WO2021232978A1 - Video processing method and apparatus, electronic device and computer readable medium - Google Patents

Video processing method and apparatus, electronic device and computer readable medium Download PDF

Info

Publication number
WO2021232978A1
WO2021232978A1 PCT/CN2021/085692 CN2021085692W WO2021232978A1 WO 2021232978 A1 WO2021232978 A1 WO 2021232978A1 CN 2021085692 W CN2021085692 W CN 2021085692W WO 2021232978 A1 WO2021232978 A1 WO 2021232978A1
Authority
WO
WIPO (PCT)
Prior art keywords
target
video
scene
electronic device
time
Prior art date
Application number
PCT/CN2021/085692
Other languages
French (fr)
Chinese (zh)
Inventor
钟瑞
Original Assignee
Oppo广东移动通信有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Oppo广东移动通信有限公司 filed Critical Oppo广东移动通信有限公司
Publication of WO2021232978A1 publication Critical patent/WO2021232978A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/73Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7837Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using objects detected or recognised in the video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items

Definitions

  • This application relates to the field of video technology, and more specifically, to a video processing method, device, electronic equipment, and computer-readable medium.
  • Video annotation is a direct highlighting mark on the video during the video preview or video playback process, which makes the video more targeted video processing method, which is widely used in many fields.
  • video tagging is the most common analysis method used by public security investigators in the research and judgment of video cases, so that public security officers can locate and focus on suspected targets and lock important video clues.
  • video annotation can also be used for image analysis in the medical field, and physicians can use video annotation to highlight body parts that have lesions or abnormalities.
  • the video annotation can also be used as a storage method of the video, and can be used as the description content corresponding to the video. The user can quickly obtain part of the content in the video through the video annotation.
  • This application proposes a video processing method, device, electronic device, and computer readable medium to improve the above-mentioned drawbacks.
  • an embodiment of the present application provides a video processing method, including: acquiring a target video to be processed; acquiring a target scene type corresponding to an image frame to be processed in the target video; and according to the time of the image frame to be processed Stamp, determine the scene time segment of the target scene type in the target video, wherein, in the target video, the scene types corresponding to the image frames in the scene time segment are all the target scene types; The scene type and the scene time segment corresponding to the scene type generate a video annotation result.
  • an embodiment of the present application also provides a video processing device, including: a video acquisition unit, a scene acquisition unit, a determination unit, and a processing unit.
  • the video acquisition unit is used to acquire the target video to be processed.
  • the scene acquisition unit is configured to acquire the target scene type corresponding to the image frame to be processed in the target video.
  • the determining unit is configured to determine the scene time segment of the target scene type in the target video according to the time stamp of the image frame to be processed, wherein, in the target video, the image frame in the scene time segment
  • the corresponding scene types are all the target scene types.
  • the processing unit is configured to generate a video annotation result according to the scene type and the scene time segment corresponding to the scene type.
  • an embodiment of the present application also provides an electronic device, including: one or more processors; a memory; one or more application programs, wherein the one or more application programs are stored in the memory And is configured to be executed by the one or more processors, and the one or more application programs are configured to execute the foregoing method.
  • the embodiments of the present application also provide a computer-readable medium, the readable storage medium storing a program code executable by a processor, and when the program code is executed by the processor, the processor Perform the above method.
  • FIG. 1 shows a method flowchart of a video processing method provided by an embodiment of the present application
  • Figure 2 shows a schematic diagram of a video download interface provided by an embodiment of the present application
  • Figure 3 shows a schematic diagram of a video playback interface provided by an embodiment of the present application
  • FIG. 4 shows a method flowchart of a video processing method provided by another embodiment of the present application.
  • FIG. 5 shows the training process of the Mobilenet_V1 network provided by the embodiment of the present application
  • FIG. 6 shows the process of identifying the scene classification of the image to be processed provided by the embodiment of the present application
  • FIG. 7 shows a schematic diagram of the Yolo_V3 network structure provided by an embodiment of the present application.
  • FIG. 8 shows a flowchart of S460 in FIG. 4
  • FIG. 9 shows a schematic diagram of a video annotation result provided by an embodiment of the present application.
  • FIG. 10 shows a method flowchart of a video processing method provided by another embodiment of the present application.
  • FIG. 11 shows a block diagram of a video processing device provided by an embodiment of the present application.
  • FIG. 12 shows a block diagram of a video processing device provided by another embodiment of the present application.
  • FIG. 13 shows a schematic diagram of an electronic device provided by an embodiment of the present application.
  • FIG. 14 is a storage unit for storing or carrying program code for implementing the video processing method according to the embodiment of the present application according to an embodiment of the present application.
  • Video annotation is a direct highlighting mark on the video during the video preview or video playback process, which makes the video more targeted video processing method, which is widely used in many fields.
  • video tagging is the most common analysis method used by public security investigators in the research and judgment of video cases, so that public security officers can locate and focus on suspected targets and lock important video clues.
  • video annotation can also be used for image analysis in the medical field, and physicians can use video annotation to highlight body parts that have lesions or abnormalities.
  • the video annotation can also be used as a storage method of the video, and can be used as the description content corresponding to the video. The user can quickly know part of the content in the video through the video annotation.
  • video tagging methods are mainly manual tagging and machine learning video tagging.
  • a manual video tagging method can be to first construct a container interface for holding videos through a web page, load the video in the video section, and then manually drag the slider or click the video drag bar according to the content of the video. To change the video playback time point or confirm the time point of the video playback content, mark the content of the video by clicking the knowledge point panel of the video.
  • Machine learning is a type of artificial intelligence.
  • Artificial Intelligence is a theory that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge, and use knowledge to obtain the best results. , Method, technology and application system.
  • artificial intelligence is a comprehensive technology of computer science, which attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can react in a similar way to human intelligence.
  • Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
  • Machine Learning is a multi-field interdisciplinary subject, involving probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and other subjects. Specializing in the study of how computers simulate or realize human learning behaviors in order to acquire new knowledge or skills, and reorganize the existing knowledge structure to continuously improve its own performance.
  • Machine learning is the core of artificial intelligence, the fundamental way to make computers intelligent, and its applications cover all fields of artificial intelligence.
  • Machine learning and deep learning usually include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and style teaching learning.
  • a video tagging method based on machine learning may be to tag videos based on feature extraction. Specifically, first decode the acquired video stream, and receive tagging commands corresponding to all frame images, and then tag all frames according to the tagging commands. All the storage features corresponding to the image are extracted, and finally the storage feature and the receiving time corresponding to each labeling command are saved in the labeling record.
  • the accuracy of the album will also decrease, and most importantly, it will infringe on the privacy of the album user.
  • the disadvantage of the video tagging method based on feature extraction is that it only records whether the video contains features of the category, and does not sort the features in the video, which makes it difficult to interpret the content of the video as a result of the video tagging.
  • the embodiments of the present application provide a video processing method, which is applied to an electronic device.
  • the execution subject of the method may be an electronic device, so that the video processing method is executed by the electronic device. It can be executed locally to avoid sending the video to the cloud server and causing data leakage and endangering the user's privacy.
  • the method includes: S101 to S104.
  • S101 Acquire a target video to be processed.
  • the target video to be processed may be at least a part of the videos already stored in the electronic device.
  • the target video to be processed may be a video selected by the user from the videos stored by the electronic device.
  • the electronic device may display the stored video on the screen of the electronic device, and the user selects at least part of the video from the multiple displayed videos as the target video to be processed.
  • the target video to be processed may be a video requested by the user to download.
  • the interface shown in Figure 2 is a video download interface provided by an application.
  • the application can be a video application, that is, an application with a video playback function. Users can watch videos online through the application. And download videos.
  • the user selects the video to be downloaded in the video download interface, and the electronic device can detect the identifier of the video corresponding to the download request triggered by the user. For example, the video corresponding to the download button triggered by the user in the video download interface is detected, and the video corresponding to the triggered download button is the video requested by the user to download.
  • the video requested to be downloaded is used as the target video to be processed, so that when the user requests to download the video, the video processing method of the embodiment of this application can be executed on the video, so that the video can be stored when the video is stored. Corresponding to the video annotation result stored.
  • the target video to be processed may be a video recorded by a user through a video recording application.
  • a video recorded by the user through the video recording function in the camera application can be used as the target video to be processed, so that when the video is stored, the video and the video annotation result corresponding to the video can be stored correspondingly.
  • the identification of the recorded video can also be stored, and the video can be used as the target video to be processed under specified conditions.
  • the specified condition may be a preset execution condition of the processing method of the embodiment of the application, that is, the method of the embodiment of the application may be executed on the target video to be processed under the specified condition, so as to obtain the target video to be processed. Video results.
  • the specified condition may be a preset period, for example, 24 hours, that is, the method in the embodiment of the present application is executed every preset period.
  • the specified condition may be that the electronic device is in an idle state, so as to prevent the electronic device from performing the method of the embodiment of the present application and causing excessive power consumption, thereby causing the system to freeze, specifically, the idle state
  • the specific implementation can refer to the subsequent embodiments.
  • S102 Acquire the target scene type corresponding to the image frame to be processed in the target video.
  • the image frame to be processed may be at least part of the image frames of all corresponding image frames of the target video.
  • the image frame to be processed may be an image frame of a partial time period of the target video, for example, may be an image frame corresponding to the time period between the end time of the opening part of the video and the start time of the end part of the video. Therefore, it is possible to obtain the corresponding labeling result without performing video processing on the beginning part and the ending part of the video, and reduce the amount of data calculation.
  • the image frame to be processed can also be a key frame in all image frames corresponding to the target video, and the amount of data calculation can also be reduced.
  • all image frames in the target video may be used as image frames to be processed, so that the accuracy and comprehensiveness of the video annotation result can be improved.
  • each image frame corresponds to a scene
  • each scene corresponds to a scene category.
  • the scene category may include: selfie, group photo, architecture, food, blue sky, silhouette, sunset, beach , Sports, grass, text, night scene. That is, what the scene represents is the content expressed by the entire image frame, and each object in the image frame can be used as each element in the scene. For example, if the entire image is a group photo of user A and user B, the scene type of the image frame is a group photo, the elements in the scene include user A and user B, and the types of user A and user B are characters.
  • the scene type of the image frame to be processed can be identified based on machine learning, for example, a neural network structure is pre-trained, for example, it can be VGG-Net or ResNet.
  • a neural network structure is pre-trained, for example, it can be VGG-Net or ResNet.
  • the output of the neural network structure is the scene type corresponding to the image to be processed, that is, the target scene type.
  • the output of the last layer of the neural network structure is the distribution vector of the probability that the input image belongs to each predefined scene category.
  • the output of several intermediate layers of the deep neural network can be used as Input the characteristics of the image, train the Softmax classifier; use batch stochastic gradient descent method and back propagation algorithm to train the deep network model. Therefore, the target scene type corresponding to the image frame to be processed can be obtained through the classifier of the neural network.
  • the image frame to be processed may be multiple images, and the image frame to be processed may correspond to multiple scene types, so that the obtained target scene type may also be multiple.
  • the image frames to be processed include image 1, image 2, image 3, image 4, image 5, image 6, image 7, image 8, image 9, and image 1, image 2, image 3, image 4, and image 5.
  • the corresponding scene types are all the first scene type
  • the scene types corresponding to image 6, image 7, image 8, and image 9 are all the second scene type
  • the target scene types corresponding to the 9 image frames to be processed are respectively the first scene type.
  • S103 Determine a scene time segment of the target scene type in the target video according to the timestamp of the image frame to be processed.
  • the scene types corresponding to the image frames in the scene time segment are all the target scene types.
  • each image frame in the target video corresponds to a timestamp
  • the timestamp of each image frame can reflect the playback sequence of the image frame in the target video.
  • Video can be regarded as multiple image frames synthesized and played in a certain order. Therefore, the image set obtained after encoding multiple image frames in a certain order can be regarded as a video, and the timestamp can be used to represent a certain The tag information of the playback sequence of the image frame in the video.
  • the first image frame of the video is used as the starting image, and the corresponding timestamp is the starting timestamp. Then, the image frames after the starting image are added to the starting timestamp according to the playback order. Value, the difference between every two adjacent image frames can be fixed.
  • each image frame in the video corresponds to a time point on the playback time axis of the video, and this time point is the timestamp of the image frame.
  • the play time axis of the video is related to the play time length of the video. It can be 0 as the starting point and the total play time length of the video as the starting point. For example, if the total length of the video is 10 seconds, then the play time of the video The axis is a time axis with 0 as the start point and 10 seconds as the end point.
  • the time stamp of each image frame in the video is located on the playback time axis, so that the position of each image frame on the time playback axis can be determined.
  • the scene time segment may include at least one of the start time and the end time of the scene.
  • the scene type corresponding to each image frame to be processed can be determined, and then according to the scene type corresponding to each image frame to be processed, the start time and end time of each scene type can be determined.
  • the scene types corresponding to the aforementioned image 1, image 2, image 3, image 4, and image 5 are all the first scene type
  • the scene types corresponding to image 6, image 7, image 8, and image 9 are all the second scene type.
  • Image 1, image 2, image 3, image 4, image 5, image 6, image 7, image 8, and image 9 correspond to time stamps t1, t2, t3, t4, t5, t6, t7, t8, t9, respectively ,
  • the scene time segments of the first scene type are t1 to t5, that is, on the video playback time axis
  • the scene types corresponding to all image frames between t1 and t5 are the first scene type
  • the scene time segment of is from t6 to t9, that is, on the playback time axis of the video
  • the scene type corresponding to all image frames between t6 and t9 is the second scene type.
  • S104 Generate a video annotation result according to the target scene type and the scene time segment corresponding to the target scene type.
  • the video annotation result is used to describe the scene type corresponding to the scene time segment in the target video as the target scene type, so that the scene type within a certain time period of the target video can be known through the video annotation result, and the scene type can be clarified.
  • the start time and end time of the type of scene are used to quickly locate the start time and end time of the scene in the target video according to the scene time segment corresponding to the scene when the video of a certain scene needs to be queried. Convenient and quick to check.
  • the video tagging result may be the description content corresponding to the target video
  • the description content may be text content.
  • the description content is used to express multiple scene types and types in the target video in the form of text.
  • the description content may be "scene: Selfie, the time segment of the scene is 2 seconds to 5 seconds".
  • the video tagging result may be content set based on the time axis of the target video.
  • the video tagging result may be content set based on the time axis of the target video.
  • the electronic device can display the video annotation result.
  • the electronic device can display the video annotation result in a designated interface of the electronic device.
  • the specified interface may be a playback interface of the target video.
  • the result of marking the video may be displayed on the progress bar of the target video being played, that is, the target scene type is marked on the progress bar. The scene time segment and the target scene type.
  • the content played in the video playing interface shown in FIG. 3 is a target video, and a first mark 302 and a second mark 303 corresponding to the target scene type are displayed on the progress bar 301 of the target video.
  • the first mark 302 is used to characterize the position of the start point of the target scene type on the progress bar 301
  • the second mark 303 is used to characterize the position of the end point of the target scene type on the progress bar 301.
  • the user triggers the first mark 302 and the second mark 303 to display the first content and the second content, where the first content is used to indicate that the position corresponding to the first mark 302 is the starting point of the target scene type, as shown in FIG.
  • the second content is used to explain that the position corresponding to the second mark 303 is the end time of the target scene type. Therefore, when the user is watching the target video, the first mark 302 and the second mark 303 can clarify the position of each scene in the video on the video progress bar 301, so that the user can quickly locate the scene of interest.
  • the progress bar 301 of the video is the playback time axis of the video.
  • FIG. 4 shows a video processing method provided by another embodiment of the present application.
  • the method can not only identify the scene in the target video, but also identify various objects in the specific scene, and combine the scene and The object generates a video annotation result.
  • the method includes: S410 to S460.
  • S420 Acquire the target scene type corresponding to the image frame to be processed in the target video.
  • a neural network based on a computer vision method is used to determine the content in the image frame to be processed.
  • a Mobilenet network model may be used.
  • the basic unit of MobileNet is depthwise separable convolution.
  • Depth-level separable convolution is a decomposable convolution operation (factorized convolutions), which can be decomposed into two smaller operations: depthwise convolution and pointwise convolution.
  • Depthwise convolution is different from standard convolution.
  • the convolution kernel is used on all input channels, while depthwise convolution uses different convolution kernels for each input channel, that is, one convolution kernel corresponds to one input channel. .
  • the pointwise convolution is actually an ordinary convolution, but it uses a 1x1 convolution kernel.
  • it first uses depthwise convolution to convolve different input channels separately, and then uses pointwise convolution to combine the above outputs, which will greatly reduce the amount of calculation and model parameters. Therefore, the Mobilenet network model can also be regarded as a lightweight convolutional neural network.
  • the target scene type corresponding to the image frame to be processed in the target video may be obtained based on Mobilenet_V1. Specifically, it may be fine-tuned (Finetune) on the basis of MobileNet_V1 that has been trained using the data set.
  • the network can divide the image frames to be processed into 10 categories, that is, a score of 1-10.
  • type identifies the operator type of each layer, where conv represents the convolutional layer, Avg Pool represents the average pooling layer, Softmax represents the Softmax layer, and FC represents the fully connected layer. Stride represents the step size of each operation, s1 represents the step size of 1, and s2 represents the step size of 2.
  • Filter Shape represents the size of the filter
  • 3x3x3x32 represents 3 color channels
  • the size of the convolution kernel is 3x3
  • the number of convolution kernels is 32
  • 3x3x32 dw represents the depthwise convolution channel is 3
  • the size of the convolution kernel is 1x3
  • the number of convolution kernels is 32
  • Pool 7x7 means the average pooling convolution kernel size is 7x7
  • 1024x1000 means that the fully connected layer contains 1024x1000 neurons
  • Classifier means the final classification category.
  • the value of Classifier It is 10, representing an output value of 1-10 points
  • Input Size represents the size of the input
  • 224x224x3 represents a 3-channel 224x224 image.
  • Figure 5 shows the training process of the Mobilenet_V1 network.
  • the classification network of a picture consists of two parts.
  • the first part is composed of a multi-layer convolution kernel, which is responsible for extracting diversified features in the picture. Make classification judgments.
  • the image feature extraction module of the image classification network has been relatively complete, so the part that needs improvement and training is the image category judgment module.
  • the strategy of Finetune is to first train the image category judgment module separately, and then Perform global fine-tuning of the network and join the image feature extraction module to train together, in which the fully connected layer (FC layer) is trained separately for 4000 steps, and the global fine-tuning is 1000 steps.
  • FC layer fully connected layer
  • the Finetune data set used by the classification network is a pre-acquired data set, including 280 types of data, 5000 pieces of each type of data, a total of 1.4 million pieces, each picture is marked with a specific physical label, which is used to indicate the need Detect the content of the image, for example, the scene type or target object, etc.
  • Figure 6 shows the process of identifying the scene classification of the image to be processed.
  • the image frame to be processed is input into the network, and after feature extraction and category judgment, the image frame corresponding to the image to be processed is finally output Scene category.
  • the network can output the scene category label of the image frame to be processed.
  • the category label included can include: selfie, group photo, architecture, food, blue sky, silhouette, sunset, beach, sports, grass, text, night view.
  • S430 Determine a scene time segment of the target scene type in the target video according to the timestamp of the image frame to be processed.
  • the target object may be a category corresponding to each specific object in the image, that is, the category of each object in the specific scene.
  • the Mobilenet network model may continue to be used to detect the target object in the image frame to be processed to obtain the target object category.
  • the target object in the image frame to be processed is detected to obtain the target object category.
  • a series of target detection algorithms based on deep learning algorithms can include: first generating candidate regions and then performing Convolutional Neural Networks (CNN) classification (ie RCNN (Regions with CNN features) algorithm), and directly applying to the input image Algorithm and output category and corresponding positioning algorithm (ie YOLO algorithm).
  • CNN Convolutional Neural Networks
  • the trained Yolo_V3 network can be used to detect and recognize the target object in the image frame to be processed.
  • Figure 7 shows the Yolo_V3 network structure.
  • the network input size is 416x416, the channel is 3, the dynamic boundary layer is DBL, and DBL represents Darknetconv2d_BN_Leaky, which is the basic component of yolo_v3, which is convolution + batch normalization (BN) + linear correction unit (Leaky relu).
  • Tensor stitching stitches the upsampling of the middle layer and a later layer.
  • the final network outputs the category and location of each detected object.
  • the network outputs 1000 types of objects and detection frames.
  • the detection frame is used to indicate the position of the object in the image where the object is located.
  • the residual model can be expressed as res_block
  • the residual unit can be expressed as res_unit
  • the tensor stitching can be expressed as concat
  • the middle layer can be expressed as darknet
  • the upsampling can be expressed as up_sample
  • the first fusion layer is route_1(1/8size )
  • the second fusion layer is route_2(1/16size)
  • the third fusion layer is route_3(1/32size)
  • zero padding is zero padding
  • the first output is y1, the second output is y2, and the third output is y3.
  • the first output channel is yolo_head/conv_6, the second output channel is yolo_head/conv_14, the third output channel is yolo_head/conv_22, the first feature map is feature_map_1, the second feature map is feature_map_2, and the third feature map is feature_map_3.
  • S450 Determine an object time segment of the target object category in the target video according to the time stamp of the image frame to be processed.
  • the time stamp of the image frame is determined.
  • the time stamp in the image frame can be used as the time stamp of the target object, so that the time stamp corresponding to each category of target object in the target video can be determined, so that each The time segment of the appearance of a target object in the target video.
  • S460 Generate a video annotation result according to the target scene type, the scene time segment, the target object category, and the object time segment.
  • content corresponding to the target object category is added according to the target object category and the object time segment.
  • the video annotation result can describe the scene type corresponding to the scene time segment in the target video as the target scene type, so that the scene type in a certain time period of the target video can be known through the video annotation result, and In addition to being able to clarify the start time and end time of this type of scene, the start time and end time of each target object category in the target video can also be determined.
  • the video tagging result may be content set based on the time axis of the target video.
  • S460 may include S461 to S465.
  • S462 Determine a time axis according to the playing time.
  • the time axis may be the playback time axis corresponding to the above-mentioned video, and the specific implementation of acquiring the playback time of the target video and determining the time axis according to the playback time can refer to the foregoing embodiment, and will not be repeated here.
  • S463 Determine a scene interval of the target scene type on the time axis according to the scene time segment corresponding to the target scene type.
  • the scene time segment includes the start time and end time of the target scene type on the time axis. Therefore, the area between the start time and the end time of the target scene type on the time axis is used as the target scene type The corresponding scene interval.
  • S464 Determine a target object interval of the target object category on the time axis according to the object time segment corresponding to the target object category.
  • the object time segment includes the start time and the end time of the target object category on the time axis. Therefore, the area between the start time and the end time of the target object category on the time axis is taken as the target object The target object interval corresponding to the category.
  • S465 Generate a video annotation result according to the time axis, the scene interval, the target object interval, the target scene type, and the target object category.
  • the scene interval and the target object interval may be correspondingly marked on the time axis, and the first content and the second content may be generated according to the target scene type and the target object category, and the scene on the time axis
  • the interval corresponds to the first content
  • the target object interval corresponds to the second content, so that the position of the scene interval and the target object interval can be clarified on the time axis, and the corresponding interval of each interval can be clarified based on the first content and the second content
  • the category of the scene or target object is correspondingly marked on the time axis, and the first content and the second content may be generated according to the target scene type and the target object category, and the scene on the time axis
  • the interval corresponds to the first content
  • the target object interval corresponds to the second content
  • an implementation manner of generating a video annotation result according to the time axis, the scene interval, the target object interval, the target scene type, and the target object category may be to obtain a scene corresponding to the target scene type The labeling content and the object labeling content corresponding to the target object category; generate a video labeling result according to the time axis, the scene interval, the target object interval, the target scene type, and the target object category.
  • the video annotation result includes a time axis, the time axis is marked with a scene section and a target object section, and the scene label content is displayed at the position of the scene section, and the object label content is displayed at the position of the target object section.
  • the scene annotation content is used to describe the content of the scene category, which can be text, pictures, etc.
  • the scene annotation content is the text "indoor”.
  • the object annotation content is used to describe the scene.
  • the content of the object category can be text, pictures, etc. For example, if the object category is a chair, the label content of the object is the text "chair”.
  • the target object category can be the category of the object, or the category of specific details of the object.
  • the object category includes the main category and the subcategories under the main category.
  • the subcategory may be a category of specific details of the target object, the main category of the target object is people, and the subcategory may be an expression category or an emotion category.
  • the video annotation result may be a display content
  • the display content includes a time axis
  • the time axis is marked with a scene interval and a target object interval
  • the scene annotation content is displayed at the position of the scene interval.
  • the label content of the object is displayed at the position of the target object interval.
  • the display content includes the time axis image, the scene image of each scene interval, and the target image of each target object interval.
  • the ratio of the length of each scene image and the target image to the time axis image is related to The time length of the scene time segment and the object time segment has a proportional relationship with the playback time length of the target video, so as to reflect the time interval in which each scene and the target object exist on the time axis of the target video.
  • scene annotation content or object annotation content is displayed on the scene image in each scene section and the target object image in each target object section.
  • FIG. 10 shows a video processing method provided by another embodiment of the present application. Specifically, the method may execute the video processing method when the electronic device is idle. Specifically, referring to FIG. 10, the method includes: S1001 to S1005.
  • S1001 Acquire the working status of the electronic device.
  • the working state of an electronic device includes a busy state and an idle state.
  • the busy state indicates that the current power consumption of the electronic device is relatively high. If the target video is processed to obtain the video annotation result, it may cause the system to freeze, and the idle state and the busy state On the contrary, that is, the current power consumption is low, and if the target video is processed to obtain the video annotation result, the possibility of system stalling is small.
  • the working state may be determined by at least one of the CPU usage rate, the charging state, and the current time.
  • the CPU usage rate is used to determine the working status of the electronic device. Specifically, it is determined whether the CPU usage rate of the electronic device is lower than the usage rate threshold, and if it is lower, the working state of the electronic device is determined to be an idle state; otherwise, the working state of the electronic device is determined to be a busy state.
  • the CPU usage rate can be obtained by viewing the task manager of the electronic device.
  • the CPU usage rate can be obtained through the adb shell top command.
  • the utilization rate threshold may be a utilization rate set by the user.
  • the utilization rate threshold may be 60%. Assuming that the current utilization rate of the CPU is 40%, 40% is less than 60%, and it is determined that the utilization rate of the central processing unit is less than The utilization rate threshold, if the current utilization rate of the CPU is 70%, then 70% is greater than 60%, and it is determined that the utilization rate of the central processing unit is greater than the utilization rate threshold.
  • the utilization rate of the central processing unit is less than the utilization rate threshold, it means that the current CPU resources are relatively rich, and it can be determined that the working state of the electronic device is idle, then S1002 can be executed, and if the utilization rate of the central processing unit is greater than or equal to the utilization rate
  • the rate threshold indicates that the current CPU resources are relatively scarce, and it can be determined that the working state of the electronic device is busy.
  • the CPU usage rate is related to the application currently started by the electronic device, it can be judged whether there is an application that matches the preset application in the currently opened application when the electronic device is in a busy state.
  • the preset application is an application that allows the system to close the application without the user's authorization. If it exists, the application that matches the preset application is closed, and then the current usage rate of the CPU is obtained as the CPU Usage rate, and return to perform the operation of judging whether the usage rate of the central processing unit is less than the usage rate threshold.
  • a list of preset applications is pre-stored in the electronic device, and the list of preset applications includes a plurality of designated application identities, wherein the designated application is authorized by the user and allows the system to be unauthorised by the user.
  • the application program that closes the application program may specifically be the user manually inputting the identifier of the specified application program.
  • the system will be allowed to kill the process of the application program closed by the user without the user's authorization, thereby releasing a certain amount of CPU resources and reducing the CPU usage rate.
  • the working state of the electronic device is determined based on the charging state and the current time. Specifically, if the electronic device is in a charging state and the current time is within the preset time range, it is determined that the working state of the electronic device is in an idle state; otherwise, it is determined that the working state of the electronic device is in a busy state.
  • the preset time range may be a preset time interval within which the probability of the user using the electronic device is small. For example, the preset time range is 1 am to 6 am, and the user is In the sleep state and the electronic device is in the charging state, the system resources of the electronic device are less occupied at this time and are in an idle state.
  • the detection of the holding state of the electronic device can be added on the basis that the electronic device is in the charging state and the current time is within the preset time range, that is, if the electronic device is in the charging state and the current time is within the preset time range and If the holding state of the electronic device is an unheld state, it is determined that the working state of the electronic device is in an idle state; otherwise, it is determined that the working state of the electronic device is in a busy state.
  • the gripped parts are generally concentrated on the bottom frame, the top frame, and the back of the electronic device near the bottom or the top. Therefore, the position of the top frame and the bottom frame can be set.
  • the detection device detects whether the user is holding the electronic device, that is, the electronic device can detect whether the electronic device is in a held state.
  • pressure sensors may be provided at the positions of the top frame and the bottom frame.
  • the pressure sensor can detect the pressure value, and it is determined that the electronic device is in a held state.
  • Temperature sensors can also be set at the top and bottom frames.
  • the temperature value detected by the temperature sensor is the first temperature value, and when the user is holding the electronic device, the temperature detected is the first temperature value.
  • the value is the second temperature value, and the first temperature value is less than the second temperature value, and the second temperature value is greater than the preset temperature value.
  • the preset temperature value is 37, which is the body temperature of the human body. If the value is greater than the preset temperature value, it is determined that the electronic device is in a held state.
  • the touch screen of the electronic device can detect the user's touch operation, if it can be detected, it is determined that the holding state is the held state, otherwise, it is determined that it is the unheld state.
  • the touch screen when the electronic device is on the screen, the touch screen is not turned off, and it remains in a state capable of detecting touch operations.
  • the status of the electronic device can be determined by combining the CPU usage rate, the charging status, and the current time at the same time, that is, when the CPU usage rate is less than the usage rate threshold, and the electronic device is in the charging state and the current time is within the preset time range , It is determined that the working state of the electronic device is in an idle state, otherwise, it is determined that the working state of the electronic device is in a busy state.
  • S1003 Acquire the target scene type corresponding to the image frame to be processed in the target video.
  • S1004 Determine a scene time segment of the target scene type in the target video according to the timestamp of the image frame to be processed.
  • S1005 Generate a video annotation result according to the target scene type and the scene time segment corresponding to the target scene type.
  • the operation of acquiring the target video to be processed and subsequently acquiring the video annotation result can be performed to avoid the electronic device from jamming and affecting the user's use when the method is executed in the electronic device.
  • Mobilenet network model and the YOLO target detection model used in the embodiments of the present application have simple structures and complex algorithms, which are more suitable for running on electronic devices.
  • FIG. 11 shows a structural block diagram of a video processing apparatus 1100 provided by an embodiment of the present application.
  • the apparatus may include: a video acquisition unit 1101, a scene acquisition unit 1102, a determination unit 1103, and a processing unit 1104.
  • the video acquisition unit 1101 is configured to acquire the target video to be processed
  • the scene acquisition unit 1102 is configured to acquire the target scene type corresponding to the image frame to be processed in the target video;
  • the determining unit 1103 is configured to determine the scene time segment of the target scene type in the target video according to the time stamp of the image frame to be processed, wherein, in the target video, the image in the scene time segment
  • the scene types corresponding to the frames are all the target scene types
  • the processing unit 1104 is configured to generate a video annotation result according to the scene type and the scene time segment corresponding to the scene type.
  • FIG. 12 shows a structural block diagram of a video processing apparatus 1200 provided by an embodiment of the present application.
  • the apparatus may include: a video acquisition unit 1201, a scene acquisition unit 1202, a first determination unit 1203, and a second determination unit 1204 ⁇ processing unit 1205.
  • the video acquisition unit 1201 is configured to acquire the target video to be processed.
  • the video acquiring unit 1201 is also used to acquire the working state of the electronic device; if the working state is an idle state, acquiring the target video to be processed.
  • the scene acquiring unit 1202 is configured to acquire the target scene type corresponding to the image frame to be processed in the target video.
  • the scene acquiring unit 1202 is also configured to acquire the target scene type corresponding to the image frame to be processed in the target video based on the Mobilenet network model.
  • the first determining unit 1203 is configured to determine a scene time segment of the target scene type in the target video according to the time stamp of the image frame to be processed, wherein, in the target video, the scene time segment is The scene types corresponding to the image frames of are all the target scene types.
  • the second determining unit 1204 is configured to detect the target object in the image frame to be processed to obtain the target object category; according to the time stamp of the image frame to be processed, determine the target object category in the target video Object time segment; generating a video annotation result according to the target scene type, the scene time segment, the target object category, and the object time segment.
  • the second determining unit 1204 is further configured to obtain the play time of the target video; determine the time axis according to the play time; determine that the target scene type is at the time according to the scene time segment corresponding to the target scene type.
  • the target scene type and the target object category generate a video annotation result.
  • the second determining unit 1204 is further configured to obtain the scene annotation content corresponding to the target scene type and the object annotation content corresponding to the target object category; according to the time axis, the scene interval, and the target object interval , The target scene type and the target object category generate a video annotation result, wherein the video annotation result includes a time axis, and the time axis is marked with a scene interval and a target object interval, and is displayed at the position of the scene interval There are scene labeling content, and the object labeling content is displayed at the location of the target object interval.
  • the second determining unit 1204 is further configured to detect the target object in the image frame to be processed based on the YOLO target detection model to obtain the target object category.
  • the processing unit 1205 is configured to generate a video annotation result according to the target scene type, the scene time segment, the target object category, and the object time segment.
  • the coupling between the modules may be electrical, mechanical or other forms of coupling.
  • the functional modules in the various embodiments of the present application may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated into one module.
  • the above-mentioned integrated modules can be implemented in the form of hardware or software functional modules.
  • the electronic device 100 may be an electronic device capable of running application programs, such as a smart phone, a tablet computer, or an e-book.
  • the electronic device 100 in this application may include one or more of the following components: a processor 110, a memory 120, and one or more application programs, where one or more application programs may be stored in the memory 120 and configured to be configured by One or more processors 110 execute, and one or more programs are configured to execute the methods described in the foregoing method embodiments.
  • the processor 110 may include one or more processing cores.
  • the processor 110 uses various interfaces and lines to connect various parts of the entire electronic device 100, and executes by running or executing instructions, programs, code sets, or instruction sets stored in the memory 120, and calling data stored in the memory 120.
  • Various functions and processing data of the electronic device 100 may adopt at least one of Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), and Programmable Logic Array (PLA).
  • DSP Digital Signal Processing
  • FPGA Field-Programmable Gate Array
  • PDA Programmable Logic Array
  • the processor 110 may be integrated with one or a combination of a central processing unit (CPU), a graphics processing unit (GPU), a modem, and the like.
  • the CPU mainly processes the operating system, user interface, and application programs; the GPU is used to render and draw the display content; and the modem is used to process wireless communication. It can be understood that the above-mentioned modem may not be integrated into the processor 110, but may be implemented by a communication chip alone.
  • the memory 120 may include random access memory (RAM) or read-only memory (Read-Only Memory).
  • the memory 120 may be used to store instructions, programs, codes, code sets or instruction sets.
  • the memory 120 may include a program storage area and a data storage area, where the program storage area may store instructions for implementing the operating system and instructions for implementing at least one function (such as touch function, sound playback function, image playback function, etc.) , Instructions used to implement the following various method embodiments, etc.
  • the storage data area can also store data (such as phone book, audio and video data, chat record data) created by the electronic device 100 during use.
  • FIG. 14 shows a structural block diagram of a computer-readable storage medium provided by an embodiment of the present application.
  • the computer-readable medium 1400 stores program code, and the program code can be invoked by a processor to execute the method described in the foregoing method embodiment.
  • the computer-readable storage medium 1400 may be an electronic memory such as flash memory, EEPROM (Electrically Erasable Programmable Read Only Memory), EPROM, hard disk, or ROM.
  • the computer-readable storage medium 1400 includes a non-transitory computer-readable storage medium.
  • the computer-readable storage medium 1400 has storage space for the program code 1410 for executing any method steps in the above-mentioned methods. These program codes can be read from or written into one or more computer program products.
  • the program code 1410 may be compressed in a suitable form, for example.
  • the video processing method, device, electronic equipment, and computer readable medium obtained the target video to be processed; obtain the target scene type corresponding to the image frame to be processed in the target video; Process the time stamp of the image frame to determine the scene time segment of the target scene type in the target video, wherein in the target video, the scene type corresponding to the image frame in the scene time segment is the target Scene type; generating a video annotation result according to the scene type and the scene time segment corresponding to the scene type.
  • the scene type of the image frame in the video can be recognized, and the annotation result can be obtained by combining the scene type and the appearance time of the scene type in the video, so that the annotation result can reflect the corresponding relationship between the time period of the video and the scene, making the annotation result more intuitive And more in line with user needs.
  • a network of picture scene recognition and picture object detection and recognition based on deep learning is used to completely record the scenes at different time points in the video and the objects that appear in the video scene at different time points.
  • the Mobilenet_V1 network based on deep learning is used in the video content scene recognition, and the Yolo_V3 network is used in the video content detection and recognition. It supports the detection and recognition of 12 scenes and 1000 types of objects, and the selected networks are relatively lightweight The network, while ensuring the lightweight model, greatly reduces the amount of calculation. It can be directly run offline on the mobile phone without uploading the user’s photographed data to the cloud. This improves the user experience while ensuring user privacy.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Library & Information Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Television Signal Processing For Recording (AREA)

Abstract

The present application discloses a video processing method and apparatus, an electronic device and a computer readable medium, relating to the field of video technologies. Said method comprises: acquiring a target video to be processed; acquiring a target scene type corresponding to an image to be processed in the target video; determining, according to a time stamp of said image, a scene time fragment of the target scene type in the target video, wherein in the scene time fragment, the scene types corresponding to the images in the target video are all the target scene type; and generating a video labeling result according to the scene type and the scene time fragment corresponding to the scene type. Therefore, the present invention can identify a scene type of images in a video, and obtain a labeling result in combination with the scene type and an appearance time of the scene type in the video, so that the labeling result can reflect a correlation between a time period of the video and the scene, making the labeling result more intuitive and better satisfy the user requirements.

Description

视频处理方法、装置、电子设备及计算机可读介质Video processing method, device, electronic equipment and computer readable medium
相关申请的交叉引用Cross-references to related applications
本申请要求于2020年5月18日提交中国专利局的申请号为CN202010420727.0、名称为“视频处理方法、装置、电子设备及计算机可读介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on May 18, 2020, with the application number CN202010420727.0, titled "Video processing methods, devices, electronic equipment and computer-readable media", and its entire contents Incorporated in this application by reference.
技术领域Technical field
本申请涉及视频技术领域,更具体地,涉及一种视频处理方法、装置、电子设备及计算机可读介质。This application relates to the field of video technology, and more specifically, to a video processing method, device, electronic equipment, and computer-readable medium.
背景技术Background technique
视频标注是在视频预览或录像回放过程中,直接在视频上进行突出标记,使视频更具有针对性的视频处理方式,在诸多领域应用广泛。例如,视频标注是公安侦查民警在视频案件研判中最常用的一种分析手段,使公安干警可定位和重点关注嫌疑目标,锁定重要视频线索信息。又如,视频标注还可以用于医学领域的影像图像分析,医师可通过视频标注重点标出发生病变或产生异常的身体部位。再如,视频标注还可以作为视频的一种存储方式,可以作为视频对应的描述内容,用户通过该视频标注能够快速获取视频内的部分内容。Video annotation is a direct highlighting mark on the video during the video preview or video playback process, which makes the video more targeted video processing method, which is widely used in many fields. For example, video tagging is the most common analysis method used by public security investigators in the research and judgment of video cases, so that public security officers can locate and focus on suspected targets and lock important video clues. For another example, video annotation can also be used for image analysis in the medical field, and physicians can use video annotation to highlight body parts that have lesions or abnormalities. For another example, the video annotation can also be used as a storage method of the video, and can be used as the description content corresponding to the video. The user can quickly obtain part of the content in the video through the video annotation.
但是,目前的视频标注技术,大多采用人工标注的方法,必须要人为的对相册中的内容进行识别与标注,标注效率低下,耗费大量的人力与财力,且随着疲劳度的增加标注的准确度也会下降。However, most of the current video tagging technology uses manual tagging methods, which must manually identify and tag the content in the album. The tagging efficiency is low, it consumes a lot of manpower and financial resources, and the tagging is accurate with the increase of fatigue. The degree will also drop.
发明内容Summary of the invention
本申请提出了一种视频处理方法、装置、电子设备及计算机可读介质,以改善上述缺陷。This application proposes a video processing method, device, electronic device, and computer readable medium to improve the above-mentioned drawbacks.
第一方面,本申请实施例提供了一种视频处理方法,包括:获取待处理的目标视频;获取所述目标视频中待处理图像帧对应的目标场景类型;根据所述待处理图像帧的时间戳,确定所述目标场景类型在所述目标视频内的场景时间片段,其中,所述目标视频中,所述场景时间片段内的图像帧对应的场景类型均为所述目标场景类型;根据所述场景类型以及该场景类型对应的场景时间片段生成视频标注结果。In the first aspect, an embodiment of the present application provides a video processing method, including: acquiring a target video to be processed; acquiring a target scene type corresponding to an image frame to be processed in the target video; and according to the time of the image frame to be processed Stamp, determine the scene time segment of the target scene type in the target video, wherein, in the target video, the scene types corresponding to the image frames in the scene time segment are all the target scene types; The scene type and the scene time segment corresponding to the scene type generate a video annotation result.
第二方面,本申请实施例还提供了一种视频处理装置,包括:视频获取单元、场景获取单元、确定单元和处理单元。视频获取单元,用于获取待处理的目标视频。场景获取单元,用于获取所述目标视频中待处理图像帧对应的目标场景类型。确定单元,用于根据所述待处理图像帧的时间戳,确定所述目标场景类型在所述目标视频内的场 景时间片段,其中,所述目标视频中,所述场景时间片段内的图像帧对应的场景类型均为所述目标场景类型。处理单元,用于根据所述场景类型以及该场景类型对应的场景时间片段生成视频标注结果。In a second aspect, an embodiment of the present application also provides a video processing device, including: a video acquisition unit, a scene acquisition unit, a determination unit, and a processing unit. The video acquisition unit is used to acquire the target video to be processed. The scene acquisition unit is configured to acquire the target scene type corresponding to the image frame to be processed in the target video. The determining unit is configured to determine the scene time segment of the target scene type in the target video according to the time stamp of the image frame to be processed, wherein, in the target video, the image frame in the scene time segment The corresponding scene types are all the target scene types. The processing unit is configured to generate a video annotation result according to the scene type and the scene time segment corresponding to the scene type.
第三方面,本申请实施例还提供了一种电子设备,包括:一个或多个处理器;存储器;一个或多个应用程序,其中所述一个或多个应用程序被存储在所述存储器中并被配置为由所述一个或多个处理器执行,所述一个或多个应用程序配置用于执行上述方法。In a third aspect, an embodiment of the present application also provides an electronic device, including: one or more processors; a memory; one or more application programs, wherein the one or more application programs are stored in the memory And is configured to be executed by the one or more processors, and the one or more application programs are configured to execute the foregoing method.
第四方面,本申请实施例还提供了一种计算机可读介质,所述可读存储介质存储有处理器可执行的程序代码,所述程序代码被所述处理器执行时使所述处理器执行上述方法。In a fourth aspect, the embodiments of the present application also provide a computer-readable medium, the readable storage medium storing a program code executable by a processor, and when the program code is executed by the processor, the processor Perform the above method.
附图说明Description of the drawings
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly describe the technical solutions in the embodiments of the present application, the following will briefly introduce the drawings that need to be used in the description of the embodiments. Obviously, the drawings in the following description are only some embodiments of the present application. For those skilled in the art, other drawings can be obtained based on these drawings without creative work.
图1示出了本申请一实施例提供的视频处理方法的方法流程图;FIG. 1 shows a method flowchart of a video processing method provided by an embodiment of the present application;
图2示出了本申请实施例提供的视频下载界面的示意图;Figure 2 shows a schematic diagram of a video download interface provided by an embodiment of the present application;
图3示出了本申请实施例提供的视频播放界面的示意图;Figure 3 shows a schematic diagram of a video playback interface provided by an embodiment of the present application;
图4示出了本申请另一实施例提供的视频处理方法的方法流程图;FIG. 4 shows a method flowchart of a video processing method provided by another embodiment of the present application;
图5示出了本申请实施例提供的Mobilenet_V1网络的训练过程;FIG. 5 shows the training process of the Mobilenet_V1 network provided by the embodiment of the present application;
图6示出了本申请实施例提供的识别待处理图像的场景分类的过程;FIG. 6 shows the process of identifying the scene classification of the image to be processed provided by the embodiment of the present application;
图7示出了本申请实施例提供的Yolo_V3网络结构的示意图;FIG. 7 shows a schematic diagram of the Yolo_V3 network structure provided by an embodiment of the present application;
图8示出了图4中的S460的流程图;FIG. 8 shows a flowchart of S460 in FIG. 4;
图9示出了本申请实施例提供的视频标注结果的示意图;FIG. 9 shows a schematic diagram of a video annotation result provided by an embodiment of the present application;
图10示出了本申请又一实施例提供的视频处理方法的方法流程图;FIG. 10 shows a method flowchart of a video processing method provided by another embodiment of the present application;
图11示出了本申请一实施例提供的视频处理装置的模块框图;FIG. 11 shows a block diagram of a video processing device provided by an embodiment of the present application;
图12示出了本申请另一实施例提供的视频处理装置的模块框图;FIG. 12 shows a block diagram of a video processing device provided by another embodiment of the present application;
图13示出了本申请实施例提供的电子设备的示意图;FIG. 13 shows a schematic diagram of an electronic device provided by an embodiment of the present application;
图14是本申请实施例的用于保存或者携带实现根据本申请实施例的视频处理方法的程序代码的存储单元。FIG. 14 is a storage unit for storing or carrying program code for implementing the video processing method according to the embodiment of the present application according to an embodiment of the present application.
具体实施方式Detailed ways
为了使本技术领域的人员更好地理解本申请方案,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述。In order to enable those skilled in the art to better understand the solutions of the present application, the technical solutions in the embodiments of the present application will be described clearly and completely in conjunction with the accompanying drawings in the embodiments of the present application.
视频标注是在视频预览或录像回放过程中,直接在视频上进行突出标记,使视频更具有针对性的视频处理方式,在诸多领域应用广泛。例如,视频标注是公安侦查民警在视频案件研判中最常用的一种分析手段,使公安干警可定位和重点关注嫌疑目标,锁定重要视频线索信息。又如,视频标注还可以用于医学领域的影像图像分析,医师可通过视频标注重点标出发生病变或产生异常的身体部位。再如,视频标注还可以作为视频的一种存储方式,可以作为视频对应的描述内容,用户通过该视频标注能够快速知晓视频内的部分内容。Video annotation is a direct highlighting mark on the video during the video preview or video playback process, which makes the video more targeted video processing method, which is widely used in many fields. For example, video tagging is the most common analysis method used by public security investigators in the research and judgment of video cases, so that public security officers can locate and focus on suspected targets and lock important video clues. For another example, video annotation can also be used for image analysis in the medical field, and physicians can use video annotation to highlight body parts that have lesions or abnormalities. For another example, the video annotation can also be used as a storage method of the video, and can be used as the description content corresponding to the video. The user can quickly know part of the content in the video through the video annotation.
目前,视频标注方法主要是人工标注和机器学习的视频标注。Currently, video tagging methods are mainly manual tagging and machine learning video tagging.
例如,一种人工的视频标注方法可以是,首先通过网页构建一个盛放视频的容器界面,在视频部分里面加载视频,后根据视频的内容,通过人工拖动滑块或点击视频拖动条来改变视频播放时间点或确认视频播放内容的时间点,通过点击视频的知识点面板来标注视频的内容。For example, a manual video tagging method can be to first construct a container interface for holding videos through a web page, load the video in the video section, and then manually drag the slider or click the video drag bar according to the content of the video. To change the video playback time point or confirm the time point of the video playback content, mark the content of the video by clicking the knowledge point panel of the video.
随着机器学习技术在计算机视觉领域的不断应用,对标注过的数据需求量越来越大。机器学习属于人工智能的一种,人工智能(Artificial Intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,人工智能是计算机科学的一个综合技术,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式做出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。人工智能技术是一门综合学科,涉及领域广泛,既有硬件层面的技术也有软件层面的技术。人工智能基础技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理技术、操作/交互系统、机电一体化等技术。人工智能软件技术主要包括计算机视觉技术、语音处理技术、自然语言处理技术以及机器学习/深度学习等几大方向。机器学习(Machine Learning,ML)是一门多领域交叉学科,涉及概率论、统计学、逼近论、凸分析、算法复杂度理论等多门学科。专门研究计算机怎样模拟或实现人类的学习行为,以获取新的知识或技能,重新组织已有的知识结构使之不断改善自身的性能。机器学习是人工智能的核心,是使计算机具有智能的根本途径,其应用遍及人工智能的各个领域。机器学习和深度学习通常包括人工神经网络、置信网络、强化学习、迁移学习、归纳学习、式教学习等技术。With the continuous application of machine learning technology in the field of computer vision, the demand for labeled data is increasing. Machine learning is a type of artificial intelligence. Artificial Intelligence (AI) is a theory that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge, and use knowledge to obtain the best results. , Method, technology and application system. In other words, artificial intelligence is a comprehensive technology of computer science, which attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can react in a similar way to human intelligence. Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making. Artificial intelligence technology is a comprehensive discipline, covering a wide range of fields, including both hardware-level technology and software-level technology. Basic artificial intelligence technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, and mechatronics. Artificial intelligence software technology mainly includes computer vision technology, speech processing technology, natural language processing technology, and machine learning/deep learning. Machine Learning (ML) is a multi-field interdisciplinary subject, involving probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and other subjects. Specializing in the study of how computers simulate or realize human learning behaviors in order to acquire new knowledge or skills, and reorganize the existing knowledge structure to continuously improve its own performance. Machine learning is the core of artificial intelligence, the fundamental way to make computers intelligent, and its applications cover all fields of artificial intelligence. Machine learning and deep learning usually include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and style teaching learning.
例如,一种基于机器学习的视频标注方法可以是根据特征提取对视频标注,具体地,首先对获取的视频流进行解码处理,并接收所有帧图像对应的标注命令,后根据标注命令将所有帧图像对应的所有存储特征进行提取处理,最后将每一标注命令对应的存储特征和接收时间保存至标注记录中。For example, a video tagging method based on machine learning may be to tag videos based on feature extraction. Specifically, first decode the acquired video stream, and receive tagging commands corresponding to all frame images, and then tag all frames according to the tagging commands. All the storage features corresponding to the image are extracted, and finally the storage feature and the receiving time corresponding to each labeling command are saved in the labeling record.
但是,发明人在研究中发现,现有的基于人工标注的方法必须要人为的对相册中的内容进行识别与标注,标注效率低下,耗费大量的人力与财力,且随着疲劳度的增加标注的准确度也会下降,最重要的是会侵犯相册用户的隐私。基于特征提取的视频标注方法的缺点在于只记录了视频中是否包含该类别的特征,没有将视频中的特征进行排序,从而导致视频标注的结果很难对视频的内容进行诠释。However, the inventor found in research that the existing manual labeling-based methods must artificially identify and label the content in the album, which is inefficient, consumes a lot of manpower and financial resources, and labels with increasing fatigue. The accuracy of the album will also decrease, and most importantly, it will infringe on the privacy of the album user. The disadvantage of the video tagging method based on feature extraction is that it only records whether the video contains features of the category, and does not sort the features in the video, which makes it difficult to interpret the content of the video as a result of the video tagging.
因此,为了克服上述缺陷,本申请实施例提供了一种视频处理方法,该方法应用于电子设备,作为一种实施方式,该方法的执行主体可以是电子设备,使得视频处理方法由电子设备在本地就可以执行,避免将该视频发送至云服务器而导致数据的泄露而危害用户的隐私。具体地,如图1,该方法包括:S101至S104。Therefore, in order to overcome the above-mentioned drawbacks, the embodiments of the present application provide a video processing method, which is applied to an electronic device. As an implementation manner, the execution subject of the method may be an electronic device, so that the video processing method is executed by the electronic device. It can be executed locally to avoid sending the video to the cloud server and causing data leakage and endangering the user's privacy. Specifically, as shown in FIG. 1, the method includes: S101 to S104.
S101:获取待处理的目标视频。S101: Acquire a target video to be processed.
作为一种实施方式,待处理的目标视频可以是电子设备内已经存储的视频中的至少部分视频。在一些实施例中,该待处理的目标视频可以是用户由电子设备所存储的视频中选中的视频。例如,电子设备可以将所存储的视频在电子设备的屏幕上显示,用户从所显示的多个视频中选中至少部分视频,作为待处理的目标视频。As an implementation manner, the target video to be processed may be at least a part of the videos already stored in the electronic device. In some embodiments, the target video to be processed may be a video selected by the user from the videos stored by the electronic device. For example, the electronic device may display the stored video on the screen of the electronic device, and the user selects at least part of the video from the multiple displayed videos as the target video to be processed.
作为另一种实施方式,该待处理的目标视频可以是用户请求下载的视频。如图2所示,图2所显示的界面为某应用程序提供的视频下载界面,该应用程序可以是视频类应用程序,即具有视频播放功能的应用程序,用户能够通过该应用程序在线观看视频以及下载视频。用户在视频下载界面内选中要下载的视频,则电子设备能够检测到用户触发的下载请求对应的视频的标识。例如,检测用户在该视频下载界面内触发的下载按钮对 应的视频,所触发的下载按钮对应的视频为用户请求下载的视频。As another implementation manner, the target video to be processed may be a video requested by the user to download. As shown in Figure 2, the interface shown in Figure 2 is a video download interface provided by an application. The application can be a video application, that is, an application with a video playback function. Users can watch videos online through the application. And download videos. The user selects the video to be downloaded in the video download interface, and the electronic device can detect the identifier of the video corresponding to the download request triggered by the user. For example, the video corresponding to the download button triggered by the user in the video download interface is detected, and the video corresponding to the triggered download button is the video requested by the user to download.
将请求下载的视频作为待处理的目标视频,从而能够在用户请求下载视频的时候,就可以对该视频执行本申请实施例的视频处理方法,以便在将该视频存储的时候,能够将该视频与该视频标注结果对应存储。The video requested to be downloaded is used as the target video to be processed, so that when the user requests to download the video, the video processing method of the embodiment of this application can be executed on the video, so that the video can be stored when the video is stored. Corresponding to the video annotation result stored.
当然,也可以将请求下载的视频的标识记录或者将视频存储,在指定条件下从该下载的视频中选择至少部分视频作为待处理的目标视频。Of course, it is also possible to record the identification of the video requested to be downloaded or store the video, and select at least part of the video from the downloaded videos as the target video to be processed under specified conditions.
作为又一种实施方式,该待处理的目标视频可以是用户通过视频录制应用程序录制的视频。例如,用户在相机应用通过视频录制功能录制的视频,则该视频可以作为待处理的目标视频,从而可以在将该视频存储的时候,将该视频与该视频对应的视频标注结果对应存储。As another implementation manner, the target video to be processed may be a video recorded by a user through a video recording application. For example, a video recorded by the user through the video recording function in the camera application can be used as the target video to be processed, so that when the video is stored, the video and the video annotation result corresponding to the video can be stored correspondingly.
当然,也可以将录制的视频的标识存储,在指定条件下将该视频作为待处理的目标视频。Of course, the identification of the recorded video can also be stored, and the video can be used as the target video to be processed under specified conditions.
其中,该指定条件可以是预先设定的本申请实施例的处理方法的执行条件,即在该指定条件下可以对待处理的目标视频执行本申请实施例的方法,以便得到待处理的目标视频的视频结果。作为一种实施方式,该指定条件可以是预设周期,例如,24小时,即每间隔预设周期执行本申请实施例的方法。作为一种实施方式,该指定条件可以是电子设备处于空闲状态,从而能够避免电子设备由于执行本申请实施例的方法而导致功耗过高,进而导致系统卡顿,具体地,该空闲状态的具体实施方式可以参考后续实施例。Wherein, the specified condition may be a preset execution condition of the processing method of the embodiment of the application, that is, the method of the embodiment of the application may be executed on the target video to be processed under the specified condition, so as to obtain the target video to be processed. Video results. As an implementation manner, the specified condition may be a preset period, for example, 24 hours, that is, the method in the embodiment of the present application is executed every preset period. As an implementation manner, the specified condition may be that the electronic device is in an idle state, so as to prevent the electronic device from performing the method of the embodiment of the present application and causing excessive power consumption, thereby causing the system to freeze, specifically, the idle state The specific implementation can refer to the subsequent embodiments.
S102:获取所述目标视频中待处理图像帧对应的目标场景类型。S102: Acquire the target scene type corresponding to the image frame to be processed in the target video.
其中,待处理图像帧可以是目标视频的对应的所有图像帧中的至少部分图像帧。作为一种实施方式,该待处理图像帧可以是目标视频的部分时间段的图像帧,例如,可以是视频的片头部分的结束时刻至片尾部分的开始时刻之间的时间段对应的图像帧,从而可以不对视频的片头部分和片尾部分做视频处理得到对应的标注结果,减少数据运算量。当然,该待处理图像帧还可以是目标视频的对应的所有图像帧中的关键帧,也可以减少数据运算量。作为另一种实施方式,可以将目标视频内的所有的图像帧作为待处理图像帧,从而能够提高视频标注结果的准确性和全面性。Wherein, the image frame to be processed may be at least part of the image frames of all corresponding image frames of the target video. As an implementation manner, the image frame to be processed may be an image frame of a partial time period of the target video, for example, may be an image frame corresponding to the time period between the end time of the opening part of the video and the start time of the end part of the video. Therefore, it is possible to obtain the corresponding labeling result without performing video processing on the beginning part and the ending part of the video, and reduce the amount of data calculation. Of course, the image frame to be processed can also be a key frame in all image frames corresponding to the target video, and the amount of data calculation can also be reduced. As another implementation manner, all image frames in the target video may be used as image frames to be processed, so that the accuracy and comprehensiveness of the video annotation result can be improved.
作为一种实施方式,每个图像帧对应一个场景,而每个场景对应一个场景类别,在一些实施例中,该场景类别可以包括:自拍、合影、建筑、美食、蓝天、剪影、日落、沙滩、运动、草地、文字、夜景。即场景所表征的是整个图像帧所表达的内容,而图像帧内的各个物体可以作为场景内的各个要素。例如,整个图像为用户A和用户B的合影,则该图像帧的场景类型为合影,该场景内的要素包括用户A和用户B,用户A和用户B的类型为人物。As an implementation manner, each image frame corresponds to a scene, and each scene corresponds to a scene category. In some embodiments, the scene category may include: selfie, group photo, architecture, food, blue sky, silhouette, sunset, beach , Sports, grass, text, night scene. That is, what the scene represents is the content expressed by the entire image frame, and each object in the image frame can be used as each element in the scene. For example, if the entire image is a group photo of user A and user B, the scene type of the image frame is a group photo, the elements in the scene include user A and user B, and the types of user A and user B are characters.
作为一种实施方式,可以基于机器学习识别待处理图像帧的场景类型,例如,预先训练好一个神经网络结构,例如,可以是VGG-Net或ResNet等。将该待处理图像帧作为该神经网络结构的输入图像,则该神经网络结构的输出为该待处理图像对应的场景类型,即目标场景类型。As an implementation manner, the scene type of the image frame to be processed can be identified based on machine learning, for example, a neural network structure is pre-trained, for example, it can be VGG-Net or ResNet. Taking the image frame to be processed as the input image of the neural network structure, the output of the neural network structure is the scene type corresponding to the image to be processed, that is, the target scene type.
具体地,神经网络结构的最后一层的输出是输入图像属于每个预先定义的场景类别的概率的分布向量,在构建集成分类器的过程中,可以将深度神经网络的若干中间层的输出作为输入图像的特征,训练Softmax分类器;采用批量随机梯度下降法和反向传播算法对深度网络模型进行训练。从而,通过该神经网络的分类器就能够得到待处理图像帧对应的目标场景类型。Specifically, the output of the last layer of the neural network structure is the distribution vector of the probability that the input image belongs to each predefined scene category. In the process of constructing the integrated classifier, the output of several intermediate layers of the deep neural network can be used as Input the characteristics of the image, train the Softmax classifier; use batch stochastic gradient descent method and back propagation algorithm to train the deep network model. Therefore, the target scene type corresponding to the image frame to be processed can be obtained through the classifier of the neural network.
作为一种实施方式,待处理图像帧可以是多个图像,且该待处理图像帧可以对应多个场景类型,从而所得到的目标场景类型也可以是多个。例如,待处理图像帧包括图像1、图像2、图像3、图像4、图像5、图像6、图像7、图像8、图像9,其中,图像1、 图像2、图像3、图像4、图像5对应的场景类型均为第一场景类型,图像6、图像7、图像8、图像9对应的场景类型均为第二场景类型,则该待处理的9个图像帧对应的目标场景类型分别为第一场景类型和第二场景类型。As an implementation manner, the image frame to be processed may be multiple images, and the image frame to be processed may correspond to multiple scene types, so that the obtained target scene type may also be multiple. For example, the image frames to be processed include image 1, image 2, image 3, image 4, image 5, image 6, image 7, image 8, image 9, and image 1, image 2, image 3, image 4, and image 5. The corresponding scene types are all the first scene type, and the scene types corresponding to image 6, image 7, image 8, and image 9 are all the second scene type, and the target scene types corresponding to the 9 image frames to be processed are respectively the first scene type. A scene type and a second scene type.
S103:根据所述待处理图像帧的时间戳,确定所述目标场景类型在所述目标视频内的场景时间片段。S103: Determine a scene time segment of the target scene type in the target video according to the timestamp of the image frame to be processed.
其中,所述目标视频中,所述场景时间片段内的图像帧对应的场景类型均为所述目标场景类型。Wherein, in the target video, the scene types corresponding to the image frames in the scene time segment are all the target scene types.
其中,目标视频内的每个图像帧都对应有时间戳,每个图像帧的时间戳能够反映该图像帧在目标视频内的播放顺序。视频可以看作是多个图像帧按照一定顺序合成播放,因此,将多个图像帧按照一定顺序编码之后得到的图像集合就可以看作是视频,而该时间戳就可以是用于表征某个图像帧在视频内的播放顺序的标记信息。通常,将视频的第一个图像帧作为起始图像,所对应的时间戳为起始时间戳,然后,该起始图像之后的图像帧按照播放顺序在该起始时间戳的基础上增加一定数值,每相邻的两个图像帧之间的差值可以是固定的。Wherein, each image frame in the target video corresponds to a timestamp, and the timestamp of each image frame can reflect the playback sequence of the image frame in the target video. Video can be regarded as multiple image frames synthesized and played in a certain order. Therefore, the image set obtained after encoding multiple image frames in a certain order can be regarded as a video, and the timestamp can be used to represent a certain The tag information of the playback sequence of the image frame in the video. Usually, the first image frame of the video is used as the starting image, and the corresponding timestamp is the starting timestamp. Then, the image frames after the starting image are added to the starting timestamp according to the playback order. Value, the difference between every two adjacent image frames can be fixed.
因此,视频内的每个图像帧都对应该视频的播放时间轴上的一个时间点,而该时间点即为图像帧的时间戳。其中,该视频的播放时间轴与视频的播放时间长度有关,可以是以0为起点,以视频的总播放时间长度为起点,例如,该视频的总长度为10秒,则该视频的播放时间轴为以0为起点,10秒为终点的时间轴。而视频内的每个图像帧的时间戳都位于该播放时间轴上,由此就能够确定每个图像帧在时间播放轴上的位置。Therefore, each image frame in the video corresponds to a time point on the playback time axis of the video, and this time point is the timestamp of the image frame. Wherein, the play time axis of the video is related to the play time length of the video. It can be 0 as the starting point and the total play time length of the video as the starting point. For example, if the total length of the video is 10 seconds, then the play time of the video The axis is a time axis with 0 as the start point and 10 seconds as the end point. The time stamp of each image frame in the video is located on the playback time axis, so that the position of each image frame on the time playback axis can be determined.
其中,该场景时间片段可以包括该场景的起始时间和终点时间的至少一种。Wherein, the scene time segment may include at least one of the start time and the end time of the scene.
根据S102能够确定每个待处理图像帧对应的场景类型,再根据每个待处理图像帧对应的场景类型,就能够确定每个场景类型的起始时刻和终点时刻。例如,上述的图像1、图像2、图像3、图像4、图像5对应的场景类型均为第一场景类型,图像6、图像7、图像8、图像9对应的场景类型均为第二场景类型,图像1、图像2、图像3、图像4、图像5、图像6、图像7、图像8、图像9对应的时间戳分别为t1、t2、t3、t4、t5、t6、t7、t8、t9,则可以确定第一场景类型的场景时间片段为t1至t5,即在视频的播放时间轴上,t1至t5之间的所有图像帧对应的场景类型均为第一场景类型,第二场景类型的场景时间片段为t6至t9,即在视频的播放时间轴上,t6至t9之间的所有图像帧对应的场景类型均为第二场景类型。According to S102, the scene type corresponding to each image frame to be processed can be determined, and then according to the scene type corresponding to each image frame to be processed, the start time and end time of each scene type can be determined. For example, the scene types corresponding to the aforementioned image 1, image 2, image 3, image 4, and image 5 are all the first scene type, and the scene types corresponding to image 6, image 7, image 8, and image 9 are all the second scene type. , Image 1, image 2, image 3, image 4, image 5, image 6, image 7, image 8, and image 9 correspond to time stamps t1, t2, t3, t4, t5, t6, t7, t8, t9, respectively , It can be determined that the scene time segments of the first scene type are t1 to t5, that is, on the video playback time axis, the scene types corresponding to all image frames between t1 and t5 are the first scene type, and the second scene type The scene time segment of is from t6 to t9, that is, on the playback time axis of the video, the scene type corresponding to all image frames between t6 and t9 is the second scene type.
S104:根据所述目标场景类型以及该目标场景类型对应的场景时间片段生成视频标注结果。S104: Generate a video annotation result according to the target scene type and the scene time segment corresponding to the target scene type.
其中,该视频标注结果用于描述目标视频内的场景时间片段对应的场景类型为目标场景类型,从而通过该视频标注结果就能够获知目标视频的某个时间段内的场景类型,且能够明确该类型的场景的起始时刻和终止时刻,以便用于在需要查询某个场景的视频时,能够根据该场景对应的场景时间片段在目标视频内快速定位到该场景的起始时刻和终止时刻,方便快速查阅。Among them, the video annotation result is used to describe the scene type corresponding to the scene time segment in the target video as the target scene type, so that the scene type within a certain time period of the target video can be known through the video annotation result, and the scene type can be clarified. The start time and end time of the type of scene are used to quickly locate the start time and end time of the scene in the target video according to the scene time segment corresponding to the scene when the video of a certain scene needs to be queried. Convenient and quick to check.
作为一种实施方式,该视频标注结果可以是该目标视频对应的描述内容,该描述内容可以是文本内容,具体地,该描述内容用于通过文本的形式表述目标视频内的多个场景类型以及每个场景类型对应的起始时刻和终止时刻。例如,该描述内容可以是“场景:自拍,场景时间片段为2秒至5秒”。As an implementation manner, the video tagging result may be the description content corresponding to the target video, and the description content may be text content. Specifically, the description content is used to express multiple scene types and types in the target video in the form of text. The start time and end time corresponding to each scene type. For example, the description content may be "scene: Selfie, the time segment of the scene is 2 seconds to 5 seconds".
作为另一种实施方式,该视频标注结果可以是基于目标视频的时间轴而设置的内容,具体地,请参阅后续实施例。As another implementation manner, the video tagging result may be content set based on the time axis of the target video. For details, please refer to the subsequent embodiments.
在一些实施例中,电子设备可以将该视频标注结果显示。作为一种实施方式,该电子设备能够将该视频标注结果在电子设备的指定界面内显示。例如,该指定界面可以是 该目标视频的播放界面,作为一种实施方式,可以将该视频标注结果在所播放的目标视频的进度条上显示,即在该进度条上标志该目标场景类型的场景时间片段以及该目标场景类型。In some embodiments, the electronic device can display the video annotation result. As an implementation manner, the electronic device can display the video annotation result in a designated interface of the electronic device. For example, the specified interface may be a playback interface of the target video. As an implementation manner, the result of marking the video may be displayed on the progress bar of the target video being played, that is, the target scene type is marked on the progress bar. The scene time segment and the target scene type.
如图3所示,在图3所示的视频播放界面内所播放的内容是目标视频,在该目标视频的进度条301上显示有目标场景类型对应的第一标记302和第二标记303。其中,第一标记302用于表征目标场景类型的起点时刻在进度条301上的位置,第二标记303用于表征目标场景类型的终点时刻在进度条301上的位置。用户触发该第一标记302和第二标记303能够显示第一内容和第二内容,其中,第一内容用于说明该第一标记302对应的位置为目标场景类型的起点时刻,如图3所示的“自拍场景起始时刻”,第二内容用于说明该第二标记303对应的位置为目标场景类型的终点时刻。因此,用户在观看该目标视频时,通过该第一标记302和第二标记303能够明确视频内的各个场景在视频的进度条301上的位置,方便用户快速定位到所感兴趣的场景。其中,视频的进度条301为视频的播放时间轴。As shown in FIG. 3, the content played in the video playing interface shown in FIG. 3 is a target video, and a first mark 302 and a second mark 303 corresponding to the target scene type are displayed on the progress bar 301 of the target video. The first mark 302 is used to characterize the position of the start point of the target scene type on the progress bar 301, and the second mark 303 is used to characterize the position of the end point of the target scene type on the progress bar 301. The user triggers the first mark 302 and the second mark 303 to display the first content and the second content, where the first content is used to indicate that the position corresponding to the first mark 302 is the starting point of the target scene type, as shown in FIG. 3 The second content is used to explain that the position corresponding to the second mark 303 is the end time of the target scene type. Therefore, when the user is watching the target video, the first mark 302 and the second mark 303 can clarify the position of each scene in the video on the video progress bar 301, so that the user can quickly locate the scene of interest. Wherein, the progress bar 301 of the video is the playback time axis of the video.
请参阅图4,图4示出了本申请另一实施例提供的视频处理方法,具体地,该方法不仅可以识别目标视频内的场景,还可以识别具体场景内的各个物体,并结合场景和物体生成视频标注结果。具体地,请参阅图4,该方法包括:S410至S460。Please refer to FIG. 4, which shows a video processing method provided by another embodiment of the present application. Specifically, the method can not only identify the scene in the target video, but also identify various objects in the specific scene, and combine the scene and The object generates a video annotation result. Specifically, referring to FIG. 4, the method includes: S410 to S460.
S410:获取待处理的目标视频。S410: Acquire a target video to be processed.
S420:获取所述目标视频中待处理图像帧对应的目标场景类型。S420: Acquire the target scene type corresponding to the image frame to be processed in the target video.
作为一种实施方式,基于计算机视觉方法的神经网络来判断待处理图像帧中的内容,具体地,可以采用Mobilenet网络模型。其中,MobileNet的基本单元是深度级可分离卷积(depthwise separable convolution)。深度级可分离卷积是一种可分解卷积操作(factorized convolutions),其可以分解为两个更小的操作:depthwise convolution和pointwise convolution。Depthwise convolution和标准卷积不同,对于标准卷积其卷积核是用在所有的输入通道上,而depthwise convolution针对每个输入通道采用不同的卷积核,就是说一个卷积核对应一个输入通道。而pointwise convolution其实就是普通的卷积,只不过其采用1x1的卷积核。对于depthwise separable convolution,其首先是采用depthwise convolution对不同输入通道分别进行卷积,然后采用pointwise convolution将上面的输出再进行结合,这样会大大减少计算量和模型参数量。因此,Mobilenet网络模型也可以看做是轻量化卷积神经网络。As an implementation manner, a neural network based on a computer vision method is used to determine the content in the image frame to be processed. Specifically, a Mobilenet network model may be used. Among them, the basic unit of MobileNet is depthwise separable convolution. Depth-level separable convolution is a decomposable convolution operation (factorized convolutions), which can be decomposed into two smaller operations: depthwise convolution and pointwise convolution. Depthwise convolution is different from standard convolution. For standard convolution, the convolution kernel is used on all input channels, while depthwise convolution uses different convolution kernels for each input channel, that is, one convolution kernel corresponds to one input channel. . The pointwise convolution is actually an ordinary convolution, but it uses a 1x1 convolution kernel. For depthwise separable convolution, it first uses depthwise convolution to convolve different input channels separately, and then uses pointwise convolution to combine the above outputs, which will greatly reduce the amount of calculation and model parameters. Therefore, the Mobilenet network model can also be regarded as a lightweight convolutional neural network.
作为一种实施方式,可以基于Mobilenet_V1获取所述目标视频中待处理图像帧对应的目标场景类型,具体地,可以是在已经使用数据集训练好的MobileNet_V1的基础上进行微调(Finetune)。As an implementation manner, the target scene type corresponding to the image frame to be processed in the target video may be obtained based on Mobilenet_V1. Specifically, it may be fine-tuned (Finetune) on the basis of MobileNet_V1 that has been trained using the data set.
如表1所示,为该Mobilenet_V1网络结构示意图。As shown in Table 1, it is a schematic diagram of the Mobilenet_V1 network structure.
表1Table 1
Figure PCTCN2021085692-appb-000001
Figure PCTCN2021085692-appb-000001
Figure PCTCN2021085692-appb-000002
Figure PCTCN2021085692-appb-000002
该网络可以将待处理图像帧分为10类,即评1-10分。在Mobilenet_V1的网络结构中,type标识每一层的算子类型,其中,conv表示卷积层,Avg Pool表示平均池化层,Softmax表示Softmax层,FC表示全连接层。Stride表示每一次操作的步长,s1表示步长为1,s2表示步长为2。Filter Shape表示滤波器的大小,3x3x3x32表示有3个颜色通道,卷积核大小为3x3,卷积核的个数为32,3x3x32 dw表示depthwise卷积的通道为3,卷积核的大小为1x3卷积核的个数为32,Pool 7x7表示平均池化的卷积核大小为7x7,1024x1000表示全连接层包含1024x1000个神经元,Classifier表示最终分类的类别,在图片打分网络中,Classifier的值为10,代表1-10分的输出值,Input Size表示输入的大小,224x224x3表示3通道的224x224的图像。The network can divide the image frames to be processed into 10 categories, that is, a score of 1-10. In the network structure of Mobilenet_V1, type identifies the operator type of each layer, where conv represents the convolutional layer, Avg Pool represents the average pooling layer, Softmax represents the Softmax layer, and FC represents the fully connected layer. Stride represents the step size of each operation, s1 represents the step size of 1, and s2 represents the step size of 2. Filter Shape represents the size of the filter, 3x3x3x32 represents 3 color channels, the size of the convolution kernel is 3x3, the number of convolution kernels is 32, 3x3x32 dw represents the depthwise convolution channel is 3, the size of the convolution kernel is 1x3 The number of convolution kernels is 32, Pool 7x7 means the average pooling convolution kernel size is 7x7, 1024x1000 means that the fully connected layer contains 1024x1000 neurons, and Classifier means the final classification category. In the image scoring network, the value of Classifier It is 10, representing an output value of 1-10 points, Input Size represents the size of the input, and 224x224x3 represents a 3-channel 224x224 image.
如图5所示,图5示出了Mobilenet_V1网络的训练过程。通常一个图片的分类网络包含两部分,前一部由多层的卷积核组成,负责提取图片中的多样化特征,后一部分通常由全连接组成,负责获取卷积层提取的特征然后对图片进行分类判断。经过ImageNet项目提供的训练数据的训练,图片分类网络的图片特征提取模块已经较为完善,所以需要改进和训练的部分是图片类别判断模块,Finetune的策略为先单独对图片类别判断模块进行训练,后进行网络的全局微调,加入图片特征提取模块一起训练,其中单独训练全连接层(FC层)4000步,全局微调1000步。分类网络所使用的Finetune数据集为预先获取的数据集,包括280类数据,每类数据5000张,共计140万张,每张图片都被标注了具体的物理标签,该物理标签用于表示需要检测图像的内容,例如,场景类型或目标物体等。As shown in Figure 5, Figure 5 shows the training process of the Mobilenet_V1 network. Usually the classification network of a picture consists of two parts. The first part is composed of a multi-layer convolution kernel, which is responsible for extracting diversified features in the picture. Make classification judgments. After training on the training data provided by the ImageNet project, the image feature extraction module of the image classification network has been relatively complete, so the part that needs improvement and training is the image category judgment module. The strategy of Finetune is to first train the image category judgment module separately, and then Perform global fine-tuning of the network and join the image feature extraction module to train together, in which the fully connected layer (FC layer) is trained separately for 4000 steps, and the global fine-tuning is 1000 steps. The Finetune data set used by the classification network is a pre-acquired data set, including 280 types of data, 5000 pieces of each type of data, a total of 1.4 million pieces, each picture is marked with a specific physical label, which is used to indicate the need Detect the content of the image, for example, the scene type or target object, etc.
如图6所示,图6示出了识别待处理图像的场景分类的过程,具体地,将待处理图像帧输入到网络中,经过特征提取和类别判断,最终输出该图待处理图像帧对应的场景类别。具体地,该网络可以输出该待处理图像帧的场景类别标签,具体地,所包含的类别标签可以包括:自拍、合影、建筑、美食、蓝天、剪影、日落、沙滩、运动、草地、文字、夜景。As shown in Figure 6, Figure 6 shows the process of identifying the scene classification of the image to be processed. Specifically, the image frame to be processed is input into the network, and after feature extraction and category judgment, the image frame corresponding to the image to be processed is finally output Scene category. Specifically, the network can output the scene category label of the image frame to be processed. Specifically, the category label included can include: selfie, group photo, architecture, food, blue sky, silhouette, sunset, beach, sports, grass, text, night view.
S430:根据所述待处理图像帧的时间戳,确定所述目标场景类型在所述目标视频内的场景时间片段。S430: Determine a scene time segment of the target scene type in the target video according to the timestamp of the image frame to be processed.
S440:对所述待处理图像帧内的目标物体检测,得到目标物体类别。S440: Detect the target object in the image frame to be processed to obtain the target object category.
其中,该目标物体可以是图像内的具体的各个物体对应的类别,即具体场景内各个物体的类别。The target object may be a category corresponding to each specific object in the image, that is, the category of each object in the specific scene.
作为一种实施方式,可以继续使用该Mobilenet网络模型对所述待处理图像帧内的目标物体检测,得到目标物体类别。As an implementation manner, the Mobilenet network model may continue to be used to detect the target object in the image frame to be processed to obtain the target object category.
作为另一种实施方式,基于YOLO目标检测模型,对所述待处理图像帧内的目标物体检测,得到目标物体类别。As another implementation manner, based on the YOLO target detection model, the target object in the image frame to be processed is detected to obtain the target object category.
基于深度学习算法的一系列目标检测算法可以包括:先产生候选区域然后再进行卷积神经网络(Convolutional Neural Networks,CNN)分类(即RCNN(Regions with CNN features)算法),以及直接对输入图像应用算法并输出类别和相应的定位的算法(即YOLO算法)。A series of target detection algorithms based on deep learning algorithms can include: first generating candidate regions and then performing Convolutional Neural Networks (CNN) classification (ie RCNN (Regions with CNN features) algorithm), and directly applying to the input image Algorithm and output category and corresponding positioning algorithm (ie YOLO algorithm).
于本申请实施例中,可以使用已经训练好的Yolo_V3网络对待处理图像帧内的目标物体检测和识别。In the embodiment of the present application, the trained Yolo_V3 network can be used to detect and recognize the target object in the image frame to be processed.
如图7所示,图7示出了Yolo_V3网络结构。其中,网络输入大小为416x416,通道为3,动态边界层为DBL,DBL表示Darknetconv2d_BN_Leaky,是yolo_v3的基本组件,就是卷积+批量归一化(BN)+线性修正单元(Leaky relu)。残差层n(resn):n代表数字,有res1,res2,…,res8等等,表示残差模型里含有多少个残差单元。张量拼接将中间层和后面的某一层的上采样进行拼接。最终网络输出每个检测到的物体的类别和位置。网络输出1000类的物体以及检测框。其中,该检测框用于表示该物体在该物体所在的图像内的位置。其中,残差模型可以表示为res_block,残差单元可以表示为res_unit,张量拼接可以表示为concat,中间层可以表示为darknet,上采样可以表示为up_sample,第一融合层为route_1(1/8size),第二融合层为route_2(1/16size),第三融合层为route_3(1/32size),零填充为zero padding,第一输出为y1,第二输出为y2,第三输出为y3,第一输出通道为yolo_head/conv_6,第二输出通道为yolo_head/conv_14,第三输出通道为yolo_head/conv_22,第一特征图为feature_map_1,第二特征图为feature_map_2,第三特征图为feature_map_3。As shown in Figure 7, Figure 7 shows the Yolo_V3 network structure. Among them, the network input size is 416x416, the channel is 3, the dynamic boundary layer is DBL, and DBL represents Darknetconv2d_BN_Leaky, which is the basic component of yolo_v3, which is convolution + batch normalization (BN) + linear correction unit (Leaky relu). Residual layer n (resn): n represents a number, res1, res2,..., res8, etc., indicating how many residual units are contained in the residual model. Tensor stitching stitches the upsampling of the middle layer and a later layer. The final network outputs the category and location of each detected object. The network outputs 1000 types of objects and detection frames. Wherein, the detection frame is used to indicate the position of the object in the image where the object is located. Among them, the residual model can be expressed as res_block, the residual unit can be expressed as res_unit, the tensor stitching can be expressed as concat, the middle layer can be expressed as darknet, the upsampling can be expressed as up_sample, and the first fusion layer is route_1(1/8size ), the second fusion layer is route_2(1/16size), the third fusion layer is route_3(1/32size), zero padding is zero padding, the first output is y1, the second output is y2, and the third output is y3. The first output channel is yolo_head/conv_6, the second output channel is yolo_head/conv_14, the third output channel is yolo_head/conv_22, the first feature map is feature_map_1, the second feature map is feature_map_2, and the third feature map is feature_map_3.
S450:根据所述待处理图像帧的时间戳,确定所述目标物体类别在所述目标视频内的物体时间片段。S450: Determine an object time segment of the target object category in the target video according to the time stamp of the image frame to be processed.
其中,确定目标物体类别在所述目标视频内的物体时间片段可以参考前述确定目标场景类别对应的场景时间片段的实施方式。具体地,确定图像帧的时间戳,图像帧内的时间戳可以作为该目标物体的时间戳,从而就能够确定出目标视频内的每个类别的目标物体对应的时间戳,从而就能够确定每个目标物体在目标视频内出现的时间片段。Wherein, for determining the object time segment of the target object category in the target video, reference may be made to the foregoing implementation manner of determining the scene time segment corresponding to the target scene category. Specifically, the time stamp of the image frame is determined. The time stamp in the image frame can be used as the time stamp of the target object, so that the time stamp corresponding to each category of target object in the target video can be determined, so that each The time segment of the appearance of a target object in the target video.
S460:根据所述目标场景类型、所述场景时间片段、所述目标物体类别以及所述物体时间片段生成视频标注结果。S460: Generate a video annotation result according to the target scene type, the scene time segment, the target object category, and the object time segment.
具体地,在前述根据目标场景类型、所述场景时间片段确定的视频标注结果的基础上,再根据目标物体类别以及所述物体时间片段添加与该目标物体类别对应的内容。Specifically, on the basis of the aforementioned video annotation result determined according to the target scene type and the scene time segment, content corresponding to the target object category is added according to the target object category and the object time segment.
具体地,该视频标注结果除了可以描述上述的目标视频内的场景时间片段对应的场景类型为目标场景类型,从而通过该视频标注结果就能够获知目标视频的某个时间段内的场景类型,且能够明确该类型的场景的起始时刻和终止时刻之外,还可以确定目标视频内的每个目标物体类别的起始时刻和终止时刻。Specifically, the video annotation result can describe the scene type corresponding to the scene time segment in the target video as the target scene type, so that the scene type in a certain time period of the target video can be known through the video annotation result, and In addition to being able to clarify the start time and end time of this type of scene, the start time and end time of each target object category in the target video can also be determined.
作为一种实施方式,该视频标注结果可以是基于目标视频的时间轴而设置的内容,具体地,请参阅图8,S460可以包括S461至S465。As an implementation manner, the video tagging result may be content set based on the time axis of the target video. Specifically, referring to FIG. 8, S460 may include S461 to S465.
S461:获取所述目标视频的播放时间。S461: Acquire the playing time of the target video.
S462:根据所述播放时间确定时间轴。S462: Determine a time axis according to the playing time.
其中,所述时间轴可以是上述的视频对应的播放时间轴,则具体地获取目标视频的播放时间以及根据播放时间确定时间轴的实施方式可以参考前述实施例,在此不再赘述。Wherein, the time axis may be the playback time axis corresponding to the above-mentioned video, and the specific implementation of acquiring the playback time of the target video and determining the time axis according to the playback time can refer to the foregoing embodiment, and will not be repeated here.
S463:根据所述目标场景类型对应的场景时间片段确定所述目标场景类型在所述时间轴上的场景区间。S463: Determine a scene interval of the target scene type on the time axis according to the scene time segment corresponding to the target scene type.
其中,所述场景时间片段包括目标场景类型在该时间轴上的起始时间和终点时间,因此,该时间轴上该目标场景类型的起始时间和终点时间之间的区域,作为目标场景类型对应的场景区间。Wherein, the scene time segment includes the start time and end time of the target scene type on the time axis. Therefore, the area between the start time and the end time of the target scene type on the time axis is used as the target scene type The corresponding scene interval.
S464:根据所述目标物体类别对应的物体时间片段确定所述目标物体类别在所述时间轴上的目标物体区间。S464: Determine a target object interval of the target object category on the time axis according to the object time segment corresponding to the target object category.
同理,所述物体时间片段包括目标物体类别在该时间轴上的起始时间和终点时间,因此,该时间轴上该目标物体类别的起始时间和终点时间之间的区域,作为目标物体类别对应的目标物体区间。Similarly, the object time segment includes the start time and the end time of the target object category on the time axis. Therefore, the area between the start time and the end time of the target object category on the time axis is taken as the target object The target object interval corresponding to the category.
S465:根据所述时间轴、所述场景区间、所述目标物体区间、所述目标场景类型和所述目标物体类别生成视频标注结果。S465: Generate a video annotation result according to the time axis, the scene interval, the target object interval, the target scene type, and the target object category.
作为一种实施方式,可以在时间轴上对应将场景区间和目标物体区间标注出,并且还可以根据目标场景类型和所述目标物体类别生成第一内容和第二内容,在时间轴上的场景区间对应标注第一内容,以及在目标物体区间对应标注第二内容,从而在时间轴上能够明确场景区间和目标物体区间的位置,并且能够根据第一内容和第二内容明确各个区间所对应的场景或目标物体的类别。As an implementation manner, the scene interval and the target object interval may be correspondingly marked on the time axis, and the first content and the second content may be generated according to the target scene type and the target object category, and the scene on the time axis The interval corresponds to the first content, and the target object interval corresponds to the second content, so that the position of the scene interval and the target object interval can be clarified on the time axis, and the corresponding interval of each interval can be clarified based on the first content and the second content The category of the scene or target object.
具体地,根据所述时间轴、所述场景区间、所述目标物体区间、所述目标场景类型和所述目标物体类别生成视频标注结果的实施方式可以是,获取所述目标场景类型对应的场景标注内容和所述目标物体类别对应的物体标注内容;根据所述时间轴、所述场景区间、所述目标物体区间、所述目标场景类型和所述目标物体类别生成视频标注结果。其中,所述视频标注结果包括时间轴,所述时间轴被标记出场景区间和目标物体区间,且在场景区间的位置处显示有场景标注内容,在目标物体区间的位置处显示有物体标注内容。Specifically, an implementation manner of generating a video annotation result according to the time axis, the scene interval, the target object interval, the target scene type, and the target object category may be to obtain a scene corresponding to the target scene type The labeling content and the object labeling content corresponding to the target object category; generate a video labeling result according to the time axis, the scene interval, the target object interval, the target scene type, and the target object category. Wherein, the video annotation result includes a time axis, the time axis is marked with a scene section and a target object section, and the scene label content is displayed at the position of the scene section, and the object label content is displayed at the position of the target object section. .
其中,场景标注内容是用于描述该场景类别的内容,可以是文本、图片等,例如,场景类别是室内,则场景标注内容是文本“室内”,同理,物体标注内容是用于描述该物体类别的内容,可以是文本、图片等,例如,物体类别是椅子,则物体标注内容是文本“椅子”。Among them, the scene annotation content is used to describe the content of the scene category, which can be text, pictures, etc. For example, if the scene category is indoor, the scene annotation content is the text "indoor". Similarly, the object annotation content is used to describe the scene. The content of the object category can be text, pictures, etc. For example, if the object category is a chair, the label content of the object is the text "chair".
另外,需要说明的是,目标物体类别可以是该物体的类别,还可以是该物体的具体某个细节的类别,具体地,物体类别包括主类别和主类别下的子类别,该主类别用于描述该物体的整体类别,例如,人。该子类别可是该目标物体的具体某个细节内容的类别,该目标物体的主类别是人,该子类别可以是表情类别或情绪类别。In addition, it should be noted that the target object category can be the category of the object, or the category of specific details of the object. Specifically, the object category includes the main category and the subcategories under the main category. To describe the overall category of the object, for example, people. The subcategory may be a category of specific details of the target object, the main category of the target object is people, and the subcategory may be an expression category or an emotion category.
作为一种实施方式,该视频标注结果可以是一个显示内容,该显示内容内包括时间轴,时间轴被标记出场景区间和目标物体区间,且在场景区间的位置处显示有场景标注内容,在目标物体区间的位置处显示有物体标注内容。如图9所示,该显示内容包括时间轴图像、各个场景区间的场景图像和各个目标物体区间的目标物图像,则每个场景图像和目标物图像的长度与时间轴图像的比例关系,与该场景时间片段和物体时间片段的时间长度与目标视频的播放时间长度的比例关系一致,从而能够反映出目标视频的时间轴上,各个场景和目标物存在的时间区间。另外,在各个场景区间的场景图像和各个目标物体区间的目标物图像上,显示有场景标注内容或物体标注内容。As an implementation manner, the video annotation result may be a display content, the display content includes a time axis, the time axis is marked with a scene interval and a target object interval, and the scene annotation content is displayed at the position of the scene interval. The label content of the object is displayed at the position of the target object interval. As shown in Figure 9, the display content includes the time axis image, the scene image of each scene interval, and the target image of each target object interval. Then the ratio of the length of each scene image and the target image to the time axis image is related to The time length of the scene time segment and the object time segment has a proportional relationship with the playback time length of the target video, so as to reflect the time interval in which each scene and the target object exist on the time axis of the target video. In addition, scene annotation content or object annotation content is displayed on the scene image in each scene section and the target object image in each target object section.
请参阅图10,图10示出了本申请另一实施例提供的视频处理方法,具体地,该方法可以在电子设备空闲的时候执行视频处理方法。具体地,请参阅图10,该方法包括:S1001至S1005。Please refer to FIG. 10. FIG. 10 shows a video processing method provided by another embodiment of the present application. Specifically, the method may execute the video processing method when the electronic device is idle. Specifically, referring to FIG. 10, the method includes: S1001 to S1005.
S1001:获取所述电子设备的工作状态。S1001: Acquire the working status of the electronic device.
电子设备的工作状态包括繁忙状态和空闲状态,其中,繁忙状态表示电子设备当前的功耗比较高,如果对目标视频处理得到视频标注结果的话,可能会导致系统卡顿,而 空闲状态与繁忙状态相反,即当前的功耗较低,如果对目标视频处理得到视频标注结果的话,产生系统卡顿的可能性较小。The working state of an electronic device includes a busy state and an idle state. The busy state indicates that the current power consumption of the electronic device is relatively high. If the target video is processed to obtain the video annotation result, it may cause the system to freeze, and the idle state and the busy state On the contrary, that is, the current power consumption is low, and if the target video is processed to obtain the video annotation result, the possibility of system stalling is small.
作为一种实施方式,该工作状态可以通过CPU使用率、充电状态和当前时刻中的至少一个确定。As an implementation manner, the working state may be determined by at least one of the CPU usage rate, the charging state, and the current time.
在一些实施例中,以CPU使用率确定电子设备的工作状态。具体地,判断所述电子设备的CPU使用率是否低于使用率阈值,如果低于,则确定所述电子设备的工作状态为空闲状态,否则,确定所述电子设备的工作状态为繁忙状态。In some embodiments, the CPU usage rate is used to determine the working status of the electronic device. Specifically, it is determined whether the CPU usage rate of the electronic device is lower than the usage rate threshold, and if it is lower, the working state of the electronic device is determined to be an idle state; otherwise, the working state of the electronic device is determined to be a busy state.
则具体地,中央处理器的使用率可以通过查看电子设备的任务管理器而获取,例如,在安卓系统下,通过adb shell top指令获取CPU的使用率。其中,使用率阈值可以是用户设定的使用率,例如,使用率阈值可以是60%,假设CPU的当前的使用率为40%,则40%小于60%,判定中央处理器的使用率小于使用率阈值,假如CPU的当前的使用率为70%,则70%大于60%,判定中央处理器的使用率大于使用率阈值。Specifically, the CPU usage rate can be obtained by viewing the task manager of the electronic device. For example, in the Android system, the CPU usage rate can be obtained through the adb shell top command. Among them, the utilization rate threshold may be a utilization rate set by the user. For example, the utilization rate threshold may be 60%. Assuming that the current utilization rate of the CPU is 40%, 40% is less than 60%, and it is determined that the utilization rate of the central processing unit is less than The utilization rate threshold, if the current utilization rate of the CPU is 70%, then 70% is greater than 60%, and it is determined that the utilization rate of the central processing unit is greater than the utilization rate threshold.
而如果中央处理器的使用率小于使用率阈值,则表示CPU当前资源比较富裕,则可以确定电子设备的工作状态处于空闲状态,则可以执行S1002,而如果中央处理器的使用率大于或等于使用率阈值,则表示CPU当前资源比较匮乏,则可以确定电子设备的工作状态处于繁忙状态。And if the utilization rate of the central processing unit is less than the utilization rate threshold, it means that the current CPU resources are relatively rich, and it can be determined that the working state of the electronic device is idle, then S1002 can be executed, and if the utilization rate of the central processing unit is greater than or equal to the utilization rate The rate threshold indicates that the current CPU resources are relatively scarce, and it can be determined that the working state of the electronic device is busy.
另外,由于CPU的使用率与电子设备当前启动的应用程序有关,则可以在电子设备处于繁忙状态的时候,判断当前所开启的应用程序中是否存在与预设应用程序匹配的应用程序,其中,预设应用程序为允许系统在用户未授权的情况下将应用程序关闭的应用程序,如果存在,则将与预设应用程序匹配的应用程序的关闭,然后再获取CPU当前的使用率作为CPU的使用率,并返回执行判断所述中央处理器的使用率是否小于使用率阈值的操作。In addition, because the CPU usage rate is related to the application currently started by the electronic device, it can be judged whether there is an application that matches the preset application in the currently opened application when the electronic device is in a busy state. Among them, The preset application is an application that allows the system to close the application without the user's authorization. If it exists, the application that matches the preset application is closed, and then the current usage rate of the CPU is obtained as the CPU Usage rate, and return to perform the operation of judging whether the usage rate of the central processing unit is less than the usage rate threshold.
具体地,电子设备内预先存储有预设应用程序的列表,在该预设应用程序的列表内包括多个指定应用程序的标识,其中,指定应用程序为用户授权的允许系统在用户未授权的情况下将应用程序关闭的应用程序,具体地,可以是用户手动输入该指定应用程序的标识。Specifically, a list of preset applications is pre-stored in the electronic device, and the list of preset applications includes a plurality of designated application identities, wherein the designated application is authorized by the user and allows the system to be unauthorised by the user. In this case, the application program that closes the application program may specifically be the user manually inputting the identifier of the specified application program.
因此,在CPU使用率过高的情况下,将允许系统在用户未授权的情况下将应用程序关闭的应用程序的进程杀死,从而释放一定的CPU资源,降低CPU使用率。Therefore, when the CPU usage rate is too high, the system will be allowed to kill the process of the application program closed by the user without the user's authorization, thereby releasing a certain amount of CPU resources and reducing the CPU usage rate.
在一些实施例中,以充电状态和当前时刻确定电子设备的工作状态。具体地,如果电子设备处于充电状态且当前时刻位于预设时间范围内,则确定电子设备的工作状态处于空闲状态,否则,确定电子设备的工作状态处于繁忙状态。其中,预设时间范围可以是预先设定的时间区间,在该时间区间内用户使用电子设备的概率较小,例如,该预设时间范围为凌晨1点到6点,在该时间段内用户处于睡眠状态,且电子设备处于充电状态,则此时电子设备的系统资源被占用的较少,处于空闲状态。In some embodiments, the working state of the electronic device is determined based on the charging state and the current time. Specifically, if the electronic device is in a charging state and the current time is within the preset time range, it is determined that the working state of the electronic device is in an idle state; otherwise, it is determined that the working state of the electronic device is in a busy state. Wherein, the preset time range may be a preset time interval within which the probability of the user using the electronic device is small. For example, the preset time range is 1 am to 6 am, and the user is In the sleep state and the electronic device is in the charging state, the system resources of the electronic device are less occupied at this time and are in an idle state.
进一步,还可以在电子设备处于充电状态且当前时刻位于预设时间范围内的基础上,增加电子设备的握持状态的检测,即如果电子设备处于充电状态且当前时刻位于预设时间范围内以及电子设备的握持状态为未被握持状态,则确定电子设备的工作状态处于空闲状态,否则,确定电子设备的工作状态处于繁忙状态。Further, the detection of the holding state of the electronic device can be added on the basis that the electronic device is in the charging state and the current time is within the preset time range, that is, if the electronic device is in the charging state and the current time is within the preset time range and If the holding state of the electronic device is an unheld state, it is determined that the working state of the electronic device is in an idle state; otherwise, it is determined that the working state of the electronic device is in a busy state.
具体地,用户在握持电子设备时,所握持的部位一般集中在电子设备的底部边框、顶部边框以及背面的靠近底部或者顶部的位置处,因此,可以在顶部边框和底部边框的位置处设置检测器件,从而检测用户是否手持电子设备,即电子设备能够检测电子设备是否处于被握持状态。Specifically, when a user holds an electronic device, the gripped parts are generally concentrated on the bottom frame, the top frame, and the back of the electronic device near the bottom or the top. Therefore, the position of the top frame and the bottom frame can be set. The detection device detects whether the user is holding the electronic device, that is, the electronic device can detect whether the electronic device is in a held state.
作为一种实施方式,可以在顶部边框和底部边框的位置处设置压力传感器,当用户手持电子设备的时候,该压力传感器能够检测到压力值,则判定电子设备处于被握持状 态。也可以在顶部边框和底部边框的位置处设置温度传感器,在用户未握持电子设备时,温度传感器所检测的温度值为第一温度值,而在用户握持电子设备时,所检测的温度值为第二温度值,而第一温度值小于第二温度值,且第二温度值大于预设温度值,例如,该预设温度值为37,即人体的体温,则如果该第二温度值大于预设温度值,则判定电子设备处于被握持状态。As an implementation manner, pressure sensors may be provided at the positions of the top frame and the bottom frame. When the user holds the electronic device, the pressure sensor can detect the pressure value, and it is determined that the electronic device is in a held state. Temperature sensors can also be set at the top and bottom frames. When the user is not holding the electronic device, the temperature value detected by the temperature sensor is the first temperature value, and when the user is holding the electronic device, the temperature detected is the first temperature value. The value is the second temperature value, and the first temperature value is less than the second temperature value, and the second temperature value is greater than the preset temperature value. For example, the preset temperature value is 37, which is the body temperature of the human body. If the value is greater than the preset temperature value, it is determined that the electronic device is in a held state.
作为另一种实施方式,还可以检测电子设备的触摸屏是否能够检测到用户的触摸操作,如果能够检测到,则确定握持状态为被握持状态,否则,确定处于未被握持状态。于该实施方式中,电子设备在被息屏的时候,触摸屏未被关闭,保持能够检测触摸操作的状态。As another implementation manner, it can also be detected whether the touch screen of the electronic device can detect the user's touch operation, if it can be detected, it is determined that the holding state is the held state, otherwise, it is determined that it is the unheld state. In this embodiment, when the electronic device is on the screen, the touch screen is not turned off, and it remains in a state capable of detecting touch operations.
在有一种实施方式中,可以同时结合CPU使用率、充电状态和当前时刻确定电子设备的状态,即在CPU使用率小于使用率阈值,且电子设备处于充电状态以及当前时刻位于预设时间范围内,则判定电子设备的工作状态处于空闲状态,否则,确定电子设备的工作状态处于繁忙状态。In one embodiment, the status of the electronic device can be determined by combining the CPU usage rate, the charging status, and the current time at the same time, that is, when the CPU usage rate is less than the usage rate threshold, and the electronic device is in the charging state and the current time is within the preset time range , It is determined that the working state of the electronic device is in an idle state, otherwise, it is determined that the working state of the electronic device is in a busy state.
S1002:若所述工作状态为空闲状态,则获取待处理的目标视频。S1002: If the working state is an idle state, acquire a target video to be processed.
S1003:获取所述目标视频中待处理图像帧对应的目标场景类型。S1003: Acquire the target scene type corresponding to the image frame to be processed in the target video.
S1004:根据所述待处理图像帧的时间戳,确定所述目标场景类型在所述目标视频内的场景时间片段。S1004: Determine a scene time segment of the target scene type in the target video according to the timestamp of the image frame to be processed.
S1005:根据所述目标场景类型以及该目标场景类型对应的场景时间片段生成视频标注结果。S1005: Generate a video annotation result according to the target scene type and the scene time segment corresponding to the target scene type.
因此,在电子设备处于空闲状态的时候,再执行获取待处理的目标视频以后后续获取视频标注结果的操作,能够避免在电子设备内运行该方法的时候,导致电子设备卡顿而影响用户使用。Therefore, when the electronic device is in an idle state, the operation of acquiring the target video to be processed and subsequently acquiring the video annotation result can be performed to avoid the electronic device from jamming and affecting the user's use when the method is executed in the electronic device.
另外,本申请实施例所使用的Mobilenet网络模型和YOLO目标检测模型结构简单,算法复杂度地,更适合在电子设备上运行。In addition, the Mobilenet network model and the YOLO target detection model used in the embodiments of the present application have simple structures and complex algorithms, which are more suitable for running on electronic devices.
请参阅图11,其示出了本申请实施例提供的一种视频处理装置1100的结构框图该装置可以包括:视频获取单元1101、场景获取单元1102、确定单元1103和处理单元1104。Please refer to FIG. 11, which shows a structural block diagram of a video processing apparatus 1100 provided by an embodiment of the present application. The apparatus may include: a video acquisition unit 1101, a scene acquisition unit 1102, a determination unit 1103, and a processing unit 1104.
视频获取单元1101,用于获取待处理的目标视频;The video acquisition unit 1101 is configured to acquire the target video to be processed;
场景获取单元1102,用于获取所述目标视频中待处理图像帧对应的目标场景类型;The scene acquisition unit 1102 is configured to acquire the target scene type corresponding to the image frame to be processed in the target video;
确定单元1103,用于根据所述待处理图像帧的时间戳,确定所述目标场景类型在所述目标视频内的场景时间片段,其中,所述目标视频中,所述场景时间片段内的图像帧对应的场景类型均为所述目标场景类型;The determining unit 1103 is configured to determine the scene time segment of the target scene type in the target video according to the time stamp of the image frame to be processed, wherein, in the target video, the image in the scene time segment The scene types corresponding to the frames are all the target scene types;
处理单元1104,用于根据所述场景类型以及该场景类型对应的场景时间片段生成视频标注结果。The processing unit 1104 is configured to generate a video annotation result according to the scene type and the scene time segment corresponding to the scene type.
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述装置和模块的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。Those skilled in the art can clearly understand that, for the convenience and conciseness of description, the specific working process of the device and module described above can refer to the corresponding process in the foregoing method embodiment, which will not be repeated here.
请参阅图12,其示出了本申请实施例提供的一种视频处理装置1200的结构框图该装置可以包括:视频获取单元1201、场景获取单元1202、第一确定单元1203、第二确定单元1204和处理单元1205。Refer to FIG. 12, which shows a structural block diagram of a video processing apparatus 1200 provided by an embodiment of the present application. The apparatus may include: a video acquisition unit 1201, a scene acquisition unit 1202, a first determination unit 1203, and a second determination unit 1204和processing unit 1205.
视频获取单元1201,用于获取待处理的目标视频。The video acquisition unit 1201 is configured to acquire the target video to be processed.
具体地,视频获取单元1201还用于获取所述电子设备的工作状态;若所述工作状态为空闲状态,则获取待处理的目标视频。Specifically, the video acquiring unit 1201 is also used to acquire the working state of the electronic device; if the working state is an idle state, acquiring the target video to be processed.
场景获取单元1202,用于获取所述目标视频中待处理图像帧对应的目标场景类型。The scene acquiring unit 1202 is configured to acquire the target scene type corresponding to the image frame to be processed in the target video.
场景获取单元1202还用于基于Mobilenet网络模型,获取所述目标视频中待处理图像帧对应的目标场景类型。The scene acquiring unit 1202 is also configured to acquire the target scene type corresponding to the image frame to be processed in the target video based on the Mobilenet network model.
第一确定单元1203,用于根据所述待处理图像帧的时间戳,确定所述目标场景类型在所述目标视频内的场景时间片段,其中,所述目标视频中,所述场景时间片段内的图像帧对应的场景类型均为所述目标场景类型。The first determining unit 1203 is configured to determine a scene time segment of the target scene type in the target video according to the time stamp of the image frame to be processed, wherein, in the target video, the scene time segment is The scene types corresponding to the image frames of are all the target scene types.
第二确定单元1204,用于对所述待处理图像帧内的目标物体检测,得到目标物体类别;根据所述待处理图像帧的时间戳,确定所述目标物体类别在所述目标视频内的物体时间片段;根据所述目标场景类型、所述场景时间片段、所述目标物体类别以及所述物体时间片段生成视频标注结果。The second determining unit 1204 is configured to detect the target object in the image frame to be processed to obtain the target object category; according to the time stamp of the image frame to be processed, determine the target object category in the target video Object time segment; generating a video annotation result according to the target scene type, the scene time segment, the target object category, and the object time segment.
进一步地,第二确定单元1204还用于获取所述目标视频的播放时间;根据所述播放时间确定时间轴;根据所述目标场景类型对应的场景时间片段确定所述目标场景类型在所述时间轴上的场景区间;根据所述目标物体类别对应的物体时间片段确定所述目标物体类别在所述时间轴上的目标物体区间;根据所述时间轴、所述场景区间、所述目标物体区间、所述目标场景类型和所述目标物体类别生成视频标注结果。Further, the second determining unit 1204 is further configured to obtain the play time of the target video; determine the time axis according to the play time; determine that the target scene type is at the time according to the scene time segment corresponding to the target scene type. The scene interval on the axis; determine the target object interval of the target object category on the time axis according to the object time segment corresponding to the target object category; according to the time axis, the scene interval, and the target object interval , The target scene type and the target object category generate a video annotation result.
进一步地,第二确定单元1204还用于获取所述目标场景类型对应的场景标注内容和所述目标物体类别对应的物体标注内容;根据所述时间轴、所述场景区间、所述目标物体区间、所述目标场景类型和所述目标物体类别生成视频标注结果,其中,所述视频标注结果包括时间轴,所述时间轴被标记出场景区间和目标物体区间,且在场景区间的位置处显示有场景标注内容,在目标物体区间的位置处显示有物体标注内容。Further, the second determining unit 1204 is further configured to obtain the scene annotation content corresponding to the target scene type and the object annotation content corresponding to the target object category; according to the time axis, the scene interval, and the target object interval , The target scene type and the target object category generate a video annotation result, wherein the video annotation result includes a time axis, and the time axis is marked with a scene interval and a target object interval, and is displayed at the position of the scene interval There are scene labeling content, and the object labeling content is displayed at the location of the target object interval.
进一步地,第二确定单元1204还用于基于YOLO目标检测模型,对所述待处理图像帧内的目标物体检测,得到目标物体类别。Further, the second determining unit 1204 is further configured to detect the target object in the image frame to be processed based on the YOLO target detection model to obtain the target object category.
处理单元1205,用于根据所述目标场景类型、所述场景时间片段、所述目标物体类别以及所述物体时间片段生成视频标注结果。The processing unit 1205 is configured to generate a video annotation result according to the target scene type, the scene time segment, the target object category, and the object time segment.
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述装置和模块的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。Those skilled in the art can clearly understand that, for the convenience and conciseness of description, the specific working process of the device and module described above can refer to the corresponding process in the foregoing method embodiment, which will not be repeated here.
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述装置和模块的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。Those skilled in the art can clearly understand that, for the convenience and conciseness of description, the specific working process of the device and module described above can refer to the corresponding process in the foregoing method embodiment, which will not be repeated here.
在本申请所提供的几个实施例中,模块相互之间的耦合可以是电性,机械或其它形式的耦合。In the several embodiments provided in this application, the coupling between the modules may be electrical, mechanical or other forms of coupling.
另外,在本申请各个实施例中的各功能模块可以集成在一个处理模块中,也可以是各个模块单独物理存在,也可以两个或两个以上模块集成在一个模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。In addition, the functional modules in the various embodiments of the present application may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated into one module. The above-mentioned integrated modules can be implemented in the form of hardware or software functional modules.
请参考图13,其示出了本申请实施例提供的一种电子设备的结构框图。该电子设备100可以是智能手机、平板电脑、电子书等能够运行应用程序的电子设备。本申请中的电子设备100可以包括一个或多个如下部件:处理器110、存储器120、以及一个或多个应用程序,其中一个或多个应用程序可以被存储在存储器120中并被配置为由一个或多个处理器110执行,一个或多个程序配置用于执行如前述方法实施例所描述的方法。Please refer to FIG. 13, which shows a structural block diagram of an electronic device provided by an embodiment of the present application. The electronic device 100 may be an electronic device capable of running application programs, such as a smart phone, a tablet computer, or an e-book. The electronic device 100 in this application may include one or more of the following components: a processor 110, a memory 120, and one or more application programs, where one or more application programs may be stored in the memory 120 and configured to be configured by One or more processors 110 execute, and one or more programs are configured to execute the methods described in the foregoing method embodiments.
处理器110可以包括一个或者多个处理核。处理器110利用各种接口和线路连接整个电子设备100内的各个部分,通过运行或执行存储在存储器120内的指令、程序、代码集或指令集,以及调用存储在存储器120内的数据,执行电子设备100的各种功能和处理数据。可选地,处理器110可以采用数字信号处理(Digital Signal Processing,DSP)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)、可编程逻辑阵列(Programmable Logic Array,PLA)中的至少一种硬件形式来实现。处理器110可集成中央处理器(Central Processing Unit,CPU)、图像处理器(Graphics Processing Unit,GPU)和调制解调器等中的一种或几种的组合。其中,CPU主要处理操作系统、用户界面和应用程序等;GPU用于负责显示内容的渲染和绘制;调制解调器用于处理无线 通信。可以理解的是,上述调制解调器也可以不集成到处理器110中,单独通过一块通信芯片进行实现。The processor 110 may include one or more processing cores. The processor 110 uses various interfaces and lines to connect various parts of the entire electronic device 100, and executes by running or executing instructions, programs, code sets, or instruction sets stored in the memory 120, and calling data stored in the memory 120. Various functions and processing data of the electronic device 100. Optionally, the processor 110 may adopt at least one of Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), and Programmable Logic Array (PLA). A kind of hardware form to realize. The processor 110 may be integrated with one or a combination of a central processing unit (CPU), a graphics processing unit (GPU), a modem, and the like. Among them, the CPU mainly processes the operating system, user interface, and application programs; the GPU is used to render and draw the display content; and the modem is used to process wireless communication. It can be understood that the above-mentioned modem may not be integrated into the processor 110, but may be implemented by a communication chip alone.
存储器120可以包括随机存储器(Random Access Memory,RAM),也可以包括只读存储器(Read-Only Memory)。存储器120可用于存储指令、程序、代码、代码集或指令集。存储器120可包括存储程序区和存储数据区,其中,存储程序区可存储用于实现操作系统的指令、用于实现至少一个功能的指令(比如触控功能、声音播放功能、图像播放功能等)、用于实现下述各个方法实施例的指令等。存储数据区还可以存储电子设备100在使用中所创建的数据(比如电话本、音视频数据、聊天记录数据)等。The memory 120 may include random access memory (RAM) or read-only memory (Read-Only Memory). The memory 120 may be used to store instructions, programs, codes, code sets or instruction sets. The memory 120 may include a program storage area and a data storage area, where the program storage area may store instructions for implementing the operating system and instructions for implementing at least one function (such as touch function, sound playback function, image playback function, etc.) , Instructions used to implement the following various method embodiments, etc. The storage data area can also store data (such as phone book, audio and video data, chat record data) created by the electronic device 100 during use.
请参考图14,其示出了本申请实施例提供的一种计算机可读存储介质的结构框图。该计算机可读介质1400中存储有程序代码,所述程序代码可被处理器调用执行上述方法实施例中所描述的方法。Please refer to FIG. 14, which shows a structural block diagram of a computer-readable storage medium provided by an embodiment of the present application. The computer-readable medium 1400 stores program code, and the program code can be invoked by a processor to execute the method described in the foregoing method embodiment.
计算机可读存储介质1400可以是诸如闪存、EEPROM(电可擦除可编程只读存储器)、EPROM、硬盘或者ROM之类的电子存储器。可选地,计算机可读存储介质1400包括非易失性计算机可读介质(non-transitory computer-readable storage medium)。计算机可读存储介质1400具有执行上述方法中的任何方法步骤的程序代码1410的存储空间。这些程序代码可以从一个或者多个计算机程序产品中读出或者写入到这一个或者多个计算机程序产品中。程序代码1410可以例如以适当形式进行压缩。The computer-readable storage medium 1400 may be an electronic memory such as flash memory, EEPROM (Electrically Erasable Programmable Read Only Memory), EPROM, hard disk, or ROM. Optionally, the computer-readable storage medium 1400 includes a non-transitory computer-readable storage medium. The computer-readable storage medium 1400 has storage space for the program code 1410 for executing any method steps in the above-mentioned methods. These program codes can be read from or written into one or more computer program products. The program code 1410 may be compressed in a suitable form, for example.
综上所述,本申请提供的视频处理方法、装置、电子设备及计算机可读介质,获取待处理的目标视频;获取所述目标视频中待处理图像帧对应的目标场景类型;根据所述待处理图像帧的时间戳,确定所述目标场景类型在所述目标视频内的场景时间片段,其中,所述目标视频中,所述场景时间片段内的图像帧对应的场景类型均为所述目标场景类型;根据所述场景类型以及该场景类型对应的场景时间片段生成视频标注结果。因此,能够识别视频中的图像帧的场景类型,并且结合场景类型以及场景类型在视频内的出现时间得到标注结果,使得标注结果能够反应视频的时间段与场景的对应关系,使得标注结果更直观且更加符合用户需求。In summary, the video processing method, device, electronic equipment, and computer readable medium provided in this application obtain the target video to be processed; obtain the target scene type corresponding to the image frame to be processed in the target video; Process the time stamp of the image frame to determine the scene time segment of the target scene type in the target video, wherein in the target video, the scene type corresponding to the image frame in the scene time segment is the target Scene type; generating a video annotation result according to the scene type and the scene time segment corresponding to the scene type. Therefore, the scene type of the image frame in the video can be recognized, and the annotation result can be obtained by combining the scene type and the appearance time of the scene type in the video, so that the annotation result can reflect the corresponding relationship between the time period of the video and the scene, making the annotation result more intuitive And more in line with user needs.
进一步地,使用了基于深度学习的图片场景识别与图片物体检测与识别的网络,完整的记录了视频中不同时间点所处的场景,以及不同时间点中视频场景内所出现的物体。Further, a network of picture scene recognition and picture object detection and recognition based on deep learning is used to completely record the scenes at different time points in the video and the objects that appear in the video scene at different time points.
有益于:1)对视频事件发生过程的记录;2)后续对于视频发生内容的分析;3)拓宽视频内容搜索的维度;4)特定物体等的视频剪辑。It is beneficial for: 1) the recording of the occurrence process of the video event; 2) the subsequent analysis of the content of the video; 3) broadening the dimensionality of the video content search; 4) video editing of specific objects.
在视频内容场景识别中使用了基于深度学习的Mobilenet_V1网络,在视频内容检测与识别中使用了Yolo_V3网络,支持12中场景与1000类物体的检测与识别,并且所选择的网络都是较为轻量的网络,在保证模型轻量级的同时大大降低了计算量,可以直接在手机本地离线运行,不需要将用户拍照的数据上传到云端,在保证用户隐私的前提下同时提高了用户体验。The Mobilenet_V1 network based on deep learning is used in the video content scene recognition, and the Yolo_V3 network is used in the video content detection and recognition. It supports the detection and recognition of 12 scenes and 1000 types of objects, and the selected networks are relatively lightweight The network, while ensuring the lightweight model, greatly reduces the amount of calculation. It can be directly run offline on the mobile phone without uploading the user’s photographed data to the cloud. This improves the user experience while ensuring user privacy.
最后应说明的是:以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不驱使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the application, not to limit them; although the application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: The technical solutions recorded in the foregoing embodiments are modified, or some of the technical features thereof are equivalently replaced; these modifications or replacements do not drive the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims (20)

  1. 一种视频处理方法,其特征在于,包括:A video processing method, characterized in that it comprises:
    获取待处理的目标视频;Obtain the target video to be processed;
    获取所述目标视频中待处理图像帧对应的目标场景类型;Acquiring the target scene type corresponding to the image frame to be processed in the target video;
    根据所述待处理图像帧的时间戳,确定所述目标场景类型在所述目标视频内的场景时间片段,其中,所述目标视频中,所述场景时间片段内的图像帧对应的场景类型均为所述目标场景类型;Determine the scene time segment of the target scene type in the target video according to the time stamp of the image frame to be processed, wherein, in the target video, the scene types corresponding to the image frames in the scene time segment are all Is the target scene type;
    根据所述目标场景类型以及该目标场景类型对应的场景时间片段生成视频标注结果。A video annotation result is generated according to the target scene type and the scene time segment corresponding to the target scene type.
  2. 根据权利要求1所述的方法,其特征在于,所述根据所述目标场景类型以及该目标场景类型对应的场景时间片段生成视频标注结果,包括:The method according to claim 1, wherein the generating a video annotation result according to the target scene type and a scene time segment corresponding to the target scene type comprises:
    对所述待处理图像帧内的目标物体检测,得到目标物体类别;Detect the target object in the image frame to be processed to obtain the target object category;
    根据所述待处理图像帧的时间戳,确定所述目标物体类别在所述目标视频内的物体时间片段;Determine the object time segment of the target object category in the target video according to the time stamp of the image frame to be processed;
    根据所述目标场景类型、所述场景时间片段、所述目标物体类别以及所述物体时间片段生成视频标注结果。A video annotation result is generated according to the target scene type, the scene time segment, the target object category, and the object time segment.
  3. 根据权利要求2所述的方法,其特征在于,所述根据所述目标场景类型、所述场景时间片段、所述目标物体类别以及所述物体时间片段生成视频标注结果,包括:The method according to claim 2, wherein the generating a video annotation result according to the target scene type, the scene time segment, the target object category, and the object time segment comprises:
    获取所述目标视频的播放时间;Acquiring the playing time of the target video;
    根据所述播放时间确定时间轴;Determining a time axis according to the playing time;
    根据所述目标场景类型对应的场景时间片段确定所述目标场景类型在所述时间轴上的场景区间;Determine the scene interval of the target scene type on the time axis according to the scene time segment corresponding to the target scene type;
    根据所述目标物体类别对应的物体时间片段确定所述目标物体类别在所述时间轴上的目标物体区间;Determine the target object interval of the target object category on the time axis according to the object time segment corresponding to the target object category;
    根据所述时间轴、所述场景区间、所述目标物体区间、所述目标场景类型和所述目标物体类别生成视频标注结果。Generate a video annotation result according to the time axis, the scene interval, the target object interval, the target scene type, and the target object category.
  4. 根据权利要求3所述的方法,其特征在于,所述根据所述时间轴、所述场景区间、所述目标物体区间、所述目标场景类型和所述目标物体类别生成视频标注结果,包括:The method according to claim 3, wherein the generating a video annotation result according to the time axis, the scene interval, the target object interval, the target scene type, and the target object category comprises:
    获取所述目标场景类型对应的场景标注内容和所述目标物体类别对应的物体标注内容;Acquiring the scene annotation content corresponding to the target scene type and the object annotation content corresponding to the target object category;
    根据所述时间轴、所述场景区间、所述目标物体区间、所述目标场景类型和所述目标物体类别生成视频标注结果,其中,所述视频标注结果包括时间轴,所述时间轴被标记出场景区间和目标物体区间,且在场景区间的位置处显示有场景标注内容,在目标物体区间的位置处显示有物体标注内容。Generate a video annotation result according to the time axis, the scene interval, the target object interval, the target scene type, and the target object category, wherein the video annotation result includes a time axis, and the time axis is marked The scene interval and the target object interval are displayed, and the scene annotation content is displayed at the position of the scene interval, and the object annotation content is displayed at the position of the target object interval.
  5. 根据权利要求1所述的方法,其特征在于,所述获取所述目标视频中待处理图像帧对应的目标场景类型,包括:The method according to claim 1, wherein the obtaining the target scene type corresponding to the image frame to be processed in the target video comprises:
    基于Mobi lenet网络模型,获取所述目标视频中待处理图像帧对应的目标场景类型。Based on the Mobilenet network model, the target scene type corresponding to the image frame to be processed in the target video is obtained.
  6. 根据权利要求2所述的方法,其特征在于,所述对所述待处理图像帧内的目标物体检测,得到目标物体类别,包括:The method according to claim 2, wherein the detecting the target object in the image frame to be processed to obtain the target object category comprises:
    基于YOLO目标检测模型,对所述待处理图像帧内的目标物体检测,得到目标物体类别。Based on the YOLO target detection model, the target object in the image frame to be processed is detected to obtain the target object category.
  7. 根据权利要求2-6任一所述的方法,其特征在于,所述目标物体类别包括该目 标物体的主类别和所述主类别下的子类别。The method according to any one of claims 2-6, wherein the target object category includes a main category of the target object and subcategories under the main category.
  8. 根据权利要求1-7任一所述的方法,其特征在于,应用于电子设备,所述获取待处理的目标视频,包括:The method according to any one of claims 1-7, wherein when applied to an electronic device, said obtaining the target video to be processed comprises:
    获取所述电子设备的工作状态;Acquiring the working status of the electronic device;
    若所述工作状态为空闲状态,则获取待处理的目标视频。If the working state is an idle state, the target video to be processed is acquired.
  9. 根据权利要求8所述的方法,其特征在于,所述获取所述电子设备的工作状态,包括:The method according to claim 8, wherein the obtaining the working status of the electronic device comprises:
    获取所述电子设备的CPU使用率;Acquiring the CPU usage rate of the electronic device;
    判断所述电子设备的CPU使用率是否低于使用率阈值;Determine whether the CPU usage rate of the electronic device is lower than the usage rate threshold;
    如果低于使用率阈值,则确定所述电子设备的工作状态为空闲状态;If it is lower than the usage rate threshold, determining that the working state of the electronic device is an idle state;
    如果不低于使用率阈值,确定所述电子设备的工作状态为繁忙状态。If it is not lower than the usage rate threshold, it is determined that the working state of the electronic device is a busy state.
  10. 根据权利要求9所述的方法,其特征在于,所述如果不低于使用率阈值,确定所述电子设备的工作状态为繁忙状态之后,还包括:The method according to claim 9, wherein after determining that the working state of the electronic device is a busy state if the usage rate is not lower than the threshold value, the method further comprises:
    判断当前所开启的应用程序中是否存在与预设应用程序匹配的应用程序;Determine whether there is an application that matches the preset application among the currently opened applications;
    如果存在与预设应用程序匹配的应用程序,将与预设应用程序匹配的应用程序的关闭,并再次获取CPU当前的使用率作为新的CPU的使用率,并返回执行判断所述电子设备的CPU使用率是否低于使用率阈值的操作。If there is an application that matches the preset application, the application that matches the preset application is closed, and the current usage rate of the CPU is obtained again as the new CPU usage rate, and the execution of the judgment of the electronic device is returned. Whether the CPU usage rate is lower than the usage rate threshold.
  11. 根据权利要求10所述的方法,其特征在于,所述预设应用程序为允许系统在用户未授权的情况下将应用程序关闭的应用程序。The method according to claim 10, wherein the preset application is an application that allows the system to close the application without the user's authorization.
  12. 根据权利要求8所述的方法,其特征在于,所述获取所述电子设备的工作状态,包括:The method according to claim 8, wherein the obtaining the working status of the electronic device comprises:
    判断所述电子设备是否处于充电状态且当前时刻位于预设时间范围内;Judging whether the electronic device is in a charging state and the current time is within a preset time range;
    若是,则确定所述电子设备的工作状态处于空闲状态;If yes, it is determined that the working state of the electronic device is in an idle state;
    若否,确定所述电子设备的工作状态处于繁忙状态。If not, it is determined that the working state of the electronic device is busy.
  13. 根据权利要求8所述的方法,其特征在于,所述获取所述电子设备的工作状态,包括:The method according to claim 8, wherein the obtaining the working status of the electronic device comprises:
    判断所述电子设备是否处于充电状态且当前时刻位于预设时间范围内以及电子设备的握持状态为未被握持状态;Determining whether the electronic device is in a charging state and the current time is within a preset time range and the holding state of the electronic device is an unheld state;
    若是,则确定所述电子设备的工作状态处于空闲状态;If yes, it is determined that the working state of the electronic device is in an idle state;
    若否,确定所述电子设备的工作状态处于繁忙状态。If not, it is determined that the working state of the electronic device is busy.
  14. 根据权利要求13所述的方法,其特征在于,通过以下方式检测电子设备的握持状态:The method according to claim 13, wherein the holding state of the electronic device is detected in the following manner:
    判断所述电子设备的触摸屏是否检测到用户的触摸操作;Judging whether the touch screen of the electronic device detects a user's touch operation;
    如果检测到,则确定握持状态为被握持状态;If it is detected, it is determined that the holding state is the held state;
    如果未检测到,确定处于未被握持状态。If it is not detected, it is determined that it is not being held.
  15. 根据权利要求1-14任一所述的方法,其特征在于,所述根据所述场景类型以及该场景类型对应的场景时间片段生成视频标注结果之后,还包括:The method according to any one of claims 1-14, wherein after generating a video annotation result according to the scene type and the scene time segment corresponding to the scene type, the method further comprises:
    该视频标注结果在电子设备的指定界面内显示,其中,所述指定界面为所述目标视频的播放界面。The video annotation result is displayed in a designated interface of the electronic device, where the designated interface is a playback interface of the target video.
  16. 根据权利要求1-15任一所述的方法,其特征在于,所述待处理图像帧为所述目标视频的片头部分的结束时刻至片尾部分的开始时刻之间的时间段内的图像帧。The method according to any one of claims 1-15, wherein the image frame to be processed is an image frame in a time period between the end time of the credit portion of the target video and the start time of the credit portion of the target video.
  17. 根据权利要求1-15任一所述的方法,其特征在于,所述待处理图像帧为所述目标视频的关键帧。The method according to any one of claims 1-15, wherein the image frame to be processed is a key frame of the target video.
  18. 一种视频处理装置,其特征在于,包括:A video processing device, characterized in that it comprises:
    视频获取单元,用于获取待处理的目标视频;The video acquisition unit is used to acquire the target video to be processed;
    场景获取单元,用于获取所述目标视频中待处理图像帧对应的目标场景类型;A scene acquisition unit, configured to acquire the target scene type corresponding to the image frame to be processed in the target video;
    确定单元,用于根据所述待处理图像帧的时间戳,确定所述目标场景类型在所述目标视频内的场景时间片段,其中,所述目标视频中,所述场景时间片段内的图像帧对应的场景类型均为所述目标场景类型;The determining unit is configured to determine the scene time segment of the target scene type in the target video according to the time stamp of the image frame to be processed, wherein, in the target video, the image frame in the scene time segment The corresponding scene types are all the target scene types;
    处理单元,用于根据所述场景类型以及该场景类型对应的场景时间片段生成视频标注结果。The processing unit is configured to generate a video annotation result according to the scene type and the scene time segment corresponding to the scene type.
  19. 一种电子设备,其特征在于,包括:An electronic device, characterized in that it comprises:
    一个或多个处理器;One or more processors;
    存储器;Memory
    一个或多个应用程序,其中所述一个或多个应用程序被存储在所述存储器中并被配置为由所述一个或多个处理器执行,所述一个或多个应用程序配置用于执行如权利要求1-17任一项所述的方法。One or more application programs, wherein the one or more application programs are stored in the memory and configured to be executed by the one or more processors, and the one or more application programs are configured to execute The method of any one of claims 1-17.
  20. 一种计算机可读介质,其特征在于,所述计算机可读介质存储有处理器可执行的程序代码,所述程序代码被所述处理器执行时使所述处理器执行权利要求1-17任一项所述方法。A computer-readable medium, wherein the computer-readable medium stores program code executable by a processor, and when the program code is executed by the processor, the processor executes any of claims 1-17. One of the methods described.
PCT/CN2021/085692 2020-05-18 2021-04-06 Video processing method and apparatus, electronic device and computer readable medium WO2021232978A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010420727.0A CN111581433B (en) 2020-05-18 2020-05-18 Video processing method, device, electronic equipment and computer readable medium
CN202010420727.0 2020-05-18

Publications (1)

Publication Number Publication Date
WO2021232978A1 true WO2021232978A1 (en) 2021-11-25

Family

ID=72115519

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/085692 WO2021232978A1 (en) 2020-05-18 2021-04-06 Video processing method and apparatus, electronic device and computer readable medium

Country Status (2)

Country Link
CN (1) CN111581433B (en)
WO (1) WO2021232978A1 (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114390368A (en) * 2021-12-29 2022-04-22 腾讯科技(深圳)有限公司 Live video data processing method and device, equipment and readable medium
CN114666667A (en) * 2022-03-03 2022-06-24 海宁奕斯伟集成电路设计有限公司 Video key point generation method and device, electronic equipment and storage medium
CN114782899A (en) * 2022-06-15 2022-07-22 浙江大华技术股份有限公司 Image processing method and device and electronic equipment
CN115695944A (en) * 2022-12-30 2023-02-03 北京远特科技股份有限公司 Vehicle-mounted image processing method and device, electronic equipment and medium
CN115734045A (en) * 2022-11-15 2023-03-03 深圳市东明炬创电子股份有限公司 Video playing method, device, equipment and storage medium
CN115858854A (en) * 2023-02-28 2023-03-28 北京奇树有鱼文化传媒有限公司 Video data sorting method and device, electronic equipment and storage medium
CN116761019A (en) * 2023-08-24 2023-09-15 瀚博半导体(上海)有限公司 Video processing method, system, computer device and computer readable storage medium

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111581433B (en) * 2020-05-18 2023-10-10 Oppo广东移动通信有限公司 Video processing method, device, electronic equipment and computer readable medium
CN112040277B (en) * 2020-09-11 2022-03-04 腾讯科技(深圳)有限公司 Video-based data processing method and device, computer and readable storage medium
CN112560583A (en) * 2020-11-26 2021-03-26 复旦大学附属中山医院 Data set generation method and device
CN112672061B (en) * 2020-12-30 2023-01-24 维沃移动通信(杭州)有限公司 Video shooting method and device, electronic equipment and medium
CN112822554A (en) * 2020-12-31 2021-05-18 联想(北京)有限公司 Multimedia processing method and device and electronic equipment
CN113034384A (en) * 2021-02-26 2021-06-25 Oppo广东移动通信有限公司 Video processing method, video processing device, electronic equipment and storage medium
CN113115054B (en) * 2021-03-31 2022-05-06 杭州海康威视数字技术股份有限公司 Video stream encoding method, device, system, electronic device and storage medium
CN113641852A (en) * 2021-07-13 2021-11-12 彩虹无人机科技有限公司 Unmanned aerial vehicle photoelectric video target retrieval method, electronic device and medium
CN113610006B (en) * 2021-08-09 2023-09-08 中电科大数据研究院有限公司 Overtime labor discrimination method based on target detection model
CN113657307A (en) * 2021-08-20 2021-11-16 北京市商汤科技开发有限公司 Data labeling method and device, computer equipment and storage medium
CN114697761B (en) * 2022-04-07 2024-02-13 脸萌有限公司 Processing method, processing device, terminal equipment and medium
CN115905622A (en) * 2022-11-15 2023-04-04 北京字跳网络技术有限公司 Video annotation method, device, equipment, medium and product

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7751683B1 (en) * 2000-11-10 2010-07-06 International Business Machines Corporation Scene change marking for thumbnail extraction
CN106126335A (en) * 2016-06-15 2016-11-16 青岛海信电器股份有限公司 The Media Survey method of terminal unit and terminal unit
CN108769801A (en) * 2018-05-28 2018-11-06 广州虎牙信息科技有限公司 Synthetic method, device, equipment and the storage medium of short-sighted frequency
CN109168062A (en) * 2018-08-28 2019-01-08 北京达佳互联信息技术有限公司 Methods of exhibiting, device, terminal device and the storage medium of video playing
CN110147722A (en) * 2019-04-11 2019-08-20 平安科技(深圳)有限公司 A kind of method for processing video frequency, video process apparatus and terminal device
CN110209879A (en) * 2018-08-15 2019-09-06 腾讯科技(深圳)有限公司 A kind of video broadcasting method, device, equipment and storage medium
CN111581433A (en) * 2020-05-18 2020-08-25 Oppo广东移动通信有限公司 Video processing method and device, electronic equipment and computer readable medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107330392A (en) * 2017-06-26 2017-11-07 司马大大(北京)智能系统有限公司 Video scene annotation equipment and method
CN108830208A (en) * 2018-06-08 2018-11-16 Oppo广东移动通信有限公司 Method for processing video frequency and device, electronic equipment, computer readable storage medium
CN109587578A (en) * 2018-12-21 2019-04-05 麒麟合盛网络技术股份有限公司 The processing method and processing device of video clip
CN110119711B (en) * 2019-05-14 2021-06-11 北京奇艺世纪科技有限公司 Method and device for acquiring character segments of video data and electronic equipment
CN110213610B (en) * 2019-06-13 2021-05-28 北京奇艺世纪科技有限公司 Live broadcast scene recognition method and device

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7751683B1 (en) * 2000-11-10 2010-07-06 International Business Machines Corporation Scene change marking for thumbnail extraction
CN106126335A (en) * 2016-06-15 2016-11-16 青岛海信电器股份有限公司 The Media Survey method of terminal unit and terminal unit
CN108769801A (en) * 2018-05-28 2018-11-06 广州虎牙信息科技有限公司 Synthetic method, device, equipment and the storage medium of short-sighted frequency
CN110209879A (en) * 2018-08-15 2019-09-06 腾讯科技(深圳)有限公司 A kind of video broadcasting method, device, equipment and storage medium
CN109168062A (en) * 2018-08-28 2019-01-08 北京达佳互联信息技术有限公司 Methods of exhibiting, device, terminal device and the storage medium of video playing
CN110147722A (en) * 2019-04-11 2019-08-20 平安科技(深圳)有限公司 A kind of method for processing video frequency, video process apparatus and terminal device
CN111581433A (en) * 2020-05-18 2020-08-25 Oppo广东移动通信有限公司 Video processing method and device, electronic equipment and computer readable medium

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114390368A (en) * 2021-12-29 2022-04-22 腾讯科技(深圳)有限公司 Live video data processing method and device, equipment and readable medium
CN114666667A (en) * 2022-03-03 2022-06-24 海宁奕斯伟集成电路设计有限公司 Video key point generation method and device, electronic equipment and storage medium
CN114782899A (en) * 2022-06-15 2022-07-22 浙江大华技术股份有限公司 Image processing method and device and electronic equipment
CN115734045A (en) * 2022-11-15 2023-03-03 深圳市东明炬创电子股份有限公司 Video playing method, device, equipment and storage medium
CN115695944A (en) * 2022-12-30 2023-02-03 北京远特科技股份有限公司 Vehicle-mounted image processing method and device, electronic equipment and medium
CN115695944B (en) * 2022-12-30 2023-03-28 北京远特科技股份有限公司 Vehicle-mounted image processing method and device, electronic equipment and medium
CN115858854A (en) * 2023-02-28 2023-03-28 北京奇树有鱼文化传媒有限公司 Video data sorting method and device, electronic equipment and storage medium
CN115858854B (en) * 2023-02-28 2023-05-26 北京奇树有鱼文化传媒有限公司 Video data sorting method and device, electronic equipment and storage medium
CN116761019A (en) * 2023-08-24 2023-09-15 瀚博半导体(上海)有限公司 Video processing method, system, computer device and computer readable storage medium

Also Published As

Publication number Publication date
CN111581433A (en) 2020-08-25
CN111581433B (en) 2023-10-10

Similar Documents

Publication Publication Date Title
WO2021232978A1 (en) Video processing method and apparatus, electronic device and computer readable medium
Vasudevan et al. Query-adaptive video summarization via quality-aware relevance estimation
US10742900B2 (en) Method and system for providing camera effect
CN111062871B (en) Image processing method and device, computer equipment and readable storage medium
CN113010703B (en) Information recommendation method and device, electronic equipment and storage medium
US9886762B2 (en) Method for retrieving image and electronic device thereof
CN106028134A (en) Detect sports video highlights for mobile computing devices
US11100368B2 (en) Accelerated training of an image classifier
CN111209897B (en) Video processing method, device and storage medium
CN113766296B (en) Live broadcast picture display method and device
CN113395578B (en) Method, device, equipment and storage medium for extracting video theme text
Thomas et al. Perceptual video summarization—A new framework for video summarization
WO2020259449A1 (en) Method and device for generating short video
CN111581423B (en) Target retrieval method and device
US10958842B2 (en) Method of displaying images in a multi-dimensional mode based on personalized topics
CN102150163A (en) Interactive image selection method
CN110619284B (en) Video scene division method, device, equipment and medium
CN111126347B (en) Human eye state identification method, device, terminal and readable storage medium
US20210126806A1 (en) Method for recognizing and utilizing user face based on profile picture in chatroom created using group album
WO2019196795A1 (en) Video editing method, device and electronic device
CN113627402B (en) Image identification method and related device
Zhang et al. A comprehensive survey on computational aesthetic evaluation of visual art images: Metrics and challenges
JP7491867B2 (en) VIDEO CLIP EXTRACTION METHOD, VIDEO CLIP EXTRACTION DEVICE AND STORAGE MEDIUM
WO2022247112A1 (en) Task processing method and apparatus, device, storage medium, computer program, and program product
Fei et al. Creating memorable video summaries that satisfy the user’s intention for taking the videos

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21809864

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21809864

Country of ref document: EP

Kind code of ref document: A1