WO2021232978A1

WO2021232978A1 - Video processing method and apparatus, electronic device and computer readable medium

Info

Publication number: WO2021232978A1
Application number: PCT/CN2021/085692
Authority: WO
Inventors: 钟瑞
Original assignee: Oppo广东移动通信有限公司
Priority date: 2020-05-18
Filing date: 2021-04-06
Publication date: 2021-11-25
Also published as: CN111581433A; CN111581433B

Abstract

The present application discloses a video processing method and apparatus, an electronic device and a computer readable medium, relating to the field of video technologies. Said method comprises: acquiring a target video to be processed; acquiring a target scene type corresponding to an image to be processed in the target video; determining, according to a time stamp of said image, a scene time fragment of the target scene type in the target video, wherein in the scene time fragment, the scene types corresponding to the images in the target video are all the target scene type; and generating a video labeling result according to the scene type and the scene time fragment corresponding to the scene type. Therefore, the present invention can identify a scene type of images in a video, and obtain a labeling result in combination with the scene type and an appearance time of the scene type in the video, so that the labeling result can reflect a correlation between a time period of the video and the scene, making the labeling result more intuitive and better satisfy the user requirements.

Description

Video processing method, device, electronic equipment and computer readable medium

Cross-references to related applications

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on May 18, 2020, with the application number CN202010420727.0, titled "Video processing methods, devices, electronic equipment and computer-readable media", and its entire contents Incorporated in this application by reference.

Technical field

This application relates to the field of video technology, and more specifically, to a video processing method, device, electronic equipment, and computer-readable medium.

Background technique

Video annotation is a direct highlighting mark on the video during the video preview or video playback process, which makes the video more targeted video processing method, which is widely used in many fields. For example, video tagging is the most common analysis method used by public security investigators in the research and judgment of video cases, so that public security officers can locate and focus on suspected targets and lock important video clues. For another example, video annotation can also be used for image analysis in the medical field, and physicians can use video annotation to highlight body parts that have lesions or abnormalities. For another example, the video annotation can also be used as a storage method of the video, and can be used as the description content corresponding to the video. The user can quickly obtain part of the content in the video through the video annotation.

However, most of the current video tagging technology uses manual tagging methods, which must manually identify and tag the content in the album. The tagging efficiency is low, it consumes a lot of manpower and financial resources, and the tagging is accurate with the increase of fatigue. The degree will also drop.

Summary of the invention

This application proposes a video processing method, device, electronic device, and computer readable medium to improve the above-mentioned drawbacks.

In the first aspect, an embodiment of the present application provides a video processing method, including: acquiring a target video to be processed; acquiring a target scene type corresponding to an image frame to be processed in the target video; and according to the time of the image frame to be processed Stamp, determine the scene time segment of the target scene type in the target video, wherein, in the target video, the scene types corresponding to the image frames in the scene time segment are all the target scene types; The scene type and the scene time segment corresponding to the scene type generate a video annotation result.

In a second aspect, an embodiment of the present application also provides a video processing device, including: a video acquisition unit, a scene acquisition unit, a determination unit, and a processing unit. The video acquisition unit is used to acquire the target video to be processed. The scene acquisition unit is configured to acquire the target scene type corresponding to the image frame to be processed in the target video. The determining unit is configured to determine the scene time segment of the target scene type in the target video according to the time stamp of the image frame to be processed, wherein, in the target video, the image frame in the scene time segment The corresponding scene types are all the target scene types. The processing unit is configured to generate a video annotation result according to the scene type and the scene time segment corresponding to the scene type.

In a third aspect, an embodiment of the present application also provides an electronic device, including: one or more processors; a memory; one or more application programs, wherein the one or more application programs are stored in the memory And is configured to be executed by the one or more processors, and the one or more application programs are configured to execute the foregoing method.

In a fourth aspect, the embodiments of the present application also provide a computer-readable medium, the readable storage medium storing a program code executable by a processor, and when the program code is executed by the processor, the processor Perform the above method.

Description of the drawings

In order to more clearly describe the technical solutions in the embodiments of the present application, the following will briefly introduce the drawings that need to be used in the description of the embodiments. Obviously, the drawings in the following description are only some embodiments of the present application. For those skilled in the art, other drawings can be obtained based on these drawings without creative work.

FIG. 1 shows a method flowchart of a video processing method provided by an embodiment of the present application;

Figure 2 shows a schematic diagram of a video download interface provided by an embodiment of the present application;

Figure 3 shows a schematic diagram of a video playback interface provided by an embodiment of the present application;

FIG. 4 shows a method flowchart of a video processing method provided by another embodiment of the present application;

FIG. 5 shows the training process of the Mobilenet_V1 network provided by the embodiment of the present application;

FIG. 6 shows the process of identifying the scene classification of the image to be processed provided by the embodiment of the present application;

FIG. 7 shows a schematic diagram of the Yolo_V3 network structure provided by an embodiment of the present application;

FIG. 8 shows a flowchart of S460 in FIG. 4;

FIG. 9 shows a schematic diagram of a video annotation result provided by an embodiment of the present application;

FIG. 10 shows a method flowchart of a video processing method provided by another embodiment of the present application;

FIG. 11 shows a block diagram of a video processing device provided by an embodiment of the present application;

FIG. 12 shows a block diagram of a video processing device provided by another embodiment of the present application;

FIG. 13 shows a schematic diagram of an electronic device provided by an embodiment of the present application;

FIG. 14 is a storage unit for storing or carrying program code for implementing the video processing method according to the embodiment of the present application according to an embodiment of the present application.

Detailed ways

In order to enable those skilled in the art to better understand the solutions of the present application, the technical solutions in the embodiments of the present application will be described clearly and completely in conjunction with the accompanying drawings in the embodiments of the present application.

Video annotation is a direct highlighting mark on the video during the video preview or video playback process, which makes the video more targeted video processing method, which is widely used in many fields. For example, video tagging is the most common analysis method used by public security investigators in the research and judgment of video cases, so that public security officers can locate and focus on suspected targets and lock important video clues. For another example, video annotation can also be used for image analysis in the medical field, and physicians can use video annotation to highlight body parts that have lesions or abnormalities. For another example, the video annotation can also be used as a storage method of the video, and can be used as the description content corresponding to the video. The user can quickly know part of the content in the video through the video annotation.

Currently, video tagging methods are mainly manual tagging and machine learning video tagging.

For example, a manual video tagging method can be to first construct a container interface for holding videos through a web page, load the video in the video section, and then manually drag the slider or click the video drag bar according to the content of the video. To change the video playback time point or confirm the time point of the video playback content, mark the content of the video by clicking the knowledge point panel of the video.

With the continuous application of machine learning technology in the field of computer vision, the demand for labeled data is increasing. Machine learning is a type of artificial intelligence. Artificial Intelligence (AI) is a theory that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge, and use knowledge to obtain the best results. , Method, technology and application system. In other words, artificial intelligence is a comprehensive technology of computer science, which attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can react in a similar way to human intelligence. Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making. Artificial intelligence technology is a comprehensive discipline, covering a wide range of fields, including both hardware-level technology and software-level technology. Basic artificial intelligence technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, and mechatronics. Artificial intelligence software technology mainly includes computer vision technology, speech processing technology, natural language processing technology, and machine learning/deep learning. Machine Learning (ML) is a multi-field interdisciplinary subject, involving probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and other subjects. Specializing in the study of how computers simulate or realize human learning behaviors in order to acquire new knowledge or skills, and reorganize the existing knowledge structure to continuously improve its own performance. Machine learning is the core of artificial intelligence, the fundamental way to make computers intelligent, and its applications cover all fields of artificial intelligence. Machine learning and deep learning usually include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and style teaching learning.

For example, a video tagging method based on machine learning may be to tag videos based on feature extraction. Specifically, first decode the acquired video stream, and receive tagging commands corresponding to all frame images, and then tag all frames according to the tagging commands. All the storage features corresponding to the image are extracted, and finally the storage feature and the receiving time corresponding to each labeling command are saved in the labeling record.

However, the inventor found in research that the existing manual labeling-based methods must artificially identify and label the content in the album, which is inefficient, consumes a lot of manpower and financial resources, and labels with increasing fatigue. The accuracy of the album will also decrease, and most importantly, it will infringe on the privacy of the album user. The disadvantage of the video tagging method based on feature extraction is that it only records whether the video contains features of the category, and does not sort the features in the video, which makes it difficult to interpret the content of the video as a result of the video tagging.

Therefore, in order to overcome the above-mentioned drawbacks, the embodiments of the present application provide a video processing method, which is applied to an electronic device. As an implementation manner, the execution subject of the method may be an electronic device, so that the video processing method is executed by the electronic device. It can be executed locally to avoid sending the video to the cloud server and causing data leakage and endangering the user's privacy. Specifically, as shown in FIG. 1, the method includes: S101 to S104.

S101: Acquire a target video to be processed.

As an implementation manner, the target video to be processed may be at least a part of the videos already stored in the electronic device. In some embodiments, the target video to be processed may be a video selected by the user from the videos stored by the electronic device. For example, the electronic device may display the stored video on the screen of the electronic device, and the user selects at least part of the video from the multiple displayed videos as the target video to be processed.

As another implementation manner, the target video to be processed may be a video requested by the user to download. As shown in Figure 2, the interface shown in Figure 2 is a video download interface provided by an application. The application can be a video application, that is, an application with a video playback function. Users can watch videos online through the application. And download videos. The user selects the video to be downloaded in the video download interface, and the electronic device can detect the identifier of the video corresponding to the download request triggered by the user. For example, the video corresponding to the download button triggered by the user in the video download interface is detected, and the video corresponding to the triggered download button is the video requested by the user to download.

The video requested to be downloaded is used as the target video to be processed, so that when the user requests to download the video, the video processing method of the embodiment of this application can be executed on the video, so that the video can be stored when the video is stored. Corresponding to the video annotation result stored.

Of course, it is also possible to record the identification of the video requested to be downloaded or store the video, and select at least part of the video from the downloaded videos as the target video to be processed under specified conditions.

As another implementation manner, the target video to be processed may be a video recorded by a user through a video recording application. For example, a video recorded by the user through the video recording function in the camera application can be used as the target video to be processed, so that when the video is stored, the video and the video annotation result corresponding to the video can be stored correspondingly.

Of course, the identification of the recorded video can also be stored, and the video can be used as the target video to be processed under specified conditions.

Wherein, the specified condition may be a preset execution condition of the processing method of the embodiment of the application, that is, the method of the embodiment of the application may be executed on the target video to be processed under the specified condition, so as to obtain the target video to be processed. Video results. As an implementation manner, the specified condition may be a preset period, for example, 24 hours, that is, the method in the embodiment of the present application is executed every preset period. As an implementation manner, the specified condition may be that the electronic device is in an idle state, so as to prevent the electronic device from performing the method of the embodiment of the present application and causing excessive power consumption, thereby causing the system to freeze, specifically, the idle state The specific implementation can refer to the subsequent embodiments.

S102: Acquire the target scene type corresponding to the image frame to be processed in the target video.

Wherein, the image frame to be processed may be at least part of the image frames of all corresponding image frames of the target video. As an implementation manner, the image frame to be processed may be an image frame of a partial time period of the target video, for example, may be an image frame corresponding to the time period between the end time of the opening part of the video and the start time of the end part of the video. Therefore, it is possible to obtain the corresponding labeling result without performing video processing on the beginning part and the ending part of the video, and reduce the amount of data calculation. Of course, the image frame to be processed can also be a key frame in all image frames corresponding to the target video, and the amount of data calculation can also be reduced. As another implementation manner, all image frames in the target video may be used as image frames to be processed, so that the accuracy and comprehensiveness of the video annotation result can be improved.

As an implementation manner, each image frame corresponds to a scene, and each scene corresponds to a scene category. In some embodiments, the scene category may include: selfie, group photo, architecture, food, blue sky, silhouette, sunset, beach , Sports, grass, text, night scene. That is, what the scene represents is the content expressed by the entire image frame, and each object in the image frame can be used as each element in the scene. For example, if the entire image is a group photo of user A and user B, the scene type of the image frame is a group photo, the elements in the scene include user A and user B, and the types of user A and user B are characters.

As an implementation manner, the scene type of the image frame to be processed can be identified based on machine learning, for example, a neural network structure is pre-trained, for example, it can be VGG-Net or ResNet. Taking the image frame to be processed as the input image of the neural network structure, the output of the neural network structure is the scene type corresponding to the image to be processed, that is, the target scene type.

Specifically, the output of the last layer of the neural network structure is the distribution vector of the probability that the input image belongs to each predefined scene category. In the process of constructing the integrated classifier, the output of several intermediate layers of the deep neural network can be used as Input the characteristics of the image, train the Softmax classifier; use batch stochastic gradient descent method and back propagation algorithm to train the deep network model. Therefore, the target scene type corresponding to the image frame to be processed can be obtained through the classifier of the neural network.

As an implementation manner, the image frame to be processed may be multiple images, and the image frame to be processed may correspond to multiple scene types, so that the obtained target scene type may also be multiple. For example, the image frames to be processed include image 1, image 2, image 3, image 4, image 5, image 6, image 7, image 8, image 9, and image 1, image 2, image 3, image 4, and image 5. The corresponding scene types are all the first scene type, and the scene types corresponding to image 6, image 7, image 8, and image 9 are all the second scene type, and the target scene types corresponding to the 9 image frames to be processed are respectively the first scene type. A scene type and a second scene type.

S103: Determine a scene time segment of the target scene type in the target video according to the timestamp of the image frame to be processed.

Wherein, in the target video, the scene types corresponding to the image frames in the scene time segment are all the target scene types.

Wherein, each image frame in the target video corresponds to a timestamp, and the timestamp of each image frame can reflect the playback sequence of the image frame in the target video. Video can be regarded as multiple image frames synthesized and played in a certain order. Therefore, the image set obtained after encoding multiple image frames in a certain order can be regarded as a video, and the timestamp can be used to represent a certain The tag information of the playback sequence of the image frame in the video. Usually, the first image frame of the video is used as the starting image, and the corresponding timestamp is the starting timestamp. Then, the image frames after the starting image are added to the starting timestamp according to the playback order. Value, the difference between every two adjacent image frames can be fixed.

Therefore, each image frame in the video corresponds to a time point on the playback time axis of the video, and this time point is the timestamp of the image frame. Wherein, the play time axis of the video is related to the play time length of the video. It can be 0 as the starting point and the total play time length of the video as the starting point. For example, if the total length of the video is 10 seconds, then the play time of the video The axis is a time axis with 0 as the start point and 10 seconds as the end point. The time stamp of each image frame in the video is located on the playback time axis, so that the position of each image frame on the time playback axis can be determined.

Wherein, the scene time segment may include at least one of the start time and the end time of the scene.

According to S102, the scene type corresponding to each image frame to be processed can be determined, and then according to the scene type corresponding to each image frame to be processed, the start time and end time of each scene type can be determined. For example, the scene types corresponding to the aforementioned image 1, image 2, image 3, image 4, and image 5 are all the first scene type, and the scene types corresponding to image 6, image 7, image 8, and image 9 are all the second scene type. , Image 1, image 2, image 3, image 4, image 5, image 6, image 7, image 8, and image 9 correspond to time stamps t1, t2, t3, t4, t5, t6, t7, t8, t9, respectively , It can be determined that the scene time segments of the first scene type are t1 to t5, that is, on the video playback time axis, the scene types corresponding to all image frames between t1 and t5 are the first scene type, and the second scene type The scene time segment of is from t6 to t9, that is, on the playback time axis of the video, the scene type corresponding to all image frames between t6 and t9 is the second scene type.

S104: Generate a video annotation result according to the target scene type and the scene time segment corresponding to the target scene type.

Among them, the video annotation result is used to describe the scene type corresponding to the scene time segment in the target video as the target scene type, so that the scene type within a certain time period of the target video can be known through the video annotation result, and the scene type can be clarified. The start time and end time of the type of scene are used to quickly locate the start time and end time of the scene in the target video according to the scene time segment corresponding to the scene when the video of a certain scene needs to be queried. Convenient and quick to check.

As an implementation manner, the video tagging result may be the description content corresponding to the target video, and the description content may be text content. Specifically, the description content is used to express multiple scene types and types in the target video in the form of text. The start time and end time corresponding to each scene type. For example, the description content may be "scene: Selfie, the time segment of the scene is 2 seconds to 5 seconds".

As another implementation manner, the video tagging result may be content set based on the time axis of the target video. For details, please refer to the subsequent embodiments.

In some embodiments, the electronic device can display the video annotation result. As an implementation manner, the electronic device can display the video annotation result in a designated interface of the electronic device. For example, the specified interface may be a playback interface of the target video. As an implementation manner, the result of marking the video may be displayed on the progress bar of the target video being played, that is, the target scene type is marked on the progress bar. The scene time segment and the target scene type.

As shown in FIG. 3, the content played in the video playing interface shown in FIG. 3 is a target video, and a first mark 302 and a second mark 303 corresponding to the target scene type are displayed on the progress bar 301 of the target video. The first mark 302 is used to characterize the position of the start point of the target scene type on the progress bar 301, and the second mark 303 is used to characterize the position of the end point of the target scene type on the progress bar 301. The user triggers the first mark 302 and the second mark 303 to display the first content and the second content, where the first content is used to indicate that the position corresponding to the first mark 302 is the starting point of the target scene type, as shown in FIG. 3 The second content is used to explain that the position corresponding to the second mark 303 is the end time of the target scene type. Therefore, when the user is watching the target video, the first mark 302 and the second mark 303 can clarify the position of each scene in the video on the video progress bar 301, so that the user can quickly locate the scene of interest. Wherein, the progress bar 301 of the video is the playback time axis of the video.

Please refer to FIG. 4, which shows a video processing method provided by another embodiment of the present application. Specifically, the method can not only identify the scene in the target video, but also identify various objects in the specific scene, and combine the scene and The object generates a video annotation result. Specifically, referring to FIG. 4, the method includes: S410 to S460.

S410: Acquire a target video to be processed.

S420: Acquire the target scene type corresponding to the image frame to be processed in the target video.

As an implementation manner, a neural network based on a computer vision method is used to determine the content in the image frame to be processed. Specifically, a Mobilenet network model may be used. Among them, the basic unit of MobileNet is depthwise separable convolution. Depth-level separable convolution is a decomposable convolution operation (factorized convolutions), which can be decomposed into two smaller operations: depthwise convolution and pointwise convolution. Depthwise convolution is different from standard convolution. For standard convolution, the convolution kernel is used on all input channels, while depthwise convolution uses different convolution kernels for each input channel, that is, one convolution kernel corresponds to one input channel. . The pointwise convolution is actually an ordinary convolution, but it uses a 1x1 convolution kernel. For depthwise separable convolution, it first uses depthwise convolution to convolve different input channels separately, and then uses pointwise convolution to combine the above outputs, which will greatly reduce the amount of calculation and model parameters. Therefore, the Mobilenet network model can also be regarded as a lightweight convolutional neural network.

As an implementation manner, the target scene type corresponding to the image frame to be processed in the target video may be obtained based on Mobilenet_V1. Specifically, it may be fine-tuned (Finetune) on the basis of MobileNet_V1 that has been trained using the data set.

As shown in Table 1, it is a schematic diagram of the Mobilenet_V1 network structure.

Table 1

The network can divide the image frames to be processed into 10 categories, that is, a score of 1-10. In the network structure of Mobilenet_V1, type identifies the operator type of each layer, where conv represents the convolutional layer, Avg Pool represents the average pooling layer, Softmax represents the Softmax layer, and FC represents the fully connected layer. Stride represents the step size of each operation, s1 represents the step size of 1, and s2 represents the step size of 2. Filter Shape represents the size of the filter, 3x3x3x32 represents 3 color channels, the size of the convolution kernel is 3x3, the number of convolution kernels is 32, 3x3x32 dw represents the depthwise convolution channel is 3, the size of the convolution kernel is 1x3 The number of convolution kernels is 32, Pool 7x7 means the average pooling convolution kernel size is 7x7, 1024x1000 means that the fully connected layer contains 1024x1000 neurons, and Classifier means the final classification category. In the image scoring network, the value of Classifier It is 10, representing an output value of 1-10 points, Input Size represents the size of the input, and 224x224x3 represents a 3-channel 224x224 image.

As shown in Figure 5, Figure 5 shows the training process of the Mobilenet_V1 network. Usually the classification network of a picture consists of two parts. The first part is composed of a multi-layer convolution kernel, which is responsible for extracting diversified features in the picture. Make classification judgments. After training on the training data provided by the ImageNet project, the image feature extraction module of the image classification network has been relatively complete, so the part that needs improvement and training is the image category judgment module. The strategy of Finetune is to first train the image category judgment module separately, and then Perform global fine-tuning of the network and join the image feature extraction module to train together, in which the fully connected layer (FC layer) is trained separately for 4000 steps, and the global fine-tuning is 1000 steps. The Finetune data set used by the classification network is a pre-acquired data set, including 280 types of data, 5000 pieces of each type of data, a total of 1.4 million pieces, each picture is marked with a specific physical label, which is used to indicate the need Detect the content of the image, for example, the scene type or target object, etc.

As shown in Figure 6, Figure 6 shows the process of identifying the scene classification of the image to be processed. Specifically, the image frame to be processed is input into the network, and after feature extraction and category judgment, the image frame corresponding to the image to be processed is finally output Scene category. Specifically, the network can output the scene category label of the image frame to be processed. Specifically, the category label included can include: selfie, group photo, architecture, food, blue sky, silhouette, sunset, beach, sports, grass, text, night view.

S430: Determine a scene time segment of the target scene type in the target video according to the timestamp of the image frame to be processed.

S440: Detect the target object in the image frame to be processed to obtain the target object category.

The target object may be a category corresponding to each specific object in the image, that is, the category of each object in the specific scene.

As an implementation manner, the Mobilenet network model may continue to be used to detect the target object in the image frame to be processed to obtain the target object category.

As another implementation manner, based on the YOLO target detection model, the target object in the image frame to be processed is detected to obtain the target object category.

A series of target detection algorithms based on deep learning algorithms can include: first generating candidate regions and then performing Convolutional Neural Networks (CNN) classification (ie RCNN (Regions with CNN features) algorithm), and directly applying to the input image Algorithm and output category and corresponding positioning algorithm (ie YOLO algorithm).

In the embodiment of the present application, the trained Yolo_V3 network can be used to detect and recognize the target object in the image frame to be processed.

As shown in Figure 7, Figure 7 shows the Yolo_V3 network structure. Among them, the network input size is 416x416, the channel is 3, the dynamic boundary layer is DBL, and DBL represents Darknetconv2d_BN_Leaky, which is the basic component of yolo_v3, which is convolution + batch normalization (BN) + linear correction unit (Leaky relu). Residual layer n (resn): n represents a number, res1, res2,..., res8, etc., indicating how many residual units are contained in the residual model. Tensor stitching stitches the upsampling of the middle layer and a later layer. The final network outputs the category and location of each detected object. The network outputs 1000 types of objects and detection frames. Wherein, the detection frame is used to indicate the position of the object in the image where the object is located. Among them, the residual model can be expressed as res_block, the residual unit can be expressed as res_unit, the tensor stitching can be expressed as concat, the middle layer can be expressed as darknet, the upsampling can be expressed as up_sample, and the first fusion layer is route_1(1/8size ), the second fusion layer is route_2(1/16size), the third fusion layer is route_3(1/32size), zero padding is zero padding, the first output is y1, the second output is y2, and the third output is y3. The first output channel is yolo_head/conv_6, the second output channel is yolo_head/conv_14, the third output channel is yolo_head/conv_22, the first feature map is feature_map_1, the second feature map is feature_map_2, and the third feature map is feature_map_3.

S450: Determine an object time segment of the target object category in the target video according to the time stamp of the image frame to be processed.

Wherein, for determining the object time segment of the target object category in the target video, reference may be made to the foregoing implementation manner of determining the scene time segment corresponding to the target scene category. Specifically, the time stamp of the image frame is determined. The time stamp in the image frame can be used as the time stamp of the target object, so that the time stamp corresponding to each category of target object in the target video can be determined, so that each The time segment of the appearance of a target object in the target video.

S460: Generate a video annotation result according to the target scene type, the scene time segment, the target object category, and the object time segment.

Specifically, on the basis of the aforementioned video annotation result determined according to the target scene type and the scene time segment, content corresponding to the target object category is added according to the target object category and the object time segment.

Specifically, the video annotation result can describe the scene type corresponding to the scene time segment in the target video as the target scene type, so that the scene type in a certain time period of the target video can be known through the video annotation result, and In addition to being able to clarify the start time and end time of this type of scene, the start time and end time of each target object category in the target video can also be determined.

As an implementation manner, the video tagging result may be content set based on the time axis of the target video. Specifically, referring to FIG. 8, S460 may include S461 to S465.

S461: Acquire the playing time of the target video.

S462: Determine a time axis according to the playing time.

Wherein, the time axis may be the playback time axis corresponding to the above-mentioned video, and the specific implementation of acquiring the playback time of the target video and determining the time axis according to the playback time can refer to the foregoing embodiment, and will not be repeated here.

S463: Determine a scene interval of the target scene type on the time axis according to the scene time segment corresponding to the target scene type.

Wherein, the scene time segment includes the start time and end time of the target scene type on the time axis. Therefore, the area between the start time and the end time of the target scene type on the time axis is used as the target scene type The corresponding scene interval.

S464: Determine a target object interval of the target object category on the time axis according to the object time segment corresponding to the target object category.

Similarly, the object time segment includes the start time and the end time of the target object category on the time axis. Therefore, the area between the start time and the end time of the target object category on the time axis is taken as the target object The target object interval corresponding to the category.

S465: Generate a video annotation result according to the time axis, the scene interval, the target object interval, the target scene type, and the target object category.

As an implementation manner, the scene interval and the target object interval may be correspondingly marked on the time axis, and the first content and the second content may be generated according to the target scene type and the target object category, and the scene on the time axis The interval corresponds to the first content, and the target object interval corresponds to the second content, so that the position of the scene interval and the target object interval can be clarified on the time axis, and the corresponding interval of each interval can be clarified based on the first content and the second content The category of the scene or target object.

Specifically, an implementation manner of generating a video annotation result according to the time axis, the scene interval, the target object interval, the target scene type, and the target object category may be to obtain a scene corresponding to the target scene type The labeling content and the object labeling content corresponding to the target object category; generate a video labeling result according to the time axis, the scene interval, the target object interval, the target scene type, and the target object category. Wherein, the video annotation result includes a time axis, the time axis is marked with a scene section and a target object section, and the scene label content is displayed at the position of the scene section, and the object label content is displayed at the position of the target object section. .

Among them, the scene annotation content is used to describe the content of the scene category, which can be text, pictures, etc. For example, if the scene category is indoor, the scene annotation content is the text "indoor". Similarly, the object annotation content is used to describe the scene. The content of the object category can be text, pictures, etc. For example, if the object category is a chair, the label content of the object is the text "chair".

In addition, it should be noted that the target object category can be the category of the object, or the category of specific details of the object. Specifically, the object category includes the main category and the subcategories under the main category. To describe the overall category of the object, for example, people. The subcategory may be a category of specific details of the target object, the main category of the target object is people, and the subcategory may be an expression category or an emotion category.

As an implementation manner, the video annotation result may be a display content, the display content includes a time axis, the time axis is marked with a scene interval and a target object interval, and the scene annotation content is displayed at the position of the scene interval. The label content of the object is displayed at the position of the target object interval. As shown in Figure 9, the display content includes the time axis image, the scene image of each scene interval, and the target image of each target object interval. Then the ratio of the length of each scene image and the target image to the time axis image is related to The time length of the scene time segment and the object time segment has a proportional relationship with the playback time length of the target video, so as to reflect the time interval in which each scene and the target object exist on the time axis of the target video. In addition, scene annotation content or object annotation content is displayed on the scene image in each scene section and the target object image in each target object section.

Please refer to FIG. 10. FIG. 10 shows a video processing method provided by another embodiment of the present application. Specifically, the method may execute the video processing method when the electronic device is idle. Specifically, referring to FIG. 10, the method includes: S1001 to S1005.

S1001: Acquire the working status of the electronic device.

The working state of an electronic device includes a busy state and an idle state. The busy state indicates that the current power consumption of the electronic device is relatively high. If the target video is processed to obtain the video annotation result, it may cause the system to freeze, and the idle state and the busy state On the contrary, that is, the current power consumption is low, and if the target video is processed to obtain the video annotation result, the possibility of system stalling is small.

As an implementation manner, the working state may be determined by at least one of the CPU usage rate, the charging state, and the current time.

In some embodiments, the CPU usage rate is used to determine the working status of the electronic device. Specifically, it is determined whether the CPU usage rate of the electronic device is lower than the usage rate threshold, and if it is lower, the working state of the electronic device is determined to be an idle state; otherwise, the working state of the electronic device is determined to be a busy state.

Specifically, the CPU usage rate can be obtained by viewing the task manager of the electronic device. For example, in the Android system, the CPU usage rate can be obtained through the adb shell top command. Among them, the utilization rate threshold may be a utilization rate set by the user. For example, the utilization rate threshold may be 60%. Assuming that the current utilization rate of the CPU is 40%, 40% is less than 60%, and it is determined that the utilization rate of the central processing unit is less than The utilization rate threshold, if the current utilization rate of the CPU is 70%, then 70% is greater than 60%, and it is determined that the utilization rate of the central processing unit is greater than the utilization rate threshold.

And if the utilization rate of the central processing unit is less than the utilization rate threshold, it means that the current CPU resources are relatively rich, and it can be determined that the working state of the electronic device is idle, then S1002 can be executed, and if the utilization rate of the central processing unit is greater than or equal to the utilization rate The rate threshold indicates that the current CPU resources are relatively scarce, and it can be determined that the working state of the electronic device is busy.

In addition, because the CPU usage rate is related to the application currently started by the electronic device, it can be judged whether there is an application that matches the preset application in the currently opened application when the electronic device is in a busy state. Among them, The preset application is an application that allows the system to close the application without the user's authorization. If it exists, the application that matches the preset application is closed, and then the current usage rate of the CPU is obtained as the CPU Usage rate, and return to perform the operation of judging whether the usage rate of the central processing unit is less than the usage rate threshold.

Specifically, a list of preset applications is pre-stored in the electronic device, and the list of preset applications includes a plurality of designated application identities, wherein the designated application is authorized by the user and allows the system to be unauthorised by the user. In this case, the application program that closes the application program may specifically be the user manually inputting the identifier of the specified application program.

Therefore, when the CPU usage rate is too high, the system will be allowed to kill the process of the application program closed by the user without the user's authorization, thereby releasing a certain amount of CPU resources and reducing the CPU usage rate.

In some embodiments, the working state of the electronic device is determined based on the charging state and the current time. Specifically, if the electronic device is in a charging state and the current time is within the preset time range, it is determined that the working state of the electronic device is in an idle state; otherwise, it is determined that the working state of the electronic device is in a busy state. Wherein, the preset time range may be a preset time interval within which the probability of the user using the electronic device is small. For example, the preset time range is 1 am to 6 am, and the user is In the sleep state and the electronic device is in the charging state, the system resources of the electronic device are less occupied at this time and are in an idle state.

Further, the detection of the holding state of the electronic device can be added on the basis that the electronic device is in the charging state and the current time is within the preset time range, that is, if the electronic device is in the charging state and the current time is within the preset time range and If the holding state of the electronic device is an unheld state, it is determined that the working state of the electronic device is in an idle state; otherwise, it is determined that the working state of the electronic device is in a busy state.

Specifically, when a user holds an electronic device, the gripped parts are generally concentrated on the bottom frame, the top frame, and the back of the electronic device near the bottom or the top. Therefore, the position of the top frame and the bottom frame can be set. The detection device detects whether the user is holding the electronic device, that is, the electronic device can detect whether the electronic device is in a held state.

As an implementation manner, pressure sensors may be provided at the positions of the top frame and the bottom frame. When the user holds the electronic device, the pressure sensor can detect the pressure value, and it is determined that the electronic device is in a held state. Temperature sensors can also be set at the top and bottom frames. When the user is not holding the electronic device, the temperature value detected by the temperature sensor is the first temperature value, and when the user is holding the electronic device, the temperature detected is the first temperature value. The value is the second temperature value, and the first temperature value is less than the second temperature value, and the second temperature value is greater than the preset temperature value. For example, the preset temperature value is 37, which is the body temperature of the human body. If the value is greater than the preset temperature value, it is determined that the electronic device is in a held state.

As another implementation manner, it can also be detected whether the touch screen of the electronic device can detect the user's touch operation, if it can be detected, it is determined that the holding state is the held state, otherwise, it is determined that it is the unheld state. In this embodiment, when the electronic device is on the screen, the touch screen is not turned off, and it remains in a state capable of detecting touch operations.

In one embodiment, the status of the electronic device can be determined by combining the CPU usage rate, the charging status, and the current time at the same time, that is, when the CPU usage rate is less than the usage rate threshold, and the electronic device is in the charging state and the current time is within the preset time range , It is determined that the working state of the electronic device is in an idle state, otherwise, it is determined that the working state of the electronic device is in a busy state.

S1002: If the working state is an idle state, acquire a target video to be processed.

S1003: Acquire the target scene type corresponding to the image frame to be processed in the target video.

S1004: Determine a scene time segment of the target scene type in the target video according to the timestamp of the image frame to be processed.

S1005: Generate a video annotation result according to the target scene type and the scene time segment corresponding to the target scene type.

Therefore, when the electronic device is in an idle state, the operation of acquiring the target video to be processed and subsequently acquiring the video annotation result can be performed to avoid the electronic device from jamming and affecting the user's use when the method is executed in the electronic device.

In addition, the Mobilenet network model and the YOLO target detection model used in the embodiments of the present application have simple structures and complex algorithms, which are more suitable for running on electronic devices.

Please refer to FIG. 11, which shows a structural block diagram of a video processing apparatus 1100 provided by an embodiment of the present application. The apparatus may include: a video acquisition unit 1101, a scene acquisition unit 1102, a determination unit 1103, and a processing unit 1104.

The video acquisition unit 1101 is configured to acquire the target video to be processed;

The scene acquisition unit 1102 is configured to acquire the target scene type corresponding to the image frame to be processed in the target video;

The determining unit 1103 is configured to determine the scene time segment of the target scene type in the target video according to the time stamp of the image frame to be processed, wherein, in the target video, the image in the scene time segment The scene types corresponding to the frames are all the target scene types;

The processing unit 1104 is configured to generate a video annotation result according to the scene type and the scene time segment corresponding to the scene type.

Those skilled in the art can clearly understand that, for the convenience and conciseness of description, the specific working process of the device and module described above can refer to the corresponding process in the foregoing method embodiment, which will not be repeated here.

Refer to FIG. 12, which shows a structural block diagram of a video processing apparatus 1200 provided by an embodiment of the present application. The apparatus may include: a video acquisition unit 1201, a scene acquisition unit 1202, a first determination unit 1203, and a second determination unit 1204和processing unit 1205.

The video acquisition unit 1201 is configured to acquire the target video to be processed.

Specifically, the video acquiring unit 1201 is also used to acquire the working state of the electronic device; if the working state is an idle state, acquiring the target video to be processed.

The scene acquiring unit 1202 is configured to acquire the target scene type corresponding to the image frame to be processed in the target video.

The scene acquiring unit 1202 is also configured to acquire the target scene type corresponding to the image frame to be processed in the target video based on the Mobilenet network model.

The first determining unit 1203 is configured to determine a scene time segment of the target scene type in the target video according to the time stamp of the image frame to be processed, wherein, in the target video, the scene time segment is The scene types corresponding to the image frames of are all the target scene types.

The second determining unit 1204 is configured to detect the target object in the image frame to be processed to obtain the target object category; according to the time stamp of the image frame to be processed, determine the target object category in the target video Object time segment; generating a video annotation result according to the target scene type, the scene time segment, the target object category, and the object time segment.

Further, the second determining unit 1204 is further configured to obtain the play time of the target video; determine the time axis according to the play time; determine that the target scene type is at the time according to the scene time segment corresponding to the target scene type. The scene interval on the axis; determine the target object interval of the target object category on the time axis according to the object time segment corresponding to the target object category; according to the time axis, the scene interval, and the target object interval , The target scene type and the target object category generate a video annotation result.

Further, the second determining unit 1204 is further configured to obtain the scene annotation content corresponding to the target scene type and the object annotation content corresponding to the target object category; according to the time axis, the scene interval, and the target object interval , The target scene type and the target object category generate a video annotation result, wherein the video annotation result includes a time axis, and the time axis is marked with a scene interval and a target object interval, and is displayed at the position of the scene interval There are scene labeling content, and the object labeling content is displayed at the location of the target object interval.

Further, the second determining unit 1204 is further configured to detect the target object in the image frame to be processed based on the YOLO target detection model to obtain the target object category.

The processing unit 1205 is configured to generate a video annotation result according to the target scene type, the scene time segment, the target object category, and the object time segment.

In the several embodiments provided in this application, the coupling between the modules may be electrical, mechanical or other forms of coupling.

In addition, the functional modules in the various embodiments of the present application may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated into one module. The above-mentioned integrated modules can be implemented in the form of hardware or software functional modules.

Please refer to FIG. 13, which shows a structural block diagram of an electronic device provided by an embodiment of the present application. The electronic device 100 may be an electronic device capable of running application programs, such as a smart phone, a tablet computer, or an e-book. The electronic device 100 in this application may include one or more of the following components: a processor 110, a memory 120, and one or more application programs, where one or more application programs may be stored in the memory 120 and configured to be configured by One or more processors 110 execute, and one or more programs are configured to execute the methods described in the foregoing method embodiments.

The processor 110 may include one or more processing cores. The processor 110 uses various interfaces and lines to connect various parts of the entire electronic device 100, and executes by running or executing instructions, programs, code sets, or instruction sets stored in the memory 120, and calling data stored in the memory 120. Various functions and processing data of the electronic device 100. Optionally, the processor 110 may adopt at least one of Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), and Programmable Logic Array (PLA). A kind of hardware form to realize. The processor 110 may be integrated with one or a combination of a central processing unit (CPU), a graphics processing unit (GPU), a modem, and the like. Among them, the CPU mainly processes the operating system, user interface, and application programs; the GPU is used to render and draw the display content; and the modem is used to process wireless communication. It can be understood that the above-mentioned modem may not be integrated into the processor 110, but may be implemented by a communication chip alone.

The memory 120 may include random access memory (RAM) or read-only memory (Read-Only Memory). The memory 120 may be used to store instructions, programs, codes, code sets or instruction sets. The memory 120 may include a program storage area and a data storage area, where the program storage area may store instructions for implementing the operating system and instructions for implementing at least one function (such as touch function, sound playback function, image playback function, etc.) , Instructions used to implement the following various method embodiments, etc. The storage data area can also store data (such as phone book, audio and video data, chat record data) created by the electronic device 100 during use.

Please refer to FIG. 14, which shows a structural block diagram of a computer-readable storage medium provided by an embodiment of the present application. The computer-readable medium 1400 stores program code, and the program code can be invoked by a processor to execute the method described in the foregoing method embodiment.

The computer-readable storage medium 1400 may be an electronic memory such as flash memory, EEPROM (Electrically Erasable Programmable Read Only Memory), EPROM, hard disk, or ROM. Optionally, the computer-readable storage medium 1400 includes a non-transitory computer-readable storage medium. The computer-readable storage medium 1400 has storage space for the program code 1410 for executing any method steps in the above-mentioned methods. These program codes can be read from or written into one or more computer program products. The program code 1410 may be compressed in a suitable form, for example.

In summary, the video processing method, device, electronic equipment, and computer readable medium provided in this application obtain the target video to be processed; obtain the target scene type corresponding to the image frame to be processed in the target video; Process the time stamp of the image frame to determine the scene time segment of the target scene type in the target video, wherein in the target video, the scene type corresponding to the image frame in the scene time segment is the target Scene type; generating a video annotation result according to the scene type and the scene time segment corresponding to the scene type. Therefore, the scene type of the image frame in the video can be recognized, and the annotation result can be obtained by combining the scene type and the appearance time of the scene type in the video, so that the annotation result can reflect the corresponding relationship between the time period of the video and the scene, making the annotation result more intuitive And more in line with user needs.

Further, a network of picture scene recognition and picture object detection and recognition based on deep learning is used to completely record the scenes at different time points in the video and the objects that appear in the video scene at different time points.

It is beneficial for: 1) the recording of the occurrence process of the video event; 2) the subsequent analysis of the content of the video; 3) broadening the dimensionality of the video content search; 4) video editing of specific objects.

The Mobilenet_V1 network based on deep learning is used in the video content scene recognition, and the Yolo_V3 network is used in the video content detection and recognition. It supports the detection and recognition of 12 scenes and 1000 types of objects, and the selected networks are relatively lightweight The network, while ensuring the lightweight model, greatly reduces the amount of calculation. It can be directly run offline on the mobile phone without uploading the user’s photographed data to the cloud. This improves the user experience while ensuring user privacy.

Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the application, not to limit them; although the application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: The technical solutions recorded in the foregoing embodiments are modified, or some of the technical features thereof are equivalently replaced; these modifications or replacements do not drive the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims

A video processing method, characterized in that it comprises:

Obtain the target video to be processed;

Acquiring the target scene type corresponding to the image frame to be processed in the target video;

Determine the scene time segment of the target scene type in the target video according to the time stamp of the image frame to be processed, wherein, in the target video, the scene types corresponding to the image frames in the scene time segment are all Is the target scene type;

A video annotation result is generated according to the target scene type and the scene time segment corresponding to the target scene type.
The method according to claim 1, wherein the generating a video annotation result according to the target scene type and a scene time segment corresponding to the target scene type comprises:

Detect the target object in the image frame to be processed to obtain the target object category;

Determine the object time segment of the target object category in the target video according to the time stamp of the image frame to be processed;

A video annotation result is generated according to the target scene type, the scene time segment, the target object category, and the object time segment.
The method according to claim 2, wherein the generating a video annotation result according to the target scene type, the scene time segment, the target object category, and the object time segment comprises:

Acquiring the playing time of the target video;

Determining a time axis according to the playing time;

Determine the scene interval of the target scene type on the time axis according to the scene time segment corresponding to the target scene type;

Determine the target object interval of the target object category on the time axis according to the object time segment corresponding to the target object category;

Generate a video annotation result according to the time axis, the scene interval, the target object interval, the target scene type, and the target object category.
The method according to claim 3, wherein the generating a video annotation result according to the time axis, the scene interval, the target object interval, the target scene type, and the target object category comprises:

Acquiring the scene annotation content corresponding to the target scene type and the object annotation content corresponding to the target object category;

Generate a video annotation result according to the time axis, the scene interval, the target object interval, the target scene type, and the target object category, wherein the video annotation result includes a time axis, and the time axis is marked The scene interval and the target object interval are displayed, and the scene annotation content is displayed at the position of the scene interval, and the object annotation content is displayed at the position of the target object interval.
The method according to claim 1, wherein the obtaining the target scene type corresponding to the image frame to be processed in the target video comprises:

Based on the Mobilenet network model, the target scene type corresponding to the image frame to be processed in the target video is obtained.
The method according to claim 2, wherein the detecting the target object in the image frame to be processed to obtain the target object category comprises:

Based on the YOLO target detection model, the target object in the image frame to be processed is detected to obtain the target object category.
The method according to any one of claims 2-6, wherein the target object category includes a main category of the target object and subcategories under the main category.
The method according to any one of claims 1-7, wherein when applied to an electronic device, said obtaining the target video to be processed comprises:

Acquiring the working status of the electronic device;

If the working state is an idle state, the target video to be processed is acquired.
The method according to claim 8, wherein the obtaining the working status of the electronic device comprises:

Acquiring the CPU usage rate of the electronic device;

Determine whether the CPU usage rate of the electronic device is lower than the usage rate threshold;

If it is lower than the usage rate threshold, determining that the working state of the electronic device is an idle state;

If it is not lower than the usage rate threshold, it is determined that the working state of the electronic device is a busy state.
The method according to claim 9, wherein after determining that the working state of the electronic device is a busy state if the usage rate is not lower than the threshold value, the method further comprises:

Determine whether there is an application that matches the preset application among the currently opened applications;

If there is an application that matches the preset application, the application that matches the preset application is closed, and the current usage rate of the CPU is obtained again as the new CPU usage rate, and the execution of the judgment of the electronic device is returned. Whether the CPU usage rate is lower than the usage rate threshold.
The method according to claim 10, wherein the preset application is an application that allows the system to close the application without the user's authorization.
The method according to claim 8, wherein the obtaining the working status of the electronic device comprises:

Judging whether the electronic device is in a charging state and the current time is within a preset time range;

If yes, it is determined that the working state of the electronic device is in an idle state;

If not, it is determined that the working state of the electronic device is busy.
The method according to claim 8, wherein the obtaining the working status of the electronic device comprises:

Determining whether the electronic device is in a charging state and the current time is within a preset time range and the holding state of the electronic device is an unheld state;

If yes, it is determined that the working state of the electronic device is in an idle state;

If not, it is determined that the working state of the electronic device is busy.
The method according to claim 13, wherein the holding state of the electronic device is detected in the following manner:

Judging whether the touch screen of the electronic device detects a user's touch operation;

If it is detected, it is determined that the holding state is the held state;

If it is not detected, it is determined that it is not being held.
The method according to any one of claims 1-14, wherein after generating a video annotation result according to the scene type and the scene time segment corresponding to the scene type, the method further comprises:

The video annotation result is displayed in a designated interface of the electronic device, where the designated interface is a playback interface of the target video.
The method according to any one of claims 1-15, wherein the image frame to be processed is an image frame in a time period between the end time of the credit portion of the target video and the start time of the credit portion of the target video.
The method according to any one of claims 1-15, wherein the image frame to be processed is a key frame of the target video.
A video processing device, characterized in that it comprises:

The video acquisition unit is used to acquire the target video to be processed;

A scene acquisition unit, configured to acquire the target scene type corresponding to the image frame to be processed in the target video;

The determining unit is configured to determine the scene time segment of the target scene type in the target video according to the time stamp of the image frame to be processed, wherein, in the target video, the image frame in the scene time segment The corresponding scene types are all the target scene types;

The processing unit is configured to generate a video annotation result according to the scene type and the scene time segment corresponding to the scene type.
An electronic device, characterized in that it comprises:

One or more processors;

Memory

One or more application programs, wherein the one or more application programs are stored in the memory and configured to be executed by the one or more processors, and the one or more application programs are configured to execute The method of any one of claims 1-17.
A computer-readable medium, wherein the computer-readable medium stores program code executable by a processor, and when the program code is executed by the processor, the processor executes any of claims 1-17. One of the methods described.