CN109922373B

CN109922373B - Video processing method, device and storage medium

Info

Publication number: CN109922373B
Application number: CN201910193143.1A
Authority: CN
Inventors: 李滇博
Original assignee: Shanghai Jilian Network Technology Co ltd
Current assignee: Shanghai Jilian Network Technology Co ltd
Priority date: 2019-03-14
Filing date: 2019-03-14
Publication date: 2021-09-28
Anticipated expiration: 2039-03-14
Also published as: CN109922373A

Abstract

The embodiment of the application provides a video processing method, a video processing device and a storage medium, wherein the method comprises the following steps: segmenting a video according to a shooting lens to obtain a plurality of video segments; acquiring preset labels corresponding to the plurality of video clips; obtaining time points matched with the preset labels in the video clips; obtaining one or more target video clips matched with the preset labels according to the time points; and forming a new target video by one target video segment, or combining a plurality of target video segments to form a new target video capable of being played continuously. One or more target video clips in the video can be acquired according to the preset label, then one target video clip is played or the target video clips are played continuously, the whole video is not played according to the original video playing sequence, only the video clip matched with the preset label can be displayed for the user, and the individual requirements of different users are met.

Description

Video processing method, device and storage medium

Technical Field

The present disclosure relates to the field of intelligent video image monitoring, and in particular, to a video processing method, apparatus, and storage medium.

Background

More and more young people in the present age of viewing time such as life choose to use various video applications to watch their favorite video programs such as tv shows, movies, etc. When a new television drama is shown, the user always likes a big fast Ying and knows the content and the wonderful segment of the video quickly, and due to the fact that most of the current programs have draggy dramas or different attention degrees of the user and the like, the user always drags a progress bar when watching the programs, and the user can fast forward or rewind to watch the part of the content that the user wants to watch, so that the user is very worried and hard.

The existing video playing mode generally plays according to the original video content, and a user can only control the speed of video playing by adjusting the playing speed of the video and cannot play proper content according to the user requirement. In addition, the conventional method for controlling the playing speed to play the content may cause audio distortion, and the user experience is not good.

Disclosure of Invention

The embodiment of the application provides a video processing method, a video processing device and a storage medium, which can process videos according to users, so that the processed videos meet the requirements of different users.

The embodiment of the application provides a video processing method, which comprises the following steps:

segmenting the video according to a shooting lens to obtain a plurality of video segments;

acquiring preset labels corresponding to the plurality of video clips;

obtaining time points matched with the preset labels in the video clips;

obtaining one or more target video clips matched with the preset labels according to the time points;

and forming a new target video by one target video segment, or combining a plurality of target video segments to form a new target video capable of being played continuously.

An embodiment of the present application provides a video processing apparatus, which includes:

the video segmentation module is used for segmenting the video according to the shooting lens to obtain a plurality of video segments;

the preset label acquisition module is used for acquiring preset labels corresponding to the plurality of video clips;

the time point acquisition module is used for acquiring time points matched with the preset labels in the plurality of video clips;

the target video clip acquisition module is used for acquiring one or more target video clips matched with the preset tags according to the time points;

and the processing module is used for forming a new target video from one target video segment or combining a plurality of target video segments to form a new target video capable of being continuously played.

An embodiment of the present application further provides a storage medium, where a computer program is stored, and when the computer program runs on a computer, the computer is caused to execute the video processing method as described above.

In the video processing method, the video processing device and the storage medium provided by the embodiment of the application, the video is firstly segmented according to the shooting lens to obtain a plurality of video segments; then acquiring preset labels corresponding to the plurality of video clips; then, obtaining time points matched with the preset labels in the video clips; then, one or more target video clips matched with the preset tags are obtained according to the time points; and finally, forming a new target video by one target video segment, or combining a plurality of target video segments to form a new target video capable of being continuously played. One or more target video clips in the video can be obtained according to the preset tag, then the one or more target video clips are continuously played, the whole video is not played according to the original video playing sequence, only the video clip matched with the preset tag can be displayed for the user, and the individual requirements of different users are met.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments will be briefly introduced below. It is obvious that the drawings in the following description are only some embodiments of the application, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.

Fig. 1 is a schematic flowchart of a video processing method according to an embodiment of the present disclosure.

Fig. 2 is a schematic flowchart of a second video processing method according to an embodiment of the present disclosure.

Fig. 3 is a third flowchart illustrating a video processing method according to an embodiment of the present application.

Fig. 4 is a further flowchart of a video processing method according to an embodiment of the present application.

Fig. 5 is a schematic diagram of a video processing apparatus according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. It is to be understood that the embodiments described are only a few embodiments of the present application and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Referring to fig. 1, fig. 1 is a first schematic flow chart of a video processing method according to an embodiment of the present application, where the flow of the video processing method may specifically include:

and 101, segmenting the video according to the shooting lens to obtain a plurality of video segments.

The video processing of the embodiment of the application can be established on the basis of shot segmentation, namely, a shot unit which is segmented is used as a video analysis object for analysis. Shot segmentation is performed by modeling background information of a video and segmenting a shot of the video by using optical flow information and color distribution information of the background. In one shot, the optical flow information and the color distribution of the background are relatively stable, and when the information change between adjacent frames is large and exceeds a threshold value, the shot switching is judged to occur at the time stamp. Therefore, the video can be divided according to the shooting lens to obtain a plurality of video clips.

Each video clip is obtained by continuously shooting the same lens, and two adjacent video clips are shot by two different lenses. It should be noted that two video clips shot by the same lens are considered as two video clips if the two video clips appear in different time periods, that is, the two video clips shot by the same lens are disconnected in the play timing sequence.

And 102, acquiring preset labels corresponding to the plurality of video clips. The method can obtain a plurality of frames of images from a video, and then identify one frame of image or a plurality of frames of images according to an image identification technology to obtain the image characteristics in each frame of image. And then extracting a preset label from the image characteristics of each frame of image. The preset tag may be at least one of a scene tag, an action tag, a character tag, and the like.

And 103, obtaining the time points matched with the preset labels in the plurality of video clips.

And after a plurality of video clips are obtained, carrying out image recognition on each video clip. Specifically, a plurality of sampling frames can be obtained for each video clip according to a preset sampling frequency, then image recognition is carried out on each sampling frame, whether the sampling frame has characteristics matched with a preset label or not is recognized, and if not, no processing is carried out; and if so, recording the time point of the preset label. The time point can be understood as the time point of the entire video playing.

The step of identifying whether the sampling frame has the feature matched with the preset tag can be understood as whether the sampling frame has the feature same as the preset tag, such as a person, or whether the sampling frame has the feature similar to the preset tag, such as a similar scene or action.

And 104, obtaining one or more target video clips matched with the preset label according to the time point.

And obtaining one or more target video clips matched with the preset label after obtaining the time point matched with the preset label in each video clip.

The video segment matching the preset tag may be entirely regarded as the target video segment.

Or obtaining a time point of a starting position matched with a preset label and a time point of an ending position matched with the preset label in the video clip, and then segmenting the target video clip according to the time point of the starting position and the time point of the ending position. So that one or more target video segments can be obtained among the one or more video segments.

And 105, forming a new target video by one target video segment or combining a plurality of target video segments to form a new target video capable of being continuously played.

And finally, combining one or more target video clips to form a new target video, wherein the target video can play one video clip or a plurality of video clips in sequence. The whole video is not played according to the original video playing sequence any more, so that only the video clip matched with the preset label can be displayed for the user, and the individual requirements of different users are met.

Referring to fig. 2, fig. 2 is a second flowchart illustrating a video processing method according to an embodiment of the present disclosure, in some embodiments, the preset tag may be a motion tag. In the video processing method, after the step of segmenting the video according to the shot to obtain a plurality of video segments, the method may include:

201, extracting video optical flow information of each video clip;

202, determining the video clips with the video optical flow information larger than a preset threshold value as target video clips, and marking action tags;

203, extracting image information and audio information of the target video clip;

204, acquiring a first probability that each frame in the target video clip is the start of an action and a second probability that each frame in the target video clip is the end of the action according to the image information and the audio information;

205, taking the frame with the maximum first probability as an action start frame and the frame with the maximum second probability as an action end frame;

and 206, obtaining an action video clip set and a time point of the action video clip set according to the action starting frame and the action ending frame. 207, obtaining one or more target video segments matched with the motion tags according to the time points;

and 208, forming a new target video by one target video segment or combining a plurality of target video segments to form a new target video capable of being continuously played.

The video analysis mode taking the action as the main line is mainly applied to the comprehensive programs such as action movies, sports events and the like, identifies the wonderful action segments appearing in the video, such as the wonderful moments with violent and meaningful information changes such as fighting, racing, goal and the like, and records the wonderful moments, so that customized watching modes such as wonderful review and the like are provided for users.

In a video analysis mode with a main line of motion, highlights in a video are time segments with main people gathering and rapidly changing information quantity. The method comprises the steps of firstly screening out candidate highlight shots with fast information change by extracting video optical flow information, then sending the candidate shots into a highlight action detection module, simultaneously extracting image and sound information of a video for judgment, outputting the probability of whether each frame in the shot is the action start or end or in-progress, and finally inputting the start and end frames with the maximum probability as the head and the tail of an action unit into a video action and action recognition model for analysis to obtain action classification results.

And when the video analysis is finished, classifying the detected video segments according to categories and listing the time point of each highlight segment.

The video playing is carried out on line in different modes, namely, a user can freely select a required playing mode according to the information of characters, scenes, actions and the like which are analyzed off line.

In the embodiment of the invention, the highlight action recognition is implemented by firstly estimating optical flow information of a video, screening out a lens with large optical flow information change as a lens of a candidate highlight action, and then sending the lens video to a highlight detection and judgment module. And then, selecting frames with high probability beginning and high probability ending and intermediate frames to send to a motion recognition module, wherein the motion recognition module is a pre-trained motion recognition model, and finally outputting the type of the motion. Finally, the highlight occurring in the video is classified and the time point is output in the same way.

In some embodiments, after the step of obtaining the motion video segment set according to the motion start frame and the motion end frame, the method further includes:

when the number of the action video clip sets is multiple, arranging the action video sets, and outputting the time information of each action video set.

Referring to fig. 3, fig. 3 is a third flowchart illustrating a video processing method according to an embodiment of the present disclosure, where the preset tag may be a scene tag. The flow of the video processing method may specifically include:

301, segmenting the video according to the shot to obtain a plurality of video segments.

And 302, performing interval sampling on each video segment to obtain a plurality of sampling frames.

Each video segment is sampled at intervals, and the sampling frequency can be one frame per second, ten frames per second, and the like, so that a plurality of sampling frames are obtained.

303, obtaining scene information in one of the sampling frames.

One of the plurality of sampling frames is selected as a reference frame, and scene information of the reference frame is acquired. The reference frame may be randomly acquired, or one frame may be randomly acquired from the first few frames. Or comparing a plurality of continuous sampling frames, and selecting a frame with the best effect as a reference frame. Of course, the scene information of one sampling frame may be obtained from a plurality of video clips, so as to obtain a plurality of scene information of a plurality of sampling frames.

304, when the scene information is a antique scene, identifying a plurality of sampling frames according to a antique scene identification algorithm to obtain a scene label corresponding to each video clip; and when the scene information is a modern scene, identifying a plurality of sampling frames according to a modern scene identification algorithm to obtain a scene label corresponding to each video clip.

The video analysis mode using the scene as a main line is mainly applied to videos with the same length as a movie and a television series. A large number of scene samples can be collected respectively according to two major themes of an ancient drama and a modern drama, and hundreds of types of scene modes are trained. The method mainly comprises the scenes in the existing movie and television dramas, and when video analysis is carried out, content integration is carried out on frequently-occurring places and scenes in the video, so that whether the slave video is based on ancient scenes or modern scenes is determined.

In some embodiments, after the step of obtaining the scene tag corresponding to each of the video segments, the method may further include:

acquiring time period information of each video clip, wherein the time period information corresponds to the video clip playing time period;

arranging a plurality of video clips according to time sequence;

when two adjacent scene labels are the same, the two scene labels are combined into one scene label, and the combined time period information comprises the time period information of the two video clips before combination.

305, classifying the plurality of video clips according to the scene tags to obtain a plurality of video clip sets, and obtaining a scene graph from each video clip set as a view tag of the corresponding video clip set.

The video analysis mode takes a scene as a main line, wherein the scene is divided into two basic categories of indoor and outdoor, different indoor scene and outdoor scene category labels are respectively defined according to ancient dramas and modern dramas, the indoor scene labels comprise offices, shops, kitchens, bedrooms and the like of the modern dramas, bedrooms, study rooms, jail rooms and the like in the ancient dramas, and the outdoor scene comprises playgrounds, parks, roads, platforms, forests, gardens, cliffs, marts and the like in the ancient dramas. It is also understood that the first classification is performed according to antique scenes and modern scenes, and then the second classification is performed according to indoor and outdoor scenes.

In a video analysis mode with a scene as a main line, a scene database can be established, and the database mainly records tags and time information of frequently-appearing scenes, namely main scenes, in a video. Since the scenes in the video are generally shot by one fixed lens, that is, the scenes under one lens unit belong to the same category, the category label of the current scene can be analyzed by only sampling one frame of image in the lens and inputting the image into the scene classification model.

In some embodiments, in a video analysis mode with a scene as a main line, a scene analysis model may extract scene features by using a deep learning model, and train different scene classifiers by combining methods such as traditional machine learning.

In some embodiments, the video analysis mode with the scene as the main line integrates the analyzed scene information, merges the scene information with repeated labels according to the sequence of the first scene occurrence frequency and the second scene occurrence time, and outputs the final scene list.

And 306, obtaining the time point matched with the preset label in each video clip in the video clip set.

307, obtaining one or more target video clips matched with the scene tags according to the time points.

And obtaining one or more target video clips matched with the scene labels after obtaining the time point matched with the scene labels in each video clip in the video clip set.

The video segment matching the scene tag may be entirely taken as the target video segment.

Or obtaining a time point of a starting position matched with the scene label and a time point of an ending position matched with the scene label in the video clip, and then segmenting the target video clip according to the time point of the starting position and the time point of the ending position. So that one or more target video segments can be obtained among the one or more video segments.

And 308, forming a new target video by one target video segment or combining a plurality of target video segments to form a new target video capable of being played continuously.

And finally, forming a new target video by one target video, or combining a plurality of target video clips to form a new target video, wherein the target video can play one video clip, or play a plurality of video clips in sequence. The whole video is not played according to the original video playing sequence any more, so that only the video clip matched with the scene label can be displayed for the user, and the individual requirements of different users are met.

In the scene recognition analysis in the embodiment of the application, a lens unit may be sampled at equal intervals, for example, in a 1 frame/s manner, and then the sampled frames are sent to the scene recognition module a, so as to determine whether the scene is antique or modern. If the scene is an antique, the sampling frames of all the remaining lens units are directly sent to the antique scene recognition module, and if the scene is a modern scene, the sampling frames of all the remaining lens units are directly sent to the modern scene recognition module. The ancient clothes scene recognition module and the modern scene recognition module are both composed of a feature extraction model trained by a Convolutional Neural Network (CNN) model and a classifier. And then, after passing through the scene classification unit, voting analysis is carried out on the analysis result under the shot, namely, the shot is labeled by the most labels. And finally, after all the lens units are analyzed, combining adjacent units adjacent to the same label, sequencing according to frequency and time, and outputting a picture of the scene and the time when the scene appears.

Referring to fig. 4, fig. 4 is a fourth schematic flow chart of a video processing method according to an embodiment of the present application, where the preset tag is a person tag, and the flow of the video processing method may specifically include:

401, the video is segmented according to the shot to obtain a plurality of video segments.

And 402, performing face recognition and/or human body recognition on the people in each video segment, determining a target person, and setting a person tag according to the target person.

In some embodiments, the step of performing face recognition and/or body recognition on a plurality of people in each video segment to determine a target person includes:

determining a target figure according to a preset rule;

after the target person is determined, the method further comprises the following steps:

acquiring the face characteristics of the target person, and comparing the face characteristics with a preset database;

if the target person is in a preset database, acquiring a first quality value of a face feature in the preset database and a second quality value of the current face feature;

if the second quality value is larger than the first quality value, replacing the face image in the preset database with the current face image;

and if the face characteristics of the target person are not in the preset database, storing the face characteristics and/or the human body characteristics of the target person into the preset database.

In the person identification method in the embodiment of the application, a video clip of a lens unit can be sent to a detection module firstly, wherein the detection module comprises two detectors, namely a face detector and a human body detector, when a face is detected, a result is sent to a face screening module, and the face screening module is mainly used for screening the face of a main person in an image and rejecting other irrelevant faces. And secondly, tracking the screened characters to obtain the track flow of the characters. The tracking module selects different tracking modes according to the distance of the lens, and the distance of the lens can be judged according to the size and the proportion of the face detection frame. Then, after a trajectory stream of a person is obtained, sampling is carried out at equal intervals in the trajectory stream according to a 6-frame/second mode, a face in a sampling frame is detected by combining tracking stream position information (detection time is shortened, and only a useful target is detected at the same time), an output face group is sent to a video face recognition module, wherein the video face recognition module comprises two parts of information, one part is feature information obtained after the face in a video segment is fused, and the other part is confidence coefficient of the feature, namely quality score. And then, comparing the group of human face features with the persons in the person database, if the group of human face features does not exist, putting the human face features and the serial numbers into the database, selecting one human face head portrait and storing the human face head portrait into the database, if the group of human face features and the persons exist in the person database, judging the quality score of the human face head portrait, and updating feature information and human face head portrait information in the database. And finally, when all lens units of the video are analyzed, counting the occurrence time of each main character appearing in the video, sequencing, and outputting a face image of the character and the corresponding occurrence time point.

In some embodiments, the step of obtaining a plurality of time points at which the target person appears in a plurality of video segments comprises:

acquiring lens parameters adopted by a current video clip;

when the shot parameter is a near shot, acquiring a plurality of time points of the target person appearing in a plurality of video clips according to the face feature of the target person;

and when the lens parameters are far lenses, acquiring a plurality of time points of the target person appearing in a plurality of video clips according to the human body characteristics of the target person.

And 403, acquiring a plurality of time points of the target person appearing in a plurality of video clips.

Specifically, the video analysis method using people as main lines is mainly applied to long videos composed of a limited number of main people, such as television shows, movies, and the like.

The method mainly comprises the steps of identifying and tracking main people in a video by a multi-model method combining face recognition and human body recognition in a video analysis mode taking people as main lines, and recording the occurrence time of each person.

In a video analysis mode using characters as main lines, a main character database needs to be established, and the database is dynamically generated according to different television shows or movies, namely, each television show has a character database of each movie. The characters in the database are main characters related to the story line in the current television play or movie, but not the mass actors and audiences, and the identity of each character is unique, namely, two identity information of one person does not appear in the database, so that a user can conveniently select to watch the video content appearing in a certain character on line.

In the video analysis mode with people as main lines, when the video of a certain shot is analyzed for people, main people in the shot is determined according to a certain rule, then the human face characteristics of the main people are extracted and compared with the characteristics in the people database, whether the people are in the people database or not is judged, if not, the human face characteristics and the human body characteristics are stored in the database, an ID is marked, then the tracking is carried out, and the occurrence time is recorded. If so, dynamically updating according to the quality of the characteristics, if the character characteristic quality of the current segment is better than the stored characteristic quality, updating to the current character characteristic, otherwise, keeping the characteristics in the database, then tracking the characteristics, and recording the occurrence time.

The video analysis mode with the person as the main line, the video tracking mode selects the mode of combining the human face and the human body, the human face tracking is selected in the near shot, the human body tracking is selected in the far shot, and the tracking method can be selected in a variety of ways.

The human features are mainly used for distinguishing the human in the near shot, and the human features are used for recalling the human when the tracking fails in the far shot. The human face features are mainly extracted through a CNN (human face model), and it needs to be emphasized that the human face features used for storage and judgment are video human face features obtained by performing quality screening and multi-frame fusion, and time dimension information is more than that of the features of a single human face image.

And in a video analysis mode with people as main lines, fusing the time information of the main people when the whole video analysis is finished, sequencing according to the sequence of occurrence duration, and outputting a list.

And 404, obtaining one or more target video clips matched with the character tags according to the time points.

And obtaining one or more target video clips matched with the character tags after obtaining the time points matched with the character tags in each video clip.

The video segment matching the character tag may be entirely taken as the target video segment.

Or obtaining the time point of the starting position matched with the character label and the time point of the ending position matched with the character label in the video segment, and then segmenting the target video segment according to the time point of the starting position and the time point of the ending position. So that one or more target video segments can be obtained among the one or more video segments.

And 405, forming a new target video by one target video segment or combining a plurality of target video segments to form a new target video capable of being continuously played.

And finally, forming a video clip into a new target video, or combining a plurality of target video clips into a new target video, wherein the target video can play one video clip, or play a plurality of video clips in sequence. The whole video is not played according to the original video playing sequence any more, so that only the video clip matched with the character tag can be displayed for the user, and the individual requirements of different users are met.

406, displaying the target person information in the target video and a plurality of time points corresponding to the target person information.

In some embodiments, the preset tags of the video processing method may be 1, 2, or 3 of the scene features, the motion features, and the character features, and the specific steps of the video processing method may be adjusted according to the preset tags, and the specific steps may be the steps in the above embodiments.

In some embodiments, the video processing method may mainly comprise two parts, one part for offline video analysis and the other part for online video playback.

Wherein, the offline video analysis part comprises: firstly, defining an actor star label, a scene label and a highlight action label appearing in a video according to different customization modes such as characters, scenes, highlight actions and the like, and respectively training deep learning models of character recognition, scene recognition and action recognition;

then, the input video enters the models, the labels of the characters, the scenes and the actions appearing in the video are analyzed, and the time stamps appearing in the video are recorded.

And then respectively combining the timestamps of all characters, all scenes and action tags, and when a user selects a play mode of one or a combination of several modes, automatically playing the video clip of the selected characters.

The online video playing part comprises: and receiving one or more combined video playing modes selected by the user according to the video type and the attention of the user, and playing the video content.

The offline video analysis part can perform video analysis in different modes, but the video needs to be shot-segmented firstly, namely the video is composed of shot units shot by a plurality of camera shots, and the background in a single shot is relatively single or continuous. Then, analysis is carried out in each lens unit, and finally, analysis results under a plurality of lenses are fused.

As can be seen from the above, in the video processing method of the embodiment of the present application, the video is firstly segmented according to the shooting lens to obtain a plurality of video segments; then acquiring preset labels corresponding to the plurality of video clips; then, obtaining time points matched with the preset labels in the video clips; then, one or more target video clips matched with the preset tags are obtained according to the time points; and finally, forming a new target video by one target video segment, or combining a plurality of target video segments to form a new target video capable of being continuously played. One or more target video clips in the video can be acquired according to the preset label, then one target video clip is played, or a plurality of target video clips are played continuously, the whole video is not played according to the original video playing sequence, only the video clip matched with the preset label can be displayed for the user, and the individual requirements of different users are met.

Referring to fig. 5, fig. 5 is a schematic view of a video processing apparatus according to an embodiment of the present disclosure, and the video processing apparatus 500 may include a video segmentation module 501, a preset tag obtaining module 502, a time point obtaining module 503, a target video segment obtaining module 505, and a processing module 505.

A video segmentation module 502, configured to segment the video according to a shot to obtain a plurality of video segments;

a preset tag obtaining module 502, configured to obtain preset tags corresponding to the multiple video segments;

a time point obtaining module 503, configured to obtain time points in the multiple video segments, where the time points are matched with the preset tag;

a target video segment obtaining module 504, configured to obtain one or more target video segments matched with the preset tag according to the time point;

and a processing module 505, configured to form a new target video from one target video segment, or combine multiple target video segments to form a new target video capable of being played continuously.

In some embodiments, the preset tag obtaining module 502 is further configured to perform interval sampling on each of the video segments to obtain a plurality of sampling frames; acquiring scene information in one of the sampling frames; when the scene information is a antique scene, identifying a plurality of sampling frames according to a antique scene identification algorithm to obtain a scene label corresponding to each video clip; when the scene information is a modern scene, identifying a plurality of sampling frames according to a modern scene identification algorithm to obtain a scene label corresponding to each video clip; and classifying the video clips according to the scene labels to obtain a plurality of video clip sets, and acquiring a scene graph from each video clip set as a view label of the corresponding video clip set.

In some embodiments, the preset tag obtaining module 502 is configured to arrange a plurality of video clips according to a time sequence; and when two adjacent scene labels are the same, combining the two scene labels into one scene label, wherein the combined scene label comprises time period information corresponding to two video clips.

In some embodiments, the preset tag obtaining module 502 is configured to extract video optical flow information of each video clip; and determining the video clips with the video optical flow information larger than a preset threshold value as target video clips, and marking action tags.

The time point obtaining module 503 is further configured to extract image information and audio information of the target video segment; acquiring a first probability that each frame in the target video clip is the start of an action and a second probability that the frame is the end of the action according to the image information and the audio information; taking the frame with the maximum first probability as an action starting frame and taking the frame with the maximum second probability as an action ending frame; and obtaining an action video clip set according to the action starting frame and the action ending frame.

In some embodiments, the preset tag obtaining module 502 is further configured to, when the number of motion video clip sets is multiple, arrange multiple motion video sets and output time information of each motion video set.

In some embodiments, the preset tag obtaining module 502 is configured to perform face recognition and/or human body recognition on the people in each video segment, determine a target person, and set a person tag according to the target person.

The time point obtaining module 503 is further configured to obtain a plurality of time points of the target person appearing in the plurality of video segments.

The processing module 505 is further configured to display the target person information in the target video and a plurality of time points corresponding to the target person information.

In some embodiments, the preset tag obtaining module 502 is further configured to determine a target person according to a preset rule; acquiring the face characteristics of the target person, and comparing the face characteristics with a preset database; if the target person is in a preset database, acquiring a first quality value of a face feature in the preset database and a second quality value of the current face feature; if the second quality value is larger than the first quality value, replacing the face image in the preset database with the current face image; and if the face of the target person is not in the preset database, storing the face characteristics and/or the human body characteristics of the target person into the preset database.

In some embodiments, the time point obtaining module 503 is further configured to obtain a shot parameter adopted by the current video segment; when the shot parameter is a near shot, acquiring a plurality of time points of the target person appearing in a plurality of video clips according to the face feature of the target person; and when the lens parameters are far lenses, acquiring a plurality of time points of the target person appearing in a plurality of video clips according to the human body characteristics of the target person.

As can be seen from the above, in the video processing apparatus according to the embodiment of the present application, the video segmentation module 501 segments the video according to the shot to obtain a plurality of video segments; then, the preset tag obtaining module 502 obtains preset tags corresponding to the plurality of video clips; then, the time point obtaining module 503 obtains the time points in the plurality of video segments matching with the preset tag; then, the target video segment obtaining module 504 obtains one or more target video segments matched with the preset tag according to the time point; finally, the processing module 505 forms a new target video from one of the target video segments, or combines a plurality of the target video segments to form a new target video that can be played continuously. One or more target video clips in the video can be acquired according to the preset label, then one target video clip is played, or the target video clips are played continuously, the whole video is not played according to the original video playing sequence, only the video clip matched with the preset label can be displayed for the user, and the individual requirements of different users are met.

An embodiment of the present application further provides a storage medium, where a computer program is stored in the storage medium, and when the computer program runs on a computer, the computer executes the video processing method according to any of the above embodiments.

It should be noted that, all or part of the steps in the methods of the above embodiments may be implemented by hardware related to instructions of a computer program, which may be stored in a computer-readable storage medium, which may include, but is not limited to: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.

The video processing method, apparatus and storage medium provided by the embodiments of the present application are described in detail above, and the principles and embodiments of the present application are described herein using specific examples, and the description of the above embodiments is only used to help understanding the present application. Meanwhile, for those skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A video processing method, comprising:

segmenting a video according to a shooting lens to obtain a plurality of video segments;

acquiring preset labels corresponding to the plurality of video clips;

obtaining time points matched with the preset labels in the video clips;

forming a new target video by one target video segment or combining a plurality of target video segments to form a new target video capable of being continuously played,

the step of acquiring the preset tag in the video comprises the following steps:

extracting video optical flow information of each video clip;

determining the video clips with the video optical flow information larger than a preset threshold value as target video clips, and marking action tags;

the obtaining of the time point, which is matched with the preset tag, in the plurality of video segments includes:

extracting image information and audio information of the target video clip;

acquiring a first probability that each frame in the target video clip is the start of an action and a second probability that the frame is the end of the action according to the image information and the audio information;

taking the frame with the maximum first probability as an action starting frame and taking the frame with the maximum second probability as an action ending frame;

and obtaining an action video clip set and a time point of the action video clip set according to the action starting frame and the action ending frame.

2. The video processing method according to claim 1, wherein the step of obtaining the preset labels corresponding to the plurality of video segments comprises:

sampling each video clip at intervals to obtain a plurality of sampling frames;

acquiring scene information in one of the sampling frames;

when the scene information is a antique scene, identifying a plurality of sampling frames according to a antique scene identification algorithm to obtain a scene label corresponding to each video clip;

when the scene information is a modern scene, identifying a plurality of sampling frames according to a modern scene identification algorithm to obtain a scene label corresponding to each video clip;

and classifying the video clips according to the scene labels to obtain a plurality of video clip sets, and acquiring a scene graph from each video clip set as a view label of the corresponding video clip set.

3. The video processing method of claim 2, wherein the step of classifying the plurality of video segments according to the scene tags is preceded by the step of:

arranging a plurality of video clips according to time sequence;

4. The video processing method according to claim 1, wherein after the step of obtaining the motion video segment set according to the motion start frame and the motion end frame, the method further comprises:

5. The video processing method according to claim 1, wherein the step of obtaining the preset tag in the video comprises:

carrying out face recognition and/or human body recognition on the characters in each video clip, determining a target character, and setting character tags according to the target character;

acquiring a plurality of time points of the target person appearing in a plurality of video clips;

after forming a new target video from one target video segment or combining a plurality of target video segments to form a new target video capable of being played continuously, the method further comprises the following steps:

and displaying the target person information in the target video and a plurality of time points corresponding to the target person information.

6. The video processing method of claim 5, wherein the step of performing face recognition and/or body recognition on the plurality of people in each video segment to determine the target person comprises:

determining a target figure according to a preset rule;

after the step of determining the target person, the method further comprises the following steps:

and if the face of the target person is not in the preset database, storing the face characteristics and/or the human body characteristics of the target person into the preset database.

7. The video processing method according to claim 5, wherein the step of obtaining a plurality of time points at which the target person appears in a plurality of video clips comprises:

acquiring lens parameters adopted by a current video clip;

8. A video processing apparatus, comprising:

the preset tag obtaining module is configured to obtain preset tags corresponding to the plurality of video clips, and specifically includes: extracting video optical flow information of each video clip;

extracting image information and audio information of the target video clip;

obtaining an action video clip set and a time point of the action video clip set according to the action starting frame and the action ending frame;

9. A storage medium having stored therein a computer program which, when run on a computer, causes the computer to execute a video processing method according to any one of claims 1 to 7.