CN111209897B

CN111209897B - Video processing method, device and storage medium

Info

Publication number: CN111209897B
Application number: CN202010157708.3A
Authority: CN
Inventors: 吴韬; 徐叙远; 刘孟洋
Original assignee: Shenzhen Yayue Technology Co ltd
Current assignee: Shenzhen Yayue Technology Co ltd
Priority date: 2020-03-09
Filing date: 2020-03-09
Publication date: 2023-06-20
Anticipated expiration: 2040-03-09
Also published as: CN111209897A

Abstract

The invention relates to a video processing method, a video processing device and a storage medium. The method comprises the following steps: acquiring a video to be processed and a target human body area; detecting a plurality of human body areas in a video to be processed; inputting the plurality of human body areas into a trained feature extraction network to obtain a plurality of first features respectively describing the plurality of human body areas, and inputting the target human body areas into the trained feature extraction network to obtain second features describing the target human body areas; comparing the plurality of first features with the second features respectively to obtain at least one first matching feature matched with the second features; determining corresponding time points of at least one first matching feature in the video to be processed; the video to be processed is processed based on the respective points in time to obtain video portions associated with the target object. The feature extraction network is trained using a dataset constructed based on a set of human body region samples, and the set of human body region samples is generated separately for a plurality of video segments divided by video shots.

Description

Video processing method, device and storage medium

Technical Field

The invention relates to the technical field of deep learning and computer vision, in particular to a video processing method, a video processing device and a storage medium.

Background

With the development of multimedia technology, various images and audios and videos add much fun to the life of people. When viewing video files, people typically choose to view their own segments of interest. Current video clip clips generally clip based on certain specific categories or specific scenes, such as based on specific shots or text cues in sports video and game video (e.g., goal, shoot in sports video, kill, five kill, etc.) to determine whether it is a highlight, and clip the video. It is also desirable to view only paragraphs about a particular person in a video. In this case, the related art generally judges a person in a video picture through face recognition to complete a clip for the specific task.

Disclosure of Invention

In the technical scheme of identifying a video clip containing a specific person through face recognition, in some cases, the video clip containing the specific person cannot be identified or cannot be identified accurately, for example, when the face of the specific person is unclear, incomplete, the person is displayed as a side face, a back face, the action amplitude of the person is large (e.g., fight), etc., it is poor to clip the specific person based on face recognition. Embodiments of the present invention address, at least in part, the above-mentioned problems.

According to an aspect of the present invention, a video processing method is presented. The method comprises the following steps: acquiring a video to be processed and a target human body area representing a target object; detecting a plurality of human body areas in a video to be processed; inputting the plurality of human body areas into a trained feature extraction network to obtain a plurality of first features respectively describing the plurality of human body areas, and inputting the target human body areas into the trained feature extraction network to obtain second features describing the target human body areas; comparing the plurality of first features with the second features respectively to obtain at least one first matching feature in the first features matched with the second features; determining corresponding time points of at least one first matching feature in the video to be processed; processing the video to be processed based on each time point to acquire a video part associated with the target object; the feature extraction network is trained by using a data set constructed based on a human body area sample set, and the human body area sample set is respectively generated for a plurality of video segments divided according to video shooting shots.

In some embodiments, the dataset is constructed by: acquiring a training video for a feature extraction network; dividing a training video into a plurality of training video segments according to video shooting shots; creating, for each of a plurality of training video segments, one or more human region sample sets of the training video segments; determining whether one or more human body region sample sets contain human faces; in response to determining that each of the one or more body region sample sets contains a face, the one or more body region sample sets are combined based on features of the face to construct a training dataset.

In some embodiments, for each of the plurality of training video segments, creating one or more human region sample sets of the training video segments comprises: for each of a plurality of training video segments, each training video segment comprising a plurality of video frames belonging to the same video shot, detecting a human body region in the plurality of video frames; judging the similarity between the two or more detected human body regions; two or more human body regions whose similarity meets a predetermined threshold range are added to the same set to generate one or more human body region sample sets of training video segments.

In some embodiments, in response to determining that a face is contained in each of the one or more sets of human region samples, merging the one or more sets of human region samples based on features of the face to construct the training data set comprises: in response to determining that each of the one or more body region sample sets contains a face, respectively selecting the same predetermined number of faces from each of the body region sample sets; comparing the similarity of the faces selected from each human body region sample set; and merging the human body region sample sets with the human face similarity higher than the first preset threshold value to construct a training data set.

In some embodiments, the dataset is further constructed by: determining a human body region with the similarity of the human body region in the same human body region sample set lower than a preset threshold value by utilizing the pedestrian re-identification ReID; human body regions having a human body region similarity below a second predetermined threshold are removed from the human body region sample set.

In some embodiments, determining the similarity between the two or more detected human body regions comprises: the similarity between the two or more detected human body regions is determined based on the artificial features.

In some embodiments, multiple human body regions in the video to be processed are detected by a single polygon detector.

In some embodiments, processing the video to be processed based on the respective points in time to obtain video portions associated with the target object includes: and splicing the videos to be processed based on the time stamps of the time points to acquire the video part associated with the target object.

According to another aspect of the invention, a method for constructing a dataset for training a feature extraction network is presented. The method comprises the following steps: acquiring a training video for a feature extraction network; dividing a training video into a plurality of training video segments according to video shooting shots; creating, for each of a plurality of training video segments, one or more human region sample sets of the training video segments; determining whether one or more human body region sample sets contain human faces; in response to determining that each of the one or more body region sample sets contains a face, the one or more body region sample sets are combined based on features of the face to construct a training dataset.

According to another aspect of the present invention, a training method of a feature extraction network is provided, including: a training video for a feature extraction network is acquired, a training dataset is constructed based on the acquired training video using the method of constructing a dataset as in the previous aspect, and the feature extraction network is trained using the dataset to extract features describing a region of the human body.

According to another aspect of the present invention, a video processing apparatus is presented. The device comprises: the device comprises an acquisition module, a human body detection module, a feature extraction module, a comparison module, a time point determination module and a video processing module. The acquisition module is configured to acquire a video to be processed and a target human body region representing a target object. The human body detection module is configured to detect a plurality of human body regions in the video to be processed. The feature extraction module is configured to input a plurality of human body regions into a trained feature extraction network to obtain a plurality of first features that respectively describe the plurality of human body regions, and input a target human body region into the trained feature extraction network to obtain a second feature that describes the target human body region, wherein the feature extraction network is trained using a dataset constructed based on a set of human body region samples that are respectively generated for a plurality of video segments divided by a video capture lens. The comparison module is configured to compare the plurality of first features with the second features, respectively, resulting in at least one first matching feature of the first features matching the second features. The point in time determination module is configured to determine corresponding respective points in time of the at least one first matching feature in the video to be processed. The video processing module is configured to process the video to be processed based on the respective points in time to obtain video portions associated with the target object.

According to another aspect of the invention, a construction device for a dataset for training a feature extraction network is presented. The device comprises: the system comprises an acquisition module, a video segmentation module, a set creation module, a determination module, a set merging module and a set merging module. The acquisition module is configured to acquire training video for the feature extraction network. The video segmentation module is configured to divide the training video into a plurality of training video segments according to video shots. The set creation module is configured to create, for each of a plurality of training video segments, one or more human region sample sets of the training video segments. The determination module is configured to determine whether a face is contained in the one or more human region sample sets. The set merge module is configured to merge the one or more human region sample sets to construct a training data set based on features of the human face in response to determining that the human face is contained in each of the one or more human region sample sets.

According to another aspect of the present invention, there is provided a training apparatus of a feature extraction network, including: an acquisition module configured to acquire training videos for the feature extraction network, a dataset construction module configured to construct a training dataset using the method of constructing a dataset as above based on the acquired training videos, and a training module configured to train the feature extraction network using the dataset to extract features describing a region of a human body.

According to some embodiments of the present invention, there is provided a computer device comprising: a processor; and a memory having instructions stored thereon that, when executed on the processor, cause the processor to perform any of the methods as above.

According to some embodiments of the present invention, there is provided a computer readable storage medium having instructions stored thereon, which when executed on a processor, cause the processor to perform any of the methods as above.

The video processing method, the device and the storage medium provided by the invention are used for analyzing the personas in the video content by utilizing deep learning and carrying out fragment editing of the same personas in the video through a trained feature extraction network. The video processing method can automatically divide the segments with the same roles in the video (such as films, television shows and various products), saves a great deal of manpower and time cost, improves editing efficiency, is also beneficial to later video production, and enhances user experience.

Drawings

Embodiments of the present invention will now be described in more detail, by way of non-limiting example, with reference to the accompanying drawings, which are merely illustrative, and in which like reference numerals refer to like parts throughout, and in which:

FIG. 1 schematically illustrates a graphical user interface schematic according to one embodiment of the invention;

FIG. 2 schematically illustrates an example application scenario according to one embodiment of the invention;

FIG. 3 schematically illustrates a network framework diagram for target character video processing in accordance with one embodiment of the present invention;

FIG. 4 schematically shows a schematic diagram of the structure of a single-shot polygon detector;

FIG. 5 schematically illustrates a flow chart of a video processing method according to one embodiment of the invention;

FIG. 6 schematically illustrates a flow chart of a method of constructing a dataset according to another embodiment of the invention;

fig. 7 schematically shows a schematic diagram of a video processing apparatus according to an embodiment of the invention;

FIG. 8 schematically illustrates a schematic diagram of an apparatus for constructing a dataset according to another embodiment of the invention; and

fig. 9 schematically shows a schematic diagram of an example computer device for video processing and/or constructing a data set.

Detailed Description

The following description provides specific details for a thorough understanding and implementation of various embodiments of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these details. In some instances, well-known structures and functions have not been shown or described in detail to avoid unnecessarily obscuring the description of the embodiments of the disclosure. The terminology used in the present disclosure is to be understood in its broadest reasonable manner, even though it is being used in conjunction with a particular embodiment of the present disclosure.

First, some terms involved in the embodiments of the present disclosure are explained to facilitate understanding by those skilled in the art:

deep learning (DeepLearning, DL): a multi-layer sensor with multiple hidden layers is a deep learning structure. Deep learning forms more abstract high-level representation attribute categories or features by combining low-level features to discover distributed feature representations of data. The motivation for studying deep learning is to build a neural network that simulates the human brain for analysis learning, which mimics the mechanisms of the human brain to interpret data, such as images, sounds, text, and the like.

Computer vision technology (ComputerVision, CV): computer vision is a science of how to "look" a machine. Further, computer vision refers to machine vision that uses a camera and a computer to replace human eyes to identify, detect and measure targets, and further uses the computer to perform graphic processing to form images more suitable for human eyes to observe or transmit to an instrument for detection. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D techniques, virtual reality, augmented reality, synchronous positioning, and map construction, among others, as well as common biometric recognition techniques such as face recognition, fingerprint recognition, and others.

Convolutional neural networks (ConvolutionalNetworks, CNN) are a type of feedforward neural network that includes convolutional calculations and has a deep structure, and are one of the representative algorithms for deep learning. The convolutional neural network has characteristic learning capability and can carry out translation invariant classification on input information according to a hierarchical structure of the convolutional neural network.

The single-shot polygon detector (SingleShotMultiBoxDetector, SSD) is a method for detecting objects in a picture based on a single depth neural network. It discretizes the output space of the bounding box, placing a series of default bounding boxes with different aspect ratios and different dimensions at the location of each feature map. In the prediction, the neural network generates a score for whether each default bounding box belongs to a certain class, and generates a correction to the bounding box so that the border fits the shape of the object more.

Scale-invariant feature transform (SIFT) is a feature descriptor with Scale invariance and illumination invariance, and is also a set of theory of feature extraction. Published in 2004 by d.g. lowe for the first time, and implemented, extended and used in the open source algorithm library OpenCV. SIFT features remain unchanged from rotation, scaling, brightness variations, etc., and are very stable local features.

Pedestrian re-recognition (personnre-identification (REID)) is a technique that uses computer vision techniques to determine whether a particular pedestrian is present in an image or video sequence. Is a sub-problem of image retrieval, given a detected pedestrian image, the pedestrian image is retrieved across devices. Which may for example retrieve the same pedestrian map under different cameras.

Triple loss function (TripletLossfunction): the so-called triplet contains three samples, for example (anchor, pos, neg), anchor representing the target, pos representing the positive sample, neg representing the negative sample. The ternary loss function is an objective function that defines a distance from the target to the negative sample that is greater than the sum of the distance from the target to the positive sample and a predetermined threshold.

The main purpose of the present invention is to analyze the personas in video content using deep learning and to clip video clips of the same personas through a feature extraction network. As the human body in the video has multiple gestures, multiple angles, multiple scales and the like, distinguishing the same human body area in the video segment is a complex task. The invention utilizes convolutional neural networks (e.g., single-shot frame detector SSD (SingleShotMultiBoxDetector)) to detect human body regions in the video, thereby extracting corresponding human body features. The invention uses the human body characteristics to locate the same human body in the video, and can automatically and effectively segment the segments with the same roles in the video.

FIG. 1 schematically illustrates a schematic diagram of a graphical user interface 100 according to one embodiment of the invention. The graphical user interface 100 may be displayed on various user terminals, such as a notebook computer, a personal computer, a tablet computer, a cell phone, a television, and the like. The video 101 is a video viewed by a user through a user terminal. The video 101 can be automatically clipped into a video clip about a selected target object, such as a target person, in the video 101 by the video processing method provided by the embodiment of the invention. The selected target person may be one or more. For example, the target persona may be a particular star or a particular character. Icons 102 of automatically-generated character video clips are also displayed on the graphical user interface 100. When viewing the video 101, the user can easily view a video clip of a corresponding person of interest by clicking on the corresponding icon 102.

FIG. 2 illustrates an example application scenario 200 according to one embodiment of the invention. The server 201 is connected to a user terminal 203 via a network 202. The user terminal 203 may be, for example, a notebook computer, a personal computer, a tablet computer, a mobile phone, a television, etc. Network 202 may include wired networks (e.g., LAN, cable, etc.) and wireless networks (e.g., WLAN, cellular, satellite, etc.), the internet, etc. An application program for viewing video is installed on the user terminal 203. The user may click on an icon of a corresponding person clip presented by the application installed on the user terminal 203 while viewing video through the application and desiring to view a video clip of a person of interest. In response to the user clicking on the icon of the corresponding person clip, the application presents a clip of the corresponding person. Notably, clips of the respective persons are obtained at the server 201 or at the user terminal 203 (or at both the user terminal 203 and the server 201) by performing the video processing method proposed by the present invention.

Fig. 3 schematically illustrates a schematic diagram of a network framework 300 for target character video processing in accordance with one embodiment of the invention. First, for a target character to be clipped, the character feature F304 of the target character is obtained by inputting the target character human body area 302 into the feature extraction network 310 to perform character feature extraction. The video 301 to be processed is input into the human body detection network 309, and all human body areas 303 in the detected video 301 to be processed are obtained. Here, the human body detection network 309 and the feature extraction network 310 will be described in further detail below. The individual human body regions 303 are then input into the above-described feature extraction network 310, and features P for each human body region are extracted _i And records the time point T of the human body in the video _i (e.g., may be a timestamp). Then, the characteristics P of all human body areas in the video to be processed _i Into a feature pool 305. The character feature F304 of the target character and the feature Pi of each human body area in the feature pool 305 are input into a matching calculation module 311 to perform similarity calculation to obtain all P _i Features P matching character features F _k 306. Feature matching may be achieved by calculating euclidean distances between different features, for example. A distance d between one Pi of the pool of features and the human feature F is calculated. If d is less than the predetermined threshold, then determine the P _i Matching with human body characteristics F, i.e. the P _i The person corresponding to the corresponding human body area accords with the target role. Matched feature P _k 306, a timing aggregation module 312, finds matching features P _k 306 atCorresponding time point, and time-series for time point T _k Polymerization is performed, resulting in a plurality of polymerized time points 307. The video clipping module 313 clips the video based on the aggregated time stamps of the multiple time points to form a segment corresponding to the target role, that is, all video frames containing the target role in the video to be processed.

In the above, in the case where the target character is a single character, description is made on how the segments for the target character are obtained by processing through the video processing method provided by the present invention. It should be appreciated that in other embodiments, the target character may be multiple target characters.

Fig. 4 schematically shows a schematic diagram of a structure 300 of a single-shot polygon detector. The human detection network used herein employs a single multi-frame detector SSD structure. The SSD detection network has very good performance in detection speed and detection precision. Specifically, the human body detection efficiency of the SSD detection network can reach 100 frames/second on the graphics processor GPU while guaranteeing a detection rate higher than 85%. The structure of SSD is based on VGG-16 because VGG-16 can provide high quality image classification and transfer learning to improve results. Here, SSD adjusts VGG-16, starting with Con6 layer, replacing the original fully-connected layer with a series of auxiliary convolutional layers. By using auxiliary convolution layers, features of multiple scales of the image can be extracted and the size of each convolution layer progressively reduced.

Fig. 5 schematically shows a flow chart of a video processing method 500 according to an embodiment of the invention. The method may be executed by a user terminal or a server, or may be executed by both the user terminal and the server, and the embodiment is described by taking the method executed by the server as an example. In step 501, a video to be processed and a target human body region representing a target object are acquired. Here, the target human body region may be obtained by inputting an image sample of the target object or a video sample containing the target object into a human body detection network (for example, SSD). In step 502, a plurality of human body regions in a video to be processed are detected using a human body detection network. In step 503, a plurality of body regions are input into the trained featuresThe method comprises the steps of obtaining a plurality of first features respectively describing a plurality of human body areas by a sign extraction network, and inputting a target human body area into the trained feature extraction network to obtain a second feature describing the target human body area. How this feature extraction network trains will be described in detail below. Here, it is noted that the feature extraction network is trained using a dataset constructed based on a set of human body region samples. The human body region sample set is generated for a plurality of video segments divided by video shots, respectively. In step 504, the plurality of first features are compared with the second features, respectively, to obtain at least one first matching feature of the first features that matches the second features. For example, the first characteristic is P _i The second feature is F, then will be defined by P _i Each P in the composed feature pool _i Comparing with F to find P matching with F _k . Here, feature matching is achieved by calculating euclidean distances between different features. Computing one P in a feature pool _i Distance d from feature F. If d is less than the predetermined threshold, then determine the P _i Matching with F, i.e. the P _i The person corresponding to the corresponding human body area accords with the target role. In step 505, corresponding respective points in time of at least one first matching feature in the video to be processed are determined. That is, P matching with F is determined _k At a corresponding point in time T in the video _k . In step 506, the video to be processed is processed based on the respective points in time to obtain video portions associated with the target object. In one embodiment, for time point T _k Aggregation is performed in time sequence, thereby yielding a set of all time points for the same role. In one embodiment, for the final acquired point in time T _k The aggregation of the sets of (a) in time sequence includes: for any two points in time, if the interval is less than a certain threshold, then it is considered a continuous segment, otherwise it is considered a separate segment. By such processing, the selected video frames are more coherent, and the picture does not jump. Thus, a plurality of video clips are obtained. For the start time point and the end time point of each segment, the optical flow method is utilized from the start time point of each segment Searching for the nearest shot switching point and searching for the nearest scene switching point from the ending time point of each segment backwards to ensure the integrity of the intercepted segments. Here, optical flow is the instantaneous velocity of the pixel motion of a spatially moving object on the observation imaging plane. The optical flow method is a method for finding out the correspondence existing between the previous frame and the current frame by utilizing the change of pixels in an image sequence in a time domain and the correlation between adjacent frames, thereby calculating the motion information of an object between the adjacent frames. After this operation is performed on all segments, different segment clips of the same target object (e.g., the same character) in the video are obtained. The video processing method 500 can automatically segment segments of the same roles in videos (such as films, dramas and variety), saves a great deal of labor and time cost, improves editing efficiency, is also beneficial to later video production, and enhances user experience.

In the video processing method, the feature extraction network is trained by using a data set constructed based on the human body region sample set. The dataset for training the feature extraction network is constructed using the temporal and spatial correlation of the video, and using, for example, face recognition and pedestrian re-recognition ReID techniques. The dataset is constructed by the following steps of the method 600 of constructing a dataset shown in fig. 6.

In step 601, training video for a feature extraction network is acquired.

In step 602, the training video is divided into a plurality of training video segments by video shot. Each of the plurality of training video segments contains a plurality of video frames belonging to the same video shot. Illustratively, whether shot switching exists in the training video can be judged through an optical flow method. If shot cuts exist, the video is divided at the video frames where the shot cuts occur, thereby splitting a full training video into segments corresponding to different shots.

In step 603, for each of a plurality of training video segments, one or more sets of human region samples of the training video segments are created. In one embodiment, for each training video segment, a human body region in a plurality of video frames it contains is detected; judging the similarity between the two or more detected human body regions; and adding two or more human body regions whose similarity meets a predetermined threshold range to the same set to generate one or more human body region sample sets of training video segments. Detecting a human body region in a plurality of video frames is accomplished through a human body detection network SSD. Here, the similarity between the human body regions is judged using artificial features. For example, the human feature may be a scale-invariant SIFT feature. In one embodiment, the predetermined threshold range is set to be above the first threshold and below the second threshold, and two or more human body regions satisfying the predetermined threshold range are added to the same set of human body region samples as the set of positive sample pairs. Setting the predetermined threshold higher than the first threshold here is for ensuring that the human body regions have a high degree of similarity, i.e. that the two human body regions belong to the same character; meanwhile, the fact that the preset threshold value is lower than the second threshold value is used for removing the human body area with too high similarity is required, and two frames with too high similarity are hardly changed, so that training of a network model is not facilitated. In another embodiment, the predetermined threshold range is set below a third threshold, and two or more human body regions satisfying the predetermined threshold range are added to the same set of human body region samples as a set of negative sample pairs, i.e. such human body regions do not belong to the same role.

In step 604, it is determined whether a face is contained in one or more human region sample sets. This step is implemented by face recognition techniques. In step 605, in response to determining that each of the one or more human region sample sets contains a human face, the one or more human region sample sets are merged based on features of the human face to construct a training data set. In one embodiment, in response to determining that each of the one or more human region sample sets contains a human face, selecting the same predetermined number of human faces from each of the human region sample sets; comparing the similarity of the faces selected from each human body region sample set; and merging the human body area sample sets with the human face similarity meeting the preset threshold. Specifically, face recognition technology is utilized to compare faces in each human body region sample set. For example, in each human body region sample set where a human face is determined to exist, N human faces are selected, respectively, where N is a positive integer. And carrying out cross comparison on the N selected faces. In the event that the proportion of N faces in two or more human region sample sets match exceeds a predetermined threshold (e.g., 50%), then the two or more human region sample sets are merged into the same human region sample set. I.e. the human body regions in the two human body region sample sets are described as actually belonging to the same person. This is in some cases due to switching from the first lens to the second lens and then back again.

In one embodiment, the method for constructing a data set further includes: determining a human body region with the similarity of the human body region in the same human body region sample set lower than a preset threshold value by utilizing the pedestrian re-identification ReID; human body regions having a human body region similarity below a predetermined threshold are removed from the human body region sample set. Here, reID is a trained ReID network that uses open sources, i.e., by which it is determined whether dissimilar human body regions exist in a set of human body region samples that are constructed.

In addition, on the basis of the method for constructing the data set, since the same person can have various gesture angles and backgrounds in the video, manual screening is needed after the steps, so that the human body in each set is ensured to be the same person image.

The invention also provides a training method of the feature extraction network, which trains the feature extraction network based on the data set obtained by the method 600. It is noted that in training, attacks including random clipping, blurring, rotation, etc. are added to these samples, thereby improving the robustness of the feature extraction network.

The feature extraction network of the invention is further improved and optimized on the basis of the existing deep network structure, so that the effect of the task is improved. Firstly, a larger convolution kernel and a larger step length are adopted in a shallow layer of the network, so that the effect of increasing the receptive field and accelerating the deep speed of the network is achieved. With the deep network, the feature dimension is continuously increased, and in order to improve the operation efficiency, the convolution kernel size is gradually reduced, and finally the convolution kernel size is reduced to a convolution kernel of 3x 3. In addition, the feature extraction network uses a triplet loss function as the final loss function. The loss function can reduce the distance between positive sample pairs, and meanwhile, increases the distance between negative sample pairs, so that the loss function has a very good effect on the follow-up judgment of whether the human bodies are similar. Here, positive samples refer to pairs of samples that determine human body regions belonging to the same person through similarity between human body regions; negative samples refer to pairs of samples of human body regions belonging to different persons determined by the similarity between the human body regions. In addition, the final feature is a superposition of deep features and shallow features. Shallow features of the depth network represent structural information of the image, and deep features are rich in more semantic information. The invention combines the deep information and the shallow information of the network by using the attention model, and can improve the very high accuracy compared with the single use of shallow features or deep features.

Fig. 7 schematically shows a schematic diagram of a video processing apparatus 700 according to an embodiment of the invention. A video processing apparatus 700 includes: an acquisition module 701, a human body detection module 702, a feature extraction module 703, a comparison module 704, a point-in-time determination module 705, and a video processing module 706. The acquisition module 701 is configured to acquire a video to be processed and a target human body region representing a target object. The human detection module 702 is configured to detect a plurality of human body regions in a video to be processed. The feature extraction module 703 is configured to input a plurality of human body regions into a trained feature extraction network resulting in a plurality of first features describing the plurality of human body regions, respectively, and to input a target human body region into the trained feature extraction network resulting in a second feature describing the target human body region, the feature extraction network being trained using a dataset constructed based on a set of human body region samples generated separately for a plurality of video segments divided by video shots. The comparison module 704 is configured to compare the plurality of first features with the second features, respectively, resulting in at least one first matching feature of the first features matching the second features. The point in time determination module 705 is configured to determine corresponding respective points in time of at least one first matching feature in the video to be processed. The video processing module 706 is configured to process the video to be processed based on the respective points in time to obtain video portions associated with the target object. The video processing device 700 can automatically segment the same role segments in videos (such as films, dramas and variety), saves a great deal of labor and time cost, improves editing efficiency, is also beneficial to later video production, and enhances user experience.

Fig. 8 schematically shows a schematic diagram of an apparatus 800 for constructing a data set for training a feature extraction network according to another embodiment of the invention. The data set constructing apparatus 800 includes: an acquisition module 801, a video segmentation module 802, a collection creation module 803, a determination module 804, a collection merge module 805, and a data set construction module 806. The acquisition module 801 is configured to acquire training video for a feature extraction network. The video segmentation module 802 is configured to divide the training video into a plurality of training video segments by video shots, each of the plurality of training video segments containing a plurality of video frames belonging to the same video shot. The set creation module 803 is configured to create, for each training video segment, one or more human region sample sets of the training video segment. The determination module 804 is configured to determine whether a face is contained in one or more human region sample sets. The set merge module 805 is configured to merge one or more human region sample sets based on features of a human face in response to determining that the human face is contained in each human region in the one or more human region sample sets.

Fig. 9 schematically illustrates a schematic diagram showing an example computer device 900 for video processing and/or constructing a data set. The computer device 900 may be a variety of different types of devices, such as a server computer (e.g., server 201 shown in fig. 2), a device associated with an application program (e.g., user terminal 203 shown in fig. 2), a system-on-chip, and/or any other suitable computer device or computing system.

Computer device 900 may include at least one processor 902, memory 904, communication interface(s) 906, display device 908, other input/output (I/O) devices 910, and one or more mass storage 912 capable of communicating with each other, such as through a system bus 914 or other suitable connection.

The processor 902 may be a single processing unit or multiple processing units, all of which may include a single or multiple computing units or multiple cores. The processor 902 may be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitries, and/or any devices that manipulate signals based on operational instructions. The processor 902 may be configured to, among other capabilities, obtain and execute computer-readable instructions stored in the memory 904, mass storage 912, or other computer-readable medium, such as program code of the operating system 916, program code of the application programs 918, program code of other programs 920, etc., to implement the methods for video processing and/or constructing data sets provided by one embodiment of the present invention.

Memory 904 and mass storage device 912 are examples of computer storage media for storing instructions that are executed by processor 902 to implement the various functions as previously described. For example, the memory 904 may generally include both volatile memory and nonvolatile memory (e.g., RAM, ROM, etc.). In addition, mass storage device 912 may generally include hard disk drives, solid state drives, removable media, including external and removable drives, memory cards, flash memory, floppy disks, optical disks (e.g., CDs, DVDs), storage arrays, network attached storage, storage area networks, and the like. Memory 904 and mass storage device 912 may both be referred to herein collectively as memory or computer storage medium, and may be non-transitory media capable of storing computer-readable, processor-executable program instructions as computer program code that may be executed by processor 902 as a particular machine configured to implement the operations and functions described in the examples herein.

A number of program modules may be stored on the mass storage device 912. These programs include an operating system 916, one or more application programs 918, other programs 920, and program data 922, and they may be loaded into the memory 904 for execution. Examples of such application programs or program modules may include, for example, computer program logic (e.g., computer program code or instructions) for implementing the following components/functions: an acquisition module 701, a human detection module 702, a feature extraction module 703, a comparison module 704, a point in time determination module 705 and a video processing module 706, as well as an acquisition module 901, a video segmentation module 802, a collection creation module 803, a determination module 804, a collection merge module 805 and a dataset construction module 806 and/or further embodiments described herein.

Although illustrated in fig. 9 as being stored in memory 904 of computer device 900,

modules

916, 918, 920, and 922, or portions thereof, may be implemented using any form of computer readable media accessible by computer device 900. As used herein, "computer-readable medium" includes at least two types of computer-readable media, namely computer storage media and communication media.

Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital Versatile Disks (DVD), or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium which can be used to store information for access by a computer device.

Computer device 900 can also include one or more communication interfaces 906 for exchanging data with other devices, such as via a network, direct connection, or the like, as discussed previously. One or more communication interfaces 906 may facilitate communication over a variety of networks and protocol types, including wired networks (e.g., LAN, cable, etc.) and wireless networks (e.g., WLAN, cellular, satellite, etc.), the Internet, and so forth. The communication interface 906 may also provide for communication with external storage devices (not shown) such as in a storage array, network attached storage, storage area network, or the like.

In some examples, a display device 908, such as a display, may be included for displaying information and images. Other I/O devices 910 may be devices that take various inputs from a user and provide various outputs to the user, and may include touch input devices, gesture input devices, cameras, keyboards, remote controls, mice, printers, audio input/output devices, and so on.

Variations to the disclosed embodiments can be understood and effected by those skilled in the art in practicing the claimed subject matter, from a study of the drawings, the disclosure, and the appended claims. In the claims, the word "comprising" does not exclude other elements or steps, and the word "a" or "an" as used herein does not exclude a plurality. Although certain features may be described in mutually different dependent claims, this mere fact is not intended to indicate that a combination of these features cannot be used or practiced.

Claims

1. A video processing method, the method comprising:

acquiring a video to be processed and a target human body area representing a target object;

detecting a plurality of human body areas in the video to be processed;

inputting the plurality of human body areas into a trained feature extraction network to obtain a plurality of first features describing the plurality of human body areas respectively, and inputting the target human body areas into the trained feature extraction network to obtain second features describing the target human body areas;

Comparing the plurality of first features with the second features respectively to obtain at least one first matching feature in the first features matched with the second features;

determining each corresponding time point of the at least one first matching feature in the video to be processed;

processing the video to be processed based on the respective time points to acquire a video portion associated with the target object;

the feature extraction network is trained by using a data set constructed based on a human body area sample set, and the human body area sample set is respectively generated for a plurality of video segments divided according to video shooting shots;

wherein the dataset is constructed by:

acquiring a training video for the feature extraction network;

dividing the training video into a plurality of training video segments according to video shooting shots;

creating, for each of the plurality of training video segments, one or more human region sample sets of the training video segment, each of the one or more human region sample sets comprising two or more human regions of the plurality of video frames contained in the training video segment that have a similarity that meets a predetermined threshold range;

Determining whether one or more of the human body region sample sets contains a human face;

in response to determining that a face is contained in each of the one or more sets of body region samples, the one or more sets of body region samples are merged based on features of the face to construct a dataset.

2. The video processing method of claim 1, the creating one or more human region sample sets of the training video segments for each of the plurality of training video segments comprising:

detecting a human body region in a plurality of video frames for each training video segment of the plurality of training video segments, the each training video segment comprising a plurality of video frames belonging to a same video shot;

judging the similarity between the two or more detected human body regions;

two or more human body regions whose similarity meets a predetermined threshold range are added to the same set to generate one or more human body region sample sets of the training video segment.

3. The video processing method of claim 1 or 2, the merging one or more human region sample sets to construct the dataset based on features of the human face in response to determining that the human face is contained in each human region in the one or more human region sample sets comprising:

In response to determining that each of the one or more body region sample sets contains a face, respectively selecting the same predetermined number of faces from each of the body region sample sets;

comparing the similarity of the faces selected from each human body region sample set;

and merging the human body region sample sets with the human face similarity higher than a first preset threshold value to construct a data set.

4. The video processing method of claim 1, the dataset further constructed by:

determining a human body region with the similarity of the human body region in the same human body region sample set lower than a preset threshold value by utilizing the pedestrian re-identification ReID;

and removing the human body region with the human body region similarity lower than a second preset threshold value from the human body region sample set.

5. The video processing method of claim 2, the judging of the similarity between the detected two or more human body regions comprising: the similarity between the two or more detected human body regions is determined based on the artificial features.

6. A method for constructing a dataset for training a feature extraction network, the method comprising:

Acquiring a training video for the feature extraction network;

creating, for each of the plurality of training video segments, one or more human region sample sets of the training video segments;

in response to determining that each of the one or more body region sample sets contains a face, merging the one or more body region sample sets based on features of the face to construct a dataset;

wherein said creating one or more human region sample sets of said training video segments for each of said plurality of training video segments comprises:

judging the similarity between the two or more detected human body regions;

7. The method for constructing a dataset of claim 6, said merging one or more human body region sample sets based on features of a human face to construct a dataset in response to determining that a human face is contained in each human body region of the one or more human body region sample sets comprising:

8. A training method of a feature extraction network, comprising:

a training video for the feature extraction network is acquired,

constructing a data set using the method of constructing a data set as claimed in any one of claims 6 to 7 based on the acquired training video,

the feature extraction network is trained using the data set to extract features describing the human body region.

9. A video processing apparatus, the apparatus comprising:

an acquisition module configured to acquire a video to be processed and a target human body region representing a target object;

A human body detection module configured to detect a plurality of human body regions in the video to be processed;

a feature extraction module configured to input the plurality of human body regions into a trained feature extraction network to obtain a plurality of first features that respectively describe the plurality of human body regions, and to input the target human body region into the trained feature extraction network to obtain a second feature that describes the target human body region, wherein the feature extraction network is trained using a dataset constructed based on a set of human body region samples, and the set of human body region sample is generated for a plurality of video segments that are divided by a video capture lens, respectively;

a comparison module configured to compare the plurality of first features with the second features, respectively, resulting in at least one first matching feature of the first features that matches the second features;

a point in time determination module configured to determine respective points in time corresponding to the at least one first matching feature in the video to be processed;

a video processing module configured to process the video to be processed based on the respective points in time to acquire a video portion associated with the target object;

Wherein the dataset is constructed by:

acquiring a training video for the feature extraction network;

10. A construction apparatus for a dataset for training a feature extraction network, the apparatus comprising:

an acquisition module configured to acquire training video for the feature extraction network;

a video segmentation module configured to divide the training video into a plurality of training video segments according to video shots;

A set creation module configured to create, for each of the plurality of training video segments, one or more human region sample sets of the training video segments;

a determining module configured to determine whether one or more of the set of human region samples contain a human face;

a set merge module configured to merge one or more human region sample sets based on features of a human face to construct a dataset in response to determining that the human face is contained in each human region in the one or more human region sample sets;

wherein the collection creation module is further configured to:

judging the similarity between the two or more detected human body regions;

11. A training apparatus of a feature extraction network, comprising:

An acquisition module configured to acquire training video for the feature extraction network,

a dataset construction module configured to construct a dataset based on the acquired training video using the method of constructing a dataset as claimed in any of claims 6-7,

a training module is configured to train the feature extraction network to extract features describing the human body region using the data set.

12. A computer device comprising a memory and a processor, the memory having stored therein a computer program which, when executed by the processor, causes the processor to perform the steps of the method of any of claims 1-7.

13. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a computer program which, when executed by a processor, causes the processor to perform the steps of the method of any of claims 1-7.