CN115529483A - Video processing method and device, electronic equipment and storage medium - Google Patents

Video processing method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN115529483A
CN115529483A CN202211025874.3A CN202211025874A CN115529483A CN 115529483 A CN115529483 A CN 115529483A CN 202211025874 A CN202211025874 A CN 202211025874A CN 115529483 A CN115529483 A CN 115529483A
Authority
CN
China
Prior art keywords
image frame
video
processed
image
position information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211025874.3A
Other languages
Chinese (zh)
Inventor
蔡佳音
陶鑫
戴宇荣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Dajia Internet Information Technology Co Ltd
Original Assignee
Beijing Dajia Internet Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Dajia Internet Information Technology Co Ltd filed Critical Beijing Dajia Internet Information Technology Co Ltd
Priority to CN202211025874.3A priority Critical patent/CN115529483A/en
Publication of CN115529483A publication Critical patent/CN115529483A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/431Generation of visual interfaces for content selection or interaction; Content or additional data rendering
    • H04N21/4318Generation of visual interfaces for content selection or interaction; Content or additional data rendering by altering the content in the rendering process, e.g. blanking, blurring or masking an image region
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/83Generation or processing of protective or descriptive data associated with content; Content structuring
    • H04N21/845Structuring of content, e.g. decomposing content into time segments
    • H04N21/8456Structuring of content, e.g. decomposing content into time segments by decomposing the content in the time domain, e.g. in time segments

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Television Signal Processing For Recording (AREA)

Abstract

The disclosure relates to a video processing method, a video processing device, an electronic device and a storage medium, wherein the method comprises the following steps: the method comprises the steps of segmenting an original video and extracting key frames to obtain at least one video segment and key frames in each video segment, determining the position information of an object to be processed of each key frame and the object to be processed of each key frame in each video segment, wherein the position information of the object to be processed contained in the image frame of the same video segment on the image frame to which the object to be processed belongs meets preset position information, generating a mask of the object to be processed of each key frame based on the position information of the object to be processed of each key frame and the object to be processed of each key frame, and carrying out object processing on at least one video segment based on the mask of the object to be processed of at least one video segment and each key frame to obtain a target video. The method and the device can reduce software and hardware resources required by removing the object to be processed in the subsequent video, and have universality.

Description

Video processing method and device, electronic equipment and storage medium
Technical Field
The present disclosure relates to the field of internet technologies, and in particular, to a video processing method and apparatus, an electronic device, and a storage medium.
Background
Videos, photographic pictures and the like are increasingly shared on the internet for other users to acquire information and enjoy. However, before the video or the photographic picture is released, the video or the photographic picture may be stained during the shooting process or the storage process, and the release of the video or the photographic picture is affected.
A typical fouling removal process includes: a large number of images of the same insult were collected for assessment and registration detection of the insult. The smear is then localized by detecting the gradients of all images in the entire image set, resulting in a correct initial estimate, where the alignment detection is used to estimate the initial alpha mask and refine the estimated smear layer, which are then used as the initialization input for our multi-image elimination algorithm, resulting in the smear-removed image.
However, the above method for removing fouling can only remove one specific fouling at a time, and has no universality.
Disclosure of Invention
The present disclosure provides a video processing method, a video processing apparatus, an electronic device, and a storage medium, and the technical scheme of the present disclosure is as follows:
according to a first aspect of the embodiments of the present disclosure, there is provided a video processing method, including:
segmenting an original video and extracting key frames to obtain at least one video clip and the key frames in each video clip; the position information of an object to be processed contained in image frames in the same video clip on the image frame to which the object to be processed belongs meets preset position information; the preset position information is information corresponding to a video clip to which the image frame belongs;
determining the object to be processed of the key frame and the position information of the object to be processed of the key frame in each video clip;
generating a mask of the object to be processed of each key frame based on the object to be processed of each key frame and the position information of the object to be processed of each key frame;
and performing object processing on the at least one video clip based on the at least one video clip and the mask of the object to be processed of each key frame to obtain the target video.
In some possible embodiments, segmenting and extracting key frames from the original video to obtain at least one video segment and key frames in each video segment includes:
carrying out object identification on an original video to obtain the type information of an object to be processed of the original video;
determining a display rule of the object to be processed on the original video based on the type information of the object to be processed;
segmenting an original video based on a display rule to obtain at least one video segment;
and determining a preset image frame in each video clip of the at least one video clip as a key frame in each video clip.
In some possible embodiments, determining a display rule of the object to be processed on the original video based on the type information of the object to be processed, and segmenting the original video based on the display rule to obtain at least one video segment, includes:
determining a display area and a display duration of the object to be processed on the original video based on the type information of the object to be processed;
and segmenting the original video based on the display area and the display duration to obtain at least one video segment.
In some possible embodiments, segmenting and extracting key frames from the original video to obtain at least one video segment and key frames in each video segment includes:
carrying out object identification on a first image frame in an original video to obtain an object to be processed of the original video; the first image frame is an image frame of a to-be-processed object appearing for the first time in an original video;
determining first position information of an object to be processed on a first image frame;
image interception is carried out on the first image frame based on the first position information, and a first sub-image corresponding to the first position information is obtained;
determining similarity data corresponding to each second image frame based on the first sub-image and each second image frame in the second image frame set; the second image frame set comprises image frames except the first image frame in the original video;
segmenting the original video based on the similarity data corresponding to each second image frame to obtain at least one video segment;
and determining a preset image frame in each video clip of the at least one video clip as a key frame in each video clip.
In some possible embodiments, determining similarity data corresponding to each second image frame based on the first sub-image and each second image frame, and segmenting the original video based on the similarity data corresponding to each second image frame to obtain at least one video segment includes:
acquiring a second sub-image corresponding to the first position information in each second image frame;
determining similarity data corresponding to each second image frame based on the similarity degree of each second sub-image and the first sub-image;
if the similarity data corresponding to each second image frame meet preset data, obtaining a video clip; the first image frame of the video segment is the first image frame.
In some possible embodiments, after determining the similarity data corresponding to each second image frame based on the similarity between each second sub-image and the first sub-image, the method further includes:
if a first target image frame set exists in the second image frame set and a first target image frame positioned at the first position in the first target image frame set is adjacent to the first image frame and positioned behind the first image frame in the original video, determining a first video segment based on the first image frame and the first target image frame set;
the first target image frame set comprises a first target image frame or a plurality of continuous first target image frames; the first target image frame set is not equal to the second image frame set, and the similarity data corresponding to the first target image frame in the first target image frame set meets preset data.
In some possible embodiments, the method further comprises:
taking a difference set between the second image frame set and the first target image frame set as a video to be segmented;
taking a first image frame of a video to be segmented as a new first image frame; taking image frames except the new first image frame in the video to be segmented as a new second image frame set;
determining new first position information of the object to be processed on a new first image frame;
obtaining a new first sub-image corresponding to the new first position information from the new first image frame, and obtaining a new second sub-image corresponding to the new first position information from each second image frame in the new second image frame set;
determining similarity data for each new second sub-image based on the similarity of the new first sub-image and each new second sub-image;
if a new first target image frame set exists in the new second image frame set and the new first target image frame set is positioned adjacent to the first target image frame and is positioned behind the new first image frame in the original video, determining a second video segment based on the new first image frame and the new first target image frame set;
the new first target image frame set comprises a new first target image frame or a plurality of consecutive new first target image frames; the new first target image frame set is not equal to the new second image frame set, and the similarity data corresponding to the first target image frame in the new first target image frame set meets the preset data.
In some possible embodiments, the method further comprises:
carrying out preset color proportion detection on the original video to obtain proportion data of preset colors of each image frame in the original video;
if the proportion data of the preset colors of a plurality of image frames in the original video meet third preset data, determining the starting and ending time of the plurality of image frames; the plurality of image frames are continuous image frames, and the last image frame of the plurality of image frames is a video end image frame.
In some possible embodiments, if the objects to be processed include a first class of objects to be processed and a second class of objects to be processed, determining the objects to be processed of the key frames and the location information of the objects to be processed of the key frames in each video clip includes:
performing first-class object detection on key frames in each video clip based on a first detection model in the object detection models, and determining first-class objects to be processed of the key frames in each video clip and position information of the first-class objects to be processed;
and performing second-class object detection on the key frame in each video clip based on a second detection model in the object detection model and the position information of the first-class object to be processed, and determining the second-class object to be processed and the position information of the second-class object to be processed of the key frame in each video clip.
In some possible embodiments, generating a mask of the object to be processed for each key frame based on the object to be processed for each key frame and the position information of the object to be processed for each key frame comprises:
and carrying out binarization processing on the pixels of the object to be processed of each key frame based on the position information of the object to be processed of each key frame to obtain a mask of the object to be processed of each key frame.
According to a second aspect of the embodiments of the present disclosure, there is provided a video processing apparatus including:
the segmentation module is configured to segment an original video and extract key frames to obtain at least one video clip and key frames in each video clip; the position information of an object to be processed contained in image frames in the same video clip on the image frame to which the object to be processed belongs meets preset position information; the preset position information is information corresponding to a video clip to which the image frame belongs;
a determining module configured to perform determining an object to be processed of a key frame and position information of the object to be processed of the key frame in each video clip;
a mask generation module configured to perform a mask generation of the object to be processed for each key frame based on the object to be processed for each key frame and the position information of the object to be processed for each key frame;
and the object removing module is configured to execute object processing on the at least one video segment based on the at least one video segment and the mask of the object to be processed of each key frame to obtain a target video.
In some possible embodiments, the segmentation module is configured to perform:
carrying out object identification on an original video to obtain the type information of an object to be processed of the original video;
determining a display rule of the object to be processed on the original video based on the type information of the object to be processed;
segmenting an original video based on a display rule to obtain at least one video segment;
and determining a preset image frame in each video clip of the at least one video clip as a key frame in each video clip.
In some possible embodiments, the segmentation module is configured to perform:
determining a display area and a display duration of the object to be processed on the original video based on the type information of the object to be processed;
and segmenting the original video based on the display area and the display duration to obtain at least one video segment.
In some possible embodiments, the segmentation module is configured to perform:
carrying out object identification on a first image frame in an original video to obtain an object to be processed of the original video; the first image frame is an image frame of an object to be processed appearing for the first time in an original video;
determining first position information of an object to be processed on a first image frame;
image interception is carried out on the first image frame based on the first position information, and a first sub-image corresponding to the first position information is obtained;
determining similarity data corresponding to each second image frame based on the first sub-image and each second image frame in the second image frame set; the second image frame set comprises image frames except the first image frame in the original video;
segmenting the original video based on the similarity data corresponding to each second image frame to obtain at least one video segment;
and determining a preset image frame in each video clip of the at least one video clip as a key frame in each video clip.
In some possible embodiments, based on the segmentation module, configured to perform:
acquiring a second sub-image corresponding to the first position information in each second image frame;
determining similarity data corresponding to each second image frame based on the similarity degree of each second sub-image and the first sub-image;
if the similarity data corresponding to each second image frame meet preset data, obtaining a video clip; the first image frame of the video segment is the first image frame.
In some possible embodiments, the segmentation module is configured to perform:
if a first target image frame set exists in the second image frame set and a first target image frame positioned at the first position in the first target image frame set is adjacent to the first image frame and positioned behind the first image frame in the original video, determining a first video segment based on the first image frame and the first target image frame set;
the first target image frame set comprises a first target image frame or a plurality of continuous first target image frames; the first target image frame set is not equal to the second image frame set, and the similarity data corresponding to the first target image frame in the first target image frame set meets preset data.
In some possible embodiments, the segmentation module is configured to perform:
taking a difference set between the second image frame set and the first target image frame set as a video to be segmented;
taking a first image frame of a video to be segmented as a new first image frame; taking image frames except the new first image frame in the video to be segmented as a new second image frame set;
determining new first position information of the object to be processed on a new first image frame;
obtaining a new first sub-image corresponding to the new first position information from the new first image frame, and obtaining a new second sub-image corresponding to the new first position information from each second image frame in the new second image frame set;
determining similarity data for each new second sub-image based on the similarity of the new first sub-image and each new second sub-image;
if a new first target image frame set exists in the new second image frame set and the new first target image frame set is positioned adjacent to the first target image frame and is positioned behind the new first image frame in the original video, determining a second video segment based on the new first image frame and the new first target image frame set;
the new first target image frame set comprises a new first target image frame or a plurality of consecutive new first target image frames; the new first target image frame set is not equal to the new second image frame set, and the similarity data corresponding to the first target image frame in the new first target image frame set meets the preset data.
In some possible embodiments, the apparatus further comprises a time determination module configured to perform:
carrying out preset color proportion detection on the original video to obtain proportion data of preset colors of each image frame in the original video;
if the proportion data of the preset colors of a plurality of image frames in the original video meet third preset data, determining the starting and ending time of the plurality of image frames; the plurality of image frames are continuous image frames, and the last image frame of the plurality of image frames is a video end image frame.
In some possible embodiments, the determining module is configured to perform:
performing first-class object detection on key frames in each video clip based on a first detection model in the object detection models, and determining first-class objects to be processed of the key frames in each video clip and position information of the first-class objects to be processed;
and performing second-class object detection on the key frame in each video clip based on a second detection model in the object detection model and the position information of the first-class object to be processed, and determining the second-class object to be processed and the position information of the second-class object to be processed of the key frame in each video clip.
In some possible embodiments, the mask generation module is configured to perform:
and carrying out binarization processing on the pixels of the object to be processed of each key frame based on the position information of the object to be processed of each key frame to obtain a mask of the object to be processed of each key frame.
According to a third aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to execute the instructions to implement the method of any one of the first aspect as described above.
According to a fourth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium, in which instructions, when executed by a processor of an electronic device, enable the electronic device to perform the method of any one of the first aspects of the embodiments of the present disclosure.
According to a fifth aspect of embodiments of the present disclosure, there is provided a computer program product comprising a computer program, the computer program being stored in a readable storage medium, from which at least one processor of a computer device reads and executes the computer program, causing the computer device to perform the method of any one of the first aspect of embodiments of the present disclosure.
The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:
segmenting an original video and extracting key frames to obtain at least one video segment and key frames in each video segment, wherein position information of an object to be processed contained in an image frame in the same video segment on the image frame to which the object to be processed belongs meets preset position information, the preset position information is information corresponding to the video segment to which the image frame belongs, the position information of the object to be processed of the key frame and the object to be processed of the key frame in each video segment is determined, a mask of the object to be processed of each key frame is generated based on the position information of the object to be processed of each key frame and the object to be processed of each key frame, and the object processing is performed on at least one video segment based on the mask of the object to be processed of at least one video segment and each key frame to obtain a target video. According to the method and the device, the positions of the objects to be processed in each video clip can be accurately positioned through the video clips and the key frames, computational resources required by removal of the objects to be processed in subsequent videos are reduced, and the method and the device can be applied to videos where more objects to be processed are located and have universality.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a schematic diagram of an application environment shown in accordance with an exemplary embodiment;
FIG. 2 is a flow diagram illustrating a video processing method according to an exemplary embodiment;
FIG. 3 is a flow diagram illustrating a method of video segmentation and key frame extraction in accordance with one illustrative embodiment;
FIG. 4 is a flow diagram illustrating a method of video segmentation and key frame extraction in accordance with one illustrative embodiment;
FIG. 5 is a flow diagram illustrating a method for segmenting an original video based on similarity data in accordance with an exemplary embodiment;
FIG. 6 is a flow diagram illustrating a method for segmenting an original video based on similarity data in accordance with an exemplary embodiment;
FIG. 7 is a flowchart illustrating a method for determining a key frame object to be processed and location information of the key frame object to be processed according to an example embodiment;
FIG. 8 is a flowchart illustrating a method of deleting a trailer in accordance with an illustrative embodiment;
FIG. 9 is a block diagram illustrating a video processing device according to an example embodiment;
FIG. 10 is a block diagram illustrating an electronic device for video processing in accordance with an exemplary embodiment.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the foregoing drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in other sequences than those illustrated or described herein. The implementations described in the exemplary embodiments below do not represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.
It should be noted that, the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data for presentation, analyzed data, etc.) referred to in the present disclosure are information and data authorized by the user or sufficiently authorized by each party.
Referring to fig. 1, fig. 1 is a schematic diagram illustrating an application environment of a video processing method according to an exemplary embodiment, and as shown in fig. 1, the application environment may include a server 01 and a client 02.
In some possible embodiments, the server 01 may include an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a web service, a middleware service, a domain name service, a security service, a CDN (Content Delivery Network), a big data and artificial intelligence platform. The operating system running on the server may include, but is not limited to, an android system, an IOS system, linux, windows, unix, and the like.
In some possible embodiments, the client 02 may include, but is not limited to, a smartphone, a desktop computer, a tablet computer, a laptop computer, a smart speaker, a digital assistant, an Augmented Reality (AR)/Virtual Reality (VR) device, a smart wearable device, and the like. The software running on the client may also be an application, an applet, or the like. Alternatively, the operating system running on the client may include, but is not limited to, an android system, an IOS system, linux, windows, unix, and the like.
In some possible embodiments, the server 01 or the client 02 may segment and extract key frames from an original video to obtain at least one video segment and a key frame in each video segment, where position information of an object to be processed included in image frames of the same video segment on an image frame to which the object to be processed belongs satisfies preset position information, the preset position information is information corresponding to the video segment to which the image frame belongs, position information of the object to be processed of the key frame and the object to be processed of the key frame in each video segment is determined, a mask of the object to be processed of each key frame is generated based on the position information of the object to be processed of each key frame and the object to be processed of each key frame, and the object processing is performed on the at least one video segment based on the at least one video segment and the mask of the object to be processed of each key frame to obtain a target video.
In some possible embodiments, the client 02 and the server 01 may be connected through a wired link or a wireless link.
In an exemplary embodiment, the databases corresponding to the client, the server, and the server may be node devices in the blockchain system, and can share the acquired and generated information to other node devices in the blockchain system, so as to implement information sharing among the plurality of node devices. The multiple node devices in the block chain system can be configured with the same block chain, the block chain is composed of multiple blocks, and the adjacent blocks have an association relationship, so that data in any block can be detected by the next block when being tampered, the data in the block chain can be prevented from being tampered, and the safety and reliability of the data in the block chain are ensured.
Fig. 2 is a flowchart illustrating a video processing method according to an exemplary embodiment, and as shown in fig. 2, the video processing method may be applied to a server, and may also be applied to other node devices, such as a client, and the method is described below by taking the server as an example, and includes the following steps:
in step S201, segmenting and extracting key frames from an original video to obtain at least one video segment and a key frame in each video segment; the position information of an object to be processed contained in image frames in the same video clip on the image frame to which the object to be processed belongs meets preset position information; the preset position information is information corresponding to the video clip to which the image frame belongs.
In the embodiment of the application, the server can perform segmentation and key frame extraction on the original video to obtain at least one video clip and a key frame in each video clip.
In order to enable the key frame in each video clip to better assist the removal of the object to be processed in the video clip where the key frame is located, the position information of the object to be processed on the image frame to which the object to be processed belongs, which is included in the image frame in the same video clip, meets the preset position information, and the preset position information is the information corresponding to the video clip to which the image frame belongs.
Optionally, that the position information of the object to be processed included in the image frame of the same video clip on the image frame to which the object to be processed belongs meets the preset position information may be: the position information of the object to be processed contained in all the image frames in the same video clip on the image frame to which the object to be processed belongs meets the preset position information.
Optionally, that the position information of the object to be processed included in the image frames of the same video segment on the image frame to which the object to be processed belongs meets the preset position information may be: the position information of the object to be processed on the image frame to which the object to be processed belongs, which is included in a part of image frames in the same video clip, meets the preset position information, the other part of image frames does not include the object to be processed, for example, in one video clip, 80% of the image frames include the object to be processed, which meets the preset position information, and the rest 20% of the image frames do not include the object to be processed.
In the embodiment of the application, each video clip has corresponding preset position information, and the position information of the object to be processed contained in the image frame belonging to the object to be processed in the image frame belonging to the video clip is within the preset position information.
In the following, what each video clip has a corresponding preset position information is illustrated by the first video clip. Assume that the first video segment contains 60 video frames, and of the 60 video frames, each video frame has a resolution of 640 × 320, i.e., the width of each video frame corresponds to 640 pixel values, and the height of each video frame corresponds to 320 pixel values. Then, the preset position information corresponding to the first video segment may be a rectangular region defined by the following four vertices: the width-corresponding pixel value is 0, and the height-corresponding pixel value is 0 (0, 0), the width-corresponding pixel value is 32, and the height-corresponding pixel value is 0 (32, 0), the width-corresponding pixel value is 0, and the height-corresponding pixel value is 16 (0, 16), and the width-corresponding pixel value is 32, and the height-corresponding pixel value is 16 (32, 16).
In an alternative embodiment, the position information of the object to be processed contained in the image frame belonging to the video segment on the image frame to which the object to be processed belongs is within the preset position information: in the first video segment, the position information of the object to be processed contained in each of the 60 image frames on the image frame to which the object to be processed belongs is a first rectangular region defined by the following four vertices: (0, 0), (0, 16), (32, 0) and (32, 16).
In another optional embodiment, the position information of the object to be processed contained in the image frame belonging to the video segment on the image frame to which the object to be processed belongs is within the preset position information: in the first video segment, the position information of the object to be processed included in a part of 60 image frames in the image frame to which the object to be processed belongs is a second rectangular region defined by the following four vertices: (0, 0), (0, 15), (32, 0) and (32, 15), the position information of the object to be processed included in the other part of the image frame on the image frame to which the object to be processed belongs is a third rectangular region defined by four vertices: (0, 0), (0, 16), (28, 0) and (28, 16). It can be seen that both the second rectangular region and the third rectangular region are contained in the first rectangular region.
Alternatively, there are many ways for the server to obtain the at least one video clip and the key frame in each of the at least one video clip, and how to obtain the at least one video clip and the key frame in each of the at least one video clip is described in the following by various embodiments.
In an alternative embodiment, the server may obtain a preset time interval and obtain a plurality of key frames from the original video based on the preset time interval. The server may then divide the original video into a plurality of video segments based on the locations of the plurality of key frames in the original video.
For example, assuming that the frame rate of the original video is 30 frames per second, the duration of the original video is 10 seconds, and the preset time interval is 2 seconds, the plurality of key frames acquired from the original video by the server based on the preset time interval are the 1 st image frame, the 61 st image frame, the 121 st image frame, the 181 th image frame, and the 241 st image frame. Subsequently, the server may determine all image frames included from the 1 st image frame to the 60 th image frame as a first video clip, all image frames included from the 61 st image frame to the 120 th image frame as a second video clip, all image frames included from the 121 st image frame to the 180 th image frame as a third video clip \8230 \ 8230, and thus, the server may divide the original video into 5 video clips, and determine the first image frame of each video clip as a key frame in the video clip.
In another alternative embodiment, the server may segment and extract the key frames from the original video based on the display rules of the object to be processed in the original video, so as to obtain at least one video clip and the key frames in each video clip. Fig. 3 is a flow chart illustrating a video segmentation and key frame extraction method according to an exemplary embodiment, as shown in fig. 3, including:
in step S301, object recognition is performed on the original video to obtain type information of an object to be processed of the original video.
In this embodiment, the type information of the object to be processed may refer to company information or organization information to which the object to be processed belongs. Therefore, the server can perform object identification on the original video to obtain company information or organization information to which the object to be processed belongs in the original video.
In step S303, a display rule of the object to be processed on the original video is determined based on the type information of the object to be processed.
In the embodiment of the present application, the display rule of the object to be processed on the original video refers to a display area and a display duration of the object to be processed on the original video. For example, the display rule that the object to be processed a appears on the original video based on the display parameters set in advance is: the object a to be processed is always displayed in the upper left corner region of each image frame (the upper left corner is a vertex, and the region is 5 × 5) on the original video. The display rule of the object to be processed B presented on the original video based on the display parameters set in advance is as follows: the object B to be processed is displayed in the upper left corner region of the image frame (the region with the upper left corner as a vertex and 5 is displayed) between 0-2 seconds of the original video, in the lower right corner region of the image frame (the region with the lower right corner as a vertex and 5 is displayed) between 2-4 seconds of the original video, in the upper right corner region of the image frame (the region with the lower right corner as a vertex and 5 is displayed) \8230between4-6 seconds of the original video, \\8230
In step S305, the original video is segmented based on the display rule to obtain at least one video segment.
Based on the above-mentioned object a to be processed, since the display area of the object a to be processed is the "upper left corner area" and the display duration is "always", the server may not segment the original video based on the display area and the display duration, that is, the original video is an integral whole.
Based on the above description of the object B to be processed, the server may segment the original video according to the display area and the display duration of the object B to be processed on the original video, and regard the video segment corresponding to 0-2 seconds as the first video segment, regard the video segment corresponding to 2-4 seconds as the second video segment, regard the video segment corresponding to 4-6 seconds as the third video segment \8230 \ 8230
In step S307, a preset image frame in each of at least one video clip is determined as a key frame in each video clip.
Optionally, for the segmentation result corresponding to the object a to be processed, the server may determine the first image frame as a key frame in the original video, or may determine any one image frame in the original video as a key frame in the original video.
Optionally, for the segmentation result corresponding to the object B to be processed, the server may determine, in the first video segment, the first image frame as a key frame in the first video segment, or may determine, in the first video segment, any one image frame as a key frame in the first video segment, and determine, in the second video segment, the first image frame as a key frame in the second video segment, or may determine, in the second video segment, any one image frame as a key frame in the second video segment, \\ 823030 \ 8230, where determining the key frame in the other video segments refers to determining the key frame in the first video segment, and details are not repeated here.
In summary, since the display rule is associated with the type information of the object to be processed, and thus a certain regularity exists in the display of the object to be processed on the image frame, the server can accurately segment the original video and determine the key frame of each video clip according to the display rule of the object to be processed, such as the display area and the display duration, so as to better assist the video clip where the key frame is located in object removal by using the key frame in each video clip.
In another alternative embodiment, the server may segment and key-frame extract the original video based on similarity data between image frames, resulting in at least one video clip and key-frames in each video clip. FIG. 4 is a flowchart illustrating a video segmentation and key frame extraction method according to an exemplary embodiment, as shown in FIG. 4, including:
in step S401, performing object identification on a first image frame in an original video to obtain an object to be processed of the original video; the first image frame is an image frame of an object to be processed appearing for the first time in the original video.
In this embodiment of the present application, the original video may be a video published on a video platform, and therefore, objects to be processed in the original video may be unified in the original video, such as all the objects to be processed of the video platform, or the video platform plus the object to be processed of the publisher.
Optionally, the resolution of each image frame in the original video is consistent, such as 640 × 320.
In an alternative embodiment, the object to be processed in the original video may start from the first image frame, and therefore, the server may perform object recognition on the first image frame to obtain the object to be processed in the original video.
In another alternative embodiment, if the object to be processed in the original video is not from the first image frame, the server may perform object recognition on the image frames in sequence until the object to be processed is obtained. In this way, the server can determine which image frame in which the object to be processed exists as the first image frame.
In the embodiment of the application, the server can perform object identification on the first image frame in the original video through the object detection model to obtain the object to be processed of the original video. Alternatively, the object detection model may include, but is not limited to, a deep learning model using a convolutional neural network, a cyclic neural network, or a recurrent neural network.
In step S403, first position information of the object to be processed on the first image frame is determined.
In an alternative embodiment, the server may not only perform object identification on the first image frame in the original video through the object detection model to obtain the object to be processed of the original video, but also obtain first position information of the object to be processed on the first image frame.
In another alternative embodiment, after the server determines the object to be processed in the original video, a rectangular frame may be marked around the object to be processed in the first image frame, and the first position information of the object to be processed on the first image frame is determined based on pixels included in the first image frame of the rectangular frame.
Optionally, the first position information indicates four pixel pairs ((X1, Y1), (X1, Y2), (X2, Y1), (X2, Y2)) corresponding to four corners of the rectangular frame, or the first position information indicates two pixel pairs ((X1, Y1), (X2, Y2)) corresponding to two opposite corners of the rectangular frame, and the two pixel pairs ((X1, Y1), (X2, Y2)) corresponding to two opposite corners can locate the four pixel pairs ((X1, Y1), (X1, Y2), (X2, Y1), (X2, Y2)) corresponding to four corners of the rectangular frame.
In step S405, an image of the first image frame is captured based on the first position information, so as to obtain a first sub-image corresponding to the first position information.
In this embodiment, the server may perform image capturing on the first image frame based on the first position information, for example, (four pixel pairs (X1, Y1), (X1, Y2), (X2, Y1), (X2, Y2) corresponding to four corners of the rectangular frame), to obtain a first sub-image corresponding to the first position information.
In step S407, determining similarity data corresponding to each second image frame based on the first sub-image and each second image frame in the second image frame set; the second image frame set includes image frames in the original video other than the first image frame.
Alternatively, the server may group image frames other than the first image frame in the original video into a second image frame set. And determining similarity data corresponding to each second image frame in the set of second image frames based on the first sub-image and each second image frame.
In step S409, the original video is segmented based on the similarity data corresponding to each second image frame, so as to obtain at least one video segment.
Optionally, the server may segment the original video based on the similarity data corresponding to each second image frame to obtain at least one video segment.
In step S411, a preset image frame in each of at least one video clip is determined as a key frame in each video clip.
Alternatively, the server may determine the first image frame in each video segment as a key frame in each video segment, or may determine any one image frame in each video segment as a key frame in each video segment.
As described above, in the embodiment of the present application, image frames with the same object to be processed may be accurately divided into one video segment according to the similarity data between the image frames, so that when the mask of the object to be processed of the key frame of the video segment is required to be used to remove the object to be processed from the video segment in the final stage, the mask of the object to be processed of the key frame of the video segment may be quickly positioned to the position of the object to be processed in the image frame of the video segment, and the object to be processed may be removed from the video segment uniformly.
Fig. 5 is a flowchart illustrating a method of segmenting an original video based on similarity data, according to an exemplary embodiment, as shown in fig. 5, including:
in step S501, a second sub-image corresponding to the first position information in each second image frame is acquired.
Based on the above first position information, i.e. the four pixel pairs ((X1, Y1), (X1, Y2), (X2, Y1), (X2, Y2)) corresponding to the four corners of the rectangular frame, the server may obtain the second sub-image corresponding to the first position information in each second image frame. That is to say, the second sub-image corresponding to each second image frame is obtained by cutting out the rectangular frame corresponding to each second image frame.
In step S502, similarity data corresponding to each second image frame is determined based on the similarity between each second sub-image and the first sub-image.
Optionally, assuming that the first sub-image and the second sub-image are sub-images corresponding to 5 × 5 pixels, the server may determine pixel pairs of pixels in the first sub-image and the second sub-image, so as to obtain 25 pixel pairs. For example, pixels with a row position of 1 and a column position of 1 in the first sub-image and pixels with a row position of 1 and a column position of 1 in the second sub-image form a pixel pair, and pixels with a row position of 5 and a column position of 5 in the first sub-image and pixels with a row position of 5 and a column position of 5 in the second sub-image form a pixel pair.
Then, the server may compare the similarity degrees based on each pixel pair to obtain similarity data corresponding to each pixel pair, and optionally, the similarity data may be percentage data. Further, the server may determine similarity data corresponding to each second image frame based on an average of the similarity data of the 25 pixel pairs.
In step S503, comparing the similarity data corresponding to each second image frame with preset data, and if the similarity data corresponding to each second image frame meets the preset data, going to step S504; otherwise, go to step S505.
In step S504, a video clip is obtained; the first image frame of the video segment is the first image frame.
Optionally, the server may compare the similarity data corresponding to each second image frame with preset data, assuming that the preset data is 80%, and if the similarity data corresponding to each second image frame is greater than or equal to the preset data, a video segment may be obtained. That is, the original video is treated as a video clip. Wherein, the first image frame of the video clip is the first image frame.
In step S505, if there is a first target image frame set in the second image frame set, and the first target image frame positioned in the first target image frame set is adjacent to the first image frame and positioned behind the first image frame in the original video, a first video segment is determined based on the first image frame and the first target image frame set; the first target image frame set comprises a first target image frame or a plurality of continuous first target image frames; the first target image frame set is not equal to the second image frame set, and the similarity data corresponding to the first target image frame in the first target image frame set meets preset data.
Optionally, the server may compare the similarity data corresponding to each second image frame with preset data, and assume that the preset data is 80%, the first image frame is a first image frame in the original video, and the original video includes 300 image frames. If a first target image frame set (including 2-60 image frames) exists in the second image frame set (including 2-300 image frames), and the first target image frame (2 nd image frame) positioned at the first position in the first target image frame set is adjacent to the first image frame and positioned after the first image frame in the original video, the server may determine a first video clip based on the first image frame and the first target image frame set.
Wherein the first target image frame set is not equal to the second image frame set. The similarity data (for example, all greater than or equal to 80%) corresponding to the first target image frame in the first target image frame set satisfies the preset data.
In this way, the server can determine the first video clip.
In step S506, a difference set between the second image frame set and the first target image frame set is used as a video to be segmented.
Continuing with the above embodiment, the server may use a difference set (including 61-300 image frames) between the second image frame set (including 2-300 image frames) and a target image frame set (including 2-60 image frames) as the video to be segmented.
In step S507, a first image frame of the video to be segmented is taken as a new first image frame; and taking the image frames except the new first image frame in the video to be segmented as a new second image frame set.
Continuing with the above embodiment, the server may use the first image frame (61 st image frame) of the video to be segmented as a new first image frame; and taking the image frames (62-300 image frames) except the new first image frame in the video to be segmented as a new second image frame set.
In step S508, new first position information of the object to be processed on a new first image frame is determined.
Optionally, in this embodiment of the application, the server may determine new first position information of the object to be processed on the new first image frame through the object detection model, where the new first position information may be four pixel pairs ((X3, Y3), (X3, Y4), (X4, Y3), (X4, Y4)) corresponding to four corners of the rectangular frame. Optionally, the object detection model may include, but is not limited to, a deep learning model using a convolutional neural network, a cyclic neural network, or a recurrent neural network.
In step S509, a new first sub-image corresponding to the new first position information is obtained from the new first image frame, and a new second sub-image corresponding to the new first position information is obtained from each second image frame in the new second image frame set.
Alternatively, the server may obtain a new first sub-image corresponding to the new first position information from the new first image frame, and obtain a new second sub-image corresponding to the new first position information from each second image frame in the new second image frame set.
In step S510, similarity data for each new second sub-image is determined based on the degree of similarity of the new first sub-image and each new second sub-image.
Optionally, assuming that the new first sub-image and the new second sub-image are sub-images corresponding to 6 × 6 pixels, the server may determine the pixel pairs of the pixels in the new first sub-image and the new second sub-image to obtain 36 pixel pairs. For example, pixels with a row position of 1 and a column position of 1 in the new first sub-image and pixels with a row position of 1 in the new second sub-image form a pixel pair, and pixels with a row position of 6 and a column position of 6 in the new first sub-image and pixels with a column position of 6 in the new second sub-image form a pixel pair.
Then, the server may compare the similarity degrees based on each pixel pair to obtain similarity data corresponding to each pixel pair, and optionally, the similarity data may be percentage data. Further, the server may determine similarity data corresponding to each second image frame based on an average of the similarity data of the 36 pixel pairs.
In step S511, if a new first target image frame set exists in the new second image frame set, and the new first target image frame set is located adjacent to the first target image frame and is located behind the new first image frame in the original video, a second video segment is determined based on the new first image frame and the new first target image frame set; the new first target image frame set comprises a new first target image frame or a plurality of consecutive new first target image frames; the new first target image frame set is not equal to the new second image frame set, and the similarity data corresponding to the first target image frame in the new first target image frame set meets the preset data.
Optionally, the server may compare the similarity data corresponding to each new second image frame with preset data. If a new first target image frame set (including 62-120 image frames) exists in the new second image frame set (62-300 image frames), and the new first target image frame set is positioned adjacent to the new first image frame (62 image frame) and is positioned after the new first image frame in the original video, the server may determine a second video segment (including 61-120 image frames) based on the new first image frame and the new first target image frame set.
Wherein the new first target image frame set is not equal to the new second image frame set. The similarity data (for example, all greater than or equal to 80%) corresponding to the first target image frame in the new first target image frame set satisfies the preset data.
In this way, the server may determine the second video segment. The server may then determine a third video segment, a fourth video segment, a third 8230, a fourth 8230, with reference to the determination of the second video segment until all image frames in the original video are separated into video segments.
As described above, the image frames of the objects to be processed with the same condition are accurately divided into one video segment by the similarity data between the first sub-image located in the first position information in the first image frame and the second sub-image corresponding to the first position information in the other image frames, and the original video is further divided into a plurality of video segments. Therefore, in the final stage, when the mask of the object to be processed of the key frame of the video clip is required to be utilized to remove the object to be processed of the video clip, the mask of the object to be processed of the key frame of the video clip can be quickly positioned to the position of the object to be processed in the image frame of the video clip, and then the object to be processed of the video clip can be removed uniformly.
Fig. 6 is a flowchart illustrating a method of segmenting an original video based on similarity data, according to an exemplary embodiment, as shown in fig. 6, including:
in step S601, the first image frame is determined as the current image frame.
The server determines the first image frame as the current image frame, which is stated as the first image frame in the original video with the first image frame being 300 frames.
In step S602, in the original video, image frames that are a preset interval from the current image frame are determined as execution image frames according to the video playing sequence.
In this embodiment of the application, the server may regard, as the execution image frame, an image frame that is apart from the current image frame by a preset interval in the original video according to the video playing sequence.
Alternatively, to ensure the accuracy of the subsequent video segment, the preset interval may be one image frame apart. That is to say, the server may regard, as the execution image frame, an image frame that is one image frame away from the current image frame in the original video, that is, a second image frame of the original video, as the execution image frame according to the video playing sequence.
Alternatively, in practical applications, the position of the object to be processed on the image frame in the original video may be constant for a short time. Based on this, in order to guarantee a certain software processing speed, the preset interval may be several image frames apart, such as 5 image frames apart, at the expense of a certain segmentation accuracy. That is, the server may regard, as the execution image frame, an image frame that is five image frames away from the current image frame in the original video, that is, a sixth image frame of the original video, in the video playing order.
In step S603, execution position information in the execution image frame is determined based on the first position information, and image interception is performed on the execution image frame based on the execution position information to obtain an execution sub-image corresponding to the execution position information.
Since it was mentioned above that the resolution of each image frame in the original video is the same, for example, 640 × 320. Assuming that the first position information is two pixel pairs ((X1, Y1), (X2, Y2)) corresponding to two opposite corners of the rectangular frame, the execution position information in the execution image frame may be two pixel pairs ((X1, Y1), (X2, Y2)) corresponding to two opposite corners of the rectangular frame in the execution image frame. Then, the server may perform image capturing on the execution image frame based on the execution position information to obtain an execution sub-image corresponding to the execution position information.
In step S604, similarity data corresponding to the execution sub-image is determined based on the first sub-image and the execution sub-image.
Alternatively, assuming that the first sub-image and the execution sub-image are sub-images corresponding to 5 × 5 pixels, the server may determine pixel pairs of the pixels in the first sub-image and the execution sub-image, so as to obtain 25 pixel pairs. For example, pixels with a row position of 1 and a column position of 1 in the first sub-image and pixels with a row position of 1 and pixels with a column position of 1 in the first sub-image form a pixel pair, and pixels with a row position of 5 and a column position of 5 in the first sub-image form a pixel pair.
Then, the server may compare the similarity degrees based on each pixel pair to obtain similarity data corresponding to each pixel pair, and optionally, the similarity data may be percentage data. Further, the server may determine to execute the similarity data corresponding to the image frame based on an average of the similarity data of the 25 pixel pairs.
In step S605, comparing the similarity data corresponding to the executed sub-image with first preset data, if the similarity data corresponding to the executed sub-image meets the first preset data, determining the executed image frame as a current image frame, and going to step S602; otherwise, the step S606 is executed.
In this embodiment of the present application, the first preset data may be preset, for example, the first preset data is greater than or equal to 95%, that is, if the similarity data satisfies that the similarity is greater than or equal to 95%, it is determined that the similarity between the execution sub-image and the first sub-image is very high, and therefore, the execution sub-image includes the same object to be processed as that in the first sub-image.
Based on this, it can be considered that the first image frame and the execution image frame (the second image frame), or all image frames involved from the first image frame to the execution image frame (the sixth image frame) can be regarded as image frames in one video segment. Subsequently, it is determined whether the image frame following the execution image frame also belongs to the image frame in the video segment.
Taking the execution image frame as the second image frame as an example, the server may regard the execution image frame as the current image frame, and then repeat steps S602-S605, that is, regarding an image frame (the third image frame in the original video) in the original video that is one image frame away from the current image frame as the execution image frame in the video order. And acquiring an execution sub-image corresponding to the first position information in the third image frame, determining similarity data based on the first sub-image and the execution sub-image, and if the similarity data is greater than or equal to 95%, determining that the similarity between the execution sub-image and the first sub-image is very high, so that the execution sub-image in the third image frame comprises the same object to be processed as that in the first sub-image.
Based on this, the server may regard the execution image frame (the third image frame) as the current image frame, and continue to repeat steps S602-S605, and the process of repeating steps is referred to above and will not be repeated here.
In step S606, until the similarity data corresponding to the executed sub-image satisfies the second preset data, the first image frame is determined as the start image frame of the first video segment, and the previous image frame of the executed image frame is determined as the end image frame of the first video segment, so as to obtain the first video segment.
Alternatively, the second preset data may be less than 95% corresponding to the first preset data above. For example, if the server performs the image frame as the thirty-first image frame in the original video after several rounds of repeating steps, and if the similarity data between the execution sub-image corresponding to the thirty-first image frame and the first sub-image is less than 95%, the server may regard the first image frame as the start image frame of the first video segment, regard the previous image frame of the execution image frame as the end image frame of the first video segment, and obtain the first video segment, where the first video segment includes 30 image frames in total from the first image frame to the thirty-first image frame. In this way, the server determines the first video segment, and the position of the object to be processed of each image frame in the first video segment on each image frame is within the preset position information corresponding to the first video segment, such as the upper left corner (the upper left corner is a vertex, and the area is 5 × 5).
In step S607, the execution image frame is determined as the second image frame.
At this time, the execution image frame is a thirty-first image frame, and the server regards the thirty-first image frame as a second image frame.
In step S608, second position information of the object to be processed on the second image frame is determined, and the second image frame is subjected to image capture based on the second position information, so as to obtain a second sub-image corresponding to the second position information.
Alternatively, the server may determine second position information of the object to be processed on the second image frame through the object detection model, where the second position information may be four pixel pairs ((X3, Y3), (X3, Y4), (X4, Y3), (X4, Y4)) corresponding to four corners of the rectangular frame. Optionally, the object detection model may include, but is not limited to, a deep learning model using a convolutional neural network, a cyclic neural network, or a recurrent neural network. And the server may perform image capturing on the second image frame based on the second position information to obtain a second sub-image corresponding to the second position information.
In step S609, the second image frame is determined as the current image frame.
As a result of entering the loop determined by the second video segment, the server may determine the second image frame as the current image frame.
In step S610, in the original video, image frames that are a preset interval from the current image frame are determined as execution image frames according to the video playing sequence.
In this embodiment, the server may regard, as the execution image frame, an image frame that is one image frame away from the current image frame in the original video according to the video playing sequence. That is, the server may regard the thirty-second image frame as the execution image frame.
In step S611, the execution position information in the execution image frame is determined based on the second position information, and the execution image frame is subjected to image capture based on the execution position information in the execution image frame, so as to obtain an execution sub-image corresponding to the execution position information.
In this embodiment of the application, the server may determine execution position information in the execution image frame based on the second position information, and perform image interception on the execution image frame based on the execution position information in the execution image frame to obtain an execution sub-image corresponding to the execution position information.
In step S612, similarity data corresponding to the execution sub-image is determined based on the second sub-image and the execution sub-image.
Optionally, assuming that the second sub-image and the execution sub-image are sub-images corresponding to pixels of 5 × 5, the server may determine pixel pairs of pixels in the second sub-image and the execution sub-image, so as to obtain 25 pixel pairs. For example, pixels with a row position of 1 and a column position of 1 in the second sub-image and pixels with a row position of 1 and a column position of 1 in the second sub-image form a pixel pair, and pixels with a row position of 5 and a column position of 5 in the second sub-image form a pixel pair.
Then, the server may compare the similarity degrees based on each pixel pair to obtain similarity data corresponding to each pixel pair, and optionally, the similarity data may be percentage data. Further, the server may determine to execute the similarity data corresponding to the image frame based on an average of the similarity data of the 25 pixel pairs.
In step S613, comparing the similarity data corresponding to the executed sub-image with first preset data, and if the similarity data corresponding to the executed sub-image meets the first preset data, determining the executed image frame as a current image frame; go to step S610; otherwise, the step S613 is executed.
In the embodiment of the present application, if the similarity data satisfies 95% or more, it is determined that the similarity between the execution sub-image and the first sub-image is high, and therefore, the execution sub-image includes the same object to be processed as that in the second sub-image.
Based on this, the second image frame (31 st image frame) and the execution image frame (32 nd image frame) can be regarded as image frames in one video clip. Subsequently, it is determined whether the image frame following the execution image frame also belongs to the image frame in the video segment.
Taking the execution image frame as the 32 nd image frame as an example, the server may take the execution image frame as the current image frame, and then repeat steps S610-S613, that is, in the video order, an image frame (the 33 rd image frame in the original video) in the original video that is one image frame away from the current image frame is taken as the execution image frame. And acquiring an execution sub-image corresponding to the second position information in the 33 th image frame, determining similarity data based on the second sub-image and the execution sub-image, and if the similarity data is greater than or equal to 95%, determining that the similarity between the execution sub-image and the second sub-image is very high, so that the execution sub-image in the 33 th image frame comprises the same object to be processed as that in the second sub-image.
Based on this, the server may regard the execution image frame (33 rd image frame) as the current image frame, and continue to repeat steps S610-S613, and the process of repeating steps is referred to above and will not be repeated here.
In step S614, until the similarity data corresponding to the executed sub-image satisfies the second preset data, the second image frame is determined as the start image frame of the second video segment, and the previous image frame of the executed image frame is determined as the end image frame of the second video segment, so as to obtain the second video segment.
Alternatively, the second preset data may be less than 95% corresponding to the first preset data above. For example, if the server performs, after several rounds of repeating steps, the execution image frame is the 61 st image frame in the original video, and if the similarity data between the execution sub-image corresponding to the 61 st image frame and the second sub-image satisfies less than 95%, the server may regard the second image frame as the starting image frame of the second video segment, and regard the previous image frame of the execution image frame as the ending image frame of the second video segment, to obtain the second video segment, that is, the second video segment includes 30 th to 60 th image frames, which are 30 image frames in total, so that the server determines the second video segment, and the position of the object to be processed in each image frame in the second video segment is in the preset position information corresponding to the first video segment, such as the lower right corner (the lower right corner is a vertex, and the region is 5 × 5).
In step S615, at least one video segment is obtained until the original video is traversed; the at least one video segment includes a first video segment and a second video segment.
Then, the server may perform the above steps until each image frame in the original video is traversed, so as to obtain at least one video segment. And determining the first image frame in each video clip as a key frame in each video clip, or determining any one image frame in each video clip as a key frame in each video clip.
As described above, the image frames of the objects to be processed, in which the same condition exists, are accurately divided into one video segment by the similarity data between the first sub-image located in the first position information in the first image frame and the execution sub-images corresponding to the execution position information in the other image frames, and the original video is further divided into a plurality of video segments by the implementation of large loop and small loop. Therefore, in the final stage, when the mask of the object to be processed of the key frame of the video clip is required to be used for removing the object to be processed of the video clip, the mask of the object to be processed of the key frame of the video clip can be quickly positioned to the position of the object to be processed in the image frame of the video clip, and then the object to be processed of the video clip is removed uniformly. Moreover, since the implementation of a large loop with a small loop can be realized by a loop statement in the code, the embodiment of the present application can realize the function by fewer codes.
In step S203, the object to be processed of the key frame and the position information of the object to be processed of the key frame in each video clip are determined.
In the embodiment of the application, the server may determine the object to be processed of the key frame and the position information of the object to be processed of the key frame in each video clip based on the object detection model.
Fig. 7 is a flowchart illustrating a method for determining a to-be-processed object of a key frame and position information of the to-be-processed object of the key frame according to an exemplary embodiment, as shown in fig. 7, including:
in step S701, a first class object detection is performed on the key frame in each video segment based on a first detection model in the object detection models, and a first class object to be processed and position information of the first class object to be processed of the key frame in each video segment are determined.
In this embodiment, the object detection model may include two modules, which are a first detection model and a second detection model. Wherein the object detection model is a trained model structure. Optionally, the server may invoke an object detection model, perform first class object detection on the key frame in each video segment based on a first detection model in the object detection model, and determine the first class object to be processed of the key frame in each video segment and the position information of the first class object to be processed.
In step S703, second class object detection is performed on the key frame in each video segment based on the second detection model in the object detection model and the position information of the first class object to be processed, and the second class object to be processed and the position information of the second class object to be processed of the key frame in each video segment are determined.
Optionally, the objects to be processed may include different types of objects to be processed, such as a first type of objects to be processed and a second type of objects to be processed.
Since the objects to be processed may include not only the first type of objects to be processed but also the second type of objects to be processed, the second type of object detection may be performed on the key frames in each video clip based on the second detection model in the object detection model, and the second type of objects to be processed and the position information of the second type of objects to be processed of the key frames in each video clip may be determined.
However, if the second detection model of the object detection models is used to perform the second type object detection on the key frames of each video segment, the second type object to be processed, which is not the object to be processed, in the key frames may be regarded as the second type object to be processed in the object to be processed. For example, when the second type of object to be processed is text, it is possible to detect subtitles in the key frame as text in the object to be processed. Therefore, the server can perform second-class object detection on the key frames in each video clip based on the second detection model in the object detection model and the position information of the first-class object to be processed, and determine the second-class object to be processed and the position information of the second-class object to be processed of the key frames in each video clip.
In the embodiment of the application, the position information of the first type of object to be processed has the function of limiting the position information of the second type of object to be processed. Specifically, in the process of detecting the second class of objects to be processed by the second detection model, if a plurality of second class objects to be processed exist on one key frame, the second class objects to be processed within a preset distance from the first class objects to be processed may be determined as the second class objects to be processed in the objects to be processed to be removed in the later stage. The other second class of objects to be processed do not belong to the objects to be processed to be removed later.
Optionally, the server may determine a set of the first class of objects to be processed and the second class of objects to be processed as the objects to be processed. Optionally, the server may determine a set of the position information of the first type of object to be processed and the position information of the second type of object to be processed as the position information of the object to be processed.
The first detection model may include, but is not limited to, a deep learning model using a convolutional neural network, a cyclic neural network, or a recurrent neural network. The deep Learning model relates to Machine Learning (ML), which is a multi-domain cross subject and relates to multiple subjects such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The method specially studies how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach to make computers have intelligence, and is applied in various fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formula learning. Machine learning can be divided into supervised machine learning, unsupervised machine learning and semi-supervised machine learning.
By the embodiment, the second class of objects to be processed, which are not in the objects to be processed in the key frame, can be prevented from being used as the second class of objects to be processed in the objects to be processed, the objects to be processed containing the first class of objects to be processed and the second class of objects to be processed can be accurately positioned, and the possibility of errors in identification of the objects to be processed is reduced.
In step S205, a mask of the object to be processed for each key frame is generated based on the object to be processed for each key frame and the position information of the object to be processed for each key frame.
In the embodiment of the application, in order to retain more background information, so that the video with the object to be processed removed is natural and the problem of missing of video information is reduced, the server may perform binarization processing on the pixels of the object to be processed of each key frame based on the position information of the object to be processed of each key frame, so as to obtain the mask of the object to be processed of each key frame.
Alternatively, the position information of the object to be processed may not be a complete rectangular region but a partial region in a complete rectangular region in the key frame. Therefore, the server can determine a rectangular area where the object to be processed is located based on the position information of the object to be processed, and the rectangular area is an area in the key frame.
Then, the server may cut out the rectangular area from the key frame, and then determine the position information of the object to be processed on the rectangular area based on the position information of the object to be processed and the position information of the rectangular area on the key frame. The server may perform binarization processing on the pixels of the rectangular region based on the position information of the object to be processed on the rectangular region, for example, set the grayscale value of the pixels of the object to be processed on the rectangular region to 0, and set the grayscale value of the pixels of the object not to be processed to 255. In this way, the server can obtain the processed rectangular area, i.e. the mask of the object to be processed for each key frame.
In step S207, object removal is performed on at least one video clip based on the at least one video clip and the mask of the object to be processed of each key frame, so as to obtain a target video.
In the embodiment of the application, the server can remove the object to be processed from at least one video clip based on the mask of the object to be processed of at least one video clip and each key frame by using the object removal model to obtain the target video; the target video does not contain the object to be processed.
Optionally, the server inputs the mask of the object to be processed of at least one video segment and each key frame into the object removal model, where the mask of the object to be processed carries the position information of the rectangular region in the key frame. And then, on the basis of the object removal model, removing the object to be processed from the video clip corresponding to the key frame by using the mask of the object to be processed of the key frame of each video clip to obtain the target video without the object to be processed.
The object removal model may include, but is not limited to, a deep learning model using a convolutional neural network, a cyclic neural network, or a recurrent neural network. The deep Learning model relates to Machine Learning (ML), which is a multi-domain cross subject and relates to multiple subjects such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach to make computers have intelligence, and is applied in various fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formula learning. Machine learning can be divided into supervised machine learning, unsupervised machine learning and semi-supervised machine learning.
In the embodiment of the present application, the original video may contain a video end segment, and the video end segment may not contain any content, and is only a black image frame that extends for approximately half a second, for example. If the server acquires the trailer deletion instruction, the server can identify the trailer of the original video and then delete the original video.
Fig. 8 is a flowchart illustrating a method of deleting a trailer according to an example embodiment, as shown in fig. 8, including:
in step S801, preset color proportion detection is performed on the original video, so as to obtain proportion data of a preset color of each image frame in the original video.
In the embodiment of the application, the server can perform preset color proportion detection on the original video to obtain proportion data of the preset color of each image frame in the original video. For example, assuming that the preset color is black, the server may perform proportion detection of black pixels on pixels in each image frame to obtain proportion data of the black pixels in each image frame.
In step S803, if the ratio data of the preset colors of a plurality of image frames in the original video satisfies the third preset data, start and end times of the plurality of image frames are determined, wherein the plurality of image frames are consecutive image frames, and the last image frame of the plurality of image frames is a video end image frame.
In this embodiment, if the ratio data of the preset colors of a plurality of image frames in the original video satisfies a third preset data (for example, 80%), where the plurality of image frames are consecutive image frames, and a last image frame of the plurality of image frames is a video end image frame, the server may determine that the plurality of image frames constitute a video end segment. The server may then determine the start-stop time of the video trailer segment.
In step S805, based on the end-of-segment deletion instruction, the plurality of image frames are deleted from the original video based on the start-stop time, resulting in an updated original video.
In the embodiment of the application, the server may delete the plurality of image frames from the original video based on the start-stop time based on the trailer deletion instruction, so as to obtain the updated original video.
Optionally, the deletion of the trailer may be performed before the original video is segmented and the key frame is extracted to obtain at least one video segment and the key frame in each video segment, or may be performed after the original video is segmented and the key frame is extracted to obtain at least one video segment and the key frame in each video segment.
As above, by positioning the trailer formed by a plurality of image frames, the trailer can be deleted according to the trailer deleting instruction, the detection and removal of the object to be processed on the trailer are not needed any more, and the computing resource is saved.
In summary, according to the present application, the video segments and the key frames can be accurately positioned at the positions where the objects to be processed exist in each video segment, and the image frames where the same objects to be processed exist can be accurately divided into one video segment according to the similarity data between the image frames, so that at the final stage, when the objects to be processed need to be removed from the video segments by using the masks of the objects to be processed of the key frames of the video segments, the masks of the objects to be processed of the key frames of the video segments can be quickly positioned at the positions of the objects to be processed in the image frames of the video segments, and the objects to be processed can be removed from the video segments in a unified manner, so that computational resources required for removing the objects to be processed in the subsequent video can be reduced, and the method can be applied to videos where more objects to be processed exist, and has a higher universality.
Fig. 9 is a block diagram illustrating a video processing device according to an example embodiment. The device has the function of realizing the data processing method in the method embodiment, and the function can be realized by hardware or by hardware executing corresponding software. Referring to fig. 9, the apparatus includes:
a segmenting module 901 configured to segment an original video and extract key frames to obtain at least one video segment and key frames in each video segment; the method comprises the steps that position information of an object to be processed contained in image frames in the same video clip on the image frame where the object to be processed belongs meets preset position information; the preset position information is information corresponding to a video clip to which the image frame belongs;
a determining module 902 configured to perform determining an object to be processed of a key frame and position information of the object to be processed of the key frame in each video clip;
a mask generation module 903 configured to perform generating a mask of the object to be processed of each key frame based on the object to be processed of each key frame and the position information of the object to be processed of each key frame;
and an object removing module 904 configured to perform object processing on the at least one video segment based on the at least one video segment and the mask of the object to be processed of each key frame to obtain a target video.
In some possible embodiments, the segmentation module is configured to perform:
carrying out object identification on an original video to obtain the type information of an object to be processed of the original video;
determining a display rule of the object to be processed on the original video based on the type information of the object to be processed;
segmenting an original video based on a display rule to obtain at least one video segment;
and determining a preset image frame in each video clip of the at least one video clip as a key frame in each video clip.
In some possible embodiments, the segmentation module is configured to perform:
determining a display area and a display duration of the object to be processed on the original video based on the type information of the object to be processed;
and segmenting the original video based on the display area and the display duration to obtain at least one video segment.
In some possible embodiments, the segmentation module is configured to perform:
carrying out object recognition on a first image frame in an original video to obtain an object to be processed of the original video; the first image frame is an image frame of an object to be processed appearing for the first time in an original video;
determining first position information of an object to be processed on a first image frame;
image interception is carried out on the first image frame based on the first position information, and a first sub-image corresponding to the first position information is obtained;
determining similarity data corresponding to each second image frame based on the first sub-image and each second image frame in the second image frame set; the second image frame set comprises image frames except the first image frame in the original video;
segmenting the original video based on the similarity data corresponding to each second image frame to obtain at least one video segment;
and determining a preset image frame in each video clip of the at least one video clip as a key frame in each video clip.
In some possible embodiments, based on the segmentation module, configured to perform:
acquiring a second sub-image corresponding to the first position information in each second image frame;
determining similarity data corresponding to each second image frame based on the similarity degree of each second sub-image and the first sub-image;
if the similarity data corresponding to each second image frame meet preset data, obtaining a video clip; the first image frame of the video clip is the first image frame.
In some possible embodiments, the segmentation module is configured to perform:
if a first target image frame set exists in the second image frame set and a first target image frame positioned at the first position in the first target image frame set is adjacent to the first image frame and positioned behind the first image frame in the original video, determining a first video segment based on the first image frame and the first target image frame set;
the first target image frame set comprises a first target image frame or a plurality of continuous first target image frames; the first target image frame set is not equal to the second image frame set, and the similarity data corresponding to the first target image frame in the first target image frame set meets preset data.
In some possible embodiments, the segmentation module is configured to perform:
taking a difference set between the second image frame set and the first target image frame set as a video to be segmented;
taking a first image frame of a video to be segmented as a new first image frame; taking image frames except the new first image frame in the video to be segmented as a new second image frame set;
determining new first position information of the object to be processed on a new first image frame;
obtaining a new first sub-image corresponding to the new first position information from the new first image frame, and obtaining a new second sub-image corresponding to the new first position information from each second image frame in the new second image frame set;
determining similarity data for each new second sub-image based on the similarity of the new first sub-image and each new second sub-image;
if a new first target image frame set exists in the new second image frame set and the new first target image frame set is positioned adjacent to the first target image frame and is positioned behind the new first image frame in the original video, determining a second video segment based on the new first image frame and the new first target image frame set;
the new first target image frame set comprises a new first target image frame or a plurality of consecutive new first target image frames; the new first target image frame set is not equal to the new second image frame set, and the similarity data corresponding to the first target image frame in the new first target image frame set meets the preset data.
In some possible embodiments, the apparatus further comprises a time determination module configured to perform:
carrying out preset color proportion detection on the original video to obtain proportion data of preset colors of each image frame in the original video;
if the proportion data of the preset colors of a plurality of image frames in the original video meet third preset data, determining the starting and ending time of the plurality of image frames; the plurality of image frames are continuous image frames, and the last image frame of the plurality of image frames is a video end image frame.
In some possible embodiments, the determining module is configured to perform:
performing first-class object detection on key frames in each video clip based on a first detection model in the object detection models, and determining first-class objects to be processed of the key frames in each video clip and position information of the first-class objects to be processed;
and performing second-class object detection on the key frame in each video clip based on a second detection model in the object detection model and the position information of the first-class object to be processed, and determining the second-class object to be processed and the position information of the second-class object to be processed of the key frame in each video clip.
In some possible embodiments, the mask generation module is configured to perform:
and performing binarization processing on the pixels of the object to be processed of each key frame based on the position information of the object to be processed of each key frame to obtain a mask of the object to be processed of each key frame.
It should be noted that, when the apparatus provided in the foregoing embodiment implements the functions thereof, only the division of the functional modules is illustrated, and in practical applications, the functions may be distributed by different functional modules according to needs, that is, the internal structure of the apparatus may be divided into different functional modules to implement all or part of the functions described above. In addition, the apparatus and method embodiments provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments for details, which are not described herein again.
Fig. 10 is a block diagram illustrating an apparatus 3000 for video processing according to an example embodiment. For example, the apparatus 3000 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.
Referring to fig. 10, the apparatus 3000 may include one or more of the following components: processing component 3002, memory 3004, power component 3006, multimedia component 3008, audio component 3010, input/output (I/O) interface 3012, sensor component 3014, and communications component 3016.
The processing component 3002 typically controls the overall operation of the device 3000, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 3002 may include one or more processors 3020 to execute instructions to perform all or part of the steps of the methods described above. Further, processing component 3002 may include one or more modules that facilitate interaction between processing component 3002 and other components. For example, the processing component 3002 may include a multimedia module to facilitate interaction between the multimedia component 3008 and the processing component 3002.
The memory 3004 is configured to store various types of data to support operations at the device 3000. Examples of such data include instructions for any application or method operating on device 3000, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 3004 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.
The power supply component 3006 provides power to the various components of the device 3000. The power components 3006 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the device 3000.
The multimedia component 3008 comprises a screen providing an output interface between the device 3000 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, multimedia component 3008 includes a front facing camera and/or a rear facing camera. The front-facing camera and/or the rear-facing camera may receive external multimedia data when the device 3000 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.
The audio component 3010 is configured to output and/or input an audio signal. For example, the audio component 3010 may include a Microphone (MIC) configured to receive external audio signals when the apparatus 3000 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signal may further be stored in the memory 3004 or transmitted via the communication component 3016. In some embodiments, the audio component 3010 further includes a speaker for outputting audio signals.
I/O interface 3012 provides an interface between processing component 3002 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.
The sensor component 3014 includes one or more sensors for providing status assessment of various aspects to the device 3000. For example, the sensor component 3014 can detect the open/closed state of the device 3000, the relative positioning of components, such as a display and keypad of the apparatus 3000, the sensor component 3014 can also detect a change in position of the apparatus 3000 or a component of the apparatus 3000, the presence or absence of user contact with the apparatus 3000, orientation or acceleration/deceleration of the apparatus 3000, and a change in temperature of the apparatus 3000. The sensor assembly 3014 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 3014 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 3014 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
The communication component 3016 is configured to facilitate wired or wireless communication between the apparatus 3000 and other devices. The device 3000 may access a wireless network based on a communication standard, such as WiFi, a carrier network (such as 2G, 3G, 4G, or 5G), or a combination thereof. In an exemplary embodiment, the communication component 3016 receives a broadcast signal or broadcast associated information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 3016 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.
In an exemplary embodiment, the apparatus 3000 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.
Embodiments of the present invention further provide a computer-readable storage medium, which may be disposed in an electronic device to store at least one instruction or at least one program for implementing a video processing method, where the at least one instruction or the at least one program is loaded and executed by the processor to implement the video processing method provided in the foregoing method embodiments.
Embodiments of the present invention also provide a computer program product comprising a computer program stored in a readable storage medium, from which at least one processor of a computer device reads and executes the computer program, causing the computer device to perform the method of any one of the first aspect of the disclosed embodiments.
It should be noted that: the precedence order of the above embodiments of the present invention is only for description, and does not represent the merits of the embodiments. And specific embodiments thereof have been described above. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, as for the apparatus embodiment, since it is substantially similar to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (14)

1. A video processing method, comprising:
segmenting an original video and extracting key frames to obtain at least one video clip and the key frames in each video clip; the position information of an object to be processed contained in image frames in the same video clip on the image frame to which the object to be processed belongs meets preset position information; the preset position information is information corresponding to a video clip to which the image frame belongs;
determining an object to be processed of a key frame in each video clip and position information of the object to be processed of the key frame;
generating a mask of the object to be processed of each key frame based on the object to be processed of each key frame and the position information of the object to be processed of each key frame;
and carrying out object processing on the at least one video clip based on the at least one video clip and the mask of the object to be processed of each key frame to obtain a target video.
2. The video processing method according to claim 1, wherein the segmenting and key frame extracting the original video to obtain at least one video segment and a key frame in each video segment comprises:
carrying out object identification on the original video to obtain the type information of an object to be processed of the original video;
determining a display rule of the object to be processed on the original video based on the type information of the object to be processed;
segmenting the original video based on the display rule to obtain at least one video segment;
determining a preset image frame in each of the at least one video clip as a key frame in each of the at least one video clip.
3. The video processing method according to claim 2, wherein the determining a display rule of the object to be processed on the original video based on the type information of the object to be processed, and segmenting the original video based on the display rule to obtain the at least one video segment comprises:
determining a display area and a display duration of the object to be processed on the original video based on the type information of the object to be processed;
and segmenting the original video based on the display area and the display duration to obtain the at least one video segment.
4. The video processing method according to claim 1, wherein the segmenting and key frame extracting the original video to obtain at least one video segment and a key frame in each video segment comprises:
carrying out object identification on a first image frame in the original video to obtain an object to be processed of the original video; the first image frame is an image frame of the original video where the object to be processed appears for the first time;
determining first position information of the object to be processed on the first image frame;
image interception is carried out on the first image frame based on the first position information, and a first sub-image corresponding to the first position information is obtained;
determining similarity data corresponding to each second image frame in the second image frame set based on the first sub-image and the each second image frame; the second image frame set comprises image frames except the first image frame in the original video;
segmenting the original video based on the similarity data corresponding to each second image frame to obtain at least one video segment;
determining a preset image frame in each of the at least one video clip as a key frame in each of the at least one video clip.
5. The video processing method according to claim 4, wherein determining similarity data corresponding to each second image frame based on the first sub-image and each second image frame, and segmenting the original video based on the similarity data corresponding to each second image frame to obtain the at least one video segment comprises:
acquiring a second sub-image corresponding to the first position information in each second image frame;
determining similarity data corresponding to each second image frame based on the similarity degree of each second sub-image and the first sub-image;
if the similarity data corresponding to each second image frame meets preset data, obtaining a video clip; the first image frame of the video clip is the first image frame.
6. The video processing method according to claim 5, wherein after determining the similarity data corresponding to each second image frame based on the similarity between each second sub-image and the first sub-image, the method further comprises:
if a first target image frame set exists in the second image frame set, a first target image frame positioned at a first position in the first target image frame set is adjacent to the first image frame, and the first target image frame set is positioned behind the first image frame in the original video, determining a first video segment based on the first image frame and the first target image frame set;
the first target image frame set comprises a first target image frame or a plurality of continuous first target image frames; the first target image frame set is not equal to the second image frame set, and the similarity data corresponding to a first target image frame in the first target image frame set meets the preset data.
7. The video processing method of claim 6, wherein the method further comprises:
taking a difference set between the second image frame set and the first target image frame set as a video to be segmented;
taking a first image frame of the video to be segmented as a new first image frame; taking image frames except the new first image frame in the video to be segmented as a new second image frame set;
determining new first position information of the object to be processed on the new first image frame;
obtaining a new first sub-image corresponding to the new first position information from the new first image frame, and obtaining a new second sub-image corresponding to the new first position information from each second image frame in the new second image frame set;
determining similarity data for each new second sub-image based on the similarity of the new first sub-image and each new second sub-image;
if a new first target image frame set exists in the new second image frame set and the new first target image frame set is positioned adjacent to the first image frame and is positioned after the new first image frame in the original video, determining a second video segment based on the new first image frame and the new first target image frame set;
the new first target image frame set comprises a new first target image frame or a plurality of consecutive new first target image frames; the new first target image frame set is not equal to the new second image frame set, and the similarity data corresponding to the first target image frame in the new first target image frame set meets the preset data.
8. The video processing method according to any of claims 1-7, wherein the method further comprises:
performing preset color proportion detection on the original video to obtain proportion data of preset colors of each image frame in the original video;
if the proportion data of the preset colors of a plurality of image frames in the original video meet third preset data, determining the starting and ending time of the plurality of image frames; wherein the plurality of image frames are consecutive image frames, and a last image frame of the plurality of image frames is a video end image frame.
9. The video processing method according to claim 1, wherein the objects to be processed include a first class of objects to be processed and a second class of objects to be processed, and the determining the objects to be processed of the key frames and the position information of the objects to be processed of the key frames in each video clip includes:
performing first-class object detection on key frames in each video clip based on a first detection model in object detection models, and determining first-class objects to be processed of the key frames in each video clip and position information of the first-class objects to be processed;
and performing second-class object detection on the key frame in each video clip based on a second detection model in the object detection model and the position information of the first-class object to be processed, and determining the second-class object to be processed and the position information of the second-class object to be processed of the key frame in each video clip.
10. The video processing method according to any one of claims 1 to 7 and 9, wherein the generating a mask of the object to be processed of each key frame based on the object to be processed of each key frame and the position information of the object to be processed of each key frame comprises:
and performing binarization processing on the pixels of the object to be processed of each key frame based on the position information of the object to be processed of each key frame to obtain a mask of the object to be processed of each key frame.
11. A video processing apparatus, comprising:
the segmentation module is configured to segment the original video and extract key frames to obtain at least one video segment and key frames in each video segment; the method comprises the steps that position information of an object to be processed contained in image frames in the same video clip on the image frame to which the object to be processed belongs meets preset position information; the preset position information is information corresponding to a video clip to which the image frame belongs;
a determining module configured to perform determining an object to be processed of a key frame and position information of the object to be processed of the key frame in each video clip;
a mask generation module configured to perform a mask generation of the object to be processed of each key frame based on the object to be processed of each key frame and the position information of the object to be processed of each key frame;
and the object removing module is configured to perform object processing on the at least one video segment based on the at least one video segment and the mask of the object to be processed of each key frame to obtain a target video.
12. An electronic device, comprising:
a processor;
a memory for storing the processor-executable instructions;
wherein the processor is configured to execute the instructions to implement the video processing method of any of claims 1 to 10.
13. A computer-readable storage medium, wherein instructions in the computer-readable storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the video processing method of any of claims 1 to 10.
14. A computer program product, characterized in that the computer program product comprises a computer program, which is stored in a readable storage medium, from which at least one processor of a computer device reads and executes the computer program, causing the computer device to perform the video processing method or the video processing method according to any one of claims 1 to 10.
CN202211025874.3A 2022-08-25 2022-08-25 Video processing method and device, electronic equipment and storage medium Pending CN115529483A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211025874.3A CN115529483A (en) 2022-08-25 2022-08-25 Video processing method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211025874.3A CN115529483A (en) 2022-08-25 2022-08-25 Video processing method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN115529483A true CN115529483A (en) 2022-12-27

Family

ID=84696852

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211025874.3A Pending CN115529483A (en) 2022-08-25 2022-08-25 Video processing method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115529483A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106454155A (en) * 2016-09-26 2017-02-22 新奥特(北京)视频技术有限公司 Video shade trick processing method and device
CN111526421A (en) * 2019-02-01 2020-08-11 网宿科技股份有限公司 Method for generating video mask information and preventing bullet screen from being shielded, server and client
CN112070047A (en) * 2020-09-15 2020-12-11 北京金山云网络技术有限公司 Video processing method and device and electronic equipment
CN112672033A (en) * 2019-10-15 2021-04-16 中兴通讯股份有限公司 Image processing method and device, storage medium and electronic device
CN114071184A (en) * 2021-11-11 2022-02-18 腾讯音乐娱乐科技(深圳)有限公司 Subtitle positioning method, electronic equipment and medium
CN114598919A (en) * 2022-03-01 2022-06-07 腾讯科技(深圳)有限公司 Video processing method, video processing device, computer equipment and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106454155A (en) * 2016-09-26 2017-02-22 新奥特(北京)视频技术有限公司 Video shade trick processing method and device
CN111526421A (en) * 2019-02-01 2020-08-11 网宿科技股份有限公司 Method for generating video mask information and preventing bullet screen from being shielded, server and client
CN112672033A (en) * 2019-10-15 2021-04-16 中兴通讯股份有限公司 Image processing method and device, storage medium and electronic device
CN112070047A (en) * 2020-09-15 2020-12-11 北京金山云网络技术有限公司 Video processing method and device and electronic equipment
CN114071184A (en) * 2021-11-11 2022-02-18 腾讯音乐娱乐科技(深圳)有限公司 Subtitle positioning method, electronic equipment and medium
CN114598919A (en) * 2022-03-01 2022-06-07 腾讯科技(深圳)有限公司 Video processing method, video processing device, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
TWI777162B (en) Image processing method and apparatus, electronic device and computer-readable storage medium
CN109740516B (en) User identification method and device, electronic equipment and storage medium
US20210089799A1 (en) Pedestrian Recognition Method and Apparatus and Storage Medium
CN111783756B (en) Text recognition method and device, electronic equipment and storage medium
CN110569777B (en) Image processing method and device, electronic device and storage medium
CN110633700B (en) Video processing method and device, electronic equipment and storage medium
CN110472091B (en) Image processing method and device, electronic equipment and storage medium
CN111340733B (en) Image processing method and device, electronic equipment and storage medium
CN109635142B (en) Image selection method and device, electronic equipment and storage medium
CN109543536B (en) Image identification method and device, electronic equipment and storage medium
CN105574857B (en) Image analysis method and device
CN110781957A (en) Image processing method and device, electronic equipment and storage medium
CN112836801A (en) Deep learning network determination method and device, electronic equipment and storage medium
CN112911239B (en) Video processing method and device, electronic equipment and storage medium
CN105354793A (en) Facial image processing method and device
CN108171222B (en) Real-time video classification method and device based on multi-stream neural network
CN109671051B (en) Image quality detection model training method and device, electronic equipment and storage medium
CN104077597A (en) Image classifying method and device
CN111680646A (en) Motion detection method and device, electronic device and storage medium
CN114187498A (en) Occlusion detection method and device, electronic equipment and storage medium
CN110415258B (en) Image processing method and device, electronic equipment and storage medium
CN110619325A (en) Text recognition method and device
CN113506229B (en) Neural network training and image generating method and device
CN118368430A (en) Video data processing method and device, and training method and device for enhanced network
CN110765943A (en) Network training and recognition method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination