CN113810751B

CN113810751B - Video processing method and device, electronic device and server

Info

Publication number: CN113810751B
Application number: CN202010537094.1A
Authority: CN
Inventors: 张士伟; 夏朱荣; 耿致远; 唐铭谦
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Damo Academy Beijing Technology Co ltd
Priority date: 2020-06-12
Filing date: 2020-06-12
Publication date: 2022-10-28
Anticipated expiration: 2040-06-12
Also published as: CN113810751A

Abstract

The embodiment of the invention provides a video processing method and device, electronic equipment and a server, wherein the method comprises the following steps: detecting a first video segment with target content in a target video; wherein the first video segment corresponds to a first start time; carrying out time correction processing on the first starting time to obtain a target starting time; and acquiring a target video clip based on the target starting time. The embodiment of the invention is improved.

Description

Video processing method and device, electronic device and server

Technical Field

The present invention relates to the field of electronic devices, and in particular, to a video processing method and device, an electronic device, and a server.

Background

With the rapid development of multimedia technology, various types of videos have been explosively increased. The rapidly growing video makes it increasingly difficult for people to obtain interesting content from a huge amount of video. In order to improve the information transmission efficiency of the video, it is a common technical means to extract video clips of key content from the video.

In the prior art, in order to extract a video clip of a key content from a video, a user is required to determine a video clip corresponding to the key content in the video from a video with a long duration, and based on a start time and an end time of the video clip, the start time and the end time are points on a time axis of the video. And then, the video segment is obtained by clipping from the long video in a manual clipping mode. For example, in a video of a football match, people pay more attention to the section of a football shot, and a user can generally determine the video section of the football shot in a manual identification manner and obtain the video section of the football shot.

However, the method of manually identifying the video segments of the key content and then performing the video editing consumes a lot of labor, and the video segment acquisition efficiency is low.

Disclosure of Invention

In view of this, embodiments of the present invention provide a video processing method and apparatus, an electronic apparatus, and a server, so as to solve the technical problem in the prior art that the editing efficiency is low due to performing video editing by manually identifying the start time and the end time of the key content.

In a first aspect, an embodiment of the present invention provides a video processing method, including:

detecting a first video segment with target content in a target video; wherein the first video segment corresponds to a first start time;

time correction processing is carried out on the first starting time to obtain target starting time;

and acquiring a target video clip based on the target starting time.

In a second aspect, an embodiment of the present invention provides a video processing method, including:

detecting a target video and target content input by a user;

carrying out time correction processing on the first starting time to obtain a target starting time;

acquiring a target video clip based on the target starting time;

outputting the target video segment for the user.

In a third aspect, an embodiment of the present invention provides a video processing method, including:

receiving a target video and target content sent by electronic equipment; the target video and the target content are obtained by detecting user input by the electronic equipment;

acquiring a target video clip based on the target starting time;

and sending the target video clip to the electronic equipment so that the electronic equipment can output the target video clip for the user.

In a fourth aspect, an embodiment of the present invention provides a video processing apparatus, including: a storage component and a processing component; wherein the storage component is configured to store one or more computer instructions; the one or more computer instructions are invoked for execution by the processing component;

the processing component is to:

detecting a first video segment with target content in a target video; wherein the first video segment corresponds to a first start time; time correction processing is carried out on the first starting time to obtain target starting time; and acquiring a target video clip based on the target starting time.

In a fifth aspect, an embodiment of the present invention provides an electronic device, including: a storage component and a processing component; wherein the storage component is configured to store one or more computer instructions; the one or more computer instructions are invoked for execution by the processing component;

the processing component is to:

acquiring a target video and target content input by a user; detecting a first video segment with target content in a target video; wherein the first video segment corresponds to a first start time; carrying out time correction processing on the first starting time to obtain a target starting time; acquiring a target video clip based on the target starting time; outputting the target video segment for the user.

In a sixth aspect, an embodiment of the present invention provides a server, including: a storage component and a processing component; wherein the storage component is configured to store one or more computer instructions; the one or more computer instructions are invoked for execution by the processing component;

the processing component is to:

receiving a target video and target content sent by electronic equipment; the target video and the target content are obtained by detecting user input by the electronic equipment; detecting a first video segment with target content in a target video; wherein the first video segment corresponds to a first start time; carrying out time correction processing on the first starting time to obtain a target starting time; acquiring a target video clip based on the target starting time; and sending the target video clip to the electronic equipment so that the electronic equipment can output the target video clip for the user.

According to the embodiment of the invention, a first video segment with target content in a target video is detected, the first video segment can correspond to a first starting time, and time correction processing can be carried out on the first starting time to obtain the target starting time, so that the target video segment can be obtained according to the target starting time. The first video segment with the target content is simply identified, and then the first starting time of the first video segment is subjected to time correction to obtain the target starting time accurately containing the target content, so that the accurate target video segment can be obtained according to the target starting time. The method has the advantages that the video clips with the target content are automatically detected to be subjected to time correction, so that the automatic clipping of the video clips with the target content in the target video is automatically completed, the automatic clipping is realized, and the clipping efficiency is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a flowchart of an embodiment of a video processing method according to the present invention;

fig. 2 is a flowchart of a video processing method according to another embodiment of the present invention;

fig. 3 is a flowchart of a video processing method according to another embodiment of the present invention;

fig. 4 is a flowchart of another embodiment of a video processing method according to an embodiment of the present invention;

fig. 5 is a flowchart of a video processing method according to another embodiment of the present invention;

fig. 6 is a flowchart of a video processing method according to another embodiment of the present invention;

fig. 7a to 7b are diagrams illustrating a video processing method according to an embodiment of the present invention;

fig. 8 is a diagram illustrating a video processing method according to another embodiment of the present invention;

fig. 9 is a schematic structural diagram of an embodiment of a video processing apparatus according to an embodiment of the present invention;

fig. 10 is a schematic structural diagram of an embodiment of an electronic device according to an embodiment of the present invention;

fig. 11 is a schematic structural diagram of an embodiment of a server according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the examples of the present invention and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, and "a" and "an" generally include at least two, but do not exclude at least one, unless the context clearly dictates otherwise.

It should be understood that the term "and/or" as used herein is merely a relationship that describes an associated object, meaning that three relationships may exist, e.g., a and/or B, may represent: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.

The words "if", as used herein, may be interpreted as "at … …" or "at … …" or "in response to a determination" or "in response to a recognition", depending on the context. Similarly, the phrases "if determined" or "if identified (a stated condition or event)" may be interpreted as "when determined" or "in response to a determination" or "when identified (a stated condition or event)" or "in response to an identification (a stated condition or event)", depending on the context.

It is also noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a good or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such good or system. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of additional like elements in a commodity or system comprising the element.

The technical scheme of the embodiment of the application can be applied to the scene of automatic video clipping, and the video clip with the target content in the target video is automatically acquired and extracted to complete the automatic video clipping and improve the clipping efficiency of the video clip with the target content.

In the prior art, in order to extract a video clip of a key content from a video, a user is required to identify a video clip corresponding to the key content from a complete video with a long video duration, and record a start time and an end time of the video clip on a time axis of the video, so as to clip the video clip corresponding to the start time and the end time from the complete video. For example, the video clip software can complete the clipping of a video segment by inputting the entire video into the video clip software and inputting the start time and the end time of the clipping. However, in this way of editing, the start time and the end time of the video segment need to be manually identified, and the editing is performed by adopting a manual editing way, which is inefficient in editing.

In the embodiment of the application, a first video segment with target content in a target video is detected, the first video segment may correspond to a first start time, and time correction processing may be performed on the first start time to obtain a target start time, so that the target video segment in the target video may be obtained according to the target start time. The first video segment with the target content is simply identified, and then time correction is carried out on the first video segment to obtain the target starting time accurately containing the target content, so that the accurate target video segment can be obtained according to the target starting time. The method has the advantages that the video clips with the target content are automatically detected to be subjected to time correction, so that the automatic clipping of the video clips with the target content in the target video is automatically completed, the automatic clipping is realized, and the clipping efficiency is improved.

The embodiments of the present application will be described in detail below with reference to the accompanying drawings.

As shown in fig. 1, a flowchart of an embodiment of a video processing method provided in this application may include the following steps:

101: a first video segment in which target content is present in a target video is detected.

Wherein the first video segment corresponds to a first start time.

The target content is key content required by the user. The target content may be specified by the user.

Whether the target content exists in the video clip can be judged by inputting the target content into a content identification model and detecting the probability that the target content exists in the video clip by using the content identification model. And determining whether the video clip has the target content or not by acquiring the probability of the target content in the video clip and determining whether the probability meets a preset threshold or not.

The content recognition model can be a deep neural network model and can be obtained by training in advance. For example, a plurality of training video segments may be obtained, each training video segment corresponds to tag data, the tag data is the probability that the training video segment has target content, and the tag data of each training video segment may be manually or automatically identified; constructing a content recognition model of the deep neural network; therefore, the content recognition model can be trained by utilizing the label data corresponding to the plurality of training data and the plurality of training data respectively, so as to obtain the model parameters of the content recognition model. The specific training process of the content recognition model is the same as that of the prior art, and is not described herein again.

Optionally, in some embodiments, the target video may be divided into a plurality of alternative video segments according to a predetermined division time length, and the plurality of alternative video segments may be combined according to a time sequence to form the target video. And then, inputting the multiple candidate video clips into the content identification model to obtain candidate probabilities of the multiple candidate video clips with target content, and if the candidate probabilities are greater than a preset target probability threshold, determining that the candidate video clips corresponding to the candidate probabilities are first video clips to obtain the first video clips.

In this embodiment, the first video segment may include a plurality of first video segments, and the video processing method shown in fig. 1 may be performed on any one of the first video segments to obtain a corresponding target video segment.

102: and carrying out time correction processing on the first starting time to obtain the target starting time.

The first video segment is obtained by dividing the target video or combining a plurality of window segments, and is a simple acquisition of the video segment with the target content in the target video, and the accuracy of the first video segment is poor, so that the time correction processing needs to be performed on the first start time of the first video segment to obtain a more accurate target start time.

In some embodiments, in order to obtain a more accurate target video segment, after the first video segment is subjected to identification condition judgment to screen the first video segment that more meets the requirement of accurate existence of the target content, time correction processing is performed on the first start time of the first video segment that meets the identification condition to obtain a more accurate target start time.

103: and acquiring the target video clip based on the target starting time.

Optionally, based on the target start time, the target end time corresponding to the target start time may be determined first to obtain the target video segment. Then, it may be determined that the target start time is a segment start point on a time axis of the target video, and the target end time corresponding to the target start time is a segment end point on the time axis of the target video, where a video segment between the segment start point and the segment end point is the target video segment.

In one possible design, the segment duration of the target video segment may be preset, and the target ending time corresponding to the target starting time may be obtained by calculating the sum of the target starting time and the segment duration.

In yet another possible design, the target ending time corresponding to the target starting time may be determined based on the first ending time corresponding to the first video segment. The time correction processing may be performed on the first ending time of the first video segment to obtain a target ending time corresponding to the target starting time of the first video segment.

The video processing method provided by the embodiment of the application can be applied to electronic equipment or a server corresponding to the electronic equipment. The electronic device may include, for example: the electronic equipment comprises a mobile phone, a notebook, a tablet computer, a personal computer, wearable equipment, a smart sound box with a screen, a super computer and the like, and the specific type of the electronic equipment is not limited too much in the embodiment of the application. The server corresponding to the electronic device may implement communication with the electronic device based on a wired or wireless communication connection manner, and the server may specifically include: the embodiment of the application does not limit the specific types of the servers too much.

When the technical scheme provided by the application is applied to the electronic equipment, the target video can be acquired by the electronic equipment. When the technical scheme provided by the application is applied to the server corresponding to the electronic equipment, the target video can be provided to the server by the electronic equipment.

In the embodiment of the application, a first video segment with target content in a target video is detected, the first video segment may correspond to a first start time, and time correction processing may be performed on the first start time to obtain a target start time, so that the target video segment in the target video may be obtained according to the target start time. The first video clip with the target content is simply identified, and then time correction is carried out on the first video clip to obtain the target starting time accurately comprising the target content, so that the accurate target video clip can be obtained according to the target starting time. The method has the advantages that the video clips with the target content are automatically detected to be subjected to time correction, so that the automatic clipping of the video clips with the target content in the target video is automatically completed, the automatic clipping is realized, and the clipping efficiency is improved.

The target video clip obtained based on the preset clip duration of the target video clip is not accurate enough, the first video clip also corresponds to a first termination time, and the first start time and the first termination time of the first video clip correspond to each other. The target ending time corresponding to the target starting time can be obtained in a more accurate adjusting mode.

As shown in fig. 2, a flow chart of another embodiment of a video processing method provided in the embodiment of the present application may include the following steps:

201: a first video segment in which target content is present in a target video is detected.

Wherein the first video segment corresponds to a first start time and a first end time.

202: and performing time correction processing on the first start time and the first end time respectively corresponding to the first video clip to obtain target start time and target end time respectively corresponding to the target start time.

203: and acquiring the target video clip based on the target starting time and the target ending time corresponding to the target starting time.

In the embodiment of the application, after a first video segment of target content exists in a target video, time correction and time correction of a first start time and a first end time corresponding to the first video segment may be performed to obtain a target start time and a target end time of each first video segment, so that accurate correction of the first video segment is realized, and an accurate target video segment is obtained. By detecting the first video clip and correcting the time of the first video clip, the automatic acquisition of the target video clip is realized, and the acquisition efficiency of the target video clip is improved.

As an embodiment, the first video segment may include a plurality of segments, and performing time correction processing on the first start time and the first end time corresponding to the first video segment to obtain the target start time, and the target end time corresponding to the target start time may include:

performing time adjustment processing on a first starting time and a first ending time of a first video segment meeting a first identification condition to obtain a second starting time and a second ending time corresponding to the second starting time;

acquiring a second video clip based on a second starting time and a second ending time corresponding to the second starting time respectively; wherein the second video segment includes a plurality.

And performing time correction processing on the second start time and the second end time of the second video clip meeting the second identification condition to obtain the target start time and the target end time corresponding to the target start time.

Optionally, since the first identification condition identifies and selects a plurality of first video segments, at this time, compared with a first video segment that does not satisfy the first identification condition, the first video segment that satisfies the first identification condition is a video segment that has a higher possibility of having the target content, and the possibility of having the target content in a small video segment adjacent to the first video segment is also higher, therefore, when the time adjustment processing is performed on the first start time and the first end time of the first video segment that satisfies the first identification condition, the segment extension may be performed on the first video segment, so that the target video segment contains the more complete target content.

And performing time adjustment processing on the first start time and the first end time of the first video segment meeting the first identification condition to obtain a plurality of second start times and second end times corresponding to the plurality of second start times. Specifically, the first start time of the first video clip may be decreased by a preset time length according to a time axis, and the first end time may be increased by the preset time length according to the time axis, so as to obtain a second start time and a second end time corresponding to the second start time. For example, when the first start time of the first video segment is 10 seconds and the first end time is 20 seconds, the preset time duration is 5 seconds, the first start time may be subtracted by 5 seconds, the obtained 5 seconds is the second start time, the first start time is added by 5 seconds, and the obtained 25 seconds is the second end time.

In some embodiments, the difference between the first start time and the second start time is a preset time length, and the difference between the second end time and the first end time is a preset time length. The preset time period may be a positive number or a negative number. The preset duration can be set according to actual requirements. For example, when it is detected that the first N image frames and the last N image frames of the first video segment both contain the target content, the preset duration may be a positive number, that is, the value of the first start time is decreased, the value of the first end time is increased, and a video segment with a longer time is obtained. When it is detected that neither the first N image frames nor the last N image frames of the first video segment contain the target content, the preset time duration may be a negative number, that is, the value of the first start time is increased, the value of the first end time is decreased, and the video segment with a shorter time duration is obtained. N is a positive integer greater than 1. In this way, the value of the preset duration can be determined in the subsequent time adjustment processing of the second start time and the second end time and the subsequent time adjustment processing of the third start time and the third end time.

In the embodiment of the application, when the time correction processing is performed on the first video segment, the time adjustment may be performed on the first video segment meeting the first identification condition to obtain the second video segment, and then the time correction processing may be performed on the second video segment to obtain the target video segment. The time adjustment processing is carried out on the first video clips meeting the first identification condition, a plurality of second video clips screened for the first time can be obtained, and the time meeting the video clips is adjusted for multiple times, so that the target video clips accurately processed for multiple times can be obtained. The efficiency and accuracy of acquiring the target video clip can be improved by extracting the video clips meeting the identification conditions.

In order to obtain a more accurate target video segment, on the basis of performing identification selection and time adjustment on the first video segment, the identification selection and the time adjustment can be performed again for multiple times so as to improve the accuracy of the obtained target video segment. Therefore, as another embodiment, performing time correction processing on the second start time and the second end time of the second video segment satisfying the second recognition condition to obtain the target start time and the target end time corresponding to the target start time respectively may include:

performing time adjustment processing on a second start time and a second end time of a second video segment meeting a second identification condition to obtain a third start time and a third end time corresponding to the third start time;

obtaining a third video clip based on the third starting time and a third ending time corresponding to the third starting time;

and performing time adjustment processing on the third start time and the third end time of the third video segment meeting the third identification condition to obtain target end times corresponding to the target start time and the target start time respectively.

In the embodiment of the present application, when performing the time adjustment processing on the second start time and the second end time of the second video segment that satisfy the second identification condition, and performing the time adjustment processing on the third start time and the third end time that satisfy the third identification condition, an adopted time adjustment method is the same as an adopted time adjustment method when performing the time adjustment processing on the first start time and the first end time of the first video segment that satisfy the first identification condition, and an adjustment principle and a step of the method are described in detail in the foregoing embodiment, and are not described again here. However, in the actual adjustment, the three may adopt different preset durations, and after the condition segment is identified each time, a smaller preset duration may be selected to ensure the adjustment accuracy. For example, the first preset duration corresponding to the first start time and the first end time may be 5 seconds, the second preset duration corresponding to the second start time and the second end time may be 3 seconds, and the third preset duration corresponding to the third start time and the third end time may be 2 seconds.

In this embodiment, a third video segment may be obtained by performing time adjustment processing on a second start time and a second end time of a second video segment that satisfy a second recognition condition, where the third video segment may be a video segment with a higher possibility of including target content than the second video segment. And after the third video segment meeting the third identification condition is subjected to time adjustment again, the third video segment can be identified more accurately, so that the accuracy of the obtained target video segment is improved.

As a possible implementation manner, the method further includes performing time adjustment processing on a first start time and a first end time of a first video segment that meets a first recognition condition, and before obtaining second end times corresponding to a second start time and a second start time, respectively, the method further includes:

and selecting a first video clip meeting a first identification condition from the plurality of first video clips.

The selection of the first video clip meeting the first identification condition from the plurality of first video clips may be performed by performing a condition selection in a manner of performing first key information extraction on the plurality of first video clips. The first recognition condition may include: first target information corresponding to target content is contained; therefore, the first video clips containing the first target information in the plurality of first video clips are selected according to the first key information corresponding to the plurality of first video clips respectively. The first key information may include, for example: face information, voiceprint information, scene information, or action information.

As another possible implementation manner, before the second start time and the second end time of the second video segment that satisfy the second identification condition are subjected to time adjustment processing to obtain third end times corresponding to the third start time and the third start time, the method may further include:

and selecting a second video clip meeting a second identification condition from the plurality of second video clips.

Selecting a second video clip meeting a second identification condition from the plurality of second video clips can select the identification condition by extracting second key information from the plurality of second video clips. The second recognition condition may include: second target information corresponding to the target content is contained; therefore, the second video clips containing the second target information in the plurality of second video clips are selected according to the second key information respectively corresponding to the plurality of second video clips. The second key information may include, for example: face information, voiceprint information, scene information, or action information.

As another possible implementation manner, before the third start time and the third end time of the third video segment that satisfy the third identification condition are subjected to time adjustment processing to obtain target end times corresponding to the target start time and the target start time, the method may further include:

and selecting a third video clip meeting a third identification condition from the plurality of third video clips.

The selection of the third video clip meeting the third identification condition from the plurality of third video clips can be performed by extracting the key information from the plurality of third video clips. The third recognition condition may include: third target information corresponding to the target content is contained; and selecting a third video clip containing third target information from the plurality of third video clips according to third key information corresponding to the plurality of third video clips respectively. The third key information may include, for example: face information, voiceprint information, scene information, or action information.

In practical application, the first target information, the second target information, and the third target information may be different from each other, may also be the same, or may be the same as one of the two, and may specifically be set according to actual identification requirements. For example, when the target content is a video clip in which a shooting action exists, the first target information, the second target information, and the third target information may be shooting action information. For example, when the target content is a video clip in which the user a has a shooting motion, the first target information and the second target information may be shooting motion information, and the third target information may be shooting motion information of the user a.

However, since the first key information, the second key information, and the third key information are only partial information in each video clip and cannot represent all information of the video clip, the accuracy of the obtained video clip satisfying the identification condition is poor by the technical scheme of confirming the target information through the key information.

Therefore, in order to obtain a more accurate selection effect, in some embodiments, selecting a first video segment of the first video segments that satisfies the first recognition condition may include:

identifying a plurality of first video clips to obtain a plurality of first identification results;

and determining a first video segment meeting the first identification condition according to the plurality of first identification results.

In some embodiments, selecting a second video segment of the plurality of second video segments that satisfies the second identification condition may include:

identifying the plurality of second video clips to obtain a plurality of second identification results;

and determining a second video segment meeting a second identification condition according to a plurality of second identification results.

In some embodiments, selecting a third video segment of the plurality of third video segments that satisfies the third identification condition includes:

performing identification processing on the plurality of third video segments to obtain a plurality of third identification results;

and determining a third video segment meeting a third identification condition according to a plurality of third identification results.

In this embodiment of the application, when the first video segment, the second video segment, or the third video segment is confirmed to perform the identification processing, the same identification processing manner may be adopted to confirm the identification result corresponding to any one of the first video segment, the second video segment, or the third video segment.

As shown in fig. 3, a flow chart of another embodiment of a video processing method provided in the embodiment of the present application may include the following steps:

301: a plurality of first video segments of target content in a target video is detected.

The first video segment corresponds to a first start time and a first end time.

302: and identifying the plurality of first video clips to obtain a plurality of first identification results.

303: and determining a plurality of first video clips meeting the first identification condition according to the plurality of first identification results.

304: and respectively carrying out time adjustment processing on the first start time and the first end time of a plurality of first video segments meeting the first identification condition to obtain a plurality of second start times and second end times corresponding to the second start times.

305: and acquiring a plurality of second video clips based on a plurality of second starting times and second ending times respectively corresponding to the second starting times.

306: and identifying the plurality of second video clips to obtain a plurality of second identification results.

307: and determining a plurality of second video clips meeting second identification conditions according to a plurality of second identification results.

308: respectively adjusting the second start time and the second end time of a plurality of second video segments meeting a second identification condition to obtain a plurality of third start times and third end times corresponding to the third start times;

309: obtaining a plurality of third video segments based on a plurality of third start times and third end times corresponding to the third start times respectively;

310: respectively identifying the plurality of third video segments to obtain a plurality of third identification results;

311: and determining a plurality of third video clips meeting third identification conditions according to a plurality of third identification results.

312: and respectively carrying out time adjustment processing on the third start time and the third end time of a plurality of third video segments meeting the third identification condition to obtain a plurality of target start times and a plurality of target end times corresponding to the target start times.

313: and acquiring a plurality of target video clips based on the target starting times and the target ending times respectively corresponding to the target starting times.

In the embodiment of the application, after the plurality of first video segments, the first start time and the first end time corresponding to the first video segments are obtained, and when the time correction processing is performed on the first video segments, the first start time and the first end time corresponding to the first start time can be adjusted more strictly and more accurately by performing the multiple recognition processing and the multiple selection processing and adding the time adjustment processing, so that the adjustment accuracy is improved, and the more accurate target video segments are obtained.

When the first video segment, the second video segment or the third video segment is confirmed to be subjected to the identification processing, the same identification processing mode can be adopted to confirm the identification result corresponding to any one of the first video segment, the second video segment or the third video segment.

In a possible design, the video segment may be identified in the following manner, so as to obtain the identification result corresponding to the video segment:

extracting segment characteristics of the video segments;

inputting the characteristic segments into a content identification model corresponding to the target content to obtain the target probability of the target content in the video segments;

and determining the corresponding recognition result of the video clip according to the target probability of the target content in the video clip.

Wherein, the video clip can comprise; a first video segment, a second video segment, or a third video segment.

When the video clip is identified, the identification model can be used for identifying, and the target probability identification can be used for extracting the characteristics of the video clip so as to confirm the corresponding occurrence probability of the target content in the video clip. In some of the descriptions, the higher the probability of occurrence, the better the video recognition result.

When extracting the features of the video segment, a method of extracting the features by using an extraction model of the video segment may be directly adopted, but the extracted features are single and cannot contain information between different segments, and in order to obtain more accurate features of the segment, as a possible implementation manner, extracting the features of the video segment may include:

extracting a plurality of window segments of a target video; the window segments respectively correspond to window starting time and window ending time; there is at least one window segment identical to the partial segment of any one window segment among the plurality of window segments.

Respectively extracting segment characteristics of the plurality of window segments to obtain window segment characteristics respectively corresponding to the plurality of window segments;

acquiring a plurality of target window segments of which the window starting time is greater than or equal to the starting time of the video segments and the window ending time is less than or equal to the ending time of the video segments;

and determining the segment characteristics of the video segments according to the window segment characteristics respectively corresponding to the target window segments.

A plurality of window segments may be selected in a sliding window manner. The segment durations of the multiple window segments are the same.

According to the embodiment of the application, the window segments and the window segment characteristics corresponding to the window segments are adopted to obtain the segment characteristics of the video segments, the segment characteristics among a plurality of window segments can be fused, and more comprehensive segment characteristics can be obtained.

In the embodiment of the application, when the segment features are extracted, a manner of extracting a plurality of window segments in real time and extracting segment features corresponding to the window segments in real time is adopted, and in some embodiments, a manner of acquiring window segment features corresponding to the window segments of the generated target video and the window segment features corresponding to the window segments can be further obtained, so that the segment features of the video segments are obtained, and the window segments and the window segment feature obtaining manner are not limited too much.

In some embodiments, determining the segment characteristics of the video segment based on the window segment characteristics of each of the plurality of target window segments may include:

and carrying out feature fusion processing on the window segment features of the target window segments to obtain the segment features of the video segments.

When the window segment features respectively corresponding to the multiple target window segments are subjected to feature fusion processing, multiple feature fusion modes may be adopted, for example, the window segment features respectively corresponding to the multiple window segments are subjected to weighting calculation or a fusion algorithm is adopted to perform feature fusion to obtain the segment features.

In order to obtain accurate segment features, as a possible implementation manner, feature fusion processing is performed on window segment features of a plurality of target window segments, and obtaining segment features of a video segment includes:

and according to the region interest pooling algorithm, carrying out feature fusion processing on the window segment features of the target window segments respectively to obtain the segment features of the video segments.

The Region interest Pooling algorithm (RoI Pooling, region of interest Pooling) is a feature extraction method that combines the window segment features corresponding to the obtained multiple window segments respectively with the start time and the end time of the corresponding video segments to obtain segment features. The process of performing feature fusion processing on the window segment features of each of the plurality of window segments by using the region interest pooling algorithm is the same as that in the prior art, and is not described herein again.

When the window segment features are extracted from the plurality of window segments, a plurality of feature extraction modes can be adopted for feature extraction. As a possible implementation manner, the extracting the segment features of the multiple window segments respectively, and the obtaining the window segment features corresponding to the multiple window segments respectively may include:

sequentially inputting the window segments into a feature extraction model to obtain basic features of the windows;

and carrying out characteristic analysis processing on the basic characteristics of the windows to obtain window segment characteristics corresponding to the windows respectively.

The feature extraction model may be a basic extraction model, which is a neural network model for extracting basic features of the window segment, and the feature extraction model may be obtained by pre-training, and the specific training process is the same as that in the prior art and is not described herein again.

After the window basic features corresponding to the multiple window segments are obtained, feature analysis processing is further performed on the window basic features corresponding to the multiple window segments, so as to obtain window segment features corresponding to the multiple windows. By carrying out feature analysis processing on the window segment features, more comprehensive feature information can be contained in the window segment features, and the characteristic expressiveness of the window segment features is improved, so that the acquisition accuracy of the target video segment is improved.

The feature analysis processing on the window segment features may adopt a multi-layer processing mode, for example, the feature analysis processing on the window segment may be completed by sequentially adopting processing modes such as global feature analysis, local feature analysis processing, time sequence feature addition, and the like. .

As a possible implementation manner, performing feature analysis processing on a plurality of window basic features, and obtaining window segment features corresponding to a plurality of window segments respectively includes:

attention mechanism processing is carried out on the multiple window basic features, and the context features of the multiple windows are added into the multiple window basic features to obtain multiple context features;

and performing time characteristic processing on the plurality of context characteristics, adding the time characteristics of the plurality of windows into the plurality of context characteristics, and obtaining window segment characteristics corresponding to the plurality of window segments respectively.

In the embodiment of the application, attention mechanism processing is performed on a plurality of window basic features, the global features and the local features are mainly fused, the obtained context features comprise the global features and the local features, and the obtained features are more accurate. And then, processing the plurality of context characteristics by utilizing the time characteristics to add the time characteristics among the window segments to the characteristic expression process, so that the obtained window segments contain multilayer meanings, the difference analysis of the window segments is facilitated, a more accurate segment set division result is obtained, the feature segment division result is more accurate, and the acquisition precision of the target video segment is improved.

In one possible design, the attention mechanism processing is performed on a plurality of window base features, and the context features of the plurality of windows are added to the plurality of window base features, wherein obtaining the plurality of context features comprises:

extracting global features corresponding to the basic features of the windows respectively based on a global pooling algorithm;

determining attention masks corresponding to the window basic features according to the window basic features and the global features corresponding to the window basic features;

performing dot product calculation on the window basic features and the attention masks corresponding to the window basic features aiming at any window basic feature to obtain the attention features corresponding to the window basic features so as to obtain a plurality of attention features;

and respectively carrying out normalization calculation on the plurality of attention features to obtain a plurality of context features.

Optionally, performing normalization calculation on the plurality of attention features may include sequentially inputting the plurality of attention features into a normalization exponential function to perform normalization calculation, and obtaining a plurality of context features by calculation.

The Global Pooling algorithm (Global Average Pooling) is to perform Global feature analysis processing on a plurality of window basic features to obtain a plurality of Global features; and aiming at any window segment feature, performing attention mask calculation on the global feature corresponding to the window segment feature and the window segment feature to obtain a plurality of attention masks, wherein each attention mask represents the global feature corresponding to the corresponding window basic feature. Then, the window base feature and the attention mask are subjected to point multiplication to obtain the attention feature containing the global feature and the local feature. To normalize the expression, the plurality of attention features may be subjected to a normalization calculation to obtain a normalized plurality of context features.

In the embodiment of the application, the global features and the local features are comprehensively extracted by adopting a context attention mechanism, so that the comprehensiveness of the expression of the context features is increased.

In yet another possible design, the time feature processing is performed on the plurality of context features, the time features of the plurality of windows are added to the plurality of context features, and the obtaining the plurality of window segment features includes:

determining a basic grouping layer, a time sequence convolution layer, a normalization layer and a fusion layer in a time sequence grouping module;

inputting the plurality of context characteristics into a basic grouping layer so as to group the plurality of context characteristics at least twice according to a preset grouping rule to obtain at least two grouping results; wherein the grouping result comprises at least two grouping sets, and the grouping sets comprise at least one context feature;

inputting at least one context characteristic in each of the plurality of grouping sets into the time sequence convolution layer to obtain time sequence characteristics corresponding to the plurality of grouping sets respectively;

respectively inputting the plurality of time sequence characteristics into a normalization layer to obtain a plurality of normalization time sequence characteristics;

aiming at any grouping result, carrying out fusion processing on at least two normalization time sequence characteristics corresponding to the grouping result to obtain time sequence group characteristics so as to obtain time sequence group characteristics respectively corresponding to at least two grouping results;

fusing the time sequence group characteristics respectively corresponding to at least two grouping results to obtain a target time sequence characteristic;

and obtaining a plurality of window segment characteristics based on the products of the at least one context characteristic and the target time sequence characteristic respectively.

In order to facilitate understanding of the acquisition process of the target time sequence feature, the number of the plurality of context features is 9, and the time feature processing process is described in detail by taking an example that the plurality of context features are grouped twice according to a preset grouping rule.

Assuming that a first grouping result obtained by the primary grouping of 9 contextual characteristics by the basic grouping layer is an A1 grouping set formed by 5 contextual characteristics and a grouping set formed by 4 contextual characteristics; the second grouping result obtained by the other grouping is a B1 grouping set composed of 3 context features, a B2 grouping set composed of 3 context features, and a B3 grouping set composed of the remaining 3 context features.

Inputting 5 context characteristics in the A1 grouping set into a time sequence convolution layer, and calculating to obtain A1 time sequence characteristics; inputting 4 context characteristics in the A2 grouping set into a time sequence convolution layer, and calculating to obtain A2 time sequence characteristics; inputting 3 context characteristics in the B1 grouping set into a time sequence convolution layer, and calculating to obtain B1 time sequence characteristics; inputting 3 context characteristics in the B2 grouping set into the time sequence convolutional layer, calculating to obtain B2 time sequence characteristics, inputting 3 context characteristics in the B3 grouping set into the time sequence convolutional layer, and calculating to obtain B3 time sequence characteristics.

Carrying out normalization calculation on the A1 time sequence characteristics to obtain A1 normalization time sequence characteristics; carrying out normalization calculation on the A2 time sequence characteristics to obtain A2 normalization time sequence characteristics; carrying out normalization calculation on the time sequence characteristics of the B1 to obtain the normalization time sequence characteristics of the B1; carrying out normalization calculation on the B2 time sequence characteristics to obtain B2 normalization time sequence characteristics; and carrying out normalization calculation on the B3 time sequence characteristics to obtain the B3 normalization time sequence characteristics.

Then, fusing the A1 normalization time sequence characteristics and the A2 normalization time sequence characteristics corresponding to the first grouping results to obtain A time sequence group characteristics; and performing feature fusion on the B1 normalization time sequence feature, the B2 normalization time sequence feature and the B3 normalization time sequence feature corresponding to the second grouping result to obtain a B time sequence group feature.

And then, carrying out feature fusion on the A time sequence group feature and the B time sequence group feature to obtain a target time sequence group feature.

In practical applications, the feature fusion mode may include multiple modes, for example, a feature fusion algorithm based on deep learning theory, a feature fusion algorithm based on a feature dependency model, a mean fusion algorithm, a multiplication fusion or addition fusion algorithm, or a feature fusion algorithm based on bayesian theory may be included. The embodiment of the present application is not limited to the fusion manner of the features.

In addition, the feature fusion algorithm used when the at least two normalized time sequence features are fused to obtain the time sequence group features may be the same as or different from the feature fusion method used when the time sequence group features corresponding to the two grouping results are fused, and may be specifically determined according to the actual fusion requirements.

As a possible implementation, the plurality of first recognition results includes a plurality of first target probabilities; the plurality of second recognition results comprise a plurality of second target probabilities; the plurality of third recognition results include: a plurality of third target probabilities;

optionally, determining, according to the plurality of first recognition results, a first video segment satisfying the first recognition condition may include:

determining a first video clip corresponding to a first target probability which is greater than a first probability threshold value in the plurality of first target probabilities to obtain a candidate first video clip;

and performing redundancy removal processing on the candidate first video segment to obtain a first video segment meeting a first identification condition.

Optionally, determining, according to the plurality of second recognition results, a second video segment that satisfies the second recognition condition may include:

determining a second video clip corresponding to a second target probability which is greater than a second probability threshold value in the plurality of second target probabilities to obtain a candidate second video clip;

and performing redundancy removal processing on the candidate second video segment to obtain a second video segment meeting a second identification condition.

Optionally, determining, according to a plurality of third recognition results, a third video segment that satisfies the third recognition condition may include:

and determining a third video clip corresponding to a third target probability greater than a third probability threshold value in the plurality of third target probabilities as a third video clip meeting a third identification condition.

Wherein the first probability threshold is less than a second probability threshold, which is less than a third probability threshold.

In the embodiment of the application, when the video clip is selected according to the identification result, the video clip can be selected according to the target probability contained in the identification result so as to obtain an accurate selection result. In addition, by using three selections and continuously increasing the size of the probability threshold value in the candidate selection process, the probability that the third video clip obtained by selection contains the target content is higher, and the obtained target video clip is more accurate. In addition, in the obtained video clips meeting the identification condition, two video clips with the same partial clip and longer time may exist, that is, the two video clips have higher overlapping performance, and one of the two video clips can be subjected to redundancy removal, so that the processing pressure is reduced, and the processing efficiency is improved. Therefore, the target video clip can be selected in a more efficient and accurate selection mode, and the effectiveness and the precision of automatic selection of the target video clip are improved.

In some embodiments, optionally, performing redundancy removal processing on the candidate first video segment, and obtaining the first video segment satisfying the first identification condition may include:

based on a non-maximum value suppression algorithm, performing redundancy removal processing on the candidate first video segment to obtain a first video segment meeting a first identification condition;

optionally, performing redundancy removal processing on the candidate second video segments, and obtaining the second video segments meeting the second identification condition may include:

and based on a non-maximum value suppression algorithm, performing redundancy removal processing on the candidate second video segment to obtain a second video segment meeting a second identification condition.

The redundancy removal manner corresponding to the redundant video segment may include multiple manners, for example, segment quality evaluation may be performed on the candidate video segment, so as to perform redundancy removal on the segment with low quality evaluation.

By adopting a non-maximum suppression algorithm, the video clip most possibly containing the target content can be selected, and the selection precision can be improved.

As an embodiment, the first video segment may include a plurality; detecting a first video segment with target content in a target video; wherein the first video segment corresponding to the first start time and the first end time may include:

extracting a plurality of window segments of a target video; the window segments respectively correspond to window starting time and window ending time; there is at least one window segment identical to a partial segment of any one of the window segments among the plurality of window segments;

dividing window segments meeting the same aggregation condition in a plurality of window segments into the same window segment set to obtain a plurality of window segment sets;

determining a minimum window starting time and a maximum window ending time corresponding to a window segment set according to the window starting time and the window ending time corresponding to the window segment in any window segment set;

acquiring a plurality of first video clips according to the minimum window time and the maximum window end time corresponding to the window clip sets respectively;

wherein the first start time of the first video segment is a minimum window start time of the corresponding window segment set and the first end time is a maximum window end time of the corresponding window segment set.

As shown in fig. 4, a flowchart of another embodiment of a video processing method provided in this embodiment of the present application may include the following steps:

401: a plurality of window segments of a target video are extracted.

Wherein, the window segments respectively correspond to a window starting time and a window ending time; there is at least one window segment among the plurality of window segments that is identical to a partial segment of any one window segment.

Alternatively, a plurality of window segments of the target video may be extracted in a sliding window manner. When the target video is subjected to sliding window, the size of the adopted window and the sliding step length can be set according to actual requirements. Smaller window sizes and sliding steps may be used if the processing accuracy of the video is high. For example, the window size may be set to 2 seconds and the step size may be set to 1 second. If the requirement on the processing precision of the video is not high, a larger window size and a larger sliding step length can be adopted. For example, the window size may be set to 8 seconds and the step size may be set to 2 seconds.

Extracting the plurality of window segments of the target video may include: and acquiring a plurality of window segments corresponding to the target video based on the preset window size and the sliding step length.

In some embodiments, multiple sets of window sizes and sliding step lengths may be set, and window segments corresponding to each set of window sizes and sliding step lengths when sliding windows in the target video are sequentially obtained. And taking window segments obtained when the sizes of the multiple groups of windows and the sliding step length respectively slide in the target video as multiple window segments of the target video, so as to execute the subsequent steps of dividing the window segments meeting the same aggregation condition in the multiple window segments into the same window segment set, obtaining multiple window segment sets and the like.

That is, based on the preset window size and the sliding step length, obtaining the plurality of window segments corresponding to the target video may include: determining at least one window size and a sliding step length corresponding to each window size; and determining a plurality of window segments corresponding to the target video according to the at least one window size and the video segments obtained when the sliding step corresponding to each window size slides on the target video respectively. Wherein, any window segment corresponds to a window start time and a window end time. There is at least one window segment identical to the partial segment of any one window segment among the plurality of window segments.

402: and dividing the window segments meeting the same aggregation condition in the window segments into the same window segment set to obtain a plurality of window segment sets.

403: and determining the minimum window starting time and the maximum window ending time corresponding to the window segment set according to the window starting time and the window ending time corresponding to the window segment in any window segment set.

404: and acquiring a plurality of first video clips according to the minimum window time and the maximum window end time corresponding to the window clip sets respectively.

The first start time of the first video segment is the minimum window start time of the corresponding window segment set, and the first end time of the first video segment is the maximum window end time of the corresponding window segment set.

405: time correction processing is carried out on first starting time and first ending time corresponding to the plurality of first video clips respectively, and target ending time corresponding to the plurality of target starting time and the plurality of target starting time respectively is obtained;

406: and acquiring a plurality of target video clips based on the target starting times and the target ending times respectively corresponding to the target starting times.

In the embodiment of the application, when a plurality of first video segments are obtained, a plurality of window segments can be obtained in a sliding window manner, and the plurality of first video segments are obtained in a manner of performing segment aggregation on window segments containing target content in the plurality of window segments. Through aggregation of the window video clips, the relevance between the acquisition of the first video clips and the target content is higher, so that after the time correction processing is performed on the first video clips, more accurate target video clips are obtained, and the acquisition precision of the target video clips is improved.

As an embodiment, dividing window segments that satisfy the same aggregation condition among multiple window segments into the same window segment set, obtaining multiple window segment sets may include:

sequentially inputting the basic characteristics of the windows into a content identification model corresponding to the target content to obtain the window probability of the target content in the window segments;

carrying out probability region identification processing on the plurality of window probabilities to obtain a plurality of probability regions;

aiming at any probability region, determining a window segment set formed by window segments corresponding to all window probabilities in the probability region;

and determining a plurality of window segment sets formed by window segment sets corresponding to the plurality of probability regions respectively.

For any probability region, determining a window segment set formed by window segments corresponding to all window probabilities located in the probability region may include: determining the area starting time and the area ending time corresponding to any probability area; and determining a window segment set consisting of all window segments of which the window starting time is greater than or equal to the area starting time and the window ending time is less than or equal to the area ending time.

For any probability region, the region start time and the region end time of the probability region may be specifically determined in the following manner: and determining the left boundary window probability corresponding to the left boundary of the region of the probability region and the right boundary window probability corresponding to the right boundary of the region. Determining the window starting time of the window segment corresponding to the left boundary window probability, wherein the window starting time is the region starting time of the probability region; and determining the window ending time of the window segment corresponding to the right boundary window probability, and taking the window ending time as the region ending time of the probability region.

In the embodiment of the application, when the window segments meeting the same aggregation condition are subjected to segment aggregation, the window probability of the target content in the window segments can be obtained, so that the window segments are aggregated in a probability region identification mode, and the aggregation efficiency and accuracy are improved.

As a possible implementation manner, performing probability region identification processing on a plurality of window probabilities, and obtaining a plurality of probability regions may include:

and carrying out probability region identification processing on the probabilities of the windows by using a watershed algorithm to obtain a plurality of probability regions.

In the embodiment of the application, the watershed algorithm is used to take the whole area where the window probabilities are located as the area to be segmented, and the watershed algorithm is used to segment the probability area of the whole area to obtain a plurality of probability areas. Any probability region contains a plurality of window probabilities, and the window segments corresponding to all the window probabilities in the probability region form a window segment set.

In practical applications, there may be a need to comprehensively present target video segments in which target content exists. In one possible design, the target video segment may include a plurality. After obtaining the target video segment based on the target start time, the method may further include:

and splicing the target video clips to obtain the key video with the target content.

When the multiple target video segments are spliced, the existing video splicing algorithm can be adopted, and details are not repeated here.

In practical applications, the target video may belong to multiple lives, works or scenes, and the target content may be set according to actual needs of the user. In some embodiments, the target video may be a ball game video, and the target content may include: there is a goal action. The target video segment may be a video segment in which a goal action exists. At this point, there may be a statistical need for individual player goal scores or game results. After obtaining the target video segment based on the target start time, the method may further include:

determining team tag information of at least one parameter user and a participating team to which the at least one parameter user belongs respectively;

counting the number of the target video clips respectively corresponding to at least one competition user in the target video clips to obtain the goal number respectively corresponding to at least one parameter team;

and determining a competition result corresponding to the target video according to team tag information and goal amount respectively corresponding to at least one competition user.

The competition result corresponding to the target video may include: the goal amount of each participating user, the total goal amount corresponding to at least one participating team, or the game name corresponding to at least one participating team determined based on the total goal amount corresponding to at least one participating team, and the like. The competition result can exist in various data forms, and can be specifically set according to the actual use requirement.

As a possible implementation manner, the determining, according to the team tag information and the goal amount respectively corresponding to at least one participating user, a match result corresponding to the target video may include:

counting the sum of the goal amounts of the participating users belonging to the same team tag information to obtain the goal total amount corresponding to at least one participating team;

and acquiring the game ranking corresponding to the at least one competition team according to the goal total amount respectively corresponding to the at least one competition team.

As shown in fig. 5, a flowchart of another embodiment of a video processing method provided in this embodiment of the present application may include:

501: and detecting target video input by a user and target content.

502: a first video segment in which target content is present in a target video is detected.

Wherein the first video segment corresponds to a first start time.

503: and carrying out time correction processing on the first starting time to obtain the target starting time.

504: and acquiring the target video clip based on the target starting time.

505: and outputting the target video clip for the user.

Some steps in the embodiments of the present application are the same as those in the previous embodiments, and are not described herein again.

In the embodiment of the application, the electronic device can detect a target video and target content input by a user and detect a first video segment with the target content in the target video. So that the first start time of the first video segment can be time-corrected to obtain the target start time. The target video clip may then be retrieved based on the target start time. So that the target video clip can be output to the user. A scheme for direct interaction with a user is provided, so that automatic interception of a target video segment is implemented for the user.

As shown in fig. 6, which is a flowchart of another embodiment of a video processing method provided in this application, the method may include:

601: and receiving the target video and the target content sent by the electronic equipment.

The target video and the target content are obtained by detecting user input by the electronic equipment.

602: a first video segment in which target content is present in a target video is detected.

Wherein the first video segment corresponds to a first start time.

603: and carrying out time correction processing on the first starting time to obtain the target starting time.

604: and acquiring the target video clip based on the target starting time.

605: and sending the target video clip to the electronic equipment so as to enable the electronic equipment to output the target video clip for the user.

Some steps in the embodiments of the present application are the same as those in the embodiments described above, and are not described herein again.

In the embodiment of the application, the electronic device can detect the target video and the target content sent by the user, and detect that the first video segment of the target content exists in the target video. Therefore, the time correction processing can be carried out on the first start time respectively corresponding to the first video segments so as to obtain the target start time. The target video segment may then be acquired based on the target start time. So that the target video clip can be output to the user. The electronic equipment sends the target video and the target content to the server so as to acquire the target video clip in the server, so that the processing pressure of the electronic equipment can be reduced, and the processing efficiency is improved.

For convenience of understanding, the technical solutions of the embodiments of the present application are executed by an electronic device, and the present application is described in detail by taking the electronic device as a notebook.

As shown in fig. 7a, the user may upload a target video in a video upload control 701 in an upload interface provided by the notebook M1 and input target content in a text box 702 in which the target content is input. The target content may be, for example, text, a picture, or a target video clip, and a prompt control 703 for transmission of the picture or video may be provided in the text box 702 for inputting the picture or the target video clip. Taking the target video as the ball game video as an example, the target content may be: the text content of the video clip with the shooting action. In some embodiments, a text box for entering the targeted content may also be text-prompted, for example, in the form of a prompt for text box 704 "please enter content desired to be intercepted" before text box 702.

In still other embodiments, a prompt control 705 for triggering a video clip intercepting instruction may be further displayed in the upload interface, and after the user triggers the intercepting control 705, the notebook may start to perform clip intercepting. The notebook M1 can acquire S701 the video of the ball game uploaded by the user, and the target content of the "video clip with shooting action". Thereafter, the notebook M1 may detect that a first video segment of the target content exists in the target video S702, wherein the first video segment may correspond to a first start time. Thereafter, a time correction process S703 may be performed on the first start time of the first video segment to obtain a target start time, so as to obtain S704 a target video segment based on the target start time.

Thereafter, as shown in FIG. 7b, the notebook M1 may output the target video segment 710 for the user. In some embodiments, a prompt for saving the video clip may be displayed, for example, the saving control 711, and after the user clicks the saving control 711, the notebook M1 may save the target video clip.

For convenience of understanding, the technical solution of the embodiment of the present application is executed by a server, and the present application is described in detail by taking an example that the server is actually a cloud server.

As shown in fig. 8, a user may upload a target video through a video upload control 801 in an upload interface provided by an electronic device, such as a tablet computer M2, and input target content in a text box 802 for inputting the target content. The target content may be, for example, text or pictures. The target content may be, for example, text, a picture, or a target video clip, and a prompt control 803 for transmission of the picture or video may be provided in the text box 802 for inputting the picture or target video clip. Taking the target video as the ball game video as an example, the target content may be: the text content of the video clip with the shooting action. The tablet computer M2 may send the target video and the target content to the cloud server M3.

In still other embodiments, a sending control 804 of the target video and the target content may be further displayed in the upload interface, and after the user triggers the sending control 804, the tablet computer M2 may send S801 the target video and the target content to the cloud server M3. The cloud server M3 may receive S802 a target video composed of the ball game video sent by the tablet computer M2 and a target content composed of the "video clip with shooting action". Thereafter, the cloud server M3 may detect S803 that a first video segment of the target content exists in the target video, the first video segment corresponding to the first start time. The time correction processing 802 is thus performed on the first start times respectively corresponding to the first video segments to obtain target start times, so as to obtain 803 the target video segments based on the target start times.

Then, the cloud server M3 sends S806 the target video clip to the tablet computer M2, so that the tablet computer outputs the target video clip for the user.

As shown in fig. 9, for a schematic structural diagram of an embodiment of a video processing apparatus provided in an embodiment of the present application, the video processing apparatus may include: a storage component 901 and a processing component 902; wherein storage component 901 is configured to store one or more computer instructions; one or more computer instructions are invoked for execution by processing component 902;

the processing component 902 is configured to:

detecting a first video segment with target content in a target video; wherein the first video segment corresponds to a first start time; carrying out time correction processing on the first starting time to obtain a target starting time; and acquiring a target video clip based on the target starting time.

In practical applications, the video processing device may be configured in a server or an electronic device.

As an embodiment, the first video segment further corresponds to a first end time; the first start time and the first end time of the first video segment correspond to each other; the processing component performs time correction processing on the first start time, and obtaining the target start time may specifically include:

performing time correction processing on a first starting time and a first ending time corresponding to the first video segment to obtain a target starting time and a target ending time corresponding to the target starting time;

the processing component, based on the target start time, specifically acquiring the target video segment may include:

and acquiring the target video clip based on the target starting time and the target ending time corresponding to the target starting time respectively.

As a possible implementation manner, the first video segment includes a plurality of first video segments, the processing component performs time correction processing on the first start time and the first end time respectively corresponding to the first video segment, and obtaining the target start time and the target end time respectively corresponding to the target start time may specifically include:

carrying out time adjustment processing on a first starting time and a first ending time of a first video clip meeting a first identification condition to obtain a second starting time and a second ending time corresponding to the second starting time;

acquiring a second video segment based on the second starting time and a second ending time corresponding to the second starting time; wherein the second video segment comprises a plurality of segments;

and performing time correction processing on a second start time and a second end time of a second video segment meeting a second identification condition to obtain target end times corresponding to the target start time and the target start time respectively.

In some embodiments, the time correction processing is performed by the processing component on the second start time and the second end time of the second video segment that satisfy the second identification condition, and obtaining the target end times corresponding to the target start time and the target start time respectively may specifically include:

performing time adjustment processing on a second start time and a second end time of a second video segment meeting a second identification condition to obtain third end times corresponding to a third start time and a third start time respectively;

obtaining a third video segment based on a third start time and a third end time respectively corresponding to the third start time;

In one possible design, the processing component may be further to:

selecting a first video clip meeting a first identification condition from a plurality of first video clips;

selecting a second video clip meeting a second identification condition from the plurality of second video clips;

As another embodiment, the selecting, by the processing component, a first video segment satisfying the first identification condition from the plurality of first video segments may specifically include:

identifying the plurality of first video clips to obtain a plurality of first identification results;

determining a first video segment meeting a first identification condition according to the plurality of first identification results;

the selecting, by the processing component, a second video segment that satisfies the second recognition condition from the plurality of second video segments may specifically include:

performing identification processing on the plurality of second video segments to obtain a plurality of second identification results;

determining a second video segment meeting a second identification condition according to the plurality of second identification results;

the selecting, by the processing component, a third video segment that meets the third identification condition from the multiple third video segments may specifically include:

and determining a third video clip meeting a third identification condition according to the plurality of third identification results.

In some embodiments, the processing component may perform recognition processing on the video segment to obtain a recognition result corresponding to the video segment by:

extracting segment characteristics of the video segments;

inputting the segment characteristics into a content identification model corresponding to the target content to obtain the target probability of the target content in the video segment;

As an embodiment, the extracting, by the processing component, the segment feature of the video segment may specifically include:

extracting a plurality of window segments of a target video; wherein, the window segments respectively correspond to a window starting time and a window ending time; there is at least one window segment identical to a partial segment of any one of the window segments among the plurality of window segments;

respectively extracting the segment characteristics of the plurality of window segments to obtain the segment characteristics respectively corresponding to the plurality of window segments;

determining a start time and an end time of a video clip;

and determining the segment characteristics of the video segment according to the window segment characteristics of the plurality of target window segments.

In some embodiments, the determining, by the processing component, the segment characteristic of the video segment according to the window segment characteristic of each of the plurality of target window segments may specifically include:

As a possible implementation manner, the feature fusion processing is performed on the window segment features of each of the multiple target window segments by the processing component, and the obtaining of the segment features of the video segment may specifically include:

As another embodiment, the extracting, by the processing component, the segment features of the plurality of window segments, and the obtaining the window segment features corresponding to the plurality of window segments may specifically include:

and carrying out feature analysis processing on the plurality of window basic features to obtain window segment features respectively corresponding to the plurality of window segments.

In one possible design, the processing component performs feature analysis processing on the multiple window basic features, and the obtaining window segment features respectively corresponding to the multiple window segments specifically may include:

In one possible design, the processing component performs attention mechanism processing on a plurality of window basic features, and adds the context features of the plurality of windows to the plurality of window basic features, and the obtaining the plurality of context features may specifically include:

extracting global features corresponding to the window basic features respectively based on a global pooling algorithm;

determining attention masks corresponding to the window basic features according to the window basic features and global features corresponding to the window basic features respectively;

and respectively carrying out normalization calculation on the plurality of attention characteristics to obtain a plurality of context characteristics.

In another possible design, the processing component performs temporal feature processing on the plurality of context features, adds the temporal features of the plurality of windows to the plurality of context features, and the obtaining the plurality of window segment features may specifically include:

determining a basic grouping layer, a time sequence convolution layer, a normalization layer and a fusion layer in a time sequence grouping module; inputting the plurality of context characteristics into a basic grouping layer so as to group the plurality of context characteristics at least twice according to a preset grouping rule to obtain at least two grouping results; wherein the grouping result comprises at least two grouping sets, and the grouping sets comprise at least one context feature;

inputting at least one context feature in at least two grouping sets corresponding to the at least two grouping results into the time sequence convolution layer to obtain at least two time sequence features corresponding to the at least two grouping results;

As yet another embodiment, the plurality of first recognition results includes a plurality of first target probabilities; the plurality of second recognition results comprise a plurality of second target probabilities; the plurality of third recognition results include: a plurality of third target probabilities;

the determining, by the processing component, the first video segment meeting the first identification condition according to the plurality of first identification results may specifically include:

performing redundancy removal processing on the candidate first video clip to obtain a first video clip meeting a first identification condition;

the determining, by the processing component, the second video segment meeting the second identification condition according to the plurality of second identification results may specifically include:

performing redundancy removal processing on the candidate second video clips to obtain second video clips meeting second identification conditions;

the determining, by the processing component according to the multiple third recognition results, a third video segment that meets the third recognition condition may specifically include:

determining a third video clip corresponding to a third target probability greater than a third probability threshold value in the plurality of third target probabilities as a third video clip meeting a third identification condition;

wherein the first probability threshold is less than the second probability threshold, which is less than the third probability threshold.

As a possible implementation manner, the processing component performs redundancy removal processing on the candidate first video segment, and obtaining the first video segment that satisfies the first identification condition may specifically include:

the processing component performs redundancy removal processing on the candidate second video segment, and obtaining the second video segment meeting the second identification condition may specifically include:

As yet another embodiment, the first video segment includes a plurality; the processing component detects a first video segment with target content in a target video; the corresponding first start time of the first video segment may specifically include:

dividing window segments meeting the same aggregation condition in the multiple window segments into the same window segment set to obtain multiple window segment sets;

determining the minimum window starting time and the maximum window ending time corresponding to the window segment set according to the window starting time and the window ending time corresponding to the window segment in any window segment set;

the first start time of the first video segment is the minimum window start time of the corresponding window segment set, and the first end time is the maximum window end time of the corresponding window segment set.

In one possible design, the dividing, by the processing component, the window segments that satisfy the same aggregation condition among the multiple window segments into the same window segment set, and the obtaining multiple window segment sets may specifically include:

sequentially inputting the window segments into a feature extraction model to obtain a plurality of window basic features;

In one possible design, the processing component performs probability region identification processing on the multiple window probabilities, and obtaining the multiple probability regions may specifically include:

As another embodiment, the target video segment may include a plurality of segments, and the processing component may be further configured to:

As yet another embodiment, the target content may include: there is a goal action; the target video clip is a video clip with a goal action;

the processing component may be further to:

determining team tag information of at least one competition user and competition teams to which the at least one competition user belongs;

counting the number of the target video clips corresponding to the at least one competition user in the target video clips to obtain the goal number corresponding to the at least one competition user;

and determining a competition result corresponding to the target video according to the team tag information and the goal amount respectively corresponding to the at least one competition user.

The video processing device in fig. 9 can execute the method of video processing in the embodiment shown in fig. 1, and the implementation principle and technical effect are not described again. The detailed description of the steps performed by the processing component in the above embodiments has been described in detail in the embodiments related to the method, and will not be elaborated upon here.

In addition, an embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored, and when the computer program is executed, any one of the video processing methods according to the foregoing embodiments may be executed.

As shown in fig. 10, a schematic structural diagram of an embodiment of an electronic device provided in the embodiment of the present application, where the electronic device may include: a storage component 1001 and a processing component 1002; wherein storage component 1001 is used to store one or more computer instructions; one or more computer instructions are invoked for execution by the processing component;

the processing component 1002 is to:

acquiring a target video and target content input by a user; detecting a first video segment with target content in a target video; wherein the first video segment corresponds to a first start time; time correction processing is carried out on the first starting time to obtain a target starting time; acquiring a target video clip based on the target starting time; and outputting the target video clip for the user.

The content of the part executed by the processing element in the embodiment of the present application is the same as the content of the part in the embodiment shown in fig. 9, and is not described again here.

As shown in fig. 11, a schematic structural diagram of an embodiment of a server provided in the present application is shown, where the server may include: a storage component 1101 and a processing component 1102; wherein storage component 1101 is used to store one or more computer instructions; one or more computer instructions are invoked for execution by the processing component 1102;

the processing component 1102 is configured to:

receiving a target video and target content sent by electronic equipment; the target video and the target content are obtained by detecting user input by the electronic equipment; detecting a first video segment with target content in a target video; wherein the first video segment corresponds to a first start time; time correction processing is carried out on the first starting time to obtain a target starting time; acquiring a target video clip based on the target starting time; and sending the target video clip to the electronic equipment so as to enable the electronic equipment to output the target video clip for the user.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment may be implemented by a necessary general hardware platform, and may also be implemented by a combination of hardware and software. With this understanding in mind, the above-described solutions and/or portions thereof that are prior art may be embodied in the form of a computer program product, which may be embodied on one or more computer-usable storage media having computer-usable program code embodied therein (including but not limited to disk storage, CD-ROM, optical storage, etc.).

The present invention has been described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A video processing method, comprising:

detecting a first video segment with target content in a target video; the first video segment comprises a plurality of segments; wherein the first video segment corresponds to a first start time and a first end time; the first start time and the first end time of the first video segment correspond to each other;

performing time correction processing on a second start time and a second end time of a second video clip meeting a second identification condition to obtain target end times corresponding to a target start time and the target start time respectively;

and acquiring a target video clip based on the target starting time and the target ending time corresponding to the target starting time.

2. The method according to claim 1, wherein the time-correcting a second start time and a second end time of a second video segment that satisfy a second recognition condition, and obtaining target end times corresponding to the target start time and the target start time respectively comprises:

obtaining a third video segment based on the third starting time and a third ending time corresponding to the third starting time; wherein the third video segment includes a plurality;

and performing time adjustment processing on the third start time and the third end time of the third video segment meeting a third identification condition to obtain the target start time and the target end time corresponding to the target start time.

3. The method according to claim 2, wherein the time adjustment processing is performed on a first start time and a first end time of the first video segment satisfying the first identification condition to obtain a second start time and a second end time corresponding to the second start time, and the method further comprises:

the time correction processing is performed on the second start time and the second end time of the second video segment meeting the second identification condition, and before the target end time corresponding to the target start time and the target start time respectively is obtained, the method further includes:

the time adjustment processing is performed on the third start time and the third end time of the third video segment which meet the third identification condition, and before the target start time and the target end time corresponding to the target start time are obtained, the method further includes:

4. The method of claim 3, wherein selecting the first video segment of the plurality of first video segments that satisfies the first identification condition comprises:

performing identification processing on the plurality of first video clips to obtain a plurality of first identification results;

the selecting a second video segment satisfying a second recognition condition from the plurality of second video segments includes:

the selecting a third video segment satisfying a third recognition condition from among the plurality of third video segments includes:

5. The method according to claim 4, wherein the video segment is identified by the following method to obtain the identification result corresponding to the video segment:

extracting segment features of the video segments;

6. The method of claim 5, wherein the extracting the segment features of the video segment comprises:

extracting a plurality of window segments of the target video; wherein, the window segments respectively correspond to a window starting time and a window ending time; there is at least one window segment among the plurality of window segments that is identical to a partial segment of any one window segment;

respectively extracting the segment characteristics of the window segments to obtain the window segment characteristics respectively corresponding to the window segments;

determining a start time and an end time of the video segment;

7. The method of claim 6, wherein determining the segment characteristics of the video segment according to the window segment characteristics of each of the plurality of target window segments comprises:

8. The method according to claim 7, wherein the performing feature fusion processing on the window segment features of the respective target window segments to obtain the segment features of the video segment comprises:

9. The method according to claim 6, wherein the extracting the segment features of the plurality of window segments respectively to obtain the window segment features corresponding to the plurality of window segments respectively comprises:

and performing feature analysis processing on the plurality of window basic features to obtain window segment features respectively corresponding to the plurality of window segments.

10. The method according to claim 9, wherein the performing feature analysis processing on the plurality of window basic features to obtain window segment features respectively corresponding to the plurality of window segments comprises:

performing attention mechanism processing on the plurality of window basic features, adding the context features of the plurality of windows into the plurality of window basic features, and obtaining a plurality of context features;

11. The method of claim 10, wherein the attention mechanism processing the plurality of window basis features to add the contextual features of the plurality of windows to the plurality of window basis features, and wherein obtaining the plurality of contextual features comprises:

12. The method of claim 10, wherein the temporally processing the plurality of context features and adding the temporal features of the plurality of windows to the plurality of context features to obtain the plurality of window segment features comprises:

inputting the plurality of context characteristics into the basic grouping layer so as to group the plurality of context characteristics at least twice according to a preset grouping rule to obtain at least two grouping results; wherein the grouping result comprises at least two grouping sets, the grouping sets comprising at least one context feature;

inputting at least one context feature in each of the plurality of grouping sets into the time sequence convolution layer to obtain time sequence features corresponding to the plurality of grouping sets respectively;

respectively inputting a plurality of time sequence characteristics into the normalization layer to obtain a plurality of normalization time sequence characteristics;

aiming at any grouping result, carrying out fusion processing on at least two normalization time sequence characteristics corresponding to the grouping result to obtain time sequence group characteristics so as to obtain time sequence group characteristics respectively corresponding to the at least two grouping results;

performing fusion processing on the time sequence group characteristics respectively corresponding to the at least two grouping results to obtain a target time sequence characteristic;

obtaining the plurality of window segment characteristics based on products of the at least one context characteristic and the target timing characteristic respectively.

13. The method of claim 5, wherein the plurality of first recognition results comprises a plurality of first target probabilities; the plurality of second recognition results comprise a plurality of second target probabilities; the plurality of third recognition results include: a plurality of third target probabilities;

the determining, according to the plurality of first recognition results, a first video segment that satisfies a first recognition condition includes:

performing redundancy removal processing on the candidate first video segment to obtain a first video segment meeting a first identification condition;

the determining, according to the plurality of second recognition results, a second video segment that satisfies a second recognition condition includes:

performing redundancy removal processing on the candidate second video segment to obtain a second video segment meeting a second identification condition;

the determining, according to the plurality of third recognition results, a third video segment that satisfies a third recognition condition includes:

determining a third video segment corresponding to a third target probability greater than a third probability threshold value in the plurality of third target probabilities as a third video segment meeting a third identification condition;

14. The method of claim 13, wherein performing redundancy elimination processing on the candidate first video segment to obtain a first video segment satisfying a first identification condition comprises:

based on a non-maximum suppression algorithm, performing redundancy removal processing on the candidate first video segment to obtain a first video segment meeting a first identification condition;

the performing redundancy removal processing on the candidate second video segment to obtain a second video segment meeting a second identification condition includes:

and based on the non-maximum value suppression algorithm, performing redundancy removal processing on the candidate second video segment to obtain a second video segment meeting a second identification condition.

15. The method of claim 1, wherein the first video segment comprises a plurality; detecting that a first video segment of target content exists in a target video; wherein the first video segment corresponding to a first start time comprises:

dividing window segments meeting the same aggregation condition in the window segments into the same window segment set to obtain a plurality of window segment sets;

acquiring a plurality of first video clips according to the minimum window time and the maximum window end time respectively corresponding to the window clip sets;

wherein the first start time of the first video clip is a minimum window start time of the corresponding window clip set and the first end time is a maximum window end time of the corresponding window clip set.

16. The method according to claim 15, wherein the dividing the window segments satisfying the same aggregation condition into the same window segment set, and obtaining multiple window segment sets comprises:

sequentially inputting the plurality of window basic characteristics into a content identification model corresponding to the target content to obtain the window probability of the target content in the plurality of window segments;

and determining the plurality of window segment sets formed by the window segment sets respectively corresponding to the plurality of probability regions.

17. The method of claim 16, wherein performing probability region identification processing on the plurality of window probabilities to obtain a plurality of probability regions comprises:

and carrying out probability region identification processing on the plurality of window probabilities by using a watershed algorithm to obtain a plurality of probability regions.

18. The method of claim 1, wherein the target video segment comprises a plurality of segments, the method further comprising:

19. The method of claim 1, wherein the target content comprises: there is a goal action; the target video clip is a video clip with a goal action;

the method further comprises the following steps:

and determining a competition result corresponding to the target video according to team label information and the goal amount respectively corresponding to the at least one competition user.

20. A video processing method, comprising:

detecting a target video and target content input by a user;

acquiring a second video clip based on the second starting time and a second ending time corresponding to the second starting time; wherein the second video segment comprises a plurality of segments;

performing time correction processing on a second start time and a second end time of a second video segment meeting a second identification condition to obtain target start times and target end times corresponding to the target start times respectively;

acquiring a target video clip based on the target starting time and a target ending time corresponding to the target starting time;

outputting the target video segment for the user.

21. A video processing method, comprising:

22. A video processing apparatus, comprising: a storage component and a processing component; wherein the storage component is configured to store one or more computer instructions; the one or more computer instructions are invoked for execution by the processing component;

the processing component is to:

detecting a first video segment with target content in a target video; the first video segment comprises a plurality of segments; wherein the first video segment corresponds to a first start time and a first end time; the first start time and the first end time of the first video segment correspond to each other; carrying out time adjustment processing on a first starting time and a first ending time of a first video clip meeting a first identification condition to obtain a second starting time and a second ending time corresponding to the second starting time; acquiring a second video segment based on the second starting time and a second ending time corresponding to the second starting time; wherein the second video segment comprises a plurality of segments; performing time correction processing on a second start time and a second end time of a second video segment meeting a second identification condition to obtain target start times and target end times corresponding to the target start times respectively; and acquiring a target video clip based on the target starting time and the target ending time corresponding to the target starting time.

23. An electronic device, comprising: a storage component and a processing component; wherein the storage component is configured to store one or more computer instructions; the one or more computer instructions are invoked for execution by the processing component;

the processing component is to:

acquiring a target video and target content input by a user; detecting a first video segment with target content in a target video; the first video segment comprises a plurality of segments; wherein the first video segment corresponds to a first start time and a first end time; the first start time and the first end time of the first video segment correspond to each other; carrying out time adjustment processing on a first starting time and a first ending time of a first video clip meeting a first identification condition to obtain a second starting time and a second ending time corresponding to the second starting time; acquiring a second video segment based on the second starting time and a second ending time corresponding to the second starting time; wherein the second video segment comprises a plurality of segments; performing time correction processing on a second start time and a second end time of a second video segment meeting a second identification condition to obtain target start times and target end times corresponding to the target start times respectively; acquiring a target video clip based on the target starting time and a target ending time corresponding to the target starting time; outputting the target video clip for the user.

24. A server, comprising: a storage component and a processing component; wherein the storage component is configured to store one or more computer instructions; the one or more computer instructions are invoked for execution by the processing component;

the processing component is to:

receiving a target video and target content sent by electronic equipment; the target video and the target content are obtained by detecting user input by the electronic equipment; detecting a first video segment with target content in a target video; the first video segment comprises a plurality of segments; wherein the first video segment corresponds to a first start time and a first end time; the first start time and the first end time of the first video segment correspond to each other; carrying out time adjustment processing on a first starting time and a first ending time of a first video clip meeting a first identification condition to obtain a second starting time and a second ending time corresponding to the second starting time; acquiring a second video segment based on the second starting time and a second ending time corresponding to the second starting time; wherein the second video segment comprises a plurality of segments; performing time correction processing on a second start time and a second end time of a second video segment meeting a second identification condition to obtain target start times and target end times corresponding to the target start times respectively; acquiring a target video clip based on the target starting time and a target ending time corresponding to the target starting time; and sending the target video clip to the electronic equipment so that the electronic equipment can output the target video clip for the user.