CN111770359B

CN111770359B - Event video clipping method, system and computer readable storage medium

Info

Publication number: CN111770359B
Application number: CN202010493124.3A
Authority: CN
Inventors: 赵筠; 尹东芹; 吴双龙
Original assignee: Suning Cloud Computing Co Ltd
Current assignee: Jiangsu Biying Technology Co ltd; Jiangsu Suning Cloud Computing Co ltd
Priority date: 2020-06-03
Filing date: 2020-06-03
Publication date: 2022-10-11
Anticipated expiration: 2040-06-03
Also published as: CN111770359A

Abstract

The invention discloses a method, a system and a computer readable storage medium for editing a video of an event, wherein the method comprises the following steps: separating and correspondingly storing a video track and an audio track of the event video to be processed; identifying playback segments and non-playback segments in a video track; analyzing to obtain an explication voice tail point in an audio track corresponding to the non-playback segment, and intercepting the non-playback segment according to the explication voice tail point to obtain a non-playback segment material; filtering the playback special effect frame of the playback section to obtain playback section materials; intercepting the audio track according to the playback segment material and the non-playback segment material to obtain a corresponding audio material; merging the playback segment material and the non-playback segment material to obtain a material video, and synthesizing the audio material to the material video to obtain a target clip video; the invention supports various events, automatically extracts and reserves important playback segments, simultaneously accurately shortens the time length, quickly manufactures a large number of wonderful highlights with different dimensions, and has high accuracy.

Description

Event video clipping method, system and computer readable storage medium

Technical Field

The invention relates to the field of video data editing processing, in particular to a method and a system for editing event video and a computer readable storage medium.

Background

In traditional copyright event operation, some wonderful segments or highlights can be edited in the event live broadcast process by traditional media, and the wonderful segments or highlights can be quickly browsed and shared by a large number of watching groups. The existing method of the process usually depends on a large amount of manual video editing to perform editing production manually, and has the following problems: 1. timeliness is poor, and a full match splendid collection video often needs the editor to spend a large amount of time watch the match repeatedly and look for splendid passage, and experiment accurate positioning is repeated, and the loaded down with trivial details inefficiency of manufacturing process usually can export after the match is ended for a long time, influences the user and sees match experience. 2. The content yield is low, is limited by operation resources, the content production can only mainly ensure key events, the content yield of non-head events is particularly influenced, and the number of output videos of one event is limited.

With the enrichment of sports games at the present stage, the manual editing method cannot meet the requirement of professional fast editing and outputting of a large number of games, and an automatic video data screening method and an automatic video data screening device gradually appear at present, but the video edited by the automatic video data screening method and the automatic video data screening device at present is poor in accuracy.

Disclosure of Invention

The invention aims at: a method, system and computer readable storage medium for editing a video of an event are provided, which can produce and edit the video quickly and precisely.

The technical scheme of the invention is as follows:

in a first aspect, a method for editing a video of an event is provided, the method comprising:

separating and correspondingly storing a video track and an audio track of the event video to be processed;

identifying playback segments and non-playback segments in the video track;

analyzing to obtain an explication voice tail point in an audio track corresponding to the non-playback segment, and intercepting the non-playback segment according to the explication voice tail point to obtain a non-playback segment material;

filtering the playback special effect frame of the playback segment to obtain playback segment materials;

intercepting the audio track according to the playback segment material and the non-playback segment material to obtain a corresponding audio material;

and combining the playback segment material and the non-playback segment material to obtain a material video, and synthesizing the audio material to the material video to obtain a target clip video.

Further, the identifying playback segments and non-playback segments in the video track specifically includes:

identifying the event logo picture in the video frame of the video track by using a ResNet50 neural network to classify the playback special-effect frame, and recording the classification result in real time;

and identifying playback segments and non-playback segments in the classification result.

Further, the analyzing to obtain the commentary voice tail point in the audio track corresponding to the non-playback segment specifically includes:

setting a reserved time length range threshold T after an event occurs according to the event type;

and analyzing the explanation voice tail point of the corresponding commentator closest to the threshold T boundary of the time length range in the audio track by using a voice endpoint detection method based on the short-time energy and the short-time average zero crossing rate.

Further: the intercepting the non-playback segment according to the commentary voice endpoint to obtain non-playback segment materials specifically includes:

and intercepting the non-playback segment by taking the explication voice tail point as a segment end time point to obtain a non-playback segment material.

Further, the intercepting the audio track according to the playback section material and the non-playback section material to obtain the corresponding audio material specifically includes:

and correspondingly intercepting the audio track from the initial position of the audio track according to the sum of the durations of the non-playback segment material and the playback segment material to obtain the audio material.

Further, before the separating and correspondingly storing the video track and the audio track of the event video to be processed, the method comprises the following steps:

acquiring an event video to be processed; the method specifically comprises the following steps:

acquiring N video frames in a to-be-processed event video stream and a display timestamp corresponding to each video frame;

identifying an event time identifier in each video frame picture in the N video frames, and performing first time axis matching on the event time identifier and a display time stamp to position an effective event video;

analyzing the event data corresponding to the event video stream to be processed to obtain the structured data of all events, wherein the structured data comprises the occurrence time of the events in the events, and performing second time axis matching on the occurrence time and the display time stamps of the events;

acquiring a plurality of related events forming any target event from all events, determining a starting point display time stamp and an end point display time stamp of the target event according to the starting points and the end points of the plurality of related events, positioning and extracting all video frames of the target event in the effective event video, and clipping and pressing the video frames into an event video to be processed.

Further, the identifying an event time identifier in each video frame picture of the N video frames, and performing first time axis matching on the event time identifier and a display timestamp to locate an effective event video specifically includes:

and identifying and verifying the event time identifier in each video frame picture in the N video frames by utilizing a deep learning algorithm based on the Faster R-CNN neural network through an AI identification module, and performing first time axis matching on the verified event time identifier and a display time stamp to position an effective event video.

Further, the method also comprises the following steps: and merging the target clip video according to the occurrence time to obtain a collection video.

In a second aspect, there is provided an event video clip system, the system comprising:

the separation storage module is used for separating and storing the video track and the audio track of the event video to be processed;

the filtering module is used for filtering the playback special effect frames of the playback segments to obtain playback segment materials;

the recognition analysis module is used for recognizing the playback segment and the non-playback segment in the video track and analyzing to obtain the explication voice tail point in the audio track corresponding to the non-playback segment;

the intercepting module is used for intercepting the non-playback segment according to the interpretation voice tail point to obtain a non-playback segment material, and intercepting the audio track according to the playback segment material and the non-playback segment material to obtain a corresponding audio material;

and the merging module is used for merging the playback segment material and the non-playback segment material to obtain a material video, and synthesizing the audio material to the material video to obtain a target clip video.

In a third aspect, a computer-readable storage medium is provided, having stored thereon a computer program which, when being executed by a processor, carries out the steps of the method of any one of the first aspects.

The invention has the advantages that: the method supports various event types, can automatically extract and reserve important playback segments from event videos, simultaneously accurately remove playback special effect pictures and shorten the video duration, quickly produces and manufactures a large number of clipped videos on the basis of massive video resources, has high accuracy, is favorable for users to quickly know the summary and essence of the events, and provides favorable support for clipping, making, sharing and spreading of increasingly red short videos.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings required in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1 is a flow chart illustrating a method for editing a video clip of an event according to an embodiment of the present invention;

FIG. 2 is a block diagram of a highlight cutting system for a soccer game according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating transition effects achieved by a video editing method for an event according to an embodiment of the present invention;

FIG. 4 is a diagram illustrating goal coordinates and range in a video editing method for an event according to an embodiment of the present invention;

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments that can be derived from the embodiments given herein by a person of ordinary skill in the art are intended to be within the scope of the present disclosure.

In the application, resNet, also called as residual neural network, refers to the idea of adding residual learning (residual learning) to the traditional convolutional neural network, so that the problems of gradient dispersion and precision reduction (training set) in a deep network are solved, the network can be deeper and deeper, the precision is ensured, and the speed is controlled.

Block ID: and (5) identifying the fragments.

CDN: an abbreviation of Content Delivery Network, namely a Content distribution Network, is a layer of intelligent virtual Network on the basis of the existing internet, which is formed by placing node servers at various parts of the Network, and can avoid bottlenecks and links on the internet which possibly affect the data transmission speed and stability as far as possible, so that the Content transmission is faster and more stable.

VAR: an abbreviation of Video assistance Referee refers to that an active Referee provides information to the Referee by playing back videos, and assists the Referee in correcting missing or errors clearly and obviously changing the race trend, and accuracy of the Referee is improved.

The event video clip needs to ensure the integrity of the event, and the video duration needs to be compressed on the premise of the integrity of the event. In order to solve the problems of low efficiency and high cost of manual clipping and low accuracy of automatic clipping (the video duration cannot be effectively compressed or the event is incomplete due to inaccurate clipping of the event), the application aims to provide a method for clipping a race video, which realizes automatic fine clipping operation, compresses the display duration of a single event, highlights the event focus and effectively shortens the overall duration of the video.

Example 1: an event video clipping method, as shown in fig. 1, includes:

101. separating and correspondingly storing a video track and an audio track of the event video to be processed;

specifically, the video track and the audio track of the event video to be processed are separated and correspondingly stored, and the picture sequence and the audio sequence are correspondingly stored.

Prior to this step, the method further comprises: the method for acquiring the event video to be processed specifically comprises the following steps:

101-1, acquiring N video frames in a video stream of the event to be processed and a display time stamp corresponding to each video frame;

the display time stamp is mainly used for measuring when the decoded video frame is displayed, that is, for marking the display time point of each frame in the manufactured target video.

The event video stream to be processed is a live event video stream provided by an event data provider or an on-demand event video stream downloaded through an on-demand video playing address, and the specific type of the event video stream is not limited in this embodiment.

N is at least 4. Take a soccer game as an example:

when the event video stream is a live event video stream, the video frames are acquired by the CDN, the CDN intercepts N video frames from the live event video stream at a preset fixed frequency and extracts a Block ID and a corresponding display timestamp of a TS fragment where each video frame is located; after the first half or the second half of the football game starts, the system acquires the first 6 video frames after each start of the video stream of the game to be processed, the blockID of the corresponding TS fragment and the display timestamp of each video frame from the CDN.

When the event video stream is the on-demand event video stream, the on-demand file is directly read through the AI identification module, the total duration of the on-demand file is fixed, the relative time points of the upper half field and the lower half field relative to the total duration of the video are randomly selected, the picture frame extraction is carried out at the time points, and the corresponding display timestamp is obtained through decoding. N is 4, namely 2 frames in the upper half field and the lower half field respectively, if the identification is wrong, 2 frames are continuously randomly extracted. If overtime exists, the frame extraction processing is also needed in the method as described above when the overtime is used for the upper half and the lower half of the overtime.

101-2, identifying the event time identifier in each video frame picture in the N video frames, and carrying out first time axis matching on the event time identifier and the display time stamp to position an effective event video.

Specifically, an AI identification module is used for identifying the event time identifier in each video frame picture in the N video frames by using a deep learning algorithm. In this embodiment, the event time identifier in each video frame is preferably obtained by the AI identification module identifying the display time of the score in each video frame in the event video stream by using a deep learning algorithm based on the Faster R-CNN neural network, and then checking the display time.

Because the video clips before the event starts exist in the video, the event time identification and the display time stamp are subjected to first time axis matching to position an event part, namely an effective event video, the event starting point is determined by acquiring the effective event time, and the positioning precision is improved.

101-3, analyzing the event data corresponding to the to-be-processed event video stream to obtain the structured data of all events, wherein the structured data comprises the occurrence time of the events in the events, and performing second time axis matching on the occurrence time and the display time stamps of the events;

specifically, when the event video stream is a live event video stream provided by an event data provider, the event data is obtained by the event data provider through feedback or through query;

and when the event video stream is the on-demand event video stream downloaded through the on-demand video playing address, the event data is obtained by directly inquiring the detailed event data of the historical events.

Since the event is interrupted or prolonged, it is necessary to extract the occurrence time of the event in the event, perform the second time axis matching between the occurrence time of the event in the event and the display time stamp, and achieve the alignment between the occurrence time and the display time.

101-4, acquiring a plurality of associated events forming any target event from all events, determining a starting point display time stamp and an end point display time stamp of the target event according to the starting points and the end points of the plurality of associated events, positioning and extracting all video frames of the target event in the effective event video, and clipping and pressing the video frames into an event video to be processed.

Specifically, a core event is determined, and a relevant event of the core event is obtained through a preset association rule, such as: when the core event is a goal, the related events comprise passing, dribbling, passing and middle circle opening events which may occur before and after the goal. The core event and its related events together constitute a related event.

In this embodiment, the preset association rule is not specifically limited.

After the event video to be processed is obtained, the event video to be processed can be screened according to a preset weight rule; for example, when the event is a soccer event, the preset weighting rules at least include: the top half start and the bottom half end, and at least one of a goal, a Highlight event given by the event data provider, a VAR, a threat goal, a red and yellow tile, a point ball. For a football game without goal, the preset weight rules comprise the beginning of the first half and the end of the second half, and at least comprise threatening shots, and the degree of threatening shots needs to be defined, see the diagram of the coordinates and range of the goal in fig. 4, if the football track falls in a circle of Close area near the outer frame of the goal after the shots, the goal is judged to be threatening shots, and other tracks far away from the goal are not screened and recorded.

It should be noted that the preset weighting rule is set according to the event content, the existing editing habit and experience, and the preset weighting rule is not specifically limited in this embodiment.

102. Identifying playback segments and non-playback segments in the video track; the method specifically comprises the following steps:

More specifically, the classification result of the real-time recording is analyzed, if the inter-frame distance of the playback special effect frames is smaller than a preset value, the same segment of playback special effect is judged, and the video content between the two segments of playback special effects is judged to be a playback segment, so that the playback segment is identified.

103. Analyzing to obtain an explication voice tail point in an audio track corresponding to the non-playback segment, and intercepting the non-playback segment according to the explication voice tail point to obtain a non-playback segment material;

the analyzing to obtain the comment speech end point in the audio track corresponding to the non-playback segment specifically includes:

The intercepting the non-playback segment according to the narration voice endpoint to obtain a non-playback segment material specifically includes:

104. Filtering the playback special effect frame of the playback segment to obtain playback segment materials;

105. intercepting the audio track according to the playback segment material and the non-playback segment material to obtain a corresponding audio material; the method specifically comprises the following steps:

and correspondingly intercepting the audio track from the initial position of the audio track according to the sum of the durations of the non-playback section materials and the playback section materials to obtain audio materials. More specifically, the purpose of capturing the audio track to obtain the audio material is to enable the sound and picture synchronization of the part of the non-playback segment, and the embodiment of the present invention is not limited to whether the part corresponding to the picture of the playback segment is the picture period in the original event video to be processed.

106. And combining the playback segment material and the non-playback segment material to obtain a material video, and synthesizing the audio material to the material video to obtain a target clip video.

The method further comprises the following steps: and merging the target clip video according to the occurrence time to obtain a collection video.

The event video editing method provided by the embodiment supports various event types, can automatically extract and reserve important playback segments from event videos and accurately remove playback special effect pictures to shorten the duration, quickly produces and manufactures a large amount of videos on the basis of massive video resources, has high accuracy, is beneficial to users to quickly know the summary and essence of events, and provides beneficial support for editing, manufacturing, sharing and spreading of increasingly red short videos.

Example 2: the present embodiment provides an event video clip system, which, as shown in fig. 2, includes:

a separation storage module 21, configured to separately store a video track and an audio track of the event video to be processed;

a filtering module 22, configured to filter the playback special effect frames of the playback segment to obtain playback segment material;

the recognition analysis module 23 is configured to recognize a playback segment and a non-playback segment in the video track, and analyze the playback segment and the non-playback segment to obtain an end point of the narration voice in the audio track corresponding to the non-playback segment;

the intercepting module 24 is configured to intercept the non-playback segment according to the narration speech endpoint to obtain a non-playback segment material, and intercept the audio track according to the playback segment material and the non-playback segment material to obtain an audio material;

and the merging module 25 is configured to merge the playback segment material and the non-playback segment material to obtain a material video, and synthesize the audio material onto the material video.

The beneficial effects of the event video clipping system provided in this embodiment for implementing the event video clipping method provided in embodiment 1 are the same as those of the event video clipping method provided in embodiment 1, and are not described herein again.

It should be noted that: in the event video clipping system provided in the above embodiment, when executing an event video clipping method, only the division of the above function modules is used for illustration, and in practical applications, the function distribution may be completed by different function modules according to needs, that is, the internal structure of the device is divided into different function modules to complete all or part of the functions described above. In addition, the event video editing system and the event video editing method provided by the above embodiments belong to the same concept, and specific implementation processes thereof are detailed in the method embodiments and are not described herein again.

Embodiment 3, this embodiment provides a computer-readable storage medium having a computer program stored thereon, the computer program when executed by a processor implementing any of the steps of:

acquiring an event video to be processed;

identifying playback segments and non-playback segments in the video track;

merging the playback segment material and the non-playback segment material to obtain a material video, and synthesizing the audio material to the material video to obtain a target clip video;

and merging the target clip video according to the occurrence time to obtain a collection video.

The beneficial effects of a computer-readable storage medium provided in this embodiment for processing and executing the steps of the event video clipping method provided in embodiment 1 are the same as those of the event video clipping method provided in embodiment 1, and thus, the description thereof is omitted here.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, and the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be, but is not limited to, a read-only memory, a magnetic or optical disk, and the like.

It should be understood that the above-mentioned embodiments are only illustrative of the technical concepts and features of the present invention, and are intended to enable those skilled in the art to understand the contents of the present invention and implement the present invention, and not to limit the scope of the present invention. All modifications made according to the spirit of the main technical scheme of the invention are covered in the protection scope of the invention.

Claims

1. A method for editing a video of an event, the method comprising:

identifying playback segments and non-playback segments in the video track, comprising: identifying the match logo pictures in the video frames of the video tracks by utilizing a ResNet50 neural network to classify and play back special effect frames, and recording the classification result in real time;

identifying playback segments and non-playback segments in the classification result;

the analyzing to obtain the commentary voice tail point in the audio track corresponding to the non-playback segment includes: setting a reserved time length range threshold T after an event occurs according to the event type;

analyzing the explication voice tail point of the explicator closest to the threshold T boundary of the time length range in the corresponding audio track by using a voice end point detection method based on short-time energy and short-time average zero crossing rate;

filtering the playback special effect frame of the playback segment to obtain playback segment material;

2. A method of video clipping for an event according to claim 1, wherein: the intercepting the non-playback segment according to the commentary voice endpoint to obtain non-playback segment materials specifically includes:

3. The event video clipping method according to claim 1, wherein the clipping the audio track according to the playback segment material and the non-playback segment material to obtain the corresponding audio material specifically comprises:

and correspondingly intercepting the audio track from the initial position of the audio track according to the sum of the durations of the non-playback section materials and the playback section materials to obtain audio materials.

4. An event video clipping method as claimed in claim 1, further comprising, before said separating and correspondingly storing the video track and the audio track of the event video to be processed:

acquiring N video frames in a video stream of an event to be processed and a display timestamp corresponding to each video frame;

5. The event video clipping method according to claim 4, wherein the identifying the event time identifier in each of the N video frames, and performing a first time axis matching between the event time identifier and a display time stamp to locate a valid event video specifically comprises:

6. A method for video clipping of an event as claimed in claim 1, the method further comprising: and merging the target clip video according to the occurrence time to obtain a collection video.

7. An event video clipping system, the system comprising:

the separation storage module is used for separating and correspondingly storing a video track and an audio track of the event video to be processed;

the recognition and analysis module is used for recognizing a playback segment and a non-playback segment in the video track and analyzing to obtain an explication voice tail point in an audio track corresponding to the non-playback segment;

the recognition analysis module comprises: the classification recording unit is used for identifying the event logo picture in the video frame of the video track by utilizing a ResNet50 neural network so as to classify and play back the special effect frame, and recording the classification result in real time;

the recognition unit is used for recognizing the playback segments and the non-playback segments in the classification result;

the setting unit is used for setting a reserved duration range threshold T after an event occurs according to the event type;

the analysis unit is used for analyzing the explication voice tail point of the explicator closest to the threshold T boundary of the duration range in the corresponding audio track by utilizing a voice endpoint detection method based on short-time energy and short-time average zero crossing rate;

the intercepting module is used for intercepting the audio track according to the playback segment material and the non-playback segment material to obtain a corresponding audio material;

8. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 6.