CN113542865B

CN113542865B - Video editing method, device and storage medium

Info

Publication number: CN113542865B
Application number: CN202011559704.4A
Authority: CN
Inventors: 赵天昊; 田思达; 袁微
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-12-25
Filing date: 2020-12-25
Publication date: 2023-04-07
Anticipated expiration: 2040-12-25
Also published as: CN113542865A

Abstract

The application provides a video editing method, a video editing device and a computer readable storage medium, which can realize automatic generation of video segments of videos, save labor cost and reduce human errors, and the method comprises the following steps: extracting a plurality of video frames in a video to be edited; inputting a plurality of video frames into a first convolution network model for feature extraction to obtain image features of the plurality of video frames; inputting image characteristics of a plurality of video frames into a time sequence action segmentation network model to obtain the starting and ending time of a target event; inputting the video frame corresponding to the start-stop time of the target event into the second convolution network model to obtain the event type of the target event; according to the event type of the target event, acquiring text information in a video frame corresponding to the target event to generate label information of the target event; and editing the video to be edited according to the starting and ending time of the target event, and giving the label information of the target event to the video segment corresponding to the target event to obtain the target video segment corresponding to the video to be edited.

Description

Video editing method, device and storage medium

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to a video clipping method, a video clipping device and a storage medium.

Background

The duration of the game video is usually long, and the game video is not suitable for being directly shared, so that the highlight segments in the game video need to be edited into a highlight segment collection for sharing.

In the related technology, the production of these collection video segments requires manual identification of the wonderful moment and then editing and post-production, which is time-consuming and labor-consuming, and it is difficult to directly identify whether to complete killing from the picture for fast-paced game video, and for games with complex scenes, numerous labels such as maps, weapons, characters, etc. are difficult to label, so how to realize fast and accurate game video editing is a problem that needs to be solved urgently.

Disclosure of Invention

The application provides a video editing method, a video editing device and a storage medium, which can automatically edit a video to be edited to generate a target video segment, save labor cost and reduce human errors.

In a first aspect, the present application provides a video clipping method, comprising:

extracting a plurality of video frames in a video to be edited;

inputting the video frames into a first convolution network model for feature extraction to obtain image features of the video frames;

inputting the image characteristics of the video frames into a time sequence action segmentation network model to obtain the starting and ending time of a target event in the video to be edited; inputting the video frames corresponding to the start-stop time of the target event into a second convolution network model for event classification to obtain the event type of the target event;

acquiring text information in a video frame corresponding to the target event according to the event type of the target event, and generating label information of the target event;

and according to the starting and ending time of the target event in the video to be clipped, clipping the video to be clipped, and giving the tag information of the target event to the video segment corresponding to the target event to obtain the target video segment corresponding to the video to be clipped.

In a second aspect, the present application provides a video clipping device comprising:

the frame extracting module is used for extracting a plurality of video frames in a video to be edited;

the characteristic extraction module is used for extracting image characteristics of the plurality of video frames through a first convolution network model;

the time sequence action segmentation module is used for inputting the image characteristics of the video frames into a time sequence action segmentation network model and outputting the start and stop time of a target event in the video to be edited;

the event classification module is used for performing event classification on the video frames corresponding to the start and stop time of the target event through a second convolutional network model and outputting the event type of the target event;

the tag generation module is used for acquiring text information in a video frame corresponding to the target event according to the event type of the target event and generating tag information of the target event;

and the clipping module is used for clipping the video to be clipped according to the starting and ending time of the target event in the video to be clipped, and endowing the tag information of the target event to the video segment corresponding to the target event to obtain the target video segment corresponding to the video to be clipped.

In a third aspect, there is provided a video clipping device comprising: a processor and a memory, the memory for storing a computer program, the processor for invoking and executing the computer program stored in the memory to perform the method of the first aspect.

In a fourth aspect, there is provided a computer readable storage medium for storing a computer program for causing a computer to perform the method of the first aspect.

According to the technical scheme, a plurality of video frames in the video to be edited can be input into the convolution network model and the time sequence action segmentation network model, the video to be edited is automatically edited into the video segment corresponding to the target event through the network model, event classification and labeling are carried out on the target event, the target video segment corresponding to the video to be edited is further automatically generated, and compared with manual editing operation, a large amount of labor cost is saved and the influence of human errors on an editing result is reduced.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic view of an application scenario provided in an embodiment of the present application.

Fig. 2 is a flowchart of a video clipping method according to an embodiment of the present application.

Fig. 3 is a flowchart of another video clipping method according to an embodiment of the present disclosure.

FIG. 4 is a diagram illustrating an example of a match event.

FIG. 5 is a schematic diagram of an example backpack selection event.

FIG. 6 is a schematic diagram of an example of a bullet change event.

FIG. 7 is a diagram illustrating an example of a multi-badge kill event.

Fig. 8 is a schematic diagram showing an example of a character animation event.

Fig. 9 is a schematic diagram of a training flow of the RGB convolutional network model.

Fig. 10 is a schematic diagram of a training flow of the time-series action segmentation network model.

Fig. 11 is a schematic diagram of a video editing apparatus according to an embodiment of the present application.

FIG. 12 is a schematic block diagram of another video clipping device provided by an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.

It should be noted that the terms "first", "second", and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between different objects and not for describing a particular sequential or chronological order. Moreover, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or server that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

In some videos, such as shooting-type game videos, the game scenes are complex, such as a plurality of tags, e.g., weapons, maps, characters, etc., and the tagging is difficult, and the rhythm of such games is generally fast, and it is difficult to identify whether the killing is completed or not from among the pictures.

It should be understood that the video to be clipped in the embodiment of the present application may be any video, for example, a game video, such as a shooting game, a ball game, etc., or may also be a clip of other multimedia works, etc., and the following description will take an example of clipping a game video, but the present application is not limited thereto.

In view of the above, the present application provides a video editing method, which can automatically generate a video segment assembly of a target event for an edited video through a multi-modal technique (e.g., an event classification technique, a time-series action segmentation technique, and an Optical Character Recognition (OCR) technique), thereby saving labor cost and reducing human errors.

Optionally, the technical solution of the present application may be applied to the following application scenarios, but is not limited thereto: as shown in fig. 1, the apparatus 110 may upload a video to be clipped or a Uniform Resource Locator (URL) of the video to be clipped to the apparatus 120, so that the apparatus 120 clips the video to be clipped to generate a target video segment.

Alternatively, the device 110 may upload the video to be clipped or the URL of the video to be clipped through the Web interface.

In some embodiments, the device 110 may upload the URL of the video stream to be clipped, and the device 120 segments the received video stream to be clipped, and further processes the segmented video stream to be clipped by using a multi-modal technique to obtain the target video segment. Specifically, the device 120 may receive a clipping request from the device 110, further obtain the video to be clipped based on the clipping request, and process the video to be clipped to obtain the target video segment.

Optionally, after the device 120 generates the target video segment, the target video segment or the URL of the target video segment may be transmitted to the device 110 for the user to view the target video segment.

It should be understood that the application scenario shown in fig. 1 is exemplified by including one apparatus 110 and one apparatus 120, and in fact, other numbers of apparatuses 110 may be further included, and other data transmission devices may be further included between the

apparatuses

110 and 120, which is not limited in this application.

Alternatively, in the present application, the device 110 may be a game machine, a mobile phone, a tablet Computer, a notebook Computer, a Personal Computer (PC), or the like, which is not limited in the present application.

Optionally, in this embodiment of the present application, the apparatus 120 may be a terminal device or a server, which is not limited in this application.

Embodiments of the present application relate to Computer Vision technology (CV) in Artificial Intelligence (AI).

AI is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence base technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

CV is a science for researching how to make a machine look, and in particular, it refers to replacing human eyes with a camera and a computer to perform machine vision such as identification, tracking and measurement on a target, and further performing image processing, so that the computer processing becomes an image more suitable for human eyes to observe or is transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can acquire information from images or multidimensional data. Computer vision technologies generally include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technologies, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and also include common biometric technologies such as face recognition and fingerprint recognition.

The technical solution of the present application will be explained in detail with reference to fig. 2 to 12.

Fig. 2 and fig. 3 are flowcharts of a video clipping method provided in an embodiment of the present application, and a main execution body of the method may be a video clipping apparatus, and the video clipping apparatus may be apparatus 120 in fig. 1, for example, a terminal device or a server, and the like, but is not limited thereto.

In the following, embodiments of the present application are described from the perspective of a video clipping device, as shown in fig. 2 and 3, the method comprising the steps of:

s210, extracting a plurality of video frames in the video to be clipped.

Alternatively, the video clipping apparatus may acquire the video to be clipped or the URL of the video to be clipped from another device. If the video clipping device acquires the URL of the video to be clipped, the video clipping device may acquire the video to be clipped from a local or cloud server according to the URL.

Optionally, in some embodiments of the present application, the method further comprises:

receiving a clipping request sent by a client;

and acquiring the video to be clipped according to the clipping request.

For example, in some scenarios, when a video clip is needed, a user may upload a video to be clipped through a client and send a clip request to a video clipping device, and the video clipping device acquires the video to be clipped when receiving the clip request. After finishing the clipping of the video to be clipped, the clipped target video segment can be sent to one or more corresponding clients.

Optionally, in some embodiments, the video to be clipped may be a game video, and the game video may be a complete game video, or may also be a segment of the complete game video, which is not limited in this application. More specifically, the game video may be an on-demand game video.

Optionally, the video clipping device may perform video framing on the video to be clipped according to the first time interval to obtain a plurality of video frames of the video to be clipped. For example: the video clipping device may extract every 1 second for 10 consecutive video frames. Optionally, the first time interval may be preset, and may be fixed or dynamically adjusted, which is not limited in this application.

In S220, the plurality of video frames are input to the first convolutional network model for feature extraction, and image features of the plurality of video frames are extracted.

Alternatively, the first convolution network model may be an RGB convolution network model for extracting image features, such as color features, of the input video frame, and the training manner of the first convolution network model is described below.

S230, inputting the image characteristics of the plurality of video frames into a time sequence Action Segmentation (Temporal Action Segmentation) network model to obtain the starting and ending time of the target event in the video to be clipped.

It should be understood that the specific content of the target event is not limited in the embodiments of the present application, and may vary according to the type of the video to be clipped, for example, if the video to be clipped is a movie video, the target event may be a highlight or a climax part in the video, or, if the video to be clipped is a game video, the target event may include a highlight in the game video, for example, a goal segment in a soccer video, a killing segment in a shooting game, and the like.

Optionally, in some embodiments, the video to be clipped is a game video, and the target event may include a first type event and a second type event, for example, the first type event is an event in a preparation process of a battle, and the second type event is an event in a battle process.

In this embodiment of the present application, the video frames corresponding to the first type of event may be referred to as key frames, and the video frames corresponding to the second type of event may be referred to as highlight segments. Accordingly, the start-stop time of the key frame and the highlight in the game video may be obtained in S230.

It should be understood that, the embodiments of the present application are not limited to the specific events included in the first type of event and the second type of event.

By way of example and not limitation, the key frames may include, for example, video frames corresponding to events such as battle matches and weapon backpack selections as shown in table 1.

TABLE 1

Key frame	Label (R)
		Weapon backpack selection	Weapon-skin name, character-skin name
Match-up	Map name

That is, the first type of event may include weapon knapsack selection, battle matching, etc.

Fig. 4-5 show exemplary screens of video frames corresponding to events of fight matching, weapon backpack selection, etc. in a game video.

By way of example and not limitation, the highlight clips may include, for example, the multi-badge kills, multi-badge kills take-over bullets, change bullets, character animations, video frames corresponding to events such as character wins or character failures, etc., illustrated in table 2.

TABLE 2

That is, the second type of event may include multiple badge strikes, bullet changes, character animation, etc.

Fig. 6 to 8 show exemplary screens of video frames corresponding to events such as a bullet change, a multi-badge kill, a character animation, and the like in a game video.

It should be understood that the above exemplary description describes specific examples of key frames and highlights, and those skilled in the art will recognize that this description is merely exemplary and is not intended to limit the scope of the embodiments of the present application.

S240, inputting the video frames corresponding to the start and stop time of the target event into a second convolution network model for event classification to obtain the event type of the target event.

For example, the video frames corresponding to the start and stop times of the key frames and the highlight segments in the plurality of video frames are input into the second convolution network model, and the key frames and the highlight segments are subjected to event classification to obtain event types of the key frames and the highlight segments.

It should be understood that the embodiment of the present application is not limited to the event types corresponding to the key frames and the highlight segments.

By way of example, the key frame may be a video frame corresponding to an event such as map selection, character selection or weapon selection, and the highlight may be a video frame corresponding to an event such as multi-badge hit, bullet change, character animation, and the like.

Optionally, the second convolutional network model may be an RGB convolutional network model.

With reference to fig. 9, the second convolutional network model is taken as an RGB convolutional network model as an example to describe the training method of the RGB convolutional network model.

S301, frame extraction is carried out on the training video to obtain a plurality of video frames.

For example, the training video is decimated at certain time intervals to obtain the plurality of video frames.

If the method is applied to editing of game videos, the training videos may include game video data including various events in various games, for example, to implement automatic editing of shooting-type games, the training videos may be various shooting-type game video data, and in this case, the RGB convolutional network model may be used to edit different shooting-type game videos. Alternatively, the training video may also include game video data of various types of events in a particular game, for example, a video of a large number of users playing the game may be used as the training video, in which case, the RGB convolutional network model may be used to clip the game video of the particular game.

S302, labeling the event types of the target events in the video frames to form a training set of the RGB convolutional network model.

For example, the event types of the key frames and the highlight segments in the plurality of video frames are labeled to form a training set of the RGB convolutional network model. For example, if N types of events and backgrounds are labeled, the RGB convolutional network is an N +1 classifier for distinguishing whether a video frame belongs to one type of events in the N types of events or the background.

For example, the event types corresponding to the key frames and the highlights are labeled according to the examples in table 1 and table 2.

Optionally, in some embodiments, a badge area can be individually scratched out of a killing event in a video frame to form a training set, and during training, whether a multi-badge killing event is determined for the badge area is determined.

S303, inputting the marked video frames into an RGB convolutional network for training to obtain model parameters of the RGB convolutional network.

The training method of the time series operation segmentation network model is described with reference to fig. 10.

It should be understood that the RGB convolutional network model may be used as a reference network model of the time sequence motion segmentation network model for feature extraction, and the extracted features are further input into the time sequence motion segmentation network model for training, so the result of feature extraction also affects the output result of the time sequence motion segmentation network model, and therefore, the model parameters of the RGB convolutional network model may be adjusted according to the output of the time sequence motion segmentation network model, so that the two cooperate to output the optimal result.

S401, marking the starting and ending time of the target event in the training video.

In some embodiments, if the time series action segmentation model is used to clip game videos, the start and stop times of key frames and highlights in the training videos may be labeled.

If the method is applied to the editing of live game videos, the training videos can comprise game video data of various events of various large game platforms, in this case, the time sequence action segmentation network model can be used for editing game videos of different games, or can also comprise game video data of various events of a specific game, in this case, the time sequence action segmentation network model can be used for editing game videos of the specific game.

S402, extracting frames of the marked training video at fixed time intervals, inputting the extracted video frames into an RGB (red, green and blue) convolution network model, outputting image characteristics of the video frames, and forming a training set of the time sequence action segmentation network.

Optionally, the fixed time interval may be preset, may be fixed, and may also be dynamically adjusted, which is not limited in this application.

S403, training the time-series motion split network by using the training set obtained in S402, and obtaining model parameters of the time-series motion split network.

So far, the start-stop time and the event type of the target event in the video to be clipped can be determined through the steps. Further, with continued reference to fig. 2 and fig. 3, the tag information corresponding to the target event can be determined and the output of the video clip can be performed through the following steps.

And S250, acquiring text information in the video frame corresponding to the target event according to the event type of the target event, and generating label information of the target event.

For example, according to the event type of the key frame, text information in a video frame corresponding to the key frame may be acquired, and tag information of the key frame may be generated.

For another example, the text information in the video frame corresponding to the highlight is obtained according to the event type of the highlight, and the label information of the highlight is generated.

In some embodiments of the present application, as shown in fig. 3, S250 may include three steps of character recognition, text combination and text matching in S251 to S253.

In the following, the three steps are respectively described by taking the determination of the text information in the key frame and the highlight as an example.

1. Text recognition:

in some embodiments, the video frames corresponding to the key frames may be decimated at different frame rates according to the event type of the key frames. For example, the frame extraction is performed at a higher frame rate for the key frames with shorter occurrence time, and at a lower frame rate for the key frames with longer occurrence time.

In some embodiments, the video frames corresponding to the highlight are also decimated at different frame rates according to the event type of the highlight. For example, the frame extraction is performed at a higher frame rate for the highlight with a shorter occurrence time, and at a lower frame rate for the highlight with a longer occurrence time.

In other embodiments, the frames may be decimated at different frame rates according to the importance of the key frames, for example, decimating at a higher frame rate for important events, such as a knapsack selection event, decimating at a lower frame rate for less important events, and so on.

In other embodiments, the frames may be extracted at different frame rates according to the importance of the highlight, for example, the frames are extracted at a higher frame rate for an important event, such as a character animation event, at a lower frame rate for an event with lower importance, and the like.

In some embodiments, a specific region in the video frame corresponding to the extracted key frame may be scratched to obtain the text information in the specific region according to the event type of the key frame, for example, as shown in fig. 4 and 5, the text information in the key frames of different event types may generally appear in different regions. Therefore, for extracting the text information in the key frames of different event types, the specific area where the text information appears can be highlighted to extract the text information in the specific area.

In some embodiments, a specific region in the video frame corresponding to the extracted highlight may be scratched to obtain the text information in the specific region according to the event type of the highlight, for example, as shown in fig. 6 to 8, the text information in the highlights of different event types will usually appear in different regions. Therefore, for extracting the text information in the highlight segments of different event types, the specific area can be highlighted to extract the text information in the specific area.

Further, an image of a specific area in the video frame corresponding to the key frame and the highlight can be input to an OCR module, and text information in the image and coordinate information of a text box corresponding to the text information can be identified. In some alternative implementations, only the images including important textual information in the extracted video frames may be input to the OCR module for recognition.

2. Text merging:

in some cases, the OCR module may recognize a complete sentence in the video frame as multiple phrases or words, or recognize a phrase as multiple words, that is, the text information that should belong to a text box is divided into multiple text boxes.

In some embodiments, it may be determined whether two text boxes belong to the same phrase or sentence according to information such as a horizontal distance, a vertical height, or a vertical overlapping range between the text boxes.

Alternatively, the lateral spacing of the text boxes may refer to the horizontal distance between two text boxes, i.e., the shortest distance between the vertical edges of two text boxes.

Alternatively, the vertical height of the text box may refer to the length of the vertical side of the text box.

Alternatively, the longitudinal repetition range between text boxes may refer to the repetition range between the vertical edges of two text boxes.

By way of example and not limitation, two text boxes that satisfy the following condition may be merged into one text box:

the transverse distance between the two text boxes is smaller than a first threshold value;

the vertical height difference of the two text boxes is smaller than a second threshold value;

the longitudinal coincidence range of the two text boxes is larger than a third threshold value.

By way of example and not limitation, the first threshold may be, for example, a vertical height of a text box with a smaller vertical height of the two text boxes.

By way of example and not limitation, the second threshold may be, for example, 25% of the vertical height of the text box of the two text boxes having the smaller vertical height.

By way of example and not limitation, the third threshold may be, for example, 75% of the vertical height of the text box of the two text boxes having the smaller vertical height.

Further, the text information in the combined text box can be spliced to obtain complete text information.

3. Text matching:

in some embodiments of the present application, the complete text information and the entries in the phrase dictionary may be integrally matched to determine the target text information.

If the video to be clipped is a game video, the phrase dictionary may include entries of various tags in the game, or entries of tag combinations, such as weapon names, character names, map names, etc. When the video to be edited is a video of other types, the phrase dictionary may also include entries of other contents, which is not limited in this application.

For example, a first text edit distance between the complete text information and an entry in the phrase dictionary may be calculated, and the target text information matched with the complete text information in the phrase dictionary may be determined according to the first text edit distance.

It should be understood that the text editing distance refers to the minimum number of editing operations required to convert one string into another string between two strings, and if the distance between the two strings is larger, the text editing distance indicates that the two strings are different.

For example, the number of operations required to convert abc to bcd is: delete a, insert d, i.e. text edit distance is 2.

Specifically, the complete text information may be matched with a plurality of entries in a word group dictionary, a first text edit distance between the complete text information and each entry in the plurality of entries may be determined, and a target entry matched with the complete text information, that is, target text information, may be further determined according to the first text edit distance between the complete text information and each entry.

In some embodiments, the entry for which the first text editing distance is the smallest and less than or equal to the fourth threshold may be selected as the target entry.

By way of example, and not limitation, the fourth threshold may be, for example, 1, or 2.

In some scenarios, the phrase dictionary does not necessarily include a combination of various labels, and assuming that the weapon name includes a and B and the skin name includes 1 and 2, the weapon and skin may be combined arbitrarily, i.e. the following combinations may be possible: a-1, A-2, B-1 and B-2, if the complete text information is detected to be A-2 and the phrase dictionary only comprises A-1 and B-2, in this case, the complete text information is matched with the entries in the phrase dictionary, and the obtained matching result is not accurate.

In other embodiments of the present application, each word in the complete text information may be independently matched with the phrase dictionary to determine the target text information.

In some embodiments, a text edit distance between each word in the complete text information and a term in the phrase dictionary may be calculated to obtain a second text distance between the complete text information and the term in the phrase dictionary, and further, the target text information matched with the complete text information in the phrase dictionary may be determined according to the second text edit distance.

In some embodiments, the entry whose second text editing distance is the smallest and less than or equal to the fifth threshold may be selected as the target entry.

By way of example, and not limitation, the fifth threshold may be, for example, 1, or 2.

In another embodiment of the present application, the target text information matched with the complete text information may be determined according to a first text editing distance between the complete text information and an entry in the phrase dictionary and a second text editing distance between the complete text information and an entry in the phrase dictionary.

Optionally, the entry corresponding to the smaller value of the first text editing distance and the second text editing distance is used as the target text information matched with the complete text information.

As an example, the complete text information in the key frame corresponding to the weapon backpack selection event may be matched with entries in the dictionaries of the weapon, the character, and the skin to obtain a first matching result, or the independent words in the complete text information in the key frame corresponding to the weapon backpack selection event may be matched with the dictionaries of the weapon, the character, and the skin to obtain a second matching result, and the result with a higher matching degree may be selected by comparing the first matching result with the second matching result, where a higher matching degree may refer to a low text editing distance, and in this way, combinations such as weapon-skin or character-skin that are not entered in the phrase dictionary may be detected.

Next to the previous example, assuming that the complete text information is a-2 and the phrase dictionary includes a-1 and B-2, the complete text information a-2 and the entries in the phrase dictionary are integrally matched to obtain a first text editing distance between the complete text information and the entries of 1, each word in the complete text information a-2 and the entries in the phrase dictionary are integrally matched to obtain a second text editing distance between the complete text information and the entries of 0, and then the target text information corresponding to the complete text information a-2 is determined to be a-2.

Further, according to the target text information, the label information of the key frames and the highlight segments is determined.

In some embodiments, the target text information corresponding to the text information detected in the highlight is used as the label information of the highlight.

For example, if text information for multi-killing confirmation (for example, weapon killing or skill killing, etc.) is detected in the video frame corresponding to the killing-like highlight, in this case, the multi-killing can be used as the tag information corresponding to the highlight. Optionally, the tag information corresponding to the highlight segment may further include information such as a weapon name, a skill name, and a map name used for completing the killing.

In other embodiments, the target text information corresponding to the text information detected in the adjacent key frames with the same event type is used as the label information of the highlight segment between the adjacent key frames.

For example, if the event types corresponding to the first key frame and the second key frame are the same, for example, both are selected for weapon backpacks, the second key frame is a key frame of the first key frame with the same next event type, and a first highlight section is located between the first key frame and the second key frame, the target text information matched with the text information detected in the first key frame and the second key frame may be used as the tag information of the first highlight section, for example, the name of a weapon or the name of a person or the like detected in the first key frame and the second key frame may be used as the tag information of the first highlight section.

That is, as shown in table 2, the tag information of the highlight may include text information from the highlight, and may also include text information from the related key frame.

Therefore, the starting and ending time of the target event in the video to be edited and the label information corresponding to the target event can be obtained.

In S260, the video to be clipped is clipped according to the start-stop time of the target event to obtain a video segment corresponding to the target event, and further, the tag information of the target event is assigned to the video segment corresponding to the target event to obtain a target video segment corresponding to the target event.

It should be understood that, the video to be clipped according to the start-stop time of the target event may be after S250, or may also be after S230, which is not limited in this application.

In some embodiments of the present application, as shown in fig. 3, in S270, the target video segment may be further processed, for example, a special video segment may be spliced.

As an example, if the target video clip includes a video clip corresponding to a first virtual event, and includes video clips corresponding to a second virtual event before and after the video clip corresponding to the first virtual event, and the first type event labels corresponding to the first virtual event and the second virtual event are the same, the video clip corresponding to the first virtual event and the video clips corresponding to the second virtual event before and after the video clip corresponding to the first virtual event are spliced.

In some embodiments, the first virtual event and the second virtual event are both events of a second type, i.e., events in a battle.

Alternatively, the first type event tags are the same, which may mean that the key frame tags are the same, such as the map tag and the weapon tag, and the key frame tags are the same for indicating that the video segments occur in the same battle.

Alternatively, the first virtual event may refer to a change bullet event and the second virtual event may refer to a multi-badge kill event. For example, for each bullet changing event, if a multi-badge killing event with the same map label and the same weapon label exists before and after the bullet changing event, the two multi-badge killing events are spliced with the video clip corresponding to the bullet changing event to obtain a special video clip spliced by the multi-badge killing event + the bullet changing event + the multi-badge killing event.

Optionally, in some embodiments, the video to be clipped includes a complete battle, in other embodiments, the video to be clipped may include multiple battles, or may include only a partial battle scene, that is, the battle does not end, in this case, in order to improve the user viewing experience, in this embodiment of the present application, the output of the target video segment may also be performed according to an event type of the video segment included in the target video segment.

The character animation event represents the end of a battle, so that the character animation event can be used as the end node of a video clip when a video clip is made.

In some embodiments, if the target video segment includes a character animation highlight segment, for example, a character winning segment or a character failing segment, the first character animation highlight segment and a highlight segment before the first character animation highlight segment may be spliced to obtain a first video segment, and the first video segment is output; splicing the highlight segments between the two character animation highlight segments to obtain a second video segment, and outputting the second video segment; and caching the figure animation wonderful segment and the wonderful segment behind the figure animation wonderful segment, and not outputting until the wonderful segment after the match is finished.

In other embodiments, if the target video segment does not include the character animation highlight segment, which indicates that the fight is not over, in this case, the highlight segment may be cached, and after the character animation highlight segment appears, the highlight segment may be output by being spliced with the character animation highlight segment.

That is, in the embodiment of the present application, each video segment outputted may include a highlight segment representing the end of a battle, and may further include a highlight segment in the battle process, for example, a multi-shot highlight segment, a bullet-changing highlight segment, a highlight segment of a specific scene, and the like.

Therefore, in the embodiment of the application, the video clipping device can extract frames of a video to be clipped according to a certain time interval, send the extracted video frames to the RGB convolution network model and the time sequence action segmentation network model, automatically clip video segments corresponding to target events from the video to be clipped through the network model, classify and add labels to the video segments, and further automatically generate the target video segments.

Fig. 11 is a schematic diagram of a video clipping device according to an embodiment of the present application, and as shown in fig. 11, the video clipping device 1000 includes:

a frame extracting module 1001, configured to extract a plurality of video frames in a video to be edited;

a feature extraction module 1002, configured to extract image features of the multiple video frames through a first convolutional network model;

a time sequence action segmentation module 1003, which inputs the image characteristics of the plurality of video frames into a time sequence action segmentation network model to obtain the start and stop time of the target event in the video to be edited;

the event classification module 1004 is configured to perform event classification on the video frames corresponding to the start-stop time of the target event through a second convolutional network model, and output an event type of the target event;

a tag generation module 1005, configured to obtain text information in a video frame corresponding to the target event according to the event type of the target event, and generate tag information of the target event;

the clipping module 1006 is configured to clip the video to be clipped according to the start-stop time of the target event in the video to be clipped, and assign the tag information of the target event to the video segment corresponding to the target event, so as to obtain a target video segment corresponding to the video to be clipped.

Optionally, in some embodiments, the framing module 1001 is further configured to:

according to the event type of the target event, extracting frames of the video frames corresponding to the target event according to a specific frame rate;

the apparatus 1000 further comprises:

the acquisition module is used for acquiring an image of a specific area in the extracted video frame according to the event type of the target event, wherein the image of the specific area is an image containing text information in the video frame;

the optical character recognition OCR module is used for recognizing the image of the specific area to obtain text information contained in the video frame and coordinate information of a text box corresponding to the text information;

and the processing module is used for determining the label information of the target event according to the character information and the coordinate information of the text box corresponding to the text information.

Optionally, in some embodiments, the processing module is further configured to:

combining the text boxes according to the coordinate information of the text boxes to obtain complete text information;

matching the complete text information with the phrase dictionary, and determining target text information matched with the complete text information;

and determining the label information of the target event according to the target text information.

two text boxes satisfying the following conditions are merged into one text box:

the lateral distance between the two text boxes is smaller than a first threshold value;

the vertical coincidence range of the two text boxes is larger than a third threshold value.

integrally matching the complete text information with the entries in the phrase dictionary, and determining a first text editing distance between the complete text information and the entries in the phrase dictionary;

and taking the entry of which the first text editing distance is the minimum and the first text editing distance is smaller than a fourth threshold value as the target text information.

independently matching each word in the complete text information with the entry in the phrase dictionary, and determining a second text editing distance between the complete text information and the entry in the phrase dictionary;

and taking the entry corresponding to the smaller value of the first text editing distance and the second text editing distance as the target text information matched with the complete text information.

Optionally, in some embodiments, the target video segment includes at least one highlight segment, or the target video segment includes at least one highlight segment and at least one key frame, the target event includes a first type of event and a second type of event, the key frame includes a plurality of video frames corresponding to the first type of event, the highlight segment includes a plurality of video frames corresponding to the second type of event, and the processing module is further configured to:

target text information matched with the text information detected in the video frame corresponding to the second type of event is used as label information of the second type of event;

and taking the target text information matched with the text information detected in the video frames corresponding to the first-class events which are adjacent and have the same event type as the label information of the second-class event between the adjacent first-class events.

Optionally, in some embodiments, the video to be clipped is an on-demand game video, the target video segment includes at least one highlight segment, or the target video segment includes at least one highlight segment and at least one key frame, where the target event includes a first type of event and a second type of event, the key frame includes a plurality of video frames corresponding to the first type of event, and the highlight segment includes a plurality of video frames corresponding to the second type of event.

Optionally, in some embodiments, the apparatus 1000 further comprises:

and an output module, configured to splice the video segment corresponding to the first virtual event and the video segments corresponding to the second virtual event before and after the video segment corresponding to the first virtual event if the target video segment includes the video segment corresponding to the first virtual event and both before and after the video segment corresponding to the first virtual event include the video segment corresponding to the second virtual event and the first type event labels corresponding to the first virtual event and the second virtual event are the same, where the first virtual event and the second virtual event are both the second type event.

Optionally, in some embodiments, the apparatus 1000 further comprises:

the communication module is used for receiving a clipping request sent by a client;

and the processing module is used for acquiring the video to be clipped according to the clipping request.

The communication module is further configured to: and sending the target video segments obtained by the video clips to be clipped to the corresponding one or more clients.

extracting a plurality of video frames in a training video;

optionally, in some embodiments, the apparatus 1000 further comprises:

the first labeling module is used for labeling the event type of the target event in the plurality of video frames;

and the first training module is used for inputting the plurality of video frames into the second convolution network model for training to obtain the model parameters of the second convolution network model.

Optionally, in some embodiments, the apparatus 1000 further comprises:

the second labeling module is used for labeling the start-stop time of the target event in the training video;

extracting a plurality of video frames in the marked training video;

the feature extraction module 1002 is further configured to: inputting the plurality of video frames into a first convolution network model for feature extraction to obtain image features of the plurality of video frames;

and the second training module is used for inputting the image characteristics of the plurality of video frames into the time sequence action segmentation network for training to obtain the model parameters of the time sequence action segmentation network.

It is to be understood that the apparatus embodiments and the method embodiments may correspond to each other and similar descriptions may be made with reference to the method embodiments. To avoid repetition, further description is omitted here. Specifically, the video clipping apparatus shown in fig. 11 may execute the method embodiments corresponding to fig. 2, fig. 3, fig. 9, and fig. 10, and the foregoing and other operations and/or functions of the modules in the video clipping apparatus are respectively to implement the corresponding flows in the methods in fig. 2, fig. 3, fig. 9, and fig. 10, and are not repeated herein for brevity.

The video clip apparatus of the embodiments of the present application is described above from the perspective of functional modules in conjunction with the drawings. It should be understood that the functional modules may be implemented by hardware, by instructions in software, or by a combination of hardware and software modules. Specifically, the steps of the method embodiments in the present application may be implemented by integrated logic circuits of hardware in a processor and/or instructions in the form of software, and the steps of the method disclosed in the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. Alternatively, the software modules may be located in random access memory, flash memory, read only memory, programmable read only memory, electrically erasable programmable memory, registers, and the like, as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps in the above method embodiments in combination with hardware thereof.

Fig. 12 is a schematic block diagram of a video clipping device 1100 provided by an embodiment of the present application.

As shown in fig. 12, the video clipping device 1100 may include:

a memory 1110 and a processor 1120, the memory 1110 being configured to store computer programs and to transfer the program codes to the processor 1120. In other words, the processor 1120 can call and run a computer program from the memory 1110 to implement the method in the embodiment of the present application.

For example, the processor 1120 may be configured to perform the above-described method embodiments according to instructions in the computer program.

In an exemplary embodiment, the processor 1120 (and in particular, the devices that the processor comprises) performs the steps of the above-described method embodiments by calling one or more instructions in memory. In particular, the memory stores one or more first instructions adapted to be loaded by the processor and to perform the steps of:

extracting a plurality of video frames in a video to be edited;

inputting the image characteristics of the plurality of video frames into a time sequence action segmentation network model to obtain the start-stop time of a target event in the video to be edited; inputting the video frames corresponding to the start-stop time of the target event into a second convolution network model for event classification to obtain the event type of the target event;

In some embodiments of the present application, the processor 1120 may include, but is not limited to:

general purpose processors, digital Signal Processors (DSPs), application Specific Integrated Circuits (ASICs), field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, and the like.

In some embodiments of the present application, the computer program may be partitioned into one or more modules, which are stored in the memory 1110 and executed by the processor 1120 to perform the methods provided herein. The one or more modules may be a series of computer program instruction segments capable of performing specific functions, the instruction segments being used to describe the execution of the computer program in the video editing apparatus.

As shown in fig. 12, the video clipping device 1100 may further include:

a transceiver 1130, the transceiver 1130 being connectable to the processor 1120 or the memory 1110.

The processor 1120 may control the transceiver 1130 to communicate with other devices, and in particular, may transmit information or data to the other devices or receive information or data transmitted by the other devices. The transceiver 1130 may include a transmitter and a receiver. The transceiver 1130 may further include one or more antennas, which may be present in number.

As an example, the transceiver 1130 may receive a game video or a game video URL uploaded by a user through a client, further clip the game video, and then return a clipped video segment in the form of the URL.

It will be appreciated that the various components in the video clipping device are connected by a bus system, wherein the bus system comprises, in addition to a data bus, a power bus, a control bus and a status signal bus.

The present application also provides a computer storage medium having stored thereon a computer program which, when executed by a computer, enables the computer to perform the method of the above-described method embodiments. In other words, the present application also provides a computer program product containing instructions, which when executed by a computer, cause the computer to execute the method of the above method embodiments.

Embodiments of the present application also provide a computer program product including computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and executes the computer instructions, so that the computer device executes the corresponding contents in the method embodiment.

Embodiments of the present application further provide a computer program, which includes computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and executes the computer instructions, so that the computer device executes the corresponding contents in the method embodiment.

It should be understood that the processor of the embodiments of the present application may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method embodiments may be performed by integrated logic circuits of hardware in a processor or by instructions in the form of software. The Processor may be a general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off the shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps of the method in combination with hardware of the processor.

It will be appreciated that the memory in the embodiments of the subject application can be either volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory. The non-volatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), or a flash Memory. Volatile Memory can be Random Access Memory (RAM), which acts as external cache Memory. By way of example, and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), synchronous Dynamic Random Access Memory (SDRAM), double Data Rate Synchronous Dynamic random access memory (DDR SDRAM), enhanced Synchronous SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), and Direct Rambus RAM (DR RAM). It should be noted that the memory of the systems and methods described herein is intended to comprise, without being limited to, these and any other suitable types of memory.

It should be understood that the above memories are exemplary but not limiting, for example, the memories in the embodiments of the present application may also be static random access memory (static RAM, SRAM), dynamic random access memory (dynamic RAM, DRAM), synchronous dynamic random access memory (synchronous DRAM, SDRAM), double data rate synchronous dynamic random access memory (DDR SDRAM), enhanced synchronous SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), direct Rambus RAM (DR RAM), and the like. That is, the memory in the embodiments of the present application is intended to comprise, without being limited to, these and any other suitable types of memory.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It can be clearly understood by those skilled in the art that, for convenience and simplicity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions may be stored in a computer-readable storage medium if they are implemented in the form of software functional units and sold or used as separate products. Based on such understanding, the technical solutions of the present application, or portions thereof, which substantially or partly contribute to the prior art, may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer or a server) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a U disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A video clipping method, comprising:

extracting a plurality of video frames in a video to be edited;

inputting the plurality of video frames into a first convolution network model for feature extraction to obtain image features of the plurality of video frames;

inputting the image characteristics of the video frames into a time sequence action segmentation network model to obtain the starting and ending time of a target event in the video to be edited;

inputting the video frames corresponding to the start-stop time of the target event into a second convolution network model for event classification to obtain the event type of the target event;

according to the event type of the target event, acquiring text information in a video frame corresponding to the target event, and generating label information of the target event, wherein the method comprises the following steps: according to the event type of the target event, performing frame extraction on a video frame corresponding to the target event according to a specific frame rate; acquiring an image of a specific area in the extracted video frame according to the event type of the target event, wherein the image of the specific area is an image containing text information in the video frame; inputting the image of the specific area to an Optical Character Recognition (OCR) module to obtain text information included in the video frame and coordinate information of a text box corresponding to the text information; determining the label information of the target event according to the text information and the coordinate information of the text box corresponding to the text information, wherein the determining comprises the following steps: combining the text boxes according to the coordinate information of the text boxes to obtain complete text information, matching the complete text information with a word group dictionary to determine target text information matched with the complete text information, and determining label information of the target event according to the target text information;

and according to the starting and ending time of the target event in the video to be clipped, clipping the video to be clipped, and endowing the tag information of the target event to the video segment corresponding to the target event to obtain the target video segment corresponding to the video to be clipped.

2. The method according to claim 1, wherein the merging the text boxes according to the coordinate information of the text boxes comprises:

3. The method of claim 1, wherein the matching the complete text information with a phrase dictionary to determine the target text information matched with the complete text information comprises:

and taking the entry with the minimum first text editing distance and the first text editing distance smaller than a fourth threshold value as the target text information.

4. The method of claim 1, wherein the matching the complete text information with a phrase dictionary to determine the target text information matched with the complete text information comprises:

and taking the entry corresponding to the smaller value of the first text editing distance and the second text editing distance as target text information matched with the complete text information.

5. The method according to any one of claims 1-4, wherein the target video segment includes at least one highlight segment, or the target video segment includes at least one highlight segment and at least one key frame, the target event includes a first type of event and a second type of event, the key frame includes a plurality of video frames corresponding to the first type of event, the highlight segment includes a plurality of video frames corresponding to the second type of event, and determining the tag information of the target event according to the target text information includes:

and taking target text information matched with text information detected in video frames corresponding to first-class events which are adjacent and have the same event type as label information of a second-class event between the adjacent first-class events.

6. The method according to claim 1, wherein the video to be clipped is a video of a game on demand, the target video clip comprises at least one highlight clip, or the target video clip comprises at least one highlight clip and at least one key frame, wherein the target event comprises a first type of event and a second type of event, the key frame comprises a plurality of video frames corresponding to the first type of event, and the highlight clip comprises a plurality of video frames corresponding to the second type of event.

7. The method of claim 6, further comprising:

if the target video clip comprises a video clip corresponding to a first virtual event, and both before and after the video clip corresponding to the first virtual event comprise a video clip corresponding to a second virtual event, and the first type event labels corresponding to the first virtual event and the second virtual event are the same, splicing the video clip corresponding to the first virtual event and the video clips corresponding to the second virtual event before and after the video clip corresponding to the first virtual event, wherein both the first virtual event and the second virtual event are the second type event.

8. The method of claim 1, further comprising:

receiving a clipping request sent by a client;

acquiring the video to be clipped according to the clipping request;

and sending the target video segments obtained by the video clips to be clipped to one or more corresponding clients.

9. The method of claim 1, further comprising:

extracting a plurality of video frames in a training video;

marking the event type of a target event in the plurality of video frames;

and inputting the plurality of video frames to the second convolution network model for training to obtain model parameters of the second convolution network model.

10. The method of claim 1, further comprising:

marking the starting and ending time of a target event in a training video;

extracting a plurality of video frames in the marked training video;

and inputting the image characteristics of the plurality of video frames into the time sequence action segmentation network for training to obtain model parameters of the time sequence action segmentation network.

11. A video clipping apparatus, comprising:

the time sequence action segmentation module is used for inputting the image characteristics of the video frames into a time sequence action segmentation network model and outputting the start-stop time of a target event in the video to be edited;

a tag generation module, configured to obtain text information in a video frame corresponding to the target event according to the event type of the target event, and generate tag information of the target event, where the tag generation module includes: according to the event type of the target event, performing frame extraction on a video frame corresponding to the target event according to a specific frame rate; acquiring an image of a specific area in the extracted video frame according to the event type of the target event, wherein the image of the specific area is an image containing text information in the video frame; inputting the image of the specific area to an Optical Character Recognition (OCR) module to obtain text information included in the video frame and coordinate information of a text box corresponding to the text information; determining the label information of the target event according to the text information and the coordinate information of the text box corresponding to the text information, wherein the determining comprises the following steps: combining the text boxes according to the coordinate information of the text boxes to obtain complete text information, matching the complete text information with a word group dictionary to determine target text information matched with the complete text information, and determining label information of the target event according to the target text information;

and the clipping module is used for clipping the video to be clipped according to the start-stop time of the target event in the video to be clipped, and endowing the tag information of the target event to the video segment corresponding to the target event to obtain the target video segment corresponding to the video to be clipped.

12. A video clipping apparatus, comprising:

a processor and a memory, the memory for storing a computer program, the processor for invoking and executing the computer program stored in the memory to perform the method of any of claims 1-10.

13. A computer-readable storage medium for storing a computer program which causes a computer to perform the method of any one of claims 1 to 10.