CN107707931B

CN107707931B - Method and device for generating interpretation data according to video data, method and device for synthesizing data and electronic equipment

Info

Publication number: CN107707931B
Application number: CN201610644155.8A
Authority: CN
Inventors: 刘垚; 华先胜; 黄健强; 周昌
Original assignee: Alibaba Group Holding Ltd
Current assignee: Zhejiang Tmall Technology Co Ltd
Priority date: 2016-08-08
Filing date: 2016-08-08
Publication date: 2021-09-10
Anticipated expiration: 2036-08-08
Also published as: CN107707931A

Abstract

The application provides a method and a device for generating interpretation data according to video data, a method and a device for synthesizing the video and the interpretation data, and an electronic device for synthesizing the video and the interpretation data, wherein the method for generating the interpretation data according to the video data comprises the following steps: acquiring a video clip to be processed; identifying identifiable events and attribute parameters thereof from the video frames to be processed of the video clips to be processed; and generating interpretation data corresponding to the identifiable events according to the attribute parameters of the identifiable events. The effect of saving the program production cost is achieved. And can play a role in enabling programs to be played more quickly and timely.

Description

Method and device for generating interpretation data according to video data, method and device for synthesizing data and electronic equipment

Technical Field

The present application relates to a method of generating interpretation data, and more particularly, to a method and apparatus for generating interpretation data from video data. And to a method and apparatus for the composition of video and interpretation data. And to an electronic device for the composition of video and interpretation data.

Background

Generally, most multimedia programs already contain video and voice corresponding to the content of the video, and such programs can be made by pre-recording the video and the corresponding voice.

For a program that only includes video but not voice, in order to meet the viewing habit of the viewer and improve the viewing experience of the viewer, the program that only includes video needs to be processed to add a voice or text interpretation corresponding to the content of the program to the program, and then the program is provided to the viewer. This undoubtedly requires a lot of manpower to make corresponding commentary and recording according to the content of the video, and even requires manual work to generate a set of multimedia programs from the voice or text and the corresponding video.

In addition, in the case of live broadcasting, the corresponding voice can be synchronously recorded and played simultaneously only when the video is collected. This requires that, for the live broadcast situation, the corresponding commentator must provide the voice commentary corresponding to the situation at that time synchronously according to the situation at the site.

For example, for a new video generated by re-editing and editing the existing video material, the original interpretation of voice or text and the like is no longer suitable for the new video after editing and editing, and the interpretation data of voice or text and the like corresponding to the content of the new video after editing and editing also needs to be added manually.

Similar to the above-described cases, it takes labor and additional time to provide interpretation data. Thereby causing problems of increased production cost of the program and non-timely broadcasting.

Disclosure of Invention

The present application provides a method of generating interpretation data from video data. The application also provides a device for generating the interpretation data according to the video data. The application also provides a method for generating interpretation data for the video and a device for generating interpretation data for the video. The application also provides an electronic device for generating interpretation data for a video.

The application provides a method for generating interpretation data according to video data, which comprises the following steps:

acquiring a video clip to be processed;

identifying identifiable events and attribute parameters thereof from the video frames to be processed of the video clips to be processed;

and generating interpretation data corresponding to the identifiable events according to the attribute parameters of the identifiable events.

Preferably, the identifying identifiable events and attribute parameters thereof from the video frames to be processed of the video clip to be processed comprises:

detecting whether a recognizable event is included within a to-be-processed video frame of the to-be-processed video segment;

and if so, determining the attribute parameters of the identifiable event.

Preferably, the detecting whether an identifiable event is contained within a pending video frame of the pending video segment comprises:

and detecting whether the video frame to be processed of the video clip to be processed contains identifiable events or not by utilizing the computer neural network for calculation obtained by training the marked video data and the data of the video frame to be processed of the video clip to be processed.

Preferably, the computer neural network comprises a three-dimensional convolutional neural network or a long-short term memory artificial neural network.

Preferably, the determining attribute parameters of the identifiable event comprises:

and determining attribute parameters of the identifiable events by using the computer neural network for calculation obtained by training the marked video data and the data of the video frames to be processed of the video clips to be processed.

Preferably, the attribute parameters of the identifiable events include at least one of:

the time corresponding to the video frame at which the identifiable event occurs, the time corresponding to the video frame at which the identifiable event ends, the name of the identifiable event, and the location of the participant of the identifiable event.

Preferably, the computer neural network comprises: three-dimensional convolutional neural networks or long-short term memory artificial neural networks.

and identifying identifiable events and attribute parameters thereof from the video frames to be processed of the video clips to be processed by utilizing the computer neural network for calculation obtained by training the marked video data and the data of the video frames of the video clips to be processed and the video frames of the video clips to be processed.

Preferably, the interpretation data comprises voice data;

correspondingly, the generating of the interpretation data corresponding to the recognizable event according to the attribute parameter of the recognizable event comprises:

respectively determining each corresponding voice fragment data according to the attribute parameters of the recognizable events;

and generating each piece of voice data into a piece of voice data as the voice data corresponding to the recognizable event.

Preferably, the interpretation data comprises textual data;

respectively determining each corresponding text data according to the attribute parameters of the identifiable events;

and generating each text data into a piece of text data as the text data corresponding to the identifiable event.

Preferably, the step of identifying identifiable events and attribute parameters thereof from the video frames to be processed of the video clip to be processed further comprises the following steps:

sampling and decoding the video clip to be processed according to a preset frame rate;

and taking the video frame obtained after the sampling decoding as a video frame to be processed of the video segment to be processed.

Preferably, after the step of generating the interpretation data corresponding to the recognizable event according to the attribute parameter of the recognizable event, the method further includes:

and synthesizing the interpretation data and the video clip to be processed.

Preferably, the feature is used for generating the commentary voice from the sports program video.

The application provides a method for synthesizing video and interpretation data, which comprises the following steps:

acquiring a video to be processed in a stream format;

slicing the video to be processed to generate a video clip with a preset time length, wherein the video clip to be processed comprises a group of video frames to be processed;

judging whether each group of video frames to be processed contains identifiable events or not;

when the method comprises the following steps: determining identifiable events and attribute parameters thereof from a video frame to be processed; generating interpretation data corresponding to the identifiable events according to the attribute parameters of the identifiable events; synthesizing the interpretation data and the video clip to be processed to form a processed video clip containing the video clip to be processed and the interpretation data;

and aggregating the unprocessed video clips and the processed video clips in the video to be processed into the processed video according to the sequence of the video clips to be processed.

The application provides a device for generating explanation data according to video data, which comprises:

the acquisition unit is used for acquiring a video clip to be processed;

the identification unit is used for identifying identifiable events and attribute parameters thereof from the video frames to be processed of the video clips to be processed;

and the generating unit is used for generating the interpretation data corresponding to the identifiable event according to the attribute parameters of the identifiable event.

Preferably, the identification unit includes:

a detection subunit, configured to detect whether an identifiable event is included in a to-be-processed video frame of the to-be-processed video segment;

and the determining subunit is used for determining the attribute parameters of the identifiable events if the identifiable events are contained.

Preferably, the detection subunit is specifically configured to:

Preferably, the determining subunit is specifically configured to:

if yes, utilizing the computer neural network for calculation obtained by training the marked video data and the data of the video frame to be processed of the video clip to be processed to determine the attribute parameters of the identifiable event.

Preferably, the identification unit is specifically configured to:

Preferably, the generating unit is specifically configured to generate, according to the attribute parameter of the recognizable event, voice data corresponding to the recognizable event;

accordingly, the generating unit comprises:

the voice segment determining subunit is used for respectively determining each corresponding voice segment data according to the attribute parameters of the identifiable events;

and the voice generating subunit is used for generating each piece of voice data into a piece of voice data as the voice corresponding to the recognizable event.

Preferably, the generating unit is specifically configured to generate text data corresponding to the identifiable event according to the attribute parameter of the identifiable event;

accordingly, the generating unit comprises:

the text determining subunit is used for respectively determining each section of text data corresponding to the identifiable event according to the attribute parameters of the identifiable event;

and the character generation subunit is used for generating the text data into a piece of text data as the character data corresponding to the identifiable event.

Preferably, the method further comprises the following steps:

the sampling unit is used for sampling and decoding the video frame to be processed according to a preset frame rate;

and the to-be-processed unit is used for taking the video frame obtained after the sampling decoding as the to-be-processed video frame.

Preferably, the method further comprises the following steps:

and the synthesis unit is used for synthesizing the interpretation data and the video to be processed.

Preferably for generating commentary speech from sports programming video.

The application provides a video and synthesized device of explanation data, includes:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a video to be processed in a stream format;

the slicing unit is used for slicing the video to be processed to generate a video clip with preset time length, and the video clip to be processed comprises a group of video frames to be processed;

the judging unit is used for judging whether the video frames to be processed contain identifiable events or not aiming at each group of video frames to be processed;

an inclusion unit, configured to, when included, perform the following steps: determining identifiable events and attribute parameters thereof from a video frame to be processed; generating interpretation data corresponding to the identifiable events according to the attribute parameters of the identifiable events; synthesizing the interpretation data and the video clip to be processed to form a processed video clip containing the video clip to be processed and the interpretation data;

and the aggregation unit is used for aggregating the unprocessed video clips and the processed video clips in the video to be processed into the processed video according to the sequence of the video clips to be processed.

The application provides an electronic device for synthesizing video and interpretation data, which comprises a processor and a memory,

the memory is used for storing a program for realizing the method for generating the interpretation data according to the video data, and after the equipment is powered on and the program for realizing the method for generating the interpretation data according to the video data is run by the processor, the following steps are executed:

acquiring a video clip to be processed;

Compared with the prior art, the method for generating the interpretation data according to the video data has the following advantages: identifiable events are identified from the video, and interpretation data corresponding to the identifiable events are generated according to attribute parameters of the identifiable events identified from the video. The explanation data corresponding to the events in the video can be generated for the video without manually watching the video, so that the effect of saving the program manufacturing cost is achieved. And can play a role in enabling programs to be played more quickly and timely.

Drawings

Fig. 1 is a flowchart illustrating a method for generating interpretation data from video data according to a first embodiment of the present application;

FIG. 2 is a flow chart illustrating a method for synthesizing video and interpretation data according to a second embodiment of the present application;

FIG. 3 is a schematic diagram of a sports video and commentary voice synthesis of a method for synthesizing video and explanatory data according to a second embodiment of the present application;

fig. 4 is a block diagram illustrating an apparatus for generating interpretation data from video data according to a third embodiment of the present application;

fig. 5 is a block diagram of an apparatus for synthesizing video and explanatory data according to a fourth embodiment of the present invention.

Detailed Description

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. This application is capable of implementation in many different ways than those herein set forth and of similar import by those skilled in the art without departing from the spirit of this application and is therefore not limited to the specific implementations disclosed below.

A first embodiment of the present application provides a method for generating interpretation data according to video data, which can be applied to any situation where it is necessary to add interpretation data corresponding to an event in a video to a video, and a flowchart of the method according to this embodiment is shown in fig. 1, and this embodiment includes the following steps:

step S101, acquiring a video frame of a to-be-processed video clip in a stream format with preset duration as the to-be-processed video frame.

The stream format in this step refers to a stream format of streaming transmission, and for video data, a video in the stream format does not need to completely include data of the complete duration of the video, and only data of a part of the duration can be decoded to obtain a video of a video image of the duration.

The video clip to be processed in the stream format is a stream format video containing preset duration, and the stream format video comprises a monitoring video, a video of a live broadcast program or a video of a sports program and the like. The video to be processed in the streaming format can be directly from a monitoring device or a live device, or can be program data generated in a post-generation mode, such as a set of certain actions of a sports game clipped together or a set of highlights in a sports game.

Some of the above streaming format videos do not include interpretation data corresponding to the content thereof when being generated, and some of the streaming format videos may include interpretation data originally, but for a new streaming format video edited together after being edited, the original interpretation data is not matched with the original interpretation data and needs to be removed, which requires that the new interpretation data be matched with the original interpretation data.

The video is usually stored in a storage device such as a magnetic disk (including a hard disk) or an optical disk, and the video stream data stored in the storage device can be read into a cache by a reading device for subsequent processing, or the data of the video sent by the other device through the network can be received by the network and subjected to subsequent processing.

The predetermined duration is sufficient for an identifiable event to be contained in the video segment and for the identifiable event contained in the video frame to be identified by the subsequent steps.

Step S102, identifying identifiable events and attribute parameters thereof from the video frames to be processed of the video clips to be processed.

Before the step of identifying the identifiable event and the attribute parameter thereof from the video to be processed, some pre-processing may be preferably performed on the obtained video segment, such as resampling the video at a predetermined frame rate.

Video is composed of a plurality of continuous static pictures, and each static picture is a static photo and is called a frame. The frame rate (also referred to as a picture update rate) is a measure for measuring the number of display picture frames in a video unit time. The measurement unit is "number of display frames per Second" (Frame per Second), so the english of the Frame rate is abbreviated as FPS.

Generally, in order to ensure the continuity of pictures during video playing, the frame rate of video stream data is usually above 24, i.e. each second contains more than 24 frames of still pictures.

Before this step, the obtained video may be resampled at a preset frame rate, for example, 6 to 10FPS, so that the resampled video contains 6 to 10 frames of static pictures per second, that is, 6 to 10 photos per second. The specific frame rate should be selected based on the fact that the sampled video can still represent a complete and recognizable event.

Searching for the identifiable event after the frame rate reduction resampling process can significantly reduce the amount of data to be processed when searching for the identifiable event subsequently.

And whether all video frames of the resampled to-be-processed segment or all video frames of the to-be-processed video segment which is not resampled are the to-be-processed video frames of the to-be-processed video segment. After the video frame to be processed is acquired, the video frame needs to be processed to identify events occurring in the video, and the events occurring in the video segment that can be identified are referred to as identifiable events.

After the frame data to be processed is obtained, whether identifiable events are contained in the video frames to be processed or not is detected.

The method for detecting whether the video frame to be processed contains the recognizable event includes using a computer neural network for calculation obtained by training video data that has been marked and the video frame data to be processed, and calculating and detecting whether the video frame to be processed contains the recognizable event.

The computerized neural network for calculation is obtained by training the computerized neural network for calculation by using the marked video data and adopting a machine learning algorithm.

The method can fully utilize the existing marked real video data to obtain the detection result which is as accurate as possible and accords with the actual situation.

The computer neural network comprises a three-dimensional convolution (C3D) neural network or a Long Short Term Memory (LSTM) artificial neural network.

The three-dimensional convolution (C3D) neural network is composed of one or more convolution layers and a top fully connected layer (corresponding to the classical neural network), and also includes an associated weight and pooling layer (Pooling layer). This structure enables the convolutional neural network to utilize a two-dimensional structure of the input data. Convolutional neural networks can give superior results in terms of image and speech recognition compared to other learning structures. This model can also be trained using a back propagation algorithm. Convolutional neural networks require fewer parameters to estimate than other, feed-forward neural networks.

The Long Short Term Memory (LSTM) artificial neural network is a time recurrent neural network that is suitable for processing and predicting significant events of very long intervals and delays in time series due to its more unique structure than the time Recurrent Neural Network (RNN). As a nonlinear model, LSTM can be used as a complex nonlinear unit for constructing larger neural networks.

The input of the computer neural network is data of a video frame, and the output can be set to be whether the recognizable event is contained or not, and can also be directly set to be the name and the inclusion of the event in the attribute parameters of the recognizable event. Therefore, whether the recognizable event is contained or not can be obtained through calculation only once, and the name of the recognizable event can also be obtained at the same time. Further, the output of the computer neural network may even be set to contain not only the name and whether or not the identifiable event is contained, but also all the attribute parameters of the identifiable event. This allows all results to be obtained in one calculation, saving time. And resources are saved. The specific structure of the computer neural network can be determined according to the type of the adopted computer neural network, the actual conditions of the computing environment, the time, the cost and the like.

The machine learning algorithm uses an algorithm that involves a complex structure or high-level abstraction of data by multiple processing layers consisting of multiple non-linear transformations. The benefit of machine learning algorithms is that the manual acquisition of features can be replaced by efficient algorithms for unsupervised or supervised feature learning and hierarchical feature extraction.

The marked video data refers to video data as follows: the marked video frame data are a plurality of groups of video frame data, the plurality of groups of video frames comprise all identifiable events, each group of video frames comprises one identifiable event, and the attribute parameters of the identifiable events contained in the video frames are also known.

The frame rate of the marked video is the same as that of the video to be processed, and if the video to be processed is resampled, the frame rate of the corresponding marked video data is the same as that of the video data to be processed after resampling.

The marked video can be obtained by performing frame rate adjustment and frame truncation on the existing video and adding necessary marking information. When the existing video is processed, only the group of video frames containing the recognizable events can be reserved, and in order to obtain videos containing the recognizable events in as many cases as possible from the limited existing video resources, the group of video frames containing the recognizable events can be subjected to some processing such as turning, blurring and the like, and the group of video frames obtained after the processing is correspondingly subjected to the processing of adding necessary marks to be used as the marked video data. So that the video data obtained containing the already marked identifiable events covers as much of the situation as possible.

And training the computer neural network by taking the marked video frame data as sample data and test data in advance, and fixing the parameters of the computer neural network as the computer neural network for calculation when the preset condition for stopping training the computer neural network is met.

There are many ways to train a computer neural network, either supervised or unsupervised, or semi-supervised, and the specific way to use can be determined according to the kind of the computer neural network used, and the characteristics of the sample data and the recognizable events themselves.

The training process is roughly as follows: calculating a sample mean value of sample data; taking sample data as the input of the computer neural network, calculating to obtain a result, adjusting each parameter of the computer neural network according to the deviation of the result and the sample mean value, inputting the sample data into the computer neural network, calculating to obtain the result and calculating the deviation, continuing adjusting each parameter of the computer neural network and performing calculation training again when the deviation does not meet the expectation, until the deviation meets the expectation or other training finishing conditions are met, and finishing the training if the training times reach the expectation.

And performing necessary verification on the computer neural network after training by using the test data, and if the verification result is positive, using the computer neural network as a computer neural network for calculation. If the verification result is negative, the computer neural network needs to be adjusted and the verification needs to be retrained.

In the step, the data of the video frame to be processed is used as the input of the computer neural network for calculation with fixed parameters for calculation, and whether the video frame to be processed contains identifiable events can be judged according to the output of the computer neural network.

And if the video frame to be processed contains the identifiable event, determining the attribute parameters of the identifiable event.

Similarly to the detection of whether the video frame to be processed contains the identifiable event, the attribute parameters of the identifiable event are calculated and determined by using the computer neural network for calculation obtained by training the marked video data and the video frame data to be processed. As such. Therefore, the detection result which is as accurate as possible and accords with the actual situation can be obtained by fully utilizing the existing marked real video data.

The identifiable event attribute parameters include at least one of: the time corresponding to the video frame at which the identifiable event occurs, the time corresponding to the video frame at which the identifiable event ends, the name of the identifiable event, and the location of the participant of the identifiable event. Determining these attribute parameters can provide rich information for subsequent processing, and also facilitate generation of interpretation data more consistent with video content.

For the attribute parameters of different recognizable events, different computer neural networks can be selected according to the characteristics of the attribute parameters, for example, for the time of the recognizable event or the name of the recognizable event, the two computer neural networks can obtain satisfactory results, and for the positions of participants of the recognizable event, the positions are not related to the time sequence, and the three-dimensional convolution (C3D) neural network is more convenient to train.

The input of the computer neural network may be set as data of a video frame and the output may be set as a relevant parameter necessary for calculating the attribute parameter of the recognizable event.

For example, for the case of determining the time corresponding to the video frame where the recognizable event occurs, the output of the computer neural network may be set to include an offset of the frame where the recognizable event occurs relative to the first frame of the to-be-processed video segment, and once the video frame where the recognizable event occurs is determined, the time corresponding to the video frame where the recognizable event occurs may be determined according to the timestamp information of the video frame where the recognizable event occurs.

Similarly, the output of the computer neural network may be set to contain an offset of the frame at which the recognizable event ended relative to the last frame of the video segment to be processed.

For the case that the name of the recognizable event has been determined when detecting whether the recognizable event is contained in the video frame to be processed, it is only necessary to set the output of the computer neural network to the remaining attribute parameters of the recognizable event.

For more than one attribute parameter of the identifiable event, the output of the computer neural network may be set as a plurality of attribute parameters of the identifiable event, so that the computer neural network for calculation obtained after training the computer neural network may be used for calculation of different attribute parameters of the identifiable event. I.e. simultaneously and possibly calculated different attribute parameters of the identifiable event.

For more than one attribute parameter of the identifiable event, the output of the computer neural network can be set as only one attribute parameter of the identifiable event each time, that is, the computer neural networks for the computers corresponding to different attribute parameters of the identifiable event are also different and need to be trained in a targeted manner. Training the computer neural network in this way makes it easier to obtain a computer neural network for computation quickly.

In other aspects of the computer neural network and the learning algorithm, for example, the process of training the computer neural network by using the marked video data and obtaining the attribute parameters of the identifiable event by using the to-be-processed video frame data and the calculation neural network is described in the foregoing with respect to detecting whether the to-be-processed video frame includes the identifiable event portion, and details thereof are not repeated herein.

The following description will take as an example the case where the motion of the basketball game is recognized from the video clip of the basketball game, the time corresponding to the video frame of the motion, and the position of the player of the motion.

Wherein the segment of the video of the basketball game corresponds to the segment of the video to be processed, the action of the basketball game corresponds to the recognizable event and is also the name of the recognizable event, the time of the action corresponds to the time corresponding to the video frame of the recognizable event, and the position of the key player of the action corresponds to the position of the participant of the recognizable event.

The basketball game usually completes one action in 2-3 seconds, the market of the video segment of the basketball game is 4 seconds, firstly, the video of the basketball game is resampled, the frame rate is adjusted to 6FPS, and 24 frames of video frames are supplied to the 4-second-duration 6FPS as the video frames to be processed.

And training the computer neural network in advance to obtain the computer neural network for calculation. The training sample data and test data are typically obtained in the following manner: and after the video containing the identifiable basketball movement action is subjected to frame rate adjustment, video frame interception, marking and other work, screening out the groups of video frame data only containing the identifiable basketball movement action.

Those sets of video frames that have been marked with different basketball game actions and the time and player position of the action are used as sample data and test data.

The method comprises the steps of constructing a computer neural network, determining a network structure such as the number of convolution layers, the number of pooling (posing) layers and the number of fully-connected (FC) layers in the network by adopting a three-dimensional convolution (C3D) neural network, and defining parameters of each layer, including the size of a convolution kernel and the like. For example: 5 groups of convolutional layers (each of which comprises a convolutional layer, a excitation (ReLU) layer, and a Pooling (Pooling) layer), plus three fully-connected (FC) layers and final output layers.

Taking sample data as an input of the three-dimensional convolution (C3D) neural network, for example, for a video frame with a time length of 4 seconds, if 6 frames per second, the video frame or sample data to be processed is 24 frames, the size of each frame is 160 × 120 pixels, and the input of the three-dimensional convolution (C3D) neural network can be set to be 3 × 24 × 120 × 160 input quantities in the case that each frame contains three channels of RGB.

The outputs of the three-dimensional convolutional (C3D) neural network may be set separately, depending on the results that are desired.

If it is required to determine whether only the recognizable motion is contained in the basketball game video to be processed, the output of the three-dimensional convolution (C3D) neural network may be set as a vector with a length of 2, the values of the vectors are probabilities of "contained" or "not contained", respectively, the range is 0-1, the sum of the two vector values is 1, when the first vector is set as the probability of "contained", the second vector is set as the probability of "not contained", and the output result is (0.01, 0.99), the probability that the motion of the recognizable basketball motion is not contained in the currently input basketball game video frame is 99%, and the motion of the recognizable basketball motion is not contained in the currently input basketball game video frame may be determined.

For the case that it is required to determine whether the basketball game video to be processed contains recognizable motions and the exact motion content, i.e. name, needs to be obtained, the output of the three-dimensional convolution (C3D) neural network can be set according to the number of recognizable motions. For example, when the number of recognizable motions is 13, the output of the three-dimensional convolution (C3D) neural network may be set as a vector of length 14, the 14 vectors respectively corresponding to the probabilities of 13 recognizable motions and the case where the recognizable motion is "not included", and the sum of the 14 vector values is 1.

For the case where only the content, i.e., the name, of the action in the basketball game video to be processed needs to be determined, the output of the three-dimensional convolutional (C3D) neural network may be set according to the number of recognizable actions. For example, when the number of recognizable motions is 13, the output of the three-dimensional convolution (C3D) neural network can be set to be a vector with a length of 13, the 14 vectors respectively correspond to the probabilities of 13 recognizable motions, and the sum of the 13 vector values is 1.

When the three-dimensional convolution (C3D) neural network is trained, inputting each sample data, namely a plurality of groups of video frame data containing all basketball movement motions, into the three-dimensional convolution (C3D) neural network, if the sample data is too large and is not convenient to pass all the samples through the three-dimensional convolution (C3D) neural network, then calculating the deviation, adopting a mini-batch mode, namely calculating the deviation once after every N sample data, adjusting each parameter of the three-dimensional convolution (C3D) neural network according to the deviation, then inputting N sample data into the three-dimensional convolution (C3D) neural network until the calculated deviation meets the expectation, and finishing training the three-dimensional convolution (C3D) neural network.

And performing necessary verification on the neural network with the deviation meeting the expected three-dimensional convolution (C3D) by using the test data, and if the verification result is positive, using the neural network with the deviation meeting the expected three-dimensional convolution (C3D) as the computer neural network for calculation. If the verification result is negative, the computer neural network needs to be adjusted and the verification needs to be retrained.

And utilizing the trained three-dimensional convolution for calculation (C3D) neural network and the video frame to be processed of the video segment of the basketball game, and continuously determining other attribute parameters of the basketball game after detecting that the video frame to be processed of the basketball game contains identifiable basketball movement motions, wherein the attribute parameters comprise the starting and stopping time of the motions and the positions of the players of the motions.

For the case of determining the time corresponding to the video frames where the motion occurs and ends in the basketball game video to be processed, the output of the three-dimensional convolution (C3D) neural network may be set as a vector of length 2, and the values of the vectors may be set as the offsets of the video frames where the motion starts and ends with respect to the first and last frames of the input video frames of the three-dimensional convolution (C3D) neural network, respectively.

The offset may be set as a frame number, and for the video frame data to be processed with a frame rate of 6FPS and a duration of 4 seconds, the input of the three-dimensional convolution (C3D) neural network is 24 frames, and the number of each frame is 0 to 23 respectively. If the output of the three-dimensional convolutional (C3D) neural network is set to the number of frames offset from the first and last frames and the output is {3, -5}, this indicates that the action starts from frame 3 to frame 18 of the input video frames. For a video frame to be processed with a timestamp for each frame, the start-stop time of the action occurring in the original video frame can be calculated from the timestamp of each video frame.

For the situation of determining the positions of the players in the actions in the basketball game video to be processed, the positions of the key players in the actions, namely the areas of the players in the video frames, can be obtained by calculating the to-be-processed video frame data with the names and the time of the actions determined according to actual requirements by adopting a saliency detection method. The area can be represented in the form of coordinates or can be directly expressed as a picture similar to a thermodynamic diagram.

In addition to calculating the positions of key players of the action using data of video frames containing identifiable basketball movement actions, the positions of key players of the action in the video frames containing identifiable basketball movement actions may also be calculated using output data of convolutional layers of the computer neural network as input data of a saliency detection method.

At this point, the attribute parameters of the recognizable events contained in the video frame are determined, and the step is finished. The next step may be performed.

S103, generating interpretation data corresponding to the identifiable event according to the attribute parameters of the identifiable event.

Preferably, first, according to the attribute parameters of the recognizable events, the corresponding segment interpretation data is determined respectively.

For different attribute parameters, the interpretation data segments corresponding to the different attribute parameters are different, and the interpretation data segment corresponding to each attribute parameter can be determined by searching the interpretation data segment database by taking the attribute parameters of the recognizable events identified in the previous step as keywords.

The interpretation data can be voice data, text data, or any other form of interpretation data.

For the case where the interpretation data is voice data, the interpretation data segment database is a speech segment database that can be created and maintained by intercepting the speech of the video with speech. Or respectively re-recording corresponding voice segments of different styles according to attribute parameters of different recognizable events, establishing a voice segment database, and storing the voice segments.

Each speech segment stored by the speech segment database has an index key corresponding to an attribute parameter of a recognizable event for ease of lookup.

For example, after the action capable of identifying the basketball movement and the attribute parameters thereof are identified from the video clip of the basketball game in the previous step, the voice clip corresponding to the action capable of identifying the basketball movement is searched and identified from the voice clip database according to each attribute parameter of the action capable of identifying the basketball movement. If the recognized action is "basket-up success", the time corresponding to the video frame in which the basket-up success action occurs is "10 minutes", and the position coordinates of the player who has succeeded in basket-up is the right side of the screen, the corresponding voice segments can be determined as "basket-up success", when the game has proceeded for 10 minutes ", and" player of a certain team ", respectively. The information of the team to which the player successfully kicks up belongs can be determined according to the position and time of the player kicking up and the identification of the two parties of the game recorded by the video.

When the attribute parameter of the recognizable event corresponds to more than one speech segment in the speech segment database, one speech segment may be selected according to a preset rule for subsequent speech generation, such as random selection, or selection of voices with the same style, and the like.

And after the voice segments corresponding to the attribute parameters are selected, merging the voice segment data corresponding to the attribute parameters of the recognizable event according to a language rule to generate a voice as the voice corresponding to the recognizable event.

Taking the determined speech segments as an example, according to the grammar language habit of Chinese, the speech segments can be combined and generated as follows: "when the game is played for 10 minutes, a certain team of players successfully basket up".

For the case that the interpretation data is character data, the interpretation data segment database is a text segment database, and the database can be input, established and maintained according to the attribute parameters of the recognizable events.

Similar to the speech segment database, the stored text segments in the text segment database are also indexed by attribute parameters of recognizable events, i.e., basketball movement actions, and the stored text segments can be either the related attribute information of the basketball movement or any information related to the video or the movement.

The method for selecting corresponding text segments from the text segment database is similar to the method for selecting voice segments from the voice segment database, and when more than one text segment can be selected, the selection can be carried out according to the context or can be carried out randomly.

And after the sections of texts corresponding to the attribute parameters of the basketball action are determined, connecting the sections of texts according to language and grammar habits or connecting the sections of texts according to the application scenes of the texts to achieve the expected effect. For example, when generating the bullet-screen information data for a video, it is even possible to intentionally combine text not according to syntax to achieve a special effect.

The interpretation data of the identifiable event may include statistical information in addition to the information in the form described above, for example, for a surveillance video (corresponding to the to-be-processed video stream data in the present application), after an event to be analyzed is identified (the time to be analyzed corresponds to the identifiable event in the present application), corresponding statistical data may be adjusted according to attribute parameters of the event, such as the event type, and the like, so that a large amount of statistical data of the event to be analyzed may be obtained without manually viewing all videos, and time and cost are saved. The work based on the statistical information is convenient to carry out and is completed quickly.

At this point, according to the attribute parameters of the recognizable time, the interpretation data corresponding to the recognizable event is generated in a combined manner.

Thereafter, the interpretation data may be further synthesized with the to-be-processed video stream data, including adding the interpretation information to a position of a video frame corresponding to the recognizable event of the to-be-processed video segment, such as adding voice data to a corresponding position of the video frame, or adding the text data to the video frame, and so on.

And continuously intercepting the video frames of the videos of the sports programs to be played and carrying out corresponding processing according to the descriptions in the steps, so that the commentary voice or the commentary subtitles can be generated for the videos of the sports programs. Other information such as a bullet screen can be added to the program, so that the program is more interesting.

In addition to the above-mentioned manner of synchronizing the interpretation data, it is even possible to synchronize more than one kind of interpretation data to the video to be processed, such as adding synchronized caption interpretation information to the video to be processed and adding synchronized narration voice data at the same time. Therefore, the original video without the interpretation information has various interpretation information, the expression modes of the video are enriched, and the use range of the video is widened.

The above is an embodiment of the present application, which is a method for generating interpretation data according to video data, and generates a voice corresponding to a recognizable event from attribute parameters of the recognizable event recognized in a video. The voice corresponding to the event in the video can be generated for the video without manually watching the video, so that the effect of saving the program manufacturing cost is achieved. And can play a role in enabling programs to be played more quickly and timely.

A second embodiment of the present application provides a method for synthesizing video and interpretation data, and a flowchart thereof is shown in fig. 2.

S201, acquiring the video to be processed in the stream format.

The video may be a complete video of a sporting event or a compilation of a plurality of highlights. The format is a streaming format, and portions thereof can be processed without affecting other portions of the food product.

This step obtains the complete video or obtains a part of the video streamed. Such as a streaming video of a sporting event.

S202, slicing the video to be processed to generate a video clip with preset duration, wherein the video clip to be processed comprises a group of video frames to be processed.

The method comprises the steps of slicing and segmenting an acquired video according to a preset time length to generate a video segment with the preset time length, wherein the preset time length is set according to the characteristics of the content of the acquired video, so that an identifiable event is guaranteed to be contained in one segment of the video, and the video segment contains a group of video frames.

The obtained complete video can be sliced, and the received partial video can be sliced when enough data is received for receiving the sports game video obtained by streaming transmission. And obtaining the video clip to be processed.

S203, aiming at each group of video frames to be processed, judging whether the video frames to be processed contain identifiable events.

The method for determining whether the processed video frames contain the recognizable events in the first embodiment of the present application can be used to determine whether each set of processed video frames contains the recognizable events.

For the case of slicing the complete video, the judgment and subsequent processing can be performed for all sets of video frames of all video segments at one time.

And for the case of receiving partial video, only each group of video frames after the received slice is processed. Such as a video of a sports game, slicing and corresponding judgment are made only on the received portion.

S204, when the data is contained, executing the following steps: determining identifiable events and attribute parameters thereof from a video frame to be processed; generating interpretation data corresponding to the identifiable events according to the attribute parameters of the identifiable events; and synthesizing the interpretation data and the video clip to be processed to form a processed video clip containing the video clip to be processed and the interpretation data.

When the method of the previous step is applied to judge that the video frame to be processed contains the identifiable event, determining the attribute parameters of the identifiable event contained in the video to be processed with the judged identifiable event, and generating the interpretation data corresponding to the identifiable event according to the attribute parameters. And the interpretation data and the video clip to be processed are added into a processed video clip containing the video clip to be processed and the interpretation data.

For a video segment in which no recognizable event is detected, it is referred to as an unprocessed video segment.

For example, for a video of a sports game in a streaming format, although only a bonus video is received, the video can be judged, after a recognizable action is judged to occur, attribute parameters of the action are determined, corresponding interpretation data such as commentary voice data are determined according to the attribute parameters, and the commentary voice data are combined with a video clip to be processed of a food of the sports game to form a processed video clip.

And S205, aggregating the unprocessed video segments and the processed video segments in the video to be processed into the processed video according to the sequence of the video segments to be processed.

For the case where the complete video is processed to obtain all processed videos, all processed video segments and all unprocessed video segments may be aggregated into a processed video in the order of their original stream format.

For the case that the video in the partial stream format is received and is correspondingly processed, all the received processed video segments and all the received unprocessed video segments can be aggregated into the processed video according to the sequence of the original stream format.

For example, the video of the sports game in the previous step, the processed video containing the commentary voice and the processed video containing no voice are continuously collected according to the sequence of the streaming format, so that the effect of adding the synchronous commentary voice to the sports game can be achieved.

The application of the method provided by the embodiment will be briefly described below by taking the synthesis of the sports video and the narration voice as an example, as shown in fig. 3,

firstly, the acquired video is subjected to necessary preprocessing, including encoding and decoding, slicing and the like, so as to obtain a video segment to be processed. Each video clip contains a set of video frames.

Identifying whether the video segment contains an identifiable sporting event. And determining attribute parameters of the sporting event, including the time, action, and location of the athlete at the event.

And determining the narration voice according to the attribute parameters.

Aggregating the voice with the video clip.

And then, continuously repeating the steps on the video data needing to be acquired later, so as to achieve the effect of synthesizing the sports video and the commentary voice.

A third embodiment of the present application provides an apparatus for generating interpretation data from video data, a block diagram of which is shown in fig. 4, the apparatus comprising the following units: a U301 acquisition unit, a U302 identification unit, and a U303 generation unit.

The obtaining unit U301 is configured to obtain a video clip to be processed.

The identifying unit U302 is configured to identify an identifiable event and an attribute parameter thereof from a to-be-processed video frame of the to-be-processed video clip.

The identification unit U302 is specifically configured to identify an identifiable event and attribute parameters thereof from the to-be-processed video frame of the to-be-processed video clip by using a computer neural network for calculation obtained by training the marked video data and the data of the to-be-processed video frame and the video frame of the to-be-processed video clip.

Optionally, the identification unit U302 may include a detection subunit and a determination subunit.

The detection subunit is configured to detect whether an identifiable event is included in a to-be-processed video frame of the to-be-processed video segment.

Preferably, the detecting subunit is specifically configured to detect, by using the computer neural network for calculation obtained by training the marked video data and the data of the to-be-processed video frame of the to-be-processed video segment, whether an identifiable event is included in the to-be-processed video frame of the to-be-processed video segment.

And the determining subunit is used for determining the attribute parameters of the identifiable events if the identifiable events are contained in the identified events.

Optionally, the determining subunit is specifically configured to, if the event identifier includes the event identifier, determine the attribute parameter of the identifiable event by using the data of the to-be-processed video frame of the to-be-processed video segment and the computer neural network for calculation obtained by training using the marked video data.

The generating unit U303 is configured to generate interpretation data corresponding to the identifiable event according to the attribute parameter of the identifiable event.

Preferably, the generating unit U303 is specifically configured to generate the voice data corresponding to the recognizable event according to the attribute parameter of the recognizable event.

Accordingly, the number of the first and second electrodes,

the generating unit may include a speech segment determining subunit and a speech generating subunit.

The voice segment determining subunit is configured to determine, according to the attribute parameter of the identifiable event, each corresponding voice segment data respectively;

and the voice generating subunit is configured to generate each piece of voice segment data into a piece of voice as the voice corresponding to the identifiable event.

Preferably, the generating unit U303 is also specifically configured to generate the text data corresponding to the identifiable event according to the attribute parameter of the identifiable event.

Accordingly, the number of the first and second electrodes,

the generating unit comprises a text determining subunit and a character generating subunit.

And the text determining subunit is used for respectively determining each text data corresponding to the identifiable event according to the attribute parameters of the identifiable event.

Preferably, the apparatus for generating interpretation data from video data may further include a sampling unit and a unit to be processed.

The sampling unit is used for sampling and decoding the video stream data to be processed according to a preset frame rate;

and the unit to be processed is used for taking the video frame obtained after the sampling decoding as the video frame to be processed.

Preferably, the apparatus for generating interpretation data according to video data further comprises a synthesizing unit, and the synthesizing unit is configured to synthesize the interpretation data and the video to be processed.

Preferably, the means for generating interpretation data from the video data may be operable to generate commentary speech from sports programming video.

A fourth embodiment of the present application provides an apparatus for synthesizing video and interpretation data, a block diagram of which is shown in fig. 5, and the apparatus includes: an obtaining unit U401, configured to obtain a video to be processed in a stream format;

a slicing unit U402, configured to slice the video to be processed to generate a video segment with a predetermined duration, where the video segment to be processed includes a group of video frames to be processed;

a determining unit U403, configured to determine, for each group of to-be-processed video frames, whether the to-be-processed video frames include an identifiable event;

an inclusion unit U404, configured to, when included, perform the following steps: determining identifiable events and attribute parameters thereof from a video frame to be processed; generating interpretation data corresponding to the identifiable events according to the attribute parameters of the identifiable events; synthesizing the interpretation data and the video clip to be processed to form a processed video clip containing the video clip to be processed and the interpretation data;

and the aggregation unit U405 is configured to aggregate the unprocessed video segments and the processed video segments in the video to be processed into the processed video according to the sequence of the video segments to be processed.

A fifth embodiment of the present application provides an electronic device for composition of video and interpretive data, comprising a processor and a memory,

acquiring a video clip to be processed;

Although the present application has been described with reference to the preferred embodiments, it is not intended to limit the present application, and those skilled in the art can make variations and modifications without departing from the spirit and scope of the present application, therefore, the scope of the present application should be determined by the claims that follow.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

1. Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include non-transitory computer readable media (transmyedia), such as modulated data signals and carrier waves.

2. As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Claims

1. A method of generating interpretation data from video data, comprising the steps of:

acquiring a video clip to be processed;

identifying identifiable events and attribute parameters thereof from the video frames to be processed of the video clips to be processed; wherein the recognizable event refers to an event occurring when the video segment to be processed can be recognized; the attribute parameters comprise the time corresponding to the video frame of the identifiable event;

generating interpretation data corresponding to the identifiable events according to the attribute parameters of the identifiable events, wherein the interpretation data comprises: respectively determining each section of interpretation data corresponding to the attribute parameters according to the attribute parameters of the identifiable events, and merging the sections of interpretation data corresponding to each attribute parameter to generate interpretation data of the identifiable events;

synthesizing the interpretation data and the video clip to be processed;

wherein the identifying identifiable events and attribute parameters thereof from the video frames to be processed of the video clip to be processed comprises: detecting whether a recognizable event is included within a to-be-processed video frame of the to-be-processed video segment; if yes, determining attribute parameters of the identifiable events;

the determining attribute parameters of the identifiable events comprises: and determining attribute parameters of the identifiable events by using the computer neural network for calculation obtained by training the marked video data and the data of the video frames to be processed of the video clips to be processed.

2. The method of claim 1, wherein said detecting whether a recognizable event is contained within a pending video frame of said pending video segment comprises:

3. A method of generating interpretation data from video data according to claim 2, wherein the computer neural network comprises a three-dimensional convolutional neural network or a long-short term memory artificial neural network.

4. Method of generating interpretation data from video data according to claim 1, wherein the attribute parameters of the identifiable events comprise at least one of:

the time corresponding to the video frame at which the identifiable event ends, the name of the identifiable event, and the location of the participant of the identifiable event.

5. The method of claim 1, wherein the computer neural network comprises: three-dimensional convolutional neural networks or long-short term memory artificial neural networks.

6. The method of claim 1, wherein the identifying recognizable events and their attribute parameters from the video frames to be processed of the video clip comprises:

7. A method of generating interpretation data from video data according to claim 1, wherein the interpretation data comprises voice data;

8. A method of generating interpretation data from video data according to claim 1, wherein the interpretation data comprises textual data;

9. The method of claim 1, wherein the step of identifying recognizable events and their attribute parameters from the video frames to be processed of the video clip further comprises the steps of:

10. A method of generating interpretation data from video data according to any of claims 1 to 9 for generating commentary speech from sports programming video.

11. A method of compositing video with interpretive data, comprising the steps of:

acquiring a video to be processed in a stream format;

judging whether each group of video frames to be processed contains identifiable events or not; wherein the recognizable event refers to an event occurring when the video segment to be processed can be recognized;

when the method comprises the following steps: determining identifiable events and attribute parameters thereof from a video frame to be processed; generating interpretation data corresponding to the identifiable events according to the attribute parameters of the identifiable events, wherein the interpretation data comprises: respectively determining each section of interpretation data corresponding to the attribute parameters according to the attribute parameters of the identifiable events, and merging the sections of interpretation data corresponding to each attribute parameter to generate interpretation data of the identifiable events;

synthesizing the interpretation data and the video clip to be processed to form a processed video clip containing the video clip to be processed and the interpretation data; wherein the attribute parameters comprise the time corresponding to the video frame of the occurrence of the identifiable event; aggregating unprocessed video clips and processed video clips in the video to be processed into processed video according to the sequence of the video clips to be processed;

wherein, the step of identifying the recognizable event and the attribute parameter thereof from the video frame to be processed of the video clip to be processed comprises the following steps: detecting whether a recognizable event is included within a to-be-processed video frame of the to-be-processed video segment; if yes, determining attribute parameters of the identifiable events;

12. An apparatus for generating interpretation data from video data, comprising:

the acquisition unit is used for acquiring a video clip to be processed;

the identification unit is used for identifying identifiable events and attribute parameters thereof from the video frames to be processed of the video clips to be processed; wherein the recognizable event refers to an event occurring when the video segment to be processed can be recognized; the attribute parameters comprise the time corresponding to the video frame of the identifiable event;

the generating unit is used for generating interpretation data corresponding to the identifiable event according to the attribute parameters of the identifiable event, and comprises the following steps: respectively determining each section of interpretation data corresponding to the attribute parameters according to the attribute parameters of the identifiable events, and merging the sections of interpretation data corresponding to each attribute parameter to generate interpretation data of the identifiable events;

a synthesizing unit for synthesizing the interpretation data with the video to be processed;

the identification unit includes: a detection subunit, configured to detect whether an identifiable event is included in a to-be-processed video frame of the to-be-processed video segment;

the determining subunit is used for determining the attribute parameters of the identifiable events if the identifiable events are contained in the identified events; the determining subunit is specifically configured to: if yes, utilizing the computer neural network for calculation obtained by training the marked video data and the data of the video frame to be processed of the video clip to be processed to determine the attribute parameters of the identifiable event.

13. The apparatus for generating interpretation data from video data according to claim 12, wherein the detecting subunit is specifically configured to:

14. The apparatus for generating interpretive data from video data according to claim 12, wherein the identification unit is specifically configured to:

15. The apparatus according to claim 12, wherein the generating unit is specifically configured to generate the voice data corresponding to the recognizable event according to the attribute parameter of the recognizable event;

accordingly, the generating unit comprises:

16. The apparatus according to claim 12, wherein the generating unit is specifically configured to generate the text data corresponding to the recognizable event according to the attribute parameter of the recognizable event;

accordingly, the generating unit comprises:

17. An apparatus for generating interpretation data from video data according to claim 12, further comprising:

18. Apparatus for generating interpretation data from video data according to any of claims 12 to 17 for generating commentary speech from sports programming video.

19. An apparatus for compositing video with interpretive data, comprising:

the judging unit is used for judging whether the video frames to be processed contain identifiable events or not aiming at each group of video frames to be processed; wherein the recognizable event refers to an event occurring when the video segment to be processed can be recognized;

an inclusion unit, configured to, when included, perform the following steps: determining identifiable events and attribute parameters thereof from a video frame to be processed; wherein, the step of identifying the recognizable event and the attribute parameter thereof from the video frame to be processed of the video clip to be processed comprises the following steps: detecting whether a recognizable event is included within a to-be-processed video frame of the to-be-processed video segment; if yes, determining attribute parameters of the identifiable events;

the determining attribute parameters of the identifiable events comprises: determining attribute parameters of the identifiable events by using a computer neural network for calculation obtained by training marked video data and data of video frames to be processed of the video segments to be processed;

synthesizing the interpretation data and the video clip to be processed to form a processed video clip containing the video clip to be processed and the interpretation data; wherein the attribute parameters comprise the time corresponding to the video frame of the occurrence of the identifiable event;

20. An electronic device for the composition of video and interpretive data, comprising a processor and a memory,

acquiring a video clip to be processed;

identifying identifiable events and attribute parameters thereof from the video frames to be processed of the video clips to be processed; wherein the recognizable event refers to an event occurring when the video segment to be processed can be recognized; the attribute parameters comprise the time corresponding to the video frame of the identifiable event; wherein the identifying identifiable events and attribute parameters thereof from the video frames to be processed of the video clip to be processed comprises: detecting whether a recognizable event is included within a to-be-processed video frame of the to-be-processed video segment; if yes, determining attribute parameters of the identifiable events;

and synthesizing the interpretation data and the video clip to be processed.