CN114547370A

CN114547370A - Video abstract extraction method and system

Info

Publication number: CN114547370A
Application number: CN202210135790.9A
Authority: CN
Inventors: 王诏; 边凯归
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2022-02-15
Filing date: 2022-02-15
Publication date: 2022-05-27

Abstract

The invention provides a video abstract extraction method, which relates to the technical field of data abstract extraction, and is used for extracting an audio file of a video to be extracted and dividing the video to be extracted into a plurality of video frames to be extracted; inputting the audio file into a voice-to-text model to obtain a time sequence text file of the video to be extracted; determining a non-time sequence text file and a text timestamp-text mapping relation according to the time sequence text file; inputting the non-time sequence text file into a text abstract extraction model to obtain a text abstract of a video to be extracted; determining a picture abstract of a video to be extracted according to the object identification model and the plurality of video frames to be extracted; and determining the video abstract of the video to be extracted according to the text abstract, the picture abstract and the text timestamp-text mapping relation, and further completing abstract extraction of the video without subtitles and scripts.

Description

Video abstract extraction method and system

Technical Field

The invention relates to the technical field of data abstract extraction, in particular to a video abstract extraction method and system.

Background

The technology of converting voice into text and the technology of automatic text summarization are developed greatly and have wide application. The voice recognition technology is continuously advanced from the step of recognizing a single word by template matching, the step of recognizing a voice segment by a hidden Markov model and the step of recognizing the voice segment end to end by a neural network, and is applied to the fields of voice assistants, intelligent homes, automatic driving and the like. Automatic text summarization techniques are also constantly advancing from abstracting key sentences from text using keywords to extracting text features using a neural network to generate new summaries from text content. It is used in the fields of information retrieval, automatic book marking and indexing, news information service, etc.

Although the speech-to-text technology and the text summarization technology are mature, both technologies are widely applied to respective fields. With the development of 5G technology, the capacity and speed of network transmission have been newly developed, and video becomes the main multimedia content in the network. The emergence of massive videos makes a new requirement on how to quickly select a proper video for watching.

The existing text-based video summarization technology mainly uses a dynamic time sequence matching algorithm to match subtitles and scripts, then measures the importance of video frames through the analysis of the scripts, and finally clips the video as a summary according to the obtained importance degree. However, the summarization technology can only be used for special videos with subtitles and scripts, and the application range is relatively limited.

Disclosure of Invention

The invention aims to provide a video abstract extraction method, which can extract an abstract of a video without subtitles and scripts and improve the precision of abstract extraction.

In order to achieve the purpose, the invention provides the following scheme:

a video abstract extraction method comprises the following steps:

acquiring a video to be extracted;

extracting an audio file of the video to be extracted, and dividing the video to be extracted into a plurality of video frames to be extracted; each video frame to be extracted corresponds to a video frame timestamp; the video frame timestamp is used for describing the starting time, the ending time and the duration of the corresponding video frame to be extracted in the video to be extracted;

inputting the audio file into a voice-to-text model to obtain a time sequence text file of the video to be extracted; the voice-to-text model is obtained by training a first deep neural network by using an audio file training set; each sentence of text in the time sequence text file corresponds to a text timestamp; the text timestamp is used for describing the starting time, the ending time and the duration of a corresponding sentence in the video to be extracted;

determining a non-time sequence text file and a text timestamp-text mapping relation according to the time sequence text file;

inputting the non-time sequence text file into a text abstract extraction model to obtain a text abstract of the video to be extracted; the text abstract extraction model is obtained by training a second deep neural network by using a text file training set;

determining a picture abstract of the video to be extracted according to an object recognition model and a plurality of video frames to be extracted; the object recognition model is obtained by training a third deep neural network by utilizing a video frame training set;

and determining the video abstract of the video to be extracted according to the text abstract, the picture abstract and the text timestamp-text mapping relation.

Optionally, before the obtaining the video to be extracted, the method further includes:

acquiring an audio file training set; the audio file training set comprises a plurality of historical audio files and historical text files corresponding to the historical audio files;

and training the first deep neural network by taking the historical audio file as input and taking a historical text file corresponding to the historical audio file as output to obtain a voice-to-text model.

acquiring a text file training set; the text file training set comprises a historical text file and a text abstract corresponding to the historical text file;

and training a second deep neural network by taking the historical text file as input and the text abstract corresponding to the historical text file as output to obtain a text abstract extraction model.

acquiring a video frame training set; the video frame training set

Marking objects in the video frame training set by using a rectangular marking frame to obtain a plurality of marked video frames;

and training a third deep neural network by taking the plurality of marked video frames as input and the rectangular marked frame as output to obtain an object recognition model.

Optionally, the determining, according to the object recognition model and the plurality of video frames to be extracted, the picture summary of the video to be extracted specifically includes:

inputting the plurality of video frames to be extracted into an object recognition model to obtain a plurality of marked video frames to be extracted;

determining any marked video frame to be extracted as a current marked video frame to be extracted;

according to the objects and the types contained in the current marked video frame to be extracted, a formula is utilized

Determining the weight of each object in the current marked video frame to be extracted; wherein, W_iMarking the object weight of the ith object in the video frame to be extracted; w_clsA class weight for the object;

wherein N is the total number of marked video frames to be extracted, N_clsMarking the total amount of the video frames to be extracted for the objects with the class of cls; x is the number of_i，y_iCoordinates representing the center point of the rectangular marking box, w_i，h_iRespectively indicating the width and the height of the rectangular marking frame, and respectively indicating the width and the height of the video frame to be extracted by W and H;

according to the object weight of each object marked in the video frame to be extracted, a formula W is utilized_img＝α×W_i1+β×W_i2Determining the weight of a picture currently marked with a video frame to be extracted; wherein W_imgRepresenting the weight of the current marked video frame to be extracted; w_i1，W_i2Respectively representing the object weights of two objects with larger weights in the current marked video frame to be extracted; both alpha and beta are hyper-parameters;

performing descending arrangement on the plurality of video frames to be extracted according to the object weight to obtain a video frame sequence to be extracted;

determining a first element in the video frame sequence to be extracted as a summary picture;

and updating the video frame sequence to be extracted and returning to the step of determining that the first element in the video frame sequence to be extracted is the abstract picture until the number of elements in the video frame sequence to be extracted is 0, so as to obtain a plurality of abstract pictures as the picture abstract of the video to be extracted.

Optionally, the updating the sequence of the video frames to be extracted specifically includes:

determining the similarity of each element except a first element in a video frame sequence to be extracted and the first element;

and deleting the first element in the video frame sequence to be extracted and the elements with the similarity larger than the similarity threshold value to obtain an updated video frame sequence to be extracted.

Optionally, the determining the video abstract of the video to be extracted according to the text abstract, the picture abstract and the text timestamp-text mapping relationship specifically includes:

splitting the text abstract into a plurality of abstract sentence vectors;

determining a plurality of abstract sentence vectors and corresponding text timestamps in the time sequence text file as a text timestamp set according to the text timestamp-text mapping relation;

determining a video frame timestamp corresponding to the picture summary as a video frame timestamp set;

intercepting a plurality of sections of initial abstract videos in the video to be extracted according to the intersection of the text timestamp set and the video frame timestamp set;

and splicing the multiple sections of initial abstract videos by using an ffmpeg tool to obtain the video abstract of the video to be extracted.

A video summary extraction system, comprising:

the to-be-extracted video acquisition module is used for acquiring a to-be-extracted video;

the audio file extraction module is used for extracting the audio file of the video to be extracted and dividing the video to be extracted into a plurality of video frames to be extracted; each video frame to be extracted corresponds to a video frame timestamp; the video frame timestamp is used for describing the starting time, the ending time and the duration of the corresponding video frame to be extracted in the video to be extracted;

the time sequence text file determining module is used for inputting the audio file into a voice-to-text model to obtain a time sequence text file of the video to be extracted; the voice-to-text model is obtained by training a first deep neural network by using an audio file training set; each sentence of text in the time sequence text file corresponds to a text timestamp; the text timestamp is used for describing the starting time, the ending time and the duration of a corresponding sentence in the video to be extracted;

the text timestamp-text mapping relation determining module is used for determining a non-time sequence text file and a text timestamp-text mapping relation according to the time sequence text file;

the text abstract determining module is used for inputting the non-time sequence text file into a text abstract extracting model to obtain a text abstract of the video to be extracted; the text abstract extraction model is obtained by training a second deep neural network by using a text file training set;

the picture abstract determining module is used for determining the picture abstract of the video to be extracted according to the object recognition model and the plurality of video frames to be extracted; the object recognition model is obtained by training a third deep neural network by utilizing a video frame training set;

and the video abstract determining module is used for determining the video abstract of the video to be extracted according to the text abstract, the picture abstract and the text timestamp-text mapping relation.

Optionally, the system further includes:

the audio file training set acquisition module is used for acquiring an audio file training set; the audio file training set comprises a plurality of historical audio files and historical text files corresponding to the historical audio files;

and the voice-to-text model determining module is used for training the first deep neural network by taking the historical audio file as input and the historical text file corresponding to the historical audio file as output to obtain a voice-to-text model.

Optionally, the system further includes:

the text file training set acquisition module is used for acquiring a text file training set; the text file training set comprises a historical text file and a text abstract corresponding to the historical text file;

and the text abstract extraction model determining module is used for training the second deep neural network by taking the historical text file as input and the text abstract corresponding to the historical text file as output to obtain a text abstract extraction model.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

the invention aims to provide a video abstract extraction method, which comprises the steps of converting an audio file into a text file by constructing a voice-to-text model, extracting the text abstract by utilizing a text abstract extraction model, constructing an object recognition model to extract a picture abstract, and fusing the text abstract and the picture abstract to obtain a video abstract of a video to be extracted; in addition, the invention can extract the abstract of the data in text format, audio format and video format, thus expanding the application range of abstract extraction.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

FIG. 1 is a flow chart of a video summary extraction method according to an embodiment of the present invention;

FIG. 2 is a first schematic diagram of a video summary extraction method according to an embodiment of the present invention;

fig. 3 is a second schematic diagram of a video summary extraction method according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

As shown in fig. 1, the present invention provides a video summary extracting method, including:

step 101: acquiring a video to be extracted;

step 102: extracting an audio file of a video to be extracted, and dividing the video to be extracted into a plurality of video frames to be extracted; each video frame to be extracted corresponds to a video frame time stamp; the video frame timestamp is used for describing the starting time, the ending time and the duration of the corresponding video frame to be extracted in the video to be extracted;

step 103: inputting the audio file into a voice-to-text model to obtain a time sequence text file of the video to be extracted; the voice-to-text model is obtained by training a first deep neural network by using an audio file training set; each sentence of text in the time sequence text file corresponds to a text timestamp; the text timestamp is used for describing the starting time, the ending time and the duration of the corresponding sentence in the video to be extracted;

step 104: determining a non-time sequence text file and a text timestamp-text mapping relation according to the time sequence text file;

step 105: inputting the non-time sequence text file into a text abstract extraction model to obtain a text abstract of a video to be extracted; the text abstract extraction model is obtained by training the second deep neural network by using a text file training set;

step 106: determining a picture abstract of a video to be extracted according to the object identification model and the plurality of video frames to be extracted; the object recognition model is obtained by training the third deep neural network by utilizing a video frame training set;

step 107: and determining the video abstract of the video to be extracted according to the text abstract, the picture abstract and the text timestamp-text mapping relation.

Before step 101, further comprising:

and training the first deep neural network by taking the historical audio file as input and taking the historical text file corresponding to the historical audio file as output to obtain a voice-to-text model.

Before step 101, further comprising:

and training the second deep neural network by taking the historical text file as input and the text abstract corresponding to the historical text file as output to obtain a text abstract extraction model.

Before step 101, further comprising:

acquiring a video frame training set; video frame training set

Marking objects in the video frame training set by using the rectangular marking frame to obtain a plurality of marked video frames;

and training the third deep neural network by taking the plurality of marked video frames as input and the rectangular marked frame as output to obtain an object recognition model.

Step 106, specifically comprising:

inputting a plurality of video frames to be extracted into an object recognition model to obtain a plurality of marked video frames to be extracted;

according to the object and the category contained in the current marked video frame to be extracted, a formula is utilized

Determining the weight of each object in the currently marked video frame to be extracted; wherein, W_iMarking the object weight of the ith object in the video frame to be extracted; w_clsA class weight for the object;

performing descending arrangement on a plurality of video frames to be extracted according to the object weight to obtain a video frame sequence to be extracted;

determining a first element in a video frame sequence to be extracted as a summary picture;

Specifically, updating the sequence of the video frames to be extracted specifically includes:

determining the similarity of each element except the first element in the video frame sequence to be extracted and the first element;

and deleting the first element in the video frame sequence to be extracted and the elements with the similarity greater than the similarity threshold value to obtain the updated video frame sequence to be extracted.

Step 107, specifically including:

splitting the text abstract into a plurality of abstract sentence vectors;

determining a plurality of abstract sentence vectors and corresponding text timestamps in the time sequence text file as a text timestamp set according to a text timestamp-text mapping relation;

determining a video frame timestamp corresponding to the picture abstract as a video frame timestamp set;

intercepting a plurality of sections of initial abstract videos in a video to be extracted according to the intersection of the text timestamp set and the video frame timestamp set;

and splicing the multiple sections of initial abstract videos by using an ffmpeg tool to obtain a video abstract of the video to be extracted.

Specifically, as shown in fig. 2-3, the method for extracting the key information summary provided by the present invention includes:

first, the overall process:

1. inputting a multi-channel data file name of the internet;

2. judging the specific type of the internet multichannel data:

2.1, judging whether the data type is a text, if so, skipping to 6, and if not, skipping to 2.2;

2.2, judging whether the data type is a picture, if so, skipping to 7, and if not, skipping to 2.3;

and 2.3, judging whether the data type is a video, if so, jumping to 3, and if not, ending the program.

3. And separating out the audio by using an audio and video separation tool, and intercepting the video frame.

4. And judging whether the output result of the step 3 is audio or not, if so, skipping to the step 5, and if not, skipping to the step 7.

5. And inputting the audio into a speech-to-text model to obtain text content with a timestamp, and removing the timestamp after storing the mapping relation between the timestamp and the text to obtain pure text content.

6. And inputting the text into the automatic text abstract model to obtain a text abstract.

7. And inputting the pictures into a picture abstract model to obtain a picture abstract, and classifying the texts according to the categories of the main objects.

8. And inputting the obtained text content and the obtained picture abstract into a video abstract model based on the text abstract and the picture abstract to obtain the video abstract.

The text abstract model is a pre-trained extraction abstract model by using a public data set. The abstract model can be a traditional model based on information such as word frequency and position, and can also be an end-to-end deep learning model. The text is written as {'. ',' |! ','? ' splitting punctuations for marking the end of sentences into sentences, adopting an open-source word segmentation tool to segment the sentences, removing common words, constructing a dictionary, and vectorizing words. And screening out k keywords according to the information such as the occurrence frequency, the position, the part of speech and the like of the words. Taking the vectorized text as the input of the abstract model; key words are also added to the input of the model as auxiliary information. The abstract model outputs a text abstract.

Specifically, the picture abstract model is a target detection model trained using an open data set: and inputting the pictures into a pre-trained model, wherein the model gives the position and classification category of the object. And according to the category of the object and the proportion of the object in the picture, giving a weight to each object in the picture, wherein the weight of the picture is the weight of all or part of the objects on the picture. And selecting the pictures with high weight and low repetition rate as the picture summaries.

Secondly, the concrete implementation:

1. training a text abstract model, a target detection model and a voice recognition model.

1.1 training text summary model. The automatic text summarization model is first pre-trained on a Common pre-training data set, such as Common Crawl. Common crawler (universal web content data set) is a public data set obtained by crawling web pages since 2008, and 40 linguistic data sets are provided in the data set and can be directly downloaded from an official website. In implementation, the Chinese corpus in Common Crawl is retrieved as a pre-trained data set. During pre-training, a part of words are replaced with a probability of 15% (the probability of the replaced words is 80% replaced by masks, the probability of the replaced words is 15% to keep the original words, and the probability of the replaced words is 15% replaced by random words). The training model predicts the replaced words during pre-training. After the pre-training is finished, the parameter weight obtained by the pre-training is used as the initial parameter weight of the automatic text abstract model, and then a Chinese automatic text abstract data set is used for fine tuning, wherein the Chinese abstract data set comprises a THUCNews (Qinghua Chinese news data set), an nlpc 2017 abstract data set (2017-year natural language processing and Chinese calculation conference abstract task data set) and the like. The THUCNews data set is generated by sorting and screening data of the Newcastle news RSS subscription channel in 2005-2011, the data comprises texts and titles, and the titles can be regarded as abstracts. nlpc 2017 is a data set of the summarization task of the NLPCC competition in 2017, and comprises a text and a summary. And when fine tuning is carried out after the pre-training is finished, inputting the article as an input and the abstract as a prediction target into the model together, and iteratively updating model parameters obtained by the pre-training. The input of the fine-tuned model is an article (text with longer space), and the output is a summary generated by the model (text with shorter space and containing important information).

1.2 training the target detection model. The model is pre-trained on a larger data set, such as ImageNet (image network data set), so that the model can better extract picture features. The pre-training task is a picture classification task, the input is a picture, and the output is the category of the picture. And after the pre-training is finished, discarding the input layer and the output layer, and only keeping the network layer used for extracting the picture characteristics in the middle for training a target detection model later. ImageNet is a large-scale picture database with over 1400 million labeled data, and provides download links and thumbnails of pictures due to copyright issues. Since each picture of a common picture classification data set only has one object, and picture classification of a plurality of objects is difficult to perform, a picture classification model is not directly adopted in implementation, but a target detection model is adopted to obtain main objects of the picture, and then the class of the picture is given according to the main objects. After pre-training, new input layers and output layers are built according to the data format of the special data set for target detection, and parameters of the two neural network layers are initialized randomly. The special data set for target detection comprises COCO-2017(Common Objects in Context, version of Common object identification data set 2017), OpenImageData (open image data set) and the like, and after the special data set is downloaded from an official website, the data are input into a target detection model consisting of a new input layer, a pre-trained intermediate layer and a new output layer, and parameters are optimized in an iterative mode. After training, a picture is input into the model, and the model gives the position of each object in the picture and the most possible category of each object.

1.3 training the speech recognition model. End-to-end speech recognition models are trained on a Chinese speech recognition training set, such as THCHS-30 (the 30-hour Chinese Speech library of Qinghua university), AISHELL (the Hill Shell Voice recognition data set), and the like. THCHS-30 is an open source chinese speech data set published by the university of qinghua speech and language technology center. The voice data is recorded by a single carbon particle microphone in a quiet office environment, the total time length is more than 30 hours, the sampling frequency is 16kHz, and the sampling precision is 16 bits. The AISHELL is a series of voice recognition data sets issued by Beijing Hill Shell science and technology Limited liability company, and is issued to AISHELL-3 nowadays, the total recording time of the AISHELL-3 is 85 hours, the total recording time of the AISHELL-3 is 88035 sentences, the recording frequency is 44.1kHz, and the recording precision is 16 bits. And inputting the voice data and the text data into an end-to-end voice recognition model for training to obtain a trained voice recognition model. After training, a piece of audio is input into the model, and the model gives text content with time stamps.

2. The user inputs the file name of the multimedia content to the integral control module, and the integral control module judges the type of the input multimedia content. The data type may be determined based on a file suffix name input by a user. If the file name input by the user is not characters, pictures or videos, outputting error information and terminating the operation. If a text file is input, skip 5, if an input picture file, skip 6. If the input is a video file, jumping to 3 to continue execution.

3. Separating the audio in the video file by using an audio and video separation open source tool, and storing the audio as an independent audio file; and an open source video cutting tool is used for cutting video frames of the video according to a set time interval (5-10ms), and the video frames are stored as pictures which are named as cut timestamps. And after the execution is finished, skipping to 4 to continue the execution.

4. And (4) inputting the audio file obtained in the step (3) into a pre-trained voice recognition model to obtain text data corresponding to the voice data. Each sentence of text has a timestamp (start time, end time, duration). And storing the corresponding relation between the text and the time stamp by using the HashSet, and then removing the time stamp to obtain the plain text content.

5. The method comprises the steps of using an open source word segmentation tool to segment words of pure text contents, wherein the word segmentation is to split complete sentences by taking words as units. The dictionary is constructed by using the result of word segmentation, each different word is assigned with a unique sequence number (sequence number is from 1), and the word is represented by the sequence number. And then vectorizing the sentence after word segmentation, wherein the vectorization is to use the serial number of the word to replace the word to represent the sentence, because the lengths of the sentences are not consistent, a maximum sentence length is set, and the sentence with the length smaller than the maximum length is filled with 0. And the vectorized text is used as the input of the trained text abstract model. The automatic text summarization model gives a most probable text sequence as a text summary according to the input content, and the length of the summary can be set artificially. The text abstract model adopts an extraction type text abstract, and the sentences in the original text are used for forming the abstract, so that the alignment of the later abstract document and the video is facilitated.

6. Inputting the picture into a trained target detection model, wherein the model gives the position (x) of an object in the picture_i，y_i，w_i，h_i) And the probability distribution of the object over all object classes, the maximum probability c_i. Wherein x_i，y_iCoordinates representing the center point of the rectangle framing the current object, w_i，h_iIndicating the width and height of the box.

And 6.1, classifying the pictures according to the categories of the objects, inputting each picture into the target detection model trained in the step 1.2, giving the position and the category of each object on the picture by the model, and then classifying the pictures according to the w of each object_iAnd h_iThe area is calculated. Class C with largest area object_iAs a picture category.

6.2 selecting the picture with higher weight as the picture abstract. The picture weight is calculated as follows:

(a) assigning different weights W to different classes of objects_clsFor example, the weight of the category person is set to 0.8, the weight of the category car is set to 0.5, and so on. The weight of each category is given according to the frequency,

where N is the number of all pictures, N_clsIs the number of pictures containing objects of class cls.

(b) Calculating the weight W of each object in the picture_i. The importance of each object in the picture is determined by the category of the picture and the area and the position of the picture, and the importance of each object

Wherein x_i，y_i，w_i，h_iThe definition of (c) is identical to the above. W, H is the width and height of the picture, W_clsObtained in the previous step.

(c) And calculating the weight of the pictures and sequencing the pictures. According to the importance W of the object_iSorting all objects in the picture, and selecting the weights of the top 2 as the weight W of the picture_img，W_img＝α×W_i1+β×W_i2Wherein W is_i1，W_i2Two objects with the maximum weight in the current picture imgAnd (3) a body. Alpha and beta are hyper-parameters, different values can be selected according to specific input data, and then a combination with the best effect is selected; or weighted by the weight of two objects, such as:

after obtaining the picture weight, according to W_imgAnd sorting the pictures in a descending order.

(d) And acquiring the picture abstract. At the beginning, the list of the abstracts of the pictures is empty, for the pictures which are arranged according to the descending weight, the picture with the first rank is added into the abstract, then the picture with the first rank is taken out, the similarity of each picture in the following is calculated, if the obtained similarity is larger than the threshold value, the similarity of the current picture and the first picture is higher, and the weight is lower than that of the first picture, and the current picture is deleted from the list because the first picture is added into the abstract. And then deleting the first picture from the ordered picture list, repeating the previous operation on the rest pictures in the list until no picture exists in the list, and then returning to the picture summary list.

And calculating the similarity of any two pictures.

The calculation of the picture similarity uses SSIMⁱ(structural similarity measure), the similarity degree of the pictures is calculated from three aspects of brightness, contrast and structure. And (d) obtaining the picture abstract according to the SSIM similarity and the method in the step (d). The steps of calculating the similarity of the pictures by the SSIM algorithm are as follows:

the sizes of two pictures with similarity to be calculated are modified into the same size (specified by people), and the pictures are converted into gray level pictures.

The SSIM value within each window is calculated using a sliding window approach (window size is specified artificially), and each window on the picture is weighted using a gaussian weighting function with a standard deviation of 1.5.

For any two input pictures X and Y, the picture portions falling within the current window are marked as X and Y, and the SSIM between X and Y is calculated as follows:

brightness similarity:

wherein C is₁Is constant, the prevention denominator is 0,

is the mean gray value of x, x_iGray value of ith pixel of x, N_pX total number of pixels.

Mean gray value of y, y_iGray value of ith pixel of y, N_qIs composed of_yTotal number of pixels.

Contrast similarity:

wherein C is₂Is a constant number of times, and is,

is the standard deviation of the gray scale of x,

is the gray scale standard deviation of y.

Structural similarity:

where C3 is a constant.

SSIM：

The weighted average of all windows SSIM is used as the similarity of the two pictures. If there are K sliding windows, the similarity between picture X and picture Y is:

where k denotes the calculation of the SSIM value in the kth window,X_k，Y_krespectively, the part of the picture X and the part of the picture Y falling in the kth window. w is a_kThe weight representing the kth window is obtained by a gaussian function with a standard deviation of 1.5.

7. And generating a video abstract according to the text abstract and the picture abstract.

7.1 obtaining c categories by using a clustering method on the plain text obtained by converting video into audio. And the clustering uses a K-means clustering method, the vector representation of sentences is constructed by words in the sentences, the distance between the two sentences is the cosine similarity of the corresponding vector, and the category of the sentence is determined according to the distance between each sentence and the nearest clustering center. Initially, the clustering center is c sentence vectors which are randomly assigned, through continuous iteration, the average value of all vectors in the same class is used as a new clustering center, and after the number of iteration assigned rounds or the minimum distance between classes is larger than an assigned threshold value, the iteration is stopped.

7.2 if the video has brief text introduction, calculating the similarity of each category obtained in 7.1 with the text introduction, and keeping the G with the highest similarity_topA category; if no text is introduced, C with most text is reserved_topAnd (4) each category. Calculate text introduction and C_topThe similarity of each category text in the categories adopts Jaccard similarity. Jaccard similarity uses the intersection and union between two sets to represent the similarity of the two sets.

Wherein A represents the word segmentation result of the text introduction, and B_iThe text representing the ith category contains words.

7.3 splitting the summary content obtained from the text summary into sentences, and using each sentence and C_topThe cosine similarity is calculated for each category. Vectorizing the current abstract sentence, and comparing the vectorized abstract sentence with C_topThe cosine similarity is calculated for the vectorization result of each sentence in each category, and the vectorization method is consistent with the vectorization in the step 5. If it is combined with C_topIf the cosine similarity of a sentence in a certain category is greater than a threshold (set artificially), the abstract is indicatedThe sentence contains the information of the category, and the sentence is reserved. If the current sentence is with C_topAnd if the cosine similarity of each category is smaller than the threshold value, discarding the sentence.

And 7.4, acquiring the time stamp of each text abstract sentence in the video according to the mapping relation between the text and the time stamp. In the result obtained by the speech recognition model in the step 4, each sentence of text has a timestamp (start time, end time, duration), the mapping relation between the text and the corresponding timestamp is stored in a HashSet mode, and each sentence of text is mapped to the timestamp of the text, so that the timestamp can be found through the text sentence; in addition, another HashSet is used to store the correspondence of the start time to the text, and the start time of each sentence of text is mapped to the text itself.

7.5 for the time stamp of each text abstract sentence, checking whether the corresponding video frame is in the picture abstract obtained by the video frame, if so, keeping the current time stamp. The picture obtained by the video frame in the step 3 is named as a timestamp, whether a picture named by the corresponding timestamp exists in the picture abstract or not is searched according to the timestamp obtained by the step 7.4, and if the picture named by the corresponding timestamp exists in the picture abstract, the timestamp corresponding to the current abstract sentence is reserved; if not, the data is discarded.

7.6 for the timestamp saved at 7.5, HashSet whose start time is mapped to the text can retrieve the corresponding text sentence. And then, according to the plain text content obtained in the step 3, n text sentences counted before and n text sentences counted after are found. For these 2n text sentences, if and C_topIf the cosine similarity of each category is greater than the threshold or the corresponding video frame is in the picture abstract, the time stamp (start time, end time and duration) corresponding to the text sentence is reserved.

7.7 according to the time stamp obtained from 7.6, using the ffmpeg tool, the video segment with the length of the duration is cut from the start time, and finally the video segment is spliced into the video abstract. The specific process is as follows:

(a) the command ffmpeg-ss { start _ time } -i { input video } -t { duration } -c: vcopy-c: a copy output video, where start time represents the start time of the truncated video segment, here the start time of the corresponding timestamp of the text sentence. input _ video is the input video, here the filename that the user enters into the overall control module. duration represents the duration of the intercepted segment, and can be the duration corresponding to the text sentence or a time specified by a person. output _ video represents the file name saved by the current video clip.

(b) Txt file, save the file name of each fragment. Txt file each action: file video _ patch _ path. Wherein the video _ patch _ path adds a file name to the storage path of each segment.

(c) Finally, the command ffmpeg-f contact-i patch is input in the command line. Wherein, video _ sum is the file name of the video abstract, and represents the storage position and name of the video abstract.

In addition, the invention also provides a video abstract extraction system, which comprises:

the audio file extraction module is used for extracting an audio file of the video to be extracted and dividing the video to be extracted into a plurality of video frames to be extracted; each video frame to be extracted corresponds to a video frame time stamp; the video frame timestamp is used for describing the starting time, the ending time and the duration of the corresponding video frame to be extracted in the video to be extracted;

the time sequence text file determining module is used for inputting the audio file into the voice-to-text model to obtain a time sequence text file of the video to be extracted; the voice-to-text model is obtained by training a first deep neural network by using an audio file training set; each sentence of text in the time sequence text file corresponds to a text timestamp; the text timestamp is used for describing the starting time, the ending time and the duration of the corresponding sentence in the video to be extracted;

the text abstract determining module is used for inputting the non-time sequence text file into the text abstract extracting model to obtain a text abstract of the video to be extracted; the text abstract extraction model is obtained by training the second deep neural network by using a text file training set;

the picture abstract determining module is used for determining the picture abstract of the video to be extracted according to the object recognition model and the plurality of video frames to be extracted; the object recognition model is obtained by training the third deep neural network by utilizing a video frame training set;

and the voice-to-text model determining module is used for training the first deep neural network by taking the historical audio file as input and the historical text file corresponding to the historical audio file as output to obtain the voice-to-text model.

Correspondingly, the invention provides an internet multi-channel data key information abstract system, which comprises: and the integral control module judges the data type of the input multi-channel data and determines the input and the output of each module. And the audio/video conversion module is used for separating audio and video and intercepting video frames. And the voice recognition module converts the audio frequency into characters. And the text summarization module automatically generates the summarization of the long text. And the picture abstract module automatically generates a picture abstract. And the video abstract module automatically generates a video abstract.

In summary, the technology and system for abstracting key information of internet multichannel data provided by the invention can judge the input file type according to the file name input by the user, abstract texts, pictures and videos according to the file type, and realize the abstraction of key information of internet multichannel data. The text abstract is realized through an extraction type text abstract model, the picture abstract is realized through a target detection model, and the video abstract is realized by separating audio, intercepting video frames and combining the text abstract and the picture abstract.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.

The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims

1. A method for extracting a video abstract, comprising the steps of:

acquiring a video to be extracted;

inputting the audio file into a voice-to-text model to obtain a time sequence text file of the video to be extracted; the voice-to-text model is obtained by training a first deep neural network by using an audio file training set; each sentence of text in the time sequence text file corresponds to a text time stamp; the text timestamp is used for describing the starting time, the ending time and the duration of a corresponding sentence in the video to be extracted;

2. The method for extracting video abstract according to claim 1, further comprising, before the obtaining the video to be extracted:

3. The method for extracting video abstract according to claim 1, further comprising, before the obtaining the video to be extracted:

4. The method for extracting video abstract according to claim 1, further comprising, before the obtaining the video to be extracted:

acquiring a video frame training set; the video frame training set

5. The method for extracting a video summary according to claim 1, wherein the determining a picture summary of the video to be extracted according to the object recognition model and the plurality of video frames to be extracted specifically includes:

wherein N is the total number of marked video frames to be extracted, N_clsFor objects of class clsMarking the total amount of the video frames to be extracted; x is the number of_i，y_iCoordinates representing the center point of the rectangular marking box, w_i，h_iRespectively indicating the width and the height of the rectangular marking frame, and respectively indicating the width and the height of the video frame to be extracted by W and H;

6. The method for extracting a video summary according to claim 5, wherein the updating the sequence of video frames to be extracted specifically includes:

7. The method for extracting a video abstract according to claim 5, wherein the determining the video abstract of the video to be extracted according to the text abstract, the picture abstract and the text timestamp-to-text mapping relationship specifically includes:

splitting the text abstract into a plurality of abstract sentence vectors;

8. A video summarization system, the system comprising:

9. The video summarization extraction system of claim 8 wherein the system further comprises:

10. The video summarization extraction system of claim 8 wherein the system further comprises: