CN111681678A

CN111681678A - Method, system, device and storage medium for automatically generating sound effect and matching video

Info

Publication number: CN111681678A
Application number: CN202010518573.9A
Authority: CN
Inventors: 薛媛; 金若熙
Original assignee: Hangzhou Xinghe Shangshi Film Media Co ltd
Current assignee: Hangzhou Xinghe Shangshi Film Media Co ltd
Priority date: 2020-06-09
Filing date: 2020-06-09
Publication date: 2020-09-18
Anticipated expiration: 2040-06-09
Also published as: CN111681678B

Abstract

The invention discloses a method for automatically generating sound effect and matching video, which comprises the steps of based on the video to be processed, extracting a video key frame in a frequency reduction mode, and carrying out primary identification analysis processing to obtain a modularized specific sound object; carrying out multi-stage recognition analysis processing on the modularized specific sounding object through a depth residual error network model to obtain the type of the specific sounding object and extract the sounding characteristics of the specific sounding object; constructing object types of specific sounding objects and specific sounding object audios based on the sounding characteristics, wherein the audios comprise audio introduction and audio keywords; and obtaining a video and audio matching score based on the object type and the audio of the specific sound-producing object, and searching and matching the specific sound-producing object and the audio according to the video and audio matching score so that the audio introduction, the audio keywords and the object type of the specific sound-producing object are matched with each other. When the video is dubbed, a sound imitation engineer is not needed to carry out special effect dubbing, the sound effect can be directly and automatically generated and matched into the corresponding video, and the method is convenient and fast and has high accuracy.

Description

Method, system, device and storage medium for automatically generating sound effect and matching video

Technical Field

The invention relates to the technical field of video processing, in particular to a method, a system, a device and a storage medium for automatically generating sound effect and matching video.

Background

At present, with the development of science and technology, multimedia audio and video technology is widely applied to various fields, good videos can provide better feelings for audiences, the audiences can understand and recognize the various fields, and how to make good-looking videos is more important.

In the existing video processing technology, clipping, special effects, subtitles, audio material adding and the like of a video are performed independently, for example, when sound is added to the video, the video is recorded first and then dubbed, or a person can make a sound first and record the sound directly in the video on site, but the sound except for characters in the video is difficult to match, and sound parts which are not finished in shooting site are made by a sound imitation engineer at the present time through later-stage production, such as footstep sound, door opening and closing sound, water pouring sound and the like, to be matched in the video.

The traditional mode of dubbing the video is slow, the operation of synchronizing the video and various sounds is complex, the workload of workers is large, a large amount of time is needed, and the operation mode is extremely inflexible.

Disclosure of Invention

The invention provides a method, a system, a device and a storage medium for automatically generating sound effect and matching video aiming at the defects in the prior art.

In order to solve the technical problem, the invention is solved by the following technical scheme:

a method for automatically generating sound effect and matching video comprises the following steps:

based on a video to be processed, extracting a video key frame in a frequency reduction mode, and performing primary identification analysis processing to obtain a modularized specific sound production object;

carrying out multi-stage recognition analysis processing on the modularized specific sounding object through a depth residual error network model to obtain the type of the specific sounding object and extract the sounding characteristics of the specific sounding object;

constructing an object type of a specific sounding object and a specific sounding object audio based on the sounding characteristics, wherein the audio comprises audio introduction and audio keywords;

and obtaining a video and audio matching score based on the object type and the audio of the specific sound-producing object, and searching and matching the specific sound-producing object and the audio according to the video and audio matching score so that the audio introduction, the audio keywords and the object type of the specific sound-producing object are matched with each other.

As an implementation manner, the frequency-reducing extraction of the video key frames and the preliminary identification analysis processing specifically include:

reducing the frame extraction frequency of the relevant information of the video to be processed, and extracting video key frames;

generating a frame image stream from the extracted video key frames;

and performing modular multi-object recognition on the frame image stream by adopting a deep convolutional neural network model.

As an alternative to the above-described embodiment,

the audio introduction is introduction content text of the audio, and the audio keywords comprise at least three words for describing the audio, wherein the words for describing the audio comprise the category name of a specific sound production object and the category name of a sound production sound.

As an implementation manner, the obtaining a video and audio matching score based on the object category and the audio of the specific sounding object, and performing search matching on the specific sounding object and the audio according to the video and audio matching score, so that the audio introduction, the audio keyword, and the object category of the specific sounding object are matched with each other, specifically:

processing the object type and the audio introduction of the specific sound-producing object to obtain a first matching score;

obtaining a BERT vector of an object type of a specific sound production object and a BERT vector introduced by an audio frequency, calculating to obtain cosine similarity of the BERT vector, and taking the cosine similarity as a neural network matching score;

obtaining a video and audio matching score based on the first matching score and the neural network matching score;

and selecting audio corresponding to a plurality of video and audio matching scores from the video and audio matching scores as audio recommendation of the specific sound-producing object.

As an implementation manner, the processing the object class and the audio introduction of the specific sound-generating object to obtain the first matching score specifically includes:

performing word segmentation processing on the object type and the audio introduction of the specific sound-producing object to obtain words;

respectively obtaining the word proportion of the object type of a specific sounding object to the audio introduction and the coincidence of the audio keywords to obtain a first proportion and a second proportion, and carrying out weighted average processing on the first proportion and the second proportion to obtain a word matching score, wherein the word matching score is the word coincidence proportion of the object type to the audio introduction, the audio introduction weight, the object type to the audio keyword, the word coincidence proportion of the audio keywords, and the audio introduction weight and the audio keyword weight are 1;

obtaining an object type TF-IDF vector based on statistical data introduced by the audio, and taking the first cosine similarity as a TF-IDF matching score according to the first cosine similarity of the object type TF-IDF vector and the audio introduction TF-IDF vector, wherein the TF-IDF matching score is cosine _ similarity (the object type TF-IDF vector and the audio introduction TF-IDF vector);

and carrying out weighted average processing on the word matching score and the TF-IDF matching score to obtain a first matching score, wherein the first matching score is the word matching score and the word weight and the TF-IDF matching score and the TF-IDF weight, and the word weight and the TF-IDF weight are 1.

As an implementation manner, the obtaining a video and audio matching score based on the first matching score and the neural network matching score specifically includes:

and carrying out weighted average processing on the first matching score and the neural network matching score to obtain a video and audio matching score, wherein the video and audio matching score is the first matching score and the first weight plus the neural network matching score and the neural network weight, and the first weight plus the neural network weight is 1.

As an implementation manner, after the step of searching and matching the specific sound-generating object and the audio so that the audio is matched to the specific sound-generating object, the method further comprises the following steps:

and mixing all the audios to form a complete audio file, and adding the audio file into the audio track of the video to enable the audio file and the video to be synchronous.

A system for automatically generating sound effect and matching videos comprises a video processing module, a feature extraction module, a feature representation module and a search matching module;

the video processing module is used for extracting video key frames in a frequency reduction mode based on a video to be processed and carrying out primary identification analysis processing to obtain a modularized specific sound production object;

the characteristic extraction module is used for carrying out multi-stage recognition analysis processing on the modularized specific sounding object through a depth residual error network model to obtain the type of the specific sounding object and extracting the sounding characteristics of the specific sounding object;

the feature representation module is used for constructing an object type of a specific sounding object and a specific sounding object audio based on the sounding features, wherein the audio comprises audio introduction and audio keywords;

the searching and matching module is used for obtaining a video and audio matching score based on the object type and the audio of the specific sounding object, and searching and matching the specific sounding object and the audio according to the video and audio matching score, so that the audio introduction, the audio keywords and the object type of the specific sounding object are matched with each other.

A computer-readable storage medium storing a computer program which, when executed by a processor, performs the steps of:

An apparatus for automatically generating sound effects and matching videos, comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the following steps when executing the computer program

Due to the adoption of the technical scheme, the invention has the remarkable technical effects that:

according to the method, the system, the device and the storage medium for automatically generating the sound effect and matching the video, based on the video to be processed, the video key frame is extracted in a frequency reduction mode and is subjected to preliminary identification analysis processing, so that a modularized specific sound object is obtained; carrying out multi-stage recognition analysis processing on the modularized specific sounding object through a depth residual error network model to obtain the type of the specific sounding object and extract the sounding characteristics of the specific sounding object; constructing an object type of a specific sounding object and a specific sounding object audio based on the sounding characteristics, wherein the audio comprises audio introduction and audio keywords; and obtaining a video and audio matching score based on the object type and the audio of the specific sound-producing object, and searching and matching the specific sound-producing object and the audio according to the video and audio matching score so that the audio introduction, the audio keywords and the object type of the specific sound-producing object are matched with each other. Through automatic generation sound effect and with these sound effects match to corresponding video in, need not carry out the special effect dubbing with the help of the sound making engineer when dubbing for the video, can directly automatic generation sound effect and match to corresponding video in, convenient and fast to the rate of accuracy is high.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a schematic overall flow diagram of the process of the present invention;

fig. 2 is a schematic diagram of the overall structure of the system of the present invention.

Detailed Description

The present invention will be described in further detail with reference to examples, which are illustrative of the present invention and are not to be construed as being limited thereto.

Example 1:

a method for automatically generating sound effects and matching videos, as shown in fig. 1, includes the following steps:

s100: based on a video to be processed, extracting a video key frame in a frequency reduction mode, and performing primary identification analysis processing to obtain a modularized specific sound production object;

specifically, the video to be processed refers to a video clip which is provided by a user and needs to be added with a sound effect, a video key Frame is extracted from the video to be processed in a Frame-down Frame-extracting mode, the Frame-extracting frequency is set as an adjustable parameter, a lower limit of the Frame-extracting frequency is not set, an upper limit of the Frame-extracting frequency is determined by a code-acquiring rate (usually, the video is 25 frames per second) of the video, a static Frame Image, namely a Frame Image Stream, with a time sequence can be generated after the video to be processed is extracted, and the Frame Image Stream is used for specific sound-producing object identification in the next step.

In one embodiment: the frequency reduction extraction video key frame is subjected to preliminary identification analysis processing, and specifically comprises the following steps:

s110: reducing the frame extraction frequency of the relevant information of the video to be processed, and extracting video key frames;

s120: generating a frame image stream from the extracted video key frames;

s130: and performing modular multi-object recognition on the frame image stream by adopting a deep convolutional neural network model.

Specifically, the video key frame is first decimated down: the object/person with dubbing value appearing in the video to be processed needs to have a certain continuous existence time, and the object dubbing which disappears within one or two frames of the video to be processed is not considered, because the object dubbing is not significant. In a specific operation, if the video key-frame in the frame-map stream is such: if the frame 2 seconds before does not contain the recognized object type, the recognized object type is regarded as that the object sounds from the second; if a secondary object exists in the frame from the first 2 seconds, the object is considered to be sounding continuously, and the minimum sounding time is set to be 5 seconds. In actual operation, different continuous sounding time and minimum sounding time can be set for different objects according to the sounding rules of the objects. Extracting video key frames for object recognition by reducing the frequency of the video key frames: for example, a video with a code rate of 25 frames/second is adopted, the frequency of a sampling key frame is set to be 1 frame/second after frequency reduction, namely, one frame is extracted from every 25 key frame pictures to serve as an identification input sample of an object appearing in the video in one second in the future, so that the reading times can be effectively and simply reduced, and the processing speed is improved. Meanwhile, the frame extraction frequency is set as an adjustable parameter, the lower limit of the frame extraction frequency is not set, and the upper limit of the frame extraction frequency is determined by the code acquisition rate of the video (usually, the video is 25 frames per second), so that a user determines the proper frame extraction frequency according to the characteristics of the video sample.

And thirdly, extracting a frame image stream generated by the video key frame, and performing modular multi-object identification based on an embedded deep convolutional neural network (DeepCNN). For each static frame image in the frame image stream, performing high nonlinear operation on pixel values of RGB (red, green and blue) three-color channels of pixel points of the image through a network to generate probability vectors taking each identifiable specific sound-producing object as a center, judging the category of the specific sound-producing object through the maximum value in each probability vector by a deep convolutional neural network, and judging the size of a current object selection frame according to the numerical distribution characteristics of the probability vectors in a rectangular region around the center of the specific sound-producing object. The generated selection box is used for intercepting a screenshot of a specific sound-producing object in each frame of image so as to perform specific sound-producing object recognition in more detail in the second stage. It should be explained that: all involved neural networks in this step are from the pre-trained Fast-RCNN networks in the object recognition library in python language, the TensorFlow deep learning framework.

The embodiment obtains the modularized specific sounding object, and correspondingly, each layer of deep convolutional neural network embedded in object recognition by adopting the modularized design is adopted. The used deep convolutional neural network can be used for switching the required one-level deep neural network in all levels of object recognition at will to adapt to special use scenes or special object classes, for example, the recognition network for carrying out the refined classification on shoes and the ground is not based on any pre-trained CNN model. The modular design can be expanded to embed a plurality of deep convolutional neural networks in each stage of recognition, and the accuracy of overall object recognition, the positioning precision and the recognition accuracy of refined classification are improved by utilizing an Ensemble Learning (Ensemble Learning) algorithm.

For example: the integrated learning algorithm can use the confidence value of each deep neural network on the identified selection box (the closer to 1 the more the network determines the correctness of the selection box, the confidence value is the probability judgment whether the model is correct for the object identification, and can be understood as the confidence of the model on one time of object identification, and the higher the confidence is, the higher the correctness of the object identification is), to carry out weighted average on a plurality of selection boxes, thereby finely adjusting a more reliable selection box for object positioning, so as to generate a higher-quality screenshot for the identification of the subsequent steps.

S200: carrying out multi-stage recognition analysis processing on the modularized specific sounding object through a depth residual error network model to obtain the type of the specific sounding object and extract the sounding characteristics of the specific sounding object;

in particular, the existing deep neural network cannot identify the details of all objects from a natural image, so that a technical solution framework of a multi-stage object identification network can be provided. In this embodiment, the multi-stage recognition analysis process follows the design concept of "coarse to fine": for each static frame image in a frame image flow, firstly, a primary deep neural network is utilized to perform preliminary analysis and identification processing to obtain general types of specific sound-producing objects (such as characters, shoes and doors and windows), and then, for detailed screenshots of the positions of each object, a new neural network is utilized to perform multistage identification and analysis processing of object subdivision types to obtain the types of the specific sound-producing objects (such as whether the shoes are sports shoes, board shoes or leather shoes). The multi-stage recognition analysis processing of the embodiment can be expanded to an image recognition framework with more stages (for example, three stages or more), and generally, because the definition of a frame-extracted image used in an experiment is limited, a two-stage deep neural network is adopted to perform two-stage recognition analysis processing, so that the currently required functions can be realized.

Here, the process of performing secondary recognition analysis processing by a secondary deep neural network is mainly described: the preliminary identification analysis processing adopts a first-level deep identification network which is derived from a pre-trained Fast-RCNN network; the multistage recognition analysis processing adopts a multistage depth recognition network, and a secondary depth recognition network of the secondary recognition analysis processing is adopted, and is used for carrying out further detailed recognition on individual key objects recognized by the first-stage depth recognition network, for example, for the 'shoes' recognized by the first-stage depth recognition network in a static frame image, the secondary depth recognition network carries out secondary recognition analysis processing on screenshots of the 'shoes' part so as to judge the 'shoe types' and the 'ground types'. More specifically, the present embodiment can recognize four different kinds of detailed footwear (sports shoes, leather shoes, high-heeled shoes, others), and five different kinds of detailed floors (tile floors, plank floors, cement floors, sand floors, others). The specific network architecture of the two-level depth recognition network is designed based on a depth residual error network (Resnet50) with 50 layers. See the following depth residual network model acquisition process:

s210, acquiring a plurality of images containing specific sound-producing objects, and eliminating unqualified images of the specific sound-producing objects to obtain qualified images of the specific sound-producing objects;

s220, preprocessing the image of the qualified specific sounding object to obtain an image data set of the qualified specific sounding object, and dividing the image data set into a training set and a verification set;

and S230, inputting the training set into the initial depth residual error network model for training, and verifying the training result through the verification set to obtain the depth residual error network model capable of acquiring the type of the specific sound-producing object.

In the prior art, a depth residual error network pre-trained for identifying shoes or the ground or other specific sounding objects does not exist, the depth residual error network used in the embodiment is not based on any pre-training parameter, the network parameter of the depth residual error network is completely originally trained from random numbers, all image sets required by training are from screenshots of actual videos, and manual calibration is carried out on the types of the shoes and the ground. The image training set at least comprises 17000+ pictures with different sizes, variable aspect ratio and maximum resolution ratio of not more than 480p, the main body is the picture of other specific sounding objects of which the total deterioration of the shoes and the ground fiddle is, and in the training depth residual error network model, unqualified images, such as the pictures which are very fuzzy and the objects in the pictures are incomplete, need to be removed, and the remaining qualified images are divided into a training set and a verification set. The pictures are different from the disclosed image recognition data set, and are mostly low-resolution pictures with non-square shapes, which considers that the shapes of the screenshots of the video frames in the actual using scene are irregular, the resolution can also be reduced due to a video compression algorithm, and the irregularity and the low resolution can be understood as noise contained in the image set, so that the network trained on the data set has stronger anti-noise capability and optimized pertinence to the footwear and the ground. The recognition accuracy (calculated on a test set) of five kinds of refinement on the ground obtained by the deep residual error network of the embodiment reaches 73.4%, which is much higher than that of random selection (20%) and crowd selection (35.2%); the recognition precision of the four types of shoes is also in the same order; the actual recognition speed can reach 100 pictures per second using a single great P100 display card.

And additionally deepens a Multi-layer perceptron (inherent in the Resnet50) at the end of the network into two layers, and matches with a random deactivation design (Dropout is 0.5) to adapt to the type requirements of the identification categories required by various specific objects, so that the overfitting phenomenon caused by excessive network parameters can be avoided to a certain extent (the identification effect on the training set is far better than that of the test set).

The depth residual network (Resnet50) employed in this embodiment. This deep residual network was developed and sourced by the team of microsoft asian research center in 2015 and has been widely used in the industry and academia in the following years, so the invention does not give further details on its specific mathematical implementation; here, the present embodiment is trained correspondingly based on the deep residual error network, so that the type of the specific sound generating object required by the present embodiment can be identified, that is, the calculation and identification process of a single picture and the specific use scene are modified correspondingly. The depth residual error network (Resnet50) can read a square RGB image with pixel values not lower than 224 × 224, and for an input image with a rectangular shape and a length and width not of 224 pixels, the embodiment adopts a conventional linear interpolation method to firstly deform the input image into a regular floating-point number matrix of 224 × 224 × 3 (three RGB color channels); after the matrix is input into a network, the matrix is transformed into feature maps (feature maps) with higher abstraction degree and smaller size through a series of convolution blocks; the convolution block is a basic unit of a conventional design of a Convolutional Neural Network (CNN), the convolution block used in the respet 50 is composed of three to four two-dimensional convolution layers (2 dconvolume) in combination with a random inactivation design (drop), a batch normalization layer (batch normalization), and a linear rectification layer (ReLU), and a residual path (residual layer, which only contains a simple one-layer two-dimensional convolution layer or is a simple copy of an input matrix) is parallel to each block. And respectively calculating the characteristic diagram output by the previous block through a residual error path and a convolution block path, then outputting the characteristic diagram into two new matrixes with consistent dimensionalities, and simply adding the two matrixes to form an input matrix of the next block. The numbers in the name of the depth residual network (Resnet50) refer to a total of 50 two-dimensional convolutional layers contained in all convolutional blocks. The depth residual error network after passing through all the convolution blocks outputs 2048-dimensional first-order vectors, and then outputs vectors with dimension of 1000 through a layer of Perceptron (Perceptron). Each element value of the final output vector of the depth residual error network represents the probability value of the image belonging to a certain category, and the final category calibration is determined by the maximum probability value. Common depth residual networks similar to Resnet50 are also Resnet34, Resnet101, etc.; other common image recognition networks include Alexnet, VGGnet, and implicit net, which are also applicable in this embodiment, but the effect is not good, so a depth residual error network (Resnet50) is selected.

In addition, the secondary recognition network architecture, i.e., the deep residual error network (Resnet50), in the present embodiment simultaneously supports the feedback learning mode: when the recognition accuracy of the secondary depth recognition network does not meet the scene requirement, the frame image stream can be subjected to screenshot through an object selection box recognized by the primary depth recognition network, the screenshot is used as a new data set to be manually calibrated, and the secondary depth recognition network, namely a depth residual error network (Resnet50) is finely adjusted. Therefore, when the video content to be processed is changed greatly, the trained model and a small amount of new data can be used for rapidly obtaining higher recognition accuracy, and the preparation period for adapting to a new application scene is shortened. The first-level depth recognition network can also be retrained in stages according to the change of the video type or the change of the application scene so as to adapt to the characteristics of new video data.

Furthermore, the specific sound-producing object information recognized by each level in the two-level depth recognition network is merged and stored in the same format. For each object the information is stored: object large class (superior network identification), object large class certainty value, object fine class (secondary depth identification network identification), object fine class certainty value, object location selection frame width height and center (measured in frame image pixels), all of which are processed further in json file format.

S300: constructing an object type of a specific sounding object and a specific sounding object audio based on the sounding characteristics, wherein the audio comprises audio introduction and audio keywords; the audio introduction is introduction content text of the audio, and the audio keywords comprise at least three words for describing the audio, wherein the words for describing the audio comprise the category name of a specific sound production object and the category name of a sound production sound.

In order to associate the object type of the specific sounding object with the audio of the specific sounding object, the natural language is used as an intermediate expression for matching the object type of the specific sounding object of the video to be processed with the audio in the embodiment, and the natural language is used as a method for matching the expression, so that the expression is beneficial to understanding and labeling by people, and sorting and maintaining an audio library.

For video to be processed, the object class identified by the video understanding module is represented as natural language (e.g., "cat"); for audio, two types of natural language markup can be used: audio introduction and audio keywords, i.e. audio includes audio introduction and audio keywords, where an audio introduction is understood to be: the content of the audio is introduced with a sentence or phrase (e.g., "a person's voice walking on snow"), and the audio keyword is the content of the audio with three key words (e.g., "shoes/snow/footsteps"). Unlike audio introductions, audio keywords must include spoken objects and spoken sound categories, and in summary, the introduction of audio keywords links the mismatch between object recognition categories and sound introductions.

For a particular sound-producing object, the class name of the object recognition is used directly as its natural language representation, since the computer cannot understand the natural language, and thus further maps the natural language representation to a vector representation. Specifically, the present embodiment introduces two vector representations in natural language: TF-IDF (term frequency-inverse document frequency) and BERT (bidirectional Encode expressions from transformations).

In a specific embodiment, the TF-IDF vector is computed from the audio introductory text, which indicates how much each word in a segment of a word has an effect on the semantics of the whole segment of the word. The method specifically comprises the following steps: firstly, Chinese word segmentation is carried out on audio introduction of all audios through a word segmentation device 'ending word segmentation': then calculating the word frequency TF of each word in each audio introduction and the word frequency DF of each word in the set of all audio introductions; for an audio presentation, the TF-IDF of any one of the words can be computed: TF-IDF ═ TF × log (1/DF + 1); it is to be noted that the TF-IDF calculation formula is a normalized TF-IDF, in order to ensure the stability of the values; finally, for any segment of characters, the TF-IDF vector of the segment of characters is calculated. All words in the text library are sorted, the TF-IDF value of each word in the segment of words is calculated according to the sequence, and if the segment of words does not contain the word, the TF-IDF value is considered to be 0. And finally, obtaining a vector with the length same as the vocabulary of the text library, namely TF-IDF vector expression of the segment of characters.

Further: and calculating a BERT vector, wherein the BERT in the embodiment is a Transformer neural network structure, network parameters are trained by large-scale unsupervised learning, and the obtained model can be directly applied to downstream natural language understanding problems to directly carry out vector mapping on sentences and phrases in the natural language. The embodiment combines the two (and simple word matching) methods, so that the result is more accurate.

This embodiment computes the BERT vector representation of a sentence using a pre-trained Chinese BERT model in a Pytorch _ predicted _ BERT in a Pytorch library. To meet the efficiency of matching, the minimum BERT model "BERT _ base _ chip" is used. Specifically, a sentence is divided into characters one by one, the first character and the last character of "[ CLS ]" and "[ SEP ]" are respectively added into the sentence to be used as input index _ tokens, a full 0 list which is as long as the input index _ tokens is used as input segment _ ids, the two inputs are simultaneously input into a pre-training BERT model, and an output vector of the last layer of neural network corresponding to the first character ("[ CLS ]") is taken out to be used as a BERT vector of the sentence.

S400: and obtaining a video and audio matching score based on the object type and the audio of the specific sound-producing object, and searching and matching the specific sound-producing object and the audio according to the video and audio matching score so that the audio introduction, the audio keywords and the object type of the specific sound-producing object are matched with each other.

In one embodiment, the obtaining a video-audio matching score based on the object category and the audio of the specific sounding object and performing search matching on the specific sounding object and the audio according to the video-audio matching score enables the audio introduction and the audio keyword to be matched with the object category of the specific sounding object, specifically:

s410: processing the object type and the audio introduction of the specific sound-producing object to obtain a first matching score;

s420: obtaining a BERT vector of an object type of a specific sound production object and a BERT vector introduced by an audio frequency, calculating to obtain cosine similarity of the BERT vector, and taking the cosine similarity as a neural network matching score;

s430: obtaining a video and audio matching score based on the first matching score and the neural network matching score;

s440: and selecting audio corresponding to a plurality of video and audio matching scores from the video and audio matching scores as audio recommendation of the specific sound-producing object.

The audio and video matching process is the matching of the object types and audio introduction identified in the video and the audio keywords. The matching score is calculated in two modes, namely a traditional method and a neural network mode, and the traditional method has the advantages that when natural language expressions of audio and video have the same words, the score can be accurately calculated; the neural network has the advantage of calculating the matching scores, when the natural language expressions of the two natural language expressions have no word overlap, the natural language expressions can be matched with each other at will, and the scores of the two methods are used and combined simultaneously, so that the two methods are complementary.

Specifically, for each identified object, 10 best matching audios may be selected as dubbing recommendations according to the final matching score, although other numbers are also possible.

Further, the processing of the object type and the audio introduction of the specific sound-generating object to obtain the first matching score specifically includes:

s411: performing word segmentation processing on the object type and the audio introduction of the specific sound-producing object to obtain words;

s412: respectively obtaining the word proportion of the object type of a specific sounding object to the audio introduction and the coincidence of the audio keywords to obtain a first proportion and a second proportion, and carrying out weighted average processing on the first proportion and the second proportion to obtain a word matching score, wherein the word matching score is the word coincidence proportion of the object type to the audio introduction, the audio introduction weight, the object type to the audio keyword, the word coincidence proportion of the audio keywords, and the audio introduction weight and the audio keyword weight are 1;

s413: obtaining an object type TF-IDF vector based on statistical data introduced by the audio, and taking the first cosine similarity as a TF-IDF matching score according to the first cosine similarity of the object type TF-IDF vector and the audio introduction TF-IDF vector, wherein the TF-IDF matching score is cosine _ similarity (the object type TF-IDF vector and the audio introduction TF-IDF vector);

s414: and carrying out weighted average processing on the word matching score and the TF-IDF matching score to obtain a first matching score, wherein the first matching score is the word matching score and the word weight and the TF-IDF matching score and the TF-IDF weight, and the word weight and the TF-IDF weight are 1.

In steps S411 to S414, the matching score is obtained by a conventional method, and the object type and the sound introduction of the specific sound object are segmented by using the ending segmenter. Then calculating the object types of the specific sound production objects, respectively introducing the object types and the sounds, and calculating the word proportion of the coincident sound keywords, and weighting and averaging the two proportions to serve as a word matching score; and obtaining TF-IDF vector expression of the object type of the specific sound-producing object according to the statistical data in the sound introduction text. Then, the cosine similarity of the object TF-IDF vector and the sound introduction TF-IDF vector is calculated to be used as a TF-IDF matching score, and the word matching score and the TF-IDF matching score are weighted and averaged to obtain a matching score of the traditional method, namely the first matching score in the step.

Step S420 is to use neural network to match, and first calculate the BERT vectors for the image category and all audio introductions. And calculating cosine similarity of the image category BERT vector and all audio introduction BERT vectors as a neural network matching score.

In step S430, obtaining a video and audio matching score based on the first matching score and the neural network matching score specifically includes: and carrying out weighted average processing on the first matching score and the neural network matching score to obtain a video and audio matching score, wherein the video and audio matching score is the first matching score and the first weight plus the neural network matching score and the neural network weight, and the first weight plus the neural network weight is 1.

Specifically, the first matching score and the neural network matching score are weighted and averaged to obtain a final video and audio matching score. In practice, the weight of the weighted average may be adjusted as needed, and if it is desired that the name of the object class of the particular spoken object appears accurately in the audio introduction or keyword, the weight of the traditional matching score may be increased to increase accuracy; if it is desired that the names of the object classes for a particular spoken object are not in the audio presentation or keyword, but have the same semantics, the weights of the neural network matching scores may be increased to increase generalization.

In one embodiment, after the step of searching for a match between the specific sound-generating object and the audio so that the audio matches to the specific sound-generating object, the method further comprises the following steps: and mixing all the audios to form a complete audio file, and adding the audio file into the audio track of the video to enable the audio file and the video to be synchronous. In this embodiment, the generated audio is mixed, and after the audio file required for dubbing and the start/stop time of playing each audio file are found, all the required audio files can be read, and each audio file is converted into a uniform frequency domain signal format, so as to facilitate subsequent editing.

In the present embodiment, audio files in any common format, including wav and mp3, can be read, which improves the capability of using scenes and generalizing to other specific audio libraries.

The specific process of mixing all audio frequencies is as follows: each audio segment will be stretched or compressed intelligently to the time length needed by dubbing, and the mute part of the audio starting and ending stage is cut off firstly, so that dubbing and the picture triggering dubbing in the video can happen simultaneously, and the dubbing effect is optimal. And then checking whether the time length of the audio after the head and tail silence is eliminated is longer than the time required to be played, if so, cutting the audio to the time length required to be played for dubbing, and using a fade-out effect at the tail so as to eliminate the abrupt pause of the audio. If not, the audio is played circularly until the playing time required by dubbing, and when the audio is played circularly, the head-to-tail connection positions of the front and the rear audio sections adopt the overlapping and gradually-in and gradually-out effects with a certain time length, so that the circularly played positions are in seamless connection, the long audio section sounds natural and complete, and the user has the best hearing experience. The time length of the gradual-in and gradual-out is equal to the time length of the overlap, the time length is determined according to the audio time length through a piecewise function, if the original audio time length is less than 20 seconds, the overlap and gradual-in and gradual-out time is set to be 10% of the audio time length, so that the time length of the overlap part is moderate, the audio of the front section and the rear section can be smoothly transited, and more non-overlap parts of the short videos can be reserved to be played to users. If the original audio duration is longer than 20 seconds, the overlap and fade-in and fade-out time is set to 2 seconds, so that the long audio can be prevented from generating an unnecessarily long transition period, and non-overlapping audio can be played as far as possible.

Finally, the audio frequencies processed according to the steps are combined together, and added into the audio track of the video, and a new video file with dubbing is output, so that the whole dubbing process is completed.

The audio mixing module has a simple and easy-to-use function interface, and can generate dubbing videos by one key, so that the working efficiency of a user is greatly improved. The mixing processing method mentioned in this embodiment is implemented depending on the mixing processing module, and although a common audio tool is used, specific mixing steps and parameters are specially designed for movies, dramas, and short videos. However, the specific mixing steps and parameters are specially designed for movies, dramas, short videos, such as the silence removal and special effect audio compression or extension methods mentioned in the method embodiments can specifically solve the above-mentioned dubbing problem of videos of specific categories, i.e. the problem that the audio length in the special effect audio library does not meet the video dubbing requirement when the audio length is many, and these specific audio processing parameters are also most suitable for the embodiment, and other technologies or audio processing parameters are not realized. .

Example 2:

a system for automatically generating sound effect and matching video is shown in FIG. 2, and comprises a video processing module 100, a feature extraction module 200, a feature representation module 300 and a search matching module 400;

the video processing module 100 is configured to, based on a video to be processed, down-convert and extract a video key frame, and perform preliminary identification, analysis and processing to obtain a modular specific sound object;

the feature extraction module 200 is configured to perform multi-stage recognition analysis processing on the modular specific sound-producing object through a deep residual error network model to obtain the type of the specific sound-producing object and extract sound-producing features of the specific sound-producing object;

the feature representation module 300 is configured to construct an object class of a specific sounding object and a specific sounding object audio based on the sounding features, where the audio includes audio introduction and audio keywords;

the search matching module 400 is configured to obtain a video and audio matching score based on the object type and the audio of the specific sound-generating object, and perform search matching on the specific sound-generating object and the audio according to the video and audio matching score, so that the audio introduction, the audio keyword, and the object type of the specific sound-generating object are matched with each other.

In one embodiment, the video processing module 100 is configured to: reducing the frame extraction frequency of the relevant information of the video to be processed, and extracting video key frames; generating a frame image stream from the extracted video key frames; and performing modular multi-object recognition on the frame image stream by adopting a deep convolutional neural network model.

In one embodiment, the feature representation module 300 is configured to: and resolving the audio into an audio introduction and audio keywords, wherein the audio introduction is an introduction content text of the audio, and the audio keywords comprise at least three words for describing the audio, and the words for describing the audio comprise the category name of the specific sound production object and the category name of the sound production sound.

In one embodiment, the search matching module 400 is configured to: processing the object type and the audio introduction of the specific sound-producing object to obtain a first matching score; obtaining cosine similarity of the BERT vector through the object type of the specific sound object and the BERT vector, and taking the cosine similarity as a neural network matching score; obtaining a video and audio matching score based on the first matching score and the neural network matching score; and selecting audio corresponding to a plurality of video and audio matching scores from the video and audio matching scores as audio recommendation of the specific sound-producing object.

In one embodiment, the search matching module 400 is configured to: performing word segmentation processing on the object type and the audio introduction of the specific sound-producing object to obtain words; respectively obtaining the word proportion of the object type of a specific sounding object to the audio introduction and the coincidence of the audio keywords to obtain a first proportion and a second proportion, and carrying out weighted average processing on the first proportion and the second proportion to obtain a word matching score, wherein the word matching score is the word coincidence proportion of the object type to the audio introduction, the audio introduction weight, the object type to the audio keyword, the word coincidence proportion of the audio keywords, and the audio introduction weight and the audio keyword weight are 1; obtaining an object type TF-IDF vector based on statistical data introduced by the audio, and taking the first cosine similarity as a TF-IDF matching score according to the first cosine similarity of the object type TF-IDF vector and the audio introduction TF-IDF vector, wherein the TF-IDF matching score is cosine _ similarity (the object type TF-IDF vector and the audio introduction TF-IDF vector); and carrying out weighted average processing on the word matching score and the TF-IDF matching score to obtain a first matching score, wherein the first matching score is the word matching score and the word weight and the TF-IDF matching score and the TF-IDF weight, and the word weight and the TF-IDF weight are 1.

In one embodiment, the search matching module 400 is configured to: and carrying out weighted average processing on the first matching score and the neural network matching score to obtain a video and audio matching score, wherein the video and audio matching score is the first matching score and the first weight plus the neural network matching score and the neural network weight, and the first weight plus the neural network weight is 1.

In one embodiment, the system further comprises a processing synchronization module, wherein the processing synchronization module is used for mixing all audio to form a complete audio file, and adding the audio file into the audio track of the video to synchronize the audio file and the video. For the specific limitation of the system for automatically generating sound effects and matching videos, reference may be made to the above limitation on the screen projection method, and details are not described herein again.

The simple and easy-to-use function interface is arranged in the processing synchronization module, and dubbing videos can be generated by one key, so that the working efficiency of users is greatly improved. Although the processing synchronization module uses a common audio tool, just as the specific mixing steps and parameters in the method are specially designed for movies, dramas, and short videos, the silence removal and special-effect audio compression or extension methods mentioned in the method embodiments can specifically solve the dubbing problem of the specific category of videos, that is, the problem that the audio length in the special-effect audio library does not meet the video dubbing requirement when the audio length is many, and these specific audio processing parameters are also most suitable for this embodiment, and other technologies or audio processing parameters are not realized.

All modules in the system for automatically generating sound effect and matching video can be completely or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or be independent of a processor of the computer device or the mobile terminal, and can also be stored in a memory of the computer device or the mobile terminal in a software form, so that the processor can call and execute operations corresponding to the modules.

For the system embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

Example 3:

In one embodiment, when the processor executes the computer program, the down-converting extracting the video key frame and performing the preliminary identification analysis processing are specifically:

generating a frame image stream from the extracted video key frames;

In one embodiment, the processor, when executing the computer program, implements the object class and audio for constructing the specific sounding object based on the sounding characteristics, where the audio includes audio introduction and audio keywords, specifically:

and resolving the audio into an audio introduction and audio keywords, wherein the audio introduction is an introduction content text of the audio, and the audio keywords comprise at least three words for describing the audio, and the words for describing the audio comprise the category name of the specific sound production object and the category name of the sound production sound.

In one embodiment, when the processor executes the computer program, the processor implements the obtaining of the video and audio matching score based on the object category and the audio of the specific sounding object, and performs search matching on the specific sounding object and the audio according to the video and audio matching score, so that the audio introduction, the audio keyword and the object category of the specific sounding object are matched with each other, specifically:

In one embodiment, when the processor executes the computer program, the processing of the object class and the audio introduction of the specific sound object is implemented to obtain the first matching score, specifically:

In one embodiment, when the processor executes the computer program, the obtaining of the video and audio matching score based on the first matching score and the neural network matching score is implemented by:

In one embodiment, the step of searching and matching the specific sound-generating object and the audio to match the audio to the specific sound-generating object when the processor executes the computer program further comprises the following steps:

Example 4:

in one embodiment, the device for automatically generating the sound effect and matching the video is provided, and the device for automatically generating the sound effect and matching the video can be a server or a mobile terminal. The device for automatically generating sound effect and matching video comprises a processor, a memory, a network interface and a database which are connected through a system bus. Wherein the processor of the apparatus for automatically generating sound effects and matching video is used to provide computing and control capabilities. The memory of the device for automatically generating sound effect and matching video comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database is used to store all data of the devices that automatically generate sound effects and match videos. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method for automatically generating sound effects and matching videos.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention has been described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It should be noted that:

reference in the specification to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the invention. Thus, the appearances of the phrase "one embodiment" or "an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

In addition, it should be noted that the specific embodiments described in the present specification may differ in the shape of the components, the names of the components, and the like. All equivalent or simple changes of the structure, the characteristics and the principle of the invention which are described in the patent conception of the invention are included in the protection scope of the patent of the invention. Various modifications, additions and substitutions for the specific embodiments described may be made by those skilled in the art without departing from the scope of the invention as defined in the accompanying claims.

Claims

1. A method for automatically generating sound effect and matching video is characterized by comprising the following steps:

2. The method for automatically generating sound effects and matching videos as claimed in claim 1, wherein the frequency reduction extracting video key frames and performing preliminary identification analysis processing specifically comprise:

generating a frame image stream from the extracted video key frames;

3. The method for automatically generating sound effects and matching videos as claimed in claim 1, wherein the audio introduction is an introduction text of the audio, and the audio keywords include at least three audio-describing words including a category name of a specific sound-producing object and a category name of a sound-producing sound.

4. The method for automatically generating sound effects and matching videos according to claim 3, wherein the video and audio matching score is obtained based on the object type and audio frequency of the specific sound object, and the specific sound object and audio frequency are searched and matched according to the video and audio matching score, so that the audio introduction, the audio keywords and the object type of the specific sound object are matched with each other, specifically:

5. The method for automatically generating sound effects and matching videos according to claim 4, wherein the object type and audio introduction of the specific sound object are processed to obtain a first matching score, specifically:

6. The method for automatically generating sound effects and matching videos as claimed in claim 4, wherein the video and audio matching score is obtained based on the first matching score and the neural network matching score, and specifically comprises:

7. The method for automatically generating audio effects and matching videos as claimed in claim 1, wherein the step of searching for a match between a specific sound object and audio so that the audio matches the specific sound object further comprises the following steps:

8. A system for automatically generating sound effect and matching video is characterized by comprising a video processing module, a feature extraction module, a feature representation module and a search matching module;

9. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method steps of any one of claims 1 to 7.

10. An apparatus for automatically generating sound effects and matching video, comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor, when executing the computer program, implements the method steps of any of claims 1 to 7.