CN111681677A

CN111681677A - Video object sound effect construction method, system and device and readable storage medium

Info

Publication number: CN111681677A
Application number: CN202010517918.9A
Authority: CN
Inventors: 薛媛; 金若熙
Original assignee: Hangzhou Xinghe Shangshi Film Media Co ltd
Current assignee: Hangzhou Xinghe Shangshi Film Media Co ltd
Priority date: 2020-06-09
Filing date: 2020-06-09
Publication date: 2020-09-18
Anticipated expiration: 2040-06-09
Also published as: CN111681677B

Abstract

The invention discloses a video object sound effect construction method, which comprises the following steps: identifying the video to be processed to obtain the type of a specific sound-producing object in the video to be processed and extracting the sound-producing characteristics of the specific sound-producing object; constructing an object type of a specific sounding object and a specific sounding object audio based on the sounding characteristics, wherein the audio comprises audio introduction and audio keywords; performing score matching processing based on the object category, the audio introduction and the audio keywords of the specific sound-producing object to respectively obtain a first matching score and a neural network matching score; and obtaining a video and audio matching score based on the first matching score and the neural network matching score, and obtaining at least one appropriate audio of the specific sounding object according to the video and audio matching score. The sound effect is directly generated by constructing the proper sound effect for the specific sound production object in the video without the help of a sound imitation engineer for special effect sound distribution when the video is subjected to sound distribution, so that the complicated work of the sound imitation engineer can be reduced, the method is convenient and fast, and the accuracy is high.

Description

Video object sound effect construction method, system and device and readable storage medium

Technical Field

The invention relates to the technical field of video processing, in particular to a method, a system and a device for constructing sound effect of a video object and a readable storage medium.

Background

At present, with the development of science and technology, multimedia audio and video technology is widely applied to various fields, and the matching of sound effect for a specific sound object in a video can bring better feeling to audiences, thereby being beneficial to the understanding and cognition of the audiences to various fields, and how to make a good-looking video is more important.

In the existing video processing technology, clipping, special effects, subtitles, audio material addition and the like of a video are performed independently, for example, sound imitation is performed on a specific sound production object of the video by a sound imitation engineer, or sound imitation is performed synchronously with the video, or the video is recorded first, then sound imitation is performed on the specific sound production object and then added, or sound imitation can be performed on the site and directly recorded in the video, but the sounds except for characters in the video are difficult to match, and sound parts which are not finished in the shooting site are made by the sound imitation engineer and then matched with the video through later stages, such as footstep sound, door opening and closing sound, water falling sound and the like.

The traditional method for matching special effects to specific objects in a video is slow and low in accuracy, and the operation of synchronizing the video and various sounds is complex, so that the workload of workers is large, a large amount of time is needed, and the operation method is extremely inflexible.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a method, a system and a device for constructing the sound effect of a video object and a readable storage medium.

In order to solve the technical problem, the invention is solved by the following technical scheme:

a video object sound effect construction method comprises the following steps:

identifying the video to be processed to obtain the type of a specific sound-producing object in the video to be processed and extracting the sound-producing characteristics of the specific sound-producing object;

constructing an object type of a specific sounding object and a specific sounding object audio based on the sounding characteristics, wherein the audio comprises audio introduction and audio keywords;

performing score matching processing based on the object category, the audio introduction and the audio keywords of the specific sound-producing object to respectively obtain a first matching score and a neural network matching score;

and obtaining a video and audio matching score based on the first matching score and the neural network matching score, and obtaining at least one appropriate audio of the specific sounding object according to the video and audio matching score.

As an implementable embodiment, the identifying process is performed on the video to be processed to obtain the type of the specific sound-producing object in the video to be processed and extract the sound-producing feature of the specific sound-producing object, specifically:

reducing the frame extraction frequency of the relevant information of the video to be processed, and extracting video key frames;

generating a frame image stream from the extracted video key frames;

performing modular multi-object recognition on the frame image flow by adopting a deep convolutional neural network model to obtain a modular specific sound object;

and carrying out multi-stage recognition analysis processing on the modularized specific sound-producing object through a depth residual error network model to obtain the type of the specific sound-producing object in the video to be processed and extract the sound-producing characteristics of the specific sound-producing object.

As an implementation manner, the audio is introduced as an introduction text of the audio, and the audio keyword comprises at least three words describing the audio, wherein the words describing the audio comprise a category name of a specific sound-producing object and a category name of a sound-producing sound.

As an implementation manner, the performing score matching processing based on the object category, the audio introduction, and the audio keyword of the specific sounding object to obtain a first matching score and a neural network matching score respectively includes:

performing word segmentation processing on the object type and the audio introduction of the specific sound-producing object to obtain words;

respectively obtaining the word proportion of the object type of a specific sounding object to the audio introduction and the coincidence of the audio keywords to obtain a first proportion and a second proportion, and carrying out weighted average processing on the first proportion and the second proportion to obtain a word matching score, wherein the word matching score is the word coincidence proportion of the object type to the audio introduction, the audio introduction weight, the object type to the audio keyword, the word coincidence proportion of the audio keywords, and the audio introduction weight and the audio keyword weight are 1;

obtaining an object type TF-IDF vector based on statistical data introduced by the audio, and taking the first cosine similarity as a TF-IDF matching score according to the first cosine similarity of the object type TF-IDF vector and the audio introduction TF-IDF vector, wherein the TF-IDF matching score is cosine _ similarity (the object type TF-IDF vector and the audio introduction TF-IDF vector);

carrying out weighted average processing on the word matching score and the TF-IDF matching score to obtain a first matching score, wherein the first matching score is the word matching score and the word weight and the TF-IDF matching score and the TF-IDF weight, and the word weight and the TF-IDF weight are 1;

and obtaining a BERT vector of the object type of the specific sound production object and a BERT vector introduced by the audio, calculating to obtain cosine similarity of the BERT vector, and taking the cosine similarity as a neural network matching score.

As an implementation manner, the obtaining a video and audio matching score based on the first matching score and the neural network matching score specifically includes:

and carrying out weighted average processing on the first matching score and the neural network matching score to obtain a video and audio matching score, wherein the video and audio matching score is the first matching score and the first weight plus the neural network matching score and the neural network weight, and the first weight plus the neural network weight is 1.

As an implementation manner, the step of obtaining one or more suitable audios of the specific sound-producing object according to the video-audio matching score further includes:

searching and matching the specific sound-producing object and the selected audio according to the video and audio matching score, so that the audio introduction, the audio keywords and the object type of the specific sound-producing object are matched with each other;

and mixing all the audios to form a complete audio file, and adding the audio file into the audio track of the video to enable the audio file and the video to be synchronous.

A video object sound effect construction system comprises an identification processing module, a construction category module, a score calculating module and a score processing module;

the recognition processing module is used for recognizing the video to be processed to obtain the type of a specific sound-producing object in the video to be processed and extracting the sound-producing characteristics of the specific sound-producing object;

the build category module is configured to: constructing an object type of a specific sounding object and a specific sounding object audio based on the sounding characteristics, wherein the audio comprises audio introduction and audio keywords;

the calculate score module is configured to: performing score matching processing based on the object category, the audio introduction and the audio keywords of the specific sound-producing object to respectively obtain a first matching score and a neural network matching score;

the score processing module is configured to: and obtaining a video and audio matching score based on the first matching score and the neural network matching score, and obtaining at least one appropriate audio of the specific sounding object according to the video and audio matching score.

As an implementation mode, the system also comprises a search matching module and a mixing processing module;

the search matching module is used for searching and matching the specific sounding object and the selected audio according to the video and audio matching score, so that the audio introduction, the audio keywords and the object type of the specific sounding object are matched with each other;

and the audio mixing processing module is used for mixing audio of all audios to form a complete audio file, and adding the audio file into the audio track of the video to enable the audio file and the video to be synchronous.

A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method steps of:

A video object sound effect construction apparatus comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor executing the computer program to implement the method steps as follows:

Due to the adoption of the technical scheme, the invention has the remarkable technical effects that:

the invention discloses a video object sound effect construction method, a system, a device and a computer readable storage medium, wherein a video to be processed is identified to obtain the type of a specific sound object in the video to be processed and extract the sound characteristics of the specific sound object; constructing an object type of a specific sounding object and a specific sounding object audio based on the sounding characteristics, wherein the audio comprises audio introduction and audio keywords; performing score matching processing based on the object category, the audio introduction and the audio keywords of the specific sound-producing object to respectively obtain a first matching score and a neural network matching score; and obtaining a video and audio matching score based on the first matching score and the neural network matching score, and obtaining at least one appropriate audio of the specific sounding object according to the video and audio matching score. Through constructing suitable audio for the specific sound production object in the video, the sound imitation or special-effect sound imitation is not required to be carried out by the sound imitation operator when the video is subjected to sound imitation, the audio can be directly and automatically generated, and therefore the complex work of the sound imitation operator can be reduced, convenience and rapidness are achieved, and the accuracy is high.

Through the sound effect that founds, follow-up can be directly matchd with the specific sound production object in the video, like this, just can directly accomplish pending video whole dubbing, the rate of accuracy is high, and is swift.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a schematic overall flow diagram of the process of the present invention;

fig. 2 is a schematic diagram of the overall structure of the system of the present invention.

Detailed Description

The present invention will be described in further detail with reference to examples, which are illustrative of the present invention and are not to be construed as being limited thereto.

Example 1:

a method for constructing audio effect of video object, as shown in fig. 1, includes the following steps:

s100, identifying a video to be processed to obtain the type of a specific sound-producing object in the video to be processed and extracting sound-producing characteristics of the specific sound-producing object;

s200, constructing an object type of a specific sounding object and a specific sounding object audio based on the sounding characteristics, wherein the audio comprises audio introduction and audio keywords;

s300, performing score matching processing based on the object type, the audio introduction and the audio keywords of the specific sound-producing object to respectively obtain a first matching score and a neural network matching score;

s400, obtaining a video and audio matching score based on the first matching score and the neural network matching score, and obtaining at least one appropriate audio of the specific sounding object according to the video and audio matching score.

Specifically, in order to associate the object type of the specific sound-generating object with the audio of the specific sound-generating object, the natural language is used as an intermediate expression for matching the object type of the specific sound-generating object of the video to be processed with the audio in the embodiment, and the natural language is used as a method for matching the expression, so that the expression is beneficial to understanding and labeling of people, and sorting and maintaining of an audio library.

For a video to be processed, the object class identified from the video is represented as natural language (e.g., "cat"); for audio, two types of natural language markup can be used: audio introduction and audio keywords, i.e. audio includes audio introduction and audio keywords, where an audio introduction is understood to be: the content of the audio is introduced with a sentence or phrase (e.g., "a person's voice walking on snow"), and the audio keyword is the content of the audio with three key words (e.g., "shoes/snow/footsteps"). Unlike audio introductions, audio keywords must include spoken objects and spoken sound categories, and in summary, the introduction of audio keywords links the mismatch between object recognition categories and sound introductions. The audio may be parsed into an audio introduction and audio keywords with reference to step S200, where the audio introduction is an introduction content text of the audio, and the audio keywords include at least three audio-describing words including a category name of a specific sound-producing object and a category name of a sound-producing sound.

For a particular sound-producing object, the class name of the object recognition is used directly as its natural language representation, since the computer cannot understand the natural language, and thus further maps the natural language representation to a vector representation. Specifically, the present embodiment introduces two vector representations in natural language: TF-IDF (term frequency-inverse document frequency) and BERT (bidirectional Encode expressions from transformations).

In a specific embodiment, the TF-IDF vector is computed from the audio introductory text, which indicates how much each word in a segment of a word has an effect on the semantics of the whole segment of the word. The method specifically comprises the following steps: firstly, Chinese word segmentation is carried out on audio introduction of all audios through a word segmentation device 'ending word segmentation': then calculating the word frequency TF of each word in each audio introduction and the word frequency DF of each word in the set of all audio introductions; for an audio presentation, the TF-IDF of any one of the words can be computed: TF-IDF ═ TF × log (1/DF + 1); it is to be noted that the TF-IDF calculation formula is a normalized TF-IDF, in order to ensure the stability of the values; finally, for any segment of characters, the TF-IDF vector of the segment of characters is calculated. All words in the text library are sorted, the TF-IDF value of each word in the segment of words is calculated according to the sequence, and if the segment of words does not contain the word, the TF-IDF value is considered to be 0. And finally, obtaining a vector with the length same as the vocabulary of the text library, namely TF-IDF vector expression of the segment of characters.

Further: and calculating a BERT vector, wherein the BERT in the embodiment is a Transformer neural network structure, network parameters are trained by large-scale unsupervised learning, and the obtained model can be directly applied to downstream natural language understanding problems to directly carry out vector mapping on sentences and phrases in the natural language. The embodiment combines the two (and simple word matching) methods, so that the result is more accurate.

In one embodiment, the BERT vector representation of a sentence is computed using a pre-trained Chinese BERT model in a Pytorch _ predicted _ BERT in a Pytorch library. To meet the efficiency of matching, the minimum BERT model "BERT _ base _ chip" is used. Specifically, a sentence is divided into characters one by one, the first character and the last character of "[ CLS ]" and "[ SEP ]" are respectively added into the sentence to be used as input index _ tokens, a full 0 list which is as long as the input index _ tokens is used as input segment _ ids, the two inputs are simultaneously input into a pre-training BERT model, and an output vector of the last layer of neural network corresponding to the first character ("[ CLS ]") is taken out to be used as a BERT vector of the sentence.

The audio and video construction process includes object types and audio introduction identified in videos, matching of audio keywords, and selecting proper audio through calculated matching scores, wherein the matching scores are calculated in two modes, namely a traditional method and a neural network mode, and the traditional method has the advantage that when natural language expressions of the audio and the videos have the same words, the scores can be accurately calculated; the neural network has the advantage of calculating the matching scores, when the natural language expressions of the two natural language expressions have no word overlap, the natural language expressions can be matched with each other at will, and the scores of the two methods are used and combined simultaneously, so that the two methods are complementary.

In one embodiment, the score matching processing based on the object type, the audio introduction, and the audio keyword of the specific utterance object in step S300 obtains a first matching score and a neural network matching score respectively, specifically:

s310, performing word segmentation processing on the object type and the audio introduction of the specific sound-producing object to obtain words;

s320, respectively obtaining the word proportion of the object type of the specific sounding object to the audio introduction and the coincidence of the audio keywords to obtain a first proportion and a second proportion, and carrying out weighted average processing on the first proportion and the second proportion to obtain a word matching score, wherein the word matching score is the word coincidence proportion of the object type to the audio introduction, the audio introduction weight, the object type and the audio keywords, the audio keyword weight, and the audio introduction weight and the audio keyword weight are 1;

s330, obtaining an object type TF-IDF vector based on statistical data of audio introduction, and taking the first cosine similarity as a TF-IDF matching score according to the first cosine similarity of the object type TF-IDF vector and the audio introduction TF-IDF vector, wherein the TF-IDF matching score is cosine _ similarity (the object type TF-IDF vector and the audio introduction TF-IDF vector);

s340, performing weighted average processing on the word matching score and the TF-IDF matching score to obtain a first matching score, where the first matching score is the word matching score and the word weight + the TF-IDF matching score and the TF-IDF weight, and the word weight + the TF-IDF weight is 1;

s350, a BERT vector of the object type of the specific sound production object and a BERT vector introduced by the audio are obtained, cosine similarity of the BERT vector is obtained through calculation, and the cosine similarity is used as a neural network matching score.

Step S310-step S340 are matched by conventional methods, and a final segmenter is used to segment the object type and voice introduction of a specific sound object. Then calculating the object types of the specific sound production objects, respectively introducing the object types and the sounds, and calculating the word proportion of the coincident sound keywords, and weighting and averaging the two proportions to serve as a word matching score; and obtaining TF-IDF vector expression of the object type of the specific sound-producing object according to the statistical data in the sound introduction text. Then, the cosine similarity of the object TF-IDF vector and the sound introduction TF-IDF vector is calculated to be used as a TF-IDF matching score, and the word matching score and the TF-IDF matching score are weighted and averaged to obtain a matching score of the traditional method, namely the first matching score in the step. Of course, in other embodiments, the technique for obtaining the first matching score may be other techniques, and is not described herein again.

In step S350, a BERT vector of the object type of the specific sound generating object and a BERT vector introduced by the audio are obtained, so as to obtain cosine similarity of the BERT vectors, and the cosine similarity is used as a neural network matching score and is realized by a neural network matching method.

In an embodiment, the obtaining a video and audio matching score based on the first matching score and the neural network matching score in step S400 specifically includes: and carrying out weighted average processing on the first matching score and the neural network matching score to obtain a video and audio matching score, wherein the video and audio matching score is the first matching score and the first weight plus the neural network matching score and the neural network weight, and the first weight plus the neural network weight is 1.

In practice, the weight of the weighted average may be adjusted as needed, and if it is desired that the name of the object class of the particular spoken object appears accurately in the audio introduction or keyword, the weight of the traditional matching score may be increased to increase accuracy; if it is desired that the names of the object classes for a particular spoken object are not in the audio presentation or keyword, but have the same semantics, the weights of the neural network matching scores may be increased to increase generalization.

Specifically, for each identified object, 10 best matching audios may be selected as dubbing recommendations according to the final matching score, although other numbers are also possible.

In an embodiment, the identifying process of the video to be processed in step S100 is performed to obtain the type of the specific sound generating object in the video to be processed and extract the sound generating feature thereof, and specifically includes the following steps:

s110, reducing the frame extraction frequency of the relevant information of the video to be processed, and extracting video key frames;

s120, generating a frame image stream from the extracted video key frames;

s130, performing modular multi-object recognition on the frame image flow by adopting a deep convolutional neural network model to obtain a modular specific sounding object;

and S140, performing multi-stage recognition analysis processing on the modularized specific sound-producing object through a depth residual error network model to obtain the type of the specific sound-producing object in the video to be processed and extract the sound-producing characteristics of the specific sound-producing object.

In this embodiment, the video to be processed refers to a video clip provided by a user and requiring sound effects, a video key Frame is extracted from the video to be processed by adopting a Frame-down Frame-extracting manner, a Frame-extracting frequency is set as an adjustable parameter, a lower limit of the Frame-extracting frequency is not set, an upper limit of the Frame-extracting frequency is determined by a code-sampling rate of the video (usually, the video is 25 frames per second), and a static Frame Image, i.e., a Frame Image Stream, having a time sequence, can be generated after the Frame-extracting of the video to be processed, and the Frame Image Stream is used for the next specific sound-producing object recognition.

In the implementation process, firstly, the video key frame needs to be down-converted: the object/person with dubbing value appearing in the video to be processed needs to have a certain continuous existence time, and the object dubbing which disappears within one or two frames of the video to be processed is generally not considered, because the object/person with dubbing value has little meaning from the viewpoint of dubbing technology. In a specific operation, if the video key-frame in the frame-map stream is such: if the frame 2 seconds before does not contain the recognized object type, the recognized object type is regarded as that the object sounds from the second; if a secondary object exists in the frame from the first 2 seconds, the object is considered to be sounding continuously, and the minimum sounding time is set to be 5 seconds. In actual operation, different continuous sounding time and minimum sounding time can be set for different objects according to the sounding rules of the objects. Extracting video key frames for object recognition by reducing the frequency of the video key frames: for example, a video with a code rate of 25 frames/second is adopted, the frequency of a sampling key frame is set to be 1 frame/second after frequency reduction, namely, one frame is extracted from every 25 key frame pictures to serve as an identification input sample of an object appearing in the video in one second in the future, so that the reading times can be effectively and simply reduced, and the processing speed is improved. Meanwhile, the frame extraction frequency is set as an adjustable parameter, the lower limit of the frame extraction frequency is not set, and the upper limit of the frame extraction frequency is determined by the code acquisition rate of the video (usually, the video is 25 frames per second), so that a user determines the proper frame extraction frequency according to the characteristics of the video sample.

And thirdly, extracting a frame image stream generated by the video key frame, and performing modular multi-object identification based on an embedded deep convolutional neural network (DeepCNN). For each static frame image in the frame image stream, performing high nonlinear operation on pixel values of RGB (red, green and blue) three-color channels of pixel points of the image through a network to generate probability vectors taking each identifiable specific sound-producing object as a center, judging the category of the specific sound-producing object through the maximum value in each probability vector by a deep convolutional neural network, and judging the size of a current object selection frame according to the numerical distribution characteristics of the probability vectors in a rectangular region around the center of the specific sound-producing object. The generated selection box is used for intercepting a screenshot of a specific sound-producing object in each frame of image so as to perform specific sound-producing object recognition in more detail in the second stage. It should be explained that: all involved neural networks in this step are from the pre-trained Fast-RCNN networks in the object recognition library in python language, the TensorFlow deep learning framework.

The embodiment obtains the modularized specific sounding object, and correspondingly, each layer of deep convolutional neural network embedded in object recognition by adopting the modularized design is adopted. The used deep convolutional neural network can be used for switching the required one-level deep neural network in all levels of object recognition at will to adapt to special use scenes or special object classes, for example, the recognition network for carrying out the refined classification on shoes and the ground is not based on any pre-trained CNN model. The modular design can be expanded to embed a plurality of deep convolutional neural networks in each stage of recognition, and the accuracy of overall object recognition, the positioning precision and the recognition accuracy of refined classification are improved by utilizing an Ensemble Learning (Ensemble Learning) algorithm.

For example: the integrated learning algorithm can use the confidence value of each deep neural network on the identified selection box (the closer to 1 the more the network determines the correctness of the selection box, the confidence value is the probability judgment whether the model is correct for the object identification, and can be understood as the confidence of the model on one time of object identification, and the higher the confidence is, the higher the correctness of the object identification is), to carry out weighted average on a plurality of selection boxes, thereby finely adjusting a more reliable selection box for object positioning, so as to generate a higher-quality screenshot for the identification of the subsequent steps.

And obtaining the modularized specific sounding object, and performing multi-stage recognition analysis processing through a depth residual error network model to obtain the type of the specific sounding object and extract the sounding characteristics of the specific sounding object. See in particular the following ways: the existing deep neural network can not identify the details of all objects from a natural image, so that a technical solution framework of a multi-stage object identification network can be provided. In this embodiment, the multi-stage recognition analysis process follows the design concept of "coarse to fine": for each static frame image in a frame image flow, firstly, a primary deep neural network is utilized to perform preliminary analysis and identification processing to obtain general types of specific sound-producing objects (such as characters, shoes and doors and windows), and then, for detailed screenshots of the positions of each object, a new neural network is utilized to perform multistage identification and analysis processing of object subdivision types to obtain the types of the specific sound-producing objects (such as whether the shoes are sports shoes, board shoes or leather shoes). The multi-stage recognition analysis processing of the embodiment can be expanded to an image recognition framework with more stages (for example, three stages or more), and generally, because the definition of a frame-extracted image used in an experiment is limited, a two-stage deep neural network is adopted to perform two-stage recognition analysis processing, so that the currently required functions can be realized.

Here, the process of performing secondary recognition analysis processing by a secondary deep neural network is mainly described: the preliminary identification analysis processing adopts a first-level deep identification network which is derived from a pre-trained Fast-RCNN network; the multistage recognition analysis processing adopts a multistage depth recognition network, and a secondary depth recognition network of the secondary recognition analysis processing is adopted, and is used for carrying out further detailed recognition on individual key objects recognized by the first-stage depth recognition network, for example, for the 'shoes' recognized by the first-stage depth recognition network in a static frame image, the secondary depth recognition network carries out secondary recognition analysis processing on screenshots of the 'shoes' part so as to judge the 'shoe types' and the 'ground types'. More specifically, the present embodiment can recognize four different kinds of detailed footwear (sports shoes, leather shoes, high-heeled shoes, others), and five different kinds of detailed floors (tile floors, plank floors, cement floors, sand floors, others). The specific network architecture of the two-level depth recognition network is designed based on a depth residual error network (Resnet50) with 50 layers. See the following depth residual network model acquisition process:

s141, acquiring a plurality of images containing specific sound production objects, and eliminating unqualified images of the specific sound production objects to obtain qualified images of the specific sound production objects;

s142, preprocessing the image of the qualified specific sounding object to obtain an image data set of the qualified specific sounding object, and dividing the image data set into a training set and a verification set;

and S143, inputting the training set into the initial depth residual error network model for training, and verifying the training result through the verification set to obtain the depth residual error network model capable of acquiring the type of the specific sound-producing object.

In the prior art, a depth residual error network pre-trained for identifying shoes or the ground or other specific sounding objects does not exist, the depth residual error network used in the embodiment is not based on any pre-training parameter, the network parameter of the depth residual error network is completely originally trained from random numbers, all image sets required by training are from screenshots of actual videos, and manual calibration is carried out on the types of the shoes and the ground. The image training set at least comprises 17000+ pictures with different sizes, variable aspect ratio and maximum resolution ratio of not more than 480p, the main body is the picture of other specific sounding objects of which the total deterioration of the shoes and the ground fiddle is, and in the training depth residual error network model, unqualified images, such as the pictures which are very fuzzy and the objects in the pictures are incomplete, need to be removed, and the remaining qualified images are divided into a training set and a verification set. The pictures are different from the disclosed image recognition data set, and are mostly low-resolution pictures with non-square shapes, which considers that the shapes of the screenshots of the video frames in the actual using scene are irregular, the resolution can also be reduced due to a video compression algorithm, and the irregularity and the low resolution can be understood as noise contained in the image set, so that the network trained on the data set has stronger anti-noise capability and optimized pertinence to the footwear and the ground. The recognition accuracy (calculated on a test set) of five kinds of refinement on the ground obtained by the deep residual error network of the embodiment reaches 73.4%, which is much higher than that of random selection (20%) and crowd selection (35.2%); the recognition precision of the four types of shoes is also in the same order; the actual recognition speed can reach 100 pictures per second using a single great P100 display card.

And additionally deepens a Multi-layer perceptron (inherent in the Resnet50) at the end of the network into two layers, and matches with a random deactivation design (Dropout is 0.5) to adapt to the type requirements of the identification categories required by various specific objects, so that the overfitting phenomenon caused by excessive network parameters can be avoided to a certain extent (the identification effect on the training set is far better than that of the test set).

The depth residual error network (Resnet50) adopted in the embodiment is based on the existing depth residual error network and is trained correspondingly, so that the type of a specific sound-producing object required by the embodiment can be identified, that is, the calculation identification process of a single picture and the specific use scene are modified correspondingly, the depth residual error network (Resnet50) can read a square RGB image with the pixel value not lower than 224 × 224, and for an input image with a rectangular shape and with a length and a width not being 224 pixels, the embodiment adopts a conventional linear interpolation method to firstly deform the input image into a regular floating point matrix of 224 × 224 × 3 (three RGB color channels); after the matrix is input into a network, the matrix is transformed into feature maps (feature maps) with higher abstraction degree and smaller size through a series of convolution blocks; the convolution block is a basic unit of a conventional design of a Convolutional Neural Network (CNN), the convolution block used in the respet 50 is composed of three to four two-dimensional convolution layers (2 dconvolume) in combination with a random inactivation design (drop), a batch normalization layer (batch normalization), and a linear rectification layer (ReLU), and a residual path (residual layer, which only contains a simple one-layer two-dimensional convolution layer or is a simple copy of an input matrix) is parallel to each block. And respectively calculating the characteristic diagram output by the previous block through a residual error path and a convolution block path, then outputting the characteristic diagram into two new matrixes with consistent dimensionalities, and simply adding the two matrixes to form an input matrix of the next block. The numbers in the name of the depth residual network (Resnet50) refer to a total of 50 two-dimensional convolutional layers contained in all convolutional blocks. The depth residual error network after passing through all the convolution blocks outputs 2048-dimensional first-order vectors, and then outputs vectors with dimension of 1000 through a layer of Perceptron (Perceptron). Each element value of the final output vector of the depth residual error network represents the probability value of the image belonging to a certain category, and the final category calibration is determined by the maximum probability value. Common depth residual networks similar to Resnet50 are also Resnet34, Resnet101, etc.; other common image recognition networks include Alexnet, VGGnet, and implicit net, which are also applicable in this embodiment, but the effect is not good, so a depth residual error network (Resnet50) is selected.

In addition, the secondary recognition network architecture, i.e., the deep residual error network (Resnet50), in the present embodiment simultaneously supports the feedback learning mode: when the recognition accuracy of the secondary depth recognition network does not meet the scene requirement, the frame image stream can be subjected to screenshot through an object selection box recognized by the primary depth recognition network, the screenshot is used as a new data set to be manually calibrated, and the secondary depth recognition network, namely a depth residual error network (Resnet50) is finely adjusted. Therefore, when the video content to be processed is changed greatly, the trained model and a small amount of new data can be used for rapidly obtaining higher recognition accuracy, and the preparation period for adapting to a new application scene is shortened. The first-level depth recognition network can also be retrained in stages according to the change of the video type or the change of the application scene so as to adapt to the characteristics of new video data.

Furthermore, the specific sound-producing object information recognized by each level in the two-level depth recognition network is merged and stored in the same format. For each object the information is stored: object large class (superior network identification), object large class certainty value, object fine class (secondary depth identification network identification), object fine class certainty value, object location selection frame width height and center (measured in frame image pixels), all of which are processed further in json file format.

In one embodiment, the step S400 of obtaining one or more suitable audios of a specific sound object according to the video audio matching score further includes the following steps:

s500, searching and matching the specific sound-producing object and the selected audio according to the video and audio matching score, so that the audio introduction, the audio keywords and the object type of the specific sound-producing object are matched with each other;

s600, mixing all the audios to form a complete audio file, and adding the audio file into the audio track of the video to enable the audio file and the video to be synchronous.

In one embodiment, the specific sound object and the selected audio are searched and matched according to the video and audio matching score, so that the audio introduction, the audio keyword and the object category of the specific sound object are matched with each other, namely, the specific sound object and the audio are matched through the video and audio matching score in the prior art, and the matching is independent dubbing. And then, the audio can be subjected to integral dubbing, namely, the generated audio is subjected to audio mixing, all the required audio files can be read after the audio files required by dubbing and the starting and ending time of playing of each audio file are found, and each audio file is converted into a uniform frequency domain signal format so as to facilitate subsequent editing.

In the present embodiment, audio files in any common format, including wav and mp3, can be read, which improves the capability of using scenes and generalizing to other specific audio libraries.

The specific process of mixing all audio frequencies is as follows: each audio segment will be stretched or compressed intelligently to the time length needed by dubbing, and the mute part of the audio starting and ending stage is cut off firstly, so that dubbing and the picture triggering dubbing in the video can happen simultaneously, and the dubbing effect is optimal. And then checking whether the time length of the audio after the head and tail silence is eliminated is longer than the time required to be played, if so, cutting the audio to the time length required to be played for dubbing, and using a fade-out effect at the tail so as to eliminate the abrupt pause of the audio. If not, the audio is played circularly until the playing time required by dubbing, and when the audio is played circularly, the head-to-tail connection positions of the front and the rear audio sections adopt the overlapping and gradually-in and gradually-out effects with a certain time length, so that the circularly played positions are in seamless connection, the long audio section sounds natural and complete, and the user has the best hearing experience. The time length of the gradual-in and gradual-out is equal to the time length of the overlap, the time length is determined according to the audio time length through a piecewise function, if the original audio time length is less than 20 seconds, the overlap and gradual-in and gradual-out time is set to be 10% of the audio time length, so that the time length of the overlap part is moderate, the audio of the front section and the rear section can be smoothly transited, and more non-overlap parts of the short videos can be reserved to be played to users. If the original audio duration is longer than 20 seconds, the overlap and fade-in and fade-out time is set to 2 seconds, so that the long audio can be prevented from generating an unnecessarily long transition period, and non-overlapping audio can be played as far as possible.

Finally, the audio frequencies processed according to the steps are combined together, and added into the audio track of the video, and a new video file with dubbing is output, so that the whole dubbing process is completed.

Example 2:

a video object sound effect construction system is shown in FIG. 2, and includes an identification processing module 100, a construction category module 200, a score calculating module 300 and a score processing module 400;

the recognition processing module 100 is configured to perform recognition processing on a video to be processed to obtain a type of a specific sound-generating object in the video to be processed and extract sound-generating characteristics of the specific sound-generating object;

the build category module 200 is arranged to: constructing an object type of a specific sounding object and a specific sounding object audio based on the sounding characteristics, wherein the audio comprises audio introduction and audio keywords;

the calculate score module 300 is configured to: performing score matching processing based on the object category, the audio introduction and the audio keywords of the specific sound-producing object to respectively obtain a first matching score and a neural network matching score;

the score processing module 400 is arranged to: and obtaining a video and audio matching score based on the first matching score and the neural network matching score, and obtaining at least one appropriate audio of the specific sounding object according to the video and audio matching score.

In one embodiment, the system further comprises a search matching module 500 and a remix processing module 600;

the search matching module 500 is configured to search and match the specific sounding object and the selected audio according to the video and audio matching score, so that the audio introduction, the audio keyword and the object category of the specific sounding object are matched with each other;

the audio mixing processing module 600 is configured to perform audio mixing processing on all audio to form a complete audio file, and add the audio file to the audio track of the video to synchronize the audio file and the video.

The simple and easy-to-use function interface is arranged in the sound mixing processing module, and the dubbing video can be generated by one key, so that the working efficiency of a user is greatly improved. Although the mixing processing module 600 uses common audio tools, just as the specific mixing steps and parameters in the method are specifically designed for movies, dramas, and short videos, the silence removal and special-effect audio compression or extension methods mentioned in the method embodiments can specifically solve the above-mentioned dubbing problem of videos of a specific category, that is, the problem that the audio length in the special-effect audio library does not meet the video dubbing requirement when the audio length is many, and these specific audio processing parameters are also most suitable for this embodiment, and other technologies or audio processing parameters are not available.

In one embodiment, the identification processing module 100 is configured to:

generating a frame image stream from the extracted video key frames;

In one embodiment, the build category module 200 is configured to present the audio as introductory text of the audio, and the audio keywords include at least three audio-describing words including a category name of the specific utterance object and a category name of the utterance sound.

In one embodiment, the calculate score module 300 is configured to:

All modules in the video object sound effect searching and matching system can be completely or partially realized through software, hardware and a combination thereof. The modules can be embedded in a hardware form or be independent of a processor of the computer device or the mobile terminal, and can also be stored in a memory of the computer device or the mobile terminal in a software form, so that the processor can call and execute operations corresponding to the modules.

For the system embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

Example 3:

a computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the following method steps:

In one embodiment, when the processor executes the computer program, the processor implements recognition processing on the video to be processed to obtain the type of the specific sound-generating object in the video to be processed and extract the sound-generating characteristics thereof, specifically:

generating a frame image stream from the extracted video key frames;

In one embodiment, the processor, when executing the computer program, implements an introductory text in which the audio introduction is audio, the audio keywords comprising at least three audio-describing words comprising a category name of a particular spoken object and a category name of a spoken sound.

In one embodiment, when the processor executes the computer program, the score matching processing based on the object category, the audio introduction, and the audio keyword of the specific sound object is implemented to obtain a first matching score and a neural network matching score, which specifically includes:

In one embodiment, when the processor executes the computer program, the obtaining of the video and audio matching score based on the first matching score and the neural network matching score is implemented by:

In one embodiment, the processor, when executing the computer program, after performing the one or more steps of obtaining the suitable audio for the specific utterance object according to the video-audio matching score, further comprises:

Example 4:

in one embodiment, a video object audio effect construction device is provided, and the video object audio effect construction device can be a server or a mobile terminal. The video object sound effect construction device comprises a processor, a memory, a network interface and a database which are connected through a system bus. Wherein, the processor of the video object sound effect construction device is used for providing calculation and control capability. The memory of the video object sound effect construction device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database is used for storing all data of the frequency object sound effect construction device. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to realize the video object sound effect construction method.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention has been described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It should be noted that:

reference in the specification to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the invention. Thus, the appearances of the phrase "one embodiment" or "an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

In addition, it should be noted that the specific embodiments described in the present specification may differ in the shape of the components, the names of the components, and the like. All equivalent or simple changes of the structure, the characteristics and the principle of the invention which are described in the patent conception of the invention are included in the protection scope of the patent of the invention. Various modifications, additions and substitutions for the specific embodiments described may be made by those skilled in the art without departing from the scope of the invention as defined in the accompanying claims.

Claims

1. A video object sound effect construction method is characterized by comprising the following steps:

2. The video object sound effect construction method according to claim 1, wherein the video to be processed is subjected to recognition processing to obtain the type of a specific sound generating object in the video to be processed and extract the sound generating characteristics thereof, and specifically the method comprises the following steps:

generating a frame image stream from the extracted video key frames;

3. The video object sound effect construction method according to claim 1, wherein the audio introduction is an introduction text of the audio, and the audio keywords include at least three words describing the audio, and the words describing the audio include a category name of a specific sound-producing object and a category name of a sound-producing sound.

4. The video object sound effect construction method according to claim 1, wherein the score matching processing based on the object category, the audio introduction and the audio keyword of the specific sound object respectively obtains a first matching score and a neural network matching score, and specifically comprises:

5. The video object sound effect construction method according to claim 4, wherein the video and audio matching score is obtained based on the first matching score and the neural network matching score, and specifically comprises:

6. The video object sound effect construction method according to claim 1, wherein the step of obtaining one or more suitable audio frequencies of a specific sound object according to the video audio matching score further comprises:

7. A video object sound effect construction system is characterized by comprising an identification processing module, a construction category module, a score calculating module and a score processing module;

8. The video object sound effect construction system according to claim 7, further comprising a search matching module and a mixing processing module;

9. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method steps of one of claims 1 to 6.

10. Video object sound effect construction apparatus comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the method steps of any one of claims 1 to 6 when executing the computer program.