CN107888843A

CN107888843A - Sound mixing method, device, storage medium and the terminal device of user's original content

Info

Publication number: CN107888843A
Application number: CN201710952671.1A
Authority: CN
Inventors: 罗斌
Original assignee: Shenzhen Xunlei Network Technology Co Ltd
Current assignee: Shenzhen Xunlei Network Technology Co Ltd
Priority date: 2017-10-13
Filing date: 2017-10-13
Publication date: 2018-04-06

Abstract

The invention discloses a kind of sound mixing method, device, storage medium and the terminal device of user's original content, methods described includes：By obtaining the video information in user's original content；Extract the video frame information in the video information；Destination object is identified from the video frame information；Corresponding audio-frequency information is superimposed according to the attribute of destination object.The special effect processing of sound has been carried out automatically according to the video content of user's original content, has matched corresponding background music.Eliminate the reliance on artificial treatment.

Description

Sound mixing method, device, storage medium and the terminal device of user's original content

Technical field

The invention belongs to communication technical field, more particularly to a kind of sound mixing method of user's original content, device, storage Jie Matter and terminal device.

Background technology

User's original content (UGC, User Generated Content) is accompanied by personalized to be main special to advocate What the Web2.0 concepts of point were risen.UGC is not a certain specific business, but a kind of user uses the new side of internet Formula, i.e., become to download based on download and upload is laid equal stress on by original.The websites such as YouTube can regard UGC success as Case, community network, video sharing, blog and blog (video sharing) etc. is all UGC main application form.With mobile phone work( Energy is gradually become strong, and user can make picture, video using mobile phone whenever and wherever possible, by the mood of oneself and what is seen and heard hand Machine is recorded, and these contents are passed into other people whenever and wherever possible will turn into trend.And the content quality of user's production is irregular not Together, user wants to create the works of high quality, is attracted to the user of high quality, expands the influence power of works, improves clicking rate, after Phase then needs a series of Edition Contains to handle, sound-editing processing.

User is after UGC has been made, and the propagation dynamics that wants to widen one's influence is, it is necessary to which a series of post production process, main Including：Personage U.S. face, the editing of video content, video caption processing, sound post-processing etc..And these cumbersome later stage systems Make all to be at present manually to handle, especially this part of acoustic processing, want to make UGC content personalizations, show one's talent, it is necessary to Fine artificial treatment is passed through in this part of sound, for example coordinates different individualized voice, no according to different UGC contents and scene Same special efficacy etc., these are required for manually going to watch UGC contents repeatedly, and are partitioned into the strict content change time, and The natural and tripping degree for the sound splicing that artificial gets on is also a very big test.

The content of the invention

The present invention provides a kind of sound mixing method, device, storage medium and the terminal device of user's original content, can be automatic Corresponding background music is matched according to video content.

The embodiment of the present invention provides a kind of sound mixing method of user's original content, including step：

Obtain the video information in user's original content；

Extract the video frame information in the video information；

Destination object is identified from the video frame information；

Corresponding audio-frequency information is superimposed according to the attribute of destination object.

Further, it is described to identify destination object from the video frame information, phase is superimposed according to the attribute of destination object The audio-frequency information answered includes：

Acquisition scene information is identified from the video frame information；

If the scene information that acquisition is identified from adjacent video frame information is identical, the adjacent video frame information is entered Row merges；

Corresponding background music is superimposed according to the attribute of scene information.

It is identified from the video frame information and obtains people information or object information；

Phonetic feature is adjusted according to the attribute of the people information or the object information, so that the people information or institute State the voice conversion audio of object information.

Acquisition action message is identified from the video frame information；

Background music is acted according to the attribute superposition of the action message.

Further, it is described to identify that destination object includes from the video frame information：

Using deep learning method destination object is identified from the video frame information.

The embodiment of the present invention also provides a kind of device sound mixing of user's original content, including：

Acquiring unit, for obtaining the video information in user's original content；

Extraction unit, for extracting the video frame information in the video information；

Recognition unit, for identifying destination object from the video frame information；

Superpositing unit, for being superimposed corresponding audio-frequency information according to the attribute of destination object.

Further, the recognition unit includes：

Scene Recognition subelement, for acquisition scene information to be identified from the video frame information；

Merge subelement, if identical for the scene information that acquisition is identified from adjacent video frame information, by described in Adjacent video frame information merges；

The superpositing unit, it is additionally operable to be superimposed corresponding background music according to the attribute of scene information.

Further, the recognition unit includes：

Object identification subelement, people information or object information are obtained for being identified from the video frame information；

The superpositing unit, it is additionally operable to adjust phonetic feature according to the attribute of the people information or the object information, So that the voice of the people information or the object information changes audio.

Further, the recognition unit includes：

Action recognition subelement, for acquisition action message to be identified from the video frame information；

The superpositing unit, it is additionally operable to act background music according to the attribute superposition of the action message.

Further, the recognition unit, it is additionally operable to identify mesh from the video frame information using deep learning method Mark object.

The embodiment of the present invention also provides a kind of storage medium, is stored thereon with computer program, when the computer program When running on computers so that the computer performs the sound mixing method of as above user's original content described in any one.

The embodiment of the present invention also provides a kind of terminal device, including processor and memory, and the memory has computer Program, the processor is by calling the computer program, for performing user's original content described in as above any one Sound mixing method.

Sound mixing method, device, storage medium and the terminal device of user's original content provided in an embodiment of the present invention, pass through Obtain the video information in user's original content；Extract the video frame information in the video information；From the video frame information Middle identification destination object；Corresponding audio-frequency information is superimposed according to the attribute of destination object.Automatically according to regarding for user's original content Frequency content has carried out the special effect processing of sound, matches corresponding background music.Eliminate the reliance on artificial treatment.

Brief description of the drawings

Technical scheme in order to illustrate the embodiments of the present invention more clearly, make required in being described below to embodiment Accompanying drawing is briefly described.It should be evident that drawings in the following description are only some embodiments of the present invention, for For those skilled in the art, on the premise of not paying creative work, it can also be obtained according to these accompanying drawings other attached Figure.

Fig. 1 is the flow chart of the sound mixing method of user's original content provided in an embodiment of the present invention；

Fig. 2 is the schematic diagram of CNN classics convolutional neural networks evolutionary process provided in an embodiment of the present invention；

Fig. 3 is another flow chart of the sound mixing method of user's original content provided in an embodiment of the present invention；

Fig. 4 is the schematic diagram of YOLO object detections provided in an embodiment of the present invention；

Fig. 5 is the schematic diagram of C3D network structures provided in an embodiment of the present invention；

Fig. 6 is the structural representation of the device sound mixing of user's original content provided in an embodiment of the present invention；

Fig. 7 is another structural representation of the device sound mixing of user's original content provided in an embodiment of the present invention；

Fig. 8 is the another structural representation of the device sound mixing of user's original content provided in an embodiment of the present invention；

Fig. 9 is another structural representation of the device sound mixing of user's original content provided in an embodiment of the present invention；

Figure 10 is an also structural representation for the device sound mixing of user's original content provided in an embodiment of the present invention；

Figure 11 is the structural representation of ResNet provided in an embodiment of the present invention residual error study module.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Site preparation describes.

As shown in figure 1, Fig. 1 is the flow chart of the sound mixing method of user's original content provided in an embodiment of the present invention.This reality Apply in example, a kind of sound mixing method of user's original content, including step：

Step S101, obtain the video information in user's original content；

Step S102, extract the video frame information in the video information；

Step S103, destination object is identified from the video frame information；

Step S104, corresponding audio-frequency information is superimposed according to the attribute of destination object.

The sound mixing method of user's original content provided in an embodiment of the present invention is by obtaining the video in user's original content Information；Extract the video frame information in the video information；Destination object is identified from the video frame information；According to target pair The attribute of elephant is superimposed corresponding audio-frequency information.Carried out automatically according to the video content of user's original content at the special efficacy of sound Reason, matches corresponding background music.Eliminate the reliance on artificial treatment.Original video is inputted, after intelligent mixer system, automatically The special effect processing of sound has been carried out according to the scene of UGC contents.Original video is inputted to match automatically according to scene and action recognition Corresponding background music.

Wherein, it is to utilize each frame video in FFmpeg extraction videos to extract the video frame information in the video information Two field picture.

Acquisition scene information is identified from the video frame information；

Specifically, the destination object in the present embodiment is scene information, scene information is obtained first in the present embodiment, can be with Frame level scene Recognition is carried out, using deep learning method, picture scene is classified, for example depth convolutional Neural can be used Network Inception V3 or resnet152 carry out scene classification, can use scene database Place365 during network training. Then scene cut is carried out, is merged based on the scene Recognition classification between consecutive frame, the consecutive frame of same scene is carried out Merge, finally obtain the Video segmentation based on different scenes.It is finally based on scene and plays corresponding background music, field can be based on Background music corresponding to scape broadcasting, such as sea, bar, dance hall, castle, field, island, rink etc., can also be according to field Scape converts original sound, for example can produce reverberation effect etc. in music hall with echogenicity effect in mountain valley.

The classical convolutional neural networks of some of machine learning.It is AlexNet, VGGNet, Google respectively InceptionNet and ResNet.Main feature is all in deep learning and neutral net.The significantly lifting of performance is nearly all The number of plies for being accompanied by convolutional neural networks is deepened, and ResNet has even used 152 hidden layers.As shown in Fig. 2 Fig. 2 is the present invention The schematic diagram for the CNN classics convolutional neural networks evolutionary processes that embodiment provides.

ResNet networks propose a kind of residual error learning framework for mitigating network training burden, and this network used than in the past The network essence upper strata time crossed is deeper.It is clearly using this layer study residual error function related as input layer, rather than study Unknown function.In ImageNet data sets with 152 layers (the possible number of plies alreadys exceed 1000), the deep 8 times depth than VGG network To assess residual error network, but it still has relatively low complexity.ResNet residual error study module is as shown in Figure 11.

Based on original sound, plus regular hour carryover effects, it is added in original sound, can echogenicity effect； And original sound is added in original sound after certain convolution and delay, then reverberation effect can be produced.It is and specific Echo and reverberation dynamics then need constantly to adjust convolution kernel and time delay.

Specifically, the destination object in the present embodiment is people information or object information, closed first in the present embodiment Key object or person analyte detection, using deep learning method, detect in picture whether containing some crucial or characteristic personages or Object, such as transformer, small Loli, uncle or some cartoon cartoon characters.It can be used during detection and be based on neutral net Object detection model YOLO (You look only once).Then voice conversion is carried out, according to the critical object detected Or personage, by adjusting the different characteristic of voice, such as fundamental frequency F0, the duration of a sound, pitch, mel cepstrum coefficients MFCC etc., so as to reach To the sound effect for being converted into critical object or personage.

The critical object or personage that voice conversion basis detects, by adjusting the different characteristic of voice, such as fundamental frequency F0, The duration of a sound, pitch, mel cepstrum coefficients MFCC etc., the fundamental frequency of general schoolgirl are higher than boy student, and the fundamental frequency of robot is higher than schoolgirl, typically Machine and the duration of a sound of Loli schoolgirl are more slightly shorter, by several sound characteristics more than adjusting, obtain the audio for needing to change.

As shown in figure 4, Fig. 4 is the schematic diagram of YOLO object detections provided in an embodiment of the present invention.Object detection algorithms frame Frame experienced the differentiation from R-CNN, SPP-net, fast-r-cnn, faster-r-cnn to YOLO algorithms, the standard of object detection True rate and speed also step up.YOLO is the convolutional neural networks that can disposably predict multiple Box positions and classification, Target detection and identification end to end can be realized, its maximum advantage is exactly that speed is fast.In fact, the essence of target detection is just It is to return, therefore one is realized the CNN for returning function and do not need complicated design process.YOLO does not select sliding window or extraction Proposal mode training network, but directly select whole figure training pattern.This have the advantage that can more preferable area Partial objectives for and background area, by contrast, using the Fast-R-CNN of proposal training methods usually background area flase drop For specific objective.Certainly, YOLO sacrifices some precision while detection speed is lifted.

Acquisition action message is identified from the video frame information；

As shown in figure 5, Fig. 5 is the schematic diagram of C3D network structures provided in an embodiment of the present invention.Specifically, the present embodiment In destination object be action message, action message is obtained in the present embodiment can utilize deep learning method, and input video arrives Identify in network, can obtain specific video actions classification, action recognition network can use C3D models (3D Convolutional networks), action recognition database UCF-101 can be used during network training.It is then based on action letter Breath plays corresponding background music, can be based on background music corresponding to action message broadcasting, for example increase punch, shock etc. Background applications.

For the action recognition of video, what it is due to input is 3 D video, two dimensional image different from the past, so traditional CNN networks be not suitable for Video processing, now need by means of Three dimensional convolution neutral net.2D convolution is used for single channel figure (multichannel image can refer to 3 Color Channels of same pictures to the situation of picture and multichannel image herein, also refer to multiple stackings Picture together, i.e., a bit of video), for a wave filter, export as a two-dimentional characteristic pattern, the information of multichannel It is fully compressed.And the output of 3D convolution remains as 3D characteristic pattern.

For example a video-frequency band input, its size are c*l*h*w, wherein c is image channel (generally 3), and l is video sequence The length of row, h and w are respectively the width and height of video.It is 3*3*3, stride 1, padding to carry out a kernel size =True, after number of filter is K 3D convolution, the dimension size of output is K*l*h*w.

As shown in figure 5, described in figure it is C3D network structures for Activity recognition, based on 3D convolution operations, C3D Network structures share 8 convolution operations, 4 pondization operations.The size of wherein convolution kernel is 3*3*3, step-length 1*1*1. The size of pond core is 2*2*2, step-length 2*2*2, but first layer pond is eliminated outside, and its size and step-length are 1*2*2.This be for The length prematurely reduced in sequential.Final network is just obtaining finally after full articulamentum and softmax layer twice Output result.The input size of network is 3*16*112*112, i.e., once inputs 16 two field pictures.

As shown in figure 3, Fig. 3 is another flow chart of the sound mixing method of user's original content provided in an embodiment of the present invention. It is described to identify destination object from the video frame information in the present embodiment, corresponding sound is superimposed according to the attribute of destination object Frequency information includes：

Acquisition scene information is identified from the video frame information, is identified from the video frame information and obtains personage Information or object information, acquisition action message is identified from the video frame information；

Corresponding background music is superimposed according to the attribute of scene information, according to the people information or the object information Attribute adjusts phonetic feature, so that the voice of the people information or the object information changes audio, is believed according to the action The attribute superposition action background music of breath.

As shown in fig. 6, Fig. 6 is the structural representation of the device sound mixing of user's original content provided in an embodiment of the present invention. A kind of device sound mixing 200 of user's original content, including acquiring unit 201, extraction unit 202, knowledge are provided in the present embodiment Other unit 203 and superpositing unit 204.Wherein：

Acquiring unit 201, for obtaining the video information in user's original content；

Extraction unit 202, for extracting the video frame information in the video information；

Recognition unit 203, for identifying destination object from the video frame information；

Superpositing unit 204, for being superimposed corresponding audio-frequency information according to the attribute of destination object.

As shown in fig. 7, Fig. 7 is another structural representation of the device sound mixing of user's original content provided in an embodiment of the present invention Figure.The recognition unit 203 includes：

Scene Recognition subelement 2031, for acquisition scene information to be identified from the video frame information；

Merge subelement 2032, will if identical for the scene information that acquisition is identified from adjacent video frame information The adjacent video frame information merges；

The superpositing unit 204, it is additionally operable to be superimposed corresponding background music according to the attribute of scene information.

As shown in figure 8, Fig. 8 is the another structural representation of the device sound mixing of user's original content provided in an embodiment of the present invention Figure.The recognition unit 203 includes：

Object identification subelement 2033, people information or object letter are obtained for being identified from the video frame information Breath；

The superpositing unit 204, it is additionally operable to adjust voice spy according to the attribute of the people information or the object information Sign, so that the voice of the people information or the object information changes audio.

As shown in figure 9, Fig. 9 is another structural representation of the device sound mixing of user's original content provided in an embodiment of the present invention Figure.The recognition unit 203 includes：

Action recognition subelement 2034, for acquisition action message to be identified from the video frame information；

The superpositing unit 204, it is additionally operable to act background music according to the attribute superposition of the action message.

Specifically, the destination object in the present embodiment is action message, action message is obtained in the present embodiment to be utilized Deep learning method, input video can obtain the classification of specific video actions into identification network, and action recognition network can be with Action recognition database UCF-101 can be used using C3D models (3D convolutional networks), during network training. It is then based on action message and plays corresponding background music, background music corresponding to action message broadcasting can be based on, for example increase Add the background applications of punch, shock etc..

As shown in Figure 10, Figure 10 is that an also structure for the device sound mixing of user's original content provided in an embodiment of the present invention is shown It is intended to.The recognition unit 203 includes：

The superpositing unit 204, it is additionally operable to be superimposed corresponding background music according to the attribute of scene information, is additionally operable to basis The attribute of the people information or the object information adjusts phonetic feature, so that the people information or the object information Voice changes audio, is additionally operable to act background music according to the attribute superposition of the action message.

It when it is implemented, above modules can be realized as independent entity, can also be combined, be made Realized for same or several entities.

Above-mentioned all technical schemes, any combination can be used to form the alternative embodiment of the present invention, it is not another herein One repeats.

In the embodiment of the present invention, user's original content in the device sound mixing and foregoing embodiments of user's original content Sound mixing method belong to same design, user's original content can be run on the device sound mixing of user's original content Sound mixing method embodiment in the either method that provides, its specific implementation process refers to the sound mixing method of user's original content Embodiment, here is omitted.

Wherein, the terminal device can be smart mobile phone, tablet personal computer, desktop computer, notebook computer or palm Apparatus such as computer.

The embodiment of the present invention also provides a kind of storage medium, and the storage medium is stored with computer program, when the meter When calculation machine program is run on computers so that the computer performs the mixed of user's original content in any of the above-described embodiment Sound method, such as：Obtain the video information in user's original content；Extract the video frame information in the video information；From institute State and destination object is identified in video frame information；Corresponding audio-frequency information is superimposed according to the attribute of destination object.

In embodiments of the present invention, the storage medium can be magnetic disc, CD, read-only storage (Read Only Memory, ROM) or random access memory (Random Access Memory, RAM) etc..

In the above-described embodiments, the description to each embodiment all emphasizes particularly on different fields, and does not have the portion being described in detail in some embodiment Point, it may refer to the associated description of other embodiment.

It should be noted that for the sound mixing method of user's original content of the embodiment of the present invention, this area is general Logical tester is appreciated that to realize all or part of flow of the sound mixing method of user's original content described in the embodiment of the present invention, It is that by computer program the hardware of correlation can be controlled to complete, it is computer-readable that the computer program can be stored in one Take in storage medium, be such as stored in the memory of electronic equipment, and by least one computing device in the electronic equipment, It may include the flow of the embodiment of the sound mixing method such as user's original content in the process of implementation.Wherein, described storage Medium can be magnetic disc, CD, read-only storage, random access memory etc..

For the device sound mixing of user's original content of the embodiment of the present invention, its each functional module can be integrated in In one process chip or modules are individually physically present, can also two or more modules be integrated in one In individual module.Above-mentioned integrated module can both be realized in the form of hardware, can also use the form of software function module Realize.If the integrated module realized in the form of software function module and as independent production marketing or in use, It can also be stored in a computer read/write memory medium, the storage medium is for example read-only storage, disk or light Disk etc..

The sound mixing method of the user's original content provided above the embodiment of the present invention a kind of, device, storage medium and Electronic equipment is described in detail, and specific case used herein is explained the principle and embodiment of the present invention State, the explanation of above example is only intended to help the method and its core concept for understanding the present invention；Meanwhile for this area Technical staff, according to the thought of the present invention, there will be changes in specific embodiments and applications, in summary, This specification content should not be construed as limiting the invention.

Claims

1. a kind of sound mixing method of user's original content, it is characterised in that including step：

Obtain the video information in user's original content；

Extract the video frame information in the video information；

Destination object is identified from the video frame information；

2. the sound mixing method of user's original content according to claim 1, it is characterised in that described to believe from the frame of video Destination object is identified in breath, being superimposed corresponding audio-frequency information according to the attribute of destination object includes：

Acquisition scene information is identified from the video frame information；

If the scene information that acquisition is identified from adjacent video frame information is identical, the adjacent video frame information is closed And；

3. the sound mixing method of user's original content according to claim 1, it is characterised in that described to believe from the frame of video Destination object is identified in breath, being superimposed corresponding audio-frequency information according to the attribute of destination object includes：

Phonetic feature is adjusted according to the attribute of the people information or the object information, so that the people information or the thing The voice conversion audio of body information.

4. the sound mixing method of user's original content according to claim 1, it is characterised in that described to believe from the frame of video Destination object is identified in breath, being superimposed corresponding audio-frequency information according to the attribute of destination object includes：

Acquisition action message is identified from the video frame information；

5. the sound mixing method of user's original content according to claim 1, it is characterised in that described to believe from the frame of video Identify that destination object includes in breath：

A kind of 6. device sound mixing of user's original content, it is characterised in that including：

7. the device sound mixing of user's original content according to claim 6, it is characterised in that the recognition unit includes：

Merge subelement, will be described adjacent if identical for the scene information that acquisition is identified from adjacent video frame information Video frame information merges；

8. the device sound mixing of user's original content according to claim 6, it is characterised in that the recognition unit includes：

9. the device sound mixing of user's original content according to claim 6, it is characterised in that the recognition unit includes：

10. the device sound mixing of user's original content according to claim 6, it is characterised in that the recognition unit, also use In identifying destination object from the video frame information using deep learning method.

11. a kind of storage medium, is stored thereon with computer program, it is characterised in that when the computer program is in computer During upper operation so that the computer performs the sound mixing method of user's original content as described in any one of claim 1 to 5.

12. a kind of terminal device, including processor and memory, the memory have computer program, it is characterised in that described Processor is by calling the computer program, for performing user's original content as described in any one of claim 1 to 5 Sound mixing method.