CN107888843A - Sound mixing method, device, storage medium and the terminal device of user's original content - Google Patents
Sound mixing method, device, storage medium and the terminal device of user's original content Download PDFInfo
- Publication number
- CN107888843A CN107888843A CN201710952671.1A CN201710952671A CN107888843A CN 107888843 A CN107888843 A CN 107888843A CN 201710952671 A CN201710952671 A CN 201710952671A CN 107888843 A CN107888843 A CN 107888843A
- Authority
- CN
- China
- Prior art keywords
- information
- user
- original content
- video frame
- video
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N5/00—Details of television systems
- H04N5/222—Studio circuitry; Studio devices; Studio equipment
- H04N5/262—Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/46—Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of sound mixing method, device, storage medium and the terminal device of user's original content, methods described includes:By obtaining the video information in user's original content;Extract the video frame information in the video information;Destination object is identified from the video frame information;Corresponding audio-frequency information is superimposed according to the attribute of destination object.The special effect processing of sound has been carried out automatically according to the video content of user's original content, has matched corresponding background music.Eliminate the reliance on artificial treatment.
Description
Technical field
The invention belongs to communication technical field, more particularly to a kind of sound mixing method of user's original content, device, storage Jie
Matter and terminal device.
Background technology
User's original content (UGC, User Generated Content) is accompanied by personalized to be main special to advocate
What the Web2.0 concepts of point were risen.UGC is not a certain specific business, but a kind of user uses the new side of internet
Formula, i.e., become to download based on download and upload is laid equal stress on by original.The websites such as YouTube can regard UGC success as
Case, community network, video sharing, blog and blog (video sharing) etc. is all UGC main application form.With mobile phone work(
Energy is gradually become strong, and user can make picture, video using mobile phone whenever and wherever possible, by the mood of oneself and what is seen and heard hand
Machine is recorded, and these contents are passed into other people whenever and wherever possible will turn into trend.And the content quality of user's production is irregular not
Together, user wants to create the works of high quality, is attracted to the user of high quality, expands the influence power of works, improves clicking rate, after
Phase then needs a series of Edition Contains to handle, sound-editing processing.
User is after UGC has been made, and the propagation dynamics that wants to widen one's influence is, it is necessary to which a series of post production process, main
Including:Personage U.S. face, the editing of video content, video caption processing, sound post-processing etc..And these cumbersome later stage systems
Make all to be at present manually to handle, especially this part of acoustic processing, want to make UGC content personalizations, show one's talent, it is necessary to
Fine artificial treatment is passed through in this part of sound, for example coordinates different individualized voice, no according to different UGC contents and scene
Same special efficacy etc., these are required for manually going to watch UGC contents repeatedly, and are partitioned into the strict content change time, and
The natural and tripping degree for the sound splicing that artificial gets on is also a very big test.
The content of the invention
The present invention provides a kind of sound mixing method, device, storage medium and the terminal device of user's original content, can be automatic
Corresponding background music is matched according to video content.
The embodiment of the present invention provides a kind of sound mixing method of user's original content, including step:
Obtain the video information in user's original content;
Extract the video frame information in the video information;
Destination object is identified from the video frame information;
Corresponding audio-frequency information is superimposed according to the attribute of destination object.
Further, it is described to identify destination object from the video frame information, phase is superimposed according to the attribute of destination object
The audio-frequency information answered includes:
Acquisition scene information is identified from the video frame information;
If the scene information that acquisition is identified from adjacent video frame information is identical, the adjacent video frame information is entered
Row merges;
Corresponding background music is superimposed according to the attribute of scene information.
Further, it is described to identify destination object from the video frame information, phase is superimposed according to the attribute of destination object
The audio-frequency information answered includes:
It is identified from the video frame information and obtains people information or object information;
Phonetic feature is adjusted according to the attribute of the people information or the object information, so that the people information or institute
State the voice conversion audio of object information.
Further, it is described to identify destination object from the video frame information, phase is superimposed according to the attribute of destination object
The audio-frequency information answered includes:
Acquisition action message is identified from the video frame information;
Background music is acted according to the attribute superposition of the action message.
Further, it is described to identify that destination object includes from the video frame information:
Using deep learning method destination object is identified from the video frame information.
The embodiment of the present invention also provides a kind of device sound mixing of user's original content, including:
Acquiring unit, for obtaining the video information in user's original content;
Extraction unit, for extracting the video frame information in the video information;
Recognition unit, for identifying destination object from the video frame information;
Superpositing unit, for being superimposed corresponding audio-frequency information according to the attribute of destination object.
Further, the recognition unit includes:
Scene Recognition subelement, for acquisition scene information to be identified from the video frame information;
Merge subelement, if identical for the scene information that acquisition is identified from adjacent video frame information, by described in
Adjacent video frame information merges;
The superpositing unit, it is additionally operable to be superimposed corresponding background music according to the attribute of scene information.
Further, the recognition unit includes:
Object identification subelement, people information or object information are obtained for being identified from the video frame information;
The superpositing unit, it is additionally operable to adjust phonetic feature according to the attribute of the people information or the object information,
So that the voice of the people information or the object information changes audio.
Further, the recognition unit includes:
Action recognition subelement, for acquisition action message to be identified from the video frame information;
The superpositing unit, it is additionally operable to act background music according to the attribute superposition of the action message.
Further, the recognition unit, it is additionally operable to identify mesh from the video frame information using deep learning method
Mark object.
The embodiment of the present invention also provides a kind of storage medium, is stored thereon with computer program, when the computer program
When running on computers so that the computer performs the sound mixing method of as above user's original content described in any one.
The embodiment of the present invention also provides a kind of terminal device, including processor and memory, and the memory has computer
Program, the processor is by calling the computer program, for performing user's original content described in as above any one
Sound mixing method.
Sound mixing method, device, storage medium and the terminal device of user's original content provided in an embodiment of the present invention, pass through
Obtain the video information in user's original content;Extract the video frame information in the video information;From the video frame information
Middle identification destination object;Corresponding audio-frequency information is superimposed according to the attribute of destination object.Automatically according to regarding for user's original content
Frequency content has carried out the special effect processing of sound, matches corresponding background music.Eliminate the reliance on artificial treatment.
Brief description of the drawings
Technical scheme in order to illustrate the embodiments of the present invention more clearly, make required in being described below to embodiment
Accompanying drawing is briefly described.It should be evident that drawings in the following description are only some embodiments of the present invention, for
For those skilled in the art, on the premise of not paying creative work, it can also be obtained according to these accompanying drawings other attached
Figure.
Fig. 1 is the flow chart of the sound mixing method of user's original content provided in an embodiment of the present invention;
Fig. 2 is the schematic diagram of CNN classics convolutional neural networks evolutionary process provided in an embodiment of the present invention;
Fig. 3 is another flow chart of the sound mixing method of user's original content provided in an embodiment of the present invention;
Fig. 4 is the schematic diagram of YOLO object detections provided in an embodiment of the present invention;
Fig. 5 is the schematic diagram of C3D network structures provided in an embodiment of the present invention;
Fig. 6 is the structural representation of the device sound mixing of user's original content provided in an embodiment of the present invention;
Fig. 7 is another structural representation of the device sound mixing of user's original content provided in an embodiment of the present invention;
Fig. 8 is the another structural representation of the device sound mixing of user's original content provided in an embodiment of the present invention;
Fig. 9 is another structural representation of the device sound mixing of user's original content provided in an embodiment of the present invention;
Figure 10 is an also structural representation for the device sound mixing of user's original content provided in an embodiment of the present invention;
Figure 11 is the structural representation of ResNet provided in an embodiment of the present invention residual error study module.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete
Site preparation describes.
As shown in figure 1, Fig. 1 is the flow chart of the sound mixing method of user's original content provided in an embodiment of the present invention.This reality
Apply in example, a kind of sound mixing method of user's original content, including step:
Step S101, obtain the video information in user's original content;
Step S102, extract the video frame information in the video information;
Step S103, destination object is identified from the video frame information;
Step S104, corresponding audio-frequency information is superimposed according to the attribute of destination object.
The sound mixing method of user's original content provided in an embodiment of the present invention is by obtaining the video in user's original content
Information;Extract the video frame information in the video information;Destination object is identified from the video frame information;According to target pair
The attribute of elephant is superimposed corresponding audio-frequency information.Carried out automatically according to the video content of user's original content at the special efficacy of sound
Reason, matches corresponding background music.Eliminate the reliance on artificial treatment.Original video is inputted, after intelligent mixer system, automatically
The special effect processing of sound has been carried out according to the scene of UGC contents.Original video is inputted to match automatically according to scene and action recognition
Corresponding background music.
Wherein, it is to utilize each frame video in FFmpeg extraction videos to extract the video frame information in the video information
Two field picture.
Further, it is described to identify destination object from the video frame information, phase is superimposed according to the attribute of destination object
The audio-frequency information answered includes:
Acquisition scene information is identified from the video frame information;
If the scene information that acquisition is identified from adjacent video frame information is identical, the adjacent video frame information is entered
Row merges;
Corresponding background music is superimposed according to the attribute of scene information.
Specifically, the destination object in the present embodiment is scene information, scene information is obtained first in the present embodiment, can be with
Frame level scene Recognition is carried out, using deep learning method, picture scene is classified, for example depth convolutional Neural can be used
Network Inception V3 or resnet152 carry out scene classification, can use scene database Place365 during network training.
Then scene cut is carried out, is merged based on the scene Recognition classification between consecutive frame, the consecutive frame of same scene is carried out
Merge, finally obtain the Video segmentation based on different scenes.It is finally based on scene and plays corresponding background music, field can be based on
Background music corresponding to scape broadcasting, such as sea, bar, dance hall, castle, field, island, rink etc., can also be according to field
Scape converts original sound, for example can produce reverberation effect etc. in music hall with echogenicity effect in mountain valley.
The classical convolutional neural networks of some of machine learning.It is AlexNet, VGGNet, Google respectively
InceptionNet and ResNet.Main feature is all in deep learning and neutral net.The significantly lifting of performance is nearly all
The number of plies for being accompanied by convolutional neural networks is deepened, and ResNet has even used 152 hidden layers.As shown in Fig. 2 Fig. 2 is the present invention
The schematic diagram for the CNN classics convolutional neural networks evolutionary processes that embodiment provides.
ResNet networks propose a kind of residual error learning framework for mitigating network training burden, and this network used than in the past
The network essence upper strata time crossed is deeper.It is clearly using this layer study residual error function related as input layer, rather than study
Unknown function.In ImageNet data sets with 152 layers (the possible number of plies alreadys exceed 1000), the deep 8 times depth than VGG network
To assess residual error network, but it still has relatively low complexity.ResNet residual error study module is as shown in Figure 11.
Based on original sound, plus regular hour carryover effects, it is added in original sound, can echogenicity effect;
And original sound is added in original sound after certain convolution and delay, then reverberation effect can be produced.It is and specific
Echo and reverberation dynamics then need constantly to adjust convolution kernel and time delay.
Further, it is described to identify destination object from the video frame information, phase is superimposed according to the attribute of destination object
The audio-frequency information answered includes:
It is identified from the video frame information and obtains people information or object information;
Phonetic feature is adjusted according to the attribute of the people information or the object information, so that the people information or institute
State the voice conversion audio of object information.
Specifically, the destination object in the present embodiment is people information or object information, closed first in the present embodiment
Key object or person analyte detection, using deep learning method, detect in picture whether containing some crucial or characteristic personages or
Object, such as transformer, small Loli, uncle or some cartoon cartoon characters.It can be used during detection and be based on neutral net
Object detection model YOLO (You look only once).Then voice conversion is carried out, according to the critical object detected
Or personage, by adjusting the different characteristic of voice, such as fundamental frequency F0, the duration of a sound, pitch, mel cepstrum coefficients MFCC etc., so as to reach
To the sound effect for being converted into critical object or personage.
The critical object or personage that voice conversion basis detects, by adjusting the different characteristic of voice, such as fundamental frequency F0,
The duration of a sound, pitch, mel cepstrum coefficients MFCC etc., the fundamental frequency of general schoolgirl are higher than boy student, and the fundamental frequency of robot is higher than schoolgirl, typically
Machine and the duration of a sound of Loli schoolgirl are more slightly shorter, by several sound characteristics more than adjusting, obtain the audio for needing to change.
As shown in figure 4, Fig. 4 is the schematic diagram of YOLO object detections provided in an embodiment of the present invention.Object detection algorithms frame
Frame experienced the differentiation from R-CNN, SPP-net, fast-r-cnn, faster-r-cnn to YOLO algorithms, the standard of object detection
True rate and speed also step up.YOLO is the convolutional neural networks that can disposably predict multiple Box positions and classification,
Target detection and identification end to end can be realized, its maximum advantage is exactly that speed is fast.In fact, the essence of target detection is just
It is to return, therefore one is realized the CNN for returning function and do not need complicated design process.YOLO does not select sliding window or extraction
Proposal mode training network, but directly select whole figure training pattern.This have the advantage that can more preferable area
Partial objectives for and background area, by contrast, using the Fast-R-CNN of proposal training methods usually background area flase drop
For specific objective.Certainly, YOLO sacrifices some precision while detection speed is lifted.
Further, it is described to identify destination object from the video frame information, phase is superimposed according to the attribute of destination object
The audio-frequency information answered includes:
Acquisition action message is identified from the video frame information;
Background music is acted according to the attribute superposition of the action message.
As shown in figure 5, Fig. 5 is the schematic diagram of C3D network structures provided in an embodiment of the present invention.Specifically, the present embodiment
In destination object be action message, action message is obtained in the present embodiment can utilize deep learning method, and input video arrives
Identify in network, can obtain specific video actions classification, action recognition network can use C3D models (3D
Convolutional networks), action recognition database UCF-101 can be used during network training.It is then based on action letter
Breath plays corresponding background music, can be based on background music corresponding to action message broadcasting, for example increase punch, shock etc.
Background applications.
For the action recognition of video, what it is due to input is 3 D video, two dimensional image different from the past, so traditional
CNN networks be not suitable for Video processing, now need by means of Three dimensional convolution neutral net.2D convolution is used for single channel figure
(multichannel image can refer to 3 Color Channels of same pictures to the situation of picture and multichannel image herein, also refer to multiple stackings
Picture together, i.e., a bit of video), for a wave filter, export as a two-dimentional characteristic pattern, the information of multichannel
It is fully compressed.And the output of 3D convolution remains as 3D characteristic pattern.
For example a video-frequency band input, its size are c*l*h*w, wherein c is image channel (generally 3), and l is video sequence
The length of row, h and w are respectively the width and height of video.It is 3*3*3, stride 1, padding to carry out a kernel size
=True, after number of filter is K 3D convolution, the dimension size of output is K*l*h*w.
As shown in figure 5, described in figure it is C3D network structures for Activity recognition, based on 3D convolution operations, C3D
Network structures share 8 convolution operations, 4 pondization operations.The size of wherein convolution kernel is 3*3*3, step-length 1*1*1.
The size of pond core is 2*2*2, step-length 2*2*2, but first layer pond is eliminated outside, and its size and step-length are 1*2*2.This be for
The length prematurely reduced in sequential.Final network is just obtaining finally after full articulamentum and softmax layer twice
Output result.The input size of network is 3*16*112*112, i.e., once inputs 16 two field pictures.
As shown in figure 3, Fig. 3 is another flow chart of the sound mixing method of user's original content provided in an embodiment of the present invention.
It is described to identify destination object from the video frame information in the present embodiment, corresponding sound is superimposed according to the attribute of destination object
Frequency information includes:
Acquisition scene information is identified from the video frame information, is identified from the video frame information and obtains personage
Information or object information, acquisition action message is identified from the video frame information;
If the scene information that acquisition is identified from adjacent video frame information is identical, the adjacent video frame information is entered
Row merges;
Corresponding background music is superimposed according to the attribute of scene information, according to the people information or the object information
Attribute adjusts phonetic feature, so that the voice of the people information or the object information changes audio, is believed according to the action
The attribute superposition action background music of breath.
As shown in fig. 6, Fig. 6 is the structural representation of the device sound mixing of user's original content provided in an embodiment of the present invention.
A kind of device sound mixing 200 of user's original content, including acquiring unit 201, extraction unit 202, knowledge are provided in the present embodiment
Other unit 203 and superpositing unit 204.Wherein:
Acquiring unit 201, for obtaining the video information in user's original content;
Extraction unit 202, for extracting the video frame information in the video information;
Recognition unit 203, for identifying destination object from the video frame information;
Superpositing unit 204, for being superimposed corresponding audio-frequency information according to the attribute of destination object.
The sound mixing method of user's original content provided in an embodiment of the present invention is by obtaining the video in user's original content
Information;Extract the video frame information in the video information;Destination object is identified from the video frame information;According to target pair
The attribute of elephant is superimposed corresponding audio-frequency information.Carried out automatically according to the video content of user's original content at the special efficacy of sound
Reason, matches corresponding background music.Eliminate the reliance on artificial treatment.Original video is inputted, after intelligent mixer system, automatically
The special effect processing of sound has been carried out according to the scene of UGC contents.Original video is inputted to match automatically according to scene and action recognition
Corresponding background music.
As shown in fig. 7, Fig. 7 is another structural representation of the device sound mixing of user's original content provided in an embodiment of the present invention
Figure.The recognition unit 203 includes:
Scene Recognition subelement 2031, for acquisition scene information to be identified from the video frame information;
Merge subelement 2032, will if identical for the scene information that acquisition is identified from adjacent video frame information
The adjacent video frame information merges;
The superpositing unit 204, it is additionally operable to be superimposed corresponding background music according to the attribute of scene information.
Specifically, the destination object in the present embodiment is scene information, scene information is obtained first in the present embodiment, can be with
Frame level scene Recognition is carried out, using deep learning method, picture scene is classified, for example depth convolutional Neural can be used
Network Inception V3 or resnet152 carry out scene classification, can use scene database Place365 during network training.
Then scene cut is carried out, is merged based on the scene Recognition classification between consecutive frame, the consecutive frame of same scene is carried out
Merge, finally obtain the Video segmentation based on different scenes.It is finally based on scene and plays corresponding background music, field can be based on
Background music corresponding to scape broadcasting, such as sea, bar, dance hall, castle, field, island, rink etc., can also be according to field
Scape converts original sound, for example can produce reverberation effect etc. in music hall with echogenicity effect in mountain valley.
Based on original sound, plus regular hour carryover effects, it is added in original sound, can echogenicity effect;
And original sound is added in original sound after certain convolution and delay, then reverberation effect can be produced.It is and specific
Echo and reverberation dynamics then need constantly to adjust convolution kernel and time delay.
As shown in figure 8, Fig. 8 is the another structural representation of the device sound mixing of user's original content provided in an embodiment of the present invention
Figure.The recognition unit 203 includes:
Object identification subelement 2033, people information or object letter are obtained for being identified from the video frame information
Breath;
The superpositing unit 204, it is additionally operable to adjust voice spy according to the attribute of the people information or the object information
Sign, so that the voice of the people information or the object information changes audio.
Specifically, the destination object in the present embodiment is people information or object information, closed first in the present embodiment
Key object or person analyte detection, using deep learning method, detect in picture whether containing some crucial or characteristic personages or
Object, such as transformer, small Loli, uncle or some cartoon cartoon characters.It can be used during detection and be based on neutral net
Object detection model YOLO (You look only once).Then voice conversion is carried out, according to the critical object detected
Or personage, by adjusting the different characteristic of voice, such as fundamental frequency F0, the duration of a sound, pitch, mel cepstrum coefficients MFCC etc., so as to reach
To the sound effect for being converted into critical object or personage.
The critical object or personage that voice conversion basis detects, by adjusting the different characteristic of voice, such as fundamental frequency F0,
The duration of a sound, pitch, mel cepstrum coefficients MFCC etc., the fundamental frequency of general schoolgirl are higher than boy student, and the fundamental frequency of robot is higher than schoolgirl, typically
Machine and the duration of a sound of Loli schoolgirl are more slightly shorter, by several sound characteristics more than adjusting, obtain the audio for needing to change.
As shown in figure 9, Fig. 9 is another structural representation of the device sound mixing of user's original content provided in an embodiment of the present invention
Figure.The recognition unit 203 includes:
Action recognition subelement 2034, for acquisition action message to be identified from the video frame information;
The superpositing unit 204, it is additionally operable to act background music according to the attribute superposition of the action message.
Specifically, the destination object in the present embodiment is action message, action message is obtained in the present embodiment to be utilized
Deep learning method, input video can obtain the classification of specific video actions into identification network, and action recognition network can be with
Action recognition database UCF-101 can be used using C3D models (3D convolutional networks), during network training.
It is then based on action message and plays corresponding background music, background music corresponding to action message broadcasting can be based on, for example increase
Add the background applications of punch, shock etc..
As shown in Figure 10, Figure 10 is that an also structure for the device sound mixing of user's original content provided in an embodiment of the present invention is shown
It is intended to.The recognition unit 203 includes:
Scene Recognition subelement 2031, for acquisition scene information to be identified from the video frame information;
Object identification subelement 2033, people information or object letter are obtained for being identified from the video frame information
Breath;
Action recognition subelement 2034, for acquisition action message to be identified from the video frame information;
Merge subelement 2032, will if identical for the scene information that acquisition is identified from adjacent video frame information
The adjacent video frame information merges;
The superpositing unit 204, it is additionally operable to be superimposed corresponding background music according to the attribute of scene information, is additionally operable to basis
The attribute of the people information or the object information adjusts phonetic feature, so that the people information or the object information
Voice changes audio, is additionally operable to act background music according to the attribute superposition of the action message.
It when it is implemented, above modules can be realized as independent entity, can also be combined, be made
Realized for same or several entities.
Above-mentioned all technical schemes, any combination can be used to form the alternative embodiment of the present invention, it is not another herein
One repeats.
In the embodiment of the present invention, user's original content in the device sound mixing and foregoing embodiments of user's original content
Sound mixing method belong to same design, user's original content can be run on the device sound mixing of user's original content
Sound mixing method embodiment in the either method that provides, its specific implementation process refers to the sound mixing method of user's original content
Embodiment, here is omitted.
The embodiment of the present invention also provides a kind of terminal device, including processor and memory, and the memory has computer
Program, the processor is by calling the computer program, for performing user's original content described in as above any one
Sound mixing method.
Wherein, the terminal device can be smart mobile phone, tablet personal computer, desktop computer, notebook computer or palm
Apparatus such as computer.
The embodiment of the present invention also provides a kind of storage medium, and the storage medium is stored with computer program, when the meter
When calculation machine program is run on computers so that the computer performs the mixed of user's original content in any of the above-described embodiment
Sound method, such as:Obtain the video information in user's original content;Extract the video frame information in the video information;From institute
State and destination object is identified in video frame information;Corresponding audio-frequency information is superimposed according to the attribute of destination object.
In embodiments of the present invention, the storage medium can be magnetic disc, CD, read-only storage (Read Only
Memory, ROM) or random access memory (Random Access Memory, RAM) etc..
In the above-described embodiments, the description to each embodiment all emphasizes particularly on different fields, and does not have the portion being described in detail in some embodiment
Point, it may refer to the associated description of other embodiment.
It should be noted that for the sound mixing method of user's original content of the embodiment of the present invention, this area is general
Logical tester is appreciated that to realize all or part of flow of the sound mixing method of user's original content described in the embodiment of the present invention,
It is that by computer program the hardware of correlation can be controlled to complete, it is computer-readable that the computer program can be stored in one
Take in storage medium, be such as stored in the memory of electronic equipment, and by least one computing device in the electronic equipment,
It may include the flow of the embodiment of the sound mixing method such as user's original content in the process of implementation.Wherein, described storage
Medium can be magnetic disc, CD, read-only storage, random access memory etc..
For the device sound mixing of user's original content of the embodiment of the present invention, its each functional module can be integrated in
In one process chip or modules are individually physically present, can also two or more modules be integrated in one
In individual module.Above-mentioned integrated module can both be realized in the form of hardware, can also use the form of software function module
Realize.If the integrated module realized in the form of software function module and as independent production marketing or in use,
It can also be stored in a computer read/write memory medium, the storage medium is for example read-only storage, disk or light
Disk etc..
The sound mixing method of the user's original content provided above the embodiment of the present invention a kind of, device, storage medium and
Electronic equipment is described in detail, and specific case used herein is explained the principle and embodiment of the present invention
State, the explanation of above example is only intended to help the method and its core concept for understanding the present invention;Meanwhile for this area
Technical staff, according to the thought of the present invention, there will be changes in specific embodiments and applications, in summary,
This specification content should not be construed as limiting the invention.
Claims (12)
1. a kind of sound mixing method of user's original content, it is characterised in that including step:
Obtain the video information in user's original content;
Extract the video frame information in the video information;
Destination object is identified from the video frame information;
Corresponding audio-frequency information is superimposed according to the attribute of destination object.
2. the sound mixing method of user's original content according to claim 1, it is characterised in that described to believe from the frame of video
Destination object is identified in breath, being superimposed corresponding audio-frequency information according to the attribute of destination object includes:
Acquisition scene information is identified from the video frame information;
If the scene information that acquisition is identified from adjacent video frame information is identical, the adjacent video frame information is closed
And;
Corresponding background music is superimposed according to the attribute of scene information.
3. the sound mixing method of user's original content according to claim 1, it is characterised in that described to believe from the frame of video
Destination object is identified in breath, being superimposed corresponding audio-frequency information according to the attribute of destination object includes:
It is identified from the video frame information and obtains people information or object information;
Phonetic feature is adjusted according to the attribute of the people information or the object information, so that the people information or the thing
The voice conversion audio of body information.
4. the sound mixing method of user's original content according to claim 1, it is characterised in that described to believe from the frame of video
Destination object is identified in breath, being superimposed corresponding audio-frequency information according to the attribute of destination object includes:
Acquisition action message is identified from the video frame information;
Background music is acted according to the attribute superposition of the action message.
5. the sound mixing method of user's original content according to claim 1, it is characterised in that described to believe from the frame of video
Identify that destination object includes in breath:
Using deep learning method destination object is identified from the video frame information.
A kind of 6. device sound mixing of user's original content, it is characterised in that including:
Acquiring unit, for obtaining the video information in user's original content;
Extraction unit, for extracting the video frame information in the video information;
Recognition unit, for identifying destination object from the video frame information;
Superpositing unit, for being superimposed corresponding audio-frequency information according to the attribute of destination object.
7. the device sound mixing of user's original content according to claim 6, it is characterised in that the recognition unit includes:
Scene Recognition subelement, for acquisition scene information to be identified from the video frame information;
Merge subelement, will be described adjacent if identical for the scene information that acquisition is identified from adjacent video frame information
Video frame information merges;
The superpositing unit, it is additionally operable to be superimposed corresponding background music according to the attribute of scene information.
8. the device sound mixing of user's original content according to claim 6, it is characterised in that the recognition unit includes:
Object identification subelement, people information or object information are obtained for being identified from the video frame information;
The superpositing unit, it is additionally operable to adjust phonetic feature according to the attribute of the people information or the object information, so that
The voice of the people information or the object information changes audio.
9. the device sound mixing of user's original content according to claim 6, it is characterised in that the recognition unit includes:
Action recognition subelement, for acquisition action message to be identified from the video frame information;
The superpositing unit, it is additionally operable to act background music according to the attribute superposition of the action message.
10. the device sound mixing of user's original content according to claim 6, it is characterised in that the recognition unit, also use
In identifying destination object from the video frame information using deep learning method.
11. a kind of storage medium, is stored thereon with computer program, it is characterised in that when the computer program is in computer
During upper operation so that the computer performs the sound mixing method of user's original content as described in any one of claim 1 to 5.
12. a kind of terminal device, including processor and memory, the memory have computer program, it is characterised in that described
Processor is by calling the computer program, for performing user's original content as described in any one of claim 1 to 5
Sound mixing method.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710952671.1A CN107888843A (en) | 2017-10-13 | 2017-10-13 | Sound mixing method, device, storage medium and the terminal device of user's original content |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710952671.1A CN107888843A (en) | 2017-10-13 | 2017-10-13 | Sound mixing method, device, storage medium and the terminal device of user's original content |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107888843A true CN107888843A (en) | 2018-04-06 |
Family
ID=61781613
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710952671.1A Pending CN107888843A (en) | 2017-10-13 | 2017-10-13 | Sound mixing method, device, storage medium and the terminal device of user's original content |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107888843A (en) |
Cited By (22)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108805036A (en) * | 2018-05-22 | 2018-11-13 | 电子科技大学 | A kind of new non-supervisory video semanteme extracting method |
CN109119089A (en) * | 2018-06-05 | 2019-01-01 | 安克创新科技股份有限公司 | The method and apparatus of penetrating processing is carried out to music |
CN109286841A (en) * | 2018-10-17 | 2019-01-29 | Oppo广东移动通信有限公司 | Film sound effect treatment method and Related product |
CN109309776A (en) * | 2018-08-13 | 2019-02-05 | 张利军 | Piece caudal flexure based on dynamic degree selects system |
CN109587552A (en) * | 2018-11-26 | 2019-04-05 | Oppo广东移动通信有限公司 | Video personage sound effect treatment method, device, mobile terminal and storage medium |
CN109618076A (en) * | 2018-08-07 | 2019-04-12 | 吴秋琴 | The adaptive method for down loading of singer's music |
CN109640166A (en) * | 2018-08-13 | 2019-04-16 | 张利军 | Piece caudal flexure selection method based on dynamic degree |
CN110163050A (en) * | 2018-07-23 | 2019-08-23 | 腾讯科技(深圳)有限公司 | A kind of method for processing video frequency and device, terminal device, server and storage medium |
CN110677716A (en) * | 2019-08-20 | 2020-01-10 | 咪咕音乐有限公司 | Audio processing method, electronic device, and storage medium |
CN110852375A (en) * | 2019-11-09 | 2020-02-28 | 北京工业大学 | End-to-end music score note identification method based on deep learning |
CN110858924A (en) * | 2018-08-22 | 2020-03-03 | 北京优酷科技有限公司 | Video background music generation method and device |
CN111031391A (en) * | 2019-12-19 | 2020-04-17 | 北京达佳互联信息技术有限公司 | Video dubbing method, device, server, terminal and storage medium |
CN111028920A (en) * | 2019-12-06 | 2020-04-17 | 杨保红 | Mental health decompression flow system platform |
WO2020087979A1 (en) * | 2018-10-30 | 2020-05-07 | 北京字节跳动网络技术有限公司 | Method and apparatus for generating model |
CN111541936A (en) * | 2020-04-02 | 2020-08-14 | 腾讯科技(深圳)有限公司 | Video and image processing method and device, electronic equipment and storage medium |
CN111970579A (en) * | 2020-08-14 | 2020-11-20 | 苏州思萃人工智能研究所有限公司 | Video music adaptation method and system based on AI video understanding |
CN112040335A (en) * | 2020-08-14 | 2020-12-04 | 苏州思萃人工智能研究所有限公司 | Artificial intelligent sound effect creation and video adaptation method and system |
CN112633087A (en) * | 2020-12-09 | 2021-04-09 | 新奥特(北京)视频技术有限公司 | Automatic journaling method and device based on picture analysis for IBC system |
CN112690823A (en) * | 2020-12-22 | 2021-04-23 | 海南力维科贸有限公司 | Method and system for identifying physiological sounds of lungs |
CN113469321A (en) * | 2020-03-30 | 2021-10-01 | 聚晶半导体股份有限公司 | Object detection device and object detection method based on neural network |
US11495015B2 (en) | 2020-03-30 | 2022-11-08 | Altek Semiconductor Corp. | Object detection device and object detection method based on neural network |
WO2024067157A1 (en) * | 2022-09-29 | 2024-04-04 | 北京字跳网络技术有限公司 | Special-effect video generation method and apparatus, electronic device and storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP1073272A1 (en) * | 1999-02-15 | 2001-01-31 | Sony Corporation | Signal processing method and video/audio processing device |
CN102222227A (en) * | 2011-04-25 | 2011-10-19 | 中国华录集团有限公司 | Video identification based system for extracting film images |
CN103050124A (en) * | 2011-10-13 | 2013-04-17 | 华为终端有限公司 | Sound mixing method, device and system |
CN103795897A (en) * | 2014-01-21 | 2014-05-14 | 深圳市中兴移动通信有限公司 | Method and device for automatically generating background music |
CN106534618A (en) * | 2016-11-24 | 2017-03-22 | 广州爱九游信息技术有限公司 | Method, device and system for realizing pseudo field interpretation |
-
2017
- 2017-10-13 CN CN201710952671.1A patent/CN107888843A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP1073272A1 (en) * | 1999-02-15 | 2001-01-31 | Sony Corporation | Signal processing method and video/audio processing device |
CN102222227A (en) * | 2011-04-25 | 2011-10-19 | 中国华录集团有限公司 | Video identification based system for extracting film images |
CN103050124A (en) * | 2011-10-13 | 2013-04-17 | 华为终端有限公司 | Sound mixing method, device and system |
CN103795897A (en) * | 2014-01-21 | 2014-05-14 | 深圳市中兴移动通信有限公司 | Method and device for automatically generating background music |
CN106534618A (en) * | 2016-11-24 | 2017-03-22 | 广州爱九游信息技术有限公司 | Method, device and system for realizing pseudo field interpretation |
Cited By (30)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108805036A (en) * | 2018-05-22 | 2018-11-13 | 电子科技大学 | A kind of new non-supervisory video semanteme extracting method |
CN109119089A (en) * | 2018-06-05 | 2019-01-01 | 安克创新科技股份有限公司 | The method and apparatus of penetrating processing is carried out to music |
CN113450811B (en) * | 2018-06-05 | 2024-02-06 | 安克创新科技股份有限公司 | Method and equipment for performing transparent processing on music |
US11887615B2 (en) | 2018-06-05 | 2024-01-30 | Anker Innovations Technology Co., Ltd. | Method and device for transparent processing of music |
CN113450811A (en) * | 2018-06-05 | 2021-09-28 | 安克创新科技股份有限公司 | Method and equipment for performing transparent processing on music |
WO2019233359A1 (en) * | 2018-06-05 | 2019-12-12 | 安克创新科技股份有限公司 | Method and device for transparency processing of music |
CN110163050A (en) * | 2018-07-23 | 2019-08-23 | 腾讯科技(深圳)有限公司 | A kind of method for processing video frequency and device, terminal device, server and storage medium |
CN110163050B (en) * | 2018-07-23 | 2022-09-27 | 腾讯科技(深圳)有限公司 | Video processing method and device, terminal equipment, server and storage medium |
CN109618076A (en) * | 2018-08-07 | 2019-04-12 | 吴秋琴 | The adaptive method for down loading of singer's music |
CN109309776B (en) * | 2018-08-13 | 2019-08-27 | 上海蒙彤文化传播有限公司 | Piece caudal flexure based on dynamic degree selects system |
CN109640166A (en) * | 2018-08-13 | 2019-04-16 | 张利军 | Piece caudal flexure selection method based on dynamic degree |
CN109309776A (en) * | 2018-08-13 | 2019-02-05 | 张利军 | Piece caudal flexure based on dynamic degree selects system |
CN110858924A (en) * | 2018-08-22 | 2020-03-03 | 北京优酷科技有限公司 | Video background music generation method and device |
CN110858924B (en) * | 2018-08-22 | 2021-11-26 | 阿里巴巴(中国)有限公司 | Video background music generation method and device and storage medium |
CN109286841B (en) * | 2018-10-17 | 2021-10-08 | Oppo广东移动通信有限公司 | Movie sound effect processing method and related product |
CN109286841A (en) * | 2018-10-17 | 2019-01-29 | Oppo广东移动通信有限公司 | Film sound effect treatment method and Related product |
WO2020087979A1 (en) * | 2018-10-30 | 2020-05-07 | 北京字节跳动网络技术有限公司 | Method and apparatus for generating model |
CN109587552A (en) * | 2018-11-26 | 2019-04-05 | Oppo广东移动通信有限公司 | Video personage sound effect treatment method, device, mobile terminal and storage medium |
CN110677716A (en) * | 2019-08-20 | 2020-01-10 | 咪咕音乐有限公司 | Audio processing method, electronic device, and storage medium |
CN110852375A (en) * | 2019-11-09 | 2020-02-28 | 北京工业大学 | End-to-end music score note identification method based on deep learning |
CN111028920A (en) * | 2019-12-06 | 2020-04-17 | 杨保红 | Mental health decompression flow system platform |
CN111031391A (en) * | 2019-12-19 | 2020-04-17 | 北京达佳互联信息技术有限公司 | Video dubbing method, device, server, terminal and storage medium |
CN113469321A (en) * | 2020-03-30 | 2021-10-01 | 聚晶半导体股份有限公司 | Object detection device and object detection method based on neural network |
US11495015B2 (en) | 2020-03-30 | 2022-11-08 | Altek Semiconductor Corp. | Object detection device and object detection method based on neural network |
CN111541936A (en) * | 2020-04-02 | 2020-08-14 | 腾讯科技(深圳)有限公司 | Video and image processing method and device, electronic equipment and storage medium |
CN112040335A (en) * | 2020-08-14 | 2020-12-04 | 苏州思萃人工智能研究所有限公司 | Artificial intelligent sound effect creation and video adaptation method and system |
CN111970579A (en) * | 2020-08-14 | 2020-11-20 | 苏州思萃人工智能研究所有限公司 | Video music adaptation method and system based on AI video understanding |
CN112633087A (en) * | 2020-12-09 | 2021-04-09 | 新奥特(北京)视频技术有限公司 | Automatic journaling method and device based on picture analysis for IBC system |
CN112690823A (en) * | 2020-12-22 | 2021-04-23 | 海南力维科贸有限公司 | Method and system for identifying physiological sounds of lungs |
WO2024067157A1 (en) * | 2022-09-29 | 2024-04-04 | 北京字跳网络技术有限公司 | Special-effect video generation method and apparatus, electronic device and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107888843A (en) | Sound mixing method, device, storage medium and the terminal device of user's original content | |
Zhang et al. | Speech emotion recognition using deep convolutional neural network and discriminant temporal pyramid matching | |
LeCun | Deep learning & convolutional networks. | |
WO2020177190A1 (en) | Processing method, apparatus and device | |
CN107220235A (en) | Speech recognition error correction method, device and storage medium based on artificial intelligence | |
CN107211061A (en) | The optimization virtual scene layout played back for space meeting | |
WO2023197979A1 (en) | Data processing method and apparatus, and computer device and storage medium | |
CN107210034A (en) | selective conference summary | |
CN104902012B (en) | The method and singing contest system of singing contest are carried out by network | |
CN114419205B (en) | Driving method of virtual digital person and training method of pose acquisition model | |
CN108206027A (en) | A kind of audio quality evaluation method and system | |
WO2023207541A1 (en) | Speech processing method and related device | |
TWI740315B (en) | Sound separation method, electronic and computer readable storage medium | |
CN115691544A (en) | Training of virtual image mouth shape driving model and driving method, device and equipment thereof | |
Tang et al. | Improved convolutional neural networks for acoustic event classification | |
WO2023197749A1 (en) | Background music insertion time point determining method and apparatus, device, and storage medium | |
Geng | Evaluation model of college english multimedia teaching effect based on deep convolutional neural networks | |
CN110136689A (en) | Song synthetic method, device and storage medium based on transfer learning | |
CN108550173A (en) | Method based on speech production shape of the mouth as one speaks video | |
CN109584904A (en) | The sightsinging audio roll call for singing education applied to root LeEco identifies modeling method | |
CN112528049A (en) | Video synthesis method and device, electronic equipment and computer-readable storage medium | |
CN108847066A (en) | A kind of content of courses reminding method, device, server and storage medium | |
Guo et al. | Attention-based visual-audio fusion for video caption generation | |
Hu et al. | 3DACRNN Model Based on Residual Network for Speech Emotion Classification. | |
Küçükbay et al. | Audio event detection using adaptive feature extraction scheme |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180406 |
|
RJ01 | Rejection of invention patent application after publication |