CN107888843A - Sound mixing method, device, storage medium and the terminal device of user's original content - Google Patents

Sound mixing method, device, storage medium and the terminal device of user's original content Download PDF

Info

Publication number
CN107888843A
CN107888843A CN201710952671.1A CN201710952671A CN107888843A CN 107888843 A CN107888843 A CN 107888843A CN 201710952671 A CN201710952671 A CN 201710952671A CN 107888843 A CN107888843 A CN 107888843A
Authority
CN
China
Prior art keywords
information
user
original content
video frame
video
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710952671.1A
Other languages
Chinese (zh)
Inventor
罗斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Xunlei Network Technology Co Ltd
Original Assignee
Shenzhen Xunlei Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Xunlei Network Technology Co Ltd filed Critical Shenzhen Xunlei Network Technology Co Ltd
Priority to CN201710952671.1A priority Critical patent/CN107888843A/en
Publication of CN107888843A publication Critical patent/CN107888843A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/222Studio circuitry; Studio devices; Studio equipment
    • H04N5/262Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of sound mixing method, device, storage medium and the terminal device of user's original content, methods described includes:By obtaining the video information in user's original content;Extract the video frame information in the video information;Destination object is identified from the video frame information;Corresponding audio-frequency information is superimposed according to the attribute of destination object.The special effect processing of sound has been carried out automatically according to the video content of user's original content, has matched corresponding background music.Eliminate the reliance on artificial treatment.

Description

Sound mixing method, device, storage medium and the terminal device of user's original content
Technical field
The invention belongs to communication technical field, more particularly to a kind of sound mixing method of user's original content, device, storage Jie Matter and terminal device.
Background technology
User's original content (UGC, User Generated Content) is accompanied by personalized to be main special to advocate What the Web2.0 concepts of point were risen.UGC is not a certain specific business, but a kind of user uses the new side of internet Formula, i.e., become to download based on download and upload is laid equal stress on by original.The websites such as YouTube can regard UGC success as Case, community network, video sharing, blog and blog (video sharing) etc. is all UGC main application form.With mobile phone work( Energy is gradually become strong, and user can make picture, video using mobile phone whenever and wherever possible, by the mood of oneself and what is seen and heard hand Machine is recorded, and these contents are passed into other people whenever and wherever possible will turn into trend.And the content quality of user's production is irregular not Together, user wants to create the works of high quality, is attracted to the user of high quality, expands the influence power of works, improves clicking rate, after Phase then needs a series of Edition Contains to handle, sound-editing processing.
User is after UGC has been made, and the propagation dynamics that wants to widen one's influence is, it is necessary to which a series of post production process, main Including:Personage U.S. face, the editing of video content, video caption processing, sound post-processing etc..And these cumbersome later stage systems Make all to be at present manually to handle, especially this part of acoustic processing, want to make UGC content personalizations, show one's talent, it is necessary to Fine artificial treatment is passed through in this part of sound, for example coordinates different individualized voice, no according to different UGC contents and scene Same special efficacy etc., these are required for manually going to watch UGC contents repeatedly, and are partitioned into the strict content change time, and The natural and tripping degree for the sound splicing that artificial gets on is also a very big test.
The content of the invention
The present invention provides a kind of sound mixing method, device, storage medium and the terminal device of user's original content, can be automatic Corresponding background music is matched according to video content.
The embodiment of the present invention provides a kind of sound mixing method of user's original content, including step:
Obtain the video information in user's original content;
Extract the video frame information in the video information;
Destination object is identified from the video frame information;
Corresponding audio-frequency information is superimposed according to the attribute of destination object.
Further, it is described to identify destination object from the video frame information, phase is superimposed according to the attribute of destination object The audio-frequency information answered includes:
Acquisition scene information is identified from the video frame information;
If the scene information that acquisition is identified from adjacent video frame information is identical, the adjacent video frame information is entered Row merges;
Corresponding background music is superimposed according to the attribute of scene information.
Further, it is described to identify destination object from the video frame information, phase is superimposed according to the attribute of destination object The audio-frequency information answered includes:
It is identified from the video frame information and obtains people information or object information;
Phonetic feature is adjusted according to the attribute of the people information or the object information, so that the people information or institute State the voice conversion audio of object information.
Further, it is described to identify destination object from the video frame information, phase is superimposed according to the attribute of destination object The audio-frequency information answered includes:
Acquisition action message is identified from the video frame information;
Background music is acted according to the attribute superposition of the action message.
Further, it is described to identify that destination object includes from the video frame information:
Using deep learning method destination object is identified from the video frame information.
The embodiment of the present invention also provides a kind of device sound mixing of user's original content, including:
Acquiring unit, for obtaining the video information in user's original content;
Extraction unit, for extracting the video frame information in the video information;
Recognition unit, for identifying destination object from the video frame information;
Superpositing unit, for being superimposed corresponding audio-frequency information according to the attribute of destination object.
Further, the recognition unit includes:
Scene Recognition subelement, for acquisition scene information to be identified from the video frame information;
Merge subelement, if identical for the scene information that acquisition is identified from adjacent video frame information, by described in Adjacent video frame information merges;
The superpositing unit, it is additionally operable to be superimposed corresponding background music according to the attribute of scene information.
Further, the recognition unit includes:
Object identification subelement, people information or object information are obtained for being identified from the video frame information;
The superpositing unit, it is additionally operable to adjust phonetic feature according to the attribute of the people information or the object information, So that the voice of the people information or the object information changes audio.
Further, the recognition unit includes:
Action recognition subelement, for acquisition action message to be identified from the video frame information;
The superpositing unit, it is additionally operable to act background music according to the attribute superposition of the action message.
Further, the recognition unit, it is additionally operable to identify mesh from the video frame information using deep learning method Mark object.
The embodiment of the present invention also provides a kind of storage medium, is stored thereon with computer program, when the computer program When running on computers so that the computer performs the sound mixing method of as above user's original content described in any one.
The embodiment of the present invention also provides a kind of terminal device, including processor and memory, and the memory has computer Program, the processor is by calling the computer program, for performing user's original content described in as above any one Sound mixing method.
Sound mixing method, device, storage medium and the terminal device of user's original content provided in an embodiment of the present invention, pass through Obtain the video information in user's original content;Extract the video frame information in the video information;From the video frame information Middle identification destination object;Corresponding audio-frequency information is superimposed according to the attribute of destination object.Automatically according to regarding for user's original content Frequency content has carried out the special effect processing of sound, matches corresponding background music.Eliminate the reliance on artificial treatment.
Brief description of the drawings
Technical scheme in order to illustrate the embodiments of the present invention more clearly, make required in being described below to embodiment Accompanying drawing is briefly described.It should be evident that drawings in the following description are only some embodiments of the present invention, for For those skilled in the art, on the premise of not paying creative work, it can also be obtained according to these accompanying drawings other attached Figure.
Fig. 1 is the flow chart of the sound mixing method of user's original content provided in an embodiment of the present invention;
Fig. 2 is the schematic diagram of CNN classics convolutional neural networks evolutionary process provided in an embodiment of the present invention;
Fig. 3 is another flow chart of the sound mixing method of user's original content provided in an embodiment of the present invention;
Fig. 4 is the schematic diagram of YOLO object detections provided in an embodiment of the present invention;
Fig. 5 is the schematic diagram of C3D network structures provided in an embodiment of the present invention;
Fig. 6 is the structural representation of the device sound mixing of user's original content provided in an embodiment of the present invention;
Fig. 7 is another structural representation of the device sound mixing of user's original content provided in an embodiment of the present invention;
Fig. 8 is the another structural representation of the device sound mixing of user's original content provided in an embodiment of the present invention;
Fig. 9 is another structural representation of the device sound mixing of user's original content provided in an embodiment of the present invention;
Figure 10 is an also structural representation for the device sound mixing of user's original content provided in an embodiment of the present invention;
Figure 11 is the structural representation of ResNet provided in an embodiment of the present invention residual error study module.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Site preparation describes.
As shown in figure 1, Fig. 1 is the flow chart of the sound mixing method of user's original content provided in an embodiment of the present invention.This reality Apply in example, a kind of sound mixing method of user's original content, including step:
Step S101, obtain the video information in user's original content;
Step S102, extract the video frame information in the video information;
Step S103, destination object is identified from the video frame information;
Step S104, corresponding audio-frequency information is superimposed according to the attribute of destination object.
The sound mixing method of user's original content provided in an embodiment of the present invention is by obtaining the video in user's original content Information;Extract the video frame information in the video information;Destination object is identified from the video frame information;According to target pair The attribute of elephant is superimposed corresponding audio-frequency information.Carried out automatically according to the video content of user's original content at the special efficacy of sound Reason, matches corresponding background music.Eliminate the reliance on artificial treatment.Original video is inputted, after intelligent mixer system, automatically The special effect processing of sound has been carried out according to the scene of UGC contents.Original video is inputted to match automatically according to scene and action recognition Corresponding background music.
Wherein, it is to utilize each frame video in FFmpeg extraction videos to extract the video frame information in the video information Two field picture.
Further, it is described to identify destination object from the video frame information, phase is superimposed according to the attribute of destination object The audio-frequency information answered includes:
Acquisition scene information is identified from the video frame information;
If the scene information that acquisition is identified from adjacent video frame information is identical, the adjacent video frame information is entered Row merges;
Corresponding background music is superimposed according to the attribute of scene information.
Specifically, the destination object in the present embodiment is scene information, scene information is obtained first in the present embodiment, can be with Frame level scene Recognition is carried out, using deep learning method, picture scene is classified, for example depth convolutional Neural can be used Network Inception V3 or resnet152 carry out scene classification, can use scene database Place365 during network training. Then scene cut is carried out, is merged based on the scene Recognition classification between consecutive frame, the consecutive frame of same scene is carried out Merge, finally obtain the Video segmentation based on different scenes.It is finally based on scene and plays corresponding background music, field can be based on Background music corresponding to scape broadcasting, such as sea, bar, dance hall, castle, field, island, rink etc., can also be according to field Scape converts original sound, for example can produce reverberation effect etc. in music hall with echogenicity effect in mountain valley.
The classical convolutional neural networks of some of machine learning.It is AlexNet, VGGNet, Google respectively InceptionNet and ResNet.Main feature is all in deep learning and neutral net.The significantly lifting of performance is nearly all The number of plies for being accompanied by convolutional neural networks is deepened, and ResNet has even used 152 hidden layers.As shown in Fig. 2 Fig. 2 is the present invention The schematic diagram for the CNN classics convolutional neural networks evolutionary processes that embodiment provides.
ResNet networks propose a kind of residual error learning framework for mitigating network training burden, and this network used than in the past The network essence upper strata time crossed is deeper.It is clearly using this layer study residual error function related as input layer, rather than study Unknown function.In ImageNet data sets with 152 layers (the possible number of plies alreadys exceed 1000), the deep 8 times depth than VGG network To assess residual error network, but it still has relatively low complexity.ResNet residual error study module is as shown in Figure 11.
Based on original sound, plus regular hour carryover effects, it is added in original sound, can echogenicity effect; And original sound is added in original sound after certain convolution and delay, then reverberation effect can be produced.It is and specific Echo and reverberation dynamics then need constantly to adjust convolution kernel and time delay.
Further, it is described to identify destination object from the video frame information, phase is superimposed according to the attribute of destination object The audio-frequency information answered includes:
It is identified from the video frame information and obtains people information or object information;
Phonetic feature is adjusted according to the attribute of the people information or the object information, so that the people information or institute State the voice conversion audio of object information.
Specifically, the destination object in the present embodiment is people information or object information, closed first in the present embodiment Key object or person analyte detection, using deep learning method, detect in picture whether containing some crucial or characteristic personages or Object, such as transformer, small Loli, uncle or some cartoon cartoon characters.It can be used during detection and be based on neutral net Object detection model YOLO (You look only once).Then voice conversion is carried out, according to the critical object detected Or personage, by adjusting the different characteristic of voice, such as fundamental frequency F0, the duration of a sound, pitch, mel cepstrum coefficients MFCC etc., so as to reach To the sound effect for being converted into critical object or personage.
The critical object or personage that voice conversion basis detects, by adjusting the different characteristic of voice, such as fundamental frequency F0, The duration of a sound, pitch, mel cepstrum coefficients MFCC etc., the fundamental frequency of general schoolgirl are higher than boy student, and the fundamental frequency of robot is higher than schoolgirl, typically Machine and the duration of a sound of Loli schoolgirl are more slightly shorter, by several sound characteristics more than adjusting, obtain the audio for needing to change.
As shown in figure 4, Fig. 4 is the schematic diagram of YOLO object detections provided in an embodiment of the present invention.Object detection algorithms frame Frame experienced the differentiation from R-CNN, SPP-net, fast-r-cnn, faster-r-cnn to YOLO algorithms, the standard of object detection True rate and speed also step up.YOLO is the convolutional neural networks that can disposably predict multiple Box positions and classification, Target detection and identification end to end can be realized, its maximum advantage is exactly that speed is fast.In fact, the essence of target detection is just It is to return, therefore one is realized the CNN for returning function and do not need complicated design process.YOLO does not select sliding window or extraction Proposal mode training network, but directly select whole figure training pattern.This have the advantage that can more preferable area Partial objectives for and background area, by contrast, using the Fast-R-CNN of proposal training methods usually background area flase drop For specific objective.Certainly, YOLO sacrifices some precision while detection speed is lifted.
Further, it is described to identify destination object from the video frame information, phase is superimposed according to the attribute of destination object The audio-frequency information answered includes:
Acquisition action message is identified from the video frame information;
Background music is acted according to the attribute superposition of the action message.
As shown in figure 5, Fig. 5 is the schematic diagram of C3D network structures provided in an embodiment of the present invention.Specifically, the present embodiment In destination object be action message, action message is obtained in the present embodiment can utilize deep learning method, and input video arrives Identify in network, can obtain specific video actions classification, action recognition network can use C3D models (3D Convolutional networks), action recognition database UCF-101 can be used during network training.It is then based on action letter Breath plays corresponding background music, can be based on background music corresponding to action message broadcasting, for example increase punch, shock etc. Background applications.
For the action recognition of video, what it is due to input is 3 D video, two dimensional image different from the past, so traditional CNN networks be not suitable for Video processing, now need by means of Three dimensional convolution neutral net.2D convolution is used for single channel figure (multichannel image can refer to 3 Color Channels of same pictures to the situation of picture and multichannel image herein, also refer to multiple stackings Picture together, i.e., a bit of video), for a wave filter, export as a two-dimentional characteristic pattern, the information of multichannel It is fully compressed.And the output of 3D convolution remains as 3D characteristic pattern.
For example a video-frequency band input, its size are c*l*h*w, wherein c is image channel (generally 3), and l is video sequence The length of row, h and w are respectively the width and height of video.It is 3*3*3, stride 1, padding to carry out a kernel size =True, after number of filter is K 3D convolution, the dimension size of output is K*l*h*w.
As shown in figure 5, described in figure it is C3D network structures for Activity recognition, based on 3D convolution operations, C3D Network structures share 8 convolution operations, 4 pondization operations.The size of wherein convolution kernel is 3*3*3, step-length 1*1*1. The size of pond core is 2*2*2, step-length 2*2*2, but first layer pond is eliminated outside, and its size and step-length are 1*2*2.This be for The length prematurely reduced in sequential.Final network is just obtaining finally after full articulamentum and softmax layer twice Output result.The input size of network is 3*16*112*112, i.e., once inputs 16 two field pictures.
As shown in figure 3, Fig. 3 is another flow chart of the sound mixing method of user's original content provided in an embodiment of the present invention. It is described to identify destination object from the video frame information in the present embodiment, corresponding sound is superimposed according to the attribute of destination object Frequency information includes:
Acquisition scene information is identified from the video frame information, is identified from the video frame information and obtains personage Information or object information, acquisition action message is identified from the video frame information;
If the scene information that acquisition is identified from adjacent video frame information is identical, the adjacent video frame information is entered Row merges;
Corresponding background music is superimposed according to the attribute of scene information, according to the people information or the object information Attribute adjusts phonetic feature, so that the voice of the people information or the object information changes audio, is believed according to the action The attribute superposition action background music of breath.
As shown in fig. 6, Fig. 6 is the structural representation of the device sound mixing of user's original content provided in an embodiment of the present invention. A kind of device sound mixing 200 of user's original content, including acquiring unit 201, extraction unit 202, knowledge are provided in the present embodiment Other unit 203 and superpositing unit 204.Wherein:
Acquiring unit 201, for obtaining the video information in user's original content;
Extraction unit 202, for extracting the video frame information in the video information;
Recognition unit 203, for identifying destination object from the video frame information;
Superpositing unit 204, for being superimposed corresponding audio-frequency information according to the attribute of destination object.
The sound mixing method of user's original content provided in an embodiment of the present invention is by obtaining the video in user's original content Information;Extract the video frame information in the video information;Destination object is identified from the video frame information;According to target pair The attribute of elephant is superimposed corresponding audio-frequency information.Carried out automatically according to the video content of user's original content at the special efficacy of sound Reason, matches corresponding background music.Eliminate the reliance on artificial treatment.Original video is inputted, after intelligent mixer system, automatically The special effect processing of sound has been carried out according to the scene of UGC contents.Original video is inputted to match automatically according to scene and action recognition Corresponding background music.
As shown in fig. 7, Fig. 7 is another structural representation of the device sound mixing of user's original content provided in an embodiment of the present invention Figure.The recognition unit 203 includes:
Scene Recognition subelement 2031, for acquisition scene information to be identified from the video frame information;
Merge subelement 2032, will if identical for the scene information that acquisition is identified from adjacent video frame information The adjacent video frame information merges;
The superpositing unit 204, it is additionally operable to be superimposed corresponding background music according to the attribute of scene information.
Specifically, the destination object in the present embodiment is scene information, scene information is obtained first in the present embodiment, can be with Frame level scene Recognition is carried out, using deep learning method, picture scene is classified, for example depth convolutional Neural can be used Network Inception V3 or resnet152 carry out scene classification, can use scene database Place365 during network training. Then scene cut is carried out, is merged based on the scene Recognition classification between consecutive frame, the consecutive frame of same scene is carried out Merge, finally obtain the Video segmentation based on different scenes.It is finally based on scene and plays corresponding background music, field can be based on Background music corresponding to scape broadcasting, such as sea, bar, dance hall, castle, field, island, rink etc., can also be according to field Scape converts original sound, for example can produce reverberation effect etc. in music hall with echogenicity effect in mountain valley.
Based on original sound, plus regular hour carryover effects, it is added in original sound, can echogenicity effect; And original sound is added in original sound after certain convolution and delay, then reverberation effect can be produced.It is and specific Echo and reverberation dynamics then need constantly to adjust convolution kernel and time delay.
As shown in figure 8, Fig. 8 is the another structural representation of the device sound mixing of user's original content provided in an embodiment of the present invention Figure.The recognition unit 203 includes:
Object identification subelement 2033, people information or object letter are obtained for being identified from the video frame information Breath;
The superpositing unit 204, it is additionally operable to adjust voice spy according to the attribute of the people information or the object information Sign, so that the voice of the people information or the object information changes audio.
Specifically, the destination object in the present embodiment is people information or object information, closed first in the present embodiment Key object or person analyte detection, using deep learning method, detect in picture whether containing some crucial or characteristic personages or Object, such as transformer, small Loli, uncle or some cartoon cartoon characters.It can be used during detection and be based on neutral net Object detection model YOLO (You look only once).Then voice conversion is carried out, according to the critical object detected Or personage, by adjusting the different characteristic of voice, such as fundamental frequency F0, the duration of a sound, pitch, mel cepstrum coefficients MFCC etc., so as to reach To the sound effect for being converted into critical object or personage.
The critical object or personage that voice conversion basis detects, by adjusting the different characteristic of voice, such as fundamental frequency F0, The duration of a sound, pitch, mel cepstrum coefficients MFCC etc., the fundamental frequency of general schoolgirl are higher than boy student, and the fundamental frequency of robot is higher than schoolgirl, typically Machine and the duration of a sound of Loli schoolgirl are more slightly shorter, by several sound characteristics more than adjusting, obtain the audio for needing to change.
As shown in figure 9, Fig. 9 is another structural representation of the device sound mixing of user's original content provided in an embodiment of the present invention Figure.The recognition unit 203 includes:
Action recognition subelement 2034, for acquisition action message to be identified from the video frame information;
The superpositing unit 204, it is additionally operable to act background music according to the attribute superposition of the action message.
Specifically, the destination object in the present embodiment is action message, action message is obtained in the present embodiment to be utilized Deep learning method, input video can obtain the classification of specific video actions into identification network, and action recognition network can be with Action recognition database UCF-101 can be used using C3D models (3D convolutional networks), during network training. It is then based on action message and plays corresponding background music, background music corresponding to action message broadcasting can be based on, for example increase Add the background applications of punch, shock etc..
As shown in Figure 10, Figure 10 is that an also structure for the device sound mixing of user's original content provided in an embodiment of the present invention is shown It is intended to.The recognition unit 203 includes:
Scene Recognition subelement 2031, for acquisition scene information to be identified from the video frame information;
Object identification subelement 2033, people information or object letter are obtained for being identified from the video frame information Breath;
Action recognition subelement 2034, for acquisition action message to be identified from the video frame information;
Merge subelement 2032, will if identical for the scene information that acquisition is identified from adjacent video frame information The adjacent video frame information merges;
The superpositing unit 204, it is additionally operable to be superimposed corresponding background music according to the attribute of scene information, is additionally operable to basis The attribute of the people information or the object information adjusts phonetic feature, so that the people information or the object information Voice changes audio, is additionally operable to act background music according to the attribute superposition of the action message.
It when it is implemented, above modules can be realized as independent entity, can also be combined, be made Realized for same or several entities.
Above-mentioned all technical schemes, any combination can be used to form the alternative embodiment of the present invention, it is not another herein One repeats.
In the embodiment of the present invention, user's original content in the device sound mixing and foregoing embodiments of user's original content Sound mixing method belong to same design, user's original content can be run on the device sound mixing of user's original content Sound mixing method embodiment in the either method that provides, its specific implementation process refers to the sound mixing method of user's original content Embodiment, here is omitted.
The embodiment of the present invention also provides a kind of terminal device, including processor and memory, and the memory has computer Program, the processor is by calling the computer program, for performing user's original content described in as above any one Sound mixing method.
Wherein, the terminal device can be smart mobile phone, tablet personal computer, desktop computer, notebook computer or palm Apparatus such as computer.
The embodiment of the present invention also provides a kind of storage medium, and the storage medium is stored with computer program, when the meter When calculation machine program is run on computers so that the computer performs the mixed of user's original content in any of the above-described embodiment Sound method, such as:Obtain the video information in user's original content;Extract the video frame information in the video information;From institute State and destination object is identified in video frame information;Corresponding audio-frequency information is superimposed according to the attribute of destination object.
In embodiments of the present invention, the storage medium can be magnetic disc, CD, read-only storage (Read Only Memory, ROM) or random access memory (Random Access Memory, RAM) etc..
In the above-described embodiments, the description to each embodiment all emphasizes particularly on different fields, and does not have the portion being described in detail in some embodiment Point, it may refer to the associated description of other embodiment.
It should be noted that for the sound mixing method of user's original content of the embodiment of the present invention, this area is general Logical tester is appreciated that to realize all or part of flow of the sound mixing method of user's original content described in the embodiment of the present invention, It is that by computer program the hardware of correlation can be controlled to complete, it is computer-readable that the computer program can be stored in one Take in storage medium, be such as stored in the memory of electronic equipment, and by least one computing device in the electronic equipment, It may include the flow of the embodiment of the sound mixing method such as user's original content in the process of implementation.Wherein, described storage Medium can be magnetic disc, CD, read-only storage, random access memory etc..
For the device sound mixing of user's original content of the embodiment of the present invention, its each functional module can be integrated in In one process chip or modules are individually physically present, can also two or more modules be integrated in one In individual module.Above-mentioned integrated module can both be realized in the form of hardware, can also use the form of software function module Realize.If the integrated module realized in the form of software function module and as independent production marketing or in use, It can also be stored in a computer read/write memory medium, the storage medium is for example read-only storage, disk or light Disk etc..
The sound mixing method of the user's original content provided above the embodiment of the present invention a kind of, device, storage medium and Electronic equipment is described in detail, and specific case used herein is explained the principle and embodiment of the present invention State, the explanation of above example is only intended to help the method and its core concept for understanding the present invention;Meanwhile for this area Technical staff, according to the thought of the present invention, there will be changes in specific embodiments and applications, in summary, This specification content should not be construed as limiting the invention.

Claims (12)

1. a kind of sound mixing method of user's original content, it is characterised in that including step:
Obtain the video information in user's original content;
Extract the video frame information in the video information;
Destination object is identified from the video frame information;
Corresponding audio-frequency information is superimposed according to the attribute of destination object.
2. the sound mixing method of user's original content according to claim 1, it is characterised in that described to believe from the frame of video Destination object is identified in breath, being superimposed corresponding audio-frequency information according to the attribute of destination object includes:
Acquisition scene information is identified from the video frame information;
If the scene information that acquisition is identified from adjacent video frame information is identical, the adjacent video frame information is closed And;
Corresponding background music is superimposed according to the attribute of scene information.
3. the sound mixing method of user's original content according to claim 1, it is characterised in that described to believe from the frame of video Destination object is identified in breath, being superimposed corresponding audio-frequency information according to the attribute of destination object includes:
It is identified from the video frame information and obtains people information or object information;
Phonetic feature is adjusted according to the attribute of the people information or the object information, so that the people information or the thing The voice conversion audio of body information.
4. the sound mixing method of user's original content according to claim 1, it is characterised in that described to believe from the frame of video Destination object is identified in breath, being superimposed corresponding audio-frequency information according to the attribute of destination object includes:
Acquisition action message is identified from the video frame information;
Background music is acted according to the attribute superposition of the action message.
5. the sound mixing method of user's original content according to claim 1, it is characterised in that described to believe from the frame of video Identify that destination object includes in breath:
Using deep learning method destination object is identified from the video frame information.
A kind of 6. device sound mixing of user's original content, it is characterised in that including:
Acquiring unit, for obtaining the video information in user's original content;
Extraction unit, for extracting the video frame information in the video information;
Recognition unit, for identifying destination object from the video frame information;
Superpositing unit, for being superimposed corresponding audio-frequency information according to the attribute of destination object.
7. the device sound mixing of user's original content according to claim 6, it is characterised in that the recognition unit includes:
Scene Recognition subelement, for acquisition scene information to be identified from the video frame information;
Merge subelement, will be described adjacent if identical for the scene information that acquisition is identified from adjacent video frame information Video frame information merges;
The superpositing unit, it is additionally operable to be superimposed corresponding background music according to the attribute of scene information.
8. the device sound mixing of user's original content according to claim 6, it is characterised in that the recognition unit includes:
Object identification subelement, people information or object information are obtained for being identified from the video frame information;
The superpositing unit, it is additionally operable to adjust phonetic feature according to the attribute of the people information or the object information, so that The voice of the people information or the object information changes audio.
9. the device sound mixing of user's original content according to claim 6, it is characterised in that the recognition unit includes:
Action recognition subelement, for acquisition action message to be identified from the video frame information;
The superpositing unit, it is additionally operable to act background music according to the attribute superposition of the action message.
10. the device sound mixing of user's original content according to claim 6, it is characterised in that the recognition unit, also use In identifying destination object from the video frame information using deep learning method.
11. a kind of storage medium, is stored thereon with computer program, it is characterised in that when the computer program is in computer During upper operation so that the computer performs the sound mixing method of user's original content as described in any one of claim 1 to 5.
12. a kind of terminal device, including processor and memory, the memory have computer program, it is characterised in that described Processor is by calling the computer program, for performing user's original content as described in any one of claim 1 to 5 Sound mixing method.
CN201710952671.1A 2017-10-13 2017-10-13 Sound mixing method, device, storage medium and the terminal device of user's original content Pending CN107888843A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710952671.1A CN107888843A (en) 2017-10-13 2017-10-13 Sound mixing method, device, storage medium and the terminal device of user's original content

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710952671.1A CN107888843A (en) 2017-10-13 2017-10-13 Sound mixing method, device, storage medium and the terminal device of user's original content

Publications (1)

Publication Number Publication Date
CN107888843A true CN107888843A (en) 2018-04-06

Family

ID=61781613

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710952671.1A Pending CN107888843A (en) 2017-10-13 2017-10-13 Sound mixing method, device, storage medium and the terminal device of user's original content

Country Status (1)

Country Link
CN (1) CN107888843A (en)

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108805036A (en) * 2018-05-22 2018-11-13 电子科技大学 A kind of new non-supervisory video semanteme extracting method
CN109119089A (en) * 2018-06-05 2019-01-01 安克创新科技股份有限公司 The method and apparatus of penetrating processing is carried out to music
CN109286841A (en) * 2018-10-17 2019-01-29 Oppo广东移动通信有限公司 Film sound effect treatment method and Related product
CN109309776A (en) * 2018-08-13 2019-02-05 张利军 Piece caudal flexure based on dynamic degree selects system
CN109587552A (en) * 2018-11-26 2019-04-05 Oppo广东移动通信有限公司 Video personage sound effect treatment method, device, mobile terminal and storage medium
CN109618076A (en) * 2018-08-07 2019-04-12 吴秋琴 The adaptive method for down loading of singer's music
CN109640166A (en) * 2018-08-13 2019-04-16 张利军 Piece caudal flexure selection method based on dynamic degree
CN110163050A (en) * 2018-07-23 2019-08-23 腾讯科技(深圳)有限公司 A kind of method for processing video frequency and device, terminal device, server and storage medium
CN110677716A (en) * 2019-08-20 2020-01-10 咪咕音乐有限公司 Audio processing method, electronic device, and storage medium
CN110852375A (en) * 2019-11-09 2020-02-28 北京工业大学 End-to-end music score note identification method based on deep learning
CN110858924A (en) * 2018-08-22 2020-03-03 北京优酷科技有限公司 Video background music generation method and device
CN111031391A (en) * 2019-12-19 2020-04-17 北京达佳互联信息技术有限公司 Video dubbing method, device, server, terminal and storage medium
CN111028920A (en) * 2019-12-06 2020-04-17 杨保红 Mental health decompression flow system platform
WO2020087979A1 (en) * 2018-10-30 2020-05-07 北京字节跳动网络技术有限公司 Method and apparatus for generating model
CN111541936A (en) * 2020-04-02 2020-08-14 腾讯科技(深圳)有限公司 Video and image processing method and device, electronic equipment and storage medium
CN111970579A (en) * 2020-08-14 2020-11-20 苏州思萃人工智能研究所有限公司 Video music adaptation method and system based on AI video understanding
CN112040335A (en) * 2020-08-14 2020-12-04 苏州思萃人工智能研究所有限公司 Artificial intelligent sound effect creation and video adaptation method and system
CN112633087A (en) * 2020-12-09 2021-04-09 新奥特(北京)视频技术有限公司 Automatic journaling method and device based on picture analysis for IBC system
CN112690823A (en) * 2020-12-22 2021-04-23 海南力维科贸有限公司 Method and system for identifying physiological sounds of lungs
CN113469321A (en) * 2020-03-30 2021-10-01 聚晶半导体股份有限公司 Object detection device and object detection method based on neural network
US11495015B2 (en) 2020-03-30 2022-11-08 Altek Semiconductor Corp. Object detection device and object detection method based on neural network
WO2024067157A1 (en) * 2022-09-29 2024-04-04 北京字跳网络技术有限公司 Special-effect video generation method and apparatus, electronic device and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1073272A1 (en) * 1999-02-15 2001-01-31 Sony Corporation Signal processing method and video/audio processing device
CN102222227A (en) * 2011-04-25 2011-10-19 中国华录集团有限公司 Video identification based system for extracting film images
CN103050124A (en) * 2011-10-13 2013-04-17 华为终端有限公司 Sound mixing method, device and system
CN103795897A (en) * 2014-01-21 2014-05-14 深圳市中兴移动通信有限公司 Method and device for automatically generating background music
CN106534618A (en) * 2016-11-24 2017-03-22 广州爱九游信息技术有限公司 Method, device and system for realizing pseudo field interpretation

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1073272A1 (en) * 1999-02-15 2001-01-31 Sony Corporation Signal processing method and video/audio processing device
CN102222227A (en) * 2011-04-25 2011-10-19 中国华录集团有限公司 Video identification based system for extracting film images
CN103050124A (en) * 2011-10-13 2013-04-17 华为终端有限公司 Sound mixing method, device and system
CN103795897A (en) * 2014-01-21 2014-05-14 深圳市中兴移动通信有限公司 Method and device for automatically generating background music
CN106534618A (en) * 2016-11-24 2017-03-22 广州爱九游信息技术有限公司 Method, device and system for realizing pseudo field interpretation

Cited By (30)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108805036A (en) * 2018-05-22 2018-11-13 电子科技大学 A kind of new non-supervisory video semanteme extracting method
CN109119089A (en) * 2018-06-05 2019-01-01 安克创新科技股份有限公司 The method and apparatus of penetrating processing is carried out to music
CN113450811B (en) * 2018-06-05 2024-02-06 安克创新科技股份有限公司 Method and equipment for performing transparent processing on music
US11887615B2 (en) 2018-06-05 2024-01-30 Anker Innovations Technology Co., Ltd. Method and device for transparent processing of music
CN113450811A (en) * 2018-06-05 2021-09-28 安克创新科技股份有限公司 Method and equipment for performing transparent processing on music
WO2019233359A1 (en) * 2018-06-05 2019-12-12 安克创新科技股份有限公司 Method and device for transparency processing of music
CN110163050A (en) * 2018-07-23 2019-08-23 腾讯科技(深圳)有限公司 A kind of method for processing video frequency and device, terminal device, server and storage medium
CN110163050B (en) * 2018-07-23 2022-09-27 腾讯科技(深圳)有限公司 Video processing method and device, terminal equipment, server and storage medium
CN109618076A (en) * 2018-08-07 2019-04-12 吴秋琴 The adaptive method for down loading of singer's music
CN109309776B (en) * 2018-08-13 2019-08-27 上海蒙彤文化传播有限公司 Piece caudal flexure based on dynamic degree selects system
CN109640166A (en) * 2018-08-13 2019-04-16 张利军 Piece caudal flexure selection method based on dynamic degree
CN109309776A (en) * 2018-08-13 2019-02-05 张利军 Piece caudal flexure based on dynamic degree selects system
CN110858924A (en) * 2018-08-22 2020-03-03 北京优酷科技有限公司 Video background music generation method and device
CN110858924B (en) * 2018-08-22 2021-11-26 阿里巴巴(中国)有限公司 Video background music generation method and device and storage medium
CN109286841B (en) * 2018-10-17 2021-10-08 Oppo广东移动通信有限公司 Movie sound effect processing method and related product
CN109286841A (en) * 2018-10-17 2019-01-29 Oppo广东移动通信有限公司 Film sound effect treatment method and Related product
WO2020087979A1 (en) * 2018-10-30 2020-05-07 北京字节跳动网络技术有限公司 Method and apparatus for generating model
CN109587552A (en) * 2018-11-26 2019-04-05 Oppo广东移动通信有限公司 Video personage sound effect treatment method, device, mobile terminal and storage medium
CN110677716A (en) * 2019-08-20 2020-01-10 咪咕音乐有限公司 Audio processing method, electronic device, and storage medium
CN110852375A (en) * 2019-11-09 2020-02-28 北京工业大学 End-to-end music score note identification method based on deep learning
CN111028920A (en) * 2019-12-06 2020-04-17 杨保红 Mental health decompression flow system platform
CN111031391A (en) * 2019-12-19 2020-04-17 北京达佳互联信息技术有限公司 Video dubbing method, device, server, terminal and storage medium
CN113469321A (en) * 2020-03-30 2021-10-01 聚晶半导体股份有限公司 Object detection device and object detection method based on neural network
US11495015B2 (en) 2020-03-30 2022-11-08 Altek Semiconductor Corp. Object detection device and object detection method based on neural network
CN111541936A (en) * 2020-04-02 2020-08-14 腾讯科技(深圳)有限公司 Video and image processing method and device, electronic equipment and storage medium
CN112040335A (en) * 2020-08-14 2020-12-04 苏州思萃人工智能研究所有限公司 Artificial intelligent sound effect creation and video adaptation method and system
CN111970579A (en) * 2020-08-14 2020-11-20 苏州思萃人工智能研究所有限公司 Video music adaptation method and system based on AI video understanding
CN112633087A (en) * 2020-12-09 2021-04-09 新奥特(北京)视频技术有限公司 Automatic journaling method and device based on picture analysis for IBC system
CN112690823A (en) * 2020-12-22 2021-04-23 海南力维科贸有限公司 Method and system for identifying physiological sounds of lungs
WO2024067157A1 (en) * 2022-09-29 2024-04-04 北京字跳网络技术有限公司 Special-effect video generation method and apparatus, electronic device and storage medium

Similar Documents

Publication Publication Date Title
CN107888843A (en) Sound mixing method, device, storage medium and the terminal device of user's original content
Zhang et al. Speech emotion recognition using deep convolutional neural network and discriminant temporal pyramid matching
LeCun Deep learning & convolutional networks.
WO2020177190A1 (en) Processing method, apparatus and device
CN107220235A (en) Speech recognition error correction method, device and storage medium based on artificial intelligence
CN107211061A (en) The optimization virtual scene layout played back for space meeting
WO2023197979A1 (en) Data processing method and apparatus, and computer device and storage medium
CN107210034A (en) selective conference summary
CN104902012B (en) The method and singing contest system of singing contest are carried out by network
CN114419205B (en) Driving method of virtual digital person and training method of pose acquisition model
CN108206027A (en) A kind of audio quality evaluation method and system
WO2023207541A1 (en) Speech processing method and related device
TWI740315B (en) Sound separation method, electronic and computer readable storage medium
CN115691544A (en) Training of virtual image mouth shape driving model and driving method, device and equipment thereof
Tang et al. Improved convolutional neural networks for acoustic event classification
WO2023197749A1 (en) Background music insertion time point determining method and apparatus, device, and storage medium
Geng Evaluation model of college english multimedia teaching effect based on deep convolutional neural networks
CN110136689A (en) Song synthetic method, device and storage medium based on transfer learning
CN108550173A (en) Method based on speech production shape of the mouth as one speaks video
CN109584904A (en) The sightsinging audio roll call for singing education applied to root LeEco identifies modeling method
CN112528049A (en) Video synthesis method and device, electronic equipment and computer-readable storage medium
CN108847066A (en) A kind of content of courses reminding method, device, server and storage medium
Guo et al. Attention-based visual-audio fusion for video caption generation
Hu et al. 3DACRNN Model Based on Residual Network for Speech Emotion Classification.
Küçükbay et al. Audio event detection using adaptive feature extraction scheme

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20180406

RJ01 Rejection of invention patent application after publication