CN113923517B - Background music generation method and device and electronic equipment - Google Patents
Background music generation method and device and electronic equipment Download PDFInfo
- Publication number
- CN113923517B CN113923517B CN202111166926.4A CN202111166926A CN113923517B CN 113923517 B CN113923517 B CN 113923517B CN 202111166926 A CN202111166926 A CN 202111166926A CN 113923517 B CN113923517 B CN 113923517B
- Authority
- CN
- China
- Prior art keywords
- music
- feature vectors
- training
- generators
- audio
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 42
- 239000013598 vector Substances 0.000 claims abstract description 150
- 238000005516 engineering process Methods 0.000 claims abstract description 27
- 238000003058 natural language processing Methods 0.000 claims abstract description 21
- 230000002194 synthesizing effect Effects 0.000 claims abstract description 12
- 238000012549 training Methods 0.000 claims description 140
- 230000008451 emotion Effects 0.000 claims description 54
- 238000000605 extraction Methods 0.000 claims description 20
- 230000015654 memory Effects 0.000 claims description 15
- 238000004590 computer program Methods 0.000 claims description 2
- 238000012545 processing Methods 0.000 description 12
- 238000004891 communication Methods 0.000 description 10
- 230000009467 reduction Effects 0.000 description 9
- 230000008569 process Effects 0.000 description 5
- 230000005291 magnetic effect Effects 0.000 description 4
- 230000003287 optical effect Effects 0.000 description 4
- 230000005236 sound signal Effects 0.000 description 4
- 238000013528 artificial neural network Methods 0.000 description 3
- 238000013527 convolutional neural network Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 241000282412 Homo Species 0.000 description 2
- 230000001133 acceleration Effects 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 238000007726 management method Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 230000009471 action Effects 0.000 description 1
- 230000006978 adaptation Effects 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000003384 imaging method Methods 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 238000009877 rendering Methods 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/44—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs
- H04N21/44008—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/47—End-user applications
- H04N21/485—End-user interface for client configuration
- H04N21/4852—End-user interface for client configuration for modifying audio parameters, e.g. switching between mono and stereo
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/60—Network structure or processes for video distribution between server and client or between remote clients; Control signalling between clients, server and network components; Transmission of management data between server and client, e.g. sending from server to client commands for recording incoming content stream; Communication details between server and client
- H04N21/65—Transmission of management data between client and server
- H04N21/658—Transmission by the client directed to the server
- H04N21/6587—Control parameters, e.g. trick play commands, viewpoint selection
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/80—Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
- H04N21/81—Monomedia components thereof
- H04N21/8106—Monomedia components thereof involving special audio data, e.g. different tracks for different languages
- H04N21/8113—Monomedia components thereof involving special audio data, e.g. different tracks for different languages comprising music, e.g. song in MP3 format
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10H—ELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
- G10H2250/00—Aspects of algorithms or signal processing methods without intrinsic musical character, yet specifically adapted for or used in electrophonic musical processing
- G10H2250/541—Details of musical waveform synthesis, i.e. audio waveshape processing from individual wavetable samples, independently of their origin or of the sound they represent
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Human Computer Interaction (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a background music generation method, which carries out voice recognition on acquired target audio and video data to obtain recognition characters; extracting features of the identified words by using a natural language processing technology to obtain N feature vectors; acquiring N music generators corresponding to the N feature vectors from a pre-trained music generator set; inputting each characteristic vector in the N characteristic vectors into a corresponding music generator to obtain N types of music; and synthesizing the N types of music to obtain background music, wherein when the N types of music are synthesized to obtain the background music, the background music is generated by the N types of music, N is an integer not smaller than 2, so that the background music is generated by multiple types of music and does not belong to the existing music and songs, and the generated background music is more personalized and is more matched with the demands of users.
Description
Technical Field
The present invention relates to the field of speech processing technologies, and in particular, to a method and an apparatus for generating background music, and an electronic device.
Background
Music has been an important form of art for companion humans, and humans have never stopped exploring music. With the development of computer technology, the combination of computers and deep learning technology has led to the creation of music with increasing applications.
In the prior art, when background music is generated, the background music can be quickly generated by presetting music characteristic parameters by a user and inputting the music characteristic parameters into a neural network for predicting future notes or using a generation countermeasure neural network for generating music, but the generated background music can not well meet the requirements of the user. Thus, a background music generation method is needed to solve the above-mentioned problems.
Disclosure of Invention
The embodiment of the invention provides a background music generation method and device and electronic equipment, which are used for generating background music of an audio and video file.
An embodiment of the present invention provides a method for generating background music, where the method includes:
performing voice recognition on the acquired target audio and video data to obtain recognition characters;
Extracting features of the identified words by using a natural language processing technology to obtain N feature vectors, wherein N is an integer not less than 2;
acquiring N music generators corresponding to the N feature vectors from a pre-trained music generator set;
inputting each characteristic vector in the N characteristic vectors into a corresponding music generator to obtain N types of music;
and synthesizing the N types of music to obtain background music.
Optionally, the obtaining N music generators corresponding to the N feature vectors includes:
Acquiring N emotion labels corresponding to the N feature vectors;
And acquiring N music generators corresponding to the N emotion labels from the music generator set according to the corresponding relation between the emotion labels and the music generators.
Optionally, the performing speech recognition on the obtained target audio and video data to obtain the recognition text includes:
performing audio extraction on the obtained target audio and video data to obtain user audio data;
And carrying out voice recognition on the user audio data to obtain the recognition text.
Optionally, the training step of the music generator set includes:
Acquiring a training sample set, wherein each training sample in the training sample set comprises training audio and video data;
Aiming at each training sample in the training sample set, performing voice recognition on training audio and video data of the training sample to obtain training recognition characters; extracting features of the training recognition characters by using a natural language processing technology to obtain M feature vectors, wherein M is an integer not less than N;
Model training is carried out on M music generators by using M feature vectors of each training sample by adopting an countermeasure network, the M trained music generators are obtained, the M trained music generators are used as the music generator set, and the M music generators correspond to the M feature vectors.
Optionally, after obtaining the background music, the method further includes:
and adding the background music to the target audio/video data.
A second aspect of the embodiment of the present invention further provides a background music generating apparatus, which is characterized in that the apparatus includes:
the recognition unit is used for carrying out voice recognition on the acquired target audio and video data to obtain recognition characters;
The feature extraction unit is used for extracting features of the identification characters by utilizing a natural language processing technology to obtain N feature vectors, wherein N is an integer not less than 2;
a music generator acquisition unit, configured to acquire N music generators corresponding to the N feature vectors from a pre-trained music generator set;
A style music obtaining unit, configured to input each feature vector of the N feature vectors into a corresponding music generator to obtain N style music;
And the background music acquisition unit is used for synthesizing the N types of music to obtain background music.
Optionally, the music generator obtaining unit is configured to obtain N emotion tags corresponding to the N feature vectors; and acquiring N music generators corresponding to the N emotion labels from the music generator set according to the corresponding relation between the emotion labels and the music generators.
Optionally, the identifying unit is configured to perform audio extraction on the obtained target audio/video data to obtain user audio data; and carrying out voice recognition on the user audio data to obtain the recognition text.
Optionally, the method further comprises:
The music generator training unit is used for acquiring a training sample set, and each training sample in the training sample set comprises training audio and video data; aiming at each training sample in the training sample set, performing voice recognition on training audio and video data of the training sample to obtain training recognition characters; extracting features of the training recognition characters by using a natural language processing technology to obtain M feature vectors, wherein M is an integer not less than N; model training is carried out on M music generators by using M feature vectors of each training sample by adopting an countermeasure network, the M trained music generators are obtained, the M trained music generators are used as the music generator set, and the M music generators correspond to the M feature vectors.
Optionally, the method further comprises:
And the background music adding unit is used for adding the background music to the target audio/video data after obtaining the background music.
A third aspect of the embodiment of the present invention provides an electronic device, including a memory, and one or more programs, where the one or more programs are stored in the memory, and configured to be executed by one or more processors, where the one or more programs include operation instructions for performing a background music generating method as provided in the first aspect.
A fourth aspect of the embodiments of the present invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps corresponding to the background music generation method as provided in the first aspect.
The above-mentioned one or at least one technical scheme in the embodiment of the application has at least the following technical effects:
based on the technical scheme, performing voice recognition on the acquired target audio and video data to obtain recognition characters; extracting features of the identified characters by using a natural language processing technology to obtain N feature vectors; inputting each characteristic vector in N characteristic vectors into a corresponding music generator according to a pre-trained music generator set to obtain N types of music; synthesizing the N types of music to obtain background music; at this time, the target audio and video data can be sequentially subjected to voice recognition and feature extraction, the extracted N feature vectors are input into N music generators trained in advance to generate N types of music, and then the N types of music are synthesized to obtain background music.
Drawings
Fig. 1 is a flow chart of a background music generating method according to an embodiment of the present application;
fig. 2 is a flow chart of a training method of a music generator set according to an embodiment of the present application;
fig. 3 is a block diagram of a background music generating apparatus according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
In the technical scheme provided by the embodiment of the application, a background music generating method is provided, and based on the technical scheme, the acquired target audio and video data are subjected to voice recognition to obtain recognition characters; extracting features of the identified characters by using a natural language processing technology to obtain N feature vectors; inputting each characteristic vector in N characteristic vectors into a corresponding music generator according to a pre-trained music generator set to obtain N types of music; synthesizing the N types of music to obtain background music; at this time, the target audio and video data can be sequentially subjected to voice recognition and feature extraction, the extracted N feature vectors are input into N music generators trained in advance to generate N types of music, and then the N types of music are synthesized to obtain background music.
The main implementation principle, the specific implementation manner and the corresponding beneficial effects of the technical scheme of the embodiment of the application are described in detail below with reference to the accompanying drawings.
Examples
Referring to fig. 1, an embodiment of the present application provides a background music generating method, which includes:
S101, performing voice recognition on the acquired target audio and video data to obtain recognition characters;
S102, extracting features of the identified words by using a natural language processing technology to obtain N feature vectors, wherein N is an integer not less than 2;
S103, acquiring N music generators corresponding to the N feature vectors from a pre-trained music generator set;
S104, inputting each characteristic vector in the N characteristic vectors into a corresponding music generator to obtain N types of music;
S105, synthesizing the N types of music to obtain background music.
In step S101, the target audio/video data may be first acquired, and then the target audio/video data may be subjected to speech recognition to obtain the recognition text. The target audio/video data includes audio data and video data, so that the target audio/video data may be audio data or video data, which is not particularly limited in this specification.
In the implementation process, when the target audio/video data is acquired, the audio data or the video data selected by the user can be used as the target audio/video data. And when the target audio and video data is subjected to voice recognition, the voice recognition model can be adopted to carry out voice recognition on the target audio and video data, so that the accuracy of the obtained recognition characters is higher.
In the embodiment of the present specification, the speech recognition model may be, for example, a neural network-based time series class classification (Connectionisttemporal classification, abbreviated as CTC) model, a Long Short Term Memory (LSTM), a CNN model, a CLDNN model, or the like, and the present specification is not particularly limited.
Specifically, when performing voice recognition on the target audio/video data to obtain the recognition text, in order to make the accuracy of the obtained recognition text higher, the obtained target audio/video data may be subjected to audio extraction to obtain the user audio data; and then carrying out voice recognition on the user audio data to obtain recognition characters. At this time, the user audio data is obtained after the other background music in the target audio and video data is removed by extracting the user audio data from the target audio and video data, so that when the user audio data is subjected to voice recognition, interference caused by the other background music in the target audio and video data is avoided, the recognition accuracy can be effectively improved, and the recognition text recognition accuracy is higher.
Specifically, in order to further improve the accuracy of the obtained recognized text, after the user audio data is obtained, noise in the audio can be removed by performing noise reduction processing on the user audio data, including music noise or background noise, so as to obtain noise-reduced audio data, wherein only human voice in the user audio data is reserved in the noise-reduced audio data; and performing voice recognition on the noise reduction audio data to obtain recognition characters. At this time, because the noise reduction audio data only keeps the voice of the user audio data, and the noise such as music noise or background noise is removed, the influence of the noise on the recognition accuracy can be further reduced when the voice recognition is carried out, and thus the recognition accuracy can be further improved, and the accuracy of the obtained recognition characters is further improved.
After the recognized text is obtained, step S102 is performed.
In step S102, feature extraction is performed on the recognized text by using a natural language processing technique, so as to obtain N feature vectors.
In the process of the specific embodiment, when the natural language processing technology is utilized to perform feature extraction on the identified text, a word bag model, a CNN model, an RNN model, an LSTM model and other models can be utilized to perform feature extraction on the identified text, and N feature vectors are extracted, wherein the N feature vectors comprise at least two of emotion feature vectors, scene feature vectors, user feature vectors and other vectors. Of course, the N feature vectors may also include semantic vectors, which may include semantic information identifying the text.
Specifically, emotion feature vectors may include information for representing emotion such as happiness, sadness, and anger, scene feature vectors may include information for representing scenes such as wedding, dinner, birthday party, and conference, and user feature vectors may include user information for representing the gender of the user, the age of the user, and the like.
In another embodiment of the present disclosure, after N feature vectors are obtained, N emotion tags corresponding to the N feature vectors are also required to be obtained. And when N emotion labels are acquired, searching from the label-vector correspondence of the emotion labels and the feature vectors according to the N feature vectors, and searching N emotion labels corresponding to the N feature vectors, wherein the feature vectors and the emotion labels are in one-to-one correspondence. Of course, it is also possible to search from the label-vector correspondence between emotion labels and feature vectors according to N feature vectors, and find K emotion labels corresponding to N feature vectors, where K is an integer not less than 1 and not greater than N, and at this time, one emotion label may correspond to one or more feature vectors, but one feature vector corresponds to only one emotion label.
For example, taking target audio and video data as video data A as an example, firstly extracting user audio data in A to be represented by A1, then carrying out noise reduction on the A1, carrying out voice recognition on the A1 after the noise reduction, and obtaining recognition characters to be represented by A2; extracting features of A2 by using an LSTM model, wherein the extracted emotion feature vector is represented by Q1, the scene feature vector is represented by Q2, the user feature vector is represented by Q3, and then Q1, Q2 and Q3 are taken as N feature vectors; and then searching that the emotion label corresponding to Q1 is B1, the emotion label corresponding to Q2 is B2 and the emotion label corresponding to Q3 is B3 according to the preset corresponding relation between the label and the vector.
After N feature vectors are obtained, step S103 is performed.
In step S103, N emotion tags corresponding to the N feature vectors may be first acquired; and acquiring N music generators corresponding to the N emotion labels from the music generator set according to the corresponding relation between the emotion labels and the music generators. At this time, N music generators corresponding to N feature vectors are obtained by the correspondence between emotion tags and feature vectors and the correspondence between emotion tags and music generators. N emotion labels corresponding to N feature vectors are obtained, and then N music generators corresponding to the N feature vectors are found according to the corresponding relation between the emotion labels and the music generators, so that the time for obtaining the N music generators is shortened through the corresponding relation of the labels, and the efficiency for obtaining the N music generators is improved.
In the implementation process, the feature vector may also directly correspond to the music generator, so that N music generators corresponding to N feature vectors may be obtained according to N feature vectors, which is not specifically limited in this specification.
Specifically, before step S103 is performed, a music generator set is further trained in advance, where the training step of the music generator set includes, as shown in fig. 2, including:
s201, acquiring a training sample set, wherein each training sample in the training sample set comprises training audio and video data;
S202, aiming at each training sample in a training sample set, performing voice recognition on training audio and video data of the training sample to obtain training recognition characters; extracting features of the training recognition characters by using a natural language processing technology to obtain M feature vectors, wherein M is an integer not less than N;
s203, model training is carried out on M music generators by using M feature vectors of each training sample by adopting an countermeasure network, the M trained music generators are obtained, the M trained music generators are used as the music generator set, and the M music generators correspond to the M feature vectors.
In step S201, a training sample set needs to be acquired first, where the training sample set includes at least one training sample, and each training sample in the training sample set includes training audio/video data, where the training audio/video data includes audio data and video data, so that the training audio/video data may be audio data or video data, which is not specifically limited in this specification.
After the training sample set is acquired, step S202 is performed.
In step S202, for each training sample in the training sample set, feature extraction may be performed on the identified text by using a natural language processing technique, so as to obtain M feature vectors.
In the process of the specific embodiment, for each training sample, when feature extraction is performed on the identified text by using a natural language processing technology, feature extraction can be performed on the training identified text of each training sample by adopting a word bag model, a CNN model, an RNN model, an LSTM model and other models, and M feature vectors of each training sample are extracted, wherein the M feature vectors comprise at least two vectors of emotion feature vectors, scene feature vectors, user feature vectors and the like. Of course, the M feature vectors may also include semantic vectors, which may include semantic information identifying the text.
Specifically, emotion feature vectors may include information for representing emotion such as happiness, sadness, and anger, scene feature vectors may include information for representing scenes such as wedding, dinner, birthday party, and conference, and user feature vectors may include user information for representing the gender of the user, the age of the user, and the like.
Specifically, M is generally the same as N, and at this time, the M feature vectors used for training and the N feature vectors actually used are made the same in vector type, so that the accuracy of the music generator prediction is higher. Of course, M may be an integer greater than N, in which case more types of feature vectors may be used during training, and some or all types of feature vectors may be used during actual use, for example, M types of feature vectors, such as C1, C2, C3, C4, and C5, may be obtained during training, and N types of feature vectors, such as C1, C2, C3, and C4, or 3 types of feature vectors, such as C1, C2, and C3, may be obtained during actual use, which is not particularly limited in this specification.
Specifically, in the process of performing voice recognition on the training audio/video data of each training sample, in order to make the accuracy of the obtained training recognition text higher, the audio extraction can be performed on the obtained training audio/video data of each training sample to obtain training user audio data; and performing voice recognition on the training user audio data to obtain recognition characters.
Specifically, in order to further improve the accuracy of the obtained training recognition text of each training sample, after the training user audio data of each training sample is obtained, noise reduction processing can be further performed on the training user audio data of each training sample, noise in the audio is removed, including music noise or background noise and the like, so as to obtain training noise reduction audio data of each training sample, wherein only human voice in the user audio data is reserved in the noise reduction audio data; and performing voice recognition on the training noise reduction audio data of each training sample to obtain training recognition characters of each training sample.
After M feature vectors for each training sample are acquired, step S203 is performed.
In step S203, for each training sample, obtaining M emotion tags corresponding to M feature vectors according to a preset correspondence between emotion tags and feature vectors; according to the corresponding relation between the emotion labels and the music generators, M music generators corresponding to the M feature vectors are obtained; inputting M characteristic vectors into M music generators to obtain M types of music; synthesizing M types of music to obtain training background music; and further obtaining training background music of each training sample. After the training background music of each training sample is obtained, the training background music and the real music data of each training sample are distinguished by using a music discriminator, parameters of each music generator are continuously adjusted, and then the music discriminator is used for distinguishing, so that continuous countermeasure optimization is realized, and finally, when the accuracy of distinguishing the training background music and the real music data by the music discriminator is smaller than the set accuracy, M music generators at the moment are used as M trained music generators.
In the embodiment of the present specification, when M types of music are synthesized to obtain training background music, M types of music are generally synthesized using a music synthesizer.
In the embodiment of the present specification, the music style generated by each of the M music generators may be different from the music styles generated by other music generators.
Specifically, M music generators are represented by G1, G2, G3 and G4, and a music discriminator is represented by D, M feature vectors of the training samples are input into G1, G2, G3 and G4 for each training sample, and training background music synthesized by the M types of output music is obtained; and D is used for distinguishing training background music from real music data, in the continuous countermeasure optimization of G1, G2, G3, G4 and D, the training background music and the real music data cannot be distinguished finally, or the distinguishing rate of the D for the training background music and the real music data meets the constraint condition (smaller than the set accuracy), at the moment, the training background music output by G1, G2, G3 and G4 is very similar to the real music data, and the G1, G2, G3 and G4 at the moment are used as M trained music generators.
The model training is performed by adopting the countermeasure training mode, so that the accuracy of background music predicted by the trained M music generators obtained by the countermeasure training is higher.
Thus, after training M music generators through steps S201-S203, since the trained M music generators are obtained by countermeasure training, the accuracy of background music predicted by the trained M music generators is higher; when the N feature vectors are input into the N music generators, the N music generators are part or all of the M trained music generators, so that the matching degree of the output background music and the music required by the user is higher, and the accuracy of the output background music is higher.
After the trained M music generators are obtained, N music generators are obtained from the trained M music generators according to the N emotion tags.
After N music generators are acquired, step S104 is performed.
In step S104, since the N feature vectors are in one-to-one correspondence with the N music generators, each of the N feature vectors may be input to the corresponding music generator to obtain N styles of music.
Specifically, each of the N feature vectors may be input to a corresponding music generator through an emotion tag to avoid that the feature vector is input to the wrong music generator, for example, if the emotion tag of a certain feature vector is B2 and the music generator corresponding to B2 is G2, the feature vector is input to G2.
After N genres of music are acquired, step S105 is performed.
In step S105, N types of music may be input to a music synthesizer to be synthesized, and the synthesized music may be obtained as background music.
For example, taking trained M music generators as G1, G2, G3 and G4 as an example, and the emotion labels corresponding to the M music generators as B1, B2, B3 and B4 in turn, if the target audio data is video data as a, obtaining N feature vectors of a as Q1, Q2 and Q3, finding the emotion label corresponding to Q1 as B1, the emotion label corresponding to Q2 as B2 and the emotion label corresponding to Q3 as B3 according to the preset correspondence between the labels and the vectors, and determining that the N music generators as G1, G2 and G3 according to the correspondence between the emotion labels and the music generators, so, inputting Q1 into G1, inputting Q2 into G2, and inputting Q3 into G3 to obtain 3 types of music, and synthesizing the 3 types of music by a music synthesizer to obtain background music.
Therefore, N music generators are part or all of M trained music generators, and M trained music generators are obtained by countertraining, so that the accuracy of background music predicted by the M trained music generators is improved, and the accuracy of background music predicted by the N music generators is also improved.
In another embodiment of the present specification, after obtaining the background music, the background music may be added to the target audio/video data, and distribution may be performed on the target audio/video data to which the background music is added. For example, in self-media creation, a user only has a section of audio/video or video, and by the background music generation method provided by the embodiment, background music can be automatically generated according to the content of the audio/video, and the generated background music is automatically added into the audio/video of the user and released by the user.
At this time, since the generated background music is generated by N music generators, and the N music generators are part or all of the M trained music generators, and the M trained music generators are obtained by using countermeasure training, the accuracy of the background music predicted by the M trained music generators is improved, and the accuracy of the background music predicted by the N music generators is also improved.
Based on the technical scheme, performing voice recognition on the acquired target audio and video data to obtain recognition characters; extracting features of the identified characters by using a natural language processing technology to obtain N feature vectors; inputting each characteristic vector in N characteristic vectors into a corresponding music generator according to a pre-trained music generator set to obtain N types of music; synthesizing the N types of music to obtain background music; at this time, the target audio and video data can be sequentially subjected to voice recognition and feature extraction, the extracted N feature vectors are input into N music generators trained in advance to generate N types of music, and then the N types of music are synthesized to obtain background music.
With reference to fig. 3, the embodiment of the present application further provides a background music generating device, where the background music generating device includes:
The recognition unit 301 is configured to perform speech recognition on the obtained target audio/video data to obtain a recognition text;
A feature extraction unit 302, configured to perform feature extraction on the identified text by using a natural language processing technology, so as to obtain N feature vectors, where N is an integer not less than 2;
A music generator obtaining unit 303, configured to obtain N music generators corresponding to the N feature vectors from a pre-trained music generator set;
A style music obtaining unit 304, configured to input each of the N feature vectors into a corresponding music generator to obtain N style music;
the background music obtaining unit 305 is configured to synthesize the N types of music to obtain background music.
In an alternative embodiment, the music generator obtaining unit 303 is configured to obtain N emotion tags corresponding to the N feature vectors; and acquiring N music generators corresponding to the N emotion labels from the music generator set according to the corresponding relation between the emotion labels and the music generators.
In an optional implementation manner, the identifying unit 301 is configured to perform audio extraction on the obtained target audio/video data to obtain user audio data; and carrying out voice recognition on the user audio data to obtain the recognition text.
In an alternative embodiment, the apparatus further comprises:
The music generator training unit is used for acquiring a training sample set, and each training sample in the training sample set comprises training audio and video data; aiming at each training sample in the training sample set, performing voice recognition on training audio and video data of the training sample to obtain training recognition characters; extracting features of the training recognition characters by using a natural language processing technology to obtain M feature vectors, wherein M is an integer not less than N; model training is carried out on M music generators by using M feature vectors of each training sample by adopting an countermeasure network, the M trained music generators are obtained, the M trained music generators are used as the music generator set, and the M music generators correspond to the M feature vectors.
In an alternative embodiment, the apparatus further comprises:
And the background music adding unit is used for adding the background music to the target audio/video data after obtaining the background music.
The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.
Fig. 4 is a block diagram of an electronic device 800 for a background music generation method, according to an example embodiment. For example, electronic device 800 may be a mobile phone, computer, digital broadcast terminal, messaging device, game console, tablet device, medical device, exercise device, personal digital assistant, or the like.
Referring to fig. 4, the electronic device 800 may include one or more of the following components: a processing component 802, a memory 804, a power component 806, a multimedia component 808, an audio component 810, an input/presentation (I/O) interface 812, a sensor component 814, and a communication component 816.
The processing component 802 generally controls overall operation of the electronic device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. Processing element 802 may include one or more processors 820 to execute instructions to perform all or part of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interactions between the processing component 802 and other components. For example, the processing component 802 may include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.
The memory 804 is configured to store various types of data to support operations at the device 800. Examples of such data include instructions for any application or method operating on the electronic device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or nonvolatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.
The power supply component 806 provides power to the various components of the electronic device 800. The power components 806 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the electronic device 800.
The multimedia component 808 includes a screen between the electronic device 800 and the user that provides a presentation interface. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may sense not only the boundary of a touch or slide action, but also the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front camera and/or a rear camera. The front camera and/or the rear camera may receive external multimedia data when the device 800 is in an operational mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have focal length and optical zoom capabilities.
The audio component 810 is configured to present and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may be further stored in the memory 804 or transmitted via the communication component 816. In some embodiments, the audio component 810 further includes a speaker for rendering audio signals.
The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be a keyboard, click wheel, buttons, etc. These buttons may include, but are not limited to: homepage button, volume button, start button, and lock button.
The sensor assembly 814 includes one or more sensors for providing status assessment of various aspects of the electronic device 800. For example, the sensor assembly 814 may detect an on/off state of the device 800, a relative positioning of the components, such as a display and keypad of the electronic device 800, the sensor assembly 814 may also detect a change in position of the electronic device 800 or a component of the electronic device 800, the presence or absence of a user's contact with the electronic device 800, an orientation or acceleration/deceleration of the electronic device 800, and a change in temperature of the electronic device 800. The sensor assembly 814 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscopic sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
The communication component 816 is configured to facilitate communication between the electronic device 800 and other devices, either wired or wireless. The electronic device 800 may access a wireless network based on a communication standard, such as WiFi,2G, or 3G, or a combination thereof. In one exemplary embodiment, the communication part 816 receives a broadcast signal or broadcast-related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.
In an exemplary embodiment, the electronic device 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic elements for executing the methods described above.
In an exemplary embodiment, a non-transitory computer readable storage medium is also provided, such as memory 804 including instructions executable by processor 820 of electronic device 800 to perform the above-described method. For example, the non-transitory computer readable storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.
A non-transitory computer readable storage medium, which when executed by a processor of a mobile terminal, causes the mobile terminal to perform a background music generation method, the method comprising:
performing voice recognition on the acquired target audio and video data to obtain recognition characters;
Extracting features of the identified words by using a natural language processing technology to obtain N feature vectors, wherein N is an integer not less than 2;
acquiring N music generators corresponding to the N feature vectors from a pre-trained music generator set;
inputting each characteristic vector in the N characteristic vectors into a corresponding music generator to obtain N types of music;
and synthesizing the N types of music to obtain background music.
Other embodiments of the application will be apparent to those skilled in the art from consideration of the specification and practice of the application disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.
It is to be understood that the invention is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the invention is limited only by the appended claims
The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the invention are intended to be included within the scope of the invention.
Claims (12)
1. A background music generation method, the method comprising:
performing voice recognition on the acquired target audio and video data to obtain recognition characters;
Extracting features of the identified words by using a natural language processing technology to obtain N feature vectors, wherein N is an integer not less than 2; the N feature vectors comprise at least two of emotion feature vectors, scene feature vectors and user feature vectors;
Acquiring N music generators corresponding to the N feature vectors from a pre-trained music generator set; each music generator represents a different style of music;
inputting each characteristic vector in the N characteristic vectors into a corresponding music generator to obtain N types of music;
and synthesizing the N types of music to obtain background music.
2. The method of claim 1, wherein the obtaining N music generators corresponding to the N feature vectors comprises:
Acquiring N emotion labels corresponding to the N feature vectors;
And acquiring N music generators corresponding to the N emotion labels from the music generator set according to the corresponding relation between the emotion labels and the music generators.
3. The method of claim 2, wherein performing speech recognition on the obtained target audio-video data to obtain the recognized text comprises:
performing audio extraction on the obtained target audio and video data to obtain user audio data;
And carrying out voice recognition on the user audio data to obtain the recognition text.
4. The method of claim 3, wherein the training step of the music generator set comprises:
Acquiring a training sample set, wherein each training sample in the training sample set comprises training audio and video data;
Aiming at each training sample in the training sample set, performing voice recognition on training audio and video data of the training sample to obtain training recognition characters; extracting features of the training recognition characters by using a natural language processing technology to obtain M feature vectors, wherein M is an integer not less than N;
Model training is carried out on M music generators by using M feature vectors of each training sample by adopting an countermeasure network, the M trained music generators are obtained, the M trained music generators are used as the music generator set, and the M music generators correspond to the M feature vectors.
5. The method of any one of claims 1-4, wherein after obtaining background music, the method further comprises:
and adding the background music to the target audio/video data.
6. A background music generating apparatus, the apparatus comprising:
the recognition unit is used for carrying out voice recognition on the acquired target audio and video data to obtain recognition characters;
The feature extraction unit is used for extracting features of the identification characters by utilizing a natural language processing technology to obtain N feature vectors, wherein N is an integer not less than 2; the N feature vectors comprise at least two of emotion feature vectors, scene feature vectors and user feature vectors;
a music generator acquisition unit, configured to acquire N music generators corresponding to the N feature vectors from a pre-trained music generator set; each music generator represents a different style of music;
A style music obtaining unit, configured to input each feature vector of the N feature vectors into a corresponding music generator to obtain N style music;
And the background music acquisition unit is used for synthesizing the N types of music to obtain background music.
7. The apparatus of claim 6, wherein the music generator obtaining unit is configured to obtain N emotion tags corresponding to the N feature vectors; and acquiring N music generators corresponding to the N emotion labels from the music generator set according to the corresponding relation between the emotion labels and the music generators.
8. The apparatus of claim 7, wherein the identification unit is configured to perform audio extraction on the obtained target audio-video data to obtain user audio data; and carrying out voice recognition on the user audio data to obtain the recognition text.
9. The apparatus as recited in claim 8, further comprising:
The music generator training unit is used for acquiring a training sample set, and each training sample in the training sample set comprises training audio and video data; aiming at each training sample in the training sample set, performing voice recognition on training audio and video data of the training sample to obtain training recognition characters; extracting features of the training recognition characters by using a natural language processing technology to obtain M feature vectors, wherein M is an integer not less than N; model training is carried out on M music generators by using M feature vectors of each training sample by adopting an countermeasure network, the M trained music generators are obtained, the M trained music generators are used as the music generator set, and the M music generators correspond to the M feature vectors.
10. The apparatus of any one of claims 6-9, further comprising:
And the background music adding unit is used for adding the background music to the target audio/video data after obtaining the background music.
11. An electronic device comprising a memory and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by one or more processors, the one or more programs comprising instructions for performing the method of any of claims 1-5.
12. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, carries out the steps corresponding to the method according to any one of claims 1-5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111166926.4A CN113923517B (en) | 2021-09-30 | 2021-09-30 | Background music generation method and device and electronic equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111166926.4A CN113923517B (en) | 2021-09-30 | 2021-09-30 | Background music generation method and device and electronic equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113923517A CN113923517A (en) | 2022-01-11 |
CN113923517B true CN113923517B (en) | 2024-05-07 |
Family
ID=79237894
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111166926.4A Active CN113923517B (en) | 2021-09-30 | 2021-09-30 | Background music generation method and device and electronic equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113923517B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116504206B (en) * | 2023-03-18 | 2024-02-20 | 深圳市狼视天下科技有限公司 | Camera capable of identifying environment and generating music |
Citations (25)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2006022606A2 (en) * | 2003-01-07 | 2006-03-02 | Madwares Ltd. | Systems and methods for portable audio synthesis |
CN103186527A (en) * | 2011-12-27 | 2013-07-03 | 北京百度网讯科技有限公司 | System for building music classification model, system for recommending music and corresponding method |
CN103795897A (en) * | 2014-01-21 | 2014-05-14 | 深圳市中兴移动通信有限公司 | Method and device for automatically generating background music |
CN108986842A (en) * | 2018-08-14 | 2018-12-11 | 百度在线网络技术(北京)有限公司 | Music style identifying processing method and terminal |
CN109492128A (en) * | 2018-10-30 | 2019-03-19 | 北京字节跳动网络技术有限公司 | Method and apparatus for generating model |
CN109599079A (en) * | 2017-09-30 | 2019-04-09 | 腾讯科技(深圳)有限公司 | A kind of generation method and device of music |
CN109862393A (en) * | 2019-03-20 | 2019-06-07 | 深圳前海微众银行股份有限公司 | Method of dubbing in background music, system, equipment and the storage medium of video file |
CN110085263A (en) * | 2019-04-28 | 2019-08-02 | 东华大学 | A kind of classification of music emotion and machine composing method |
CN110148393A (en) * | 2018-02-11 | 2019-08-20 | 阿里巴巴集团控股有限公司 | Music generating method, device and system and data processing method |
CN110309327A (en) * | 2018-02-28 | 2019-10-08 | 北京搜狗科技发展有限公司 | Audio generation method, device and the generating means for audio |
CN110740262A (en) * | 2019-10-31 | 2020-01-31 | 维沃移动通信有限公司 | Background music adding method and device and electronic equipment |
CN110767201A (en) * | 2018-07-26 | 2020-02-07 | Tcl集团股份有限公司 | Score generation method, storage medium and terminal equipment |
CN110781835A (en) * | 2019-10-28 | 2020-02-11 | 中国传媒大学 | Data processing method and device, electronic equipment and storage medium |
CN110830368A (en) * | 2019-11-22 | 2020-02-21 | 维沃移动通信有限公司 | Instant messaging message sending method and electronic equipment |
CN110858924A (en) * | 2018-08-22 | 2020-03-03 | 北京优酷科技有限公司 | Video background music generation method and device |
CN110971969A (en) * | 2019-12-09 | 2020-04-07 | 北京字节跳动网络技术有限公司 | Video dubbing method and device, electronic equipment and computer readable storage medium |
CN111737516A (en) * | 2019-12-23 | 2020-10-02 | 北京沃东天骏信息技术有限公司 | Interactive music generation method and device, intelligent sound box and storage medium |
CN111950266A (en) * | 2019-04-30 | 2020-11-17 | 北京搜狗科技发展有限公司 | Data processing method and device and data processing device |
CN112040273A (en) * | 2020-09-11 | 2020-12-04 | 腾讯科技(深圳)有限公司 | Video synthesis method and device |
CN112189193A (en) * | 2018-05-24 | 2021-01-05 | 艾米有限公司 | Music generator |
CN112231499A (en) * | 2019-07-15 | 2021-01-15 | 李姿慧 | Intelligent video music distribution system |
CN112584062A (en) * | 2020-12-10 | 2021-03-30 | 上海哔哩哔哩科技有限公司 | Background audio construction method and device |
CN112597320A (en) * | 2020-12-09 | 2021-04-02 | 上海掌门科技有限公司 | Social information generation method, device and computer readable medium |
CN113190709A (en) * | 2021-03-31 | 2021-07-30 | 浙江大学 | Background music recommendation method and device based on short video key frame |
CN113299255A (en) * | 2021-05-13 | 2021-08-24 | 中国科学院声学研究所 | Emotional music generation method based on deep neural network and music element drive |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10380983B2 (en) * | 2016-12-30 | 2019-08-13 | Google Llc | Machine learning to generate music from text |
CN110555126B (en) * | 2018-06-01 | 2023-06-27 | 微软技术许可有限责任公司 | Automatic generation of melodies |
US11741922B2 (en) * | 2018-09-14 | 2023-08-29 | Bellevue Investments Gmbh & Co. Kgaa | Method and system for template based variant generation of hybrid AI generated song |
KR102148006B1 (en) * | 2019-04-30 | 2020-08-25 | 주식회사 카카오 | Method and apparatus for providing special effects to video |
-
2021
- 2021-09-30 CN CN202111166926.4A patent/CN113923517B/en active Active
Patent Citations (25)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2006022606A2 (en) * | 2003-01-07 | 2006-03-02 | Madwares Ltd. | Systems and methods for portable audio synthesis |
CN103186527A (en) * | 2011-12-27 | 2013-07-03 | 北京百度网讯科技有限公司 | System for building music classification model, system for recommending music and corresponding method |
CN103795897A (en) * | 2014-01-21 | 2014-05-14 | 深圳市中兴移动通信有限公司 | Method and device for automatically generating background music |
CN109599079A (en) * | 2017-09-30 | 2019-04-09 | 腾讯科技(深圳)有限公司 | A kind of generation method and device of music |
CN110148393A (en) * | 2018-02-11 | 2019-08-20 | 阿里巴巴集团控股有限公司 | Music generating method, device and system and data processing method |
CN110309327A (en) * | 2018-02-28 | 2019-10-08 | 北京搜狗科技发展有限公司 | Audio generation method, device and the generating means for audio |
CN112189193A (en) * | 2018-05-24 | 2021-01-05 | 艾米有限公司 | Music generator |
CN110767201A (en) * | 2018-07-26 | 2020-02-07 | Tcl集团股份有限公司 | Score generation method, storage medium and terminal equipment |
CN108986842A (en) * | 2018-08-14 | 2018-12-11 | 百度在线网络技术(北京)有限公司 | Music style identifying processing method and terminal |
CN110858924A (en) * | 2018-08-22 | 2020-03-03 | 北京优酷科技有限公司 | Video background music generation method and device |
CN109492128A (en) * | 2018-10-30 | 2019-03-19 | 北京字节跳动网络技术有限公司 | Method and apparatus for generating model |
CN109862393A (en) * | 2019-03-20 | 2019-06-07 | 深圳前海微众银行股份有限公司 | Method of dubbing in background music, system, equipment and the storage medium of video file |
CN110085263A (en) * | 2019-04-28 | 2019-08-02 | 东华大学 | A kind of classification of music emotion and machine composing method |
CN111950266A (en) * | 2019-04-30 | 2020-11-17 | 北京搜狗科技发展有限公司 | Data processing method and device and data processing device |
CN112231499A (en) * | 2019-07-15 | 2021-01-15 | 李姿慧 | Intelligent video music distribution system |
CN110781835A (en) * | 2019-10-28 | 2020-02-11 | 中国传媒大学 | Data processing method and device, electronic equipment and storage medium |
CN110740262A (en) * | 2019-10-31 | 2020-01-31 | 维沃移动通信有限公司 | Background music adding method and device and electronic equipment |
CN110830368A (en) * | 2019-11-22 | 2020-02-21 | 维沃移动通信有限公司 | Instant messaging message sending method and electronic equipment |
CN110971969A (en) * | 2019-12-09 | 2020-04-07 | 北京字节跳动网络技术有限公司 | Video dubbing method and device, electronic equipment and computer readable storage medium |
CN111737516A (en) * | 2019-12-23 | 2020-10-02 | 北京沃东天骏信息技术有限公司 | Interactive music generation method and device, intelligent sound box and storage medium |
CN112040273A (en) * | 2020-09-11 | 2020-12-04 | 腾讯科技(深圳)有限公司 | Video synthesis method and device |
CN112597320A (en) * | 2020-12-09 | 2021-04-02 | 上海掌门科技有限公司 | Social information generation method, device and computer readable medium |
CN112584062A (en) * | 2020-12-10 | 2021-03-30 | 上海哔哩哔哩科技有限公司 | Background audio construction method and device |
CN113190709A (en) * | 2021-03-31 | 2021-07-30 | 浙江大学 | Background music recommendation method and device based on short video key frame |
CN113299255A (en) * | 2021-05-13 | 2021-08-24 | 中国科学院声学研究所 | Emotional music generation method based on deep neural network and music element drive |
Non-Patent Citations (3)
Title |
---|
Fang-Fei Kuo ; Man-Kwan Shan ; Suh-Yin Lee.Background music recommendation for video based on multimodal latent semantic analysis.《2013 IEEE International Conference on Multimedia and Expo (ICME)》.2013,全文. * |
互动仪式链视角下的音乐短视频探析――以抖音App为例;翟欣;;新媒体研究;20180831(第16期);全文 * |
基于深度学习的视频背景音乐自动推荐算法研究;吕军辉;;电视技术;20181005(第10期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN113923517A (en) | 2022-01-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107644646B (en) | Voice processing method and device for voice processing | |
CN107527619B (en) | Method and device for positioning voice control service | |
CN108038102B (en) | Method and device for recommending expression image, terminal and storage medium | |
CN104378441A (en) | Schedule creating method and device | |
US11335348B2 (en) | Input method, device, apparatus, and storage medium | |
CN107945806B (en) | User identification method and device based on sound characteristics | |
CN111831806B (en) | Semantic integrity determination method, device, electronic equipment and storage medium | |
CN110781323A (en) | Method and device for determining label of multimedia resource, electronic equipment and storage medium | |
CN112068711A (en) | Information recommendation method and device of input method and electronic equipment | |
CN105447109A (en) | Key word searching method and apparatus | |
CN112037756A (en) | Voice processing method, apparatus and medium | |
CN110610720B (en) | Data processing method and device and data processing device | |
CN111797262A (en) | Poetry generation method and device, electronic equipment and storage medium | |
CN113177419B (en) | Text rewriting method and device, storage medium and electronic equipment | |
CN113923517B (en) | Background music generation method and device and electronic equipment | |
CN110728981A (en) | Interactive function execution method and device, electronic equipment and storage medium | |
CN113656557A (en) | Message reply method, device, storage medium and electronic equipment | |
CN113936697B (en) | Voice processing method and device for voice processing | |
CN112948565A (en) | Man-machine conversation method, device, electronic equipment and storage medium | |
CN113709548B (en) | Image-based multimedia data synthesis method, device, equipment and storage medium | |
CN112130839A (en) | Method for constructing database, method for voice programming and related device | |
CN111831132A (en) | Information recommendation method and device and electronic equipment | |
CN114550691A (en) | Multi-tone word disambiguation method and device, electronic equipment and readable storage medium | |
CN113115104B (en) | Video processing method and device, electronic equipment and storage medium | |
CN113420553A (en) | Text generation method and device, storage medium and electronic equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant |