CN113923517B - Background music generation method and device and electronic equipment - Google Patents

Background music generation method and device and electronic equipment Download PDF

Info

Publication number
CN113923517B
CN113923517B CN202111166926.4A CN202111166926A CN113923517B CN 113923517 B CN113923517 B CN 113923517B CN 202111166926 A CN202111166926 A CN 202111166926A CN 113923517 B CN113923517 B CN 113923517B
Authority
CN
China
Prior art keywords
music
feature vectors
training
generators
audio
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111166926.4A
Other languages
Chinese (zh)
Other versions
CN113923517A (en
Inventor
崔国辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sogou Technology Development Co Ltd
Original Assignee
Beijing Sogou Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sogou Technology Development Co Ltd filed Critical Beijing Sogou Technology Development Co Ltd
Priority to CN202111166926.4A priority Critical patent/CN113923517B/en
Publication of CN113923517A publication Critical patent/CN113923517A/en
Application granted granted Critical
Publication of CN113923517B publication Critical patent/CN113923517B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs
    • H04N21/44008Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream, rendering scenes according to MPEG-4 scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/485End-user interface for client configuration
    • H04N21/4852End-user interface for client configuration for modifying audio parameters, e.g. switching between mono and stereo
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/60Network structure or processes for video distribution between server and client or between remote clients; Control signalling between clients, server and network components; Transmission of management data between server and client, e.g. sending from server to client commands for recording incoming content stream; Communication details between server and client 
    • H04N21/65Transmission of management data between client and server
    • H04N21/658Transmission by the client directed to the server
    • H04N21/6587Control parameters, e.g. trick play commands, viewpoint selection
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/81Monomedia components thereof
    • H04N21/8106Monomedia components thereof involving special audio data, e.g. different tracks for different languages
    • H04N21/8113Monomedia components thereof involving special audio data, e.g. different tracks for different languages comprising music, e.g. song in MP3 format
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2250/00Aspects of algorithms or signal processing methods without intrinsic musical character, yet specifically adapted for or used in electrophonic musical processing
    • G10H2250/541Details of musical waveform synthesis, i.e. audio waveshape processing from individual wavetable samples, independently of their origin or of the sound they represent

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a background music generation method, which carries out voice recognition on acquired target audio and video data to obtain recognition characters; extracting features of the identified words by using a natural language processing technology to obtain N feature vectors; acquiring N music generators corresponding to the N feature vectors from a pre-trained music generator set; inputting each characteristic vector in the N characteristic vectors into a corresponding music generator to obtain N types of music; and synthesizing the N types of music to obtain background music, wherein when the N types of music are synthesized to obtain the background music, the background music is generated by the N types of music, N is an integer not smaller than 2, so that the background music is generated by multiple types of music and does not belong to the existing music and songs, and the generated background music is more personalized and is more matched with the demands of users.

Description

Background music generation method and device and electronic equipment
Technical Field
The present invention relates to the field of speech processing technologies, and in particular, to a method and an apparatus for generating background music, and an electronic device.
Background
Music has been an important form of art for companion humans, and humans have never stopped exploring music. With the development of computer technology, the combination of computers and deep learning technology has led to the creation of music with increasing applications.
In the prior art, when background music is generated, the background music can be quickly generated by presetting music characteristic parameters by a user and inputting the music characteristic parameters into a neural network for predicting future notes or using a generation countermeasure neural network for generating music, but the generated background music can not well meet the requirements of the user. Thus, a background music generation method is needed to solve the above-mentioned problems.
Disclosure of Invention
The embodiment of the invention provides a background music generation method and device and electronic equipment, which are used for generating background music of an audio and video file.
An embodiment of the present invention provides a method for generating background music, where the method includes:
performing voice recognition on the acquired target audio and video data to obtain recognition characters;
Extracting features of the identified words by using a natural language processing technology to obtain N feature vectors, wherein N is an integer not less than 2;
acquiring N music generators corresponding to the N feature vectors from a pre-trained music generator set;
inputting each characteristic vector in the N characteristic vectors into a corresponding music generator to obtain N types of music;
and synthesizing the N types of music to obtain background music.
Optionally, the obtaining N music generators corresponding to the N feature vectors includes:
Acquiring N emotion labels corresponding to the N feature vectors;
And acquiring N music generators corresponding to the N emotion labels from the music generator set according to the corresponding relation between the emotion labels and the music generators.
Optionally, the performing speech recognition on the obtained target audio and video data to obtain the recognition text includes:
performing audio extraction on the obtained target audio and video data to obtain user audio data;
And carrying out voice recognition on the user audio data to obtain the recognition text.
Optionally, the training step of the music generator set includes:
Acquiring a training sample set, wherein each training sample in the training sample set comprises training audio and video data;
Aiming at each training sample in the training sample set, performing voice recognition on training audio and video data of the training sample to obtain training recognition characters; extracting features of the training recognition characters by using a natural language processing technology to obtain M feature vectors, wherein M is an integer not less than N;
Model training is carried out on M music generators by using M feature vectors of each training sample by adopting an countermeasure network, the M trained music generators are obtained, the M trained music generators are used as the music generator set, and the M music generators correspond to the M feature vectors.
Optionally, after obtaining the background music, the method further includes:
and adding the background music to the target audio/video data.
A second aspect of the embodiment of the present invention further provides a background music generating apparatus, which is characterized in that the apparatus includes:
the recognition unit is used for carrying out voice recognition on the acquired target audio and video data to obtain recognition characters;
The feature extraction unit is used for extracting features of the identification characters by utilizing a natural language processing technology to obtain N feature vectors, wherein N is an integer not less than 2;
a music generator acquisition unit, configured to acquire N music generators corresponding to the N feature vectors from a pre-trained music generator set;
A style music obtaining unit, configured to input each feature vector of the N feature vectors into a corresponding music generator to obtain N style music;
And the background music acquisition unit is used for synthesizing the N types of music to obtain background music.
Optionally, the music generator obtaining unit is configured to obtain N emotion tags corresponding to the N feature vectors; and acquiring N music generators corresponding to the N emotion labels from the music generator set according to the corresponding relation between the emotion labels and the music generators.
Optionally, the identifying unit is configured to perform audio extraction on the obtained target audio/video data to obtain user audio data; and carrying out voice recognition on the user audio data to obtain the recognition text.
Optionally, the method further comprises:
The music generator training unit is used for acquiring a training sample set, and each training sample in the training sample set comprises training audio and video data; aiming at each training sample in the training sample set, performing voice recognition on training audio and video data of the training sample to obtain training recognition characters; extracting features of the training recognition characters by using a natural language processing technology to obtain M feature vectors, wherein M is an integer not less than N; model training is carried out on M music generators by using M feature vectors of each training sample by adopting an countermeasure network, the M trained music generators are obtained, the M trained music generators are used as the music generator set, and the M music generators correspond to the M feature vectors.
Optionally, the method further comprises:
And the background music adding unit is used for adding the background music to the target audio/video data after obtaining the background music.
A third aspect of the embodiment of the present invention provides an electronic device, including a memory, and one or more programs, where the one or more programs are stored in the memory, and configured to be executed by one or more processors, where the one or more programs include operation instructions for performing a background music generating method as provided in the first aspect.
A fourth aspect of the embodiments of the present invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps corresponding to the background music generation method as provided in the first aspect.
The above-mentioned one or at least one technical scheme in the embodiment of the application has at least the following technical effects:
based on the technical scheme, performing voice recognition on the acquired target audio and video data to obtain recognition characters; extracting features of the identified characters by using a natural language processing technology to obtain N feature vectors; inputting each characteristic vector in N characteristic vectors into a corresponding music generator according to a pre-trained music generator set to obtain N types of music; synthesizing the N types of music to obtain background music; at this time, the target audio and video data can be sequentially subjected to voice recognition and feature extraction, the extracted N feature vectors are input into N music generators trained in advance to generate N types of music, and then the N types of music are synthesized to obtain background music.
Drawings
Fig. 1 is a flow chart of a background music generating method according to an embodiment of the present application;
fig. 2 is a flow chart of a training method of a music generator set according to an embodiment of the present application;
fig. 3 is a block diagram of a background music generating apparatus according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
In the technical scheme provided by the embodiment of the application, a background music generating method is provided, and based on the technical scheme, the acquired target audio and video data are subjected to voice recognition to obtain recognition characters; extracting features of the identified characters by using a natural language processing technology to obtain N feature vectors; inputting each characteristic vector in N characteristic vectors into a corresponding music generator according to a pre-trained music generator set to obtain N types of music; synthesizing the N types of music to obtain background music; at this time, the target audio and video data can be sequentially subjected to voice recognition and feature extraction, the extracted N feature vectors are input into N music generators trained in advance to generate N types of music, and then the N types of music are synthesized to obtain background music.
The main implementation principle, the specific implementation manner and the corresponding beneficial effects of the technical scheme of the embodiment of the application are described in detail below with reference to the accompanying drawings.
Examples
Referring to fig. 1, an embodiment of the present application provides a background music generating method, which includes:
S101, performing voice recognition on the acquired target audio and video data to obtain recognition characters;
S102, extracting features of the identified words by using a natural language processing technology to obtain N feature vectors, wherein N is an integer not less than 2;
S103, acquiring N music generators corresponding to the N feature vectors from a pre-trained music generator set;
S104, inputting each characteristic vector in the N characteristic vectors into a corresponding music generator to obtain N types of music;
S105, synthesizing the N types of music to obtain background music.
In step S101, the target audio/video data may be first acquired, and then the target audio/video data may be subjected to speech recognition to obtain the recognition text. The target audio/video data includes audio data and video data, so that the target audio/video data may be audio data or video data, which is not particularly limited in this specification.
In the implementation process, when the target audio/video data is acquired, the audio data or the video data selected by the user can be used as the target audio/video data. And when the target audio and video data is subjected to voice recognition, the voice recognition model can be adopted to carry out voice recognition on the target audio and video data, so that the accuracy of the obtained recognition characters is higher.
In the embodiment of the present specification, the speech recognition model may be, for example, a neural network-based time series class classification (Connectionisttemporal classification, abbreviated as CTC) model, a Long Short Term Memory (LSTM), a CNN model, a CLDNN model, or the like, and the present specification is not particularly limited.
Specifically, when performing voice recognition on the target audio/video data to obtain the recognition text, in order to make the accuracy of the obtained recognition text higher, the obtained target audio/video data may be subjected to audio extraction to obtain the user audio data; and then carrying out voice recognition on the user audio data to obtain recognition characters. At this time, the user audio data is obtained after the other background music in the target audio and video data is removed by extracting the user audio data from the target audio and video data, so that when the user audio data is subjected to voice recognition, interference caused by the other background music in the target audio and video data is avoided, the recognition accuracy can be effectively improved, and the recognition text recognition accuracy is higher.
Specifically, in order to further improve the accuracy of the obtained recognized text, after the user audio data is obtained, noise in the audio can be removed by performing noise reduction processing on the user audio data, including music noise or background noise, so as to obtain noise-reduced audio data, wherein only human voice in the user audio data is reserved in the noise-reduced audio data; and performing voice recognition on the noise reduction audio data to obtain recognition characters. At this time, because the noise reduction audio data only keeps the voice of the user audio data, and the noise such as music noise or background noise is removed, the influence of the noise on the recognition accuracy can be further reduced when the voice recognition is carried out, and thus the recognition accuracy can be further improved, and the accuracy of the obtained recognition characters is further improved.
After the recognized text is obtained, step S102 is performed.
In step S102, feature extraction is performed on the recognized text by using a natural language processing technique, so as to obtain N feature vectors.
In the process of the specific embodiment, when the natural language processing technology is utilized to perform feature extraction on the identified text, a word bag model, a CNN model, an RNN model, an LSTM model and other models can be utilized to perform feature extraction on the identified text, and N feature vectors are extracted, wherein the N feature vectors comprise at least two of emotion feature vectors, scene feature vectors, user feature vectors and other vectors. Of course, the N feature vectors may also include semantic vectors, which may include semantic information identifying the text.
Specifically, emotion feature vectors may include information for representing emotion such as happiness, sadness, and anger, scene feature vectors may include information for representing scenes such as wedding, dinner, birthday party, and conference, and user feature vectors may include user information for representing the gender of the user, the age of the user, and the like.
In another embodiment of the present disclosure, after N feature vectors are obtained, N emotion tags corresponding to the N feature vectors are also required to be obtained. And when N emotion labels are acquired, searching from the label-vector correspondence of the emotion labels and the feature vectors according to the N feature vectors, and searching N emotion labels corresponding to the N feature vectors, wherein the feature vectors and the emotion labels are in one-to-one correspondence. Of course, it is also possible to search from the label-vector correspondence between emotion labels and feature vectors according to N feature vectors, and find K emotion labels corresponding to N feature vectors, where K is an integer not less than 1 and not greater than N, and at this time, one emotion label may correspond to one or more feature vectors, but one feature vector corresponds to only one emotion label.
For example, taking target audio and video data as video data A as an example, firstly extracting user audio data in A to be represented by A1, then carrying out noise reduction on the A1, carrying out voice recognition on the A1 after the noise reduction, and obtaining recognition characters to be represented by A2; extracting features of A2 by using an LSTM model, wherein the extracted emotion feature vector is represented by Q1, the scene feature vector is represented by Q2, the user feature vector is represented by Q3, and then Q1, Q2 and Q3 are taken as N feature vectors; and then searching that the emotion label corresponding to Q1 is B1, the emotion label corresponding to Q2 is B2 and the emotion label corresponding to Q3 is B3 according to the preset corresponding relation between the label and the vector.
After N feature vectors are obtained, step S103 is performed.
In step S103, N emotion tags corresponding to the N feature vectors may be first acquired; and acquiring N music generators corresponding to the N emotion labels from the music generator set according to the corresponding relation between the emotion labels and the music generators. At this time, N music generators corresponding to N feature vectors are obtained by the correspondence between emotion tags and feature vectors and the correspondence between emotion tags and music generators. N emotion labels corresponding to N feature vectors are obtained, and then N music generators corresponding to the N feature vectors are found according to the corresponding relation between the emotion labels and the music generators, so that the time for obtaining the N music generators is shortened through the corresponding relation of the labels, and the efficiency for obtaining the N music generators is improved.
In the implementation process, the feature vector may also directly correspond to the music generator, so that N music generators corresponding to N feature vectors may be obtained according to N feature vectors, which is not specifically limited in this specification.
Specifically, before step S103 is performed, a music generator set is further trained in advance, where the training step of the music generator set includes, as shown in fig. 2, including:
s201, acquiring a training sample set, wherein each training sample in the training sample set comprises training audio and video data;
S202, aiming at each training sample in a training sample set, performing voice recognition on training audio and video data of the training sample to obtain training recognition characters; extracting features of the training recognition characters by using a natural language processing technology to obtain M feature vectors, wherein M is an integer not less than N;
s203, model training is carried out on M music generators by using M feature vectors of each training sample by adopting an countermeasure network, the M trained music generators are obtained, the M trained music generators are used as the music generator set, and the M music generators correspond to the M feature vectors.
In step S201, a training sample set needs to be acquired first, where the training sample set includes at least one training sample, and each training sample in the training sample set includes training audio/video data, where the training audio/video data includes audio data and video data, so that the training audio/video data may be audio data or video data, which is not specifically limited in this specification.
After the training sample set is acquired, step S202 is performed.
In step S202, for each training sample in the training sample set, feature extraction may be performed on the identified text by using a natural language processing technique, so as to obtain M feature vectors.
In the process of the specific embodiment, for each training sample, when feature extraction is performed on the identified text by using a natural language processing technology, feature extraction can be performed on the training identified text of each training sample by adopting a word bag model, a CNN model, an RNN model, an LSTM model and other models, and M feature vectors of each training sample are extracted, wherein the M feature vectors comprise at least two vectors of emotion feature vectors, scene feature vectors, user feature vectors and the like. Of course, the M feature vectors may also include semantic vectors, which may include semantic information identifying the text.
Specifically, emotion feature vectors may include information for representing emotion such as happiness, sadness, and anger, scene feature vectors may include information for representing scenes such as wedding, dinner, birthday party, and conference, and user feature vectors may include user information for representing the gender of the user, the age of the user, and the like.
Specifically, M is generally the same as N, and at this time, the M feature vectors used for training and the N feature vectors actually used are made the same in vector type, so that the accuracy of the music generator prediction is higher. Of course, M may be an integer greater than N, in which case more types of feature vectors may be used during training, and some or all types of feature vectors may be used during actual use, for example, M types of feature vectors, such as C1, C2, C3, C4, and C5, may be obtained during training, and N types of feature vectors, such as C1, C2, C3, and C4, or 3 types of feature vectors, such as C1, C2, and C3, may be obtained during actual use, which is not particularly limited in this specification.
Specifically, in the process of performing voice recognition on the training audio/video data of each training sample, in order to make the accuracy of the obtained training recognition text higher, the audio extraction can be performed on the obtained training audio/video data of each training sample to obtain training user audio data; and performing voice recognition on the training user audio data to obtain recognition characters.
Specifically, in order to further improve the accuracy of the obtained training recognition text of each training sample, after the training user audio data of each training sample is obtained, noise reduction processing can be further performed on the training user audio data of each training sample, noise in the audio is removed, including music noise or background noise and the like, so as to obtain training noise reduction audio data of each training sample, wherein only human voice in the user audio data is reserved in the noise reduction audio data; and performing voice recognition on the training noise reduction audio data of each training sample to obtain training recognition characters of each training sample.
After M feature vectors for each training sample are acquired, step S203 is performed.
In step S203, for each training sample, obtaining M emotion tags corresponding to M feature vectors according to a preset correspondence between emotion tags and feature vectors; according to the corresponding relation between the emotion labels and the music generators, M music generators corresponding to the M feature vectors are obtained; inputting M characteristic vectors into M music generators to obtain M types of music; synthesizing M types of music to obtain training background music; and further obtaining training background music of each training sample. After the training background music of each training sample is obtained, the training background music and the real music data of each training sample are distinguished by using a music discriminator, parameters of each music generator are continuously adjusted, and then the music discriminator is used for distinguishing, so that continuous countermeasure optimization is realized, and finally, when the accuracy of distinguishing the training background music and the real music data by the music discriminator is smaller than the set accuracy, M music generators at the moment are used as M trained music generators.
In the embodiment of the present specification, when M types of music are synthesized to obtain training background music, M types of music are generally synthesized using a music synthesizer.
In the embodiment of the present specification, the music style generated by each of the M music generators may be different from the music styles generated by other music generators.
Specifically, M music generators are represented by G1, G2, G3 and G4, and a music discriminator is represented by D, M feature vectors of the training samples are input into G1, G2, G3 and G4 for each training sample, and training background music synthesized by the M types of output music is obtained; and D is used for distinguishing training background music from real music data, in the continuous countermeasure optimization of G1, G2, G3, G4 and D, the training background music and the real music data cannot be distinguished finally, or the distinguishing rate of the D for the training background music and the real music data meets the constraint condition (smaller than the set accuracy), at the moment, the training background music output by G1, G2, G3 and G4 is very similar to the real music data, and the G1, G2, G3 and G4 at the moment are used as M trained music generators.
The model training is performed by adopting the countermeasure training mode, so that the accuracy of background music predicted by the trained M music generators obtained by the countermeasure training is higher.
Thus, after training M music generators through steps S201-S203, since the trained M music generators are obtained by countermeasure training, the accuracy of background music predicted by the trained M music generators is higher; when the N feature vectors are input into the N music generators, the N music generators are part or all of the M trained music generators, so that the matching degree of the output background music and the music required by the user is higher, and the accuracy of the output background music is higher.
After the trained M music generators are obtained, N music generators are obtained from the trained M music generators according to the N emotion tags.
After N music generators are acquired, step S104 is performed.
In step S104, since the N feature vectors are in one-to-one correspondence with the N music generators, each of the N feature vectors may be input to the corresponding music generator to obtain N styles of music.
Specifically, each of the N feature vectors may be input to a corresponding music generator through an emotion tag to avoid that the feature vector is input to the wrong music generator, for example, if the emotion tag of a certain feature vector is B2 and the music generator corresponding to B2 is G2, the feature vector is input to G2.
After N genres of music are acquired, step S105 is performed.
In step S105, N types of music may be input to a music synthesizer to be synthesized, and the synthesized music may be obtained as background music.
For example, taking trained M music generators as G1, G2, G3 and G4 as an example, and the emotion labels corresponding to the M music generators as B1, B2, B3 and B4 in turn, if the target audio data is video data as a, obtaining N feature vectors of a as Q1, Q2 and Q3, finding the emotion label corresponding to Q1 as B1, the emotion label corresponding to Q2 as B2 and the emotion label corresponding to Q3 as B3 according to the preset correspondence between the labels and the vectors, and determining that the N music generators as G1, G2 and G3 according to the correspondence between the emotion labels and the music generators, so, inputting Q1 into G1, inputting Q2 into G2, and inputting Q3 into G3 to obtain 3 types of music, and synthesizing the 3 types of music by a music synthesizer to obtain background music.
Therefore, N music generators are part or all of M trained music generators, and M trained music generators are obtained by countertraining, so that the accuracy of background music predicted by the M trained music generators is improved, and the accuracy of background music predicted by the N music generators is also improved.
In another embodiment of the present specification, after obtaining the background music, the background music may be added to the target audio/video data, and distribution may be performed on the target audio/video data to which the background music is added. For example, in self-media creation, a user only has a section of audio/video or video, and by the background music generation method provided by the embodiment, background music can be automatically generated according to the content of the audio/video, and the generated background music is automatically added into the audio/video of the user and released by the user.
At this time, since the generated background music is generated by N music generators, and the N music generators are part or all of the M trained music generators, and the M trained music generators are obtained by using countermeasure training, the accuracy of the background music predicted by the M trained music generators is improved, and the accuracy of the background music predicted by the N music generators is also improved.
Based on the technical scheme, performing voice recognition on the acquired target audio and video data to obtain recognition characters; extracting features of the identified characters by using a natural language processing technology to obtain N feature vectors; inputting each characteristic vector in N characteristic vectors into a corresponding music generator according to a pre-trained music generator set to obtain N types of music; synthesizing the N types of music to obtain background music; at this time, the target audio and video data can be sequentially subjected to voice recognition and feature extraction, the extracted N feature vectors are input into N music generators trained in advance to generate N types of music, and then the N types of music are synthesized to obtain background music.
With reference to fig. 3, the embodiment of the present application further provides a background music generating device, where the background music generating device includes:
The recognition unit 301 is configured to perform speech recognition on the obtained target audio/video data to obtain a recognition text;
A feature extraction unit 302, configured to perform feature extraction on the identified text by using a natural language processing technology, so as to obtain N feature vectors, where N is an integer not less than 2;
A music generator obtaining unit 303, configured to obtain N music generators corresponding to the N feature vectors from a pre-trained music generator set;
A style music obtaining unit 304, configured to input each of the N feature vectors into a corresponding music generator to obtain N style music;
the background music obtaining unit 305 is configured to synthesize the N types of music to obtain background music.
In an alternative embodiment, the music generator obtaining unit 303 is configured to obtain N emotion tags corresponding to the N feature vectors; and acquiring N music generators corresponding to the N emotion labels from the music generator set according to the corresponding relation between the emotion labels and the music generators.
In an optional implementation manner, the identifying unit 301 is configured to perform audio extraction on the obtained target audio/video data to obtain user audio data; and carrying out voice recognition on the user audio data to obtain the recognition text.
In an alternative embodiment, the apparatus further comprises:
The music generator training unit is used for acquiring a training sample set, and each training sample in the training sample set comprises training audio and video data; aiming at each training sample in the training sample set, performing voice recognition on training audio and video data of the training sample to obtain training recognition characters; extracting features of the training recognition characters by using a natural language processing technology to obtain M feature vectors, wherein M is an integer not less than N; model training is carried out on M music generators by using M feature vectors of each training sample by adopting an countermeasure network, the M trained music generators are obtained, the M trained music generators are used as the music generator set, and the M music generators correspond to the M feature vectors.
In an alternative embodiment, the apparatus further comprises:
And the background music adding unit is used for adding the background music to the target audio/video data after obtaining the background music.
The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.
Fig. 4 is a block diagram of an electronic device 800 for a background music generation method, according to an example embodiment. For example, electronic device 800 may be a mobile phone, computer, digital broadcast terminal, messaging device, game console, tablet device, medical device, exercise device, personal digital assistant, or the like.
Referring to fig. 4, the electronic device 800 may include one or more of the following components: a processing component 802, a memory 804, a power component 806, a multimedia component 808, an audio component 810, an input/presentation (I/O) interface 812, a sensor component 814, and a communication component 816.
The processing component 802 generally controls overall operation of the electronic device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. Processing element 802 may include one or more processors 820 to execute instructions to perform all or part of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interactions between the processing component 802 and other components. For example, the processing component 802 may include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.
The memory 804 is configured to store various types of data to support operations at the device 800. Examples of such data include instructions for any application or method operating on the electronic device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or nonvolatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.
The power supply component 806 provides power to the various components of the electronic device 800. The power components 806 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the electronic device 800.
The multimedia component 808 includes a screen between the electronic device 800 and the user that provides a presentation interface. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may sense not only the boundary of a touch or slide action, but also the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front camera and/or a rear camera. The front camera and/or the rear camera may receive external multimedia data when the device 800 is in an operational mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have focal length and optical zoom capabilities.
The audio component 810 is configured to present and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may be further stored in the memory 804 or transmitted via the communication component 816. In some embodiments, the audio component 810 further includes a speaker for rendering audio signals.
The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be a keyboard, click wheel, buttons, etc. These buttons may include, but are not limited to: homepage button, volume button, start button, and lock button.
The sensor assembly 814 includes one or more sensors for providing status assessment of various aspects of the electronic device 800. For example, the sensor assembly 814 may detect an on/off state of the device 800, a relative positioning of the components, such as a display and keypad of the electronic device 800, the sensor assembly 814 may also detect a change in position of the electronic device 800 or a component of the electronic device 800, the presence or absence of a user's contact with the electronic device 800, an orientation or acceleration/deceleration of the electronic device 800, and a change in temperature of the electronic device 800. The sensor assembly 814 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscopic sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
The communication component 816 is configured to facilitate communication between the electronic device 800 and other devices, either wired or wireless. The electronic device 800 may access a wireless network based on a communication standard, such as WiFi,2G, or 3G, or a combination thereof. In one exemplary embodiment, the communication part 816 receives a broadcast signal or broadcast-related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.
In an exemplary embodiment, the electronic device 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic elements for executing the methods described above.
In an exemplary embodiment, a non-transitory computer readable storage medium is also provided, such as memory 804 including instructions executable by processor 820 of electronic device 800 to perform the above-described method. For example, the non-transitory computer readable storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.
A non-transitory computer readable storage medium, which when executed by a processor of a mobile terminal, causes the mobile terminal to perform a background music generation method, the method comprising:
performing voice recognition on the acquired target audio and video data to obtain recognition characters;
Extracting features of the identified words by using a natural language processing technology to obtain N feature vectors, wherein N is an integer not less than 2;
acquiring N music generators corresponding to the N feature vectors from a pre-trained music generator set;
inputting each characteristic vector in the N characteristic vectors into a corresponding music generator to obtain N types of music;
and synthesizing the N types of music to obtain background music.
Other embodiments of the application will be apparent to those skilled in the art from consideration of the specification and practice of the application disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.
It is to be understood that the invention is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the invention is limited only by the appended claims
The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the invention are intended to be included within the scope of the invention.

Claims (12)

1. A background music generation method, the method comprising:
performing voice recognition on the acquired target audio and video data to obtain recognition characters;
Extracting features of the identified words by using a natural language processing technology to obtain N feature vectors, wherein N is an integer not less than 2; the N feature vectors comprise at least two of emotion feature vectors, scene feature vectors and user feature vectors;
Acquiring N music generators corresponding to the N feature vectors from a pre-trained music generator set; each music generator represents a different style of music;
inputting each characteristic vector in the N characteristic vectors into a corresponding music generator to obtain N types of music;
and synthesizing the N types of music to obtain background music.
2. The method of claim 1, wherein the obtaining N music generators corresponding to the N feature vectors comprises:
Acquiring N emotion labels corresponding to the N feature vectors;
And acquiring N music generators corresponding to the N emotion labels from the music generator set according to the corresponding relation between the emotion labels and the music generators.
3. The method of claim 2, wherein performing speech recognition on the obtained target audio-video data to obtain the recognized text comprises:
performing audio extraction on the obtained target audio and video data to obtain user audio data;
And carrying out voice recognition on the user audio data to obtain the recognition text.
4. The method of claim 3, wherein the training step of the music generator set comprises:
Acquiring a training sample set, wherein each training sample in the training sample set comprises training audio and video data;
Aiming at each training sample in the training sample set, performing voice recognition on training audio and video data of the training sample to obtain training recognition characters; extracting features of the training recognition characters by using a natural language processing technology to obtain M feature vectors, wherein M is an integer not less than N;
Model training is carried out on M music generators by using M feature vectors of each training sample by adopting an countermeasure network, the M trained music generators are obtained, the M trained music generators are used as the music generator set, and the M music generators correspond to the M feature vectors.
5. The method of any one of claims 1-4, wherein after obtaining background music, the method further comprises:
and adding the background music to the target audio/video data.
6. A background music generating apparatus, the apparatus comprising:
the recognition unit is used for carrying out voice recognition on the acquired target audio and video data to obtain recognition characters;
The feature extraction unit is used for extracting features of the identification characters by utilizing a natural language processing technology to obtain N feature vectors, wherein N is an integer not less than 2; the N feature vectors comprise at least two of emotion feature vectors, scene feature vectors and user feature vectors;
a music generator acquisition unit, configured to acquire N music generators corresponding to the N feature vectors from a pre-trained music generator set; each music generator represents a different style of music;
A style music obtaining unit, configured to input each feature vector of the N feature vectors into a corresponding music generator to obtain N style music;
And the background music acquisition unit is used for synthesizing the N types of music to obtain background music.
7. The apparatus of claim 6, wherein the music generator obtaining unit is configured to obtain N emotion tags corresponding to the N feature vectors; and acquiring N music generators corresponding to the N emotion labels from the music generator set according to the corresponding relation between the emotion labels and the music generators.
8. The apparatus of claim 7, wherein the identification unit is configured to perform audio extraction on the obtained target audio-video data to obtain user audio data; and carrying out voice recognition on the user audio data to obtain the recognition text.
9. The apparatus as recited in claim 8, further comprising:
The music generator training unit is used for acquiring a training sample set, and each training sample in the training sample set comprises training audio and video data; aiming at each training sample in the training sample set, performing voice recognition on training audio and video data of the training sample to obtain training recognition characters; extracting features of the training recognition characters by using a natural language processing technology to obtain M feature vectors, wherein M is an integer not less than N; model training is carried out on M music generators by using M feature vectors of each training sample by adopting an countermeasure network, the M trained music generators are obtained, the M trained music generators are used as the music generator set, and the M music generators correspond to the M feature vectors.
10. The apparatus of any one of claims 6-9, further comprising:
And the background music adding unit is used for adding the background music to the target audio/video data after obtaining the background music.
11. An electronic device comprising a memory and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by one or more processors, the one or more programs comprising instructions for performing the method of any of claims 1-5.
12. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, carries out the steps corresponding to the method according to any one of claims 1-5.
CN202111166926.4A 2021-09-30 2021-09-30 Background music generation method and device and electronic equipment Active CN113923517B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111166926.4A CN113923517B (en) 2021-09-30 2021-09-30 Background music generation method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111166926.4A CN113923517B (en) 2021-09-30 2021-09-30 Background music generation method and device and electronic equipment

Publications (2)

Publication Number Publication Date
CN113923517A CN113923517A (en) 2022-01-11
CN113923517B true CN113923517B (en) 2024-05-07

Family

ID=79237894

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111166926.4A Active CN113923517B (en) 2021-09-30 2021-09-30 Background music generation method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN113923517B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116504206B (en) * 2023-03-18 2024-02-20 深圳市狼视天下科技有限公司 Camera capable of identifying environment and generating music

Citations (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2006022606A2 (en) * 2003-01-07 2006-03-02 Madwares Ltd. Systems and methods for portable audio synthesis
CN103186527A (en) * 2011-12-27 2013-07-03 北京百度网讯科技有限公司 System for building music classification model, system for recommending music and corresponding method
CN103795897A (en) * 2014-01-21 2014-05-14 深圳市中兴移动通信有限公司 Method and device for automatically generating background music
CN108986842A (en) * 2018-08-14 2018-12-11 百度在线网络技术(北京)有限公司 Music style identifying processing method and terminal
CN109492128A (en) * 2018-10-30 2019-03-19 北京字节跳动网络技术有限公司 Method and apparatus for generating model
CN109599079A (en) * 2017-09-30 2019-04-09 腾讯科技(深圳)有限公司 A kind of generation method and device of music
CN109862393A (en) * 2019-03-20 2019-06-07 深圳前海微众银行股份有限公司 Method of dubbing in background music, system, equipment and the storage medium of video file
CN110085263A (en) * 2019-04-28 2019-08-02 东华大学 A kind of classification of music emotion and machine composing method
CN110148393A (en) * 2018-02-11 2019-08-20 阿里巴巴集团控股有限公司 Music generating method, device and system and data processing method
CN110309327A (en) * 2018-02-28 2019-10-08 北京搜狗科技发展有限公司 Audio generation method, device and the generating means for audio
CN110740262A (en) * 2019-10-31 2020-01-31 维沃移动通信有限公司 Background music adding method and device and electronic equipment
CN110767201A (en) * 2018-07-26 2020-02-07 Tcl集团股份有限公司 Score generation method, storage medium and terminal equipment
CN110781835A (en) * 2019-10-28 2020-02-11 中国传媒大学 Data processing method and device, electronic equipment and storage medium
CN110830368A (en) * 2019-11-22 2020-02-21 维沃移动通信有限公司 Instant messaging message sending method and electronic equipment
CN110858924A (en) * 2018-08-22 2020-03-03 北京优酷科技有限公司 Video background music generation method and device
CN110971969A (en) * 2019-12-09 2020-04-07 北京字节跳动网络技术有限公司 Video dubbing method and device, electronic equipment and computer readable storage medium
CN111737516A (en) * 2019-12-23 2020-10-02 北京沃东天骏信息技术有限公司 Interactive music generation method and device, intelligent sound box and storage medium
CN111950266A (en) * 2019-04-30 2020-11-17 北京搜狗科技发展有限公司 Data processing method and device and data processing device
CN112040273A (en) * 2020-09-11 2020-12-04 腾讯科技(深圳)有限公司 Video synthesis method and device
CN112189193A (en) * 2018-05-24 2021-01-05 艾米有限公司 Music generator
CN112231499A (en) * 2019-07-15 2021-01-15 李姿慧 Intelligent video music distribution system
CN112584062A (en) * 2020-12-10 2021-03-30 上海哔哩哔哩科技有限公司 Background audio construction method and device
CN112597320A (en) * 2020-12-09 2021-04-02 上海掌门科技有限公司 Social information generation method, device and computer readable medium
CN113190709A (en) * 2021-03-31 2021-07-30 浙江大学 Background music recommendation method and device based on short video key frame
CN113299255A (en) * 2021-05-13 2021-08-24 中国科学院声学研究所 Emotional music generation method based on deep neural network and music element drive

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10380983B2 (en) * 2016-12-30 2019-08-13 Google Llc Machine learning to generate music from text
CN110555126B (en) * 2018-06-01 2023-06-27 微软技术许可有限责任公司 Automatic generation of melodies
US11741922B2 (en) * 2018-09-14 2023-08-29 Bellevue Investments Gmbh & Co. Kgaa Method and system for template based variant generation of hybrid AI generated song
KR102148006B1 (en) * 2019-04-30 2020-08-25 주식회사 카카오 Method and apparatus for providing special effects to video

Patent Citations (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2006022606A2 (en) * 2003-01-07 2006-03-02 Madwares Ltd. Systems and methods for portable audio synthesis
CN103186527A (en) * 2011-12-27 2013-07-03 北京百度网讯科技有限公司 System for building music classification model, system for recommending music and corresponding method
CN103795897A (en) * 2014-01-21 2014-05-14 深圳市中兴移动通信有限公司 Method and device for automatically generating background music
CN109599079A (en) * 2017-09-30 2019-04-09 腾讯科技(深圳)有限公司 A kind of generation method and device of music
CN110148393A (en) * 2018-02-11 2019-08-20 阿里巴巴集团控股有限公司 Music generating method, device and system and data processing method
CN110309327A (en) * 2018-02-28 2019-10-08 北京搜狗科技发展有限公司 Audio generation method, device and the generating means for audio
CN112189193A (en) * 2018-05-24 2021-01-05 艾米有限公司 Music generator
CN110767201A (en) * 2018-07-26 2020-02-07 Tcl集团股份有限公司 Score generation method, storage medium and terminal equipment
CN108986842A (en) * 2018-08-14 2018-12-11 百度在线网络技术(北京)有限公司 Music style identifying processing method and terminal
CN110858924A (en) * 2018-08-22 2020-03-03 北京优酷科技有限公司 Video background music generation method and device
CN109492128A (en) * 2018-10-30 2019-03-19 北京字节跳动网络技术有限公司 Method and apparatus for generating model
CN109862393A (en) * 2019-03-20 2019-06-07 深圳前海微众银行股份有限公司 Method of dubbing in background music, system, equipment and the storage medium of video file
CN110085263A (en) * 2019-04-28 2019-08-02 东华大学 A kind of classification of music emotion and machine composing method
CN111950266A (en) * 2019-04-30 2020-11-17 北京搜狗科技发展有限公司 Data processing method and device and data processing device
CN112231499A (en) * 2019-07-15 2021-01-15 李姿慧 Intelligent video music distribution system
CN110781835A (en) * 2019-10-28 2020-02-11 中国传媒大学 Data processing method and device, electronic equipment and storage medium
CN110740262A (en) * 2019-10-31 2020-01-31 维沃移动通信有限公司 Background music adding method and device and electronic equipment
CN110830368A (en) * 2019-11-22 2020-02-21 维沃移动通信有限公司 Instant messaging message sending method and electronic equipment
CN110971969A (en) * 2019-12-09 2020-04-07 北京字节跳动网络技术有限公司 Video dubbing method and device, electronic equipment and computer readable storage medium
CN111737516A (en) * 2019-12-23 2020-10-02 北京沃东天骏信息技术有限公司 Interactive music generation method and device, intelligent sound box and storage medium
CN112040273A (en) * 2020-09-11 2020-12-04 腾讯科技(深圳)有限公司 Video synthesis method and device
CN112597320A (en) * 2020-12-09 2021-04-02 上海掌门科技有限公司 Social information generation method, device and computer readable medium
CN112584062A (en) * 2020-12-10 2021-03-30 上海哔哩哔哩科技有限公司 Background audio construction method and device
CN113190709A (en) * 2021-03-31 2021-07-30 浙江大学 Background music recommendation method and device based on short video key frame
CN113299255A (en) * 2021-05-13 2021-08-24 中国科学院声学研究所 Emotional music generation method based on deep neural network and music element drive

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Fang-Fei Kuo ; Man-Kwan Shan ; Suh-Yin Lee.Background music recommendation for video based on multimodal latent semantic analysis.《2013 IEEE International Conference on Multimedia and Expo (ICME)》.2013,全文. *
互动仪式链视角下的音乐短视频探析――以抖音App为例;翟欣;;新媒体研究;20180831(第16期);全文 *
基于深度学习的视频背景音乐自动推荐算法研究;吕军辉;;电视技术;20181005(第10期);全文 *

Also Published As

Publication number Publication date
CN113923517A (en) 2022-01-11

Similar Documents

Publication Publication Date Title
CN107644646B (en) Voice processing method and device for voice processing
CN107527619B (en) Method and device for positioning voice control service
CN108038102B (en) Method and device for recommending expression image, terminal and storage medium
CN104378441A (en) Schedule creating method and device
US11335348B2 (en) Input method, device, apparatus, and storage medium
CN107945806B (en) User identification method and device based on sound characteristics
CN111831806B (en) Semantic integrity determination method, device, electronic equipment and storage medium
CN110781323A (en) Method and device for determining label of multimedia resource, electronic equipment and storage medium
CN112068711A (en) Information recommendation method and device of input method and electronic equipment
CN105447109A (en) Key word searching method and apparatus
CN112037756A (en) Voice processing method, apparatus and medium
CN110610720B (en) Data processing method and device and data processing device
CN111797262A (en) Poetry generation method and device, electronic equipment and storage medium
CN113177419B (en) Text rewriting method and device, storage medium and electronic equipment
CN113923517B (en) Background music generation method and device and electronic equipment
CN110728981A (en) Interactive function execution method and device, electronic equipment and storage medium
CN113656557A (en) Message reply method, device, storage medium and electronic equipment
CN113936697B (en) Voice processing method and device for voice processing
CN112948565A (en) Man-machine conversation method, device, electronic equipment and storage medium
CN113709548B (en) Image-based multimedia data synthesis method, device, equipment and storage medium
CN112130839A (en) Method for constructing database, method for voice programming and related device
CN111831132A (en) Information recommendation method and device and electronic equipment
CN114550691A (en) Multi-tone word disambiguation method and device, electronic equipment and readable storage medium
CN113115104B (en) Video processing method and device, electronic equipment and storage medium
CN113420553A (en) Text generation method and device, storage medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant