CN113923517B

CN113923517B - Background music generation method and device and electronic equipment

Info

Publication number: CN113923517B
Application number: CN202111166926.4A
Authority: CN
Inventors: 崔国辉
Original assignee: Beijing Sogou Technology Development Co Ltd
Current assignee: Beijing Sogou Technology Development Co Ltd
Priority date: 2021-09-30
Filing date: 2021-09-30
Publication date: 2024-05-07
Anticipated expiration: 2041-09-30
Also published as: CN113923517A

Abstract

The invention discloses a background music generation method, which carries out voice recognition on acquired target audio and video data to obtain recognition characters; extracting features of the identified words by using a natural language processing technology to obtain N feature vectors; acquiring N music generators corresponding to the N feature vectors from a pre-trained music generator set; inputting each characteristic vector in the N characteristic vectors into a corresponding music generator to obtain N types of music; and synthesizing the N types of music to obtain background music, wherein when the N types of music are synthesized to obtain the background music, the background music is generated by the N types of music, N is an integer not smaller than 2, so that the background music is generated by multiple types of music and does not belong to the existing music and songs, and the generated background music is more personalized and is more matched with the demands of users.

Description

Background music generation method and device and electronic equipment

Technical Field

The present invention relates to the field of speech processing technologies, and in particular, to a method and an apparatus for generating background music, and an electronic device.

Background

Music has been an important form of art for companion humans, and humans have never stopped exploring music. With the development of computer technology, the combination of computers and deep learning technology has led to the creation of music with increasing applications.

In the prior art, when background music is generated, the background music can be quickly generated by presetting music characteristic parameters by a user and inputting the music characteristic parameters into a neural network for predicting future notes or using a generation countermeasure neural network for generating music, but the generated background music can not well meet the requirements of the user. Thus, a background music generation method is needed to solve the above-mentioned problems.

Disclosure of Invention

The embodiment of the invention provides a background music generation method and device and electronic equipment, which are used for generating background music of an audio and video file.

An embodiment of the present invention provides a method for generating background music, where the method includes:

performing voice recognition on the acquired target audio and video data to obtain recognition characters;

Extracting features of the identified words by using a natural language processing technology to obtain N feature vectors, wherein N is an integer not less than 2;

acquiring N music generators corresponding to the N feature vectors from a pre-trained music generator set;

inputting each characteristic vector in the N characteristic vectors into a corresponding music generator to obtain N types of music;

and synthesizing the N types of music to obtain background music.

Optionally, the obtaining N music generators corresponding to the N feature vectors includes:

Acquiring N emotion labels corresponding to the N feature vectors;

And acquiring N music generators corresponding to the N emotion labels from the music generator set according to the corresponding relation between the emotion labels and the music generators.

Optionally, the performing speech recognition on the obtained target audio and video data to obtain the recognition text includes:

performing audio extraction on the obtained target audio and video data to obtain user audio data;

And carrying out voice recognition on the user audio data to obtain the recognition text.

Optionally, the training step of the music generator set includes:

Acquiring a training sample set, wherein each training sample in the training sample set comprises training audio and video data;

Aiming at each training sample in the training sample set, performing voice recognition on training audio and video data of the training sample to obtain training recognition characters; extracting features of the training recognition characters by using a natural language processing technology to obtain M feature vectors, wherein M is an integer not less than N;

Model training is carried out on M music generators by using M feature vectors of each training sample by adopting an countermeasure network, the M trained music generators are obtained, the M trained music generators are used as the music generator set, and the M music generators correspond to the M feature vectors.

Optionally, after obtaining the background music, the method further includes:

and adding the background music to the target audio/video data.

A second aspect of the embodiment of the present invention further provides a background music generating apparatus, which is characterized in that the apparatus includes:

the recognition unit is used for carrying out voice recognition on the acquired target audio and video data to obtain recognition characters;

The feature extraction unit is used for extracting features of the identification characters by utilizing a natural language processing technology to obtain N feature vectors, wherein N is an integer not less than 2;

a music generator acquisition unit, configured to acquire N music generators corresponding to the N feature vectors from a pre-trained music generator set;

A style music obtaining unit, configured to input each feature vector of the N feature vectors into a corresponding music generator to obtain N style music;

And the background music acquisition unit is used for synthesizing the N types of music to obtain background music.

Optionally, the music generator obtaining unit is configured to obtain N emotion tags corresponding to the N feature vectors; and acquiring N music generators corresponding to the N emotion labels from the music generator set according to the corresponding relation between the emotion labels and the music generators.

Optionally, the identifying unit is configured to perform audio extraction on the obtained target audio/video data to obtain user audio data; and carrying out voice recognition on the user audio data to obtain the recognition text.

Optionally, the method further comprises:

The music generator training unit is used for acquiring a training sample set, and each training sample in the training sample set comprises training audio and video data; aiming at each training sample in the training sample set, performing voice recognition on training audio and video data of the training sample to obtain training recognition characters; extracting features of the training recognition characters by using a natural language processing technology to obtain M feature vectors, wherein M is an integer not less than N; model training is carried out on M music generators by using M feature vectors of each training sample by adopting an countermeasure network, the M trained music generators are obtained, the M trained music generators are used as the music generator set, and the M music generators correspond to the M feature vectors.

Optionally, the method further comprises:

And the background music adding unit is used for adding the background music to the target audio/video data after obtaining the background music.

A third aspect of the embodiment of the present invention provides an electronic device, including a memory, and one or more programs, where the one or more programs are stored in the memory, and configured to be executed by one or more processors, where the one or more programs include operation instructions for performing a background music generating method as provided in the first aspect.

A fourth aspect of the embodiments of the present invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps corresponding to the background music generation method as provided in the first aspect.

The above-mentioned one or at least one technical scheme in the embodiment of the application has at least the following technical effects:

based on the technical scheme, performing voice recognition on the acquired target audio and video data to obtain recognition characters; extracting features of the identified characters by using a natural language processing technology to obtain N feature vectors; inputting each characteristic vector in N characteristic vectors into a corresponding music generator according to a pre-trained music generator set to obtain N types of music; synthesizing the N types of music to obtain background music; at this time, the target audio and video data can be sequentially subjected to voice recognition and feature extraction, the extracted N feature vectors are input into N music generators trained in advance to generate N types of music, and then the N types of music are synthesized to obtain background music.

Drawings

Fig. 1 is a flow chart of a background music generating method according to an embodiment of the present application;

fig. 2 is a flow chart of a training method of a music generator set according to an embodiment of the present application;

fig. 3 is a block diagram of a background music generating apparatus according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In the technical scheme provided by the embodiment of the application, a background music generating method is provided, and based on the technical scheme, the acquired target audio and video data are subjected to voice recognition to obtain recognition characters; extracting features of the identified characters by using a natural language processing technology to obtain N feature vectors; inputting each characteristic vector in N characteristic vectors into a corresponding music generator according to a pre-trained music generator set to obtain N types of music; synthesizing the N types of music to obtain background music; at this time, the target audio and video data can be sequentially subjected to voice recognition and feature extraction, the extracted N feature vectors are input into N music generators trained in advance to generate N types of music, and then the N types of music are synthesized to obtain background music.

The main implementation principle, the specific implementation manner and the corresponding beneficial effects of the technical scheme of the embodiment of the application are described in detail below with reference to the accompanying drawings.

Examples

Referring to fig. 1, an embodiment of the present application provides a background music generating method, which includes:

S101, performing voice recognition on the acquired target audio and video data to obtain recognition characters;

S102, extracting features of the identified words by using a natural language processing technology to obtain N feature vectors, wherein N is an integer not less than 2;

S103, acquiring N music generators corresponding to the N feature vectors from a pre-trained music generator set;

S104, inputting each characteristic vector in the N characteristic vectors into a corresponding music generator to obtain N types of music;

S105, synthesizing the N types of music to obtain background music.

In step S101, the target audio/video data may be first acquired, and then the target audio/video data may be subjected to speech recognition to obtain the recognition text. The target audio/video data includes audio data and video data, so that the target audio/video data may be audio data or video data, which is not particularly limited in this specification.

In the implementation process, when the target audio/video data is acquired, the audio data or the video data selected by the user can be used as the target audio/video data. And when the target audio and video data is subjected to voice recognition, the voice recognition model can be adopted to carry out voice recognition on the target audio and video data, so that the accuracy of the obtained recognition characters is higher.

In the embodiment of the present specification, the speech recognition model may be, for example, a neural network-based time series class classification (Connectionisttemporal classification, abbreviated as CTC) model, a Long Short Term Memory (LSTM), a CNN model, a CLDNN model, or the like, and the present specification is not particularly limited.

Specifically, when performing voice recognition on the target audio/video data to obtain the recognition text, in order to make the accuracy of the obtained recognition text higher, the obtained target audio/video data may be subjected to audio extraction to obtain the user audio data; and then carrying out voice recognition on the user audio data to obtain recognition characters. At this time, the user audio data is obtained after the other background music in the target audio and video data is removed by extracting the user audio data from the target audio and video data, so that when the user audio data is subjected to voice recognition, interference caused by the other background music in the target audio and video data is avoided, the recognition accuracy can be effectively improved, and the recognition text recognition accuracy is higher.

Specifically, in order to further improve the accuracy of the obtained recognized text, after the user audio data is obtained, noise in the audio can be removed by performing noise reduction processing on the user audio data, including music noise or background noise, so as to obtain noise-reduced audio data, wherein only human voice in the user audio data is reserved in the noise-reduced audio data; and performing voice recognition on the noise reduction audio data to obtain recognition characters. At this time, because the noise reduction audio data only keeps the voice of the user audio data, and the noise such as music noise or background noise is removed, the influence of the noise on the recognition accuracy can be further reduced when the voice recognition is carried out, and thus the recognition accuracy can be further improved, and the accuracy of the obtained recognition characters is further improved.

After the recognized text is obtained, step S102 is performed.

In step S102, feature extraction is performed on the recognized text by using a natural language processing technique, so as to obtain N feature vectors.

In the process of the specific embodiment, when the natural language processing technology is utilized to perform feature extraction on the identified text, a word bag model, a CNN model, an RNN model, an LSTM model and other models can be utilized to perform feature extraction on the identified text, and N feature vectors are extracted, wherein the N feature vectors comprise at least two of emotion feature vectors, scene feature vectors, user feature vectors and other vectors. Of course, the N feature vectors may also include semantic vectors, which may include semantic information identifying the text.

Specifically, emotion feature vectors may include information for representing emotion such as happiness, sadness, and anger, scene feature vectors may include information for representing scenes such as wedding, dinner, birthday party, and conference, and user feature vectors may include user information for representing the gender of the user, the age of the user, and the like.

In another embodiment of the present disclosure, after N feature vectors are obtained, N emotion tags corresponding to the N feature vectors are also required to be obtained. And when N emotion labels are acquired, searching from the label-vector correspondence of the emotion labels and the feature vectors according to the N feature vectors, and searching N emotion labels corresponding to the N feature vectors, wherein the feature vectors and the emotion labels are in one-to-one correspondence. Of course, it is also possible to search from the label-vector correspondence between emotion labels and feature vectors according to N feature vectors, and find K emotion labels corresponding to N feature vectors, where K is an integer not less than 1 and not greater than N, and at this time, one emotion label may correspond to one or more feature vectors, but one feature vector corresponds to only one emotion label.

For example, taking target audio and video data as video data A as an example, firstly extracting user audio data in A to be represented by A1, then carrying out noise reduction on the A1, carrying out voice recognition on the A1 after the noise reduction, and obtaining recognition characters to be represented by A2; extracting features of A2 by using an LSTM model, wherein the extracted emotion feature vector is represented by Q1, the scene feature vector is represented by Q2, the user feature vector is represented by Q3, and then Q1, Q2 and Q3 are taken as N feature vectors; and then searching that the emotion label corresponding to Q1 is B1, the emotion label corresponding to Q2 is B2 and the emotion label corresponding to Q3 is B3 according to the preset corresponding relation between the label and the vector.

After N feature vectors are obtained, step S103 is performed.

In step S103, N emotion tags corresponding to the N feature vectors may be first acquired; and acquiring N music generators corresponding to the N emotion labels from the music generator set according to the corresponding relation between the emotion labels and the music generators. At this time, N music generators corresponding to N feature vectors are obtained by the correspondence between emotion tags and feature vectors and the correspondence between emotion tags and music generators. N emotion labels corresponding to N feature vectors are obtained, and then N music generators corresponding to the N feature vectors are found according to the corresponding relation between the emotion labels and the music generators, so that the time for obtaining the N music generators is shortened through the corresponding relation of the labels, and the efficiency for obtaining the N music generators is improved.

In the implementation process, the feature vector may also directly correspond to the music generator, so that N music generators corresponding to N feature vectors may be obtained according to N feature vectors, which is not specifically limited in this specification.

Specifically, before step S103 is performed, a music generator set is further trained in advance, where the training step of the music generator set includes, as shown in fig. 2, including:

s201, acquiring a training sample set, wherein each training sample in the training sample set comprises training audio and video data;

S202, aiming at each training sample in a training sample set, performing voice recognition on training audio and video data of the training sample to obtain training recognition characters; extracting features of the training recognition characters by using a natural language processing technology to obtain M feature vectors, wherein M is an integer not less than N;

s203, model training is carried out on M music generators by using M feature vectors of each training sample by adopting an countermeasure network, the M trained music generators are obtained, the M trained music generators are used as the music generator set, and the M music generators correspond to the M feature vectors.

In step S201, a training sample set needs to be acquired first, where the training sample set includes at least one training sample, and each training sample in the training sample set includes training audio/video data, where the training audio/video data includes audio data and video data, so that the training audio/video data may be audio data or video data, which is not specifically limited in this specification.

After the training sample set is acquired, step S202 is performed.

In step S202, for each training sample in the training sample set, feature extraction may be performed on the identified text by using a natural language processing technique, so as to obtain M feature vectors.

In the process of the specific embodiment, for each training sample, when feature extraction is performed on the identified text by using a natural language processing technology, feature extraction can be performed on the training identified text of each training sample by adopting a word bag model, a CNN model, an RNN model, an LSTM model and other models, and M feature vectors of each training sample are extracted, wherein the M feature vectors comprise at least two vectors of emotion feature vectors, scene feature vectors, user feature vectors and the like. Of course, the M feature vectors may also include semantic vectors, which may include semantic information identifying the text.

Specifically, M is generally the same as N, and at this time, the M feature vectors used for training and the N feature vectors actually used are made the same in vector type, so that the accuracy of the music generator prediction is higher. Of course, M may be an integer greater than N, in which case more types of feature vectors may be used during training, and some or all types of feature vectors may be used during actual use, for example, M types of feature vectors, such as C1, C2, C3, C4, and C5, may be obtained during training, and N types of feature vectors, such as C1, C2, C3, and C4, or 3 types of feature vectors, such as C1, C2, and C3, may be obtained during actual use, which is not particularly limited in this specification.

Specifically, in the process of performing voice recognition on the training audio/video data of each training sample, in order to make the accuracy of the obtained training recognition text higher, the audio extraction can be performed on the obtained training audio/video data of each training sample to obtain training user audio data; and performing voice recognition on the training user audio data to obtain recognition characters.

Specifically, in order to further improve the accuracy of the obtained training recognition text of each training sample, after the training user audio data of each training sample is obtained, noise reduction processing can be further performed on the training user audio data of each training sample, noise in the audio is removed, including music noise or background noise and the like, so as to obtain training noise reduction audio data of each training sample, wherein only human voice in the user audio data is reserved in the noise reduction audio data; and performing voice recognition on the training noise reduction audio data of each training sample to obtain training recognition characters of each training sample.

After M feature vectors for each training sample are acquired, step S203 is performed.

In step S203, for each training sample, obtaining M emotion tags corresponding to M feature vectors according to a preset correspondence between emotion tags and feature vectors; according to the corresponding relation between the emotion labels and the music generators, M music generators corresponding to the M feature vectors are obtained; inputting M characteristic vectors into M music generators to obtain M types of music; synthesizing M types of music to obtain training background music; and further obtaining training background music of each training sample. After the training background music of each training sample is obtained, the training background music and the real music data of each training sample are distinguished by using a music discriminator, parameters of each music generator are continuously adjusted, and then the music discriminator is used for distinguishing, so that continuous countermeasure optimization is realized, and finally, when the accuracy of distinguishing the training background music and the real music data by the music discriminator is smaller than the set accuracy, M music generators at the moment are used as M trained music generators.

In the embodiment of the present specification, when M types of music are synthesized to obtain training background music, M types of music are generally synthesized using a music synthesizer.

In the embodiment of the present specification, the music style generated by each of the M music generators may be different from the music styles generated by other music generators.

Specifically, M music generators are represented by G1, G2, G3 and G4, and a music discriminator is represented by D, M feature vectors of the training samples are input into G1, G2, G3 and G4 for each training sample, and training background music synthesized by the M types of output music is obtained; and D is used for distinguishing training background music from real music data, in the continuous countermeasure optimization of G1, G2, G3, G4 and D, the training background music and the real music data cannot be distinguished finally, or the distinguishing rate of the D for the training background music and the real music data meets the constraint condition (smaller than the set accuracy), at the moment, the training background music output by G1, G2, G3 and G4 is very similar to the real music data, and the G1, G2, G3 and G4 at the moment are used as M trained music generators.

The model training is performed by adopting the countermeasure training mode, so that the accuracy of background music predicted by the trained M music generators obtained by the countermeasure training is higher.

Thus, after training M music generators through steps S201-S203, since the trained M music generators are obtained by countermeasure training, the accuracy of background music predicted by the trained M music generators is higher; when the N feature vectors are input into the N music generators, the N music generators are part or all of the M trained music generators, so that the matching degree of the output background music and the music required by the user is higher, and the accuracy of the output background music is higher.

After the trained M music generators are obtained, N music generators are obtained from the trained M music generators according to the N emotion tags.

After N music generators are acquired, step S104 is performed.

In step S104, since the N feature vectors are in one-to-one correspondence with the N music generators, each of the N feature vectors may be input to the corresponding music generator to obtain N styles of music.

Specifically, each of the N feature vectors may be input to a corresponding music generator through an emotion tag to avoid that the feature vector is input to the wrong music generator, for example, if the emotion tag of a certain feature vector is B2 and the music generator corresponding to B2 is G2, the feature vector is input to G2.

After N genres of music are acquired, step S105 is performed.

In step S105, N types of music may be input to a music synthesizer to be synthesized, and the synthesized music may be obtained as background music.

For example, taking trained M music generators as G1, G2, G3 and G4 as an example, and the emotion labels corresponding to the M music generators as B1, B2, B3 and B4 in turn, if the target audio data is video data as a, obtaining N feature vectors of a as Q1, Q2 and Q3, finding the emotion label corresponding to Q1 as B1, the emotion label corresponding to Q2 as B2 and the emotion label corresponding to Q3 as B3 according to the preset correspondence between the labels and the vectors, and determining that the N music generators as G1, G2 and G3 according to the correspondence between the emotion labels and the music generators, so, inputting Q1 into G1, inputting Q2 into G2, and inputting Q3 into G3 to obtain 3 types of music, and synthesizing the 3 types of music by a music synthesizer to obtain background music.

Therefore, N music generators are part or all of M trained music generators, and M trained music generators are obtained by countertraining, so that the accuracy of background music predicted by the M trained music generators is improved, and the accuracy of background music predicted by the N music generators is also improved.

In another embodiment of the present specification, after obtaining the background music, the background music may be added to the target audio/video data, and distribution may be performed on the target audio/video data to which the background music is added. For example, in self-media creation, a user only has a section of audio/video or video, and by the background music generation method provided by the embodiment, background music can be automatically generated according to the content of the audio/video, and the generated background music is automatically added into the audio/video of the user and released by the user.

At this time, since the generated background music is generated by N music generators, and the N music generators are part or all of the M trained music generators, and the M trained music generators are obtained by using countermeasure training, the accuracy of the background music predicted by the M trained music generators is improved, and the accuracy of the background music predicted by the N music generators is also improved.

With reference to fig. 3, the embodiment of the present application further provides a background music generating device, where the background music generating device includes:

The recognition unit 301 is configured to perform speech recognition on the obtained target audio/video data to obtain a recognition text;

A feature extraction unit 302, configured to perform feature extraction on the identified text by using a natural language processing technology, so as to obtain N feature vectors, where N is an integer not less than 2;

A music generator obtaining unit 303, configured to obtain N music generators corresponding to the N feature vectors from a pre-trained music generator set;

A style music obtaining unit 304, configured to input each of the N feature vectors into a corresponding music generator to obtain N style music;

the background music obtaining unit 305 is configured to synthesize the N types of music to obtain background music.

In an alternative embodiment, the music generator obtaining unit 303 is configured to obtain N emotion tags corresponding to the N feature vectors; and acquiring N music generators corresponding to the N emotion labels from the music generator set according to the corresponding relation between the emotion labels and the music generators.

In an optional implementation manner, the identifying unit 301 is configured to perform audio extraction on the obtained target audio/video data to obtain user audio data; and carrying out voice recognition on the user audio data to obtain the recognition text.

In an alternative embodiment, the apparatus further comprises:

The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

Fig. 4 is a block diagram of an electronic device 800 for a background music generation method, according to an example embodiment. For example, electronic device 800 may be a mobile phone, computer, digital broadcast terminal, messaging device, game console, tablet device, medical device, exercise device, personal digital assistant, or the like.

Referring to fig. 4, the electronic device 800 may include one or more of the following components: a processing component 802, a memory 804, a power component 806, a multimedia component 808, an audio component 810, an input/presentation (I/O) interface 812, a sensor component 814, and a communication component 816.

The processing component 802 generally controls overall operation of the electronic device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. Processing element 802 may include one or more processors 820 to execute instructions to perform all or part of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interactions between the processing component 802 and other components. For example, the processing component 802 may include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.

The memory 804 is configured to store various types of data to support operations at the device 800. Examples of such data include instructions for any application or method operating on the electronic device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or nonvolatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.

The power supply component 806 provides power to the various components of the electronic device 800. The power components 806 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the electronic device 800.

The multimedia component 808 includes a screen between the electronic device 800 and the user that provides a presentation interface. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may sense not only the boundary of a touch or slide action, but also the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front camera and/or a rear camera. The front camera and/or the rear camera may receive external multimedia data when the device 800 is in an operational mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have focal length and optical zoom capabilities.

The audio component 810 is configured to present and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may be further stored in the memory 804 or transmitted via the communication component 816. In some embodiments, the audio component 810 further includes a speaker for rendering audio signals.

The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be a keyboard, click wheel, buttons, etc. These buttons may include, but are not limited to: homepage button, volume button, start button, and lock button.

The sensor assembly 814 includes one or more sensors for providing status assessment of various aspects of the electronic device 800. For example, the sensor assembly 814 may detect an on/off state of the device 800, a relative positioning of the components, such as a display and keypad of the electronic device 800, the sensor assembly 814 may also detect a change in position of the electronic device 800 or a component of the electronic device 800, the presence or absence of a user's contact with the electronic device 800, an orientation or acceleration/deceleration of the electronic device 800, and a change in temperature of the electronic device 800. The sensor assembly 814 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscopic sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 816 is configured to facilitate communication between the electronic device 800 and other devices, either wired or wireless. The electronic device 800 may access a wireless network based on a communication standard, such as WiFi,2G, or 3G, or a combination thereof. In one exemplary embodiment, the communication part 816 receives a broadcast signal or broadcast-related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the electronic device 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic elements for executing the methods described above.

In an exemplary embodiment, a non-transitory computer readable storage medium is also provided, such as memory 804 including instructions executable by processor 820 of electronic device 800 to perform the above-described method. For example, the non-transitory computer readable storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.

A non-transitory computer readable storage medium, which when executed by a processor of a mobile terminal, causes the mobile terminal to perform a background music generation method, the method comprising:

and synthesizing the N types of music to obtain background music.

Other embodiments of the application will be apparent to those skilled in the art from consideration of the specification and practice of the application disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

It is to be understood that the invention is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the invention is limited only by the appended claims

The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the invention are intended to be included within the scope of the invention.

Claims

1. A background music generation method, the method comprising:

Extracting features of the identified words by using a natural language processing technology to obtain N feature vectors, wherein N is an integer not less than 2; the N feature vectors comprise at least two of emotion feature vectors, scene feature vectors and user feature vectors;

Acquiring N music generators corresponding to the N feature vectors from a pre-trained music generator set; each music generator represents a different style of music;

and synthesizing the N types of music to obtain background music.

2. The method of claim 1, wherein the obtaining N music generators corresponding to the N feature vectors comprises:

Acquiring N emotion labels corresponding to the N feature vectors;

3. The method of claim 2, wherein performing speech recognition on the obtained target audio-video data to obtain the recognized text comprises:

4. The method of claim 3, wherein the training step of the music generator set comprises:

5. The method of any one of claims 1-4, wherein after obtaining background music, the method further comprises:

and adding the background music to the target audio/video data.

6. A background music generating apparatus, the apparatus comprising:

The feature extraction unit is used for extracting features of the identification characters by utilizing a natural language processing technology to obtain N feature vectors, wherein N is an integer not less than 2; the N feature vectors comprise at least two of emotion feature vectors, scene feature vectors and user feature vectors;

a music generator acquisition unit, configured to acquire N music generators corresponding to the N feature vectors from a pre-trained music generator set; each music generator represents a different style of music;

7. The apparatus of claim 6, wherein the music generator obtaining unit is configured to obtain N emotion tags corresponding to the N feature vectors; and acquiring N music generators corresponding to the N emotion labels from the music generator set according to the corresponding relation between the emotion labels and the music generators.

8. The apparatus of claim 7, wherein the identification unit is configured to perform audio extraction on the obtained target audio-video data to obtain user audio data; and carrying out voice recognition on the user audio data to obtain the recognition text.

9. The apparatus as recited in claim 8, further comprising:

10. The apparatus of any one of claims 6-9, further comprising:

11. An electronic device comprising a memory and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by one or more processors, the one or more programs comprising instructions for performing the method of any of claims 1-5.

12. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, carries out the steps corresponding to the method according to any one of claims 1-5.