CN114242025A

CN114242025A - Method and device for generating accompaniment and storage medium

Info

Publication number: CN114242025A
Application number: CN202111527995.3A
Authority: CN
Inventors: 张超鹏; 翁志强; 寇志娟
Original assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Current assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date: 2021-12-14
Filing date: 2021-12-14
Publication date: 2022-03-25
Also published as: WO2023109278A1

Abstract

The embodiment of the application discloses an accompaniment generation method, equipment and a storage medium, wherein the accompaniment generation method comprises the following steps: acquiring an acoustic stem signal set, wherein the acoustic stem signal set comprises x acoustic stem signals corresponding to a target song; generating a virtual sound signal based on a corresponding dry sound signal at each of N virtual three-dimensional space sound image locations, wherein x dry sound signals correspond to the N virtual three-dimensional space sound image locations, the N virtual three-dimensional space sound image locations are not identical, and each virtual three-dimensional space sound image location is allowed to correspond to one or more of the x dry sound signals; merging each virtual sound signal in the virtual sound signal set to obtain chorus dry sound; and performing sound effect synthesis processing on the chorus dry sound and the background music of the target song according to a sound effect optimization rule to obtain the accompaniment of the target song. By adopting the method and the device, the effect of stereo surround of the accompaniment can be realized.

Description

Method and device for generating accompaniment and storage medium

Technical Field

The present application relates to the field of computer application technologies, and in particular, to an accompaniment generation method, device, and storage medium.

Background

With the development of Virtual Reality (Virtual Reality) technology, Virtual Three-Dimensional (3D) audio technology is also gradually being optimized. The virtual 3D audio technology can create a three-dimensional dynamic effect, and can bring an experience of being personally on the scene to a user by applying the three-dimensional dynamic effect to singing software. At present, when a virtual 3D audio technology is applied in a scene of chorus of multiple persons, the existing technical solution is to directly weight and superimpose multiple paths of persons, however, the sound effect is not stereo enough by the processing method, resulting in poor user experience.

Disclosure of Invention

The embodiment of the application provides an accompaniment generation method, equipment and a storage medium, which can realize the effect of audio stereo surround in all directions and improve the user experience.

In one aspect, an embodiment of the present application provides an accompaniment generation method, including:

acquiring an acoustic stem signal set, wherein the acoustic stem signal set comprises x acoustic stem signals corresponding to a target song, and x is an integer greater than 1;

generating a virtual sound signal based on a corresponding dry sound signal at each of N virtual three-dimensional space sound image positions, wherein x dry sound signals correspond to the N virtual three-dimensional space sound image positions, N is an integer greater than 1, the N virtual three-dimensional space sound image positions are different, and each virtual three-dimensional space sound image position is allowed to correspond to one or more of the x dry sound signals;

merging each virtual sound signal in a virtual sound signal set to obtain chorus dry sound, wherein the virtual sound signal set comprises: a virtual sound signal at each of the N virtual three-dimensional spatial sound image positions;

and performing sound effect synthesis processing on the chorus dry sound and the background music of the target song according to a sound effect optimization rule to obtain the accompaniment of the target song.

In one aspect, an embodiment of the present application provides an accompaniment playing processing method, including:

displaying a user interface, wherein the user interface is used for receiving a selection instruction of a target song;

if the selection instruction received on the user interface indicates that the accompaniment pattern of the target song is the chorus accompaniment pattern, acquiring the accompaniment corresponding to the target song;

playing the accompaniment corresponding to the target song;

the accompaniment is generated according to chorus dry sound and background music, the chorus dry sound is generated according to a plurality of dry sound signals in a dry sound signal set, the dry sound signals in the dry sound signal set correspond to a plurality of different virtual three-dimensional space sound image positions, and the dry sound signal set is obtained according to the dry sound signals recorded by a plurality of users aiming at the target song.

In another aspect, an embodiment of the present application provides an accompaniment generating apparatus, including:

the acquisition unit is used for acquiring an acoustic stem signal set, wherein the acoustic stem signal set comprises x acoustic stem signals corresponding to a target song, and x is an integer greater than 1; generating a virtual sound signal based on a corresponding dry sound signal at each of N virtual three-dimensional space sound image positions, wherein x dry sound signals correspond to the N virtual three-dimensional space sound image positions, N is an integer greater than 1, the N virtual three-dimensional space sound image positions are different, and each virtual three-dimensional space sound image position is allowed to correspond to one or more of the x dry sound signals.

A processing unit, configured to perform merging processing on each virtual sound signal in a virtual sound signal set to obtain chorus dry sound, where the virtual sound signal set includes: a virtual sound signal at each of the N virtual three-dimensional spatial sound image positions; and performing sound effect synthesis processing on the chorus dry sound and the background music of the target song according to a sound effect optimization rule to obtain the accompaniment of the target song.

On the other hand, an embodiment of the present application provides an accompaniment playing processing apparatus, including:

the system comprises an acquisition unit, a storage unit and a display unit, wherein the acquisition unit is used for displaying a user interface which is used for receiving a selection instruction of a target song; and if the selection instruction received by the user interface indicates that the accompaniment pattern of the target song is the chorus accompaniment pattern, acquiring the accompaniment corresponding to the target song.

The processing unit is used for playing the accompaniment corresponding to the target song; the accompaniment is generated according to chorus dry sound and background music, the chorus dry sound is generated according to a plurality of dry sound signals in a dry sound signal set, the dry sound signals in the dry sound signal set correspond to a plurality of different virtual three-dimensional space sound image positions, and the dry sound signal set is obtained according to the dry sound signals recorded by a plurality of users aiming at the target song.

Accordingly, an embodiment of the present application provides an electronic device, including: the system comprises a memory, a processor and a network interface, wherein the processor is connected with the memory and the network interface, the network interface is used for providing a network communication function, the memory is used for storing program codes, and the processor is used for calling the program codes and executing the method in the embodiment of the application.

Accordingly, an embodiment of the present application provides a computer-readable storage medium, including: the computer-readable storage medium stores therein a computer program that, when executed by a processor, implements the method in the embodiments of the present application.

Accordingly, embodiments of the present application provide a computer program product or a computer program, which includes computer instructions stored in a computer-readable storage medium, and a processor of a computer device reads and executes the computer instructions from the computer-readable storage medium, so that the computer device executes the method in the embodiments of the present application.

By implementing the embodiment of the application, on one hand, the virtual sound signals of all the dry sound signals corresponding to the target song in the dry sound signal set in different virtual three-dimensional space sound image positions can be acquired, then the virtual sound signals corresponding to all the dry sound signals are combined to obtain chorus dry sound, and finally, sound effect synthesis processing is carried out on the chorus dry sound and background music of the target song according to sound effect optimization rules to obtain the accompaniment of the target song; on the other hand, a selection instruction of the user for the target song may be received, and when the received selection instruction indicates that the accompaniment pattern of the target song is the chorus accompaniment pattern, the accompaniment corresponding to the target song may be acquired and played. By the method, the sound image position of the dry sound signal in the virtual three-dimensional space can be simulated in an all-around manner, the effect of audio stereo surround is realized, and a user has an immersive sensation in the sense of hearing when acquiring corresponding accompaniment, so that immersive experience is obtained.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic view of an application scenario of a method for generating an accompaniment according to an embodiment of the present application;

fig. 2 is a schematic flowchart of an accompaniment generating method according to an embodiment of the present application;

fig. 3a is a schematic diagram of a horizontal plane, an upper plane and a lower plane in a method for generating an accompaniment according to an embodiment of the present application;

fig. 3b is a schematic diagram of a virtual three-dimensional spatial sound image position in an accompaniment generation method according to an embodiment of the present application;

fig. 3c is a schematic diagram illustrating dividing planes by a preset angle interval in an accompaniment generation method according to an embodiment of the present application;

fig. 4 is a schematic flowchart of another accompaniment generation method provided in the embodiment of the present application;

fig. 5 is a schematic flowchart illustrating a process of acquiring binaural signals corresponding to the stem signals in the stem signal set in the accompaniment generating method according to the embodiment of the present application;

fig. 6 is a flowchart illustrating an accompaniment playing processing method according to an embodiment of the present application;

fig. 7a is a schematic flowchart illustrating an accompaniment obtained by acquiring a target song according to the accompaniment playing processing method for the accompaniment provided by the embodiment of the present application;

fig. 7b is a schematic diagram illustrating a first single sentence interface displayed in the accompaniment play processing method according to the embodiment of the present application;

fig. 7c is a schematic diagram illustrating a second single sentence interface displayed in the accompaniment play processing method according to the embodiment of the present application;

fig. 8a is a schematic structural diagram of an accompaniment generating apparatus according to an embodiment of the present application;

fig. 8b is a schematic structural diagram of an accompaniment playing processing device according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Before further detailed description of the embodiments of the present application, terms and expressions referred to in the embodiments of the present application will be described, and the terms and expressions referred to in the embodiments of the present application will be applied to the following explanations.

1) Dry sound signal: the dry sound signal in the embodiment of the present application refers to a pure human sound signal without accompanying music, and the dry sound signal is a mono sound signal, that is, does not include any direction information.

2) Binaural signal: the principle of the binaural is that when people hear sound, the specific position of the sound source can be judged according to the phase difference between the left ear and the right ear. The two-channel signals in the embodiment of the present application refer to a left-channel sound signal and a right-channel sound signal.

3) Head Related Transfer Functions (HRTFs): HRTFs, which may also be referred to as binaural transfer functions, describe the transmission of sound waves from a sound source to both ears. The HRTF is a group of filters, and virtual sound signals transmitted to two ears can be obtained through calculation according to HRTF data corresponding to sound source position information by adopting the principle that time domain convolution is equivalent to frequency domain convolution.

The embodiment of the application provides an accompaniment generation method, equipment and a storage medium, through the embodiment of the application, an dry sound signal set formed by a plurality of dry sound signals corresponding to the same target song can be obtained, virtual sound signals corresponding to the dry sound signals in different virtual three-dimensional space sound image positions in the dry sound signal set are obtained, then the virtual sound signals corresponding to the dry sound signals are combined to obtain chorus dry sound, and finally, sound effect synthesis processing is carried out on the chorus dry sound and background music of the target song according to sound effect optimization rules to obtain the accompaniment of the target song. Through this kind of mode, on the one hand, can all-round simulation dry sound signal set include each dry sound signal acoustic image position in virtual three-dimensional space, and then obtain the chorus dry sound after the virtual sound signal amalgamation processing that corresponds by each dry sound signal in different virtual three-dimensional space acoustic image positions department, the effect of audio frequency stereo encirclement has been realized, on the other hand, can carry out the chorus dry sound and background music according to sound effect optimization rule and obtain the accompaniment after the audio synthesis processing again, with this immersion of reinforcing audio effect, generally speaking, compare in directly carrying out superimposed processing mode to each dry sound signal, this application can obtain abundanter audio frequency treatment effect, user experience has been improved.

Referring to fig. 1, fig. 1 is a schematic view of an application scenario of a method for generating an accompaniment according to an embodiment of the present application. As shown in fig. 1, the application scenario may include a smart device 100, where the smart device communicates with a server 110 in a wired or wireless manner, and the server 110 is connected to a database 120.

The method for generating the accompaniment provided by the embodiment of the application can be realized by an electronic device such as the intelligent device 100. For example, when the received selection instruction indicates that the accompaniment pattern of the target song is the chorus accompaniment pattern, the smart device 100 may acquire a set of dry sound signals corresponding to the target song, generate a virtual sound signal based on the dry sound signal corresponding to each virtual three-dimensional spatial sound image position in the N virtual three-dimensional spatial sound image positions, for example, the virtual sound signal may be a binaural signal, merge the virtual sound signals corresponding to the dry sound signals to obtain chorus dry sound, and perform sound effect synthesis processing on the chorus dry sound and the background music of the target song according to a sound effect optimization rule to obtain the accompaniment. As an example, in the smart device 100 of fig. 1, an option of "chorus accompaniment" is shown, and the user may generate a selection instruction of the chorus accompaniment pattern through voice control, or may generate the selection instruction of the chorus accompaniment pattern after triggering a selection control displayed on the user interface. The set of dry sound signals may be pre-stored locally by the smart device 100, or may be obtained by the smart device 100 from the server 110 or the database 120.

The accompaniment generation method provided by the embodiment of the present application may also be implemented by an electronic device such as the server 110. For example, when the received selection instruction indicates that the accompaniment pattern of the target song is the chorus accompaniment pattern, the server 110 may acquire a set of dry sound signals corresponding to the target song, generate a virtual sound signal based on the dry sound signal corresponding to each virtual three-dimensional spatial sound image position in the N virtual three-dimensional spatial sound image positions, where the virtual sound signal may be a binaural signal, merge the virtual sound signals corresponding to the dry sound signals to obtain chorus dry sound, and perform sound effect synthesis processing on the chorus dry sound and background music of the target song according to a sound effect optimization rule to obtain the accompaniment. The set of dry sound signals may be pre-stored locally by the server 110, or may be acquired from the database 120 by the server 110, and the finally obtained accompaniment may be stored locally or stored in the database 120 to be called when needed. Of course, the server 110 may not start generating the accompaniment when the received selection instruction indicates that the accompaniment pattern for the target song is the chorus accompaniment pattern, and the server 110 may start executing the steps related to the method for generating the accompaniment according to the present application to generate the accompaniment at an appropriate time, for example, when the load of the server 110 is low, or when the server 110 receives a new dry sound signal of the target song, or when a management operation for generating the accompaniment is received. Preferably, the accompaniment of the chorus version may be generated in advance and then stored in the server, after the accompaniment of a large number of songs is generated, the user may select the "chorus accompaniment" on the user interface to issue a selection instruction for the target song through the smart device 100, so that the server 110 may find the chorus accompaniment of the target song from the generated large number of accompaniments in response to the selection instruction and issue the chorus accompaniment to the smart device 100.

The accompaniment generation method provided by the embodiment of the present application may also be cooperatively implemented by an electronic device such as the smart device 100 and an electronic device such as the server 110. For example, the server 110 may generate a virtual sound signal based on a corresponding dry sound signal at each of the N virtual three-dimensional space sound image positions, where the virtual sound signal may be a binaural signal, merge the virtual sound signals corresponding to the dry sound signals to obtain a chorus dry sound, perform sound effect synthesis processing on the chorus dry sound and background music of the target song according to a sound effect optimization rule to obtain an accompaniment, and issue the obtained accompaniment to the smart device 100.

The accompaniment generation method provided by the embodiment of the present application may also be implemented by an electronic device such as the smart device 100 and an electronic device such as the server 110 by running a computer program. For example, the computer program may be a native program or a software module in an operating system, may be a local Application program (APP, Application), or may be an applet, and in short, the computer program may be an Application program, a module, or a plug-in any form, which is not limited in this embodiment of the present Application.

The intelligent device related in the embodiment of the present application may be a personal computer, a notebook computer, a smart phone, a tablet computer, a smart watch, an intelligent voice interaction device, an intelligent household appliance, a vehicle-mounted terminal, an intelligent wearable device, and the like, but is not limited thereto. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like. The smart device and the server may be directly or indirectly connected through wired or wireless communication, which is not specifically limited in this embodiment of the application.

It should be understood that the numbers of dry sound signals and virtual three-dimensional space sound image locations shown in fig. 1 are merely illustrative, and that any number of dry sound signals may be in the set of dry sound signals and any number of virtual three-dimensional space sound image locations may be in the virtual three-dimensional space, as desired for an implementation.

Further, please refer to fig. 2, fig. 2 is a schematic flow chart of an accompaniment generation method provided in the embodiment of the present application, and the method of the embodiment of the present application may be applied to an electronic device, for example, an intelligent device such as a smart phone, a tablet computer, a smart wearable device, a personal computer, or a server. The method may include, but is not limited to, the steps of:

s201: a set of dry sound signals is acquired.

In the implementation of the application, the electronic device may acquire an acoustic stem signal set, where the acoustic stem signal set includes a number of acoustic stem signals corresponding to a target song.

In one embodiment, the set of dry sound signals may be obtained from an audio database, where the audio database includes initial dry sound signals recorded by multiple users singing the same song, and it should be noted that the initial dry sound signals in the audio database are recorded after authorized approval of the users. The electronic device may screen out the dry sound signals that satisfy the condition according to the sound parameters of the initial dry sound signals to form a dry sound signal set.

In one embodiment, the electronic device may filter the dry sound signals satisfying the condition from the initial dry sound signal set according to the tone level characteristic parameter and the tone quality characteristic parameter. The intonation characteristic parameters can comprise any one or more of tone parameters, rhythm parameters and rhythm parameters, and the dry sound signals which are screened out according to the intonation characteristic parameters and meet the conditions have the characteristic of high consistency of the song tone, the rhythm and the accompaniment melody; the acoustic characteristic parameters can comprise any one or more of noise parameters, energy parameters and speed parameters, and the dry sound signals which are screened out according to the acoustic characteristic parameters and meet the conditions have the characteristics of clear audio, proper audio energy, uniform audio speed and the like. The screening sequence of the dry sound signals meeting the conditions is not limited, for example, the electronic device can screen the dry sound signals meeting the conditions according to the tone pitch characteristic parameters, then screen the dry sound signals meeting the conditions of the preset tone quality characteristic parameters from the dry sound signals meeting the conditions of the preset tone pitch characteristic parameters, and also can screen the dry sound signals meeting the conditions according to the tone quality characteristic parameters, and then screen the dry sound signals meeting the conditions of the preset audio characteristic parameters from the dry sound signals meeting the conditions of the preset tone quality characteristic parameters. The dry sound signals obtained by screening the initial dry sound signal set in the mode form the dry sound signal set with good tone accuracy and sound quality.

S202: a virtual sound signal is generated based on the corresponding dry sound signal at each of the N virtual three-dimensional spatial sound image locations.

In this embodiment, the electronic device may simulate different sound image positions of each dry sound signal in the dry sound signal set in the virtual three-dimensional space, and generate a virtual sound signal based on a corresponding dry sound signal at each virtual three-dimensional space sound image position in the N virtual three-dimensional space sound image positions, where the virtual sound signal may be a binaural signal, for example. Wherein the N virtual three-dimensional spatial sound image locations are different, and each virtual three-dimensional spatial sound image location may correspond to one or more dry sound signals.

In one embodiment, the N virtual three-dimensional space image locations may be simulated in the virtual three-dimensional space by: as shown in fig. 3a, the positive directions of the x, y and z axes in the virtual three-dimensional space respectively correspond to the directions of the head, the right front, the left side and the vertex, and the virtual three-dimensional space is divided into three planes: a horizontal plane 301, an upper plane 302 having a first angle threshold with the horizontal plane, and a horizontal planeThe included angle is the lower plane 303 of the second angle threshold. As shown in FIG. 3b, each virtual three-dimensional space image location in the virtual three-dimensional space includes an azimuth angle and an elevation angle, and the azimuth angle of the virtual three-dimensional space image location is assumed to be represented by θ, which is expressed by

Elevation angle representing virtual three-dimensional space image position, each virtual three-dimensional space image position can be used

And (4) showing. Correspondingly, the horizontal plane 301 is a plane corresponding to an elevation angle of 0 °, the upper plane is a plane corresponding to an elevation angle of the first angle threshold, the first angle threshold may be any angle value above the horizontal plane, the lower plane is a plane corresponding to an elevation angle of the second angle threshold, and the second angle threshold may be any angle value below the horizontal plane. For example, the upper plane may be a plane corresponding to 40 ° elevation and the lower plane may be a plane corresponding to-40 ° elevation. The azimuth angle theta can be used for describing an included angle between the virtual three-dimensional space sound image position on a plane from the clockwise direction to the target direction line. Further, as shown in fig. 3c, after dividing the planes corresponding to different elevation angles by taking the corresponding preset angles as intervals, a plurality of virtual three-dimensional space sound image positions can be obtained. Specifically, after dividing the image on the horizontal plane at intervals of a first preset angle, n1 virtual three-dimensional space sound image positions on the horizontal plane can be obtained; dividing the upper plane at intervals of a second preset angle to obtain n2 virtual three-dimensional space sound image positions on the upper plane; after the lower plane is divided at intervals of a third preset angle, n3 virtual three-dimensional space sound image positions on the lower plane can be obtained. For example, assuming that the first preset angle is 10 °, the second preset angle and the third preset angle are both 15 °, then dividing the horizontal plane at intervals of 10 ° will result in 36 virtual three-dimensional spatial sound image positions, dividing the upper plane at intervals of 15 ° will result in 24 virtual three-dimensional spatial sound image positions, and dividing the lower plane at intervals of 15 ° will result inSimilarly, 24 virtual three-dimensional spatial sound image positions can be obtained after the division, and 84 different virtual three-dimensional spatial sound image positions can be obtained in total. It should be noted that the first preset angle, the second preset angle and the third preset angle in the embodiment of the present application may be preset arbitrary angle values, and specific numerical values of the three preset angles are only used for example and do not constitute a limitation to the embodiment of the present application. By the method, a plurality of virtual three-dimensional space sound image positions can be virtualized in three different planes in the virtual three-dimensional space at different intervals of azimuth angles, and the omnibearing immersion feeling simulation of a sound source is realized.

In one embodiment, each virtual three-dimensional spatial sound image position may correspond to one dry sound signal or a plurality of dry sound signals. The electronic device may obtain a virtual sound signal corresponding to one or more dry sound signals at a sound image position in each virtual three-dimensional space, and specifically, the electronic device may obtain the virtual sound signal corresponding to each dry sound signal in the virtual three-dimensional space by using the following scheme: acquiring the azimuth angle and the elevation angle of a virtual three-dimensional space sound image position corresponding to the dry sound signal, determining a Head Related Transfer Function (HRTF) corresponding to the virtual three-dimensional space sound image position according to the azimuth angle and the elevation angle of the virtual three-dimensional space sound image position, and calculating to obtain the virtual sound signal corresponding to the dry sound signal at the virtual space sound image position according to the azimuth angle and the elevation angle of the virtual three-dimensional space sound image position and corresponding HRTF data. For example, the azimuth angle and the elevation angle of the virtual three-dimensional space sound image position corresponding to the dry sound signal X are

The HRTF data expression corresponding to the virtual three-dimensional space sound image position is as follows:

calculating to obtain a dual-channel signal Y as the virtual sound signal corresponding to the dry sound signal X at the virtual three-dimensional space sound image position_LAnd Y_RWherein Y is_LFor left channel signals, Y_RIs the right channel signal.

In an embodiment, the electronic device may obtain a part of the dry sound signals from the dry sound signal set, for example, may randomly obtain or screen out dry sound signals with better accuracy and quality according to a new screening rule, and perform a delay processing operation on the screened part of the dry sound signals respectively to obtain delayed binaural signals corresponding to each dry sound signal in the part of the dry sound signals. Specifically, when a delay processing operation is performed on one dry sound signal, 8 pairs of different time parameters may be selected, where it should be noted that 8 pairs of time parameters represent 8 time parameters for acquiring a delayed left channel signal and 8 time parameters for acquiring a delayed right channel signal, and a total of 16 time parameters will be selected. For example, 16 different parameters may be selected as the time parameter from a range of 21ms to 79ms (80 ms is selected as the reverberation time according to the general room impulse response), or 16 (or other values) different parameters may be randomly selected as the time parameter in other reasonable ranges according to actual needs. Through this kind of mode, can simulate the dry sound signal that is located people's head left ear department or people's head right ear department, make the audio frequency effect abundanter. In one embodiment, the selection mode and the setting of the time parameter (delay duration parameter) during the delay processing operation can be adjusted through an interface, so that a user who makes the chorus audio can flexibly configure the chorus audio. It should be noted that, the step of acquiring the virtual sound signal, the step of acquiring the delayed left channel signal and the delayed right channel signal may be executed simultaneously or sequentially, and the present application does not limit this.

S203: and merging each virtual sound signal in the virtual sound signal set to obtain chorus dry sound.

In this embodiment, the electronic device may perform merging processing on each virtual sound signal in the set of virtual sound signals to obtain a chorus dry sound.

In one embodiment, the merging processing of the virtual sound signals corresponding to the dry sound signals can be realized by normalization processing, so as to achieve the purpose of adjusting the loudness of the virtual sound signals after the merging processing to [ -1dB, 1dB ]. Wherein each virtual sound signal targeted when performing the merging processing includes: the method comprises the steps of obtaining a virtual sound signal corresponding to a dry sound signal at each virtual three-dimensional space sound image position in N virtual three-dimensional space sound image positions, and obtaining each delayed double-channel signal after performing delay processing operation on part of the dry sound signals in a dry sound signal set.

S203: and performing sound effect synthesis processing on the chorus dry sound and the background music of the target song according to a sound effect optimization rule to obtain the accompaniment.

In the embodiment of the application, the electronic equipment performs sound effect synthesis processing on the chorus dry sound and the background music of the target song according to the sound effect optimization rule, and then can obtain the final accompaniment. The sound effect optimization rule may be, for example, a sound parameter for adjusting background music of the target song and the virtual sound signal corresponding to the obtained multiple dry sound signals, where the sound parameter may be, for example, a common adjustable parameter such as loudness, timbre, and the like.

In one embodiment, after obtaining the chorus dry sound, the electronic device may obtain background music of the target song, and if the energy relationship between the obtained chorus dry sound and the background music of the target song does not satisfy the energy ratio condition, the electronic device may adjust the energy relationship between the chorus dry sound and the background music of the target song. Here, the energy ratio condition may be set such that the ratio between the energy value of the chorus stem and the energy value of the background music of the target song is less than a ratio threshold, or may be set such that the loudness of the chorus stem is 3dB lower than the loudness of the background music of the target song. By the method, the chorus dry sound can be prevented from being more energy than background music of the target song, so that the finally obtained accompaniment is more harmonious.

By implementing the embodiment of the application, the virtual sound signals corresponding to different virtual three-dimensional space sound image positions of the dry sound signals in the virtual three-dimensional space can be obtained, then the chorus dry sounds are obtained after the combination processing of the virtual sound signals, sound effect synthesis processing is carried out on the chorus dry sounds and background music of the target song according to the sound effect optimization rules, the accompaniment of the target song is obtained, therefore, the stereo surround effect on the audio listening sense is achieved, the audio effect immersion sense is enhanced, and the user experience is good.

Further, please refer to fig. 4, where fig. 4 is a schematic flowchart of another accompaniment generation method provided in the embodiment of the present application, and the method of the embodiment of the present application may be applied to an electronic device, such as a smart phone, a tablet computer, a smart wearable device, a personal computer, a server, and the like. The method may include, but is not limited to, the steps of:

s401: an initial set of dry sound signals is obtained from an audio database.

In this embodiment of the present application, the electronic device may obtain an initial set of dry sound signals from an audio database, where it should be noted that the initial set of dry sound signals in the audio database is entered after an authorized agreement of a user.

In one embodiment, the audio database may be a stand-alone database or may be integrated with the electronic device, i.e. the audio database may be considered to be stored inside the electronic device. Here, the initial dry sound signal set refers to a set of original dry sound signals recorded after authorization approval when a user sings the same song in the audio database.

S402: and screening out dry sound signals from the initial dry sound signal set according to the sound parameters of the initial dry sound signals, wherein the screened dry sound signals form a dry sound signal set.

In this embodiment, the electronic device may screen out, from the initial dry sound signal set, dry sound signals that satisfy a condition according to the sound parameter of each initial dry sound signal, so as to narrow down the initial dry sound signal set to form a dry sound signal set.

In one embodiment, the sound parameters of the initial dry sound signal may include a intonation characteristic parameter and a timbre characteristic parameter of the initial dry sound signal, the intonation characteristic parameter may include any one or more of a pitch parameter, a tempo parameter and a prosody parameter, and the timbre characteristic parameter may include any one or more of a noise parameter, an energy parameter and a speed parameter. By the method, the initial dry sound signals with poor audio effects, such as noisy hearing, off-tune occurrence, too short audio time, low audio energy, crackle and the like in the initial dry sound signal set can be removed, so that the dry sound signal set with good intonation and sound quality is obtained.

S403: and acquiring a head related transfer function corresponding to each virtual three-dimensional space acoustic image position of the N virtual three-dimensional space acoustic image positions.

In this embodiment of the application, the electronic device may acquire N virtual three-dimensional space acoustic image positions in a virtual three-dimensional space, and then acquire a head-related transfer function corresponding to each virtual three-dimensional space acoustic image position according to the N virtual three-dimensional space acoustic image positions.

In one embodiment, the head-related transfer functions corresponding to the sound image positions in the virtual three-dimensional space may be pre-saved in the head-related transfer function database, so that the electronic device may call the corresponding head-related transfer functions from the head-related transfer function database according to the sound image positions in the virtual three-dimensional space.

S404: and processing the dry sound signal corresponding to the target virtual three-dimensional space sound image position through the head related transfer function corresponding to the target virtual three-dimensional space sound image position to obtain the virtual sound signal at the target virtual three-dimensional space sound image position.

In this embodiment, the electronic device may process the target dry sound signal according to a head-related transfer function corresponding to the target virtual three-dimensional space sound image position to obtain a virtual sound signal corresponding to the target dry sound signal at the target virtual three-dimensional space sound image position, where the target virtual three-dimensional space sound image position may be any one of N virtual three-dimensional space sound image positions, and the target dry sound signal may be any one of a set of dry sound signals.

In one embodiment, the head related transfer function corresponding to the target virtual three-dimensional spatial image position is HRTF data corresponding to the virtual three-dimensional spatial image position. HRTF data corresponding to the target virtual three-dimensional space sound image position can be determined from known HRTF data according to the azimuth angle and the elevation angle of the target virtual three-dimensional space sound image position, and then the electronic equipment can convolute the target dry sound signal with the HRTF data corresponding to the target virtual three-dimensional space position to obtain a virtual sound signal corresponding to the target dry sound signal at the target virtual three-dimensional space sound image position.

S405: p dry sound signals are obtained from the x dry sound signals included in the dry sound signal set.

In the embodiment of the present application, the electronic device may randomly acquire p dry sound signals from the x dry sound signals included in the dry sound signal set. It should be noted that S404 and S405 may be executed simultaneously or sequentially, and the present application is not limited thereto.

S406: and carrying out time delay processing operation on each dry sound signal in the p dry sound signals to obtain a time delay left channel signal and a time delay right channel signal corresponding to each dry sound signal in the p dry sound signals.

In this embodiment of the application, the electronic device may perform a delay processing operation on m1 time parameters on each of the p dry sound signals to obtain m1 delayed dry sound signals corresponding to each of the p dry sound signals, obtain a delayed left channel signal corresponding to each of the p dry sound signals by superimposing m1 delayed dry sound signals corresponding to each of the dry sound signals, where m1 is a positive integer; and performing delay processing operation on m2 time parameters on each dry sound signal in the p dry sound signals to obtain m2 delay dry sound signals corresponding to each dry sound signal in the p dry sound signals, and obtaining a delay right channel signal corresponding to each dry sound signal in the p dry sound signals by superposing the m2 delay dry sound signals corresponding to each dry sound signal, wherein m2 is a positive integer.

In an embodiment, the electronic device may process one dry sound signal by 16 time delays with different time parameters to obtain 16 dry sound signals with different time delays and attenuation degrees, then averagely divide the 16 dry sound signals with different time delays and attenuation degrees into two groups, and superimpose the dry sound signals with different time delays and attenuation degrees in each group respectively to obtain a time-delayed left channel signal and a time-delayed right channel signal corresponding to the dry sound signal.

In one embodiment, before obtaining the delayed binaural signal corresponding to each of the p dry sound signals, the sound field of the dry sound signal may be widened by adding a bass enhancement and reverberation simulation module to reduce the correlation between the delayed left channel signal and the delayed right channel signal in the binaural signal obtained by the delay processing. It should be noted that the steps of acquiring the virtual sound signal in S403 and S404 and the steps of acquiring the delayed left channel signal and the delayed right channel signal in S405 and S406 may be executed simultaneously or sequentially, which is not limited in this application, where S405 and S406 are two optional steps.

S407: and merging each virtual sound signal in the virtual sound signal set to obtain chorus dry sound.

In this embodiment, the electronic device may perform merging processing on each virtual sound signal in the set of virtual sound signals to obtain a chorus dry sound. Here, the virtual sound signal set includes a virtual sound signal corresponding to each dry sound signal acquired by the electronic device by simulating N virtual three-dimensional spatial positions, and a delayed binaural signal acquired by the electronic device by performing a delay processing operation on p dry sound signals in the dry sound signal set.

In one embodiment, each virtual sound signal in the set of virtual sound signals is a two-channel signal, the two-channel signal includes a left channel signal and a right channel signal, the combining process on each virtual sound signal can process the left channel signal and the right channel signal separately, and the left channel signal and the right channel signal are applicable to the same processing rule. Here, the combining process may be implemented by a normalization process such that the loudness of the binaural signal after the combining process is [ -1dB, 1dB ]. For example, assuming that there are 1000 binaural signals including 1000 left channel signals and 1000 right channel signals, each left channel signal is normalized, and the sum of the addition of the 1000 normalized left channel signals is divided by 1000, so as to obtain a left channel signal after combination processing. Thus, chorus dry voices can be obtained.

In one embodiment, the energy relationship between the obtained chorus dry sound and the background music of the target song may or may not satisfy the energy ratio condition. If the obtained energy relationship between the chorus dry sound and the background music meets the energy ratio condition, step S408 can be omitted; accordingly, if the energy relationship between the obtained chorus dry sound and the background music does not satisfy the energy ratio condition, step S408 is executed.

S408: and acquiring background music of the target song, and adjusting the energy relation between the chorus dry sound and the background music.

In the embodiment of the application, the electronic device may adjust the energy relationship between the chorus dry sound and the corresponding background music according to the background music of the target song, and the energy relationship between the adjusted chorus dry sound and the adjusted background music satisfies the energy ratio condition.

In one embodiment, the chorus dry sound may be too energetic, resulting in over-energizing background music. By adjusting the chorus dry sound and the background music, the energy relation between the adjusted chorus dry sound and the adjusted background music can meet the energy ratio condition, and the situation that the energy of the chorus dry sound is too large can be dealt with in the way. Wherein, the energy ratio condition can be set that the ratio between the energy value of the chorus dry sound and the energy value of the background music is less than a ratio threshold value, and can also be set that the loudness of the chorus dry sound is lower than the loudness of the background music by 3 dB.

In an embodiment, after obtaining the background music of the target song, the detailed description of the virtual sound signal may be generated based on the corresponding dry sound signal at each virtual three-dimensional space sound image position in the N virtual three-dimensional space sound image positions according to the above S202, and the background music is also processed in the same way to obtain the chorus dry sound and the background music with similar effects, so as to achieve a more harmonious and unified auditory experience.

S409: and carrying out frequency spectrum equalization processing on the chorus dry sound at a preset frequency section.

In the embodiment of the application, the electronic device may perform spectrum equalization processing on the chorus dry sound in a preset frequency band.

In one embodiment, the electronic device may achieve spectral equalization by adding a spectral notching process at a preset frequency band, for example, the electronic device may add a spectral notching process of around 6dB around 4 kHz. By the method, the listening feeling of the chorus dry sound can be more natural, and high-frequency current sound generated due to frequency spectrum incompatibility is prevented.

S410: and acquiring the loudness of the background music.

In the embodiment of the application, the electronic device may obtain the loudness of the background music.

S411: and if the loudness is smaller than the loudness threshold value, the loudness of the background music after adjustment is increased to the loudness threshold value.

In this embodiment of the application, if the loudness of the background music is smaller than the loudness threshold, the electronic device may raise the loudness of the background music to the loudness threshold. For example, the loudness threshold may be set to-14 dB, and if the loudness of background music is less than-14 dB, the electronics may up-tune to-14 dB.

S412: the accompaniment is obtained.

In the embodiment of the application, the electronic equipment superposes the chorus dry sound and the background music, so that the final accompaniment can be obtained. It should be noted that the accompaniment may be obtained by any one or a combination of steps from S408 to S411, and in one embodiment, S408 to S411 may be selectively executed according to actual needs, for example, there may be a case where the energy relationship between the chorus stem and the background music does not need to be adjusted, so S408 is not executed. Similarly, the vocal chord is optionally subjected to spectrum equalization at a preset frequency range. For another example, the steps S410 and S411 may not be executed. Fig. 4 only indicates the technical solutions adopted to make the accompaniment more harmonious and natural and better sound quality, and the energy relationship adjustment, the spectrum equalization adjustment of S409, and the loudness adjustment embodied by S410 and S411 are performed in S408, and the precedence order of these three aspects is not limited in this application.

In one implementation, the final accompaniment may be stored in the database after being obtained, so that the electronic device may directly retrieve the corresponding accompaniment from the database when receiving the chorus request of the same song.

By implementing the embodiment of the application, the virtual sound signals corresponding to the dry sound signals obtained by simulating the N virtual three-dimensional space positions can be obtained, and the delayed binaural signals corresponding to the dry sound signals obtained by performing the delayed processing operation on the dry sound signals can also be obtained, so that the chorus dry sound is enriched. In addition, the energy relation between the chorus dry sound and the background music is adjusted, so that the finally obtained accompanying listening feeling is more harmonious and natural, and the user can obviously feel the space feeling and the immersion feeling when the chorus is played.

Further, please refer to fig. 5, where fig. 5 is a schematic flow chart illustrating a process of acquiring a virtual sound signal in an accompaniment generation method according to an embodiment of the present application, where acquiring a virtual sound signal includes: the method comprises the steps of obtaining a virtual sound signal corresponding to a dry sound signal at each virtual three-dimensional space sound image position in N virtual three-dimensional space sound image positions, and carrying out time delay processing operation on each dry sound signal in p dry sound signals to obtain a time delay double-channel signal corresponding to each dry sound signal in the p dry sound signals.

In this embodiment of the application, after the dry sound signal set is obtained, a virtual sound signal corresponding to the dry sound signal at each virtual three-dimensional space sound image position in the N virtual three-dimensional space sound image positions may be obtained, or a delay processing operation may be performed on each dry sound signal in p dry sound signals in the dry sound signal set, so as to obtain a delay binaural signal corresponding to each dry sound signal in the p dry sound signals.

For example, as shown in fig. 5, the corresponding virtual sound signals are obtained by the two manners for the dry sound signal X and the dry sound signal W included in the dry sound signal set, where the dry sound signal X and the dry sound signal W may be any one in the dry sound signal setDry sound signals. Specifically, after acquiring the dry sound signal X in the dry sound signal set, the electronic device may describe the position information of the virtual three-dimensional space sound image position according to the azimuth angle and the elevation angle of the virtual three-dimensional space sound image position, that is, the position information is

Then according to the position information of the sound image position in the virtual three-dimensional space

Head related transfer function HRTF corresponding to the virtual three-dimensional space sound image position can be found

Head related transfer function HRTF for corresponding dry sound signal X to virtual three-dimensional space acoustic image position

After convolution operation is carried out, a virtual sound signal corresponding to the dry sound signal at the virtual three-dimensional space sound image position can be obtained, the virtual sound signal is a two-channel signal, and the virtual sound signal comprises a left channel signal

Y_LAnd a right channel signal Y_R. The virtual sound signal corresponding to the dry sound signal obtained in the mode can enhance the stereoscopic immersion of the user.

Besides, after acquiring the dry sound signal W in the dry sound signal set, the electronic device may perform a delay processing operation on the dry sound signal W. By way of example, the electronic device may pass the dry-acoustic signal W through d_L(1)、d_L(2)，...，d_L(8) And d_R(1)、d_R(2)，...，d_R(8) The delay processors with 16 different time parameters are delayed and then pass through d_L(1)、d_L(2)，...，d_L(8) The 8 acoustic signals obtained by the delay processing of the 8 time delayers are superposed to obtain a delayed left channel signal W corresponding to the acoustic signal W_LWill pass through d_R(1)、d_R(2)，...，d_R(8) The 8 dry sound signals obtained by the delay processing of the 8 time delayers are superposed to obtain a delayed right channel signal W corresponding to the dry sound signal W_R. The time-delay binaural signal corresponding to the dry sound signal obtained by the method can simulate the binaural signal at the left ear or the right ear of the human head, and enriches the listening effect of the user.

In one embodiment, the resulting set of virtual sound signals includes the above two cases, that is, the last set of virtual sound signals is: z ═ Z_L，Z_R}，Z_L＝Y_L+W_L；Z_R＝Y_R+W_L. It should be noted that, the step of acquiring the virtual sound signal, the step of acquiring the delayed left channel signal and the delayed right channel signal may be executed simultaneously or sequentially, and the present application does not limit this. The corresponding virtual sound signals are obtained by adopting the two different modes for the dry sound signals in the dry sound signal set, the scene experience in chorus can be displayed in an all-round mode, and the audio effect is richer.

Further, please refer to fig. 6, where fig. 6 is a method for playing an accompaniment according to an embodiment of the present application. The method of the embodiment of the application can be applied to electronic equipment, and the electronic equipment can be smart equipment such as a smart phone, a tablet computer, a smart wearable device and a personal computer, and can also be a server and the like. The method may include, but is not limited to, the steps of:

s601: a user interface is displayed.

In an embodiment of the application, the electronic device may display a user interface for receiving a selection instruction of a target song by a user.

In one embodiment, the selection instruction may include a selection instruction of an accompaniment pattern of the target song, which may be, but is not limited to, a chorus accompaniment pattern, an acoustic accompaniment pattern, and an Artificial Intelligence (AI) accompaniment pattern.

In an embodiment, the selection instruction may be an instruction generated by a user by triggering a selection control displayed on the user interface, or may be a selection instruction generated by the user by controlling the electronic device through voice, for example, the voice control of the electronic device by the user may be "please play with the chorus accompaniment pattern", so that the electronic device may generate the selection instruction indicating that the accompaniment pattern for the target song is the chorus accompaniment pattern.

S602: and if the selection instruction received by the user interface indicates that the accompaniment pattern of the target song is the chorus accompaniment pattern, acquiring the accompaniment corresponding to the target song.

In the embodiment of the application, if the selection instruction received by the electronic device at the user interface indicates that the accompaniment pattern of the target song is the chorus accompaniment pattern, the electronic device may acquire the accompaniment corresponding to the chorus accompaniment pattern of the target song.

In one embodiment, a selection control for an accompaniment pattern of a target song may be displayed on the user interface, and the selection control may include: a chorus accompaniment mode selection control and an acoustic accompaniment mode selection control. Before acquiring the accompaniment corresponding to the target song, the electronic device may detect whether a selection operation for the chorus accompaniment pattern selection control is acquired, and if so, confirm that the selection instruction received at the user interface indicates that the accompaniment pattern for the target song is the chorus accompaniment pattern.

In one embodiment, the corresponding accompaniment in the chorus accompaniment pattern is generated from chorus dry voices and background music. Wherein the chorus dry sound may be generated from a set of virtual sound signals comprising: the method includes the steps that a virtual sound signal at each virtual three-dimensional space sound image position in N virtual three-dimensional space sound image positions is generated according to an acquired dry sound signal set, multiple dry sound signals in the dry sound signal set can correspond to multiple different virtual three-dimensional space sound image positions, each virtual three-dimensional space sound image position can correspond to one or multiple dry sound signals, the dry sound signal set is obtained according to dry sound signals recorded by multiple users for a target song, and it needs to be noted that the dry sound signals of the users for the target song are recorded after authorization of the users. Specifically, the method for generating the accompaniment in the chorus accompaniment pattern can be seen from the embodiments shown in fig. 2-5, and will not be described herein again.

S603: and playing the accompaniment corresponding to the target song.

In the embodiment of the application, after the electronic device acquires the corresponding accompaniment in the chorus accompaniment pattern of the target song, the accompaniment can be played to the user.

In an implementation, the accompaniment that the target song corresponds can be applied to the scene of K song, and the user can sing when playing this accompaniment, and under the condition of user's authorization consent, electronic equipment can broadcast after gathering user singing voice and fusing with the accompaniment that the target song corresponds again, lets the user have the unique experience like the scene of singing.

In one embodiment, as shown in fig. 7a, the electronic device acquiring the accompaniment corresponding to the target song may include, but is not limited to, the following steps:

s701: an accompaniment request is sent to the server.

In an embodiment of the application, the electronic device may send an accompaniment request to the server, where the accompaniment request may include identification information of the target song.

In one embodiment, the identification information of the target song is unique information for identifying the target song, for example, the identification information may be a song name of the target song.

S702: and receiving the chorus dry sound and the background music returned by the server in response to the accompaniment request.

In an embodiment of the application, the electronic device may receive chorus dry sounds and background music returned by the server in response to an accompaniment request for a target song.

In one embodiment, the server may return the chorus dry sound and the background music respectively, or return the chorus dry sound and the background music after being combined, and the specific return mode may be selected according to the setting of the user.

S703: and determining a target chorus dry sound segment from the chorus dry sound.

In an embodiment of the present application, the electronic device may determine the target chorus stem segment according to the returned chorus stem.

In one embodiment, the electronic device may display a first sentence interface, as shown in fig. 7b, which displays the respective sentences in the text data corresponding to the chorus in the order of the time playback nodes of the chorus. The user may select a target chorus dry sound segment based on each single sentence displayed in the first single sentence interface.

In one embodiment, the target chorus dry sound segment may be composed of a part of the single sentences in the chorus dry sound, or may be composed of all the single sentences in the chorus dry sound, and may be specifically determined by the selection operation of the user.

S704: and obtaining the accompaniment corresponding to the target song according to the chorus dry sound and the background music corresponding to the target chorus dry sound fragment.

In the embodiment of the application, the electronic device can obtain the accompaniment corresponding to the target song according to the chorus dry sound and the background music corresponding to the target chorus dry sound segment selected by the user.

In one embodiment, the electronic device may display a second sentence interface, as shown in fig. 7c, which may be displayed during the playing of the accompaniment corresponding to the target song, and display the sentences in the text data corresponding to the accompaniment in the order of the time playing node of the accompaniment.

In an embodiment, the electronic device may further detect whether a mute selection operation for the chorus stems in the accompaniment is acquired in the playing process, and if the mute selection operation for the chorus stems in the accompaniment by the user is received, the playing of the chorus stems in the accompaniment may be cancelled at the playing node at the current time, and only the playing of the background music in the accompaniment is reserved.

By implementing the embodiment of the application, on one hand, a selection instruction of a user for the target song can be received, and when the selection instruction of the user indicates that the accompaniment pattern of the target song is the chorus accompaniment pattern, the accompaniment corresponding to the target song can be acquired and played; on the other hand, the accompaniment of the target song in the chorus accompaniment pattern is generated according to the chorus dry sound and the background music, the target chorus segment can be determined from the chorus dry sound, and the accompaniment corresponding to the target song can be generated according to the chorus dry sound corresponding to the target chorus dry sound segment and the background music. Through this kind of mode, can be when the accompaniment that the broadcast target song corresponds, let the user have the experience that the department sings the meeting scene, and have the sensation of being personally on the scene in the sense of hearing, in addition, the user can also select the chorus stem sound in the accompaniment in a flexible way, has strengthened the interest of accompaniment, has improved user experience.

Further, referring to fig. 8a, fig. 8a is a schematic structural diagram of an apparatus for generating an accompaniment according to an embodiment of the present disclosure, where the apparatus according to the embodiment of the present disclosure may be applied to an electronic device, such as a smart phone, a tablet computer, a smart wearable device, a personal computer, a server, and so on, and in an embodiment, as shown in fig. 8a, the apparatus 80 for generating an accompaniment may include:

an obtaining unit 801, configured to obtain an acoustic stem signal set, where the acoustic stem signal set includes x acoustic stem signals corresponding to a target song, and x is an integer greater than 1; generating a virtual sound signal based on a corresponding dry sound signal at each of N virtual three-dimensional space sound image positions, wherein x dry sound signals correspond to the N virtual three-dimensional space sound image positions, N is an integer greater than 1, the N virtual three-dimensional space sound image positions are different, and each virtual three-dimensional space sound image position is allowed to correspond to one or more of the x dry sound signals.

A processing unit 802, configured to perform merging processing on each virtual sound signal in a virtual sound signal set to obtain chorus dry sound, where the virtual sound signal set includes: a virtual sound signal at each of the N virtual three-dimensional spatial sound image positions; and performing sound effect synthesis processing on the chorus dry sound and the background music of the target song according to a sound effect optimization rule to obtain the accompaniment of the target song.

In one embodiment, the obtaining unit 801 may be further configured to obtain an initial set of dry sound signals from an audio database, where the audio database includes initial dry sound signals recorded when a plurality of users sing the same song; the processing unit 802 may further be configured to screen out an acoustic interference signal from the initial acoustic interference signal set according to the sound parameter of each initial acoustic interference signal, where the screened acoustic interference signal constitutes an acoustic interference signal set.

In one embodiment, the set of dry sound signals includes: screening dry sound signals from the initial dry sound signal set according to the tone level characteristic parameters and the tone quality characteristic parameters; the intonation characteristic parameters comprise any one or more of tone parameters, rhythm parameters and rhythm parameters; the psychoacoustic characteristic parameters include any one or more of noise parameters, energy parameters and speed parameters.

In one embodiment, the N virtual three-dimensional spatial sound image locations comprise; dividing a horizontal plane at intervals of a first preset angle on the horizontal plane to obtain n1 virtual three-dimensional space sound image positions on the horizontal plane; dividing the upper plane at intervals of a second preset angle on the upper plane to obtain n2 virtual three-dimensional space sound image positions on the upper plane; an included angle between the upper plane and the horizontal plane is a first angle threshold value; dividing the lower plane at intervals of a third preset angle on the lower plane to obtain n3 virtual three-dimensional space sound image positions on the lower plane; an included angle between the lower plane and the horizontal plane is a second angle threshold value; wherein N1, N2, and N3 are positive integers and the sum is equal to N.

In one embodiment, the obtaining unit 801 may be further configured to obtain a head related transfer function corresponding to each of the N virtual three-dimensional space sound image positions; the processing unit 802 may further be configured to process, through a head-related transfer function corresponding to the target virtual three-dimensional space sound image position, a dry sound signal corresponding to the target virtual three-dimensional space sound image position, so as to obtain a virtual sound signal at the target virtual three-dimensional space sound image position; the virtual sound signal at the sound image position of the target virtual three-dimensional space is a two-channel signal; the target virtual three-dimensional space acoustic image position is any virtual three-dimensional space acoustic image position in the N virtual three-dimensional space acoustic image positions.

In one embodiment, the set of virtual sound signals further comprises: a delayed left channel signal and a delayed right channel signal corresponding to each of the p dry sound signals; the obtaining unit 801 may further be configured to obtain p dry sound signals from the x dry sound signals included in the dry sound signal set, where p is a positive integer and is less than or equal to x; the processing unit 802 may further be configured to perform delay processing operations on m1 time parameters on each of the p dry sound signals, to obtain m1 delayed dry sound signals corresponding to each of the p dry sound signals, and obtain a delayed left channel signal corresponding to each of the p dry sound signals by superimposing m1 delayed dry sound signals corresponding to each of the p dry sound signals, where m1 is a positive integer; and performing time delay processing operation on m2 time parameters on each dry sound signal in the p dry sound signals to obtain m2 time delay dry sound signals corresponding to each dry sound signal in the p dry sound signals, and obtaining a time delay right channel signal corresponding to each dry sound signal in the p dry sound signals by superposing the m2 time delay dry sound signals corresponding to each dry sound signal, wherein m2 is a positive integer.

In one embodiment, the obtaining unit 801 may further be configured to obtain background music of the target song, and the processing unit 802 may further be configured to adjust an energy relationship between the chorus dry sound and the background music, where the energy relationship between the adjusted chorus dry sound and the adjusted background music satisfies an energy ratio condition; the accompaniment is obtained according to the adjusted chorus dry sound and the background music.

In one embodiment, the processing unit 802 may be further configured to perform a spectrum equalization process on the chorus dry sound at a preset frequency band; the obtaining unit 801 may also be configured to obtain the loudness of background music; the processing unit 802 may be further configured to increase the loudness of the background music to a loudness threshold if the loudness of the background music is less than the loudness threshold; the accompaniment is obtained according to the chorus dry sound after the frequency spectrum equalization processing and the background music after the loudness processing.

It should be noted that, details which are not mentioned in the embodiment corresponding to fig. 8a and the specific implementation manner of each step may refer to the embodiments shown in fig. 2 to fig. 5 and the foregoing, and are not described again here.

Further, please refer to fig. 8b, fig. 8b is a schematic structural diagram of an apparatus for processing accompaniment playing provided in the embodiment of the present application, the apparatus of the embodiment of the present application may be applied to an electronic device, for example, a smart phone, a tablet computer, a smart wearable device, a personal computer, a server, and the like, and in an embodiment, as shown in fig. 8b, the apparatus 81 for processing accompaniment playing may include:

an obtaining unit 811 for displaying a user interface for receiving a selection instruction for a target song; and if the selection instruction received by the user interface indicates that the accompaniment pattern of the target song is the chorus accompaniment pattern, acquiring the accompaniment corresponding to the target song.

A processing unit 812 for playing an accompaniment corresponding to the target song; the accompaniment is generated according to chorus dry sound and background music, the chorus dry sound is generated according to a plurality of dry sound signals in a dry sound signal set, the dry sound signals in the dry sound signal set correspond to a plurality of different virtual three-dimensional space sound image positions, and the dry sound signal set is obtained according to the dry sound signals recorded by a plurality of users aiming at the target song.

In one embodiment, the chorus dry sound is generated from a set of virtual sound signals comprising: generating a virtual sound signal at each virtual three-dimensional space sound image position in the N virtual three-dimensional space sound image positions according to the acquired dry sound signal set; wherein the plurality of dry sound signals in the dry sound signal set correspond to N virtual three-dimensional space acoustic image locations, N being an integer greater than 1, the N virtual three-dimensional space acoustic image locations being different and each virtual three-dimensional space acoustic image location being allowed to correspond to one or more dry sound signals.

In one embodiment, a selection control of an accompaniment pattern for a target song is displayed on the user interface, and the selection control of the accompaniment pattern comprises: a chorus accompaniment mode selection control and an acoustic accompaniment mode selection control; before acquiring the accompaniment corresponding to the target song, the processing unit 812 may be further configured to detect whether a selection operation for the chorus accompaniment pattern selection control is acquired; if yes, confirming that the selection instruction received in the user interface indicates that the accompaniment pattern of the target song is the chorus accompaniment pattern.

In one embodiment, the processing unit 812 may be further configured to send an accompaniment request to the server, the accompaniment request including identification information of the target song; the acquisition unit 811 may also be configured to receive the chorus dry sound and the background music returned by the server in response to the accompaniment request; the processing unit 812 may also be configured to determine a target chorus vocal segment from the chorus; and obtaining the accompaniment corresponding to the target song according to the chorus dry sound and the background music corresponding to the target chorus dry sound fragment.

In an embodiment, the processing unit 812 may be further configured to display a first single sentence interface, and display each single sentence in the text data corresponding to the chorus dry sound according to a time playing node sequence of the chorus dry sound; the target chorus dry sound segment is determined based on a single sentence selection operation on the first single sentence interface.

In one embodiment, the processing unit 812 may be further configured to display a second single sentence interface, which displays each single sentence in the text data corresponding to the accompaniment in the order of the time playing node of the accompaniment; detecting whether a mute selection operation aiming at chorus dry sound in the accompaniment is acquired in the playing process; and if so, canceling the playing of the chorus dry sound at the playing node at the current time.

It should be noted that, details which are not mentioned in the embodiment corresponding to fig. 8b and the specific implementation manner of each step may refer to the embodiments shown in fig. 2 to fig. 7 and the foregoing, and are not described again here.

Further, please refer to fig. 9, where fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present application. The electronic device may include: a network interface 901, a memory 902 and a processor 903, the network interface 901, the memory 902 and the processor 903 being connected by one or more communication buses for enabling connection communication between these components. Network interface 901 may include a standard wired interface, a wireless interface (e.g., a WIFI interface). Memory 902 may include volatile memory (volatile memory), such as random-access memory (RAM); the memory 902 may also include a non-volatile memory (non-volatile memory), such as a flash memory (flash memory), a solid-state drive (SSD), etc.; the memory 902 may also comprise a combination of the above-described types of memory. The processor 903 may be a Central Processing Unit (CPU). The processor 903 may further comprise a hardware chip. The hardware chip may be an application-specific integrated circuit (ASIC), a Programmable Logic Device (PLD), or the like. The PLD may be a field-programmable gate array (FPGA), a General Array Logic (GAL), or the like.

Optionally, the memory 902 is further configured to store program instructions, which the processor 903 may also call to implement:

In one embodiment, the processor 903 may also call the program instructions to implement: acquiring an initial dry sound signal set from an audio database, wherein the audio database comprises initial dry sound signals recorded when a plurality of users sing the same song; and screening out dry sound signals from the initial dry sound signal set according to the sound parameters of the initial dry sound signals, wherein the screened dry sound signals form a dry sound signal set.

In one embodiment, the processor 903 may also call the program instructions to implement: acquiring a head related transfer function corresponding to each virtual three-dimensional space sound image position in the N virtual three-dimensional space sound image positions; processing a dry sound signal corresponding to a target virtual three-dimensional space sound image position through a head related transfer function corresponding to the target virtual three-dimensional space sound image position to obtain a virtual sound signal at the target virtual three-dimensional space sound image position; the virtual sound signal at the sound image position of the target virtual three-dimensional space is a two-channel signal; the target virtual three-dimensional space acoustic image position is any virtual three-dimensional space acoustic image position in the N virtual three-dimensional space acoustic image positions.

In one embodiment, the set of virtual sound signals further comprises: a delayed left channel signal and a delayed right channel signal corresponding to each of the p dry sound signals; the processor 903 may also call the program instructions to implement: acquiring p dry sound signals from x dry sound signals included in a dry sound signal set, wherein p is a positive integer and is less than or equal to x; performing time delay processing operation on m1 time parameters on each dry sound signal in the p dry sound signals to obtain m1 time delay dry sound signals corresponding to each dry sound signal in the p dry sound signals, and obtaining a time delay left channel signal corresponding to each dry sound signal in the p dry sound signals by superposing the m1 time delay dry sound signals corresponding to each dry sound signal, wherein m1 is a positive integer; and performing time delay processing operation on m2 time parameters on each dry sound signal in the p dry sound signals to obtain m2 time delay dry sound signals corresponding to each dry sound signal in the p dry sound signals, and obtaining a time delay right channel signal corresponding to each dry sound signal in the p dry sound signals by superposing the m2 time delay dry sound signals corresponding to each dry sound signal, wherein m2 is a positive integer.

In one embodiment, the processor 903 may also call the program instructions to implement: acquiring background music of a target song, and adjusting the energy relationship between chorus dry sound and the background music, wherein the energy relationship between the adjusted chorus dry sound and the adjusted background music meets an energy ratio condition; the accompaniment is obtained according to the adjusted chorus dry sound and the background music.

In one embodiment, the processor 903 may also call the program instructions to implement: carrying out frequency spectrum equalization processing on chorus dry sound at a preset frequency section; acquiring the loudness of background music; if the loudness of the background music is smaller than the loudness threshold value, the loudness of the background music is increased to the loudness threshold value; the accompaniment is obtained according to the chorus dry sound after the frequency spectrum equalization processing and the background music after the loudness processing.

playing the accompaniment corresponding to the target song; the accompaniment is generated according to chorus dry sound and background music, the chorus dry sound is generated according to a plurality of dry sound signals in a dry sound signal set, the dry sound signals in the dry sound signal set correspond to a plurality of different virtual three-dimensional space sound image positions, and the dry sound signal set is obtained according to the dry sound signals recorded by a plurality of users aiming at the target song.

In one embodiment, a selection control of an accompaniment pattern for a target song is displayed on the user interface, and the selection control of the accompaniment pattern comprises: a chorus accompaniment mode selection control and an acoustic accompaniment mode selection control; before obtaining the accompaniment corresponding to the target song, the processor 903 may further call the program instructions to implement: detecting whether a selection operation for the chorus accompaniment mode selection control is acquired; if yes, confirming that the selection instruction received in the user interface indicates that the accompaniment pattern of the target song is the chorus accompaniment pattern.

In one embodiment, the processor 903 may also call the program instructions to implement: sending an accompaniment request to a server, wherein the accompaniment request comprises identification information of a target song; receiving chorus dry sound and background music returned by the server in response to the accompaniment request; determining a target chorus dry sound segment from the chorus dry sound; and obtaining the accompaniment corresponding to the target song according to the chorus dry sound and the background music corresponding to the target chorus dry sound fragment.

In one embodiment, the processor 903 may also call the program instructions to implement: displaying a first single sentence interface, and displaying each single sentence in the text data corresponding to the chorus dry sound according to the time playing node sequence of the chorus dry sound; the target chorus dry sound segment is determined based on a single sentence selection operation on the first single sentence interface.

In one embodiment, the processor 903 may also call the program instructions to implement: displaying a second single sentence interface, and displaying each single sentence in the text data corresponding to the accompaniment according to the time playing node sequence of the accompaniment; detecting whether a mute selection operation aiming at chorus dry sound in the accompaniment is acquired in the playing process; and if so, canceling the playing of the chorus dry sound at the playing node at the current time.

It should be understood that the principle and the advantageous effects of the electronic device 90 described in the embodiment of the present application for solving the problems are similar to the embodiment shown in fig. 2 to fig. 7 of the present application and the principle and the advantageous effects of the foregoing for solving the problems, and therefore, for brevity, detailed descriptions thereof are omitted here.

Furthermore, the present application also provides a computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, implements the method provided by the foregoing embodiment.

Embodiments of the present application also provide a computer program product or computer program comprising computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to execute the method provided by the foregoing embodiment.

The steps in the method of the embodiment of the application can be sequentially adjusted, combined and deleted according to actual needs.

The units in the device of the embodiment of the application can be combined, divided and deleted according to actual needs.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the program is executed. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

While the present disclosure has been described with reference to particular embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the present disclosure.

Claims

1. A method of generating an accompaniment, the method comprising:

generating a virtual sound signal based on a corresponding dry sound signal at each of N virtual three-dimensional space sound image positions, wherein the x dry sound signals correspond to the N virtual three-dimensional space sound image positions, N is an integer greater than 1, the N virtual three-dimensional space sound image positions are not the same, and each virtual three-dimensional space sound image position is allowed to correspond to one or more of the x dry sound signals;

2. The method of claim 1, wherein the acquiring a set of dry sound signals comprises:

acquiring an initial dry sound signal set from an audio database, wherein the audio database comprises initial dry sound signals recorded when a plurality of users sing a target song;

and screening x dry sound signals from the initial dry sound signal set according to the sound parameters of the initial dry sound signals to form the dry sound signal set.

3. The method of claim 2, wherein the sound parameters comprise: tone level characteristic parameters and tone quality characteristic parameters;

wherein, the intonation characteristic parameters comprise any one or more of tone parameters, rhythm parameters and rhythm parameters; the psychoacoustic characteristic parameters comprise any one or more of noise parameters, energy parameters and speed parameters.

4. The method of claim 1, wherein the N virtual three-dimensional spatial sound image locations comprise;

dividing a horizontal plane at intervals of a first preset angle on the horizontal plane to obtain n1 virtual three-dimensional space sound image positions on the horizontal plane;

dividing the upper plane at intervals of a second preset angle on the upper plane to obtain n2 virtual three-dimensional space sound image positions on the upper plane; an included angle between the upper plane and the horizontal plane is a first angle threshold value;

dividing a lower plane at intervals of a third preset angle on the lower plane to obtain n3 virtual three-dimensional space sound image positions on the lower plane; an included angle between the lower plane and the horizontal plane is a second angle threshold value;

wherein the N1, the N2, and the N3 are positive integers and a sum equal to the N.

5. The method of any one of claims 1-4, wherein generating the virtual sound signal based on the corresponding dry sound signal at each of the N virtual three-dimensional spatial sound image locations comprises:

acquiring a head related transfer function corresponding to each virtual three-dimensional space sound image position in the N virtual three-dimensional space sound image positions;

processing a dry sound signal corresponding to a target virtual three-dimensional space sound image position through a head related transfer function corresponding to the target virtual three-dimensional space sound image position to obtain a virtual sound signal at the target virtual three-dimensional space sound image position;

the virtual sound signal at the sound image position of the target virtual three-dimensional space is a two-channel signal;

the target virtual three-dimensional space acoustic image position is any one of the N virtual three-dimensional space acoustic image positions.

6. The method of any one of claims 1-4, wherein the aggregating of the virtual sound signal further comprises: a delayed left channel signal and a delayed right channel signal corresponding to each of the p dry sound signals;

before the combining processing is performed on each virtual sound signal in the virtual sound signal set to obtain the chorus dry sound, the method further includes:

acquiring p dry sound signals from x dry sound signals included in the dry sound signal set, wherein p is a positive integer and is less than or equal to x;

performing m1 time-delay processing operations on each dry sound signal in the p dry sound signals to obtain m1 time-delay dry sound signals corresponding to each dry sound signal in the p dry sound signals, and obtaining a time-delay left channel signal corresponding to each dry sound signal in the p dry sound signals by superposing the m1 time-delay dry sound signals corresponding to each dry sound signal, wherein m1 is a positive integer;

and performing m2 time-delay processing operation on each dry sound signal in the p dry sound signals to obtain m2 time-delay dry sound signals corresponding to each dry sound signal in the p dry sound signals, and obtaining a time-delay right channel signal corresponding to each dry sound signal in the p dry sound signals by superposing the m2 time-delay dry sound signals corresponding to each dry sound signal, wherein m2 is a positive integer.

7. The method according to any one of claims 1-4, wherein the sound effect synthesizing processing of the chorus dry sound and the background music of the target song according to the sound effect optimization rule to obtain the accompaniment of the target song comprises:

acquiring background music of the target song, and adjusting the energy relationship between the chorus dry sound and the background music, wherein the energy relationship between the adjusted chorus dry sound and the adjusted background music meets an energy ratio condition;

the accompaniment is obtained according to the adjusted chorus dry sound and the background music.

8. The method according to any one of claims 1-4, wherein the sound effect synthesizing processing of the chorus dry sound and the background music of the target song according to the sound effect optimization rule to obtain the accompaniment of the target song comprises:

carrying out frequency spectrum equalization processing on the chorus dry sound at a preset frequency section;

obtaining the loudness of the background music;

if the loudness of the background music is smaller than a loudness threshold value, the loudness of the background music is increased to the loudness threshold value;

the accompaniment is obtained according to the chorus dry sound after the frequency spectrum equalization processing and the background music after the loudness processing.

9. A method for processing playback of an accompaniment, comprising:

if the selection instruction received by the user interface indicates that the accompaniment pattern of the target song is the chorus accompaniment pattern, acquiring the accompaniment corresponding to the target song;

playing the accompaniment corresponding to the target song;

10. The method of claim 9, wherein the chorus dry sound is generated from a set of virtual sound signals, the set of virtual sound signals comprising: generating a virtual sound signal at each virtual three-dimensional space sound image position in the N virtual three-dimensional space sound image positions according to the acquired dry sound signal set;

wherein the plurality of dry sound signals in the dry sound signal set correspond to N virtual three-dimensional spatial sound image locations, N being an integer greater than 1, the N virtual three-dimensional spatial sound image locations being different and each virtual three-dimensional spatial sound image location being allowed to correspond to one or more dry sound signals.

11. The method of claim 9 or 10, wherein a selection control of an accompaniment pattern for a target song is displayed on the user interface, the selection control of the accompaniment pattern comprising: a chorus accompaniment mode selection control and an acoustic accompaniment mode selection control; before the obtaining of the accompaniment corresponding to the target song, the method further comprises:

detecting whether a selection operation aiming at the chorus accompaniment mode selection control is acquired;

if yes, confirming that the accompaniment pattern of the target song is the chorus accompaniment pattern indicated by the selection instruction received by the user interface.

12. The method of claim 9, wherein the obtaining the accompaniment corresponding to the target song comprises:

sending an accompaniment request to a server, wherein the accompaniment request comprises identification information of the target song;

receiving the chorus dry sound and the background music returned by the server in response to the accompaniment request;

determining a target chorus dry sound segment from the chorus dry sound;

and obtaining the accompaniment corresponding to the target song according to the chorus dry sound corresponding to the target chorus dry sound segment and the background music.

13. The method of claim 12, wherein prior to determining the target chorus vocal segment from the chorus, the method further comprises:

displaying a first single sentence interface, and displaying each single sentence in the text data corresponding to the chorus dry sound according to the time playing node sequence of the chorus dry sound;

the target chorus dry sound segment is determined based on a single sentence selection operation on the first single sentence interface.

14. The method according to claim 9 or 12, wherein after playing the accompaniment corresponding to the target song, the method comprises:

displaying a second single sentence interface, and displaying each single sentence in the text data corresponding to the accompaniment according to the time playing node sequence of the accompaniment;

detecting whether a mute selection operation aiming at the chorus dry sound in the accompaniment is acquired in the playing process;

and if so, canceling the playing of the chorus dry sound at the playing node at the current time.

15. An electronic device, comprising: the device comprises a memory, a processor and a network interface, wherein the processor is connected with the memory and the network interface, the network interface is used for providing a network communication function, the memory is used for storing program codes, and the processor is used for calling the program codes and executing the method of any one of claims 1 to 14.

16. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method of any one of claims 1 to 14.