WO2022228220A1

WO2022228220A1 - Method and device for processing chorus audio, and storage medium

Info

Publication number: WO2022228220A1
Application number: PCT/CN2022/087784
Authority: WO
Inventors: 张超鹏; 陈灏; 武文昊; 罗辉; 李革委; 姜涛; 胡鹏
Original assignee: 腾讯音乐娱乐科技（深圳）有限公司
Priority date: 2021-04-27
Filing date: 2022-04-20
Publication date: 2022-11-03
Also published as: CN113192486B; CN113192486A

Abstract

A method and device for processing a chorus audio, and a storage medium. The method comprises the following steps: obtaining acapella audios of a plurality of singers singing the same target song (S110); performing time alignment on the plurality of obtained acapella audios (S120), and performing virtual sound image positioning, so as to position the plurality of acapella audios onto the plurality of virtual sound images (S130); generating a chorus audio on the basis of the plurality of acapella audios after having undergone the virtual sound image positioning (S140); and when a lead singer audio based on the singing of the target song is obtained, synthesizing the lead singer audio, the chorus audio, and a corresponding accompaniment, and then outputting a chorus effect audio (S150). The plurality of virtual sound images surround human ears and the plurality of acapella audios are positioned onto the plurality of virtual sound images, so that the outputted chorus effect audio can have a sound field surround sound effect, effectively preventing an in-head effect caused by sound field gathering in the center of the head, and enabling the sound field to be wider.

Description

Method, device and storage medium for processing chorus audio

This application claims the priority of the Chinese patent application with the application number 202110460280.4 and the invention titled "A chorus audio processing method, device and storage medium", which was filed with the China Patent Office on April 27, 2021, the entire contents of which are by reference Incorporated in this application.

technical field

The present application relates to the technical field of computer applications, and in particular, to a method, device and storage medium for processing chorus audio.

Background technique

With the rapid development of computer technology, various types of software such as audio, video, and office have gradually increased, bringing a lot of convenience to people's lives. Using audio software, users can listen to songs, sing and other experiences.

At present, in order to provide users with an auditory experience of a concert chorus, most of the multi-person singing data is directly superimposed. However, the audio obtained through simple superposition processing has a sound field concentrated in the center of the human head, with a head effect, the sound field is not wide enough, and the listening experience is poor.

SUMMARY OF THE INVENTION

The purpose of the present application is to provide a chorus audio processing method, device and storage medium, so as to avoid the head effect caused by the sound field gathering in the center of the human head, so that the sound field is wider and the listening experience is improved.

In order to solve the above-mentioned technical problems, the application provides the following technical solutions:

A method for processing chorus audio, comprising:

Obtain the dry audio of the same target song performed by multiple singers respectively;

performing time alignment processing on the obtained plurality of the dry audio frequencies;

Perform virtual sound image localization on a plurality of the dry sound audios after time alignment processing, so as to locate the plurality of dry sound audios on a plurality of virtual sound images; In the virtual audio-visual coordinate system, the virtual audio-visual coordinate system is centered on the human head, with the midpoint of the straight line where the left and right ears are located as the coordinate origin, the positive direction of the first coordinate axis represents the front of the human head, and the positive direction of the second coordinate axis Represents the side of the human head from the left ear to the right ear, the positive direction of the third coordinate axis represents directly above the human head, the distance between each virtual sound image and the coordinate origin is within the set distance range, and each virtual sound image is within the set distance range. The pitch angle of the sound image relative to the plane formed by the first coordinate axis and the second coordinate axis is within a set angle range;

generating chorus audio based on a plurality of the dry audio audio after virtual sound image localization;

In the case of acquiring the lead vocal audio sung based on the target song, after synthesizing the lead vocal audio, the chorus audio and the corresponding accompaniment, a large chorus effect audio is output.

In a specific implementation manner of the present application, performing time alignment processing on the plurality of obtained dry audio frequencies includes:

Determine the reference audio corresponding to the target song;

For each of the obtained dry audio frequencies, extract the audio features of the current dry audio audio and the reference audio respectively, and the audio features are fingerprint features or fundamental frequency features;

Determine the time corresponding to the audio feature similarity maximum value of the current dry sound audio and the reference audio as the audio alignment time;

Based on the audio alignment time, time alignment processing is performed on the current dry audio audio.

In a specific embodiment of the present application, it also includes:

Band-pass filtering is performed on the obtained multiple dry audio frequencies respectively to obtain multiple bass data;

Correspondingly, the chorus audio is generated based on the plurality of dry sound audios after virtual sound image localization, including:

A chorus audio is generated based on the plurality of the dry audio frequencies and the plurality of the bass data subjected to virtual sound image localization.

In a specific embodiment of the present application, it also includes:

respectively performing reverberation simulation processing on the obtained plurality of the dry audio frequencies;

A chorus audio is generated based on the plurality of the dry audio frequencies subjected to virtual sound image localization and the plurality of the dry audio frequencies subjected to reverberation simulation processing.

In a specific implementation manner of the present application, performing reverberation simulation processing on a plurality of the obtained dry audio frequencies respectively includes:

Reverberation simulation processing is respectively performed on the obtained plurality of the dry audio frequencies by using the cascade of comb filters and all-pass filters.

In a specific implementation manner of the present application, after the virtual sound image localization is performed on the plurality of dry audio audios subjected to the time alignment process, the method further includes:

respectively performing reverberation simulation processing on a plurality of the dry audio frequencies after virtual sound image localization;

A chorus audio is generated based on the plurality of dry audio audios subjected to virtual sound image localization and reverberation simulation processing.

In a specific embodiment of the present application, it also includes:

respectively perform binaural simulation processing on the obtained plurality of the dry audio frequencies;

A chorus audio is generated based on the plurality of the dry audio audio after virtual sound image localization and the plurality of the dry audio audio after binaural simulation processing.

In a specific implementation manner of the present application, after the two-channel simulation processing is performed on the obtained plurality of dry audio frequencies, the method further includes:

performing reverberation simulation processing on a plurality of the dry audio frequencies after the binaural simulation processing;

Correspondingly, the generation of chorus audio based on the plurality of dry audio audios after virtual sound image localization and the plurality of dry audio audios after binaural simulation processing includes:

The chorus audio is generated based on the plurality of the dry audio audios after virtual sound image localization, the plurality of the dry audio audios after the binaural simulation processing and the reverberation simulation processing.

In a specific embodiment of the present application, performing virtual sound image localization on a plurality of the dry audio audio after time alignment processing includes:

According to the number of virtual sound images, the obtained dry sound audio frequency after time alignment processing is grouped, and the number of groups is the same as the number of virtual sound images;

Each group of dry audio audio is located on the corresponding virtual audio image, and different groups of dry audio audio correspond to different virtual audio images.

In a specific embodiment of the present application,

Among the plurality of virtual sound images, the elevation angle of the virtual sound image located behind the human head relative to the plane formed by the first coordinate axis and the second coordinate axis is greater than that of the virtual sound image located in front of the human head relative to the first coordinate axis. the elevation angle of the plane formed by the coordinate axis and the second coordinate axis;

or,

Each of the virtual sound images is evenly distributed on a circumference of the plane formed by the first coordinate axis and the second coordinate axis.

In a specific implementation manner of the present application, the synthesizing the lead vocal audio, the chorus audio and the corresponding accompaniment includes:

respectively performing volume adjustment on the lead vocal audio and the chorus audio, and/or performing reverberation simulation processing on the lead vocal audio and the chorus audio;

The lead vocal audio, the chorus audio and the corresponding accompaniment after volume adjustment and/or reverberation simulation processing are synthesized.

A processing device for chorus audio, comprising:

The dry sound audio obtaining module is used to obtain the dry sound audio of the same target song performed by multiple singers respectively;

an alignment processing module, configured to perform time alignment processing on a plurality of the obtained dry audio frequencies;

A virtual sound image localization module, configured to perform virtual sound image localization on a plurality of the dry sound audio after time alignment processing, so as to locate a plurality of the dry sound audio on a plurality of virtual sound images; The virtual sound image is located in a pre-established virtual sound image coordinate system, the virtual sound image coordinate system is centered on the human head, the midpoint of the straight line where the left and right ears are located is the coordinate origin, and the positive direction of the first coordinate axis represents the front of the human head. , the positive direction of the second coordinate axis represents the side of the human head from the left ear to the right ear, the positive direction of the third coordinate axis represents directly above the human head, and the distance between each virtual sound image and the coordinate origin is within the set distance Within the range, the pitch angle of each virtual sound image relative to the plane formed by the first coordinate axis and the second coordinate axis is within a set angle range;

a chorus audio generation module for generating chorus audio based on a plurality of the dry audio frequencies after performing virtual sound image localization;

The chorus effect audio output module is configured to output the chorus effect audio after synthesizing the lead singer audio, the chorus audio and the corresponding accompaniment under the condition of obtaining the lead vocal audio sung based on the target song.

A chorus audio processing device, comprising:

memory for storing computer programs;

The processor is configured to implement the steps of the chorus audio processing method described in any one of the above when executing the computer program.

A computer-readable storage medium storing a computer program on the computer-readable storage medium, when the computer program is executed by a processor, implements the steps of any of the above-mentioned chorus audio processing methods.

By applying the technical solutions provided by the embodiments of the present application, after obtaining the dry voice audio of the same target song performed by a plurality of singers respectively, time alignment processing is performed on the obtained plurality of dry voice audio, and the aligned plurality of dry voice audio is processed. Perform virtual sound image localization of sound and audio to locate multiple dry sound audio on multiple virtual sound images. The multiple virtual sound images are located in the virtual sound image coordinate system centered on the human head, and the distance from the coordinate origin is set. Within the distance, surround the human ear, generate chorus audio based on multiple dry audios after virtual sound image positioning, and when the lead vocal audio based on the target song is obtained, the lead vocal audio, chorus audio and corresponding accompaniment Perform a chorus, get and output a large chorus effect audio. Positioning multiple dry sound audios on multiple virtual sound images surrounding the human ear can make the generated chorus audio have a sound field surround sound effect. In terms of listening sense, it can effectively prevent the sound field of the final output large chorus effect audio from gathering in the center of the human head. The resulting head effect makes the sound field wider.

Description of drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the following briefly introduces the accompanying drawings required for the description of the embodiments or the prior art. Obviously, the drawings in the following description are only These are some embodiments of the present application. For those of ordinary skill in the art, other drawings can also be obtained based on these drawings without any creative effort.

FIG. 1 is an implementation flowchart of a method for processing chorus audio in an embodiment of the application;

Fig. 2 is the schematic diagram that the virtual sound image localization coordinate system shows the sound image orientation in the embodiment of the application;

3 is a schematic diagram of a virtual sound image localization in an embodiment of the application;

Fig. 4 is the schematic diagram of the virtual sound image after positioning in the embodiment of the application;

5 is a schematic diagram of the composition of a spatial sound field process in an embodiment of the application;

6 is a schematic diagram of a cascaded form of a comb filter and an all-pass filter in an embodiment of the application;

7 is a schematic diagram of a reverberation impulse response in an embodiment of the application;

8 is a schematic diagram of a two-channel simulation process in an embodiment of the present application;

9 is a schematic diagram of a framework of a chorus audio processing system in an embodiment of the application;

10 is a schematic diagram of a specific structure of a chorus audio processing system in an embodiment of the application;

11 is a schematic structural diagram of an apparatus for processing chorus audio in an embodiment of the application;

FIG. 12 is a schematic structural diagram of a chorus audio processing device according to an embodiment of the present application.

Detailed ways

The core of the present application is to provide a method for processing chorus audio. After obtaining the dry voice audio of the same target song performed by multiple singers, time alignment is performed on the obtained dry voice audio, and virtual sound image localization is performed on the aligned Each dry sound audio is located on multiple virtual sound images, and the multiple virtual sound images are located in the virtual sound image coordinate system centered on the human head, and the distance from the coordinate origin is within the set distance range, surrounding the human ear, based on the virtual sound Like the positioned multiple dry audios, chorus audio is generated, and when the lead vocal audio based on the target song is obtained, the lead vocal audio, the chorus audio and the corresponding accompaniment are chorused to obtain and output the large chorus effect audio. Positioning multiple dry sound audios on multiple virtual sound images surrounding the human ear can make the generated chorus audio have a sound field surround sound effect. In terms of listening sense, it can effectively prevent the sound field of the final output large chorus effect audio from gathering in the center of the human head. The resulting head effect makes the sound field wider.

In practical applications, the methods provided in the embodiments of the present application can be applied in various scenarios where a large chorus sound effect is to be obtained, and specific solutions can be implemented through the interaction between the server and the client.

For example, in Scenario 1, the server may obtain in advance the dry voice audios of multiple singers, such as

singers

1, 2, 3, 4... for the same target song, and perform time alignment on the obtained dry voice audios Process, and perform virtual sound image localization on multiple dry sound audios after alignment, and locate multiple dry sound audios on multiple virtual sound images, and multiple virtual sound images can surround the human ear. Multiple dry audio to generate chorus audio. When user X wants to make the song he sings achieve a chorus sound effect, he can sing the target song through the client. The chorus effect audio can be obtained, and the chorus effect audio can be output through the client, so that the user X can feel the chorus sound effect.

In Scenario 2, several good friends (

Users

1, 2, 3, 4, and 5) sing the target song at the same time period but in different spaces, and want to achieve a large chorus sound effect. From the perspective of any user, the current user can be used as the lead singer. For example, from the perspective of user 1, the server can obtain the dry audios of the target songs performed by users 2, 3, 4, and 5 respectively, perform time alignment processing on the obtained dry audios, and locate the plurality of dry audios after alignment. On the multiple virtual sound images, the multiple virtual sound images surround the human ear, and the chorus audio is generated based on the multiple dry sound audio frequencies after the virtual sound image localization. When the server obtains the lead vocal audio sung by user 1 through the client based on the target song, the server synthesizes the lead vocal audio, the chorus audio and the corresponding accompaniment to obtain the chorus effect audio, which is output to the user 1 through the client, so that the user 1 can Feel the big chorus sound.

The above only describes the application scenarios exemplarily. In practical applications, the technical solutions of the present application can also be applied to more scenarios, such as sound effect processing scenarios of multi-person chorus and multi-person small bands.

In order to make those skilled in the art better understand the solution of the present application, the present application will be further described in detail below with reference to the accompanying drawings and specific embodiments. Obviously, the described embodiments are only a part of the embodiments of the present application, but not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.

Referring to FIG. 1, which is an implementation flowchart of a method for processing chorus audio provided by an embodiment of the present application, the method may include the following steps:

S110: Obtain dry audio audio of the same target song performed by multiple singers respectively.

In this embodiment of the present application, multiple dry audio frequencies may be obtained according to actual needs. The multiple dry audio audios may be audio data obtained by different singers singing the same target song, and different singers may be in the same or different environments.

S120: Perform time alignment processing on the obtained multiple dry audio frequencies.

The dry audios of the same target song sung by multiple singers are obtained separately, because the multiple dry audios may be sung by different singers at different times, and there may be a phenomenon of misalignment such as delay. In order to achieve a better chorus sound effect in the future, you can first perform time alignment processing on the obtained dry sound audio, so that the dry sound audio after time alignment processing does not have serious rush or slow beat, such as advance or lag more than 1 second 's audio. Specifically, an alignment tool can be used to align the time of the most identical starting positions of the obtained multiple dry audio audios.

In the specific embodiment of the present application, before the time alignment processing is performed on the obtained multiple dry audio audios, the obtained multiple dry audio audios may also be preliminarily screened, for example, by using tools such as sound quality detection, and the audio itself is eliminated. Audio with poor sound quality, such as noise, accompaniment backstepping, audio length is too short, audio energy is too small, popping sound, etc. Then, time alignment processing and subsequent steps are performed on the dry audio audio retained after screening.

S130: Perform virtual sound image localization on the multiple dry audio audios subjected to the time alignment processing, so as to locate the multiple dry audio audios on the multiple virtual audio images.

Among them, a plurality of virtual sound images are located in a pre-established virtual sound image coordinate system, the virtual sound image coordinate system is centered on the human head, the center point of the straight line where the left and right ears are located is the coordinate origin, and the positive direction of the first coordinate axis indicates the front of the human head. , the positive direction of the second coordinate axis represents the side of the human head from the left ear to the right ear, the positive direction of the third coordinate axis represents the top of the human head, and the distance between each virtual sound image and the coordinate origin is within the set distance range. The pitch angle of each virtual sound image relative to the plane formed by the first coordinate axis and the second coordinate axis is within a set angle range.

In this embodiment of the present application, a virtual audio-visual coordinate system may be established in advance to display the audio-visual orientation. The virtual audiovisual coordinate system may specifically be a Cartesian coordinate system. As shown in Figure 2, the virtual audio-visual coordinate system can be centered on the human head, and the midpoint of the straight line where the left and right ears are located as the coordinate origin. The positive direction represents the side of the human head from the left ear to the right ear. The positive direction of the third coordinate axis, that is, the z-axis, represents the top of the human head, that is, the direction of the top of the head. The sound image has a certain azimuth and elevation (azimuth) in space ( elevation), you can use

to indicate, rad indicates the distance between the current sound image and the coordinate origin.

Generally, the sound signal is a single-channel signal, which can be regarded as the sound image in the

In terms of position, in order to obtain a certain virtual sound image, HRTF (Head Related Transfer Function, head related transformation function) can be used to perform data convolution to realize the localization operation. The schematic diagram of virtual sound image localization is shown in Figure 3, where X represents a real sound source (single-channel signal), Y _L and Y _R represent the sound signals heard by the left ear and right ear respectively, and HRTF represents the position of the sound signal from the sound source. Transfer function of the transmission path to both ears. Based on HRTF technology, the real audio source (single-channel signal) can be passed through a certain

The HRTF filtering of the left and right ears on the position to obtain a two-way acoustic signal.

The frequency domain characteristics of the acoustic signals received by the left and right ears can be expressed as:

It can be simply considered that the acoustic signal heard by the human ear is the result of the HRTF filtering of the sound source X. Therefore, when performing virtual sound image localization, the sound signal can be filtered through the HRTF of the corresponding position. In the virtual sound image coordinate system, multiple virtual sound images can be set, and the distance between each virtual sound image and the coordinate origin can be within the set distance range, such as 1 meter range, and each virtual sound image is relative to the virtual sound image coordinates. The pitch angle of the plane formed by the first coordinate axis and the second coordinate axis of the system may be within a set angle range, such as a range of 10°, so that multiple virtual sound images surround the human ear.

Specifically, each virtual sound image of the plurality of virtual sound images may be uniformly distributed on a circumference of the plane formed by the first coordinate axis and the second coordinate axis. That is, it surrounds the horizontal plane of the human ear at the same interval angle. The interval angle can be set according to the actual situation or analysis of historical data, for example, it is set to 30°. If the interval angle is set to 30°, 12 virtual sound images can be located around the horizontal plane of the human ear at 30° intervals. The elevation angle of these 12 virtual sound images is 0°, and the azimuth angles are: 0°, 30°, 60°, …, 330°. Of course, the interval angle can also be set to other values, such as 15°, 60°, and so on.

In another embodiment, among the plurality of virtual sound images, the elevation angle of the virtual sound image located behind the human head relative to the plane formed by the first coordinate axis and the second coordinate axis may be greater than that of the virtual sound image located in front of the human head relative to the first coordinate axis and the second coordinate axis. The elevation angle of the plane formed by the first coordinate axis and the second coordinate axis. That is, among the plurality of virtual audio images, the elevation angle of the virtual audio image located behind the human head may be greater than the elevation angle of the virtual audio image located in front of the human head. This enhances the localization effect and reduces the front and rear mirroring problems of the virtual sound image. For example, the elevation angle of the virtual audio image located behind the head can be increased by 10°, that is, the elevation angle of the virtual audio image located in front of the human head is θ=0°, and the elevation angle of the virtual audio image located behind the human head is θ=10°.

As shown in Fig. 4, a plurality of virtual sound images surround the horizontal plane of the human ear at intervals of 30°, the elevation angle of the virtual sound image in front of the human head is 0=0°, and the elevation angle of the virtual sound image behind the head is θ=10°.

It should be noted that the positions of the plurality of virtual sound images in the virtual sound image coordinate system are not limited to the ones mentioned above, and can also be specifically set according to actual needs, and only need to satisfy each virtual sound image. The distance from the coordinate origin is within the set distance range, and the pitch angle of each virtual sound image relative to the plane formed by the first coordinate axis and the second coordinate axis may be within the set angle range. For example, some of the virtual sound images of the multiple virtual sound images surround the plane of the human ear at intervals of 30°, and the elevation angle is 0°. The distance between some virtual sound images and the coordinate origin can be the same or different, but they are all within the set distance range, which will enhance the surround effect of the subsequently generated chorus audio.

The virtual sound image localization is performed on the multiple dry sound audio frequencies after the time alignment processing, and after the multiple dry sound audio frequencies are located on the multiple virtual sound images, the operations of the subsequent steps can be continued.

S140: Generate chorus audio based on the plurality of dry audio audios after virtual sound image localization.

After obtaining the dry audio of the same target song performed by multiple singers respectively, perform time alignment processing on the plurality of dry audios, and perform virtual sound image positioning on the aligned dry audio, and locate the plurality of dry audios. After being placed on the multiple virtual audio images, each dry audio audio in the multiple dry audio audio can be subjected to HRTF filtering processing at the corresponding virtual audio image position, and corresponding audio data can be obtained at each virtual audio image. Chorus audio may be generated based on the plurality of dry audio audio after virtual panning. Specifically, the corresponding audio data obtained after HRTF filtering processing of multiple virtual sound image positions may be superimposed, or weighted and superimposed to obtain chorus audio. The sound effect of the obtained chorus audio has a three-dimensional sound field sense of hearing.

S150: In the case of acquiring the lead vocal audio sung based on the target song, after synthesizing the lead vocal audio, the chorus audio and the corresponding accompaniment, output the chorus effect audio.

In an application scenario of the embodiment of the present application, after the chorus audio is generated, the chorus audio can be stored in a database and used when needed. For example, if a user wants to sing a song with a chorus effect, in this case, the chorus audio can be used to achieve the corresponding effect.

You can obtain the audio sung by the current user based on the target song, use the audio as the lead vocal audio, and then synthesize the lead vocal audio, chorus audio and the corresponding accompaniment to obtain the chorus effect audio, output the chorus effect audio, and the current user can enjoy the chorus sound effect. .

For the synthesis of the main vocal audio, chorus audio and corresponding accompaniment, it can be realized in various ways, such as synthesizing the main vocal audio and the corresponding accompaniment first, and then synthesizing it with the chorus audio, or, first synthesizing the chorus audio and the corresponding accompaniment Synthesize, and then synthesize with the lead vocal audio, or, first synthesize the lead vocal audio and chorus audio, and then synthesize with the corresponding accompaniment. Layer the corresponding accompaniment. The chorus sound effect obtained by different implementation methods will be different, and the specific implementation method can be selected according to the actual situation.

Applying the method provided by the embodiment of the present application, after obtaining the dry voice audio of the same target song performed by a plurality of singers respectively, perform time alignment processing on the obtained plurality of dry voice audio, and align the plurality of dry voice audio after the alignment. The audio performs virtual sound image localization to locate multiple dry audio audios on multiple virtual sound images. The multiple virtual sound images are located in the virtual sound image coordinate system centered on the human head, and the distance from the coordinate origin is within the set distance. Within the range, surround the human ear, generate chorus audio based on multiple dry audios after virtual sound image positioning, and when the lead vocal audio based on the target song is obtained, the lead vocal audio, chorus audio and corresponding accompaniment are performed. Chorus, get and output the large chorus effect audio. Positioning multiple dry audio audios on multiple virtual sound images surrounding the human ear can make the generated chorus audio have a sound field surround effect. In terms of hearing, it can effectively prevent the sound field of the final output large chorus effect audio from gathering in the center of the human head. The resulting head effect makes the sound field wider.

In an embodiment of the present application, step S120 performs time alignment processing on the obtained plurality of dry audio frequencies, which may include the following steps:

The first step: determine the reference audio corresponding to the target song;

The second step: for each obtained dry audio frequency, extract the audio features of the current dry audio audio and the reference audio respectively, and the audio features are fingerprint features or fundamental frequency features;

The third step: determining the time corresponding to the maximum value of the audio feature similarity between the current dry audio and the reference audio as the audio alignment time;

The fourth step: performing time alignment processing on the current dry audio audio based on the audio alignment time.

For the convenience of description, the above steps are combined for description.

In the embodiment of the present application, after obtaining the dry voice audio of the same target song performed by a plurality of singers respectively, in the process of performing time alignment processing on the obtained plurality of dry voice audio, the reference corresponding to the target song may be determined first. audio. Specifically, a dry audio audio with better sound quality can be selected from the plurality of obtained dry audio audios as a reference audio. The original dry vocal audio of the target song may also be determined as the reference audio.

For each obtained dry audio audio, audio features of the current dry audio audio and the reference audio can be extracted respectively, and the audio features are fingerprint features or fundamental frequency features. For example, Mel frequency band information, Bark frequency band information, Erb frequency band power, etc. can be extracted through multi-band filtering, and then fingerprint features can be obtained through half-wave rectification, binary judgment, etc. For another example, fundamental frequency features can be extracted by fundamental frequency extraction tools such as pyin, crepe, and harvest. The audio features of the reference audio can be saved after being extracted once, and can be called directly when necessary.

The audio features of the current dry audio and the reference audio are compared, which can be characterized by a similarity curve, etc., and the time corresponding to the maximum similarity value can be determined as the audio alignment time. Then, based on the audio alignment time, time alignment processing is performed on the current dry audio audio.

For each obtained dry audio frequency, the corresponding audio alignment time is obtained by comparing with the audio features of the reference audio, and after time alignment processing is performed, multiple dry audio audios after time alignment processing can be obtained.

In an embodiment of the present application, the method may further include the following steps:

Correspondingly, chorus audio is generated based on the plurality of dry audio audios after virtual sound image localization, including:

The chorus audio is generated based on the plurality of dry audio frequencies and the plurality of bass data after virtual panning.

In this embodiment of the present application, after obtaining the dry audios of the same target song performed by multiple singers, bandpass filtering may be performed on the obtained dry audios, for example, performing bandpass filtering on the plurality of dry audios. Bandpass filtering with a cutoff frequency of [33,523] Hz to obtain multiple bass data.

Chorus audio may be generated based on the plurality of dry audio frequencies and the plurality of bass data after virtual panning. Specifically, chorus audio may be generated by superimposing or weighted superimposing a plurality of obtained bass data and a plurality of dry audio audios based on virtual sound image localization. After superimposing the bass signal, the heaviness of the sound signal can be enhanced.

Reverberation simulation processing is performed on the obtained multiple dry audio frequencies respectively;

A chorus audio is generated based on the plurality of dry audio frequencies subjected to virtual sound image localization and the plurality of dry audio frequencies subjected to reverberation simulation processing.

Usually, the sound signal emitted by the sound source in the sound field will go through processes such as direct sound, reflection, and reverberation. Figure 5 shows a schematic diagram of a typical spatial sound field process composition. In this figure, the acoustic signal with the largest amplitude is the direct sound, the following acoustic signal is the reflected acoustic signal obtained by the sound wave reflected on the object closest to the listener, which has obvious directionality, and then a dense acoustic signal It is the reverberation sound signal obtained by the superposition of sound waves after multiple reflections of surrounding objects. It is the superposition of a large number of reflected sounds from different directions without directionality.

According to the known room impulse response characteristics, the reverberation sound is the superposition of multiple reflection sounds, which is characterized by weak energy and no directionality, because it is the superposition of a large number of late reflection sounds from different directions, and has a high echo density, so You can use reverb to create a surround sound with a sense of surround.

In this embodiment of the present application, after obtaining the dry voice audio of the same target song performed by multiple singers respectively, reverberation simulation processing may be performed on the obtained multiple dry voice audio respectively. Specifically, a cascade of comb filters and all-pass filters can be used to perform reverberation simulation processing on the obtained multiple dry audio frequencies respectively.

Figure 6 shows a cascaded form of comb filters and all-pass filters, in which four comb filters are connected in parallel with two all-pass filters in series. The actual simulated reverberation impulse response is shown in Figure 7.

It should be noted that Figure 6 is only a specific form. In practical applications, there can be other forms. The number of comb filters and all-pass filters and the cascading method can be adjusted according to actual needs. .

Perform reverberation simulation processing and virtual sound image localization on the obtained multiple dry sound audios respectively. After positioning the multiple dry sound audios on multiple virtual sound images, you can Acoustic tones and reverbs simulate multiple dry-voiced tones after processing to generate chorus audio. Specifically, multiple dry audio audios after virtual sound image localization and multiple dry audio audios after reverberation simulation processing may be superimposed or weighted superimposed to generate chorus audio. This enhances the spatial effect of the sound signal, further suppresses the head-in-head effect, and expands the sound field.

In an embodiment of the present application, after performing virtual sound image localization on the plurality of dry audio audios subjected to time alignment processing, the method may further include the following steps:

Reverberation simulation processing is performed on the multiple dry audio frequencies after virtual sound image localization;

The chorus audio is generated based on the plurality of dry audio audios subjected to virtual sound image localization and reverberation simulation processing.

In the embodiment of the present application, after obtaining the dry voice audio of the same target song performed by multiple singers respectively, performing time alignment processing on the obtained multiple dry voice audio frequency, and performing virtual sound image localization, it is possible to further separate the Reverberation simulation processing is performed on the plurality of dry audio frequencies after virtual sound image localization. For the reverberation simulation processing process, reference may be made to the reverberation simulation processing process in the previous embodiment, which will not be repeated here.

A chorus audio can be generated based on the plurality of dry audio audios subjected to virtual sound image localization and subjected to reverberation simulation processing. Specifically, the chorus audio may be generated by superimposing or weighted superimposing a plurality of dry audio audios after performing virtual sound image localization and performing reverberation simulation processing.

Performing reverberation simulation processing on multiple dry audio frequencies after virtual sound image processing can enhance the spatial sound effects of the sound signal, further suppress the head-in-head effect, and expand the sound field.

Perform binaural analog processing on the obtained multiple dry audio frequencies respectively;

A chorus audio is generated based on the plurality of dry audio audios after virtual sound image localization and the plurality of dry audio audios after binaural simulation processing.

In the embodiment of the present application, after obtaining the dry voice audio of the same target song performed by a plurality of singers respectively, and performing time alignment processing on the obtained plurality of dry voice audio, the dual Channel analog processing. The correlation between the two-channel signals is reduced by delay, and the sound field is expanded as much as possible to obtain two-channel output.

As shown in Figure 8, a plurality of dry audio audios can be simulated by 8 groups of different delay weights on the left and right, where d represents the delay and g represents the weight. Since the general room impulse response takes 80ms as the reverberation time, the delay parameter can choose 16 parameters ranging from 21ms to 79ms. The amplitude attenuation is used to represent the energy loss of the sound wave due to reflection, thereby reducing the correlation between the two environmental information. That is, the dry audio can be copied separately to obtain two signals with the same information. The two signals are completely correlated, and then attenuated by different delays and amplitudes to reduce the correlation of the two signals to obtain a pseudo stereo signal.

It should be noted that what is shown in FIG. 8 is only a specific example, and it is possible to set fewer groups or more groups of different delays to implement two-channel simulation according to actual needs.

A chorus audio may be generated based on the plurality of dry audio frequencies after virtual panning and the plurality of dry audio frequencies after binaural simulation processing. Specifically, multiple dry audio audios after virtual sound image localization and multiple dry audio audios after binaural simulation processing may be superimposed or weighted superimposed to generate chorus audio.

In an embodiment of the present application, after performing binaural simulation processing on the obtained plurality of dry audio frequencies, the method may further include the following steps:

Perform reverberation simulation processing on a plurality of dry audio frequencies after two-channel simulation processing;

Correspondingly, chorus audio is generated based on the plurality of dry audio audios after virtual sound image localization and the plurality of dry audio audios after binaural simulation processing, including:

The chorus audio is generated based on the plurality of dry sound audio frequencies after virtual sound image localization, the plurality of dry sound audio frequencies after the binaural simulation processing and the reverberation simulation processing.

In the embodiment of the present application, after obtaining the dry voice audio of the same target song performed by a plurality of singers respectively, time alignment processing is performed on the plurality of dry voice audio frequencies, and binaural simulation processing is performed on the plurality of dry voice audio frequencies respectively. After that, reverberation simulation processing may be further performed on the plurality of dry audio frequencies after the binaural simulation processing, so as to enhance the spatial effect of the sound signal, suppress the head effect, and expand the sound field.

Perform virtual sound image localization on multiple dry sound audios. After positioning multiple dry sound audios on multiple virtual sound images, you can perform dual-channel analog processing and mixing based on the multiple dry sound audios after virtual sound image localization. The chorus audio is generated by ringing multiple dry audio frequencies after analog processing. Specifically, multiple dry audio audios after virtual sound image localization, and multiple dry audio audios after binaural simulation processing and reverberation simulation processing may be superimposed or weighted superimposed to generate chorus audio.

In practical applications, after obtaining the dry audio of the same target song performed by multiple singers, time alignment processing can be performed on the obtained dry audio, and then based on the time alignment processing of the plurality of dry audios The sound audio is processed by virtual sound image localization, bass enhancement, reverberation simulation, two-channel simulation, etc. The specific processing can be carried out in combination with the above embodiments. Sound simulation and two-channel simulation, so that the final generated chorus audio has a surround sound effect, which can be highly robust to a wide range of sound misalignment. If you want to superimpose the chorus audio and the lead vocal The delay drop is large, which can also ensure that the user has a harmonious listening experience.

FIG. 9 is a schematic diagram of a system framework for processing multiple dry audio frequencies after time alignment processing, including a bass enhancement unit, a virtual sound image localization unit, a two-channel simulation unit and a reverberation simulation unit. The bass enhancement unit is used to perform bandpass filtering on multiple dry audios to obtain bass data; the virtual sound image localization unit is used to perform virtual sound image localization on multiple dry audios, so as to locate the multiple dry audios to multiple The two-channel simulation unit is used for performing dual-channel simulation processing on a plurality of dry sound audios; the reverberation simulation unit is used for performing reverberation simulation processing on a plurality of dry sound audio frequencies. Both the virtual sound image localization unit and the two-channel simulation unit can be connected to the reverberation simulation unit. After the virtual sound image localization of multiple dry audio frequencies is performed by the virtual sound image localization unit, the reverberation simulation unit can be further used for reverberation simulation. Similarly, after the binaural simulation processing is performed on the plurality of dry audio frequencies by the binaural simulation unit, the reverberation simulation processing may be further performed by the reverberation simulation unit. Finally, weighted superposition can be performed on the audio data processed by these units to obtain chorus audio.

Figure 10 shows a specific example of processing multiple dry audio frequencies, H represents the transfer function of HRTF filtering, through the processing of this transfer function, virtual sound image localization can be performed on The sound audio is positioned on 12 virtual sound images around the human ear level, REV represents the reverberation analog unit, BASS represents the bass enhancement unit, and REF represents the two-channel analog unit. The reverberation simulation unit here can use the same parameters, and you can also configure different parameters for different reverberation simulation units according to actual needs, so as to obtain flexible reverberation modulation.

The grand chorus effect of the chorus audio finally generated in the embodiment of the present application is closer to the hearing sense of a real concert chorus. In practical applications, adding accompaniment on the basis of the lead vocal audio and mixing the chorus audio at the same time allows users to have an immersive concert experience and a more shocking immersive sound field surround experience.

In an embodiment of the present application, performing virtual sound image localization on a plurality of dry audio frequencies after time alignment processing may include the following steps:

Step 1: according to the number of virtual audio images, group the obtained multiple dry audio audios after time alignment processing, and the number of groups is the same as the number of virtual audio images;

Step 2: Position each group of dry audio audio on the corresponding virtual audio image, and different groups of dry audio audio correspond to different virtual audio images.

For the convenience of description, the above two steps are combined for description.

In the embodiment of the present application, after obtaining the dry voice audio of the same target song performed by a plurality of singers respectively, and performing time alignment processing on the obtained plurality of dry voice audio frequency, the obtained dry voice audio can be obtained according to the number of virtual The plurality of dry audio audios after time alignment processing are grouped, and the number of divided groups is the same as the number of virtual audio images, and the same group includes several dry audio audios. If the number of obtained dry audios is large, the same dry audio can be in only one group; if the number of obtained dry audios is small, the same dry audio can be in multiple groups to better Achieve large chorus sound.

After a plurality of dry sound tones are grouped, each group of dry sound tones can be positioned on corresponding virtual sound images respectively, and different groups of dry sound tones correspond to different virtual sound images. Realize the localization processing of the virtual sound images of multiple dry audios, and enhance the sound effect of the chorus.

In an embodiment of the present application, synthesizing the lead vocal audio, the chorus audio and the corresponding accompaniment may include the following steps:

Perform volume adjustment on lead vocal audio and chorus audio respectively, and/or perform reverberation simulation processing on lead vocal audio and chorus audio;

Synthesize the lead vocal audio, chorus audio and corresponding accompaniment after volume adjustment and/or reverb simulation processing.

In this embodiment of the present application, after obtaining the lead vocal audio sung based on the target song, the volume of the lead vocal audio and the chorus audio can be adjusted respectively, so that the volume of the lead vocal audio and the chorus audio are equal, or the volume of the lead vocal audio is greater than that of the chorus audio. volume. At the same time, reverberation simulation processing can also be performed on the lead vocal audio and the chorus audio to obtain a surround sound with a sense of surround.

Then, the lead vocal audio, chorus audio and the corresponding accompaniment after volume adjustment and/or reverberation simulation processing are synthesized, so that the final output chorus effect audio brings a better listening experience to the user.

Corresponding to the above method embodiments, embodiments of the present application further provide a chorus audio processing apparatus, and the chorus audio processing apparatus described below and the chorus audio processing method described above may refer to each other correspondingly.

Referring to Figure 11, the device may include the following modules:

A dry-sound audio obtaining module 1110 is used to obtain the dry-sound audio of the same target song performed by a plurality of singers respectively;

a time alignment processing module 1120, configured to perform time alignment processing on the obtained multiple dry audio frequencies;

The virtual sound image localization module 1130 is configured to perform virtual sound image localization on the plurality of dry sound audios after time alignment processing, so as to locate the plurality of dry sound audios on the plurality of virtual sound images; It is located in the pre-established virtual audio-visual coordinate system. The virtual audio-visual coordinate system is centered on the human head and the coordinate origin is the midpoint of the straight line where the left and right ears are located. The positive direction of the first coordinate axis indicates the front of the head, and the positive direction of the second coordinate axis. The direction represents the side of the human head from the left ear to the right ear, the positive direction of the third coordinate axis represents the top of the human head, the distance between each virtual sound image and the coordinate origin is within the set distance range, and each virtual sound image is relative to the first. The pitch angle of the plane formed by one coordinate axis and the second coordinate axis is within the set angle range;

The chorus audio generation module 1140 is configured to generate chorus audio based on the plurality of dry audio frequencies after virtual sound image localization;

The chorus effect audio obtaining module 1150 is configured to output the chorus effect audio after synthesizing the lead singer audio, the chorus audio and the corresponding accompaniment in the case of obtaining the lead singer audio based on the target song singing.

By applying the device provided by the embodiment of the present application, after obtaining the dry voice audio of the same target song performed by a plurality of singers respectively, time alignment processing is performed on the obtained plurality of dry voice audio, and the aligned plurality of dry voice audio is processed. The audio performs virtual sound image localization to locate multiple dry audio audios on multiple virtual sound images. The multiple virtual sound images are located in the virtual sound image coordinate system centered on the human head, and the distance from the coordinate origin is within the set distance. Within the range, surround the human ear, generate chorus audio based on multiple dry audios after virtual sound image positioning, and when the lead vocal audio based on the target song is obtained, the lead vocal audio, chorus audio and corresponding accompaniment are performed. Chorus, get and output the large chorus effect audio. Positioning multiple dry sound audios on multiple virtual sound images surrounding the human ear can make the generated chorus audio have a sound field surround sound effect. In terms of listening sense, it can effectively prevent the sound field of the final output large chorus effect audio from gathering in the center of the human head. The resulting head effect makes the sound field wider.

In a specific implementation manner of the present application, the time alignment processing module 1120 is used for:

Determine the reference audio corresponding to the target song;

For each obtained dry sound audio, the audio features of the current dry sound audio and the reference audio are extracted respectively, and the audio features are fingerprint features or fundamental frequency features;

Determining the time corresponding to the maximum value of the audio feature similarity between the current dry audio and the reference audio as the audio alignment time;

In a specific embodiment of the present application, it also includes a bass data acquisition module for:

Correspondingly, the chorus audio generation module 1140 is used for:

In a specific implementation manner of the present application, a reverberation simulation processing module is also included for:

Correspondingly, the chorus audio generation module 1140 is used for:

In a specific embodiment of the present application, the reverberation simulation processing module is used for:

Reverberation simulation processing is performed on the obtained multiple dry audio frequencies by using the cascade of comb filters and all-pass filters.

In a specific embodiment of the present application, the reverberation simulation processing module is also used for:

After performing virtual sound image localization on the plurality of dry sound audios subjected to the time alignment processing, reverberation simulation processing is performed on the plurality of dry sound audio frequencies after the virtual sound image localization;

Correspondingly, the chorus audio generation module 1140 is used for:

In a specific embodiment of the present application, it also includes a two-channel analog processing module for:

Correspondingly, the chorus audio generation module 1140 is used for:

After performing binaural simulation processing on the obtained plurality of dry sound audios respectively, performing reverberation simulation processing on the plurality of dry sound audios after the binaural simulation processing;

Correspondingly, the chorus audio generation module 1140 is used for:

In a specific embodiment of the present application, the virtual sound image localization module 1130 is used for:

According to the number of virtual sound images, the obtained multiple dry sound audio frequency after time alignment processing is grouped, and the number of groups is the same as the number of virtual sound images;

Each group of dry audio audio is positioned on the corresponding virtual audio image, and different groups of dry audio audio correspond to different virtual audio images.

In a specific embodiment of the present application, among the plurality of virtual sound images, the elevation angle of the virtual sound image located behind the human head relative to the plane formed by the first coordinate axis and the second coordinate axis is greater than that of the virtual sound image located in front of the human head. is based on the elevation angle of the plane formed by the first coordinate axis and the second coordinate axis; or, each virtual sound image is uniformly distributed on a circumference of the plane formed by the first coordinate axis and the second coordinate axis.

In a specific embodiment of the present application, the chorus effect audio obtaining module 1150 is used for:

Corresponding to the above method embodiments, the embodiments of the present application also provide a chorus audio processing device, including:

memory for storing computer programs;

The processor is configured to implement the steps of the above-mentioned chorus audio processing method when the computer program is executed.

As shown in FIG. 12 , which is a schematic diagram of the composition and structure of a chorus audio processing device, the chorus audio processing device may include: a processor 10 , a memory 11 , a communication interface 12 and a communication bus 13 . The processor 10 , the memory 11 , and the communication interface 12 all communicate with each other through the communication bus 13 .

In this embodiment of the present application, the processor 10 may be a central processing unit (Central Processing Unit, CPU), an application-specific integrated circuit, a digital signal processor, a field programmable gate array, or other programmable logic devices, and the like.

The processor 10 may call the program stored in the memory 11, and specifically, the processor 10 may execute the operations in the embodiments of the method for processing chorus audio.

The memory 11 is used to store one or more programs, and the programs may include program codes, and the program codes include computer operation instructions. In the embodiment of the present application, the memory 11 at least stores a program for realizing the following functions:

Perform time alignment processing on the obtained multiple dry audio frequencies;

Perform virtual sound image localization on the plurality of dry sound audios after time alignment processing, so as to locate the plurality of dry sound audios on the plurality of virtual sound images; wherein, the plurality of virtual sound images are located in a pre-established virtual sound image coordinate system , the virtual audio-visual coordinate system is centered on the human head, and the coordinate origin is the midpoint of the straight line where the left and right ears are located. The side of the third coordinate axis is directly above the human head, the distance between each virtual sound image and the coordinate origin is within the set distance range, and each virtual sound image is formed relative to the first coordinate axis and the second coordinate axis. The pitch angle of the plane is within the set angle range;

Generate chorus audio based on the plurality of dry audio audio after virtual sound image localization;

In the case of acquiring the lead vocal audio sung based on the target song, after synthesizing the lead vocal audio, the chorus audio and the corresponding accompaniment, the large chorus effect audio is output.

In a possible implementation, the memory 11 may include a program storage area and a data storage area, wherein the program storage area may store an operating system and an application program required for at least one function (such as an audio playback function and an audio synthesis function). etc.; the storage data area can store data created during use, such as sound image positioning data, audio synthesis data, etc.

Additionally, the memory 11 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device or other volatile solid-state storage device.

The communication interface 12 may be an interface of a communication module for connecting with other devices or systems.

Of course, it should be noted that the structure shown in FIG. 12 does not constitute a limitation on the chorus audio processing device in the embodiment of the present application. In practical applications, the chorus audio processing device may include more or more chorus audio processing devices than those shown in FIG. 12 . Fewer parts, or a combination of certain parts.

Corresponding to the above method embodiments, the embodiments of the present application also provide a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the above-mentioned chorus audio processing method are implemented. .

The various embodiments in this specification are described in a progressive manner, and each embodiment focuses on the differences from other embodiments, and the same or similar parts between the various embodiments may be referred to each other.

Professionals may further realize that the units and algorithm steps of each example described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or a combination of the two, in order to clearly illustrate the possibilities of hardware and software. Interchangeability, the above description has generally described the components and steps of each example in terms of functionality. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each particular application, but such implementations should not be considered beyond the scope of this application.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be directly implemented in hardware, a software module executed by a processor, or a combination of the two. The software module can be placed in random access memory (RAM), internal memory, read only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable disk, CD-ROM, or any other in the technical field. in any other known form of storage medium.

The principles and implementations of the present application are described herein by using specific examples, and the descriptions of the above embodiments are only used to help understand the technical solutions and core ideas of the present application. It should be pointed out that for those of ordinary skill in the art, without departing from the principles of the present application, several improvements and modifications can also be made to the present application, and these improvements and modifications also fall within the protection scope of the claims of the present application.

Claims

A method for processing chorus audio, comprising:

Obtain the dry audio of the same target song performed by multiple singers respectively;

performing time alignment processing on the obtained plurality of the dry audio frequencies;

Perform virtual sound image localization on a plurality of the dry sound audios after time alignment processing, so as to locate the plurality of dry sound audios on a plurality of virtual sound images; In the virtual audio-visual coordinate system, the virtual audio-visual coordinate system is centered on the human head, with the midpoint of the straight line where the left and right ears are located as the coordinate origin, the positive direction of the first coordinate axis represents the front of the human head, and the positive direction of the second coordinate axis Represents the side of the human head from the left ear to the right ear, the positive direction of the third coordinate axis represents directly above the human head, the distance between each virtual sound image and the coordinate origin is within the set distance range, and each virtual sound image is within the set distance range. The pitch angle of the sound image relative to the plane formed by the first coordinate axis and the second coordinate axis is within a set angle range;

generating chorus audio based on a plurality of the dry audio audio after virtual sound image localization;

In the case of acquiring the lead vocal audio sung based on the target song, after synthesizing the lead vocal audio, the chorus audio and the corresponding accompaniment, a large chorus effect audio is output.
The method for processing chorus audio according to claim 1, wherein the performing time alignment processing on the plurality of obtained dry audio audios comprises:

Determine the reference audio corresponding to the target song;

For each of the obtained dry audio frequencies, extract the audio features of the current dry audio audio and the reference audio respectively, and the audio features are fingerprint features or fundamental frequency features;

Determine the time corresponding to the audio feature similarity maximum value of the current dry sound audio and the reference audio as the audio alignment time;

Based on the audio alignment time, time alignment processing is performed on the current dry audio audio.
The method for processing chorus audio according to claim 1, further comprising:

Band-pass filtering is performed on the obtained multiple dry audio frequencies respectively to obtain multiple bass data;

Correspondingly, the chorus audio is generated based on the plurality of dry sound audios after virtual sound image localization, including:

A chorus audio is generated based on the plurality of the dry audio frequencies and the plurality of the bass data subjected to virtual sound image localization.
The method for processing chorus audio according to claim 1, further comprising:

respectively performing reverberation simulation processing on the obtained plurality of the dry audio frequencies;

Correspondingly, the chorus audio is generated based on the plurality of dry sound audios after virtual sound image localization, including:

A chorus audio is generated based on the plurality of the dry audio frequencies subjected to virtual sound image localization and the plurality of the dry audio frequencies subjected to reverberation simulation processing.
The method for processing chorus audio according to claim 4, wherein the performing reverberation simulation processing on the obtained plurality of the dry audio frequencies respectively comprises:

Reverberation simulation processing is respectively performed on the obtained plurality of the dry audio frequencies by using the cascade of comb filters and all-pass filters.
The method for processing chorus audio according to claim 1, wherein after performing virtual sound image localization on the plurality of dry audio audios subjected to time alignment processing, the method further comprises:

respectively performing reverberation simulation processing on a plurality of the dry audio frequencies after virtual sound image localization;

Correspondingly, the chorus audio is generated based on the plurality of dry sound audios after virtual sound image localization, including:

A chorus audio is generated based on the plurality of dry audio audios subjected to virtual sound image localization and reverberation simulation processing.
The method for processing chorus audio according to claim 1, further comprising:

respectively perform binaural simulation processing on the obtained plurality of the dry audio frequencies;

Correspondingly, the chorus audio is generated based on the plurality of dry sound audios after virtual sound image localization, including:

A chorus audio is generated based on the plurality of the dry audio audio after virtual sound image localization and the plurality of the dry audio audio after binaural simulation processing.
The method for processing chorus audio according to claim 7, characterized in that, after performing binaural simulation processing on the obtained plurality of dry audio audios, the method further comprises:

performing reverberation simulation processing on a plurality of the dry audio frequencies after the binaural simulation processing;

Correspondingly, described based on a plurality of described dry sound audio frequency after virtual sound image localization and a plurality of described dry sound audio frequency after binaural simulation processing, generate chorus audio, including:

The chorus audio is generated based on the plurality of the dry audio audios after virtual sound image localization, the plurality of the dry audio audios after the binaural simulation processing and the reverberation simulation processing.
The method for processing chorus audio according to claim 1, wherein the performing virtual sound image localization on a plurality of the dry audio audio after time alignment processing comprises:

According to the number of virtual sound images, the obtained dry sound audio frequency after time alignment processing is grouped, and the number of groups is the same as the number of virtual sound images;

Each group of dry audio audio is positioned on the corresponding virtual audio image, and different groups of dry audio audio correspond to different virtual audio images.
The processing method of chorus audio according to claim 1, wherein,

Among the plurality of virtual sound images, the elevation angle of the virtual sound image located behind the human head relative to the plane formed by the first coordinate axis and the second coordinate axis is greater than that of the virtual sound image located in front of the human head relative to the first coordinate axis. the elevation angle of the plane formed by the coordinate axis and the second coordinate axis;

or,

Each of the virtual sound images is evenly distributed on a circumference of the plane formed by the first coordinate axis and the second coordinate axis.
The method for processing chorus audio according to any one of claims 1 to 10, wherein the synthesizing the lead vocal audio, the chorus audio and the corresponding accompaniment comprises:

respectively performing volume adjustment on the lead vocal audio and the chorus audio, and/or performing reverberation simulation processing on the lead vocal audio and the chorus audio;

The lead vocal audio, the chorus audio and the corresponding accompaniment after volume adjustment and/or reverberation simulation processing are synthesized.
A processing device for chorus audio, comprising:

memory for storing computer programs;

The processor is configured to implement the steps of the chorus audio processing method according to any one of claims 1 to 11 when executing the computer program.
A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the chorus audio according to any one of claims 1 to 11 is implemented. The steps of the processing method.