CN113192486B - Chorus audio processing method, chorus audio processing equipment and storage medium - Google Patents

Chorus audio processing method, chorus audio processing equipment and storage medium Download PDF

Info

Publication number
CN113192486B
CN113192486B CN202110460280.4A CN202110460280A CN113192486B CN 113192486 B CN113192486 B CN 113192486B CN 202110460280 A CN202110460280 A CN 202110460280A CN 113192486 B CN113192486 B CN 113192486B
Authority
CN
China
Prior art keywords
audio
dry
chorus
virtual sound
processing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110460280.4A
Other languages
Chinese (zh)
Other versions
CN113192486A (en
Inventor
张超鹏
陈灏
武文昊
罗辉
李革委
姜涛
胡鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Music Entertainment Technology Shenzhen Co Ltd
Original Assignee
Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Music Entertainment Technology Shenzhen Co Ltd filed Critical Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority to CN202110460280.4A priority Critical patent/CN113192486B/en
Publication of CN113192486A publication Critical patent/CN113192486A/en
Priority to PCT/CN2022/087784 priority patent/WO2022228220A1/en
Application granted granted Critical
Publication of CN113192486B publication Critical patent/CN113192486B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser
    • G10L13/0335Pitch control
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • G10L13/047Architecture of speech synthesisers
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S7/00Indicating arrangements; Control arrangements, e.g. balance control
    • H04S7/30Control circuits for electronic adaptation of the sound field
    • H04S7/305Electronic adaptation of stereophonic audio signals to reverberation of the listening space
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2400/00Details of stereophonic systems covered by H04S but not provided for in its groups
    • H04S2400/13Aspects of volume control, not necessarily automatic, in stereophonic sound systems

Abstract

The application discloses a chorus audio processing method, which can comprise the following steps: respectively obtaining dry voice audios of singing the same target song by a plurality of singers; performing time alignment processing on the obtained dry sound frequencies and then performing virtual sound image positioning so as to position the dry sound frequencies on the virtual sound images; generating chorus audio; and under the condition that the main singing audio based on target song singing is acquired, synthesizing the main singing audio, the chorus audio and the corresponding accompaniment, and outputting the chorus effect audio. By applying the technical scheme provided by the application, the plurality of virtual sound images encircle human ears, the plurality of dry sound images are positioned on the plurality of virtual sound images, so that the output large chorus effect audio has sound field encircling sound effect, the effect in the head caused by gathering the sound field at the center of the human head is effectively avoided, and the sound field is wider. The application also discloses a chorus audio processing device, a chorus audio processing equipment and a storage medium, which have corresponding technical effects.

Description

Chorus audio processing method, chorus audio processing equipment and storage medium
Technical Field
The present invention relates to the field of computer application technologies, and in particular, to a method, an apparatus, and a storage medium for processing chorus audio.
Background
With the rapid development of computer technology, various software such as audio, video and office is gradually increased, and great convenience is brought to the life of people. With audio-type software, a user can experience listening to songs, singing songs, and the like.
Currently, in order to provide users with a great chorus hearing experience of a concert, many people often perform direct superposition processing on singing data of multiple people. However, the audio frequency obtained by simple superposition processing is in hearing sense, the sound field is concentrated in the center of the head of a person, the effect in the head is achieved, the sound field is not wide enough, and the hearing experience is poor.
Disclosure of Invention
The purpose of the application is to provide a chorus audio processing method, device and storage medium, so as to avoid the effect of the sound field in the head of a person due to the fact that the sound field is gathered in the center of the head of the person, so that the sound field is wider, and the hearing experience is improved.
In order to solve the technical problems, the application provides the following technical scheme:
a method of processing chorus audio, comprising:
respectively obtaining dry voice audios of singing the same target song by a plurality of singers;
performing time alignment processing on the obtained plurality of dry sound frequencies;
performing virtual sound image positioning on the plurality of dry sound frequencies subjected to time alignment processing so as to position the plurality of dry sound frequencies on a plurality of virtual sound images; the virtual sound images are located in a pre-established virtual sound image coordinate system, the virtual sound image coordinate system takes a human head as a center, a straight line midpoint of a left ear and a right ear as a coordinate origin, the positive direction of a first coordinate axis represents the right front of the human head, the positive direction of a second coordinate axis represents the side of the human head from the left ear to the right ear, the positive direction of a third coordinate axis represents the right top of the human head, the distance between each virtual sound image and the coordinate origin is within a set distance range, and the pitch angle of each virtual sound image relative to a plane formed by the first coordinate axis and the second coordinate axis is within a set angle range;
Generating chorus audio based on the plurality of the dry audio frequencies subjected to virtual sound image localization;
and under the condition that the main singing audio based on the target song performance is acquired, synthesizing the main singing audio, the chorus audio and the corresponding accompaniment, and outputting the large chorus effect audio.
In a specific embodiment of the present application, the time alignment processing for the obtained plurality of dry audio frequencies includes:
determining a reference audio corresponding to the target song;
respectively extracting audio features of the current dry sound audio and the reference audio aiming at each obtained dry sound audio, wherein the audio features are fingerprint features or fundamental frequency features;
determining the time corresponding to the maximum value of the audio feature similarity of the current dry audio and the reference audio as audio alignment time;
and performing time alignment processing on the current dry sound frequency based on the audio alignment time.
In a specific embodiment of the present application, further comprising:
respectively carrying out band-pass filtering processing on the obtained dry sound frequencies to obtain a plurality of bass data;
correspondingly, the generating chorus audio based on the plurality of the dry audio after virtual sound image localization includes:
And generating chorus audio based on the plurality of dry sound audios and the plurality of bass data after virtual sound image localization.
In a specific embodiment of the present application, further comprising:
respectively carrying out reverberation simulation processing on the obtained dry sound frequencies;
correspondingly, the generating chorus audio based on the plurality of the dry audio after virtual sound image localization includes:
and generating chorus audio based on the plurality of the dry sound audio subjected to virtual sound image localization and the plurality of the dry sound audio subjected to reverberation simulation processing.
In a specific embodiment of the present application, the performing reverberation simulation processing on the obtained plurality of dry audio frequencies includes:
and respectively carrying out reverberation simulation processing on the obtained dry sound frequencies by utilizing cascade connection of a comb filter and an all-pass filter.
In one specific embodiment of the present application, after the virtual sound image localization is performed on the plurality of dry sound frequencies after the time alignment process, the method further includes:
respectively carrying out reverberation simulation processing on the plurality of dry sound frequencies subjected to virtual sound image positioning;
correspondingly, the generating chorus audio based on the plurality of the dry audio after virtual sound image localization includes:
And generating chorus audio based on the plurality of dry audio after virtual sound image localization and reverberation simulation processing.
In a specific embodiment of the present application, further comprising:
respectively carrying out double-channel simulation processing on the obtained dry sound frequencies;
correspondingly, the generating chorus audio based on the plurality of the dry audio after virtual sound image localization includes:
and generating chorus audio based on the plurality of the dry audio subjected to virtual sound image localization and the plurality of the dry audio subjected to binaural simulation processing.
In a specific embodiment of the present application, after the two-channel analog processing is performed on the obtained plurality of dry audio frequencies, the method further includes:
performing reverberation simulation processing on the plurality of dry sound frequencies subjected to the binaural simulation processing;
correspondingly, the generating chorus audio based on the plurality of the dry audio subjected to virtual sound image localization and the plurality of the dry audio subjected to binaural simulation processing includes:
and generating chorus audio based on the plurality of the dry audio subjected to virtual sound image positioning, the two-channel simulation processing and the plurality of the dry audio subjected to reverberation simulation processing.
In a specific embodiment of the present application, the performing virtual sound image localization on the plurality of dry sound frequencies after performing the time alignment processing includes:
grouping the obtained dry audio after time alignment treatment according to the number of the virtual sound images, wherein the number of the groups is the same as the number of the virtual sound images;
and respectively positioning each group of dry sound frequency to the corresponding virtual sound image, wherein different groups of dry sound frequency correspond to different virtual sound images.
In one embodiment of the present application,
among the plurality of virtual sound images, the elevation angle of the virtual sound image positioned behind the human head relative to the plane formed by the first coordinate axis and the second coordinate axis is larger than the elevation angle of the virtual sound image positioned in front of the human head relative to the plane formed by the first coordinate axis and the second coordinate axis;
or,
each virtual sound image is uniformly distributed on one circle of a plane formed by the first coordinate axis and the second coordinate axis.
In a specific embodiment of the present application, the synthesizing the main singing audio, the chorus audio and the corresponding accompaniment includes:
respectively adjusting the volume of the main singing audio and the chorus audio, and/or performing reverberation simulation processing on the main singing audio and the chorus audio;
And synthesizing the main singing audio, the chorus audio and the corresponding accompaniment after volume adjustment and/or reverberation simulation processing.
A chorus audio processing apparatus, comprising:
the dry sound frequency obtaining module is used for respectively obtaining dry sound frequency of singing the same target song by a plurality of singers;
an alignment processing module, configured to perform time alignment processing on the obtained plurality of dry audio frequencies;
the virtual sound image positioning module is used for positioning the virtual sound images of the plurality of the dry sound frequencies subjected to the time alignment processing so as to position the plurality of the dry sound frequencies on the plurality of virtual sound images; the virtual sound images are located in a pre-established virtual sound image coordinate system, the virtual sound image coordinate system takes a human head as a center, a straight line midpoint of a left ear and a right ear as a coordinate origin, the positive direction of a first coordinate axis represents the right front of the human head, the positive direction of a second coordinate axis represents the side of the human head from the left ear to the right ear, the positive direction of a third coordinate axis represents the right top of the human head, the distance between each virtual sound image and the coordinate origin is within a set distance range, and the pitch angle of each virtual sound image relative to a plane formed by the first coordinate axis and the second coordinate axis is within a set angle range;
A chorus audio generation module for generating chorus audio based on the plurality of the dry audio after virtual sound image localization;
and the large chorus effect audio output module is used for outputting the large chorus effect audio after synthesizing the main singing audio, the chorus audio and the corresponding accompaniment under the condition that the main singing audio based on the target song singing is acquired.
A chorus audio processing apparatus, comprising:
a memory for storing a computer program;
a processor for implementing the steps of the method for processing chorus audio as described in any one of the above when executing the computer program.
A computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the chorus audio processing method of any of the above.
By applying the technical scheme provided by the embodiment of the application, after the dry sound audio of singing the same target song by a plurality of singers is respectively obtained, time alignment processing is carried out on the obtained dry sound audio, virtual sound image positioning is carried out on the aligned dry sound audio, so that the dry sound audio is positioned on the virtual sound images, the virtual sound images are positioned in a virtual sound image coordinate system with the head as the center, the distance between the virtual sound images and the coordinate origin is within a set distance range, the human ears are surrounded, chorus audio is generated based on the dry sound audio positioned by the virtual sound images, and under the condition that the main chorus audio based on the singing of the target song is obtained, the main chorus audio, the chorus audio and corresponding accompaniment are chorus, so that the big chorus effect audio is obtained and output. The multiple dry sound frequencies are positioned on the multiple virtual sound images surrounding the human ears, so that the generated chorus audio has sound field surrounding sound effect, and the effect that the sound field of the finally output large chorus effect audio is gathered in the center of the human head and generated in the head can be effectively avoided in the sense of hearing, so that the sound field is wider.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flowchart of an embodiment of a method for processing chorus audio;
FIG. 2 is a schematic diagram showing the sound image orientation of a virtual sound image localization coordinate system according to an embodiment of the present application;
FIG. 3 is a schematic diagram of virtual sound image localization according to an embodiment of the present application;
FIG. 4 is a schematic diagram of a localized virtual sound image according to an embodiment of the present application;
FIG. 5 is a schematic diagram illustrating a spatial sound field process according to an embodiment of the present application;
fig. 6 is a schematic diagram of a cascade of a comb filter and an all-pass filter according to an embodiment of the present application;
FIG. 7 is a schematic diagram of a reverberant impulse response according to one embodiment of the present application;
FIG. 8 is a schematic diagram of a two-channel simulation process according to an embodiment of the present application;
FIG. 9 is a schematic diagram of a frame of a chorus audio processing system according to an embodiment of the present application;
FIG. 10 is a schematic diagram of a specific structure of a chorus audio processing system according to an embodiment of the present application;
fig. 11 is a schematic structural diagram of a chorus audio processing device according to an embodiment of the present application;
fig. 12 is a schematic structural diagram of a chorus audio processing device according to an embodiment of the present application.
Detailed Description
The core of the application is to provide a chorus audio processing method. After the dry sound audio of singing the same target song by a plurality of singers is respectively obtained, time alignment processing is carried out on the obtained dry sound audio, virtual sound image positioning is carried out on the aligned dry sound audio, so that the dry sound audio is positioned on the virtual sound images, the virtual sound images are positioned in a virtual sound image coordinate system taking a human head as a center, the distance between the virtual sound images and the origin of coordinates is within a set distance range, the human ears are surrounded, chorus audio is generated based on the dry sound audio positioned by the virtual sound images, and the chorus audio, the chorus audio and corresponding accompaniment are chorus under the condition that the main singing audio based on the target song is obtained, so that the large chorus effect audio is obtained and output. The multiple dry sound frequencies are positioned on the multiple virtual sound images surrounding the human ears, so that the generated chorus audio has sound field surrounding sound effect, and the effect that the sound field of the finally output large chorus effect audio is gathered in the center of the human head and generated in the head can be effectively avoided in the sense of hearing, so that the sound field is wider.
In practical application, the method provided by the embodiment of the application can be applied to various scenes in which a great chorus sound effect is wanted, and implementation of a specific scheme can be carried out through interaction between a server and a client.
For example, in scenario 1, the server may obtain in advance a plurality of singers, such as the singers 1, 2, 3, and 4, … …, perform time alignment processing on the obtained dry audio, perform virtual sound image localization on the plurality of dry audio after alignment, locate the plurality of dry audio on a plurality of virtual sound images, and generate chorus audio based on the plurality of dry sound images located by the virtual sound images. When the user X wants to make the song sung by himself realize the big chorus sound effect, the target song can be sung through the client, the server obtains the main singing audio sung by the user X through the client, synthesizes the main singing audio, the chorus audio and the corresponding accompaniment, can obtain the big chorus effect audio, outputs the big chorus effect audio through the client, and can make the user X feel the big chorus sound effect.
In scenario 2, several good friends (users 1, 2, 3, 4, 5) sing the target song in the same time period but in different spaces want to achieve a large chorus sound effect. From the perspective of any one user, the current user can be taken as the master singing. For example, at the angle of the user 1, the server may obtain the dry audio of singing the target songs by the users 2, 3, 4, and 5, perform time alignment processing on the obtained dry audio, position the multiple dry audio on multiple virtual sound images after alignment, and generate chorus audio based on the multiple dry audio positioned by the virtual sound images. Under the condition that the server acquires the main singing audio singed by the user 1 through the client based on the target song, synthesizing the main singing audio, the chorus audio and the corresponding accompaniment to obtain the audio with the large chorus effect, and outputting the audio with the large chorus effect to the user 1 through the client, so that the user 1 can feel the large chorus effect.
The application scenario is described by way of example only, and in practical application, the technical scheme of the application can also be applied to more scenarios, such as multi-person chorus, multi-person small band and other sound effect processing scenarios.
In order to provide a better understanding of the present application, those skilled in the art will now make further details of the present application with reference to the drawings and detailed description. It will be apparent that the described embodiments are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.
Referring to fig. 1, a flowchart of an embodiment of a method for processing chorus audio according to the present application may include the following steps:
s110: and respectively obtaining dry voice audios for singing the same target song by a plurality of singers.
In the embodiment of the application, a plurality of dry sound audios can be obtained according to actual needs. The plurality of dry audio may be audio data of different singers singing for the same target song, and the different singers may be in the same or different environments.
S120: the obtained plurality of dry audio frequencies are subjected to time alignment processing.
The dry audio of singing the same target song by a plurality of singers is obtained respectively, because the plurality of dry audio may be singed by different singers at different times, and there may be misalignment phenomena such as delay. In order to achieve better chorus sound effect later, time alignment treatment can be carried out on a plurality of obtained dry sound frequencies, so that the dry sound frequencies subjected to the time alignment treatment do not have serious snapshot or slow snapshot, such as audio which is advanced or delayed for more than 1 second. In particular, the alignment tool may be used to time align the most identical starting positions of the obtained plurality of dry audio frequencies.
In the specific embodiment of the application, before the obtained dry sound frequencies are subjected to time alignment processing, the obtained dry sound frequencies can be subjected to preliminary screening, such as screening through tools such as tone quality detection, and the like, so that the audio itself is rejected and has noise, accompaniment back stepping, too short audio length, too small audio energy, and poor tone quality such as pop sound. And then performing time alignment treatment and subsequent steps on the dry audio remained after screening.
S130: and carrying out virtual sound image positioning on the plurality of dry sound frequencies subjected to the time alignment processing so as to position the plurality of dry sound frequencies on the plurality of virtual sound images.
The virtual sound images are located in a pre-established virtual sound image coordinate system, the virtual sound image coordinate system takes a human head as a center, the midpoint of a straight line where a left ear and a right ear are located as an origin of coordinates, the positive direction of a first coordinate axis represents the front of the human head, the positive direction of a second coordinate axis represents the side of the human head from the left ear to the right ear, the positive direction of a third coordinate axis represents the front of the human head, the distance between each virtual sound image and the origin of coordinates is within a set distance range, and the pitch angle of each virtual sound image relative to a plane formed by the first coordinate axis and the second coordinate axis is within a set angle range.
In the embodiment of the application, a virtual sound image coordinate system may be pre-established for displaying the sound image azimuth. The virtual sound image coordinate system may specifically be a cartesian coordinate system. As shown in fig. 2, the virtual sound image coordinate system may be centered on the human head, and the midpoint of the straight line where the left and right ears are located is the origin of coordinates, the first coordinate axis, i.e., the positive direction of the x-axis, represents the front of the human head, the second coordinate axis, i.e., the positive direction of the y-axis, represents the side of the human head from the left ear to the right ear, the third coordinate axis, i.e., the positive direction of the z-axis, represents the front of the human head, i.e., the top of the head direction, and the sound image has a certain azimuth (azimuth) and elevation (elevation) in space, and may be used To indicate that rad indicates the distance of the current sound image from the origin of coordinates.
The general sound signal is a single-channel signal, and can be regarded as sound imageIn position, in order to obtain a certain virtual sound image, a localization operation may be implemented by performing data convolution using HRTF (Head Related Transfer Function, head related transform function). The virtual sound image localization diagram is shown in FIG. 3, wherein X represents a real sound source (single-channel signal), Y L 、Y R Representing acoustic signals heard by the left and right ears, respectively, and HRTFs represent transfer functions of the acoustic signals from the source locations to the transmission paths of the ears. The real sound source (single channel signal) can be passed through a certain +.>HRTF filtering of left and right ears at positionWave, two-way acoustic signal is obtained.
The frequency domain characteristics of the left and right ear received acoustic signals can be expressed as:
the acoustic signal heard by the human ear can be simply considered to be the result of HRTF filtering of the sound source X. Therefore, when virtual sound image localization is performed, the sound signal can be filtered through HRTFs at corresponding positions. A plurality of virtual sound images may be set in the virtual sound image coordinate system, a distance between each virtual sound image and the origin of coordinates may be within a set distance range, such as a range of 1 meter, and a pitch angle of each virtual sound image with respect to a plane formed by the first coordinate axis and the second coordinate axis of the virtual sound image coordinate system may be within a set angle range, such as a range of 10 °, so that the plurality of virtual sound images encircle the human ear.
Specifically, each virtual sound image of the plurality of virtual sound images may be uniformly distributed in a circle of a plane formed by the first coordinate axis and the second coordinate axis. I.e. around the horizontal surface of the human ear at the same angular interval. The interval angle may be set according to the actual situation or analysis of the history data, for example, set to 30 °. If the interval angle is set to be 30 degrees, the human ear horizontal plane is surrounded by 30 degrees as an interval, 12 virtual sound images can be positioned, the elevation angles of the 12 virtual sound images are 0 degrees, and the azimuth angles are respectively: 0 °, 30 °, 60 °, … °, 330 °. Of course, the interval angle may also be set to other values, such as 15 °, 60 °, etc.
In another embodiment, among the plurality of virtual sound images, an elevation angle of the virtual sound image located behind the human head with respect to a plane formed by the first coordinate axis and the second coordinate axis may be larger than an elevation angle of the virtual sound image located in front of the human head with respect to a plane formed by the first coordinate axis and the second coordinate axis. That is, among the plurality of virtual sound images, the virtual sound image located behind the head may have an elevation angle greater than that of the virtual sound image located in front of the head. Thus, the localization effect can be enhanced, and the front-back mirror image problem of the virtual sound image can be reduced. For example, the elevation angle of the virtual sound image behind the head may be adjusted up by 10 °, i.e. the elevation angle θ=0° of the virtual sound image in front of the head, and the elevation angle θ=10° of the virtual sound image behind the head.
As shown in fig. 4, the plurality of virtual sound images surround the human ear level surface at intervals of 30 °, the elevation angle θ=0° of the virtual sound image located in front of the human head, and the elevation angle θ=10° of the virtual sound image located behind the human head.
The positions of the plurality of virtual sound images in the virtual sound image coordinate system are not limited to the above-mentioned ones, and may be specifically set according to actual needs, so long as the distance between each virtual sound image and the origin of coordinates is within a set distance range, and the pitch angle of each virtual sound image with respect to the plane formed by the first coordinate axis and the second coordinate axis is within a set angle range. For example, a part of the virtual sound images in the plurality of virtual sound images surround the human ear plane at intervals of 30 degrees, the elevation angle is 0 degrees, the other part of the virtual sound images surround the human ear plane at intervals of 60 degrees, the elevation angle is 10 degrees, and the distances between the two parts of virtual sound images and the origin of coordinates can be the same or different, but are all within a set distance range, so that the surrounding effect of the chorus audio generated later can be enhanced.
And (3) carrying out virtual sound image positioning on the plurality of dry sound frequencies subjected to the time alignment processing, and after the plurality of dry sound frequencies are positioned on the plurality of virtual sound images, continuously executing the operation of the subsequent steps.
S140: chorus audio is generated based on the plurality of dry audio frequencies after virtual sound image localization.
After the dry sound audio of singing the same target song by a plurality of singers is respectively obtained, time alignment processing is carried out on the dry sound audio, virtual sound image positioning is carried out on the aligned dry sound audio, and the dry sound audio is positioned on the virtual sound images, each dry sound audio in the dry sound audio can be respectively subjected to HRTF filtering processing of the corresponding virtual sound image position, and corresponding audio data can be obtained at each virtual sound image. Chorus audio may be generated based on the multiple dry audio frequencies after virtual sound image localization. Specifically, corresponding audio data obtained after HRTF filtering processing of a plurality of virtual sound image positions may be superimposed, or weighted and superimposed, to obtain chorus audio. The sound effect of the chorus audio has the hearing feeling of a three-dimensional sound field.
S150: and under the condition that the main singing audio based on target song singing is acquired, synthesizing the main singing audio, the chorus audio and the corresponding accompaniment, and outputting the chorus effect audio.
In an application scenario of the embodiment of the present application, after the chorus audio is generated, the chorus audio may be stored in a database and used when needed. For example, a song that a user wants to sing by himself has a chorus effect, and in this case, the chorus audio can be used to achieve a corresponding effect.
The method comprises the steps of obtaining the audio of the current user singing based on the target song, taking the audio as the main singing audio, then synthesizing the main singing audio, the chorus audio and the corresponding accompaniment to obtain the audio with large chorus effect, outputting the audio with large chorus effect, and enjoying the large chorus effect by the current user.
The synthesis of the main singing audio, the chorus audio and the corresponding accompaniment can be realized in various modes, such as synthesizing the main singing audio and the corresponding accompaniment first, then synthesizing the chorus audio and the main singing audio, then synthesizing the chorus audio, or synthesizing the main singing audio and the chorus audio first, then synthesizing the chorus audio and the corresponding accompaniment, for example, after the main singing audio and the chorus audio are subjected to balanced adjustment, the corresponding accompaniment is superimposed according to the set sound accompaniment ratio. The large chorus sound effects obtained by different implementation modes can be different, and specific implementation modes can be selected according to actual conditions.
By the method provided by the embodiment of the application, after the dry sound audios of singing the same target song by a plurality of singers are respectively obtained, time alignment processing is carried out on the obtained dry sound audios, virtual sound image positioning is carried out on the aligned dry sound audios so as to position the dry sound audios on the virtual sound images, the virtual sound images are positioned in a virtual sound image coordinate system with a human head as a center, the distance between the virtual sound images and an origin of coordinates is within a set distance range, the human ears are surrounded, chorus audios are generated based on the dry sound audios positioned by the virtual sound images, and under the condition that a main singing audio based on the target song is obtained, the main singing audio, the chorus audios and corresponding accompaniment are chorus, so that a large chorus effect audio is obtained and output. The multiple dry sound frequencies are positioned on the multiple virtual sound images surrounding the human ears, so that the generated chorus audio has sound field surrounding sound effect, and the effect that the sound field of the finally output large chorus effect audio is gathered in the center of the human head and generated in the head can be effectively avoided in the sense of hearing, so that the sound field is wider.
In one embodiment of the present application, the step S120 of performing time alignment processing on the obtained plurality of dry audio frequencies may include the steps of:
the first step: determining a reference audio corresponding to the target song;
and a second step of: for each obtained dry sound audio, respectively extracting the audio characteristics of the current dry sound audio and the reference audio, wherein the audio characteristics are fingerprint characteristics or fundamental frequency characteristics;
and a third step of: determining the time corresponding to the maximum value of the similarity of the audio characteristics of the current dry audio and the reference audio as audio alignment time;
fourth step: and performing time alignment processing on the current dry audio based on the audio alignment time.
For ease of description, the steps described above are combined.
In the embodiment of the present application, after the dry audio frequencies of singing the same target song by a plurality of singers are obtained respectively, in the process of performing time alignment processing on the obtained plurality of dry audio frequencies, the reference audio corresponding to the target song may be determined first. Specifically, one of the obtained plurality of dry audio frequencies having a good sound quality can be selected as the reference audio. The original dry audio of the target song may also be determined as the reference audio.
For each obtained dry audio, the audio features of the current dry audio and the reference audio may be extracted separately, the audio features being fingerprint features or fundamental frequency features. For example, mel frequency band information, bark frequency band information, erb frequency band power, etc. can be extracted through multiband filtering, and then fingerprint characteristics can be obtained through half-wave rectification, binary judgment, etc. As another example, fundamental frequency features can be extracted by fundamental frequency extraction tools such as pyin, crepe, harvest. The audio features of the reference audio can be saved after being extracted once, and can be directly called when needed.
Comparing the current dry audio with the audio characteristics of the reference audio can be characterized by a similarity curve and the like, and the time corresponding to the maximum value of the similarity can be determined as the audio alignment time. The current dry audio is then time aligned based on the audio alignment time.
And comparing the obtained dry sound audio with the audio characteristics of the reference audio to obtain corresponding audio alignment time, and performing time alignment processing to obtain a plurality of dry sound frequencies after the time alignment processing.
In one embodiment of the present application, the method may further comprise the steps of:
Respectively carrying out band-pass filtering processing on the obtained dry sound frequencies to obtain a plurality of bass data;
correspondingly, generating chorus audio based on the plurality of dry audio frequencies subjected to virtual sound image localization comprises the following steps:
a chorus audio is generated based on the plurality of dry audio and the plurality of bass data after virtual sound image localization.
In this embodiment of the present application, after a plurality of dry audio frequencies for singing the same target song by a plurality of singers are obtained respectively, band-pass filtering processing may be performed on the obtained plurality of dry audio frequencies, for example, band-pass filtering processing with a cut-off frequency of [33,523] hz is performed on the plurality of dry audio frequencies, so as to obtain a plurality of bass data.
Chorus audio may be generated based on the plurality of dry audio and the plurality of bass data after virtual sound image localization. Specifically, the obtained plurality of pieces of bass data may be superimposed or weighted superimposed on a plurality of pieces of dry audio based on which the virtual sound image is localized, to generate chorus audio. After the bass signal is superimposed, the sense of heaviness of the acoustic signal can be enhanced.
In one embodiment of the present application, the method may further comprise the steps of:
respectively carrying out reverberation simulation processing on the obtained dry sound frequencies;
Correspondingly, generating chorus audio based on the plurality of dry audio frequencies subjected to virtual sound image localization comprises the following steps:
and generating chorus audio based on the plurality of dry sound audios subjected to virtual sound image localization and the plurality of dry sound audios subjected to reverberation simulation processing.
Generally, an acoustic signal emitted by a sound source in a sound field may undergo processes such as direct sound, reflection, reverberation, and the like. Fig. 5 shows a schematic diagram of a typical spatial sound field process composition. In this figure, the sound signal with the largest amplitude is direct sound, the next sound signal is a reflected sound signal obtained by reflecting sound waves on an object closest to a listener, the reflected sound signal has obvious directivity, the next dense sound signal is a reverberant sound signal obtained by overlapping sound waves after multiple reflections of surrounding objects, and the reverberant sound signal is an overlapping of a large number of reflected sounds in different directions, and the directivity is not present.
According to the known room impulse response characteristics, reverberant sound is superposition of multipath reflected sound, and is characterized by weak energy and no directivity, and because the reverberant sound is superposition of a large number of post reflected sounds from different directions and has higher echo density, the reverberant sound can be utilized to generate surrounding sound effects with surrounding sense.
In the embodiment of the present application, after the dry audio frequencies of the singers singing the same target song are obtained, the obtained dry audio frequencies may be subjected to reverberation simulation processing. Specifically, the obtained plurality of dry audio frequencies may be respectively subjected to reverberation simulation processing by using a cascade of a comb filter and an all-pass filter.
Fig. 6 shows a cascade of comb filters and all-pass filters, wherein four comb filters are connected in parallel and then in series with two all-pass filters. The actual simulated reverberation impulse response is shown in fig. 7.
It should be noted that, fig. 6 is only one specific form, and in practical application, there may be other more forms, and the number and cascading manner of the comb filter, the all-pass filter may be adjusted according to practical needs.
And respectively carrying out reverberation simulation processing on the obtained dry sound frequencies, carrying out virtual sound image positioning, and after the dry sound frequencies are positioned on the virtual sound images, generating chorus audio based on the dry sound audio with the positioned virtual sound images and the dry sound frequencies with the reverberation simulation processing. Specifically, the multiple pieces of dry audio after virtual sound image localization and the multiple pieces of dry audio after reverberation simulation processing can be subjected to superposition or weighted superposition processing to generate chorus audio. Thus, the spatial sound effect of the sound signal can be enhanced, the effect in the head can be further restrained, and the sound field can be expanded.
In one embodiment of the present application, after virtual sound image localization is performed on the plurality of dry sound frequencies subjected to the time alignment process, the method may further include the steps of:
Respectively carrying out reverberation simulation processing on a plurality of dry sound frequencies subjected to virtual sound image positioning;
correspondingly, generating chorus audio based on the plurality of dry audio frequencies subjected to virtual sound image localization comprises the following steps:
and generating chorus audio based on the plurality of dry audio frequencies subjected to virtual sound image localization and reverberation simulation processing.
In this embodiment of the present application, after a plurality of singers respectively obtain dry audio frequencies for singing the same target song, time alignment processing is performed on the obtained plurality of dry audio frequencies, and virtual sound image positioning is performed, reverberation simulation processing may be further performed on the plurality of dry audio frequencies after virtual sound image positioning, where the reverberation simulation processing process may refer to the reverberation simulation processing process in the previous embodiment, and details are not repeated herein.
Chorus audio can be generated based on a plurality of dry audio frequencies subjected to virtual sound image localization and reverberation simulation processing. Specifically, the chorus audio may be generated by performing processing such as superposition or weighted superposition on a plurality of dry audio frequencies subjected to the virtual sound image localization and the reverberation simulation processing.
The reverberation simulation processing is carried out on the plurality of dry sound frequencies subjected to the virtual sound image processing, so that the space sound effect of the sound signals can be enhanced, the effect in the head can be further suppressed, and the sound field can be expanded.
In one embodiment of the present application, the method may further comprise the steps of:
respectively carrying out double-channel simulation processing on the obtained dry sound frequencies;
correspondingly, generating chorus audio based on the plurality of dry audio frequencies subjected to virtual sound image localization comprises the following steps:
and generating chorus audio based on the plurality of dry audio frequencies subjected to virtual sound image localization and the plurality of dry audio frequencies subjected to binaural simulation processing.
In the embodiment of the application, after the dry audio in which a plurality of singers sings on the same target song is obtained respectively and the obtained plurality of dry audio is time aligned, the two-channel simulation process can be performed on the plurality of dry audio respectively. The correlation of the two-channel signals is reduced through delay, and the sound field is expanded as much as possible to obtain two-way output.
As shown in fig. 8, a plurality of dry audio can be modeled by 8 sets of different delay weights on the left and right, where d represents the delay and g represents the weight. Since a typical room impulse response takes 80ms as the reverberation time, the delay parameter can select 16 parameters, which are unequal from 21ms to 79 ms. The amplitude attenuation is used for representing the energy loss of the sound wave due to reflection, so that the correlation of the two-path environment information can be reduced. The two paths of signals with the same information can be obtained by copying the dry sound frequency respectively, the two paths of signals are completely correlated, and then the correlation of the two paths of signals is reduced by utilizing different delays and amplitudes to obtain pseudo-stereophonic signals.
It should be noted that, fig. 8 is only a specific example, and fewer or more different sets of delays may be set to implement the binaural simulation according to actual needs.
Chorus audio can be generated based on the plurality of dry audio frequencies after virtual sound image localization and the plurality of dry audio frequencies after binaural simulation processing. Specifically, the chorus audio may be generated by performing processing such as superposition or weighted superposition on a plurality of dry audio frequencies subjected to virtual sound image localization and a plurality of dry audio frequencies subjected to binaural simulation processing.
In one embodiment of the present application, after the two-channel analog processing is performed on the obtained plurality of dry audio frequencies, the method may further include the steps of:
performing reverberation simulation processing on the plurality of dry sound frequencies subjected to the binaural simulation processing;
correspondingly, generating chorus audio based on the plurality of dry audio after virtual sound image localization and the plurality of dry audio after binaural simulation processing, including:
and generating chorus audio based on the plurality of dry audio frequencies subjected to virtual sound image localization, the two-channel simulation processing and the plurality of dry audio frequencies subjected to reverberation simulation processing.
In the embodiment of the application, after the dry audio frequency in which a plurality of singers sings on the same target song is obtained respectively, the time alignment processing is performed on the plurality of dry audio frequencies, and the binaural simulation processing is performed on the plurality of dry audio frequencies respectively, the reverberation simulation processing can be further performed on the plurality of dry audio frequencies subjected to the binaural simulation processing, so that the space effect of an acoustic signal is enhanced, the in-head effect is suppressed, and the sound field is expanded.
The virtual sound image localization is performed on the plurality of dry sound frequencies, and after the plurality of dry sound frequencies are localized on the plurality of virtual sound images, chorus audio can be generated based on the plurality of dry sound frequencies after the virtual sound image localization, the two-channel simulation processing and the reverberation simulation processing. Specifically, the chorus audio can be generated by performing processes such as superposition or weighted superposition on a plurality of dry audio frequencies after virtual sound image positioning, binaural simulation processing and reverberation simulation processing.
In practical application, after obtaining the dry audio in which a plurality of singers sings on the same target song, time alignment processing may be performed on the obtained plurality of dry audio, then virtual sound image positioning, bass enhancement, reverberation simulation, binaural simulation and other processing may be performed on the basis of the plurality of dry audio after the time alignment processing, and specific processing may be performed by combining the above embodiments.
Fig. 9 is a schematic diagram of a system frame for processing a plurality of dry audio signals after time alignment, which includes a bass enhancement unit, a virtual sound image localization unit, a binaural simulation unit, and a reverberation simulation unit. The bass enhancement unit is used for carrying out band-pass filtering processing on a plurality of dry audio frequencies to obtain bass data; the virtual sound image positioning unit is used for positioning the virtual sound images of the plurality of dry sound frequencies so as to position the plurality of dry sound frequencies on the plurality of virtual sound images; the double-channel simulation unit is used for carrying out double-channel simulation processing on the plurality of dry sound frequencies; the reverberation simulation unit is used for performing reverberation simulation processing on the plurality of dry sound frequencies. The virtual sound image positioning unit and the binaural simulation unit can be connected with the reverberation simulation unit, after the virtual sound image positioning unit is used for positioning the virtual sound images of the dry sound frequencies, the reverberation simulation unit can be used for further performing reverberation simulation processing, and similarly, after the binaural simulation unit is used for performing binaural simulation processing on the dry sound frequencies, the reverberation simulation unit can be used for further performing reverberation simulation processing. Finally, the audio data processed by the units can be weighted and overlapped to obtain chorus audio.
Fig. 10 shows a specific example of processing a plurality of dry audio frequencies, where H represents a transfer function of HRTF filtering, virtual sound image localization can be performed on the plurality of dry audio frequencies by the processing of the transfer function, the plurality of dry audio frequencies are localized to 12 virtual sound images surrounding the human ear level, REV represents a reverberation simulation unit, BASS represents a BASS enhancement unit, and REF represents a binaural simulation unit. The same parameters can be used by the reverberation simulation unit, and different parameters can be configured for different reverberation simulation units according to actual requirements, so that flexible reverberation modulation is obtained.
The great chorus effect of the chorus audio finally generated by the embodiment of the application is more approximate to the hearing sense of the chorus of the real concert. In practical application, accompaniment is added on the basis of the main singing audio, and chorus audio is mixed, so that a user can feel an immersive concert experience in hearing, and more shocking immersive sound field surrounding experience is obtained.
In one embodiment of the present application, the virtual sound image localization of the plurality of dry sound frequencies after the time alignment process may include the steps of:
step one: grouping the obtained dry sound frequencies subjected to time alignment according to the number of the virtual sound images, wherein the number of the groups is the same as that of the virtual sound images;
Step two: and respectively positioning each group of dry sound frequency to the corresponding virtual sound image, wherein different groups of dry sound frequency correspond to different virtual sound images.
For ease of description, the two steps described above are combined.
In this embodiment of the present application, after the dry audio in which a plurality of singers sings on the same target song is obtained respectively and the obtained plurality of dry audio is time-aligned, the obtained plurality of time-aligned dry audio may be grouped according to the number of virtual sound images, where the number of groups obtained is the same as the number of virtual sound images, and the same group includes a plurality of dry audio. If the obtained dry sound frequency number is more, the same dry sound frequency can be made to be in only one group, and if the obtained dry sound frequency number is less, the same dry sound frequency can be made to be in a plurality of groups so as to better realize the large chorus sound effect.
After the plurality of dry sound frequencies are grouped, each group of dry sound frequencies can be respectively positioned on the corresponding virtual sound images, and different groups of dry sound frequencies correspond to different virtual sound images. The positioning processing of the virtual sound images of a plurality of dry sound audios is realized, and the chorus sound effect is enhanced.
In one embodiment of the present application, synthesizing the main audio, chorus audio and corresponding accompaniment may include the steps of:
respectively adjusting the volume of the main voice frequency and the chorus voice frequency, and/or performing reverberation simulation processing on the main voice frequency and the chorus voice frequency;
and synthesizing the main singing audio, the chorus audio and the corresponding accompaniment after volume adjustment and/or reverberation simulation processing.
In this embodiment of the present application, after the primary singing audio based on the target song performance is obtained, the volume adjustment may be performed on the primary singing audio and the chorus audio, so that the volumes of the primary singing audio and the chorus audio are equivalent, or the volume of the primary singing audio is greater than the volume of the chorus audio. Meanwhile, reverberation simulation processing can be performed on the main audio and the chorus audio to obtain surrounding sound effects with surrounding sense.
And synthesizing the main singing audio, the chorus audio and the corresponding accompaniment after volume adjustment and/or reverberation simulation processing, so that the finally output large chorus effect audio brings better hearing experience for users.
Corresponding to the above method embodiments, the embodiments of the present application further provide a device for processing chorus audio, where the device for processing chorus audio described below and the method for processing chorus audio described above may be referred to correspondingly.
Referring to fig. 11, the apparatus may include the following modules:
a dry audio obtaining module 1110, configured to obtain dry audio of singing the same target song by a plurality of singers, respectively;
a time alignment processing module 1120, configured to perform time alignment processing on the obtained plurality of dry audio frequencies;
a virtual sound image localization module 1130, configured to perform virtual sound image localization on the plurality of dry sound frequencies after the time alignment process, so as to localize the plurality of dry sound frequencies onto the plurality of virtual sound images; the virtual sound images are located in a pre-established virtual sound image coordinate system, the virtual sound image coordinate system takes a human head as a center, the midpoint of a straight line where a left ear and a right ear are located as an origin of coordinates, the positive direction of a first coordinate axis represents the front of the human head, the positive direction of a second coordinate axis represents the side of the human head from the left ear to the right ear, the positive direction of a third coordinate axis represents the front of the human head, the distance between each virtual sound image and the origin of coordinates is within a set distance range, and the pitch angle of each virtual sound image relative to a plane formed by the first coordinate axis and the second coordinate axis is within a set angle range;
a chorus audio generation module 1140, configured to generate chorus audio based on the plurality of dry audio frequencies after virtual sound image localization;
And the chorus effect audio obtaining module 1150 is configured to, when obtaining the primary singing audio based on the target song, synthesize the primary singing audio, the chorus audio and the corresponding accompaniment, and output the chorus effect audio.
By the device provided by the embodiment of the application, after the dry sound audios of singing the same target song by a plurality of singers are respectively obtained, time alignment processing is carried out on the obtained dry sound audios, virtual sound image positioning is carried out on the aligned dry sound audios, so that the dry sound audios are positioned on the virtual sound images, the virtual sound images are positioned in a virtual sound image coordinate system with a human head as a center, the distance between the virtual sound images and an origin of coordinates is within a set distance range, the human ears are surrounded, chorus audios are generated based on the dry sound audios positioned by the virtual sound images, and under the condition that a main singing audio based on the target song is obtained, the main singing audio, the chorus audios and corresponding accompaniment are chorus, so that a large chorus effect audio is obtained and output. The multiple dry sound frequencies are positioned on the multiple virtual sound images surrounding the human ears, so that the generated chorus audio has sound field surrounding sound effect, and the effect that the sound field of the finally output large chorus effect audio is gathered in the center of the human head and generated in the head can be effectively avoided in the sense of hearing, so that the sound field is wider.
In a specific embodiment of the present application, the time alignment processing module 1120 is configured to:
determining a reference audio corresponding to the target song;
for each obtained dry sound audio, respectively extracting the audio characteristics of the current dry sound audio and the reference audio, wherein the audio characteristics are fingerprint characteristics or fundamental frequency characteristics;
determining the time corresponding to the maximum value of the similarity of the audio characteristics of the current dry audio and the reference audio as audio alignment time;
and performing time alignment processing on the current dry audio based on the audio alignment time.
In a specific embodiment of the present application, the device further includes a bass data obtaining module, configured to:
respectively carrying out band-pass filtering processing on the obtained dry sound frequencies to obtain a plurality of bass data;
accordingly, a chorus audio generation module 1140 is configured to:
a chorus audio is generated based on the plurality of dry audio and the plurality of bass data after virtual sound image localization.
In a specific embodiment of the present application, the method further includes a reverberation simulation processing module configured to:
respectively carrying out reverberation simulation processing on the obtained dry sound frequencies;
accordingly, a chorus audio generation module 1140 is configured to:
and generating chorus audio based on the plurality of dry sound audios subjected to virtual sound image localization and the plurality of dry sound audios subjected to reverberation simulation processing.
In one specific embodiment of the present application, the reverberation simulation processing module is configured to:
and respectively carrying out reverberation simulation processing on the obtained dry sound frequencies by utilizing cascade connection of the comb filter and the all-pass filter.
In a specific embodiment of the present application, the reverberation simulation processing module is further configured to:
after virtual sound image positioning is carried out on the plurality of dry sound frequencies subjected to time alignment treatment, respectively carrying out reverberation simulation treatment on the plurality of dry sound frequencies subjected to virtual sound image positioning;
accordingly, a chorus audio generation module 1140 is configured to:
and generating chorus audio based on the plurality of dry audio frequencies subjected to virtual sound image localization and reverberation simulation processing.
In a specific embodiment of the present application, the system further includes a dual-channel analog processing module, configured to:
respectively carrying out double-channel simulation processing on the obtained dry sound frequencies;
accordingly, a chorus audio generation module 1140 is configured to:
and generating chorus audio based on the plurality of dry audio frequencies subjected to virtual sound image localization and the plurality of dry audio frequencies subjected to binaural simulation processing.
In a specific embodiment of the present application, the reverberation simulation processing module is further configured to:
After the obtained dry sound frequencies are respectively subjected to double-channel simulation processing, the obtained dry sound frequencies are subjected to reverberation simulation processing;
accordingly, a chorus audio generation module 1140 is configured to:
and generating chorus audio based on the plurality of dry audio frequencies subjected to virtual sound image localization, the two-channel simulation processing and the plurality of dry audio frequencies subjected to reverberation simulation processing.
In one embodiment of the present application, the virtual sound image localization module 1130 is configured to:
grouping the obtained dry sound frequencies subjected to time alignment according to the number of the virtual sound images, wherein the number of the groups is the same as that of the virtual sound images;
and respectively positioning each group of dry sound frequency to the corresponding virtual sound image, wherein different groups of dry sound frequency correspond to different virtual sound images.
In one specific embodiment of the present application, among the plurality of virtual sound images, an elevation angle of the virtual sound image located behind the human head with respect to a plane formed by the first coordinate axis and the second coordinate axis is larger than an elevation angle of the virtual sound image located in front of the human head with respect to a plane formed by the first coordinate axis and the second coordinate axis; alternatively, each virtual sound image is uniformly distributed around a plane formed by the first coordinate axis and the second coordinate axis.
In one embodiment of the present application, a chorus effect audio acquisition module 1150 is configured to:
respectively adjusting the volume of the main voice frequency and the chorus voice frequency, and/or performing reverberation simulation processing on the main voice frequency and the chorus voice frequency;
and synthesizing the main singing audio, the chorus audio and the corresponding accompaniment after volume adjustment and/or reverberation simulation processing.
Corresponding to the above method embodiment, the embodiment of the present application further provides a device for processing chorus audio, including:
a memory for storing a computer program;
and the processor is used for realizing the steps of the chorus audio processing method when executing the computer program.
As shown in fig. 12, to illustrate a composition structure of the chorus audio processing apparatus, the chorus audio processing apparatus may include: a processor 10, a memory 11, a communication interface 12 and a communication bus 13. The processor 10, the memory 11 and the communication interface 12 all complete communication with each other through a communication bus 13.
In the present embodiment, the processor 10 may be a central processing unit (Central Processing Unit, CPU), an asic, a dsp, a field programmable gate array, or other programmable logic device, etc.
The processor 10 may call a program stored in the memory 11, and in particular, the processor 10 may perform operations in an embodiment of a method of processing chorus audio.
The memory 11 is used for storing one or more programs, and the programs may include program codes, where the program codes include computer operation instructions, and in this embodiment, at least the programs for implementing the following functions are stored in the memory 11:
respectively obtaining dry voice audios of singing the same target song by a plurality of singers;
performing time alignment processing on the obtained plurality of dry sound frequencies;
virtual sound image positioning is carried out on the plurality of dry sound frequencies subjected to time alignment processing so as to position the plurality of dry sound frequencies on the plurality of virtual sound images; the virtual sound images are located in a pre-established virtual sound image coordinate system, the virtual sound image coordinate system takes a human head as a center, the midpoint of a straight line where a left ear and a right ear are located as an origin of coordinates, the positive direction of a first coordinate axis represents the front of the human head, the positive direction of a second coordinate axis represents the side of the human head from the left ear to the right ear, the positive direction of a third coordinate axis represents the front of the human head, the distance between each virtual sound image and the origin of coordinates is within a set distance range, and the pitch angle of each virtual sound image relative to a plane formed by the first coordinate axis and the second coordinate axis is within a set angle range;
Generating chorus audio based on the plurality of dry audio frequencies subjected to virtual sound image localization;
and under the condition that the main singing audio based on target song singing is acquired, synthesizing the main singing audio, the chorus audio and the corresponding accompaniment, and outputting the chorus effect audio.
In one possible implementation, the memory 11 may include a storage program area and a storage data area, where the storage program area may store an operating system, and application programs required for at least one function (such as an audio playing function, an audio synthesizing function), and so on; the storage data area may store data created during use, such as sound image localization data, audio synthesis data, and the like.
In addition, the memory 11 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device or other volatile solid-state storage device.
The communication interface 12 may be an interface of a communication module for interfacing with other devices or systems.
Of course, it should be noted that the structure shown in fig. 12 does not limit the chorus audio processing apparatus in the embodiment of the present application, and the chorus audio processing apparatus may include more or less components than those shown in fig. 12 or may combine some components in practical applications.
Corresponding to the above method embodiments, the present application further provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor, implements the steps of the above-mentioned chorus audio processing method.
In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, so that the same or similar parts between the embodiments are referred to each other.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
Specific examples are used herein to illustrate the principles and embodiments of the present application, and the description of the above examples is only for aiding in understanding the technical solution of the present application and its core ideas. It should be noted that it would be obvious to those skilled in the art that various improvements and modifications can be made to the present application without departing from the principles of the present application, and such improvements and modifications fall within the scope of the claims of the present application.

Claims (12)

1. A method of processing chorus audio, comprising:
respectively obtaining dry voice audios of singing the same target song by a plurality of singers;
performing time alignment processing on the obtained plurality of dry sound frequencies;
performing virtual sound image positioning on the plurality of dry sound frequencies subjected to time alignment processing so as to position the plurality of dry sound frequencies on a plurality of virtual sound images; the virtual sound images are located in a pre-established virtual sound image coordinate system, the virtual sound image coordinate system takes a human head as a center, a straight line midpoint of a left ear and a right ear as a coordinate origin, the positive direction of a first coordinate axis represents the right front of the human head, the positive direction of a second coordinate axis represents the side of the human head from the left ear to the right ear, the positive direction of a third coordinate axis represents the right top of the human head, the distance between each virtual sound image and the coordinate origin is within a set distance range, and the pitch angle of each virtual sound image relative to a plane formed by the first coordinate axis and the second coordinate axis is within a set angle range;
Generating chorus audio based on the plurality of the dry audio frequencies subjected to virtual sound image localization;
under the condition that the main singing audio based on the target song performance is obtained, synthesizing the main singing audio, the chorus audio and the corresponding accompaniment, and then outputting large chorus effect audio;
the synthesizing the main singing audio, the chorus audio and the corresponding accompaniment includes:
respectively adjusting the volume of the main singing audio and the chorus audio, and/or performing reverberation simulation processing on the main singing audio and the chorus audio;
and synthesizing the main singing audio, the chorus audio and the corresponding accompaniment after volume adjustment and/or reverberation simulation processing.
2. The method of processing chorus audio according to claim 1, wherein said time-aligning the obtained plurality of said dry audio comprises:
determining a reference audio corresponding to the target song;
respectively extracting audio features of the current dry sound audio and the reference audio aiming at each obtained dry sound audio, wherein the audio features are fingerprint features or fundamental frequency features;
determining the time corresponding to the maximum value of the audio feature similarity of the current dry audio and the reference audio as audio alignment time;
And performing time alignment processing on the current dry sound frequency based on the audio alignment time.
3. The method of processing chorus audio of claim 1, further comprising:
respectively carrying out band-pass filtering processing on the obtained dry sound frequencies to obtain a plurality of bass data;
correspondingly, the generating chorus audio based on the plurality of the dry audio after virtual sound image localization includes:
and generating chorus audio based on the plurality of dry sound audios and the plurality of bass data after virtual sound image localization.
4. The method of processing chorus audio of claim 1, further comprising:
respectively carrying out reverberation simulation processing on the obtained dry sound frequencies;
correspondingly, the generating chorus audio based on the plurality of the dry audio after virtual sound image localization includes:
and generating chorus audio based on the plurality of the dry sound audio subjected to virtual sound image localization and the plurality of the dry sound audio subjected to reverberation simulation processing.
5. The method of processing chorus audio according to claim 4, wherein said respectively performing reverberation simulation processing on the obtained plurality of said dry audio comprises:
And respectively carrying out reverberation simulation processing on the obtained dry sound frequencies by utilizing cascade connection of a comb filter and an all-pass filter.
6. The method of processing chorus audio according to claim 1, further comprising, after said virtual sound image localization of said time-aligned plurality of said dry sound frequencies:
respectively carrying out reverberation simulation processing on the plurality of dry sound frequencies subjected to virtual sound image positioning;
correspondingly, the generating chorus audio based on the plurality of the dry audio after virtual sound image localization includes:
and generating chorus audio based on the plurality of dry audio after virtual sound image localization and reverberation simulation processing.
7. The method of processing chorus audio of claim 1, further comprising:
respectively carrying out double-channel simulation processing on the obtained dry sound frequencies;
correspondingly, the generating chorus audio based on the plurality of the dry audio after virtual sound image localization includes:
and generating chorus audio based on the plurality of the dry audio subjected to virtual sound image localization and the plurality of the dry audio subjected to binaural simulation processing.
8. The method of processing chorus audio according to claim 7, further comprising, after said respectively performing two-channel analog processing on said obtained plurality of said dry audio signals:
performing reverberation simulation processing on the plurality of dry sound frequencies subjected to the binaural simulation processing;
correspondingly, the generating chorus audio based on the plurality of the dry audio subjected to virtual sound image localization and the plurality of the dry audio subjected to binaural simulation processing includes:
and generating chorus audio based on the plurality of the dry audio subjected to virtual sound image positioning, the two-channel simulation processing and the plurality of the dry audio subjected to reverberation simulation processing.
9. The method of processing chorus audio according to claim 1, wherein said performing virtual sound image localization on the plurality of the dry sound frequencies subjected to the time alignment processing comprises:
grouping the obtained dry audio after time alignment treatment according to the number of the virtual sound images, wherein the number of the groups is the same as the number of the virtual sound images;
and respectively positioning each group of dry sound frequency to the corresponding virtual sound image, wherein different groups of dry sound frequency correspond to different virtual sound images.
10. The method for processing chorus audio of claim 1,
among the plurality of virtual sound images, the elevation angle of the virtual sound image positioned behind the human head relative to the plane formed by the first coordinate axis and the second coordinate axis is larger than the elevation angle of the virtual sound image positioned in front of the human head relative to the plane formed by the first coordinate axis and the second coordinate axis;
or,
each virtual sound image is uniformly distributed on one circle of a plane formed by the first coordinate axis and the second coordinate axis.
11. A chorus audio processing apparatus, comprising:
a memory for storing a computer program;
a processor for implementing the steps of the method of processing chorus audio as claimed in any one of claims 1 to 10 when executing said computer program.
12. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of the method of processing chorus audio according to any of claims 1 to 10.
CN202110460280.4A 2021-04-27 2021-04-27 Chorus audio processing method, chorus audio processing equipment and storage medium Active CN113192486B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202110460280.4A CN113192486B (en) 2021-04-27 2021-04-27 Chorus audio processing method, chorus audio processing equipment and storage medium
PCT/CN2022/087784 WO2022228220A1 (en) 2021-04-27 2022-04-20 Method and device for processing chorus audio, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110460280.4A CN113192486B (en) 2021-04-27 2021-04-27 Chorus audio processing method, chorus audio processing equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113192486A CN113192486A (en) 2021-07-30
CN113192486B true CN113192486B (en) 2024-01-09

Family

ID=76979435

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110460280.4A Active CN113192486B (en) 2021-04-27 2021-04-27 Chorus audio processing method, chorus audio processing equipment and storage medium

Country Status (2)

Country Link
CN (1) CN113192486B (en)
WO (1) WO2022228220A1 (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113192486B (en) * 2021-04-27 2024-01-09 腾讯音乐娱乐科技(深圳)有限公司 Chorus audio processing method, chorus audio processing equipment and storage medium
CN114242025A (en) * 2021-12-14 2022-03-25 腾讯音乐娱乐科技(深圳)有限公司 Method and device for generating accompaniment and storage medium
CN114363793A (en) * 2022-01-12 2022-04-15 厦门市思芯微科技有限公司 System and method for converting dual-channel audio into virtual surround 5.1-channel audio
CN114630145A (en) * 2022-03-17 2022-06-14 腾讯音乐娱乐科技(深圳)有限公司 Multimedia data synthesis method, equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009044261A (en) * 2007-08-06 2009-02-26 Yamaha Corp Device for forming sound field
CN108269560A (en) * 2017-01-04 2018-07-10 北京酷我科技有限公司 A kind of speech synthesizing method and system
CN110992970A (en) * 2019-12-13 2020-04-10 腾讯音乐娱乐科技(深圳)有限公司 Audio synthesis method and related device
CN111028818A (en) * 2019-11-14 2020-04-17 北京达佳互联信息技术有限公司 Chorus method, apparatus, electronic device and storage medium
WO2020177190A1 (en) * 2019-03-01 2020-09-10 腾讯音乐娱乐科技(深圳)有限公司 Processing method, apparatus and device

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000333297A (en) * 1999-05-14 2000-11-30 Sound Vision:Kk Stereophonic sound generator, method for generating stereophonic sound, and medium storing stereophonic sound
CN105208039B (en) * 2015-10-10 2018-06-08 广州华多网络科技有限公司 The method and system of online concert cantata
CN106331977B (en) * 2016-08-22 2018-06-12 北京时代拓灵科技有限公司 A kind of virtual reality panorama acoustic processing method of network K songs
CN107422862B (en) * 2017-08-03 2021-01-15 嗨皮乐镜(北京)科技有限公司 Method for virtual image interaction in virtual reality scene
CN110379401A (en) * 2019-08-12 2019-10-25 黑盒子科技(北京)有限公司 A kind of music is virtually chorused system and method
CN113192486B (en) * 2021-04-27 2024-01-09 腾讯音乐娱乐科技(深圳)有限公司 Chorus audio processing method, chorus audio processing equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2009044261A (en) * 2007-08-06 2009-02-26 Yamaha Corp Device for forming sound field
CN108269560A (en) * 2017-01-04 2018-07-10 北京酷我科技有限公司 A kind of speech synthesizing method and system
WO2020177190A1 (en) * 2019-03-01 2020-09-10 腾讯音乐娱乐科技(深圳)有限公司 Processing method, apparatus and device
CN111028818A (en) * 2019-11-14 2020-04-17 北京达佳互联信息技术有限公司 Chorus method, apparatus, electronic device and storage medium
CN110992970A (en) * 2019-12-13 2020-04-10 腾讯音乐娱乐科技(深圳)有限公司 Audio synthesis method and related device

Also Published As

Publication number Publication date
CN113192486A (en) 2021-07-30
WO2022228220A1 (en) 2022-11-03

Similar Documents

Publication Publication Date Title
CN113192486B (en) Chorus audio processing method, chorus audio processing equipment and storage medium
US5371799A (en) Stereo headphone sound source localization system
Hacihabiboglu et al. Perceptual spatial audio recording, simulation, and rendering: An overview of spatial-audio techniques based on psychoacoustics
CN105900457B (en) The method and system of binaural room impulse response for designing and using numerical optimization
Brown et al. A structural model for binaural sound synthesis
US6259795B1 (en) Methods and apparatus for processing spatialized audio
KR100964353B1 (en) Method for processing audio data and sound acquisition device therefor
US20120262536A1 (en) Stereophonic teleconferencing using a microphone array
CA2835463C (en) Apparatus and method for generating an output signal employing a decomposer
RU2513910C2 (en) Angle-dependent operating device or method for generating pseudo-stereophonic audio signal
US20080298610A1 (en) Parameter Space Re-Panning for Spatial Audio
US20060120533A1 (en) Apparatus and method for producing virtual acoustic sound
EP1522868A1 (en) System for determining the position of a sound source and method therefor
US20090067636A1 (en) Optimization of Binaural Sound Spatialization Based on Multichannel Encoding
JP2013211906A (en) Sound spatialization and environment simulation
CA2744429C (en) Converter and method for converting an audio signal
Noisternig et al. Framework for real-time auralization in architectural acoustics
Lee et al. A real-time audio system for adjusting the sweet spot to the listener's position
Pulkki et al. Spatial effects
Hoffbauer et al. Four-directional ambisonic spatial decomposition method with reduced temporal artifacts
US11917394B1 (en) System and method for reducing noise in binaural or stereo audio
US9794717B2 (en) Audio signal processing apparatus and audio signal processing method
US11032660B2 (en) System and method for realistic rotation of stereo or binaural audio
Yuan et al. Externalization improvement in a real-time binaural sound image rendering system
De Sena Analysis, design and implementation of multichannel audio systems

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant