CN108962277A

CN108962277A - Speech signal separation method, apparatus, computer equipment and storage medium

Info

Publication number: CN108962277A
Application number: CN201810802835.7A
Authority: CN
Inventors: 张超钢
Original assignee: Guangzhou Kugou Computer Technology Co Ltd
Current assignee: Guangzhou Kugou Computer Technology Co Ltd
Priority date: 2018-07-20
Filing date: 2018-07-20
Publication date: 2018-12-07
Also published as: WO2020015270A1

Abstract

The invention discloses a kind of speech signal separation method, apparatus, computer equipment and storage mediums, belong to field of voice signal.The described method includes: sampling to the acoustic waveform of audio file to be separated, audio signal is obtained；Audio signal is converted from time domain to frequency domain, obtains the frequency spectrum of audio signal, frequency spectrum is only used for indicating the amplitude of audio signal and amplitude is real number；The frequency spectrum of audio signal is decomposed, accompaniment frequency spectrum and voice frequency spectrum are obtained；Accompaniment frequency spectrum and voice frequency spectrum are converted from frequency domain to time domain, audio accompaniment and voice audio are obtained.The transformation algorithm of the amplitude of audio frame is only indicated when the present invention is using conversion with real number, to carry out the transformation of time domain to frequency domain and frequency domain to time domain, since transformation front and back will not convert phase, phase information is not suffered a loss, therefore, accompaniment and voice are separated from audio file based on this conversion regime, avoids the phase distortion problem of Fourier transformation spectral decomposition.

Description

Speech signal separation method, apparatus, computer equipment and storage medium

Technical field

The present invention relates to Speech signal processing field, in particular to a kind of speech signal separation method, apparatus, computer are set Standby and storage medium.

Background technique

With the continuous development of voice process technology, speech signal separation has obtained extensively in people's daily life General application.For example, user when using some K song software wants that accompaniment is combined to record the song that oneself is sung, then just needing The accompanying song provided using server, the quality of accompaniment directly affect the effect for recording finished product to the end.Therefore, how to carry out Speech signal separation, it is most important for the quality for promoting audio accompaniment to obtain audio accompaniment and voice audio.

Currently, can be related to turning audio signal from time domain with Fourier transformation when carrying out speech signal separation The process of frequency domain is shifted to, the available complex spectrum of the process.It is thus possible to be divided by being decomposed to complex spectrum The accompaniment frequency spectrum and voice frequency spectrum separated out, then by Fourier inversion, obtain audio accompaniment and voice audio.

In the implementation of the present invention, the inventor finds that the existing technology has at least the following problems: due to plural number When frequency spectrum is decomposed, merely with amplitude frequency spectrum, the phenomenon that there are phase distortions so as to cause the audio accompaniment isolated.

Summary of the invention

The embodiment of the invention provides a kind of speech signal separation method, apparatus, computer equipment and storage medium, energy Enough solve the problems, such as the phase distortion of speech signal separation.The technical solution is as follows:

On the one hand, a kind of speech signal separation method is provided, this method comprises:

The acoustic waveform of audio file to be separated is sampled, audio signal is obtained；

The audio signal is converted from time domain to frequency domain, the frequency spectrum of the audio signal is obtained, which is only used for indicating to be somebody's turn to do The amplitude of audio signal and the amplitude are real number；

The frequency spectrum of the audio signal is decomposed, accompaniment frequency spectrum and voice frequency spectrum are obtained；

The accompaniment frequency spectrum and voice frequency spectrum are converted from frequency domain to time domain, audio accompaniment and voice audio are obtained.

In a kind of possible implementation, this converts the audio signal to frequency domain from time domain, obtains the audio signal Frequency spectrum, comprising:

The audio signal is subjected to sub-frame processing, obtains multiple audio frames；

Multiple audio frame is converted from time domain to frequency domain respectively, obtains the frequency spectrum of multiple audio frame, each audio frame Frequency spectrum be only used for indicating the amplitude of the audio frame and amplitude is real number；

The frequency spectrum of multiple audio frame is combined, the frequency spectrum of the audio signal is obtained.

In a kind of possible implementation, which is carried out sub-frame processing by this, obtains multiple audio frames, comprising:

Based on default window function, windowing process is carried out to the audio signal, obtains multiple audio frames.

In a kind of possible implementation, the length of the default window function is identical as the sampling number of each audio frame.

In a kind of possible implementation, the sampling number of each audio frame is 2 times of frame overlap sampling points.

In a kind of possible implementation, this decomposes the frequency spectrum of the audio signal, obtains accompaniment frequency spectrum and voice Frequency spectrum, comprising:

Preset decomposition model is called, which is used to carry out frequency spectrum separation based on signal spectrum；

The frequency spectrum of the audio signal is inputted into the preset decomposition model, output accompaniment frequency spectrum and voice frequency spectrum.

On the one hand, a kind of speech signal separation device is provided, which includes:

Sampling module samples for the acoustic waveform to audio file to be separated, obtains audio signal；

First conversion module obtains the frequency spectrum of the audio signal, is somebody's turn to do for converting the audio signal from time domain to frequency domain Frequency spectrum is only used for indicating the amplitude of the audio signal and the amplitude is real number；

Decomposing module obtains accompaniment frequency spectrum and voice frequency spectrum for decomposing the frequency spectrum of the audio signal；

Second conversion module obtains audio accompaniment for converting the accompaniment frequency spectrum and voice frequency spectrum from frequency domain to time domain With voice audio.

In a kind of possible implementation, which includes:

Framing unit obtains multiple audio frames for the audio signal to be carried out sub-frame processing；

Time-frequency convert unit obtains multiple audio frame for converting multiple audio frame from time domain to frequency domain respectively Frequency spectrum, the frequency spectrum of each audio frame is only used for indicating the amplitude of the audio frame and amplitude is real number；

Assembled unit obtains the frequency spectrum of the audio signal for the frequency spectrum of multiple audio frame to be combined.

In a kind of possible implementation, which is used for:

In a kind of possible implementation, for calling preset decomposition model, which uses the decomposing module In based on signal spectrum progress frequency spectrum separation；The frequency spectrum of the audio signal is inputted into the preset decomposition model, output accompaniment frequency spectrum With voice frequency spectrum.

On the one hand, a kind of computer equipment is provided, which includes processor and memory, in the memory It is stored at least one instruction, which is loaded by the processor and executed to realize that predicate sound signal separation method as above is held Capable operation.

On the one hand, a kind of computer readable storage medium is provided, at least one instruction is stored in the storage medium, it should Instruction is loaded as processor and is executed to realize operation performed by predicate sound signal separation method as above.

Method provided in an embodiment of the present invention only indicates that the transformation of the amplitude of audio frame is calculated with real number when using conversion Method, to carry out the transformation of time domain to frequency domain and frequency domain to time domain, since transformation front and back will not convert phase, phase Information is not suffered a loss, and therefore, is separated accompaniment and voice from audio file based on this conversion regime, is avoided Fourier transformation frequency The phase distortion problem of spectral factorization.

Detailed description of the invention

To describe the technical solutions in the embodiments of the present invention more clearly, make required in being described below to embodiment Attached drawing is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for For those of ordinary skill in the art, without creative efforts, it can also be obtained according to these attached drawings other Attached drawing.

Fig. 1 is a kind of implement scene figure of speech signal separation method provided in an embodiment of the present invention；

Fig. 2 is a kind of flow chart of speech signal separation method provided in an embodiment of the present invention；

Fig. 3 is a kind of speech signal separation apparatus structure schematic diagram provided in an embodiment of the present invention；

Fig. 4 is a kind of structural schematic diagram of computer equipment provided in an embodiment of the present invention.

Specific embodiment

To make the object, technical solutions and advantages of the present invention clearer, below in conjunction with attached drawing to embodiment party of the present invention Formula is described in further detail.

Fig. 1 is a kind of implement scene figure of speech signal separation method provided in an embodiment of the present invention.Referring to Fig. 1, the reality Apply in scene may include: at least one terminal 101 and at least one server 102, wherein at least one terminal 101 can be with As the acquisition terminal of voice signal or the playback terminal of audio file, at least one server 102 is for being at least one A terminal 101 provides audio service, such as can provide audio file to be played, can also provide such as embodiment of the present invention The corresponding Signal separator function of institute's providing method, so as to provided by terminal or audio file that terminal is chosen carries out language Sound signal separation etc..The server 102 can be provided as computer equipment.

Fig. 2 is a kind of flow chart of speech signal separation method provided in an embodiment of the present invention.Referring to fig. 2, the embodiment It specifically includes:

201, computer equipment samples the acoustic waveform of audio file to be separated, obtains audio signal.

The audio file to be separated can be the audio file of terminal upload, be also possible to store in computer equipment Audio file, certainly, the computer equipment can be server, be also possible to any one terminal, the embodiment of the present invention to this not It limits.Computer equipment is after obtaining audio file to be processed, the acoustic waveform of available audio file, and to sound wave Waveform carries out the sampling of default sample rate, to obtain audio signal.

Wherein, which can be corresponding with the format of the audio file, and different audio file formats can correspond to In the default sample rate of difference, the acoustic waveform of audio file is sampled using audio sample rate corresponding with the format, it can To guarantee that the obtained audio signal of sampling is with uniformity.

202, the computer equipment is based on default window function, carries out windowing process to the audio signal, obtains multiple sounds Frequency frame.

Sub-frame processing can be carried out according to default frame length by sampling obtained audio signal, to obtain multiple original audio frames. The default frame length should be short enough, can generally be taken as 20 to 50 milliseconds, and within the time short enough, which can be approximate It is considered as stable periodic signal, in order to the implementation of subsequent step.

When carrying out sub-frame processing, the sampling number of each audio frame should be chosen in reasonable range, to improve audio The spectral resolution of frame.In a kind of possible implementation, answered between a upper original audio frame and next original audio frame The part for having frame to be overlapped prevents from going out between two original audio frames to guarantee that each original audio frame has the ingredient of previous frame Existing discontinuous phenomenon.Generally, the sampling number range of each original audio frame can be chosen at 512 to 8192 points it Between.For example, in embodiments of the present invention, the sampling number of each audio frame can be chosen at 2048 points, correspondingly, by frame weight Folded sampling number is chosen at 1024 points.

During above-mentioned sub-frame processing, it may be considered that the sampled point for being included in default frame length and each audio frame Number, so that the two is all satisfied above-mentioned condition, to reach optimal framing effect.

When actually carrying out sub-frame processing, the mode of adding window can be taken, that is to say and multiple original audio frame is distinguished Windowing process is carried out, multiple audio frames are obtained, to allow multiple audio frame preferably to meet time-frequency convert in subsequent step Periodicity requirements reduce the leakage of audio frame frequency spectrum, improve the resolution ratio of frequency spectrum.For example, the default window function can choose the Chinese Peaceful window or hamming code window.Wherein, the length of the default window function can be identical as the sampling number of each audio frame, each audio frame Sampling number be frame overlap sampling points 2 times.

203, the computer equipment converts multiple audio frame from time domain to frequency domain respectively, obtains multiple audio frame Frequency spectrum, the frequency spectrum of each audio frame is only used for indicating the amplitude of the audio frame and amplitude is real number.

In embodiments of the present invention, when carrying out time-frequency convert, multiple audio frame can be divided by hartley transform It does not convert from time domain to frequency domain, obtains the frequency spectrum of multiple audio frame.Since hartley transform is real number transformation, obtain The frequency spectrum of multiple audio frame is real number frequency spectrum, and, which is only used for indicating the amplitude of the sound spectrum, is not related to phase Position.Specifically, which can realize using following formula:

K=0 ... .., N-1

Wherein, the number of sampling points of each audio frame is N, and the number of sampling points of frame overlapping is M, and M is the 1/2, x of N_nIt is every The sample amplitude of frame, n=0,1,2 ..., N-1.H_kFor the frequency spectrum after hartley transform, k is frequency point, k=0,1,2 ..., N-1。

It should be noted that the embodiment of the present invention is only illustrated by taking hartley transform as an example, can also actually use Other do not damage the mapping mode of phase, and it is not limited in the embodiment of the present invention.

204, the frequency spectrum of multiple audio frame is combined by the computer equipment, obtains the frequency spectrum of the audio signal.

When getting the frequency spectrum of each audio frame, the frequency spectrum of each audio frame is spelled by connected head-to-tail mode sequence It connects, forms the bivector of N*L dimension, wherein N is equal to the number of sampling points of each audio frame, and L is the total number of frame.

205, the computer equipment calls preset decomposition model, which is used to carry out frequency based on signal spectrum Spectrum separation；The frequency spectrum of the audio signal is inputted into the preset decomposition model, output accompaniment frequency spectrum and voice frequency spectrum.

Wherein, preset decomposition model, which can be, is in advance based on the frequency spectrum of multiple audio signals, based on multiple audio signal Accompaniment frequency spectrum and voice frequency spectrum be trained.For example, the preset decomposition model can be used to indicate that accompaniment frequency spectrum and The law of segregation of voice frequency spectrum decomposes the frequency spectrum of the audio signal to be based on the law of segregation.

206, the computer equipment converts the accompaniment frequency spectrum and voice frequency spectrum to time domain from frequency domain, obtain audio accompaniment with Voice audio.

It, can be by Hartley inverse transformation, by the accompaniment frequency spectrum and voice when getting accompaniment frequency spectrum and voice frequency spectrum Frequency spectrum is converted from frequency domain to time domain, and audio accompaniment and voice audio are obtained.

Method provided in an embodiment of the present invention only indicates that the transformation of the amplitude of the audio frame is calculated with real number when using conversion Method due to transformed frequency spectrum, is composed for real number, is believed without phase to carry out the transformation of time domain to frequency domain and frequency domain to time domain Breath；And carry out after inverse transformation or original phase, phase information are not suffered a loss, therefore, based on this conversion regime from sound Separation accompaniment and voice, avoid the phase distortion problem of Fourier transformation spectral decomposition in frequency file.

All the above alternatives can form the alternative embodiment of the disclosure, herein no longer using any combination It repeats one by one.

Fig. 3 is a kind of structural schematic diagram of speech signal separation device provided in an embodiment of the present invention, described referring to Fig. 3 Device includes:

Sampling module 301 samples for the acoustic waveform to audio file to be separated, obtains audio signal；

First conversion module 302 obtains the audio signal for converting the audio signal from time domain to frequency domain Frequency spectrum, the frequency spectrum is only used for indicating the amplitude of the audio signal and the amplitude is real number；

Decomposing module 303 obtains accompaniment frequency spectrum and voice frequency spectrum for decomposing the frequency spectrum of the audio signal；

Second conversion module 304 is accompanied for converting the accompaniment frequency spectrum and voice frequency spectrum from frequency domain to time domain Audio and voice audio.

In a kind of possible embodiment, first conversion module 302 includes:

Time-frequency convert unit obtains the multiple sound for converting the multiple audio frame from time domain to frequency domain respectively The frequency spectrum of frequency frame, the frequency spectrum of each audio frame is only used for indicating the amplitude of the audio frame and amplitude is real number；

Assembled unit obtains the frequency spectrum of the audio signal for the frequency spectrum of the multiple audio frame to be combined.

In a kind of possible embodiment, the framing unit is used for:

In a kind of possible embodiment, the sampling number phase of the length of the default window function and each audio frame Together.

In a kind of possible embodiment, the sampling number of each audio frame is 2 times of frame overlap sampling points.

In a kind of possible embodiment, the decomposing module is for calling preset decomposition model, the preset decomposition mould Type is used to carry out frequency spectrum separation based on signal spectrum；The frequency spectrum of the audio signal is inputted into the preset decomposition model, output Accompaniment frequency spectrum and voice frequency spectrum.

It should be understood that speech signal separation device provided by the above embodiment is in speech signal separation, only more than The division progress of each functional module is stated for example, can according to need and in practical application by above-mentioned function distribution by difference Functional module complete, i.e., the internal structure of equipment is divided into different functional modules, with complete it is described above whole or Person's partial function.In addition, speech signal separation device provided by the above embodiment belongs to speech signal separation embodiment of the method Same design, specific implementation process are detailed in embodiment of the method, and which is not described herein again.

Fig. 4 is a kind of structural schematic diagram of computer equipment provided in an embodiment of the present invention, which can be because Configuration or performance are different and generate bigger difference, may include one or more processors (central Processing units, CPU) 401 and one or more memory 402, wherein it is stored in the memory 402 There is at least one instruction, at least one instruction is loaded by the processor 401 and executed to realize that above-mentioned each method is real The method that example offer is provided.Certainly, which can also have wired or wireless network interface, keyboard and input and output The components such as interface, to carry out input and output, which can also include other components for realizing functions of the equipments, This will not be repeated here.

In the exemplary embodiment, a kind of computer readable storage medium is additionally provided, the memory for example including instruction, Above-metioned instruction can be executed by the processor in terminal to complete the speech signal separation method in following embodiments.For example, described Computer readable storage medium can be ROM, random access memory (RAM), CD-ROM, tape, floppy disk and optical data storage Equipment etc..

Those of ordinary skill in the art will appreciate that realizing that all or part of the steps of above-described embodiment can pass through hardware It completes, relevant hardware can also be instructed to complete by program, the program can store in a kind of computer-readable In storage medium, storage medium mentioned above can be read-only memory, disk or CD etc..

The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all in spirit of the invention and Within principle, any modification, equivalent replacement, improvement and so on be should all be included in the protection scope of the present invention.

Claims

1. a kind of speech signal separation method, which is characterized in that the described method includes:

The audio signal is converted from time domain to frequency domain, the frequency spectrum of the audio signal is obtained, the frequency spectrum is only used for indicating The amplitude of the audio signal and the amplitude are real number；

2. the method according to claim 1, wherein described convert the audio signal to frequency domain from time domain, Obtain the frequency spectrum of the audio signal, comprising:

The multiple audio frame is converted from time domain to frequency domain respectively, obtains the frequency spectrum of the multiple audio frame, each audio frame Frequency spectrum be only used for indicating the amplitude of the audio frame and amplitude is real number；

The frequency spectrum of the multiple audio frame is combined, the frequency spectrum of the audio signal is obtained.

3. according to the method described in claim 2, it is characterized in that, it is described by the audio signal carry out sub-frame processing, obtain Multiple audio frames, comprising:

4. according to the method described in claim 3, it is characterized in that, the length of the default window function and each audio frame Sampling number it is identical.

5. according to the method described in claim 2, it is characterized in that, the sampling number of each audio frame is frame overlap sampling points 2 times.

6. being obtained the method according to claim 1, wherein the frequency spectrum by the audio signal decomposes To accompaniment frequency spectrum and voice frequency spectrum, comprising:

Preset decomposition model is called, the preset decomposition model is used to carry out frequency spectrum separation based on signal spectrum；

7. a kind of speech signal separation device, which is characterized in that described device includes:

First conversion module obtains the frequency spectrum of the audio signal, institute for converting the audio signal from time domain to frequency domain Frequency spectrum is stated to be only used for indicating the amplitude of the audio signal and the amplitude for real number；

Second conversion module, for converting the accompaniment frequency spectrum and voice frequency spectrum from frequency domain to time domain, obtain audio accompaniment with Voice audio.

8. device according to claim 7, which is characterized in that first conversion module includes:

Time-frequency convert unit obtains the multiple audio frame for converting the multiple audio frame from time domain to frequency domain respectively Frequency spectrum, the frequency spectrum of each audio frame is only used for indicating the amplitude of the audio frame and amplitude is real number；

9. device according to claim 8, which is characterized in that the framing unit is used for:

10. device according to claim 9, which is characterized in that the length of the default window function and each audio The sampling number of frame is identical.

11. device according to claim 8, which is characterized in that the sampling number of each audio frame is frame overlap sampling point Several 2 times.

12. device according to claim 7, which is characterized in that the decomposing module is for calling preset decomposition model, institute State preset decomposition model for based on signal spectrum progress frequency spectrum separation；By described default point of the frequency spectrum input of the audio signal Solve model, output accompaniment frequency spectrum and voice frequency spectrum.

13. a kind of computer equipment, which is characterized in that the computer equipment includes processor and memory, the memory In be stored at least one instruction, described instruction is loaded by the processor and is executed to realize as claim 1 to right is wanted Ask operation performed by 7 described in any item speech signal separation methods.

14. a kind of computer readable storage medium, which is characterized in that be stored at least one instruction, institute in the storage medium Instruction is stated to be loaded by processor and executed to realize such as claim 1 to the described in any item speech signal separations of claim 7 Operation performed by method.