WO2024027246A1 - 声音信号处理方法、装置、电子设备和存储介质 - Google Patents

声音信号处理方法、装置、电子设备和存储介质 Download PDF

Info

Publication number
WO2024027246A1
WO2024027246A1 PCT/CN2023/092372 CN2023092372W WO2024027246A1 WO 2024027246 A1 WO2024027246 A1 WO 2024027246A1 CN 2023092372 W CN2023092372 W CN 2023092372W WO 2024027246 A1 WO2024027246 A1 WO 2024027246A1
Authority
WO
WIPO (PCT)
Prior art keywords
sound
sound source
candidate
signal
processed
Prior art date
Application number
PCT/CN2023/092372
Other languages
English (en)
French (fr)
Inventor
陈俊彬
Original Assignee
深圳Tcl新技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳Tcl新技术有限公司 filed Critical 深圳Tcl新技术有限公司
Publication of WO2024027246A1 publication Critical patent/WO2024027246A1/zh

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • G10L21/028Voice signal separating using properties of sound source
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • G10L21/0308Voice signal separating characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/60Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for measuring the quality of voice signals

Definitions

  • the present application relates to the field of signal processing technology, and specifically relates to a sound signal processing method, device, electronic equipment and storage medium.
  • the collected signals When electronic equipment collects sound signals through multiple microphone channels, the collected signals often include environmental noise, interference-related source signals, reflection signals in the environment and other non-sound source interference signals.
  • Various complexities in the transmission process indicate that the interference signal and the sound source signal are mixed together, making it difficult to extract the sound source signal.
  • Blind source separation is an effective method to solve this problem. Its purpose is to extract the sound source signal from complex mixed signals.
  • the existing blind source separation method can separate the sound source signal from the complex mixed sound signal, it cannot identify the separated sound source signal because the existing blind source separation method can separate the sound source signal from the complex mixed sound signal. Whether the multiple sound source signals are valid and the quality meets the requirements makes the accuracy of the separated sound source signals not high, making the blind source separation not very stable.
  • Embodiments of the present application provide a sound signal processing method, device, electronic device and storage medium, which can improve the stability of signal separation.
  • Embodiments of the present application provide a sound signal processing method, including:
  • embodiments of the present application also provide a sound signal processing device, including:
  • a separation module configured to perform sound source separation processing on the sound source data to be processed, and obtain the candidate sound sources corresponding to the sound source data to be processed and the sound signals belonging to each of the candidate sound sources in the sound source data to be processed;
  • An evaluation module configured to evaluate the quality of the sound signal of each candidate sound source and determine the evaluation value of the sound signal of each candidate sound source
  • a selection module configured to determine a target sound source from a plurality of candidate sound sources based on the evaluation value corresponding to the sound signal of each candidate sound source;
  • a processing module used to process the sound signal of the target sound source.
  • an electronic device including:
  • Memory and processor stores a computer program, and the processor is used to run the computer program in the memory to perform operations in the sound signal processing method.
  • embodiments of the present application also provide a storage medium that stores multiple instructions, and the instructions are suitable for loading by the processor to execute the steps in the sound signal processing method.
  • the embodiment of the present application adds a signal quality evaluation mechanism to the signal separation, and performs the processing on the sound source data to be processed.
  • Sound source separation processing is performed to obtain the candidate sound sources corresponding to the sound source data to be processed and the sound signals belonging to each candidate sound source in the sound source data to be processed; the sound signal of each candidate sound source is evaluated for quality, and each candidate sound source is determined.
  • the evaluation value of the sound signal of the source according to the evaluation value of the sound signal of each candidate sound source, the target sound source is determined from multiple candidate sound sources; the sound signal of the target sound source is processed, so that through signal separation
  • a signal quality evaluation mechanism is added to the system.
  • the sound signal quality of each candidate sound source is evaluated, so as to select an effective target sound source to improve the accuracy of the separated sound source signal and improve the signal quality.
  • the problem of low separation stability is performed to obtain the candidate sound sources corresponding to the sound source data to be processed and the sound signals belonging to each candidate sound source in the sound source data to be processed; the sound signal of each
  • Figure 1 is a schematic flow chart of the sound signal processing method provided by the embodiment of the present application.
  • Figure 2 is a schematic flow chart of a sound source separation process in the sound signal processing method provided by the embodiment of the present application;
  • Figure 3 is a schematic flow chart of the estimation method of candidate sound sources in the sound signal processing method provided by the embodiment of the present application;
  • Figure 4 is a schematic flow chart of another sound source separation process in the sound signal processing method provided by the embodiment of the present application.
  • Figure 5 is a schematic structural diagram of a sound signal processing device provided by an embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • AuxICA full name: Auxiliary Function Based Independent Component Analysis, Chinese: independent component analysis method based on auxiliary function
  • AuxIVA full name: Auxiliary Function Based Independent Vector Analysis, independent vector based on auxiliary function
  • embodiments of the present application provide a sound signal processing method, which determines the final target sound source through the evaluation value of each candidate sound source, thereby improving the separation
  • the accuracy of the sound source signal is improved, the problem of low signal separation stability is improved, and the audio-visual effect is improved.
  • Figure 1 is a schematic flow chart of a sound signal processing method provided by an embodiment of the present application.
  • the sound signal processing method shown can be applied to electronic devices.
  • the electronic device can be a mobile terminal.
  • the electronic device may also be a voice device, such as a Bluetooth speaker, a smart speaker, a microphone, a smart home, etc.
  • the sound signal processing method shown includes steps 101 to 104:
  • the sound source data to be processed refers to the speech signal in the current environment collected by the electronic device, which includes the speech signal of the sound source and the noise existing in the environment.
  • candidate sound sources refer to sound sources that may exist in the current environment estimated based on the sound source data to be processed, including target sound sources.
  • the sound source data to be processed may be speech signals in the current environment collected in real time; or may be speech signals in the current environment collected within a preset time period.
  • a microphone array is provided in the electronic device.
  • the electronic device collects speech signals in the current environment where the electronic device is located through the microphone array. Since the distance between the sound source and each microphone channel in the microphone array is different, each microphone in the microphone array All channels may receive the sound signal from the sound source, and the reverberation in the room, interference from other sound sources, noise in the environment, and noise inside the device will inevitably reduce the quality and speech intelligibility of the speech signal.
  • Speech recognition technology cannot be as highly sensitive and robust as human hearing and can distinguish various sound sources and eliminate interference. Therefore, these interferences cause noise to exist in the sound source data to be processed.
  • the microphone array may be a ring microphone array, a linear microphone array, or a distributed microphone array, wherein the microphone array includes at least one microphone channel.
  • the characteristics of different sound sources can be separated from the sound source data to be processed through the sound source separation method based on the deep neural network, and the candidate sound sources in the sound source data to be processed can be obtained based on the characteristics of the separated sound sources.
  • the sound signal of each candidate sound source include but are not limited to sound source separation methods based on deep clustering, sound source separation methods based on permutation invariant training, and end-to-end sound source separation methods.
  • the candidate sound sources in the sound source data to be processed and the sound signals of each candidate sound source can be obtained through blind source separation using a separation method based on independent subspace analysis.
  • the candidate sound sources in the sound source data to be processed and the sound signals of each candidate sound source can be obtained through blind source separation using a separation method based on non-negative matrix decomposition.
  • Blind source separation can be performed on the sound source data to be processed through a clustering-based separation method to obtain the candidate sound sources in the sound source data to be processed and the sound signals of each candidate sound source.
  • the Gaussian mixture model is used to cluster the sound source data to be processed, and the candidate sound sources in the sound source data to be processed and the sound signal of each candidate sound source are obtained.
  • Blind source separation can be performed on the sound source data to be processed through principal component analysis to obtain the candidate sound sources in the sound source data to be processed and the sound signals of each candidate sound source.
  • the separation method based on independent component analysis can be used to perform blind source separation on the sound source data to be processed by analyzing the mutually independent statistical characteristics of the signals in the sound source data to be processed, and obtain the candidate sound sources in the sound source data to be processed. and the sound signal of each candidate sound source.
  • Blind source separation can be performed on the sound source data to be processed based on the separation method of independent vector analysis, and the candidate sound sources in the sound source data to be processed and the sound signal of each candidate sound source can be obtained.
  • the candidate sound sources in the sound source data to be processed and the sound signals of each candidate sound source can be obtained by performing blind source separation on the sound source data to be processed through the separation method of independent vector analysis based on auxiliary function optimization.
  • the above sound source separation processing method is only an illustrative description and does not constitute a limitation of the sound signal processing method provided by the embodiment of the present application.
  • it can also be treated by a separation method based on overdetermined independent vector analysis based on auxiliary function optimization.
  • Process the sound source data to perform sound source separation processing to obtain the number of sound sources to be processed.
  • the evaluation value characterizes the sound quality of the sound signal of each candidate sound source and is used to quantify the probability that each candidate sound source is the target sound source.
  • the quality of the sound signal of each candidate sound source can be evaluated by calculating the signal interference ratio of the sound signal of each candidate sound source, and the evaluation value of the sound signal of each candidate sound source can be determined.
  • the quality of the sound signal of each candidate sound source can be evaluated by calculating the signal distortion ratio of the sound signal of each candidate sound source, and the evaluation value of the sound signal of each candidate sound source can be determined.
  • the quality of the sound signal of each candidate sound source can be evaluated by calculating the maximum likelihood ratio of the sound signal of each candidate sound source, and the evaluation value of the sound signal of each candidate sound source can be determined.
  • the quality of the sound signal of each candidate sound source can be evaluated by calculating the cepstral clustering of the sound signal of each candidate sound source, and the evaluation value of the sound signal of each candidate sound source can be determined.
  • the quality of the sound signal of each candidate sound source can be evaluated by calculating the frequency-weighted segmented signal-to-noise ratio of the sound signal of each candidate sound source, and the evaluation value of the sound signal of each candidate sound source can be determined.
  • the quality of the sound signal of each candidate sound source can be evaluated by calculating the speech quality perception evaluation score of the sound signal of each candidate sound source, and the evaluation value of the sound signal of each candidate sound source can be determined.
  • the quality of the sound signal of each candidate sound source can be evaluated by calculating the kurtosis value of the sound signal of each candidate sound source, and the evaluation value of the sound signal of each candidate sound source can be determined.
  • the quality of the sound signal of each candidate sound source can be evaluated by calculating the probability score corresponding to the speech feature vector of the sound signal of each candidate sound source, and the evaluation value of the sound signal of each candidate sound source can be determined.
  • the probability score is used to characterize the probability that the sound signal of each candidate sound source is the speech signal of the target sound source.
  • the above-mentioned method of evaluating the quality of the sound signal of each candidate sound source is only an exemplary description and does not constitute a limitation on the sound signal processing method provided by the embodiment of the present application. In practical applications, the corresponding evaluation value determination method can be selected based on the computing efficiency of the electronic device in the actual application scenario.
  • step 103 includes: selecting the candidate sound source corresponding to the maximum evaluation value according to the evaluation value of the sound signal of each candidate sound source, and setting the candidate sound source corresponding to the selected maximum evaluation value to target sound source.
  • step 103 includes: obtaining the statistical characteristics of the evaluation value based on the evaluation value of the sound signal of each candidate sound source, and determining the target sound from multiple candidate sound sources based on the statistical characteristics of the evaluation value.
  • Source where the statistical characteristic includes the median or mode of the estimated values is obtained, and based on the median or mode of the evaluation value, a candidate sound source with an evaluation value greater than or equal to The median, or the evaluation value with an evaluation value greater than the mode, sets the selected candidate sound source corresponding to the evaluation value greater than the median as the target sound source, or the evaluation value with the evaluation value greater than the mode is set as the target sound source.
  • processing the sound signal of the target sound source includes but is not limited to speech output, speech recognition, speech transmission, speech storage, etc.
  • the step Step 104 includes: obtaining the sound signal corresponding to the target sound source, performing semantic analysis on the sound signal corresponding to the target sound source, and obtaining the speech information in the sound signal corresponding to the target sound source, and the electronic device responds to the speech information corresponding to Instructions are used to perform operations corresponding to the instructions, such as conducting dialogue interactions, performing query operations, performing music playback operations, etc.
  • step 104 when the electronic device is a voice collection device, step 104 includes: obtaining the sound signal corresponding to the target sound source, storing the voice signal, such as a radio device, or transmitting the voice signal to A server that communicates with the electronic device.
  • Embodiments of the present application provide a sound signal processing method that determines the final target sound source through the evaluation value of each candidate sound source, improves the accuracy of the separated sound source signal, and improves the problem of low stability of signal separation. .
  • the existing blind source separation method cannot use the position information of the sound source, when the position of the sound source changes, the existing blind source separation method cannot accurately detect the change in the position of the sound source. Therefore, through the existing blind source separation method
  • the separated target sound source has a certain degree of uncertainty, which causes the sound signal of the separated target sound source to be unstable. Therefore, in order to further improve the certainty of the output results obtained by blind source separation, the noise in the output speech signal is reduced. , to achieve the effect of noise reduction.
  • sound source estimation is performed on the sound source data to be processed to obtain the sound source prior information in the sound source data to be processed, based on the sound source data.
  • the source prior information is used for sound source separation processing to improve the accuracy of candidate sound sources obtained by blind source separation, thereby ensuring the accuracy of the final target sound source.
  • the prior information about the sound source refers to the location information of candidate sound sources that may exist in the environment where the electronic device that collects the sound source data to be processed is located.
  • the position information of the candidate sound source may be the spatial coordinate value of the candidate sound source, or the pitch angle and azimuth angle of the candidate sound source in the space of the environment where the electronic device that collects the sound source data to be processed is located. It should be noted that the embodiment of the present application does not limit the way in which the spatial coordinate system is established. For example, the geometric center in the electronic device can be used as the origin.
  • Figure 2 is a schematic flow chart of a sound source separation processing in the sound signal processing method provided by the embodiment of the present application.
  • the sound source separation processing method shown includes steps 201 to 203:
  • the SPR full name: Steered Response Power, Chinese: Controlled Response Power
  • the SPR can be used to estimate the sound source of the sound source data to be processed, and obtain the candidate sound sources corresponding to the sound source data to be processed and each Location information of candidate sound sources. Specifically, it includes: estimating the spatial power spectrum distribution of the sound source data to be processed through the SPR method, and determining the candidate sound sources corresponding to the sound source data to be processed and the position information of each candidate sound source according to the power spectrum distribution.
  • the position of the maximum power can be determined according to the power spectrum distribution, and the selected position of the maximum power is set as the position information of the candidate sound source.
  • multiple positions with power greater than or equal to the preset power value can also be determined based on the power spectrum distribution, and the selected multiple positions with power greater than the preset power value are set as positions of candidate sound sources.
  • the preset power value can be the average power value in the power spectrum distribution, or it can be based on the power spectrum distribution, sorting the power at each position in the space in order from large to small, and ranking the power in the middle.
  • the power value is preset in the S-th power setting bit.
  • S is an integer greater than 0, and the value of S can be set according to the actual application scenario.
  • the value of S can be 2, 3, 4, 5, etc.
  • the high-frequency part is prone to aliasing. Since the aliasing phenomenon is prone to occur, it affects the estimation of the position information of the candidate sound source.
  • the embodiment of the present application performs sound source estimation on the low-frequency part of the sound source data to be processed through SPR, and obtains the candidate Select the estimation area where the sound source is located, perform sound source estimation on the high-frequency part of the sound source data to be processed through SPR, and select the candidate sound source corresponding to the sound source data to be processed and each candidate sound from the estimation area where the candidate sound source is located.
  • Source location information includes steps a1 to a4:
  • Step a1 Perform frequency domain conversion on the sound source data to be processed to obtain the frequency domain signal of the sound source data to be processed.
  • Step a2 Filter the frequency domain signal of the sound source data to be processed through a filter to obtain the low-frequency signal and high-frequency signal of the frequency signal.
  • the filter can be a low-pass filter or a high-pass filter.
  • Step a3 Based on the low-frequency signal of the frequency signal, use the SPR method to estimate the time delay of the low-frequency signal of each microphone channel in the microphone array set up in the electronic device, and obtain the controllable response power of the microphone array in each preset area. function value, select multiple preset areas whose controllable response power function values are greater than or equal to the preset function value threshold, and set the selected preset areas as the estimated areas where the candidate sound sources are located.
  • Step a4 According to the high-frequency signal of the frequency signal, use the SPR method to estimate the time delay of the high-frequency signal of each microphone channel in the microphone array set in the electronic device, and obtain the controllable response of the microphone array in each estimated area.
  • Power function value select the estimated area where the controllable response power function value is greater than or equal to the preset function value threshold, and set each selected estimated area where the controllable response power function value is greater than or equal to the preset function value threshold to each The position information of each selected estimated area where the controllable response power function value is greater than or equal to the preset function value threshold is set as the position information of each candidate sound source.
  • steps a3 to a4 include: dividing the spatial coordinate system into multiple first grid areas, where each first grid area corresponds to a position consisting of a pitch angle and an azimuth angle. Information; use the SPR method to estimate the time delay of the low-frequency signal of each microphone channel in the microphone array set up in the electronic device, and obtain the first controllable response power function value of the microphone array in each first grid area; select The first grid area with the largest first controllable response power function value is set as the estimated area where the candidate sound source is located; the estimated area where the candidate sound source is located is divided into multiple second grids grid area, each second grid area corresponds to a position information consisting of a pitch angle and an azimuth angle, and the angle difference between every two adjacent second grid areas is smaller than every two adjacent first grid areas The angle difference between areas; use the SPR method to estimate the time delay of the high-frequency signal of each microphone channel in the microphone array set up in the electronic device, and obtain the second controllable response of the microphone array in
  • the position of the sound channel for collecting the sound source data to be processed refers to the position information of each microphone channel in the microphone array provided in the electronic device for collecting the sound source data to be processed in a preset spatial coordinate system.
  • the position guidance information is used to determine the position vector between the position information of each candidate sound source and the position of each sound channel from which the sound source data to be processed is collected.
  • the existing blind source separation method since the existing blind source separation method does not consider the position of the sound source, then This causes noise to exist in the separated sound signal. Therefore, in order to improve the stability of blind source separation, in blind source separation, the position information of each candidate sound source is obtained according to the sound source estimation, and the position information of each candidate sound source is determined with The position vector between the position of each sound channel of the sound source data to be processed.
  • blind source separation through the position vector between the position information of each candidate sound source and the position of each sound channel of the sound source data to be processed, And collect the sound signal of each sound channel position of the sound source data to be processed, perform sound source separation on the sound source data to be processed, and obtain the sound of each candidate sound source in the sound signal of each sound channel position of the sound source data to be processed.
  • the signal component is to obtain the sound signal of each candidate sound source by collecting the sound signal component at each candidate sound source position in the sound signal at each sound channel position in the sound source data to be processed.
  • the position information of each candidate sound source and each sound channel of the sound source data to be collected can be obtained based on the position of the sound channel that collects the sound source data to be processed and the position information of each candidate sound source.
  • the distance between positions is based on the distance between the position information of each candidate sound source and the position of each sound channel that collects the sound source data to be processed, the position information of each candidate sound source and each sound source data that collects the sound source data to be processed.
  • represents the angle information between the position information of each candidate sound source and the position of each sound channel for collecting the sound source data to be processed
  • M is the number of sound channels for collecting the sound source data to be processed.
  • the position information of each candidate sound source and each sound channel of the sound source data to be collected can be obtained based on the position of the sound channel that collects the sound source data to be processed and the position information of each candidate sound source.
  • the distance between positions according to the distance between the position information of each candidate sound source and the position of each sound channel where the sound source data to be collected is collected, the sound signal is obtained from the position information of each candidate sound source to the sound source to be collected.
  • the time information required for the position of each sound channel of the source data According to the time information required for the sound signal to reach the position of each sound channel from the position information of each candidate sound source to collect the sound source data to be processed, the time information of each candidate sound source is obtained. Location-oriented information.
  • the time information required for the sound signal to reach the position of each sound channel where the sound source data to be processed is collected based on the position information of each candidate sound source, by Obtain the position guidance information of each candidate sound source, where ⁇ represents the time information required for the sound signal to reach the position of each sound channel from the position information of each candidate sound source to collect the sound source data to be processed, and j represents a complex number.
  • sound source separation can be performed on the sound source data to be processed according to the separation method in step 101 according to the position guidance information of each candidate sound source, so as to obtain the sound signal of each candidate sound source.
  • the separation method based on overdetermined independent vector analysis based on auxiliary function optimization is taken as an example for explanation.
  • the received sound source data is composed of N transmitter source signals S 1 , S in the environment. 2 ,...,S N is the mixed signal x 1 ,x 2 ,...,x M received by M sound channels after being mixed by the transfer function h mn , that is, the sound source data is expressed as Among them, N
  • the separation parameters of each frame of the frequency domain signal in the frequency domain signal of the sound source data are solved at each frequency point, and through the separation parameters and each frame of the frequency domain signal in the frequency domain signal of the sound source data
  • the frequency signal at each frequency point is separated from the frequency domain signal of each frame in the frequency domain signal of the sound source data to be processed.
  • the frequency domain signal of the candidate sound source is separated from the frequency signal at each frequency point, and the frequency domain signal of the candidate sound source is separated by inverse short Time Fourier transform is performed to obtain the sound signal of the candidate sound source.
  • a separation method based on overdetermined independent vector analysis based on auxiliary function optimization is used to In the sound source separation of the source data, the separation parameters of each frame of the frequency domain signal in the frequency domain signal of the sound source data at each frequency point are solved through the guidance information combined with the optimization based on the auxiliary function.
  • the frequency signal of each frame of frequency domain signal in the frequency domain signal at each frequency point is used to separate candidate sound sources from the frequency signal of each frame of frequency domain signal at each frequency point in the frequency domain signal of the sound source data.
  • the separation method includes steps b1 to b2:
  • Step b1 Determine the separation parameters of the sound source data to be processed based on the position guidance information of each candidate sound source.
  • the sound source data to be processed may be the current frame frequency domain signal among the frequency domain signals of the sound source data.
  • step b1 includes: after obtaining the auxiliary parameters corresponding to the sound source data to be processed, correcting the auxiliary parameters corresponding to the sound source data to be processed according to the position guidance information of each candidate sound source to obtain the auxiliary parameters corresponding to the sound source data to be processed.
  • the corrected auxiliary parameters corresponding to the sound source data are optimized to obtain the separation parameters of the sound source data to be processed based on the corrected auxiliary parameters.
  • the auxiliary parameters include auxiliary parameters in the frequency signal at each frequency point of each frame of the frequency domain signal of the sound source data to be processed.
  • the method of determining the separation parameters of the sound source data to be processed based on the position guidance information includes:
  • the auxiliary parameters are corrected to obtain the corrected auxiliary parameters.
  • the separation parameters of the sound source data to be processed are obtained.
  • the historical separation parameters of the historical sound source data refer to the separation parameters of the sound source data of the previous frame of the sound source data to be processed. Since the sound signals of the sound source are correlated in time series, the auxiliary function is used to optimize the The separation method of overdetermined independent vector analysis performs sound source separation on the sound source data to be processed.
  • the separation parameters and auxiliary parameters are alternately updated to perform sound source separation on the sound source data to be processed.
  • the auxiliary parameters are updated through the sound source data to be processed.
  • the frequency domain signal of the sound source data updates the separation parameters of the previous frame of sound source data; the update of the separation parameters is achieved by updating the separation parameters of the previous frame of sound source data through the auxiliary parameters of the frequency domain signal of the sound source data to be processed. .
  • the step of obtaining the auxiliary parameters corresponding to the sound source data to be processed includes: obtaining the historical separation parameters of the historical sound source data and the historical auxiliary parameters of the discrete sound source data. According to the historical separation parameters and the sound source to be processed, According to the sound signal of the data, the energy of each candidate sound source output by the previous frame of sound source data is obtained. According to the historical auxiliary parameters, the sound signal of the sound source data to be processed, and the energy of each candidate sound source output by the previous frame of sound source data. Energy obtains the auxiliary parameters corresponding to the sound source data to be processed.
  • the historical auxiliary parameters refer to the auxiliary parameters of the previous frame of sound source data.
  • the auxiliary parameters of the kth frequency point in that is, the historical auxiliary parameters, is the energy of each candidate sound source output, W s (l-1,k) is the separation parameter of the k-th frequency point in the sound source data of the previous frame, that is, the historical separation parameter, (g) H represents the conjugate transformation Set, where the auxiliary parameter V s (1,k) of the first frame processing sound source data is a matrix in which the value of the diagonal
  • modifying the auxiliary parameter matrix according to the position guidance information of each candidate sound source includes: calculating the auxiliary parameter V s of each candidate sound source in the auxiliary parameters according to the position guidance information of each candidate sound source. (l,k) corresponding correction parameters, V s (l,k) and the correction parameters are used to correct each candidate sound source auxiliary parameter V s (l,k) to obtain the corrected auxiliary parameters.
  • the position guidance information of each candidate sound source can be pass
  • the correction parameter ⁇ corresponding to each candidate sound source auxiliary parameter V s (l,k) is obtained, where ⁇ s is a preset constant.
  • the sth intermediate parameter P s (l can be obtained through (W(l-1,k)D s (l,k)) -1 based on the corrected auxiliary parameters and historical separation parameters. ,k), through Get the separation parameter W(l,k).
  • the value of the third intermediate parameter ⁇ s (l,k) of the element can be compared with the preset value Yes; if the value of the third intermediate parameter ⁇ s (l,k) of the element is consistent with the preset value, then according to the second intermediate parameter ⁇ s (l,k) of the element, the first intermediate parameter P s (l ,k) and the modified first intermediate parameter Q s (l,k), by Obtain the s-th element W s (l,k) in the separation parameter W(l,k); if the value of the third intermediate parameter ⁇ s (l,k) of the element is inconsistent with the preset value, then according to the element
  • the above method for determining separation parameters is only an illustrative description of determining the separation parameters in a separation method based on overdetermined independent vector analysis based on auxiliary function optimization. In practical applications, it can be determined according to the adopted method.
  • the separation method used adjusts how the separation parameters are determined.
  • Step b2 perform sound source separation on the sound signals in the sound source data to be processed according to the separation parameters, and determine the sound signals of each candidate sound source.
  • step b2 includes: after obtaining the separation parameters, separate the separation parameters from the sound source data to be processed by calculating the product of the separation parameters and the frequency domain signal of the sound signal in the sound source data to be processed. signal, where each element in the separated signal represents the frequency domain signal of each candidate sound source, and the inverse short-time Fourier transform is performed on the frequency domain signal separated from each candidate sound source to obtain the sound signal of each candidate sound source.
  • step b2 includes: after obtaining the separation parameters, obtain the noise separation parameters according to the separation parameters, obtain the total separation parameters of the sound source data to be processed through the noise separation parameters and the separation parameters, and calculate the total separation parameters and The product of the frequency domain signal of the sound signal in the sound source data to be processed is used to separate the separated signal from the sound source data to be processed, where each element in the separated signal represents the frequency domain signal of each candidate sound source, and the separation signal is The frequency domain signal of each candidate sound source is subjected to inverse short-time Fourier transform to obtain the sound signal of each candidate sound source.
  • the preamble noise parameter square matrix C(l-1,k) The sound signal of the sound source data to be processed is obtained by ⁇ C(l-1,k)+(1- ⁇ )X(l,k)X H (l,k), where ⁇ is the auxiliary
  • the forgetting factor in parameter calculation has the same numerical setting as the forgetting factor setting in the auxiliary parameters, and can be set to 0.95; in some embodiments of this application, the first frame of C(l,k) processes the noise of the sound source data
  • the parameter matrix C(1,k) is a zero matrix.
  • the noise separation parameter U(l,k) By calculating the total separation parameter
  • the inverse short-time Fourier transform is performed on the frequency domain signal separated from each candidate sound source to obtain the sound of each candidate sound source.
  • step b2 after obtaining the separation parameters, obtain the noise separation parameters according to the separation parameters.
  • Obtain the total separation parameters of the sound source data to be processed pass Get the first transformation matrix of the total separation parameters Extract the first transformation matrix From the elements in the first row to the S-th row, the second transformation matrix W bp (l,k) is obtained, through Separate the separated signal Y(l,k) from the sound source data to be processed.
  • A(l,k) is the diagonal matrix of M*M, and the diagonal elements in A(l,k) are the total separation parameters.
  • Find the diagonal elements after inversion; ( ⁇ ) H represents the conjugate transpose.
  • the above-mentioned method of separating the sound signals of candidate sound sources from the sound source data to be processed through separation parameters is only an exemplary illustration of the separation method based on overdetermined independent vector analysis based on auxiliary function optimization.
  • the way in which the sound signal of the candidate sound source is separated from the sound source data to be processed through the separation parameters can be adjusted according to the separation method adopted.
  • an initial sound source area may be selected from the sound source space where the electronic device that collects the sound source data to be processed is located, and the initial sound source area may be selected according to the preset azimuth angle.
  • the area is evenly divided to obtain multiple direction vectors, each direction vector is set as an initial sound source position, and SRP (full name: Steered Response Power, full name: Controllable Response Power) is used to calculate the sound source data to be processed at each initial The power value of the sound source position.
  • SRP full name: Steered Response Power, full name: Controllable Response Power
  • a candidate sound source position is selected from multiple initial sound source positions.
  • the candidate sound source and the position of the candidate sound source are obtained based on the selected candidate sound source position. information.
  • Figure 3 is a schematic flow chart of a candidate sound source estimation method in the sound signal processing method provided by an embodiment of the present application.
  • the illustrated candidate sound source estimation method includes steps 301 to 304:
  • the geometric center of the microphone array in the electronic device can be used as the origin, the spatial coordinate system is established with the origin, and the preset distance is selected in the clockwise or counterclockwise direction with the origin as the center and the preset distance as the radius.
  • the initial sound source area within the angle range.
  • a position is selected, multiple positions are selected, each selected position is set as the initial sound source position, and the azimuth angle of each selected position and each selected position are set.
  • the direction vector of each selected position is obtained from the pitch angle formed by the position and the origin, and the direction vector of each selected position is set as the position information of the initial sound source position.
  • step 302 includes: determining the position coordinates of each microphone channel of the microphone array in the electronic device in the spatial coordinate system according to the spatial coordinate system, and placing each microphone channel of the microphone array in the electronic device in the spatial coordinate system.
  • the position coordinates in this spatial coordinate system are set to the position of each sound channel that collects the sound source data to be processed.
  • the position coordinates of each initial sound source position are determined based on the position information of each initial sound source position. For each initial sound source position, Source position, based on the position coordinates of the initial sound source position and the position of each sound channel, the distance between the initial sound source position and the position of each sound channel is obtained.
  • the position coordinates of the initial sound source position and the position of each sound channel can be calculated. 2-norm between the positions, the distance between the initial sound source position and each sound channel position can be obtained; the Euclidean distance or Mahalanobis distance between the position coordinates of the initial sound source position and the position of each sound channel can also be calculated Distance, get the distance between the initial sound source position and the position of each sound channel.
  • step 303 includes: for each initial sound source position, obtain the initial sound source position and each two adjacent sound channel positions based on the distance between the initial sound source position and each sound channel position.
  • the distance difference according to the distance difference, the time difference between the two adjacent sound channel positions receiving the signal from the initial sound source position is obtained, and based on the sound source data to be processed, each of the two adjacent sound channel positions is determined According to the time difference and the sound signal at each sound channel position of the two adjacent sound channel positions, the signal of the initial sound source position in the two adjacent sound channel positions is obtained.
  • the power at the previous sound channel position is calculated, and the power of the signal at the candidate sound source position at each sound channel position is summed to obtain the power of the sound signal at the initial sound source position.
  • the previous sound channel position among the two adjacent sound channel positions may be a sound channel position in which the position coordinate of the sound channel is smaller than the position coordinate of the other sound channel among the two adjacent sound channel positions.
  • step 303 may also calculate the power of the sound signal at each initial sound source position according to steps a1 to a3.
  • step 304 includes: according to the power of the sound signal at each initial sound source position, sorting the initial sound source positions in order from large to small power, starting from the sorted initial sound source positions A preset number of target initial sound source positions are selected, the selected target initial sound source positions are set as candidate sound sources, and the position information of each target initial sound source position is set as the position information of the candidate sound source.
  • the embodiment of the present application does not limit the specific value of the preset number, that is, the number of candidate sound sources is not limited.
  • the number of candidate sound sources can be set to The number of candidate sound sources is less than or equal to the number of sound channels for collecting sound source data to be processed.
  • step 304 includes: comparing the power of the sound signal at each initial sound source position with the power threshold in turn, selecting an initial sound source position whose power is greater than or equal to the power preset, and converting the selected The initial sound source position whose power is greater than or equal to the power preset is set as a candidate sound source, and the position information of each selected initial sound source position whose power is greater than or equal to the power preset is set as the position information of the candidate sound source.
  • the power threshold may be preset, may be determined based on the average, mode or median of the power of the sound signals at each initial sound source position, or may be determined based on the power of each initial sound source. The power of the sound signal at the source position is sorted according to the order of power from large to small, and the power value at a preset number of the sorted powers is set as the power threshold.
  • step 304 includes: determining the maximum power among the powers of the sound signals at each initial sound source position based on the power of the sound signal at each initial sound source position, and calculating the maximum power of the sound signal at each initial sound source position.
  • the power difference between the power of the sound signal and the maximum power, the initial sound source position corresponding to the power whose power difference is less than or equal to the preset power difference threshold is set as the candidate sound source, and the power difference is less than or equal to the preset power difference threshold.
  • the position information of the initial sound source position corresponding to the power of the power difference threshold is set as the position information of the candidate sound source.
  • step 303 for each initial sound source position, the time information for the signal from the initial sound source position to reach each sound channel position can be obtained through the initial sound source position and each sound channel position. , based on the time information of the signal from the initial sound source position arriving at each sound channel position, Determine the power of the sound signal at the initial sound source position.
  • the power calculation method at the initial sound source position includes steps c1 to c3:
  • Step c1 For each initial sound source position, determine the time information for the signal from the initial sound source position to reach each sound channel position based on the distance between the initial sound source position and each sound channel position.
  • the distance between the initial sound source position and each sound channel position can be obtained through the initial sound source position and the position of each sound channel from which the sound source data to be processed is collected. distance, based on the propagation speed of sound and the distance between the initial sound source position and each sound channel position, the time information for the signal from the initial sound source position to reach each sound channel position is obtained.
  • Step c2 Determine the power of the sound signal at each sound channel position based on the time information of the signal at the initial sound source position reaching each sound channel position.
  • the time delay can be estimated based on the time information of the signal from the initial sound source position reaching the position of each sound channel, and the controllable response power function value of each sound channel can be obtained, and the controllable response power function value of each sound channel can be obtained.
  • the power function value is set to the power of the sound signal at each sound channel position.
  • the controllable response power function value can be obtained by estimating the time delay based on the time information of the signal at the initial sound source position reaching each sound channel position through a generalized cross-correlation function based on phase transformation weighting.
  • the method of determining the power of the sound signal at each sound channel position based on the generalized cross-correlation function based on phase transformation weighting includes:
  • the signal from the initial sound source position may arrive at the sound channel by subtracting the second time information from the signal from the initial sound source position to the next sound source channel position adjacent to the sound channel position.
  • the frequency domain signal Xi (k) of each frequency point k of the sound signal according to the sound channel position and the frequency domain signal Xi (k) adjacent to the sound channel position are The frequency domain signal X j (k) of the sound signal at each frequency point k at the next sound source channel position is passed Obtain the controllable response power function value R ij [ ⁇ ij (d n )] of the sound channel position, and set the controllable response power function value of the sound channel position to the power of the sound signal at the sound channel position, where, ( ⁇ ) * represents conjugation, Fs is the sampling frequency of the sound signal in the sound source data to be processed, and K is the number of frequency points of the short-time Fourier transform.
  • Step c3 determine the sound at the initial sound source position based on the power of each sound channel position. The power of the signal.
  • the power F(d n ) of the sound signal at the initial sound source position is obtained.
  • the method of determining the power of the sound signal at each sound channel position using the function if for each sound channel position, the quality of the signal received at the sound channel position is not considered, only through Obtaining the power F(d n ) of the sound signal at the initial sound source position may reduce the accuracy of subsequent sound source estimation, and, in practical applications, the maximum controllable response power function of the sound channel position pair can be obtained The value represents the quality of the received signal at that pair of sound channel locations.
  • step c2 the initial power of each sound channel position is obtained through the time difference between the signal from the initial sound source position reaching each sound channel position and reaching the next sound channel position adjacent to each sound channel position. and the initial power of the next sound channel position adjacent to each sound channel position. According to the maximum value of the initial power of each sound channel position and the initial power of the next sound channel position adjacent to each sound channel position, each sound channel position is obtained.
  • the power weight of each sound channel position is used to obtain the power of the sound signal at each sound channel position through the initial power of each sound channel position and the power weight of the sound channel position.
  • the power determination method of the sound signal based on the weight includes :
  • the target power represents the larger of the initial powers corresponding to each two adjacent sound channel positions. Large value.
  • the initial power of the sound signal at each sound channel position can be obtained by determining the power of the sound signal at each sound channel position according to the generalized cross-correlation function based on phase transformation weighting described above.
  • the initial power of the sound signal at the sound channel position and the sound signal at the next sound channel position adjacent to the sound channel position are determined.
  • the maximum initial power R max ij [ ⁇ ij (d n ) ] among the initial powers is set to the target power. After obtaining each target power, by accumulating each target power, the total target power value of the sound channel position where the sound source data to be processed is collected at the initial sound source position is obtained.
  • the maximum initial power of the initial power of the signal and the initial power of the sound signal of the next sound channel position adjacent to the sound channel position is calculated by the total value of the maximum initial power and the target power. Normalize the maximum initial power of the initial power of the sound signal at the sound channel position and the initial power of the sound signal at the next sound channel position adjacent to the sound channel position to obtain the sound signal at the sound channel position.
  • the power weight ⁇ i ,j is calculated by the total value of the maximum initial power and the target power.
  • step 202 according to the position information of the candidate sound source and the position of each sound channel in which the sound source data to be processed is collected, the signal of the position information of each candidate sound source can be obtained to reach the sound source to be processed.
  • the signal of the position information of each candidate sound source reaches the time information of the position of each sound channel of the sound source data to be processed, the position guidance information of each candidate sound source is obtained.
  • the method for determining location guidance information includes steps d1 to d2:
  • Step d1 For each candidate sound source, determine the time information for the signal of the candidate sound source to reach the position of each sound source channel based on the position information of the candidate sound source.
  • Step d2 Obtain the position guidance information of the candidate sound source based on the time information of the signal of the candidate sound source reaching the position of each sound source channel.
  • step d1 includes: according to the established spatial coordinate system, the position information of each sound source channel in the spatial coordinate system that collects the sound source data to be processed and the coordinate origin in the spatial coordinate system. position information to obtain the position vector of each sound source channel that collects the sound source data to be processed.
  • position information to obtain the position vector of each sound source channel that collects the sound source data to be processed.
  • For each sound source candidate according to the direction vector of the position information of the candidate sound source and each sound source of the sound source data to be processed The inner product between the position vectors of the channels is used to obtain the propagation distance of the signal at the position of the candidate sound source to the position of each sound source channel where the sound source data to be collected is collected.
  • the propagation distance and sound propagation speed of each sound source channel of the source data are used to obtain the time information for the signal of the candidate sound source to reach the position of each sound source channel where the sound source data to be processed is collected.
  • step d2 includes: inputting the time information of the signal of the candidate sound source to the position of each sound source channel where the sound source data to be collected is collected into a preset vector model.
  • Obtain the position guidance information of the candidate sound source in represents the position information of the candidate sound source
  • is the preselected simulated angular frequency
  • the sound source data of the to-be-processed sound source data is first located to obtain the number of candidate sound sources and the position information of the candidate sound sources, and then the position information of each candidate sound source is used to obtain each candidate sound source.
  • the separation method of overdetermined independent vector analysis combined with the position guidance information is used to separate the sound source data to be processed, and the sound signals of each candidate sound source are separated from the sound source data to be processed.
  • the embodiment uses position-oriented information to drive blind source separation, strengthens the stability of the separated output sound signal, and avoids the possibility that the existing separation method only through overdetermined independent vector analysis may occur when the sound source position changes.
  • the output is pure noise.
  • Figure 4 is another sound source separation processing method in the sound signal processing method provided by the embodiment of the present application.
  • a schematic flow chart of the sound source separation processing method shown includes steps 401 to 403:
  • the embodiment of the present application uses a separation method based on independent vector analysis to perform sound source separation.
  • the number m of sound channels of the sound source data to be processed establishes the initial separation parameter W M ⁇ M (l).
  • W M ⁇ M (l) By iteratively updating the initial separation parameter W M ⁇ M (l), M predicted sounds are separated from the sound source data to be processed.
  • the sound signal of the source is detected and eliminated, and the sound signal of the candidate sound source is extracted from the separated sound signals of the M predicted sound sources.
  • the initial separation parameter W M Iterate the initial separation parameter W M
  • the sound source data to be processed is separated into sound signals of M predicted sound sources.
  • I is the unit matrix of m*m dimension
  • l represents the number of iteration steps
  • ⁇ (l) represents the iteration step size
  • E represents the expectation
  • y represents the nonlinear function, which is related to the probability of the sound signal of the sound source data to be processed. Density function is related
  • y represents the separated signal obtained by the l-th stripping, where T represents the transposition.
  • the initial separation parameter W M ⁇ M (l) is also established through the number m of sound channels of the sound source data to be processed, and through an iterative model based on the natural gradient method For the initial separation parameters W M
  • the source data separates the sound signals of M predicted sound sources.
  • the parameter meanings of the iterative model based on the natural gradient method are the same as those of the iterative model based on the equivariant adaptive decomposition algorithm, and will not be described again here.
  • the frequency domain signal The sound source data to be processed separates the sound signals of M predicted sound sources.
  • the remaining M-S components are copies or zeros of one or more independent components.
  • signal where the S independent components are the sound signals of the S candidate sound sources. Since the correlation between the S independent components present in the sound signals of the M predicted sound sources is relatively low, one or more independent components There is a correlation between the M-S components composed of copies of components or zero signals. Therefore, S can be extracted from the sound signals of the M predicted sound sources through the cross-correlation coefficient between the sound signals corresponding to each predicted sound source. The sound signal of the candidate sound source.
  • step 402 includes: for each predicted sound source, calculating the autocorrelation coefficient between the sound signal corresponding to the predicted sound source and the sound signal corresponding to the predicted sound source, and dividing the sound signal corresponding to the predicted sound source by The correlation coefficient between the sound signals of each predicted sound source other than the sound signal corresponding to the predicted sound source is obtained, and the correlation coefficient of the sound signal corresponding to the predicted sound source is obtained; according to the correlation coefficient of the sound signal corresponding to each initial sound source The coefficients establish a correlation coefficient matrix corresponding to the sound signal of the predicted sound source.
  • the diagonal elements of the matrix represent the autocorrelation coefficients of the sound signals of the predicted sound sources, which must all be 1.
  • the other elements in the matrix represent the sound signals of any two predicted sound sources.
  • the separated sound signals of the candidate sound sources include the sound signals of the target sound source and non-target sound sources
  • the sound signal of the target sound source, and the sound signal of the target sound source has very few other signals or noise components mixed in, and the speech quality is better.
  • the sound signal of the non-target sound source will be caused by the aliasing noise or other signals.
  • the speech quality of the sound signal of the non-target sound source is worse than the sound signal of the target sound source. Therefore, the speech quality of the sound signal of each candidate sound source can be evaluated from the separated sound signals of the candidate sound source.
  • the sound signal of the target sound source is acousing noise or other signals.
  • the speech of the sound signals of each candidate sound source is The quality is evaluated to obtain the evaluation value of the sound signal of each candidate sound source. Based on the evaluation value, the candidate sound sources are screened and the target sound source is selected from multiple candidate sound sources.
  • the evaluation value corresponding to the sound signal of each candidate sound source can be obtained by calculating the kurtosis value of the sound signal of each candidate sound source, where the kurtosis value is used to describe the speech characteristics of the sound signal. , the greater the kurtosis value of the sound signal, the higher the speech quality of the sound signal.
  • sound signal evaluation methods based on kurtosis values include:
  • the wake-up word score in the sound signal of each candidate sound source can also be determined based on the speech characteristics of the sound signal of each candidate sound source.
  • the wake-up word score in the sound signal of each candidate sound source is set to the evaluation value corresponding to the sound signal of each candidate sound source, that is, through the wake-up word score in the sound signal of each candidate sound source, from multiple candidates Select the target sound source among the sound sources.
  • the wake-up word score is used to quantify the voice quality in the sound signal of each candidate sound source.
  • the sound feature corresponding to the sound signal of each candidate sound source can be determined by determining that the sound feature of the wake-up word is Probability is determined to obtain the wake word score in the sound signal of each candidate sound source.
  • sound signal evaluation methods based on wake word scores include:
  • the probability score represents the probability that the speech feature vector is the speech feature vector corresponding to the wake-up word.
  • the semantic features of the sound signal of each candidate sound source can be obtained according to the speech feature vector of the sound signal of each candidate sound source, and the semantic features of the sound signal of each candidate sound source are compared with the preset The semantic features are compared to obtain the similarity between the semantic features of the sound signal of each candidate sound source and the preset semantic features, and the similarity between the semantic features of the sound signal of each candidate sound source and the preset semantic features is set to each The probability score corresponding to the speech feature vector of the sound signal of a candidate sound source.
  • the semantic features of the sound signal of each candidate sound source can be obtained according to the semantic feature extraction method in step 102.
  • the probability score corresponding to the speech feature vector of the sound signal of the candidate sound source can be set to the evaluation value corresponding to the sound signal of the candidate sound source.
  • the probability score corresponding to the speech feature vector of the sound signal of the candidate sound source can be compared with a preset probability threshold. If the sound signal of the candidate sound source The probability score corresponding to the speech feature vector is greater than the preset probability threshold, then the evaluation value corresponding to the sound signal of the candidate sound source is set to the first preset value; if the speech feature vector of the sound signal of the candidate sound source is If the corresponding probability score is less than or equal to the preset probability threshold, then the evaluation value corresponding to the sound signal of the candidate sound source is set as the second preset value.
  • the first preset value can be 1, and the second preset value can be 0; the first preset value can also be 100, and the second preset value can also be 0.
  • the probability score corresponding to the speech feature vector of the sound signal of the candidate sound source can be queried in the pre-stored evaluation data to determine the probability score of the sound signal of the candidate sound source.
  • the probability interval in which the probability score corresponding to the speech feature vector is located, and the evaluation score corresponding to the probability interval, and the evaluation score corresponding to the probability interval is set to the evaluation value corresponding to the sound signal of the candidate sound source.
  • the pre-stored evaluation data includes multiple probability intervals and the evaluation scores corresponding to each probability interval.
  • the candidate sound source corresponding to the maximum evaluation value can be determined based on the evaluation value corresponding to the sound signal of each candidate sound source, and the candidate sound source corresponding to the maximum evaluation value can be determined.
  • the candidate sound source corresponding to the evaluation value is set as the target sound source, and the sound signal of the candidate sound source corresponding to the maximum evaluation value is set as the sound signal of the target sound source.
  • the sound signal processing method provided by the embodiment of the present application determines the final target sound source by evaluating the sound signal of each candidate sound source obtained by blind source separation, and improves the problem of low stability of blind source separation. By improving the target sound source Accuracy, noise reduction, and improved audio-visual effects.
  • Figure 5 is a diagram of the implementation of the present application.
  • the example provides a schematic structural diagram of a sound signal processing device.
  • the sound signal processing device shown includes:
  • the separation module 501 is used to perform sound source separation processing on the sound source data to be processed, and obtain the candidate sound sources corresponding to the sound source data to be processed and the sound signals belonging to each candidate sound source in the sound source data to be processed;
  • the evaluation module 502 is used to evaluate the quality of the sound signal of each candidate sound source and determine the evaluation value of the sound signal of each candidate sound source;
  • the selection module 503 is used to determine the target sound source from multiple candidate sound sources according to the evaluation value corresponding to the sound signal of each candidate sound source;
  • the processing module 504 is used to process the sound signal of the target sound source.
  • the separation module 501 includes:
  • the sound source estimation unit is used to estimate the sound source position of the sound source data to be processed, and determine the candidate sound sources corresponding to the sound source data to be processed and the position information of each candidate sound source;
  • a vector determination unit used to determine the position guidance information of each candidate sound source based on the position of each sound channel that collects the sound source data to be processed and the position information of each candidate sound source;
  • the separation unit is used to perform sound source separation on the sound source data to be processed according to the position guidance information of each candidate sound source, and obtain the sound signal of each candidate sound source.
  • the separation unit :
  • the separation parameter subunit is used to determine the separation parameters based on the position guidance information of each candidate sound source
  • the separation subunit is used to perform sound source separation on the sound signals in the sound source data to be processed according to the separation parameters, and determine and obtain the sound signal of each candidate sound source.
  • the separation parameter subunit is used for:
  • the auxiliary parameter matrix is corrected to obtain the corrected auxiliary parameters
  • the separation parameters of the sound source data to be processed are obtained.
  • the sound source estimation unit is used for:
  • each initial sound source determine the distance between the position of each initial sound source and the position of each sound channel for collecting the sound source data to be processed
  • each initial sound source position and each sound channel position determine the power of the sound signal at each initial sound source position
  • the candidate sound source and the position information of the candidate sound source are determined and obtained.
  • the sound source estimation unit is used for:
  • For each initial sound source position determine the time information for the signal from the initial sound source position to reach each sound channel position based on the distance between the initial sound source position and the position of each sound channel;
  • the power of the sound signal at each sound channel position is determined
  • the power of the sound signal at the initial sound source position is determined.
  • the sound source estimation unit is used for:
  • Second time information For each sound channel position, determine the first time information for the signal at the initial sound source position to arrive at the sound channel position, and the time for the signal at the initial sound source position to arrive at the next sound source channel position adjacent to the sound channel position. Second time information;
  • the power of the sound signal at the sound channel position is determined.
  • the sound source estimation unit is used for:
  • the initial power of the sound signal at each sound channel position is determined
  • the target power among the initial powers corresponding to each two adjacent sound channel positions is determined, and the target power represents the larger value among the initial powers corresponding to each two adjacent sound channel positions;
  • For each sound channel position determine the power weight of the sound channel position based on the initial power corresponding to the sound channel position, the initial power corresponding to the next sound channel position adjacent to the sound channel position, and each target power;
  • the power of the sound channel position is determined.
  • the vector determination unit is used for:
  • For each candidate sound source determine the time information for the signal of the candidate sound source to reach the position of each sound source channel based on the position information of the candidate sound source;
  • the position guidance information of the candidate sound source is obtained.
  • the separation module 501 includes:
  • An initial separation unit is used to separate the sound sources of the sound source data to be processed, and obtain the predicted sound sources corresponding to the sound source data to be processed and the sound signals belonging to each predicted sound source in the sound source data to be processed;
  • a correlation calculation unit is used to calculate the cross-correlation coefficients between the sound signals of each predicted sound source to obtain a correlation coefficient matrix
  • a screening unit is used to determine candidate sound sources and sound signals of the candidate sound sources from each predicted sound source according to the correlation coefficient matrix.
  • the evaluation module 502 is used to:
  • the kurtosis value corresponding to the time domain signal of the sound signal of each candidate sound source is determined, and the kurtosis value is set to the evaluation value corresponding to the sound signal of the candidate sound source.
  • the evaluation module 502 is used to:
  • the probability score represents the probability that the speech feature vector is the speech feature vector corresponding to the wake-up word
  • the evaluation value corresponding to the sound signal of each candidate sound source is determined.
  • the selection module 503 is used for:
  • the sound signal processing device determines the final target sound source by evaluating the sound signal of each candidate sound source obtained by blind source separation, and improves the problem of low stability of blind source separation. By improving the target sound source Accuracy, noise reduction, and improved audio-visual effects.
  • the embodiment of the present application also provides an electronic device.
  • the electronic device can It includes a radio frequency (RF) circuit 601, a memory 602 including one or more computer-readable storage media, an input unit 603, a display unit 604, a sensor 605, an audio circuit 606, and wireless fidelity (WiFi).
  • the module 607 includes a processor 608 with one or more processing cores, a power supply 609 and other components.
  • RF radio frequency
  • the module 607 includes a processor 608 with one or more processing cores, a power supply 609 and other components.
  • the RF circuit 601 can be used to receive and send information or signals during a call. In particular, after receiving the downlink information of the base station, it is handed over to one or more processors 608 for processing; in addition, the uplink data is sent to the base station. .
  • the RF circuit 601 includes, but is not limited to, an antenna, at least one amplifier, a tuner, one or more oscillators, a Subscriber Identity Module (SIM) card, a transceiver, a coupler, a low noise amplifier (LNA, Low Noise Amplifier), duplexer, etc.
  • SIM Subscriber Identity Module
  • RF circuit 601 can also communicate with networks and other devices through wireless communications.
  • Wireless communication can use any communication standard or protocol, including but not limited to Global System of Mobile communication (GSM, Global System of Mobile communication), General Packet Radio Service (GPRS, General Packet Radio Service), Code Division Multiple Access (CDMA, Code Division Multiple Access), Wideband Code Division Multiple Access (WCDMA, Wideband Code Division Multiple Access), Long Term Evolution (LTE, Long Term Evolution), email, Short Messaging Service (SMS, Short Messaging Service), etc.
  • GSM Global System of Mobile communication
  • GPRS General Packet Radio Service
  • CDMA Code Division Multiple Access
  • WCDMA Wideband Code Division Multiple Access
  • LTE Long Term Evolution
  • SMS Short Messaging Service
  • the memory 602 can be used to store software programs and modules, and the processor 608 executes various functional applications and data processing by running the software programs and modules stored in the memory 602 .
  • the memory 602 may mainly include a program storage area and a data storage area, where the program storage area may store an operating system, a computer program required for at least one function (such as a sound playback function, an image playback function, etc.), etc.; the storage data area may store a program according to Data created by the use of electronic devices (such as audio data, phone books, etc.), etc.
  • memory 602 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device. Accordingly, the memory 602 may also include a memory controller to provide the processor 608 and the input unit 603 with access to the memory 602 .
  • the input unit 603 may be used to receive input numeric or character information and generate keyboard, mouse, joystick, optical or trackball signal input related to user settings and function control.
  • the input unit 603 may include a touch-sensitive surface as well as other input devices.
  • a touch-sensitive surface also known as a touch display or trackpad, can collect the user's touch operations on or near it (such as the user using a finger, stylus, or any suitable object or accessory on or near the touch-sensitive surface). operations near the surface), and drive the corresponding connection device according to the preset program.
  • the touch-sensitive surface may include two parts: a touch detection device and a touch controller.
  • the touch detection device detects the user's touch orientation, detects the signal brought by the touch operation, and transmits the signal to the touch controller; the touch controller receives the touch information from the touch detection device, converts it into contact point coordinates, and then sends it to the touch controller. to the processor 608, and can receive commands sent by the processor 608 and execute them.
  • touch-sensitive surfaces can be implemented using a variety of types including resistive, capacitive, infrared, and surface acoustic waves.
  • the input unit 603 may also include other input devices. Specifically, other input devices may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), trackball, mouse, joystick, etc.
  • the display unit 604 may be used to display information input by the user or information provided to the user as well as various graphical user interfaces of the electronic device. These graphical user interfaces may be composed of graphics, text, icons, videos, and any combination thereof.
  • the display unit 604 may include a display panel, which may optionally be configured in the form of a liquid crystal display (LCD), an organic light-emitting diode (OLED), or the like.
  • the touch-sensitive surface can cover the display panel. When the touch-sensitive After the surface detects a touch operation on or near it, it is transmitted to the processor 608 to determine the type of touch event, and then the processor 608 provides corresponding visual output on the display panel according to the type of touch event.
  • the touch-sensitive surface and the display panel are used as two independent components to implement the input and input functions, in some embodiments, the touch-sensitive surface and the display panel can be integrated to implement the input and output functions.
  • the electronic device may also include at least one sensor 605, such as a light sensor, motion sensor, and other sensors.
  • the light sensor may include an ambient light sensor and a proximity sensor.
  • the ambient light sensor may adjust the brightness of the display panel according to the brightness of the ambient light.
  • the proximity sensor may close the display panel and/or when the electronic device moves to the ear. Backlight.
  • the gravity acceleration sensor can detect the magnitude of acceleration in various directions (usually three axes). It can detect the magnitude and direction of gravity when stationary.
  • the audio circuit 606, speaker, and microphone can provide an audio interface between the user and the electronic device.
  • the audio circuit 606 can transmit the electrical signal converted from the received audio data to the speaker, which converts it into a sound signal and outputs it; on the other hand, the microphone converts the collected sound signal into an electrical signal, which is received and converted by the audio circuit 606
  • the audio data is processed by the audio data output processor 608 and then sent to, for example, another electronic device via the RF circuit 601, or the audio data is output to the memory 602 for further processing.
  • Audio circuitry 606 may also include an earphone jack to provide communication between peripheral earphones and electronic devices.
  • WiFi is a short-distance wireless transmission technology. Electronic devices can help users send and receive emails, browse web pages, and access streaming media through the WiFi module 607. It provides users with wireless broadband Internet access.
  • FIG. 6 shows the WiFi module 607, it can be understood that it is not a necessary component of the electronic device and can be omitted as needed without changing the essence of the invention.
  • the processor 608 is the control center of the electronic device, using various interfaces and lines to connect various parts of the entire mobile phone, by running or executing software programs and/or modules stored in the memory 602, and calling data stored in the memory 602, Perform various functions of the electronic device and process data, resulting in overall monitoring of the mobile phone.
  • the processor 608 may include one or more processing cores; preferably, the processor 608 may integrate an application processor and a modem processor, where the application processor mainly processes operating systems, user interfaces, computer programs, etc. , the modem processor mainly handles wireless communications. It can be understood that the above-mentioned modem processor may not be integrated into the processor 608.
  • the electronic device also includes a power supply 609 (such as a battery) that supplies power to various components.
  • a power supply 609 (such as a battery) that supplies power to various components.
  • the power supply can be logically connected to the processor 608 through a power management system, so that functions such as charging, discharging, and power consumption management can be implemented through the power management system.
  • Power supply 609 may also include one or more DC or AC power supplies, recharging systems, power failure detection circuits, power converters or inverters, power status indicators, and other arbitrary components.
  • the electronic device may also include a camera, a Bluetooth module, etc., which will not be described again here.
  • the processor 608 in the electronic device will load the executable files corresponding to the processes of one or more computer programs into the memory 602 according to the following instructions, and the processor 608 will run the executable files stored in the computer program.
  • Computer program in memory 602 to perform various functions:
  • embodiments of the present application provide a storage medium in which multiple instructions are stored, and the multiple instructions can be loaded by the processor to execute the steps in any sound signal processing method provided by the embodiments of the present application.
  • the computer program can perform the following steps:
  • the storage medium may include: read-only memory (ROM, Read Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disk, etc.
  • ROM read-only memory
  • RAM random access memory
  • magnetic disk or optical disk etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

一种声音信号处理方法、装置、电子设备和存储介质,该方法包括:对待处理声源数据进行声源分离处理,得到各候选声源的声音信号(101);对声音信号进行质量评估,确定评估值(102);根据每个候选声源的声音信号的评估值,得到目标声源(103);对目标声源的声音信号进行处理(104),如此,通过质量评估提高分离出的声源信号的准确度,改善信号分离的稳定性。

Description

声音信号处理方法、装置、电子设备和存储介质 技术领域
本申请涉及信号处理技术领域,具体涉及一种声音信号处理方法、装置、电子设备和存储介质。
背景技术
当电子设备通过多个麦克风通道进行声音信号采集时,这些所收集到的信号中往往存在环境噪声、干扰的相关源信号、在环境中的反射信号等属于非声源的干扰信号,由于信号在传输的过程中的种种复杂性表征,干扰信号与声源信号混合在一起,难以提取出声源信号。盲源分离便是解决这一问题的有效方法,它的目的就在于能从复杂混合信号中提取出声源信号。
技术问题
虽然现有的盲源分离方法可以从复杂混合的声音信号中分离出声源信号,但是由于现有盲源分离方法可以从复杂混合的声音信号中分离出声源信号,却无法甄别分离出的多个声源信号是否是有效的,质量是否是符合需求的,使得分离出的声源信号的准确度不高,使得盲源分离的稳定性不高。
技术解决方案
本申请实施例提供一种声音信号处理方法、装置、电子设备和存储介质,可以提高信号分离的稳定性。
本申请实施例提供一种声音信号处理方法,包括:
对待处理声源数据进行声源分离处理,得到所述待处理声源数据对应的候选声源以及所述待处理声源数据中属于各个所述候选声源的声音信号;
对每个所述候选声源的声音信号进行质量评估,确定每个所述候选声源的声音信号的评估值;
根据每个所述候选声源的声音信号的评估值,从多个所述候选声源中确定得到目标声源;
对所述目标声源的声音信号进行处理。
相应的,本申请实施例还提供一种声音信号处理装置,包括:
分离模块,用于对待处理声源数据进行声源分离处理,得到所述待处理声源数据对应的候选声源以及所述待处理声源数据中属于各个所述候选声源的声音信号;
评估模块,用于对每个所述候选声源的声音信号进行质量评估,确定每个所述候选声源的声音信号的评估值;
选取模块,用于根据每个所述候选声源的声音信号对应的评估值,从多个所述候选声源中确定得到目标声源;
处理模块,用于对所述目标声源的声音信号进行处理。
相应地,本申请实施例还提供一种电子设备,包括:
存储器和处理器;所述存储器存储有计算机程序,所述处理器用于运行所述存储器内的计算机程序,以执行所述的声音信号处理方法中的操作。
此外,本申请实施例还提供一种存储介质,所述存储介质存储有多条指令,所述指令适于处理器进行加载,以执行所述的声音信号处理方法中的步骤。
有益效果
本申请实施例在信号分离中增加信号质量评估机制,对待处理声源数据进行 声源分离处理,得到待处理声源数据对应的候选声源以及待处理声源数据中属于各个候选声源的声音信号;对每个候选声源的声音信号进行质量评估,确定每个候选声源的声音信号的评估值;根据每个候选声源的声音信号的评估值,从多个候选声源中确定得到目标声源;对目标声源的声音信号进行处理,如此,通过在信号分离中增加信号质量评估机制,根据各个候选声源的评估值,对各个候选声源的声音信号质量进行评估,以此选取出有效的目标声源提高分离出的声源信号的准确度,改善信号分离的稳定性不高的问题。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本申请实施例提供的声音信号处理方法的一个流程示意图;
图2是本申请实施例提供的声音信号处理方法中一种声源分离处理的一个流程示意图;
图3是本申请实施例提供的声音信号处理方法中候选声源的估计方法的一个流程示意图;
图4是本申请实施例提供的声音信号处理方法中另一种声源分离处理的一个流程示意图;
图5是本申请实施例提供的声音信号处理装置的结构示意图;
图6是本申请实施例提供的电子设备的结构示意图。
本发明的实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
如背景技术所述,现有技术通过AuxICA(全称:Auxiliary Function Based Independent Component Analysis,中文:基于辅助函数的独立分量分析方法)、AuxIVA(全称:Auxiliary Function Based Independent Vector Analysis,基于辅助函数的独立向量分析方法)进行盲源分离时,对于分离出的目标声源并不进行声源的声音质量的评估和筛选,无法保障目标声源的准确度,进而降低分离出的声音信号的质量,使得盲源分离的稳定性不高。
基于此,为了提高盲源分离输出的稳定性,达到降噪的效果,本申请实施例提供一种声音信号处理方法,通过各个候选声源的评估值,确定最终的目标声源,提高了分离出的声源信号的准确度,改善信号分离的稳定性不高的问题,提高视听效果。
请参考图1,图1是本申请实施例提供的声音信号处理方法的一个流程示意图,所示声音信号处理方法可以应用于电子设备中,在本申请一些实施例中,电子设备可以移动终端,例如手机、平板电脑、计算机、电视机;在本申请一些实施例中电子设备还可以是语音设备,例如蓝牙音箱、智能音箱、麦克风、智能家居等。
所示的声音信号处理方法包括步骤101~104:
101,对待处理声源数据进行声源分离处理,得到待处理声源数据中的候选声源以及待处理声源数据中属于各个候选声源的声音信号。
其中,待处理声源数据指的是电子设备采集到的当前环境中的语音信号,其包括声源的语音信号以及环境中存在的噪音。候选声源指的是根据待处理声源数据中估计的当前环境中可能存在的声源,其包括目标声源。
在本申请一些实施例中,待处理声源数据可以是实时采集的当前环境中的语音信号;也可以是在预设时间段内采集到的当前环境中的语音信号。
电子设备中设置有麦克风阵列,电子设备通过麦克风阵列收集该电子设备所在的当前环境中的语音信号,由于声源与麦克风阵列中每个麦克风通道的距离是不同的,但是麦克风阵列中每个麦克风通道都可能接收到该声源的声音信号,并且房间的混响、其他声源的干扰、环境中的噪声以及设备内部的噪声都不可避免地降低了语音信号的质量和语言清晰度,而目前的语音识别技术无法完全像人的听觉那样具有高的灵敏度和鲁棒性,能够区分各种声源和排除干扰,于是这些干扰使得待处理声源数据中存在噪音,如果直接使用该待处理声源数据进行输出,影响视听效果并使得以语音为交互方式的电子设备的性能降低,因此需要对待处理声源数据进行声源分离处理,确定环境中的声源。其中,麦克风阵列可以是环形麦克风阵列,也可以是线性麦克风阵列,也可以是分布式麦克风阵列,其中麦克风阵列中包括至少一个麦克风通道。
在本申请一些实施例中,对待处理声源数据进行声源分离处理存在多种方式,示例性的包括:
(1)可以通过基于深度神经网络的声源分离方法从待处理声源数据中分离出不同声源的特征,根据分离出的不同声源的特征得到待处理声源数据中的候选声源以及每个候选声源的声音信号。其中,基于深度神经网络的声源分离方法包括但不限于基于深度聚类的声源分离方法、基于置换不变训练的声源分离方法和端到端的声源分离方法。
(2)可以通过基于独立子空间分析的分离方法对待处理声源数据进行盲源分离,得到待处理声源数据中的候选声源以及每个候选声源的声音信号。
(3)可以通过基于非负矩阵分解的分离方法对待处理声源数据进行盲源分离,得到待处理声源数据中的候选声源以及每个候选声源的声音信号。
(4)可以通过基于聚类的分离方法对待处理声源数据进行盲源分离,得到待处理声源数据中的候选声源以及每个候选声源的声音信号。例如通过高斯混合模型对待处理声源数据进行聚类,得到待处理声源数据中的候选声源以及每个候选声源的声音信号。
(5)可以通过主成分分析对待处理声源数据进行盲源分离,得到待处理声源数据中的候选声源以及每个候选声源的声音信号。
(6)可以基于独立成分分析的分离方法,通过分析待处理声源数据中信号之间的相互独立的统计特性对待处理声源数据进行盲源分离,得到待处理声源数据中的候选声源以及每个候选声源的声音信号。
(7)可以基于独立向量分析的分离方法对待处理声源数据进行盲源分离,得到待处理声源数据中的候选声源以及每个候选声源的声音信号。
(8)可以通过基于辅助函数优化的独立向量分析的分离方法对待处理声源数据进行盲源分离,得到待处理声源数据中的候选声源以及每个候选声源的声音信号。
需要说明的是,上述声源分离处理方法仅为示例性说明,不构成本申请实施例提供的声音信号处理方法的限定,例如还可以通过基于辅助函数优化的超定独立向量分析的分离方法对待处理声源数据进行声源分离处理,得到待处理声源数 据中的候选声源以及每个候选声源的声音信号。
102,对每个候选声源的声音信号进行质量评估,确定每个候选声源的声音信号的评估值。
评估值表征每个候选声源的声音信号的声音质量,用于量化每个候选声源是目标声源的概率。
在本申请一些实施例中,存在多种方式对每个候选声源的声音信号进行质量评估,示例性的包括:
(1)可以通过计算每个候选声源的声音信号的信号干扰比对每个候选声源的声音信号进行质量评估,确定每个候选声源的声音信号的评估值。
(2)可以通过计算每个候选声源的声音信号的信号失真比对每个候选声源的声音信号进行质量评估,确定每个候选声源的声音信号的评估值。
(3)可以通过计算每个候选声源的声音信号的最大似然比对每个候选声源的声音信号进行质量评估,确定每个候选声源的声音信号的评估值。
(4)可以通过计算每个候选声源的声音信号的倒谱聚类对每个候选声源的声音信号进行质量评估,确定每个候选声源的声音信号的评估值。
(5)可以通过计算每个候选声源的声音信号的频率加权分段信噪比对每个候选声源的声音信号进行质量评估,确定每个候选声源的声音信号的评估值。
(6)可以通过计算每个候选声源的声音信号的语音质量感知评价分数对每个候选声源的声音信号进行质量评估,确定每个候选声源的声音信号的评估值。
(7)可以通过计算每个候选声源的声音信号的峭度值对每个候选声源的声音信号进行质量评估,确定每个候选声源的声音信号的评估值。
(8)可以通过计算每个候选声源的声音信号的语音特征向量所对应的概率分值对每个候选声源的声音信号进行质量评估,确定每个候选声源的声音信号的评估值。其中,概率分值用于表征每个候选声源的声音信号是目标声源的语音信号的概率。
需要说明的是,上述对每个候选声源的声音信号进行质量评估的方法仅为示例性说明,不构成对本申请实施例提供的声音信号处理方法的限定。在实际应用中,可以根据实际应用场景中电子设备的计算效力选取对应的评估值确定方法。
103,根据每个候选声源的声音信号的评估值,从多个候选声源中确定得到目标声源。
在本申请一些实施例中,步骤103包括:根据每个候选声源的声音信号的评估值,选取出最大评估值对应的候选声源,将选取出的最大评估值对应的候选声源设置为目标声源。
在本申请一些实施例中,步骤103包括:根据每个候选声源的声音信号的评估值,得到评估值的统计特征,根据评估值的统计特征,从多个候选声源中确定得到目标声源,其中,统计特征包括评估值的中位数或众数。具体地,根据每个候选声源的声音信号的评估值,得到评估值的中位数或众数,根据评估值的中位数或众数,从多个候选声源中选取出评估值大于中位数,或者评估值大于众数的评估值,将选取出的评估值大于中位数对应的候选声源设置为目标声源,或者评估值大于众数的评估值设置为目标声源。
104,对目标声源的声音信号进行处理。
其中,对目标声源的声音信号进行处理包括但不限于语音输出、语音识别、语音传输、语音存储等。
在本申请一些实施例中,当电子设备是以语音为交互方式的电子设备时,步 骤104包括:获取该目标声源对应的声音信号,对该目标声源对应的声音信号中的语义分析,得到该目标声源对应的声音信号中的语音信息,电子设备响应该语音信息对应的指令,执行指令对应的操作,例如进行对话交互、执行查询操作、执行音乐播放操作等。
在本申请一些实施例中,当电子设备是语音收集设备时,步骤104包括:获取该目标声源对应的声音信号,将该语音信号进行存储,例如收音设备,也可以将该语音信号传输至与该电子设备通信连接的服务器。
本申请实施例提供一种声音信号处理方法,通过各个候选声源的评估值,确定最终的目标声源,提高了分离出的声源信号的准确度,改善信号分离的稳定性不高的问题。
考虑到现有盲源分离方法无法利用声源的位置信息,当声源的位置变化后,现有的盲源分离方法不能准确检测到声源位置的变化情况,因而经过现有盲源分离方法分离出的目标声源具有一定的不确定性,从而导致分离得到的目标声源的声音信号不稳定,因此为了进一步提高盲源分离得到的输出结果的确定性,降低输出的语音信号中的噪音,达到降噪的效果,在本申请一些实施例中,在步骤101的声源分离处理中,对待处理声源数据进行声源估计得到待处理声源数据中的声源先验信息,基于声源先验信息进行声源分离处理,提高盲源分离得到的候选声源的准确度,进而保证最终的目标声源的准确度。其中,声源先验信息指的是采集该待处理声源数据的电子设备所处的环境中可能存在的候选声源的位置信息。其中候选声源的位置信息可以是候选声源的空间坐标值,也可以是候选声源的在采集该待处理声源数据的电子设备所处的环境的空间中的俯仰角和方位角。需要说明的是,本申请实施例对空间坐标系的建立方式不作限定,例如可以电子设备中的几何中心为原点进行建立。
如图2所示,图2是本申请实施例提供的声音信号处理方法中一种声源分离处理的一个流程示意图,所示的声源分离处理方法包括步骤201~203:
201,对待处理声源数据进行声源位置估计,确定得到待处理声源数据对应的候选声源以及每个候选声源的位置信息。
在本申请一些实施例中,可以通过SPR(全称:Steered Response Power,中文:可控响应功率)方法对待处理声源数据进行声源估计,得到待处理声源数据对应的候选声源以及每个候选声源的位置信息。具体地包括:通过SPR方法估计待处理声源数据在空间上的功率谱分布,根据功率谱分布确定待处理声源数据对应候选声源以及每个候选声源的位置信息。
在本申请一些实施例中,可以根据功率谱分布确定最大功率的位置,将选取的最大功率的位置设置为候选声源的位置信息。在本申请一些实施例中,还可以根据功率谱分布,确定多个功率大于或等于预设功率值的位置,将选取出的多个功率大于预设功率值的位置设置为候选声源的位置信息。其中,预设功率值可以是功率谱分布中的平均功率值,也可以是根据功率谱分布对按照功率从大到小的顺序对空间上每个位置的功率进行排序,将排序后功率中排在第S位的功率设置位预设功率值。其中,S是大于0的整数,S的取值可以根据实际应用场景进行设置,例如S的取值可以是2、3、4、5等。
在本申请一些实施例中,考虑到通过SPR方法估计待处理声源数据的功率谱时,高频部分容易出现混叠现象,由于容易出现混叠现象,影响估计得到候选声源的位置信息的准确度,基于此,本申请实施例在通过SPR方法估计待处理声源数据的功率谱时,通过SPR对待处理声源数据的低频部分进行声源估计,得到候 选声源所在的估计区域,通过SPR对待处理声源数据的高频部分进行声源估计,从候选声源所在的估计区域中选取出待处理声源数据对应的候选声源以及每个候选声源的位置信息。具体地,声源位置估计包括步骤a1~a4:
步骤a1,对待处理声源数据进行频域转换,得到待处理声源数据的频域信号。
步骤a2,通过滤波器对待处理声源数据的频域信号进行滤波处理,得到该频率信号的低频信号和高频信号。其中,滤波器可以是低通滤波器,也可以是高通滤波器。
步骤a3,根据该频率信号的低频信号,通过SPR方法对电子设备中设置的麦克风阵列中的每个麦克风通道的低频信号进行时延估计,得到麦克风阵列在每个预设区域的可控响应功率函数值,选取出多个可控响应功率函数值大于或等于预设函数值阈值的预设区域,将选取出的预设区域设置为候选声源所在的估计区域。
步骤a4,根据该频率信号的高频信号,通过SPR方法对电子设备中设置的麦克风阵列中的每个麦克风通道的高频信号进行时延估计,得到麦克风阵列在每个估计区域的可控响应功率函数值,选取出可控响应功率函数值大于或等于预设函数值阈值的估计区域,将选取出的每个可控响应功率函数值大于或等于预设函数值阈值的估计区域设置为每个候选声源所在的位置,将选取出的每个可控响应功率函数值大于或等于预设函数值阈值的估计区域的位置信息设置为每个候选声源的位置信息。
在本申请一些实施例中,步骤a3~步骤a4包括:将空间坐标系划分为多个第一网格区域,其中,每个第一网格区域对应有一个由俯仰角和方位角构成的位置信息;通过SPR方法对电子设备中设置的麦克风阵列中的每个麦克风通道的低频信号进行时延估计,得到麦克风阵列在每个第一网格区域的第一可控响应功率函数值;选取出第一可控响应功率函数值最大的第一网格区域,将选取出的第一网格区域设置为候选声源所在的估计区域;将候选声源所在的估计区域划分为多个第二网格区域,每一个第二网格区域对应有一个由俯仰角和方位角构成的位置信息,且每两个相邻第二网格区域之间的角度差小于每两个相邻第一网格区域之间的角度差;通过SPR方法对电子设备中设置的麦克风阵列中的每个麦克风通道的高频信号进行时延估计,得到麦克风阵列在每个第二网格区域的第二可控响应功率函数值,根据麦克风阵列在每个第二网格区域的第二可控响应功率函数值,选取出第二可控响应功率函数值大于或等于预设函数值阈值的第二网格区域,将选取出的每个第二可控响应功率函数值大于或等于预设函数值阈值的第二网格区域设置为每个候选声源所在的位置,将选取出的每个第二可控响应功率函数值大于或等于预设函数值阈值的第二网格区域的位置信息设置为每个候选声源的位置信息。
202,根据采集待处理声源数据的各声音通道位置以及各候选声源的位置信息,确定得到每个候选声源的位置导向信息。
其中,采集待处理声源数据的声音通道位置指的是采集待处理声源数据的电子设备中设置的麦克风阵列中的每个麦克风通道在预先设置的空间坐标系中的位置信息。
位置导向信息用于确定每个候选声源的位置信息与采集待处理声源数据的每个声音通道位置之间的位置向量。
在本申请一些实施例中,由于现有盲源分离方法并没有考虑声源位置,进而 使得分离出的声音信号中存在噪音,因此为了提高盲源分离的稳定性,在盲源分离中,根据声源估计得到每个候选声源的位置信息,确定每个候选声源的位置信息与待处理声源数据的每个声音通道位置之间的位置向量,在盲源分离中,通过每个候选声源的位置信息与待处理声源数据的每个声音通道位置之间的位置向量、以及采集待处理声源数据的每个声音通道位置的声音信号对待处理声源数据进行声源分离,得到待处理声源数据中每个声音通道位置的声音信号中的每个候选声源的声音信号分量,通过采集待处理声源数据中每个声音通道位置的声音信号中的每个候选声源位置处的声音信号分量,得到每个候选声源的声音信号。
在本申请一些实施例中,可以根据采集待处理声源数据的声音通道位置以及各候选声源的位置信息,得到每个候选声源的位置信息与采集待处理声源数据的每个声音通道位置之间的距离,根据每个候选声源的位置信息与采集待处理声源数据的每个声音通道位置之间的距离、每个候选声源的位置信息与采集待处理声源数据的每个声音通道位置,得到每个候选声源的位置信息与采集待处理声源数据的每个声音通道位置之间的角度信息,根据每个候选声源的位置信息与采集待处理声源数据的每个声音通道位置之间的角度信息得到每个候选声源的位置导向信息。例如,可以根据每个候选声源的位置信息与采集待处理声源数据的每个声音通道位置之间的角度信息,通过得到每个候选声源在位置导向信息。其中,θ表示每个候选声源的位置信息与采集待处理声源数据的每个声音通道位置之间的角度信息,M为采集待处理声源数据的声音通道的数量。
在本申请一些实施例中,可以根据采集待处理声源数据的声音通道位置以及各候选声源的位置信息,得到每个候选声源的位置信息与采集待处理声源数据的每个声音通道位置之间的距离,根据每个候选声源的位置信息与采集待处理声源数据的每个声音通道位置之间的距离,得到声音信号从每个候选声源的位置信息到达采集待处理声源数据的每个声音通道位置需要的时间信息,根据声音信号从每个候选声源的位置信息到达采集待处理声源数据的每个声音通道位置需要的时间信息,得到每个候选声源的位置导向信息。例如,通过可以根据声音信号从每个候选声源的位置信息到达采集待处理声源数据的每个声音通道位置需要的时间信息,通过得到每个候选声源在位置导向信息,其中τ表征声音信号从每个候选声源的位置信息到达采集待处理声源数据的每个声音通道位置需要的时间信息,j是表示复数。
203,根据每个候选声源的位置导向信息,对待处理声源数据进行声源分离,得到每个候选声源的声音信号。
在本申请一些实施例中,可以根据每个候选声源的位置导向信息,按照步骤101中的分离方法对待处理声源数据进行声源分离,得到每个候选声源的声音信号。示例性,以基于辅助函数优化的超定独立向量分析的分离方法为例进行说明。
其中,在基于辅助函数优化的超定独立向量分析的分离方法对待处理声源数据进行声源分离中,通过假设接收到的声源数据是由环境中的N个发射端源信号S1,S2,...,SN经过传递函数hmn混合后被M个声音通道接收到的混合信号x1,x2,...,xM,即将声源数据表示为其中,N个 发射端源信号为环境中的N个声源信号,并通过对声源数据进行短时傅立叶变换将声源数据转换到频域中,得到声源数据的频域信号X(l,k)=H(k)S(l,k),l=1,...,L,其中L为短时傅立叶变换的帧数,S(l,k)=[S1(l,k),...,SN(l,k)]T表示在频率点k的发射源信号,X(l,k)=[X1(l,k),...,XM(l,k)]T为接收到的混合信号的频域信号,H(k)为混合矩阵,根据声源数据的频域信号X(l,k)=H(k)S(l,k),l=1,...,L,可以将在各频率点上的分离信号表示为Y(l,k)=W(k)X(l,k),其中,Y(l,k)=[Y1(l,k),...,YN(l,k)]T为频率点k上的分离信号,该分离信号近似为N个发射端源信号,W(k)为频率点k上的分离参数,通过基于辅助函数优化求解出声源数据的频域信号中的每一帧频域信号在各频率点上的分离参数,通过分离参数和声源数据的频域信号中的每一帧频域信号在各频率点上的频率信号,从待处理声源数据的频域信号中的每一帧频域信号在各频率点上的频率信号中分离出候选声源的频域信号,并通过逆短时傅里叶变换,得到候选声源的声音信号。
在本申请一些实施例中,为了提高分离得到候选声源的声音信号的可靠性,降低分离得到的候选声源信号中的噪声,在基于辅助函数优化的超定独立向量分析的分离方法对声源数据进行声源分离中,通过导向信息结合基于辅助函数优化求解出声源数据的频域信号中的每一帧频域信号在各频率点上的分离参数,通过分离参数和声源数据的频域信号中的每一帧频域信号在各频率点上的频率信号,从声源数据的频域信号中的每一帧频域信号在各频率点上的频率信号中分离出候选声源的声音信号。具体地,具体地,分离方法包括步骤b1~b2:
步骤b1,根据每个候选声源的位置导向信息,确定得到待处理声源数据的分离参数。
其中,待处理声源数据可以是声源数据的频域信号中的当前帧频域信号。
在本申请一些实施例中,步骤b1包括:在得到待处理声源数据对应的辅助参数之后,根据每个候选声源的位置导向信息对待处理声源数据对应的辅助参数进行修正,得到待处理声源数据对应的修正后的辅助参数,基于修正后的辅助参数优化求解出待处理声源数据的分离参数。其中辅助参数包括待处理声源数据的频域信号中的每一帧频域信号在各频率点上的频率信号中辅助参数。具体地,根据位置导向信息确定待处理声源数据的分离参数的方法包括:
(1)获取历史声源数据的历史分离参数以及待处理声源数据对应的辅助参数。
(2)根据每个候选声源的位置导向信息,对辅助参数进行修正,得到修正后的辅助参数。
(3)根据修正后的辅助参数以及历史分离参数,得到待处理声源数据的分离参数。
其中,历史声源数据的历史分离参数指的是待处理声源数据的前一帧声源数据的分离参数。由于声源的声音信号在时序上存在相关性,因此基于辅助函数优 化的超定独立向量分析的分离方法对待处理声源数据进行声源分离中,是通过交替更新分离参数和辅助参数对待处理声源数据进行声源分离,其中,辅助参数的更新是通过待处理声源数据的频域信号更新前一帧声源数据的分离参数实现的;分离参数的更新是通过待处理声源数据的频域信号的辅助参数更新前一帧声源数据的分离参数实现的。
在本申请一些实施例中,获取待处理声源数据对应的辅助参数的步骤包括:获取历史声源数据的历史分离参数以及离散声源数据的历史辅助参数,根据历史分离参数和待处理声源数据的声音信号,得到前一帧声源数据输出的每个候选声源的能量,根据历史辅助参数、待处理声源数据的声音信号以及前一帧声源数据输出的每个候选声源的能量得到待处理声源数据对应的辅助参数。其中,历史辅助参数指的是前一帧声源数据的辅助参数。例如,当分离出的候选声源的数量S小于或等于待处理声源数据的声音通道的数量M时,可以通过得到待处理声源数据的辅助参数V(l,k)=[V1(l,k),V2(l,k),...,Vs(l,k),...,VS(l,k)],其中α是遗忘因子,并且α∈[0,1],l是待处理声源数据的帧数,Vs(l-1,k)是前一帧声源数据中第k个频率点的辅助参数,即历史辅助参数,是输出的每个候选声源的能量,Ws(l-1,k)是前一帧声源数据中第k个频率点的分离参数,即历史分离参数,(g)H表示共轭转置,其中第一帧处理声源数据的辅助参数Vs(1,k)是预先设置的对角线元素的值为1、其它位置的元素的值为零的矩阵。
在本申请一些实施例中,根据每个候选声源的位置导向信息,对辅助参数矩阵进行修正包括:根据每个候选声源的位置导向信息计算辅助参数中每个候选声源辅助参数Vs(l,k)对应的修正参数,将Vs(l,k)与修正参数对每个候选声源辅助参数Vs(l,k)进行修正,得到修正后的辅助参数。其中,可以根据每个候选声源的位置导向信息通过得到每个候选声源辅助参数Vs(l,k)对应的修正参数β,其中λs是预设的常数。在确定得到每个候选声源辅助参数Vs(l,k)对应的修正参数β后,通过Vs(l,k)+β得到每个修正后候选声源辅助参数Ds(l,k),汇总每个修正后候选声源辅助参数Ds(l,k)得到修正后的辅助参数D(l,k)=[D1(l,k),D2(l,k),...,Ds(l,k),...,DS(l,k)]。
在本申请一些实施例中,可以根据修正后的辅助参数和历史分离参数,通过 D(l,k)W(l-1,k),得到待处理声源数据的分离参数W(l,k)。
在本申请一些实施例中,可以根据修正后的辅助参数和历史分离参数,通过(W(l-1,k)Ds(l,k))-1得到第s个中间参数Ps(l,k),通过得到分离参数W(l,k)。
在本申请一些实施例中,为了增加分离结果的准确度,以解决盲源分离的模糊性问题,在分离参数确定中,可以根据修正后的辅助参数和历史分离参数,通过(W(l-1,k)Vs(l,k))-1得到第s个第一中间参数Ps(l,k),通过得到第s个修正的第一中间参数Qs(l,k),通过Ps H(l,k)Ds(l,k)Ps(l,k)得到第s个第二中间参数Ψs(l,k),通过Ps H(l,k)Ds(l,k)Qs(l,k)得到第s个第三中间参数Φs(l,k),根据第s个第一中间参数、第s个修正的第一中间参数、第s个第二中间参数和第s个第三中间参数得到第s个元素的分离参数Ws(l,k),汇总每个元素的分离参数得到分离参数W(l,k)。具体地,对于分离参数W(l,k)中的第s个元素Ws(l,k),可以将该元素的第三中间参数Φs(l,k)的值与预设数值进行比对;如果该元素的第三中间参数Φs(l,k)的值与预设数值一致,则根据该元素的第二中间参数Ψs(l,k)、第一中间参数Ps(l,k)以及修正的第一中间参数Qs(l,k),通过得到分离参数W(l,k)中的第s个元素Ws(l,k);如果该元素的第三中间参数Φs(l,k)的值与预设数值不一致,则根据该元素的第二中间参数Ψs(l,k)、该元素的第三中间参数Φs(l,k)、第一中间参数Ps(l,k)以及修正的第一中间参数Qs(l,k),通过得到分离参数W(l,k)中第s个元素Ws(l,k)。其中,预设数值可以是0。
需要说明的是,上述分离参数的确定方法仅为基于辅助函数优化的超定独立向量分析的分离方法中确定分离参数的示例性说明,在实际应用中,可以根据采 用的分离方法调整分离参数的确定方式。
步骤b2,根据分离参数,对待处理声源数据中的声音信号进行声源分离,确定得到每个候选声源的声音信号。
在本申请一些实施例中,步骤b2包括:在得到分离参数后,通过计算分离参数与待处理声源数据中的声音信号的频域信号的乘积,从待处理声源数据的中分离出分离信号,其中分离信号中的每一个元素表示每个候选声源的频域信号,对分离出每个候选声源的频域信号进行逆短时傅立叶变换,得到每个候选声源的声音信号。
在本申请一些实施例中,步骤b2包括:在得到分离参数后,根据分离参数得到噪声分离参数,通过噪声分离参数和分离参数得到待处理声源数据的总分离参数,通过计算总分离参数与待处理声源数据中的声音信号的频域信号的乘积,从待处理声源数据的中分离出分离信号,其中分离信号中的每一个元素表示每个候选声源的频域信号,对分离出每个候选声源的频域信号进行逆短时傅立叶变换,得到每个候选声源的声音信号。
在本申请一些实施例中,噪声分离参数的确定方法包括:根据分离参数,通过(A2C(l,k)WH(l,k))(A1C(l,k)WH(l,k))-1计算得到噪声子空间J(l,k),通过[J(l,k),-IM-S]得到噪声分离参数U(l,k),其中,A1和A2都是常数矩阵,并且A1=[IS,OS×M-S],A1=[O(M-S)×S,IM-S],I是单位矩阵,Q*×*是零矩阵,C(l,k)是M*M的噪声参数方阵,在本申请一些实施例中,可以根据前一帧声源数据的前序噪声分离参数中的前序噪声参数方阵C(l-1,k)、待处理声源数据的声音信号,通过αC(l-1,k)+(1-α)X(l,k)XH(l,k)得到C(l,k),其中,α是辅助参数计算中的遗忘因子,其数值设置与辅助参数中的遗忘因子设置相同,都可以设置为0.95;在本申请一些实施例中,C(l,k)的第一帧处理声源数据的噪声参数矩阵C(1,k)是零矩阵。
在本申请一些实施例中,在得到噪声分离参数U(l,k)后,通过得到总分离参数通过计算总分离参数与待处理声源数据中的声音信号X(l,k)的频域信号的乘积从待处理声源数据的中分离出分离信号Y(l,k),其中分离信号Y(l,k)是一个元素个数为S的列向量,其中每一个元素Ys(l,k),s=1,2,...,S表示每个候选声源的频域信号,对分离出每个候选声源的频域信号进行逆短时傅立叶变换,得到每个候选声源的声音信号ys(l)。
在本申请一些实施例中,为了增加分离信号的稳定性,增加分离结果的准确 度,以解决盲源分离的模糊性问题,步骤b2中,在得到分离参数后,根据分离参数得到噪声分离参数,通过得到待处理声源数据的总分离参数通过得到总分离参数的第一变换矩阵提取第一变换矩阵中的第一行到第S行的元素,得到第二变换矩阵Wbp(l,k),通过从待处理声源数据的中分离出分离信号Y(l,k)。其中,A(l,k)是M*M的对角矩阵,A(l,k)中对角元素为总分离参数求逆后的对角线元素;(·)H表示共轭转置。
需要说明的是,上述通过分离参数从待处理声源数据中分离出候选声源的声音信号的方式仅为基于辅助函数优化的超定独立向量分析的分离方法的示例性说明,在实际应用中,可以根据采用的分离方法调整通过分离参数从待处理声源数据中分离出候选声源的声音信号的方式。
在本申请一些实施例中,在步骤201中,可以从采集该待处理声源数据的电子设备所处的声源空间中选取出一个初始声源区域,根据预设方位角对该初始声源区域进行均匀划分,得到多个方向向量,将每一个方向向量设置为一个初始声源位置,通过SRP(全称:Steered Response Power,全称:可控响应功率)计算待处理声源数据在每个初始声源位置的功率值,根据每个初始声源位置的功率值,从多个初始声源位置中选取出候选声源位置,根据选取的候选声源位置得到候选声源以及候选声源的位置信息。具体地,如图3所示,图3是本申请实施例提供的声音信号处理方法中候选声源的估计方法的一个流程示意图,所示的候选声源的估计方法包括步骤301~304:
301,根据预设的方位角,确定得到多个初始声源位置。
在本申请一些实施例中,可以电子设备中麦克风阵列的几何中心作为原点,以该原点建立空间坐标系,以该原点作为圆心、预设距离为半径按照顺时针方向或逆时针方向选取出预设角度范围的初始声源区域,在该初始声源区域中存在至少一个初始声源,以该原点作为圆心、预设距离为半径,按照顺时针方向转动或进行逆时针方向转动,在该初始声源区域中每转动预设的方位角选取一个位置,选取出多个位置,将每个选取出的位置设置为初始声源位置,将每个选取出的位置的方位角以及每个选取出的位置与该原点构成的俯仰角得到每个选取出的位置的方向向量,将每个选取出的位置的方向向量设置为初始声源位置的位置信息。
302,根据各初始声源位置,确定各初始声源位置与各声音通道位置的距离。
在本申请一些实施例中,步骤302包括:根据空间坐标系,确定电子设备中麦克风阵列的每个麦克风通道在该空间坐标系中的位置坐标,将电子设备中麦克风阵列的每个麦克风通道在该空间坐标系中的位置坐标设置为采集待处理声源数据的每个声音通道位置,根据每个初始声源位置的位置信息确定得到每个初始声源位置的位置坐标,对于每个初始声源位置,根据该初始声源位置的位置坐标以及每个声音通道位置,得到该初始声源位置与每个声音通道位置的距离。在本申请一些实施例中,可以通过计算该初始声源位置的位置坐标与每个声音通道位 置之间的2-范数,得到该初始声源位置与每个声音通道位置的距离;还可以通过计算该初始声源位置的位置坐标与每个声音通道位置之间的欧式距离或马氏距离,得到该初始声源位置与每个声音通道位置的距离。
303,根据各初始声源位置与各声音通道位置的距离,确定得到各初始声源位置上的声音信号的功率。
在本申请一些实施例中,步骤303包括:对于每个初始声源位置,根据该初始声源位置与每个声音通道位置的距离,得到该初始声源位置与每两个相邻声音通道位置的距离差,根据该距离差,得到该两个相邻声音通道位置接收到该初始声源位置的信号的时间差,根据待处理声源数据,确定该两个相邻声音通道位置中的每个声音通道位置上的声音信号,根据该时间差、该两个相邻声音通道位置中的每个声音通道位置上的声音信号,得到该初始声源位置的信号在该两个相邻声音通道位置中的前一个声音通道位置上的功率,汇总该候选声源位置的信号在每个声音通道位置上的功率,得到该初始声源位置上的声音信号的功率。其中,两个相邻声音通道位置中的前一个声音通道位置可以是两个相邻声音通道位置中声音通道的位置坐标小于另一个声音通道的位置坐标的声音通道位置。
在本申请一些实施例中,步骤303还可以按照步骤a1~a3计算各初始声源位置上的声音信号的功率。
304,根据各初始声源位置上的声音信号的功率,确定得到候选声源以及候选声源的位置信息。
在本申请一些实施例中,步骤304包括:根据各初始声源位置上的声音信号的功率,按照功率从大到小的顺序对各初始声源位置进行排序,从排序后的初始声源位置中选取出预设数量的目标初始声源位置,将选取出的目标初始声源位置设置为候选声源,将每个目标初始声源位置的位置信息设置为候选声源的位置信息。需要说明的是,本申请实施例对预设数量的具体数值不作限定,即对候选声源的数量不作限定,例如,为了减少声源分离处理中的计算量,可以将候选声源的数量设置为候选声源的数量小于或等于采集待处理声源数据的声音通道的数量。
在本申请一些实施例中,步骤304包括:将各初始声源位置上的声音信号的功率依次与功率阈值进行比较,选取出功率大于或等于功率预设的初始声源位置,将选取出的功率大于或等于功率预设的初始声源位置设为候选声源,将每个选取出的功率大于或等于功率预设的初始声源位置的位置信息设置为候选声源的位置信息。在本申请一些实施例中,功率阈值可以是预先设置的,可以是根据各初始声源位置上的声音信号的功率的平均值、众数或中位数确定得到的,还可以根据各初始声源位置上的声音信号的功率,按照功率从大到小的顺序对各功率进行排序,从排序后的功率中第预设数量处的功率值设为功率阈值。
在本申请一些实施例中,步骤304包括:根据各初始声源位置上的声音信号的功率,确定各初始声源位置上的声音信号的功率中的最大功率,计算各初始声源位置上的声音信号的功率与该最大功率之间的功率差值,将功率差值小于或等于预设功率差阈值的功率所对应的初始声源位置设置为候选声源,将功率差值小于或等于预设功率差阈值的功率所对应的初始声源位置的位置信息设置为候选声源的位置信息。
在本申请一些实施例中,在步骤303中,对于每一个初始声源位置,可以通过该初始声源位置与各声音通道位置,得到该初始声源位置的信号到达各声音通道位置的时间信息,根据该初始声源位置的信号到达各声音通道位置的时间信息, 确定得到该初始声源位置上的声音信号的功率,具体地,初始声源位置的功率计算方法包括步骤c1~c3:
步骤c1,针对每个初始声源位置,根据该初始声源位置与各声音通道位置的距离,确定该初始声源位置的信号到达各声音通道位置的时间信息。
在本申请一些实施例中,针对每个初始声源位置,可以通过该初始声源位置与采集待处理声源数据的各声音通道位置,得到该初始声源位置与各声音通道位置之间的距离,根据声音的传播速度以及该初始声源位置与各声音通道位置之间的距离,得到该初始声源位置的信号到达各声音通道位置的时间信息。
步骤c2,根据该初始声源位置的信号到达各声音通道位置的时间信息,确定得到各声音通道位置的声音信号的功率。
在本申请一些实施例中,可以根据该初始声源位置的信号到达各声音通道位置的时间信息进行时延估计,得到各声音通道的可控响应功率函数值,将各声音通道的可控响应功率函数值设置为各声音通道位置的声音信号的功率。其中,可控响应功率函数值可以通过基于相位变换加权的广义互相关函数根据该初始声源位置的信号到达各声音通道位置的时间信息进行时延估计得到。具体地,根据基于相位变换加权的广义互相关函数确定得到各声音通道位置的声音信号的功率的方法包括:
(1)针对每个声音通道位置,确定该初始声源位置的信号到达与该声音通道位置相邻的下一个声源通道位置的第一时间信息,以及该初始声源位置的信号到达与该声音通道位置相邻的下一个声源通道位置的第二时间信息。
(2)确定第一时间信息与第二时间信息的时间差。
(3)根据时间差、该声音通道位置的声音信号以及与该声音通道位置相邻的下一声源通道位置的声音信号,确定得到该声音通道位置的功率。
在本申请一些实施例中,可以通过该初始声源位置的信号到达与该声音通道位置相邻的下一个声源通道位置的第二时间信息减去该初始声源位置的信号到达该声音通道位置的第一时间信息,得到该初始声源位置的信号到达该声音通道位置以及到达与该声音通道位置相邻的下一个声源通道位置的时间差τij(dn),其中dn表示该初始声源位置,i表示第i个声源通道,j表示第j个声源通道,且j=i+1。
在本申请一些实施例中,在得到时间差τij(dn)之后,根据该声音通道位置的声音信号的各频率点k的频域信号Xi(k)以及与该声音通道位置相邻的下一声源通道位置的各频率点k声音信号的频域信号Xj(k),通过得到该声音通道位置的可控响应功率函数值Rijij(dn)],将该声音通道位置的可控响应功率函数值设置为该声音通道位置的声音信号的功率,其中,(·)*表示共轭,Fs是待处理声源数据中的是声音信号的采样频率,K为短时傅立叶变换的频率点数。
步骤c3,根据各声音通道位置的功率,确定得到该初始声源位置上的声音 信号的功率。
在本申请一些实施例中,在得到该声音通道位置的声音信号的功率之后,通过得到该初始声源位置上的声音信号的功率F(dn)。
在本申请一些实施例中,考虑到对于该初始声源位置,电子设备中的不同的声音通道位置接收到该初始声源位置的信号质量是存在差异的,在基于相位变换加权的广义互相关函数确定得到各声音通道位置的声音信号的功率的方法中,如果对于每一个声音通道位置,不考虑该声音通道位置接收到的信号质量,仅通过得到该初始声源位置上的声音信号的功率F(dn),可能会降低后续声源估计的准确度,并且,在实际应用中,可以通过声音通道位置对的可控响应功率函数的最大值可表征该对声音通道位置接收信号的质量。
基于此,在步骤c2中,通过该初始声源位置的信号达到每个声音通道位置以及达到与每个声音通道位置相邻的下一声音通道位置的时间差,得到每个声音通道位置的初始功率以及每个声音通道位置相邻的下一声音通道位置的初始功率,根据每个声音通道位置的初始功率以及每个声音通道位置相邻的下一声音通道位置的初始功率中的最大值得到每个声音通道位置的功率权重,通过每个声音通道位置的初始功率以及该声音通道位置的功率权重,得到各声音通道位置的声音信号的功率,具体地,基于权重的声音信号的功率确定方法包括:
(1)根据该初始声源位置的信号到达各声音通道位置的时间信息,确定得到各声音通道位置的声音信号的初始功率。
(2)根据各声音通道位置对应的初始功率,确定得到每两个相邻声音通道位置对应的初始功率中的目标功率,目标功率表征每两个相邻声音通道位置对应的初始功率中的较大值。
(3)针对每个声音通道位置,根据该声音通道位置对应的初始功率、该声音通道位置相邻的下一声音通道位置对应的初始功率以及各目标功率,确定得到该声音通道位置的功率权重。
(4)根据该声音通道位置对应的初始功率以及该声音通道位置的功率权重,确定得到该声音通道位置的功率。
其中,对于该初始声源位置,可以按照上述基于相位变换加权的广义互相关函数确定得到各声音通道位置的声音信号的功率的方法得到每个声音通道位置的声音信号的初始功率。
在本申请一些实施例中,对于采集待处理声源数据的每个声音通道位置,确定该声音通道位置的声音信号的初始功率以及该声音通道位置相邻的下一声音通道位置的声音信号的初始功率中的最大初始功率Rmax ijij(dn)],将该最大初始功率Rmax ijij(dn)]设置目标功率。在得到每个目标功率后,通过累加每个目标功率,得到在该初始声源位置下,采集待处理声源数据的声音通道位置的目标功率总值针对每个声音通道位置,确定该声音通道位置的声音 信号的初始功率以及该声音通道位置相邻的下一声音通道位置的声音信号的初始功率中的最大初始功率,通过该最大初始功率与目标功率总值的比值对该声音通道位置的声音信号的初始功率以及该声音通道位置相邻的下一声音通道位置的声音信号的初始功率中的最大初始功率进行归一化,得到该声音通道位置的声音信号的功率权重ωi ,j
在本申请一些实施例中,在得到该声音通道位置的声音信号的功率权重ωi ,j、该声音通道位置的声音信号的初始功率Rijij(dn)]后,根据ωi,jRijij(dn)]得到该声音通道位置的功率。
在本申请一些实施例中,在通过ωi,jRijij(dn)]得到该声音通道位置的功率后,通过得到该初始声源位置上的声音信号的功率F(dn)。
在本申请一些实施例中,步骤202中,可以根据候选声源的位置信息以及采集待处理声源数据的每个声音通道位置,得到每个候选声源的位置信息的信号达到待处理声源数据的每个声音通道位置的时间信息,根据每个候选声源的位置信息的信号达到待处理声源数据的每个声音通道位置的时间信息,得到每个候选声源的位置导向信息。具体地位置导向信息的确定方法包括步骤d1~d2:
步骤d1,针对每个候选声源,根据该候选声源的位置信息,确定得到该候选声源的信号到达各声源通道位置的时间信息。
步骤d2,根据该候选声源的信号到达各声源通道位置的时间信息,得到该候选声源的位置导向信息。
在本申请一些实施例中,步骤d1包括:根据已建立的空间坐标系,根据采集待处理声源数据的每个声源通道在空间坐标系中的位置信息以及该空间坐标系中的坐标原点的位置信息,得到采集待处理声源数据的每个声源通道的位置向量,针对每个候选声源,根据该候选声源的位置信息的方向向量与待处理声源数据的每个声源通道的位置向量之间的内积,得到该候选声源的位置处的信号到达采集待处理声源数据的每个声源通道位置的传播距离,根据该候选声源的信号到达采集待处理声源数据的每个声源通道的传播距离和声音传播速度,得到该候选声源的信号到达采集待处理声源数据的每个声源通道位置的时间信息。
在本申请一些实施例中,步骤d2包括:将该候选声源的信号到达采集待处理声源数据的每个声源通道位置的时间信息输入至预设的矢量模型得到该候选声源的位置导向信息其中,表示该候选声源的位置信息,ω是预选设置的模拟角频率,τm,m=1,2,...,M表征该候选声源的信号到达采集待处理声源数据的每个声源通道位置的时间信息。
在本申请一些实施例中,在得到每个候选声源的位置导向信息后,按照步骤 203根据每个候选声源的位置导向信息,对待处理声源数据进行声源分离,得到每个候选声源的声音信号。
本申请实施例,在声源分离处理中,先是对待处理声源数据声源定位,得到候选声源的个数以及候选声源的位置信息,继而利用各个候选声源的位置信息求取各个候选声源的位置导向信息,采用结合位置导向信息的超定独立向量分析的分离方法对待处理声源数据进行声源分离,从待处理声源数据中分离出各个候选声源的声音信号,本申请实施例利用了位置导向信息来牵引盲源分离,加强了分离输出的声音信号的稳定性,避免了现有的仅通过超定独立向量分析的分离方法的在声源位置发生变化的情况下可能输出纯噪声的情况。
在本申请一些实施例中,考虑到201~203所示的声源分离处理方法需要进行候选声源的数量估计,这增加了声音信号处理方法的计算量,因此为了降低声音信号处理方法的计算量,本申请实施例提供一种无需声源估计的声源分离处理方法,具体地,如图4所示,图4是本申请实施例提供的声音信号处理方法中另一种声源分离处理的一个流程示意图,所示的声源分离处理方法包括步骤401~403:
401,对待处理声源数据进行声源分离处理,得到多个预测声源,以及每个初预测声源的声音信号。
考虑到现有的基于独立向量分析的分离方法在进行声源分离处理时,需要进行声源估计确定候选声源的数量或者预先知道需要分离出的候选声源数量,增加了声音信号处理方法的计算量,为了解决在基于独立向量分析的分离方法在进行盲源分离时,必须进行候选声源的数量估计问题,本申请实施例在基于独立向量分析的分离方法在进行声源分离时,通过待处理声源数据的声音通道的数量m建立初始分离参数WM×M(l),通过迭代更新初始分离参数WM×M(l),从待处理声源数据的分离出M个预测声源的声音信号,对分离出的M个预测声源的声音信号进行冗余信号检测与剔除,从分离出的M个预测声源的声音信号中提取出候选声源的声音信号。
在本申请一些实施例中,在步骤401中,可以通过待处理声源数据的声音通道的数量m建立初始分离参数WM×M(l),通过基于等变自适应分解算法的迭代模型对初始分离参数WM×M(l)进行迭代,每一次迭代时,从待处理声源数据分离出该次迭代的分离信号,当迭代次数达到预设迭代次数时,得到最终的分离信号从待处理声源数据的分离出M个预测声源的声音信号。其中,I时m*m维的单位矩阵,l表征迭代步数,α(l)表征迭代步长,E表示求期望,y表示非线性函数,其与待处理声源数据的声音信号的概率密度函数有关,y表示第l次挈带得到的分离信号,其中T表示转置。
在本申请一些实施例中,在步骤401中,还通过待处理声源数据的声音通道的数量m建立初始分离参数WM×M(l),通过基于自然梯度法的迭代模型对初始分离参数 WM×M(l)进行迭代,每一次迭代时,从待处理声源数据分离出该次迭代的分离信号,当迭代次数达到预设迭代次数时,得到最终的分离信号,从待处理声源数据分离出M个预测声源的声音信号。其中,基于自然梯度法的迭代模型的参数含义与基于等变自适应分解算法的迭代模型相同,此处不再赘述。
在本申请一些实施例中,在步骤401中,也可以对待处理声源数据的M个声音通道的时域信号x={x1;x2;...;xM}进行短时傅里叶变换,得到待处理声源数据的M个声音通道的频域信号X(k),k=1,2,...,K,其中K是短时傅里叶变换的点数,X(k)={X1(k);...;XM(k)};根据基于辅助函数优化的独立向量分析的分离方法对各频率点的频域信号X(k)进行声源分离,从待处理声源数据分离出M个预测声源的声音信号。
402,计算各预测声源的声音信号之间的互相关系数,得到相关系数矩阵。
在本申请一些实施例中,候选声源的数量未知时,分离得到的M个预测声源的声音信号中存在S个独立分量,其余的M-S个分量是一个或者多个独立分量的拷贝或零信号,其中,S个独立分量即为S个候选声源的声音信号,由于M个预测声源的声音信号中存在的S个独立分量之间是相关性比较低,而由一个或者多个独立分量的拷贝或零信号构成的M-S个分量之间存在相关性,因此可以通过每个预测声源对应的声音信号之间的互相关系数,从M个预测声源的声音信号中提取出S个候选声源的声音信号。
具体地,步骤402包括:针对每个预测声源,计算该预测声源对应的声音信号与该预测声源对应的声音信号之间的自相关系数,以及该预测声源对应的声音信号与除该预测声源对应的声音信号外的每个预测声源的声音信号之间的互相关系数,得到该预测声源对应的声音信号的相关系数;根据每个始声源对应的声音信号的相关系数建立预测声源的声音信号对应的相关系数矩阵
403,根据相关系数矩阵,从各预测声源中确定得到候选声源以及候选声源的声音信号。
在本申请一些实施例中,相关系数矩阵中,矩阵对角线元素表示预测声源的声音信号的自相关系数,必然都为1,矩阵中其他元素表示任意两个预测声源的声音信号之间的互相关系数;将相关系数矩阵中每一列或每一行非对角线元素的数值与预设系数进行比较;若相关系数矩阵中每一列或每一行非对角线元素的数值中存在元素的数值与预设系数之间的绝对差小于或等于预设阈值的目标元素,说明预测声源中存在与对角线元素所对应的预测声源的声音信号相同或者相似的冗余信号,则将该目标元素对应的预测声源剔除;通过相关系数矩阵剔除冗余信号,对多个预测声源进行数据清洗,得到候选声源以及候选声源的声音信号。
考虑到分离得到的候选声源的声音信号中包括目标声源的声音信号和非目 标声源的声音信号,并且该目标声源的声音信号中混叠的其他信号或噪声成分极少,语音质量较好,而非目标声源的声音信号,则由于混叠噪声或者其他信号使得非目标声源的声音信号的语音质量比目标声源的声音信号的质量差,因此可以通过评估每个候选声源的声音信号的语音质量,从分离得到的候选声源的声音信号中选取出目标声源的声音信号。因此,本申请实施例在得到待处理声源数据中的候选声源以及每个候选声源的声音信号之后,为了进一步去除候选声源中的噪声,对每个候选声源的声音信号的语音质量进行评估,得到每个候选声源的声音信号的评估值,根据评估值,对候选声源进行筛选,从多个候选声源中选取出目标声源。
在本申请一些实施例中,可以通过计算每个候选声源的声音信号的峭度值,得到每个候选声源的声音信号对应的评估值,其中峭度值用于描述声音信号的语音特征,声音信号的峭度值越大则该声音信号的语音质量越高。具体地,基于峭度值的声音信号评估方法包括:
(1)对每个候选声源的声音信号进行时频域转换,得到每个候选声源的声音信号的时域信号。
(2)确定每个候选声源的声音信号的时域信号对应的峭度值,将峭度值设置为该候选声源的声音信号对应的评估值。
在本申请一些实施例中,在分离得到多个候选声源的声音信号的频域信号Y(l,k)=[Y1(l,k),Y2(l,k),...,YS(l,k)],通过逆短时傅立叶变换,得到多个候选声源的声音信号的时域信号y(l)=[y1(l),y2(l),...,yS(l)],对于每一个候选声源,根据该候选声源的声音信号的时域信号ys(l),s=1,…,S,通过得到该候选声源的声音信号的峭度值K(ys(l)),将该候选声源的声音信号的峭度值K(ys(l))设置为该候选声源的声音信号对应的评估值。
在本申请一些实施例中,当电子设备是以语音为交互方式的电子设备时,还可以根据每个候选声源的声音信号的语音特征确定每个候选声源的声音信号中的唤醒词得分,将每个候选声源的声音信号中的唤醒词得分设置为每个候选声源的声音信号对应的评估值,即通过每个候选声源的声音信号中的唤醒词得分,从多个候选声源中选取出目标声源。其中,唤醒词得分用于量化每个候选声源的声音信号中语音质量,在本申请一些实施例中,可以通过确定每个候选声源的声音信号对应的声音特征是唤醒词的声音特征的概率,确定得到每个候选声源的声音信号中的唤醒词得分。具体地,基于唤醒词得分的声音信号评估方法包括:
(1)获取每个候选声源的声音信号的语音特征向量。
(2)确定每个候选声源的声音信号的语音特征向量所对应的概率分值。
(3)根据每个候选声源的声音信号的语音特征向量所对应的概率分值,确定每个候选声源的声音信号对应的评估值。
其中,概率分值表征语音特征向量是唤醒词对应的语音特征向量的概率。
在本申请一些实施例中,可以将分离得到多个候选声源的声音信号的频域信号Y(l,k)=[Y1(l,k),Y2(l,k),...,YS(l,k)],通过逆短时傅立叶变换,得到多个候选声源的声音信号的时域信号y(l)=[y1(l),y2(l),...,yS(l)],对于每一个候选声源,根据该候选声源的声音信号的时域信号ys(l),s=1,…,S,对每一个候选声源的时域信号ys(l),通过由频谱衍生出来的梅尔频率倒谱系数从该ys(l)中提取出反映语音信号特征的关键特征参数形成特征矢量序列,将特征矢量序列设置为候选声源的声音信号的语音特征向量。
在本申请一些实施例中,可以根据每个候选声源的声音信号的语音特征向量得到每个候选声源的声音信号的语义特征,将每个候选声源的声音信号的语义特征与预设语义特征进行比对,得到每个候选声源的声音信号的语义特征与预设语义特征的相似程度,将每个候选声源的声音信号的语义特征与预设语义特征的相似程度设置为每个候选声源的声音信号的语音特征向量所对应的概率分值。其中,可以根据步骤102中的语义特征提取方法得到每个候选声源的声音信号的语义特征。
在本申请一些实施例中,对于每个候选声源,可以将该候选声源的声音信号的语音特征向量所对应的概率分值设置为该候选声源的声音信号对应的评估值。
在本申请一些实施例中,对于每个候选声源,可以将该候选声源的声音信号的语音特征向量所对应的概率分值与预设概率阈值进行比较,若该候选声源的声音信号的语音特征向量所对应的概率分值大于预设概率阈值,则将该候选声源的声音信号对应的评估值设置为第一预设值;若该候选声源的声音信号的语音特征向量所对应的概率分值小于或等于预设概率阈值,则将该候选声源的声音信号对应的评估值设置为第二预设值。其中,第一预设值可以是1,第二预设值可以为0;第一预设值还可以是100,第二预设值还可以是0。
在本申请一些实施例中,对于每个候选声源,可以将该候选声源的声音信号的语音特征向量所对应的概率分值,查询预存的评估数据,确定该候选声源的声音信号的语音特征向量所对应的概率分值所在的概率区间,以及该概率区间所对应的评估分数,将该概率区间所对应的评估分数设置为该候选声源的声音信号对应的评估值。其中,预存的评估数据包括多个概率区间以及每个概率区间所对应的评估分数。
在本申请一些实施例中,在得到每个候选声源的声音信号对应的评估值之后,可以根据每个候选声源的声音信号对应的评估值确定最大评估值对应的候选声源,将最大评估值对应的候选声源设置为目标声源,并将最大评估值对应的候选声源的声音信号设置为目标声源的声音信号。
本申请实施例提供的声音信号处理方法,通过评估盲源分离得到的每个候选声源的声音信号确定最终的目标声源,改善盲源分离的稳定性不高的问题,通过提高目标声源的准确度,进行降噪,提高视听效果。
为了更好实施本申请实施例提供的声音信号处理方法,在声音信号处理方法实施例基础上,本申请实施例还提供一种声音信号处理装置,如图5所示,图5是本申请实施例提供的声音信号处理装置的结构示意图,所示的声音信号处理装置包括:
分离模块501,用于对待处理声源数据进行声源分离处理,得到待处理声源数据对应的候选声源以及待处理声源数据中属于各个候选声源的声音信号;
评估模块502,用于对每个候选声源的声音信号进行质量评估,确定每个候选声源的声音信号的评估值;
选取模块503,用于根据每个候选声源的声音信号对应的评估值,从多个候选声源中确定得到目标声源;
处理模块504,用于对目标声源的声音信号进行处理。
在本申请一些实施例中,分离模块501包括:
声源估计单元,用于对待处理声源数据进行声源位置估计,确定得到待处理声源数据对应的候选声源以及每个候选声源的位置信息;
矢量确定单元,用于根据采集待处理声源数据的各声音通道位置以及各候选声源的位置信息,确定得到每个候选声源的位置导向信息;
分离单元,用于根据每个候选声源的位置导向信息,对待处理声源数据进行声源分离,得到每个候选声源的声音信号。
在本申请一些实施例中,分离单元:
分离参数子单元,用于根据每个候选声源的位置导向信息,确定得到分离参数;
分离子单元,用于根据分离参数,对待处理声源数据中的声音信号进行声源分离,确定得到每个候选声源的声音信号。
在本申请一些实施例中,分离参数子单元用于:
获取历史声源数据的历史分离参数以及待处理声源数据对应的辅助参数;
根据每个候选声源的位置导向信息,对辅助参数矩阵进行修正,得到修正后的辅助参数;
根据修正后的辅助参数以及历史分离参数,得到待处理声源数据的分离参数。
在本申请一些实施例中,声源估计单元,用于:
根据预设的方位角,确定得到多个初始声源位置;
根据各初始声源位置,确定各初始声源位置与采集待处理声源数据的各声音通道位置的距离;
根据各初始声源位置与各声音通道位置的距离,确定得到各初始声源位置上的声音信号的功率;
根据各初始声源位置上的声音信号的功率,确定得到候选声源以及候选声源的位置信息。
在本申请一些实施例中,声源估计单元,用于:
针对每个初始声源位置,根据该初始声源位置与各声音通道位置的距离,确定该初始声源位置的信号到达各声音通道位置的时间信息;
根据该初始声源位置的信号到达各声音通道位置的时间信息,确定得到各声音通道位置的声音信号的功率;
根据各声音通道位置的声音信号的功率,确定得到该初始声源位置上的声音信号的功率。
在本申请一些实施例中,声源估计单元,用于:
针对每个声音通道位置,确定该初始声源位置的信号到达该声音通道位置的第一时间信息,以及该初始声源位置的信号到达与该声音通道位置相邻的下一个声源通道位置的第二时间信息;
确定第一时间信息与第二时间信息的时间差;
根据时间差、该声音通道位置的声音信号、与该声音通道位置相邻的下一声源通道位置的声音信号,确定得到该声音通道位置的声音信号的功率。
在本申请一些实施例中,声源估计单元,用于:
根据该初始声源位置的信号到达各声音通道位置的时间信息,确定得到各声音通道位置的声音信号的初始功率;
根据各声音通道位置对应的初始功率,确定得到每两个相邻声音通道位置对应的初始功率中的目标功率,目标功率表征每两个相邻声音通道位置对应的初始功率中的较大值;
针对每个声音通道位置,根据该声音通道位置对应的初始功率、该声音通道位置相邻的下一声音通道位置对应的初始功率以及各目标功率,确定得到该声音通道位置的功率权重;
根据该声音通道位置对应的初始功率以及该声音通道位置的功率权重,确定得到该声音通道位置的功率。
在本申请一些实施例中,矢量确定单元,用于:
针对每个候选声源,根据该候选声源的位置信息,确定得到该候选声源的信号到达各声源通道位置的时间信息;
根据该候选声源的信号到达各声源通道位置的时间信息,得到该候选声源的位置导向信息。
在本申请一些实施例中,分离模块501包括:
初始分离单元,用于对待处理声源数据进行声源分离,得到待处理声源数据对应的预测声源以及待处理声源数据中属于各个预测声源的声音信号;
相关计算单元,用于计算各预测声源的声音信号之间的互相关系数,得到相关系数矩阵;
筛选单元,用于根据相关系数矩阵,从各预测声源中确定得到候选声源以及候选声源的声音信号。
在本申请一些实施例中,评估模块502,用于:
对每个候选声源的声音信号进行时频域转换,得到每个候选声源的声音信号的时域信号;
确定每个候选声源的声音信号的时域信号对应的峭度值,将峭度值设置为该候选声源的声音信号对应的评估值。
在本申请一些实施例中,评估模块502,用于:
获取每个候选声源的声音信号的语音特征向量;
确定每个候选声源的声音信号的语音特征向量所对应的概率分值;概率分值表征语音特征向量是唤醒词对应的语音特征向量的概率;
根据每个候选声源的声音信号的语音特征向量所对应的概率分值,确定每个候选声源的声音信号对应的评估值。
在本申请一些实施例中,选取模块503,用于:
根据每个候选声源的声音信号对应的评估值,确定最大评估值对应的候选声源;
将最大评估值对应的候选声源设置为目标声源。
本申请实施例提供的声音信号处理装置,通过评估盲源分离得到的每个候选声源的声音信号确定最终的目标声源,改善盲源分离的稳定性不高的问题,通过提高目标声源的准确度,进行降噪,提高视听效果。
相应的,本申请实施例还提供一种电子设备,如图6所示,该电子设备可以 包括射频(RF,Radio Frequency)电路601、包括有一个或一个以上计算机可读存储介质的存储器602、输入单元603、显示单元604、传感器605、音频电路606、无线保真(WiFi,Wireless Fidelity)模块607、包括有一个或者一个以上处理核心的处理器608、以及电源609等部件。本领域技术人员可以理解,图6中示出的电子设备结构并不构成对电子设备的限定,可以,包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。其中:
RF电路601可用于收发信息或通话过程中,信号的接收和发送,特别地,将基站的下行信息接收后,交由一个或者一个以上处理器608处理;另外,将涉及上行的数据发送给基站。通常,RF电路601包括但不限于天线、至少一个放大器、调谐器、一个或多个振荡器、用户身份模块(SIM,Subscriber Identity Module)卡、收发信机、耦合器、低噪声放大器(LNA,Low Noise Amplifier)、双工器等。此外,RF电路601还可以通过无线通信与网络和其他设备通信。无线通信可以使用任一通信标准或协议,包括但不限于全球移动通讯系统(GSM,Global System of Mobile communication)、通用分组无线服务(GPRS,General Packet Radio Service)、码分多址(CDMA,Code Division Multiple Access)、宽带码分多址(WCDMA,Wideband Code Division Multiple Access)、长期演进(LTE,Long Term Evolution)、电子邮件、短消息服务(SMS,Short Messaging Service)等。
存储器602可用于存储软件程序以及模块,处理器608通过运行存储在存储器602的软件程序以及模块,从而执行各种功能应用以及数据处理。存储器602可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的计算机程序(比如声音播放功能、图像播放功能等)等;存储数据区可存储根据电子设备的使用所创建的数据(比如音频数据、电话本等)等。此外,存储器602可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。相应地,存储器602还可以包括存储器控制器,以提供处理器608和输入单元603对存储器602的访问。
输入单元603可用于接收输入的数字或字符信息,以及产生与用户设置以及功能控制有关的键盘、鼠标、操作杆、光学或者轨迹球信号输入。具体地,在一个具体的实施例中,输入单元603可包括触敏表面以及其他输入设备。触敏表面,也称为触摸显示屏或者触控板,可收集用户在其上或附近的触摸操作(比如用户使用手指、触笔等任何适合的物体或附件在触敏表面上或在触敏表面附近的操作),并根据预先设定的程式驱动相应的连接装置。可选的,触敏表面可包括触摸检测装置和触摸控制器两个部分。其中,触摸检测装置检测用户的触摸方位,并检测触摸操作带来的信号,将信号传送给触摸控制器;触摸控制器从触摸检测装置上接收触摸信息,并将它转换成触点坐标,再送给处理器608,并能接收处理器608发来的命令并加以执行。此外,可以采用电阻式、电容式、红外线以及表面声波等多种类型实现触敏表面。除了触敏表面,输入单元603还可以包括其他输入设备。具体地,其他输入设备可以包括但不限于物理键盘、功能键(比如音量控制按键、开关按键等)、轨迹球、鼠标、操作杆等中的一种或多种。
显示单元604可用于显示由用户输入的信息或提供给用户的信息以及电子设备的各种图形用户接口,这些图形用户接口可以由图形、文本、图标、视频和其任意组合来构成。显示单元604可包括显示面板,可选的,可以采用液晶显示器(LCD,Liquid Crystal Display)、有机发光二极管(OLED,Organic Light-Emitting Diode)等形式来配置显示面板。进一步的,触敏表面可覆盖显示面板,当触敏 表面检测到在其上或附近的触摸操作后,传送给处理器608以确定触摸事件的类型,随后处理器608根据触摸事件的类型在显示面板上提供相应的视觉输出。虽然在图6中,触敏表面与显示面板是作为两个独立的部件来实现输入和输入功能,但是在某些实施例中,可以将触敏表面与显示面板集成而实现输入和输出功能。
电子设备还可包括至少一种传感器605,比如光传感器、运动传感器以及其他传感器。具体地,光传感器可包括环境光传感器及接近传感器,其中,环境光传感器可根据环境光线的明暗来调节显示面板的亮度,接近传感器可在电子设备移动到耳边时,关闭显示面板和/或背光。作为运动传感器的一种,重力加速度传感器可检测各个方向上(一般为三轴)加速度的大小,静止时可检测出重力的大小及方向,可用于识别手机姿态的应用(比如横竖屏切换、相关游戏、磁力计姿态校准)、振动识别相关功能(比如计步器、敲击)等;至于电子设备还可配置的陀螺仪、气压计、湿度计、温度计、红外线传感器等其他传感器,在此不再赘述。
音频电路606、扬声器,传声器可提供用户与电子设备之间的音频接口。音频电路606可将接收到的音频数据转换后的电信号,传输到扬声器,由扬声器转换为声音信号输出;另一方面,传声器将收集的声音信号转换为电信号,由音频电路606接收后转换为音频数据,再将音频数据输出处理器608处理后,经RF电路601以发送给比如另一电子设备,或者将音频数据输出至存储器602以便进一步处理。音频电路606还可能包括耳塞插孔,以提供外设耳机与电子设备的通信。
WiFi属于短距离无线传输技术,电子设备通过WiFi模块607可以帮助用户收发电子邮件、浏览网页和访问流式媒体等,它为用户提供了无线的宽带互联网访问。虽然图6示出了WiFi模块607,但是可以理解的是,其并不属于电子设备的必须构成,完全可以根据需要在不改变发明的本质的范围内而省略。
处理器608是电子设备的控制中心,利用各种接口和线路连接整个手机的各个部分,通过运行或执行存储在存储器602内的软件程序和/或模块,以及调用存储在存储器602内的数据,执行电子设备的各种功能和处理数据,从而对手机进行整体监控。可选的,处理器608可包括一个或多个处理核心;优选的,处理器608可集成应用处理器和调制解调处理器,其中,应用处理器主要处理操作系统、用户界面和计算机程序等,调制解调处理器主要处理无线通信。可以理解的是,上述调制解调处理器也可以不集成到处理器608中。
电子设备还包括给各个部件供电的电源609(比如电池),优选的,电源可以通过电源管理系统与处理器608逻辑相连,从而通过电源管理系统实现管理充电、放电、以及功耗管理等功能。电源609还可以包括一个或一个以上的直流或交流电源、再充电系统、电源故障检测电路、电源转换器或者逆变器、电源状态指示器等任意组件。
尽管未示出,电子设备还可以包括摄像头、蓝牙模块等,在此不再赘述。具体在本实施例中,电子设备中的处理器608会按照如下的指令,将一个或一个以上的计算机程序的进程对应的可执行文件加载到存储器602中,并由处理器608来运行存储在存储器602中的计算机程序,从而实现各种功能:
对待处理声源数据进行声源分离处理,得到待处理声源数据对应的候选声源以及待处理声源数据中属于各个候选声源的声音信号;
对每个候选声源的声音信号进行质量评估,确定每个候选声源的声音信号的评估值;
根据每个候选声源的声音信号的评估值,从多个候选声源中确定得到目标声源;
对目标声源的声音信号进行处理。
本领域普通技术人员可以理解,上述实施例的各种方法中的全部或部分步骤可以通过指令来完成,或通过指令控制相关的硬件来完成,该指令可以存储于一计算机可读存储介质中,并由处理器进行加载和执行。
为此,本申请实施例提供一种存储介质,其中存储有多条指令,该多条指令能够被处理器进行加载,以执行本申请实施例所提供的任一种声音信号处理方法中的步骤。例如,该计算机程序可以执行如下步骤:
对待处理声源数据进行声源分离处理,得到待处理声源数据对应的候选声源以及待处理声源数据中属于各个候选声源的声音信号;
对每个候选声源的声音信号进行质量评估,确定每个候选声源的声音信号的评估值;
根据每个候选声源的声音信号的评估值,从多个候选声源中确定得到目标声源;
对目标声源的声音信号进行处理。
以上各个操作的具体实施可参见前面的实施例,在此不再赘述。
其中,该存储介质可以包括:只读存储器(ROM,Read Only Memory)、随机存取记忆体(RAM,Random Access Memory)、磁盘或光盘等。
由于该存储介质中所存储的计算机程序,可以执行本申请实施例所提供的任一种声音信号处理方法中的步骤,因此,可以实现本申请实施例所提供的任一种声音信号处理方法所能实现的有益效果,详见前面的实施例,在此不再赘述。
以上对本申请实施例所提供的一种声音信号处理方法、装置、电子设备和存储介质进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。

Claims (20)

  1. 一种声音信号处理方法,其中,所述方法包括:
    对待处理声源数据进行声源分离处理,得到所述待处理声源数据对应的候选声源以及所述待处理声源数据中属于各个所述候选声源的声音信号;
    对每个所述候选声源的声音信号进行质量评估,确定每个所述候选声源的声音信号的评估值;
    根据每个所述候选声源的声音信号的评估值,从多个所述候选声源中确定得到目标声源;
    对所述目标声源的声音信号进行处理。
  2. 如权利要求1所述的声音信号处理方法,其中,所述对待处理声源数据进行声源分离处理,得到所述待处理声源数据对应的候选声源以及所述待处理声源数据中属于各个所述候选声源的声音信号包括:
    对待处理声源数据进行声源位置估计,确定得到所述待处理声源数据对应的候选声源以及每个候选声源的位置信息;
    根据采集所述待处理声源数据的各声音通道位置以及各所述候选声源的位置信息,确定得到每个所述候选声源的位置导向信息;
    根据每个所述候选声源的位置导向信息,对所述待处理声源数据进行声源分离,得到每个所述候选声源的声音信号。
  3. 如权利要求2所述的声音信号处理方法,其中,所述根据每个所述候选声源的位置导向信息,对所述待处理声源数据进行声源分离,得到每个所述候选声源的声音信号,包括:
    根据每个所述候选声源的位置导向信息,确定得到分离参数;
    根据所述分离参数,对所述待处理声源数据中的声音信号进行声源分离,确定得到每个所述候选声源的声音信号。
  4. 如权利要求3所述的声音信号处理方法,其中,所述根据每个所述候选声源的位置导向信息,确定得到分离参数包括:
    获取历史声源数据的历史分离参数以及所述待处理声源数据对应的辅助参数;
    根据每个所述候选声源的位置导向信息,对所述辅助参数进行修正,得到修正后的辅助参数;
    根据所述修正后的辅助参数以及所述历史分离参数,得到所述待处理声源数据的分离参数。
  5. 如权利要求2所述的声音信号处理方法,其中,所述对待处理声源数据进行声源位置估计,确定得到所述待处理声源数据对应的候选声源以及每个候选声源的位置信息包括:
    根据预设的方位角,确定得到多个初始声源位置;
    根据各所述初始声源位置,确定各所述初始声源位置与采集待处理声源数据的各声音通道位置的距离;
    根据各所述初始声源位置与各所述声音通道位置的距离,确定得到各所述初始声源位置上的声音信号的功率;
    根据各所述初始声源位置上的声音信号的功率,确定得到候选声源以及所述候选声源的位置信息。
  6. 如权利要求5所述的声音信号处理方法,其中,所述根据各所述初始声 源位置与各所述声音通道位置的距离,确定得到各所述初始声源位置上的声音信号的功率包括:
    针对每个所述初始声源位置,根据该初始声源位置与各所述声音通道位置的距离,确定该初始声源位置的信号到达各所述声音通道位置的时间信息;
    根据该初始声源位置的信号到达各所述声音通道位置的时间信息,确定得到各所述声音通道位置的声音信号的功率;
    根据各所述声音通道位置的声音信号的功率,确定得到该初始声源位置上的声音信号的功率。
  7. 如权利要求6所述的声音信号处理方法,其中,所述根据该初始声源位置的信号到达各所述声音通道位置的时间信息,确定得到各所述声音通道位置的声音信号的功率,包括:
    针对每个所述声音通道位置,确定该初始声源位置的信号到达该声音通道位置的第一时间信息,以及该初始声源位置的信号到达与该声音通道位置相邻的下一个声源通道位置的第二时间信息;
    确定所述第一时间信息与所述第二时间信息的时间差;
    根据所述时间差、该声音通道位置的声音信号、与该声音通道位置相邻的下一声源通道位置的声音信号,确定得到该声音通道位置的声音信号的功率。
  8. 如权利要求6所述的声音信号处理方法,其中,所述根据该初始声源位置的信号到达各所述声音通道位置的时间信息,确定得到各所述声音通道位置的声音信号的功率,包括:
    根据该初始声源位置的信号到达各所述声音通道位置的时间信息,确定得到各所述声音通道位置的声音信号的初始功率;
    根据各所述声音通道位置对应的初始功率,确定得到每两个相邻声音通道位置对应的初始功率中的目标功率,所述目标功率表征每两个相邻声音通道位置对应的初始功率中的较大值;
    针对每个所述声音通道位置,根据该声音通道位置对应的初始功率、该声音通道位置相邻的下一声音通道位置对应的初始功率以及各所述目标功率,确定得到该声音通道位置的功率权重;
    根据该声音通道位置对应的初始功率以及该声音通道位置的功率权重,确定得到该声音通道位置的功率。
  9. 如权利要求2所述的声音信号处理方法,其中,所述根据采集所述待处理声源数据的各声音通道位置以及各所述候选声源的位置信息,确定得到每个所述候选声源的位置导向信息,包括:
    针对每个所述候选声源,根据该候选声源的位置信息,确定得到该候选声源的信号到达各声源通道位置的时间信息;
    根据该候选声源的信号到达各所述声源通道位置的时间信息,得到该候选声源的位置导向信息。
  10. 如权利要求1所述的声音信号处理方法,其中,所述对处理声源数据进行声源分离,得到所述待处理声源数据对应的候选声源以及所述待处理声源数据中属于各个所述候选声源的声音信号包括:
    对所述待处理声源数据进行声源分离,得到所述待处理声源数据对应的预测声源以及所述待处理声源数据中属于各个所述预测声源的声音信号;
    计算各所述预测声源的声音信号之间的互相关系数,得到相关系数矩阵;
    根据所述相关系数矩阵,从各所述预测声源中确定得到候选声源以及候选声 源的声音信号。
  11. 如权利要求1所述的声音信号处理方法,其中,所述对每个所述候选声源的声音信号进行质量评估,确定每个所述候选声源的声音信号对应的评估值包括:
    对每个所述候选声源的声音信号进行时频域转换,得到每个所述候选声源的声音信号的时域信号;
    确定每个所述候选声源的声音信号的时域信号对应的峭度值,将所述峭度值设置为该候选声源的声音信号对应的评估值。
  12. 如权利要求1所述的声音信号处理方法,其中,所述对每个所述候选声源的声音信号进行质量评估,确定每个所述候选声源的声音信号对应的评估值包括:
    获取每个所述候选声源的声音信号的语音特征向量;
    确定每个所述候选声源的声音信号的语音特征向量所对应的概率分值;所述概率分值表征语音特征向量是唤醒词对应的语音特征向量的概率;
    根据每个所述候选声源的声音信号的语音特征向量所对应的概率分值,确定每个候选声源的声音信号对应的评估值。
  13. 如权利要求1所述的声音信号处理方法,其中,所述根据每个所述候选声源的声音信号对应的评估值,从多个所述声候选声源中确定得到目标声源包括:
    根据每个所述候选声源的声音信号对应的评估值,确定得到最大评估值对应的候选声源;
    将所述最大评估值对应的候选声源设置为目标声源。
  14. 如权利要求12所述的声音信号处理方法,其中,所述确定每个所述候选声源的声音信号的语音特征向量所对应的概率分值,包括:
    根据每个所述候选声源的声音信号的语音特征向量,得到每个所述候选声源的声音信号的语义特征;
    将每个所述候选声源的声音信号的语义特征与预设语义特征进行比对,得到每个所述候选声源的声音信号的语义特征与预设语义特征的相似程度;
    将每个所述候选声源的声音信号的语义特征与预设语义特征的相似程度,设置为每个所述候选声源的声音信号的语音特征向量所对应的概率分值。
  15. 如权利要求12所述的声音信号处理方法,其中,所述根据每个所述候选声源的声音信号的语音特征向量所对应的概率分值,确定每个候选声源的声音信号对应的评估值,包括:
    对于每个所述候选声源,将所述候选声源的声音信号的语音特征向量所对应的概率分值与预设概率阈值进行比较;
    若所述候选声源的声音信号的语音特征向量所对应的概率分值大于预设概率阈值,则将所述候选声源的声音信号对应的评估值设置为第一预设值;
    若所述候选声源的声音信号的语音特征向量所对应的概率分值小于或等于所述预设概率阈值,则将所述候选声源的声音信号对应的评估值设置为第二预设值。
  16. 如权利要求12所述的声音信号处理方法,其中,所述根据每个所述候选声源的声音信号的语音特征向量所对应的概率分值,确定每个候选声源的声音信号对应的评估值,包括:
    对于每个所述候选声源,将所述候选声源的声音信号的语音特征向量所对应的概率分值,查询预存的评估数据;
    确定所述候选声源的声音信号的语音特征向量所对应的概率分值所在的概率区间,以及所述概率区间所对应的评估分数;
    将所述概率区间所对应的评估分数,设置为所述候选声源的声音信号对应的评估值,其中,预存的评估数据包括多个概率区间以及每个概率区间所对应的评估分数。
  17. 如权利要求2所述的声音信号处理方法,其中,所述对待处理声源数据进行声源位置估计,确定得到所述待处理声源数据对应的候选声源以及每个候选声源的位置信息,包括:
    对所述待处理声源数据进行频域转换,得到所述待处理声源数据的频域信号;
    通过滤波器对所述待处理声源数据的频域信号进行滤波处理,得到所述频率信号的低频信号和高频信号。
    根据所述频率信号的低频信号,通过SPR方法对电子设备中设置的麦克风阵列中的每个麦克风通道的低频信号进行时延估计,得到所述麦克风阵列在每个预设区域的可控响应功率函数值,以及,选取出多个可控响应功率函数值大于或等于预设函数值阈值的预设区域,将选取出的预设区域设置为候选声源所在的估计区域;
    根据所述频率信号的高频信号,通过SPR方法对电子设备中设置的麦克风阵列中的每个麦克风通道的高频信号进行时延估计,得到所述麦克风阵列在每个估计区域的可控响应功率函数值,以及,选取出可控响应功率函数值大于或等于预设函数值阈值的估计区域;
    将选取出的每个可控响应功率函数值大于或等于预设函数值阈值的估计区域,设置为每个候选声源所在的位置,以及,将选取出的每个可控响应功率函数值大于或等于预设函数值阈值的估计区域的位置信息设置为每个候选声源的位置信息。
  18. 一种声音信号处理装置,其中,所述装置包括:
    分离模块,用于对待处理声源数据进行声源分离处理,得到所述待处理声源数据对应的候选声源以及所述待处理声源数据中属于各个所述候选声源的声音信号;
    评估模块,用于对每个所述候选声源的声音信号进行质量评估,确定每个所述候选声源的声音信号的评估值;
    选取模块,用于根据每个所述候选声源的声音信号对应的评估值,从多个所述候选声源中确定得到目标声源;
    处理模块,用于对所述目标声源的声音信号进行处理。
  19. 一种电子设备,其中,包括存储器和处理器;所述存储器存储有计算机程序,所述处理器用于运行所述存储器内的计算机程序,以执行权利要求1至17任一项所述的声音信号处理方法中的操作。
  20. 一种存储介质,其中,所述存储介质存储有多条指令,所述指令适于处理器进行加载,以执行权利要求1至17任一项所述的声音信号处理方法中的步骤。
PCT/CN2023/092372 2022-08-05 2023-05-05 声音信号处理方法、装置、电子设备和存储介质 WO2024027246A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210944168.2A CN117153186A (zh) 2022-08-05 2022-08-05 声音信号处理方法、装置、电子设备和存储介质
CN202210944168.2 2022-08-05

Publications (1)

Publication Number Publication Date
WO2024027246A1 true WO2024027246A1 (zh) 2024-02-08

Family

ID=88904825

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/092372 WO2024027246A1 (zh) 2022-08-05 2023-05-05 声音信号处理方法、装置、电子设备和存储介质

Country Status (2)

Country Link
CN (1) CN117153186A (zh)
WO (1) WO2024027246A1 (zh)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100070274A1 (en) * 2008-09-12 2010-03-18 Electronics And Telecommunications Research Institute Apparatus and method for speech recognition based on sound source separation and sound source identification
JP2016045225A (ja) * 2014-08-19 2016-04-04 日本電信電話株式会社 音源数推定装置、音源数推定方法および音源数推定プログラム
CN106797413A (zh) * 2014-09-30 2017-05-31 惠普发展公司,有限责任合伙企业 声音调节
US20180299527A1 (en) * 2015-12-22 2018-10-18 Huawei Technologies Duesseldorf Gmbh Localization algorithm for sound sources with known statistics
CN113096684A (zh) * 2021-06-07 2021-07-09 成都启英泰伦科技有限公司 一种基于双麦克风阵列的目标语音提取方法
CN113327624A (zh) * 2021-05-25 2021-08-31 西北工业大学 一种采用端到端时域声源分离系统进行环境噪声智能监测的方法
CN114220454A (zh) * 2022-01-25 2022-03-22 荣耀终端有限公司 一种音频降噪方法、介质和电子设备

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100070274A1 (en) * 2008-09-12 2010-03-18 Electronics And Telecommunications Research Institute Apparatus and method for speech recognition based on sound source separation and sound source identification
JP2016045225A (ja) * 2014-08-19 2016-04-04 日本電信電話株式会社 音源数推定装置、音源数推定方法および音源数推定プログラム
CN106797413A (zh) * 2014-09-30 2017-05-31 惠普发展公司,有限责任合伙企业 声音调节
US20180299527A1 (en) * 2015-12-22 2018-10-18 Huawei Technologies Duesseldorf Gmbh Localization algorithm for sound sources with known statistics
CN113327624A (zh) * 2021-05-25 2021-08-31 西北工业大学 一种采用端到端时域声源分离系统进行环境噪声智能监测的方法
CN113096684A (zh) * 2021-06-07 2021-07-09 成都启英泰伦科技有限公司 一种基于双麦克风阵列的目标语音提取方法
CN113889138A (zh) * 2021-06-07 2022-01-04 成都启英泰伦科技有限公司 一种基于双麦克风阵列的目标语音提取方法
CN114220454A (zh) * 2022-01-25 2022-03-22 荣耀终端有限公司 一种音频降噪方法、介质和电子设备

Also Published As

Publication number Publication date
CN117153186A (zh) 2023-12-01

Similar Documents

Publication Publication Date Title
CN110176226B (zh) 一种语音识别、及语音识别模型训练方法及装置
CN110164469B (zh) 一种多人语音的分离方法和装置
CN106710596B (zh) 回答语句确定方法及装置
CN109558512B (zh) 一种基于音频的个性化推荐方法、装置和移动终端
CN109903773B (zh) 音频处理方法、装置及存储介质
CN110163367B (zh) 一种终端部署方法和装置
EP3493198A1 (en) Method and device for determining delay of audio
WO2020088153A1 (zh) 语音处理方法、装置、存储介质和电子设备
WO2019128639A1 (zh) 音频信号底鼓节拍点的检测方法以及终端
CN107993672B (zh) 频带扩展方法及装置
CN112820299B (zh) 声纹识别模型训练方法、装置及相关设备
CN113676226A (zh) 导频信息符号发送方法、信道估计方法及通信设备
CN111477243B (zh) 音频信号处理方法及电子设备
CN111883091A (zh) 音频降噪方法和音频降噪模型的训练方法
CN109302528B (zh) 一种拍照方法、移动终端及计算机可读存储介质
CN110517677B (zh) 语音处理系统、方法、设备、语音识别系统及存储介质
CN112735388B (zh) 网络模型训练方法、语音识别处理方法及相关设备
CN117332844A (zh) 对抗样本生成方法、相关装置及存储介质
CN112748899A (zh) 一种数据处理方法和相关设备
WO2024027246A1 (zh) 声音信号处理方法、装置、电子设备和存储介质
CN106782614B (zh) 音质检测方法及装置
CN112948763B (zh) 件量预测方法、装置、电子设备及存储介质
CN112489644B (zh) 用于电子设备的语音识别方法及装置
CN111091180B (zh) 一种模型训练方法和相关装置
CN117012202B (zh) 语音通道识别方法、装置、存储介质及电子设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23848957

Country of ref document: EP

Kind code of ref document: A1