WO2020134851A1 - 语音信号变换方法、装置、设备和存储介质 - Google Patents

语音信号变换方法、装置、设备和存储介质 Download PDF

Info

Publication number
WO2020134851A1
WO2020134851A1 PCT/CN2019/121838 CN2019121838W WO2020134851A1 WO 2020134851 A1 WO2020134851 A1 WO 2020134851A1 CN 2019121838 W CN2019121838 W CN 2019121838W WO 2020134851 A1 WO2020134851 A1 WO 2020134851A1
Authority
WO
WIPO (PCT)
Prior art keywords
segmented
original
frequency domain
target
signal
Prior art date
Application number
PCT/CN2019/121838
Other languages
English (en)
French (fr)
Inventor
吴晓婕
Original Assignee
广州市百果园信息技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 广州市百果园信息技术有限公司 filed Critical 广州市百果园信息技术有限公司
Priority to RU2021119297A priority Critical patent/RU2770747C1/ru
Priority to EP19902578.4A priority patent/EP3905243A4/en
Priority to SG11202106539QA priority patent/SG11202106539QA/en
Priority to US17/416,709 priority patent/US20220051685A1/en
Publication of WO2020134851A1 publication Critical patent/WO2020134851A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • G10L21/013Adapting to target pitch
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0316Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude
    • G10L21/0324Details of processing therefor
    • G10L21/034Automatic adjustment
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/90Pitch determination of speech signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • G10L21/013Adapting to target pitch
    • G10L2021/0135Voice conversion or morphing

Definitions

  • Embodiments of the present application relate to the field of voice recognition technology, for example, to a voice signal conversion method, device, device, and storage medium.
  • the voice characteristics of the voice user may be changed, so that there is a large difference between the voice played and the actual voice of the voice user.
  • a male voice signal is turned up by 4 semitones, it will sound like a girl's voice, and there is a certain sound error.
  • a fixed-length window function is usually used to process the short-time Fourier transform signals corresponding to the voice signals before and after the modulation, to obtain the formant envelopes corresponding to the voice signals before and after the modulation, and then according to the obtained
  • the formant envelope processes the tone-modified speech signal, resulting in a tone-modified speech signal that eliminates sound errors.
  • the determined formant envelope is inaccurate, which in turn leads to the inconsistent sound characteristics of the final tone-modulated speech signal and the tone characteristics of the tone signal before tone-tuning.
  • the quality of tone-modulated voice signals is poor, and the effect of eliminating sound errors cannot be achieved.
  • the embodiments of the present application provide a voice signal conversion method, device, equipment and storage medium. On the basis of transposing the original voice signal, the consistency of the voice characteristics in the voice signal before and after transposition is ensured, and the transposed voice signal is improved the quality of.
  • An embodiment of the present application provides a voice signal conversion method.
  • the method includes:
  • the original segment window function corresponding to each segmented original frequency domain signal is determined according to the fundamental frequency and segment length of each segmented original frequency domain signal
  • the target segment window function corresponding to each segmented target frequency domain signal is determined according to the fundamental frequency of each segmented target frequency domain signal and the segment length;
  • a tone-modulated speech signal is determined.
  • An embodiment of the present application provides a voice signal conversion transpose.
  • the device includes:
  • the segmentation transformation module is configured to perform segmentation processing on the original speech signal and the initial target speech signal obtained by modulating the original speech signal respectively, and separately segment the multiple segmented original speech signals and the segmented
  • the obtained multiple segmented target speech signals are subjected to Fourier transform to obtain multiple segmented original frequency domain signals and multiple segmented target frequency domain signals;
  • the envelope determination module is configured to filter the plurality of original frequency domain signals according to the plurality of original segmented window functions to obtain a plurality of original formant envelopes, and respectively according to a plurality of target segmented window function pairs
  • the multiple segmented target frequency domain signals are filtered to obtain multiple target formant envelopes.
  • the original segmented window function corresponding to each segmented original frequency domain signal is based on the basis of each segmented original frequency domain signal Frequency and segment length are determined, and the target segment window function corresponding to each segment target frequency domain signal is determined according to the fundamental frequency of each segment target frequency domain signal and the segment length;
  • the tone modulation speech determination module is configured to determine the tone modulation speech signal according to the plurality of segmented target frequency domain signals, the plurality of original formant envelopes and the plurality of target formant envelopes.
  • An embodiment of the present application provides a device, which includes:
  • One or more processors are One or more processors;
  • Storage device set to store one or more programs
  • the one or more processors When the one or more programs are executed by the one or more processors, the one or more processors implement the voice signal conversion method described in any embodiment of the present application.
  • An embodiment of the present application provides a computer-readable storage medium that stores a computer program, and when the program is executed by a processor, a method for converting a voice signal according to any embodiment of the present application is implemented.
  • FIG. 1A is a flowchart of a voice signal conversion method according to Embodiment 1 of the present application.
  • FIG. 1B is a schematic diagram of a principle of a voice signal conversion process provided in Embodiment 1 of the present application;
  • FIG. 2 is an original schematic diagram of a fundamental frequency detection and window function construction process provided by Embodiment 2 of the present application;
  • FIG. 3 is a schematic diagram of a principle of a voice signal conversion process provided in Embodiment 3 of the present application.
  • FIG. 4 is a schematic structural diagram of a voice signal conversion device according to Embodiment 4 of the present application.
  • FIG. 5 is a schematic structural diagram of a device according to Embodiment 5 of the present application.
  • the voice quality is determined, that is, the voice characteristics are implemented in this application.
  • the example mainly deals with the consistency of the formant envelope in the voice signal before and after transposition.
  • the formant envelope hold algorithm is used to eliminate the effect of the target formant envelope after transposition on the transposition, so that the same formant envelope before and after transposition Network, which improves the voice quality of tone-modulated voice signals.
  • FIG. 1A is a flowchart of a voice signal conversion method according to Embodiment 1 of the present application.
  • This embodiment can be applied to any device capable of modulating voice signals.
  • the technical solutions in the embodiments of the present application can be applied to the case where the consistency of the voice characteristics in the voice signal before and after tone modulation is achieved.
  • a voice signal conversion method provided in this embodiment may be performed by a voice signal conversion device provided in an embodiment of the present application.
  • the device may be implemented in software and/or hardware, and integrated in a device that executes the method.
  • the device may be a smart terminal configured with any application program capable of modulating the voice signal, such as a smart phone, a tablet, and a handheld computer.
  • the method may include the following steps:
  • the original voice signal refers to a voice signal initially recorded by a voice user collected by a voice collector without any processing.
  • the original voice signal is encoded in the form of a discrete signal, and the original voice signal includes a large amount of voice Sampling point.
  • the original voice signal initially collected by the voice user and collected by the voice collector needs to be obtained first, and then the original voice signal is to be modulated.
  • S120 Transpose the original voice signal to obtain an initial target voice signal.
  • transposition refers to adjusting the pitch in the voice signal, that is, adjusting the main frequency in the voice signal, such as correcting some defective sounds contained in the original recording of a singer, that is, correcting The voice signal is transposed.
  • the tone change requirement can be determined and the corresponding tone change parameter can be set in the corresponding voice tone modulation software according to the tone change requirement.
  • the transposition parameters and the transposition algorithm of the sound modulate the original speech signal to obtain the initial target speech signal. Since the sound characteristics in the original speech signal will be destroyed when the transposition is performed, the sound features in the original target speech signal are relative to those in the original speech signal. The voice characteristics are changed and cannot be output directly. It is also necessary to restore the changed voice characteristics to ensure that the final voice signal is played so that other users can clarify which voice user entered the final voice signal.
  • transposing the original speech signal to obtain the initial target speech signal may include: acquiring a transposition amplitude; transposing the original speech signal according to the transposition amplitude to obtain the initial target speech signal.
  • the original voice signal can be processed through the Pitch Shift algorithm.
  • the pitch range corresponding to the current pitch change is determined in advance, so that the pitch range is set in the voice pitch adjustment algorithm, and the original pitch signal is adjusted according to the pitch range.
  • the voice signal is tone-adjusted to obtain the initial target voice signal.
  • S130 Perform segmentation processing on the original speech signal and the initial target speech signal respectively, and perform Fourier transform on the plurality of segmented original speech signals obtained after segmentation and the plurality of segmented target speech signals obtained after segmentation respectively To obtain multiple segmented original frequency domain signals and multiple segmented target frequency domain signals.
  • the Fourier transform is a transform method that converts a time-domain signal into a frequency-domain signal.
  • a type of information that cannot be obtained in the time domain can be converted into the frequency domain for analysis.
  • the original voice signal is a voice signal sent by a voice user and contains different frequency information within a period of time
  • the corresponding frequency domain signal is obtained for the entire time
  • the frequency spectrum corresponding to a single frequency determined by all the voice information in the domain cannot reflect the corresponding frequency characteristics in the local time domain, and the frequency domain information in different time periods cannot be analyzed. Therefore, in this embodiment, a short-time Fourier transform is used to The original voice signal and the initial target voice signal are processed to obtain frequency domain information corresponding to the original voice signal and the initial target voice signal in different time periods.
  • Short-time Fourier transform refers to representing the frequency domain characteristics of a moment by the frequency domain signal corresponding to a segment of speech signal within a specified time window.
  • the original voice signal and the initial target voice signal can be separately analyzed first Perform segmentation to obtain multiple segmented original speech signals and multiple segmented target speech signals. Subsequent analysis of the segmented original speech signals and segmented target speech signals in the same time segment can be performed separately.
  • the multiple segmented original speech signals and multiple segmented target speech signals are all Fourier transformed, thereby obtaining multiple segmented original frequency domain signals and multiple segmented target frequency domain signals within multiple segments.
  • the multiple segmented original frequency domain signals and multiple segmented target frequency domain signals obtained by Fourier transform after segmentation are also There is a one-to-one correspondence in multiple segments.
  • S140 filtering multiple segmented original frequency domain signals according to multiple original segmented window functions, respectively, to obtain multiple original formant envelopes, and multiple segmented target frequency domains according to multiple target segmented window functions, respectively.
  • the signal is filtered to obtain multiple target formant envelopes.
  • the original segment window function corresponding to each segmented original frequency domain signal is determined according to the fundamental frequency and segment length of each segmented original frequency domain signal
  • the corresponding The target segmentation window function is determined according to the fundamental frequency and segment length of each segmented target frequency domain signal.
  • the original segmented window function and the target segmented window function are adaptive variable-length window functions. Since the fundamental frequencies of multiple segmented original frequency domain signals are different, the obtained multiple original segmented window functions The lengths are also different. Because the fundamental frequencies of multiple segmented target frequency domain signals are different, the lengths of the multiple target segmented window functions obtained are also different.
  • an adaptive variable-length window function is used to separate the voices before and after the modulation in different segments. Signal processing can reduce processing errors.
  • the fundamental frequency of the segmented original speech signal refers to the basic frequency contained in the segmented original speech signal, which can be reflected in the segmented original frequency domain signal
  • the fundamental frequency of the segmented target frequency domain signal refers to the segmentation
  • the basic frequency contained in the target frequency domain signal can be reflected in the segmented target frequency domain signal
  • the segment length indicates the number of sampling points that should be included in the speech signal in each segment, generally 2n, such as the segment length can be 1024 or 2048 etc.
  • the formant refers to the area where the sound energy in the frequency domain signal is relatively concentrated, and determines the sound quality of the sound.
  • the formant of the signal can be used to determine which voice user sent the voice signal;
  • the formant envelope refers to The frequency domain range in which the highest points of amplitude corresponding to different frequencies are connected in the frequency domain signal can represent the voice characteristics of the voice user in the current segment.
  • the fundamental frequency of the segmented target frequency domain signal within a segment can be It is directly determined by the fundamental frequency and the pitch of the original frequency domain signal within the segment without re-detecting the fundamental frequency of multiple segment target frequency domain signals, reducing additional detection operations and increasing the signal processing rate.
  • the fundamental frequency of each segmented original frequency domain signal may be detected first, and according to the fundamental frequency and division of the segmented original frequency domain signal The length of the segment determines the corresponding original segmented window function.
  • the segmented original frequency domain signal because the fundamental frequency of the different segmented original frequency domain signal is different, there are different original segmented window functions; for the segmented target frequency domain signal, the same method is used to pass multiple segments
  • the fundamental frequency and segment length of the target frequency domain signal determine multiple target segment window functions corresponding to multiple segment target frequency domain signals.
  • the multiple segmented original frequency domain signals are filtered by multiple original segmented window functions corresponding to the multiple segmented original frequency domain signals, respectively, to obtain multiple corresponding to the multiple segmented original frequency domain signals
  • the original formant envelope; at the same time, the multiple segmented target frequency domain signals are filtered by multiple target segmented window functions corresponding to the multiple segmented target frequency domain signals, respectively, to obtain multiple segmented target frequency domain signal correspondences
  • the target formant envelope; the number of original formant envelopes and target formant envelopes corresponds to the number of segments.
  • the window function in this embodiment filters the frequency domain signal
  • the window function can be understood as a low-pass filter in different forms, and the adaptive variable length of the window function used can make the corresponding low-pass filtering performance vary with frequency. The characteristics of the domain signal change.
  • S150 Determine a tone-modulated speech signal according to multiple segmented target frequency domain signals, multiple original formant envelopes and multiple target formant envelopes.
  • the transposed speech signal after transposing the original speech signal, the transposed speech signal has eliminated the influence on the sound characteristics during transposition and can finally output a speech signal that is consistent with the sound features in the original speech signal.
  • the segmented original frequency domain signal before the modulation and the segmented target after the modulation are represented in the segment
  • the change of the sound characteristics in the frequency domain signal, according to the segment target frequency domain signal in the segment and the ratio, the final corresponding segment frequency domain signal in the segment is determined, and finally according to the number of multiple segments
  • a segmented target frequency domain signal and corresponding multiple ratios determine the segmented frequency domain signal in multiple segments, and obtain the final transposed frequency domain signal from the multiple segmented frequency domain signals, and then determine the final transposed voice signal .
  • the technical solution provided in this embodiment performs segmentation processing on the original speech signal and the original target speech signal after the original speech signal is modulated, and separately obtains the plurality of segmented original speech signals and the segmented obtained after segmentation
  • the multiple segment target speech signals are subjected to Fourier transform to obtain multiple segment original frequency domain signals and multiple segment target frequency domain signals, and according to the fundamental frequency and segment length of the multiple segment original frequency domain signals Determine multiple original segmented window functions, and determine multiple target segmented window functions according to the fundamental frequency and segment length of multiple segmented target frequency domain signals. Different segmented signals can correspond to different segmented window functions.
  • Multiple original segmented window functions and multiple target segmented window functions respectively filter multiple segmented original frequency domain signals and multiple segmented target frequency domain signals to obtain multiple original formant envelopes and multiple target formants Envelope to reduce the acquisition error of the formant envelope before and after transposition, so as to determine the final transposed speech signal based on multiple segmented target frequency domain signals and multiple formant envelopes before and after transposition, respectively, to eliminate the target formant envelope
  • tone modulation makes the voice signals before and after tone modulation have the same formant envelope, thereby ensuring the consistency of the voice characteristics in the voice signal before and after tone modulation, and improving the voice quality of the tone-modulated voice signal.
  • FIG. 2 is an original schematic diagram of a fundamental frequency detection and window function construction process provided by Embodiment 2 of the present application. This embodiment is described based on the above embodiment. In this embodiment, the detection process of the fundamental frequency of multiple segmented original frequency domain signals obtained by Fourier transform after segmenting the original speech signal, and the multiple original segments corresponding to the multiple segmented original frequency domain signals The construction process of multiple target segmented window functions corresponding to the window function and multiple segmented target frequency domain signals will be described.
  • S2020 Transpose the original voice signal to obtain the initial target voice signal.
  • S2030 separately segment the original speech signal and the initial target speech signal, and perform Fourier transform on the segmented original speech signal and the segmented target speech signal respectively To obtain multiple segmented original frequency domain signals and multiple segmented target frequency domain signals.
  • the subsequent segmented original frequency domain signal and the segmented target frequency domain signal need to be filtered by a window function to determine the corresponding formant envelope
  • a window function to determine the corresponding formant envelope
  • the fundamental frequency of the segmented original frequency domain signal needs to be detected first, so it is determined whether each segmented original frequency domain signal in the plurality of segmented original frequency domain signals carries the fundamental frequency.
  • the judgment result of whether the current frequency of the original frequency domain signal carries the fundamental frequency may be marked. If the current frequency of the original frequency domain signal carries the fundamental frequency, then Mark the actual result of the fundamental frequency. If the original frequency signal in the current segment does not carry the fundamental frequency, then use the preset flag to mark the original frequency signal in the current segment. Frequency domain signal.
  • S2050 Use the carried fundamental frequency as the fundamental frequency of the original frequency domain signal of each segment.
  • the carried fundamental frequency is directly used as the fundamental frequency of the original frequency domain signal of the current segment.
  • S2060 Determine according to the fundamental frequency of the previous segmented original frequency domain signal of each segmented original frequency domain signal and the fundamental frequency of the subsequent segmented original frequency domain signal of each segmented original frequency domain signal The fundamental frequency of each segmented original frequency domain signal.
  • the fundamental frequency detection fails, resulting in the light-tone portion or signal after segmentation processing and Fourier transform of the original speech signal
  • the segmented original frequency domain signal corresponding to the weak part may not carry the fundamental frequency.
  • the current segmented original frequency domain signal does not carry the fundamental frequency, in order to smooth the fundamental frequency detection result, according to the previous
  • the fundamental frequency of the segmented original frequency domain signal and the fundamental frequency of the subsequent segmented original frequency domain signal are used to determine the fundamental frequency of the current segmented original frequency domain signal.
  • determining the fundamental frequency of each segmented original frequency domain signal may include: calculating the fundamental frequency of the previous segmented original frequency domain signal of each segmented original frequency domain signal and the The fundamental frequency of the segmented original frequency domain signal after the segmented original frequency domain signal obtains the fundamental frequency of each segmented original frequency domain signal.
  • an interpolation algorithm may be used to calculate the fundamental frequency of the previous segmented original frequency domain signal of the current segmented original frequency domain signal and the fundamental frequency of the latter segmented original frequency domain signal, so as to obtain the current segmented original frequency The fundamental frequency of the domain signal.
  • S2070 Determine the fundamental frequency of each segmented target frequency domain signal according to the product of the fundamental frequency of each segmented original frequency domain signal and the amplitude of modulation.
  • S2080 Obtain the original window length corresponding to the segmented original frequency domain signal according to the fundamental frequency and segment length of each segmented original frequency domain signal; according to the original window length and pre-correspondence corresponding to each segmented original frequency domain signal Set the window type to construct the original segmented window function corresponding to each segmented original frequency domain signal.
  • the original window function used in multiple segments can be determined according to the fundamental frequencies and segment lengths of the multiple segmented original frequency domain signals, respectively Window length.
  • the preset window type refers to different types of window functions, which may be triangular windows, rectangular windows, or Hanning windows, etc., which is not limited in this embodiment.
  • window functions which may be triangular windows, rectangular windows, or Hanning windows, etc., which is not limited in this embodiment.
  • multiple original segmented window functions corresponding to the multiple segmented original frequency domain signals can be constructed, and the multiple original segmented window functions are subsequently passed Filter the corresponding segmented original frequency domain signal.
  • S2090 Obtain the target window length corresponding to each segment target frequency domain signal according to the fundamental frequency and segment length of each segment target frequency domain signal; according to the target corresponding to each segment target frequency domain signal The window length and the preset window type construct a target segment window function corresponding to each segment target frequency domain signal.
  • the fundamental frequency and segmentation of each segmented target frequency domain signal may be The length determines the target window length of the window function used in each segment.
  • multiple target segmented window functions corresponding to the multiple segmented target frequency domain signals can be constructed.
  • Each target segment window function filters the corresponding multiple segment target frequency domain signals.
  • S2080 and S2090 are not sequential, and may be executed at the same time, which is not limited in this embodiment.
  • S2110 Determine a tone-modulated speech signal according to multiple segmented target frequency domain signals, multiple original formant envelopes and multiple target formant envelopes.
  • the technical solution provided in this embodiment determines the fundamental frequencies of a plurality of segmented original frequency domain signals and a plurality of segmented target frequency domain signals respectively, based on the basis of the segmented original frequency domain signals in the plurality of segments
  • Frequency and segment length determine the corresponding multiple original window lengths in multiple segments, and determine the corresponding multiple target window lengths in multiple segments based on the fundamental frequency and segment length of multiple segment target frequency domain signals, respectively ,
  • Construct an adaptive variable-length window function respectively filter multiple segmented original frequency domain signals and multiple segmented target frequency domain signals to obtain corresponding multiple original formant envelopes and multiple target formant envelopes, Reduce the acquisition error of the formant envelope before and after transposition, so as to eliminate the influence of the target formant envelope on transposition according to the formant envelope before and after transposition, so that the voice signal before and after transposition has the same formant envelope, thereby ensuring transposition
  • the consistency of the voice characteristics in the voice signals before and after improves the voice quality of the tone-modulated voice
  • FIG. 3 is a schematic diagram of a principle of a voice signal conversion process provided in Embodiment 3 of the present application. This embodiment is described based on the above embodiment. This embodiment describes the process of segmentation processing and Fourier transform of the voice signal, and the process of determining the tone-modulated voice signal.
  • S320 Transpose the original voice signal to obtain an initial target voice signal.
  • S330 Segment the original voice signal and the initial target voice signal according to the preset segment length and the segment displacement to obtain multiple segmented original voice signals and multiple segmented target voice signals.
  • the preset segment length represents each segment
  • the number of sampling points that should be included in the voice signal within a segment is generally 2n, such as the preset segment length can be 1024 or 2048, etc.
  • the segment displacement represents the distance between the starting sampling points of adjacent segments, such as the preset
  • the segment length is 1024 and the segment displacement is 512
  • the first segment is composed of 1-1024 sampling points
  • the second segment is composed of 513-1536 sampling points
  • this embodiment is based on the preset segment length and Segment displacement separately segments the original speech signal and the initial target speech signal, and can obtain corresponding multiple segment original speech signals and multiple segment target speech signals in multiple segments.
  • S340 Fourier transform the multiple segmented original speech signals and the multiple segmented target speech signals respectively to obtain multiple segmented original frequency domain signals and multiple segmented target frequency domain signals.
  • multiple segmented original voice signals and multiple segmented target voice signals when multiple segmented original voice signals and multiple segmented target voice signals are obtained, multiple segmented original voice signals and multiple segmented target voice signals within multiple segments may be separately processed Fourier transform to obtain multiple segment original frequency domain signals and multiple segment target frequency domain signals corresponding to multiple segments.
  • S350 filtering multiple segmented original frequency domain signals according to multiple original segmented window functions, respectively, to obtain multiple original formant envelopes, and multiple segmented target frequency domains according to multiple target segmented window functions, respectively.
  • the signal is filtered to obtain multiple target formant envelopes.
  • the original segment window function corresponding to each segmented original frequency domain signal is determined according to the fundamental frequency and segment length of each segmented original frequency domain signal.
  • the target segment window function corresponding to the segment target frequency domain signal is determined according to the fundamental frequency and segment length of each segment target frequency domain signal.
  • S360 Determine a pitch modulation value corresponding to each segmented target frequency domain signal according to the original formant envelope and the target formant envelope corresponding to each segmented target frequency domain signal.
  • the original formant envelope corresponding to each segmented original frequency domain signal and the target formant envelope corresponding to each segmented target frequency domain signal are obtained, for a single segmented target frequency domain signal,
  • the original formant envelope obtained in the segment corresponding to the segmented target frequency domain signal can be compared with the target formant envelope to determine the transpose ratio corresponding to the segmented target frequency domain signal, which represents the post-transpose ratio
  • the influence of the target formant envelope on the sound characteristics during the transposition process According to the same method, multiple pitch modulation values corresponding to multiple segmented target frequency domain signals can be determined.
  • S370 Determine the segmented pitch-modulated frequency domain signal corresponding to each segment target frequency domain signal according to each segment target frequency domain signal and the pitch modulation value corresponding to each segment target frequency domain signal.
  • the segmented target frequency domain signal corresponding to the target formant envelope can be multiplied by the modulation ratio value to obtain the segment corresponding
  • the segmented frequency-modulated frequency domain signal after eliminating the effect of transposition has the same formant envelope as the segmented original frequency-domain signal in the same segment. According to the same method, it is possible to determine a plurality of segmented frequency-modulated frequency-domain signals corresponding to a plurality of segments after eliminating the effects of modulation.
  • S380 Perform inverse Fourier transform on the segmented pitch-modulated frequency domain signal corresponding to each segmented target frequency domain signal to obtain a segmented pitch-modulated voice signal corresponding to each segmented target frequency domain signal.
  • the inverse Fourier transform may be performed on the corresponding segmented frequency-modulated frequency domain signal in each segment, thereby obtaining each segment Within the segmented tone-modulated voice signal, the final tone-modulated voice signal is subsequently determined based on multiple segmented tone-modulated voice signals.
  • S390 Determine the tone-modulated voice signal according to multiple segmented tone-modulated voice signals, the preset segment length, and the segment displacement.
  • the multiple segmented tone-modulated voice signals can be composed according to the preset segment length and segment displacement when the original voice signal is segmented to obtain the elimination target
  • the formant envelope affects the sound characteristics during the transposition process, and finally the transposed speech signal is the same as the formant envelope in the original speech signal, thereby ensuring the consistency of the sound features in the speech signal before and after transposition .
  • the corresponding transmodulation ratio value is determined according to the formant envelope before the modulation and the formant envelope after the modulation, and according to the segmented target frequency domain in the segment
  • the voice quality of the voice signal is determined according to the formant envelope before the modulation and the formant envelope after the modulation, and according to the segmented target frequency domain in the segment
  • FIG. 4 is a schematic structural diagram of a voice signal conversion device according to Embodiment 4 of the present application.
  • the device may include: a segmentation conversion module 410, which is configured to modulate the original voice signal and the original voice signal respectively The initial target speech signal is segmented, and the multiple segmented original speech signals obtained after segmentation and the multiple target speech signals obtained after segmentation are Fourier transformed to obtain multiple segmented original frequency domains. Signals and multiple segmented target frequency domain signals; the envelope determination module 420 is configured to filter the multiple segmented original frequency domain signals according to the multiple original segmented window functions to obtain multiple original formant envelopes, and, Filter multiple segmented target frequency domain signals according to multiple target segmented window functions to obtain multiple target formant envelopes.
  • Module 430 is configured to determine the tone-modulated speech signal based on multiple segmented target frequency domain signals, multiple original formant envelopes and multiple target formant envelopes.
  • the technical solution provided in this embodiment performs segmentation processing on the original speech signal and the original target speech signal after the original speech signal is modulated, and separately obtains a plurality of segmented original speech signals and the segmented obtained after segmentation
  • the multiple target speech signals are subjected to Fourier transform to obtain multiple segmented original frequency domain signals and multiple segmented target frequency domain signals, and the multi-segmented original frequency domain signal is determined based on the fundamental frequency and segment length Original segmented window functions, multiple target segmented window functions are determined according to the fundamental frequency and segment length of multiple segmented target frequency domain signals, different segmented signals can correspond to different segmented window functions, and subsequent
  • the original segmented window function and multiple target segmented window functions respectively filter multiple segmented original frequency domain signals and multiple segmented target frequency domain signals to obtain multiple original formant envelopes and target formant envelopes, reducing Acquisition error of formant envelope before and after transposition, so as to determine the final transposed speech signal based on multiple segmented target frequency domain signals and multiple form
  • FIG. 5 is a schematic structural diagram of a device according to Embodiment 5 of the present invention. As shown in FIG. 5, the device includes a processor 50, a storage device 51, and a communication device 52.
  • the storage device 51 is a computer-readable storage medium that can be used to store software programs, computer executable programs, and modules, such as program instructions/modules corresponding to the voice signal conversion method described in any embodiment of the present invention.
  • the processor 50 executes various functional applications and data processing of the device by running software programs, instructions, and modules stored in the storage device 51, that is, implementing the foregoing voice signal conversion method.
  • Embodiment 6 of the present application also provides a computer-readable storage medium, which stores a computer program, and when the program is executed by a processor, the voice signal conversion method in any embodiment of the present application may be implemented.
  • the method may specifically include: separately segmenting the original speech signal and the initial target speech signal obtained by modulating the original speech signal, and separately segmenting the plurality of segmented original speech signals and the segmented obtained Multiple target speech signals are subjected to Fourier transform to obtain multiple segmented original frequency domain signals and multiple segmented target frequency domain signals; Signal filtering to obtain multiple original formant envelopes, and filtering the multiple segmented target frequency domain signals according to multiple target segmentation window functions to obtain multiple target formant envelopes, and each segment The original segmented window function corresponding to the original frequency domain signal is determined according to the fundamental frequency and segment length of each segmented original frequency domain signal, and the target segmented window function corresponding to each segmented target frequency domain signal is determined according to the The fundamental frequency and segment length of each segmented target frequency domain signal are

Landscapes

  • Engineering & Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)
  • Electrophonic Musical Instruments (AREA)

Abstract

一种语音信号变换方法、装置、设备和存储介质。方法包括:分别对原始语音信号和由原始语音信号变调得到的初始目标语音信号进行分段处理,并分别对分段后得到的多个分段原始语音信号和分段后得到的多个分段目标语音信号进行傅里叶变换,得到多个分段原始频域信号和多个分段目标频域信号(S130);分别根据多个原始分段窗函数对多个分段原始频域信号滤波,得到多个原始共振峰包络,以及,分别根据多个目标分段窗函数对多个分段目标频域信号滤波,得到多个目标共振峰包络(S140);根据多个分段目标频域信号、多个原始共振峰包络和多个目标共振峰包络,确定变调语音信号(S150)。

Description

语音信号变换方法、装置、设备和存储介质
本申请要求在2018年12月28日提交中国专利局、申请号为201811628761.6的中国专利申请的优先权,该申请的全部内容通过引用结合在本申请中。
技术领域
本申请实施例涉及语音识别技术领域,例如涉及一种语音信号变换方法、装置、设备和存储介质。
背景技术
随着互联网技术的快速发展,一种通过声音变调算法(Pitch Shift)对原始语音进行音调变化的娱乐软件开始广泛的应用到人们的日常生活中,通过播放变调后的语音为用户提供一种新型的娱乐放松方式,如对一歌唱家的原始录音进行修音时,会对存在瑕疵的声音进行变调,使得歌曲更加完美。
在通过声音变调算法处理原始语音时,虽然达到了调节音调的目的,但是可能会使该语音用户的声音特征发生改变,使得播放的语音与该语音用户的实际声音存在较大的差别,如将一个男音信号调高4个半音时,会导致听起来像一个女生的声音,存在一定的声音误差。为了克服上述问题,相关技术中通常采用固定长度的窗函数,对变调前后的语音信号分别对应的短时傅立叶变换信号进行处理,得到变调前后的语音信号分别对应的共振峰包络,再根据得到的共振峰包络对变调后的语音信号进行处理,最终得到消除声音误差的变调语音信号。但由于相关技术中确定共振峰包络的窗函数的长度固定,导致确定的共振峰包络不准确,进而导致最终得到的变调语音信号的声音特征与变调前的语音信号的声音特征并不一致,变调语音信号的质量较差,不能达到消除声音误差的效果。
发明内容
本申请实施例提供了一种语音信号变换方法、装置、设备和存储介质,在对原始语音信号进行变调的基础上,保证了变调前后的语音信号中声音特征的一致性,提高了变调语音信号的质量。
本申请实施例提供了一种语音信号变换方法,该方法包括:
分别对原始语音信号和由所述原始语音信号变调得到的初始目标语音信号进行分段处理,并分别对分段后得到的多个分段原始语音信号和分段后得到的 多个分段目标语音信号进行傅里叶变换,得到多个分段原始频域信号和多个分段目标频域信号;
分别根据多个原始分段窗函数对所述多个分段原始频域信号滤波,得到多个原始共振峰包络,以及,分别根据多个目标分段窗函数对所述多个分段目标频域信号滤波,得到多个目标共振峰包络,与每个分段原始频域信号对应的原始分段窗函数根据所述每个分段原始频域信号的基频和分段长度确定,与每个分段目标频域信号对应的目标分段窗函数根据所述每个分段目标频域信号的基频和所述分段长度确定;
根据所述多个分段目标频域信号、所述多个原始共振峰包络和所述多个目标共振峰包络,确定变调语音信号。
本申请实施例提供了一种语音信号变换转置,该装置包括:
分段变换模块,设置为分别对原始语音信号和由所述原始语音信号变调得到的初始目标语音信号进行分段处理,并分别对分段后得到的多个分段原始语音信号和分段后得到的多个分段目标语音信号进行傅里叶变换,得到多个分段原始频域信号和多个分段目标频域信号;
包络确定模块,设置为分别根据多个原始分段窗函数对所述多个分段原始频域信号滤波,得到多个原始共振峰包络,以及,分别根据多个目标分段窗函数对所述多个分段目标频域信号滤波,得到多个目标共振峰包络,与每个分段原始频域信号对应的原始分段窗函数根据所述每个分段原始频域信号的基频和分段长度确定,与每个分段目标频域信号对应的目标分段窗函数根据所述每个分段目标频域信号的基频和所述分段长度确定;
变调语音确定模块,设置为根据所述多个分段目标频域信号、所述多个原始共振峰包络和所述多个目标共振峰包络,确定变调语音信号。
本申请实施例提供了一种设备,该设备包括:
一个或多个处理器;
存储装置,设置为存储一个或多个程序;
当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现本申请任意实施例所述的语音信号变换方法。
本申请实施例提供了一种计算机可读存储介质,存储有计算机程序,该程序被处理器执行时实现本申请任意实施例所述的语音信号变换方法。
附图说明
图1A为本申请实施例一提供的一种语音信号变换方法的流程图;
图1B为本申请实施例一提供的一种语音信号变换过程的原理示意图;
图2为本申请实施例二提供的一种基频检测和窗函数构建过程的原始示意图;
图3为本申请实施例三提供的一种语音信号变换过程的原理示意图;
图4为本申请实施例四提供的一种语音信号变换装置的结构示意图;
图5为本申请实施例五提供的一种设备的结构示意图。
具体实施方式
下面结合附图和实施例对本申请进行说明。本文所描述的具体实施例仅仅用于解释本申请,而非对本申请的限定。为了便于描述,附图中仅示出了与本申请相关的部分而非全部结构。
为了在对语音信号变调的基础上,保证变调前后的语音信号中声音特征的一致性,由于共振峰反映了语音信号在频域中的能量分布,决定了语音音质,也就是声音特征本申请实施例主要针对变调前后的语音信号中的共振峰包络的一致性进行处理,采用共振峰包络保持算法消除变调后的目标共振峰包络对变调的影响,使得变调前后具备相同的共振峰包络,提高了变调语音信号的语音质量。
实施例一
图1A为本申请实施例一提供的一种语音信号变换方法的流程图。本实施例可应用于任意一种能够对语音信号进行变调的设备中。本申请实施例中的技术方案可适用于实现变调前后的语音信号中声音特征的一致性的情况中。本实施例提供的一种语音信号变换方法可以由本申请实施例提供的语音信号变换装置来执行,该装置可以通过软件和/或硬件的方式来实现,并集成在执行本方法的设备中,该设备可以是配置有任意一种能够对语音信号进行变调的应用程序的智能终端,如智能手机、平板和掌上电脑等。
一实施例中,参考图1A,该方法可以包括如下步骤:
S110,获取原始语音信号。
本实施例中,原始语音信号是指通过语音采集器采集的语音用户初始录入的,未经过任何处理的语音信号,该原始语音信号以离散信号的形式被编码,该原始语音信号中包括大量语音采样点。
本实施例中,在需要对语音信号变调时,首先需要获取通过语音采集器采集的由语音用户初始录入的原始语音信号,后续对该原始语音信号进行变调。
S120,对原始语音信号变调,得到初始目标语音信号。
本实施例中,变调是指对语音信号中的音调进行调节,也就是调节语音信号中的主要频率,如对一歌唱者的原始录音中含有的一些有瑕疵的声音进行修音,也就是对语音信号进行变调。
一实施例中,在得到原始语音信号,且对该原始语音信号存在变调需求时,可以通过确定变调需求,并根据该变调需求在相应的语音变调软件中设定对应的变调参数,通过设定的变调参数和声音变调算法对原始语音信号进行变调,从而得到初始目标语音信号,由于变调时会破坏原始语音信号中的声音特征,因此初始目标语音信号中的声音特征相对于原始语音信号中的声音特征发生了改变,并不能直接输出,还需要对改变的声音特征进行还原,以保证最终得到的语音信号在播放时,能够使其他用户明确该最终得到的语音信号由哪一语音用户录入。
一实施例中,对原始语音信号变调,得到初始目标语音信号,可以包括:获取变调幅度;根据变调幅度对原始语音信号变调,得到初始目标语音信号。
一实施例中,可以通过声音变调(Pitch Shift)算法来处理原始语音信号,此时预先确定本次变调对应的变调幅度,从而在声音变调算法中设定该变调幅度,依据该变调幅度对原始语音信号进行变调,得到初始目标语音信号。
S130,分别对原始语音信号和初始目标语音信号进行分段处理,并分别对分段后得到的多个分段原始语音信号和分段后得到的多个分段目标语音信号进行傅里叶变换,得到多个分段原始频域信号和多个分段目标频域信号。
本实施例中,傅里叶变换是将时域信号转换为频域信号的一种变换方式,对于时域中无法明确得到的一类信息,可以转换到频域中来进行分析。
本实施例中,由于原始语音信号是由语音用户发出的一段时间内包含有不同频率信息的语音信号,若直接对整个原始语音信号进行傅里叶变换,对应得到的频域信号是针对整个时域的全部语音信息来确定的单个频率对应的频谱,不能反映局部时域内对应的频率特征,无法分析出不同时间段内的频域信息,因此本实施例中采用短时傅里叶变换分别对原始语音信号和初始目标语音信号进行处理,从而得到不同时间段内原始语音信号和初始目标语音信号对应的频域信息。短时傅里叶变换是指通过指定的时间窗内的一段语音信号对应的频域信号来表示一时刻的频域特征。
本实施例中,在得到原始语音信号和初始目标语音信号后,为了对语音信 号在一时刻的频域信息进行准确分析,如图1B所示,首先可以分别对原始语音信号和初始目标语音信号进行分段,得到多个分段原始语音信号和多个分段目标语音信号,后续可以对处于同一时间分段内的分段原始语音信号和分段目标语音信号分别进行分析,对分段后的多个分段原始语音信号和多个分段目标语音信号均进行傅里叶变换,从而得到多个分段内的多个分段原始频域信号和多个分段目标频域信号。同时,由于采用同一分段方式对原始语音信号和初始目标语音信号进行分段,因此分段后进行傅里叶变换得到的多个分段原始频域信号和多个分段目标频域信号也是分别在多个分段内一一对应的。
S140,分别根据多个原始分段窗函数对多个分段原始频域信号滤波,得到多个原始共振峰包络,以及,分别根据多个目标分段窗函数对多个分段目标频域信号滤波,得到多个目标共振峰包络。
本实施例中,与每个分段原始频域信号对应的原始分段窗函数根据每个分段原始频域信号的基频和分段长度确定,与每个分段目标频域信号对应的目标分段窗函数根据每个分段目标频域信号的基频和分段长度确定。本实施例中,原始分段窗函数和目标分段窗函数是一种自适应变长的窗函数,由于多个分段原始频域信号的基频不同,得到的多个原始分段窗函数的长度也不同,由于多个分段目标频域信号的基频不同,得到的多个目标分段窗函数的长度也不同。由于不同分段语音信号中频率变化情况不同,此时采用固定长度的窗函数进行分析会造成一定的误差,本实施例中采用自适应变长的窗函数分别对不同分段内变调前后的语音信号进行处理,可以降低处理误差。本实施例中,分段原始语音信号的基频是指分段原始语音信号中包含的基础频率,可以在分段原始频域信号中体现,分段目标频域信号的基频是指分段目标频域信号中包含的基础频率,可以在分段目标频域信号中体现;分段长度表示每一分段内语音信号中应包含的采样点数量,一般为2n,如分段长度可以是1024或者2048等。
一实施例中,共振峰是指频域信号中声音能量相对集中的区域,决定声音的音质,可以通过信号的共振峰判断该语音信号是由哪一语音用户发出的;共振峰包络是指频域信号中将不同频率对应的振幅最高点连接围成的频域范围,能够表示语音用户的在当前分段的声音特征。
一实施例中,为了提高信号处理速率,在确定分段目标频域信号的基频时,由于信号变调就是对信号频率进行调节,因此一分段内的分段目标频域信号的基频可以直接由该分段内的分段原始频域信号的基频和变调幅度确定,而无需重新检测多个分段目标频域信号的基频,减少额外的检测操作,提高信号处理速率。
一实施例中,在得到分段原始频域信号和分段目标频域信号时,可以首先 检测每一个分段原始频域信号的基频,根据该分段原始频域信号的基频与分段长度确定对应的原始分段窗函数,仅根据该原始分段窗函数对对应分段内的分段原始频域信号进行处理,而不对其他分段原始频域信号进行处理;针对不同的分段原始频域信号,由于该不同的分段原始频域信号的基频不同,因此对应有不同的原始分段窗函数;对于分段目标频域信号,采用相同的方式,通过多个分段目标频域信号的基频和分段长度,确定多个分段目标频域信号对应的多个目标分段窗函数。
一实施例中,分别通过多个分段原始频域信号对应的多个原始分段窗函数对多个分段原始频域信号进行滤波,从而得到多个分段原始频域信号对应的多个原始共振峰包络;同时,分别通过多个分段目标频域信号对应的多个目标分段窗函数对多个分段目标频域信号进行滤波,从而得到多个分段目标频域信号对应的目标共振峰包络;原始共振峰包络和目标共振峰包络的数量与分段数量对应。
本实施例中的窗函数在对频域信号滤波时,可以将窗函数理解为不同形式上的低通滤波器,而所使用窗函数的自适应变长可以使对应的低通滤波性能随频域信号的特性变化而变化。
S150,根据多个分段目标频域信号、多个原始共振峰包络和多个目标共振峰包络,确定变调语音信号。
本实施例中,变调语音信号为对原始语音信号变调后,已经消除了变调时对声音特征产生的影响而最终可以输出与原始语音信号中的声音特征保持一致的语音信号。
在得到多个原始共振峰包络和多个目标共振峰包络后,为了保证变调前后的语音信号中声音特征的一致性,需要消除变调后的多个分段目标频域信号中的目标共振峰的影响。一实施例中,通过确定每个分段内的原始共振峰包络和目标共振峰包络的比值,来表示该分段内的变调前的分段原始频域信号和变调后的分段目标频域信号中声音特征的变化情况,根据该分段内的分段目标频域信号和该比值,确定最终对应的该分段内的分段频域信号,最终根据多个分段内的多个分段目标频域信号和对应的多个比值,确定多个分段内的分段频域信号,由多个分段频域信号得到最终的变调频域信号,进而确定最终的变调语音信号。
本实施例提供的技术方案,通过对原始语音信号以及对原始语音信号变调后的初始目标语音信号进行分段处理,并分别对分段后得到的多个分段原始语音信号和分段后得到的多个分段目标语音信号进行傅里叶变换,得到多个分段原始频域信号和多个分段目标频域信号,并根据多个分段原始频域信号的基频 和分段长度确定多个原始分段窗函数,根据多个分段目标频域信号的基频和分段长度确定多个目标分段窗函数,不同的分段信号可以对应不同的分段窗函数,后续根据多个原始分段窗函数和多个目标分段窗函数分别对多个分段原始频域信号和多个分段目标频域信号滤波,得到多个原始共振峰包络和多个目标共振峰包络,降低变调前后的共振峰包络的获取误差,从而分别根据多个分段目标频域信号以及变调前后的多个共振峰包络,确定最终的变调语音信号,消除目标共振峰包络对变调的影响,使得变调前后的语音信号具备相同的共振峰包络,从而保证变调前后的语音信号中声音特征的一致性,提高变调语音信号的语音质量。
实施例二
图2为本申请实施例二提供的一种基频检测和窗函数构建过程的原始示意图。本实施例是在上述实施例的基础上进行说明。本实施例中主要对原始语音信号分段后进行傅里叶变换得到的多个分段原始频域信号的基频的检测过程,以及多个分段原始频域信号对应的多个原始分段窗函数和多个分段目标频域信号对应的多个目标分段窗函数的构建过程进行说明。
本实施例中的方法可以包括如下步骤:
S2010,获取原始语音信号。
S2020,对原始语音信号变调,得到初始目标语音信号。
S2030,分别对原始语音信号和初始目标语音信号进行分段处理,并分别对分段后得到的多个分段原始语音信号和分段后得到的多个分段目标语音信号进行傅里叶变换,得到多个分段原始频域信号和多个分段目标频域信号。
S2040,判断多个分段原始频域信号中的每个分段原始频域信号中是否携带基频,若每个分段原始频域信号携带基频,执行S2050;若每个分段原始频域信号未携带基频,执行S2060。
一实施例中,由于后续需要通过窗函数对分段原始频域信号和分段目标频域信号滤波,从而确定对应的共振峰包络,本实施例中为了提高变调前后不同分段内频域信号的共振峰包络准确性,需要通过自适应变长的窗函数对不同频域信号滤波,此时可以根据不同频域信号的基频和分段长度,确定多个频域信号对应采用的窗函数,因此本实施例中首先需要检测分段原始频域信号的基频,故判断多个分段原始频域信号中的每个分段原始频域信号是否携带基频。本实施例中为了后续对基频检测结果的有效性分析,可以对当前分段原始频域信号中是否携带基频的判断结果进行标记,若当前分段原始频域信号中携带基频, 则标记该基频的实际结果,若当前分段原始频域信号中未携带基频,则采用预设标志来标记当前分段原始频域信号,从而在后续明确得到未携带基频的分段原始频域信号。
S2050,将携带的基频作为所述每个分段原始频域信号的基频。
一实施例中,如果当前分段原始频域信号中携带基频,则直接将该携带的基频作为当前分段原始频域信号的基频。
S2060,根据所述每个分段原始频域信号的前一分段原始频域信号的基频和所述每个分段原始频域信号的后一分段原始频域信号的基频,确定所述每个分段原始频域信号的基频。
一实施例中,由于原始语音信号中存在轻音部分或者信号比较微弱而导致基频检测失败的情况,导致在对原始语音信号进行分段处理和傅里叶变换后,该轻音部分或者信号微弱部分对应的分段原始频域信号中可能存在未携带基频的情况,本实施例中如果当前分段原始频域信号中未携带基频,为了使基频检测结果平滑,则根据前一分段原始频域信号的基频和后一分段原始频域信号的基频,来确定当前分段原始频域信号的基频。
一实施例中,根据所述每个分段原始频域信号的前一分段原始频域信号的基频和所述每个分段原始频域信号的后一分段原始频域信号的基频,确定所述每个分段原始频域信号的基频,可以包括:通过插值算法计算所述每个分段原始频域信号的前一分段原始频域信号的基频和所述每个分段原始频域信号的后一分段原始频域信号的基频,得到所述每个分段原始频域信号的基频。
本实施例中可以采用插值算法对当前分段原始频域信号的前一分段原始频域信号的基频和后一分段原始频域信号的基频进行计算,从而得到当前分段原始频域信号的基频。
S2070,根据每个分段原始频域信号的基频与变调幅度的乘积,确定所述每个分段目标频域信号的基频。
S2080,根据每个分段原始频域信号的基频和分段长度,得到所述分段原始频域信号对应的原始窗长度;根据每个分段原始频域信号对应的原始窗长度和预设窗类型,构建所述每个分段原始频域信号对应的原始分段窗函数。
本实施例在得到多个分段原始频域信号的基频后,可以分别根据多个分段原始频域信号的基频和分段长度,确定多个分段内所采用的窗函数的原始窗长度。示例性的,可以通过下述方式确定原始窗长度:Ln_s=Pn*N/Fs;其中,Ln_s为原始窗长度,Pn为分段原始频域信号的基频,N为分段长度,也就是每个分段内的采样点数量,Fs为原始语音信号的采样率,一般为48kHz。
一实施例中,预设窗类型是指窗函数的不同类型,可以为三角窗、矩形窗或者汉宁窗等,本实施例中对此不作限定。分别根据多个分段原始频域信号对应的原始窗长度和预设窗类型,可以构建多个分段原始频域信号对应的多个原始分段窗函数,后续通过多个原始分段窗函数对对应的分段原始频域信号进行滤波。
S2090,根据每个分段目标频域信号的基频和分段长度,得到所述每个分段目标频域信号对应的目标窗长度;根据所述每个分段目标频域信号对应的目标窗长度和预设窗类型,构建所述每个分段目标频域信号对应的目标分段窗函数。
本实施例在分别根据多个分段原始频域信号的基频和变调幅度得到多个分段目标频域信号的基频后,可以根据每个分段目标频域信号的基频和分段长度,确定所述每个分段内所采用的窗函数的目标窗长度。示例性的,可以通过下述方式确定目标窗长度:Ln_s=Pn*Ratio*N/Fs;其中,Ln_s为窗长度,Pn为分段原始频域信号的基频,Ratio为变调幅度,N为分段长度,也就是每个分段内的采样点数量,Fs为初始目标语音信号的采样率,一般为48kHz。
一实施例中,分别根据多个分段目标频域信号对应的目标窗长度和预设窗类型,可以构建多个分段目标频域信号对应的多个目标分段窗函数,后续分别通过多个个目标分段窗函数对对应的多个分段目标频域信号进行滤波。
S2080和S2090的执行顺序没有先后之分,也可以同时执行,本实施例中对此不作限定。
S2100,分别根据多个原始分段窗函数对多个分段原始频域信号滤波,得到多个原始共振峰包络,以及,分别根据多个目标分段窗函数对多个分段目标频域信号滤波,得到多个目标共振峰包络。
S2110,根据多个分段目标频域信号、多个原始共振峰包络和多个目标共振峰包络,确定变调语音信号。
本实施例提供的技术方案,通过分别对多个分段原始频域信号和多个分段目标频域信号的基频进行确定,分别根据多个分段内的分段原始频域信号的基频和分段长度确定多个分段内对应的多个原始窗长度,并分别根据多个分段目标频域信号的基频和分段长度确定多个分段内对应的多个目标窗长度,构建自适应变长的窗函数,分别对多个分段原始频域信号和多个分段目标频域信号滤波,得到对应的多个原始共振峰包络和多个目标共振峰包络,降低变调前后的共振峰包络的获取误差,从而根据变调前后的共振峰包络,消除目标共振峰包络对变调的影响,使得变调前后的语音信号具备相同的共振峰包络,进而保证变调前后的语音信号中声音特征的一致性,提高变调语音信号的语音质量。
实施例三
图3为本申请实施例三提供的一种语音信号变换过程的原理示意图。本实施例是在上述实施例的基础上进行说明。本实施例对语音信号进行分段处理和傅里叶变换的过程,以及变调语音信号的确定过程进行说明。
本实施例中可以包括如下步骤:
S310,获取原始语音信号。
S320,对原始语音信号变调,得到初始目标语音信号。
S330,根据预设分段长度和分段位移对原始语音信号和初始目标语音信号进行分段,得到多个分段原始语音信号和多个分段目标语音信号。
一实施例中,本实施例在对原始语音信号和初始目标语音信号分段时,首先需要确定本次分段对应的预设分段长度和分段位移,预设分段长度表示每一分段内语音信号中应包含的采样点数量,一般为2n,如预设分段长度可以是1024或者2048等;分段位移表示相邻分段的起始采样点之间的距离,如预设分段长度为1024,分段位移为512时,第一分段由1-1024的采样点组成,第二分段由513-1536的采样点组成;本实施例根据预设分段长度和分段位移分别对原始语音信号和初始目标语音信号进行分段,可以得到多个分段内对应的多个分段原始语音信号和多个分段目标语音信号。
S340,分别对多个分段原始语音信号和多个分段目标语音信号进行傅里叶变换,得到多个分段原始频域信号和多个分段目标频域信号。
一实施例中,在得到多个分段原始语音信号和多个分段目标语音信号时,可以分别对多个分段内的多个分段原始语音信号和多个分段目标语音信号进行傅里叶变换,得到多分段对应的多个分段原始频域信号和多个分段目标频域信号。
S350,分别根据多个原始分段窗函数对多个分段原始频域信号滤波,得到多个原始共振峰包络,以及,分别根据多个目标分段窗函数对多个分段目标频域信号滤波,得到多个目标共振峰包络,与每个分段原始频域信号对应的原始分段窗函数根据每个分段原始频域信号的基频和分段长度确定,与每个分段目标频域信号对应的目标分段窗函数根据每个分段目标频域信号的基频和分段长度确定。
S360,根据每个分段目标频域信号对应的原始共振峰包络和目标共振峰包络,确定所述每个分段目标频域信号对应的变调比值。
一实施例中,在得到每个分段原始频域信号对应的原始共振峰包络,以及每个分段目标频域信号对应的目标共振峰包络时,针对单个分段目标频域信号,可以将该分段目标频域信号对应的分段内得到的原始共振峰包络和目标共振峰包络进行比对,确定该分段目标频域信号对应的变调比值,该变调比值表示变调后的目标共振峰包络在变调过程中对声音特征产生的影响。依据同样的方法,可以确定多个分段目标频域信号对应的多个变调比值。
S370,根据每个分段目标频域信号和所述每个分段目标频域信号对应的变调比值,确定所述每个分段目标频域信号对应的分段变调频域信号。
本实施例中,为了消除目标共振峰包络在变调过程中对声音特征产生的影响,可以将该目标共振峰包络对应的分段目标频域信号和变调比值相乘,得到分段对应的消除变调影响后的分段变调频域信号,该分段变调频域信号与相同分段内的分段原始频域信号具有相同的共振峰包络。依据相同的方法,可以确定多个分段对应的消除变调影响后的多个分段变调频域信号。本实施例通过下述公式得到对应的分段变调频域信号:STFT_tn’=STFT_tn*Esn/Etn;其中,STFT_tn’为分段变调频域信号,STFT_tn为分段目标频域信号,Esn为该分段内对应的原始共振峰包络,Etn为该分段内对应的目标共振峰包络。
S380,对每个分段目标频域信号对应的分段变调频域信号进行傅里叶逆变换,得到所述每个分段目标频域信号对应的分段变调语音信号。
一实施例中,在得到每个分段内对应的分段变调频域信号时,可以对每个分段内对应的分段变调频域信号进行傅里叶逆变换,从而得到每个分段内的分段变调语音信号,后续根据多个分段变调语音信号确定最终的变调语音信号。
S390,根据多个分段变调语音信号、预设分段长度和分段位移,确定变调语音信号。
一实施例中,在得到多个分段变调语音信号后,可以根据对原始语音信号进行分段时的预设分段长度和分段位移,将多个分段变调语音信号组成,得到消除目标共振峰包络在变调过程中对声音特征产生的影响后最终的变调语音信号,该变调语音信号与原始语音信号中的共振峰包络相同,从而保证变调前后的语音信号中声音特征的一致性。
本实施例提供的技术方案,针对单个分段目标频域信号,根据变调前的共振峰包络和变调后的共振峰包络确定对应的变调比值,根据该分段内的分段目标频域信号和变调比值,确定对应的分段变调频域信号,消除该分段内的共振峰包络对变调的影响,从而得到多个分段内消除共振峰包络影响后的多个分段变调频域信号,并通过傅里叶逆变换得到多个分段变调语音信号,由多个分段 变调语音信号组成对应的变调语音信号,保证变调前后的语音信号中声音特征的一致性,提高变调语音信号的语音质量。
实施例四
图4为本申请实施例四提供的一种语音信号变换装置的结构示意图,如图4所示,该装置可以包括:分段变换模块410,设置为分别对原始语音信号和原始语音信号变调得到的初始目标语音信号进行分段处理,并分别对分段后得到的多个分段原始语音信号和分段后得到的多个目标语音信号进行傅里叶变换,得到多个分段原始频域信号和多个分段目标频域信号;包络确定模块420,设置为分别根据多个原始分段窗函数对多个分段原始频域信号滤波,得到多个原始共振峰包络,以及,分别根据多个目标分段窗函数对多个分段目标频域信号滤波,得到多个目标共振峰包络,与每个分段原始频域信号对应的原始分段窗函数根据每个分段原始频域信号的基频和分段长度确定,与每个分段目标频域信号对应的目标分段窗函数根据每个分段目标频域信号的基频和分段长度确定;变调语音确定模块430,设置为根据多个分段目标频域信号、多个原始共振峰包络和多个目标共振峰包络,确定变调语音信号。
本实施例提供的技术方案,通过对原始语音信号以及对原始语音信号变调后的初始目标语音信号进行分段处理,并分别对分段后得到的多个分段原始语音信号和分段后得到的多个目标语音信号进行傅里叶变换,得到多个分段原始频域信号和多个分段目标频域信号,并根据多个分段原始频域信号的基频和分段长度确定多个原始分段窗函数,根据多个分段目标频域信号的基频和分段长度确定多个目标分段窗函数,不同的分段信号可以对应不同的分段窗函数,后续根据多个原始分段窗函数和多个目标分段窗函数分别对多个分段原始频域信号和多个分段目标频域信号滤波,得到多个原始共振峰包络和目标共振峰包络,降低变调前后的共振峰包络的获取误差,从而分别根据多个分段目标频域信号以及变调前后的多个共振峰包络,确定最终的变调语音信号,消除目标共振峰包络对变调的影响,使得变调前后的语音信号具备相同的共振峰包络,从而保证变调前后的语音信号中声音特征的一致性,提高变调语音信号的语音质量。
实施例五
图5为本发明实施例五提供的一种设备的结构示意图,如图5所示,该设备包括处理器50、存储装置51和通信装置52。
存储装置51作为一种计算机可读存储介质,可用于存储软件程序、计算机 可执行程序以及模块,如本发明任意实施例所述的语音信号变换方法对应的程序指令/模块。处理器50通过运行存储在存储装置51中的软件程序、指令以及模块,从而执行设备的各种功能应用以及数据处理,即实现上述语音信号变换方法。
实施例六
本申请实施例六还提供了一种计算机可读存储介质,存储有计算机程序,该程序被处理器执行时可实现本申请任意实施例中的语音信号变换方法。该方法具体可以包括:分别对原始语音信号和由所述原始语音信号变调得到的初始目标语音信号进行分段处理,并分别对分段后得到的多个分段原始语音信号和分段后得到的多个目标语音信号进行傅里叶变换,得到多个分段原始频域信号和多个分段目标频域信号;分别根据多个原始分段窗函数对所述多个分段原始频域信号滤波,得到多个原始共振峰包络,以及,分别根据多个目标分段窗函数对所述多个分段目标频域信号滤波,得到多个目标共振峰包络,与每个分段原始频域信号对应的原始分段窗函数根据所述每个分段原始频域信号的基频和分段长度确定,与每个分段目标频域信号对应的目标分段窗函数根据所述每个分段目标频域信号的基频和分段长度确定;根据所述多个分段目标频域信号、所述多个原始共振峰包络和所述多个目标共振峰包络,确定变调语音信号。

Claims (12)

  1. 一种语音信号变换方法,包括:
    分别对原始语音信号和由所述原始语音信号变调得到的初始目标语音信号进行分段处理,并分别对分段后得到的多个分段原始语音信号和分段后得到的多个目标语音信号进行傅里叶变换,得到多个分段原始频域信号和多个分段目标频域信号;
    分别根据多个原始分段窗函数对所述多个分段原始频域信号滤波,得到多个原始共振峰包络,以及,分别根据多个目标分段窗函数对所述多个分段目标频域信号滤波,得到多个目标共振峰包络,与每个分段原始频域信号对应的原始分段窗函数根据所述每个分段原始频域信号的基频和分段长度确定,与每个分段目标频域信号对应的目标分段窗函数根据所述每个分段目标频域信号的基频和分段长度确定;
    根据所述多个分段目标频域信号、所述多个原始共振峰包络和所述多个目标共振峰包络,确定变调语音信号。
  2. 根据权利要求1所述的方法,还包括:
    获取变调幅度;
    根据所述变调幅度对所述原始语音信号变调,得到所述初始目标语音信号。
  3. 根据权利要求2所述的方法,其中,每个分段目标频域信号的基频为所述每个分段目标频域信号对应的分段原始频域信号的基频与所述变调幅度的乘积。
  4. 根据权利要求1-3中任一项所述的方法,在分别根据多个原始分段窗函数对所述多个分段原始频域信号滤波之前,还包括:
    在一个分段原始频域信号中携带基频的情况下,将携带的基频作为所述一个分段原始频域信号的基频;
    在一个分段原始频域信号中未携带基频的情况下,根据所述一个分段原始频域信号的前一分段原始频域信号的基频和所述一个分段原始频域信号的后一分段原始频域信号的基频,确定所述一个分段原始频域信号的基频。
  5. 根据权利要求4所述的方法,其中,所述根据所述一个分段原始频域信号的前一分段原始频域信号的基频和所述一个分段原始频域信号的后一分段原始频域信号的基频,确定所述一个分段原始频域信号的基频,包括:
    通过插值算法计算所述一个分段原始频域信号的前一分段原始频域信号的基频和所述一个分段原始频域信号的后一分段原始频域信号的基频,得到所述一个分段原始频域信号的基频。
  6. 根据权利要求1-5中任一项所述的方法,在所述分别根据多个原始分段窗函数对所述多个分段原始频域信号滤波,得到多个原始共振峰包络之前,还包括:
    根据每个分段原始频域信号的基频和分段长度,得到所述每个分段原始频域信号对应的原始窗长度;
    根据所述每个分段原始频域信号对应的原始窗长度和预设窗类型,构建所述每个分段原始频域信号对应的原始分段窗函数。
  7. 根据权利要求1-6中任一项所述的方法,在所述分别根据多个目标分段窗函数对所述多个分段目标频域信号滤波,得到多个目标共振峰包络之前,还包括:
    根据每个分段目标频域信号的基频和分段长度,得到所述每个分段目标频域信号对应的目标窗长度;
    根据所述每个分段目标频域信号对应的目标窗长度和预设窗类型,构建所述每个分段目标频域信号对应的目标分段窗函数。
  8. 根据权利要求1-7中任一项所述的方法,其中,所述分别对原始语音信号和由所述原始语音信号变调得到的初始目标语音信号进行处理,并分别对分段后得到的多个分段原始语音信号和分段后得到的多个分段目标语音信号进行傅里叶变换,得到多个分段原始频域信号和多个分段目标频域信号,包括:
    根据预设分段长度和分段位移对原始语音信号和由所述原始语音信号变调得到的初始目标语音信号进行分段,得到多个分段原始语音信号和多个分段目标语音信号;
    分别对所述多个分段原始语音信号和所述多个分段目标语音信号进行傅里叶变换,得到多个分段原始频域信号和多个分段目标频域信号。
  9. 根据权利要求8所述的方法,其中,所述根据所述多个分段目标频域信号、所述多个原始共振峰包络和所述多个目标共振峰包络,确定变调语音信号,包括:
    根据每个分段目标频域信号对应的原始共振峰包络和目标共振峰包络,确定所述每个分段目标频域信号对应的变调比值;
    根据每个分段目标频域信号和所述每个分段目标频域信号对应的变调比值,确定所述每个分段目标频域信号对应的分段变调频域信号;
    对每个分段目标频域信号对应的分段变调频域信号进行傅里叶逆变换,得到所述每个分段目标频域信号对应的分段变调语音信号;
    根据所述多个分段目标频域信号对应的分段变调语音信号、所述预设分段长度和分段位移,确定变调语音信号。
  10. 一种语音信号变换装置,包括:
    分段变换模块,设置为分别对原始语音信号和由所述原始语音信号变调得到的初始目标语音信号进行分段处理,并分别对分段后得到的多个分段原始语音信号和分段后得到的多个分段目标语音信号进行傅里叶变换,得到多个分段原始频域信号和多个分段目标频域信号;
    包络确定模块,设置为分别根据多个原始分段窗函数对所述多个分段原始频域信号滤波,得到多个原始共振峰包络,以及,分别根据多个目标分段窗函数对所述多个分段目标频域信号滤波,得到多个目标共振峰包络,与每个分段原始频域信号对应的原始分段窗函数根据所述每个分段原始频域信号的基频和分段长度确定,与每个分段目标频域信号对应的目标分段窗函数根据所述每个分段目标频域信号的基频和所述分段长度确定;
    变调语音确定模块,设置为根据所述多个分段目标频域信号、所述多个原始共振峰包络和所述多个目标共振峰包络,确定变调语音信号。
  11. 一种设备,包括:
    至少一个处理器;
    存储装置,设置为存储至少一个程序;
    当所述至少一个程序被所述至少一个处理器执行,使得所述至少一个处理器实现如权利要求1-9中任一所述的语音信号变换方法。
  12. 一种计算机可读存储介质,存储有计算机程序,所述程序被处理器执行时实现如权利要求1-9中任一所述的语音信号变换方法。
PCT/CN2019/121838 2018-12-28 2019-11-29 语音信号变换方法、装置、设备和存储介质 WO2020134851A1 (zh)

Priority Applications (4)

Application Number Priority Date Filing Date Title
RU2021119297A RU2770747C1 (ru) 2018-12-28 2019-11-29 Способ преобразования аудиосигнала, устройство и носитель данных
EP19902578.4A EP3905243A4 (en) 2018-12-28 2019-11-29 AUDIO SIGNAL TRANSFORMATION METHOD, DEVICE, APPARATUS, AND INFORMATION CARRIER
SG11202106539QA SG11202106539QA (en) 2018-12-28 2019-11-29 Audio signal transformation method, device, apparatus, and storage medium
US17/416,709 US20220051685A1 (en) 2018-12-28 2019-11-29 Method for transforming audio signal, device, and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811628761.6 2018-12-28
CN201811628761.6A CN111383646B (zh) 2018-12-28 2018-12-28 一种语音信号变换方法、装置、设备和存储介质

Publications (1)

Publication Number Publication Date
WO2020134851A1 true WO2020134851A1 (zh) 2020-07-02

Family

ID=71126923

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/121838 WO2020134851A1 (zh) 2018-12-28 2019-11-29 语音信号变换方法、装置、设备和存储介质

Country Status (6)

Country Link
US (1) US20220051685A1 (zh)
EP (1) EP3905243A4 (zh)
CN (1) CN111383646B (zh)
RU (1) RU2770747C1 (zh)
SG (1) SG11202106539QA (zh)
WO (1) WO2020134851A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112887480A (zh) * 2021-01-22 2021-06-01 维沃移动通信有限公司 音频信号处理方法、装置、电子设备和可读存储介质

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112289330A (zh) * 2020-08-26 2021-01-29 北京字节跳动网络技术有限公司 一种音频处理方法、装置、设备及存储介质
CN112908351A (zh) * 2021-01-21 2021-06-04 腾讯音乐娱乐科技(深圳)有限公司 一种音频变调方法、装置、设备及存储介质
CN113129922B (zh) * 2021-04-21 2022-11-08 维沃移动通信有限公司 语音信号的处理方法和装置
CN113241082B (zh) * 2021-04-22 2024-02-20 杭州网易智企科技有限公司 变声方法、装置、设备和介质
CN114295577B (zh) * 2022-01-04 2024-04-09 太赫兹科技应用(广东)有限公司 一种太赫兹检测信号的处理方法、装置、设备和介质
CN116761128B (zh) * 2023-08-23 2023-11-24 深圳市中翔达润电子有限公司 一种运动蓝牙耳机声音泄漏检测方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6336092B1 (en) * 1997-04-28 2002-01-01 Ivl Technologies Ltd Targeted vocal transformation
CN1719514A (zh) * 2004-07-06 2006-01-11 中国科学院自动化研究所 基于语音分析与合成的高品质实时变声方法
CN101354889A (zh) * 2008-09-18 2009-01-28 北京中星微电子有限公司 一种语音变调方法及装置
CN101527141A (zh) * 2009-03-10 2009-09-09 苏州大学 基于径向基神经网络的耳语音转换为正常语音的方法
CN102592590A (zh) * 2012-02-21 2012-07-18 华南理工大学 一种可任意调节的语音自然变声方法及装置

Family Cites Families (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3265962B2 (ja) * 1995-12-28 2002-03-18 日本ビクター株式会社 音程変換装置
US6757659B1 (en) * 1998-11-16 2004-06-29 Victor Company Of Japan, Ltd. Audio signal processing apparatus
WO2006046761A1 (ja) * 2004-10-27 2006-05-04 Yamaha Corporation ピッチ変換装置
US8315857B2 (en) * 2005-05-27 2012-11-20 Audience, Inc. Systems and methods for audio signal analysis and modification
JP5400059B2 (ja) * 2007-12-18 2014-01-29 エルジー エレクトロニクス インコーポレイティド オーディオ信号処理方法及び装置
ATE500588T1 (de) * 2008-01-04 2011-03-15 Dolby Sweden Ab Audiokodierer und -dekodierer
US9240193B2 (en) * 2013-01-21 2016-01-19 Cochlear Limited Modulation of speech signals
WO2014145960A2 (en) * 2013-03-15 2014-09-18 Short Kevin M Method and system for generating advanced feature discrimination vectors for use in speech recognition
US9583116B1 (en) * 2014-07-21 2017-02-28 Superpowered Inc. High-efficiency digital signal processing of streaming media
EP2980795A1 (en) * 2014-07-28 2016-02-03 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Audio encoding and decoding using a frequency domain processor, a time domain processor and a cross processor for initialization of the time domain processor
CN105304092A (zh) * 2015-09-18 2016-02-03 深圳市海派通讯科技有限公司 一种基于智能终端的实时变声方法
US9947341B1 (en) * 2016-01-19 2018-04-17 Interviewing.io, Inc. Real-time voice masking in a computer network
CN106057208B (zh) * 2016-06-14 2019-11-15 科大讯飞股份有限公司 一种音频修正方法及装置
CN106228973A (zh) * 2016-07-21 2016-12-14 福州大学 稳定音色的音乐语音变调方法
CN108988822A (zh) * 2018-08-24 2018-12-11 广东石油化工学院 一种非平稳非高斯噪声的滤除方法及系统

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6336092B1 (en) * 1997-04-28 2002-01-01 Ivl Technologies Ltd Targeted vocal transformation
CN1719514A (zh) * 2004-07-06 2006-01-11 中国科学院自动化研究所 基于语音分析与合成的高品质实时变声方法
CN101354889A (zh) * 2008-09-18 2009-01-28 北京中星微电子有限公司 一种语音变调方法及装置
CN101527141A (zh) * 2009-03-10 2009-09-09 苏州大学 基于径向基神经网络的耳语音转换为正常语音的方法
CN102592590A (zh) * 2012-02-21 2012-07-18 华南理工大学 一种可任意调节的语音自然变声方法及装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3905243A4

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112887480A (zh) * 2021-01-22 2021-06-01 维沃移动通信有限公司 音频信号处理方法、装置、电子设备和可读存储介质
WO2022156709A1 (zh) * 2021-01-22 2022-07-28 维沃移动通信有限公司 音频信号处理方法、装置、电子设备和可读存储介质
CN112887480B (zh) * 2021-01-22 2022-07-29 维沃移动通信有限公司 音频信号处理方法、装置、电子设备和可读存储介质

Also Published As

Publication number Publication date
EP3905243A1 (en) 2021-11-03
CN111383646B (zh) 2020-12-08
EP3905243A4 (en) 2022-02-23
US20220051685A1 (en) 2022-02-17
RU2770747C1 (ru) 2022-04-21
CN111383646A (zh) 2020-07-07
SG11202106539QA (en) 2021-07-29

Similar Documents

Publication Publication Date Title
WO2020134851A1 (zh) 语音信号变换方法、装置、设备和存储介质
CN111128213B (zh) 一种分频段进行处理的噪声抑制方法及其系统
Chi et al. Multiresolution spectrotemporal analysis of complex sounds
US7660718B2 (en) Pitch detection of speech signals
WO2020006898A1 (zh) 音频数据的乐器识别方法及装置、电子设备、存储介质
Kaya et al. A temporal saliency map for modeling auditory attention
Caetano et al. Improved estimation of the amplitude envelope of time-domain signals using true envelope cepstral smoothing
US20210193149A1 (en) Method, apparatus and device for voiceprint recognition, and medium
JP6724932B2 (ja) 音声合成方法、音声合成システムおよびプログラム
WO2022012195A1 (zh) 音频信号处理方法和相关装置
Quatieri et al. Audio signal processing based on sinusoidal analysis/synthesis
CN109410971B (zh) 一种美化声音的方法和装置
US8750530B2 (en) Method and arrangement for processing audio data, and a corresponding corresponding computer-readable storage medium
JP2012208177A (ja) 帯域拡張装置及び音声補正装置
Giannoulis et al. On the disjointess of sources in music using different time-frequency representations
JP6241131B2 (ja) 音響用フィルタ装置、音響用フィルタリング方法、およびプログラム
Li et al. Musical sound separation using pitch-based labeling and binary time-frequency masking
Eichas et al. Feature design for the classification of audio effect units by input/output measurements
WO2020241641A1 (ja) 生成モデル確立方法、生成モデル確立システム、プログラムおよび訓練データ準備方法
JPH05118906A (ja) 音響測定方法およびその装置
CN109697985B (zh) 语音信号处理方法、装置及终端
Zivanovic Harmonic bandwidth companding for separation of overlapping harmonics in pitched signals
US11756558B2 (en) Sound signal generation method, generative model training method, sound signal generation system, and recording medium
JP2003241777A (ja) 楽音のフォルマント抽出方法、記録媒体及び楽音のフォルマント抽出装置
CN115602182B (zh) 声音变换方法、系统、计算机设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19902578

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2019902578

Country of ref document: EP

Effective date: 20210728