CN112309425A

CN112309425A - Sound tone changing method, electronic equipment and computer readable storage medium

Info

Publication number: CN112309425A
Application number: CN202011098110.8A
Authority: CN
Inventors: 王宝俊; 史巍; 林聚财; 殷俊
Original assignee: Zhejiang Dahua Technology Co Ltd
Current assignee: Zhejiang Dahua Technology Co Ltd
Priority date: 2020-10-14
Filing date: 2020-10-14
Publication date: 2021-02-02
Anticipated expiration: 2040-10-14
Also published as: CN112309425B

Abstract

The invention provides a sound tonal modification method, an electronic device and a computer readable storage medium, wherein the method comprises the following steps: acquiring the position of a pitch period in audio to be processed; framing the audio to be processed according to the position of the pitch period to form a plurality of audio frames; windowing the audio frames by using a window function to obtain a plurality of analysis frames; synthesizing a plurality of analysis frames to obtain a synthesized audio by controlling the overlapping length between the analysis frames, and synthesizing the window functions corresponding to the analysis frames to obtain a synthesized window; and processing the synthesized audio by using the synthesis window to obtain final audio after sound is modulated. The final audio sound obtained by the method has natural tone-changing effect and high quality.

Description

Sound tone changing method, electronic equipment and computer readable storage medium

Technical Field

The present invention relates to the field of audio signal processing, and in particular, to a sound tone modification method, an electronic device, and a storage medium.

Background

With the development of information technology and multimedia technology, people's amateur life is more and more colorful, and the requirement for some audio materials is higher and higher, and ordinary audio materials can not meet the requirements of people's production and life. The voice tone-changing technology is suitable for the birth, and the application of the voice tone-changing technology can be seen everywhere in real life, for example, the voice tone-changing technology can change the voice of a person into the voice of the person who is familiar with the voice-changing technology, so that the purpose of protecting the privacy of the person is achieved, and the voice tone-changing technology has great use in security protection, such as a voice-changing doorbell, an intelligent building and a camera talkback.

The existing pitch modulation algorithms are various, and the mainstream pitch modulation algorithm mostly adopts SOLA, PSOLA, WSOLA and phase vocoder technologies. The algorithms are calculated in a time domain and a frequency domain, and have the problems of large calculation amount, unnatural tone-changing effect, low audio quality after tone changing and unsuitability for being deployed on embedded equipment.

Disclosure of Invention

The invention provides a sound tonal modification method, an electronic device and a computer readable storage medium, which have natural sound tonal modification effect and high quality.

In order to solve the above technical problems, a first technical solution provided by the present invention is: there is provided a sound transposition method, the method comprising: acquiring the position of a pitch period in audio to be processed; framing the audio to be processed according to the position of the pitch period to form a plurality of audio frames; windowing the audio frames by using a window function to obtain a plurality of analysis frames; synthesizing a plurality of analysis frames to obtain a synthesized audio by controlling the overlapping length between the analysis frames, and synthesizing the window functions corresponding to the analysis frames to obtain a synthesized window; and processing the synthesized audio by using the synthesis window to obtain final audio after sound is modulated.

Wherein the obtaining of the position of the pitch period in the audio to be processed comprises: carrying out differential calculation on the audio to be processed to obtain a first audio signal to be processed; performing first zero-frequency suppression processing on the first audio signal to be processed to obtain a second audio signal to be processed; performing second zero-frequency suppression processing on the second audio signal to be processed to obtain a third audio signal to be processed; performing trend elimination processing on the third audio signal to be processed to obtain a fourth audio signal to be processed; obtaining a position of the pitch period based on the fourth audio signal to be processed.

Wherein, the performing a differential calculation on the audio to be processed to obtain a first audio signal to be processed includes: acquiring pulse code modulation data of the audio to be processed, and converting integer data corresponding to each sampling point into floating point data; intercepting fixed-length sampling points as operation objects; and carrying out differential calculation on the floating point type data corresponding to each sampling point in the operation object to obtain the first audio signal to be processed.

Performing first zero-frequency suppression processing and/or second zero-frequency suppression processing on the first audio signal to be processed and/or the second audio signal to be processed by using a zero-frequency container; the trend elimination processing on the third audio signal to be processed to obtain a fourth audio signal to be processed includes: and subtracting the average value of the short-time window from the data corresponding to the current sampling point in the third audio signal to be processed to perform trend elimination processing, thereby obtaining a fourth audio signal to be processed.

Each audio frame comprises a plurality of pitch periods, and the number of sampling points between every two adjacent pitch periods is a sliding window; the windowing the audio frame with a window function to obtain a plurality of analysis frames comprises: and adding a window function with a preset length to the sliding window in each audio frame to obtain an analysis frame corresponding to each audio frame.

Wherein the synthesizing a plurality of analysis frames to obtain a synthesized audio by controlling the overlapping length between the analysis frames and synthesizing the window functions corresponding to the analysis frames to obtain a synthesis window comprises: synthesizing the current analysis frame and the current synthesized analysis frame according to the target synthesis length based on the current synthesis length, so that the synthesized length is the target synthesis length; wherein the length of the synthesis window is equal to the length of the synthesized audio.

Wherein the method further comprises: acquiring a time length change factor, and obtaining the total target length of sound tone modification based on the time length change factor; and setting the target synthesis length of each synthesis according to the total target length.

Synthesizing the current analysis frame and the current synthesized analysis frame according to the target synthesis length based on the current synthesis length, wherein the step of enabling the synthesized length to be the target synthesis length comprises the following steps: judging the relation between the current synthesis length and the target synthesis length; and if the current synthesis length is smaller than the target synthesis length, the current analysis frame is offset from a first sliding window of a last analysis frame of the current analysis frame on the basis of the current synthesis length to be synthesized.

Wherein the processing the synthesized audio using the synthesis window to obtain a final audio after sound transposition further comprises: judging a time length change factor; if the time length change factor is larger than a preset value, downsampling the synthesized audio to obtain a final audio after sound modulation; and if the time length change factor is smaller than a preset value, performing up-sampling processing on the synthesized audio to obtain a final audio after sound transposition.

Wherein, if the time length variation factor is greater than a preset value, the down-sampling processing of the synthesized audio to obtain a final audio after sound transposition comprises: if the time length change factor is larger than a preset value, processing the synthesized audio by using a low-pass filter, and then performing down-sampling processing on the processed synthesized audio to obtain a final audio after sound modulation; if the time length variation factor is smaller than a preset value, performing upsampling processing on the synthesized audio to obtain a final audio after sound transposition, wherein the final audio comprises: and if the time length change factor is smaller than a preset value, performing up-sampling processing on the synthesized audio, and processing the synthesized audio subjected to the up-sampling processing by using a low-pass filter to obtain a final audio subjected to sound modulation.

In order to solve the above technical problems, a second technical solution provided by the present invention is: provided is an electronic device including: a memory storing program instructions and a processor retrieving the program instructions from the memory to perform any of the sound transposition methods.

In order to solve the above technical problems, a second technical solution provided by the present invention is: there is provided a computer-readable storage medium storing a program file executable to implement the sound transposition method of any one of the above.

The method has the beneficial effects that the method is different from the prior art, and the method synthesizes a plurality of analysis frames to obtain a synthesized audio and synthesizes window functions corresponding to the analysis frames to obtain a synthesized window by controlling the overlapping length of the analysis frames; and processing the synthesized audio by using the synthesis window to obtain the final audio after the sound is modulated. The final audio sound obtained by the method has natural tone changing effect and high quality.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without inventive efforts, wherein:

FIG. 1 is a flowchart illustrating a first embodiment of a sound modification method according to the present invention;

FIG. 2 is a flowchart illustrating an embodiment of step S11;

FIG. 3 is a flowchart illustrating an embodiment of step S12;

FIG. 4 is a flowchart illustrating another embodiment of step S12;

FIG. 5 is a flowchart illustrating an embodiment of step S13;

FIG. 6 is a flowchart illustrating an embodiment of step S14;

FIG. 7 is a flowchart illustrating an embodiment of step S15;

FIG. 8 is a schematic structural diagram of an electronic device according to an embodiment of the invention;

FIG. 9 is a schematic structural diagram of an embodiment of a computer-readable storage medium according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In the prior art, a certain error is generated when a pitch search is performed, the error is amplified to cause waveform splicing inaccuracy, the autocorrelation function is used for calculating the waveform similarity for different time periods, the calculation cost is increased, and in the waveform cutting and cutting process, although the windowing is performed, noise still exists to a certain extent. For example, when a waveform is inserted and cut, a spike is easily generated to cause a sudden change in the spectrum, and the quality of the generated variable speed voice is not high. Therefore, based on the above disadvantages, the present invention provides a method with low computational complexity and good sound-changing effect. The present invention will be described in detail below with reference to the accompanying drawings and examples.

Referring to fig. 1, a schematic flow chart of a first embodiment of the sound tonal modification method of the present invention includes:

step S11: the method comprises the steps of obtaining the position of a pitch period in audio to be processed.

Specifically, a section of audio data is acquired as the audio to be processed, and the pitch period position in the audio to be processed is acquired. In one embodiment, when acquiring a pitch period, it is necessary to acquire pulse code modulation data of an audio to be processed, where the pulse code modulation data is transmitted in a channel after converting an analog signal with continuous time and continuous value into a digital signal with discrete time and discrete value. Pulse code modulation is the process of sampling the analog signal, quantizing the amplitude of the sample and coding. The obtained pulse code modulation data comprises a plurality of sampling points, and each sampling point has a corresponding integer value. In one embodiment, for convenience of processing, the reshaped value corresponding to each sample is normalized and converted into floating point data. And intercepting fixed-length sampling points as operation objects on the basis of floating-point type data. Specifically, if the whole section of the audio to be processed needs to be processed, interception is not required, and all sampling points in the data to be processed are taken as operation objects; if only the part in the audio to be processed needs to be modified, the fixed-length sampling points are intercepted as operation objects after the part is converted into floating-point data.

In one embodiment, in order to accurately obtain the position of the pitch period, the influence of the low frequency signal and the vocal tract model needs to be eliminated, and specifically, referring to fig. 2, the method includes:

step S111: and carrying out differential calculation on the audio to be processed to obtain a first audio signal to be processed.

It can be understood that, if the whole segment of the audio to be processed is the sound tonal modification object, the difference calculation is performed on the floating point type data of each sampling point corresponding to the audio to be processed. If a certain section of the audio to be processed is a sound tonal modification object, performing differential calculation on the floating point type data of the sampling points with the intercepted fixed length to obtain a first audio signal to be processed. Specifically, taking the example of intercepting fixed-length sampling points, difference calculation is performed on each sampling point one by one, so as to obtain a plurality of difference calculation results, and the plurality of difference calculation results form a first audio signal to be processed.

Specifically, the difference calculation can be performed using the following formula:

x[n]＝s[n]-s[n-1] (1)。

wherein, s [ n ] represents the floating point number corresponding to the nth sampling point, s [ n-1] represents the floating point number corresponding to the n-1 th sampling point, and x [ n ] represents the result of the difference calculation between the floating point number of the nth sampling point and the floating point number of the n-1 th sampling point.

After each sampling point is subjected to differential calculation in the mode of the formula (1), a first audio signal to be processed is formed.

Step S112: and carrying out first zero-frequency suppression processing on the first audio signal to be processed to obtain a second audio signal to be processed.

Specifically, the first audio signal to be processed is subjected to first zero-frequency suppression processing to obtain a second audio signal to be processed, that is, a result obtained after the difference calculation of each sampling point passes through the first zero-frequency container to obtain the second audio signal to be processed.

The first zero-frequency container performs the first zero-frequency suppression processing in the following manner:

wherein, y₁[n]Is the result obtained after the nth sample point passes through the first zero-frequency container, a_kIs a parameter of the first zero-frequency container, y₁[n-k]X [ n ] is the result of the k sample point before the n sample point passing through the first zero-frequency container]And the result of the difference calculation of the floating point numerical values of the nth sample point and the (n-1) th sample point is shown.

And (3) carrying out first zero-frequency suppression processing on all sampling points in the first audio signal to be processed by utilizing the formula (2) to form a second audio signal to be processed.

Step S113: and carrying out second zero-frequency suppression processing on the second audio signal to be processed to obtain a third audio signal to be processed.

Specifically, the second audio signal to be processed is subjected to second zero-frequency suppression processing to obtain a third audio signal to be processed, that is, each sampling point in the second audio signal to be processed passes through the second zero-frequency container to obtain a third audio signal to be processed.

The second zero-frequency container performs the second zero-frequency suppression processing in the following manner:

wherein, y₁[n]Is the result obtained after the nth sample point passes through the first zero-frequency container, a_kIs a parameter of the first zero-frequency container, y₂[n-k]The result obtained after the k sample point before the n sample point passes through the second zero-frequency container, y₂[n]And the result obtained after the nth sample point passes through the second zero-frequency container is shown.

And (3) carrying out second zero-frequency suppression processing on all sampling points in the second audio signal to be processed by utilizing the formula (3) to form a third audio signal to be processed.

Step S114: and performing trend elimination processing on the third audio signal to be processed to obtain a fourth audio signal to be processed.

Specifically, the signal obtained after the zero-frequency container performs the zero-frequency suppression processing is an exponentially increasing or decreasing signal, and needs to be stabilized again on the upper and lower coordinate axes, so that the trend elimination processing needs to be performed on the third generation processed audio signal. Specifically, the trend elimination processing is performed by subtracting the average value of the short time window from the data corresponding to the current sampling point in the third audio signal to be processed, so as to obtain a fourth audio signal to be processed.

Wherein, y₂[n]Representing the result obtained after the nth sample point passes through the second zero-frequency container, y₂[n+m]Denotes the result of the mth sample after the nth sample through the second zero-frequency container, y [ n ]]The result of trend elimination of the nth sampling point is obtained.

And (4) forming a fourth audio signal to be processed after trend elimination of each sampling point in the third audio signal to be processed by utilizing the formula (4).

Step S115: the position of the pitch period is obtained based on a fourth audio signal to be processed.

Specifically, the fourth audio signal to be processed obtained after the trend elimination is an oscillating waveform, and the local period of the fourth audio signal is approximately equal to the local pulse period, so that zero-crossing point extraction from negative to positive is performed on the signal after the trend elimination, and the position of the pitch period is determined.

Step S12: and framing the audio to be processed according to the position of the pitch period to form a plurality of audio frames.

And framing the audio to be processed according to the position of the pitch period to form a plurality of audio frames. Specifically, the positions of the pitch periods extracted in step S11 are multiple, and the number of samples between two adjacent pitch periods is a sliding window, please refer to fig. 3 specifically. Assuming that 6 pitch period positions are extracted in step S11, there is a sliding window between pitch period 1 and pitch period 2 in fig. 3, where the sliding window includes a plurality of sampling point numbers; a sliding window is arranged between the pitch period 2 and the pitch period 3; between pitch period 3 and pitch period 4 is a sliding window; between pitch period 4 and pitch period 5 is a sliding window; between pitch period 5 and pitch period 6 is a sliding window.

The audio data is framed according to the pitch period shown in fig. 3 to obtain a plurality of audio frames. An audio frame may contain multiple pitch periods, for example, 4 pitch periods (3 sliding windows) form an audio frame, please refer to fig. 4, where pitch period 1 and pitch period 2, pitch period 3 and pitch period 4 form an audio frame; pitch period 2 forms an audio frame with pitch period 3, pitch period 4, and pitch period 5; pitch period 3 forms one audio frame with pitch period 4, pitch period 5 and pitch period 6.

Step S13: the audio frames are windowed using a window function to obtain a plurality of analysis frames.

Specifically, a window function with a preset length is added to the sliding window in each audio frame to obtain an analysis frame corresponding to each audio frame, please refer to fig. 5, the window function with the length of the sliding window corresponding to each audio frame is added to each audio frame, and then the audio frame is converted into an analysis frame. Specifically, taking the example that the pitch period 1, the pitch period 2, the pitch period 3, and the pitch period 4 form an audio frame, wherein the audio frame includes sliding windows Hop1, Hop2, and Hop3, a window function of the total length of the sliding windows Hop1, Hop2, and Hop3 is added to the audio frame by using the window function, and the audio frame is converted into an analysis frame.

In this embodiment, since each n adjacent sliding windows constitute an analysis frame, the length of the sum of the last n-1 sliding windows of the sequence cannot be processed, and the pitch period position and the corresponding sequence data are retained and processed in connection with the next subsequence.

As shown in fig. 5, the length of the first sliding window of the previous analysis frame is shifted between two adjacent analysis frames. For example, the length of the first sliding window Hop1 between analysis frame 1 and analysis frame 2 that offsets analysis frame 1, and the length of the first sliding window Hop2 between analysis frame 2 and analysis frame 2 that offsets analysis frame 2.

Step S14: by controlling the overlapping length between the analysis frames, the analysis frames are synthesized to obtain a synthesized audio, and the window functions corresponding to the analysis frames are synthesized to obtain a synthesis window.

Specifically, the method and the device for synthesizing the audio signals have the advantages that the overlapping length between the analysis frames is controlled, the analysis frames are synthesized to obtain the synthesized audio, the window functions corresponding to the analysis frames are synthesized to obtain the synthesis window, and the purpose of signal time length expansion is achieved.

Specifically, an initial target length and an initial synthesis length are set to be 0, the target synthesis length is set during first synthesis, the analysis frame 1 is synthesized on the basis of the initial synthesis length, and the synthesis length reaches the target synthesis length; and then synthesizing the current analysis frame and the current synthesized analysis frame according to the target synthesis length based on the current synthesis length, so that the synthesized length is the target synthesis length. Wherein the length of the synthesis window is equal to the length of the synthesized audio.

For example, if the current synthesis length is 5s and the target synthesis length is 10s, synthesizing the current analysis frame into 5s on the basis of the current synthesis length of 5s, so that the synthesized length reaches the target synthesis length of 10 s; and when the target synthesis length is 13s in the next synthesis, synthesizing the next analysis frame for 3s on the basis of the current synthesis length of 10s so that the synthesized length reaches the target synthesis length of 13 s.

In an embodiment, when setting the target synthesis length, the total target length of the sound transposition may be obtained based on the time length variation factor in combination with the time length variation factor. Specifically, the time length variation factor is used to characterize whether the duration of the current audio to be processed is extended or shortened. And setting the target synthesis length of each synthesis according to the total target length. Specifically, the number of analysis frames may be considered when setting the target synthesis length for each synthesis.

In an embodiment, please refer to fig. 6 for overlapping the analysis frames, which specifically includes:

step S61: and judging the relation between the current synthesis length and the target synthesis length.

Specifically, it is determined whether the current synthesis length is greater than the target synthesis length.

Step S62: and if the current synthesis length is smaller than the target synthesis length, the current analysis frame is offset from the first sliding window of the last analysis frame of the current analysis frame on the basis of the current synthesis length to be synthesized.

In an embodiment, if the current synthesis length is 10s and the target synthesis length is 13s, the current synthesis length is smaller than the target synthesis length, and at this time, the current analysis frame is shifted from the first sliding window of the last analysis frame of the current analysis frame on the basis of the current synthesis length to be synthesized.

Step S63: if the current synthesis length is greater than the target synthesis length, the next analysis frame is selected, and the next target synthesis length is set, and the process returns to step S61.

Specifically, if the current synthesis length is 13s and the target synthesis length is 10s, the current synthesis length is greater than the target synthesis length, the current analysis frame is not synthesized at this time, the next analysis frame of the current analysis frame is selected, the next target synthesis length is set, whether the target synthesis length is greater than the current target synthesis length is continuously judged, and if yes, the next analysis frame of the current analysis frame is synthesized.

When synthesizing the analysis frame, it is necessary to synthesize the window functions in the same manner, and the position where the window functions are superimposed is the same as the position where the analysis frame is superimposed.

Step S15: and processing the synthesized audio by using the synthesis window to obtain the final audio after the sound is modulated.

Specifically, the final audio after the sound is modified is obtained by dividing the synthesized audio at the corresponding superposition position by the synthesis window.

In an embodiment, referring to fig. 7, step S15 further includes:

step S71: and judging the time length change factor.

Specifically, by judging the time length change factor, it is determined whether the time length of the audio to be processed is extended or shortened according to the sound tone change result.

Step S72: and if the time length change factor is larger than a preset value, performing downsampling processing on the synthesized audio to obtain a final audio after sound transposition.

Specifically, the preset value is 1, if the time length variation factor is greater than 1, it indicates that the duration of the audio to be processed needs to be stretched, and at this time, down-sampling processing is performed on the synthesized audio to obtain the final audio after sound transposition.

In one embodiment, in order to prevent aliasing from occurring in downsampling, the synthesized audio is processed by using a low-pass filter, and then downsampling is performed on the processed synthesized audio to obtain a final audio after sound modification.

Step S73: and if the time length change factor is smaller than a preset value, performing up-sampling processing on the synthesized audio to obtain a final audio after sound transposition.

If the time length change factor is less than 1, it indicates that the duration of the audio to be processed needs to be shortened, and at this time, the synthesized audio is subjected to up-sampling processing to obtain the final audio after sound modulation.

In an embodiment, in order to prevent the mirror phenomenon from occurring in the upsampling, the synthesized audio is first upsampled, and then the upsampled synthesized audio is processed by using a low-pass filter to obtain a final audio after the sound is modified.

According to the sound tone-changing method provided by the invention, when the position of the pitch period is obtained, the low-frequency signal is eliminated and the influence of a sound channel is counteracted through difference calculation, first zero-frequency suppression processing, second zero-frequency suppression processing and trend elimination processing, so that the accurate extraction of the pitch period is realized. And in the analysis frame synthesis stage, the purpose of signal expansion and contraction is realized by controlling the overlapping length between analysis frames. The final audio sound obtained by the method has natural tone-changing effect and high quality, and the related calculation amount is small in the application.

Fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the invention. The electronic device comprises a memory 202 and a processor 201 connected to each other.

The memory 202 is used to store program instructions implementing the sound transposition method of the device of any of the above.

The processor 201 is used to execute program instructions stored by the memory 202.

The processor 201 may also be referred to as a Central Processing Unit (CPU). The processor 201 may be an integrated circuit chip having signal processing capabilities. The processor 201 may also be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The storage 202 may be a memory bank, a TF card, etc., and may store all information in the electronic device of the device, including the input raw data, the computer program, the intermediate operation results, and the final operation results. It stores and retrieves information based on the location specified by the controller. With the memory, the electronic device can only have the memory function to ensure the normal operation. The storage of electronic devices can be classified into a main storage (internal storage) and an auxiliary storage (external storage) according to the use, and also into an external storage and an internal storage. The external memory is usually a magnetic medium, an optical disk, or the like, and can store information for a long period of time. The memory refers to a storage component on the main board, which is used for storing data and programs currently being executed, but is only used for temporarily storing the programs and the data, and the data is lost when the power is turned off or the power is cut off.

In the several embodiments provided in the present application, it should be understood that the disclosed method and apparatus may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a module or a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a system server, a network device, or the like) or a processor (processor) to execute all or part of the steps of the method of the embodiments of the present application.

Please refer to fig. 9, which is a schematic structural diagram of a computer-readable storage medium according to the present invention. The storage medium of the present application stores a program file 203 capable of implementing all the sound tonal modification methods, wherein the program file 203 may be stored in the storage medium in the form of a software product, and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute all or part of the steps of the methods according to the embodiments of the present application. The aforementioned storage device includes: various media capable of storing program codes, such as a usb disk, a mobile hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, or terminal devices, such as a computer, a server, a mobile phone, and a tablet.

The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A method of tonal modification, the method comprising:

acquiring the position of a pitch period in audio to be processed;

framing the audio to be processed according to the position of the pitch period to form a plurality of audio frames;

windowing the audio frames by using a window function to obtain a plurality of analysis frames;

synthesizing a plurality of analysis frames to obtain a synthesized audio by controlling the overlapping length between the analysis frames, and synthesizing the window functions corresponding to the analysis frames to obtain a synthesized window;

and processing the synthesized audio by using the synthesis window to obtain final audio after sound is modulated.

2. The sound transposition method of claim 1, wherein the obtaining the position of the pitch period in the audio to be processed comprises:

carrying out differential calculation on the audio to be processed to obtain a first audio signal to be processed;

performing first zero-frequency suppression processing on the first audio signal to be processed to obtain a second audio signal to be processed;

performing second zero-frequency suppression processing on the second audio signal to be processed to obtain a third audio signal to be processed;

performing trend elimination processing on the third audio signal to be processed to obtain a fourth audio signal to be processed;

obtaining a position of the pitch period based on the fourth audio signal to be processed.

3. The sound tonal modification method of claim 2, wherein the performing a differential calculation on the audio to be processed to obtain a first audio signal to be processed comprises:

acquiring pulse code modulation data of the audio to be processed, and converting integer data corresponding to each sampling point into floating point data;

intercepting fixed-length sampling points as operation objects;

and carrying out differential calculation on the floating point type data corresponding to each sampling point in the operation object to obtain the first audio signal to be processed.

4. The sound tonal modification method of claim 3, wherein a zero-frequency accommodator is used to perform a first zero-frequency suppression process and/or a second zero-frequency suppression process on the first audio signal to be processed and/or the second audio signal to be processed;

the trend elimination processing on the third audio signal to be processed to obtain a fourth audio signal to be processed includes:

and subtracting the average value of the short-time window from the data corresponding to the current sampling point in the third audio signal to be processed to perform trend elimination processing, thereby obtaining a fourth audio signal to be processed.

5. The sound transposition method of claim 1 wherein each of the audio frames includes a plurality of pitch periods, and the number of samples between two adjacent pitch periods is a sliding window;

the windowing the audio frame with a window function to obtain a plurality of analysis frames comprises:

and adding a window function with a preset length to the sliding window in each audio frame to obtain an analysis frame corresponding to each audio frame.

6. The method of claim 5, wherein the synthesizing a plurality of analysis frames to obtain a synthesized audio by controlling the length of the overlap between the analysis frames, and synthesizing the window functions corresponding to the analysis frames to obtain a synthesis window comprises:

synthesizing the current analysis frame and the current synthesized analysis frame according to the target synthesis length based on the current synthesis length, so that the synthesized length is the target synthesis length;

wherein the length of the synthesis window is equal to the length of the synthesized audio.

7. The sound transposition method of claim 6, further comprising:

acquiring a time length change factor, and obtaining the total target length of sound tone modification based on the time length change factor;

and setting the target synthesis length of each synthesis according to the total target length.

8. The sound transposition method according to claim 6, wherein the synthesizing a current analysis frame and a currently synthesized analysis frame according to a target synthesis length based on a current synthesis length so that the synthesized length is the target synthesis length comprises:

judging the relation between the current synthesis length and the target synthesis length;

and if the current synthesis length is smaller than the target synthesis length, the current analysis frame is offset from a first sliding window of a last analysis frame of the current analysis frame on the basis of the current synthesis length to be synthesized.

9. The method of claim 1, wherein the processing the synthesized audio using the synthesis window to obtain the final audio after sound modification further comprises:

judging a time length change factor;

if the time length change factor is larger than a preset value, downsampling the synthesized audio to obtain a final audio after sound modulation;

and if the time length change factor is smaller than a preset value, performing up-sampling processing on the synthesized audio to obtain a final audio after sound transposition.

10. The method of claim 9, wherein the down-sampling the synthesized audio to obtain the final audio after the pitch modification if the time length variation factor is greater than a preset value comprises:

if the time length change factor is larger than a preset value, processing the synthesized audio by using a low-pass filter, and then performing down-sampling processing on the processed synthesized audio to obtain a final audio after sound modulation;

if the time length variation factor is smaller than a preset value, performing upsampling processing on the synthesized audio to obtain a final audio after sound transposition, wherein the final audio comprises:

and if the time length change factor is smaller than a preset value, performing up-sampling processing on the synthesized audio, and processing the synthesized audio subjected to the up-sampling processing by using a low-pass filter to obtain a final audio subjected to sound modulation.

11. An electronic device, comprising: a memory storing program instructions and a processor retrieving the program instructions from the memory to perform the sound transposition method according to any one of claims 1-10.

12. A computer-readable storage medium, characterized in that a program file is stored, which can be executed to implement the sound transposition method according to any one of claims 1 to 10.