CN113782050A

CN113782050A - Sound tone changing method, electronic device and storage medium

Info

Publication number: CN113782050A
Application number: CN202111052275.6A
Authority: CN
Inventors: 史巍; 王宝俊; 张锦铖; 林聚财; 殷俊
Original assignee: Zhejiang Dahua Technology Co Ltd
Current assignee: Zhejiang Dahua Technology Co Ltd
Priority date: 2021-09-08
Filing date: 2021-09-08
Publication date: 2021-12-10

Abstract

The application discloses a sound tone changing method, electronic equipment and a computer readable storage medium. The method comprises the following steps: acquiring audio data to be processed; searching pitch periods from short to long according to the current starting point to determine a current first audio frame, wherein the number of the pitch periods of the current first audio frame is greater than a pitch period threshold, and the shorter the searched time period is, the larger the corresponding pitch period threshold is; windowing the current first audio frame to obtain a current analysis frame; and synthesizing target audio data by using the current analysis frame. By the method, the real-time performance of the sound processing in the sound tone changing process can be improved.

Description

Sound tone changing method, electronic device and storage medium

Technical Field

The present application relates to the field of audio processing technologies, and in particular, to a sound tonal modification method, an electronic device, and a computer-readable storage medium.

Background

For the purpose of protecting personal information or entertainment, etc., it is often necessary to change the sound in the original audio data by a sound change (sound transposition) method. The voice tone changing method is to change the tone of the voice under the premise of not changing the voice speed. For example, in some anonymous video conferences, if personal information needs to be protected, audio data in the conference process needs to be processed by using a sound tone changing method, and the sound of a speaker in the video conference process is changed, so that the sound of the speaker in the original audio data cannot be distinguished. However, the conventional voice pitch modification method is not high in real-time performance in processing voice.

Disclosure of Invention

The application provides a sound tone changing method, an electronic device and a computer readable storage medium, which can solve the problem that the existing sound tone changing method is not high in real-time performance of sound processing.

In order to solve the technical problem, the application adopts a technical scheme that: a sound transposition method is provided. The method comprises the following steps: acquiring audio data to be processed; searching pitch periods from short to long according to the current starting point to determine a current first audio frame, wherein the number of the pitch periods of the current first audio frame is greater than a pitch period threshold, and the shorter the searched time period is, the larger the corresponding pitch period threshold is; windowing the current first audio frame to obtain a current analysis frame; and synthesizing target audio data by using the current analysis frame.

In order to solve the above technical problem, another technical solution adopted by the present application is: an electronic device is provided, which comprises a processor and a memory connected with the processor, wherein the memory stores program instructions; the processor is configured to execute the program instructions stored by the memory to implement the above-described method.

In order to solve the above technical problem, the present application adopts another technical solution: there is provided a computer readable storage medium storing program instructions that when executed are capable of implementing the above method.

Through the mode, when the current first audio frame is determined, the pitch periods are searched according to the sequence of the time lengths from short to long by the current starting point, namely the number of the pitch periods is searched adaptively according to the multi-stage time length strategy, so that the time length is within an allowable range while the number of the pitch periods of the obtained analysis frame is larger than the threshold of the pitch period after the current first audio frame is windowed, and the time length of the analysis frame is not too long, and the real-time performance of sound processing is improved.

Drawings

FIG. 1 is a schematic flow chart diagram illustrating an embodiment of a method for pitch modification according to the present application;

FIG. 2 is a schematic view of the detailed process of S11 in FIG. 1;

FIG. 3 is a schematic flow chart diagram of another embodiment of the present application voice transposition method;

FIG. 4 is a schematic structural diagram of an embodiment of an electronic device of the present application;

FIG. 5 is a schematic structural diagram of an embodiment of a computer-readable storage medium of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms "first", "second" and "third" in this application are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implying any indication of the number of technical features indicated. Thus, a feature defined as "first," "second," or "third" may explicitly or implicitly include at least one of the feature. In the description of the present application, "plurality" means at least two, e.g., two, three, etc., unless explicitly specifically limited otherwise.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those skilled in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments without conflict.

Before introducing the method provided by the present application, the reason why the existing sound tonal modification method cannot process in real time is explained:

the processing of audio data referred to herein is from a time domain perspective. The audio data to be processed comprises a plurality of sampling points, and each sampling point is provided with a corresponding sampling time point and a corresponding sequence number. The pitch period is the expression form of the pitch period in the time domain, the position of the pitch period is the initial position of the pitch period in the time domain, and the pitch period is expressed as the serial number of a specific sampling point in the time domain.

In the existing scheme, the number of pitch periods of an analysis frame for synthesizing target audio data is fixed. Taking the fixed number M (M ≧ 2) as an example, an analysis frame is defined as the portion of M pitch period intervals in the original audio data multiplied by a Hanning window of length M _ interval. The analysis frame can be expressed by the following formula:

x_s(n)＝x(e(n):e(n+M))*h_win n++

wherein x is_s(n) represents the nth analysis frame, x represents the original audio data, and h _ win represents the hanning window function, which represents the sequence number of the sampling point corresponding to the nth pitch period.

Since the pitch period exists only in the voiced part of the original audio data, the pitch period can be searched only in the voiced part and cannot be searched in the noisy and silent part when determining the analysis frame. When there is no sound for a long time in the original audio data, the time interval between two adjacent pitch periods is very long, resulting in a very long analysis frame time, which cannot be processed in real time.

To this end, the method provided by the present application is as follows:

fig. 1 is a schematic flow chart of an embodiment of the present application voice tonal modification method. It should be noted that, if the result is substantially the same, the flow sequence shown in fig. 1 is not limited in this embodiment. As shown in fig. 1, the present embodiment may include:

s11: and acquiring audio data to be processed.

The audio data to be processed may be the acquired original audio data, or may be data obtained by processing the original audio data. Compared with the former, the latter can change the distribution condition of pitch period, and avoid the noise between the speech frame and the non-speech frame in the target audio data obtained by sound modulation. Referring to fig. 2 in combination, in the latter case, S11 may include the following sub-steps:

s111: raw audio data is acquired.

The raw audio data may be real-time or non-real-time.

S112: the original audio data is divided into a number of second audio frames.

The original audio data may be divided into a plurality of second audio frames by using a window function, the length of the second audio frame/the length of the window may be determined arbitrarily, or may be determined based on the sampling rate of the original audio data, and the sampling rate is proportional to the length of the second audio frame.

S113: speech frames and non-speech frames of a number of second audio frames are detected.

Both speech and non-speech frames can be detected using the VAD (dynamic voice detection) algorithm. The VAD algorithm may be Thomas, Rabin, etc.

S114: and smoothing at least part of the non-speech frames adjacent to the speech frames by utilizing the speech frames.

It will be appreciated that smoothing the non-speech frames with speech frames can convert the non-speech frames into transition frames between speech and non-speech frames. So that the pitch period can be searched in the transition frame when subsequently determining the analysis frame.

The maximum number N of non-speech frames smoothed with a single speech frame may be determined based on the sampling rate fs. Wherein the maximum number of non-speech frames smoothed with a single speech frame is positively correlated to the sampling rate. For example, the number of non-speech frames smoothed with a single speech frame is determined by:

N＝fs*T₁，

T₁₁<T₁₂v

where N denotes the number of non-speech frames smoothed with one speech frame, fs denotes the sampling rate, T₁₁And T₁₂Indicating a candidate duration of smoothing, T₁Indicating the length of the smoothing.

The non-speech frame adjacent to the speech frame may be a non-speech frame adjacent after the speech frame, or a non-speech frame adjacent before the speech frame. The smoothing of non-speech frames with speech frames can be divided into unidirectional smoothing and bidirectional smoothing.

The unidirectional smoothing process is a process of performing only forward smoothing or only backward smoothing. The forward smoothing process is to smooth the non-speech frame adjacent to the forward smoothing process by using the speech frame, and the reverse smoothing process is to smooth the non-speech frame adjacent to the forward smoothing process by using the speech frame. The bidirectional smoothing is performed by performing both forward smoothing and reverse smoothing.

Smoothing (forward smoothing) the non-speech frame adjacent to the speech frame after the smoothing can be represented by the following equation:

x_n(j)＝x_n-1(j)*β+x_n(j)*(1-β)n++，

wherein x is_n(j) Value, x, representing the j-th sample point of the n-th frame in a speech frame and its following adjacent non-speech frame_n-1(j) Denotes the value of the jth sample point of the (n-1) th frame, and beta denotes the smoothing coefficient.

Smoothing (inverse smoothing) the non-speech frame adjacent to the speech frame before the speech frame can be represented as follows:

x_n(j)＝x_n+1(j)*β+x_n(j)*(1-β)n--，

wherein x is_n(j) Value, x, representing the j-th sample point of the nth frame of a speech frame and its preceding adjacent non-speech frame_n+1(j)Indicating the value of the jth sample point of the (n + 1) th frame.

When a speech frame is used to smooth a plurality of non-speech frames adjacent to the speech frame, the farther a non-speech frame is from the speech frame (or the more different the frame number of the speech frame is), the smaller the smoothing coefficient corresponding to the non-speech frame is.

For example, in the case where the number of non-speech frames smoothed with one speech frame is N, the smoothing coefficient β may be decremented by 1/N from frame to frame, up to 0.

In S114, as an embodiment, the non-speech frames adjacent to the speech frame are smoothed (forward smoothing) in each step.

As another embodiment, in S114, the non-speech frames adjacent to each other before the non-speech frame are smoothed by the speech frame (reverse smoothing).

As another embodiment, in S114, both the non-speech frame adjacent to the speech frame after the smoothing process and the non-speech frame adjacent to the speech frame before the smoothing process (bidirectional smoothing process) are performed.

As another embodiment, in S114, whether the one-way smoothing process or the two-way smoothing process is performed is limited by the number of non-speech frames between two adjacent speech frames.

Specifically, the preset number threshold may be derived based on the sampling rate, and the preset number threshold is positively correlated to the sampling rate. The preset number threshold may be equal to the maximum number of non-speech frames smoothed with a single speech frame.

Taking a non-speech frame between two adjacent speech frames as a target non-speech frame; if the number of the target non-speech frames is larger than or equal to the preset number threshold, one (the previous one or the next one) of two adjacent speech frames is taken as a first starting speech frame, and the first starting speech frame is utilized to carry out smoothing processing on at least part of the target non-speech frames adjacent to the first starting speech frame; if the number of the target non-speech frames is smaller than the preset number threshold, two adjacent speech frames are respectively used as a first starting speech frame and a second starting speech frame, the second starting speech frame and at least part of the target non-speech frames adjacent to the second starting speech frame are used for smoothing, and the third starting speech frame is used for smoothing at least part of the target non-speech frames adjacent to the third starting speech frame.

For example, when the number of target non-speech frames is less than N, a transition frame is generated using a bi-directional smoothing process. Firstly, forward smoothing processing is carried out on a first target non-speech frame, and a smoothing coefficient is decreased by 1/N from 1 to the last target non-speech frame. And then carrying out reverse smoothing processing from the last target non-speech frame, wherein the smoothing coefficient is decreased by 1/N frame by frame from 1 to the first target non-speech frame. The method can be specifically realized by the following formula:

and when the number of the target non-speech frames is greater than or equal to N, generating a transition frame by adopting forward smoothing processing. The smoothing is performed starting from the first non-speech frame and the smoothing factor is decremented by 1/N frame by frame, starting from 1, until 0. The method can be specifically realized by the following formula:

x_n(j)＝x_n-1(j)*β+x_n(j)*(1-β)n++，

s12: the pitch periods are searched in order of duration from short to long from the current start point to determine the current first audio frame.

And the shorter the search time length is, the larger the corresponding pitch period threshold is.

The current starting point is the time starting point of the time length searched when the current first audio frame is determined. In this step, in order to avoid the current first audio frame being too long in duration, an allowable range of the duration of the search is given. Each duration/duration range of the allowed range has a corresponding pitch period threshold, and the greater the pitch period threshold corresponding to a short duration/duration range, the smaller the pitch period threshold corresponding to a long duration/duration range.

In the searching process, the time lengths are sequentially used as target time lengths according to the sequence of the time lengths in the allowed range from short to long, if the number of pitch periods searched in the target time lengths is larger than the threshold value of the pitch period corresponding to the target time lengths, the searching is stopped, and the part between the first pitch period and the last pitch period searched in the target time lengths in the audio data to be processed is determined as the current first audio frame.

If the pitch period number searched in all the time lengths is not greater than the corresponding pitch period threshold, namely, the searching is carried out according to the maximum time length of the allowed range, and the searched pitch period number is not greater than the corresponding pitch period threshold, the current starting point is shifted backwards so as to update the current starting point, and the searching is restarted. The number of sampling time points shifted backward depends on the length of overlap of two adjacent analysis frames, processing accuracy, and other factors.

The search process of this step is illustrated as follows:

given three time length thresholds, a first time length threshold T is respectively_minA second time length threshold T_midAnd a third duration threshold T_max，T_min<T_mid<T_max. The three time length thresholds are used for dividing the allowable range into three parts, namely (0, T)_min]、(T_min,T_mid]And (T)_mid,T_max]。(0,T_min]The corresponding pitch period threshold is M, (T)_min,T_mid]The corresponding pitch period threshold is M-1, (T)_mid,T_max]The corresponding pitch period threshold is 2. The search duration is T.

1) At (0, T)_min]Carrying out pitch period search according to the sequence of short to long T; if the number of the searched pitch periods is larger than M, determining the part between the first pitch period and the last pitch period obtained by T search in the audio data to be processed as the part between the first pitch period and the last pitch periodA current first audio frame; if the number of pitch periods searched is not greater than M, then go to 2).

2) In (T)_min,T_mid]Carrying out pitch period search according to the sequence of short to long T; if the searched pitch period number is larger than M-1, determining the current first audio frame; if the number of pitch periods searched is not greater than M-1, then go to 3).

3) In (T)_mid,T_max]Carrying out pitch period search according to the sequence of short to long T; if the number of the searched pitch periods is larger than 2, determining the current first audio frame; if the number of the searched pitch periods is not more than 2, modifying the current starting point (shifting the current starting point backwards), and jumping to 1) to repeatedly execute the steps until the current first audio frame is determined.

S13: the current first audio frame is windowed to obtain a current analysis frame.

The current first audio frame is windowed, i.e. multiplied with the current first audio frame by a window function, including but not limited to a hanning window function. Taking the hanning window function as an example, the current first audio frame can be represented as:

where m represents the number of pitch periods searched for over the search duration T.

S14: and synthesizing target audio data by using the current analysis frame.

The method of synthesizing the target audio data using the current analysis frame may be SOLA, PSOLA, WSOLA algorithm, or the like.

Through the implementation of the embodiment, when the current first audio frame is determined, the pitch periods are searched from short to long according to the current starting point, that is, the number of pitch periods is adaptively searched according to the multi-stage duration strategy, so that after the current first audio frame is windowed, the obtained number of pitch periods of the analysis frame is greater than the threshold value of the pitch period, and meanwhile, the duration is within the allowable range, so that the duration of the analysis frame is not too long, and the processing real-time performance is improved.

Fig. 3 is a schematic flow chart of another embodiment of the sound tonal modification method of the present application. It should be noted that, if the result is substantially the same, the flow sequence shown in fig. 3 is not limited in this embodiment. The present embodiment is a further extension of S14. As shown in fig. 3, the present embodiment may include:

s141: the time duration for which the current analysis frame overlaps the synthesized intermediate audio data is determined based on the time scale factor.

The time scale factor is the ratio of the time length of the intermediate audio data obtained after the last superposition to the time length of the data to be processed.

The length of the overlap is the length of time that the tail of the synthesized intermediate audio data overlaps the head of the current analysis frame. The time scale factor may be preset or may be specified by a user during application. Since the duration of the data to be processed is known, the duration of the intermediate audio data obtained after the last superposition can be determined with the time scale factor determination.

Intermediate speech obtained after final superposition

The time scale factor is greater than 1 or less than 1. If the time scale factor is greater than 1, it means that the audio data to be processed is stretched during the processing process, so that the sound in the audio data to be processed becomes thicker boorish; if the time scale factor is smaller than 1, it means that the audio data to be processed will be compressed during the processing process, so that the sound in the audio data to be processed becomes sharp.

S142: and overlapping the current analysis frame to the synthesized intermediate audio data according to the overlapped duration.

And correspondingly adding the values of partial sampling points overlapped by the current analysis frame and the synthesized intermediate audio data, and keeping the other parts unchanged.

The intermediate audio data synthesized from the currently analyzed frame overlap value may be embodied as:

s143: and sampling the intermediate audio data obtained after the last superposition to obtain target audio data.

The sampling process is to make the duration of the target audio data coincide with the duration of the audio data to be processed, thereby realizing the pitch change of the sound without speed change. Thus, if the time scale factor is greater than 1, downsampling the intermediate audio data; and if the time scale factor is smaller than 1, performing up-sampling processing on the intermediate audio data.

In addition, before sampling processing, normalization processing can be performed on intermediate audio data obtained after the last superposition.

In addition, in order to avoid that the actual length of the intermediate audio data obtained after the last superposition is greater than the target length, before S14, the duration of the current analysis frame may be multiplied by the time scale factor to determine the target of the intermediate audio data obtained after the current analysis frame is superposed; and judging whether the actual length of the intermediate audio data obtained after the superposition of the current analysis frame is smaller than the target length. If so, go to S14; otherwise, the current analysis frame does not participate in the synthesis of the target audio data.

Fig. 4 is a schematic structural diagram of an embodiment of an electronic device according to the present application. As shown in fig. 4, the electronic device includes a processor 21, and a memory 22 coupled to the processor 21.

Wherein the memory 22 stores program instructions for implementing the method of any of the above embodiments; processor 21 is operative to execute program instructions stored by memory 22 to implement the steps of the above-described method embodiments. The processor 21 may also be referred to as a CPU (Central Processing Unit). The processor 21 may be an integrated circuit chip having signal processing capabilities. The processor 21 may also be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

FIG. 5 is a schematic structural diagram of an embodiment of a computer-readable storage medium of the present application. As shown in fig. 5, the computer-readable storage medium 30 of the embodiment of the present application stores program instructions 31, and the program instructions 31 implement the method provided by the above-mentioned embodiment of the present application when executed. The program instructions 31 may form a program file stored in the computer-readable storage medium 30 in the form of a software product, so as to enable a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute all or part of the steps of the methods according to the embodiments of the present application. And the aforementioned computer-readable storage medium 30 includes: various media capable of storing program codes, such as a usb disk, a mobile hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, or terminal devices, such as a computer, a server, a mobile phone, and a tablet.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit. The above embodiments are merely examples and are not intended to limit the scope of the present disclosure, and all modifications, equivalents, and flow charts using the contents of the specification and drawings of the present disclosure or those directly or indirectly applied to other related technical fields are intended to be included in the scope of the present disclosure.

Claims

1. A method of tonal modification, comprising:

acquiring audio data to be processed;

searching pitch periods from short to long according to the duration from a current starting point to determine a current first audio frame, wherein the number of the pitch periods of the current first audio frame is greater than a pitch period threshold, and the shorter the duration of the search is, the larger the corresponding pitch period threshold is;

windowing the current first audio frame to obtain a current analysis frame;

and synthesizing target audio data by using the current analysis frame.

2. The method of claim 1, wherein searching pitch periods in order of duration from short to long from a current start point to determine a current first audio frame comprises:

and if the pitch period number searched in all the time lengths is not larger than the corresponding pitch period threshold value, shifting the current starting point backwards to update the current starting point, and restarting the search.

3. The method of claim 1, further comprising:

according to the sequence of the time lengths from short to long, the time lengths are sequentially used as target time lengths;

if the pitch period number searched in the target time length is larger than the pitch period threshold corresponding to the target time length, stopping searching, and determining the part between the first pitch period and the last pitch period searched in the target time length in the audio data to be processed as the current first audio frame.

4. The method of claim 1, wherein prior to obtaining the audio data to be processed, comprising:

acquiring original audio data;

dividing the original audio data into a plurality of second audio frames;

detecting a speech frame and a non-speech frame in the plurality of second audio frames;

and smoothing at least part of the non-speech frames adjacent to the speech frames by utilizing the speech frames.

5. The method of claim 4, wherein the smoothing of at least some non-speech frames adjacent to the speech frame with the speech frame comprises:

taking a non-speech frame between two adjacent speech frames as a target non-speech frame;

if the number of the target non-speech frames is larger than or equal to a preset number threshold, taking one of the two adjacent speech frames as a first starting speech frame, and smoothing at least part of the target non-speech frames adjacent to the first starting speech frame by using the first starting speech frame;

and if the number of the target non-speech frames is smaller than the preset number threshold, respectively taking the two adjacent speech frames as a second starting speech frame and a third starting speech frame, smoothing at least part of the target non-speech frames adjacent to the second starting speech frame by using the second starting speech frame, and smoothing at least part of the target non-speech frames adjacent to the third starting speech frame by using the third starting speech frame.

6. The method of claim 5, wherein the preset number threshold is a maximum number of the non-speech frames smoothed by a single speech frame, and wherein the preset number threshold is positively correlated to a sampling rate of the original audio data.

7. The method of claim 1, wherein synthesizing target audio data using the current analysis frame comprises:

determining the overlapping duration of the current analysis frame and the synthesized intermediate audio data based on a time scale factor, wherein the time scale factor is the ratio of the duration of the intermediate audio data obtained after the last overlapping to the duration of the data to be processed;

superposing the current analysis frame to the synthesized intermediate audio data according to the overlapped duration;

and sampling the intermediate audio data obtained after the last superposition to obtain the target audio data.

8. The method of claim 7, prior to said synthesizing target audio data using said current analysis frame, comprising:

multiplying the duration of the current analysis frame by the time scale factor to determine the target length of the intermediate audio data obtained after the current analysis frame is overlapped;

judging whether the actual length of the intermediate audio data obtained after the current analysis frame is superposed is smaller than the target length;

and if the current analysis frame is smaller than the target audio data, the step of synthesizing the target audio data by using the current analysis frame is executed.

9. The method according to claim 7, wherein the sampling the intermediate audio data obtained after the last superposition to obtain the target audio data further comprises:

if the time scale factor is larger than 1, performing downsampling processing on the intermediate audio data;

and if the time scale factor is smaller than 1, performing upsampling processing on the intermediate audio data.

10. An electronic device comprising a processor, a memory coupled to the processor, wherein,

the memory stores program instructions;

the processor is configured to execute the program instructions stored by the memory to implement the method of any of claims 1-9.

11. A computer-readable storage medium, characterized in that the storage medium stores program instructions that, when executed, implement the method of any of claims 1-9.