CN108831509B

CN108831509B - Method and device for determining pitch period, computer equipment and storage medium

Info

Publication number: CN108831509B
Application number: CN201810607513.7A
Authority: CN
Inventors: 袁念德; 邵明绪; 田姣
Original assignee: Xi'an Fengyu Information Technology Co ltd
Current assignee: Xi'an Fengyu Information Technology Co ltd
Priority date: 2018-06-13
Filing date: 2018-06-13
Publication date: 2020-12-04
Anticipated expiration: 2038-06-13
Also published as: CN108831509A

Abstract

The application relates to a pitch period determination method, a pitch period determination device, a computer device and a storage medium. The method comprises the following steps: when the audio signal to be detected is a voiced sound signal in the current frame, acquiring a target cost value of each first pitch period of the audio signal to be detected in the current frame according to a preset cost function; wherein the target cost value comprises: a cost value between each first pitch period of the audio signal to be measured and each second pitch period in an associated frame, the associated frame including: a history frame adjacent to the current frame and a leading frame set positioned after the current frame; and determining the target pitch period of the audio signal to be detected in the current frame from each first pitch period according to each target cost value. The method can improve the accuracy of the pitch period.

Description

Method and device for determining pitch period, computer equipment and storage medium

Technical Field

The present invention relates to the field of communications technologies, and in particular, to a method, an apparatus, a computer device, and a storage medium for determining a pitch period.

Background

The pitch period is the time when the vocal cords are opened and closed once, and the pitch period is widely applied in the fields of speech coding, recognition and the like as a characteristic of an audio signal.

In the process of extracting the pitch period, errors may occur due to various kinds of interference, for example, a double frequency or a half frequency of a real pitch period is determined as an individual pitch period, or an individual abrupt change point appears in a locus of the pitch period. In order to reduce the error rate in the pitch period extraction process, generally, after the pitch period extraction is completed, a trajectory formed by the extracted pitch period needs to be smoothed to remove a point where a mutation occurs in the pitch extraction process. Currently, a commonly used smoothing method is a median filtering method, and the principle of the method is to select the intermediate value of a plurality of consecutive candidate pitch periods in a sliding window section as the final output pitch period.

Then, the track of the pitch period is smoothed by adopting a median filtering method, and the accuracy of the obtained pitch period is lower.

Disclosure of Invention

In view of the above, it is necessary to provide a pitch period determination method, apparatus, computer device and storage medium capable of improving the accuracy of the pitch period.

A method of pitch period determination, the method comprising:

when the audio signal to be detected is a voiced sound signal in the current frame, acquiring a target cost value of each first pitch period of the audio signal to be detected in the current frame according to a preset cost function; wherein the target cost value comprises: a cost value between each first pitch period of the audio signal to be measured and each second pitch period in an associated frame, the associated frame including: a history frame adjacent to the current frame and a leading frame set positioned after the current frame;

and determining the target pitch period of the audio signal to be detected in the current frame from each first pitch period according to each target cost value.

In one embodiment, the obtaining, according to a preset cost function, a target cost value corresponding to each first pitch period of the audio signal to be detected in the current frame includes:

according to the cost function, acquiring a first generation value between each first pitch period of the audio signal to be detected and a second pitch period in the historical frame, and acquiring a second generation value between each first pitch period of the audio signal to be detected and a second pitch period in a target leading frame; the target leading frame is the leading frame which is positioned at the last in time sequence in the leading frame set;

and obtaining the target cost value according to the first cost value and the second cost value.

In one embodiment, the leading frame set comprises a first leading frame and a second leading frame, and the second leading frame is the target leading frame; the obtaining a second cost value between each first pitch period of the audio signal to be detected and a second pitch period in the target leading frame includes:

according to the cost function, acquiring a third cost value between each first pitch period of the audio signal to be detected and each second pitch period in the first leading frame, and acquiring a fourth cost value between each second pitch period of the audio signal to be detected in the first leading frame and each second pitch period of the audio signal to be detected in the second leading frame;

and obtaining the second generation value according to the third generation value and the fourth generation value.

In one embodiment, if the audio signal to be detected is an unvoiced signal in the associated frame, the cost function is a function constructed according to an error value between the audio signal to be detected and an offset audio signal in the first pitch period, and the offset audio signal is a signal obtained by offsetting the audio signal to be detected according to the first pitch period.

In one embodiment, the cost function is W (n, n ± 1) ═ α × E_n(k_n) Wherein n is the identifier of the current frame, n + -1 is the identifier of the associated frame, α is the smoothing coefficient, E_n(k_n) For an error function corresponding to said first pitch period, k_nIs the first pitch period.

In one embodiment, if the audio signal to be tested is a voiced signal in the associated frame, the cost function is a function constructed according to the first pitch period, the second pitch period, and an error value between the audio signal to be tested and an offset signal in the first pitch period, and the offset audio signal is a signal obtained by offsetting the audio signal to be tested according to the first pitch period.

In one embodiment, the cost function is W (n, n ± 1) ═ k_n-k_n±1|+α*E_n(k_n) Wherein n is the identifier of the current frame, n + -1 is the identifier of the associated frame, α is the smoothing coefficient, E_n(k_n) For an error function corresponding to said first pitch period, k_nFor said first pitch period, k_n±1Is the second pitch period.

In one embodiment, the determining, according to each target cost value, a target pitch period of the audio signal to be detected in the current frame from each first pitch period includes:

and determining a first pitch period corresponding to the minimum cost value in each target cost value as the target pitch period.

An apparatus for pitch period determination, the apparatus comprising:

the acquisition module is used for acquiring the target cost value of each first pitch period of the audio signal to be detected in the current frame according to a preset cost function when the audio signal to be detected is a voiced signal in the current frame; wherein the target cost value comprises: a cost value between each first pitch period of the audio signal to be measured and each second pitch period in an associated frame, the associated frame including: a history frame adjacent to the current frame and a leading frame set positioned after the current frame;

and the determining module is used for determining the target pitch period of the audio signal to be detected in the current frame from each first pitch period according to each target cost value.

A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:

A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:

According to the method, the device, the computer equipment and the storage medium for determining the pitch period, when the audio signal to be detected is a voiced signal in the current frame, the target cost value of each first pitch period of the audio signal to be detected in the current frame is obtained according to the preset cost function, and the target pitch period of the audio signal to be detected in the current frame is determined from each first pitch period according to each target cost value, wherein the target cost value comprises: a cost value between each first pitch period of the audio signal to be measured and each second pitch period in the associated frame, the associated frame comprising: the historical frame adjacent to the current frame and the leading frame set positioned behind the current frame can determine the target pitch period in the current frame by combining the pitch period change among the current frame, the historical frame and the leading frame, thereby effectively removing the mutated pitch period, achieving better smoothing effect and improving the accuracy of the pitch period.

Drawings

FIG. 1 is a flow diagram of a method for pitch period determination according to one embodiment;

FIG. 2 is a diagram of a frame structure according to an embodiment;

FIG. 3 is a flowchart of a possible implementation method of step 101 in FIG. 1;

FIG. 4 is a flowchart of a possible implementation method of step 201 in FIG. 3;

fig. 5 is a device for determining a pitch period according to an embodiment;

fig. 6 is a device for determining a pitch period according to another embodiment;

FIG. 7 is a schematic diagram of an architecture of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The pitch period determining method provided by the application can be applied to an acoustic detection environment and used for smoothing the pitch period of a voice signal so as to filter out a point with a sudden change in the pitch period. The execution subject of the method can be a terminal, a server and the like. The terminal can be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers and portable wearable devices, and the server can be implemented by an independent server or a server cluster formed by a plurality of servers.

Fig. 1 is a flowchart of a pitch period determining method according to an embodiment, as shown in fig. 1, the method includes the following steps:

step 101, when the audio signal to be detected is a voiced sound signal in the current frame, obtaining a target cost value of each first pitch period of the audio signal to be detected in the current frame according to a preset cost function; wherein the target cost values include: a cost value between each first pitch period of the audio signal to be measured and each second pitch period in the associated frame, the associated frame comprising: a history frame adjacent to the current frame, and a set of leading frames positioned after the current frame.

Wherein the cost function is a function constructed according to the variation of pitch periods in adjacent time frames, and is used for calculating the cost value between the pitch periods in the adjacent time frames, and the cost value is used for representing the error between the pitch periods in the adjacent time frames. The larger the pitch period variation in adjacent time frames, the larger the cost value obtained from the cost function. The target cost value of the first pitch period includes a cost value between each first pitch period of the audio signal to be measured and each second pitch period in the history frame, and a cost value between each first pitch period of the audio signal to be measured and each second pitch period in the preamble frame. The first pitch period and the second pitch period are alternative pitch periods of the audio signal to be detected in each time frame, which are determined by adopting methods such as autocorrelation detection, amplitude difference and the like.

As shown in FIG. 2, the current frame is Frm⁽⁰⁾The history frame is Frm^(-1)The preamble frame set comprises a preamble frame Frm⁽¹⁾And a preamble frame Frm⁽²⁾Wherein the history frame Frm^(-1)Including a second pitch period that has been smoothed, current frame Frm⁽⁰⁾Preamble frame Frm⁽¹⁾And a preamble frame Frm⁽²⁾Each including 5 alternative pitch periods. It should be noted that the leading frame set may include one leading frame adjacent to the current frame, or may include a plurality of leading frames located after the current frame, for example, the leading frame set includes 3, 4, or even more leading frames located after the current frame, and the leading frames in the leading frame set are consecutive in time sequence. The number of candidate pitch periods in the current frame and each preceding frame is not limited in this application.

The cost value W1 between each first pitch period of the audio signal to be tested in the current frame and the second pitch period in the history frame can be calculated according to the cost function, then the cost value W2 between each first pitch period of the audio signal to be tested in the current frame and the second pitch period in the previous frame is calculated according to the cost function, and the target cost value is obtained by adding W1 and W2 corresponding to each first pitch period, or by performing weighted summation on W1 and W2. Since there are a plurality of second pitch periods in the preceding frame corresponding to each pitch period in the current frame, there are a plurality of W2 for each first pitch period in the current frame, and there are a plurality of target cost values for each first pitch period in the current frame.

And 102, determining a target pitch period of the audio signal to be detected in the current frame from each first pitch period according to each target cost value.

In this embodiment, since the larger the cost value is, the larger the variation between the pitch periods in the adjacent time frames is, the first pitch period corresponding to the smaller target cost value may be selected as the target pitch period of the audio signal to be measured in the current frame. For example, the target cost values are arranged in the order from small to large, and the first pitch period corresponding to one or more target cost values arranged in front is taken as the target pitch period of the audio signal to be measured in the current frame.

In the pitch period determining method provided in the embodiment of the present application, when the audio signal to be detected is a voiced signal in the current frame, the target cost value of each first pitch period of the audio signal to be detected in the current frame is obtained according to the preset cost function, and the target pitch period of the audio signal to be detected in the current frame is determined from each first pitch period according to each target cost value, where the target cost value includes: a cost value between each first pitch period of the audio signal to be measured and each second pitch period in the associated frame, the associated frame comprising: the historical frame adjacent to the current frame and the leading frame set positioned behind the current frame can determine the target pitch period in the current frame by combining the pitch period change among the current frame, the historical frame and the leading frame, thereby effectively removing the mutated pitch period, achieving better smoothing effect and improving the accuracy of the pitch period.

Optionally, the step 101 "determining a target pitch period of the audio signal to be measured in the current frame from each first pitch period according to each target cost value" includes: and determining a first pitch period corresponding to the minimum cost value in the target cost values as a target pitch period.

In this embodiment, since the larger the variation of the pitch period in the adjacent time is, the larger the cost value is, the first pitch period corresponding to the minimum and large value in each target cost value is determined as the target pitch period, which can effectively filter the mutated pitch period, so that the pitch period variation between each time frame is smaller, and the accuracy and reliability of the pitch period are ensured.

Optionally, on the basis of the embodiment shown in fig. 1, if the audio signal to be measured is an unvoiced signal in the associated frame, the cost function is a function constructed according to an error value between the audio signal to be measured and the offset audio signal in the first pitch period, and the offset audio signal is a signal of the audio signal to be measured after being offset according to the first pitch period.

In this embodiment, the cost function is used to calculate a cost value between two adjacent time frames, and the associated frame may be a historical frame or a leading frame. When the history frame or the leading frame is an unvoiced signal, a function may be constructed according to an error value between the audio signal to be measured and the offset audio signal in the first pitch period. The error value between the audio signal to be detected and the offset audio signal in the first pitch period may be an error value obtained by a method such as a single normalized autocorrelation error or a normalized energy difference.

Optionally, if the audio signal to be measured is an unvoiced signal in the associated frame, the cost function is W (n, n ± 1) ═ α × E_n(k_n) Wherein n is the identifier of the current frame, n + -1 is the identifier of the associated frame, α is the smoothing coefficient, 0<α≤256，E_n(k_n) For the error function corresponding to the first pitch period, k_nIs the first pitch period.

In this embodiment, when the audio signal to be measured is an unvoiced sound signal in the associated frame, the formula W (n, n ± 1) ═ α × E may be used_n(k_n) And calculating the cost value, wherein n-1 is the identifier of the historical frame, and n +1 is the identifier of the leading frame. Taking FIG. 2 as an example, if the historical frame Frm^(-1)For unvoiced signals, the formula W (0, -1) ═ α × E is used₀(k₀) Calculating current frame Frm⁽⁰⁾And history frame Frm^(-1)A cost value therebetween; if the leading frame Frm⁽¹⁾For unvoiced signals, the formula W (0,1) ═ α × E is used₀(k₀) Calculating current frame Frm⁽⁰⁾And a preamble frame Frm⁽¹⁾The cost value in between.

Optionally, on the basis of the embodiment shown in fig. 1, if the audio signal to be measured is a voiced signal in the associated frame, the cost function is a function constructed according to the first pitch period, the second pitch period, and an error value between the audio signal to be measured and the offset signal in the first pitch period, and the offset audio signal is a signal obtained by offsetting the audio signal to be measured according to the first pitch period.

In this embodiment, when the history frame or the preceding frame is an unvoiced signal, a function may be constructed according to the first pitch period, a second pitch period, and an error value between the audio signal to be measured and the offset signal in the first pitch period, where the second pitch period is the second pitch period in the history frame or the second pitch period in the preceding frame.

Alternatively, on the basis of the embodiment shown in fig. 1, if the audio signal to be measured is a voiced signal in the associated frame, the cost function is W (n, n ± 1) ═ k_n-k_n±1|+α*E_n(k_n) Wherein n is the identifier of the current frame, n + -1 is the identifier of the associated frame, α is the smoothing coefficient, E_n(k_n) For the error function corresponding to the first pitch period, k_nIs the first pitch period, k_n±1Is the second pitch period.

In this embodiment, if the audio signal to be measured is a voiced signal in the associated frame, the formula W (n, n ± 1) ═ k is adopted_n-k_n±1|+α*E_n(k_n) Calculating the cost value, n-1 is the mark of the history frame, n +1 is the mark of the leading frame, k_n-1For the second pitch period, k, in the history frame_n+1Is the second pitch period in the leading frame. Taking FIG. 2 as an example, if the historical frame Frm^(-1)For voiced signals, the formula W (0, -1) ═ k is used₀-k_-1|+α*E₀(k₀) Calculating current frame Frm⁽⁰⁾And history frame Frm^(-1)A cost value therebetween; if the leading frame Frm: (¹⁾For voiced signals, the formula W (0,1) ═ k is used₀-k₁|+α*E₀(k₀) Calculating current frame Frm⁽⁰⁾And a preamble frame Frm⁽¹⁾The cost value in between.

Fig. 3 is a flowchart of a possible implementation method of step 101 in fig. 1, where on the basis of the embodiment shown in fig. 1, as shown in fig. 3, the step "obtaining a target cost value corresponding to each first pitch period of the audio signal to be detected in the current frame according to a preset cost function" includes:

step 201, according to the cost function, obtaining a first generation value between each first pitch period of the audio signal to be detected and a second pitch period in the historical frame, and obtaining a second generation value between each first pitch period of the audio signal to be detected and a second pitch period in the target leading frame; the target leading frame is the leading frame which is positioned at the last in time sequence in the leading frame set.

Wherein, the target leading frame is the leading frame which is positioned at the last in time sequence in the leading frame set, as shown in fig. 2, if the leading frame set includes Frm^(-1)The one leading frame, the target leading frame is Frm^(-1)If Frm is included in the preamble frame set⁽¹⁾And Frm⁽²⁾Then the target leading frame is Frm⁽²⁾Alternatively, Frm is included in the preamble frame set⁽¹⁾、Frm⁽²⁾、Frm⁽³⁾Then the target leading frame is Frm⁽³⁾And so on.

In the present embodiment, according to the cost function, a first cost value between each first pitch period of the audio signal to be tested and the second pitch period in the history frame, and a second cost value between each first pitch period of the audio signal to be tested and the second pitch period in the target leading frame can be calculated. As shown in FIG. 2, for the first pitch period B1, the current frame Frm⁽⁰⁾The first pitch period B1 in the historical frame Frm^(-1)Is marked as W (B1, a), the current frame Frm⁽⁰⁾With the first pitch period B1 in the target leading frame Frm⁽²⁾The second cost values between the second pitch periods D1 are respectively marked as W (B1, D1), W (B1, D2), W (B1, D3), W (B1, D4), W (B1, D5), wherein the paths corresponding to W (B1, D1) may include B1-C1-D1, B1-C2-D1, B1-C3-D1 … …, and so on, and are not described herein again.

Optionally, if the leading frame set includes a first leading frame and a second leading frame, and the second leading frame is a target leading frame; as shown in fig. 4, one possible implementation method of step 201 may include:

step 301, according to the cost function, obtaining a third generation value between each first pitch period of the audio signal to be detected and each second pitch period in the first leading frame, and obtaining a fourth generation value between each second pitch period of the audio signal to be detected in the first leading frame and each second pitch period of the audio signal to be detected in the second leading frame.

Taking fig. 2 as an example, the first preamble frame is Frm⁽¹⁾The second leading frame is Frm⁽²⁾Taking the first pitch period B1 as an example, the first pitch period B1 and the first leading frame Frm are obtained according to the cost function⁽¹⁾A third generation value W (B1, C1), W (B1, C2), W (B1, C3), W (B1, C4), W (B1, C5) between second pitch periods in (a); with a first leading frame Frm⁽¹⁾Taking the second pitch period C1 as an example, the first leading frame Frm is obtained⁽¹⁾Second pitch period C1 and the audio signal to be tested in the second leading frame Frm⁽²⁾W (C1, D1), W (C1, D2), W (C1, D3), W (C1, D4), W (C1, D5), and so on, among the second pitch periods, a plurality of third generation values and a plurality of fourth generation values are obtained.

When calculating the fourth cost value between each second pitch period of the audio signal to be measured in the first leading frame and each second pitch period of the audio signal to be measured in the second leading frame, the cost function W (n, n ± 1) ═ α × E may also be used_n(k_n) Or W (n, n ± 1) ═ k_n-k_n±1|+α*E_n(k_n). For example, if the second preamble frame is unvoiced, the formula W (1,2) ═ α × E is adopted₁(k₁) Calculating a fourth cost value, and if the second leading frame is voiced, adopting a formula W (1,2) ═ k₁-k₂|+α*E₁(k₁) A fourth cost value is calculated.

Optionally, the minimum cost value between each second pitch period of the audio signal to be detected in the first leading frame and the second pitch period of the audio signal to be detected in the second leading frame may also be obtained first, and the minimum cost value corresponding to each second pitch period of the audio signal to be detected in the first leading frame is taken as the fourth cost value; and then, the third generation value and the fourth cost value between each first pitch period of the audio signal to be detected and each second pitch period in the first leading frame are combined and added to obtain a second generation value. The method can reduce the calculation steps and improve the efficiency.

Taking fig. 2 as an example, 1) calculates the first preamble frame Frm⁽¹⁾All second pitch periods to the second leading frame Frm⁽²⁾Over 25-clock conditions, to finally arrive at the first leading frame Frm⁽¹⁾Of each second pitch period. 2) At calculation Frm⁽⁰⁾To Frm⁽¹⁾To Frm⁽²⁾The cost value of (c) is not needed to be calculated Frm⁽¹⁾To Frm⁽²⁾In other cases than the known minimum cost value, a recalculation Frm is made⁽⁰⁾To Frm⁽¹⁾To Frm⁽²⁾The minimum cost value of (c) and recorded. 3) Due to Frm^(-1)For the last smoothing of the found pitch period, it is only necessary to traverse Frm^(-1)Calculating 5 decimal values by the step 2) to obtain Frm⁽⁰⁾The minimum cost value of the five first pitch periods as the target cost value. 4) Frm will be mixed⁽⁰⁾Updating to historical frame, using the first pitch period corresponding to the target cost value as the second pitch period in the historical frame, and Frm⁽¹⁾Update to current frame Frm⁽²⁾As the first leading frame, smoothing processing of the next frame is started.

For example, in the first leading frame Frm⁽¹⁾Taking the second pitch period C1 as an example, the first leading frame Frm is obtained⁽¹⁾Second pitch period C1 and the audio signal to be tested in the second leading frame Frm⁽²⁾W (C1, D1), W (C1, D2), W (C1, D3), W (C1, D4), W (C1, D5) among the second pitch periods in (b), and if W (C1, D2) is the smallest, then W (C1, D2) is taken as the first leading frame Frm⁽¹⁾The fourth cost value of the second pitch period C1, and so on, the first leading frame Frm is calculated⁽¹⁾The fourth cost value of the other second pitch period C1.

And 302, obtaining a second generation value according to the third generation value and the fourth generation value.

In this embodiment, the plurality of third generation values and the plurality of fourth generation values are combined and added, or weighted values of the plurality of third generation values and the plurality of fourth generation values are combined and added, so as to obtain the second generation value.

According to the method, according to the cost function, a third generation value between each first pitch period of the audio signal to be detected and each second pitch period in the first leading frame is obtained, a fourth generation value between each second pitch period of the audio signal to be detected in the first leading frame and each second pitch period of the audio signal to be detected in the second leading frame is obtained, according to the third generation value and the fourth generation value, the target generation value of each first pitch period of the current frame is calculated by adopting at least the pitch periods of four adjacent time frames, the target generation value reflects pitch change between different time frames, and the smoothing process of the pitch periods is more reliable and accurate.

And 202, obtaining a target cost value according to the first cost value and the second cost value.

In this embodiment, the sum of the first generation value and the second generation value may be the target generation value, or different weights may be given to the first generation value and the second generation value, and then the first generation value and the second generation value are weighted and summed to obtain the target generation value. For example, as shown in fig. 2, the target cost values of the first pitch period B1 may be W (B1, a) + W (B1, D1), W (B1, a) + W (B1, D2), W (B1, a) + W (B1, D3), respectively.

Illustratively, in FIG. 2, a current frame Frm is shown⁽⁰⁾Of the first pitch period of the audio signal and in the historical frame Frm^(-1)Is marked as W (0, -1), and each current frame Frm is marked as a value of the first pitch between the second pitch periods⁽⁰⁾At the first pitch period and in the first preceding frame Frm⁽¹⁾Is marked as W (0,1), will be in the first leading frame Frm⁽¹⁾Of the audio signal to be tested in the second preamble frame Frm⁽²⁾Each second group in (1)A fourth cost value between the pitch periods is denoted as W (1,2), and the target cost value W is W (0, -1) + W (0,1) + W (1,2), and theoretically there may be 5 × 5 — 125 target cost values, and the first pitch period corresponding to the target cost value with the smallest value among the 125 target cost values may be used as the target pitch period.

According to the pitch period determining method provided by the embodiment, according to the cost function, a first generation value between each first pitch period of the audio signal to be detected and a second pitch period in the historical frame is obtained, a second generation value between each first pitch period of the audio signal to be detected and a second pitch period in the target leading frame is obtained, a target generation value is obtained according to the first generation value and the second generation value, and the target generation value is determined according to the cost values between the pitch periods in at least three time frames, so that the target cost value of each pitch period in the current frame has a more accurate reference meaning, the prominent pitch period can be filtered out, and the accuracy of the pitch period is improved.

It should be understood that although the various steps in the flow charts of fig. 1-4 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 1-4 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternating with other steps or at least some of the sub-steps or stages of other steps.

Fig. 5 is a device for determining a pitch period according to an embodiment, as shown in fig. 5, the device includes: an acquisition module 11 and a determination module 12.

An obtaining module 11, configured to, when an audio signal to be detected is a voiced signal in a current frame, obtain a target cost value of each first pitch period of the audio signal to be detected in the current frame according to a preset cost function; wherein the target cost value comprises: a cost value between each first pitch period of the audio signal to be measured and each second pitch period in an associated frame, the associated frame including: a history frame adjacent to the current frame and a leading frame set positioned after the current frame;

and a determining module 12, configured to determine, according to each target cost value, a target pitch period of the audio signal to be detected in the current frame from each first pitch period.

In one embodiment, as shown in fig. 6, the obtaining module 11 includes a first obtaining submodule 111 and a second obtaining submodule 112; the first obtaining submodule 111 is configured to obtain, according to the cost function, a first generation value between each first pitch period of the audio signal to be detected and a second pitch period in the historical frame, and obtain a second generation value between each first pitch period of the audio signal to be detected and a second pitch period in the target leading frame; the target leading frame is the leading frame which is positioned at the last in time sequence in the leading frame set; the second obtaining sub-module 112 is configured to obtain the target cost value according to the first cost value and the second cost value.

In one embodiment, the leading frame set comprises a first leading frame and a second leading frame, and the second leading frame is the target leading frame; the first obtaining sub-module 111 obtains a second cost value between each first pitch period of the audio signal to be detected and a second pitch period in the target leading frame, including: the first obtaining sub-module 111 obtains, according to the cost function, a third cost value between each first pitch period of the audio signal to be detected and each second pitch period in the first leading frame, and obtains a fourth cost value between each second pitch period of the audio signal to be detected in the first leading frame and each second pitch period of the audio signal to be detected in the second leading frame; and obtaining the second generation value according to the third generation value and the fourth generation value.

In one embodiment, the determining module 12 is specifically configured to determine that the first pitch period corresponding to the minimum cost value in each of the target cost values is the target pitch period.

For the specific definition of the pitch period determination device, reference may be made to the above definition of the pitch period determination method, and details are not described here. The modules in the above-described pitch period determining means may be implemented in whole or in part by software, hardware, and combinations thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 7. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is for storing cost value data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of pitch period determination.

Those skilled in the art will appreciate that the architecture shown in fig. 7 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having a computer program stored therein, the processor implementing the following steps when executing the computer program:

In one embodiment, the processor, when executing the computer program, further performs the steps of:

according to the cost function, acquiring a first generation value between each first pitch period of the audio signal to be detected and a second pitch period in the historical frame, and acquiring a second generation value between each first pitch period of the audio signal to be detected and a second pitch period in a target leading frame; the target leading frame is the leading frame which is positioned at the last in time sequence in the leading frame set; and obtaining the target cost value according to the first cost value and the second cost value.

according to the cost function, acquiring a third cost value between each first pitch period of the audio signal to be detected and each second pitch period in the first leading frame, and acquiring a fourth cost value between each second pitch period of the audio signal to be detected in the first leading frame and each second pitch period of the audio signal to be detected in the second leading frame; and obtaining the second generation value according to the third generation value and the fourth generation value.

In one embodiment, the processor, when executing the computer program, further performs the steps of: if the audio signal to be detected is an unvoiced signal in the associated frame, the cost function is a function constructed according to an error value between the audio signal to be detected and a shifted audio signal in the first pitch period, and the shifted audio signal is a signal of the audio signal to be detected after being shifted according to the first pitch period.

In one embodiment, the processor, when executing the computer program, further implements the method of:

the cost function is W (n, n +/-1) ═ alpha E_n(k_n)，Wherein n is the identifier of the current frame, n + -1 is the identifier of the associated frame, α is the smoothing coefficient, E_n(k_n) For an error function corresponding to said first pitch period, k_nIs the first pitch period.

if the audio signal to be detected is a voiced signal in the associated frame, the cost function is a function constructed according to the first pitch period, the second pitch period, and an error value between the audio signal to be detected and an offset signal in the first pitch period, and the offset audio signal is a signal after the audio signal to be detected is offset according to the first pitch period.

the cost function is W (n, n + -1) ═ k_n-k_n±1|+α*E_n(k_n) Wherein n is the identifier of the current frame, n + -1 is the identifier of the associated frame, α is the smoothing coefficient, E_n(k_n) For an error function corresponding to said first pitch period, k_nFor said first pitch period, k_n±1Is the second pitch period.

In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of:

In one embodiment, the computer program when executed by the processor further performs the steps of:

if the audio signal to be detected is an unvoiced signal in the associated frame, the cost function is a function constructed according to an error value between the audio signal to be detected and a shifted audio signal in the first pitch period, and the shifted audio signal is a signal of the audio signal to be detected after being shifted according to the first pitch period.

In one embodiment, the computer program when executed by the processor further implements the method of:

the cost function is W (n, n +/-1) ═ alpha E_n(k_n) Wherein n is the identifier of the current frame, n + -1 is the identifier of the associated frame, α is the smoothing coefficient, E_n(k_n) For an error function corresponding to said first pitch period, k_nIs the first pitch period.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method for pitch period determination, the method comprising:

when the audio signal to be detected is a voiced sound signal in the current frame, acquiring a target cost value of each first pitch period of the audio signal to be detected in the current frame according to a preset cost function; wherein the target cost value comprises: a cost value between each first pitch period of the audio signal to be measured and each second pitch period in an associated frame, the associated frame comprising: a history frame adjacent to the current frame and a leading frame set positioned after the current frame; if the audio signal to be detected is an unvoiced signal in the associated frame, the cost function is a function constructed according to an error value between the audio signal to be detected and a shifted audio signal in the first pitch period, and the shifted audio signal is a signal of the audio signal to be detected after being shifted according to the first pitch period;

2. The method according to claim 1, wherein the obtaining a target cost value corresponding to each first pitch period of the audio signal to be tested in the current frame according to a preset cost function includes:

3. The method of claim 2, wherein the set of leading frames comprises a first leading frame and a second leading frame, and the second leading frame is the target leading frame; the obtaining a second cost value between each first pitch period of the audio signal to be detected and a second pitch period in the target leading frame includes:

4. The method according to claim 1, wherein the cost function is W (n, n ± 1) ═ α × E_n(k_n) Wherein n is the identifier of the current frame, n + -1 is the identifier of the associated frame, α is the smoothing coefficient, E_n(k_n) For an error function corresponding to said first pitch period, k_nIs the first pitch period.

5. The method according to any of claims 1-3, wherein if the audio signal under test is a voiced signal in the associated frame, the cost function is a function constructed according to the first pitch period, the second pitch period, and an error value between the audio signal under test and an offset signal in the first pitch period, and the offset audio signal is a signal after the audio signal under test is offset according to the first pitch period.

6. The method of claim 5, wherein the cost function is W (n, n ± 1) ═ k_n-k_n±1|+α*E_n(k_n) Wherein n is the identifier of the current frame, n + -1 is the identifier of the associated frame, α is the smoothing coefficient, E_n(k_n) For an error function corresponding to said first pitch period, k_nFor said first pitch period, k_n±1Is the second pitch period.

7. A method according to any one of claims 1-3, wherein said determining a target pitch period of the audio signal under test in the current frame from each first pitch period according to each target cost value comprises:

8. An apparatus for pitch period determination, the apparatus comprising:

the acquisition module is used for acquiring the target cost value of each first pitch period of the audio signal to be detected in the current frame according to a preset cost function when the audio signal to be detected is a voiced signal in the current frame; wherein the target cost value comprises: a cost value between each first pitch period of the audio signal to be measured and each second pitch period in an associated frame, the associated frame including: a history frame adjacent to the current frame and a leading frame set positioned after the current frame; if the audio signal to be detected is an unvoiced signal in the associated frame, the cost function is a function constructed according to an error value between the audio signal to be detected and a shifted audio signal in the first pitch period, and the shifted audio signal is a signal of the audio signal to be detected after being shifted according to the first pitch period;

9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.