Disclosure of Invention
In order to solve the above technical problems, embodiments of the present application provide a corpus extraction method and apparatus, so as to achieve the purpose of improving the corpus extraction accuracy, and the technical scheme is as follows:
a corpus extraction method comprises the following steps:
dividing the audio signal into a plurality of voice frames according to the set frame duration;
respectively carrying out activity detection on each voice frame according to a set activity detection mode, and determining the attribute of each voice frame, wherein the attribute is effective or silent;
arranging the voice frames marked with the attributes according to a time sequence to obtain a voice frame list;
using a rectangular window with a set size to slide in the voice frame array by a set step length, and detecting the initial position and the end position of a speaking speech segment in the voice frame array;
and extracting a speaking speech segment from the audio signal according to the initial position and the end position, and taking the speaking speech segment as a corpus.
Preferably, the detecting the initial position and the end position of the speech segment in the speech frame sequence by using the rectangular window with the set size to slide in the speech frame sequence with the set step length includes:
sliding in the voice frame array by using a rectangular window with a set size according to a set step length, calculating the ratio of the number of voice frames with an effective attribute in the rectangular window to the size of the rectangular window as a first ratio, and calculating the ratio of the number of voice frames with a mute attribute in the rectangular window to the size of the rectangular window as a second ratio;
and determining the initial position of the speech section in the speech frame array by comparing the first proportion with a first proportion threshold value, and determining the end position of the speech section in the speech frame array by comparing the second proportion with a second proportion threshold value.
Preferably, the method further comprises:
optimizing the set frame time length, the set activity detection mode and the set size of the rectangular window to obtain the optimized set frame time length, the optimized set activity detection model and the optimized set size of the rectangular window;
and replacing the set frame time length with the optimized set frame time length, replacing the set activity detection mode with the optimized activity detection mode, and replacing the set size of the rectangular window with the set size of the optimized rectangular window.
Preferably, the optimizing the set frame duration, the set activity detection mode, and the set size of the rectangular window to obtain an optimized set frame duration, an optimized set activity detection model, and an optimized set size of the rectangular window includes:
setting different values for the set frame duration, the set activity detection mode and the set size of the rectangular window respectively, and randomly combining the set frame duration, the set activity detection mode and the set size of the rectangular window with different values to obtain different combination sets;
extracting speaking speech segments from the audio test signals according to the combination sets respectively, and taking the speaking speech segments as segmentation linguistic data;
calculating the average amplitude of each audio test signal and the frame average amplitude of each segmented corpus;
calculating the average mute duration of each audio test signal in different combination sets, and dividing the sum of the average mute durations of each audio test signal in different combination sets by the number of the combination sets respectively to obtain a result as the total average mute duration;
respectively comparing the difference value between the average mute time length of each audio test signal under different combination sets and the total average mute time length of the audio test signal, selecting the difference values arranged at the first L according to the sequence of the difference values from small to large, and taking the combination set corresponding to the difference values arranged at the first L as a combination set to be processed, wherein L is an integer greater than 1;
according to the sequence of the occurrence frequencies from top to bottom, counting the first k to-be-processed combination sets with the occurrence frequencies arranged in the to-be-processed combination sets of the plurality of audio test signals, and taking the first k to-be-processed combination sets as candidate combination sets, wherein k is an integer greater than 1;
and respectively calculating the variances of the corpora extracted according to the candidate combination sets, taking the set frame time length in the candidate combination set corresponding to the corpora with the minimum variance as the optimized set frame time length, taking the set activity detection mode in the candidate combination set corresponding to the corpora with the minimum variance as the optimized set activity detection mode, and taking the set size of the rectangular window in the candidate combination set corresponding to the corpora with the minimum variance as the set size of the optimized rectangular window.
Preferably, the method further comprises:
respectively calculating the corpus average amplitude and the frame average amplitude of each corpus;
respectively taking the frames with the frame average amplitude values larger than the corpus average amplitude values in each corpus as effective frames, and calculating the proportion of the effective frames;
and taking the corpus of which the ratio of the effective frame is greater than a set effective frame ratio threshold value and the average corpus amplitude is greater than a set corpus average amplitude threshold value as an effective corpus.
Preferably, the calculating the average amplitude of each audio test signal and the frame average amplitude of each segmented corpus includes:
respectively calculating the amplitude sum of sampling points in each audio test signal, and dividing the amplitude sum of the sampling points by the number of the sampling points to obtain a result as the average amplitude of the audio test signal;
and respectively calculating the sum of the amplitudes of the sampling points in each corpus, dividing the sum of the amplitudes of the sampling points in the corpus by the number of the sampling points in the corpus, and taking the obtained result as the frame average amplitude of the corpus.
Preferably, the calculating the variance of the corpus extracted according to each candidate combination set includes:
using relational expressions
Respectively calculating the variance of the corpora extracted according to each candidate combination set;
wherein, variance of the corpus is represented by variance, xi represents average amplitude of the ith frame in the corpus, avg represents average amplitude of the audio signal, t represents duration of the corpus, and n represents number of voice frames included in the corpus.
A corpus extraction device, comprising:
the dividing module is used for dividing the audio signal into a plurality of voice frames according to the set frame duration;
the activity detection module is used for respectively carrying out activity detection on each voice frame according to a set activity detection mode and determining the attribute of each voice frame, wherein the attribute is effective or silent;
the arrangement module is used for arranging the voice frames marked with the attributes according to a time sequence to obtain a voice frame array;
an endpoint detection module, configured to utilize a rectangular window with a set size to slide in the speech frame array by a set step length, and detect an initial position and an end position of a speech segment in the speech frame array;
and the extraction module is used for extracting a speaking language segment from the audio signal according to the initial position and the end position, and taking the speaking language segment as a language material.
Preferably, the endpoint detection module includes:
a first calculating module, configured to utilize a rectangular window with a set size, slide in the speech frame array in a set step length, calculate a ratio of the number of speech frames with an effective attribute in the rectangular window to the size of the rectangular window, as a first ratio, and calculate a ratio of the number of speech frames with a mute attribute in the rectangular window to the size of the rectangular window, as a second ratio;
and the comparison module is used for determining the initial position of the speech section in the speech frame array by comparing the first proportion with a first proportion threshold value, and determining the end position of the speech section in the speech frame array by comparing the second proportion with a second proportion threshold value.
Preferably, the apparatus further comprises:
the optimization module is used for optimizing the set frame time length, the set activity detection mode and the set size of the rectangular window to obtain the optimized set frame time length, the optimized set activity detection model and the optimized set size of the rectangular window;
and the replacing module is used for replacing the set frame time length with the optimized set frame time length, replacing the set activity detection mode with the optimized activity detection mode and replacing the set size of the rectangular window with the set size of the optimized rectangular window.
Preferably, the optimization module is specifically configured to:
respectively setting different values for the set frame duration, the set activity detection mode and the set size of the rectangular window, and randomly combining the set frame duration, the set activity detection mode and the set size of the rectangular window with different values to obtain different combination sets;
extracting speaking speech segments from the audio test signals according to the combination sets respectively, and taking the speaking speech segments as segmentation linguistic data;
calculating the average amplitude of each audio test signal and the frame average amplitude of each segmented corpus;
calculating the average mute duration of each audio test signal in different combination sets, and dividing the sum of the average mute durations of each audio test signal in different combination sets by the number of the combination sets respectively to obtain a result as the total average mute duration;
respectively comparing the difference value between the average mute time length of each audio test signal under different combination sets and the total average mute time length of the audio test signal, selecting the difference values arranged at the first L according to the sequence of the difference values from small to large, and taking the combination set corresponding to the difference values arranged at the first L as a combination set to be processed, wherein L is an integer greater than 1;
according to the sequence of the occurrence frequencies from top to bottom, counting the first k to-be-processed combination sets with the occurrence frequencies arranged in the to-be-processed combination sets of the plurality of audio test signals, and taking the first k to-be-processed combination sets as candidate combination sets, wherein k is an integer greater than 1;
and respectively calculating the variances of the corpora extracted according to each candidate combination set, taking the set frame time length in the candidate combination set corresponding to the corpus with the minimum variance as the optimized set frame time length, taking the set activity detection mode in the candidate combination set corresponding to the corpus with the minimum variance as the optimized set activity detection mode, and taking the set size of the rectangular window in the candidate combination set corresponding to the corpus with the minimum variance as the set size of the optimized rectangular window.
Preferably, the apparatus further comprises:
the second calculation module is used for calculating the corpus average amplitude and the frame average amplitude of each corpus respectively;
the first determining module is used for respectively taking the frames with the frame average amplitude values larger than the corpus average amplitude values in each corpus as effective frames and calculating the proportion of the effective frames;
and the second determining module is used for taking the corpus of which the ratio of the effective frame is greater than a set effective frame ratio threshold and the average corpus amplitude is greater than a set corpus average amplitude threshold as an effective corpus.
Preferably, the optimization module is specifically configured to:
respectively calculating the amplitude sum of sampling points in each audio test signal, and dividing the amplitude sum of the sampling points by the number of the sampling points to obtain a result as the average amplitude of the audio test signal;
and respectively calculating the sum of the amplitudes of the sampling points in each corpus, and dividing the sum of the amplitudes of the sampling points in the corpus by the number of the sampling points in the corpus to obtain a result as the frame average amplitude of the corpus.
Preferably, the optimization module is specifically configured to:
using relational expressions
Respectively calculating the variance of the corpora extracted according to each candidate combination set;
wherein, variance of the corpus is represented by variance, xi represents average amplitude of the ith frame in the corpus, avg represents average amplitude of the audio signal, t represents duration of the corpus, and n represents number of voice frames included in the corpus.
Compared with the prior art, the beneficial effect of this application is:
in the application, after dividing an audio signal into a plurality of speech frames, performing activity detection on each speech frame, and based on the result of the activity detection, performing end point detection (i.e. detection of the starting position and the ending position of a speech segment) by using a rectangular window to set a step length to slide in a speech frame column formed by each speech frame.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The embodiment of the application discloses a corpus extraction method, which comprises the following steps: dividing the audio signal into a plurality of voice frames according to the set frame duration; respectively carrying out activity detection on each voice frame according to a set activity detection mode, and determining the attribute of each voice frame, wherein the attribute is effective or silent; arranging the voice frames marked with the attributes according to the time sequence to obtain a voice frame list; using a rectangular window with a set size to slide in the voice frame array by a set step length, and detecting the initial position and the end position of a speaking speech segment in the voice frame array; and extracting a speaking speech segment from the audio signal according to the initial position and the end position, and taking the speaking speech segment as a corpus. According to the method and the device, the corpus extraction accuracy can be improved.
Next, a corpus extraction method disclosed in an embodiment of the present application is introduced, and as shown in fig. 1, a flowchart of an embodiment 1 of the corpus extraction method provided in the present application may include the following steps:
step S11, dividing the audio signal into a plurality of speech frames according to the set frame duration.
Sound is a time-varying signal, and sound of 10ms to 30ms is considered to be stable and constant in a short time in acoustics, and therefore, it is preferable that the set frame duration be set to any one value of 10ms to 30 ms.
According to the set frame duration, dividing the audio signal into a plurality of speech frames, which can be understood as: and dividing the audio signal into a plurality of voice frames according to the set frame duration and the time sequence.
Step S12, performing activity detection on each voice frame according to a set activity detection mode, and determining an attribute of each voice frame, where the attribute is valid or silent.
The set activity detection mode may include, but is not limited to: detection mode of VAD algorithm. The method specifically comprises the following steps: a normal mode, a low bit rate mode, an aggressive mode or a very aggressive mode. In which, since each activity detection mode has different sensitivity to the detection of the voice frame, the detection result in each activity detection mode may not be consistent.
And step S13, arranging the voice frames marked with the attributes according to the time sequence to obtain a voice frame list.
Step S14, using a rectangular window with a set size to slide in the speech frame sequence with a set step length, and detecting the initial position and the end position of the speech segment in the speech frame sequence.
The method and the device utilize a rectangular window with a set size to slide in the voice frame array by a set step length to detect the initial position and the end position of the speech segment in the voice frame array, can ensure that the voice frames in the voice frame array are orderly and continuously detected, and improve the accuracy of detecting the initial position and the end position of the speech segment in the voice frame array.
Step S15, extracting utterance segments from the audio signal according to the start position and the end position, and using the utterance segments as corpora.
During the process from the start of the rectangular window to the end of the sliding, the start and end positions of one or more utterance sections can be determined. When the initial positions and the end positions of the speech segments are determined, the speech segments are extracted from the audio signal according to the initial positions and the corresponding end positions.
In the application, after dividing an audio signal into a plurality of speech frames, performing activity detection on each speech frame, and based on the result of the activity detection, performing end point detection (i.e. detection of the starting position and the ending position of a speech segment) by using a rectangular window to set a step length to slide in a speech frame column formed by each speech frame.
As another alternative embodiment of the present application, referring to fig. 2, a schematic flow diagram of an embodiment 2 of a corpus extraction method provided in the present application is provided, where this embodiment mainly relates to a refinement scheme of the corpus extraction method described in the above embodiment 1, as shown in fig. 2, the method may include, but is not limited to, the following steps:
step S21, dividing the audio signal into a plurality of speech frames according to the set frame duration.
Step S22, performing activity detection on each voice frame according to a set activity detection mode, and determining an attribute of each voice frame, where the attribute is valid or silent.
And step S23, arranging the voice frames marked with the attributes according to the time sequence to obtain a voice frame list.
Step S24, sliding in the speech frame sequence by a set step length by using a rectangular window with a set size, and calculating a ratio of the number of speech frames with an attribute of being valid in the rectangular window to the size of the rectangular window as a first ratio, and calculating a ratio of the number of speech frames with an attribute of being silent in the rectangular window to the size of the rectangular window as a second ratio.
In this embodiment, a rectangular window with a set size is used to slide in the voice frame sequence with a set step length until the last frame of the voice frame sequence is reached. And calculating the ratio of the number of the voice frames with the attribute of being silent in the rectangular window to the size of the rectangular window as a first ratio and calculating the ratio of the number of the voice frames with the attribute of being silent in the rectangular window to the size of the rectangular window as a second ratio every time the rectangular window slides once.
Step S25, determining the initial position of the speech segment in the speech frame sequence by comparing the first ratio with the first ratio threshold, and determining the ending position of the speech segment in the speech frame sequence by comparing the second ratio with the second ratio threshold.
By comparing the first ratio with a first ratio threshold, determining the initial position of the speech segment in the speech frame sequence, which can be understood as: and comparing the first proportion with a first proportion threshold, and if the first proportion is not less than the first proportion threshold, determining a first frame in the rectangular window as the initial position of the speech segment in the speech frame column. For example, if the first scale threshold is 0.9 and the size of the rectangular window is 10, as shown in fig. 3, 0 in the speech frame column represents a mute frame and 1 represents a valid frame, and when the rectangular window slides to "1011111111", the first scale is 0.9, which is equal to the first scale threshold, then the first frame in the rectangular window can be used as the starting position of the speech segment in the speech frame column.
By comparing the second ratio with the second ratio threshold, the end position of the speech segment in the speech frame column is determined, which can be understood as: and comparing the second proportion with a second proportion threshold, and if the second proportion is not less than the second proportion threshold, determining that the last frame in the rectangular window is the end position of the speech section in the speech frame column. For example, if the second ratio threshold is 0.9 and the size of the rectangular window is 10, as shown in fig. 4, when the rectangular window slides to "1000000000", the second ratio is 0.9 and is equal to the second ratio threshold, the last frame in the rectangular window can be the end position of the speech segment in the speech frame column.
When the rectangular window starts to slide to the sliding end, each time at least one starting position and one ending position are determined, the speech frame between the first starting position and the ending position can be used as a speech segment. As shown in fig. 5, the utterance section is "1011111111.. 10111111111000000000".
Steps S24-S25 are a specific implementation of step S14 in example 1.
Step S26, extracting a speaking segment from the audio signal according to the start position and the end position, and using the speaking segment as a corpus.
As another alternative embodiment of the present application, referring to fig. 6, a schematic flow diagram of an embodiment 3 of a corpus extraction method provided in the present application is provided, where this embodiment is mainly an extension of the corpus extraction method described in the above embodiment 1, as shown in fig. 6, the method may include, but is not limited to, the following steps:
step S31, dividing the audio signal into a plurality of speech frames according to the set frame duration.
Step S32, performing activity detection on each voice frame according to a set activity detection mode, and determining an attribute of each voice frame, where the attribute is valid or silent.
And step S33, arranging the voice frames marked with the attributes according to the time sequence to obtain a voice frame list.
Step S34, using a rectangular window with a set size to slide in the speech frame sequence with a set step length, and detecting the initial position and the end position of the speech segment in the speech frame sequence.
Step S35, extracting utterance segments from the audio signal according to the start position and the end position, and using the utterance segments as corpora.
The detailed procedures of steps S31-S35 can be referred to the related descriptions of steps S11-S15 in embodiment 1, and are not described herein again.
And step S36, optimizing the set frame time length, the set activity detection mode and the set size of the rectangular window to obtain the optimized set frame time length, the optimized set activity detection model and the optimized set size of the rectangular window.
And step S37, replacing the set frame time length with the optimized set frame time length, replacing the set activity detection mode with the optimized activity detection mode, and replacing the set size of the rectangular window with the set size of the optimized rectangular window.
And replacing the set frame time length with the optimized set frame time length, replacing the set activity detection mode with the optimized activity detection mode, replacing the set size of the rectangular window with the optimized set size of the rectangular window, and replacing the set activity detection mode with the optimized activity detection mode and replacing the set size of the rectangular window with the optimized set size of the rectangular window by using the optimized set frame time length, so that the corpus extraction can be more accurate.
In another embodiment of the present application, the process of optimizing the set frame duration, the set activity detection mode, and the set size of the rectangular window to obtain the optimized set frame duration, the optimized set activity detection model, and the optimized set size of the rectangular window in embodiment 3 is described in detail, and specifically may include:
a11, setting different values for the set frame duration, the set activity detection mode and the set size of the rectangular window, and randomly combining the set frame duration, the set activity detection mode and the set size of the rectangular window with different values to obtain different combination sets.
Preferably, the different values set for the set frame duration may be 10ms, 20ms or 30ms, the set activity detection mode may be set to a normal mode, a low bit rate mode, an aggressive mode or a very aggressive mode, and the different values set for the set size of the rectangular window may be 4 to 10 times of the set frame duration, for example, when the set frame duration is 10ms, the set size of the rectangular window is 4 times of the set frame duration, and the set size of the rectangular window may be 40ms, 80ms or 120 ms.
For example, different values are set for the set frame duration, the set activity detection mode, and the set size of the rectangular window, and the set frame duration, the set activity detection mode, and the set size of the rectangular window with different values are arbitrarily combined to obtain different combination sets, for example, if the set frame duration is 10ms, 20ms, or 30ms, the set activity detection mode is a normal mode, a low bit rate mode, an aggressive mode, or a very aggressive mode, and the set size of the rectangular window is 40ms, 80ms, or 120ms, the set frame duration, the set activity detection mode, and the set size of the rectangular window with different values are arbitrarily combined to obtain different combination sets, which can be: a set frame duration of 10ms, a set activity detection mode of the normal mode, and a set size of a rectangular window of 40 ms; or, a set frame duration of 10ms, a set activity detection mode of the low bit rate mode, and a set size of a rectangular window of 40 ms; or, a set frame duration of 20ms, a set activity detection mode of the normal mode, and a set size of a rectangular window of 40 ms; or a set frame duration of 10ms, a set activity detection mode of the normal mode, a set size of a rectangular window of 80ms, and the like.
And A12, extracting utterance segments from the audio test signals according to the combination sets respectively, and taking the utterance segments as segmentation corpora.
In this embodiment, 10000 audio test signals may be selected, but are not limited to.
The process of extracting utterance sections from the audio test signals according to the combination sets can be referred to the related descriptions of steps S11-S15 in embodiment 1, and will not be described herein again.
And A13, calculating the average amplitude of each audio test signal and the frame average amplitude of each segmentation corpus.
The process of calculating the average amplitude of each audio test signal and the frame average amplitude of each segmented corpus may include:
b11, respectively calculating the sum of the amplitudes of the sampling points in each audio test signal, and dividing the sum of the amplitudes of the sampling points by the number of the sampling points to obtain a result as the average amplitude of the audio test signal.
Respectively calculating the sum of the amplitudes of the sampling points in each audio test signal, dividing the sum of the amplitudes of the sampling points by the number of the sampling points, and taking the obtained result as the average amplitude of the audio test signal to be understood as follows:
and respectively calculating the amplitude sum of the sampling points in each audio test signal, dividing the amplitude sum of the sampling points by the number of the sampling points in the audio test signal, and taking the obtained result as the average amplitude of the audio test signal.
In this embodiment, the following relationship may be used to respectively calculate the sum of the amplitudes of the sampling points in each audio test signal, and the sum of the amplitudes of the sampling points is divided by the number of the sampling points in the audio test signal, and the obtained result is used as the average amplitude of the audio test signal:
raw _ avg is a certain audio measureAverage amplitude of test signal, AnIs the amplitude of the nth sample point, and n is the number of sample points in the audio test signal.
And B12, respectively calculating the sum of the amplitudes of the sampling points in each corpus, and dividing the sum of the amplitudes of the sampling points in the corpus by the number of the sampling points in the corpus to obtain a result which is used as the frame average amplitude of the corpus.
In this embodiment, the following relational expression may be used to respectively calculate the sum of the amplitudes of the sampling points in each corpus, and the sum of the amplitudes of the sampling points in the corpus is divided by the number of the sampling points in the corpus, so as to obtain a result, which is used as the frame average amplitude of the corpus:
wherein frame _ avg represents the frame average amplitude of a certain segmented corpus, AnAnd representing the amplitude of the nth sampling point, wherein n is the number of the sampling points in the segmentation corpus.
A14, calculating the average mute duration of each audio test signal in different combination sets, and dividing the sum of the average mute durations of each audio test signal in different combination sets by the number of the combination sets respectively to obtain the result as the total average mute duration.
In this embodiment, the following relationship may be used to calculate the average mute duration of each audio test signal in different combination sets:
the CorpusAverageSilence is the average mute time of a certain audio test signal in a certain combination set, the SilenceFrameNum is the number of mute frames in all corpora extracted by the audio test signal in the combination set, the FrameDuration represents the time of a speech frame, and the CorpusNum is the number of all corpora extracted by the audio test signal in the combination set.
In this embodiment, the following relation may be used to divide the sum of the average mute durations of the audio test signals in different combination sets by the number of the combination sets, and the obtained result is used as the overall average mute duration:
wherein Audio AverageSilence is the total average mute time of a certain audio test signal, CorpusAverageSilencenThe average mute duration of the audio test signal under the nth combination set is defined as n, where n is the number of combination sets.
A15, respectively comparing the difference between the average mute time length of each audio test signal in different combination sets and the total average mute time length thereof, selecting the difference arranged at the first L according to the sequence of the difference from small to large, and taking the combination set corresponding to the difference arranged at the first L as a combination set to be processed, wherein L is an integer greater than 1.
L can be flexibly set according to needs, and preferably, L can be set to 5.
A16, counting the first k to-be-processed combination sets with the occurrence frequencies arranged in the to-be-processed combination sets of the plurality of audio test signals according to the sequence of the occurrence frequencies from top to bottom, and taking the first k to-be-processed combination sets as candidate combination sets, wherein k is an integer larger than 1.
K can be flexibly set according to needs, and preferably, K can be set to be 20.
A17, calculating the variance of the corpora extracted according to each candidate combination set, taking the set frame time length in the candidate combination set corresponding to the corpora with the minimum variance as the optimized set frame time length, taking the set activity detection mode in the candidate combination set corresponding to the corpora with the minimum variance as the optimized set activity detection mode, and taking the set size of the rectangular window in the candidate combination set corresponding to the corpora with the minimum variance as the set size of the optimized rectangular window.
In this embodiment, the variance of the corpus extracted according to each candidate combination set may be calculated by using the following relational expression:
wherein, variance of the corpus is represented by variance, xi represents average amplitude of the ith frame in the corpus, avg represents average amplitude of the audio signal, t represents duration of the corpus, and n represents number of voice frames included in the corpus.
In the embodiment, the frame duration, the set activity detection mode and the set size of the rectangular window are optimized in real time by evaluating the mean square error of the segmented corpus, so that the probability of overlong sentences such as incomplete segmentation, white space and the like can be well controlled, and the quality of the extracted corpus is greatly improved.
As another alternative embodiment of the present application, referring to fig. 7, a schematic flow diagram of an embodiment 4 of a corpus extraction method provided in the present application is provided, where this embodiment is mainly an extension of the corpus extraction method described in the above embodiment 1, as shown in fig. 7, the method may include, but is not limited to, the following steps:
step S41, dividing the audio signal into a plurality of speech frames according to the set frame duration.
Step S42, performing activity detection on each voice frame according to a set activity detection mode, and determining an attribute of each voice frame, where the attribute is valid or silent.
And step S43, arranging the voice frames marked with the attributes according to the time sequence to obtain a voice frame list.
Step S44, using a rectangular window with a set size to slide in the speech frame sequence with a set step length, and detecting the initial position and the end position of the speech segment in the speech frame sequence.
Step S45, extracting utterance segments from the audio signal according to the start position and the end position, and using the utterance segments as corpora.
The detailed procedures of steps S41-S45 can be referred to the related descriptions of steps S11-S15 in embodiment 1, and are not described herein again.
And step S46, calculating corpus average amplitude and frame average amplitude of each corpus respectively.
The corpus mean amplitude may be understood as: the corpus contains the average of the amplitudes of the sample points.
Frame average amplitude, which can be understood as: the corpus contains the average of the amplitudes of the sample points contained in the frame.
Step S47, respectively taking the frame with the frame average amplitude greater than the corpus average amplitude in each corpus as an effective frame, and calculating the ratio of the effective frame.
The process of calculating the proportion of valid frames may include:
and dividing the number of the effective frames by the number of all frames in the corpus to obtain a result which is the proportion of the effective frames.
Step S48, using the corpus in which the ratio of the effective frame is greater than the set effective frame ratio threshold and the corpus average amplitude is greater than the set corpus average amplitude threshold as the effective corpus.
In this embodiment, after the corpora are extracted in step S45, there inevitably occurs invalid corpora such as an excessively long silence time, a low voice, or a pure background noise. The amplitude of the audio signal is relatively weak due to the long silence duration or the low-pitched corpus compared to the normal corpus. Thus, the average amplitude of the sound signal of the invalid corpus is usually lower, and the proportion of the frames with lower amplitude in the whole corpus is higher.
Based on the characteristics of the invalid corpora, the corpus is filtered based on the proportion of the valid frames of the corpus and the average amplitude value of the corpus.
The corpus extraction device provided in the present application is described next, and the corpus extraction device described below and the corpus extraction method described above may be referred to in correspondence.
Referring to fig. 8, the corpus extraction device includes: a partitioning module 11, an activity detection module 12, an arrangement module 13, an endpoint detection module 14, and an extraction module 15.
The dividing module 11 is configured to divide the audio signal into a plurality of voice frames according to a set frame duration;
an activity detection module 12, configured to perform activity detection on each voice frame according to a set activity detection mode, and determine an attribute of each voice frame, where the attribute is valid or silent;
the arrangement module 13 is configured to arrange the voice frames labeled with the attributes according to a time sequence to obtain a voice frame list;
an endpoint detection module 14, configured to utilize a rectangular window with a set size to slide in the speech frame array by a set step length, and detect a starting position and an ending position of a speech segment in the speech frame array;
and the extracting module 15 is configured to extract a speech segment from the audio signal according to the starting position and the ending position, and use the speech segment as a corpus.
In this embodiment, the endpoint detection module 14 may include:
a first calculating module, configured to utilize a rectangular window with a set size, slide in the speech frame array in a set step length, calculate a ratio of the number of speech frames with an effective attribute in the rectangular window to the size of the rectangular window, as a first ratio, and calculate a ratio of the number of speech frames with a mute attribute in the rectangular window to the size of the rectangular window, as a second ratio;
the comparison module is used for determining the initial position of the speech section in the speech frame array by comparing the first proportion with a first proportion threshold value, and determining the end position of the speech section in the speech frame array by comparing the second proportion with a second proportion threshold value;
in this embodiment, the corpus extraction device may further include:
the optimization module is used for optimizing the set frame time length, the set activity detection mode and the set size of the rectangular window to obtain the optimized set frame time length, the optimized set activity detection model and the optimized set size of the rectangular window;
and the replacing module is used for replacing the set frame time length with the optimized set frame time length, replacing the set activity detection mode with the optimized activity detection mode and replacing the set size of the rectangular window with the set size of the optimized rectangular window.
In this embodiment, the optimization module is specifically configured to:
setting different values for the set frame duration, the set activity detection mode and the set size of the rectangular window respectively, and randomly combining the set frame duration, the set activity detection mode and the set size of the rectangular window with different values to obtain different combination sets;
extracting speaking speech segments from the audio test signals according to the combination sets respectively, and taking the speaking speech segments as segmentation linguistic data;
calculating the average amplitude of each audio test signal and the frame average amplitude of each segmented corpus;
calculating the average mute duration of each audio test signal in different combination sets, and dividing the sum of the average mute durations of each audio test signal in different combination sets by the number of the combination sets respectively to obtain a result as the total average mute duration;
respectively comparing the difference value between the average mute time length of each audio test signal under different combination sets and the total average mute time length of the audio test signal, selecting the difference values arranged at the first L according to the sequence of the difference values from small to large, and taking the combination set corresponding to the difference values arranged at the first L as a combination set to be processed, wherein L is an integer greater than 1;
according to the sequence of the occurrence frequencies from top to bottom, counting the first k to-be-processed combination sets with the occurrence frequencies arranged in the to-be-processed combination sets of the plurality of audio test signals, and taking the first k to-be-processed combination sets as candidate combination sets, wherein k is an integer greater than 1;
and respectively calculating the variances of the corpora extracted according to the candidate combination sets, taking the set frame time length in the candidate combination set corresponding to the corpora with the minimum variance as the optimized set frame time length, taking the set activity detection mode in the candidate combination set corresponding to the corpora with the minimum variance as the optimized set activity detection mode, and taking the set size of the rectangular window in the candidate combination set corresponding to the corpora with the minimum variance as the set size of the optimized rectangular window.
In this embodiment, the corpus extraction device may further include:
the second calculation module is used for calculating the corpus average amplitude and the frame average amplitude of each corpus respectively;
the first determining module is used for respectively taking the frames with the frame average amplitude values larger than the corpus average amplitude values in each corpus as effective frames and calculating the proportion of the effective frames;
and the second determining module is used for taking the corpus of which the ratio of the effective frame is greater than a set effective frame ratio threshold and the average corpus amplitude is greater than a set corpus average amplitude threshold as an effective corpus.
In this embodiment, the optimization module may be specifically configured to:
respectively calculating the amplitude sum of sampling points in each audio test signal, and dividing the amplitude sum of the sampling points by the number of the sampling points to obtain a result as the average amplitude of the audio test signal;
and respectively calculating the sum of the amplitudes of the sampling points in each corpus, dividing the sum of the amplitudes of the sampling points in the corpus by the number of the sampling points in the corpus, and taking the obtained result as the frame average amplitude of the corpus.
In this embodiment, the optimization module may be specifically configured to:
using relational expressions
Respectively calculating the variance of the corpora extracted according to each candidate combination set;
wherein, variance of the corpus is represented by variance, xi represents average amplitude of the ith frame in the corpus, avg represents average amplitude of the audio signal, t represents duration of the corpus, and n represents number of voice frames included in the corpus.
It should be noted that each embodiment is mainly described as a difference from the other embodiments, and the same and similar parts between the embodiments may be referred to each other. For the device-like embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.
For convenience of description, the above devices are described as being divided into various units by function, and are described separately. Of course, the functionality of the units may be implemented in one or more software and/or hardware when implementing the present application.
From the above description of the embodiments, it is clear to those skilled in the art that the present application can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present application may be essentially or partially implemented in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments of the present application.
The corpus extraction method and device provided by the present application are introduced in detail, and a specific example is applied in the text to explain the principle and the implementation of the present application, and the description of the above embodiment is only used to help understand the method and the core idea of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.