CN110600010B - Corpus extraction method and apparatus - Google Patents

Corpus extraction method and apparatus Download PDF

Info

Publication number
CN110600010B
CN110600010B CN201910891615.0A CN201910891615A CN110600010B CN 110600010 B CN110600010 B CN 110600010B CN 201910891615 A CN201910891615 A CN 201910891615A CN 110600010 B CN110600010 B CN 110600010B
Authority
CN
China
Prior art keywords
frame
corpus
rectangular window
speech
activity detection
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910891615.0A
Other languages
Chinese (zh)
Other versions
CN110600010A (en
Inventor
李博
杨森
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Du Xiaoman Technology Beijing Co Ltd
Original Assignee
Du Xiaoman Technology Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Du Xiaoman Technology Beijing Co Ltd filed Critical Du Xiaoman Technology Beijing Co Ltd
Priority to CN201910891615.0A priority Critical patent/CN110600010B/en
Publication of CN110600010A publication Critical patent/CN110600010A/en
Application granted granted Critical
Publication of CN110600010B publication Critical patent/CN110600010B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/45Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Telephonic Communication Services (AREA)

Abstract

In the method, after an audio signal is divided into a plurality of voice frames, activity detection is carried out on each voice frame, and based on the result of the activity detection, a rectangular window is utilized to slide in a voice frame column formed by each voice frame by a set step length, and end point detection (namely detection of the initial position and the end position of a speaking speech section) is carried out.

Description

Corpus extraction method and apparatus
Technical Field
The present application relates to the field of speech recognition technologies, and in particular, to a corpus extraction method and apparatus.
Background
With the rapid development of artificial intelligence, the speech recognition technology has made a major breakthrough and is widely applied in the fields of e-commerce, finance, instant messaging and the like.
In a speech recognition system, speech recognition is generally performed using a speech recognition model. The speech recognition model requires a large amount of corpus (e.g., audio, labeled text, etc.) to be trained. The quality of the corpus directly affects the final recognition effect of the speech recognition model, so that the accurate extraction of the corpus from the audio data is very important for speech recognition. However, how to perform accurate corpus extraction becomes a problem.
Disclosure of Invention
In order to solve the above technical problems, embodiments of the present application provide a corpus extraction method and apparatus, so as to achieve the purpose of improving the corpus extraction accuracy, and the technical scheme is as follows:
a corpus extraction method comprises the following steps:
dividing the audio signal into a plurality of voice frames according to the set frame duration;
respectively carrying out activity detection on each voice frame according to a set activity detection mode, and determining the attribute of each voice frame, wherein the attribute is effective or silent;
arranging the voice frames marked with the attributes according to a time sequence to obtain a voice frame list;
using a rectangular window with a set size to slide in the voice frame array by a set step length, and detecting the initial position and the end position of a speaking speech segment in the voice frame array;
and extracting a speaking speech segment from the audio signal according to the initial position and the end position, and taking the speaking speech segment as a corpus.
Preferably, the detecting the initial position and the end position of the speech segment in the speech frame sequence by using the rectangular window with the set size to slide in the speech frame sequence with the set step length includes:
sliding in the voice frame array by using a rectangular window with a set size according to a set step length, calculating the ratio of the number of voice frames with an effective attribute in the rectangular window to the size of the rectangular window as a first ratio, and calculating the ratio of the number of voice frames with a mute attribute in the rectangular window to the size of the rectangular window as a second ratio;
and determining the initial position of the speech section in the speech frame array by comparing the first proportion with a first proportion threshold value, and determining the end position of the speech section in the speech frame array by comparing the second proportion with a second proportion threshold value.
Preferably, the method further comprises:
optimizing the set frame time length, the set activity detection mode and the set size of the rectangular window to obtain the optimized set frame time length, the optimized set activity detection model and the optimized set size of the rectangular window;
and replacing the set frame time length with the optimized set frame time length, replacing the set activity detection mode with the optimized activity detection mode, and replacing the set size of the rectangular window with the set size of the optimized rectangular window.
Preferably, the optimizing the set frame duration, the set activity detection mode, and the set size of the rectangular window to obtain an optimized set frame duration, an optimized set activity detection model, and an optimized set size of the rectangular window includes:
setting different values for the set frame duration, the set activity detection mode and the set size of the rectangular window respectively, and randomly combining the set frame duration, the set activity detection mode and the set size of the rectangular window with different values to obtain different combination sets;
extracting speaking speech segments from the audio test signals according to the combination sets respectively, and taking the speaking speech segments as segmentation linguistic data;
calculating the average amplitude of each audio test signal and the frame average amplitude of each segmented corpus;
calculating the average mute duration of each audio test signal in different combination sets, and dividing the sum of the average mute durations of each audio test signal in different combination sets by the number of the combination sets respectively to obtain a result as the total average mute duration;
respectively comparing the difference value between the average mute time length of each audio test signal under different combination sets and the total average mute time length of the audio test signal, selecting the difference values arranged at the first L according to the sequence of the difference values from small to large, and taking the combination set corresponding to the difference values arranged at the first L as a combination set to be processed, wherein L is an integer greater than 1;
according to the sequence of the occurrence frequencies from top to bottom, counting the first k to-be-processed combination sets with the occurrence frequencies arranged in the to-be-processed combination sets of the plurality of audio test signals, and taking the first k to-be-processed combination sets as candidate combination sets, wherein k is an integer greater than 1;
and respectively calculating the variances of the corpora extracted according to the candidate combination sets, taking the set frame time length in the candidate combination set corresponding to the corpora with the minimum variance as the optimized set frame time length, taking the set activity detection mode in the candidate combination set corresponding to the corpora with the minimum variance as the optimized set activity detection mode, and taking the set size of the rectangular window in the candidate combination set corresponding to the corpora with the minimum variance as the set size of the optimized rectangular window.
Preferably, the method further comprises:
respectively calculating the corpus average amplitude and the frame average amplitude of each corpus;
respectively taking the frames with the frame average amplitude values larger than the corpus average amplitude values in each corpus as effective frames, and calculating the proportion of the effective frames;
and taking the corpus of which the ratio of the effective frame is greater than a set effective frame ratio threshold value and the average corpus amplitude is greater than a set corpus average amplitude threshold value as an effective corpus.
Preferably, the calculating the average amplitude of each audio test signal and the frame average amplitude of each segmented corpus includes:
respectively calculating the amplitude sum of sampling points in each audio test signal, and dividing the amplitude sum of the sampling points by the number of the sampling points to obtain a result as the average amplitude of the audio test signal;
and respectively calculating the sum of the amplitudes of the sampling points in each corpus, dividing the sum of the amplitudes of the sampling points in the corpus by the number of the sampling points in the corpus, and taking the obtained result as the frame average amplitude of the corpus.
Preferably, the calculating the variance of the corpus extracted according to each candidate combination set includes:
using relational expressions
Figure BDA0002208914340000031
Respectively calculating the variance of the corpora extracted according to each candidate combination set;
wherein, variance of the corpus is represented by variance, xi represents average amplitude of the ith frame in the corpus, avg represents average amplitude of the audio signal, t represents duration of the corpus, and n represents number of voice frames included in the corpus.
A corpus extraction device, comprising:
the dividing module is used for dividing the audio signal into a plurality of voice frames according to the set frame duration;
the activity detection module is used for respectively carrying out activity detection on each voice frame according to a set activity detection mode and determining the attribute of each voice frame, wherein the attribute is effective or silent;
the arrangement module is used for arranging the voice frames marked with the attributes according to a time sequence to obtain a voice frame array;
an endpoint detection module, configured to utilize a rectangular window with a set size to slide in the speech frame array by a set step length, and detect an initial position and an end position of a speech segment in the speech frame array;
and the extraction module is used for extracting a speaking language segment from the audio signal according to the initial position and the end position, and taking the speaking language segment as a language material.
Preferably, the endpoint detection module includes:
a first calculating module, configured to utilize a rectangular window with a set size, slide in the speech frame array in a set step length, calculate a ratio of the number of speech frames with an effective attribute in the rectangular window to the size of the rectangular window, as a first ratio, and calculate a ratio of the number of speech frames with a mute attribute in the rectangular window to the size of the rectangular window, as a second ratio;
and the comparison module is used for determining the initial position of the speech section in the speech frame array by comparing the first proportion with a first proportion threshold value, and determining the end position of the speech section in the speech frame array by comparing the second proportion with a second proportion threshold value.
Preferably, the apparatus further comprises:
the optimization module is used for optimizing the set frame time length, the set activity detection mode and the set size of the rectangular window to obtain the optimized set frame time length, the optimized set activity detection model and the optimized set size of the rectangular window;
and the replacing module is used for replacing the set frame time length with the optimized set frame time length, replacing the set activity detection mode with the optimized activity detection mode and replacing the set size of the rectangular window with the set size of the optimized rectangular window.
Preferably, the optimization module is specifically configured to:
respectively setting different values for the set frame duration, the set activity detection mode and the set size of the rectangular window, and randomly combining the set frame duration, the set activity detection mode and the set size of the rectangular window with different values to obtain different combination sets;
extracting speaking speech segments from the audio test signals according to the combination sets respectively, and taking the speaking speech segments as segmentation linguistic data;
calculating the average amplitude of each audio test signal and the frame average amplitude of each segmented corpus;
calculating the average mute duration of each audio test signal in different combination sets, and dividing the sum of the average mute durations of each audio test signal in different combination sets by the number of the combination sets respectively to obtain a result as the total average mute duration;
respectively comparing the difference value between the average mute time length of each audio test signal under different combination sets and the total average mute time length of the audio test signal, selecting the difference values arranged at the first L according to the sequence of the difference values from small to large, and taking the combination set corresponding to the difference values arranged at the first L as a combination set to be processed, wherein L is an integer greater than 1;
according to the sequence of the occurrence frequencies from top to bottom, counting the first k to-be-processed combination sets with the occurrence frequencies arranged in the to-be-processed combination sets of the plurality of audio test signals, and taking the first k to-be-processed combination sets as candidate combination sets, wherein k is an integer greater than 1;
and respectively calculating the variances of the corpora extracted according to each candidate combination set, taking the set frame time length in the candidate combination set corresponding to the corpus with the minimum variance as the optimized set frame time length, taking the set activity detection mode in the candidate combination set corresponding to the corpus with the minimum variance as the optimized set activity detection mode, and taking the set size of the rectangular window in the candidate combination set corresponding to the corpus with the minimum variance as the set size of the optimized rectangular window.
Preferably, the apparatus further comprises:
the second calculation module is used for calculating the corpus average amplitude and the frame average amplitude of each corpus respectively;
the first determining module is used for respectively taking the frames with the frame average amplitude values larger than the corpus average amplitude values in each corpus as effective frames and calculating the proportion of the effective frames;
and the second determining module is used for taking the corpus of which the ratio of the effective frame is greater than a set effective frame ratio threshold and the average corpus amplitude is greater than a set corpus average amplitude threshold as an effective corpus.
Preferably, the optimization module is specifically configured to:
respectively calculating the amplitude sum of sampling points in each audio test signal, and dividing the amplitude sum of the sampling points by the number of the sampling points to obtain a result as the average amplitude of the audio test signal;
and respectively calculating the sum of the amplitudes of the sampling points in each corpus, and dividing the sum of the amplitudes of the sampling points in the corpus by the number of the sampling points in the corpus to obtain a result as the frame average amplitude of the corpus.
Preferably, the optimization module is specifically configured to:
using relational expressions
Figure BDA0002208914340000061
Respectively calculating the variance of the corpora extracted according to each candidate combination set;
wherein, variance of the corpus is represented by variance, xi represents average amplitude of the ith frame in the corpus, avg represents average amplitude of the audio signal, t represents duration of the corpus, and n represents number of voice frames included in the corpus.
Compared with the prior art, the beneficial effect of this application is:
in the application, after dividing an audio signal into a plurality of speech frames, performing activity detection on each speech frame, and based on the result of the activity detection, performing end point detection (i.e. detection of the starting position and the ending position of a speech segment) by using a rectangular window to set a step length to slide in a speech frame column formed by each speech frame.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive labor.
FIG. 1 is a flowchart of an embodiment 1 of a corpus extraction method provided in the present application;
FIG. 2 is a flowchart of an embodiment 2 of a corpus extraction method provided in the present application;
FIG. 3 is a schematic view of the sliding of a rectangular window provided herein;
FIG. 4 is another schematic illustration of the rectangular window sliding provided herein;
FIG. 5 is yet another illustration of the sliding of a rectangular window provided by the present application;
FIG. 6 is a flowchart of an embodiment 3 of a corpus extraction method provided in the present application;
FIG. 7 is a flowchart of an embodiment 4 of a corpus extraction method provided in the present application;
fig. 8 is a schematic diagram of a logical structure of a corpus extraction device according to the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The embodiment of the application discloses a corpus extraction method, which comprises the following steps: dividing the audio signal into a plurality of voice frames according to the set frame duration; respectively carrying out activity detection on each voice frame according to a set activity detection mode, and determining the attribute of each voice frame, wherein the attribute is effective or silent; arranging the voice frames marked with the attributes according to the time sequence to obtain a voice frame list; using a rectangular window with a set size to slide in the voice frame array by a set step length, and detecting the initial position and the end position of a speaking speech segment in the voice frame array; and extracting a speaking speech segment from the audio signal according to the initial position and the end position, and taking the speaking speech segment as a corpus. According to the method and the device, the corpus extraction accuracy can be improved.
Next, a corpus extraction method disclosed in an embodiment of the present application is introduced, and as shown in fig. 1, a flowchart of an embodiment 1 of the corpus extraction method provided in the present application may include the following steps:
step S11, dividing the audio signal into a plurality of speech frames according to the set frame duration.
Sound is a time-varying signal, and sound of 10ms to 30ms is considered to be stable and constant in a short time in acoustics, and therefore, it is preferable that the set frame duration be set to any one value of 10ms to 30 ms.
According to the set frame duration, dividing the audio signal into a plurality of speech frames, which can be understood as: and dividing the audio signal into a plurality of voice frames according to the set frame duration and the time sequence.
Step S12, performing activity detection on each voice frame according to a set activity detection mode, and determining an attribute of each voice frame, where the attribute is valid or silent.
The set activity detection mode may include, but is not limited to: detection mode of VAD algorithm. The method specifically comprises the following steps: a normal mode, a low bit rate mode, an aggressive mode or a very aggressive mode. In which, since each activity detection mode has different sensitivity to the detection of the voice frame, the detection result in each activity detection mode may not be consistent.
And step S13, arranging the voice frames marked with the attributes according to the time sequence to obtain a voice frame list.
Step S14, using a rectangular window with a set size to slide in the speech frame sequence with a set step length, and detecting the initial position and the end position of the speech segment in the speech frame sequence.
The method and the device utilize a rectangular window with a set size to slide in the voice frame array by a set step length to detect the initial position and the end position of the speech segment in the voice frame array, can ensure that the voice frames in the voice frame array are orderly and continuously detected, and improve the accuracy of detecting the initial position and the end position of the speech segment in the voice frame array.
Step S15, extracting utterance segments from the audio signal according to the start position and the end position, and using the utterance segments as corpora.
During the process from the start of the rectangular window to the end of the sliding, the start and end positions of one or more utterance sections can be determined. When the initial positions and the end positions of the speech segments are determined, the speech segments are extracted from the audio signal according to the initial positions and the corresponding end positions.
In the application, after dividing an audio signal into a plurality of speech frames, performing activity detection on each speech frame, and based on the result of the activity detection, performing end point detection (i.e. detection of the starting position and the ending position of a speech segment) by using a rectangular window to set a step length to slide in a speech frame column formed by each speech frame.
As another alternative embodiment of the present application, referring to fig. 2, a schematic flow diagram of an embodiment 2 of a corpus extraction method provided in the present application is provided, where this embodiment mainly relates to a refinement scheme of the corpus extraction method described in the above embodiment 1, as shown in fig. 2, the method may include, but is not limited to, the following steps:
step S21, dividing the audio signal into a plurality of speech frames according to the set frame duration.
Step S22, performing activity detection on each voice frame according to a set activity detection mode, and determining an attribute of each voice frame, where the attribute is valid or silent.
And step S23, arranging the voice frames marked with the attributes according to the time sequence to obtain a voice frame list.
Step S24, sliding in the speech frame sequence by a set step length by using a rectangular window with a set size, and calculating a ratio of the number of speech frames with an attribute of being valid in the rectangular window to the size of the rectangular window as a first ratio, and calculating a ratio of the number of speech frames with an attribute of being silent in the rectangular window to the size of the rectangular window as a second ratio.
In this embodiment, a rectangular window with a set size is used to slide in the voice frame sequence with a set step length until the last frame of the voice frame sequence is reached. And calculating the ratio of the number of the voice frames with the attribute of being silent in the rectangular window to the size of the rectangular window as a first ratio and calculating the ratio of the number of the voice frames with the attribute of being silent in the rectangular window to the size of the rectangular window as a second ratio every time the rectangular window slides once.
Step S25, determining the initial position of the speech segment in the speech frame sequence by comparing the first ratio with the first ratio threshold, and determining the ending position of the speech segment in the speech frame sequence by comparing the second ratio with the second ratio threshold.
By comparing the first ratio with a first ratio threshold, determining the initial position of the speech segment in the speech frame sequence, which can be understood as: and comparing the first proportion with a first proportion threshold, and if the first proportion is not less than the first proportion threshold, determining a first frame in the rectangular window as the initial position of the speech segment in the speech frame column. For example, if the first scale threshold is 0.9 and the size of the rectangular window is 10, as shown in fig. 3, 0 in the speech frame column represents a mute frame and 1 represents a valid frame, and when the rectangular window slides to "1011111111", the first scale is 0.9, which is equal to the first scale threshold, then the first frame in the rectangular window can be used as the starting position of the speech segment in the speech frame column.
By comparing the second ratio with the second ratio threshold, the end position of the speech segment in the speech frame column is determined, which can be understood as: and comparing the second proportion with a second proportion threshold, and if the second proportion is not less than the second proportion threshold, determining that the last frame in the rectangular window is the end position of the speech section in the speech frame column. For example, if the second ratio threshold is 0.9 and the size of the rectangular window is 10, as shown in fig. 4, when the rectangular window slides to "1000000000", the second ratio is 0.9 and is equal to the second ratio threshold, the last frame in the rectangular window can be the end position of the speech segment in the speech frame column.
When the rectangular window starts to slide to the sliding end, each time at least one starting position and one ending position are determined, the speech frame between the first starting position and the ending position can be used as a speech segment. As shown in fig. 5, the utterance section is "1011111111.. 10111111111000000000".
Steps S24-S25 are a specific implementation of step S14 in example 1.
Step S26, extracting a speaking segment from the audio signal according to the start position and the end position, and using the speaking segment as a corpus.
As another alternative embodiment of the present application, referring to fig. 6, a schematic flow diagram of an embodiment 3 of a corpus extraction method provided in the present application is provided, where this embodiment is mainly an extension of the corpus extraction method described in the above embodiment 1, as shown in fig. 6, the method may include, but is not limited to, the following steps:
step S31, dividing the audio signal into a plurality of speech frames according to the set frame duration.
Step S32, performing activity detection on each voice frame according to a set activity detection mode, and determining an attribute of each voice frame, where the attribute is valid or silent.
And step S33, arranging the voice frames marked with the attributes according to the time sequence to obtain a voice frame list.
Step S34, using a rectangular window with a set size to slide in the speech frame sequence with a set step length, and detecting the initial position and the end position of the speech segment in the speech frame sequence.
Step S35, extracting utterance segments from the audio signal according to the start position and the end position, and using the utterance segments as corpora.
The detailed procedures of steps S31-S35 can be referred to the related descriptions of steps S11-S15 in embodiment 1, and are not described herein again.
And step S36, optimizing the set frame time length, the set activity detection mode and the set size of the rectangular window to obtain the optimized set frame time length, the optimized set activity detection model and the optimized set size of the rectangular window.
And step S37, replacing the set frame time length with the optimized set frame time length, replacing the set activity detection mode with the optimized activity detection mode, and replacing the set size of the rectangular window with the set size of the optimized rectangular window.
And replacing the set frame time length with the optimized set frame time length, replacing the set activity detection mode with the optimized activity detection mode, replacing the set size of the rectangular window with the optimized set size of the rectangular window, and replacing the set activity detection mode with the optimized activity detection mode and replacing the set size of the rectangular window with the optimized set size of the rectangular window by using the optimized set frame time length, so that the corpus extraction can be more accurate.
In another embodiment of the present application, the process of optimizing the set frame duration, the set activity detection mode, and the set size of the rectangular window to obtain the optimized set frame duration, the optimized set activity detection model, and the optimized set size of the rectangular window in embodiment 3 is described in detail, and specifically may include:
a11, setting different values for the set frame duration, the set activity detection mode and the set size of the rectangular window, and randomly combining the set frame duration, the set activity detection mode and the set size of the rectangular window with different values to obtain different combination sets.
Preferably, the different values set for the set frame duration may be 10ms, 20ms or 30ms, the set activity detection mode may be set to a normal mode, a low bit rate mode, an aggressive mode or a very aggressive mode, and the different values set for the set size of the rectangular window may be 4 to 10 times of the set frame duration, for example, when the set frame duration is 10ms, the set size of the rectangular window is 4 times of the set frame duration, and the set size of the rectangular window may be 40ms, 80ms or 120 ms.
For example, different values are set for the set frame duration, the set activity detection mode, and the set size of the rectangular window, and the set frame duration, the set activity detection mode, and the set size of the rectangular window with different values are arbitrarily combined to obtain different combination sets, for example, if the set frame duration is 10ms, 20ms, or 30ms, the set activity detection mode is a normal mode, a low bit rate mode, an aggressive mode, or a very aggressive mode, and the set size of the rectangular window is 40ms, 80ms, or 120ms, the set frame duration, the set activity detection mode, and the set size of the rectangular window with different values are arbitrarily combined to obtain different combination sets, which can be: a set frame duration of 10ms, a set activity detection mode of the normal mode, and a set size of a rectangular window of 40 ms; or, a set frame duration of 10ms, a set activity detection mode of the low bit rate mode, and a set size of a rectangular window of 40 ms; or, a set frame duration of 20ms, a set activity detection mode of the normal mode, and a set size of a rectangular window of 40 ms; or a set frame duration of 10ms, a set activity detection mode of the normal mode, a set size of a rectangular window of 80ms, and the like.
And A12, extracting utterance segments from the audio test signals according to the combination sets respectively, and taking the utterance segments as segmentation corpora.
In this embodiment, 10000 audio test signals may be selected, but are not limited to.
The process of extracting utterance sections from the audio test signals according to the combination sets can be referred to the related descriptions of steps S11-S15 in embodiment 1, and will not be described herein again.
And A13, calculating the average amplitude of each audio test signal and the frame average amplitude of each segmentation corpus.
The process of calculating the average amplitude of each audio test signal and the frame average amplitude of each segmented corpus may include:
b11, respectively calculating the sum of the amplitudes of the sampling points in each audio test signal, and dividing the sum of the amplitudes of the sampling points by the number of the sampling points to obtain a result as the average amplitude of the audio test signal.
Respectively calculating the sum of the amplitudes of the sampling points in each audio test signal, dividing the sum of the amplitudes of the sampling points by the number of the sampling points, and taking the obtained result as the average amplitude of the audio test signal to be understood as follows:
and respectively calculating the amplitude sum of the sampling points in each audio test signal, dividing the amplitude sum of the sampling points by the number of the sampling points in the audio test signal, and taking the obtained result as the average amplitude of the audio test signal.
In this embodiment, the following relationship may be used to respectively calculate the sum of the amplitudes of the sampling points in each audio test signal, and the sum of the amplitudes of the sampling points is divided by the number of the sampling points in the audio test signal, and the obtained result is used as the average amplitude of the audio test signal:
Figure BDA0002208914340000121
raw _ avg is a certain audio measureAverage amplitude of test signal, AnIs the amplitude of the nth sample point, and n is the number of sample points in the audio test signal.
And B12, respectively calculating the sum of the amplitudes of the sampling points in each corpus, and dividing the sum of the amplitudes of the sampling points in the corpus by the number of the sampling points in the corpus to obtain a result which is used as the frame average amplitude of the corpus.
In this embodiment, the following relational expression may be used to respectively calculate the sum of the amplitudes of the sampling points in each corpus, and the sum of the amplitudes of the sampling points in the corpus is divided by the number of the sampling points in the corpus, so as to obtain a result, which is used as the frame average amplitude of the corpus:
Figure BDA0002208914340000131
wherein frame _ avg represents the frame average amplitude of a certain segmented corpus, AnAnd representing the amplitude of the nth sampling point, wherein n is the number of the sampling points in the segmentation corpus.
A14, calculating the average mute duration of each audio test signal in different combination sets, and dividing the sum of the average mute durations of each audio test signal in different combination sets by the number of the combination sets respectively to obtain the result as the total average mute duration.
In this embodiment, the following relationship may be used to calculate the average mute duration of each audio test signal in different combination sets:
Figure BDA0002208914340000132
the CorpusAverageSilence is the average mute time of a certain audio test signal in a certain combination set, the SilenceFrameNum is the number of mute frames in all corpora extracted by the audio test signal in the combination set, the FrameDuration represents the time of a speech frame, and the CorpusNum is the number of all corpora extracted by the audio test signal in the combination set.
In this embodiment, the following relation may be used to divide the sum of the average mute durations of the audio test signals in different combination sets by the number of the combination sets, and the obtained result is used as the overall average mute duration:
Figure BDA0002208914340000133
wherein Audio AverageSilence is the total average mute time of a certain audio test signal, CorpusAverageSilencenThe average mute duration of the audio test signal under the nth combination set is defined as n, where n is the number of combination sets.
A15, respectively comparing the difference between the average mute time length of each audio test signal in different combination sets and the total average mute time length thereof, selecting the difference arranged at the first L according to the sequence of the difference from small to large, and taking the combination set corresponding to the difference arranged at the first L as a combination set to be processed, wherein L is an integer greater than 1.
L can be flexibly set according to needs, and preferably, L can be set to 5.
A16, counting the first k to-be-processed combination sets with the occurrence frequencies arranged in the to-be-processed combination sets of the plurality of audio test signals according to the sequence of the occurrence frequencies from top to bottom, and taking the first k to-be-processed combination sets as candidate combination sets, wherein k is an integer larger than 1.
K can be flexibly set according to needs, and preferably, K can be set to be 20.
A17, calculating the variance of the corpora extracted according to each candidate combination set, taking the set frame time length in the candidate combination set corresponding to the corpora with the minimum variance as the optimized set frame time length, taking the set activity detection mode in the candidate combination set corresponding to the corpora with the minimum variance as the optimized set activity detection mode, and taking the set size of the rectangular window in the candidate combination set corresponding to the corpora with the minimum variance as the set size of the optimized rectangular window.
In this embodiment, the variance of the corpus extracted according to each candidate combination set may be calculated by using the following relational expression:
Figure BDA0002208914340000141
wherein, variance of the corpus is represented by variance, xi represents average amplitude of the ith frame in the corpus, avg represents average amplitude of the audio signal, t represents duration of the corpus, and n represents number of voice frames included in the corpus.
In the embodiment, the frame duration, the set activity detection mode and the set size of the rectangular window are optimized in real time by evaluating the mean square error of the segmented corpus, so that the probability of overlong sentences such as incomplete segmentation, white space and the like can be well controlled, and the quality of the extracted corpus is greatly improved.
As another alternative embodiment of the present application, referring to fig. 7, a schematic flow diagram of an embodiment 4 of a corpus extraction method provided in the present application is provided, where this embodiment is mainly an extension of the corpus extraction method described in the above embodiment 1, as shown in fig. 7, the method may include, but is not limited to, the following steps:
step S41, dividing the audio signal into a plurality of speech frames according to the set frame duration.
Step S42, performing activity detection on each voice frame according to a set activity detection mode, and determining an attribute of each voice frame, where the attribute is valid or silent.
And step S43, arranging the voice frames marked with the attributes according to the time sequence to obtain a voice frame list.
Step S44, using a rectangular window with a set size to slide in the speech frame sequence with a set step length, and detecting the initial position and the end position of the speech segment in the speech frame sequence.
Step S45, extracting utterance segments from the audio signal according to the start position and the end position, and using the utterance segments as corpora.
The detailed procedures of steps S41-S45 can be referred to the related descriptions of steps S11-S15 in embodiment 1, and are not described herein again.
And step S46, calculating corpus average amplitude and frame average amplitude of each corpus respectively.
The corpus mean amplitude may be understood as: the corpus contains the average of the amplitudes of the sample points.
Frame average amplitude, which can be understood as: the corpus contains the average of the amplitudes of the sample points contained in the frame.
Step S47, respectively taking the frame with the frame average amplitude greater than the corpus average amplitude in each corpus as an effective frame, and calculating the ratio of the effective frame.
The process of calculating the proportion of valid frames may include:
and dividing the number of the effective frames by the number of all frames in the corpus to obtain a result which is the proportion of the effective frames.
Step S48, using the corpus in which the ratio of the effective frame is greater than the set effective frame ratio threshold and the corpus average amplitude is greater than the set corpus average amplitude threshold as the effective corpus.
In this embodiment, after the corpora are extracted in step S45, there inevitably occurs invalid corpora such as an excessively long silence time, a low voice, or a pure background noise. The amplitude of the audio signal is relatively weak due to the long silence duration or the low-pitched corpus compared to the normal corpus. Thus, the average amplitude of the sound signal of the invalid corpus is usually lower, and the proportion of the frames with lower amplitude in the whole corpus is higher.
Based on the characteristics of the invalid corpora, the corpus is filtered based on the proportion of the valid frames of the corpus and the average amplitude value of the corpus.
The corpus extraction device provided in the present application is described next, and the corpus extraction device described below and the corpus extraction method described above may be referred to in correspondence.
Referring to fig. 8, the corpus extraction device includes: a partitioning module 11, an activity detection module 12, an arrangement module 13, an endpoint detection module 14, and an extraction module 15.
The dividing module 11 is configured to divide the audio signal into a plurality of voice frames according to a set frame duration;
an activity detection module 12, configured to perform activity detection on each voice frame according to a set activity detection mode, and determine an attribute of each voice frame, where the attribute is valid or silent;
the arrangement module 13 is configured to arrange the voice frames labeled with the attributes according to a time sequence to obtain a voice frame list;
an endpoint detection module 14, configured to utilize a rectangular window with a set size to slide in the speech frame array by a set step length, and detect a starting position and an ending position of a speech segment in the speech frame array;
and the extracting module 15 is configured to extract a speech segment from the audio signal according to the starting position and the ending position, and use the speech segment as a corpus.
In this embodiment, the endpoint detection module 14 may include:
a first calculating module, configured to utilize a rectangular window with a set size, slide in the speech frame array in a set step length, calculate a ratio of the number of speech frames with an effective attribute in the rectangular window to the size of the rectangular window, as a first ratio, and calculate a ratio of the number of speech frames with a mute attribute in the rectangular window to the size of the rectangular window, as a second ratio;
the comparison module is used for determining the initial position of the speech section in the speech frame array by comparing the first proportion with a first proportion threshold value, and determining the end position of the speech section in the speech frame array by comparing the second proportion with a second proportion threshold value;
in this embodiment, the corpus extraction device may further include:
the optimization module is used for optimizing the set frame time length, the set activity detection mode and the set size of the rectangular window to obtain the optimized set frame time length, the optimized set activity detection model and the optimized set size of the rectangular window;
and the replacing module is used for replacing the set frame time length with the optimized set frame time length, replacing the set activity detection mode with the optimized activity detection mode and replacing the set size of the rectangular window with the set size of the optimized rectangular window.
In this embodiment, the optimization module is specifically configured to:
setting different values for the set frame duration, the set activity detection mode and the set size of the rectangular window respectively, and randomly combining the set frame duration, the set activity detection mode and the set size of the rectangular window with different values to obtain different combination sets;
extracting speaking speech segments from the audio test signals according to the combination sets respectively, and taking the speaking speech segments as segmentation linguistic data;
calculating the average amplitude of each audio test signal and the frame average amplitude of each segmented corpus;
calculating the average mute duration of each audio test signal in different combination sets, and dividing the sum of the average mute durations of each audio test signal in different combination sets by the number of the combination sets respectively to obtain a result as the total average mute duration;
respectively comparing the difference value between the average mute time length of each audio test signal under different combination sets and the total average mute time length of the audio test signal, selecting the difference values arranged at the first L according to the sequence of the difference values from small to large, and taking the combination set corresponding to the difference values arranged at the first L as a combination set to be processed, wherein L is an integer greater than 1;
according to the sequence of the occurrence frequencies from top to bottom, counting the first k to-be-processed combination sets with the occurrence frequencies arranged in the to-be-processed combination sets of the plurality of audio test signals, and taking the first k to-be-processed combination sets as candidate combination sets, wherein k is an integer greater than 1;
and respectively calculating the variances of the corpora extracted according to the candidate combination sets, taking the set frame time length in the candidate combination set corresponding to the corpora with the minimum variance as the optimized set frame time length, taking the set activity detection mode in the candidate combination set corresponding to the corpora with the minimum variance as the optimized set activity detection mode, and taking the set size of the rectangular window in the candidate combination set corresponding to the corpora with the minimum variance as the set size of the optimized rectangular window.
In this embodiment, the corpus extraction device may further include:
the second calculation module is used for calculating the corpus average amplitude and the frame average amplitude of each corpus respectively;
the first determining module is used for respectively taking the frames with the frame average amplitude values larger than the corpus average amplitude values in each corpus as effective frames and calculating the proportion of the effective frames;
and the second determining module is used for taking the corpus of which the ratio of the effective frame is greater than a set effective frame ratio threshold and the average corpus amplitude is greater than a set corpus average amplitude threshold as an effective corpus.
In this embodiment, the optimization module may be specifically configured to:
respectively calculating the amplitude sum of sampling points in each audio test signal, and dividing the amplitude sum of the sampling points by the number of the sampling points to obtain a result as the average amplitude of the audio test signal;
and respectively calculating the sum of the amplitudes of the sampling points in each corpus, dividing the sum of the amplitudes of the sampling points in the corpus by the number of the sampling points in the corpus, and taking the obtained result as the frame average amplitude of the corpus.
In this embodiment, the optimization module may be specifically configured to:
using relational expressions
Figure BDA0002208914340000181
Respectively calculating the variance of the corpora extracted according to each candidate combination set;
wherein, variance of the corpus is represented by variance, xi represents average amplitude of the ith frame in the corpus, avg represents average amplitude of the audio signal, t represents duration of the corpus, and n represents number of voice frames included in the corpus.
It should be noted that each embodiment is mainly described as a difference from the other embodiments, and the same and similar parts between the embodiments may be referred to each other. For the device-like embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.
For convenience of description, the above devices are described as being divided into various units by function, and are described separately. Of course, the functionality of the units may be implemented in one or more software and/or hardware when implementing the present application.
From the above description of the embodiments, it is clear to those skilled in the art that the present application can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present application may be essentially or partially implemented in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments of the present application.
The corpus extraction method and device provided by the present application are introduced in detail, and a specific example is applied in the text to explain the principle and the implementation of the present application, and the description of the above embodiment is only used to help understand the method and the core idea of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims (8)

1. A corpus extraction method is characterized by comprising the following steps:
dividing the audio signal into a plurality of voice frames according to the set frame duration;
respectively carrying out activity detection on each voice frame according to a set activity detection mode, and determining the attribute of each voice frame, wherein the attribute is effective or silent;
arranging the voice frames marked with the attributes according to a time sequence to obtain a voice frame list;
using a rectangular window with a set size to slide in the voice frame array by a set step length, and detecting the initial position and the end position of a speaking speech segment in the voice frame array; the detecting the initial position and the end position of the speech segment in the speech frame array by using the rectangular window with the set size to slide in the speech frame array with the set step length comprises:
sliding in the voice frame column by a set step length by utilizing a rectangular window with a set size, calculating the ratio of the number of voice frames with the attribute of being effective in the rectangular window to the size of the rectangular window as a first ratio, and calculating the ratio of the number of voice frames with the attribute of being silent in the rectangular window to the size of the rectangular window as a second ratio;
determining the initial position of the speech segment in the speech frame array by comparing the first proportion with a first proportion threshold value, and determining the end position of the speech segment in the speech frame array by comparing the second proportion with a second proportion threshold value;
and extracting a speaking speech segment from the audio signal according to the initial position and the end position, and taking the speaking speech segment as a corpus.
2. The method of claim 1, further comprising:
optimizing the set frame time length, the set activity detection mode and the set size of the rectangular window to obtain the optimized set frame time length, the optimized set activity detection model and the optimized set size of the rectangular window;
and replacing the set frame time length with the optimized set frame time length, replacing the set activity detection mode with the optimized activity detection mode, and replacing the set size of the rectangular window with the set size of the optimized rectangular window.
3. The method of claim 2, wherein optimizing the set frame duration, the set activity detection mode, and the set size of the rectangular window to obtain an optimized set frame duration, an optimized set activity detection model, and an optimized set size of the rectangular window comprises:
setting different values for the set frame duration, the set activity detection mode and the set size of the rectangular window respectively, and randomly combining the set frame duration, the set activity detection mode and the set size of the rectangular window with different values to obtain different combination sets;
extracting speaking speech segments from the audio test signals according to the combination sets respectively, and taking the speaking speech segments as segmentation linguistic data;
calculating the average amplitude of each audio test signal and the frame average amplitude of each segmented corpus;
calculating the average mute duration of each audio test signal in different combination sets, and dividing the sum of the average mute durations of each audio test signal in different combination sets by the number of the combination sets respectively to obtain a result as the total average mute duration;
respectively comparing the difference value between the average mute time length of each audio test signal under different combination sets and the total average mute time length of the audio test signal, selecting the difference values arranged at the first L according to the sequence of the difference values from small to large, and taking the combination set corresponding to the difference values arranged at the first L as a combination set to be processed, wherein L is an integer greater than 1;
according to the sequence of the occurrence frequencies from top to bottom, counting the first k to-be-processed combination sets with the occurrence frequencies arranged in the to-be-processed combination sets of the plurality of audio test signals, and taking the first k to-be-processed combination sets as candidate combination sets, wherein k is an integer greater than 1;
and respectively calculating the variances of the corpora extracted according to the candidate combination sets, taking the set frame time length in the candidate combination set corresponding to the corpora with the minimum variance as the optimized set frame time length, taking the set activity detection mode in the candidate combination set corresponding to the corpora with the minimum variance as the optimized set activity detection mode, and taking the set size of the rectangular window in the candidate combination set corresponding to the corpora with the minimum variance as the set size of the optimized rectangular window.
4. The method according to any one of claims 1-3, further comprising:
respectively calculating the corpus average amplitude and the frame average amplitude of each corpus;
respectively taking the frames with the frame average amplitude values larger than the corpus average amplitude values in each corpus as effective frames, and calculating the proportion of the effective frames;
and taking the corpus of which the ratio of the effective frame is greater than a set effective frame ratio threshold value and the average corpus amplitude is greater than a set corpus average amplitude threshold value as an effective corpus.
5. The method according to claim 3, wherein said calculating the average amplitude of each of said audio test signals and the average amplitude of each of said segmented corpora in frames comprises:
respectively calculating the amplitude sum of sampling points in each audio test signal, and dividing the amplitude sum of the sampling points by the number of the sampling points to obtain a result as the average amplitude of the audio test signal;
and respectively calculating the sum of the amplitudes of the sampling points in each corpus, dividing the sum of the amplitudes of the sampling points in the corpus by the number of the sampling points in the corpus, and taking the obtained result as the frame average amplitude of the corpus.
6. The method according to claim 3, wherein separately calculating the variance of the corpora extracted according to each of the candidate set includes:
using relational expressions
Figure FDA0003443264910000031
Respectively calculating the variance of the corpora extracted according to each candidate combination set;
wherein, variance of the corpus is represented by variance, xi represents average amplitude of the ith frame in the corpus, avg represents average amplitude of the audio signal, t represents duration of the corpus, and n represents number of voice frames included in the corpus.
7. A corpus extraction device, comprising:
the dividing module is used for dividing the audio signal into a plurality of voice frames according to the set frame duration;
the activity detection module is used for respectively carrying out activity detection on each voice frame according to a set activity detection mode and determining the attribute of each voice frame, wherein the attribute is effective or silent;
the arrangement module is used for arranging the voice frames marked with the attributes according to a time sequence to obtain a voice frame array;
an endpoint detection module, configured to utilize a rectangular window with a set size to slide in the speech frame array by a set step length, and detect an initial position and an end position of a speech segment in the speech frame array; the endpoint detection module includes:
a first calculating module, configured to utilize a rectangular window with a set size, slide in the speech frame array in a set step length, calculate a ratio of the number of speech frames with an effective attribute in the rectangular window to the size of the rectangular window, as a first ratio, and calculate a ratio of the number of speech frames with a mute attribute in the rectangular window to the size of the rectangular window, as a second ratio;
the comparison module is used for determining the initial position of the speech section in the speech frame array by comparing the first proportion with a first proportion threshold value, and determining the end position of the speech section in the speech frame array by comparing the second proportion with a second proportion threshold value;
and the extraction module is used for extracting a speaking language segment from the audio signal according to the initial position and the end position, and taking the speaking language segment as a language material.
8. The apparatus of claim 7, further comprising:
the optimization module is used for optimizing the set frame time length, the set activity detection mode and the set size of the rectangular window to obtain the optimized set frame time length, the optimized set activity detection model and the optimized set size of the rectangular window;
and the replacing module is used for replacing the set frame time length with the optimized set frame time length, replacing the set activity detection mode with the optimized activity detection mode and replacing the set size of the rectangular window with the set size of the optimized rectangular window.
CN201910891615.0A 2019-09-20 2019-09-20 Corpus extraction method and apparatus Active CN110600010B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910891615.0A CN110600010B (en) 2019-09-20 2019-09-20 Corpus extraction method and apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910891615.0A CN110600010B (en) 2019-09-20 2019-09-20 Corpus extraction method and apparatus

Publications (2)

Publication Number Publication Date
CN110600010A CN110600010A (en) 2019-12-20
CN110600010B true CN110600010B (en) 2022-05-17

Family

ID=68861751

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910891615.0A Active CN110600010B (en) 2019-09-20 2019-09-20 Corpus extraction method and apparatus

Country Status (1)

Country Link
CN (1) CN110600010B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111629267B (en) * 2020-04-30 2023-06-09 腾讯科技(深圳)有限公司 Audio labeling method, device, equipment and computer readable storage medium
CN112489623A (en) * 2020-11-17 2021-03-12 携程计算机技术(上海)有限公司 Language identification model training method, language identification method and related equipment

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1763844A (en) * 2004-10-18 2006-04-26 中国科学院声学研究所 End-point detecting method, device and speech recognition system based on moving window
CN103646649A (en) * 2013-12-30 2014-03-19 中国科学院自动化研究所 High-efficiency voice detecting method
JP2015022112A (en) * 2013-07-18 2015-02-02 独立行政法人産業技術総合研究所 Voice activity detection device and method
CN108962227A (en) * 2018-06-08 2018-12-07 百度在线网络技术(北京)有限公司 Voice beginning and end detection method, device, computer equipment and storage medium
CN108986830A (en) * 2018-08-28 2018-12-11 安徽淘云科技有限公司 A kind of audio corpus screening technique and device
CN109545188A (en) * 2018-12-07 2019-03-29 深圳市友杰智新科技有限公司 A kind of real-time voice end-point detecting method and device
CN110335593A (en) * 2019-06-17 2019-10-15 平安科技(深圳)有限公司 Sound end detecting method, device, equipment and storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1763844A (en) * 2004-10-18 2006-04-26 中国科学院声学研究所 End-point detecting method, device and speech recognition system based on moving window
JP2015022112A (en) * 2013-07-18 2015-02-02 独立行政法人産業技術総合研究所 Voice activity detection device and method
CN103646649A (en) * 2013-12-30 2014-03-19 中国科学院自动化研究所 High-efficiency voice detecting method
CN108962227A (en) * 2018-06-08 2018-12-07 百度在线网络技术(北京)有限公司 Voice beginning and end detection method, device, computer equipment and storage medium
CN108986830A (en) * 2018-08-28 2018-12-11 安徽淘云科技有限公司 A kind of audio corpus screening technique and device
CN109545188A (en) * 2018-12-07 2019-03-29 深圳市友杰智新科技有限公司 A kind of real-time voice end-point detecting method and device
CN110335593A (en) * 2019-06-17 2019-10-15 平安科技(深圳)有限公司 Sound end detecting method, device, equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
语音端点检测算法研究;费宇泉等;《仪器仪表与检测技术》;20170825;第36卷(第8期);第98-101页 *

Also Published As

Publication number Publication date
CN110600010A (en) 2019-12-20

Similar Documents

Publication Publication Date Title
CN109065031B (en) Voice labeling method, device and equipment
Tatman Gender and dialect bias in YouTube’s automatic captions
US10152988B2 (en) Selecting speech features for building models for detecting medical conditions
CN108198547B (en) Voice endpoint detection method and device, computer equipment and storage medium
KR101942521B1 (en) Speech endpointing
US9368126B2 (en) Assessing speech prosody
EP2940684A1 (en) Voice recognizing method and system for personalized user information
US20140236600A1 (en) Method and device for keyword detection
CN109036471B (en) Voice endpoint detection method and device
CN103677729A (en) Voice input method and system
KR102296878B1 (en) Foreign language learning evaluation device
CN110600010B (en) Corpus extraction method and apparatus
CN112750445A (en) Voice conversion method, device and system and storage medium
JP6276513B2 (en) Speech recognition apparatus and speech recognition program
CN117711376A (en) Language identification method, system, equipment and storage medium
Bisikalo et al. Precision Automated Phonetic Analysis of Speech Signals for Information Technology of Text-dependent Authentication of a Person by Voice.
KR20130126570A (en) Apparatus for discriminative training acoustic model considering error of phonemes in keyword and computer recordable medium storing the method thereof
CN113611284B (en) Speech library construction method, speech library recognition method, speech library construction system and speech library recognition system
Tripathi et al. VEP detection for read, extempore and conversation speech
CN114254628A (en) Method and device for quickly extracting hot words by combining user text in voice transcription, electronic equipment and storage medium
Płonkowski Using bands of frequencies for vowel recognition for Polish language
Pal et al. Modified energy based method for word endpoints detection of continuous speech signal in real world environment
Ferro et al. Using deep neural networks for smoothing pitch profiles in connected speech
CN113593523A (en) Speech detection method and device based on artificial intelligence and electronic equipment
CN115862617A (en) Computer starting method, system, equipment and medium based on speech recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 401121 b7-7-2, Yuxing Plaza, No.5 Huangyang Road, Yubei District, Chongqing

Applicant after: Chongqing duxiaoman Youyang Technology Co.,Ltd.

Address before: 201800 room j1328, 3 / F, building 8, 55 Huiyuan Road, Jiading District, Shanghai

Applicant before: SHANGHAI YOUYANG NEW MEDIA INFORMATION TECHNOLOGY Co.,Ltd.

TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20211221

Address after: 100193 Room 606, 6 / F, building 4, West District, courtyard 10, northwest Wangdong Road, Haidian District, Beijing

Applicant after: Du Xiaoman Technology (Beijing) Co.,Ltd.

Address before: 401121 b7-7-2, Yuxing Plaza, No.5 Huangyang Road, Yubei District, Chongqing

Applicant before: Chongqing duxiaoman Youyang Technology Co.,Ltd.

GR01 Patent grant
GR01 Patent grant