WO2022134781A1 - 拖音的检测方法、装置、设备及存储介质 - Google Patents
拖音的检测方法、装置、设备及存储介质 Download PDFInfo
- Publication number
- WO2022134781A1 WO2022134781A1 PCT/CN2021/124632 CN2021124632W WO2022134781A1 WO 2022134781 A1 WO2022134781 A1 WO 2022134781A1 CN 2021124632 W CN2021124632 W CN 2021124632W WO 2022134781 A1 WO2022134781 A1 WO 2022134781A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- voice
- speech
- segment
- generate
- voiced
- Prior art date
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 161
- 230000002035 prolonged effect Effects 0.000 title abstract 7
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 106
- 238000000034 method Methods 0.000 claims abstract description 44
- 230000000694 effects Effects 0.000 claims abstract description 38
- 238000012545 processing Methods 0.000 claims abstract description 37
- 230000001629 suppression Effects 0.000 claims abstract description 34
- 238000005070 sampling Methods 0.000 claims abstract description 28
- 230000001755 vocal effect Effects 0.000 claims description 42
- 238000004364 calculation method Methods 0.000 claims description 29
- 230000035945 sensitivity Effects 0.000 claims description 12
- 239000000284 extract Substances 0.000 claims description 11
- 206010019133 Hangover Diseases 0.000 claims description 7
- 238000001228 spectrum Methods 0.000 claims description 6
- 239000012634 fragment Substances 0.000 claims description 5
- 238000004590 computer program Methods 0.000 claims 6
- 238000005516 engineering process Methods 0.000 abstract description 2
- 238000013473 artificial intelligence Methods 0.000 abstract 1
- 238000010586 diagram Methods 0.000 description 6
- 238000012549 training Methods 0.000 description 6
- 230000006870 function Effects 0.000 description 4
- 208000003028 Stuttering Diseases 0.000 description 3
- 238000012937 correction Methods 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 238000010606 normalization Methods 0.000 description 2
- 238000003672 processing method Methods 0.000 description 2
- 241001672694 Citrus reticulata Species 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000002085 persistent effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/18—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/21—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
- G10L25/87—Detection of discrete points within a voice signal
Definitions
- the present application relates to the technical field of speech processing, and in particular, to a method, device, device and storage medium for detecting a dragging sound.
- the present application provides a dragging sound detection method, device, equipment and storage medium, which are used to save the dragging sound detection time, thereby improving the dragging sound detection efficiency.
- a first aspect of the present application provides a method for detecting hangover, wherein the method for detecting hangover includes: acquiring multiple pieces of voice data in real time, and performing real-time sampling processing on the multiple pieces of voice data to generate discrete voice signals;
- the discrete speech signal is processed by using the activity detection algorithm and the silence suppression algorithm to generate at least one voiced voice segment, and one voiced voice segment includes multiple voiced voice sub-segments;
- the voice segment is subjected to vocal detection to determine at least one target vocal segment; syllable detection is performed on the at least one target vocal segment to generate multiple syllables to be detected; according to a preset pronunciation duration threshold, the multiple syllables to be detected are detected Perform dragging detection, and determine a target dragging syllable among the multiple to-be-detected syllables, where the target dragging syllable is one or more.
- a second aspect of the present application provides a drag sound detection device, including a memory, a processor, and computer-readable instructions stored on the memory and executable on the processor, and the processor executes the computer
- the following steps are implemented: acquiring multiple segments of voice data in real time, and performing real-time sampling processing on the multiple segments of voice data to generate discrete voice signals; sequentially using an activity detection algorithm and a silence suppression algorithm to process the discrete voice signals to generate At least one voiced voice segment, one voiced voice fragment includes multiple voiced voice sub-segments; perform human voice detection on the at least one voiced voice fragment in combination with a preset zero-crossing rate algorithm, and determine at least one target voice segment; Perform syllable detection on at least one target vocal segment to generate multiple syllables to be detected; perform drag detection on the multiple syllables to be detected according to a preset pronunciation duration threshold, and determine the target drag from the multiple syllables to be detected syllable, the target drag s
- a third aspect of the present application provides a computer-readable storage medium, where computer instructions are stored in the computer-readable storage medium, and when the computer instructions are executed on a computer, the computer performs the following steps: acquiring multiple pieces of voice data in real time , and perform real-time sampling processing on the multi-segment voice data to generate discrete voice signals; sequentially use an activity detection algorithm and a silence suppression algorithm to process the discrete voice signals to generate at least one voiced voice segment, and a voiced voice segment includes multiple voiced speech sub-segments; perform vocal detection on the at least one voiced voice segment in combination with a preset zero-crossing rate algorithm to determine at least one target vocal segment; perform syllable detection on the at least one target vocal segment to generate a plurality of To-be-detected syllables; according to the preset pronunciation duration threshold value, carry out drag sound detection on the multiple to-be-detected syllables, determine target drag-sound syllables in the multiple to-be-
- a fourth aspect of the present application provides a drag sound detection device, wherein the drag sound detection device includes: an acquisition module configured to acquire multiple pieces of voice data in real time, and perform real-time sampling processing on the multiple pieces of voice data to generate a discrete speech signal; a voiced segment generation module, configured to sequentially use an activity detection algorithm and a silence suppression algorithm to process the discrete voice signal to generate at least one voiced voice segment, and one voiced voice segment includes a plurality of voiced speech sub-segments; human voice a detection module, configured to perform human voice detection on the at least one voiced speech segment in combination with a preset zero-crossing rate algorithm to determine at least one target human voice segment; a syllable detection module, configured to perform human voice detection on the at least one target human voice segment Syllable detection, generating a plurality of syllables to be detected; a dragging sound detection module for performing dragging detection on the plurality of syllables to be detected according to a preset pronunciation duration threshold, and
- multiple pieces of voice data are acquired in real time, and the multiple pieces of voice data are subjected to real-time sampling processing to generate discrete voice signals; the discrete voice signals are processed by using an activity detection algorithm and a silence suppression algorithm to generate at least one A voiced speech segment, a voiced speech segment includes a plurality of voiced speech sub-segments; human voice detection is performed on the at least one voiced speech segment in combination with a preset zero-crossing rate algorithm to determine at least one target human voice segment; A target vocal segment carries out syllable detection, and generates a plurality of syllables to be detected; according to the preset pronunciation duration threshold, the multiple syllables to be detected are subjected to drag detection, and the target drag syllable is determined in the multiple syllables to be detected.
- the target drag syllable is one or more.
- a series of processing is performed on the discrete speech signal by using the activity detection algorithm, the silence suppression algorithm and the zero-crossing rate algorithm to generate the syllables to be detected, and then the target dragged syllables are determined based on the syllables to be detected, so that a large number of syllables are not required.
- the labeled dragging data is used for model training, which saves the dragging detection time and improves the dragging detection efficiency.
- FIG. 1 is a schematic diagram of an embodiment of a method for detecting drag in the embodiment of the present application
- FIG. 2 is a schematic diagram of another embodiment of the method for detecting drag in the embodiment of the present application.
- FIG. 3 is a schematic diagram of an embodiment of a drag sound detection device in an embodiment of the present application.
- FIG. 4 is a schematic diagram of another embodiment of the apparatus for detecting drag in the embodiment of the present application.
- FIG. 5 is a schematic diagram of an embodiment of a drag sound detection device in an embodiment of the present application.
- the embodiments of the present application provide a dragging sound detection method, device, device and storage medium, which are used to save dragging noise detection time, thereby improving dragging noise detection efficiency.
- an embodiment of the method for detecting drag in the embodiment of the present application includes:
- the server acquires multiple pieces of voice data in real time, and samples and processes the multiple voice data into discrete voice signals in real time. It should be emphasized that, in order to further ensure the privacy and security of the above-mentioned multiple pieces of voice data, the above-mentioned multiple pieces of voice data can also be stored in a node of a blockchain.
- online processing is used to process multiple pieces of voice data, so as to obtain a series of binary data, that is, discrete voice signals.
- the voice data is "What are you doing (pause for 3 seconds in the middle) (noise 1) Why don't you answer (noise 2) me (no one answered for 3 seconds in the middle)", after online processing, the generated discrete speech signal is [1 0 0 ... 1].
- the online processing method can process the voice data in real time, and there is no need to wait for all the voices to be collected before processing.
- the voice data is obtained by sampling, and each piece of voice data records the state of the original analog sound wave at the time of acquisition.
- the execution body of the present application may be a drag sound detection device, and may also be a terminal or a server, which is not specifically limited here.
- the embodiments of the present application take the server as an execution subject as an example for description.
- the server uses an activity detection algorithm and a silence suppression algorithm to process the discrete speech signal into at least one voiced voice segment, and one voiced voice segment includes a plurality of voiced speech sub-segments.
- the activity detection algorithm is the WebRTC VAD algorithm
- the mute suppression algorithm is an algorithm that suppresses the mute signal.
- the server uses the activity detection algorithm to detect the activity of [100...1], generates the activity detection result, and uses the silence suppression algorithm to suppress the silence of the activity detection result, Thereby, two voiced speech segments are generated, namely "what are you doing (noise 1)" and "why don't you answer (noise 2) me”.
- the server performs human voice detection on at least one voiced speech segment in combination with a preset zero-crossing rate algorithm, thereby determining at least one target human voice segment.
- the zero-crossing rate algorithm can be understood as combining multiple voiced speech sub-segments to generate a threshold value, and then determining the voiced voice sub-segments greater than the threshold value as the target vocal segment. Assuming the voiced speech segments are "what are you doing (noise 1)" and "why aren't you answering (noise 2) me", the server determines the threshold based on the voiced speech sub-segments in each voiced segment and the zero-crossing rate algorithm, The volume value of noise and noise is lower than the threshold value, so the server determines the voiced speech sub-segments greater than the threshold value as two voiced voice sub-segments of "what are you doing” and "why don't you answer me” .
- the server performs syllable detection on at least one target vocal segment, thereby generating a plurality of syllables to be detected.
- the phonology of modern standard Chinese often uses syllable as the unit of analysis, and often a Chinese character corresponds to a syllable. Therefore, the server performs syllable detection on the target vocal segment, thereby obtaining a plurality of syllables to be detected.
- the target vocal segment is "what are you doing", in which the server performs syllable detection on the target vocal segment of "what are you doing", so as to obtain "you", “are”, “do”, "what” and “what” ” for multiple syllables to be detected.
- 105 Perform dragging detection on a plurality of syllables to be detected according to a preset pronunciation duration threshold, and determine a target dragging syllable among the multiple to-be-detected syllables, where there are one or more target dragging syllables.
- the server performs dragging detection on a plurality of syllables to be detected respectively according to a preset pronunciation duration threshold, and then determines one or more target dragging syllables among the multiple syllables to be detected.
- the server will classify "you”, “zai”, “qian”, “shi”, and “mo" according to the preset pronunciation duration threshold , "What” and “What” are respectively detected by dragging sound, so as to determine the target dragging syllable as "you".
- a series of processing is performed on the discrete speech signal by using the activity detection algorithm, the silence suppression algorithm and the zero-crossing rate algorithm to generate the syllables to be detected, and then the target dragged syllables are determined based on the syllables to be detected, so that a large number of syllables are not required.
- the labeled dragging data is used for model training, which saves the dragging detection time and improves the dragging detection efficiency.
- another embodiment of the method for detecting drag in the embodiment of the present application includes:
- the server acquires multiple pieces of voice data in real time, and samples and processes the multiple voice data into discrete voice signals in real time. It should be emphasized that, in order to further ensure the privacy and security of the above-mentioned multiple pieces of voice data, the above-mentioned multiple pieces of voice data can also be stored in a node of a blockchain.
- online processing is used to process multiple pieces of voice data, so as to obtain a series of binary data, that is, discrete voice signals.
- the voice data is "What are you doing (pause for 3 seconds in the middle) (noise 1) Why don't you answer (noise 2) me (no one answered for 3 seconds in the middle)", after online processing, the generated discrete speech signal is [1 0 0 ... 1].
- the online processing method can process the voice data in real time, and there is no need to wait for all the voices to be collected before processing.
- the voice data is obtained by sampling, and each piece of voice data records the state of the original analog sound wave at the time of acquisition.
- the server acquires multiple pieces of voice data as analog sound wave data in real time according to a preset sampling rate; the server splices the multiple pieces of voice data in real time, generates real-time spliced voice data, and performs binary processing on the real-time spliced voice data, Generates discrete speech signals as binary data.
- the server acquires multiple pieces of voice data as analog sound wave data in real time according to the preset sampling rate.
- the sampling rate can be 8khz, 16khz, 32khz, 48khz, when data sampling is performed at the sampling rate of 16khz, 32khz, 48khz After that, it is necessary to reduce the audio frequency of the voice data to 8khz and then perform the processing of voice signal conversion; after obtaining multiple pieces of voice data, perform online binary processing on the multiple pieces of voice data respectively to generate binary discrete voice signals.
- the server uses a band-pass filter to divide the audio of the discrete signal into multiple audio subbands according to the audio spectrum.
- the server mainly divides the audio of the discrete signal into 6 audio sub-bands, wherein the 6 audio sub-bands are 80Hz ⁇ 250Hz, 250Hz ⁇ 500Hz, 500Hz ⁇ 1KHz, 1KHz ⁇ 2KHz, 2KHz ⁇ 3KHz and 3KHz respectively ⁇ 4KHz.
- the server sequentially calculates the audio subbands through the activity detection algorithm, generates the calculated result, and then uses the silence suppression algorithm to suppress the silence of the calculated result, so as to generate a voiced speech segment, for example, the discrete speech signal is "What are you doing ( The middle pause is silent for 3 seconds) (noise 1) why don't you answer (noise 2) I (no one answer for 3 seconds in the middle)" discrete speech signal.
- the activity detection algorithm, and the silence processing algorithm two voiced speech segments are generated, which are "what are you doing (noise 1)" and “why don't you answer (noise 2) me”.
- the server performs feature calculation on multiple audio subbands, generates multiple subband features, and then calculates the subband energies of the multiple subband features respectively, and generates multiple subband feature quantities, wherein the subband energy of the subband features is calculated.
- the function is the WebRtcVad_CalcVad8khz function; the server then uses the preset Gaussian model to calculate the probability density of each subband feature quantity, that is, the noise distribution probability and the speech distribution probability; and then calculates the relevant likelihood parameters based on the noise distribution probability and the speech distribution probability, and calculates the distribution probability
- the function of is the WebRtcVad_GaussianProbability function, and the relevant likelihood parameter is the relevant likelihood parameter of the Gaussian model, and then based on the relevant likelihood parameter and the activity detection algorithm, the noise distribution probability and the speech distribution probability are calculated to generate multiple weighted log-likelihood ratios, A weighted log-likelihood ratio corresponds to one audio subband; finally, the audio subband
- the server performs human voice detection on at least one voiced speech segment in combination with a preset zero-crossing rate algorithm, thereby determining at least one target human voice segment.
- the zero-crossing rate algorithm can be understood as combining multiple voiced speech sub-segments to generate a threshold value, and then determining the voiced voice sub-segments greater than the threshold value as the target vocal segment. Assuming the voiced speech segments are "what are you doing (noise 1)" and "why aren't you answering (noise 2) me", the server determines the threshold based on the voiced speech sub-segments in each voiced segment and the zero-crossing rate algorithm, Then, the voiced speech sub-segments larger than the threshold value are determined as two voiced voice sub-segments of "what are you doing" and "why don't you answer me”.
- the server extracts multiple voice volumes from at least one voiced voice segment, and calculates the average value of the multiple voice volumes to generate an average voice volume.
- One voiced voice sub-segment corresponds to one voice volume; the server calculates the multiple voice volumes respectively.
- the normalized volume value group calculates the average value according to the preset number of points, and obtains multiple volume average values; the server uses the preset zero-crossing rate algorithm to check the zero-crossing of multiple volume averages, and generates multiple non-zero volume values value, and perform sum calculation and mean calculation on multiple non-zero volume values to generate non-zero volume total value and non-zero volume mean value; the server generates sensitivity based on multiple voice average volume, non-zero volume total value and non-zero volume mean value Threshold; the server determines the voiced speech sub-segments whose speech volume is greater than or equal to the sensitivity threshold as the target vocal segment.
- the server first extracts multiple voice volumes from the voice fragments, and then calculates the average value of the multiple voice volumes to generate the average voice volume; then normalizes the multiple voice volumes to 0 Between -1, a normalized volume value group is generated.
- the normalized volume value group includes multiple normalized volume values.
- the server adjusts the normalized volume value less than 0.25 (normalization threshold) to 0, and generates The adjusted normalized volume value group; the server uses the zero-crossing rate algorithm to calculate the average value of the adjusted normalized volume value group with 256 discrete voice signal points, that is, the preset number points as a value segment, and generate Multiple average volume values, and then use the preset zero-crossing rate algorithm to subtract each point and the corresponding volume average value to generate volume difference values, and then filter out the volume difference values that are 0 to obtain multiple non-zero volume values.
- the server uses the zero-crossing rate algorithm to calculate the average value of the adjusted normalized volume value group with 256 discrete voice signal points, that is, the preset number points as a value segment, and generate Multiple average volume values, and then use the preset zero-crossing rate algorithm to subtract each point and the corresponding volume average value to generate volume difference values, and then filter out the volume difference values that are 0 to obtain multiple non-zero volume values.
- the server sets the average volume based on multiple voices, the non-zero volume total and the non-zero volume mean Sensitivity threshold, and finally determine the voiced speech sub-segments whose voice volume is greater than or equal to the sensitivity threshold as the target vocal segment.
- the average voice volume is generally 70, the total value of non-zero volume is 50, and the average non-zero volume is 0.35.
- the server performs syllable detection on at least one target vocal segment, thereby generating a plurality of syllables to be detected.
- the phonology of modern standard Chinese often uses syllable as the unit of analysis, and often a Chinese character corresponds to a syllable. Therefore, the server performs syllable detection on the target vocal segment, thereby obtaining a plurality of syllables to be detected.
- the target vocal segment is "what are you doing", in which the server performs syllable detection on the target vocal segment of "what are you doing", so as to obtain "you", “are”, “do”, "what” and “what” ” for multiple syllables to be detected.
- the server extracts multiple discrete data absolute value groups from at least one target vocal segment; the server determines the smallest discrete data absolute value in each discrete data absolute value group, and obtains multiple discrete data absolute values; the server reads The discrete data between the absolute values of two adjacent discrete data is obtained to obtain a plurality of discrete data, and the plurality of discrete data are respectively determined as a plurality of syllables to be detected.
- the server extracts multiple discrete data absolute value groups from each target vocal field.
- the server When collecting voice data at a sampling rate of 8k, the server extracts discrete data absolute values at every 600 discrete voice signal points to generate multiple discrete data absolute values. value group; when the voice data is collected at a sampling rate of 16k, the server extracts the discrete data absolute value at every 1200 discrete voice signal points to generate multiple discrete data absolute value groups; then the server determines in each discrete data absolute value group The smallest absolute value of discrete data is obtained to obtain multiple absolute values of discrete data, and then the discrete data between the absolute values of two adjacent discrete data is read to obtain multiple discrete data. Finally, a plurality of discrete data are respectively determined as a plurality of syllables to be detected. It should be noted that when the absolute value of the discrete data is less than 0.35, the server ignores the absolute value of the discrete data.
- the server performs dragging detection on a plurality of syllables to be detected respectively according to a preset pronunciation duration threshold, and then determines one or more target dragging syllables among the multiple syllables to be detected.
- the server will classify "you”, “zai”, “qian”, “shi”, and “mo" according to the preset pronunciation duration threshold , "What” and “What” are respectively detected by dragging sound, so as to determine the target dragging syllable as "you".
- the server reads the pronunciation duration of each syllable to be detected, obtains the pronunciation duration of multiple syllables, and then compares the pronunciation duration of each syllable with the pronunciation duration threshold.
- the pronunciation duration threshold is 0.4 seconds, for example, "you", ""
- the pronunciation durations of ", “qian”, “shi” and “me” are 0.5, 0.4, 0.3, 0.3 and 0.3 respectively, and the server determines the syllables to be detected whose pronunciation duration is longer than 0.4 seconds, namely "you", as the target For the drag syllable, in other embodiments, the target drag syllable may be one, two or more. It should be noted that 0.4 seconds is the pronunciation duration threshold of Mandarin, and if it is Sichuan dialect, the pronunciation duration threshold is 0.3 seconds.
- a series of processing is performed on the discrete speech signal by using the activity detection algorithm, the silence suppression algorithm and the zero-crossing rate algorithm to generate the syllables to be detected, and then the target dragged syllables are determined based on the syllables to be detected, so that a large number of syllables are not required.
- the labeled dragging data is used for model training, which saves the dragging detection time and improves the dragging detection efficiency.
- an embodiment of the device for detecting drag in the embodiment of the present application includes:
- an acquisition module 301 configured to acquire multiple segments of voice data in real time, and perform real-time sampling processing on the multiple segments of voice data to generate discrete voice signals;
- a voiced segment generation module 302 configured to sequentially use an activity detection algorithm and a silence suppression algorithm to process the discrete speech signal to generate at least one voiced voice segment, and one voiced voice segment includes a plurality of voiced voice sub-segments;
- a human voice detection module 303 configured to perform human voice detection on the at least one voiced speech segment in combination with a preset zero-crossing rate algorithm, and determine at least one target human voice segment;
- a syllable detection module 304 configured to perform syllable detection on the at least one target vocal segment, and generate a plurality of syllables to be detected;
- the dragging sound detection module 305 is used for performing dragging detection on the plurality of syllables to be detected according to a preset pronunciation duration threshold, and determining a target dragging syllable in the plurality of syllables to be detected, and the target dragging syllable is one or more.
- a series of processing is performed on the discrete speech signal by using the activity detection algorithm, the silence suppression algorithm and the zero-crossing rate algorithm to generate the syllables to be detected, and then the target dragged syllables are determined based on the syllables to be detected, so that a large number of syllables are not required.
- the labeled dragging data is used for model training, which saves the dragging detection time and improves the dragging detection efficiency.
- another embodiment of the drag sound detection device in the embodiment of the present application includes:
- an acquisition module 301 configured to acquire multiple segments of voice data in real time, and perform real-time sampling processing on the multiple segments of voice data to generate discrete voice signals;
- a voiced segment generation module 302 configured to sequentially use an activity detection algorithm and a silence suppression algorithm to process the discrete speech signal to generate at least one voiced voice segment, and one voiced voice segment includes a plurality of voiced voice sub-segments;
- a human voice detection module 303 configured to perform human voice detection on the at least one voiced speech segment in combination with a preset zero-crossing rate algorithm, and determine at least one target human voice segment;
- a syllable detection module 304 configured to perform syllable detection on the at least one target vocal segment, and generate a plurality of syllables to be detected;
- the dragging sound detection module 305 is used for performing dragging detection on the plurality of syllables to be detected according to a preset pronunciation duration threshold, and determining a target dragging syllable in the plurality of syllables to be detected, and the target dragging syllable is one or more.
- the obtaining module 301 can also be specifically used for:
- the multi-segment speech data is spliced in real time to generate real-time spliced speech data, and the real-time spliced speech data is subjected to binary processing to generate discrete speech signals, where the discrete speech signals are binary data.
- the vocal segment generation module 302 includes:
- the dividing unit 3021 is used for adopting a band-pass filter to divide the audio of the discrete speech signal into a plurality of audio subbands according to a preset audio frequency spectrum;
- the voiced segment generating unit 3022 is configured to calculate the plurality of audio subbands through the activity detection algorithm and the silence suppression algorithm in sequence, to generate at least one voiced speech segment, and a voiced speech segment includes multiple voiced speech subsegments.
- the voice segment generating unit 3022 can also be specifically used for:
- An activity detection algorithm is used to calculate the likelihood ratio corresponding to each subband feature quantity based on the noise distribution probability and the speech distribution probability, and generate a plurality of weighted log-likelihood ratios.
- the multiple audio subbands are in one-to-one correspondence;
- the audio subband whose weighted log-likelihood ratio is less than or equal to the likelihood threshold as the mute signal use the mute suppression algorithm to suppress the mute signal, and determine the audio subband whose weighted log-likelihood ratio is greater than the likelihood threshold It is determined that it is a voiced speech segment, and at least one voiced speech segment is obtained, and one voiced speech segment includes a plurality of voiced speech sub-segments.
- the human voice detection module 303 can also be specifically used for:
- the multiple voice volumes are respectively normalized to generate a normalized volume value group, and the normalized volume value less than the normalized threshold in the normalized volume value group is adjusted to zero, and the adjusted The normalized volume value group of ;
- the preset zero-crossing rate algorithm is used to perform zero-crossing check on the average values of the multiple volumes, generate multiple non-zero volume values, and perform sum calculation and average calculation on the multiple non-zero volume values to generate non-zero volume values Total volume and non-zero volume mean;
- a voiced speech sub-segment whose speech volume is greater than or equal to the sensitivity threshold is determined as a target vocal segment.
- the syllable detection module 304 can also be specifically used for:
- the discrete data between the absolute values of two adjacent discrete data is read to obtain multiple discrete data, and the multiple discrete data are respectively determined as multiple syllables to be detected.
- the dragging sound detection module 305 can also be specifically used for:
- a syllable to be detected whose syllable pronunciation duration is greater than a preset pronunciation duration threshold is determined as a target drag syllable, and the target drag syllable is one or more.
- a series of processing is performed on the discrete speech signal by using the activity detection algorithm, the silence suppression algorithm and the zero-crossing rate algorithm to generate the syllables to be detected, and then the target syllables are determined based on the syllables to be detected, so that a large number of syllables are not required.
- the labeled dragging data is used for model training, which saves the dragging detection time and improves the dragging detection efficiency.
- FIGS 3 and 4 above describe in detail the dragging sound detection device in the embodiment of the present application from the perspective of modular functional entities, and the following describes the dragging sound detection device in the embodiment of the present application in detail from the perspective of hardware processing.
- FIG. 5 is a schematic structural diagram of a drag sound detection device provided by an embodiment of the present application.
- the drag noise detection device 500 may vary greatly due to different configurations or performances, and may include one or more processors (central processing units, CPU) 510 (eg, one or more processors) and memory 520, one or more storage media 530 (eg, one or more mass storage devices) that store application programs 533 or data 532.
- the memory 520 and the storage medium 530 may be short-term storage or persistent storage.
- the program stored in the storage medium 530 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations in the device 500 for detecting drag sound.
- the processor 510 may be configured to communicate with the storage medium 530, and execute a series of instruction operations in the storage medium 530 on the drag sound detection device 500.
- the sound detection device 500 may also include one or more power supplies 540, one or more wired or wireless network interfaces 550, one or more input and output interfaces 560, and/or, one or more operating systems 531, such as Windows Serve, Mac OS X, Unix, Linux, FreeBSD, and more.
- operating systems 531 such as Windows Serve, Mac OS X, Unix, Linux, FreeBSD, and more.
- the present application also provides a drag sound detection device, the computer device includes a memory and a processor, the memory stores computer-readable instructions, and when the computer-readable instructions are executed by the processor, causes the processor to execute the above-mentioned embodiments.
- the steps of the dragging sound detection method are not limited to a drag sound detection device, the computer device includes a memory and a processor, the memory stores computer-readable instructions, and when the computer-readable instructions are executed by the processor, causes the processor to execute the above-mentioned embodiments. The steps of the dragging sound detection method.
- the present application also provides a computer-readable storage medium.
- the computer-readable storage medium may be a non-volatile computer-readable storage medium.
- the computer-readable storage medium may also be a volatile computer-readable storage medium. Instructions are stored in the computer-readable storage medium, and when the instructions are executed on a computer, cause the computer to execute the steps of the method for detecting drag.
- the blockchain referred to in this application is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
- Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block.
- the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
- the integrated unit if implemented in the form of a software functional unit and sold or used as an independent product, may be stored in a computer-readable storage medium.
- the technical solutions of the present application can be embodied in the form of software products in essence, or the parts that contribute to the prior art, or all or part of the technical solutions, and the computer software products are stored in a storage medium , including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application.
- the aforementioned storage medium includes: U disk, removable hard disk, read-only memory (ROM), random access memory (RAM), magnetic disk or optical disk and other media that can store program codes .
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Telephonic Communication Services (AREA)
Abstract
本申请涉及人工智能技术领域,公开了拖音的检测方法、装置、设备及存储介质,用于节省拖音检测的时间,从而提高拖音检测的效率。拖音的检测方法包括:实时获取多段语音数据,并对多段语音数据进行实时采样处理,生成离散语音信号;依次采用活性检测算法和静音抑制算法对离散语音信号进行处理,生成至少一个有声语音片段;结合预置的过零率算法对至少一个有声语音片段进行人声检测,确定至少一个目标人声段;对至少一个目标人声段进行音节检测,生成多个待检测音节;按照预置的发音时长阈值对多个待检测音节进行拖音检测,在多个待检测音节中确定目标拖音音节。此外,本申请还涉及区块链技术,多段语音数据可存储于区块链中。
Description
本申请要求于2020年12月23日提交中国专利局、申请号为202011538711.6、发明名称为“拖音的检测方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在申请中。
本申请涉及语音处理技术领域,尤其涉及一种拖音的检测方法、装置、设备及存储介质。
现实生活中,我们会发现很多人说话有口吃之类的问题,因此很多矫正口吃问题教育语言机构和在线发音矫正平台应运而生。
在现有技术中,大多数教育语言机构或者在线发音矫正平台,对口吃中的拖音通常没有智能算法进行检测和提取,发明人意识到即使有部分教育语言机构或者在线发音矫正平台采用深度学习来训练智能模型进行拖音检测,但是通常需要大量的标注数据进行模型训练,从而导致检测效率低下。
发明内容
本申请提供了一种拖音的检测方法、装置、设备及存储介质,用于节省拖音检测的时间,从而提高拖音检测的效率。
本申请第一方面提供了一种拖音的检测方法,其中,所述拖音的检测方法包括:实时获取多段语音数据,并对所述多段语音数据进行实时采样处理,生成离散语音信号;依次采用活性检测算法和静音抑制算法对所述离散语音信号进行处理,生成至少一个有声语音片段,一个有声语音片段包括多个有声语音子片段;结合预置的过零率算法对所述至少一个有声语音片段进行人声检测,确定至少一个目标人声段;对所述至少一个目标人声段进行音节检测,生成多个待检测音节;按照预置的发音时长阈值对所述多个待检测音节进行拖音检测,在所述多个待检测音节中确定目标拖音音节,所述目标拖音音节为一个或者多个。
本申请第二方面提供了一种拖音的检测设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤:实时获取多段语音数据,并对所述多段语音数据进行实时采样处理,生成离散语音信号;依次采用活性检测算法和静音抑制算法对所述离散语音信号进行处理,生成至少一个有声语音片段,一个有声语音片段包括多个有声语音子片段;结合预置的过零率算法对所述至少一个有声语音片段进行人声检测,确定至少一个目标人声段;对所述至少一个目标人声段进行音节检测,生成多个待检测音节;按照预置的发音时长阈值对所述多个待检测音节进行拖音检测,在所述多个待检测音节中确定目标拖音音节,所述目标拖音音节为一个或者多个。
本申请的第三方面提供了一种计算机可读存储介质,所述计算机可读存储介质中存储计算机指令,当所述计算机指令在计算机上运行时,使得计算机执行如下步骤:实时获取多段语音数据,并对所述多段语音数据进行实时采样处理,生成离散语音信号;依次采用活性检测算法和静音抑制算法对所述离散语音信号进行处理,生成至少一个有声语音片段,一个有声语音片段包括多个有声语音子片段;结合预置的过零率算法对所述至少一个有声语音片段进行人声检测,确定至少一个目标人声段;对所述至少一个目标人声段进行音节检测,生成多个待检测音节;按照预置的发音时长阈值对所述多个待检测音节进行拖音检测,在所述多个待检测音节中确定目标拖音音节,所述目标拖音音节为一个或者多个。
本申请第四方面提供了一种拖音的检测装置,其中,所述拖音的检测装置包括:获取 模块,用于实时获取多段语音数据,并对所述多段语音数据进行实时采样处理,生成离散语音信号;有声片段生成模块,用于依次采用活性检测算法和静音抑制算法对所述离散语音信号进行处理,生成至少一个有声语音片段,一个有声语音片段包括多个有声语音子片段;人声检测模块,用于结合预置的过零率算法对所述至少一个有声语音片段进行人声检测,确定至少一个目标人声段;音节检测模块,用于对所述至少一个目标人声段进行音节检测,生成多个待检测音节;拖音检测模块,用于按照预置的发音时长阈值对所述多个待检测音节进行拖音检测,在所述多个待检测音节中确定目标拖音音节,所述目标拖音音节为一个或者多个。
本申请提供的技术方案中,实时获取多段语音数据,并对所述多段语音数据进行实时采样处理,生成离散语音信号;采用活性检测算法和静音抑制算法对所述离散语音信号进行处理,生成至少一个有声语音片段,一个有声语音片段包括多个有声语音子片段;结合预置的过零率算法对所述至少一个有声语音片段进行人声检测,确定至少一个目标人声段;对所述至少一个目标人声段进行音节检测,生成多个待检测音节;按照预置的发音时长阈值对所述多个待检测音节进行拖音检测,在所述多个待检测音节中确定目标拖音音节,所述目标拖音音节为一个或者多个。本申请实施例中,通过使用活性检测算法、静音抑制算法和过零率算法对离散语音信号进行一系列处理,从而生成待检测音节,然后基于待检测音节确定目标拖音音节,从而不需要大量的标注拖音数据进行模型训练,节省了拖音检测的时间,提高了拖音检测的效率。
图1为本申请实施例中拖音的检测方法的一个实施例示意图;
图2为本申请实施例中拖音的检测方法的另一个实施例示意图;
图3为本申请实施例中拖音的检测装置的一个实施例示意图;
图4为本申请实施例中拖音的检测装置的另一个实施例示意图;
图5为本申请实施例中拖音的检测设备的一个实施例示意图。
本申请实施例提供了一种拖音的检测方法、装置、设备及存储介质,用于节省拖音检测的时间,从而提高拖音检测的效率。
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”、“第三”、“第四”等(如果存在)是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的实施例能够以除了在这里图示或描述的内容以外的顺序实施。此外,术语“包括”或“具有”及其任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。
为便于理解,下面对本申请实施例的具体流程进行描述,请参阅图1,本申请实施例中拖音的检测方法的一个实施例包括:
101、实时获取多段语音数据,并对多段语音数据进行实时采样处理,生成离散语音信号;
服务器实时获取多段语音数据,并将多个语音数据实时采样处理为离散语音信号。需要强调的是,为进一步保证上述多段语音数据的私密和安全性,上述多段语音数据还可以存储于一区块链的节点中。
在本实施例中,采用在线处理的方式对多段语音数据进行处理,从而得到一连串的二进制数据,即离散语音信号,例如,语音数据为“你在干什么(中间停顿无声3秒)(杂音 1)你怎么不回答(杂音2)我(中间无人回答3秒)”,经过在线处理,生成离散语音信号为[1 0 0 ……1]。在线处理的方式能够对语音数据进行实时处理,无需等采集完全部语音再进行处理。语音数据是由采样得到的,每一段语音数据都记录了原始模拟声波在获取时刻的状态。
可以理解的是,本申请的执行主体可以为拖音的检测装置,还可以是终端或者服务器,具体此处不做限定。本申请实施例以服务器为执行主体为例进行说明。
102、依次采用活性检测算法和静音抑制算法对离散语音信号进行处理,生成至少一个有声语音片段,一个有声语音片段包括多个有声语音子片段;
服务器采用活性检测算法和静音抑制算法将离散语音信号处理为至少一个有声语音片段,一个有声语音片段包括多个有声语音子片段。
需要说明的是,在本实施例中,活性检测算法为WebRTC VAD算法,静音抑制算法即将静音信号进行抑制的算法。假设离散语音信号为[1 0 0 ……1],服务器采用活性检测算法对[1 0 0 ……1]进行活性检测,生成活性检测结果,并采用静音抑制算法对活性检测结果进行静音抑制,从而生成两个有声语音片段,即“你在干什么(杂音1)”和“你怎么不回答(杂音2)我”。
103、结合预置的过零率算法对至少一个有声语音片段进行人声检测,确定至少一个目标人声段;
服务器结合预置的过零率算法对至少一个有声语音片段进行人声检测,从而确定至少一个目标人声段。
过零率算法可以理解为结合多个有声语音子片段生成一个门限值,然后将大于该门限值的有声语音子片段确定为目标人声段。假设有声语音片段为“你在干什么(杂音1)”和“你怎么不回答(杂音2)我”,服务器则基于每个有声片段中的有声语音子片段和过零率算法确定门限值,其中杂音、噪音的音量值低于所述门限值,因此服务器将大于该门限值的有声语音子片段确定为“你在干什么”和“你怎么不回答我”的两个有声语音子片段。
104、对至少一个目标人声段进行音节检测,生成多个待检测音节;
服务器对至少一个目标人声段进行音节检测,从而生成多个待检测音节。
现代标准汉语的音系常用音节作为分析单位,往往一个汉字对应一个音节。因此服务器对目标人声段进行音节检测,从而得到多个待检测音节。例如目标人声段为“你在干什么”,其中,服务器对“你在干什么”的目标人声段进行音节检测,从而得到“你”、“在”、“干”、“什”以及“么”的多个待检测音节。
105、按照预置的发音时长阈值对多个待检测音节进行拖音检测,在多个待检测音节中确定目标拖音音节,目标拖音音节为一个或者多个。
服务器按照预置的发音时长阈值分别对多个待检测音节进行拖音检测,然后在多个待检测音节中确定为一个或者多个的目标拖音音节。
例如,假设多个待检测音节分别为“你”、“在”、“干”、“什”以及“么”,服务器按照预置的发音时长阈值对“你”、“在”、“干”、“什”以及“么”分别进行拖音检测,从而确定目标拖音音节为“你”。
本申请实施例中,通过使用活性检测算法、静音抑制算法和过零率算法对离散语音信号进行一系列处理,从而生成待检测音节,然后基于待检测音节确定目标拖音音节,从而不需要大量的标注拖音数据进行模型训练,节省了拖音检测的时间,提高了拖音检测的效率。
请参阅图2,本申请实施例中拖音的检测方法的另一个实施例包括:
201、实时获取多段语音数据,并对多段语音数据进行实时采样处理,生成离散语音信 号;
服务器实时获取多段语音数据,并将多个语音数据实时采样处理为离散语音信号。需要强调的是,为进一步保证上述多段语音数据的私密和安全性,上述多段语音数据还可以存储于一区块链的节点中。
在本实施例中,采用在线处理的方式对多段语音数据进行处理,从而得到一连串的二进制数据,即离散语音信号,例如,语音数据为“你在干什么(中间停顿无声3秒)(杂音1)你怎么不回答(杂音2)我(中间无人回答3秒)”,经过在线处理,生成离散语音信号为[1 0 0 ……1]。在线处理的方式能够对语音数据进行实时处理,无需等采集完全部语音再进行处理。语音数据是由采样得到的,每一段语音数据都记录了原始模拟声波在获取时刻的状态。
具体的,服务器按照预置的采样率实时获取为模拟声波数据的多段语音数据;服务器将多段语音数据进行实时拼接,生成实时拼接后的语音数据,并将实时拼接后的语音数据进行二进制处理,生成为二进制数据的离散语音信号。
服务器按照预置的采样率实时获取为模拟声波数据的多段语音数据,在本实施例中,采样率可以为8khz、16khz、32khz、48khz,当在16khz、32khz、48khz的采样率下进行数据采样后,还需要将语音数据的音频频率降为8khz再进行语音信号转换的处理;在得到多段语音数据之后,分别对多段语音数据进行在线的二进制处理,生成为二进制的离散语音信号。
202、采用带通滤波器,按照预置的音频频谱将离散语音信号的音频分割为多个音频子带;
服务器按照音频频谱采用带通滤波器将离散信号的音频分割为多个音频子带。在本实施例中,服务器主要将离散信号的音频划分为6个音频子带,其中6个音频子带分别为80Hz~250Hz、250Hz~500Hz、500Hz~1KHz、1KHz~2KHz、2KHz~3KHz以及3KHz~4KHz。
203、依次通过活性检测算法和静音抑制算法分别对多个音频子带进行计算,生成至少一个有声语音片段,一个有声语音片段包括多个有声语音子片段;
服务器依次通过活性检测算法对音频子带进行计算,生成计算后的结果,然后采用静音抑制算法对计算后的结果进行静音抑制,从而生成有声语音片段,例如,离散语音信号为“你在干什么(中间停顿无声3秒)(杂音1)你怎么不回答(杂音2)我(中间无人回答3秒)”的离散语音信号。经过高斯模型、活性检测算法以及静音处理算法处理后,生成两个有声语音片段,分别为“你在干什么(杂音1)”和“你怎么不回答(杂音2)我”。
具体的,服务器分别对多个音频子带进行特征计算,生成多个子带特征,然后分别计算多个子带特征的子带能量,生成多个子带特征量,其中计算子带特征的子带能量的函数为WebRtcVad_CalcVad8khz函数;服务器再采用预置的高斯模型计算每个子带特征量的概率密度,即噪声分布概率和语音分布概率;然后基于噪声分布概率和语音分布概率计算相关似然参数,计算分布概率的函数为WebRtcVad_GaussianProbability函数,相关似然参数为高斯模型的相关似然参数,然后基于相关似然参数和活性检测算法对噪音分布概率和语音分布概率进行计算,生成多个加权对数似然比,一个加权对数似然比对应一个音频子带;最后将加权对数似然比小于或者等于似然阈值的音频子带确定为静音信号,并采用静音抑制算法抑制静音信号,将加权对数似然比大于似然阈值的音频子带确定为有声语音片段。
204、结合预置的过零率算法对至少一个有声语音片段进行人声检测,确定至少一个目标人声段;
服务器结合预置的过零率算法对至少一个有声语音片段进行人声检测,从而确定至少 一个目标人声段。
过零率算法可以理解为结合多个有声语音子片段生成一个门限值,然后将大于该门限值的有声语音子片段确定为目标人声段。假设有声语音片段为“你在干什么(杂音1)”和“你怎么不回答(杂音2)我”,服务器则基于每个有声片段中的有声语音子片段和过零率算法确定门限值,然后将大于该门限值的有声语音子片段确定为“你在干什么”和“你怎么不回答我”两个有声语音子片段。
具体的,服务器从至少一个有声语音片段中提取多个语音音量,并计算多个语音音量的平均值,生成语音平均音量,一个有声语音子片段对应一个语音音量;服务器分别将多个语音音量进行归一化,生成归一化音量值组,并将归一化音量值组中小于归一化阈值的归一化音量值调整为零,生成调整后的归一化音量值组;然后服务器对归一化音量值组按照预置的数量点进行平均值计算,得到多个音量平均值;服务器采用预置的过零率算法对多个音量平均值进行过零检查,生成多个非零音量值,并对多个非零音量值进行求和计算以及均值计算,生成非零音量总值和非零音量均值;服务器基于多个语音平均音量、非零音量总值和非零音量均值生成灵敏度阈值;服务器将语音音量大于或者等于灵敏度阈值的有声语音子片段确定为目标人声段。
服务器首先从语音片段中提取多个语音音量,再对多个语音音量进行平均值的计算,生成语音平均音量;然后对多个语音音量进行归一化,将多个语音音量归一化至0-1之间,生成归一化音量值组,归一化音量值组中包括多个归一化音量值,服务器将小于0.25(归一化阈值)的归一化音量值调整为0,生成调整后的归一化音量值组;服务器采用过零率算法对调整后的归一化音量值组以256个离散语音信号点,即预置的数量点为一值段进行平均值计算,生成多个音量平均值,然后采用预置的过零率算法分别为每个点与对应的音量平均值进行减法计算,生成音量差值,再过滤掉为0的音量差值,得到多个非零音量值,并对多个非零音量值进行求和计算和平均值计算,生成非零音量总之和非零音量均值;服务器基于多个语音平均音量、非零音量总值和非零音量均值设置灵敏度阈值,最后将语音音量大于或者等于灵敏度阈值的有声语音子片段确定为目标人声段。其中,语音平均音量一般为70,非零音量总值为50,非零音量均值为0.35。
205、对至少一个目标人声段进行音节检测,生成多个待检测音节;
服务器对至少一个目标人声段进行音节检测,从而生成多个待检测音节。
现代标准汉语的音系常用音节作为分析单位,往往一个汉字对应一个音节。因此服务器对目标人声段进行音节检测,从而得到多个待检测音节。例如目标人声段为“你在干什么”,其中,服务器对“你在干什么”的目标人声段进行音节检测,从而得到“你”、“在”、“干”、“什”以及“么”的多个待检测音节。
具体的,服务器从至少一个目标人声段中提取多个离散数据绝对值组;服务器在每个离散数据绝对值组中确定最小的离散数据绝对值,得到多个离散数据绝对值;服务器读取相邻两个离散数据绝对值间的离散数据,得到多个离散数据,并分别将多个离散数据确定为多个待检测音节。
服务器从每个目标人声字段中提取多个离散数据绝对值组,当以8k的采样率采集语音数据时,服务器以每600个离散语音信号点提取离散数据绝对值,生成多个离散数据绝对值组;当以16k的采样率采集语音数据时,服务器以每1200个离散语音信号点提取离散数据绝对值,生成多个离散数据绝对值组;然后服务器在每个离散数据绝对值组中确定最小的离散数据绝对值,得到多个离散数据绝对值,然后读取相邻两个离散数据绝对值间的离散数据,得到多个离散数据。最后将多个离散数据分别确定为多个待检测音节。需要说明的是,当离散数据绝对值小于0.35时,服务器忽略不计该离散数据绝对值。
206、按照预置的发音时长阈值对多个待检测音节进行拖音检测,在多个待检测音节中确定目标拖音音节,目标拖音音节为一个或者多个。
服务器按照预置的发音时长阈值分别对多个待检测音节进行拖音检测,然后在多个待检测音节中确定为一个或者多个的目标拖音音节。
例如,假设多个待检测音节分别为“你”、“在”、“干”、“什”以及“么”,服务器按照预置的发音时长阈值对“你”、“在”、“干”、“什”以及“么”分别进行拖音检测,从而确定目标拖音音节为“你”。
具体的,服务器读取每个待检测音节的发音时长,得到多个音节发音时长,然后将每个音节发音时长与发音时长阈值进行对比,发音时长阈值为0.4秒,例如,“你”、“在”、“干”、“什”以及“么”的发音时长分别为0.5、0.4、0.3、0.3和0.3,服务器将音节发音时长大于0.4秒的待检测音节,即“你”,确定为目标拖音音节,在其他实施例中,目标拖音音节可以为一个、两个或者多个。需要说明的是,0.4秒为普通话的发音时长阈值,如果是四川方言,发音时长阈值则为0.3秒。
本申请实施例中,通过使用活性检测算法、静音抑制算法和过零率算法对离散语音信号进行一系列处理,从而生成待检测音节,然后基于待检测音节确定目标拖音音节,从而不需要大量的标注拖音数据进行模型训练,节省了拖音检测的时间,提高了拖音检测的效率。
上面对本申请实施例中拖音的检测方法进行了描述,下面对本申请实施例中拖音的检测装置进行描述,请参阅图3,本申请实施例中拖音的检测装置一个实施例包括:
获取模块301,用于实时获取多段语音数据,并对所述多段语音数据进行实时采样处理,生成离散语音信号;
有声片段生成模块302,用于依次采用活性检测算法和静音抑制算法对所述离散语音信号进行处理,生成至少一个有声语音片段,一个有声语音片段包括多个有声语音子片段;
人声检测模块303,用于结合预置的过零率算法对所述至少一个有声语音片段进行人声检测,确定至少一个目标人声段;
音节检测模块304,用于对所述至少一个目标人声段进行音节检测,生成多个待检测音节;
拖音检测模块305,用于按照预置的发音时长阈值对所述多个待检测音节进行拖音检测,在所述多个待检测音节中确定目标拖音音节,所述目标拖音音节为一个或者多个。
本申请实施例中,通过使用活性检测算法、静音抑制算法和过零率算法对离散语音信号进行一系列处理,从而生成待检测音节,然后基于待检测音节确定目标拖音音节,从而不需要大量的标注拖音数据进行模型训练,节省了拖音检测的时间,提高了拖音检测的效率。
请参阅图4,本申请实施例中拖音的检测装置的另一个实施例包括:
获取模块301,用于实时获取多段语音数据,并对所述多段语音数据进行实时采样处理,生成离散语音信号;
有声片段生成模块302,用于依次采用活性检测算法和静音抑制算法对所述离散语音信号进行处理,生成至少一个有声语音片段,一个有声语音片段包括多个有声语音子片段;
人声检测模块303,用于结合预置的过零率算法对所述至少一个有声语音片段进行人声检测,确定至少一个目标人声段;
音节检测模块304,用于对所述至少一个目标人声段进行音节检测,生成多个待检测音节;
拖音检测模块305,用于按照预置的发音时长阈值对所述多个待检测音节进行拖音检 测,在所述多个待检测音节中确定目标拖音音节,所述目标拖音音节为一个或者多个。
可选的,获取模块301还可以具体用于:
按照预置的采样率实时获取多段语音数据,所述多段语音数据为模拟声波数据;
将所述多段语音数据进行实时拼接,生成实时拼接后的语音数据,并将所述实时拼接后的语音数据进行二进制处理,生成离散语音信号,所述离散语音信号为二进制数据。
可选的,有声片段生成模块302包括:
分割单元3021,用于采用带通滤波器,按照预置的音频频谱将所述离散语音信号的音频分割为多个音频子带;
有声片段生成单元3022,用于依次通过活性检测算法和静音抑制算法分别对所述多个音频子带进行计算,生成至少一个有声语音片段,一个有声语音片段包括多个有声语音子片段。
可选的,有声片段生成单元3022还可以具体用于:
分别对所述多个音频子带进行特征计算,生成多个子带特征;
分别对所述多个子带特征进行子带能量计算,生成多个子带特征量;
分别对所述多个子带特征量进行概率密度计算,生成噪声分布概率和语音分布概率;
采用活性检测算法基于所述噪声分布概率和所述语音分布概率计算每个子带特征量对应的似然比,生成多个加权对数似然比,所述多个加权对数似然比与所述多个音频子带一一对应;
将加权对数似然比小于或者等于似然阈值的音频子带确定为静音信号,采用静音抑制算法对所述静音信号进行抑制,并将加权对数似然比大于似然阈值的音频子带确定为有声语音片段,得到至少一个有声语音片段,一个有声语音片段包括多个有声语音子片段。
可选的,人声检测模块303还可以具体用于:
从所述至少一个有声语音片段中提取多个语音音量,并计算所述多个语音音量的平均值,生成语音平均音量,一个有声语音子片段对应一个语音音量;
分别将所述多个语音音量进行归一化,生成归一化音量值组,并将所述归一化音量值组中小于归一化阈值的归一化音量值调整为零,生成调整后的归一化音量值组;
对所述归一化音量值组按照预置的数量点进行平均值计算,得到多个音量平均值;
采用预置的过零率算法对所述多个音量平均值进行过零检查,生成多个非零音量值,并对所述多个非零音量值进行求和计算以及均值计算,生成非零音量总值和非零音量均值;
基于所述多个语音平均音量、所述非零音量总值和所述非零音量均值生成灵敏度阈值;
将语音音量大于或者等于所述灵敏度阈值的有声语音子片段确定为目标人声段。
可选的,音节检测模块304还可以具体用于:
从所述至少一个目标人声段中提取多个离散数据绝对值组;
在每个离散数据绝对值组中确定最小的离散数据绝对值,得到多个离散数据绝对值;
读取相邻两个离散数据绝对值间的离散数据,得到多个离散数据,并分别将所述多个离散数据确定为多个待检测音节。
可选的,拖音检测模块305还可以具体用于:
分别提取所述多个待检测音节的发音时长,得到多个音节发音时长;
将音节发音时长大于预置的发音时长阈值的待检测音节确定为目标拖音音节,所述目标拖音音节为一个或者多个。
本申请实施例中,通过使用活性检测算法、静音抑制算法和过零率算法对离散语音信号进行一系列处理,从而生成待检测音节,然后基于待检测音节确定目标拖音音节,从而不需要大量的标注拖音数据进行模型训练,节省了拖音检测的时间,提高了拖音检测的效 率。
上面图3和图4从模块化功能实体的角度对本申请实施例中的拖音的检测装置进行详细描述,下面从硬件处理的角度对本申请实施例中拖音的检测设备进行详细描述。
图5是本申请实施例提供的一种拖音的检测设备的结构示意图,该拖音的检测设备500可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上处理器(central processing units,CPU)510(例如,一个或一个以上处理器)和存储器520,一个或一个以上存储应用程序533或数据532的存储介质530(例如一个或一个以上海量存储设备)。其中,存储器520和存储介质530可以是短暂存储或持久存储。存储在存储介质530的程序可以包括一个或一个以上模块(图示没标出),每个模块可以包括对拖音的检测设备500中的一系列指令操作。更进一步地,处理器510可以设置为与存储介质530通信,在拖音的检测设备500上执行存储介质530中的一系列指令操作。
拖音的检测设备500还可以包括一个或一个以上电源540,一个或一个以上有线或无线网络接口550,一个或一个以上输入输出接口560,和/或,一个或一个以上操作系统531,例如Windows Serve,Mac OS X,Unix,Linux,FreeBSD等等。本领域技术人员可以理解,图5示出的拖音的检测设备结构并不构成对拖音的检测设备的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。
本申请还提供一种拖音的检测设备,所述计算机设备包括存储器和处理器,存储器中存储有计算机可读指令,计算机可读指令被处理器执行时,使得处理器执行上述各实施例中的所述拖音的检测方法的步骤。
本申请还提供一种计算机可读存储介质,该计算机可读存储介质可以为非易失性计算机可读存储介质,该计算机可读存储介质也可以为易失性计算机可读存储介质,所述计算机可读存储介质中存储有指令,当所述指令在计算机上运行时,使得计算机执行所述拖音的检测方法的步骤。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
本申请所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。
Claims (20)
- 一种拖音的检测方法,其中,所述拖音的检测方法包括:实时获取多段语音数据,并对所述多段语音数据进行实时采样处理,生成离散语音信号;依次采用活性检测算法和静音抑制算法对所述离散语音信号进行处理,生成至少一个有声语音片段,一个有声语音片段包括多个有声语音子片段;结合预置的过零率算法对所述至少一个有声语音片段进行人声检测,确定至少一个目标人声段;对所述至少一个目标人声段进行音节检测,生成多个待检测音节;按照预置的发音时长阈值对所述多个待检测音节进行拖音检测,在所述多个待检测音节中确定目标拖音音节,所述目标拖音音节为一个或者多个。
- 根据权利要求1所述的拖音的检测方法,其中,所述实时获取多段语音数据,并对所述多段语音数据进行实时采样处理,生成离散语音信号包括:按照预置的采样率实时获取多段语音数据,所述多段语音数据为模拟声波数据;将所述多段语音数据进行实时拼接,生成实时拼接后的语音数据,并将所述实时拼接后的语音数据进行二进制处理,生成离散语音信号,所述离散语音信号为二进制数据。
- 根据权利要求1所述的拖音的检测方法,其中,所述依次采用活性检测算法和静音抑制算法对所述离散语音信号进行处理,生成至少一个有声语音片段,一个有声语音片段包括多个有声语音子片段包括:采用带通滤波器,按照预置的音频频谱将所述离散语音信号的音频分割为多个音频子带;依次通过活性检测算法和静音抑制算法分别对所述多个音频子带进行计算,生成至少一个有声语音片段,一个有声语音片段包括多个有声语音子片段。
- 根据权利要求3所述的拖音的检测方法,其中,所述依次通过活性检测算法和静音抑制算法分别对所述多个音频子带进行计算,生成至少一个有声语音片段,一个有声语音片段包括多个有声语音子片段包括:分别对所述多个音频子带进行特征计算,生成多个子带特征;分别对所述多个子带特征进行子带能量计算,生成多个子带特征量;分别对所述多个子带特征量进行概率密度计算,生成噪声分布概率和语音分布概率;采用活性检测算法基于所述噪声分布概率和所述语音分布概率计算每个子带特征量对应的似然比,生成多个加权对数似然比,所述多个加权对数似然比与所述多个音频子带一一对应;将加权对数似然比小于或者等于似然阈值的音频子带确定为静音信号,采用静音抑制算法对所述静音信号进行抑制,并将加权对数似然比大于似然阈值的音频子带确定为有声语音片段,得到至少一个有声语音片段,一个有声语音片段包括多个有声语音子片段。
- 根据权利要求1所述的拖音的检测方法,其中,所述结合预置的过零率算法对所述至少一个有声语音片段进行人声检测,确定至少一个目标人声段包括:从所述至少一个有声语音片段中提取多个语音音量,并计算所述多个语音音量的平均值,生成语音平均音量,一个有声语音子片段对应一个语音音量;分别将所述多个语音音量进行归一化,生成归一化音量值组,并将所述归一化音量值组中小于归一化阈值的归一化音量值调整为零,生成调整后的归一化音量值组;对所述归一化音量值组按照预置的数量点进行平均值计算,得到多个音量平均值;采用预置的过零率算法对所述多个音量平均值进行过零检查,生成多个非零音量值, 并对所述多个非零音量值进行求和计算以及均值计算,生成非零音量总值和非零音量均值;基于所述多个语音平均音量、所述非零音量总值和所述非零音量均值生成灵敏度阈值;将语音音量大于或者等于所述灵敏度阈值的有声语音子片段确定为目标人声段。
- 根据权利要求1所述的拖音的检测方法,其中,所述对所述至少一个目标人声段进行音节检测,生成多个待检测音节包括:从所述至少一个目标人声段中提取多个离散数据绝对值组;在每个离散数据绝对值组中确定最小的离散数据绝对值,得到多个离散数据绝对值;读取相邻两个离散数据绝对值间的离散数据,得到多个离散数据,并分别将所述多个离散数据确定为多个待检测音节。
- 根据权利要求1所述的拖音的检测方法,其中,所述按照预置的发音时长阈值对所述多个待检测音节进行拖音检测,在所述多个待检测音节中确定目标拖音音节,所述目标拖音音节为一个或者多个包括:分别提取所述多个待检测音节的发音时长,得到多个音节发音时长;将音节发音时长大于预置的发音时长阈值的待检测音节确定为目标拖音音节,所述目标拖音音节为一个或者多个。
- 一种拖音的检测设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤:实时获取多段语音数据,并对所述多段语音数据进行实时采样处理,生成离散语音信号;依次采用活性检测算法和静音抑制算法对所述离散语音信号进行处理,生成至少一个有声语音片段,一个有声语音片段包括多个有声语音子片段;结合预置的过零率算法对所述至少一个有声语音片段进行人声检测,确定至少一个目标人声段;对所述至少一个目标人声段进行音节检测,生成多个待检测音节;按照预置的发音时长阈值对所述多个待检测音节进行拖音检测,在所述多个待检测音节中确定目标拖音音节,所述目标拖音音节为一个或者多个。
- 根据权利要求8所述的拖音的检测设备,所述处理器执行所述计算机程序时还实现以下步骤:按照预置的采样率实时获取多段语音数据,所述多段语音数据为模拟声波数据;将所述多段语音数据进行实时拼接,生成实时拼接后的语音数据,并将所述实时拼接后的语音数据进行二进制处理,生成离散语音信号,所述离散语音信号为二进制数据。
- 根据权利要求8所述的拖音的检测设备,所述处理器执行所述计算机程序时还实现以下步骤:采用带通滤波器,按照预置的音频频谱将所述离散语音信号的音频分割为多个音频子带;依次通过活性检测算法和静音抑制算法分别对所述多个音频子带进行计算,生成至少一个有声语音片段,一个有声语音片段包括多个有声语音子片段。
- 根据权利要求10所述的拖音的检测设备,所述处理器执行所述计算机程序时还实现以下步骤:分别对所述多个音频子带进行特征计算,生成多个子带特征;分别对所述多个子带特征进行子带能量计算,生成多个子带特征量;分别对所述多个子带特征量进行概率密度计算,生成噪声分布概率和语音分布概率;采用活性检测算法基于所述噪声分布概率和所述语音分布概率计算每个子带特征量对 应的似然比,生成多个加权对数似然比,所述多个加权对数似然比与所述多个音频子带一一对应;将加权对数似然比小于或者等于似然阈值的音频子带确定为静音信号,采用静音抑制算法对所述静音信号进行抑制,并将加权对数似然比大于似然阈值的音频子带确定为有声语音片段,得到至少一个有声语音片段,一个有声语音片段包括多个有声语音子片段。
- 根据权利要求8所述的拖音的检测设备,所述处理器执行所述计算机程序时还实现以下步骤:从所述至少一个有声语音片段中提取多个语音音量,并计算所述多个语音音量的平均值,生成语音平均音量,一个有声语音子片段对应一个语音音量;分别将所述多个语音音量进行归一化,生成归一化音量值组,并将所述归一化音量值组中小于归一化阈值的归一化音量值调整为零,生成调整后的归一化音量值组;对所述归一化音量值组按照预置的数量点进行平均值计算,得到多个音量平均值;采用预置的过零率算法对所述多个音量平均值进行过零检查,生成多个非零音量值,并对所述多个非零音量值进行求和计算以及均值计算,生成非零音量总值和非零音量均值;基于所述多个语音平均音量、所述非零音量总值和所述非零音量均值生成灵敏度阈值;将语音音量大于或者等于所述灵敏度阈值的有声语音子片段确定为目标人声段。
- 根据权利要求8所述的拖音的检测设备,所述处理器执行所述计算机程序时还实现以下步骤:从所述至少一个目标人声段中提取多个离散数据绝对值组;在每个离散数据绝对值组中确定最小的离散数据绝对值,得到多个离散数据绝对值;读取相邻两个离散数据绝对值间的离散数据,得到多个离散数据,并分别将所述多个离散数据确定为多个待检测音节。
- 根据权利要求8所述的拖音的检测设备,所述处理器执行所述计算机程序时还实现以下步骤:分别提取所述多个待检测音节的发音时长,得到多个音节发音时长;将音节发音时长大于预置的发音时长阈值的待检测音节确定为目标拖音音节,所述目标拖音音节为一个或者多个。
- 一种计算机可读存储介质,所述计算机可读存储介质中存储计算机指令,当所述计算机指令在计算机上运行时,使得计算机执行如下步骤:实时获取多段语音数据,并对所述多段语音数据进行实时采样处理,生成离散语音信号;依次采用活性检测算法和静音抑制算法对所述离散语音信号进行处理,生成至少一个有声语音片段,一个有声语音片段包括多个有声语音子片段;结合预置的过零率算法对所述至少一个有声语音片段进行人声检测,确定至少一个目标人声段;对所述至少一个目标人声段进行音节检测,生成多个待检测音节;按照预置的发音时长阈值对所述多个待检测音节进行拖音检测,在所述多个待检测音节中确定目标拖音音节,所述目标拖音音节为一个或者多个。
- 根据权利要求15所述的计算机可读存储介质,所述计算机可读存储介质中存储计算机指令,当所述计算机指令在计算机上运行时,使得计算机执行如下步骤:按照预置的采样率实时获取多段语音数据,所述多段语音数据为模拟声波数据;将所述多段语音数据进行实时拼接,生成实时拼接后的语音数据,并将所述实时拼接后的语音数据进行二进制处理,生成离散语音信号,所述离散语音信号为二进制数据。
- 根据权利要求15所述的计算机可读存储介质,所述计算机可读存储介质中存储计算机指令,当所述计算机指令在计算机上运行时,使得计算机执行如下步骤:采用带通滤波器,按照预置的音频频谱将所述离散语音信号的音频分割为多个音频子带;依次通过活性检测算法和静音抑制算法分别对所述多个音频子带进行计算,生成至少一个有声语音片段,一个有声语音片段包括多个有声语音子片段。
- 根据权利要求17所述的计算机可读存储介质,所述计算机可读存储介质中存储计算机指令,当所述计算机指令在计算机上运行时,使得计算机执行如下步骤:分别对所述多个音频子带进行特征计算,生成多个子带特征;分别对所述多个子带特征进行子带能量计算,生成多个子带特征量;分别对所述多个子带特征量进行概率密度计算,生成噪声分布概率和语音分布概率;采用活性检测算法基于所述噪声分布概率和所述语音分布概率计算每个子带特征量对应的似然比,生成多个加权对数似然比,所述多个加权对数似然比与所述多个音频子带一一对应;将加权对数似然比小于或者等于似然阈值的音频子带确定为静音信号,采用静音抑制算法对所述静音信号进行抑制,并将加权对数似然比大于似然阈值的音频子带确定为有声语音片段,得到至少一个有声语音片段,一个有声语音片段包括多个有声语音子片段。
- 根据权利要求15所述的计算机可读存储介质,所述计算机可读存储介质中存储计算机指令,当所述计算机指令在计算机上运行时,使得计算机执行如下步骤:从所述至少一个有声语音片段中提取多个语音音量,并计算所述多个语音音量的平均值,生成语音平均音量,一个有声语音子片段对应一个语音音量;分别将所述多个语音音量进行归一化,生成归一化音量值组,并将所述归一化音量值组中小于归一化阈值的归一化音量值调整为零,生成调整后的归一化音量值组;对所述归一化音量值组按照预置的数量点进行平均值计算,得到多个音量平均值;采用预置的过零率算法对所述多个音量平均值进行过零检查,生成多个非零音量值,并对所述多个非零音量值进行求和计算以及均值计算,生成非零音量总值和非零音量均值;基于所述多个语音平均音量、所述非零音量总值和所述非零音量均值生成灵敏度阈值;将语音音量大于或者等于所述灵敏度阈值的有声语音子片段确定为目标人声段。
- 一种拖音的检测装置,其中,所述拖音的检测装置包括:获取模块,用于实时获取多段语音数据,并对所述多段语音数据进行实时采样处理,生成离散语音信号;有声片段生成模块,用于依次采用活性检测算法和静音抑制算法对所述离散语音信号进行处理,生成至少一个有声语音片段,一个有声语音片段包括多个有声语音子片段;人声检测模块,用于结合预置的过零率算法对所述至少一个有声语音片段进行人声检测,确定至少一个目标人声段;音节检测模块,用于对所述至少一个目标人声段进行音节检测,生成多个待检测音节;拖音检测模块,用于按照预置的发音时长阈值对所述多个待检测音节进行拖音检测,在所述多个待检测音节中确定目标拖音音节,所述目标拖音音节为一个或者多个。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011538711.6 | 2020-12-23 | ||
CN202011538711.6A CN112712823A (zh) | 2020-12-23 | 2020-12-23 | 拖音的检测方法、装置、设备及存储介质 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022134781A1 true WO2022134781A1 (zh) | 2022-06-30 |
Family
ID=75543676
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2021/124632 WO2022134781A1 (zh) | 2020-12-23 | 2021-10-19 | 拖音的检测方法、装置、设备及存储介质 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN112712823A (zh) |
WO (1) | WO2022134781A1 (zh) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112712823A (zh) * | 2020-12-23 | 2021-04-27 | 深圳壹账通智能科技有限公司 | 拖音的检测方法、装置、设备及存储介质 |
CN113744730B (zh) * | 2021-09-13 | 2023-09-08 | 北京奕斯伟计算技术股份有限公司 | 声音检测方法及装置 |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
TW201705127A (zh) * | 2015-07-30 | 2017-02-01 | 國立屏東大學 | 口吃偵測方法與裝置、電腦程式產品 |
CN108831508A (zh) * | 2018-06-13 | 2018-11-16 | 百度在线网络技术(北京)有限公司 | 语音活动检测方法、装置和设备 |
CN111554324A (zh) * | 2020-04-01 | 2020-08-18 | 深圳壹账通智能科技有限公司 | 智能化语言流利度识别方法、装置、电子设备及存储介质 |
CN111862951A (zh) * | 2020-07-23 | 2020-10-30 | 海尔优家智能科技(北京)有限公司 | 语音端点检测方法及装置、存储介质、电子设备 |
CN112712823A (zh) * | 2020-12-23 | 2021-04-27 | 深圳壹账通智能科技有限公司 | 拖音的检测方法、装置、设备及存储介质 |
-
2020
- 2020-12-23 CN CN202011538711.6A patent/CN112712823A/zh active Pending
-
2021
- 2021-10-19 WO PCT/CN2021/124632 patent/WO2022134781A1/zh active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
TW201705127A (zh) * | 2015-07-30 | 2017-02-01 | 國立屏東大學 | 口吃偵測方法與裝置、電腦程式產品 |
CN108831508A (zh) * | 2018-06-13 | 2018-11-16 | 百度在线网络技术(北京)有限公司 | 语音活动检测方法、装置和设备 |
CN111554324A (zh) * | 2020-04-01 | 2020-08-18 | 深圳壹账通智能科技有限公司 | 智能化语言流利度识别方法、装置、电子设备及存储介质 |
CN111862951A (zh) * | 2020-07-23 | 2020-10-30 | 海尔优家智能科技(北京)有限公司 | 语音端点检测方法及装置、存储介质、电子设备 |
CN112712823A (zh) * | 2020-12-23 | 2021-04-27 | 深圳壹账通智能科技有限公司 | 拖音的检测方法、装置、设备及存储介质 |
Also Published As
Publication number | Publication date |
---|---|
CN112712823A (zh) | 2021-04-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021139425A1 (zh) | 语音端点检测方法、装置、设备及存储介质 | |
Kim et al. | Power-normalized cepstral coefficients (PNCC) for robust speech recognition | |
KR101269296B1 (ko) | 모노포닉 오디오 신호로부터 오디오 소스를 분리하는 뉴럴네트워크 분류기 | |
CN108922541B (zh) | 基于dtw和gmm模型的多维特征参数声纹识别方法 | |
Kumar et al. | Design of an automatic speaker recognition system using MFCC, vector quantization and LBG algorithm | |
WO2022134781A1 (zh) | 拖音的检测方法、装置、设备及存储介质 | |
WO2014153800A1 (zh) | 语音识别系统 | |
Mittal et al. | Analysis of production characteristics of laughter | |
Walters | Auditory-based processing of communication sounds | |
WO2017045429A1 (zh) | 一种音频数据的检测方法、系统及存储介质 | |
Ismail et al. | Mfcc-vq approach for qalqalahtajweed rule checking | |
Nossier et al. | Mapping and masking targets comparison using different deep learning based speech enhancement architectures | |
Labied et al. | An overview of automatic speech recognition preprocessing techniques | |
Wiśniewski et al. | Automatic detection of disorders in a continuous speech with the hidden Markov models approach | |
Murugaiya et al. | Probability enhanced entropy (PEE) novel feature for improved bird sound classification | |
Chang et al. | Spectro-temporal features for noise-robust speech recognition using power-law nonlinearity and power-bias subtraction | |
Moritz et al. | Integration of optimized modulation filter sets into deep neural networks for automatic speech recognition | |
Vanderreydt et al. | A novel channel estimate for noise robust speech recognition | |
Zouhir et al. | A bio-inspired feature extraction for robust speech recognition | |
Varela et al. | Combining pulse-based features for rejecting far-field speech in a HMM-based voice activity detector | |
Iwok et al. | Evaluation of Machine Learning Algorithms using Combined Feature Extraction Techniques for Speaker Identification | |
Wiśniewski et al. | Automatic detection of prolonged fricative phonemes with the hidden Markov models approach | |
Kaminski et al. | Automatic speaker recognition using a unique personal feature vector and Gaussian Mixture Models | |
Cristea et al. | New cepstrum frequency scale for neural network speaker verification | |
Montalvão et al. | Is masking a relevant aspect lacking in MFCC? A speaker verification perspective |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21908782 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 03.11.2023) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 21908782 Country of ref document: EP Kind code of ref document: A1 |