CN109378016A - A kind of keyword identification mask method based on VAD - Google Patents
A kind of keyword identification mask method based on VAD Download PDFInfo
- Publication number
- CN109378016A CN109378016A CN201811179716.7A CN201811179716A CN109378016A CN 109378016 A CN109378016 A CN 109378016A CN 201811179716 A CN201811179716 A CN 201811179716A CN 109378016 A CN109378016 A CN 109378016A
- Authority
- CN
- China
- Prior art keywords
- vad
- voice
- vadreg
- delay
- len
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 16
- 206010019133 Hangover Diseases 0.000 claims abstract description 19
- 239000000463 material Substances 0.000 claims abstract description 17
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 10
- 238000012545 processing Methods 0.000 claims abstract description 10
- 238000002372 labelling Methods 0.000 claims abstract description 7
- 230000000694 effects Effects 0.000 claims description 7
- 238000001514 detection method Methods 0.000 claims description 5
- 238000013528 artificial neural network Methods 0.000 claims description 4
- 230000000306 recurrent effect Effects 0.000 claims description 4
- 238000012937 correction Methods 0.000 abstract description 4
- 230000008859 change Effects 0.000 abstract description 2
- 238000005516 engineering process Methods 0.000 description 4
- 238000012549 training Methods 0.000 description 3
- 230000007704 transition Effects 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- PEDCQBHIVMGVHV-UHFFFAOYSA-N Glycerine Chemical compound OCC(O)CO PEDCQBHIVMGVHV-UHFFFAOYSA-N 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000004069 differentiation Effects 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 238000009432 framing Methods 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a kind of, and the keyword based on VAD identifies mask method, sub-frame processing is carried out to original language material data, the data of certain length are read as a frame data according to particular sample rate, each frame data are sent into vad algorithm and carry out VAD judgement, judging result is handled by hangover again: assuming that VAD is vadreg to the voice of a frame original language material data or the judging result of non-voice, wherein vadreg is 0 expression non-voice, vadreg is 1 expression voice, the corpus labeling at corresponding moment is carried out according to the final output mark VAD of VAD judgement and hangover decision.This method is combining voice hangover decision, can be effectively artificial check and correction by lengthy and tedious artificial audition, artificial mark work change, can greatly save manpower and time cost, and ensure validity.
Description
Technical field
The present invention relates to field of computer technology, especially a kind of keyword based on VAD identifies mask method.
Background technique
In keyword recognition field, for the model training of keyword, a large amount of original language material data of demand, and need to count
It is labeled according to progress keyword and non-keyword speech part, referred to as " labels ".Label corresponds to the time of corpus, in this way
Model training could carry out the training that required data extract final implementation model from corresponding time point.
Main voice annotation means are artificial mark in industry at present, play corpus and display waveform with tool, then
Worker shows and is labeled according to audition identification and waveform.Its advantage is that unavailable part or wrong different can be distinguished extremely accurate
Normal part, but it is apparent the disadvantage is that efficiency is extremely low, and often a corpus labeling job is just counted by working hour unit of the moon, and consuming is a large amount of
Manpower and time cost.
Summary of the invention
To solve problems of the prior art, the keyword identification based on VAD that the object of the present invention is to provide a kind of
Mask method, this method combine voice hangover decision, can effectively by lengthy and tedious artificial audition, manually mark work change
Manually to proofread, manpower and time cost can be greatly saved, and ensure validity.
To achieve the above object, the technical solution adopted by the present invention is that: a kind of keyword based on VAD identifies mark side
Method carries out sub-frame processing to original language material data, will according to the data of particular sample rate reading certain length as a frame data
Each frame data are sent into vad algorithm and carry out VAD judgement, then are handled by hangover judging result: assuming that VAD is to a frame
The voice of original language material data or the judging result of non-voice are vadreg, and wherein vadreg is 0 expression non-voice, vadreg
Voice is indicated for 1, then the input value of hangover decision is vadreg and previous judging result VAD', and specific judgement is such as
Under:
I introduces variable burst_count, hang_count and records that continuous vadreg is 1 and continuous vadreg is 0 respectively
Number, the delay and the delay that terminates of voice of voice starting are respectively indicated using burst_len, hang_len;
If II, previous result of decision VAD'=0, burst_count is reached when continuously there is the number that vadreg is 1
When the delay originated to voice, i.e. when burst_count >=burst_len, exports new result of decision VAD=1, otherwise exports
VAD=0;
If III, previous result of decision VAD'=1, reach when continuously there is the number hang_count that vadreg is 0
When the delay that voice terminates, i.e. when hang_count >=hang_len, exports new result of decision VAD=0, otherwise exports VAD=
1;
IV, the corpus labeling that the corresponding moment is carried out according to the final output mark VAD of VAD judgement and hangover decision,
Wherein VAD=0 is labeled as non-key word, and VAD=1 is labeled as keyword.
As a preferred embodiment, the delay that delay burst_len and the voice of the voice starting terminate
The length of hang_len is regular length or distance to go.
As another preferred embodiment, delay that delay burst_len and the voice of voice starting terminate
The length of hang_len is regular length, and burst_len delay is 5 frames, and the delay of hang_len is 7 frames, finally marks VAD
When by the time of hangover decision by delay retract.
As another preferred embodiment, the vad algorithm is calculated using the voice activity detection of Recognition with Recurrent Neural Network
Method.
The beneficial effects of the present invention are: the present invention can obtain huge compared to traditional artificial mark in terms of labeling effciency
It is big to be promoted, and make mark is manually entered and be converted into artificial check and correction and error correction, greatly save human cost and time at
This.
Detailed description of the invention
Fig. 1 is the flow diagram of the embodiment of the present invention;
Fig. 2 is the demonstrating effect figure of the embodiment of the present invention.
Specific embodiment
The embodiment of the present invention is described in detail with reference to the accompanying drawing.
Embodiment
The present embodiment proposes voice activity detection (VAD) technology more mature using current field of voice signal,
Keyword fragment and non-speech portion in the original language material marked to needs are realized in conjunction with the hangover scheme being widely used
Differentiation and mark, the work manually marked can be replaced, save manpower and time cost.Industry identifies skill for keyword at present
The original language material mark work of art is mainly carried out using artificial mark.
VAD technology can accurately distinguish voice and non-speech portion in original language material very much.VAD type of skill ratio at present
It is more, it can be divided mainly into zero-crossing rate detection, filtering, machine learning etc..The VAD technology that the present embodiment uses is RNNoise open source
Voice existing probability VAD in engineering.Its principle is to be carried out deeply using Recognition with Recurrent Neural Network (RNN) to various noises and voice
Degree study and modeling, analyze corpus by RNN method, distinguishing each frame data is voice or non-voice.Utilize this calculation
Method, corpus automatic distinguishing can be voice or non-voice by we.
Sub-frame processing is carried out to original language material data, reads the data of certain length as a frame number according to particular sample rate
According to, by each frame data be sent into vad algorithm carry out VAD judgement, then to judging result by hangover processing: assuming that VAD pairs
The voice of one frame original language material data or the judging result of non-voice are vadreg, and wherein vadreg is 0 expression non-voice,
Vadreg is 1 expression voice, then the input value of hangover decision is vadreg and previous judging result VAD', specifically
It adjudicates as follows:
I introduces variable burst_count, hang_count and records that continuous vadreg is 1 and continuous vadreg is 0 respectively
Number, the delay and the delay that terminates of voice of voice starting are respectively indicated using burst_len, hang_len;
If II, previous result of decision VAD'=0, arrived when continuously there is the number that vadreg is 1 up to burst_count
When the delay of voice starting, i.e. when burst_count >=burst_len, exports new result of decision VAD=1, otherwise exports
VAD=0;
If III, previous result of decision VAD'=1, reach when continuously there is the number hang_count that vadreg is 0
When the delay that voice terminates, i.e. when hang_count >=hang_len, exports new result of decision VAD=0, otherwise exports VAD=
1;
IV, the corpus labeling that the corresponding moment is carried out according to the final output mark VAD of VAD judgement and hangover decision,
Wherein VAD=0 is labeled as non-key word, and VAD=1 is labeled as keyword.
As shown in FIG. 1, FIG. 1 is the flow diagrams of the present embodiment, by taking keyword is the original language material of " Changhong little Bai " as an example
The present embodiment is illustrated, it is assumed that we carry out automation mark to the single channel corpus that one section of sample rate is 48KHz.
Sub-frame processing is carried out to original language material first, reads the data of certain length as 1 frame number according to particular sample rate
According to (such as 16ms).Each frame data are sent into vad algorithm and carry out VAD judgement, vad algorithm carries out calculation process to this frame data
Provide afterwards one be voice or non-voice judging result.
According to VAD judgement the result is that voice or non-voice carry out the policy-making processing of hangover, processing result
It is marked with automation.Which frame is the basis of automation mark be currently, and the time span for each frame set when multiplied by framing is (such as
16ms) determine beginning and ending time that needs mark.
Hangover processing, be a kind of simulation human articulation start-stop relatively common in Speech processing continue between
Gap realizes the processing method for reducing the False Rate of voice and non-voice.Its specific embodiment is from non-voice to voice
Transition stage introduce " burst " judgement, if that is, setting be currently non-speech segment, need continuously to judge certain frame number be voice
Just from non-voice status transition to voice status, non-voice state is otherwise remained;Similarly, in the transition from voice to non-voice
Stage introduces " hang " judgement, i.e. setting needs continuously to judge that certain frame number is non-voice, just from language as being currently voice segments
Sound-like state is transitioned into non-voice state, otherwise remains voice status.Wherein, the length of " burst " and " hang " selection can divide
For fixed and dynamic.The present embodiment is in the specific implementation according to experimental data feedback, and using regular length, " burst " delay is 5
Frame, " hang " delay are 7 frames.Then the hangover time is retracted by delay in mark, to guarantee the validity of mark.It is logical
Cross the present embodiment automatic marking generate correspond to original language material label, check by PC tool Audacity and
Check and correction.As shown in Fig. 2, Fig. 2 is the demonstrating effect figure of the present embodiment: keyword is the original language material of " Changhong little Bai " and generates
The demonstration graph that mark label is opened in Audacity.
Specific mark then uses the generic way of corpus labeling, and " beginning and ending time+keyword spelling " is recorded in and original language
Expect in txt file of the same name.
The present embodiment realizes the speech activity inspection of original language material using the voice activity detection algorithms of Recognition with Recurrent Neural Network
Brake.
A specific embodiment of the invention above described embodiment only expresses, the description thereof is more specific and detailed, but simultaneously
Limitations on the scope of the patent of the present invention therefore cannot be interpreted as.It should be pointed out that for those of ordinary skill in the art
For, without departing from the inventive concept of the premise, various modifications and improvements can be made, these belong to guarantor of the invention
Protect range.
Claims (4)
1. a kind of keyword based on VAD identifies mask method, which is characterized in that sub-frame processing is carried out to original language material data,
Each frame data are sent into vad algorithm progress VAD and sentenced by the data for reading certain length according to particular sample rate as a frame data
It is disconnected, then judging result is handled by hangover: assuming that judgement of the VAD to the voice or non-voice of a frame original language material data
It as a result is vadreg, wherein vadreg is 0 expression non-voice, and vadreg is 1 expression voice, then the input value of hangover decision
For vadreg and previous judging result VAD', specifically judgement is as follows:
I, introducing variable burst_count, hang_count record time that continuous vadreg is 1 and continuous vadreg is 0 respectively
Number respectively indicates the delay that voice originates and the delay that voice terminates using burst_len, hang_len;
If II, previous result of decision VAD'=0, burst_count is reached to voice when continuously there is the number that vadreg is 1
When the delay of starting, i.e. when burst_count >=burst_len, exports new result of decision VAD=1, otherwise exports VAD=0;
If III, previous result of decision VAD'=1, reach voice when continuously there is the number hang_count that vadreg is 0
When the delay of end, i.e. when hang_count >=hang_len, exports new result of decision VAD=0, otherwise exports VAD=1;
IV, the corpus labeling that the corresponding moment is carried out according to the final output mark VAD of VAD judgement and hangover decision, wherein
VAD=0 is labeled as non-key word, and VAD=1 is labeled as keyword.
2. the keyword according to claim 1 based on VAD identifies mask method, which is characterized in that the voice starting
Delay burst_len and voice terminate delay hang_len length be regular length or distance to go.
3. the keyword according to claim 2 based on VAD identifies mask method, which is characterized in that the voice starting
Delay burst_len and voice terminate delay hang_len length be regular length, and burst_len delay be 5 frames,
The delay of hang_len is 7 frames, and the time of hangover decision retracts by delay when finally marking VAD.
4. the keyword according to claim 1-3 based on VAD identifies mask method, which is characterized in that described
Vad algorithm uses the voice activity detection algorithms of Recognition with Recurrent Neural Network.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811179716.7A CN109378016A (en) | 2018-10-10 | 2018-10-10 | A kind of keyword identification mask method based on VAD |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811179716.7A CN109378016A (en) | 2018-10-10 | 2018-10-10 | A kind of keyword identification mask method based on VAD |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109378016A true CN109378016A (en) | 2019-02-22 |
Family
ID=65404041
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811179716.7A Pending CN109378016A (en) | 2018-10-10 | 2018-10-10 | A kind of keyword identification mask method based on VAD |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109378016A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110010153A (en) * | 2019-03-25 | 2019-07-12 | 平安科技(深圳)有限公司 | A kind of mute detection method neural network based, terminal device and medium |
CN110930997A (en) * | 2019-12-10 | 2020-03-27 | 四川长虹电器股份有限公司 | Method for labeling audio by using deep learning model |
CN112420070A (en) * | 2019-08-22 | 2021-02-26 | 北京峰趣互联网信息服务有限公司 | Automatic labeling method and device, electronic equipment and computer readable storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20020120440A1 (en) * | 2000-12-28 | 2002-08-29 | Shude Zhang | Method and apparatus for improved voice activity detection in a packet voice network |
CN104409080A (en) * | 2014-12-15 | 2015-03-11 | 北京国双科技有限公司 | Voice end node detection method and device |
CN108428448A (en) * | 2017-02-13 | 2018-08-21 | 芋头科技(杭州)有限公司 | A kind of sound end detecting method and audio recognition method |
-
2018
- 2018-10-10 CN CN201811179716.7A patent/CN109378016A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20020120440A1 (en) * | 2000-12-28 | 2002-08-29 | Shude Zhang | Method and apparatus for improved voice activity detection in a packet voice network |
CN104409080A (en) * | 2014-12-15 | 2015-03-11 | 北京国双科技有限公司 | Voice end node detection method and device |
CN108428448A (en) * | 2017-02-13 | 2018-08-21 | 芋头科技(杭州)有限公司 | A kind of sound end detecting method and audio recognition method |
Non-Patent Citations (1)
Title |
---|
陈明: "语音活动检测的算法研究", 《万方数据库》 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110010153A (en) * | 2019-03-25 | 2019-07-12 | 平安科技(深圳)有限公司 | A kind of mute detection method neural network based, terminal device and medium |
CN112420070A (en) * | 2019-08-22 | 2021-02-26 | 北京峰趣互联网信息服务有限公司 | Automatic labeling method and device, electronic equipment and computer readable storage medium |
CN110930997A (en) * | 2019-12-10 | 2020-03-27 | 四川长虹电器股份有限公司 | Method for labeling audio by using deep learning model |
CN110930997B (en) * | 2019-12-10 | 2022-08-16 | 四川长虹电器股份有限公司 | Method for labeling audio by using deep learning model |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109255119B (en) | Sentence trunk analysis method and system of multi-task deep neural network based on word segmentation and named entity recognition | |
CN110364171B (en) | Voice recognition method, voice recognition system and storage medium | |
CN102568475B (en) | System and method for assessing proficiency in Putonghua | |
CN107092596B (en) | Text emotion analysis method based on attention CNNs and CCR | |
CN112580367B (en) | Telephone traffic quality inspection method and device | |
CN113468296B (en) | Model self-iteration type intelligent customer service quality inspection system and method capable of configuring business logic | |
CN111341305B (en) | Audio data labeling method, device and system | |
CN103761975B (en) | Method and device for oral evaluation | |
CN108428448A (en) | A kind of sound end detecting method and audio recognition method | |
CN109801628B (en) | Corpus collection method, apparatus and system | |
CN105427858A (en) | Method and system for achieving automatic voice classification | |
CN109378016A (en) | A kind of keyword identification mask method based on VAD | |
CN112966082B (en) | Audio quality inspection method, device, equipment and storage medium | |
Audhkhasi et al. | Formant-based technique for automatic filled-pause detection in spontaneous spoken English | |
CN103559894A (en) | Method and system for evaluating spoken language | |
CN109460558B (en) | Effect judging method of voice translation system | |
CN112951275A (en) | Voice quality inspection method and device, electronic equipment and medium | |
CN108549628A (en) | The punctuate device and method of streaming natural language information | |
CN104464755A (en) | Voice evaluation method and device | |
CN110992959A (en) | Voice recognition method and system | |
CN111144118A (en) | Method, system, device and medium for identifying named entities in spoken text | |
CN109190099A (en) | Sentence mould extracting method and device | |
CN109213970B (en) | Method and device for generating notes | |
CN110196897A (en) | A kind of case recognition methods based on question and answer template | |
CN112309398B (en) | Method and device for monitoring working time, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190222 |