CN109378016A - A kind of keyword identification mask method based on VAD - Google Patents

A kind of keyword identification mask method based on VAD Download PDF

Info

Publication number
CN109378016A
CN109378016A CN201811179716.7A CN201811179716A CN109378016A CN 109378016 A CN109378016 A CN 109378016A CN 201811179716 A CN201811179716 A CN 201811179716A CN 109378016 A CN109378016 A CN 109378016A
Authority
CN
China
Prior art keywords
vad
voice
vadreg
delay
len
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811179716.7A
Other languages
Chinese (zh)
Inventor
车云飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan Changhong Electric Co Ltd
Original Assignee
Sichuan Changhong Electric Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan Changhong Electric Co Ltd filed Critical Sichuan Changhong Electric Co Ltd
Priority to CN201811179716.7A priority Critical patent/CN109378016A/en
Publication of CN109378016A publication Critical patent/CN109378016A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a kind of, and the keyword based on VAD identifies mask method, sub-frame processing is carried out to original language material data, the data of certain length are read as a frame data according to particular sample rate, each frame data are sent into vad algorithm and carry out VAD judgement, judging result is handled by hangover again: assuming that VAD is vadreg to the voice of a frame original language material data or the judging result of non-voice, wherein vadreg is 0 expression non-voice, vadreg is 1 expression voice, the corpus labeling at corresponding moment is carried out according to the final output mark VAD of VAD judgement and hangover decision.This method is combining voice hangover decision, can be effectively artificial check and correction by lengthy and tedious artificial audition, artificial mark work change, can greatly save manpower and time cost, and ensure validity.

Description

A kind of keyword identification mask method based on VAD
Technical field
The present invention relates to field of computer technology, especially a kind of keyword based on VAD identifies mask method.
Background technique
In keyword recognition field, for the model training of keyword, a large amount of original language material data of demand, and need to count It is labeled according to progress keyword and non-keyword speech part, referred to as " labels ".Label corresponds to the time of corpus, in this way Model training could carry out the training that required data extract final implementation model from corresponding time point.
Main voice annotation means are artificial mark in industry at present, play corpus and display waveform with tool, then Worker shows and is labeled according to audition identification and waveform.Its advantage is that unavailable part or wrong different can be distinguished extremely accurate Normal part, but it is apparent the disadvantage is that efficiency is extremely low, and often a corpus labeling job is just counted by working hour unit of the moon, and consuming is a large amount of Manpower and time cost.
Summary of the invention
To solve problems of the prior art, the keyword identification based on VAD that the object of the present invention is to provide a kind of Mask method, this method combine voice hangover decision, can effectively by lengthy and tedious artificial audition, manually mark work change Manually to proofread, manpower and time cost can be greatly saved, and ensure validity.
To achieve the above object, the technical solution adopted by the present invention is that: a kind of keyword based on VAD identifies mark side Method carries out sub-frame processing to original language material data, will according to the data of particular sample rate reading certain length as a frame data Each frame data are sent into vad algorithm and carry out VAD judgement, then are handled by hangover judging result: assuming that VAD is to a frame The voice of original language material data or the judging result of non-voice are vadreg, and wherein vadreg is 0 expression non-voice, vadreg Voice is indicated for 1, then the input value of hangover decision is vadreg and previous judging result VAD', and specific judgement is such as Under:
I introduces variable burst_count, hang_count and records that continuous vadreg is 1 and continuous vadreg is 0 respectively Number, the delay and the delay that terminates of voice of voice starting are respectively indicated using burst_len, hang_len;
If II, previous result of decision VAD'=0, burst_count is reached when continuously there is the number that vadreg is 1 When the delay originated to voice, i.e. when burst_count >=burst_len, exports new result of decision VAD=1, otherwise exports VAD=0;
If III, previous result of decision VAD'=1, reach when continuously there is the number hang_count that vadreg is 0 When the delay that voice terminates, i.e. when hang_count >=hang_len, exports new result of decision VAD=0, otherwise exports VAD= 1;
IV, the corpus labeling that the corresponding moment is carried out according to the final output mark VAD of VAD judgement and hangover decision, Wherein VAD=0 is labeled as non-key word, and VAD=1 is labeled as keyword.
As a preferred embodiment, the delay that delay burst_len and the voice of the voice starting terminate The length of hang_len is regular length or distance to go.
As another preferred embodiment, delay that delay burst_len and the voice of voice starting terminate The length of hang_len is regular length, and burst_len delay is 5 frames, and the delay of hang_len is 7 frames, finally marks VAD When by the time of hangover decision by delay retract.
As another preferred embodiment, the vad algorithm is calculated using the voice activity detection of Recognition with Recurrent Neural Network Method.
The beneficial effects of the present invention are: the present invention can obtain huge compared to traditional artificial mark in terms of labeling effciency It is big to be promoted, and make mark is manually entered and be converted into artificial check and correction and error correction, greatly save human cost and time at This.
Detailed description of the invention
Fig. 1 is the flow diagram of the embodiment of the present invention;
Fig. 2 is the demonstrating effect figure of the embodiment of the present invention.
Specific embodiment
The embodiment of the present invention is described in detail with reference to the accompanying drawing.
Embodiment
The present embodiment proposes voice activity detection (VAD) technology more mature using current field of voice signal, Keyword fragment and non-speech portion in the original language material marked to needs are realized in conjunction with the hangover scheme being widely used Differentiation and mark, the work manually marked can be replaced, save manpower and time cost.Industry identifies skill for keyword at present The original language material mark work of art is mainly carried out using artificial mark.
VAD technology can accurately distinguish voice and non-speech portion in original language material very much.VAD type of skill ratio at present It is more, it can be divided mainly into zero-crossing rate detection, filtering, machine learning etc..The VAD technology that the present embodiment uses is RNNoise open source Voice existing probability VAD in engineering.Its principle is to be carried out deeply using Recognition with Recurrent Neural Network (RNN) to various noises and voice Degree study and modeling, analyze corpus by RNN method, distinguishing each frame data is voice or non-voice.Utilize this calculation Method, corpus automatic distinguishing can be voice or non-voice by we.
Sub-frame processing is carried out to original language material data, reads the data of certain length as a frame number according to particular sample rate According to, by each frame data be sent into vad algorithm carry out VAD judgement, then to judging result by hangover processing: assuming that VAD pairs The voice of one frame original language material data or the judging result of non-voice are vadreg, and wherein vadreg is 0 expression non-voice, Vadreg is 1 expression voice, then the input value of hangover decision is vadreg and previous judging result VAD', specifically It adjudicates as follows:
I introduces variable burst_count, hang_count and records that continuous vadreg is 1 and continuous vadreg is 0 respectively Number, the delay and the delay that terminates of voice of voice starting are respectively indicated using burst_len, hang_len;
If II, previous result of decision VAD'=0, arrived when continuously there is the number that vadreg is 1 up to burst_count When the delay of voice starting, i.e. when burst_count >=burst_len, exports new result of decision VAD=1, otherwise exports VAD=0;
If III, previous result of decision VAD'=1, reach when continuously there is the number hang_count that vadreg is 0 When the delay that voice terminates, i.e. when hang_count >=hang_len, exports new result of decision VAD=0, otherwise exports VAD= 1;
IV, the corpus labeling that the corresponding moment is carried out according to the final output mark VAD of VAD judgement and hangover decision, Wherein VAD=0 is labeled as non-key word, and VAD=1 is labeled as keyword.
As shown in FIG. 1, FIG. 1 is the flow diagrams of the present embodiment, by taking keyword is the original language material of " Changhong little Bai " as an example The present embodiment is illustrated, it is assumed that we carry out automation mark to the single channel corpus that one section of sample rate is 48KHz.
Sub-frame processing is carried out to original language material first, reads the data of certain length as 1 frame number according to particular sample rate According to (such as 16ms).Each frame data are sent into vad algorithm and carry out VAD judgement, vad algorithm carries out calculation process to this frame data Provide afterwards one be voice or non-voice judging result.
According to VAD judgement the result is that voice or non-voice carry out the policy-making processing of hangover, processing result It is marked with automation.Which frame is the basis of automation mark be currently, and the time span for each frame set when multiplied by framing is (such as 16ms) determine beginning and ending time that needs mark.
Hangover processing, be a kind of simulation human articulation start-stop relatively common in Speech processing continue between Gap realizes the processing method for reducing the False Rate of voice and non-voice.Its specific embodiment is from non-voice to voice Transition stage introduce " burst " judgement, if that is, setting be currently non-speech segment, need continuously to judge certain frame number be voice Just from non-voice status transition to voice status, non-voice state is otherwise remained;Similarly, in the transition from voice to non-voice Stage introduces " hang " judgement, i.e. setting needs continuously to judge that certain frame number is non-voice, just from language as being currently voice segments Sound-like state is transitioned into non-voice state, otherwise remains voice status.Wherein, the length of " burst " and " hang " selection can divide For fixed and dynamic.The present embodiment is in the specific implementation according to experimental data feedback, and using regular length, " burst " delay is 5 Frame, " hang " delay are 7 frames.Then the hangover time is retracted by delay in mark, to guarantee the validity of mark.It is logical Cross the present embodiment automatic marking generate correspond to original language material label, check by PC tool Audacity and Check and correction.As shown in Fig. 2, Fig. 2 is the demonstrating effect figure of the present embodiment: keyword is the original language material of " Changhong little Bai " and generates The demonstration graph that mark label is opened in Audacity.
Specific mark then uses the generic way of corpus labeling, and " beginning and ending time+keyword spelling " is recorded in and original language Expect in txt file of the same name.
The present embodiment realizes the speech activity inspection of original language material using the voice activity detection algorithms of Recognition with Recurrent Neural Network Brake.
A specific embodiment of the invention above described embodiment only expresses, the description thereof is more specific and detailed, but simultaneously Limitations on the scope of the patent of the present invention therefore cannot be interpreted as.It should be pointed out that for those of ordinary skill in the art For, without departing from the inventive concept of the premise, various modifications and improvements can be made, these belong to guarantor of the invention Protect range.

Claims (4)

1. a kind of keyword based on VAD identifies mask method, which is characterized in that sub-frame processing is carried out to original language material data, Each frame data are sent into vad algorithm progress VAD and sentenced by the data for reading certain length according to particular sample rate as a frame data It is disconnected, then judging result is handled by hangover: assuming that judgement of the VAD to the voice or non-voice of a frame original language material data It as a result is vadreg, wherein vadreg is 0 expression non-voice, and vadreg is 1 expression voice, then the input value of hangover decision For vadreg and previous judging result VAD', specifically judgement is as follows:
I, introducing variable burst_count, hang_count record time that continuous vadreg is 1 and continuous vadreg is 0 respectively Number respectively indicates the delay that voice originates and the delay that voice terminates using burst_len, hang_len;
If II, previous result of decision VAD'=0, burst_count is reached to voice when continuously there is the number that vadreg is 1 When the delay of starting, i.e. when burst_count >=burst_len, exports new result of decision VAD=1, otherwise exports VAD=0;
If III, previous result of decision VAD'=1, reach voice when continuously there is the number hang_count that vadreg is 0 When the delay of end, i.e. when hang_count >=hang_len, exports new result of decision VAD=0, otherwise exports VAD=1;
IV, the corpus labeling that the corresponding moment is carried out according to the final output mark VAD of VAD judgement and hangover decision, wherein VAD=0 is labeled as non-key word, and VAD=1 is labeled as keyword.
2. the keyword according to claim 1 based on VAD identifies mask method, which is characterized in that the voice starting Delay burst_len and voice terminate delay hang_len length be regular length or distance to go.
3. the keyword according to claim 2 based on VAD identifies mask method, which is characterized in that the voice starting Delay burst_len and voice terminate delay hang_len length be regular length, and burst_len delay be 5 frames, The delay of hang_len is 7 frames, and the time of hangover decision retracts by delay when finally marking VAD.
4. the keyword according to claim 1-3 based on VAD identifies mask method, which is characterized in that described Vad algorithm uses the voice activity detection algorithms of Recognition with Recurrent Neural Network.
CN201811179716.7A 2018-10-10 2018-10-10 A kind of keyword identification mask method based on VAD Pending CN109378016A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811179716.7A CN109378016A (en) 2018-10-10 2018-10-10 A kind of keyword identification mask method based on VAD

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811179716.7A CN109378016A (en) 2018-10-10 2018-10-10 A kind of keyword identification mask method based on VAD

Publications (1)

Publication Number Publication Date
CN109378016A true CN109378016A (en) 2019-02-22

Family

ID=65404041

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811179716.7A Pending CN109378016A (en) 2018-10-10 2018-10-10 A kind of keyword identification mask method based on VAD

Country Status (1)

Country Link
CN (1) CN109378016A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110010153A (en) * 2019-03-25 2019-07-12 平安科技(深圳)有限公司 A kind of mute detection method neural network based, terminal device and medium
CN110930997A (en) * 2019-12-10 2020-03-27 四川长虹电器股份有限公司 Method for labeling audio by using deep learning model
CN112420070A (en) * 2019-08-22 2021-02-26 北京峰趣互联网信息服务有限公司 Automatic labeling method and device, electronic equipment and computer readable storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020120440A1 (en) * 2000-12-28 2002-08-29 Shude Zhang Method and apparatus for improved voice activity detection in a packet voice network
CN104409080A (en) * 2014-12-15 2015-03-11 北京国双科技有限公司 Voice end node detection method and device
CN108428448A (en) * 2017-02-13 2018-08-21 芋头科技(杭州)有限公司 A kind of sound end detecting method and audio recognition method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020120440A1 (en) * 2000-12-28 2002-08-29 Shude Zhang Method and apparatus for improved voice activity detection in a packet voice network
CN104409080A (en) * 2014-12-15 2015-03-11 北京国双科技有限公司 Voice end node detection method and device
CN108428448A (en) * 2017-02-13 2018-08-21 芋头科技(杭州)有限公司 A kind of sound end detecting method and audio recognition method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
陈明: "语音活动检测的算法研究", 《万方数据库》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110010153A (en) * 2019-03-25 2019-07-12 平安科技(深圳)有限公司 A kind of mute detection method neural network based, terminal device and medium
CN112420070A (en) * 2019-08-22 2021-02-26 北京峰趣互联网信息服务有限公司 Automatic labeling method and device, electronic equipment and computer readable storage medium
CN110930997A (en) * 2019-12-10 2020-03-27 四川长虹电器股份有限公司 Method for labeling audio by using deep learning model
CN110930997B (en) * 2019-12-10 2022-08-16 四川长虹电器股份有限公司 Method for labeling audio by using deep learning model

Similar Documents

Publication Publication Date Title
CN110364171B (en) Voice recognition method, voice recognition system and storage medium
CN103559894B (en) Oral evaluation method and system
CN109378016A (en) A kind of keyword identification mask method based on VAD
CN108428448A (en) A kind of sound end detecting method and audio recognition method
CN111341305B (en) Audio data labeling method, device and system
CN101751919B (en) Spoken Chinese stress automatic detection method
CN103761975B (en) Method and device for oral evaluation
CN109801628B (en) Corpus collection method, apparatus and system
CN112580367B (en) Telephone traffic quality inspection method and device
CN105427858A (en) Method and system for achieving automatic voice classification
CN113468296B (en) Model self-iteration type intelligent customer service quality inspection system and method capable of configuring business logic
CN110648691B (en) Emotion recognition method, device and system based on energy value of voice
Audhkhasi et al. Formant-based technique for automatic filled-pause detection in spontaneous spoken English
CN102568475A (en) System and method for assessing proficiency in Putonghua
CN112951275B (en) Voice quality inspection method and device, electronic equipment and medium
CN103177733A (en) Method and system for evaluating Chinese mandarin retroflex suffixation pronunciation quality
Howell et al. Development of a two-stage procedure for the automatic recognition of dysfluencies in the speech of children who stutter: I. Psychometric procedures appropriate for selection of training material for lexical dysfluency classifiers
CN108549628A (en) The punctuate device and method of streaming natural language information
CN104464755A (en) Voice evaluation method and device
CN109460558B (en) Effect judging method of voice translation system
CN110992959A (en) Voice recognition method and system
CN110888989A (en) Intelligent learning platform and construction method thereof
CN110196897B (en) Case identification method based on question and answer template
CN109213970B (en) Method and device for generating notes
CN111933120A (en) Voice data automatic labeling method and system for voice recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20190222