CN109378016A

CN109378016A - A kind of keyword identification mask method based on VAD

Info

Publication number: CN109378016A
Application number: CN201811179716.7A
Authority: CN
Inventors: 车云飞
Original assignee: Sichuan Changhong Electric Co Ltd
Current assignee: Sichuan Changhong Electric Co Ltd
Priority date: 2018-10-10
Filing date: 2018-10-10
Publication date: 2019-02-22

Abstract

The invention discloses a kind of, and the keyword based on VAD identifies mask method, sub-frame processing is carried out to original language material data, the data of certain length are read as a frame data according to particular sample rate, each frame data are sent into vad algorithm and carry out VAD judgement, judging result is handled by hangover again: assuming that VAD is vadreg to the voice of a frame original language material data or the judging result of non-voice, wherein vadreg is 0 expression non-voice, vadreg is 1 expression voice, the corpus labeling at corresponding moment is carried out according to the final output mark VAD of VAD judgement and hangover decision.This method is combining voice hangover decision, can be effectively artificial check and correction by lengthy and tedious artificial audition, artificial mark work change, can greatly save manpower and time cost, and ensure validity.

Description

A kind of keyword identification mask method based on VAD

Technical field

The present invention relates to field of computer technology, especially a kind of keyword based on VAD identifies mask method.

Background technique

In keyword recognition field, for the model training of keyword, a large amount of original language material data of demand, and need to count It is labeled according to progress keyword and non-keyword speech part, referred to as " labels ".Label corresponds to the time of corpus, in this way Model training could carry out the training that required data extract final implementation model from corresponding time point.

Main voice annotation means are artificial mark in industry at present, play corpus and display waveform with tool, then Worker shows and is labeled according to audition identification and waveform.Its advantage is that unavailable part or wrong different can be distinguished extremely accurate Normal part, but it is apparent the disadvantage is that efficiency is extremely low, and often a corpus labeling job is just counted by working hour unit of the moon, and consuming is a large amount of Manpower and time cost.

Summary of the invention

To solve problems of the prior art, the keyword identification based on VAD that the object of the present invention is to provide a kind of Mask method, this method combine voice hangover decision, can effectively by lengthy and tedious artificial audition, manually mark work change Manually to proofread, manpower and time cost can be greatly saved, and ensure validity.

To achieve the above object, the technical solution adopted by the present invention is that: a kind of keyword based on VAD identifies mark side Method carries out sub-frame processing to original language material data, will according to the data of particular sample rate reading certain length as a frame data Each frame data are sent into vad algorithm and carry out VAD judgement, then are handled by hangover judging result: assuming that VAD is to a frame The voice of original language material data or the judging result of non-voice are vadreg, and wherein vadreg is 0 expression non-voice, vadreg Voice is indicated for 1, then the input value of hangover decision is vadreg and previous judging result VAD', and specific judgement is such as Under:

I introduces variable burst_count, hang_count and records that continuous vadreg is 1 and continuous vadreg is 0 respectively Number, the delay and the delay that terminates of voice of voice starting are respectively indicated using burst_len, hang_len；

If II, previous result of decision VAD'=0, burst_count is reached when continuously there is the number that vadreg is 1 When the delay originated to voice, i.e. when burst_count >=burst_len, exports new result of decision VAD=1, otherwise exports VAD=0；

If III, previous result of decision VAD'=1, reach when continuously there is the number hang_count that vadreg is 0 When the delay that voice terminates, i.e. when hang_count >=hang_len, exports new result of decision VAD=0, otherwise exports VAD= 1；

IV, the corpus labeling that the corresponding moment is carried out according to the final output mark VAD of VAD judgement and hangover decision, Wherein VAD=0 is labeled as non-key word, and VAD=1 is labeled as keyword.

As a preferred embodiment, the delay that delay burst_len and the voice of the voice starting terminate The length of hang_len is regular length or distance to go.

As another preferred embodiment, delay that delay burst_len and the voice of voice starting terminate The length of hang_len is regular length, and burst_len delay is 5 frames, and the delay of hang_len is 7 frames, finally marks VAD When by the time of hangover decision by delay retract.

As another preferred embodiment, the vad algorithm is calculated using the voice activity detection of Recognition with Recurrent Neural Network Method.

The beneficial effects of the present invention are: the present invention can obtain huge compared to traditional artificial mark in terms of labeling effciency It is big to be promoted, and make mark is manually entered and be converted into artificial check and correction and error correction, greatly save human cost and time at This.

Detailed description of the invention

Fig. 1 is the flow diagram of the embodiment of the present invention；

Fig. 2 is the demonstrating effect figure of the embodiment of the present invention.

Specific embodiment

The embodiment of the present invention is described in detail with reference to the accompanying drawing.

Embodiment

The present embodiment proposes voice activity detection (VAD) technology more mature using current field of voice signal, Keyword fragment and non-speech portion in the original language material marked to needs are realized in conjunction with the hangover scheme being widely used Differentiation and mark, the work manually marked can be replaced, save manpower and time cost.Industry identifies skill for keyword at present The original language material mark work of art is mainly carried out using artificial mark.

VAD technology can accurately distinguish voice and non-speech portion in original language material very much.VAD type of skill ratio at present It is more, it can be divided mainly into zero-crossing rate detection, filtering, machine learning etc..The VAD technology that the present embodiment uses is RNNoise open source Voice existing probability VAD in engineering.Its principle is to be carried out deeply using Recognition with Recurrent Neural Network (RNN) to various noises and voice Degree study and modeling, analyze corpus by RNN method, distinguishing each frame data is voice or non-voice.Utilize this calculation Method, corpus automatic distinguishing can be voice or non-voice by we.

Sub-frame processing is carried out to original language material data, reads the data of certain length as a frame number according to particular sample rate According to, by each frame data be sent into vad algorithm carry out VAD judgement, then to judging result by hangover processing: assuming that VAD pairs The voice of one frame original language material data or the judging result of non-voice are vadreg, and wherein vadreg is 0 expression non-voice, Vadreg is 1 expression voice, then the input value of hangover decision is vadreg and previous judging result VAD', specifically It adjudicates as follows:

If II, previous result of decision VAD'=0, arrived when continuously there is the number that vadreg is 1 up to burst_count When the delay of voice starting, i.e. when burst_count >=burst_len, exports new result of decision VAD=1, otherwise exports VAD=0；

As shown in FIG. 1, FIG. 1 is the flow diagrams of the present embodiment, by taking keyword is the original language material of " Changhong little Bai " as an example The present embodiment is illustrated, it is assumed that we carry out automation mark to the single channel corpus that one section of sample rate is 48KHz.

Sub-frame processing is carried out to original language material first, reads the data of certain length as 1 frame number according to particular sample rate According to (such as 16ms).Each frame data are sent into vad algorithm and carry out VAD judgement, vad algorithm carries out calculation process to this frame data Provide afterwards one be voice or non-voice judging result.

According to VAD judgement the result is that voice or non-voice carry out the policy-making processing of hangover, processing result It is marked with automation.Which frame is the basis of automation mark be currently, and the time span for each frame set when multiplied by framing is (such as 16ms) determine beginning and ending time that needs mark.

Hangover processing, be a kind of simulation human articulation start-stop relatively common in Speech processing continue between Gap realizes the processing method for reducing the False Rate of voice and non-voice.Its specific embodiment is from non-voice to voice Transition stage introduce " burst " judgement, if that is, setting be currently non-speech segment, need continuously to judge certain frame number be voice Just from non-voice status transition to voice status, non-voice state is otherwise remained；Similarly, in the transition from voice to non-voice Stage introduces " hang " judgement, i.e. setting needs continuously to judge that certain frame number is non-voice, just from language as being currently voice segments Sound-like state is transitioned into non-voice state, otherwise remains voice status.Wherein, the length of " burst " and " hang " selection can divide For fixed and dynamic.The present embodiment is in the specific implementation according to experimental data feedback, and using regular length, " burst " delay is 5 Frame, " hang " delay are 7 frames.Then the hangover time is retracted by delay in mark, to guarantee the validity of mark.It is logical Cross the present embodiment automatic marking generate correspond to original language material label, check by PC tool Audacity and Check and correction.As shown in Fig. 2, Fig. 2 is the demonstrating effect figure of the present embodiment: keyword is the original language material of " Changhong little Bai " and generates The demonstration graph that mark label is opened in Audacity.

Specific mark then uses the generic way of corpus labeling, and " beginning and ending time+keyword spelling " is recorded in and original language Expect in txt file of the same name.

The present embodiment realizes the speech activity inspection of original language material using the voice activity detection algorithms of Recognition with Recurrent Neural Network Brake.

A specific embodiment of the invention above described embodiment only expresses, the description thereof is more specific and detailed, but simultaneously Limitations on the scope of the patent of the present invention therefore cannot be interpreted as.It should be pointed out that for those of ordinary skill in the art For, without departing from the inventive concept of the premise, various modifications and improvements can be made, these belong to guarantor of the invention Protect range.

Claims

1. a kind of keyword based on VAD identifies mask method, which is characterized in that sub-frame processing is carried out to original language material data, Each frame data are sent into vad algorithm progress VAD and sentenced by the data for reading certain length according to particular sample rate as a frame data It is disconnected, then judging result is handled by hangover: assuming that judgement of the VAD to the voice or non-voice of a frame original language material data It as a result is vadreg, wherein vadreg is 0 expression non-voice, and vadreg is 1 expression voice, then the input value of hangover decision For vadreg and previous judging result VAD', specifically judgement is as follows:

I, introducing variable burst_count, hang_count record time that continuous vadreg is 1 and continuous vadreg is 0 respectively Number respectively indicates the delay that voice originates and the delay that voice terminates using burst_len, hang_len；

If II, previous result of decision VAD'=0, burst_count is reached to voice when continuously there is the number that vadreg is 1 When the delay of starting, i.e. when burst_count >=burst_len, exports new result of decision VAD=1, otherwise exports VAD=0；

If III, previous result of decision VAD'=1, reach voice when continuously there is the number hang_count that vadreg is 0 When the delay of end, i.e. when hang_count >=hang_len, exports new result of decision VAD=0, otherwise exports VAD=1；

2. the keyword according to claim 1 based on VAD identifies mask method, which is characterized in that the voice starting Delay burst_len and voice terminate delay hang_len length be regular length or distance to go.

3. the keyword according to claim 2 based on VAD identifies mask method, which is characterized in that the voice starting Delay burst_len and voice terminate delay hang_len length be regular length, and burst_len delay be 5 frames, The delay of hang_len is 7 frames, and the time of hangover decision retracts by delay when finally marking VAD.

4. the keyword according to claim 1-3 based on VAD identifies mask method, which is characterized in that described Vad algorithm uses the voice activity detection algorithms of Recognition with Recurrent Neural Network.