CN108320733A

CN108320733A - Voice data processing method and device, storage medium, electronic equipment

Info

Publication number: CN108320733A
Application number: CN201711364085.1A
Authority: CN
Inventors: 吴国兵; 潘嘉
Original assignee: Iflytek Shanghai Mdt Infotech Ltd
Current assignee: Iflytek Shanghai Mdt Infotech Ltd
Priority date: 2017-12-18
Filing date: 2017-12-18
Publication date: 2018-07-24
Anticipated expiration: 2037-12-18
Also published as: CN108320733B

Abstract

A kind of voice data processing method of disclosure offer and device, storage medium, electronic equipment.This method includes：Voice data input by user is obtained, the voice data includes successfully waking up the wake-up voice data of intelligent terminal, and indicate the control voice data that operation is intended to；The acoustic layer region feature and/or semantic level feature of the voice data are extracted, the acoustic layer region feature is used to indicate that the pronunciation character of user, the semantic level feature to be used to indicate the text feature of the voice data；Using the acoustic layer region feature and/or semantic level feature as input, after the voice discrimination model processing through building in advance, determine whether the wake-up voice data is false wake-up data.Such scheme carries out model optimization using the wake-up voice data for having screened out false wake-up data, helps to improve the optimization performance for waking up model.

Description

Voice data processing method and device, storage medium, electronic equipment

Technical field

This disclosure relates to voice process technology field, and in particular, to a kind of voice data processing method and device, Storage medium, electronic equipment.

Background technology

Voice awakening technology is the important branch in voice process technology field, in smart home, intelligent robot, intelligence Energy vehicle device, smart mobile phone etc. have important application.

In general, the voice wakeup process of intelligent terminal can be presented as：Whether intelligent terminal monitoring users input voice number According to if receiving voice data input by user, the acoustic feature of voice data can be extracted；Then using acoustic feature as Input is carried out waking up word identification by the wake-up model built in advance, if recognition result is wake-up word, wakes up success, Ke Yiji Whether continuous monitoring users, which have input operation, is intended to；It is on the contrary then wake up failure, it can continue whether monitoring users carry out intelligence again Terminal wakes up.Wherein, acoustic feature can be presented as the spectrum signature of voice data, for example, mel-frequency cepstrum coefficient (English Text：Mel Frequency Cepstrum Coefficient, referred to as：MFCC) feature, perception linear prediction (English： Perceptual Linear Predictive, referred to as：PLP) feature etc..

In general, level can not be optimal by initially waking up the performance of model, need constantly to carry out mould in use Type optimizes, to improve the recognition accuracy of model.Specifically, it can will wake up successful voice data and be considered as positive example voice number According to, will wake up failure voice data be considered as counter-example voice data, current awake model is trained based on distinction criterion Optimization.

In actual application, not high due to initially waking up model performance, causing to wake up can in successful voice data There can be false wake-up data, for example, the non-wake-up word etc. of the interference of background noise, voice, pronunciation close with word is waken up, it may be accidentally Intelligent terminal is waken up, if carrying out model optimization using false wake-up data as positive example voice data, it is likely that cause to wake up model Performance is worse and worse.

Invention content

It is a general object of the present disclosure to provide a kind of voice data processing method and device, storage medium, electronic equipments, have Help improve the optimization performance for waking up model.

To achieve the goals above, the disclosure provides a kind of voice data processing method, the method includes：

Voice data input by user is obtained, the voice data includes the wake-up voice number for successfully waking up intelligent terminal According to, and indicate the control voice data that operation is intended to；

The acoustic layer region feature and/or semantic level feature of the voice data are extracted, the acoustic layer region feature is used for Indicate that the pronunciation character of user, the semantic level feature are used to indicate the text feature of the voice data；

Using the acoustic layer region feature and/or semantic level feature as input, the voice discrimination model through building in advance After processing, determine whether the wake-up voice data is false wake-up data.

Optionally, obtaining the mode for waking up voice data is：

Judge in preset time period whether continuous acquisition is at least two voice data for waking up the intelligent terminal；

If the voice data that continuous acquisition is used to wake up the intelligent terminal at least two in the preset time period, And score value d of described at least two voice data for waking up the intelligent terminal after current awake model treatment meets The following conditions：d₂≤d<d₁, then it is determined as the wake-up for waking up the voice data of the intelligent terminal by described at least two Voice data, d₁Score threshold value, d are waken up for first₂Score threshold value is waken up for second.

Optionally, the acoustic layer region feature includes the acoustic score of current awake model, then extracts the voice data Acoustic layer region feature include：

Obtain top n identification of the current awake model for each voice unit output for waking up voice data As a result；

If including the orthoepy of the voice unit in the top n recognition result of each voice unit, the voice list is judged The recognition result of member is that identification is correct；

According to the recognition result of each voice unit, the recognition accuracy for waking up voice data is counted, is worked as described The preceding acoustic score for waking up model.

Optionally, the acoustic layer region feature further include in fundamental frequency mean value, short-time average energy, short-time zero-crossing rate at least One；

And/or

The acoustic layer region feature further includes pure and impure sequence signature, then extracts the acoustic layer region feature packet of the voice data It includes：At least one of fundamental frequency mean value, short-time average energy, short-time zero-crossing rate are used as and inputted, pure and impure point through building in advance After the processing of class device, the pure and impure sequence { a for waking up voice data is exported₁, a₂..., a_i..., a_m, wherein a_iIt is called out described in expression The corresponding pure and impure classification of i-th of phoneme of awake voice data；Calculate the pure and impure sequence for waking up voice data and the wake-up Similarity between the corresponding pure and impure sequence for waking up word of voice data, as the pure and impure sequence signature；

And/or

The acoustic layer region feature further includes pitch sequences feature, then extracts the acoustic layer region feature packet of the voice data It includes：By at least one of fundamental frequency mean value, short-time average energy, short-time zero-crossing rate as inputting, the tone through building in advance divides After the processing of class device, the pitch sequences { b for waking up voice data is exported₁, b₂..., b_j..., b_n, wherein b_jIt is called out described in expression The corresponding tone types of j-th of syllable of awake voice data；Calculate the pitch sequences for waking up voice data and the wake-up Similarity between the corresponding pitch sequences for waking up word of voice data, as the pitch sequences feature；

And/or

The acoustic layer region feature further includes the temporal characteristics of voice unit, then extracts the acoustics level of the voice data Feature includes：Count the duration of each voice unit for waking up voice data；Using each voice unit it is lasting when Between, calculate time average and time variance, the temporal characteristics as institute's speech units；

And/or

The acoustic layer region feature further includes vocal print feature, then the acoustic layer region feature for extracting the voice data includes： Using the i-vector features for waking up voice data described in the voiceprint extraction model extraction built in advance, as vocal print spy Sign；

And/or

The acoustic layer region feature further includes energy-distributing feature, then extracts the acoustic layer region feature packet of the voice data It includes：It is three parts c by the voice data cutting_t-1、c_t、c_t+1, the average energy distribution of each section is counted, as the energy Distribution characteristics；Wherein, c_tIndicate the wake-up voice data, c_t+1Indicate the collected packet after the wake-up voice data Include the voice data collection of the control voice data, c_t-1Indicate the collected voice data before the wake-up voice data Collection.

Optionally, the semantic level feature includes semantic smoothness, then the semantic level for extracting the voice data is special Sign includes：Word segmentation processing is carried out to the voice data, obtains word sequence { w₁, w₂..., w_k..., w_f, wherein w_kIndicate institute State k-th of word of voice data；The probability that f word sequentially occurs according to the sequence of the word sequence is calculated, as institute Predicate justice smoothness；

And/or

The semantic level feature includes the editing distance of part of speech sequence, then the semantic level for extracting the voice data is special Sign includes：Word segmentation processing is carried out to the voice data, obtains part of speech sequence { q₁, q₂..., q_k..., q_f, wherein q_kIndicate institute State the part of speech of k-th of word of voice data；Calculate the word of the part of speech sequence and each sample voice data of the voice data Editing distance between property sequence, and smallest edit distance is therefrom chosen, as the editing distance of the part of speech sequence, the sample This voice data is the data for participating in the training voice discrimination model；

And/or

The semantic level feature includes intent features, then the semantic level feature for extracting the voice data includes：Profit The intent features of the control voice data are extracted with the intention analysis model built in advance, the intent features include clear It is intended to or nothing is clearly intended to, alternatively, the intent features includes the corresponding intention classification of the control voice data.

Optionally, the mode for building the voice discrimination model is：

Collecting sample voice data, the sample voice data include that sample wakes up voice data and sample control voice Data, the data type that the sample wakes up voice data are labeled as positive example wake-up voice data or counter-example wake-up voice number According to the counter-example wakes up voice data and includes false wake-up data and wake up the voice data of failure；

Extract the acoustic layer region feature and/or semantic level feature of the sample voice data；

Determine the topological structure of the voice discrimination model；

Using the topological structure and the acoustic layer region feature and/or semantic level feature of the sample voice data, The training voice discrimination model, until the sample of voice discrimination model output wakes up the data type and mark of voice data The data type of note is identical.

Optionally, the method further includes：

Using the wake-up voice data for having screened out the false wake-up data, optimize current awake model.

The disclosure provides a kind of voice data processing apparatus, and described device includes：

Voice data acquisition module, for obtaining voice data input by user, the voice data includes successfully waking up The wake-up voice data of intelligent terminal, and indicate the control voice data that operation is intended to；

Characteristic extracting module, the acoustic layer region feature for extracting the voice data and/or semantic level feature, it is described Acoustic layer region feature is used to indicate that the pronunciation character of user, the semantic level feature to be used to indicate the text of the voice data Feature；

Model processing modules are used for using the acoustic layer region feature and/or semantic level feature as input, through advance structure After the voice discrimination model processing built, determine whether the wake-up voice data is false wake-up data.

Optionally, the voice data acquisition module, for judging in preset time period whether continuous acquisition is at least two Item is used to wake up the voice data of the intelligent terminal；If continuous acquisition is used to call out at least two in the preset time period The voice data of the awake intelligent terminal, and described at least two voice data for waking up the intelligent terminal are through currently calling out Score value d after awake model treatment meets the following conditions：d₂≤d<d₁, then by described at least two for waking up the intelligence eventually The voice data at end is determined as the wake-up voice data, d₁Score threshold value, d are waken up for first₂Score thresholding is waken up for second Value.

Optionally, the acoustic layer region feature includes the acoustic score of current awake model,

The characteristic extracting module, for obtaining the current awake model for each language for waking up voice data The top n recognition result of sound unit output；If including the correct hair of the voice unit in the top n recognition result of each voice unit Sound, then it is correct to identify to judge the recognition result of the voice unit；According to the recognition result of each voice unit, the wake-up is counted The recognition accuracy of voice data, the acoustic score as the current awake model.

And/or

The acoustic layer region feature further includes pure and impure sequence signature, the characteristic extracting module, for by fundamental frequency mean value, short When average energy, at least one of short-time zero-crossing rate as input, after the pure and impure grader processing through building in advance, export institute State the pure and impure sequence { a for waking up voice data₁, a₂..., a_i..., a_m, wherein a_iIndicate wake up voice data i-th The corresponding pure and impure classification of phoneme；Calculate the pure and impure sequence wake-up corresponding with the wake-up voice data for waking up voice data Similarity between the pure and impure sequence of word, as the pure and impure sequence signature；

And/or

The acoustic layer region feature further includes pitch sequences feature, the characteristic extracting module, for by fundamental frequency mean value, short When average energy, at least one of short-time zero-crossing rate as input, after the tone classifier processing through building in advance, export institute State the pitch sequences { b for waking up voice data₁, b₂..., b_j..., b_n, wherein b_jIndicate wake up voice data j-th The corresponding tone types of syllable；Calculate the pitch sequences wake-up corresponding with the wake-up voice data for waking up voice data Similarity between the pitch sequences of word, as the pitch sequences feature；

And/or

The acoustic layer region feature further includes the temporal characteristics of voice unit, the characteristic extracting module, for counting State the duration for each voice unit for waking up voice data；Using the duration of each voice unit, time average is calculated And time variance, the temporal characteristics as institute's speech units；

And/or

The acoustic layer region feature further includes vocal print feature, the characteristic extracting module, for utilizing the sound built in advance Line extraction model extracts the i-vector features for waking up voice data, as the vocal print feature；

And/or

The acoustic layer region feature further includes energy-distributing feature, the characteristic extracting module, is used for the voice number It is three parts c according to cutting_t-1、c_t、c_t+1, the average energy distribution of each section is counted, as the energy-distributing feature；Wherein, c_t Indicate the wake-up voice data, c_t+1It includes the control voice number to indicate collected after the wake-up voice data According to voice data collection, c_t-1Indicate the collected voice data collection before the wake-up voice data.

Optionally, the semantic level feature includes semantic smoothness, the characteristic extracting module, for the voice Data carry out word segmentation processing, obtain word sequence { w₁, w₂..., w_k..., w_f, wherein w_kIndicate k-th of the voice data Word；The probability that f word sequentially occurs according to the sequence of the word sequence is calculated, as the semantic smoothness；

And/or

The semantic level feature includes the editing distance of part of speech sequence, the characteristic extracting module, for institute's predicate Sound data carry out word segmentation processing, obtain part of speech sequence { q₁, q₂..., q_k..., q_f, wherein q_kIndicate the kth of the voice data The part of speech of a word；Calculate the editor between the part of speech sequence of the voice data and the part of speech sequence of each sample voice data Distance, and smallest edit distance is therefrom chosen, as the editing distance of the part of speech sequence, the sample voice data are to participate in The data of the training voice discrimination model；

And/or

The semantic level feature includes intent features, the characteristic extracting module, for utilizing the intention built in advance Analysis model extracts the intent features of the control voice data, and the intent features include that clearly intention or nothing is clearly anticipated Figure, alternatively, the intent features include the corresponding intention classification of the control voice data.

Optionally, described device further includes：

Sample voice data acquisition module is used for collecting sample voice data, and the sample voice data include that sample is called out Awake voice data and sample control voice data, the data type that the sample wakes up voice data are labeled as positive example wake-up language Sound data or counter-example wake up voice data, and the counter-example wakes up voice data and includes false wake-up data and wake up the language of failure Sound data；

Sample characteristics extraction module, acoustic layer region feature and/or semantic level for extracting the sample voice data Feature；

Topological structure determining module, the topological structure for determining the voice discrimination model；

Model training module, for the acoustic layer region feature using the topological structure and the sample voice data And/or semantic level feature, the training voice discrimination model, until the sample of voice discrimination model output wakes up voice The data type of data is identical as the data type of mark.

Optionally, described device further includes：

Model optimization module, for using the wake-up voice data for having screened out the false wake-up data, optimizing current awake Model.

The disclosure provides a kind of storage device, wherein being stored with a plurality of instruction, described instruction is loaded by processor, in execution The step of stating voice data processing method.

The disclosure provides a kind of electronic equipment, and the electronic equipment includes；

Above-mentioned storage device；And

Processor, for executing the instruction in the storage device.

Disclosure scheme can acquire the wake-up voice data for successfully waking up intelligent terminal, and indicate what operation was intended to Control voice data therefrom extract the acoustic layer region feature for indicating user pronunciation feature, and/or, indicate voice data text The semantic level feature of feature, using acoustic layer region feature and/or semantic level feature as the input of voice discrimination model, through mould It determines to wake up whether voice data is false wake-up data after type processing.Such scheme can be screened out from waking up in voice data False wake-up data carry out model optimization, disclosure scheme using false wake-up data as positive example voice data compared with the existing technology Help to improve model optimization performance.

Other feature and advantage of the disclosure will be described in detail in subsequent specific embodiment part.

Description of the drawings

Attached drawing is for providing further understanding of the disclosure, and a part for constitution instruction, with following tool Body embodiment is used to explain the disclosure together, but does not constitute the limitation to the disclosure.In the accompanying drawings：

Fig. 1 is the flow diagram of disclosure scheme voice data processing method；

Fig. 2 is the flow diagram that voice discrimination model is built in disclosure scheme；

Fig. 3 is the composition schematic diagram of disclosure scheme voice data processing apparatus；

Fig. 4 is structural schematic diagram of the disclosure scheme for the electronic equipment of language data process.

Specific implementation mode

The specific implementation mode of the disclosure is described in detail below in conjunction with attached drawing.It should be understood that this place is retouched The specific implementation mode stated is only used for describing and explaining the disclosure, is not limited to the disclosure.

Referring to Fig. 1, the flow diagram of disclosure voice data processing method is shown.It may comprise steps of：

S101, obtains voice data input by user, and the voice data includes the wake-up language for successfully waking up intelligent terminal Sound data, and indicate the control voice data that operation is intended to.

In general, the wake-up interactive process between user and intelligent terminal can be presented as：Whether intelligent terminal monitoring users Input voice data for waking up intelligent terminal, if having input voice data and identifying wake-up word accordingly, wake up at Work(, can continue whether monitoring users input voice data for manipulating intelligent terminal, if having input voice data and according to This identifies that operation is intended to, then can control intelligent terminal to execute relevant operation.

In disclosure scheme, it can will identify that the voice data for waking up word, successfully waking up intelligent terminal, referred to as wake up language Sound data；It can will identify that operation is intended to, controls the voice data that intelligent terminal executes relevant operation, referred to as control voice number According to.

It is to be appreciated that compared with other kinds of interactive voice process, wakes up interactive process and there is apparent interruption sense, It can be abstracted as " mute+to wake up word+short pause+operation intention ".For example, " sil ding-dong ding-dongs sp I want to listen the sound of Liu De China It is happy ", wherein before " sil " indicates that user wakes up intelligent terminal, mute or ambient noise that intelligent terminal listens to；" ding-dong is stung Rub-a-dub " it indicates to wake up word；" sp " indicates to wake up the short pause between voice data and control voice data；" I wants to listen Liu De China Music " indicates that operation is intended to.

Disclosure scheme can screen out false wake-up data to improve the performance of model optimization from waking up in voice data, That is, it is judged that waking up whether voice data is false wake-up data, if it is false wake-up data, then counter-example voice number can be determined as According to.False wake-up data are considered as positive example voice data compared with the existing technology and carry out model optimization, disclosure scheme helps to carry High model optimization performance.

As an example, disclosure scheme can be triggered when intelligent terminal is successfully waken up and execute voice number According to processing procedure；Alternatively, can again be triggered when meeting other preset conditions executes language data process process, for example, in advance If condition can be the voice data for collecting preset number, arrival predetermined time, etc., disclosure scheme is to executing voice number It can not be limited according to the opportunity of processing, it is specific to be set in combination with practical application request.

As an example, the intelligent terminal in disclosure scheme can be the electronic equipment with voice arousal function, For example, can be intelligent electric appliance, mobile phone, PC, tablet computer etc.；In actual application, intelligent end can be passed through The microphone at end acquires voice data input by user, disclosure scheme to the form of expression of intelligent terminal, obtain voice data Equipment etc. can be not specifically limited.

As an example, the wake-up voice data in disclosure scheme can go out to call out through current awake Model Identification The voice data of awake word.For example, the first wake-up score threshold value d can be set₁If the voice for waking up intelligent terminal Data, the score value exported after current awake model treatment are not less than d₁, then it is assumed that the recognition result of the voice data is to wake up Word can be determined as waking up voice data.

As an example, disclosure scheme can also obtain wake-up voice data in the following way：When judging default Between whether continuous acquisition is at least two voice data for waking up intelligent terminal in section；If continuously adopted in preset time period Collect at least two voice data for waking up intelligent terminal, and at least two voice data warps for waking up intelligent terminal Score value d after current awake model treatment meets the following conditions：d₂≤d<d₁, then will at least two for waking up intelligent terminal Voice data be determined as wake up voice data.

In conjunction with practical application it is found that if user wakes up failure when carrying out waking up interactive for the first time, it will usually quickly carry out the Secondary wake-up interaction, or even carry out repeatedly waking up interaction until waking up successfully or user actively stops waking up interacting, based on this Characteristic, disclosure scheme also provide a kind of new scheme for determining wake-up voice data.It for example, can d above₁Base On plinth, setting second wakes up score threshold value d₂, and d₂<d₁If at least two of continuous acquisition are for waking up in preset time period The voice data of intelligent terminal, the score value d exported after current awake model treatment are respectively positioned on section [d₂, d₁), then can by this two Voice data is determined as waking up voice data.In this way, can to a certain extent keep score less than d₁Wake-up voice data, Enrich the data that can be used for optimizing current awake model.

S102 extracts the acoustic layer region feature and/or semantic level feature of the voice data, the acoustic layer region feature Pronunciation character for indicating user, the semantic level feature are used to indicate the text feature of the voice data.

After getting voice data input by user, the acoustic layer region feature and/or semantic layer of voice data can be extracted Region feature is used for the processing of voice discrimination model.

As an example, acoustic layer region feature may include the acoustic score of current awake model.Optionally, it removes current Except the acoustic score for waking up model, acoustic layer region feature can also include at least one of following optional feature：Fundamental frequency is equal Value, short-time average energy, short-time zero-crossing rate, pure and impure sequence signature, pitch sequences feature, the temporal characteristics of voice unit, vocal print Feature, energy-distributing feature.It is to be appreciated that optional feature can be divided into two types：One is directly from wake-up voice The primitive character that extracting data goes out, for example, fundamental frequency mean value, short-time average energy, short-time zero-crossing rate；Another kind is to wake up voice Feature after the processing of data, for example, pure and impure sequence signature, pitch sequences feature, the temporal characteristics of voice unit, vocal print feature, Energy-distributing feature.

As an example, semantic level feature may include at least one of following characteristics：Semantic smoothness, part of speech The editing distance of sequence, intent features.

Meaning about each character representation and specific extraction process wouldn't be described in detail herein reference can be made to hereafter introducing.

S103, using the acoustic layer region feature and/or semantic level feature as input, the voice through building in advance differentiates After model treatment, determine whether the wake-up voice data is false wake-up data.

After extract acoustic layer region feature and/or semantic level feature in voice data, it can utilize and build in advance Voice discrimination model carries out model treatment, determines and wakes up whether voice data is false wake-up data, if it is false wake-up data, then It can be classified as counter-example voice data；If not false wake-up data, then positive example voice data is may continue as.

By taking current awake model is presented as that current foreground wakes up model and current background wake-up model as an example, below to utilizing The wake-up voice data of false wake-up data is screened out, the process of optimization current awake model is briefly described.

It is to be appreciated that foreground wake up model for describe wake up word, may be used comprising wake up word voice data into Row model training；Background wakes up model for describing non-wake-up word, may be used and carries out model without the voice data for waking up word Training.

When disclosure scheme carries out wake-up model optimization, the wake-up voice data for having screened out false wake-up data can be used, It updates current foreground and wakes up model；Counter-example voice data can be used, for example, the voice data of failure, false wake-up data are waken up, It updates current background and wakes up model.In this way, the distance in two paths can be made to zoom out, mould is waken up after helping to improve update The speech recognition accuracy of type.Specific optimization process can refer to the relevant technologies realization, be not detailed herein.

As an example, it can only update current foreground and wake up model, that is, can only utilize and screen out false wake-up data Wake-up voice data, optimize current foreground and wake up model.Specific combinable practical application request, determines the side of model modification Formula, disclosure scheme can not limit this.

Below to the acoustic layer region feature in disclosure scheme, semantic level feature, it is explained respectively.

1. acoustic layer region feature

(1) acoustic score of current awake model, for reflecting the recognition accuracy for waking up word

As an example, current awake model can be obtained for each voice unit output for waking up voice data Top n recognition result；If including the orthoepy of the voice unit in the top n recognition result of each voice unit, judgement should The recognition result of voice unit is that identification is correct；According to the recognition result of each voice unit, statistics wakes up the identification of voice data Accuracy, the acoustic score as current awake model.

For example, voice unit can be presented as the basic recognition unit of current awake model, for example, phoneme, syllable Deng disclosure scheme can be not specifically limited this.

By voice unit be syllable for, wake up word " ding-dong ding-dong " can be divided into " ding ", " dong ", " ding ", " dong " 4 voice units for first voice unit " ding ", can obtain current awake if the value of N is 3 Model is directed to the identification probability of voice unit output, by highest preceding 3 identification for being determined as voice unit " ding " of probability If as a result, there is the orthoepy of " ding " in this 3 recognition results, the recognition result of the voice unit is judged to identify just Really.And so on, obtain the recognition result of other 3 voice units respectively, then calculate wake up voice data identification it is accurate Degree, that is, the ratio identified between correct voice unit number, voice unit total number is calculated, as current awake model Acoustic score.

It is to be appreciated that the value of N can be N >=1 in disclosure scheme, it is specific to be set in combination with practical application request, Disclosure scheme can not limit this.

(2) primitive character for waking up voice data, for example, fundamental frequency mean value, short-time average energy, short-time zero-crossing rate

In general, voice signal can be divided into two kinds of voiceless sound, voiced sound by people when pronunciation according to whether vocal cords shake. Voiced sound is also known as sound language, carries most energy in language, and voiced sound shows apparent periodicity in the time domain；Voiceless sound Similar to white noise, without apparent periodical.When sending out voiced sound, air-flow makes vocal cords generate the vibration of relaxation vibrating type by glottis, Quasi-periodic driving pulse string is generated, the frequency of this vocal cord vibration is properly termed as fundamental frequency, abbreviation fundamental frequency.Fundamental frequency one As with personal vocal cords, pronunciation custom etc. have relationship, personal feature can be reacted to a certain extent.

As an example, the mode of extraction fundamental frequency mean value can be presented as：Sub-frame processing is carried out to waking up voice data, Multiple speech data frames are obtained, are then extracted per the corresponding fundamental frequency of frame, and then using per the corresponding fundamental frequency of frame, calculates and wakes up voice The corresponding fundamental frequency mean value of data.

In addition, it is necessary to explanation, short-time average energy can be as the characteristic parameter for distinguishing voiceless sound, voiced sound；Alternatively, It, can be as distinguishing sound, noiseless characteristic parameter in the case of signal-to-noise ratio height.

Short-time zero-crossing rate refers to that voice signal waveform is across the number of horizontal axis (zero level) in a speech data frame. In general, energy when voiced sound concentrates on low-frequency range, energy when voiceless sound concentrates on high band, can response frequency to a certain extent Just, there is lower zero-crossing rate in voiced segments, have higher zero-crossing rate in voiceless sound section.

Disclosure scheme can not limit the mode of acquisition fundamental frequency mean value, short-time average energy, short-time zero-crossing rate, specifically The relevant technologies realization is can refer to, is not detailed herein.

(3) pure and impure sequence signature, for reflecting the pure and impure characteristic for waking up phoneme in voice data

It as an example, can be by least one of fundamental frequency mean value, short-time average energy, short-time zero-crossing rate as defeated Enter, after the pure and impure grader processing through building in advance, output wakes up the pure and impure sequence { a of voice data₁, a₂..., a_i..., a_m, Wherein, a_iIndicate the corresponding pure and impure classification of i-th of phoneme of wake-up voice data；Calculate wake up voice data pure and impure sequence with The similarity between the corresponding pure and impure sequence for waking up word of voice data is waken up, as pure and impure sequence signature.

For example, the pure and impure classification of phoneme can be：Voiceless sound, voiced sound, for example, " 0 " can be used to indicate voiceless sound, with " 1 " Indicate that voiced sound, disclosure scheme can be not specifically limited this.

It is to be appreciated that intelligent terminal may only preserve a wake-up word, that is, knowing that waking up voice data corresponds in advance Wake-up word what is；Alternatively, intelligent terminal may preserve multiple wake-up words, in view of this, current awake mould can be utilized Type identifies that it is what to wake up the corresponding wake-up word of voice data, and disclosure scheme can be not specifically limited this.

As an example, the pure and impure sequence for waking up word can be stored in intelligent terminal, and needing to calculate similarity When directly read；Alternatively, the pure and impure sequence for waking up word can be determined in real time using pure and impure grader when needing to calculate similarity Row, disclosure scheme can be not specifically limited this.

As an example, calculating the similarity of pure and impure sequence can be presented as：Phase is calculated by the way of XOR operation If on corresponding position the pure and impure classification of phoneme is identical, such as be the voiced sound that " 1 " indicates like degree, then phoneme in this position Exclusive or result is 0；Otherwise exclusive or result is 1.In this way, can count non-zero number obtains similarity, usual non-zero number is fewer Similarity is higher.

It is to be appreciated that common classification model may be used in the pure and impure grader in disclosure scheme, for example, supporting vector Machine model, neural network model etc., disclosure scheme can be not specifically limited this.

(4) pitch sequences feature, for reflecting the pitch characteristics for waking up syllable in voice data

It as an example, can be by least one of fundamental frequency mean value, short-time average energy, short-time zero-crossing rate as defeated Enter, after the tone classifier processing through building in advance, output wakes up the pitch sequences { b of voice data₁, b₂..., b_j..., b_n, Wherein, b_jIndicate the corresponding tone types of j-th of syllable of wake-up voice data；Calculate wake up voice data pitch sequences with The similarity between the corresponding pitch sequences for waking up word of voice data is waken up, as pitch sequences feature.

By taking Chinese as an example, the tone types of syllable can be presented as 4 kinds of common tones, can use identifier " 1 ", " 2 ", " 3 ", " 4 " indicate different tones；Or the tone types that other languages determine syllable are can be combined with, disclosure scheme is to this It can be not specifically limited.

By being described above it is found that no matter intelligent terminal preserves a wake-up word, or preserves multiple wake-up words, can determine Go out to wake up the corresponding wake-up word of voice data, and then obtain waking up the pitch sequences of word, specifically can refer to (3) pure and impure sequence signature Place is introduced, and and will not be described here in detail.

As an example, calculating the similarity of pitch sequences can be presented as：Phase is calculated by the way of XOR operation If the tone types of syllable are identical on corresponding position, such as it is the Chinese falling tone tune that " 4 " indicate, the then position like degree The XOR operation result of upper syllable is 0；Otherwise exclusive or result is 1.In this way, can count non-zero number obtains similarity, usually Non-zero number is fewer, and similarity is higher.

It is to be appreciated that common classification model may be used in the tone classifier in disclosure scheme, for example, supporting vector Machine model, neural network model etc., disclosure scheme can be not specifically limited this.

(5) temporal characteristics of voice unit wake up voice data existing exception in voice unit cutting for reflecting Situation

As an example, can based on the voice recognition result that current awake model obtains, to wake up voice data into Row forces cutting, at the beginning of obtaining each voice unit, the end time, and then obtains the duration of each voice unit；It can Using the duration of each voice unit, to calculate time average and time variance, the temporal characteristics as voice unit.

In general, the temporal characteristics of voice unit can reflect voice unit abnormal conditions present in dicing process, example Such as, the duration of individual voice unit is long or too short, does not meet normal voice form.As an example, voice unit It can be presented as that phoneme, syllable etc., disclosure scheme can be not specifically limited this.

(6) vocal print feature, physiological characteristic and behavioural characteristic for reflecting speaker

As an example, the voiceprint extraction model extraction built in advance can be utilized to wake up the i-vector of voice data Feature, as vocal print feature.For example, the voiceprint extractions models such as DNN I-Vector, GMM-UBM I-Vector can be utilized to carry Take vocal print feature, disclosure scheme that can be not specifically limited this.

It is to be appreciated that vocal print feature reflection be speaker individualized feature, in general, the vocal print feature of speaker exists Being in short time will not be changed, therefore voiceprint extraction model can also be utilized in control voice data or including wake-up The whole voice extracting data i-vector features of voice data and control voice data, disclosure scheme, which can not do this, to be had Body limits.

(7) energy-distributing feature, for reflecting the characteristic for waking up interactive process

As an example, can be three parts c by voice data cutting_t-1、c_t、c_t+1, and count being averaged for each section Energy distribution, such as the average energy distribution of three parts can be expressed as g_t-1、g_t、g_t+1, obtain energy-distributing feature.

As an example, extracting the mode of energy-distributing feature can be presented as：3 part of speech data are carried out respectively Sub-frame processing obtains the speech data frame that every part includes, and then extracts per the corresponding energy of frame, and then corresponding using every frame Energy calculates the average energy of each section.

The example of interactive process is waken up in conjunction with above lift, " sil ding-dong ding-dongs sp I want to listen the music of Liu De China " can be with In such a way that identification wakes up word, voice data is divided into 3 parts.Wherein, c_tIt indicates to wake up voice data；c_t-1It indicates The collected voice data collection before waking up voice data, usually mute section or ambient noise；c_t+1It indicates waking up language Collected voice data collection after sound data, usually short pause and operation are intended to.It is to be appreciated that c_t-1、c_t+1Duration It can flexibly determine, for example, can be according to VAD (English：Voice Activity Detection, Chinese：Speech activity is examined Survey) information determination, fixed duration is may be set to be, such as 1s~5s, disclosure scheme can be not specifically limited this.

Mentioning wake-up word relative to common discourse causes intelligent terminal by false wake-up, such as " I thinks that ding-dong ding-dong is Good name ", the Energy distribution of the wake-up interactive process of disclosure scheme is in contrast, have significant difference.

2. semantic level feature

(1) semantic smoothness

As an example, word segmentation processing can be carried out to voice data, obtains word sequence { w₁, w₂..., w_k..., w_f, wherein w_kIndicate k-th of word of voice data；Then calculate what f word sequentially occurred according to the sequence of word sequence Probability, as semantic smoothness.

For example, the semantic smoothness in disclosure scheme can be presented as w₁To w_fThe semantic smoothness P of forward direction in direction (w₁, w₂..., w_f)；And/or w_fTo w₁The reverse semanteme smoothness P (w in direction_f, w_f-1..., w₁).With positive semantic smoothness For, it can be calculated by the following formula to obtain：

Wherein, P (w_k|w_k-1) can be obtained based on the sample voice data statistics for participating in the training of voice discrimination model.

(2) editing distance of part of speech sequence

As an example, word segmentation processing can be carried out to voice data, obtains part of speech sequence { q₁, q₂..., q_k..., q_f, wherein q_kIndicate the part of speech of k-th of word of voice data；Calculate the part of speech sequence of voice data and each sample voice Editing distance between the part of speech sequence of data, and smallest edit distance is therefrom chosen, the editing distance as part of speech sequence.Its In, sample voice data are to participate in the data of training voice discrimination model.

Part of speech sequence signature can reflect semantic information to a certain extent, particular for wake-up interactive process part of speech sequence Feature is more notable.In disclosure scheme, part of speech sequence signature can be presented as the editing distance (Edit of part of speech sequence Distance), between referring specifically to two word strings, the minimum edit operation number changed into needed for another by one is usually edited Apart from smaller, the similarity of two word strings is bigger.

If the part of speech sequence of sample voice data is expressed as { p₁, p₂..., p_h, then calculate { q using following formula₁, q₂..., q_fAnd { p₁, p₂..., p_hEditing distance d_{[f, h]}：

It is to be appreciated that sample voice data can participate in all data of voice discrimination model training；Alternatively, can be with Using the positive example data filtered out from all data as sample voice data, editing distance calculating, disclosure scheme pair are carried out This can be not specifically limited, as long as determining smallest edit distance.

(3) intent features

As an example, the intention of the intention analysis model extraction control voice data built in advance can be utilized special Sign, it is intended that feature includes that clearly intention or nothing is clearly intended to, alternatively, intent features include the corresponding intention of control voice data Classification.

Disclosure scheme can build intention analysis model in advance, for determining that operation is intended to tendency.For example, it is intended that Analysis model can be presented as that two graders, model output indicate that clearly intention, nothing are clearly intended to；Or, it is intended that analysis mould Type can be presented as that regression model, model output indicate the various scores for being intended to classification, can determine to control according to score height The corresponding intention classification of voice data processed, for example, the preceding M of highest scoring is intended to classification as the corresponding meaning of control voice data Figure classification, the value of M can be M >=1, specific to be set in combination with practical application request, disclosure scheme pair in disclosure scheme This can not be limited.For example, it is intended that classification can play music, inquiry weather etc., can specifically be needed by practical application Depending on asking.

The process that voice discrimination model is built in disclosure scheme is explained below.For details, reference can be made to Fig. 2 institutes Show flow chart, may comprise steps of：

S201, collecting sample voice data, the sample voice data include that sample wakes up voice data and sample control Voice data processed, the data type that the sample wakes up voice data are labeled as positive example wake-up voice data or counter-example wake-up language Sound data, the counter-example wake up voice data and include false wake-up data and wake up the voice data of failure.

When carrying out model training, a large amount of sample voice data can be acquired, wherein sample voice data can embody Voice data, sample control voice data are waken up for sample.Further, it is also possible to which waking up voice data to sample carries out data type Mark wakes up voice data, counter-example wake-up voice data for example, data type can be positive example, voice number is waken up for counter-example According to, can also further it is careful be labeled as false wake-up data, wake up failure voice data.

S202 extracts the acoustic layer region feature and/or semantic level feature of the sample voice data.

Specific implementation process can refer to introduction made above, be not detailed herein.

S203 determines the topological structure of the voice discrimination model.

As an example, the topological structure in disclosure scheme can be presented as：CNN (English：Convolutional Neural Network, Chinese：Convolutional neural networks), RNN (English：Recurrent neural Network, Chinese：Cycle Neural network), DNN (English：Deep Neural Network, Chinese：Deep neural network) etc., disclosure scheme can to this It is not specifically limited.

As an example, the output layer of neural network can include 2 output nodes, respectively represent positive example and wake up voice Data, false wake-up data indicate false wake-up data for example, " 0 " can be used to indicate that positive example wakes up voice data with " 1 ".Alternatively, The output layer of neural network can include 1 output node, indicate to wake up the probability that voice data is confirmed as false wake-up data. Disclosure scheme can not limit the specific manifestation form of neural network.

S204 utilizes the topological structure and the acoustic layer region feature and/or semantic level of the sample voice data Feature, the training voice discrimination model, until the sample of voice discrimination model output wakes up the data class of voice data Type is identical as the data type of mark.

The topological structure for determining model, acoustic layer region feature and/or the semantic level for extracting sample voice data are special After sign, model training can be carried out.As an example, cross entropy criterion may be used in training process, using common random Gradient descent method updates Optimized model parameter, it is ensured that when model training is completed, the sample of model output wakes up the number of voice data It is identical as the data type of mark according to type.

As an example, voice discrimination model can be universal model, i.e., be not to be directed to some or certain specific wake-ups Word is built；Alternatively, voice discrimination model can be personalized model, that is, it is directed to different wake-up words and builds different voice differentiation moulds Type, disclosure scheme can be not specifically limited this.

Referring to Fig. 3, the composition schematic diagram of disclosure voice data processing apparatus is shown.Described device may include：

Voice data acquisition module 301, for obtaining voice data input by user, the voice data includes successfully calling out The wake-up voice data of awake intelligent terminal, and indicate the control voice data that operation is intended to；

Characteristic extracting module 302, the acoustic layer region feature for extracting the voice data and/or semantic level feature, The acoustic layer region feature is used to indicate the pronunciation character of user, and the semantic level feature is for indicating the voice data Text feature；

Model processing modules 303 are used for using the acoustic layer region feature and/or semantic level feature as input, through pre- After the voice discrimination model processing first built, determine whether the wake-up voice data is false wake-up data.

And/or

Optionally, described device further includes：

About the device in above-described embodiment, wherein modules execute the concrete mode of operation in related this method Embodiment in be described in detail, explanation will be not set forth in detail herein.

Referring to Fig. 4, structural schematic diagram of the disclosure for the electronic equipment 400 of language data process is shown.With reference to figure 4, electronic equipment 400 includes processing component 401, further comprises one or more processors, and by 402 institute of storage medium The storage device resource of representative, can be by the instruction of the execution of processing component 401, such as application program for storing.Storage medium The application program stored in 402 may include it is one or more each correspond to one group of instruction module.In addition, place Reason component 401 is configured as executing instruction, to execute above-mentioned voice data processing method.

Electronic equipment 400 can also include a power supply module 403, be configured as executing the power supply pipe of electronic equipment 400 Reason；One wired or wireless network interface 404 is configured as electronic equipment 400 being connected to network；With an input and output (I/O) interface 405.Electronic equipment 400 can be operated based on the operating system for being stored in storage medium 402, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM or similar.

The preferred embodiment of the disclosure is described in detail above in association with attached drawing, still, the disclosure is not limited to above-mentioned reality The detail in mode is applied, in the range of the technology design of the disclosure, a variety of letters can be carried out to the technical solution of the disclosure Monotropic type, these simple variants belong to the protection domain of the disclosure.

It is further to note that specific technical features described in the above specific embodiments, in not lance In the case of shield, can be combined by any suitable means, in order to avoid unnecessary repetition, the disclosure to it is various can The combination of energy no longer separately illustrates.

In addition, arbitrary combination can also be carried out between a variety of different embodiments of the disclosure, as long as it is without prejudice to originally Disclosed thought equally should be considered as disclosure disclosure of that.

Claims

1. a kind of voice data processing method, which is characterized in that the method includes：

Voice data input by user is obtained, the voice data includes the wake-up voice data for successfully waking up intelligent terminal, with And indicate the control voice data that operation is intended to；

The acoustic layer region feature and/or semantic level feature of the voice data are extracted, the acoustic layer region feature is for indicating The pronunciation character of user, the semantic level feature are used to indicate the text feature of the voice data；

Using the acoustic layer region feature and/or semantic level feature as input, the voice discrimination model processing through building in advance Afterwards, determine whether the wake-up voice data is false wake-up data.

2. according to the method described in claim 1, it is characterized in that, the mode for obtaining the wake-up voice data is：

If the voice data that continuous acquisition is used to wake up the intelligent terminal at least two in the preset time period, and institute It is following to state score value d satisfaction of at least two voice data for waking up the intelligent terminal after current awake model treatment Condition：d₂≤d<d₁, then it is determined as the wake-up voice for waking up the voice data of the intelligent terminal by described at least two Data, d₁Score threshold value, d are waken up for first₂Score threshold value is waken up for second.

3. method according to claim 1 or 2, which is characterized in that the acoustic layer region feature includes current awake model Acoustic score, then the acoustic layer region feature for extracting the voice data includes：

Obtain top n identification knot of the current awake model for each voice unit output for waking up voice data Fruit；

If including the orthoepy of the voice unit in the top n recognition result of each voice unit, the voice unit is judged Recognition result is that identification is correct；

According to the recognition result of each voice unit, the recognition accuracy for waking up voice data is counted, is currently called out as described The acoustic score of awake model.

4. according to the method described in claim 3, it is characterized in that,

The acoustic layer region feature further includes at least one of fundamental frequency mean value, short-time average energy, short-time zero-crossing rate；

And/or

The acoustic layer region feature further includes pure and impure sequence signature, then the acoustic layer region feature for extracting the voice data includes： By at least one of fundamental frequency mean value, short-time average energy, short-time zero-crossing rate as input, the pure and impure grader through building in advance After processing, the pure and impure sequence { a for waking up voice data is exported₁, a₂..., a_i..., a_m, wherein a_iIndicate the wake-up language The corresponding pure and impure classification of i-th of phoneme of sound data；Calculate the pure and impure sequence for waking up voice data and the wake-up voice Similarity between the corresponding pure and impure sequence for waking up word of data, as the pure and impure sequence signature；

And/or

The acoustic layer region feature further includes pitch sequences feature, then the acoustic layer region feature for extracting the voice data includes： By at least one of fundamental frequency mean value, short-time average energy, short-time zero-crossing rate as input, the tone classifier through building in advance After processing, the pitch sequences { b for waking up voice data is exported₁, b₂..., b_j..., b_n, wherein b_jIndicate the wake-up language The corresponding tone types of j-th of syllable of sound data；Calculate the pitch sequences for waking up voice data and the wake-up voice Similarity between the corresponding pitch sequences for waking up word of data, as the pitch sequences feature；

And/or

The acoustic layer region feature further includes the temporal characteristics of voice unit, then extracts the acoustic layer region feature of the voice data Including：Count the duration of each voice unit for waking up voice data；Utilize the duration of each voice unit, meter Evaluation time mean value and time variance, the temporal characteristics as institute's speech units；

And/or

The acoustic layer region feature further includes vocal print feature, then the acoustic layer region feature for extracting the voice data includes：It utilizes The i-vector features that voice data is waken up described in the voiceprint extraction model extraction built in advance, as the vocal print feature；

And/or

The acoustic layer region feature further includes energy-distributing feature, then the acoustic layer region feature for extracting the voice data includes： It is three parts c by the voice data cutting_t-1、c_t、c_t+1, the average energy distribution of each section is counted, as the energy point Cloth feature；Wherein, c_tIndicate the wake-up voice data, c_t+1Indicating collected after the wake-up voice data includes The voice data collection of the control voice data, c_t-1Indicate the collected voice data collection before the wake-up voice data.

5. method according to claim 1 or 2, which is characterized in that

The semantic level feature includes semantic smoothness, then the semantic level feature for extracting the voice data includes：To institute It states voice data and carries out word segmentation processing, obtain word sequence { w₁, w₂..., w_k..., w_f, wherein w_kIndicate the voice data K-th of word；The probability that f word sequentially occurs according to the sequence of the word sequence is calculated, it is smooth as the semanteme Degree；

And/or

The semantic level feature includes the editing distance of part of speech sequence, then extracts the semantic level feature packet of the voice data It includes：Word segmentation processing is carried out to the voice data, obtains part of speech sequence { q₁, q₂..., q_k..., q_f, wherein q_kIndicate institute's predicate The part of speech of k-th of word of sound data；Calculate the part of speech sequence of the part of speech sequence and each sample voice data of the voice data Editing distance between row, and smallest edit distance is therefrom chosen, as the editing distance of the part of speech sequence, the sample language Sound data are to participate in the data of the training voice discrimination model；

And/or

The semantic level feature includes intent features, then the semantic level feature for extracting the voice data includes：Using pre- The intention analysis model first built extracts the intent features of the control voice data, and the intent features include clearly to be intended to Or nothing is clearly intended to, alternatively, the intent features include the corresponding intention classification of the control voice data.

6. method according to claim 1 or 2, which is characterized in that the mode for building the voice discrimination model is：

Collecting sample voice data, the sample voice data include that sample wakes up voice data and sample control voice number According to, the data type that the sample wakes up voice data is labeled as positive example wake-up voice data or counter-example wake-up voice data, The counter-example wakes up voice data and includes false wake-up data and wake up the voice data of failure；

Determine the topological structure of the voice discrimination model；

Utilize the topological structure and the acoustic layer region feature and/or semantic level feature of the sample voice data, training The voice discrimination model, until the sample of voice discrimination model output wakes up the data type and mark of voice data Data type is identical.

7. method according to claim 1 or 2, which is characterized in that the method further includes：

8. a kind of voice data processing apparatus, which is characterized in that described device includes：

Voice data acquisition module, for obtaining voice data input by user, the voice data includes successfully waking up intelligence The wake-up voice data of terminal, and indicate the control voice data that operation is intended to；

Characteristic extracting module, the acoustic layer region feature for extracting the voice data and/or semantic level feature, the acoustics Level feature is used to indicate that the pronunciation character of user, the semantic level feature to be used to indicate that the text of the voice data to be special Sign；

Model processing modules are used for using the acoustic layer region feature and/or semantic level feature as input, through what is built in advance After the processing of voice discrimination model, determine whether the wake-up voice data is false wake-up data.

9. device according to claim 8, which is characterized in that

The voice data acquisition module, for judging whether continuous acquisition is used to wake up institute at least two in preset time period State the voice data of intelligent terminal；If continuous acquisition is whole for waking up the intelligence at least two in the preset time period The voice data at end, and described at least two voice data for waking up the intelligent terminal are after current awake model treatment Score value d meet the following conditions：d₂≤d<d₁, then by described at least two voice data for waking up the intelligent terminal It is determined as the wake-up voice data, d₁Score threshold value, d are waken up for first₂Score threshold value is waken up for second.

10. device according to claim 8 or claim 9, which is characterized in that the acoustic layer region feature includes current awake model Acoustic score,

The characteristic extracting module, for obtaining the current awake model for each voice list for waking up voice data The top n recognition result of member output；If including the orthoepy of the voice unit in the top n recognition result of each voice unit, It is correct to identify then to judge the recognition result of the voice unit；According to the recognition result of each voice unit, the wake-up language is counted The recognition accuracy of sound data, the acoustic score as the current awake model.

11. device according to claim 10, which is characterized in that

And/or

The acoustic layer region feature further includes pure and impure sequence signature, the characteristic extracting module, for by fundamental frequency mean value, put down in short-term At least one of equal energy, short-time zero-crossing rate after the pure and impure grader processing through building in advance, are called out as input described in output Pure and impure sequence { a of awake voice data₁, a₂..., a_i..., a_m, wherein a_iIndicate i-th of phoneme for waking up voice data Corresponding pure and impure classification；Calculate the pure and impure sequence for waking up voice data wake-up word corresponding with the wake-up voice data Similarity between pure and impure sequence, as the pure and impure sequence signature；

And/or

The acoustic layer region feature further includes pitch sequences feature, the characteristic extracting module, for by fundamental frequency mean value, put down in short-term At least one of equal energy, short-time zero-crossing rate after the tone classifier processing through building in advance, are called out as input described in output Pitch sequences { the b of awake voice data₁, b₂..., b_j..., b_n, wherein b_jIndicate j-th of syllable for waking up voice data Corresponding tone types；Calculate the pitch sequences for waking up voice data wake-up word corresponding with the wake-up voice data Similarity between pitch sequences, as the pitch sequences feature；

And/or

The acoustic layer region feature further includes the temporal characteristics of voice unit, and the characteristic extracting module is called out for counting described The duration of each voice unit of awake voice data；Using the duration of each voice unit, calculate time average and Time variance, the temporal characteristics as institute's speech units；

And/or

The acoustic layer region feature further includes vocal print feature, the characteristic extracting module, for being carried using the vocal print built in advance The i-vector features that voice data is waken up described in model extraction are taken, as the vocal print feature；

And/or

The acoustic layer region feature further includes energy-distributing feature, the characteristic extracting module, for cutting the voice data It is divided into three parts c_t-1、c_t、c_t+1, the average energy distribution of each section is counted, as the energy-distributing feature；Wherein, c_tIt indicates The wake-up voice data, c_t+1It includes the control voice data to indicate collected after the wake-up voice data Voice data collection, c_t-1Indicate the collected voice data collection before the wake-up voice data.

12. device according to claim 8 or claim 9, which is characterized in that

The semantic level feature includes semantic smoothness, the characteristic extracting module, for dividing the voice data Word processing, obtains word sequence { w₁, w₂..., w_k..., w_f, wherein w_kIndicate k-th of word of the voice data；Calculate f The probability that a word sequentially occurs according to the sequence of the word sequence, as the semantic smoothness；

And/or

The semantic level feature includes the editing distance of part of speech sequence, the characteristic extracting module, for the voice number According to word segmentation processing is carried out, part of speech sequence { q is obtained₁, q₂..., q_k..., q_f, wherein q_kIndicate k-th of list of the voice data The part of speech of word；Calculate editor between the part of speech sequence of the voice data and the part of speech sequence of each sample voice data away from From, and smallest edit distance is therefrom chosen, as the editing distance of the part of speech sequence, the sample voice data are to participate in instructing Practice the data of the voice discrimination model；

And/or

The semantic level feature includes intent features, the characteristic extracting module, for being analyzed using the intention built in advance The intent features of control voice data described in model extraction, the intent features include clearly be intended to or without being clearly intended to, or Person, the intent features include the corresponding intention classification of the control voice data.

13. device according to claim 8 or claim 9, which is characterized in that described device further includes：

Sample voice data acquisition module, is used for collecting sample voice data, and the sample voice data include that sample wakes up language Sound data and sample control voice data, the data type that the sample wakes up voice data are labeled as positive example wake-up voice number According to or counter-example wake up voice data, the counter-example wakes up voice data and includes false wake-up data and wake up the voice number of failure According to；

Sample characteristics extraction module, the acoustic layer region feature for extracting the sample voice data and/or semantic level feature；

Model training module, for using the topological structure and the sample voice data acoustic layer region feature and/or Semantic level feature, the training voice discrimination model, until the sample of voice discrimination model output wakes up voice data Data type it is identical as the data type of mark.

14. device according to claim 8 or claim 9, which is characterized in that described device further includes：

15. a kind of storage device, wherein being stored with a plurality of instruction, which is characterized in that described instruction is loaded by processor, right of execution Profit requires the step of any one of 1 to 7 the method.

16. a kind of electronic equipment, which is characterized in that the electronic equipment includes；

Storage device described in claim 15；And

Processor, for executing the instruction in the storage device.