CN108320733A - Voice data processing method and device, storage medium, electronic equipment - Google Patents
Voice data processing method and device, storage medium, electronic equipment Download PDFInfo
- Publication number
- CN108320733A CN108320733A CN201711364085.1A CN201711364085A CN108320733A CN 108320733 A CN108320733 A CN 108320733A CN 201711364085 A CN201711364085 A CN 201711364085A CN 108320733 A CN108320733 A CN 108320733A
- Authority
- CN
- China
- Prior art keywords
- voice data
- voice
- wake
- data
- feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/14—Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/90—Pitch determination of speech signals
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/93—Discriminating between voiced and unvoiced parts of speech signals
Abstract
A kind of voice data processing method of disclosure offer and device, storage medium, electronic equipment.This method includes:Voice data input by user is obtained, the voice data includes successfully waking up the wake-up voice data of intelligent terminal, and indicate the control voice data that operation is intended to;The acoustic layer region feature and/or semantic level feature of the voice data are extracted, the acoustic layer region feature is used to indicate that the pronunciation character of user, the semantic level feature to be used to indicate the text feature of the voice data;Using the acoustic layer region feature and/or semantic level feature as input, after the voice discrimination model processing through building in advance, determine whether the wake-up voice data is false wake-up data.Such scheme carries out model optimization using the wake-up voice data for having screened out false wake-up data, helps to improve the optimization performance for waking up model.
Description
Technical field
This disclosure relates to voice process technology field, and in particular, to a kind of voice data processing method and device,
Storage medium, electronic equipment.
Background technology
Voice awakening technology is the important branch in voice process technology field, in smart home, intelligent robot, intelligence
Energy vehicle device, smart mobile phone etc. have important application.
In general, the voice wakeup process of intelligent terminal can be presented as:Whether intelligent terminal monitoring users input voice number
According to if receiving voice data input by user, the acoustic feature of voice data can be extracted;Then using acoustic feature as
Input is carried out waking up word identification by the wake-up model built in advance, if recognition result is wake-up word, wakes up success, Ke Yiji
Whether continuous monitoring users, which have input operation, is intended to;It is on the contrary then wake up failure, it can continue whether monitoring users carry out intelligence again
Terminal wakes up.Wherein, acoustic feature can be presented as the spectrum signature of voice data, for example, mel-frequency cepstrum coefficient (English
Text:Mel Frequency Cepstrum Coefficient, referred to as:MFCC) feature, perception linear prediction (English:
Perceptual Linear Predictive, referred to as:PLP) feature etc..
In general, level can not be optimal by initially waking up the performance of model, need constantly to carry out mould in use
Type optimizes, to improve the recognition accuracy of model.Specifically, it can will wake up successful voice data and be considered as positive example voice number
According to, will wake up failure voice data be considered as counter-example voice data, current awake model is trained based on distinction criterion
Optimization.
In actual application, not high due to initially waking up model performance, causing to wake up can in successful voice data
There can be false wake-up data, for example, the non-wake-up word etc. of the interference of background noise, voice, pronunciation close with word is waken up, it may be accidentally
Intelligent terminal is waken up, if carrying out model optimization using false wake-up data as positive example voice data, it is likely that cause to wake up model
Performance is worse and worse.
Invention content
It is a general object of the present disclosure to provide a kind of voice data processing method and device, storage medium, electronic equipments, have
Help improve the optimization performance for waking up model.
To achieve the goals above, the disclosure provides a kind of voice data processing method, the method includes:
Voice data input by user is obtained, the voice data includes the wake-up voice number for successfully waking up intelligent terminal
According to, and indicate the control voice data that operation is intended to;
The acoustic layer region feature and/or semantic level feature of the voice data are extracted, the acoustic layer region feature is used for
Indicate that the pronunciation character of user, the semantic level feature are used to indicate the text feature of the voice data;
Using the acoustic layer region feature and/or semantic level feature as input, the voice discrimination model through building in advance
After processing, determine whether the wake-up voice data is false wake-up data.
Optionally, obtaining the mode for waking up voice data is:
Judge in preset time period whether continuous acquisition is at least two voice data for waking up the intelligent terminal;
If the voice data that continuous acquisition is used to wake up the intelligent terminal at least two in the preset time period,
And score value d of described at least two voice data for waking up the intelligent terminal after current awake model treatment meets
The following conditions:d2≤d<d1, then it is determined as the wake-up for waking up the voice data of the intelligent terminal by described at least two
Voice data, d1Score threshold value, d are waken up for first2Score threshold value is waken up for second.
Optionally, the acoustic layer region feature includes the acoustic score of current awake model, then extracts the voice data
Acoustic layer region feature include:
Obtain top n identification of the current awake model for each voice unit output for waking up voice data
As a result;
If including the orthoepy of the voice unit in the top n recognition result of each voice unit, the voice list is judged
The recognition result of member is that identification is correct;
According to the recognition result of each voice unit, the recognition accuracy for waking up voice data is counted, is worked as described
The preceding acoustic score for waking up model.
Optionally, the acoustic layer region feature further include in fundamental frequency mean value, short-time average energy, short-time zero-crossing rate at least
One;
And/or
The acoustic layer region feature further includes pure and impure sequence signature, then extracts the acoustic layer region feature packet of the voice data
It includes:At least one of fundamental frequency mean value, short-time average energy, short-time zero-crossing rate are used as and inputted, pure and impure point through building in advance
After the processing of class device, the pure and impure sequence { a for waking up voice data is exported1, a2..., ai..., am, wherein aiIt is called out described in expression
The corresponding pure and impure classification of i-th of phoneme of awake voice data;Calculate the pure and impure sequence for waking up voice data and the wake-up
Similarity between the corresponding pure and impure sequence for waking up word of voice data, as the pure and impure sequence signature;
And/or
The acoustic layer region feature further includes pitch sequences feature, then extracts the acoustic layer region feature packet of the voice data
It includes:By at least one of fundamental frequency mean value, short-time average energy, short-time zero-crossing rate as inputting, the tone through building in advance divides
After the processing of class device, the pitch sequences { b for waking up voice data is exported1, b2..., bj..., bn, wherein bjIt is called out described in expression
The corresponding tone types of j-th of syllable of awake voice data;Calculate the pitch sequences for waking up voice data and the wake-up
Similarity between the corresponding pitch sequences for waking up word of voice data, as the pitch sequences feature;
And/or
The acoustic layer region feature further includes the temporal characteristics of voice unit, then extracts the acoustics level of the voice data
Feature includes:Count the duration of each voice unit for waking up voice data;Using each voice unit it is lasting when
Between, calculate time average and time variance, the temporal characteristics as institute's speech units;
And/or
The acoustic layer region feature further includes vocal print feature, then the acoustic layer region feature for extracting the voice data includes:
Using the i-vector features for waking up voice data described in the voiceprint extraction model extraction built in advance, as vocal print spy
Sign;
And/or
The acoustic layer region feature further includes energy-distributing feature, then extracts the acoustic layer region feature packet of the voice data
It includes:It is three parts c by the voice data cuttingt-1、ct、ct+1, the average energy distribution of each section is counted, as the energy
Distribution characteristics;Wherein, ctIndicate the wake-up voice data, ct+1Indicate the collected packet after the wake-up voice data
Include the voice data collection of the control voice data, ct-1Indicate the collected voice data before the wake-up voice data
Collection.
Optionally, the semantic level feature includes semantic smoothness, then the semantic level for extracting the voice data is special
Sign includes:Word segmentation processing is carried out to the voice data, obtains word sequence { w1, w2..., wk..., wf, wherein wkIndicate institute
State k-th of word of voice data;The probability that f word sequentially occurs according to the sequence of the word sequence is calculated, as institute
Predicate justice smoothness;
And/or
The semantic level feature includes the editing distance of part of speech sequence, then the semantic level for extracting the voice data is special
Sign includes:Word segmentation processing is carried out to the voice data, obtains part of speech sequence { q1, q2..., qk..., qf, wherein qkIndicate institute
State the part of speech of k-th of word of voice data;Calculate the word of the part of speech sequence and each sample voice data of the voice data
Editing distance between property sequence, and smallest edit distance is therefrom chosen, as the editing distance of the part of speech sequence, the sample
This voice data is the data for participating in the training voice discrimination model;
And/or
The semantic level feature includes intent features, then the semantic level feature for extracting the voice data includes:Profit
The intent features of the control voice data are extracted with the intention analysis model built in advance, the intent features include clear
It is intended to or nothing is clearly intended to, alternatively, the intent features includes the corresponding intention classification of the control voice data.
Optionally, the mode for building the voice discrimination model is:
Collecting sample voice data, the sample voice data include that sample wakes up voice data and sample control voice
Data, the data type that the sample wakes up voice data are labeled as positive example wake-up voice data or counter-example wake-up voice number
According to the counter-example wakes up voice data and includes false wake-up data and wake up the voice data of failure;
Extract the acoustic layer region feature and/or semantic level feature of the sample voice data;
Determine the topological structure of the voice discrimination model;
Using the topological structure and the acoustic layer region feature and/or semantic level feature of the sample voice data,
The training voice discrimination model, until the sample of voice discrimination model output wakes up the data type and mark of voice data
The data type of note is identical.
Optionally, the method further includes:
Using the wake-up voice data for having screened out the false wake-up data, optimize current awake model.
The disclosure provides a kind of voice data processing apparatus, and described device includes:
Voice data acquisition module, for obtaining voice data input by user, the voice data includes successfully waking up
The wake-up voice data of intelligent terminal, and indicate the control voice data that operation is intended to;
Characteristic extracting module, the acoustic layer region feature for extracting the voice data and/or semantic level feature, it is described
Acoustic layer region feature is used to indicate that the pronunciation character of user, the semantic level feature to be used to indicate the text of the voice data
Feature;
Model processing modules are used for using the acoustic layer region feature and/or semantic level feature as input, through advance structure
After the voice discrimination model processing built, determine whether the wake-up voice data is false wake-up data.
Optionally, the voice data acquisition module, for judging in preset time period whether continuous acquisition is at least two
Item is used to wake up the voice data of the intelligent terminal;If continuous acquisition is used to call out at least two in the preset time period
The voice data of the awake intelligent terminal, and described at least two voice data for waking up the intelligent terminal are through currently calling out
Score value d after awake model treatment meets the following conditions:d2≤d<d1, then by described at least two for waking up the intelligence eventually
The voice data at end is determined as the wake-up voice data, d1Score threshold value, d are waken up for first2Score thresholding is waken up for second
Value.
Optionally, the acoustic layer region feature includes the acoustic score of current awake model,
The characteristic extracting module, for obtaining the current awake model for each language for waking up voice data
The top n recognition result of sound unit output;If including the correct hair of the voice unit in the top n recognition result of each voice unit
Sound, then it is correct to identify to judge the recognition result of the voice unit;According to the recognition result of each voice unit, the wake-up is counted
The recognition accuracy of voice data, the acoustic score as the current awake model.
Optionally, the acoustic layer region feature further include in fundamental frequency mean value, short-time average energy, short-time zero-crossing rate at least
One;
And/or
The acoustic layer region feature further includes pure and impure sequence signature, the characteristic extracting module, for by fundamental frequency mean value, short
When average energy, at least one of short-time zero-crossing rate as input, after the pure and impure grader processing through building in advance, export institute
State the pure and impure sequence { a for waking up voice data1, a2..., ai..., am, wherein aiIndicate wake up voice data i-th
The corresponding pure and impure classification of phoneme;Calculate the pure and impure sequence wake-up corresponding with the wake-up voice data for waking up voice data
Similarity between the pure and impure sequence of word, as the pure and impure sequence signature;
And/or
The acoustic layer region feature further includes pitch sequences feature, the characteristic extracting module, for by fundamental frequency mean value, short
When average energy, at least one of short-time zero-crossing rate as input, after the tone classifier processing through building in advance, export institute
State the pitch sequences { b for waking up voice data1, b2..., bj..., bn, wherein bjIndicate wake up voice data j-th
The corresponding tone types of syllable;Calculate the pitch sequences wake-up corresponding with the wake-up voice data for waking up voice data
Similarity between the pitch sequences of word, as the pitch sequences feature;
And/or
The acoustic layer region feature further includes the temporal characteristics of voice unit, the characteristic extracting module, for counting
State the duration for each voice unit for waking up voice data;Using the duration of each voice unit, time average is calculated
And time variance, the temporal characteristics as institute's speech units;
And/or
The acoustic layer region feature further includes vocal print feature, the characteristic extracting module, for utilizing the sound built in advance
Line extraction model extracts the i-vector features for waking up voice data, as the vocal print feature;
And/or
The acoustic layer region feature further includes energy-distributing feature, the characteristic extracting module, is used for the voice number
It is three parts c according to cuttingt-1、ct、ct+1, the average energy distribution of each section is counted, as the energy-distributing feature;Wherein, ct
Indicate the wake-up voice data, ct+1It includes the control voice number to indicate collected after the wake-up voice data
According to voice data collection, ct-1Indicate the collected voice data collection before the wake-up voice data.
Optionally, the semantic level feature includes semantic smoothness, the characteristic extracting module, for the voice
Data carry out word segmentation processing, obtain word sequence { w1, w2..., wk..., wf, wherein wkIndicate k-th of the voice data
Word;The probability that f word sequentially occurs according to the sequence of the word sequence is calculated, as the semantic smoothness;
And/or
The semantic level feature includes the editing distance of part of speech sequence, the characteristic extracting module, for institute's predicate
Sound data carry out word segmentation processing, obtain part of speech sequence { q1, q2..., qk..., qf, wherein qkIndicate the kth of the voice data
The part of speech of a word;Calculate the editor between the part of speech sequence of the voice data and the part of speech sequence of each sample voice data
Distance, and smallest edit distance is therefrom chosen, as the editing distance of the part of speech sequence, the sample voice data are to participate in
The data of the training voice discrimination model;
And/or
The semantic level feature includes intent features, the characteristic extracting module, for utilizing the intention built in advance
Analysis model extracts the intent features of the control voice data, and the intent features include that clearly intention or nothing is clearly anticipated
Figure, alternatively, the intent features include the corresponding intention classification of the control voice data.
Optionally, described device further includes:
Sample voice data acquisition module is used for collecting sample voice data, and the sample voice data include that sample is called out
Awake voice data and sample control voice data, the data type that the sample wakes up voice data are labeled as positive example wake-up language
Sound data or counter-example wake up voice data, and the counter-example wakes up voice data and includes false wake-up data and wake up the language of failure
Sound data;
Sample characteristics extraction module, acoustic layer region feature and/or semantic level for extracting the sample voice data
Feature;
Topological structure determining module, the topological structure for determining the voice discrimination model;
Model training module, for the acoustic layer region feature using the topological structure and the sample voice data
And/or semantic level feature, the training voice discrimination model, until the sample of voice discrimination model output wakes up voice
The data type of data is identical as the data type of mark.
Optionally, described device further includes:
Model optimization module, for using the wake-up voice data for having screened out the false wake-up data, optimizing current awake
Model.
The disclosure provides a kind of storage device, wherein being stored with a plurality of instruction, described instruction is loaded by processor, in execution
The step of stating voice data processing method.
The disclosure provides a kind of electronic equipment, and the electronic equipment includes;
Above-mentioned storage device;And
Processor, for executing the instruction in the storage device.
Disclosure scheme can acquire the wake-up voice data for successfully waking up intelligent terminal, and indicate what operation was intended to
Control voice data therefrom extract the acoustic layer region feature for indicating user pronunciation feature, and/or, indicate voice data text
The semantic level feature of feature, using acoustic layer region feature and/or semantic level feature as the input of voice discrimination model, through mould
It determines to wake up whether voice data is false wake-up data after type processing.Such scheme can be screened out from waking up in voice data
False wake-up data carry out model optimization, disclosure scheme using false wake-up data as positive example voice data compared with the existing technology
Help to improve model optimization performance.
Other feature and advantage of the disclosure will be described in detail in subsequent specific embodiment part.
Description of the drawings
Attached drawing is for providing further understanding of the disclosure, and a part for constitution instruction, with following tool
Body embodiment is used to explain the disclosure together, but does not constitute the limitation to the disclosure.In the accompanying drawings:
Fig. 1 is the flow diagram of disclosure scheme voice data processing method;
Fig. 2 is the flow diagram that voice discrimination model is built in disclosure scheme;
Fig. 3 is the composition schematic diagram of disclosure scheme voice data processing apparatus;
Fig. 4 is structural schematic diagram of the disclosure scheme for the electronic equipment of language data process.
Specific implementation mode
The specific implementation mode of the disclosure is described in detail below in conjunction with attached drawing.It should be understood that this place is retouched
The specific implementation mode stated is only used for describing and explaining the disclosure, is not limited to the disclosure.
Referring to Fig. 1, the flow diagram of disclosure voice data processing method is shown.It may comprise steps of:
S101, obtains voice data input by user, and the voice data includes the wake-up language for successfully waking up intelligent terminal
Sound data, and indicate the control voice data that operation is intended to.
In general, the wake-up interactive process between user and intelligent terminal can be presented as:Whether intelligent terminal monitoring users
Input voice data for waking up intelligent terminal, if having input voice data and identifying wake-up word accordingly, wake up at
Work(, can continue whether monitoring users input voice data for manipulating intelligent terminal, if having input voice data and according to
This identifies that operation is intended to, then can control intelligent terminal to execute relevant operation.
In disclosure scheme, it can will identify that the voice data for waking up word, successfully waking up intelligent terminal, referred to as wake up language
Sound data;It can will identify that operation is intended to, controls the voice data that intelligent terminal executes relevant operation, referred to as control voice number
According to.
It is to be appreciated that compared with other kinds of interactive voice process, wakes up interactive process and there is apparent interruption sense,
It can be abstracted as " mute+to wake up word+short pause+operation intention ".For example, " sil ding-dong ding-dongs sp I want to listen the sound of Liu De China
It is happy ", wherein before " sil " indicates that user wakes up intelligent terminal, mute or ambient noise that intelligent terminal listens to;" ding-dong is stung
Rub-a-dub " it indicates to wake up word;" sp " indicates to wake up the short pause between voice data and control voice data;" I wants to listen Liu De China
Music " indicates that operation is intended to.
Disclosure scheme can screen out false wake-up data to improve the performance of model optimization from waking up in voice data,
That is, it is judged that waking up whether voice data is false wake-up data, if it is false wake-up data, then counter-example voice number can be determined as
According to.False wake-up data are considered as positive example voice data compared with the existing technology and carry out model optimization, disclosure scheme helps to carry
High model optimization performance.
As an example, disclosure scheme can be triggered when intelligent terminal is successfully waken up and execute voice number
According to processing procedure;Alternatively, can again be triggered when meeting other preset conditions executes language data process process, for example, in advance
If condition can be the voice data for collecting preset number, arrival predetermined time, etc., disclosure scheme is to executing voice number
It can not be limited according to the opportunity of processing, it is specific to be set in combination with practical application request.
As an example, the intelligent terminal in disclosure scheme can be the electronic equipment with voice arousal function,
For example, can be intelligent electric appliance, mobile phone, PC, tablet computer etc.;In actual application, intelligent end can be passed through
The microphone at end acquires voice data input by user, disclosure scheme to the form of expression of intelligent terminal, obtain voice data
Equipment etc. can be not specifically limited.
As an example, the wake-up voice data in disclosure scheme can go out to call out through current awake Model Identification
The voice data of awake word.For example, the first wake-up score threshold value d can be set1If the voice for waking up intelligent terminal
Data, the score value exported after current awake model treatment are not less than d1, then it is assumed that the recognition result of the voice data is to wake up
Word can be determined as waking up voice data.
As an example, disclosure scheme can also obtain wake-up voice data in the following way:When judging default
Between whether continuous acquisition is at least two voice data for waking up intelligent terminal in section;If continuously adopted in preset time period
Collect at least two voice data for waking up intelligent terminal, and at least two voice data warps for waking up intelligent terminal
Score value d after current awake model treatment meets the following conditions:d2≤d<d1, then will at least two for waking up intelligent terminal
Voice data be determined as wake up voice data.
In conjunction with practical application it is found that if user wakes up failure when carrying out waking up interactive for the first time, it will usually quickly carry out the
Secondary wake-up interaction, or even carry out repeatedly waking up interaction until waking up successfully or user actively stops waking up interacting, based on this
Characteristic, disclosure scheme also provide a kind of new scheme for determining wake-up voice data.It for example, can d above1Base
On plinth, setting second wakes up score threshold value d2, and d2<d1If at least two of continuous acquisition are for waking up in preset time period
The voice data of intelligent terminal, the score value d exported after current awake model treatment are respectively positioned on section [d2, d1), then can by this two
Voice data is determined as waking up voice data.In this way, can to a certain extent keep score less than d1Wake-up voice data,
Enrich the data that can be used for optimizing current awake model.
S102 extracts the acoustic layer region feature and/or semantic level feature of the voice data, the acoustic layer region feature
Pronunciation character for indicating user, the semantic level feature are used to indicate the text feature of the voice data.
After getting voice data input by user, the acoustic layer region feature and/or semantic layer of voice data can be extracted
Region feature is used for the processing of voice discrimination model.
As an example, acoustic layer region feature may include the acoustic score of current awake model.Optionally, it removes current
Except the acoustic score for waking up model, acoustic layer region feature can also include at least one of following optional feature:Fundamental frequency is equal
Value, short-time average energy, short-time zero-crossing rate, pure and impure sequence signature, pitch sequences feature, the temporal characteristics of voice unit, vocal print
Feature, energy-distributing feature.It is to be appreciated that optional feature can be divided into two types:One is directly from wake-up voice
The primitive character that extracting data goes out, for example, fundamental frequency mean value, short-time average energy, short-time zero-crossing rate;Another kind is to wake up voice
Feature after the processing of data, for example, pure and impure sequence signature, pitch sequences feature, the temporal characteristics of voice unit, vocal print feature,
Energy-distributing feature.
As an example, semantic level feature may include at least one of following characteristics:Semantic smoothness, part of speech
The editing distance of sequence, intent features.
Meaning about each character representation and specific extraction process wouldn't be described in detail herein reference can be made to hereafter introducing.
S103, using the acoustic layer region feature and/or semantic level feature as input, the voice through building in advance differentiates
After model treatment, determine whether the wake-up voice data is false wake-up data.
After extract acoustic layer region feature and/or semantic level feature in voice data, it can utilize and build in advance
Voice discrimination model carries out model treatment, determines and wakes up whether voice data is false wake-up data, if it is false wake-up data, then
It can be classified as counter-example voice data;If not false wake-up data, then positive example voice data is may continue as.
By taking current awake model is presented as that current foreground wakes up model and current background wake-up model as an example, below to utilizing
The wake-up voice data of false wake-up data is screened out, the process of optimization current awake model is briefly described.
It is to be appreciated that foreground wake up model for describe wake up word, may be used comprising wake up word voice data into
Row model training;Background wakes up model for describing non-wake-up word, may be used and carries out model without the voice data for waking up word
Training.
When disclosure scheme carries out wake-up model optimization, the wake-up voice data for having screened out false wake-up data can be used,
It updates current foreground and wakes up model;Counter-example voice data can be used, for example, the voice data of failure, false wake-up data are waken up,
It updates current background and wakes up model.In this way, the distance in two paths can be made to zoom out, mould is waken up after helping to improve update
The speech recognition accuracy of type.Specific optimization process can refer to the relevant technologies realization, be not detailed herein.
As an example, it can only update current foreground and wake up model, that is, can only utilize and screen out false wake-up data
Wake-up voice data, optimize current foreground and wake up model.Specific combinable practical application request, determines the side of model modification
Formula, disclosure scheme can not limit this.
Below to the acoustic layer region feature in disclosure scheme, semantic level feature, it is explained respectively.
1. acoustic layer region feature
(1) acoustic score of current awake model, for reflecting the recognition accuracy for waking up word
As an example, current awake model can be obtained for each voice unit output for waking up voice data
Top n recognition result;If including the orthoepy of the voice unit in the top n recognition result of each voice unit, judgement should
The recognition result of voice unit is that identification is correct;According to the recognition result of each voice unit, statistics wakes up the identification of voice data
Accuracy, the acoustic score as current awake model.
For example, voice unit can be presented as the basic recognition unit of current awake model, for example, phoneme, syllable
Deng disclosure scheme can be not specifically limited this.
By voice unit be syllable for, wake up word " ding-dong ding-dong " can be divided into " ding ", " dong ", " ding ",
" dong " 4 voice units for first voice unit " ding ", can obtain current awake if the value of N is 3
Model is directed to the identification probability of voice unit output, by highest preceding 3 identification for being determined as voice unit " ding " of probability
If as a result, there is the orthoepy of " ding " in this 3 recognition results, the recognition result of the voice unit is judged to identify just
Really.And so on, obtain the recognition result of other 3 voice units respectively, then calculate wake up voice data identification it is accurate
Degree, that is, the ratio identified between correct voice unit number, voice unit total number is calculated, as current awake model
Acoustic score.
It is to be appreciated that the value of N can be N >=1 in disclosure scheme, it is specific to be set in combination with practical application request,
Disclosure scheme can not limit this.
(2) primitive character for waking up voice data, for example, fundamental frequency mean value, short-time average energy, short-time zero-crossing rate
In general, voice signal can be divided into two kinds of voiceless sound, voiced sound by people when pronunciation according to whether vocal cords shake.
Voiced sound is also known as sound language, carries most energy in language, and voiced sound shows apparent periodicity in the time domain;Voiceless sound
Similar to white noise, without apparent periodical.When sending out voiced sound, air-flow makes vocal cords generate the vibration of relaxation vibrating type by glottis,
Quasi-periodic driving pulse string is generated, the frequency of this vocal cord vibration is properly termed as fundamental frequency, abbreviation fundamental frequency.Fundamental frequency one
As with personal vocal cords, pronunciation custom etc. have relationship, personal feature can be reacted to a certain extent.
As an example, the mode of extraction fundamental frequency mean value can be presented as:Sub-frame processing is carried out to waking up voice data,
Multiple speech data frames are obtained, are then extracted per the corresponding fundamental frequency of frame, and then using per the corresponding fundamental frequency of frame, calculates and wakes up voice
The corresponding fundamental frequency mean value of data.
In addition, it is necessary to explanation, short-time average energy can be as the characteristic parameter for distinguishing voiceless sound, voiced sound;Alternatively,
It, can be as distinguishing sound, noiseless characteristic parameter in the case of signal-to-noise ratio height.
Short-time zero-crossing rate refers to that voice signal waveform is across the number of horizontal axis (zero level) in a speech data frame.
In general, energy when voiced sound concentrates on low-frequency range, energy when voiceless sound concentrates on high band, can response frequency to a certain extent
Just, there is lower zero-crossing rate in voiced segments, have higher zero-crossing rate in voiceless sound section.
Disclosure scheme can not limit the mode of acquisition fundamental frequency mean value, short-time average energy, short-time zero-crossing rate, specifically
The relevant technologies realization is can refer to, is not detailed herein.
(3) pure and impure sequence signature, for reflecting the pure and impure characteristic for waking up phoneme in voice data
It as an example, can be by least one of fundamental frequency mean value, short-time average energy, short-time zero-crossing rate as defeated
Enter, after the pure and impure grader processing through building in advance, output wakes up the pure and impure sequence { a of voice data1, a2..., ai..., am,
Wherein, aiIndicate the corresponding pure and impure classification of i-th of phoneme of wake-up voice data;Calculate wake up voice data pure and impure sequence with
The similarity between the corresponding pure and impure sequence for waking up word of voice data is waken up, as pure and impure sequence signature.
For example, the pure and impure classification of phoneme can be:Voiceless sound, voiced sound, for example, " 0 " can be used to indicate voiceless sound, with " 1 "
Indicate that voiced sound, disclosure scheme can be not specifically limited this.
It is to be appreciated that intelligent terminal may only preserve a wake-up word, that is, knowing that waking up voice data corresponds in advance
Wake-up word what is;Alternatively, intelligent terminal may preserve multiple wake-up words, in view of this, current awake mould can be utilized
Type identifies that it is what to wake up the corresponding wake-up word of voice data, and disclosure scheme can be not specifically limited this.
As an example, the pure and impure sequence for waking up word can be stored in intelligent terminal, and needing to calculate similarity
When directly read;Alternatively, the pure and impure sequence for waking up word can be determined in real time using pure and impure grader when needing to calculate similarity
Row, disclosure scheme can be not specifically limited this.
As an example, calculating the similarity of pure and impure sequence can be presented as:Phase is calculated by the way of XOR operation
If on corresponding position the pure and impure classification of phoneme is identical, such as be the voiced sound that " 1 " indicates like degree, then phoneme in this position
Exclusive or result is 0;Otherwise exclusive or result is 1.In this way, can count non-zero number obtains similarity, usual non-zero number is fewer
Similarity is higher.
It is to be appreciated that common classification model may be used in the pure and impure grader in disclosure scheme, for example, supporting vector
Machine model, neural network model etc., disclosure scheme can be not specifically limited this.
(4) pitch sequences feature, for reflecting the pitch characteristics for waking up syllable in voice data
It as an example, can be by least one of fundamental frequency mean value, short-time average energy, short-time zero-crossing rate as defeated
Enter, after the tone classifier processing through building in advance, output wakes up the pitch sequences { b of voice data1, b2..., bj..., bn,
Wherein, bjIndicate the corresponding tone types of j-th of syllable of wake-up voice data;Calculate wake up voice data pitch sequences with
The similarity between the corresponding pitch sequences for waking up word of voice data is waken up, as pitch sequences feature.
By taking Chinese as an example, the tone types of syllable can be presented as 4 kinds of common tones, can use identifier " 1 ", " 2 ",
" 3 ", " 4 " indicate different tones;Or the tone types that other languages determine syllable are can be combined with, disclosure scheme is to this
It can be not specifically limited.
By being described above it is found that no matter intelligent terminal preserves a wake-up word, or preserves multiple wake-up words, can determine
Go out to wake up the corresponding wake-up word of voice data, and then obtain waking up the pitch sequences of word, specifically can refer to (3) pure and impure sequence signature
Place is introduced, and and will not be described here in detail.
As an example, calculating the similarity of pitch sequences can be presented as:Phase is calculated by the way of XOR operation
If the tone types of syllable are identical on corresponding position, such as it is the Chinese falling tone tune that " 4 " indicate, the then position like degree
The XOR operation result of upper syllable is 0;Otherwise exclusive or result is 1.In this way, can count non-zero number obtains similarity, usually
Non-zero number is fewer, and similarity is higher.
It is to be appreciated that common classification model may be used in the tone classifier in disclosure scheme, for example, supporting vector
Machine model, neural network model etc., disclosure scheme can be not specifically limited this.
(5) temporal characteristics of voice unit wake up voice data existing exception in voice unit cutting for reflecting
Situation
As an example, can based on the voice recognition result that current awake model obtains, to wake up voice data into
Row forces cutting, at the beginning of obtaining each voice unit, the end time, and then obtains the duration of each voice unit;It can
Using the duration of each voice unit, to calculate time average and time variance, the temporal characteristics as voice unit.
In general, the temporal characteristics of voice unit can reflect voice unit abnormal conditions present in dicing process, example
Such as, the duration of individual voice unit is long or too short, does not meet normal voice form.As an example, voice unit
It can be presented as that phoneme, syllable etc., disclosure scheme can be not specifically limited this.
(6) vocal print feature, physiological characteristic and behavioural characteristic for reflecting speaker
As an example, the voiceprint extraction model extraction built in advance can be utilized to wake up the i-vector of voice data
Feature, as vocal print feature.For example, the voiceprint extractions models such as DNN I-Vector, GMM-UBM I-Vector can be utilized to carry
Take vocal print feature, disclosure scheme that can be not specifically limited this.
It is to be appreciated that vocal print feature reflection be speaker individualized feature, in general, the vocal print feature of speaker exists
Being in short time will not be changed, therefore voiceprint extraction model can also be utilized in control voice data or including wake-up
The whole voice extracting data i-vector features of voice data and control voice data, disclosure scheme, which can not do this, to be had
Body limits.
(7) energy-distributing feature, for reflecting the characteristic for waking up interactive process
As an example, can be three parts c by voice data cuttingt-1、ct、ct+1, and count being averaged for each section
Energy distribution, such as the average energy distribution of three parts can be expressed as gt-1、gt、gt+1, obtain energy-distributing feature.
As an example, extracting the mode of energy-distributing feature can be presented as:3 part of speech data are carried out respectively
Sub-frame processing obtains the speech data frame that every part includes, and then extracts per the corresponding energy of frame, and then corresponding using every frame
Energy calculates the average energy of each section.
The example of interactive process is waken up in conjunction with above lift, " sil ding-dong ding-dongs sp I want to listen the music of Liu De China " can be with
In such a way that identification wakes up word, voice data is divided into 3 parts.Wherein, ctIt indicates to wake up voice data;ct-1It indicates
The collected voice data collection before waking up voice data, usually mute section or ambient noise;ct+1It indicates waking up language
Collected voice data collection after sound data, usually short pause and operation are intended to.It is to be appreciated that ct-1、ct+1Duration
It can flexibly determine, for example, can be according to VAD (English:Voice Activity Detection, Chinese:Speech activity is examined
Survey) information determination, fixed duration is may be set to be, such as 1s~5s, disclosure scheme can be not specifically limited this.
Mentioning wake-up word relative to common discourse causes intelligent terminal by false wake-up, such as " I thinks that ding-dong ding-dong is
Good name ", the Energy distribution of the wake-up interactive process of disclosure scheme is in contrast, have significant difference.
2. semantic level feature
(1) semantic smoothness
As an example, word segmentation processing can be carried out to voice data, obtains word sequence { w1, w2..., wk...,
wf, wherein wkIndicate k-th of word of voice data;Then calculate what f word sequentially occurred according to the sequence of word sequence
Probability, as semantic smoothness.
For example, the semantic smoothness in disclosure scheme can be presented as w1To wfThe semantic smoothness P of forward direction in direction
(w1, w2..., wf);And/or wfTo w1The reverse semanteme smoothness P (w in directionf, wf-1..., w1).With positive semantic smoothness
For, it can be calculated by the following formula to obtain:
Wherein, P (wk|wk-1) can be obtained based on the sample voice data statistics for participating in the training of voice discrimination model.
(2) editing distance of part of speech sequence
As an example, word segmentation processing can be carried out to voice data, obtains part of speech sequence { q1, q2..., qk...,
qf, wherein qkIndicate the part of speech of k-th of word of voice data;Calculate the part of speech sequence of voice data and each sample voice
Editing distance between the part of speech sequence of data, and smallest edit distance is therefrom chosen, the editing distance as part of speech sequence.Its
In, sample voice data are to participate in the data of training voice discrimination model.
Part of speech sequence signature can reflect semantic information to a certain extent, particular for wake-up interactive process part of speech sequence
Feature is more notable.In disclosure scheme, part of speech sequence signature can be presented as the editing distance (Edit of part of speech sequence
Distance), between referring specifically to two word strings, the minimum edit operation number changed into needed for another by one is usually edited
Apart from smaller, the similarity of two word strings is bigger.
If the part of speech sequence of sample voice data is expressed as { p1, p2..., ph, then calculate { q using following formula1,
q2..., qfAnd { p1, p2..., phEditing distance d[f, h]:
It is to be appreciated that sample voice data can participate in all data of voice discrimination model training;Alternatively, can be with
Using the positive example data filtered out from all data as sample voice data, editing distance calculating, disclosure scheme pair are carried out
This can be not specifically limited, as long as determining smallest edit distance.
(3) intent features
As an example, the intention of the intention analysis model extraction control voice data built in advance can be utilized special
Sign, it is intended that feature includes that clearly intention or nothing is clearly intended to, alternatively, intent features include the corresponding intention of control voice data
Classification.
Disclosure scheme can build intention analysis model in advance, for determining that operation is intended to tendency.For example, it is intended that
Analysis model can be presented as that two graders, model output indicate that clearly intention, nothing are clearly intended to;Or, it is intended that analysis mould
Type can be presented as that regression model, model output indicate the various scores for being intended to classification, can determine to control according to score height
The corresponding intention classification of voice data processed, for example, the preceding M of highest scoring is intended to classification as the corresponding meaning of control voice data
Figure classification, the value of M can be M >=1, specific to be set in combination with practical application request, disclosure scheme pair in disclosure scheme
This can not be limited.For example, it is intended that classification can play music, inquiry weather etc., can specifically be needed by practical application
Depending on asking.
The process that voice discrimination model is built in disclosure scheme is explained below.For details, reference can be made to Fig. 2 institutes
Show flow chart, may comprise steps of:
S201, collecting sample voice data, the sample voice data include that sample wakes up voice data and sample control
Voice data processed, the data type that the sample wakes up voice data are labeled as positive example wake-up voice data or counter-example wake-up language
Sound data, the counter-example wake up voice data and include false wake-up data and wake up the voice data of failure.
When carrying out model training, a large amount of sample voice data can be acquired, wherein sample voice data can embody
Voice data, sample control voice data are waken up for sample.Further, it is also possible to which waking up voice data to sample carries out data type
Mark wakes up voice data, counter-example wake-up voice data for example, data type can be positive example, voice number is waken up for counter-example
According to, can also further it is careful be labeled as false wake-up data, wake up failure voice data.
S202 extracts the acoustic layer region feature and/or semantic level feature of the sample voice data.
Specific implementation process can refer to introduction made above, be not detailed herein.
S203 determines the topological structure of the voice discrimination model.
As an example, the topological structure in disclosure scheme can be presented as:CNN (English:Convolutional
Neural Network, Chinese:Convolutional neural networks), RNN (English:Recurrent neural Network, Chinese:Cycle
Neural network), DNN (English:Deep Neural Network, Chinese:Deep neural network) etc., disclosure scheme can to this
It is not specifically limited.
As an example, the output layer of neural network can include 2 output nodes, respectively represent positive example and wake up voice
Data, false wake-up data indicate false wake-up data for example, " 0 " can be used to indicate that positive example wakes up voice data with " 1 ".Alternatively,
The output layer of neural network can include 1 output node, indicate to wake up the probability that voice data is confirmed as false wake-up data.
Disclosure scheme can not limit the specific manifestation form of neural network.
S204 utilizes the topological structure and the acoustic layer region feature and/or semantic level of the sample voice data
Feature, the training voice discrimination model, until the sample of voice discrimination model output wakes up the data class of voice data
Type is identical as the data type of mark.
The topological structure for determining model, acoustic layer region feature and/or the semantic level for extracting sample voice data are special
After sign, model training can be carried out.As an example, cross entropy criterion may be used in training process, using common random
Gradient descent method updates Optimized model parameter, it is ensured that when model training is completed, the sample of model output wakes up the number of voice data
It is identical as the data type of mark according to type.
As an example, voice discrimination model can be universal model, i.e., be not to be directed to some or certain specific wake-ups
Word is built;Alternatively, voice discrimination model can be personalized model, that is, it is directed to different wake-up words and builds different voice differentiation moulds
Type, disclosure scheme can be not specifically limited this.
Referring to Fig. 3, the composition schematic diagram of disclosure voice data processing apparatus is shown.Described device may include:
Voice data acquisition module 301, for obtaining voice data input by user, the voice data includes successfully calling out
The wake-up voice data of awake intelligent terminal, and indicate the control voice data that operation is intended to;
Characteristic extracting module 302, the acoustic layer region feature for extracting the voice data and/or semantic level feature,
The acoustic layer region feature is used to indicate the pronunciation character of user, and the semantic level feature is for indicating the voice data
Text feature;
Model processing modules 303 are used for using the acoustic layer region feature and/or semantic level feature as input, through pre-
After the voice discrimination model processing first built, determine whether the wake-up voice data is false wake-up data.
Optionally, the voice data acquisition module, for judging in preset time period whether continuous acquisition is at least two
Item is used to wake up the voice data of the intelligent terminal;If continuous acquisition is used to call out at least two in the preset time period
The voice data of the awake intelligent terminal, and described at least two voice data for waking up the intelligent terminal are through currently calling out
Score value d after awake model treatment meets the following conditions:d2≤d<d1, then by described at least two for waking up the intelligence eventually
The voice data at end is determined as the wake-up voice data, d1Score threshold value, d are waken up for first2Score thresholding is waken up for second
Value.
Optionally, the acoustic layer region feature includes the acoustic score of current awake model,
The characteristic extracting module, for obtaining the current awake model for each language for waking up voice data
The top n recognition result of sound unit output;If including the correct hair of the voice unit in the top n recognition result of each voice unit
Sound, then it is correct to identify to judge the recognition result of the voice unit;According to the recognition result of each voice unit, the wake-up is counted
The recognition accuracy of voice data, the acoustic score as the current awake model.
Optionally, the acoustic layer region feature further include in fundamental frequency mean value, short-time average energy, short-time zero-crossing rate at least
One;
And/or
The acoustic layer region feature further includes pure and impure sequence signature, the characteristic extracting module, for by fundamental frequency mean value, short
When average energy, at least one of short-time zero-crossing rate as input, after the pure and impure grader processing through building in advance, export institute
State the pure and impure sequence { a for waking up voice data1, a2..., ai..., am, wherein aiIndicate wake up voice data i-th
The corresponding pure and impure classification of phoneme;Calculate the pure and impure sequence wake-up corresponding with the wake-up voice data for waking up voice data
Similarity between the pure and impure sequence of word, as the pure and impure sequence signature;
And/or
The acoustic layer region feature further includes pitch sequences feature, the characteristic extracting module, for by fundamental frequency mean value, short
When average energy, at least one of short-time zero-crossing rate as input, after the tone classifier processing through building in advance, export institute
State the pitch sequences { b for waking up voice data1, b2..., bj..., bn, wherein bjIndicate wake up voice data j-th
The corresponding tone types of syllable;Calculate the pitch sequences wake-up corresponding with the wake-up voice data for waking up voice data
Similarity between the pitch sequences of word, as the pitch sequences feature;
And/or
The acoustic layer region feature further includes the temporal characteristics of voice unit, the characteristic extracting module, for counting
State the duration for each voice unit for waking up voice data;Using the duration of each voice unit, time average is calculated
And time variance, the temporal characteristics as institute's speech units;
And/or
The acoustic layer region feature further includes vocal print feature, the characteristic extracting module, for utilizing the sound built in advance
Line extraction model extracts the i-vector features for waking up voice data, as the vocal print feature;
And/or
The acoustic layer region feature further includes energy-distributing feature, the characteristic extracting module, is used for the voice number
It is three parts c according to cuttingt-1、ct、ct+1, the average energy distribution of each section is counted, as the energy-distributing feature;Wherein, ct
Indicate the wake-up voice data, ct+1It includes the control voice number to indicate collected after the wake-up voice data
According to voice data collection, ct-1Indicate the collected voice data collection before the wake-up voice data.
Optionally, the semantic level feature includes semantic smoothness, the characteristic extracting module, for the voice
Data carry out word segmentation processing, obtain word sequence { w1, w2..., wk..., wf, wherein wkIndicate k-th of the voice data
Word;The probability that f word sequentially occurs according to the sequence of the word sequence is calculated, as the semantic smoothness;
And/or
The semantic level feature includes the editing distance of part of speech sequence, the characteristic extracting module, for institute's predicate
Sound data carry out word segmentation processing, obtain part of speech sequence { q1, q2..., qk..., qf, wherein qkIndicate the kth of the voice data
The part of speech of a word;Calculate the editor between the part of speech sequence of the voice data and the part of speech sequence of each sample voice data
Distance, and smallest edit distance is therefrom chosen, as the editing distance of the part of speech sequence, the sample voice data are to participate in
The data of the training voice discrimination model;
And/or
The semantic level feature includes intent features, the characteristic extracting module, for utilizing the intention built in advance
Analysis model extracts the intent features of the control voice data, and the intent features include that clearly intention or nothing is clearly anticipated
Figure, alternatively, the intent features include the corresponding intention classification of the control voice data.
Optionally, described device further includes:
Sample voice data acquisition module is used for collecting sample voice data, and the sample voice data include that sample is called out
Awake voice data and sample control voice data, the data type that the sample wakes up voice data are labeled as positive example wake-up language
Sound data or counter-example wake up voice data, and the counter-example wakes up voice data and includes false wake-up data and wake up the language of failure
Sound data;
Sample characteristics extraction module, acoustic layer region feature and/or semantic level for extracting the sample voice data
Feature;
Topological structure determining module, the topological structure for determining the voice discrimination model;
Model training module, for the acoustic layer region feature using the topological structure and the sample voice data
And/or semantic level feature, the training voice discrimination model, until the sample of voice discrimination model output wakes up voice
The data type of data is identical as the data type of mark.
Optionally, described device further includes:
Model optimization module, for using the wake-up voice data for having screened out the false wake-up data, optimizing current awake
Model.
About the device in above-described embodiment, wherein modules execute the concrete mode of operation in related this method
Embodiment in be described in detail, explanation will be not set forth in detail herein.
Referring to Fig. 4, structural schematic diagram of the disclosure for the electronic equipment 400 of language data process is shown.With reference to figure
4, electronic equipment 400 includes processing component 401, further comprises one or more processors, and by 402 institute of storage medium
The storage device resource of representative, can be by the instruction of the execution of processing component 401, such as application program for storing.Storage medium
The application program stored in 402 may include it is one or more each correspond to one group of instruction module.In addition, place
Reason component 401 is configured as executing instruction, to execute above-mentioned voice data processing method.
Electronic equipment 400 can also include a power supply module 403, be configured as executing the power supply pipe of electronic equipment 400
Reason;One wired or wireless network interface 404 is configured as electronic equipment 400 being connected to network;With an input and output
(I/O) interface 405.Electronic equipment 400 can be operated based on the operating system for being stored in storage medium 402, such as Windows
ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM or similar.
The preferred embodiment of the disclosure is described in detail above in association with attached drawing, still, the disclosure is not limited to above-mentioned reality
The detail in mode is applied, in the range of the technology design of the disclosure, a variety of letters can be carried out to the technical solution of the disclosure
Monotropic type, these simple variants belong to the protection domain of the disclosure.
It is further to note that specific technical features described in the above specific embodiments, in not lance
In the case of shield, can be combined by any suitable means, in order to avoid unnecessary repetition, the disclosure to it is various can
The combination of energy no longer separately illustrates.
In addition, arbitrary combination can also be carried out between a variety of different embodiments of the disclosure, as long as it is without prejudice to originally
Disclosed thought equally should be considered as disclosure disclosure of that.
Claims (16)
1. a kind of voice data processing method, which is characterized in that the method includes:
Voice data input by user is obtained, the voice data includes the wake-up voice data for successfully waking up intelligent terminal, with
And indicate the control voice data that operation is intended to;
The acoustic layer region feature and/or semantic level feature of the voice data are extracted, the acoustic layer region feature is for indicating
The pronunciation character of user, the semantic level feature are used to indicate the text feature of the voice data;
Using the acoustic layer region feature and/or semantic level feature as input, the voice discrimination model processing through building in advance
Afterwards, determine whether the wake-up voice data is false wake-up data.
2. according to the method described in claim 1, it is characterized in that, the mode for obtaining the wake-up voice data is:
Judge in preset time period whether continuous acquisition is at least two voice data for waking up the intelligent terminal;
If the voice data that continuous acquisition is used to wake up the intelligent terminal at least two in the preset time period, and institute
It is following to state score value d satisfaction of at least two voice data for waking up the intelligent terminal after current awake model treatment
Condition:d2≤d<d1, then it is determined as the wake-up voice for waking up the voice data of the intelligent terminal by described at least two
Data, d1Score threshold value, d are waken up for first2Score threshold value is waken up for second.
3. method according to claim 1 or 2, which is characterized in that the acoustic layer region feature includes current awake model
Acoustic score, then the acoustic layer region feature for extracting the voice data includes:
Obtain top n identification knot of the current awake model for each voice unit output for waking up voice data
Fruit;
If including the orthoepy of the voice unit in the top n recognition result of each voice unit, the voice unit is judged
Recognition result is that identification is correct;
According to the recognition result of each voice unit, the recognition accuracy for waking up voice data is counted, is currently called out as described
The acoustic score of awake model.
4. according to the method described in claim 3, it is characterized in that,
The acoustic layer region feature further includes at least one of fundamental frequency mean value, short-time average energy, short-time zero-crossing rate;
And/or
The acoustic layer region feature further includes pure and impure sequence signature, then the acoustic layer region feature for extracting the voice data includes:
By at least one of fundamental frequency mean value, short-time average energy, short-time zero-crossing rate as input, the pure and impure grader through building in advance
After processing, the pure and impure sequence { a for waking up voice data is exported1, a2..., ai..., am, wherein aiIndicate the wake-up language
The corresponding pure and impure classification of i-th of phoneme of sound data;Calculate the pure and impure sequence for waking up voice data and the wake-up voice
Similarity between the corresponding pure and impure sequence for waking up word of data, as the pure and impure sequence signature;
And/or
The acoustic layer region feature further includes pitch sequences feature, then the acoustic layer region feature for extracting the voice data includes:
By at least one of fundamental frequency mean value, short-time average energy, short-time zero-crossing rate as input, the tone classifier through building in advance
After processing, the pitch sequences { b for waking up voice data is exported1, b2..., bj..., bn, wherein bjIndicate the wake-up language
The corresponding tone types of j-th of syllable of sound data;Calculate the pitch sequences for waking up voice data and the wake-up voice
Similarity between the corresponding pitch sequences for waking up word of data, as the pitch sequences feature;
And/or
The acoustic layer region feature further includes the temporal characteristics of voice unit, then extracts the acoustic layer region feature of the voice data
Including:Count the duration of each voice unit for waking up voice data;Utilize the duration of each voice unit, meter
Evaluation time mean value and time variance, the temporal characteristics as institute's speech units;
And/or
The acoustic layer region feature further includes vocal print feature, then the acoustic layer region feature for extracting the voice data includes:It utilizes
The i-vector features that voice data is waken up described in the voiceprint extraction model extraction built in advance, as the vocal print feature;
And/or
The acoustic layer region feature further includes energy-distributing feature, then the acoustic layer region feature for extracting the voice data includes:
It is three parts c by the voice data cuttingt-1、ct、ct+1, the average energy distribution of each section is counted, as the energy point
Cloth feature;Wherein, ctIndicate the wake-up voice data, ct+1Indicating collected after the wake-up voice data includes
The voice data collection of the control voice data, ct-1Indicate the collected voice data collection before the wake-up voice data.
5. method according to claim 1 or 2, which is characterized in that
The semantic level feature includes semantic smoothness, then the semantic level feature for extracting the voice data includes:To institute
It states voice data and carries out word segmentation processing, obtain word sequence { w1, w2..., wk..., wf, wherein wkIndicate the voice data
K-th of word;The probability that f word sequentially occurs according to the sequence of the word sequence is calculated, it is smooth as the semanteme
Degree;
And/or
The semantic level feature includes the editing distance of part of speech sequence, then extracts the semantic level feature packet of the voice data
It includes:Word segmentation processing is carried out to the voice data, obtains part of speech sequence { q1, q2..., qk..., qf, wherein qkIndicate institute's predicate
The part of speech of k-th of word of sound data;Calculate the part of speech sequence of the part of speech sequence and each sample voice data of the voice data
Editing distance between row, and smallest edit distance is therefrom chosen, as the editing distance of the part of speech sequence, the sample language
Sound data are to participate in the data of the training voice discrimination model;
And/or
The semantic level feature includes intent features, then the semantic level feature for extracting the voice data includes:Using pre-
The intention analysis model first built extracts the intent features of the control voice data, and the intent features include clearly to be intended to
Or nothing is clearly intended to, alternatively, the intent features include the corresponding intention classification of the control voice data.
6. method according to claim 1 or 2, which is characterized in that the mode for building the voice discrimination model is:
Collecting sample voice data, the sample voice data include that sample wakes up voice data and sample control voice number
According to, the data type that the sample wakes up voice data is labeled as positive example wake-up voice data or counter-example wake-up voice data,
The counter-example wakes up voice data and includes false wake-up data and wake up the voice data of failure;
Extract the acoustic layer region feature and/or semantic level feature of the sample voice data;
Determine the topological structure of the voice discrimination model;
Utilize the topological structure and the acoustic layer region feature and/or semantic level feature of the sample voice data, training
The voice discrimination model, until the sample of voice discrimination model output wakes up the data type and mark of voice data
Data type is identical.
7. method according to claim 1 or 2, which is characterized in that the method further includes:
Using the wake-up voice data for having screened out the false wake-up data, optimize current awake model.
8. a kind of voice data processing apparatus, which is characterized in that described device includes:
Voice data acquisition module, for obtaining voice data input by user, the voice data includes successfully waking up intelligence
The wake-up voice data of terminal, and indicate the control voice data that operation is intended to;
Characteristic extracting module, the acoustic layer region feature for extracting the voice data and/or semantic level feature, the acoustics
Level feature is used to indicate that the pronunciation character of user, the semantic level feature to be used to indicate that the text of the voice data to be special
Sign;
Model processing modules are used for using the acoustic layer region feature and/or semantic level feature as input, through what is built in advance
After the processing of voice discrimination model, determine whether the wake-up voice data is false wake-up data.
9. device according to claim 8, which is characterized in that
The voice data acquisition module, for judging whether continuous acquisition is used to wake up institute at least two in preset time period
State the voice data of intelligent terminal;If continuous acquisition is whole for waking up the intelligence at least two in the preset time period
The voice data at end, and described at least two voice data for waking up the intelligent terminal are after current awake model treatment
Score value d meet the following conditions:d2≤d<d1, then by described at least two voice data for waking up the intelligent terminal
It is determined as the wake-up voice data, d1Score threshold value, d are waken up for first2Score threshold value is waken up for second.
10. device according to claim 8 or claim 9, which is characterized in that the acoustic layer region feature includes current awake model
Acoustic score,
The characteristic extracting module, for obtaining the current awake model for each voice list for waking up voice data
The top n recognition result of member output;If including the orthoepy of the voice unit in the top n recognition result of each voice unit,
It is correct to identify then to judge the recognition result of the voice unit;According to the recognition result of each voice unit, the wake-up language is counted
The recognition accuracy of sound data, the acoustic score as the current awake model.
11. device according to claim 10, which is characterized in that
The acoustic layer region feature further includes at least one of fundamental frequency mean value, short-time average energy, short-time zero-crossing rate;
And/or
The acoustic layer region feature further includes pure and impure sequence signature, the characteristic extracting module, for by fundamental frequency mean value, put down in short-term
At least one of equal energy, short-time zero-crossing rate after the pure and impure grader processing through building in advance, are called out as input described in output
Pure and impure sequence { a of awake voice data1, a2..., ai..., am, wherein aiIndicate i-th of phoneme for waking up voice data
Corresponding pure and impure classification;Calculate the pure and impure sequence for waking up voice data wake-up word corresponding with the wake-up voice data
Similarity between pure and impure sequence, as the pure and impure sequence signature;
And/or
The acoustic layer region feature further includes pitch sequences feature, the characteristic extracting module, for by fundamental frequency mean value, put down in short-term
At least one of equal energy, short-time zero-crossing rate after the tone classifier processing through building in advance, are called out as input described in output
Pitch sequences { the b of awake voice data1, b2..., bj..., bn, wherein bjIndicate j-th of syllable for waking up voice data
Corresponding tone types;Calculate the pitch sequences for waking up voice data wake-up word corresponding with the wake-up voice data
Similarity between pitch sequences, as the pitch sequences feature;
And/or
The acoustic layer region feature further includes the temporal characteristics of voice unit, and the characteristic extracting module is called out for counting described
The duration of each voice unit of awake voice data;Using the duration of each voice unit, calculate time average and
Time variance, the temporal characteristics as institute's speech units;
And/or
The acoustic layer region feature further includes vocal print feature, the characteristic extracting module, for being carried using the vocal print built in advance
The i-vector features that voice data is waken up described in model extraction are taken, as the vocal print feature;
And/or
The acoustic layer region feature further includes energy-distributing feature, the characteristic extracting module, for cutting the voice data
It is divided into three parts ct-1、ct、ct+1, the average energy distribution of each section is counted, as the energy-distributing feature;Wherein, ctIt indicates
The wake-up voice data, ct+1It includes the control voice data to indicate collected after the wake-up voice data
Voice data collection, ct-1Indicate the collected voice data collection before the wake-up voice data.
12. device according to claim 8 or claim 9, which is characterized in that
The semantic level feature includes semantic smoothness, the characteristic extracting module, for dividing the voice data
Word processing, obtains word sequence { w1, w2..., wk..., wf, wherein wkIndicate k-th of word of the voice data;Calculate f
The probability that a word sequentially occurs according to the sequence of the word sequence, as the semantic smoothness;
And/or
The semantic level feature includes the editing distance of part of speech sequence, the characteristic extracting module, for the voice number
According to word segmentation processing is carried out, part of speech sequence { q is obtained1, q2..., qk..., qf, wherein qkIndicate k-th of list of the voice data
The part of speech of word;Calculate editor between the part of speech sequence of the voice data and the part of speech sequence of each sample voice data away from
From, and smallest edit distance is therefrom chosen, as the editing distance of the part of speech sequence, the sample voice data are to participate in instructing
Practice the data of the voice discrimination model;
And/or
The semantic level feature includes intent features, the characteristic extracting module, for being analyzed using the intention built in advance
The intent features of control voice data described in model extraction, the intent features include clearly be intended to or without being clearly intended to, or
Person, the intent features include the corresponding intention classification of the control voice data.
13. device according to claim 8 or claim 9, which is characterized in that described device further includes:
Sample voice data acquisition module, is used for collecting sample voice data, and the sample voice data include that sample wakes up language
Sound data and sample control voice data, the data type that the sample wakes up voice data are labeled as positive example wake-up voice number
According to or counter-example wake up voice data, the counter-example wakes up voice data and includes false wake-up data and wake up the voice number of failure
According to;
Sample characteristics extraction module, the acoustic layer region feature for extracting the sample voice data and/or semantic level feature;
Topological structure determining module, the topological structure for determining the voice discrimination model;
Model training module, for using the topological structure and the sample voice data acoustic layer region feature and/or
Semantic level feature, the training voice discrimination model, until the sample of voice discrimination model output wakes up voice data
Data type it is identical as the data type of mark.
14. device according to claim 8 or claim 9, which is characterized in that described device further includes:
Model optimization module, for using the wake-up voice data for having screened out the false wake-up data, optimizing current awake model.
15. a kind of storage device, wherein being stored with a plurality of instruction, which is characterized in that described instruction is loaded by processor, right of execution
Profit requires the step of any one of 1 to 7 the method.
16. a kind of electronic equipment, which is characterized in that the electronic equipment includes;
Storage device described in claim 15;And
Processor, for executing the instruction in the storage device.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711364085.1A CN108320733B (en) | 2017-12-18 | 2017-12-18 | Voice data processing method and device, storage medium and electronic equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711364085.1A CN108320733B (en) | 2017-12-18 | 2017-12-18 | Voice data processing method and device, storage medium and electronic equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108320733A true CN108320733A (en) | 2018-07-24 |
CN108320733B CN108320733B (en) | 2022-01-04 |
Family
ID=62893086
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711364085.1A Active CN108320733B (en) | 2017-12-18 | 2017-12-18 | Voice data processing method and device, storage medium and electronic equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108320733B (en) |
Cited By (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108831471A (en) * | 2018-09-03 | 2018-11-16 | 与德科技有限公司 | A kind of voice method for security protection, device and route terminal |
CN109036412A (en) * | 2018-09-17 | 2018-12-18 | 苏州奇梦者网络科技有限公司 | voice awakening method and system |
CN109119079A (en) * | 2018-07-25 | 2019-01-01 | 天津字节跳动科技有限公司 | voice input processing method and device |
CN109671435A (en) * | 2019-02-21 | 2019-04-23 | 三星电子(中国)研发中心 | Method and apparatus for waking up smart machine |
CN110049107A (en) * | 2019-03-22 | 2019-07-23 | 钛马信息网络技术有限公司 | A kind of net connection vehicle awakening method, device, equipment and medium |
CN110060665A (en) * | 2019-03-15 | 2019-07-26 | 上海拍拍贷金融信息服务有限公司 | Word speed detection method and device, readable storage medium storing program for executing |
CN110070863A (en) * | 2019-03-11 | 2019-07-30 | 华为技术有限公司 | A kind of sound control method and device |
CN110444210A (en) * | 2018-10-25 | 2019-11-12 | 腾讯科技(深圳)有限公司 | A kind of method of speech recognition, the method and device for waking up word detection |
CN110534098A (en) * | 2019-10-09 | 2019-12-03 | 国家电网有限公司客户服务中心 | A kind of the speech recognition Enhancement Method and device of age enhancing |
CN110992940A (en) * | 2019-11-25 | 2020-04-10 | 百度在线网络技术(北京)有限公司 | Voice interaction method, device, equipment and computer-readable storage medium |
CN111128155A (en) * | 2019-12-05 | 2020-05-08 | 珠海格力电器股份有限公司 | Awakening method, device, equipment and medium for intelligent equipment |
CN111261143A (en) * | 2018-12-03 | 2020-06-09 | 杭州嘉楠耘智信息科技有限公司 | Voice wake-up method and device and computer readable storage medium |
CN111602194A (en) * | 2018-09-30 | 2020-08-28 | 微软技术许可有限责任公司 | Speech waveform generation |
CN112037772A (en) * | 2020-09-04 | 2020-12-04 | 平安科技(深圳)有限公司 | Multi-mode-based response obligation detection method, system and device |
CN112530442A (en) * | 2020-11-05 | 2021-03-19 | 广东美的厨房电器制造有限公司 | Voice interaction method and device |
CN112951235A (en) * | 2021-01-27 | 2021-06-11 | 北京云迹科技有限公司 | Voice recognition method and device |
CN113436615A (en) * | 2021-07-06 | 2021-09-24 | 南京硅语智能科技有限公司 | Semantic recognition model, training method thereof and semantic recognition method |
EP3923272A1 (en) * | 2020-06-10 | 2021-12-15 | Apollo Intelligent Connectivity (Beijing) Technology Co., Ltd. | Method and apparatus for adapting a wake-up model |
WO2022147692A1 (en) * | 2021-01-06 | 2022-07-14 | 京东方科技集团股份有限公司 | Voice command recognition method, electronic device and non-transitory computer-readable storage medium |
CN117784632A (en) * | 2024-02-28 | 2024-03-29 | 深圳市轻生活科技有限公司 | Intelligent household control system based on offline voice recognition |
Citations (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101727903A (en) * | 2008-10-29 | 2010-06-09 | 中国科学院自动化研究所 | Pronunciation quality assessment and error detection method based on fusion of multiple characteristics and multiple systems |
US20110004624A1 (en) * | 2009-07-02 | 2011-01-06 | International Business Machines Corporation | Method for Customer Feedback Measurement in Public Places Utilizing Speech Recognition Technology |
CN102999161A (en) * | 2012-11-13 | 2013-03-27 | 安徽科大讯飞信息科技股份有限公司 | Implementation method and application of voice awakening module |
WO2013163113A1 (en) * | 2012-04-26 | 2013-10-31 | Nuance Communications, Inc | Embedded system for construction of small footprint speech recognition with user-definable constraints |
CN103474069A (en) * | 2013-09-12 | 2013-12-25 | 中国科学院计算技术研究所 | Method and system for fusing recognition results of a plurality of speech recognition systems |
CN103943105A (en) * | 2014-04-18 | 2014-07-23 | 安徽科大讯飞信息科技股份有限公司 | Voice interaction method and system |
CN104281645A (en) * | 2014-08-27 | 2015-01-14 | 北京理工大学 | Method for identifying emotion key sentence on basis of lexical semantics and syntactic dependency |
CN104538030A (en) * | 2014-12-11 | 2015-04-22 | 科大讯飞股份有限公司 | Control system and method for controlling household appliances through voice |
CN105096939A (en) * | 2015-07-08 | 2015-11-25 | 百度在线网络技术(北京)有限公司 | Voice wake-up method and device |
CN105702253A (en) * | 2016-01-07 | 2016-06-22 | 北京云知声信息技术有限公司 | Voice awakening method and device |
CN105976813A (en) * | 2015-03-13 | 2016-09-28 | 三星电子株式会社 | Speech recognition system and speech recognition method thereof |
CN106272481A (en) * | 2016-08-15 | 2017-01-04 | 北京光年无限科技有限公司 | The awakening method of a kind of robot service and device |
CN106297777A (en) * | 2016-08-11 | 2017-01-04 | 广州视源电子科技股份有限公司 | A kind of method and apparatus waking up voice service up |
CN106653056A (en) * | 2016-11-16 | 2017-05-10 | 中国科学院自动化研究所 | Fundamental frequency extraction model based on LSTM recurrent neural network and training method thereof |
CN106782554A (en) * | 2016-12-19 | 2017-05-31 | 百度在线网络技术(北京)有限公司 | Voice awakening method and device based on artificial intelligence |
FI20156000A (en) * | 2015-12-22 | 2017-06-23 | Code-Q Oy | Speech recognition method and apparatus based on a wake-up call |
CN107223280A (en) * | 2017-03-03 | 2017-09-29 | 深圳前海达闼云端智能科技有限公司 | robot awakening method, device and robot |
CN107358951A (en) * | 2017-06-29 | 2017-11-17 | 阿里巴巴集团控股有限公司 | A kind of voice awakening method, device and electronic equipment |
CN107464564A (en) * | 2017-08-21 | 2017-12-12 | 腾讯科技(深圳)有限公司 | voice interactive method, device and equipment |
-
2017
- 2017-12-18 CN CN201711364085.1A patent/CN108320733B/en active Active
Patent Citations (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101727903A (en) * | 2008-10-29 | 2010-06-09 | 中国科学院自动化研究所 | Pronunciation quality assessment and error detection method based on fusion of multiple characteristics and multiple systems |
US20110004624A1 (en) * | 2009-07-02 | 2011-01-06 | International Business Machines Corporation | Method for Customer Feedback Measurement in Public Places Utilizing Speech Recognition Technology |
WO2013163113A1 (en) * | 2012-04-26 | 2013-10-31 | Nuance Communications, Inc | Embedded system for construction of small footprint speech recognition with user-definable constraints |
CN102999161A (en) * | 2012-11-13 | 2013-03-27 | 安徽科大讯飞信息科技股份有限公司 | Implementation method and application of voice awakening module |
CN103474069A (en) * | 2013-09-12 | 2013-12-25 | 中国科学院计算技术研究所 | Method and system for fusing recognition results of a plurality of speech recognition systems |
CN103943105A (en) * | 2014-04-18 | 2014-07-23 | 安徽科大讯飞信息科技股份有限公司 | Voice interaction method and system |
CN104281645A (en) * | 2014-08-27 | 2015-01-14 | 北京理工大学 | Method for identifying emotion key sentence on basis of lexical semantics and syntactic dependency |
CN104538030A (en) * | 2014-12-11 | 2015-04-22 | 科大讯飞股份有限公司 | Control system and method for controlling household appliances through voice |
CN105976813A (en) * | 2015-03-13 | 2016-09-28 | 三星电子株式会社 | Speech recognition system and speech recognition method thereof |
CN105096939A (en) * | 2015-07-08 | 2015-11-25 | 百度在线网络技术(北京)有限公司 | Voice wake-up method and device |
FI20156000A (en) * | 2015-12-22 | 2017-06-23 | Code-Q Oy | Speech recognition method and apparatus based on a wake-up call |
CN105702253A (en) * | 2016-01-07 | 2016-06-22 | 北京云知声信息技术有限公司 | Voice awakening method and device |
CN106297777A (en) * | 2016-08-11 | 2017-01-04 | 广州视源电子科技股份有限公司 | A kind of method and apparatus waking up voice service up |
CN106272481A (en) * | 2016-08-15 | 2017-01-04 | 北京光年无限科技有限公司 | The awakening method of a kind of robot service and device |
CN106653056A (en) * | 2016-11-16 | 2017-05-10 | 中国科学院自动化研究所 | Fundamental frequency extraction model based on LSTM recurrent neural network and training method thereof |
CN106782554A (en) * | 2016-12-19 | 2017-05-31 | 百度在线网络技术(北京)有限公司 | Voice awakening method and device based on artificial intelligence |
CN107223280A (en) * | 2017-03-03 | 2017-09-29 | 深圳前海达闼云端智能科技有限公司 | robot awakening method, device and robot |
CN107358951A (en) * | 2017-06-29 | 2017-11-17 | 阿里巴巴集团控股有限公司 | A kind of voice awakening method, device and electronic equipment |
CN107464564A (en) * | 2017-08-21 | 2017-12-12 | 腾讯科技(深圳)有限公司 | voice interactive method, device and equipment |
Non-Patent Citations (4)
Title |
---|
FENGPEI GE、YONGHONG YAN: "Deep neural network based wake-up-word speech recognition with two-stage detection", 《2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP)》 * |
SHILEI ZHANG,ET AL.: "Wake-up-word spotting using end-to-end deep neural network system", 《2016 23RD INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR)》 * |
施唯佳等: "智能语音机顶盒的软硬件实现方案", 《电信科学》 * |
陈永真等: "基于语音识别技术和蓝牙技术的数字化家庭综合设计 ", 《世界电子元器件》 * |
Cited By (30)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109119079A (en) * | 2018-07-25 | 2019-01-01 | 天津字节跳动科技有限公司 | voice input processing method and device |
CN108831471A (en) * | 2018-09-03 | 2018-11-16 | 与德科技有限公司 | A kind of voice method for security protection, device and route terminal |
CN109036412A (en) * | 2018-09-17 | 2018-12-18 | 苏州奇梦者网络科技有限公司 | voice awakening method and system |
US11869482B2 (en) | 2018-09-30 | 2024-01-09 | Microsoft Technology Licensing, Llc | Speech waveform generation |
CN111602194A (en) * | 2018-09-30 | 2020-08-28 | 微软技术许可有限责任公司 | Speech waveform generation |
CN110444210A (en) * | 2018-10-25 | 2019-11-12 | 腾讯科技(深圳)有限公司 | A kind of method of speech recognition, the method and device for waking up word detection |
CN111261143A (en) * | 2018-12-03 | 2020-06-09 | 杭州嘉楠耘智信息科技有限公司 | Voice wake-up method and device and computer readable storage medium |
CN111261143B (en) * | 2018-12-03 | 2024-03-22 | 嘉楠明芯(北京)科技有限公司 | Voice wakeup method and device and computer readable storage medium |
CN109671435B (en) * | 2019-02-21 | 2020-12-25 | 三星电子(中国)研发中心 | Method and apparatus for waking up smart device |
CN109671435A (en) * | 2019-02-21 | 2019-04-23 | 三星电子(中国)研发中心 | Method and apparatus for waking up smart machine |
CN110070863A (en) * | 2019-03-11 | 2019-07-30 | 华为技术有限公司 | A kind of sound control method and device |
CN110060665A (en) * | 2019-03-15 | 2019-07-26 | 上海拍拍贷金融信息服务有限公司 | Word speed detection method and device, readable storage medium storing program for executing |
CN110049107A (en) * | 2019-03-22 | 2019-07-23 | 钛马信息网络技术有限公司 | A kind of net connection vehicle awakening method, device, equipment and medium |
CN110049107B (en) * | 2019-03-22 | 2022-04-08 | 钛马信息网络技术有限公司 | Internet vehicle awakening method, device, equipment and medium |
CN110534098A (en) * | 2019-10-09 | 2019-12-03 | 国家电网有限公司客户服务中心 | A kind of the speech recognition Enhancement Method and device of age enhancing |
CN110992940A (en) * | 2019-11-25 | 2020-04-10 | 百度在线网络技术(北京)有限公司 | Voice interaction method, device, equipment and computer-readable storage medium |
US11250854B2 (en) | 2019-11-25 | 2022-02-15 | Baidu Online Network Technology (Beijing) Co., Ltd. | Method and apparatus for voice interaction, device and computer-readable storage medium |
CN111128155A (en) * | 2019-12-05 | 2020-05-08 | 珠海格力电器股份有限公司 | Awakening method, device, equipment and medium for intelligent equipment |
EP3923272A1 (en) * | 2020-06-10 | 2021-12-15 | Apollo Intelligent Connectivity (Beijing) Technology Co., Ltd. | Method and apparatus for adapting a wake-up model |
US11587550B2 (en) | 2020-06-10 | 2023-02-21 | Apollo Intelligent Connectivity (Beijing) Technology Co., Ltd. | Method and apparatus for outputting information |
WO2021159756A1 (en) * | 2020-09-04 | 2021-08-19 | 平安科技(深圳)有限公司 | Method for response obligation detection based on multiple modes, and system and apparatus |
CN112037772A (en) * | 2020-09-04 | 2020-12-04 | 平安科技(深圳)有限公司 | Multi-mode-based response obligation detection method, system and device |
CN112037772B (en) * | 2020-09-04 | 2024-04-02 | 平安科技(深圳)有限公司 | Response obligation detection method, system and device based on multiple modes |
CN112530442B (en) * | 2020-11-05 | 2023-11-17 | 广东美的厨房电器制造有限公司 | Voice interaction method and device |
CN112530442A (en) * | 2020-11-05 | 2021-03-19 | 广东美的厨房电器制造有限公司 | Voice interaction method and device |
WO2022147692A1 (en) * | 2021-01-06 | 2022-07-14 | 京东方科技集团股份有限公司 | Voice command recognition method, electronic device and non-transitory computer-readable storage medium |
CN112951235A (en) * | 2021-01-27 | 2021-06-11 | 北京云迹科技有限公司 | Voice recognition method and device |
CN112951235B (en) * | 2021-01-27 | 2022-08-16 | 北京云迹科技股份有限公司 | Voice recognition method and device |
CN113436615A (en) * | 2021-07-06 | 2021-09-24 | 南京硅语智能科技有限公司 | Semantic recognition model, training method thereof and semantic recognition method |
CN117784632A (en) * | 2024-02-28 | 2024-03-29 | 深圳市轻生活科技有限公司 | Intelligent household control system based on offline voice recognition |
Also Published As
Publication number | Publication date |
---|---|
CN108320733B (en) | 2022-01-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108320733A (en) | Voice data processing method and device, storage medium, electronic equipment | |
CN106611597B (en) | Voice awakening method and device based on artificial intelligence | |
US8825479B2 (en) | System and method for recognizing emotional state from a speech signal | |
CN105206258A (en) | Generation method and device of acoustic model as well as voice synthetic method and device | |
KR20180091903A (en) | METHOD, APPARATUS AND STORAGE MEDIUM FOR CONFIGURING VOICE DECODING NETWORK IN NUMERIC VIDEO RECOGNI | |
CN110299153A (en) | Sound section detection device, sound section detection method and recording medium | |
JP4322785B2 (en) | Speech recognition apparatus, speech recognition method, and speech recognition program | |
CN105206271A (en) | Intelligent equipment voice wake-up method and system for realizing method | |
CN111105785B (en) | Text prosody boundary recognition method and device | |
CN112102850B (en) | Emotion recognition processing method and device, medium and electronic equipment | |
CN108320734A (en) | Audio signal processing method and device, storage medium, electronic equipment | |
CN102982811A (en) | Voice endpoint detection method based on real-time decoding | |
JP2006048065A (en) | Method and apparatus for voice-interactive language instruction | |
CN102194454A (en) | Equipment and method for detecting key word in continuous speech | |
CN109036395A (en) | Personalized speaker control method, system, intelligent sound box and storage medium | |
CN102945673A (en) | Continuous speech recognition method with speech command range changed dynamically | |
CN108269574B (en) | Method and device for processing voice signal to represent vocal cord state of user, storage medium and electronic equipment | |
CN107871499A (en) | Audio recognition method, system, computer equipment and computer-readable recording medium | |
CN110265063A (en) | A kind of lie detecting method based on fixed duration speech emotion recognition sequence analysis | |
CN112071308A (en) | Awakening word training method based on speech synthesis data enhancement | |
CN108536668A (en) | Wake up word appraisal procedure and device, storage medium, electronic equipment | |
CN115414042B (en) | Multi-modal anxiety detection method and device based on emotion information assistance | |
CN110827853A (en) | Voice feature information extraction method, terminal and readable storage medium | |
CN111276156B (en) | Real-time voice stream monitoring method | |
KR102113879B1 (en) | The method and apparatus for recognizing speaker's voice by using reference database |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |