CN110534099A - Voice wakes up processing method, device, storage medium and electronic equipment - Google Patents

Voice wakes up processing method, device, storage medium and electronic equipment Download PDF

Info

Publication number
CN110534099A
CN110534099A CN201910828451.7A CN201910828451A CN110534099A CN 110534099 A CN110534099 A CN 110534099A CN 201910828451 A CN201910828451 A CN 201910828451A CN 110534099 A CN110534099 A CN 110534099A
Authority
CN
China
Prior art keywords
audio frame
confidence
frame feature
verification
voice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910828451.7A
Other languages
Chinese (zh)
Other versions
CN110534099B (en
Inventor
陈杰
苏丹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201910828451.7A priority Critical patent/CN110534099B/en
Publication of CN110534099A publication Critical patent/CN110534099A/en
Application granted granted Critical
Publication of CN110534099B publication Critical patent/CN110534099B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/065Adaptation
    • G10L15/07Adaptation to the speaker
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Telephone Function (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

A kind of voice provided by the present application wakes up processing method, device, storage medium and electronic equipment, take the audio frame feature of the input voice information, acoustic model is inputted to be handled, obtain the posterior probability of the default corresponding target audio frame feature of each syllable for waking up word, utilize the judging confidence for being directed to adult mode and child mode respectively of deployment, realize double judging confidences to these obtained posterior probability, so that each syllable obtains two confidence scores, the court verdict of any confidence score passes through, the verification audio frame feature that corresponding length can be obtained from caching carries out secondary confidence level verification, pass through to confidence level check results, it can be preset directly in response to this and wake up the corresponding instruction of word, controlling electronic devices executes predetermined registration operation.As it can be seen that voice provided in this embodiment wakes up processing method, adult voice can be combined and wake up performance and children speech wake-up performance, voice is improved and wake up efficiency and accuracy.

Description

Voice wakes up processing method, device, storage medium and electronic equipment
Technical field
This application involves artificial intelligence application fields, and in particular to a kind of voice wake-up processing method, device, storage medium And electronic equipment.
Background technique
Speech recognition is as a kind of artificial intelligence technology, in industry, household electrical appliances, communication, automotive electronics, medical treatment, family's clothes The multiple fields such as business, consumption electronic product are used widely, so that being applied to the electronic equipment in each field has speech recognition Ability is issued by identification user and wakes up word, come wake up electronic equipment and it includes application, mentioned using electronic equipment for user Great convenience is supplied.
In the prior art, the existing voice shown referring to Fig.1 wakes up the flow diagram of processing method, usually by user The voice messaging of input is sent to acoustic model (such as deep neural network), obtains composition and wakes up phoneme or syllable of word etc., together When, non-wake-up word can be obtained by being filled through unit also, later by the smoothing windows of posteriority processing module and confidence calculations window, to wake-up The phoneme or syllable of word are handled, and the confidence score of the wake-up word is obtained, if the confidence score reaches threshold value, will be responded this and are called out Awake word, controlling electronic devices execute predetermined registration operation.
Although, to balance wake-up performance, it is simultaneously as it can be seen that existing voice wake-up processing method can be by adjusting threshold value Not in view of the difference between adult phonetic feature and children speech feature, cause the output accuracy of acoustic model lower, drops The low voice to electronic equipment wakes up performance.
Summary of the invention
It wakes up processing method, device, storage medium and electronics in view of this, the embodiment of the present application provides a kind of voice and sets It is standby, adult voice can be combined and wake up performance and children speech wake-up performance, voice is improved and wake up efficiency and accuracy.
To achieve the above object, the embodiment of the present application provides the following technical solutions:
On the one hand, present applicant proposes a kind of voices to wake up processing method, which comprises
Obtain the audio frame feature of the voice messaging of input;
Audio frame feature input acoustic model is handled, is obtained corresponding with default wake-up each syllable of word The posterior probability of target audio frame feature;
Double judging confidences are carried out to the posterior probability of the corresponding target audio frame feature of each syllable, are obtained corresponding The first confidence score and the second confidence score of syllable;
Using the court verdict passed through in first confidence score and second confidence score, institute's predicate is obtained Verification audio frame feature in the audio frame feature of message breath;
The confidence level check results of the verification audio frame feature are obtained, the confidence level check results are to the verification Audio frame feature carries out what secondary judging confidence obtained;
If the confidence level check results pass through, the corresponding instruction of the default wake-up word is responded, controlling electronic devices is held Row predetermined registration operation.
Another aspect, present applicant proposes a kind of voices to wake up processing unit, and described device includes:
Feature obtains module, the audio frame feature of the voice messaging for obtaining input;
Posterior probability obtains module, for handling audio frame feature input acoustic model, obtains and presets Wake up the posterior probability of the corresponding target audio frame feature of each syllable of word;
Judging confidence module carries out double for the posterior probability to the corresponding target audio frame feature of each syllable Judging confidence obtains the first confidence score and the second confidence score of corresponding syllables;
Calibration feature obtains module, for using passing through in first confidence score and second confidence score Court verdict, obtain the verification audio frame feature in the audio frame feature of the voice messaging;
Confidence level check results obtain module, for obtaining the confidence level check results of the verification audio frame feature, institute Stating confidence level check results is to carry out secondary judging confidence to the verification audio frame feature to obtain;
Voice wake-up module responds the corresponding finger of the default wake-up word if passing through for the confidence level check results It enables, controlling electronic devices executes predetermined registration operation.
Another aspect, present applicant proposes a kind of storage mediums, are stored thereon with computer program, the computer program It is executed by processor, realizes that voice as described above wakes up the program of each step of processing.
Another aspect, present applicant proposes a kind of electronic equipment, the electronic equipment includes:
Sound collector, for acquiring the voice messaging of user's output;
Communication interface;
Memory, for storing the program for realizing that voice as described above wakes up processing;
Processor, for loading and executing the program of the memory storage, to realize at voice wake-up as described above Each step of reason.
It can be seen that compared with the existing technology, the application after obtaining user for the voice messaging of electronic equipment input, It will acquire the audio frame feature of the voice messaging, and handled by being inputted acoustic model, obtained in the voice messaging The posterior probability of the corresponding target audio frame feature of each syllable for the default wake-up word for including, later, the present embodiment is by nationwide examination for graduation qualification Consider the difference between the phonetic feature of different type user (such as adult and children), deployment is directed to adult mode and children respectively The different judging confidence modules of mode share an acoustic model, realize double confidences to these obtained posterior probability Degree judgement, so that each syllable obtains two confidence scores, the court verdict of any confidence score passes through, and can postpone It deposits the middle verification audio frame feature for obtaining corresponding length and carries out secondary confidence level verification, pass through to confidence level check results, it can be with It determines and contains the default wake-up word in voice messaging, can be preset directly in response to this and wake up the corresponding instruction of word, control electronics Equipment executes predetermined registration operation.As it can be seen that voice provided in this embodiment wakes up processing method, adult voice can be combined and waken up Performance and children speech wake up performance, improve voice and wake up efficiency and accuracy.
Detailed description of the invention
In order to illustrate the technical solutions in the embodiments of the present application or in the prior art more clearly, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this The embodiment of application for those of ordinary skill in the art without creative efforts, can also basis The attached drawing of offer obtains other attached drawings.
Fig. 1 shows the flow diagram that a kind of existing voice wakes up processing method;
The voice proposed Fig. 2 shows the application wakes up in the R&D process of processing method, and the realization voice of proposition wakes up One optional structure diagram of processing method;
Fig. 3 shows the structural schematic diagram of an optional example of the voice wake-up processing method for realizing that the application proposes;
Fig. 4 shows the hardware structural diagram of an optional example of the electronic equipment of the application proposition;
Fig. 5 shows the hardware structural diagram of the another optional example of the electronic equipment of the application proposition;
The voice that Fig. 6 shows the application proposition wakes up the flow chart of an optional example of processing method;
The voice that Fig. 7 shows the application proposition wakes up the signaling process figure of an optional example of processing method;
The voice that Fig. 8 shows the application proposition wakes up the structural schematic diagram of an optional example of processing unit;
The voice that Fig. 9 shows the application proposition wakes up the structural schematic diagram of the another optional example of processing unit;
Figure 10 shows a kind of system structure diagram for the voice wake-up processing method for realizing that the application proposes;
Figure 11 shows an application scenarios schematic diagram of the voice wake-up processing method for realizing that the application proposes.
Specific embodiment
Below in conjunction with the attached drawing in the embodiment of the present application, technical solutions in the embodiments of the present application carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of embodiments of the present application, instead of all the embodiments.It is based on Embodiment in the application, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall in the protection scope of this application.
Wherein, in the description of the embodiment of the present application, unless otherwise indicated, "/" indicate or the meaning, for example, A/B can be with Indicate A or B;"and/or" herein is only a kind of incidence relation for describing affiliated partner, indicates that there may be three kinds of passes System, for example, A and/or B, can indicate: individualism A exists simultaneously A and B, these three situations of individualism B.In addition, at this In the description for applying for embodiment, " multiple " refer to two or more.
Following term " first ", " second " are used for descriptive purposes only and cannot be understood as indicating or suggesting relative importance Or implicitly indicate the quantity of indicated technical characteristic.Define " first " as a result, the feature of " second " can be expressed or Implicitly include one or more of the features.In the description of the present embodiment, unless otherwise indicated, the meaning of " plurality " is Two or more.
If background technology part is introduced, voice is waken up in electronic apparatus application at present, voice performed by electronic equipment Processing method is waken up, because only using an acoustic model, the voice of different type user (such as adult user, child user) is believed Breath is handled, and so as to cause this acoustic model, be cannot be considered in terms of adult and children voices and is waken up performance, usual situation Under, in the sample data of model training, adult data can be noticeably greater than child dataset, so, existing this voice wakes up Processing method, may be higher in adult's wake-up performance, but children wake up poor performance.
Performance is waken up in order to improve voice, the application proposes two different size of acoustic models of training, constitutes two-stage sound Model is learned, and shares a posteriority processing module to calculate confidence score, conclusive judgement is carried out, referring to a kind of language shown in Fig. 2 Sound wakes up the flow diagram of processing method, for the voice messaging of user's output, can first carry out mentioning for voice characteristics information It takes, it is such as real using MFCC (Mel-scale Frequency Cepstral Coefficients, mel cepstrum coefficients) mode Existing, however, it is not limited to this, then frame buffer zone is written in the voice characteristics information extracted, by first order model, that is, lesser sound It learns model (the first acoustic model in such as Fig. 2), confidence score calculating is carried out to the voice characteristics information of extraction, such as utilizes hidden horse Er Kefu model HMM, is calculated the confidence score of the voice characteristics information;Or using the processing of posteriority shown in figure 1 above Module carries out confidence score calculating etc., and after first order model is triggered, the identical voice that can also arrive said extracted is special Reference breath is sent to biggish acoustic model (the second acoustic model in such as Fig. 2), calculates phonetic feature letter using similar fashion The confidence score of breath, thus realize the second judgement to same voice characteristics information, the language relative to single model shown in FIG. 1 Sound wakes up processing mode, improves voice to a certain extent and wakes up performance.
At the same time, the application also proposed another voice and wake up processing method, call out with above-mentioned voice shown in Fig. 2 The difference for processing method of waking up is, is that the voice messaging of user's output is sent to cloud after first order model is triggered Server identified by automatic speech recognition (Automatic Speech Recognition, ASR) component of server, At this point, the server can use more massive acoustic model, and biggish language model is combined, at encoder decoding Reason realizes the second judgement to the voice messaging.
It can be seen that the application two kinds of voices proposed above wake up processing method, it is all the introduction of one biggish two Grade model achievees the purpose that lifting system performance, and several voices proposed above wake up processing method, although relative to list The scheme of a acoustic model can properly increase voice and wake up performance, but not consider adult phonetic feature and children's language really Existing difference between sound feature, the very slow feature of the relatively adult word speed of children, leads to the acoustics constructed in these types of method Model all cannot really take into account the adult performance with children, and then lead to the electronic equipment that processing method is waken up using the voice, It can not be suitable for adult and children simultaneously well, greatly reduce user experience.
In conjunction with improvement project proposed above, the application can not in order to solve above-mentioned children and adult voice wake-up performance The problem of combining proposes on the basis of above-mentioned voice shown in FIG. 1 wakes up system architecture used in processing method, needle Children speech feature is improved, increases double judging confidence mechanism, and in second-level model, by children and adult model It separates, so that the voice characteristics information and training data of the two input are different, is obviously improved the performance of children's wake-up.
Specifically, the voice that realization the embodiment of the present application referring to shown in Fig. 3 proposes wakes up the system structure of processing method Schematic diagram, the system can be made of three models of two-stage of tandem, as shown in figure 3, first-level model is in addition to including feature Outside computing module, feature cache module, it is configured with an acoustic model and a double judging confidence module, which sentences Certainly module will carry out posteriority processing according to adult and children's model respectively, that is to say, that double judging confidence modules can be with Including adult posteriority processing module and children's posteriority processing module.In second-level model, it will handle mould for both posteriority Block configures corresponding adult Knowledge Verification Model and children's Knowledge Verification Model, shares first-level model, when any one posteriority therein handles mould The output result of block passes through, and triggering second-level model carries out secondary judging confidence, if passing through, it will voice responsive information includes Default to wake up word, controlling electronic devices executes predetermined registration operation, and specific implementation process is referred to the corresponding portion of hereafter embodiment of the method The description divided.
In conjunction with the analysis for the technical concept for waking up processing method to the voice that the application proposes above, voice wake-up processing Method can be adapted for such as electronic equipment (i.e. terminal device) and/or server computer equipment.Specifically, the application is above The first-level model of proposition can be deployed in electronic equipment, and second-level model is run after first-level model is triggered, and can dispose On the server in electronic equipment or cloud, but it is not limited to this deployment way, it can be true according to the demand of actual scene It is fixed.
Illustratively, the voice that the application proposes, which wakes up processing method, can be applied to electronic equipment, that is to say, that above-mentioned First-level model and second-level model in system structure can be located at electronic equipment, and certainly, first-level model can according to actual needs To be located at electronic equipment, second-level model can be located at server or other equipment, either which kind of system layout, realize that voice is called out The process for processing method of waking up is similar, and the application is no longer directed to each system layout, describes it respectively and realizes voice wake-up processing The process of method.
Wherein, above-mentioned electronic equipment can be mobile phone, tablet computer, wearable device, mobile unit, smart home and set Standby, augmented reality (augmented reality, AR)/virtual reality (virtual reality, VR) equipment, notebook electricity Brain, Ultra-Mobile PC (ultra-mobile personal computer, UMPC), personal digital assistant (personal digital assistant, PDA) etc., the embodiment of the present application does not limit the concrete type of electronic equipment It is fixed.
It should be appreciated that in order to realize the voice control to electronic equipment, it usually needs electronic equipment has speech recognition function Can, such as voice assistant application is installed, in this way, can not have to manual operation when user is needed using electronic equipment, directly say The wake-up word of the electronic equipment can start electronic equipment or certain application of its installation etc., very convenient.Under normal conditions, right In the different types of electronic equipment of different manufacturers, the activation system of setting and the wake-up word respectively applied may difference, The application is not detailed this, and the system of electronic equipment and the wake-up of application can according to actual needs, be adjusted flexibly in user Word, the application are not detailed the configuration method and its application method that wake up word.
Illustratively, Fig. 4 shows the hard of a kind of electronic equipment for realizing that voice provided by the present application wakes up processing method Part structural schematic diagram, the electronic equipment may include: sound collector 11, communication interface 12, memory 13 and processor 14, In:
In the present embodiment, sound collector 11, communication interface 12, memory 13 and processor 14 can pass through communication bus Realize mutual communication, and the sound collector 11, communication interface 12, memory 13, processor 14 and communication bus Quantity can be at least one, can determine according to concrete application demand, number of the application to above-mentioned electronic equipment building block Amount is not construed as limiting.
Sound collector 11 can acquire the voice messaging that user is directed to electronic equipment output, usually may include wake-up electricity The wake-up word of any application of sub- device systems and/or electronic equipment installation, that is to say, that when user needs to wake up electronic equipment Or its have a certain corresponding default wakes up word, the sound collector 11 of electronic equipment in application, can directly say The voice messaging comprising the wake-up word of user's output is acquired, the wake-up word is identified will pass through, responds corresponding control instruction, Controlling electronic devices executes predetermined registration operation, and the application is not detailed the configuration of the wake-up word of electronic equipment and its application method.
Communication interface 12 can receive the voice messaging of the output of sound collector 11, and send it to the progress of processor 14 Processing, can also be used to realize sound collector 11 and memory 13, the data interaction between memory 13 and processor 14, or Person is the data between building block that other building blocks and the present embodiment are enumerated in electronic equipment between other building blocks Interaction, the application are not detailed the content of 12 sending and receiving data of communication interface, according to electronic equipment type and its can answer It is determined with scene.
Based on this, which may include the interface of wireless communication module and/or wire communication module, such as GSM The interface of (Global System for Mobile Communications, global system for mobile communications) module, WIFI module Interface, GPRS (General Packet Radio Service, general packet radio service technology) module interface etc., also May include;USB (universal serial bus, universal serial bus) interface, serial/parallel mouth etc., the application does not do one One is described in detail.
Memory 13 can be used to store the program for the voice wake-up processing method for realizing that the application proposes, can also store At least one preset wake-up word and voice wake up the various intermediate data generated in processing method operational process, Yi Jiqi The data etc. that his electronic equipment or user send can determine that the application be not detailed according to the demand of application scenarios.
In practical applications, memory 13 may include high speed RAM memory, it is also possible to further include nonvolatile memory (non-volatile memory), for example, at least a magnetic disk storage.
Processor 14 can be used to call and execute the program that memory is stored, above-mentioned applied to electronic equipment to realize Voice wake up each step of processing method, specific implementation process is referred to the description of hereafter embodiment of the method corresponding portion.
In the present embodiment, processor 14 may be a central processor CPU or specific integrated circuit ASIC (Application Specific Integrated Circuit), or be arranged to implement the one of the embodiment of the present application A or multiple integrated circuits etc., the application is not detailed the specific structure of processor 14.
Optionally, above-mentioned memory 13 can also be deployed in processor 14 independently of processor 14, similarly, on At least partly interface that communication interface is included is stated, can also be deployed in processor 14, such as integrated circuit interface, integrated circuit Built-in audio interface, USB interface etc., the application is to the deployment relationship and processor between memory 13 and processor 14 The communication interface quantity and type disposed in 14 without limitation, can be determined according to actual demand.
In addition, it will be appreciated that the system composed structure of electronic equipment, it is not limited to sound collector listed above, Communication interface, memory and processor, as shown in figure 5, electronic equipment can also include display, input equipment, power module, The building blocks such as loudspeaker, sensor module, camera, indicator light, antenna, power module, the application do not enumerate, and The composition of electronic equipment may include perhaps than more or fewer components shown in Fig. 5 the certain components of combination/fractionation or Different component layouts etc., the component of diagram can be hardware, software or hardware and the combination of software is realized.
And the interface connection relationship of each intermodule shown in Fig. 5, it only schematically illustrates, does not constitute to electronic equipment Structure qualification, that is to say, that in other embodiments, electronic equipment can also be connected using the interface different from the present embodiment The combination of relationship or multiple interfaces connection type is connect, the application does not do and is described in detail one by one.
System structure diagram in conjunction with shown in figure 3 above shows the embodiment of the present application and provides a kind of language referring to Fig. 6 Sound wakes up the flow diagram of processing method, and such as method above can execute realization by electronic equipment, can also be by electronic equipment Cooperate with server and realize, the present embodiment is mainly described from the angle of electronic equipment, specific implementation process may include but It is not limited to following steps:
Step S11 obtains the audio frame feature of the voice messaging of input;
In the present embodiment practical application, user is desirable to carry out voice control to electronic equipment, traditional manual to substitute Operation, liberates the both hands of user, it is generally the case that the various operations for different types of electronic equipment can be pre-configured with Corresponding to wake up word, the corresponding wake-up word of operation needed for user only needs to say can control electronics by voice control mode Equipment executes corresponding operation.
For example, user wishes that controlling intelligent sound box plays song A, it may be said that " xx, plays song A ", and intelligent sound box can be with By analyzing this voice messaging, the wake-up word that the voice messaging includes is identified, to wake up intelligent sound box system, and Play song A.
In this process, since the phonetic feature of different type user is different, such as adult and this two major classes of children user There are great differences for phonetic feature, and in order to accurately identify the wake-up word for including in the voice messaging, the present embodiment can be incited somebody to action Multiframe (i.e. multiple audio frames) data are divided into for the voice messaging of electronic equipment input, then feature is carried out to each frame data It extracts, obtains corresponding audio frame feature, which can be a feature vector, according to this processing mode, originally The available n dimensional feature vector of embodiment, the numerical value of n depend on the quantity for the audio frame that voice messaging includes, and the application is to n number Value is without limitation.
It is to be appreciated that carrying out feature extraction after the application is to the voice messaging for obtaining input to it, obtaining for inputting sound The process of the characteristic of model without limitation, can be after carrying out framing pretreatment to voice messaging, using FBanK (FilterBank) feature extraction mode obtains respective frame to pretreated each audio frame number according to feature extraction frame by frame is carried out Audio frame feature, the application is not described further the specific implementation process of FBanK feature extraction, and for obtaining voice messaging The implementation of the audio frame feature of each audio frame, it is not limited to this FBanK feature extraction mode.
Step S12 handles audio frame feature input acoustic model, obtains and the default each syllable for waking up word The posterior probability of corresponding target audio frame feature;
Acoustic model is one of part mostly important in speech recognition system, can use hidden Markov model HMM It is modeled, but is not limited to this modeling pattern, other neural network even depth learning network building sound can also be used Learn model.The hidden Markov model is a discrete time-domain finite-state automata, and its marking, decoding and training are corresponding Algorithm can be forwards algorithms, Viterbi algorithm and forward-backward algorithm algorithm etc., and the application does not do the modeling process of acoustic model It is described in detail.
Under normal conditions, the feature for the multidimensional extracted when the input of acoustic model by characteristic extracting module, and its value can Be it is discrete or continuous, the present embodiment can with actual demand obtain input acoustic model audio frame feature.
After multiple audio frame features of obtained voice messaging are inputted acoustic model by the present embodiment, acoustic model can be incited somebody to action This multiple audio frame feature acoustic feature corresponding with default wake-up word is handled, to sieve from this multiple audio frame feature It selects and can use sieve later with the range of the default corresponding audio frame of each syllable for waking up the corresponding acoustic feature of word The acoustics likelihood of each audio frame in the range of each audio frame removed is scored, and from the range of each audio frame, is determined The target audio frame for meeting the preset quantity of preset requirement, as acoustics likelihood scoring reaches the target of the preset quantity of default scoring Audio frame, but it is not limited to this method of determination, the corresponding audio frame feature of target audio frame can be denoted as by the present embodiment Target audio frame feature calculates these respective acoustics posteriority of target audio frame feature and obtains finally, can use acoustic model Divide i.e. posterior probability, the application to how to utilize acoustic model, do not make by the realization process for calculating the posterior probability of audio frame feature It is described in detail.
As it can be seen that the audio frame feature of each frame inputs acoustic model, an available posterior probability, which can To indicate a possibility that respective audio frame is characterized in the default audio frame feature for waking up word size, it is generally the case that posterior probability It is bigger, illustrate that a possibility that its corresponding audio frame is characterized in the default audio frame feature for waking up word is bigger.
It should be understood that in practical application, after all audio frame features input acoustic model of voice messaging, output Data not only may include the posterior probability that composition wakes up the syllable of word or each audio frame feature of phoneme, often may also contain The posterior probability of each audio frame feature of other non-syllables for waking up word or phoneme, the application are then the syllables that word is waken up to composition Or the posterior probability of each audio frame feature of phoneme carries out subsequent processing and therefore can sieve from the output data of acoustic module The posterior probability of this part needs is selected, specific implementation process is not detailed.
Default wake-up word in the present embodiment can refer to corresponding to the voice control that user currently executes electronic equipment Preset wake-up word, it is generally the case that when user issues the phonetic order that it executes certain operation to electronic equipment, described in user Voice messaging can comprising this it is default wake up word, the application to the default content for waking up word without limitation.
In addition, it is necessary to illustrate, target audio frame corresponding with default wake-up each syllable of word is special in step S12 Sign can be acoustic model and think in the audio frame feature of input, it may be possible to the default corresponding audio of each syllable for waking up word Frame feature.
Step S13 carries out the posterior probability of target audio frame feature corresponding with default wake-up each syllable of word double Judging confidence obtains the first confidence score and the second confidence score of corresponding syllables;
In the present embodiment, the audio frame feature of the voice messaging of caching will be utilized and be directed to after the processing of acoustic model The different judging confidence modules of different type user preset carry out double judging confidences to processing result, so that voice It may be default each syllable for waking up word in information, two confidence scores can be accessed, the first confidence level is denoted as and obtains Divide and the second confidence score.How the application is to realizing in voice messaging may be default each syllable for waking up word Confidence calculations method without limitation, may include but be not limited to following calculation:
In above-mentioned confidence level (confidence) calculation formula, n can indicate the output unit number of acoustic model, specifically Numerical value can be determining according to the concrete outcome of the acoustic model, pi'jI-th of unit jth frame after can indicating smoothing processing The posterior probability of audio frame feature, hmax=max { 1, j-wmax+ 1 } confidence calculations window (i.e. judging confidence window) can be indicated WmaxIn first frame position.
By above-mentioned confidence calculations formula it is known that the application can be from each audio of each output unit of acoustic model In the posterior probability of frame feature, the maximum a posteriori probability of each output unit is determined, by being multiplied with after square root, can obtain To the confidence score of the default each syllable for waking up word.As user wish electronic equipment execute predetermined registration operation wake-up word be " okey google ", according to above-mentioned confidence calculations mode, obtained confidence score can indicate that in size be hmaxWhen It is interior occur a possibility that okey and google have it is much.
After the analysis for the technical concept for waking up processing method to the voice that the application proposes above, the application will be for difference Type of user wakes up accuracy using different judging confidence rules, Lai Tigao voice, is that adult is used with different type user It is illustrated for family (i.e. adult) and underage users (young children), the use of both types can be directed in advance Family configures corresponding judging confidence module (i.e. posteriority processing module) and realizes posteriority processing, at the adult posteriority in figure 3 above Module and children's posteriority processing module are managed, using the two posteriority processing modules respectively to obtained above, with default wake-up word The corresponding target audio frame feature of each syllable posterior probability carry out confidence calculations, for each syllable, it will obtain Two confidence scores.
It should be noted that being that there are biggish differences between the phonetic feature of the different type user of the application, such as Virgin word speed is usually slower than adult word speed, in this way, during confidence calculations, voice messaging suitable for adult user Judgement window size possibly can not cover children and wake up word complete speech, so, the application can will be suitable for the language of child user The judgement window of message breath is configured to be greater than the judgement window of the voice messaging suitable for adult user, both adjudicate the specific of windows Size without limitation, can be adjusted flexibly according to actual demand.
It can be seen that, because the judgement window of the two configuration is of different sizes, the two can be delayed for different judging confidence modules The time span for depositing the posterior probability of audio frame feature is different, and in the case where this judgement passes through, subsequent progress is secondary to be sentenced When certainly, the length for obtaining the audio frame feature cached to be adjudicated also can accordingly change, which can be with corresponding judgement Window size matching, so that the audio frame feature for carrying out second judgement includes complete wake-up word feature as far as possible.
Wherein, after configuring good above-mentioned judgement window, if the judgement window is set as the audio frame feature of 100 frames of caching, then, After having saved the audio frame feature of 100 frames, the audio frame feature of a newest frame is obtained, the frame cached earliest can be lost It abandons, the audio frame feature of a newest frame is added, achievees the purpose that caching, but be not limited to the judgement window of the present embodiment description Size.
Step S14 obtains the voice using the court verdict passed through in the first confidence score and the second confidence score Verification audio frame feature in the audio frame feature of information;
After analysis above, for the confidence score that different judging confidence modules obtain, judge corresponding syllables whether be The threshold value of the default syllable for waking up word is different, and different threshold values can be denoted as the first judging confidence threshold value, the by the present embodiment Two judging confidence threshold values etc..
In this way, after obtaining the first confidence score and the second confidence score, it can be by the first confidence score and first Judging confidence threshold value is compared, and the second confidence score is compared with the second judging confidence threshold value, if any set Confidence score reaches corresponding judging confidence threshold value, it is believed that the syllable belongs to the default wake-up of respective type user input Word, at this point, the first-level model in figure 3 above will be triggered, it can be big according to the corresponding judgement window of the type user from caching It is small, obtain verification audio frame feature.
For example, if the second confidence score that the judging confidence module for being suitable for children obtains, has reached second Judging confidence threshold value (i.e. the judging confidence threshold value of children, correspondingly, the first judging confidence threshold value are then suitable for adult), The verification audio frame that according to the corresponding judgement window size of children, from the audio frame feature of caching, can obtain corresponding length is special Sign;Similarly, if being suitable for the first confidence score that adult judging confidence module obtains, the first confidence level has been reached and has sentenced Certainly threshold value, then available matched with adult corresponding judgement window size, the verification audio frame feature of corresponding length specifically obtains Process is taken to be not detailed.
Step S15, obtains the confidence level check results of the verification audio frame feature, and confidence level check results are to the verification Audio frame feature carries out what secondary judging confidence obtained;
Based on above-mentioned analysis, the present embodiment, using double judging confidence modules, is realized and is believed voice in first-level model The wake-up word of breath identifies, and after the first-level model is waken up, that is, primarily determines in the voice messaging comprising default wake-up word In the case of, it will continue to carry out secondary verifying, such as above-mentioned analysis to the voice messaging by second-level model, which can dispose In electronic equipment, can also dispose on the server, the application to the deployed position and its structure of second-level model without limitation.
Optionally, for the second-level model in such as Fig. 3, the corresponding calibration mode of different types of user configuration can be directed to Type, such as the adult model and children's model in Fig. 3, the network structure of both Knowledge Verification Models can be identical, such as technical side above What is proposed in case R&D process is deployed in the bigger acoustic model+posteriority processing module or level-one in electronic equipment or cloud Acoustic model+corresponding judging confidence module etc. in model, the application is not construed as limiting the specific network structure of Knowledge Verification Model.
It should be noted that needing to utilize respective type during building different type user corresponding Knowledge Verification Model The speech samples of user are trained, and in the training process, and the audio frame length for inputting the sample characteristics of network also can be different, It is referred to the description of above-mentioned judgement window segment.
Wherein, to the secondary confidence level judging process of verification audio frame feature, with above-mentioned first-level model to target audio frame The process of judging confidence for the first time of feature is similar, and the application repeats no more.
Step S16 responds this and presets the corresponding instruction of wake-up word, control electronics is set if the confidence level check results pass through It is standby to execute predetermined registration operation.
Such as above-mentioned analysis, the application is the first confidence score after first-level model is waken up, i.e. in above-mentioned steps S14 In the respective court verdict of the second confidence score, in the case that at least one court verdict passes through, just it will do it secondary Judging confidence, the judging confidence result obtained to secondary judging confidence also by, it is believed that from speech signal analysis The really preset default wake-up word of the wake-up word identified, that is, be recognized accurately the wake-up in the voice messaging of user's input Word, later, electronic equipment can respond the corresponding instruction of wake-up word, and controlling electronic devices executes predetermined registration operation, such as controls Intelligent sound box plays song A.
In conclusion will acquire voice letter after the present embodiment obtains user for the voice messaging of electronic equipment input The audio frame feature of breath, and handled by being inputted acoustic model, obtain the default wake-up for including in the voice messaging The posterior probability of the corresponding target audio frame feature of each syllable of word, later, the present embodiment will can take into account different type use Difference between the phonetic feature at family (such as adult and children), deployment are sentenced for the confidence level of adult mode and child mode respectively Certainly, to realize double judging confidences to these obtained posterior probability, so that each syllable obtains two confidence scores, The court verdict of any confidence score passes through, and the verification audio frame feature that corresponding length can be obtained from caching carries out two Secondary confidence level verification, passes through to confidence level check results, can determine in voice messaging and contain the default wake-up word, Ke Yizhi Connecing response, this presets the corresponding instruction of wake-up word, and controlling electronic devices executes predetermined registration operation.As it can be seen that voice provided in this embodiment Processing method is waken up, adult voice can be combined and wake up performance and children speech wake-up performance, voice is improved and wake up effect Rate and accuracy.
Processing method will be waken up for the above-described voice of the application below to refine, but be not limited to hereafter retouch The refinement example stated, as shown in fig. 7, waking up a kind of exemplary signaling process of refinement of processing method for the voice that the application proposes Figure, this method may include but be not limited to following steps:
Step S21, electronic equipment obtain the voice messaging of user's input;
Step S22, electronic equipment carry out feature extraction frame by frame to the voice messaging, obtain audio frame feature and cache;
In the present embodiment, feature extraction frame by frame is carried out to the voice messaging of user's input, it will obtain forming the voice The audio frame feature of each audio frame of information can cache the audio frame feature of the obtained voice messaging later, use To realize the identification for waking up word of the voice messaging, and then voice wake-up control of the realization to electronic equipment.
The application is not construed as limiting the acquisition methods and its cache way of audio frame feature, may include but does not limit to In the method for foregoing embodiments description.
The audio frame feature input acoustic model of caching is handled, obtains waking up with default by step S23, electronic equipment The posterior probability of the corresponding target audio frame feature of each syllable of word;
The description of above-described embodiment corresponding portion is referred to about the realization process of step S23.
Step S24, electronic equipment are set according to the first judging confidence rule and the second judging confidence rule respectively Reliability calculates, and same monosyllabic first confidence score and the second confidence level for obtaining the default wake-up that voice messaging includes obtain Point;
In conjunction with the description of above-described embodiment, the present embodiment can be according to the first judging confidence rule, to the default wake-up The posterior probability of the corresponding target audio frame feature of each syllable in word carries out confidence calculations, obtains the first of corresponding syllables Confidence score;And according to the second judging confidence rule, to the default corresponding target audio of each syllable waken up in word The posterior probability of frame feature carries out confidence calculations, obtains the second confidence score of corresponding syllables.Wherein, the first confidence level is sentenced Certainly rule is different from the judgement window size of the second judging confidence rule and judging confidence threshold value, and the judgement window is used In the time span for the target audio frame feature for determining progress confidence calculations, specific value is not construed as limiting.
In the present embodiment, above-mentioned first judging confidence rule and the second judging confidence rule be can be, and different sets Reliability judging module (i.e. posteriority processing module) carries out confidence calculations rule, the application couple based on confidence calculations process Its particular content without limitation, can be determined according to the confidence calculations method of respective confidence judging module.Such as above-mentioned analysis, Judging confidence module may include adult judging confidence module, also may include the judging confidence module of children, can See, compared with the existing technology, joined the judging confidence module for child mode, and the confidence level of itself and adult mode is sentenced Certainly module is mutually indepedent, in the case where not influencing adult wake-up performance, by the way that biggish judgement window is arranged, can effectively improve To the wake-up performance of children speech.
Step S25, electronic equipment make decisions the first confidence score using the first judging confidence threshold value, obtain One court verdict, and the second confidence score is made decisions using the second judging confidence threshold value, obtain the second court verdict;
The present embodiment does not limit the specific value size of the first judging confidence threshold value and the second judging confidence threshold value It is fixed.
Step S26, electronic equipment obtain verification audio in the case where the first court verdict or the second court verdict pass through Frame feature;
Wherein, which is characterized in the matched sound of judgement window size corresponding with the court verdict passed through of caching Frequency frame feature, specific acquisition process are referred to the description of above-described embodiment corresponding portion.
Step S27, electronic equipment send the verification request of voice confidence level to server;
Wherein, voice confidence level verification request can carry verification audio frame feature and the verification audio frame is special Corresponding user type mark is levied, such as adult user's mark, child user mark, is asked it is to be appreciated that the voice confidence level verifies The content of carrying is asked to be not limited thereto, it can also be including judging confidence for the first time as a result, such as passing through or not passing through.
Step S28, server parse voice confidence level verification request, obtain the verification audio frame feature and its corresponding User type mark;
Step S29, server by utilizing Knowledge Verification Model corresponding with user type mark carry out verification audio frame feature Confidence level verification, obtains confidence level check results;
As it can be seen that electronic equipment can use school corresponding with the court verdict passed through after determining verification audio frame feature It tests model and confidence level verification is carried out to verification audio frame feature, obtain the confidence level check results of verification audio frame feature, wherein For different judging confidence rules, it is configured with corresponding Knowledge Verification Model, which is by sentencing to respective confidence What the speech samples of certainly regular corresponding types user were trained, it is corresponding that specific implementation process is referred to above-described embodiment Partial description, but it is not limited to this processing mode of the present embodiment description.
The confidence level check results are fed back to electronic equipment by step S210, server;
Step S211, electronic equipment respond the default word that wakes up and correspond in the case where the confidence level check results pass through Instruction, execute predetermined registration operation.
To sum up, the characteristics of electronic equipment of the present embodiment will be for children speech and adult voice, configuration two is corresponding Judging confidence module, i.e., double judging confidence modules joined the judging confidence of child mode compared with the existing technology, And the two judging confidence modules are relatively independent, so that the electronic equipment of embodiment be made not influence the adult feelings for waking up performance Under condition, by the way that biggish judgement window is arranged, the wake-up performance to children speech can effectively improve.
And in the first-level model of such as Fig. 3, either adult user or child user input voice information, it will altogether It enjoys acoustic model to be handled, does not need to reduce calculation amount, and to electricity for two acoustic models of these two types of user settings The occupancy of sub- device resource can be suitable for scene resource-constrained on electronic equipment.
In addition, the application is directed to the different Knowledge Verification Model of different type user configuration in the second-level model of Fig. 3, this Two Knowledge Verification Models can be modeled for adult speech samples and children speech sample respectively respectively, can efficiently use this The speech samples of two class users respectively obtain respective optimal performance, effectively promote the accuracy of secondary judging confidence, simultaneously Improve the wake-up rate of children speech.
Referring to Fig. 8, the voice proposed for the application wakes up an optional exemplary structure chart of processing unit, which can be with For electronic equipment, the application to the product type of electronic equipment without limitation, as shown in figure 8, the apparatus may include:
Feature obtains module 21, the audio frame feature of the voice messaging for obtaining input;
Optionally, this feature acquisition module 21 may include:
Voice messaging acquiring unit, for obtaining the voice messaging for being directed to electronic equipment input;
Feature extraction unit obtains forming each of the voice messaging for carrying out feature extraction to the voice messaging The audio frame feature of audio frame, and obtained audio frame feature is cached.
Posterior probability obtain module 22, for by the audio frame feature input acoustic model handle, obtain in advance If waking up the posterior probability of the corresponding target audio frame feature of each syllable of word;
Judging confidence module 23 is carried out for the posterior probability to the corresponding target audio frame feature of each syllable Double judging confidences obtain the first confidence score and the second confidence score of corresponding syllables;
Calibration feature obtains module 24, leads to for utilizing in first confidence score and second confidence score The court verdict crossed obtains the verification audio frame feature in the audio frame feature of the voice messaging;
As the optional example of the application one, as shown in figure 9, the judging confidence module 23 may include:
First confidence computation unit 231, it is corresponding to each syllable for regular according to the first judging confidence The posterior probability of target audio frame feature carries out confidence calculations, obtains the first confidence score of corresponding syllables;
Second confidence computation unit 232, it is corresponding to each syllable for regular according to the second judging confidence The posterior probability of target audio frame feature carries out confidence calculations, obtains the second confidence score of corresponding syllables;
Wherein, the judgement window size and confidence of the first judging confidence rule and the second judging confidence rule Degree decision threshold is different, and the judgement window is for determining the time span for carrying out the target audio frame feature of confidence calculations.
Correspondingly, above-mentioned calibration feature acquisition module 24 may include:
First decision unit 241, for being sentenced using the first judging confidence threshold value to first confidence score Certainly, the first court verdict is obtained;
Second decision unit 242, for being sentenced using the second judging confidence threshold value to second confidence score Certainly, the second court verdict is obtained;
Audio frame feature acquiring unit 243 is verified, for passing through in the judgement of the first court verdict or the second court verdict In the case where, from the audio frame feature of voice messaging, it is matched to obtain judgement window size corresponding with the court verdict passed through Verify audio frame feature.
Confidence level check results obtain module 25, for obtaining the confidence level check results of the verification audio frame feature, The confidence level check results are to carry out secondary judging confidence to the verification audio frame feature to obtain;
Optionally, confidence level check results acquisition module 25 may include:
Confidence level verification unit, for utilizing Knowledge Verification Model corresponding with the court verdict passed through to the verification audio frame Feature carries out confidence level verification, obtains the confidence level check results of the verification audio frame feature;
Wherein, for different judging confidence rules, it is configured with corresponding Knowledge Verification Model, the Knowledge Verification Model is to pass through The speech samples of respective confidence decision rule corresponding types user are trained.
In practical applications, the confidence level check results of above-mentioned verification audio frame feature can be carried out directly by electronic equipment Secondary judging confidence obtains, and can also carry out two by the server that can be communicated to connect with electronic equipment or other electronic equipments Secondary judging confidence obtains, and the application does not limit the specific acquisition methods of the confidence level check results of verification audio frame feature It is fixed, it is referred to the description of above method embodiment corresponding portion.
Based on this, above-mentioned confidence level verification unit may include:
Confidence level verifies request transmitting unit, and for sending the verification request of voice confidence level to server, the voice is set Reliability verification request carries the verification audio frame feature and the corresponding user type mark of the verification audio frame feature Know;
Confidence level check results receiving unit, the verification audio frame feature for receiving the server feedback are set Reliability check results, the confidence level check results are that the server responds the voice confidence level verification request, using with The user type identifies corresponding Knowledge Verification Model, carries out what confidence level verified to the verification audio frame feature.
Based on above-mentioned analysis, it should be understood that above-mentioned confidence level check results are obtained by the direct operation of electronic equipment It is similar with the calculation process process of the present embodiment description in example, the school of corresponding different user types mark can be trained in advance Model is tested, carries out secondary confidence level verification using Knowledge Verification Model verification audio frequency characteristics corresponding to relative users type identification, Specific checking procedure can judging confidence method corresponding with last user type mark it is similar, the present embodiment is not gone to live in the household of one's in-laws on getting married It states.
It is corresponding to respond the default wake-up word if passing through for the confidence level check results for voice wake-up module 26 Instruction, controlling electronic devices execute predetermined registration operation.
In conclusion, for the voice messaging of acquisition, the characteristic voice of different type user will be combined in the present embodiment, Double judging confidences are carried out to the voice messaging, and double judging confidence modules will share same acoustic model and realize, i.e., it is double Judging confidence module carries out judging confidence to identical audio frame feature will as long as there is a judging confidence to pass through The secondary confidence level verification operation of triggering following matches according to judgement window size used in the judging confidence that passes through Length, obtain verification audio frame feature, the Knowledge Verification Model for being sent to relative users type carries out confidence level verification, if verification is logical It crosses, determines that the voice messaging obtained includes default wake-up word, electronic equipment can respond the voice messaging of user's input, execute pre- If operation.As it can be seen that the voice that the application proposes wakes up processing scheme, adult voice can be combined and wake up performance and children's language Sound wakes up performance, compared with the existing technology, improves children speech and wakes up performance, that is, improves voice and wake up efficiency and accurate Property.
In addition, it is necessary to illustrate, each model in processing unit is waken up about above-mentioned voice, unit is actually to have program generation The functional module that code is constituted realizes the function of the functional mode, about each functional mode reality by executing corresponding program code The process of existing corresponding function, is referred to the description of above-described embodiment corresponding portion.
The embodiment of the present application also provides a kind of storage mediums, are stored thereon with computer program, the computer program quilt Processor executes, and realizes that above-mentioned voice wakes up each step of processing method, and the realization process which wakes up processing method can be with Referring to the description of above method embodiment.
Referring to Fig.1 0, the voice proposed for the application wakes up an optional exemplary structural schematic diagram of processing system, this is System may include but be not limited to: at least one electronic equipment 31 and server 32, in which:
The present embodiment to the product type of each electronic equipment 31 without limitation, it is not limited to the electronic equipment shown in Figure 10 Type.
Server 32 can be individual service equipment, or the server set being made of multiple service equipments, the application couple The structure and type of server 32 such as may include communication interface, memory and processor, the storage in server without limitation Device can be used to carry out verification audio frame feature the program of secondary confidence level decision method, and processor can call the program simultaneously It executes, realizes the secondary judging confidence to verification audio frame feature, obtain the confidence level check results of verification audio frame feature, Specific implementation process is referred to the description of above method embodiment corresponding portion.
As shown in figure 11, when user wishes that voice control electronic equipment executes certain operation (i.e. predetermined registration operation), user can be with Corresponding wake-up word is said, such as needs intelligent sound box to play song B, user is it may be said that " xx (can be calling out for intelligent sound box system Awake word, however, it is not limited to this), playing song B " can be according to upper after electronic equipment collects the voice messaging of user's output The mode for stating embodiment description handles it, if electronic equipment can carry out feature extraction frame by frame to the voice messaging, Multiple audio frame features are obtained, preset acoustic model is inputted and is handled, the posterior probability of each audio frame feature are obtained, true Make include in the voice messaging may be it is default wake up word, at least one corresponding target audio frame of each syllable is special The posterior probability of sign carries out double confidence levels to the posterior probability of the corresponding target audio frame feature of each syllable determined later Judgement is such as handled using adult judging confidence module and children's judging confidence module respectively, it is seen then that the application considers The difference between adult characteristic voice and children speech feature has been arrived, using different judging confidence modules, has shared a sound Model is learned, confidence calculations, judgement are carried out to the posterior probability of each target audio frame feature of acoustic model output, are needed Bright, judgement window size and confidence threshold value as used herein are of different sizes, can determine, lead to according to different user types feature The judgement window of normal children is greater than adult judgement window, to guarantee to wake up the integrality of word feature as far as possible.
In practical application, in above-mentioned double judging confidence results, as long as a judging confidence passes through, it is believed that such as Fig. 3 institute The first-level model shown is activated, and can trigger second-level model work, at this point, obtain what characteristic length passed through with judging confidence, The matched verification audio frame feature of the judgement window size of user type, which is sent to and the user type Corresponding Knowledge Verification Model (it can be deployed in electronic equipment, can also be deployed in other electronic equipments, such as above-mentioned server), by The Knowledge Verification Model (such as adult Knowledge Verification Model or children's Knowledge Verification Model) carries out verification audio frame feature according to above-mentioned processing mode Secondary confidence level verification, detailed process do not repeat them here.It wherein, is to utilize relative users for the Knowledge Verification Model of different user types What the data training of type obtained, it ensure that the accuracy of secondary judging confidence.
Pass through by above-mentioned judging confidence twice, can determine that the voice messaging currently obtained includes default wakes up Word, electronic equipment can respond this and preset the corresponding control instruction of wake-up word, execute predetermined registration operation, meet user and set to the electronics Standby voice wakes up demand for control.Judging confidence result as when first time judging confidence being children passes through, it is believed that The voice messaging may be what children exported, and the voice messaging may wake up word comprising default, and the audio frame from caching is special It in sign, obtains and adjudicates the matched verification audio frame feature of window size with children, be sent to children's Knowledge Verification Model and carry out secondary confidence Degree judgement determines that the voice messaging is that children issue and include default wake-up word, electronic equipment will respond the voice if passing through Information improves the performance of children speech wake-up.
It should be noted that under the application scenarios of the present embodiment, after obtaining verification audio frame feature, it is not limited to Processing mode shown in Figure 11 is sent to server and carries out secondary judging confidence, can also be carried out by electronic equipment itself Secondary judging confidence, specific implementation process is identical, and the application does not repeat them here.
Each embodiment in this specification is described by the way of progressive or arranged side by side, the highlights of each of the examples are With the difference of other embodiments, the same or similar parts in each embodiment may refer to each other.Embodiment is disclosed Device, system, for electronic equipment, since it is corresponded to the methods disclosed in the examples, so being described relatively simple, phase Place is closed referring to method part illustration.
Professional further appreciates that, unit described in conjunction with the examples disclosed in the embodiments of the present disclosure And algorithm steps, can be realized with electronic hardware, computer software, or a combination of the two, in order to clearly demonstrate hardware and The interchangeability of software generally describes each exemplary composition and step according to function in the above description.These Function is implemented in hardware or software actually, the specific application and design constraint depending on technical solution.Profession Technical staff can use different methods to achieve the described function each specific application, but this realization is not answered Think beyond scope of the present application.
The step of method described in conjunction with the examples disclosed in this document or algorithm, can directly be held with hardware, processor The combination of capable software module or the two is implemented.Software module can be placed in random access memory (RAM), memory, read-only deposit Reservoir (ROM), electrically programmable ROM, electrically erasable ROM, register, hard disk, moveable magnetic disc, CD-ROM or technology In any other form of storage medium well known in field.
The foregoing description of the disclosed embodiments makes professional and technical personnel in the field can be realized or use the application. Various modifications to these embodiments will be readily apparent to those skilled in the art, as defined herein General Principle can be realized in other embodiments in the case where not departing from the core concept or range of the application.Therefore, originally Application is not intended to be limited to the embodiments shown herein, and is to fit to and the principles and novel features disclosed herein Consistent widest scope.

Claims (10)

1. a kind of voice wakes up processing method, which is characterized in that the described method includes:
Obtain the audio frame feature of the voice messaging of input;
Audio frame feature input acoustic model is handled, target corresponding with default wake-up each syllable of word is obtained The posterior probability of audio frame feature;
Double judging confidences are carried out to the posterior probability of the corresponding target audio frame feature of each syllable, obtain corresponding syllables The first confidence score and the second confidence score;
Using the court verdict passed through in first confidence score and second confidence score, the voice letter is obtained Verification audio frame feature in the audio frame feature of breath;
The confidence level check results of the verification audio frame feature are obtained, the confidence level check results are to the verification audio Frame feature carries out what secondary judging confidence obtained;
If the confidence level check results pass through, the corresponding instruction of the default wake-up word is responded, controlling electronic devices executes pre- If operation.
2. the method according to claim 1, wherein described special to the corresponding target audio frame of each syllable The posterior probability of sign carries out double judging confidences, obtains the first confidence score and the second confidence score of corresponding syllables;
According to the first judging confidence rule, the posterior probability of the corresponding target audio frame feature of each syllable is set Reliability calculates, and obtains the first confidence score of corresponding syllables;
According to the second judging confidence rule, the posterior probability of the corresponding target audio frame feature of each syllable is set Reliability calculates, and obtains the second confidence score of corresponding syllables;
Wherein, the judgement window size and confidence level of the first judging confidence rule and the second judging confidence rule are sentenced Certainly threshold value is different, and the judgement window is for determining the time span for carrying out the target audio frame feature of confidence calculations.
3. according to the method described in claim 2, it is characterized in that, described utilize first confidence score and described second The court verdict passed through in confidence score obtains the verification audio frame feature in the audio frame feature of the voice messaging, packet It includes:
First confidence score is made decisions using the first judging confidence threshold value, obtains the first court verdict, and benefit Second confidence score is made decisions with the second judging confidence threshold value, obtains the second court verdict;
If first court verdict or second court verdict pass through, from the audio frame feature of the voice messaging, obtain Take the matched verification audio frame feature of judgement window size corresponding with the court verdict passed through.
4. method according to any one of claims 1 to 3, which is characterized in that described to obtain the verification audio frame spy The confidence level check results of sign, comprising:
Confidence level verification is carried out to the verification audio frame feature using Knowledge Verification Model corresponding with the court verdict passed through, is obtained The confidence level check results of the verification audio frame feature;
Wherein, for different judging confidence rules, it is configured with corresponding Knowledge Verification Model, the Knowledge Verification Model is by phase The speech samples of judging confidence rule corresponding types user are answered to be trained.
5. according to the method described in claim 4, it is characterized in that, described utilize calibration mode corresponding with the court verdict passed through Type carries out confidence level verification to the verification audio frame feature, obtains the confidence level check results of the verification audio frame feature, Include:
The verification request of voice confidence level is sent to server, the voice confidence level verification request carries the verification audio frame Feature and the corresponding user type mark of the verification audio frame feature;
Receive the confidence level check results of the verification audio frame feature of the server feedback, the confidence level check results It is that the server responds the voice confidence level verification request, using Knowledge Verification Model corresponding with user type mark, What confidence level verified is carried out to the verification audio frame feature.
6. method according to any one of claims 1 to 4, which is characterized in that the sound of the voice messaging for obtaining input Frequency frame feature, comprising:
Obtain the voice messaging for electronic equipment input;
Feature extraction is carried out to the voice messaging, obtains the audio frame feature for each audio frame for forming the voice messaging, and Obtained audio frame feature is cached.
7. a kind of voice wakes up processing unit, which is characterized in that described device includes:
Feature obtains module, the audio frame feature of the voice messaging for obtaining input;
Posterior probability obtains module, for handling audio frame feature input acoustic model, obtains waking up with default The posterior probability of the corresponding target audio frame feature of each syllable of word;
Judging confidence module carries out double confidences for the posterior probability to the corresponding target audio frame feature of each syllable Degree judgement, obtains the first confidence score and the second confidence score of corresponding syllables;
Calibration feature obtains module, for being sentenced using what is passed through in first confidence score and second confidence score Certainly as a result, obtaining the verification audio frame feature in the audio frame feature of the voice messaging;
Confidence level check results obtain module, described to set for obtaining the confidence level check results of the verification audio frame feature Reliability check results are to carry out secondary judging confidence to the verification audio frame feature to obtain;
Voice wake-up module responds the corresponding instruction of the default wake-up word, control if passing through for the confidence level check results Electronic equipment processed executes predetermined registration operation.
8. device according to claim 7, which is characterized in that the confidence level check results obtain module and include:
Confidence level verification unit, for utilizing Knowledge Verification Model corresponding with the court verdict passed through to the verification audio frame feature Confidence level verification is carried out, the confidence level check results of the verification audio frame feature are obtained;
Wherein, for different judging confidence rules, it is configured with corresponding Knowledge Verification Model, the Knowledge Verification Model is by phase The speech samples of judging confidence rule corresponding types user are answered to be trained.
9. a kind of storage medium, is stored thereon with computer program, which is characterized in that the computer program is held by processor Row realizes that voice as described in any one of claims 1-9 wakes up the program of each step of processing.
10. a kind of electronic equipment, which is characterized in that the electronic equipment includes:
Sound collector, for acquiring the voice messaging of user's output;
Communication interface;
Memory, for storing the program for realizing that voice as described in any one of claims 1-9 wakes up processing;
Processor, for loading and executing the program of the memory storage, to realize as described in claim 1-9 any one Voice wake up processing each step.
CN201910828451.7A 2019-09-03 2019-09-03 Voice wake-up processing method and device, storage medium and electronic equipment Active CN110534099B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910828451.7A CN110534099B (en) 2019-09-03 2019-09-03 Voice wake-up processing method and device, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910828451.7A CN110534099B (en) 2019-09-03 2019-09-03 Voice wake-up processing method and device, storage medium and electronic equipment

Publications (2)

Publication Number Publication Date
CN110534099A true CN110534099A (en) 2019-12-03
CN110534099B CN110534099B (en) 2021-12-14

Family

ID=68666681

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910828451.7A Active CN110534099B (en) 2019-09-03 2019-09-03 Voice wake-up processing method and device, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN110534099B (en)

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110910885A (en) * 2019-12-12 2020-03-24 苏州思必驰信息科技有限公司 Voice awakening method and device based on decoding network
CN110910884A (en) * 2019-12-04 2020-03-24 北京搜狗科技发展有限公司 Wake-up detection method, device and medium
CN111161728A (en) * 2019-12-26 2020-05-15 珠海格力电器股份有限公司 Awakening method, device, equipment and medium for intelligent equipment
CN111312222A (en) * 2020-02-13 2020-06-19 北京声智科技有限公司 Awakening and voice recognition model training method and device
CN111583927A (en) * 2020-05-08 2020-08-25 安创生态科技(深圳)有限公司 Data processing method and device for multi-channel I2S voice awakening low-power-consumption circuit
CN111667818A (en) * 2020-05-27 2020-09-15 北京声智科技有限公司 Method and device for training awakening model
CN111833867A (en) * 2020-06-08 2020-10-27 北京嘀嘀无限科技发展有限公司 Voice instruction recognition method and device, readable storage medium and electronic equipment
CN111986659A (en) * 2020-07-16 2020-11-24 百度在线网络技术(北京)有限公司 Method and device for establishing audio generation model
CN112543390A (en) * 2020-11-25 2021-03-23 南阳理工学院 Intelligent infant sound box and interaction method thereof
CN112951211A (en) * 2021-04-22 2021-06-11 中国科学院声学研究所 Voice awakening method and device
CN113241059A (en) * 2021-04-27 2021-08-10 标贝(北京)科技有限公司 Voice wake-up method, device, equipment and storage medium
CN113539266A (en) * 2021-07-13 2021-10-22 盛景智能科技(嘉兴)有限公司 Command word recognition method and device, electronic equipment and storage medium
CN113782016A (en) * 2021-08-06 2021-12-10 佛山市顺德区美的电子科技有限公司 Wake-up processing method, device, equipment and computer storage medium
CN115132197A (en) * 2022-05-27 2022-09-30 腾讯科技(深圳)有限公司 Data processing method, data processing apparatus, electronic device, program product, and medium
CN115132195A (en) * 2022-05-12 2022-09-30 腾讯科技(深圳)有限公司 Voice wake-up method, apparatus, device, storage medium and program product
CN115132198A (en) * 2022-05-27 2022-09-30 腾讯科技(深圳)有限公司 Data processing method, data processing device, electronic equipment, program product and medium

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6125345A (en) * 1997-09-19 2000-09-26 At&T Corporation Method and apparatus for discriminative utterance verification using multiple confidence measures
EP2881939A1 (en) * 2013-12-09 2015-06-10 MediaTek, Inc System for speech keyword detection and associated method
US20160189716A1 (en) * 2013-10-11 2016-06-30 Apple Inc. Speech recognition wake-up of a handheld portable electronic device
US20160232916A1 (en) * 2015-02-09 2016-08-11 Oki Electric Industry Co., Ltd. Object sound period detection apparatus, noise estimating apparatus and snr estimation apparatus
CN106448663A (en) * 2016-10-17 2017-02-22 海信集团有限公司 Voice wakeup method and voice interaction device
FI20156000A (en) * 2015-12-22 2017-06-23 Code-Q Oy Speech recognition method and apparatus based on a wake-up call
CN107134279A (en) * 2017-06-30 2017-09-05 百度在线网络技术(北京)有限公司 A kind of voice awakening method, device, terminal and storage medium
CN107507612A (en) * 2017-06-30 2017-12-22 百度在线网络技术(北京)有限公司 A kind of method for recognizing sound-groove and device
CN108335696A (en) * 2018-02-09 2018-07-27 百度在线网络技术(北京)有限公司 Voice awakening method and device
CN108447472A (en) * 2017-02-16 2018-08-24 腾讯科技(深圳)有限公司 Voice awakening method and device
CN109215647A (en) * 2018-08-30 2019-01-15 出门问问信息科技有限公司 Voice awakening method, electronic equipment and non-transient computer readable storage medium
US20190164542A1 (en) * 2017-11-29 2019-05-30 Nuance Communications, Inc. System and method for speech enhancement in multisource environments

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6125345A (en) * 1997-09-19 2000-09-26 At&T Corporation Method and apparatus for discriminative utterance verification using multiple confidence measures
US20160189716A1 (en) * 2013-10-11 2016-06-30 Apple Inc. Speech recognition wake-up of a handheld portable electronic device
EP2881939A1 (en) * 2013-12-09 2015-06-10 MediaTek, Inc System for speech keyword detection and associated method
US20160232916A1 (en) * 2015-02-09 2016-08-11 Oki Electric Industry Co., Ltd. Object sound period detection apparatus, noise estimating apparatus and snr estimation apparatus
FI20156000A (en) * 2015-12-22 2017-06-23 Code-Q Oy Speech recognition method and apparatus based on a wake-up call
CN106448663A (en) * 2016-10-17 2017-02-22 海信集团有限公司 Voice wakeup method and voice interaction device
CN108447472A (en) * 2017-02-16 2018-08-24 腾讯科技(深圳)有限公司 Voice awakening method and device
CN107134279A (en) * 2017-06-30 2017-09-05 百度在线网络技术(北京)有限公司 A kind of voice awakening method, device, terminal and storage medium
CN107507612A (en) * 2017-06-30 2017-12-22 百度在线网络技术(北京)有限公司 A kind of method for recognizing sound-groove and device
US20190164542A1 (en) * 2017-11-29 2019-05-30 Nuance Communications, Inc. System and method for speech enhancement in multisource environments
CN108335696A (en) * 2018-02-09 2018-07-27 百度在线网络技术(北京)有限公司 Voice awakening method and device
CN109215647A (en) * 2018-08-30 2019-01-15 出门问问信息科技有限公司 Voice awakening method, electronic equipment and non-transient computer readable storage medium

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
ROOZBEH JAFARI: "Low Power Tiered Wake-up Module for Lightweight Embedded Systems Using Cross Correlation", 《2011 INTERNATIONAL CONFERENCE ON BODY SENSOR NETWORKS》 *
张水利等: "一种具有语音功能的智能家用唤醒系统设计", 《微型电脑应用》 *
毛跃辉: "基于深度学习的语音识别技术研究及其在空调上的应用", 《家电科技》 *

Cited By (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110910884B (en) * 2019-12-04 2022-03-22 北京搜狗科技发展有限公司 Wake-up detection method, device and medium
CN110910884A (en) * 2019-12-04 2020-03-24 北京搜狗科技发展有限公司 Wake-up detection method, device and medium
CN110910885A (en) * 2019-12-12 2020-03-24 苏州思必驰信息科技有限公司 Voice awakening method and device based on decoding network
CN111161728A (en) * 2019-12-26 2020-05-15 珠海格力电器股份有限公司 Awakening method, device, equipment and medium for intelligent equipment
CN111312222A (en) * 2020-02-13 2020-06-19 北京声智科技有限公司 Awakening and voice recognition model training method and device
CN111312222B (en) * 2020-02-13 2023-09-12 北京声智科技有限公司 Awakening and voice recognition model training method and device
CN111583927A (en) * 2020-05-08 2020-08-25 安创生态科技(深圳)有限公司 Data processing method and device for multi-channel I2S voice awakening low-power-consumption circuit
CN111667818A (en) * 2020-05-27 2020-09-15 北京声智科技有限公司 Method and device for training awakening model
CN111667818B (en) * 2020-05-27 2023-10-10 北京声智科技有限公司 Method and device for training wake-up model
CN111833867A (en) * 2020-06-08 2020-10-27 北京嘀嘀无限科技发展有限公司 Voice instruction recognition method and device, readable storage medium and electronic equipment
CN111833867B (en) * 2020-06-08 2023-12-05 北京嘀嘀无限科技发展有限公司 Voice instruction recognition method and device, readable storage medium and electronic equipment
CN111986659A (en) * 2020-07-16 2020-11-24 百度在线网络技术(北京)有限公司 Method and device for establishing audio generation model
CN112543390B (en) * 2020-11-25 2023-03-24 南阳理工学院 Intelligent infant sound box and interaction method thereof
CN112543390A (en) * 2020-11-25 2021-03-23 南阳理工学院 Intelligent infant sound box and interaction method thereof
CN112951211A (en) * 2021-04-22 2021-06-11 中国科学院声学研究所 Voice awakening method and device
CN112951211B (en) * 2021-04-22 2022-10-18 中国科学院声学研究所 Voice awakening method and device
CN113241059A (en) * 2021-04-27 2021-08-10 标贝(北京)科技有限公司 Voice wake-up method, device, equipment and storage medium
CN113539266A (en) * 2021-07-13 2021-10-22 盛景智能科技(嘉兴)有限公司 Command word recognition method and device, electronic equipment and storage medium
CN113782016B (en) * 2021-08-06 2023-05-05 佛山市顺德区美的电子科技有限公司 Wakeup processing method, wakeup processing device, equipment and computer storage medium
CN113782016A (en) * 2021-08-06 2021-12-10 佛山市顺德区美的电子科技有限公司 Wake-up processing method, device, equipment and computer storage medium
CN115132195A (en) * 2022-05-12 2022-09-30 腾讯科技(深圳)有限公司 Voice wake-up method, apparatus, device, storage medium and program product
CN115132195B (en) * 2022-05-12 2024-03-12 腾讯科技(深圳)有限公司 Voice wakeup method, device, equipment, storage medium and program product
CN115132198A (en) * 2022-05-27 2022-09-30 腾讯科技(深圳)有限公司 Data processing method, data processing device, electronic equipment, program product and medium
CN115132197A (en) * 2022-05-27 2022-09-30 腾讯科技(深圳)有限公司 Data processing method, data processing apparatus, electronic device, program product, and medium
CN115132198B (en) * 2022-05-27 2024-03-15 腾讯科技(深圳)有限公司 Data processing method, device, electronic equipment, program product and medium
CN115132197B (en) * 2022-05-27 2024-04-09 腾讯科技(深圳)有限公司 Data processing method, device, electronic equipment, program product and medium

Also Published As

Publication number Publication date
CN110534099B (en) 2021-12-14

Similar Documents

Publication Publication Date Title
CN110534099A (en) Voice wakes up processing method, device, storage medium and electronic equipment
CN107481718B (en) Audio recognition method, device, storage medium and electronic equipment
WO2021093449A1 (en) Wakeup word detection method and apparatus employing artificial intelligence, device, and medium
CN110838286B (en) Model training method, language identification method, device and equipment
CN112259106B (en) Voiceprint recognition method and device, storage medium and computer equipment
CN108320733B (en) Voice data processing method and device, storage medium and electronic equipment
CN107767861B (en) Voice awakening method and system and intelligent terminal
CN107767863A (en) voice awakening method, system and intelligent terminal
CN110364143A (en) Voice awakening method, device and its intelligent electronic device
CN110473554B (en) Audio verification method and device, storage medium and electronic equipment
CN108694940A (en) A kind of audio recognition method, device and electronic equipment
CN104765996B (en) Voiceprint password authentication method and system
CN108711429A (en) Electronic equipment and apparatus control method
CN113643693B (en) Acoustic model conditioned on sound characteristics
CN109036395A (en) Personalized speaker control method, system, intelligent sound box and storage medium
CN110544468B (en) Application awakening method and device, storage medium and electronic equipment
CN112634897B (en) Equipment awakening method and device, storage medium and electronic device
CN113129867B (en) Training method of voice recognition model, voice recognition method, device and equipment
CN110491373A (en) Model training method, device, storage medium and electronic equipment
CN110400571A (en) Audio-frequency processing method, device, storage medium and electronic equipment
CN112562723B (en) Pronunciation accuracy determination method and device, storage medium and electronic equipment
CN113393828A (en) Training method of voice synthesis model, and voice synthesis method and device
CN110580897B (en) Audio verification method and device, storage medium and electronic equipment
CN110853669B (en) Audio identification method, device and equipment
CN114360510A (en) Voice recognition method and related device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant