CN110534099B

CN110534099B - Voice wake-up processing method and device, storage medium and electronic equipment

Info

Publication number: CN110534099B
Application number: CN201910828451.7A
Authority: CN
Inventors: 陈杰; 苏丹
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-09-03
Filing date: 2019-09-03
Publication date: 2021-12-14
Anticipated expiration: 2039-09-03
Also published as: CN110534099A

Abstract

The application provides a voice awakening processing method, a device, a storage medium and an electronic device, which are characterized in that audio frame characteristics of input voice information are taken and input into an acoustic model for processing, posterior probabilities of target audio frame characteristics corresponding to each syllable of a preset awakening word are obtained, deployed confidence level judgments aiming at an adult mode and a child mode respectively are utilized to realize double confidence level judgments on the obtained posterior probabilities, so that each syllable obtains two confidence level scores, wherein the judgment result of any confidence level score passes through, a verification audio frame characteristic with a corresponding length can be obtained from a cache for secondary confidence level verification, and when the confidence level verification result passes, an instruction corresponding to the preset awakening word can be directly responded, and the electronic device is controlled to execute preset operation. Therefore, the voice awakening processing method provided by the embodiment can give consideration to both adult voice awakening performance and child voice awakening performance, and improves voice awakening efficiency and accuracy.

Description

Voice wake-up processing method and device, storage medium and electronic equipment

Technical Field

The application relates to the field of artificial intelligence application, in particular to a voice awakening processing method and device, a storage medium and electronic equipment.

Background

As an artificial intelligence technology, voice recognition has been widely used in various fields such as industry, home appliances, communications, automotive electronics, medical care, home services, consumer electronics, and the like, so that electronic devices applied in various fields have voice recognition capability, and wake-up words are issued by recognizing users to wake up the electronic devices and applications included therein, thereby providing great convenience for the users to use the electronic devices.

In the prior art, referring to a flow diagram of a conventional voice wake-up processing method shown in fig. 1, generally, voice information input by a user is sent to an acoustic model (such as a deep neural network) to obtain phonemes or syllables constituting a wake-up word, and meanwhile, a non-wake-up word is obtained through a filling unit, and then the phonemes or syllables of the wake-up word are processed through a smoothing window and a confidence calculation window of a posterior processing module to obtain a confidence score of the wake-up word, and if the confidence score reaches a threshold, an electronic device is controlled to execute a preset operation in response to the wake-up word.

As can be seen, although the existing voice wake-up processing method can balance the wake-up performance by adjusting the threshold, it does not consider the difference between the adult voice feature and the child voice feature, resulting in lower accuracy of the output of the acoustic model and reduced voice wake-up performance for the electronic device.

Disclosure of Invention

In view of this, embodiments of the present application provide a voice wake-up processing method and apparatus, a storage medium, and an electronic device, which can simultaneously consider both adult voice wake-up performance and child voice wake-up performance, and improve voice wake-up efficiency and accuracy.

In order to achieve the above purpose, the embodiments of the present application provide the following technical solutions:

in one aspect, the present application provides a voice wake-up processing method, where the method includes:

acquiring audio frame characteristics of input voice information;

inputting the audio frame characteristics into an acoustic model for processing to obtain the posterior probability of the target audio frame characteristics corresponding to each syllable of the preset awakening words;

carrying out double confidence degree judgment on the posterior probability of the target audio frame characteristic corresponding to each syllable to obtain a first confidence degree score and a second confidence degree score of the corresponding syllable;

obtaining a verification audio frame characteristic in the audio frame characteristics of the voice information by using a passed judgment result in the first confidence score and the second confidence score;

obtaining a confidence coefficient checking result of the checking audio frame characteristics, wherein the confidence coefficient checking result is obtained by performing secondary confidence coefficient judgment on the checking audio frame characteristics;

and if the confidence verification result passes, responding to the instruction corresponding to the preset awakening word, and controlling the electronic equipment to execute preset operation.

In another aspect, the present application provides a voice wake-up processing apparatus, including:

the characteristic acquisition module is used for acquiring the audio frame characteristics of the input voice information;

the posterior probability acquisition module is used for inputting the audio frame characteristics into an acoustic model for processing to obtain the posterior probability of the target audio frame characteristics corresponding to each syllable of the preset awakening word;

the confidence coefficient judging module is used for carrying out double confidence coefficient judgment on the posterior probability of the target audio frame characteristic corresponding to each syllable to obtain a first confidence coefficient score and a second confidence coefficient score of the corresponding syllable;

a verification feature obtaining module, configured to obtain a verification audio frame feature in the audio frame features of the speech information by using a passed determination result in the first confidence score and the second confidence score;

a confidence check result obtaining module, configured to obtain a confidence check result of the verified audio frame feature, where the confidence check result is obtained by performing secondary confidence decision on the verified audio frame feature;

and the voice awakening module is used for responding to the instruction corresponding to the preset awakening word and controlling the electronic equipment to execute preset operation if the confidence coefficient verification result passes.

In yet another aspect, the present application proposes a storage medium having stored thereon a computer program, which is executed by a processor, implementing a program of the steps of the voice wake-up process as described above.

In yet another aspect, the present application provides an electronic device, including:

the voice collector is used for collecting voice information output by a user;

a communication interface;

a memory for storing a program for implementing the voice wakeup process as described above;

and the processor is used for loading and executing the program stored in the memory so as to realize the steps of the voice wake-up processing.

Therefore, compared with the prior art, after acquiring the voice information input by the user aiming at the electronic equipment, the audio frame characteristics of the voice information are acquired, and the audio frame characteristics are input into the acoustic model for processing, so as to obtain the posterior probability of the target audio frame characteristics corresponding to each syllable of the preset awakening words contained in the voice information, then, the embodiment takes the difference between the voice characteristics of different types of users (such as adults and children) into consideration, deploys different confidence degree judging modules respectively aiming at an adult mode and a child mode, shares one acoustic model, realizes double confidence degree judgment of the obtained posterior probabilities, so that each syllable obtains two confidence degree scores, wherein the judgment result of any confidence degree score passes through, and obtains the verification audio frame characteristics with corresponding length from the cache for secondary confidence degree verification, when the confidence verification result passes, the voice message can be determined to contain the preset awakening word, and the instruction corresponding to the preset awakening word can be directly responded to control the electronic equipment to execute the preset operation. Therefore, the voice awakening processing method provided by the embodiment can give consideration to both adult voice awakening performance and child voice awakening performance, and improves voice awakening efficiency and accuracy.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a flow chart illustrating a conventional voice wake-up processing method;

fig. 2 is a schematic diagram illustrating an alternative structure of a voice wake-up processing method provided in the present application in a development process of the voice wake-up processing method;

fig. 3 is a schematic structural diagram illustrating an alternative example of implementing the voice wakeup processing method proposed in the present application;

fig. 4 is a schematic diagram showing a hardware structure of an alternative example of the electronic device proposed in the present application;

fig. 5 shows a hardware architecture diagram of yet another alternative example of the electronic device proposed by the present application;

FIG. 6 is a flow chart illustrating an alternative example of the voice wakeup process method proposed in the present application;

fig. 7 is a signaling flow diagram illustrating an alternative example of the voice wakeup processing method proposed in the present application;

fig. 8 is a schematic structural diagram of an alternative example of the voice wakeup processing device proposed in the present application;

fig. 9 is a schematic structural diagram of a further alternative example of the voice wakeup processing device proposed in the present application;

fig. 10 is a schematic diagram of a system structure for implementing the voice wakeup processing method proposed in the present application;

fig. 11 is a schematic view of an application scenario for implementing the voice wakeup processing method proposed in the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In the description of the embodiments herein, "/" means "or" unless otherwise specified, for example, a/B may mean a or B; "and/or" herein is merely an association describing an associated object, and means that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, in the description of the embodiments of the present application, "a plurality" means two or more than two.

The terms "first", "second" and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present embodiment, "a plurality" means two or more unless otherwise specified.

As introduced in the background art, in the current application of voice wake-up electronic device, the voice wake-up processing method executed by the electronic device only uses one acoustic model to process the voice information of different types of users (such as adult users and child users), so that the acoustic model cannot give consideration to the voice wake-up performance of both adults and children.

In order to improve the voice awakening performance, the present application proposes to train two acoustic models with different sizes to form two-level acoustic models, and share one posterior processing module to calculate a confidence score, and make a final decision, refer to a flow diagram of a voice awakening processing method shown in fig. 2, for voice information output by a user, extraction of voice feature information may be performed first, for example, the voice feature information is implemented in an MFCC (Mel-scale Frequency Cepstral Coefficients, Mel-Frequency Cepstral Coefficients) manner, but not limited thereto, and then the extracted voice feature information is written into a frame buffer, and the extracted voice feature information is subjected to confidence score calculation by a first-level model (e.g., the first acoustic model in fig. 2), for example, the confidence score of the voice feature information is calculated by using a hidden markov model HMM; or after the first-level model is triggered, the extracted same voice feature information can be sent to a larger acoustic model (such as a second acoustic model in fig. 2), and the confidence scores of the voice feature information are calculated in a similar manner, so that secondary judgment on the same voice feature information is realized, and compared with the voice awakening processing manner of a single model shown in fig. 1, the voice awakening performance is improved to a certain extent.

Meanwhile, the present application also provides another voice wake-up processing method, which is different from the voice wake-up processing method shown in fig. 2 in that after the first-level model is triggered, the voice information output by the user is sent to the server in the cloud, and is recognized by an Automatic Speech Recognition (ASR) component of the server, and at this time, the server may adopt a larger-scale acoustic model, combine with a larger language model, and perform decoding processing by an encoder, thereby implementing secondary decision on the voice information.

Therefore, it can be seen that, in the two voice awakening processing methods proposed in the present application, a larger secondary model is introduced to achieve the purpose of improving the system performance, but although the voice awakening performance can be properly improved compared with the scheme of a single acoustic model, the characteristics that the voice characteristics of adults and children are very slow relative to the voice speed of adults are not really considered, so that the acoustic models constructed in the methods cannot really take the performances of adults and children into consideration, and further, the electronic device using the voice awakening processing method cannot be well applied to adults and children at the same time, thereby greatly reducing the user experience.

In combination with the above improved scheme, in order to solve the problem that the voice awakening performance of the child and the adult cannot be considered at the same time, the method provides that on the basis of the system architecture used by the voice awakening processing method shown in fig. 1, the voice characteristics of the child are improved, a dual-confidence-level decision mechanism is added, and in a secondary model, models of the child and the adult are separated, so that the voice characteristic information and the training data input by the child and the adult are different, and the awakening performance of the child is remarkably improved.

Specifically, referring to a system structure diagram of fig. 3 for implementing the voice wake-up processing method provided in the embodiment of the present application, the system may be composed of two stages of models connected in series, as shown in fig. 3, the first stage model includes a feature calculation module and a feature cache module, and is configured with an acoustic model and a dual-confidence level decision module, and the dual-confidence level decision module performs posterior processing according to an adult model and a child model, that is, the dual-confidence level decision module may include an adult posterior processing module and a child posterior processing module. In the second-level model, corresponding adult verification models and child verification models are configured for the two posterior processing modules, the first-level model is shared, when the output result of any one posterior processing module passes, the second-level model is triggered to carry out secondary confidence judgment, if the output result passes, the preset awakening words contained in the voice information are responded, the electronic equipment is controlled to execute preset operation, and the specific implementation process can refer to the description of the corresponding part of the method embodiment.

In combination with the above analysis of the technical concept of the voice wake-up processing method proposed in the present application, the voice wake-up processing method can be applied to computer devices such as electronic devices (i.e. terminal devices) and/or servers. Specifically, the primary model proposed in the application can be deployed on the electronic device, and the secondary model is run after the primary model is triggered, and can be deployed on the electronic device or a cloud server, but is not limited to this deployment mode, and can be determined according to the requirements of an actual scene.

For example, the voice wake-up processing method provided by the present application may be applied to an electronic device, that is, both the primary model and the secondary model in the system structure may be located in the electronic device, and certainly, the primary model may be located in the electronic device according to actual needs, and the secondary model may be located in a server or other devices, and no matter which system layout is used, the process of implementing the voice wake-up processing method is similar.

The electronic device may be a mobile phone, a tablet computer, a wearable device, an in-vehicle device, a smart home device, an Augmented Reality (AR)/Virtual Reality (VR) device, a notebook computer, an ultra-mobile personal computer (UMPC), a Personal Digital Assistant (PDA), and the like, and the specific type of the electronic device is not limited in the embodiment of the present application.

It should be understood that, in order to implement voice control on an electronic device, it is generally required that the electronic device has a voice recognition function, for example, an application such as a voice assistant is installed, so that when a user needs to use the electronic device, the user can start the electronic device or some application installed therein without manually speaking a wakeup word of the electronic device, which is very convenient. In general, for different types of electronic devices of different manufacturers, the setup wake-up words of the start-up system and the applications may be different, and the configuration method and the usage method of the wake-up words are not detailed in the present application.

For example, fig. 4 shows a hardware structure diagram of an electronic device for implementing the voice wakeup processing method provided by the present application, where the electronic device may include: sound collector 11, communication interface 12, memory 13 and processor 14, wherein:

in this embodiment, the sound collector 11, the communication interface 12, the memory 13, and the processor 14 may implement mutual communication through a communication bus, and the number of the sound collector 11, the communication interface 12, the memory 13, the processor 14, and the communication bus may be at least one, and may be determined according to a specific application requirement.

The sound collector 11 may collect voice information output by a user for an electronic device, and may generally include a wakeup word for waking up an electronic device system and/or any application installed on the electronic device, that is, when the user needs to wake up the electronic device or some application provided by the electronic device, the user may directly speak a corresponding preset wakeup word, and the sound collector 11 of the electronic device may collect voice information output by the user and including the wakeup word, so as to respond to a corresponding control instruction by recognizing the wakeup word and control the electronic device to execute a preset operation.

The communication interface 12 may receive the voice information output by the sound collector 11, and send the voice information to the processor 14 for processing, and may also be used to implement data interaction between the sound collector 11 and the memory 13, between the memory 13 and the processor 14, or between other components in the electronic device, between other components and components listed in this embodiment.

Based on this, the communication interface 12 may include an interface of a wireless communication module and/or a wired communication module, such as an interface of a GSM (Global System for Mobile Communications) module, an interface of a WIFI module, an interface of a GPRS (General Packet Radio Service) module, and the like, and may further include; USB (universal serial bus) interface, serial/parallel interface, etc., which are not described in detail herein.

The memory 13 may be configured to store a program for implementing the voice wake-up processing method provided by the present application, and may also store at least one preset wake-up word, various intermediate data generated during the operation of the voice wake-up processing method, data sent by other electronic devices or users, and the like, which may be determined according to the requirements of an application scenario, and details of the present application are not described herein.

In practical applications, the memory 13 may comprise a high-speed RAM memory, and may further comprise a non-volatile memory (non-volatile memory), such as at least one disk memory.

The processor 14 may be configured to call and execute a program stored in the memory to implement the steps of the above-mentioned voice wakeup processing method applied to the electronic device, and the specific implementation process may refer to the description of the corresponding parts of the method embodiments below.

In this embodiment, the processor 14 may be a central processing unit CPU, or an application Specific Integrated circuit asic (application Specific Integrated circuit), or one or more Integrated circuits configured to implement the embodiments of the present application, and the Specific structure of the processor 14 is not described in detail herein.

Optionally, the memory 13 may be independent of the processor 14, or may be disposed in the processor 14, and similarly, at least a part of the interfaces included in the communication interface may also be disposed in the processor 14, such as an integrated circuit interface, an integrated circuit built-in audio interface, a USB interface, and the like.

In addition, it should be understood that the system composition structure of the electronic device is not limited to the sound collector, the communication interface, the memory and the processor listed above, as shown in fig. 5, the electronic device may further include constituent components such as a display, an input device, a power supply module, a speaker, a sensor module, a camera, an indicator light, an antenna, a power supply module, etc., which are not listed one by one in this application, and the composition of the electronic device may include more or less components than those shown in fig. 5, or combine/split some components, or different component arrangements, etc., and the illustrated components may be implemented by hardware, software, or a combination of hardware and software.

Moreover, the interface connection relationship between the modules shown in fig. 5 is only schematically illustrated, and does not form a structural limitation on the electronic device, that is, in other embodiments, the electronic device may also adopt an interface connection relationship different from that in this embodiment, or a combination of multiple interface connection manners, which is not described in detail herein.

With reference to the system structure diagram shown in fig. 3, fig. 6 shows a flowchart illustrating a voice wake-up processing method provided in the embodiment of the present application, for example, the method may be implemented by an electronic device, or may be implemented by an electronic device and a server in a matching manner, which is mainly described from the perspective of the electronic device, and a specific implementation process may include, but is not limited to, the following steps:

step S11, obtaining the audio frame characteristics of the input voice information;

in practical application of this embodiment, a user wants to perform voice control on an electronic device to replace a conventional manual operation and liberate both hands of the user, and in a general case, corresponding wake-up words can be preconfigured for various operations of different types of electronic devices, and the user can control the electronic device to perform corresponding operations in a voice control manner only by speaking the wake-up words corresponding to the required operations.

For example, the user may wish to control the smart speaker to play song a, so to speak, "xx, play song a", and the smart speaker may recognize a wakeup word included in the voice message by analyzing the voice message, so as to wake up the smart speaker system and play song a.

In the process, because the voice features of different types of users are different, for example, the voice features of two major types of users, namely an adult and a child, are greatly different, in order to accurately recognize the wakeup words included in the voice information, the embodiment may divide the voice information input for the electronic device into multiple frames (i.e., multiple audio frames) of data, and perform feature extraction on each frame of data to obtain corresponding audio frame features, where the audio frame features may be a feature vector, and according to this processing manner, the embodiment may obtain an n-dimensional feature vector, where the value of n depends on the number of audio frames included in the voice information, and the application does not limit the value of n.

It should be noted that, in the present application, after the input voice information is acquired, feature extraction is performed on the voice information, and a process of obtaining feature data for inputting an acoustic model is not limited, and after frame division preprocessing is performed on the voice information, an FBanK (filterbank) feature extraction manner is adopted to perform frame-by-frame feature extraction on each preprocessed audio frame data, so as to obtain an audio frame feature of a corresponding frame.

Step S12, inputting the audio frame characteristics into an acoustic model for processing to obtain the posterior probability of the target audio frame characteristics corresponding to each syllable of the preset awakening words;

the acoustic model is one of the most important parts in the speech recognition system, and may be modeled by using a hidden markov model HMM, but is not limited to this modeling method, and may be constructed by using a deep learning network such as other neural networks. The hidden Markov model is a discrete time domain finite state automaton, and the corresponding algorithms for scoring, decoding and training can be a forward algorithm, a Viterbi algorithm, a forward and backward algorithm and the like, and the modeling process of the acoustic model is not detailed in the application.

In general, in the input of the acoustic model, the feature extraction module extracts the multi-dimensional features, and the values thereof may be discrete or continuous, and this embodiment may actually require to obtain the audio frame features of the input acoustic model.

In this embodiment, after inputting a plurality of audio frame features of the obtained speech information into an acoustic model, the acoustic model may process the plurality of audio frame features and acoustic features corresponding to a preset wakeup word, so as to screen out, from the plurality of audio frame features, a range of audio frames corresponding to each syllable of the acoustic features corresponding to the preset wakeup word, and then, a preset number of target audio frames meeting preset requirements, such as a preset number of target audio frames whose acoustic likelihood scores reach the preset score, may be determined from the range of each audio frame by using the acoustic likelihood score of each audio frame in the screened range of each audio frame, but is not limited to this determination manner, in this embodiment, the audio frame features corresponding to the target audio frames may be marked as the target audio frame features, and finally, by using the acoustic model, an acoustic posterior score, that is a posterior probability, of each of the target audio frame features may be calculated, the present application does not describe in detail how to use the acoustic model to calculate the posterior probability of the audio frame feature.

Therefore, the audio frame characteristics of each frame are input into the acoustic model, so that a posterior probability can be obtained, the posterior probability can represent the possibility that the corresponding audio frame characteristics are the audio frame characteristics of the preset awakening words, and in general, the greater the posterior probability is, the greater the possibility that the corresponding audio frame characteristics are the audio frame characteristics of the preset awakening words is.

It should be understood that, in practical applications, after all audio frame features of the speech information are input into the acoustic model, the output data may include not only the posterior probabilities of the audio frame features of the syllables or phonemes constituting the wakeup word, but also the posterior probabilities of the audio frame features of the syllables or phonemes of other non-wakeup words.

The preset wake-up word in this embodiment may refer to a preset wake-up word corresponding to voice control currently performed on the electronic device by a user, and in a general case, when the user sends a voice instruction for performing a certain operation to the electronic device, the voice information spoken by the user may include the preset wake-up word, and the content of the preset wake-up word is not limited in this application.

In addition, it should be noted that the target audio frame characteristics corresponding to each syllable of the preset wake-up word in step S12 may be the audio frame characteristics corresponding to each syllable of the preset wake-up word, among the audio frame characteristics considered to be input by the acoustic model.

Step S13, carrying out double confidence degree judgment on the posterior probability of the target audio frame characteristics corresponding to each syllable of the preset awakening words to obtain a first confidence degree score and a second confidence degree score of the corresponding syllable;

in this embodiment, after the audio frame features of the cached voice information are processed by the acoustic model, the processing result is subjected to double confidence level decision by using different confidence level decision modules preset for different types of users, so that each syllable of the voice information, which may be a preset wakeup word, can obtain two confidence level scores, which are recorded as a first confidence level score and a second confidence level score. The method for calculating the confidence level of each syllable of the voice information, which may be a preset wakeup word, is not limited in the present application, and may include, but is not limited to, the following calculation methods:

in the above confidence calculation formula, n may represent the number of output units of the acoustic model, and the specific numerical value may be determined according to the specific result of the acoustic model, and p_i'_jCan represent the posterior probability, h, of the audio frame feature of the ith unit jth frame after smoothing_max＝max{1,j-w_max+1 may represent a confidence computation window (i.e., confidence decision window) W_maxThe position of the first frame in (a).

According to the confidence coefficient calculation formula, the maximum posterior probability of each output unit can be determined from the posterior probabilities of the audio frame characteristics of each output unit of the acoustic model, and the confidence coefficient score of each syllable of the preset awakening word can be obtained after multiplication and evolution calculation. If the user wants the wake-up word of the electronic device executing the preset operation to be 'okey google', the obtained confidence score can be represented in the size of h according to the confidence calculation mode_maxHow likely the okey and google are to occur within the time of (c).

Following the analysis of the technical concept of the voice wake-up processing method proposed by the present application, the present application will adopt different confidence level decision rules for different types of users to improve the voice wake-up accuracy, taking the different types of users as adult users (i.e., adults) and immature users (children with a small age) as examples for explanation, and may configure corresponding confidence level decision modules (i.e., posterior processing modules) to implement posterior processing in advance for the two types of users, such as the adult posterior processing module and the child posterior processing module in fig. 3, and use these two posterior processing modules to respectively perform confidence level calculation on the obtained posterior probabilities of the target audio frame features corresponding to each syllable of the preset wake-up word, and for each syllable, two confidence level scores will be obtained.

It should be noted that, the speech characteristics of different types of users in the present application are greatly different, for example, the speech speed of children is generally slower than that of adults, so that the size of the decision window applicable to the speech information of an adult user may not cover the complete speech of a child awakening word in the confidence calculation process, so that the present application may configure the decision window applicable to the speech information of a child user to be larger than the decision window applicable to the speech information of an adult user, and the specific sizes of the two decision windows are not limited, and can be flexibly adjusted according to actual requirements.

Therefore, for different confidence coefficient judgment modules, because the judgment windows configured by the two judgment modules are different in size, the time lengths of the posterior probabilities of the two modules for caching the audio frame features are different, and under the condition that the judgment is passed, the length of the cached audio frame feature to be judged is correspondingly changed when the secondary judgment is subsequently carried out, and the length can be matched with the size of the corresponding judgment window, so that the audio frame feature for carrying out the secondary judgment contains the complete awakening word feature as much as possible.

After the decision window is configured, if the decision window is set to buffer the audio frame features of 100 frames, after the audio frame features of 100 frames have been stored, the audio frame features of the latest frame are obtained, and the frame buffered earliest will be discarded and added to the audio frame features of the latest frame, so as to achieve the purpose of buffering.

Step S14, obtaining the verification audio frame characteristic in the audio frame characteristic of the voice information by using the passing judgment result in the first confidence score and the second confidence score;

after the above analysis, for the confidence scores obtained by the different confidence decision modules, it is determined whether the corresponding syllables are different thresholds of the syllables of the preset wakeup word, and in this embodiment, the different thresholds may be recorded as a first confidence decision threshold, a second confidence decision threshold, and the like.

Thus, after the first confidence score and the second confidence score are obtained, the first confidence score may be compared with the first confidence judgment threshold, the second confidence score may be compared with the second confidence judgment threshold, and if any confidence score reaches the corresponding confidence judgment threshold, the syllable may be considered to belong to the preset wake-up word input by the corresponding type of user, at this time, the first-level model in fig. 3 will be triggered, and the verification audio frame feature may be obtained from the cache according to the size of the judgment window corresponding to the type of user.

For example, if the second confidence score obtained by the confidence decision module applicable to the child reaches the second confidence decision threshold (i.e., the confidence decision threshold of the child, and accordingly, the first confidence decision threshold is applicable to adults), the verification audio frame features with corresponding lengths may be obtained from the cached audio frame features according to the size of the decision window corresponding to the child; similarly, if the first confidence score obtained by the confidence judgment module applicable to adults reaches the first confidence judgment threshold, the verification audio frame features with the corresponding length and matched with the size of the judgment window corresponding to the adults can be obtained, and the specific obtaining process is not described in detail.

Step S15, obtaining a confidence coefficient checking result of the checking audio frame characteristic, wherein the confidence coefficient checking result is obtained by carrying out secondary confidence coefficient judgment on the checking audio frame characteristic;

based on the above analysis, in this embodiment, a dual-confidence-level decision module is adopted in a primary model to recognize a wake-up word of voice information, and after the primary model is woken up, that is, under the condition that it is preliminarily determined that the voice information includes a preset wake-up word, secondary verification is continuously performed on the voice information by a secondary model.

Optionally, for the secondary model shown in fig. 3, corresponding verification models may be configured for different types of users, such as the adult model and the child model shown in fig. 3, network structures of the two verification models may be the same, for example, a larger acoustic model disposed in the electronic device or the cloud and an a posteriori processing module provided in the above technical scheme development process, or an acoustic model in the primary model and a corresponding confidence level determining module, and the like, and the specific network structure of the verification model is not limited in the present application.

It should be noted that, in the process of constructing the verification models corresponding to different types of users, the voice samples of the users of the corresponding types need to be used for training, and in the training process, the lengths of the audio frames of the sample features input into the network are also different, and the description of the decision window portion may be referred to.

The secondary confidence degree judgment process for the verified audio frame features is similar to the primary confidence degree judgment process for the target audio frame features by the primary model, and is not repeated in the application.

Step S16, if the confidence check result passes, controlling the electronic device to execute a preset operation in response to the instruction corresponding to the preset wake-up word.

As described above, according to the present application, after the primary model is awakened, that is, when at least one of the first confidence score and the second confidence score in the step S14 passes through the judgment result, the secondary confidence score is determined, and the confidence judgment result obtained by the secondary confidence judgment also passes through the judgment result, it can be considered that the awakening word identified from the speech information processing is indeed the preset awakening word, that is, the awakening word in the speech information input by the user is accurately identified, and then the electronic device can respond to the instruction corresponding to the awakening word to control the electronic device to perform the preset operation, such as controlling the smart speaker to play the song a.

To sum up, after acquiring the voice information input by the user for the electronic device, the embodiment acquires the audio frame features of the voice information, and inputs the audio frame features into the acoustic model for processing, so as to obtain the posterior probabilities of the target audio frame features corresponding to each syllable of the preset wake-up word included in the voice information, and then, the embodiment deploys the confidence level decisions respectively for the adult mode and the child mode in consideration of the difference between the voice features of different types of users (such as adults and children), so as to realize the double confidence level decision of the obtained posterior probabilities, so that each syllable obtains two confidence level scores, wherein when the decision result of any confidence level score passes, the check audio frame features with corresponding length can be acquired from the cache for secondary confidence level check, and when the check result passes, it can be determined that the voice information includes the preset wake-up word, the instruction corresponding to the preset awakening word can be directly responded, and the electronic equipment is controlled to execute the preset operation. Therefore, the voice awakening processing method provided by the embodiment can give consideration to both adult voice awakening performance and child voice awakening performance, and improves voice awakening efficiency and accuracy.

The following will be detailed for the voice wakeup processing method described above in the present application, but is not limited to the detailed example described below, and as shown in fig. 7, for a signaling flowchart of a detailed example of the voice wakeup processing method proposed in the present application, the method may include, but is not limited to, the following steps:

step S21, the electronic equipment acquires the voice information input by the user;

step S22, the electronic equipment extracts the feature of the voice information frame by frame to obtain the audio frame feature and caches the audio frame feature;

in this embodiment, the voice information input by the user is subjected to feature extraction frame by frame, so that audio frame features of each audio frame constituting the voice information are obtained, and then, the obtained audio frame features of the voice information can be cached to recognize a wakeup word of the voice information, so as to realize voice wakeup control of the electronic device.

The method for acquiring the audio frame features and the buffering mode thereof are not limited in the present application, and may include, but are not limited to, the methods described in the above embodiments.

Step S23, the electronic equipment inputs the cached audio frame characteristics into an acoustic model for processing to obtain the posterior probability of the target audio frame characteristics corresponding to each syllable of the preset awakening word;

the implementation process of step S23 can refer to the description of the corresponding parts of the above embodiments.

Step S24, the electronic device carries out confidence calculation according to the first confidence judgment rule and the second confidence judgment rule respectively to obtain a first confidence score and a second confidence score of the same preset awakened syllable contained in the voice information;

with reference to the description of the foregoing embodiment, in this embodiment, according to a first confidence decision rule, a confidence is calculated for a posterior probability of a target audio frame feature corresponding to each syllable in the preset wakeup word, so as to obtain a first confidence score of the corresponding syllable; and according to a second confidence coefficient judgment rule, performing confidence coefficient calculation on the posterior probability of the target audio frame characteristics corresponding to each syllable in the preset awakening word to obtain a second confidence coefficient score of the corresponding syllable. The first confidence coefficient judgment rule and the second confidence coefficient judgment rule have different judgment window sizes and confidence coefficient judgment thresholds, the judgment window is used for determining the time length of the target audio frame feature for performing confidence coefficient calculation, and specific numerical values are not limited.

In this embodiment, the first confidence degree decision rule and the second confidence degree decision rule may be confidence degree calculation rules according to which different confidence degree decision modules (i.e., posterior processing modules) perform a confidence degree calculation process, and the specific content of the first confidence degree decision rule and the second confidence degree decision rule is not limited in the present application, and may be determined according to a confidence degree calculation method of a corresponding confidence degree decision module. As described above, the confidence level decision module may include an adult confidence level decision module or a child confidence level decision module, and as a result, compared with the prior art, the confidence level decision module for the child mode is added, and is independent from the adult confidence level decision module, and the awakening performance of the child voice can be effectively improved by setting a larger decision window without affecting the awakening performance of the adult.

Step S25, the electronic device judges the first confidence score by using the first confidence judgment threshold to obtain a first judgment result, and judges the second confidence score by using the second confidence judgment threshold to obtain a second judgment result;

the present embodiment does not limit the specific values of the first confidence level decision threshold and the second confidence level decision threshold.

Step S26, the electronic equipment acquires the verification audio frame characteristic under the condition that the first judgment result or the second judgment result passes;

the verified audio frame feature is an audio frame feature that is cached and matches the size of the decision window corresponding to the passed decision result, and the specific acquisition process may refer to the description of the corresponding portion of the above embodiment.

Step S27, the electronic equipment sends a voice confidence check request to the server;

the voice confidence verification request may carry verification audio frame features and user type identifiers corresponding to the verification audio frame features, such as adult user identifiers and child user identifiers, which need to be described.

Step S28, the server analyzes the voice confidence check request to obtain the check audio frame characteristics and the corresponding user type identification;

step S29, the server uses the verification model corresponding to the user type identification to carry out confidence verification on the verification audio frame characteristics to obtain a confidence verification result;

it can be seen that, after determining the verified audio frame feature, the electronic device may perform confidence verification on the verified audio frame feature by using a verification model corresponding to the passed decision result to obtain a confidence verification result of the verified audio frame feature, where, for different confidence decision rules, corresponding verification models are configured, and the verification model is obtained by training a voice sample of a user of a type corresponding to the corresponding confidence decision rule.

Step S210, the server feeds back the confidence coefficient checking result to the electronic equipment;

step S211, the electronic device responds to the instruction corresponding to the preset wakeup word and executes a preset operation when the confidence check result passes.

In summary, the electronic device of this embodiment configures two corresponding confidence level decision modules, that is, a dual confidence level decision module, for the characteristics of the child voice and the adult voice, and adds the confidence level decision of the child mode compared to the prior art, and the two confidence level decision modules are relatively independent, so that the electronic device of the embodiment can effectively improve the awakening performance of the child voice by setting a larger decision window without affecting the awakening performance of the adult.

In addition, in the primary model as shown in fig. 3, no matter an adult user or a child user inputs voice information, the acoustic models are shared for processing, two acoustic models do not need to be set for the two types of users, the calculation amount is reduced, and the occupation of resources of the electronic equipment is reduced, so that the method can be suitable for scenes with limited resources on the electronic equipment.

In addition, in the secondary model of fig. 3, different verification models are configured for different types of users, the two verification models can be respectively modeled for adult voice samples and child voice samples, the voice samples of the two types of users can be effectively utilized, respective optimal performances can be respectively obtained, accuracy of secondary confidence degree judgment is effectively improved, and meanwhile awakening rate of the child voice is improved.

Referring to fig. 8, which is a block diagram of an alternative example of a voice wakeup processing apparatus proposed in the present application, the apparatus may be used in an electronic device, and the product type of the electronic device is not limited in the present application, as shown in fig. 8, the apparatus may include:

a feature obtaining module 21, configured to obtain an audio frame feature of the input voice information;

optionally, the feature obtaining module 21 may include:

a voice information acquisition unit for acquiring voice information input for the electronic device;

and the feature extraction unit is used for extracting features of the voice information to obtain audio frame features of each audio frame forming the voice information and caching the obtained audio frame features.

The posterior probability obtaining module 22 is configured to input the audio frame features into an acoustic model for processing, so as to obtain posterior probabilities of target audio frame features corresponding to each syllable of a preset wake-up word;

the confidence coefficient judging module 23 is configured to perform double confidence coefficient judgment on the posterior probability of the target audio frame feature corresponding to each syllable, so as to obtain a first confidence coefficient score and a second confidence coefficient score of the corresponding syllable;

a verification feature obtaining module 24, configured to obtain a verification audio frame feature in the audio frame features of the speech information by using a passing determination result in the first confidence score and the second confidence score;

as an alternative example of the present application, as shown in fig. 9, the confidence level determining module 23 may include:

a first confidence coefficient calculating unit 231, configured to perform confidence coefficient calculation on the posterior probability of the target audio frame feature corresponding to each syllable according to a first confidence coefficient decision rule, so as to obtain a first confidence coefficient score of the corresponding syllable;

a second confidence coefficient calculating unit 232, configured to perform confidence coefficient calculation on the posterior probability of the target audio frame feature corresponding to each syllable according to a second confidence coefficient decision rule, so as to obtain a second confidence coefficient score of the corresponding syllable;

the first confidence coefficient judgment rule and the second confidence coefficient judgment rule have different judgment window sizes and confidence coefficient judgment thresholds, and the judgment windows are used for determining the time length of the target audio frame features for performing confidence coefficient calculation.

Accordingly, the verification feature obtaining module 24 may include:

a first decision unit 241, configured to decide the first confidence score by using a first confidence decision threshold to obtain a first decision result;

a second decision unit 242, configured to decide the second confidence score by using a second confidence decision threshold, so as to obtain a second decision result;

and a verification audio frame feature obtaining unit 243, configured to, when the first decision result or the second decision result is passed, obtain, from the audio frame features of the speech information, a verification audio frame feature that matches the size of the decision window corresponding to the passed decision result.

A confidence check result obtaining module 25, configured to obtain a confidence check result of the verified audio frame feature, where the confidence check result is obtained by performing secondary confidence decision on the verified audio frame feature;

optionally, the confidence verification result obtaining module 25 may include:

the confidence coefficient checking unit is used for carrying out confidence coefficient checking on the checking audio frame characteristics by using a checking model corresponding to the passed judgment result to obtain a confidence coefficient checking result of the checking audio frame characteristics;

and aiming at different confidence coefficient judgment rules, corresponding verification models are configured, wherein the verification models are obtained by training voice samples of users of types corresponding to the corresponding confidence coefficient judgment rules.

In practical applications, the confidence level verification result of the verified audio frame feature may be obtained by directly performing secondary confidence level determination by the electronic device, or by performing secondary confidence level determination by a server or other electronic devices that can be in communication connection with the electronic device.

Based on this, the confidence level check unit may include:

a confidence check request sending unit, configured to send a voice confidence check request to a server, where the voice confidence check request carries the check audio frame feature and a user type identifier corresponding to the check audio frame feature;

and the confidence coefficient checking result receiving unit is used for receiving a confidence coefficient checking result of the checking audio frame characteristics fed back by the server, wherein the confidence coefficient checking result is obtained by the server responding to the voice confidence coefficient checking request and performing confidence coefficient checking on the checking audio frame characteristics by using a checking model corresponding to the user type identifier.

Based on the above analysis, it should be understood that, in the example where the confidence level verification result is obtained by direct operation of the electronic device, similar to the operation processing process described in this embodiment, verification models corresponding to different user type identifiers may be trained in advance, and a secondary confidence level verification may be performed on the verification audio features corresponding to the corresponding user type identifiers by using the verification models, where a specific verification process may be similar to a previous confidence level determination method corresponding to the user type identifier, and this embodiment is not described in detail.

And the voice awakening module 26 is configured to respond to the instruction corresponding to the preset awakening word if the confidence verification result passes, and control the electronic device to execute a preset operation.

In summary, in this embodiment, for the obtained speech information, the speech characteristics of different types of users are combined, and a double confidence level decision is performed on the speech information, and the double confidence level decision module is implemented by sharing the same acoustic model, that is, the double confidence level decision module performs confidence level decision on the same audio frame feature, as long as one confidence level decision passes, a subsequent secondary confidence level check operation is triggered, that is, the verified audio frame feature is obtained according to a length matching the size of a decision window used by the passed confidence level decision, and is sent to the check model of the corresponding user type for performing confidence level check, if the check passes, it is determined that the obtained speech information includes a preset wakeup word, and the electronic device may respond to the speech information input by the user and perform the preset operation. Therefore, the voice awakening processing scheme provided by the application can simultaneously take account of the adult voice awakening performance and the child voice awakening performance, and compared with the prior art, the voice awakening performance of the child is improved, namely, the voice awakening efficiency and accuracy are improved.

In addition, it should be noted that, each model and unit in the above-mentioned voice wakeup processing apparatus is actually a functional module composed of a program code, and the function of the functional model is realized by executing the corresponding program code, and as for the process of realizing the corresponding function by each functional model, the description of the corresponding part of the above-mentioned embodiment may be referred to.

The embodiment of the present application further provides a storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the steps of the above-mentioned voice wake-up processing method, and the implementation process of the voice wake-up processing method may refer to the description of the above-mentioned method embodiment.

Referring to fig. 10, a schematic diagram of an alternative example of a voice wake-up processing system proposed in the present application may include, but is not limited to: at least one electronic device 31 and a server 32, wherein:

the present embodiment does not limit the product type of each electronic device 31, and is not limited to the type of electronic device shown in fig. 10.

The server 32 may be a single service device or a server set composed of a plurality of service devices, the structure and type of the server 32 are not limited in the present application, and for example, the server may include a communication interface, a memory and a processor, the memory in the server may be used as a program for performing a secondary confidence level determination method on the verified audio frame feature, and the processor may call and execute the program to perform the secondary confidence level determination on the verified audio frame feature to obtain a confidence level verification result of the verified audio frame feature, and the specific implementation process may refer to the description of the corresponding part of the above method embodiment.

As shown in fig. 11, when a user desires to perform a certain operation (i.e. a preset operation) by controlling the electronic device with voice, the user may speak a corresponding wake-up word, for example, the smart speaker needs to play song B, the user may speak "xx (which may be a wake-up word of the smart speaker system, but is not limited thereto), play song B", after the electronic device collects voice information output by the user, the electronic device may process the voice information in the manner described in the above embodiment, for example, the electronic device may perform feature extraction on the voice information frame by frame to obtain a plurality of audio frame features, input a preset acoustic model for processing to obtain posterior probabilities of the audio frame features, determine posterior probabilities of at least one target audio frame feature corresponding to each syllable, which may be a preset wake-up word, included in the voice information, and then perform a double-confidence decision on the posterior probabilities of the target audio frame features corresponding to each syllable, if the adult confidence judgment module and the child confidence judgment module are respectively used for processing, it can be seen that the method considers the difference between the adult voice characteristics and the child voice characteristics, uses different confidence judgment modules to share one acoustic model, and carries out confidence calculation and judgment on the posterior probability of each target audio frame characteristic output by the acoustic model, and needs to explain, the judgment window size and the confidence threshold value used in the method are different, and can be determined according to different user type characteristics, and the judgment window of a child is usually larger than that of an adult so as to ensure the completeness of the awakening word characteristic as much as possible.

In practical applications, in the above dual confidence level determination result, as long as one confidence level determination is passed, it is considered that the primary model shown in fig. 3 is activated, and the secondary model may be triggered to operate, at this time, a verification audio frame feature that the feature length and the confidence level determination are passed and the size of the determination window of the user type is matched is obtained, and the verification audio frame feature is sent to a verification model (which may be deployed in an electronic device or in other electronic devices, such as the above server) corresponding to the user type, and the verification model (such as an adult verification model or a child verification model) performs secondary confidence level verification on the verification audio frame feature according to the above processing method, and the specific process is not described in detail. The verification models for different user types are obtained by training data of corresponding user types, and accuracy of secondary confidence degree judgment is guaranteed.

After the two confidence level judgments are passed, the currently acquired voice information can be determined to contain the preset awakening word, the electronic equipment can respond to the control instruction corresponding to the preset awakening word, the preset operation is executed, and the voice awakening control requirement of the user on the electronic equipment is met. If the confidence level judgment result of the child passes the first confidence level judgment, the voice information can be considered to be output by the child, the voice information can contain a preset awakening word, the verification audio frame characteristics matched with the child judgment window in size are obtained from the cached audio frame characteristics and sent to the child verification model for secondary confidence level judgment, if the verification audio frame characteristics pass the child judgment, the voice information is determined to be sent by the child and contain the preset awakening word, the electronic equipment responds to the voice information, and the voice awakening performance of the child is improved.

It should be noted that, in the application scenario of this embodiment, after the audio frame features are checked, the processing method is not limited to the processing method shown in fig. 11, that is, the audio frame features are sent to the server to perform secondary confidence level determination, and the secondary confidence level determination may also be performed by the electronic device itself, and the specific implementation process is the same, and is not described in detail herein.

The embodiments in the present description are described in a progressive or parallel manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device, the system and the electronic equipment disclosed by the embodiment correspond to the method disclosed by the embodiment, so that the description is relatively simple, and the relevant points can be referred to the method part for description.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A voice wakeup processing method, the method comprising:

acquiring audio frame characteristics of input voice information;

2. The method according to claim 1, wherein the posterior probability of the target audio frame feature corresponding to each syllable is subjected to double confidence level decision to obtain a first confidence level score and a second confidence level score of the corresponding syllable;

according to a first confidence coefficient judgment rule, performing confidence coefficient calculation on the posterior probability of the target audio frame characteristics corresponding to each syllable to obtain a first confidence coefficient score of the corresponding syllable;

according to a second confidence coefficient judgment rule, performing confidence coefficient calculation on the posterior probability of the target audio frame characteristics corresponding to each syllable to obtain a second confidence coefficient score of the corresponding syllable;

3. The method according to claim 2, wherein the obtaining the verified audio frame feature of the audio frame features of the speech information using the passed decision result of the first confidence score and the second confidence score comprises:

judging the first confidence score by using a first confidence judgment threshold to obtain a first judgment result, and judging the second confidence score by using a second confidence judgment threshold to obtain a second judgment result;

and if the first judgment result or the second judgment result passes, acquiring verification audio frame characteristics matched with the size of a judgment window corresponding to the passing judgment result from the audio frame characteristics of the voice information.

4. The method according to any one of claims 1 to 3, wherein the obtaining of the confidence verification result of the verified audio frame feature comprises:

performing confidence check on the verified audio frame features by using a check model corresponding to the passed judgment result to obtain a confidence check result of the verified audio frame features;

5. The method according to claim 4, wherein performing confidence check on the verified audio frame feature by using a check model corresponding to the passed decision result to obtain a confidence check result of the verified audio frame feature comprises:

sending a voice confidence check request to a server, wherein the voice confidence check request carries the check audio frame characteristics and the user type identification corresponding to the check audio frame characteristics;

and receiving a confidence coefficient checking result of the checking audio frame characteristics fed back by the server, wherein the confidence coefficient checking result is obtained by performing confidence coefficient checking on the checking audio frame characteristics by the server in response to the voice confidence coefficient checking request and by using a checking model corresponding to the user type identification.

6. The method according to any one of claims 1 to 3, wherein the obtaining of the audio frame characteristics of the input speech information comprises:

acquiring voice information input aiming at the electronic equipment;

and extracting the characteristics of the voice information to obtain the audio frame characteristics of each audio frame forming the voice information, and caching the obtained audio frame characteristics.

7. A voice wake-up processing apparatus, the apparatus comprising:

8. The apparatus of claim 7, wherein the confidence check result obtaining module comprises:

9. A storage medium having stored thereon a computer program, characterized in that the computer program is executed by a processor, a program implementing the steps of the voice wake-up processing method according to any of the claims 1-6.

10. An electronic device, characterized in that the electronic device comprises:

the voice collector is used for collecting voice information output by a user;

a communication interface;

a memory for storing a program for implementing the voice wakeup processing method according to any one of claims 1 to 6;

a processor for loading and executing the program stored in the memory to implement the steps of the voice wakeup processing method according to any one of claims 1 to 6.