CN113488050B

CN113488050B - Voice wakeup method and device, storage medium and electronic equipment

Info

Publication number: CN113488050B
Application number: CN202110778555.9A
Authority: CN
Inventors: 李亚伟; 姚海涛; 田垚; 蔡猛; 马泽君
Original assignee: Beijing Youzhuju Network Technology Co Ltd
Current assignee: Beijing Youzhuju Network Technology Co Ltd
Priority date: 2021-07-09
Filing date: 2021-07-09
Publication date: 2024-03-26
Anticipated expiration: 2041-07-09
Also published as: CN113488050A

Abstract

The disclosure relates to a voice awakening method, a device, a storage medium and electronic equipment, which are used for designing a higher threshold value for common awakening words so as to reduce false awakening and setting a lower threshold value for rare awakening words so as to improve the awakening rate. The method comprises the following steps: acquiring a voice wake-up word of which the threshold value is to be set; determining word scores of the voice wake words through a language model, wherein the word scores are used for representing the occurrence probability of the voice wake words; determining a target wake-up threshold corresponding to the voice wake-up word according to the word score of the voice wake-up word and a preset corresponding relation between the wake-up threshold and the word score, wherein the wake-up threshold in the preset corresponding relation is positively related to the word score, and the target wake-up threshold is used for comparing the target wake-up threshold with the collected target voice wake-up word in the voice wake-up process so as to determine a voice wake-up result corresponding to the target voice wake-up word.

Description

Voice wakeup method and device, storage medium and electronic equipment

Technical Field

The disclosure relates to the technical field of voice, and in particular relates to a voice awakening method, a voice awakening device, a storage medium and electronic equipment.

Background

The voice wake-up technology presets wake-up words in the electronic equipment or software, and when a user sends out voice instructions corresponding to the wake-up words, the electronic equipment can be woken up from a dormant state and can respond in a specified mode. Specifically, each preset wake-up word has a corresponding wake-up threshold, after a user sends out a voice command, a word score corresponding to the voice command is determined, and if the word score is greater than or equal to the wake-up threshold, the electronic device can be awakened from the sleep state and can respond in a specified manner. If the word score is less than the wake threshold, the electronic device will not wake.

In the related art, a unified wake-up threshold is generally set for each wake-up word, so that the wake-up rate corresponding to some wake-up words is low, or the false wake-up corresponding to some wake-up words is more, and the accuracy of voice wake-up is affected.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

In a first aspect, the present disclosure provides a voice wake-up method, the method comprising:

acquiring a voice wake-up word of which the threshold value is to be set;

determining word scores of the voice wake words through a language model, wherein the word scores are used for representing the occurrence probability of the voice wake words;

determining a target wake-up threshold corresponding to the voice wake-up word according to the word score of the voice wake-up word and a preset corresponding relation between the wake-up threshold and the word score, wherein the wake-up threshold in the preset corresponding relation is positively related to the word score, and the target wake-up threshold is used for comparing the target wake-up threshold with the collected target voice wake-up word in the voice wake-up process so as to determine a voice wake-up result corresponding to the target voice wake-up word.

In a second aspect, the present disclosure provides a voice wake apparatus, the apparatus comprising:

the acquisition module is used for acquiring voice wake-up words of the threshold to be set;

the first determining module is used for determining word scores of the voice wake words through a language model, wherein the word scores are used for representing the occurrence probability of the voice wake words;

the second determining module is configured to determine a target wake-up threshold corresponding to the voice wake-up word according to the word score of the voice wake-up word and a preset corresponding relation between the wake-up threshold and the word score, where the wake-up threshold in the preset corresponding relation is positively related to the word score, and the target wake-up threshold is used for comparing with the collected target voice wake-up word in the voice wake-up process to determine a voice wake-up result corresponding to the target voice wake-up word.

In a third aspect, the present disclosure provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processing device, implements the steps of the method described in the first aspect.

In a fourth aspect, the present disclosure provides an electronic device comprising:

a storage device having a computer program stored thereon;

processing means for executing said computer program in said storage means to carry out the steps of the method described in the first aspect.

Through the technical scheme, the target awakening threshold corresponding to the voice awakening word can be determined according to the word score of the voice awakening word and the preset corresponding relation between the awakening threshold and the word score. Moreover, the wake-up threshold value is positively correlated with the word score in the preset corresponding relation, so that the higher the word score is, namely the more common the voice wake-up word is, the higher the corresponding wake-up threshold value is, and therefore false wake-up corresponding to the wake-up word can be reduced. Conversely, the lower the word score is, the less the voice wake-up words are, the lower the corresponding wake-up threshold is, so that the wake-up rate corresponding to the wake-up words can be improved. Therefore, the problem of low wake-up rate or more false wake-up in the related technology can be solved, and the accuracy of voice wake-up is improved.

Additional features and advantages of the present disclosure will be set forth in the detailed description which follows.

Drawings

The above and other features, advantages, and aspects of embodiments of the present disclosure will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. The same or similar reference numbers will be used throughout the drawings to refer to the same or like elements. It should be understood that the figures are schematic and that elements and components are not necessarily drawn to scale. In the drawings:

FIG. 1 is a flow chart illustrating a method of voice wakeup according to an example embodiment of the present disclosure;

FIG. 2 is a block diagram of a voice wake apparatus, shown in accordance with an exemplary embodiment of the present disclosure;

fig. 3 is a block diagram of an electronic device, according to an exemplary embodiment of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure have been shown in the accompanying drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but are provided to provide a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are for illustration purposes only and are not intended to limit the scope of the present disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order and/or performed in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "including" and variations thereof as used herein are intended to be open-ended, i.e., including, but not limited to. The term "based on" is based at least in part on. The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments. Related definitions of other terms will be given in the description below.

It should be noted that the terms "first," "second," and the like in this disclosure are merely used to distinguish between different devices, modules, or units and are not used to define an order or interdependence of functions performed by the devices, modules, or units. It is further noted that references to "one" or "a plurality" in this disclosure are intended to be illustrative rather than limiting, and those of ordinary skill in the art will appreciate that "one or more" is intended to be understood as "one or more" unless the context clearly indicates otherwise.

The names of messages or information interacted between the various devices in the embodiments of the present disclosure are for illustrative purposes only and are not intended to limit the scope of such messages or information.

As described in the background, the related art generally sets a uniform wake-up threshold for each wake-up word. However, the inventors have found that some wake-up words are relatively common and thus are prone to false wake-up, so that a higher threshold value needs to be set, while some wake-up words are relatively rare and thus are relatively difficult to wake-up, so that a lower threshold value needs to be set. According to the related art, a unified threshold is set for different wake-up words, so that the wake-up rate corresponding to some wake-up words is low, or the false wake-up corresponding to some wake-up words is more, and the accuracy of voice wake-up is affected.

The inventor finds out through a large amount of data analysis that the proper thresholds of different wake-up words and word scores obtained after the wake-up words are input into the language model show stronger correlation. Therefore, the present disclosure proposes a new threshold setting manner to design a higher threshold for wake words that are easy to wake up, so as to reduce false wake up, and set a lower threshold for wake words that are difficult to wake up, so as to improve wake up rate.

Fig. 1 is a flow chart illustrating a voice wakeup method according to an example embodiment of the present disclosure. Referring to fig. 1, the voice wakeup method includes:

step 101, obtaining a voice wake-up word of which the threshold value is to be set;

step 102, determining word scores of the voice wake words through a language model, wherein the word scores are used for representing occurrence probability of the voice wake words;

step 103, determining a target awakening threshold corresponding to the voice awakening word according to the word score of the voice awakening word and a preset corresponding relation between the awakening threshold and the word score, wherein the awakening threshold in the preset corresponding relation is positively related to the word score. The target wake-up threshold is used for comparing the target wake-up word with the collected target voice wake-up word in the voice wake-up process so as to determine a voice wake-up result corresponding to the target voice wake-up word.

For example, the voice wake-up word to be set with the threshold may be a word or phrase customized by the user, for waking up the electronic device to perform an operation. For example, "navigation" or "open navigation" and the like, to which embodiments of the present disclosure are not limited. The word score may characterize the probability of occurrence of the voice wake, i.e., the degree of commonality of the voice wake. The word score of a voice wake word is positively correlated with the probability of occurrence (i.e., the degree of commonality), i.e., the higher the word score, the more common the voice wake word, whereas the lower the word score, the less common the voice wake word. In practical applications, the score of a voice wake may be determined by an N-gram (N is a positive integer) language model, which is not limited by embodiments of the present disclosure.

After determining the word score of the voice wake-up word, determining a target wake-up threshold corresponding to the voice wake-up word according to the word score and a preset corresponding relation between the wake-up threshold and the word score. Moreover, the wake-up threshold value is positively correlated with the word score in the preset corresponding relation, so that the higher the word score is, namely the more common the voice wake-up word is, the higher the corresponding wake-up threshold value is, and therefore false wake-up corresponding to the wake-up word can be reduced. Conversely, the lower the word score is, the less the voice wake-up words are, the lower the corresponding wake-up threshold is, so that the wake-up rate corresponding to the wake-up words can be improved. Therefore, the problem of low wake-up rate or more false wake-up in the related technology can be solved, and the accuracy of voice wake-up is improved.

In order to enable those skilled in the art to more understand the voice wake-up method provided in the present disclosure, the following details of each step are described.

Illustratively, the language model may be an N-gram language model, such as a 1-gram language model, a 2-gram language model, etc., to which embodiments of the present disclosure are not limited. The 1gram language model divides input text or words into a plurality of single words, and word score of each single word is independent of other single words. The 2gram language model is that after an input text or word is divided into a plurality of words, the word score of each word is related to the word preceding the word.

In a possible manner, if the voice wake model is an end-to-end voice recognition RNN-T model, the voice model may be a 2gram language model correspondingly, where the 2gram language model is used to determine a word score corresponding to a word according to a word before the word segmentation. It should be appreciated that the RNN-T model has a correlation between two adjacent data during data processing, and therefore a 2gram language model is correspondingly employed to determine a word score corresponding to each word from a single word preceding the word. Therefore, the language model can be matched with the voice awakening model, so that the accuracy of the result of the awakening threshold value can be improved, and the accuracy of voice awakening is further improved.

For example, a language model may be trained by sample text, and word scores for a voice wake may characterize the probability of occurrence of the voice wake in the sample text. In a possible manner, a first sample text for training a voice wake model may be obtained and a language model may be trained based on the first sample text. That is, the language model can be trained by training the text data of the voice wake model, so that the matching degree between the voice wake model and the language model is improved, and the accuracy of voice wake is improved.

It should be understood that if the frequency of occurrence of the voice wake word in the first sample text of the training voice wake model is higher, after the language model is trained by the first sample text, the language model determines a higher word score for the voice wake word, which indicates that the voice wake word is more common, so that a higher wake threshold can be determined for the voice wake word subsequently, and the corresponding false wake is reduced. Otherwise, if the occurrence frequency of the voice wake-up word in the first text sample for training the voice wake-up model is low, after the language model is trained by the first text sample, the language model determines a low word score for the voice wake-up word, which indicates that the voice wake-up word is less visible, so that a low wake-up threshold can be determined for the voice wake-up word subsequently, and the corresponding wake-up rate is improved.

In a possible manner, the preset correspondence between the wake-up threshold and the word score may be obtained by: determining sample word scores corresponding to the words in the second sample text through the language model, determining sample wake-up thresholds corresponding to the words in the second sample text, and performing data fitting according to the sample word scores corresponding to the words in the second sample text and the sample wake-up thresholds to obtain preset corresponding relations between the wake-up thresholds and the word scores.

The second sample text may be the same as or different from the first sample text, for example, and embodiments of the present disclosure are not limited in this regard. A plurality of segmentation words can be selected from the second sample text, and then the sample word scores corresponding to the plurality of segmentation words are determined through a language model. And, a sample wake-up threshold corresponding to the plurality of tokens may be determined. Wherein the sample wake-up threshold may be determined by analyzing a large amount of data.

In a possible manner, determining a sample wake-up threshold corresponding to each word in the second sample text may be: aiming at target word segmentation in a second sample text, inputting test corpus except the target word segmentation into a voice wake-up model in a preset time period, determining a false wake-up rate corresponding to each candidate wake-up threshold in the preset time period according to a plurality of candidate wake-up thresholds corresponding to the voice wake-up model, wherein the target word segmentation is any word segmentation in the sample text, and determining the candidate wake-up threshold when the false wake-up rate reaches the preset false wake-up rate as the sample wake-up threshold corresponding to the target word.

For example, the preset time period may be set according to actual situations, for example, may be set to 100 hours, 200 hours, and so on, which is not limited by the embodiment of the present disclosure. The candidate wake-up threshold may be a plurality of values within a preset threshold range, for example, the threshold range is set to 0 to 1, and the step size is set to 0.1, and then the candidate wake-up threshold may sequentially take values of 0, 0.1, 0.2, … … and 1. The preset threshold range and the step size may be set according to practical situations, which is not limited in the embodiments of the present disclosure. In addition, the false wake-up rate is the false wake-up number of the corpus tested in unit time, the preset false wake-up rate is the expected false wake-up rate in practical application, and the false wake-up rate can be set according to practical conditions, and the embodiment of the disclosure is not limited to this.

In the embodiment of the disclosure, a test corpus except for target word segmentation can be input to a voice wake-up model within a preset time period, then the false wake-up rate corresponding to each candidate wake-up threshold within the preset time period is determined, and if the false wake-up rate corresponding to a certain candidate wake-up threshold reaches the preset false wake-up rate, the candidate wake-up threshold can be used as a sample wake-up threshold. Therefore, the sample awakening threshold corresponding to each word in the second sample text can be obtained according to the false awakening rate, and the false awakening can be reduced according to the corresponding relation between the sample awakening threshold fitting awakening threshold and the word score because the sample awakening threshold accords with the expected false awakening rate, so that the accuracy of setting the threshold is improved.

After determining the sample word score and the sample wake-up threshold corresponding to each word in the second sample text, performing data fitting according to the sample word score and the sample wake-up threshold corresponding to each word in the second sample text to obtain a preset corresponding relation between the wake-up threshold and the word score. In a possible manner, sample word scores corresponding to the words in the second sample text are used as independent variables, and sample wake-up thresholds corresponding to the words in the second sample text are used as dependent variables to perform linear fitting, so that a functional relation for representing the corresponding relation between the wake-up thresholds and the word scores is obtained.

For example, the sample word score corresponding to each word in the second sample text may be used as an independent variable, and the sample wake-up threshold corresponding to each word in the second sample text may be used as an independent variable, and linear fitting may be performed by using the linear equation y=ax+b. Wherein y represents a sample wake-up threshold value corresponding to each word in the second sample text, x represents a sample word score corresponding to each word in the second sample text, and a and b represent coefficients to be fitted. Therefore, a functional relation for representing the corresponding relation between the wake-up threshold and the word score can be obtained, and the target wake-up threshold corresponding to the voice wake-up word can be determined through the functional relation and the word score of the voice wake-up word. In addition, the wake-up threshold value is positively correlated with the word score in the functional relation, so that the higher the word score is, namely the more common the voice wake-up word is, the higher the corresponding wake-up threshold value is, and therefore false wake-up corresponding to the wake-up word can be reduced. Conversely, the lower the word score is, the less the voice wake-up words are, the lower the corresponding wake-up threshold is, so that the wake-up rate corresponding to the wake-up words can be improved. Therefore, the problem of low wake-up rate or more false wake-up in the related technology can be solved, and the accuracy of voice wake-up is improved.

Based on the same inventive concept, the embodiments of the present disclosure also provide a voice wake-up device, which may be part or all of an electronic device by means of software, hardware or a combination of both. Referring to fig. 2, the voice wakeup apparatus 200 includes:

an obtaining module 201, configured to obtain a voice wake-up word for which a threshold is to be set;

a first determining module 202, configured to determine, by using a language model, a word score of the voice wake word, where the word score is used to characterize an occurrence probability of the voice wake word;

the second determining module 203 is configured to determine a target wake-up threshold corresponding to the voice wake-up word according to a word score of the voice wake-up word and a preset correspondence between wake-up thresholds and word scores, where the wake-up thresholds in the preset correspondence are positively related to the word scores, and the target wake-up threshold is used for comparing with the collected target voice wake-up word in a voice wake-up process to determine a voice wake-up result corresponding to the target voice wake-up word.

Optionally, the apparatus 200 further comprises a training module for obtaining a first sample text for training a voice wake model, and training the language model according to the first sample text.

Optionally, the apparatus 200 further includes a data fitting module for:

determining a sample word score corresponding to each word in a second sample text through the language model, and determining a sample wake-up threshold corresponding to each word in the second sample text;

and performing data fitting according to the sample word score and the sample wake-up threshold corresponding to each word in the second sample text to obtain a preset corresponding relation between the wake-up threshold and the word score.

Optionally, the data fitting module is configured to:

and taking the sample word score corresponding to each word in the second sample text as an independent variable, and taking a sample wake-up threshold corresponding to each word in the second sample text as a dependent variable to perform linear fitting so as to obtain a functional relation used for representing the corresponding relation between the wake-up threshold and the word score.

Optionally, the data fitting module is configured to:

aiming at target word segmentation in the second sample text, inputting test corpus except the target word segmentation to a voice wake-up model in a preset time period, and determining a false wake-up rate corresponding to each candidate wake-up threshold in the preset time period according to a plurality of candidate wake-up thresholds corresponding to the voice wake-up model, wherein the target word segmentation is any word segmentation in the sample text;

and determining a candidate wake-up threshold value when the false wake-up rate reaches a preset false wake-up rate as a sample wake-up threshold value corresponding to the target word segmentation.

Optionally, the voice wake-up model includes an end-to-end voice recognition RNN-T model, and the language model is a 2-gram language model, and the 2-gram language model is used for determining a word score corresponding to a word according to a single word before the word segmentation.

The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

Based on the same inventive concept, the embodiments of the present disclosure also provide a non-transitory computer-readable storage medium, on which a computer program is stored, which when executed by a processing device, implements the steps of any of the above-described voice wakeup methods.

Based on the same inventive concept, the embodiments of the present disclosure further provide an electronic device, including:

a storage device having a computer program stored thereon;

and the processing device is used for executing the computer program in the storage device so as to realize the steps of any voice awakening method.

Referring now to fig. 3, a schematic diagram of an electronic device 300 suitable for use in implementing embodiments of the present disclosure is shown. The terminal devices in the embodiments of the present disclosure may include, but are not limited to, mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., in-vehicle navigation terminals), and the like, and stationary terminals such as digital TVs, desktop computers, and the like. The electronic device shown in fig. 3 is merely an example and should not be construed to limit the functionality and scope of use of the disclosed embodiments.

As shown in fig. 3, the electronic device 300 may include a processing means (e.g., a central processing unit, a graphics processor, etc.) 301 that may perform various suitable actions and processes in accordance with a program stored in a Read Only Memory (ROM) 302 or a program loaded from a storage means 308 into a Random Access Memory (RAM) 303. In the RAM 303, various programs and data required for the operation of the electronic apparatus 300 are also stored. The processing device 301, the ROM 302, and the RAM 303 are connected to each other via a bus 304. An input/output (I/O) interface 305 is also connected to bus 304.

In general, the following devices may be connected to the I/O interface 305: input devices 306 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 307 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 308 including, for example, magnetic tape, hard disk, etc.; and communication means 309. The communication means 309 may allow the electronic device 300 to communicate with other devices wirelessly or by wire to exchange data. While fig. 3 shows an electronic device 300 having various means, it is to be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may be implemented or provided instead.

In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a non-transitory computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via a communication device 309, or installed from a storage device 308, or installed from a ROM 302. The above-described functions defined in the methods of the embodiments of the present disclosure are performed when the computer program is executed by the processing means 301.

It should be noted that the computer readable medium described in the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present disclosure, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.

In some implementations, communications may be made using any currently known or future developed network protocol, such as HTTP (HyperText Transfer Protocol ), and may be interconnected with any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the internet (e.g., the internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed networks.

The computer readable medium may be contained in the electronic device; or may exist alone without being incorporated into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring a voice wake-up word of which the threshold value is to be set; determining word scores of the voice wake words through a language model, wherein the word scores are used for representing the occurrence probability of the voice wake words; determining a target wake-up threshold corresponding to the voice wake-up word according to the word score of the voice wake-up word and a preset corresponding relation between the wake-up threshold and the word score, wherein the wake-up threshold in the preset corresponding relation is positively related to the word score, and the target wake-up threshold is used for comparing the target wake-up threshold with the collected target voice wake-up word in the voice wake-up process so as to determine a voice wake-up result corresponding to the target voice wake-up word.

Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, including, but not limited to, an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present disclosure may be implemented in software or hardware. The name of a module does not in some cases define the module itself.

The functions described above herein may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a system on a chip (SOC), a Complex Programmable Logic Device (CPLD), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

In accordance with one or more embodiments of the present disclosure, example 1 provides a voice wake method, comprising:

acquiring a voice wake-up word of which the threshold value is to be set;

Example 2 provides the method of example 1, according to one or more embodiments of the present disclosure, the language model being trained by:

a first sample text for training a voice wake model is obtained, and the language model is trained according to the first sample text.

In accordance with one or more embodiments of the present disclosure, example 3 provides the method of example 1 or 2, wherein the preset correspondence between the arousal threshold and the word score is obtained by:

According to one or more embodiments of the present disclosure, example 4 provides the method of example 3, wherein the performing data fitting according to the sample word score and the sample wake-up threshold corresponding to each word in the second sample text, to obtain a preset correspondence between the wake-up threshold and the word score includes:

According to one or more embodiments of the present disclosure, example 5 provides the method of example 3, the determining a sample wake-up threshold corresponding to each word segment in the second sample text includes:

In accordance with one or more embodiments of the present disclosure, example 6 provides the method of example 1 or 2, the voice wake model comprising an end-to-end voice recognition RNN-T model, the language model being a 2gram language model, the 2gram language model being configured to determine a word score corresponding to a word from a single word preceding the word.

According to one or more embodiments of the present disclosure, example 7 provides a voice wake apparatus, the apparatus comprising:

Example 8 provides the apparatus of example 7, according to one or more embodiments of the disclosure, further comprising:

and the training module is used for acquiring target sample text for training the voice wake model and taking the target sample text as a first sample text for training the language model.

According to one or more embodiments of the present disclosure, example 9 provides a non-transitory computer-readable storage medium having stored thereon a computer program that, when executed by a processing device, performs the steps of the method of any of examples 1-6.

In accordance with one or more embodiments of the present disclosure, example 10 provides an electronic device, comprising:

a storage device having a computer program stored thereon;

processing means for executing the computer program in the storage means to implement the steps of the method of any one of examples 1-6.

The foregoing description is only of the preferred embodiments of the present disclosure and description of the principles of the technology being employed. It will be appreciated by persons skilled in the art that the scope of the disclosure referred to in this disclosure is not limited to the specific combinations of features described above, but also covers other embodiments which may be formed by any combination of features described above or equivalents thereof without departing from the spirit of the disclosure. Such as those described above, are mutually substituted with the technical features having similar functions disclosed in the present disclosure (but not limited thereto).

Moreover, although operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. In certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of the present disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are example forms of implementing the claims. The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

Claims

1. A method of waking up speech, the method comprising:

acquiring a voice wake-up word of which the threshold value is to be set;

determining a target wake-up threshold corresponding to the voice wake-up word according to the word score of the voice wake-up word and a preset corresponding relation between the wake-up threshold and the word score, wherein the wake-up threshold in the preset corresponding relation is positively correlated with the word score, and the target wake-up threshold is used for comparing the target wake-up word with the collected target voice wake-up word in the voice wake-up process so as to determine a voice wake-up result corresponding to the target voice wake-up word;

the preset corresponding relation is obtained according to sample word scores corresponding to the segmentation words in the second sample text and a sample wake-up threshold; the sample wake-up threshold value corresponding to each word in the second sample text is obtained by the following method:

2. The method of claim 1, wherein the language model is trained by:

3. The method according to claim 1 or 2, wherein the preset correspondence is obtained by:

4. The method of claim 3, wherein the performing data fitting according to the sample word score and the sample wake-up threshold corresponding to each word in the second sample text to obtain a preset correspondence between wake-up threshold and word score includes:

5. The method of claim 1 or 2, wherein the voice wake model comprises an end-to-end voice recognition RNN-T model, the language model being a 2gram language model, the 2gram language model being configured to determine a word score corresponding to a word from a word preceding the word.

6. A voice wakeup apparatus, the apparatus comprising:

the second determining module is used for determining a target wake-up threshold corresponding to the voice wake-up word according to the word score of the voice wake-up word and a preset corresponding relation between the wake-up threshold and the word score, wherein the wake-up threshold in the preset corresponding relation is positively correlated with the word score, and the target wake-up threshold is used for comparing the target wake-up threshold with the collected target voice wake-up word in the voice wake-up process so as to determine a voice wake-up result corresponding to the target voice wake-up word;

the preset corresponding relation is obtained according to sample word scores corresponding to the segmentation words in the second sample text and a sample wake-up threshold;

the sample wake-up threshold value corresponding to each word in the second sample text is obtained by the following method:

7. The apparatus of claim 6, wherein the apparatus further comprises:

8. A non-transitory computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processing device, implements the steps of the method according to any one of claims 1-5.

9. An electronic device, comprising:

a storage device having a computer program stored thereon;

processing means for executing said computer program in said storage means to carry out the steps of the method according to any one of claims 1-5.