CN116312481A

CN116312481A - Cascade awakening method and device based on keyword recognition technology and storage medium

Info

Publication number: CN116312481A
Application number: CN202310132392.6A
Authority: CN
Inventors: 赵茂祥; 李全忠
Original assignee: Puqiang Times Zhuhai Hengqin Information Technology Co ltd
Current assignee: Puqiang Times Zhuhai Hengqin Information Technology Co ltd
Priority date: 2023-02-15
Filing date: 2023-02-15
Publication date: 2023-06-23

Abstract

The embodiment of the specification provides a cascade awakening method, a device and a storage medium based on a keyword recognition technology, wherein the method comprises the following steps: receiving externally input voice information, wherein the voice information comprises a voice sequence; identifying keyword nodes in the voice sequence; determining the probability that the voice information is a wake-up signal based on the keyword node and the voice sequence; and executing a wake-up system or carrying out voice recognition on the voice sequence according to the probability. The technical scheme provided by the application is used for solving the problem that the performance of the wake-up system is not matched with the power consumption and the computing capacity of the embedded equipment.

Description

Cascade awakening method and device based on keyword recognition technology and storage medium

Technical Field

The present document relates to the field of artificial intelligence, and in particular, to a method, an apparatus, and a storage medium for cascaded wake-up based on keyword recognition technology.

Background

The wake-up function is applied to a plurality of intelligent devices, although the wake-up function has the characteristics of low power consumption and quick response.

For embedded intelligent equipment, the intelligent equipment is characterized by small volume, low power consumption and limited computing capacity. Therefore, the embedded intelligent device has the following problems when implementing the wake-up function:

if the wake-up system has a high accuracy in recognizing speech, there may be a mismatch in the performance of the wake-up system with the power consumption and computing power of the embedded device. If the performance of the wake-up system matches the power consumption and computing power of the embedded device, there is a problem in that the wake-up system is less accurate in recognizing speech.

Disclosure of Invention

In view of the above analysis, the present application aims to provide a cascade wake-up method, device and storage medium based on a keyword recognition technology, so as to ensure accuracy of a wake-up system in recognizing speech under the condition that performance of the wake-up system is matched with power consumption and computing capacity of an embedded intelligent device.

In a first aspect, one or more embodiments of the present disclosure provide a cascade wake-up method based on a keyword recognition technique, including:

receiving externally input voice information, wherein the voice information comprises a voice sequence;

identifying keyword nodes in the voice sequence;

determining the probability that the voice information is a wake-up signal based on the keyword node and the voice sequence;

and executing a wake-up system or carrying out voice recognition on the voice sequence according to the probability.

Further, the determining, based on the keyword node and the voice sequence, the probability that the voice information is a wake-up signal includes:

and sequentially performing sequence head-to-tail detection and node detection on the keyword nodes, and performing node time sequence detection on the voice sequence to obtain the probability that the voice information is a wake-up signal.

Further, the sequence head-tail detection for the keyword node includes:

detecting whether a first keyword node and a last keyword node in the voice sequence contain preset keywords or not;

and when the first keyword node and the last keyword node both contain preset keywords, executing the node detection.

Further, the node detection for the keyword node includes:

respectively determining sound data of each keyword node in the voice sequence:

respectively determining the similarity between the sound data of each keyword node and corresponding preset target sound data;

and when the similarity of the keyword nodes is larger than a preset value, executing the node time sequence detection.

Further, the node timing detection for the voice sequence includes:

determining sound data of the speech sequence;

and determining the probability that the voice data of the voice sequence is the whole voice data of the preset target.

Further, the performing a wake-up system or performing speech recognition on the speech sequence according to the probability includes:

executing a wake-up system when the probability reaches a first preset value;

and executing the voice recognition on the voice sequence when the probability does not reach the first preset value.

determining whether the probability reaches a first preset value;

when the probability reaches a first preset value, determining whether the probability reaches a second preset value;

executing a wake-up system when the second preset value is determined to be reached;

and executing the voice recognition on the voice sequence when the probability does not reach a second preset value. .

In a second aspect, one or more embodiments of the present disclosure provide a cascaded wake-up device based on keyword recognition technology, including: the device comprises a receiving module, an identification determining module, a data processing module and an executing module;

the receiving module is used for receiving voice information input from outside, and the voice information comprises a voice sequence;

the recognition determining module is used for recognizing keyword nodes in the voice sequence;

the data processing module is used for determining the probability that the voice information is a wake-up signal based on the keyword node and the voice sequence;

the execution module is used for executing a wake-up system or carrying out voice recognition on the voice sequence according to the probability.

Further, the data processing module is configured to sequentially perform sequence head-to-tail detection and node detection on the keyword node, and perform node timing detection on the voice sequence to obtain a probability that the voice information is a wake-up signal.

In a third aspect, one or more embodiments of the present specification provide a storage medium comprising:

for storing computer-executable instructions which, when executed, implement the method of the first aspect.

Compared with the prior art, the application can at least realize the following technical effects:

the voice information that wakes up the system is typically a sentence. Each word in a sentence can be considered a keyword node that forms a speech sequence in a certain order. The method and the device judge the probability that the voice information is a wake-up signal based on the keyword nodes. When the probability is relatively high, the system is directly awakened without triggering speech recognition. Since speech recognition generally requires a large amount of computation, most of the energy consumption comes from speech recognition for embedded devices. According to the method, the frequency of voice recognition triggering can be reduced, and energy consumption is saved. Meanwhile, when the voice information cannot be judged to be the wake-up signal, voice recognition can be started to ensure accuracy, so that the accuracy of the wake-up system in the aspect of voice recognition is ensured under the condition that the performance of the wake-up system is matched with the power consumption and the computing capacity of the embedded intelligent equipment.

Drawings

For a clearer description of one or more embodiments of the present description or of the solutions of the prior art, the drawings that are necessary for the description of the embodiments or of the prior art will be briefly described, it being apparent that the drawings in the description that follow are only some of the embodiments described in the description, from which, for a person skilled in the art, other drawings can be obtained without inventive faculty.

Fig. 1 is a flowchart of a cascade wake-up method based on a keyword recognition technique according to one or more embodiments of the present disclosure.

Detailed Description

In order to enable a person skilled in the art to better understand the technical solutions in one or more embodiments of the present specification, the technical solutions in one or more embodiments of the present specification will be clearly and completely described below with reference to the drawings in one or more embodiments of the present specification, and it is obvious that the described embodiments are only some embodiments of the present specification, not all embodiments. All other embodiments, which can be made by one or more embodiments of the present disclosure without inventive faculty, are intended to be within the scope of the present disclosure.

The embedded device is composed of hardware and software, and is a device capable of operating independently. The software content only comprises a software running environment and an operating system thereof. The hardware content includes various contents including a signal processor, a memory, a communication module, and the like. Compared with a general computer processing system, an embedded system has a large difference, and cannot realize a large-capacity storage function because of no large-capacity medium matched with the embedded system. Most intelligent devices in life are embedded devices, such as intelligent door locks, intelligent sweeping robots, and the like.

These embedded devices receive user instructions through speech recognition, however, the more accurate the speech recognition algorithm, the more generally its energy consumption is also big, which is not favorable for the embedded devices to react quickly, but also causes the user to use the cost to increase. But optionally accurate speech recognition algorithms, embedded devices may fail to wake up.

Based on the above scenario, the embodiment of the application provides a cascade wake-up method based on a keyword recognition technology, which comprises the following steps:

step 1, receiving voice information input from outside.

In the embodiment of the application, the voice information includes a voice sequence.

And 2, identifying keyword nodes in the voice sequence.

In an embodiment of the present application, the speech sequence includes at least one keyword junction. For example, the wake-up signal is "hello little" and "hello", "little" and "meaning" are all keyword nodes of the speech sequence. The keyword nodes are arranged into a certain sequence according to the semantics, namely, the system is only awakened when the keyword nodes accord with the sequence carried in the voice sequence. For example, "hello is small" cannot wake up the system.

And step 3, determining the probability that the voice information is a wake-up signal based on the keyword nodes and the voice sequence.

In the embodiment of the application, three steps are needed to sequentially execute the sequence head-tail detection, the node detection and the node time sequence detection to determine the probability that the voice information is the wake-up signal. The method comprises the steps of sequentially detecting the head and the tail of a sequence and detecting the node aiming at a keyword node, and detecting the time sequence of a voice sequence node.

Specifically, sequence end-to-end detection includes:

detecting whether a first keyword node and a last keyword node in a voice sequence contain preset keywords or not;

and when the first keyword node and the last keyword node both contain preset keywords, performing node detection.

If one of the first keyword node and the last keyword node is not the preset keyword, the probability of indicating that the piece of voice information is a wake-up signal is 0, and the current flow is directly ended.

Because of the different speech, speech speed and pronunciation patterns of each person, the 'hello' is provided with a plurality of reading methods. For example, waveforms corresponding to four keywords, namely, "you", "good", "small" and "meaning", are difficult to separate for people with faster speech. At this time, the voice sequence may include partial waveforms corresponding to "good" and "small", regardless of ending. Therefore, in order to simultaneously prevent erroneous judgment and filter noise, whether the first keyword node and the last keyword node both contain preset keywords is adopted as a criterion.

The node detection comprises the following steps:

sound data of each keyword node in the speech sequence is determined respectively.

In the embodiment of the present application, the sound data includes: duration of sound and waveform of sound.

And respectively determining the similarity between the sound data of each keyword node and the corresponding preset target sound data.

And when the similarity of the keyword nodes is larger than a preset value, performing node time sequence detection.

As described above, waveforms corresponding to the four keywords, "you", "good", "small" and "meaning" are not separated. Therefore, when the node detection is carried out, the order of the keyword nodes is not considered, so that misjudgment prevention and noise filtering are both considered.

Node timing detection, comprising:

sound data of the speech sequence is determined.

And determining the probability that the sound data of the voice sequence is the whole sound data of the preset target.

Most of the noise is filtered out after the sequence head-to-tail detection and node detection, and only the noise such as 'you little good' is left. At this time, whether the sequence of the keyword nodes in the voice sequence accords with the preset sequence is judged, so that the recognition accuracy is improved.

In summary, the first and last detection of the sequence performs the preliminary screening on the voice signal, and then sequentially performs the node detection and the node time sequence detection according to the concept of first whole and then part, so as to accurately determine the probability that the voice information is the wake-up signal. The method can recognize whether the voice information is a wake-up signal or not without depending on voice recognition. Therefore, the mode can replace part of voice recognition roles of the embedded equipment, so that the accuracy of the awakening system in the aspect of recognizing voice is ensured under the condition that the performance of the awakening system is matched with the power consumption and the computing capacity of the embedded intelligent equipment.

And 4, executing a wake-up system or performing voice recognition on the voice sequence according to the probability.

In the embodiment of the application, when the probability reaches a first preset value, executing a wake-up system;

and executing voice recognition on the voice sequence when the probability does not reach the first preset value.

Preferably, the end-to-end detection, node detection and node timing detection are combined to obtain a calculation result, and the accuracy degree is not as high as that of speech recognition based on artificial intelligence. Therefore, in order to improve accuracy, the embodiment of the application sets a second preset value, and when the probability reaches the first preset value, whether the probability reaches the second preset value is determined; executing a wake-up system when the second preset value is determined to be reached; and executing voice recognition on the voice sequence when the probability does not reach the second preset value. The voice signal which does not reach the first preset value is directly recognized as not being a wake-up signal. Meanwhile, the triggering frequency of voice recognition is regulated by setting a second preset value. For example, if the user prefers a higher accuracy, the second preset value may be set to 99%, which corresponds to an increase in the frequency of triggering of speech recognition, and thus a higher accuracy. If the user prefers to save energy, the second preset value can be set to 93%, which is equivalent to reducing the triggering frequency of voice recognition, thereby reducing the energy consumption.

In summary, the present application can flexibly adjust the balance between the power consumption and the accuracy of the embedded device.

Specifically, the method disclosed by the application is applicable to a neural network-based wake-up method. Under the condition that the size of the neural network model is limited, the wake-up rate and the false call rate can be balanced well. For a given 2000 pieces of forward data (wake-up words) and 75 hours of reverse data (random recorded content, no wake-up words), a 99% wake-up rate and a false wake-up rate of 0.03 pieces/hour can be achieved at the appropriate threshold. On a 2G Hz,32 core CPU machine, RT (real time rate, ratio of speech processing time to speech length) can reach 0.02-0.03.

The embodiment of the application provides a cascade awakening device based on a keyword recognition technology, which comprises the following components: the device comprises a receiving module, an identification determining module, a data processing module and an executing module;

In this embodiment of the present application, the data processing module is configured to sequentially perform sequence head-to-tail detection and node detection on the keyword node, and perform node timing detection on the voice sequence, so as to obtain a probability that the voice information is a wake-up signal.

An embodiment of the present application provides a storage medium, including:

for storing computer-executable instructions that, when executed, implement the methods of the above-described embodiments.

The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

In the 30 s of the 20 th century, improvements to one technology could clearly be distinguished as improvements in hardware (e.g., improvements to circuit structures such as diodes, transistors, switches, etc.) or software (improvements to the process flow). However, with the development of technology, many improvements of the current method flows can be regarded as direct improvements of hardware circuit structures. Designers almost always obtain corresponding hardware circuit structures by programming improved method flows into hardware circuits. Therefore, an improvement of a method flow cannot be said to be realized by a hardware entity module. For example, a programmable logic device (Programmable Logic Device, PLD) (e.g., field programmable gate array (Field Programmable Gate Array, FPGA)) is an integrated circuit whose logic function is determined by the programming of the device by a user. A designer programs to "integrate" a digital system onto a PLD without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Moreover, nowadays, instead of manually manufacturing integrated circuit chips, such programming is mostly implemented by using "logic compiler" software, which is similar to the software compiler used in program development and writing, and the original code before the compiling is also written in a specific programming language, which is called hardware description language (Hardware Description Language, HDL), but not just one of the hdds, but a plurality of kinds, such as ABEL (Advanced Boolean Expression Language), AHDL (Altera Hardware Description Language), confluence, CUPL (Cornell University Programming Language), HDCal, JHDL (Java Hardware Description Language), lava, lola, myHDL, PALASM, RHDL (Ruby Hardware Description Language), etc., VHDL (Very-High-Speed Integrated Circuit Hardware Description Language) and Verilog are currently most commonly used. It will also be apparent to those skilled in the art that a hardware circuit implementing the logic method flow can be readily obtained by merely slightly programming the method flow into an integrated circuit using several of the hardware description languages described above.

The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer readable medium storing computer readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, application specific integrated circuits (Application Specific Integrated Circuit, ASIC), programmable logic controllers, and embedded microcontrollers, examples of which include, but are not limited to, the following microcontrollers: ARC 625D, atmel AT91SAM, microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic of the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller in a pure computer readable program code, it is well possible to implement the same functionality by logically programming the method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, etc. Such a controller may thus be regarded as a kind of hardware component, and means for performing various functions included therein may also be regarded as structures within the hardware component. Or even means for achieving the various functions may be regarded as either software modules implementing the methods or structures within hardware components.

The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

For convenience of description, the above devices are described as being functionally divided into various units, respectively. Of course, the functions of each unit may be implemented in the same piece or pieces of software and/or hardware when implementing the embodiments of the present specification.

One skilled in the relevant art will recognize that one or more embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, one or more embodiments of the present description may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present description can take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

The present description is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the specification. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

One or more embodiments of the present specification may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. One or more embodiments of the specification may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.

The foregoing description is by way of example only and is not intended to limit the present disclosure. Various modifications and changes may occur to those skilled in the art. Any modifications, equivalent substitutions, improvements, etc. that fall within the spirit and principles of the present document are intended to be included within the scope of the claims of the present document.

Claims

1. The cascade awakening method based on the keyword recognition technology is characterized by comprising the following steps of:

identifying keyword nodes in the voice sequence;

2. The method of claim 1, wherein the step of determining the position of the substrate comprises,

the determining, based on the keyword node and the voice sequence, the probability that the voice information is a wake-up signal includes:

3. The method of claim 2, wherein the step of determining the position of the substrate comprises,

the step of detecting the head and the tail of the sequence aiming at the keyword node comprises the following steps:

4. The method of claim 2, wherein the step of determining the position of the substrate comprises,

the node detection for the keyword node comprises the following steps:

respectively determining sound data of each keyword node in the voice sequence:

5. The method of claim 2, wherein the step of determining the position of the substrate comprises,

the node timing detection for the voice sequence includes:

determining sound data of the speech sequence;

6. The method of claim 1, wherein the step of determining the position of the substrate comprises,

the executing a wake-up system or performing voice recognition on the voice sequence according to the probability comprises:

executing a wake-up system when the probability reaches a first preset value;

7. The method of claim 6, wherein the step of providing the first layer comprises,

determining whether the probability reaches a first preset value;

and executing the voice recognition on the voice sequence when the probability does not reach a second preset value.

8. A cascading wake-up device based on a keyword recognition technology, comprising: the device comprises a receiving module, an identification determining module, a data processing module and an executing module;

9. The apparatus of claim 8, wherein the device comprises a plurality of sensors,

the data processing module is used for sequentially carrying out sequence head-to-tail detection and node detection on the keyword nodes, and carrying out node time sequence detection on the voice sequence to obtain the probability that the voice information is a wake-up signal.

10. A storage medium, comprising:

for storing computer-executable instructions which, when executed, implement the method of any of claims 1-7.