CN113516967A

CN113516967A - Voice recognition method and device

Info

Publication number: CN113516967A
Application number: CN202110889732.0A
Authority: CN
Inventors: 李程帅; 周全; 徐涛
Original assignee: Qingdao Xinxin Microelectronics Technology Co Ltd
Current assignee: Qingdao Xinxin Microelectronics Technology Co Ltd
Priority date: 2021-08-04
Filing date: 2021-08-04
Publication date: 2021-10-19

Abstract

The application discloses a voice recognition method and a voice recognition device, which are used for improving the voice recognition efficiency and further improving the response speed of a voice instruction. The application provides a voice recognition method, which comprises the following steps: determining the optimal path of user voice decoding frame by frame; and before the user voice is cut off, determining whether to output a recognition result corresponding to the user voice according to the confidence coefficient of the current optimal path.

Description

Voice recognition method and device

Technical Field

The present application relates to the field of information technology, and in particular, to a method and an apparatus for speech recognition.

Background

In a command word speech recognition system, in order to ensure the accuracy of the recognition performance, the stream decoding of speech recognition is often used in combination with a voice detection (VAD) module, i.e. after a command word is completely spoken from the beginning to the end of the speech, the final result is obtained.

For example, if the speaker issues the command "air blowing mode" regarding the air conditioner, according to the prior art, it is necessary to wait until the speaker "air blowing mode" finishes speaking and then output the recognition result, so to speak, for example: the silence of continuous 3 frames is used as a judgment condition of voice cut-off, otherwise, the 'blowing' can be mixed with the 'stroke' of the short instruction to cause false recognition. That is, the prior art needs to wait for the voice command to be cut off, which necessarily causes a delay, for example: the silence of continuous 3 frames is used as a judgment condition of voice cut-off, so that the delay of at least 3 frames is brought, namely, the response speed of the voice instruction of the user is low, and the user experience is influenced.

Disclosure of Invention

The embodiment of the application provides a voice recognition method and a voice recognition device, which are used for improving the voice recognition efficiency and further improving the response speed of a voice instruction.

The voice recognition method provided by the embodiment of the application comprises the following steps:

determining the optimal path of user voice decoding frame by frame;

and before the user voice is cut off, determining whether to output a recognition result corresponding to the user voice according to the confidence coefficient of the current optimal path.

By the method, the optimal path of the user voice decoding is determined frame by frame; before the voice of the user is cut off, whether a recognition result corresponding to the voice of the user is output or not is determined according to the confidence coefficient of the current optimal path, so that the voice recognition efficiency is improved, and the response speed of the voice instruction is improved.

Optionally, determining whether to output a recognition result corresponding to the user voice according to the confidence of the current optimal path specifically includes:

comparing the cost value of the current optimal path with a preset threshold value, and determining whether to output a recognition result corresponding to the user voice according to the comparison result;

or comparing the distances between the current optimal path and other paths, and determining whether to output the recognition result corresponding to the user voice according to the comparison result.

Optionally, before the user speech is cut off, if a recognition result corresponding to the user speech is not output, the method further includes:

and if the cost value of the current optimal path is smaller than a preset first threshold value and the user voice is cut off, outputting a recognition result corresponding to the user voice.

Optionally, before the user voice is cut off, if the cost value of the current optimal path is smaller than a preset second threshold, outputting a recognition result corresponding to the user voice, where the second threshold is smaller than the first threshold.

Optionally, when the final state of the user voice is reached, if the cost value of the current optimal path is smaller than a preset second threshold, outputting a recognition result corresponding to the user voice.

Specifically, how to judge whether the final state of the user voice is reached belongs to the prior art.

Optionally, the cost value of the current optimal path is an average optimal cost value of the current optimal path, or an average cost value of the current optimal path.

Optionally, when a preset frame number after the final state of the user voice is reached, determining the user voice cutoff.

Optionally, the method further comprises:

updating the preset frame number according to the following mode:

n ═ N (average cost value of current optimal path/first threshold) × N

Wherein N is a preset frame number;

taking the value obtained by rounding N' as the updated preset frame number;

and when the updated preset frame number reaches the final state of the user voice, determining the user voice cut-off.

Another embodiment of the present application provides a speech recognition apparatus, including:

a memory for storing program instructions;

and the processor is used for calling the program instructions stored in the memory and executing any one of the methods according to the obtained program.

Another embodiment of the present application provides a computing device, which includes a memory and a processor, wherein the memory is used for storing program instructions, and the processor is used for calling the program instructions stored in the memory and executing any one of the above methods according to the obtained program.

Another embodiment of the present application provides a computer storage medium having stored thereon computer-executable instructions for causing a computer to perform any one of the methods described above.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic diagram of an optimal path provided in an embodiment of the present application;

FIG. 2 is a schematic diagram of an optimal path according to an embodiment of the present disclosure;

fig. 3 is a schematic flowchart of a speech recognition method according to an embodiment of the present application;

fig. 4 is a schematic flowchart of a speech recognition method according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The method and the device are based on the same application concept, and because the principles of solving the problems of the method and the device are similar, the implementation of the device and the method can be mutually referred, and repeated parts are not repeated.

Various embodiments of the present application will be described in detail below with reference to the accompanying drawings. It should be noted that the display sequence of the embodiment of the present application only represents the sequence of the embodiment, and does not represent the merits of the technical solutions provided by the embodiments.

The embodiment of the application provides that the waiting judgment of voice cut-off is not required to be always carried out in the streaming decoding, but the judgment of waiting for the voice cut-off or the direct output of the current recognition result is determined according to the confidence coefficient of the current path, wherein under the condition of carrying out the judgment of waiting for the voice cut-off, the number of waiting frames can be further dynamically adjusted instead of fixedly waiting according to the preset number of frames, so that the response speed of a voice command can be improved, and the response time can be effectively reduced.

The confidence of the path described in the embodiment of the present application may be equal to the average optimal cost value of the path, and the smaller the value of the confidence is, the more likely the path is to be the correct output path, i.e. the higher the accuracy of the speech recognition is.

The path, i.e. the decoding path of the speech recognition, for example, the recognition output of the acoustic model of 8 continuous frames input into the decoder to the speech data, which is the predicted probability value of all phonemes in each frame, is obtained according to the Viterbi algorithm, for example, fig. 1 shows one path in all decoding graphs after 8 frames are input.

That is to say, the speech recognition method provided by the embodiment of the present application includes:

the method comprises the following steps: determining the optimal path of user voice decoding frame by frame;

for example, as shown in fig. 1, for 8 continuous frames of voice data, each time a frame is input, there are several paths in the decoding graph, and the path with the smallest accumulated cost value is the optimal path in the decoding graph of the current frame;

step two: and before the user voice is cut off, determining whether to output a recognition result corresponding to the user voice according to the confidence coefficient of the current optimal path.

In the embodiment of the present application, when the optimal path in the current frame decoding diagram is obtained, it is determined whether the last state of the command word is reached, that is, whether the last state of the user speech is reached, which may also be referred to as the last state of the optimal path, and taking fig. 1 as an example, the state corresponding to the phoneme 139 is the last state.

It should be noted that before the user speech is cut off in the embodiment of the present application, the final state of the user speech may be reached, for example, the last state shown in fig. 1 is reached; or the preset number of frames after the preset delay reaches the final state of the user speech may be advanced, for example, four frames after the phoneme 139 state with the cost value of 0.5 shown in fig. 2 is reached are preset, and the user speech is considered to be cut off, but according to the method provided in the embodiment of the present application, it may be determined whether to output the recognition result corresponding to the user speech according to the confidence of the current optimal path two frames in advance, that is, the recognition result is output two frames in advance, specifically several frames in advance, which may be determined according to the actual situation, and an example will be developed subsequently.

If the optimal path reaches the final state of the command word, calculating the cost value of the current optimal path; the cost value may be an average optimal cost value of the current optimal path, or may also be an average cost value of the current optimal path.

The smaller the cost value of the current optimal path is, the higher the confidence degree of the optimal path is.

For example, if the average optimal cost value of the optimal path is smaller than the first threshold, the average optimal cost value is continuously compared with the second threshold, if the average optimal cost value of the optimal path is smaller than the second threshold, the command word (i.e., the current recognition result) corresponding to the optimal path is directly output, and if the average optimal cost value of the optimal path is smaller than the second threshold, the command word is judged after the voice is finished.

An example of a specific embodiment is illustrated with reference to fig. 3, for example:

the embodiment of the application provides that the time delay of stream decoding is dynamically adjusted by using the confidence coefficient of a path. If the phoneme sequence of the command word "power on" is "17395146171139", wherein the phoneme refers to the minimum pronunciation unit specified artificially, the serial numbers in the phoneme sequence correspond to the pronunciations, for example, "173" indicates the phoneme "k" in "on". A command word, i.e. a user voice command, corresponds to a sequence of phonemes.

As shown in fig. 1, after 8 frames are input into the speech recognition decoding graph, the "startup" optimal path reaches its last state 139 (specifically, how to determine whether the optimal path reaches the last state can be implemented by using the prior art, which is not described in this application embodiment), at this time, the cumulative cost value of this optimal path is calculated to be (0.05+0.2+0.12+0.1+0.3+0.2+0.09+0.1) ═ 1.16, which is the optimal path in the decoding graph, i.e., 1.16 is the smallest of the cumulative cost values of all paths in the decoded picture, the average optimal cost value of the optimal path is further computed, e.g., "17395146171139" for a total of 5 phonemes, corresponding to the 5 optimal states, when the same phoneme has a plurality of cost values, for example, the cost value of the phoneme 95 is 0.2 in the second frame, 0.12 in the third frame, 0.12 is smaller than 0.2, therefore, 0.12 is the optimal cost value of the phoneme 95, so 0.12 is used to calculate the average optimal cost value of the optimal path. The average of the optimal cost values of the 5 phonemes, that is, (0.05+0.12+0.1+0.09+0.1)/5 is 0.092, and the smaller the value, the higher the confidence of the optimal path.

If the average optimal cost value is lower than the second threshold of the command word, the command word is directly recognized at the time, that is, a recognition result is output, for example, the second threshold may be set to be equal to 0.1 time of the first threshold, and if the threshold (the first threshold) of the average optimal cost value recognized as "power on" by means of voice cut-off is 1.0, that is, 0.092 is smaller than 0.1 of the second threshold, that is, smaller than 0.1 time of the first threshold, at this time, it is stated that the confidence of the path is high, so that the command word may be directly recognized as "power on" without obtaining voice cut-off, and thus, under the condition that the voice command is correctly recognized, the response speed of the voice command may be greatly increased, and the user experience may be improved.

For another case, if the average optimal cost value at this time is between 0.1 and 1.0, i.e. greater than the second threshold, and less than the first threshold, for example, as shown in fig. 2, the average optimal cost value of the first 8 frames is: if (0.5+0.2+0.3+0.1+0.5)/5 is 0.32, that is, between 0.1 and 1.0, and similarly, the average optimal cost values at the 9 th frame and the 10 th frame are also between 0.1 and 1.0, the recognition result cannot be directly output in the above manner, and a judgment needs to be made after the speech is cut off, that is, the recognition result is output after a delay.

The reason why the speech detection module has a certain false recognition rate is that, for example, a certain speech frame may be mistakenly recognized as a mute frame in the speech, and if the speech is cut off, the average optimal cost value of the optimal path is smaller than the first threshold 1.0, as shown in fig. 2, for example, the average optimal cost value of the optimal path is: if (0.5+0.2+0.3+0.1+0.1)/5 is 0.24 and is less than the first threshold value of 1.0, it is identified as "on".

In fig. 1, the confidence of the "on" path is higher, so that the recognition can be directly performed, otherwise, as shown in fig. 2, the recognition is performed after the speech is cut off, and obviously, the response time of fig. 1 is 4 frames earlier than that of fig. 2, which speeds up the response time of the recognition.

In addition to the solutions described in the above specific examples, the technical solutions provided in the embodiments of the present application may also have the following implementation manners:

first, when determining the confidence of the optimal path, the way of calculating the path cost is not unique, for example, the average optimal cost value may be calculated by the way in the above embodiment, or the average cost of all states of the path may be directly calculated, and taking fig. 1 as an example, the average cost value of all states (0.05+0.2+0.12+0.1+0.3+0.2+0.09+0.1)/8 is calculated to be 0.145.

Next, regarding the determination manner of the second threshold, for example, 0.1 times of the command word threshold (i.e., the first threshold) is used as the second threshold in the above embodiment, the second threshold may also be determined in a fixed value or other manners, but the second threshold should be equal to or less than the conventional command word threshold (i.e., the first threshold).

In addition to thresholding the confidence of the optimal path, the distance of the current optimal path to other paths may be compared, for example, comparing the distance of the optimal path to the second ranked path, for example, if the confidence of both the path for which the command word is identified as "supply air" and the path for which the command word is identified as "stroke" both reach a first threshold, the cumulative cost value of the "supply air" path is second only to the cumulative cost value of the "stroke" path, and the accumulated cost value of the stroke path is less than 0.1 time of the accumulated cost value of the air supply path, for example, the cumulative cost value of the "wind" route is 4.0, the cumulative cost value of the "stroke" route is 0.3, this indicates that the "stroke" route and the "wind" route are far apart, and the stroke path is the optimal path, which indicates that the confidence coefficient of the stroke path is higher, and the voice recognition result stroke is directly output.

Finally, regarding shortening the time length of the decoding delay of the streaming voice, in the above embodiment, the user voice cutoff is determined according to a preset number of frames (for example, 4 frames) after the final state of the user voice is reached. In the embodiment of the application, the preset frame number may also be updated in the following manner, and the user speech cutoff is determined according to the updated preset frame number after updating. For example:

the updated preset number of frames is equal to (optimal path average cost/first threshold) x N and then rounded, that is:

n ═ N (average cost value of current optimal path/first threshold) × N

Wherein N is a preset frame number;

taking the value obtained by rounding N' as the updated preset frame number;

in this case, the smaller the average optimal cost value of the current optimal path is, the smaller N' is, i.e., the shorter the delay is, otherwise, the longer the delay is, e.g., the first threshold is 1.0, at this time, the average optimal cost value of the optimal path is 0.4, N is 5, the updated number of delay frames is round (5 × 0.4/1.0)) -2, i.e., the delay is reduced from 5 frames to 2 frames because the confidence of the path is higher, i.e., the recognition result can be output two frames after the final state of the command word is reached.

In the embodiment of the present application, the above-mentioned speech cut-off determination method may also be implemented by using conventional techniques, and may be any algorithm or model based on VAD, GMM, DNN, or the like, for example.

In addition, in this embodiment of the present application, different command words may correspond to different thresholds (including the first threshold and/or the second threshold), that is, different paths may correspond to different thresholds, and the specific embodiment of the present application is not limited.

In summary, referring to fig. 4, a speech recognition method in the embodiment of the present application includes:

s101, determining an optimal path of user voice decoding frame by frame;

and S102, before the voice of the user is cut off, determining whether to output a recognition result corresponding to the voice of the user according to the confidence coefficient of the current optimal path.

Optionally, the method further comprises:

updating the preset frame number according to the following mode:

n ═ N (average cost value of current optimal path/first threshold) × N

Wherein N is a preset frame number;

taking the value obtained by rounding N' as the updated preset frame number;

Referring to fig. 5, a speech recognition apparatus provided in an embodiment of the present application includes:

the processor 600, which is used to read the program in the memory 620, executes the following processes:

determining the optimal path of user voice decoding frame by frame;

Optionally, before the user speech is cut off, if the recognition result corresponding to the user speech is not output, the processor 600 is further configured to:

Optionally, the processor 600 is further configured to:

updating the preset frame number according to the following mode:

n ═ N (average cost value of current optimal path/first threshold) × N

Wherein N is a preset frame number;

taking the value obtained by rounding N' as the updated preset frame number;

A transceiver 610 for receiving and transmitting data under the control of the processor 600.

Wherein in fig. 5, the bus architecture may include any number of interconnected buses and bridges, with one or more processors, represented by processor 600, and various circuits of memory, represented by memory 620, being linked together. The bus architecture may also link together various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further herein. The bus interface provides an interface. The transceiver 610 may be a number of elements including a transmitter and a receiver that provide a means for communicating with various other apparatus over a transmission medium. For different user devices, the user interface 630 may also be an interface capable of interfacing with a desired device externally, including but not limited to a keypad, display, speaker, microphone, joystick, etc.

The processor 600 is responsible for managing the bus architecture and general processing, and the memory 620 may store data used by the processor 600 in performing operations.

Alternatively, the processor 600 may be a CPU (central processing unit), an ASIC (Application Specific Integrated Circuit), an FPGA (Field Programmable Gate Array), or a CPLD (Complex Programmable Logic Device).

Referring to fig. 6, another speech recognition apparatus provided in the embodiment of the present application includes:

a first unit 11, configured to determine an optimal path for user speech decoding on a frame-by-frame basis;

and a second unit 12, configured to determine whether to output a recognition result corresponding to the user voice according to the confidence of the current optimal path before the user voice is cut off.

Optionally, before the user speech is cut off, if the recognition result corresponding to the user speech is not output, the second unit 12 is further configured to:

Optionally, the second unit 12 is further configured to:

updating the preset frame number according to the following mode:

n ═ N (average cost value of current optimal path/first threshold) × N

Wherein N is a preset frame number;

taking the value obtained by rounding N' as the updated preset frame number;

It should be noted that the division of the unit in the embodiment of the present application is schematic, and is only a logic function division, and there may be another division manner in actual implementation. In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, a network device, or the like) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The voice recognition device provided in the embodiment of the application can be user equipment, such as intelligent household appliances like an air conditioner, a refrigerator and a washing machine, and can also be other types of terminal equipment.

The embodiment of the present application provides a computing device, which may specifically be a desktop computer, a portable computer, a smart phone, a tablet computer, a Personal Digital Assistant (PDA), and the like. The computing device may include a Central Processing Unit (CPU), memory, input/output devices, etc., the input devices may include a keyboard, mouse, touch screen, etc., and the output devices may include a Display device, such as a Liquid Crystal Display (LCD), a Cathode Ray Tube (CRT), etc.

The memory may include Read Only Memory (ROM) and Random Access Memory (RAM), and provides the processor with program instructions and data stored in the memory. In the embodiments of the present application, the memory may be used for storing a program of any one of the methods provided by the embodiments of the present application.

The processor is used for executing any one of the methods provided by the embodiment of the application according to the obtained program instructions by calling the program instructions stored in the memory.

Embodiments of the present application provide a computer storage medium for storing computer program instructions for an apparatus provided in the embodiments of the present application, which includes a program for executing any one of the methods provided in the embodiments of the present application.

The computer storage media may be any available media or data storage device that can be accessed by a computer, including, but not limited to, magnetic memory (e.g., floppy disks, hard disks, magnetic tape, magneto-optical disks (MOs), etc.), optical memory (e.g., CDs, DVDs, BDs, HVDs, etc.), and semiconductor memory (e.g., ROMs, EPROMs, EEPROMs, non-volatile memory (NAND FLASH), Solid State Disks (SSDs)), etc.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. A method of speech recognition, the method comprising:

determining the optimal path of user voice decoding frame by frame;

2. The method according to claim 1, wherein determining whether to output a recognition result corresponding to the user's voice according to the confidence of the current optimal path specifically includes:

3. The method of claim 2, wherein before the user speech is cut off, if the recognition result corresponding to the user speech is not output, the method further comprises:

4. The method according to claim 3, wherein before the user speech is cut off, if the cost value of the current optimal path is smaller than a preset second threshold, the recognition result corresponding to the user speech is output, wherein the second threshold is smaller than the first threshold.

5. The method according to claim 4, wherein when the final state of the user voice is reached, if the cost value of the current optimal path is smaller than a preset second threshold, the recognition result corresponding to the user voice is output.

6. The method according to any one of claims 1 to 5, wherein the cost value of the current optimal path is an average optimal cost value of the current optimal path or an average cost value of the current optimal path.

7. The method of claim 3, wherein the user speech cutoff is determined when a preset number of frames after the final state of the user speech is reached.

8. The method of claim 7, further comprising:

updating the preset frame number according to the following mode:

n ═ N (average cost value of current optimal path/first threshold) × N

Wherein N is a preset frame number;

taking the value obtained by rounding N' as the updated preset frame number;

9. A speech recognition apparatus, comprising:

a memory for storing program instructions;

a processor for calling program instructions stored in said memory to execute the method of any one of claims 1 to 8 in accordance with the obtained program.

10. A computer storage medium having stored thereon computer-executable instructions for causing a computer to perform the method of any one of claims 1 to 8.