CN111429911A

CN111429911A - Method and device for reducing power consumption of speech recognition engine in noise scene

Info

Publication number: CN111429911A
Application number: CN202010163866.XA
Authority: CN
Inventors: 闫子魁
Original assignee: Unisound Intelligent Technology Co Ltd; Xiamen Yunzhixin Intelligent Technology Co Ltd
Current assignee: Unisound Intelligent Technology Co Ltd; Xiamen Yunzhixin Intelligent Technology Co Ltd
Priority date: 2020-03-11
Filing date: 2020-03-11
Publication date: 2020-07-17

Abstract

The invention discloses a method and a device for reducing power consumption of a speech recognition engine in a noise scene. The method comprises the following steps: acquiring a first voice recognition engine and a second voice recognition engine, wherein the power consumption of the first voice recognition engine is greater than that of the second voice recognition engine; acquiring a preset number of pieces of voice information in a noise scene; preliminarily recognizing the preset number of pieces of voice information through the first voice recognition engine to obtain a first voice recognition result; and identifying the first voice recognition result through the second voice recognition engine to obtain a second voice recognition result. According to the technical scheme of the invention, as the first speech recognition engine with low power consumption is used for recognizing and filtering useless noise, and the second speech recognition engine with high power consumption is not used for frequently recognizing the useless noise, the working frequency is reduced, the power consumption is greatly reduced, and the accuracy of the obtained second speech recognition result is high.

Description

Method and device for reducing power consumption of speech recognition engine in noise scene

Technical Field

The invention relates to the technical field of mutual voice recognition, in particular to a method and a device for reducing power consumption of a voice recognition engine in a noise scene.

Background

Speech recognition is a cross discipline. In the last two decades, speech recognition technology has advanced significantly, starting to move from the laboratory to the market. It is expected that voice recognition technology will enter various fields such as industry, home appliances, communications, automotive electronics, medical care, home services, consumer electronics, and the like, in the next 10 years, where noise is a major factor blocking voice recognition from being put to practical use.

At present, a voice recognition engine with a VAD (voice activity detection) technology has a poor voice recognition effect in a noise scene, and a voice recognition engine without the VAD has relatively high power consumption when performing voice recognition in the noise scene, so how to ensure the recognition effect of the voice recognition engine in the noise scene and the power consumption are small is a problem that needs to be solved urgently.

Disclosure of Invention

The invention provides a method and a device for reducing power consumption of a speech recognition engine in a noise scene, wherein the technical scheme is as follows:

according to a first aspect of the embodiments of the present invention, there is provided a method for reducing power consumption of a speech recognition engine in a noise scene, including:

acquiring a first voice recognition engine and a second voice recognition engine, wherein the power consumption of the first voice recognition engine is greater than that of the second voice recognition engine;

acquiring a preset number of pieces of voice information in a noise scene;

preliminarily recognizing the preset number of pieces of voice information through the first voice recognition engine to obtain a first voice recognition result;

and identifying the first voice recognition result through the second voice recognition engine to obtain a second voice recognition result.

In one embodiment, the method for reducing power consumption of a speech recognition engine in a noisy scene further comprises:

judging whether the first voice recognition result meets a preset condition or not;

when the first voice recognition result does not meet the preset condition, recognizing the first voice recognition result through the first voice recognition engine to obtain a third voice recognition result;

and recognizing the third voice recognition result through the second voice recognition engine to obtain a fourth voice recognition result.

And when the first voice recognition result meets the preset condition, recognizing the first voice recognition result through the second voice recognition engine to obtain a second voice recognition result.

In one embodiment, the preliminary recognition of the preset number of pieces of speech information by the first speech recognition engine to obtain a first speech recognition result includes:

acquiring various noise information in the noise scene;

extracting various noise characteristics in the various noise information;

extracting the voice characteristics corresponding to the preset number of pieces of voice information;

and the first voice recognition engine filters the preset number of pieces of voice information according to the voice characteristics and the various noise characteristics to obtain the first voice recognition result.

In one embodiment, the recognizing, by the second speech recognition engine, the first speech recognition result to obtain a second speech recognition result includes:

judging whether the first voice recognition result has voice information of a user;

when the first voice recognition result has the voice information of the user, calculating the first recognition result to obtain a living body detection score;

judging whether the living body detection score is larger than a preset threshold value or not;

and when the living body detection score is larger than a preset threshold value, identifying the first voice recognition result through the second voice recognition engine to obtain a second voice recognition result.

In one embodiment, the filtering, by the first speech recognition engine, the preset number of pieces of speech information according to the speech features and the various noise features to obtain the first speech recognition result includes:

respectively judging whether the voice characteristics corresponding to the preset number of pieces of voice information are matched with the various noise characteristics;

and the first voice recognition engine filters a plurality of pieces of voice information of which the voice characteristics are matched with the various noise characteristics in the preset number of pieces of voice information to obtain a first voice recognition result.

According to a second aspect of the embodiments of the present invention, there is provided an apparatus for reducing power consumption of a speech recognition engine in a noise scene, including:

the device comprises a first acquisition module, a second acquisition module and a processing module, wherein the first acquisition module is used for acquiring a first voice recognition engine and a second voice recognition engine, and the power consumption of the first voice recognition engine is greater than that of the second voice recognition engine;

the second acquisition module is used for acquiring a preset number of pieces of voice information in a noise scene;

the first recognition module is used for carrying out preliminary recognition on the preset number of pieces of voice information through the first voice recognition engine so as to obtain a first voice recognition result;

and the second recognition module is used for recognizing the first voice recognition result through the second voice recognition engine so as to obtain a second voice recognition result.

In one embodiment, the apparatus for reducing power consumption of a speech recognition engine in a noisy scene further comprises:

the judging module is used for judging whether the first voice recognition result meets a preset condition or not;

the third recognition module is used for recognizing the first voice recognition result through the first voice recognition engine when the first voice recognition result does not meet the preset condition so as to obtain a third voice recognition result;

and the fourth recognition module is used for recognizing the third voice recognition result through the second voice recognition engine so as to obtain a fourth voice recognition result.

And the fifth recognition module is used for recognizing the first voice recognition result through the second voice recognition engine when the first voice recognition result meets the preset condition so as to obtain a second voice recognition result.

In one embodiment, the first identification module includes:

the obtaining submodule is used for obtaining various noise information in the noise scene;

the first extraction submodule is used for extracting various noise characteristics in the various noise information;

the second extraction submodule is used for extracting the voice features corresponding to the preset number of pieces of voice information;

and the filtering submodule is used for filtering the preset number of pieces of voice information by the first voice recognition engine according to the voice characteristics and the various noise characteristics so as to obtain the first voice recognition result.

In one embodiment, the second identification module includes:

the first judgment submodule is used for judging whether the first voice recognition result has voice information of a user;

the calculation submodule is used for calculating the first recognition result to obtain a living body detection score when the first voice recognition result has the voice information of the user;

the second judgment submodule is used for judging whether the in-vivo detection score is larger than a preset threshold value or not;

and the recognition submodule is used for recognizing the first voice recognition result through the second voice recognition engine when the living body detection score is larger than a preset threshold value so as to obtain a second voice recognition result.

In one embodiment, the filtering submodule includes:

the judging unit is used for respectively judging whether the voice characteristics corresponding to the preset number of pieces of voice information are matched with the various noise characteristics;

and the filtering unit is used for filtering a plurality of pieces of voice information of which the voice characteristics are matched with the various noise characteristics in the preset number of pieces of voice information by the first voice recognition engine so as to obtain a first voice recognition result.

The technical scheme provided by the embodiment of the invention has the following beneficial effects:

acquiring a first voice recognition engine and a second voice recognition engine, then acquiring a preset number of pieces of voice information in a noise scene, firstly, preliminarily recognizing the preset number of pieces of voice information through the first voice recognition engine to obtain a first voice recognition result, and then, recognizing the first voice recognition result through the second voice recognition engine to obtain a second voice recognition result; according to the technical scheme, the preset number of pieces of voice information are preliminarily recognized through the first voice recognition engine with low power consumption, useless noise is filtered, then the filtered first voice recognition result is subjected to voice recognition through the second voice recognition engine with high power consumption and strong functions, the second voice recognition result can be obtained, and the useless noise is recognized and filtered through the first voice recognition engine with low power consumption, so that the useless noise is not frequently recognized by the second voice recognition engine with high power consumption, the working frequency is reduced, the power consumption is greatly reduced, and the accuracy of the obtained second voice recognition result is high.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and drawings.

The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:

FIG. 1 is a flow chart of a method for reducing power consumption of a speech recognition engine in a noisy scene according to an embodiment of the present invention;

FIG. 2 is a flow chart of another method for reducing power consumption of a speech recognition engine in a noisy environment according to an embodiment of the present invention;

FIG. 3 is a block diagram of an apparatus for reducing power consumption of a speech recognition engine in a noisy scene according to an embodiment of the present invention;

FIG. 4 is a block diagram of an apparatus for reducing power consumption of a speech recognition engine in a noisy scene according to an embodiment of the present invention.

Detailed Description

The preferred embodiments of the present invention will be described in conjunction with the accompanying drawings, and it will be understood that they are described herein for the purpose of illustration and explanation and not limitation.

FIG. 1 is a flowchart illustrating a method for reducing power consumption of a speech recognition engine in a noisy scene according to an embodiment of the present invention, as shown in FIG. 1, the method can be implemented as the following steps S11-S14:

in step S11, a first speech recognition engine and a second speech recognition engine are obtained, wherein the power consumption of the first speech recognition engine is greater than that of the second speech recognition engine; the first speech recognition engine has small parameters, so that the first speech recognition engine has very low power consumption but not powerful function, and its main function is to filter the useless noise in the speech information, while the second speech recognition engine has high power consumption but good recognition effect.

In step S12, a preset number of pieces of voice information in a noise scene are acquired;

in step S13, performing preliminary recognition on a preset number of pieces of speech information by a first speech recognition engine to obtain a first speech recognition result; the first voice recognition engine is used for carrying out preliminary recognition on the preset number of pieces of voice information, namely, useful voice in the preset number of pieces of voice information is recognized, useless noise is filtered, and the remaining voice information is the first voice recognition result.

In step S14, the first speech recognition result is recognized by the second speech recognition engine to obtain a second speech recognition result.

judging whether the first voice recognition result meets a preset condition or not; the preset condition may be, but is not limited to, that no noise information exists in the first speech recognition result.

When the first voice recognition result does not meet the preset condition, recognizing the first voice recognition result through the first voice recognition engine to obtain a third voice recognition result; when the first voice recognition result does not meet the preset condition, namely the first voice recognition result does not meet the condition that no noise information exists in the first voice recognition result, the first voice recognition engine is used for recognizing the first recognition result again so as to filter the noise information in the first recognition result.

After whether the first voice recognition result meets the preset condition or not is judged, subsequent recognition operation is carried out according to the judged result, the phenomenon that a second voice recognition engine carries out useless voice information recognition is fully avoided, the use of the second voice recognition engine with high power consumption is reduced, and further the power consumption is reduced.

As shown in fig. 2, in one embodiment, the above step S13 can be implemented as the following steps S131-S134, including:

in step S131, various noise information in a noise scene is acquired; the noise information refers to useless sounds, such as the cry of an animal, the booming of a machine, and the like.

In step S132, various noise features in various noise information are extracted;

in step S133, extracting respective corresponding voice features of a preset number of pieces of voice information;

in step S134, the first speech recognition engine filters a preset number of pieces of speech information according to the speech characteristics and various noise characteristics to obtain a first speech recognition result.

Noise features in the obtained noise information are extracted, and then voice features in the voice information are extracted, the first voice recognition engine filters the noise information through the voice features and the noise features, the remaining voice information is a first voice recognition result, filtering is carried out according to the voice features and the noise features, and useful voice information can be prevented from being filtered.

Through calculating the first voice recognition result with the user voice information, the live body detection score can be obtained, the live body detection score is compared with a preset threshold value, when the live body detection score is larger than the preset threshold value, the live body of the user is recognized according to the first voice recognition result, unnecessary recognition work of a second voice recognition engine is prevented through the live body detection, and resource waste is avoided.

Whether the voice features corresponding to the preset number of pieces of voice information are matched with various noise features or not is judged, when the voice features are matched, the matched voice information is determined to be noise, then the noise is filtered through the first voice recognition engine, the left voice information is a first voice recognition result, and the filtering of useful voice information can be prevented through the matching mechanism.

For the method for reducing the power consumption of the speech recognition engine in the noise scene provided by the embodiment of the present invention, the embodiment of the present invention also provides a device for reducing the power consumption of the speech recognition engine in the noise scene, as shown in fig. 3, the device includes:

a first obtaining module 31, configured to obtain a first speech recognition engine and a second speech recognition engine, where power consumption of the first speech recognition engine is greater than that of the second speech recognition engine;

a second obtaining module 32, configured to obtain a preset number of pieces of voice information in a noise scene;

the first recognition module 33 is configured to perform preliminary recognition on the preset number of pieces of speech information by using the first speech recognition engine to obtain a first speech recognition result;

and a second recognition module 34, configured to recognize the first speech recognition result through the second speech recognition engine to obtain a second speech recognition result.

As shown in fig. 4, in one embodiment, the first identification module 33 includes:

the obtaining sub-module 331 is configured to obtain various noise information in the noise scene;

a first extraction submodule 332, configured to extract various noise features in the various noise information;

the second extraction submodule 333 is configured to extract respective voice features corresponding to the preset number of pieces of voice information;

and a filtering submodule 334, configured to filter, by the first speech recognition engine, the preset number of pieces of speech information according to the speech features and the various noise features, so as to obtain the first speech recognition result.

In one embodiment, the second identification module includes:

In one embodiment, the filtering submodule includes:

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A method for reducing power consumption of a speech recognition engine in a noisy scene, comprising:

acquiring a preset number of pieces of voice information in a noise scene;

2. The method of claim 1, further comprising:

recognizing the third voice recognition result through the second voice recognition engine to obtain a fourth voice recognition result;

3. The method of claim 1, wherein the preliminary recognizing, by the first speech recognition engine, the preset number of pieces of speech information to obtain a first speech recognition result comprises:

acquiring various noise information in the noise scene;

extracting various noise characteristics in the various noise information;

4. The method of claim 1, wherein said recognizing, by the second speech recognition engine, the first speech recognition result to obtain a second speech recognition result comprises:

5. The method of claim 3, wherein the first speech recognition engine filtering the preset number of pieces of speech information according to the speech characteristics and the various noise characteristics to obtain the first speech recognition result, comprising:

6. An apparatus for reducing power consumption of a speech recognition engine in a noisy scene, comprising:

7. The apparatus of claim 6, further comprising:

the fourth recognition module is used for recognizing the third voice recognition result through the second voice recognition engine to obtain a fourth voice recognition result;

8. The apparatus of claim 6, wherein the first identification module comprises:

9. The apparatus of claim 6, wherein the second identification module comprises:

10. The apparatus of claim 8, wherein the filtering submodule comprises: