WO2017024835A1

WO2017024835A1 - Voice recognition method and device

Info

Publication number: WO2017024835A1
Application number: PCT/CN2016/082079
Authority: WO
Inventors: 曾一庭
Original assignee: 中兴通讯股份有限公司
Priority date: 2015-08-13
Filing date: 2016-05-13
Publication date: 2017-02-16
Also published as: CN106469553A

Abstract

A voice recognition method and device. The method comprises: acquiring and recognizing voice information (101); judging whether the voice information is consistent with a pre-extracted sound feature of a predetermined user (102); and when a judging result is that the voice information is consistent with the sound feature, determining the voice information to be the voice information of the predetermined user (103).

Description

Speech recognition method and device

Technical field

This application relates to, but is not limited to, the field of communication technology.

Background technique

With the release of Apple Voice Assistant (Siri), intelligent voice applications have entered an explosive development. For voice applications, the success rate of voice recognition is an important indicator for measuring voice applications. In the related technologies, voice recognition is obtained. The input of the sound is then recognized according to the input sound. However, the related art speech application cannot distinguish whether the user is talking, the ambient noise, or the sound of other people, which causes a problem. In a quiet environment, the recognition success rate of voice applications is very high. In some actual usage scenarios, if there is sudden ambient noise around, or other people's voices will trigger the voice application to start recognition, resulting in voice applications. False triggering, resulting in a significant decline in the recognition success rate.

The related art speech recognition usually has a concept of confidence, that is, the recording of the user's speech is matched with the standard data of the engine preset after the large amount of data training, and the higher the confidence, the more accurate. The voice application determines a confidence level for its own situation as a standard. Exceeding this standard, the recognition is correct. Below this standard, the recognition is incorrect.

Since the voice application judges the success or failure of the recognition by means of confidence, the confidence threshold is set low, the recognition is easy, the user's command does not need to be said to be very standard, and the sound does not need to be very loud, and the recognition result can be obtained, but It is also easier to identify the surrounding noise as the user's voice, which leads to the occurrence of misidentification and reduces the recognition rate. The confidence threshold is set high, the recognition is accurate, and the noise is small, but the user command needs to be said to be very standard and sound. It is loud to be able to identify success. Many times the user has clearly said that it has been clear, but still does not pass the confidence threshold, resulting in recognition failure.

The way to perform speech recognition through confidence is that there is no way to distinguish between the commands spoken by the users themselves or the voices of other people. In actual usage scenarios, such as in a driving environment, in the case of other people speaking, it is also very It is easy to cause the voice application to start misidentification, and the recognition rate is reduced. Happening.

Summary of the invention

The following is an overview of the topics detailed in this document. This Summary is not intended to limit the scope of the claims.

Aiming at the problem that the speech recognition is affected by other sounds in the related art and the false recognition rate is high, an effective solution has not been proposed.

This paper provides a speech recognition method and device to solve the problem that the speech recognition is affected by other sounds and the false recognition rate is high in the related art.

A speech recognition method comprising:

Acquire and identify voice information;

Determining whether the voice information matches a sound feature of a predetermined user that is extracted in advance;

When the result of the judgment is that the voice information is consistent with the voice feature, the voice information is determined to be voice information of the predetermined user.

Optionally, after the determining that the voice information is the voice information of the predetermined user, the method further includes:

Determining whether the confidence of the voice information is greater than a preset threshold;

When the result of the determination is that the confidence of the voice information is greater than the preset threshold, determining that the voice information is an instruction issued by the predetermined user;

When the result of the determination is that the confidence of the voice information is less than or equal to the preset threshold, the voice information is discarded.

Optionally, after the determining that the voice information is an instruction issued by the predetermined user, the method further includes: executing an instruction corresponding to the voice information.

Optionally, before the determining whether the voice information matches the voice feature of the predetermined user that is extracted in advance, the method further includes:

Extracting the sound characteristics of the recording by repeatedly acquiring the same recording;

The extracted sound features are saved.

Optionally, before the saving the extracted sound feature, the method further comprises: determining that the confidence of the sound feature exceeds a preset threshold.

A speech recognition device comprising:

The acquisition module is set to: acquire and recognize voice information;

a determining module, configured to: determine whether the voice information acquired by the acquiring module matches a sound feature of a predetermined user that is extracted in advance;

And a determining module, configured to: when the judgment result of the determining module is that the voice information is consistent with the sound feature, determine that the voice information is voice information of the predetermined user.

Optionally, the device further includes a discarding module;

The determining module is further configured to: determine whether the confidence of the voice information is greater than a preset threshold;

The determining module is further configured to: when the determination result of the determining module is that the confidence level of the voice information is greater than the preset threshold, determining that the voice information is an instruction issued by the predetermined user;

The discarding module is configured to discard the voice information when the judgment result of the determining module is that the confidence level of the voice information is less than or equal to the preset threshold.

Optionally, the device further includes: an executing module, configured to: after the determining module determines that the voice information is an instruction issued by the predetermined user, execute an instruction corresponding to the voice information.

Optionally, the device further includes: a repeating acquisition module, configured to: repeatedly acquire the same recording extracting station before the determining module determines whether the voice information matches the sound feature of the pre-extracted predetermined user The sound characteristics of the recording;

And saving the module, configured to: save the sound feature extracted by the repeated acquisition module.

Optionally, the determining module is further configured to: before the saving module saves the extracted sound feature, determine that the confidence of the sound feature is greater than the preset threshold.

The voice recognition method and device provided by the embodiment of the present invention determines whether the voice information matches the sound feature of the predetermined user that is extracted in advance by acquiring and identifying the voice information, and when the judgment result is that the voice information and the sound feature are consistent, Determining that the voice information is the language of the predetermined user The audio information, the embodiment of the present invention solves the problem that the false recognition rate is high due to the influence of other sounds in the speech recognition process, and the false recognition rate is reduced.

Other aspects will be apparent upon reading and understanding the drawings and detailed description.

BRIEF abstract

FIG. 1 is a flowchart of a voice recognition method according to an embodiment of the present invention;

2 is a schematic structural diagram of a voice recognition apparatus according to an embodiment of the present invention;

FIG. 3 is a schematic structural diagram of another voice recognition apparatus according to an embodiment of the present disclosure;

4 is a schematic structural diagram of still another voice recognition apparatus according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of extracting a sound feature in a voice recognition method according to an embodiment of the present invention; FIG.

FIG. 6 is a schematic diagram of voice recognition in a voice recognition method according to an embodiment of the present invention.

Embodiments of the invention

Embodiments of the present invention will be described in detail below with reference to the accompanying drawings. It should be noted that, in the case of no conflict, the features in the embodiments and the embodiments herein may be arbitrarily combined with each other.

The steps illustrated in the flowchart of the figures may be executed in a computer system such as a set of computer executable instructions. Also, although logical sequences are shown in the flowcharts, in some cases the steps shown or described may be performed in a different order than the ones described herein.

The embodiment of the present invention provides a voice recognition method. FIG. 1 is a flowchart of a voice recognition method according to an embodiment of the present invention. As shown in FIG. 1 , the voice recognition method provided in this embodiment includes the following steps, that is, steps. 101 to step 103:

Step 101: Acquire and identify voice information.

Step 102: Determine whether the voice information matches the voice feature of the predetermined user that is extracted in advance;

Step 103: When the result of the determination is that the voice information is consistent with the voice feature, determine that the voice information is voice information of the predetermined user.

Through the steps of the flow shown in FIG. 1 above, by acquiring and identifying the voice information, it is determined whether the voice information matches the voice feature of the predetermined user that is extracted in advance, and the result of the determination is the voice information and When the voice features are consistent, the voice information is determined to be the voice information of the predetermined user. The method provided in this embodiment solves the problem that the false recognition rate is high due to the influence of other voices in the voice recognition process. The problem is reduced the false recognition rate.

Optionally, in another embodiment of the present invention, after determining that the voice information is the voice information of the predetermined user, the method further includes: determining whether the confidence of the voice information is greater than a preset threshold; When the confidence level of the information is greater than the preset threshold, the voice information is determined to be an instruction issued by the predetermined user; and when the result of the determination is that the confidence of the voice information is less than or equal to the preset threshold, the voice information is discarded.

Optionally, in another embodiment of the present invention, after determining that the voice information is an instruction issued by the predetermined user, the method further includes: executing an instruction corresponding to the voice information, where the operation performed is, for example, triggering an application according to the instruction.

Optionally, in another embodiment of the present invention, before determining whether the voice information matches the sound feature of the predetermined user that is extracted in advance, the method further includes: extracting the sound feature of the recording by repeatedly acquiring the same recording; saving the extraction The sound feature.

Optionally, before saving the extracted sound feature, the embodiment further includes: determining that the confidence of the sound feature is greater than the preset threshold.

The embodiment of the present invention further provides a voice recognition device. FIG. 2 is a schematic structural diagram of a voice recognition device according to an embodiment of the present invention. As shown in FIG. 2, the voice recognition device 20 provided in this embodiment may include: Module 21, determination module 22 and determination module 23.

The obtaining module 21 is configured to: acquire and identify voice information;

The determining module 22 is configured to: determine whether the voice information acquired by the acquiring module 21 matches the sound feature of the predetermined user that is extracted in advance;

The determining module 23 is configured to: when the determination result of the determining module 22 is that the voice information is consistent with the sound feature, determine that the voice information is the voice information of the predetermined user.

Optionally, FIG. 3 is a schematic structural diagram of another voice recognition apparatus according to an embodiment of the present invention. Based on the device shown in FIG. 2 above, the voice recognition device 20 provided in this embodiment may further include a discarding module 24.

The determining module 22 is further configured to: determine whether the confidence of the voice information is greater than a preset. Threshold value

The determining module 23 is further configured to: when the determination result of the determining module 22 is that the confidence level of the voice information is greater than the preset threshold, determining that the voice information is an instruction issued by the predetermined user;

The discarding module 24 is configured to discard the voice information when the judgment result of the determining module 22 is that the confidence level of the voice information is less than or equal to the preset threshold.

Optionally, the apparatus further includes: an executing module 25, configured to: after the determining module 23 determines that the voice information is an instruction issued by a predetermined user, execute an instruction corresponding to the voice information.

Optionally, the device further includes: a repetition obtaining module 26, configured to: after the determining module 22 determines whether the voice information matches the sound feature of the predetermined user that is extracted in advance, extracting the sound feature of the recording by repeatedly acquiring the same recording The save module 27 is configured to: save the sound feature extracted by the repeat acquisition module 26.

Optionally, the determining module 23 in the device is further configured to: before the saving module 27 saves the extracted sound feature, determine that the confidence of the sound feature is greater than a preset threshold.

For the above problems existing in the related art, the following description will be made in conjunction with specific alternative embodiments, which are combined with the above-described alternative embodiments and alternative embodiments thereof.

Optionally, FIG. 4 is a schematic structural diagram of still another voice recognition apparatus according to an embodiment of the present invention. The voice recognition device 30 provided in this embodiment includes the following parts, a voiceprint extraction module 31, a voiceprint feature library 32, a voiceprint discrimination module 33, a voice recognition module 34, a control module 35, and a recording management module 36, and functions thereof. The acquisition module 21, the determination module 22, the determination module 23, the discarding module 24, the execution module 25, the repeat acquisition module 26, and some or all of the save modules 27 are implemented together.

The voiceprint extraction module 31 is configured to: the user trains the voiceprint to extract the voice features of the user.

The voiceprint feature library 32 is configured to: store the voice features of the user extracted by the voiceprint extraction module 31, and provide them to subsequent modules for use.

The voiceprint discriminating module 33 is configured to determine whether it is the voice of the current user according to the user voice data provided by the recording management module 36.

The voice recognition module 34 is configured to perform corresponding voice recognition according to the user voice data provided by the recording management module 36, and convert the voice into characters.

The control module 35 is configured to: control the entire logic.

The recording management module 36 is configured to: manage the system recordings, and provide them to the voiceprint discriminating module 33 and the voice recognition module 34, respectively.

The embodiment of the invention further provides a usage manner for improving the speech recognition rate by using the voiceprint, and the usage mode is described below through an alternative embodiment.

FIG. 5 is a schematic diagram of extracting a sound feature in a voice recognition method according to an embodiment of the present invention. As shown in FIG. 5, when the user uses the system for the first time, the voiceprint extraction module 31 is used to extract a user voice feature, for example, Since the user's voice characteristics need to be extracted, the user is required to repeatedly read a certain piece of text, and then the extracted sound features are saved to the voiceprint feature library 32.

FIG. 6 is a schematic diagram of voice recognition in a voice recognition method according to an embodiment of the present invention. As shown in FIG. 6, the user triggers speech recognition, and the system recording is separately sent to the voiceprint discrimination module 33 and the voice recognition module 34 through the recording management module 36, and the result of the voiceprint discrimination module 33 and the result of the voice recognition module 34 are provided. The control module 35 is given a decision by the control module 35. The control module 35 first determines whether the result of the voiceprint discrimination conforms to the user's voice feature. If not, the system recording is noise or surrounding voice, discarding the voice recognition result, and notifying the recording management module 36 to continue recording; if the voiceprint If the determination is passed, the control module 35 determines whether the confidence level of the voice recognition module 34 is greater than a threshold. If the result of the determination is that the confidence is less than or equal to the threshold, the description is a voice command of the user, but is not necessarily a voice command, and the control module 35 discards the result and notifies the recording management module 36 to continue recording; if both pass the verification, the correct result is returned to the subsequent application process.

When the user has pre-trained the voiceprint, the system records the voiceprint feature. When the application is triggered by the user's voice or noise, the recording management module 36 starts recording and distributes the corresponding recording to the speech recognition module 34 and the voiceprint discrimination module 33, and the control module 35 waits for the speech recognition module 34 and the voiceprint discrimination module 33 to respectively Give the result.

When the control module 35 receives the result returned by the voiceprint discrimination module 33, the control module 35 determines whether the voiceprint matching degree reaches a threshold value, for example, the threshold value is 80%, and the threshold value can be set by the user, or can be preset by the system if the control is performed. The module 35 determines that the voiceprint matching degree does not exceed the threshold, then the control discards the result returned by the voice recognition module 34, and simultaneously notifies the recording management module 36 to continue recording, waiting for the correct result.

When the control module 35 determines that the voiceprint matching degree is greater than the threshold, it continues to determine whether the voice recognition result is greater than the threshold, and if not exceeded, still discards. If passed, return this result to the backend Process or module use.

The user connects the wireless router to the wired network and turns on the power switch. Use your mobile phone to search the Bluetooth of your wireless router for pairing. After the pairing is completed, the user opens the setting program in the mobile phone, and sets the hotspot, encryption mode, password, and access mode of the Wide Area Network (WAN) port. After the setting is successful, the router takes effect. This device is generally used for business travel users, often changing hotels, and needs a wireless routing device that can be carried and conveniently set up.

One of ordinary skill in the art will appreciate that all or a portion of the steps of the above-described embodiments can be implemented using a computer program flow, which can be stored in a computer readable storage medium, such as on a corresponding hardware platform (eg, The system, device, device, device, etc. are executed, and when executed, include one or a combination of the steps of the method embodiments.

Alternatively, all or part of the steps of the above embodiments may also be implemented by using an integrated circuit. These steps may be separately fabricated into individual integrated circuit modules, or multiple modules or steps may be fabricated into a single integrated circuit module. achieve.

The devices/function modules/functional units in the above embodiments may be implemented by a general-purpose computing device, which may be centralized on a single computing device or distributed over a network of multiple computing devices.

When the device/function module/functional unit in the above embodiment is implemented in the form of a software function module and sold or used as a stand-alone product, it can be stored in a computer readable storage medium. The above mentioned computer readable storage medium may be a read only memory, a magnetic disk or an optical disk or the like.

Industrial applicability

The embodiment of the present invention determines whether the voice information matches the voice feature of the predetermined user that is extracted in advance by acquiring and identifying the voice information, and determining that the voice information is the voice information when the judgment result is that the voice information matches the voice feature. The voice information of the user is scheduled, and the embodiment of the present invention solves the problem that the false recognition rate is high due to the influence of other voices in the voice recognition process, and the false recognition rate is reduced.

Claims

A speech recognition method comprising:

Acquire and identify voice information;

Determining whether the voice information matches a sound feature of a predetermined user that is extracted in advance;

When the result of the judgment is that the voice information is consistent with the voice feature, the voice information is determined to be voice information of the predetermined user.
The method of claim 1, wherein after the determining that the voice information is voice information of the predetermined user, the method further comprises:

Determining whether the confidence of the voice information is greater than a preset threshold;

When the result of the determination is that the confidence of the voice information is greater than the preset threshold, determining that the voice information is an instruction issued by the predetermined user;

When the result of the determination is that the confidence of the voice information is less than or equal to the preset threshold, the voice information is discarded.
The method of claim 2, wherein after the determining that the voice information is an instruction issued by the predetermined user, the method further comprises:

Executing an instruction corresponding to the voice information.
The method according to claim 1, wherein before the determining whether the voice information matches a sound feature of a predetermined user that is extracted in advance, the method further comprises:

Extracting the sound characteristics of the recording by repeatedly acquiring the same recording;

The extracted sound features are saved.
The method of claim 4, wherein before the saving the extracted sound features, the method further comprises:

Determining that the confidence of the sound feature is greater than the predetermined threshold.
A speech recognition device comprising:

The acquisition module is set to: acquire and recognize voice information;

a determining module, configured to: determine the voice information acquired by the acquiring module and pre-fetch Whether the predetermined user's voice characteristics match;

And a determining module, configured to: when the judgment result of the determining module is that the voice information is consistent with the sound feature, determine that the voice information is voice information of the predetermined user.
The apparatus of claim 6 further comprising a discarding module;

The determining module is further configured to: determine whether the confidence of the voice information is greater than a preset threshold;

The determining module is further configured to: when the determination result of the determining module is that the confidence level of the voice information is greater than the preset threshold, determining that the voice information is an instruction issued by the predetermined user;

The discarding module is configured to discard the voice information when the judgment result of the determining module is that the confidence level of the voice information is less than or equal to the preset threshold.
The apparatus of claim 7 further comprising:

And an execution module, configured to: after the determining module determines that the voice information is an instruction issued by the predetermined user, execute an instruction corresponding to the voice information.
The apparatus of claim 6 further comprising:

And repeating the obtaining module, configured to: before the determining module determines whether the voice information matches the sound feature of the pre-extracted predetermined user, extracting the sound feature of the recorded sound by repeatedly acquiring the same sound recording;

And saving the module, configured to: save the sound feature extracted by the repeated acquisition module.
The apparatus according to claim 9, wherein

The determining module is further configured to: before the saving module saves the extracted sound feature, determine that the confidence of the sound feature is greater than the preset threshold.