CN108831477B

CN108831477B - Voice recognition method, device, equipment and storage medium

Info

Publication number: CN108831477B
Application number: CN201810615353.0A
Authority: CN
Inventors: 许超
Original assignee: Mobvoi Information Technology Co Ltd
Current assignee: Mobvoi Information Technology Co Ltd
Priority date: 2018-06-14
Filing date: 2018-06-14
Publication date: 2021-07-09
Anticipated expiration: 2038-06-14
Also published as: CN108831477A

Abstract

The embodiment of the invention discloses a voice recognition method, a voice recognition device, voice recognition equipment and a storage medium. Wherein, the method comprises the following steps: when terminal equipment determines that voice information to be played is to be played in a sleeping process, acquiring a word set corresponding to the voice information to be played, and starting a wake-up word wake-up function in advance by the terminal equipment; and the terminal equipment detects the awakening words of the received voice information according to the similarity between the word set and the preset awakening words. According to the technical scheme of the embodiment of the invention, when the voice information to be played is specifically shielded according to the voice information to be played, the condition of mistaken awakening is avoided, the awakening identification is optimized, and the user experience is improved.

Description

Voice recognition method, device, equipment and storage medium

Technical Field

The embodiment of the invention relates to an intelligent terminal technology, in particular to a voice recognition method, a voice recognition device, voice recognition equipment and a storage medium.

Background

With the continuous progress of science and technology, the speech control technology is gradually popularized. Most intelligent terminals basically have a dialogue system that enables voice interaction. The voice interaction is carried out through the dialogue system with the intelligent terminal, so that the operation of the intelligent terminal becomes simpler and more convenient.

In the prior art, before interaction with the dialog system, a fixed wakeup word is used to wake up the dialog system each time, and after the system enters a wakeup state, voice interaction is performed.

In the process of implementing the invention, the inventor finds that the prior art has the following defects: the user may play the voice information through the smart terminal or the associated smart terminal, for example, play the audio electronic book through the smart terminal or the associated smart terminal. When the played voice message contains contents similar to the awakening words, the condition of false awakening is easy to occur. That is, when the user does not have a requirement for waking up the dialog system of the intelligent terminal, the content similar to the wake-up word in the voice information played in the environment is recognized as the wake-up word, the dialog system is woken up by mistake, and the dialog system performs voice interaction, so that the user is disturbed.

Disclosure of Invention

The invention provides a voice recognition method, a voice recognition device, voice recognition equipment and a storage medium, which are used for shielding voice information played in an environment in a targeted manner when an intelligent terminal is awakened and recognized, so that the condition of mistaken awakening is avoided, and the user experience is improved.

In a first aspect, an embodiment of the present invention provides a speech recognition method, including:

the method comprises the steps that when terminal equipment determines that voice information to be played is to be played in a sleeping process, a word set corresponding to the voice information to be played is obtained;

and the terminal equipment detects the awakening words of the received voice information according to the similarity between the word set and the preset awakening words.

In a second aspect, an embodiment of the present invention further provides a speech recognition apparatus, including:

the word set acquisition module is used for acquiring a word set corresponding to the voice information to be played when the terminal equipment determines that the voice information to be played is to be played in the sleeping process;

and the awakening word detection module is used for detecting the awakening words of the received voice information by the terminal equipment according to the similarity between the word set and the preset awakening words.

In a third aspect, an embodiment of the present invention further provides an apparatus, including:

one or more processors;

a storage device for storing one or more programs,

when the one or more programs are executed by the one or more processors, the one or more processors implement the speech recognition method provided by the embodiments of the present invention.

In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the speech recognition method provided by the embodiment of the present invention.

According to the technical scheme of the embodiment of the invention, when the terminal equipment determines that the voice information to be played is to be played in the sleeping process, the word set corresponding to the voice information to be played is obtained, the awakening word detection is carried out on the received voice information according to the similarity degree between the word set and the preset awakening word, the voice information played in the environment can be shielded in a targeted manner according to the voice information to be played when the awakening recognition is carried out, the condition of mistaken awakening is avoided, the awakening recognition is optimized, and the user experience is improved.

Drawings

Fig. 1 is a flowchart of a speech recognition method according to an embodiment of the present invention;

fig. 2 is a flowchart of a speech recognition method according to a second embodiment of the present invention;

fig. 3 is a flowchart of a speech recognition method according to a third embodiment of the present invention;

fig. 4 is a flowchart of a speech recognition method according to a fourth embodiment of the present invention;

fig. 5 is a block diagram of a speech recognition apparatus according to a fifth embodiment of the present invention;

fig. 6 is a schematic structural diagram of an apparatus according to a sixth embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.

Example one

Fig. 1 is a flowchart of a speech recognition method according to an embodiment of the present invention, where the present embodiment is applicable to a case of recognizing a speech signal, and the method can be executed by a speech recognition apparatus, where the apparatus is executed by software and/or hardware, and can be generally integrated in a terminal device. The terminal devices include, but are not limited to, computers and the like. Illustratively, the terminal device may be a smart watch, a smart phone, a smart bracelet, a smart speaker, a smart television, or the like. As shown in fig. 1, it specifically includes the following steps:

step 101, when the terminal device determines that the voice information to be played is to be played in the sleeping process, a word set corresponding to the voice information to be played is obtained.

The terminal equipment is in a sleep state when the user does not use the terminal equipment, and the terminal equipment is awakened when the user needs to use the terminal equipment, so that the terminal equipment enters a working state from the sleep state.

The wake-up word is a word or a plurality of words preset by the user or set by the system, and may be, for example: you get a question. Specifically, the wake-up word is not detachable and is continuous. For example, if the user inputs speech: if you are good, the user learns well, and asks questions, the voice input by the user does not have the awakening word.

Optionally, the terminal device is provided with a wake-up word wake-up function. Before the terminal equipment enters a sleep process, after a user manually starts a wake-up word wake-up function of the terminal equipment, wake-up word detection can be carried out on received voice information through a preset wake-up word, and the terminal equipment is awakened according to a detection result. The received voice information is the voice information obtained by the terminal equipment. Specifically, the terminal device starts a wake-up word wake-up function in advance, when the terminal device determines that voice information to be played is to be played in the sleeping process, a word set corresponding to the voice information to be played is obtained, then the terminal device detects the wake-up word of the received voice information according to the similarity degree between the word set and a preset wake-up word, and if the terminal device determines that the wake-up word is recognized in the voice information, the terminal device is woken up; and when the fact that the awakening words are not recognized in the voice information is determined, the terminal equipment is not awakened.

Optionally, when the terminal device detects that the terminal device enters the sleep process, the wake-up function of the wake-up word is automatically started.

Before the terminal equipment enters the sleep process, when the terminal equipment does not start the awakening word awakening function, awakening word detection cannot be performed on the received voice information through the preset awakening word, and the terminal equipment is awakened according to the detection result.

Optionally, when the user does not use the preset application of the terminal device, the preset application of the terminal device is in a sleep state, and when the user needs to use the preset application of the terminal device, the preset application of the terminal device is awakened, so that the preset application of the terminal device enters a working state from the sleep state. For example, when the user does not use the dialog system of the terminal device, the dialog system is in a sleep state, and when the user needs to use the dialog system, the dialog system is woken up, so that the dialog system enters a working state from the sleep state to perform voice interaction with the user.

Before the dialog system of the terminal equipment enters a sleep process, after the terminal equipment starts a wake-up word wake-up function, the received voice information can be subjected to wake-up word detection through a preset wake-up word, and the dialog system of the terminal equipment is awakened according to a detection result. Specifically, the terminal device starts a wake-up word wake-up function in advance, when the terminal device determines that voice information to be played is to be played in the sleeping process, a word set corresponding to the voice information to be played is obtained, then the terminal device detects the wake-up word of the received voice information according to the similarity degree between the word set and a preset wake-up word, and when the wake-up word is determined to be recognized in the voice information, a dialog system of the terminal device is woken up; and when the fact that the awakening words are not recognized in the voice information is determined, the dialog system of the terminal equipment is not awakened.

When the terminal equipment detects that the set audio file is to be played in the set playing application program, the terminal equipment determines that the voice information to be played is to be played. And setting the playing application program as an application program capable of playing the audio file. The setting audio file may include a music file and a talking electronic book file.

And the word set corresponding to the voice information to be played is a set of common words of the voice information to be played. Common words of the voice information to be played can be obtained according to the set audio file, and therefore a word set corresponding to the voice information to be played is generated.

And 102, the terminal equipment detects the awakening words of the received voice information according to the similarity between the word set and the preset awakening words.

And calculating the similarity degree between each word in the word set and a preset awakening word according to a preset word similarity algorithm. And then determining whether the similar words of the awakening words and the same words of the awakening words are contained in the word set or not according to the similarity between each word and the preset awakening words and a preset similarity threshold value. Specifically, the preset similarity threshold includes a similar word threshold and a same word threshold. The similar term threshold is less than the same term threshold. When the similarity degree between the words and the preset awakening words is larger than the similar word threshold value and smaller than the same word threshold value, determining the words as the similar words of the awakening words; and when the similarity between the words and the preset awakening words is greater than the same word threshold value, determining the words as the same words of the awakening words.

And the terminal equipment detects the awakening words of the received voice information according to the similarity between the word set and the preset awakening words. The similarity between the word set and the preset awakening word comprises the following steps: the word set comprises similar words of the awakening words and does not comprise the same words of the awakening words; the terminal equipment determines that the word set does not contain similar words of the awakening words and does not contain the same words of the awakening words according to the similarity degree; and the terminal equipment determines that the similar words of the awakening words are not contained in the word set and the same words of the awakening words are not contained in the word set according to the similarity degree.

Specifically, when it is determined that the word set includes similar words of the wakeup word and does not include the same words of the wakeup word, the terminal device needs to determine whether to recognize the wakeup word in the voice information according to the matching score between the recognition result corresponding to the voice information and the wakeup word and the matching score between the recognition result corresponding to the voice information and the similar words. When the matching score between the recognition result corresponding to the voice information and the awakening word is larger than or equal to the matching score between the recognition result corresponding to the voice information and the similar word, determining that the awakening word is recognized in the voice information; when the matching score between the recognition result corresponding to the voice information and the awakening word is smaller than the matching score between the recognition result corresponding to the voice information and the similar word, the awakening word is determined not to be recognized in the voice information, the terminal equipment or the preset application of the terminal equipment is not awakened, and the condition of mistaken awakening is avoided. And when the terminal equipment determines that the similar words of the awakening words are not contained in the word set and the same words of the awakening words are not contained in the word set according to the similarity degree, the terminal equipment directly detects the awakening words according to the preset awakening words. When the terminal equipment determines that the same words containing the awakening words in the word set according to the similarity degree, the terminal equipment needs to perform identity verification on the user according to the voiceprint characteristics so as to determine whether the received voice information is the voice information of the user. And when the received voice information is confirmed to be the voice information of the user, directly detecting the awakening words according to the preset awakening words.

When the awakening words are identified in the voice information, awakening the terminal equipment or the preset application of the terminal equipment; and when the fact that the awakening words are not recognized in the voice information is determined, the terminal equipment or the preset application of the terminal equipment is not awakened.

According to the voice recognition method provided by the embodiment, when the terminal device determines that the voice information to be played is to be played in the sleeping process, the word set corresponding to the voice information to be played is obtained, the awakening word detection is performed on the received voice information according to the similarity degree between the word set and the preset awakening word, the voice information played in the environment can be shielded according to the voice information to be played in a targeted manner when the awakening recognition is performed, the condition of mistaken awakening is avoided, the awakening recognition is optimized, and the user experience is improved.

Example two

Fig. 2 is a flowchart of a speech recognition method according to a second embodiment of the present invention, where the present embodiment optimizes step 102 based on the foregoing embodiments: the terminal equipment detects the awakening words of the received voice information according to the similarity degree between the word set and the preset awakening words, and the method comprises the following steps: when the terminal equipment determines that the word set contains similar words of the awakening words and does not contain the same words of the awakening words according to the similarity degree, whether voice signals corresponding to the awakening words exist in the received voice information or not is detected; when voice signals corresponding to awakening words exist in the voice information, acquiring first matching scores of the voice signals corresponding to the awakening words and second matching scores of the voice signals corresponding to the awakening words and similar words, and comparing the first matching scores with the second matching scores; and when the first matching score is larger than or equal to the second matching score, determining that the awakening word is recognized in the voice message.

As shown in fig. 2, the method includes:

step 201, when the terminal device determines that the voice information to be played is to be played in the sleeping process, a word set corresponding to the voice information to be played is acquired.

Optionally, the determining, by the terminal device, that the voice information to be played is to be played in the sleep process includes: the terminal equipment regularly detects the set playing application program in the local system and/or the associated equipment, and determines that the voice information to be played is to be played when the set audio file to be played in the set playing application program is detected; and/or the terminal equipment determines that the voice information to be played is to be played when receiving the audio file playing prompt information sent by the local system and/or the associated equipment.

The terminal equipment regularly detects the set playing application program in the local system according to a preset period, and determines that the voice information to be played is to be played when the set audio file to be played in the set playing application program is detected. And setting the playing application program as an application program capable of playing the audio file. The setting audio file may include a music file and a talking electronic book file.

Optionally, the terminal device periodically detects a set playing application program in the associated device, and determines that the voice information to be played is to be played when it is detected that the set audio file is to be played in the set playing application program. The associated device may be another terminal device connected to the same server as the terminal device. Optionally, the associated device may be another terminal device sharing the same user account with the terminal device. The user account is used to record the user's username and password, affiliated groups, accessible network resources, and the user's personal files and settings.

Optionally, the terminal device periodically detects a set playing application program in the local system and the associated device, and determines that the voice information to be played is to be played when it is detected that a set audio file is to be played in the set playing application program.

Optionally, when receiving the audio file playing prompt message sent by the local system, the terminal device determines that the voice message to be played is to be played.

When the set playing application program in the system is about to play the set audio file, the audio file playing prompt message is sent. The terminal equipment can determine that the set audio file is to be played by the set playing application program in the local system according to the received audio file playing prompt information sent by the local system, namely, determine that the voice information to be played is to be played.

Optionally, when receiving the audio file playing prompt message sent by the associated device, the terminal device determines that the voice message to be played is to be played.

When the set playing application program in the associated equipment is about to play the set audio file, sending audio file playing prompt information. The terminal device can determine that the set audio file is to be played by the set playing application program in the associated device, that is, determine that the voice information to be played is to be played, according to the received audio file playing prompt information sent by the associated device.

Optionally, when receiving the audio file playing prompt information sent by the local system and the associated device, the terminal device determines that the audio information to be played is to be played.

Optionally, the terminal device periodically detects a set playing application program in the local system and the associated device, and when it is detected that a set audio file is to be played in the set playing application program, it determines that the voice information to be played is to be played, and when receiving audio file playing prompt information sent by the local system and the associated device, it determines that the voice information to be played is to be played.

Optionally, obtaining a word set corresponding to the voice information to be played includes: acquiring introduction information of voice information to be played; and acquiring common words of the voice information to be played according to the introduction information, and generating a word set corresponding to the voice information to be played.

The introduction information of the voice information to be played is preset information used for introducing the content of the set audio file to be played corresponding to the voice information to be played. After the introduction information of the voice information to be played is obtained, the common words of the voice information to be played are obtained according to the introduction information, and a word set corresponding to the voice information to be played is generated. Specifically, the statistical characteristics of each word in the introduction information are extracted through a preset statistical algorithm, then the word with the occurrence frequency reaching a preset frequency threshold is screened out according to the statistical characteristics, and the word is determined as the common word of the voice information to be played. And generating a word set corresponding to the voice information to be played according to all the determined common words.

Optionally, obtaining a word set corresponding to the voice information to be played includes: acquiring a set audio file to be played corresponding to the voice information to be played; and acquiring common words of the voice information to be played according to the set audio file, and generating a word set corresponding to the voice information to be played.

The statistical characteristics of each word in the set audio file are extracted through a preset statistical algorithm, then the word with the occurrence frequency reaching a preset frequency threshold value is screened out according to the statistical characteristics, and the word is determined as a common word of the voice information to be played. And generating a word set corresponding to the voice information to be played according to all the determined common words.

Step 202, when the terminal device determines that the word set contains similar words of the awakening word and does not contain the same words of the awakening word according to the similarity degree, detecting whether a voice signal corresponding to the awakening word exists in the received voice information.

And when the terminal equipment determines that the word set contains similar words of the awakening words and does not contain the same words of the awakening words according to the similarity degree, detecting whether a voice signal corresponding to the awakening words exists in the received voice information. Specifically, acoustic features of the voice information are extracted, the acoustic features are input into a preset voice recognition model, the received voice information is recognized through the preset voice recognition model, a recognition result corresponding to the voice information is obtained, and a matching score between the recognition result corresponding to the voice information and the awakening word, namely a first matching score, is calculated. Here, the matching score may range from 0 to 10 points. The higher the match score, the better the degree of match. The range of the matching score can be set according to actual needs.

After the first matching score is obtained, whether a voice signal corresponding to the awakening word exists in the received voice message or not can be determined according to the first matching score and a preset first matching threshold value. When the first matching score is larger than or equal to a first preset threshold value, the recognition result corresponding to the voice information is matched with the awakening word, namely, the voice signal corresponding to the awakening word is detected in the received voice information; and when the first matching score is smaller than a first preset threshold value, indicating that the recognition result corresponding to the voice information is not matched with the awakening word, namely detecting that the voice signal corresponding to the awakening word does not exist in the received voice information. For example, the matching score may range from 0 to 10 points, and the first preset threshold may be 8 points.

Step 203, when a voice signal corresponding to the awakening word is detected to exist in the voice information, acquiring a first matching score of the voice information and the awakening word and a second matching score of the voice information and the similar word, and comparing the first matching score and the second matching score.

And the first matching score of the voice information and the awakening word is the matching score between the recognition result corresponding to the voice information and the awakening word. The second matching score of the voice information and the similar words is a matching score between the recognition result corresponding to the voice information and the similar words.

When detecting that a voice signal corresponding to the awakening word exists in the received voice information, acquiring a matching score between a recognition result corresponding to the voice information and the awakening word, namely a first matching score, and calculating a matching score between the recognition result corresponding to the voice information and the similar word through a preset voice recognition model, namely a second matching score. And after a first matching score of the voice information and the awakening word and a second matching score of the voice information and the similar word are obtained, comparing the first matching score with the second matching score.

And step 204, when the first matching score is larger than or equal to the second matching score, determining that the awakening word is recognized in the voice message.

The first matching score is larger than the second matching score, the matching degree of the recognition result corresponding to the voice information and the awakening word is higher than the matching degree of the recognition result corresponding to the voice information and the similar word, and the awakening word is determined to be recognized in the voice information; the first matching score is equal to the second matching score, the matching degree of the recognition result corresponding to the voice information and the awakening word is more similar to that of the recognition result corresponding to the voice information and the similar word, and the awakening word is determined to be recognized in the voice information; and if the first matching score is smaller than the second matching score, the matching degree of the recognition result corresponding to the voice information and the awakening word is lower than the matching degree of the recognition result corresponding to the voice information and the similar word, and the awakening word is determined not to be recognized in the voice information.

And under the condition that the matching degree of the recognition result corresponding to the voice information and the awakening word is higher than that of the recognition result corresponding to the voice information and the similar word, or the matching degree of the recognition result corresponding to the voice information and the awakening word is close to that of the recognition result corresponding to the voice information and the similar word, determining that the awakening word is recognized in the voice information.

In the voice recognition method provided by this embodiment, when it is determined that a word set includes similar words of a wakeup word and does not include the same words of the wakeup word, and when a voice signal corresponding to the wakeup word is detected to exist in voice information, the voice signal corresponding to the wakeup word is compared with a first matching score of the wakeup word, and the voice signal corresponding to the wakeup word and a second matching score of the similar words are compared; when the first matching score is larger than or equal to the second matching score, the awakening word is determined to be recognized in the voice message, and the voice message played in the environment can be shielded in a targeted manner according to the similar words of the awakening word in the voice message to be played during awakening recognition, so that the condition of mistaken awakening is avoided.

EXAMPLE III

Fig. 3 is a flowchart of a speech recognition method according to a third embodiment of the present invention, where the present embodiment optimizes step 102 based on the foregoing embodiment: the terminal equipment detects the awakening words of the received voice information according to the similarity degree between the word set and the preset awakening words, and the method comprises the following steps: when the terminal equipment determines that the word set does not contain similar words of the awakening words and does not contain the same words of the awakening words according to the similarity degree, whether voice signals corresponding to the awakening words exist in the received voice information or not is detected; and when the voice information is detected to have the voice signal corresponding to the awakening word, determining that the awakening word is recognized in the voice information.

As shown in fig. 3, the method includes:

step 301, when the terminal device determines that the voice information to be played is to be played in the sleep process, acquiring a word set corresponding to the voice information to be played.

Step 302, when the terminal device determines that the word set does not contain similar words of the awakening word and does not contain the same words of the awakening word according to the similarity degree, it detects whether a voice signal corresponding to the awakening word exists in the received voice information.

And when the terminal equipment determines that the word set does not contain similar words of the awakening word and does not contain the same words of the awakening word according to the similarity degree, detecting whether a voice signal corresponding to the awakening word exists in the received voice information. Specifically, acoustic features of the voice information are extracted, the acoustic features are input into a preset voice recognition model, the received voice information is recognized through the preset voice recognition model, a recognition result corresponding to the voice information is obtained, and a matching score between the recognition result corresponding to the voice information and the awakening word, namely a first matching score, is calculated. Here, the matching score may range from 0 to 10 points. The higher the match score, the better the degree of match. The range of the matching score can be set according to actual needs.

Step 303, when it is detected that the voice information has a voice signal corresponding to the wakeup word, determining that the wakeup word is recognized in the voice information.

When voice signals corresponding to the awakening words are detected to exist in the voice information, the awakening words are determined to be recognized in the voice information, and the terminal equipment or the preset application of the terminal equipment is awakened; when the voice information is detected to have no voice signal corresponding to the awakening word, the awakening word is determined not to be recognized in the voice information, and the terminal equipment or the preset application of the terminal equipment is not awakened.

In the voice recognition method provided by this embodiment, under the condition that it is determined that a word set does not include similar words of a wakeup word and does not include the same words of the wakeup word, the terminal device detects whether a voice signal corresponding to the wakeup word exists in received voice information; when voice signals corresponding to the awakening words are detected to exist in the voice information, the awakening words are determined to be recognized in the voice information, and the awakening recognition can be directly performed according to the awakening words under the condition that similar words and identical words of the awakening words do not exist in the voice information to be played.

Example four

Fig. 4 is a flowchart of a speech recognition method according to a fourth embodiment of the present invention, where the present embodiment optimizes step 102 on the basis of the foregoing embodiment: the terminal equipment detects the awakening words of the received voice information according to the similarity degree between the word set and the preset awakening words, and the method comprises the following steps: when the terminal equipment determines the same words containing the awakening words in the word set according to the similarity degree, determining the voiceprint characteristics corresponding to the voice information according to the received voice information; judging whether the voiceprint features are matched with preset voiceprint features or not; when the voiceprint features are matched with the preset voiceprint features, detecting whether a voice signal corresponding to the awakening word exists in the voice information; and when the voice information is detected to have the voice signal corresponding to the awakening word, determining that the awakening word is recognized in the voice information.

As shown in fig. 4, the method includes:

step 401, when it is determined that the voice information to be played is to be played in the sleep process, the terminal device obtains a word set corresponding to the voice information to be played.

Step 402, when the terminal device determines that the same words containing the awakening words in the word set according to the similarity degree, determining the voiceprint characteristics corresponding to the voice information according to the received voice information.

The method for determining the voiceprint features corresponding to the voice information may be to process the received voice information and further extract the voiceprint features corresponding to the received voice information.

And step 403, judging whether the voiceprint features are matched with preset voiceprint features.

The preset voiceprint feature is the preset voiceprint feature of a user using the device. The obtaining mode of the voiceprint features can be directly set for the user, and also can be analyzed according to voice signals input by the user, so that the voiceprint features of the user can be obtained. Optionally, the preset voiceprint feature may comprise a plurality of voiceprint features.

And step 404, detecting whether a voice signal corresponding to the awakening word exists in the voice information when the voiceprint feature is matched with the preset voiceprint feature.

The voice print characteristics are matched with preset voice print characteristics, the received voice information is the voice information input by a user using the terminal equipment, and whether a voice signal corresponding to the awakening word exists in the voice information is detected; if the voiceprint feature is not matched with the preset voiceprint feature, the received voice information is not the voice information input by the user using the terminal equipment, and may be the interference voice information in the environment, and further processing is not performed.

Step 405, when a voice signal corresponding to the awakening word is detected to exist in the voice message, determining that the awakening word is recognized in the voice message.

In the voice recognition method provided by this embodiment, when the terminal device determines that the word set includes the same word of the wakeup word, the voiceprint feature corresponding to the voice information is determined according to the received voice information, and whether the voiceprint feature matches the preset voiceprint feature is determined, and when the voiceprint feature matches the preset voiceprint feature, whether a voice signal corresponding to the wakeup word exists in the voice information is detected, so that the voice recognition can be performed according to the voiceprint feature under the condition that the same word of the wakeup word exists in the voice information to be played, and the interfering voice information in the environment is shielded.

EXAMPLE five

Fig. 5 is a block diagram of a speech recognition apparatus according to a fifth embodiment of the present invention. As shown in fig. 5, the apparatus includes:

a word set acquisition module 501 and a wake-up word detection module 502.

The word set acquiring module 501 is configured to acquire a word set corresponding to voice information to be played when it is determined that the voice information to be played is to be played in a sleep process of the terminal device; and the awakening word detection module 502 is configured to perform awakening word detection on the received voice message by the terminal device according to the similarity between the word set and the preset awakening word.

The voice recognition device provided by the embodiment acquires the word set corresponding to the voice information to be played when the terminal device determines that the voice information to be played is to be played in the sleeping process, and performs awakening word detection on the received voice information according to the similarity between the word set and the preset awakening words, so that the voice information played in the environment can be shielded according to the voice information to be played in a targeted manner when awakening recognition is performed, the condition of mistaken awakening is avoided, awakening recognition is optimized, and user experience is improved.

On the basis of the foregoing embodiments, the word set obtaining module 501 may include:

the information regular detection unit is used for detecting the set playing application program in the local system and/or the associated equipment by the terminal equipment regularly, and determining that the voice information to be played is to be played when the set audio file to be played in the set playing application program is detected; and/or

And the information receiving unit is used for determining that the voice information to be played is to be played when the terminal equipment receives the audio file playing prompt information sent by the local system and/or the associated equipment.

On the basis of the foregoing embodiments, the wakeup word detection module 502 may include:

the first signal detection unit is used for detecting whether a voice signal corresponding to the awakening word exists in the received voice information or not when the terminal equipment determines that the similar words containing the awakening word in the word set and the same words not containing the awakening word according to the similarity degree;

the matching score comparing unit is used for acquiring a first matching score of the voice information and the awakening word and a second matching score of the voice information and the similar word when the voice signal corresponding to the awakening word is detected to exist in the voice information, and comparing the first matching score and the second matching score;

and the first identification unit is used for determining that the awakening word is identified in the voice message when the first matching score is larger than or equal to the second matching score.

the second signal detection unit is used for detecting whether the voice signal corresponding to the awakening word exists in the received voice information or not when the terminal equipment determines that the similar words of the awakening word are not contained in the word set and the same words of the awakening word are not contained in the word set according to the similarity degree;

and the second identification unit is used for determining that the awakening word is identified in the voice information when the voice signal corresponding to the awakening word is detected to exist in the voice information.

the voiceprint characteristic determining unit is used for determining the voiceprint characteristics corresponding to the voice information according to the received voice information when the terminal equipment determines the same words containing the awakening words in the word set according to the similarity degree;

the voiceprint judging unit is used for judging whether the voiceprint characteristics are matched with the preset voiceprint characteristics;

the third signal detection unit is used for detecting whether a voice signal corresponding to the awakening word exists in the voice information or not when the voiceprint feature is matched with the preset voiceprint feature;

and the third identification unit is used for determining that the awakening word is identified in the voice information when the voice signal corresponding to the awakening word is detected to exist in the voice information.

the introduction information acquisition unit is used for acquiring introduction information of the voice information to be played;

and the word set generating unit is used for acquiring the common words of the voice information to be played according to the introduction information and generating a word set corresponding to the voice information to be played.

The voice recognition device provided by the embodiment of the invention can execute the voice recognition method provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method.

EXAMPLE six

Fig. 6 is a schematic structural diagram of an apparatus according to a sixth embodiment of the present invention. Fig. 6 illustrates a block diagram of an exemplary device 612 suitable for use in implementing embodiments of the present invention. The device shown in fig. 6 is only an example and should not bring any limitation to the function and the scope of use of the embodiments of the present invention.

As shown in FIG. 6, device 612 is in the form of a general purpose computing device. Components of device 612 may include, but are not limited to: one or more processors or processing units 616, a system memory 628, and a bus 618 that couples various system components including the system memory 628 and the processing unit 616.

Bus 618 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. By way of example, such architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, micro-channel architecture (MAC) bus, enhanced ISA bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.

Device 612 typically includes a variety of computer system readable media. Such media can be any available media that is accessible by device 612 and includes both volatile and nonvolatile media, removable and non-removable media.

The system memory 628 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM)630 and/or cache memory 632. The device 612 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 634 may be used to read from or write to non-removable, nonvolatile magnetic media (not shown in FIG. 6, commonly referred to as a "hard disk drive"). Although not shown in FIG. 6, a magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In such cases, each drive may be connected to bus 618 by one or more data media interfaces. Memory 628 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the invention.

A program/utility 640 having a set (at least one) of program modules 642 may be stored, for example, in memory 628, such program modules 642 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which examples or some combination thereof may comprise an implementation of a network environment. The program modules 642 generally perform the functions and/or methods of the described embodiments of the present invention.

Device 612 may also communicate with one or more external devices 614 (e.g., keyboard, pointing device, display 624, etc.), with one or more devices that enable a user to interact with device 612, and/or with any devices (e.g., network card, modem, etc.) that enable device 612 to communicate with one or more other computing devices. Such communication may occur via input/output (I/O) interfaces 622. Also, the device 612 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet) through the network adapter 620. As shown, the network adapter 620 communicates with the other modules of the device 612 via the bus 618. It should be appreciated that although not shown, other hardware and/or software modules may be used in conjunction with the device 612, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

The processing unit 616 executes programs stored in the system memory 628 to perform various functional applications and data processing, such as implementing a voice recognition method provided by an embodiment of the present invention.

Namely: the method comprises the steps that when terminal equipment determines that voice information to be played is to be played in a sleeping process, a word set corresponding to the voice information to be played is obtained; and the terminal equipment detects the awakening words of the received voice information according to the similarity between the word set and the preset awakening words.

EXAMPLE seven

The seventh embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the speech recognition method provided by the embodiment of the present invention.

Computer storage media for embodiments of the invention may employ any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. A speech recognition method, comprising:

the terminal equipment detects the awakening words of the received voice information according to the similarity degree between the word set and the preset awakening words;

the terminal equipment detects the awakening words of the received voice information according to the similarity degree between the word set and the preset awakening words, and the method comprises the following steps:

when the terminal equipment determines that the word set contains similar words of the awakening word and does not contain the same words of the awakening word according to the similarity degree, whether a voice signal corresponding to the awakening word exists in the received voice information or not is detected;

when voice signals corresponding to the awakening words exist in the voice information, acquiring a first matching score of the voice information and the awakening words and a second matching score of the voice information and the similar words, and comparing the first matching score with the second matching score;

determining that the wake-up word is recognized in the voice message when the first matching score is greater than or equal to the second matching score.

2. The method of claim 1, wherein the terminal device determines that the voice message to be played is to be played during the sleep process, and the method comprises:

the terminal equipment regularly detects a set playing application program in a local system and/or associated equipment, and determines that voice information to be played is to be played when a set audio file to be played in the set playing application program is detected; and/or

And when the terminal equipment receives the audio file playing prompt information sent by the local system and/or the associated equipment, the terminal equipment determines that the voice information to be played is to be played.

3. The method according to claim 1 or 2, wherein the terminal device performs wakeup word detection on the received voice message according to the similarity between the word set and a preset wakeup word, and the method includes:

when the terminal equipment determines that the word set does not contain similar words of the awakening word and does not contain the same words of the awakening word according to the similarity degree, whether a voice signal corresponding to the awakening word exists in the received voice information or not is detected;

and when the voice information is detected to have the voice signal corresponding to the awakening word, determining that the awakening word is recognized in the voice information.

4. The method according to claim 1 or 2, wherein the terminal device performs wakeup word detection on the received voice message according to the similarity between the word set and a preset wakeup word, and the method includes:

when the terminal equipment determines that the same words containing the awakening words in the word set according to the similarity degree, determining voiceprint characteristics corresponding to the voice information according to the received voice information;

judging whether the voiceprint features are matched with preset voiceprint features or not;

when the voiceprint features are matched with preset voiceprint features, detecting whether a voice signal corresponding to a wakeup word exists in the voice information;

and when detecting that the voice information has the voice signal corresponding to the awakening word, determining that the awakening word is recognized in the voice information.

5. The method of claim 1, wherein obtaining a set of words corresponding to the voice information to be played comprises:

acquiring introduction information of the voice information to be played;

and acquiring the common words of the voice information to be played according to the introduction information, and generating a word set corresponding to the voice information to be played.

6. A speech recognition apparatus, comprising:

the terminal equipment comprises a word set acquisition module, a word acquisition module and a word processing module, wherein the word set acquisition module is used for acquiring a word set corresponding to voice information to be played when the terminal equipment determines that the voice information to be played is to be played in a sleeping process;

the awakening word detection module is used for the terminal equipment to perform awakening word detection on the received voice information according to the similarity degree between the word set and a preset awakening word;

the awakening word detection module comprises:

the terminal equipment detects whether a voice signal corresponding to the awakening word exists in the received voice information or not when the terminal equipment determines that the similar words of the awakening word are contained in the word set and the same words of the awakening word are not contained in the word set according to the similarity degree;

and the first identification unit is used for determining that the awakening word is identified in the voice information when the first matching score is greater than or equal to the second matching score.

7. The apparatus of claim 6, wherein the term set acquisition module comprises:

the information regular detection unit is used for detecting the set playing application programs in the local system and/or the associated equipment by the terminal equipment regularly, and determining that the voice information to be played is to be played when the set playing application programs are detected to be about to play the set audio files; and/or

8. An electronic device, characterized in that the device comprises:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the speech recognition method of any of claims 1-5.

9. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the speech recognition method according to any one of claims 1 to 5.