CN108766441B

CN108766441B - Voice control method and device based on offline voiceprint recognition and voice recognition

Info

Publication number: CN108766441B
Application number: CN201810533494.8A
Authority: CN
Inventors: 卢敬光; 刘海模; 吴晓东; 刘雄; 肖虎; 马鸿飞
Original assignee: Guangdong Shengjiangjun Technology Co ltd
Current assignee: ZHUHAI RONGTAI ELECTRONICS Co.,Ltd.
Priority date: 2018-05-29
Filing date: 2018-05-29
Publication date: 2020-11-10
Anticipated expiration: 2038-05-29
Also published as: CN108766441A

Abstract

A voice control method based on off-line voiceprint recognition and voice recognition comprises the following steps: receiving awakening word voice, and extracting a first voice characteristic and a first voiceprint characteristic of the awakening word voice; checking whether the extracted first voice feature and the extracted first voiceprint feature are respectively matched with a wakeup word voice template and a voiceprint template, and acquiring a first voiceprint code corresponding to the first voiceprint feature; receiving command word voice and extracting a second voiceprint feature of the command word voice; checking whether the second acoustic line characteristic is matched with the acoustic line template or not, and acquiring a second acoustic line code corresponding to the second acoustic line characteristic; checking whether the first voiceprint code is the same as the second voiceprint code, and extracting a second voice feature of the command word; and checking whether the extracted second voice characteristic is matched with the command word voice template or not, acquiring the voice code of the second voice characteristic and generating a corresponding control instruction based on the voice code.

Description

Voice control method and device based on offline voiceprint recognition and voice recognition

Technical Field

The present application relates to the field of speaker verification technologies, and in particular, to a voice control method based on offline voiceprint recognition and voice recognition and a device for implementing the method.

Background

With the recent aging and popularization of voice recognition technology, a function of issuing a control instruction to an electronic device by voice has been successfully applied to a number of electronic consumer products (for example, Siri function of an apple phone). The above-mentioned voice-based electronic device control technology relates to Speaker Verification (Speaker Verification) technology in voice recognition technology, that is, to confirm whether the relevant voice is issued by a specified user (for example, the holder of a mobile phone or a person having the use authority of the electronic device), and to confirm a control instruction corresponding to the voice content.

Compared with the traditional electronic equipment control technology, the voice-based electronic equipment control technology can provide a more friendly and convenient electronic equipment interactive operation mode for users (for example, the users do not need to input passwords manually to verify the use authority of the users); however, the prior art solutions are unstable due to the fact that the speech itself is easily affected by other conditions (such as background noise and the speaking condition of the speaker itself), and the determination of the speech content and the conversion from natural language to computer language acceptable to the electronic device often require the related devices to be connected with an external database for semantic conversion online. Both of these problems increase the cost of using voice-based electronic device control techniques.

Disclosure of Invention

The application aims to solve the defects of the prior art, and provides a voice control method and device based on offline voiceprint recognition and voice recognition, which can achieve the effects of realizing the control of electronic equipment based on voice offline and reducing the influence of external conditions on voice recognition as much as possible.

In order to achieve the above object, the present application first proposes a voice control method based on offline voiceprint recognition and voice recognition, which includes the following steps: receiving awakening word voice, and extracting a first voice characteristic and a first voiceprint characteristic of the awakening word voice; checking whether the extracted first voice feature and the extracted first voiceprint feature are respectively matched with the awakening word voice template and the voiceprint template, if not, finishing, otherwise, acquiring a first voiceprint code corresponding to the first voiceprint feature; receiving command word voice and extracting a second voiceprint feature of the command word voice; checking whether the second voiceprint feature is matched with the voiceprint template, if not, ending, otherwise, acquiring a second voiceprint code corresponding to the second voiceprint feature; checking whether the first voiceprint code is the same as the second voiceprint code, if not, finishing, otherwise, extracting a second voice feature of the command word; and checking whether the extracted second voice features are matched with the command word voice template, if not, ending, otherwise, acquiring the voice codes of the second voice features and generating corresponding control instructions based on the voice codes. Wherein the wake word voice template, the command word voice template, and the voiceprint template are stored locally.

In a preferred embodiment of the above method, the wake word tone template and the command word tone template are generated by training pre-collected speech.

In a preferred embodiment of the above method, the voiceprint template is generated by training at least one user's previously collected speech.

In a preferred embodiment of the above method, the correspondence between the speech and the speech code is customized.

Further, in the above preferred embodiment, the correspondence of the voice and the voice code is stored locally.

In a preferred embodiment of the above method, the wake and command word tone templates are trained by dynamically updating the collected speech.

In a preferred embodiment of the above method, the voiceprint template is trained by dynamically updating the speech of at least one designated person.

Secondly, this application still provides a speech control device based on off-line voiceprint recognition and speech recognition, includes following module: the first receiving module is used for receiving the awakening word voice and extracting a first voice feature and a first voiceprint feature of the awakening word voice; the first checking module is used for checking whether the extracted first voice feature and the extracted first voiceprint feature are respectively matched with the awakening word voice template and the voiceprint template, if not, the operation is finished, otherwise, a first voiceprint code corresponding to the first voiceprint feature is obtained; the second receiving module is used for receiving the command word voice and extracting the second voiceprint characteristics of the command word voice; the second checking module is used for checking whether the second voiceprint feature is matched with the voiceprint template or not, if not, the second voiceprint feature is finished, otherwise, a second voiceprint code corresponding to the second voiceprint feature is obtained; the third checking module is used for checking whether the first voiceprint code is the same as the second voiceprint code or not, if not, the flow is carried out, otherwise, the second voice characteristic of the command word is extracted; and the instruction generating module is used for checking whether the extracted second voice characteristics are matched with the command word voice template or not, if not, ending, otherwise, acquiring the voice codes of the second voice characteristics and generating corresponding control instructions based on the voice codes. Wherein the wake word voice template, the command word voice template, and the voiceprint template are stored locally.

In a preferred embodiment of the above apparatus, the wake word tone template and the command word tone template are generated by training pre-collected speech.

In a preferred embodiment of the above apparatus, the voiceprint template is generated by training at least one user's previously collected speech.

In a preferred embodiment of the above apparatus, the correspondence between the speech and the speech code is customized.

In a preferred embodiment of the above apparatus, the wake word tone template and the command word tone template are trained by dynamically updating the collected speech.

In a preferred embodiment of the above apparatus, the voiceprint template is trained by dynamically updating the speech of at least one designated person.

Finally, the present application also discloses a computer readable storage medium having stored thereon computer instructions which, when executed by a processor, carry out the steps of the method according to any of the preceding claims.

The beneficial effect of this application does: the identity and the voice content of the speaker can be conveniently confirmed through the local voice template and the voiceprint template, and the usability of the electronic equipment based on the voice is improved.

Drawings

FIG. 1 is a flow diagram illustrating one embodiment of a method for speech control based on offline voiceprint recognition and speech recognition;

fig. 2 is a schematic configuration diagram of the related device according to the embodiment in fig. 1;

FIG. 3 is a diagram illustrating a user-defined correspondence between speech and speech codes;

FIG. 4 is a block diagram of an embodiment of a voice control apparatus based on offline voiceprint recognition and voice recognition.

Detailed Description

The conception, specific structure and technical effects of the present application will be described clearly and completely with reference to the following embodiments and the accompanying drawings, so that the purpose, scheme and effects of the present application can be fully understood. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The same reference numbers will be used throughout the drawings to refer to the same or like parts.

In this context, unless explicitly stated otherwise, the wake word tone refers to a sound emitted by a user having the authority to use the electronic device to verify the user's identity and initiate the electronic device control flow. Only when the wake-up word speech meets certain conditions, the relevant device will further receive other speech indications. Correspondingly, the command word voice means that after confirming the related awakening word voice, the user further sends a voice instruction for issuing a voice with actual specific meaning to the electronic equipment.

FIG. 1 is a flow diagram illustrating one embodiment of a method for speech control based on offline voiceprint recognition and speech recognition. The method comprises the following steps: receiving awakening word voice, and extracting a first voice characteristic and a first voiceprint characteristic of the awakening word voice; checking whether the extracted first voice feature and the extracted first voiceprint feature are respectively matched with the awakening word voice template and the voiceprint template, if not, finishing, otherwise, acquiring a first voiceprint code corresponding to the first voiceprint feature; receiving command word voice and extracting a second voiceprint feature of the command word voice; checking whether the second voiceprint feature is matched with the voiceprint template, if not, ending, otherwise, acquiring a second voiceprint code corresponding to the second voiceprint feature; checking whether the first voiceprint code is the same as the second voiceprint code, if not, finishing, otherwise, extracting a second voice feature of the command word; and checking whether the extracted second voice features are matched with the command word voice template, if not, ending, otherwise, acquiring the voice codes of the second voice features and generating corresponding control instructions based on the voice codes. As shown in the schematic diagram of fig. 2, the wake word sound template, the command word sound template, and the voiceprint template are all stored locally. When the voice feature or the voiceprint feature is not matched with the corresponding template stored locally, the method forcibly ends the process and returns to the stage of waiting for the user to input the awakening word voice again.

The first and second voiceprint features are speech map feature parameters formed on the basis of the stability of human speech and on the physical quantities (such as tone quality, duration, intensity, height and the like) of the collected speech. Further, in one embodiment of the present application, the voiceprint template is generated by extracting voiceprint features of a plurality of users having electronic device usage rights and grouping and sorting the voiceprint features of the plurality of users according to the electronic device usage rights of the users. The voiceprint feature can be formed by analyzing the voice of the user through an algorithm that is conventional in the art, and is not limited in this application.

Similarly, the first speech feature and the second speech feature are feature parameters formed in accordance with words, phonemes, tones, and the like of a specific language for the collected speech based on the specific language. The above-mentioned feature parameters are matched with the feature parameters of a plurality of groups of voices which have been marked with specific meanings in the voice templates (the wake word tone template and the command word tone template) to determine the specific meanings of the voices uttered by the user. The speech features may also be formed by analyzing the user's voice using algorithms that are conventional in the art and are not intended to be limiting.

To reduce the computational load of the system, in one embodiment of the present application, only the first speech feature is extracted after receiving the wake word speech. When the first voice feature is matched with one of the plurality of groups of feature parameters recorded in the awakening word voice template, the first voiceprint feature of the awakening word is extracted; otherwise, if the first voice feature is not matched with all feature parameters in the awakening word voice template, prompting the user to send the awakening word for matching again. The associated match determination (e.g., the matching of the first voice feature to the wake-up voice template, the matching of the first voiceprint feature to the voiceprint template, and the matching of the second voice feature command voice template) can be implemented using conventional matching algorithms in the art, which are not limited in this application.

In one embodiment of the present application, the wake and command word tone templates are generated by training pre-collected speech. Specifically, the user can input the wake word voice and the command word voice for a plurality of times in advance, and the wake word voice template and the command word voice template are improved through supervised training, so that the accuracy of voice recognition is improved.

Similarly, in one embodiment of the present application, the voiceprint template is generated by training at least one user's previously collected speech. Correspondingly, one or more users with the use authority of the electronic equipment input the awakening word voice and the command word voice for multiple times in advance, and the voiceprint template is improved through supervised training, so that the accuracy of voiceprint recognition is improved.

Referring to the schematic diagram of the user-defined correspondence between the voice and the voice code shown in fig. 3, in an embodiment of the present application, the correspondence between the voice and the voice code may be set by itself according to an actual electronic device and a language used by a user. At this time, since the user can customize the corresponding relationship between the voice and the voice code, the specific instruction issued to the electronic device is independent of the specific language in which the user issues the command word voice. For example, receiving and converting command word speech uttered in english or chinese into a corresponding control instruction is achieved by modifying characteristic parameters of speech that has been tagged with a particular meaning in a command word speech template so that the speech in english or chinese is associated with a specified speech code.

Further, in the above-described embodiments of the present application, the correspondence between the voice and the voice code is also stored locally, so that the voice-based electronic device control can be realized without connecting to a network.

In one embodiment of the present application, the wake and command word tone templates are trained by dynamically updating the collected speech. The user can improve the safety factor of the electronic equipment by regularly updating the awakening word voice and the command word voice, and the electronic equipment is prevented from being abused by other persons without using permission.

Similarly, in one embodiment of the present application, the voiceprint template is trained by dynamically updating the voice of at least one designated person, so as to update the voiceprint characteristics of the user in time (especially, the user in the sound-changing period, such as the user in adolescence or the user who just receives the laryngeal surgery).

FIG. 4 is a block diagram of an embodiment of a voice control apparatus based on offline voiceprint recognition and voice recognition. The illustrated apparatus includes the following modules: the first receiving module is used for receiving the awakening word voice and extracting a first voice feature and a first voiceprint feature of the awakening word voice; the first checking module is used for checking whether the extracted first voice feature and the extracted first voiceprint feature are respectively matched with the awakening word voice template and the voiceprint template, if not, the operation is finished, otherwise, a first voiceprint code corresponding to the first voiceprint feature is obtained; the second receiving module is used for receiving the command word voice and extracting the second voiceprint characteristics of the command word voice; the second checking module is used for checking whether the second voiceprint feature is matched with the voiceprint template or not, if not, the second voiceprint feature is finished, otherwise, a second voiceprint code corresponding to the second voiceprint feature is obtained; the third checking module is used for checking whether the first voiceprint code is the same as the second voiceprint code or not, if not, the second voiceprint code is ended, otherwise, the second voice feature of the command word is extracted; and the instruction generating module is used for checking whether the extracted second voice characteristics are matched with the command word voice template or not, if not, ending, otherwise, acquiring the voice codes of the second voice characteristics and generating corresponding control instructions based on the voice codes. As shown in the schematic diagram of fig. 2, the wake word sound template, the command word sound template, and the voiceprint template are all stored locally. When the voice feature or the voiceprint feature is not matched with the corresponding template stored locally, the device returns to a state of waiting for the user to input the awakening word voice again.

In order to reduce the computation amount of the system, in an embodiment of the present application, after receiving the wake word speech, the first receiving module extracts only the first speech feature. When the first checking module determines that the first voice feature is matched with one of the plurality of groups of feature parameters recorded in the awakening word voice template, the first receiving module extracts the first voiceprint feature of the awakening word; otherwise, if the first checking module judges that the first voice feature is not matched with all feature parameters in the awakening word voice template, the first receiving module prompts the user to send the awakening word so as to match again. The associated match determination (e.g., the matching of the first voice feature to the wake-up voice template, the matching of the first voiceprint feature to the voiceprint template, and the matching of the second voice feature command voice template) can be implemented using conventional matching algorithms in the art, which are not limited in this application.

Referring to the schematic diagram of the user-defined correspondence between the voice and the voice code shown in fig. 3, in an embodiment of the present application, the instruction generating module may set the correspondence between the voice and the voice code according to the actual electronic device and the language used by the user. At this time, since the user can customize the corresponding relationship between the voice and the voice code, the specific instruction issued to the electronic device is independent of the specific language in which the user issues the command word voice. For example, receiving and converting command word speech uttered in english or chinese into a corresponding control instruction is achieved by modifying characteristic parameters of speech that has been tagged with a particular meaning in a command word speech template so that the speech in english or chinese is associated with a specified speech code.

While the description of the present application has been made in considerable detail and with particular reference to a few illustrated embodiments, it is not intended to be limited to any such details or embodiments or any particular embodiments, but it is to be construed as effectively covering the intended scope of the application by providing a broad interpretation of the claims in view of the prior art with reference to the appended claims. Further, the foregoing describes the present application in terms of embodiments foreseen by the inventor for which an enabling description was available, notwithstanding that insubstantial changes from the present application, not presently foreseen, may nonetheless represent equivalents thereto.

Claims

1. A voice control method based on off-line voiceprint recognition and voice recognition is characterized by comprising the following steps:

receiving awakening word voice, and extracting a first voice characteristic and a first voiceprint characteristic of the awakening word voice;

checking whether the extracted first voice feature and the extracted first voiceprint feature are respectively matched with the awakening word voice template and the voiceprint template, if not, finishing, otherwise, acquiring a first voiceprint code corresponding to the first voiceprint feature;

receiving command word voice and extracting a second voiceprint feature of the command word voice;

checking whether the second voiceprint feature is matched with the voiceprint template, if not, ending, otherwise, acquiring a second voiceprint code corresponding to the second voiceprint feature;

checking whether the first voiceprint code is the same as the second voiceprint code, if not, finishing, otherwise, extracting a second voice feature of the command word;

checking whether the extracted second voice features are matched with the command word voice template or not, if not, ending, otherwise, acquiring voice codes of the second voice features and generating corresponding control instructions based on the voice codes;

wherein, the awakening word voice template, the command word voice template and the voiceprint template are all stored locally; the correspondence of the speech to the speech codes is custom mapped and stored locally, the wake and command tone templates being generated by training pre-collected speech and dynamically updating the collected speech.

2. The method of claim 1, wherein the voiceprint template is generated by training at least one user's previously collected speech.

3. The method of claim 1, wherein the voiceprint template is trained by dynamically updating the voice of at least one designated person.

4. A speech control device based on off-line voiceprint recognition and speech recognition, using the method according to any one of claims 1 to 3, characterized by comprising the following modules:

the first receiving module is used for receiving the awakening word voice and extracting a first voice feature and a first voiceprint feature of the awakening word voice;

the first checking module is used for checking whether the extracted first voice feature and the extracted first voiceprint feature are respectively matched with the awakening word voice template and the voiceprint template, if not, the operation is finished, otherwise, a first voiceprint code corresponding to the first voiceprint feature is obtained;

the second receiving module is used for receiving the command word voice and extracting the second voiceprint characteristics of the command word voice;

the second checking module is used for checking whether the second voiceprint feature is matched with the voiceprint template or not, if not, the second voiceprint feature is finished, otherwise, a second voiceprint code corresponding to the second voiceprint feature is obtained;

the third checking module is used for checking whether the first voiceprint code is the same as the second voiceprint code or not, if not, the second voiceprint code is ended, otherwise, the second voice feature of the command word is extracted;

the instruction generation module is used for checking whether the extracted second voice characteristics are matched with the command word voice template or not, if not, ending, otherwise, acquiring voice codes of the second voice characteristics and generating corresponding control instructions based on the voice codes;

wherein the wake word voice template, the command word voice template, and the voiceprint template are stored locally.

5. A computer-readable storage medium having stored thereon computer instructions, characterized in that the instructions, when executed by a processor, carry out the steps of the method according to any one of claims 1 to 3.