CN115331673A

CN115331673A - Voiceprint recognition household appliance control method and device in complex sound scene

Info

Publication number: CN115331673A
Application number: CN202211256541.1A
Authority: CN
Inventors: 张林焘; 吴昊; 别荣芳
Original assignee: Beijing Normal University
Current assignee: Beijing Normal University
Priority date: 2022-10-14
Filing date: 2022-10-14
Publication date: 2022-11-11
Anticipated expiration: 2042-10-14
Also published as: CN115331673B

Abstract

The invention provides a voiceprint recognition household appliance control method and device in a complex sound scene, and relates to the field of household appliance control. The template audio fully considers various conditions in a complex sound scene, has better representativeness, and lays a foundation for improving the voiceprint recognition precision in the complex sound scene. And the similarity detection model based on the template audio, the voiceprint recognition decision model based on the SVM model and the voiceprint recognition model based on the convolutional neural network are used for sequentially judging, so that the voiceprint recognition precision is improved. The model is simple to complex, the result can be obtained by using the simple model for the audio which is easy to judge, the result can be obtained by using the complex model for the audio signal which is difficult to judge, and the consumption of computing resources is reduced.

Description

Voiceprint recognition household appliance control method and device in complex sound scene

Technical Field

The invention relates to the field of household appliance control, in particular to a voiceprint recognition household appliance control method and device in a complex sound scene.

Background

With the advancement of technology, more and more modern home appliances are widely used by consumers. As an important identity identification technology, voiceprint identification can identify the identities of family members, so that the household appliances receive the instructions of specific family members, and the instruction interference of irrelevant personnel is prevented. In general, a common voiceprint recognition technology can ensure high recognition accuracy, so that accurate control of specific family members on household appliances is realized.

However, in the process of controlling the home appliance by using the voiceprint recognition technology, a complicated sound scene is often accompanied, and the recognition accuracy of the voiceprint recognition technology is greatly reduced. With the obvious reduction of the identification precision, the application value of the household appliance based on the voiceprint identification control method is also obviously reduced. Therefore, how to design a voiceprint recognition household appliance control method in a complex sound scene can ensure the voiceprint recognition accuracy in the complex sound scene, and the method has very important application value.

Disclosure of Invention

In order to overcome the above problems or at least partially solve the above problems, embodiments of the present invention provide a method and an apparatus for controlling a voiceprint recognition appliance in a complex sound scene.

The embodiment of the invention is realized by the following steps:

in a first aspect, an embodiment of the present method provides a voiceprint recognition home appliance control method in a complex sound scene, including:

respectively recording multiple sections of audio of specific family members in multiple sound scenes;

encoding a plurality of pieces of audio;

after encoding, calculating the similarity between every two audios of each family member, reserving a section of audio with the similarity larger than a preset value, and determining all the reserved audio as template audio;

all template audios are used as positive training samples, audios of a plurality of non-specific family members are collected and used as negative training samples, and a machine learning model is used for training to obtain a voiceprint recognition decision model;

when the household appliance user outputs a section of audio, calculating the similarity between the section of audio and the template audio, and if the similarity between the section of audio and any template audio is greater than the preset similarity, directly identifying the audio as the audio of a specific family member; if the similarity between the section of audio and any template audio is smaller than the preset similarity, performing the next step;

and judging whether the output audio of the household appliance user is the audio of the specific family member by using the voiceprint recognition decision model.

Based on the first aspect, in some embodiments of the invention, the machine learning model is an SVM model.

Based on the first aspect, in some embodiments of the present invention, the step of determining whether the output audio of the household appliance user is the audio of a specific family member by using a voiceprint recognition decision model includes:

if the score of the voiceprint recognition decision result based on the SVM model is larger than a first preset score, directly recognizing the voiceprint recognition decision result based on the SVM model as the audio frequency of a specific family member, if the score of the voiceprint recognition decision result based on the SVM model is smaller than a second preset score, directly recognizing the voiceprint recognition decision result based on the SVM model as the audio frequency of a non-specific family member, and if the score of the voiceprint recognition decision result based on the SVM model is between the first preset score and the second preset score, carrying out the next step;

and finally judging the output audio of the household appliance user by using a voiceprint recognition model based on the convolutional neural network, and judging whether the audio is the audio of a specific family member.

In some embodiments of the invention based on the first aspect, the step of calculating the similarity between the audio segment and the template audio comprises:

performing, for the segment of audio and the template audio: audio filtering, calculating short-time energy of an audio signal and intercepting effective data of the audio signal;

and calculating the cosine distance between the section of audio and the template audio.

Based on the first aspect, in some embodiments of the present invention, the step of respectively entering multiple pieces of audio of a specific family member in multiple sound scenes includes:

recording multi-section audio of a specific family member under one or more conditions of high noise, speaking of multiple persons and small sound;

and controlling the duration of each piece of audio to be within 5 seconds when the audio is recorded.

Based on the first aspect, in some embodiments of the present invention, the step of encoding the multiple pieces of audio includes:

and coding the multi-segment audio by using an I-Vector calculation method.

Based on the first aspect, in some embodiments of the invention, the step of collecting audio of a plurality of non-specific family members as negative training samples comprises:

more than 50 non-specific family member's audios were collected as negative training samples.

In a second aspect, an embodiment of the present invention provides a home appliance control system for voiceprint recognition in a complex sound scene, including:

a logging module: respectively recording multiple sections of audio of specific family members in multiple sound scenes;

and an encoding module: encoding a plurality of pieces of audio;

a calculate similarity module: after encoding, calculating the similarity between every two audios of each family member, reserving a section of audio with the similarity larger than a preset value, and determining all the reserved audio as template audio;

a training module: taking all template audios as positive training samples, collecting audios of a plurality of non-specific family members as negative training samples, and training by using a machine learning model to obtain a voiceprint recognition decision model;

an identification module: when the household appliance user outputs a section of audio, the similarity between the section of audio and the template audio is calculated, and if the similarity between the section of audio and any template audio is greater than the preset similarity, the audio is directly identified as the audio of a specific family member;

a judgment module: if the similarity between the audio and any template audio is smaller than the preset similarity, the voiceprint recognition decision model is used for judging whether the output audio of the household appliance user is the audio of the specific family member.

In a third aspect, an embodiment of the present invention provides an electronic device, including:

at least one processor, at least one memory, and a data bus; wherein:

the processor and the memory complete mutual communication through the data bus; the memory stores program instructions executable by the processor, and the processor calls the program instructions to execute the method.

In a fourth aspect, an embodiment of the present invention provides a non-transitory computer-readable storage medium storing a computer program, where the computer program causes a computer to execute the method described above.

Compared with the prior art, the embodiment of the invention has at least the following advantages or beneficial effects:

(1) The template audio fully considers various conditions in a complex sound scene, has better representativeness and lays a foundation for improving the voiceprint recognition precision in the complex sound scene.

(2) And the similarity detection model based on the template audio, the voiceprint recognition decision model based on the SVM model and the voiceprint recognition model based on the convolutional neural network are used for sequentially judging, so that the voiceprint recognition precision is improved.

(3) The similarity detection model based on the template audio, the voiceprint recognition decision model based on the SVM model and the voiceprint recognition model based on the convolutional neural network are used for sequentially judging, the models are simple to complex, the result can be obtained by using the simple model for the audio which is easy to judge, the result can be obtained by using the complex model for the audio signal which is difficult to judge, and the consumption of computing resources is reduced.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.

Fig. 1 is a flowchart of an embodiment of a voiceprint recognition home appliance control method in a complex sound scene according to the present invention;

FIG. 2 is a flowchart of an embodiment of a voiceprint recognition appliance control method in a complex sound scene according to the present invention;

fig. 3 is a block diagram illustrating a structure of a voiceprint recognition home appliance control apparatus in a complex sound scene according to an embodiment of the present invention;

fig. 4 is a block diagram of an electronic device according to an embodiment of the invention.

An icon: 1. a logging module; 2. an encoding module; 3. a similarity calculation module; 4. a training module; 5. an identification module; 6. a judgment module; 7. a processor; 8. a memory; 9. a data bus.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined or explained in subsequent figures.

In the embodiments provided in the present application, it should be understood that the disclosed method and apparatus may be implemented in other ways. The system embodiments are merely illustrative, and for example, the block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems and computer program products according to various embodiments of the present application. In this regard, each block in the block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for enabling a computer device, which may be a personal computer, a server, or a network device, to execute all or part of the steps of the methods according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In the description of the embodiments of the present invention, "a plurality" represents at least 2.

In the description of the embodiments of the present invention, it should be further noted that unless otherwise explicitly stated or limited, the terms "disposed" and "connected" should be interpreted broadly, and may be, for example, fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations.

Examples

Referring to fig. 1, in a first aspect, an embodiment of the present invention provides a method for controlling a voiceprint recognition appliance in a complex sound scene, including:

s1: respectively recording multiple sections of audio of specific family members in multiple sound scenes;

in the step, a plurality of sound scenes, namely duplicated scenes, comprise various conditions such as high noise, multi-person speaking, small sound and the like, all the conditions of voiceprint recognition in the use of the household appliance are contained as comprehensively as possible, the audio time can be set according to the actual conditions, the time for carrying out voice control on the household appliance under the common condition is short, and each section of audio is within 5 seconds. The template audio fully considers various conditions in a complex sound scene, has better representativeness and lays a foundation for improving the voiceprint recognition precision in the complex sound scene.

S2: encoding a plurality of pieces of audio;

in this step, a multi-segment audio is encoded using an I-Vector calculation method. In practical applications, because the speaker information and various interference information are mixed in the speaker voice, and there are differences between channels of different acquisition devices, the channel interference information is mixed in the voice collected by us. Such interfering information can cause perturbations in the speaker's information. The traditional GMM-UBM method has no way to overcome the problem, and the system performance is unstable. In the GMM-UBM model, each target speaker can be described by the GMM model. Since the mean value is only changed and no adjustments are made to the weights and covariance when adapting from the UBM model to the GMM model for each speaker, most of the speaker's information is contained in the mean value of the GMM. The GMM mean vector contains channel information in addition to most of the speaker information. Joint Factor Analysis (JFA) can model speaker differences and channel differences, respectively, thereby compensating for channel differences and improving system performance. However, JFA requires a large number of training corpora of different channels, is difficult to obtain, and is computationally complex, so that it is difficult to put into practical use. Proposed by Dehak, a novel solution is proposed based on the I-Vector factor analysis technology. The JFA method is to model a speaker difference space and a channel difference space respectively, and the I-Vector-based method is to model global differences, and model the global differences as a whole, so that the limitation on training corpora is relaxed, the calculation is simple, and the performance is equivalent.

S3: after encoding, calculating the similarity between every two audios of each family member, reserving a section of audio with the similarity larger than a preset value, and regarding all reserved audio as template audio;

in the step, calculating the similarity between every two audios of each family member comprises respectively carrying out audio filtering on the two sections of audios, calculating the short-time energy of an audio signal and intercepting effective data of the audio signal; and calculating the cosine distance of the two sections of audio. And reserving a section of audio with the similarity larger than a preset value, and considering all the reserved audio as template audio, wherein the preset value can be reasonably set according to actual requirements.

S4: all template audios are used as positive training samples, audios of a plurality of non-specific family members are collected and used as negative training samples, and a machine learning model is used for training to obtain a voiceprint recognition decision model;

in this step, the step of collecting audio of a plurality of unspecific family members as negative training samples includes: more than 50 non-specific family member's audios were collected as negative training samples. The machine learning model may be an SVM model.

S5: when the household appliance user outputs a section of audio, the similarity between the section of audio and the template audio is calculated, and if the similarity between the section of audio and any template audio is greater than the preset similarity, the audio is directly identified as the audio of a specific family member; if the similarity between the section of audio and any template audio is smaller than the preset similarity, carrying out the next step;

in this step, the similarity between the piece of audio and the template audio may be calculated using a similarity detection model based on the template audio. The step of calculating the similarity between the audio and the template audio comprises: performing, for the segment of audio and the template audio: audio filtering, calculating short-time energy of an audio signal and intercepting effective data of the audio signal; and calculating the cosine distance between the section of audio and the template audio.

S6: and judging whether the output audio of the household appliance user is the audio of the specific family member by using the voiceprint recognition decision model.

The similarity detection model based on the template audio and the voiceprint recognition decision model based on the SVM model are used for sequentially judging, so that the voiceprint recognition precision is improved; the model is simple to complex, the result can be obtained by using the simple model for the audio which is easy to judge, the result can be obtained by using the complex model for the audio signal which is difficult to judge, and the consumption of computing resources is reduced.

referring to fig. 2, S61: if the score of the voiceprint recognition decision result based on the SVM model is larger than a first preset score, directly recognizing the voiceprint recognition decision result based on the SVM model as the audio frequency of a specific family member, if the score of the voiceprint recognition decision result based on the SVM model is smaller than a second preset score, directly recognizing the voiceprint recognition decision result based on the SVM model as the audio frequency of a non-specific family member, and if the score of the voiceprint recognition decision result based on the SVM model is between the first preset score and the second preset score, performing the next step;

s62: and finally judging the output audio of the household appliance user by using a voiceprint recognition model based on the convolutional neural network, and judging whether the audio is the audio of a specific family member.

The similarity detection model based on the template audio, the voiceprint recognition decision model based on the SVM model and the voiceprint recognition model based on the convolutional neural network are used for sequentially judging, so that the voiceprint recognition precision is improved; the similarity detection model based on the template audio, the voiceprint recognition decision model based on the SVM model and the voiceprint recognition model based on the convolutional neural network are used for sequentially judging, the models are simple to complex, the audios which are easy to judge can obtain results by using the simple models, the audios which are difficult to judge can obtain results by using the complex models, and the consumption of computing resources is reduced.

Referring to fig. 3, in a second aspect, an embodiment of the present invention provides a voiceprint recognition home appliance control system in a complex sound scene, including:

the recording module 1: respectively recording multiple sections of audio of specific family members in multiple sound scenes;

and the coding module 2: encoding a plurality of pieces of audio;

the calculate similarity module 3: after encoding, calculating the similarity between every two audios of each family member, reserving a section of audio with the similarity larger than a preset value, and regarding all reserved audio as template audio;

the training module 4: taking all template audios as positive training samples, collecting audios of a plurality of non-specific family members as negative training samples, and training by using a machine learning model to obtain a voiceprint recognition decision model;

the identification module 5: when the household appliance user outputs a section of audio, calculating the similarity between the section of audio and the template audio, and if the similarity between the section of audio and any template audio is greater than the preset similarity, directly identifying the audio as the audio of a specific family member;

and a judging module 6: if the similarity between the audio and any template audio is smaller than the preset similarity, the voiceprint recognition decision model is used for judging whether the output audio of the household appliance user is the audio of the specific family member.

For the specific implementation of the apparatus, please refer to the implementation of the method, and redundant description is omitted here.

Referring to fig. 4, in a third aspect, an embodiment of the invention provides an electronic device, including:

at least one processor 7, at least one memory 8 and a data bus 9; wherein:

the processor 7 and the memory 8 complete communication with each other through the data bus 9; the memory 8 stores program instructions executable by the processor 7, and the processor 7 calls the program instructions to execute the method. For example, the above-described steps S1 to S6 are performed.

In a fourth aspect, an embodiment of the present invention provides a non-transitory computer-readable storage medium storing a computer program, where the computer program causes a computer to execute the method described above. For example, the above-described steps S1 to S6 are performed.

In conclusion, the invention provides the voiceprint recognition household appliance control method in the complex sound scene, the template audio fully considers various conditions in the complex sound scene, the representativeness is better, and a foundation is laid for improving the voiceprint recognition precision in the complex sound scene. And the similarity detection model based on the template audio, the voiceprint recognition decision model based on the SVM model and the voiceprint recognition model based on the convolutional neural network are used for sequentially judging, so that the voiceprint recognition precision is improved. The model is simple to complex, the result can be obtained by using the simple model for the audio which is easy to judge, the result can be obtained by using the complex model for the audio signal which is difficult to judge, and the consumption of computing resources is reduced.

The present invention has been described in terms of the preferred embodiment, and it is not intended to be limited to the embodiment. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

It will be evident to those skilled in the art that the present application is not limited to the details of the foregoing illustrative embodiments, and that the present application may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the application being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.

Claims

1. A voiceprint recognition household appliance control method in a complex sound scene is characterized by comprising the following steps:

encoding a plurality of pieces of audio;

taking all template audios as positive training samples, collecting audios of a plurality of non-specific family members as negative training samples, and training by using a machine learning model to obtain a voiceprint recognition decision model;

when the household appliance user outputs a section of audio, calculating the similarity between the section of audio and the template audio, and if the similarity between the section of audio and any template audio is greater than the preset similarity, directly identifying the audio as the audio of a specific family member; if the similarity between the section of audio and any template audio is smaller than the preset similarity, carrying out the next step;

2. The method as claimed in claim 1, wherein the machine learning model is an SVM model.

3. The method as claimed in claim 2, wherein the step of determining whether the output audio of the user of the household appliance is the audio of the specific family member using the voiceprint recognition decision model comprises:

4. The method as claimed in claim 1, wherein the user of the home appliance outputs a segment of audio, and the step of calculating the similarity between the segment of audio and the template audio comprises:

5. The method as claimed in claim 1, wherein the step of respectively recording multiple audio segments of a specific family member in a plurality of sound scenes comprises:

6. The method as claimed in claim 1, wherein the step of encoding the multiple audio segments comprises:

and coding the multi-segment audio by using an I-Vector calculation method.

7. The method as claimed in claim 1, wherein the step of collecting the audios of a plurality of unspecific family members as the negative training samples comprises:

more than 50 non-specific family member audios were collected as negative training samples.

8. A voiceprint recognition household appliance control device in a complex sound scene is characterized by comprising:

and an encoding module: encoding a plurality of pieces of audio;

a calculate similarity module: after encoding, calculating the similarity between every two audios of each family member, reserving a section of audio with the similarity larger than a preset value, and regarding all reserved audio as template audio;

a training module: all template audios are used as positive training samples, audios of a plurality of non-specific family members are collected and used as negative training samples, and a machine learning model is used for training to obtain a voiceprint recognition decision model;

a judging module: if the similarity between the audio frequency and any template audio frequency is smaller than the preset similarity, the voiceprint recognition decision model is utilized to judge whether the output audio frequency of the household appliance user is the audio frequency of a specific family member.

9. An electronic device, comprising:

at least one processor, at least one memory, and a data bus; wherein:

the processor and the memory complete mutual communication through the data bus; the memory stores program instructions executable by the processor, the processor calling the program instructions to perform the method of any of claims 1 to 7.

10. A non-transitory computer-readable storage medium storing a computer program that causes a computer to perform the method according to any one of claims 1 to 7.