CN111261168A

CN111261168A - Speech recognition engine and method supporting multi-task and multi-model

Info

Publication number: CN111261168A
Application number: CN202010069206.5A
Authority: CN
Inventors: 范小朋; 俞恺源; 严伟玮
Original assignee: Hangzhou Zhongke Advanced Technology Research Institute Co ltd
Current assignee: Hangzhou Zhongke Advanced Technology Research Institute Co ltd
Priority date: 2020-01-21
Filing date: 2020-01-21
Publication date: 2020-06-09

Abstract

The invention relates to a speech recognition engine and method for supporting multitask and multiple models, a speech recognition engine for supporting multitask and multiple models comprises a central device; the central equipment comprises a container, a voice acquisition module, a voiceprint recognition module and a result output module; the container is loaded with a speech recognition model; the voice acquisition module is used for acquiring voice information of a user, the voiceprint recognition module is used for carrying out voiceprint recognition on the voice information and determining a corresponding voice recognition model according to the voiceprint recognition structure, and after the voice recognition model recognizes the voice information, a recognition result is output through the result output module. The invention uploads the voice recognition model and the voiceprint model of each user, so that the central equipment can use the model of each participating user to perform voice recognition, and the central equipment deletes the loaded user model when the service is finished, thereby playing a role in protecting the private data and the model of the user.

Description

Speech recognition engine and method supporting multi-task and multi-model

Technical Field

The invention belongs to the field of ultrasonic regulation and control, and relates to a voice recognition engine and method supporting multiple tasks and multiple models.

Background

With the increasing maturity of speech recognition technology, mainstream speech recognition products achieve high recognition accuracy. However, the mainstream technology collects voice data of a user, and transmits the voice data to the cloud for analysis, processing and model training, which infringes the guarantee of user privacy to a certain extent. Today, people's privacy awareness is increasing due to the rapid development of science and technology. Therefore, how to ensure the privacy of user data while performing voice recognition is a considerable research issue.

Disclosure of Invention

The invention provides a speech recognition engine and a method supporting multiple tasks and multiple models, the speech recognition engine does not infringe the privacy of users and collect the speech data of the users privately, all training is finished on the private equipment of the users, and when needing to carry out recognition service, the speech recognition model and the voiceprint model of each user are uploaded, so that the central equipment (such as a conference recorder) can carry out speech recognition by using the model of each participating user, and when the service is finished, the central equipment deletes the loaded user model, thereby playing the role of protecting the private data and the model of the users.

The technical scheme for solving the problems is as follows: a speech recognition engine that supports multitasking and multiple models, characterized by:

comprises a central device;

the central equipment comprises a container, a voice acquisition module, a voiceprint recognition module and a result output module; the container is loaded with a speech recognition model;

the voice acquisition module is used for acquiring voice information of a user, the voiceprint recognition module is used for carrying out voiceprint recognition on the voice information and determining a corresponding voice recognition model according to the voiceprint recognition structure, and after the voice recognition model recognizes the voice information, a recognition result is output through the result output module.

Preferably, the number of containers is at least two.

Preferably, each container is loaded with a different speech recognition model.

Preferably, the container and the voiceprint recognition module may be in a central device (i.e., local) or in a cloud.

Preferably, the bottom layer system of the central device is an android system.

A speech recognition method supporting multitask and multiple models is characterized by comprising the following steps:

1) acquiring user voice information;

2) performing voiceprint recognition on the voice information;

3) after the corresponding user is identified, the voice information is transmitted to a container corresponding to the user, and voice identification is carried out by using a voice identification model of the corresponding user;

4) and recording and outputting the recognition result.

Preferably, the method further comprises a step 5) of deleting the voice recognition model of the corresponding user after the recognition result is output.

Preferably, the speech recognition model corresponding to the user refers to a speech recognition model uploaded by the user.

The invention has the advantages that:

the invention provides a voice recognition engine and a method supporting multitask and multiple models, which have the advantages that enterprises or other users cannot obtain voice data and models of the users, so that private data of the users are protected; the user can decide whether to authorize the model obtained by the voice training of the user or not; the distributed voice recognition container also enables users to upload different voice models which are most suitable for the users; the whole engine achieves privacy and individuation.

Drawings

FIG. 1 is a flow chart of a multitasking and multimodal speech recognition service proposed by the present invention;

FIG. 2 is a diagram of a central facility architecture according to the present invention;

FIG. 3 is a flow chart of the operation of the center device in the present invention;

FIG. 4 is a diagram of a multi-tasking multi-model cloud identification architecture according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings of the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention.

A speech recognition method supporting multitask and multiple models comprises the following steps:

1) acquiring user voice information;

2) performing voiceprint recognition on the voice information;

3) after the corresponding user is identified, the voice information is transmitted to a container corresponding to the user, and voice recognition is carried out by using a voice recognition model of the corresponding user;

4) and recording and outputting the recognition result.

Preferentially, the method also comprises a step 5) of disconnecting the user after the recognition result is output and deleting the voice recognition model of the corresponding user.

Based on the above method, the present invention provides a speech recognition engine supporting multitask and multiple models, as shown in fig. 1, including a central device; the central equipment comprises a container, a voice acquisition module, a voiceprint recognition module and a result output module; the container is loaded with a speech recognition model; the voice acquisition module is used for acquiring voice information of a user, the voiceprint recognition module is used for carrying out voiceprint recognition on the voice information and determining a corresponding voice recognition model according to the voiceprint recognition structure, and after the voice recognition model recognizes the voice information, a recognition result is output through the result output module.

Preferably, the number of the containers may be plural.

Preferably, each container is loaded with a different speech recognition model, or each container is loaded with the same speech recognition model, with the parameters of each speech recognition model being different.

Preferably, the above-mentioned speech recognition model and voiceprint model of the user are both performed on the user's private device. When the central equipment/system needs to use the models of the users, the voice and voiceprint models of the users are uploaded to the central equipment through a mode of manual authorization of the users, containers used by each loading model in the equipment are isolated from each other and do not affect each other, and different types of models can be loaded. After the required voice service is finished, each container in the central equipment deletes the user model loaded by each container.

The central equipment architecture in the invention is as follows:

the bottom layer system of the central device is an android system, referring to fig. 2, a plurality of docker containers are operated on the android system, the specific number of the containers is determined by the number of connected users, and each container is allocated with the same amount of GPU and CPU resources. By the method, the isolation among the containers is ensured, and the condition that the model which is in operation by the other side cannot be obtained among the containers is ensured. Each container has a respective TensorflowLite. Each TensorflowLite is responsible for loading a model for one user. Taking the following (architecture) diagram as an example, the number of connected users is three, three containers are established in an equipment main system (android), each container is respectively provided with a TensorflowLite, and each TensorflowLite is used for loading a speech recognition model of a corresponding user.

The user mobile phone and the center device in the invention are interacted as follows:

when the mobile phone of the user and the central device are in the same network environment, the user can transmit the own pb model file to the central device by using the app on the mobile phone. Because the central facility opens a separate container space for each user, the user can use different speech recognition models, or the same model with parameters custom-trained from the user's own data. Achieving the effect of individuation or multiple models. The whole steps are as follows:

connecting a wireless network (the same as the central device);

opening app, confirming and authorizing transmission of the pb model file;

and after the transmission of all the users is finished, sending an instruction for starting the identification record to the central equipment.

Identification and recording of the central device in the invention:

when the recognition service is started, the voice signal is acquired by the central equipment through the microphone, the central equipment recognizes the user corresponding to the voice section by comparing the voiceprint models uploaded by all the participating users, and the container corresponding to the user is searched through the hash table.

The central equipment pushes the voice to a container of a user corresponding to the central equipment, the model in the container is used for voice recognition, the container pushes the recognized voice back to the central equipment, and the central equipment records the relation between the model and the model in the container, wherein the relation is that the model in the container is used for voice recognition, the container is used for recording the relation between the model in the central equipment and the model in the central equipment, and the: xxxx "is recorded and saved, and the specific flow is shown in fig. 3.

Multitasking and multiple models

A) Multitasking:

because the container system created for each user has relative independence, each container is respectively responsible for processing the voice data of a single user corresponding to the container;

B) multiple models are as follows:

the user is free to upload the model. The model uploaded by the user A and the model uploaded by the user B have different model structures, or have the same model structure but different parameters, namely the model is a multi-model. The model can be a model which is trained by a user by using personal data and optimized aiming at individuals, so that the recognition effect is optimal as much as possible.

Preferentially, the container and the voiceprint recognition module of the central device are arranged at the cloud end.

The architecture in the central device may be implemented in the cloud (as in fig. 4):

1) the user is still connected with the central equipment in a mobile phone authorization mode and authorizes the use of the user voiceprint and voice recognition model;

2) the central equipment is connected with the cloud end through a network, the same central equipment and a container as those in the invention are established at the cloud end, and the user model is transmitted to the cloud end;

3) after the service is started, the central equipment transmits voice input to the cloud end, the cloud end carries out voiceprint recognition firstly, and the voiceprint module transmits the voice to a container corresponding to a voiceprint recognition result to carry out voice recognition;

4) the cloud combines the results of voiceprint recognition and speech recognition into a "xxx (someone): xxxxx (speech specific content) "form is sent back to the center device;

5) the central equipment stores the received identification result.

The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all equivalent structures or equivalent flow transformations made by using the contents of the specification and the drawings, or applied directly or indirectly to other related systems, are included in the scope of the present invention.

Claims

1. A speech recognition method supporting multitask and multiple models is characterized by comprising the following steps:

1) acquiring user voice information;

2) performing voiceprint recognition on the voice information;

4) and recording and outputting the recognition result.

2. The method of claim 1, wherein the model is a multi-task model,

and 5), after the recognition result is output, deleting the voice recognition model of the corresponding user.

3. A speech recognition engine supporting multitasking and multiple models according to claim 1 or 2 and characterized by:

the voice recognition model corresponding to the user refers to a voice recognition model uploaded by the user.

4. A speech recognition engine that supports multitasking and multiple models, characterized by:

comprises a central device;

5. A speech recognition engine supporting multitasking and multiple models according to claim 4 and wherein:

the number of the containers is at least two.

6. A speech recognition engine supporting multitasking and multiple models according to claim 5 and wherein:

each container is loaded with a different speech recognition model.

7. A speech recognition engine supporting multitasking and multiple models according to any one of claims 4-6 and wherein:

the container and the voiceprint recognition module are located in the central device or the cloud.

8. A speech recognition engine supporting multitasking and multiple models according to claim 7 and wherein:

the bottom layer system of the central equipment is an android system.