CN111261168A - Speech recognition engine and method supporting multi-task and multi-model - Google Patents

Speech recognition engine and method supporting multi-task and multi-model Download PDF

Info

Publication number
CN111261168A
CN111261168A CN202010069206.5A CN202010069206A CN111261168A CN 111261168 A CN111261168 A CN 111261168A CN 202010069206 A CN202010069206 A CN 202010069206A CN 111261168 A CN111261168 A CN 111261168A
Authority
CN
China
Prior art keywords
model
recognition
voice
user
speech recognition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010069206.5A
Other languages
Chinese (zh)
Inventor
范小朋
俞恺源
严伟玮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Zhongke Advanced Technology Research Institute Co ltd
Original Assignee
Hangzhou Zhongke Advanced Technology Research Institute Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Zhongke Advanced Technology Research Institute Co ltd filed Critical Hangzhou Zhongke Advanced Technology Research Institute Co ltd
Priority to CN202010069206.5A priority Critical patent/CN111261168A/en
Publication of CN111261168A publication Critical patent/CN111261168A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/30Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/22Interactive procedures; Man-machine interfaces

Abstract

The invention relates to a speech recognition engine and method for supporting multitask and multiple models, a speech recognition engine for supporting multitask and multiple models comprises a central device; the central equipment comprises a container, a voice acquisition module, a voiceprint recognition module and a result output module; the container is loaded with a speech recognition model; the voice acquisition module is used for acquiring voice information of a user, the voiceprint recognition module is used for carrying out voiceprint recognition on the voice information and determining a corresponding voice recognition model according to the voiceprint recognition structure, and after the voice recognition model recognizes the voice information, a recognition result is output through the result output module. The invention uploads the voice recognition model and the voiceprint model of each user, so that the central equipment can use the model of each participating user to perform voice recognition, and the central equipment deletes the loaded user model when the service is finished, thereby playing a role in protecting the private data and the model of the user.

Description

Speech recognition engine and method supporting multi-task and multi-model
Technical Field
The invention belongs to the field of ultrasonic regulation and control, and relates to a voice recognition engine and method supporting multiple tasks and multiple models.
Background
With the increasing maturity of speech recognition technology, mainstream speech recognition products achieve high recognition accuracy. However, the mainstream technology collects voice data of a user, and transmits the voice data to the cloud for analysis, processing and model training, which infringes the guarantee of user privacy to a certain extent. Today, people's privacy awareness is increasing due to the rapid development of science and technology. Therefore, how to ensure the privacy of user data while performing voice recognition is a considerable research issue.
Disclosure of Invention
The invention provides a speech recognition engine and a method supporting multiple tasks and multiple models, the speech recognition engine does not infringe the privacy of users and collect the speech data of the users privately, all training is finished on the private equipment of the users, and when needing to carry out recognition service, the speech recognition model and the voiceprint model of each user are uploaded, so that the central equipment (such as a conference recorder) can carry out speech recognition by using the model of each participating user, and when the service is finished, the central equipment deletes the loaded user model, thereby playing the role of protecting the private data and the model of the users.
The technical scheme for solving the problems is as follows: a speech recognition engine that supports multitasking and multiple models, characterized by:
comprises a central device;
the central equipment comprises a container, a voice acquisition module, a voiceprint recognition module and a result output module; the container is loaded with a speech recognition model;
the voice acquisition module is used for acquiring voice information of a user, the voiceprint recognition module is used for carrying out voiceprint recognition on the voice information and determining a corresponding voice recognition model according to the voiceprint recognition structure, and after the voice recognition model recognizes the voice information, a recognition result is output through the result output module.
Preferably, the number of containers is at least two.
Preferably, each container is loaded with a different speech recognition model.
Preferably, the container and the voiceprint recognition module may be in a central device (i.e., local) or in a cloud.
Preferably, the bottom layer system of the central device is an android system.
A speech recognition method supporting multitask and multiple models is characterized by comprising the following steps:
1) acquiring user voice information;
2) performing voiceprint recognition on the voice information;
3) after the corresponding user is identified, the voice information is transmitted to a container corresponding to the user, and voice identification is carried out by using a voice identification model of the corresponding user;
4) and recording and outputting the recognition result.
Preferably, the method further comprises a step 5) of deleting the voice recognition model of the corresponding user after the recognition result is output.
Preferably, the speech recognition model corresponding to the user refers to a speech recognition model uploaded by the user.
The invention has the advantages that:
the invention provides a voice recognition engine and a method supporting multitask and multiple models, which have the advantages that enterprises or other users cannot obtain voice data and models of the users, so that private data of the users are protected; the user can decide whether to authorize the model obtained by the voice training of the user or not; the distributed voice recognition container also enables users to upload different voice models which are most suitable for the users; the whole engine achieves privacy and individuation.
Drawings
FIG. 1 is a flow chart of a multitasking and multimodal speech recognition service proposed by the present invention;
FIG. 2 is a diagram of a central facility architecture according to the present invention;
FIG. 3 is a flow chart of the operation of the center device in the present invention;
FIG. 4 is a diagram of a multi-tasking multi-model cloud identification architecture according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings of the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention.
A speech recognition method supporting multitask and multiple models comprises the following steps:
1) acquiring user voice information;
2) performing voiceprint recognition on the voice information;
3) after the corresponding user is identified, the voice information is transmitted to a container corresponding to the user, and voice recognition is carried out by using a voice recognition model of the corresponding user;
4) and recording and outputting the recognition result.
Preferentially, the method also comprises a step 5) of disconnecting the user after the recognition result is output and deleting the voice recognition model of the corresponding user.
Preferably, the speech recognition model corresponding to the user refers to a speech recognition model uploaded by the user.
Based on the above method, the present invention provides a speech recognition engine supporting multitask and multiple models, as shown in fig. 1, including a central device; the central equipment comprises a container, a voice acquisition module, a voiceprint recognition module and a result output module; the container is loaded with a speech recognition model; the voice acquisition module is used for acquiring voice information of a user, the voiceprint recognition module is used for carrying out voiceprint recognition on the voice information and determining a corresponding voice recognition model according to the voiceprint recognition structure, and after the voice recognition model recognizes the voice information, a recognition result is output through the result output module.
Preferably, the number of the containers may be plural.
Preferably, each container is loaded with a different speech recognition model, or each container is loaded with the same speech recognition model, with the parameters of each speech recognition model being different.
Preferably, the above-mentioned speech recognition model and voiceprint model of the user are both performed on the user's private device. When the central equipment/system needs to use the models of the users, the voice and voiceprint models of the users are uploaded to the central equipment through a mode of manual authorization of the users, containers used by each loading model in the equipment are isolated from each other and do not affect each other, and different types of models can be loaded. After the required voice service is finished, each container in the central equipment deletes the user model loaded by each container.
The central equipment architecture in the invention is as follows:
the bottom layer system of the central device is an android system, referring to fig. 2, a plurality of docker containers are operated on the android system, the specific number of the containers is determined by the number of connected users, and each container is allocated with the same amount of GPU and CPU resources. By the method, the isolation among the containers is ensured, and the condition that the model which is in operation by the other side cannot be obtained among the containers is ensured. Each container has a respective TensorflowLite. Each TensorflowLite is responsible for loading a model for one user. Taking the following (architecture) diagram as an example, the number of connected users is three, three containers are established in an equipment main system (android), each container is respectively provided with a TensorflowLite, and each TensorflowLite is used for loading a speech recognition model of a corresponding user.
The user mobile phone and the center device in the invention are interacted as follows:
when the mobile phone of the user and the central device are in the same network environment, the user can transmit the own pb model file to the central device by using the app on the mobile phone. Because the central facility opens a separate container space for each user, the user can use different speech recognition models, or the same model with parameters custom-trained from the user's own data. Achieving the effect of individuation or multiple models. The whole steps are as follows:
connecting a wireless network (the same as the central device);
opening app, confirming and authorizing transmission of the pb model file;
and after the transmission of all the users is finished, sending an instruction for starting the identification record to the central equipment.
Identification and recording of the central device in the invention:
when the recognition service is started, the voice signal is acquired by the central equipment through the microphone, the central equipment recognizes the user corresponding to the voice section by comparing the voiceprint models uploaded by all the participating users, and the container corresponding to the user is searched through the hash table.
The central equipment pushes the voice to a container of a user corresponding to the central equipment, the model in the container is used for voice recognition, the container pushes the recognized voice back to the central equipment, and the central equipment records the relation between the model and the model in the container, wherein the relation is that the model in the container is used for voice recognition, the container is used for recording the relation between the model in the central equipment and the model in the central equipment, and the: xxxx "is recorded and saved, and the specific flow is shown in fig. 3.
Multitasking and multiple models
A) Multitasking:
because the container system created for each user has relative independence, each container is respectively responsible for processing the voice data of a single user corresponding to the container;
B) multiple models are as follows:
the user is free to upload the model. The model uploaded by the user A and the model uploaded by the user B have different model structures, or have the same model structure but different parameters, namely the model is a multi-model. The model can be a model which is trained by a user by using personal data and optimized aiming at individuals, so that the recognition effect is optimal as much as possible.
Preferentially, the container and the voiceprint recognition module of the central device are arranged at the cloud end.
The architecture in the central device may be implemented in the cloud (as in fig. 4):
1) the user is still connected with the central equipment in a mobile phone authorization mode and authorizes the use of the user voiceprint and voice recognition model;
2) the central equipment is connected with the cloud end through a network, the same central equipment and a container as those in the invention are established at the cloud end, and the user model is transmitted to the cloud end;
3) after the service is started, the central equipment transmits voice input to the cloud end, the cloud end carries out voiceprint recognition firstly, and the voiceprint module transmits the voice to a container corresponding to a voiceprint recognition result to carry out voice recognition;
4) the cloud combines the results of voiceprint recognition and speech recognition into a "xxx (someone): xxxxx (speech specific content) "form is sent back to the center device;
5) the central equipment stores the received identification result.
The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all equivalent structures or equivalent flow transformations made by using the contents of the specification and the drawings, or applied directly or indirectly to other related systems, are included in the scope of the present invention.

Claims (8)

1. A speech recognition method supporting multitask and multiple models is characterized by comprising the following steps:
1) acquiring user voice information;
2) performing voiceprint recognition on the voice information;
3) after the corresponding user is identified, the voice information is transmitted to a container corresponding to the user, and voice identification is carried out by using a voice identification model of the corresponding user;
4) and recording and outputting the recognition result.
2. The method of claim 1, wherein the model is a multi-task model,
and 5), after the recognition result is output, deleting the voice recognition model of the corresponding user.
3. A speech recognition engine supporting multitasking and multiple models according to claim 1 or 2 and characterized by:
the voice recognition model corresponding to the user refers to a voice recognition model uploaded by the user.
4. A speech recognition engine that supports multitasking and multiple models, characterized by:
comprises a central device;
the central equipment comprises a container, a voice acquisition module, a voiceprint recognition module and a result output module; the container is loaded with a speech recognition model;
the voice acquisition module is used for acquiring voice information of a user, the voiceprint recognition module is used for carrying out voiceprint recognition on the voice information and determining a corresponding voice recognition model according to the voiceprint recognition structure, and after the voice recognition model recognizes the voice information, a recognition result is output through the result output module.
5. A speech recognition engine supporting multitasking and multiple models according to claim 4 and wherein:
the number of the containers is at least two.
6. A speech recognition engine supporting multitasking and multiple models according to claim 5 and wherein:
each container is loaded with a different speech recognition model.
7. A speech recognition engine supporting multitasking and multiple models according to any one of claims 4-6 and wherein:
the container and the voiceprint recognition module are located in the central device or the cloud.
8. A speech recognition engine supporting multitasking and multiple models according to claim 7 and wherein:
the bottom layer system of the central equipment is an android system.
CN202010069206.5A 2020-01-21 2020-01-21 Speech recognition engine and method supporting multi-task and multi-model Pending CN111261168A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010069206.5A CN111261168A (en) 2020-01-21 2020-01-21 Speech recognition engine and method supporting multi-task and multi-model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010069206.5A CN111261168A (en) 2020-01-21 2020-01-21 Speech recognition engine and method supporting multi-task and multi-model

Publications (1)

Publication Number Publication Date
CN111261168A true CN111261168A (en) 2020-06-09

Family

ID=70954670

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010069206.5A Pending CN111261168A (en) 2020-01-21 2020-01-21 Speech recognition engine and method supporting multi-task and multi-model

Country Status (1)

Country Link
CN (1) CN111261168A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111785275A (en) * 2020-06-30 2020-10-16 北京捷通华声科技股份有限公司 Voice recognition method and device
CN113823263A (en) * 2020-06-19 2021-12-21 深圳Tcl新技术有限公司 Voice recognition method and system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105096941A (en) * 2015-09-02 2015-11-25 百度在线网络技术(北京)有限公司 Voice recognition method and device
CN105931643A (en) * 2016-06-30 2016-09-07 北京海尔广科数字技术有限公司 Speech recognition method and apparatus
CN110675872A (en) * 2019-09-27 2020-01-10 青岛海信电器股份有限公司 Voice interaction method based on multi-system display equipment and multi-system display equipment

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105096941A (en) * 2015-09-02 2015-11-25 百度在线网络技术(北京)有限公司 Voice recognition method and device
CN105931643A (en) * 2016-06-30 2016-09-07 北京海尔广科数字技术有限公司 Speech recognition method and apparatus
CN110675872A (en) * 2019-09-27 2020-01-10 青岛海信电器股份有限公司 Voice interaction method based on multi-system display equipment and multi-system display equipment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
董泽: "基于容器的任务卸载技术研究", 《中国优秀硕士学位论文全文数据库》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113823263A (en) * 2020-06-19 2021-12-21 深圳Tcl新技术有限公司 Voice recognition method and system
WO2021253779A1 (en) * 2020-06-19 2021-12-23 深圳Tcl新技术有限公司 Speech recognition method and system
CN111785275A (en) * 2020-06-30 2020-10-16 北京捷通华声科技股份有限公司 Voice recognition method and device

Similar Documents

Publication Publication Date Title
US6687671B2 (en) Method and apparatus for automatic collection and summarization of meeting information
US9424836B2 (en) Privacy-sensitive speech model creation via aggregation of multiple user models
CN103081004B (en) For the method and apparatus providing input to voice-enabled application program
CN104679631B (en) Method of testing and system for the equipment based on android system
CN109388701A (en) Minutes generation method, device, equipment and computer storage medium
US9047506B2 (en) Computer-readable recording medium storing authentication program, authentication device, and authentication method
WO2015024413A1 (en) Conference summary extraction method and device
CN111261168A (en) Speech recognition engine and method supporting multi-task and multi-model
US20110316671A1 (en) Content transfer system and communication terminal
CN104038354A (en) Intelligent mobile phone-based conference interaction method
US20130243186A1 (en) Audio encryption systems and methods
CN105897686A (en) Smart television user account speech management method and smart television
CN107862071A (en) The method and apparatus for generating minutes
CN110060656A (en) Model management and phoneme synthesizing method, device and system and storage medium
CN112350834B (en) AI voice conference system with screen and method
CN109493866A (en) Intelligent sound box and its operating method
CN109637534A (en) Voice remote control method, system, controlled device and computer readable storage medium
JP4469867B2 (en) Apparatus, method and program for managing communication status
CN105427857B (en) Generate the method and system of writing record
CN104113604A (en) Implementation method of voice rapid acquisition in cloud environment
CN113241070A (en) Hot word recall and updating method, device, storage medium and hot word system
CN107122291A (en) Mobile terminal software stability test method and apparatus
KR101351264B1 (en) System and method for message translation based on voice recognition
CN108419108A (en) Sound control method, device, remote controler and computer storage media
JP2018120203A (en) Information processing method and program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200609

RJ01 Rejection of invention patent application after publication