CN110544480B

CN110544480B - Voice recognition resource switching method and device

Info

Publication number: CN110544480B
Application number: CN201910838220.4A
Authority: CN
Inventors: 王星培
Original assignee: Sipic Technology Co Ltd
Current assignee: Sipic Technology Co Ltd
Priority date: 2019-09-05
Filing date: 2019-09-05
Publication date: 2022-03-11
Anticipated expiration: 2039-09-05
Also published as: CN110544480A

Abstract

The invention discloses a method and a device for switching voice recognition resources, wherein the method for switching the voice recognition resources comprises the following steps: initializing a resource pool, wherein each voice recognition resource in the initialized resource pool has an initial weight; collecting user voice and carrying out distributed multi-path decoding on the user voice; obtaining scores of distributed multi-path decoding, and determining a domain classification result corresponding to a text result with the highest score in the multi-path decoding; judging whether the domain classification result belongs to each voice recognition resource; and if the domain classification result belongs to each voice recognition resource, updating the initial weight of each voice recognition resource according to the domain classification result. The method and the device have the advantages that the field customization capacity of the voice recognition resources is developed, the field weight is adaptively adjusted in the using process of a user, the recognition accuracy rate can be improved, and the user can obtain better customization and interaction experience.

Description

Voice recognition resource switching method and device

Technical Field

The invention belongs to the technical field of voice recognition, and particularly relates to a voice recognition resource switching method and device.

Background

In the related art, some simple voice recognition switching systems exist, such as the voice recognition switching system with the patent number CN201810233474, which implement simple switching between different voice recognition resources and optimize user experience in context switching scenarios.

The inventor finds that the prior scheme has at least the following defects in the process of implementing the application: the performance is too dependent on the accuracy of switching judgment, and the stability is poor.

Disclosure of Invention

An embodiment of the present invention provides a method and an apparatus for switching speech recognition resources, which are used to solve at least one of the above technical problems.

In a first aspect, an embodiment of the present invention provides a method for switching speech recognition resources, including: initializing a resource pool, wherein each voice recognition resource in the initialized resource pool has an initial weight; collecting user voice and carrying out distributed multi-path decoding on the user voice; obtaining scores of distributed multi-path decoding, and determining a domain classification result corresponding to a text result with the highest score in the multi-path decoding; judging whether the domain classification result belongs to each voice recognition resource; and if the domain classification result belongs to each voice recognition resource, updating the initial weight of each voice recognition resource according to the domain classification result.

In a second aspect, an embodiment of the present invention provides a speech recognition resource switching apparatus, including: the initialization module is configured to initialize the resource pool, wherein each voice recognition resource in the initialized resource pool has an initial weight; the multi-channel decoding module is configured to collect user voice and perform distributed multi-channel decoding on the user voice; the domain classification module is configured to obtain scores of distributed multi-path decoding and determine a domain classification result corresponding to a text result with the highest score in the multi-path decoding; a domain judging module configured to judge whether the domain classification result belongs to each of the speech recognition resources; and the updating module is configured to update the initial weight of each voice recognition resource according to the domain classification result if the domain classification result belongs to each voice recognition resource.

In a third aspect, an electronic device is provided, comprising: the system comprises at least one processor and a memory which is connected with the at least one processor in a communication mode, wherein the memory stores instructions which can be executed by the at least one processor, and the instructions are executed by the at least one processor so as to enable the at least one processor to execute the steps of the voice recognition resource switching method of any embodiment of the invention.

In a fourth aspect, the present invention further provides a computer program product, where the computer program product includes a computer program stored on a non-volatile computer-readable storage medium, where the computer program includes program instructions, and when the program instructions are executed by a computer, the computer executes the steps of the speech recognition resource switching method according to any one of the embodiments of the present invention.

The method and the device provided by the application explore the field customization capability of the voice recognition resources, and adaptively adjust the field weight in the using process of the user, so that the method and the device not only can play a role in improving the recognition accuracy, but also can enable the user to obtain better customization and interactive experience.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

Fig. 1 is a flowchart of a method for switching speech recognition resources according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating another method for switching speech recognition resources according to an embodiment of the present invention;

fig. 3 is a flowchart of a method for switching speech recognition resources according to an embodiment of the present invention;

fig. 4 is a block diagram of a speech recognition resource switching apparatus according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, which shows a flowchart of an embodiment of a speech recognition resource switching method according to the present application, the speech recognition resource switching method of the present embodiment may be applied to terminals with speech recognition capability, such as smart voice televisions, smart speakers, smart dialogue toys, and other existing smart terminals with speech recognition capability.

As shown in fig. 1, in step 101, a resource pool is initialized;

in step 102, collecting user voice and performing distributed multi-path decoding on the user voice;

in step 103, obtaining scores of distributed multi-path decoding, and determining a domain classification result corresponding to a text result with the highest score in the multi-path decoding;

in step 104, judging whether the domain classification result belongs to each voice recognition resource;

in step 105, if the domain classification result belongs to each speech recognition resource, the initial weight of each speech recognition resource is updated according to the domain classification result.

In this embodiment, for step 101, the speech recognition resource switching apparatus first needs to initialize the resource pool, and add the speech recognition resources that may be used subsequently to the resource pool, where each speech recognition resource in the initialized resource pool has an initial weight. For example, a plurality of speech recognition resources with the highest initialization weight may be selected to be added to the resource pool, and resource initialization may also be performed in other manners, which is not limited in this respect.

Then, for step 102, the speech recognition resource switching device collects the user speech and performs distributed multi-path decoding on the user speech, and for step 103, according to the score of the multi-path decoding, the domain classification result corresponding to the text result with the highest score is determined.

Then, in step 104, the speech recognition resource switching device determines whether the domain classification result belongs to each speech recognition resource in the resource pool. Finally, in step 105, if the domain classification result belongs to each speech recognition resource, the speech recognition resource switching device updates the initial weight of each speech recognition resource according to the domain classification result, for example, the speech recognition resource corresponding to the domain classification result may be weighted more or the other speech recognition resources may be weighted less appropriately, which is not limited herein.

The method of the embodiment can obviously improve the recognition effect by judging the field to which the user voice belongs first and then using the voice recognition resource of the resource pool to perform subsequent recognition processing on the user voice if the corresponding field exists in the resource pool. Further, after the user uses the corresponding voice recognition resource, the weight value of the voice recognition resource is updated, so that the resource can be more easily used by the user again in the future.

In some optional embodiments, when the resource pool is initialized, a plurality of voice recognition resources with higher frequency used by the user can be selected as the resources in the initialized resource pool according to the use data of the historical voice recognition resources of the user, so that the subsequent switching of the resources in the resource pool can be reduced, and the user experience is better.

Further referring to fig. 2, a flowchart of an embodiment of the speech recognition resource switching method according to the present application is shown, and the flowchart of the embodiment mainly refers to a flowchart of a step further defined when the determination result in step 104 in fig. 1 is that the speech recognition resource does not belong to the speech recognition resource switching method.

In step 201, if the domain classification result does not belong to each voice recognition resource, adding the voice recognition resource corresponding to the domain classification result into a resource pool from the alternative resource pool;

in step 202, determining the speech recognition resource with the lowest weight in the speech recognition resources based on the initial weight or the updated weight of the speech recognition resources;

in step 203, the speech recognition resource with the lowest weight is removed from the resource pool.

In this embodiment, for step 201, if the domain classification result does not belong to each voice recognition resource existing in the current resource pool, the voice recognition resource switching device may further select a voice recognition resource corresponding to the current domain classification result from the selected resource pool, and then add the voice recognition resource to the resource pool. Thereafter, in step 202, the speech recognition resource switching means determines a speech recognition resource with the lowest weight in the current resource pool based on the initial weight or the updated weight of each speech recognition resource. The initial weight corresponds to the resource pool which has not been subjected to the weight update, and the updated weight indicates that the weight of each speech recognition resource is no longer the initial weight after the current resource pool is subjected to the resource update at least once. Finally, for step 203, the speech recognition resource with the lowest weight is removed from the current resource pool.

In the method of the embodiment, after the recognition result of the interpretation field does not belong to each voice recognition resource in the current resource pool, the corresponding resource is added into the resource pool from the alternative resource pool, and the resource with the lowest weight is deleted, so that the stability of the resource in the resource pool can be maintained, and a better recognition effect can be provided.

In some optional embodiments, the initialized resource pool includes a generic speech recognition resource, a user-customized speech recognition resource, and at least one common domain speech recognition resource. Therefore, the resource pool not only comprises the universal voice recognition resource, but also comprises the user customized voice recognition resource, and the good recognition effect can be ensured while the user customization is met. Furthermore, the speech recognition resources in the common field are added, so that the recognition performance of certain common professional fields, such as navigation, song ordering and the like, is better.

Further optionally, the initialized resource pool always reserves the universal speech recognition resource and the user-customized speech recognition resource. Thus, the generic speech recognition resource and the user-customized speech recognition resource can reside in the resource pool, so that the generic speech and the user-customized speech can be better recognized. Further, the universal speech recognition resource and the user-customized speech recognition resource may be made to reside in the resource pool in a manner of setting a higher weight for the resource, or the two resources may not be considered when the universal speech recognition resource and the user-customized speech recognition resource are removed, so that the universal speech recognition resource and the user-customized speech recognition resource may reside in the resource pool, or may be made to reside in other manners, which is not limited herein.

In some optional embodiments, if the domain classification result does not belong to each speech recognition resource, adding the speech recognition resource corresponding to the domain classification result into the resource pool from the alternative resource pool includes: if the domain classification result does not belong to each voice recognition resource, inquiring whether the user needs to switch to the voice recognition resource corresponding to the domain classification result; and responding to the received switching instruction of the user, and adding the voice recognition resource corresponding to the domain classification result into the resource pool from the alternative resource library. For example, the user may be asked "whether a speech recognition resource in XX domain is not contained in the current resource pool, and whether switching to a speech recognition resource in XX domain is required". If the user agrees to the handover, the corresponding handover may be performed according to the previous handover procedure. This can be done to take full account of the user's wishes.

Further optionally, the method further includes: and in response to the received command of the user without switching, performing subsequent processing on the acquired user voice by using each voice recognition resource in the initialized resource pool. Therefore, after receiving the switching-free instruction of the user, the current resource is adopted to carry out subsequent recognition processing on the voice of the user without switching the resource.

The following description is provided to enable those skilled in the art to better understand the present disclosure by describing some of the problems encountered by the inventors in implementing the present disclosure and by describing one particular embodiment of the finally identified solution.

The inventor finds in the process of implementing the present application that the defects of the prior art are mainly caused by the following:

in the existing scheme, the existing voice recognition resources are directly emptied after the switching judgment is finished; and handover decisions are overly tedious.

Context-switching scenarios generally come from the user's personalized customization needs and the user's desire for a quality voice interaction experience. The industry's solutions are typically based on only a single, generalized voice recognition resource and do not focus on the user's experience with voice interaction.

The scheme of the application provides a voice recognition resource switching device.

(1) The switching judgment only comprises the domain classification result of the recognition text and the judgment condition customized by the user;

(2) decoding a plurality of resources by adopting a voice recognition resource pool;

(3) the generic speech recognition resources and the user-customized recognition resources are permanently reserved in the resource pool.

The following describes the embodiments of the present application in detail with reference to fig. 3.

(1) Initializing a resource pool, wherein the resource pool comprises a universal voice recognition resource, a user customized voice recognition resource and voice recognition resources in some common fields;

(2) distributed multi-path decoding is adopted;

(3) and performing domain classification on the text result with the highest multi-path decoding score. And updating the resource weight of each path of the resource pool according to the classification result. If the resource pool does not contain the voice recognition resources in the result field, adding the field into the resource pool from the alternative resource library, and deleting the voice recognition resources with the lowest current weight.

In the process of implementing the present application, the inventors have also tried the following schemes: and when the switching condition is met, adding an interactive flow and actively inquiring whether the user switches. The advantage is that the accuracy of the handover is further ensured. The disadvantage is that the complexity of the switching process is increased, and the scheme is finally rejected considering that the domain switching is not the deliberate action of the user.

The scheme of the embodiment of the application can realize the following beneficial effects: the scheme develops the field customization capability of the voice recognition resources, and adaptively adjusts the field weight in the using process of the user, so that the effect of improving the recognition accuracy rate can be achieved, and the user can obtain better customization and interactive experience.

The universal speech recognition resources have good generalization but cannot meet the personalized customization requirements. Aiming at the voice recognition resources in the fixed field, the recognition performance in the field can be effectively improved, and based on the scheme, a self-adaptive voice recognition resource switching system is designed. The design of the resource pool ensures the integral stability of the voice recognition system to a certain extent; the design of the domain weight updating mechanism endows the speech recognition system with dynamic domain customization capability.

Referring to fig. 4, a block diagram of a speech recognition resource switching apparatus according to an embodiment of the present invention is shown.

As shown in fig. 4, the speech recognition resource switching apparatus 400 includes an initialization module 410, a multi-channel decoding module 420, a domain classification module 430, a domain judgment module 440, and an update module 450.

A weight, initialization module 410 configured to initialize a resource pool, wherein each voice recognition resource in the initialized resource pool has an initial weight; a multi-channel decoding module 420 configured to collect user speech and perform distributed multi-channel decoding on the user speech; the domain classification module 430 is configured to obtain scores of the distributed multi-path decoding and determine a domain classification result corresponding to a text result with the highest score in the multi-path decoding; a domain judging module 440 configured to judge whether the domain classification result belongs to each of the speech recognition resources; the updating module 450 is configured to update the initial weight of each speech recognition resource according to the domain classification result if the domain classification result belongs to each speech recognition resource.

In some optional embodiments, the apparatus further comprises: a resource switching module (not shown in the figure), configured to add the speech recognition resource corresponding to the domain classification result into a resource pool from an alternative resource library if the domain classification result does not belong to each speech recognition resource; a weight judging module (not shown in the figure) configured to determine a speech recognition resource with the lowest weight in the speech recognition resources based on the initial weight or the updated weight of the speech recognition resources; and a resource removal module (not shown) configured to remove the lowest weighted speech recognition resource from the resource pool.

It should be understood that the modules depicted in fig. 4 correspond to various steps in the method described with reference to fig. 1. Thus, the operations and features described above for the method and the corresponding technical effects are also applicable to the modules in fig. 4, and are not described again here.

It should be noted that the modules in the embodiments of the present application are not intended to limit the solution of the present application, for example, the word segmentation module may be described as a module that divides the received sentence text into a sentence and at least one entry. In addition, the related functional modules may also be implemented by a hardware processor, for example, the word segmentation module may also be implemented by a processor, which is not described herein again.

In other embodiments, an embodiment of the present invention further provides a non-volatile computer storage medium, where the computer storage medium stores computer-executable instructions, and the computer-executable instructions may execute the voice recognition resource switching method in any of the above method embodiments;

as one embodiment, a non-volatile computer storage medium of the present invention stores computer-executable instructions configured to:

initializing a resource pool, wherein each voice recognition resource in the initialized resource pool has an initial weight;

collecting user voice and carrying out distributed multi-path decoding on the user voice;

obtaining scores of distributed multi-path decoding, and determining a domain classification result corresponding to a text result with the highest score in the multi-path decoding;

judging whether the domain classification result belongs to each voice recognition resource;

and if the domain classification result belongs to each voice recognition resource, updating the initial weight of each voice recognition resource according to the domain classification result.

The non-volatile computer-readable storage medium may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the voice recognition apparatus, and the like. Further, the non-volatile computer-readable storage medium may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the non-transitory computer readable storage medium optionally includes memory located remotely from the processor, which may be connected to the speech recognition apparatus over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

Embodiments of the present invention also provide a computer program product comprising a computer program stored on a non-volatile computer-readable storage medium, the computer program comprising program instructions that, when executed by a computer, cause the computer to perform any of the speech recognition methods described above.

Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 5, the electronic device includes: one or more processors 510 and memory 520, with one processor 510 being an example in fig. 5. The apparatus of the voice recognition method may further include: an input device 530 and an output device 540. The processor 510, the memory 520, the input device 530, and the output device 540 may be connected by a bus or other means, and the bus connection is exemplified in fig. 5. The memory 520 is a non-volatile computer-readable storage medium as described above. The processor 510 executes various functional applications of the server and data processing by executing nonvolatile software programs, instructions and modules stored in the memory 520, that is, implements the voice recognition method of the above-described method embodiment. The input device 530 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the voice recognition device. The output device 540 may include a display device such as a display screen.

The product can execute the method provided by the embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the method provided by the embodiment of the present invention.

As an embodiment, the electronic device is applied to a speech recognition resource switching apparatus, and includes:

at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to:

The electronic device of the embodiments of the present application exists in various forms, including but not limited to:

(1) a mobile communication device: such devices are characterized by mobile communications capabilities and are primarily targeted at providing voice, data communications. Such terminals include smart phones (e.g., iphones), multimedia phones, functional phones, and low-end phones, among others.

(2) Ultra mobile personal computer device: the equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include: PDA, MID, and UMPC devices, etc., such as ipads.

(3) A portable entertainment device: such devices can display and play multimedia content. Such devices include audio and video players (e.g., ipods), handheld game consoles, electronic books, as well as smart toys and portable car navigation devices.

(4) The server is similar to a general computer architecture, but has higher requirements on processing capability, stability, reliability, safety, expandability, manageability and the like because of the need of providing highly reliable services.

(5) And other electronic devices with data interaction functions.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A speech recognition resource switching method comprises the following steps:

if the domain classification result belongs to each voice recognition resource, using the corresponding voice recognition resource in the resource pool to perform subsequent recognition processing on the user voice, and updating the initial weight of each voice recognition resource according to the domain classification result.

2. The method of claim 1, wherein after said determining whether the domain classification result belongs to the respective speech recognition resource, the method further comprises:

if the domain classification result does not belong to each voice recognition resource, adding the voice recognition resource corresponding to the domain classification result into a resource pool from an alternative resource library;

determining the speech recognition resource with the lowest weight in the speech recognition resources based on the initial weight or the updated weight of the speech recognition resources;

removing the lowest weighted speech recognition resource from the resource pool.

3. The method of claim 2, wherein the initialized resource pool comprises a generic speech recognition resource, a user-customized speech recognition resource, and at least one common domain speech recognition resource.

4. The method of claim 3, wherein the generic speech recognition resources and user-customized speech recognition resources are always reserved in the initialized resource pool.

5. The method according to any one of claims 2 to 4, wherein, if the domain classification result does not belong to the speech recognition resources, adding the speech recognition resource corresponding to the domain classification result into a resource pool from an alternative resource pool comprises:

if the domain classification result does not belong to each voice recognition resource, inquiring whether the user needs to switch to the voice recognition resource corresponding to the domain classification result;

and responding to a received switching instruction of a user, and adding the voice recognition resource corresponding to the domain classification result into a resource pool from the alternative resource library.

6. The method of claim 5, wherein the method further comprises:

and in response to receiving a command of the user without switching, using each voice recognition resource in the initialized resource pool to perform subsequent processing on the acquired user voice.

7. A speech recognition resource switching apparatus comprising:

the initialization module is configured to initialize the resource pool, wherein each voice recognition resource in the initialized resource pool has an initial weight;

the multi-channel decoding module is configured to collect user voice and perform distributed multi-channel decoding on the user voice;

the domain classification module is configured to obtain scores of distributed multi-path decoding and determine a domain classification result corresponding to a text result with the highest score in the multi-path decoding;

a domain judging module configured to judge whether the domain classification result belongs to each of the speech recognition resources;

and the updating module is configured to perform subsequent recognition processing on the user voice by using the corresponding voice recognition resource in the resource pool if the domain classification result belongs to each voice recognition resource, and update the initial weight of each voice recognition resource according to the domain classification result.

8. The apparatus of claim 7, wherein the apparatus further comprises:

the resource switching module is configured to add the voice recognition resources corresponding to the domain classification result into a resource pool from an alternative resource library if the domain classification result does not belong to each voice recognition resource;

the weight judging module is configured to determine the voice recognition resource with the lowest weight in the voice recognition resources based on the initial weight or the updated weight of the voice recognition resources;

a resource removal module configured to remove the lowest weighted speech recognition resource from the resource pool.

9. An electronic device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of any one of claims 1 to 6.

10. A storage medium having stored thereon a computer program, characterized in that the program, when being executed by a processor, is adapted to carry out the steps of the method of any one of claims 1 to 6.