CN117711377A - Speech recognition method, device, electronic equipment and readable medium - Google Patents

Speech recognition method, device, electronic equipment and readable medium Download PDF

Info

Publication number
CN117711377A
CN117711377A CN202211029147.4A CN202211029147A CN117711377A CN 117711377 A CN117711377 A CN 117711377A CN 202211029147 A CN202211029147 A CN 202211029147A CN 117711377 A CN117711377 A CN 117711377A
Authority
CN
China
Prior art keywords
voiceprint
target
acoustic model
acoustic
similar
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211029147.4A
Other languages
Chinese (zh)
Inventor
俞科峰
仝建刚
李嫚
吴滢
陈梦夏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Telecom Corp Ltd
Original Assignee
China Telecom Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Telecom Corp Ltd filed Critical China Telecom Corp Ltd
Priority to CN202211029147.4A priority Critical patent/CN117711377A/en
Publication of CN117711377A publication Critical patent/CN117711377A/en
Pending legal-status Critical Current

Links

Landscapes

  • Machine Translation (AREA)

Abstract

The application provides a voice recognition method, a voice recognition device, electronic equipment and a readable medium. The method comprises the following steps: acquiring voice data; performing feature matching on the voice data and voiceprint features in a voiceprint feature library to obtain target voiceprint features, wherein the voiceprint feature library comprises at least one voiceprint feature, and each voiceprint feature corresponds to an acoustic model; and carrying out voice recognition on voice data through a target acoustic model corresponding to the target voiceprint characteristics and M similar acoustic models of the target acoustic model to obtain a voice recognition result, wherein M is an integer greater than or equal to 1, and the voiceprint characteristics of the M similar acoustic models are similar to the target voiceprint characteristics. The method can avoid inaccurate recognition results of accents or voices with close features caused by inaccurate adaptive recognition engines, and improves the accuracy of the voice recognition results.

Description

Speech recognition method, device, electronic equipment and readable medium
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to a method and apparatus for voice recognition, an electronic device, and a readable medium.
Background
With the advancement of digital transformation in the industry, intelligent speech recognition technology is widely used in customer service. As the demand for personalized services expands, large call centers need to provide multilingual and multi-dialect services.
In the related art, a large call center is generally required to deploy speech recognition engines for different languages and different dialects, and call the corresponding recognition engines through an adaptation layer or an adapter according to the speech recognized as needed.
However, in this process, for voices with accents or features close to each other, the adapter often fails to adapt or the matching recognition engine is not accurate, so that the matching degree between the called recognition engine and the voices to be recognized is not high, and the accuracy of the voice recognition result is reduced.
Disclosure of Invention
Based on the technical problems, the application provides a voice recognition method, a device, electronic equipment and a readable medium, so as to avoid inaccurate recognition results caused by inaccurate adaptive recognition engines for accents or voices with close features, and improve the accuracy of the voice recognition results.
Other features and advantages of the present application will be apparent from the following detailed description, or may be learned in part by the practice of the application.
According to an aspect of the embodiments of the present application, there is provided a voice recognition method, including:
acquiring voice data;
performing feature matching on the voice data and voiceprint features in a voiceprint feature library to obtain target voiceprint features, wherein the voiceprint feature library comprises at least one voiceprint feature, and each voiceprint feature corresponds to an acoustic model;
And performing voice recognition on the voice data through a target acoustic model corresponding to the target voiceprint feature and M similar acoustic models of the target acoustic model to obtain a voice recognition result, wherein M is an integer greater than or equal to 1, and the voiceprint features of the M similar acoustic models are similar to the target voiceprint features.
In some embodiments of the present application, based on the above technical solution, performing, by using a target acoustic model corresponding to the target voiceprint feature and M similar acoustic models of the target acoustic model, speech recognition on the speech data to obtain a speech recognition result, where the speech recognition result includes:
acquiring a target acoustic model corresponding to the target voiceprint features;
obtaining M similar acoustic models similar to the target acoustic model according to the model similarity relation;
generating a voice recognition strategy according to the target acoustic model and the M similar acoustic models, wherein the voice recognition strategy is used for indicating the execution sequence of the target acoustic model and the M similar acoustic models;
and according to the voice recognition strategy, performing voice recognition on the voice data through a target acoustic model corresponding to the target voiceprint feature and M similar acoustic models to obtain a voice recognition result.
In some embodiments of the present application, based on the above technical solution, according to the voice recognition policy, performing voice recognition on the voice data through a target acoustic model corresponding to the target voiceprint feature and M similar acoustic models, to obtain a voice recognition result, where the voice recognition result includes:
performing voice recognition on the voice data through the target acoustic model to obtain a first recognition result;
respectively carrying out voice recognition on the voice data according to the M similar acoustic models to obtain M second recognition results;
and correcting the first recognition result according to the M second recognition results to obtain a voice recognition result.
In some embodiments of the present application, based on the above technical solution, before the obtaining, according to the model similarity relationship, M similar acoustic models similar to the target acoustic model, the method further includes:
for a specific voiceprint feature in the voiceprint feature library, determining voiceprint similarity between the specific voiceprint feature and N-1 other voiceprint features;
determining M other voiceprint features with highest voiceprint similarity to the specific voiceprint feature;
and for the specific acoustic model corresponding to the specific voiceprint feature, determining M acoustic models corresponding to the M other voiceprint features as M similar acoustic models of the specific acoustic model, and obtaining the model similarity relation.
In some embodiments of the present application, based on the above technical solutions, before performing feature matching on the voice data and the voiceprint features in the voiceprint feature library to obtain the target voiceprint feature, the method further includes:
acquiring N acoustic models, wherein each acoustic model in the N acoustic models is used for identifying voice information of a language or an accent, and N is an integer greater than or equal to M+1;
and respectively extracting voiceprint features of the N acoustic models to obtain the voiceprint feature library.
In some embodiments of the present application, based on the above technical solutions, after the extracting the voiceprint features of the N acoustic models respectively, to obtain the voiceprint feature library, the method further includes:
acquiring updated training data corresponding to an acoustic model to be updated in the N acoustic models;
training the acoustic model to be updated and updating model parameters according to the updated training data to obtain an updated acoustic model;
extracting updated voiceprint features of the updated acoustic model, and replacing the voiceprint features of the acoustic model to be updated in the voiceprint feature library by using the updated voiceprint features.
In some embodiments of the present application, based on the above technical solutions, the generating a speech recognition strategy according to the target acoustic model and the M similar acoustic models includes:
obtaining model access addresses of the target acoustic model and the M similar acoustic models from an acoustic resource index;
determining the execution sequence of the target acoustic model and the M similar acoustic models according to the sequence of the voiceprint similarity between the target acoustic model and the M similar acoustic models;
and sequencing the model access addresses according to the execution sequence to generate a voice recognition strategy.
According to an aspect of the embodiments of the present application, there is provided a voice recognition apparatus, including:
the data acquisition module is used for acquiring voice data;
the feature matching module is used for carrying out feature matching on the voice data and voiceprint features in a voiceprint feature library to obtain target voiceprint features, the voiceprint feature library comprises at least one voiceprint feature, and each voiceprint feature corresponds to an acoustic model;
and the voice recognition module is used for carrying out voice recognition on the voice data through the target acoustic model corresponding to the target voiceprint characteristic and M similar acoustic models of the target acoustic model to obtain a voice recognition result, wherein M is an integer greater than or equal to 1, and the voiceprint characteristic of the M similar acoustic models is similar to the target voiceprint characteristic.
According to an aspect of the embodiments of the present application, there is provided an electronic device including: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to perform the speech recognition method as in the above claims via execution of the executable instructions.
According to an aspect of the embodiments of the present application, there is provided a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements a speech recognition method as in the above technical solution.
In the embodiment of the application, when the voice recognition is performed, the target acoustic model is determined according to the matching degree of the voice data and the voiceprint characteristics of each acoustic model, and the similar acoustic model with similar voiceprint characteristics can be obtained according to the target acoustic model, and the voice data is recognized through the target acoustic model and the similar acoustic model, so that inaccurate recognition results caused by inaccurate adaptive recognition engines for accents or voices with similar characteristics are avoided, and the accuracy of the voice recognition results is improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application. It is apparent that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained from these drawings without inventive effort for a person of ordinary skill in the art.
In the drawings:
FIG. 1 schematically illustrates an exemplary system architecture diagram of the present application in an application scenario;
FIG. 2 is a schematic block diagram of a speech recognition system of the present application;
FIG. 3 is a schematic flow chart of a speech recognition process according to an embodiment of the present application;
FIG. 4 is a schematic flow chart of a speech recognition method according to an embodiment of the present application;
FIG. 5 schematically shows a block diagram of the speech recognition apparatus in an embodiment of the present application;
fig. 6 shows a schematic diagram of a computer system suitable for use in implementing the electronic device of the embodiments of the present application.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments may be embodied in many forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art.
Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the present application. One skilled in the relevant art will recognize, however, that the aspects of the application can be practiced without one or more of the specific details, or with other methods, components, devices, steps, etc. In other instances, well-known methods, devices, implementations, or operations are not shown or described in detail to avoid obscuring aspects of the application.
The block diagrams depicted in the figures are merely functional entities and do not necessarily correspond to physically separate entities. That is, the functional entities may be implemented in software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.
The flow diagrams depicted in the figures are exemplary only, and do not necessarily include all of the elements and operations/steps, nor must they be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the order of actual execution may be changed according to actual situations.
The scheme of the application can be applied to a voice recognition scene, and is particularly applied to a customer service flow of a large-scale call center and other institutions, and is used for automatically answering a call initiated by a user. For example, for customer service centers across the country of service areas, or even worldwide, answering systems are often employed to automatically process calls from users to enhance the efficiency of customer call processing. The method and the device are applied to such a scene, the acoustic model matched with the language and the accent of the client and a plurality of similar acoustic models similar to the language and the accent of the client can be determined according to the voice information of the client, the client voice is identified through the matched acoustic models, and the result is corrected through the similar acoustic models, so that an identification result with higher accuracy is obtained.
The application scenario of the present application scheme is described below. Referring to fig. 1, fig. 1 schematically shows an exemplary system architecture diagram of the technical solution of the present application in an application scenario. As shown in fig. 1, a client 110, an identification system 120, and a business system 130. The client 110 communicates to the call center by telephone or network, etc., and transmits voice data, typically by speaking or voice messaging. The voice data is accessed to the recognition system 120. An recognition engine is deployed in the recognition system 120 and a plurality of acoustic models are deployed in the recognition engine, each acoustic model being configured to recognize data information of an accent in a voice, for example, each acoustic model may be configured to recognize mandarin of an accent in a place, or to recognize standard mandarin, or to recognize cantonese of a region, etc. The recognition system 120 performs matching through voiceprint features of the voice data during the process of recognizing the voice data, selects and determines an acoustic model matching the voice data from a plurality of acoustic models, and selects a plurality of acoustic models similar to the selected acoustic model from the remaining acoustic models. The acoustic models are used for identifying the voice data, so that a voice identification result is obtained. The speech recognition result is typically semantic information of the speech data, such as text information into which the speech data is converted, etc. The speech recognition result is transmitted to the service system 130, and the service system 130 performs service processing according to the speech recognition result and provides services to the client 110.
In the above application scenario, the servers in the identification system 120 and the service system 130 may be independent physical servers, or may be a server cluster or a distributed system formed by a plurality of physical servers, or may be cloud servers that provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs (Content Delivery Network, content delivery networks), and basic cloud computing services such as big data and artificial intelligence platforms, which are not limited herein. The terminal device of the client 110 may be, but not limited to, a mobile phone, a computer, an intelligent voice interaction device, an intelligent home appliance, a vehicle-mounted terminal, and the like. The terminal device and the server may be directly or indirectly connected through wired or wireless communication, which is not limited herein. The number of terminal devices and servers is not limited.
The speech recognition system of the present application is described below. Referring to fig. 2, fig. 2 is a schematic block diagram of a speech recognition system in the present application. As shown in fig. 2, the speech recognition system includes a client media information module 210, a speech recognition engine 220, and a service processing module 230. The client media information module 210 specifically includes a client and an interactive system, and a user accesses the interactive system through the client to initiate a voice recognition access request. Included in the speech recognition engine 220 are a policy orchestration component, a task execution component, a feature library, an acoustic model index, and a plurality of acoustic models. The strategy arrangement component is used for arranging the execution task strategy of the voice recognition engine according to the feature library recognition result. The feature library is used for storing acoustic features of the deployed acoustic models, mainly the mapping relation between main voiceprint feature identifiers of dialects and corresponding acoustic model identifiers. The acoustic model index is used for storing cloud resource pool addresses stored by the acoustic model. The task execution component is used for continuously synchronizing the voiceprint features of the acoustic model to the feature library, executing tasks according to the engine arranged by the strategy arrangement component, and executing voice recognition tasks in combination with the acoustic model index. The execution result of the task execution component is pushed to a semantic recognition unit of the service processing module for semantic recognition, and then the semantic recognition result is pushed to a service system, and the service system responds to the interaction system according to the semantics.
The specific operation of the speech recognition system may be seen with reference to fig. 3. Fig. 3 is a flow chart of a voice recognition flow in an embodiment of the present application. The required acoustic model is deployed in the cloud end of the voice recognition system, such as: standard national language acoustic models, cantonese language recognition models, and the like; and the cloud resource pool address of the acoustic model is configured in the acoustic model index component, and the task execution component extracts characteristic voiceprints of the acoustic model and stores the characteristic voiceprints in the characteristic library for identifying the voiceprints corresponding to the media information. And meanwhile, calculating the similarity of the acoustic models according to the voiceprint characteristics, and determining the similar acoustic models. As shown in fig. 3, the customer calls into the interactive system, which pushes the customer media information to the speech recognition engine. The speech recognition engine policy orchestration component first invokes the feature library to identify the media, selects 1 backbone model and several auxiliary models according to voiceprint similarity, e.g., 1 backbone model and 2 auxiliary models in the embodiment of fig. 3, and orchestrates the recognition policy according to the determined acoustic models. The recognition strategy is pushed to the task execution component for voice recognition. The task execution component invokes the corresponding acoustic model to recognize the voice according to the recognition strategy according to the acoustic model index, for example, recognition is performed through the trunk model, and dialect correction is performed on the recognition result through the 2 auxiliary models, so that the recognition result is obtained. The identification result is pushed to the semantic understanding system, and the semantic identification system calls the corresponding service system to output the service result and returns the service result to the interactive system.
The following describes the speech recognition method in the embodiment of the present application further. Referring to fig. 4, fig. 4 is a schematic flowchart of a voice recognition method according to an embodiment of the present application. The solution may be executed by a server with a speech recognition engine deployed, as shown in fig. 4, where the speech recognition method at least includes steps S410 to S430, and is described in detail as follows:
step S410, obtaining voice data;
step S420, performing feature matching on the voice data and the voiceprint features in the voiceprint feature library to obtain target voiceprint features, wherein the voiceprint feature library comprises at least one voiceprint feature, and each voiceprint feature corresponds to an acoustic model.
The voiceprint feature library is a feature library constructed by the speech recognition engine prior to speech recognition and storing voiceprint features of individual acoustic models. The speech recognition engine will extract voiceprint features of each acoustic model to store in a feature library. Typically, the voiceprint feature library includes a plurality of voiceprint features, each voiceprint feature corresponding to an acoustic model, and each acoustic model being used to identify speech data in a different language or for a different accent. And the voice recognition engine performs feature matching on voice data or voice print features extracted according to the voice data and voice print features in the voice print feature library, so that voice print features which are most matched with the voice data in the voice print feature library are determined to be target voice print features. By determining the matching voiceprint features, the speech recognition engine also determines the most likely language or accent contained in the speech data, thereby enabling targeted recognition.
In an embodiment of the present application, based on the above technical solution, step S420 above performs feature matching on the voice data and the voiceprint features in the voiceprint feature library, and before obtaining the target voiceprint feature, the method of the present application further includes the following steps:
acquiring N acoustic models, wherein each acoustic model in the N acoustic models is used for identifying voice information of a language or an accent, and N is an integer greater than or equal to M+1;
and respectively extracting voiceprint features of the N acoustic models to obtain the voiceprint feature library.
In this embodiment, after the acoustic model is accessed to the speech recognition engine, the speech recognition engine extracts voiceprint features of the acoustic model, thereby forming a voiceprint feature library. Specifically, the voiceprint features may be determined according to training data for training after the acoustic model is trained, feature data and configuration parameters of the acoustic model, or may be determined according to standard audio of a language or accent corresponding to the acoustic model.
In an embodiment of the present application, based on the above technical solution, after extracting the voiceprint features of the N acoustic models respectively to obtain the voiceprint feature library, the method of the present application further includes the following steps:
Acquiring updated training data corresponding to an acoustic model to be updated in the N acoustic models;
training the acoustic model to be updated and updating model parameters according to the updated training data to obtain an updated acoustic model;
extracting updated voiceprint features of the updated acoustic model, and replacing the voiceprint features of the acoustic model to be updated in the voiceprint feature library by using the updated voiceprint features.
Specifically, after the acoustic model is connected to the voice recognition engine, the voice recognition engine collects feedback of a recognition result in the process of voice recognition, and updates and trains the acoustic mode by taking voice data with inaccurate recognition result as training data. And after the updating training is finished, re-extracting updated voiceprint features from the acoustic model obtained after the training, and replacing the voiceprint features of the acoustic model to be updated in the voiceprint feature library with the updated voiceprint features, so that the recognition effect of the acoustic model is continuously improved and updated.
And step S430, performing voice recognition on the voice data through a target acoustic model corresponding to the target voiceprint feature and M similar acoustic models of the target acoustic model to obtain a voice recognition result, wherein M is an integer greater than or equal to 1, and the voiceprint features of the M similar acoustic models are similar to the target voiceprint features.
Specifically, each acoustic model will have at least one similar acoustic model. These similar acoustic models may be determined prior to recognition based on the degree of similarity of the voiceprint features of the respective acoustic models, or may be determined during recognition by the degree of matching of the voiceprint features of the respective acoustic models to the speech data. For example, in one embodiment, the speech recognition engine calculates a voiceprint match value for each voiceprint feature in the voiceprint feature library for the speech data, wherein the voiceprint feature with the highest voiceprint match value is determined to be the target voiceprint feature, and other voiceprint features above the match threshold are used as similar voiceprint features to determine similar acoustic models according to a preset match threshold, or are ranked directly according to the voiceprint match value, and acoustic models corresponding to M voiceprint features ranked after the highest voiceprint match value are used as similar acoustic models. The voice recognition engine calls the target acoustic model and the M similar acoustic models to perform voice recognition on the voice data, so that a voice recognition result is obtained. Specifically, the voice recognition engine firstly calls the target acoustic model to perform voice recognition on voice data to obtain a recognition result, and then corrects the recognition result output by the target acoustic model by using M similar acoustic models to determine a final voice recognition result.
In one embodiment of the present application, based on the above technical solution, step S430 includes performing speech recognition on the speech data through a target acoustic model corresponding to the target voiceprint feature and M similar acoustic models of the target acoustic model to obtain a speech recognition result, and specifically includes the following steps:
acquiring a target acoustic model corresponding to the target voiceprint features;
obtaining M similar acoustic models similar to the target acoustic model according to the model similarity relation;
generating a voice recognition strategy according to the target acoustic model and the M similar acoustic models, wherein the voice recognition strategy is used for indicating the execution sequence of the target acoustic model and the M similar acoustic models;
and according to the voice recognition strategy, performing voice recognition on the voice data through a target acoustic model corresponding to the target voiceprint feature and M similar acoustic models to obtain a voice recognition result.
In this embodiment, the speech recognition engine will formulate a speech recognition strategy based on the determined target acoustic model and the similar acoustic model. The speech recognition strategy specifies an execution order in which the acoustic model is invoked for speech recognition. The speech recognition engine then invokes the target acoustic model and the M similar acoustic models to perform speech recognition according to the speech recognition policy to obtain a speech recognition result. By calling a plurality of acoustic models in the sequential recognition process, the front deployment end of the engine does not need to deploy an adaptation layer to determine the matching relationship between the voice and the engine, redundant modules or components of the whole recognition system are reduced, and the cost of voice recognition is reduced.
In one embodiment of the present application, based on the above technical solution, the foregoing steps, according to the voice recognition policy, perform voice recognition on the voice data through a target acoustic model corresponding to the target voiceprint feature and M similar acoustic models, to obtain a voice recognition result, specifically including the following steps:
performing voice recognition on the voice data through the target acoustic model to obtain a first recognition result;
respectively carrying out voice recognition on the voice data according to the M similar acoustic models to obtain M second recognition results;
and correcting the first recognition result according to the M second recognition results to obtain a voice recognition result.
In this embodiment, the target acoustic model and the M similar acoustic models respectively perform speech recognition on the speech data, so as to obtain a first recognition result and M second recognition results. And then, correcting the first recognition result according to the M second recognition results. Specifically, the speech recognition engine corrects the first recognition result by using the M second recognition results in turn, that is, selects one recognition result from the M second recognition results, corrects the first recognition result, and obtains a corrected result. Then selecting the next recognition result from the M-1 second recognition results, continuously correcting the correction result, and so on until all the corrections are completed. The correction is mainly aimed at the semantic content of the first recognition result, for example, when the first recognition result is different from the second recognition result, the semantic correctness of the semantic in the recognition result can be determined by means of semantic confirmation through different parts.
In an embodiment of the present application, based on the above technical solution, before the step of obtaining M similar acoustic models similar to the target acoustic model according to the model similarity relationship, the method of the present application further includes the following steps:
for a specific voiceprint feature in the voiceprint feature library, determining voiceprint similarity between the specific voiceprint feature and N-1 other voiceprint features;
determining M other voiceprint features with highest voiceprint similarity to the specific voiceprint feature;
and for the specific acoustic model corresponding to the specific voiceprint feature, determining M acoustic models corresponding to the M other voiceprint features as M similar acoustic models of the specific acoustic model, and obtaining the model similarity relation.
In the present embodiment, the semantic recognition engine calculates the similarity between the individual voiceprint features, and sets two acoustic models whose similarity satisfies a threshold as similar acoustic models. The number of similar acoustic models may be different for each acoustic model. Specifically, for any one voiceprint feature, firstly calculating the similarity between the voiceprint feature and other voiceprint features in the acoustic feature library, and then selecting the acoustic models meeting the number of similar models from high to low according to the set number of similar models.
In one embodiment of the present application, based on the above technical solution, the step of generating a speech recognition policy according to the target acoustic model and the M similar acoustic models specifically includes the following steps:
obtaining model access addresses of the target acoustic model and the M similar acoustic models from an acoustic resource index;
determining the execution sequence of the target acoustic model and the M similar acoustic models according to the sequence of the voiceprint similarity between the target acoustic model and the M similar acoustic models;
and sequencing the model access addresses according to the execution sequence to generate a voice recognition strategy.
In the embodiment of the application, when the voice recognition is performed, the target acoustic model is determined according to the matching degree of the voice data and the voiceprint characteristics of each acoustic model, and the similar acoustic model with similar voiceprint characteristics can be obtained according to the target acoustic model, and the voice data is recognized through the target acoustic model and the similar acoustic model, so that inaccurate recognition results caused by inaccurate adaptive recognition engines for accents or voices with similar characteristics are avoided, and the accuracy of the voice recognition results is improved.
It should be noted that although the steps of the methods in the present application are depicted in the accompanying drawings in a particular order, this does not require or imply that the steps must be performed in that particular order, or that all illustrated steps be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform, etc.
The following describes the implementation of the apparatus of the present application, which may be used to perform the speech recognition method in the above-described embodiments of the present application. Fig. 5 schematically shows a block diagram of the voice recognition apparatus in the embodiment of the present application. As shown in fig. 5, the voice recognition apparatus 500 may mainly include:
a data acquisition module 510, configured to acquire voice data;
the feature matching module 520 is configured to perform feature matching on the voice data and voiceprint features in a voiceprint feature library to obtain target voiceprint features, where the voiceprint feature library includes at least one voiceprint feature, and each voiceprint feature corresponds to an acoustic model;
and the voice recognition module 530 is configured to perform voice recognition on the voice data through a target acoustic model corresponding to the target voiceprint feature and M similar acoustic models of the target acoustic model, so as to obtain a voice recognition result, where M is an integer greater than or equal to 1, and voiceprint features of the M similar acoustic models are similar to the target voiceprint feature.
In some embodiments of the present application, based on the above technical solutions, the voice recognition module 530 includes:
the target model acquisition unit is used for acquiring a target acoustic model corresponding to the target voiceprint feature;
a similar model obtaining unit, configured to obtain M similar acoustic models similar to the target acoustic model according to a model similarity relationship;
a strategy generation unit, configured to generate a speech recognition strategy according to the target acoustic model and the M similar acoustic models, where the speech recognition strategy is used to indicate an execution sequence of the target acoustic model and the M similar acoustic models;
and the voice recognition unit is used for carrying out voice recognition on the voice data through the target acoustic model corresponding to the target voiceprint characteristics and M similar acoustic models according to the voice recognition strategy to obtain a voice recognition result.
In some embodiments of the present application, based on the above technical solutions, the speech recognition unit includes:
the first recognition subunit is used for carrying out voice recognition on the voice data through the target acoustic model to obtain a first recognition result;
the second recognition subunit is used for respectively carrying out voice recognition on the voice data according to the M similar acoustic models to obtain M second recognition results;
And the correction unit is used for correcting the first recognition result according to the M second recognition results to obtain a voice recognition result.
In some embodiments of the present application, based on the above technical solutions, the voice recognition apparatus further includes:
the similarity determining module is used for determining the voiceprint similarity between the specific voiceprint feature and N-1 other voiceprint features for the specific voiceprint feature in the voiceprint feature library;
the special effect determining module is used for determining M other voiceprint features with highest voiceprint similarity with the specific voiceprint feature;
and the relation determining module is used for determining M acoustic models corresponding to the M other voiceprint features as M similar acoustic models of the specific acoustic model for the specific acoustic model corresponding to the specific voiceprint features, so as to obtain the model similarity relation.
In some embodiments of the present application, based on the above technical solutions, the voice recognition apparatus further includes:
the model acquisition module is used for acquiring N acoustic models, wherein each acoustic model in the N acoustic models is used for identifying voice information of a language or an accent, and N is an integer greater than or equal to M+1;
And the feature extraction unit is used for respectively extracting voiceprint features of the N acoustic models to obtain the voiceprint feature library.
In some embodiments of the present application, based on the above technical solutions, the voice recognition apparatus further includes:
the training data acquisition module is used for acquiring updated training data corresponding to the acoustic models to be updated in the N acoustic models;
the model training module is used for training the acoustic model to be updated and updating model parameters according to the updated training data to obtain an updated acoustic model;
and the characteristic updating module is used for extracting updated voiceprint characteristics of the updated acoustic model and replacing the voiceprint characteristics of the acoustic model to be updated in the voiceprint characteristic library by utilizing the updated voiceprint characteristics.
In some embodiments of the present application, based on the above technical solution, the policy generating unit includes:
an address obtaining subunit, configured to obtain, from an acoustic resource index, model access addresses of the target acoustic model and the M similar acoustic models;
a sequence determining subunit, configured to determine an execution sequence of the target acoustic model and the M similar acoustic models according to an order of voiceprint similarities between the target acoustic model and the M similar acoustic models;
And the address ordering subunit is used for ordering the model access addresses according to the execution sequence to generate a voice recognition strategy.
It should be noted that, the apparatus provided in the foregoing embodiments and the method provided in the foregoing embodiments belong to the same concept, and a specific manner in which each module performs an operation has been described in detail in the method embodiment, which is not described herein again.
Fig. 6 shows a schematic diagram of a computer system suitable for use in implementing the electronic device of the embodiments of the present application.
It should be noted that, the computer system 600 of the electronic device shown in fig. 6 is only an example, and should not impose any limitation on the functions and the application scope of the embodiments of the present application.
As shown in fig. 6, the computer system 600 includes a central processing unit (Central Processing Unit, CPU) 601, which can perform various appropriate actions and processes according to a program stored in a Read-Only Memory (ROM) 602 or a program loaded from a storage section 608 into a random access Memory (Random Access Memory, RAM) 603. In the RAM 603, various programs and data required for system operation are also stored. The CPU 601, ROM 602, and RAM 603 are connected to each other through a bus 604. An Input/Output (I/O) interface 605 is also connected to bus 604.
The following components are connected to the I/O interface 605: an input portion 606 including a keyboard, mouse, etc.; an output portion 607 including a Cathode Ray Tube (CRT), a liquid crystal display (Liquid Crystal Display, LCD), and a speaker, etc.; a storage section 608 including a hard disk and the like; and a communication section 609 including a network interface card such as a LAN (Local Area Network ) card, a modem, or the like. The communication section 609 performs communication processing via a network such as the internet. The drive 610 is also connected to the I/O interface 605 as needed. Removable media 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is installed as needed on drive 610 so that a computer program read therefrom is installed as needed into storage section 608.
In particular, according to embodiments of the present application, the processes described in the various method flowcharts may be implemented as computer software programs. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program may be downloaded and installed from a network through the communication portion 609, and/or installed from the removable medium 611. When executed by a Central Processing Unit (CPU) 601, performs the various functions defined in the system of the present application.
It should be noted that, the computer readable medium shown in the embodiments of the present application may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-Only Memory (ROM), an erasable programmable read-Only Memory (Erasable Programmable Read Only Memory, EPROM), flash Memory, an optical fiber, a portable compact disc read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present application, however, a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
It should be noted that although in the above detailed description several modules or units of a device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functions of two or more modules or units described above may be embodied in one module or unit, in accordance with embodiments of the present application. Conversely, the features and functions of one module or unit described above may be further divided into a plurality of modules or units to be embodied.
From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or may be implemented in software in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present application may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a usb disk, a mobile hard disk, etc.) or on a network, and includes several instructions to cause a computing device (may be a personal computer, a server, a touch terminal, or a network device, etc.) to perform the method according to the embodiments of the present application.
Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains.
It is to be understood that the present application is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims (10)

1. A method of speech recognition, comprising:
acquiring voice data;
performing feature matching on the voice data and voiceprint features in a voiceprint feature library to obtain target voiceprint features, wherein the voiceprint feature library comprises at least one voiceprint feature, and each voiceprint feature corresponds to an acoustic model;
and performing voice recognition on the voice data through a target acoustic model corresponding to the target voiceprint feature and M similar acoustic models of the target acoustic model to obtain a voice recognition result, wherein M is an integer greater than or equal to 1, and the voiceprint features of the M similar acoustic models are similar to the target voiceprint features.
2. The method according to claim 1, wherein performing speech recognition on the speech data through the target acoustic model corresponding to the target voiceprint feature and M similar acoustic models of the target acoustic model to obtain a speech recognition result includes:
acquiring a target acoustic model corresponding to the target voiceprint features;
obtaining M similar acoustic models similar to the target acoustic model according to the model similarity relation;
generating a voice recognition strategy according to the target acoustic model and the M similar acoustic models, wherein the voice recognition strategy is used for indicating the execution sequence of the target acoustic model and the M similar acoustic models;
And according to the voice recognition strategy, performing voice recognition on the voice data through a target acoustic model corresponding to the target voiceprint feature and M similar acoustic models to obtain a voice recognition result.
3. The method according to claim 2, wherein performing speech recognition on the speech data according to the speech recognition policy through the target acoustic model corresponding to the target voiceprint feature and M similar acoustic models to obtain a speech recognition result includes:
performing voice recognition on the voice data through the target acoustic model to obtain a first recognition result;
respectively carrying out voice recognition on the voice data according to the M similar acoustic models to obtain M second recognition results;
and correcting the first recognition result according to the M second recognition results to obtain a voice recognition result.
4. The method of claim 2, wherein prior to obtaining M similar acoustic models that are similar to the target acoustic model based on model similarity relationships, the method further comprises:
for a specific voiceprint feature in the voiceprint feature library, determining voiceprint similarity between the specific voiceprint feature and N-1 other voiceprint features;
Determining M other voiceprint features with highest voiceprint similarity to the specific voiceprint feature;
and for the specific acoustic model corresponding to the specific voiceprint feature, determining M acoustic models corresponding to the M other voiceprint features as M similar acoustic models of the specific acoustic model, and obtaining the model similarity relation.
5. The method of claim 1, wherein before feature matching the voice data with the voiceprint features in the voiceprint feature library to obtain the target voiceprint feature, the method further comprises:
acquiring N acoustic models, wherein each acoustic model in the N acoustic models is used for identifying voice information of a language or an accent, and N is an integer greater than or equal to M+1;
and respectively extracting voiceprint features of the N acoustic models to obtain the voiceprint feature library.
6. The method of claim 5, wherein after extracting the voiceprint features of the N acoustic models, respectively, to obtain the voiceprint feature library, the method further comprises:
acquiring updated training data corresponding to an acoustic model to be updated in the N acoustic models;
training the acoustic model to be updated and updating model parameters according to the updated training data to obtain an updated acoustic model;
Extracting updated voiceprint features of the updated acoustic model, and replacing the voiceprint features of the acoustic model to be updated in the voiceprint feature library by using the updated voiceprint features.
7. The method of claim 4, wherein generating a speech recognition strategy from the target acoustic model and the M similar acoustic models comprises:
obtaining model access addresses of the target acoustic model and the M similar acoustic models from an acoustic resource index;
determining the execution sequence of the target acoustic model and the M similar acoustic models according to the sequence of the voiceprint similarity between the target acoustic model and the M similar acoustic models;
and sequencing the model access addresses according to the execution sequence to generate a voice recognition strategy.
8. A speech recognition apparatus, comprising:
the data acquisition module is used for acquiring voice data;
the feature matching module is used for carrying out feature matching on the voice data and voiceprint features in a voiceprint feature library to obtain target voiceprint features, the voiceprint feature library comprises at least one voiceprint feature, and each voiceprint feature corresponds to an acoustic model;
And the voice recognition module is used for carrying out voice recognition on the voice data through the target acoustic model corresponding to the target voiceprint characteristic and M similar acoustic models of the target acoustic model to obtain a voice recognition result, wherein M is an integer greater than or equal to 1, and the voiceprint characteristic of the M similar acoustic models is similar to the target voiceprint characteristic.
9. An electronic device, comprising:
a processor;
a memory for storing executable instructions of the processor;
wherein the processor is configured to perform the speech recognition method of any one of claims 1 to 7 via execution of the executable instructions.
10. A computer readable medium on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the speech recognition method according to any one of claims 1 to 7.
CN202211029147.4A 2022-08-25 2022-08-25 Speech recognition method, device, electronic equipment and readable medium Pending CN117711377A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211029147.4A CN117711377A (en) 2022-08-25 2022-08-25 Speech recognition method, device, electronic equipment and readable medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211029147.4A CN117711377A (en) 2022-08-25 2022-08-25 Speech recognition method, device, electronic equipment and readable medium

Publications (1)

Publication Number Publication Date
CN117711377A true CN117711377A (en) 2024-03-15

Family

ID=90157539

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211029147.4A Pending CN117711377A (en) 2022-08-25 2022-08-25 Speech recognition method, device, electronic equipment and readable medium

Country Status (1)

Country Link
CN (1) CN117711377A (en)

Similar Documents

Publication Publication Date Title
US11217236B2 (en) Method and apparatus for extracting information
US9582757B1 (en) Scalable curation system
CN110970021B (en) Question-answering control method, device and system
CN105190614A (en) Search results using intonation nuances
CN110046230B (en) Method for generating recommended speaking collection, and recommended speaking method and device
CN107705782B (en) Method and device for determining phoneme pronunciation duration
CN111310440A (en) Text error correction method, device and system
CN111223476B (en) Method and device for extracting voice feature vector, computer equipment and storage medium
CN111445903B (en) Enterprise name recognition method and device
CN111858854A (en) Question-answer matching method based on historical dialogue information and related device
CN112650842A (en) Human-computer interaction based customer service robot intention recognition method and related equipment
CN110717021A (en) Input text and related device for obtaining artificial intelligence interview
CN111368145A (en) Knowledge graph creating method and system and terminal equipment
CN111402864A (en) Voice processing method and electronic equipment
CN116631412A (en) Method for judging voice robot through voiceprint matching
CN113051389B (en) Knowledge pushing method and device
CN114550718A (en) Hot word speech recognition method, device, equipment and computer readable storage medium
CN112434953A (en) Customer service personnel assessment method and device based on computer data processing
CN116189663A (en) Training method and device of prosody prediction model, and man-machine interaction method and device
CN114528851B (en) Reply sentence determination method, reply sentence determination device, electronic equipment and storage medium
CN117711377A (en) Speech recognition method, device, electronic equipment and readable medium
CN115691503A (en) Voice recognition method and device, electronic equipment and storage medium
CN113408736B (en) Processing method and device of voice semantic model
CN115509485A (en) Filling-in method and device of business form, electronic equipment and storage medium
CN113987202A (en) Knowledge graph-based interactive telephone calling method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination