CN113327587A

CN113327587A - Method and device for voice recognition in specific scene, electronic equipment and storage medium

Info

Publication number: CN113327587A
Application number: CN202110616948.XA
Authority: CN
Inventors: 范红亮; 李轶杰; 梁家恩
Original assignee: Unisound Shanghai Intelligent Technology Co Ltd
Current assignee: Unisound Shanghai Intelligent Technology Co Ltd
Priority date: 2021-06-02
Filing date: 2021-06-02
Publication date: 2021-08-31

Abstract

The invention relates to a method for voice recognition in a specific scene, an electronic device and a storage medium, wherein the method comprises the following steps: acquiring audio data to be identified; extracting the characteristics of the audio data; inputting the characteristics of the audio data into a first decoding network to obtain an identification text; wherein the first decoding network is determined by: training the characteristics of the domain voice database into an acoustic model to obtain a domain acoustic model; the domain acoustic model, the lexicon, and the speech model form a first decoding network. The audio data to be recognized in the application are input into the first decoding network after feature extraction, corresponding field acoustic models can be found out from the first decoding network, the field acoustic models are matched with the audio data to be recognized better due to the fact that acoustic signal features under specific scenes are learned, the performance of the obtained recognized text can be better, the voice recognition accuracy under specific application fields is improved, and time and resource investment are saved.

Description

Method and device for voice recognition in specific scene, electronic equipment and storage medium

Technical Field

The invention relates to the technical field of voice recognition, in particular to a method for voice recognition in a specific scene, electronic equipment and a storage medium.

Background

Conventional ASR systems include a training phase and a decoding phase, the training phase: training an Acoustic Model (AM) by utilizing a voice database based on technologies such as a deep neural network and the like; and training a Language Model (LM) by utilizing a text database based on technologies such as ngram and a deep neural network. And a decoding stage: the acoustic model, the language model and the pronunciation dictionary obtained in the training stage form a decoding network. After the input audio is subjected to feature extraction, an optimal path can be found out from a decoding network through a decoding algorithm, and a final recognition result is obtained.

In a general application scene, mass data sources for training an acoustic model are data acquired in various daily scenes or open source data, and test data in a decoding stage in the general scene are basically matched with training data in terms of acoustic signals, so that a very good recognition effect can be achieved.

In a specific field application scenario, the test data in the decoding stage and the training data are usually not matched in terms of acoustic signals, and the acoustic mismatch may cause a drastic performance degradation.

Disclosure of Invention

The invention provides a method, a device, an electronic device and a storage medium for voice recognition in a specific scene, which can solve the technical problem that the voice recognition performance is sharply reduced.

The technical scheme for solving the technical problems is as follows:

in a first aspect, an embodiment of the present invention provides a method for speech recognition in a specific scenario, including:

acquiring audio data to be identified;

extracting the characteristics of the audio data;

features of the audio data are input into the first decoding network to obtain the recognized text.

Wherein the first decoding network is determined by:

training the characteristics of the domain voice database into an acoustic model to obtain a domain acoustic model;

the domain acoustic model, the lexicon, and the speech model form a first decoding network.

In some embodiments, the features of the domain voice database in the above method are determined by:

acquiring a domain voice database;

and extracting the characteristics of the domain voice database.

In some embodiments, the domain voice database in the above method is a labeled domain voice database.

In some embodiments, the extracting features of the domain voice database in the above method at least includes: pre-emphasis, framing, windowing, and discrete fourier transform.

In some embodiments, the domain voice database in the method is a domain voice database corresponding to different domains.

In a second aspect, an embodiment of the present invention further provides an apparatus for speech recognition in a specific scenario, including:

an acquisition module: the method comprises the steps of acquiring audio data to be identified;

an extraction module: for extracting features of the audio data;

an input module: the system comprises a first decoding network, a second decoding network and a third decoding network, wherein the first decoding network is used for inputting the characteristics of audio data into the first decoding network to obtain an identification text;

wherein the first decoding network is determined by:

In some embodiments, the characteristics of the domain speech database in the apparatus are determined by:

acquiring a domain voice database;

and extracting the characteristics of the domain voice database.

In some embodiments, the domain voice database in the apparatus is a labeled domain voice database.

In some embodiments, the extracting features of the domain voice database in the apparatus at least includes: pre-emphasis, framing, windowing, and discrete fourier transform.

In some embodiments, the domain voice database in the device is a domain voice database corresponding to different domains.

In a third aspect, an embodiment of the present invention further provides an electronic device, including: a processor and a memory;

the processor is used for executing the method for voice recognition under any one specific scene by calling the program or the instructions stored in the memory.

In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium stores a program or instructions, and the program or instructions cause a computer to execute the method for speech recognition in any one of the specific scenarios described above.

The invention has the beneficial effects that: acquiring audio data to be identified; extracting the characteristics of the audio data; inputting the characteristics of the audio data into a first decoding network to obtain an identification text; wherein the first decoding network is determined by: training the characteristics of the domain voice database into an acoustic model to obtain a domain acoustic model; the domain acoustic model, the lexicon, and the speech model form a first decoding network. The audio data to be recognized in the application are input into the first decoding network after feature extraction, corresponding field acoustic models can be found out from the decoding network, the field acoustic models are matched with the audio data to be recognized better due to the fact that the acoustic signal features under specific scenes are learned, the performance of the obtained recognized text can be better, the voice recognition accuracy under specific application fields is improved, and time and resource investment are saved.

Drawings

FIG. 1 is a diagram illustrating a method for speech recognition in a specific scenario according to an embodiment of the present invention;

FIG. 2 is a second method for speech recognition under a specific scenario according to an embodiment of the present invention;

FIG. 3 is a third method for speech recognition in a specific scenario according to an embodiment of the present invention;

FIG. 4 is a diagram of an apparatus for speech recognition in a specific scenario according to an embodiment of the present invention;

fig. 5 is a schematic block diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The principles and features of this invention are described below in conjunction with the following drawings, which are set forth by way of illustration only and are not intended to limit the scope of the invention.

In order that the above objects, features and advantages of the present application can be more clearly understood, the present disclosure will be further described in detail with reference to the accompanying drawings and examples. It is to be understood that the embodiments described are only a few embodiments of the present disclosure, and not all embodiments. The specific embodiments described herein are merely illustrative of the disclosure and are not limiting of the application. All other embodiments that can be derived by one of ordinary skill in the art from the description of the embodiments are intended to be within the scope of the present disclosure.

It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.

The Accuracy of Speech Recognition (ASR) systems is greatly affected by environmental factors. The environment difference of different application scenes is very large, for example, a home scene and a vehicle-mounted scene are two completely different environment signals. The acoustic signals in the same application scene may also be very different, for example, in a vehicle-mounted scene, various situations such as when the vehicle is stationary, driving at a low speed, driving at a high speed, turning on/off an air conditioner, and turning on/off a door of the vehicle need to be considered.

Under certain circumstances, the performance of a speech recognition system is often not ideal due to reasons such as data matching. In order to perform deep customization in a specific field, large investment is needed, large-scale high-quality training data is collected, an acoustic model aiming at the specific environment can be trained, and a good recognition effect is achieved. In view of this, the present application provides a method, an apparatus, an electronic device and a storage medium for speech recognition in a specific scenario, which can solve the above technical problem of poor speech recognition performance.

Fig. 1 is a diagram of a method for speech recognition in a specific scenario according to an embodiment of the present invention.

In a first aspect, with reference to fig. 1, an embodiment of the present invention provides a method for speech recognition in a specific scenario, including three steps S101, S102, and S103.

S101: and acquiring audio data to be identified.

Specifically, the audio data to be recognized in the application may be a command statement spoken by the user, such as "turn on the air conditioner" spoken by the user in a speech control air conditioner scene, and "adjust the temperature to 25 degrees celsius", or may also be "turn on the meeting PPT" spoken by the user in a meeting scene.

S102: features of the audio data are extracted.

Specifically, the features of the audio data are extracted, and the audio data are converted into feature vectors which can be processed by a computer. The most common features are MFCC, i.e. mel-frequency cepstral coefficients, and Filterbank, i.e. Filterbank-based, Fbank features.

S103: features of the audio data are input into the first decoding network to obtain the recognized text.

Specifically, after the audio data to be recognized in the application is subjected to feature extraction, the audio data to be recognized is input into the first decoding network, a corresponding field acoustic model can be found out from the first decoding network, the field acoustic model is matched with the audio data to be recognized due to the fact that the acoustic signal features under a specific scene are learned, the performance of the obtained recognized text can be better, the voice recognition accuracy under the specific application field is improved, and time and resource investment are saved.

Fig. 2 is a second method for speech recognition under a specific scenario according to an embodiment of the present invention.

With reference to fig. 2, the first decoding network in step S103 is determined by the following steps S201 and S202:

s201: and training the characteristics of the domain voice database into an acoustic model to obtain a domain acoustic model.

S202: the domain acoustic model, the lexicon, and the speech model form a first decoding network.

It should be understood that the domain acoustic model in the present application is a domain acoustic model obtained by learning the acoustic signal characteristics in a specific scene using limited domain speech data based on an existing general model, and the "domain acoustic model" will replace the "general acoustic model" in the conventional framework, and together with the dictionary and the language model, form the first decoding network. The domain acoustic model can be matched with audio data to be recognized in a specific scene, and therefore the accuracy rate of voice recognition in the specific scene is improved.

Fig. 3 is a third method for speech recognition in a specific scenario according to an embodiment of the present invention.

In some embodiments, in conjunction with fig. 3, the features of the domain voice database in the above method are determined by the following two steps S301 and S302:

s301: and acquiring a domain voice database.

Specifically, the domain voice database in the embodiment of the present application is different from the general voice database, and the scale of the domain voice database is much smaller. The scale of the domain voice database only needs hundreds of sentences to tens of thousands of sentences according to different task difficulty degrees. For applications with simpler texts, such as command type scenes, the recognition performance can be greatly improved by only hundreds of sentences, such as opening sound, playing Chinese words and the like, and for scenes which are free and have special acoustic characteristics, tens of thousands of sentences of data are possibly needed. For example, in a conference scene, the speaking content is free, and the relative positions of different speakers and microphones are different, so that more training data is needed to learn the acoustic features in the scene. In summary, the larger the data size, the better the effect of improving the speech recognition performance in a specific scene.

S302: and extracting the characteristics of the domain voice database.

Specifically, in the embodiment of the present application, the features of the domain speech database are extracted, and the audio in the database is converted into feature vectors that can be processed by a computer. Such as MFCC, i.e. mel-frequency cepstral coefficients, and Filterbank, i.e. filter bank based, features.

Specifically, the field voice database in the embodiment of the present application is a field voice database with text labels, for example, the intelligent control home appliance voice database has an intelligent control label, the home scene voice database has a home scene label, the vehicle-mounted scene voice database has a vehicle-mounted scene label, and the like.

In some embodiments, the domain voice database in the above method is a domain voice database corresponding to different domains.

It should be understood that, in the field of intelligent control home appliances, the voice database of the intelligent control home appliances corresponds to; for example, the conference scene field corresponds to a conference scene field database; such as a vehicle-mounted scene, a domain database corresponding to the vehicle-mounted scene; and the like, which are not listed herein, do not limit the scope of the embodiments of the present application. It should also be appreciated that by differentiating between different domain speech databases, the more detailed the segmentation yields a higher accuracy of the domain acoustic model and a higher accuracy of the resulting speech recognition.

Fig. 4 is a diagram of a device for speech recognition in a specific scenario according to an embodiment of the present invention.

In a second aspect, with reference to fig. 4, an embodiment of the present invention further provides an apparatus for speech recognition in a specific scenario, including:

the acquisition module 401: for obtaining audio data to be identified.

Specifically, the audio data to be recognized, which is acquired by the acquisition module 401 in the present application, may be a command statement spoken by a user, such as "turn on the air conditioner", "adjust the temperature to 25 degrees celsius" spoken by the user in a speech control air conditioner scene, or "turn on the conference PPT" spoken by the user in a conference scene.

The extraction module 402: for extracting features of the audio data.

Specifically, the extraction module 402 in the present application extracts features of audio data, and converts the audio data into feature vectors that can be processed by a computer. Such as MFCC, i.e. mel-frequency cepstral coefficients, and Filterbank, i.e. filter bank based, features.

The input module 403: for inputting the characteristics of the audio data into the first decoding network to obtain the recognized text.

Specifically, after the audio data to be recognized in the application is subjected to feature extraction, the input module 403 inputs the audio data to the first decoding network, and a corresponding field acoustic model can be found out from the first decoding network, the field acoustic model is more matched with the audio data to be recognized due to learning of acoustic signal features in a specific scene, the performance of the obtained recognized text is better, the speech recognition accuracy in a specific application field is improved, and time and resource investment are saved.

The first decoding network in the above apparatus is determined by:

and training the characteristics of the domain voice database into an acoustic model to obtain a domain acoustic model.

It should be understood that the domain acoustic model in the present application is a domain acoustic model obtained by learning the acoustic signal characteristics in a specific scene according to limited domain speech data on the basis of an existing general model, and the "domain acoustic model" will replace the "general acoustic model" in the conventional framework, and together with a dictionary and a language model, form a first decoding network. The domain acoustic model can be matched with audio data to be recognized in a specific scene, and therefore the accuracy rate of voice recognition in the specific scene is improved.

and acquiring a domain voice database.

And extracting the characteristics of the domain voice database.

Specifically, in the embodiment of the present application, the features of the domain speech database are extracted, and the audio in the database is converted into feature vectors that can be processed by a computer. The most common features are MFCC, i.e. mel-frequency cepstral coefficients, and Filterbank, i.e. Filterbank-based, Fbank features.

In some embodiments, the extracting module 402 in the apparatus extracts features of the domain voice database, and includes at least: pre-emphasis, framing, windowing, and discrete fourier transform.

Fig. 5 is a schematic block diagram of an electronic device provided by an embodiment of the present disclosure.

As shown in fig. 5, the electronic apparatus includes: at least one processor 501, at least one memory 502, and at least one communication interface 503. The various components in the electronic device are coupled together by a bus system 504. A communication interface 503 for information transmission with an external device. It is understood that the bus system 504 is used to enable communications among the components. The bus system 504 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, the various buses are labeled as bus system 504 in fig. 5.

It will be appreciated that the memory 502 in this embodiment can be either volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory.

In some embodiments, memory 502 stores elements, executable units or data structures, or a subset thereof, or an expanded set thereof as follows: an operating system and an application program.

The operating system includes various system programs, such as a framework layer, a core library layer, a driver layer, and the like, and is used for implementing various basic services and processing hardware-based tasks. The application programs, including various application programs such as a Media Player (Media Player), a Browser (Browser), etc., are used to implement various application services. The program for implementing the method for speech recognition in any specific scenario in the methods for speech recognition in specific scenarios provided by the embodiments of the present application may be included in the application program.

In this embodiment of the present application, the processor 501 is configured to execute the steps of the embodiments of the method for speech recognition in the specific scenario provided in this application by calling a program or an instruction stored in the memory 502, specifically, a program or an instruction stored in an application program.

Acquiring audio data to be identified;

extracting the characteristics of the audio data;

inputting the characteristics of the audio data into a first decoding network to obtain an identification text;

wherein the first decoding network is determined by: training the characteristics of the domain voice database into an acoustic model to obtain a domain acoustic model; the domain acoustic model, the lexicon, and the speech model form a first decoding network.

Any one of the methods of speech recognition in a specific scenario provided in the embodiment of the present application may be applied to the processor 501, or implemented by the processor 501. The processor 501 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 501. The Processor 501 may be a general-purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, or discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The steps of any one of the methods for speech recognition in a specific scenario provided in the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software units in the decoding processor. The software elements may be located in ram, flash, rom, prom, or eprom, registers, among other storage media that are well known in the art. The storage medium is located in the memory 502, and the processor 501 reads the information in the memory 502 and, in combination with its hardware, performs the steps of the method for speech recognition in a specific scenario.

Those skilled in the art will appreciate that although some embodiments described herein include some features included in other embodiments instead of others, combinations of features of different embodiments are meant to be within the scope of the application and form different embodiments.

Those skilled in the art will appreciate that the description of each embodiment has a respective emphasis, and reference may be made to the related description of other embodiments for those parts of an embodiment that are not described in detail.

Although the embodiments of the present application have been described in conjunction with the accompanying drawings, those skilled in the art will be able to make various modifications and variations without departing from the spirit and scope of the application, and such modifications and variations are included in the specific embodiments of the present invention as defined in the appended claims, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of various equivalent modifications and substitutions within the technical scope of the present disclosure, and these modifications and substitutions are intended to be included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

While the invention has been described with reference to specific embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for speech recognition in a specific scene, comprising:

acquiring audio data to be identified;

extracting features of the audio data;

wherein the first decoding network is determined by:

the domain acoustic model, the lexicon, and the speech model constitute the first decoding network.

2. The method of claim 1, wherein the characteristics of the domain speech database are determined by:

acquiring a domain voice database;

and extracting the characteristics of the domain voice database.

3. The method of claim 2, wherein the domain speech database is a labeled domain speech database.

4. The method of speech recognition under specific scenario according to claim 2, wherein the extracting features of the domain speech database at least comprises: pre-emphasis, framing, windowing, and discrete fourier transform.

5. The method of claim 2, wherein the domain speech databases are domain speech databases corresponding to different domains.

6. An apparatus for speech recognition under a specific scenario, comprising:

an extraction module: for extracting features of the audio data;

an input module: for inputting the characteristics of the audio data into a first decoding network to obtain a recognized text

Wherein the first decoding network is determined by:

the domain acoustic model, the dictionary and the speech model form the first decoding network.

7. The apparatus for speech recognition under specific circumstances as claimed in claim 6, wherein the features of the domain speech database are determined by the following steps:

acquiring a domain voice database;

and extracting the characteristics of the domain voice database.

8. The apparatus for speech recognition under certain circumstances as claimed in claim 7, wherein the domain speech database is a labeled domain speech database.

9. An electronic device, comprising: a processor and a memory;

the processor is used for executing the method for speech recognition under the specific scene according to any one of claims 1 to 5 by calling the program or the instructions stored in the memory.

10. A computer-readable storage medium storing a program or instructions for causing a computer to execute a method of speech recognition in a specific scenario according to any one of claims 1 to 5.