CN115662407A

CN115662407A - Voice recognition method, device, equipment and storage medium

Info

Publication number: CN115662407A
Application number: CN202211309926.XA
Authority: CN
Inventors: 王诗达; 毛振苏; 徐勇攀; 李乾; 张琛; 潘仰耀
Original assignee: Shanghai Pudong Development Bank Co Ltd
Current assignee: Shanghai Pudong Development Bank Co Ltd
Priority date: 2022-10-25
Filing date: 2022-10-25
Publication date: 2023-01-31

Abstract

The invention discloses a voice recognition method, a voice recognition device, voice recognition equipment and a storage medium. The method comprises the following steps: inputting target user voice data into a set acoustic model, and outputting an acoustic feature sequence corresponding to the target user voice data; the target user voice data carries a region attribute; the sequence of acoustic features comprises a plurality of acoustic features; determining at least one target word corresponding to each acoustic feature in the acoustic feature sequence to obtain a plurality of target words; inputting the target words into a domain language model, and outputting a target voice recognition result; wherein the target speech recognition result is characterized by a word sequence. By the technical scheme of the invention, the accuracy of voice recognition can be improved.

Description

Voice recognition method, device, equipment and storage medium

Technical Field

The embodiment of the invention relates to the technical field of voice processing, in particular to a voice recognition method, a voice recognition device, voice recognition equipment and a storage medium.

Background

In the nationwide business oriented financial industry, business personnel will handle a large amount of business problem feedback to financial banks from different regions of the country each day. Although the communication scene is widely fixed in financial business, different regional languages have quite different accents and expression modes, and business personnel face complex and difficult recognition and understanding problems in the national business processing process.

Disclosure of Invention

Embodiments of the present invention provide a voice recognition method, apparatus, device, and storage medium, which can improve accuracy of voice recognition.

According to an aspect of the present invention, there is provided a speech recognition method including:

inputting target user voice data into a set acoustic model, and outputting an acoustic feature sequence corresponding to the target user voice data; the target user voice data carries a region attribute; the sequence of acoustic features comprises a plurality of acoustic features;

determining at least one target word corresponding to each acoustic feature in the acoustic feature sequence to obtain a plurality of target words;

inputting the candidate words into a domain language model, and outputting a target voice recognition result; wherein the target speech recognition result is characterized by a word sequence.

Optionally, determining at least one target word corresponding to each acoustic feature in the acoustic feature sequence includes:

decoding each acoustic feature in the acoustic feature sequence to obtain a plurality of candidate words corresponding to the acoustic feature and probability information of each candidate word;

selecting at least one target word from the plurality of candidate words based on the probability information.

Optionally, inputting the multiple target words into the domain language model, and outputting a target speech recognition result, including:

inputting the target words into a domain language model, and outputting a plurality of candidate recognition results and recognition probabilities of the candidate recognition results;

and determining the candidate recognition result with the highest recognition probability as the target voice recognition result.

Optionally, the training mode for setting the acoustic model is as follows:

acquiring voice sample data; the voice sample data carries a region attribute;

inputting the voice sample data into a standard acoustic model, and outputting phoneme residual error characteristics; wherein the standard acoustic model is used for identifying phoneme characteristics of standard voice data; the phoneme residual error feature represents the difference of phoneme features between the voice sample data and standard voice data;

inputting the voice sample data into a voice feature recognition model, and outputting voice features;

training the set acoustic model based on the phoneme residual features and the sound features.

Optionally, training the set acoustic model based on the phoneme residual feature and the sound feature includes:

clustering the phoneme residual error characteristics based on standard words to obtain accent characteristics;

clustering the sound features based on user information to obtain user pronunciation features;

training the set acoustic model based on the accent features and the user pronunciation features.

Optionally, training the set acoustic model based on the accent features and the user pronunciation features includes:

inputting the accent features and the user pronunciation features into the set acoustic model, and outputting first acoustic features;

inputting the first acoustic feature into an auxiliary acoustic model, and outputting a second acoustic feature;

and performing backward parameter adjustment on the set acoustic model based on the first acoustic feature and the second acoustic feature to obtain the trained set acoustic model.

Optionally, the training mode of the domain language model is as follows:

obtaining corpus sample data of a set field; the corpus sample data comprises user corpus sample data and database corpus sample data;

training a domain language model based on the corpus sample data;

and fusing the trained domain language model and the general language model to obtain a final domain language model.

According to another aspect of the present invention, there is provided a voice recognition apparatus including:

the acoustic feature sequence output module is used for inputting target user voice data into a set acoustic model and outputting an acoustic feature sequence corresponding to the target user voice data; the target user voice data carries a region attribute; the sequence of acoustic features comprises a plurality of acoustic features;

the target word acquisition module is used for determining at least one target word corresponding to each acoustic feature in the acoustic feature sequence to acquire a plurality of target words;

the target voice recognition result output module is used for inputting the target words into the field language model and outputting a target voice recognition result; wherein the target speech recognition result is characterized by a word sequence.

According to another aspect of the present invention, there is provided an electronic apparatus including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores a computer program executable by the at least one processor, the computer program being executable by the at least one processor to enable the at least one processor to perform the speech recognition method according to any of the embodiments of the present invention.

According to another aspect of the present invention, there is provided a computer-readable storage medium storing computer instructions for causing a processor to implement a speech recognition method according to any one of the embodiments of the present invention when the computer instructions are executed.

Inputting target user voice data into a set acoustic model, and outputting an acoustic feature sequence corresponding to the target user voice data; the target user voice data carries a region attribute; the sequence of acoustic features comprises a plurality of acoustic features; determining at least one target word corresponding to each acoustic feature in the acoustic feature sequence to obtain a plurality of target words; inputting the target words into a domain language model, and outputting a target voice recognition result; wherein the target speech recognition result is characterized by a word sequence. By the technical scheme of the invention, the accuracy of voice recognition can be improved.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present invention, nor do they necessarily limit the scope of the invention. Other features of the present invention will become apparent from the following description.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.

Fig. 1 is a flowchart of a speech recognition method according to an embodiment of the present invention;

FIG. 2 is a flowchart of a speech recognition method according to a second embodiment of the present invention;

fig. 3 is a schematic structural diagram of a speech recognition apparatus according to a third embodiment of the present invention;

fig. 4 is a schematic structural diagram of an electronic device according to a fourth embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Example one

Fig. 1 is a flowchart of a speech recognition method according to an embodiment of the present invention, where the embodiment is applicable to a case of recognizing speech, and the method may be executed by a speech recognition apparatus, and specifically includes the following steps:

and step 110, inputting the target user voice data into a set acoustic model, and outputting an acoustic feature sequence corresponding to the target user voice data.

The target user voice data carries a region attribute; the sequence of acoustic features includes a plurality of acoustic features. The regional attribute may be understood as that the voice data of the target user carries a local accent. Illustratively, the target user carries voice data with geographic attributes such as cantonese or southern min. The acoustic feature may be understood as a pronunciation corresponding to each word in each voice data. The set acoustic model may be a pre-set trained acoustic model. The acoustic model may be used to identify a sequence of acoustic features corresponding to the user's speech data. The target user voice data can be understood as a user with regional accents needing voice data recognition. In this embodiment, the target user may be in a conversation with some intelligent customer services, and the voice data obtained by telephone communication of the target user may be used.

In this embodiment, the target user voice data may be input into the acoustic model, so as to output an acoustic feature sequence corresponding to the target user voice data.

And step 120, determining at least one target word corresponding to each acoustic feature in the acoustic feature sequence to obtain a plurality of target words.

The target word may be understood as a word corresponding to each acoustic feature in the acoustic feature sequence, that is, a word corresponding to each acoustic feature. Since the acoustic feature sequence includes a plurality of acoustic features, each acoustic feature may correspond to one target word, and thus there may be a plurality of target words. In this embodiment, at least one target word corresponding to each acoustic feature in the acoustic feature sequence is determined to obtain a plurality of target words.

In this embodiment, optionally, the determining at least one target word corresponding to each acoustic feature in the acoustic feature sequence includes: decoding each acoustic feature in the acoustic feature sequence to obtain a plurality of candidate words corresponding to the acoustic feature and probability information of each candidate word; selecting at least one target word from the plurality of candidate words based on the probability information.

The method for decoding each acoustic feature in the acoustic feature sequence may be to perform a decoding operation by calling an existing database, dictionary, or dictionary to perform comparison, and the like. In this embodiment, since pronunciations of many words are the same or similar, one acoustic feature may correspond to multiple candidate words. The probability information of the candidate word can be understood as the obtained probability data content of the corresponding candidate word. The probability magnitudes of the respective candidate words are different. In this embodiment, the candidate word and the probability information of the candidate word can be obtained by decoding each acoustic feature in the acoustic feature sequence. The target word may be selected based on probability information. In the embodiment, a preset number of target words can be selected based on probability information according to service requirements. For example, in this embodiment, the top 5 target words may be selected according to the descending order of the probability information.

In this embodiment, each acoustic feature in the acoustic feature sequence may be decoded to obtain a plurality of candidate words corresponding to the acoustic feature and probability information of each candidate word, and at least one target word may be selected from the plurality of candidate words based on the probability information. Through the arrangement, the target words with higher probability can be selected based on the service requirements, the accuracy of voice recognition is improved conveniently,

and step 130, inputting the target words into a domain language model, and outputting a target voice recognition result.

Wherein the target speech recognition result may be characterized by a sequence of words. Word sequence is understood to mean a sequence of words, i.e. a segment of speech, connected by individual words. A domain speech model may be understood as a pre-trained speech model of a domain. For example, the present embodiment may be a financial domain speech model. The domain speech model may be used to fuse the target words into complete sentences that may be understood in the domain. The target speech recognition result may be output by a domain language model. In this embodiment, a plurality of target words may be input into the domain language model to output a target speech recognition result.

In this embodiment, optionally, inputting the plurality of target words into the domain language model, and outputting a target speech recognition result, including: inputting the target words into a domain language model, and outputting a plurality of candidate recognition results and recognition probabilities of the candidate recognition results; and determining the candidate recognition result with the highest recognition probability as the target voice recognition result.

Wherein, the candidate recognition results can be understood as a plurality of complete sentences combined according to the target words, and can be output by inputting the target words into the domain speech model. The recognition probability of a candidate recognition result may be understood as the probability data content of the candidate recognition result. The recognition probability magnitudes of the respective candidate recognition results are different. The target speech recognition result may be a candidate recognition result having the highest recognition probability. In this embodiment, the candidate recognition results with the highest recognition probability may be determined as the target speech recognition result.

In this embodiment, a plurality of target words may be input into the domain speech model, a plurality of candidate recognition results and recognition probabilities of the respective candidate recognition results may be output, and the candidate recognition result with the highest recognition probability may be determined as the target speech recognition result. In the embodiment, all possible candidate recognition results and recognition probabilities can be directly obtained by inputting the target words corresponding to the acoustic features into the model, the candidate recognition results can be scored and sequenced by adopting the domain voice model, the correction of grammar and semantics is realized, and the candidate recognition result with the highest recognition probability can be selected finally.

Through the arrangement, the recognition result with the highest recognition probability can be used as the target voice recognition result through the trained model, the accuracy of voice recognition is further improved, and the recognition result of the accent features is improved.

Example two

Fig. 2 is a flowchart of a speech recognition method according to a second embodiment of the present invention, which is optimized based on the second embodiment. The concrete optimization is as follows: the training mode for setting the acoustic model is as follows: acquiring voice sample data; the voice sample data carries a region attribute; inputting the voice sample data into a standard acoustic model, and outputting phoneme residual error characteristics; wherein the standard acoustic model is used for identifying phoneme characteristics of standard voice data; the phoneme residual error feature represents the difference of phoneme features between the voice sample data and standard voice data; inputting the voice sample data into a voice feature recognition model, and outputting voice features; training the set acoustic model based on the phoneme residual features and the sound features. As shown in fig. 2, the method of this embodiment specifically includes the following steps:

step 210, obtaining voice sample data;

the voice sample data may carry a region attribute. The regional attribute may be understood as voice sample data of a speaker having a regional accent. In this embodiment, voice sample data carrying a region attribute may be acquired.

Further, in this embodiment, before obtaining the voice sample data, the standard voice data may be input to the standard acoustic model, and the phoneme feature of the standard voice data may be output. Illustratively, standard speech data may be understood as speech data that pronounces mandarin chinese.

Step 220, inputting the voice sample data into a standard acoustic model, and outputting phoneme residual error characteristics.

Wherein the standard acoustic model may be used to identify phoneme features of standard speech data; the phoneme residual features may characterize differences in phoneme features between the speech sample data and standard speech data. Standard speech data may be understood as speech data spoken with a standard pronunciation. The standard acoustic model may be a pre-trained model. In this embodiment, the voice sample data may be input into a standard acoustic model, and the phoneme residual error feature may be output.

And step 230, inputting the voice sample data into a voice feature recognition model, and outputting voice features.

Wherein the sound feature may be a feature output by the sound feature recognition model. The voice feature recognition model may be a pre-trained model, and may be used to recognize voice features of the voice sample data. In this embodiment, the voice sample data may be input into the voice feature recognition model, so as to output the voice feature.

Step 240, training the set acoustic model based on the phoneme residual features and the sound features.

In this embodiment, the set acoustic model may be trained based on the phoneme residual features and the sound features.

In this embodiment, optionally, training the set acoustic model based on the phoneme residual feature and the sound feature includes: clustering the phoneme residual error characteristics based on standard words to obtain accent characteristics; clustering the sound features based on user information to obtain user pronunciation features; training the set acoustic model based on the accent features and the user pronunciation features.

The standard word can be understood as a word with a standard pronunciation. The accent features may be clustered based on the standard words to phoneme residual features. In this embodiment, after clustering, an operation of averaging or weighted summation may be performed to obtain accent features. The user information may be understood as user identification, for example, identification information such as the frequency of speaking, the speech speed of speaking, and the tone height of the user. The user information may also include the user's speech content information. The user pronunciation characteristics can be obtained by clustering the sound characteristics based on the user information. In this embodiment, the set acoustic model may be trained based on the accent features and the user pronunciation features.

In the embodiment, the phoneme residual error characteristics can be clustered based on the standard words, so as to obtain accent characteristics; then clustering the voice characteristics based on the content information of the user speaking so as to obtain the pronunciation characteristics of the user; and training the set acoustic model based on the accent features and the user pronunciation features. Through such setting in this embodiment, be favorable to having promoted the identification precision of voiced sound user.

In this embodiment, optionally, training the set acoustic model based on the accent features and the user pronunciation features includes: inputting the accent features and the user pronunciation features into the set acoustic model, and outputting first acoustic features; inputting the first acoustic feature into an auxiliary acoustic model, and outputting a second acoustic feature; and performing backward parameter adjustment on the set acoustic model based on the first acoustic feature and the second acoustic feature to obtain the trained set acoustic model.

The first acoustic feature may be obtained by inputting an accent feature and a user pronunciation feature into a set acoustic feature model. The auxiliary acoustic model may be a pre-trained model that may be used to modify the first acoustic feature to obtain the second acoustic feature. The back-tuning may be a process of adjusting parameters by calculating a difference of the first acoustic feature and the second acoustic feature. In this embodiment, the set acoustic model may be inversely parametrized based on the first acoustic feature and the second acoustic feature to obtain the trained set acoustic model. In addition, the auxiliary acoustic model in the embodiment adaptively adjusts scale and shift values in Batch Normalization in the network through the countervailing learning module, and eliminates information irrelevant to the accent feature vectors, so that the influence of accent data on the recognition result is strengthened, and the influence of the missing of accent data resources is compensated.

In the embodiment, the accent characteristics and the user pronunciation characteristics can be input into a set acoustic model to output first acoustic characteristics; then inputting the first acoustic features into the auxiliary acoustic model for modification to obtain a second acoustic model; and finally, performing inverse parameter adjustment on the set acoustic features based on the first acoustic features and the second acoustic features, so as to obtain the trained set acoustic model. Through the arrangement, the set acoustic model is trained based on the acoustic characteristics, so that the trained acoustic model is matched with a specific speaker, and the recognition accuracy of the user with the accent is improved.

In this embodiment, optionally, the training mode of the domain language model is: obtaining corpus sample data of a set field; the corpus sample data comprises user corpus sample data and database corpus sample data; training a domain language model based on the corpus sample data; and fusing the trained domain language model and the general language model to obtain a final domain language model.

The set field is understood to be a field of a set industry, for example, a financial field such as banking. In this embodiment, the setting field may be set according to a service requirement. The corpus sample data may include user corpus sample data and database corpus sample data. The user corpus sample data can be understood as historical corpus information of the user in the aspect of the service of the set field. The data corpus sample data of the database can be data corpus sample data which is generally used in the industry field and downloaded from the Internet.

In this embodiment, key information may also be extracted from the speech information. Illustratively, when a certain sentence in the corpus is "hello, i need to transact XX service" and key information extraction is performed in this embodiment, "transact" and "XX service" can be extracted, so that intention information of the user can be known.

The domain voice training model can be trained based on user corpus sample data and database corpus sample data. The universal language model can identify the voices of all fields and can be combined into complete professional terms according with the industry field according to the keywords in the corpus. Fusion may be understood as an operation of averaging or weighted fusion. In this embodiment, the network layers corresponding to the domain speech model and the generic speech model may be the same, but the parameter settings of the network layers are different. In this embodiment, the fusion based on the trained domain speech model and the generic speech model may be understood as an operation of averaging or weighting and summing parameters of the corresponding network layer. In this embodiment, by fusing the general language model and the domain model, the acoustic model score and the language model score can be calculated by using a decoder with a given feature vector sequence and a plurality of candidate word sequences, and the word sequence with the highest overall output score is used as the final personalized recognition result.

In the embodiment, user corpus sample data and database corpus sample data in a set field can be acquired; then training a domain language model based on user corpus sample data and database corpus sample data; and finally, carrying out averaging or weighted fusion operation based on the trained domain language model and the general language model so as to obtain the final domain language model.

Through the arrangement, the training can be performed based on the language individuation combining the domain language model and the user language model, so that the recognition precision of the commonly used words and grammar of the user is improved, and the voice recognition rate is improved.

The invention obtains voice sample data; inputting the voice sample data into a standard acoustic model, and outputting phoneme residual error characteristics; inputting the voice sample data into a voice feature recognition model, and outputting voice features; training the set acoustic model based on the phoneme residual features and the sound features. By the technical scheme of the invention, the accuracy of voice recognition can be improved.

EXAMPLE III

Fig. 3 is a schematic structural diagram of a speech recognition apparatus according to a third embodiment of the present invention, where the apparatus is capable of executing a speech recognition method according to any embodiment of the present invention, and has functional modules and beneficial effects corresponding to the execution method. As shown in fig. 3, the apparatus includes:

the acoustic feature sequence output module 310 is configured to input target user voice data into a set acoustic model, and output an acoustic feature sequence corresponding to the target user voice data.

The target user voice data carries a region attribute; the sequence of acoustic features comprises a plurality of acoustic features;

the target word obtaining module 320 is configured to determine at least one target word corresponding to each acoustic feature in the acoustic feature sequence, and obtain a plurality of target words.

A target speech recognition result output module 330, configured to input the target words into the domain language model, and output a target speech recognition result; wherein the target speech recognition result is characterized by a word sequence.

Optionally, the target word obtaining module 320 is specifically configured to:

Optionally, the target speech recognition result output module 330 is specifically configured to:

Optionally, the training module for setting the acoustic model includes:

a voice sample data acquisition unit for acquiring voice sample data; the voice sample data carries a region attribute;

a phoneme residual error feature output unit, configured to input the voice sample data into a standard acoustic model, and output a phoneme residual error feature; wherein the standard acoustic model is used for identifying phoneme characteristics of standard voice data; the phoneme residual error feature represents the difference of phoneme features between the voice sample data and standard voice data;

the voice characteristic output unit is used for inputting the voice sample data into a voice characteristic recognition model and outputting voice characteristics;

a first training unit configured to train the set acoustic model based on the phoneme residual feature and the sound feature.

Optionally, the first training unit includes:

the accent characteristic subunit is used for clustering the phoneme residual error characteristics based on the standard words to obtain accent characteristics;

the user pronunciation feature acquisition subunit is used for clustering the sound features based on user information to acquire user pronunciation features;

and the training subunit is used for training the set acoustic model based on the accent features and the user pronunciation features.

Optionally, the training subunit is specifically configured to:

inputting the first acoustic feature into a secondary acoustic model and outputting a second acoustic feature;

Optionally, the training module of the domain language model includes:

the corpus sample data acquisition unit is used for acquiring corpus sample data of a set field; the corpus sample data comprises user corpus sample data and database corpus sample data;

the second training unit is used for training a domain language model based on the corpus sample data;

and the field language model acquisition unit is used for fusing the trained field language model and the general language model to acquire a final field language model.

The device can execute the methods provided by all the embodiments of the invention, and has corresponding functional modules and beneficial effects for executing the methods. For details not described in detail in this embodiment, reference may be made to the methods provided in all the foregoing embodiments of the present invention.

Example four

Fig. 4 is a schematic structural diagram of an electronic device according to a fourth embodiment of the present invention. The electronic device 10 is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smart phones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.

As shown in fig. 4, the electronic device 10 includes at least one processor 11, and a memory communicatively connected to the at least one processor 11, such as a Read Only Memory (ROM) 12, a Random Access Memory (RAM) 13, and the like, wherein the memory stores a computer program executable by the at least one processor, and the processor 11 can perform various suitable actions and processes according to the computer program stored in the Read Only Memory (ROM) 12 or the computer program loaded from a storage unit 18 into the Random Access Memory (RAM) 13. In the RAM 13, various programs and data necessary for the operation of the electronic apparatus 10 can also be stored. The processor 11, the ROM 12, and the RAM 13 are connected to each other via a bus 14. An input/output (I/O) interface 15 is also connected to bus 14.

A number of components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16 such as a keyboard, a mouse, or the like; an output unit 17 such as various types of displays, speakers, and the like; a storage unit 18 such as a magnetic disk, an optical disk, or the like; and a communication unit 19 such as a network card, modem, wireless communication transceiver, etc. The communication unit 19 allows the electronic device 10 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

The processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, or the like. The processor 11 performs the various methods and processes described above, such as speech recognition methods.

In some embodiments, the speech recognition method may be implemented as a computer program tangibly embodied in a computer-readable storage medium, such as storage unit 18. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM 12 and/or the communication unit 19. When the computer program is loaded into the RAM 13 and executed by the processor 11, one or more steps of the speech recognition method described above may be performed. Alternatively, in other embodiments, the processor 11 may be configured to perform the speech recognition method by any other suitable means (e.g. by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

A computer program for implementing the methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be performed. A computer program can execute entirely on a machine, partly on a machine, as a stand-alone software package partly on a machine and partly on a remote machine or entirely on a remote machine or server.

In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. A computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user may provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the Internet.

The computing system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service are overcome.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present invention may be executed in parallel, sequentially, or in different orders, and are not limited herein as long as the desired results of the technical solution of the present invention can be achieved.

The above-described embodiments should not be construed as limiting the scope of the invention. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A speech recognition method, comprising:

inputting the target words into a domain language model, and outputting a target voice recognition result; wherein the target speech recognition result is characterized by a word sequence.

2. The method of claim 1, wherein determining at least one target word corresponding to each acoustic feature in the sequence of acoustic features comprises:

3. The method of claim 1, wherein inputting the plurality of target words into a domain language model and outputting a target speech recognition result comprises:

4. The method of claim 1, wherein the training mode for setting the acoustic model is:

acquiring voice sample data; the voice sample data carries a region attribute;

5. The method of claim 4, wherein training the set acoustic model based on the phoneme residual features and the sound features comprises:

6. The method of claim 5, wherein training the set acoustic model based on the accent features and the user pronunciation features comprises:

7. The method of claim 1, wherein the domain language model is trained by:

training a domain language model based on the corpus sample data;

8. A speech recognition apparatus, comprising:

the target speech recognition result output module is used for inputting the target words into the field language model and outputting a target speech recognition result; wherein the target speech recognition result is characterized by a word sequence.

9. An electronic device, characterized in that the electronic device comprises:

at least one processor; and

the memory stores a computer program executable by the at least one processor, the computer program being executable by the at least one processor to enable the at least one processor to perform the speech recognition method of any one of claims 1-7.

10. A computer-readable storage medium storing computer instructions for causing a processor to perform the speech recognition method of any one of claims 1-7 when executed.