CN114596840A

CN114596840A - Speech recognition method, device, equipment and computer readable storage medium

Info

Publication number: CN114596840A
Application number: CN202210209960.3A
Authority: CN
Inventors: 朱传聪; 孙思宁
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-03-04
Filing date: 2022-03-04
Publication date: 2022-06-07

Abstract

The application provides a voice recognition method and a voice recognition device, which can be applied to the fields of automatic driving, vehicle-mounted, map or traffic, and the method comprises the following steps: acquiring a voice to be recognized, and performing phoneme conversion on the voice to be recognized to obtain an initial phoneme sequence, wherein the initial phoneme sequence comprises: a first phoneme sequence corresponding to the awakening word and a second phoneme sequence corresponding to the voice content associated with the awakening word; acquiring a standard phoneme sequence corresponding to the first phoneme sequence, and determining difference information between the first phoneme sequence and the standard phoneme sequence; determining a sequence adjustment mode corresponding to the difference information, and adjusting the standard phoneme sequence by adopting the sequence adjustment mode to obtain an identifiable phoneme sequence corresponding to the first phoneme sequence; and performing voice recognition on the second phoneme sequence in the initial phoneme sequence based on the recognizable phoneme sequence to obtain a voice recognition result corresponding to the voice content. By the method and the device, the accuracy of voice recognition can be improved, and the human-computer interaction efficiency is improved.

Description

Speech recognition method, device, equipment and computer readable storage medium

Technical Field

The present application relates to speech recognition technologies, and in particular, to a speech recognition method, apparatus, electronic device, computer-readable storage medium, and computer program product.

Background

With the development of artificial intelligence technology, smart voice devices (such as smart speakers, smart phones, smart televisions, etc.) are gradually used by a large number of users. The user can interact with the intelligent terminal device through voice to wake up the intelligent voice device.

In the related art, a phenomenon of false wake-up is often accompanied, for example, when the wake-up is not needed, the device is woken up; and when the user needs to be awakened, the user cannot be awakened in time delay. Therefore, the voice recognition accuracy is low, the awakening success rate of the intelligent device is low, or the mistaken awakening rate is high.

Disclosure of Invention

Embodiments of the present application provide a voice recognition method, an apparatus, an electronic device, a computer-readable storage medium, and a computer program product, which can improve accuracy of voice recognition and improve human-computer interaction efficiency.

The technical scheme of the embodiment of the application is realized as follows:

the embodiment of the application provides a voice recognition method, which comprises the following steps:

acquiring a voice to be recognized, and performing phoneme conversion on the voice to be recognized to obtain an initial phoneme sequence, wherein the initial phoneme sequence comprises: a first phoneme sequence corresponding to a wakeup word and a second phoneme sequence corresponding to a voice content associated with the wakeup word;

acquiring a standard phoneme sequence corresponding to the first phoneme sequence, and determining difference information between the first phoneme sequence and the standard phoneme sequence;

determining a sequence adjustment mode corresponding to the difference information, and adjusting the standard phoneme sequence by adopting the sequence adjustment mode to obtain an identifiable phoneme sequence corresponding to the first phoneme sequence;

and performing voice recognition on the second phoneme sequence in the initial phoneme sequence based on the recognizable phoneme sequence to obtain a voice recognition result corresponding to the voice content.

An embodiment of the present application provides a speech recognition apparatus, including:

an obtaining module, configured to obtain a speech to be recognized, and perform phoneme conversion on the speech to be recognized to obtain an initial phoneme sequence, where the initial phoneme sequence includes: a first phoneme sequence corresponding to a wakeup word and a second phoneme sequence corresponding to a voice content associated with the wakeup word;

the determining module is used for acquiring a standard phoneme sequence corresponding to the first phoneme sequence and determining difference information between the first phoneme sequence and the standard phoneme sequence;

the adjusting module is used for determining a sequence adjusting mode corresponding to the difference information, and adjusting the standard phoneme sequence by adopting the sequence adjusting mode to obtain an identifiable phoneme sequence corresponding to the first phoneme sequence;

and the recognition module is used for carrying out voice recognition on the second phoneme sequence in the initial phoneme sequence based on the recognizable phoneme sequence to obtain a voice recognition result corresponding to the voice content.

In the foregoing solution, the determining module is further configured to compare the phonemes in the first phoneme sequence with the phonemes in the standard phoneme sequence, respectively, so as to determine a phoneme difference of the first phoneme sequence compared with the standard phoneme sequence;

wherein the phoneme difference comprises at least one of: loss of standard phonemes, phoneme redundancy;

and taking the phoneme difference of the first phoneme sequence compared with the standard phoneme sequence as difference information between the first phoneme sequence and the standard phoneme sequence.

In the foregoing scheme, the adjusting module is further configured to determine that a sequence adjusting manner corresponding to the difference information is a sequence supplementing manner when the difference information indicates that the first phoneme sequence has a standard phoneme deficiency with respect to the standard phoneme sequence;

correspondingly, the adjusting module is further configured to determine that the first phoneme sequence is compared with a missing standard phoneme of the standard phoneme sequence, and determine phonemes except for the missing standard phoneme in the standard phoneme sequence as candidate standard phonemes based on the missing standard phoneme;

reconstructing a phoneme sequence based on the candidate standard phonemes to obtain at least one sub-phoneme sequence;

searching the sub-phoneme sequence corresponding to the first phoneme sequence from the at least one sub-phoneme sequence as the recognizable phoneme sequence.

In the above solution, the adjusting module is further configured to determine that a sequence adjusting manner corresponding to the difference information is a noise supplementing manner when the difference information indicates that the first phoneme sequence has phoneme redundancy relative to the standard phoneme sequence;

correspondingly, the adjusting module is further configured to determine at least one redundant phoneme of the first phoneme sequence relative to the standard phoneme sequence;

constructing a noise phoneme corresponding to the at least one redundant phoneme;

and reconstructing a phoneme sequence based on the noise phoneme and the standard phoneme sequence to obtain an identifiable phoneme sequence corresponding to the first phoneme sequence.

In the above solution, the adjusting module is further configured to determine a phoneme filling position of the standard phoneme sequence;

filling the noise phonemes based on the phoneme filling positions of the standard phoneme sequence to obtain a noise phoneme sequence;

mapping each phoneme in the first phoneme sequence to a phoneme in the noise phoneme sequence in sequence to obtain an identifiable phoneme sequence corresponding to the first phoneme sequence;

wherein the redundant phone in the first phone sequence is used for mapping to a noise phone in the noise phone sequence.

In the foregoing solution, the adjusting module is further configured to determine the number of the at least one redundant phoneme;

and filling the noise phonemes with the number at the phoneme filling positions of the standard phoneme sequence to obtain a noise phoneme sequence.

In the foregoing solution, the adjusting module is further configured to determine the mapping times for the noise phonemes in a process of mapping the redundant phonemes in the first phoneme sequence to the noise phonemes;

and when the mapping times reach a time threshold value, generating and outputting prompt information, wherein the prompt information is used for prompting that the voice recognition aiming at the voice to be recognized fails.

In the foregoing solution, the adjusting module is further configured to determine that a sequence adjusting manner corresponding to the difference information is a combination of a sequence supplementing manner and a noise supplementing manner, when the difference information indicates that the first phoneme sequence has both a standard phoneme missing and a phoneme redundancy with respect to the standard phoneme sequence;

determining at least one redundant phoneme of the first phoneme sequence relative to the standard phoneme sequence and constructing a noise phoneme corresponding to the at least one redundant phoneme;

and reconstructing a phoneme sequence based on the candidate standard phoneme, the noise phoneme and the standard phoneme sequence to obtain an identifiable phoneme sequence corresponding to the first phoneme sequence.

In the above scheme, the determining module is further configured to perform word segmentation processing on the wakeup word by taking a character as a unit to obtain at least two characters;

respectively determining pronunciations corresponding to the characters;

performing phoneme conversion on the characters based on the pronunciations corresponding to the characters to obtain phoneme sequences corresponding to the characters;

and determining a standard phoneme sequence corresponding to the first phoneme sequence based on the phoneme sequence corresponding to each character.

In the above scheme, the determining module is further configured to perform phoneme conversion on the characters based on the pronunciations corresponding to the characters to obtain an intermediate phoneme sequence corresponding to the characters;

when the number of the phonemes in the intermediate phoneme sequence is one, splicing the phonemes in the intermediate phoneme sequence with a preset modified phoneme to obtain a corresponding target phoneme;

and replacing the phoneme in the intermediate phoneme sequence with the target phoneme to obtain a phoneme sequence corresponding to the character.

In the foregoing scheme, the identification module is further configured to cut the initial phoneme sequence based on the recognizable phoneme sequence to obtain the second phoneme sequence;

acquiring a dictionary for phoneme recognition;

and performing voice recognition on the second phoneme sequence based on the dictionary to obtain a voice recognition result corresponding to the voice content.

An embodiment of the present application provides an electronic device, including:

a memory for storing executable instructions;

and the processor is used for realizing the voice recognition method provided by the embodiment of the application when the processor executes the executable instructions stored in the memory.

The embodiment of the present application provides a computer-readable storage medium, which stores executable instructions for causing a processor to implement the speech recognition method provided by the embodiment of the present application when executed.

The embodiment of the present application provides a computer program product, which includes a computer program or instructions, and the computer program or instructions, when executed by a processor, implement the speech recognition method provided by the embodiment of the present application.

The embodiment of the application has the following beneficial effects:

by applying the embodiment of the application, the phoneme conversion is carried out on the voice to be recognized to obtain the first phoneme sequence containing the awakening word and the second phoneme sequence corresponding to the voice content associated with the awakening word, so that pause is not needed when the voice is awakened, and the time of man-machine interaction can be effectively reduced. Then, determining difference information between the first phoneme sequence and a standard phoneme sequence representing the awakening word, and determining different sequence adjustment modes according to different difference information, so that different sequence adjustments are performed according to different difference information, and the accuracy of determining the recognizable phoneme sequence corresponding to the first phoneme sequence can be improved; finally, voice recognition is carried out on the second phoneme sequence to obtain a recognition result, so that the influence of the difference information on the recognition of the second phoneme sequence can be reduced, and the efficiency of the voice recognition is improved.

Drawings

Fig. 1 is a schematic architecture diagram of a speech recognition system 100 provided in an embodiment of the present application;

fig. 2 is a schematic structural diagram of an electronic device implementing a speech recognition method according to an embodiment of the present application;

FIG. 3 is a flow chart of a speech recognition method provided by an embodiment of the present application;

FIG. 4 is a diagram of a phoneme sequence provided by an embodiment of the present application;

FIG. 5 is a schematic diagram of difference information provided by an embodiment of the present application;

FIG. 6 is a flowchart of a sequence supplement method provided by an embodiment of the present application to determine recognizable phoneme sequences;

fig. 7 is an illustration of a phoneme sequence directed graph provided in an embodiment of the present application;

FIG. 8 is a flow chart of a noise supplementation manner provided by an embodiment of the present application to determine recognizable phoneme sequences;

fig. 9 is a schematic diagram illustrating a filling process of noise phonemes provided in an embodiment of the present application;

FIG. 10 is a schematic diagram of a noise compensation method provided by an embodiment of the present application;

FIG. 11 is a flowchart of a method for determining recognizable phoneme sequences provided by an embodiment of the present application;

FIG. 12 is a flow chart of another method for determining recognizable phoneme sequences provided by embodiments of the present application;

FIG. 13 is a schematic diagram of a Oneshot recognition method provided in the embodiments of the present application;

FIG. 14 is a speech recognition decoding diagram provided by an embodiment of the present application;

FIG. 15 is a schematic diagram of a speech frame provided in an implementation of the present application;

FIG. 16 is a schematic diagram of spin absorption provided by embodiments of the present application;

FIG. 17 is a schematic diagram of phoneme decoding provided by an embodiment of the present application;

fig. 18 is a flowchart of speech recognition provided in an embodiment of the present application.

Detailed Description

In order to make the purpose, technical solutions and advantages of the present application clearer, the present application will be described in further detail with reference to the accompanying drawings, the described embodiments should not be considered as limiting the present application, and all other embodiments obtained by a person of ordinary skill in the art without making creative efforts fall within the protection scope of the present application.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.

Where a similar description of "first/second" appears in the specification, and the description below is added where the terms "first/second/third" are used merely to distinguish between similar objects and do not denote a particular order or importance to the objects, it will be appreciated that "first/second/third" may be interchanged either in a particular order or in a sequential order, where permissible, to enable embodiments of the application described herein to be practiced in other than the order shown or described herein.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the application.

Before further detailed description of the embodiments of the present application, terms and expressions referred to in the embodiments of the present application will be described, and the terms and expressions referred to in the embodiments of the present application will be used for the following explanation.

1) Weighted Finite automata (WFST), WFST is used to characterize model (HCLG) in Automatic Speech Recognition (ASR), which can be more conveniently fused and optimized, so WFST can be used as a simple and flexible ASR decoder (single and flexible ASR decoder design), WFST mainly plays the role of decoding graph in Speech Recognition.

2) Voice wakeup (KWS, keyword spotting): voice wake-up enables real-time detection of specific voice segments with user intent in a continuous voice stream.

3) Oneshot interaction mode: one interactive mode in voice interaction adopts an Oneshot interactive mode, and a user wakes up words to speak with contents to be recognized continuously and wakes up and recognizes at one time. The experience presented to the user is that the user directly speaks after awakening, and the device directly starts services such as voice recognition and semantic understanding after awakening, so that the interaction time is shortened. Devices using Oneshot interaction must support recorded playback loop audio for echo cancellation processing, and this processing must be very robust to echo cancellation. Oneshot audio refers to a combination of audio that contains the front-end wake word audio and the back-end recognition audio. The Oneshot recognition result refers to a recognition result obtained after the awakening word voice recognition result is removed from Oneshot recognition.

4) Acoustic model: it is understood that modeling of an utterance can convert a speech input into an output of an acoustic representation, more precisely giving the probability that a speech frame corresponds to a plurality of states. The acoustic model may be a convolutional neural network model, a deep neural network model, or the like, which needs to be trained through a large amount of speech data. As is well known, the pronunciation of a word is made up of phonemes, a state being understood here as a unit of speech that is finer than a phoneme, usually a phoneme is divided into 3 states. Speech recognition is the recognition of speech frames into states, the combination of states into phonemes, and the combination of phonemes into words.

5) The language model is a language abstract mathematical modeling according to the language objective fact, the function of the language model can be simply understood as resolving the problem of polyphones, after the acoustic model gives out a pronunciation sequence, a character string sequence with the maximum probability is found out from the candidate character sequences, and the character string sequence is the character sequence corresponding to the voice to be recognized. The linguistic features of the speech to be recognized refer to semantic features and/or text features of the wake-up word.

6) Hidden Markov Model (HMM, Hidden Markov Model): to describe a markov process with hidden unknown parameters. The markov process is a stochastic process with markov properties. Markov refers to the process (or system) at time t₀Under the condition that the state is known, the process is at the time t>t₀Conditional distribution of the state in question, and the process time t₀The state-independent feature, i.e., the conditional probability in the markov process, is only relevant to the current state of the system, and is independent and irrelevant to its past or future state.

Based on the above explanations of terms and terms involved in the embodiments of the present application, the speech recognition system provided by the embodiments of the present application is explained below. Referring to fig. 1, fig. 1 is a schematic diagram of an architecture of a speech recognition system 100 provided in an embodiment of the present application, in order to support an exemplary application, in the speech recognition system 100, terminals (terminals 400-1 and 400-2 are exemplarily shown) having an intelligent speech recognition function are connected to a server 200 through a network 300, and the network 300 may be a wide area network or a local area network, or a combination of the two, and uses a wireless or wired link to implement data transmission.

The terminal (such as the terminal 400-1 and the terminal 400-2) is used for collecting the voice of the current environment as the voice to be recognized and sending a voice recognition request carrying the voice to be recognized to the server.

The server 200 is configured to receive a speech recognition request carrying a speech to be recognized, acquire the speech to be recognized, and perform phoneme conversion on the speech to be recognized to obtain an initial phoneme sequence, where the initial phoneme sequence includes: a first phoneme sequence corresponding to the awakening word and a second phoneme sequence corresponding to the voice content associated with the awakening word; acquiring a standard phoneme sequence corresponding to the first phoneme sequence, and determining difference information between the first phoneme sequence and the standard phoneme sequence; determining a sequence adjustment mode corresponding to the difference information, and adjusting the standard phoneme sequence by adopting the sequence adjustment mode to obtain an identifiable phoneme sequence corresponding to the first phoneme sequence; performing voice recognition on a second phoneme sequence in the initial phoneme sequence based on the recognizable phoneme sequence to obtain a voice recognition result corresponding to the voice content, and performing semantic analysis on the voice recognition result to obtain a semantic analysis result; and acquiring response information corresponding to the semantic analysis result, and outputting the response information to the terminal.

The terminals (such as the terminal 400-1 and the terminal 400-2) are further configured to receive the response information sent by the server and play the response information.

In some embodiments, a user speaks a wakeup word preset for an intelligent speech device, after a microphone in the intelligent speech device acquires a to-be-recognized speech corresponding to the wakeup word from a user in a current environment, a speech recognition request carrying the to-be-recognized speech is sent to a server for speech recognition, the to-be-recognized speech is acquired, and a phoneme conversion is performed on the to-be-recognized speech to obtain an initial phoneme sequence, where the initial phoneme sequence includes: a first phoneme sequence corresponding to the awakening word and a second phoneme sequence corresponding to the voice content associated with the awakening word; acquiring a standard phoneme sequence corresponding to the first phoneme sequence, and determining difference information between the first phoneme sequence and the standard phoneme sequence; determining a sequence adjustment mode corresponding to the difference information, and adjusting the standard phoneme sequence by adopting the sequence adjustment mode to obtain an identifiable phoneme sequence corresponding to the first phoneme sequence; performing voice recognition on a second phoneme sequence in the initial phoneme sequence based on the recognizable phoneme sequence to obtain a voice recognition result corresponding to the voice content; performing semantic analysis on the voice recognition result to obtain a semantic analysis result; and acquiring response information corresponding to the semantic analysis result, outputting the response information to the intelligent voice equipment, and playing the received response information by the intelligent voice equipment through a loudspeaker.

In practical application, the server 200 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like. The terminals (such as the terminal 400-1 and the terminal 400-2) may be, but are not limited to, a smart phone, a tablet computer, a laptop computer, a desktop computer, a smart voice interaction device, a smart speaker, a smart television, a smart watch, a smart home appliance, a vehicle-mounted terminal, an aircraft, and the like. The terminals (e.g., terminal 400-1 and terminal 400-2) and the server 200 may be directly or indirectly connected through wired or wireless communication, and the application is not limited thereto.

The embodiments of the present application can also be implemented by means of Cloud Technology (Cloud Technology), which refers to a hosting Technology for unifying series resources such as hardware, software, and network in a wide area network or a local area network to implement data calculation, storage, processing, and sharing.

The cloud technology is a general term of network technology, information technology, integration technology, management platform technology, application technology and the like applied based on a cloud computing business model, can form a resource pool, is used as required, and is flexible and convenient. Cloud computing technology will become an important support. Background services of the technical network system require a large amount of computing and storage resources.

Referring to fig. 2, fig. 2 is a schematic structural diagram of an electronic device implementing a speech recognition method according to an embodiment of the present application. In practical applications, the electronic device 500 may be a server or a terminal shown in fig. 1, and the electronic device 500 is taken as the terminal shown in fig. 1 as an example to explain an electronic device implementing the speech recognition method according to the embodiment of the present application, where the electronic device 500 provided in the embodiment of the present application includes: at least one processor 510, memory 550, at least one network interface 520, and a user interface 530. The various components in the electronic device 500 are coupled together by a bus system 540. It is understood that the bus system 540 is used to enable communications among the components. The bus system 540 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 540 in fig. 2.

The Processor 510 may be an integrated circuit chip having Signal processing capabilities, such as a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like, wherein the general purpose Processor may be a microprocessor or any conventional Processor, or the like.

The user interface 530 includes one or more output devices 531 enabling presentation of media content, including one or more speakers and/or one or more visual display screens. The user interface 530 also includes one or more input devices 532, including user interface components to facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, other input buttons and controls.

The memory 550 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard disk drives, optical disk drives, and the like. Memory 550 optionally includes one or more storage devices physically located remote from processor 510.

The memory 550 can include both volatile and nonvolatile memory, and can also include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read Only Memory (ROM), and the volatile Memory may be a Random Access Memory (RAM). The memory 550 described in embodiments herein is intended to comprise any suitable type of memory.

In some embodiments, memory 550 can store data to support various operations, examples of which include programs, modules, and data structures, or subsets or supersets thereof, as exemplified below.

An operating system 551 including system programs for processing various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and processing hardware-based tasks;

a network communication module 552 for communicating to other computing devices via one or more (wired or wireless) network interfaces 520, exemplary network interfaces 520 including: bluetooth, wireless compatibility authentication (WiFi), and Universal Serial Bus (USB), etc.;

a presentation module 553 for enabling presentation of information (e.g., a user interface for operating peripherals and displaying content and information) via one or more output devices 531 (e.g., a display screen, speakers, etc.) associated with the user interface 530;

an input processing module 554 to detect one or more user inputs or interactions from one of the one or more input devices 532 and to translate the detected inputs or interactions.

In some embodiments, the speech recognition device provided by the embodiments of the present application may be implemented in software, and fig. 2 shows a speech recognition device 555 stored in a memory 550, which may be software in the form of programs and plug-ins, and includes the following software modules: the obtaining module 5551, the determining module 5552, the adjusting module 5553 and the identifying module 5554 are logical modules, and thus may be arbitrarily combined or further split according to the implemented functions, and the functions of the respective modules will be described below.

In other embodiments, the voice recognition Device provided in the embodiments of the present Application may be implemented by a combination of hardware and software, and as an example, the voice recognition Device provided in the embodiments of the present Application may be a processor in the form of a hardware decoding processor, which is programmed to perform the voice recognition method provided in the embodiments of the present Application, for example, the processor in the form of a hardware decoding processor may be one or more Application Specific Integrated Circuits (ASICs), DSPs, Programmable Logic Devices (PLDs), Complex Programmable Logic Devices (CPLDs), Field Programmable Gate Arrays (FPGAs), or other electronic components.

Based on the above description of the speech recognition and electronic device provided in the embodiments of the present application, the speech recognition method provided in the embodiments of the present application is described below. In some embodiments, the voice recognition method provided by the embodiments of the present application may be implemented by a server or a terminal alone, or implemented by a server and a terminal in cooperation. In some embodiments, the terminal or the server may implement the speech recognition method provided by the embodiment of the present application by running a computer program. For example, the computer program may be a native program or a software module in an operating system; the Application program may be a local (Native) Application program (APP), that is, a program that needs to be installed in an operating system to be run, such as a client that supports voice recognition, e.g., a mobile phone voice assistant; or may be an applet, i.e. a program that can be run only by downloading it to the browser environment; but also an applet that can be embedded into any APP. In general, the computer programs described above may be any form of application, module or plug-in.

The following describes a speech recognition method provided in the embodiments of the present application by taking a server as an example. Referring to fig. 3, fig. 3 is a schematic flowchart of a speech recognition method provided in an embodiment of the present application, where the speech recognition method provided in the embodiment of the present application includes:

in step 101, a server acquires a speech to be recognized, and performs phoneme conversion on the speech to be recognized to obtain an initial phoneme sequence.

It should be noted that the initial phoneme sequence includes: a first phoneme sequence corresponding to the awakening word and a second phoneme sequence corresponding to the voice content associated with the awakening word.

In practical implementation, the voice to be recognized received by the server may be standard voice or non-standard voice, and in a scene of an "awakening-recognition" voice interaction mode, the server may receive the voice to be recognized sent by the intelligent voice device, where the voice to be recognized is the voice of the environment where the intelligent voice device is located. Standard speech suitable for "wake-recognition" speech interaction mode may consist of two parts of speech content: wake-up words and voice content that meets the user's intent. The awakening words are used for awakening the intelligent voice equipment, voice content meeting the intention of the user is used for voice recognition of the intelligent voice equipment, and response information is returned. The non-standard voice is generated possibly due to noise interference and the like in the environment where the intelligent voice device is located, so that the voice to be recognized received by the server may be the non-standard voice with truncation, multiple input or swallowing.

For example, the text content corresponding to standard voice D of the user U for the smart voice device is "tiny, hello! Today's weather "where" Small, your good! "is a wake word for the smart voice device," how today's weather "is the voice that meets the user's true intent. There is a situation that the speech to be recognized collected by the intelligent speech equipment is cut off due to the fact that the speech speed of the user U is too fast, and the situation becomes' hello, micro! Today what the weather is; in another situation, because the user U is playing music in the environment, the smart voice device collects more voice inputs, and the voice to be recognized is changed to "hello, small! How the weather is today ".

In practical implementation, the server receives a to-be-recognized speech acquired by the intelligent speech device, and performs phoneme conversion on the to-be-recognized speech through a preset acoustic model to obtain an initial speech sequence, where the initial speech sequence may include a phoneme sequence (which may be referred to as a first phoneme sequence) corresponding to a wakeup word and a phoneme sequence (which may be referred to as a second phoneme sequence) corresponding to speech content associated with the wakeup word and meeting a user intention. The user can receive the above example to obtain the corresponding wake-up word "hello, little! "a first phoneme sequence (ni3hao3xiao3wei1), and a second phoneme sequence (ji1 tie 1qi4ru2he2) of" how today's weather ".

In step 102, a phone sequence corresponding to the first phone sequence is obtained, and difference information between the first phone sequence and the phone sequence is determined.

In practical implementation, the first phoneme sequence in the initial phoneme sequence is corresponding to the wake-up word, and the standard phoneme sequence corresponding to the first phoneme sequence may be regarded as the phoneme sequence corresponding to the standard wake-up word.

To illustrate the manner in which the phone sequence is obtained, in some embodiments, the server may determine the phone sequence by: the server performs word segmentation processing on the awakening word by taking the character as a unit to obtain at least two characters; respectively determining pronunciations corresponding to the characters; performing phoneme conversion on the characters based on the pronunciations corresponding to the characters to obtain phoneme sequences corresponding to the characters; and determining a standard phoneme sequence corresponding to the first phoneme sequence based on the phoneme sequence corresponding to each character.

In practical implementation, let the wake-up word K be composed of n words { K1, K2, … …, Kn }, where n is greater than or equal to 1, and n is an integer. A simple handling case, in chinese, a word (character) can be directly defined as a word, and n can be considered as the length of the wake-up word K. Wherein, for a certain Kn, it possesses { A1, A2, … …, Am } pronunciations, m is more than or equal to 1, and m is an integer. It should be noted that if m >1, Kn is a polyphone. Each pronunciation Am may in turn be divided into a sequence of phonemes { P1, P2, … …, Pj }. Here, a phoneme P is a Phone, and for the wake-up word K, an mK1 × mK2 ×. mKn phoneme sequence can be constructed, where mKn represents the number of pronunciations of the nth word. In practical applications, all phoneme sequences corresponding to the wake-up word K may be represented by a directed graph, and in the directed graph, the phoneme sequences may also be referred to as pronunciation paths, and nodes in the directed graph are each phoneme.

Illustratively, the wake-up word K is "hello haver" as an example. The awakening word K is composed of four words, n is 4, where "good" and "hah" are polyphones, and are distinguished by "good 3", "good 4" and "hah 1" and "hah 4", respectively, and the numbers (1, 3, 4, etc. can be regarded as disambiguation symbols to ensure the uniqueness of each pronunciation). Then paths "nihi 3ha1 fu", "nihi 4ha1 fu" and "nihi 4ha4 fu" can be formed, and then the text is converted into phoneme sequences { ni3hao3ha1fu2}, { ni3hao4ha1fu2}, { ni3hao3ha4fu2}, and { ni3hao4ha4fu2}, see fig. 4, where fig. 4 is a phoneme sequence diagram provided in the embodiment of the present application, and the 4 phoneme sequences can form a directed graph shown in the figure, and each node in the graph represents a corresponding phoneme. In practical applications, for convenience of representation, empty nodes eps are inserted between each character, i.e. the phoneme sequence can be represented by { ni3epshaoeps3ha1epsfu2 }.

By the method, various pronunciations of the awakening words can be covered, and the accuracy of voice recognition is improved.

In some embodiments, the server may determine the sequence of phonemes with a single phoneme by: the server performs phoneme conversion on the characters based on the pronunciations corresponding to the characters to obtain a middle phoneme sequence corresponding to the characters; when the number of the phonemes in the intermediate phoneme sequence is one, splicing the phonemes in the intermediate phoneme sequence with a preset modified phoneme to obtain a corresponding target phoneme; and replacing the phoneme in the intermediate phoneme sequence with the target phoneme to obtain a phoneme sequence corresponding to the character.

In practical implementation, for a chinese wake-up word, a chinese character may be regarded as being composed of two phonemes (an initial and a final part), where for a part of single-phoneme characters, such as, e.g., a, etc., it may be ensured that a single-phoneme character may be composed of two phonemes by concatenating preset modified phonemes, such as Ini _ e, Ini _ o, Ini _ a.

For example, assume that for the wake-up word K "Xiao A, Xiao A!in the speech to be recognized! The middle phone sequence of the "middle character" a "is { a1}, and at this time, the preset modified phone Ini _ e and a1 may be spliced to form a phone sequence { Ini _ ea1} including two phones corresponding to the character" a ".

The method for modifying the phoneme by presetting and supplementing the single-phoneme character can effectively reduce the phoneme recognition time in the voice recognition process and improve the voice recognition efficiency.

Next, describing the determination of the difference information, in some embodiments, the server may determine the difference information between the first phoneme sequence and the standard phoneme sequence by: the server compares the phonemes in the first phoneme sequence with the phonemes in the standard phoneme sequence respectively to determine the phoneme difference of the first phoneme sequence compared with the phonemes in the standard phoneme sequence; wherein the phoneme difference comprises at least one of: loss of standard phonemes, phoneme redundancy; and regarding the phoneme difference of the first phoneme sequence compared with the standard phoneme sequence as the difference information between the first phoneme sequence and the standard phoneme sequence.

In practical implementation, the server obtains the phone sequence corresponding to the wake-up word after obtaining the phone sequence corresponding to the first phone sequence. Determining the phoneme difference between the first phoneme sequence and the standard phoneme sequence by comparing each phoneme of the first phoneme sequence with each phoneme of the standard phoneme sequence, wherein the phoneme difference at least includes two cases: loss of phone criteria, phoneme redundancy, wherein the loss of phone criteria is relative to the sequence of phone criteria, and the loss of phone criteria exists in the first sequence of phone criteria; phoneme redundancy means that there are excess phonemes in the first phoneme sequence relative to the standard phoneme sequence.

Exemplarily, referring to fig. 5, fig. 5 is a schematic diagram of difference information provided by an embodiment of the present application, and it is assumed that a phoneme sequence { ni3hao3xiao3wei1} shown by number 1 in the diagram is a corresponding wake-up word K "hello, smallness |)! "a sequence of standard phonemes; the phoneme difference corresponding to the phoneme sequence { ni3hao3ni3hao3xiao3wei1} shown in the figure with the number 2 is phoneme redundancy, i.e. the phoneme corresponding to ni3hao3 is added relative to the standard phoneme sequence; the phoneme difference corresponding to the phoneme sequence { xiao3wei1} shown in the figure as number 3 is a standard phoneme missing, i.e. a phoneme corresponding to ni3hao3 is missing with respect to the standard phoneme sequence.

The determination mode of the difference information can cover various phoneme missing conditions, and improves the universality of the speech recognition.

In step 103, a sequence adjustment manner corresponding to the difference information is determined, and the standard phoneme sequence is adjusted by using the sequence adjustment manner, so as to obtain an identifiable phoneme sequence corresponding to the first phoneme sequence.

In practical implementation, the server may determine the sequence adjustment manner corresponding to the type of each phoneme difference according to the different types of the phoneme differences existing in the first phoneme sequence compared with the standard phoneme sequence. Wherein the phoneme difference comprises at least one of: loss of standard phonemes, phoneme redundancy.

For the determination of the sequence adjustment manner, in some embodiments, the server may determine the sequence adjustment manner corresponding to the difference information by: and when the difference information represents that the first phoneme sequence has standard phoneme missing relative to the standard phoneme sequence, determining that the sequence adjusting mode corresponding to the difference information is a sequence supplementing mode.

Illustratively, referring to fig. 5, by comparing the phone sequence 3 corresponding to the reference number 1 with the phone sequence 1 corresponding to the reference number 3, it can be determined that the difference information between the two can characterize that the phone sequence 3 has a phone deficiency with respect to the phone sequence 1, and the deficient phone is ni3hao 3. In this regard, the server may determine recognizable phoneme sequences corresponding to phoneme sequence 3 in a sequence complementary manner.

Accordingly, in some embodiments, referring to fig. 6, fig. 6 is a flowchart illustrating a process of determining a recognizable phoneme sequence by a server in a sequence supplementing manner according to an embodiment of the present application, and the steps shown in fig. 6 are combined to illustrate a process of determining a recognizable phoneme sequence by a server in a sequence supplementing manner.

Step 1031 a: the server determines the missing phonemes of the first phoneme sequence as compared to the sequence of phonemes.

Illustratively, referring to fig. 5, it is determined that the phoneme sequence 3 corresponding to the number 3 has ni3hao3 as compared to the phoneme sequence 1 corresponding to the number 1.

Step 1032a, based on the missing standard phoneme, determining a phoneme in the standard phoneme sequence except the missing standard phoneme as a candidate standard phoneme.

Following the above example, referring to fig. 5, the other phonemes "xiao 3" and "wei 1" of the standard phoneme sequence 1 corresponding to the number 1 are determined as candidate standard phonemes.

And 1033a, reconstructing the phoneme sequence based on the candidate standard phonemes to obtain at least one sub-phoneme sequence.

Bearing in mind the above example, the phoneme sequence of the standard phoneme sequence 1 shown in fig. 5 for number 1 may be characterized as start → ni3 → hao3 → xiao3 → wei1 → end, and then the candidate phonemes are "xiao 3" and "wei 1", and construct the corresponding sub-phoneme sequence, which may be start → xiao3 → wei1 → end, start → wei1 → end.

At step 1034a, a sub-phoneme sequence corresponding to the first phoneme sequence is searched from the at least one sub-phoneme sequence as an identifiable phoneme sequence.

Taking the above example as a support, from the two subponin sequences start → xiao3 → wei1 → end, start → wei1 → end, start → xiao3 → wei1 → end is determined as the recognizable sequence corresponding to the first phoneme sequence.

In practical implementation, when constructing a phoneme sequence, each phoneme sequence may also be represented in a directed graph, for convenience of representation in the directed graph, a start element representing a start point and an end element representing an end point are usually added, and at the same time, characters may be separated from each other by eps.

Illustratively, the sequence supplement method will be described by taking a way of characterizing each phoneme sequence in the form of a directed graph as an example. Referring to fig. 7, fig. 7 is an illustration diagram of directional phoneme sequences provided in the embodiment of the present application, where a standard phoneme sequence corresponding to a wakeup word K { "you", "good", "ha", "friend", "fu" } is "ni 3hao3ha1fu 2", but after the server has performed phoneme conversion on a speech to be recognized, a first obtained phoneme sequence is "hao 3ha1fu 2" (when there is a standard phoneme deletion, i.e., hao3 is truncated) or a first obtained phoneme sequence "ni 3ha1fu 2" (when there is a standard phoneme deletion, i.e., ni3 is swallowed), and thus, a path corresponding to a phoneme sequence is additionally added on the basis of the standard phoneme sequence shown in fig. 4, and the path of the phoneme sequence shown in the figure 1 is { ha1fu2} or { ha4fu → 9258 } in the presence of the graph is 493 h 2 eps → 4 → → f → 829 → → i →; the path of each phoneme sequence { hao3ha1fu2}, { hao3ha4fu2}, { hao4ha1fu2}, and { hao4ha4fu2} in the directed graph in the absence of the phoneme ni3 shown by number 2 in the figure; the path of each phoneme sequence { ni3hao3ha1fu2}, { ni3hao3ha4fu2}, { ni3ha1fu2}, and { ni3ha4fu2} in the directed graph in the absence of the phoneme hao4 shown in number 3 in the figure.

By adjusting the phone sequence of the standard phone by the sequence supplement method, a correct recognizable sequence can be obtained even if the phone is lost.

In some embodiments, the server may further determine a sequence adjustment manner for the corresponding difference information by: when the difference information represents that the first phoneme sequence has phoneme redundancy relative to the standard phoneme sequence, the server determines that the sequence adjustment mode corresponding to the difference information is a noise supplement mode.

Illustratively, referring to fig. 5, by comparing the phone sequence 2 corresponding to the reference number 1 with the phone sequence 1 corresponding to the reference number 2, it can be determined that the difference information between the two can characterize that the phone sequence 2 has phone redundancy relative to the phone sequence 1, and the multi-input phone is ni3hao 3. In this regard, the server may determine recognizable phoneme sequences corresponding to phoneme sequence 2 by means of noise supplementation.

Accordingly, in some embodiments, referring to fig. 8, fig. 8 is a flowchart illustrating a process of determining a recognizable phoneme sequence by a server in a noise supplementing manner according to the noise supplementing manner provided by the embodiments of the present application, and the process is described in conjunction with the steps shown in fig. 8.

In step 1031b, the server determines at least one redundant phoneme of the first phoneme sequence relative to the canonical phoneme sequence.

Illustratively, referring to fig. 5, the phoneme difference corresponding to the phoneme sequence { ni3hao3ni3hao3xiao3wei1} shown in the number 2 in the figure is phoneme redundancy, i.e., there is an increase of the phoneme corresponding to ni3hao3 relative to the standard phoneme sequence.

Step 1032b, constructing a noise phoneme corresponding to the at least one redundant phoneme.

In practical implementation, the server may receive the redundant phonemes by constructing a noise phoneme, which is not a true pronunciation.

And 1033b, reconstructing the phoneme sequence based on the noise phoneme and the standard phoneme sequence to obtain an identifiable phoneme sequence corresponding to the first phoneme sequence.

In practical implementation, after the server determines the noise phoneme, the server reconstructs the phoneme sequence on the basis of the standard phoneme sequence, and then obtains an identifiable phoneme sequence corresponding to the first phoneme sequence.

In some embodiments, referring to fig. 9, fig. 9 is a schematic filling flow diagram of noise phonemes provided in an embodiment of the present application, and the steps shown in fig. 9 are combined to explain a process in which a server determines a recognizable phoneme sequence in a noise supplementation manner.

In step 301, the server determines the phoneme filling position of the sequence of standard phonemes.

In practical implementation, the noise phoneme may be filled in any position of the sequence of standard phonemes, and in general, the noise phoneme may be filled in a position of a start point of the sequence of standard phonemes.

Step 302, based on the phoneme filling position of the standard phoneme sequence, filling noise phonemes to obtain a noise phoneme sequence;

in some embodiments, the server may derive the noise phoneme sequence by: the server determines the number of at least one redundant phoneme; and at the phoneme filling position of the standard phoneme sequence, obtaining a noise phoneme sequence by a target number of noise phonemes, wherein the target number is equal to the number of redundant phonemes.

In practical implementation, at least one noise phoneme may be filled in the sequence of standard phonemes, that is, the noise phoneme and the redundant phoneme may have a one-to-one relationship, that is, when the number of redundant phonemes is less than the number threshold, the sequence of standard phonemes may be filled with noise phonemes in accordance with the number of redundant phonemes, and one-to-one mapping is performed.

It should be noted that the noise phonemes and the redundant phonemes may be in a one-to-many relationship, that is, each redundant phoneme is mapped with the same noise phoneme.

Step 303, mapping each phoneme in the first phoneme sequence to a phoneme in the noise phoneme sequence in sequence to obtain an identifiable phoneme sequence corresponding to the first phoneme sequence.

It should be noted that the redundant phoneme in the first phoneme sequence is used for mapping to the noise phoneme in the noise phoneme sequence.

In practical implementation, when the number of the noise phonemes is consistent with the number of the redundant phonemes, the phonemes in the first phoneme sequence are sequentially mapped to the noise phoneme sequence, and finally, the recognizable phoneme sequence is obtained.

In some embodiments, the server may generate the prompt message for prompting the failure of the speech recognition by: the server determines the mapping times aiming at the noise phonemes in the process of mapping the redundant phonemes in the first phoneme sequence to the noise phonemes; and when the mapping times reach the time threshold, generating and outputting prompt information, wherein the prompt information is used for prompting that the voice recognition aiming at the voice to be recognized fails.

In practical implementation, since the number of redundant phonemes cannot be predetermined, one noise phoneme may be set to receive all redundant phonemes, and the redundant phonemes cannot be received without limitation in consideration of the computational performance consumption caused by receiving the redundant phonemes each time. By setting a threshold number of times to represent the maximum number of times that a noise phoneme can be mapped during a speech recognition process. When the mapping times exceed the times threshold, the server may generate prompt information for prompting that the voice recognition for the voice to be recognized fails.

Exemplarily, referring to fig. 10, fig. 10 is a schematic diagram of a noise supplementation manner provided by an embodiment of the present application, where for a phoneme difference corresponding to a phoneme sequence { ni3hao3ni3hao3xiao3wei1} shown in fig. 5 as number 2, which is a phoneme redundancy, that is, a phoneme corresponding to ni3hao3 is added with respect to a standard phoneme sequence, by filling a noise phoneme "# ABSORB" in a start position (start) of the standard phoneme sequence { ni3hao3xiao3wei1}, and dividing the redundant phoneme n, i3, h, ao3 into 4 mapping values to operate the phoneme "# ABSORB", which can be understood as self-absorbing (spinning) 4 times at the "# ABSORB", an identifiable phoneme sequence is obtained.

In some embodiments, the server may further determine a sequence adjustment manner for the corresponding difference information by: when the difference information represents that the first phoneme sequence has both standard phoneme missing and phoneme redundancy relative to the standard phoneme sequence, the server determines that the sequence adjustment mode corresponding to the difference information is a combination of a sequence supplement mode and a noise supplement mode.

In practical implementation, when the first phoneme sequence has both a missing phoneme and a redundant phoneme with respect to the standard phoneme sequence corresponding to the wakeup word, a method of combining a sequence supplement manner and a noise supplement manner may be used to determine the recognizable phoneme sequence for the first phoneme sequence.

Accordingly, referring to fig. 11, in some embodiments, fig. 11 is a flowchart of a method for determining a recognizable phoneme sequence according to an embodiment of the present application, and the steps shown in fig. 11 are used to illustrate a process of determining a recognizable phoneme sequence by a server in a sequence supplement manner in combination with a noise supplement manner.

In step 1031c, the server determines the missing standard phonemes of the first phoneme sequence compared to the standard phoneme sequence and determines phonemes in the standard phoneme sequence other than the missing standard phonemes as candidate standard phonemes based on the missing standard phonemes.

In practical implementation, the server may determine candidate phonemes for the missing phoneme standard in the first phoneme sequence according to the aforementioned steps 1031a-1032a shown in fig. 6.

Step 1032c determines at least one redundant phoneme of the first phoneme sequence relative to the standard phoneme sequence and constructs a noise phoneme corresponding to the at least one redundant phoneme.

In practical implementation, for redundant phonemes in the first phoneme sequence, the server may construct corresponding noise phonemes according to the aforementioned steps 1031b-1032b shown in fig. 7.

Step 1033c, based on the candidate standard phoneme and the noise phoneme, reconstructing the phoneme sequence to obtain an identifiable phoneme sequence corresponding to the first phoneme sequence.

In practical implementation, after determining candidate phone candidates and noise phones with respect to the phone sequence, the server may perform phone sequence reconstruction on the phone sequence to obtain an identifiable phone sequence corresponding to the first phone sequence.

Illustratively, taking a phoneme sequence of a wake-up word characterized by a directed graph as an example, see fig. 12, and fig. 12 is a flowchart of another method for determining a recognizable phoneme sequence provided in an embodiment of the present application, in which a standard phoneme sequence corresponding to the wake-up word K { "you", "good", "ha", "fre" } is "ni 3hao3ha1fu 2", but after the server performs phoneme conversion on the speech to be recognized, the obtained first phoneme sequence is "jin 1 tie 1hao3ha1fu 2" (at this time, the standard phoneme deletion, i.e., ni3 is truncated), and a redundant phoneme jin1 tie 1 exist, so that, for the standard phoneme deletion, a path corresponding to the phoneme sequence { hao3ha1fu2} may be added in the graph (a path shown in fig. No. 2 → eps → h → ao3 → h → a1 → e → 2 → i → f → h → 2 → i → b →. For the redundant phoneme of "jin 1 tie 1", at the start position of the standard phoneme sequence, fill the noise phoneme of "# ABSORB" shown by number 4 in the figure, get the corresponding noise phoneme sequence start → # ABSORB → n → i3 → eps → h → ao3 → eps → h → a1 → eps → f → u2 → end, and map j, in1, t, ian14 redundant phonemes to the noise phoneme in turn, i.e. the noise phoneme is mapped 4 times.

The method for determining the recognizable phoneme sequence by combining the sequence supplement mode with the noise supplement mode increases the fault-tolerant processing of the awakening words, adds special absorption nodes, and can effectively avoid recognition errors caused by truncation, multi-input, swallowing and the like of the awakening audio.

In step 104, based on the recognizable phoneme sequence, the server performs speech recognition on the second phoneme sequence in the initial phoneme sequence to obtain a speech recognition result corresponding to the speech content.

In some embodiments, the server may determine to perform speech recognition on the second phoneme sequence by: the server cuts the initial phoneme sequence based on the recognizable phoneme sequence to obtain a second phoneme sequence; acquiring a dictionary for phoneme recognition; and performing voice recognition on the second phoneme sequence based on the dictionary to obtain a voice recognition result corresponding to the voice content.

In practical implementation, after determining the recognizable phoneme sequence of the first phoneme sequence, the server ignores the recognizable phoneme sequence from the initial phoneme sequence corresponding to the speech to be recognized to obtain the speech content (i.e., the second phoneme sequence) associated with the wakeup word, and recognizes the speech content based on a preset phoneme dictionary and a preset language model to obtain a corresponding text content as a speech recognition result.

Exemplarily, for a to-be-recognized speech "wan 4sui4ni3xiao3wei1jin1 tie 1qi4ru2he 2" ("ten year hello, micro, how weather today") containing a wake-up word K "hello, pico", the recognizable phoneme sequence of the first phoneme sequence "wan 4sui4ni3xiao3wei 1" corresponding to the wake-up word K is determined to be "start → # ABSORB → n → i3 → x → iao3 → w → ei1 → end". Then, the recognizable phoneme sequence is separated from the initial phoneme sequence, and the phoneme sequence which accords with the user intention is obtained { start → j → in1 → ti → an1 → t → ian1 → q → i4 → r → u2 → h → e2 → end }.

In some embodiments, the server may determine the second sequence of phonemes in the speech to be recognized by: the server acquires the starting position and the ending position of the recognizable phoneme sequence corresponding to the initial phoneme sequence; and performing phoneme neglect on the initial phoneme sequence from the starting position until the ending position, and taking a phoneme sequence consisting of the remaining phonemes in the initial phoneme sequence as a second phoneme sequence.

In practical implementation, the ignoring operation for the first phoneme sequence may be performed based on the recognizable phoneme sequence at the start position and the end position corresponding to the initial phoneme sequence, that is, performing phoneme ignoring on the initial phoneme sequence from the start position to the end position.

In some embodiments, after obtaining the speech recognition result corresponding to the speech content, the server may further generate and respond to the information by: the server carries out semantic analysis on the voice recognition result to obtain a semantic analysis result; and acquiring response information corresponding to the semantic analysis result, and outputting the response information.

Illustratively, for a to-be-recognized speech "wan 4sui4ni3xiao3wei1jin1 tie 1qi4ru2he 2" ("all-year-old-good, micro, how-today), which contains a wake word K" hello, micro ", a sequence of phonemes { start → j → in1 → ti → an1 → t → ian1 → q → i4 → r → u2 → h → e2 → end } (i.e., how-today), the server semantically analyzes the" how-today "to obtain a specific weather condition of today as response information for the to-be-recognized speech, and sends the response information to the intelligent speech device to make the intelligent speech device respond.

By applying the embodiment of the application, when the first phoneme sequence corresponding to the awakening word has phoneme difference with the standard phoneme sequence, a corresponding sequence adjustment mode can be determined according to the type of the phoneme difference, the phoneme sequence is adjusted based on the target sequence adjustment mode, and the recognizable phoneme sequence aiming at the awakening word is obtained, so that the awakening word can be partially ignored from the voice to be recognized, the phoneme sequence meeting the user intention is obtained, and a target result is obtained.

In the following, an exemplary application of the embodiments of the present application in a practical application scenario will be described.

In the related art, under the voice interaction of the "wake-up-recognition" mode, a user needs to wake up the voice recognition system through a wake-up word first, and then say the content to be recognized after muting for a certain duration, and this interaction mode has a slightly low interaction efficiency, and may cause the early truncation of the recognized voice due to noise interference and the like. The Oneshot recognition technology is an interactive mode which omits the mute waiting time in awakening and recognition, so that a user can perform voice interaction more quickly and can avoid partial noise interference.

Based on the above, the embodiment of the application provides a voice recognition method, which is based on the Oneshot audio recognition of a Transducer acoustic model, and by using the method, when the Oneshot audio is recognized, the recognition result of the awakening word can be completely eliminated, so that the subsequent voice recognition errors caused by audio truncation, multiple input, swallowing and the like of the awakening word can be effectively avoided.

First, a Oneshot voice interaction mode is explained, and the Oneshot recognition technology is a more convenient mode under voice interaction in a 'wake-recognition' mode. The Oneshot recognition technology is that in the process of voice input, firstly, awakening is carried out, then according to awakening results, audio of awakening words and subsequent input audio are spliced, the Oneshot audio is recognized through a voice recognition engine, and in recognition results, according to pronunciation of the awakening words, parts of the awakening words are removed, and accordingly the recognition results of Oneshot can be obtained. Referring to fig. 13, fig. 13 is a schematic diagram of a Oneshot recognition method provided in the embodiment of the present application, in which after a wakeup engine of an intelligent speech device detects a wakeup word ("jingle"), a cached wakeup audio is first input into the wakeup word recognition engine, and then a speech is continuously streamed; then, the recognition engine recognizes the awakening word audio and the user voice audio as a whole; finally, the awakening word recognition result is removed, and the voice audio recognition result is output ("navigation saves the library").

In practical implementation, depending on a Keyword detection (KWS) result, see fig. 14, fig. 14 is a speech recognition decoding diagram provided in this embodiment of the present application, before speech recognition, a decoding diagram W for a wakeup word may be constructed, and a wakeup word decoding diagram W shown in number 1 in the diagram and a speech recognition decoding diagram S shown in number 2 in the diagram are sequentially searched in a state distinguishing manner, and correspond to a wakeup audio and a recognition audio, as shown in the figure, after a speech to be recognized is processed by an acoustic system (acoustic model), a phoneme or an HMM state of the phoneme is generated, and through driving of acoustic output, search decoding may be performed in the search wakeup word decoding diagram W and the speech recognition decoding diagram S. In fig. 14, each circle in the decoding diagram represents a State, and each Arc represents a jump, which is called a State transition Arc (Arc), and may be composed of elements such as an input label (ilabel), an output label (olabel), a transition weight (weight), and a next State (nextstate). Taking the state transition arc shown by the number 3 in the figure as an example, ilabel on the arc is d, olabel is also d, weight is 1, that is, the weight of input d and output d is 1, the arc information can be represented in d: d/1 form, when the input label and the output label are the same, weight can be omitted, and the elements on the arc can only retain d. It should be noted that the figure is a decoding diagram (including the wakeup word decoding diagram W and the speech recognition decoding diagram S) corresponding to the illustrated optimal path, and an arc (shown by number 4 in the figure) between State8 and State9 indicates that the search of W is completed and the search S is started.

In practical application, when the wake-up audio is cut, under the influence of noise, speech speed and the like, the problems of truncation, multi-input, swallowing and the like of the wake-up audio are easily caused. The impact caused in a "speech frame synchronous decoding" system is not obvious because the number of speech frames is usually much larger than the number of phonemes of the wake-up word.

Illustratively, referring to fig. 15, fig. 15 is a schematic diagram of a speech frame provided in the present application, in which in a synchronous speech frame decoding system, speech is divided into pauses of the same duration, each pause corresponding to an acoustic output; in the phoneme frame synchronous decoding system, the acoustics output the acoustics at acoustically determined positions, and other frames are padded with blk. In the figure, in a "speech frame synchronous decoding system", the pronunciation (phoneme) sequence of the acoustic output has redundancy, and W and S under the system can be absorbed or expanded by the spin in the decoding diagram even if there is acoustic garbage input caused by truncation, swallowing, multiple input and the like, see fig. 16, fig. 16 is a spin absorption diagram provided by the embodiment of the application, and in fig. 16, when the acoustic system inputs "sil d d d d d and ang1 ang1 ang1 sil" or inputs "sil d ang1 ang1 ang1 sil" and the like, decoding can be correctly performed, and even if there are cases of truncation, swallowing, multiple input and the like in the acoustic system, absorption or expansion can be performed by the decoding diagram within a certain error range.

In practical implementation, in an end-to-end based "phoneme synchronous decoding" system, if a phoneme is cut, inserted or swallowed, an obvious garbage phoneme insertion or normal phoneme deletion may be introduced, which may cause that an acoustic output may not perform a correct viterbi search in a defined wakeup word decoding graph W, thereby causing an incorrect timing or an inability to jump from the wakeup word decoding graph W to the speech recognition decoding graph S. Referring to fig. 17, fig. 17 is a schematic diagram of phoneme decoding provided in the embodiment of the present application, in the diagram, an acoustic output "d 1 d ang 1" may be correctly searched, but in a case that a phoneme of an acoustic input "n i3h ao3 d 1 d ang 1" or "d a ng 1" is missing, in an acoustic model based on a transducer (end-to-end), a recognition error caused by truncation of a wakeup audio, multiple inputs, swallowing, and the like may be avoided by adding a fault-tolerant process for a wakeup word and adding a special sink node.

In some embodiments, referring to fig. 18, fig. 18 is a flowchart of speech recognition provided by an embodiment of the present application, and is described in conjunction with the steps shown in fig. 18. Step 401 is executed first: the user wakes up the corresponding smart voice device and then the wake-up engine performs step 402: detecting whether the awakening is successful, if the awakening is unsuccessful, the user can continue to execute the step 401, if the awakening is successful, the awakening engine executes the step 403, cutting the awakening audio cache, and executing the step 404 based on the cut awakening audio: the wake word WFST is constructed, and after the construction is completed, the wake engine executes step 405: the recognition engine is input to wake up the audio buffer, and then proceeds to step 406: inputting the speech recognition audio content into the recognition engine, and continuing to execute step 407: detecting the end of the speech to be recognized, and finally executing step 408: and outputting a final result.

Firstly, explaining a cutting mode of the awakening word audio, wherein the awakening engine needs to cache the latest audio in the awakening process, and generally caches about 2s of audio; and after the awakening engine is awakened, outputting an awakening word and an awakening alignment result, wherein the awakening alignment refers to the position of the awakening word audio in the whole input audio, the length L of the awakening audio can be calculated through the awakening result, and the awakening word audio can be obtained by backtracking the L time from the awakening point in the cached awakening audio. Illustratively, referring to fig. 5, which illustrates voice alignment information, when Lb > L calculated by the alignment information, a wake word audio multiple input occurs; truncation of the wake word audio occurs when La < L is calculated from the alignment information. In order to perform fault-tolerant processing on the above situations under various reasons such as background noise, wake-up alignment algorithm, insufficient cache, and the like, so as to obtain a correct L, a specific fault-tolerant processing manner is as described below.

Next, the wakeup word WFST construction will be described. The construction of the awakening word WFST (W) is divided into basic steps of word segmentation, pronunciation conversion, polyphone compatibility, WFST construction and the like. Let the wake-up word K be composed of n words { K1, K2., Kn }, where in the simplest case, a word can be directly defined as a word in chinese, and n can be considered as the length of K. Wherein, for a certain Kn, it possesses { A1, A2.. Am } pronunciations, if m >1, Kn is a polyphone. Each pronunciation Am can be further divided into a pronunciation sequence { P1, P2.., Pj }. Here, a certain pronunciation P is a Phone, and in general, a chinese character is composed of two phones Phone, wherein, for a part of single-Phone words, two phones can be formed by filling custom prefixes, such as Ini _ e, ni _ o, ni _ a. For the wake word K, Mk1 × Mk2 ×. × mKn paths may be constructed in the wake word WFST, where mKn represents the number of pronunciations of the nth word. Wherein, the path may use the following formula:

in the above formula, the language score in W is usually set to 0, so as to facilitate forced matching of the wakeup word; pj represents the Phone sequence of the mth pronunciation of Kn, and the Loglike function is used for acquiring the posterior probability of a certain Phone in a certain audio frame. When the value of Rm is minimum, Rm is the optimal path.

By way of example, referring to fig. 4, the wake-up word "hello haver" is illustrated as an example. Wherein K is composed of four words { "you", "good", "ha", "not" }, n is 4, where "good" and "ha" are polyphones, and are distinguished by "good 3", "good 4" and "ha 1" and "ha 4", respectively. Then we can compose four paths "niuha 3ha1 ver", "niuha 4ha1 ver", and "niuha 4ha4 ver", and then convert the text into pronunciation sequences, and compose a decoding path diagram as follows, where the circles represent phonemes. For ease of presentation, empty nodes, eps, are inserted between each word for ease of understanding.

When the phonemes output by the acoustic system conform to the standard structure described above, the smallest Rm, i.e., the optimal path, can be obtained. Such as audio with an acoustic input of "n i3h ao3h a1f u 2". However, if a swallow, truncation, etc. occurs, the acoustic model based on the transponder cannot output a complete phoneme sequence, such as "h ao3h a1f u 2" (truncation) or "n i3h a1fu 2" (swallow). Therefore, the solution for the pronunciation sequence of the swallowed and truncated awakening word is as follows: when the wake word WFST is constructed, a path needs to be additionally added on the basis of a decoding graph corresponding to the existing wake word K. And for a four-character awakening word, namely n is 4, checking a pronunciation construction path in the four-character awakening word which is easy to swallow, and constructing a path for the last three characters and the last two characters. That is, for K, paths { K1, K3, K4}, { K3, K4}, { K2, K3, K4} and the like are additionally established. For example, K { "you", "good", "ha", "f }, the constructed additional path may refer to the path information shown in fig. 7. Namely, the ability of W frame skipping is given, and even if the acoustic frame is missing, the correct path can be searched correctly.

Thirdly, the fault-tolerant processing mode of the multi-input awakening word audio is explained, if the alignment point is deviated, it is possible that there is a multiple entry of audio, e.g., Lb, which is likely to contain human voice, for the multiple input case, solution as shown in fig. 12, from the beginning of the wake word WFST, a spin of the absorption symbol "# ABSORB" is added, and adds a certain cost w to the cost, ave _ cost, when the cost exceeds the cost threshold, the spin can be exited to prevent trapping in the spin during the search, in fig. 12, "# ABSORB" is not a true utterance, and therefore it also has no corresponding acoustic score, in a viterbi search, an average acoustic score may be forced to "# ABSORB", namely the average value of all node acoustic scores in the current State, and the value range is limited to be larger than a certain threshold value, so that the decoding is prevented from falling into the spin of "# ABSORB".

Explaining a continuous recognition process of the awakening word and subsequent voice, after obtaining the acoustic output of an acoustic model based on a Transducer, performing Viterbi search in a constructed WFST decoding graph, wherein the Viterbi search starts from a state S.NumStates, namely, after all states of W are shifted to S, the Viterbi search forms a plurality of paths according to the pronunciation sequence of the output of the acoustic model and state conversion in W, and when a certain path touches a Final node (end point) of the awakening word W, an additional path is constructed to the state S.StartState, so that the continuous recognition of the awakening word and the subsequent voice is realized.

Finally, separating the awakening words in the speech to be recognized, wherein in the constructed awakening word WFST, the outputs of all nodes are null nodes eps, which means that when W performs Viterbi search, only input label is generated, and output label is not generated, i.e. the recognition result is null (no output is provided for the awakening words). Therefore, in the finally obtained search path, when the label is statistically output, the content of the awakening word cannot be contained, and the separation of the awakening word is achieved. Meanwhile, when the input labels of the final path are counted, the corresponding input labels can be removed according to the grammar used for constructing the W, so that the Phone sequence of the voice of the user can be obtained.

In practical applications, the performance index of speech recognition may include an Insertion Error Rate (IER), a Deletion Error Rate (DER), a replacement Error Rate (SER), a Word Error Rate (WER), and the like. Referring to table 1, table 1 is a test result table provided in the embodiment of the present application for a speech recognition method, and the relevant test results are as follows:

scheme(s)	Insertion Error (IER)	Delete Error (DER)	Replacement error (SER)	Word Error Rate (WER)
					Other arrangements	1.24％	0.42％	2.5％	4.19％
This application	0.21％	0.52％	2.53％	3.24％

TABLE 1

As can be seen from table 1, by using the speech recognition method provided in the embodiment of the present application, especially, by applying the speech recognition method to the voice wakeup function, an Insertion Error Rate (IER) in speech recognition can be effectively reduced.

The embodiment of the application can be used in vehicle-mounted off-line and on-line Oneshot voice recognition, a vehicle-mounted system is firstly awakened through an awakening model, then the voice of the awakening word is cut after being aligned, and the cut voice and the subsequent voice are input into a voice recognition module. In this way, after the user wakes up the voice system, the user does not need to pause, the interaction time is reduced, and the interaction efficiency is improved. In addition, the fault-tolerant processing and the special absorption node provided by the embodiment of the application can effectively avoid the influence of the awakening audio on the subsequent voice recognition due to the alignment problem, thereby improving the comprehension capability of the user semantics.

It is understood that, in the embodiments of the present application, the data related to the user information and the like need to be approved or approved by the user when the embodiments of the present application are applied to specific products or technologies, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards of the related countries and regions.

Continuing with the exemplary structure of the speech recognition device 555 provided by the embodiments of the present application as software modules, in some embodiments, as shown in fig. 2, the software modules stored in the speech recognition device 555 in the memory 550 may include:

an obtaining module 5551, configured to obtain a speech to be recognized, and perform phoneme conversion on the speech to be recognized to obtain an initial phoneme sequence, where the initial phoneme sequence includes: a first phoneme sequence corresponding to a wakeup word and a second phoneme sequence corresponding to a voice content associated with the wakeup word;

a determining module 5552, configured to obtain a standard phoneme sequence corresponding to the first phoneme sequence, and determine difference information between the first phoneme sequence and the standard phoneme sequence;

an adjusting module 5553, configured to determine a sequence adjusting manner corresponding to the difference information, and adjust the standard phoneme sequence by using the sequence adjusting manner to obtain an identifiable phoneme sequence corresponding to the first phoneme sequence;

a recognition module 5554, configured to perform speech recognition on the second phoneme sequence in the initial phoneme sequence based on the recognizable phoneme sequence, so as to obtain a speech recognition result corresponding to the speech content.

In some embodiments, the determining module is further configured to compare the phonemes in the first phoneme sequence with the phonemes in the standard phoneme sequence, respectively, to determine the phoneme difference of the first phoneme sequence compared with the standard phoneme sequence; wherein the phoneme difference comprises at least one of: loss of standard phonemes, phoneme redundancy; and taking the phoneme difference of the first phoneme sequence compared with the standard phoneme sequence as difference information between the first phoneme sequence and the standard phoneme sequence.

In some embodiments, the adjusting module is further configured to determine that a sequence adjusting manner corresponding to the difference information is a sequence supplementing manner when the difference information indicates that the first phoneme sequence has a standard phoneme deficiency with respect to the standard phoneme sequence; correspondingly, the adjusting module is further configured to determine that the first phoneme sequence is compared with a missing standard phoneme of the standard phoneme sequence, and determine phonemes except for the missing standard phoneme in the standard phoneme sequence as candidate standard phonemes based on the missing standard phoneme; reconstructing a phoneme sequence based on the candidate standard phonemes to obtain at least one sub-phoneme sequence; searching the sub-phoneme sequence corresponding to the first phoneme sequence from the at least one sub-phoneme sequence as the recognizable phoneme sequence.

In some embodiments, the adjusting module is further configured to determine that a sequence adjusting manner corresponding to the difference information is a noise supplementing manner when the difference information indicates that the first phoneme sequence has phoneme redundancy relative to the standard phoneme sequence; correspondingly, the adjusting module is further configured to determine at least one redundant phoneme of the first phoneme sequence relative to the standard phoneme sequence; constructing a noise phoneme corresponding to the at least one redundant phoneme; and reconstructing a phoneme sequence based on the noise phoneme and the standard phoneme sequence to obtain an identifiable phoneme sequence corresponding to the first phoneme sequence.

In some embodiments, the adjusting module is further configured to determine a phoneme filling position of the sequence of canonical phonemes; filling the noise phoneme based on the phoneme filling position of the standard phoneme sequence to obtain a noise phoneme sequence; mapping each phoneme in the first phoneme sequence to a phoneme in the noise phoneme sequence in sequence to obtain an identifiable phoneme sequence corresponding to the first phoneme sequence; wherein the redundant phone in the first phone sequence is used for mapping to a noise phone in the noise phone sequence.

In some embodiments, the adjusting module is further configured to determine a number of the at least one redundant phoneme; and filling the noise phonemes with the number at the phoneme filling positions of the standard phoneme sequence to obtain a noise phoneme sequence.

In some embodiments, the adjusting module is further configured to determine a number of mappings for the noise phoneme in the process of mapping the redundant phoneme in the first phoneme sequence to the noise phoneme; and when the mapping times reach a time threshold value, generating and outputting prompt information, wherein the prompt information is used for prompting that the voice recognition aiming at the voice to be recognized fails.

In some embodiments, the adjusting module is further configured to determine that a sequence adjusting manner corresponding to the difference information is a combination of a sequence supplementing manner and a noise supplementing manner when the difference information indicates that the first phoneme sequence has both a missing phoneme and a redundant phoneme with respect to the standard phoneme sequence; correspondingly, the adjusting module is further configured to determine that the first phoneme sequence is compared with a missing standard phoneme of the standard phoneme sequence, and determine phonemes except for the missing standard phoneme in the standard phoneme sequence as candidate standard phonemes based on the missing standard phoneme; determining at least one redundant phoneme of the first phoneme sequence relative to the standard phoneme sequence and constructing a noise phoneme corresponding to the at least one redundant phoneme; and reconstructing a phoneme sequence based on the candidate standard phonemes and the noise phonemes to obtain an identifiable phoneme sequence corresponding to the first phoneme sequence.

In some embodiments, the determining module is further configured to perform word segmentation processing on the wakeup word by taking a character as a unit to obtain at least two characters; respectively determining pronunciations corresponding to the characters; performing phoneme conversion on the characters based on the pronunciations corresponding to the characters to obtain phoneme sequences corresponding to the characters; and determining a standard phoneme sequence corresponding to the first phoneme sequence based on the phoneme sequence corresponding to each character.

In some embodiments, the determining module is further configured to perform phoneme conversion on the characters based on pronunciations corresponding to the characters, so as to obtain an intermediate phoneme sequence corresponding to the characters; when the number of the phonemes in the intermediate phoneme sequence is one, splicing the phonemes in the intermediate phoneme sequence with a preset modified phoneme to obtain a corresponding target phoneme; and replacing the phoneme in the intermediate phoneme sequence with the target phoneme to obtain a phoneme sequence corresponding to the character.

In some embodiments, the recognition module is further configured to cut the initial phoneme sequence based on the recognizable phoneme sequence to obtain the second phoneme sequence; acquiring a dictionary for phoneme recognition; and performing voice recognition on the second phoneme sequence based on the dictionary to obtain a voice recognition result corresponding to the voice content.

Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the speech recognition method described in the embodiment of the present application.

Embodiments of the present application provide a computer-readable storage medium having stored therein executable instructions that, when executed by a processor, cause the processor to perform a speech recognition method provided by embodiments of the present application, for example, the speech recognition method as shown in fig. 3.

In some embodiments, the computer-readable storage medium may be memory such as FRAM, ROM, PROM, EPROM, EEPROM, flash, magnetic surface memory, optical disk, or CD-ROM; or may be various devices including one or any combination of the above memories.

In some embodiments, executable instructions may be written in any form of programming language (including compiled or interpreted languages), in the form of programs, software modules, scripts or code, and may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.

By way of example, executable instructions may correspond, but do not necessarily have to correspond, to files in a file system, and may be stored in a portion of a file that holds other programs or data, such as in one or more scripts in a hypertext Markup Language (HTML) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).

By way of example, executable instructions may be deployed to be executed on one computing device or on multiple computing devices at one site or distributed across multiple sites and interconnected by a communication network.

In summary, according to the embodiment of the application, after the user wakes up the voice system, the user does not need to pause, so that the interaction time is reduced, and the interaction efficiency is improved. In addition, the fault-tolerant processing and the special absorption node provided by the embodiment of the application can effectively avoid the influence of the awakening audio on the subsequent voice recognition due to the alignment problem, thereby improving the comprehension capability of the user semantics.

The above description is only an example of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, and improvement made within the spirit and scope of the present application are included in the protection scope of the present application.

Claims

1. A method of speech recognition, the method comprising:

2. The method of claim 1 wherein said determining difference information between said first phone sequence and said phone sequence comprises:

comparing the phonemes in the first phoneme sequence with the phonemes in the standard phoneme sequence respectively to determine the phoneme difference of the first phoneme sequence compared with the standard phoneme sequence;

3. The method as claimed in claim 1, wherein the determining the sequence adjustment manner corresponding to the difference information comprises:

when the difference information represents that the first phoneme sequence has standard phoneme missing relative to the standard phoneme sequence, determining a sequence adjustment mode corresponding to the difference information as a sequence supplement mode;

the adjusting the standard phoneme sequence by using the sequence adjusting method to obtain an identifiable phoneme sequence corresponding to the first phoneme sequence includes:

determining the first phoneme sequence as compared with the missing standard phoneme of the standard phoneme sequence, and determining the phonemes except the missing standard phoneme in the standard phoneme sequence as candidate standard phonemes based on the missing standard phoneme;

4. The method of claim 1, wherein the determining the sequence adjustment corresponding to the difference information comprises:

when the difference information represents that the first phoneme sequence has phoneme redundancy relative to the standard phoneme sequence, determining that a sequence adjustment mode corresponding to the difference information is a noise supplement mode;

determining at least one redundant phoneme of the first phoneme sequence relative to the standard phoneme sequence;

5. The method of claim 4 wherein said performing phoneme sequence reconstruction based on said noise phonemes and said standard phoneme sequence to obtain an identifiable phoneme sequence corresponding to said first phoneme sequence comprises:

determining phoneme filling positions of the standard phoneme sequence;

6. The method of claim 5 wherein filling the noise phonemes based on the phoneme filling positions of the sequence of standard phonemes to obtain a sequence of noise phonemes comprises:

determining a number of the at least one redundant phoneme;

7. The method of claim 5, wherein when the number of redundant phonemes is at least two, the method further comprises:

determining a number of mappings for the noise phoneme in a process of mapping the redundant phoneme in the first phoneme sequence to the noise phoneme;

8. The method of claim 1, wherein the determining the sequence adjustment corresponding to the difference information comprises:

when the difference information represents that the first phoneme sequence has both standard phoneme deletion and phoneme redundancy relative to the standard phoneme sequence, determining that a sequence adjustment mode corresponding to the difference information is a combination of a sequence supplement mode and a noise supplement mode;

and reconstructing a phoneme sequence based on the candidate standard phonemes and the noise phonemes to obtain an identifiable phoneme sequence corresponding to the first phoneme sequence.

9. The method of claim 1 wherein said obtaining a standard phoneme sequence corresponding to said first phoneme sequence comprises:

taking the characters as units, and performing word segmentation processing on the awakening words to obtain at least two characters;

respectively determining pronunciations corresponding to the characters;

10. The method of claim 9, wherein the performing phoneme conversion on the character based on the pronunciation corresponding to the character to obtain a phoneme sequence corresponding to the character comprises:

performing phoneme conversion on the characters based on pronunciations corresponding to the characters to obtain a middle phoneme sequence corresponding to the characters;

11. The method of claim 1, wherein performing speech recognition on the second phoneme sequence in the initial phoneme sequence based on the recognizable phoneme sequence to obtain a speech recognition result corresponding to the speech content comprises:

cutting the initial phoneme sequence based on the recognizable phoneme sequence to obtain a second phoneme sequence;

acquiring a dictionary for phoneme recognition;

12. The method of claim 1, wherein after obtaining the speech recognition result corresponding to the speech content, the method further comprises:

performing semantic analysis on the voice recognition result to obtain a semantic analysis result;

and acquiring response information corresponding to the semantic analysis result, and outputting the response information.

13. A speech recognition apparatus, characterized in that the apparatus comprises:

14. An electronic device, characterized in that the electronic device comprises:

a memory for storing executable instructions;

a processor for implementing the speech recognition method of any one of claims 1 to 12 when executing executable instructions stored in the memory.

15. A computer-readable storage medium storing executable instructions, wherein the executable instructions, when executed by a processor, implement the speech recognition method of any one of claims 1 to 12.

16. A computer program product comprising a computer program or instructions, characterized in that the computer program or instructions, when executed by a processor, implement the speech recognition method of any of claims 1 to 12.