CN113053359A

CN113053359A - Voice recognition method, intelligent terminal and storage medium

Info

Publication number: CN113053359A
Application number: CN201911403451.9A
Authority: CN
Inventors: 潘弘海
Original assignee: Shenzhen TCL Digital Technology Co Ltd
Current assignee: Shenzhen TCL Digital Technology Co Ltd
Priority date: 2019-12-27
Filing date: 2019-12-27
Publication date: 2021-06-29

Abstract

The invention discloses a voice recognition method, an intelligent terminal and a storage medium.

Description

Voice recognition method, intelligent terminal and storage medium

Technical Field

The present invention relates to the field of speech recognition technologies, and in particular, to a speech recognition method, an intelligent terminal, and a storage medium.

Background

Compared with text input modes such as pinyin and font, the voice input mode has the advantages of high speed, convenience in operation and the like, and is applied to more and more scenes. However, due to the influence of factors such as Chinese multiword homophones, dialects, substandard pronunciation, noise and the like, the situation that the voice recognition result is wrong occurs sometimes, inconvenience is brought to the user, and popularization of voice recognition products is influenced.

Thus, there is still a need for improvement and development of the prior art.

Disclosure of Invention

The inventor finds that in the prior art, an error point of voice recognition often appears in a proper noun, and the proper noun is a key point of a whole sentence of a user, for example, for a smart television, when the user uses voice recognition to search and watch, the user speaks a sentence containing a tv show name, a person name, a song name and the like, such as "i want to watch langa board" and the like, and the smart television needs to recognize a proper character string of the tv show name, the person name, the song name and the like so as to execute a correct search, thereby achieving the purpose of the user. However, due to the influence of the homophones, dialects and surrounding noise in chinese, in the prior art, there is an error in identifying the special character strings such as the name of tv drama, the name of person, and the name of song, for example, identifying "i want to see langerhans" as "i want to see langerhans. The recognition error of the proper name character string obviously causes the accuracy rate of voice recognition to be greatly reduced and even far away from the original intention of the user.

The technical problem to be solved by the present invention is to provide a voice recognition method, an intelligent terminal and a storage medium for solving the above-mentioned defects in the prior art, and to solve the problem in the prior art that the accuracy of recognizing proper nouns by voice recognition is low.

The technical scheme of the invention is as follows:

in a first aspect of the present invention, a speech recognition method is provided, where the speech recognition method includes:

acquiring a text corresponding to voice information, extracting a first character string in the text, and matching the first character string with a preset character string in a target database;

when a preset character string which is the same as the first character string does not exist in the target database, acquiring a target preset character string which corresponds to the first character string in the target database;

and replacing the first character string in the text with the target preset character string, and taking the replaced text as the recognition result of the voice information.

The voice recognition method, wherein the matching the first character string with a preset character string in a target database includes:

acquiring professional categories corresponding to the voice information;

selecting a database corresponding to the professional category from at least one preset database according to the professional category, and taking the database as the target database;

and matching the first character string with a preset character string in the target database.

The voice recognition method, wherein the extracting the first character string from the text specifically includes:

inputting the text into a first model corresponding to the professional category, and acquiring the first character string output by the first model;

the first model is trained according to a first data set, the first data set comprises a plurality of groups of first samples, and each group of first samples comprises sample texts in the professional categories and sample first character strings corresponding to the sample texts.

The voice recognition method, wherein the obtaining of the target preset character string corresponding to the first character string in the target database includes:

acquiring a first syllable sequence corresponding to the first character string;

inputting the first syllable sequence into a pre-trained second model to obtain a second syllable sequence output by the second model;

the second model is trained according to a second data set, the second data set comprises a plurality of groups of second samples, each group of second samples comprises a sample syllable sequence and a sample second syllable sequence corresponding to the sample syllable sequence, and the sample second syllable sequence is a syllable sequence corresponding to a preset character string in the target database;

and determining the target preset character string according to the second syllable sequence.

The voice recognition method, wherein the obtaining the target preset character string according to the second syllable sequence includes:

and when the preset character string with the syllable sequence consistent with the second syllable sequence does not exist in the target database, taking the preset character string with the highest correlation degree with the second syllable sequence in the target database as the target preset character string.

and when the number of the preset character strings with the syllable sequence consistent with the second syllable sequence in the target database is one, taking the preset character string with the syllable sequence consistent with the second syllable sequence as the target preset character string.

The voice recognition method, wherein the target database stores the use frequency of each preset character string in historical use data, and the obtaining of the target preset character string according to the second syllable sequence includes:

and when the number of the preset character strings with the syllable sequence consistent with the second syllable sequence in the target database is multiple, taking the preset character string with the highest use frequency in the preset character strings with the syllable sequence consistent with the second syllable sequence as the target preset character string.

The voice recognition method described above, wherein the preset character string in the target database having the highest degree of correlation with the second syllable sequence is a preset character string corresponding to a syllable sequence having the smallest edit distance with respect to the second syllable sequence; the step of using the preset character string with the highest correlation degree with the second syllable sequence in the target database as the target preset character string comprises:

selecting at least one first preset character string from the target database, wherein the syllable sequence of each first preset character string and the second syllable sequence comprise at least a preset number of same syllables;

respectively acquiring the editing distance between the syllable sequence of the at least one first preset character string and the second syllable sequence;

and taking the first preset character string corresponding to the syllable sequence with the minimum editing distance as the target preset character string.

In a second aspect of the present invention, an intelligent terminal is provided, wherein the intelligent terminal includes: a processor, a storage medium communicatively coupled to the processor, the storage medium adapted to store a plurality of instructions; the processor is adapted to invoke instructions in the storage medium to perform a speech recognition method implementing any of the above.

In a third aspect of the invention, a storage medium is provided, wherein the computer readable storage medium stores one or more programs, which are executable by one or more processors to implement a speech recognition method as described in any above.

The invention has the technical effects that: according to the voice recognition method provided by the invention, when the acquired voice is converted into the text, the first character string which is theoretically a proprietary character string in the text is extracted, and when the first character string is not the proprietary character string, the proprietary character string corresponding to the first character string is acquired, so that the recognition accuracy rate of the proprietary character string in voice recognition is improved.

Drawings

FIG. 1 is a flow chart of a first embodiment of a speech recognition method provided by the present invention;

FIG. 2 is a flowchart illustrating sub-steps of step S100 according to a first embodiment of the speech recognition method provided by the present invention;

FIG. 3 is a flow chart of one implementation of obtaining a target default string in the speech recognition method provided by the present invention;

fig. 4 is a functional schematic diagram of an intelligent terminal provided by the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer and clearer, the present invention is further described in detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The invention provides a voice recognition method, which can be applied to a terminal, wherein when a user uses the terminal supporting voice recognition, the user speaks voice to the terminal, and the terminal recognizes the voice spoken by the user according to the voice recognition method provided by the invention and outputs a voice recognition result. The terminal may be, but is not limited to, various personal computers, notebook computers, mobile phones, tablet computers, vehicle-mounted computers, and portable wearable devices.

Example one

Referring to fig. 1, fig. 1 is a simplified flowchart of a speech recognition method according to a first embodiment of the present invention. The voice recognition method comprises the following steps:

s100, obtaining a text corresponding to the voice information, extracting a first character string in the text, and matching the first character string with a preset character string in a target database.

The voice information is information uttered by a user, and in particular, the voice information may be uttered when the user wants to use voice input. The text corresponding to the voice information may be obtained by separately-arranged voice conversion equipment according to the voice information and then input to the terminal, or may be obtained by converting the voice information by a voice conversion unit arranged in the terminal itself.

The first character string in the text is a character string which theoretically should be a proprietary character string in the text, the proprietary character string is a character string corresponding to a specific vocabulary in a professional field, for example, the proprietary character string in the movie field may be a tv drama name, a movie name, etc., the proprietary character string in the sports field may be a sports action name, a player name, etc., the target database is a database in which the proprietary character string of a professional category corresponding to the voice information is stored, the professional category is a category of the professional field, such as a movie category, a sports category, etc., and the preset character string in the target database is the proprietary character string of the professional category corresponding to the voice information. The preset character strings in the target database can be acquired by manual collection or by automatic access to the world wide web by a crawler. The target database can be stored locally in the terminal or stored in the cloud.

Specifically, as explained in the content part of the invention, in the process of speech recognition, a recognition error often occurs in the proprietary character string part, and meanwhile, the recognition accuracy of the proprietary character string has a great influence on the accuracy of the overall result of speech recognition, so in the invention, after the text corresponding to the speech information is obtained, the first character string which theoretically should be the proprietary character string in the text is extracted, and the first character string is matched with the preset character string in the target database, so that whether the first character string is the proprietary character string is determined.

And for different professional categories, different proprietary character strings are provided, and as shown in fig. 2, the matching the first character string with the preset character string in the target database includes:

s110, acquiring professional categories corresponding to the voice information;

the professional category corresponding to the voice information may be obtained through a text corresponding to the voice information, for example, if the text corresponding to the voice information is "i want to watch XXX" and "i want to listen to XXX", then the professional category corresponding to the voice information may be obtained as a movie category; the professional type corresponding to the voice information can also be obtained according to interface information of the terminal when the voice information is received, and when the voice information is received, the interface of the terminal is a news interface, so that the professional type corresponding to the voice information can be obtained as news.

S120, selecting a database corresponding to the professional category from at least one preset database according to the professional category, and taking the database as the target database.

In this embodiment, at least one database is established in advance according to different professional categories, a proprietary character string of a corresponding professional category is stored in each database, and after the voice information is acquired, a database corresponding to the professional category is selected as the target database according to the professional category corresponding to the voice information.

S130, matching the first character string with a preset character string in the target database.

After the target database is obtained, matching the first character string with a preset character string in the target database to determine whether the first character string is a proprietary character string.

In this embodiment, the first character string in the text is obtained through a first model trained in advance, and in different professional categories, the proprietary character strings are different, the positions of the proprietary character strings in the sentence are different, the models used for extracting the first character string are different, and the extracting of the first character string in the text specifically includes:

and inputting the text into a first model corresponding to the professional category, and acquiring the first character string output by the first model.

The first model is trained according to a first data set, the first data set comprises a plurality of groups of first samples, in order to enable the trained first model to be suitable for extracting the first character strings in the voice information corresponding to the professional categories, each group of first samples comprises sample texts in the professional categories and sample first character strings corresponding to the sample texts. In specific implementation, the first sample may be obtained according to historical input data of a plurality of existing users, for example, a text in the professional category, which is input by the user through voice or manual input, is obtained as a sample text, the sample text is labeled, and a proprietary character string in the text is labeled as a sample first character string, that is, the first sample including the sample text and a sample first character string corresponding to the sample text is completed.

It should be noted that the sample first character string corresponding to the sample text is not necessarily a proprietary character string, but may also be a proprietary character string with a wrongly written character, and when labeling the sample text, the proprietary character string with a wrongly written character may be labeled as the sample first character string, that is, the sample first character string is a part of the sample text that should theoretically be a proprietary character string. Therefore, when the proprietary character string in the text obtained by converting the semantic information contains a wrong character, the first character string in the text can still be obtained, and the first character string may be the proprietary character string with the wrong character.

The first model may be a CRF (conditional random field) model or an LSTM (long short-term memory) model, and of course, those skilled in the art may select other models in the field of natural language processing as the first model according to the needs, such as a Bi-LSTM (Bi-directional short-term memory) model, a Bi-LSTM + CRF model, and the like.

After the first character string is obtained, determining whether the first character string is a proprietary character string, specifically, matching the first character string with a preset character string in the target database: when the terminal acquires the first character string, traversing all preset character strings in the target database, determining whether the first character string exists in the target database, when the first character string exists in the target database, indicating that the special character string in the voice information is correctly identified, directly outputting the text as the identification result of the voice information, and when the first character string does not exist, indicating that the special character string in the voice information is incorrectly identified and needing to be corrected.

Specifically, the speech recognition method further includes:

s200, when the preset character string which is the same as the first character string does not exist in the target database, acquiring a target preset character string corresponding to the first character string in the target database.

The target preset character string is a proprietary character string corresponding to the first character string, and when a preset character string identical to the first character string does not exist in the target database, the target preset character string corresponding to the first character string in the target database is obtained. Specifically, the obtaining of the target preset character string corresponding to the first character string in the target database includes:

s210, obtaining a first syllable sequence of the first character string.

The syllable sequence corresponding to the character string is a sequence in which syllables of each character in the character string are formed in the order of the characters in the character string, and the first syllable sequence is a sequence in which syllables of each character included in the first character string are formed in the order of the characters in the first character string. For example, the syllable sequence corresponding to "piglet pecky" is "x, iao, zh, u, p, ei, q, i".

S220, inputting the first syllable sequence into a pre-trained second model, and obtaining a second syllable sequence output by the second model.

When the first string is not in the target database, then the first string may be a proprietary string with errors, such as missing or wrongly-written words. In this embodiment, the first string is error corrected according to a pre-trained second model.

Specifically, the second model is trained according to a second data set, the second data set includes a plurality of groups of second samples, and each group of second samples includes a sample syllable sequence and a sample second syllable sequence corresponding to the sample syllable sequence. The sample second syllable sequence is a syllable sequence corresponding to a preset character string in the target database, and the sample syllable sequence is generated by randomly replacing part of syllables in the sample second syllable sequence with other syllables. That is, when the second model is trained, a plurality of preset character strings are selected in the target database, random syllable replacement is performed on each selected preset character string, and the second model is trained according to the correspondence between the syllable sequence generated after random syllable replacement and the syllable sequence before random syllable replacement, that is, the training goal of the second model is to enable the second model to have the capability of correcting the input syllable sequence to the syllable sequence corresponding to the preset character string in the target database, so that after the first syllable sequence is input to the trained second model, the second syllable sequence output by the second model has a high probability of being the syllable sequence of the preset character string in the target database.

The second model may be a bert (bidirectional Encoder Representation from transforms) model, although other suitable natural language processing models, such as N-gram models (N-gram models), may be selected by those skilled in the art. When training the second model, it is necessary to convert each syllable in the second sample into a vector format that can be computed by a computer, that is, it is necessary to obtain a word vector corresponding to each syllable. In this embodiment, the word vector corresponding to each syllable is obtained by using the syllable sequence corresponding to all the preset character strings in the target database as a data set. Specifically, word vector training tools such as word2Vec may be used, and a data set formed by syllable sequences corresponding to all preset character strings in the target database is used as a training data set to obtain word vectors corresponding to all syllables.

And S230, acquiring the target preset character string according to the second syllable sequence.

The second syllable sequence output by the second model is a proprietary character string corresponding to the first syllable sequence predicted by the second model, and the sample syllable sequence in the second sample is generated by randomly replacing the preset character string in the target database when the second model is trained, and the randomly replaced syllable sequence may not be in accordance with the recognition result of the speech information sent by the user in practice, for example, a preset character string in the target database is a piglet peck, and the corresponding syllable sequence is "x, iao, zh, u, p, ei, q, i" to generate the sample syllable sequence which may be "x, iao, k, u, p, ei, t, i" or "t, iao, sh, u, p, ei, t, i", etc., in practice, the user may not send the speech information recognized as the above syllable sequence, the recognition result of the speech information sent by the user may be mostly "x, iao, z, u, p, ei, q, i", "x, iao, z, u, b, ei, q, i", that is, the training sample of the second model is inconsistent with the actual data, which results in limited capability of the second model, and therefore, the second model may not completely reach the training target of the second model, that is, the second syllable sequence output according to the first syllable sequence may not be the preset character string in the target database, that is, the target database may have the preset character string with the consistent syllable sequence with the second syllable sequence, or may not have the preset character string with the consistent syllable sequence with the second syllable sequence.

In a possible implementation manner, the obtaining the target preset character string according to the second syllable sequence includes:

When only one preset character string exists in the target database, which is consistent with the second syllable sequence, the second syllable sequence is the syllable sequence corresponding to the proprietary character string, and then the preset character string in which the syllable sequence in the target database is the second syllable sequence is directly obtained as the target character string.

In another possible implementation manner, the target database stores usage frequencies of respective preset character strings in historical usage data, and the obtaining the target preset character string according to the second syllable sequence includes:

The usage frequency is a frequency of the preset character strings in the target database being used by the user, and the usage frequency may be obtained according to an occurrence frequency of each preset character string in the statistical data when the target database is established, or may be obtained according to a frequency of each preset character string input by the user. The target database may have a plurality of preset character strings having the same syllable sequence, and when the number of the preset character strings having the same syllable sequence as the second syllable sequence is plural, the preset character string having the highest frequency of use among the preset character strings having the same syllable sequence as the second syllable sequence is obtained as the target preset character string.

As already explained above, there may not be a predetermined string of syllables in the target database that is identical to the second syllable sequence, and therefore, in a possible implementation, the obtaining the target predetermined string according to the second syllable sequence includes:

Specifically, when there is no preset character string having a syllable sequence consistent with the second syllable sequence in the target database, it is indicated that the second syllable sequence output by the second model is not a syllable sequence corresponding to a proprietary character string yet, and needs to be further corrected, and at this time, the second noun is obtained according to a correlation between the second syllable sequence and the preset character string in the target database.

Specifically, in this embodiment, the editing distance between syllable sequences is used to evaluate the correlation between the second syllable sequence and a preset character string, and the preset character string in the target database having the highest correlation with the second syllable sequence is the preset character string corresponding to the syllable sequence having the smallest editing distance with respect to the second syllable sequence. An edit distance (edit distance) is an index used in the field of language processing to measure the degree of difference between two character strings, and refers to the number of times of single character editing required to convert one character string into another character string. When the editing distance of the two character strings is larger, the difference of the two character strings is larger, and conversely, when the editing distance of the two character strings is smaller, the difference between the two character strings is smaller.

As shown in fig. 3, the step of using the preset character string with the highest correlation with the second syllable sequence in the target database as the target preset character string includes:

s231, selecting at least one first preset character string from the target database, wherein the syllable sequence of each preset character string and the second syllable sequence comprise at least a preset number of same syllables.

Specifically, a plurality of preset character strings exist in the target database, that is, a syllable sequence of a plurality of preset character strings exists, the edit distance between the syllable sequence of each preset character string and the second syllable sequence is obtained, and then the preset character string corresponding to the syllable sequence with the smallest edit distance obviously needs to consume a large amount of computing resources. The preset number may be 3, 6, 8, etc., and it is easy to see that, the larger the preset number is, the smaller the number of the first preset character strings is, the smaller the calculation amount for obtaining the syllable sequence with the minimum editing distance to the second syllable sequence is, but due to the small calculation sample amount, the preset character string with the highest correlation degree with the second syllable sequence in the target database may be lost, resulting in inaccurate result. The smaller the preset number is, the more the number of the first preset character strings is, the larger the calculation amount for obtaining the preset character string with the highest correlation degree with the second syllable sequence is, but the calculation sample amount is large, and the result is more accurate.

S232, respectively obtaining the editing distance between the syllable sequence of the at least one first preset character string and the second syllable sequence.

After the at least one first preset character string is obtained, the editing distance between the syllable sequence of each first preset character string and the second syllable sequence is calculated respectively.

And S233, taking the first preset character string corresponding to the syllable sequence with the minimum editing distance as the target preset character string.

If the edit distance between the syllable sequence of a first preset character string in the at least one first preset character string and the second syllable sequence is minimum, the difference between the syllable sequence of the first preset character string and the second syllable sequence is the minimum in all the first preset character strings. After determining the syllable sequence with the minimum editing distance, obtaining a first preset character string corresponding to the syllable sequence with the minimum editing distance, where the first preset character string is a preset character string with the highest correlation with the second syllable sequence, and in this embodiment, obtaining the first preset character string as the target preset character string.

In a possible implementation manner, when a plurality of syllable sequences with the same editing distance appear, the use frequency of the first preset character string corresponding to each syllable sequence with the same editing distance in the target database is obtained, and the first preset character string with the highest use frequency is used as the target preset character string.

Referring to fig. 1 again, after the target preset character string is obtained, the voice recognition method further includes:

s300, replacing the first character string in the text with the target preset character string, and taking the replaced text as the recognition result of the voice information.

As already explained above, the target preset character string is a proprietary character string corresponding to the first character string, and therefore, after the target preset character string is obtained, the target preset character string is used to replace the first character string in the text, the replaced text is a text containing the error-corrected proprietary character string, and the replaced text is output as the recognition result of the voice information.

It can be seen from the above embodiments that, in the speech recognition method provided by the present invention, when converting an acquired speech into a text, a first character string that should theoretically be a proprietary character string in the text is extracted, and when the first character string is not the proprietary character string, the proprietary character string corresponding to the first character string is acquired, so that the recognition accuracy of the proprietary character string in speech recognition is improved.

Example two

Based on the above embodiment, the present invention further provides an intelligent terminal, and a schematic block diagram thereof may be as shown in fig. 4. The intelligent terminal comprises a processor, a memory, a network interface, a display screen and a temperature sensor which are connected through a system bus. Wherein, the processor of the intelligent terminal is used for providing calculation and control capability. The memory of the intelligent terminal comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the intelligent terminal is used for being connected and communicated with an external terminal through a network. The computer program is executed by a processor to implement a speech recognition method. The display screen of the intelligent terminal can be a liquid crystal display screen or an electronic ink display screen, and the temperature sensor of the intelligent terminal is arranged inside the intelligent terminal in advance and used for detecting the current operating temperature of internal equipment.

It will be understood by those skilled in the art that the block diagram shown in fig. 4 is only a block diagram of a part of the structure related to the solution of the present invention, and does not constitute a limitation to the intelligent terminal to which the solution of the present invention is applied, and a specific intelligent terminal may include more or less components than those shown in the figure, or combine some components, or have a different arrangement of components.

In one embodiment, an intelligent terminal is provided, which includes a memory and a processor, the memory stores a computer program, and the processor can realize at least the following steps when executing the computer program:

Wherein the matching the first character string with a preset character string in a target database comprises:

acquiring professional categories corresponding to the voice information;

Wherein the extracting of the first character string in the text specifically includes:

Wherein the obtaining of the target preset character string corresponding to the first character string in the target database comprises:

Wherein the obtaining the target preset character string according to the second syllable sequence comprises:

Wherein, the target database stores the use frequency of each preset character string in the historical use data, and the obtaining the target preset character string according to the second syllable sequence comprises:

The preset character string with the highest correlation degree with the second syllable sequence in the target database is a preset character string corresponding to the syllable sequence with the minimum editing distance with the second syllable sequence; the step of using the preset character string with the highest correlation degree with the second syllable sequence in the target database as the target preset character string comprises:

EXAMPLE III

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, databases, or other media used in embodiments provided herein may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The present invention provides a storage medium storing one or more programs executable by one or more processors to implement a speech recognition method according to an embodiment.

It is to be understood that the invention is not limited to the examples described above, but that modifications and variations may be effected thereto by those of ordinary skill in the art in light of the foregoing description, and that all such modifications and variations are intended to be within the scope of the invention as defined by the appended claims.

Claims

1. A speech recognition method, characterized in that the speech recognition method comprises:

2. The speech recognition method of claim 1, wherein matching the first string with a predetermined string in a target database comprises:

acquiring professional categories corresponding to the voice information;

3. The speech recognition method according to claim 2, wherein the extracting the first character string from the text specifically comprises:

4. The speech recognition method according to claim 1, wherein the obtaining of the target preset character string corresponding to the first character string in the target database specifically comprises:

5. The speech recognition method of claim 4, wherein the obtaining the target pre-determined string according to the second syllable sequence comprises:

6. The speech recognition method of claim 4, wherein the obtaining the target pre-determined string according to the second syllable sequence comprises:

7. The speech recognition method of claim 4, wherein the target database stores usage frequencies of respective preset character strings in historical usage data, and the obtaining the target preset character string according to the second syllable sequence comprises:

8. The speech recognition method according to claim 5, wherein the predetermined string in the target database having the highest correlation with the second syllable sequence is a predetermined string corresponding to a syllable sequence having the smallest edit distance with respect to the second syllable sequence; the step of using the preset character string with the highest correlation degree with the second syllable sequence in the target database as the target preset character string comprises:

9. An intelligent terminal, characterized in that, intelligent terminal includes: a processor, a storage medium communicatively coupled to the processor, the storage medium adapted to store a plurality of instructions; the processor is adapted to invoke instructions in the storage medium to perform a method of speech recognition according to any of the preceding claims 1-8.

10. A storage medium storing one or more programs, the one or more programs being executable by one or more processors to implement the speech recognition method of any one of claims 1-8.