CN113053359A - Voice recognition method, intelligent terminal and storage medium - Google Patents

Voice recognition method, intelligent terminal and storage medium Download PDF

Info

Publication number
CN113053359A
CN113053359A CN201911403451.9A CN201911403451A CN113053359A CN 113053359 A CN113053359 A CN 113053359A CN 201911403451 A CN201911403451 A CN 201911403451A CN 113053359 A CN113053359 A CN 113053359A
Authority
CN
China
Prior art keywords
character string
syllable sequence
preset character
target
preset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911403451.9A
Other languages
Chinese (zh)
Inventor
潘弘海
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen TCL Digital Technology Co Ltd
Original Assignee
Shenzhen TCL Digital Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen TCL Digital Technology Co Ltd filed Critical Shenzhen TCL Digital Technology Co Ltd
Priority to CN201911403451.9A priority Critical patent/CN113053359A/en
Publication of CN113053359A publication Critical patent/CN113053359A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/027Syllables being the recognition units

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a voice recognition method, an intelligent terminal and a storage medium.

Description

Voice recognition method, intelligent terminal and storage medium
Technical Field
The present invention relates to the field of speech recognition technologies, and in particular, to a speech recognition method, an intelligent terminal, and a storage medium.
Background
Compared with text input modes such as pinyin and font, the voice input mode has the advantages of high speed, convenience in operation and the like, and is applied to more and more scenes. However, due to the influence of factors such as Chinese multiword homophones, dialects, substandard pronunciation, noise and the like, the situation that the voice recognition result is wrong occurs sometimes, inconvenience is brought to the user, and popularization of voice recognition products is influenced.
Thus, there is still a need for improvement and development of the prior art.
Disclosure of Invention
The inventor finds that in the prior art, an error point of voice recognition often appears in a proper noun, and the proper noun is a key point of a whole sentence of a user, for example, for a smart television, when the user uses voice recognition to search and watch, the user speaks a sentence containing a tv show name, a person name, a song name and the like, such as "i want to watch langa board" and the like, and the smart television needs to recognize a proper character string of the tv show name, the person name, the song name and the like so as to execute a correct search, thereby achieving the purpose of the user. However, due to the influence of the homophones, dialects and surrounding noise in chinese, in the prior art, there is an error in identifying the special character strings such as the name of tv drama, the name of person, and the name of song, for example, identifying "i want to see langerhans" as "i want to see langerhans. The recognition error of the proper name character string obviously causes the accuracy rate of voice recognition to be greatly reduced and even far away from the original intention of the user.
The technical problem to be solved by the present invention is to provide a voice recognition method, an intelligent terminal and a storage medium for solving the above-mentioned defects in the prior art, and to solve the problem in the prior art that the accuracy of recognizing proper nouns by voice recognition is low.
The technical scheme of the invention is as follows:
in a first aspect of the present invention, a speech recognition method is provided, where the speech recognition method includes:
acquiring a text corresponding to voice information, extracting a first character string in the text, and matching the first character string with a preset character string in a target database;
when a preset character string which is the same as the first character string does not exist in the target database, acquiring a target preset character string which corresponds to the first character string in the target database;
and replacing the first character string in the text with the target preset character string, and taking the replaced text as the recognition result of the voice information.
The voice recognition method, wherein the matching the first character string with a preset character string in a target database includes:
acquiring professional categories corresponding to the voice information;
selecting a database corresponding to the professional category from at least one preset database according to the professional category, and taking the database as the target database;
and matching the first character string with a preset character string in the target database.
The voice recognition method, wherein the extracting the first character string from the text specifically includes:
inputting the text into a first model corresponding to the professional category, and acquiring the first character string output by the first model;
the first model is trained according to a first data set, the first data set comprises a plurality of groups of first samples, and each group of first samples comprises sample texts in the professional categories and sample first character strings corresponding to the sample texts.
The voice recognition method, wherein the obtaining of the target preset character string corresponding to the first character string in the target database includes:
acquiring a first syllable sequence corresponding to the first character string;
inputting the first syllable sequence into a pre-trained second model to obtain a second syllable sequence output by the second model;
the second model is trained according to a second data set, the second data set comprises a plurality of groups of second samples, each group of second samples comprises a sample syllable sequence and a sample second syllable sequence corresponding to the sample syllable sequence, and the sample second syllable sequence is a syllable sequence corresponding to a preset character string in the target database;
and determining the target preset character string according to the second syllable sequence.
The voice recognition method, wherein the obtaining the target preset character string according to the second syllable sequence includes:
and when the preset character string with the syllable sequence consistent with the second syllable sequence does not exist in the target database, taking the preset character string with the highest correlation degree with the second syllable sequence in the target database as the target preset character string.
The voice recognition method, wherein the obtaining the target preset character string according to the second syllable sequence includes:
and when the number of the preset character strings with the syllable sequence consistent with the second syllable sequence in the target database is one, taking the preset character string with the syllable sequence consistent with the second syllable sequence as the target preset character string.
The voice recognition method, wherein the target database stores the use frequency of each preset character string in historical use data, and the obtaining of the target preset character string according to the second syllable sequence includes:
and when the number of the preset character strings with the syllable sequence consistent with the second syllable sequence in the target database is multiple, taking the preset character string with the highest use frequency in the preset character strings with the syllable sequence consistent with the second syllable sequence as the target preset character string.
The voice recognition method described above, wherein the preset character string in the target database having the highest degree of correlation with the second syllable sequence is a preset character string corresponding to a syllable sequence having the smallest edit distance with respect to the second syllable sequence; the step of using the preset character string with the highest correlation degree with the second syllable sequence in the target database as the target preset character string comprises:
selecting at least one first preset character string from the target database, wherein the syllable sequence of each first preset character string and the second syllable sequence comprise at least a preset number of same syllables;
respectively acquiring the editing distance between the syllable sequence of the at least one first preset character string and the second syllable sequence;
and taking the first preset character string corresponding to the syllable sequence with the minimum editing distance as the target preset character string.
In a second aspect of the present invention, an intelligent terminal is provided, wherein the intelligent terminal includes: a processor, a storage medium communicatively coupled to the processor, the storage medium adapted to store a plurality of instructions; the processor is adapted to invoke instructions in the storage medium to perform a speech recognition method implementing any of the above.
In a third aspect of the invention, a storage medium is provided, wherein the computer readable storage medium stores one or more programs, which are executable by one or more processors to implement a speech recognition method as described in any above.
The invention has the technical effects that: according to the voice recognition method provided by the invention, when the acquired voice is converted into the text, the first character string which is theoretically a proprietary character string in the text is extracted, and when the first character string is not the proprietary character string, the proprietary character string corresponding to the first character string is acquired, so that the recognition accuracy rate of the proprietary character string in voice recognition is improved.
Drawings
FIG. 1 is a flow chart of a first embodiment of a speech recognition method provided by the present invention;
FIG. 2 is a flowchart illustrating sub-steps of step S100 according to a first embodiment of the speech recognition method provided by the present invention;
FIG. 3 is a flow chart of one implementation of obtaining a target default string in the speech recognition method provided by the present invention;
fig. 4 is a functional schematic diagram of an intelligent terminal provided by the invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer and clearer, the present invention is further described in detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The invention provides a voice recognition method, which can be applied to a terminal, wherein when a user uses the terminal supporting voice recognition, the user speaks voice to the terminal, and the terminal recognizes the voice spoken by the user according to the voice recognition method provided by the invention and outputs a voice recognition result. The terminal may be, but is not limited to, various personal computers, notebook computers, mobile phones, tablet computers, vehicle-mounted computers, and portable wearable devices.
Example one
Referring to fig. 1, fig. 1 is a simplified flowchart of a speech recognition method according to a first embodiment of the present invention. The voice recognition method comprises the following steps:
s100, obtaining a text corresponding to the voice information, extracting a first character string in the text, and matching the first character string with a preset character string in a target database.
The voice information is information uttered by a user, and in particular, the voice information may be uttered when the user wants to use voice input. The text corresponding to the voice information may be obtained by separately-arranged voice conversion equipment according to the voice information and then input to the terminal, or may be obtained by converting the voice information by a voice conversion unit arranged in the terminal itself.
The first character string in the text is a character string which theoretically should be a proprietary character string in the text, the proprietary character string is a character string corresponding to a specific vocabulary in a professional field, for example, the proprietary character string in the movie field may be a tv drama name, a movie name, etc., the proprietary character string in the sports field may be a sports action name, a player name, etc., the target database is a database in which the proprietary character string of a professional category corresponding to the voice information is stored, the professional category is a category of the professional field, such as a movie category, a sports category, etc., and the preset character string in the target database is the proprietary character string of the professional category corresponding to the voice information. The preset character strings in the target database can be acquired by manual collection or by automatic access to the world wide web by a crawler. The target database can be stored locally in the terminal or stored in the cloud.
Specifically, as explained in the content part of the invention, in the process of speech recognition, a recognition error often occurs in the proprietary character string part, and meanwhile, the recognition accuracy of the proprietary character string has a great influence on the accuracy of the overall result of speech recognition, so in the invention, after the text corresponding to the speech information is obtained, the first character string which theoretically should be the proprietary character string in the text is extracted, and the first character string is matched with the preset character string in the target database, so that whether the first character string is the proprietary character string is determined.
And for different professional categories, different proprietary character strings are provided, and as shown in fig. 2, the matching the first character string with the preset character string in the target database includes:
s110, acquiring professional categories corresponding to the voice information;
the professional category corresponding to the voice information may be obtained through a text corresponding to the voice information, for example, if the text corresponding to the voice information is "i want to watch XXX" and "i want to listen to XXX", then the professional category corresponding to the voice information may be obtained as a movie category; the professional type corresponding to the voice information can also be obtained according to interface information of the terminal when the voice information is received, and when the voice information is received, the interface of the terminal is a news interface, so that the professional type corresponding to the voice information can be obtained as news.
S120, selecting a database corresponding to the professional category from at least one preset database according to the professional category, and taking the database as the target database.
In this embodiment, at least one database is established in advance according to different professional categories, a proprietary character string of a corresponding professional category is stored in each database, and after the voice information is acquired, a database corresponding to the professional category is selected as the target database according to the professional category corresponding to the voice information.
S130, matching the first character string with a preset character string in the target database.
After the target database is obtained, matching the first character string with a preset character string in the target database to determine whether the first character string is a proprietary character string.
In this embodiment, the first character string in the text is obtained through a first model trained in advance, and in different professional categories, the proprietary character strings are different, the positions of the proprietary character strings in the sentence are different, the models used for extracting the first character string are different, and the extracting of the first character string in the text specifically includes:
and inputting the text into a first model corresponding to the professional category, and acquiring the first character string output by the first model.
The first model is trained according to a first data set, the first data set comprises a plurality of groups of first samples, in order to enable the trained first model to be suitable for extracting the first character strings in the voice information corresponding to the professional categories, each group of first samples comprises sample texts in the professional categories and sample first character strings corresponding to the sample texts. In specific implementation, the first sample may be obtained according to historical input data of a plurality of existing users, for example, a text in the professional category, which is input by the user through voice or manual input, is obtained as a sample text, the sample text is labeled, and a proprietary character string in the text is labeled as a sample first character string, that is, the first sample including the sample text and a sample first character string corresponding to the sample text is completed.
It should be noted that the sample first character string corresponding to the sample text is not necessarily a proprietary character string, but may also be a proprietary character string with a wrongly written character, and when labeling the sample text, the proprietary character string with a wrongly written character may be labeled as the sample first character string, that is, the sample first character string is a part of the sample text that should theoretically be a proprietary character string. Therefore, when the proprietary character string in the text obtained by converting the semantic information contains a wrong character, the first character string in the text can still be obtained, and the first character string may be the proprietary character string with the wrong character.
The first model may be a CRF (conditional random field) model or an LSTM (long short-term memory) model, and of course, those skilled in the art may select other models in the field of natural language processing as the first model according to the needs, such as a Bi-LSTM (Bi-directional short-term memory) model, a Bi-LSTM + CRF model, and the like.
After the first character string is obtained, determining whether the first character string is a proprietary character string, specifically, matching the first character string with a preset character string in the target database: when the terminal acquires the first character string, traversing all preset character strings in the target database, determining whether the first character string exists in the target database, when the first character string exists in the target database, indicating that the special character string in the voice information is correctly identified, directly outputting the text as the identification result of the voice information, and when the first character string does not exist, indicating that the special character string in the voice information is incorrectly identified and needing to be corrected.
Specifically, the speech recognition method further includes:
s200, when the preset character string which is the same as the first character string does not exist in the target database, acquiring a target preset character string corresponding to the first character string in the target database.
The target preset character string is a proprietary character string corresponding to the first character string, and when a preset character string identical to the first character string does not exist in the target database, the target preset character string corresponding to the first character string in the target database is obtained. Specifically, the obtaining of the target preset character string corresponding to the first character string in the target database includes:
s210, obtaining a first syllable sequence of the first character string.
The syllable sequence corresponding to the character string is a sequence in which syllables of each character in the character string are formed in the order of the characters in the character string, and the first syllable sequence is a sequence in which syllables of each character included in the first character string are formed in the order of the characters in the first character string. For example, the syllable sequence corresponding to "piglet pecky" is "x, iao, zh, u, p, ei, q, i".
S220, inputting the first syllable sequence into a pre-trained second model, and obtaining a second syllable sequence output by the second model.
When the first string is not in the target database, then the first string may be a proprietary string with errors, such as missing or wrongly-written words. In this embodiment, the first string is error corrected according to a pre-trained second model.
Specifically, the second model is trained according to a second data set, the second data set includes a plurality of groups of second samples, and each group of second samples includes a sample syllable sequence and a sample second syllable sequence corresponding to the sample syllable sequence. The sample second syllable sequence is a syllable sequence corresponding to a preset character string in the target database, and the sample syllable sequence is generated by randomly replacing part of syllables in the sample second syllable sequence with other syllables. That is, when the second model is trained, a plurality of preset character strings are selected in the target database, random syllable replacement is performed on each selected preset character string, and the second model is trained according to the correspondence between the syllable sequence generated after random syllable replacement and the syllable sequence before random syllable replacement, that is, the training goal of the second model is to enable the second model to have the capability of correcting the input syllable sequence to the syllable sequence corresponding to the preset character string in the target database, so that after the first syllable sequence is input to the trained second model, the second syllable sequence output by the second model has a high probability of being the syllable sequence of the preset character string in the target database.
The second model may be a bert (bidirectional Encoder Representation from transforms) model, although other suitable natural language processing models, such as N-gram models (N-gram models), may be selected by those skilled in the art. When training the second model, it is necessary to convert each syllable in the second sample into a vector format that can be computed by a computer, that is, it is necessary to obtain a word vector corresponding to each syllable. In this embodiment, the word vector corresponding to each syllable is obtained by using the syllable sequence corresponding to all the preset character strings in the target database as a data set. Specifically, word vector training tools such as word2Vec may be used, and a data set formed by syllable sequences corresponding to all preset character strings in the target database is used as a training data set to obtain word vectors corresponding to all syllables.
And S230, acquiring the target preset character string according to the second syllable sequence.
The second syllable sequence output by the second model is a proprietary character string corresponding to the first syllable sequence predicted by the second model, and the sample syllable sequence in the second sample is generated by randomly replacing the preset character string in the target database when the second model is trained, and the randomly replaced syllable sequence may not be in accordance with the recognition result of the speech information sent by the user in practice, for example, a preset character string in the target database is a piglet peck, and the corresponding syllable sequence is "x, iao, zh, u, p, ei, q, i" to generate the sample syllable sequence which may be "x, iao, k, u, p, ei, t, i" or "t, iao, sh, u, p, ei, t, i", etc., in practice, the user may not send the speech information recognized as the above syllable sequence, the recognition result of the speech information sent by the user may be mostly "x, iao, z, u, p, ei, q, i", "x, iao, z, u, b, ei, q, i", that is, the training sample of the second model is inconsistent with the actual data, which results in limited capability of the second model, and therefore, the second model may not completely reach the training target of the second model, that is, the second syllable sequence output according to the first syllable sequence may not be the preset character string in the target database, that is, the target database may have the preset character string with the consistent syllable sequence with the second syllable sequence, or may not have the preset character string with the consistent syllable sequence with the second syllable sequence.
In a possible implementation manner, the obtaining the target preset character string according to the second syllable sequence includes:
and when the number of the preset character strings with the syllable sequence consistent with the second syllable sequence in the target database is one, taking the preset character string with the syllable sequence consistent with the second syllable sequence as the target preset character string.
When only one preset character string exists in the target database, which is consistent with the second syllable sequence, the second syllable sequence is the syllable sequence corresponding to the proprietary character string, and then the preset character string in which the syllable sequence in the target database is the second syllable sequence is directly obtained as the target character string.
In another possible implementation manner, the target database stores usage frequencies of respective preset character strings in historical usage data, and the obtaining the target preset character string according to the second syllable sequence includes:
and when the number of the preset character strings with the syllable sequence consistent with the second syllable sequence in the target database is multiple, taking the preset character string with the highest use frequency in the preset character strings with the syllable sequence consistent with the second syllable sequence as the target preset character string.
The usage frequency is a frequency of the preset character strings in the target database being used by the user, and the usage frequency may be obtained according to an occurrence frequency of each preset character string in the statistical data when the target database is established, or may be obtained according to a frequency of each preset character string input by the user. The target database may have a plurality of preset character strings having the same syllable sequence, and when the number of the preset character strings having the same syllable sequence as the second syllable sequence is plural, the preset character string having the highest frequency of use among the preset character strings having the same syllable sequence as the second syllable sequence is obtained as the target preset character string.
As already explained above, there may not be a predetermined string of syllables in the target database that is identical to the second syllable sequence, and therefore, in a possible implementation, the obtaining the target predetermined string according to the second syllable sequence includes:
and when the preset character string with the syllable sequence consistent with the second syllable sequence does not exist in the target database, taking the preset character string with the highest correlation degree with the second syllable sequence in the target database as the target preset character string.
Specifically, when there is no preset character string having a syllable sequence consistent with the second syllable sequence in the target database, it is indicated that the second syllable sequence output by the second model is not a syllable sequence corresponding to a proprietary character string yet, and needs to be further corrected, and at this time, the second noun is obtained according to a correlation between the second syllable sequence and the preset character string in the target database.
Specifically, in this embodiment, the editing distance between syllable sequences is used to evaluate the correlation between the second syllable sequence and a preset character string, and the preset character string in the target database having the highest correlation with the second syllable sequence is the preset character string corresponding to the syllable sequence having the smallest editing distance with respect to the second syllable sequence. An edit distance (edit distance) is an index used in the field of language processing to measure the degree of difference between two character strings, and refers to the number of times of single character editing required to convert one character string into another character string. When the editing distance of the two character strings is larger, the difference of the two character strings is larger, and conversely, when the editing distance of the two character strings is smaller, the difference between the two character strings is smaller.
As shown in fig. 3, the step of using the preset character string with the highest correlation with the second syllable sequence in the target database as the target preset character string includes:
s231, selecting at least one first preset character string from the target database, wherein the syllable sequence of each preset character string and the second syllable sequence comprise at least a preset number of same syllables.
Specifically, a plurality of preset character strings exist in the target database, that is, a syllable sequence of a plurality of preset character strings exists, the edit distance between the syllable sequence of each preset character string and the second syllable sequence is obtained, and then the preset character string corresponding to the syllable sequence with the smallest edit distance obviously needs to consume a large amount of computing resources. The preset number may be 3, 6, 8, etc., and it is easy to see that, the larger the preset number is, the smaller the number of the first preset character strings is, the smaller the calculation amount for obtaining the syllable sequence with the minimum editing distance to the second syllable sequence is, but due to the small calculation sample amount, the preset character string with the highest correlation degree with the second syllable sequence in the target database may be lost, resulting in inaccurate result. The smaller the preset number is, the more the number of the first preset character strings is, the larger the calculation amount for obtaining the preset character string with the highest correlation degree with the second syllable sequence is, but the calculation sample amount is large, and the result is more accurate.
S232, respectively obtaining the editing distance between the syllable sequence of the at least one first preset character string and the second syllable sequence.
After the at least one first preset character string is obtained, the editing distance between the syllable sequence of each first preset character string and the second syllable sequence is calculated respectively.
And S233, taking the first preset character string corresponding to the syllable sequence with the minimum editing distance as the target preset character string.
If the edit distance between the syllable sequence of a first preset character string in the at least one first preset character string and the second syllable sequence is minimum, the difference between the syllable sequence of the first preset character string and the second syllable sequence is the minimum in all the first preset character strings. After determining the syllable sequence with the minimum editing distance, obtaining a first preset character string corresponding to the syllable sequence with the minimum editing distance, where the first preset character string is a preset character string with the highest correlation with the second syllable sequence, and in this embodiment, obtaining the first preset character string as the target preset character string.
In a possible implementation manner, when a plurality of syllable sequences with the same editing distance appear, the use frequency of the first preset character string corresponding to each syllable sequence with the same editing distance in the target database is obtained, and the first preset character string with the highest use frequency is used as the target preset character string.
Referring to fig. 1 again, after the target preset character string is obtained, the voice recognition method further includes:
s300, replacing the first character string in the text with the target preset character string, and taking the replaced text as the recognition result of the voice information.
As already explained above, the target preset character string is a proprietary character string corresponding to the first character string, and therefore, after the target preset character string is obtained, the target preset character string is used to replace the first character string in the text, the replaced text is a text containing the error-corrected proprietary character string, and the replaced text is output as the recognition result of the voice information.
It can be seen from the above embodiments that, in the speech recognition method provided by the present invention, when converting an acquired speech into a text, a first character string that should theoretically be a proprietary character string in the text is extracted, and when the first character string is not the proprietary character string, the proprietary character string corresponding to the first character string is acquired, so that the recognition accuracy of the proprietary character string in speech recognition is improved.
Example two
Based on the above embodiment, the present invention further provides an intelligent terminal, and a schematic block diagram thereof may be as shown in fig. 4. The intelligent terminal comprises a processor, a memory, a network interface, a display screen and a temperature sensor which are connected through a system bus. Wherein, the processor of the intelligent terminal is used for providing calculation and control capability. The memory of the intelligent terminal comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the intelligent terminal is used for being connected and communicated with an external terminal through a network. The computer program is executed by a processor to implement a speech recognition method. The display screen of the intelligent terminal can be a liquid crystal display screen or an electronic ink display screen, and the temperature sensor of the intelligent terminal is arranged inside the intelligent terminal in advance and used for detecting the current operating temperature of internal equipment.
It will be understood by those skilled in the art that the block diagram shown in fig. 4 is only a block diagram of a part of the structure related to the solution of the present invention, and does not constitute a limitation to the intelligent terminal to which the solution of the present invention is applied, and a specific intelligent terminal may include more or less components than those shown in the figure, or combine some components, or have a different arrangement of components.
In one embodiment, an intelligent terminal is provided, which includes a memory and a processor, the memory stores a computer program, and the processor can realize at least the following steps when executing the computer program:
acquiring a text corresponding to voice information, extracting a first character string in the text, and matching the first character string with a preset character string in a target database;
when a preset character string which is the same as the first character string does not exist in the target database, acquiring a target preset character string which corresponds to the first character string in the target database;
and replacing the first character string in the text with the target preset character string, and taking the replaced text as the recognition result of the voice information.
Wherein the matching the first character string with a preset character string in a target database comprises:
acquiring professional categories corresponding to the voice information;
selecting a database corresponding to the professional category from at least one preset database according to the professional category, and taking the database as the target database;
and matching the first character string with a preset character string in the target database.
Wherein the extracting of the first character string in the text specifically includes:
inputting the text into a first model corresponding to the professional category, and acquiring the first character string output by the first model;
the first model is trained according to a first data set, the first data set comprises a plurality of groups of first samples, and each group of first samples comprises sample texts in the professional categories and sample first character strings corresponding to the sample texts.
Wherein the obtaining of the target preset character string corresponding to the first character string in the target database comprises:
acquiring a first syllable sequence corresponding to the first character string;
inputting the first syllable sequence into a pre-trained second model to obtain a second syllable sequence output by the second model;
the second model is trained according to a second data set, the second data set comprises a plurality of groups of second samples, each group of second samples comprises a sample syllable sequence and a sample second syllable sequence corresponding to the sample syllable sequence, and the sample second syllable sequence is a syllable sequence corresponding to a preset character string in the target database;
and determining the target preset character string according to the second syllable sequence.
Wherein the obtaining the target preset character string according to the second syllable sequence comprises:
and when the preset character string with the syllable sequence consistent with the second syllable sequence does not exist in the target database, taking the preset character string with the highest correlation degree with the second syllable sequence in the target database as the target preset character string.
Wherein the obtaining the target preset character string according to the second syllable sequence comprises:
and when the number of the preset character strings with the syllable sequence consistent with the second syllable sequence in the target database is one, taking the preset character string with the syllable sequence consistent with the second syllable sequence as the target preset character string.
Wherein, the target database stores the use frequency of each preset character string in the historical use data, and the obtaining the target preset character string according to the second syllable sequence comprises:
and when the number of the preset character strings with the syllable sequence consistent with the second syllable sequence in the target database is multiple, taking the preset character string with the highest use frequency in the preset character strings with the syllable sequence consistent with the second syllable sequence as the target preset character string.
The preset character string with the highest correlation degree with the second syllable sequence in the target database is a preset character string corresponding to the syllable sequence with the minimum editing distance with the second syllable sequence; the step of using the preset character string with the highest correlation degree with the second syllable sequence in the target database as the target preset character string comprises:
selecting at least one first preset character string from the target database, wherein the syllable sequence of each first preset character string and the second syllable sequence comprise at least a preset number of same syllables;
respectively acquiring the editing distance between the syllable sequence of the at least one first preset character string and the second syllable sequence;
and taking the first preset character string corresponding to the syllable sequence with the minimum editing distance as the target preset character string.
EXAMPLE III
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, databases, or other media used in embodiments provided herein may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
The present invention provides a storage medium storing one or more programs executable by one or more processors to implement a speech recognition method according to an embodiment.
It is to be understood that the invention is not limited to the examples described above, but that modifications and variations may be effected thereto by those of ordinary skill in the art in light of the foregoing description, and that all such modifications and variations are intended to be within the scope of the invention as defined by the appended claims.

Claims (10)

1. A speech recognition method, characterized in that the speech recognition method comprises:
acquiring a text corresponding to voice information, extracting a first character string in the text, and matching the first character string with a preset character string in a target database;
when a preset character string which is the same as the first character string does not exist in the target database, acquiring a target preset character string which corresponds to the first character string in the target database;
and replacing the first character string in the text with the target preset character string, and taking the replaced text as the recognition result of the voice information.
2. The speech recognition method of claim 1, wherein matching the first string with a predetermined string in a target database comprises:
acquiring professional categories corresponding to the voice information;
selecting a database corresponding to the professional category from at least one preset database according to the professional category, and taking the database as the target database;
and matching the first character string with a preset character string in the target database.
3. The speech recognition method according to claim 2, wherein the extracting the first character string from the text specifically comprises:
inputting the text into a first model corresponding to the professional category, and acquiring the first character string output by the first model;
the first model is trained according to a first data set, the first data set comprises a plurality of groups of first samples, and each group of first samples comprises sample texts in the professional categories and sample first character strings corresponding to the sample texts.
4. The speech recognition method according to claim 1, wherein the obtaining of the target preset character string corresponding to the first character string in the target database specifically comprises:
acquiring a first syllable sequence corresponding to the first character string;
inputting the first syllable sequence into a pre-trained second model to obtain a second syllable sequence output by the second model;
the second model is trained according to a second data set, the second data set comprises a plurality of groups of second samples, each group of second samples comprises a sample syllable sequence and a sample second syllable sequence corresponding to the sample syllable sequence, and the sample second syllable sequence is a syllable sequence corresponding to a preset character string in the target database;
and determining the target preset character string according to the second syllable sequence.
5. The speech recognition method of claim 4, wherein the obtaining the target pre-determined string according to the second syllable sequence comprises:
and when the preset character string with the syllable sequence consistent with the second syllable sequence does not exist in the target database, taking the preset character string with the highest correlation degree with the second syllable sequence in the target database as the target preset character string.
6. The speech recognition method of claim 4, wherein the obtaining the target pre-determined string according to the second syllable sequence comprises:
and when the number of the preset character strings with the syllable sequence consistent with the second syllable sequence in the target database is one, taking the preset character string with the syllable sequence consistent with the second syllable sequence as the target preset character string.
7. The speech recognition method of claim 4, wherein the target database stores usage frequencies of respective preset character strings in historical usage data, and the obtaining the target preset character string according to the second syllable sequence comprises:
and when the number of the preset character strings with the syllable sequence consistent with the second syllable sequence in the target database is multiple, taking the preset character string with the highest use frequency in the preset character strings with the syllable sequence consistent with the second syllable sequence as the target preset character string.
8. The speech recognition method according to claim 5, wherein the predetermined string in the target database having the highest correlation with the second syllable sequence is a predetermined string corresponding to a syllable sequence having the smallest edit distance with respect to the second syllable sequence; the step of using the preset character string with the highest correlation degree with the second syllable sequence in the target database as the target preset character string comprises:
selecting at least one first preset character string from the target database, wherein the syllable sequence of each first preset character string and the second syllable sequence comprise at least a preset number of same syllables;
respectively acquiring the editing distance between the syllable sequence of the at least one first preset character string and the second syllable sequence;
and taking the first preset character string corresponding to the syllable sequence with the minimum editing distance as the target preset character string.
9. An intelligent terminal, characterized in that, intelligent terminal includes: a processor, a storage medium communicatively coupled to the processor, the storage medium adapted to store a plurality of instructions; the processor is adapted to invoke instructions in the storage medium to perform a method of speech recognition according to any of the preceding claims 1-8.
10. A storage medium storing one or more programs, the one or more programs being executable by one or more processors to implement the speech recognition method of any one of claims 1-8.
CN201911403451.9A 2019-12-27 2019-12-27 Voice recognition method, intelligent terminal and storage medium Pending CN113053359A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911403451.9A CN113053359A (en) 2019-12-27 2019-12-27 Voice recognition method, intelligent terminal and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911403451.9A CN113053359A (en) 2019-12-27 2019-12-27 Voice recognition method, intelligent terminal and storage medium

Publications (1)

Publication Number Publication Date
CN113053359A true CN113053359A (en) 2021-06-29

Family

ID=76507514

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911403451.9A Pending CN113053359A (en) 2019-12-27 2019-12-27 Voice recognition method, intelligent terminal and storage medium

Country Status (1)

Country Link
CN (1) CN113053359A (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20120110751A (en) * 2011-03-30 2012-10-10 포항공과대학교 산학협력단 Speech processing apparatus and method
CN107832301A (en) * 2017-11-22 2018-03-23 北京百度网讯科技有限公司 Participle processing method, device, mobile terminal and computer-readable recording medium
CN108091328A (en) * 2017-11-20 2018-05-29 北京百度网讯科技有限公司 Speech recognition error correction method, device and readable medium based on artificial intelligence
CN109065054A (en) * 2018-08-31 2018-12-21 出门问问信息科技有限公司 Speech recognition error correction method, device, electronic equipment and readable storage medium storing program for executing
CN109145276A (en) * 2018-08-14 2019-01-04 杭州智语网络科技有限公司 A kind of text correction method after speech-to-text based on phonetic
CN109599114A (en) * 2018-11-07 2019-04-09 重庆海特科技发展有限公司 Method of speech processing, storage medium and device
CN109727598A (en) * 2018-12-28 2019-05-07 浙江省公众信息产业有限公司 Intension recognizing method under big noise context
CN109918485A (en) * 2019-01-07 2019-06-21 口碑(上海)信息技术有限公司 The method and device of speech recognition vegetable, storage medium, electronic device
CN110473523A (en) * 2019-08-30 2019-11-19 北京大米科技有限公司 A kind of audio recognition method, device, storage medium and terminal
CN110517692A (en) * 2019-08-30 2019-11-29 苏州思必驰信息科技有限公司 Hot word audio recognition method and device

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20120110751A (en) * 2011-03-30 2012-10-10 포항공과대학교 산학협력단 Speech processing apparatus and method
CN108091328A (en) * 2017-11-20 2018-05-29 北京百度网讯科技有限公司 Speech recognition error correction method, device and readable medium based on artificial intelligence
CN107832301A (en) * 2017-11-22 2018-03-23 北京百度网讯科技有限公司 Participle processing method, device, mobile terminal and computer-readable recording medium
CN109145276A (en) * 2018-08-14 2019-01-04 杭州智语网络科技有限公司 A kind of text correction method after speech-to-text based on phonetic
CN109065054A (en) * 2018-08-31 2018-12-21 出门问问信息科技有限公司 Speech recognition error correction method, device, electronic equipment and readable storage medium storing program for executing
CN109599114A (en) * 2018-11-07 2019-04-09 重庆海特科技发展有限公司 Method of speech processing, storage medium and device
CN109727598A (en) * 2018-12-28 2019-05-07 浙江省公众信息产业有限公司 Intension recognizing method under big noise context
CN109918485A (en) * 2019-01-07 2019-06-21 口碑(上海)信息技术有限公司 The method and device of speech recognition vegetable, storage medium, electronic device
CN110473523A (en) * 2019-08-30 2019-11-19 北京大米科技有限公司 A kind of audio recognition method, device, storage medium and terminal
CN110517692A (en) * 2019-08-30 2019-11-29 苏州思必驰信息科技有限公司 Hot word audio recognition method and device

Similar Documents

Publication Publication Date Title
CN109635270B (en) Bidirectional probabilistic natural language rewrite and selection
JP5901001B1 (en) Method and device for acoustic language model training
CN107729313B (en) Deep neural network-based polyphone pronunciation distinguishing method and device
JP4705023B2 (en) Speech recognition apparatus, speech recognition method, and program
CN109754809B (en) Voice recognition method and device, electronic equipment and storage medium
US8527272B2 (en) Method and apparatus for aligning texts
CN106570180B (en) Voice search method and device based on artificial intelligence
US11031009B2 (en) Method for creating a knowledge base of components and their problems from short text utterances
CN111402861B (en) Voice recognition method, device, equipment and storage medium
CN101221576B (en) Input method and device capable of implementing automatic translation
CN111611349A (en) Voice query method and device, computer equipment and storage medium
CN111312209A (en) Text-to-speech conversion processing method and device and electronic equipment
CN113450774B (en) Training data acquisition method and device
CN113225612B (en) Subtitle generating method, device, computer readable storage medium and electronic equipment
CN114817465A (en) Entity error correction method and intelligent device for multi-language semantic understanding
CN112530404A (en) Voice synthesis method, voice synthesis device and intelligent equipment
CN111326144A (en) Voice data processing method, device, medium and computing equipment
CN112559725A (en) Text matching method, device, terminal and storage medium
CN110929514B (en) Text collation method, text collation apparatus, computer-readable storage medium, and electronic device
CN112447173A (en) Voice interaction method and device and computer storage medium
CN113470617B (en) Speech recognition method, electronic equipment and storage device
CN113053359A (en) Voice recognition method, intelligent terminal and storage medium
CN115881108A (en) Voice recognition method, device, equipment and storage medium
CN114171000A (en) Audio recognition method based on acoustic model and language model
CN110728137B (en) Method and device for word segmentation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination