CN111128185B

CN111128185B - Method, device, terminal and storage medium for converting voice into characters

Info

Publication number: CN111128185B
Application number: CN201911358259.2A
Authority: CN
Inventors: 曲季; 李智勇; 苏少炜
Original assignee: Beijing SoundAI Technology Co Ltd
Current assignee: Beijing SoundAI Technology Co Ltd
Priority date: 2019-12-25
Filing date: 2019-12-25
Publication date: 2022-10-21
Anticipated expiration: 2039-12-25
Also published as: CN111128185A

Abstract

The invention provides a method, a device, a terminal and a storage medium for converting voice into characters, which convert target voice into first character information by carrying out voice recognition on the target voice, acquire scene information related to a preset first error correction rule at present, wherein the scene information comprises first scene information related to the terminal and/or second scene information related to a network environment, and carry out homophone error correction on the first character information based on the scene information to generate the target character information. Based on the method and the device, error correction of homophones is realized in the process of converting voice into characters through the scene information related to the preset first error correction rule, so that the accuracy of converting voice into characters is improved, and the user experience is improved.

Description

Method, device, terminal and storage medium for converting voice into characters

Technical Field

The present invention relates to the field of voice-to-text technology, and more particularly, to a method, an apparatus, a terminal and a storage medium for converting voice to text.

Background

At present, the intelligent sound box supports a voice control function, the voice control function is based on a voice-to-character process, and inaccurate results of voice conversion characters often lead to voice control errors and further influence user experience.

Disclosure of Invention

In view of this, the present invention provides a method, an apparatus, a terminal and a storage medium for converting speech into text, which implement error correction on homophones in the process of converting speech into text, so as to improve accuracy of converting speech into text.

In order to achieve the above object, the following solutions are proposed:

the first aspect of the present invention discloses a method for converting voice into text, which comprises:

carrying out voice recognition on target voice to convert the target voice into first character information;

acquiring scene information related to a preset first error correction rule at present, wherein the scene information comprises first scene information related to a terminal and/or second scene information related to a network environment;

and performing homophone error correction on the first character information based on the scene information to generate target character information.

Optionally, the method further includes:

performing homophone error correction on the first character information by adopting a second error correction rule to obtain second character information, wherein the second error correction rule is related to common words, descriptors with functions in the terminal and/or sentence currency;

the generating of the target character information by performing homophone error correction on the first character information based on the scene information includes: and performing homophone error correction on the second character information based on the scene information to generate target character information.

Optionally, the performing homophone error correction on the second text information based on the scene information to generate target text information includes:

performing homophone error correction on the second character information based on current scene information in the scene information to generate third character information;

detecting whether the second text information is the same as the third text information;

if the second text information is different from the third text information, determining the third text information as target text information;

if the second character information is the same as the third character information, acquiring historical scene information in the scene information;

and performing homophone error correction on the third character information according to the historical scene information to generate target character information.

Optionally, if the second text information is different from the third text information, the method further includes:

determining target information in the third text information, wherein the corrected information in the second text information is used for being replaced by the target information to generate the third text information;

detecting the associated information between the target information and the context related to the target information in the third text information, wherein the associated information is in direct proportion to the degree of association between the target information and the context;

judging whether the associated information exceeds preset target associated information or not;

if the associated information does not exceed the target associated information, executing the step of acquiring historical scene information in the scene information;

the determining the third text information as target text information includes: and if the associated information exceeds the target associated information, determining the third character information as target character information.

Optionally, the first scene information relates to a usage scenario of the terminal, a user profile of a registered user of the terminal, content of a display page of the terminal, and/or a state of playing content in the terminal.

Optionally, the usage scenario of the terminal indicates a play category, where the play category is a book category or a picture category; the user profile is related to on-demand behavior of the registered user and/or preferences of the registered user; the state of the playing content in the terminal comprises all information of a playing source to which the playing content belongs.

Optionally, the second scene information is dynamically updated.

Optionally, the second scenario information includes recommended content of the merchant to the terminal within a recent preset historical time period.

The second aspect of the present invention discloses a device for converting speech into text, comprising:

the conversion unit is used for carrying out voice recognition on target voice to convert the target voice into first character information;

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring scene information related to a preset first error correction rule at present, and the scene information comprises first scene information related to a terminal and/or second scene information related to a network environment;

and the target character information generating unit is used for carrying out homophone error correction on the first character information based on the scene information to generate target character information.

The third aspect of the present invention discloses a terminal for converting voice into text, which comprises:

the voice recognition module is used for carrying out voice recognition on target voice and converting the target voice into first character information;

the voice processing module is used for acquiring scene information related to a preset first error correction rule, wherein the scene information comprises first scene information related to a terminal and/or second scene information related to a network environment, and homophone error correction is performed on the first character information based on the scene information to generate target character information.

In a fourth aspect, the present invention discloses a computer-readable storage medium, in which computer-executable instructions are stored, and the computer-executable instructions are configured to perform the method for converting speech into text as disclosed in any one of the first aspect of the present invention.

The invention provides a method, a device, a terminal and a storage medium for converting voice into characters, which convert target voice into first character information by carrying out voice recognition on the target voice, acquire scene information related to a preset first error correction rule at present, wherein the scene information comprises first scene information related to the terminal and/or second scene information related to a network environment, and carry out homophone error correction on the first character information based on the acquired scene information to generate the target character information. According to the technical scheme provided by the invention, the error correction of the homophone is realized in the process of converting the voice into the character on the basis of the scene information related to the preset first error correction rule, so that the accuracy of converting the voice into the character is improved and the user experience is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a schematic flowchart of a method for converting speech into text according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating another method for converting speech into text according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a device for converting speech into text according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a terminal for converting speech into text according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In this application, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in a process, method, article, or apparatus that comprises the element.

It can be known from the foregoing background art that the current intelligent sound boxes all support a voice control function, and the basis of the voice control function is a process of converting voice into text, wherein the process of converting voice into text is to convert voice into text through Natural Speech Recognition technology (ASR), and then to recognize text converted from voice through Natural Language Processing (NLP).

Therefore, the inventor of the present application further provides a method for converting voice into text as shown in fig. 1 through research, the method for converting voice into text is applied to a terminal, and based on the method for converting voice into text as shown in fig. 1 provided in the embodiment of the present application, error correction on homophones can be achieved in the process of converting voice into text, so that accuracy of converting voice into text and user experience are improved.

As a preferred mode of the embodiment of the present invention, the terminal applied to the method for converting voice into text provided by the embodiment of the present invention may be an intelligent sound box, and regarding the terminal applied to the method for converting voice into text provided by the embodiment of the present invention, the inventor may set the terminal according to his own needs, which is not limited in the embodiment of the present invention.

Referring to fig. 1, an embodiment of the present invention provides a flow diagram of a method for converting a voice into a text, where the method specifically includes the following steps:

s101: and performing voice recognition on the target voice to convert the target voice into first character information.

It should be noted that the target voice may be "hello", "view previous" or "play the fifth set of ounces", and specific contents related to the target voice may be set according to actual situations, which is not limited in the embodiment of the present invention.

In the process of specifically executing step S101, performing speech recognition on the target speech to convert the target speech into the first text message. For example, when the target voice is "hello", the target voice is subjected to voice recognition to convert the target voice into "hello", and "hello" is the first text message.

In the embodiment of the application, the target voice can be converted into the first text information by performing voice recognition on the target voice through ASR.

S102: scene information related to a preset first error correction rule is obtained.

Wherein the context information includes first context information related to the terminal and/or second context information related to the network environment.

In an embodiment of the present application, the context information includes first context information related to the terminal and/or second context information related to the network environment. For example, the context information includes any one or a combination of first context information related to the terminal and second context information related to the network environment. The inventor can set the specific content related to the scene information according to his own needs, which is not limited herein.

In the specific implementation of step S102, scene information currently related to the preset first error correction rule is obtained, for example, when the scene information related to the preset first error correction rule only includes the first scene information related to the terminal, the first scene information related to the terminal currently related to the preset first error correction rule is obtained; when the scene information related to the preset first error correction rule only comprises second scene information related to the network environment, acquiring the second scene information related to the network environment and related to the preset first error correction rule; when the scene information related to the preset first error correction rule comprises first scene information related to the terminal and second scene information related to the network environment, acquiring the first scene information related to the terminal and the second scene information related to the network environment, which are related to the preset first error correction rule currently.

In the embodiment of the application, the first scene information is related to the use scene of the terminal, the user portrait of the registered user of the terminal, the content of the display page of the terminal and/or the state of the playing content in the terminal. For example, the first scene information is related to any one or more of a usage scenario of the terminal, a user profile of a registered user of the terminal, contents of a terminal display page, and a state of contents played in the terminal. The invention can be configured according to its own requirements, and the embodiment of the invention is not limited thereto.

As a preferred mode of the embodiment of the present application, the usage scenario of the terminal indicates a play category, where the play category may be a book category or a picture category. For example, if the usage scenario of the terminal is reading, the playback category indicated by the usage scenario of the terminal is a book category, and if the usage scenario of the terminal is viewing photos, the playback category of the usage scenario of the terminal is a picture category. The above is only the preferred content of the usage scenario of the terminal provided by the embodiment of the present invention, and the inventor may set the specific information related to the usage scenario of the terminal according to his own needs, which is not limited herein.

In embodiments of the present application, the user profile is associated with on-demand behavior of the registered user and/or preferences of the registered user. The on-demand behavior of the registered user can be the current on-demand behavior of the registered user or the on-demand behavior of the registered user in a past period of time. For example, a registered user has requested a series a, a series B, etc. within the past month; the preference of the registered user may be a genre of a tv show and/or a genre of a book that the registered user likes, for example, the genre of the tv show that the registered user likes may be a ghost, a comedy, a paleo-drama, and the like. The on-demand behavior of the registered user and the preference of the registered user can be set according to actual situations, and the embodiment of the invention is not limited.

In the embodiment of the application, the content of the terminal display page may be text information and/or voice information on the terminal display page.

In the embodiment of the present application, the state of playing the content in the terminal includes all information of a playing source to which the playing content belongs. For example, when the currently played content in the terminal is the 5 th episode of the drama a, the playing source to which the played content belongs is the drama a, and the total number of the dramas a is 40, all the information of the playing source is 40 episodes of the drama a, and the state of the played content in the terminal is determined to be 40 episodes of the drama a.

In the embodiment of the application, the second scenario information is related to the network hot-cast content and/or the recommended content of the merchant to the terminal, and the second scenario information is dynamically updated.

In the embodiment of the application, the second scenario information includes recommended content from a merchant to the terminal in a recent preset historical time period. For example, the latest preset historical time period may be set to be one month in the latest past in advance, and since the second scene information is dynamically updated, the content recommended by the merchant to the terminal in the second scene information is always the content recommended by the merchant to the terminal in the latest past one month.

It should be noted that the network thermoprinted content may be a tv show, a movie, an animation, etc., for example, the network thermoprinted content may be a tv show a, a tv show B, a movie a, a movie B, etc. When two on-demand contents of homophones exist, namely a TV series A and a TV series C respectively, wherein the TV series A is the network hot-broadcast content, and homophones error correction is carried out on the first character information by adopting the TV series A in the process of carrying out homophones error correction on the first character information based on the scene information to generate target character information. The specific content related to the network hot broadcast content may be set according to actual situations, and the embodiment of the present invention is not limited thereto.

As a preferred mode of the embodiment of the application, the content recommended by the current merchant to the terminal and/or the content recommended by the merchant to the terminal in the recent preset historical time period may be used as the content recommended by the merchant to the terminal. For example, in the recent history time period, the merchant recommended a series a, a series B, a movie a, a movie B, and the like to the terminal, and the content recommended to the terminal by the merchant is the series a, the series B, the movie a, the movie B, and the like. The specific content recommended to the terminal by the merchant may be set according to actual conditions, and the embodiment of the present invention is not limited.

Note that, the latest history period is set in advance, and the latest history period may be set to last one month in advance. For example, when the last preset history time period is set to be one month in the last past, and the merchant recommended the series a, the series B, the movie a, and the movie B to the terminal within the last month, the merchant recommends the contents as the series a, the series B, the movie a, and the movie B.

In the embodiment of the present application, scene information currently related to a preset first error correction rule may be acquired through NLP.

S103: and performing homophone error correction on the first character information based on the scene information to generate target character information.

In the process of specifically executing step S103, after scene information related to a preset first error correction rule is obtained, homophone error correction is performed on the first character information based on the scene information to generate target character information.

For a better understanding of the above, the following examples are given.

For example, the current usage scenario of the terminal is reading, if the target speech is "next chapter", the target speech is subjected to speech recognition to convert the target speech into first text information ("next chapter"), a usage scenario indication playing category, i.e., a book category, of the terminal is obtained, and homonym error correction is performed on the first text information based on the book category of the usage scenario of the terminal to generate target text information ("next chapter").

Or when the target voice is 'play ounce 5 th set', performing voice recognition on the target voice to convert the target voice into first character information ('play ounce 5 th set'), acquiring the state of the play content in the current terminal, namely all information of the play source to which the play content of the current terminal belongs, and all information of the play source to which the play content of the terminal belongs being 40 sets of ounces, further determining that the play source (ounce) to which the play content of the terminal belongs does not have the 5 th set, and performing homophone error correction on the first character information ('play ounce 5 th set') based on the state of the play content in the terminal to generate the target character information ('play ounce 5 th set').

Or when the target voice is 'playing snow dog', performing voice recognition on the target voice to convert the target voice into first character information ('playing snow storm'), acquiring network hot-air content (snow dog) related to a preset first error correction rule, and performing homophone error correction on the first character information based on the network hot-air content (snow dog) to generate the target character information ('playing snow dog'). However, if the target voice is "play snowstorm", homophone error correction is performed according to the content (ounces) of the network hot broadcast, the generated target text information is "play snowstorm", tv opera-ounces are recommended to the user, and if the user inputs the target voice again "play snowstorm", the "play snowstorm" is directly switched to "play snowstorm".

In the embodiment of the application, homophone error correction can be performed on the first character information based on the scene information through the NLP to generate the target character information.

In the embodiment of the application, after homophone error correction is performed on the first text information based on the scene information to generate target text information, the target text information is displayed on a screen of the terminal for a user to view.

The invention provides a method for converting voice into characters, which converts target voice into first character information by carrying out voice recognition on the target voice, acquires current scene information related to a preset first error correction rule, and carries out homophone error correction on the first character information to generate the target character information. According to the technical scheme provided by the invention, the error correction of the homophone is realized in the process of converting the voice into the character on the basis of the scene information related to the preset first error correction rule, so that the accuracy of converting the voice into the character is improved and the user experience is improved.

Further, another method for converting speech into text is provided in the embodiments of the present invention, as shown in fig. 2, which specifically includes the following steps:

s201: and performing voice recognition on the target voice to convert the target voice into first character information.

In the process of specifically executing step S201, the specific implementation principle and the execution process of step S201 are the same as those of step S101 disclosed in fig. 1, and reference may be made to corresponding parts in fig. 1, which is not described herein again.

S202: and performing homophone error correction on the first character information by adopting a second error correction rule to obtain second character information.

Wherein the second error correction rule is related to common words, descriptors of existing functions in the terminal, and/or sentence currency.

In the embodiment of the application, the common words include preset common words and common words related to use habits of users, for example, the preset common words may be "good morning" or "good evening", and the common words related to use habits of users may be "obsolete" by being used by northeast people, wherein "obsolete" corresponds to "dirty"; the existing function in the terminal may be a photographing function, a reading function, a video playing function, etc., and when the existing function in the terminal is the reading function, the descriptor of the existing function of the corresponding terminal may be a "chapter" or a "page". The above are merely preferred contents of the commonly used words and descriptors of existing functions in the terminal provided by the embodiments of the present application, and specific contents of the commonly used words and descriptors of existing functions in the terminal may be set by the inventor according to his or her needs, which is not limited herein.

In the embodiment of the present application, the sentence order may be understood as meaning homophones or words in the target speech or forming a sentence when the speech is converted into characters, so as to ensure that the sentence is more ordered. For example, the two homophones are "lesson" and "attack and conk", respectively, and if the target voice is "i like doing lesson", the "lesson" is taken as a meaningful homophone to correct the first text information converted from the target voice. And if the target voice is attack enemy formation, taking attack as a meaningful homophone to correct homophone error of the first character information.

In the specific process of executing step S202, after performing speech recognition on the target speech to convert the target speech into the first text information, performing homophone error correction on the first text information by using the second error correction rule to obtain the second text information.

For example, the second error correction rule is related to common words and sentence compliances, and the common words are "good morning", "good evening", and so on, when the target speech is "good morning, and you are really beautiful today", the target speech is subjected to speech recognition to convert the target speech into the first text information ("good date, and you are really beautiful today"), and the second error correction rule is adopted, that is, the common words and sentence compliances are adopted to carry out homophonic character error correction on the first text information ("good date, and you are really beautiful today") to obtain the second text information ("good morning and you are really beautiful today").

When the second error correction rule is only related to the descriptors of the existing functions in the terminal and the existing functions in the terminal only comprise a reading function, the descriptors of the existing functions in the terminal are set to be 'chapters', when the target voice is 'next chapter', the target voice is subjected to voice recognition to be converted into first character information ('next piece'), and the description words ('chapters') of the existing functions in the terminal are adopted to carry out homophone error correction on the first character information ('next piece') to obtain second character information ('next chapter').

In the embodiment of the application, the ASR can be used for realizing homophone error correction on the first character information by adopting a second error correction rule to obtain second character information.

S203: scene information related to a preset first error correction rule is obtained currently.

In the embodiment of the application, the first scene information is related to the use scene of the terminal, the user portrait of the registered user of the terminal, the content of the display page of the terminal and/or the state of the content currently played in the terminal.

It should be noted that the usage scenario of the terminal indicates a play category, and the play category is a book category or a picture category; the user profile is related to the on-demand behavior of the registered user and/or the preference of the registered user; the state of the current playing content in the terminal comprises all information of a playing source to which the current playing content in the terminal belongs.

In the embodiment of the application, the second scene information is related to the network hot-cast content and/or the merchant recommended content, and the second scene information is dynamically updated.

Preferably, the second scenario information includes recommended content from the merchant to the terminal within a recent preset historical time period.

In the process of specifically executing step S203, the specific implementation principle and the execution process of step S203 are the same as those of step S102 disclosed in fig. 1, and reference may be made to corresponding parts in fig. 1, which is not described herein again.

S204: and performing homophone error correction on the second character information based on the current scene information in the scene information to generate third character information.

In the embodiment of the application, the current scene information in the scene information includes current first scene information related to the terminal and/or current second scene information related to the network environment.

In the embodiment of the application, the current first scene information is related to a use scene of the current terminal, a user portrait of a registered user of the current terminal, a playing state of content in the current terminal and/or content of a display page of the current terminal.

It should be noted that the usage scenario of the current terminal indicates a current playing category, where the playing category may be a book or a book category; the user portrait of the registered user of the current terminal is related to the on-demand behavior of the registered user of the current terminal and/or the current preference of the registered user; the content of the current terminal display page can be character information and/or voice information on the current terminal display page; the state of the playing content in the current terminal comprises all information of a playing source to which the current playing content belongs.

In the embodiment of the application, the current second scenario information is related to the current network hot-cast content and/or the current recommended content of the merchant to the terminal, and the second scenario information is dynamically updated.

In the process of specifically executing step S204, after homophone error correction is performed on the first text information by using the second error correction rule, scene information related to the preset first error correction rule is obtained, and homophone error correction is performed on the second text information according to current scene information in the scene information, so as to obtain third text information.

For a better understanding of the above, the following examples are given.

For example, when the existing functions in the terminal include a reading function and a photographing function, the second error correction rule is related to the commonly used words and the descriptors of the existing functions in the terminal, and the current scene information is related to the current usage scenario of the terminal, the descriptors of the reading function may be set to "chapter", the descriptors of the photographing function may be set to "sheet", when the target speech is "next chapter", and the current usage scenario of the terminal is reading, that is, the play category of the usage scenario of the terminal is the book category, the target speech is subjected to speech recognition to convert the target speech into the first text information ("next cockroach"), the description terms (sheets) of the existing functions in the terminal are subjected to homophone error correction on the first text information ("next cockroach"), so as to obtain the second text information ("next sheet"), the usage scenario indication play category of the current terminal and the terminal is obtained, that the book category is indicated, and the homophone is performed on the second text information based on the book category of the current usage scenario of the terminal, so as to generate the third text information ("next chapter").

Or, when the existing terminal functions include a video playing function, the second error correction rule is related to common words, the current scene information is related to network hot-broadcast content, the current network hot-broadcast content is a movie-Luo Xiaohei war note, when the target voice is a play Luo Xiaohei war note, the target voice is subjected to voice recognition to convert the target voice into first character information ("a Dial-off Luo Xiaohei war note"), the common words ("play") are adopted to perform homophone error correction on the first character information ("a Dial-off Luo Xiaohei war note") to obtain second character information (a play Luo Xiaohei war note), the current scene information, namely the current network hot-broadcast content, and the second character information is subjected to homophone error correction based on the current network hot-broadcast content to generate third character information ("a play Luo Xiaohei war").

S205: detecting whether the second text information is the same as the third text information; if the second text message is not the same as the third text message, step S206 is executed, and if the second text message is the same as the third text message, step S210 is executed.

In the process of specifically executing step S205, third text information generated by performing error correction on the second text information based on the current scene information in the scene information is detected, when it is detected that the second text information is different from the third text information, the third text information is determined as target text information, and if it is detected that the second text information is the same as the third text information, it is indicated that the current scene information does not perform error correction on the second file information, history scene information in the scene information is further obtained, and homophone error correction is performed on the generated third text information based on the history scene information, so as to generate the target text information.

S206: and determining target information in the third text information, wherein the corrected information in the second text information is used for being replaced by the target information to generate the third text information.

In the embodiment of the application, when the second text information and the third text information are detected to be different, target information for replacing the corrected information in the second text information to generate the third text information is determined.

For example, when the target speech is "today's weather is good and i want to go to outing", the target speech is subjected to speech recognition to convert the target speech into first text information, and a second text information is generated by performing homophone error correction on the first text information by using a second error correction rule, wherein the second text information is "today's weather is good and i want to go to outing", the second text information is subjected to homophone error correction based on current scene information in the scene information, and a third text information is generated, wherein the third text information is "today's weather is good and i want to go to outing", because "today's weather is good, i want to go to outing" and "today's weather is good and i want to go to outing" are different, it is determined that the second text information is different from the third text information, and it is determined that the corrected text information in the second text information is "steam" for replacing the corrected text information in the second text information to generate the third text information as "steam", and the target information in the "steam" third text information is determined, and the "steam" is further determined.

S207: and detecting the associated information between the target information and the context related to the target information in the third text information.

It should be noted that the association information is proportional to the degree of association between the target information and the context. For example, the association information may be 70%, or 80%.

In the embodiment of the application, after the target information in the third text information is determined, the context related to the target information in the target information and the third text information is determined, and then the association information between the target information and the context related to the target information is calculated based on the target information and the context related to the target information.

As a preferred mode of the embodiment of the present application, a mode of determining the context related to the target information in the target information and the third text information may be: and taking the character information positioned in front of the target information in the sentence in which the target information is positioned in the third character information as the upper text related to the target information, and taking the character information positioned behind the target information as the lower text related to the target information.

For example, the third text message is "weather today is good, i.e. i want to go to outing", wherein "weather today is good" is the first sentence in the third text message, and "i want to go to outing" is the second sentence in the third text message. If the target information in the third text information is determined to be ' qi ', the sentence in which the target information is located in the third text information can be determined to be a first sentence, and further the text information positioned in front of the ' qi ' in the first sentence is determined to be the text related to the target information, namely ' the day is the text related to the target information; the text information following "qi" in the first sentence is determined as the context related to the target information, i.e., "good" as the context related to the target information.

As another preferred mode of the embodiment of the present application, a mode of determining a context related to the target information in the target information and the third text information may be: determining the sentence in which the target information is located in the third text information, and taking the text information in the sentence in which the target information is located in the third text information, which is located in front of the target information, and the text information of the previous sentence connected with the sentence in which the target information is located as the text related to the target information; and taking the text information of the sentence in which the target information in the third text information is positioned, which is positioned behind the target information, and the text information of the next sentence connected with the sentence in which the target information is positioned as the text related to the target information.

For example, the third text message is "i eat full, today is very good in weather, i want to go to outing", wherein "i eat full" is the first sentence in the third text message, "today is very good in weather" is the second sentence in the third text message, and "i want to go to outing" is the third sentence in the third text message. If the target information in the third text information is determined to be 'qi', the sentence in which the target information is located in the third text information can be determined to be the second sentence, and then the text information positioned in front of the 'qi' in the first sentence and the second sentence is determined to be the text related to the target information, namely 'I eat fully, the current day' is the text related to the target information; the text information in the third sentence and the first sentence which is positioned behind "qi" is determined as the context related to the target information, i.e., "nice, i.e., i want to go to the outing" is the context related to the target information.

As a preferable mode of the embodiment of the present application, a mode of calculating the association information between the target information and the context related to the target information may be: embedding the target information and each character except the target information in the sentence where the target information is located in the third character information through an attention mechanism, and further obtaining the associated information between the contexts related to the target information.

For example, the third text message is "weather today is good, i.e. i want to go to outing", wherein "weather today is good" is the first sentence in the third text message, and "i want to go to outing" is the second sentence in the third text message. If the target information in the third text information is determined to be 'qi' and the upper text related to the target information is 'today day'; the following is "good" in relation to the target information. Embedding the target information with 'today', 'day', 'very' and 'good' through an attention mechanism, and further obtaining the associated information between the target information and the context related to the target information.

As another preferred method of the embodiment of the present application, a manner of calculating the association information between the target information and the context related to the target information may be: each character positioned in front of the target information in the sentence where the target information in the third character information is positioned and at least one keyword in the previous sentence connected with the sentence where the target information is positioned are positioned through an attention mechanism; and embedding each character behind the target information in the sentence in which the target information in the third character information is located and at least one keyword in the next sentence connected with the sentence in which the target information is located, so as to obtain the associated information between the target information and the context related to the target information.

For example, the third text message is "i eat full, today is very good in weather, i want to go to outing", wherein "i eat full" is the first sentence in the third text message, "today is very good in weather" is the second sentence in the third text message, and "i want to go to outing" is the third sentence in the third text message. If the target information in the third text information is determined to be 'qi' in the second sentence, and the upper text related to the target information is 'the day of the day' and the first sentence ('I eat fully'); the following text related to the target information is 'good' and a third sentence ('i want to go to outing'), embedding the target information with 'present', 'day', a keyword 'full', 'good' in the first sentence, a keyword 'want to go' in the third sentence and a keyword 'outing' in the third sentence through an attention mechanism, and further obtaining related information between the target information and a context related to the target information.

S208: judging whether the associated information exceeds preset target associated information or not; if the associated information exceeds the preset target associated information, executing step S209; if the associated information does not exceed the preset target associated information, step S210 is executed.

In the embodiment of the application, target associated information is preset, and the target associated information is in direct proportion to the degree of association between the target information and the context related to the target information. For example, the target association information may be set to 80%. When the target related information is 80%, if the related information between the target information and the context related to the target information is 90%, determining that the related information exceeds the preset target related information, and determining the third character information as the target character information.

On the contrary, if the associated information between the target information and the context related to the target information is 70%, it is determined that the associated information does not exceed the preset target associated information, which indicates that the sentence of the third text information generated by replacing the error-corrected information in the second text information with the target information is not smooth, and the degree of association between the target information and the context related to the target information is low, so as to obtain the historical scene information in the scene information, and further correct the third text information based on the historical scene information.

S209: and determining the third character information as the target character information.

S210: and acquiring historical scene information in the scene information.

In the embodiment of the application, the historical scene information in the scene information comprises historical first scene information related to the terminal and/or historical second scene information related to the network environment.

In the embodiment of the application, the historical first scene information is related to the historical use scene of the terminal, the state of the historical playing content in the terminal and/or the content of the historical terminal display page.

It should be noted that the historical usage scenario of the terminal indicates a play category in a preset historical time period, where the play category may be a book category or a picture category; the state of the history playing content in the terminal comprises all information of a playing source to which the history playing content in the terminal belongs; the content of the historical terminal display page can be character information and/or voice information on the terminal display page in a preset historical time period.

In the embodiment of the application, the second scene information is related to the network hot-cast content in the preset historical time period and/or the recommended content of the merchant to the terminal in the preset historical time period, and the second scene information is dynamically updated.

S211: and performing homophone error correction on the third character information according to the historical scene information to generate target character information.

In the specific process of executing step S211, when it is detected that the second text information is the same as the third text information, historical scene information in the scene information is obtained, and homophone error correction is performed on the third text information according to the historical scene information to generate target text information.

For a better understanding of the above, the following description is given by way of example.

For example, when the existing functions of the terminal include a video playing function, the second error correction rule is related to common words, the current scene information is related to the network hot-broadcast content, the current network hot-broadcast content is movie-Luo Xiaohei war memorandum, movie-me and my country, and the like, the historical scene information is related to the network hot-broadcast content in the past month, and the network hot-broadcast content in the past month is a story of eight public of movie-loyalty dogs, a union of movie-revenges, and the like. If the target voice is 'a story playing the dog-eared game', performing voice recognition on the target voice to convert the target voice into first character information ('a story playing the dog-eared game'), performing homophone error correction on the first character information ('the story playing the dog-eared game') by adopting a second error correction rule, namely a common word ('playing'), to obtain second character information ('a story playing the dog-eared game'), performing homophone error correction on the second character information based on the current network hot-play content to generate third character information ('a story playing the dog-eared game'), detecting whether the second character information is the same as the third character information or not, and obtaining historical scene information in the scene information, namely obtaining the network hot-play content in the past month, and performing homophone error correction on the third character information based on the network hot-play content in the past month to generate the target word information ('the story playing the dog-eared game').

Or, when the existing terminal functions include a video playing function, the second error correction rule is related to common words, the current scene information is related to all information of the current playing content in the terminal, the playing content in the current terminal is drama-happy boutique, the historical scene information is related to all information of the historical playing content in the terminal, and the historical playing content in the terminal is movies-camel xiang, drama-ounces and the like. If the target voice is ' playing the 5 th collection of ounces ', performing voice recognition on the target voice to convert the target voice into first character information (' playing the 5 th collection of ounces '), performing homophone error correction on the first character information (' playing the 5 th collection of ounces ') by adopting a second error correction rule, namely commonly used words (' playing ') to obtain second character information (' playing the 5 th collection of ounces '), performing homophone error correction on the second character information to generate third character information (' playing the 5 th collection of ounces ') based on all information of currently played contents in the terminal, namely all information (70 collection) of TV drama Yanxi attack to detect whether the second character information and the third character information are the same, and determining that the second character information and the third character information are the same because ' playing the 5 th collection of ounces ' (second character information) and ' playing the 5 th collection of ounces (third character information) are the same, acquiring all information of historically played contents in the terminal, namely all information (first and all information (40) of movie-camel-ounces of peaces, and generating all target character information (playing the third character information) of homophones error correction on the terminal based on all information (playing contents of the third character information) (5 th collection of plays).

In the embodiment of the invention, the second error correction rule is adopted to carry out homophone error correction on the first character information converted from the target voice to obtain the second character information, and scene information related to the preset first error correction rule is obtained to carry out further homophone error correction on the second character information, so that the target character information is generated. According to the technical scheme provided by the embodiment of the invention, the error correction of the homophone is realized in the process of converting the voice into the character by adopting the second error correction rule and the scene information related to the preset first error correction rule, so that the accuracy of converting the voice into the character can be improved, and the user experience can be improved.

Based on the method for converting voice into text disclosed in the embodiment of the present invention, the embodiment of the present invention also correspondingly discloses a device for converting voice into text, as shown in fig. 3, the device 300 for converting voice into text includes:

the conversion unit 301 is configured to perform voice recognition on the target voice to convert the target voice into the first text information.

An obtaining unit 302, configured to obtain scenario information currently related to a preset first error correction rule, where the scenario information includes first scenario information related to a terminal and/or second scenario information related to a network environment.

A target character information generating unit 303, configured to perform homophone error correction on the first character information based on the scene information to generate target character information.

The specific principle and the execution process of each unit in the voice-to-text device disclosed in the embodiment of the present invention are the same as those of the voice-to-text method disclosed in the embodiment of the present invention, and reference may be made to corresponding parts in the voice-to-text method disclosed in the embodiment of the present invention, which are not described herein again.

The invention provides a device for converting voice into characters, which is used for converting target voice into first character information by carrying out voice recognition on the target voice, acquiring scene information related to a preset first error correction rule at present, wherein the scene information comprises first scene information related to a terminal and/or second scene information related to a network environment, and carrying out homophone error correction on the first character information based on the acquired scene information to generate the target character information. According to the technical scheme provided by the invention, the error correction of the homophone is realized in the process of converting the voice into the character on the basis of the scene information related to the preset first error correction rule, so that the accuracy of converting the voice into the character is improved and the user experience is improved.

Preferably, the apparatus 300 for converting speech into text further comprises:

and the first error correction unit is used for performing homophone error correction on the first character information by adopting a second error correction rule to obtain second character information, wherein the second error correction rule is related to common words, description words with functions in the terminal and/or sentence currency.

Correspondingly, the target character information generating unit comprises: and the first generating unit is used for carrying out homophone error correction on the second character information based on the scene information to generate target character information.

Preferably, the first generating unit includes:

and the second error correction unit is used for carrying out homophone error correction on the second character information based on the current scene information in the scene information to generate third character information.

And the first detection unit is used for detecting whether the second character information is the same as the third character information.

It should be noted that, if the second text information is different from the third text information, the determining unit is executed, and if the second text information is the same as the third text information, the historical scene information acquiring unit is executed.

And the first determining unit is used for determining the third character information as the target character information.

And the historical scene information acquisition unit is used for acquiring the historical scene information in the scene information.

And the second generating unit is used for carrying out homophone error correction on the third character information according to the historical scene information to generate target character information.

and the second determining unit is used for determining the target information in the third character information, and the corrected information in the second character information is used for being replaced by the target information to generate the third character information.

And the second detection unit is used for detecting the association information between the target information and the context related to the target information in the third character information, and the association information is in direct proportion to the association degree between the target information and the context.

A judging unit configured to judge whether the associated information exceeds preset target associated information; if the associated information does not exceed the target associated information, triggering a historical scene information acquisition unit to execute;

correspondingly, the first determining unit is specifically configured to determine the third text information as the target text information if the associated information exceeds the target associated information.

Preferably, the first scene information is related to a usage scenario of the terminal, a user profile of a registered user of the terminal, contents of a display page of the terminal, and/or a state of contents played in the terminal.

Preferably, the use scene of the terminal indicates a play type, and the play type is a book type or a picture type; the user profile is related to the on-demand behavior of the registered user and/or the preference of the registered user; the state of the playing content in the terminal comprises all information of a playing source to which the playing content belongs.

Preferably, the second scene information is dynamically updated.

Corresponding to the method for converting speech into text disclosed in the above embodiment of the present invention, referring to fig. 4, an embodiment of the present invention further provides a schematic structural diagram of a terminal capable of converting speech into text, where the terminal 400 for converting speech into text includes: a speech recognition module 401 and a speech processing module 402.

The voice recognition module 401 is configured to perform voice recognition on the target voice to convert the target voice into first text information.

As a preferred mode of the embodiment of the present application, the speech recognition module 401 may be an ASR. The above is only a preferred mode of the speech recognition module 401 provided in the embodiment of the present application, and specifically, the inventor may set the speech recognition module according to his own needs, which is not limited herein.

The voice processing module 402 is configured to obtain current scene information related to a preset first error correction rule, where the scene information includes first scene information related to a terminal and/or second scene information related to a network environment, and perform homophone error correction on the second text information based on the scene information to generate target text information.

As a preferred mode of the embodiment of the present application, the speech processing module 402 may be an NLP. The above is only a preferred mode of the speech processing module 402 provided in the embodiment of the present application, and specifically, the inventor can set the mode according to his own needs, which is not limited herein.

The specific principle and the execution process of each unit in the voice to text terminal disclosed in the above embodiment of the present invention are the same as those of the voice to text method disclosed in the above embodiment of the present invention, and reference may be made to corresponding parts in the voice to text method disclosed in the above embodiment of the present invention, which are not described herein again.

The invention provides a terminal for converting voice into characters, which is characterized in that target voice is converted into first characters during voice recognition of target voice through a voice recognition module, scene information related to a preset first error correction rule is obtained through a voice processing module, the scene information comprises first scene information related to the terminal and/or second scene information related to a network environment, and homophone error correction is carried out on the first character information based on the scene information to generate the target character information. According to the technical scheme provided by the invention, the error correction of the homophone is realized in the process of converting the voice into the character on the basis of the scene information related to the preset first error correction rule, so that the accuracy of converting the voice into the character is improved and the user experience is improved.

The embodiment of the invention provides a computer-readable storage medium, which stores computer-executable instructions and is used for implementing a method for converting voice into text, which is provided by any one of the embodiments of the invention.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, the system or system embodiments are substantially similar to the method embodiments and therefore are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for related points. The above-described system and system embodiments are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing is only a preferred embodiment of the present application and it should be noted that those skilled in the art can make several improvements and modifications without departing from the principle of the present application, and these improvements and modifications should also be considered as the protection scope of the present application.

Claims

1. A method for converting voice into text is characterized by comprising the following steps:

performing homophone error correction on the first character information based on the scene information to generate target character information;

performing homophone error correction on the first character information by adopting a second error correction rule to obtain second character information, wherein the second error correction rule is related to common words, descriptors with functions in the terminal and/or sentence smoothness;

the generating of the target character information by performing homophone error correction on the first character information based on the scene information comprises: performing homophone error correction on the second character information based on the scene information to generate target character information;

the generating of the target character information by performing homophone error correction on the second character information based on the scene information includes:

performing homophone error correction on the third character information according to the historical scene information to generate target character information;

if the second text information is different from the third text information, the method further comprises:

2. The method according to claim 1, wherein the first scenario information relates to a usage scenario of the terminal, a user profile of a registered user of the terminal, a content of a display page of the terminal, and/or a status of playing content in the terminal.

3. The method according to claim 2, wherein the usage scenario of the terminal indicates a play category, and the play category is a book category or a picture category; the user profile is related to on-demand behavior of the registered user and/or preferences of the registered user; the state of the playing content in the terminal comprises all information of a playing source to which the playing content belongs.

4. The method of claim 1, wherein the second scene information is dynamically updated.

5. The method of claim 4, wherein the second context information comprises a recommendation of content from a merchant to the terminal within a recent preset historical period of time.

6. A device for converting voice into text is characterized by comprising:

a target character information generating unit, configured to perform homophone error correction on the first character information based on the scene information to generate target character information;

the first error correction unit is used for performing homophone error correction on the first character information by adopting a second error correction rule to obtain second character information, wherein the second error correction rule is related to common words, description words with functions in the terminal and/or sentence currency;

the target character information generation unit includes: the first generating unit is used for carrying out homophone error correction on the second character information based on the scene information to generate target character information;

the first generation unit includes:

the second error correction unit is used for carrying out homophone error correction on the second character information based on the current scene information in the scene information to generate third character information;

the first detection unit is used for detecting whether the second character information is the same as the third character information; if the second text information is different from the third text information, executing a determining unit, and if the second text information is the same as the third text information, executing a historical scene information acquiring unit;

a first determining unit configured to determine the third text information as target text information;

the second determining unit is used for determining target information in the third character information, and the corrected information in the second character information is used for being replaced by the target information to generate the third character information;

the second detection unit is used for detecting the association information between the target information and the context related to the target information in the third character information, and the association information is in direct proportion to the association degree between the target information and the context;

a judging unit for judging whether the associated information exceeds preset target associated information; if the associated information does not exceed the target associated information, triggering a historical scene information acquisition unit to execute;

the first determining unit is specifically configured to determine the third text information as the target text information if the associated information exceeds the target associated information;

a historical scene information acquiring unit, configured to acquire historical scene information in scene information;

7. A terminal for converting voice into text is characterized by comprising:

the voice recognition module is used for carrying out voice recognition on target voice to convert the target voice into first character information;

the voice processing module is used for acquiring scene information related to a preset first error correction rule at present, wherein the scene information comprises first scene information related to a terminal and/or second scene information related to a network environment, and homophone error correction is carried out on the first character information based on the scene information to generate target character information;

the voice processing module comprises: the first generating unit is used for carrying out homophone error correction on the second character information based on the scene information to generate target character information;

the first generation unit includes:

the first detection unit is used for detecting whether the second character information is the same as the third character information; if the second character information is different from the third character information, executing a determining unit, and if the second character information is the same as the third character information, executing a historical scene information acquiring unit;

8. A computer-readable storage medium having stored thereon computer-executable instructions for performing the method of converting speech into text of any one of claims 1-5.