CN107564526B - Processing method, apparatus and machine-readable medium - Google Patents

Processing method, apparatus and machine-readable medium Download PDF

Info

Publication number
CN107564526B
CN107564526B CN201710632930.2A CN201710632930A CN107564526B CN 107564526 B CN107564526 B CN 107564526B CN 201710632930 A CN201710632930 A CN 201710632930A CN 107564526 B CN107564526 B CN 107564526B
Authority
CN
China
Prior art keywords
replacement
text
target
current
language model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710632930.2A
Other languages
Chinese (zh)
Other versions
CN107564526A (en
Inventor
郑宏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sogou Technology Development Co Ltd
Original Assignee
Beijing Sogou Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sogou Technology Development Co Ltd filed Critical Beijing Sogou Technology Development Co Ltd
Priority to CN201710632930.2A priority Critical patent/CN107564526B/en
Publication of CN107564526A publication Critical patent/CN107564526A/en
Application granted granted Critical
Publication of CN107564526B publication Critical patent/CN107564526B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The embodiment of the invention provides a processing method, a processing device and a machine readable medium, wherein the method specifically comprises the following steps: acquiring a target word corresponding to a punctuation mark from a source text corresponding to a voice signal; replacing target words included in the source text with corresponding punctuations to obtain a target text corresponding to the voice signal; and outputting the target text as a voice recognition result corresponding to the voice signal. The embodiment of the invention can lead the voice recognition result to accord with the punctuation intention of the user, thereby improving the intelligence of the voice recognition service.

Description

Processing method, apparatus and machine-readable medium
Technical Field
The present invention relates to the field of speech recognition technology, and in particular, to a processing method and apparatus, an apparatus for processing, and a machine-readable medium.
Background
Speech recognition technology is a technology in which a machine converts a voice uttered by a person into a corresponding word or symbol through a recognition and understanding process, or gives a response such as execution of a control, making an answer, or the like. The application field of the voice recognition technology is very wide, and almost every field of life is related to, such as the fields of voice input, voice transcription, voice control, intelligent dialogue inquiry and the like. Taking the voice input field as an example, the voice-to-text conversion can be performed on the voice signal input by the user, and the voice recognition result obtained by the conversion can be provided for the user.
In practical applications, in order to overcome the problem that the punctuation marks are not added to the existing voice recognition result or the punctuation marks added to the existing voice recognition result are inaccurate, some users try to input the punctuation marks corresponding to the text through the voice signal while inputting the text through the voice signal. For example, a user wants to enter "what name are you, what are you called? "you comma what name you call" will be input as a voice signal.
However, existing solutions do not take into account the above-mentioned input intent of the user, which typically provides a completely consistent speech recognition result for the speech signal input by the user; for example, for a speech signal corresponding to "what question name you call" you comma, the provided speech recognition result is usually the speech signal corresponding to "what question name you call" you comma, however, the speech recognition result in this case cannot meet the input intention of the user.
Disclosure of Invention
In view of the above problems, embodiments of the present invention are provided to provide a processing method, a processing apparatus, an apparatus for processing, and a machine-readable medium that overcome or at least partially solve the above problems, and may make a voice recognition result conform to a punctuation intention of a user, so that intelligence of a voice recognition service may be improved.
In order to solve the above problem, the present invention discloses a processing method, comprising:
acquiring a target word corresponding to a punctuation mark from a source text corresponding to a voice signal;
replacing target words included in the source text with corresponding punctuations to obtain a target text corresponding to the voice signal;
and outputting the target text as a voice recognition result corresponding to the voice signal.
Optionally, before the outputting the target text as a speech recognition result corresponding to the speech signal, the method further includes:
and determining that the comparison result between the language model score corresponding to the target text and the language model score corresponding to the source text meets a first preset condition.
Optionally, the number of the target words is multiple, and replacing the target words included in the source text with corresponding punctuation marks includes:
according to a preset sequence, acquiring a target word needing to be replaced currently from a plurality of target words to serve as a current target word;
replacing the current target word with a corresponding punctuation mark, wherein the current target word is included in the text before replacement corresponding to the current replacement; and obtaining a replaced text corresponding to the current replacement, and obtaining a target text corresponding to the voice signal after completing the replacement corresponding to all current target words.
Optionally, the condition that the current replacement is successful includes: and the comparison result between the language model score of the text after the replacement corresponding to the current replacement and the language model score of the text before the replacement corresponding to the current replacement meets a second preset condition.
Optionally, if a comparison result between the language model score of the text after replacement corresponding to the current replacement and the language model score of the text before replacement corresponding to the current replacement meets a second preset condition, the text before replacement corresponding to the next replacement is the text after replacement corresponding to the current replacement; or
And if the language model score of the text after replacement corresponding to the current replacement and the language model score of the text before replacement corresponding to the current replacement do not accord with a second preset condition, the text before replacement corresponding to the next replacement is the text before replacement corresponding to the current replacement.
Optionally, the first preset condition includes: the language model score corresponding to the target text is not lower than the language model score corresponding to the source text; or
The second preset condition includes: and the language model score of the text after the replacement corresponding to the current replacement is not lower than the language model score of the text before the replacement corresponding to the current replacement.
Optionally, the first preset condition includes: the increase amplitude of the language model score corresponding to the target text relative to the language model score corresponding to the source text exceeds a first amplitude threshold value; or
The second preset condition includes: the language model score of the text after the replacement corresponding to the current replacement is increased relative to the language model score of the text before the replacement corresponding to the current replacement by more than a second amplitude threshold.
Optionally, the first amplitude threshold or the second amplitude threshold is obtained according to the number of words included in the source text.
Optionally, before the outputting the target text as a speech recognition result corresponding to the speech signal, the method further includes:
and determining that the syntactic analysis result corresponding to the target text conforms to a preset rule.
In another aspect, the present invention discloses a processing apparatus comprising:
the target word acquisition module is used for acquiring target words corresponding to punctuations from a source text corresponding to the voice signal;
the target word replacing module is used for replacing target words included in the source text with corresponding punctuations so as to obtain a target text corresponding to the voice signal; and
and the recognition result output module is used for outputting the target text as a voice recognition result corresponding to the voice signal.
Optionally, the apparatus further comprises:
and the first determination module is used for determining that the comparison result between the language model score corresponding to the target text and the language model score corresponding to the source text meets a first preset condition before the recognition result output module outputs the target text as the voice recognition result corresponding to the voice signal.
Optionally, the number of the target words is multiple, and the target word replacement module includes:
the sequence acquisition sub-module is used for acquiring a target word needing to be replaced currently from the target words according to a preset sequence and taking the target word as the current target word;
and the sequence replacement submodule is used for replacing the current target word with the corresponding punctuation marks so as to obtain a replaced text corresponding to the current replacement, wherein the current target word is included in the pre-replacement text corresponding to the current replacement, and the target text corresponding to the voice signal is obtained after the replacement corresponding to all the current target words is completed.
Optionally, the condition that the current replacement is successful includes: and the comparison result between the language model score of the text after the replacement corresponding to the current replacement and the language model score of the text before the replacement corresponding to the current replacement meets a second preset condition.
Optionally, if a comparison result between the language model score of the text after replacement corresponding to the current replacement and the language model score of the text before replacement corresponding to the current replacement meets a second preset condition, the text before replacement corresponding to the next replacement is the text after replacement corresponding to the current replacement; or
And if the language model score of the text after replacement corresponding to the current replacement and the language model score of the text before replacement corresponding to the current replacement do not accord with a second preset condition, the text before replacement corresponding to the next replacement is the text before replacement corresponding to the current replacement.
Optionally, the first preset condition includes: the language model score corresponding to the target text is not lower than the language model score corresponding to the source text; or
The second preset condition includes: and the language model score of the text after the replacement corresponding to the current replacement is not lower than the language model score of the text before the replacement corresponding to the current replacement.
Optionally, the first preset condition includes: the increase amplitude of the language model score corresponding to the target text relative to the language model score corresponding to the source text exceeds a first amplitude threshold value; or
The second preset condition includes: the language model score of the text after the replacement corresponding to the current replacement is increased relative to the language model score of the text before the replacement corresponding to the current replacement by more than a second amplitude threshold.
Optionally, the first amplitude threshold or the second amplitude threshold is obtained according to the number of words included in the source text.
Optionally, the apparatus further comprises:
and the second determining module is used for determining that the syntactic analysis result corresponding to the target text accords with a preset rule before the recognition result output module outputs the target text as the voice recognition result corresponding to the voice signal.
In yet another aspect, an apparatus for processing is disclosed that includes a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured for execution by the one or more processors to include instructions for: acquiring a target word corresponding to a punctuation mark from a source text corresponding to a voice signal; replacing target words included in the source text with corresponding punctuations to obtain a target text corresponding to the voice signal; and outputting the target text as a voice recognition result corresponding to the voice signal.
In yet another aspect, the present disclosure discloses a machine-readable medium having stored thereon instructions, which when executed by one or more processors, cause an apparatus to perform the aforementioned processing method.
The embodiment of the invention has the following advantages:
in the embodiment of the invention, target words included in a source text corresponding to a voice signal are replaced by corresponding punctuation marks so as to obtain a target text corresponding to the voice signal, and the target text is output as a voice recognition result corresponding to the voice signal; thus, the voice recognition result can be made to conform to the punctuation intention of the user, and therefore, the intelligence of the voice recognition service can be improved. In addition, the embodiment of the invention can save the operation cost for manually editing the voice recognition result which does not accord with the punctuation intention by the user and improve the processing efficiency of the user.
Drawings
FIG. 1 is a schematic illustration of an environment in which a process of the present invention is applied;
FIG. 2 is a flow chart of the steps of one embodiment of a processing method of the present invention;
FIG. 3 is a schematic diagram of a punctuation addition process corresponding to a speech recognition result according to an embodiment of the present invention;
FIG. 4 is a flow chart of the steps of one embodiment of a processing method of the present invention;
FIG. 5 is a block diagram of a processing device according to an embodiment of the present invention;
FIG. 6 is a block diagram illustrating an apparatus for processing as a terminal in accordance with an example embodiment; and
FIG. 7 is a block diagram illustrating an apparatus for processing as a server in accordance with an example embodiment.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
The embodiment of the invention provides a processing scheme which can obtain a target word corresponding to a punctuation mark from a source text corresponding to a voice signal; replacing target words included in the source text with corresponding punctuations to obtain a target text corresponding to the voice signal; and outputting the target text as a voice recognition result corresponding to the voice signal.
In the embodiment of the present invention, the target word corresponding to the punctuation mark may be used to represent a word meeting the punctuation intention of the user, and in practical applications, the target word corresponding to the punctuation mark may be an identification word of the punctuation mark, such as a name, an alias, or the like, and even the target word may be obtained by user setting.
In the embodiment of the invention, target words included in a source text corresponding to a voice signal are replaced by corresponding punctuation marks so as to obtain a target text corresponding to the voice signal, and the target text is output as a voice recognition result corresponding to the voice signal; thus, the voice recognition result can be made to conform to the punctuation intention of the user, and therefore, the intelligence of the voice recognition service can be improved. In addition, the embodiment of the invention can save the operation cost for manually editing the voice recognition result which does not accord with the punctuation intention by the user and improve the processing efficiency of the user.
The embodiment of the invention can be applied to any scenes related to the voice recognition technology, such as voice input, voice transcription and the like, and particularly can be applied to scenes needing to display the voice recognition result. Moreover, the embodiment of the present invention may be applied to application environments of websites and/or application programs to provide a voice recognition service to a user through the application environments, and may improve intelligence of the voice recognition service through a voice recognition result that conforms to a punctuation intention of the user, and it is understood that the embodiment of the present invention is not limited to a specific application environment.
The processing method provided by the embodiment of the present invention can be applied to the application environment shown in fig. 1, as shown in fig. 1, the client 100 and the server 200 are located in a wired or wireless network, and the client 100 and the server 200 perform data interaction through the wired or wireless network.
The processing method of the embodiment of the present invention may be executed by any one of the client 100 and the server 200:
for example, the client 100 may receive a voice signal input by a user, specifically, the client 100 may receive the voice signal input by the user through another voice collecting device such as a microphone, or may obtain the voice signal input by the user from a voice file specified by the user; next, the client 100 may obtain a source text corresponding to the voice signal by using a voice recognition technology; acquiring a target word corresponding to a punctuation mark from a source text corresponding to a voice signal; replacing target words included in the source text with corresponding punctuations to obtain a target text corresponding to the voice signal; and displaying the target text as a voice recognition result corresponding to the voice signal.
For another example, after acquiring a voice signal input by a user, the client 100 may also send the voice signal to the server 200, so that the server 200 acquires a source text corresponding to the voice signal by using a voice recognition technology; acquiring a target word corresponding to a punctuation mark from a source text corresponding to a voice signal; replacing target words included in the source text with corresponding punctuations to obtain a target text corresponding to the voice signal; and transmits the target text to the client 100; and the client 100 may present the target text to the user.
If the voice signal of the user is marked as S, a series of processing is carried out on the S to obtain a corresponding voice characteristic sequence O, and the voice characteristic sequence O is marked as O ═ { O {1,O2,…,Oi,…,OTIn which O isiIs the ith speech feature, and T is the total number of speech features. A sentence corresponding to a speech signal S can be regarded as a word string composed of many words, and is denoted by W ═ W1,w2,…,wn}. The process of speech recognition is to find the most likely word string W based on the known speech feature sequence O.
Specifically, the speech recognition is a model matching process, in which a speech model is first established according to the speech characteristics of a person, and a template required for the speech recognition is established by extracting required features through analysis of an input speech signal; the process of recognizing the voice input by the user is a process of comparing the characteristics of the voice input by the user with the template, and finally determining the best template matched with the voice input by the user so as to obtain a voice recognition result. The specific speech recognition algorithm may adopt a training and recognition algorithm based on a statistical hidden markov model, or may adopt other algorithms such as a training and recognition algorithm based on a neural network, a recognition algorithm based on dynamic time warping matching, and the like.
Optionally, the client 100 may be run on an intelligent terminal, and the intelligent terminal specifically includes but is not limited to: smart phones, tablet computers, electronic book readers, MP3 (Moving picture Experts Group Audio Layer III) players, MP4 (Moving picture Experts Group Audio Layer IV) players, laptop portable computers, car-mounted computers, desktop computers, set-top boxes, smart televisions, wearable devices, and the like.
Method embodiment
Referring to fig. 2, a flowchart illustrating steps of an embodiment of a processing method according to the present invention is shown, which may specifically include the following steps:
step 201, obtaining a target word corresponding to a punctuation mark from a source text corresponding to a voice signal;
step 202, replacing target words included in the source text with corresponding punctuation marks to obtain a target text corresponding to the voice signal;
and 203, outputting the target text as a voice recognition result corresponding to the voice signal.
The processing method provided by the embodiment of the present invention includes steps 201 to 203, which can be executed by any one of the client and the server.
The source text of the embodiment of the invention can be a text obtained by identifying the voice signal. The voice signal may be a voice input by a user in real time, or a voice included in a voice file designated by the user, for example, the voice file may be a call recording file of the user, or a voice file received from a recording pen device. In practical applications, the source text may be obtained by recognizing a voice signal, or may be received from other devices, and it is understood that the embodiment of the present invention is not limited to a specific obtaining manner of the source text corresponding to the voice signal.
In practical application, punctuation identification words corresponding to punctuation symbols can be stored through a set of punctuation identification words. Moreover, different landmark mark word sets can be established for different languages in consideration of the differences of the landmark marks in different languages. For example, a landmark corresponding to chinese identifies a word set, target. "saved punctuation identification words may include: "period", etc.; for another example, in the set of punctuation identification words corresponding to english, the punctuation identification words stored for ".", may include: "period", etc. It is understood that the embodiment of the present invention does not limit the specific language suitable for the set of the landmark identifying words.
It should be noted that the punctuation mark may be added to the voice recognition result through the punctuation mark adding process, however, the punctuation mark added by the punctuation mark adding process is usually limited to the commonly used punctuation marks such as comma, question mark, period, exclamation mark, space, etc., that is, the punctuation mark added by the punctuation mark adding process is limited. Referring to fig. 3, a schematic diagram of a punctuation adding processing procedure corresponding to a speech recognition result according to an embodiment of the present invention is shown, where a word sequence corresponding to the speech recognition result is "hello/my is/mingming/happy to know you", and punctuation symbols may be added between adjacent words of "hello/my is/mingming/happy to know you"; in fig. 3, words such as "hello", "my is", "xiaoming", "happy", "know you" are respectively represented by rectangles, and punctuations such as comma, space, exclamation mark, question mark, period are respectively represented by circles, so that there may be a plurality of paths between punctuations after the first word "hello" and the last word "know you" in the word sequence corresponding to the voice recognition result.
In the embodiment of the present invention, the target word corresponding to the punctuation mark may be an identification word of the punctuation mark, such as a name, an alias, or even the target word may be obtained by user setting, so that the embodiment of the present invention may flexibly add more abundant punctuation marks in the voice recognition result through the target word. For example, the user may add a corresponding punctuation mark to the speech recognition result through the target word "dash", or the target word "double quotation mark", so that the user's input intention such as a prominent expression may be realized.
In an alternative embodiment of the present invention, the punctuation identification words stored in the set of punctuation identification words may be set by the user. Optionally, the user may further set a mapping relationship between the punctuation mark words and the punctuation marks, so that the embodiment of the present invention may flexibly add richer punctuation marks in the speech recognition result through the mapping relationship.
In an embodiment of the present invention, the process of obtaining the target word corresponding to the punctuation mark from the source text corresponding to the voice signal in step 201 may include: matching characters included in the source text with punctuation identification words in the punctuation identification word set, and if the matching is successful, taking the characters included in the source text and successfully matched with the punctuation identification words as target words. Assuming that the source text is "what name you call with a comma", the target words corresponding to punctuation marks, such as "comma", "question mark", etc., can be obtained from the source text.
In practical applications, the source text may include one or more target words, that is, the source text may include one target word or a plurality of target words, and may not include any target words. The embodiment of the present invention does not impose a limitation on the specific number of target words included in the source text.
In a practical application summary, step 202 replaces target words included in the source text with corresponding punctuation marks, wherein one replacement may involve one or more target words. In the case that one replacement relates to one target word, one replacement may replace one target word included in the source text with a corresponding punctuation mark; in case a substitution involves multiple target words, one substitution may replace multiple target words comprised by the source text with corresponding punctuation marks.
Assuming that the source text is "what name you call" and "what name you call", after obtaining the target words corresponding to the punctuation marks, such as "comma" and "question", the target words included therein may be replaced with the corresponding punctuation marks by one or more replacements, and finally the target text "what name you call? ".
In an optional embodiment of the present invention, a comparison result between the language model score corresponding to the target text and the language model score corresponding to the source text may meet a first preset condition. Accordingly, the process of outputting the target text as the speech recognition result corresponding to the speech signal in step 203 may include: and if the comparison result between the language model score corresponding to the target text and the language model score corresponding to the source text meets a first preset condition, outputting the target text as a voice recognition result corresponding to the voice signal.
Because the language model is language abstract mathematical modeling according to language objective facts, and the score of the language model can reflect the language quality corresponding to a text (including a source text or a target text and the like), the embodiment of the invention can avoid the condition that the language quality is reduced due to replacing target words included in the source text with corresponding punctuation marks to a certain extent, and further can improve the quality of a voice recognition result corresponding to a voice signal.
In the embodiment of the present invention, the language model may include: an N-gram (N-gram) language model, and/or a neural network language model, wherein the neural network language model may further include: RNNLM (Recurrent Neural Network Language Model), CNNLM (Convolutional Neural Network Language Model), DNNLM (deep Neural Network Language Model), and the like.
Where the N-gram language model is based on the assumption that the occurrence of the nth word is only related to the first N-1 words and not to any other words, the probability of a complete sentence is the product of the probabilities of occurrence of the words.
Since the N-gram language model predicts the Nth word with a limited number of N-1 words (above), the N-gram language model may have the descriptive capability of the language model score for a semantic segment of length N, e.g., N may be a positive integer with a fixed value less than the first length threshold, such as 3, 5, etc. One advantage of neural network language models over N-gram language models, such as RNNLM, is: all the above can be utilized to predict the next word, so RNNLM can have the description capability of the language model score of the semantic fragment with variable length, that is, RNNLM is suitable for the semantic fragments with wider length range, for example, the length range of the semantic fragment corresponding to RNNLM can be: 1-a second length threshold, wherein the second length threshold is greater than the first length threshold.
In the embodiment of the present invention, the semantic segment may be used to represent a word sequence to which punctuations (including punctuations such as preset marks) are or are not added. The word sequence may include a plurality of words, the words may be obtained by segmenting a text (including a source text or a target text), and the word sequence may be all or part of the text. For example, for the source text "hello% i is% minuscule% happy% knows you," its corresponding semantic segments may include: "hello%,% i is"% i is "small and clear% is very happy", and the like, wherein "%" is a symbol provided for the convenience of the specification, and "%" is used to indicate the boundary between words and/or the boundary between words and punctuation marks, and in practical applications, "%" may not have any meaning.
According to one embodiment, since RNNLM is suitable for semantic fragments with a wide range of lengths, all semantic fragments corresponding to the source text or the target text as a whole can be used by RNNLM to determine the language model score corresponding to the source text or the target text, for example, if all character units included in the source text or the target text are input into RNNLM, RNNLM can output the corresponding language model score. The character unit may include: vocabulary and/or punctuation.
According to another embodiment, the determining of the language model score corresponding to the source text or the target text may include: determining corresponding language model scores aiming at semantic fragments contained in a source text or a target text; and fusing the language model scores corresponding to all semantic fragments contained in the source text or the target text to obtain the corresponding language model scores.
Optionally, corresponding semantic segments may be obtained from the source text or the target text in a moving manner in a front-to-back order, the number of character units included in different semantic segments may be the same, and repeated character units may exist in adjacent semantic segments. In this case, the language model score corresponding to the semantic segment can be determined by the N-gram language model and/or the neural network language model. Assuming that N is 5 and the number of the first character unit is 1, the following order of numbering may be followed: 1-5, 2-6, 3-7, 4-8 and the like, obtaining corresponding semantic fragments with the length of 5 from the punctuation addition result, and determining the language model score corresponding to each semantic fragment by using an N-gram language model, for example, if each semantic fragment is input into an N-gram, the N-gram can output the corresponding language model score.
In an optional embodiment of the present invention, the first preset condition includes: the language model score corresponding to the target text is not lower than the language model score corresponding to the source text, that is, the comparison result between the language model score corresponding to the target text and the language model score corresponding to the source text may be greater than 0.
In another alternative embodiment of the present invention, the first preset condition may include: the magnitude of the increase in the language model score corresponding to the target text relative to the language model score corresponding to the source text exceeds a first magnitude threshold. Wherein the increase amplitude can be expressed as: (new _ lm _ score-old _ lm _ score)/old _ lm _ score, where new _ lm _ score represents the language model score corresponding to the target text, and old _ lm _ score represents the language model score corresponding to the source text, it can be understood that the specific representation manner of the increase magnitude is not limited in the embodiment of the present invention.
Wherein, the skilled person can determine the first amplitude threshold value according to the actual application requirement. For example, the first amplitude threshold may be an empirical value.
Optionally, the first amplitude threshold may be obtained according to the number of words included in the source text, so that a target word may be prevented from being erroneously replaced to some extent, and the accuracy of the speech recognition result may be further improved. Further optionally, the first amplitude threshold and the number of words included in the source text may have a negative correlation, that is, the first amplitude threshold may decrease as the number of words increases. The number of words may be the number of characters included in the source text, specifically in chinese, the number of words may be the number of single words included in the source text, specifically in english, and the number of words may be the number of words included in the source text.
In an alternative embodiment of the present invention, the number of words may be divided into a number of word number levels, wherein different word number levels may correspond to different first amplitude thresholds. It is assumed that the word quantity rankings may include, in order of small to large word quantity: the first word quantity level, the second word quantity level, and … N-th word quantity level, where N is a natural number, the first amplitude threshold corresponding to the (i +1) -th word quantity level may be smaller than the first amplitude threshold corresponding to the ith word quantity level, where i is a natural number, and i is not greater than N.
Referring to table 1, an example of a mapping relationship between the word quantity rank and the first amplitude threshold according to the embodiment of the present invention is shown, which may specifically include: a first word quantity level, a second word quantity level, a third word quantity level and a second word quantity level. As an example, in table 1, the values of the first threshold, the second threshold, and the third threshold may be 2, 10, and 20, respectively, and it is understood that the embodiment of the present invention is not limited to a specific word quantity level and a mapping relationship between the word quantity level and the first amplitude threshold.
TABLE 1
Figure BDA0001364256420000121
Figure BDA0001364256420000131
In another optional embodiment of the present invention, before outputting the target text as the speech recognition result corresponding to the speech signal, the method of the embodiment of the present invention may further include determining that a syntax analysis result corresponding to the target text conforms to a preset rule.
The basic task of syntactic analysis is to determine the syntactic structure of a sentence, such as "i am late," where "i am the subject," i am the predicate, and "late" is the complement. Alternatively, the syntax structure may be represented by a tree-like data structure, and the program module that performs this parsing process may be referred to as a syntax parser.
Because the target text is changed relative to the source text, the change is specifically: the target words in the source text become punctuation, and the change will result in a change in the result of the syntactic analysis.
The embodiment of the invention can analyze the syntax of the target text and judge whether the syntax analysis result accords with the preset rule of the corresponding language, if so, the target text can be output as the voice recognition result corresponding to the voice signal, thus, the phenomenon of unreasonable syntax caused by target word replacement can be avoided to a certain extent, and the rationality of the voice recognition result is improved. It can be understood that if the parsing result does not conform to the preset rule of the corresponding language, the source text may be output as a speech recognition result corresponding to the speech signal.
In practical applications, the preset rule may include a preset syntax rule. Grammar is a branch of linguistics that explores inflectional changes in "parts of speech", "words", or other means of representing interrelationships, and the function and relationship of words in sentences, applied in a definite usage. The grammar rule may include a word formation rule, a configuration rule and a sentence formation rule, and it is understood that the specific preset rule is not limited in the embodiment of the present invention.
It should be noted that the source text of the embodiment of the present invention may be subjected to punctuation addition processing, and in this case, step 203 may directly present the target text to the user. Alternatively, the source text of the embodiment of the present invention may not be subjected to punctuation addition processing, in this case, step 203 may first perform punctuation addition processing on the target text, and then output the target text subjected to punctuation addition processing.
In the embodiment of the invention, punctuation adding processing can be used for adding punctuation to the text. In an optional embodiment of the present invention, the punctuation adding process on the text may specifically include: performing word segmentation on the text to obtain a word sequence corresponding to the voice recognition result; and performing punctuation addition processing on the word sequence corresponding to the text through a language model to obtain the text serving as a punctuation addition result. It can be understood that a person skilled in the art may adopt a required punctuation addition processing manner according to an actual application requirement, and the embodiment of the present invention does not limit a specific punctuation addition processing manner.
In the embodiment of the present invention, a plurality of candidate punctuation marks can be added between adjacent words in the word sequence corresponding to the text, that is, punctuation addition processing can be performed on the word sequence according to the situation that a plurality of candidate punctuation marks are added between adjacent word-segments in the word sequence corresponding to the text, so that the word sequence corresponding to the text corresponds to a plurality of punctuation addition schemes and corresponding punctuation addition results. Optionally, punctuation addition processing may be performed on the word sequence through the language model, so that an optimal punctuation addition result with an optimal language model score may be finally obtained.
It should be noted that, a person skilled in the art may determine a candidate punctuation mark to be added according to an actual application requirement, and optionally, the candidate punctuation mark may include: the invention relates to a method for segmenting words, which comprises the steps of generating a plurality of words, wherein the words are represented by commas, question marks, periods, exclamation marks, spaces and the like, wherein the spaces can play a role in word segmentation or do not play any role, for example, for English, the spaces can be used for segmenting different words, and for Chinese, the spaces can be punctuation marks which do not play any role.
To sum up, in the processing method of the embodiment of the present invention, a target word included in a source text corresponding to a speech signal is replaced with a corresponding punctuation mark to obtain a target text corresponding to the speech signal, and the target text is output as a speech recognition result corresponding to the speech signal; thus, the voice recognition result can be made to conform to the punctuation intention of the user, and therefore, the intelligence of the voice recognition service can be improved. In addition, the embodiment of the invention can save the operation cost for manually editing the voice recognition result which does not accord with the punctuation intention by the user and improve the processing efficiency of the user.
Moreover, the comparison result between the language model score corresponding to the target text and the language model score corresponding to the source text can accord with a first preset condition; because the language model is language abstract mathematical modeling according to language objective facts, and the score of the language model can reflect the language quality corresponding to a text (including a source text or a target text and the like), the embodiment of the invention can avoid the condition that the language quality is reduced due to replacing target words included in the source text with corresponding punctuation marks to a certain extent, and further can improve the quality of a voice recognition result corresponding to a voice signal.
Referring to fig. 4, a flowchart illustrating steps of an embodiment of a processing method according to the present invention is shown, which may specifically include the following steps:
step 401, obtaining a target word corresponding to a punctuation mark from a source text corresponding to a voice signal; the number of the target words can be multiple;
step 402, according to a preset sequence, acquiring a target word needing to be replaced currently from a plurality of target words as a current target word;
step 403, replacing the current target word with the corresponding punctuation mark to obtain a replaced text corresponding to the current replacement, wherein the current target word is included in the pre-replacement text corresponding to the current replacement, and after the replacement corresponding to all the current target words is completed, the target text corresponding to the voice signal is obtained;
and step 404, outputting the target text as a voice recognition result corresponding to the voice signal.
In practical application, in the case that the number of the target words is one, the embodiment of the present invention may replace the target words included in the source text with the corresponding punctuation marks through one replacement, so as to obtain the corresponding target text.
Compared with the embodiment shown in fig. 2, the target word of the embodiment relates to processing under the condition that the number of the target words is multiple, specifically, the embodiment of the present invention may replace the target word included in the source text with the corresponding punctuation mark through multiple replacements, where one replacement may relate to replacement of one target word, that is, replace the current target word included in the text before replacement with the corresponding punctuation mark.
In practical application, a target word to be replaced currently is obtained from the plurality of target words according to a preset sequence, the target word is used as the current target word, and the current target word is replaced according to the preset sequence. The preset sequence may be determined by those skilled in the art according to actual application requirements, for example, the preset sequence may include: from front to back, from back to front, or from the middle to both ends.
In this embodiment of the present invention, the replacement text corresponding to the first replacement may be a source text, and the replacement text corresponding to the (j +1) th replacement may be a replaced text corresponding to the jth replacement, where j is a natural number. The text before replacement corresponding to the (j +1) th replacement is the text after replacement corresponding to the j th replacement, and updating of the text after replacement corresponding to each replacement can be achieved.
Of course, the text before replacement corresponding to the (j +1) th replacement is the text after replacement corresponding to the j th replacement is only an optional embodiment, and actually, the text before replacement corresponding to the (j +1) th replacement may be the source text.
Assuming that the source text is "what name you call with a comma", the target words corresponding to the punctuation marks are obtained from the source text, such as "comma" and "question mark", and then the target words included in the punctuation marks can be replaced with the corresponding punctuation marks through two times of replacement; the first replacement can replace one target word in the source text with a corresponding punctuation mark so as to obtain a first replaced text; the second replacement can replace one target word in the text after the first replacement with a corresponding punctuation mark to obtain a target text; the target text "what name you are you calling? ".
In an alternative embodiment of the present invention, the condition that the current replacement is successful may include: and the comparison result between the language model score of the text after the replacement corresponding to the current replacement and the language model score of the text before the replacement corresponding to the current replacement meets a second preset condition. If the condition that the current replacement is successful is not met, the current replacement is considered to be failed, or the current replacement is abandoned, and the next replacement is continued until the replacement corresponding to all the current target words is completed. Because the language model is language abstract mathematical modeling according to language objective facts, and the score of the language model can reflect the language quality corresponding to a text (including a text before replacement, a text after replacement and the like), the embodiment of the invention can avoid the situation that the language quality is reduced due to the fact that the current target word included in the text before replacement is replaced by the corresponding punctuation mark to a certain extent, and further can improve the quality of a voice recognition result corresponding to a voice signal.
Optionally, the second preset condition may include: and the language model score of the text after the replacement corresponding to the current replacement is not lower than the language model score of the text before the replacement corresponding to the current replacement. Further optionally, the second preset condition may include: the language model score of the text after the replacement corresponding to the current replacement is increased relative to the language model score of the text before the replacement corresponding to the current replacement by more than a second amplitude threshold.
For the process of determining the language model score of the text after replacement and the process of determining the language model score of the text before replacement, the process of determining the language model score corresponding to the source text or the target text is similar, so that the description is omitted and the reference is made to each other.
For the second preset condition, since it is similar to the first preset condition, it is not described herein and it is sufficient to refer to each other. Specifically, the second amplitude threshold corresponding to the second preset condition may be obtained according to the number of words included in the source text. Therefore, the wrong replacement of the target words can be avoided to a certain extent, and the accuracy of the voice recognition result can be improved. Further optionally, the second amplitude threshold may be in a negative correlation with the number of words included in the source text, that is, the second amplitude threshold may decrease as the number of words increases.
In an alternative embodiment of the present invention, the condition that the current replacement is successful may include: the syntax analysis result corresponding to the replaced text corresponding to the current replacement accords with the preset rule, so that the phenomenon of unreasonable syntax caused by target word replacement can be avoided to a certain extent, and the rationality of the voice recognition result is improved. For the obtaining process and the determining process of the syntactic analysis result corresponding to the replaced text corresponding to the current replacement, the obtaining process and the determining process of the syntactic analysis result corresponding to the target text are similar, and therefore, the description is omitted here, and the mutual reference is only needed.
The conditions for successful replacement at the current time include: under the condition that the comparison result between the language model score of the text after replacement corresponding to the current replacement and the language model score of the text before replacement corresponding to the current replacement meets a second preset condition, the current replacement can be effectively verified, and correspondingly, the text before replacement corresponding to the next replacement can include the following two situations:
in case 1, if a comparison result between the language model score of the text after replacement corresponding to the current replacement and the language model score of the text before replacement corresponding to the current replacement meets a second preset condition, the text before replacement corresponding to the next replacement is the text after replacement corresponding to the current replacement; or
And 2, if the language model score of the text after replacement corresponding to the current replacement and the language model score of the text before replacement corresponding to the current replacement do not accord with a second preset condition, taking the text before replacement corresponding to the next replacement as the text before replacement corresponding to the current replacement.
Case 1 may correspond to a case where the current replacement is successful, and in this case, the text before replacement corresponding to the next replacement may be updated through the current replacement, so that the text after replacement corresponding to the next replacement may be the text after replacement corresponding to the current replacement.
Case 2 may correspond to a case where the current replacement fails, and in this case, the current replacement does not update the text before replacement corresponding to the next replacement, so that the text before replacement corresponding to the next replacement may be the text before replacement corresponding to the current replacement.
In an application example of the present invention, the text before replacement corresponding to the first replacement is the source text, if the first replacement fails, the text before replacement corresponding to the second replacement may be the source text, or if the first replacement succeeds, the text after replacement corresponding to the second replacement may be the text after replacement corresponding to the first replacement; further, if the second replacement fails, the replaced text corresponding to the third replacement may be the replaced text corresponding to the first replacement, or if the second replacement succeeds, the replaced text corresponding to the third replacement may be the replaced text corresponding to the second replacement.
To sum up, the processing method according to the embodiment of the present invention may replace the target word included in the source text with the corresponding punctuation mark through multiple replacements, where one replacement may involve replacement of one target word, that is, replace the current target word included in the text before replacement with the corresponding punctuation mark.
In addition, in the embodiment of the present invention, the condition that the current replacement is successful may include: and the comparison result between the language model score of the text after the replacement corresponding to the current replacement and the language model score of the text before the replacement corresponding to the current replacement meets a second preset condition. If the condition that the current replacement is successful is not met, the current replacement is considered to be failed, or the current replacement is abandoned, and the next replacement is continued until the replacement corresponding to all the current target words is completed. Because the language model is language abstract mathematical modeling according to language objective facts, and the score of the language model can reflect the language quality corresponding to a text (including a text before replacement, a text after replacement and the like), the embodiment of the invention can avoid the situation that the language quality is reduced due to the fact that the current target word included in the text before replacement is replaced by the corresponding punctuation mark to a certain extent, and further can improve the quality of a voice recognition result corresponding to a voice signal.
It should be noted that, for simplicity of description, the method embodiments are described as a series of motion combinations, but those skilled in the art should understand that the present invention is not limited by the described motion sequences, because some steps may be performed in other sequences or simultaneously according to the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no moving act is required as an embodiment of the invention.
Device embodiment
Referring to fig. 5, a block diagram of a processing apparatus according to an embodiment of the present invention is shown, which may specifically include:
a target word obtaining module 501, configured to obtain a target word corresponding to a punctuation mark from a source text corresponding to a voice signal;
a target word replacing module 502, configured to replace a target word that may be included in the source text with a corresponding punctuation mark to obtain a target text corresponding to the voice signal; and
a recognition result output module 503, configured to output the target text as a speech recognition result corresponding to the speech signal.
Optionally, a comparison result between the language model score corresponding to the target text and the language model score corresponding to the source text meets a first preset condition; accordingly, the apparatus may further include:
a first determining module, configured to determine that a comparison result between the language model score corresponding to the target text and the language model score corresponding to the source text meets a first preset condition before the recognition result output module 503 outputs the target text as a speech recognition result corresponding to the speech signal.
Optionally, the number of the target words may be multiple, and the target word replacement module 402 may include:
the sequence acquisition sub-module is used for acquiring a target word needing to be replaced currently from the target words according to a preset sequence and taking the target word as the current target word;
and the sequence replacement submodule is used for replacing the current target word with the corresponding punctuation marks so as to obtain a replaced text corresponding to the current replacement, wherein the current target word is included in the pre-replacement text corresponding to the current replacement, and the target text corresponding to the voice signal is obtained after the replacement corresponding to all the current target words is completed.
Optionally, the condition that the current replacement is successful may include: and the comparison result between the language model score of the text after the replacement corresponding to the current replacement and the language model score of the text before the replacement corresponding to the current replacement meets a second preset condition.
Optionally, if a comparison result between the language model score of the text after replacement corresponding to the current replacement and the language model score of the text before replacement corresponding to the current replacement meets a second preset condition, the text before replacement corresponding to the next replacement is the text after replacement corresponding to the current replacement; or
And if the language model score of the text after replacement corresponding to the current replacement and the language model score of the text before replacement corresponding to the current replacement do not accord with a second preset condition, the text before replacement corresponding to the next replacement is the text before replacement corresponding to the current replacement.
Optionally, the first preset condition may include: the language model score corresponding to the target text is not lower than the language model score corresponding to the source text; or
The second preset condition may include: and the language model score of the text after the replacement corresponding to the current replacement is not lower than the language model score of the text before the replacement corresponding to the current replacement.
Optionally, the first preset condition may include: the increase amplitude of the language model score corresponding to the target text relative to the language model score corresponding to the source text exceeds a first amplitude threshold value; or
The second preset condition may include: the language model score of the text after the replacement corresponding to the current replacement is increased relative to the language model score of the text before the replacement corresponding to the current replacement by more than a second amplitude threshold.
Optionally, the first amplitude threshold or the second amplitude threshold is obtained according to the number of words that the source text can include.
Optionally, the apparatus may further include:
a second determining module, configured to determine that a parsing result corresponding to the target text meets a preset rule before the recognition result output module 503 outputs the target text as a speech recognition result corresponding to the speech signal.
For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.
The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.
With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.
Embodiments of the present invention also provide a processing apparatus, comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory, and the one or more programs configured to be executed by the one or more processors include instructions for: acquiring a target word corresponding to a punctuation mark from a source text corresponding to a voice signal; replacing target words included in the source text with corresponding punctuations to obtain a target text corresponding to the voice signal; and outputting the target text as a voice recognition result corresponding to the voice signal.
Optionally, the device is also configured to execute the one or more programs by the one or more processors including instructions for:
before outputting the target text as a voice recognition result corresponding to the voice signal, determining that a comparison result between a language model score corresponding to the target text and a language model score corresponding to the source text meets a first preset condition.
Optionally, the number of the target words is multiple, and replacing the target words included in the source text with corresponding punctuation marks includes:
according to a preset sequence, acquiring a target word needing to be replaced currently from a plurality of target words to serve as a current target word;
replacing the current target word with a corresponding punctuation mark, wherein the current target word is included in the text before replacement corresponding to the current replacement; and obtaining a replaced text corresponding to the current replacement, and obtaining a target text corresponding to the voice signal after completing the replacement corresponding to all current target words.
Optionally, the condition that the current replacement is successful includes: and the comparison result between the language model score of the text after the replacement corresponding to the current replacement and the language model score of the text before the replacement corresponding to the current replacement meets a second preset condition.
Optionally, if a comparison result between the language model score of the text after replacement corresponding to the current replacement and the language model score of the text before replacement corresponding to the current replacement meets a second preset condition, the text before replacement corresponding to the next replacement is the text after replacement corresponding to the current replacement; or
And if the language model score of the text after replacement corresponding to the current replacement and the language model score of the text before replacement corresponding to the current replacement do not accord with a second preset condition, the text before replacement corresponding to the next replacement is the text before replacement corresponding to the current replacement.
Optionally, the first preset condition includes: the language model score corresponding to the target text is not lower than the language model score corresponding to the source text; or
The second preset condition includes: and the language model score of the text after the replacement corresponding to the current replacement is not lower than the language model score of the text before the replacement corresponding to the current replacement.
Optionally, the first preset condition includes: the increase amplitude of the language model score corresponding to the target text relative to the language model score corresponding to the source text exceeds a first amplitude threshold value; or
The second preset condition includes: the language model score of the text after the replacement corresponding to the current replacement is increased relative to the language model score of the text before the replacement corresponding to the current replacement by more than a second amplitude threshold.
Optionally, the first amplitude threshold or the second amplitude threshold is obtained according to the number of words included in the source text.
Optionally, the device being configured to execute the one or more programs by the one or more processors includes instructions for:
and before outputting the target text as a voice recognition result corresponding to the voice signal, determining that a syntactic analysis result corresponding to the target text conforms to a preset rule.
Fig. 6 is a block diagram illustrating an apparatus for processing as a terminal according to an example embodiment. For example, terminal 900 may be a mobile phone, computer, digital broadcast terminal, messaging device, game console, tablet device, medical device, fitness device, personal digital assistant, and the like.
Referring to fig. 6, terminal 900 can include one or more of the following components: processing component 902, memory 904, power component 906, multimedia component 908, audio component 910, input/output (I/O) interface 912, sensor component 914, and communication component 916.
Processing component 902 generally controls overall operation of terminal 900, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. Processing element 902 may include one or more processors 920 to execute instructions to perform all or a portion of the steps of the methods described above. Further, processing component 902 can include one or more modules that facilitate interaction between processing component 902 and other components. For example, the processing component 902 can include a multimedia module to facilitate interaction between the multimedia component 908 and the processing component 902.
Memory 904 is configured to store various types of data to support operation at terminal 900. Examples of such data include instructions for any application or method operating on terminal 900, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 904 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.
The power components 906 provide power to the various components of the terminal 900. The power components 906 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the terminal 900.
The multimedia components 908 include a screen providing an output interface between the terminal 900 and the user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide motion action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 908 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the terminal 900 is in an operation mode, such as a photographing mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.
The audio component 910 is configured to output and/or input audio signals. For example, audio component 910 includes a Microphone (MIC) configured to receive external audio signals when terminal 900 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 904 or transmitted via the communication component 916. In some embodiments, audio component 910 also includes a speaker for outputting audio signals.
I/O interface 912 provides an interface between processing component 902 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.
The sensor component 914 includes one or more sensors for providing various aspects of state assessment for the terminal 900. For example, sensor assembly 914 can detect an open/closed state of terminal 900, a relative positioning of components, such as a display and keypad of terminal 900, a change in position of terminal 900 or a component of terminal 900, the presence or absence of user contact with terminal 900, an orientation or acceleration/deceleration of terminal 900, and a change in temperature of terminal 900. The sensor assembly 914 may include a proximity sensor configured to detect the presence of a nearby object in the absence of any physical contact. The sensor assembly 914 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 914 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
Communication component 916 is configured to facilitate communications between terminal 900 and other devices in a wired or wireless manner. Terminal 900 can access a wireless network based on a communication standard, such as WiFi, 2G, or 3G, or a combination thereof. In an exemplary embodiment, the communication component 916 receives a broadcast signal or broadcast associated information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communications component 916 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.
In an exemplary embodiment, the terminal 900 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.
In an exemplary embodiment, a non-transitory computer readable storage medium comprising instructions, such as memory 904 comprising instructions, executable by processor 920 of terminal 900 to perform the above-described method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.
FIG. 7 is a block diagram illustrating an apparatus for processing as a server in accordance with an example embodiment. The server 1900 may vary widely by configuration or performance and may include one or more Central Processing Units (CPUs) 1922 (e.g., one or more processors) and memory 1932, one or more storage media 1930 (e.g., one or more mass storage devices) storing applications 1942 or data 1944. Memory 1932 and storage medium 1930 can be, among other things, transient or persistent storage. The program stored in the storage medium 1930 may include one or more modules (not shown), each of which may include a series of instructions operating on a server. Still further, a central processor 1922 may be provided in communication with the storage medium 1930 to execute a series of instruction operations in the storage medium 1930 on the server 1900.
The server 1900 may also include one or more power supplies 1926, one or more wired or wireless network interfaces 1950, one or more input-output interfaces 1958, one or more keyboards 1956, and/or one or more operating systems 1941, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.
In an exemplary embodiment, a non-transitory computer readable storage medium is also provided that includes instructions, such as memory 1932 that includes instructions executable by a processor of server 1900 to perform the above-described method. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.
A non-transitory computer readable storage medium in which instructions, when executed by a processor of an apparatus (terminal or server), enable the apparatus to perform a processing method, the method comprising: acquiring a target word corresponding to a punctuation mark from a source text corresponding to a voice signal; replacing target words included in the source text with corresponding punctuations to obtain a target text corresponding to the voice signal; and outputting the target text as a voice recognition result corresponding to the voice signal.
Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This invention is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.
It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is only limited by the appended claims
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.
The present invention provides a processing method, a processing device, and a machine-readable medium, which are described in detail above, and the principles and embodiments of the present invention are explained herein by using specific examples, and the descriptions of the above examples are only used to help understand the method and the core ideas of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims (25)

1. A method of processing, comprising:
acquiring a target word corresponding to a punctuation mark from a source text corresponding to a voice signal;
replacing target words included in the source text with corresponding punctuations to obtain a target text corresponding to the voice signal;
under the condition that the comparison result between the language model score corresponding to the target text and the language model score corresponding to the source text meets a first preset condition, outputting the target text as a voice recognition result corresponding to the voice signal;
the first preset condition includes: the increase amplitude of the language model score corresponding to the target text relative to the language model score corresponding to the source text exceeds a first amplitude threshold value; the first amplitude threshold value is obtained according to the number of words included in the source text; the quantity grade corresponding to the quantity of the words comprises: a first word quantity level and a second word quantity level; the number of the words corresponding to the first word number level is smaller than the number of the words corresponding to the second word number level, and the first amplitude threshold value corresponding to the first word number level is larger than the first amplitude threshold value corresponding to the second word number level.
2. The method according to claim 1, wherein the number of the target words is plural, and the replacing the target words included in the source text with corresponding punctuation marks comprises:
according to a preset sequence, acquiring a target word needing to be replaced currently from a plurality of target words to serve as a current target word;
replacing the current target word with a corresponding punctuation mark, wherein the current target word is included in the text before replacement corresponding to the current replacement; and obtaining a replaced text corresponding to the current replacement, and obtaining a target text corresponding to the voice signal after completing the replacement corresponding to all current target words.
3. The method of claim 2, wherein the condition that the current replacement is successful comprises: and the comparison result between the language model score of the text after the replacement corresponding to the current replacement and the language model score of the text before the replacement corresponding to the current replacement meets a second preset condition.
4. The method according to claim 2, wherein if the result of the comparison between the language model score of the text after replacement corresponding to the current replacement and the language model score of the text before replacement corresponding to the current replacement meets a second preset condition, the text before replacement corresponding to the next replacement is the text after replacement corresponding to the current replacement; or
And if the language model score of the text after replacement corresponding to the current replacement and the language model score of the text before replacement corresponding to the current replacement do not accord with a second preset condition, the text before replacement corresponding to the next replacement is the text before replacement corresponding to the current replacement.
5. The method according to claim 3 or 4, wherein the second preset condition comprises: and the language model score of the text after the replacement corresponding to the current replacement is not lower than the language model score of the text before the replacement corresponding to the current replacement.
6. The method of claim 5, wherein the second preset condition comprises: the language model score of the text after the replacement corresponding to the current replacement is increased relative to the language model score of the text before the replacement corresponding to the current replacement by more than a second amplitude threshold.
7. The method of claim 6, wherein the second amplitude threshold is derived from a number of words included in the source text.
8. The method according to claim 1, wherein before outputting the target text as a speech recognition result corresponding to the speech signal, the method further comprises:
and determining that the syntactic analysis result corresponding to the target text conforms to a preset rule.
9. A processing apparatus, comprising:
the target word acquisition module is used for acquiring target words corresponding to punctuations from a source text corresponding to the voice signal;
the target word replacing module is used for replacing target words included in the source text with corresponding punctuations so as to obtain a target text corresponding to the voice signal; and
a recognition result output module, configured to output the target text as a speech recognition result corresponding to the speech signal when a comparison result between the language model score corresponding to the target text and the language model score corresponding to the source text meets a first preset condition;
the first preset condition includes: the increase amplitude of the language model score corresponding to the target text relative to the language model score corresponding to the source text exceeds a first amplitude threshold value; the first amplitude threshold value is obtained according to the number of words included in the source text; the quantity grade corresponding to the quantity of the words comprises: a first word quantity level and a second word quantity level; the number of the words corresponding to the first word number level is smaller than the number of the words corresponding to the second word number level, and the first amplitude threshold value corresponding to the first word number level is larger than the first amplitude threshold value corresponding to the second word number level.
10. The apparatus of claim 9, wherein the number of the target words is plural, and wherein the target word replacement module comprises:
the sequence acquisition sub-module is used for acquiring a target word needing to be replaced currently from the target words according to a preset sequence and taking the target word as the current target word;
and the sequence replacement submodule is used for replacing the current target word with the corresponding punctuation marks so as to obtain a replaced text corresponding to the current replacement, wherein the current target word is included in the pre-replacement text corresponding to the current replacement, and the target text corresponding to the voice signal is obtained after the replacement corresponding to all the current target words is completed.
11. The apparatus of claim 10, wherein the condition that the current replacement is successful comprises: and the comparison result between the language model score of the text after the replacement corresponding to the current replacement and the language model score of the text before the replacement corresponding to the current replacement meets a second preset condition.
12. The apparatus according to claim 10, wherein if the result of the comparison between the language model score of the text after replacement corresponding to the current replacement and the language model score of the text before replacement corresponding to the current replacement meets a second preset condition, the text before replacement corresponding to the next replacement is the text after replacement corresponding to the current replacement; or
And if the language model score of the text after replacement corresponding to the current replacement and the language model score of the text before replacement corresponding to the current replacement do not accord with a second preset condition, the text before replacement corresponding to the next replacement is the text before replacement corresponding to the current replacement.
13. The apparatus according to claim 11 or 12, wherein the second preset condition comprises: and the language model score of the text after the replacement corresponding to the current replacement is not lower than the language model score of the text before the replacement corresponding to the current replacement.
14. The apparatus according to claim 11 or 12, wherein the second preset condition comprises: the language model score of the text after the replacement corresponding to the current replacement is increased relative to the language model score of the text before the replacement corresponding to the current replacement by more than a second amplitude threshold.
15. The apparatus of claim 14, wherein the second magnitude threshold is derived from a number of words included in the source text.
16. The apparatus of claim 9, further comprising:
and the second determining module is used for determining that the syntactic analysis result corresponding to the target text accords with a preset rule before the recognition result output module outputs the target text as the voice recognition result corresponding to the voice signal.
17. An apparatus for processing, comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory, and wherein execution of the one or more programs by one or more processors comprises instructions for:
acquiring a target word corresponding to a punctuation mark from a source text corresponding to a voice signal;
replacing target words included in the source text with corresponding punctuations to obtain a target text corresponding to the voice signal;
under the condition that the comparison result between the language model score corresponding to the target text and the language model score corresponding to the source text meets a first preset condition, outputting the target text as a voice recognition result corresponding to the voice signal;
the first preset condition includes: the increase amplitude of the language model score corresponding to the target text relative to the language model score corresponding to the source text exceeds a first amplitude threshold value; the first amplitude threshold value is obtained according to the number of words included in the source text; the quantity grade corresponding to the quantity of the words comprises: a first word quantity level and a second word quantity level; the number of the words corresponding to the first word number level is smaller than the number of the words corresponding to the second word number level, and the first amplitude threshold value corresponding to the first word number level is larger than the first amplitude threshold value corresponding to the second word number level.
18. The apparatus of claim 17, wherein the number of the target words is plural, and the replacing the target words included in the source text with corresponding punctuation marks comprises:
according to a preset sequence, acquiring a target word needing to be replaced currently from a plurality of target words to serve as a current target word;
replacing the current target word with a corresponding punctuation mark, wherein the current target word is included in the text before replacement corresponding to the current replacement; and obtaining a replaced text corresponding to the current replacement, and obtaining a target text corresponding to the voice signal after completing the replacement corresponding to all current target words.
19. The apparatus of claim 18, wherein the condition that the current replacement is successful comprises: and the comparison result between the language model score of the text after the replacement corresponding to the current replacement and the language model score of the text before the replacement corresponding to the current replacement meets a second preset condition.
20. The apparatus according to claim 18, wherein if the result of the comparison between the language model score of the text after replacement corresponding to the current replacement and the language model score of the text before replacement corresponding to the current replacement meets a second preset condition, the text before replacement corresponding to the next replacement is the text after replacement corresponding to the current replacement; or
And if the language model score of the text after replacement corresponding to the current replacement and the language model score of the text before replacement corresponding to the current replacement do not accord with a second preset condition, the text before replacement corresponding to the next replacement is the text before replacement corresponding to the current replacement.
21. The apparatus according to claim 19 or 20, wherein the second preset condition comprises: and the language model score of the text after the replacement corresponding to the current replacement is not lower than the language model score of the text before the replacement corresponding to the current replacement.
22. The apparatus according to claim 19 or 20, wherein the second preset condition comprises: the language model score of the text after the replacement corresponding to the current replacement is increased relative to the language model score of the text before the replacement corresponding to the current replacement by more than a second amplitude threshold.
23. The apparatus of claim 22, wherein the second magnitude threshold is derived from a number of words included in the source text.
24. The apparatus of claim 17, wherein the apparatus is also configured to execute the one or more programs by one or more processors includes instructions for:
and before outputting the target text as a voice recognition result corresponding to the voice signal, determining that a syntactic analysis result corresponding to the target text conforms to a preset rule.
25. A machine-readable medium having stored thereon instructions, which when executed by one or more processors, cause an apparatus to perform the processing method of any one of claims 1 to 8.
CN201710632930.2A 2017-07-28 2017-07-28 Processing method, apparatus and machine-readable medium Active CN107564526B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710632930.2A CN107564526B (en) 2017-07-28 2017-07-28 Processing method, apparatus and machine-readable medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710632930.2A CN107564526B (en) 2017-07-28 2017-07-28 Processing method, apparatus and machine-readable medium

Publications (2)

Publication Number Publication Date
CN107564526A CN107564526A (en) 2018-01-09
CN107564526B true CN107564526B (en) 2020-10-27

Family

ID=60973895

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710632930.2A Active CN107564526B (en) 2017-07-28 2017-07-28 Processing method, apparatus and machine-readable medium

Country Status (1)

Country Link
CN (1) CN107564526B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108538292B (en) * 2018-04-26 2020-12-22 科大讯飞股份有限公司 Voice recognition method, device, equipment and readable storage medium
CN110020190B (en) * 2018-07-05 2021-06-01 中国科学院信息工程研究所 Multi-instance learning-based suspicious threat index verification method and system
US10789955B2 (en) * 2018-11-16 2020-09-29 Google Llc Contextual denormalization for automatic speech recognition
CN111460836B (en) * 2019-01-18 2024-04-19 北京搜狗科技发展有限公司 Data processing method and device for data processing
CN110321532A (en) * 2019-06-06 2019-10-11 数译(成都)信息技术有限公司 Language pre-processes punctuate method, computer equipment and computer readable storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH11126091A (en) * 1997-10-22 1999-05-11 Atr Onsei Honyaku Tsushin Kenkyusho:Kk Voice language processing unit conversion device
CN1235312A (en) * 1998-05-13 1999-11-17 国际商业机器公司 Automatic punctuating for continuous speech recognition
US6067514A (en) * 1998-06-23 2000-05-23 International Business Machines Corporation Method for automatically punctuating a speech utterance in a continuous speech recognition system
CN103247291A (en) * 2013-05-07 2013-08-14 华为终端有限公司 Updating method, device, and system of voice recognition device
CN105074817A (en) * 2013-03-15 2015-11-18 高通股份有限公司 Systems and methods for switching processing modes using gestures
CN106484134A (en) * 2016-09-20 2017-03-08 深圳Tcl数字技术有限公司 The method and device of the phonetic entry punctuation mark based on Android system

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH11126091A (en) * 1997-10-22 1999-05-11 Atr Onsei Honyaku Tsushin Kenkyusho:Kk Voice language processing unit conversion device
CN1235312A (en) * 1998-05-13 1999-11-17 国际商业机器公司 Automatic punctuating for continuous speech recognition
US6067514A (en) * 1998-06-23 2000-05-23 International Business Machines Corporation Method for automatically punctuating a speech utterance in a continuous speech recognition system
CN105074817A (en) * 2013-03-15 2015-11-18 高通股份有限公司 Systems and methods for switching processing modes using gestures
CN103247291A (en) * 2013-05-07 2013-08-14 华为终端有限公司 Updating method, device, and system of voice recognition device
CN106484134A (en) * 2016-09-20 2017-03-08 深圳Tcl数字技术有限公司 The method and device of the phonetic entry punctuation mark based on Android system

Also Published As

Publication number Publication date
CN107564526A (en) 2018-01-09

Similar Documents

Publication Publication Date Title
CN107291690B (en) Punctuation adding method and device and punctuation adding device
CN107632980B (en) Voice translation method and device for voice translation
CN106098060B (en) Method and device for error correction processing of voice
CN107102746B (en) Candidate word generation method and device and candidate word generation device
CN107564526B (en) Processing method, apparatus and machine-readable medium
CN111145756B (en) Voice recognition method and device for voice recognition
CN107291704B (en) Processing method and device for processing
CN108628813B (en) Processing method and device for processing
CN107274903B (en) Text processing method and device for text processing
CN110069624B (en) Text processing method and device
CN107424612B (en) Processing method, apparatus and machine-readable medium
CN111160047A (en) Data processing method and device and data processing device
CN111831806A (en) Semantic integrity determination method and device, electronic equipment and storage medium
CN111369978A (en) Data processing method and device and data processing device
CN109979435B (en) Data processing method and device for data processing
CN112036195A (en) Machine translation method, device and storage medium
CN108073294B (en) Intelligent word forming method and device for intelligent word forming
CN111400443B (en) Information processing method, device and storage medium
CN114462410A (en) Entity identification method, device, terminal and storage medium
CN110780749B (en) Character string error correction method and device
CN108345590B (en) Translation method, translation device, electronic equipment and storage medium
CN108073566B (en) Word segmentation method and device and word segmentation device
CN109388252B (en) Input method and device
CN112149432A (en) Method and device for translating chapters by machine and storage medium
CN113589954A (en) Data processing method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant