CN112331194A

CN112331194A - Input method and device and electronic equipment

Info

Publication number: CN112331194A
Application number: CN201910703691.4A
Authority: CN
Inventors: 王丹; 崔欣
Original assignee: Beijing Sogou Technology Development Co Ltd
Current assignee: Beijing Sogou Technology Development Co Ltd
Priority date: 2019-07-31
Filing date: 2019-07-31
Publication date: 2021-02-05
Anticipated expiration: 2039-07-31
Also published as: CN112331194B

Abstract

The embodiment of the invention provides an input method, an input device and electronic equipment, wherein the method comprises the following steps: acquiring first voice data input by a user, identifying the first voice data as a first text and displaying the first text; acquiring second voice data input by a user, identifying the second voice data as a second text and judging whether the user has the intention of modifying the first text; when determining that the user has the intention to modify the first text, modifying the first text according to the second text; after the user determines that the input method voice is recognized wrongly, the user can modify the text with the recognized mistake by inputting the voice data of the sentence again without manually modifying the text by the user, and the input efficiency is improved.

Description

Input method and device and electronic equipment

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to an input method, an input device, and an electronic device.

Background

With the development of computer technology, electronic devices such as mobile phones and tablet computers are more and more popular, and great convenience is brought to life, study and work of people. These electronic devices are typically installed with an input method application (abbreviated as input method) so that a user can input information using the input method.

With the progress of the voice recognition technology, voice input is gradually developed into an input mode of an input method, and a user can trigger voice input on an input method interface and then output corresponding voice; after receiving the voice data of the user, the input method carries out voice recognition on the voice data and then displays the voice recognition result in the edit box.

When the input method speech recognition result is wrong, the user needs to manually modify, for example, manually delete the wrong text and input the correct text, for example, the speech input by the user is "today is the seventh day of work", and the speech recognition result is "today is the first day of work"; at this point the user manually deletes "one" in the edit box and then enters "seven" via the input method keypad. As well as, for example, voice modification after triggering a modification mode; as in the above example, after the user triggers the modification mode, the user utters the voice "replace" one with "seven". Therefore, when the voice recognition result is wrong in the voice input process, the operation of modifying the wrong result is complicated, and the input efficiency is low.

Disclosure of Invention

The embodiment of the invention provides an input method for improving input efficiency.

Correspondingly, the embodiment of the invention also provides an input device and electronic equipment, which are used for ensuring the realization and application of the method.

In order to solve the above problem, an embodiment of the present invention discloses an input method, which specifically includes: acquiring first voice data input by a user, identifying the first voice data as a first text and displaying the first text; acquiring second voice data input by a user, identifying the second voice data as a second text and judging whether the user has the intention of modifying the first text; and when the user is determined to have the intention of modifying the first text, modifying the first text according to the second text.

Optionally, the determining whether the user has an intent to modify the first text includes: calculating the voice similarity of the first voice data and the second voice data by adopting a voice similarity calculation method; and judging whether the user has the intention of modifying the first text or not according to the voice similarity.

Optionally, the determining whether the user has an intent to modify the first text includes: calculating the text similarity of the first text and the second text by adopting a text similarity algorithm; and judging whether the user has the intention of modifying the first text or not according to the text similarity.

Optionally, the calculating the voice similarity between the first voice data and the second voice data includes: dividing the first voice data into a plurality of voice segments, and generating a plurality of voice segment sets by adopting the plurality of voice segments; wherein one of the voice segment sets comprises one voice segment or a plurality of continuous voice segments; respectively calculating the voice similarity of each voice fragment set and the second voice data; the judging whether the user has the intention of modifying the first text or not according to the voice similarity comprises the following steps: and judging whether the user has the intention of modifying the first text or not according to the maximum voice similarity.

Optionally, the modifying the first text according to the second text includes: and replacing the text corresponding to the voice fragment set with the maximum voice similarity by adopting the second text.

Optionally, the calculating the text similarity between the first text and the second text includes: dividing the first text into a plurality of text segments, and generating a plurality of text segment sets by adopting the plurality of text segments; wherein one of the text segment sets comprises one text segment or a plurality of continuous text segments; respectively calculating the text similarity of each text fragment set and the second text; the judging whether the user has the intention of modifying the first text or not according to the text similarity comprises the following steps: and judging whether the user has the intention of modifying the first text or not according to the maximum text similarity.

Optionally, the modifying the first text according to the second text includes: and replacing the text segment corresponding to the text segment set with the maximum text similarity by using the second text.

Optionally, the modifying the first text according to the second text includes: and replacing the first text with a second text.

Optionally, after the recognition as the second text, the method further includes: displaying the second text in an edit box; the modifying the first text according to the second text comprises: deleting the first text.

Optionally, the modifying the first text according to the second text includes: displaying the second text in a candidate bar; and receiving a screen-up instruction, and replacing the first text with the second text corresponding to the screen-up instruction.

Optionally, the method further comprises: presenting the second text upon determining that the user does not have an intent to modify the first text.

The embodiment of the invention also discloses an input device, which specifically comprises: the first acquisition module is used for acquiring first voice data input by a user, identifying the first voice data as a first text and displaying the first text; the second acquisition module is used for acquiring second voice data input by a user and identifying the second voice data as a second text; a determining module for determining whether a user has an intent to modify the first text; and the modification module is used for modifying the first text according to the second text when determining that the user has the intention of modifying the first text.

Optionally, the determining module includes: the voice similarity operator module is used for calculating the voice similarity of the first voice data and the second voice data by adopting a voice similarity calculation method; and the first intention judgment submodule is used for judging whether the user has the intention of modifying the first text or not according to the voice similarity.

Optionally, the determining module includes: the text similarity operator module is used for calculating the text similarity between the first text and the second text by adopting a text similarity algorithm; and the second intention judgment submodule is used for judging whether the user has the intention of modifying the first text or not according to the text similarity.

Optionally, the voice similarity calculation subunit module is configured to divide the first voice data into a plurality of voice segments, and generate a plurality of voice segment sets by using the plurality of voice segments; wherein one of the voice segment sets comprises one voice segment or a plurality of continuous voice segments; respectively calculating the voice similarity of each voice fragment set and the second voice data; and the first intention judgment submodule is used for judging whether the user has the intention of modifying the first text or not according to the maximum voice similarity.

Optionally, the modifying module includes: and the first text modification submodule is used for replacing the text corresponding to the voice fragment set with the maximum voice similarity by adopting the second text.

Optionally, the text similarity calculation operator module is configured to divide the first text into a plurality of text segments, and generate a plurality of text segment sets by using the plurality of text segments; wherein one of the text segment sets comprises one text segment or a plurality of continuous text segments; respectively calculating the text similarity of each text fragment set and the second text; and the second intention judgment submodule is used for judging whether the user has the intention of modifying the first text or not according to the maximum text similarity.

Optionally, the modifying module includes: and the second text modification submodule is used for replacing the text segment corresponding to the text segment set with the maximum text similarity by adopting the second text.

Optionally, the modifying module includes: and the third text modification sub-module is used for replacing the first text with the second text.

Optionally, the apparatus further comprises: a first presentation module for presenting the second text in an edit box after the identification as the second text; the modification module includes: and the fourth text modification submodule is used for deleting the first text.

Optionally, the modifying module includes: a fifth text modification sub-module for presenting the second text in a candidate bar; and receiving a screen-up instruction, and replacing the first text with the second text corresponding to the screen-up instruction.

Optionally, the apparatus further comprises: a second presentation module to present the second text upon determining that the user does not have an intent to modify the first text.

The embodiment of the invention also discloses a readable storage medium, and when the instructions in the storage medium are executed by a processor of the electronic equipment, the electronic equipment can execute the input method according to any one of the embodiments of the invention.

An embodiment of the present invention also discloses an electronic device, including a memory, and one or more programs, where the one or more programs are stored in the memory, and configured to be executed by one or more processors, and the one or more programs include instructions for: acquiring first voice data input by a user, identifying the first voice data as a first text and displaying the first text; acquiring second voice data input by a user, identifying the second voice data as a second text and judging whether the user has the intention of modifying the first text; and when the user is determined to have the intention of modifying the first text, modifying the first text according to the second text.

Optionally, the modifying the first text according to the second text includes: and replacing the first text with the second text.

Optionally, after the recognition as the second text, further comprising instructions for: displaying the second text in an edit box; the modifying the first text according to the second text comprises: deleting the first text.

Optionally, further comprising instructions for: presenting the second text upon determining that the user does not have an intent to modify the first text.

The embodiment of the invention has the following advantages:

in summary, in the embodiment of the present invention, the input method may acquire the first voice data input by the user, identify the first voice data as the first text, and display the first text; when the user determines that the recognition result of the input method for the first voice data is wrong, second voice data can be input, the input method can further acquire the second voice data input by the user, the second voice data is recognized as a second text, and whether the user has the intention of modifying the first text is judged; when it is determined that the user has an intent to modify the first text, the first text may be modified in accordance with the second text; after the user determines that the input method voice is recognized wrongly, the user can modify the text with the recognized mistake by inputting the voice data of the sentence again without manually modifying the text by the user, and the input efficiency is improved.

Drawings

FIG. 1 is a flow chart of the steps of an input method embodiment of the present invention;

FIG. 2 is a flow chart of the steps of an alternative embodiment of an input method of the present invention;

FIG. 3 is a block diagram of an input device according to an embodiment of the present invention;

FIG. 4 is a block diagram of an alternative embodiment of an input device of the present invention;

FIG. 5 illustrates a block diagram of an electronic device for input, in accordance with an exemplary embodiment;

fig. 6 is a schematic structural diagram of an electronic device for input according to another exemplary embodiment of the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

Referring to fig. 1, a flowchart illustrating steps of an embodiment of an input method according to the present invention is shown, which may specifically include the following steps:

and 102, acquiring first voice data input by a user, identifying the first voice data as a first text and displaying the first text.

And 104, acquiring second voice data input by the user, identifying the second voice data as a second text and judging whether the user has the intention of modifying the first text.

And 106, when the user is determined to have the intention of modifying the first text, modifying the first text according to the second text.

In the embodiment of the invention, the input method can provide a function of automatically modifying the error result of the voice recognition, and after a user inputs the voice data of a certain statement, if the recognition result of the voice data of the statement by the input method is determined to be wrong, the voice data corresponding to the statement can be input again by adjusting tone, pronunciation, volume and the like. After acquiring the currently input voice data, the input method can perform voice recognition on the currently input voice data and judge whether the recognition result of the last input voice data needs to be modified. When the last input voice data needs to be modified, modifying the last voice recognition result according to the current voice recognition result; therefore, the automatic modification of the error result of the voice recognition is realized.

For convenience of the following description, voice data sequentially input by a user may be referred to as first voice data and second voice data, respectively.

After the user inputs the first voice data, the input method can acquire the first voice data; the first speech data may then be speech recognized, the corresponding first text determined and presented, e.g. the first text may be presented in an edit box.

When the user determines that the recognition result of the input method for the first voice data is wrong according to the displayed first text, the second voice data for modifying the first text can be input so as to modify the first text. When the user determines that the recognition result of the input method for the first voice data is correct according to the displayed first text, the second voice data corresponding to the next section of text can be input. Therefore, after the input method receives the second voice data, on one hand, the input method can identify the second voice data and determine a corresponding second text; another aspect may determine whether the user has an intent to modify the first text. The manner of determining whether the user has the intention to modify the first text may include multiple manners, such as determining by comparing the first voice data with the second voice data; for example, after the second text corresponding to the second speech data is recognized, the first text and the second text may be compared to determine, which is not limited in this embodiment of the present invention.

When it is determined that the user has an intention to modify the first text, the first text may be modified according to the second text, for example, the first text is replaced by the second text, and the like; the embodiments of the present invention are not limited in this regard. Of course, the second text may be presented directly when it is determined that the user does not have an intent to modify the first text.

In one example of the invention, after the user inputs the first voice data of "no check", the input method can acquire the first voice data, recognize the first text as "no eye done" and display the first text. And when the user determines that the input method has an error in recognition of the first voice data, the user adjusts tone, volume, tone and the like again to input the second voice data without checking. Correspondingly, the input method acquires second voice data and recognizes that the second text is 'not to be accepted'. It may then be determined that the user has an intent to modify the first text, which may then be modified in accordance with the second text. For example, the "do not use eye done" in the edit box is modified to "do not use acceptance".

In another embodiment of the present invention, a description is given of how to determine whether the user has an intention to modify the first text and how to modify the first text.

Referring to fig. 2, a flowchart illustrating steps of an alternative embodiment of the input method of the present invention is shown, which may specifically include the following steps:

step 202, acquiring first voice data input by a user, recognizing the first voice data as a first text and displaying the first text.

In the embodiment of the invention, a user can trigger a voice input function such as clicking a voice input identifier in an input method interface and then input first voice data; after receiving the first voice data, the input method may perform voice recognition on the first voice data, and determine a corresponding first text. For example, the first speech data may be subjected to speech enhancement, and then the speech-enhanced first speech data is input into a speech recognition model to obtain a corresponding first text. The first text may be displayed in an edit box, where names of edit boxes in different application programs are different, for example, an edit box in a chat application may refer to a message input box, an edit box in a browser may refer to a search box, and the like, which is not limited in this embodiment of the present invention.

And step 204, acquiring second voice data input by the user, and recognizing the second voice data as a second text.

After the user continues to input the second voice data, the input method can receive the second voice data, and then can perform voice recognition on the second voice data to determine a corresponding second text; the manner of recognizing the second voice data is similar to that of recognizing the first voice data, and is not repeated again.

In the embodiment of the invention, after the input method acquires the second voice data, whether the user has the intention of modifying the first text or not can be judged according to the first voice data and the second voice data; reference may be made to steps 206-208. After a second text corresponding to second voice data is identified, whether the user has the intention of modifying the first text or not can be judged according to the first text and the second text; reference may be made to steps 210-212.

And step 206, calculating the voice similarity of the first voice data and the second voice data by adopting a voice similarity calculation method.

And step 208, judging whether the user has the intention of modifying the first text or not according to the voice similarity.

In one example of the present invention, the voice similarity of the first voice data and the second voice data may be calculated by using a voice similarity algorithm. In the process of calculating the voice similarity by using the voice similarity algorithm, a series of processing such as frequency domain shaping, level adjustment, filtering, compensation and the like can be respectively carried out on the first voice data and the second voice data, and then the similarity of the processed first voice data and the processed second voice data is scored to obtain the corresponding voice similarity. And comparing the voice similarity with a voice similarity threshold value, and judging whether the voice similarity is greater than the voice similarity threshold value. When the speech similarity is greater than a speech similarity threshold, it may be determined that the user has an intent to modify the first text; when the speech similarity is not greater than a speech similarity threshold, it may be determined that the user does not have an intent to modify the first text. The voice similarity threshold may be set as required, which is not limited in this embodiment of the present invention.

In an embodiment of the present invention, a manner of calculating a voice similarity between first voice data and second voice data may be: and calculating by adopting the whole section of the first voice data and the whole section of the second voice data, and calculating the voice similarity of the whole section of the first voice data and the whole section of the second voice data.

After a user inputs voice data of a sentence, the input method may only identify a part of texts in the sentence incorrectly; in order to further improve the input efficiency of the user, the user can input the voice data of the whole sentence again only by inputting the voice segment corresponding to the text with the recognition error again without inputting the voice data of the whole sentence again, and then the text with the partial recognition error in the sentence can be modified.

Therefore, in an example of the present invention, another way to calculate the voice similarity between the first voice data and the second voice data may be: dividing the first voice data into a plurality of voice segments, and generating a plurality of voice segment sets by adopting the plurality of voice segments; wherein one of the voice segment sets comprises one voice segment or a plurality of continuous voice segments; respectively calculating the voice similarity of each voice fragment set and the second voice data; and judging whether the user has the intention of modifying the first text or not according to the maximum voice similarity.

In an example of the present invention, the voice segments may be divided between two frames, where the time interval is greater than the interval threshold, according to the time interval between frames of voice data corresponding to the first voice data; the time interval may be set as required, and the embodiment of the present invention is not limited thereto. For example, the first voice data is "i do you up, do not want to speak, find me bar tomorrow", and the corresponding first text is "i do you up, do not want to speak, take me bar tomorrow"; the second voice data is voice data corresponding to the 'seeking me bar tomorrow'. The time interval between "o" and "not" and the time interval between "speech" and "bright" in the first voice data are all greater than the interval threshold, so that the first voice data can be divided into three voice segments: speech segment a 1: the voice and voice segment a2 corresponding to "i do all over: the voice corresponding to 'want to speak' and the voice fragment A3, the voice corresponding to 'find me bar tomorrow'. These three speech segments are then used to generate 6 speech segment sets, such as: the method comprises the steps of collecting S1{ A1}, collecting S2{ A2}, collecting S3{ A3}, collecting S4{ A1, A2}, collecting S5{ A2, A3}, and collecting S1{ A1, A2, A3 }. And aiming at each voice fragment set, calculating the voice similarity between the voice fragment set and the second voice data so as to obtain the voice similarity between each voice fragment set and the second voice data. Then, the maximum voice similarity can be selected, and whether the user has the intention of modifying the first text or not is judged according to the maximum voice similarity. The voice segment set with the largest voice similarity is the voice segment set which is most similar to the second voice data in the first voice data. For example, the voice similarity between the 6 voice segment sets and the first voice data is respectively calculated as follows: 0.25(S1), 0.21(S2), 0.94(S3), 0.18(S4), 0.67(S5), 0.32 (S6); it can be determined that the maximum speech similarity is 0.94 and the set of speech segments with the maximum speech similarity is S3.

And step 210, calculating the text similarity between the first text and the second text by adopting a text similarity algorithm.

And step 212, judging whether the user has the intention of modifying the first text according to the text similarity.

In another example of the present invention, a text similarity algorithm may be used to calculate the text similarity between the first text and the second text. For example, the editing distance between the first text and the second text can be calculated, and the text similarity is determined according to the editing distance; for another example, calculating the Jacard coefficients of the first text and the second text, and determining text similarity according to the Jacard coefficients; for example, a word vector of the first text and a word vector of the second text are calculated, and text similarity is determined according to the word vectors. Of course, other parameters may also be calculated, and the text similarity may be determined according to the other parameters, which is not limited in this embodiment of the present invention. Then, the text similarity can be compared with a text similarity threshold value, and whether the text similarity is greater than the text similarity threshold value is judged. When the text similarity is greater than a text similarity threshold, it may be determined that the user has an intent to modify the first text; when the text similarity is not greater than a text similarity threshold, it may be determined that the user does not have an intent to modify the first text. The text similarity threshold may be set as required, which is not limited in this embodiment of the present invention.

In one example of the present invention, one way to calculate the text similarity between the first text and the second text may be: and calculating by adopting the whole section of the first text and the whole section of the second text, and calculating the text similarity of the whole section of the first text and the whole section of the second text.

Of course, in an example of the present invention, the manner of calculating the text similarity between the first text and the second text may be: dividing the first text into a plurality of text segments, and generating a plurality of text segment sets by adopting the plurality of text segments; wherein one of the text segment sets comprises one text segment or a plurality of continuous text segments; respectively calculating the text similarity of each text fragment set and the second text; and judging whether the user has the intention of modifying the first text or not according to the maximum text similarity.

In the embodiment of the present invention, after performing voice recognition on voice data, the input method may add corresponding punctuation marks to the recognition result, so that a manner of dividing the first text into a plurality of text segments may be to divide the first text into a plurality of text segments according to the punctuation marks in the first text sentence. For example, the first text: "i do all over well, do not want to speak, find me bar tomorrow", it may be determined that the sentence in the first text includes two punctuations, and the first text may be divided into 3 text segments: text segment B1 "i am good, text segment B2" do not want to speak "and text segment B3" find me bar tomorrow ". And then, the three text segments are adopted to form 6 text segment sets, which is similar to the above-mentioned 6 voice segment sets formed by adopting three voice segments, and the description is omitted here.

In the embodiment of the invention, after voice recognition is carried out on voice data by the input method, correct punctuation marks can not be added to a recognition result; or when the recognition result is a short sentence, punctuation marks do not exist in the sentence. Therefore, another way to divide the first text into a plurality of text segments may be to perform word segmentation on the first text and determine the plurality of text segments of the first text. For example, a first text "i happy today" is word-segmented to obtain 3 text segments: text segment C1 "I", text segment C2 "today", and text segment C3 "Happy". These three text segments are then used to generate 6 sets of speech segments, such as: the method comprises the steps of collecting R1{ C1}, collecting R2{ C2}, collecting R3{ C3}, collecting R4{ C1, C2}, collecting R5{ C2, C3}, and collecting R1{ C1, C2, C3 }. Respectively determining the text similarity of each text fragment set and the corresponding second text, and selecting the maximum text similarity; and then judging whether the user has the intention of modifying the first text or not according to the maximum text similarity. And the text segment set with the maximum text similarity is the text segment set which is the most similar to the second text in the first text. For example, the second text is "fellow happy", and the similarity of the second text to the texts of the 6 speech segment sets is calculated as follows: 0.33(R1), 0.29(R2), 0.87(R3), 0.22(R4), 0.57(R5), 0.42 (R6); it can be determined that the maximum text similarity is 0.87, and the set of speech segments with the maximum text similarity is R3.

Of course, the above manners may also be combined, and after the text similarity and the voice similarity are obtained, the text similarity and the voice similarity may be weighted and calculated. Then, the result of the weighted calculation can be used as the final similarity, and the final similarity is compared with a joint similarity threshold value to judge whether the user has the intention of modifying the first text; the joint similarity threshold may be determined as required, which is not limited in this embodiment of the present invention. The weight of the text similarity and the weight of the voice similarity may be set as required, which is not limited in the embodiment of the present invention.

And step 214, replacing the first text with the second text.

In the embodiment of the invention, after the second text is identified, the second text is not displayed in the edit box; after it is determined that the user has an intent to modify the first text, the second text may be substituted for the first text.

In one example of the invention, the input method may actively replace the first text with the second text.

If the voice similarity is calculated by using the whole section of the first voice data and the whole section of the second voice data in step 206, or the text similarity is calculated by using the whole section of the first text and the whole section of the second text in step 210, the first text may be replaced by the second text in a manner of replacing the whole section of the first text with the whole section of the second text. That is, the first text in the edit box can be deleted, and then the second text can be added at the position corresponding to the first text. Another way may be: replacing portions of the first text with portions of the second text: the first text and the second text can be compared to determine the wrong words in the first text and the correct words in the second text; and replacing the error words in the first text by adopting the correct words in the second text. And the error words in the first text correspond to the correct words in the second text. Words in the first text and words in the second text may be compared one by one, and when it is determined that a word at a position in the first text is different from a word at the position in the second text, the word at the position in the first text may be called a wrong word, and the word at the position in the second text may be called a correct word. The error word in the first text may then be deleted in the edit box, and the correct word in the second text corresponding to the error word position may be added at the position corresponding to the error word.

If the step 206 is to calculate the voice similarity by using the voice fragment set of the first voice data and the second voice data, one way to replace the first text with the second text may be: and replacing the text corresponding to the voice fragment set with the maximum voice similarity by adopting the second text. If the text similarity is determined according to the text similarity between the text segment set corresponding to the first text and the text similarity corresponding to the second text in step 210, the second text may be used to replace the text segment corresponding to the text segment set with the largest text similarity.

In one example of the present invention, the input method may implement replacing the first text with the second text based on a user trigger.

The other way of replacing the first text with the second text may be: displaying the second text in a candidate bar; and receiving a screen-up instruction, and replacing the first text with the second text corresponding to the screen-up instruction. And replacing the first text with a second text corresponding to the screen-up instruction, which is similar to the above and is not described herein again. Of course, when the second text includes a plurality of second texts, each second text may also be presented in the candidate bar; when the user executes the screen-up operation on one of the second texts, the first text can be modified according to the second text corresponding to the screen-up instruction.

Of course, after the second text is recognized, the second text may also be presented in an edit box; after the user is determined to have the intention of modifying the first text, the first text can be directly deleted, and then the first text can be modified according to the second text.

And step 216, displaying the second text.

In the embodiment of the present invention, if it is determined that the user does not have an intention to modify the first text, the second text may be displayed directly after the first text in the edit box.

In summary, in the embodiment of the present invention, the input method may acquire the first voice data input by the user, identify the first voice data as the first text, and display the first text; when the recognition result of the input method for the first voice data is determined to be wrong, second voice data can be input, the input method can further acquire the second voice data input by the user, the second voice data is recognized as a second text, and whether the user has the intention of modifying the first text is judged; when it is determined that there is an intention to modify the first text, modifying the first text in accordance with the second text; after the user determines that the input method voice is recognized wrongly, the user can modify the text with the recognized mistake by inputting the voice data of the sentence again without manually modifying the text by the user, so that the input efficiency and the user experience are improved.

Secondly, in the embodiment of the present invention, a speech similarity algorithm may be used to calculate the speech similarity between the first speech data and the second speech data to determine whether the user has an intention to modify the first text, or a text similarity algorithm may be used to calculate the text similarity between the first text and the second text to determine whether the user has an intention to modify the first text; the first text and the second text can be weighted to judge whether the user has the intention of modifying the first text; therefore, the modification intention of the user can be accurately determined, the error modification rate is reduced, and the input efficiency is further improved.

Further, in the embodiment of the present invention, the first voice data may be further divided into a plurality of voice segments, and a plurality of voice segment sets are generated by using the plurality of voice segments; wherein one of the voice segment sets comprises one voice segment or a plurality of continuous voice segments; respectively calculating the voice similarity of each voice fragment set and the second voice data, and then judging whether the user has the intention of modifying the first text or not according to the maximum voice similarity; the first text can be divided into a plurality of text segments, and a plurality of text segment sets are generated by adopting the text segments; wherein one of the text segment sets comprises one text segment or a plurality of continuous text segments; respectively calculating the text similarity of each text fragment set and the second text, and judging whether the user has the intention of modifying the first text or not according to the maximum text similarity; and then when the input method identifies the part of text in a sentence incorrectly, the user only needs to input the speech fragment corresponding to the text with the incorrect identification again without inputting the speech data of the whole sentence again, so that the text with the part of the incorrect identification in the sentence can be modified, the operation is simple and convenient, and the user experience is improved.

It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments of the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.

Referring to fig. 3, a block diagram of an embodiment of an input device according to the present invention is shown, which may specifically include the following modules:

a first obtaining module 302, configured to obtain first voice data input by a user, recognize the first voice data as a first text, and display the first text;

a second obtaining module 304, configured to obtain second voice data input by the user, and identify the second voice data as a second text;

a determining module 306, configured to determine whether the user has an intention to modify the first text;

a modification module 308, configured to modify the first text according to the second text when it is determined that the user has an intention to modify the first text.

Referring to fig. 4, a block diagram of an alternative embodiment of an input device of the present invention is shown.

In an optional embodiment of the present invention, the determining module 306 includes:

the voice similarity operator module 3062 is used for calculating the voice similarity of the first voice data and the second voice data by adopting a voice similarity calculation method;

the first intention determining submodule 3064 is configured to determine whether the user has an intention to modify the first text according to the voice similarity.

the text similarity operator module 3066 is configured to calculate a text similarity between the first text and the second text by using a text similarity algorithm;

the second intention determining submodule 3068 is configured to determine whether the user has an intention to modify the first text according to the text similarity.

In an optional embodiment of the present invention, the voice similarity operator module 3062 is configured to divide the first voice data into a plurality of voice segments, and generate a plurality of voice segment sets by using the plurality of voice segments; wherein one of the voice segment sets comprises one voice segment or a plurality of continuous voice segments; respectively calculating the voice similarity of each voice fragment set and the second voice data; the first intention determining submodule 3064 is configured to determine whether the user has an intention to modify the first text according to the maximum voice similarity.

In an alternative embodiment of the present invention, the modifying module 308 includes:

and the first text modification submodule 3082 is configured to replace, with the second text, the text corresponding to the speech fragment set with the largest speech similarity.

In an optional embodiment of the present invention, the text similarity operator module 3066 is configured to divide the first text into a plurality of text segments, and generate a plurality of text segment sets by using the plurality of text segments; wherein one of the text segment sets comprises one text segment or a plurality of continuous text segments; respectively calculating the text similarity of each text fragment set and the second text; the second intention determining submodule 3068 is configured to determine whether the user has an intention to modify the first text according to the maximum text similarity.

and the second text modification submodule 3084 is configured to replace the text segment corresponding to the text segment set with the largest text similarity with the second text.

a third text modification submodule 3086, configured to replace the first text with the second text.

In an optional embodiment of the present invention, the apparatus further comprises:

a first presentation module 310, configured to present the second text in an edit box after the identification as the second text;

the modification module 308 includes: a fourth text modification submodule 3088 for deleting said first text.

a fifth text modification submodule 30810 for presenting the second text in a candidate bar; and receiving a screen-up instruction, and replacing the first text with the second text corresponding to the screen-up instruction.

a second presentation module 312 for presenting the second text upon determining that the user does not have an intent to modify the first text.

For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

FIG. 5 is a block diagram illustrating a structure of an electronic device 500 for input, according to an example embodiment. For example, the electronic device 500 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.

Referring to fig. 5, electronic device 500 may include one or more of the following components: a processing component 502, a memory 504, a power component 506, a multimedia component 508, an audio component 510, an input/output (I/O) interface 512, a sensor component 514, and a communication component 516.

The processing component 502 generally controls overall operation of the electronic device 500, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing elements 502 may include one or more processors 520 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 502 can include one or more modules that facilitate interaction between the processing component 502 and other components. For example, the processing component 502 can include a multimedia module to facilitate interaction between the multimedia component 508 and the processing component 502.

The memory 504 is configured to store various types of data to support operation at the device 500. Examples of such data include instructions for any application or method operating on the electronic device 500, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 504 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The power component 506 provides power to the various components of the electronic device 500. Power components 506 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for electronic device 500.

The multimedia component 508 includes a screen that provides an output interface between the electronic device 500 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 508 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the electronic device 500 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 510 is configured to output and/or input audio signals. For example, the audio component 510 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 500 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 504 or transmitted via the communication component 516. In some embodiments, audio component 510 further includes a speaker for outputting audio signals.

The I/O interface 512 provides an interface between the processing component 502 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 514 includes one or more sensors for providing various aspects of status assessment for the electronic device 500. For example, the sensor assembly 514 may detect an open/closed state of the device 500, the relative positioning of components, such as a display and keypad of the electronic device 500, the sensor assembly 514 may detect a change in the position of the electronic device 500 or a component of the electronic device 500, the presence or absence of user contact with the electronic device 500, orientation or acceleration/deceleration of the electronic device 500, and a change in the temperature of the electronic device 500. The sensor assembly 514 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 514 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 514 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 516 is configured to facilitate wired or wireless communication between the electronic device 500 and other devices. The electronic device 500 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication section 514 receives a broadcast signal or broadcast associated information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communications component 514 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the electronic device 500 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as the memory 504 comprising instructions, executable by the processor 520 of the electronic device 500 to perform the above-described method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

A non-transitory computer readable storage medium in which instructions, when executed by a processor of an electronic device, enable the electronic device to perform an input method, the method comprising: acquiring first voice data input by a user, identifying the first voice data as a first text and displaying the first text; acquiring second voice data input by a user, identifying the second voice data as a second text and judging whether the user has the intention of modifying the first text; and when the user is determined to have the intention of modifying the first text, modifying the first text according to the second text.

Fig. 6 is a schematic structural diagram of an electronic device 600 for input according to another exemplary embodiment of the present invention. The electronic device 600 may be a server, which may vary greatly due to different configurations or capabilities, and may include one or more Central Processing Units (CPUs) 622 (e.g., one or more processors) and memory 632, one or more storage media 630 (e.g., one or more mass storage devices) storing applications 642 or data 644. Memory 632 and storage medium 630 may be, among other things, transient or persistent storage. The program stored in the storage medium 630 may include one or more modules (not shown), each of which may include a series of instruction operations for the server. Still further, the central processor 622 may be configured to communicate with the storage medium 630 to execute a series of instruction operations in the storage medium 630 on the server.

The server may also include one or more power supplies 626, one or more wired or wireless network interfaces 650, one or more input-output interfaces 658, one or more keyboards 656, and/or one or more operating systems 641, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.

An electronic device comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by one or more processors the one or more programs including instructions for: acquiring first voice data input by a user, identifying the first voice data as a first text and displaying the first text; acquiring second voice data input by a user, identifying the second voice data as a second text and judging whether the user has the intention of modifying the first text; and when the user is determined to have the intention of modifying the first text, modifying the first text according to the second text.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.

The input method, the input device and the electronic device provided by the invention are described in detail, and the principle and the implementation mode of the invention are explained by applying specific examples, and the description of the embodiments is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. An input method, comprising:

acquiring first voice data input by a user, identifying the first voice data as a first text and displaying the first text;

acquiring second voice data input by a user, identifying the second voice data as a second text and judging whether the user has the intention of modifying the first text;

and when the user is determined to have the intention of modifying the first text, modifying the first text according to the second text.

2. The method of claim 1, wherein determining whether the user has an intent to modify the first text comprises:

calculating the voice similarity of the first voice data and the second voice data by adopting a voice similarity calculation method;

and judging whether the user has the intention of modifying the first text or not according to the voice similarity.

3. The method of claim 1, wherein determining whether the user has an intent to modify the first text comprises:

calculating the text similarity of the first text and the second text by adopting a text similarity algorithm;

and judging whether the user has the intention of modifying the first text or not according to the text similarity.

4. The method of claim 2, wherein calculating the voice similarity between the first voice data and the second voice data comprises:

dividing the first voice data into a plurality of voice segments, and generating a plurality of voice segment sets by adopting the plurality of voice segments; wherein one of the voice segment sets comprises one voice segment or a plurality of continuous voice segments;

respectively calculating the voice similarity of each voice fragment set and the second voice data;

the judging whether the user has the intention of modifying the first text or not according to the voice similarity comprises the following steps:

and judging whether the user has the intention of modifying the first text or not according to the maximum voice similarity.

5. The method of claim 4, wherein said modifying the first text based on the second text comprises:

and replacing the text corresponding to the voice fragment set with the maximum voice similarity by adopting the second text.

6. The method of claim 3, wherein the calculating the text similarity between the first text and the second text comprises:

dividing the first text into a plurality of text segments, and generating a plurality of text segment sets by adopting the plurality of text segments; wherein one of the text segment sets comprises one text segment or a plurality of continuous text segments;

respectively calculating the text similarity of each text fragment set and the second text;

the judging whether the user has the intention of modifying the first text or not according to the text similarity comprises the following steps:

and judging whether the user has the intention of modifying the first text or not according to the maximum text similarity.

7. The method of claim 6, wherein said modifying the first text based on the second text comprises:

and replacing the text segment corresponding to the text segment set with the maximum text similarity by using the second text.

8. An input device, comprising:

the first acquisition module is used for acquiring first voice data input by a user, identifying the first voice data as a first text and displaying the first text;

the second acquisition module is used for acquiring second voice data input by a user and identifying the second voice data as a second text;

a determining module for determining whether a user has an intent to modify the first text;

and the modification module is used for modifying the first text according to the second text when determining that the user has the intention of modifying the first text.

9. A readable storage medium, characterized in that instructions in the storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the input method according to any of method claims 1-7.

10. An electronic device comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by one or more processors the one or more programs including instructions for: