CN113589954A

CN113589954A - Data processing method and device and electronic equipment

Info

Publication number: CN113589954A
Application number: CN202010368472.8A
Authority: CN
Inventors: 姚波怀
Original assignee: Beijing Sogou Technology Development Co Ltd
Current assignee: Beijing Sogou Technology Development Co Ltd
Priority date: 2020-04-30
Filing date: 2020-04-30
Publication date: 2021-11-02

Abstract

The embodiment of the invention provides a data processing method, a data processing device and electronic equipment, wherein the method comprises the following steps: acquiring an input sequence and input associated information; adopting a statistical model to predict long sentences based on the input sequence and the input associated information to obtain sentence candidates output by the statistical model; and then, long sentence prediction is carried out by combining the input sequence and the input associated information through a statistical model, so that the accuracy of the long sentence prediction is improved, and the input efficiency of a user is improved.

Description

Data processing method and device and electronic equipment

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a data processing method and apparatus, and an electronic device.

Background

With the development of computer technology, electronic devices such as mobile phones and tablet computers are more and more popular, and great convenience is brought to life, study and work of people. These electronic devices are typically installed with an input method application (abbreviated as input method) so that a user can input information using the input method.

In the input process of the user, the input method can predict various types of candidates matched with the input sequence, such as sentence candidates, name candidates, associations and the like, so that the user can screen, and the input efficiency of the user is improved. However, in the prior art, the prediction of sentence candidates is not accurate, and the input requirement of the user cannot be well met, so that the input efficiency of the user cannot be well improved.

Disclosure of Invention

The embodiment of the invention provides a data processing method, which aims to improve the input efficiency of a user by improving the accuracy of long sentence prediction.

Correspondingly, the embodiment of the invention also provides a data processing device and electronic equipment, which are used for ensuring the realization and application of the method.

In order to solve the above problem, an embodiment of the present invention discloses a data processing method, which specifically includes: acquiring an input sequence and input associated information; and performing long sentence prediction by adopting a statistical model based on the input sequence and the input associated information to obtain sentence candidates output by the statistical model.

Optionally, the performing long sentence prediction by using a statistical model based on the input sequence and the input associated information to obtain a sentence candidate output by the statistical model includes: and inputting the input sequence and the input associated information into a statistical model to obtain sentence candidates output by the statistical model.

Optionally, the input sequence includes a pinyin sequence, and the long sentence prediction is performed by using a statistical model based on the input sequence and the input association information to obtain the sentence candidates output by the statistical model, including: analyzing the pinyin sequence to obtain a corresponding target syllable network; and inputting the target syllable network and the input associated information into a statistical model to obtain sentence candidates output by the statistical model.

Optionally, the analyzing the pinyin sequence to obtain a corresponding target syllable network includes: correcting errors of the pinyin sequence to obtain a corresponding error correction sequence; and analyzing the pinyin sequence and the error correction sequence to obtain a corresponding target syllable network.

Optionally, the analyzing the pinyin sequence to obtain a corresponding target syllable network includes: analyzing the pinyin sequence into multiple forms of pinyin; aiming at the pinyin in the target form, converting the pinyin in the target form into a pinyin identifier matched with the pinyin prefix in the target form to obtain a syllable path corresponding to the pinyin in the target form; and generating a target syllable network by adopting a plurality of syllable paths.

Optionally, the method further comprises the step of generating the statistical model: collecting a plurality of sets of training data, each set of training data comprising: historical input associated information, a historical input sequence and sentences input by a user under the conditions of the historical input associated information and the historical input sequence; counting the frequency of each sentence under the condition that the historical input associated information is the same and the historical input sequence is the same; and determining the conditional probability of each sentence according to the frequency of each sentence, and generating a statistical model based on the conditional probability of each sentence.

Optionally, the method further comprises the step of generating the statistical model: collecting a plurality of sets of training data, each set of training data comprising: historical input associated information, a syllable network corresponding to the historical pinyin sequence, and a sentence input by a user under the condition that the historical input associated information and the historical pinyin sequence correspond to the syllable network; counting the frequency of each sentence under the condition that historical input association information is the same and syllable networks are the same; and determining the conditional probability of each sentence according to the frequency of each sentence, and generating a statistical model based on the conditional probability of each sentence.

The embodiment of the invention also discloses a data processing device, which specifically comprises: the acquisition module is used for acquiring the input sequence and the input associated information; and the prediction module is used for adopting a statistical model to carry out long sentence prediction based on the input sequence and the input associated information to obtain sentence candidates output by the statistical model.

Optionally, the prediction module includes: and the first sentence candidate prediction submodule is used for inputting the input sequence and the input associated information into a statistical model to obtain the sentence candidates output by the statistical model.

Optionally, the input sequence includes a pinyin sequence, and the prediction module includes: the analysis submodule is used for analyzing the pinyin sequence to obtain a corresponding target syllable network; and the second sentence candidate prediction submodule is used for inputting the target syllable network and the input associated information into a statistical model to obtain the sentence candidates output by the statistical model.

Optionally, the parsing submodule includes: the error correction analysis unit is used for carrying out error correction on the pinyin sequence to obtain a corresponding error correction sequence; and analyzing the pinyin sequence and the error correction sequence to obtain a corresponding target syllable network.

Optionally, the parsing submodule includes: the syllable network conversion unit is used for analyzing the pinyin sequence into multiple forms of pinyin; aiming at the pinyin in the target form, converting the pinyin in the target form into a pinyin identifier matched with the pinyin prefix in the target form to obtain a syllable path corresponding to the pinyin in the target form; and generating a target syllable network by adopting a plurality of syllable paths.

Optionally, the apparatus further comprises: a first model generation module for collecting a plurality of sets of training data, each set of training data comprising: historical input associated information, a historical input sequence and sentences input by a user under the conditions of the historical input associated information and the historical input sequence; counting the frequency of each sentence under the condition that the historical input associated information is the same and the historical input sequence is the same; and determining the conditional probability of each sentence according to the frequency of each sentence, and generating a statistical model based on the conditional probability of each sentence.

Optionally, the apparatus further comprises: a second model generation module for collecting a plurality of sets of training data, each set of training data comprising: historical input associated information, a syllable network corresponding to the historical pinyin sequence, and a sentence input by a user under the condition that the historical input associated information and the historical pinyin sequence correspond to the syllable network; counting the frequency of each sentence under the condition that historical input association information is the same and syllable networks are the same; and determining the conditional probability of each sentence according to the frequency of each sentence, and generating a statistical model based on the conditional probability of each sentence.

The embodiment of the invention also discloses a readable storage medium, and when the instructions in the storage medium are executed by a processor of the electronic equipment, the electronic equipment can execute the data processing method according to any one of the embodiments of the invention.

An embodiment of the present invention also discloses an electronic device, including a memory, and one or more programs, where the one or more programs are stored in the memory, and configured to be executed by one or more processors, and the one or more programs include instructions for: acquiring an input sequence and input associated information; and performing long sentence prediction by adopting a statistical model based on the input sequence and the input associated information to obtain sentence candidates output by the statistical model.

Optionally, the method further comprises generating the statistical model by: collecting a plurality of sets of training data, each set of training data comprising: historical input associated information, a historical input sequence and sentences input by a user under the conditions of the historical input associated information and the historical input sequence; counting the frequency of each sentence under the condition that the historical input associated information is the same and the historical input sequence is the same; and determining the conditional probability of each sentence according to the frequency of each sentence, and generating a statistical model based on the conditional probability of each sentence.

Optionally, further comprising instructions for generating the statistical model by: collecting a plurality of sets of training data, each set of training data comprising: historical input associated information, a syllable network corresponding to the historical pinyin sequence, and a sentence input by a user under the condition that the historical input associated information and the historical pinyin sequence correspond to the syllable network; counting the frequency of each sentence under the condition that historical input association information is the same and syllable networks are the same; and determining the conditional probability of each sentence according to the frequency of each sentence, and generating a statistical model based on the conditional probability of each sentence.

The embodiment of the invention has the following advantages:

in the embodiment of the invention, an input sequence and input associated information can be acquired, and then a statistical model is adopted to predict long sentences based on the input sequence and the input associated information, so that sentence candidates output by the statistical model are obtained; and then, long sentence prediction is carried out by combining the input sequence and the input associated information through a statistical model, so that the accuracy of the long sentence prediction is improved, and the input efficiency of a user is improved.

Drawings

FIG. 1 is a flow chart of the steps of one data processing method embodiment of the present invention;

FIG. 2 is a flow chart of the steps of one embodiment of a model generation method of the present invention;

FIG. 3 is a flow chart of the steps of an alternative embodiment of a data processing method of the present invention;

FIG. 4 is a flow chart of steps of yet another model generation method embodiment of the present invention;

FIG. 5 is a flow chart of the steps of yet another alternative embodiment of a data processing method of the present invention;

FIG. 6 is a block diagram of an embodiment of a data processing apparatus according to the present invention;

FIG. 7 is a block diagram of an alternate embodiment of a data processing apparatus of the present invention;

FIG. 8 illustrates a block diagram of an electronic device for data processing in accordance with an exemplary embodiment;

fig. 9 is a schematic structural diagram of an electronic device for data processing according to another exemplary embodiment of the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

Referring to fig. 1, a flowchart illustrating steps of an embodiment of a data processing method according to the present invention is shown, which may specifically include the following steps:

and 102, acquiring an input sequence and input associated information.

In the embodiment of the invention, long sentence prediction can be carried out in the process of inputting the input sequence by the user, and corresponding sentence candidates are generated.

The embodiment of the invention can be applied to long sentence prediction in scenes with various input modes. The method can be applied to stroke input scenes for long sentence prediction; for example, the method is applied to long sentence prediction in a pinyin input scene; the method is also applied to long sentence prediction in a voice input scene; etc., which are not limited in this respect by embodiments of the present invention.

In addition, the embodiment of the invention can also be applied to long sentence prediction in a plurality of language scenes. The method can be applied to long sentence prediction in a Chinese input scene; the method can also be applied to long sentence prediction in English input scenes for example; it can also be applied to long sentence prediction in Korean input scenes, for example; etc., which are not limited in this respect by embodiments of the present invention.

Correspondingly, the input sequence may include a stroke sequence, a pinyin sequence, a foreign language character string, and the like, which is not limited in this embodiment of the present invention.

The method comprises the steps that in the process of inputting by a user through an input method, an input sequence input by the user and input associated information can be obtained; then, based on the acquired input sequence and the input association information, corresponding sentence candidates are predicted.

The input association information may include information related to an input, such as the above information, input environment information, and the like, which is not limited in this embodiment of the present invention.

In an example of the present invention, a manner of performing long sentence prediction based on the obtained input sequence and input related information may refer to the following step 104:

and 104, adopting a statistical model to predict long sentences based on the input sequence and the input associated information to obtain sentence candidates output by the statistical model.

In the embodiment of the invention, statistics can be carried out in advance to generate a statistical model; and then long sentence prediction is carried out by adopting a pre-generated statistical model. The generation process of the statistical model is explained in the following embodiments. Wherein each sentence can be scored by a statistical model based on the input association information and the input sequence; sentence candidates are then output based on the scores of the sentences. The sentence candidates output by the statistical model may be sentences whose scores are greater than a first preset threshold, or may be the first X sentences whose scores are the largest. The first preset threshold may be set as required, and X is a positive integer and may also be set as required, which is not limited in the embodiment of the present invention.

In summary, in the embodiment of the present invention, an input sequence and input associated information may be obtained, and then a statistical model is used to perform long sentence prediction based on the input sequence and the input associated information, so as to obtain candidate sentences output by the statistical model; and then, long sentence prediction is carried out by combining the input sequence and the input associated information through a statistical model, so that the accuracy of the long sentence prediction is improved, and the input efficiency of a user is improved.

In the embodiment of the present invention, the manner of generating the statistical model may include multiple manners, and one of the manners may be as follows:

referring to FIG. 2, a flow chart of the steps of one embodiment of the model generation method of the present invention is shown.

Step 202, collecting a plurality of sets of training data, each set of training data comprising: history input association information, a history input sequence, and a sentence input by a user on the condition of the history input association information and the history input sequence.

In the embodiment of the invention, the input sequence input by the user history, the input associated information when the user inputs the input sequence, and the sentences input after the user inputs the input sequence under the condition of inputting the associated information can be collected.

For convenience of description, an input sequence of the user history input may be referred to as a history input sequence, and input related information when the user inputs the history input sequence may be referred to as history input related information. The training data may be a set of historical input sequence, historical input association information corresponding to the historical input sequence, and sentences input by a user under the conditions of the historical input association information and the historical input sequence.

And 204, counting the frequency of each sentence under the condition that the historical input association information is the same and the historical input sequence is the same.

Step 206, determining the conditional probability of each sentence according to the frequency of each sentence, and generating a statistical model based on the conditional probability of each sentence.

In the embodiment of the invention, multiple groups of training data with the same historical input association information, historical input sequence and sentences input by a user may exist in the collected multiple groups of training data; there may also be multiple sets of training data with the same historical input association information and the same historical input sequence, but with different sentences input by the user. Of course, there may be a plurality of sets of training data in which the historical input association information, the historical input sequence, and the sentence input by the user are different.

Since most of the input habits of the users are similar, in the long sentence prediction process, sentences the frequency of which is input by the users meets the preset condition under the condition that a certain historical input sequence and the corresponding historical input associated information are predicted to be sentences which the users may input under the condition that the historical input sequence and the corresponding historical input associated information are input. In the embodiment of the invention, the frequency of each sentence can be counted under the condition that the historical input association information is the same and the historical input sequence is the same. Then for each sentence, a conditional probability for the sentence may be determined based on the frequency of the sentence; and generating a statistical model based on the conditional probability of each sentence.

The conditional probability refers to the probability of occurrence of an event a when another event B has occurred. The conditional probability is expressed as: p (A | B). In the embodiment of the present invention, the conditional probability refers to the probability P of a sentence input by the user (sentence | history input associated information, history input sequence) under the condition of history input associated information and history input sequence.

In one example, sentences with conditional probabilities greater than a second preset threshold may be determined, and then sentences with conditional probabilities less than or equal to the second preset threshold may be filtered; and generating a statistical model by adopting sentences with the conditional probability larger than a threshold value. The sentences with the conditional probabilities greater than the second preset threshold, the conditional probabilities corresponding to the sentences with the conditional probabilities greater than the second preset threshold, and the conditions corresponding to the sentences with the conditional probabilities greater than the second preset threshold (i.e., the historical input associated information and the historical input sequence corresponding to the sentences with the conditional probabilities greater than the second preset threshold) may be stored in the statistical model. The second preset threshold may be set as required, which is not limited in this embodiment of the present invention.

In another example, the top N sentences with the highest conditional probability may be determined, and then the other sentences may be filtered; and generating a statistical model by adopting the first N sentences with the maximum conditional probability. The first N sentences with the largest conditional probability, the conditional probability corresponding to each sentence in the first N sentences with the largest conditional probability, and the condition corresponding to each sentence in the first N sentences with the largest conditional probability (i.e., the historical input related information and the historical input sequence corresponding to each sentence in the first N sentences with the largest conditional probability) may be stored in the statistical model. The N is a positive integer, which may be specifically set as required, and this is not limited in the embodiment of the present invention.

For example, under the conditions: historical input association information: "you are me", the historical input sequence is the pinyin sequence: in the case of "h", the three sentences input by the user are: "you are good at me", "you are important at me", "you are good at me"; the conditional probabilities for these three sentences are: 0.02,0.15,0.1. If the second preset threshold is 0.015, the three sentences, the corresponding conditions and the corresponding conditional probabilities may be stored in the statistical model.

The following description will take as an example the case where long sentence prediction is performed using the statistical model generated in steps 202 to 206 and a candidate sentence is output.

Referring to fig. 3, a flowchart illustrating steps of an alternative embodiment of the data processing method of the present invention is shown, which may specifically include the following steps:

step 302, acquiring an input sequence and input associated information.

The input sequence may include a single code or may include multiple codes, which is not limited in this embodiment of the present invention. The inputting of the associated information may include: the above information and the input environment information may also include other information, which is not limited in the embodiment of the present invention; the above information may include interaction information and/or content in an edit box.

In the embodiment of the present invention, a way of obtaining a sentence candidate output by a statistical model by using the statistical model to perform long sentence prediction based on the input sequence and the input association information may refer to step 304:

and step 304, inputting the input sequence and the input associated information into a statistical model to obtain sentence candidates output by the statistical model.

In the embodiment of the invention, the acquired input sequence and the input associated information can be directly input into the statistical model generated according to the steps 202 to 206; and calculating the conditional probability of each sentence input by the user under the condition of the input sequence and the input associated information by the statistical model. Then, the sentences with the conditional probability greater than the first preset threshold of the probability can be taken as sentence candidates to be output; the first X sentences with the largest conditional probability may also be output as sentence candidates, which is not limited in the embodiment of the present invention.

In summary, in the embodiment of the present invention, after an input sequence and input associated information are obtained, the input sequence and the input associated information may be directly input into a statistical model, so as to obtain a candidate sentence output by the statistical model; and then need not to handle the information of inputting into the statistical model, can obtain sentence candidate fast.

In an optional embodiment of the present invention, the error correction may be performed on the input sequence to obtain an error correction sequence; then inputting the input sequence, the error correction sequence and the input associated information into a statistical model to obtain sentence candidates output by the statistical model; and further, under the condition that the user inputs by mistake, sentence candidates which hit the requirements of the user can be given.

Another way of generating a statistical model according to an embodiment of the present invention is described below by taking an input sequence as a pinyin sequence.

Referring to FIG. 4, a flowchart illustrating the steps of yet another model generation method embodiment of the present invention is shown.

Step 402, collecting a plurality of sets of training data, each set of training data comprising: historical input associated information, syllable networks corresponding to the historical pinyin sequences, and sentences input by a user under the condition of the historical input associated information and the syllable networks corresponding to the historical pinyin sequences.

In the embodiment of the invention, the pinyin sequence historically input by the user, the input associated information when the user inputs the pinyin sequence, and the sentence input after the user inputs the pinyin sequence under the condition of inputting the associated information can be collected.

For convenience of subsequent description, a pinyin sequence input by a user in history may be referred to as a history pinyin sequence, and input association information when the user inputs the history pinyin sequence is referred to as history input association information. The historical pinyin sequence can be converted into a corresponding syllable network, and then the syllable network corresponding to the historical pinyin sequence, the historical input associated information corresponding to the historical pinyin sequence, and the sentence input by the user under the condition that the syllable network corresponding to the historical pinyin sequence and the historical input associated information are used as a group of training data.

In an example of the embodiment of the present invention, a way to convert the historical pinyin sequence to a corresponding syllable network may refer to sub-steps 22-26:

and a substep 22, analyzing the historical pinyin sequence into multiple forms of pinyin.

In the embodiment of the present invention, the same pinyin sequence may correspond to multiple forms of pinyin, for example, the pinyin sequence: "fangan", the form of the corresponding pinyin may include: "fang 'an", "fan' gan", "fa 'n' gan", and the like. Therefore, the historical pinyin sequence can be analyzed, and the historical pinyin sequence is analyzed into pinyins in various forms; wherein, each pinyin of the forms can comprise pinyin of at least M syllables, and M is a positive integer. For example: one form of pinyin is "fang' an", corresponding to a pinyin that includes two syllables: "fang" and "an"; one form of pinyin is "fan' gan" corresponding to a pinyin that includes two syllables: "fan" and "gan"; one form of pinyin is "fa 'n' gan" corresponding to a pinyin that includes two syllables: fa, n, and gan.

And a substep 24, aiming at the target form of pinyin, converting the target form of pinyin into a pinyin identifier matched with the target form of pinyin prefix, and obtaining a syllable path corresponding to the target form of pinyin.

In the embodiment of the invention, one form of pinyin can be selected from multiple forms of pinyin as the pinyin of a target form; the target form of pinyin is then converted to the corresponding syllable path. Wherein, the pinyin of one form can comprise pinyin of M syllables, and M is a positive integer.

As most users generally use to input only the first pinyin character or the first few pinyin characters of the target character when inputting pinyin sequences; therefore, the embodiment of the invention can provide sentence candidates related to the target input for the user when the user does not input the complete pinyin sequence, thereby improving the input efficiency of the user; when the target form pinyin is converted into the corresponding syllable path, the target form pinyin is converted into the pinyin identifier matched with the target form pinyin prefix, and the syllable path corresponding to the target form pinyin is obtained; to increase the comprehensiveness of the training data and thus improve the comprehensiveness of the subsequent predicted sentence candidates.

Wherein, one way of converting the target form pinyin into pinyin identifiers matched with the target form pinyin prefixes is as follows: and converting the pinyin of the Mth syllable in the pinyin of the target form into a pinyin identifier matched with the pinyin prefix of the Mth syllable. The prefix matching may refer to that the pinyin corresponding to the pinyin identifier includes the pinyin corresponding to the syllable in the pinyin in the target form. Determining syllables corresponding to the pinyin and containing initial consonants and final vowels and syllables only containing initial consonants in the first M-1 syllables in the pinyin sequence of the target form; and converting syllables of the first M-1 syllables in the target form pinyin sequence, wherein the corresponding pinyin comprises initial consonants and vowels, into pinyin identifiers which are completely matched with the corresponding pinyin. And converting syllables of the first M-1 syllables in the pinyin sequence of the target form, wherein the corresponding pinyin only contains the initial consonant, into pinyin identifiers matched with the corresponding initial consonant.

The target form of pinyin can be converted into a pinyin identifier matched with a target form of pinyin prefix by inquiring the mapping relationship between the pinyin and the pinyin identifier (such as pinyin ID). For example, the pinyin for the mth syllable in the pinyin of the target form is "h", and the pinyins matched with the prefix of "h" have "h", "hen", "he", "heng", and "ha", etc.; then, the pinyin identifier 99 corresponding to the "h", the pinyin identifier 120 corresponding to the "hen", the pinyin identifier 110 corresponding to the "he", the pinyin identifier 122 corresponding to the "heng", and the pinyin identifier 105 corresponding to the "ha" are determined as the pinyin identifiers matching the pinyin prefix corresponding to the mth syllable. For another example, if the pinyin for a syllable existing in the first M-1 syllables of the pinyin in the target form is "he", only the identifier 110 corresponding to "he" can be used as the pinyin identifier of the pinyin corresponding to the syllable.

Then, the pinyin identification corresponding to each syllable in the pinyin in the target form can be used as a syllable node; the pinyin identification corresponding to the Mth syllable in the pinyin of the target form can comprise Y, and Y is a positive integer. Then the first M-1 syllable nodes in the target form pinyin and Y syllable nodes corresponding to the Mth syllable respectively form a syllable path; and then Y syllable paths corresponding to the target form of pinyin can be obtained.

And a substep 26 of generating a target syllable network by using the plurality of syllable paths.

Then adopting syllable paths corresponding to the pinyins in various forms to generate a syllable network corresponding to the historical pinyin sequence; further, the syllable network of the historical pinyin sequence may include a plurality of syllable paths.

In another example of embodiment of the present invention, another way of converting the historical input sequence into a corresponding syllable network may refer to substeps 42-44 as follows:

and a substep 42, performing error correction on the historical pinyin sequence to obtain a corresponding historical error correction sequence.

And a substep 44, analyzing the historical pinyin sequence and the historical error correction sequence to obtain a corresponding syllable network.

The user may have a wrong input condition in the input process; therefore, after the historical pinyin sequence is obtained, the historical pinyin sequence can be corrected, and a corresponding historical correction sequence can be determined. Then, the historical pinyin sequence and the historical error correction sequence can be analyzed respectively to obtain a corresponding syllable network; and further, under the condition that the user inputs by mistake, the correct historical pinyin sequence can be adopted to be converted into the corresponding syllable network. The syllable network obtained by analyzing the historical error correction sequence and the syllable network obtained by analyzing the historical pinyin sequence are both called target syllable networks. The method for analyzing the historical error correction sequence to obtain the corresponding syllable network is similar to the method for analyzing the historical pinyin sequence to obtain the corresponding syllable network, and is not repeated herein.

And step 404, counting the frequency of each sentence under the condition that the historical input association information is the same and the syllable network is the same.

Step 406, determining the conditional probability of each sentence according to the frequency of each sentence, and generating a statistical model based on the conditional probability of each sentence.

Wherein, steps 404 to 406 are similar to steps 204 to 206, and are not described herein again.

The conditional probability in the embodiment of the present invention refers to the probability P of a sentence (sentence | historical input associated information, syllable network corresponding to historical pinyin sequence) under the condition that the syllable network corresponding to the historical pinyin sequence corresponds to the historical input associated information.

In one example, sentences with conditional probabilities greater than a second preset threshold, conditional probabilities corresponding to sentences with conditional probabilities greater than the second preset threshold, and conditions corresponding to sentences with conditional probabilities greater than the second preset threshold (i.e., historical input association information and a syllable network corresponding to sentences with conditional probabilities greater than the second preset threshold) may be stored in the statistical model.

In another example, the first N sentences with the highest conditional probability, the conditional probability corresponding to each sentence in the first N sentences with the highest conditional probability, and the condition corresponding to each sentence in the first N sentences with the highest conditional probability (i.e., the historical input association information and the syllable network corresponding to each sentence in the first N sentences with the highest conditional probability) may be stored in the statistical model.

Compared with the generation of the statistical model according to the steps 202 to 206, the statistical model generated according to the steps 402 to 406 has small storage space, and can also carry out error correction in the long sentence prediction process.

The following description will take as an example the case where long sentence prediction is performed using the statistical model generated in steps 402 to 406, and a sentence candidate is output.

Referring to fig. 5, a flowchart illustrating steps of another alternative embodiment of the data processing method of the present invention is shown, which may specifically include the following steps:

step 502, obtaining a pinyin sequence and inputting associated information.

Step 502 is similar to step 302 described above and will not be described herein again.

In the embodiment of the invention, a way of predicting long sentences based on the pinyin sequence and the input associated information by adopting a statistical model to obtain sentence candidates output by the statistical model can be as follows, step 504-step 506:

and step 504, analyzing the pinyin sequence to obtain a corresponding target syllable network.

In one example of embodiment of the present invention, the step 504 may include the following substeps 62-66:

and a substep 62 of analyzing the pinyin sequence into multiple forms of pinyin.

And a substep 64 of converting the target form pinyin into a pinyin identifier matched with the target form pinyin prefix to obtain a syllable path corresponding to the target form pinyin.

And a substep 66 of generating a target syllable network using the plurality of syllable paths.

Among them, substeps 62 to 66 are similar to substeps 22 to 26 described above and will not be described herein.

In one example of embodiment of the present invention, the step 504 may include the following substeps 82-84:

substep 82, correcting the pinyin sequence to obtain a corresponding error correction sequence;

and a substep 84 of analyzing the pinyin sequence and the error correction sequence to obtain a corresponding target syllable network.

Among them, substeps 82-84 are similar to substeps 42-44 described above and will not be described again.

Step 506, inputting the target syllable network and the input associated information into a statistical model to obtain sentence candidates output by the statistical model.

In the embodiment of the invention, the syllable network of the pinyin sequence and the input associated information are input into the statistical model generated according to the steps 402-406; and the statistical model is used for calculating the conditional probability of each sentence input by the user under the condition that the obtained pinyin sequence corresponds to the target network syllable and the input associated information. Then, the sentences with the conditional probability larger than a first preset threshold value can be used as sentence candidates to be output; the first X sentences with the largest conditional probability may also be output as sentence candidates, which is not limited in the embodiment of the present invention.

In summary, in the embodiments of the present invention, after obtaining a pinyin sequence and inputting associated information, the pinyin sequence may be analyzed to obtain a corresponding target syllable network; then inputting the target syllable network and the input associated information into a statistical model to obtain sentence candidates output by the statistical model; compared with the long sentence prediction based on the pinyin sequence and the input associated information, the long sentence prediction efficiency of the statistical model provided by the embodiment of the invention is higher.

Secondly, in the embodiment of the invention, in the process of analyzing the pinyin sequence to obtain the corresponding target syllable network, the pinyin sequence can be corrected to obtain a corresponding error correction sequence; then analyzing the pinyin sequence and the error correction sequence to obtain a corresponding target syllable network; the statistical model can be converted based on the accurate pinyin sequence to obtain accurate word candidates; therefore, under the condition that the user inputs the sentence candidates by mistake, accurate sentence candidates can be predicted for the user, and the accuracy of the sentence candidates is further improved.

Furthermore, in the embodiment of the invention, the pinyin sequence can be analyzed into multiple forms of pinyin; then converting the pinyin of the target form into a pinyin identifier matched with the pinyin prefix of the target form to obtain a syllable path corresponding to the pinyin of the target form; and then, a target syllable network is generated by adopting a plurality of syllable paths, so that a more comprehensive syllable network can be obtained, and the comprehensiveness and the accuracy of sentence candidates are improved.

In addition, the statistical model can output the conditional probability of each sentence candidate while outputting each sentence candidate; and ordering the sentence candidates according to the conditional probability of each sentence candidate, and displaying each sentence candidate according to the ordered result.

In the embodiment of the invention, when the input associated information is more, the input sequence and part of the input associated information can be input into the statistical model, or the syllable network of the input sequence and part of the input associated information are input into the statistical model, so that sentence candidates output by the statistical model can be obtained; to reduce the computational load of the statistical model. Therefore, after sentence candidates are obtained, in the process of sorting the sentence candidates, the sentence candidates can be sorted based on complete input information and probability conditions of the sentence candidates; and then the accuracy of the candidate ordering of each sentence can be improved. Of course, sentence candidates may also be ranked based on the complete input association information, which is not limited in this embodiment of the present invention.

It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments of the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.

Referring to fig. 6, a block diagram of a data processing apparatus according to an embodiment of the present invention is shown, which may specifically include the following modules:

an obtaining module 602, configured to obtain an input sequence and input association information;

and a prediction module 604, configured to perform long sentence prediction based on the input sequence and the input association information by using a statistical model, so as to obtain a sentence candidate output by the statistical model.

Referring to fig. 7, a block diagram of an alternative embodiment of a data processing apparatus of the present invention is shown.

In an alternative embodiment of the present invention, the prediction module 604 includes:

and a first sentence candidate prediction sub-module 6042, configured to input the input sequence and the input association information into a statistical model, so as to obtain a sentence candidate output by the statistical model.

In an alternative embodiment of the present invention, the input sequence includes a pinyin sequence, and the prediction module 604 includes:

an analysis submodule 6044, configured to analyze the pinyin sequence to obtain a corresponding target syllable network;

and a second sentence candidate prediction sub-module 6046, configured to input the target syllable network and the input association information into a statistical model, so as to obtain a sentence candidate output by the statistical model.

In an optional embodiment of the present invention, the parsing sub-module 6044 includes:

an error correction analysis unit 60442, configured to perform error correction on the pinyin sequence to obtain a corresponding error correction sequence; and analyzing the pinyin sequence and the error correction sequence to obtain a corresponding target syllable network.

a syllable network conversion unit 60444 for parsing the pinyin sequence into multiple forms of pinyin; aiming at the pinyin in the target form, converting the pinyin in the target form into a pinyin identifier matched with the pinyin prefix in the target form to obtain a syllable path corresponding to the pinyin in the target form; and generating a target syllable network by adopting a plurality of syllable paths.

In an optional embodiment of the present invention, the apparatus further comprises:

a first model generation module 606 configured to collect a plurality of sets of training data, each set of training data comprising: historical input associated information, a historical input sequence and sentences input by a user under the conditions of the historical input associated information and the historical input sequence; counting the frequency of each sentence under the condition that the historical input associated information is the same and the historical input sequence is the same; and determining the conditional probability of each sentence according to the frequency of each sentence, and generating a statistical model based on the conditional probability of each sentence.

a second model generation module 608, configured to collect a plurality of sets of training data, each set of training data including: historical input associated information, a syllable network corresponding to the historical pinyin sequence, and a sentence input by a user under the condition that the historical input associated information and the historical pinyin sequence correspond to the syllable network; counting the frequency of each sentence under the condition that historical input association information is the same and syllable networks are the same; and determining the conditional probability of each sentence according to the frequency of each sentence, and generating a statistical model based on the conditional probability of each sentence.

For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

Fig. 8 is a block diagram illustrating a structure of an electronic device 800 for data processing according to an example embodiment. For example, the electronic device 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.

Referring to fig. 8, electronic device 800 may include one or more of the following components: a processing component 802, a memory 804, a power component 806, a multimedia component 808, an audio component 810, an input/output (I/O) interface 812, a sensor component 814, and a communication component 816.

The processing component 802 generally controls overall operation of the electronic device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing elements 802 may include one or more processors 820 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.

The memory 804 is configured to store various types of data to support operation at the device 800. Examples of such data include instructions for any application or method operating on the electronic device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The power components 806 provide power to the various components of the electronic device 800. Power components 806 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for electronic device 800.

The multimedia component 808 includes a screen that provides an output interface between the electronic device 800 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the electronic device 800 is in an operation mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for the electronic device 800. For example, the sensor assembly 814 may detect an open/closed state of the device 800, the relative positioning of components, such as a display and keypad of the electronic device 800, the sensor assembly 814 may also detect a change in the position of the electronic device 800 or a component of the electronic device 800, the presence or absence of user contact with the electronic device 800, orientation or acceleration/deceleration of the electronic device 800, and a change in the temperature of the electronic device 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 816 is configured to facilitate wired or wireless communication between the electronic device 800 and other devices. The electronic device 800 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 814 receives a broadcast signal or broadcast associated information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communications component 814 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the electronic device 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as the memory 804 comprising instructions, executable by the processor 820 of the electronic device 800 to perform the above-described method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

A non-transitory computer readable storage medium in which instructions, when executed by a processor of an electronic device, enable the electronic device to perform a data processing method, the method comprising: acquiring an input sequence and input associated information; and performing long sentence prediction by adopting a statistical model based on the input sequence and the input associated information to obtain sentence candidates output by the statistical model.

Fig. 9 is a schematic structural diagram of an electronic device 900 for data processing according to another exemplary embodiment of the present invention. The electronic device 900 may be a server, which may vary widely depending on configuration or performance, and may include one or more Central Processing Units (CPUs) 922 (e.g., one or more processors) and memory 932, one or more storage media 930 (e.g., one or more mass storage devices) storing applications 942 or data 944. Memory 932 and storage media 930 can be, among other things, transient storage or persistent storage. The program stored on the storage medium 930 may include one or more modules (not shown), each of which may include a series of instruction operations for the server. Still further, the central processor 922 may be arranged to communicate with the storage medium 930 to execute a series of instruction operations in the storage medium 930 on the server.

The server may also include one or more power supplies 926, one or more wired or wireless network interfaces 950, one or more input-output interfaces 958, one or more keyboards 956, and/or one or more operating systems 941, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.

An electronic device comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by one or more processors the one or more programs including instructions for: acquiring an input sequence and input associated information; and performing long sentence prediction by adopting a statistical model based on the input sequence and the input associated information to obtain sentence candidates output by the statistical model.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.

The data processing method, the data processing apparatus and the electronic device provided by the present invention are described in detail above, and specific examples are applied herein to illustrate the principles and embodiments of the present invention, and the descriptions of the above embodiments are only used to help understand the method and the core ideas of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A data processing method, comprising:

acquiring an input sequence and input associated information;

and performing long sentence prediction by adopting a statistical model based on the input sequence and the input associated information to obtain sentence candidates output by the statistical model.

2. The method of claim 1, wherein the using a statistical model to perform long sentence prediction based on the input sequence and the input association information to obtain the sentence candidates output by the statistical model comprises:

and inputting the input sequence and the input associated information into a statistical model to obtain sentence candidates output by the statistical model.

3. The method of claim 1, wherein the input sequence comprises a pinyin sequence, and the long-sentence prediction based on the input sequence and input association information using a statistical model to obtain sentence candidates output by the statistical model comprises:

analyzing the pinyin sequence to obtain a corresponding target syllable network;

and inputting the target syllable network and the input associated information into a statistical model to obtain sentence candidates output by the statistical model.

4. The method of claim 3, wherein the parsing the pinyin sequence to obtain a corresponding target syllable network comprises:

correcting errors of the pinyin sequence to obtain a corresponding error correction sequence;

and analyzing the pinyin sequence and the error correction sequence to obtain a corresponding target syllable network.

5. The method of claim 3, wherein the parsing the pinyin sequence to obtain a corresponding target syllable network comprises:

analyzing the pinyin sequence into multiple forms of pinyin;

aiming at the pinyin in the target form, converting the pinyin in the target form into a pinyin identifier matched with the pinyin prefix in the target form to obtain a syllable path corresponding to the pinyin in the target form;

and generating a target syllable network by adopting a plurality of syllable paths.

6. The method of claim 2, further comprising the step of generating the statistical model by:

collecting a plurality of sets of training data, each set of training data comprising: historical input associated information, a historical input sequence and sentences input by a user under the conditions of the historical input associated information and the historical input sequence;

counting the frequency of each sentence under the condition that the historical input associated information is the same and the historical input sequence is the same;

and determining the conditional probability of each sentence according to the frequency of each sentence, and generating a statistical model based on the conditional probability of each sentence.

7. The method of claim 3, further comprising the step of generating the statistical model by:

collecting a plurality of sets of training data, each set of training data comprising: historical input associated information, a syllable network corresponding to the historical pinyin sequence, and a sentence input by a user under the condition that the historical input associated information and the historical pinyin sequence correspond to the syllable network;

counting the frequency of each sentence under the condition that historical input association information is the same and syllable networks are the same;

8. A data processing apparatus, comprising:

the acquisition module is used for acquiring the input sequence and the input associated information;

and the prediction module is used for adopting a statistical model to carry out long sentence prediction based on the input sequence and the input associated information to obtain sentence candidates output by the statistical model.

9. An electronic device comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by one or more processors the one or more programs including instructions for:

acquiring an input sequence and input associated information;

10. A readable storage medium, characterized in that instructions in the storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the data processing method according to any of method claims 1-7.