CN113589948A

CN113589948A - Data processing method and device and electronic equipment

Info

Publication number: CN113589948A
Application number: CN202010366776.0A
Authority: CN
Inventors: 姚波怀
Original assignee: Beijing Sogou Technology Development Co Ltd
Current assignee: Beijing Sogou Technology Development Co Ltd
Priority date: 2020-04-30
Filing date: 2020-04-30
Publication date: 2021-11-02

Abstract

The embodiment of the invention provides a data processing method, a data processing device and electronic equipment, wherein the method comprises the following steps: acquiring an input sequence and the above information; splicing the word candidates of the input sequence with the above information to obtain corresponding splicing results; inputting the splicing result into a sentence prediction model to obtain a sentence candidate output by the sentence prediction model; and furthermore, long sentence prediction is carried out by combining the input sequence and the input associated information, so that the accuracy of the long sentence prediction is improved, and the input efficiency of a user is improved.

Description

Data processing method and device and electronic equipment

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a data processing method and apparatus, and an electronic device.

Background

With the development of computer technology, electronic devices such as mobile phones and tablet computers are more and more popular, and great convenience is brought to life, study and work of people. These electronic devices are typically installed with an input method application (abbreviated as input method) so that a user can input information using the input method.

In the input process of the user, the input method can predict various types of candidates matched with the input sequence, such as sentence candidates, name candidates, associations and the like, so that the user can screen, and the input efficiency of the user is improved. However, in the prior art, the prediction of sentence candidates is not accurate, and the input requirement of the user cannot be well met, so that the input efficiency of the user cannot be well improved.

Disclosure of Invention

The embodiment of the invention provides a data processing method, which aims to improve the input efficiency of a user by improving the accuracy of long sentence prediction.

Correspondingly, the embodiment of the invention also provides a data processing device and electronic equipment, which are used for ensuring the realization and application of the method.

In order to solve the above problem, an embodiment of the present invention discloses a data processing method, which specifically includes: acquiring an input sequence and the above information; splicing the word candidates of the input sequence with the above information to obtain corresponding splicing results; and inputting the splicing result into a sentence prediction model to obtain a sentence candidate output by the sentence prediction model.

Optionally, the concatenating the word candidates of the input sequence with the above information to obtain a corresponding concatenation result includes: converting the input sequence into corresponding word candidates; and splicing the word candidates after the above information to obtain corresponding splicing results.

Optionally, the converting the input sequence into a corresponding word candidate includes: analyzing the pinyin sequence into multiple forms of pinyin; aiming at the pinyin in the target form, converting the pinyin in the target form into a pinyin identifier matched with the pinyin prefix in the target form to obtain a syllable path corresponding to the pinyin in the target form; generating a target syllable network by adopting a plurality of syllable paths; and converting the pinyin sequence into corresponding word candidates based on the target syllable network.

Optionally, the converting the input sequence into a corresponding word candidate includes: correcting errors of the pinyin sequence to obtain a corresponding error correction sequence; analyzing the pinyin sequence and the error correction sequence to obtain a corresponding target syllable network; and converting the pinyin sequence into corresponding word candidates based on the target syllable network.

Optionally, after the converting the input sequence into the corresponding word candidate, the method further includes: acquiring input associated information; determining first score information of each word candidate based on the input association information and the above information, wherein the first score information is used for representing the reasonability of the word candidate; and selecting the first N character word candidates with the highest first score information, wherein N is a positive integer.

Optionally, the output of the sentence prediction model further includes second score information of a sentence candidate, and when the sentence candidate includes a plurality of sentences, the method further includes: and sequencing each sentence candidate based on the second score information of the sentence candidate and the first score information corresponding to the word candidate in the sentence candidate.

Optionally, when the sentence candidate includes a plurality of sentences, the method further includes: acquiring input associated information; and sequencing each sentence candidate based on the input association information.

Optionally, the method further comprises the step of training a sentence prediction model: collecting corpora; sentence granularity division is carried out on the corpus to obtain training data; and/or performing word granularity division on the corpus to obtain training data; and training the sentence model by adopting the training data.

The embodiment of the invention also discloses a data processing device, which specifically comprises: the acquisition module is used for acquiring the input sequence and the information; the splicing module is used for splicing the word candidates of the input sequence with the above information to obtain corresponding splicing results; and the prediction module is used for inputting the splicing result into a sentence prediction model to obtain a sentence candidate output by the sentence prediction model.

Optionally, the splicing module includes: a conversion sub-module for converting the input sequence into corresponding word candidates; and the information splicing submodule is used for splicing the word candidates after the above information to obtain corresponding splicing results.

Optionally, the input sequence includes a pinyin sequence, and the conversion sub-module is configured to parse the pinyin sequence into multiple forms of pinyin; aiming at the pinyin in the target form, converting the pinyin in the target form into a pinyin identifier matched with the pinyin prefix in the target form to obtain a syllable path corresponding to the pinyin in the target form; generating a target syllable network by adopting a plurality of syllable paths; and converting the pinyin sequence into corresponding word candidates based on the target syllable network.

Optionally, the input sequence includes a pinyin sequence, and the conversion sub-module is configured to perform error correction on the pinyin sequence to obtain a corresponding error correction sequence; analyzing the pinyin sequence and the error correction sequence to obtain a corresponding target syllable network; and converting the pinyin sequence into corresponding word candidates based on the target syllable network.

Optionally, the apparatus further comprises: a selection module, configured to obtain input association information after the input sequence is converted into a corresponding word candidate; determining first score information of each word candidate based on the input association information and the above information, wherein the first score information is used for representing the reasonability of the word candidate; and selecting the first N character word candidates with the highest first score information, wherein N is a positive integer.

Optionally, the output of the sentence prediction model further includes second score information of the sentence candidates, and the apparatus further includes: and the first sequencing module is used for sequencing each sentence candidate based on the second score information of the sentence candidate and the first score information corresponding to the word candidate in the sentence candidate when the sentence candidate comprises a plurality of sentences.

Optionally, the apparatus further comprises: a second sorting module, configured to, when the sentence candidates include a plurality of candidates, obtain input association information; and sequencing each sentence candidate based on the input association information.

Optionally, the apparatus further comprises: the training module is used for collecting the linguistic data; sentence granularity division is carried out on the corpus to obtain training data; and/or performing word granularity division on the corpus to obtain training data; and training the sentence model by adopting the training data.

The embodiment of the invention also discloses a readable storage medium, and when the instructions in the storage medium are executed by a processor of the electronic equipment, the electronic equipment can execute the data processing method according to any one of the embodiments of the invention.

An embodiment of the present invention also discloses an electronic device, including a memory, and one or more programs, where the one or more programs are stored in the memory, and configured to be executed by one or more processors, and the one or more programs include instructions for: acquiring an input sequence and the above information; splicing the word candidates of the input sequence with the above information to obtain corresponding splicing results; and inputting the splicing result into a sentence prediction model to obtain a sentence candidate output by the sentence prediction model.

Optionally, after the converting the input sequence into the corresponding word candidate, the electronic device further includes: acquiring input associated information; determining first score information of each word candidate based on the input association information and the above information, wherein the first score information is used for representing the reasonability of the word candidate; and selecting the first N character word candidates with the highest first score information, wherein N is a positive integer.

Optionally, the output of the sentence prediction model further includes second score information of a sentence candidate, and when the sentence candidate includes a plurality of sentences, further includes instructions for: and sequencing each sentence candidate based on the second score information of the sentence candidate and the first score information corresponding to the word candidate in the sentence candidate.

Optionally, further comprising instructions for training the sentence prediction model to: collecting corpora; sentence granularity division is carried out on the corpus to obtain training data; and/or performing word granularity division on the corpus to obtain training data; and training the sentence model by adopting the training data.

The embodiment of the invention has the following advantages:

in the embodiment of the invention, an input sequence and the above information can be obtained, and then word candidates of the input sequence are spliced with the above information to obtain a corresponding splicing result; and inputting the splicing result into a sentence prediction model to obtain a sentence candidate output by the sentence prediction model, and further performing long sentence prediction by combining an input sequence and input associated information, so that the accuracy of the long sentence prediction is improved, and the input efficiency of a user is improved.

Drawings

FIG. 1 is a flow chart of the steps of one data processing method embodiment of the present invention;

FIG. 2 is a flow chart of the steps of one embodiment of a model training method of the present invention;

FIG. 3 is a flow chart of the steps of an alternative embodiment of a data processing method of the present invention;

FIG. 4 is a block diagram of an embodiment of a data processing apparatus of the present invention;

FIG. 5 is a block diagram of an alternate embodiment of a data processing apparatus of the present invention;

FIG. 6 illustrates a block diagram of an electronic device for data processing in accordance with an exemplary embodiment;

fig. 7 is a schematic structural diagram of an electronic device for data processing according to another exemplary embodiment of the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

Referring to fig. 1, a flowchart illustrating steps of an embodiment of a data processing method according to the present invention is shown, which may specifically include the following steps:

step 102, obtaining the input sequence and the above information.

In the embodiment of the invention, long sentence prediction can be carried out in the process of inputting the input sequence by the user, and corresponding sentence candidates are generated.

The embodiment of the invention can be applied to long sentence prediction in scenes with various input modes. The method can be applied to stroke input scenes for long sentence prediction; for example, the method is applied to long sentence prediction in a pinyin input scene; the method is also applied to long sentence prediction in a voice input scene; etc., which are not limited in this respect by embodiments of the present invention.

In addition, the embodiment of the invention can also be applied to long sentence prediction in a plurality of language scenes. The method can be applied to long sentence prediction in a Chinese input scene; the method can also be applied to long sentence prediction in English input scenes for example; it can also be applied to long sentence prediction in Korean input scenes, for example; etc., which are not limited in this respect by embodiments of the present invention.

Correspondingly, the input sequence may include a stroke sequence, a pinyin sequence, a foreign language character string, and the like, which is not limited in this embodiment of the present invention.

The method comprises the steps that an input sequence and the above information input by a user can be acquired in the process of inputting by the user through an input method; then, based on the acquired input sequence and the above information, corresponding sentence candidates are predicted. Wherein the above information may include content and/or interaction information in an edit box.

In one example of the present invention, a way to perform long sentence prediction based on the obtained input sequence and the above information may refer to the following steps 104-106:

and step 104, splicing the word candidates of the input sequence with the above information to obtain corresponding splicing results.

And 106, inputting the splicing result into a sentence prediction model to obtain a sentence candidate output by the sentence prediction model.

In the embodiment of the invention, a sentence prediction model can be trained in advance; and then long sentence prediction is carried out by adopting the trained sentence prediction model. Here, the training process of the sentence prediction model is explained in the following embodiments.

The word candidates corresponding to the input sequence can be spliced with the above information to obtain corresponding splicing results; and then inputting the splicing result into a sentence prediction model, performing long sentence prediction by the sentence prediction model based on the splicing result, and outputting corresponding sentence candidates. The sentence candidates output by the sentence prediction model may be one or multiple, which is not limited in this embodiment of the present invention.

In summary, in the embodiment of the present invention, an input sequence and the above information may be obtained, and then word candidates of the input sequence are spliced with the above information to obtain a corresponding splicing result; and inputting the splicing result into a sentence prediction model to obtain a sentence candidate output by the sentence prediction model, and further performing long sentence prediction by combining an input sequence and input associated information, so that the accuracy of the long sentence prediction is improved, and the input efficiency of a user is improved.

The following describes a training process of the sentence prediction model.

Referring to fig. 2, a flowchart illustrating steps of an embodiment of a model training method according to the present invention is shown, which may specifically include the following steps:

step 202, collecting corpora.

In the embodiment of the invention, the corpus can be collected, and then the training data is generated according to the collected corpus, so that the sentence prediction model is trained by adopting the training data. The mode of collecting the corpus may include multiple modes, for example, sentences input by a user in an input method may be collected as the corpus; for example, text, abstract and the like in each webpage are collected as linguistic data; the embodiments of the present invention are not limited in this regard.

Step 204, performing sentence granularity division on the corpus to obtain training data; and/or performing word granularity division on the corpus to obtain training data.

In the embodiment of the invention, the corpus can be divided to generate training data. One way to divide the corpus and generate the training data may be: and carrying out sentence granularity division on the corpus to obtain training data. The corpus can be divided into a plurality of sentences by taking one sentence as a reference; then two sentences which are adjacent and semantically related are adopted as a group of training data; and then multiple sets of training data can be obtained. The sentences can include single sentences and compound sentences, the single sentences can refer to sentences composed of phrases or single words, and clauses cannot be separated out; the clauses are single sentences that are structurally similar without the syntactic units of a complete sentence tone. The compound sentence is composed of two or more than two clauses which are closely related in meaning and structurally not included mutually.

In addition, in order to provide more comprehensive sentence candidates for the user, better input experience is further brought to the user; and dividing each compound sentence into a plurality of clauses by taking punctuation marks as intervals for each compound sentence. Then, two adjacent clauses in each compound sentence can be used as a group of training data; to augment the training data generated above.

One way to divide the corpus and generate the training data may be: and performing word granularity division on the corpus to obtain training data. The word can be used as the granularity, and the corpus can be divided into a plurality of words; two words that are adjacent and semantically related are then employed as a set of training data.

In the embodiment of the invention, the granularity of the words can be determined based on natural language processing; and then dividing the corpus into a plurality of words by taking the words as granularity. Or determining the granularity of the words based on the screen-up operation of the user in the input process; dividing the corpus into a plurality of words by taking the words as granularity; the embodiments of the present invention are not limited in this regard.

And step 206, training the sentence prediction model by using the training data.

The following description will take an example of training the sentence prediction model using a set of training data.

In an example of the present invention, when the training data is obtained by performing sentence granularity division on the corpus, each set of training data may include two sentences, or two clauses. The following description takes as an example a set of training data comprising two sentences. In the embodiment of the invention, the previous sentence in the training data set can be input into the sentence prediction model, and the sentence prediction model is used for carrying out forward calculation to obtain the sentence candidate. Then, the sentence candidates are compared with the next sentence in the set of training data, and the weight of the sentence prediction model is adjusted. And then, training the sentence prediction model by adopting a plurality of groups of training data in the mode until the set ending condition is met.

In an embodiment of the present invention, when the training data is obtained by performing sentence granularity division on the corpus, each set of training data may include two sets of words. In the embodiment of the invention, the previous group of words in the training data can be input into the sentence prediction model, and the sentence prediction model is used for carrying out forward calculation to obtain word candidates. Then, the word candidates are compared with a later group of words in the training data set, and the weight of the sentence prediction model is adjusted. And then, training the sentence prediction model by adopting a plurality of groups of training data in the mode until the set ending condition is met.

Referring to fig. 3, a flowchart illustrating steps of an alternative embodiment of the data processing method of the present invention is shown, which may specifically include the following steps:

step 302, obtain the input sequence and the above information.

The input sequence may include a single code or may include multiple codes, which is not limited in this embodiment of the present invention.

In an embodiment of the present invention, one way of splicing the word candidates of the input sequence with the above information to obtain the corresponding splicing result may be: and converting the input sequence into corresponding word candidates, and splicing the word candidates and the above information to obtain corresponding splicing results.

Step 304, converting the input sequence into corresponding word candidates.

In an example of the present invention, when the input sequence is a pinyin sequence, one way to convert the input sequence into corresponding word candidates may be: analyzing the pinyin sequence to obtain a corresponding target syllable network (refer to substeps 22-26); the pinyin sequence is then converted to corresponding word candidates based on the target syllable network (see sub-step 28).

And a substep 22, analyzing the pinyin sequence into multiple forms of pinyin.

In the embodiment of the present invention, the same pinyin sequence may correspond to multiple forms of pinyin, for example, the pinyin sequence: "fangan", the form of the corresponding pinyin may include: "fang 'an", "fan' gan", "fa 'n' gan", and the like. Therefore, the pinyin sequence can be analyzed, and the pinyin sequence is analyzed into multiple forms of pinyin; wherein, each pinyin of the forms can comprise M pinyin of syllables, and M is a positive integer. For example: one form of pinyin is "fang' an", corresponding to a pinyin that includes two syllables: "fang" and "an"; one form of pinyin is "fan' gan" corresponding to a pinyin that includes two syllables: "fan" and "gan"; one form of pinyin is "fa 'n' gan" corresponding to a pinyin that includes two syllables: fa, n, and gan.

And a substep 24, aiming at the pinyin of the target form, converting the pinyin of the target form into a pinyin identifier matched with the pinyin prefix of the target form to obtain a syllable path corresponding to the pinyin of the target form.

In the embodiment of the invention, one form of pinyin can be selected from multiple forms of pinyin as the pinyin of a target form; the target form of pinyin is then converted to the corresponding syllable path. Wherein, the pinyin of one form can comprise pinyin of M syllables, and M is a positive integer.

As most users generally use to input only the first pinyin character or the first few pinyin characters of the target character when inputting pinyin sequences; therefore, the embodiment of the invention can provide sentence candidates related to the target input for the user when the user does not input the complete pinyin sequence, thereby improving the input efficiency of the user; when the target form pinyin is converted into the corresponding syllable path, the target form pinyin can be converted into the pinyin identifier matched with the target form pinyin prefix, and the syllable path corresponding to the target form pinyin is obtained; the comprehensiveness of the word candidates corresponding to the determined pinyin sequence is increased, so that the comprehensiveness of subsequent sentence candidates predicted based on the word candidates can be increased.

Wherein, one way of converting the target form pinyin into pinyin identifiers matched with the target form pinyin prefixes is as follows: and converting the pinyin of the Mth syllable in the pinyin of the target form into a pinyin identifier matched with the pinyin prefix of the Mth syllable. The prefix matching may refer to that the pinyin corresponding to the pinyin identifier includes the pinyin corresponding to the syllable in the pinyin in the target form. Determining syllables corresponding to the pinyin and containing initial consonants and final vowels and syllables only containing initial consonants in the first M-1 syllables in the pinyin sequence of the target form; and converting syllables of the first M-1 syllables in the target form pinyin sequence, wherein the corresponding pinyin comprises initial consonants and vowels, into pinyin identifiers which are completely matched with the corresponding pinyin. And converting syllables of the first M-1 syllables in the pinyin sequence of the target form, wherein the corresponding pinyin only contains the initial consonant, into pinyin identifiers matched with the corresponding initial consonant.

The target form of pinyin can be converted into a pinyin identifier matched with a target form of pinyin prefix by inquiring the mapping relationship between the pinyin and the pinyin identifier (such as pinyin ID). For example, the pinyin for the mth syllable in the pinyin of the target form is "h", and the pinyins matched with the prefix of "h" have "h", "hen", "he", "heng", and "ha", etc.; then, the pinyin identifier 99 corresponding to the "h", the pinyin identifier 120 corresponding to the "hen", the pinyin identifier 110 corresponding to the "he", the pinyin identifier 122 corresponding to the "heng", and the pinyin identifier 105 corresponding to the "ha" are determined as the pinyin identifiers matching the pinyin prefix corresponding to the mth syllable. For another example, if the pinyin for a syllable existing in the first M-1 syllables of the pinyin in the target form is "he", only the identifier 110 corresponding to "he" can be used as the pinyin identifier of the pinyin corresponding to the syllable.

Then, the pinyin identification corresponding to each syllable in the pinyin in the target form can be used as a syllable node; the pinyin identifier corresponding to the mth syllable in the pinyin of the target form can include X, and X is a positive integer. Then, the first M-1 syllable nodes in the target pinyin form and X syllable nodes corresponding to the Mth syllable respectively form a syllable path; further, X syllable paths corresponding to the target form of pinyin can be obtained.

And a substep 26 of generating a target syllable network by using the plurality of syllable paths.

Then, generating a target syllable network corresponding to the pinyin sequence by adopting syllable paths corresponding to the pinyins in various forms; the target syllable network may then include a plurality of syllable paths.

And a substep 28 of converting the pinyin sequence into corresponding word candidates based on the target syllable network.

When the target syllable network comprises a plurality of syllable paths, each syllable path can be converted into a corresponding word candidate; wherein each syllable path may be converted into at least one word candidate.

In an example of the present invention, taking the input sequence as a pinyin sequence as an example, to describe the manner in which the input sequence is converted into corresponding word candidates, reference may be made to sub-steps 42-46:

and a substep 42 of correcting errors of the pinyin sequence to obtain a corresponding error correction sequence.

And a substep 44 of analyzing the pinyin sequence and the error correction sequence to obtain a corresponding target syllable network.

Substep 46, converting the pinyin sequence to corresponding word candidates based on the target syllable network.

The user may have a wrong input condition in the input process; therefore, the error correction can be carried out on the pinyin sequence firstly; so that the correct pinyin sequence can be adopted to be converted into corresponding word candidates; and accurate sentence candidates can be predicted subsequently based on word candidates, so that the input efficiency of the user is further improved, and the user experience can be improved.

Of course, when the input sequence is other sequences or a foreign language character string, the error correction may be performed on the input sequence to obtain an error correction sequence.

The syllable network obtained by analyzing the error correction sequence and the pinyin sequence can be called a target syllable network. The method for analyzing the error correction sequence to obtain the corresponding target syllable network is similar to the method for analyzing the pinyin sequence to obtain the corresponding target syllable network (refer to substeps 22-26), and is not described herein again.

Among the word candidates, some word candidates may be unreasonable, that is, not satisfying the natural language rule. Therefore, after the word candidates of the input sequence are obtained, reasonable word candidates can be screened out from the corresponding word candidates; so as to improve the accuracy of the prediction of the subsequent long sentence. Reference may be made to steps 306-310 as follows:

step 306, obtaining input associated information.

And 308, determining first score information of each word candidate based on the input association information and the information, wherein the first score information is used for representing the reasonability of the word candidate.

And 310, selecting the first N character word candidates with the highest first score information.

In the embodiment of the present invention, input associated information may be obtained, where the input associated information may include information related to input, such as input environment information, user personalized information, and the like; the embodiments of the present invention are not limited in this regard. Then, based on the input association information and the above information, determining first score information of each word candidate; the first score information is used for representing the rationality of the word candidates. For example, a language model may be used to score each word candidate based on the input association information and the above information, such as determining a conditional probability of each word candidate under the condition of inputting the association information and the above information. The first N word candidates with the highest first score information may then be selected. Wherein N is a positive integer.

And step 312, splicing the word candidates to the above information to obtain corresponding splicing results.

Then, for each selected word candidate, the word candidate may be spliced to the above information to obtain a corresponding splicing result. When the extracted word candidates include a plurality of words, the corresponding concatenation result may also include a plurality of words.

And step 314, inputting the splicing result into a sentence prediction model to obtain a sentence candidate output by the sentence prediction model.

When the splicing result comprises a plurality of splicing results, inputting one splicing result into the sentence prediction model each time to obtain the sentence candidates output by the sentence prediction model; until all the splicing results are input into the sentence prediction model to obtain corresponding sentence candidates.

In summary, in the embodiments of the present invention, an input sequence and the above information may be obtained, then the input sequence is converted into corresponding word candidates, and the word candidates are spliced after the above information to obtain corresponding splicing results; and inputting the splicing result into a sentence prediction model to obtain sentence candidates output by the sentence prediction model, and further inputting word candidates of an input sequence into the sentence prediction model as a part of the above information for long-sentence prediction, so that the information content of the above information can be increased, and the prediction accuracy can be further improved.

In the embodiment of the invention, when the input sequence is a pinyin sequence, the pinyin sequence can be analyzed into pinyins in various forms; then converting the pinyin of the target form into a pinyin identifier matched with the pinyin prefix of the target form to obtain a syllable path corresponding to the pinyin of the target form; and then generating a target syllable network by adopting a plurality of syllable paths, and converting the pinyin sequence into corresponding word candidates based on the target syllable network, thereby increasing the comprehensiveness of the converted word candidates and further improving the accuracy and comprehensiveness of the sentence candidates.

Further, in the embodiment of the present invention, when the input sequence is a pinyin sequence, error correction may be performed on the pinyin sequence to obtain a corresponding error correction sequence; then analyzing the pinyin sequence and the error correction sequence to obtain a corresponding target syllable network, and converting the pinyin sequence into corresponding word candidates based on the target syllable network; further, conversion can be performed based on the accurate pinyin sequence to obtain accurate word candidates; therefore, under the condition that the user inputs the sentence candidates by mistake, accurate sentence candidates can be predicted for the user, and the accuracy of the sentence candidates is further improved.

Thirdly, in the embodiment of the invention, after the input sequence is converted into the corresponding word candidate, the input associated information can be obtained, then the first score information of each word candidate is determined based on the input associated information and the above information, and then the first N word candidates with the highest first score information are selected; the first score information is used for representing the rationality of the word candidates, and then the reasonable word candidates can be screened out, so that the accuracy of sentence prediction candidates can be improved on one hand, and the calculated amount of a sentence prediction model can be reduced on the other hand.

In one embodiment of the invention, the sentence prediction model can output the candidate sentences and also can output second score information corresponding to the candidate sentences; and then, the sentence candidates can be ranked and displayed according to the second score information. The sentence candidates may be presented according to the sorting result of the sentence candidates. Because the second score information is determined based on the input sequence and the above information, the factors for ordering and considering the sentence candidates based on the second score information are not comprehensive enough; therefore, more factors can be introduced to order the sentence candidates.

In one example of the present invention, one way to rank sentence candidates may be: and sequencing each sentence candidate based on the second score information of the sentence candidate and the first score information corresponding to the word candidate in the sentence candidate.

In another example of the present invention, one way to rank the sentence candidates may be: acquiring input associated information; sorting each sentence candidate based on the input association information; and further, the accuracy of ranking each sentence candidate can be improved.

In addition, when the information is more, the splicing result of the input sequence and part of the information can be input into a sentence prediction model, and a sentence candidate output by the sentence prediction model is obtained; to reduce the computational load of the sentence prediction model. Therefore, in the process of sorting the sentence candidates, the sentence candidates can be sorted based on the complete above information. In yet another example of the present invention, one way to rank the sentence candidates may be: acquiring input associated information; sorting each sentence candidate based on the input association information and/or the above information; and further, the accuracy of ranking each sentence candidate can be improved.

It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments of the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.

Referring to fig. 4, a block diagram of a data processing apparatus according to an embodiment of the present invention is shown, which may specifically include the following modules:

an obtaining module 402, configured to obtain an input sequence and the above information;

a splicing module 404, configured to splice the word candidates of the input sequence with the above information to obtain corresponding splicing results;

and the prediction module 406 is configured to input the splicing result into a sentence prediction model, so as to obtain a sentence candidate output by the sentence prediction model.

Referring to fig. 5, a block diagram of an alternative embodiment of a data processing apparatus of the present invention is shown.

In an optional embodiment of the present invention, the splicing module 404 includes:

a conversion sub-module 4042, configured to convert the input sequence into corresponding word candidates;

and the information splicing submodule 4044 is configured to splice the word candidates with the above information to obtain a corresponding splicing result.

In an optional embodiment of the present invention, the input sequence includes a pinyin sequence, and the conversion sub-module 4042 is configured to parse the pinyin sequence into multiple forms of pinyin; aiming at the pinyin in the target form, converting the pinyin in the target form into a pinyin identifier matched with the pinyin prefix in the target form to obtain a syllable path corresponding to the pinyin in the target form; generating a target syllable network by adopting a plurality of syllable paths; and converting the pinyin sequence into corresponding word candidates based on the target syllable network.

In an optional embodiment of the present invention, the input sequence includes a pinyin sequence, and the conversion sub-module 4042 is configured to perform error correction on the pinyin sequence to obtain a corresponding error correction sequence; analyzing the pinyin sequence and the error correction sequence to obtain a corresponding target syllable network; and converting the pinyin sequence into corresponding word candidates based on the target syllable network.

In an optional embodiment of the present invention, the apparatus further comprises:

a selecting module 408, configured to obtain input association information after the input sequence is converted into a corresponding word candidate; determining first score information of each word candidate based on the input association information and the above information, wherein the first score information is used for representing the reasonability of the word candidate; and selecting the first N character word candidates with the highest first score information, wherein N is a positive integer.

In an optional embodiment of the invention, the output of the sentence prediction model further comprises second score information of the sentence candidates, and the apparatus further comprises:

a first sorting module 410, configured to, when the sentence candidates include a plurality of sentences, sort the sentence candidates based on the second score information of the sentence candidates and the first score information corresponding to the word candidates in the sentence candidates.

a second sorting module 412, configured to, when the sentence candidates include a plurality of sentences, obtain input association information; and sequencing each sentence candidate based on the input association information.

a training module 414 for collecting corpora; sentence granularity division is carried out on the corpus to obtain training data; and/or performing word granularity division on the corpus to obtain training data; and training the sentence model by adopting the training data.

For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

Fig. 6 is a block diagram illustrating a structure of an electronic device 600 for data processing according to an example embodiment. For example, the electronic device 600 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.

Referring to fig. 6, electronic device 600 may include one or more of the following components: a processing component 602, a memory 604, a power component 606, a multimedia component 608, an audio component 610, an interface to input/output (I/O) 612, a sensor component 614, and a communication component 616.

The processing component 602 generally controls overall operation of the electronic device 600, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing elements 602 may include one or more processors 620 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 602 can include one or more modules that facilitate interaction between the processing component 602 and other components. For example, the processing component 602 can include a multimedia module to facilitate interaction between the multimedia component 608 and the processing component 602.

The memory 604 is configured to store various types of data to support operation at the device 600. Examples of such data include instructions for any application or method operating on the electronic device 600, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 604 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

Power component 606 provides power to the various components of electronic device 600. Power components 606 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for electronic device 600.

The multimedia component 608 includes a screen that provides an output interface between the electronic device 600 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 608 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the electronic device 600 is in an operation mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 610 is configured to output and/or input audio signals. For example, the audio component 610 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 600 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signal may further be stored in the memory 604 or transmitted via the communication component 616. In some embodiments, audio component 610 further includes a speaker for outputting audio signals.

The I/O interface 612 provides an interface between the processing component 602 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor component 614 includes one or more sensors for providing status assessment of various aspects of the electronic device 600. For example, the sensor component 614 may detect an open/closed state of the device 600, the relative positioning of components, such as a display and keypad of the electronic device 600, the sensor component 614 may also detect a change in the position of the electronic device 600 or a component of the electronic device 600, the presence or absence of user contact with the electronic device 600, orientation or acceleration/deceleration of the electronic device 600, and a change in the temperature of the electronic device 600. The sensor assembly 614 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 614 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 614 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 616 is configured to facilitate communications between the electronic device 600 and other devices in a wired or wireless manner. The electronic device 600 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 614 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communications component 614 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the electronic device 600 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer readable storage medium comprising instructions, such as the memory 604 comprising instructions, executable by the processor 620 of the electronic device 600 to perform the above-described method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

A non-transitory computer readable storage medium in which instructions, when executed by a processor of an electronic device, enable the electronic device to perform a data processing method, the method comprising: acquiring an input sequence and the above information; splicing the word candidates of the input sequence with the above information to obtain corresponding splicing results; and inputting the splicing result into a sentence prediction model to obtain a sentence candidate output by the sentence prediction model.

Fig. 7 is a schematic structural diagram of an electronic device 700 for data processing according to another exemplary embodiment of the present invention. The electronic device 700 may be a server, which may vary significantly depending on configuration or performance, and may include one or more Central Processing Units (CPUs) 722 (e.g., one or more processors) and memory 732, one or more storage media 730 (e.g., one or more mass storage devices) storing applications 742 or data 744. Memory 732 and storage medium 730 may be, among other things, transient storage or persistent storage. The program stored in the storage medium 730 may include one or more modules (not shown), each of which may include a series of instruction operations for the server. Further, the central processor 722 may be configured to communicate with the storage medium 730, and execute a series of instruction operations in the storage medium 730 on the server.

The server may also include one or more power supplies 726, one or more wired or wireless network interfaces 750, one or more input-output interfaces 758, one or more keyboards 756, and/or one or more operating systems 741, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.

An electronic device comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by one or more processors the one or more programs including instructions for: acquiring an input sequence and the above information; splicing the word candidates of the input sequence with the above information to obtain corresponding splicing results; and inputting the splicing result into a sentence prediction model to obtain a sentence candidate output by the sentence prediction model.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.

The data processing method, the data processing apparatus and the electronic device provided by the present invention are described in detail above, and specific examples are applied herein to illustrate the principles and embodiments of the present invention, and the descriptions of the above embodiments are only used to help understand the method and the core ideas of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A data processing method, comprising:

acquiring an input sequence and the above information;

splicing the word candidates of the input sequence with the above information to obtain corresponding splicing results;

and inputting the splicing result into a sentence prediction model to obtain a sentence candidate output by the sentence prediction model.

2. The method of claim 1, wherein the concatenating the word candidates of the input sequence with the above information to obtain corresponding concatenation results comprises:

converting the input sequence into corresponding word candidates;

and splicing the word candidates after the above information to obtain corresponding splicing results.

3. The method of claim 2, wherein the input sequence comprises a pinyin sequence, and wherein converting the input sequence into corresponding word candidates comprises:

analyzing the pinyin sequence into multiple forms of pinyin;

aiming at the pinyin in the target form, converting the pinyin in the target form into a pinyin identifier matched with the pinyin prefix in the target form to obtain a syllable path corresponding to the pinyin in the target form;

generating a target syllable network by adopting a plurality of syllable paths;

and converting the pinyin sequence into corresponding word candidates based on the target syllable network.

4. The method of claim 2, wherein the input sequence comprises a pinyin sequence, and wherein converting the input sequence into corresponding word candidates comprises:

correcting errors of the pinyin sequence to obtain a corresponding error correction sequence;

analyzing the pinyin sequence and the error correction sequence to obtain a corresponding target syllable network;

5. The method of claim 2, wherein after said converting the input sequence into corresponding word candidates, the method further comprises:

acquiring input associated information;

determining first score information of each word candidate based on the input association information and the above information, wherein the first score information is used for representing the reasonability of the word candidate;

and selecting the first N character word candidates with the highest first score information, wherein N is a positive integer.

6. The method according to claim 5, wherein the output of the sentence prediction model further includes second score information of sentence candidates, and when the sentence candidates include a plurality, the method further includes:

and sequencing each sentence candidate based on the second score information of the sentence candidate and the first score information corresponding to the word candidate in the sentence candidate.

7. The method according to claim 1, wherein when the sentence candidates include a plurality, the method further comprises:

acquiring input associated information;

and sequencing each sentence candidate based on the input association information.

8. A data processing apparatus, comprising:

the acquisition module is used for acquiring the input sequence and the information;

the splicing module is used for splicing the word candidates of the input sequence with the above information to obtain corresponding splicing results;

and the prediction module is used for inputting the splicing result into a sentence prediction model to obtain a sentence candidate output by the sentence prediction model.

9. An electronic device comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by one or more processors the one or more programs including instructions for:

acquiring an input sequence and the above information;

10. A readable storage medium, characterized in that instructions in the storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the data processing method according to any of method claims 1-7.