CN112926343A

CN112926343A - Data processing method and device and electronic equipment

Info

Publication number: CN112926343A
Application number: CN201911244080.4A
Authority: CN
Inventors: 许静芳; 翟飞飞; 戴磊; 杨正彪; 戴加明; 李质轩; 王坤; 武静; 王青龙
Original assignee: Beijing Sogou Technology Development Co Ltd; Sogou Hangzhou Intelligent Technology Co Ltd
Current assignee: Beijing Sogou Technology Development Co Ltd
Priority date: 2019-12-06
Filing date: 2019-12-06
Publication date: 2021-06-08

Abstract

The embodiment of the invention provides a data processing method, a data processing device and electronic equipment, wherein the method comprises the following steps: obtaining a source language text; coding each character in the source language text according to character element information to obtain coding information corresponding to the source language text; translating the source language text into a target language text by adopting a machine translation model according to the coding information corresponding to the source language text; compared with the prior art that the characters in the source language text are directly coded, the embodiment of the invention can code the source language text in a finer granularity, thereby improving the translation quality of a machine translation model.

Description

Data processing method and device and electronic equipment

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a data processing method and apparatus, and an electronic device.

Background

Artificial intelligence includes a very broad spectrum of science, consisting of different fields such as machine learning, computer vision, etc. In general, one of the main goals of artificial intelligence research is to enable machines to perform complex tasks that typically require human intelligence to complete; since the birth of artificial intelligence, theories and technologies are mature day by day, and application fields are expanded continuously. Such as in the field of machine translation, for example, translating chinese into english, translating english into chinese, and the like.

In the process of machine translation, characters in chinese are usually directly encoded, and then translated based on the encoded result. However, compared with european languages such as english, the number of words in chinese is more and the semantics are richer, so that there is semantic imbalance between chinese and other languages, and for some characters with sparse training, accurate translation cannot be performed.

Disclosure of Invention

The embodiment of the invention provides a data processing method for improving the quality of machine translation.

Correspondingly, the embodiment of the invention also provides a data processing device and electronic equipment, which are used for ensuring the realization and application of the method.

In order to solve the above problem, an embodiment of the present invention discloses a data processing method, which specifically includes: obtaining a source language text; coding each character in the source language text according to character element information to obtain coding information corresponding to the source language text; and translating the source language text into a target language text by adopting a machine translation model according to the coding information corresponding to the source language text.

Optionally, the encoding each character in the source language text according to the character element information to obtain the encoding information corresponding to the source language text includes: coding each character in the source language text according to the character pattern of the character to obtain coding information corresponding to each character; and splicing the coded information corresponding to each character to generate the coded information corresponding to the source language text.

Optionally, the encoding each character in the source language text according to the font of the character to obtain the encoding information corresponding to each character includes: performing the following for each word in the source language text: splitting the characters by taking the components as minimum units; coding each component contained in the characters respectively to obtain font coding information corresponding to each component; and generating the coded information of the characters according to the character pattern coded information of each component contained in the characters.

Optionally, the encoding each character in the source language text according to the font of the character to obtain the encoding information corresponding to each character, further includes: after the characters are split by taking the components as the minimum unit, determining the spatial information of each component contained in the characters according to the character shape structure of the characters; coding the spatial information of each component contained in the characters to obtain spatial coding information corresponding to each component; the generating the coding information of the characters according to the character coding information of each component contained in the characters comprises: and adopting the font coding information of each component contained in the characters and the corresponding space coding information to form the coding information of the characters.

Optionally, the forming the coded information of the text by using the font coded information of each component and the corresponding spatial coded information included in the text includes: adopting the font coding information and the corresponding space coding information of each component contained in the characters to form coding information corresponding to each component; and splicing the coding information of each component according to the sequence of each component in the characters to obtain the coding information of the characters.

Optionally, the method further comprises the step of training the machine translation model: obtaining a corpus, wherein the corpus comprises: a source language training text and a corresponding target language training text; coding each character in the source language training text according to the text element information to obtain coding information corresponding to the source language training text; and training the machine translation model according to the coding information corresponding to the source language training text and the target language training text.

The embodiment of the invention also discloses a data processing device, which specifically comprises: the acquisition module is used for acquiring a source language text; the encoding module is used for encoding each character in the source language text according to character element information to obtain encoding information corresponding to the source language text; and the translation module is used for translating the source language text into a target language text by adopting a machine translation model according to the coding information corresponding to the source language text.

Optionally, the encoding module includes: the character coding submodule is used for coding each character in the source language text according to the character pattern of the character to obtain coding information corresponding to each character; and the splicing submodule is used for splicing the coded information corresponding to each character to generate the coded information corresponding to the source language text.

Optionally, the text encoding sub-module includes: the splitting unit is used for splitting each character in the source language text by taking the components as the minimum unit; the character form coding unit is used for coding each component contained in the characters respectively to obtain character form coding information corresponding to each component; and the information generating unit is used for generating the coded information of the characters according to the character pattern coded information of each component contained in the characters.

Optionally, the text encoding sub-module further includes: a spatial information determining unit, configured to determine, after the character is split with the components as the minimum unit, spatial information of each component included in the character according to a font structure of the character; the space coding unit is used for coding the space information of each component contained in the characters to obtain the space coding information corresponding to each component; and the information generating unit is used for adopting the font coding information of each component contained in the characters and the corresponding space coding information to form the coding information of the characters.

Optionally, the information generating unit is configured to use font coding information of each component and corresponding spatial coding information included in the text to form coding information corresponding to each component; and splicing the coding information of each component according to the sequence of each component in the characters to obtain the coding information of the characters.

Optionally, the apparatus further comprises: the training module is used for obtaining training corpora, and the training corpora comprise: a source language training text and a corresponding target language training text; coding each character in the source language training text according to the text element information to obtain coding information corresponding to the source language training text; and training the machine translation model according to the coding information corresponding to the source language training text and the target language training text.

The embodiment of the invention also discloses a readable storage medium, and when the instructions in the storage medium are executed by a processor of the electronic equipment, the electronic equipment can execute the data processing method according to any one of the embodiments of the invention.

An embodiment of the present invention also discloses an electronic device, including a memory, and one or more programs, where the one or more programs are stored in the memory, and configured to be executed by one or more processors, and the one or more programs include instructions for: obtaining a source language text; coding each character in the source language text according to character element information to obtain coding information corresponding to the source language text; and translating the source language text into a target language text by adopting a machine translation model according to the coding information corresponding to the source language text.

Optionally, further comprising instructions for training the machine translation model to: obtaining a corpus, wherein the corpus comprises: a source language training text and a corresponding target language training text; coding each character in the source language training text according to the text element information to obtain coding information corresponding to the source language training text; and training the machine translation model according to the coding information corresponding to the source language training text and the target language training text.

The embodiment of the invention has the following advantages:

in the embodiment of the invention, a source language text can be obtained, then each character in the source language text is coded according to character element information to obtain coded information corresponding to the source language text, and then the source language text is translated into a target language text by adopting a machine translation model according to the coded information corresponding to the source language text; compared with the prior art that the characters in the source language text are directly coded, the embodiment of the invention can code the source language text in a finer granularity, thereby improving the translation quality of a machine translation model.

Drawings

FIG. 1 is a flow chart of the steps of one data processing method embodiment of the present invention;

FIG. 2 is a flow chart of the steps of an alternative embodiment of a data processing method of the present invention;

FIG. 3 is a flow chart of the steps of a method embodiment of the invention for training a machine translation model;

FIG. 4 is a block diagram of an embodiment of a data processing apparatus of the present invention;

FIG. 5 is a block diagram of an alternate embodiment of a data processing apparatus of the present invention;

FIG. 6 illustrates a block diagram of an electronic device for data processing in accordance with an exemplary embodiment;

fig. 7 is a schematic structural diagram of an electronic device for data processing according to another exemplary embodiment of the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

One of the core ideas of the embodiment of the invention is that characters are encoded according to character element information, and then the encoded result is input into a machine translation model for translation, so that the characters can be encoded in a finer granularity, and the quality of machine translation is further improved.

Referring to fig. 1, a flowchart illustrating steps of an embodiment of a data processing method according to the present invention is shown, which may specifically include the following steps:

and 102, acquiring a source language text.

In the embodiment of the invention, when a certain text needs to be translated, the text to be translated can be obtained. The language corresponding to the text to be translated may be referred to as a source language, and the text to be translated may be referred to as a source language text. Steps 104-106 may then be performed to translate the source language text.

And step 104, coding each character in the source language text according to the character element information to obtain coding information corresponding to the source language text.

And 106, translating the source language text into a target language text by adopting a machine translation model according to the coding information corresponding to the source language text.

In the embodiment of the invention, one method for translating the source language text can be to encode characters in the source language text; and then inputting the coded result into a machine translation model, and translating the source language text into a target language text by the machine translation model based on the coded result. Wherein the target language is a different language than the source language.

In order to improve the translation quality of a machine translation model, in the process of coding characters in a source language text, coding each character in the source language text according to character element information of each character in the source language text; and further obtaining the coding information corresponding to the source language text. Then, the coding information corresponding to the source language text may be input into a trained machine translation model (the training process is described later), and the trained machine translation model translates the coding information to output a target language text. In the embodiment of the present invention, the text element information may include multiple elements of a text, such as a font, a pronunciation, and the like, which is not limited in this embodiment of the present invention; therefore, compared with the prior art that the characters in the source language text are directly coded, the embodiment of the invention can code the source language text in a finer granularity, thereby improving the translation quality of a machine translation model.

In summary, in the embodiment of the present invention, a source language text may be obtained, then each character in the source language text is encoded according to character element information, so as to obtain encoded information corresponding to the source language text, and then a machine translation model is adopted to translate the source language text into a target language text according to the encoded information corresponding to the source language text; compared with the prior art that the characters in the source language text are directly coded, the embodiment of the invention can code the source language text in a finer granularity, thereby improving the translation quality of a machine translation model.

In an optional embodiment of the present invention, the text element information may include a font; each character in the source language text can be encoded according to the font, specifically as follows:

referring to fig. 2, a flowchart illustrating steps of an alternative embodiment of the data processing method of the present invention is shown, which may specifically include the following steps:

step 202, obtaining a source language text.

In the embodiment of the present invention, the source language may refer to a language in which corresponding characters are ideographic characters, such as chinese, japanese, and the like, which is not limited in this embodiment of the present invention; wherein, the ideograph can refer to a word system for recording information by using symbolic writing symbols, and the ideograph does not directly or only expresses voice. The source language text may be a word, or one or more sentences, which is not limited in this embodiment of the present invention.

And 204, coding each character in the source language text according to the character pattern of the character to obtain coding information corresponding to each character.

The following description will take the source language as chinese and the characters in the source language text as chinese characters as an example. Wherein step 204 may include the following steps (204-2) - (204-6):

wherein steps (204-2) - (204-6) are operations performed for each word in the source language text.

And step 204-2, splitting the characters by taking the components as minimum units.

And 204-4, coding each component contained in the characters respectively to obtain font coding information corresponding to each component.

And step 204-6, generating the coded information of the characters according to the character pattern coded information of each component contained in the characters.

In the embodiment of the invention, from the character form of the Chinese character, the Chinese character can be divided into 3 types of components, namely, a shape component (a rational component constructed according to meanings), a sound component (a rational component constructed according to sounds) and a matched component (an irrational component which is not constructed according to sounds); wherein, the words of only one component are called unibody words, and the words of more than one component are called multibody words. Therefore, the character can be split by taking the components as the minimum unit; the text is split into one or more components. Then, each component contained in the character can be coded to obtain font coding information corresponding to each component contained in the character. Optionally, five-stroke coding or etymon coding of a certificate code can be performed on each component to obtain font coding information of each component; when the five-stroke codes or the etymon codes of the certificate codes corresponding to different components are the same, different identifications can be added for different components, and the identification can be used for uniquely identifying one component so as to distinguish different components with the same character form coding information. Then generating the coded information of the characters according to the character pattern coded information of each component contained in the characters; one way may be to splice the font coding information of each component included in the text according to the order of each component in the text, so as to obtain the coding information corresponding to the text. The order of the components in the text may be the writing order of the components in the text when the text is written.

As an example of the present invention, for example, the character "she" is divided by taking the radical as the minimum unit, and the radicals "woman" and "also" can be obtained. Then, the female and the also are respectively coded to obtain font coding information gdi _0 of the female and font coding information rnb _0 of the also; then, the font coding information "gdi _ 0" of "woman" and the font coding information "rnb _ 0" of "also" are spliced to obtain the coding information corresponding to the Chinese character "her" as follows: { gdi _0, rnb _0 }. For another example, the Chinese character "he" is split by taking the radical as the minimum unit, and radicals "" and "" can be obtained. Then respectively coding "" and "" are "" to obtain "" coding information "" gdi _1 "" and "" are "" font coding information "" rnb _0 ""; then, the font coding information 'gdi _ 1' of 'alpha' and the font coding information 'rnb _ 0' of 'also' are spliced to obtain the coding information corresponding to the Chinese character 'other': { gdi _1, rnb _0 }.

In one embodiment of the present invention, the structure of Chinese characters can be divided into two categories: a structure corresponding to a single character (which may be referred to as a single structure) and a structure corresponding to a multiple character; the structure corresponding to the multi-character can include various structures, such as an upper structure, a lower structure, a left structure, a right structure, and the like. For the multi-character, the components are coded, and spatial information of the components in the multi-character with different structures cannot be embodied; thus, the components may also be spatially encoded. Correspondingly, the encoding of each character in the source language text according to the font of the character to obtain the encoded information corresponding to each character further includes: after the characters are split by taking the components as the minimum unit in the step 204-2, determining the spatial information of each component contained in the characters according to the font structure of the characters; coding the spatial information of each component contained in the characters to obtain spatial coding information corresponding to each component; the generating the coding information of the characters according to the character coding information of each component contained in the characters comprises: and adopting the font coding information of each component contained in the characters and the corresponding space coding information to form the coding information of the characters.

In the embodiment of the present invention, after the character is split with the components as the minimum units, the spatial information of each component included in the character can be determined according to the font structure of the character. Wherein the spatial information may include: empty, left, middle, right, up, down, delta-shaped up, delta-shaped left down, delta-shaped right down, full bag, half bag and inside. Wherein the space may be space information of a single font. The left, middle, right, up and down may refer to the spatial information of the components in the text corresponding to the up-down structure, the up-middle-down structure, the left-right structure, and the left-middle-right structure. The upper part of the triangle, the lower left part of the triangle and the lower right part of the triangle can refer to the space information of the components in the characters (such as the Pinyin and the Sensen) corresponding to the structure of the triangle. The full wrap and the inner portion may refer to spatial information of the components in the text (e.g., circle) corresponding to the full wrap structure. The half pack and the inner part can refer to the space information of the character (such as celebration) corresponding to the half surrounding structure.

Then, the space information of each component contained in the characters can be coded to obtain space coding information corresponding to each component; for example, the Chinese character "she" corresponds to the spatial coding information "kdi _ 1" and "also" spatial coding information "kdi _ 3" of the radical "woman". The spatial coding information corresponding to different spatial information is different. Of course, each piece of spatial information may be encoded in advance, and then a vocabulary of the spatial information and the corresponding spatial encoding information may be created. When the space information of each component contained in the characters is coded to obtain the space coding information corresponding to each component, a word list can be searched according to the space information of each component to determine the space coding information corresponding to each component; the embodiments of the present invention are not limited in this regard.

The character pattern coding information of each component contained in the characters and the corresponding space coding information are adopted to form the coding information of the characters; may comprise the following sub-steps:

and a substep 22, adopting the font coding information of each component contained in the characters and the corresponding space coding information to form the coding information corresponding to each component.

And a substep 24, splicing the coding information of each component according to the sequence of each component in the characters to obtain the coding information of the characters.

In the embodiment of the present invention, for each component included in the text, the font coding information and the corresponding spatial coding information corresponding to the component may be used to form the coding information corresponding to the component. The method for forming the code information corresponding to the component by using the font code information corresponding to the component and the corresponding spatial code information includes various methods, for example, splicing the spatial code information of the component behind the font code information of the component; for example, the spatial coding information of the component is used as a superscript or a subscript of the glyph coding information of the component, and the like, which is not limited in this embodiment of the present invention.

And then splicing the coding information of each component according to the sequence of each component in the characters to obtain the coding information of the characters. For example, the Chinese character "her", the corresponding radical includes "woman" and "also". Wherein, the font coding information corresponding to the female is gdi _0 and the spatial coding information is kdi _ 1; the encoded information corresponding to "woman" can be obtained as { gdi _0, kdi _1 }. "also" corresponds to glyph encoding information of "rnb _ 0" and spatial encoding information of "kdi _ 3"; the encoded information corresponding to "also" can be obtained as { rnb _0, kdi _3 }. Then, the coded information of "woman" and "also" are spliced to obtain the coded information of "her" as { gdi _0, kdi _1, rnb _0, kdi _3 }. Correspondingly, for the Chinese character "he", the corresponding encoding information may be { gdi _1, kdi _1, rnb _0, kdi _3 }.

In the prior art, characters are directly used for coding, for example, a Chinese character's' is directly coded to obtain coding information X, and a Chinese character's' is directly coded to obtain coding information Y. Whether the character coding information is generated by the character coding information of the components or the character coding information is generated by the character coding information of the components and the corresponding space coding information, the granularity of the coding information obtained by the embodiment of the invention is finer than that of the coding information obtained by the prior art. Further, the last three encodings "kdi _1, rnb _0, kdi _ 3" are the same between "her" encoding information { gdi _0, kdi _1, rnb _0, kdi _3}, and "he" corresponding encoding information { gdi _1, kdi _1, rnb _0, kdi _3 }; and for the sparsely trained characters, the machine translation model can search the relevance between the sparsely trained characters and the trained characters based on the coding information, so that the translation quality of the sparsely trained characters is improved.

Of course, the embodiment of the present invention may also obtain the source language corpus in advance, and then generate the coding information corresponding to each character in the source language corpus according to the method described in step 204 for each character in the source language corpus; and then establishing a word list of each character and corresponding coding information. Further, when step 204 is executed, the word list can be directly searched based on each character in the source language text, and the coding information corresponding to each character in the source language text is determined; the embodiments of the present invention are not limited in this regard.

And step 206, splicing the coded information corresponding to each character to generate coded information corresponding to the source language text.

One implementation manner of the embodiment of the present invention may be that the order of each character in the source language file is determined; and then splicing the coded information corresponding to each character according to the sequence of each character to generate the coded information corresponding to the source language text.

And 208, translating the source language text into a target language text by adopting a machine translation model according to the coding information corresponding to the source language text.

The coding information of the source language text can be input into a machine translation model, and the machine translation model translates the source language text into a target language text according to the coding information of the source language text.

In summary, in the embodiment of the present invention, after a source language text is obtained, each character in the source language text may be encoded according to a font of the character to obtain encoded information corresponding to each character, and then the encoded information corresponding to each character is spliced to generate encoded information corresponding to the source language text; the character can be split by taking the components as the minimum unit, each component contained in the character is coded respectively to obtain font coding information corresponding to each component, and the coding information of the character is generated according to the font coding information of each component contained in the character; because different characters may contain the same components, for the characters with sparse training, the machine translation model can search the relevance between the characters with sparse training and the trained characters based on the coding information, and further improve the translation quality of the characters with sparse training.

Secondly, in the embodiment of the present invention, after the characters are split with the components as the minimum unit, the spatial information of each component included in the characters can be determined according to the font structure of the characters, and the spatial information of each component included in the characters is encoded to obtain spatial encoding information corresponding to each component; then, the character pattern coding information of each component contained in the characters and the corresponding space coding information are adopted to form the coding information of the characters; and further, character codes are further refined, so that the translation quality of sparse training data is further improved.

In an optional embodiment of the present invention, the text element information may further include text-to-speech information, and correspondingly, another method for encoding each text in the source language text according to the text element information may be as follows: and coding each character in the source language text according to the character voice of the character to obtain the coding information corresponding to the source language text. The phonetic notation can be performed on each character in the source language text to obtain a phonetic notation sequence corresponding to each character; and then coding the phonetic notation sequence to obtain coding information corresponding to each character. And splicing the coded information corresponding to each character to generate the coded information corresponding to the source language text. Of course, the phonetic notation can also be performed on each character in the source language corpus in advance, and then the phonetic notation sequence of each character is encoded to obtain the corresponding encoding information; and then establishing a word list of each character and corresponding coding information. And when the characters in the source language text are coded according to the character voice of the characters to obtain the coded information corresponding to the source language text, searching a word list according to the characters in the source language text to determine the coded information corresponding to the characters. Compared with the word list generated by coding according to the character form of the character, the word list generated by coding according to the character voice of the character has less data quantity, and the efficiency of determining the coding information corresponding to the source language text is improved.

In another embodiment of the present invention, the machine translation model may be trained in advance by using a training corpus, so that the machine translation model learns how to accurately translate the encoded information encoded according to the text element information. The following describes the training process of the machine translation model.

Referring to fig. 3, a flowchart illustrating steps of an embodiment of a machine translation model training method according to the present invention is shown, which may specifically include the following steps:

step 302, obtaining a corpus, wherein the corpus includes: a source language training text and a corresponding target language training text;

and step 304, coding each character in the source language training text according to the text element information to obtain coding information corresponding to the source language training text.

And step 306, training the machine translation model according to the coding information corresponding to the source language training text and the target language training text.

In the embodiment of the invention, training corpora can be collected; the corpus comprises: the method comprises the following steps that a source language training text and a corresponding target language training text are obtained, wherein the source language training text can comprise a plurality of pieces, and the corresponding target training text can also comprise a plurality of pieces; the target language training text corresponding to the source language training text may be a target language text obtained by translating the source language training text. Then coding each character in the source language training text according to the text element information to obtain coding information corresponding to the source language training text; similar to steps 204-206, the details are not repeated herein.

Then, the machine translation model can be trained according to the coding information corresponding to the source language training text and the target language training text; therefore, the machine translation model can learn how to accurately translate the coded information obtained by coding according to the character element information.

The method comprises the steps of obtaining coding information corresponding to a source language training text and a target language training text, and describing a training process of the machine translation model by taking the coding information corresponding to the source language training text and the target language training text as examples. Forward training: the encoding information corresponding to the source language training text can be input into the machine translation model, and a plurality of target language texts and corresponding probabilities output by the machine translation model are obtained. Reverse training: and judging whether the target language text with the maximum probability output by the machine translation model is matched with the target language training text corresponding to the source language training text. And if not, adjusting the weight of the machine translation model until the target language text with the maximum probability output by the machine translation model is matched with the target language training text corresponding to the source language training text after the coding information corresponding to the source language training text is input.

It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments of the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.

Referring to fig. 4, a block diagram of a data processing apparatus according to an embodiment of the present invention is shown, which may specifically include the following modules:

an obtaining module 402, configured to obtain a source language text;

the encoding module 404 is configured to encode each character in the source language text according to character element information to obtain encoding information corresponding to the source language text;

and the translation module 406 is configured to translate the source language text into a target language text according to the coding information corresponding to the source language text by using a machine translation model.

Referring to fig. 5, a block diagram of an alternative embodiment of a data processing apparatus of the present invention is shown.

In an alternative embodiment of the present invention, the encoding module 404 includes:

the character coding submodule 4042 is used for coding each character in the source language text according to the character pattern of the character to obtain coding information corresponding to each character;

and the splicing submodule 4044 is configured to splice the coded information corresponding to each character to generate coded information corresponding to the source language text.

In an optional embodiment of the present invention, the text encoding sub-module 4042 includes:

a splitting unit 40422, configured to split, for each character in the source language text, the character with a component as a minimum unit;

a font coding unit 40424, configured to code each component included in the text, to obtain font coding information corresponding to each component;

the information generating unit 40426 is configured to generate the coded information of the text according to the font coded information of each component included in the text.

In an optional embodiment of the present invention, the text encoding sub-module 4042 further includes:

a spatial information determining unit 40428, configured to determine, after the character is split with a component as a minimum unit, spatial information of each component included in the character according to a font structure of the character;

a spatial encoding unit 404210, configured to encode spatial information of each component included in the text to obtain spatial encoding information corresponding to each component;

the information generating unit 40426 is configured to use the font coding information of each component included in the text and the corresponding spatial coding information to form the coding information of the text.

In an optional embodiment of the present invention, the information generating unit 40426 is configured to adopt font coding information of each component and corresponding spatial coding information included in the text to form coding information corresponding to each component; and splicing the coding information of each component according to the sequence of each component in the characters to obtain the coding information of the characters.

In an optional embodiment of the present invention, the apparatus further comprises:

the training module 408 is configured to obtain a corpus, where the corpus includes: a source language training text and a corresponding target language training text; coding each character in the source language training text according to the text element information to obtain coding information corresponding to the source language training text; and training the machine translation model according to the coding information corresponding to the source language training text and the target language training text.

For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

Fig. 6 is a block diagram illustrating a structure of an electronic device 600 for data processing according to an example embodiment. For example, the electronic device 600 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.

Referring to fig. 6, electronic device 600 may include one or more of the following components: a processing component 602, a memory 604, a power component 606, a multimedia component 608, an audio component 610, an interface to input/output (I/O) 612, a sensor component 614, and a communication component 616.

The processing component 602 generally controls overall operation of the electronic device 600, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing elements 602 may include one or more processors 620 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 602 can include one or more modules that facilitate interaction between the processing component 602 and other components. For example, the processing component 602 can include a multimedia module to facilitate interaction between the multimedia component 608 and the processing component 602.

The memory 604 is configured to store various types of data to support operation at the device 600. Examples of such data include instructions for any application or method operating on the electronic device 600, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 604 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

Power component 606 provides power to the various components of electronic device 600. Power components 606 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for electronic device 600.

The multimedia component 608 includes a screen that provides an output interface between the electronic device 600 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 608 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the electronic device 600 is in an operation mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 610 is configured to output and/or input audio signals. For example, the audio component 610 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 600 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signal may further be stored in the memory 604 or transmitted via the communication component 616. In some embodiments, audio component 610 further includes a speaker for outputting audio signals.

The I/O interface 612 provides an interface between the processing component 602 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor component 614 includes one or more sensors for providing status assessment of various aspects of the electronic device 600. For example, the sensor component 614 may detect an open/closed state of the device 600, the relative positioning of components, such as a display and keypad of the electronic device 600, the sensor component 614 may also detect a change in the position of the electronic device 600 or a component of the electronic device 600, the presence or absence of user contact with the electronic device 600, orientation or acceleration/deceleration of the electronic device 600, and a change in the temperature of the electronic device 600. The sensor assembly 614 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 614 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 614 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 616 is configured to facilitate communications between the electronic device 600 and other devices in a wired or wireless manner. The electronic device 600 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 614 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communications component 614 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the electronic device 600 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer readable storage medium comprising instructions, such as the memory 604 comprising instructions, executable by the processor 620 of the electronic device 600 to perform the above-described method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

A non-transitory computer readable storage medium in which instructions, when executed by a processor of an electronic device, enable the electronic device to perform a data processing method, the method comprising: obtaining a source language text; coding each character in the source language text according to character element information to obtain coding information corresponding to the source language text; and translating the source language text into a target language text by adopting a machine translation model according to the coding information corresponding to the source language text.

Fig. 7 is a schematic structural diagram of an electronic device 700 for data processing according to another exemplary embodiment of the present invention. The electronic device 700 may be a server, which may vary significantly depending on configuration or performance, and may include one or more Central Processing Units (CPUs) 722 (e.g., one or more processors) and memory 732, one or more storage media 730 (e.g., one or more mass storage devices) storing applications 742 or data 744. Memory 732 and storage medium 730 may be, among other things, transient storage or persistent storage. The program stored in the storage medium 730 may include one or more modules (not shown), each of which may include a series of instruction operations for the server. Further, the central processor 722 may be configured to communicate with the storage medium 730, and execute a series of instruction operations in the storage medium 730 on the server.

The server may also include one or more power supplies 726, one or more wired or wireless network interfaces 750, one or more input-output interfaces 758, one or more keyboards 756, and/or one or more operating systems 741, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.

An electronic device comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by one or more processors the one or more programs including instructions for: obtaining a source language text; coding each character in the source language text according to character element information to obtain coding information corresponding to the source language text; and translating the source language text into a target language text by adopting a machine translation model according to the coding information corresponding to the source language text.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.

The data processing method, the data processing apparatus and the electronic device provided by the present invention are described in detail above, and specific examples are applied herein to illustrate the principles and embodiments of the present invention, and the descriptions of the above embodiments are only used to help understand the method and the core ideas of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A data processing method, comprising:

obtaining a source language text;

coding each character in the source language text according to character element information to obtain coding information corresponding to the source language text;

and translating the source language text into a target language text by adopting a machine translation model according to the coding information corresponding to the source language text.

2. The method of claim 1, wherein the encoding each character in the source language text according to character element information to obtain encoded information corresponding to the source language text comprises:

coding each character in the source language text according to the character pattern of the character to obtain coding information corresponding to each character;

and splicing the coded information corresponding to each character to generate the coded information corresponding to the source language text.

3. The method of claim 2, wherein said encoding each word in the source language text according to the font of the word to obtain the encoded information corresponding to each word comprises:

performing the following for each word in the source language text:

splitting the characters by taking the components as minimum units;

coding each component contained in the characters respectively to obtain font coding information corresponding to each component;

and generating the coded information of the characters according to the character pattern coded information of each component contained in the characters.

4. The method of claim 3, wherein the encoding each character in the source language text according to the character pattern of the character to obtain the encoded information corresponding to each character, further comprises:

after the characters are split by taking the components as the minimum unit, determining the spatial information of each component contained in the characters according to the character shape structure of the characters;

coding the spatial information of each component contained in the characters to obtain spatial coding information corresponding to each component;

the generating the coding information of the characters according to the character coding information of each component contained in the characters comprises:

and adopting the font coding information of each component contained in the characters and the corresponding space coding information to form the coding information of the characters.

5. The method according to claim 4, wherein said using the glyph encoding information of each component and the corresponding spatial encoding information included in the text to compose the encoding information of the text comprises:

adopting the font coding information and the corresponding space coding information of each component contained in the characters to form coding information corresponding to each component;

and splicing the coding information of each component according to the sequence of each component in the characters to obtain the coding information of the characters.

6. The method of claim 1, further comprising the step of training the machine translation model by:

obtaining a corpus, wherein the corpus comprises: a source language training text and a corresponding target language training text;

coding each character in the source language training text according to the text element information to obtain coding information corresponding to the source language training text;

and training the machine translation model according to the coding information corresponding to the source language training text and the target language training text.

7. A data processing apparatus, comprising:

the acquisition module is used for acquiring a source language text;

the encoding module is used for encoding each character in the source language text according to character element information to obtain encoding information corresponding to the source language text;

and the translation module is used for translating the source language text into a target language text by adopting a machine translation model according to the coding information corresponding to the source language text.

8. The apparatus of claim 7, wherein the encoding module comprises:

the character coding submodule is used for coding each character in the source language text according to the character pattern of the character to obtain coding information corresponding to each character;

and the splicing submodule is used for splicing the coded information corresponding to each character to generate the coded information corresponding to the source language text.

9. A readable storage medium, characterized in that instructions in the storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the data processing method according to any of method claims 1-6.

10. An electronic device comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by one or more processors the one or more programs including instructions for:

obtaining a source language text;