CN110414013B - Data processing method and device and electronic equipment - Google Patents

Data processing method and device and electronic equipment Download PDF

Info

Publication number
CN110414013B
CN110414013B CN201910702513.XA CN201910702513A CN110414013B CN 110414013 B CN110414013 B CN 110414013B CN 201910702513 A CN201910702513 A CN 201910702513A CN 110414013 B CN110414013 B CN 110414013B
Authority
CN
China
Prior art keywords
word
target
target word
processed
translation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910702513.XA
Other languages
Chinese (zh)
Other versions
CN110414013A (en
Inventor
王明三
张健昶
曾钦松
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201910702513.XA priority Critical patent/CN110414013B/en
Publication of CN110414013A publication Critical patent/CN110414013A/en
Application granted granted Critical
Publication of CN110414013B publication Critical patent/CN110414013B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Machine Translation (AREA)

Abstract

The embodiment of the invention discloses a data processing method, a device and a terminal, wherein the method can comprise the following steps: and acquiring a target word from the text content to be translated, preprocessing the target word to obtain a processed target word, wherein the processed target word meets the condition of segmentation into word pieces. And cutting the processed target word into at least one word piece, and translating the target word according to the at least one word piece to obtain a translation result. The embodiment of the invention can improve the accuracy of translation.

Description

Data processing method and device and electronic equipment
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a data processing method, a data processing apparatus, and an electronic device.
Background
With the wide application of internet technology, the economic globalization scope is continuously expanded, and communication and cooperation among a plurality of countries are promoted. Many professionals in the industry (e.g., foreign trade workers and technical developers) need to communicate with people who speak different languages, and need to read a large amount of document data written in languages that they are not familiar with, so that communication and communication between people are hindered. Based on this, it becomes particularly important to translate a language, and it is found in practice that the accuracy of the existing language translation method is relatively low, and it is difficult to achieve the expected effect of the user.
Disclosure of Invention
The technical problem to be solved by the embodiment of the invention is to provide a data processing method, a device, a storage medium and electronic equipment, which can improve the accuracy of translation.
In one aspect, an embodiment of the present invention provides a data processing method, including:
acquiring target words from text contents to be translated;
preprocessing the target word to obtain a processed target word, wherein the processed target word meets the condition of segmentation into word sheets;
Dividing the processed target word into at least one word piece;
And translating the target word according to the at least one word piece to obtain a translation result.
In one aspect, an embodiment of the present invention provides a data processing apparatus, including:
an acquisition unit for acquiring a target word from text content to be translated;
the processing unit is used for preprocessing the target word to obtain a processed target word, and the processed target word meets the condition of segmentation into word sheets;
the segmentation unit is used for segmenting the processed target word into at least one word piece;
and the translation unit is used for translating the target word according to the at least one word piece to obtain a translation result.
In yet another aspect, an embodiment of the present invention provides an electronic device, including an input device and an output device, further including:
A processor adapted to implement one or more instructions; and
A computer storage medium storing one or more instructions adapted to be loaded by the processor and to perform the steps of:
acquiring target words from text contents to be translated;
preprocessing the target word to obtain a processed target word, wherein the processed target word meets the condition of segmentation into word sheets;
Dividing the processed target word into at least one word piece;
And translating the target word according to the at least one word piece to obtain a translation result.
In yet another aspect, embodiments of the present invention provide a computer storage medium storing one or more instructions adapted to be loaded by a processor and to perform the steps of:
acquiring target words from text contents to be translated;
preprocessing the target word to obtain a processed target word, wherein the processed target word meets the condition of segmentation into word sheets;
Dividing the processed target word into at least one word piece;
And translating the target word according to the at least one word piece to obtain a translation result.
In the embodiment of the invention, the target word is preprocessed to obtain the processed target word, so that the processed target word meets the condition of segmentation into word pieces. The letters of the processed target word have the same format type, so that the problem that the target word is disordered in segmentation mode due to inconsistent format types of the same word can be avoided, and the problem that the translation accuracy of the target word is lower is further solved; in addition, the processed target word has no wrong letters, so that the word segmentation accuracy of the processed target word can be improved. In addition, the description capability of the word is improved by segmenting the processed target word into at least one word piece, namely, describing the processed target word by adopting the word piece with smaller granularity. Further, the translation accuracy of the target word can be improved by translating the target word by adopting at least one word piece to obtain a translation result.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic flow chart of a data processing method according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of an interface of a data processing process according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a data processing apparatus according to an embodiment of the present invention;
Fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Referring to fig. 1, fig. 1 is a schematic diagram of a data processing method according to an embodiment of the present invention. The data processing method may be performed by an electronic device including, but not limited to: smart phones, tablet computers, portable personal computers, smart watches, bracelets, smart televisions, and the like. Referring to fig. 1, the data processing method includes the following steps S101 to S104.
S101, acquiring target words from text contents to be translated.
The text content to be translated may refer to content to be translated, which may be obtained by performing text recognition on a text file, for example, the text file may refer to a professional technical document, a literature, and the like. Or the text content is identified by an audio file, such as the content that the presenter is speaking. Or the text content is obtained from a webpage, for example, the webpage can comprise a product introduction webpage and a social webpage. Or the text content may refer to content entered on a translation interface, which may refer to a web page translation interface or an interface of a translation application. The text content may include at least one word, the target word may refer to any word of the at least one word, and the language category of the target word may refer to a language such as english that needs to be formatted for distinction of size.
S102, preprocessing the target word to obtain a processed target word, wherein the processed target word meets the condition of segmentation into word sheets.
In order to improve accuracy of translation of the target word, the electronic device may perform preprocessing on the target word to obtain the processed target word. The pretreatment includes any of the following: format processing, correction processing, format, and correction processing. The format processing can be normalization processing of the format type of the letters in the target word, so that the problem that the segmentation mode of the target word is disordered due to inconsistent formats of the same word can be avoided, and the problem of low translation accuracy of the target word is further solved. The correction process may refer to correcting the incorrect letter in the target word, which may improve accuracy of the target word, and further, improve accuracy of translation of the target word. That is, the processed target word satisfies the condition of being segmented into word pieces may specifically include any one or more of the following: each letter in the processed target word has the same format type, and no wrong letter exists in the processed target word.
S103, segmenting the processed target word into at least one word piece.
In order to describe the target word more finely, the electronic device may segment the processed target word into at least one word segment according to a word segmentation dictionary or a frequency of use of the target word, and the word segment may be composed of at least one letter in the processed target word. Thus, word sheets with smaller granularity are adopted to describe the processed target words, and the description capability of the words can be improved. The word segmentation dictionary may refer to a dictionary for segmenting words into word segments.
S104, translating the target word according to the at least one word piece to obtain a translation result.
The electronic device can input the at least one word slice into a translation model for translation, and a translation result of the target word is obtained. The translation model can be obtained by optimizing and training word sheets of a large number of sample words, namely, the translation model is used for translating word sheets with small granularity, so that the accuracy of translation can be improved. Or the translation result of the target word can be searched from the translation dictionary according to the at least one word piece, more refined search can be realized through the word piece, and further, the accuracy of translation of the target word is improved, and the accuracy of translation can be improved. Optionally, translating the target word according to the at least one word piece to obtain at least one candidate translation, and if the number of the obtained candidate translations is 1, taking the candidate translations as a translation result of the target word; if the number of the obtained candidate translations is a plurality of, the translation result of the target word can be determined according to the translation results of the words adjacent to the target word in the text content, that is, the translation result of the target word is determined by combining the context and the candidate translations of the target word.
In the embodiment of the invention, the target word is preprocessed to obtain the processed target word, so that the processed target word meets the condition of segmentation into word pieces. The letters of the processed target word have the same format type, so that the problem that the target word is disordered in segmentation mode due to inconsistent format types of the same word can be avoided, and the problem that the translation accuracy of the target word is lower is further solved; in addition, the processed target word has no wrong letters, so that the word segmentation accuracy of the processed target word can be improved. In addition, the description capability of the word is improved by segmenting the processed target word into at least one word piece, namely, describing the processed target word by adopting the word piece with smaller granularity. Further, the translation accuracy of the target word can be improved by translating the target word by adopting at least one word piece to obtain a translation result.
In one embodiment, the preprocessing includes format processing, and step S102 includes steps S11 and S12 as follows.
S11, acquiring a word segmentation dictionary, and the format type of words in the word segmentation dictionary, wherein the word segmentation dictionary comprises a plurality of words.
And s12, preprocessing the target word according to the format type of the word in the word segmentation dictionary to obtain the processed target word.
In steps s11 and s12, a plurality of words and word pieces corresponding to each word are recorded in the word segmentation dictionary, that is, the word pieces corresponding to the words can be queried through the word segmentation dictionary. In the case where the size of the word segmentation dictionary is fixed, the format types of words in the word segmentation dictionary are the same in order that more words can be recorded. For example, for words PERSISTENTLY and PERSISTENTLY, the substance of the two words refers to the same word described by using different format types, and if the two words are directly recorded in the word segmentation dictionary, the memory space of the two words needs to be occupied; if the formats of the two words are normalized, the word PERSISTENTLY or PERSISTENTLY is obtained, and only PERSISTENTLY or PERSISTENTLY is required to be recorded in the word segmentation dictionary, that is, only the memory space of one word is required to be occupied. Thus, the formats of all words in the word-segmentation dictionary are normalized, i.e., the format types of all words in the word-segmentation dictionary are the same, which may refer to either uppercase or lowercase types. In order to obtain word fragments of the processed target word from the word segmentation dictionary, the electronic equipment can obtain the word segmentation dictionary and the format type of the word in the word segmentation dictionary, namely if the format type of the word in the word segmentation dictionary is a lower case letter type, preprocessing the target word according to the lower case letter type to obtain the processed target word; if the format type of the word in the word segmentation dictionary is the capital letter type, preprocessing the target word according to the capital letter type to obtain the processed target word.
In one embodiment, the target word includes at least one letter, and the format type of the word in the word segmentation dictionary is a lower case letter type; step s12 may include the following steps s21 to s22.
And s21, if the first letter in the target word is a capital letter, adding a first identification in the target word, and converting the capital letter in the target word into a lowercase letter to obtain the processed target word.
And s22, if the letters in the target word are all capital letters, adding a second mark in the target word, and converting the capital letters in the target word into the lowercase letters to obtain the processed target word.
In steps s21 to s22, if the first letter of the target word is a capital letter, it indicates that the format type of the target word is different from the format type of the word in the word segmentation dictionary, so that the word segment of the processed target word cannot be obtained from the word segmentation dictionary. Thus, a first identifier may be added to the target word and the uppercase letters in the target word converted to lowercase letters, resulting in the processed target word. If the letters in the target word are capital letters, the format type of the target word is different from the format type of the words in the word segmentation dictionary, so that the word segments of the processed target word cannot be obtained from the word segmentation dictionary. Thus, a second identifier may be added to the target word and the uppercase letters in the target word converted to lowercase letters, resulting in the processed target word. If the format type of each letter of the target word is a lower case letter type, no format processing is required for the target word. It should be noted that, for the same word, if the format type is different, the translation result may be different, for example, the translation result of the word China is China, and the translation result of the word China is porcelain. Thus, the purpose of adding the first token or the second token to the target word here is to: indicating the format type of the target word may improve the accuracy of the translation of the target word.
The first mark is used for indicating that the first letter in the target word is the capital letter, the first letter can be the first word of the target word from the left, and the second mark is used for indicating that the letters in the target word are all the capital letters. The first mark and the second mark can be formed by at least one of letters, numbers and symbols, and the first mark is different from the second mark. The first identifier and the second identifier may be the same or different in addition location in the target word.
In one embodiment, the at least one word slice includes a first word slice and a second word slice, and step S103 includes the following steps S31 and S32.
And s31, determining a target mark in the processed target word as the first word piece, wherein the target mark is the first mark or the second mark.
And S32, cutting letters except the target mark in the processed target word to obtain the second word slice.
In steps s31 and s32, in both cases where the first letter in the target word is a capital letter and the letters in the target word are capital letters, the corresponding processed target word differs only in that: the first identifier is different from the second identifier. Therefore, in order to describe the processed target word with fewer word pieces, the electronic device may determine the target identifier in the processed target word as the first word piece, where the target identifier is the first identifier or the second identifier, and may segment letters in the processed target word except for the target identifier, to obtain the second word piece. That is, in both cases where the first letter in the target word is a capital letter and where the letters in the target word are capital letters, the target word may share a second word segment, so that fewer word segments may be used to describe more words, reducing the complexity of segmenting words. And the problem of messy word segmentation modes of the same word caused by different format types can be avoided.
In one embodiment, the word segmentation dictionary further includes word segments corresponding to each word, and step s32 includes the following steps s41 and s42.
And s41, if the word matched with the processed target word exists in the word segmentation dictionary, determining the word segment corresponding to the matched word as the second word segment.
S42, if there is no word matching the processed target word in the word segmentation dictionary, determining the second word segment according to the frequency of the target word history being used.
In steps s41 and s4, if there is a word matching the processed target word in the word segmentation dictionary, determining the word segment corresponding to the matched word as the second word segment, where matching means that there is a word identical to a first sub-word in the word segmentation dictionary, and the first sub-word means a word consisting of letters except for the target identifier in the processed target word. If the word segment dictionary does not have a word matching the processed target word, indicating that the word segment dictionary does not have a word identical to the first sub word, the electronic device may determine the second word segment according to the frequency with which the target word history is used.
In one embodiment, step s42 includes the following steps s 51-s 53.
S51, obtaining the frequency with which the history of the target word is used.
And s52, if the frequency of the history of the target word is greater than the first preset frequency, determining the letters except the target mark in the processed target word as the second word piece.
S53, if the frequency of the history of the target word is less than or equal to the first preset frequency, acquiring the frequency of each letter except the target mark in the processed target word in the text content; and cutting letters except the target mark in the processed target word according to the frequency to obtain a plurality of second word sheets.
In steps s51 to s53, the electronic device may count the frequency of the use of the target word history (i.e. the frequency of occurrence of the target word in the plurality of text contents) from the plurality of text contents, and if the frequency of the use of the target word history is greater than the first preset frequency, it indicates that the target word is a common word, and the translation corresponding to the target word can be easily translated through the translation model. Thus, the letters of the processed target word other than the target mark may be determined as the second word piece, i.e., the first sub-word as the second word piece. If the frequency of the history of the target word is less than or equal to the first preset frequency, which indicates that the target word is an unusual word, it is difficult to translate the word into a translation corresponding to the target word through a translation model, so that finer segmentation needs to be performed on the processed target word, that is, the frequency of each letter except the target identifier in the processed target word in the text content is obtained; and cutting letters except the target mark in the processed target word according to the frequency to obtain a plurality of second word sheets.
In another embodiment, the preprocessing includes correction processing, and step S102 includes the following steps S61 to S63.
S61, obtaining the history of the used frequency of the target phrase corresponding to the target word, wherein the target phrase consists of the target word and the adjacent words of the target word in the text content.
And s62, if the frequency of the history of the target phrase is less than the second preset frequency, acquiring a word matched with the target word.
And s63, correcting the target word by adopting the matched word to obtain the processed target word.
In steps s61 to s63, the electronic apparatus may perform correction processing on the target word in order to prevent the target word from being erroneously operated. Specifically, the electronic device may count the frequency of the use of the target phrase history corresponding to the target word from the plurality of text contents, and if the frequency of the use of the target phrase history is less than the second preset frequency, it indicates that the probability of occurrence of the error of the target word is relatively high, so that a word matched with the target word may be obtained. Matching here may refer to the similarity between the target word and the matched word being greater than a preset threshold, or the distance being greater than a preset distance threshold. Further, the matching word may be used to correct the target word, i.e., the matching word may be used to replace the target word, resulting in a processed target word. Optionally, if the number of words matched with the target word is 1, directly replacing the target word with the matched word; if the number of words matching the target word is plural, the target word is replaced with the word having the greatest similarity to the target word. For example, the target word is bgi, and the target phrase corresponding to the target word is bgi orders. If the frequency of the history of the obtained target phrase is 0, the word matched with the target word is obtained, if the word matched with the target word is big, big is adopted to replace bgi.
In one embodiment, step S102 includes the following steps S71-S74.
And S71, encoding the at least one word slice to obtain an encoding value of the at least one word slice.
And S72, inputting the coding value of the at least one word slice into a translation model for translation to obtain at least one candidate translation.
And s73, acquiring a translation result of a word adjacent to the target word in the text content.
S74, determining the translation result of the target word according to the at least one candidate translation and the translation results of the adjacent words.
In steps s 71-s 74, the electronic device may translate the target word through a translation model, where the translation model may refer to a neural network machine translation model (Neural Machine Translation, NMT), which may be composed of at least one neural network model. For example, the neural network machine translation model may be composed of two neural network models, one neural network model is used for encoding word pieces of the processed target word, and the other neural network model is used for translating the encoded values of the word pieces to obtain the translation result of the target word. Specifically, the electronic device may encode the at least one word segment to obtain an encoded value of the at least one word segment, where the encoded value may include at least one of a number, a letter, a symbol, etc., for example, the encoded value may refer to an id corresponding to the at least one word segment, and the id may be a number. The coding value of the at least one word slice can be input into a translation model for translation to obtain at least one candidate translation, the translation result of the word adjacent to the target word in the text content can be obtained, and the translation result of the target word is determined according to the at least one candidate translation and the translation result of the adjacent word. Namely, according to the association degree between each translation in the at least one candidate translation and the translation result of the adjacent word, the candidate translation with the largest association degree is taken as the translation result of the target word. The method and the device can translate the target word according to the context of the text content, and improve the accuracy and fluency of translation.
The data processing method of the present embodiment will be explained below by taking a translation application as an example. A translation application is installed in the electronic device, and the translation application may be used to translate words, phrases, and sentences in any language, and the translation application will be described below by taking the translation of english words into chinese as an example. As shown in fig. 2, the data processing method includes the following steps 1-3.
1. The target word is acquired. When the user has a translation requirement, touch operation can be executed on the translation application program, the electronic equipment detects the touch operation acted on the translation application program, the translation application program is started, and the interface of the translation application program is displayed. A text content input box 21 may be included on the interface, the text content input box 21 allowing a user to perform editing operations of text content and for receiving text content generated by user editing. After the user inputs text content in the content input box 21, the electronic device may acquire a target word from the text content.
2. And carrying out format processing and segmentation processing on the target word. The electronic device may obtain a format type of a word in the word-segment dictionary, and if the format type of the word in the word-segment dictionary is a lower case letter type, may pre-process the target word according to the lower case letter type. Specifically, if the target word is PERSISTENTLY, the electronic device may convert the uppercase letters in the target word into lowercase letters, and add a first identifier to the target word, where the first identifier may be_u, and the processed target words are_u and PERSISTENTLY. The electronic device may use the first identifier as a first word segment, and segment letters except the first identifier in the processed target word to obtain a second word segment, that is, the word segment corresponding to_ upersistently is: u+ pers + ist + ently. Wherein, symbol "+" in _u+ _ pers +ist+ ently is used for distinguishing word sheets, and has no real meaning. Four word pieces, namely _u, _pers, ist and ently, are included in _u + _ pers + ist + ently. Similarly, if the target word is PERSISTENTLY, the electronic device may convert the uppercase letters in the target word into lowercase letters, and add a second identifier to the target word, where the second identifier may be_u, and the processed target words are_u and PERSISTENTLY. The electronic device may use the second identifier as the first word segment, and segment letters except the second identifier in the processed target word to obtain a second word segment, that is, the word segment corresponding to_ Upersistently is: u+ pers + ist + ently. Similarly, if the target word is PERSISTENTLY, and the format types of the letters of the target word are all lower case letter types, the target word may not be subjected to format processing, the target word may be directly segmented, and the word piece corresponding to the target word is_ pers +ist+ ently. The above words PERSISTENTLY, PERSISTENTLY and PERSISTENTLY are essentially the same word described by different format types, and it can be known by comparing the word fragments corresponding to the three words, each of the word fragments corresponding to the three words includes _ pers, ist, ently, i.e., the three words share the word sheet _ pers, ist, ently. By preprocessing the target words, more words can be described by adopting fewer word sheets, and the description capability of the words is improved. As shown in fig. 2, after obtaining the word segment, the electronic device may display the word segment 22 corresponding to the target word on the interface of the translation application.
3. The target word is translated. The at least one word slice is encoded to obtain an encoded value, e.g., each word slice is mapped to a number, e.g., word slice_u, _ pers, ist, ently is mapped to 1256. Inputting the code value into a translation model for translation to obtain at least one candidate translation, and outputting a candidate translation 23 on an interface of a translation application program, wherein the candidate translation of the target word is PERSISTENTLY: once again, constantly, without buckling. The electronic device may obtain the translation result of the word adjacent to the target word in the text content, determine the translation result of the target word according to the translation result of the adjacent word and the candidate translation, and output the translation result 24, for example, the translation result is: constantly, the device is hard to use. Of course, these candidate translations allow the user to perform a selection operation thereon, and the electronic device may take the candidate translation selected by the user as a result of the translation of the target word.
An embodiment of the present invention provides a data processing apparatus, which may be disposed in an electronic device, please refer to fig. 3, the apparatus includes:
An obtaining unit 301, configured to obtain a target word from text content to be translated.
And the processing unit 302 is configured to pre-process the target word to obtain a processed target word, where the processed target word meets a condition of being segmented into word pieces.
And the segmentation unit 303 is configured to segment the processed target word into at least one word segment.
And the translation unit 304 is configured to translate the target word according to the at least one word slice, so as to obtain a translation result.
Optionally, the processing unit 302 is specifically configured to obtain a word segmentation dictionary, and a format type of a word in the word segmentation dictionary, where the word segmentation dictionary includes a plurality of words; and preprocessing the target word according to the format type of the word in the word segmentation dictionary to obtain the processed target word.
Optionally, the target word includes at least one letter, and the format type of the word in the word segmentation dictionary is a lower case letter type; the processing unit 302 is specifically configured to add a first identifier to the target word if the first letter in the target word is a capital letter, and convert the capital letter in the target word into a lowercase letter, so as to obtain the processed target word; and if the letters in the target word are all capital letters, adding a second mark in the target word, and converting the capital letters in the target word into the lowercase letters to obtain the processed target word.
Optionally, the at least one word slice includes a first word slice and a second word slice, and the segmentation unit 303 is specifically configured to determine a target identifier in the processed target word as the first word slice, where the target identifier is the first identifier or the second identifier; and cutting letters except the target mark in the processed target word to obtain the second word slice.
Optionally, the word segmentation dictionary further includes a word segment corresponding to each word, and the segmentation unit 303 is specifically configured to determine, if a word matching the processed target word exists in the word segmentation dictionary, a word segment corresponding to the matched word as the second word segment; and if no word matched with the processed target word exists in the word segmentation dictionary, determining the second word segment according to the frequency of using the target word history.
Optionally, the slicing unit 303 is specifically configured to obtain a frequency of use of the target word history;
If the frequency of the target word history used is greater than a first preset frequency, determining letters except the target mark in the processed target word as the second word piece; if the frequency of the used target word history is smaller than or equal to the first preset frequency, acquiring the frequency of each letter except the target mark in the processed target word in the text content; and cutting letters except the target mark in the processed target word according to the frequency to obtain a plurality of second word sheets.
Optionally, the processing unit 302 is specifically configured to obtain a frequency of usage of a target phrase history corresponding to the target word, where the target phrase is composed of the target word in the text content and a word adjacent to the target word; if the frequency of the used target phrase history is smaller than a second preset frequency, acquiring a word matched with the target word; and correcting the target word by adopting the matched word to obtain the processed target word.
Optionally, the translation unit 304 is specifically configured to encode the at least one word segment to obtain an encoded value of the at least one word segment; inputting the coding value of the at least one word slice into a translation model for translation to obtain at least one candidate translation; acquiring a translation result of a word adjacent to the target word in the text content; and determining the translation result of the target word according to the at least one candidate translation and the translation results of the adjacent words.
In the embodiment of the invention, the target word is preprocessed to obtain the processed target word, so that the processed target word meets the condition of segmentation into word pieces. The letters of the processed target word have the same format type, so that the problem that the target word is disordered in segmentation mode due to inconsistent format types of the same word can be avoided, and the problem that the translation accuracy of the target word is lower is further solved; in addition, the processed target word has no wrong letters, so that the word segmentation accuracy of the processed target word can be improved. In addition, the description capability of the word is improved by segmenting the processed target word into at least one word piece, namely, describing the processed target word by adopting the word piece with smaller granularity. Further, the translation accuracy of the target word can be improved by translating the target word by adopting at least one word piece to obtain a translation result.
The embodiment of the invention provides an electronic device, and please refer to fig. 4. The electronic device includes: the processor 151, the user interface 152, the network interface 154, and the storage device 155 are connected via the bus 153.
A user interface 152 for enabling human-machine interaction, which may include a display screen or keyboard, etc. A network interface 154 for communication connection with external devices. A storage device 155 is coupled to the processor 151 for storing various software programs and/or sets of instructions. In particular implementations, storage 155 may include high-speed random access memory, and may also include non-volatile memory, such as one or more disk storage devices, flash memory devices, or other non-volatile solid state storage devices. The storage 155 may store an operating system (hereinafter referred to as "system"), such as ANDROID, IOS, WINDOWS, or an embedded operating system, such as LINUX. The storage 155 may also store a network communication program that may be used to communicate with one or more additional devices, one or more application servers, and one or more network devices. The storage 155 may also store a user interface program that can vividly display the content image of the application program through a graphical operation interface, and receive control operations of the application program from a user through input controls such as menus, dialog boxes, buttons, and the like. The storage 155 may also store video data or the like.
In one embodiment, the storage 155 may be used to store one or more instructions; the processor 151 may be capable of implementing a data processing method when invoking the one or more instructions, specifically, the processor 151 invokes the one or more instructions to perform the following steps:
acquiring target words from text contents to be translated;
preprocessing the target word to obtain a processed target word, wherein the processed target word meets the condition of segmentation into word sheets;
Dividing the processed target word into at least one word piece;
And translating the target word according to the at least one word piece to obtain a translation result.
Optionally, the processor 151 invokes the one or more instructions to perform the following steps:
acquiring a word segmentation dictionary and format types of words in the word segmentation dictionary, wherein the word segmentation dictionary comprises a plurality of words;
And preprocessing the target word according to the format type of the word in the word segmentation dictionary to obtain the processed target word.
Optionally, the processor 151 invokes the one or more instructions to perform the following steps:
If the first letter in the target word is a capital letter, adding a first identification in the target word, and converting the capital letter in the target word into a lowercase letter to obtain the processed target word;
And if the letters in the target word are all capital letters, adding a second mark in the target word, and converting the capital letters in the target word into the lowercase letters to obtain the processed target word.
Optionally, the processor 151 invokes the one or more instructions to perform the following steps:
determining a target mark in the processed target word as the first word piece, wherein the target mark is the first mark or the second mark;
And cutting letters except the target mark in the processed target word to obtain the second word slice.
Optionally, the processor 151 invokes the one or more instructions to perform the following steps:
If the word matched with the processed target word exists in the word segmentation dictionary, determining a word segment corresponding to the matched word as the second word segment;
and if no word matched with the processed target word exists in the word segmentation dictionary, determining the second word segment according to the frequency of using the target word history.
Optionally, the processor 151 invokes the one or more instructions to perform the following steps:
Acquiring the frequency with which the target word history is used;
if the frequency of the target word history used is greater than a first preset frequency, determining letters except the target mark in the processed target word as the second word piece;
If the frequency of the used target word history is smaller than or equal to the first preset frequency, acquiring the frequency of each letter except the target mark in the processed target word in the text content; and cutting letters except the target mark in the processed target word according to the frequency to obtain a plurality of second word sheets.
Optionally, the processor 151 invokes the one or more instructions to perform the following steps:
Acquiring the frequency of the history of the target phrase corresponding to the target word, wherein the target phrase consists of the target word in the text content and adjacent words of the target word;
If the frequency of the used target phrase history is smaller than a second preset frequency, acquiring a word matched with the target word;
And correcting the target word by adopting the matched word to obtain the processed target word.
Optionally, the processor 151 invokes the one or more instructions to perform the following steps:
Encoding the at least one word slice to obtain an encoding value of the at least one word slice;
Inputting the coding value of the at least one word slice into a translation model for translation to obtain at least one candidate translation;
acquiring a translation result of a word adjacent to the target word in the text content;
And determining the translation result of the target word according to the at least one candidate translation and the translation results of the adjacent words.
In the embodiment of the invention, the target word is preprocessed to obtain the processed target word, so that the processed target word meets the condition of segmentation into word pieces. The letters of the processed target word have the same format type, so that the problem that the target word is disordered in segmentation mode due to inconsistent format types of the same word can be avoided, and the problem that the translation accuracy of the target word is lower is further solved; in addition, the processed target word has no wrong letters, so that the word segmentation accuracy of the processed target word can be improved. In addition, the description capability of the word is improved by segmenting the processed target word into at least one word piece, namely, describing the processed target word by adopting the word piece with smaller granularity. Further, the translation accuracy of the target word can be improved by translating the target word by adopting at least one word piece to obtain a translation result.
The embodiment of the present invention further provides a computer readable storage medium, on which a computer program is stored, and the implementation and beneficial effects of the program for solving the problem may be referred to the implementation and beneficial effects of a data processing method described in fig. 1, and the repetition is omitted.
The above disclosure is illustrative only of some embodiments of the invention and is not intended to limit the scope of the invention, which is defined by the claims and their equivalents.

Claims (7)

1. A method of data processing, the method comprising:
acquiring target words from text contents to be translated;
Preprocessing the target word to obtain a processed target word, wherein the processed target word meets the condition of segmentation into word sheets; the preprocessing comprises format processing, the preprocessing is performed on the target word to obtain a processed target word, and the preprocessing comprises the following steps: acquiring a word segmentation dictionary and format types of words in the word segmentation dictionary, wherein the word segmentation dictionary comprises a plurality of words; the target word comprises at least one letter, and the format type of the word in the word segmentation dictionary is a lower case letter type; the word segmentation dictionary also comprises word segments corresponding to each word; if the first letter in the target word is a capital letter, adding a first identification in the target word, and converting the capital letter in the target word into a lowercase letter to obtain the processed target word; if the letters in the target word are all capital letters, adding a second mark in the target word, and converting the capital letters in the target word into lowercase letters to obtain the processed target word;
Determining a target mark in the processed target word as a first word piece, wherein the target mark is the first mark or the second mark;
If the word matched with the processed target word does not exist in the word segmentation dictionary, acquiring the frequency of the target word history; if the frequency of the target word history used is greater than a first preset frequency, determining letters except the target mark in the processed target word as a second word piece; if the frequency of the used target word history is smaller than or equal to the first preset frequency, acquiring the frequency of each letter except the target mark in the processed target word in the text content; dividing letters except the target mark in the processed target word according to the frequency to obtain a plurality of second word sheets; translating the target word according to at least one word piece to obtain a translation result; the at least one word segment includes a first word segment and the second word segment.
2. The method of claim 1, wherein the method further comprises:
And if the word matched with the processed target word exists in the word segmentation dictionary, determining the word segment corresponding to the matched word as the second word segment.
3. The method of any of claims 1-2, wherein the preprocessing further comprises a correction process, and wherein the preprocessing the target word to obtain a processed target word further comprises:
Acquiring the frequency of the history of the target phrase corresponding to the target word, wherein the target phrase consists of the target word in the text content and adjacent words of the target word;
If the frequency of the used target phrase history is smaller than a second preset frequency, acquiring a word matched with the target word;
And correcting the target word by adopting the matched word to obtain the processed target word.
4. The method according to any one of claims 1-2, wherein translating the target word according to at least one word segment to obtain a translation result comprises:
Encoding the at least one word slice to obtain an encoding value of the at least one word slice;
Inputting the coding value of the at least one word slice into a translation model for translation to obtain at least one candidate translation;
acquiring a translation result of a word adjacent to the target word in the text content;
And determining the translation result of the target word according to the at least one candidate translation and the translation results of the adjacent words.
5. A data processing apparatus, the apparatus comprising:
an acquisition unit for acquiring a target word from text content to be translated;
The processing unit is used for preprocessing the target word to obtain a processed target word, and the processed target word meets the condition of segmentation into word sheets; the preprocessing comprises format processing, the preprocessing is performed on the target word to obtain a processed target word, and the preprocessing comprises the following steps: acquiring a word segmentation dictionary and format types of words in the word segmentation dictionary, wherein the word segmentation dictionary comprises a plurality of words; the target word comprises at least one letter, and the format type of the word in the word segmentation dictionary is a lower case letter type; the word segmentation dictionary also comprises word segments corresponding to each word; if the first letter in the target word is a capital letter, adding a first identification in the target word, and converting the capital letter in the target word into a lowercase letter to obtain the processed target word; if the letters in the target word are all capital letters, adding a second mark in the target word, and converting the capital letters in the target word into lowercase letters to obtain the processed target word;
The segmentation unit is used for determining a target mark in the processed target word as a first word piece, wherein the target mark is the first mark or the second mark; if the word matched with the processed target word does not exist in the word segmentation dictionary, acquiring the frequency of the target word history; if the frequency of the target word history used is greater than a first preset frequency, determining letters except the target mark in the processed target word as a second word piece; if the frequency of the used target word history is smaller than or equal to the first preset frequency, acquiring the frequency of each letter except the target mark in the processed target word in the text content; dividing letters except the target mark in the processed target word according to the frequency to obtain a plurality of second word sheets;
the translation unit is used for translating the target word according to at least one word piece to obtain a translation result; the at least one word segment includes a first word segment and the second word segment.
6. An electronic device comprising an input device and an output device, further comprising:
A processor adapted to implement one or more instructions; and
A computer storage medium storing one or more instructions adapted to be loaded by the processor and to perform the method of any one of claims 1-4.
7. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 4.
CN201910702513.XA 2019-07-31 2019-07-31 Data processing method and device and electronic equipment Active CN110414013B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910702513.XA CN110414013B (en) 2019-07-31 2019-07-31 Data processing method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910702513.XA CN110414013B (en) 2019-07-31 2019-07-31 Data processing method and device and electronic equipment

Publications (2)

Publication Number Publication Date
CN110414013A CN110414013A (en) 2019-11-05
CN110414013B true CN110414013B (en) 2024-06-21

Family

ID=68364860

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910702513.XA Active CN110414013B (en) 2019-07-31 2019-07-31 Data processing method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN110414013B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1912865A (en) * 2005-08-10 2007-02-14 英业达股份有限公司 Hermeneutical system and method
CN106156007A (en) * 2015-03-24 2016-11-23 吕海港 A kind of English-Chinese statistical machine translation method of word original shape

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5342760B2 (en) * 2007-09-03 2013-11-13 株式会社東芝 Apparatus, method, and program for creating data for translation learning
CN107015971A (en) * 2017-03-30 2017-08-04 唐亮 The post-processing module of multilingual intelligence pretreatment real-time statistics machine translation system
CN107038160A (en) * 2017-03-30 2017-08-11 唐亮 The pretreatment module of multilingual intelligence pretreatment real-time statistics machine translation system
CN108563644A (en) * 2018-03-29 2018-09-21 河南工学院 A kind of English Translation electronic system
CN108763222B (en) * 2018-05-17 2020-08-04 腾讯科技(深圳)有限公司 Translation missing detection and translation method and device, server and storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1912865A (en) * 2005-08-10 2007-02-14 英业达股份有限公司 Hermeneutical system and method
CN106156007A (en) * 2015-03-24 2016-11-23 吕海港 A kind of English-Chinese statistical machine translation method of word original shape

Also Published As

Publication number Publication date
CN110414013A (en) 2019-11-05

Similar Documents

Publication Publication Date Title
US20210397780A1 (en) Method, device, and storage medium for correcting error in text
CN110765996B (en) Text information processing method and device
CN113807098B (en) Model training method and device, electronic equipment and storage medium
CN107767870B (en) Punctuation mark adding method and device and computer equipment
CN110737768B (en) Text abstract automatic generation method and device based on deep learning and storage medium
CN107729313B (en) Deep neural network-based polyphone pronunciation distinguishing method and device
CN107341143B (en) Sentence continuity judgment method and device and electronic equipment
US20130191111A1 (en) Language identification for documents containing multiple languages
Zayats et al. Disfluencies and human speech transcription errors
CN101286094A (en) Multi-mode input method editor
US20140380169A1 (en) Language input method editor to disambiguate ambiguous phrases via diacriticization
CN110770735A (en) Transcoding of documents with embedded mathematical expressions
CN113918031A (en) System and method for Chinese punctuation recovery using sub-character information
US9519637B2 (en) Text processing apparatus and text display system
US11239858B2 (en) Detection of unknown code page indexing tokens
CN113255329A (en) English text spelling error correction method and device, storage medium and electronic equipment
CN104933030A (en) Uygur language spelling examination method and device
CN111161730B (en) Voice instruction matching method, device, equipment and storage medium
US20230153550A1 (en) Machine Translation Method and Apparatus, Device and Storage Medium
CN110414013B (en) Data processing method and device and electronic equipment
CN114090885B (en) Product title core word extraction method, related device and computer program product
CN113627173B (en) Manufacturer name identification method, manufacturer name identification device, electronic equipment and readable medium
CN114861628A (en) System, method, electronic device and storage medium for training machine translation model
CN114676699A (en) Entity emotion analysis method and device, computer equipment and storage medium
KR102354898B1 (en) Vocabulary list generation method and device for Korean based neural network language model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant