CN110414013A - Data processing method, device and electronic equipment - Google Patents

Data processing method, device and electronic equipment Download PDF

Info

Publication number
CN110414013A
CN110414013A CN201910702513.XA CN201910702513A CN110414013A CN 110414013 A CN110414013 A CN 110414013A CN 201910702513 A CN201910702513 A CN 201910702513A CN 110414013 A CN110414013 A CN 110414013A
Authority
CN
China
Prior art keywords
word
target
target word
piece
treated
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910702513.XA
Other languages
Chinese (zh)
Other versions
CN110414013B (en
Inventor
王明三
张健昶
曾钦松
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201910702513.XA priority Critical patent/CN110414013B/en
Publication of CN110414013A publication Critical patent/CN110414013A/en
Application granted granted Critical
Publication of CN110414013B publication Critical patent/CN110414013B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Machine Translation (AREA)

Abstract

The embodiment of the invention discloses a kind of data processing method, device and terminals, wherein, this method can include: obtain target word from content of text to be translated, the target word is pre-processed, the target word that obtains that treated, described treated that target word meets the condition for being cut into word piece.Treated that target word is cut at least one word piece by described, is translated according at least one described word piece to the target word, obtains translation result.The accuracy of translation can be improved through the embodiment of the present invention.

Description

Data processing method, device and electronic equipment
Technical field
The present invention relates to field of computer technology more particularly to a kind of data processing method, a kind of data processing equipment and A kind of electronic equipment.
Background technique
With the extensive use of Internet technology, so that economic globalization range constantly expands, and multiple countries have been pushed Between exchange with cooperate.The practitioner (such as foreign trade work person, technical research person) of more industry needs languages different from saying The people of speech exchanges, and needs to read the document information largely write using their unfamiliar language, so that person to person Between exchange and conmmunication there are certain obstacles.Based on this, particularly important is become to the translation of language, is found in practice, it is existing The accuracy of some language translation modes is relatively low, it is difficult to reach the desired effect of user.
Summary of the invention
The technical problem to be solved by the embodiment of the invention is that providing a kind of data processing method, device, storage medium And electronic equipment, the accuracy of translation can be improved.
On the one hand, the embodiment of the present invention provides a kind of data processing method, this method comprises:
Target word is obtained from content of text to be translated;
The target word is pre-processed, the target word that obtains that treated, described treated that target word is full Foot is cut into the condition of word piece;
Treated that target word is cut at least one word piece by described;
The target word is translated according at least one described word piece, obtains translation result.
On the one hand, the embodiment of the present invention provides a kind of data processing equipment, which includes:
Acquiring unit, for obtaining target word from content of text to be translated;
Processing unit, for being pre-processed to the target word, the target word that obtains that treated, after the processing Target word meet the condition of word piece of being cut into;
Cutting unit, for treated that target word is cut at least one word piece by described;
Translation unit obtains translation result for translating according at least one described word piece to the target word.
Another aspect, the embodiment of the invention provides a kind of electronic equipment, including input equipment and output equipment, also wrap It includes:
Processor is adapted for carrying out one or more instruction;And
Computer storage medium, the computer storage medium are stored with one or more instruction, and described one or more Instruction is suitable for being loaded by the processor and executing following steps:
Target word is obtained from content of text to be translated;
The target word is pre-processed, the target word that obtains that treated, described treated that target word is full Foot is cut into the condition of word piece;
Treated that target word is cut at least one word piece by described;
The target word is translated according at least one described word piece, obtains translation result.
Another aspect, the embodiment of the invention provides a kind of computer storage medium, the computer storage medium storage There is one or more instruction, one or more instruction is suitable for being loaded by processor and executing following steps:
Target word is obtained from content of text to be translated;
The target word is pre-processed, the target word that obtains that treated, described treated that target word is full Foot is cut into the condition of word piece;
Treated that target word is cut at least one word piece by described;
The target word is translated according at least one described word piece, obtains translation result.
In the embodiment of the present invention, by pre-processing to target word, the target word that obtains that treated makes the processing Target word afterwards meets the condition for being cut into word piece.That is the letter format class having the same of treated the target word Type can avoid same word because Format Type is inconsistent in this way, lead to the problem mixed and disorderly to the target word slit mode, And then the problem for causing the translation accuracy to target word lower;Also, mistake should be not present in treated target word Letter, the accuracy to treated the target word participle can be improved.In addition, by should treated that target word is cut It is divided at least one word piece, i.e., the target word that describes that treated using the word piece of smaller particle size improves the description to word Ability.Further, the target word is translated by using at least one word piece, obtains translation result, can be improved to mesh Mark the translation accuracy of word.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with It obtains other drawings based on these drawings.
Fig. 1 is a kind of flow diagram of data processing method provided in an embodiment of the present invention;
Fig. 2 is a kind of interface schematic diagram of data handling procedure provided in an embodiment of the present invention;
Fig. 3 is a kind of structural schematic diagram of data processing equipment provided in an embodiment of the present invention;
Fig. 4 is the structural schematic diagram of a kind of electronic equipment provided in an embodiment of the present invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.
Referring to Figure 1, Fig. 1 is a kind of data processing method that present example provides.The data processing method can be by Electronic equipment executes, which includes but is not limited to: smart phone, tablet computer, portable personal computer, intelligence The equipment such as wrist-watch, bracelet, smart television.Referring to Figure 1, which includes the following steps S101~S104.
S101, target word is obtained from content of text to be translated.
Content of text to be translated can refer to the content that needs are translated, and text content, which can be, carries out text file What Text region obtained, such as this article this document can be refer to professional technique document, literature works.Alternatively, text content It is to be identified to audio file, as the audio file refers to the content that speaker is given a lecture.Alternatively, text content Refer to and acquired from Webpage, if the Webpage may include product introduction Webpage, social Webpage. Alternatively, text content can refer to the content inputted on translation interface, which can refer to web page translation interface Or the interface of translation application.Text content may include at least one word, which can refer at least one Any word in a word, the category of language of the target word can refer to the languages for needing to distinguish format size, such as English.
S102, the target word is pre-processed, the target word that obtains that treated, it should treated that target word is full Foot is cut into the condition of word piece.
In order to improve the accuracy to target word translation, electronic equipment can pre-process the target word, obtain To treated the target word.The pretreatment includes following any: at format analysis processing, correction process, format and correction Reason.Format analysis processing can refer to that the Format Type to the letter in the target word is normalized, in this way can be to avoid The same word leads to the problem more mixed and disorderly to the target word slit mode because format is inconsistent, and then causes to target The lower problem of the translation accuracy of word.Correction process, which can refer to, is corrected incorrect letter in target word, in this way The accuracy of target word can be improved, further, improve the translation accuracy to target word.It i.e. should treated target list Word meet be cut into word piece condition can specifically include it is following any or multinomial: should be each in treated target word There is no the letters of mistake in a letter Format Type having the same, treated target word.
S103, by this, treated that target word is cut at least one word piece.
For the description target word that can be finer, electronic equipment can be according to word fragment dictionary or target word Frequency of use by this, treated that target word is cut at least one word piece, which can be by treated the target At least one of word letter composition.The target word that described that treated using the word piece of smaller particle size in this way, can be improved To the descriptive power of word.Wherein, which can refer to the dictionary for by the segmentation of words being word piece.
S104, the target word is translated according at least one word piece, obtains translation result.
At least one word piece can be input to translation model and translated by electronic equipment, obtain turning over for the target word Translate result.The translation model, which can refer to, optimizes what training obtained using the word piece of a large amount of sample words, i.e., by being somebody's turn to do Translation model translates the word piece of small grain size, and the accuracy of translation can be improved.Alternatively, can be according at least one word piece The search more more refined can be achieved by word piece for the translation result that the target word is searched for from translation dictionary, in turn, improve Accuracy to target word translation, can be improved the accuracy of translation.Optionally, according at least one word piece to the target Word is translated, and is obtained at least one candidate translation and is made candidate's translation if the quantity of obtained candidate translation is 1 For the translation result of the target word;If the quantity of obtained candidate translation be it is multiple, can according in content of text with mesh The translation result of the adjacent word of mark word determines the translation result of the target word, that is, combines context and the target word Candidate translation determines the translation result of the target word.
In the embodiment of the present invention, by pre-processing to target word, the target word that obtains that treated makes the processing Target word afterwards meets the condition for being cut into word piece.That is the letter format class having the same of treated the target word Type can avoid same word because Format Type is inconsistent in this way, lead to the problem mixed and disorderly to the target word slit mode, And then the problem for causing the translation accuracy to target word lower;Also, mistake should be not present in treated target word Letter, the accuracy to treated the target word participle can be improved.In addition, by should treated that target word is cut It is divided at least one word piece, i.e., the target word that describes that treated using the word piece of smaller particle size improves the description to word Ability.Further, the target word is translated by using at least one word piece, obtains translation result, can be improved to mesh Mark the translation accuracy of word.
In one embodiment, which includes format analysis processing, and step S102 includes the following steps s11 and s12.
S11, the Format Type for obtaining word fragment dictionary and the word in the word fragment dictionary are wrapped in the word fragment dictionary Include multiple words.
S12, the target word is pre-processed according to the Format Type of the word in the word fragment dictionary, is handled Target word afterwards.
In step s11 and s12, multiple words and the corresponding word piece of each word are had recorded in word fragment dictionary, i.e., it is logical Crossing the word fragment dictionary can be with the corresponding word piece of looking up words.In the fixed-size situation of the word fragment dictionary, in order to can To record more words, the Format Type of the word in word fragment dictionary is all the same.For example, for word persistently and Persistently, the two words substantially refer to the same word using different-format type specification, if directly the two Word is directly recorded in word fragment dictionary, then needs to occupy the memory headroom of two words;If the format of the two words into Row normalized then obtains word persistently or PERSISTENTLY, it is only necessary to by persistently or PERSISTENTLY is recorded in the word fragment dictionary, i.e., only needs to occupy the memory headroom of a word.Therefore, word fragment word The format of all words in allusion quotation is normalized processing, i.e. the Format Type of all words in the word fragment dictionary is homogeneous Together, Format Type can refer to capitalization type or lowercase type.In order to obtain from word fragment dictionary, treated The word piece of target word, the format class of the available word fragment dictionary of electronic equipment and the word in the word fragment dictionary Type, the even Format Type of the word in the word fragment dictionary are lowercase type, then according to lowercase type to the mesh Mark word is pre-processed, the target word that obtains that treated;If the Format Type of the word in the word fragment dictionary is capitalization Letter type then pre-processes the target word according to capitalization type, the target word that obtains that treated.
In one embodiment, which includes at least one letter, the format class of word in the word fragment dictionary Type is lowercase type;Step s12 may include steps of s21~s22.
If the first letter in s21, the target word is capitalization, first identifier is added in the target word, And by the capitalization lower in the target word, this is obtained treated target word.
If the letter in s22, the target word is capitalization, second identifier is added in the target word, and By the capitalization lower in the target word, this is obtained treated target word.
In step s21~s22, if the first letter of the target word is capitalization, show the lattice of the target word Formula type is different from the Format Type of word in word fragment dictionary, and can not obtaining this from word fragment dictionary in this way, treated The word piece of target word.Therefore, first identifier can be added in the target word, and by the capitalization in the target word Lower obtains this treated target word.If the letter in the target word is capitalization, show this The Format Type of target word is different from the Format Type of word in word fragment dictionary, can not obtain from word fragment dictionary in this way Take this treated the word piece of target word.Therefore, second identifier can be added in the target word, and by the target word In capitalization lower, obtain this treated target word.If the lattice of each letter of the target word Formula type is lowercase type, then does not need to carry out format analysis processing to the target word.It should be noted that for same A word, if the possible translation result of its Format Type difference is not identical, for example, the translation result of word China is China, word The translation result of china is ceramic products.Therefore, first identifier is added in target word here or the purpose of second identifier exists In: it indicates the Format Type of target word, the accuracy to target word translation can be improved in this way.
Wherein, it is capitalization which, which is used to indicate the first letter in the target word, and first letter can be with For the target word from the first word of left side number, the letter that second identifier is used to indicate in the target word is capital letter It is female.First identifier, second identifier can be to be made of at least one of letter, number, symbol, first identifier and second identifier It is different.First identifier can be identical or not identical with point of addition of the second identifier in target word.
In one embodiment, which includes the first word piece and the second word piece, and step S103 includes as follows Step s31 and s32.
S31, the target identification in treated the target word is determined as to the first word piece, the target identification be this One mark or the second identifier.
S32, cutting is carried out to the letter in treated the target word in addition to the target identification, obtains second word Piece.
In step s31 and s32, the first letter in the target word is the letter in capitalization and the target word It is in the case of uppercase two kinds, the difference of target word is only in that after corresponding processing: first identifier and the second mark Know not identical.Therefore, in order to using less word piece description treated target word, electronic equipment can should treated Target identification in target word is determined as the first word piece, which is the first identifier or the second identifier, can be with Cutting is carried out to the letter in treated the target word in addition to the target identification, obtains the second word piece.That is, First letter in the target word is that the letter in capitalization and the target word is in the case of uppercase two kinds, Target word can share the second word piece, so as to describe more words using less word piece, reduce and carry out to word The complexity of cutting.And can to avoid the same word because Format Type it is different caused by word piece slit mode is mixed and disorderly asks Topic.
It in one embodiment, further include the corresponding word piece of each word in the word fragment dictionary, step s32 includes as follows Step s41 and s42.
If exist in s41, the word fragment dictionary with treated the matched word of target word, by the matched list of institute The corresponding word piece of word is determined as the second word piece.
If in s42, the word fragment dictionary there is no with treated the matched word of target word, according to the target The frequency that word histories are used determines the second word piece.
In step s41 and s4, if exist in the word fragment dictionary with treated the matched word of target word, will The corresponding word piece of the matched word of institute is determined as the second word piece, and so-called matching, which refers to, here has list in the word fragment dictionary Word is identical as the first sub-word, and the first sub-word refers to this treated in target word the letter composition in addition to target identification Word.If in the word fragment dictionary there is no with treated the matched word of target word, show in the word fragment dictionary There is no word identical with the first sub-word, electronic equipment can be determined according to the frequency that the target word history is used should Second word piece.
In one embodiment, step s42 includes the following steps s51~s53.
S51, the frequency that the target word history is used is obtained.
It, should treated target list if the frequency that s52, the target word history are used is greater than the first predeterminated frequency Letter in word in addition to the target identification is determined as the second word piece.
If the frequency that s53, the target word history are used is less than or equal to first predeterminated frequency, the processing is obtained The frequency that letter occurs in text content each of in addition to the target identification in target word afterwards;According to the frequency pair Letter in treated the target word in addition to the target identification carries out cutting, obtains multiple second word pieces.
In step s51~s53, electronic equipment can count the target word history from multiple content of text and be used Frequency (i.e. the target word occur in multiple content of text frequency), if the frequency that the target word history is used is big In the first predeterminated frequency, shows that the target word is a common word, translation can be easier by translation model and obtained The corresponding translation of the target word.It therefore, can be by the letter in treated the target word in addition to the target identification really It is set to the second word piece, i.e., using the first sub-word as the second word piece.If the frequency that the target word history is used be less than or Equal to first predeterminated frequency, shows that the target word is the word that is of little use, be difficult to translate to obtain the target by translation model The corresponding translation of word, therefore, it is necessary to carry out finer cutting to treated the target word, that is, obtaining should treated The frequency that letter occurs in text content each of in addition to the target identification in target word;According to the frequency at this Letter in target word after reason in addition to the target identification carries out cutting, obtains multiple second word pieces.
In another embodiment, which includes correction process, and step S102 includes the following steps s61~s63.
S61, the frequency that the corresponding target phrase history of the target word is used is obtained, the target phrase is by the text The target word in appearance, and formed with the adjacent words of the target word.
If the frequency that s62, the target phrase history are used obtains and the target word phase less than the second predeterminated frequency Matched word.
S63, processing is corrected to the target word using the word to match, the target word that obtains that treated.
In step s61~s63, maloperation causes target word mistake occur in order to prevent, and electronic equipment can be to mesh Mark word is corrected processing.Specifically, electronic equipment can count the corresponding mesh of the target word from multiple content of text The frequency that mark phrase history is used, if the frequency that the target phrase history is used shows the mesh less than the second predeterminated frequency The probability for marking word appearance mistake is bigger, therefore, the available and matched word of the target word.Matching herein can be with The similarity for referring to target word between the word that matches is greater than preset threshold, or distance is greater than pre-determined distance threshold value. Further, it is possible to processing is corrected to the target word using the word to match, i.e., it should using the word replacement to match Target word, the target word that obtains that treated.Optionally, if the quantity of the word to match with target word is 1, directly The target word is replaced using the word to match;If the quantity of the word to match with the target word be it is multiple, use The target word is replaced with the maximum word of similarity of target word.For example, target word is bgi, the target word is corresponding Target phrase be bgi orders.If getting the frequency that target phrase history is used is 0, obtain and target word The word to match, the word such as to match are big, then replace bgi using big.
In having one embodiment, step S102 includes the following steps s71~s74.
S71, at least one word piece is encoded, obtains the encoded radio of at least one word piece.
S72, the encoded radio of at least one word piece is input in translation model and is translated, obtain at least one candidate Translation.
S73, the translation result for obtaining word adjacent with the target word in text content.
S74, according to this, the translation result of at least one candidate translation and the adjacent word determines turning over for the target word Translate result.
In step s71~s74, electronic equipment can be translated to obtain the translation knot of the target word by translation model Fruit, herein the translation model can refer to neural network machine translation model (Neural Machine Translation, NMT), which can be made of at least one neural network model.For example, the neural network machine Translation model can be made of two neural network models, and neural network model is used for the word to treated target word Piece is encoded, another neural network model obtains the translation knot of target word for translating the encoded radio of word piece Fruit.Specifically, electronic equipment can encode at least one word piece, the encoded radio of at least one word piece is obtained, it should Encoded radio can be made of at least one of number, letter, symbol etc., as the encoded radio can refer at least one word piece Corresponding id, the id can be number.The encoded radio of at least one word piece can be input in translation model and be translated, Obtain at least one candidate translation, the translation result of the word adjacent with the target word, root in available text content According to this, the translation result of at least one candidate translation and the adjacent word determines the translation result of the target word.I.e. basis should The degree of association at least one candidate translation between the translation result of each translation word adjacent with this, the degree of association is maximum Translation result of the candidate translation as the target word.It can be realized and target word is turned over according to the context of content of text It translates, improves accuracy and the fluency of translation.
Below by taking translation application as an example, the data processing method of this programme is explained.In electronic equipment It is mounted with translation application, which can be used for turning over the word, phrase, sentence of any language It translates, is illustrated so that the translation application is translated as Chinese to English word as an example below.As shown in Fig. 2, the data processing Method includes the following steps 1-3.
1, target word is obtained.When user has translation demand, touch control operation can be executed to the translation application, Electronic equipment detects the touch control operation for acting on the translation application, starts the translation application, shows that the translation is answered With the interface of program.It may include content of text input frame 21 on the interface, text content input frame 21 allows user to carry out The edit operation of content of text, and the content of text generated is edited for receiving user.When user is defeated in content input frame 21 After entering content of text, electronic equipment can obtain target word from text content.
2, format analysis processing is carried out to target word and cutting is handled.The lattice of word in the available word piece dictionary of electronic equipment Formula type can be according to lowercase type to this if the Format Type of word is lowercase type in word fragment dictionary Target word is pre-processed.Specifically, electronic equipment can be by the target word if the target word is Persistently In capitalization lower, and first identifier is added in the target word, first identifier can be _ u, processing Target word afterwards is _ u and persistently.Electronic equipment can be using first identifier as the first word piece, after the processing Target word in addition to first identifier letter carry out cutting obtain the second word piece, i.e., the corresponding word of _ upersistently Piece are as follows: _ u+_pers+ist+ently.Wherein, symbol "+" is used for differentiating words piece in _ u+_pers+ist+ently, without tangible meaning Justice.It include four word pieces, respectively _ u, _ pers, ist and ently in _ u+_pers+ist+ently.Similarly, if the target list Word is PERSISTENTLY, and electronic equipment can be by the capitalization lower in the target word, and in the mesh Second identifier is added in mark word, and second identifier can be for _ U, and target word that treated is _ U and persistently.Electronics Equipment can carry out the letter in treated the target word in addition to second identifier using second identifier as the first word piece Cutting obtains the second word piece, i.e., the corresponding word piece of _ Upersistently are as follows: _ U+_pers+ist+ently.Similarly, if the mesh Mark word is persistently, and the Format Type of the letter of the target word is lowercase type at this time, then may not be used To the target word carry out format analysis processing, can directly by the target word carry out cutting, the corresponding word piece of the target word be _ pers+ist+ently.Above word Persistently, PERSISTENTLY and persistently are substantially by not apposition The same word of formula type specification compares the corresponding participle of these three words it is found that the corresponding participle of these three words wraps _ pers, ist, ently are included, i.e., these three words have shared word piece _ pers, ist, ently.I.e. by being carried out to target word Pretreatment improves the descriptive power to word, it can be achieved that describe more words using less word piece.As shown in Fig. 2, obtaining After getting word piece, electronic equipment can be shown the corresponding word piece 22 of the target word on the interface of translation application.
3, target word is translated.At least one word piece is encoded, encoded radio is obtained, such as by each word Piece is mapped as a number, and word piece _ u, _ pers, ist, ently are such as mapped as 1256.Encoded radio is input to translation model In translated to obtain at least one candidate translation, and candidate translation 23 is exported on the interface of translation application, such as target Word is Persistently, the candidate translation of the target word are as follows: again and again, always, by persistence, carve without House ground, unflaggingly.The translation result of the word adjacent with the target word, root in the available content of text of electronic equipment The translation result of the target word is determined according to the translation result and candidate translation of adjacent words, and exports translation result 24, is such as turned over Translate result are as follows: by persistence.Certainly, these candidate translations allow users to execute selection operation to it, and electronic equipment can incite somebody to action Translation result of the selected candidate translation of user as the target word.
The embodiment of the present invention provides a kind of data processing equipment, which may be disposed in electronic equipment, asks Referring to Fig. 3, which includes:
Acquiring unit 301, for obtaining target word from content of text to be translated.
Processing unit 302, for being pre-processed to the target word, obtain that treated target word, the place Target word after reason meets the condition for being cut into word piece.
Cutting unit 303, for treated that target word is cut at least one word piece by described.
Translation unit 304 obtains translation knot for translating according at least one described word piece to the target word Fruit.
Optionally, processing unit 302 are specifically used for obtaining word fragment dictionary and the word in institute's predicate fragment dictionary Format Type includes multiple words in institute's predicate fragment dictionary;According to the Format Type pair of the word in institute's predicate fragment dictionary The target word is pre-processed, the target word that obtains that treated.
Optionally, the target word includes at least one letter, and the Format Type of word is in institute's predicate fragment dictionary Lowercase type;Processing unit 302, if being capitalization specifically for the first letter in the target word, in institute It states and adds first identifier in target word, and by the capitalization lower in the target word, obtain described Target word that treated;If the letter in the target word is capitalization, is added in the target word Two marks obtain treated the target word and by the capitalization lower in the target word.
Optionally, at least one described word piece includes the first word piece and the second word piece, and cutting unit 303, being specifically used for will Target identification in treated the target word is determined as the first word piece, and the target identification is the first identifier Or the second identifier;Cutting is carried out to the letter in treated the target word in addition to the target identification, is obtained The second word piece.
It optionally, further include the corresponding word piece of each word in institute's predicate fragment dictionary, cutting unit 303 is specifically used for If exist in institute's predicate fragment dictionary with treated the matched word of target word, the matched word of institute is corresponding Word piece is determined as the second word piece;If in institute's predicate fragment dictionary there is no with treated the matched list of target word Word then determines the second word piece according to the frequency that the target word history is used.
Optionally, cutting unit 303, the frequency used specifically for obtaining the target word history;
It, will treated the target list if the frequency that the target word history is used is greater than the first predeterminated frequency Letter in word in addition to the target identification is determined as the second word piece;If the frequency that the target word history is used It is less than or equal to first predeterminated frequency, then every in addition to the target identification in target word that treated described in acquisition The frequency that a letter occurs in the content of text;According to the frequency to removing the mesh in treated the target word Letter other than mark mark carries out cutting, obtains multiple second word pieces.
Optionally, processing unit 302 are used specifically for the corresponding target phrase history of the acquisition target word Frequency, the target phrase is by the target word in the content of text, and the adjacent words group with the target word At;If the frequency that the target phrase history is used matches less than the second predeterminated frequency, acquisition with the target word Word;Processing is corrected to the target word using the word to match, the target word that obtains that treated.
Optionally, translation unit 304 obtain described at least one specifically for encoding at least one described word piece The encoded radio of a word piece;The encoded radio of at least one word piece is input in translation model and is translated, obtains at least one A candidate's translation;Obtain the translation result of word adjacent with the target word in the content of text;According to it is described at least The translation result of one candidate translation and the adjacent word determines the translation result of the target word.
In the embodiment of the present invention, by pre-processing to target word, the target word that obtains that treated makes the processing Target word afterwards meets the condition for being cut into word piece.That is the letter format class having the same of treated the target word Type can avoid same word because Format Type is inconsistent in this way, lead to the problem mixed and disorderly to the target word slit mode, And then the problem for causing the translation accuracy to target word lower;Also, mistake should be not present in treated target word Letter, the accuracy to treated the target word participle can be improved.In addition, by should treated that target word is cut It is divided at least one word piece, i.e., the target word that describes that treated using the word piece of smaller particle size improves the description to word Ability.Further, the target word is translated by using at least one word piece, obtains translation result, can be improved to mesh Mark the translation accuracy of word.
The embodiment of the present invention provides a kind of electronic equipment, refers to Fig. 4.The electronic equipment includes: processor 151, user Interface 152, network interface 154 and storage device 155, processor 151, user interface 152, network interface 154 and storage It is connected between device 155 by bus 153.
User interface 152, for realizing human-computer interaction, user interface may include display screen or keyboard etc..Network connects Mouth 154, for being communicatively coupled between external equipment.Storage device 155 is coupled with processor 151, various for storing Software program and/or multiple groups instruction.In the specific implementation, storage device 155 may include the memory of high random access, and It may include nonvolatile memory, such as one or more disk storage equipments, flash memory device or other nonvolatile solid states are deposited Store up equipment.Storage device 155 can store an operating system (following abbreviation systems), such as ANDROID, IOS, WINDOWS, or The embedded OSs such as LINUX.Storage device 155 can also store network communication program, which can be used for With one or more optional equipments, one or more application server, one or more network equipments are communicated.Storage device 155 can also store user interface program, which can be by patterned operation interface by application program Content image is true to nature to be shown, and receives user to application program by input controls such as menu, dialog box and keys Control operation.Storage device 155 can also store video data etc..
In one embodiment, the storage device 155 can be used for storing one or more instruction;The processor 151 can be realized data processing method when can call described one or more instruction, specifically, the processor 151 is adjusted With described one or more instruction, following steps are executed:
Target word is obtained from content of text to be translated;
The target word is pre-processed, the target word that obtains that treated, described treated that target word is full Foot is cut into the condition of word piece;
Treated that target word is cut at least one word piece by described;
The target word is translated according at least one described word piece, obtains translation result.
Optionally, the processor 151 calls described one or more instruction, executes following steps:
The Format Type of word fragment dictionary and the word in institute's predicate fragment dictionary is obtained, is wrapped in institute's predicate fragment dictionary Include multiple words;
The target word is pre-processed according to the Format Type of the word in institute's predicate fragment dictionary, is handled Target word afterwards.
Optionally, the processor 151 calls described one or more instruction, executes following steps:
If the first letter in the target word is capitalization, first identifier is added in the target word, And by the capitalization lower in the target word, treated the target word is obtained;
If the letter in the target word is capitalization, second identifier is added in the target word, and By the capitalization lower in the target word, treated the target word is obtained.
Optionally, the processor 151 calls described one or more instruction, executes following steps:
Target identification in treated the target word is determined as the first word piece, the target identification is institute State first identifier or the second identifier;
Cutting is carried out to the letter in treated the target word in addition to the target identification, obtains described second Word piece.
Optionally, the processor 151 calls described one or more instruction, executes following steps:
If exist in institute's predicate fragment dictionary with treated the matched word of target word, by the matched list of institute The corresponding word piece of word is determined as the second word piece;
If in institute's predicate fragment dictionary there is no with treated the matched word of target word, according to the mesh The frequency that mark word histories are used determines the second word piece.
Optionally, the processor 151 calls described one or more instruction, executes following steps:
Obtain the frequency that the target word history is used;
It, will treated the target list if the frequency that the target word history is used is greater than the first predeterminated frequency Letter in word in addition to the target identification is determined as the second word piece;
If the frequency that the target word history is used is less than or equal to first predeterminated frequency, the place is obtained The frequency that letter occurs in the content of text each of in addition to the target identification in target word after reason;According to institute It states frequency and cutting is carried out to the letter in treated the target word in addition to the target identification, obtain multiple second words Piece.
Optionally, the processor 151 calls described one or more instruction, executes following steps:
The frequency that the corresponding target phrase history of the target word is used is obtained, the target phrase is by the text The target word in content, and formed with the adjacent words of the target word;
If the frequency that the target phrase history is used obtains and the target word phase less than the second predeterminated frequency Matched word;
Processing is corrected to the target word using the word to match, the target word that obtains that treated.
Optionally, the processor 151 calls described one or more instruction, executes following steps:
At least one described word piece is encoded, the encoded radio of at least one word piece is obtained;
The encoded radio of at least one word piece is input in translation model and is translated, at least one candidate is obtained and translates Text;
Obtain the translation result of word adjacent with the target word in the content of text;
The target word is determined according to the translation result of at least one described candidate translation and the adjacent word Translation result.
In the embodiment of the present invention, by pre-processing to target word, the target word that obtains that treated makes the processing Target word afterwards meets the condition for being cut into word piece.That is the letter format class having the same of treated the target word Type can avoid same word because Format Type is inconsistent in this way, lead to the problem mixed and disorderly to the target word slit mode, And then the problem for causing the translation accuracy to target word lower;Also, mistake should be not present in treated target word Letter, the accuracy to treated the target word participle can be improved.In addition, by should treated that target word is cut It is divided at least one word piece, i.e., the target word that describes that treated using the word piece of smaller particle size improves the description to word Ability.Further, the target word is translated by using at least one word piece, obtains translation result, can be improved to mesh Mark the translation accuracy of word.
The embodiment of the invention also provides a kind of computer readable storage mediums, are stored thereon with computer program, the journey The embodiment and beneficial effect that sequence solves the problems, such as may refer to a kind of embodiment party of data processing method described in above-mentioned Fig. 1 Formula and beneficial effect, overlaps will not be repeated.
Above disclosed is only section Example of the present invention, cannot limit the right model of the present invention with this certainly It encloses, therefore equivalent changes made in accordance with the claims of the present invention, is still within the scope of the present invention.

Claims (10)

1. a kind of data processing method, which is characterized in that the described method includes:
Target word is obtained from content of text to be translated;
The target word is pre-processed, the target word that obtains that treated, described treated that target word satisfaction is cut It is divided into the condition of word piece;
Treated that target word is cut at least one word piece by described;
The target word is translated according at least one described word piece, obtains translation result.
2. the method as described in claim 1, which is characterized in that the pretreatment includes format analysis processing, described to the target Word is pre-processed, the target word that obtains that treated, comprising:
The Format Type of word fragment dictionary and the word in institute's predicate fragment dictionary is obtained, includes more in institute's predicate fragment dictionary A word;
The target word is pre-processed according to the Format Type of the word in institute's predicate fragment dictionary, obtains that treated Target word.
3. method according to claim 2, which is characterized in that the target word includes at least one letter, institute's predicate point The Format Type of word is lowercase type in piece dictionary;The Format Type according to the word in institute's predicate fragment dictionary The target word is pre-processed, the target word that obtains that treated, comprising:
If the first letter in the target word is capitalization, first identifier is added in the target word, and will Capitalization lower in the target word obtains treated the target word;
If the letter in the target word is capitalization, second identifier is added in the target word, and by institute The capitalization lower in target word is stated, treated the target word is obtained.
4. method as claimed in claim 3, which is characterized in that at least one described word piece includes the first word piece and the second word Piece, described treated that target word is cut at least one word piece by described, comprising:
Target identification in treated the target word is determined as the first word piece, the target identification is described the One mark or the second identifier;
Cutting is carried out to the letter in treated the target word in addition to the target identification, obtains second word Piece.
5. method as claimed in claim 4, which is characterized in that further include the corresponding word of each word in institute's predicate fragment dictionary Piece, the letter in treated the target word in addition to the target identification carry out cutting, obtain described second Word piece, comprising:
If exist in institute's predicate fragment dictionary with treated the matched word of target word, by the matched word pair of institute The word piece answered is determined as the second word piece;
If in institute's predicate fragment dictionary there is no with treated the matched word of target word, according to the target list The frequency that word history is used determines the second word piece.
6. method as claimed in claim 5, which is characterized in that the frequency used according to the target word history is true The fixed second word piece, comprising:
Obtain the frequency that the target word history is used;
If the frequency that the target word history is used is greater than the first predeterminated frequency, in target word that treated by described in Letter in addition to the target identification is determined as the second word piece;
If the frequency that the target word history is used is less than or equal to first predeterminated frequency, after obtaining the processing Target word in each of in addition to the target identification frequency that occurs in the content of text of letter;According to the frequency Rate carries out cutting to the letter in treated the target word in addition to the target identification, obtains multiple second word pieces.
7. as the method according to claim 1 to 6, which is characterized in that the pretreatment includes correction process, described right The target word is pre-processed, the target word that obtains that treated, comprising:
The frequency that the corresponding target phrase history of the target word is used is obtained, the target phrase is by the content of text In the target word, and formed with the adjacent words of the target word;
If the frequency that the target phrase history is used matches less than the second predeterminated frequency, acquisition with the target word Word;
Processing is corrected to the target word using the word to match, the target word that obtains that treated.
8. as the method according to claim 1 to 6, which is characterized in that described at least one word piece according to is to described Target word is translated, and translation result is obtained, comprising:
At least one described word piece is encoded, the encoded radio of at least one word piece is obtained;
The encoded radio of at least one word piece is input in translation model and is translated, at least one candidate translation is obtained;
Obtain the translation result of word adjacent with the target word in the content of text;
The translation of the target word is determined according to the translation result of at least one described candidate translation and the adjacent word As a result.
9. a kind of data processing equipment, which is characterized in that described device includes:
Acquiring unit, for obtaining target word from content of text to be translated;
Processing unit, for being pre-processed to the target word, the target word that obtains that treated, treated the mesh Mark word meets the condition for being cut into word piece;
Cutting unit, for treated that target word is cut at least one word piece by described;
Translation unit obtains translation result for translating according at least one described word piece to the target word.
10. a kind of electronic equipment, including input equipment and output equipment, which is characterized in that further include:
Processor is adapted for carrying out one or more instruction;And
Computer storage medium, the computer storage medium are stored with one or more instruction, one or more instruction Suitable for being loaded by the processor and executing the method according to claim 1.
CN201910702513.XA 2019-07-31 2019-07-31 Data processing method and device and electronic equipment Active CN110414013B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910702513.XA CN110414013B (en) 2019-07-31 2019-07-31 Data processing method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910702513.XA CN110414013B (en) 2019-07-31 2019-07-31 Data processing method and device and electronic equipment

Publications (2)

Publication Number Publication Date
CN110414013A true CN110414013A (en) 2019-11-05
CN110414013B CN110414013B (en) 2024-06-21

Family

ID=68364860

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910702513.XA Active CN110414013B (en) 2019-07-31 2019-07-31 Data processing method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN110414013B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1912865A (en) * 2005-08-10 2007-02-14 英业达股份有限公司 Hermeneutical system and method
US20090063127A1 (en) * 2007-09-03 2009-03-05 Tatsuya Izuha Apparatus, method, and computer program product for creating data for learning word translation
CN106156007A (en) * 2015-03-24 2016-11-23 吕海港 A kind of English-Chinese statistical machine translation method of word original shape
CN107015971A (en) * 2017-03-30 2017-08-04 唐亮 The post-processing module of multilingual intelligence pretreatment real-time statistics machine translation system
CN107038160A (en) * 2017-03-30 2017-08-11 唐亮 The pretreatment module of multilingual intelligence pretreatment real-time statistics machine translation system
CN108563644A (en) * 2018-03-29 2018-09-21 河南工学院 A kind of English Translation electronic system
CN108763222A (en) * 2018-05-17 2018-11-06 腾讯科技(深圳)有限公司 Detection, interpretation method and device, server and storage medium are translated in a kind of leakage

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1912865A (en) * 2005-08-10 2007-02-14 英业达股份有限公司 Hermeneutical system and method
US20090063127A1 (en) * 2007-09-03 2009-03-05 Tatsuya Izuha Apparatus, method, and computer program product for creating data for learning word translation
CN106156007A (en) * 2015-03-24 2016-11-23 吕海港 A kind of English-Chinese statistical machine translation method of word original shape
CN107015971A (en) * 2017-03-30 2017-08-04 唐亮 The post-processing module of multilingual intelligence pretreatment real-time statistics machine translation system
CN107038160A (en) * 2017-03-30 2017-08-11 唐亮 The pretreatment module of multilingual intelligence pretreatment real-time statistics machine translation system
CN108563644A (en) * 2018-03-29 2018-09-21 河南工学院 A kind of English Translation electronic system
CN108763222A (en) * 2018-05-17 2018-11-06 腾讯科技(深圳)有限公司 Detection, interpretation method and device, server and storage medium are translated in a kind of leakage

Also Published As

Publication number Publication date
CN110414013B (en) 2024-06-21

Similar Documents

Publication Publication Date Title
CN113807098B (en) Model training method and device, electronic equipment and storage medium
US20210397780A1 (en) Method, device, and storage medium for correcting error in text
WO2021068352A1 (en) Automatic construction method and apparatus for faq question-answer pair, and computer device and storage medium
CN110110041A (en) Wrong word correcting method, device, computer installation and storage medium
CN111241237B (en) Intelligent question-answer data processing method and device based on operation and maintenance service
WO2021135469A1 (en) Machine learning-based information extraction method, apparatus, computer device, and medium
CN111310440B (en) Text error correction method, device and system
KR20180078318A (en) Methods and Apparatus for Determining the Agents
CN113707300B (en) Search intention recognition method, device, equipment and medium based on artificial intelligence
CN113051371B (en) Chinese machine reading understanding method and device, electronic equipment and storage medium
CN110347790B (en) Text duplicate checking method, device and equipment based on attention mechanism and storage medium
CN111783471B (en) Semantic recognition method, device, equipment and storage medium for natural language
CN112633003A (en) Address recognition method and device, computer equipment and storage medium
CN110852106A (en) Named entity processing method and device based on artificial intelligence and electronic equipment
CN114556328A (en) Data processing method and device, electronic equipment and storage medium
US20190095447A1 (en) Method, apparatus, device and storage medium for establishing error correction model based on error correction platform
CN116012481B (en) Image generation processing method and device, electronic equipment and storage medium
CN112287069A (en) Information retrieval method and device based on voice semantics and computer equipment
CN111832318A (en) Single sentence natural language processing method and device, computer equipment and readable storage medium
CN116303537A (en) Data query method and device, electronic equipment and storage medium
CN116013307A (en) Punctuation prediction method, punctuation prediction device, punctuation prediction equipment and computer storage medium
CN110222144B (en) Text content extraction method and device, electronic equipment and storage medium
US20230153550A1 (en) Machine Translation Method and Apparatus, Device and Storage Medium
CN107705849A (en) Remote medical consultation with specialists opinion integration method and device
CN111291561B (en) Text recognition method, device and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant