CN110414013A - Data processing method, device and electronic equipment - Google Patents
Data processing method, device and electronic equipment Download PDFInfo
- Publication number
- CN110414013A CN110414013A CN201910702513.XA CN201910702513A CN110414013A CN 110414013 A CN110414013 A CN 110414013A CN 201910702513 A CN201910702513 A CN 201910702513A CN 110414013 A CN110414013 A CN 110414013A
- Authority
- CN
- China
- Prior art keywords
- word
- target
- target word
- piece
- treated
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000003672 processing method Methods 0.000 title claims abstract description 14
- 238000013519 translation Methods 0.000 claims abstract description 126
- 238000000034 method Methods 0.000 claims abstract description 20
- 239000012634 fragment Substances 0.000 claims description 53
- 238000012545 processing Methods 0.000 claims description 35
- 238000004458 analytical method Methods 0.000 claims description 7
- 238000012937 correction Methods 0.000 claims description 5
- 230000014616 translation Effects 0.000 description 114
- 239000002245 particle Substances 0.000 description 5
- 238000010586 diagram Methods 0.000 description 4
- 238000003062 neural network model Methods 0.000 description 4
- 238000007781 pre-processing Methods 0.000 description 4
- 235000013399 edible fruits Nutrition 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 2
- 230000002688 persistence Effects 0.000 description 2
- 239000000919 ceramic Substances 0.000 description 1
- 238000004590 computer program Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Landscapes
- Machine Translation (AREA)
Abstract
The embodiment of the invention discloses a kind of data processing method, device and terminals, wherein, this method can include: obtain target word from content of text to be translated, the target word is pre-processed, the target word that obtains that treated, described treated that target word meets the condition for being cut into word piece.Treated that target word is cut at least one word piece by described, is translated according at least one described word piece to the target word, obtains translation result.The accuracy of translation can be improved through the embodiment of the present invention.
Description
Technical field
The present invention relates to field of computer technology more particularly to a kind of data processing method, a kind of data processing equipment and
A kind of electronic equipment.
Background technique
With the extensive use of Internet technology, so that economic globalization range constantly expands, and multiple countries have been pushed
Between exchange with cooperate.The practitioner (such as foreign trade work person, technical research person) of more industry needs languages different from saying
The people of speech exchanges, and needs to read the document information largely write using their unfamiliar language, so that person to person
Between exchange and conmmunication there are certain obstacles.Based on this, particularly important is become to the translation of language, is found in practice, it is existing
The accuracy of some language translation modes is relatively low, it is difficult to reach the desired effect of user.
Summary of the invention
The technical problem to be solved by the embodiment of the invention is that providing a kind of data processing method, device, storage medium
And electronic equipment, the accuracy of translation can be improved.
On the one hand, the embodiment of the present invention provides a kind of data processing method, this method comprises:
Target word is obtained from content of text to be translated;
The target word is pre-processed, the target word that obtains that treated, described treated that target word is full
Foot is cut into the condition of word piece;
Treated that target word is cut at least one word piece by described;
The target word is translated according at least one described word piece, obtains translation result.
On the one hand, the embodiment of the present invention provides a kind of data processing equipment, which includes:
Acquiring unit, for obtaining target word from content of text to be translated;
Processing unit, for being pre-processed to the target word, the target word that obtains that treated, after the processing
Target word meet the condition of word piece of being cut into;
Cutting unit, for treated that target word is cut at least one word piece by described;
Translation unit obtains translation result for translating according at least one described word piece to the target word.
Another aspect, the embodiment of the invention provides a kind of electronic equipment, including input equipment and output equipment, also wrap
It includes:
Processor is adapted for carrying out one or more instruction;And
Computer storage medium, the computer storage medium are stored with one or more instruction, and described one or more
Instruction is suitable for being loaded by the processor and executing following steps:
Target word is obtained from content of text to be translated;
The target word is pre-processed, the target word that obtains that treated, described treated that target word is full
Foot is cut into the condition of word piece;
Treated that target word is cut at least one word piece by described;
The target word is translated according at least one described word piece, obtains translation result.
Another aspect, the embodiment of the invention provides a kind of computer storage medium, the computer storage medium storage
There is one or more instruction, one or more instruction is suitable for being loaded by processor and executing following steps:
Target word is obtained from content of text to be translated;
The target word is pre-processed, the target word that obtains that treated, described treated that target word is full
Foot is cut into the condition of word piece;
Treated that target word is cut at least one word piece by described;
The target word is translated according at least one described word piece, obtains translation result.
In the embodiment of the present invention, by pre-processing to target word, the target word that obtains that treated makes the processing
Target word afterwards meets the condition for being cut into word piece.That is the letter format class having the same of treated the target word
Type can avoid same word because Format Type is inconsistent in this way, lead to the problem mixed and disorderly to the target word slit mode,
And then the problem for causing the translation accuracy to target word lower;Also, mistake should be not present in treated target word
Letter, the accuracy to treated the target word participle can be improved.In addition, by should treated that target word is cut
It is divided at least one word piece, i.e., the target word that describes that treated using the word piece of smaller particle size improves the description to word
Ability.Further, the target word is translated by using at least one word piece, obtains translation result, can be improved to mesh
Mark the translation accuracy of word.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below
There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this
Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with
It obtains other drawings based on these drawings.
Fig. 1 is a kind of flow diagram of data processing method provided in an embodiment of the present invention;
Fig. 2 is a kind of interface schematic diagram of data handling procedure provided in an embodiment of the present invention;
Fig. 3 is a kind of structural schematic diagram of data processing equipment provided in an embodiment of the present invention;
Fig. 4 is the structural schematic diagram of a kind of electronic equipment provided in an embodiment of the present invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on
Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other
Embodiment shall fall within the protection scope of the present invention.
Referring to Figure 1, Fig. 1 is a kind of data processing method that present example provides.The data processing method can be by
Electronic equipment executes, which includes but is not limited to: smart phone, tablet computer, portable personal computer, intelligence
The equipment such as wrist-watch, bracelet, smart television.Referring to Figure 1, which includes the following steps S101~S104.
S101, target word is obtained from content of text to be translated.
Content of text to be translated can refer to the content that needs are translated, and text content, which can be, carries out text file
What Text region obtained, such as this article this document can be refer to professional technique document, literature works.Alternatively, text content
It is to be identified to audio file, as the audio file refers to the content that speaker is given a lecture.Alternatively, text content
Refer to and acquired from Webpage, if the Webpage may include product introduction Webpage, social Webpage.
Alternatively, text content can refer to the content inputted on translation interface, which can refer to web page translation interface
Or the interface of translation application.Text content may include at least one word, which can refer at least one
Any word in a word, the category of language of the target word can refer to the languages for needing to distinguish format size, such as English.
S102, the target word is pre-processed, the target word that obtains that treated, it should treated that target word is full
Foot is cut into the condition of word piece.
In order to improve the accuracy to target word translation, electronic equipment can pre-process the target word, obtain
To treated the target word.The pretreatment includes following any: at format analysis processing, correction process, format and correction
Reason.Format analysis processing can refer to that the Format Type to the letter in the target word is normalized, in this way can be to avoid
The same word leads to the problem more mixed and disorderly to the target word slit mode because format is inconsistent, and then causes to target
The lower problem of the translation accuracy of word.Correction process, which can refer to, is corrected incorrect letter in target word, in this way
The accuracy of target word can be improved, further, improve the translation accuracy to target word.It i.e. should treated target list
Word meet be cut into word piece condition can specifically include it is following any or multinomial: should be each in treated target word
There is no the letters of mistake in a letter Format Type having the same, treated target word.
S103, by this, treated that target word is cut at least one word piece.
For the description target word that can be finer, electronic equipment can be according to word fragment dictionary or target word
Frequency of use by this, treated that target word is cut at least one word piece, which can be by treated the target
At least one of word letter composition.The target word that described that treated using the word piece of smaller particle size in this way, can be improved
To the descriptive power of word.Wherein, which can refer to the dictionary for by the segmentation of words being word piece.
S104, the target word is translated according at least one word piece, obtains translation result.
At least one word piece can be input to translation model and translated by electronic equipment, obtain turning over for the target word
Translate result.The translation model, which can refer to, optimizes what training obtained using the word piece of a large amount of sample words, i.e., by being somebody's turn to do
Translation model translates the word piece of small grain size, and the accuracy of translation can be improved.Alternatively, can be according at least one word piece
The search more more refined can be achieved by word piece for the translation result that the target word is searched for from translation dictionary, in turn, improve
Accuracy to target word translation, can be improved the accuracy of translation.Optionally, according at least one word piece to the target
Word is translated, and is obtained at least one candidate translation and is made candidate's translation if the quantity of obtained candidate translation is 1
For the translation result of the target word;If the quantity of obtained candidate translation be it is multiple, can according in content of text with mesh
The translation result of the adjacent word of mark word determines the translation result of the target word, that is, combines context and the target word
Candidate translation determines the translation result of the target word.
In the embodiment of the present invention, by pre-processing to target word, the target word that obtains that treated makes the processing
Target word afterwards meets the condition for being cut into word piece.That is the letter format class having the same of treated the target word
Type can avoid same word because Format Type is inconsistent in this way, lead to the problem mixed and disorderly to the target word slit mode,
And then the problem for causing the translation accuracy to target word lower;Also, mistake should be not present in treated target word
Letter, the accuracy to treated the target word participle can be improved.In addition, by should treated that target word is cut
It is divided at least one word piece, i.e., the target word that describes that treated using the word piece of smaller particle size improves the description to word
Ability.Further, the target word is translated by using at least one word piece, obtains translation result, can be improved to mesh
Mark the translation accuracy of word.
In one embodiment, which includes format analysis processing, and step S102 includes the following steps s11 and s12.
S11, the Format Type for obtaining word fragment dictionary and the word in the word fragment dictionary are wrapped in the word fragment dictionary
Include multiple words.
S12, the target word is pre-processed according to the Format Type of the word in the word fragment dictionary, is handled
Target word afterwards.
In step s11 and s12, multiple words and the corresponding word piece of each word are had recorded in word fragment dictionary, i.e., it is logical
Crossing the word fragment dictionary can be with the corresponding word piece of looking up words.In the fixed-size situation of the word fragment dictionary, in order to can
To record more words, the Format Type of the word in word fragment dictionary is all the same.For example, for word persistently and
Persistently, the two words substantially refer to the same word using different-format type specification, if directly the two
Word is directly recorded in word fragment dictionary, then needs to occupy the memory headroom of two words;If the format of the two words into
Row normalized then obtains word persistently or PERSISTENTLY, it is only necessary to by persistently or
PERSISTENTLY is recorded in the word fragment dictionary, i.e., only needs to occupy the memory headroom of a word.Therefore, word fragment word
The format of all words in allusion quotation is normalized processing, i.e. the Format Type of all words in the word fragment dictionary is homogeneous
Together, Format Type can refer to capitalization type or lowercase type.In order to obtain from word fragment dictionary, treated
The word piece of target word, the format class of the available word fragment dictionary of electronic equipment and the word in the word fragment dictionary
Type, the even Format Type of the word in the word fragment dictionary are lowercase type, then according to lowercase type to the mesh
Mark word is pre-processed, the target word that obtains that treated;If the Format Type of the word in the word fragment dictionary is capitalization
Letter type then pre-processes the target word according to capitalization type, the target word that obtains that treated.
In one embodiment, which includes at least one letter, the format class of word in the word fragment dictionary
Type is lowercase type;Step s12 may include steps of s21~s22.
If the first letter in s21, the target word is capitalization, first identifier is added in the target word,
And by the capitalization lower in the target word, this is obtained treated target word.
If the letter in s22, the target word is capitalization, second identifier is added in the target word, and
By the capitalization lower in the target word, this is obtained treated target word.
In step s21~s22, if the first letter of the target word is capitalization, show the lattice of the target word
Formula type is different from the Format Type of word in word fragment dictionary, and can not obtaining this from word fragment dictionary in this way, treated
The word piece of target word.Therefore, first identifier can be added in the target word, and by the capitalization in the target word
Lower obtains this treated target word.If the letter in the target word is capitalization, show this
The Format Type of target word is different from the Format Type of word in word fragment dictionary, can not obtain from word fragment dictionary in this way
Take this treated the word piece of target word.Therefore, second identifier can be added in the target word, and by the target word
In capitalization lower, obtain this treated target word.If the lattice of each letter of the target word
Formula type is lowercase type, then does not need to carry out format analysis processing to the target word.It should be noted that for same
A word, if the possible translation result of its Format Type difference is not identical, for example, the translation result of word China is China, word
The translation result of china is ceramic products.Therefore, first identifier is added in target word here or the purpose of second identifier exists
In: it indicates the Format Type of target word, the accuracy to target word translation can be improved in this way.
Wherein, it is capitalization which, which is used to indicate the first letter in the target word, and first letter can be with
For the target word from the first word of left side number, the letter that second identifier is used to indicate in the target word is capital letter
It is female.First identifier, second identifier can be to be made of at least one of letter, number, symbol, first identifier and second identifier
It is different.First identifier can be identical or not identical with point of addition of the second identifier in target word.
In one embodiment, which includes the first word piece and the second word piece, and step S103 includes as follows
Step s31 and s32.
S31, the target identification in treated the target word is determined as to the first word piece, the target identification be this
One mark or the second identifier.
S32, cutting is carried out to the letter in treated the target word in addition to the target identification, obtains second word
Piece.
In step s31 and s32, the first letter in the target word is the letter in capitalization and the target word
It is in the case of uppercase two kinds, the difference of target word is only in that after corresponding processing: first identifier and the second mark
Know not identical.Therefore, in order to using less word piece description treated target word, electronic equipment can should treated
Target identification in target word is determined as the first word piece, which is the first identifier or the second identifier, can be with
Cutting is carried out to the letter in treated the target word in addition to the target identification, obtains the second word piece.That is,
First letter in the target word is that the letter in capitalization and the target word is in the case of uppercase two kinds,
Target word can share the second word piece, so as to describe more words using less word piece, reduce and carry out to word
The complexity of cutting.And can to avoid the same word because Format Type it is different caused by word piece slit mode is mixed and disorderly asks
Topic.
It in one embodiment, further include the corresponding word piece of each word in the word fragment dictionary, step s32 includes as follows
Step s41 and s42.
If exist in s41, the word fragment dictionary with treated the matched word of target word, by the matched list of institute
The corresponding word piece of word is determined as the second word piece.
If in s42, the word fragment dictionary there is no with treated the matched word of target word, according to the target
The frequency that word histories are used determines the second word piece.
In step s41 and s4, if exist in the word fragment dictionary with treated the matched word of target word, will
The corresponding word piece of the matched word of institute is determined as the second word piece, and so-called matching, which refers to, here has list in the word fragment dictionary
Word is identical as the first sub-word, and the first sub-word refers to this treated in target word the letter composition in addition to target identification
Word.If in the word fragment dictionary there is no with treated the matched word of target word, show in the word fragment dictionary
There is no word identical with the first sub-word, electronic equipment can be determined according to the frequency that the target word history is used should
Second word piece.
In one embodiment, step s42 includes the following steps s51~s53.
S51, the frequency that the target word history is used is obtained.
It, should treated target list if the frequency that s52, the target word history are used is greater than the first predeterminated frequency
Letter in word in addition to the target identification is determined as the second word piece.
If the frequency that s53, the target word history are used is less than or equal to first predeterminated frequency, the processing is obtained
The frequency that letter occurs in text content each of in addition to the target identification in target word afterwards;According to the frequency pair
Letter in treated the target word in addition to the target identification carries out cutting, obtains multiple second word pieces.
In step s51~s53, electronic equipment can count the target word history from multiple content of text and be used
Frequency (i.e. the target word occur in multiple content of text frequency), if the frequency that the target word history is used is big
In the first predeterminated frequency, shows that the target word is a common word, translation can be easier by translation model and obtained
The corresponding translation of the target word.It therefore, can be by the letter in treated the target word in addition to the target identification really
It is set to the second word piece, i.e., using the first sub-word as the second word piece.If the frequency that the target word history is used be less than or
Equal to first predeterminated frequency, shows that the target word is the word that is of little use, be difficult to translate to obtain the target by translation model
The corresponding translation of word, therefore, it is necessary to carry out finer cutting to treated the target word, that is, obtaining should treated
The frequency that letter occurs in text content each of in addition to the target identification in target word;According to the frequency at this
Letter in target word after reason in addition to the target identification carries out cutting, obtains multiple second word pieces.
In another embodiment, which includes correction process, and step S102 includes the following steps s61~s63.
S61, the frequency that the corresponding target phrase history of the target word is used is obtained, the target phrase is by the text
The target word in appearance, and formed with the adjacent words of the target word.
If the frequency that s62, the target phrase history are used obtains and the target word phase less than the second predeterminated frequency
Matched word.
S63, processing is corrected to the target word using the word to match, the target word that obtains that treated.
In step s61~s63, maloperation causes target word mistake occur in order to prevent, and electronic equipment can be to mesh
Mark word is corrected processing.Specifically, electronic equipment can count the corresponding mesh of the target word from multiple content of text
The frequency that mark phrase history is used, if the frequency that the target phrase history is used shows the mesh less than the second predeterminated frequency
The probability for marking word appearance mistake is bigger, therefore, the available and matched word of the target word.Matching herein can be with
The similarity for referring to target word between the word that matches is greater than preset threshold, or distance is greater than pre-determined distance threshold value.
Further, it is possible to processing is corrected to the target word using the word to match, i.e., it should using the word replacement to match
Target word, the target word that obtains that treated.Optionally, if the quantity of the word to match with target word is 1, directly
The target word is replaced using the word to match;If the quantity of the word to match with the target word be it is multiple, use
The target word is replaced with the maximum word of similarity of target word.For example, target word is bgi, the target word is corresponding
Target phrase be bgi orders.If getting the frequency that target phrase history is used is 0, obtain and target word
The word to match, the word such as to match are big, then replace bgi using big.
In having one embodiment, step S102 includes the following steps s71~s74.
S71, at least one word piece is encoded, obtains the encoded radio of at least one word piece.
S72, the encoded radio of at least one word piece is input in translation model and is translated, obtain at least one candidate
Translation.
S73, the translation result for obtaining word adjacent with the target word in text content.
S74, according to this, the translation result of at least one candidate translation and the adjacent word determines turning over for the target word
Translate result.
In step s71~s74, electronic equipment can be translated to obtain the translation knot of the target word by translation model
Fruit, herein the translation model can refer to neural network machine translation model (Neural Machine Translation,
NMT), which can be made of at least one neural network model.For example, the neural network machine
Translation model can be made of two neural network models, and neural network model is used for the word to treated target word
Piece is encoded, another neural network model obtains the translation knot of target word for translating the encoded radio of word piece
Fruit.Specifically, electronic equipment can encode at least one word piece, the encoded radio of at least one word piece is obtained, it should
Encoded radio can be made of at least one of number, letter, symbol etc., as the encoded radio can refer at least one word piece
Corresponding id, the id can be number.The encoded radio of at least one word piece can be input in translation model and be translated,
Obtain at least one candidate translation, the translation result of the word adjacent with the target word, root in available text content
According to this, the translation result of at least one candidate translation and the adjacent word determines the translation result of the target word.I.e. basis should
The degree of association at least one candidate translation between the translation result of each translation word adjacent with this, the degree of association is maximum
Translation result of the candidate translation as the target word.It can be realized and target word is turned over according to the context of content of text
It translates, improves accuracy and the fluency of translation.
Below by taking translation application as an example, the data processing method of this programme is explained.In electronic equipment
It is mounted with translation application, which can be used for turning over the word, phrase, sentence of any language
It translates, is illustrated so that the translation application is translated as Chinese to English word as an example below.As shown in Fig. 2, the data processing
Method includes the following steps 1-3.
1, target word is obtained.When user has translation demand, touch control operation can be executed to the translation application,
Electronic equipment detects the touch control operation for acting on the translation application, starts the translation application, shows that the translation is answered
With the interface of program.It may include content of text input frame 21 on the interface, text content input frame 21 allows user to carry out
The edit operation of content of text, and the content of text generated is edited for receiving user.When user is defeated in content input frame 21
After entering content of text, electronic equipment can obtain target word from text content.
2, format analysis processing is carried out to target word and cutting is handled.The lattice of word in the available word piece dictionary of electronic equipment
Formula type can be according to lowercase type to this if the Format Type of word is lowercase type in word fragment dictionary
Target word is pre-processed.Specifically, electronic equipment can be by the target word if the target word is Persistently
In capitalization lower, and first identifier is added in the target word, first identifier can be _ u, processing
Target word afterwards is _ u and persistently.Electronic equipment can be using first identifier as the first word piece, after the processing
Target word in addition to first identifier letter carry out cutting obtain the second word piece, i.e., the corresponding word of _ upersistently
Piece are as follows: _ u+_pers+ist+ently.Wherein, symbol "+" is used for differentiating words piece in _ u+_pers+ist+ently, without tangible meaning
Justice.It include four word pieces, respectively _ u, _ pers, ist and ently in _ u+_pers+ist+ently.Similarly, if the target list
Word is PERSISTENTLY, and electronic equipment can be by the capitalization lower in the target word, and in the mesh
Second identifier is added in mark word, and second identifier can be for _ U, and target word that treated is _ U and persistently.Electronics
Equipment can carry out the letter in treated the target word in addition to second identifier using second identifier as the first word piece
Cutting obtains the second word piece, i.e., the corresponding word piece of _ Upersistently are as follows: _ U+_pers+ist+ently.Similarly, if the mesh
Mark word is persistently, and the Format Type of the letter of the target word is lowercase type at this time, then may not be used
To the target word carry out format analysis processing, can directly by the target word carry out cutting, the corresponding word piece of the target word be _
pers+ist+ently.Above word Persistently, PERSISTENTLY and persistently are substantially by not apposition
The same word of formula type specification compares the corresponding participle of these three words it is found that the corresponding participle of these three words wraps
_ pers, ist, ently are included, i.e., these three words have shared word piece _ pers, ist, ently.I.e. by being carried out to target word
Pretreatment improves the descriptive power to word, it can be achieved that describe more words using less word piece.As shown in Fig. 2, obtaining
After getting word piece, electronic equipment can be shown the corresponding word piece 22 of the target word on the interface of translation application.
3, target word is translated.At least one word piece is encoded, encoded radio is obtained, such as by each word
Piece is mapped as a number, and word piece _ u, _ pers, ist, ently are such as mapped as 1256.Encoded radio is input to translation model
In translated to obtain at least one candidate translation, and candidate translation 23 is exported on the interface of translation application, such as target
Word is Persistently, the candidate translation of the target word are as follows: again and again, always, by persistence, carve without
House ground, unflaggingly.The translation result of the word adjacent with the target word, root in the available content of text of electronic equipment
The translation result of the target word is determined according to the translation result and candidate translation of adjacent words, and exports translation result 24, is such as turned over
Translate result are as follows: by persistence.Certainly, these candidate translations allow users to execute selection operation to it, and electronic equipment can incite somebody to action
Translation result of the selected candidate translation of user as the target word.
The embodiment of the present invention provides a kind of data processing equipment, which may be disposed in electronic equipment, asks
Referring to Fig. 3, which includes:
Acquiring unit 301, for obtaining target word from content of text to be translated.
Processing unit 302, for being pre-processed to the target word, obtain that treated target word, the place
Target word after reason meets the condition for being cut into word piece.
Cutting unit 303, for treated that target word is cut at least one word piece by described.
Translation unit 304 obtains translation knot for translating according at least one described word piece to the target word
Fruit.
Optionally, processing unit 302 are specifically used for obtaining word fragment dictionary and the word in institute's predicate fragment dictionary
Format Type includes multiple words in institute's predicate fragment dictionary;According to the Format Type pair of the word in institute's predicate fragment dictionary
The target word is pre-processed, the target word that obtains that treated.
Optionally, the target word includes at least one letter, and the Format Type of word is in institute's predicate fragment dictionary
Lowercase type;Processing unit 302, if being capitalization specifically for the first letter in the target word, in institute
It states and adds first identifier in target word, and by the capitalization lower in the target word, obtain described
Target word that treated;If the letter in the target word is capitalization, is added in the target word
Two marks obtain treated the target word and by the capitalization lower in the target word.
Optionally, at least one described word piece includes the first word piece and the second word piece, and cutting unit 303, being specifically used for will
Target identification in treated the target word is determined as the first word piece, and the target identification is the first identifier
Or the second identifier;Cutting is carried out to the letter in treated the target word in addition to the target identification, is obtained
The second word piece.
It optionally, further include the corresponding word piece of each word in institute's predicate fragment dictionary, cutting unit 303 is specifically used for
If exist in institute's predicate fragment dictionary with treated the matched word of target word, the matched word of institute is corresponding
Word piece is determined as the second word piece;If in institute's predicate fragment dictionary there is no with treated the matched list of target word
Word then determines the second word piece according to the frequency that the target word history is used.
Optionally, cutting unit 303, the frequency used specifically for obtaining the target word history;
It, will treated the target list if the frequency that the target word history is used is greater than the first predeterminated frequency
Letter in word in addition to the target identification is determined as the second word piece;If the frequency that the target word history is used
It is less than or equal to first predeterminated frequency, then every in addition to the target identification in target word that treated described in acquisition
The frequency that a letter occurs in the content of text;According to the frequency to removing the mesh in treated the target word
Letter other than mark mark carries out cutting, obtains multiple second word pieces.
Optionally, processing unit 302 are used specifically for the corresponding target phrase history of the acquisition target word
Frequency, the target phrase is by the target word in the content of text, and the adjacent words group with the target word
At;If the frequency that the target phrase history is used matches less than the second predeterminated frequency, acquisition with the target word
Word;Processing is corrected to the target word using the word to match, the target word that obtains that treated.
Optionally, translation unit 304 obtain described at least one specifically for encoding at least one described word piece
The encoded radio of a word piece;The encoded radio of at least one word piece is input in translation model and is translated, obtains at least one
A candidate's translation;Obtain the translation result of word adjacent with the target word in the content of text;According to it is described at least
The translation result of one candidate translation and the adjacent word determines the translation result of the target word.
In the embodiment of the present invention, by pre-processing to target word, the target word that obtains that treated makes the processing
Target word afterwards meets the condition for being cut into word piece.That is the letter format class having the same of treated the target word
Type can avoid same word because Format Type is inconsistent in this way, lead to the problem mixed and disorderly to the target word slit mode,
And then the problem for causing the translation accuracy to target word lower;Also, mistake should be not present in treated target word
Letter, the accuracy to treated the target word participle can be improved.In addition, by should treated that target word is cut
It is divided at least one word piece, i.e., the target word that describes that treated using the word piece of smaller particle size improves the description to word
Ability.Further, the target word is translated by using at least one word piece, obtains translation result, can be improved to mesh
Mark the translation accuracy of word.
The embodiment of the present invention provides a kind of electronic equipment, refers to Fig. 4.The electronic equipment includes: processor 151, user
Interface 152, network interface 154 and storage device 155, processor 151, user interface 152, network interface 154 and storage
It is connected between device 155 by bus 153.
User interface 152, for realizing human-computer interaction, user interface may include display screen or keyboard etc..Network connects
Mouth 154, for being communicatively coupled between external equipment.Storage device 155 is coupled with processor 151, various for storing
Software program and/or multiple groups instruction.In the specific implementation, storage device 155 may include the memory of high random access, and
It may include nonvolatile memory, such as one or more disk storage equipments, flash memory device or other nonvolatile solid states are deposited
Store up equipment.Storage device 155 can store an operating system (following abbreviation systems), such as ANDROID, IOS, WINDOWS, or
The embedded OSs such as LINUX.Storage device 155 can also store network communication program, which can be used for
With one or more optional equipments, one or more application server, one or more network equipments are communicated.Storage device
155 can also store user interface program, which can be by patterned operation interface by application program
Content image is true to nature to be shown, and receives user to application program by input controls such as menu, dialog box and keys
Control operation.Storage device 155 can also store video data etc..
In one embodiment, the storage device 155 can be used for storing one or more instruction;The processor
151 can be realized data processing method when can call described one or more instruction, specifically, the processor 151 is adjusted
With described one or more instruction, following steps are executed:
Target word is obtained from content of text to be translated;
The target word is pre-processed, the target word that obtains that treated, described treated that target word is full
Foot is cut into the condition of word piece;
Treated that target word is cut at least one word piece by described;
The target word is translated according at least one described word piece, obtains translation result.
Optionally, the processor 151 calls described one or more instruction, executes following steps:
The Format Type of word fragment dictionary and the word in institute's predicate fragment dictionary is obtained, is wrapped in institute's predicate fragment dictionary
Include multiple words;
The target word is pre-processed according to the Format Type of the word in institute's predicate fragment dictionary, is handled
Target word afterwards.
Optionally, the processor 151 calls described one or more instruction, executes following steps:
If the first letter in the target word is capitalization, first identifier is added in the target word,
And by the capitalization lower in the target word, treated the target word is obtained;
If the letter in the target word is capitalization, second identifier is added in the target word, and
By the capitalization lower in the target word, treated the target word is obtained.
Optionally, the processor 151 calls described one or more instruction, executes following steps:
Target identification in treated the target word is determined as the first word piece, the target identification is institute
State first identifier or the second identifier;
Cutting is carried out to the letter in treated the target word in addition to the target identification, obtains described second
Word piece.
Optionally, the processor 151 calls described one or more instruction, executes following steps:
If exist in institute's predicate fragment dictionary with treated the matched word of target word, by the matched list of institute
The corresponding word piece of word is determined as the second word piece;
If in institute's predicate fragment dictionary there is no with treated the matched word of target word, according to the mesh
The frequency that mark word histories are used determines the second word piece.
Optionally, the processor 151 calls described one or more instruction, executes following steps:
Obtain the frequency that the target word history is used;
It, will treated the target list if the frequency that the target word history is used is greater than the first predeterminated frequency
Letter in word in addition to the target identification is determined as the second word piece;
If the frequency that the target word history is used is less than or equal to first predeterminated frequency, the place is obtained
The frequency that letter occurs in the content of text each of in addition to the target identification in target word after reason;According to institute
It states frequency and cutting is carried out to the letter in treated the target word in addition to the target identification, obtain multiple second words
Piece.
Optionally, the processor 151 calls described one or more instruction, executes following steps:
The frequency that the corresponding target phrase history of the target word is used is obtained, the target phrase is by the text
The target word in content, and formed with the adjacent words of the target word;
If the frequency that the target phrase history is used obtains and the target word phase less than the second predeterminated frequency
Matched word;
Processing is corrected to the target word using the word to match, the target word that obtains that treated.
Optionally, the processor 151 calls described one or more instruction, executes following steps:
At least one described word piece is encoded, the encoded radio of at least one word piece is obtained;
The encoded radio of at least one word piece is input in translation model and is translated, at least one candidate is obtained and translates
Text;
Obtain the translation result of word adjacent with the target word in the content of text;
The target word is determined according to the translation result of at least one described candidate translation and the adjacent word
Translation result.
In the embodiment of the present invention, by pre-processing to target word, the target word that obtains that treated makes the processing
Target word afterwards meets the condition for being cut into word piece.That is the letter format class having the same of treated the target word
Type can avoid same word because Format Type is inconsistent in this way, lead to the problem mixed and disorderly to the target word slit mode,
And then the problem for causing the translation accuracy to target word lower;Also, mistake should be not present in treated target word
Letter, the accuracy to treated the target word participle can be improved.In addition, by should treated that target word is cut
It is divided at least one word piece, i.e., the target word that describes that treated using the word piece of smaller particle size improves the description to word
Ability.Further, the target word is translated by using at least one word piece, obtains translation result, can be improved to mesh
Mark the translation accuracy of word.
The embodiment of the invention also provides a kind of computer readable storage mediums, are stored thereon with computer program, the journey
The embodiment and beneficial effect that sequence solves the problems, such as may refer to a kind of embodiment party of data processing method described in above-mentioned Fig. 1
Formula and beneficial effect, overlaps will not be repeated.
Above disclosed is only section Example of the present invention, cannot limit the right model of the present invention with this certainly
It encloses, therefore equivalent changes made in accordance with the claims of the present invention, is still within the scope of the present invention.
Claims (10)
1. a kind of data processing method, which is characterized in that the described method includes:
Target word is obtained from content of text to be translated;
The target word is pre-processed, the target word that obtains that treated, described treated that target word satisfaction is cut
It is divided into the condition of word piece;
Treated that target word is cut at least one word piece by described;
The target word is translated according at least one described word piece, obtains translation result.
2. the method as described in claim 1, which is characterized in that the pretreatment includes format analysis processing, described to the target
Word is pre-processed, the target word that obtains that treated, comprising:
The Format Type of word fragment dictionary and the word in institute's predicate fragment dictionary is obtained, includes more in institute's predicate fragment dictionary
A word;
The target word is pre-processed according to the Format Type of the word in institute's predicate fragment dictionary, obtains that treated
Target word.
3. method according to claim 2, which is characterized in that the target word includes at least one letter, institute's predicate point
The Format Type of word is lowercase type in piece dictionary;The Format Type according to the word in institute's predicate fragment dictionary
The target word is pre-processed, the target word that obtains that treated, comprising:
If the first letter in the target word is capitalization, first identifier is added in the target word, and will
Capitalization lower in the target word obtains treated the target word;
If the letter in the target word is capitalization, second identifier is added in the target word, and by institute
The capitalization lower in target word is stated, treated the target word is obtained.
4. method as claimed in claim 3, which is characterized in that at least one described word piece includes the first word piece and the second word
Piece, described treated that target word is cut at least one word piece by described, comprising:
Target identification in treated the target word is determined as the first word piece, the target identification is described the
One mark or the second identifier;
Cutting is carried out to the letter in treated the target word in addition to the target identification, obtains second word
Piece.
5. method as claimed in claim 4, which is characterized in that further include the corresponding word of each word in institute's predicate fragment dictionary
Piece, the letter in treated the target word in addition to the target identification carry out cutting, obtain described second
Word piece, comprising:
If exist in institute's predicate fragment dictionary with treated the matched word of target word, by the matched word pair of institute
The word piece answered is determined as the second word piece;
If in institute's predicate fragment dictionary there is no with treated the matched word of target word, according to the target list
The frequency that word history is used determines the second word piece.
6. method as claimed in claim 5, which is characterized in that the frequency used according to the target word history is true
The fixed second word piece, comprising:
Obtain the frequency that the target word history is used;
If the frequency that the target word history is used is greater than the first predeterminated frequency, in target word that treated by described in
Letter in addition to the target identification is determined as the second word piece;
If the frequency that the target word history is used is less than or equal to first predeterminated frequency, after obtaining the processing
Target word in each of in addition to the target identification frequency that occurs in the content of text of letter;According to the frequency
Rate carries out cutting to the letter in treated the target word in addition to the target identification, obtains multiple second word pieces.
7. as the method according to claim 1 to 6, which is characterized in that the pretreatment includes correction process, described right
The target word is pre-processed, the target word that obtains that treated, comprising:
The frequency that the corresponding target phrase history of the target word is used is obtained, the target phrase is by the content of text
In the target word, and formed with the adjacent words of the target word;
If the frequency that the target phrase history is used matches less than the second predeterminated frequency, acquisition with the target word
Word;
Processing is corrected to the target word using the word to match, the target word that obtains that treated.
8. as the method according to claim 1 to 6, which is characterized in that described at least one word piece according to is to described
Target word is translated, and translation result is obtained, comprising:
At least one described word piece is encoded, the encoded radio of at least one word piece is obtained;
The encoded radio of at least one word piece is input in translation model and is translated, at least one candidate translation is obtained;
Obtain the translation result of word adjacent with the target word in the content of text;
The translation of the target word is determined according to the translation result of at least one described candidate translation and the adjacent word
As a result.
9. a kind of data processing equipment, which is characterized in that described device includes:
Acquiring unit, for obtaining target word from content of text to be translated;
Processing unit, for being pre-processed to the target word, the target word that obtains that treated, treated the mesh
Mark word meets the condition for being cut into word piece;
Cutting unit, for treated that target word is cut at least one word piece by described;
Translation unit obtains translation result for translating according at least one described word piece to the target word.
10. a kind of electronic equipment, including input equipment and output equipment, which is characterized in that further include:
Processor is adapted for carrying out one or more instruction;And
Computer storage medium, the computer storage medium are stored with one or more instruction, one or more instruction
Suitable for being loaded by the processor and executing the method according to claim 1.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910702513.XA CN110414013B (en) | 2019-07-31 | 2019-07-31 | Data processing method and device and electronic equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910702513.XA CN110414013B (en) | 2019-07-31 | 2019-07-31 | Data processing method and device and electronic equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110414013A true CN110414013A (en) | 2019-11-05 |
CN110414013B CN110414013B (en) | 2024-06-21 |
Family
ID=68364860
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910702513.XA Active CN110414013B (en) | 2019-07-31 | 2019-07-31 | Data processing method and device and electronic equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110414013B (en) |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1912865A (en) * | 2005-08-10 | 2007-02-14 | 英业达股份有限公司 | Hermeneutical system and method |
US20090063127A1 (en) * | 2007-09-03 | 2009-03-05 | Tatsuya Izuha | Apparatus, method, and computer program product for creating data for learning word translation |
CN106156007A (en) * | 2015-03-24 | 2016-11-23 | 吕海港 | A kind of English-Chinese statistical machine translation method of word original shape |
CN107015971A (en) * | 2017-03-30 | 2017-08-04 | 唐亮 | The post-processing module of multilingual intelligence pretreatment real-time statistics machine translation system |
CN107038160A (en) * | 2017-03-30 | 2017-08-11 | 唐亮 | The pretreatment module of multilingual intelligence pretreatment real-time statistics machine translation system |
CN108563644A (en) * | 2018-03-29 | 2018-09-21 | 河南工学院 | A kind of English Translation electronic system |
CN108763222A (en) * | 2018-05-17 | 2018-11-06 | 腾讯科技(深圳)有限公司 | Detection, interpretation method and device, server and storage medium are translated in a kind of leakage |
-
2019
- 2019-07-31 CN CN201910702513.XA patent/CN110414013B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1912865A (en) * | 2005-08-10 | 2007-02-14 | 英业达股份有限公司 | Hermeneutical system and method |
US20090063127A1 (en) * | 2007-09-03 | 2009-03-05 | Tatsuya Izuha | Apparatus, method, and computer program product for creating data for learning word translation |
CN106156007A (en) * | 2015-03-24 | 2016-11-23 | 吕海港 | A kind of English-Chinese statistical machine translation method of word original shape |
CN107015971A (en) * | 2017-03-30 | 2017-08-04 | 唐亮 | The post-processing module of multilingual intelligence pretreatment real-time statistics machine translation system |
CN107038160A (en) * | 2017-03-30 | 2017-08-11 | 唐亮 | The pretreatment module of multilingual intelligence pretreatment real-time statistics machine translation system |
CN108563644A (en) * | 2018-03-29 | 2018-09-21 | 河南工学院 | A kind of English Translation electronic system |
CN108763222A (en) * | 2018-05-17 | 2018-11-06 | 腾讯科技(深圳)有限公司 | Detection, interpretation method and device, server and storage medium are translated in a kind of leakage |
Also Published As
Publication number | Publication date |
---|---|
CN110414013B (en) | 2024-06-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113807098B (en) | Model training method and device, electronic equipment and storage medium | |
US20210397780A1 (en) | Method, device, and storage medium for correcting error in text | |
WO2021068352A1 (en) | Automatic construction method and apparatus for faq question-answer pair, and computer device and storage medium | |
CN110110041A (en) | Wrong word correcting method, device, computer installation and storage medium | |
CN111241237B (en) | Intelligent question-answer data processing method and device based on operation and maintenance service | |
WO2021135469A1 (en) | Machine learning-based information extraction method, apparatus, computer device, and medium | |
CN111310440B (en) | Text error correction method, device and system | |
KR20180078318A (en) | Methods and Apparatus for Determining the Agents | |
CN113707300B (en) | Search intention recognition method, device, equipment and medium based on artificial intelligence | |
CN113051371B (en) | Chinese machine reading understanding method and device, electronic equipment and storage medium | |
CN110347790B (en) | Text duplicate checking method, device and equipment based on attention mechanism and storage medium | |
CN111783471B (en) | Semantic recognition method, device, equipment and storage medium for natural language | |
CN112633003A (en) | Address recognition method and device, computer equipment and storage medium | |
CN110852106A (en) | Named entity processing method and device based on artificial intelligence and electronic equipment | |
CN114556328A (en) | Data processing method and device, electronic equipment and storage medium | |
US20190095447A1 (en) | Method, apparatus, device and storage medium for establishing error correction model based on error correction platform | |
CN116012481B (en) | Image generation processing method and device, electronic equipment and storage medium | |
CN112287069A (en) | Information retrieval method and device based on voice semantics and computer equipment | |
CN111832318A (en) | Single sentence natural language processing method and device, computer equipment and readable storage medium | |
CN116303537A (en) | Data query method and device, electronic equipment and storage medium | |
CN116013307A (en) | Punctuation prediction method, punctuation prediction device, punctuation prediction equipment and computer storage medium | |
CN110222144B (en) | Text content extraction method and device, electronic equipment and storage medium | |
US20230153550A1 (en) | Machine Translation Method and Apparatus, Device and Storage Medium | |
CN107705849A (en) | Remote medical consultation with specialists opinion integration method and device | |
CN111291561B (en) | Text recognition method, device and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant |