CN109033042A - BPE coding method and system, machine translation system based on the sub- word cell of Chinese - Google Patents
BPE coding method and system, machine translation system based on the sub- word cell of Chinese Download PDFInfo
- Publication number
- CN109033042A CN109033042A CN201810687736.9A CN201810687736A CN109033042A CN 109033042 A CN109033042 A CN 109033042A CN 201810687736 A CN201810687736 A CN 201810687736A CN 109033042 A CN109033042 A CN 109033042A
- Authority
- CN
- China
- Prior art keywords
- chinese
- word
- sub
- chinese character
- bpe
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/126—Character encoding
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
- G06F40/58—Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Abstract
The invention belongs to computer software technical fields, disclose a kind of BPE coding method and system, machine translation system based on the sub- word cell of Chinese, carry out the fractionation of Chinese character in such a way that five-stroke etymon inputs Chinese character to Chinese character;It is first smaller more common sub- word cell by word dismantling for the word in sentence;By translating sub- character segment to translate unregistered word.The present invention makees extra process to Chinese character before carrying out BPE coding step.Therefore, parameter scale and operation time are suitable with current BPE in the present invention in practical applications, complexity when excessively will not increase practical.
Description
Technical field
The invention belongs to computer software technical field more particularly to a kind of BPE coding staffs based on the sub- word cell of Chinese
Method and system, machine translation system.
Background technique
Currently, the prior art commonly used in the trade is such that machine translation is to utilize computerized algorithm automatically will be a kind of
Source language sentence translation becomes the process of another target language sentence.Machine translation is a research direction of artificial intelligence,
With highly important scientific research value and practical value.Along with the rapid hair of globalization process deepened constantly with internet
Exhibition, at home and abroad politics, economical, society, cultural exchanges etc. play increasingly important role to machine translation mothod.With
The raising of computer computation ability and the application of big data, deep learning obtain further application.Based on deep learning
Neural Machine Translation have been to be concerned by more and more people.In the field NMT, a kind of most common translation
Model is the encoder-decoder model with attention mechanism (attention-based).Its main thought is will be wait turn over
The source language sentence of input is converted by the sentence (hereinafter collectively referred to as ' source statement ') translated by encoder (encoder)
Input of one term vector sequence as Recognition with Recurrent Neural Network, encoder can export the intensive vector an of regular length, referred to as
Context vector.Then other one is utilized using the context vector that input terminal exports as input using decoder (decoder)
A Recognition with Recurrent Neural Network combines a Softmax classifier, exports the term vector sequence of object language.Finally, utilizing dictionary handle
Term vector mapping becomes object language word, completes whole translation process.Encoder-decoder frame is the core of deep learning
Thought, likewise, encoder-decoder frame is also the common basic framework of NMT system.The NMT system of mainstream at present
System, encoder and decoder utilize RNN (recurrent neural networks), and RNN has when handling timing information
Advantageous advantage, be capable of handling the input of random length and be converted the vector as a fixed dimension.It is this
The most important defect of encoder-decoder frame be it is bad to the translation effect for the Chinese vocabulary being not logged in, building translate
When model, source and target side vocabulary can be formed according to training corpus first.Since computing capability limits, vocabulary size can be into
Row limitation (such as containing 30000 words in original language vocabulary), the word not in vocabulary are unified to use additional character " UNK "
Instead of.This causes the translation of NMT to will appear a serious problem: when containing word not in vocabulary in sentence to be translated
When, UNK can be generated in translation, causes translation readability not high, although people proposed to be solved not with BPE coding mode again later
Posting term problem, still, facts proved that, this method effect on Chinese language words is not obvious, or even can make translation quality instead
Decline.Therefore the present invention proposes a kind of method to can significantly solve the above problems.
In conclusion problem of the existing technology is: encoder-decoder frame contains when in sentence to be translated
When word not in vocabulary, UNK can be generated in translation, causes translation readability not high, declines translation quality.
Summary of the invention
In view of the problems of the existing technology, the present invention provides a kind of BPE coding methods based on the sub- word cell of Chinese
And system, machine translation system.
The invention is realized in this way a kind of BPE coding method based on the sub- word cell of Chinese, described based on the sub- word of Chinese
The BPE coding method of unit carries out the fractionation of Chinese character to Chinese character in such a way that five-stroke etymon inputs Chinese character.
Further, it includes: language that the Chinese character, which carries out the fractionation of Chinese character in such a way that five-stroke etymon inputs Chinese character,
The pretreatments such as material cleaning;To every a word in corpus, Chinese word segmentation is carried out;Five-stroke etymon pair is corresponded to each word or Chinese character
Answer character.
Further, the fractionation that the Chinese character carries out Chinese character in such a way that five-stroke etymon inputs Chinese character needs later
It wants: being first smaller more common sub- word cell by word dismantling for the word in sentence;By translating sub- character segment thus will
Unregistered word is translated.
Another object of the present invention is to provide a kind of bases of the BPE coding method based on the sub- word cell of Chinese described in realize
In the BPE coded system of the sub- word cell of Chinese, the BPE coded system based on the sub- word cell of Chinese includes:
Module is split, for carrying out the fractionation of Chinese character in such a way that five-stroke etymon inputs Chinese character to Chinese character;
Module is disassembled, for being first smaller more common sub- word cell by word dismantling for the word in sentence;
Translation module, for being translated unregistered word by translating sub- character segment.
Another object of the present invention is to provide described in a kind of realize based on the BPE coding method of the sub- word cell of Chinese
Calculation machine program.
Another object of the present invention is to provide a kind of letters of the BPE coding method based on the sub- word cell of Chinese described in realize
Cease data processing terminal.
Another object of the present invention is to provide a kind of computer readable storage mediums, including instruction, when it is in computer
When upper operation, so that computer executes the BPE coding method based on the sub- word cell of Chinese.
In conclusion advantages of the present invention and good effect are as follows: the present invention is before carrying out BPE coding step, to Chinese
Chinese character makees extra process.Therefore, parameter scale and operation time are suitable with current BPE in the present invention in practical applications, will not
Complexity when excessively increasing practical.
Detailed description of the invention
Fig. 1 is the BPE coding method flow chart provided in an embodiment of the present invention based on the sub- word cell of Chinese.
Fig. 2 is the BPE coded system structural schematic diagram provided in an embodiment of the present invention based on the sub- word cell of Chinese;
In figure: 1, splitting module;2, module is disassembled;3, translation module.
Specific embodiment
In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to embodiments, to the present invention
It is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, it is not used to
Limit the present invention.
The present invention has made improvement to existing BPE coding mode.Provide a kind of generation for more preferably more having theory significance
The method of Chinese BPE coding, makes the BPE coding mode of Chinese be able to solve Chinese unregistered word problem.Making full use of Chinese
While word information, avoid traditional BP E coding mode bring insufficient.Chinese character constructs depositing for Chinese character due to five typewritings
And the English alphabet forms that are easily converted to corresponding five typewritings of Chinese character root effectively drop to solve the problems, such as Chinese fractionation
The appearance of low Chinese unregistered word.
Application principle of the invention is explained in detail with reference to the accompanying drawing.
As shown in Figure 1, the BPE coding method provided in an embodiment of the present invention based on the sub- word cell of Chinese includes following step
It is rapid:
S101: the fractionation of Chinese character is carried out in such a way that five-stroke etymon inputs Chinese character to Chinese character;
S102: being first smaller more common sub- word cell by word dismantling for the word in sentence;
S103: by translating sub- character segment to translate unregistered word.
In a preferred embodiment of the invention: step S101 is specifically included: the pretreatment such as corpus cleaning;To every in corpus
In short, Chinese word segmentation is carried out;Character is corresponded to each word (or Chinese character) corresponding five-stroke etymon.
Application principle of the invention is described in detail combined with specific embodiments below.
For example sentence: I likes China.
Chinese word segmentation operation is carried out first, and obtain: I likes China.
Then, each word (or Chinese character) convert according to the corresponding letter of radical in five typewritings:
TRNT EPDC KHK LGYI。
Next, being encoded according to traditional BPE coding mode to above formula.
And traditional BP E encoding embodiments are as shown in the table:
It is thus possible to which it is bei@@jing that beijing, which is passed through BPE code conversion, nanjing becomes nan@@jing.
BPE coding method provided in an embodiment of the present invention based on the sub- word cell of Chinese is by typewriting to Chinese according to five
The corresponding English alphabet of middle radical is split, so that Chinese character to be changed into the format of English word composition.By this
Improved Chinese BPE encoding mechanism, declines the quantity of Chinese unregistered word.To which translation quality can be effectively improved.
As shown in Fig. 2, the BPE coded system provided in an embodiment of the present invention based on the sub- word cell of Chinese includes:
Module 1 is split, for carrying out the fractionation of Chinese character in such a way that five-stroke etymon inputs Chinese character to Chinese character;
Module 2 is disassembled, for being first smaller more common sub- word cell by word dismantling for the word in sentence;
Translation module 3, for being translated unregistered word by translating sub- character segment.
In the above-described embodiments, can come wholly or partly by software, hardware, firmware or any combination thereof real
It is existing.When using entirely or partly realizing in the form of a computer program product, the computer program product include one or
Multiple computer instructions.When loading on computers or executing the computer program instructions, entirely or partly generate according to
Process described in the embodiment of the present invention or function.The computer can be general purpose computer, special purpose computer, computer network
Network or other programmable devices.The computer instruction may be stored in a computer readable storage medium, or from one
Computer readable storage medium is transmitted to another computer readable storage medium, for example, the computer instruction can be from one
A web-site, computer, server or data center pass through wired (such as coaxial cable, optical fiber, Digital Subscriber Line (DSL)
Or wireless (such as infrared, wireless, microwave etc.) mode is carried out to another web-site, computer, server or data center
Transmission).The computer-readable storage medium can be any usable medium or include one that computer can access
The data storage devices such as a or multiple usable mediums integrated server, data center.The usable medium can be magnetic Jie
Matter, (for example, floppy disk, hard disk, tape), optical medium (for example, DVD) or semiconductor medium (such as solid state hard disk Solid
State Disk (SSD)) etc..
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention
Made any modifications, equivalent replacements, and improvements etc., should all be included in the protection scope of the present invention within mind and principle.
Claims (7)
1. a kind of BPE coding method based on the sub- word cell of Chinese, which is characterized in that the described method includes:
Corpus cleaning;
Chinese language words are split as sub- word cell;
The fractionation of Chinese character is carried out in such a way that five-stroke etymon inputs Chinese character to Chinese character;
It is post-processed in the way of BPE coding.
2. the BPE coding method as described in claim 1 based on the sub- word cell of Chinese, which is characterized in that described to the Chinese Chinese
Word forms each word after carrying out the fractionation of Chinese character in such a way that five-stroke etymon inputs Chinese character or Chinese character corresponds to five-stroke etymon
Corresponding English character.
3. the BPE coding method as described in claim 1 based on the sub- word cell of Chinese, which is characterized in that the Chinese character
It carries out needing after the fractionation of Chinese character in such a way that five-stroke etymon inputs Chinese character: for the word in sentence, first by word
Dismantling is smaller more common sub- word cell;By translating sub- character segment to translate unregistered word.
4. a kind of BPE coding method realized based on the sub- word cell of Chinese described in claim 1 based on the sub- word cell of Chinese
BPE coded system, which is characterized in that the BPE coded system based on the sub- word cell of Chinese includes:
Preprocessing module, for carrying out corpus cleaning;
Module is disassembled, for being first smaller more common sub- word cell by word dismantling for the word in sentence;
Module is split, for carrying out the fractionation of Chinese character in such a way that five-stroke etymon inputs Chinese character to Chinese character;
Translation module, for being translated unregistered word by translating sub- character segment.
5. a kind of computer journey for realizing the BPE coding method based on the sub- word cell of Chinese described in claims 1 to 3 any one
Sequence.
6. a kind of information data for realizing the BPE coding method based on the sub- word cell of Chinese described in claims 1 to 3 any one
Processing terminal.
7. a kind of computer readable storage medium, including instruction, when run on a computer, so that computer is executed as weighed
Benefit requires the BPE coding method based on the sub- word cell of Chinese described in 1-3 any one.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810687736.9A CN109033042A (en) | 2018-06-28 | 2018-06-28 | BPE coding method and system, machine translation system based on the sub- word cell of Chinese |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810687736.9A CN109033042A (en) | 2018-06-28 | 2018-06-28 | BPE coding method and system, machine translation system based on the sub- word cell of Chinese |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109033042A true CN109033042A (en) | 2018-12-18 |
Family
ID=65520753
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810687736.9A Pending CN109033042A (en) | 2018-06-28 | 2018-06-28 | BPE coding method and system, machine translation system based on the sub- word cell of Chinese |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109033042A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110287483A (en) * | 2019-06-06 | 2019-09-27 | 广东技术师范大学 | A kind of unknown word identification method and system using five-stroke etymon deep learning |
CN112861516A (en) * | 2021-01-21 | 2021-05-28 | 昆明理工大学 | Experimental method for verifying influence of common sub-words on XLM translation model effect |
US11868737B2 (en) | 2020-04-24 | 2024-01-09 | Direct Cursus Technology L.L.C | Method and server for processing text sequence for machine processing task |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090326916A1 (en) * | 2008-06-27 | 2009-12-31 | Microsoft Corporation | Unsupervised chinese word segmentation for statistical machine translation |
CN107608973A (en) * | 2016-07-12 | 2018-01-19 | 华为技术有限公司 | A kind of interpretation method and device based on neutral net |
CN107967262A (en) * | 2017-11-02 | 2018-04-27 | 内蒙古工业大学 | A kind of neutral net covers Chinese machine translation method |
-
2018
- 2018-06-28 CN CN201810687736.9A patent/CN109033042A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090326916A1 (en) * | 2008-06-27 | 2009-12-31 | Microsoft Corporation | Unsupervised chinese word segmentation for statistical machine translation |
CN107608973A (en) * | 2016-07-12 | 2018-01-19 | 华为技术有限公司 | A kind of interpretation method and device based on neutral net |
CN107967262A (en) * | 2017-11-02 | 2018-04-27 | 内蒙古工业大学 | A kind of neutral net covers Chinese machine translation method |
Non-Patent Citations (2)
Title |
---|
MIXUETAN等: "wubi2en: Character-level Chinese-English Translation", 《COMPUTER SCIENCE》 * |
韩冬等: "基于子字单元的神经机器翻译未登录词翻译分析", 《中文信息学报》 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110287483A (en) * | 2019-06-06 | 2019-09-27 | 广东技术师范大学 | A kind of unknown word identification method and system using five-stroke etymon deep learning |
CN110287483B (en) * | 2019-06-06 | 2023-12-05 | 广东技术师范大学 | Unregistered word recognition method and system utilizing five-stroke character root deep learning |
US11868737B2 (en) | 2020-04-24 | 2024-01-09 | Direct Cursus Technology L.L.C | Method and server for processing text sequence for machine processing task |
CN112861516A (en) * | 2021-01-21 | 2021-05-28 | 昆明理工大学 | Experimental method for verifying influence of common sub-words on XLM translation model effect |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109117483B (en) | Training method and device of neural network machine translation model | |
CN110334361B (en) | Neural machine translation method for Chinese language | |
CN109325112B (en) | A kind of across language sentiment analysis method and apparatus based on emoji | |
CN108132932B (en) | Neural machine translation method with replication mechanism | |
JP7335300B2 (en) | Knowledge pre-trained model training method, apparatus and electronic equipment | |
CN114757182A (en) | BERT short text sentiment analysis method for improving training mode | |
WO2020124674A1 (en) | Method and device for vectorizing translator's translation personality characteristics | |
CN108549646A (en) | A kind of neural network machine translation system based on capsule, information data processing terminal | |
CN109033042A (en) | BPE coding method and system, machine translation system based on the sub- word cell of Chinese | |
WO2023051148A1 (en) | Method and apparatus for multilingual processing | |
CN112347796A (en) | Mongolian Chinese neural machine translation method based on combination of distillation BERT and improved Transformer | |
CN110472255A (en) | Neural network machine interpretation method, model, electric terminal and storage medium | |
CN113239710A (en) | Multi-language machine translation method and device, electronic equipment and storage medium | |
Mathur et al. | A scaled‐down neural conversational model for chatbots | |
Xu et al. | Research on Uyghur‐Chinese Neural Machine Translation Based on the Transformer at Multistrategy Segmentation Granularity | |
Kong | [Retracted] Artificial Intelligence‐Based Translation Technology in Translation Teaching | |
CN111401003A (en) | Humor text generation method with enhanced external knowledge | |
CN113743101A (en) | Text error correction method and device, electronic equipment and computer storage medium | |
CN112380882A (en) | Mongolian Chinese neural machine translation method with error correction function | |
LU502694B1 (en) | Semi-supervised sign language production method based on dual transformation and system and storage medium thereof | |
CN111104806A (en) | Construction method and device of neural machine translation model, and translation method and device | |
CN112685543B (en) | Method and device for answering questions based on text | |
Xu | [Retracted] English‐Chinese Machine Translation Based on Transfer Learning and Chinese‐English Corpus | |
Wu et al. | NLP Research Based on Transformer Model | |
CN113553837A (en) | Reading understanding model training method and device and text analysis method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20181218 |
|
RJ01 | Rejection of invention patent application after publication |