CN109033042A - BPE coding method and system, machine translation system based on the sub- word cell of Chinese - Google Patents

BPE coding method and system, machine translation system based on the sub- word cell of Chinese Download PDF

Info

Publication number
CN109033042A
CN109033042A CN201810687736.9A CN201810687736A CN109033042A CN 109033042 A CN109033042 A CN 109033042A CN 201810687736 A CN201810687736 A CN 201810687736A CN 109033042 A CN109033042 A CN 109033042A
Authority
CN
China
Prior art keywords
chinese
word
sub
chinese character
bpe
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810687736.9A
Other languages
Chinese (zh)
Inventor
汪鸣
汪一鸣
谭新
熊德意
程国艮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chinese Translation Language Through Polytron Technologies Inc
Original Assignee
Chinese Translation Language Through Polytron Technologies Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chinese Translation Language Through Polytron Technologies Inc filed Critical Chinese Translation Language Through Polytron Technologies Inc
Priority to CN201810687736.9A priority Critical patent/CN109033042A/en
Publication of CN109033042A publication Critical patent/CN109033042A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The invention belongs to computer software technical fields, disclose a kind of BPE coding method and system, machine translation system based on the sub- word cell of Chinese, carry out the fractionation of Chinese character in such a way that five-stroke etymon inputs Chinese character to Chinese character;It is first smaller more common sub- word cell by word dismantling for the word in sentence;By translating sub- character segment to translate unregistered word.The present invention makees extra process to Chinese character before carrying out BPE coding step.Therefore, parameter scale and operation time are suitable with current BPE in the present invention in practical applications, complexity when excessively will not increase practical.

Description

BPE coding method and system, machine translation system based on the sub- word cell of Chinese
Technical field
The invention belongs to computer software technical field more particularly to a kind of BPE coding staffs based on the sub- word cell of Chinese Method and system, machine translation system.
Background technique
Currently, the prior art commonly used in the trade is such that machine translation is to utilize computerized algorithm automatically will be a kind of Source language sentence translation becomes the process of another target language sentence.Machine translation is a research direction of artificial intelligence, With highly important scientific research value and practical value.Along with the rapid hair of globalization process deepened constantly with internet Exhibition, at home and abroad politics, economical, society, cultural exchanges etc. play increasingly important role to machine translation mothod.With The raising of computer computation ability and the application of big data, deep learning obtain further application.Based on deep learning Neural Machine Translation have been to be concerned by more and more people.In the field NMT, a kind of most common translation Model is the encoder-decoder model with attention mechanism (attention-based).Its main thought is will be wait turn over The source language sentence of input is converted by the sentence (hereinafter collectively referred to as ' source statement ') translated by encoder (encoder) Input of one term vector sequence as Recognition with Recurrent Neural Network, encoder can export the intensive vector an of regular length, referred to as Context vector.Then other one is utilized using the context vector that input terminal exports as input using decoder (decoder) A Recognition with Recurrent Neural Network combines a Softmax classifier, exports the term vector sequence of object language.Finally, utilizing dictionary handle Term vector mapping becomes object language word, completes whole translation process.Encoder-decoder frame is the core of deep learning Thought, likewise, encoder-decoder frame is also the common basic framework of NMT system.The NMT system of mainstream at present System, encoder and decoder utilize RNN (recurrent neural networks), and RNN has when handling timing information Advantageous advantage, be capable of handling the input of random length and be converted the vector as a fixed dimension.It is this The most important defect of encoder-decoder frame be it is bad to the translation effect for the Chinese vocabulary being not logged in, building translate When model, source and target side vocabulary can be formed according to training corpus first.Since computing capability limits, vocabulary size can be into Row limitation (such as containing 30000 words in original language vocabulary), the word not in vocabulary are unified to use additional character " UNK " Instead of.This causes the translation of NMT to will appear a serious problem: when containing word not in vocabulary in sentence to be translated When, UNK can be generated in translation, causes translation readability not high, although people proposed to be solved not with BPE coding mode again later Posting term problem, still, facts proved that, this method effect on Chinese language words is not obvious, or even can make translation quality instead Decline.Therefore the present invention proposes a kind of method to can significantly solve the above problems.
In conclusion problem of the existing technology is: encoder-decoder frame contains when in sentence to be translated When word not in vocabulary, UNK can be generated in translation, causes translation readability not high, declines translation quality.
Summary of the invention
In view of the problems of the existing technology, the present invention provides a kind of BPE coding methods based on the sub- word cell of Chinese And system, machine translation system.
The invention is realized in this way a kind of BPE coding method based on the sub- word cell of Chinese, described based on the sub- word of Chinese The BPE coding method of unit carries out the fractionation of Chinese character to Chinese character in such a way that five-stroke etymon inputs Chinese character.
Further, it includes: language that the Chinese character, which carries out the fractionation of Chinese character in such a way that five-stroke etymon inputs Chinese character, The pretreatments such as material cleaning;To every a word in corpus, Chinese word segmentation is carried out;Five-stroke etymon pair is corresponded to each word or Chinese character Answer character.
Further, the fractionation that the Chinese character carries out Chinese character in such a way that five-stroke etymon inputs Chinese character needs later It wants: being first smaller more common sub- word cell by word dismantling for the word in sentence;By translating sub- character segment thus will Unregistered word is translated.
Another object of the present invention is to provide a kind of bases of the BPE coding method based on the sub- word cell of Chinese described in realize In the BPE coded system of the sub- word cell of Chinese, the BPE coded system based on the sub- word cell of Chinese includes:
Module is split, for carrying out the fractionation of Chinese character in such a way that five-stroke etymon inputs Chinese character to Chinese character;
Module is disassembled, for being first smaller more common sub- word cell by word dismantling for the word in sentence;
Translation module, for being translated unregistered word by translating sub- character segment.
Another object of the present invention is to provide described in a kind of realize based on the BPE coding method of the sub- word cell of Chinese Calculation machine program.
Another object of the present invention is to provide a kind of letters of the BPE coding method based on the sub- word cell of Chinese described in realize Cease data processing terminal.
Another object of the present invention is to provide a kind of computer readable storage mediums, including instruction, when it is in computer When upper operation, so that computer executes the BPE coding method based on the sub- word cell of Chinese.
In conclusion advantages of the present invention and good effect are as follows: the present invention is before carrying out BPE coding step, to Chinese Chinese character makees extra process.Therefore, parameter scale and operation time are suitable with current BPE in the present invention in practical applications, will not Complexity when excessively increasing practical.
Detailed description of the invention
Fig. 1 is the BPE coding method flow chart provided in an embodiment of the present invention based on the sub- word cell of Chinese.
Fig. 2 is the BPE coded system structural schematic diagram provided in an embodiment of the present invention based on the sub- word cell of Chinese;
In figure: 1, splitting module;2, module is disassembled;3, translation module.
Specific embodiment
In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to embodiments, to the present invention It is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, it is not used to Limit the present invention.
The present invention has made improvement to existing BPE coding mode.Provide a kind of generation for more preferably more having theory significance The method of Chinese BPE coding, makes the BPE coding mode of Chinese be able to solve Chinese unregistered word problem.Making full use of Chinese While word information, avoid traditional BP E coding mode bring insufficient.Chinese character constructs depositing for Chinese character due to five typewritings And the English alphabet forms that are easily converted to corresponding five typewritings of Chinese character root effectively drop to solve the problems, such as Chinese fractionation The appearance of low Chinese unregistered word.
Application principle of the invention is explained in detail with reference to the accompanying drawing.
As shown in Figure 1, the BPE coding method provided in an embodiment of the present invention based on the sub- word cell of Chinese includes following step It is rapid:
S101: the fractionation of Chinese character is carried out in such a way that five-stroke etymon inputs Chinese character to Chinese character;
S102: being first smaller more common sub- word cell by word dismantling for the word in sentence;
S103: by translating sub- character segment to translate unregistered word.
In a preferred embodiment of the invention: step S101 is specifically included: the pretreatment such as corpus cleaning;To every in corpus In short, Chinese word segmentation is carried out;Character is corresponded to each word (or Chinese character) corresponding five-stroke etymon.
Application principle of the invention is described in detail combined with specific embodiments below.
For example sentence: I likes China.
Chinese word segmentation operation is carried out first, and obtain: I likes China.
Then, each word (or Chinese character) convert according to the corresponding letter of radical in five typewritings:
TRNT EPDC KHK LGYI。
Next, being encoded according to traditional BPE coding mode to above formula.
And traditional BP E encoding embodiments are as shown in the table:
It is thus possible to which it is bei@@jing that beijing, which is passed through BPE code conversion, nanjing becomes nan@@jing.
BPE coding method provided in an embodiment of the present invention based on the sub- word cell of Chinese is by typewriting to Chinese according to five The corresponding English alphabet of middle radical is split, so that Chinese character to be changed into the format of English word composition.By this Improved Chinese BPE encoding mechanism, declines the quantity of Chinese unregistered word.To which translation quality can be effectively improved.
As shown in Fig. 2, the BPE coded system provided in an embodiment of the present invention based on the sub- word cell of Chinese includes:
Module 1 is split, for carrying out the fractionation of Chinese character in such a way that five-stroke etymon inputs Chinese character to Chinese character;
Module 2 is disassembled, for being first smaller more common sub- word cell by word dismantling for the word in sentence;
Translation module 3, for being translated unregistered word by translating sub- character segment.
In the above-described embodiments, can come wholly or partly by software, hardware, firmware or any combination thereof real It is existing.When using entirely or partly realizing in the form of a computer program product, the computer program product include one or Multiple computer instructions.When loading on computers or executing the computer program instructions, entirely or partly generate according to Process described in the embodiment of the present invention or function.The computer can be general purpose computer, special purpose computer, computer network Network or other programmable devices.The computer instruction may be stored in a computer readable storage medium, or from one Computer readable storage medium is transmitted to another computer readable storage medium, for example, the computer instruction can be from one A web-site, computer, server or data center pass through wired (such as coaxial cable, optical fiber, Digital Subscriber Line (DSL) Or wireless (such as infrared, wireless, microwave etc.) mode is carried out to another web-site, computer, server or data center Transmission).The computer-readable storage medium can be any usable medium or include one that computer can access The data storage devices such as a or multiple usable mediums integrated server, data center.The usable medium can be magnetic Jie Matter, (for example, floppy disk, hard disk, tape), optical medium (for example, DVD) or semiconductor medium (such as solid state hard disk Solid State Disk (SSD)) etc..
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention Made any modifications, equivalent replacements, and improvements etc., should all be included in the protection scope of the present invention within mind and principle.

Claims (7)

1. a kind of BPE coding method based on the sub- word cell of Chinese, which is characterized in that the described method includes:
Corpus cleaning;
Chinese language words are split as sub- word cell;
The fractionation of Chinese character is carried out in such a way that five-stroke etymon inputs Chinese character to Chinese character;
It is post-processed in the way of BPE coding.
2. the BPE coding method as described in claim 1 based on the sub- word cell of Chinese, which is characterized in that described to the Chinese Chinese Word forms each word after carrying out the fractionation of Chinese character in such a way that five-stroke etymon inputs Chinese character or Chinese character corresponds to five-stroke etymon Corresponding English character.
3. the BPE coding method as described in claim 1 based on the sub- word cell of Chinese, which is characterized in that the Chinese character It carries out needing after the fractionation of Chinese character in such a way that five-stroke etymon inputs Chinese character: for the word in sentence, first by word Dismantling is smaller more common sub- word cell;By translating sub- character segment to translate unregistered word.
4. a kind of BPE coding method realized based on the sub- word cell of Chinese described in claim 1 based on the sub- word cell of Chinese BPE coded system, which is characterized in that the BPE coded system based on the sub- word cell of Chinese includes:
Preprocessing module, for carrying out corpus cleaning;
Module is disassembled, for being first smaller more common sub- word cell by word dismantling for the word in sentence;
Module is split, for carrying out the fractionation of Chinese character in such a way that five-stroke etymon inputs Chinese character to Chinese character;
Translation module, for being translated unregistered word by translating sub- character segment.
5. a kind of computer journey for realizing the BPE coding method based on the sub- word cell of Chinese described in claims 1 to 3 any one Sequence.
6. a kind of information data for realizing the BPE coding method based on the sub- word cell of Chinese described in claims 1 to 3 any one Processing terminal.
7. a kind of computer readable storage medium, including instruction, when run on a computer, so that computer is executed as weighed Benefit requires the BPE coding method based on the sub- word cell of Chinese described in 1-3 any one.
CN201810687736.9A 2018-06-28 2018-06-28 BPE coding method and system, machine translation system based on the sub- word cell of Chinese Pending CN109033042A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810687736.9A CN109033042A (en) 2018-06-28 2018-06-28 BPE coding method and system, machine translation system based on the sub- word cell of Chinese

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810687736.9A CN109033042A (en) 2018-06-28 2018-06-28 BPE coding method and system, machine translation system based on the sub- word cell of Chinese

Publications (1)

Publication Number Publication Date
CN109033042A true CN109033042A (en) 2018-12-18

Family

ID=65520753

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810687736.9A Pending CN109033042A (en) 2018-06-28 2018-06-28 BPE coding method and system, machine translation system based on the sub- word cell of Chinese

Country Status (1)

Country Link
CN (1) CN109033042A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110287483A (en) * 2019-06-06 2019-09-27 广东技术师范大学 A kind of unknown word identification method and system using five-stroke etymon deep learning
CN112861516A (en) * 2021-01-21 2021-05-28 昆明理工大学 Experimental method for verifying influence of common sub-words on XLM translation model effect
US11868737B2 (en) 2020-04-24 2024-01-09 Direct Cursus Technology L.L.C Method and server for processing text sequence for machine processing task

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090326916A1 (en) * 2008-06-27 2009-12-31 Microsoft Corporation Unsupervised chinese word segmentation for statistical machine translation
CN107608973A (en) * 2016-07-12 2018-01-19 华为技术有限公司 A kind of interpretation method and device based on neutral net
CN107967262A (en) * 2017-11-02 2018-04-27 内蒙古工业大学 A kind of neutral net covers Chinese machine translation method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090326916A1 (en) * 2008-06-27 2009-12-31 Microsoft Corporation Unsupervised chinese word segmentation for statistical machine translation
CN107608973A (en) * 2016-07-12 2018-01-19 华为技术有限公司 A kind of interpretation method and device based on neutral net
CN107967262A (en) * 2017-11-02 2018-04-27 内蒙古工业大学 A kind of neutral net covers Chinese machine translation method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
MIXUETAN等: "wubi2en: Character-level Chinese-English Translation", 《COMPUTER SCIENCE》 *
韩冬等: "基于子字单元的神经机器翻译未登录词翻译分析", 《中文信息学报》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110287483A (en) * 2019-06-06 2019-09-27 广东技术师范大学 A kind of unknown word identification method and system using five-stroke etymon deep learning
CN110287483B (en) * 2019-06-06 2023-12-05 广东技术师范大学 Unregistered word recognition method and system utilizing five-stroke character root deep learning
US11868737B2 (en) 2020-04-24 2024-01-09 Direct Cursus Technology L.L.C Method and server for processing text sequence for machine processing task
CN112861516A (en) * 2021-01-21 2021-05-28 昆明理工大学 Experimental method for verifying influence of common sub-words on XLM translation model effect

Similar Documents

Publication Publication Date Title
CN109117483B (en) Training method and device of neural network machine translation model
CN110334361B (en) Neural machine translation method for Chinese language
CN109325112B (en) A kind of across language sentiment analysis method and apparatus based on emoji
CN108132932B (en) Neural machine translation method with replication mechanism
JP7335300B2 (en) Knowledge pre-trained model training method, apparatus and electronic equipment
CN114757182A (en) BERT short text sentiment analysis method for improving training mode
WO2020124674A1 (en) Method and device for vectorizing translator's translation personality characteristics
CN108549646A (en) A kind of neural network machine translation system based on capsule, information data processing terminal
CN109033042A (en) BPE coding method and system, machine translation system based on the sub- word cell of Chinese
WO2023051148A1 (en) Method and apparatus for multilingual processing
CN112347796A (en) Mongolian Chinese neural machine translation method based on combination of distillation BERT and improved Transformer
CN110472255A (en) Neural network machine interpretation method, model, electric terminal and storage medium
CN113239710A (en) Multi-language machine translation method and device, electronic equipment and storage medium
Mathur et al. A scaled‐down neural conversational model for chatbots
Xu et al. Research on Uyghur‐Chinese Neural Machine Translation Based on the Transformer at Multistrategy Segmentation Granularity
Kong [Retracted] Artificial Intelligence‐Based Translation Technology in Translation Teaching
CN111401003A (en) Humor text generation method with enhanced external knowledge
CN113743101A (en) Text error correction method and device, electronic equipment and computer storage medium
CN112380882A (en) Mongolian Chinese neural machine translation method with error correction function
LU502694B1 (en) Semi-supervised sign language production method based on dual transformation and system and storage medium thereof
CN111104806A (en) Construction method and device of neural machine translation model, and translation method and device
CN112685543B (en) Method and device for answering questions based on text
Xu [Retracted] English‐Chinese Machine Translation Based on Transfer Learning and Chinese‐English Corpus
Wu et al. NLP Research Based on Transformer Model
CN113553837A (en) Reading understanding model training method and device and text analysis method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20181218

RJ01 Rejection of invention patent application after publication