CN110765766B - German lexical analysis method and system for neural network machine translation - Google Patents

German lexical analysis method and system for neural network machine translation Download PDF

Info

Publication number
CN110765766B
CN110765766B CN201911029182.4A CN201911029182A CN110765766B CN 110765766 B CN110765766 B CN 110765766B CN 201911029182 A CN201911029182 A CN 201911029182A CN 110765766 B CN110765766 B CN 110765766B
Authority
CN
China
Prior art keywords
word
words
german
machine translation
neural network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911029182.4A
Other languages
Chinese (zh)
Other versions
CN110765766A (en
Inventor
张孝飞
周聪
刘煜
范婷婷
葛昱晖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zhong Xian Electronic Technology Development Co ltd
Original Assignee
Beijing Zhong Xian Electronic Technology Development Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zhong Xian Electronic Technology Development Co ltd filed Critical Beijing Zhong Xian Electronic Technology Development Co ltd
Priority to CN201911029182.4A priority Critical patent/CN110765766B/en
Publication of CN110765766A publication Critical patent/CN110765766A/en
Application granted granted Critical
Publication of CN110765766B publication Critical patent/CN110765766B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The invention relates to the technical field of machine translation, in particular to a German lexical analysis method and system for neural network machine translation; the method comprises the following steps: performing dictionary query on German words one by one; restoring the irregular deformation words; restoring the abbreviation into a word original shape; restoring the regularly deformed words; and splitting the compound word into independent constituent word combinations. The German lexical analysis method and system for neural network machine translation disclosed by the invention utilize the neural network machine translation technology to carry out deep learning on the lexical analysis information of German deformed words, abbreviated words and compound words, the method is favorable for reducing dimensionality, solves the problem of data sparseness, overcomes the phenomena that the German deformed words, the abbreviated words and the compound words are not recognized and translated because dictionaries are not recorded in machine translation, greatly improves the accuracy and the utilization rate of sentence alignment linguistic data and the readability of machine translation, and improves the quality of machine translation translated text.

Description

German lexical analysis method and system for neural network machine translation
Technical Field
The invention relates to the technical field of machine translation, in particular to a German lexical analysis method and system for neural network machine translation.
Background
Machine translation, also known as automatic translation, is the process of converting one natural language (source language) to another (target language) using a computer. With the rapid development of artificial intelligence, deep learning network structures such as a recurrent neural network and the like have been widely applied to the field of natural language processing, and neural network machine translation is one of the products. As a new machine translation technique which has been developed in recent years, neural network machine translation has made a great breakthrough in translation quality as compared with the past rule-based machine translation and statistical-based machine translation, and thus the commercial application of machine translation has become accessible.
Chinese patent CN201810845896.1 provides a training method and apparatus for neural network machine translation model, which includes: acquiring a plurality of high-resource language pairs and low-resource language pairs; spelling unification operation is carried out on the source language of the high resource language pair and the source language of the low resource language pair on a character level; taking each operated high-resource language pair as a training set of a corresponding parent model, taking the operated low-resource language pair as a training set of a child model, and training each parent model according to a transfer learning method according to a preset sequence so as to transfer word vectors of a source language and word vectors of a target language of a previous parent model to a next parent model; training the child model according to the last trained father model to obtain a neural network machine translation model for translating low-resource languages; the performance of the child model trained on the low-resource language pair is obviously improved.
However, the conventional german translation system has a problem that german inflected words, acronyms, and compound words which are not included in the dictionary in the machine translation are not recognized and not translated. Therefore, in order to solve the above problems, it is urgently needed to invent a new method and system for lexical analysis of german language for neural network machine translation.
Disclosure of Invention
The invention aims to: the German lexical analysis method and system for the neural network machine translation are provided, and are used for solving the problems of no recognition and no translation of German inflected words, acronyms and compound words which are not included in a dictionary in the machine translation.
The invention provides the following scheme:
a German lexical analysis method oriented to neural network machine translation comprises the following steps:
dictionary query is carried out on German words one by one;
restoring the irregular deformation words;
reducing the abbreviation into a word original shape;
restoring the regularly deformed words;
and splitting the compound word into independent constituent word combinations to obtain the processed German word, and inputting the processed German word into a neural network for deep learning.
Preferably, the german lexical analysis method for neural network machine translation further includes:
labeling the lexical analysis information of the successfully restored deformed words, abbreviated words and compound words;
and performing deep learning of neural network machine translation on the lexical analysis information of the labeled deformed words, the labeled abbreviated words and the labeled compound words.
Preferably, the step of performing dictionary lookup on the german words one by one specifically includes:
after receiving the German text, performing dictionary query on each German word, and if the feedback result is 'true', directly outputting the prototype word; if the feedback result is false, the next step is executed.
Preferably, the step of restoring the irregularly deformed morphing words includes:
and inquiring the special vocabulary, if the feedback result is 'true', directly restoring the deformed word into the original shape according to the special vocabulary, and if the feedback result is 'false', executing the next step.
Preferably, the step of reducing the abbreviation into the original shape of the word comprises the following steps:
and inquiring the abbreviation list, if the feedback result is 'true', directly restoring the deformed word into the original shape according to the abbreviation list, and if the feedback result is 'false', executing the next step.
Preferably, the step of restoring the regularly deformed morphing words includes:
determining a morphological reduction rule suitable for the deformed word through the word end query, and reducing through the morphological reduction rule;
performing dictionary query on the reduction result, and if the feedback result is 'true', successfully reducing; if the feedback result is false, the reduction fails through the reduction rule, and the reduction is carried out through the next reduction rule;
by analogy, if the dictionary query feedback result of the reduction result of a certain reduction rule is 'true', the reduction is successful; and if all the reduction rules are finished and the dictionary query feedback results are 'false', executing the next step.
Preferably, the step of splitting the compound word into independent constituent word combinations specifically includes:
performing forward maximum matching on the compound words to be processed, performing dictionary query on the obtained word segments of each constituent word one by one, if the feedback is 'true', entering the word segments of the constituent words into a memory bank, and if the feedback result is 'false', executing the next step;
carrying out mark judgment and mark processing on the former item forming word field and/or the latter item residual field and/or the whole field;
performing forward maximum matching on the remaining fields and/or the whole field of the latter item after the mark processing again, performing dictionary query on the obtained word fields of each constituent word one by one, if the feedback is 'true', entering the word field of the constituent word into a memory bank, and if the feedback is 'false', directly outputting the compound word without splitting;
and post-processing the component word fields in the memory library.
Preferably, the result of each positive maximum match remains in two forms, preferably in upper case first and lower case first; the constituent word fields should be no less than three letters when there is a positive maximum match.
Preferably, the basic formula of the machine deep learning training is h-g (W)Tx + b), where x is the input value and the values of W and b are adjusted based on the difference calculated by the back propagation algorithm.
Further, the present invention also provides a german lexical analysis system for neural network machine translation, comprising:
the dictionary query module is used for performing dictionary query on the words or the processed words one by one;
the special vocabulary reduction module is used for reducing the irregular deformation words by inquiring the special vocabulary;
the abbreviation restoring module is used for restoring the abbreviation into a word original shape by inquiring the abbreviation list;
the rule reduction module is used for reducing the deformed words with deformed rules through the word form reduction rule table;
the compound word splitting module is used for splitting the compound words into independent constituent word combinations;
the labeling module is used for labeling the lexical analysis information of the successfully restored deformed words, the successfully restored abbreviations and the successfully restored compound words;
and the deep learning module is used for performing deep learning of neural network machine translation on the lexical analysis information of the labeled deformed words, the labeled abbreviated words and the labeled compound words.
The invention has the following beneficial effects:
the invention discloses a German lexical analysis method and system for neural network machine translation, wherein the method comprises the following steps: dictionary query is carried out on German words one by one; restoring the irregular deformation words; reducing the abbreviation into a word original shape; restoring the regularly deformed words; splitting the compound word into independent constituent word combinations; the method is beneficial to reducing dimensionality, solving the problem of data sparseness, overcoming the phenomena that german deformed words, abbreviations and compound words are not recognized and translated because dictionaries are not recorded in machine translation, greatly improving the accuracy and the utilization rate of sentence alignment linguistic data and the readability of machine translation, and improving the quality of machine translation translated texts.
Drawings
Fig. 1 is a flow chart of the german lexical analysis method oriented to neural network machine translation according to the present invention.
Fig. 2 is a block diagram of the structural diagram of the german lexical analysis system for neural network machine translation.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
Referring to fig. 1, a german lexical analysis method for neural network machine translation includes the following steps:
1, dictionary query is carried out on German words one by one;
2, restoring the irregular deformation words;
3, restoring the abbreviation into a word original shape;
4, restoring the regularly deformed words;
and 5, splitting the compound word into independent constituent word combinations.
The German lexical analysis method for neural network machine translation further comprises the following steps:
marking the lexical analysis information of the successfully restored morphemes, abbreviations and compound words;
and 7, performing deep learning of neural network machine translation on the lexical analysis information of the labeled deformation words, the labeled abbreviation words and the labeled compound words.
Step 1 of performing dictionary lookup on German words one by one, specifically comprising the following steps:
after receiving the German text, performing dictionary query on each German word, and if the feedback result is 'true', directly outputting the prototype word; if the feedback result is "false", the next step 2 is executed.
And 2, restoring the irregularly deformed deformation words, which specifically comprises the following steps:
and (3) inquiring the special vocabulary, if the feedback result is 'true', directly restoring the deformed word into the original shape according to the special vocabulary, and if the feedback result is 'false', executing the next step 3.
And 3, reducing the abbreviation into a word prototype, specifically:
and inquiring the abbreviation list, if the feedback result is 'true', directly restoring the deformed word into the original shape according to the abbreviation list, and if the feedback result is 'false', executing the next step 4.
And 4, restoring the regularly deformed deformation words, which specifically comprises the following steps:
determining a morphological reduction rule suitable for the deformed word through the word end query, and reducing through the morphological reduction rule;
performing dictionary query on the reduction result, and if the feedback result is 'true', successfully reducing; if the feedback result is false, the reduction fails through the reduction rule, and the reduction is carried out through the next reduction rule;
by analogy, if the dictionary query feedback result of the reduction result of a certain reduction rule is 'true', the reduction is successful; and if all the reduction rules are finished and the dictionary query feedback results are 'false', executing the next step 5.
Step 5, splitting the compound word into independent word combinations, specifically:
performing forward maximum matching on the compound words to be processed, performing dictionary query on the obtained word segments of each constituent word one by one, if the feedback is 'true', entering the word segments of the constituent words into a memory bank, and if the feedback result is 'false', executing the next step;
carrying out mark judgment and mark processing on the former item forming word field and/or the latter item residual field and/or the whole field;
performing forward maximum matching on the remaining fields and/or the whole field of the latter item after the mark processing again, performing dictionary query on the obtained word fields of each constituent word one by one, if the feedback is 'true', entering the word field of the constituent word into a memory bank, and if the feedback is 'false', directly outputting the compound word without splitting;
and post-processing the component word fields in the memory library.
The result of each positive maximum matching is reserved in two forms, wherein the first letter is capitalized preferentially, and the second letter is lowercase; the constituent word fields should be no less than three letters when there is a positive maximum match.
The basic formula of the machine deep learning training is h ═ g (W)Tx + b), where x is the input value and the values of W and b are adjusted based on the difference calculated by the back propagation algorithm.
The German lexical analysis method for neural network machine translation in the embodiment comprises the following steps:
1. after receiving the German text, firstly carrying out dictionary query on each German word, and if the feedback result is 'true', directly outputting the prototype word; if the feedback result is false, executing the next step;
2. inquiring a special vocabulary table, if the feedback result is 'true', directly reducing the deformed word into an original shape according to the special vocabulary table, if the deformed word a beta is reduced into essen, and if the feedback result is 'false', executing the next step;
3. inquiring an abbreviation list, if the feedback result is 'true', directly reducing the deformed word into an original shape according to the abbreviation list, such as reducing the abbreviation Abb to Abbildung, and if the feedback result is 'false', executing the next step;
4. determining a morphology reduction rule suitable for the deformed word through word end query, reducing through the morphology reduction rule, performing dictionary query on a reduction result, and if the feedback result is 'true', successfully reducing; if the feedback result is 'false', the restoration fails through the restoration rule, the restoration is performed through the next restoration rule (the restoration rule has a priority sequence from top to bottom), and by analogy, if the dictionary query feedback result of the restoration result of a certain restoration rule is 'true', the restoration is successful; and if all the reduction rules are finished and the dictionary query feedback results are false, executing the next step. Specific reduction rule examples are as follows:
*estens->FIND(IL,(HEAD,1),LOWERCASE),INFLEX(-,|
Figure BDA0002248092310000081
a)
where esten denotes that a word ends with the suffix; FIND (IL, (HEAD, 1), LOWERCASE) indicates that the search condition is: the first letter from the left inside the word is lower case; (iv) induction ("," alpha "") cage
Figure BDA0002248092310000082
a) Represents: if the search condition is satisfied, the suffix of the word is removed and the letters contained in the word are deleted
Figure BDA0002248092310000083
Reducing to the letter a. Such as inflected words
Figure BDA0002248092310000084
Reducing to alt.
*t->FIND(IL,(HEAD,1),LOWERCASE),INFLEX(-,en)
Where t denotes that a word ends with the suffix; FIND (IL, (HEAD, 1), LOWERCASE) indicates that the search condition is: the first letter from the left inside the word is lower case; INFLEX (—, en) denotes: if the search condition is satisfied, the suffix t of the word is reduced to en. Such as the morph kommt to kommen.
5. The compound word splitting processing is carried out on the word, and the method specifically comprises the following steps:
1) and performing forward maximum matching on the compound word to be processed, wherein the result of each forward maximum matching is reserved in two forms, namely, the first letter is upper case preferentially, and the second letter is lower case. When the positive direction is matched to the maximum, the formed word field is not less than three letters; when the forward maximum matching is finished, if the remaining fields of the postitems are only "en", "er", "n", "e", "s", "es", "ern", "se", "ses", "sen", "d", "de", or "den", the postitems may be deleted. The obtained word fields of each component need to be queried through a dictionary one by one, if the feedback is 'true', the word fields of the component enter a memory bank, and if the feedback result is 'false', the next step is executed;
2) performing mark judgment and mark processing on the preceding item composition word field and/or the following item residual field and/or the whole field, specifically comprising: if the antecedent constitutes a word field denoted "ung" or
Figure BDA0002248092310000091
Or ending the ion and beginning the remaining field of the postitem with s, deleting the beginning letter s of the remaining field of the postitem, and then carrying out the next step; if the remaining field and/or the whole field of the later item contains "n", "e", "er" or "s", performing morphological reduction on the first letter of the remaining field and/or the whole field of the later item to the field ending with the letter, if the field contains at least two letters of "n", "e", "er" and "s", sequentially intercepting the field according to the sequence of "n", "e", "er" and "s", and sequentially reducing, if the word ending is "ern", "se", "ses" and "sen", performing reduction through morphological reduction rules, and then performing the next step; if the compound word is not successfully split by the two mark processing methods, letters 'e', 'en' and 'n' are sequentially added behind the remaining fields of the later item, and then the next step is carried out; if the remaining field of the latter item has an initial "s", "n" or "e" or the first two letters "en", "er" or "es" of the remaining field of the latter item, the initial or the first two letters are removed and the next step is performed.
3) Performing forward maximum matching on the remaining fields and/or the whole field of the latter item after the marking, wherein the fields of all the obtained constituting words need to be queried one by one through a dictionary, if the feedback is 'true', the field of the constituting word enters a memory base, and if the feedback is 'false', the compound word is directly output without splitting;
4) post-processing the composition word field in the memory library, which specifically comprises the following steps: if the constituent word field in the memory base ends with "en" and the first letter is lower, the first letter thereof is converted into upper case and the letter "n" at the end of the constituent word field is deleted, then dictionary lookup is performed, and if the feedback is "true", the processed constituent word field is the constituent word to be output next. If the feedback is "false", the initial letter is converted into capital letter and the letter "en" at the end of the word field of the composition word is deleted, then dictionary inquiry is carried out, and if the feedback is "true", the processed composition word field is the composition word to be output next; if the feedback is still 'false', a field form of a constructed word ending with 'en' in a memory library is reserved, the field form is output as the constructed word, the final output form of the compound word is the constructed word +. the result after the separation of the German compound word Waschmacschinensen mur is Waschmacschine + Tur;
6. labeling lexical analysis information of words which are split and reduced through a special vocabulary list, an abbreviation list, a morphology reduction rule list and a compound word;
an example of the output result after labeling is as follows: 5zeigen zxst eine Schnittansicht des Draht zxsesin4 entilang der Linie V; (original 5)zeigt eine Schnittansicht des Drahtes in 4 entlang der Linie V;)。
7. And the neural network machine translation carries out deep learning on the labeled lexical analysis information of the word.
Specifically, the basic formula of the machine deep learning training is h ═ g (W)Tx + b), where x is the input value (i.e., as in the example sentence 5zeigen zxst eine Schnittansicht des Draht zxses in4 entlang der Linie V; ) The w and b values are adjusted based on the difference calculated by the back propagation algorithm. The deep learning mainly includes two processes of encoding (encode) and decoding (decode), in the encoding process, the first layer of the multi-layer neural network is given by the following formula: h (1) ═ g (1) (W (1)Tx + b (1)); the second layer is given by the following formula h (2) ═ g (2) (W (2)Th (1) + b (2)); the third layer is given by the following formula h (3) ═ g (3) (W (3)Th (2) + b (3)).. the nth layer is given by the formula h (n) ═ g (n) (w (n))Th (n-1) + b (n) · and using the result of coding h (n) as decoding layerThe input value is subjected to multilayer calculation according to a basic formula to obtain a Chinese result, and a back propagation algorithm is used for calculation, so that a better learning effect is realized.
Referring to fig. 2, a german lexical analysis system for neural network machine translation includes:
a dictionary querying module 210, configured to perform dictionary querying on the words or the processed words one by one;
the special vocabulary restoring module 220 is used for restoring the irregular deformation words by inquiring the special vocabulary;
an abbreviation restoring module 230 for restoring the abbreviations into word prototypes by querying the abbreviation list;
the rule restoring module 240 is configured to restore the deformed words with deformed rules through the morphology restoring rule table;
a compound word splitting module 250, configured to split the compound word into independent constituent word combinations;
the labeling module 250 is used for labeling the lexical analysis information of the successfully restored deformed words, the successfully restored abbreviations and the successfully restored compound words;
and the deep learning module 270 is configured to perform deep learning of neural network machine translation on the lexical analysis information of the labeled inflected words, the labeled acronyms, and the labeled compound words.
The present embodiment also provides a computer system suitable for implementing the above-described neural network machine translation-oriented german lexical analysis method. The computer system includes a processor and a computer-readable storage medium. The computer system may perform a method according to an embodiment of the invention.
In particular, the processor may comprise, for example, a general purpose microprocessor, an instruction set processor and/or related chip set and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), or the like. The processor may also include on-board memory for caching purposes. The processor may be a single processing unit or a plurality of processing units for performing the different actions of the method flow according to embodiments of the present invention.
Computer-readable storage media, for example, may be non-volatile computer-readable storage media, specific examples including, but not limited to: magnetic storage devices, such as magnetic tape or Hard Disk Drives (HDDs); optical storage devices, such as compact disks (CD-ROMs); a memory, such as a Random Access Memory (RAM) or a flash memory; and so on.
The computer-readable storage medium may comprise a computer program that may comprise code/computer-executable instructions that, when executed by a processor, cause the processor to perform a method according to an embodiment of the invention or any variant thereof.
The computer program may be configured with computer program code, for example comprising computer program modules. For example, in an example embodiment, code in the computer program may include one or more program modules, including, for example, a dictionary lookup module, a special vocabulary reduction module, an abbreviation reduction module, a rule reduction module, and the like. It should be noted that the division and number of modules are not fixed, and those skilled in the art may use suitable program modules or program module combinations according to actual situations, which when executed by a processor, enable the processor to perform the method according to the embodiments of the present invention or any variations thereof.
According to an embodiment of the present invention, at least one of the above modules may be implemented as a computer program module, which when executed by a processor, may implement the respective operations described above.
The present invention also provides a computer-readable storage medium, which may be contained in the apparatus/device/system described in the above embodiments; or may exist separately and not be assembled into the device/apparatus/system. The computer-readable storage medium carries one or more programs which, when executed, implement the method according to an embodiment of the present invention.
According to embodiments of the present invention, the computer readable storage medium may be a non-volatile computer readable storage medium, which may include, for example but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The german lexical analysis method and system for neural network machine translation in the embodiment comprise the following steps: 1, performing dictionary query on German words one by one; 2, restoring the irregular deformation words; 3, restoring the abbreviation into a word original shape; 4, restoring the regularly deformed words; 5 splitting the compound word into independent word combinations; marking the lexical analysis information of the successfully restored deformed words, abbreviated words and compound words; and 7, performing deep learning on lexical analysis information of the German deformed words, the acronyms and the compound words by using a neural network machine translation technology, wherein the method is favorable for reducing dimensionality, solving the problem of data sparseness, overcoming the problems that the German deformed words, the acronyms and the compound words are not recognized and not translated because dictionaries are not included in machine translation, greatly improving the accuracy and the utilization rate of sentence alignment linguistic data and the readability of machine translation, and improving the quality of machine translation translated texts.
For simplicity of explanation, the method embodiments are described as a series of acts or combinations, but those skilled in the art will appreciate that the embodiments are not limited by the order of acts described, as some steps may occur in other orders or concurrently with other steps in accordance with the embodiments of the invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (8)

1. A German lexical analysis method oriented to neural network machine translation is characterized in that: the method comprises the following steps:
restoring the German words into original words to obtain processed German words, and inputting the processed German words into a neural network for deep learning;
wherein the restoring the german words to the prototype words comprises:
sequentially executing the following steps until the prototype word is obtained:
dictionary query is carried out on German words one by one;
restoring the irregular deformation words;
reducing the abbreviation into a word original shape;
restoring the regularly deformed words; and
splitting the compound word into independent constituent word combinations;
the step of splitting the compound word into independent word combinations is specifically as follows:
carrying out forward maximum matching on the compound words to be processed, wherein two forms of results of each forward maximum matching are reserved, the first letter is capitalized preferentially, the second letter is capitalized first, when the compound words are in the forward maximum matching, the formed word field is not less than three letters, dictionary query is carried out on the obtained formed word fields one by one, if the feedback is 'true', the formed word field enters a memory base, and if the feedback result is 'false', the next step is executed;
carrying out mark judgment and mark processing on the former item forming word field and/or the latter item residual field and/or the whole field;
performing forward maximum matching on the remaining fields and/or the whole field of the latter item after the mark processing again, performing dictionary query on the obtained word fields of each constituent word one by one, if the feedback is 'true', entering the word field of the constituent word into a memory bank, and if the feedback is 'false', directly outputting the compound word without splitting;
and post-processing the component word fields in the memory library.
2. The neural network machine translation-oriented german lexical analysis method of claim 1, wherein: further comprising:
labeling the lexical analysis information of the successfully restored deformed words, abbreviated words and compound words;
and performing deep learning of neural network machine translation on the lexical analysis information of the labeled deformed words, the labeled abbreviated words and the labeled compound words.
3. The neural network machine translation-oriented german lexical analysis method of claim 2, wherein: the step of performing dictionary query on German words one by one specifically comprises the following steps:
after receiving the German text, performing dictionary query on each German word, and if the feedback result is 'true', directly outputting the prototype word; if the feedback result is false, the next step is executed.
4. The neural network machine translation-oriented german lexical analysis method of claim 3, wherein: the method for restoring the irregular deformation words comprises the following steps:
and inquiring the special vocabulary, if the feedback result is 'true', directly restoring the deformed word into the original shape according to the special vocabulary, and if the feedback result is 'false', executing the next step.
5. The German lexical analysis method oriented to neural network machine translation, as claimed in claim 4, wherein: the method comprises the following steps of reducing the abbreviation into a word prototype:
and inquiring the abbreviation list, if the feedback result is 'true', directly restoring the deformed word into the original shape according to the abbreviation list, and if the feedback result is 'false', executing the next step.
6. The neural network machine translation-oriented german lexical analysis method of claim 5, wherein: the method comprises the following steps of restoring regularly deformed deformation words:
determining a morphological reduction rule suitable for the deformed word through the word end query, and reducing through the morphological reduction rule;
performing dictionary query on the reduction result, and if the feedback result is 'true', successfully reducing; if the feedback result is false, the reduction fails through the reduction rule, and the reduction is carried out through the next reduction rule;
by analogy, if the dictionary query feedback result of the reduction result of a certain reduction rule is 'true', the reduction is successful; and if all the reduction rules are finished and the dictionary query feedback results are 'false', executing the next step.
7. The neural network machine translation-oriented german lexical analysis method of claim 1, wherein:
the basic formula of the machine deep learning training is h ═ g (W)Tx + b), where x is the input value and the values of W and b are adjusted based on the difference calculated by the back propagation algorithm.
8. A German lexical analysis system oriented to neural network machine translation is characterized in that: the method comprises the following steps:
the dictionary query module is used for performing dictionary query on the words or the processed words one by one;
the special vocabulary reduction module is used for reducing the irregular deformation words by inquiring the special vocabulary;
the abbreviation restoring module is used for restoring the abbreviation into a word original shape by inquiring the abbreviation list;
the rule reduction module is used for reducing the deformed words with deformed rules through the word form reduction rule table;
the compound word splitting module is used for splitting the compound words into independent constituent word combinations;
the labeling module is used for labeling the lexical analysis information of the successfully restored deformed words, the successfully restored abbreviations and the successfully restored compound words;
the deep learning module is used for performing deep learning of neural network machine translation on the lexical analysis information of the labeled deformed words, the labeled abbreviated words and the labeled compound words;
the compound word splitting module is used for splitting the compound words into independent constituent word combinations by executing the following operations:
carrying out positive maximum matching on the compound words to be processed, wherein the result of each positive maximum matching is reserved in two forms, namely, the first letter is capitalized preferentially, and the second letter is lowercase; when the forward direction is matched to the maximum, the number of the formed word fields is not less than three, dictionary query is carried out on the obtained formed word fields one by one, if the feedback is 'true', the formed word fields enter a memory bank, and if the feedback result is 'false', the next step is executed;
carrying out mark judgment and mark processing on the former item forming word field and/or the latter item residual field and/or the whole field;
performing forward maximum matching on the remaining fields and/or the whole field of the latter item after the mark processing again, performing dictionary query on the obtained word fields of each constituent word one by one, if the feedback is 'true', entering the word field of the constituent word into a memory bank, and if the feedback is 'false', directly outputting the compound word without splitting;
and post-processing the component word fields in the memory library.
CN201911029182.4A 2019-10-25 2019-10-25 German lexical analysis method and system for neural network machine translation Active CN110765766B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911029182.4A CN110765766B (en) 2019-10-25 2019-10-25 German lexical analysis method and system for neural network machine translation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911029182.4A CN110765766B (en) 2019-10-25 2019-10-25 German lexical analysis method and system for neural network machine translation

Publications (2)

Publication Number Publication Date
CN110765766A CN110765766A (en) 2020-02-07
CN110765766B true CN110765766B (en) 2022-05-17

Family

ID=69334332

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911029182.4A Active CN110765766B (en) 2019-10-25 2019-10-25 German lexical analysis method and system for neural network machine translation

Country Status (1)

Country Link
CN (1) CN110765766B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1141465A (en) * 1995-07-26 1997-01-29 深圳科智语言信息处理有限公司北京分公司 Morphology analysing calculating method of Germany-Chinese translation system
CN106649288A (en) * 2016-12-12 2017-05-10 北京百度网讯科技有限公司 Translation method and device based on artificial intelligence
CN107608973A (en) * 2016-07-12 2018-01-19 华为技术有限公司 A kind of interpretation method and device based on neutral net
CN109359294A (en) * 2018-09-18 2019-02-19 湖北文理学院 A kind of archaic Chinese interpretation method based on neural machine translation
CN110287333A (en) * 2019-06-12 2019-09-27 北京语言大学 A kind of knowledge based library carries out the method and system of paraphrase generation

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3557605B2 (en) * 2001-09-19 2004-08-25 インターナショナル・ビジネス・マシーンズ・コーポレーション Sentence segmentation method, sentence segmentation processing device using the same, machine translation device, and program

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1141465A (en) * 1995-07-26 1997-01-29 深圳科智语言信息处理有限公司北京分公司 Morphology analysing calculating method of Germany-Chinese translation system
CN107608973A (en) * 2016-07-12 2018-01-19 华为技术有限公司 A kind of interpretation method and device based on neutral net
CN106649288A (en) * 2016-12-12 2017-05-10 北京百度网讯科技有限公司 Translation method and device based on artificial intelligence
CN109359294A (en) * 2018-09-18 2019-02-19 湖北文理学院 A kind of archaic Chinese interpretation method based on neural machine translation
CN110287333A (en) * 2019-06-12 2019-09-27 北京语言大学 A kind of knowledge based library carries out the method and system of paraphrase generation

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于机器学习的德语复合词智能翻译算法研究;李奕萱 等;《电子世界》;20190331(第5期);第62-65页 *

Also Published As

Publication number Publication date
CN110765766A (en) 2020-02-07

Similar Documents

Publication Publication Date Title
Dong et al. Unified language model pre-training for natural language understanding and generation
Malmi et al. Encode, tag, realize: High-precision text editing
Weiss et al. Sequence-to-sequence models can directly translate foreign speech
Honnibal et al. Joint incremental disfluency detection and dependency parsing
EP2653982A1 (en) Method and system for statistical misspelling correction
Yaghoobzadeh et al. Multi-level representations for fine-grained typing of knowledge base entities
Chitnis et al. Variable-length word encodings for neural translation models
CN112307208A (en) Long text classification method, terminal and computer storage medium
US10394960B2 (en) Transliteration decoding using a tree structure
Matteson et al. Rich character-level information for Korean morphological analysis and part-of-speech tagging
Long et al. Translation of patent sentences with a large vocabulary of technical terms using neural machine translation
KR20190065665A (en) Apparatus and method for recognizing Korean named entity using deep-learning
de Sousa Neto et al. Htr-flor++ a handwritten text recognition system based on a pipeline of optical and language models
Qiu et al. A two-stage model for Chinese grammatical error correction
KR20230009564A (en) Learning data correction method and apparatus thereof using ensemble score
Stankevičius et al. Correcting diacritics and typos with a ByT5 transformer model
Hung Vietnamese diacritics restoration using deep learning approach
CN110502759B (en) Method for processing Chinese-Yue hybrid network neural machine translation out-of-set words fused into classification dictionary
CN109815497B (en) Character attribute extraction method based on syntactic dependency
US20180173695A1 (en) Transliteration of text entry across scripts
Lo et al. Cool English: A grammatical error correction system based on large learner corpora
KR100318763B1 (en) The similarity comparitive method of foreign language a tunning fork transcription
CN110765766B (en) German lexical analysis method and system for neural network machine translation
Zhang et al. Selective decoding for cross-lingual open information extraction
Nguyen et al. Example-based sentence reduction using the hidden markov model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant