CN114528861A - Foreign language translation training method and device based on corpus - Google Patents

Foreign language translation training method and device based on corpus Download PDF

Info

Publication number
CN114528861A
CN114528861A CN202210204937.5A CN202210204937A CN114528861A CN 114528861 A CN114528861 A CN 114528861A CN 202210204937 A CN202210204937 A CN 202210204937A CN 114528861 A CN114528861 A CN 114528861A
Authority
CN
China
Prior art keywords
translation
corpus
training
word segmentation
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202210204937.5A
Other languages
Chinese (zh)
Inventor
申丽霞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhengzhou University of Science and Technology
Original Assignee
Zhengzhou University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhengzhou University of Science and Technology filed Critical Zhengzhou University of Science and Technology
Priority to CN202210204937.5A priority Critical patent/CN114528861A/en
Publication of CN114528861A publication Critical patent/CN114528861A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a foreign language translation training method and a foreign language translation training device based on a corpus, which relate to the technical field of natural language processing and specifically comprise the following steps: randomly extracting a preset number of training corpora from any parallel corpus to construct a first parallel corpus; constructing and training an initial translation model according to the first parallel language library; acquiring a translation corpus by using an initial translation model; calculating a translation confidence score of any statement in the translation corpus, and comparing the translation confidence score with a preset evaluation threshold; updating the translation corpus according to the comparison result, and splicing the translation corpus with any monolingual corpus to obtain a second parallel corpus; acquiring an integral corpus according to any one parallel corpus and a second parallel corpus, and training the initial translation model again; according to the invention, the scale of the parallel corpus is enlarged to ensure the accuracy of the translation model result, and the parallel corpus is enlarged to ensure the accuracy of the translated sentences merged into the original parallel corpus, so that the trained translation model is more accurate.

Description

Foreign language translation training method and device based on corpus
Technical Field
The invention relates to the technical field of natural language processing, in particular to a foreign language translation training method and device based on a corpus.
Background
Natural language processing is an important research direction for computer science artificial intelligence. The study on how to enable people and computers to effectively communicate by using natural language is a subject integrating linguistics, computer science and mathematics.
Among them, neural machine translation is an important task that cannot be ignored. In recent years, neural machine translation has attracted a great deal of attention in academia and industry. The neural network machine translation model can obtain good performance and benefit from large-scale and high-quality bilingual parallel training corpora, and currently, the high-quality parallel corpora usually exist among a small number of languages and are often limited to certain specific fields, such as government documents, news and the like; therefore, how to ensure the accuracy of the translation model result in the limited parallel training corpus is a problem that needs to be solved urgently by those skilled in the art.
Disclosure of Invention
In view of the above, the present invention provides a method and an apparatus for training foreign language translation based on a corpus, which overcome the above-mentioned drawbacks.
In order to achieve the above purpose, the invention provides the following technical scheme:
a foreign language translation training method based on a corpus specifically comprises the following steps:
randomly extracting a preset number of training corpora from any parallel corpus to construct a first parallel corpus;
constructing and training an initial translation model according to the first parallel language library;
translating a source language sentence in any monolingual corpus into a target language sentence by using an initial translation model to obtain a translation corpus;
calculating a translation confidence score of any statement in the translation corpus, and comparing the translation confidence score with a preset evaluation threshold;
updating the translation corpus according to the comparison result, and splicing the translation corpus with any monolingual corpus to obtain a second parallel corpus;
and acquiring the whole corpus according to any one parallel corpus and the second parallel corpus, and training the initial translation model again.
Optionally, the construction step of the initial translation model is:
preprocessing sentences in the first parallel language material library to obtain a preprocessed text;
performing word segmentation processing on the preprocessed text according to the automatic word segmentation model to obtain word segmentation text information;
and training by utilizing a recurrent neural network based on the word segmentation text information, and establishing and training an initial translation model.
Optionally, the step of obtaining the automatic word segmentation model includes:
acquiring a preprocessed text, and performing word segmentation processing on the preprocessed text to obtain word segmentation text information at a character level;
acquiring part-of-speech tags and word segmentation tags of word segmentation text information;
combining part-of-speech labels and word segmentation labels of word segmentation text information to obtain binary label information;
and training by using a recurrent neural network based on the word segmentation text information and the binary label information to construct an automatic word segmentation model.
Optionally, the obtaining step of the translation confidence score of any statement is:
obtaining a translation confidence evaluation index according to historical data;
acquiring the weight of each translation confidence evaluation index;
and obtaining the translation confidence score of any statement according to each translation confidence evaluation index and the corresponding weight.
Optionally, the calculation formula of the translation confidence score is as follows:
Figure BDA0003528785900000031
in the formula, i is the number of translation confidence evaluation indexes; lambda [ alpha ]iThe weight of the ith translation confidence evaluation index; h isiThe index is the ith translation confidence evaluation index.
Optionally, the step of updating the translation corpus specifically includes:
calculating a translation confidence score of any statement in the translation corpus, and comparing the translation confidence score with a preset evaluation threshold; if the evaluation value is larger than or equal to the second evaluation threshold value, the translation corpus is not updated; if the evaluation value is smaller than the first evaluation threshold value, text recognition is carried out on any sentence in the translation corpus according to a preset length;
matching the recognized text with the text in the source language sentence;
acquiring a text to be replaced according to a monolingual corpus of a target language;
replacing the text to be replaced with the corresponding content in the identified text to obtain a second translation sentence;
calculating translation confidence score of the second translation statement, if the translation confidence score is smaller than a first evaluation threshold, replacing the texts to be replaced one by one, and calculating the translation confidence score respectively to obtain the best translation statement and update the translation corpus; and if the evaluation value is larger than or equal to the second evaluation threshold value, storing the second translation sentence into the translation corpus and updating the translation corpus.
A foreign language translation training device based on a corpus comprises an initial training module, an evaluation module, a first corpus construction module, a second corpus construction module and a retraining module;
the initial training module is used for constructing and training an initial translation model according to the first parallel language library;
the evaluation module is used for calculating the translation confidence score of any statement in the translation corpus and comparing the translation confidence score with a preset evaluation threshold; updating the translation corpus according to the comparison result;
the first corpus construction module is used for splicing the updated translation corpus and any monolingual corpus to obtain a second parallel corpus;
the second corpus construction module is used for acquiring an integral corpus according to any one parallel corpus and a second parallel corpus;
and the retraining module is used for retraining the initial translation model according to the whole corpus.
Optionally, the initial training module includes a corpus extraction module, a preprocessing module, an automatic word segmentation module, and a model training module;
the corpus extraction module is used for extracting training corpuses with preset quantity and constructing a first parallel corpus;
the preprocessing module is used for preprocessing the sentences in the first parallel language material library to obtain a preprocessed text;
the automatic word segmentation module is used for carrying out word segmentation processing on the preprocessed text to obtain word segmentation text information;
and the model training module is used for establishing and training an initial translation model.
Compared with the prior art, the foreign language translation training method and device based on the corpus ensure the accuracy of the translation model result by enlarging the scale of the parallel corpus and ensure the accuracy of the translated sentences incorporated into the original parallel corpus by enlarging the parallel corpus so as to ensure that the trained translation model is more accurate.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
FIG. 1 is a schematic flow diagram of the process of the present invention;
FIG. 2 is a schematic structural diagram of the apparatus of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment of the invention discloses a foreign language translation training method and a foreign language translation training device based on a corpus, wherein the method comprises the following steps as shown in figure 1:
step 1, randomly extracting a preset number of training corpora from any parallel corpus to construct a first parallel corpus;
the parallel corpus is also called a translation corpus, is a corpus formed by combining original texts and translated texts, is used for training, testing and the like of a machine translation model, and can be a corpus formed by combining original texts and translated texts, such as Chinese and Mandarin, Chinese and English, Chinese and Japanese, Japanese and Chinese and the like.
In this embodiment, 1500 pairs of Chinese and English language pairs are randomly extracted from a Chinese-English translation corpus as training corpora, and a corpus is separately established for the training corpora, defined as a first parallel corpus; in this embodiment, chinese is defined as a source language and english is defined as a target language.
Step 2, constructing and training an initial translation model according to the first parallel language database, which specifically comprises the following steps:
preprocessing sentences in the first parallel language material library to obtain a preprocessed text;
performing word segmentation processing on the preprocessed text according to the automatic word segmentation model to obtain word segmentation text information;
based on the word segmentation text information, utilizing a bidirectional cyclic neural network to train, and establishing and training an initial translation model;
furthermore, the process of training by using the bidirectional recurrent neural network is as follows: the method comprises the steps of coding word segmentation text information from a forward direction and a reverse direction based on a bidirectional RNN coder, determining a hidden state of the bidirectional RNN coder at each time step, decoding the hidden state and semantic vectors of the bidirectional RNN coder at each time step based on a non-directional RNN decoder, establishing an initial translation model, and training the initial translation model.
In the embodiment, the encoding is performed from the positive direction and the negative direction through the bidirectional recurrent neural network, and the hidden state and the semantic vector of each time step are determined, so that the hidden state and the semantic vector of all the time steps are prevented from being compressed in a fixed-length vector, and the sentence translation accuracy of the initial translation model is improved.
The automatic word segmentation model is constructed by the following steps:
acquiring a preprocessed text, and performing word segmentation processing on the preprocessed text to obtain word segmentation text information at a character level;
acquiring part-of-speech tags and word segmentation tags of word segmentation text information;
combining part-of-speech labels and word segmentation labels of word segmentation text information to obtain binary label information;
training by using a long-short term memory network based on word segmentation text information and binary label information to obtain an automatic word segmentation model;
wherein, the preprocessing is to carry out regularization, error correction, digital regularization and the like on the training corpus;
in this embodiment: and sequentially carrying out messy code filtering processing, Chinese half-corner character to full-corner character processing, Chinese word segmentation processing and English corpus lowercase processing on the data in the first parallel corpus, and establishing a corresponding word list.
Furthermore, the obtained word segmentation text information is used for training the long-term and short-term memory network until the current iteration number is larger than or equal to the preset maximum iteration number or the accuracy of the binary label information output by the long-term and short-term memory network is larger than a preset accuracy threshold, and then the automatic word segmentation model is obtained.
Step 3, translating the source language sentences in any monolingual corpus into target language sentences by using the initial translation model to obtain a translation corpus, which specifically comprises the following steps:
designating any language database in the existing Chinese language database, then translating all sentences in the Chinese language database into English sentences through an initial translation model, storing all English sentences in the language database according to the translation sequence, and defining the English sentences as a translation language database;
step 4, calculating a translation confidence score of any statement in the translation corpus, and comparing the translation confidence score with a preset evaluation threshold; the method specifically comprises the following steps:
obtaining a translation confidence evaluation index according to historical data;
acquiring the weight of each translation confidence evaluation index;
obtaining a translation confidence score of any statement according to each translation confidence evaluation index and the corresponding weight;
the translation confidence score is compared to a preset evaluation threshold.
The translation confidence evaluation index may include: the fluency degree of the translated sentence, the translation probability between the source language sentence and the word in the translated sentence, and the translation probability between the source language sentence and the phrase in the translated sentence are described;
the translation probability is related to the language habits, fixed collocation and the field of the source language sentence and the translated sentence, namely English.
The calculation formula of the translation confidence score is as follows:
Figure BDA0003528785900000071
in the formula, i is the number of translation confidence evaluation indexes; lambda [ alpha ]iThe weight of the ith translation confidence evaluation index; h isiThe translation confidence evaluation index.
Step 5, updating the translation corpus according to the comparison result, and splicing the translation corpus with any monolingual corpus to obtain a second parallel corpus;
the step of updating the translation corpus is as follows:
calculating a translation confidence score of any statement in the translation corpus, and comparing the translation confidence score with a preset evaluation threshold value; if the evaluation value is greater than or equal to the optimal evaluation threshold value, the translation corpus is not updated; if the evaluation value is smaller than the minimum evaluation threshold value, performing text recognition on any sentence in the translation corpus according to a preset length;
matching the recognized text with the text in the source language sentence;
acquiring a text to be replaced according to a monolingual corpus of a target language;
replacing the text to be replaced and the corresponding content of the identified text to obtain a new translation sentence;
carrying out translation confidence score calculation on the new translation sentences, if the translation confidence score is smaller than the lowest evaluation threshold value, replacing the texts to be replaced one by one, and respectively calculating translation confidence scores to obtain the best translation sentences and updating a translation corpus; if the evaluation value is larger than or equal to the second evaluation threshold value, the sentence which is completely replaced is stored in the translation corpus, and the translation corpus is updated.
And 6, acquiring an integral corpus according to any one parallel corpus and the second parallel corpus, and training the initial translation model again.
The embodiment further includes a foreign language translation training device based on a corpus, as shown in fig. 2, the structure of which includes an initial training module, an evaluation module, a first corpus construction module, a second corpus construction module, and a retraining module;
the initial training module is used for constructing and training an initial translation model according to the first parallel language database;
the evaluation module is used for calculating the translation confidence score of any statement in the translation corpus and comparing the translation confidence score with a preset evaluation threshold; updating the translation corpus according to the comparison result;
the first corpus building module is used for splicing the updated translation corpus and any monolingual corpus to obtain a second parallel corpus;
the second corpus construction module is used for acquiring an integral corpus according to any one parallel corpus and the second parallel corpus;
and the retraining module is used for retraining the initial translation model according to the whole corpus.
The initial training module comprises a corpus extraction module, a preprocessing module, an automatic word segmentation module and a model training module;
the corpus extraction module is used for extracting training corpuses with preset quantity and constructing a first parallel corpus;
the preprocessing module is used for preprocessing the sentences in the first parallel language material library to obtain a preprocessed text;
the automatic word segmentation module is used for carrying out word segmentation processing on the preprocessed text to obtain word segmentation text information;
and the model training module is used for establishing and training an initial translation model.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (8)

1. A foreign language translation training method based on a corpus is characterized by comprising the following specific steps:
randomly extracting a preset number of training corpora from any parallel corpus to construct a first parallel corpus;
constructing and training an initial translation model according to the first parallel language library;
translating a source language sentence in any monolingual corpus into a target language sentence by using an initial translation model to obtain a translation corpus;
calculating a translation confidence score of any statement in the translation corpus, and comparing the translation confidence score with a preset evaluation threshold;
updating the translation corpus according to the comparison result, and splicing the translation corpus with any monolingual corpus to obtain a second parallel corpus;
and acquiring the whole corpus according to any one parallel corpus and the second parallel corpus, and training the initial translation model again.
2. The corpus-based foreign language translation training method according to claim 1, wherein the initial translation model is constructed by the steps of:
preprocessing sentences in the first parallel language material library to obtain a preprocessed text;
performing word segmentation processing on the preprocessed text according to the automatic word segmentation model to obtain word segmentation text information;
and training by utilizing a recurrent neural network based on the word segmentation text information, and establishing and training an initial translation model.
3. The corpus-based foreign language translation training method according to claim 2, wherein the automatic word segmentation model is obtained by the steps of:
acquiring a preprocessed text, and performing word segmentation processing on the preprocessed text to obtain word segmentation text information at a character level;
acquiring part-of-speech tags and word segmentation tags of word segmentation text information;
combining part-of-speech labels and word segmentation labels of word segmentation text information to obtain binary label information;
and training by using a recurrent neural network based on the word segmentation text information and the binary label information to construct an automatic word segmentation model.
4. The corpus-based foreign language translation training method according to claim 1, wherein the step of obtaining the translation confidence score of any sentence is:
obtaining a translation confidence evaluation index according to historical data;
acquiring the weight of each translation confidence evaluation index;
and obtaining the translation confidence score of any statement according to each translation confidence evaluation index and the corresponding weight.
5. A corpus-based foreign language translation training method according to any one of claims 1-4, wherein the translation confidence score is calculated by the formula:
Figure FDA0003528785890000021
in the formula, i is the number of translation confidence evaluation indexes; lambda [ alpha ]iThe weight of the ith translation confidence evaluation index; h isiThe index is evaluated for i first translation confidence degrees.
6. The language corpus-based foreign language translation training method according to claim 1, wherein the step of updating the translation language corpus specifically comprises:
calculating a translation confidence score of any statement in the translation corpus, and comparing the translation confidence score with a preset evaluation threshold; if the evaluation value is larger than or equal to the second evaluation threshold value, the translation corpus is not updated; if the evaluation value is smaller than the first evaluation threshold value, text recognition is carried out on any sentence in the translation corpus according to a preset length;
matching the recognized text with the text in the source language sentence;
acquiring a text to be replaced according to a monolingual corpus of a target language;
replacing the text to be replaced with the corresponding content in the identified text to obtain a second translation sentence;
calculating translation confidence score of the second translation statement, if the translation confidence score is smaller than a first evaluation threshold, replacing the texts to be replaced one by one, and calculating the translation confidence score respectively to obtain the best translation statement and update the translation corpus; and if the evaluation value is larger than or equal to the second evaluation threshold value, storing the second translation sentence into the translation corpus and updating the translation corpus.
7. A foreign language translation training device based on a corpus is characterized by comprising an initial training module, an evaluation module, a first corpus construction module, a second corpus construction module and a retraining module;
the initial training module is used for constructing and training an initial translation model according to the first parallel language library;
the evaluation module is used for calculating the translation confidence score of any statement in the translation corpus and comparing the translation confidence score with a preset evaluation threshold; updating the translation corpus according to the comparison result;
the first corpus building module is used for splicing the updated translation corpus with any monolingual corpus to obtain a second parallel corpus;
the second corpus construction module is used for acquiring an integral corpus according to any one parallel corpus and a second parallel corpus;
and the retraining module is used for retraining the initial translation model according to the whole corpus.
8. The corpus-based foreign language translation training device according to claim 7, wherein the initial training module comprises a corpus extraction module, a preprocessing module, an automatic word segmentation module, and a model training module;
the corpus extraction module is used for extracting a preset number of training corpuses and constructing a first parallel corpus;
the preprocessing module is used for preprocessing the sentences in the first parallel language material library to obtain a preprocessed text;
the automatic word segmentation module is used for carrying out word segmentation processing on the preprocessed text to obtain word segmentation text information;
and the model training module is used for establishing and training an initial translation model.
CN202210204937.5A 2022-03-02 2022-03-02 Foreign language translation training method and device based on corpus Withdrawn CN114528861A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210204937.5A CN114528861A (en) 2022-03-02 2022-03-02 Foreign language translation training method and device based on corpus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210204937.5A CN114528861A (en) 2022-03-02 2022-03-02 Foreign language translation training method and device based on corpus

Publications (1)

Publication Number Publication Date
CN114528861A true CN114528861A (en) 2022-05-24

Family

ID=81626033

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210204937.5A Withdrawn CN114528861A (en) 2022-03-02 2022-03-02 Foreign language translation training method and device based on corpus

Country Status (1)

Country Link
CN (1) CN114528861A (en)

Similar Documents

Publication Publication Date Title
WO2023065544A1 (en) Intention classification method and apparatus, electronic device, and computer-readable storage medium
US8131539B2 (en) Search-based word segmentation method and device for language without word boundary tag
CN110532573B (en) Translation method and system
CN112632997A (en) Chinese entity identification method based on BERT and Word2Vec vector fusion
CN112541356B (en) Method and system for recognizing biomedical named entities
KR102043353B1 (en) Apparatus and method for recognizing Korean named entity using deep-learning
CN111709242A (en) Chinese punctuation mark adding method based on named entity recognition
CN113268576B (en) Deep learning-based department semantic information extraction method and device
CN115759119B (en) Financial text emotion analysis method, system, medium and equipment
CN111597807B (en) Word segmentation data set generation method, device, equipment and storage medium thereof
CN114298010A (en) Text generation method integrating dual-language model and sentence detection
CN115587590A (en) Training corpus construction method, translation model training method and translation method
CN112417823A (en) Chinese text word order adjusting and quantitative word completion method and system
CN109815497B (en) Character attribute extraction method based on syntactic dependency
CN115033753A (en) Training corpus construction method, text processing method and device
CN114757184A (en) Method and system for realizing knowledge question answering in aviation field
CN114564912A (en) Intelligent checking and correcting method and system for document format
Pal et al. Vartani Spellcheck--Automatic Context-Sensitive Spelling Correction of OCR-generated Hindi Text Using BERT and Levenshtein Distance
CN114722774B (en) Data compression method, device, electronic equipment and storage medium
CN113139050B (en) Text abstract generation method based on named entity identification additional label and priori knowledge
Mekki et al. COTA 2.0: An automatic corrector of Tunisian Arabic social media texts
Ramesh et al. Interpretable natural language segmentation based on link grammar
CN111090720B (en) Hot word adding method and device
CN114528861A (en) Foreign language translation training method and device based on corpus
CN109960720B (en) Information extraction method for semi-structured text

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20220524