CN114780667A - Corpus construction and filtering method and system - Google Patents
Corpus construction and filtering method and system Download PDFInfo
- Publication number
- CN114780667A CN114780667A CN202210356507.5A CN202210356507A CN114780667A CN 114780667 A CN114780667 A CN 114780667A CN 202210356507 A CN202210356507 A CN 202210356507A CN 114780667 A CN114780667 A CN 114780667A
- Authority
- CN
- China
- Prior art keywords
- corpus
- bilingual
- filtering
- webpage
- parallel corpus
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/353—Clustering; Classification into predefined classes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
- G06F16/9566—URL specific, e.g. using aliases, detecting broken or misspelled links
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/211—Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/263—Language identification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
- G06F40/58—Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Software Systems (AREA)
- Biomedical Technology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Molecular Biology (AREA)
- Evolutionary Computation (AREA)
- Biophysics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Machine Translation (AREA)
Abstract
The invention provides a method and a system for constructing and filtering a corpus, which comprise the following steps: step 1: performing document clause alignment processing on the obtained initial parallel corpus to obtain bilingual parallel corpus; step 2: and according to the text alignment degree of the bilingual parallel corpus, performing filtering processing to obtain a corpus. Compared with the prior art, the method and the device realize the scheme of automatically collecting the multi-language corpus of various Internet languages, and can finish the automatic alignment of the corpus based on the scheme. In addition, a filtering method of high-performance low-resource languages is designed, so that the quality of generating the corpus is further improved, and data guarantee and support are provided for related natural language processing downstream tasks.
Description
Technical Field
The invention relates to the technical field of cross-language text translation and alignment, in particular to a corpus construction and filtering method and system, and particularly relates to a corpus construction and filtering method for a low-resource language translation system.
Background
With the continuous popularization of information-oriented infrastructures along the line and the complexity of multi-language intercommunication along the line, information systems increasingly rely on high-quality multi-language cross-language services. Therefore, multilingual cross-language information processing service is urgently needed. At present, various countries have various languages, complex language conditions, deficient resources of most languages, high acquisition difficulty and high cost, and the languages are called low-resource languages. In recent years, the neural network machine translation model has achieved the best translation performance. However, the neural network machine translation depends on high-quality bilingual corpus, and the quality and scale of bilingual corpus pairs of different languages, such as Chinese-Nippol, have important influence on the training effect of the machine translation.
Therefore, an important premise for constructing a good low-resource translation system is to be able to obtain high-quality bilingual corpus with rich contents.
Patent document CN114139561A discloses a method for improving translation performance of a multi-domain neural machine, which comprises the following steps: crawling mass data as model training corpora, and dividing the model training corpora into a specific field corpus and a multi-field parallel corpus; calculating the similarity between each sentence in the multi-domain parallel corpus and each specific-domain corpus; selecting sentences with high average similarity with a plurality of specific field corpora from the multi-field parallel corpora as a training set of the multi-field model; constructing a multi-field deep neural machine translation model and a plurality of specific-field deep neural machine translation models for training and storing model parameters; and calculating the similarity between each specific field corpus and each multi-field parallel corpus, and performing circular knowledge refining on the multi-field models and each specific field model to finally obtain the multi-field neural machine translation model with improved performance. The patent document proposes to extract texts through web crawlers and select training data through sentence vector similarity. However, the method does not provide a crawler object, a crawling manner and a text processing manner, does not provide a corpus selection and filtering algorithm for low-resource corpuses, and does not solve the problems of corpus selection and filtering of the low-resource corpuses.
Disclosure of Invention
Aiming at the defects in the prior art, the invention aims to provide a corpus construction and filtering method and system.
The corpus construction and filtering method provided by the invention comprises the following steps:
step 1: performing document clause alignment processing on the obtained initial parallel corpus to obtain bilingual parallel corpus;
and 2, step: and filtering according to the text alignment degree of the bilingual parallel corpus to obtain a corpus.
Preferably, step 1, comprises:
step 101: acquiring initial parallel corpora from a preset resource library;
step 102: and carrying out sentence segmentation on the initial parallel corpus, and carrying out sentence alignment processing to obtain the bilingual parallel corpus.
Preferably, step 2, comprises:
step 201: inputting bilingual parallel linguistic data into a translation model to obtain a feature vector output by the translation model;
step 202: inputting the feature vectors into a multilayer perceptron to obtain the text alignment;
step 203: and filtering the text alignment to obtain a corpus.
Preferably, step 1, further comprises:
step 103: performing webpage derivation on a preset resource library to obtain a derived webpage set;
step 104: and adding the webpage derived set into a preset resource library.
Preferably, step 203, comprises:
step 2031: if the text alignment degree is larger than or equal to a preset threshold value, putting the corresponding bilingual parallel corpus into a corpus;
step 2032: and if the text alignment degree is smaller than a preset threshold value, discarding the corresponding bilingual parallel corpus.
The corpus construction and filtering system provided by the invention comprises:
module M1: performing document sentence alignment processing on the obtained initial parallel corpus to obtain bilingual parallel corpus;
module M2: and according to the text alignment degree of the bilingual parallel corpus, performing filtering processing to obtain a corpus.
Preferably, the module M1 includes:
submodule M101: acquiring initial parallel corpora from a preset resource library;
submodule M102: and carrying out sentence segmentation on the initial parallel corpus, and carrying out sentence alignment processing to obtain the bilingual parallel corpus.
Preferably, the module M2, comprises:
submodule M201: inputting bilingual parallel corpora into a translation model to obtain a feature vector output by the translation model;
submodule M202: inputting the feature vectors into a multilayer perceptron to obtain the text alignment;
submodule M203: and filtering the text alignment to obtain a corpus.
Preferably, the module M1 further includes:
submodule M103: performing webpage derivation on a preset resource library to obtain a derived webpage set;
submodule M104: and adding the webpage derived set into a preset resource library.
Preferably, the module M203 comprises:
cell D2031: if the text alignment degree is larger than or equal to a preset threshold value, putting the corresponding bilingual parallel corpus into a corpus;
unit D2032: and if the text alignment degree is smaller than a preset threshold value, discarding the corresponding bilingual parallel corpus.
Compared with the prior art, the invention has the following beneficial effects:
1. the invention greatly improves the collection capability of low-resource linguistic data, and continuously explores more language information of single language or multi-language through the seed website. Thereby acquiring more updated low-resource corpora.
2. The invention can automatically complete the work of sentence division, word alignment and the like of multiple languages. And preliminarily screening out excessively low-quality corpora. The speed of corpus evaluation and filtering is greatly improved.
3. The method can evaluate the language quality through the alignment algorithm on the basis of obtaining the corpus information, filter out the part with larger noise and generate the parallel text with higher utilization value. The performance of the algorithm exceeds the evaluation performance of methods based on sentence vectors, word vectors, dual cross entropies and the like.
Drawings
Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:
FIG. 1 is a schematic flow diagram of the present invention;
FIG. 2 is a schematic diagram illustrating an initial parallel corpus obtaining method according to the present invention;
FIG. 3 is a schematic diagram of a cross-language model architecture based on pre-training according to the present invention.
Detailed Description
The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that it would be obvious to those skilled in the art that various changes and modifications can be made without departing from the spirit of the invention. All falling within the scope of the present invention.
It can be known that high-quality corpora have an important influence on the downstream tasks of the related natural language processing, and part of low-resource corpora are deficient, so that the acquisition difficulty is high; the existing Internet automatic collection algorithm collects the single type and language of the corpus, and the existing corpus filtering technical effect is too dependent on the quality of the existing bilingual dictionary.
Some existing technologies automatically collect multilingual corpora from the internet, but these technologies often crawl corpora from websites with limited content, such as news websites, technical communities, wikipedia and the like. The methods are oriented to fixed websites, and are difficult to acquire continuous corpus information and acquire more and richer corpuses on low-resource corpuses.
In addition, the existing corpus filtering technology focuses on using some statistical methods or methods of word vector sentence vectors, and the statistical methods often have poor filtering effect on the corpus, so that high-quality bilingual languages cannot be effectively screened out for use. The word vector method depends on the quality of the bilingual dictionary, and the filtering effect is usually strong and bad.
Aiming at the problems, the invention designs a scheme for realizing automatic acquisition of a plurality of Internet multilingual corpora and can finish automatic alignment of the corpora based on the scheme. In addition, a filtering method of high-performance low-resource languages is designed, the quality of a generated corpus is further improved, and data guarantee and support are provided for related natural language processing downstream tasks.
Fig. 1 is a schematic flow diagram of the present invention, and as shown in fig. 1, the present invention provides a corpus construction and filtering method, which includes the following steps:
step 1: and carrying out document sentence alignment processing on the obtained initial parallel corpus to obtain the bilingual parallel corpus.
Specifically, step 1, comprises: step 101: acquiring initial parallel corpora from a preset resource library; step 102: and carrying out sentence segmentation on the initial parallel corpus, and carrying out sentence alignment processing to obtain the bilingual parallel corpus.
It should be understood that the invention automatically collects internet multilingual corpus and mainly constructs initial parallel corpus by the following three conditions. Wherein, each condition corresponds to a preset resource pool.
In the first case: and (5) crawling the initial parallel linguistic data by utilizing the Internet linguistic data resources.
The Internet has massive multi-language text resources, the quantity is large, and the related field range is wide, so that parallel linguistic data can be crawled and mined from Internet webpage resources. The invention adopts a cross-webpage parallel corpus crawling mode to obtain an initial parallel corpus, and fig. 2 is a schematic diagram of the obtaining mode of the initial parallel corpus of the invention, as shown in fig. 2, the method comprises the following steps: acquiring bilingual candidate websites; acquiring bilingual pair content; and judging the corpus parallelism and aligning the document clauses.
First, bilingual candidate websites are obtained. A bilingual candidate website may be understood as a crawled seed website, referring to a website that is likely to provide bilingual corpus. The invention needs to find out the related bilingual candidate website before corpus collection. Wherein the acquisition includes the following two cases.
In case one, single-webpage bilingual corpus collection: within a web page, aligned bilingual corpora are contained, but the corpora are not concentrated, for example: online translation, online dictionary, etc.
And in case two, collecting bilingual corpus of multiple webpages: within two different web pages, containing aligned bilingual corpora, there are typically versions of the same web page content containing multiple languages, for example: wikipedia, different language versions of official websites, etc.
Then, bilingual content pairs are obtained. After obtaining the bilingual candidate website, the present invention needs the following two ways to obtain bilingual content for different types of web pages.
In the first mode, a pair of web pages is obtained. In the process of collecting the bilingual corpus of a single webpage, the initial corpus required to be obtained by the method is a bilingual webpage pair. The present invention uses the following two methods to obtain parallel corpora.
Specifically, the first method is to directly use links in the web pages, for example, some websites have directly selectable link options for switching to other languages, and can directly crawl to construct bilingual web page pairs; the second method is to analyze the mode of Uniform Resource Locator (URL), extract the part related to the language and replace it with other language to construct the possible bilingual webpage pair
In the second way, paragraph pairs are obtained. In the bilingual corpus collection process of the double web pages, the preliminary corpus to be obtained by the method is a bilingual paragraph pair. For such websites, the corresponding paragraphs will generally appear regularly at the corresponding positions of the web pages, and the pattern of the URL will be related to the query content. Therefore, the present invention extracts a Language-dependent portion of the URL by analyzing the URL pattern, and analyzes the structure of a HyperText Markup Language (HTML) page, thereby extracting a corresponding bilingual paragraph pair.
And then judging the corpus parallelism. After obtaining the corresponding corpus sequence, the parallelism of the corpora in the sequence cannot be guaranteed. This is caused by the following two reasons.
On the first hand, due to the condition that the web page updating is not synchronous, the situation that the web page of a certain language in the bilingual web page pair is updated and the web page of another language is not updated may occur; in a second aspect, the URL of the website is constructed by the invention in the process of extracting the corpus, and when the website is actually accessed, a situation that the web pages are not aligned or the partial paragraphs of the web pages are not aligned may occur.
Therefore, after performing simple URL matching, the present invention needs to further check the parallelism of the corpus in order to find the high-quality parallel corpus. The method and the device measure the parallelism of the linguistic data through three characteristics, namely the length characteristic of the webpage, the HTML linear structure of the webpage and the alignment condition of each sentence in the webpage. The more similar the webpage length and HTML linear structure of the bilingual webpage pair, the greater the number of aligned sentences in the webpage, the greater the possibility of webpage alignment, and the higher the parallelism of the corpus. Particularly, the invention also adopts a K Nearest Neighbors (KNN) to fuse various characteristics together for classification and judge whether the corpora are aligned.
Finally, the documents are claused and aligned. What the corpus parallelism judgment above does is to judge the alignment of the whole web page, belonging to document-level alignment. In the training of the machine translation model, considering the calculation cost, the hardware requirement and the final translation effect, the more common input is parallel sentences rather than documents, so the invention further divides the webpage documents extracted from the webpage and carries out sentence alignment operation. In order to align sentences better, the invention sorts corresponding bilingual documents by calculating the similarity and the alignment of the sentences in terms of length, punctuation marks and bilingual dictionaries, divides paragraphs into sentences with proper length and obtains final bilingual parallel corpus.
In the second case: parallel corpora are extracted from the bilingual corpus-referenced dataset.
Besides crawling discrete parallel corpora in the Internet, the method also directly extracts the centralized bilingual corpora in the Internet. There are many direct bilingual parallel corpora in the internet, for example: the language material database (UN Parallel Corpus) of the United nations and the official literary work Corpus (rucorpus) can directly download the integrated language material data or indirectly extract the relevant bilingual Parallel language material with aligned sentences by utilizing the characteristics of URL of the website.
Specifically, the open source bilingual comparison corpus directly giving a complete data packet can be directly downloaded; for the collection of the open source bilingual control corpus, part of which is not directly provided with the integrated data packet, the specific processing flow is as follows:
first, the web pages are grouped within a station. Because the existing bilingual corpus data set has different types of data, in order to facilitate acquisition and use, the invention reduces the complexity of codes by dividing the webpages with similar URL characteristics into the same group. Specifically, a part library is used for analyzing the original URL of the webpage into six parts, namely a communication protocol (scheme), a host domain name (netloc), a resource path (path), a parameter (params), a query (query) and an information fragment (fragment), and only two parts, namely the host domain name and the resource path, are reserved; and respectively matching language-related parts of the host domain name and the resource path by using a regular expression and removing the language-related parts. And splicing the processed host domain name and the resource path, and using the character string as the identifier of a webpage resource group. All web page URLs with the same URL characteristics and different languages can obtain the same character string after being processed, so that the web page URLs are divided into the same web page resource group.
Then, the corpus in the group crawls. For the classified corpus resource group, the text content which accords with the length in the related webpage elements is obtained through the URL, and the available bilingual corpus information is obtained after the webpage elements are corresponding and screened.
Finally, the bilingual sentences are aligned. In a part of bilingual parallel corpus, the corpus may not have a good sentence alignment effect on different web pages. Therefore, in order to obtain a better sentence alignment effect, the invention matches every two web pages in the same group with the ratio of the web page lengths in a certain proper interval. And finally, extracting bilingual sentences from the bilingual sentences to be used as the bilingual parallel corpus obtained preliminarily.
Although the corpus extracted by the method has a good sentence alignment result and does not need to align the webpage documents, Chinese in the given corpus is often complex, and the invention needs to convert the Chinese into simple and complex. In the project, the tool bag OpenCC is used as a conversion tool of Chinese simplified and traditional Chinese to convert the simplified and traditional Chinese corpora into usable simplified Chinese corpora, and then bilingual initial parallel corpora can be collected from bilingual reference corpus data.
In the third case: and (5) excavating parallel corpora from the single webpage.
Besides the two methods for directly extracting the bilingual corpus, the method for mining the parallel corpus on the single webpage also obtains the bilingual corpus. Among the internet web resources, there are a large number of bilingual web pages, which contain two sentences aligned with each other. In order to mine parallel corpora which appear in different languages in the same webpage but have the same semantics, the invention adopts the following process to mine the parallel corpora in a single webpage.
Firstly, detecting and screening the languages of the webpage. Before extracting the web page content, whether the candidate web page contains the required bilingual parallel languages needs to be detected, and in addition, in order to improve the quality and the quantity of the searched linguistic data, the proportion of the bilingual languages contained in the selected web page is required to reach a certain threshold value. The method adopts pycld2 library to detect languages, obtains the language proportion of corresponding texts at the same time, and enters the next stage of extraction after finding that the language proportion in a webpage is higher than a set threshold value, otherwise, another webpage is reselected for detection.
The web page content is then divided into sentences. After the screened web pages are obtained, the whole chapter and paragraph contents are obtained by extracting the texts in the web pages in the project. Because most of the existing machine translation models adopt sentence-level parallel corpus training, the invention also needs to perform sentence segmentation on the articles. The invention utilizes the polyglot library to perform webpage content clause, realizes the clause task of the chapters and obtains the corresponding sentence set. Then, in order to facilitate the subsequent extraction of bilingual parallel corpora, the language detection is carried out on the sentences by using the pycld2 library again, the language proportion is also screened, and high-quality sentence corpora of the same language are placed in the same group to obtain a sentence group of multiple languages.
And finally, aligning and screening the bilingual sentences. For the classified language sentence groups, according to the target bilingual language, the language A and the language B are recorded, the corresponding two language sentence groups are taken out, the corresponding aligned sentence pairs are matched between the two language sentence groups, and the bilingual sentence alignment is further screened. Because the appearance sequence of sentences of different languages is usually corresponding to each other in the web page, the invention searches the aligned objects of the sentences of different languages in sequence according to the sequence and carries out sentence matching in sequence.
For the sentence pairs to be matched, the invention adopts the following processes to screen: and cleaning sentences, segmenting words, filtering parts of speech, mapping the language A to a bilingual dictionary of the language B, calculating the alignment degree of the sentences from the language A to the language B, and obtaining the alignment degree of the sentences from the language B to the language A in a similar way. And averaging the two to obtain the final alignment of the sentence pair, and then comparing the final alignment with a preset alignment threshold. If the length of the sentence is higher than the threshold value, the sentence pair is subjected to the requirements defined according to the target, such as the limitation on the length, and further screening is carried out, and if the requirements are met, the sentence pair is added into the final bilingual parallel corpus.
Further, step 1, further includes: step 103: performing webpage derivation on a preset resource library to obtain a derived webpage set; step 104: and adding the webpage derivation set into a preset resource library.
Specifically, the web page is sub-linked to the derivative. After the initial parallel corpus mining process is finished, in order to conveniently obtain more bilingual parallel corpus resources, the method and the device can further derive the webpage sub-links by utilizing the crawled webpages. For each alternative webpage, the method calculates the proportion of bilingual parallel corpus mined in the previous steps to sentences in the webpage, and considers the webpage with the proportion higher than a threshold value as a high-quality corpus webpage. For high-quality corpus web pages, the sub-links of the high-quality corpus web pages are crawled to obtain a derived web page set. And after the web pages in the set and the existing alternative web pages are deduplicated, adding the web pages in the set and the existing alternative web pages into the alternative set to be crawled, and further enriching the linguistic data.
Step 2: and according to the text alignment degree of the bilingual parallel corpus, performing filtering processing to obtain a corpus.
Specifically, step 2 includes: step 201: inputting bilingual parallel corpora into a translation model to obtain a feature vector output by the translation model; step 202: inputting the feature vectors into a multilayer perceptron to obtain the text alignment; step 203: and filtering the text alignment to obtain a corpus.
In order to further filter bilingual corpus, a text alignment algorithm is further provided, and corresponding text alignment is accurately calculated for each sentence pair. In order to fully utilize semantic information, the method adopts an evaluation method based on a pre-trained Cross-Language Model, fig. 3 is a schematic diagram of an architecture of the pre-trained Cross-Language Model of the invention, as shown in fig. 3, a lower layer is a Cross-Language Model (XLM) structure formed by a transform layer, and an upper layer is a multi-layer sensor for processing features. Wherein, the transformer refers to a codec structure for text processing; the language embedding layer inputs the language of the bilingual corpus into the model, the position embedding layer inputs the position information of each token in each sentence into the model, and the word embedding layer converts each token into an Identity Document (ID) which can be read and analyzed by the model. The three embedded layers embody the input information of the model, i.e., the integration of the information of the three embedded layers.
In particular, the XLM model is a translation model generated through a multi-lingual training task. The input of the model is bilingual data, each word of each language is mapped into a corresponding vector after word segmentation, and simultaneously language information and a language order information vector are embedded to form the input of the model. Illustratively, one specific translation task process is: the model optimizes the effect of a translation task by inputting a sentence representation B of the language A and the language in another language, namely, an embedded vector of the language A is propagated from a coder and a decoder to obtain a representation of the language B, and the cosine similarity of the vector and a real language B vector is maximized. Similarly, the embedded vector of B language will be operated in the same way. After the translation task training, the model (XLM) can already represent text semantic information, namely text representation, so that the alignment characteristics of the bilingual language can be extracted. The method adopts the last layer in a Transformer structure as an extraction object, leads out a vector of the extraction object and inputs the vector into a multilayer perceptron, wherein the multilayer perceptron consists of 3 layers of full connection layers and 3 layers of activation layers, and finally outputs 2-dimensional representation for representing the alignment degree of text pairs. Through testing, the method can finally well distinguish the high-quality linguistic data from the low-quality linguistic data.
Sentence pairs from languages a and B were input into the XLM model structure formed by the transform layer. The XLM model is a translation model generated through a multi-lingual training task. The input is bilingual data, and each word of each language is mapped into a corresponding vector after word segmentation. And after the input word vectors are propagated by a plurality of transform layers, sentence vectors representing the original sentence pairs are obtained. The filtering method provided by the invention is characterized in that based on the text representation obtained by the model, classification is carried out by a multilayer perceptron, and whether the input corpus is a high-quality aligned corpus is judged by completing the classification task. In the testing process, the adopted MLP comprises a 3-layer fully-connected layer neural network and a 3-layer activation layer, the activation layer function uses a softmax function, the number of neurons in the first layer of the fully-connected layer is consistent with the length of the sentence vector, and the dimension of the output of the last layer is 2. Finally, after training, the multilayer perceptron can well judge the parallelism degree of sentence pairs, namely the corpus quality through the sentence vectors. In the present model, the text representation is the input to the multi-tier perceptron.
Further, step 203 comprises: step 2031: if the text alignment degree is larger than or equal to a preset threshold value, putting the corresponding bilingual parallel corpus into a corpus; step 2032: and if the text alignment degree is smaller than a preset threshold value, discarding the corresponding bilingual parallel linguistic data.
In the invention, a classifier is trained to complete the filtering treatment, namely, the classification of the high-quality linguistic data and the low-quality linguistic data is completed.
The core of an excellent filter is a classifier with excellent performance, after the classifier with high accuracy is obtained, the classifier can be used for scoring the initial parallel corpora obtained by crawling, and then the corpus processing personnel can select the corpora with high score from the corpora as the high-quality corpora to be added into the database.
Firstly, a collected bilingual parallel corpus needs to be prepared for training a downstream classifier, further, a nonparallel corpus is constructed by disordering the word order of the bilingual parallel corpus, deleting part of content or supplementing part of content, setting a label of a parallel language to be 1, setting a label of the nonparallel corpus to be 0, inputting an XLM (cross-linked list model), and experiments prove that a quite accurate classifier can be trained by only needing few parallel corpuses.
In a specific implementation, the part of the corpus is obtained by out-of-order, deleting part of the content, or supplementing part of the content from the normal sample, i.e., the high-quality corpus aligned in parallel. The method comprises the following specific steps: inserting irrelevant words into the existing bilingual parallel corpus in a random mode, deleting part of keywords or changing the corresponding sequence of languages A and B. Three ways used for constructing the negative example: 1. the order of the normal corpus to a certain sentence is disturbed. As a correct example: (It is a blue painting.) A negative example of a confusing some verbal construct is (painting a is an It blue painting. It is a blue painting.) 2. random deletion keyword: as a correct example: (It is a blue painting.) the negative example of a random deletion construct is (It is a blue painting.) 3. random insertion of irrelevant words and sentences as correct example is: (It is a blue painting.) A negative example of random word interpolation is (It is a blue painting.) A negative example of random word interpolation used in the testing process of the present invention was generated by these three methods.
Then, fixing the parameter representation of the downstream XLM model, and setting the parameters of the downstream classifier to be in an updatable state; and continuously transmitting tensor information representing the corpus pair forward to obtain a final output vector of the corpus pair, and performing inner product operation on the output vector and [0,1] to obtain the prediction probability of bilingual language alignment.
And then, calculating a binary classification loss function of the sample by using the predicted value and the real label value, calculating a gradient value of the loss function value to the parameters of the multilayer perceptron, and updating the parameter values of the multilayer perceptron through a random gradient descent algorithm.
And finally, repeatedly updating parameters of the multilayer perceptron for multiple times to obtain the classifier with effective discrimination capability.
Further, the procedure using the classifier is as follows:
and (3) inputting a text of the bilingual parallel corpus to be detected by using the trained classifier as a discriminator, checking the output alignment probability, namely the text alignment degree, and if the value is closer to 1, indicating that the text pair is aligned, putting the text pair into a corpus to be used as a correct corpus. If the output value is close to 0, it indicates that the text pair is misaligned, i.e., the source and target languages do not translate accurately and properly with respect to each other. The corpus pair is discarded from the dataset. The preset threshold value can be set to be 0.8 by means of the preset threshold value, when the text alignment degree is greater than or equal to 0.8, the corresponding bilingual parallel corpus is put into the corpus, and if the text alignment degree is less than 0.8, the corresponding bilingual parallel corpus is abandoned.
The invention also provides a corpus construction and filtering system, which comprises the following modules:
module M1: and carrying out document clause alignment processing on the obtained initial parallel corpus to obtain the bilingual parallel corpus.
Specifically, the module M1 includes: sub-module M101: acquiring initial parallel corpora from a preset resource library; sub-module M102: and carrying out sentence segmentation on the initial parallel corpus, and carrying out sentence alignment processing to obtain the bilingual parallel corpus.
Further, the module M1 further includes: submodule M103: performing webpage derivation on a preset resource library to obtain a derived webpage set; sub-module M104: and adding the webpage derived set into a preset resource library.
Module M2: and according to the text alignment degree of the bilingual parallel corpus, performing filtering processing to obtain a corpus.
Specifically, the module M2 includes: submodule M201: inputting bilingual parallel linguistic data into a translation model to obtain a feature vector output by the translation model; submodule M202: inputting the feature vectors into a multilayer perceptron to obtain the text alignment; submodule M203: and filtering the text alignment to obtain a corpus.
Further, the module M203 includes: cell D2031: if the text alignment degree is larger than or equal to a preset threshold value, putting the corresponding bilingual parallel corpus into a corpus; cell D2032: and if the text alignment degree is smaller than a preset threshold value, discarding the corresponding bilingual parallel linguistic data.
The technical problem solved by the invention is as follows:
1. high-quality corpora have an important influence on related natural language processing downstream tasks, and part of low-resource corpora are deficient and difficult to acquire.
2. The type and language of the corpus collected by the existing Internet automatic collection algorithm are single, and the technical effect of the existing corpus filtering is too dependent on the quality of the existing bilingual dictionary.
Compared with the prior art, the invention has the following beneficial effects:
1. the invention greatly improves the collection capability of low-resource linguistic data, and continuously explores more language information of single language or multi-language through the seed website. Thereby acquiring more updated low-resource corpora.
2. The invention can automatically complete the work of sentence division, word alignment and the like of multiple languages. And preliminarily screening out the excessively low-quality corpora. The speed of corpus evaluation and filtering is greatly improved.
3. The method can evaluate the language quality through the alignment algorithm on the basis of obtaining the corpus information, filter out the part with larger noise and generate the parallel text with higher utilization value. The performance of the algorithm exceeds the evaluation performance of methods based on sentence vectors, word vectors, dual cross entropies and the like.
It is known to those skilled in the art that, besides the system, the apparatus and the respective modules thereof provided by the present invention being implemented in pure computer readable program code, the system, the apparatus and the respective modules thereof provided by the present invention can be implemented in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like by completely programming the method submodule M. Therefore, the system, the device and the modules thereof provided by the present invention can be considered as a hardware component, and the modules included in the system, the device and the modules thereof for implementing various programs can also be considered as structures in the hardware component; modules for performing various functions may also be considered to be both software programs for performing the methods and structures within hardware components.
The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes or modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The embodiments and features of the embodiments of the present application may be combined with each other arbitrarily without conflict.
Claims (10)
1. A corpus construction and filtering method is characterized by comprising the following steps:
step 1: performing document sentence alignment processing on the obtained initial parallel corpus to obtain bilingual parallel corpus;
and 2, step: and filtering according to the text alignment of the bilingual parallel corpus to obtain a corpus.
2. Corpus construction and filtering method according to claim 1, wherein said step 1 comprises:
step 101: acquiring the initial parallel corpus from a preset resource library;
step 102: and carrying out sentence segmentation on the initial parallel corpus, and carrying out sentence alignment processing to obtain the bilingual parallel corpus.
3. The corpus building and filtering method according to claim 1, wherein said step 2 includes:
step 201: inputting the bilingual parallel corpus into a translation model to obtain a feature vector output by the translation model;
step 202: inputting the feature vector into a multilayer perceptron to obtain the text alignment degree;
step 203: and filtering the text alignment degree to obtain the corpus.
4. Corpus construction and filtering method according to claim 1 or 2, wherein said step 1 further comprises:
step 103: performing webpage derivation on the preset resource library to obtain a derived webpage set;
step 104: and adding the webpage derivation set into the preset resource library.
5. Corpus construction and filtering method according to claim 3, wherein said step 203 comprises:
step 2031: if the text alignment degree is larger than or equal to a preset threshold value, putting the corresponding bilingual parallel corpus into the corpus;
step 2032: and if the text alignment degree is smaller than the preset threshold value, discarding the corresponding bilingual parallel corpus.
6. A corpus construction and filtering system, comprising:
module M1: performing document clause alignment processing on the obtained initial parallel corpus to obtain bilingual parallel corpus;
module M2: and according to the text alignment degree of the bilingual parallel corpus, performing filtering processing to obtain a corpus.
7. Corpus construction and filtering system according to claim 6, wherein said module M1 comprises:
submodule M101: acquiring the initial parallel corpus from a preset resource library;
submodule M102: and carrying out sentence segmentation on the initial parallel corpus, and carrying out sentence alignment processing to obtain the bilingual parallel corpus.
8. Corpus construction and filtering system according to claim 6, wherein said module M2 comprises:
submodule M201: inputting the bilingual parallel corpus into a translation model to obtain a feature vector output by the translation model;
submodule M202: inputting the feature vector into a multilayer perceptron to obtain the text alignment;
submodule M203: and filtering the text alignment to obtain the corpus.
9. Corpus construction and filtering system according to claim 6 or 7, wherein said module M1 further comprises:
submodule M103: performing webpage derivation on the preset resource library to obtain a derived webpage set;
sub-module M104: and adding the webpage derivation set into the preset resource library.
10. Corpus construction and filtering system according to claim 8, wherein said module M203 comprises:
unit D2031: if the text alignment degree is larger than or equal to a preset threshold value, putting the corresponding bilingual parallel corpus into the corpus;
unit D2032: and if the text alignment degree is smaller than the preset threshold value, discarding the corresponding bilingual parallel corpus.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210356507.5A CN114780667A (en) | 2022-04-06 | 2022-04-06 | Corpus construction and filtering method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210356507.5A CN114780667A (en) | 2022-04-06 | 2022-04-06 | Corpus construction and filtering method and system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114780667A true CN114780667A (en) | 2022-07-22 |
Family
ID=82427830
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210356507.5A Pending CN114780667A (en) | 2022-04-06 | 2022-04-06 | Corpus construction and filtering method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114780667A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118170927A (en) * | 2024-05-10 | 2024-06-11 | 山东圣剑医学研究有限公司 | Scientific research data knowledge graph construction method for AI digital person |
-
2022
- 2022-04-06 CN CN202210356507.5A patent/CN114780667A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118170927A (en) * | 2024-05-10 | 2024-06-11 | 山东圣剑医学研究有限公司 | Scientific research data knowledge graph construction method for AI digital person |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110298033B (en) | Keyword corpus labeling training extraction system | |
CN105022725B (en) | A kind of text emotion trend analysis method applied to finance Web fields | |
Navigli et al. | Learning word-class lattices for definition and hypernym extraction | |
CN109766544B (en) | Document keyword extraction method and device based on LDA and word vector | |
CN112749284B (en) | Knowledge graph construction method, device, equipment and storage medium | |
CN106570171A (en) | Semantics-based sci-tech information processing method and system | |
CN109635288A (en) | A kind of resume abstracting method based on deep neural network | |
CN112818694A (en) | Named entity recognition method based on rules and improved pre-training model | |
CN105869634A (en) | Field-based method and system for feeding back text error correction after speech recognition | |
Jha et al. | Homs: Hindi opinion mining system | |
CN105022806B (en) | The method and system of the internet web page construction movement page based on translation template | |
CN111091009B (en) | Document association auditing method based on semantic analysis | |
Jabbar et al. | An improved Urdu stemming algorithm for text mining based on multi-step hybrid approach | |
Matlatipov et al. | Uzbek sentiment analysis based on local restaurant reviews | |
Najafi et al. | Text-to-Text Transformer in Authorship Verification Via Stylistic and Semantical Analysis. | |
Espla-Gomis et al. | Bitextor’s participation in WMT’16: shared task on document alignment | |
CN113312476A (en) | Automatic text labeling method and device and terminal | |
CN111967267A (en) | XLNET-based news text region extraction method and system | |
CN115390806A (en) | Software design mode recommendation method based on bimodal joint modeling | |
CN111814476A (en) | Method and device for extracting entity relationship | |
CN114780667A (en) | Corpus construction and filtering method and system | |
CN113378024B (en) | Deep learning-oriented public inspection method field-based related event identification method | |
CN114065749A (en) | Text-oriented Guangdong language recognition model and training and recognition method of system | |
CN113361252A (en) | Text depression tendency detection system based on multi-modal features and emotion dictionary | |
CN117132995A (en) | Internet harmful information detection method combining OCR (optical character recognition) model and NLP (non-line character recognition) model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |