CN108536688A

CN108536688A - It was found that the whole network multi-language website and the method for obtaining parallel corpora

Info

Publication number: CN108536688A
Application number: CN201810365280.4A
Authority: CN
Inventors: 熊德意; 王涛
Original assignee: Suzhou University
Current assignee: Suzhou University
Priority date: 2018-04-23
Filing date: 2018-04-23
Publication date: 2018-09-14

Abstract

The present invention relates to a kind of methods for finding the whole network multi-language website and obtaining parallel corpora, including：Obtain the index file of the URL information in Common Crawl；According to preset standard, the index file for the URL information of acquisition builds a dictionary for including various linguistic labels；According to the dictionary for including various linguistic labels, processing is filtered to the index file of URL information, the URL of the candidate multi-language website needed；By the URL of the multi-language website of the candidate, parallel corpora is obtained.Above-mentioned discovery the whole network multi-language website and the method for obtaining parallel corpora are obtained the URL of candidate multi-language website, finally obtain parallel corpora, this method can quickly obtain parallel corpora by the data set increased income using the method for linguistic labels.

Description

It was found that the whole network multi-language website and the method for obtaining parallel corpora

Technical field

The present invention relates to machine translation mothod fields, more particularly to discovery the whole network multi-language website and obtain parallel corpora Method.

Background technology

Bilingual parallel corporas is a kind of highly important resource for many tasks, especially in natural language processing Field, have abundant use.And in natural language processing, machine translation even more very relies on one kind of parallel corpora. The quality of machine translation depends directly on the quantity and quality of parallel corpora.It is Chinese-English in the task of machine translation, A Ying with And many european languages are to millions or even higher parallel corpora, and in other most of language to upper, it is difficult to obtain Obtain parallel corpora.Therefore, it is the important first step that most of language is mutually translated to obtain parallel corpora.

The a large amount of parallel corpora that can be obtained on network at present is mainly the multi-lingual news of world's main language, in European Union A variety of translated versions of various meetings and file, wikipedia content and some books etc. being translated.

In paper PaCo2:AFullyAutomated tool for gathering Parallel Corpora from In the Web, author includes the keyword of bilingual pair using search engine search, and it is candidate to obtain potential multi-language website Then list downloads sub-fraction therein and detects whether to comply with standard, then carried out to multilingual website general parallel The page extracts, contents extraction and etc..

There are following technical problems for traditional technology：

Internet is the valuable source for obtaining parallel text.Many websites are all multilingual, this is even more to obtain parallel language The valuable source of material.And since the complexity of internet content is chaotic and in large scale, it is difficult to concentrate to obtain a large amount of parallel languages Material.

Existing technology is difficult extensive, comprehensively finds all multi-language websites present on network and crawls parallel language Material.If finding multilingual website at random, a large amount of time can be wasted, and can be drawn by search using the method for search engine Prodigious influence is held up, many times may be wasted on detecting common webpage.

Invention content

Based on this, it is necessary in view of the above technical problems, provide a kind of discovery the whole network multi-language website and obtain parallel language The method of material is obtained the URL of candidate multi-language website, finally obtained by the data set increased income using the method for linguistic labels Take parallel corpora, this method that can quickly obtain parallel corpora.

A method of it finding the whole network multi-language website and obtains parallel corpora, including：

Obtain the index file of the URL information in Common Crawl；

According to preset standard, the index file for the URL information of acquisition builds a dictionary for including various linguistic labels；

According to the dictionary for including various linguistic labels, processing is filtered to the index file of URL information, is needed The URL for the candidate multi-language website wanted；

By the URL of the multi-language website of the candidate, parallel corpora is obtained.

The above-mentioned discovery the whole network multi-language website and method for obtaining parallel corpora utilizes language by the data set increased income The method of label obtains the URL of candidate multi-language website, finally obtains parallel corpora, this method can quickly obtain parallel Language material.

In other one embodiment, the preset standard is iso-639 standards.

In other one embodiment, step " by the URL of the multi-language website of the candidate, obtains parallel corpora " It is realized by Bitextor tools.

A kind of computer equipment, including memory, processor and storage can be run on a memory and on a processor The step of computer program, the processor realizes any one the method when executing described program.

A kind of computer readable storage medium, is stored thereon with computer program, which realizes when being executed by processor The step of any one the method.

Description of the drawings

Fig. 1 is a kind of stream of method for finding the whole network multi-language website and obtaining parallel corpora provided by the embodiments of the present application Journey schematic diagram.

Fig. 2 is in a kind of method for finding the whole network multi-language website and obtaining parallel corpora provided by the embodiments of the present application The schematic diagram of the index file of URL information.

Fig. 3 is to be obtained in a kind of method for finding the whole network multi-language website and obtaining parallel corpora provided by the embodiments of the present application To the schematic diagram of candidate website.

Specific implementation mode

In order to make the purpose , technical scheme and advantage of the present invention be clearer, with reference to the accompanying drawings and embodiments, right The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and It is not used in the restriction present invention.

S110, the index file for obtaining URL information in Common Crawl；

S120, according to preset standard, build one comprising various linguistic labels for the index file of the URL information of acquisition Dictionary；

S130, the dictionary for including various linguistic labels according to, processing is filtered to the index file of URL information, The URL of the candidate multi-language website needed；

S140, the URL by the multi-language website of the candidate, obtain parallel corpora.

In other one embodiment, the preset standard is iso-639 standards.

A concrete application scene of the invention is described below：

The invention is intended to obtain the URL of multi-language website from Common Crawl, parallel corpora then can be carried out Processing.First, then we are built using the index information data in Common Crawl as the sources URL by international standard One dictionary for including various linguistic labels finally starts to be filtered processing to all index informations, the time needed The URL of the multi-language website of choosing.After obtaining candidate URL, it can be further processed.

The particular content of flow will be discussed in detail below：

Common Crawl are the data sets that a non-profit organization provides and safeguards.It is one and uses Apache The data set of the whole network that Nutch is crawled, it is intended to provide a large amount of web data for owner rather than be monopolized by major company.It is several It issues the moon primary new as a result, including the web data of TB up to a hundred, is currently stored on the Cloud Server of Amazon.

Include the file of four kinds of formats in the data of Common Crawl, WARC is the original web page code number crawled According to containing the complete content of metadata (metadata), request (request) and response (response)；WAT is to above WARC formatted files further extract metadata information, with json formats preserve；WET provides pure text message.In more detail Information is shown in http://commoncrawl.org/the-data/get-started/.

Also a kind of easily ignored information is exactly URL Index Files to be used here, that is, about The index file of the URL information of all pages.Specific format is as shown in Figure 2：

Note：For each webpage, there is a corresponding information, contains domain name and crawl the time, url, medium type, The state of return, the range of information such as file size.2017 December version Common Crawl in, one is shared 302 such index files, size are 200GB or so.

Pass through the observation to a large amount of parallel web pages, it is found that include language in the URL of most parallel web pages The information of label.For example the Official English website of apple is https://www.apple.com/, and the page of Chinese is https://www.apple.com/cn/, Vietnam is https://www.apple.com/vn/.It is not only apple, actually Almost all of multilingual website is all the URL of this structure, and difference lies in linguistic labels may be different.Such as with regard to English and Speech may have ten several different labels several in this way：en,en-us,en-af',en-au,en-ca,en-gb,en-ph,en- Za, eng etc..Possibly even certain websites provides that the page in native country does not have linguistic labels, and the page of other language has label.

So the internationalization website that we can select some well-known, including apple, Microsoft etc therefrom extract main language The label of speech, in addition some language codes as defined in basic iso-639 standards, constituting initial language codes table (includes 38 kinds of main language and their a variety of expressions, partial content see the table below), for judging whether to include to be somebody's turn to do when analyzing url Linguistic labels.

It is appreciated that number of site can also be selected, details are not described herein by other standards.

Certainly, for some rare foreign languages possibility, we can not determine that their linguistic labels are, but based in this way The fact：The multi-language website of rare foreign languages at least can include a main language, such as English, therefore, simple When url is filtered, candidate network address can be added them to, so as in subsequent steps identify them (language identification module is needed to support the language), to reach complete effect.

The common label of various language is as follows：

English en'en':'en','en-us':'en','en-af':'en','en-au':'en','en-ca':'en',' en-gb':'en','en-ph':

French fr'fr':'fr','fra':'fr','fr-fr':'fr','fr-be':'fr','fr-ca':'fr',' fr-ch':'fr','fr-lu':'fr'

Vietnam vi'vi':'vi','vn':'vi','vn':'vi','vnm':'vn','vie':'vi'

Chinese zh'cn':'zh','hk':'zh','mo':'zh','tw':'zh','chn':'zh','zho':'zh'

Russian ru'ru':'ru','ru-ru':'ru','rus':'ru'

Start to be filtered locally downloading index file.It is used herein based on MapReduce thoughts Python frames mrjob realizes this process.Mrjob frames can be in the machine, Hadoop clusters, Amazon Elastic (above-mentioned three is point for operation on MapReduce (EMR) or Google Cloud Dataproc (Dataproc) The frame of cloth processing).Since data volume is not especially big, the method that the machine operation may be used in the application, and if it is The files such as WET are directly operated, recommends directly to run in the cluster service that Amazon provides, mass data transmission can be saved Time (Common Crawl are stored in Amazon S3 clouds store-service).

Specifically filter factor includes：To excessive too small file filter (it is appreciated that this can be selected), as long as The status states of the file of text/html types, response are 2000K and most important linguistic labels.For some Website domain name counts all page quantities for including above-mentioned linguistic labels.After the completion of statistics, if including certain linguistic labels Quantity is met the requirements, and is considered as the multi-language website network address that this domain name is candidate.

In we test, meeting the requirement of multi-language website is：For including two kinds or more of website, it is added candidate Multi-language website；And for there was only a kind of website of linguistic labels, at least need 10 mainstream speech labels contained above (such as English) the page.It is appreciated that 10 are a possible example.

One has been obtained 468955 candidate websites, specific as shown in Figure 3：

In paper Bitextor, a free/open-source software to harvest translation In memories from multilingual websites, the slave some websites extraction that author provides complete set is parallel The open source software of language material.Software realization is from downloading webpage, duplicate removal, parallel web pages extraction, a series of solutions of parallel sentence alignment Certainly scheme.

Bitextor is that the parallel corpora freely increased income obtains tool.Bitextor by entire step simple abstract be with Under several steps：Using the httrack Open-Source Tools download site pages, duplicate removal and text extraction, language identification, parallel web pages Extraction, parallel sentence extraction.

Bitextor needs the dictionary (i.e. the one-to-one dictionary of word) of corresponding language in the acquisition process of parallel pages, For judging parallel degree.Official version only provides several language pair, but there is provided the tools for generating dictionary, when needing certainly Oneself generates.

In addition during judging above, author's acquiescence is the Indo-European family of languages for being not necessarily to participle.And for many languages, For example Chinese, the authors such as Japanese do not provide automatic word segmentation function, need oneself addition as needed.For example, Chinese word segmentation can To segment library using the jieba of python.

The module of Bitextor can be used to start a series of subsequent processing below.

The web data of almost the whole network has been provided in view of Common Crawl, we can by resource that it is provided and The existing technology that crawls combines.The data provided by Common Crawl are analyzed, to obtain the multilingual of candidate Then list of websites crawls the website of needs using other tools, step-by-step processing, parallel corpora is obtained.We make This set of flow solves multi-language website distribution dispersion, it is difficult to the problem of finding simultaneously large scale collection.Language identification and User can be allowed according to demand to interested language to carrying out subsequent processing the step of being summarized.

Each technical characteristic of embodiment described above can be combined arbitrarily, to keep description succinct, not to above-mentioned reality It applies all possible combination of each technical characteristic in example to be all described, as long as however, the combination of these technical characteristics is not deposited In contradiction, it is all considered to be the range of this specification record.

Several embodiments of the invention above described embodiment only expresses, the description thereof is more specific and detailed, but simultaneously It cannot therefore be construed as limiting the scope of the patent.It should be pointed out that coming for those of ordinary skill in the art It says, without departing from the inventive concept of the premise, various modifications and improvements can be made, these belong to the protection of the present invention Range.Therefore, the protection domain of patent of the present invention should be determined by the appended claims.

Claims

1. a kind of method for finding the whole network multi-language website and obtaining parallel corpora, which is characterized in that including：

Obtain the index file of the URL information in Common Crawl；

According to the dictionary for including various linguistic labels, processing is filtered to the index file of URL information, is needed The URL of candidate multi-language website；

2. the method according to claim 1 for finding the whole network multi-language website and obtaining parallel corpora, which is characterized in that institute It is iso-639 standards to state preset standard.

3. the method according to claim 1 for finding the whole network multi-language website and obtaining parallel corpora, which is characterized in that step Suddenly " by the URL of the multi-language website of the candidate, parallel corpora is obtained " to realize by Bitextor tools.

4. a kind of computer equipment, including memory, processor and storage are on a memory and the meter that can run on a processor Calculation machine program, which is characterized in that the processor realizes any one of claim 1-3 the methods when executing described program The step of.

5. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the program is held by processor The step of claim 1-3 any one the methods are realized when row.