CN108536688A - It was found that the whole network multi-language website and the method for obtaining parallel corpora - Google Patents

It was found that the whole network multi-language website and the method for obtaining parallel corpora Download PDF

Info

Publication number
CN108536688A
CN108536688A CN201810365280.4A CN201810365280A CN108536688A CN 108536688 A CN108536688 A CN 108536688A CN 201810365280 A CN201810365280 A CN 201810365280A CN 108536688 A CN108536688 A CN 108536688A
Authority
CN
China
Prior art keywords
language
url
parallel corpora
website
language website
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810365280.4A
Other languages
Chinese (zh)
Inventor
熊德意
王涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou University
Original Assignee
Suzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou University filed Critical Suzhou University
Priority to CN201810365280.4A priority Critical patent/CN108536688A/en
Publication of CN108536688A publication Critical patent/CN108536688A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/42Data-driven translation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The present invention relates to a kind of methods for finding the whole network multi-language website and obtaining parallel corpora, including:Obtain the index file of the URL information in Common Crawl;According to preset standard, the index file for the URL information of acquisition builds a dictionary for including various linguistic labels;According to the dictionary for including various linguistic labels, processing is filtered to the index file of URL information, the URL of the candidate multi-language website needed;By the URL of the multi-language website of the candidate, parallel corpora is obtained.Above-mentioned discovery the whole network multi-language website and the method for obtaining parallel corpora are obtained the URL of candidate multi-language website, finally obtain parallel corpora, this method can quickly obtain parallel corpora by the data set increased income using the method for linguistic labels.

Description

It was found that the whole network multi-language website and the method for obtaining parallel corpora
Technical field
The present invention relates to machine translation mothod fields, more particularly to discovery the whole network multi-language website and obtain parallel corpora Method.
Background technology
Bilingual parallel corporas is a kind of highly important resource for many tasks, especially in natural language processing Field, have abundant use.And in natural language processing, machine translation even more very relies on one kind of parallel corpora. The quality of machine translation depends directly on the quantity and quality of parallel corpora.It is Chinese-English in the task of machine translation, A Ying with And many european languages are to millions or even higher parallel corpora, and in other most of language to upper, it is difficult to obtain Obtain parallel corpora.Therefore, it is the important first step that most of language is mutually translated to obtain parallel corpora.
The a large amount of parallel corpora that can be obtained on network at present is mainly the multi-lingual news of world's main language, in European Union A variety of translated versions of various meetings and file, wikipedia content and some books etc. being translated.
In paper PaCo2:AFullyAutomated tool for gathering Parallel Corpora from In the Web, author includes the keyword of bilingual pair using search engine search, and it is candidate to obtain potential multi-language website Then list downloads sub-fraction therein and detects whether to comply with standard, then carried out to multilingual website general parallel The page extracts, contents extraction and etc..
There are following technical problems for traditional technology:
Internet is the valuable source for obtaining parallel text.Many websites are all multilingual, this is even more to obtain parallel language The valuable source of material.And since the complexity of internet content is chaotic and in large scale, it is difficult to concentrate to obtain a large amount of parallel languages Material.
Existing technology is difficult extensive, comprehensively finds all multi-language websites present on network and crawls parallel language Material.If finding multilingual website at random, a large amount of time can be wasted, and can be drawn by search using the method for search engine Prodigious influence is held up, many times may be wasted on detecting common webpage.
Invention content
Based on this, it is necessary in view of the above technical problems, provide a kind of discovery the whole network multi-language website and obtain parallel language The method of material is obtained the URL of candidate multi-language website, finally obtained by the data set increased income using the method for linguistic labels Take parallel corpora, this method that can quickly obtain parallel corpora.
A method of it finding the whole network multi-language website and obtains parallel corpora, including:
Obtain the index file of the URL information in Common Crawl;
According to preset standard, the index file for the URL information of acquisition builds a dictionary for including various linguistic labels;
According to the dictionary for including various linguistic labels, processing is filtered to the index file of URL information, is needed The URL for the candidate multi-language website wanted;
By the URL of the multi-language website of the candidate, parallel corpora is obtained.
The above-mentioned discovery the whole network multi-language website and method for obtaining parallel corpora utilizes language by the data set increased income The method of label obtains the URL of candidate multi-language website, finally obtains parallel corpora, this method can quickly obtain parallel Language material.
In other one embodiment, the preset standard is iso-639 standards.
In other one embodiment, step " by the URL of the multi-language website of the candidate, obtains parallel corpora " It is realized by Bitextor tools.
A kind of computer equipment, including memory, processor and storage can be run on a memory and on a processor The step of computer program, the processor realizes any one the method when executing described program.
A kind of computer readable storage medium, is stored thereon with computer program, which realizes when being executed by processor The step of any one the method.
Description of the drawings
Fig. 1 is a kind of stream of method for finding the whole network multi-language website and obtaining parallel corpora provided by the embodiments of the present application Journey schematic diagram.
Fig. 2 is in a kind of method for finding the whole network multi-language website and obtaining parallel corpora provided by the embodiments of the present application The schematic diagram of the index file of URL information.
Fig. 3 is to be obtained in a kind of method for finding the whole network multi-language website and obtaining parallel corpora provided by the embodiments of the present application To the schematic diagram of candidate website.
Specific implementation mode
In order to make the purpose , technical scheme and advantage of the present invention be clearer, with reference to the accompanying drawings and embodiments, right The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and It is not used in the restriction present invention.
A method of it finding the whole network multi-language website and obtains parallel corpora, including:
S110, the index file for obtaining URL information in Common Crawl;
S120, according to preset standard, build one comprising various linguistic labels for the index file of the URL information of acquisition Dictionary;
S130, the dictionary for including various linguistic labels according to, processing is filtered to the index file of URL information, The URL of the candidate multi-language website needed;
S140, the URL by the multi-language website of the candidate, obtain parallel corpora.
The above-mentioned discovery the whole network multi-language website and method for obtaining parallel corpora utilizes language by the data set increased income The method of label obtains the URL of candidate multi-language website, finally obtains parallel corpora, this method can quickly obtain parallel Language material.
In other one embodiment, the preset standard is iso-639 standards.
In other one embodiment, step " by the URL of the multi-language website of the candidate, obtains parallel corpora " It is realized by Bitextor tools.
A kind of computer equipment, including memory, processor and storage can be run on a memory and on a processor The step of computer program, the processor realizes any one the method when executing described program.
A kind of computer readable storage medium, is stored thereon with computer program, which realizes when being executed by processor The step of any one the method.
A concrete application scene of the invention is described below:
The invention is intended to obtain the URL of multi-language website from Common Crawl, parallel corpora then can be carried out Processing.First, then we are built using the index information data in Common Crawl as the sources URL by international standard One dictionary for including various linguistic labels finally starts to be filtered processing to all index informations, the time needed The URL of the multi-language website of choosing.After obtaining candidate URL, it can be further processed.
The particular content of flow will be discussed in detail below:
Common Crawl are the data sets that a non-profit organization provides and safeguards.It is one and uses Apache The data set of the whole network that Nutch is crawled, it is intended to provide a large amount of web data for owner rather than be monopolized by major company.It is several It issues the moon primary new as a result, including the web data of TB up to a hundred, is currently stored on the Cloud Server of Amazon.
Include the file of four kinds of formats in the data of Common Crawl, WARC is the original web page code number crawled According to containing the complete content of metadata (metadata), request (request) and response (response);WAT is to above WARC formatted files further extract metadata information, with json formats preserve;WET provides pure text message.In more detail Information is shown in http://commoncrawl.org/the-data/get-started/.
Also a kind of easily ignored information is exactly URL Index Files to be used here, that is, about The index file of the URL information of all pages.Specific format is as shown in Figure 2:
Note:For each webpage, there is a corresponding information, contains domain name and crawl the time, url, medium type, The state of return, the range of information such as file size.2017 December version Common Crawl in, one is shared 302 such index files, size are 200GB or so.
Pass through the observation to a large amount of parallel web pages, it is found that include language in the URL of most parallel web pages The information of label.For example the Official English website of apple is https://www.apple.com/, and the page of Chinese is https://www.apple.com/cn/, Vietnam is https://www.apple.com/vn/.It is not only apple, actually Almost all of multilingual website is all the URL of this structure, and difference lies in linguistic labels may be different.Such as with regard to English and Speech may have ten several different labels several in this way:en,en-us,en-af',en-au,en-ca,en-gb,en-ph,en- Za, eng etc..Possibly even certain websites provides that the page in native country does not have linguistic labels, and the page of other language has label.
So the internationalization website that we can select some well-known, including apple, Microsoft etc therefrom extract main language The label of speech, in addition some language codes as defined in basic iso-639 standards, constituting initial language codes table (includes 38 kinds of main language and their a variety of expressions, partial content see the table below), for judging whether to include to be somebody's turn to do when analyzing url Linguistic labels.
It is appreciated that number of site can also be selected, details are not described herein by other standards.
Certainly, for some rare foreign languages possibility, we can not determine that their linguistic labels are, but based in this way The fact:The multi-language website of rare foreign languages at least can include a main language, such as English, therefore, simple When url is filtered, candidate network address can be added them to, so as in subsequent steps identify them (language identification module is needed to support the language), to reach complete effect.
The common label of various language is as follows:
English en'en':'en','en-us':'en','en-af':'en','en-au':'en','en-ca':'en',' en-gb':'en','en-ph':
French fr'fr':'fr','fra':'fr','fr-fr':'fr','fr-be':'fr','fr-ca':'fr',' fr-ch':'fr','fr-lu':'fr'
Vietnam vi'vi':'vi','vn':'vi','vn':'vi','vnm':'vn','vie':'vi'
Chinese zh'cn':'zh','hk':'zh','mo':'zh','tw':'zh','chn':'zh','zho':'zh'
Russian ru'ru':'ru','ru-ru':'ru','rus':'ru'
Start to be filtered locally downloading index file.It is used herein based on MapReduce thoughts Python frames mrjob realizes this process.Mrjob frames can be in the machine, Hadoop clusters, Amazon Elastic (above-mentioned three is point for operation on MapReduce (EMR) or Google Cloud Dataproc (Dataproc) The frame of cloth processing).Since data volume is not especially big, the method that the machine operation may be used in the application, and if it is The files such as WET are directly operated, recommends directly to run in the cluster service that Amazon provides, mass data transmission can be saved Time (Common Crawl are stored in Amazon S3 clouds store-service).
Specifically filter factor includes:To excessive too small file filter (it is appreciated that this can be selected), as long as The status states of the file of text/html types, response are 2000K and most important linguistic labels.For some Website domain name counts all page quantities for including above-mentioned linguistic labels.After the completion of statistics, if including certain linguistic labels Quantity is met the requirements, and is considered as the multi-language website network address that this domain name is candidate.
In we test, meeting the requirement of multi-language website is:For including two kinds or more of website, it is added candidate Multi-language website;And for there was only a kind of website of linguistic labels, at least need 10 mainstream speech labels contained above (such as English) the page.It is appreciated that 10 are a possible example.
One has been obtained 468955 candidate websites, specific as shown in Figure 3:
In paper Bitextor, a free/open-source software to harvest translation In memories from multilingual websites, the slave some websites extraction that author provides complete set is parallel The open source software of language material.Software realization is from downloading webpage, duplicate removal, parallel web pages extraction, a series of solutions of parallel sentence alignment Certainly scheme.
Bitextor is that the parallel corpora freely increased income obtains tool.Bitextor by entire step simple abstract be with Under several steps:Using the httrack Open-Source Tools download site pages, duplicate removal and text extraction, language identification, parallel web pages Extraction, parallel sentence extraction.
Bitextor needs the dictionary (i.e. the one-to-one dictionary of word) of corresponding language in the acquisition process of parallel pages, For judging parallel degree.Official version only provides several language pair, but there is provided the tools for generating dictionary, when needing certainly Oneself generates.
In addition during judging above, author's acquiescence is the Indo-European family of languages for being not necessarily to participle.And for many languages, For example Chinese, the authors such as Japanese do not provide automatic word segmentation function, need oneself addition as needed.For example, Chinese word segmentation can To segment library using the jieba of python.
The module of Bitextor can be used to start a series of subsequent processing below.
The web data of almost the whole network has been provided in view of Common Crawl, we can by resource that it is provided and The existing technology that crawls combines.The data provided by Common Crawl are analyzed, to obtain the multilingual of candidate Then list of websites crawls the website of needs using other tools, step-by-step processing, parallel corpora is obtained.We make This set of flow solves multi-language website distribution dispersion, it is difficult to the problem of finding simultaneously large scale collection.Language identification and User can be allowed according to demand to interested language to carrying out subsequent processing the step of being summarized.
Each technical characteristic of embodiment described above can be combined arbitrarily, to keep description succinct, not to above-mentioned reality It applies all possible combination of each technical characteristic in example to be all described, as long as however, the combination of these technical characteristics is not deposited In contradiction, it is all considered to be the range of this specification record.
Several embodiments of the invention above described embodiment only expresses, the description thereof is more specific and detailed, but simultaneously It cannot therefore be construed as limiting the scope of the patent.It should be pointed out that coming for those of ordinary skill in the art It says, without departing from the inventive concept of the premise, various modifications and improvements can be made, these belong to the protection of the present invention Range.Therefore, the protection domain of patent of the present invention should be determined by the appended claims.

Claims (5)

1. a kind of method for finding the whole network multi-language website and obtaining parallel corpora, which is characterized in that including:
Obtain the index file of the URL information in Common Crawl;
According to preset standard, the index file for the URL information of acquisition builds a dictionary for including various linguistic labels;
According to the dictionary for including various linguistic labels, processing is filtered to the index file of URL information, is needed The URL of candidate multi-language website;
By the URL of the multi-language website of the candidate, parallel corpora is obtained.
2. the method according to claim 1 for finding the whole network multi-language website and obtaining parallel corpora, which is characterized in that institute It is iso-639 standards to state preset standard.
3. the method according to claim 1 for finding the whole network multi-language website and obtaining parallel corpora, which is characterized in that step Suddenly " by the URL of the multi-language website of the candidate, parallel corpora is obtained " to realize by Bitextor tools.
4. a kind of computer equipment, including memory, processor and storage are on a memory and the meter that can run on a processor Calculation machine program, which is characterized in that the processor realizes any one of claim 1-3 the methods when executing described program The step of.
5. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the program is held by processor The step of claim 1-3 any one the methods are realized when row.
CN201810365280.4A 2018-04-23 2018-04-23 It was found that the whole network multi-language website and the method for obtaining parallel corpora Pending CN108536688A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810365280.4A CN108536688A (en) 2018-04-23 2018-04-23 It was found that the whole network multi-language website and the method for obtaining parallel corpora

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810365280.4A CN108536688A (en) 2018-04-23 2018-04-23 It was found that the whole network multi-language website and the method for obtaining parallel corpora

Publications (1)

Publication Number Publication Date
CN108536688A true CN108536688A (en) 2018-09-14

Family

ID=63479122

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810365280.4A Pending CN108536688A (en) 2018-04-23 2018-04-23 It was found that the whole network multi-language website and the method for obtaining parallel corpora

Country Status (1)

Country Link
CN (1) CN108536688A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109815390A (en) * 2018-11-08 2019-05-28 平安科技(深圳)有限公司 Search method, device, computer equipment and the computer storage medium of multilingual information

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1282928A (en) * 1999-07-28 2001-02-07 国际商业机器公司 Method and system for providing national language inquiry service
CN104281711A (en) * 2014-10-27 2015-01-14 浪潮(北京)电子信息产业有限公司 Multi-language processing method and multi-language processing device for WEB application
CN105022728A (en) * 2015-07-13 2015-11-04 广西达译商务服务有限责任公司 Automatic acquisition system of Chinese and Lao bilingual parallel texts and implementation method
CN107077842A (en) * 2014-12-15 2017-08-18 百度(美国)有限责任公司 System and method for phonetic transcription

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1282928A (en) * 1999-07-28 2001-02-07 国际商业机器公司 Method and system for providing national language inquiry service
CN104281711A (en) * 2014-10-27 2015-01-14 浪潮(北京)电子信息产业有限公司 Multi-language processing method and multi-language processing device for WEB application
CN107077842A (en) * 2014-12-15 2017-08-18 百度(美国)有限责任公司 System and method for phonetic transcription
CN105022728A (en) * 2015-07-13 2015-11-04 广西达译商务服务有限责任公司 Automatic acquisition system of Chinese and Lao bilingual parallel texts and implementation method

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
ESPLA-GOMIS M. 等: "Combining content-based and URL-based heuristics to harvest aligned bitexts from multilingual sites with Bitextor", 《THE PRAGUE BULLETIN OF MATHEMATICAL LINGUISTICS》 *
GALE W. A. 等: "A Program for Aligning Sentences in Bilingual Corpora", 《29TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS》 *
SMITH JASON 等: "Dirt cheap web-scale parallel text from the common crawl", 《PROCEEDINGS OF THE 51ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS》 *
韩冬 等: "基于子字单元的神经机器翻译未登录词翻译分析", 《中文信息学报》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109815390A (en) * 2018-11-08 2019-05-28 平安科技(深圳)有限公司 Search method, device, computer equipment and the computer storage medium of multilingual information
CN109815390B (en) * 2018-11-08 2023-08-08 平安科技(深圳)有限公司 Method, device, computer equipment and computer storage medium for retrieving multilingual information

Similar Documents

Publication Publication Date Title
US20200042560A1 (en) Automatically generating a website specific to an industry
US9411790B2 (en) Systems, methods, and media for generating structured documents
Trampuš et al. Internals of an aggregated web news feed
US9858314B2 (en) System and method for refining search results
US9443014B2 (en) Custom web page themes
US11222053B2 (en) Searching multilingual documents based on document structure extraction
CN108090104B (en) Method and device for acquiring webpage information
CN101826096B (en) Information display method, device and system based on mouse pointing
US8793120B1 (en) Behavior-driven multilingual stemming
CN107798622B (en) Method and device for identifying user intention
CN107870915B (en) Indication of search results
CN103150331A (en) Method and device for providing search engine tags
Wijeratne et al. Sinhala language corpora and stopwords from a decade of sri lankan facebook
CN107209779B (en) Storage and retrieval of structured content in an unstructured user-editable content repository
Boulaknadel et al. Building a standard Amazigh corpus
CN106777140B (en) Method and device for searching unstructured document
CN113656737A (en) Webpage content display method and device, electronic equipment and storage medium
CN108536688A (en) It was found that the whole network multi-language website and the method for obtaining parallel corpora
CN112667208A (en) Translation error recognition method and device, computer equipment and readable storage medium
JP2012141679A (en) Training data acquiring device, training data acquiring method, and program thereof
JP2016045552A (en) Feature extraction program, feature extraction method, and feature extraction device
WO2014049310A2 (en) Method and apparatuses for interactive searching of electronic documents
Přichystal Mobile application for customers’ reviews opinion mining
US11010978B2 (en) Method and system for generating augmented reality interactive content
CN112800078A (en) Lightweight text labeling method, system, equipment and storage medium based on javascript

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20180914