CN108536688A - It was found that the whole network multi-language website and the method for obtaining parallel corpora - Google Patents
It was found that the whole network multi-language website and the method for obtaining parallel corpora Download PDFInfo
- Publication number
- CN108536688A CN108536688A CN201810365280.4A CN201810365280A CN108536688A CN 108536688 A CN108536688 A CN 108536688A CN 201810365280 A CN201810365280 A CN 201810365280A CN 108536688 A CN108536688 A CN 108536688A
- Authority
- CN
- China
- Prior art keywords
- language
- url
- parallel corpora
- website
- language website
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
- G06F40/58—Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
- G06F40/42—Data-driven translation
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Abstract
The present invention relates to a kind of methods for finding the whole network multi-language website and obtaining parallel corpora, including:Obtain the index file of the URL information in Common Crawl;According to preset standard, the index file for the URL information of acquisition builds a dictionary for including various linguistic labels;According to the dictionary for including various linguistic labels, processing is filtered to the index file of URL information, the URL of the candidate multi-language website needed;By the URL of the multi-language website of the candidate, parallel corpora is obtained.Above-mentioned discovery the whole network multi-language website and the method for obtaining parallel corpora are obtained the URL of candidate multi-language website, finally obtain parallel corpora, this method can quickly obtain parallel corpora by the data set increased income using the method for linguistic labels.
Description
Technical field
The present invention relates to machine translation mothod fields, more particularly to discovery the whole network multi-language website and obtain parallel corpora
Method.
Background technology
Bilingual parallel corporas is a kind of highly important resource for many tasks, especially in natural language processing
Field, have abundant use.And in natural language processing, machine translation even more very relies on one kind of parallel corpora.
The quality of machine translation depends directly on the quantity and quality of parallel corpora.It is Chinese-English in the task of machine translation, A Ying with
And many european languages are to millions or even higher parallel corpora, and in other most of language to upper, it is difficult to obtain
Obtain parallel corpora.Therefore, it is the important first step that most of language is mutually translated to obtain parallel corpora.
The a large amount of parallel corpora that can be obtained on network at present is mainly the multi-lingual news of world's main language, in European Union
A variety of translated versions of various meetings and file, wikipedia content and some books etc. being translated.
In paper PaCo2:AFullyAutomated tool for gathering Parallel Corpora from
In the Web, author includes the keyword of bilingual pair using search engine search, and it is candidate to obtain potential multi-language website
Then list downloads sub-fraction therein and detects whether to comply with standard, then carried out to multilingual website general parallel
The page extracts, contents extraction and etc..
There are following technical problems for traditional technology:
Internet is the valuable source for obtaining parallel text.Many websites are all multilingual, this is even more to obtain parallel language
The valuable source of material.And since the complexity of internet content is chaotic and in large scale, it is difficult to concentrate to obtain a large amount of parallel languages
Material.
Existing technology is difficult extensive, comprehensively finds all multi-language websites present on network and crawls parallel language
Material.If finding multilingual website at random, a large amount of time can be wasted, and can be drawn by search using the method for search engine
Prodigious influence is held up, many times may be wasted on detecting common webpage.
Invention content
Based on this, it is necessary in view of the above technical problems, provide a kind of discovery the whole network multi-language website and obtain parallel language
The method of material is obtained the URL of candidate multi-language website, finally obtained by the data set increased income using the method for linguistic labels
Take parallel corpora, this method that can quickly obtain parallel corpora.
A method of it finding the whole network multi-language website and obtains parallel corpora, including:
Obtain the index file of the URL information in Common Crawl;
According to preset standard, the index file for the URL information of acquisition builds a dictionary for including various linguistic labels;
According to the dictionary for including various linguistic labels, processing is filtered to the index file of URL information, is needed
The URL for the candidate multi-language website wanted;
By the URL of the multi-language website of the candidate, parallel corpora is obtained.
The above-mentioned discovery the whole network multi-language website and method for obtaining parallel corpora utilizes language by the data set increased income
The method of label obtains the URL of candidate multi-language website, finally obtains parallel corpora, this method can quickly obtain parallel
Language material.
In other one embodiment, the preset standard is iso-639 standards.
In other one embodiment, step " by the URL of the multi-language website of the candidate, obtains parallel corpora "
It is realized by Bitextor tools.
A kind of computer equipment, including memory, processor and storage can be run on a memory and on a processor
The step of computer program, the processor realizes any one the method when executing described program.
A kind of computer readable storage medium, is stored thereon with computer program, which realizes when being executed by processor
The step of any one the method.
Description of the drawings
Fig. 1 is a kind of stream of method for finding the whole network multi-language website and obtaining parallel corpora provided by the embodiments of the present application
Journey schematic diagram.
Fig. 2 is in a kind of method for finding the whole network multi-language website and obtaining parallel corpora provided by the embodiments of the present application
The schematic diagram of the index file of URL information.
Fig. 3 is to be obtained in a kind of method for finding the whole network multi-language website and obtaining parallel corpora provided by the embodiments of the present application
To the schematic diagram of candidate website.
Specific implementation mode
In order to make the purpose , technical scheme and advantage of the present invention be clearer, with reference to the accompanying drawings and embodiments, right
The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and
It is not used in the restriction present invention.
A method of it finding the whole network multi-language website and obtains parallel corpora, including:
S110, the index file for obtaining URL information in Common Crawl;
S120, according to preset standard, build one comprising various linguistic labels for the index file of the URL information of acquisition
Dictionary;
S130, the dictionary for including various linguistic labels according to, processing is filtered to the index file of URL information,
The URL of the candidate multi-language website needed;
S140, the URL by the multi-language website of the candidate, obtain parallel corpora.
The above-mentioned discovery the whole network multi-language website and method for obtaining parallel corpora utilizes language by the data set increased income
The method of label obtains the URL of candidate multi-language website, finally obtains parallel corpora, this method can quickly obtain parallel
Language material.
In other one embodiment, the preset standard is iso-639 standards.
In other one embodiment, step " by the URL of the multi-language website of the candidate, obtains parallel corpora "
It is realized by Bitextor tools.
A kind of computer equipment, including memory, processor and storage can be run on a memory and on a processor
The step of computer program, the processor realizes any one the method when executing described program.
A kind of computer readable storage medium, is stored thereon with computer program, which realizes when being executed by processor
The step of any one the method.
A concrete application scene of the invention is described below:
The invention is intended to obtain the URL of multi-language website from Common Crawl, parallel corpora then can be carried out
Processing.First, then we are built using the index information data in Common Crawl as the sources URL by international standard
One dictionary for including various linguistic labels finally starts to be filtered processing to all index informations, the time needed
The URL of the multi-language website of choosing.After obtaining candidate URL, it can be further processed.
The particular content of flow will be discussed in detail below:
Common Crawl are the data sets that a non-profit organization provides and safeguards.It is one and uses Apache
The data set of the whole network that Nutch is crawled, it is intended to provide a large amount of web data for owner rather than be monopolized by major company.It is several
It issues the moon primary new as a result, including the web data of TB up to a hundred, is currently stored on the Cloud Server of Amazon.
Include the file of four kinds of formats in the data of Common Crawl, WARC is the original web page code number crawled
According to containing the complete content of metadata (metadata), request (request) and response (response);WAT is to above
WARC formatted files further extract metadata information, with json formats preserve;WET provides pure text message.In more detail
Information is shown in http://commoncrawl.org/the-data/get-started/.
Also a kind of easily ignored information is exactly URL Index Files to be used here, that is, about
The index file of the URL information of all pages.Specific format is as shown in Figure 2:
Note:For each webpage, there is a corresponding information, contains domain name and crawl the time, url, medium type,
The state of return, the range of information such as file size.2017 December version Common Crawl in, one is shared
302 such index files, size are 200GB or so.
Pass through the observation to a large amount of parallel web pages, it is found that include language in the URL of most parallel web pages
The information of label.For example the Official English website of apple is https://www.apple.com/, and the page of Chinese is
https://www.apple.com/cn/, Vietnam is https://www.apple.com/vn/.It is not only apple, actually
Almost all of multilingual website is all the URL of this structure, and difference lies in linguistic labels may be different.Such as with regard to English and
Speech may have ten several different labels several in this way:en,en-us,en-af',en-au,en-ca,en-gb,en-ph,en-
Za, eng etc..Possibly even certain websites provides that the page in native country does not have linguistic labels, and the page of other language has label.
So the internationalization website that we can select some well-known, including apple, Microsoft etc therefrom extract main language
The label of speech, in addition some language codes as defined in basic iso-639 standards, constituting initial language codes table (includes
38 kinds of main language and their a variety of expressions, partial content see the table below), for judging whether to include to be somebody's turn to do when analyzing url
Linguistic labels.
It is appreciated that number of site can also be selected, details are not described herein by other standards.
Certainly, for some rare foreign languages possibility, we can not determine that their linguistic labels are, but based in this way
The fact:The multi-language website of rare foreign languages at least can include a main language, such as English, therefore, simple
When url is filtered, candidate network address can be added them to, so as in subsequent steps identify them
(language identification module is needed to support the language), to reach complete effect.
The common label of various language is as follows:
English en'en':'en','en-us':'en','en-af':'en','en-au':'en','en-ca':'en','
en-gb':'en','en-ph':
French fr'fr':'fr','fra':'fr','fr-fr':'fr','fr-be':'fr','fr-ca':'fr','
fr-ch':'fr','fr-lu':'fr'
Vietnam vi'vi':'vi','vn':'vi','vn':'vi','vnm':'vn','vie':'vi'
Chinese zh'cn':'zh','hk':'zh','mo':'zh','tw':'zh','chn':'zh','zho':'zh'
Russian ru'ru':'ru','ru-ru':'ru','rus':'ru'
Start to be filtered locally downloading index file.It is used herein based on MapReduce thoughts
Python frames mrjob realizes this process.Mrjob frames can be in the machine, Hadoop clusters, Amazon Elastic
(above-mentioned three is point for operation on MapReduce (EMR) or Google Cloud Dataproc (Dataproc)
The frame of cloth processing).Since data volume is not especially big, the method that the machine operation may be used in the application, and if it is
The files such as WET are directly operated, recommends directly to run in the cluster service that Amazon provides, mass data transmission can be saved
Time (Common Crawl are stored in Amazon S3 clouds store-service).
Specifically filter factor includes:To excessive too small file filter (it is appreciated that this can be selected), as long as
The status states of the file of text/html types, response are 2000K and most important linguistic labels.For some
Website domain name counts all page quantities for including above-mentioned linguistic labels.After the completion of statistics, if including certain linguistic labels
Quantity is met the requirements, and is considered as the multi-language website network address that this domain name is candidate.
In we test, meeting the requirement of multi-language website is:For including two kinds or more of website, it is added candidate
Multi-language website;And for there was only a kind of website of linguistic labels, at least need 10 mainstream speech labels contained above (such as
English) the page.It is appreciated that 10 are a possible example.
One has been obtained 468955 candidate websites, specific as shown in Figure 3:
In paper Bitextor, a free/open-source software to harvest translation
In memories from multilingual websites, the slave some websites extraction that author provides complete set is parallel
The open source software of language material.Software realization is from downloading webpage, duplicate removal, parallel web pages extraction, a series of solutions of parallel sentence alignment
Certainly scheme.
Bitextor is that the parallel corpora freely increased income obtains tool.Bitextor by entire step simple abstract be with
Under several steps:Using the httrack Open-Source Tools download site pages, duplicate removal and text extraction, language identification, parallel web pages
Extraction, parallel sentence extraction.
Bitextor needs the dictionary (i.e. the one-to-one dictionary of word) of corresponding language in the acquisition process of parallel pages,
For judging parallel degree.Official version only provides several language pair, but there is provided the tools for generating dictionary, when needing certainly
Oneself generates.
In addition during judging above, author's acquiescence is the Indo-European family of languages for being not necessarily to participle.And for many languages,
For example Chinese, the authors such as Japanese do not provide automatic word segmentation function, need oneself addition as needed.For example, Chinese word segmentation can
To segment library using the jieba of python.
The module of Bitextor can be used to start a series of subsequent processing below.
The web data of almost the whole network has been provided in view of Common Crawl, we can by resource that it is provided and
The existing technology that crawls combines.The data provided by Common Crawl are analyzed, to obtain the multilingual of candidate
Then list of websites crawls the website of needs using other tools, step-by-step processing, parallel corpora is obtained.We make
This set of flow solves multi-language website distribution dispersion, it is difficult to the problem of finding simultaneously large scale collection.Language identification and
User can be allowed according to demand to interested language to carrying out subsequent processing the step of being summarized.
Each technical characteristic of embodiment described above can be combined arbitrarily, to keep description succinct, not to above-mentioned reality
It applies all possible combination of each technical characteristic in example to be all described, as long as however, the combination of these technical characteristics is not deposited
In contradiction, it is all considered to be the range of this specification record.
Several embodiments of the invention above described embodiment only expresses, the description thereof is more specific and detailed, but simultaneously
It cannot therefore be construed as limiting the scope of the patent.It should be pointed out that coming for those of ordinary skill in the art
It says, without departing from the inventive concept of the premise, various modifications and improvements can be made, these belong to the protection of the present invention
Range.Therefore, the protection domain of patent of the present invention should be determined by the appended claims.
Claims (5)
1. a kind of method for finding the whole network multi-language website and obtaining parallel corpora, which is characterized in that including:
Obtain the index file of the URL information in Common Crawl;
According to preset standard, the index file for the URL information of acquisition builds a dictionary for including various linguistic labels;
According to the dictionary for including various linguistic labels, processing is filtered to the index file of URL information, is needed
The URL of candidate multi-language website;
By the URL of the multi-language website of the candidate, parallel corpora is obtained.
2. the method according to claim 1 for finding the whole network multi-language website and obtaining parallel corpora, which is characterized in that institute
It is iso-639 standards to state preset standard.
3. the method according to claim 1 for finding the whole network multi-language website and obtaining parallel corpora, which is characterized in that step
Suddenly " by the URL of the multi-language website of the candidate, parallel corpora is obtained " to realize by Bitextor tools.
4. a kind of computer equipment, including memory, processor and storage are on a memory and the meter that can run on a processor
Calculation machine program, which is characterized in that the processor realizes any one of claim 1-3 the methods when executing described program
The step of.
5. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the program is held by processor
The step of claim 1-3 any one the methods are realized when row.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810365280.4A CN108536688A (en) | 2018-04-23 | 2018-04-23 | It was found that the whole network multi-language website and the method for obtaining parallel corpora |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810365280.4A CN108536688A (en) | 2018-04-23 | 2018-04-23 | It was found that the whole network multi-language website and the method for obtaining parallel corpora |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108536688A true CN108536688A (en) | 2018-09-14 |
Family
ID=63479122
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810365280.4A Pending CN108536688A (en) | 2018-04-23 | 2018-04-23 | It was found that the whole network multi-language website and the method for obtaining parallel corpora |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108536688A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109815390A (en) * | 2018-11-08 | 2019-05-28 | 平安科技(深圳)有限公司 | Search method, device, computer equipment and the computer storage medium of multilingual information |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1282928A (en) * | 1999-07-28 | 2001-02-07 | 国际商业机器公司 | Method and system for providing national language inquiry service |
CN104281711A (en) * | 2014-10-27 | 2015-01-14 | 浪潮(北京)电子信息产业有限公司 | Multi-language processing method and multi-language processing device for WEB application |
CN105022728A (en) * | 2015-07-13 | 2015-11-04 | 广西达译商务服务有限责任公司 | Automatic acquisition system of Chinese and Lao bilingual parallel texts and implementation method |
CN107077842A (en) * | 2014-12-15 | 2017-08-18 | 百度(美国)有限责任公司 | System and method for phonetic transcription |
-
2018
- 2018-04-23 CN CN201810365280.4A patent/CN108536688A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1282928A (en) * | 1999-07-28 | 2001-02-07 | 国际商业机器公司 | Method and system for providing national language inquiry service |
CN104281711A (en) * | 2014-10-27 | 2015-01-14 | 浪潮(北京)电子信息产业有限公司 | Multi-language processing method and multi-language processing device for WEB application |
CN107077842A (en) * | 2014-12-15 | 2017-08-18 | 百度(美国)有限责任公司 | System and method for phonetic transcription |
CN105022728A (en) * | 2015-07-13 | 2015-11-04 | 广西达译商务服务有限责任公司 | Automatic acquisition system of Chinese and Lao bilingual parallel texts and implementation method |
Non-Patent Citations (4)
Title |
---|
ESPLA-GOMIS M. 等: "Combining content-based and URL-based heuristics to harvest aligned bitexts from multilingual sites with Bitextor", 《THE PRAGUE BULLETIN OF MATHEMATICAL LINGUISTICS》 * |
GALE W. A. 等: "A Program for Aligning Sentences in Bilingual Corpora", 《29TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS》 * |
SMITH JASON 等: "Dirt cheap web-scale parallel text from the common crawl", 《PROCEEDINGS OF THE 51ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS》 * |
韩冬 等: "基于子字单元的神经机器翻译未登录词翻译分析", 《中文信息学报》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109815390A (en) * | 2018-11-08 | 2019-05-28 | 平安科技(深圳)有限公司 | Search method, device, computer equipment and the computer storage medium of multilingual information |
CN109815390B (en) * | 2018-11-08 | 2023-08-08 | 平安科技(深圳)有限公司 | Method, device, computer equipment and computer storage medium for retrieving multilingual information |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20200042560A1 (en) | Automatically generating a website specific to an industry | |
US9411790B2 (en) | Systems, methods, and media for generating structured documents | |
Trampuš et al. | Internals of an aggregated web news feed | |
US9858314B2 (en) | System and method for refining search results | |
US9443014B2 (en) | Custom web page themes | |
US11222053B2 (en) | Searching multilingual documents based on document structure extraction | |
CN108090104B (en) | Method and device for acquiring webpage information | |
CN101826096B (en) | Information display method, device and system based on mouse pointing | |
US8793120B1 (en) | Behavior-driven multilingual stemming | |
CN107798622B (en) | Method and device for identifying user intention | |
CN107870915B (en) | Indication of search results | |
CN103150331A (en) | Method and device for providing search engine tags | |
Wijeratne et al. | Sinhala language corpora and stopwords from a decade of sri lankan facebook | |
CN107209779B (en) | Storage and retrieval of structured content in an unstructured user-editable content repository | |
Boulaknadel et al. | Building a standard Amazigh corpus | |
CN106777140B (en) | Method and device for searching unstructured document | |
CN113656737A (en) | Webpage content display method and device, electronic equipment and storage medium | |
CN108536688A (en) | It was found that the whole network multi-language website and the method for obtaining parallel corpora | |
CN112667208A (en) | Translation error recognition method and device, computer equipment and readable storage medium | |
JP2012141679A (en) | Training data acquiring device, training data acquiring method, and program thereof | |
JP2016045552A (en) | Feature extraction program, feature extraction method, and feature extraction device | |
WO2014049310A2 (en) | Method and apparatuses for interactive searching of electronic documents | |
Přichystal | Mobile application for customers’ reviews opinion mining | |
US11010978B2 (en) | Method and system for generating augmented reality interactive content | |
CN112800078A (en) | Lightweight text labeling method, system, equipment and storage medium based on javascript |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180914 |