CN105045862A - System for automatically acquiring bilingual parallel corpus of Chinese-foreign languages and realization method - Google Patents

System for automatically acquiring bilingual parallel corpus of Chinese-foreign languages and realization method Download PDF

Info

Publication number
CN105045862A
CN105045862A CN201510407578.3A CN201510407578A CN105045862A CN 105045862 A CN105045862 A CN 105045862A CN 201510407578 A CN201510407578 A CN 201510407578A CN 105045862 A CN105045862 A CN 105045862A
Authority
CN
China
Prior art keywords
chinese
data
bilingual
automatically
bilingual parallel
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510407578.3A
Other languages
Chinese (zh)
Inventor
温家凯
农强
刘连芳
邓姿娴
陆迪茜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Pingsoft New Technology Co Ltd
Guangxi Daring E-Commerce Services Co Ltd
Original Assignee
Pingsoft New Technology Co Ltd
Guangxi Daring E-Commerce Services Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Pingsoft New Technology Co Ltd, Guangxi Daring E-Commerce Services Co Ltd filed Critical Pingsoft New Technology Co Ltd
Priority to CN201510407578.3A priority Critical patent/CN105045862A/en
Publication of CN105045862A publication Critical patent/CN105045862A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

A system for automatically acquiring a bilingual parallel corpus of Chinese-foreign languages and a realization method are disclosed. The method comprises automatic discovery, automatic extraction and automatic arrangement of bilingual parallel information of the Chinese-foreign languages. Firstly, a key word term of a corpus required to be acquired is made, a website is searched through a search engine, a webpage is acquired to obtain a search result, information of the search result is filtered and screened, and the filtered search result is stored in a search result database; secondly, by visiting the webpage in the search result database, the bilingual parallel information of the Chinese-foreign languages is automatically extracted; and finally, for the automatically extracted bilingual parallel information of the Chinese-foreign languages, data filtration is performed and the filtered bilingual parallel data of the Chinese-foreign languages is stored in the bilingual parallel corpus of the Chinese-foreign languages. The system and the method provide important basic data for research on the Chinese-foreign languages and machine translation applications, solve the data source problem faced by corpus acquisition personnel and research personnel, and make outstanding contributions to development of automatic bilingual corpus acquisition and Chinese-foreign natural language processing.

Description

The system that the outer bilingual parallel corpora of the Chinese gathers automatically and implementation method
Technical field
The present invention relates to Computer Applied Technology field, especially relate to system and implementation method that outside a kind of Chinese, bilingual parallel corpora gathers automatically.
Background technology
" parallel corpora " ( parallelTexts) refer to the text using different language to write, have each other " translation relation ".In computational language educational circles, it is different from " contrast language material " ( comparableTexts), the latter also uses different language and for same subject, but does not exist each other directly " translation relation ".
Human history once there is parallel corpora miscellaneous.The Egyptian Rosetta stone be unearthed, its an inscription on a tablet bilingual, three kinds of words are carved into, and are the parallel corporas in the ancient times having much great reputation.By comparing the word on stone tablet, the good pictograph understanding ancient Egypt of France language scholar Shang Bo in ancient times.In addition, contrast with different language contractual agreement, Scriptures, the literary works write also people life in different periods and different field.Late 1950s, parallel corpora starts to appear in mechanical translation research.Because the storage space and computing power of working as computer-chronograph are limited, and the input of a large amount of text data quite difficulty, the effect of Parallel Corpus does not obtain too many concern.In the latter stage seventies, the collection work of translated resources is carried out widely in research centres such as XeroxPARC, BrighamYoung.1987, MartinKay and MartinRoscheisen proposed parallel corpora automatic aligning algorithm the earliest.Various alignment schemes emerges in an endless stream afterwards, parallel corpora after alignment, also by being systematically applied in natural language processing, comprising and setting up translation memory, compile a dictionary and bilingual terminology table, cross-language information retrieval, computer-aided instruction, contrastive studies of languages etc.
The construction of corpus is the important foundation of statistical learning method, and in recent years, the immense value that corpus resource is studied for natural language processing is more and more approved.Particularly bilingualism corpora (BilingualCorpus), has become mechanical translation, machine aided translation and translation knowledge and has obtained the indispensable valuable source of research.On the one hand, the appearance of bilingualism corpora has directly promoted the development of mechanical translation new technology, and the model construction being statistical machine translation as Parallel Corpus provides requisite training data (e.g., Brownetal.1990; Melamed2000; OchandNey2002), Corpus--based Method (Statistic-Based) and Case-based Reasoning (Example-Based) etc. are that mechanical translation research provides new thinking based on the interpretation method of corpus, effectively improve translation quality, start new climax in mechanical translation research field.On the other hand, bilingualism corpora is again the important sources obtaining translation knowledge, therefrom can excavate the various fine-grained translation knowledge of study, as dictionary for translation (e.g., GaleandChurch1991; And translation template, thus improve traditional machine translation mothod Melamed1997).In addition, bilingualism corpora is also cross-language information retrieval (e.g., DavisandDunning1995; Jian-YunNie, TREC8; ), dictionary for translation writing, bilingual terminology are extracted and the important foundation resource of multilingual comparative study etc. automatically.Bilingual teaching mode construction also exists very large difficulty with acquisition, and each state has all dropped into a large amount of human and material resources and financial resources, but the source of bilingual teaching mode mainly concentrates on the specific area such as Government Report, news law, is not suitable for real text application.Meanwhile, the extensive bilingual text on internet and have well ageing and spreadability, this is that the acquisition of bilingual teaching mode provides potential solution route.
Researcher Nie of Montreal, CAN university builds the system PTMiner(ParallelTextMiner of cloud exploitation, 1999): form bilingual candidate website by the search engine website of searching containing particular anchor text, table is sewed in the front and back relying on predefined language again, if extract the candidate web pages front and back that namely a certain URL contains a kind of language with URL name similarity to sew, then replace with another kind of language by sewing before and after these, construct a URL, if the URL built like this exists.Then have found a pair candidate web pages pair, finally again according to text size, the HTML mark structure of webpage, the characteristic filter such as the language of webpage fall uneven webpage pair in candidate web pages.PTMiner system chooses the right Sino-British parallel web pages pair of hundreds of at Sino-British parallel web pages text, through artificial evaluation, has the accuracy rate of 90% nearly.The English text got has 137M, and Chinese text has 117M.
The system STRAND(StructuralTranslationRecognition of the researcher Resnik exploitation of Univ Maryland-Coll Park USA, AcquiringNaturalData, 2003) be also utilize the rule selecting candidate website of search engine and definition to obtain bilingual candidate website.Compare with PTMiner, when STRAND recycling URL names similarity to search the candidate web pages pair in a website, take the mode leaving out the pre-defined character string relevant to language in China and British URL, if after removing the relevant word string of language, China and British URL is equal, then illustrate that current Sino-British URL is a pair bilingual parallel web pages of candidate.In addition, the more careful deep similarity that have studied parallel web pages and structurally have of STRAND, have employed that more to filter out in candidate's parallel web pages based on the feature of structure of web page be not the webpage pair translated each other.The about 400 right Sino-British parallel web pages pair of manual evaluation, achieve the accuracy rate of 98% and the recall rate of 61%.STRAND system gets about 3, and 500 to Sino-British parallel web pages pair.BITS(BilingualInternetTextSearch, MaandLiberman1999), download the alternatively website, all websites under designated domain name, define a kind of account form calculating similarity between Sino-British web page contents and namely translate the ratio that word accounts for the total word number of text mutually, carry out the right determination of Sino-British parallel web pages.The PTI(TheParallelTextIdentificationSystem of people's exploitations such as the Monash University Chen Ji river in Jiangsu Province which flows into the Huangpu River of Shanghai, 2004) after having downloaded a large amount of bilingual web pages by web retrieval device, first have passed filename comparison model and namely obtain bilingual parallel web pages pair according to the similarity of URL name, the same PTMiner of principle, the webpage not having corresponding align to link in this process is again by a file content analytical model, define the Similarity Measure mode calculated between webpage text content, thus obtain bilingual parallel webpage pair.PTI system gets the 193 right parallel texts of China and Britain altogether, and wherein 180 to being correct, and accuracy is 93%, and recall rate is 96%.
The WPDE(WebParallelDataExtraction of people's exploitations such as the Wu Ke of Asia Microsoft Research, 2006) when utilizing search engine to obtain candidate website, not only make use of the ALT information that Anchor Text additionally uses picture.When naming the bilingual parallel web pages pair of similar retrieval candidate according to URL, adopt and URL is divided into pathname and basename, the pairing of pathname is searched and is also utilized predefined heuristic character string, defines some matched rules when concrete searching; Basename search pairing be not used in previous systems adopt based on predefined character string forms, but based on improve smallest edit distance algorithm, such mode through overtesting prove achieve better effect.Except have employed text size during the right filtration of the bilingual parallel web pages of candidate, the features such as webpage html structure, also introduce the quality of a feature based on web page contents and the bilingual parallel web pages text sentence alignment of candidate.The test set same at same PTI closes, and WPDE system achieves the accuracy of 97% and the recall rate of 94%.
Along with the high speed development of networked information era, Internet resources just constantly increase in the mode of explosion type.Internet is the important sources of present information, people can obtain a large amount of information resources by internet, but mix a large amount of data miscellaneous in internet, how from the magnanimity information internet, to extract valuable bilingual data, be the major issue that current data acquisition personnel and relevant enterprise face.The extensive bilingual teaching mode acquiring technology of research sing on web obtains a difficult problem for solution bilingualism corpora, promotes correlation technique development and practically to have great importance.At present, be also short of very much for the language material sampling instrument of the outer bilingual parallel corpora of the Chinese and method, that can carry out automatically gathering is just very fewer.So be now badly in need of a kind of method that automatically can gather the outer bilingual parallel corpora of the Chinese to liberate the loaded down with trivial details collecting work of language material collector and to provide valuable language material resource for enterprise.
Summary of the invention
For the deficiencies in the prior art, the invention provides system and implementation method that outside a kind of Chinese, bilingual parallel corpora gathers automatically, establish the bilingual corpora auto acquisition system of a sing on web, the outer bilingual parallel corpora of the automatic collection network Chinese from internet, can the outer bilingual teaching mode of the Chinese of the outer bilingual teaching mode of the automatic acquisition text level Chinese and Sentence-level, achieve that the outer bilingual parallel information of the Chinese automatically finds, automatically extracts, the bilingual parallel corpora acquisition system of automatic arranging.
The present invention realizes by the following technical solutions:
The system that the outer bilingual parallel corpora of the Chinese gathers automatically, comprises the automatic discovery module of the outer bilingual parallel information of the Chinese, automatically extraction module, automatic arranging module, wherein:
(1) automatically module is found: realize the function that the outer bilingual parallel corpora of the Chinese finds automatically, formulate the crucial phrase needing to gather language material, by search engine search website, gather webpage and obtain Search Results, after the information of Search Results is filtered and screened, Search Results will be obtained after filtration and be stored in search results database;
(2) automatic extraction module: realize the function that the outer bilingual parallel corpora of the Chinese extracts automatically, by the webpage of access search results lane database, extracts the outer bilingual parallel information of the Chinese automatically;
(3) automatic arranging module: for the outer bilingual parallel information of the Chinese automatically extracted, carry out data filtering, and outer for the Chinese after filtration treatment bilingual panel data is stored in the outer bilingual teaching mode of the Chinese.
The outer bilingual parallel corpora of the Chinese of described automatic discovery module finds that workflow is automatically: formulate the crucial phrase of the outer intertranslation of one or more groups Chinese, obtain Search Results by search engine, analyze Search Results and with carry out data acquisition for target.
The outer bilingual parallel corpora of the Chinese of described automatic discovery module finds that principle of design is automatically:
A. selected crucial phrase should be the outer intertranslation phrase pair of the Chinese within the scope of specific area;
B. the third party's search-engine tool used provides search service side for open;
C., after obtaining result by keyword group searching, n page information before only preserving, n associates with the popular degree of selected keyword, and preservation content comprises searches plain result URL address, Search Results title and Search Results summary.
The bilingual parallel corpora of described automatic extraction module automatically extracts workflow and is: use Internet robot to conduct interviews to target web, the crucial phrase of the outer intertranslation of the corresponding Chinese is used to carry out content location to target pages content, from anchor point, front and back travel through and obtain page data.
The bilingual parallel corpora of network of described automatic extraction module extracts principle:
A. specify that the pagefile type of accessing can only be " html ", " htm ", " shtml " and common pagefile type, will not conduct interviews to the page of non-stated type;
B., before access destination webpage, the robots.txt file of Network Check targeted website, if target pages is present on robots.txt file, will not conduct interviews to this target web;
C. will extract complete bilingual data, in extraction process, the html Shipping Options Page be included in target language data will be considered as extracting object more.
The workflow of described automatic extraction module mainly comprises following step:
(1) non-target language information filtering: respectively character filtering is carried out to the outer data of the Chinese collected, main filtration html label, web page code and some non-language symbols, remove the noise data in Information Monitoring, obtain the outer bilingual panel data of the clean Chinese;
(2) the outer participle process of the Chinese: use Chinese and foreign language participle instrument, participle operation is carried out to Chinese and foreign language data, for data handling procedure below provides basis.
The workflow of described automatic arranging module mainly comprises following step:
(1) length ratio and intertranslation matching rate calculate: the data for Automatic Extraction are effectively filtered, respectively each the group bilingual data extracted in the outer bilingual panel data of the Chinese of returning is carried out to the calculating of length ratio and intertranslation matching rate, length difference is filtered apart from larger data, and carry out the intertranslation matching judgment of the outer bilingual panel data of the Chinese, filter out correct panel data;
(2) outer for the Chinese after treatment bilingual panel data will be deposited into the outer bilingual teaching mode of the Chinese.
The implementation method of the outer bilingual parallel corpora automated collection systems of the described Chinese is: setting data acquisition server, data processing server, data storage server and outer network switch, Intra-Network switch, to automatically find Module-embedding data acquisition server, automatic extraction module, automatic arranging Module-embedding data processing server, for guaranteeing data security, use inside and outside network physical isolation.When data acquisition server needs access internet, data acquisition server will be connected with outer network switch, and disconnect the connection with Intra-Network switch.When data acquisition server needs access Intranet, data acquisition server will be connected with Intra-Network switch, and disconnect the connection with outer network switch.Outer network switch realizes the communication between outer net.Intra-Network switch realizes the communication between Intranet.
Data acquisition server is connected with outer network switch by data acquisition personnel, makes data acquisition server have access to Internet service.The layout of data acquisition librarian use outer net desk-top computer needs the related keyword phrase of image data.After determining acquisition tasks, send startup to data acquisition server and automatically gather request; After the crucial phrase data that data acquisition server receives desk-top computer transmission and task start order, the automatic discovery procedure of the data that bring into operation.Obtained the Search Results of all crucial phrases by internet after, Search Results is saved in this locality; Data acquisition server and outer network switch disconnect by data acquisition personnel, and are connected with Intra-Network switch.The data acquisition personnel data started in data processing server are extracted and automatic arranging program automatically, and data processing server reads the Search Results be stored in data acquisition server, carries out data and automatically extracts and automatic arranging.After program is finished the work, all bilingual data got are stored in data storage server.
Outer bilingual other foreign language referring to Chinese and use non-Chinese of the Chinese herein.
Chinese (H à ny ǔ) is also known as " Chinese ", it is the mother tongue of Han nationality, the official language of the Yi Shi People's Republic of China (PRC) and Singapore, the United Nations's official language, also be the language that number of users is maximum in the world, mainly circulate in China, Singapore, Malaysia, and Burma, Thailand, the U.S., Canada, Australia, New Zealand, Japan and other countries overseas Chinese community.Also be the general minority language of the countries such as Malaysia, Burma, the U.S., Canada, Australia, New Zealand.
Foreign language, refers to foreign language, i.e. the language of non-originating people's use.The foreign language conventional in China has: English, Russian, Japanese, Korean, German, French, Spanish, Italian etc., and the foreign language be of little use has: Italian, Portuguese, Norwegian, Finnish, Croatian, Slovene, Czech, Albanian, Bulgarian, Dutch, Estonian, Danish, Russian, Georgian, Byelorussian, Armenian, Macedonian, Ethiopian, Hungarian, Greek, Serbian, Slovak, Polish, Romanian, Swedish, Latvian, Lithuanian, Ukrainian, Korea S (Korea) language, Thai, Japanese, Arabic, Hindi (Dard), Persian (Iranian), Hebrew (Israel's language), Bengali, Indonesian, Malay, Turkish, Pilipino, Vietnamese, Laotian, Kampuchean, this Vassili language (East Africa), Zulu, Icelandic, Irish, Afrikaans etc., also have in addition: Basque, Catalan, place mist language, Cha Moluo language, Chechnya language, Fijian, Greenland language, Kazak, Kurdish, Latin language, luxemburg language, Macedonian, Maltese, Maori, Mongol, Nepali, Ryukyu's language, Javanese, Scotch, Telugu, continue and receive language, Urdu, Uzbek, Azerbaijan, Xhosa, Yoruba, Yi Luo Kano language etc.
Outstanding substantive distinguishing features of the present invention and significant progress are:
1, the system that outside a kind of Chinese provided by the invention, bilingual parallel corpora gathers automatically and implementation method, make full use of the outer automatic discovery technique of bilingual parallel corpora of the webpage Chinese, the outer automatic extractive technique of bilingual parallel corpora of the webpage Chinese and the outer bilingual parallel corpora filtering technique of the webpage Chinese, form the system of the automatic collection of the outer bilingual parallel corpora of the Chinese;
2, the scheme that the present invention uses can be collected the outer bilingual parallel corpora of the valuable Chinese and be analyzed and researched from the internet information of magnanimity, for the research of Chinese outer language and mechanical translation application provide important foundation data, solve the problem of the Data Source that language material collector and researchist face, outstanding contribution has been made in the development automatically gathered for bilingual corpora and the outer natural language processing of the Chinese;
3, Parallel Corpus is a kind of important kind of corpus, the construction of the outer Parallel Corpus of the Chinese is current or blank, the system that the outer bilingual parallel corpora of a kind of Chinese of the present invention gathers automatically and implementation method, automatically find comprising the outer bilingual parallel information of the Chinese, automatically extract and automatic arranging, unique effect can be played in language contrast, Translation Study, language teaching and lexicography;
4, apply scheme provided by the present invention, the parallel corpora between bilingual can be obtained, thus solve the problem of language material scarcity of resources between language, and the translation rule being conducive to obtaining better quality is to build statictic machine translation system;
5, in translation teaching, utilize Parallel Corpus of the present invention, can provide abundant and translate example, determine the possibility of multiple translation, and optimum selecting, can also be used to the illustration in checking bilingual dictionary, teaching dictionary, grammar book, definition, service regeulations and environment for use according to parallel corpora library information, thus determine focal points;
6, the outer bilingual teaching mode construction of the Chinese also exists very large difficulty with acquisition, although dropped into a large amount of manpowers, material resources and financial resources, but the source of the outer bilingual teaching mode of the Chinese mainly concentrates on Government Report, the specific areas such as news law, be not suitable for real text application, in view of the extensive bilingual text on internet there is well ageing and spreadability, the system and method that the present invention uses can be collected the outer bilingual parallel corpora of the valuable Chinese and analyze and research from the internet information of magnanimity, and build up the outer bilingual teaching mode of the Chinese, promote correlation technique development and practically to have great importance,
7, utilize systematic collection of the present invention to the electronic guide browsing equipment connection in relevant bilingual data and tourist attractions, museum, scientific exhibit shop etc., can both pictures and texts are excellent shows by the form of bilingual journal the article of scenic spot and display, visitor is made to see while in listening, take in knowledge, understand intension, enjoy culture, guests fully can be understood and views and admires the deep cultural deposits of object, meanwhile, the rich connotation of sight spot, showpiece is elevated after contrast is browsed.
Accompanying drawing explanation
Fig. 1 is the system construction drawing of the system that automatically gathers of the outer bilingual parallel corpora of the Chinese of the present invention and implementation method;
Fig. 2 is the process flow diagram of the method that the outer bilingual parallel corpora of the Chinese of the present invention gathers automatically;
Fig. 3 is the process flow diagram of the method that the outer bilingual parallel corpora of the Chinese of the present invention filters;
Fig. 4 is the block diagram of the example arrangement of personal computer as the messaging device adopted in embodiments of the invention;
Fig. 5 is the network topology structure figure of present system.
Embodiment
Provide the specific implementation of the embodiment of the present invention in instructions part below, wherein, describe the preferred embodiment being used for the openly embodiment of the present invention fully in detail, and do not apply to limit to it.
As shown in Figure 1, the system that the outer bilingual parallel corpora of a kind of Chinese gathers automatically, comprise the automatic discovery of the outer bilingual parallel information of the Chinese, automatically extraction, automatic arranging, first be the outer automatic discovery procedure of bilingual panel data of the Chinese, formulate the crucial phrase needing to gather language material, by search engine search website, gather webpage and obtain Search Results, after the information of Search Results is filtered and screened, Search Results will be obtained after filtration and be stored in search results database; Next is the outer automatic leaching process of bilingual parallel corpora of the Chinese, by the webpage of access search results lane database, automatically extracts the outer bilingual parallel information of the Chinese; Be finally the outer bilingual parallel corpora automatic arranging process of the Chinese, for the outer bilingual parallel information of the Chinese automatically extracted, carry out data filtering, and outer for the Chinese after filtration treatment bilingual panel data is stored in the outer bilingual teaching mode of the Chinese.
As shown in Figure 2, the method that the outer bilingual parallel corpora of the Chinese of the present invention gathers automatically, comprises the following steps:
The outer automatic discovery technique of bilingual parallel corpora of the webpage Chinese:
First the related keyword phrase needing to gather language material is formulated.Here crucial phrase is the outer intertranslation phrase pair of the Chinese, such as: for crucial phrase starting point, obtain relevant search result by search engine with " flower bulakl á k ".
Then enter and Search Results is filtered.Mainly in order to by filtering the information of Search Results and screen, improve collecting efficiency and quality, reduce acquisition cost.Specific practice is as follows:
By the contrast of URL address, title and summary, determine whether the search result information of repetition.As being judged as duplicate message, will filter.
By the document form of URL adress analysis webpage, the URL address not belonging to common web page files type is removed.Only preserve the URL address of common web page files type, as the common web page files type such as " html ", " htm ", " shtml ", " jsp ", " php ".
Filtered by the analysis of crucial phrase and summary.Mainly through keyword location summary info, filtered by the length ratio of Thai language information and Chinese information, remove the situation that single intertranslation phrase is right.
Finally, Search Results (comprising crucial phrase, URL address, title and summary) will be obtained after filtration and be stored in search results database.
The outer automatic extractive technique of bilingual parallel corpora of the webpage Chinese:
By the webpage of access search results lane database, automatically extract bilingual information.Be implemented as follows:
First, from search results database, obtain the URL address queue to be visited newly added.The URL address that taking-up one is to be visited from URL address queue.Whether systems axiol-ogy targeted website exists robot.txt file, and whether this target URL address is present in robot.txt file.If this URL address does not allow access, system skips this URL address, takes out URL address next to be visited.If this URL address allows access, system starts to access and resolves the webpage of this URL address.
By analyzing web page, system starts automatically to extract the outer bilingual panel data of the Chinese of the page.Concrete steps are as follows:
1. the outer bilingual data of the Chinese slightly extracts:
(1) full page content is read into a character string S.
(2) S is resolved into two character string s1, s2.S1 preserves foreign language data all in S.S2 preserves Chinese datas all in S.
(3) all Chinese of being saved and Thai language data demand retain putting in order on the page originally.And the html tag retained between all Thai language data and between Chinese data and language message character, comprise (putting aside the english information of doping) such as punctuate, numeral, special symbols.
2.HTML label is replaced:
A spaced markings <T> is replaced with by unified for all html tags in s1, s2.
3. the outer bilingual panel data of the Chinese extracts:
(1) foreign language subordinate sentence is carried out to s1, obtain character string dimension st1 [m].Chinese subordinate sentence is carried out to s2, obtains character string dimension st2 [n].Here m and n represents foreign language sentence sum and Chinese sentence sum respectively.
(2) the inner all spaced markings <T> of st1 [m] and st2 [n] are removed.
(3) inner to st1 [m] all character strings carry out foreign language participle.The all character strings inner to st2 [n] carry out Chinese word segmentation.
(4) sentence that st1 [m] and st2 [n] the inside only has single word to form is filtered out.
(5) bilingual parallel sentence is to automatic matching method:
A. from st1 [m], take out the foreign language sentence s_th of a participle.
B. utilize the Chinese outer intertranslation dictionary that Chinese is translated in foreign language phrase each in s_th.Obtain sentence s_th_ch.
C. from s_th_ch, take out a Chinese phrase, at the inner all sentences finding this Chinese phrase of existence of st2 [n], obtain st2 [n'].If there is not the sentence containing this Chinese phrase in st2 [n], then from s_th_ch, take out next Chinese phrase.Continue at the inner all sentences finding this Chinese phrase of existence of st2 [n].If n'>1; Then from s_th_ch, take out next word, continue at the inner all sentences finding this Chinese phrase of existence of st2 [n'].Circulate this step, until the word in n'=1 or s_th_ch has traveled through.If n'=1, namely st2 [n'] is inner only exists a sentence.So, this inner for st2 [n'] sentence is considered as the parallel sentence s_ch of best Chinese corresponding to this s_th by us.If the word in s_th_ch has traveled through complete, and n'>1; Then get the minimum sentence of st2 [n'] inner string length as the parallel sentence s_ch of the best Chinese that this s_th is corresponding.
D. using s_th with s_ch as the outer bilingual parallel sentence of a Chinese to preservation, and by s_th and s_ch removal in st1 [m] and st2 [n] respectively.
If the s_th e. taken out does not find corresponding s_ch, then st1 [m] gets the foreign language sentence of next participle.Repeat above-mentioned steps.Until st1 [m] will have been traveled through.
F. after having traveled through st1 [m], if m>1, and n>1, illustrate that also may to there is the outer bilingual parallel sentence of the Chinese do not mated right, then according to above-mentioned steps, go to find the parallel sentence of best foreign language st1 [m] from st2 [n] conversely.
From URL address queue, take out next URL address to be visited, repeat above-mentioned steps, until the outer bilingual parallel corpora data of the Chinese having extracted all URL addresses to be visited.The outer bilingual parallel sentence of the Chinese of all automatic extractions is to the outer bilingual panel data queue of the composition Chinese to be filtered.
The outer bilingual panel data filtering technique of the Chinese: for the outer bilingual parallel information of the Chinese automatically extracted, carry out data filtering.It improves the quality of Information Monitoring to a great extent.
As shown in Figure 3, the method that the outer bilingual parallel corpora of the webpage Chinese filters, comprises following content:
Information denoising: pure for ensureing data, filters the non-linguistic information collected in data again.Comprise html tag and non-language character.
Information Monitoring contrast is filtered: the outer bilingual parallel information of the Chinese for denoising filters.Following operation is carried out to the outer bilingual parallel information of each group Chinese:
First length is carried out than filtering.Foreign Language information and Chinese information carry out participle operation respectively.Statistics show that foreign language information phrase number is a, Chinese information phrase numerical digit b, setting minimum length compares λ than μ and maximum length, setting is as a/b> λ or b/a> λ or a/b< μ or b/a< μ, be considered as the outer bilingual panel data information of the valueless Chinese, and by this group information filtering.
Then, matching rate filtration is carried out for meeting length than the outer bilingual parallel information of the Chinese required.There is M phrase in the Chinese information of participle, therefrom extract m phrase, by dictionary of Chinese and foreign language, m corresponding foreign language phrase is translated in this m phrase.N the phrase that can mate completely is with it there is in this m foreign language phrase in the foreign language information of participle.So p (cn|th)=m2/ (n*M), we look the matching rate of p (cn|th) for the corresponding Thai language information of Chinese information.In like manner, p (th|cn) is the matching rate of the corresponding Chinese information of Thai language information.So, the matching rate of one group of bilingual parallel information of our regulation collection is p=(p (th|cn)+p (cn|th))/2.Carry out matching rate calculating according to organizing bilingual parallel information to each, smallest match rate ρ is set, as p< ρ, filtration treatment is carried out to the bilingual parallel information of this group.
Finally, according to the Chinese outer bilingual teaching mode, heavy filtration is looked into the outer bilingual parallel information of the Chinese collected.The outer bilingual panel data of the Chinese after treatment will be deposited into the outer bilingual teaching mode of the Chinese.
Application Example 1:
As shown in Figure 4, CPU, ROM and RAM are connected to each other via bus.Input/output interface is also connected to bus; Input system, output system, storage system, communication system and drive system are connected to input/output interface; Input system, comprises keyboard, mouse etc.; Output system, comprises display, loudspeaker etc.; Storage system, comprises hard disk etc.; Communication system, comprise network interface unit such as LAN card, modulator-demodular unit etc., communication system is via network such as the Internet executive communication process; As required, drive system is also connected to input/output interface; Mobile storage medium such as disk, CD, magneto-optic disk, USB flash drive etc. are connected in drive system as required, and the computer program therefrom read is stored on mobile storage medium as required.
CPU (central processing unit) (CPU) performs various process according to the program stored in ROM (read-only memory) (ROM) or from the program that storage area is loaded into random access memory (RAM).In RAM, also store the data required when CPU performs various process etc. as required.
When instruction code of the present invention can be read by above carrier and be performed.
Application Example 2:
As shown in Figure 5, brief description is carried out to network topology structure figure:
Node specification
Switch X: outer network switch
Switch Y: Intra-Network switch
Server A: data acquisition server (embed and automatically find module)
Server B: data processing server (embedding automatic extraction module, automatic arranging module)
Server C: data storage server
Network service
For guaranteeing data security, use inside and outside network physical isolation, server is connected with Internet with router through fire wall.When needs server A needs access internet, server A will be connected with switch X, and disconnect the connection with switch Y.When server A needs access Intranet, server A will be connected with switch Y, and disconnect the connection with switch X.
Switch X realizes the communication between outer net.
Switch Y realizes the communication between Intranet.
Its course of work is as follows:
(1) server A is connected with switch X by data acquisition personnel, makes server A have access to Internet service.The layout of data acquisition librarian use outer net desk-top computer needs the related keyword phrase of image data.After determining acquisition tasks, send startup to server A and automatically gather request.
(2) after the server A crucial phrase data that receive desk-top computer transmission and task start order, the automatic discovery procedure of the data that bring into operation.Obtained the Search Results of all crucial phrases by internet after, Search Results is saved in this locality.
(3) server A and switch X disconnect by data acquisition personnel, and are connected with switch Y.The data acquisition personnel data started in server B are extracted and automatic arranging program automatically, and server B reads the Search Results be stored in server A, carries out data and automatically extracts and automatic arranging.After program is finished the work, all bilingual data got are stored in server C.
Application Example 3:
Paris is the capital of France, French largest city, and being the politics of France, economy, culture, commercial center, is also European second largest city.Paris is the highway in Europe, the center of railway traffic, also be one of center of World Airways transport, also be famous sightseeing tour ground, China has every year many people to go sightseeing visit, because local interpretation staff is limited, particularly Chinese explanation, be difficult to every visitor provide specification as one explanation service, so arrange electronic cicerone machine system at some more famous sight spots, use systematic collection of the present invention to relevant bilingual data, can both pictures and texts are excellent shows by the form of bilingual journal the article of scenic spot and display, spectators are made to see while in listening, take in knowledge, understand intension, enjoy culture.Guests fully can be understood and views and admires the deep cultural deposits of object.See thing and think feelings, thoughts thronging one's mind, the rich connotation of sight spot, showpiece is elevated after contrast is browsed.
Application Example 4:
Lisbon is industrial city, internationalization city, nowadays be Portuguese politics, economy, culture, Education Center, also be European famous tourist city, in the indoor museum such as museum, science and technology center, conference and exhibition center in this city, instead of artificial guide with electronic guide browsing system and affect the loudspeaker of its people's visit because noise is large, this system and systems connection of the present invention, by the form of bilingual journal, in the mode that both pictures and texts are excellent, visitor is presented to the content displayed exhibits, make visitor in the process of viewing and admiring showpiece, make its intension be extended, more vividly.Visitor, after fully having appreciated the outward appearance presentation of showpiece, obtains again abundant knowledge.In addition, visitor can also by the button on touch-screen, and the position of inquiry exhibition section and path, freely enjoy the enjoyment of self-service visit.
The above; be only the specific embodiment of the present invention, but protection scope of the present invention is not limited thereto, is anyly familiar with those skilled in the art in the technical scope that the present invention discloses; the change that can expect easily or replacement, all should be encompassed within protection scope of the present invention.Therefore, protection scope of the present invention should be as the criterion with the protection domain of claim.

Claims (8)

1. the system that outside the Chinese, bilingual parallel corpora gathers automatically, comprises the automatic discovery module of the outer bilingual parallel information of the Chinese, automatically extraction module, automatic arranging module, it is characterized in that:
(1) automatically module is found: realize the function that the outer bilingual parallel corpora of the Chinese finds automatically, formulate the crucial phrase needing to gather language material, by search engine search website, gather webpage and obtain Search Results, after the information of Search Results is filtered and screened, Search Results will be obtained after filtration and be stored in search results database;
(2) automatic extraction module: realize the function that the outer bilingual parallel corpora of the Chinese extracts automatically, by the webpage of access search results lane database, extracts the outer bilingual parallel information of the Chinese automatically;
(3) automatic arranging module: for the outer bilingual parallel information of the Chinese automatically extracted, carry out data filtering, and outer for the Chinese after filtration treatment bilingual panel data is stored in the outer bilingual teaching mode of the Chinese.
2. the system that outside the Chinese according to claim 1, bilingual parallel corpora gathers automatically, it is characterized in that, the outer bilingual parallel corpora of the Chinese of described automatic discovery module finds that workflow is automatically: formulate the crucial phrase of the outer intertranslation of one or more groups Chinese, obtain Search Results by search engine, analyze Search Results and with carry out data acquisition for target.
3. the system that outside the Chinese according to claim 1, bilingual parallel corpora gathers automatically, is characterized in that, the outer bilingual parallel corpora of the Chinese of described automatic discovery module finds that principle of design is automatically:
A. selected crucial phrase should be the outer intertranslation phrase pair of the Chinese within the scope of specific area;
B. the third party's search-engine tool used provides search service side for open;
C., after obtaining result by keyword group searching, n page information before only preserving, n associates with the popular degree of selected keyword, and preservation content comprises searches plain result URL address, Search Results title and Search Results summary.
4. the system that outside the Chinese according to claim 1, bilingual parallel corpora gathers automatically, it is characterized in that, the bilingual parallel corpora of described automatic extraction module automatically extracts workflow and is: use Internet robot to conduct interviews to target web, the crucial phrase of the outer intertranslation of the corresponding Chinese is used to carry out content location to target pages content, from anchor point, front and back travel through and obtain page data.
5. the system that outside the Chinese according to claim 1, bilingual parallel corpora gathers automatically, is characterized in that, the bilingual parallel corpora of network of described automatic extraction module extracts principle:
A. specify that the pagefile type of accessing can only be " html ", " htm ", " shtml " and common pagefile type, will not conduct interviews to the page of non-stated type;
B., before access destination webpage, the robots.txt file of Network Check targeted website, if target pages is present on robots.txt file, will not conduct interviews to this target web;
C. will extract complete bilingual data, in extraction process, the html Shipping Options Page be included in target language data will be considered as extracting object more.
6. the system that outside the Chinese according to claim 1, bilingual parallel corpora gathers automatically, it is characterized in that, the workflow of described automatic extraction module mainly comprises following step:
(1) non-target language information filtering: respectively character filtering is carried out to the outer data of the Chinese collected, main filtration html label, web page code and some non-language symbols, remove the noise data in Information Monitoring, obtain the outer bilingual panel data of the clean Chinese;
(2) the outer participle process of the Chinese: use Chinese and foreign language participle instrument, participle operation is carried out to Chinese and foreign language data, for data handling procedure below provides basis.
7. the system that outside the Chinese according to claim 1, bilingual parallel corpora gathers automatically, it is characterized in that, the workflow of described automatic arranging module mainly comprises following step:
(1) length ratio and intertranslation matching rate calculate: the data for Automatic Extraction are effectively filtered, respectively each the group bilingual data extracted in the outer bilingual panel data of the Chinese of returning is carried out to the calculating of length ratio and intertranslation matching rate, length difference is filtered apart from larger data, and carry out the intertranslation matching judgment of the outer bilingual panel data of the Chinese, filter out correct panel data;
(2) outer for the Chinese after treatment bilingual panel data will be deposited into the outer bilingual teaching mode of the Chinese.
8. the system that outside the Chinese according to claim 1, bilingual parallel corpora gathers automatically, it is characterized in that, the implementation method of the outer bilingual parallel corpora automated collection systems of the described Chinese is: setting data acquisition server, data processing server, data storage server and outer network switch, Intra-Network switch, to automatically find Module-embedding data acquisition server, automatic extraction module, automatic arranging Module-embedding data processing server;
Data acquisition server is connected with outer network switch by data acquisition personnel, data acquisition server is made to have access to Internet service, the layout of data acquisition librarian use outer net desk-top computer needs the related keyword phrase of image data, after determining acquisition tasks, send to start to data acquisition server and automatically gather request, after the crucial phrase data that data acquisition server receives desk-top computer transmission and task start order, the automatic discovery procedure of the data that bring into operation, obtained the Search Results of all crucial phrases by internet after, Search Results is saved in this locality; Data acquisition server and outer network switch disconnect by data acquisition personnel, and are connected with Intra-Network switch;
The data acquisition personnel data started in data processing server are extracted and automatic arranging program automatically, data processing server reads the Search Results be stored in data acquisition server, carry out data automatically to extract and automatic arranging, after program is finished the work, all bilingual data got are stored in data storage server.
CN201510407578.3A 2015-07-13 2015-07-13 System for automatically acquiring bilingual parallel corpus of Chinese-foreign languages and realization method Pending CN105045862A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510407578.3A CN105045862A (en) 2015-07-13 2015-07-13 System for automatically acquiring bilingual parallel corpus of Chinese-foreign languages and realization method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510407578.3A CN105045862A (en) 2015-07-13 2015-07-13 System for automatically acquiring bilingual parallel corpus of Chinese-foreign languages and realization method

Publications (1)

Publication Number Publication Date
CN105045862A true CN105045862A (en) 2015-11-11

Family

ID=54452409

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510407578.3A Pending CN105045862A (en) 2015-07-13 2015-07-13 System for automatically acquiring bilingual parallel corpus of Chinese-foreign languages and realization method

Country Status (1)

Country Link
CN (1) CN105045862A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108763223A (en) * 2016-06-28 2018-11-06 大连民族大学 Method for constructing Chinese-English Mongolian Tibetan language multilingual parallel corpus
CN109948142A (en) * 2019-01-25 2019-06-28 北京海天瑞声科技股份有限公司 Corpus chooses processing method, device, equipment and computer readable storage medium
CN110209804A (en) * 2018-04-20 2019-09-06 腾讯科技(深圳)有限公司 Determination method and apparatus, storage medium and the electronic device of target corpus
CN111078893A (en) * 2019-12-11 2020-04-28 竹间智能科技(上海)有限公司 Method for efficiently acquiring and identifying linguistic data for dialog meaning graph in large scale
CN111209461A (en) * 2019-12-30 2020-05-29 成都理工大学 Bilingual corpus collection system based on public identification words
CN111310465A (en) * 2020-02-18 2020-06-19 北京字节跳动网络技术有限公司 Parallel corpus acquisition method and device, electronic equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1707476A (en) * 2005-05-06 2005-12-14 贺方升 Auxiliary translation searching engine system and method thereof
CN102043808A (en) * 2009-10-14 2011-05-04 腾讯科技(深圳)有限公司 Method and equipment for extracting bilingual terms using webpage structure
CN102930031A (en) * 2012-11-08 2013-02-13 哈尔滨工业大学 Method and system for extracting bilingual parallel text in web pages
CN103020043A (en) * 2012-11-16 2013-04-03 哈尔滨工业大学 Distributed acquisition system facing web bilingual parallel corpora resources
CN103885939A (en) * 2012-12-19 2014-06-25 新疆信息产业有限责任公司 Uyghur-Chinese bi-directional translation memory system construction method
CN104408078A (en) * 2014-11-07 2015-03-11 北京第二外国语学院 Construction method for key word-based Chinese-English bilingual parallel corpora

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1707476A (en) * 2005-05-06 2005-12-14 贺方升 Auxiliary translation searching engine system and method thereof
CN102043808A (en) * 2009-10-14 2011-05-04 腾讯科技(深圳)有限公司 Method and equipment for extracting bilingual terms using webpage structure
CN102930031A (en) * 2012-11-08 2013-02-13 哈尔滨工业大学 Method and system for extracting bilingual parallel text in web pages
CN103020043A (en) * 2012-11-16 2013-04-03 哈尔滨工业大学 Distributed acquisition system facing web bilingual parallel corpora resources
CN103885939A (en) * 2012-12-19 2014-06-25 新疆信息产业有限责任公司 Uyghur-Chinese bi-directional translation memory system construction method
CN104408078A (en) * 2014-11-07 2015-03-11 北京第二外国语学院 Construction method for key word-based Chinese-English bilingual parallel corpora

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
冯艳卉: "基于Web的大规模平行语料库构建方法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
林政: "Web双语平行语料自动获取及其在统计机器翻译中的应用", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108763223A (en) * 2016-06-28 2018-11-06 大连民族大学 Method for constructing Chinese-English Mongolian Tibetan language multilingual parallel corpus
CN110209804A (en) * 2018-04-20 2019-09-06 腾讯科技(深圳)有限公司 Determination method and apparatus, storage medium and the electronic device of target corpus
CN109948142A (en) * 2019-01-25 2019-06-28 北京海天瑞声科技股份有限公司 Corpus chooses processing method, device, equipment and computer readable storage medium
CN109948142B (en) * 2019-01-25 2020-01-14 北京海天瑞声科技股份有限公司 Corpus selection processing method, apparatus, device and computer readable storage medium
CN111078893A (en) * 2019-12-11 2020-04-28 竹间智能科技(上海)有限公司 Method for efficiently acquiring and identifying linguistic data for dialog meaning graph in large scale
CN111209461A (en) * 2019-12-30 2020-05-29 成都理工大学 Bilingual corpus collection system based on public identification words
CN111310465A (en) * 2020-02-18 2020-06-19 北京字节跳动网络技术有限公司 Parallel corpus acquisition method and device, electronic equipment and storage medium
CN111310465B (en) * 2020-02-18 2021-07-23 北京字节跳动网络技术有限公司 Parallel corpus acquisition method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
Malmasi et al. MultiCoNER: A large-scale multilingual dataset for complex named entity recognition
CN105045862A (en) System for automatically acquiring bilingual parallel corpus of Chinese-foreign languages and realization method
Marine-Roig et al. Tourism analytics with massive user-generated content: A case study of Barcelona
Resnik et al. The web as a parallel corpus
Smith et al. Dirt cheap web-scale parallel text from the common crawl
Rae et al. Mining the web for points of interest
Shimada et al. Analyzing tourism information on twitter for a local city
CN111897914A (en) Entity information extraction and knowledge graph construction method for field of comprehensive pipe gallery
CN105022728A (en) Automatic acquisition system of Chinese and Lao bilingual parallel texts and implementation method
Saraswathi Bilingual information retrieval system for English and Tamil
Shuttleworth Locating foci of translation on Wikipedia: Some methodological proposals
Pasley et al. Geo-tagging for imprecise regions of different sizes
CN105138548A (en) System for automatically collecting Chinese-Thai bilingual parallel corpus and implementation method
CN104933192A (en) Automatic Chinese and Filipino bilingual parallel text collection system and implementation method
KR20050078655A (en) Dynamic keyword extraction and processing system
Mckee The map as a search box: Using linked data to create a geographic discovery system
CN104933194A (en) Chinese and Vietnamese bilingual parallel text automatic acquisition system and realizing method thereof
CN104965925A (en) Automatic Chinese-Khmer bilingual parallel text acquisition system and implementation method
de Souza et al. Development of a brazilian portuguese hotel’s reviews corpus
CN104933193A (en) Chinese and Bahasa Melayu bilingual parallel text automatic acquisition system and realizing method thereof
CN104933195A (en) Chinese and Burmese bilingual parallel text automatic acquisition system and realizing method thereof
CN105045861A (en) System for automatically collecting Hanyu and Bahasa Indonesia bilingualism parallel texts, and implementation method
JP5977199B2 (en) Local association word extraction device, regional association word extraction method, and regional association word extraction program
Li et al. A Chinese geographic knowledge base for GIR
Trinh et al. Collecting Chinese-Vietnamese texts from bilingual websites

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20151111

RJ01 Rejection of invention patent application after publication