CN105045861A - System for automatically collecting Hanyu and Bahasa Indonesia bilingualism parallel texts, and implementation method - Google Patents

System for automatically collecting Hanyu and Bahasa Indonesia bilingualism parallel texts, and implementation method Download PDF

Info

Publication number
CN105045861A
CN105045861A CN201510407512.4A CN201510407512A CN105045861A CN 105045861 A CN105045861 A CN 105045861A CN 201510407512 A CN201510407512 A CN 201510407512A CN 105045861 A CN105045861 A CN 105045861A
Authority
CN
China
Prior art keywords
chinese
data
bilingual
automatically
prints
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510407512.4A
Other languages
Chinese (zh)
Inventor
温家凯
农强
刘连芳
潘媛媛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Pingsoft New Technology Co Ltd
Guangxi Daring E-Commerce Services Co Ltd
Original Assignee
Pingsoft New Technology Co Ltd
Guangxi Daring E-Commerce Services Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Pingsoft New Technology Co Ltd, Guangxi Daring E-Commerce Services Co Ltd filed Critical Pingsoft New Technology Co Ltd
Priority to CN201510407512.4A priority Critical patent/CN105045861A/en
Publication of CN105045861A publication Critical patent/CN105045861A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a system for automatically collecting Hanyu and Bahasa Indonesia bilingualism parallel texts, and an implementation method. The implementation method comprises the automatic discovery, the automatic extraction and the automatic sorting out of Hanyu and Bahasa Indonesia bilingualism parallel information. The implementation method comprises the following steps: firstly, formulating a keyword group of texts to be collected, searching a website through a search engine, collecting a website to obtain a search result, filtering and screening the information of the search result, and then, storing the search result obtained by filtering into a search result database; secondly, accessing the webpages in the research result database to automatically extract the Hanyu and Bahasa Indonesia bilingualism parallel information; and finally, aiming at the Hanyu and Bahasa Indonesia bilingualism parallel information which is automatically extracted, filtering data, and storing Hanyu and Bahasa Indonesia bilingualism parallel data subjected to filtering processing into a Hanyu and Bahasa Indonesia parallel corpus. The invention provides important basic data for Hanyu and Bahasa Indonesia language research and machine translation application, solves the data source problem to which text acquisition personnel and research personnel face, and makes an excellent contribution for the development of the automatic acquisition of the bilingual texts and the processing of Hanyu and Bahasa Indonesia natural language processing.

Description

The Chinese prints the system and implementation method that bilingual parallel corpora gathers automatically
Technical field
The present invention relates to Computer Applied Technology field, especially relate to a kind of Chinese and print the system and implementation method that bilingual parallel corpora gathers automatically.
Background technology
" parallel corpora " ( parallelTexts) refer to the text using different language to write, have each other " translation relation ".In computational language educational circles, it is different from " contrast language material " ( comparableTexts), the latter also uses different language and for same subject, but does not exist each other directly " translation relation ".
Human history once there is parallel corpora miscellaneous.The Egyptian Rosetta stone be unearthed, its an inscription on a tablet bilingual, three kinds of words are carved into, and are the parallel corporas in the ancient times having much great reputation.By comparing the word on stone tablet, the good pictograph understanding ancient Egypt of France language scholar Shang Bo in ancient times.In addition, contrast with different language contractual agreement, Scriptures, the literary works write also people life in different periods and different field.Late 1950s, parallel corpora starts to appear in mechanical translation research.Because the storage space and computing power of working as computer-chronograph are limited, and the input of a large amount of text data quite difficulty, the effect of Parallel Corpus does not obtain too many concern.In the latter stage seventies, the collection work of translated resources is carried out widely in research centres such as XeroxPARC, BrighamYoung.1987, MartinKay and MartinRoscheisen proposed parallel corpora automatic aligning algorithm the earliest.Various alignment schemes emerges in an endless stream afterwards, parallel corpora after alignment, also by being systematically applied in natural language processing, comprising and setting up translation memory, compile a dictionary and bilingual terminology table, cross-language information retrieval, computer-aided instruction, contrastive studies of languages etc.
The construction of corpus is the important foundation of statistical learning method, and in recent years, the immense value that corpus resource is studied for natural language processing is more and more approved.Particularly bilingualism corpora (BilingualCorpus), has become mechanical translation, machine aided translation and translation knowledge and has obtained the indispensable valuable source of research.On the one hand, the appearance of bilingualism corpora has directly promoted the development of mechanical translation new technology, and the model construction being statistical machine translation as Parallel Corpus provides requisite training data (e.g., Brownetal.1990; Melamed2000; OchandNey2002), Corpus--based Method (Statistic-Based) and Case-based Reasoning (Example-Based) etc. are that mechanical translation research provides new thinking based on the interpretation method of corpus, effectively improve translation quality, start new climax in mechanical translation research field.On the other hand, bilingualism corpora is again the important sources obtaining translation knowledge, therefrom can excavate the various fine-grained translation knowledge of study, as dictionary for translation (e.g., GaleandChurch1991; And translation template, thus improve traditional machine translation mothod Melamed1997).In addition, bilingualism corpora is also cross-language information retrieval (e.g., DavisandDunning1995; Jian-YunNie, TREC8; ), dictionary for translation writing, bilingual terminology are extracted and the important foundation resource of multilingual comparative study etc. automatically.Bilingual teaching mode construction also exists very large difficulty with acquisition, and each state has all dropped into a large amount of human and material resources and financial resources, but the source of bilingual teaching mode mainly concentrates on the specific area such as Government Report, news law, is not suitable for real text application.Meanwhile, the extensive bilingual text on internet and have well ageing and spreadability, this is that the acquisition of bilingual teaching mode provides potential solution route.
Researcher Nie of Montreal, CAN university builds the system PTMiner(ParallelTextMiner of cloud exploitation, 1999): form bilingual candidate website by the search engine website of searching containing particular anchor text, table is sewed in the front and back relying on predefined language again, if extract the candidate web pages front and back that namely a certain URL contains a kind of language with URL name similarity to sew, then replace with another kind of language by sewing before and after these, construct a URL, if the URL built like this exists.Then have found a pair candidate web pages pair, finally again according to text size, the HTML mark structure of webpage, the characteristic filter such as the language of webpage fall uneven webpage pair in candidate web pages.PTMiner system chooses the right Sino-British parallel web pages pair of hundreds of at Sino-British parallel web pages text, through artificial evaluation, has the accuracy rate of 90% nearly.The English text got has 137M, and Chinese text has 117M.
The system STRAND(StructuralTranslationRecognition of the researcher Resnik exploitation of Univ Maryland-Coll Park USA, AcquiringNaturalData, 2003) be also utilize the rule selecting candidate website of search engine and definition to obtain bilingual candidate website.Compare with PTMiner, when STRAND recycling URL names similarity to search the candidate web pages pair in a website, take the mode leaving out the pre-defined character string relevant to language in China and British URL, if after removing the relevant word string of language, China and British URL is equal, then illustrate that current Sino-British URL is a pair bilingual parallel web pages of candidate.In addition, the more careful deep similarity that have studied parallel web pages and structurally have of STRAND, have employed that more to filter out in candidate's parallel web pages based on the feature of structure of web page be not the webpage pair translated each other.The about 400 right Sino-British parallel web pages pair of manual evaluation, achieve the accuracy rate of 98% and the recall rate of 61%.STRAND system gets about 3, and 500 to Sino-British parallel web pages pair.BITS(BilingualInternetTextSearch, MaandLiberman1999), download the alternatively website, all websites under designated domain name, define a kind of account form calculating similarity between Sino-British web page contents and namely translate the ratio that word accounts for the total word number of text mutually, carry out the right determination of Sino-British parallel web pages.The PTI(TheParallelTextIdentificationSystem of people's exploitations such as the Monash University Chen Ji river in Jiangsu Province which flows into the Huangpu River of Shanghai, 2004) after having downloaded a large amount of bilingual web pages by web retrieval device, first have passed filename comparison model and namely obtain bilingual parallel web pages pair according to the similarity of URL name, the same PTMiner of principle, the webpage not having corresponding align to link in this process is again by a file content analytical model, define the Similarity Measure mode calculated between webpage text content, thus obtain bilingual parallel webpage pair.PTI system gets the 193 right parallel texts of China and Britain altogether, and wherein 180 to being correct, and accuracy is 93%, and recall rate is 96%.
The WPDE(WebParallelDataExtraction of people's exploitations such as the Wu Ke of Asia Microsoft Research, 2006) when utilizing search engine to obtain candidate website, not only make use of the ALT information that Anchor Text additionally uses picture.When naming the bilingual parallel web pages pair of similar retrieval candidate according to URL, adopt and URL is divided into pathname and basename, the pairing of pathname is searched and is also utilized predefined heuristic character string, defines some matched rules when concrete searching; Basename search pairing be not used in previous systems adopt based on predefined character string forms, but based on improve smallest edit distance algorithm, such mode through overtesting prove achieve better effect.Except have employed text size during the right filtration of the bilingual parallel web pages of candidate, the features such as webpage html structure, also introduce the quality of a feature based on web page contents and the bilingual parallel web pages text sentence alignment of candidate.The test set same at same PTI closes, and WPDE system achieves the accuracy of 97% and the recall rate of 94%.
Along with the high speed development of networked information era, Internet resources just constantly increase in the mode of explosion type.Internet is the important sources of present information, people can obtain a large amount of information resources by internet, but mix a large amount of data miscellaneous in internet, how from the magnanimity information internet, to extract valuable bilingual data, be the major issue that current data acquisition personnel and relevant enterprise face.The extensive bilingual teaching mode acquiring technology of research sing on web obtains a difficult problem for solution bilingualism corpora, promotes correlation technique development and practically to have great importance.At present, print the language material sampling instrument of bilingual parallel corpora and method is also short of very much for the Chinese, that can carry out automatically gathering is just very fewer.So be now badly in need of a kind ofly automatically to gather method that the Chinese prints bilingual parallel corpora to liberate the loaded down with trivial details collecting work of language material collector and to provide valuable language material resource for enterprise.
Summary of the invention
For the deficiencies in the prior art, the invention provides a kind of Chinese and print the system and implementation method that bilingual parallel corpora gathers automatically, establish the bilingual corpora auto acquisition system of a sing on web, from internet, the automatic collection network Chinese prints bilingual parallel corpora, can the Chinese print bilingual teaching mode of automatic acquisition text level Chinese print bilingual teaching mode and Sentence-level, achieve the Chinese and print that bilingual parallel information automatically finds, automatically extracts, the bilingual parallel corpora acquisition system of automatic arranging.
The present invention realizes by the following technical solutions:
The Chinese prints the system that bilingual parallel corpora gathers automatically, comprises automatic discovery module, automatically extraction module, automatic arranging module that the Chinese prints bilingual parallel information, wherein:
(1) automatically module is found: realize the Chinese and print the function that bilingual parallel corpora finds automatically, formulate the crucial phrase needing to gather language material, by search engine search website, gather webpage and obtain Search Results, after the information of Search Results is filtered and screened, Search Results will be obtained after filtration and be stored in search results database;
(2) automatic extraction module: realize the Chinese and print the function that bilingual parallel corpora extracts automatically, by the webpage of access search results lane database, automatically extract the Chinese and print bilingual parallel information;
(3) automatic arranging module: print bilingual parallel information for the Chinese automatically extracted, carry out data filtering, and the Chinese after filtration treatment is printed bilingual panel data and be stored in Chinese print bilingual teaching mode.
The Chinese of described automatic discovery module prints bilingual parallel corpora and automatically finds that workflow is: formulate the crucial phrase of one or more groups Chinese print intertranslation, obtain Search Results by search engine, analyze Search Results and with carry out data acquisition for target.
The Chinese of described automatic discovery module prints bilingual parallel corpora and automatically finds that principle of design is:
A. selected crucial phrase should be the Chinese print intertranslation phrase pair within the scope of specific area;
B. the third party's search-engine tool used provides search service side for open;
C., after obtaining result by keyword group searching, n page information before only preserving, n associates with the popular degree of selected keyword, and preservation content comprises searches plain result URL address, Search Results title and Search Results summary.
The bilingual parallel corpora of described automatic extraction module automatically extracts workflow and is: use Internet robot to conduct interviews to target web, the corresponding crucial phrase of Chinese print intertranslation is used to carry out content location to target pages content, from anchor point, front and back travel through and obtain page data.
The bilingual parallel corpora of network of described automatic extraction module extracts principle:
A. specify that the pagefile type of accessing can only be " html ", " htm ", " shtml " and common pagefile type, will not conduct interviews to the page of non-stated type;
B., before access destination webpage, the robots.txt file of Network Check targeted website, if target pages is present on robots.txt file, will not conduct interviews to this target web;
C. will extract complete bilingual data, in extraction process, the html Shipping Options Page be included in target language data will be considered as extracting object more.
The workflow of described automatic extraction module mainly comprises following step:
(1) non-target language information filtering: respectively character filtering is carried out to the Chinese printing certificate collected, main filtration html label, web page code and some non-language symbols, remove the noise data in Information Monitoring, obtain the clean Chinese and print bilingual panel data;
(2) Chinese print participle process: use Chinese and Indonesian participle instrument, participle operation is carried out to Chinese and Indonesian data, for data handling procedure below provides basis.
The workflow of described automatic arranging module mainly comprises following step:
(1) length ratio and intertranslation matching rate calculate: the data for Automatic Extraction are effectively filtered, the calculating that bilingual data carries out length ratio and intertranslation matching rate is organized respectively to extracting each print in bilingual panel data of the Chinese of returning, length difference is filtered apart from larger data, and carry out the intertranslation matching judgment that the Chinese prints bilingual panel data, filter out correct panel data;
(2) Chinese is after treatment printed bilingual panel data will deposit into Chinese print bilingual teaching mode.
The implementation method that the described Chinese prints bilingual parallel corpora automated collection systems is: setting data acquisition server, data processing server, data storage server and outer network switch, Intra-Network switch, to automatically find Module-embedding data acquisition server, automatic extraction module, automatic arranging Module-embedding data processing server, for guaranteeing data security, use inside and outside network physical isolation.When data acquisition server needs access internet, data acquisition server will be connected with outer network switch, and disconnect the connection with Intra-Network switch.When data acquisition server needs access Intranet, data acquisition server will be connected with Intra-Network switch, and disconnect the connection with outer network switch.Outer network switch realizes the communication between outer net.Intra-Network switch realizes the communication between Intranet.
Data acquisition server is connected with outer network switch by data acquisition personnel, makes data acquisition server have access to Internet service.The layout of data acquisition librarian use outer net desk-top computer needs the related keyword phrase of image data.After determining acquisition tasks, send startup to data acquisition server and automatically gather request; After the crucial phrase data that data acquisition server receives desk-top computer transmission and task start order, the automatic discovery procedure of the data that bring into operation.Obtained the Search Results of all crucial phrases by internet after, Search Results is saved in this locality; Data acquisition server and outer network switch disconnect by data acquisition personnel, and are connected with Intra-Network switch.The data acquisition personnel data started in data processing server are extracted and automatic arranging program automatically, and data processing server reads the Search Results be stored in data acquisition server, carries out data and automatically extracts and automatic arranging.After program is finished the work, all bilingual data got are stored in data storage server.
Chinese print is herein bilingual refers to Chinese and Indonesian.
Chinese (H à ny ǔ) is also known as " Chinese ", it is the mother tongue of Han nationality, the official language of the Yi Shi People's Republic of China (PRC) and Singapore, the United Nations's official language, also be the language that number of users is maximum in the world, mainly circulate in China, Singapore, Malaysia, and Burma, Thailand, the U.S., Canada, Australia, New Zealand, Japan and other countries overseas Chinese community.Also be the general minority language of the countries such as Malaysia, Burma, the U.S., Canada, Australia, New Zealand.
Indonesian (BahasaIndonesia) is a kind of Malay in Liao based on dialect, is Indonesian official language.About there are 1,700 ten thousand to 3,000 ten thousand people in the whole world using the mother tongue of Indonesian as them, also have about 1.4 hundred million people using Indonesian as second language, can be more skilled read and say Indonesian.The all general Indonesian of Indonesian all regions, also has many people to use Indonesian in Holland, Philippine, Saudi Arabia, Singapore and the U.S. simultaneously.In order to can to release with Malay spelling system be standard that accurate phonetic (EjaanYangDisempurnakan) makes Indonesian till now spell closely with Malay with the unification government of Indonesia in 1972 of Malay written.
Outstanding substantive distinguishing features of the present invention and significant progress are:
1, a kind of Chinese provided by the invention prints the system and implementation method that bilingual parallel corpora gathers automatically, make full use of that the webpage Chinese prints the automatic discovery technique of bilingual parallel corpora, the webpage Chinese prints the automatic extractive technique of bilingual parallel corpora and the webpage Chinese prints bilingual parallel corpora filtering technique, form automatic acquiring method and system that the Chinese prints bilingual parallel corpora;
2, the scheme that the present invention uses can be collected the valuable Chinese and printed bilingual parallel corpora and analyze and research from the internet information of magnanimity, for Chinese print speech research and mechanical translation application provide important foundation data, solve the problem of the Data Source that language material collector and researchist face, outstanding contribution has been made in the development automatically gathered for bilingual corpora and the natural language processing of Chinese print;
3, Parallel Corpus is a kind of important kind of corpus, the construction of Chinese print Parallel Corpus is current or blank, a kind of Chinese of the present invention prints the system and implementation method that bilingual parallel corpora gathers automatically, print bilingual parallel information comprising the Chinese automatically to find, automatically extract and automatic arranging, unique effect can be played in language contrast, Translation Study, language teaching and lexicography;
4, apply scheme provided by the present invention, the parallel corpora between bilingual can be obtained, thus solve the problem of language material scarcity of resources between language, and the translation rule being conducive to obtaining better quality is to build statictic machine translation system;
5, in translation teaching, utilize Parallel Corpus of the present invention, can provide abundant and translate example, determine the possibility of multiple translation, and optimum selecting, can also be used to the illustration in checking bilingual dictionary, teaching dictionary, grammar book, definition, service regeulations and environment for use according to parallel corpora library information, thus determine focal points;
6, the construction of Chinese print bilingual teaching mode also exists very large difficulty with acquisition, although dropped into a large amount of manpowers, material resources and financial resources, but the source of Chinese print bilingual teaching mode mainly concentrates on Government Report, the specific areas such as news law, be not suitable for real text application, in view of the extensive bilingual text on internet there is well ageing and spreadability, the system and method that the present invention uses can be collected the valuable Chinese and prints bilingual parallel corpora and analyze and research from the internet information of magnanimity, and build up Chinese print bilingual teaching mode, promote correlation technique development and practically to have great importance,
7, utilize systematic collection of the present invention to the electronic guide browsing equipment connection in relevant bilingual data and tourist attractions, museum, scientific exhibit shop etc., can both pictures and texts are excellent shows by the form of bilingual journal the article of scenic spot and display, visitor is made to see while in listening, take in knowledge, understand intension, enjoy culture, guests fully can be understood and views and admires the deep cultural deposits of object, meanwhile, the rich connotation of sight spot, showpiece is elevated after contrast is browsed.
Accompanying drawing explanation
Fig. 1 is the system construction drawing that the Chinese of the present invention prints system that bilingual parallel corpora gathers automatically and implementation method;
Fig. 2 is the process flow diagram that the Chinese of the present invention prints the method that bilingual parallel corpora gathers automatically;
Fig. 3 is the process flow diagram that the Chinese of the present invention prints the method that bilingual parallel corpora filters;
Fig. 4 is the block diagram of the example arrangement of personal computer as the messaging device adopted in embodiments of the invention;
Fig. 5 is the network topology structure figure of present system.
Embodiment
Provide the specific implementation of the embodiment of the present invention in instructions part below, wherein, describe the preferred embodiment being used for the openly embodiment of the present invention fully in detail, and do not apply to limit to it.
As shown in Figure 1, a kind of Chinese prints the system that bilingual parallel corpora gathers automatically, comprise automatic discovery, automatically extraction, automatic arranging that the Chinese prints bilingual parallel information, first be that the Chinese prints the automatic discovery procedure of bilingual panel data, formulate the crucial phrase needing to gather language material, by search engine search website, gather webpage and obtain Search Results, after the information of Search Results is filtered and screened, Search Results will be obtained after filtration and be stored in search results database; Next is that the Chinese prints the automatic leaching process of bilingual parallel corpora, by the webpage of access search results lane database, automatically extracts the Chinese and prints bilingual parallel information; Be finally that the Chinese prints bilingual parallel corpora automatic arranging process, print bilingual parallel information for the Chinese automatically extracted, carry out data filtering, and the Chinese after filtration treatment is printed bilingual panel data and be stored in Chinese print bilingual teaching mode.
As shown in Figure 2, the Chinese of the present invention prints the method that bilingual parallel corpora gathers automatically, comprises the following steps:
The webpage Chinese prints the automatic discovery technique of bilingual parallel corpora:
First the related keyword phrase needing to gather language material is formulated.Here crucial phrase is Chinese print intertranslation phrase pair, such as: " peanut kacang ".With crucial phrase for starting point, obtain relevant search result by search engine.
Then enter and Search Results is filtered.Mainly in order to by filtering the information of Search Results and screen, improve collecting efficiency and quality, reduce acquisition cost.Specific practice is as follows:
By the contrast of URL address, title and summary, determine whether the search result information of repetition.As being judged as duplicate message, will filter.
By the document form of URL adress analysis webpage, the URL address not belonging to common web page files type is removed.Only preserve the URL address of common web page files type, as the common web page files type such as " html ", " htm ", " shtml ", " jsp ", " php ".
Filtered by the analysis of crucial phrase and summary.Mainly through keyword location summary info, filtered by the length ratio of Thai language information and Chinese information, remove the situation that single intertranslation phrase is right.
Finally, Search Results (comprising crucial phrase, URL address, title and summary) will be obtained after filtration and be stored in search results database.
The webpage Chinese prints the automatic extractive technique of bilingual parallel corpora:
By the webpage of access search results lane database, automatically extract bilingual information.Be implemented as follows:
First, from search results database, obtain the URL address queue to be visited newly added.The URL address that taking-up one is to be visited from URL address queue.Whether systems axiol-ogy targeted website exists robot.txt file, and whether this target URL address is present in robot.txt file.If this URL address does not allow access, system skips this URL address, takes out URL address next to be visited.If this URL address allows access, system starts to access and resolves the webpage of this URL address.
By analyzing web page, the Chinese that system starts automatically to extract the page prints bilingual panel data.Concrete steps are as follows:
1. Chinese print bilingual data slightly extracts:
(1) full page content is read into a character string S.
(2) S is resolved into two character string s1, s2.S1 preserves Indonesian data all in S.S2 preserves Chinese datas all in S.
(3) all Chinese of being saved and Thai language data demand retain putting in order on the page originally.And the html tag retained between all Thai language data and between Chinese data and language message character, comprise (putting aside the english information of doping) such as punctuate, numeral, special symbols.
2.HTML label is replaced:
A spaced markings <T> is replaced with by unified for all html tags in s1, s2.
3. the Chinese prints the extraction of bilingual panel data:
(1) Indonesian subordinate sentence is carried out to s1, obtain character string dimension st1 [m].Chinese subordinate sentence is carried out to s2, obtains character string dimension st2 [n].Here m and n represents Indonesian sentence sum and Chinese sentence sum respectively.
(2) the inner all spaced markings <T> of st1 [m] and st2 [n] are removed.
(3) inner to st1 [m] all character strings carry out Indonesian participle.The all character strings inner to st2 [n] carry out Chinese word segmentation.
(4) sentence that st1 [m] and st2 [n] the inside only has single word to form is filtered out.
(5) bilingual parallel sentence is to automatic matching method:
A. from st1 [m], take out the Indonesian sentence s_th of a participle.
B. utilize the Chinese to print intertranslation dictionary and Chinese is translated in Indonesian phrase each in s_th.Obtain sentence s_th_ch.
C. from s_th_ch, take out a Chinese phrase, at the inner all sentences finding this Chinese phrase of existence of st2 [n], obtain st2 [n'].If there is not the sentence containing this Chinese phrase in st2 [n], then from s_th_ch, take out next Chinese phrase.Continue at the inner all sentences finding this Chinese phrase of existence of st2 [n].If n'>1; Then from s_th_ch, take out next word, continue at the inner all sentences finding this Chinese phrase of existence of st2 [n'].Circulate this step, until the word in n'=1 or s_th_ch has traveled through.If n'=1, namely st2 [n'] is inner only exists a sentence.So, this inner for st2 [n'] sentence is considered as the parallel sentence s_ch of best Chinese corresponding to this s_th by us.If the word in s_th_ch has traveled through complete, and n'>1; Then get the minimum sentence of st2 [n'] inner string length as the parallel sentence s_ch of the best Chinese that this s_th is corresponding.
D. s_th with s_ch is printed bilingual parallel sentence to preservation as a Chinese, and s_th and s_ch is removed respectively in st1 [m] and st2 [n].
If the s_th e. taken out does not find corresponding s_ch, then st1 [m] gets the Indonesian sentence of next participle.Repeat above-mentioned steps.Until st1 [m] will have been traveled through.
F. after having traveled through st1 [m], if m>1, and n>1, illustrate that also may to there is the bilingual parallel sentence of Chinese print do not mated right, then according to above-mentioned steps, go to find the parallel sentence of best Indonesian st1 [m] from st2 [n] conversely.
From URL address queue, take out next URL address to be visited, repeat above-mentioned steps, until the Chinese having extracted all URL addresses to be visited prints bilingual parallel corpora data.The bilingual parallel sentence of Chinese print of all automatic extractions prints bilingual panel data queue to the composition Chinese to be filtered.
The Chinese prints bilingual panel data filtering technique: print bilingual parallel information for the Chinese automatically extracted, carry out data filtering.It improves the quality of Information Monitoring to a great extent.
As shown in Figure 3, the webpage Chinese prints the method that bilingual parallel corpora filters, and comprises following content:
Information denoising: pure for ensureing data, filters the non-linguistic information collected in data again.Comprise html tag and non-language character.
Information Monitoring contrast is filtered: the Chinese for denoising prints bilingual parallel information and filters.Bilingual parallel information is printed to each group Chinese and carries out following operation:
First length is carried out than filtering.Respectively participle operation is carried out to Indonesian information and Chinese information.Statistics show that Indonesian information phrase number is a, Chinese information phrase numerical digit b, setting minimum length compares λ than μ and maximum length, setting is as a/b> λ or b/a> λ or a/b< μ or b/a< μ, be considered as the valueless Chinese and print bilingual panel data information, and by this group information filtering.
Then, print bilingual parallel information than the Chinese required carry out matching rate filtration for meeting length.There is M phrase in the Chinese information of participle, therefrom extract m phrase, by Chinese print dictionary, this m phrase is translated into m Indonesian phrase of correspondence.N the phrase that can mate completely is with it there is in this m Indonesian phrase in the Indonesian information of participle.So p (cn|th)=m2/ (n*M), we look the matching rate of p (cn|th) for the corresponding Thai language information of Chinese information.In like manner, p (th|cn) is the matching rate of the corresponding Chinese information of Thai language information.So, the matching rate of one group of bilingual parallel information of our regulation collection is p=(p (th|cn)+p (cn|th))/2.Carry out matching rate calculating according to organizing bilingual parallel information to each, smallest match rate ρ is set, as p< ρ, filtration treatment is carried out to the bilingual parallel information of this group.
Finally, according to Chinese print bilingual teaching mode, bilingual parallel information is printed to the Chinese collected and look into heavy filtration.The Chinese after treatment prints bilingual panel data and will deposit into Chinese print bilingual teaching mode.
Application Example 1:
As shown in Figure 4, CPU, ROM and RAM are connected to each other via bus.Input/output interface is also connected to bus; Input system, output system, storage system, communication system and drive system are connected to input/output interface; Input system, comprises keyboard, mouse etc.; Output system, comprises display, loudspeaker etc.; Storage system, comprises hard disk etc.; Communication system, comprise network interface unit such as LAN card, modulator-demodular unit etc., communication system is via network such as the Internet executive communication process; As required, drive system is also connected to input/output interface; Mobile storage medium such as disk, CD, magneto-optic disk, USB flash drive etc. are connected in drive system as required, and the computer program therefrom read is stored on mobile storage medium as required.
CPU (central processing unit) (CPU) performs various process according to the program stored in ROM (read-only memory) (ROM) or from the program that storage area is loaded into random access memory (RAM).In RAM, also store the data required when CPU performs various process etc. as required.
When instruction code of the present invention can be read by above carrier and be performed.
Application Example 2:
As shown in Figure 5, brief description is carried out to network topology structure figure:
Node specification
Switch X: outer network switch
Switch Y: Intra-Network switch
Server A: data acquisition server (embed and automatically find module)
Server B: data processing server (embedding automatic extraction module, automatic arranging module)
Server C: data storage server
Network service
For guaranteeing data security, use inside and outside network physical isolation, server is connected with Internet with router through fire wall.When needs server A needs access internet, server A will be connected with switch X, and disconnect the connection with switch Y.When server A needs access Intranet, server A will be connected with switch Y, and disconnect the connection with switch X.
Switch X realizes the communication between outer net.
Switch Y realizes the communication between Intranet.
Its course of work is as follows:
(1) server A is connected with switch X by data acquisition personnel, makes server A have access to Internet service.The layout of data acquisition librarian use outer net desk-top computer needs the related keyword phrase of image data.After determining acquisition tasks, send startup to server A and automatically gather request.
(2) after the server A crucial phrase data that receive desk-top computer transmission and task start order, the automatic discovery procedure of the data that bring into operation.Obtained the Search Results of all crucial phrases by internet after, Search Results is saved in this locality.
(3) server A and switch X disconnect by data acquisition personnel, and are connected with switch Y.The data acquisition personnel data started in server B are extracted and automatic arranging program automatically, and server B reads the Search Results be stored in server A, carries out data and automatically extracts and automatic arranging.After program is finished the work, all bilingual data got are stored in server C.
Application Example 3:
Bandung is the third-largest city of Indonesia, Bandung scenery is beautiful, quiet and tastefully laid out, it's like spring all the year round, be described as the city that Indonesia is the most beautiful, also be famous sightseeing tour ground, China has every year many people to go sightseeing visit, because local interpretation staff is limited, particularly Chinese explanation, be difficult to every visitor provide specification as one explanation service, so arrange electronic cicerone machine system at some more famous sight spots, use systematic collection of the present invention to relevant bilingual data, can both pictures and texts are excellent shows by the form of Chinese print bilingual journal the article of scenic spot and display, spectators are made to see while in listening, take in knowledge, understand intension, enjoy culture.Guests fully can be understood and views and admires the deep cultural deposits of object.See thing and think feelings, thoughts thronging one's mind, the rich connotation of sight spot, showpiece is elevated after contrast is browsed.
Application Example 4:
Jakarta has another name called Coconut City (Literary Journal), Indonesian capital and maximum city, in the indoor museum such as museum, science and technology center, conference and exhibition center in this city, instead of artificial guide with electronic guide browsing system and affect the loudspeaker of its people's visit because noise is large, this system and systems connection of the present invention, by the form of Chinese print bilingual journal, in the mode that both pictures and texts are excellent, visitor is presented to the content displayed exhibits, makes visitor in the process of viewing and admiring showpiece, make its intension be extended, more vividly.Visitor, after fully having appreciated the outward appearance presentation of showpiece, obtains again abundant knowledge.In addition, visitor can also by the button on touch-screen, and the position of inquiry exhibition section and path, freely enjoy the enjoyment of self-service visit.
The above; be only the specific embodiment of the present invention, but protection scope of the present invention is not limited thereto, is anyly familiar with those skilled in the art in the technical scope that the present invention discloses; the change that can expect easily or replacement, all should be encompassed within protection scope of the present invention.Therefore, protection scope of the present invention should be as the criterion with the protection domain of claim.

Claims (8)

1. the Chinese prints the system that bilingual parallel corpora gathers automatically, comprises automatic discovery module, automatically extraction module, automatic arranging module that the Chinese prints bilingual parallel information, it is characterized in that:
(1) automatically module is found: realize the Chinese and print the function that bilingual parallel corpora finds automatically, formulate the crucial phrase needing to gather language material, by search engine search website, gather webpage and obtain Search Results, after the information of Search Results is filtered and screened, Search Results will be obtained after filtration and be stored in search results database;
(2) automatic extraction module: realize the Chinese and print the function that bilingual parallel corpora extracts automatically, by the webpage of access search results lane database, automatically extract the Chinese and print bilingual parallel information;
(3) automatic arranging module: print bilingual parallel information for the Chinese automatically extracted, carry out data filtering, and the Chinese after filtration treatment is printed bilingual panel data and be stored in Chinese print bilingual teaching mode.
2. the Chinese according to claim 1 prints the system that bilingual parallel corpora gathers automatically, it is characterized in that, the Chinese of described automatic discovery module prints bilingual parallel corpora and automatically finds that workflow is: formulate the crucial phrase of one or more groups Chinese print intertranslation, obtain Search Results by search engine, analyze Search Results and with carry out data acquisition for target.
3. the Chinese according to claim 1 prints the system that bilingual parallel corpora gathers automatically, it is characterized in that, the Chinese of described automatic discovery module prints bilingual parallel corpora and automatically finds that principle of design is:
A. selected crucial phrase should be the Chinese print intertranslation phrase pair within the scope of specific area;
B. the third party's search-engine tool used provides search service side for open;
C., after obtaining result by keyword group searching, n page information before only preserving, n associates with the popular degree of selected keyword, and preservation content comprises searches plain result URL address, Search Results title and Search Results summary.
4. the Chinese according to claim 1 prints the system that bilingual parallel corpora gathers automatically, it is characterized in that, the bilingual parallel corpora of described automatic extraction module automatically extracts workflow and is: use Internet robot to conduct interviews to target web, the corresponding crucial phrase of Chinese print intertranslation is used to carry out content location to target pages content, from anchor point, front and back travel through and obtain page data.
5. the Chinese according to claim 1 prints the system that bilingual parallel corpora gathers automatically, it is characterized in that, the bilingual parallel corpora of network of described automatic extraction module extracts principle:
A. specify that the pagefile type of accessing can only be " html ", " htm ", " shtml " and common pagefile type, will not conduct interviews to the page of non-stated type;
B., before access destination webpage, the robots.txt file of Network Check targeted website, if target pages is present on robots.txt file, will not conduct interviews to this target web;
C. will extract complete bilingual data, in extraction process, the html Shipping Options Page be included in target language data will be considered as extracting object more.
6. the Chinese according to claim 1 prints the system that bilingual parallel corpora gathers automatically, it is characterized in that, the workflow of described automatic extraction module mainly comprises following step:
(1) non-target language information filtering: respectively character filtering is carried out to the Chinese printing certificate collected, main filtration html label, web page code and some non-language symbols, remove the noise data in Information Monitoring, obtain the clean Chinese and print bilingual panel data;
(2) Chinese print participle process: use Chinese and Indonesian participle instrument, participle operation is carried out to Chinese and Indonesian data, for data handling procedure below provides basis.
7. the Chinese according to claim 1 prints the system that bilingual parallel corpora gathers automatically, it is characterized in that, the workflow of described automatic arranging module mainly comprises following step:
(1) length ratio and intertranslation matching rate calculate: the data for Automatic Extraction are effectively filtered, the calculating that bilingual data carries out length ratio and intertranslation matching rate is organized respectively to extracting each print in bilingual panel data of the Chinese of returning, length difference is filtered apart from larger data, and carry out the intertranslation matching judgment that the Chinese prints bilingual panel data, filter out correct panel data;
(2) Chinese is after treatment printed bilingual panel data will deposit into Chinese print bilingual teaching mode.
8. the Chinese according to claim 1 prints the system that bilingual parallel corpora gathers automatically, it is characterized in that, the implementation method that the described Chinese prints bilingual parallel corpora automated collection systems is: setting data acquisition server, data processing server, data storage server and outer network switch, Intra-Network switch, to automatically find Module-embedding data acquisition server, automatic extraction module, automatic arranging Module-embedding data processing server;
Data acquisition server is connected with outer network switch by data acquisition personnel, data acquisition server is made to have access to Internet service, the layout of data acquisition librarian use outer net desk-top computer needs the related keyword phrase of image data, after determining acquisition tasks, send to start to data acquisition server and automatically gather request, after the crucial phrase data that data acquisition server receives desk-top computer transmission and task start order, the automatic discovery procedure of the data that bring into operation, obtained the Search Results of all crucial phrases by internet after, Search Results is saved in this locality; Data acquisition server and outer network switch disconnect by data acquisition personnel, and are connected with Intra-Network switch;
The data acquisition personnel data started in data processing server are extracted and automatic arranging program automatically, data processing server reads the Search Results be stored in data acquisition server, carry out data automatically to extract and automatic arranging, after program is finished the work, all bilingual data got are stored in data storage server.
CN201510407512.4A 2015-07-13 2015-07-13 System for automatically collecting Hanyu and Bahasa Indonesia bilingualism parallel texts, and implementation method Pending CN105045861A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510407512.4A CN105045861A (en) 2015-07-13 2015-07-13 System for automatically collecting Hanyu and Bahasa Indonesia bilingualism parallel texts, and implementation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510407512.4A CN105045861A (en) 2015-07-13 2015-07-13 System for automatically collecting Hanyu and Bahasa Indonesia bilingualism parallel texts, and implementation method

Publications (1)

Publication Number Publication Date
CN105045861A true CN105045861A (en) 2015-11-11

Family

ID=54452408

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510407512.4A Pending CN105045861A (en) 2015-07-13 2015-07-13 System for automatically collecting Hanyu and Bahasa Indonesia bilingualism parallel texts, and implementation method

Country Status (1)

Country Link
CN (1) CN105045861A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1707476A (en) * 2005-05-06 2005-12-14 贺方升 Auxiliary translation searching engine system and method thereof
CN102043808A (en) * 2009-10-14 2011-05-04 腾讯科技(深圳)有限公司 Method and equipment for extracting bilingual terms using webpage structure
CN102930031A (en) * 2012-11-08 2013-02-13 哈尔滨工业大学 Method and system for extracting bilingual parallel text in web pages
CN103020043A (en) * 2012-11-16 2013-04-03 哈尔滨工业大学 Distributed acquisition system facing web bilingual parallel corpora resources
CN103885939A (en) * 2012-12-19 2014-06-25 新疆信息产业有限责任公司 Uyghur-Chinese bi-directional translation memory system construction method
CN104408078A (en) * 2014-11-07 2015-03-11 北京第二外国语学院 Construction method for key word-based Chinese-English bilingual parallel corpora

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1707476A (en) * 2005-05-06 2005-12-14 贺方升 Auxiliary translation searching engine system and method thereof
CN102043808A (en) * 2009-10-14 2011-05-04 腾讯科技(深圳)有限公司 Method and equipment for extracting bilingual terms using webpage structure
CN102930031A (en) * 2012-11-08 2013-02-13 哈尔滨工业大学 Method and system for extracting bilingual parallel text in web pages
CN103020043A (en) * 2012-11-16 2013-04-03 哈尔滨工业大学 Distributed acquisition system facing web bilingual parallel corpora resources
CN103885939A (en) * 2012-12-19 2014-06-25 新疆信息产业有限责任公司 Uyghur-Chinese bi-directional translation memory system construction method
CN104408078A (en) * 2014-11-07 2015-03-11 北京第二外国语学院 Construction method for key word-based Chinese-English bilingual parallel corpora

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
冯艳卉: ""基于Web的大规模平行语料库构建方法研究"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
林政: ""Web双语平行语料自动获取及其在统计机器翻译中的应用"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Similar Documents

Publication Publication Date Title
Malmasi et al. MultiCoNER: A large-scale multilingual dataset for complex named entity recognition
Resnik et al. The web as a parallel corpus
CN105045862A (en) System for automatically acquiring bilingual parallel corpus of Chinese-foreign languages and realization method
CN104298662B (en) A kind of machine translation method and translation system based on nomenclature of organic compound entity
CN106777274A (en) A kind of Chinese tour field knowledge mapping construction method and system
CN103678412A (en) Document retrieval method and device
Evert A Lightweight and Efficient Tool for Cleaning Web Pages.
Barriere et al. TerminoWeb: a software environment for term study in rich contexts
Hassel Resource lean and portable automatic text summarization
CN104268283A (en) Method for automatically analyzing Internet web page
CN105022728A (en) Automatic acquisition system of Chinese and Lao bilingual parallel texts and implementation method
CN106485525A (en) Information processing method and device
CN105574004B (en) A kind of removing duplicate webpages method and apparatus
CN105138548A (en) System for automatically collecting Chinese-Thai bilingual parallel corpus and implementation method
CN104933192A (en) Automatic Chinese and Filipino bilingual parallel text collection system and implementation method
KR20050078655A (en) Dynamic keyword extraction and processing system
CN104933193A (en) Chinese and Bahasa Melayu bilingual parallel text automatic acquisition system and realizing method thereof
CN104933195A (en) Chinese and Burmese bilingual parallel text automatic acquisition system and realizing method thereof
CN104965925A (en) Automatic Chinese-Khmer bilingual parallel text acquisition system and implementation method
CN104933194A (en) Chinese and Vietnamese bilingual parallel text automatic acquisition system and realizing method thereof
Sallaberry et al. A semantic approach for geospatial information extraction from unstructured documents
Berman et al. Historical gazetteer system integration: Chgis, regnum francorum, and geonames
CN105045861A (en) System for automatically collecting Hanyu and Bahasa Indonesia bilingualism parallel texts, and implementation method
JP5977199B2 (en) Local association word extraction device, regional association word extraction method, and regional association word extraction program
CN113268607A (en) Knowledge graph construction method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20151111

RJ01 Rejection of invention patent application after publication