Summary of the invention
Technical matters to be solved by this invention is to provide a kind of method and system extracting bilingual parallel corpora from webpage, to overcome the problem of the low and scale deficiency of existing corpus collection efficiency.The invention provides by the method and system extracting bilingual parallel text in webpage.
System by extracting bilingual parallel text in webpage of the present invention comprises:
Web database, for storing the webpage and attribute thereof that crawl at random on a large scale; Also for being carried out the hashing based on character by the URL of webpage, and the close degree classification of all webpages after process according to its domain name is stored; The close degree classification storage of all webpages according to its domain name is referred to: the Main Domain in the domain name of each webpage and each subdomain name are calculated and obtains corresponding cryptographic hash, all webpages identical for the cryptographic hash of Main Domain are existed in a large class, all webpages identical for the cryptographic hash of next stage subdomain name in this large class are divided in a subclass again, by that analogy, all Web page classifyings are stored;
Text message extraction module, for extracting the tag characters string of each webpage, also for extracting the body matter in this webpage, and records type of coding and the text size of described tag characters string and this Web page text content, and is stored to web database;
Type of webpage discrimination module, for carrying out category of language judgement to the body matter of all webpages in web database, if there is the bilingual text that scale is suitable in described body matter, then judge that this mixing webpage is as mixing webpage, otherwise judge that this webpage is single languages webpage;
Mixing webpage processing module, for carrying out intertranslation differentiation to the bilingual text in mixing webpage, when being judged to be intertranslation text, being organized into bilingual parallel text formatting by the bilingual text in this webpage and being saved to bilingualism corpora.
Single languages Web Page Processing module, process for each the not marking matched single languages webpage traveled through in web database, to the processing procedure of each single languages webpage be: the body matter of other not marking matched single languages webpage in the body matter in this single languages webpage and web database is carried out intertranslation differentiation, the principle of other not marking matched single languages webpage is selected to be single languages webpage that prioritizing selection is arranged in same subclass, that body matter in two single languages webpages of intertranslation text is organized into bilingual parallel text and is saved to bilingualism corpora by judging, and described two single languages webpages are all designated coupling.
Method by extracting bilingual parallel text in webpage of the present invention comprises the steps:
Store the webpage that crawls at random on a large scale and attribute thereof the step to web database;
By carrying out the hashing based on character to the URL of the webpage stored, and by the step of all webpages after process according to the close degree classification storage of its domain name, this step specifically comprises: the cryptographic hash step calculating Main Domain in the domain name of each webpage and each subdomain, all webpages identical for the cryptographic hash of Main Domain existed the step in a large class, all webpages identical for the cryptographic hash of next stage subdomain name in all webpages in this large class are divided into again the step in a subclass, by that analogy, by step that all Web page classifyings store;
Extract the step of the tag characters string of each webpage;
Extract the step of the body matter in this webpage; The type of coding of the tag characters string that record extracts and corresponding web page body matter and text size, and be stored to the step of web database;
The body matter of all webpages in web database is carried out to the step of category of language judgement, this step comprises further: when judging to exist in described body matter the suitable bilingual text of scale, judge the step of this mixing webpage as mixing webpage, otherwise judge that this webpage is the step of single languages webpage;
Carry out the step of intertranslation differentiation to the bilingual text in mixing webpage, this step comprises further: when being judged to be intertranslation text, the bilingual text in this webpage is organized into bilingual parallel text formatting and is saved to the step of bilingualism corpora;
Each not marking matched single languages webpage in traversal web database carries out the step processed, the processing procedure of each single languages webpage is comprised: the body matter of other not marking matched single languages webpage in the body matter in this single languages webpage and web database is carried out the step of intertranslation differentiation, in this step, select the principle of other not marking matched single languages webpage to be single languages webpage that prioritizing selection is arranged in same subclass; Be that body matter in two single languages webpages of intertranslation text is organized into bilingual parallel text and is saved to bilingualism corpora by judging, and described two single languages webpages are all designated the step of coupling.
The length of above-mentioned body matter calculates according to the character quantity in body matter to obtain text size.
Instant invention overcomes the technology prejudice in prior art field, internet is obtained object as language material, the technique effect brought thus has:
1, owing to there is a large amount of bilingual parallel texts in internet, extract bilingual parallel text be trained to bilingual corpora from internet, obtaining information amount is large, and languages are enriched.
2, because the information in internet constantly updates, therefore the bilingual corpora that internet obtains object acquisition as language material also can be reached lasting renewal and the effect of growth.
Adopt the present invention to obtain bilingual corpora, greatly can accelerate the collection efficiency of language material, also can solve the problem of the language material scale deficiency of particular source.
Embodiment
Being comprised by the system extracting bilingual parallel text in webpage described in embodiment one, present embodiment:
Web database, for storing the webpage and attribute thereof that crawl at random on a large scale; Also for being carried out the hashing based on character by the URL of webpage, and the close degree classification of all webpages after process according to its domain name is stored; The close degree classification storage of all webpages according to its domain name is referred to: the Main Domain in the domain name of each webpage and each subdomain name are calculated and obtains corresponding cryptographic hash, all webpages identical for the cryptographic hash of Main Domain are existed in a large class, all webpages identical for the cryptographic hash of next stage subdomain name in this large class are divided in a subclass again, by that analogy, all Web page classifyings are stored;
Text message extraction module, for extracting the tag characters string of each webpage, also for extracting the body matter in this webpage, and records type of coding and the text size of described tag characters string and this Web page text content, and is stored to web database;
Type of webpage discrimination module, for carrying out category of language judgement to the body matter of all webpages in web database, if there is the bilingual text that scale is suitable in described body matter, then judge that this mixing webpage is as mixing webpage, otherwise judge that this webpage is single languages webpage;
Mixing webpage processing module, for carrying out intertranslation differentiation to the bilingual text in mixing webpage, when being judged to be intertranslation text, being organized into bilingual parallel text formatting by the bilingual text in this webpage and being saved to bilingualism corpora.
Single languages Web Page Processing module, process for each the not marking matched single languages webpage traveled through in web database, to the processing procedure of each single languages webpage be: the body matter of other not marking matched single languages webpage in the body matter in this single languages webpage and web database is carried out intertranslation differentiation, the principle of other not marking matched single languages webpage is selected to be single languages webpage that prioritizing selection is arranged in same subclass, that body matter in two single languages webpages of intertranslation text is organized into bilingual parallel text and is saved to bilingualism corpora by judging, and described two single languages webpages are all designated coupling.
The length of described body matter calculates according to the character quantity in body matter to obtain text size.
Embodiment two, present embodiment are further illustrating webpage attribute in the system extracting bilingual parallel text in the webpage described in embodiment one, in present embodiment, and the URL address of described webpage attribute kit purse rope page and the time crawled.
Embodiment three, present embodiment is to being limited by the further of text message extraction module of the system extracting bilingual parallel text in webpage described in embodiment one, described text message extraction module is also for judging the tag characters string of the webpage extracted, when described tag characters string is <html>, <body>, <td>, <p>, during <span> or <div>, continue to extract the text message in this webpage.
In present embodiment, the function judging tag characters string is added in text message extraction module, that is: the text of the extraction webpage of selection type is had, due to the text under above-mentioned several label belong to text may be higher, therefore extract the content that above-mentioned label comprises, and then reduce data processing amount, increase the probability of availability of information extraction.
Embodiment four, present embodiment are to being limited by the further of text message extraction module of the system extracting bilingual parallel text in webpage described in embodiment one, described text message extraction module is also for after extraction body matter, judge the length of body matter, and when described length is greater than 30 ~ 80 characters, continue record corresponding information, otherwise record the URL of this webpage, and this webpage is deleted from web database.
Embodiment five, present embodiment are to sentencing further illustrating of method for distinguishing by intertranslation in the system extracting bilingual parallel text in webpage described in embodiment one, method for distinguishing is sentenced in described intertranslation: utilize dictionary to travel through the word obtained in intertranslation bilingual text, and using these words as anchor point, judge whether their positions in bilingual text mate, if matching rate is greater than setting value, the span of described setting value is 0.3 ~ 0.7, then judge that described bilingual text is as intertranslation text.
Embodiment six, present embodiment limit the further of bilingual text suitable by scale in the system extracting bilingual parallel text in webpage described in embodiment one, and the bilingual text that scale described in present embodiment is suitable refers to that the length ratio of bilingual text is in setting range.
Embodiment seven, present embodiment comprised the steps: by the method extracting bilingual parallel text in webpage
Store the webpage that crawls at random on a large scale and attribute thereof the step to web database;
By carrying out the hashing based on character to the URL of the webpage stored, and by the step of all webpages after process according to the close degree classification storage of its domain name, this step specifically comprises: the cryptographic hash step calculating Main Domain in the domain name of each webpage and each subdomain, all webpages identical for the cryptographic hash of Main Domain existed the step in a large class, all webpages identical for the cryptographic hash of next stage subdomain name in all webpages in this large class are divided into again the step in a subclass, by that analogy, by step that all Web page classifyings store;
Extract the step of the tag characters string of each webpage;
Extract the step of the body matter in this webpage; The type of coding of the tag characters string that record extracts and corresponding web page body matter and text size, and be stored to the step of web database;
The body matter of all webpages in web database is carried out to the step of category of language judgement, this step comprises further: when judging to exist in described body matter the suitable bilingual text of scale, judge the step of this mixing webpage as mixing webpage, otherwise judge that this webpage is the step of single languages webpage;
Carry out the step of intertranslation differentiation to the bilingual text in mixing webpage, this step comprises further: when being judged to be intertranslation text, the bilingual text in this webpage is organized into bilingual parallel text formatting and is saved to the step of bilingualism corpora;
Each not marking matched single languages webpage in traversal web database carries out the step processed, the processing procedure of each single languages webpage is comprised: the body matter of other not marking matched single languages webpage in the body matter in this single languages webpage and web database is carried out the step of intertranslation differentiation, in this step, select the principle of other not marking matched single languages webpage to be single languages webpage that prioritizing selection is arranged in same subclass; Be that body matter in two single languages webpages of intertranslation text is organized into bilingual parallel text and is saved to bilingualism corpora by judging, and described two single languages webpages are all designated the step of coupling.
The length of described body matter calculates according to the character quantity in body matter to obtain text size.
Embodiment eight, present embodiment are to being limited by the further of webpage attribute of extracting in webpage in the method for bilingual parallel text described in embodiment seven, in present embodiment, URL address and the time crawled of described webpage attribute kit purse rope page.
Embodiment nine, present embodiment are that the step of the tag characters string of each webpage of described extraction also comprises to being limited by the further of method of extracting bilingual parallel text in webpage described in embodiment seven; To the step that the tag characters string of the webpage extracted judges, when described tag characters string is <html>, <body>, <td>, <p>, <span> or <div>, continue the step of the body matter extracted in this webpage.
In present embodiment, the step judging tag characters string is added in the step of tag characters string extracting each webpage, that is: the text of the extraction webpage of selection type is had, due to the text under above-mentioned several label belong to text may be higher, therefore extract the content that above-mentioned label comprises, and then reduce data processing amount, increase the probability of availability of information extraction.
Embodiment ten, present embodiment are to limiting by extracting the further of the step of the body matter in this webpage in the method extracting bilingual parallel text in webpage described in embodiment seven, the step of the body matter in this webpage of described extraction comprises further: after extraction body matter, judge the step of the length of body matter, and when described length is greater than 30 ~ 80 characters, continue record corresponding information, otherwise record the URL of this webpage, and by step that this webpage is deleted from web database.
In the step extracting the body matter in this webpage, give the function having added and judged body matter length in present embodiment, abandon the webpage that those length are little.
Embodiment 11, present embodiment limit the step differentiated by the intertranslation of extracting in webpage in the method for bilingual parallel text described in embodiment seven, intertranslation described in present embodiment is sentenced method for distinguishing and is comprised the steps: to utilize dictionary to travel through the word obtained in intertranslation bilingual text, and using the step of these words as anchor point, the step judging them whether position mates in bilingual text, if matching rate is greater than setting value, the span of described setting value is 0.3 ~ 0.7, then judge the step of described bilingual text as intertranslation text.
Embodiment 12, present embodiment limit the further of bilingual text suitable by scale in the method extracting bilingual parallel text in webpage described in embodiment seven, and the bilingual text that scale described in present embodiment is suitable refers to that the length ratio of bilingual text is in setting range.
Concrete technical scheme described in the respective embodiments described above of the present invention is the detailed description to technical scheme of the present invention, should not be construed as limitation of the present invention.