CN102930031B - By the method and system extracting bilingual parallel text in webpage - Google Patents

By the method and system extracting bilingual parallel text in webpage Download PDF

Info

Publication number
CN102930031B
CN102930031B CN201210442487.XA CN201210442487A CN102930031B CN 102930031 B CN102930031 B CN 102930031B CN 201210442487 A CN201210442487 A CN 201210442487A CN 102930031 B CN102930031 B CN 102930031B
Authority
CN
China
Prior art keywords
webpage
text
bilingual
body matter
webpages
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201210442487.XA
Other languages
Chinese (zh)
Other versions
CN102930031A (en
Inventor
李文强
刘飞
张宇
刘挺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute of Technology Institute of artificial intelligence Co.,Ltd.
Original Assignee
Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology filed Critical Harbin Institute of Technology
Priority to CN201210442487.XA priority Critical patent/CN102930031B/en
Publication of CN102930031A publication Critical patent/CN102930031A/en
Application granted granted Critical
Publication of CN102930031B publication Critical patent/CN102930031B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

By the method and system extracting bilingual parallel text in webpage, relate to language material acquiring technology field.Instant invention overcomes the problem of the low and scale deficiency of existing corpus collection efficiency.System of the present invention comprises the web database for storing webpage and the attribute thereof crawled at random on a large scale; For extracting the text message extraction module of the tag characters string of each webpage, body matter and relevant information; For determining the type of webpage discrimination module mixing webpage or single languages webpage according to the body matter of all webpages in web database; For carrying out intertranslation differentiation to the bilingual text in mixing webpage, will be judged to be that the bilingual text of intertranslation text is saved to the mixing webpage processing module of bilingualism corpora; For traveling through other the single languages webpage in web database for each not marking matched single languages webpage, obtain two the single languages webpages having intertranslation text, and the body matter in two webpages is saved to bilingualism corpora list languages Web Page Processing module.

Description

By the method and system extracting bilingual parallel text in webpage
Technical field
The present invention relates to language material acquiring technology field, be specifically related to the acquiring technology field of bilingual parallel corpora.
Background technology
Statistical machine translation is one of method of mechanical translation, and basic thought is by carrying out statistical study to a large amount of parallel corporas, builds statistical translation model, and then uses this model to translate.Nearly ten years, the research of statistical machine translation has made great progress, and statistical method becomes the main stream approach of mechanical translation research in the world gradually.Machine translation system conventional at present adopts statistical method mostly, such as Google translation, Bing translation and Baidu's translation.
In statistical machine translation technology, Parallel Corpus serves vital effect.Having the parallel corpora of sufficient amount and good quality, is the necessary condition setting up high-performance statictic machine translation system.
Current parallel corpora has particular source, and their scale is limited.
Summary of the invention
Technical matters to be solved by this invention is to provide a kind of method and system extracting bilingual parallel corpora from webpage, to overcome the problem of the low and scale deficiency of existing corpus collection efficiency.The invention provides by the method and system extracting bilingual parallel text in webpage.
System by extracting bilingual parallel text in webpage of the present invention comprises:
Web database, for storing the webpage and attribute thereof that crawl at random on a large scale; Also for being carried out the hashing based on character by the URL of webpage, and the close degree classification of all webpages after process according to its domain name is stored; The close degree classification storage of all webpages according to its domain name is referred to: the Main Domain in the domain name of each webpage and each subdomain name are calculated and obtains corresponding cryptographic hash, all webpages identical for the cryptographic hash of Main Domain are existed in a large class, all webpages identical for the cryptographic hash of next stage subdomain name in this large class are divided in a subclass again, by that analogy, all Web page classifyings are stored;
Text message extraction module, for extracting the tag characters string of each webpage, also for extracting the body matter in this webpage, and records type of coding and the text size of described tag characters string and this Web page text content, and is stored to web database;
Type of webpage discrimination module, for carrying out category of language judgement to the body matter of all webpages in web database, if there is the bilingual text that scale is suitable in described body matter, then judge that this mixing webpage is as mixing webpage, otherwise judge that this webpage is single languages webpage;
Mixing webpage processing module, for carrying out intertranslation differentiation to the bilingual text in mixing webpage, when being judged to be intertranslation text, being organized into bilingual parallel text formatting by the bilingual text in this webpage and being saved to bilingualism corpora.
Single languages Web Page Processing module, process for each the not marking matched single languages webpage traveled through in web database, to the processing procedure of each single languages webpage be: the body matter of other not marking matched single languages webpage in the body matter in this single languages webpage and web database is carried out intertranslation differentiation, the principle of other not marking matched single languages webpage is selected to be single languages webpage that prioritizing selection is arranged in same subclass, that body matter in two single languages webpages of intertranslation text is organized into bilingual parallel text and is saved to bilingualism corpora by judging, and described two single languages webpages are all designated coupling.
Method by extracting bilingual parallel text in webpage of the present invention comprises the steps:
Store the webpage that crawls at random on a large scale and attribute thereof the step to web database;
By carrying out the hashing based on character to the URL of the webpage stored, and by the step of all webpages after process according to the close degree classification storage of its domain name, this step specifically comprises: the cryptographic hash step calculating Main Domain in the domain name of each webpage and each subdomain, all webpages identical for the cryptographic hash of Main Domain existed the step in a large class, all webpages identical for the cryptographic hash of next stage subdomain name in all webpages in this large class are divided into again the step in a subclass, by that analogy, by step that all Web page classifyings store;
Extract the step of the tag characters string of each webpage;
Extract the step of the body matter in this webpage; The type of coding of the tag characters string that record extracts and corresponding web page body matter and text size, and be stored to the step of web database;
The body matter of all webpages in web database is carried out to the step of category of language judgement, this step comprises further: when judging to exist in described body matter the suitable bilingual text of scale, judge the step of this mixing webpage as mixing webpage, otherwise judge that this webpage is the step of single languages webpage;
Carry out the step of intertranslation differentiation to the bilingual text in mixing webpage, this step comprises further: when being judged to be intertranslation text, the bilingual text in this webpage is organized into bilingual parallel text formatting and is saved to the step of bilingualism corpora;
Each not marking matched single languages webpage in traversal web database carries out the step processed, the processing procedure of each single languages webpage is comprised: the body matter of other not marking matched single languages webpage in the body matter in this single languages webpage and web database is carried out the step of intertranslation differentiation, in this step, select the principle of other not marking matched single languages webpage to be single languages webpage that prioritizing selection is arranged in same subclass; Be that body matter in two single languages webpages of intertranslation text is organized into bilingual parallel text and is saved to bilingualism corpora by judging, and described two single languages webpages are all designated the step of coupling.
The length of above-mentioned body matter calculates according to the character quantity in body matter to obtain text size.
Instant invention overcomes the technology prejudice in prior art field, internet is obtained object as language material, the technique effect brought thus has:
1, owing to there is a large amount of bilingual parallel texts in internet, extract bilingual parallel text be trained to bilingual corpora from internet, obtaining information amount is large, and languages are enriched.
2, because the information in internet constantly updates, therefore the bilingual corpora that internet obtains object acquisition as language material also can be reached lasting renewal and the effect of growth.
Adopt the present invention to obtain bilingual corpora, greatly can accelerate the collection efficiency of language material, also can solve the problem of the language material scale deficiency of particular source.
Accompanying drawing explanation
Fig. 1 is the principle of work schematic diagram of system by extracting bilingual parallel text in webpage of the present invention.
Embodiment
Being comprised by the system extracting bilingual parallel text in webpage described in embodiment one, present embodiment:
Web database, for storing the webpage and attribute thereof that crawl at random on a large scale; Also for being carried out the hashing based on character by the URL of webpage, and the close degree classification of all webpages after process according to its domain name is stored; The close degree classification storage of all webpages according to its domain name is referred to: the Main Domain in the domain name of each webpage and each subdomain name are calculated and obtains corresponding cryptographic hash, all webpages identical for the cryptographic hash of Main Domain are existed in a large class, all webpages identical for the cryptographic hash of next stage subdomain name in this large class are divided in a subclass again, by that analogy, all Web page classifyings are stored;
Text message extraction module, for extracting the tag characters string of each webpage, also for extracting the body matter in this webpage, and records type of coding and the text size of described tag characters string and this Web page text content, and is stored to web database;
Type of webpage discrimination module, for carrying out category of language judgement to the body matter of all webpages in web database, if there is the bilingual text that scale is suitable in described body matter, then judge that this mixing webpage is as mixing webpage, otherwise judge that this webpage is single languages webpage;
Mixing webpage processing module, for carrying out intertranslation differentiation to the bilingual text in mixing webpage, when being judged to be intertranslation text, being organized into bilingual parallel text formatting by the bilingual text in this webpage and being saved to bilingualism corpora.
Single languages Web Page Processing module, process for each the not marking matched single languages webpage traveled through in web database, to the processing procedure of each single languages webpage be: the body matter of other not marking matched single languages webpage in the body matter in this single languages webpage and web database is carried out intertranslation differentiation, the principle of other not marking matched single languages webpage is selected to be single languages webpage that prioritizing selection is arranged in same subclass, that body matter in two single languages webpages of intertranslation text is organized into bilingual parallel text and is saved to bilingualism corpora by judging, and described two single languages webpages are all designated coupling.
The length of described body matter calculates according to the character quantity in body matter to obtain text size.
Embodiment two, present embodiment are further illustrating webpage attribute in the system extracting bilingual parallel text in the webpage described in embodiment one, in present embodiment, and the URL address of described webpage attribute kit purse rope page and the time crawled.
Embodiment three, present embodiment is to being limited by the further of text message extraction module of the system extracting bilingual parallel text in webpage described in embodiment one, described text message extraction module is also for judging the tag characters string of the webpage extracted, when described tag characters string is <html>, <body>, <td>, <p>, during <span> or <div>, continue to extract the text message in this webpage.
In present embodiment, the function judging tag characters string is added in text message extraction module, that is: the text of the extraction webpage of selection type is had, due to the text under above-mentioned several label belong to text may be higher, therefore extract the content that above-mentioned label comprises, and then reduce data processing amount, increase the probability of availability of information extraction.
Embodiment four, present embodiment are to being limited by the further of text message extraction module of the system extracting bilingual parallel text in webpage described in embodiment one, described text message extraction module is also for after extraction body matter, judge the length of body matter, and when described length is greater than 30 ~ 80 characters, continue record corresponding information, otherwise record the URL of this webpage, and this webpage is deleted from web database.
Embodiment five, present embodiment are to sentencing further illustrating of method for distinguishing by intertranslation in the system extracting bilingual parallel text in webpage described in embodiment one, method for distinguishing is sentenced in described intertranslation: utilize dictionary to travel through the word obtained in intertranslation bilingual text, and using these words as anchor point, judge whether their positions in bilingual text mate, if matching rate is greater than setting value, the span of described setting value is 0.3 ~ 0.7, then judge that described bilingual text is as intertranslation text.
Embodiment six, present embodiment limit the further of bilingual text suitable by scale in the system extracting bilingual parallel text in webpage described in embodiment one, and the bilingual text that scale described in present embodiment is suitable refers to that the length ratio of bilingual text is in setting range.
Embodiment seven, present embodiment comprised the steps: by the method extracting bilingual parallel text in webpage
Store the webpage that crawls at random on a large scale and attribute thereof the step to web database;
By carrying out the hashing based on character to the URL of the webpage stored, and by the step of all webpages after process according to the close degree classification storage of its domain name, this step specifically comprises: the cryptographic hash step calculating Main Domain in the domain name of each webpage and each subdomain, all webpages identical for the cryptographic hash of Main Domain existed the step in a large class, all webpages identical for the cryptographic hash of next stage subdomain name in all webpages in this large class are divided into again the step in a subclass, by that analogy, by step that all Web page classifyings store;
Extract the step of the tag characters string of each webpage;
Extract the step of the body matter in this webpage; The type of coding of the tag characters string that record extracts and corresponding web page body matter and text size, and be stored to the step of web database;
The body matter of all webpages in web database is carried out to the step of category of language judgement, this step comprises further: when judging to exist in described body matter the suitable bilingual text of scale, judge the step of this mixing webpage as mixing webpage, otherwise judge that this webpage is the step of single languages webpage;
Carry out the step of intertranslation differentiation to the bilingual text in mixing webpage, this step comprises further: when being judged to be intertranslation text, the bilingual text in this webpage is organized into bilingual parallel text formatting and is saved to the step of bilingualism corpora;
Each not marking matched single languages webpage in traversal web database carries out the step processed, the processing procedure of each single languages webpage is comprised: the body matter of other not marking matched single languages webpage in the body matter in this single languages webpage and web database is carried out the step of intertranslation differentiation, in this step, select the principle of other not marking matched single languages webpage to be single languages webpage that prioritizing selection is arranged in same subclass; Be that body matter in two single languages webpages of intertranslation text is organized into bilingual parallel text and is saved to bilingualism corpora by judging, and described two single languages webpages are all designated the step of coupling.
The length of described body matter calculates according to the character quantity in body matter to obtain text size.
Embodiment eight, present embodiment are to being limited by the further of webpage attribute of extracting in webpage in the method for bilingual parallel text described in embodiment seven, in present embodiment, URL address and the time crawled of described webpage attribute kit purse rope page.
Embodiment nine, present embodiment are that the step of the tag characters string of each webpage of described extraction also comprises to being limited by the further of method of extracting bilingual parallel text in webpage described in embodiment seven; To the step that the tag characters string of the webpage extracted judges, when described tag characters string is <html>, <body>, <td>, <p>, <span> or <div>, continue the step of the body matter extracted in this webpage.
In present embodiment, the step judging tag characters string is added in the step of tag characters string extracting each webpage, that is: the text of the extraction webpage of selection type is had, due to the text under above-mentioned several label belong to text may be higher, therefore extract the content that above-mentioned label comprises, and then reduce data processing amount, increase the probability of availability of information extraction.
Embodiment ten, present embodiment are to limiting by extracting the further of the step of the body matter in this webpage in the method extracting bilingual parallel text in webpage described in embodiment seven, the step of the body matter in this webpage of described extraction comprises further: after extraction body matter, judge the step of the length of body matter, and when described length is greater than 30 ~ 80 characters, continue record corresponding information, otherwise record the URL of this webpage, and by step that this webpage is deleted from web database.
In the step extracting the body matter in this webpage, give the function having added and judged body matter length in present embodiment, abandon the webpage that those length are little.
Embodiment 11, present embodiment limit the step differentiated by the intertranslation of extracting in webpage in the method for bilingual parallel text described in embodiment seven, intertranslation described in present embodiment is sentenced method for distinguishing and is comprised the steps: to utilize dictionary to travel through the word obtained in intertranslation bilingual text, and using the step of these words as anchor point, the step judging them whether position mates in bilingual text, if matching rate is greater than setting value, the span of described setting value is 0.3 ~ 0.7, then judge the step of described bilingual text as intertranslation text.
Embodiment 12, present embodiment limit the further of bilingual text suitable by scale in the method extracting bilingual parallel text in webpage described in embodiment seven, and the bilingual text that scale described in present embodiment is suitable refers to that the length ratio of bilingual text is in setting range.
Concrete technical scheme described in the respective embodiments described above of the present invention is the detailed description to technical scheme of the present invention, should not be construed as limitation of the present invention.

Claims (9)

1. by the system extracting bilingual parallel text in webpage, it is characterized in that, this system comprises:
Web database, for storing the webpage and attribute thereof that crawl at random on a large scale; Also for being carried out the hashing based on character by the URL of webpage, and the close degree classification of all webpages after process according to its domain name is stored; The close degree classification storage of all webpages according to its domain name is referred to: the Main Domain in the domain name of each webpage and each subdomain name are calculated and obtains corresponding cryptographic hash, all webpages identical for the cryptographic hash of Main Domain are existed in a large class, all webpages identical for the cryptographic hash of next stage subdomain name in this large class are divided in a subclass again, by that analogy, all Web page classifyings are stored;
Text message extraction module, for extracting the tag characters string of each webpage, also for extracting the body matter in this webpage, and records type of coding and the text size of described tag characters string and this Web page text content, and is stored to web database;
Type of webpage discrimination module, for carrying out category of language judgement to the body matter of all webpages in web database, if there is the bilingual text that scale is suitable in described body matter, then judge that this webpage is as mixing webpage, otherwise judge that this webpage is single languages webpage;
Mixing webpage processing module, for carrying out intertranslation differentiation to the bilingual text in mixing webpage, when being judged to be intertranslation text, being organized into bilingual parallel text formatting by the bilingual text in this webpage and being saved to bilingualism corpora;
Single languages Web Page Processing module, process for each the not marking matched single languages webpage traveled through in web database, to the processing procedure of each single languages webpage be: the body matter of other not marking matched single languages webpage in the body matter in this single languages webpage and web database is carried out intertranslation differentiation, the principle of other not marking matched single languages webpage is selected to be single languages webpage that prioritizing selection is arranged in same subclass, that body matter in two single languages webpages of intertranslation text is organized into bilingual parallel text and is saved to bilingualism corpora by judging, and described two single languages webpages are all designated coupling,
The bilingual text that described scale is suitable refers to that the length ratio of bilingual text is in setting range.
2. the system by extracting bilingual parallel text in webpage according to claim 1, it is characterized in that, text message extraction module, also for judging the tag characters string of the webpage extracted, when described tag characters string is <html>, <body>, <td>, <p>, during <span> or <div>, continue to extract the text message in this webpage.
3. the system by extracting bilingual parallel text in webpage according to claim 1, it is characterized in that, text message extraction module, also for after extraction body matter, judge the length of body matter, and when described length is greater than 30 ~ 80 characters, continue record corresponding information, otherwise record the URL of this webpage, and this webpage is deleted from web database.
4. the system by extracting bilingual parallel text in webpage according to claim 1, it is characterized in that, method for distinguishing is sentenced in described intertranslation: utilize dictionary to travel through the word obtained in intertranslation bilingual text, and using these words as anchor point, judge whether their positions in bilingual text mate, if matching rate is greater than setting value, the span of described setting value is 0.3 ~ 0.7, then judge that described bilingual text is as intertranslation text.
5., by the method extracting bilingual parallel text in webpage, it is characterized in that, the method comprises the steps:
Store the webpage that crawls at random on a large scale and attribute thereof the step to web database;
By carrying out the hashing based on character to the URL of the webpage stored, and by the step of all webpages after process according to the close degree classification storage of its domain name, this step specifically comprises: the cryptographic hash step calculating Main Domain in the domain name of each webpage and each subdomain, all webpages identical for the cryptographic hash of Main Domain existed the step in a large class, all webpages identical for the cryptographic hash of next stage subdomain name in all webpages in this large class are divided into again the step in a subclass, by that analogy, by step that all Web page classifyings store;
Extract the step of the tag characters string of each webpage;
Extract the step of the body matter in this webpage; The type of coding of the tag characters string that record extracts and corresponding web page body matter and text size, and be stored to the step of web database;
The body matter of all webpages in web database is carried out to the step of category of language judgement, this step comprises further: when judging to exist in described body matter the suitable bilingual text of scale, judge the step of this webpage as mixing webpage, otherwise judge that this webpage is the step of single languages webpage;
Carry out the step of intertranslation differentiation to the bilingual text in mixing webpage, this step comprises further: when being judged to be intertranslation text, the bilingual text in this webpage is organized into bilingual parallel text formatting and is saved to the step of bilingualism corpora;
Each not marking matched single languages webpage in traversal web database carries out the step processed, the processing procedure of each single languages webpage is comprised: the body matter of other not marking matched single languages webpage in the body matter in this single languages webpage and web database is carried out the step of intertranslation differentiation, in this step, select the principle of other not marking matched single languages webpage to be single languages webpage that prioritizing selection is arranged in same subclass; Be that body matter in two single languages webpages of intertranslation text is organized into bilingual parallel text and is saved to bilingualism corpora by judging, and described two single languages webpages are all designated the step of coupling.
6. the method by extracting bilingual parallel text in webpage according to claim 5, is characterized in that, URL address and the time crawled of described webpage attribute kit purse rope page.
7. the method by extracting bilingual parallel text in webpage according to claim 5, is characterized in that, the step of the tag characters string of each webpage of described extraction also comprises; To the step that the tag characters string of the webpage extracted judges, when described tag characters string is <html>, <body>, <td>, <p>, <span> or <div>, continue the step of the body matter extracted in this webpage.
8. the method by extracting bilingual parallel text in webpage according to claim 5, it is characterized in that, the step extracting the body matter in this webpage comprises further: after extraction body matter, judge the step of the length of body matter, and when described length is greater than 30 ~ 80 characters, continue record corresponding information, otherwise record the URL of this webpage, and by step that this webpage is deleted from web database.
9. the method by extracting bilingual parallel text in webpage according to claim 5, it is characterized in that, described intertranslation is sentenced method for distinguishing and is comprised the steps: to utilize dictionary to travel through the word obtained in intertranslation bilingual text, and using the step of these words as anchor point, the step judging them whether position mates in bilingual text, if matching rate is greater than setting value, the span of described setting value is 0.3 ~ 0.7, then judge the step of described bilingual text as intertranslation text.
CN201210442487.XA 2012-11-08 2012-11-08 By the method and system extracting bilingual parallel text in webpage Active CN102930031B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210442487.XA CN102930031B (en) 2012-11-08 2012-11-08 By the method and system extracting bilingual parallel text in webpage

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210442487.XA CN102930031B (en) 2012-11-08 2012-11-08 By the method and system extracting bilingual parallel text in webpage

Publications (2)

Publication Number Publication Date
CN102930031A CN102930031A (en) 2013-02-13
CN102930031B true CN102930031B (en) 2015-10-07

Family

ID=47644828

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210442487.XA Active CN102930031B (en) 2012-11-08 2012-11-08 By the method and system extracting bilingual parallel text in webpage

Country Status (1)

Country Link
CN (1) CN102930031B (en)

Families Citing this family (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104077273A (en) * 2013-03-27 2014-10-01 腾讯科技(深圳)有限公司 Method and device for extracting webpage contents
CN103559172B (en) * 2013-11-06 2016-08-31 北京百度网讯科技有限公司 The subordinate sentence method and apparatus of multi-lingual mixing text
CN103646117B (en) * 2013-12-27 2016-09-28 苏州大学 A kind of bilingual parallel web pages recognition methods based on link and system
CN103678714B (en) * 2013-12-31 2017-05-10 北京百度网讯科技有限公司 Construction method and device for entity knowledge base
CN104133848B (en) * 2014-07-01 2017-09-19 中央民族大学 Tibetan language entity mobility models information extraction method
CN105045861A (en) * 2015-07-13 2015-11-11 广西达译商务服务有限责任公司 System for automatically collecting Hanyu and Bahasa Indonesia bilingualism parallel texts, and implementation method
CN104965925A (en) * 2015-07-13 2015-10-07 广西达译商务服务有限责任公司 Automatic Chinese-Khmer bilingual parallel text acquisition system and implementation method
CN104933193A (en) * 2015-07-13 2015-09-23 广西达译商务服务有限责任公司 Chinese and Bahasa Melayu bilingual parallel text automatic acquisition system and realizing method thereof
CN104933194A (en) * 2015-07-13 2015-09-23 广西达译商务服务有限责任公司 Chinese and Vietnamese bilingual parallel text automatic acquisition system and realizing method thereof
CN105022728A (en) * 2015-07-13 2015-11-04 广西达译商务服务有限责任公司 Automatic acquisition system of Chinese and Lao bilingual parallel texts and implementation method
CN105045862A (en) * 2015-07-13 2015-11-11 广西达译商务服务有限责任公司 System for automatically acquiring bilingual parallel corpus of Chinese-foreign languages and realization method
CN105138548A (en) * 2015-07-13 2015-12-09 广西达译商务服务有限责任公司 System for automatically collecting Chinese-Thai bilingual parallel corpus and implementation method
CN104933192A (en) * 2015-07-13 2015-09-23 广西达译商务服务有限责任公司 Automatic Chinese and Filipino bilingual parallel text collection system and implementation method
CN104933195A (en) * 2015-07-13 2015-09-23 广西达译商务服务有限责任公司 Chinese and Burmese bilingual parallel text automatic acquisition system and realizing method thereof
CN105574066A (en) * 2015-10-23 2016-05-11 青岛恒波仪器有限公司 Web page text extraction and comparison method and system thereof
CN112395856B (en) * 2019-07-31 2022-09-13 阿里巴巴集团控股有限公司 Text matching method, text matching device, computer system and readable storage medium
CN111209461A (en) * 2019-12-30 2020-05-29 成都理工大学 Bilingual corpus collection system based on public identification words
CN111310465B (en) * 2020-02-18 2021-07-23 北京字节跳动网络技术有限公司 Parallel corpus acquisition method and device, electronic equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060282255A1 (en) * 2005-06-14 2006-12-14 Microsoft Corporation Collocation translation from monolingual and available bilingual corpora
CN101201820A (en) * 2007-11-28 2008-06-18 北京金山软件有限公司 Method and system for filtering bilingualism corpora

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060282255A1 (en) * 2005-06-14 2006-12-14 Microsoft Corporation Collocation translation from monolingual and available bilingual corpora
CN101201820A (en) * 2007-11-28 2008-06-18 北京金山软件有限公司 Method and system for filtering bilingualism corpora

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于双语语料库的短语复述实例获取研究;李维刚;《中文信息学报》;20070930;第21卷(第5期);112-116 *

Also Published As

Publication number Publication date
CN102930031A (en) 2013-02-13

Similar Documents

Publication Publication Date Title
CN102930031B (en) By the method and system extracting bilingual parallel text in webpage
CN104598577B (en) A kind of extracting method of Web page text
CN104991889B (en) A kind of non-multi-character word error auto-collation based on fuzzy participle
CN105630941B (en) Web body matter abstracting methods based on statistics and structure of web page
CN105956052A (en) Building method of knowledge map based on vertical field
CN101727498A (en) Automatic extraction method of web page information based on WEB structure
CN103077164A (en) Text analysis method and text analyzer
CN106446072B (en) The treating method and apparatus of web page contents
CN102135967A (en) Webpage keywords extracting method, device and system
CN107463571A (en) Web color method
CN102591612A (en) General webpage text extraction method based on punctuation continuity and system thereof
CN103530429A (en) Webpage content extracting method
CN108038099A (en) Low frequency keyword recognition method based on term clustering
CN104360993A (en) Method for extracting needed content from text
CN102508901A (en) Content-based massive image search method and content-based massive image search system
CN110008473A (en) A kind of medical text name Entity recognition mask method based on alternative manner
CN105630822A (en) Method for marking similar contents in patent retrieval in red color
CN105279208A (en) Data marking method and management system
CN106528509A (en) Webpage information extracting method and apparatus
CN110362673A (en) Computer vision class papers contents method of discrimination and system based on abstract semantic analysis
CN105022728A (en) Automatic acquisition system of Chinese and Lao bilingual parallel texts and implementation method
CN103761312B (en) Information extraction system and method for multi-recording webpage
CN111581478A (en) Cross-website general news acquisition method for specific subject
CN107451215B (en) Feature text extraction method and device
CN103116607B (en) A kind of text retrieval system based on the Chinese phonetic alphabet newly

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20210419

Address after: Room 206-10, building 16, 1616 Chuangxin Road, Songbei District, Harbin City, Heilongjiang Province

Patentee after: Harbin jizuo technology partnership (L.P.)

Patentee after: Harbin Institute of Technology Asset Management Co.,Ltd.

Address before: 150001 Harbin, Nangang, West District, large straight street, No. 92

Patentee before: HARBIN INSTITUTE OF TECHNOLOGY

TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20210617

Address after: Room 206-12, building 16, 1616 Chuangxin Road, Songbei District, Harbin City, Heilongjiang Province

Patentee after: Harbin Institute of Technology Institute of artificial intelligence Co.,Ltd.

Address before: Room 206-10, building 16, 1616 Chuangxin Road, Songbei District, Harbin City, Heilongjiang Province

Patentee before: Harbin jizuo technology partnership (L.P.)

Patentee before: Harbin Institute of Technology Asset Management Co.,Ltd.

TR01 Transfer of patent right