CN111209461A - Bilingual corpus collection system based on public identification words - Google Patents

Bilingual corpus collection system based on public identification words Download PDF

Info

Publication number
CN111209461A
CN111209461A CN201911388715.8A CN201911388715A CN111209461A CN 111209461 A CN111209461 A CN 111209461A CN 201911388715 A CN201911388715 A CN 201911388715A CN 111209461 A CN111209461 A CN 111209461A
Authority
CN
China
Prior art keywords
corpus
bilingual
public
module
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911388715.8A
Other languages
Chinese (zh)
Inventor
张洁
王晓珊
李伟彬
刘华
费比
周黎
周辛雨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu University of Information Technology
Chengdu Univeristy of Technology
Original Assignee
Chengdu University of Information Technology
Chengdu Univeristy of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu University of Information Technology, Chengdu Univeristy of Technology filed Critical Chengdu University of Information Technology
Priority to CN201911388715.8A priority Critical patent/CN111209461A/en
Publication of CN111209461A publication Critical patent/CN111209461A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to a bilingual corpus collection system based on public identification languages, which comprises a corpus collection range setting module, a corpus collection module for performing corpus collection in a corpus collection range, a first corpus information storage module, a second corpus information storage module, a public identification language extraction module for extracting a public identification language part from collected corpus, a bilingual comparison translation module and a third corpus information storage module. The invention purposefully collects the content related to the public logo based on the network information and the reference book, provides a more detailed comparison basis for the vocabulary of the public logo, so that paraphrases which are not related to the public logo appear in the subsequent use process, and effectively improves the translation accuracy in the application of the public logo.

Description

Bilingual corpus collection system based on public identification words
Technical Field
The invention relates to a bilingual corpus collection system based on public identification words.
Background
The public logo is also called a bulletin, is mainly indicative voice provided for convenience of travel of the public or tourists in a city, and comprises service facilities, organization names, advertising boards, public facilities, public transportation, tourist attractions, street signboards, slogan slogans, shop signboards and the like, and has the function of providing effective information to the public through concise language. With the development of economic culture, particularly the development of tourism, many cities attract a great number of foreign friends, so that the translation of public identification is very important, and the public identification not only represents urban language environment and human environment, but also plays an important role in promoting the development of tourism industry. The correct and conscientious public logo translation content can provide good and convenient help for tourists in various countries and improve the overall image of a city, otherwise, wrong and unjust public logo reaction content can bring comprehension barriers and even error zones to foreign tourists, and therefore, the accuracy of public logo translation is very necessary.
In the process of improving the translation accuracy of the public logo, it is important to establish a reasonable and accurate public logo bilingual parallel corpus, and the public logo bilingual parallel corpus is derived from a wide bilingual parallel corpus base, so how to obtain required public logo information from a wide corpus information source is a problem which needs to be solved urgently by technical personnel in the field.
Disclosure of Invention
In view of the above technical problems, the present invention provides a bilingual corpus collection system based on public identification languages, so as to conveniently obtain the required public identification language corpus and improve the accuracy of the corpus to a certain extent.
In order to achieve the purpose, the technical scheme adopted by the invention is as follows:
a bilingual corpus collection system based on public identification languages comprises:
the corpus collection range setting module is used for setting a collection range of the corpus related to the public logo, and the collection range comprises a webpage and a document work related to the public logo;
the corpus collection module is used for carrying out large-scale basic corpus information collection in a collection range through web crawlers, manual input and character recognition modes, wherein the basic corpus information comprises monolingual basic corpus information and bilingual basic corpus information;
the first corpus information storage module is used for storing the acquired monolingual basic corpus information;
the second corpus information storage module is used for storing the acquired bilingual basic corpus information;
the public logo extraction module is used for extracting unilingual public logo corpus information from the first corpus information storage module and bilingual public logo corpus information from the second corpus information storage module according to the constructed public logo keywords;
the bilingual comparison translation module is used for translating and converting the monolingual public sign language corpus information into corresponding bilingual public sign language corpus information; and
and the third language material information storage module is used for storing the language material information of the bilingual public identification language.
Furthermore, a preset collection source set and an extended collection source set are arranged in the corpus collection range setting module, wherein the preset collection source set is used for storing a preset fixed collection range, and the extended collection source set is used for storing a collection range newly input by the input device.
Furthermore, the corpus collection module comprises a crawler module for collecting information on a network, an input module for receiving manual input information, a scanning recognition module for recognizing characters on an image, and a corpus category recognition module for recognizing the category of the language in the collected information content, wherein the corpus category recognition module transmits the recognized single-language basic corpus information to the first corpus information storage module for storage, and transmits the recognized double-language basic corpus information to the second corpus information storage module for storage.
Furthermore, the public identification word extraction module is also connected with a keyword library,
the keyword library is used for storing public sign language keywords, wherein a part of public sign language keywords are preset, and inputting and expanding new public sign language keywords according to actual requirements.
Furthermore, the bilingual corpus collection system based on the public identification words further comprises a bilingual correction module, wherein the bilingual correction module is used for correcting the bilingual public identification word corpus information extracted by the public identification word extraction module.
Further, the bilingual correction module performs a correction process including:
respectively identifying and extracting a Chinese part and a foreign language part which correspond to each other from the language material information of the bilingual public logo, comparing the paraphrases of the Chinese part and the foreign language part based on a translation word stock used by a bilingual contrast translation module,
if the contrast approximation is not less than 85%, the bilingual public logo corpus information of the part is considered to be available and stored in the third corpus information storage module,
if the contrast approximation degree is not more than 50 percent, the bilingual public sign language corpus information of the part is considered to be unavailable, the translation word stock is adopted to correspondingly translate the Chinese part, the translated part of bilingual public sign language corpus information is stored in a third language corpus information storage module,
if the contrast approximation degree is between 50% and 85%, the bilingual public logo corpus information of the part is marked like a suspect mark, and the extracted Chinese part, the foreign language part and the content translated by adopting the translation word stock are jointly stored in a third corpus information storage module in an associated form.
Further, the bilingual public logo corpus information with suspected marks in the bilingual correction module or the third corpus information storage module is manually corrected.
Compared with the prior art, the invention has the following beneficial effects:
(1) the invention purposefully collects the content related to the public logo based on the network information and the reference book, provides a more detailed comparison basis for the vocabulary of the public logo, so that paraphrases which are not related to the public logo appear in the subsequent use process, and effectively improves the translation accuracy in the application of the public logo.
(2) The invention can expand more corpus collection ranges through setting corpus collection ranges based on basic corpus collection ranges and manual input, so as to facilitate the continuous update and growth of bilingual corpus.
(3) The invention further extracts the content containing the required public identification by utilizing the keyword library so as to discharge some content irrelevant to the public identification, thereby improving the accuracy of the public identification used subsequently, and further improving the translation accuracy of the bilingual public identification by correcting the concentrated public identification information through the translation word library.
Drawings
FIG. 1 is a block diagram of the present invention.
FIG. 2 is a block diagram of the corpus collection module.
Detailed Description
The present invention will be further described with reference to the following description and examples, which include but are not limited to the following examples.
Examples
As shown in fig. 1 and fig. 2, the bilingual corpus collection system based on public identification languages includes:
the corpus collection range setting module is used for setting a collection range of the corpus related to the public logo, and the collection range comprises webpages and literature works related to the public logo, such as webpages of related websites in the travel industry, official report materials and the like; the corpus collection range setting module is internally provided with a preset collection source set and an extended collection source set, wherein the preset collection source set is used for storing a preset fixed collection range, and the extended collection source set is used for storing a collection range newly input by the input device.
The corpus collection module is used for carrying out large-scale basic corpus information collection in a collection range through web crawlers, manual input and character recognition modes, wherein the basic corpus information comprises monolingual basic corpus information and bilingual basic corpus information, and the basic corpus information takes page paragraphs as basic units; the corpus collection module comprises a crawler module for collecting information on a network, an input module for receiving manual input information, a scanning identification module for identifying characters on an image, and a corpus category identification module for identifying the category of the collected information content, wherein the corpus category identification module transmits the identified monolingual basic corpus information to a first corpus information storage module for storage, and transmits the identified bilingual basic corpus information to a second corpus information storage module for storage.
And the first corpus information storage module is used for storing the acquired monolingual basic corpus information.
And the second language material information storage module is used for storing the acquired bilingual basic language material information.
And the public identification language extraction module is used for extracting monolingual public identification language corpus information from the first corpus information storage module and bilingual public identification language corpus information from the second corpus information storage module according to the constructed public identification language keywords, wherein the monolingual public identification language corpus information can be Chinese languages and foreign languages, and the extracted monolingual public identification language corpus information and the bilingual public identification language corpus information both use sentences as basic units.
And the bilingual comparison translation module is used for translating the single-language public identification language corpus information into corresponding bilingual public identification language corpus information and can be connected with a universal bilingual translation word stock.
And the third language material information storage module is used for storing the language material information of the bilingual public identification language.
The public logo extraction module is also connected with a keyword library, the keyword library is used for storing public logo keywords, wherein part of the public logo keywords are preset, and new public logo keywords are input and expanded according to actual requirements.
The bilingual corpus collection system based on the public identification words further comprises a bilingual correction module, and the bilingual public identification word corpus information extracted by the public identification word extraction module is corrected.
Specifically, the bilingual correction module performs a correction process including:
respectively identifying and extracting a Chinese part and a foreign language part which correspond to each other from the language material information of the bilingual public logo, comparing the paraphrases of the Chinese part and the foreign language part based on a translation word stock used by a bilingual contrast translation module,
if the contrast approximation is not less than 85%, the bilingual public logo corpus information of the part is considered to be available and stored in the third corpus information storage module,
if the contrast approximation degree is not more than 50 percent, the bilingual public sign language corpus information of the part is considered to be unavailable, the translation word stock is adopted to correspondingly translate the Chinese part, the translated part of bilingual public sign language corpus information is stored in a third language corpus information storage module,
if the contrast approximation degree is between 50% and 85%, the bilingual public logo corpus information of the part is marked like a suspect mark, and the extracted Chinese part, the foreign language part and the content translated by adopting the translation word stock are jointly stored in a third corpus information storage module in an associated form.
And manually correcting the bilingual public logo corpus information with suspected marks in the bilingual correction module or the third corpus information storage module.
Through the arrangement, the method and the device can acquire the bilingual corpus information of the required public identification words through a wider network information environment, and provide a sufficient information data basis for subsequently establishing an accurate bilingual parallel corpus of the public identification words.
The above-mentioned embodiment is only one of the preferred embodiments of the present invention, and should not be used to limit the scope of the present invention, but all the insubstantial modifications or changes made within the spirit and scope of the main design of the present invention, which still solve the technical problems consistent with the present invention, should be included in the scope of the present invention.

Claims (7)

1. A bilingual corpus collection system based on public identification words is characterized by comprising:
the corpus collection range setting module is used for setting a collection range of the corpus related to the public logo, and the collection range comprises a webpage and a document work related to the public logo;
the corpus collection module is used for carrying out large-scale basic corpus information collection in a collection range through web crawlers, manual input and character recognition modes, wherein the basic corpus information comprises monolingual basic corpus information and bilingual basic corpus information;
the first corpus information storage module is used for storing the acquired monolingual basic corpus information;
the second corpus information storage module is used for storing the acquired bilingual basic corpus information;
the public logo extraction module is used for extracting unilingual public logo corpus information from the first corpus information storage module and bilingual public logo corpus information from the second corpus information storage module according to the constructed public logo keywords;
the bilingual comparison translation module is used for translating and converting the monolingual public sign language corpus information into corresponding bilingual public sign language corpus information; and
and the third language material information storage module is used for storing the language material information of the bilingual public identification language.
2. The bilingual corpus collection system according to claim 1, wherein a default collection source set for storing a default fixed collection range and an extended collection source set for storing a collection range newly inputted from the input device are embedded in the corpus collection range setting module.
3. The bilingual corpus collection system according to claim 2, wherein the corpus collection module comprises a crawler module for collecting information on the network, an input module for receiving manual input information, a scan recognition module for recognizing characters on the image, and a corpus language recognition module for recognizing the language category in the collected information content, wherein the corpus language recognition module transmits the recognized monolingual basic corpus information to the first corpus information storage module for storage, and transmits the recognized bilingual basic corpus information to the second corpus information storage module for storage.
4. The bilingual corpus collection system based on public facing phrases of claim 3, wherein the public facing phrase extraction module is further connected with a keyword library,
the keyword library is used for storing public sign language keywords, wherein a part of public sign language keywords are preset, and inputting and expanding new public sign language keywords according to actual requirements.
5. The system for bilingual corpus collection based on common logos according to claim 4, further comprising a bilingual correction module for correcting the bilingual common logo corpus information extracted by the common logo extraction module.
6. The bilingual corpus collection system according to claim 5, wherein the bilingual corpus correction module corrects the bilingual corpus according to the following steps:
respectively identifying and extracting a Chinese part and a foreign language part which correspond to each other from the language material information of the bilingual public logo, comparing the paraphrases of the Chinese part and the foreign language part based on a translation word stock used by a bilingual contrast translation module,
if the contrast approximation is not less than 85%, the bilingual public logo corpus information of the part is considered to be available and stored in the third corpus information storage module,
if the contrast approximation degree is not more than 50 percent, the bilingual public sign language corpus information of the part is considered to be unavailable, the translation word stock is adopted to correspondingly translate the Chinese part, the translated part of bilingual public sign language corpus information is stored in a third language corpus information storage module,
if the contrast approximation degree is between 50% and 85%, the bilingual public logo corpus information of the part is marked like a suspect mark, and the extracted Chinese part, the foreign language part and the content translated by adopting the translation word stock are jointly stored in a third corpus information storage module in an associated form.
7. The bilingual corpus collection system according to claim 6, wherein the bilingual corpus information of the common logo in the bilingual corpus correction module or the third corpus information storage module, which is suspected to be marked, is manually corrected.
CN201911388715.8A 2019-12-30 2019-12-30 Bilingual corpus collection system based on public identification words Pending CN111209461A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911388715.8A CN111209461A (en) 2019-12-30 2019-12-30 Bilingual corpus collection system based on public identification words

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911388715.8A CN111209461A (en) 2019-12-30 2019-12-30 Bilingual corpus collection system based on public identification words

Publications (1)

Publication Number Publication Date
CN111209461A true CN111209461A (en) 2020-05-29

Family

ID=70788390

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911388715.8A Pending CN111209461A (en) 2019-12-30 2019-12-30 Bilingual corpus collection system based on public identification words

Country Status (1)

Country Link
CN (1) CN111209461A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111881900A (en) * 2020-07-01 2020-11-03 腾讯科技(深圳)有限公司 Corpus generation, translation model training and translation method, apparatus, device and medium
CN112183122A (en) * 2020-10-22 2021-01-05 腾讯科技(深圳)有限公司 Character recognition method and device, storage medium and electronic equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101034395A (en) * 2007-03-30 2007-09-12 传神联合(北京)信息技术有限公司 Document waiting for translating processing system and document processing method using same
CN102930031A (en) * 2012-11-08 2013-02-13 哈尔滨工业大学 Method and system for extracting bilingual parallel text in web pages
CN105045862A (en) * 2015-07-13 2015-11-11 广西达译商务服务有限责任公司 System for automatically acquiring bilingual parallel corpus of Chinese-foreign languages and realization method
CN110008378A (en) * 2019-01-28 2019-07-12 平安科技(深圳)有限公司 Corpus collection method, device, equipment and storage medium based on artificial intelligence
CN110046261A (en) * 2019-04-22 2019-07-23 山东建筑大学 A kind of construction method of the multi-modal bilingual teaching mode of architectural engineering

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101034395A (en) * 2007-03-30 2007-09-12 传神联合(北京)信息技术有限公司 Document waiting for translating processing system and document processing method using same
CN102930031A (en) * 2012-11-08 2013-02-13 哈尔滨工业大学 Method and system for extracting bilingual parallel text in web pages
CN105045862A (en) * 2015-07-13 2015-11-11 广西达译商务服务有限责任公司 System for automatically acquiring bilingual parallel corpus of Chinese-foreign languages and realization method
CN110008378A (en) * 2019-01-28 2019-07-12 平安科技(深圳)有限公司 Corpus collection method, device, equipment and storage medium based on artificial intelligence
CN110046261A (en) * 2019-04-22 2019-07-23 山东建筑大学 A kind of construction method of the multi-modal bilingual teaching mode of architectural engineering

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111881900A (en) * 2020-07-01 2020-11-03 腾讯科技(深圳)有限公司 Corpus generation, translation model training and translation method, apparatus, device and medium
CN111881900B (en) * 2020-07-01 2022-08-23 腾讯科技(深圳)有限公司 Corpus generation method, corpus translation model training method, corpus translation model translation method, corpus translation device, corpus translation equipment and corpus translation medium
CN112183122A (en) * 2020-10-22 2021-01-05 腾讯科技(深圳)有限公司 Character recognition method and device, storage medium and electronic equipment

Similar Documents

Publication Publication Date Title
CN102609408B (en) Cross-lingual interpretation method based on multi-lingual document image recognition
CN110046261A (en) A kind of construction method of the multi-modal bilingual teaching mode of architectural engineering
CN102779135B (en) Method and device for obtaining cross-linguistic search resources and corresponding search method and device
CN111209461A (en) Bilingual corpus collection system based on public identification words
CN103902525B (en) Uighur part-of-speech tagging method
Chung et al. Enhancing readability of web documents by text augmentation for deaf people
JP5529092B2 (en) Note data translation apparatus, note data translation method, and note data translation program
JP2009151777A (en) Method and apparatus for aligning spoken language parallel corpus
CN109271625B (en) Pinyin spelling standardization method for Chinese place names
CN109871546A (en) A kind of patent document translation system
Danielewicz-Betz et al. Varieties of English in the urban landscapes of Hong Kong and Shenzhen: Changing English landscapes around a Chinese border
CN101477519B (en) Original document synchronous preview apparatus and method for network translation
CN103455572A (en) Method and device for acquiring movie and television subjects from web pages
CN103164398A (en) Chinese-Uygur language electronic dictionary and automatic translating Chinese-Uygur language method thereof
CN104050156A (en) Device, method and electronic equipment for extracting maximum noun phrase
CN107590132B (en) Method for automatically correcting part of characters-judging by English part of speech
Somers Machine translation and minority languages
CN104834740A (en) Full-automatic audio/video structuralized accurate searching method
CN101458682A (en) Mapping method based on Chinese character and Japanese Chinese character and use thereof
CN111241784A (en) Processing and sorting method for language material resources of public identification languages
CN111680122B (en) Space data active recommendation method and device, storage medium and computer equipment
Wang Recent progress in corpus linguistics in China
Gelbukh et al. Resolving ambiguities in toponym recognition in cartographic maps
Ciubotaru et al. Regeneration of cultural heritage: Problems related to Moldavian Cyrillic alphabet
Daimary et al. Bodo to english statistical machine translation system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200529