CN111209461A

CN111209461A - Bilingual corpus collection system based on public identification words

Info

Publication number: CN111209461A
Application number: CN201911388715.8A
Authority: CN
Inventors: 张洁; 王晓珊; 李伟彬; 刘华; 费比; 周黎; 周辛雨
Original assignee: Chengdu University of Information Technology; Chengdu Univeristy of Technology
Current assignee: Chengdu University of Information Technology; Chengdu Univeristy of Technology
Priority date: 2019-12-30
Filing date: 2019-12-30
Publication date: 2020-05-29

Abstract

The invention relates to a bilingual corpus collection system based on public identification languages, which comprises a corpus collection range setting module, a corpus collection module for performing corpus collection in a corpus collection range, a first corpus information storage module, a second corpus information storage module, a public identification language extraction module for extracting a public identification language part from collected corpus, a bilingual comparison translation module and a third corpus information storage module. The invention purposefully collects the content related to the public logo based on the network information and the reference book, provides a more detailed comparison basis for the vocabulary of the public logo, so that paraphrases which are not related to the public logo appear in the subsequent use process, and effectively improves the translation accuracy in the application of the public logo.

Description

Bilingual corpus collection system based on public identification words

Technical Field

The invention relates to a bilingual corpus collection system based on public identification words.

Background

The public logo is also called a bulletin, is mainly indicative voice provided for convenience of travel of the public or tourists in a city, and comprises service facilities, organization names, advertising boards, public facilities, public transportation, tourist attractions, street signboards, slogan slogans, shop signboards and the like, and has the function of providing effective information to the public through concise language. With the development of economic culture, particularly the development of tourism, many cities attract a great number of foreign friends, so that the translation of public identification is very important, and the public identification not only represents urban language environment and human environment, but also plays an important role in promoting the development of tourism industry. The correct and conscientious public logo translation content can provide good and convenient help for tourists in various countries and improve the overall image of a city, otherwise, wrong and unjust public logo reaction content can bring comprehension barriers and even error zones to foreign tourists, and therefore, the accuracy of public logo translation is very necessary.

In the process of improving the translation accuracy of the public logo, it is important to establish a reasonable and accurate public logo bilingual parallel corpus, and the public logo bilingual parallel corpus is derived from a wide bilingual parallel corpus base, so how to obtain required public logo information from a wide corpus information source is a problem which needs to be solved urgently by technical personnel in the field.

Disclosure of Invention

In view of the above technical problems, the present invention provides a bilingual corpus collection system based on public identification languages, so as to conveniently obtain the required public identification language corpus and improve the accuracy of the corpus to a certain extent.

In order to achieve the purpose, the technical scheme adopted by the invention is as follows:

a bilingual corpus collection system based on public identification languages comprises:

the corpus collection range setting module is used for setting a collection range of the corpus related to the public logo, and the collection range comprises a webpage and a document work related to the public logo;

the corpus collection module is used for carrying out large-scale basic corpus information collection in a collection range through web crawlers, manual input and character recognition modes, wherein the basic corpus information comprises monolingual basic corpus information and bilingual basic corpus information;

the first corpus information storage module is used for storing the acquired monolingual basic corpus information;

the second corpus information storage module is used for storing the acquired bilingual basic corpus information;

the public logo extraction module is used for extracting unilingual public logo corpus information from the first corpus information storage module and bilingual public logo corpus information from the second corpus information storage module according to the constructed public logo keywords;

the bilingual comparison translation module is used for translating and converting the monolingual public sign language corpus information into corresponding bilingual public sign language corpus information; and

and the third language material information storage module is used for storing the language material information of the bilingual public identification language.

Furthermore, a preset collection source set and an extended collection source set are arranged in the corpus collection range setting module, wherein the preset collection source set is used for storing a preset fixed collection range, and the extended collection source set is used for storing a collection range newly input by the input device.

Furthermore, the corpus collection module comprises a crawler module for collecting information on a network, an input module for receiving manual input information, a scanning recognition module for recognizing characters on an image, and a corpus category recognition module for recognizing the category of the language in the collected information content, wherein the corpus category recognition module transmits the recognized single-language basic corpus information to the first corpus information storage module for storage, and transmits the recognized double-language basic corpus information to the second corpus information storage module for storage.

Furthermore, the public identification word extraction module is also connected with a keyword library,

the keyword library is used for storing public sign language keywords, wherein a part of public sign language keywords are preset, and inputting and expanding new public sign language keywords according to actual requirements.

Furthermore, the bilingual corpus collection system based on the public identification words further comprises a bilingual correction module, wherein the bilingual correction module is used for correcting the bilingual public identification word corpus information extracted by the public identification word extraction module.

Further, the bilingual correction module performs a correction process including:

respectively identifying and extracting a Chinese part and a foreign language part which correspond to each other from the language material information of the bilingual public logo, comparing the paraphrases of the Chinese part and the foreign language part based on a translation word stock used by a bilingual contrast translation module,

if the contrast approximation is not less than 85%, the bilingual public logo corpus information of the part is considered to be available and stored in the third corpus information storage module,

if the contrast approximation degree is not more than 50 percent, the bilingual public sign language corpus information of the part is considered to be unavailable, the translation word stock is adopted to correspondingly translate the Chinese part, the translated part of bilingual public sign language corpus information is stored in a third language corpus information storage module,

if the contrast approximation degree is between 50% and 85%, the bilingual public logo corpus information of the part is marked like a suspect mark, and the extracted Chinese part, the foreign language part and the content translated by adopting the translation word stock are jointly stored in a third corpus information storage module in an associated form.

Further, the bilingual public logo corpus information with suspected marks in the bilingual correction module or the third corpus information storage module is manually corrected.

Compared with the prior art, the invention has the following beneficial effects:

(1) the invention purposefully collects the content related to the public logo based on the network information and the reference book, provides a more detailed comparison basis for the vocabulary of the public logo, so that paraphrases which are not related to the public logo appear in the subsequent use process, and effectively improves the translation accuracy in the application of the public logo.

(2) The invention can expand more corpus collection ranges through setting corpus collection ranges based on basic corpus collection ranges and manual input, so as to facilitate the continuous update and growth of bilingual corpus.

(3) The invention further extracts the content containing the required public identification by utilizing the keyword library so as to discharge some content irrelevant to the public identification, thereby improving the accuracy of the public identification used subsequently, and further improving the translation accuracy of the bilingual public identification by correcting the concentrated public identification information through the translation word library.

Drawings

FIG. 1 is a block diagram of the present invention.

FIG. 2 is a block diagram of the corpus collection module.

Detailed Description

The present invention will be further described with reference to the following description and examples, which include but are not limited to the following examples.

Examples

As shown in fig. 1 and fig. 2, the bilingual corpus collection system based on public identification languages includes:

the corpus collection range setting module is used for setting a collection range of the corpus related to the public logo, and the collection range comprises webpages and literature works related to the public logo, such as webpages of related websites in the travel industry, official report materials and the like; the corpus collection range setting module is internally provided with a preset collection source set and an extended collection source set, wherein the preset collection source set is used for storing a preset fixed collection range, and the extended collection source set is used for storing a collection range newly input by the input device.

The corpus collection module is used for carrying out large-scale basic corpus information collection in a collection range through web crawlers, manual input and character recognition modes, wherein the basic corpus information comprises monolingual basic corpus information and bilingual basic corpus information, and the basic corpus information takes page paragraphs as basic units; the corpus collection module comprises a crawler module for collecting information on a network, an input module for receiving manual input information, a scanning identification module for identifying characters on an image, and a corpus category identification module for identifying the category of the collected information content, wherein the corpus category identification module transmits the identified monolingual basic corpus information to a first corpus information storage module for storage, and transmits the identified bilingual basic corpus information to a second corpus information storage module for storage.

And the first corpus information storage module is used for storing the acquired monolingual basic corpus information.

And the second language material information storage module is used for storing the acquired bilingual basic language material information.

And the public identification language extraction module is used for extracting monolingual public identification language corpus information from the first corpus information storage module and bilingual public identification language corpus information from the second corpus information storage module according to the constructed public identification language keywords, wherein the monolingual public identification language corpus information can be Chinese languages and foreign languages, and the extracted monolingual public identification language corpus information and the bilingual public identification language corpus information both use sentences as basic units.

And the bilingual comparison translation module is used for translating the single-language public identification language corpus information into corresponding bilingual public identification language corpus information and can be connected with a universal bilingual translation word stock.

The public logo extraction module is also connected with a keyword library, the keyword library is used for storing public logo keywords, wherein part of the public logo keywords are preset, and new public logo keywords are input and expanded according to actual requirements.

The bilingual corpus collection system based on the public identification words further comprises a bilingual correction module, and the bilingual public identification word corpus information extracted by the public identification word extraction module is corrected.

Specifically, the bilingual correction module performs a correction process including:

And manually correcting the bilingual public logo corpus information with suspected marks in the bilingual correction module or the third corpus information storage module.

Through the arrangement, the method and the device can acquire the bilingual corpus information of the required public identification words through a wider network information environment, and provide a sufficient information data basis for subsequently establishing an accurate bilingual parallel corpus of the public identification words.

The above-mentioned embodiment is only one of the preferred embodiments of the present invention, and should not be used to limit the scope of the present invention, but all the insubstantial modifications or changes made within the spirit and scope of the main design of the present invention, which still solve the technical problems consistent with the present invention, should be included in the scope of the present invention.

Claims

1. A bilingual corpus collection system based on public identification words is characterized by comprising:

2. The bilingual corpus collection system according to claim 1, wherein a default collection source set for storing a default fixed collection range and an extended collection source set for storing a collection range newly inputted from the input device are embedded in the corpus collection range setting module.

3. The bilingual corpus collection system according to claim 2, wherein the corpus collection module comprises a crawler module for collecting information on the network, an input module for receiving manual input information, a scan recognition module for recognizing characters on the image, and a corpus language recognition module for recognizing the language category in the collected information content, wherein the corpus language recognition module transmits the recognized monolingual basic corpus information to the first corpus information storage module for storage, and transmits the recognized bilingual basic corpus information to the second corpus information storage module for storage.

4. The bilingual corpus collection system based on public facing phrases of claim 3, wherein the public facing phrase extraction module is further connected with a keyword library,

5. The system for bilingual corpus collection based on common logos according to claim 4, further comprising a bilingual correction module for correcting the bilingual common logo corpus information extracted by the common logo extraction module.

6. The bilingual corpus collection system according to claim 5, wherein the bilingual corpus correction module corrects the bilingual corpus according to the following steps:

7. The bilingual corpus collection system according to claim 6, wherein the bilingual corpus information of the common logo in the bilingual corpus correction module or the third corpus information storage module, which is suspected to be marked, is manually corrected.