CN111177412A - Public logo bilingual parallel corpus system - Google Patents

Public logo bilingual parallel corpus system Download PDF

Info

Publication number
CN111177412A
CN111177412A CN201911388415.XA CN201911388415A CN111177412A CN 111177412 A CN111177412 A CN 111177412A CN 201911388415 A CN201911388415 A CN 201911388415A CN 111177412 A CN111177412 A CN 111177412A
Authority
CN
China
Prior art keywords
parallel corpus
information
corpus information
public
corpus
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911388415.XA
Other languages
Chinese (zh)
Other versions
CN111177412B (en
Inventor
李伟彬
张洁
刘小蓉
毛智
田娜
阳程
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu University of Information Technology
Chengdu Univeristy of Technology
Original Assignee
Chengdu University of Information Technology
Chengdu Univeristy of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu University of Information Technology, Chengdu Univeristy of Technology filed Critical Chengdu University of Information Technology
Priority to CN201911388415.XA priority Critical patent/CN111177412B/en
Publication of CN111177412A publication Critical patent/CN111177412A/en
Application granted granted Critical
Publication of CN111177412B publication Critical patent/CN111177412B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/381Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using identifiers, e.g. barcodes, RFIDs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/26Government or public services

Abstract

The invention relates to a bilingual parallel corpus system of public identification words, which comprises a corpus collection module, a classification labeling module, a parallel corpus information sub-library, a category index table and an inquiry information extraction module, wherein the collected corpus information is stored in the parallel corpus information sub-library according to categories through the classification labeling module, and the corpus information sub-library establishes association with other main categories by utilizing secondary categories, so that the corresponding bilingual parallel corpus information can be quickly found in a required range when inquiring information. The invention specially designs a correlated multilevel labeling form for classifying and labeling the stored corpus information of the public identification language aiming at numerous problems of the public identification language relating to the field, enables the corpus possibly having correlation to be rapidly displayed in query by combining a semantic labeling mode, effectively eliminates query search of non-related corpus, and improves the use efficiency of the public identification language bilingual parallel corpus.

Description

Public logo bilingual parallel corpus system
Technical Field
The invention relates to a bilingual parallel corpus system for public identification words.
Background
The public logo is also called a bulletin, is mainly indicative voice provided for convenience of travel of the public or tourists in a city, and comprises service facilities, organization names, advertising boards, public facilities, public transportation, tourist attractions, street signboards, slogan slogans, shop signboards and the like, and has the function of providing effective information to the public through concise language. With the development of economic culture, particularly the development of tourism, many cities attract a great number of foreign friends, so that the translation of public identification is very important, and the public identification not only represents urban language environment and human environment, but also plays an important role in promoting the development of tourism industry. The correct and conscientious public logo translation content can provide good and convenient help for tourists in various countries and improve the overall image of a city, otherwise, wrong and unjust public logo reaction content can bring comprehension barriers and even error zones to foreign tourists, and therefore, the accuracy of public logo translation is very necessary.
In the process of improving the translation accuracy of the public identification, establishing a reasonable and accurate bilingual parallel corpus of the public identification is also crucial, and because the fields related to the public identification are numerous, how to enable a user to quickly and accurately acquire the required bilingual corpus information of the public identification is a great need for the technical staff in the field.
Disclosure of Invention
In view of the above technical problems, the present invention provides a bilingual parallel corpus system for public logos, which utilizes computer information processing technology to improve the efficiency of obtaining bilingual parallel corpora for public logos.
In order to achieve the purpose, the technical scheme adopted by the invention is as follows:
a system for a bilingual parallel corpus of public logos, comprising:
the corpus collection module is used for collecting and acquiring bilingual parallel corpus information of the public logo from an external information channel;
the system comprises a classification marking module, a corpus collecting module and a semantic analysis module, wherein the classification marking module is used for marking public identification language bilingual parallel corpus information acquired by the corpus collecting module according to a preset class and outputting a corresponding class identifier, the preset class at least comprises a primary class corresponding to the main classification of the public identification language bilingual parallel corpus information and a secondary class corresponding to the secondary classification of the public identification language bilingual parallel corpus information, and the class identifier comprises a class identifier corresponding to the primary class and a class identifier corresponding to the secondary class;
the parallel corpus information sub-base is matched with the first-level category number of the public logo bilingual parallel corpus information and is used for respectively and independently storing the public logo bilingual parallel corpus information according to main classification;
the corpus information sub-base is subordinate to each parallel corpus information sub-base and is used for storing two types of identifiers generated by currently classified public logo bilingual parallel corpus information according to secondary classification and establishing the association between the two types of identifiers and other primary classes;
the category index table is used for recording and storing the category identifier and configuring a skip interface on a second category identifier associated with a first category;
and the query information extraction module is used for marking the input information with possibly related category identifiers according to the meanings during query so as to directly contrast the category index table to perform traversal information query in the corresponding parallel corpus information sub-base.
Specifically, the two-class identifier establishes association between other class I classes matched with the two-class identifier through the semantic context of the secondary classification.
Furthermore, public identification language bilingual parallel corpus information stored in each parallel corpus information sub-library is configured with priority values, and the priority values are sorted according to the frequency of inquiry.
Further, the corpus information sub-library is configured with a relevancy value for indicating semantic relevance between different bilingual parallel corpus information of the public logo.
Compared with the prior art, the invention has the following beneficial effects:
the invention specially designs a correlated multilevel labeling form for classifying and labeling the stored corpus information of the public identification language aiming at numerous problems of the public identification language relating to the field, enables the corpus possibly having correlation to be rapidly displayed in query by combining a semantic labeling mode, effectively eliminates query search of non-related corpus, improves the use efficiency of a bilingual parallel corpus of the public identification language, and has important promotion effect on the application of the public identification language.
Drawings
Fig. 1 is a schematic block diagram of the present invention.
Detailed Description
The present invention will be further described with reference to the following description and examples, which include but are not limited to the following examples.
Examples
As shown in fig. 1, the system for bilingual parallel corpus of public signs includes:
the corpus collection module is used for collecting and acquiring bilingual parallel corpus information of the public logo from an external information channel;
the system comprises a classification marking module, a corpus collecting module and a semantic analysis module, wherein the classification marking module is used for marking public identification language bilingual parallel corpus information acquired by the corpus collecting module according to a preset class and outputting a corresponding class identifier, the preset class at least comprises a primary class corresponding to the main classification of the public identification language bilingual parallel corpus information and a secondary class corresponding to the secondary classification of the public identification language bilingual parallel corpus information, and the class identifier comprises a class identifier corresponding to the primary class and a class identifier corresponding to the secondary class;
the parallel corpus information sub-base is matched with the first-level category number of the public logo bilingual parallel corpus information and is used for respectively and independently storing the public logo bilingual parallel corpus information according to main classification;
the corpus information sub-base is attached to each parallel corpus information sub-base and is used for storing two classes of identifiers generated by currently classified public logo bilingual parallel corpus information according to secondary classification and enabling the two classes of identifiers to establish the association between other first-class classes matched with the secondary classification bilingual parallel corpus information through semantic contexts of the secondary classification; the corpus information sub-base is configured with a relevance value used for indicating semantic relevance between different public logo bilingual parallel corpus information;
the category index table is used for recording and storing the category identifier and configuring a skip interface on a second category identifier associated with a first category;
and the query information extraction module is used for marking the input information with possibly related category identifiers according to the meanings during query so as to directly contrast the category index table to perform traversal information query in the corresponding parallel corpus information sub-base.
And the public identification language bilingual parallel corpus information stored in each parallel corpus information sub-library is configured with a priority value, and the priority values are sorted according to the frequency of inquiring.
In practical application, when a user queries a section of public identification language corpus, the query information extraction module labels the category identifier, and then the system allocates the parallel corpus information sub-libraries to be queried and associated with the same according to the category identifier, and queries the sub-libraries according to the actual keyword information, so as to quickly and accurately obtain the required bilingual parallel corpus information.
The above-mentioned embodiment is only one of the preferred embodiments of the present invention, and should not be used to limit the scope of the present invention, but all the insubstantial modifications or changes made within the spirit and scope of the main design of the present invention, which still solve the technical problems consistent with the present invention, should be included in the scope of the present invention.

Claims (4)

1. A system for bilingual parallel corpora of public signs, comprising:
the corpus collection module is used for collecting and acquiring bilingual parallel corpus information of the public logo from an external information channel;
the system comprises a classification marking module, a corpus collecting module and a semantic analysis module, wherein the classification marking module is used for marking public identification language bilingual parallel corpus information acquired by the corpus collecting module according to a preset class and outputting a corresponding class identifier, the preset class at least comprises a primary class corresponding to the main classification of the public identification language bilingual parallel corpus information and a secondary class corresponding to the secondary classification of the public identification language bilingual parallel corpus information, and the class identifier comprises a class identifier corresponding to the primary class and a class identifier corresponding to the secondary class;
the parallel corpus information sub-base is matched with the first-level category number of the public logo bilingual parallel corpus information and is used for respectively and independently storing the public logo bilingual parallel corpus information according to main classification;
the corpus information sub-base is subordinate to each parallel corpus information sub-base and is used for storing two types of identifiers generated by currently classified public logo bilingual parallel corpus information according to secondary classification and establishing the association between the two types of identifiers and other primary classes;
the category index table is used for recording and storing the category identifier and configuring a skip interface on a second category identifier associated with a first category;
and the query information extraction module is used for marking the input information with possibly related category identifiers according to the meanings during query so as to directly contrast the category index table to perform traversal information query in the corresponding parallel corpus information sub-base.
2. The system of claim 1, wherein the two-class labels establish associations between other classes of interest that match the secondary classes of semantic contexts.
3. The system according to claim 2, wherein the public logo bilingual parallel corpus information stored in each of the parallel corpus information repositories is configured with a priority value, and the priority values are sorted according to the frequency of queries.
4. The system according to claim 3, wherein the corpus information sub-library is configured with a relevance value indicating semantic relevance between different bilingual parallel corpus information of common logo.
CN201911388415.XA 2019-12-30 2019-12-30 Public logo bilingual parallel corpus system Active CN111177412B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911388415.XA CN111177412B (en) 2019-12-30 2019-12-30 Public logo bilingual parallel corpus system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911388415.XA CN111177412B (en) 2019-12-30 2019-12-30 Public logo bilingual parallel corpus system

Publications (2)

Publication Number Publication Date
CN111177412A true CN111177412A (en) 2020-05-19
CN111177412B CN111177412B (en) 2023-03-31

Family

ID=70655838

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911388415.XA Active CN111177412B (en) 2019-12-30 2019-12-30 Public logo bilingual parallel corpus system

Country Status (1)

Country Link
CN (1) CN111177412B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101079028A (en) * 2007-05-29 2007-11-28 中国科学院计算技术研究所 On-line translation model selection method of statistic machine translation
WO2010134752A2 (en) * 2009-05-21 2010-11-25 주식회사 아이네크 Semantic search method and system in which a plurality of classification systems are linked
US8145636B1 (en) * 2009-03-13 2012-03-27 Google Inc. Classifying text into hierarchical categories
US20160314201A1 (en) * 2013-05-13 2016-10-27 Groupon, Inc. Method, apparatus, and computer program product for classification and tagging of textual data
CN109145301A (en) * 2018-08-29 2019-01-04 上海汽车集团股份有限公司 Information classification approach and device, computer readable storage medium
CN109948160A (en) * 2019-03-15 2019-06-28 智者四海(北京)技术有限公司 Short text classification method and device
CN110209764A (en) * 2018-09-10 2019-09-06 腾讯科技(北京)有限公司 The generation method and device of corpus labeling collection, electronic equipment, storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101079028A (en) * 2007-05-29 2007-11-28 中国科学院计算技术研究所 On-line translation model selection method of statistic machine translation
US8145636B1 (en) * 2009-03-13 2012-03-27 Google Inc. Classifying text into hierarchical categories
WO2010134752A2 (en) * 2009-05-21 2010-11-25 주식회사 아이네크 Semantic search method and system in which a plurality of classification systems are linked
US20160314201A1 (en) * 2013-05-13 2016-10-27 Groupon, Inc. Method, apparatus, and computer program product for classification and tagging of textual data
CN109145301A (en) * 2018-08-29 2019-01-04 上海汽车集团股份有限公司 Information classification approach and device, computer readable storage medium
CN110209764A (en) * 2018-09-10 2019-09-06 腾讯科技(北京)有限公司 The generation method and device of corpus labeling collection, electronic equipment, storage medium
CN109948160A (en) * 2019-03-15 2019-06-28 智者四海(北京)技术有限公司 Short text classification method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张姝等: "面向事件的多语平行语料库构建研究", 《计算机应用研究》 *

Also Published As

Publication number Publication date
CN111177412B (en) 2023-03-31

Similar Documents

Publication Publication Date Title
CN106777275B (en) Entity attribute and property value extracting method based on more granularity semantic chunks
CN102841920A (en) Method and device for extracting webpage frame information
CN109933797A (en) Geocoding and system based on Jieba participle and address dictionary
CN108897887A (en) A kind of teaching resource recommended method of knowledge based map and user's similarity
CN109597895B (en) Knowledge graph-based official document searching method
US20130275454A1 (en) Full Text Search Using R-Trees
CN107908627A (en) A kind of multilingual map POI search systems
CN109740150A (en) Address resolution method, device, computer equipment and computer readable storage medium
CN107463711A (en) A kind of tag match method and device of data
Hall Quantitative and qualitative content analysis
Ahlers et al. Location-based Web search
CN114780680A (en) Retrieval and completion method and system based on place name and address database
Tamburelli et al. Revisiting the classification of Gallo-Italic: a dialectometric approach
CN106095933A (en) A kind of patent information inquiry system and querying method
CN111177412B (en) Public logo bilingual parallel corpus system
EP2783308B1 (en) Full text search based on interwoven string tokens
KR101289082B1 (en) System and method for providing area information service
CN111209461A (en) Bilingual corpus collection system based on public identification words
Chang et al. Enhancing POI search on maps via online address extraction and associated information segmentation
CN104376041B (en) A kind of information extraction method based on microblogging classification
JPH01304575A (en) Document processing device
CN111241784A (en) Processing and sorting method for language material resources of public identification languages
TW201040752A (en) Method and system for providing localized information
Neumaier et al. Geo-semantic labelling of open data. semantics 2018-14th international conference on semantic systems
Perdue et al. Citation searching: A new feature in PsycINFO

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant