CN110046261B - Construction method of multi-modal bilingual parallel corpus of construction engineering - Google Patents

Construction method of multi-modal bilingual parallel corpus of construction engineering Download PDF

Info

Publication number
CN110046261B
CN110046261B CN201910323653.6A CN201910323653A CN110046261B CN 110046261 B CN110046261 B CN 110046261B CN 201910323653 A CN201910323653 A CN 201910323653A CN 110046261 B CN110046261 B CN 110046261B
Authority
CN
China
Prior art keywords
corpus
translation
building
modal
bilingual
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910323653.6A
Other languages
Chinese (zh)
Other versions
CN110046261A (en
Inventor
高金岭
张聪颖
王海峰
丁玫
鲍玉平
高吉云
张晓红
王薇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Jianzhu University
Original Assignee
Shandong Jianzhu University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Jianzhu University filed Critical Shandong Jianzhu University
Priority to CN201910323653.6A priority Critical patent/CN110046261B/en
Publication of CN110046261A publication Critical patent/CN110046261A/en
Application granted granted Critical
Publication of CN110046261B publication Critical patent/CN110046261B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Machine Translation (AREA)

Abstract

The invention belongs to the technical field of data processing, and particularly relates to a construction method of a multi-modal bilingual parallel corpus of construction engineering; the method comprises six steps of corpus screening, corpus extraction, proofreading, corpus segmentation, alignment, denoising, parallel corpus obtaining, corpus updating and expansion; abundant contrast samples are provided for building vocabularies, the meanings of the retrieved vocabularies or syntaxes are related to buildings, and some useless meanings are eliminated; the method provides huge number of bilingual transliteration samples for users, has fine segmentation and high precision, and the retrieved vocabulary or syntax meanings are all relevant to buildings, thereby eliminating some useless meanings and providing huge number of building bilingual transliteration samples for the users.

Description

Construction method of multi-modal bilingual parallel corpus of construction engineering
Technical Field
The invention belongs to the technical field of data processing, and particularly relates to a construction method of a multi-modal bilingual parallel corpus of construction engineering.
Background
The building English is the combination of the building industry and English, and relates to all links of the building industry, such as qualification prequalification, bidding, building construction, quality evaluation and the like. The Chinese characters and expressions of the building English belong to scientific and technical texts in characteristic, and have own professional vocabulary and expression habits, wherein the speaking mode is a written form, and the basic tone of the speaking is a formal form. With the continuous expansion of the share of the foreign building market in China and the connection between the domestic building market and the foreign building market, building English is more and more commonly used, the translation of the building English also appears in large quantity, and the linguistic viewpoint and method of the corpus provide a tool method for the research of the translation and the teaching of the building English.
Computer Aided Translation (CAT) means that when a translator carries out translation work, a background continuously and automatically stores translations recorded by the translator, so that a database is established, and therefore when the same or similar phrases or sentence segments appear again in the later translation process, a system can automatically search the same or similar contents stored in the database to provide reference translations for the translator, so that repeated translation labor is avoided, and therefore the translation efficiency can be greatly improved by adopting a mode of combining a multi-modal corpus of architectural engineering and CAT. However, there are still some problems at present: the construction specialized corpus itself is extremely rare at home and abroad, and the construction type multi-mode corpus is unprecedented; the existing building corpus has fewer corpora collection, arrangement and proofreading, even has no proofreading, so that the corpora format and the content are not standard; the sources of the corpora are not authoritative enough, and various texts on the network are collected without distinguishing some corpora, so that the corpora are large in noise and low in purity and cannot be truly applied to CAT software; at present, most of parallel corpora are aligned to paragraphs, but during translation, sentences are the most valuable to refer to, and language fragments, phrases and terms are the next, so that the accuracy of whole-paragraph translation is low.
Disclosure of Invention
The invention aims to overcome the defects in the prior art and provides a construction method of a multi-mode bilingual parallel corpus in the building engineering, namely, segmentation is fine, the accuracy is high, and retrieved vocabulary or syntax meanings are all in the relevant fields of buildings, so that useless meanings are eliminated, and a huge number of building bilingual transliteration samples are provided for a user.
The invention relates to a construction method of a multi-modal bilingual parallel corpus of construction engineering, which specifically comprises the following steps:
(1) corpus screening: obtaining original corpora through network downloading, scanning recognition, manual input and a web crawler mode, wherein the main sources of the original corpora are building English-Chinese bilingual works, government official document reports, official authentication materials, audio, videos, drawings, pictures and the like of formal conferences in the building industry, which are formally published by a national-level publishing company;
(2) and (3) corpus extraction and proofreading: collecting multi-mode building engineering information (pictures, charts, drawings, videos, audios, characters and the like) by using a modern image technology, and mining and constructing the information; then, proofreading is carried out, the original linguistic data on the server are subjected to operations of adding, deleting, modifying and searching, the original linguistic data are cleaned and removed, the original linguistic data are stored after being proofread, and the bilingual linguistic data are made into a sentence alignment mode mainly based on paragraph in Tmxmall software;
(3) segmenting and aligning corpora: segmenting sentence segments after aligning in the step (2), so that each pair of bilingual parallel sentence pairs does not exceed four lines in a visual document of Word;
(4) denoising: the sentence or paragraph with inaccurate translation result is modified by adopting a manual noise reduction mode, and is manually input and stored into the corpus so as to ensure the accurate matching of the corpus in the computer-aided translation process;
(5) labeling and transcribing: designing a reasonable and sufficient data mining scheme according to a research object and research needs, establishing different labeling layers on labeling software, and labeling the materials from different perspectives and aspects, such as labeling construction contract corpora; and marking and retrieving software by using a multi-modal corpus, synchronously presenting the transcribed content, the audio and the video, and simultaneously supporting the output of results in various forms such as characters, audio and video;
(6) obtaining a parallel corpus: sequentially performing machine translation on the recognized characters, and correcting by adopting manual translation to obtain a parallel corpus;
(7) corpus updating and capacity expansion: the corpus updating is controlled by an updating unit, the updating unit pops up the recommended entry and the recommended weight thereof at irregular time, the recommended entry is written into the corpus according to the recommended entry and the recommended weight, the recommended weight is determined according to the number of times that the recommended entry pops up the word or sentence, the word pops up 5 times like a word, namely the weight is recorded as 5, when the weight exceeds 10, the word is written into the corpus, and the corpus updating and the corpus expansion are realized.
In the corpus screening process of the step (1), the web crawler method adopts a selenium network test package under python as the basis of a crawler library, external links of download pages of related field files are crawled through third-party websites such as Baidu academic, then the external links are uniformly accessed, the related field files are downloaded in a mode of simulating and clicking page elements, formats of the related files are converted, redundant information and error information are cleaned, corresponding structural information is extracted, then word segmentation is carried out on the converted texts, stop words are removed, semantic-free sections are filtered, and basic texts which can be analyzed are constructed.
In the step (1) of the invention, the national-level publishing company comprises a Chinese building industry publishing company and the like, the government official document report comprises a conference bulletin and the like, official certification materials comprise contract texts and the like, and other materials comprise audio, video, drawings and pictures of formal conferences in the building industry; the fields that the corpus mainly relates to include: green buildings, construction theories, building labels, building contracts, building materials, city planning and the like.
In the step (2), the full-angle characters and numerical values which do not accord with the English text specification and the redundant enter keys of the full-angle blank spaces are sorted and replaced by the software of a text sorter in the computer, so that the nonstandard symbols and formats in the text are sorted, the original linguistic data are cleaned and removed, and the cleanness of the text is kept.
In the step (3) of the invention, the specific steps of not more than four rows in the visual document of Word for each bilingual parallel sentence pair are as follows:
s31: firstly, counting characters in first to third lines of a text to be translated, detecting the first last period, comma or semicolon in the third line, when detecting the first last period, comma or semicolon in the third line, performing line feed operation at the position of the period, comma or semicolon, and then executing step S34; if not, go to step S32;
s32: detecting the last but one period or comma or semicolon of the second line, performing line feed operation at the last but one period or comma or semicolon of the second line, and then executing step S34; if the second row is not detected, go to step S33;
s33: the same step detects the first line, and the line feed operation is performed at the last period or comma or semicolon of the first line, and then step S34 is performed;
s34: and continuously detecting the first to third lines of characters after the line feed of the text to be translated, and repeating the steps of S31-S33, thereby effectively realizing the segmentation of the corpus sentence segments, and ensuring that each pair of bilingual parallel sentence pairs does not exceed four lines in the visual document of Word.
The invention has the beneficial effects that: compared with the prior art, the construction method of the multi-modal bilingual parallel corpus of the architectural engineering provides abundant contrast samples for construction vocabularies, and the meanings of the retrieved vocabularies or syntaxes are related to buildings, so that some useless meanings are eliminated; the bilingual translation system provides a great number of bilingual translation samples for users, so that teachers can have a great number of instances for teaching in teaching, the teaching quality is improved, and students can have great profits in learning and research out of class; can provide abundant available professional text data for users, and the professional text data can be used as a reference for further teaching, learning, research and practice; a more professional translation platform is provided for the building field, the corpus is authoritative in source, fine in segmentation and strong in specificity, the industry can be accurately butted, the matching degree of the corpus is improved, and therefore the requirements of building engineering are met.
Drawings
FIG. 1 is an exemplary diagram of translation results according to the present invention;
FIGS. 2 and 3 are diagrams illustrating the multi-modal translation of materials according to the present invention;
FIG. 4 is an exemplary diagram of a translation result of an online translation platform with a high market share;
FIG. 5 is a diagram illustrating the result of the corpus + CAT translation of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.
The invention is further illustrated by the following specific examples.
Example (b):
the construction method of the multi-modal bilingual parallel corpus of the architectural engineering, which is related by the embodiment, specifically comprises the following steps:
(1) corpus screening: obtaining original corpora through network downloading, scanning recognition, manual input and a web crawler mode, wherein the main sources of the original corpora are building English-Chinese bilingual works, government official document reports, official authentication materials, audio, videos, drawings, pictures and the like of formal conferences in the building industry, which are formally published by a national-level publishing company;
(2) and (3) corpus extraction and proofreading: collecting multi-mode building engineering information (pictures, charts, drawings, videos, audios, characters and the like) by using a modern image technology, and mining and constructing the information; then, proofreading is carried out, the original linguistic data on the server are subjected to operations of adding, deleting, modifying and searching, the original linguistic data are cleaned and removed, the original linguistic data are stored after being proofread, and the bilingual linguistic data are made into a sentence alignment mode mainly based on paragraph in Tmxmall software;
(3) segmenting and aligning corpora: segmenting sentence segments after aligning in the step (2), so that each pair of bilingual parallel sentence pairs does not exceed four lines in a visual document of Word;
(4) denoising: the sentence or paragraph with inaccurate translation result is modified by adopting a manual noise reduction mode, and is manually input and stored into the corpus so as to ensure the accurate matching of the corpus in the computer-aided translation process;
(5) labeling and transcribing: designing a reasonable and sufficient data mining scheme according to a research object and research needs, establishing different labeling layers on labeling software, and labeling the materials from different perspectives and aspects, such as labeling construction contract corpora; and marking and retrieving software by using a multi-modal corpus, synchronously presenting the transcribed content, the audio and the video, and simultaneously supporting the output of results in various forms such as characters, audio and video;
(6) obtaining a parallel corpus: sequentially performing machine translation on the recognized characters, and correcting by adopting manual translation to obtain a parallel corpus;
(7) corpus updating and capacity expansion: the corpus updating is controlled by an updating unit, the updating unit pops up the recommended entry and the recommended weight thereof at irregular time, the recommended entry is written into the corpus according to the recommended entry and the recommended weight, the recommended weight is determined according to the number of times that the recommended entry pops up the word or sentence, the word pops up 5 times like a word, namely the weight is recorded as 5, when the weight exceeds 10, the word is written into the corpus, and the corpus updating and the corpus expansion are realized.
In the corpus screening process in step (1) of this embodiment, the web crawler method uses a selenium network test package under python as a basis of a crawler library, crawls external links of download pages of related field files through third-party websites such as academia and the like, then uniformly accesses the external links, downloads the related field files in a mode of simulating and clicking page elements, converts formats of the related files, cleans redundant information and error information, extracts corresponding structural information, performs word segmentation, removes stop words and filters semantic-free sections on the converted text, and constructs a basic text for analysis.
In the step (1), the national-level publishing company comprises a Chinese building industry publishing company and the like, the government official document report comprises a conference bulletin and the like, official certification materials comprise contract texts and the like, and other materials comprise audio, video, drawings, pictures and the like of formal conferences in the building industry; the fields that the corpus mainly relates to include: green buildings, construction theories, building labels, building contracts, building materials, city planning and the like.
In the step (2), the full-angle characters and numerical values which do not accord with the English text specification and the redundant enter keys of the full-angle blank spaces are sorted and replaced by the software of a text sorter in the computer, so that the nonstandard symbols and formats in the text are sorted, the original linguistic data are cleaned and removed, and the cleanness of the text is kept.
In step (3) of this embodiment, the specific steps of no more than four rows in the visual document of Word for each bilingual parallel sentence pair are as follows:
s31: firstly, counting characters in first to third lines of a text to be translated, detecting the first last period, comma or semicolon in the third line, when detecting the first last period, comma or semicolon in the third line, performing line feed operation at the position of the period, comma or semicolon, and then executing step S34; if not, go to step S32;
s32: detecting the last but one period or comma or semicolon of the second line, performing line feed operation at the last but one period or comma or semicolon of the second line, and then executing step S34; if the second row is not detected, go to step S33;
s33: the same step detects the first line, and the line feed operation is performed at the last period or comma or semicolon of the first line, and then step S34 is performed;
s34: and continuously detecting the first to third lines of characters after the line feed of the text to be translated, and repeating the steps of S31-S33, thereby effectively realizing the segmentation of the corpus sentence segments, and ensuring that each pair of bilingual parallel sentence pairs does not exceed four lines in the visual document of Word.
The corpus format described in this embodiment includes a TMX format and a TXT format that can be directly imported into CAT software, and can also provide a visual EXCEL format.
When the multi-modal bilingual parallel pre-library of the construction engineering is used, the corpus is used as a Translation Memory (TM) and is introduced into computer-aided translation software, and when a translation text encounters a sentence pair or a vocabulary (Glossary) corresponding to the corpus, the computer-aided translation software is automatically matched, so that the translation quality of the construction industry is ensured, and the time is greatly saved; the translation process adopts the translation of a corpus plus CAT mode, integrates CAT and MT, establishes a mode of CAT plus Machine Translation (MT) plus post-translation editing (PE), improves the translation efficiency and ensures the translation quality; setting translation accuracy rate confidence coefficient for a material library and CAT, wherein the core of the CAT technology is a translation memory technology, when the same or similar phrases appear, the system automatically prompts a user to use the closest translation method in the memory library, the user needs to adopt, discard or edit repeated texts according to actual needs, therefore, the translation accuracy rate confidence coefficient is set, if the translation accuracy rate confidence coefficient is not lower than 90 during translation, a translation result is adopted, if the translation accuracy rate confidence coefficient is lower than 90, the translation result at the moment is discarded, manual translation is adopted for correction, the results of the manual translation replace the results of the material library and the CAT translation, the manual translation is edited and input in an editable area and stored in the material library, the translation mode of 'CAT + Machine Translation (MT) + post-editing (PE)' is realized, and the translation accuracy rate confidence coefficient is a set program preset in the system, for judging the accuracy of the translation; the continuous updating and capacity expansion of the corpus ensures the updating speed of the corpus, ensures the stable continuation of the corpus, ensures that more corpus references are provided for the unification of terms and translator styles, and prepares a reliable resource corpus for the extraction work of building terms; the corpus is used as a Translation Memory (TM) and is introduced into computer-aided translation software, so that multi-modal corpora can be directly put into a warehouse, and after multi-modal materials are input into a computer, the computer-aided translation software can directly call the corpora in the formats of pictures, charts, drawings, videos, audios and the like to perform translation, teaching research and the like.
The corpus described in this embodiment has been put into storage as follows: general theory of construction environment and energy application engineering (hanyiying), "green northern europe: sustainable development cities and buildings (Chinese translation English), FIDIC contract translation (Chinese translation English), < ecological cities and green buildings > < Fulai construction history > < Yikan > < construction materials > < Yikan > < Johnston: one principle, various forms (Chinese translation English), construction of the opposite climax shishiyasu (Chinese translation English), design concept (English translation Chinese), urban sustainable development principle (Chinese translation English), construction from concept to construction 2 (Chinese translation English), sustainable development city and construction design (Chinese translation English), and international civil engineering construction contract (Chinese translation English); in addition, the corpus described in this embodiment participates in a plurality of translation project tests, and these projects design a large amount of industry professional knowledge such as architecture, electric power and the like, so that the problems of language accuracy and scientific text language logic of the translation professional industry can be more perfectly solved, and the efficiency and quality of translation service are ensured.
The embodiment provides a translation result imported into CAT translation software by using the corpus, as shown in fig. 1, it can be seen from fig. 1 that the corpus of the embodiment ensures accurate matching of the corpus in the CAT translation software; FIGS. 2 and 3 show the translation result of the multi-modal technique applied to the roof structure diagram in the corpus; fig. 4 and fig. 5 are comparison diagrams, and it can be seen from comparison between fig. 4 and fig. 5 that the corpus according to this embodiment has high matching precision and more accurate translation results in the building-type text.
The above embodiments are only specific examples of the present invention, and the protection scope of the present invention includes but is not limited to the product forms and styles of the above embodiments, and any suitable changes or modifications made by those skilled in the art according to the claims of the present invention shall fall within the protection scope of the present invention.

Claims (5)

1. A construction method of a multi-modal bilingual parallel corpus of construction engineering is characterized by comprising the following steps: the method specifically comprises the following steps:
(1) corpus screening: obtaining original linguistic data through network downloading, scanning recognition, manual input and a web crawler mode, wherein the original linguistic data are sourced from English-Chinese bilingual works of buildings, government official document reports, official authentication materials and audio, video, drawings and pictures of formal conferences in the building industry, which are formally published by a national publishing company;
(2) and (3) corpus extraction and proofreading: collecting multi-mode building engineering information by using a modern imaging technology, and excavating and constructing the multi-mode building engineering information; then, proofreading is carried out, the original linguistic data on the server are subjected to operations of adding, deleting, modifying and searching, the original linguistic data are cleaned and removed, the original linguistic data are stored after being proofread, and the bilingual linguistic data are made into a sentence alignment mode mainly based on paragraph in Tmxmall software;
(3) segmenting and aligning corpora: segmenting sentence segments after aligning in the step (2), so that each pair of bilingual parallel sentence pairs does not exceed four lines in a visual document of Word;
(4) denoising: the sentence or paragraph with inaccurate translation result is modified by adopting a manual noise reduction mode, and is manually input and stored into the corpus so as to ensure the accurate matching of the corpus in the computer-aided translation process;
(5) labeling and transcribing: establishing different labeling layers on labeling software, and labeling the materials from different perspectives and aspects; and marking and retrieving software by using a multi-modal corpus, synchronously presenting the transcribed content, the audio and the video, and simultaneously supporting the output of various forms of results of characters, audio and video;
(6) obtaining a parallel corpus: sequentially performing machine translation on the recognized characters, and correcting by adopting manual translation to obtain a parallel corpus;
(7) corpus updating and capacity expansion: the updating of the corpus is controlled by an updating unit, the updating unit pops up the recommended entries and the recommended weights thereof at irregular time, the recommended entries are written into the corpus according to the recommended entries and the recommended weights, and the recommended weights are determined according to the number of times of popping up the recommended entries;
in the step (3), each bilingual parallel sentence pair has no more than four rows in the visual document of Word, and the specific steps are as follows:
s31: firstly, counting characters in first to third lines of a text to be translated, detecting the first last period, comma or semicolon in the third line, when detecting the first last period, comma or semicolon in the third line, performing line feed operation at the position of the period, comma or semicolon, and then executing step S34; if not, go to step S32;
s32: detecting the last but one period or comma or semicolon of the second line, performing line feed operation at the last but one period or comma or semicolon of the second line, and then executing step S34; if the second row is not detected, go to step S33;
s33: the same step detects the first line, and the line feed operation is performed at the last period or comma or semicolon of the first line, and then step S34 is performed;
s34: continuously detecting the first to the third lines of characters after the line feed of the text to be translated, and repeating the steps of S31-S33, thereby effectively realizing the segmentation of the corpus sentence segments, ensuring that each pair of bilingual parallel sentence pairs does not exceed four lines in the visual document of Word;
the language database is used as a translation memory to be introduced into computer-aided translation software, and when the translation text meets a sentence pair or a vocabulary table corresponding to the language database, the computer-aided translation software automatically matches the sentence pair or the vocabulary table; and the corpus is taken as a translation memory library and is introduced into computer-aided translation software, so that multi-modal corpora are directly put into a warehouse, and the computer-aided translation software can directly call the corpora in the formats of pictures, charts, drawings, videos and audios for translation after the multi-modal materials are input into a computer.
2. The method for constructing a multi-modal bilingual parallel corpus of architectural engineering according to claim 1, wherein: in the corpus screening process of the step (1), a selenium network test package under python is used as a foundation of a crawler library, external links of download pages of related field files are crawled through a Baidu academic third-party website, then the external links are uniformly accessed, the related field files are downloaded in a mode of simulating and clicking page elements, formats of the related files are converted, redundant information and error information are cleaned, corresponding structural information is extracted, then word segmentation is carried out on the converted texts, stop words are removed, semantic-free sections are filtered, and a basic text for analysis is constructed.
3. The method for constructing a multi-modal bilingual parallel corpus of architectural engineering according to claim 1, wherein: the national-level publishing company in the step (1) comprises a Chinese building industry publishing company, the government official document report comprises a conference bulletin, official certification materials comprise contract texts, and other forms of materials comprise audio, video, drawings and pictures of formal conferences in the building industry; the fields related to the corpus include: green building, construction theory, building standard, building contract, building material and city planning.
4. The method for constructing a multi-modal bilingual parallel corpus of architectural engineering according to claim 1, wherein: in the step (2), the full-angle characters and numerical values which do not accord with the English text specification and the redundant enter keys of the full-angle blank spaces are sorted and replaced by the software of a text sorter in the computer, so that the nonstandard symbols and formats in the text are sorted, the original linguistic data are cleaned and removed, and the cleanness of the text is kept.
5. The method for constructing a multi-modal bilingual parallel corpus of architectural engineering according to claim 1, wherein: the language database format comprises a TMX format and a TXT format which can be directly imported into CAT software, and a visual EXCEL format can be provided.
CN201910323653.6A 2019-04-22 2019-04-22 Construction method of multi-modal bilingual parallel corpus of construction engineering Active CN110046261B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910323653.6A CN110046261B (en) 2019-04-22 2019-04-22 Construction method of multi-modal bilingual parallel corpus of construction engineering

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910323653.6A CN110046261B (en) 2019-04-22 2019-04-22 Construction method of multi-modal bilingual parallel corpus of construction engineering

Publications (2)

Publication Number Publication Date
CN110046261A CN110046261A (en) 2019-07-23
CN110046261B true CN110046261B (en) 2022-01-21

Family

ID=67278357

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910323653.6A Active CN110046261B (en) 2019-04-22 2019-04-22 Construction method of multi-modal bilingual parallel corpus of construction engineering

Country Status (1)

Country Link
CN (1) CN110046261B (en)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110543844A (en) * 2019-08-26 2019-12-06 中电科大数据研究院有限公司 metadata extraction method for government affair metadata PDF file
CN110889295B (en) * 2019-09-12 2021-10-01 华为技术有限公司 Machine translation model, and method, system and equipment for determining pseudo-professional parallel corpora
CN110942765B (en) * 2019-11-11 2022-05-27 珠海格力电器股份有限公司 Method, device, server and storage medium for constructing corpus
CN111221965A (en) * 2019-12-30 2020-06-02 成都信息工程大学 Classification sampling detection method based on bilingual corpus of public identification words
CN111241784A (en) * 2019-12-30 2020-06-05 成都理工大学 Processing and sorting method for language material resources of public identification languages
CN111209461A (en) * 2019-12-30 2020-05-29 成都理工大学 Bilingual corpus collection system based on public identification words
CN112016604B (en) * 2020-08-19 2021-03-26 华东师范大学 Zero-resource machine translation method applying visual information
CN112085985B (en) * 2020-08-20 2022-05-10 安徽七天网络科技有限公司 Student answer automatic scoring method for English examination translation questions
CN114626390A (en) * 2020-12-12 2022-06-14 郑州宝冶钢结构有限公司 Method for improving translation efficiency based on steel structure engineering parallel corpus
CN113268980A (en) * 2021-04-29 2021-08-17 赵天诚 Text recognition method and device, terminal equipment and storage medium
CN115423578B (en) * 2022-09-01 2023-12-05 广东博成网络科技有限公司 Bid bidding method and system based on micro-service containerized cloud platform
CN115688811A (en) * 2022-09-20 2023-02-03 甲骨易(北京)语言科技股份有限公司 Corpus alignment method combining rules and semantics
CN118170933B (en) * 2024-05-13 2024-08-13 之江实验室 Construction method and device of multi-mode corpus data oriented to scientific field

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8825466B1 (en) * 2007-06-08 2014-09-02 Language Weaver, Inc. Modification of annotated bilingual segment pairs in syntax-based machine translation
CN104657351A (en) * 2015-02-12 2015-05-27 中国科学院软件研究所 Method and device for processing bilingual alignment corpora
CN105005561A (en) * 2015-07-07 2015-10-28 刘改琳 Bilingual retrieval statistical translation system based on corpus
CN105068997A (en) * 2015-07-15 2015-11-18 清华大学 Parallel corpus construction method and device
CN106066870A (en) * 2016-05-27 2016-11-02 南京信息工程大学 A kind of bilingual teaching mode constructing system of linguistic context mark
CN106919689B (en) * 2017-03-03 2018-05-11 中国科学技术信息研究所 Professional domain knowledge mapping dynamic fixing method based on definitions blocks of knowledge

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5611076A (en) * 1994-09-21 1997-03-11 Micro Data Base Systems, Inc. Multi-model database management system engine for databases having complex data models
CN101101752B (en) * 2007-07-19 2010-12-01 华中科技大学 Monosyllabic language lip-reading recognition system based on vision character
US8600730B2 (en) * 2011-02-08 2013-12-03 Microsoft Corporation Language segmentation of multilingual texts
US20130332450A1 (en) * 2012-06-11 2013-12-12 International Business Machines Corporation System and Method for Automatically Detecting and Interactively Displaying Information About Entities, Activities, and Events from Multiple-Modality Natural Language Sources
CN104408078B (en) * 2014-11-07 2019-02-12 北京第二外国语学院 A kind of bilingual Chinese-English parallel corpora base construction method based on keyword
CN105843802A (en) * 2016-03-31 2016-08-10 长安大学 Corpus intervention module and method in translation

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8825466B1 (en) * 2007-06-08 2014-09-02 Language Weaver, Inc. Modification of annotated bilingual segment pairs in syntax-based machine translation
CN104657351A (en) * 2015-02-12 2015-05-27 中国科学院软件研究所 Method and device for processing bilingual alignment corpora
CN105005561A (en) * 2015-07-07 2015-10-28 刘改琳 Bilingual retrieval statistical translation system based on corpus
CN105068997A (en) * 2015-07-15 2015-11-18 清华大学 Parallel corpus construction method and device
CN106066870A (en) * 2016-05-27 2016-11-02 南京信息工程大学 A kind of bilingual teaching mode constructing system of linguistic context mark
CN106919689B (en) * 2017-03-03 2018-05-11 中国科学技术信息研究所 Professional domain knowledge mapping dynamic fixing method based on definitions blocks of knowledge

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"互联网+背景下多模态、多语种外交话语平行语料库设计与创建探析";杨明星等;《外语教学》;20181110;第39卷(第6期);第16页第3.4,第17页3.4-3.6 *
"建筑双语平行语料库构建及其对MTI学生思辨能力的开发";李家坤等;《沈阳建筑大学学报》;20181015;第20卷(第5期);第531页右栏第3-4段,第532页 *

Also Published As

Publication number Publication date
CN110046261A (en) 2019-07-23

Similar Documents

Publication Publication Date Title
CN110046261B (en) Construction method of multi-modal bilingual parallel corpus of construction engineering
CN109670191B (en) Calibration optimization method and device for machine translation and electronic equipment
CN107180045B (en) Method for extracting geographic entity relation contained in internet text
WO2022179149A1 (en) Machine translation method and apparatus based on translation memory
Vel Pre-processing techniques of text mining using computational linguistics and python libraries
CN108287911A (en) A kind of Relation extraction method based on about fasciculation remote supervisory
CN101464856A (en) Alignment method and apparatus for parallel spoken language materials
JP4304268B2 (en) Third language text generation algorithm, apparatus, and program by inputting bilingual parallel text
CN112836525A (en) Human-computer interaction based machine translation system and automatic optimization method thereof
CN116468009A (en) Article generation method, apparatus, electronic device and storage medium
CN112257442A (en) Policy document information extraction method based on corpus expansion neural network
Jindal et al. Building english-punjabi parallel corpus for machine translation
Gamal et al. Survey of arabic machine translation, methodologies, progress, and challenges
CN117473971A (en) Automatic generation method and system for bidding documents based on purchasing text library
Li Key Technologies for Constructing Bilingual Corpus for English-Chinese Translation [J]
CN106776590A (en) A kind of method and system for obtaining entry translation
Ma et al. Corpus Support for Machine Translation at LDC.
Zhang Research on English machine translation system based on the internet
CN112488593B (en) Auxiliary bid evaluation system and method for bidding
CN114564638A (en) News collection and automatic extraction method based on depth map neural network
Fan Application of computer aided translation in technical English manual
Meurant et al. Modelling a parallel corpus of french and french belgian sign language
NZUANKE et al. Technology and translation: Areas of convergence and divergence between machine translation and computer-assisted translation
CN111046182B (en) Method for constructing emotion annotation corpus of internet short texts
NL2030655B1 (en) Method for building multi-modal bilingual parallel corpus of architectural engineering

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information
CB03 Change of inventor or designer information

Inventor after: Gao Jinling

Inventor after: Zhang Congying

Inventor after: Wang Haifeng

Inventor after: Ding Mei

Inventor after: Bao Yuping

Inventor after: Gao Jiyun

Inventor after: Zhang Xiaohong

Inventor after: Wang Wei

Inventor before: Zhang Xiaohong

Inventor before: Wang Wei

Inventor before: Zhang Congying

Inventor before: Ding Mei

Inventor before: Gao Jinling

Inventor before: Bao Yuping

GR01 Patent grant
GR01 Patent grant