CN110046261B - Construction method of multi-modal bilingual parallel corpus of construction engineering - Google Patents

Construction method of multi-modal bilingual parallel corpus of construction engineering Download PDF

Info

Publication number
CN110046261B
CN110046261B CN201910323653.6A CN201910323653A CN110046261B CN 110046261 B CN110046261 B CN 110046261B CN 201910323653 A CN201910323653 A CN 201910323653A CN 110046261 B CN110046261 B CN 110046261B
Authority
CN
China
Prior art keywords
corpus
translation
construction
text
bilingual
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910323653.6A
Other languages
Chinese (zh)
Other versions
CN110046261A (en
Inventor
高金岭
张聪颖
王海峰
丁玫
鲍玉平
高吉云
张晓红
王薇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Jianzhu University
Original Assignee
Shandong Jianzhu University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Jianzhu University filed Critical Shandong Jianzhu University
Priority to CN201910323653.6A priority Critical patent/CN110046261B/en
Publication of CN110046261A publication Critical patent/CN110046261A/en
Application granted granted Critical
Publication of CN110046261B publication Critical patent/CN110046261B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Machine Translation (AREA)

Abstract

The invention belongs to the technical field of data processing, and particularly relates to a construction method of a multi-modal bilingual parallel corpus of construction engineering; the method comprises six steps of corpus screening, corpus extraction, proofreading, corpus segmentation, alignment, denoising, parallel corpus obtaining, corpus updating and expansion; abundant contrast samples are provided for building vocabularies, the meanings of the retrieved vocabularies or syntaxes are related to buildings, and some useless meanings are eliminated; the method provides huge number of bilingual transliteration samples for users, has fine segmentation and high precision, and the retrieved vocabulary or syntax meanings are all relevant to buildings, thereby eliminating some useless meanings and providing huge number of building bilingual transliteration samples for the users.

Description

Construction method of multi-modal bilingual parallel corpus of construction engineering
Technical Field
The invention belongs to the technical field of data processing, and particularly relates to a construction method of a multi-modal bilingual parallel corpus of construction engineering.
Background
The building English is the combination of the building industry and English, and relates to all links of the building industry, such as qualification prequalification, bidding, building construction, quality evaluation and the like. The Chinese characters and expressions of the building English belong to scientific and technical texts in characteristic, and have own professional vocabulary and expression habits, wherein the speaking mode is a written form, and the basic tone of the speaking is a formal form. With the continuous expansion of the share of the foreign building market in China and the connection between the domestic building market and the foreign building market, building English is more and more commonly used, the translation of the building English also appears in large quantity, and the linguistic viewpoint and method of the corpus provide a tool method for the research of the translation and the teaching of the building English.
Computer Aided Translation (CAT) means that when a translator carries out translation work, a background continuously and automatically stores translations recorded by the translator, so that a database is established, and therefore when the same or similar phrases or sentence segments appear again in the later translation process, a system can automatically search the same or similar contents stored in the database to provide reference translations for the translator, so that repeated translation labor is avoided, and therefore the translation efficiency can be greatly improved by adopting a mode of combining a multi-modal corpus of architectural engineering and CAT. However, there are still some problems at present: the construction specialized corpus itself is extremely rare at home and abroad, and the construction type multi-mode corpus is unprecedented; the existing building corpus has fewer corpora collection, arrangement and proofreading, even has no proofreading, so that the corpora format and the content are not standard; the sources of the corpora are not authoritative enough, and various texts on the network are collected without distinguishing some corpora, so that the corpora are large in noise and low in purity and cannot be truly applied to CAT software; at present, most of parallel corpora are aligned to paragraphs, but during translation, sentences are the most valuable to refer to, and language fragments, phrases and terms are the next, so that the accuracy of whole-paragraph translation is low.
Disclosure of Invention
The invention aims to overcome the defects in the prior art and provides a construction method of a multi-mode bilingual parallel corpus in the building engineering, namely, segmentation is fine, the accuracy is high, and retrieved vocabulary or syntax meanings are all in the relevant fields of buildings, so that useless meanings are eliminated, and a huge number of building bilingual transliteration samples are provided for a user.
The invention relates to a construction method of a multi-modal bilingual parallel corpus of construction engineering, which specifically comprises the following steps:
(1) corpus screening: obtaining original corpora through network downloading, scanning recognition, manual input and a web crawler mode, wherein the main sources of the original corpora are building English-Chinese bilingual works, government official document reports, official authentication materials, audio, videos, drawings, pictures and the like of formal conferences in the building industry, which are formally published by a national-level publishing company;
(2) and (3) corpus extraction and proofreading: collecting multi-mode building engineering information (pictures, charts, drawings, videos, audios, characters and the like) by using a modern image technology, and mining and constructing the information; then, proofreading is carried out, the original linguistic data on the server are subjected to operations of adding, deleting, modifying and searching, the original linguistic data are cleaned and removed, the original linguistic data are stored after being proofread, and the bilingual linguistic data are made into a sentence alignment mode mainly based on paragraph in Tmxmall software;
(3) segmenting and aligning corpora: segmenting sentence segments after aligning in the step (2), so that each pair of bilingual parallel sentence pairs does not exceed four lines in a visual document of Word;
(4) denoising: the sentence or paragraph with inaccurate translation result is modified by adopting a manual noise reduction mode, and is manually input and stored into the corpus so as to ensure the accurate matching of the corpus in the computer-aided translation process;
(5) labeling and transcribing: designing a reasonable and sufficient data mining scheme according to a research object and research needs, establishing different labeling layers on labeling software, and labeling the materials from different perspectives and aspects, such as labeling construction contract corpora; and marking and retrieving software by using a multi-modal corpus, synchronously presenting the transcribed content, the audio and the video, and simultaneously supporting the output of results in various forms such as characters, audio and video;
(6) obtaining a parallel corpus: sequentially performing machine translation on the recognized characters, and correcting by adopting manual translation to obtain a parallel corpus;
(7) corpus updating and capacity expansion: the corpus updating is controlled by an updating unit, the updating unit pops up the recommended entry and the recommended weight thereof at irregular time, the recommended entry is written into the corpus according to the recommended entry and the recommended weight, the recommended weight is determined according to the number of times that the recommended entry pops up the word or sentence, the word pops up 5 times like a word, namely the weight is recorded as 5, when the weight exceeds 10, the word is written into the corpus, and the corpus updating and the corpus expansion are realized.
In the corpus screening process of the step (1), the web crawler method adopts a selenium network test package under python as the basis of a crawler library, external links of download pages of related field files are crawled through third-party websites such as Baidu academic, then the external links are uniformly accessed, the related field files are downloaded in a mode of simulating and clicking page elements, formats of the related files are converted, redundant information and error information are cleaned, corresponding structural information is extracted, then word segmentation is carried out on the converted texts, stop words are removed, semantic-free sections are filtered, and basic texts which can be analyzed are constructed.
In the step (1) of the invention, the national-level publishing company comprises a Chinese building industry publishing company and the like, the government official document report comprises a conference bulletin and the like, official certification materials comprise contract texts and the like, and other materials comprise audio, video, drawings and pictures of formal conferences in the building industry; the fields that the corpus mainly relates to include: green buildings, construction theories, building labels, building contracts, building materials, city planning and the like.
In the step (2), the full-angle characters and numerical values which do not accord with the English text specification and the redundant enter keys of the full-angle blank spaces are sorted and replaced by the software of a text sorter in the computer, so that the nonstandard symbols and formats in the text are sorted, the original linguistic data are cleaned and removed, and the cleanness of the text is kept.
In the step (3) of the invention, the specific steps of not more than four rows in the visual document of Word for each bilingual parallel sentence pair are as follows:
s31: firstly, counting characters in first to third lines of a text to be translated, detecting the first last period, comma or semicolon in the third line, when detecting the first last period, comma or semicolon in the third line, performing line feed operation at the position of the period, comma or semicolon, and then executing step S34; if not, go to step S32;
s32: detecting the last but one period or comma or semicolon of the second line, performing line feed operation at the last but one period or comma or semicolon of the second line, and then executing step S34; if the second row is not detected, go to step S33;
s33: the same step detects the first line, and the line feed operation is performed at the last period or comma or semicolon of the first line, and then step S34 is performed;
s34: and continuously detecting the first to third lines of characters after the line feed of the text to be translated, and repeating the steps of S31-S33, thereby effectively realizing the segmentation of the corpus sentence segments, and ensuring that each pair of bilingual parallel sentence pairs does not exceed four lines in the visual document of Word.
The invention has the beneficial effects that: compared with the prior art, the construction method of the multi-modal bilingual parallel corpus of the architectural engineering provides abundant contrast samples for construction vocabularies, and the meanings of the retrieved vocabularies or syntaxes are related to buildings, so that some useless meanings are eliminated; the bilingual translation system provides a great number of bilingual translation samples for users, so that teachers can have a great number of instances for teaching in teaching, the teaching quality is improved, and students can have great profits in learning and research out of class; can provide abundant available professional text data for users, and the professional text data can be used as a reference for further teaching, learning, research and practice; a more professional translation platform is provided for the building field, the corpus is authoritative in source, fine in segmentation and strong in specificity, the industry can be accurately butted, the matching degree of the corpus is improved, and therefore the requirements of building engineering are met.
Drawings
FIG. 1 is an exemplary diagram of translation results according to the present invention;
FIGS. 2 and 3 are diagrams illustrating the multi-modal translation of materials according to the present invention;
FIG. 4 is an exemplary diagram of a translation result of an online translation platform with a high market share;
FIG. 5 is a diagram illustrating the result of the corpus + CAT translation of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.
The invention is further illustrated by the following specific examples.
Example (b):
the construction method of the multi-modal bilingual parallel corpus of the architectural engineering, which is related by the embodiment, specifically comprises the following steps:
(1) corpus screening: obtaining original corpora through network downloading, scanning recognition, manual input and a web crawler mode, wherein the main sources of the original corpora are building English-Chinese bilingual works, government official document reports, official authentication materials, audio, videos, drawings, pictures and the like of formal conferences in the building industry, which are formally published by a national-level publishing company;
(2) and (3) corpus extraction and proofreading: collecting multi-mode building engineering information (pictures, charts, drawings, videos, audios, characters and the like) by using a modern image technology, and mining and constructing the information; then, proofreading is carried out, the original linguistic data on the server are subjected to operations of adding, deleting, modifying and searching, the original linguistic data are cleaned and removed, the original linguistic data are stored after being proofread, and the bilingual linguistic data are made into a sentence alignment mode mainly based on paragraph in Tmxmall software;
(3) segmenting and aligning corpora: segmenting sentence segments after aligning in the step (2), so that each pair of bilingual parallel sentence pairs does not exceed four lines in a visual document of Word;
(4) denoising: the sentence or paragraph with inaccurate translation result is modified by adopting a manual noise reduction mode, and is manually input and stored into the corpus so as to ensure the accurate matching of the corpus in the computer-aided translation process;
(5) labeling and transcribing: designing a reasonable and sufficient data mining scheme according to a research object and research needs, establishing different labeling layers on labeling software, and labeling the materials from different perspectives and aspects, such as labeling construction contract corpora; and marking and retrieving software by using a multi-modal corpus, synchronously presenting the transcribed content, the audio and the video, and simultaneously supporting the output of results in various forms such as characters, audio and video;
(6) obtaining a parallel corpus: sequentially performing machine translation on the recognized characters, and correcting by adopting manual translation to obtain a parallel corpus;
(7) corpus updating and capacity expansion: the corpus updating is controlled by an updating unit, the updating unit pops up the recommended entry and the recommended weight thereof at irregular time, the recommended entry is written into the corpus according to the recommended entry and the recommended weight, the recommended weight is determined according to the number of times that the recommended entry pops up the word or sentence, the word pops up 5 times like a word, namely the weight is recorded as 5, when the weight exceeds 10, the word is written into the corpus, and the corpus updating and the corpus expansion are realized.
In the corpus screening process in step (1) of this embodiment, the web crawler method uses a selenium network test package under python as a basis of a crawler library, crawls external links of download pages of related field files through third-party websites such as academia and the like, then uniformly accesses the external links, downloads the related field files in a mode of simulating and clicking page elements, converts formats of the related files, cleans redundant information and error information, extracts corresponding structural information, performs word segmentation, removes stop words and filters semantic-free sections on the converted text, and constructs a basic text for analysis.
In the step (1), the national-level publishing company comprises a Chinese building industry publishing company and the like, the government official document report comprises a conference bulletin and the like, official certification materials comprise contract texts and the like, and other materials comprise audio, video, drawings, pictures and the like of formal conferences in the building industry; the fields that the corpus mainly relates to include: green buildings, construction theories, building labels, building contracts, building materials, city planning and the like.
In the step (2), the full-angle characters and numerical values which do not accord with the English text specification and the redundant enter keys of the full-angle blank spaces are sorted and replaced by the software of a text sorter in the computer, so that the nonstandard symbols and formats in the text are sorted, the original linguistic data are cleaned and removed, and the cleanness of the text is kept.
In step (3) of this embodiment, the specific steps of no more than four rows in the visual document of Word for each bilingual parallel sentence pair are as follows:
s31: firstly, counting characters in first to third lines of a text to be translated, detecting the first last period, comma or semicolon in the third line, when detecting the first last period, comma or semicolon in the third line, performing line feed operation at the position of the period, comma or semicolon, and then executing step S34; if not, go to step S32;
s32: detecting the last but one period or comma or semicolon of the second line, performing line feed operation at the last but one period or comma or semicolon of the second line, and then executing step S34; if the second row is not detected, go to step S33;
s33: the same step detects the first line, and the line feed operation is performed at the last period or comma or semicolon of the first line, and then step S34 is performed;
s34: and continuously detecting the first to third lines of characters after the line feed of the text to be translated, and repeating the steps of S31-S33, thereby effectively realizing the segmentation of the corpus sentence segments, and ensuring that each pair of bilingual parallel sentence pairs does not exceed four lines in the visual document of Word.
The corpus format described in this embodiment includes a TMX format and a TXT format that can be directly imported into CAT software, and can also provide a visual EXCEL format.
When the multi-modal bilingual parallel pre-library of the construction engineering is used, the corpus is used as a Translation Memory (TM) and is introduced into computer-aided translation software, and when a translation text encounters a sentence pair or a vocabulary (Glossary) corresponding to the corpus, the computer-aided translation software is automatically matched, so that the translation quality of the construction industry is ensured, and the time is greatly saved; the translation process adopts the translation of a corpus plus CAT mode, integrates CAT and MT, establishes a mode of CAT plus Machine Translation (MT) plus post-translation editing (PE), improves the translation efficiency and ensures the translation quality; setting translation accuracy rate confidence coefficient for a material library and CAT, wherein the core of the CAT technology is a translation memory technology, when the same or similar phrases appear, the system automatically prompts a user to use the closest translation method in the memory library, the user needs to adopt, discard or edit repeated texts according to actual needs, therefore, the translation accuracy rate confidence coefficient is set, if the translation accuracy rate confidence coefficient is not lower than 90 during translation, a translation result is adopted, if the translation accuracy rate confidence coefficient is lower than 90, the translation result at the moment is discarded, manual translation is adopted for correction, the results of the manual translation replace the results of the material library and the CAT translation, the manual translation is edited and input in an editable area and stored in the material library, the translation mode of 'CAT + Machine Translation (MT) + post-editing (PE)' is realized, and the translation accuracy rate confidence coefficient is a set program preset in the system, for judging the accuracy of the translation; the continuous updating and capacity expansion of the corpus ensures the updating speed of the corpus, ensures the stable continuation of the corpus, ensures that more corpus references are provided for the unification of terms and translator styles, and prepares a reliable resource corpus for the extraction work of building terms; the corpus is used as a Translation Memory (TM) and is introduced into computer-aided translation software, so that multi-modal corpora can be directly put into a warehouse, and after multi-modal materials are input into a computer, the computer-aided translation software can directly call the corpora in the formats of pictures, charts, drawings, videos, audios and the like to perform translation, teaching research and the like.
The corpus described in this embodiment has been put into storage as follows: general theory of construction environment and energy application engineering (hanyiying), "green northern europe: sustainable development cities and buildings (Chinese translation English), FIDIC contract translation (Chinese translation English), < ecological cities and green buildings > < Fulai construction history > < Yikan > < construction materials > < Yikan > < Johnston: one principle, various forms (Chinese translation English), construction of the opposite climax shishiyasu (Chinese translation English), design concept (English translation Chinese), urban sustainable development principle (Chinese translation English), construction from concept to construction 2 (Chinese translation English), sustainable development city and construction design (Chinese translation English), and international civil engineering construction contract (Chinese translation English); in addition, the corpus described in this embodiment participates in a plurality of translation project tests, and these projects design a large amount of industry professional knowledge such as architecture, electric power and the like, so that the problems of language accuracy and scientific text language logic of the translation professional industry can be more perfectly solved, and the efficiency and quality of translation service are ensured.
The embodiment provides a translation result imported into CAT translation software by using the corpus, as shown in fig. 1, it can be seen from fig. 1 that the corpus of the embodiment ensures accurate matching of the corpus in the CAT translation software; FIGS. 2 and 3 show the translation result of the multi-modal technique applied to the roof structure diagram in the corpus; fig. 4 and fig. 5 are comparison diagrams, and it can be seen from comparison between fig. 4 and fig. 5 that the corpus according to this embodiment has high matching precision and more accurate translation results in the building-type text.
The above embodiments are only specific examples of the present invention, and the protection scope of the present invention includes but is not limited to the product forms and styles of the above embodiments, and any suitable changes or modifications made by those skilled in the art according to the claims of the present invention shall fall within the protection scope of the present invention.

Claims (5)

1.一种建筑工程多模态双语平行语料库的构建方法,其特征在于:具体包括以下步骤:1. a construction method of multimodal bilingual parallel corpus of construction engineering, is characterized in that: specifically comprise the following steps: (1)语料筛选:通过网络下载、扫描识别、手工录入和网络爬虫方式获取原始语料,原始语料的来源为国家级出版社正式出版的建筑类英汉双语著作、政府公文报告、官方认证材料、建筑行业正式会议的音频、视频、图纸、图片;(1) Corpus screening: Obtain the original corpus through network download, scanning recognition, manual input and web crawling. The source of the original corpus is the bilingual works of architecture published by national publishing houses, government official documents, official certification materials, architectural Audio, video, drawings and pictures of industry formal meetings; (2)语料提取、校对:利用现代影像技术采集多模态建筑工程类信息,并对其进行挖掘、构建;然后进行校对,对服务器上的原始语料进行增、删、改、查操作,对原始语料进行数据的清洗和去除,校对无误后保存,并将双语语料在Tmxmall软件中做成以段落为主的句对齐方式;(2) Corpus extraction and proofreading: use modern imaging technology to collect multi-modal construction engineering information, mine and construct it; then proofread, add, delete, modify and check the original corpus on the server, The original corpus is cleaned and removed, proofread and saved, and the bilingual corpus is made into a paragraph-based sentence alignment in Tmxmall software; (3)语料切分、对齐:将步骤(2)中对齐后对句段进行切分,使得每一对双语平行句对,在Word的可视文档中不超过四行;(3) corpus segmentation and alignment: segment segments after alignment in step (2), so that each pair of bilingual parallel sentence pairs does not exceed four lines in the visual document of Word; (4)去噪:采用人工降噪的方式,对翻译结果不精准的语句或段落进行修改,手动录入并保存至语料库,确保了语料库在计算机辅助翻译过程中的精确匹配;(4) Denoising: Using artificial noise reduction, the sentences or paragraphs with inaccurate translation results are modified, manually entered and saved to the corpus to ensure the exact matching of the corpus in the computer-assisted translation process; (5)标注与转写:在标注软件上建立不同的标注层,从不同的视角和方面对语料进行标注;并用多模态语料库标注与检索软件,对转写内容、音频和视频进行同步呈现,同时支持文字、音频和视频多种形式结果输出;(5) Annotation and transcription: establish different annotation layers on the annotation software to annotate the corpus from different perspectives and aspects; and use the multimodal corpus annotation and retrieval software to simultaneously present the transcribed content, audio and video , while supporting text, audio and video results output in various forms; (6)得到平行语料库:对已识别的文字依次进行机器翻译,并采用人工翻译进行校正后得到平行语料库;(6) Obtaining a parallel corpus: machine translation is performed on the recognized characters in turn, and a parallel corpus is obtained after correction by manual translation; (7)语料库更新和扩容:语料库更新通过更新单元来控制,更新单元不定时弹出推荐词条及其推荐权重,根据推荐词条和推荐权重将推荐词条写入语料库,推荐权重是根据推荐词条弹出的次数来确定;(7) Corpus update and expansion: The update of the corpus is controlled by the update unit. The update unit pops up the recommended entries and their recommended weights from time to time, and writes the recommended entries into the corpus according to the recommended entries and recommended weights. The recommended weight is based on the recommended words. The number of times the bar pops up to determine; 所述步骤(3)中,每一对双语平行句对,在Word的可视文档中不超过四行的具体步骤如下:In described step (3), every pair of bilingual parallel sentences is paired, and the concrete steps that are no more than four rows in the visual document of Word are as follows: S31:首先统计待翻译文本第一至三行的字符,并检测第三行倒数第一个句号或逗号或分号,当检测到第三行倒数第一个句号或逗号或分号时,在句号或逗号或分号处进行换行操作,然后执行步骤S34;若未检测到,执行步骤S32;S31: First count the characters in the first to third lines of the text to be translated, and detect the penultimate period or comma or semicolon in the third line. When the penultimate period, comma or semicolon in the third line is detected, in Perform a line feed operation at the period, comma or semicolon, and then execute step S34; if not detected, execute step S32; S32:检测第二行倒数第一个句号或逗号或分号,在第二行倒数第一个句号或逗号或分号处进行换行操作,然后执行步骤S34;若第二行未检测到,则执行步骤S33;S32: Detect the penultimate period or comma or semicolon of the second line, perform a line break operation at the penultimate period or comma or semicolon of the second line, and then perform step S34; if the second line is not detected, then Go to step S33; S33:同样步骤检测第一行,在第一行倒数第一个句号或逗号或分号处进行换行操作,然后执行步骤S34;S33: The first line is detected in the same step, and a line feed operation is performed at the penultimate period or comma or semicolon of the first line, and then step S34 is performed; S34:继续检测待翻译文本换行后的第一至三行的字符,并重复S31-S33的步骤,有效的实现了语料句段的切分,保证了每一对双语平行句对,在Word的可视文档中不超过四行;S34: Continue to detect the characters in the first to third lines after the newline of the text to be translated, and repeat the steps of S31-S33, which effectively realizes the segmentation of the corpus segment, and ensures that each pair of bilingual parallel sentence pairs is no more than four lines in the visual document; 将所述语料库作为翻译记忆库导入计算机辅助翻译软件中,当翻译文本遇到和语料库相对应的句对或词汇表时,计算机辅助翻译软件将自动匹配;将所述语料库作为翻译记忆库导入计算机辅助翻译软件中,实现多模态语料直接入库,以便多模态材料录入计算机后,计算机辅助翻译软件能够直接调取图片、图表、图纸、视频、音频格式的语料进行翻译。Import the corpus as a translation memory into the computer-assisted translation software, when the translated text encounters a sentence pair or vocabulary corresponding to the corpus, the computer-assisted translation software will automatically match; import the corpus as a translation memory into the computer In the auxiliary translation software, the multimodal corpus is directly stored in the database, so that after the multimodal material is entered into the computer, the computer-aided translation software can directly retrieve the corpus in the format of pictures, charts, drawings, videos and audios for translation. 2.根据权利要求1所述的建筑工程多模态双语平行语料库的构建方法,其特征在于:所述步骤(1)的语料筛选过程中,所述网络爬虫的方法采用python下的selenium网络测试包作为爬虫库的基础,首先通过百度学术第三方网站爬取相关领域文件下载页的外部链接,而后再统一进入这些外部链接,通过模拟点击页面元素的方式,下载相关领域文件,对相关文件的格式进行转换,清洗冗余信息和错误信息,并提取相应的结构信息,然后对转换后的文本进行分词、去除停用词和过滤无语义段落,构建可供分析的基础文本。2. the construction method of construction engineering multimodal bilingual parallel corpus according to claim 1, is characterized in that: in the corpus screening process of described step (1), the method for described web crawler adopts the selenium network test under python As the basis of the crawler library, the package first crawls the external links of the download page of related field files through the third-party website of Baidu Academic, and then enters these external links uniformly, and downloads related field files by simulating the way of clicking on the page elements. The format is converted, redundant information and error information are cleaned, and corresponding structural information is extracted, and then the converted text is segmented, stop words removed, and non-semantic paragraphs filtered to construct basic text for analysis. 3.根据权利要求1所述的建筑工程多模态双语平行语料库的构建方法,其特征在于:所述步骤(1)中所述国家级出版社包括中国建筑工业出版社,所述政府公文报告包括会议公报,官方认证材料包括合同文本,其他形态的材料包括建筑行业正式会议的音频、视频、图纸、图片;语料涉及的领域包括:绿色建筑、建设理论、建筑标书、建筑合同、建筑材料和城市规划。3. the construction method of construction engineering multimodal bilingual parallel corpus according to claim 1, is characterized in that: described in the step (1), the state-level publishing house comprises China Construction Industry Press, and the government official document reports Including meeting bulletins, official certification materials including contract texts, and other forms of materials including audio, video, drawings, and pictures of official conferences in the construction industry; the fields covered by the corpus include: green building, construction theory, construction tenders, construction contracts, construction materials and City planning. 4.根据权利要求1所述的建筑工程多模态双语平行语料库的构建方法,其特征在于:所述步骤(2)中,通过计算机中的“文本整理器”软件将不符合英文文本规范的全角字符和数值、全角空格多余的回车键进行整理和替换,以对文本中存在的不合规范的符号、格式进行整理,实现原始语料进行数据的清洗和去除,保持文本的清洁性。4. the construction method of construction engineering multimodal bilingual parallel corpus according to claim 1, is characterized in that: in described step (2), by " text organizer " software in the computer will do not meet English text norm The full-width characters and numerical values, and the extra carriage return key of the full-width spaces are sorted and replaced to sort out the non-standard symbols and formats in the text, realize the cleaning and removal of the original corpus data, and keep the text clean. 5.根据权利要求1所述的建筑工程多模态双语平行语料库的构建方法,其特征在于:所述的语料库格式包括能够直接导入CAT软件中的TMX格式以及TXT格式,同时还能够提供可视化EXCEL格式。5. the construction method of construction engineering multimodal bilingual parallel corpus according to claim 1, is characterized in that: described corpus format comprises the TMX format and the TXT format that can be directly imported into the CAT software, and can also provide visualization EXCEL Format.
CN201910323653.6A 2019-04-22 2019-04-22 Construction method of multi-modal bilingual parallel corpus of construction engineering Active CN110046261B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910323653.6A CN110046261B (en) 2019-04-22 2019-04-22 Construction method of multi-modal bilingual parallel corpus of construction engineering

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910323653.6A CN110046261B (en) 2019-04-22 2019-04-22 Construction method of multi-modal bilingual parallel corpus of construction engineering

Publications (2)

Publication Number Publication Date
CN110046261A CN110046261A (en) 2019-07-23
CN110046261B true CN110046261B (en) 2022-01-21

Family

ID=67278357

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910323653.6A Active CN110046261B (en) 2019-04-22 2019-04-22 Construction method of multi-modal bilingual parallel corpus of construction engineering

Country Status (1)

Country Link
CN (1) CN110046261B (en)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110543844A (en) * 2019-08-26 2019-12-06 中电科大数据研究院有限公司 A Metadata Extraction Method for Government Affairs Metadata PDF Files
CN110889295B (en) * 2019-09-12 2021-10-01 华为技术有限公司 Machine translation model, method, system and device for determining pseudo-professional parallel corpus
CN110942765B (en) * 2019-11-11 2022-05-27 珠海格力电器股份有限公司 Method, device, server and storage medium for constructing corpus
CN111209461A (en) * 2019-12-30 2020-05-29 成都理工大学 Bilingual corpus collection system based on public identification words
CN111221965A (en) * 2019-12-30 2020-06-02 成都信息工程大学 Classification and sampling detection method based on bilingual corpus of public signs
CN111241784A (en) * 2019-12-30 2020-06-05 成都理工大学 Methods of processing and sorting out public signage corpus resources
CN112016604B (en) * 2020-08-19 2021-03-26 华东师范大学 Zero-resource machine translation method applying visual information
CN112085985B (en) * 2020-08-20 2022-05-10 安徽七天网络科技有限公司 Student answer automatic scoring method for English examination translation questions
CN114626390A (en) * 2020-12-12 2022-06-14 郑州宝冶钢结构有限公司 Method for improving translation efficiency based on steel structure engineering parallel corpus
CN113268980A (en) * 2021-04-29 2021-08-17 赵天诚 Text recognition method and device, terminal equipment and storage medium
CN115423578B (en) * 2022-09-01 2023-12-05 广东博成网络科技有限公司 Bid bidding method and system based on micro-service containerized cloud platform
CN115688811A (en) * 2022-09-20 2023-02-03 甲骨易(北京)语言科技股份有限公司 Corpus alignment method combining rules and semantics
CN118170933B (en) * 2024-05-13 2024-08-13 之江实验室 A method and device for constructing multimodal corpus data in scientific fields

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8825466B1 (en) * 2007-06-08 2014-09-02 Language Weaver, Inc. Modification of annotated bilingual segment pairs in syntax-based machine translation
CN104657351A (en) * 2015-02-12 2015-05-27 中国科学院软件研究所 Method and device for processing bilingual alignment corpora
CN105005561A (en) * 2015-07-07 2015-10-28 刘改琳 Bilingual retrieval statistical translation system based on corpus
CN105068997A (en) * 2015-07-15 2015-11-18 清华大学 Parallel corpus construction method and device
CN106066870A (en) * 2016-05-27 2016-11-02 南京信息工程大学 A kind of bilingual teaching mode constructing system of linguistic context mark
CN106919689B (en) * 2017-03-03 2018-05-11 中国科学技术信息研究所 Professional domain knowledge mapping dynamic fixing method based on definitions blocks of knowledge

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5611076A (en) * 1994-09-21 1997-03-11 Micro Data Base Systems, Inc. Multi-model database management system engine for databases having complex data models
CN101101752B (en) * 2007-07-19 2010-12-01 华中科技大学 A lip-reading recognition system for monosyllabic languages based on visual features
US8600730B2 (en) * 2011-02-08 2013-12-03 Microsoft Corporation Language segmentation of multilingual texts
US20130332450A1 (en) * 2012-06-11 2013-12-12 International Business Machines Corporation System and Method for Automatically Detecting and Interactively Displaying Information About Entities, Activities, and Events from Multiple-Modality Natural Language Sources
CN104408078B (en) * 2014-11-07 2019-02-12 北京第二外国语学院 A kind of bilingual Chinese-English parallel corpora base construction method based on keyword
CN105843802A (en) * 2016-03-31 2016-08-10 长安大学 Corpus intervention module and method in translation

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8825466B1 (en) * 2007-06-08 2014-09-02 Language Weaver, Inc. Modification of annotated bilingual segment pairs in syntax-based machine translation
CN104657351A (en) * 2015-02-12 2015-05-27 中国科学院软件研究所 Method and device for processing bilingual alignment corpora
CN105005561A (en) * 2015-07-07 2015-10-28 刘改琳 Bilingual retrieval statistical translation system based on corpus
CN105068997A (en) * 2015-07-15 2015-11-18 清华大学 Parallel corpus construction method and device
CN106066870A (en) * 2016-05-27 2016-11-02 南京信息工程大学 A kind of bilingual teaching mode constructing system of linguistic context mark
CN106919689B (en) * 2017-03-03 2018-05-11 中国科学技术信息研究所 Professional domain knowledge mapping dynamic fixing method based on definitions blocks of knowledge

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"互联网+背景下多模态、多语种外交话语平行语料库设计与创建探析";杨明星等;《外语教学》;20181110;第39卷(第6期);第16页第3.4,第17页3.4-3.6 *
"建筑双语平行语料库构建及其对MTI学生思辨能力的开发";李家坤等;《沈阳建筑大学学报》;20181015;第20卷(第5期);第531页右栏第3-4段,第532页 *

Also Published As

Publication number Publication date
CN110046261A (en) 2019-07-23

Similar Documents

Publication Publication Date Title
CN110046261B (en) Construction method of multi-modal bilingual parallel corpus of construction engineering
WO2022179149A1 (en) Machine translation method and apparatus based on translation memory
Vel Pre-processing techniques of text mining using computational linguistics and python libraries
CN110609983B (en) Structured decomposition method for policy file
CN118170933B (en) A method and device for constructing multimodal corpus data in scientific fields
CN104991890A (en) Method for constructing Vietnamese dependency tree bank on basis of Chinese-Vietnamese vocabulary alignment corpora
CN106055623A (en) Cross-language recommendation method and system
CN113159969A (en) Financial long text rechecking system
CN111563372B (en) Typesetting document content self-duplication checking method based on teaching book publishing
CN101464856A (en) Alignment method and apparatus for parallel spoken language materials
CN112257442A (en) Policy document information extraction method based on corpus expansion neural network
CN111897917A (en) Rail transit industry term extraction method based on multi-modal natural language features
CN111353077A (en) Intelligent creation algorithm-based converged media collecting, editing and distributing system
CN112488593B (en) Auxiliary bid evaluation system and method for bidding
CN117194614A (en) Text difference recognition method, device and computer readable medium
CN114064878A (en) Natural language data marking method and system based on reinforcement learning
CN112836067A (en) Intelligent searching method based on knowledge graph
CN108268669A (en) A kind of crucial new word discovery method based on multidimensional words and phrases feature and sentiment analysis
CN103164398A (en) Chinese-Uygur language electronic dictionary and automatic translating Chinese-Uygur language method thereof
Jindal et al. Building english-punjabi parallel corpus for machine translation
CN114239579A (en) Electric power feasibility study document extraction method and device based on regular expression and CRF model
Li Key Technologies for Constructing Bilingual Corpus for English-Chinese Translation
CN111046663B (en) An Intelligent Correction Method for Chinese Forms
CN106776590A (en) A kind of method and system for obtaining entry translation
Gamal et al. Survey of arabic machine translation, methodologies, progress, and challenges

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information

Inventor after: Gao Jinling

Inventor after: Zhang Congying

Inventor after: Wang Haifeng

Inventor after: Ding Mei

Inventor after: Bao Yuping

Inventor after: Gao Jiyun

Inventor after: Zhang Xiaohong

Inventor after: Wang Wei

Inventor before: Zhang Xiaohong

Inventor before: Wang Wei

Inventor before: Zhang Congying

Inventor before: Ding Mei

Inventor before: Gao Jinling

Inventor before: Bao Yuping

CB03 Change of inventor or designer information
GR01 Patent grant
GR01 Patent grant