CN110046261B

CN110046261B - Construction method of multi-modal bilingual parallel corpus of construction engineering

Info

Publication number: CN110046261B
Application number: CN201910323653.6A
Authority: CN
Inventors: 高金岭; 张聪颖; 王海峰; 丁玫; 鲍玉平; 高吉云; 张晓红; 王薇
Original assignee: Shandong Jianzhu University
Current assignee: Shandong Jianzhu University
Priority date: 2019-04-22
Filing date: 2019-04-22
Publication date: 2022-01-21
Anticipated expiration: 2039-04-22
Also published as: CN110046261A

Abstract

The invention belongs to the technical field of data processing, and particularly relates to a construction method of a multi-modal bilingual parallel corpus of construction engineering; the method comprises six steps of corpus screening, corpus extraction, proofreading, corpus segmentation, alignment, denoising, parallel corpus obtaining, corpus updating and expansion; abundant contrast samples are provided for building vocabularies, the meanings of the retrieved vocabularies or syntaxes are related to buildings, and some useless meanings are eliminated; the method provides huge number of bilingual transliteration samples for users, has fine segmentation and high precision, and the retrieved vocabulary or syntax meanings are all relevant to buildings, thereby eliminating some useless meanings and providing huge number of building bilingual transliteration samples for the users.

Description

Construction method of multi-modal bilingual parallel corpus of construction engineering

Technical Field

The invention belongs to the technical field of data processing, and particularly relates to a construction method of a multi-modal bilingual parallel corpus of construction engineering.

Background

The building English is the combination of the building industry and English, and relates to all links of the building industry, such as qualification prequalification, bidding, building construction, quality evaluation and the like. The Chinese characters and expressions of the building English belong to scientific and technical texts in characteristic, and have own professional vocabulary and expression habits, wherein the speaking mode is a written form, and the basic tone of the speaking is a formal form. With the continuous expansion of the share of the foreign building market in China and the connection between the domestic building market and the foreign building market, building English is more and more commonly used, the translation of the building English also appears in large quantity, and the linguistic viewpoint and method of the corpus provide a tool method for the research of the translation and the teaching of the building English.

Computer Aided Translation (CAT) means that when a translator carries out translation work, a background continuously and automatically stores translations recorded by the translator, so that a database is established, and therefore when the same or similar phrases or sentence segments appear again in the later translation process, a system can automatically search the same or similar contents stored in the database to provide reference translations for the translator, so that repeated translation labor is avoided, and therefore the translation efficiency can be greatly improved by adopting a mode of combining a multi-modal corpus of architectural engineering and CAT. However, there are still some problems at present: the construction specialized corpus itself is extremely rare at home and abroad, and the construction type multi-mode corpus is unprecedented; the existing building corpus has fewer corpora collection, arrangement and proofreading, even has no proofreading, so that the corpora format and the content are not standard; the sources of the corpora are not authoritative enough, and various texts on the network are collected without distinguishing some corpora, so that the corpora are large in noise and low in purity and cannot be truly applied to CAT software; at present, most of parallel corpora are aligned to paragraphs, but during translation, sentences are the most valuable to refer to, and language fragments, phrases and terms are the next, so that the accuracy of whole-paragraph translation is low.

Disclosure of Invention

The invention aims to overcome the defects in the prior art and provides a construction method of a multi-mode bilingual parallel corpus in the building engineering, namely, segmentation is fine, the accuracy is high, and retrieved vocabulary or syntax meanings are all in the relevant fields of buildings, so that useless meanings are eliminated, and a huge number of building bilingual transliteration samples are provided for a user.

The invention relates to a construction method of a multi-modal bilingual parallel corpus of construction engineering, which specifically comprises the following steps:

(1) corpus screening: obtaining original corpora through network downloading, scanning recognition, manual input and a web crawler mode, wherein the main sources of the original corpora are building English-Chinese bilingual works, government official document reports, official authentication materials, audio, videos, drawings, pictures and the like of formal conferences in the building industry, which are formally published by a national-level publishing company;

(2) and (3) corpus extraction and proofreading: collecting multi-mode building engineering information (pictures, charts, drawings, videos, audios, characters and the like) by using a modern image technology, and mining and constructing the information; then, proofreading is carried out, the original linguistic data on the server are subjected to operations of adding, deleting, modifying and searching, the original linguistic data are cleaned and removed, the original linguistic data are stored after being proofread, and the bilingual linguistic data are made into a sentence alignment mode mainly based on paragraph in Tmxmall software;

(3) segmenting and aligning corpora: segmenting sentence segments after aligning in the step (2), so that each pair of bilingual parallel sentence pairs does not exceed four lines in a visual document of Word;

(4) denoising: the sentence or paragraph with inaccurate translation result is modified by adopting a manual noise reduction mode, and is manually input and stored into the corpus so as to ensure the accurate matching of the corpus in the computer-aided translation process;

(5) labeling and transcribing: designing a reasonable and sufficient data mining scheme according to a research object and research needs, establishing different labeling layers on labeling software, and labeling the materials from different perspectives and aspects, such as labeling construction contract corpora; and marking and retrieving software by using a multi-modal corpus, synchronously presenting the transcribed content, the audio and the video, and simultaneously supporting the output of results in various forms such as characters, audio and video;

(6) obtaining a parallel corpus: sequentially performing machine translation on the recognized characters, and correcting by adopting manual translation to obtain a parallel corpus;

(7) corpus updating and capacity expansion: the corpus updating is controlled by an updating unit, the updating unit pops up the recommended entry and the recommended weight thereof at irregular time, the recommended entry is written into the corpus according to the recommended entry and the recommended weight, the recommended weight is determined according to the number of times that the recommended entry pops up the word or sentence, the word pops up 5 times like a word, namely the weight is recorded as 5, when the weight exceeds 10, the word is written into the corpus, and the corpus updating and the corpus expansion are realized.

In the corpus screening process of the step (1), the web crawler method adopts a selenium network test package under python as the basis of a crawler library, external links of download pages of related field files are crawled through third-party websites such as Baidu academic, then the external links are uniformly accessed, the related field files are downloaded in a mode of simulating and clicking page elements, formats of the related files are converted, redundant information and error information are cleaned, corresponding structural information is extracted, then word segmentation is carried out on the converted texts, stop words are removed, semantic-free sections are filtered, and basic texts which can be analyzed are constructed.

In the step (1) of the invention, the national-level publishing company comprises a Chinese building industry publishing company and the like, the government official document report comprises a conference bulletin and the like, official certification materials comprise contract texts and the like, and other materials comprise audio, video, drawings and pictures of formal conferences in the building industry; the fields that the corpus mainly relates to include: green buildings, construction theories, building labels, building contracts, building materials, city planning and the like.

In the step (2), the full-angle characters and numerical values which do not accord with the English text specification and the redundant enter keys of the full-angle blank spaces are sorted and replaced by the software of a text sorter in the computer, so that the nonstandard symbols and formats in the text are sorted, the original linguistic data are cleaned and removed, and the cleanness of the text is kept.

In the step (3) of the invention, the specific steps of not more than four rows in the visual document of Word for each bilingual parallel sentence pair are as follows:

s31: firstly, counting characters in first to third lines of a text to be translated, detecting the first last period, comma or semicolon in the third line, when detecting the first last period, comma or semicolon in the third line, performing line feed operation at the position of the period, comma or semicolon, and then executing step S34; if not, go to step S32;

s32: detecting the last but one period or comma or semicolon of the second line, performing line feed operation at the last but one period or comma or semicolon of the second line, and then executing step S34; if the second row is not detected, go to step S33;

s33: the same step detects the first line, and the line feed operation is performed at the last period or comma or semicolon of the first line, and then step S34 is performed;

s34: and continuously detecting the first to third lines of characters after the line feed of the text to be translated, and repeating the steps of S31-S33, thereby effectively realizing the segmentation of the corpus sentence segments, and ensuring that each pair of bilingual parallel sentence pairs does not exceed four lines in the visual document of Word.

The invention has the beneficial effects that: compared with the prior art, the construction method of the multi-modal bilingual parallel corpus of the architectural engineering provides abundant contrast samples for construction vocabularies, and the meanings of the retrieved vocabularies or syntaxes are related to buildings, so that some useless meanings are eliminated; the bilingual translation system provides a great number of bilingual translation samples for users, so that teachers can have a great number of instances for teaching in teaching, the teaching quality is improved, and students can have great profits in learning and research out of class; can provide abundant available professional text data for users, and the professional text data can be used as a reference for further teaching, learning, research and practice; a more professional translation platform is provided for the building field, the corpus is authoritative in source, fine in segmentation and strong in specificity, the industry can be accurately butted, the matching degree of the corpus is improved, and therefore the requirements of building engineering are met.

Drawings

FIG. 1 is an exemplary diagram of translation results according to the present invention;

FIGS. 2 and 3 are diagrams illustrating the multi-modal translation of materials according to the present invention;

FIG. 4 is an exemplary diagram of a translation result of an online translation platform with a high market share;

FIG. 5 is a diagram illustrating the result of the corpus + CAT translation of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

The invention is further illustrated by the following specific examples.

Example (b):

the construction method of the multi-modal bilingual parallel corpus of the architectural engineering, which is related by the embodiment, specifically comprises the following steps:

In the corpus screening process in step (1) of this embodiment, the web crawler method uses a selenium network test package under python as a basis of a crawler library, crawls external links of download pages of related field files through third-party websites such as academia and the like, then uniformly accesses the external links, downloads the related field files in a mode of simulating and clicking page elements, converts formats of the related files, cleans redundant information and error information, extracts corresponding structural information, performs word segmentation, removes stop words and filters semantic-free sections on the converted text, and constructs a basic text for analysis.

In the step (1), the national-level publishing company comprises a Chinese building industry publishing company and the like, the government official document report comprises a conference bulletin and the like, official certification materials comprise contract texts and the like, and other materials comprise audio, video, drawings, pictures and the like of formal conferences in the building industry; the fields that the corpus mainly relates to include: green buildings, construction theories, building labels, building contracts, building materials, city planning and the like.

In step (3) of this embodiment, the specific steps of no more than four rows in the visual document of Word for each bilingual parallel sentence pair are as follows:

The corpus format described in this embodiment includes a TMX format and a TXT format that can be directly imported into CAT software, and can also provide a visual EXCEL format.

When the multi-modal bilingual parallel pre-library of the construction engineering is used, the corpus is used as a Translation Memory (TM) and is introduced into computer-aided translation software, and when a translation text encounters a sentence pair or a vocabulary (Glossary) corresponding to the corpus, the computer-aided translation software is automatically matched, so that the translation quality of the construction industry is ensured, and the time is greatly saved; the translation process adopts the translation of a corpus plus CAT mode, integrates CAT and MT, establishes a mode of CAT plus Machine Translation (MT) plus post-translation editing (PE), improves the translation efficiency and ensures the translation quality; setting translation accuracy rate confidence coefficient for a material library and CAT, wherein the core of the CAT technology is a translation memory technology, when the same or similar phrases appear, the system automatically prompts a user to use the closest translation method in the memory library, the user needs to adopt, discard or edit repeated texts according to actual needs, therefore, the translation accuracy rate confidence coefficient is set, if the translation accuracy rate confidence coefficient is not lower than 90 during translation, a translation result is adopted, if the translation accuracy rate confidence coefficient is lower than 90, the translation result at the moment is discarded, manual translation is adopted for correction, the results of the manual translation replace the results of the material library and the CAT translation, the manual translation is edited and input in an editable area and stored in the material library, the translation mode of 'CAT + Machine Translation (MT) + post-editing (PE)' is realized, and the translation accuracy rate confidence coefficient is a set program preset in the system, for judging the accuracy of the translation; the continuous updating and capacity expansion of the corpus ensures the updating speed of the corpus, ensures the stable continuation of the corpus, ensures that more corpus references are provided for the unification of terms and translator styles, and prepares a reliable resource corpus for the extraction work of building terms; the corpus is used as a Translation Memory (TM) and is introduced into computer-aided translation software, so that multi-modal corpora can be directly put into a warehouse, and after multi-modal materials are input into a computer, the computer-aided translation software can directly call the corpora in the formats of pictures, charts, drawings, videos, audios and the like to perform translation, teaching research and the like.

The corpus described in this embodiment has been put into storage as follows: general theory of construction environment and energy application engineering (hanyiying), "green northern europe: sustainable development cities and buildings (Chinese translation English), FIDIC contract translation (Chinese translation English), < ecological cities and green buildings > < Fulai construction history > < Yikan > < construction materials > < Yikan > < Johnston: one principle, various forms (Chinese translation English), construction of the opposite climax shishiyasu (Chinese translation English), design concept (English translation Chinese), urban sustainable development principle (Chinese translation English), construction from concept to construction 2 (Chinese translation English), sustainable development city and construction design (Chinese translation English), and international civil engineering construction contract (Chinese translation English); in addition, the corpus described in this embodiment participates in a plurality of translation project tests, and these projects design a large amount of industry professional knowledge such as architecture, electric power and the like, so that the problems of language accuracy and scientific text language logic of the translation professional industry can be more perfectly solved, and the efficiency and quality of translation service are ensured.

The embodiment provides a translation result imported into CAT translation software by using the corpus, as shown in fig. 1, it can be seen from fig. 1 that the corpus of the embodiment ensures accurate matching of the corpus in the CAT translation software; FIGS. 2 and 3 show the translation result of the multi-modal technique applied to the roof structure diagram in the corpus; fig. 4 and fig. 5 are comparison diagrams, and it can be seen from comparison between fig. 4 and fig. 5 that the corpus according to this embodiment has high matching precision and more accurate translation results in the building-type text.

The above embodiments are only specific examples of the present invention, and the protection scope of the present invention includes but is not limited to the product forms and styles of the above embodiments, and any suitable changes or modifications made by those skilled in the art according to the claims of the present invention shall fall within the protection scope of the present invention.

Claims

1. a construction method of multimodal bilingual parallel corpus of construction engineering, is characterized in that: specifically comprise the following steps:

(1) Corpus screening: Obtain the original corpus through network download, scanning recognition, manual input and web crawling. The source of the original corpus is the bilingual works of architecture published by national publishing houses, government official documents, official certification materials, architectural Audio, video, drawings and pictures of industry formal meetings;

(2) Corpus extraction and proofreading: use modern imaging technology to collect multi-modal construction engineering information, mine and construct it; then proofread, add, delete, modify and check the original corpus on the server, The original corpus is cleaned and removed, proofread and saved, and the bilingual corpus is made into a paragraph-based sentence alignment in Tmxmall software;

(3) corpus segmentation and alignment: segment segments after alignment in step (2), so that each pair of bilingual parallel sentence pairs does not exceed four lines in the visual document of Word;

(4) Denoising: Using artificial noise reduction, the sentences or paragraphs with inaccurate translation results are modified, manually entered and saved to the corpus to ensure the exact matching of the corpus in the computer-assisted translation process;

(5) Annotation and transcription: establish different annotation layers on the annotation software to annotate the corpus from different perspectives and aspects; and use the multimodal corpus annotation and retrieval software to simultaneously present the transcribed content, audio and video , while supporting text, audio and video results output in various forms;

(6) Obtaining a parallel corpus: machine translation is performed on the recognized characters in turn, and a parallel corpus is obtained after correction by manual translation;

(7) Corpus update and expansion: The update of the corpus is controlled by the update unit. The update unit pops up the recommended entries and their recommended weights from time to time, and writes the recommended entries into the corpus according to the recommended entries and recommended weights. The recommended weight is based on the recommended words. The number of times the bar pops up to determine;

In described step (3), every pair of bilingual parallel sentences is paired, and the concrete steps that are no more than four rows in the visual document of Word are as follows:

S31: First count the characters in the first to third lines of the text to be translated, and detect the penultimate period or comma or semicolon in the third line. When the penultimate period, comma or semicolon in the third line is detected, in Perform a line feed operation at the period, comma or semicolon, and then execute step S34; if not detected, execute step S32;

S32: Detect the penultimate period or comma or semicolon of the second line, perform a line break operation at the penultimate period or comma or semicolon of the second line, and then perform step S34; if the second line is not detected, then Go to step S33;

S33: The first line is detected in the same step, and a line feed operation is performed at the penultimate period or comma or semicolon of the first line, and then step S34 is performed;

S34: Continue to detect the characters in the first to third lines after the newline of the text to be translated, and repeat the steps of S31-S33, which effectively realizes the segmentation of the corpus segment, and ensures that each pair of bilingual parallel sentence pairs is no more than four lines in the visual document;

Import the corpus as a translation memory into the computer-assisted translation software, when the translated text encounters a sentence pair or vocabulary corresponding to the corpus, the computer-assisted translation software will automatically match; import the corpus as a translation memory into the computer In the auxiliary translation software, the multimodal corpus is directly stored in the database, so that after the multimodal material is entered into the computer, the computer-aided translation software can directly retrieve the corpus in the format of pictures, charts, drawings, videos and audios for translation.

2. the construction method of construction engineering multimodal bilingual parallel corpus according to claim 1, is characterized in that: in the corpus screening process of described step (1), the method for described web crawler adopts the selenium network test under python As the basis of the crawler library, the package first crawls the external links of the download page of related field files through the third-party website of Baidu Academic, and then enters these external links uniformly, and downloads related field files by simulating the way of clicking on the page elements. The format is converted, redundant information and error information are cleaned, and corresponding structural information is extracted, and then the converted text is segmented, stop words removed, and non-semantic paragraphs filtered to construct basic text for analysis.

3. the construction method of construction engineering multimodal bilingual parallel corpus according to claim 1, is characterized in that: described in the step (1), the state-level publishing house comprises China Construction Industry Press, and the government official document reports Including meeting bulletins, official certification materials including contract texts, and other forms of materials including audio, video, drawings, and pictures of official conferences in the construction industry; the fields covered by the corpus include: green building, construction theory, construction tenders, construction contracts, construction materials and City planning.

4. the construction method of construction engineering multimodal bilingual parallel corpus according to claim 1, is characterized in that: in described step (2), by " text organizer " software in the computer will do not meet English text norm The full-width characters and numerical values, and the extra carriage return key of the full-width spaces are sorted and replaced to sort out the non-standard symbols and formats in the text, realize the cleaning and removal of the original corpus data, and keep the text clean.

5. the construction method of construction engineering multimodal bilingual parallel corpus according to claim 1, is characterized in that: described corpus format comprises the TMX format and the TXT format that can be directly imported into the CAT software, and can also provide visualization EXCEL Format.