CN116822517A - Multi-language translation term identification method - Google Patents

Multi-language translation term identification method Download PDF

Info

Publication number
CN116822517A
CN116822517A CN202311090725.XA CN202311090725A CN116822517A CN 116822517 A CN116822517 A CN 116822517A CN 202311090725 A CN202311090725 A CN 202311090725A CN 116822517 A CN116822517 A CN 116822517A
Authority
CN
China
Prior art keywords
text
term
cultural
culture
translation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202311090725.XA
Other languages
Chinese (zh)
Other versions
CN116822517B (en
Inventor
周兰栋
孔坤坤
刘敏杰
傅泉铭
崔晓静
李园园
刘莹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Baishun Information Technology Co ltd
Original Assignee
Baishun Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Baishun Information Technology Co ltd filed Critical Baishun Information Technology Co ltd
Priority to CN202311090725.XA priority Critical patent/CN116822517B/en
Publication of CN116822517A publication Critical patent/CN116822517A/en
Application granted granted Critical
Publication of CN116822517B publication Critical patent/CN116822517B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Machine Translation (AREA)

Abstract

The invention relates to the technical field of language identification, in particular to a multi-language translation term identification method, which comprises the following steps: preprocessing an input text by using a natural language processing technology, wherein the preprocessing comprises word segmentation, part-of-speech tagging and syntactic analysis; extracting features of the preprocessed text by using a pre-trained machine learning model; recognizing and classifying terms in the text by combining natural language processing and a machine learning model; matching the identified terms with a multi-language term library constructed in advance to obtain corresponding translations; based on the cultural background adaptation module, according to cultural custom and characteristics of the target language, the context and style of term translation are adjusted, and translation content is ensured to accord with cultural background and acceptance of the target language community. The invention achieves higher accuracy in terms of term recognition and feature extraction by combining natural language processing with a pre-trained machine learning model.

Description

Multi-language translation term identification method
Technical Field
The invention relates to the technical field of language identification, in particular to a term identification method for multi-language translation.
Background
Existing multilingual translation relies mainly on fixed dictionary and grammar rules, which often ignore cultural background differences and exact matches of specific terms, and some existing systems employ statistical-based machine translation methods, which, while improving accuracy, still have limitations in terms of sensitivity and cultural adaptability.
Text preprocessing has been used in some existing solutions, but generally lacks optimization for multilingual and multicultural contexts, existing term recognition methods are mostly based on simple dictionary lookup, lack flexibility and accuracy, and although some systems attempt to apply machine learning to term matching, most methods lack the ability to adapt and precisely match cultural contexts, existing translation systems generally lack the ability to adapt to target cultural contexts, often resulting in improper context and style of translation.
Although the existing multi-language translation technology has advanced to some extent, significant limitations still exist in key aspects such as accuracy, sensitivity, cultural adaptability and the like. These limitations of the prior art highlight the urgent need for new methods and techniques, particularly the need to better integrate natural language processing and machine learning, and to take into account the complexity of the cultural background.
Disclosure of Invention
Based on the above objects, the present invention provides a term recognition method for multi-language translation.
A method for recognizing a term for multilingual translation, comprising the steps of:
step one: preprocessing an input text by using a natural language processing technology, wherein the preprocessing comprises word segmentation, part-of-speech tagging and syntactic analysis;
step two: extracting features of the preprocessed text by using a pre-trained machine learning model;
step three: recognizing and classifying terms in the text by combining natural language processing and a machine learning model;
step four: matching the identified terms with a multi-language term library constructed in advance to obtain corresponding translations;
step five: based on the cultural background adaptation module, according to cultural custom and characteristics of the target language, the context and style of term translation are adjusted, and translation content is ensured to accord with cultural background and acceptance of the target language community.
Further, the preprocessing in the first step specifically includes:
text cleaning: firstly, the input original text is cleaned, including removing redundant blank characters, special symbols and non-text elements, so as to ensure the text quality.
Word segmentation: word segmentation is carried out on the cleaned text, and continuous text is segmented into independent words or phrases;
part of speech tagging: marking the parts of speech of the words after word segmentation, and identifying the parts of speech of each word;
syntax analysis: further performing syntactic analysis on the text to determine the dependency relationship between words;
removing stop words: during the analysis, common stop words are removed;
semantic role labeling: the principal components in the sentence are identified and the relationship between the principal components is understood.
Further, the feature extraction in the second step is used for converting the input text into a set of digital values representing the underlying structure and meaning, and the machine learning model performs feature extraction based on a Transformer (transducer) model, which is expressed as follows:
self-attention mechanism
Query, key, and value: calculate query (Q), key (K)) And a value (V) embedded from the input word by the following formulaThe method comprises the following steps:
wherein ,a weight matrix representing the query, key and value, respectively, for converting word embedding into the query, key and value;
attention weighting: the dot product between the query and the key is calculated, and the attention weight is obtained through the scaled softmax function:
wherein ,is the dimension of the key vector;
feedforward neural network: the output from the attention layer is further processed by the feed forward neural network layer:
wherein ,is a learnable weight and bias;
the feedforward neural network layer further transforms the self-attention output, and the nonlinear capability of the model is increased through the activation function and linear transformation, so that the model is beneficial to capturing more complex modes;
and (3) outputting:
and integrating the self-attention mechanism and the output of the feedforward neural network layer to obtain the depth characteristic representation of the input text.
Further, the pre-built multilingual term library includes professional term mapping between each domain and different languages, and the step four specifically includes:
the identified terms are subjected to normalization processing, so that the formats and the representations of the identified terms are consistent with the standards in the term library;
comparing the standardized terms with related items in a multi-language term library through a searching and matching algorithm;
after finding the matched term, extracting the equivalent expression of the term in the target language from the multi-language term library, wherein the mapping considers the subtle differences under different languages and cultural backgrounds, and ensures the accuracy and the readability of translation;
if the target term does not have a direct corresponding item in the multi-language term library, adopting a rollback strategy, splitting the term into smaller parts, and translating and reorganizing each part;
by integrating the translation of the term with other translated portions, a complete, fluent multi-lingual translation text is generated.
Further, the search and match algorithm is based on a cosine similarity algorithm for comparing similarity between two pieces of text or two vocabulary items.
Further, the cosine similarity algorithm formula is as follows:
is provided with two vectors and />Their cosine similarity is calculated as:
wherein :
representation-> and />Is calculated as:
and />Respectively indicate-> and />Is calculated as the euclidean norm:
and />
By the above equation, the cosine similarity value falls between-1 and 1, where 1 represents exactly the same, 0 represents exactly uncorrelated, -1 represents exactly opposite.
Further, the cultural background adapting module in the fifth step specifically includes:
constructing a culture tag library: based on the identification of cultural elements, identifying key elements in target culture, including etiquette, value, custom, religion; creating a label for each cultural element, collecting texts, expressions and metaphors related to the cultural elements, and providing accurate references for subsequent cultural adaptation through accurate marking of cultural features;
multilevel semantic analysis: based on syntactic analysis, analyzing sentence structure by using natural language processing technology, determining subjects, objects and verbs in sentences, identifying semantic roles of the subjects and the verbs, and helping to understand the intention and implicit meaning of the text more deeply through multi-level analysis;
culture matching and adaptation: the method comprises the steps of matching texts with a culture tag library based on a matching algorithm, identifying elements related to target cultures, adjusting expression modes, mood and politics according to custom and characteristics of the target cultures, converting source texts into target culture styles by using a converter model of a machine learning model, and ensuring culture adaptability of the texts through matching with the culture tag library and style conversion of deep learning.
Further, the cultural background adaptation module comprises expert participation and optimization: the expert's modification and feedback is used for continuous optimization of the model by constructing a platform that allows cultural experts to review and edit the translation results based on the expert review platform.
Further, the tag library construction includes:
defining culture dimensions and classifications:
a culture dimension selection, namely selecting culture dimensions of etiquette, value view, custom, religion and language style;
a classification definition, wherein sub-classifications are defined under each culture dimension;
collecting and analyzing sample text:
sample sources, including books, papers, networks, social media channels, collect text related to a target culture;
text analysis, namely analyzing the text by using a natural language processing tool, and extracting key information related to culture dimension and sub-classification;
constructing a primary tag library:
extracting keywords and expressions, extracting keywords and expressions from the analyzed sample text, and associating the keywords and expressions with culture dimensions and sub-classifications;
and (3) label generation: generating a label for each keyword and expression, associated with a particular cultural dimension and sub-category;
tag library integration:
integration with the model: integrating the constructed tag library into a cultural background adaptation module for cultural feature matching and adaptation;
providing an API interface: the method is used for the tag library to communicate with other systems or modules and support diversified application requirements.
The invention has the beneficial effects that:
the invention achieves higher accuracy in terms of term recognition and feature extraction by combining natural language processing with a pre-trained machine learning model. Compared with the existing method, the overall accuracy is improved.
The invention has more sensitive and flexible characteristic extraction and term matching process, can more accurately identify and process specific terms and expression modes under various languages and cultural backgrounds, and can adjust the context and style of term translation according to cultural custom and characteristics of target languages by virtue of a cultural background adaptation module. This breakthrough innovation allows the translation result to better meet the expectations and acceptability of the target audience.
The invention can support a wider language range through matching with a pre-constructed multilingual term library, provides strong support for globalization communication and business, and is helpful for promoting communication and understanding among people with different cultures and language backgrounds through sensitive processing of cultural backgrounds and contexts. This social benefit is not neglected.
Drawings
In order to more clearly illustrate the invention or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the description below are only of the invention and that other drawings can be obtained from them without inventive effort for a person skilled in the art.
FIG. 1 is a schematic flow chart of an identification method according to an embodiment of the invention;
fig. 2 is a schematic diagram of a cultural background adaptation module according to an embodiment of the invention.
Detailed Description
The present invention will be further described in detail with reference to specific embodiments in order to make the objects, technical solutions and advantages of the present invention more apparent.
It is to be noted that unless otherwise defined, technical or scientific terms used herein should be taken in a general sense as understood by one of ordinary skill in the art to which the present invention belongs. The terms "first," "second," and the like, as used herein, do not denote any order, quantity, or importance, but rather are used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that elements or items preceding the word are included in the element or item listed after the word and equivalents thereof, but does not exclude other elements or items. The terms "connected" or "connected," and the like, are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", etc. are used merely to indicate relative positional relationships, which may also be changed when the absolute position of the object to be described is changed.
As shown in fig. 1-2, a term recognition method for multi-language translation includes the following steps:
step one: preprocessing an input text by using a natural language processing technology, wherein the preprocessing comprises word segmentation, part-of-speech tagging and syntactic analysis;
step two: extracting features of the preprocessed text by using a pre-trained machine learning model;
step three: recognizing and classifying terms in the text by combining natural language processing and a machine learning model;
step four: matching the identified terms with a multi-language term library constructed in advance to obtain corresponding translations;
step five: based on the cultural background adaptation module, according to cultural custom and characteristics of the target language, the context and style of term translation are adjusted, and translation content is ensured to accord with the cultural background and acceptance of the target language community;
not only the conversion of the text layer is concerned, but also the adaptation of the context and the style is concerned, so that the translation content can better accord with the cultural background and the acceptance of the target language community. Such adaptation to different cultural backgrounds is rare in traditional translation methods, but has practical application value and innovativeness in globalization backgrounds.
The pretreatment of the first step specifically comprises the following steps:
text cleaning: firstly, the input original text is cleaned, including removing redundant blank characters, special symbols, non-text elements and the like, so as to ensure the text quality.
Word segmentation: the cleaned text is subjected to word segmentation processing, continuous text is segmented into individual words or phrases, and a dictionary matching method, a statistical method or a deep learning method can be used in the step to adapt to different languages and fields;
part of speech tagging: part of speech tagging is performed on the words after word segmentation, and the part of speech (such as nouns, verbs, adjectives and the like) of each word is identified, so that the grammar effect and meaning of the words in the text can be understood;
syntax analysis: further parsing the text to determine dependencies between words, which can help the system understand the structure and meaning of sentences;
removing stop words: during analysis, common stop words (e.g., "sum", "yes", etc.) are removed, which in many cases do not carry a practical meaning, and removal helps to reduce the complexity of the analysis;
semantic role labeling: identifying principal components in a sentence, such as subjects, verbs, objects, etc., and understanding the relationships between the principal components, helps to accurately capture the meaning of text;
the above together form a comprehensive and careful text preprocessing process. Through this process, the entered text is converted into a form that is easier to analyze and process, which lays a solid foundation for subsequent term recognition and translation work.
The feature extraction of the second step is used for converting the input text into a group of digital values representing the underlying structure and meaning of the text, and the machine learning model performs feature extraction based on a transducer (transducer) model, and is expressed as follows:
self-attention mechanism
Query, key, and value: computing query (Q), key(K) And a value (V) embedded from the input word by the following formulaThe method comprises the following steps:
wherein ,weight matrices representing queries, keys and values, respectively, for converting word embeddings into queries, keys and values, which can be regarded as encodings of specific tasks, such as term recognition;
attention weighting: the dot product between the query and the key is calculated, and the attention weight is obtained through the scaled softmax function:
wherein ,/>Is the dimension of the key vector:
through self-attention calculations, the model considers the interrelationships and dependencies between words in the text,calculating the similarity between the query and the key, and further converting the similarity into weights through a softmax function;
feedforward neural network: the output from the attention layer is further processed by the feed forward neural network layer:
wherein ,is a learnable weight and bias;
the feedforward neural network layer further transforms the self-attention output, increases the nonlinear capability of the model by activating functions and linear transformation, and helps to capture more complex modes, which can be regarded as specialized adjustment and optimization for specific tasks (such as term recognition);
and (3) outputting:
combining the self-attention mechanism and the output of the feedforward neural network layer to obtain depth characteristic representation of the input text, wherein the characteristics capture the complex relation between words in the text and can be used for subsequent term recognition and translation tasks;
the feature extraction process can capture deep semantic and structural information in the input text, and provides a rich basis for subsequent processing. In practical applications, the model structure and parameters may be adjusted according to specific requirements and data.
The pre-built multi-language term library comprises professional term mapping among various fields and different languages, and the fourth step specifically comprises the following steps:
the identified terms are subjected to standardization processing, the format and representation of the identified terms are consistent with the standards in a term library, and operations such as stem extraction, synonym replacement and the like are performed;
comparing the standardized terms with related items in a multi-language term library through a searching and matching algorithm;
after finding the matched term, extracting the equivalent expression of the term in the target language from the multi-language term library, wherein the mapping considers the subtle differences under different languages and cultural backgrounds, and ensures the accuracy and the readability of translation;
if the target term does not have a direct corresponding item in the multi-language term library, adopting a rollback strategy, splitting the term into smaller parts, and translating and reorganizing each part;
generating a complete and smooth multilingual translation text by integrating the translation of the term with other translation parts;
in general, matching the identified terms to a library of pre-constructed multilingual terms and obtaining corresponding translations is a complex but highly valuable task. It relates to a number of advanced techniques and methods, which are a core component in modern multi-language processing systems.
The find and match algorithm is based on a cosine similarity algorithm to compare the similarity between two pieces of text or two vocabulary items.
The cosine similarity algorithm formula is as follows:
is provided with two vectors and />Their cosine similarity is calculated as:
wherein :
representation-> and />Is calculated as:
and />Respectively indicate-> and />Is calculated as the euclidean norm:
and />
By the above equation, the cosine similarity value falls between-1 and 1, where 1 represents exactly the same, 0 represents exactly uncorrelated, -1 represents exactly opposite;
in the term matching and translation scenario described above, the similarity of the identified term to each item in the multilingual term library may be compared by cosine similarity, thereby finding the most appropriate corresponding item. This approach is very effective in dealing with spelling variants, abbreviations, synonyms, etc.
The cultural background adaptation module in the fifth step specifically comprises:
constructing a culture tag library: based on the identification of cultural elements, identifying key elements in target culture, including etiquette, value, custom, religion; creating a label for each cultural element, collecting texts, expressions and metaphors related to the cultural elements, and providing accurate references for subsequent cultural adaptation through accurate marking of cultural features;
multilevel semantic analysis: based on syntactic analysis, analyzing sentence structure by using natural language processing technology, determining subjects, objects and verbs in sentences, identifying semantic roles of the subjects and the verbs, and helping to understand the intention and implicit meaning of the text more deeply through multi-level analysis;
culture matching and adaptation: matching the text with a cultural tag library based on a matching algorithm, identifying elements related to a target culture, adjusting an expression mode, a mood and a polite according to custom and characteristics of the target culture, converting a source text with a target cultural style by using a converter model of a machine learning model, and ensuring cultural adaptability of the text through matching with the cultural tag library and style conversion of deep learning;
through the technical scheme, the cultural background adaptation module can comprehensively understand and adapt to custom and characteristics of target culture. The schemes comprehensively utilize the technologies of natural language processing, machine learning, data mining, artificial intelligence and the like to ensure that the translation result is accurate in language and matched with a target audience in cultural context and style.
The cultural background adaptation module comprises expert participation and optimization: the expert's modification and feedback is used for continuous optimization of the model by constructing a platform that allows cultural experts to review and edit the translation results based on the expert review platform.
The label library construction comprises the following steps:
defining culture dimensions and classifications:
a culture dimension selection, namely selecting culture dimensions of etiquette, value view, custom, religion and language style;
classifying and defining sub-classifications under each culture dimension, wherein the etiquette can be subdivided into a business etiquette and a daily etiquette;
collecting and analyzing sample text:
sample sources, including books, papers, networks, social media channels, collect text related to a target culture;
text analysis, namely analyzing the text by using a natural language processing tool, and extracting key information related to culture dimension and sub-classification;
constructing a primary tag library:
extracting keywords and expressions, extracting keywords and expressions from the analyzed sample text, and associating the keywords and expressions with culture dimensions and sub-classifications;
and (3) label generation: generating a label for each keyword and expression, associated with a particular cultural dimension and sub-category;
tag library integration:
integration with the model: integrating the constructed tag library into a cultural background adaptation module for cultural feature matching and adaptation;
providing an API interface: the method is used for the tag library to communicate with other systems or modules and support diversified application requirements.
In order to verify the effectiveness of the identification method of the present invention, the following related tests were performed.
1. Design of experiment
Test dataset: 5000 multilingual translation test samples of 10 different cultural backgrounds and languages were selected.
Evaluation criteria: accuracy (85% target), sensitivity (80% target), cultural adaptability (90% target).
Comparison experiment: 3 existing translation methods were chosen for comparison.
2. Text pre-processing effect test
Experiment: 1000 samples were pre-processed.
Results: the translation accuracy of the pretreated sample is improved by 12 percent and reaches 88 percent.
3. Feature extraction and term recognition effect testing.
Experiment: feature extraction was performed on 800 samples.
Results: the accuracy is improved by 18 percent and reaches 91 percent.
4. Multi-language term library matching effect test.
Experiment: the term library matching is performed on 700 samples.
Results: the accuracy matching rate reaches 92%, and is improved by 14% compared with the prior art.
5. Cultural background adaptation module effect test.
Experiment: a cultural background adaptation module was applied to 600 samples.
Results: the culture adaptability score is improved by 20 percent and reaches 95 percent.
6. And (5) testing the comprehensive effect.
Experiment: the entire test dataset is tested comprehensively.
Results: the overall accuracy is 90%, the sensitivity is 87%, the cultural adaptability is 94%, and the overall accuracy is more than a preset target and more than 10% of the conventional system.
7. And (5) expert review.
Experiment: 5 experts were manually reviewed.
Results: the average score was 9.2/10, and the expert agreed to recognize the innovations and utilities of the present invention.
8. Statistical analysis and conclusion.
Analysis: statistical analysis methods such as t-test and ANOVA were used.
Conclusion: statistical significance p <0.01, a significant advantage over the prior art.
By combining specific experimental data, the remarkable progress of the invention in the aspects of multilingual translation accuracy, cultural adaptability and the like can be seen more clearly.
Table 1 comparison table of experimental test data
Those of ordinary skill in the art will appreciate that: the discussion of any of the embodiments above is merely exemplary and is not intended to suggest that the scope of the invention is limited to these examples; the technical features of the above embodiments or in the different embodiments may also be combined within the idea of the invention, the steps may be implemented in any order and there are many other variations of the different aspects of the invention as described above, which are not provided in detail for the sake of brevity.
The present invention is intended to embrace all such alternatives, modifications and variances which fall within the broad scope of the appended claims. Therefore, any omission, modification, equivalent replacement, improvement, etc. of the present invention should be included in the scope of the present invention.

Claims (9)

1. A method for recognizing a term for multilingual translation, comprising the steps of:
step one: preprocessing an input text by using a natural language processing technology, wherein the preprocessing comprises word segmentation, part-of-speech tagging and syntactic analysis;
step two: extracting features of the preprocessed text by using a pre-trained machine learning model;
step three: recognizing and classifying terms in the text by combining natural language processing and a machine learning model;
step four: matching the identified terms with a multi-language term library constructed in advance to obtain corresponding translations;
step five: based on the cultural background adaptation module, according to cultural custom and characteristics of the target language, the context and style of term translation are adjusted, and translation content is ensured to accord with cultural background and acceptance of the target language community.
2. The method for recognizing terms for multilingual translation according to claim 1, wherein the preprocessing in the first step specifically comprises:
text cleaning: firstly, cleaning an input original text, including removing redundant blank characters, special symbols and non-text elements, so as to ensure the text quality;
word segmentation: word segmentation is carried out on the cleaned text, and continuous text is segmented into independent words or phrases;
part of speech tagging: marking the parts of speech of the words after word segmentation, and identifying the parts of speech of each word;
syntax analysis: further performing syntactic analysis on the text to determine the dependency relationship between words;
removing stop words: during the analysis, common stop words are removed;
semantic role labeling: the principal components in the sentence are identified and the relationship between the principal components is understood.
3. The method of claim 2, wherein the feature extraction in step two is used to convert the input text into a set of numerical values representing its underlying structure and meaning, and the machine learning model performs feature extraction based on a transducer model, as follows:
self-attention mechanism
Query, key, and value: calculating a query (Q), a key (K) and a value (V), embedding from the entered word by the following formulaThe method comprises the following steps:
wherein ,a weight matrix representing the query, key and value, respectively, for converting word embedding into the query, key and value;
attention weighting: the dot product between the query and the key is calculated, and the attention weight is obtained through the scaled softmax function:
wherein ,/>Is the dimension of the key vector;
feedforward neural network: the output from the attention layer is further processed by the feed forward neural network layer:
wherein ,is a learnable weight and bias;
the feedforward neural network layer further transforms the self-attention output, and the nonlinear capability of the model is increased through the activation function and linear transformation, so that the model is beneficial to capturing more complex modes;
and (3) outputting:
and integrating the self-attention mechanism and the output of the feedforward neural network layer to obtain the depth characteristic representation of the input text.
4. A method for recognizing terms for multilingual translation according to claim 3, wherein the pre-constructed multilingual term library includes professional term mappings between fields and different languages, and the step four specifically includes:
the identified terms are subjected to normalization processing, so that the formats and the representations of the identified terms are consistent with the standards in the term library;
comparing the standardized terms with related items in a multi-language term library through a searching and matching algorithm;
after finding a matching term, extracting the equivalent expression of the term in the target language from the multilingual term library;
if the target term does not have a direct corresponding item in the multi-language term library, adopting a rollback strategy, splitting the term into smaller parts, and translating and reorganizing each part;
by integrating the translation of the term with other translated portions, a complete, fluent multi-lingual translation text is generated.
5. The method of claim 4, wherein the find and match algorithm is based on a cosine similarity algorithm for comparing similarity between two text segments or two vocabulary items.
6. The method for recognizing terms for multilingual translation according to claim 5, wherein the cosine similarity algorithm formula is as follows:
is provided with two vectors and />Their cosine similarity is calculated as:
wherein :
representation-> and />Is calculated as:
and />Respectively indicate-> and />Is calculated as the euclidean norm:
and />
By the above equation, the cosine similarity value falls between-1 and 1, where 1 represents exactly the same, 0 represents exactly uncorrelated, -1 represents exactly opposite.
7. The method for recognizing terms for multilingual translation according to claim 6, wherein the cultural background adaptation module in the fifth step specifically comprises:
constructing a culture tag library: based on the identification of cultural elements, identifying key elements in target culture, including etiquette, value, custom, religion; creating a label for each cultural element, collecting texts, expressions and metaphors related to the cultural elements, and providing accurate references for subsequent cultural adaptation through accurate marking of cultural features;
multilevel semantic analysis: based on syntactic analysis, analyzing the sentence structure by using a natural language processing technology, determining subjects, objects and verbs in the sentence, and identifying semantic roles of the subjects, the objects and the verbs;
culture matching and adaptation: and matching the text with the culture tag library based on a matching algorithm, identifying elements related to the target culture, adjusting the expression mode, the mood and the polite according to the custom and the characteristics of the target culture, and converting the source text and the target culture style by using a transformer model of a machine learning model.
8. The method of claim 7, wherein the cultural background adaptation module further comprises expert participation and optimization: the expert's modification and feedback is used for continuous optimization of the model by constructing a platform that allows cultural experts to review and edit the translation results based on the expert review platform.
9. The method for recognizing terms for multilingual translation according to claim 8, wherein the tag library construction comprises:
defining culture dimensions and classifications:
a culture dimension selection, namely selecting culture dimensions of etiquette, value view, custom, religion and language style;
a classification definition, wherein sub-classifications are defined under each culture dimension;
collecting and analyzing sample text:
sample sources, including books, papers, networks, social media channels, collect text related to a target culture;
text analysis, namely analyzing the text by using a natural language processing tool, and extracting key information related to culture dimension and sub-classification;
constructing a primary tag library:
extracting keywords and expressions, extracting keywords and expressions from the analyzed sample text, and associating the keywords and expressions with culture dimensions and sub-classifications;
and (3) label generation: generating a label for each keyword and expression, associated with a particular cultural dimension and sub-category;
tag library integration:
integration with the model: integrating the constructed tag library into a cultural background adaptation module for cultural feature matching and adaptation;
providing an API interface: the method is used for the tag library to communicate with other systems or modules and support diversified application requirements.
CN202311090725.XA 2023-08-29 2023-08-29 Multi-language translation term identification method Active CN116822517B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311090725.XA CN116822517B (en) 2023-08-29 2023-08-29 Multi-language translation term identification method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311090725.XA CN116822517B (en) 2023-08-29 2023-08-29 Multi-language translation term identification method

Publications (2)

Publication Number Publication Date
CN116822517A true CN116822517A (en) 2023-09-29
CN116822517B CN116822517B (en) 2023-11-10

Family

ID=88116949

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311090725.XA Active CN116822517B (en) 2023-08-29 2023-08-29 Multi-language translation term identification method

Country Status (1)

Country Link
CN (1) CN116822517B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118036577A (en) * 2024-04-11 2024-05-14 一百分信息技术有限公司 Sequence labeling method in natural language processing

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040181390A1 (en) * 2000-09-23 2004-09-16 Manson Keith S. Computer system with natural language to machine language translator
US20110184718A1 (en) * 2010-01-25 2011-07-28 Chen Chung-Ching Interlingua, Interlingua Engine, and Interlingua Machine Translation System
CN107066455A (en) * 2017-03-30 2017-08-18 唐亮 A kind of multilingual intelligence pretreatment real-time statistics machine translation system
CN108491399A (en) * 2018-04-02 2018-09-04 上海杓衡信息科技有限公司 Chinese to English machine translation method based on context iterative analysis
CN110543644A (en) * 2019-09-04 2019-12-06 语联网(武汉)信息技术有限公司 Machine translation method and device containing term translation and electronic equipment
CN114330380A (en) * 2021-12-27 2022-04-12 成都优译信息技术股份有限公司 Multilingual text term extraction method, device, equipment and medium
CN114357975A (en) * 2022-01-07 2022-04-15 上海一者信息科技有限公司 Multilingual term recognition and bilingual term alignment method
CN114722842A (en) * 2022-04-24 2022-07-08 西安领向鸟文化传播有限公司 Computer artificial intelligent foreign language translation method and translation system thereof
CN115017923A (en) * 2022-05-30 2022-09-06 华东师范大学 Professional term vocabulary alignment replacement method based on Transformer translation model
CN115062634A (en) * 2022-06-21 2022-09-16 新疆大学 Medical term extraction method and system based on multilingual parallel corpus
CN115114940A (en) * 2022-06-29 2022-09-27 中译语通科技股份有限公司 Machine translation style migration method and system based on curriculum pre-training
CN115329785A (en) * 2022-10-15 2022-11-11 小语智能信息科技(云南)有限公司 Phoneme feature-fused English-Tai-old multi-language neural machine translation method and device

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040181390A1 (en) * 2000-09-23 2004-09-16 Manson Keith S. Computer system with natural language to machine language translator
US20110184718A1 (en) * 2010-01-25 2011-07-28 Chen Chung-Ching Interlingua, Interlingua Engine, and Interlingua Machine Translation System
CN107066455A (en) * 2017-03-30 2017-08-18 唐亮 A kind of multilingual intelligence pretreatment real-time statistics machine translation system
CN108491399A (en) * 2018-04-02 2018-09-04 上海杓衡信息科技有限公司 Chinese to English machine translation method based on context iterative analysis
CN110543644A (en) * 2019-09-04 2019-12-06 语联网(武汉)信息技术有限公司 Machine translation method and device containing term translation and electronic equipment
CN114330380A (en) * 2021-12-27 2022-04-12 成都优译信息技术股份有限公司 Multilingual text term extraction method, device, equipment and medium
CN114357975A (en) * 2022-01-07 2022-04-15 上海一者信息科技有限公司 Multilingual term recognition and bilingual term alignment method
CN114722842A (en) * 2022-04-24 2022-07-08 西安领向鸟文化传播有限公司 Computer artificial intelligent foreign language translation method and translation system thereof
CN115017923A (en) * 2022-05-30 2022-09-06 华东师范大学 Professional term vocabulary alignment replacement method based on Transformer translation model
CN115062634A (en) * 2022-06-21 2022-09-16 新疆大学 Medical term extraction method and system based on multilingual parallel corpus
CN115114940A (en) * 2022-06-29 2022-09-27 中译语通科技股份有限公司 Machine translation style migration method and system based on curriculum pre-training
CN115329785A (en) * 2022-10-15 2022-11-11 小语智能信息科技(云南)有限公司 Phoneme feature-fused English-Tai-old multi-language neural machine translation method and device

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
YUYU LUO;NAN TANG;GUOLIANG LI;JIAWEI TANG;CHENGLIANG CHAI;XUEDI QIN: "Natural Language to Visualization by Neural Machine Translation", IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, vol. 28, no. 1, pages 217 - 226, XP011895518, DOI: 10.1109/TVCG.2021.3114848 *
孙寰;陈岩: "面向翻译的船海核相关专业汉英俄术语库建设研究", 中国科技术语, no. 05, pages 19 - 23 *
王琳: "从跨文化交际角度论商标的翻译", 山西农业大学学报(社会科学版), no. 04, pages 94 - 97 *
王蜜: "浅析文化语境与翻译策略", 经济与社会发展, vol. 8, no. 3 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118036577A (en) * 2024-04-11 2024-05-14 一百分信息技术有限公司 Sequence labeling method in natural language processing

Also Published As

Publication number Publication date
CN116822517B (en) 2023-11-10

Similar Documents

Publication Publication Date Title
Navigli et al. Learning word-class lattices for definition and hypernym extraction
JP5167546B2 (en) Sentence search method, sentence search device, computer program, recording medium, and document storage device
CN109685056B (en) Method and device for acquiring document information
JP3906356B2 (en) Syntax analysis method and apparatus
US20040024598A1 (en) Thematic segmentation of speech
WO2008107305A2 (en) Search-based word segmentation method and device for language without word boundary tag
CN111832293B (en) Entity and relation joint extraction method based on head entity prediction
CN109002473A (en) A kind of sentiment analysis method based on term vector and part of speech
CN108733647B (en) Word vector generation method based on Gaussian distribution
CN116822517B (en) Multi-language translation term identification method
CN112541337A (en) Document template automatic generation method and system based on recurrent neural network language model
CN111274829A (en) Sequence labeling method using cross-language information
CN112052319B (en) Intelligent customer service method and system based on multi-feature fusion
CN114266256A (en) Method and system for extracting new words in field
CN111091009A (en) Document association auditing method based on semantic analysis
KR101023209B1 (en) Document translation apparatus and its method
CN112380848B (en) Text generation method, device, equipment and storage medium
CN116881463B (en) Artistic multi-mode corpus construction system based on data
CN113157918A (en) Commodity name short text classification method and system based on attention mechanism
CN109241521B (en) Scientific literature high-attention sentence extraction method based on citation relation
Rosset et al. The LIMSI participation in the QAst track
CN115204143B (en) Method and system for calculating text similarity based on prompt
CN112989839A (en) Keyword feature-based intent recognition method and system embedded in language model
Majeed et al. Comparative study on extractive summarization using sentence ranking algorithm and text ranking algorithm
Ma et al. An enhanced method for dialect transcription via error‐correcting thesaurus

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: A Term Recognition Method for Multilingual Translation

Granted publication date: 20231110

Pledgee: Qilu Bank Co.,Ltd. Jinan Science and Technology Innovation Financial Center Branch

Pledgor: Baishun Information Technology Co.,Ltd.

Registration number: Y2024980021473