CN116822517A

CN116822517A - Multi-language translation term identification method

Info

Publication number: CN116822517A
Application number: CN202311090725.XA
Authority: CN
Inventors: 周兰栋; 孔坤坤; 刘敏杰; 傅泉铭; 崔晓静; 李园园; 刘莹
Original assignee: Baishun Information Technology Co ltd
Current assignee: Baishun Information Technology Co ltd
Priority date: 2023-08-29
Filing date: 2023-08-29
Publication date: 2023-09-29
Anticipated expiration: 2043-08-29
Also published as: CN116822517B

Abstract

The invention relates to the technical field of language identification, in particular to a multi-language translation term identification method, which comprises the following steps: preprocessing an input text by using a natural language processing technology, wherein the preprocessing comprises word segmentation, part-of-speech tagging and syntactic analysis; extracting features of the preprocessed text by using a pre-trained machine learning model; recognizing and classifying terms in the text by combining natural language processing and a machine learning model; matching the identified terms with a multi-language term library constructed in advance to obtain corresponding translations; based on the cultural background adaptation module, according to cultural custom and characteristics of the target language, the context and style of term translation are adjusted, and translation content is ensured to accord with cultural background and acceptance of the target language community. The invention achieves higher accuracy in terms of term recognition and feature extraction by combining natural language processing with a pre-trained machine learning model.

Description

Multi-language translation term identification method

Technical Field

The invention relates to the technical field of language identification, in particular to a term identification method for multi-language translation.

Background

Existing multilingual translation relies mainly on fixed dictionary and grammar rules, which often ignore cultural background differences and exact matches of specific terms, and some existing systems employ statistical-based machine translation methods, which, while improving accuracy, still have limitations in terms of sensitivity and cultural adaptability.

Text preprocessing has been used in some existing solutions, but generally lacks optimization for multilingual and multicultural contexts, existing term recognition methods are mostly based on simple dictionary lookup, lack flexibility and accuracy, and although some systems attempt to apply machine learning to term matching, most methods lack the ability to adapt and precisely match cultural contexts, existing translation systems generally lack the ability to adapt to target cultural contexts, often resulting in improper context and style of translation.

Although the existing multi-language translation technology has advanced to some extent, significant limitations still exist in key aspects such as accuracy, sensitivity, cultural adaptability and the like. These limitations of the prior art highlight the urgent need for new methods and techniques, particularly the need to better integrate natural language processing and machine learning, and to take into account the complexity of the cultural background.

Disclosure of Invention

Based on the above objects, the present invention provides a term recognition method for multi-language translation.

A method for recognizing a term for multilingual translation, comprising the steps of:

step one: preprocessing an input text by using a natural language processing technology, wherein the preprocessing comprises word segmentation, part-of-speech tagging and syntactic analysis;

step two: extracting features of the preprocessed text by using a pre-trained machine learning model;

step three: recognizing and classifying terms in the text by combining natural language processing and a machine learning model;

step four: matching the identified terms with a multi-language term library constructed in advance to obtain corresponding translations;

step five: based on the cultural background adaptation module, according to cultural custom and characteristics of the target language, the context and style of term translation are adjusted, and translation content is ensured to accord with cultural background and acceptance of the target language community.

Further, the preprocessing in the first step specifically includes:

text cleaning: firstly, the input original text is cleaned, including removing redundant blank characters, special symbols and non-text elements, so as to ensure the text quality.

Word segmentation: word segmentation is carried out on the cleaned text, and continuous text is segmented into independent words or phrases;

part of speech tagging: marking the parts of speech of the words after word segmentation, and identifying the parts of speech of each word;

syntax analysis: further performing syntactic analysis on the text to determine the dependency relationship between words;

removing stop words: during the analysis, common stop words are removed;

semantic role labeling: the principal components in the sentence are identified and the relationship between the principal components is understood.

Further, the feature extraction in the second step is used for converting the input text into a set of digital values representing the underlying structure and meaning, and the machine learning model performs feature extraction based on a Transformer (transducer) model, which is expressed as follows:

self-attention mechanism

Query, key, and value: calculate query (Q), key (K)) And a value (V) embedded from the input word by the following formulaThe method comprises the following steps:

wherein ,a weight matrix representing the query, key and value, respectively, for converting word embedding into the query, key and value;

attention weighting: the dot product between the query and the key is calculated, and the attention weight is obtained through the scaled softmax function:

wherein ,is the dimension of the key vector;

feedforward neural network: the output from the attention layer is further processed by the feed forward neural network layer:

wherein ,is a learnable weight and bias;

the feedforward neural network layer further transforms the self-attention output, and the nonlinear capability of the model is increased through the activation function and linear transformation, so that the model is beneficial to capturing more complex modes;

and (3) outputting:

and integrating the self-attention mechanism and the output of the feedforward neural network layer to obtain the depth characteristic representation of the input text.

Further, the pre-built multilingual term library includes professional term mapping between each domain and different languages, and the step four specifically includes:

the identified terms are subjected to normalization processing, so that the formats and the representations of the identified terms are consistent with the standards in the term library;

comparing the standardized terms with related items in a multi-language term library through a searching and matching algorithm;

after finding the matched term, extracting the equivalent expression of the term in the target language from the multi-language term library, wherein the mapping considers the subtle differences under different languages and cultural backgrounds, and ensures the accuracy and the readability of translation;

if the target term does not have a direct corresponding item in the multi-language term library, adopting a rollback strategy, splitting the term into smaller parts, and translating and reorganizing each part;

by integrating the translation of the term with other translated portions, a complete, fluent multi-lingual translation text is generated.

Further, the search and match algorithm is based on a cosine similarity algorithm for comparing similarity between two pieces of text or two vocabulary items.

Further, the cosine similarity algorithm formula is as follows:

is provided with two vectors and />Their cosine similarity is calculated as:

wherein ：

representation-> and />Is calculated as:

；

and />Respectively indicate-> and />Is calculated as the euclidean norm:

and />；

By the above equation, the cosine similarity value falls between-1 and 1, where 1 represents exactly the same, 0 represents exactly uncorrelated, -1 represents exactly opposite.

Further, the cultural background adapting module in the fifth step specifically includes:

constructing a culture tag library: based on the identification of cultural elements, identifying key elements in target culture, including etiquette, value, custom, religion; creating a label for each cultural element, collecting texts, expressions and metaphors related to the cultural elements, and providing accurate references for subsequent cultural adaptation through accurate marking of cultural features;

multilevel semantic analysis: based on syntactic analysis, analyzing sentence structure by using natural language processing technology, determining subjects, objects and verbs in sentences, identifying semantic roles of the subjects and the verbs, and helping to understand the intention and implicit meaning of the text more deeply through multi-level analysis;

culture matching and adaptation: the method comprises the steps of matching texts with a culture tag library based on a matching algorithm, identifying elements related to target cultures, adjusting expression modes, mood and politics according to custom and characteristics of the target cultures, converting source texts into target culture styles by using a converter model of a machine learning model, and ensuring culture adaptability of the texts through matching with the culture tag library and style conversion of deep learning.

Further, the cultural background adaptation module comprises expert participation and optimization: the expert's modification and feedback is used for continuous optimization of the model by constructing a platform that allows cultural experts to review and edit the translation results based on the expert review platform.

Further, the tag library construction includes:

defining culture dimensions and classifications:

a culture dimension selection, namely selecting culture dimensions of etiquette, value view, custom, religion and language style;

a classification definition, wherein sub-classifications are defined under each culture dimension;

collecting and analyzing sample text:

sample sources, including books, papers, networks, social media channels, collect text related to a target culture;

text analysis, namely analyzing the text by using a natural language processing tool, and extracting key information related to culture dimension and sub-classification;

constructing a primary tag library:

extracting keywords and expressions, extracting keywords and expressions from the analyzed sample text, and associating the keywords and expressions with culture dimensions and sub-classifications;

and (3) label generation: generating a label for each keyword and expression, associated with a particular cultural dimension and sub-category;

tag library integration:

integration with the model: integrating the constructed tag library into a cultural background adaptation module for cultural feature matching and adaptation;

providing an API interface: the method is used for the tag library to communicate with other systems or modules and support diversified application requirements.

The invention has the beneficial effects that:

the invention achieves higher accuracy in terms of term recognition and feature extraction by combining natural language processing with a pre-trained machine learning model. Compared with the existing method, the overall accuracy is improved.

The invention has more sensitive and flexible characteristic extraction and term matching process, can more accurately identify and process specific terms and expression modes under various languages and cultural backgrounds, and can adjust the context and style of term translation according to cultural custom and characteristics of target languages by virtue of a cultural background adaptation module. This breakthrough innovation allows the translation result to better meet the expectations and acceptability of the target audience.

The invention can support a wider language range through matching with a pre-constructed multilingual term library, provides strong support for globalization communication and business, and is helpful for promoting communication and understanding among people with different cultures and language backgrounds through sensitive processing of cultural backgrounds and contexts. This social benefit is not neglected.

Drawings

In order to more clearly illustrate the invention or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the description below are only of the invention and that other drawings can be obtained from them without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of an identification method according to an embodiment of the invention;

fig. 2 is a schematic diagram of a cultural background adaptation module according to an embodiment of the invention.

Detailed Description

The present invention will be further described in detail with reference to specific embodiments in order to make the objects, technical solutions and advantages of the present invention more apparent.

It is to be noted that unless otherwise defined, technical or scientific terms used herein should be taken in a general sense as understood by one of ordinary skill in the art to which the present invention belongs. The terms "first," "second," and the like, as used herein, do not denote any order, quantity, or importance, but rather are used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that elements or items preceding the word are included in the element or item listed after the word and equivalents thereof, but does not exclude other elements or items. The terms "connected" or "connected," and the like, are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", etc. are used merely to indicate relative positional relationships, which may also be changed when the absolute position of the object to be described is changed.

As shown in fig. 1-2, a term recognition method for multi-language translation includes the following steps:

step five: based on the cultural background adaptation module, according to cultural custom and characteristics of the target language, the context and style of term translation are adjusted, and translation content is ensured to accord with the cultural background and acceptance of the target language community;

not only the conversion of the text layer is concerned, but also the adaptation of the context and the style is concerned, so that the translation content can better accord with the cultural background and the acceptance of the target language community. Such adaptation to different cultural backgrounds is rare in traditional translation methods, but has practical application value and innovativeness in globalization backgrounds.

The pretreatment of the first step specifically comprises the following steps:

text cleaning: firstly, the input original text is cleaned, including removing redundant blank characters, special symbols, non-text elements and the like, so as to ensure the text quality.

Word segmentation: the cleaned text is subjected to word segmentation processing, continuous text is segmented into individual words or phrases, and a dictionary matching method, a statistical method or a deep learning method can be used in the step to adapt to different languages and fields;

part of speech tagging: part of speech tagging is performed on the words after word segmentation, and the part of speech (such as nouns, verbs, adjectives and the like) of each word is identified, so that the grammar effect and meaning of the words in the text can be understood;

syntax analysis: further parsing the text to determine dependencies between words, which can help the system understand the structure and meaning of sentences;

removing stop words: during analysis, common stop words (e.g., "sum", "yes", etc.) are removed, which in many cases do not carry a practical meaning, and removal helps to reduce the complexity of the analysis;

semantic role labeling: identifying principal components in a sentence, such as subjects, verbs, objects, etc., and understanding the relationships between the principal components, helps to accurately capture the meaning of text;

the above together form a comprehensive and careful text preprocessing process. Through this process, the entered text is converted into a form that is easier to analyze and process, which lays a solid foundation for subsequent term recognition and translation work.

The feature extraction of the second step is used for converting the input text into a group of digital values representing the underlying structure and meaning of the text, and the machine learning model performs feature extraction based on a transducer (transducer) model, and is expressed as follows:

self-attention mechanism

Query, key, and value: computing query (Q), key(K) And a value (V) embedded from the input word by the following formulaThe method comprises the following steps:

wherein ,weight matrices representing queries, keys and values, respectively, for converting word embeddings into queries, keys and values, which can be regarded as encodings of specific tasks, such as term recognition;

wherein ,/>Is the dimension of the key vector:

through self-attention calculations, the model considers the interrelationships and dependencies between words in the text,calculating the similarity between the query and the key, and further converting the similarity into weights through a softmax function;

wherein ,is a learnable weight and bias;

the feedforward neural network layer further transforms the self-attention output, increases the nonlinear capability of the model by activating functions and linear transformation, and helps to capture more complex modes, which can be regarded as specialized adjustment and optimization for specific tasks (such as term recognition);

and (3) outputting:

combining the self-attention mechanism and the output of the feedforward neural network layer to obtain depth characteristic representation of the input text, wherein the characteristics capture the complex relation between words in the text and can be used for subsequent term recognition and translation tasks;

the feature extraction process can capture deep semantic and structural information in the input text, and provides a rich basis for subsequent processing. In practical applications, the model structure and parameters may be adjusted according to specific requirements and data.

The pre-built multi-language term library comprises professional term mapping among various fields and different languages, and the fourth step specifically comprises the following steps:

the identified terms are subjected to standardization processing, the format and representation of the identified terms are consistent with the standards in a term library, and operations such as stem extraction, synonym replacement and the like are performed;

generating a complete and smooth multilingual translation text by integrating the translation of the term with other translation parts;

in general, matching the identified terms to a library of pre-constructed multilingual terms and obtaining corresponding translations is a complex but highly valuable task. It relates to a number of advanced techniques and methods, which are a core component in modern multi-language processing systems.

The find and match algorithm is based on a cosine similarity algorithm to compare the similarity between two pieces of text or two vocabulary items.

The cosine similarity algorithm formula is as follows:

is provided with two vectors and />Their cosine similarity is calculated as:

wherein ：

representation-> and />Is calculated as:

；

and />Respectively indicate-> and />Is calculated as the euclidean norm:

and />；

By the above equation, the cosine similarity value falls between-1 and 1, where 1 represents exactly the same, 0 represents exactly uncorrelated, -1 represents exactly opposite;

in the term matching and translation scenario described above, the similarity of the identified term to each item in the multilingual term library may be compared by cosine similarity, thereby finding the most appropriate corresponding item. This approach is very effective in dealing with spelling variants, abbreviations, synonyms, etc.

The cultural background adaptation module in the fifth step specifically comprises:

culture matching and adaptation: matching the text with a cultural tag library based on a matching algorithm, identifying elements related to a target culture, adjusting an expression mode, a mood and a polite according to custom and characteristics of the target culture, converting a source text with a target cultural style by using a converter model of a machine learning model, and ensuring cultural adaptability of the text through matching with the cultural tag library and style conversion of deep learning;

through the technical scheme, the cultural background adaptation module can comprehensively understand and adapt to custom and characteristics of target culture. The schemes comprehensively utilize the technologies of natural language processing, machine learning, data mining, artificial intelligence and the like to ensure that the translation result is accurate in language and matched with a target audience in cultural context and style.

The cultural background adaptation module comprises expert participation and optimization: the expert's modification and feedback is used for continuous optimization of the model by constructing a platform that allows cultural experts to review and edit the translation results based on the expert review platform.

The label library construction comprises the following steps:

defining culture dimensions and classifications:

classifying and defining sub-classifications under each culture dimension, wherein the etiquette can be subdivided into a business etiquette and a daily etiquette;

collecting and analyzing sample text:

constructing a primary tag library:

tag library integration:

In order to verify the effectiveness of the identification method of the present invention, the following related tests were performed.

1. Design of experiment

Test dataset: 5000 multilingual translation test samples of 10 different cultural backgrounds and languages were selected.

Evaluation criteria: accuracy (85% target), sensitivity (80% target), cultural adaptability (90% target).

Comparison experiment: 3 existing translation methods were chosen for comparison.

2. Text pre-processing effect test

Experiment: 1000 samples were pre-processed.

Results: the translation accuracy of the pretreated sample is improved by 12 percent and reaches 88 percent.

3. Feature extraction and term recognition effect testing.

Experiment: feature extraction was performed on 800 samples.

Results: the accuracy is improved by 18 percent and reaches 91 percent.

4. Multi-language term library matching effect test.

Experiment: the term library matching is performed on 700 samples.

Results: the accuracy matching rate reaches 92%, and is improved by 14% compared with the prior art.

5. Cultural background adaptation module effect test.

Experiment: a cultural background adaptation module was applied to 600 samples.

Results: the culture adaptability score is improved by 20 percent and reaches 95 percent.

6. And (5) testing the comprehensive effect.

Experiment: the entire test dataset is tested comprehensively.

Results: the overall accuracy is 90%, the sensitivity is 87%, the cultural adaptability is 94%, and the overall accuracy is more than a preset target and more than 10% of the conventional system.

7. And (5) expert review.

Experiment: 5 experts were manually reviewed.

Results: the average score was 9.2/10, and the expert agreed to recognize the innovations and utilities of the present invention.

8. Statistical analysis and conclusion.

Analysis: statistical analysis methods such as t-test and ANOVA were used.

Conclusion: statistical significance p <0.01, a significant advantage over the prior art.

By combining specific experimental data, the remarkable progress of the invention in the aspects of multilingual translation accuracy, cultural adaptability and the like can be seen more clearly.

Table 1 comparison table of experimental test data

Those of ordinary skill in the art will appreciate that: the discussion of any of the embodiments above is merely exemplary and is not intended to suggest that the scope of the invention is limited to these examples; the technical features of the above embodiments or in the different embodiments may also be combined within the idea of the invention, the steps may be implemented in any order and there are many other variations of the different aspects of the invention as described above, which are not provided in detail for the sake of brevity.

The present invention is intended to embrace all such alternatives, modifications and variances which fall within the broad scope of the appended claims. Therefore, any omission, modification, equivalent replacement, improvement, etc. of the present invention should be included in the scope of the present invention.

Claims

1. A method for recognizing a term for multilingual translation, comprising the steps of:

2. The method for recognizing terms for multilingual translation according to claim 1, wherein the preprocessing in the first step specifically comprises:

text cleaning: firstly, cleaning an input original text, including removing redundant blank characters, special symbols and non-text elements, so as to ensure the text quality;

removing stop words: during the analysis, common stop words are removed;

3. The method of claim 2, wherein the feature extraction in step two is used to convert the input text into a set of numerical values representing its underlying structure and meaning, and the machine learning model performs feature extraction based on a transducer model, as follows:

self-attention mechanism

Query, key, and value: calculating a query (Q), a key (K) and a value (V), embedding from the entered word by the following formulaThe method comprises the following steps:

wherein ,/>Is the dimension of the key vector;

wherein ,is a learnable weight and bias;

and (3) outputting:

4. A method for recognizing terms for multilingual translation according to claim 3, wherein the pre-constructed multilingual term library includes professional term mappings between fields and different languages, and the step four specifically includes:

after finding a matching term, extracting the equivalent expression of the term in the target language from the multilingual term library;

5. The method of claim 4, wherein the find and match algorithm is based on a cosine similarity algorithm for comparing similarity between two text segments or two vocabulary items.

6. The method for recognizing terms for multilingual translation according to claim 5, wherein the cosine similarity algorithm formula is as follows:

is provided with two vectors and />Their cosine similarity is calculated as:

wherein ：

representation-> and />Is calculated as:

；

and />Respectively indicate-> and />Is calculated as the euclidean norm:

and />；

7. The method for recognizing terms for multilingual translation according to claim 6, wherein the cultural background adaptation module in the fifth step specifically comprises:

multilevel semantic analysis: based on syntactic analysis, analyzing the sentence structure by using a natural language processing technology, determining subjects, objects and verbs in the sentence, and identifying semantic roles of the subjects, the objects and the verbs;

culture matching and adaptation: and matching the text with the culture tag library based on a matching algorithm, identifying elements related to the target culture, adjusting the expression mode, the mood and the polite according to the custom and the characteristics of the target culture, and converting the source text and the target culture style by using a transformer model of a machine learning model.

8. The method of claim 7, wherein the cultural background adaptation module further comprises expert participation and optimization: the expert's modification and feedback is used for continuous optimization of the model by constructing a platform that allows cultural experts to review and edit the translation results based on the expert review platform.

9. The method for recognizing terms for multilingual translation according to claim 8, wherein the tag library construction comprises:

defining culture dimensions and classifications:

collecting and analyzing sample text:

constructing a primary tag library:

tag library integration: