CN113420570A - Method, system and device for improving translation accuracy - Google Patents

Method, system and device for improving translation accuracy Download PDF

Info

Publication number
CN113420570A
CN113420570A CN202110745049.XA CN202110745049A CN113420570A CN 113420570 A CN113420570 A CN 113420570A CN 202110745049 A CN202110745049 A CN 202110745049A CN 113420570 A CN113420570 A CN 113420570A
Authority
CN
China
Prior art keywords
expression
word
translation
result
digital
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110745049.XA
Other languages
Chinese (zh)
Other versions
CN113420570B (en
Inventor
郝顺平
关祎宁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenyang Chuangsijiye Technology Co ltd
Original Assignee
Shenyang Chuangsijiye Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenyang Chuangsijiye Technology Co ltd filed Critical Shenyang Chuangsijiye Technology Co ltd
Priority to CN202110745049.XA priority Critical patent/CN113420570B/en
Publication of CN113420570A publication Critical patent/CN113420570A/en
Application granted granted Critical
Publication of CN113420570B publication Critical patent/CN113420570B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/51Translation evaluation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The invention discloses a method, a system and a device for improving translation accuracy, which convert a word array of an input original text and an original text array stored in a translation memory into two digital arrays, only need to compare the numbers when comparing the words one by one, reduce comparison times, avoid character transcoding each time, have higher speed of processing the numbers by a computer than the text, improve the comparison speed of the content of the original text, further directly improve the performance and the calculation speed of a matching algorithm, simultaneously for the original text in the translation memory, adopt a method of converting the numbers into the numbers before each storage and then storing the numbers, directly compare the stored word arrays in the next matching, reduce conversion cost and further improve the performance.

Description

Method, system and device for improving translation accuracy
Technical Field
The invention relates to the field of intelligent translation, in particular to a method, a system and a device for improving translation accuracy.
Background
The translation memory bank is translation auxiliary software commonly used in the translation field, continuously collects and stores checked original texts and translated texts without quality defects in translation projects, provides a matching algorithm, performs similarity matching on input original texts to be translated from the stored original texts, and finally returns a batch of original texts with higher similarity and corresponding translated texts in the translation memory bank. Because the translations are audited, a high-quality translation reference can be provided for the translator.
In summary, the key link affecting the matching performance of a translation memory library is the storage of the original text and the matching algorithm. The original text content is used as the basis and the basis of the matching algorithm, and the calculation performance of the matching algorithm is directly determined at the same time, and is mainly reflected in the aspect of comparison speed of the original text. As the memory capacity of the translation memory base is gradually accumulated along with the translation service, the overhead of the system in comparison of the original text is increased, so that the processing of the content of the original text and the design of the data type are important links influencing the matching performance of the translation memory base.
The traditional original text processing method is that the whole sentence of original text content is stored into a translation memory base according to the character string type, when matching occurs, the original text is taken out from the base to be participled, then the input original text is participled, the word text arrays after two participles are calculated out the similarity of the two sentences of original text through a matching algorithm, and finally the data with the highest similarity in the translation memory base is obtained.
This text-based matching may cause performance problems when comparing each word in the original text. Firstly, the principle of processing character strings by a computer is that each character is converted into an ASCII code, and then each ASCII code value is processed and compared one by one, so that for a word character string, the more characters are, the more times are compared, and if the number of words is also large, the matching algorithm can be very slow. When the translation memory database data increases with the increase of translation services, the reference original text number and the original text word number increase, and the performance problem is more obvious. The time for the translator to wait for the reference translation is increased, and the translation efficiency is reduced.
Disclosure of Invention
Because the method of matching the original text in a text mode and obtaining the corresponding reference translation text can affect the calculation performance along with the gradual accumulation of the contents of the translation memory library and reduce the translation efficiency, the invention discloses an original text processing method for solving the technical problem, and the integral matching performance of the translation memory library is improved by optimizing the storage mode and the data structure of the original text;
the invention provides a method for improving translation accuracy, which comprises the following steps,
collecting a first word of first translation data without quality defects and a second word corresponding to the first word, and performing binary conversion on the first word and the second word respectively to obtain a first digital expression of the first word and a second digital expression of the second word, wherein the first word is an original word of the first translation data, and the second word is a translated word of the first translation data;
collecting document data to be translated, carrying out binary conversion on the document data to be translated to obtain a third digital expression of the document data to be translated, and comparing the first similarity of the third digital expression with the first digital expression or the second digital expression to obtain second translation data of the document data to be translated.
Preferably, third translation data are collected, binary conversion is carried out on the third translation data to obtain a fourth digital expression of the third translation data, and translation accuracy of the third translation data is obtained by comparing a second similarity of the fourth digital expression with the first digital expression or the second digital expression, wherein the third translation data are well translated literature data to be corrected.
Preferably, based on the translation accuracy, the first word or the second word is obtained and added to the third translation material, wherein in the process of adding the first word or the second word to the third translation material, the first word or the second word is labeled, and the labeled form at least comprises word font, word size, word color and dialog box.
Preferably, in the process of binary conversion of the first word, the second word, the document material to be translated and the third translation material respectively,
acquiring the word length of a word to be converted, and expressing the word length through a four-digit binary system to obtain a first expression;
collecting the word content of the word to be converted, and expressing the word content through a sixty-digit binary system to obtain a second expression;
and constructing a digital expression based on the first expression and the second expression, wherein the digital expression comprises the first digital expression, the second digital expression, the third digital expression and the fourth digital expression.
Collecting the length of the English words to be converted;
if the length of the English word is equal to 10, expressing the length of the English word through 4-bit binary expression to obtain a first expression, performing 6-bit binary conversion on characters of the word content and accumulating to obtain a second expression, and obtaining a digital expression through the first expression and the second expression;
if the length of the English word is less than 10, expressing the length of the English word through 4-bit binary expression to obtain a first expression, expressing vacancy characters with the length of the English word less than 10 in the word content through 6-bit 1 to obtain a third expression, performing 6-bit binary conversion on the characters in the word content and accumulating to obtain a second expression, and obtaining a digital expression through the first expression, the second expression and the third expression;
if the length of the English word is more than 10, expressing the length of the English word through a 4-bit binary expression to obtain a fourth expression, collecting an ASCII code value of each character of the word content, performing 31-bit system conversion on the ASCII code value, accumulating to obtain an accumulation result, and adding the accumulation result to 260And performing division and remainder calculation and 60-bit binary conversion to obtain a fifth expression, and obtaining a digital expression according to the fourth expression and the fifth expression.
In the process of processing the English words to be converted with the English word length being more than 10, the method comprises the following steps:
s101, collecting a first ASCII code value of a first character of word content, and adding the first ASCII code value and a second ASCII code value of a second character of the word content after carrying out 31-system conversion on the first ASCII code value to obtain a first result;
s103, after the first result is subjected to 31-system conversion, adding the first result and a third ASCII code value of a third character of the word content to obtain a second result;
s105, based on the calculation process of S103, accumulating the second result to the last character of the word content, and performing the same operation as 260And performing division and remainder calculation and 60-bit binary conversion to obtain a fifth expression.
Preferably, in the process of binary conversion of the first word, the second word, the document material to be translated and the third translation material respectively,
acquiring the word length of a word to be converted, and expressing the word length through a four-digit binary system to obtain a first expression;
collecting the word content of the word to be converted, and expressing the word content through a sixty-digit binary system to obtain a second expression;
and constructing a digital expression based on the first expression and the second expression, wherein the digital expression comprises the first digital expression, the second digital expression, the third digital expression and the fourth digital expression.
Collecting the length of the Chinese words to be converted;
if the length of the Chinese word is equal to 4, the length of the Chinese word is expressed through a 4-bit binary system to obtain a first expression, the Unicode code value of each character of the word content is subtracted by 2000, then the obtained value is converted into a 15-bit binary system for accumulation to obtain a second expression, and a digital expression is obtained through the first expression and the second expression;
if the length of the Chinese word is less than 4, expressing the length of the Chinese word through a 4-bit binary system to obtain a first expression, expressing vacancy characters with the length of the Chinese word less than 4 in the word content through 15 pieces of 1 to obtain a sixth expression, and obtaining a digital expression through the first expression and the sixth expression;
if the length of the Chinese word is more than 4, the length of the Chinese word is expressed through 4-bit binary expression to obtain a seventh expression, the Unicode code values of each character of the word content are accumulated after being subjected to 13131-bit binary conversion, and the accumulated values are the same as 260The division and the remainder are calculated and 60-bit binary conversion is carried out to obtain a ninth expression according to whichThe seventh expression and the ninth expression obtain numerical expressions.
In the process of processing the word content with the Chinese word length larger than 4, the method comprises the following steps:
s201, extracting a first Unicode code value of a first character, carrying out 13131-system conversion, and adding the first Unicode code value of the first character and a second Unicode code value of a second character to obtain a first result;
s203, carrying out 13131-system conversion on the first result, and adding the first result and a third and second Unicode code value of a third character to obtain a second result;
s205, based on the calculation process of S203, after the second result is accumulated to the last character, the same as 2 is carried out60And performing division and remainder calculation and 60-bit binary conversion to obtain a ninth expression.
A system for improving translation accuracy, comprising,
the first data acquisition module is used for acquiring a first original text and a first translated text of first translation data without quality defects;
the second data acquisition module is used for acquiring a document to be translated or a second original text of the translation data to be checked;
the first data conversion module is used for converting the first original text and the first translated text into a first digital expression;
the second data conversion module is used for converting the second original text into a second digital expression;
the data processing module is used for comparing the second digital expression with the first digital expression to obtain a second translation;
the display module is used for displaying the second translation;
the storage module is used for storing the first original text, the first translation, the second original text and the second translation, wherein the storage module is also used for fusing the first original text and the second original text to obtain a new first original text and fusing the second translation and the first translation to obtain a new first translation.
An apparatus for improving translation accuracy, comprising,
the input device is used for inputting the document to be translated or the document to be checked and translated;
the display equipment is used for displaying the translation result of the document to be translated or the auditing result of the document to be audited;
the data processing equipment is used for performing binary digital conversion on the document to be translated or the document to be checked to obtain a first digital expression, performing similarity matching according to a second digital expression existing in the data processing equipment, and selecting a word corresponding to at least one second digital expression with the highest similarity according to a matching result to obtain a translation result or a checking result;
and the data storage equipment is used for storing the document to be translated, the verification translation document, the translation result and the verification result and updating the stored data according to the storage result.
The invention discloses the following technical effects:
the invention provides a method for improving the performance of a matching algorithm: the words in the original text are converted into numbers, so that when the matching algorithm calculates the difference between the input original text and the stored original text in the translation memory base, only the number arrays of the two words need to be compared, and the problems that the comparison is carried out according to the form of character string texts in the traditional method, a large number of processing times are generated, and the efficiency and the performance of the algorithm are reduced are solved.
The method is applicable to any matching algorithm, and for one matching algorithm, the method only optimizes the input parameter for calling the matching algorithm, namely the original text content, so that the calculation processing load of the matching algorithm is reduced, and the calculation efficiency is improved. On the other hand, the invention also provides a data processing and storing scheme in application implementation in other fields, such as database systems, content differentiation and data security. .
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
FIG. 1 is a flow chart of a method according to the present invention;
FIG. 2 is a 64-bit binary diagram according to the present invention;
FIG. 3 is a comparison of the method of the present invention with the prior art.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
As shown in fig. 1-3, the present invention provides a method for improving translation accuracy, comprising the steps of,
collecting a first word of first translation data without quality defects and a second word corresponding to the first word, and performing binary conversion on the first word and the second word respectively to obtain a first digital expression of the first word and a second digital expression of the second word, wherein the first word is an original word of the first translation data, and the second word is a translated word of the first translation data;
collecting document data to be translated, carrying out binary conversion on the document data to be translated to obtain a third digital expression of the document data to be translated, and comparing the first similarity of the third digital expression with the first digital expression or the second digital expression to obtain second translation data of the document data to be translated.
Preferably, third translation data are collected, binary conversion is carried out on the third translation data to obtain a fourth digital expression of the third translation data, and translation accuracy of the third translation data is obtained by comparing a second similarity of the fourth digital expression with the first digital expression or the second digital expression, wherein the third translation data are well translated literature data to be corrected.
Preferably, based on the translation accuracy, the first word or the second word is obtained and added to the third translation material, wherein in the process of adding the first word or the second word to the third translation material, the first word or the second word is labeled, and the labeled form at least comprises word font, word size, word color and dialog box.
Preferably, in the process of binary conversion of the first word, the second word, the document material to be translated and the third translation material respectively,
acquiring the word length of a word to be converted, and expressing the word length through a four-digit binary system to obtain a first expression;
collecting the word content of the word to be converted, and expressing the word content through a sixty-digit binary system to obtain a second expression;
and constructing a digital expression based on the first expression and the second expression, wherein the digital expression comprises the first digital expression, the second digital expression, the third digital expression and the fourth digital expression.
Collecting the length of the English words to be converted;
if the length of the English word is equal to 10, expressing the length of the English word through 4-bit binary expression to obtain a first expression, performing 6-bit binary conversion on characters of the word content and accumulating to obtain a second expression, and obtaining a digital expression through the first expression and the second expression;
if the length of the English word is less than 10, expressing the length of the English word through 4-bit binary expression to obtain a first expression, expressing vacancy characters with the length of the English word less than 10 in the word content through 6-bit 1 to obtain a third expression, performing 6-bit binary conversion on the characters in the word content and accumulating to obtain a second expression, and obtaining a digital expression through the first expression, the second expression and the third expression;
if the length of the English word is more than 10, expressing the length of the English word by 4-bit binary expression to obtain a fourth expression, collecting an ASCII code value of each character of the word content, and converting the ASCII code value into a 31-system codeConverting and accumulating to obtain an accumulated result, and adding the accumulated result to 260And performing division and remainder calculation and 60-bit binary conversion to obtain a fifth expression, and obtaining a digital expression according to the fourth expression and the fifth expression.
In the process of processing the English words to be converted with the English word length being more than 10, the method comprises the following steps:
s101, collecting a first ASCII code value of a first character of word content, and adding the first ASCII code value and a second ASCII code value of a second character of the word content after carrying out 31-system conversion on the first ASCII code value to obtain a first result;
s103, after the first result is subjected to 31-system conversion, adding the first result and a third ASCII code value of a third character of the word content to obtain a second result;
s105, based on the calculation process of S103, accumulating the second result to the last character of the word content, and performing the same operation as 260And performing division and remainder calculation and 60-bit binary conversion to obtain a fifth expression.
Preferably, in the process of binary conversion of the first word, the second word, the document material to be translated and the third translation material respectively,
acquiring the word length of a word to be converted, and expressing the word length through a four-digit binary system to obtain a first expression;
collecting the word content of the word to be converted, and expressing the word content through a sixty-digit binary system to obtain a second expression;
and constructing a digital expression based on the first expression and the second expression, wherein the digital expression comprises the first digital expression, the second digital expression, the third digital expression and the fourth digital expression.
Collecting the length of the Chinese words to be converted;
if the length of the Chinese word is equal to 4, the length of the Chinese word is expressed through a 4-bit binary system to obtain a first expression, the Unicode code value of each character of the word content is subtracted by 2000, then the obtained value is converted into a 15-bit binary system for accumulation to obtain a second expression, and a digital expression is obtained through the first expression and the second expression;
if the length of the Chinese word is less than 4, expressing the length of the Chinese word through a 4-bit binary system to obtain a first expression, expressing vacancy characters with the length of the Chinese word less than 4 in the word content through 15 pieces of 1 to obtain a sixth expression, and obtaining a digital expression through the first expression and the sixth expression;
if the length of the Chinese word is more than 4, the length of the Chinese word is expressed through 4-bit binary expression to obtain a seventh expression, the Unicode code values of each character of the word content are accumulated after being subjected to 13131-bit binary conversion, and the accumulated values are the same as 260And performing division and remainder calculation and 60-bit binary conversion to obtain a ninth expression, and obtaining a digital expression according to the seventh expression and the ninth expression.
In the process of processing the word content with the Chinese word length larger than 4, the method comprises the following steps:
s201, extracting a first Unicode code value of a first character, carrying out 13131-system conversion, and adding the first Unicode code value of the first character and a second Unicode code value of a second character to obtain a first result;
s203, carrying out 13131-system conversion on the first result, and adding the first result and a third and second Unicode code value of a third character to obtain a second result;
s205, based on the calculation process of S203, after the second result is accumulated to the last character, the same as 2 is carried out60And performing division and remainder calculation and 60-bit binary conversion to obtain a ninth expression.
A system for improving translation accuracy, comprising,
the first data acquisition module is used for acquiring a first original text and a first translated text of first translation data without quality defects;
the second data acquisition module is used for acquiring a document to be translated or a second original text of the translation data to be checked;
the first data conversion module is used for converting the first original text and the first translated text into a first digital expression;
the second data conversion module is used for converting the second original text into a second digital expression;
the data processing module is used for comparing the second digital expression with the first digital expression to obtain a second translation;
the display module is used for displaying the second translation;
the storage module is used for storing the first original text, the first translation, the second original text and the second translation, wherein the storage module is also used for fusing the first original text and the second original text to obtain a new first original text and fusing the second translation and the first translation to obtain a new first translation.
An apparatus for improving translation accuracy, comprising,
the input device is used for inputting the document to be translated or the document to be checked and translated;
the display equipment is used for displaying the translation result of the document to be translated or the auditing result of the document to be audited;
the data processing equipment is used for performing binary digital conversion on the document to be translated or the document to be checked to obtain a first digital expression, performing similarity matching according to a second digital expression existing in the data processing equipment, and selecting a word corresponding to at least one second digital expression with the highest similarity according to a matching result to obtain a translation result or a checking result;
and the data storage equipment is used for storing the document to be translated, the verification translation document, the translation result and the verification result and updating the stored data according to the storage result.
Example 1: 1. the technical scheme provided by the invention comprises the following implementation schemes:
1.1. the class of words is expressed in 4-bit binary, which is used to distinguish between languages and word lengths. The language and word length determine the computational granularity of the method.
1.2. The content of the word is expressed by using a 60-bit binary system, and the method of converting the content into the 60-bit binary system is taken as a core link of the invention and will be mainly described in the [ word content conversion method ].
1.3. The 64-bit binary is converted to a long integer value as the final result of the invention.
2. Word content conversion method (example English and Chinese are given separately)
Description of the symbols
b1-41 to 4 bit binary
b5-645 to 64 bit binary
x is word content
xiThe ith character
i character index
n is the number of word characters
fc-b(c) 6-bit binary expression method of character
fc-a(c) ASCII value expression method of character
fc-u(c) Method for expressing Unicode value of character
fb-n(b) Method for converting 64-bit binary system into digital system
fn-b(n) method for converting digits into 60-bit binary system
(x) the results of the present invention.
2.1 English words with length less than or equal to 10: conversion module for entering English short content
The processing idea is as follows: b1-4Denoted by 0000; b5-64The 6-bit binary system of each character is accumulated, the number of the characters is less than 10, and each empty position is expressed by 6-bit 1. Convert the final 64-bit binary to a number:
Figure BDA0003144113960000151
2.2 English words greater than 10 in length: conversion module for entering English long content
The processing idea is as follows: b1-4Represented by 0001; b5-64The processing of (1) adding the 31-system of the ASCII code value of the first character to the ASCII code value of the second character, adding the 31-system of the addition result to the ASCII code value of the third character, and repeating the steps until the addition result is added to the last character to obtain:
Figure BDA0003144113960000152
calculate the above to xn-1The division and remainder calculation of the result a and the 60 power of 2 obtains a number:
Figure BDA0003144113960000153
after converting the result b into a 60-bit binary system, the final 64-bit binary system is converted into a number:
f(x)=fb-n(b1-4+fn-b(b))
2.3 Chinese words with length less than or equal to 4: conversion module for entering Chinese short content
The processing idea is as follows: b1-40010 is used for representation; b5-64The Unicode of each character is subtracted by 2000 and then converted into a 15-bit binary accumulation, the number of words is less than 4, and each empty bit is expressed by 15 1. Finally, 64-bit binary is converted into digital:
Figure BDA0003144113960000161
2.4 Chinese words with length greater than 4: conversion module for long content of Chinese characters
The processing idea is as follows: b1-4Is denoted by 0011; b5-64The processing of (1) adding the 13131 system of the Unicode code value of the first character to the Unicode code value of the second character, adding the 13131 system of the addition result to the Unicode code value of the third character, and repeating the following steps until the last character is obtained:
Figure BDA0003144113960000162
calculate the above to xn-1The division and remainder calculation of the result a and the 60 power of 2 obtains a number:
Figure BDA0003144113960000163
after converting the result b into a 60-bit binary system, the final 64-bit binary system is converted into a number:
f(x)=fb-n(b1-4+fn-b(b))
further, with respect to the mathematical function description used in the solution: except that f (x) needs to explain the calculation process as the calculation result of the scheme, other functions and methods can be directly obtained through various development languages and mathematical formulas.
English words with length less than or equal to 10:
1-4 bits: 0000, English word with length of 10 "
5-64 bits: the accumulation is performed with binary expressions of characters/numbers, for example, 1 is 000001, 2 is 000010, a is 001010, etc. For the case of capital letters, the method will convert them into lower case letters first, and then perform binary processing, because most of the scenarios of using the translation memory library for translation assistance are fuzzy queries, in order to improve the computational performance and reduce the character content range, the case of capital letters will be ignored here. Thus, when a character is represented by a 6-bit binary, 60 bits are filled when the word length is 10 at the maximum. When the word length is less than 10, each space is supplemented with 111111, e.g., a word length of 9 is supplemented with 111111 at the end, and a length of 8 is supplemented with 111111111111111.
Binary expressions are exemplified by: for the word students, the binary expression is:
0000 011100 011101 011110 001101 001110 010111
011101 011100
111111 111111
converting the number: after the binary system is converted into a long integer number, the number is
512698782764617727
English words greater than 10 in length:
1-4 bits: 0001, English word with the category of 'length > 10'
5-64 bits: after the ASCII code of the first character is processed in a 31-system mode, the ASCII code of the first character is accumulated with the ASCII code of the second character, the obtained result is processed in the 31-system mode and then accumulated with the ASCII code of the third character, and the like until the last character is processed. And dividing the final calculation result by the power of 60 of 2 to obtain a remainder, if the remainder is greater than 0, directly taking the remainder as a digital result of 5-64 bit positions, and otherwise, adding the remainder by the power of 60 of 2 to obtain the digital result of 5-64 bit positions.
Converting the number: for example, a word consisting of 11 characters, the Ascii of the first character is represented by Ascii _ C1, the Ascii of the second character is represented by Ascii _ C2, and so on: (Ascii _ C1 + Ascii _ C2) 31+ Ascii _ C3) 31+ Ascii _ C4 … …/2 to the power of 60, the calculation result is expressed in 60 bits, and the result is converted into a number.
Chinese words with length less than or equal to 4
1-4 bits: 0010, Chinese word with "length of 4" as the category of the word "
5-64 bits:
and (3) an algorithm discovery process: in the research on the character transcoding method, the value obtained by subtracting 2000 from the Unicode range of the Chinese character is just between 0 and 32767, and the numerical value in the range of 0 to 32767 can be just expressed by 15 bits, namely 15 powers of 2. In this way, 60 bits are filled exactly when the word length is at most 4. When the word length is less than 4, each null is complemented with 15 bits, namely 1111 … … (15 1 s).
Converting the number: a word with the length of exactly 4 is expressed as a binary system corresponding to Unicode-2000; a word of length 3 is expressed as a binary +15 1 for Unicode-2000 (15 bits are occupied by 15 1's). The secondary binary result is converted to a number.
Chinese words with length greater than 4
1-4 bits: 0010, Chinese word with "Length > 4" as a category of the word "
5-64 bits: and carrying out 13131 system processing on the Unicode code of the first character, accumulating the Unicode of the first character with the Unicode of the second character, carrying out 13131 system processing on the obtained result, accumulating the result with the Unicode of the third character, and processing until the last character in the same way. And dividing the final calculation result by the power of 60 of 2 to obtain a remainder, if the remainder is greater than 0, directly taking the remainder as a digital result of 5-64 bit positions, and otherwise, adding the remainder by the power of 60 of 2 to obtain the digital result of 5-64 bit positions.
Converting the number: for example, a word consisting of 5 characters, Unicode _ C1 for the first character, Unicode _ C2 for the second character, and so on: (Unicode _ C1 × 13131+ Unicode _ C2) × 13131+ Unicode _ C3) × 13131+ Unicode _ C4 … …/2 to the power of 60, and after the calculation result is expressed by 60 bits, the result is converted into a number.
The invention realizes the method for converting the words into the numbers, and converts the two word content arrays (the word array of the input original text and the original text array stored in the translation memory) into the two number arrays firstly, so when comparing the words one by one, only the numbers need to be compared, the comparison times are reduced, and the character transcoding at each time is avoided. And for the original text in the translation memory bank, a method of firstly converting the original text into numbers according to the invention before storing the original text each time and then storing the numbers can be adopted, and the stored word arrays are directly compared in the next matching, so that the conversion cost can be reduced, and the performance can be further improved.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus once an item is defined in one figure, it need not be further defined and explained in subsequent figures, and moreover, the terms "first", "second", "third", etc. are used merely to distinguish one description from another and are not to be construed as indicating or implying relative importance.
Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present invention, which are used for illustrating the technical solutions of the present invention and not for limiting the same, and the protection scope of the present invention is not limited thereto, although the present invention is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the present invention in its spirit and scope. Are intended to be covered by the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1. A method for improving translation accuracy, comprising the steps of,
collecting a first word of first translation data without quality defects and a second word corresponding to the first word, and performing binary conversion on the first word and the second word respectively to obtain a first digital expression of the first word and a second digital expression of the second word, wherein the first word is an original text word of the first translation data, and the second word is a translated text word of the first translation data;
collecting document data to be translated, carrying out binary conversion on the document data to be translated to obtain a third digital expression of the document data to be translated, and comparing the first similarity of the third digital expression with the first digital expression or the second digital expression to obtain second translation data of the document data to be translated.
2. The method for improving translation accuracy according to claim 1,
collecting third translation data, performing binary conversion on the third translation data to obtain a fourth digital expression of the third translation data, and comparing a second similarity of the fourth digital expression with the first digital expression or the second digital expression to obtain the translation accuracy of the third translation data, wherein the third translation data are document data which are translated well and need to be corrected.
3. The method for improving translation accuracy according to claim 2,
and obtaining the first word or the second word based on the translation accuracy, and adding the first word or the second word into the third translation material, wherein the first word or the second word is labeled in the process of adding the first word or the second word into the third translation material, and the labeled form at least comprises a word font, a word size, a word color and a dialog box.
4. The method for improving translation accuracy according to claim 3,
in the process of binary conversion of the first word, the second word, the document material to be translated and the third translation material respectively,
acquiring the word length of a word to be converted, and expressing the word length through a four-digit binary system to obtain a first expression;
collecting the word content of the word to be converted, and expressing the word content through a sixty-digit binary system to obtain a second expression;
and constructing a digital expression based on the first expression and the second expression, wherein the digital expression comprises the first digital expression, the second digital expression, the third digital expression and the fourth digital expression.
5. The method for improving translation accuracy according to claim 4,
collecting the length of the English words to be converted;
if the length of the English word is equal to 10, the length of the English word is expressed through 4-bit binary expression to obtain the first expression, characters of the word content are subjected to 6-bit binary conversion and accumulated to obtain the second expression, and the digital expression is obtained through the first expression and the second expression;
if the length of the English word is less than 10, expressing the length of the English word through 4-bit binary expression to obtain the first expression, expressing vacancy characters with the length of the English word less than 10 in the word content through 6-bit 1 to obtain a third expression, performing 6-bit binary conversion on the characters of the word content and accumulating to obtain the second expression, and obtaining the digital expression through the first expression, the second expression and the third expression;
if the length of the English word is larger than 10, the length of the English word is expressed through a 4-bit binary expression to obtain a fourth expression, an ASCII code value of each character of the word content is collected, 31-system conversion is carried out on the ASCII code value, accumulation is carried out to obtain an accumulation result, and the accumulation result is the same as 260And performing division and remainder calculation and 60-bit binary conversion to obtain a fifth expression, and obtaining the digital expression according to the fourth expression and the fifth expression.
6. The method for improving translation accuracy according to claim 5,
in the process of processing the English words to be converted with the English word length being more than 10, the method comprises the following steps:
s101, collecting a first ASCII code value of a first character of the word content, and adding the first ASCII code value and a second ASCII code value of a second character of the word content after carrying out 31-system conversion on the first ASCII code value to obtain a first result;
s103, after the first result is subjected to 31-system conversion, adding the first result and a third ASCII code value of a third character of the word content to obtain a second result;
s105, based on the calculation process of S103, accumulating the second result to the last character of the word content, and then performing the same operation as the operation 260And performing division and remainder calculation and 60-bit binary conversion to obtain the fifth expression.
7. The method for improving translation accuracy according to claim 4,
collecting the length of the Chinese words to be converted;
if the Chinese word length is equal to 4, obtaining the first expression by expressing the Chinese word length through a 4-bit binary system, subtracting 2000 from the Unicode code value of each character of the word content, converting the code value into a 15-bit binary system for accumulation to obtain the second expression, and obtaining the digital expression through the first expression and the second expression;
if the Chinese word length is less than 4, expressing the Chinese word length through 4-bit binary to obtain the first expression, expressing vacancy characters with the Chinese word length less than 4 in the word content through 15 pieces of 1 to obtain a sixth expression, and obtaining the digital expression through the first expression and the sixth expression;
if the length of the Chinese word is more than 4, the length of the Chinese word is expressed through 4-bit binary expression to obtain a seventh expression, the Unicode code values of all characters of the word content are accumulated after being subjected to 13131-bit binary conversion, and the accumulated values are the same as 260And performing division and remainder calculation and 60-bit binary conversion to obtain a ninth expression, and obtaining the digital expression according to the seventh expression and the ninth expression.
8. The method for improving translation accuracy according to claim 7,
in the process of processing the word content with the Chinese word length larger than 4, the method comprises the following steps:
s201, extracting a first Unicode code value of a first character, carrying out 13131-system conversion, and adding the first Unicode code value of the first character and a second Unicode code value of a second character to obtain a first result;
s203, carrying out 13131-system conversion on the first result, and adding the first result and a third and second Unicode code value of a third character to obtain a second result;
s205, based on the calculation process of S203, after the second result is accumulated to the last character, the second result is the same as the second result in the step 260And performing division and remainder calculation and 60-bit binary conversion to obtain the ninth expression.
9. A system for improving translation accuracy, comprising,
the first data acquisition module is used for acquiring a first original text and a first translated text of first translation data without quality defects;
the second data acquisition module is used for acquiring a document to be translated or a second original text of the translation data to be checked;
the first data conversion module is used for converting the first original text and the first translated text into a first digital expression;
the second data conversion module is used for converting the second original text into a second digital expression;
the data processing module is used for comparing the second digital expression with the first digital expression to obtain a second translation;
the display module is used for displaying the second translation;
and the storage module is used for storing the first original text, the first translation, the second original text and the second translation, wherein the storage module is further used for fusing the first original text and the second original text to obtain a new first original text and fusing the second translation and the first translation to obtain a new first translation.
10. An apparatus for improving translation accuracy, comprising,
the input device is used for inputting the document to be translated or the document to be checked and translated;
the display device is used for displaying the translation result of the to-be-translated document or the auditing result of the to-be-audited translation document;
the data processing equipment is used for performing binary digital conversion on the document to be translated or the document to be checked to obtain a first digital expression, performing similarity matching according to a second digital expression existing in the data processing equipment, and selecting a word corresponding to at least one second digital expression with the highest similarity according to a matching result to obtain the translation result or the checking result;
and the data storage equipment is used for storing the document to be translated, the audited translation document, the translation result and the audit result and updating the stored data according to the storage result.
CN202110745049.XA 2021-07-01 2021-07-01 Method, system and device for improving translation accuracy Active CN113420570B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110745049.XA CN113420570B (en) 2021-07-01 2021-07-01 Method, system and device for improving translation accuracy

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110745049.XA CN113420570B (en) 2021-07-01 2021-07-01 Method, system and device for improving translation accuracy

Publications (2)

Publication Number Publication Date
CN113420570A true CN113420570A (en) 2021-09-21
CN113420570B CN113420570B (en) 2024-04-30

Family

ID=77719954

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110745049.XA Active CN113420570B (en) 2021-07-01 2021-07-01 Method, system and device for improving translation accuracy

Country Status (1)

Country Link
CN (1) CN113420570B (en)

Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6314469B1 (en) * 1999-02-26 2001-11-06 I-Dns.Net International Pte Ltd Multi-language domain name service
US20050027547A1 (en) * 2003-07-31 2005-02-03 International Business Machines Corporation Chinese / Pin Yin / english dictionary
CN101178705A (en) * 2007-12-13 2008-05-14 中国电信股份有限公司 Free-running speech comprehend method and man-machine interactive intelligent system
CN101261633A (en) * 2008-04-02 2008-09-10 深圳市共进电子有限公司 Electronic translation method and system based on engineering
CN102693222A (en) * 2012-05-25 2012-09-26 熊晶 Carapace bone script explanation machine translation method based on example
CN103559172A (en) * 2013-11-06 2014-02-05 北京百度网讯科技有限公司 Phrasing method and device for multi-language mixed text
CN103793527A (en) * 2014-02-25 2014-05-14 惠州Tcl移动通信有限公司 Sign language interpreting method and sign language interpreting system based on gesture tracing
CN104331399A (en) * 2014-07-25 2015-02-04 一朵云(北京)科技有限公司 Dictionary tree translation method
CN105408891A (en) * 2013-06-03 2016-03-16 机械地带有限公司 Systems and methods for multi-user multi-lingual communications
CN105472451A (en) * 2015-03-18 2016-04-06 深圳Tcl数字技术有限公司 Data transmission method and device between terminals
TWM532593U (en) * 2016-08-10 2016-11-21 Nat Taichung University Science & Technology Voice-translation system
CN107329957A (en) * 2017-05-18 2017-11-07 网易(杭州)网络有限公司 Replace the method and computer-readable recording medium of code Chinese character string
CN109492233A (en) * 2018-11-14 2019-03-19 北京捷通华声科技股份有限公司 A kind of machine translation method and device
CN109634869A (en) * 2018-12-21 2019-04-16 中国人民解放军战略支援部队信息工程大学 Binary translation intermediate representation correctness test method and device based on semantic equivalence verifying
CN111753555A (en) * 2020-06-17 2020-10-09 兰州大学 Method and system for translating mathematic formula into Braille based on MathML
CN112818712A (en) * 2021-02-23 2021-05-18 语联网(武汉)信息技术有限公司 Machine translation method and device based on translation memory library

Patent Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6314469B1 (en) * 1999-02-26 2001-11-06 I-Dns.Net International Pte Ltd Multi-language domain name service
US20050027547A1 (en) * 2003-07-31 2005-02-03 International Business Machines Corporation Chinese / Pin Yin / english dictionary
CN101178705A (en) * 2007-12-13 2008-05-14 中国电信股份有限公司 Free-running speech comprehend method and man-machine interactive intelligent system
CN101261633A (en) * 2008-04-02 2008-09-10 深圳市共进电子有限公司 Electronic translation method and system based on engineering
CN102693222A (en) * 2012-05-25 2012-09-26 熊晶 Carapace bone script explanation machine translation method based on example
CN105408891A (en) * 2013-06-03 2016-03-16 机械地带有限公司 Systems and methods for multi-user multi-lingual communications
CN103559172A (en) * 2013-11-06 2014-02-05 北京百度网讯科技有限公司 Phrasing method and device for multi-language mixed text
CN103793527A (en) * 2014-02-25 2014-05-14 惠州Tcl移动通信有限公司 Sign language interpreting method and sign language interpreting system based on gesture tracing
CN104331399A (en) * 2014-07-25 2015-02-04 一朵云(北京)科技有限公司 Dictionary tree translation method
CN105472451A (en) * 2015-03-18 2016-04-06 深圳Tcl数字技术有限公司 Data transmission method and device between terminals
TWM532593U (en) * 2016-08-10 2016-11-21 Nat Taichung University Science & Technology Voice-translation system
CN107329957A (en) * 2017-05-18 2017-11-07 网易(杭州)网络有限公司 Replace the method and computer-readable recording medium of code Chinese character string
CN109492233A (en) * 2018-11-14 2019-03-19 北京捷通华声科技股份有限公司 A kind of machine translation method and device
CN109634869A (en) * 2018-12-21 2019-04-16 中国人民解放军战略支援部队信息工程大学 Binary translation intermediate representation correctness test method and device based on semantic equivalence verifying
CN111753555A (en) * 2020-06-17 2020-10-09 兰州大学 Method and system for translating mathematic formula into Braille based on MathML
CN112818712A (en) * 2021-02-23 2021-05-18 语联网(武汉)信息技术有限公司 Machine translation method and device based on translation memory library

Also Published As

Publication number Publication date
CN113420570B (en) 2024-04-30

Similar Documents

Publication Publication Date Title
US5835893A (en) Class-based word clustering for speech recognition using a three-level balanced hierarchical similarity
US7269548B2 (en) System and method of creating and using compact linguistic data
JP5062131B2 (en) Information processing program, information processing apparatus, and information processing method
JPS63316231A (en) Facilitating of computer sorting
US20220083528A1 (en) System and method for representing query elements in an artificial neural network
US9720976B2 (en) Extracting method, computer product, extracting system, information generating method, and information contents
JP6447161B2 (en) Semantic structure search program, semantic structure search apparatus, and semantic structure search method
CN112434535A (en) Multi-model-based factor extraction method, device, equipment and storage medium
CN109858025B (en) Word segmentation method and system for address standardized corpus
CN111858933A (en) Character-based hierarchical text emotion analysis method and system
EP1627325A1 (en) Automatic segmentation of texts comprising chunsks without separators
CN114495143A (en) Text object identification method and device, electronic equipment and storage medium
JP2018018174A (en) Encoding program, encoding device, encoding method, and search method
KR20210125449A (en) Method for industry text increment, apparatus thereof, and computer program stored in medium
CN113420570B (en) Method, system and device for improving translation accuracy
Nongmeikapam et al. A transliteration of CRF based Manipuri POS tagging
EP1631920B1 (en) System and method of creating and using compact linguistic data
CN114154503A (en) Sensitive data type identification method
CN114579763A (en) Character-level confrontation sample generation method for Chinese text classification task
CN114201957A (en) Text emotion analysis method and device and computer readable storage medium
Ristov et al. Ziv Lempel compression of huge natural language data tries using suffix arrays
US6526401B1 (en) Device for processing strings
JP2019159743A (en) Correspondence generation program, correspondence generation device, correspondence generation method, and translation program
WO2022211099A1 (en) Patent valuation using artificial intelligence
CN112069374B (en) Identification method and device for multiple customer numbers of bank

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant