CN113420570B - Method, system and device for improving translation accuracy - Google Patents

Method, system and device for improving translation accuracy Download PDF

Info

Publication number
CN113420570B
CN113420570B CN202110745049.XA CN202110745049A CN113420570B CN 113420570 B CN113420570 B CN 113420570B CN 202110745049 A CN202110745049 A CN 202110745049A CN 113420570 B CN113420570 B CN 113420570B
Authority
CN
China
Prior art keywords
expression
word
length
translation
digital
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110745049.XA
Other languages
Chinese (zh)
Other versions
CN113420570A (en
Inventor
郝顺平
关祎宁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenyang Chuangsijiye Technology Co ltd
Original Assignee
Shenyang Chuangsijiye Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenyang Chuangsijiye Technology Co ltd filed Critical Shenyang Chuangsijiye Technology Co ltd
Priority to CN202110745049.XA priority Critical patent/CN113420570B/en
Publication of CN113420570A publication Critical patent/CN113420570A/en
Application granted granted Critical
Publication of CN113420570B publication Critical patent/CN113420570B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/51Translation evaluation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The invention discloses a method, a system and a device for improving translation accuracy, which convert word arrays of input texts and text arrays stored in a translation memory into two digital arrays, so that when the words are compared one by one, only the numbers are compared, the comparison times are reduced, the character transcoding each time is avoided, the speed of processing the numbers by a computer is faster than that of the text, the comparison speed of the text contents is improved, the performance and the calculation speed of a matching algorithm are further improved directly, and meanwhile, for the texts in the translation memory, the method of converting the text into the numbers according to the invention before each storage and then storing the numbers can be adopted, and the conversion cost can be reduced and the performance is further improved by directly comparing the stored word arrays after next matching.

Description

Method, system and device for improving translation accuracy
Technical Field
The invention relates to the field of intelligent translation, in particular to a method, a system and a device for improving translation accuracy.
Background
The translation memory library is translation auxiliary software commonly used in the field of translation, continuously collects and stores the audited quality defect-free texts and translations in the translation project, provides a matching algorithm, performs similarity matching on input texts to be translated from the stored texts, and finally returns a batch of texts with higher similarity and corresponding translations in the translation memory library. Because these translations are all audited, a high quality translation reference may be provided to the translator.
In summary, the key link affecting the matching performance of a translation memory is the storage of the original text and the matching algorithm. The original text content is used as the basis and the foundation of the matching algorithm, and meanwhile, the calculation performance of the matching algorithm is directly determined, and the method is mainly embodied in the aspect of comparison speed of the original text. Along with the gradual accumulation of the memory capacity of the translation memory along with the translation service, the overhead of the system when comparing the original text is increased, so that the processing of the original text content and the design of data types are important links for influencing the matching performance of the translation memory.
The traditional original text processing method is that the content of the whole sentence of original text is stored in a translation memory library according to the type of character strings, when matching occurs, the original text is firstly taken out from the library to be segmented, then the input original text is segmented, the word text arrays after two segmentation are used for calculating the similarity of the two sentences of original text through a matching algorithm, and finally the data with the highest similarity in the translation memory library is obtained.
This text-wise matching may cause performance problems when comparing each word in the text. The principle of processing the character string by the computer is that each character is converted into ASCII code, and then each ASCII code value is processed and compared one by one, so that for a word character string, the more characters are, the more times of comparison are, and if the number of words is too large, the matching algorithm can be very slow. When the translation memory database data increases along with the increase of translation services, the number of the reference texts and the number of words of the texts are increased, and the performance problem is more obvious. The time for the translator to wait for the reference translation is increased, and the translation efficiency is reduced.
Disclosure of Invention
The invention discloses an original text processing method, which aims to solve the technical problem by optimizing an original text storage mode and a data structure so as to improve the overall matching performance of a translation memory library;
the invention provides a method for improving translation accuracy, comprising the following steps,
Collecting a first word of a first translation material without quality defects and a second word corresponding to the first word, respectively performing binary conversion on the first word and the second word to obtain a first digital expression of the first word and a second digital expression of the second word, wherein the first word is an original word of the first translation material, and the second word is a translated word of the first translation material;
and acquiring the document data to be translated, performing binary conversion on the document data to be translated to obtain a third digital expression of the document data to be translated, and comparing the first similarity of the third digital expression and the first digital expression or the second digital expression to obtain second translation data of the document data to be translated.
Preferably, the third translation data is collected, binary conversion is performed on the third translation data, a fourth digital expression of the third translation data is obtained, and the translation accuracy of the third translation data is obtained by comparing the second similarity of the fourth digital expression and the first digital expression or the second digital expression, wherein the third translation data is translated document data to be checked.
Preferably, based on the translation accuracy, the first word or the second word is obtained, and the first word or the second word is added to the third translation material, wherein the first word or the second word is marked in the process of adding the first word or the second word to the third translation material, and the marked form at least comprises word font, word size, word color and dialog box.
Preferably, in the process of respectively binary-converting the first word, the second word, the document data to be translated, and the third translation data,
Acquiring word length of a word to be converted, and expressing through four-bit binary system to obtain a first expression;
acquiring word content of a word to be converted, and expressing the word content through sixty-bit binary system to obtain a second expression;
based on the first expression and the second expression, constructing a digital expression, wherein the digital expression comprises a first digital expression, a second digital expression, a third digital expression and a fourth digital expression.
Collecting the length of English words to be converted;
If the length of the English word is equal to 10, the length of the English word is expressed through 4 bits of binary system to obtain a first expression, the characters of the word content are converted through 6 bits of binary system and accumulated to obtain a second expression, and a digital expression is obtained through the first expression and the second expression;
If the length of the English word is smaller than 10, the length of the English word is expressed through 4 bits of binary system, a first expression is obtained, the blank characters with the length smaller than 10 in the English word in the word content are expressed through 6 bits of 1, a third expression is obtained, the characters in the word content are subjected to 6 bits of binary system conversion and accumulation, a second expression is obtained, and a digital expression is obtained through the first expression, the second expression and the third expression;
If the length of the English word is greater than 10, the length of the English word is expressed in 4 bits in binary form to obtain a fourth expression, ASCII code values of each character of the word content are collected, 31-ary conversion is carried out on the ASCII code values and accumulation is carried out to obtain an accumulation result, the accumulation result and 2 60 are divided, the remainder calculation is carried out on the accumulation result, 60-bit binary conversion is carried out to obtain a fifth expression, and a digital expression is obtained according to the fourth expression and the fifth expression.
In the process of processing English words to be converted with the English word length more than 10, the method comprises the following steps:
S101, acquiring a first ASCII code value of a first character of word content, and adding the first ASCII code value with a second ASCII code value of a second character of the word content after 31-ary conversion of the first ASCII code value to obtain a first result;
s103, after 31-ary conversion is carried out on the first result, adding the first result with a third ASCII code value of a third character of word content to obtain a second result;
S105, based on the calculation process of S103, after the second result is accumulated to the last character of the word content, the second result is divided by 2 60 to be subjected to remainder calculation and 60-bit binary conversion is carried out, and a fifth expression is obtained.
Preferably, in the process of respectively binary-converting the first word, the second word, the document data to be translated, and the third translation data,
Acquiring word length of a word to be converted, and expressing through four-bit binary system to obtain a first expression;
acquiring word content of a word to be converted, and expressing the word content through sixty-bit binary system to obtain a second expression;
based on the first expression and the second expression, constructing a digital expression, wherein the digital expression comprises a first digital expression, a second digital expression, a third digital expression and a fourth digital expression.
Collecting the length of a Chinese word to be converted;
if the length of the Chinese word is equal to 4, the length of the Chinese word is expressed through 4 bits of binary system to obtain a first expression, the Unicode code value of each character of the word content is reduced by 2000 and then converted into 15 bits of binary system accumulation to obtain a second expression, and the digital expression is obtained through the first expression and the second expression;
If the length of the Chinese word is smaller than 4, the length of the Chinese word is expressed through 4 bits of binary system to obtain a first expression, the blank characters with the length smaller than 4 of the Chinese word in the word content are expressed through 15 pieces of 1 to obtain a sixth expression, and the first expression and the sixth expression are used for obtaining a digital expression;
If the length of the Chinese word is greater than 4, the length of the Chinese word is expressed through 4 bits of binary system to obtain a seventh expression, after 13131 binary system conversion is carried out on Unicode code values of each character of word content, accumulation is carried out, the result is divided by 2 60 to obtain remainder calculation and 60 bits of binary system conversion is carried out, a ninth expression is obtained, and a digital expression is obtained according to the seventh expression and the ninth expression.
In processing word content in chinese with word length greater than 4, the steps of:
s201, extracting a first Unicode code value of a first character, performing 13131 binary conversion, and adding the first Unicode code value with a second Unicode value of a second character to obtain a first result;
S203, performing 13131 binary conversion on the first result, and adding the first result with a third Unicode code value of a third character to obtain a second result;
S205, based on the calculation process of S203, after accumulating the second result to the last character, dividing the second result by 2 60, performing remainder calculation, and performing 60-bit binary conversion to obtain a ninth expression.
A system for improving translation accuracy, comprising,
The first data acquisition module is used for acquiring a first original text and a first translated text of the first translated data without quality defects;
The second data acquisition module is used for acquiring a second original document of the document to be translated or the document to be checked;
the first data conversion module is used for converting the first original text and the first translated text into a first digital expression;
the second data conversion module is used for converting the second original text into a second digital expression;
The data processing module is used for comparing the second digital expression with the first digital expression to obtain a second translation;
the display module is used for displaying the second translation;
The storage module is used for storing the first original text, the first translation, the second original text and the second translation, wherein the storage module is also used for fusing the first original text and the second original text to obtain a new first original text and fusing the second translation and the first translation to obtain a new first translation.
An apparatus for improving translation accuracy, comprising,
The input device is used for inputting the document to be translated or the document to be verified;
the display device is used for displaying the translation result of the document to be translated or the auditing result of the document to be audited;
The data processing equipment is used for performing binary digital conversion on the document to be translated or the document to be verified to obtain a first digital expression, performing similarity matching according to second digital expressions existing in the data processing equipment, and selecting words corresponding to at least one second digital expression with the highest similarity according to a matching result to obtain a translation result or a verification result;
and the data storage device is used for storing the document to be translated, the audit translation document, the translation result and the audit result and updating the stored data according to the storage result.
The invention discloses the following technical effects:
the performance of the matching algorithm is improved: the word in the original text is converted into the number, so that when the matching algorithm calculates the difference between the input original text and the stored original text in the translation memory library, only the number array of the two sentences is compared, and the comparison according to the text form of the character string in the traditional method is corrected, thereby generating a large number of processing times and reducing the algorithm efficiency and performance.
The invention is applicable to any matching algorithm, because for one matching algorithm, the invention only optimizes the input parameters for calling the matching algorithm, namely the content of the original text, reduces the calculation processing burden of the matching algorithm, and improves the calculation efficiency. In another aspect, the invention provides a data processing and storage scheme in other applications, such as database systems, content differentiation and data security. .
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the drawings that are needed in the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings are also obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a method according to the present invention;
FIG. 2 is a 64-bit binary schematic according to the present invention;
FIG. 3 is a diagram comparing the method of the present invention with the prior art.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
As shown in fig. 1-3, the present invention provides a method for improving translation accuracy, comprising the steps of,
Collecting a first word of a first translation material without quality defects and a second word corresponding to the first word, respectively performing binary conversion on the first word and the second word to obtain a first digital expression of the first word and a second digital expression of the second word, wherein the first word is an original word of the first translation material, and the second word is a translated word of the first translation material;
and acquiring the document data to be translated, performing binary conversion on the document data to be translated to obtain a third digital expression of the document data to be translated, and comparing the first similarity of the third digital expression and the first digital expression or the second digital expression to obtain second translation data of the document data to be translated.
Preferably, the third translation data is collected, binary conversion is performed on the third translation data, a fourth digital expression of the third translation data is obtained, and the translation accuracy of the third translation data is obtained by comparing the second similarity of the fourth digital expression and the first digital expression or the second digital expression, wherein the third translation data is translated document data to be checked.
Preferably, based on the translation accuracy, the first word or the second word is obtained, and the first word or the second word is added to the third translation material, wherein the first word or the second word is marked in the process of adding the first word or the second word to the third translation material, and the marked form at least comprises word font, word size, word color and dialog box.
Preferably, in the process of respectively binary-converting the first word, the second word, the document data to be translated, and the third translation data,
Acquiring word length of a word to be converted, and expressing through four-bit binary system to obtain a first expression;
acquiring word content of a word to be converted, and expressing the word content through sixty-bit binary system to obtain a second expression;
based on the first expression and the second expression, constructing a digital expression, wherein the digital expression comprises a first digital expression, a second digital expression, a third digital expression and a fourth digital expression.
Collecting the length of English words to be converted;
If the length of the English word is equal to 10, the length of the English word is expressed through 4 bits of binary system to obtain a first expression, the characters of the word content are converted through 6 bits of binary system and accumulated to obtain a second expression, and a digital expression is obtained through the first expression and the second expression;
If the length of the English word is smaller than 10, the length of the English word is expressed through 4 bits of binary system, a first expression is obtained, the blank characters with the length smaller than 10 in the English word in the word content are expressed through 6 bits of 1, a third expression is obtained, the characters in the word content are subjected to 6 bits of binary system conversion and accumulation, a second expression is obtained, and a digital expression is obtained through the first expression, the second expression and the third expression;
If the length of the English word is greater than 10, the length of the English word is expressed in 4 bits in binary form to obtain a fourth expression, ASCII code values of each character of the word content are collected, 31-ary conversion is carried out on the ASCII code values and accumulation is carried out to obtain an accumulation result, the accumulation result and 2 60 are divided, the remainder calculation is carried out on the accumulation result, 60-bit binary conversion is carried out to obtain a fifth expression, and a digital expression is obtained according to the fourth expression and the fifth expression.
In the process of processing English words to be converted with the English word length more than 10, the method comprises the following steps:
S101, acquiring a first ASCII code value of a first character of word content, and adding the first ASCII code value with a second ASCII code value of a second character of the word content after 31-ary conversion of the first ASCII code value to obtain a first result;
s103, after 31-ary conversion is carried out on the first result, adding the first result with a third ASCII code value of a third character of word content to obtain a second result;
S105, based on the calculation process of S103, after the second result is accumulated to the last character of the word content, the second result is divided by 2 60 to be subjected to remainder calculation and 60-bit binary conversion is carried out, and a fifth expression is obtained.
Preferably, in the process of respectively binary-converting the first word, the second word, the document data to be translated, and the third translation data,
Acquiring word length of a word to be converted, and expressing through four-bit binary system to obtain a first expression;
acquiring word content of a word to be converted, and expressing the word content through sixty-bit binary system to obtain a second expression;
based on the first expression and the second expression, constructing a digital expression, wherein the digital expression comprises a first digital expression, a second digital expression, a third digital expression and a fourth digital expression.
Collecting the length of a Chinese word to be converted;
if the length of the Chinese word is equal to 4, the length of the Chinese word is expressed through 4 bits of binary system to obtain a first expression, the Unicode code value of each character of the word content is reduced by 2000 and then converted into 15 bits of binary system accumulation to obtain a second expression, and the digital expression is obtained through the first expression and the second expression;
If the length of the Chinese word is smaller than 4, the length of the Chinese word is expressed through 4 bits of binary system to obtain a first expression, the blank characters with the length smaller than 4 of the Chinese word in the word content are expressed through 15 pieces of 1 to obtain a sixth expression, and the first expression and the sixth expression are used for obtaining a digital expression;
If the length of the Chinese word is greater than 4, the length of the Chinese word is expressed through 4 bits of binary system to obtain a seventh expression, after 13131 binary system conversion is carried out on Unicode code values of each character of word content, accumulation is carried out, the result is divided by 2 60 to obtain remainder calculation and 60 bits of binary system conversion is carried out, a ninth expression is obtained, and a digital expression is obtained according to the seventh expression and the ninth expression.
In processing word content in chinese with word length greater than 4, the steps of:
s201, extracting a first Unicode code value of a first character, performing 13131 binary conversion, and adding the first Unicode code value with a second Unicode value of a second character to obtain a first result;
S203, performing 13131 binary conversion on the first result, and adding the first result with a third Unicode code value of a third character to obtain a second result;
S205, based on the calculation process of S203, after accumulating the second result to the last character, dividing the second result by 2 60, performing remainder calculation, and performing 60-bit binary conversion to obtain a ninth expression.
A system for improving translation accuracy, comprising,
The first data acquisition module is used for acquiring a first original text and a first translated text of the first translated data without quality defects;
The second data acquisition module is used for acquiring a second original document of the document to be translated or the document to be checked;
the first data conversion module is used for converting the first original text and the first translated text into a first digital expression;
the second data conversion module is used for converting the second original text into a second digital expression;
The data processing module is used for comparing the second digital expression with the first digital expression to obtain a second translation;
the display module is used for displaying the second translation;
The storage module is used for storing the first original text, the first translation, the second original text and the second translation, wherein the storage module is also used for fusing the first original text and the second original text to obtain a new first original text and fusing the second translation and the first translation to obtain a new first translation.
An apparatus for improving translation accuracy, comprising,
The input device is used for inputting the document to be translated or the document to be verified;
the display device is used for displaying the translation result of the document to be translated or the auditing result of the document to be audited;
The data processing equipment is used for performing binary digital conversion on the document to be translated or the document to be verified to obtain a first digital expression, performing similarity matching according to second digital expressions existing in the data processing equipment, and selecting words corresponding to at least one second digital expression with the highest similarity according to a matching result to obtain a translation result or a verification result;
and the data storage device is used for storing the document to be translated, the audit translation document, the translation result and the audit result and updating the stored data according to the storage result.
Example 1:1. the technical scheme provided by the invention is as follows:
1.1. The class of words is expressed in 4-bit binary for distinguishing between language and word length. The language and word length determine the computational granularity of the method.
1.2. The method of converting the content into 60-bit binary is described in detail in the section [ word content converting method ].
1.3. The 64-bit binary is converted into a long integer value as the final result of the present invention.
2. Word content conversion method (English and Chinese examples respectively)
Symbol description
B 1-4 1 to 4 bit binary system
B 5-64 -64 bit binary system
X: word content
X i ith character
I: character index
N is the number of word characters
F c-b (c) 6-bit binary expression method of character
ASCII value expression method for f c-a (c) character
Unicode value expression method for characters of f c-u (c)
Method for converting 64-bit binary system into digital system
Method for converting f n-b (n) number into 60-bit binary system
F (x) the results of the present invention.
2.1 English words with length less than or equal to 10: english short content conversion module
The processing idea is as follows: b 1-4 is represented by 0000; b 5-64, carrying out binary accumulation on 6 bits of each character, wherein the number of the characters is less than 10, and each empty bit is expressed by 6 bits of 1. The final 64-bit binary is converted to a number:
2.2 English words with length greater than 10: english long content conversion module
The processing idea is as follows: b 1-4 is denoted by 0001; b 5-64, firstly adding 31 system of ASCII code value of first character and ASCII code value of second character, adding 31 system of added result and ASCII code value of third character, and so on to last character so as to obtain the invented product:
The result a calculated at x n-1 is divided by the 60 th power of 2 to obtain a number:
after converting the result b into a 60-bit binary, converting the final 64-bit binary into a number:
f(x)=fb-n(b1-4+fn-b(b))
2.3 Chinese words with length less than or equal to 4: enter Chinese short content conversion module
The processing idea is as follows: b 1-4 is denoted by 0010; b 5-64, after subtracting 2000 from Unicode of each character, converting the Unicode into 15-bit binary accumulation, wherein the number of words is less than 4, and each empty bit is expressed by 15 1. Finally converting 64-bit binary into numbers:
2.4 Chinese words with length greater than 4: enter Chinese long content conversion module
The processing idea is as follows: b 1-4 is denoted by 0011; b 5-64, adding the 13131 system of the Unicode code value of the first character and the Unicode code value of the second character, adding the 13131 system of the added result and the Unicode code value of the third character, and processing to the last character to obtain the final character:
The result a calculated at x n-1 is divided by the 60 th power of 2 to obtain a number:
after converting the result b into a 60-bit binary, converting the final 64-bit binary into a number:
f(x)=fb-n(b1-4+fn-b(b))
further, the mathematical function used in the technical solution is described as follows: besides f (x) as a scheme calculation result which needs to explain the calculation process, other functions and methods can be directly obtained through various development languages and mathematical formulas.
English word of length 10 or less:
1-4 bits: 0000, english word "representing that the category of the word is" length < =10 "
5-64 Bits: accumulating with binary expressions of characters/numbers, for example, binary expression of 1 is 000001,2 is 000010, binary expression of a is 001010, and the like. For the case of capital letters, the method will convert the capital letters into lower case and then perform binary processing, because most of the scenes for translation assistance by using the translation memory are fuzzy queries, in order to improve the computing performance, the character content range is reduced, and the case of capital letters will be ignored here. Thus, the process of expressing a character by using a 6-bit binary system is just stained with 60 bits when the word length is 10 at maximum. When the word length is less than 10, each slot is complemented with 111111, e.g., word length 9, then 111111 is complemented at the end and length 8, then 111111 111111 is complemented.
Binary expression examples: for the words students, the binary expression is:
0000 011100 011101 011110 001101 001110 010111
011101 011100
111111 111111
converting the number: converting the binary system into a long integer number to obtain
512698782764617727
English words of length greater than 10:
1-4 bits: 0001, english word representing that the category of the word is 'length > 10'
5-64 Bits: after 31 system processing is carried out on the ASCII code of the first character, the ASCII code of the first character is accumulated with the ASCII code of the second character, 31 system processing is carried out on the obtained result, and then the ASCII code of the third character is accumulated, and the like until the last character is processed. Dividing the final calculation result by the power of 60 of 2 to obtain a remainder, if the remainder is greater than 0, directly taking the remainder as a digital result of 5-64 bits, otherwise, adding the remainder to the power of 60 of 2, and taking the remainder as a digital result of 5-64 bits.
Converting the number: for example, a word consisting of 11 characters, ASCII for the first character with Ascii _c1, ASCII for the second character with Ascii _c2, and so on: ((Ascii _c1×31+ascii_c2) ×31+ascii_c3) ×31+ascii_c4 … …/2, the result is expressed in 60 bits and converted into a number.
Chinese word with length less than or equal to 4
1-4 Bits: 0010, a Chinese word representing that the category of the word is "length < =4"
5-64 Bits:
Algorithm discovery process: in the research of the Chinese character transcoding method, it is found that the value obtained by subtracting 2000 from the Unicode range of Chinese characters is just between 0 and 32767, and the value in the range of 0 to 32767 can be expressed by 15 bits, namely the 15 th power of 2. In this way, exactly 60 bits are stained when the word length is at most 4. When the word length is less than 4, each slot is complemented with 15 bits, i.e., 1111 … … (15 1).
Converting the number: words with the length of exactly 4 are expressed as binary corresponding to Unicode-2000; a word of length 3 is expressed as binary +15 1's (15 bits are occupied by 15 1's) corresponding to Unicode-2000. The secondary binary result is converted to a number.
Chinese words of length greater than 4
1-4 Bits: 0010, chinese word representing the class of the word as "Length > 4"
5-64 Bits: after performing 13131 system processing on the Unicode of the first character, accumulating with the Unicode of the second character, performing 13131 system processing on the obtained result, accumulating with the Unicode of the third character, and the like until the last character is processed. Dividing the final calculation result by the power of 60 of 2 to obtain a remainder, if the remainder is greater than 0, directly taking the remainder as a digital result of 5-64 bits, otherwise, adding the remainder to the power of 60 of 2, and taking the remainder as a digital result of 5-64 bits.
Converting the number: for example, a word consisting of 5 characters, unicode_c1 for the first character, unicode_c2 for the second character, and so on: ((unicode_c1×13131+unicode_c2) ×13131+unicode_c3) ×13131+unicode_c4 … …/2 to the 60 th power, the calculation result is expressed in 60 bits, and the result is converted into a number.
The invention realizes the method of converting words into numbers, firstly converts two word content arrays (word arrays of input original text and original text arrays stored in a translation memory library) into two number arrays, thus, when comparing words one by one, only the numbers need to be compared, the comparison times are reduced, the character transcoding each time is avoided, and meanwhile, the comparison speed of the original text content is improved because the speed of processing the numbers by a computer is much faster than that of the text, and the performance and the calculation speed of the matching algorithm are directly improved because the comparison of the original text content is an important link in the matching algorithm. And for the original text in the translation memory library, the method of converting the original text into numbers before storing each time according to the invention and then storing the numbers can be adopted, and the stored word arrays are directly compared by next matching, so that the conversion cost can be reduced, and the performance is further improved.
It should be noted that: like reference numerals and letters in the following figures denote like items, and thus once an item is defined in one figure, no further definition or explanation of it is required in the following figures, and furthermore, the terms "first," "second," "third," etc. are used merely to distinguish one description from another and are not to be construed as indicating or implying relative importance.
Finally, it should be noted that: the above examples are only specific embodiments of the present invention, and are not intended to limit the scope of the present invention, but it should be understood by those skilled in the art that the present invention is not limited thereto, and that the present invention is described in detail with reference to the foregoing examples: any person skilled in the art may modify or easily conceive of the technical solution described in the foregoing embodiments, or perform equivalent substitution of some of the technical features, while remaining within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the corresponding technical solutions. Are intended to be encompassed within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (8)

1. A method for improving translation accuracy is characterized by comprising the following steps,
Collecting a first word of a first translation material without quality defects and a second word corresponding to the first word, respectively performing binary conversion on the first word and the second word to obtain a first digital expression of the first word and a second digital expression of the second word, wherein the first word is an original word of the first translation material, and the second word is a translated word of the first translation material;
Acquiring document data to be translated, performing binary conversion on the document data to be translated to obtain a third digital expression of the document data to be translated, and obtaining second translation data of the document data to be translated by comparing the first similarity of the third digital expression and the first digital expression or the second digital expression;
collecting the length of English words to be converted;
If the length of the English word is equal to 10, the length of the English word is expressed through 4 bits of binary system to obtain a first expression, characters of word contents are subjected to 6 bits of binary system conversion and accumulation to obtain a second expression, and a digital expression is obtained through the first expression and the second expression;
If the English word length is smaller than 10, the English word length is expressed through 4 bits of binary system, the first expression is obtained, the blank characters with the English word length smaller than 10 in the word content are expressed through 6 bits of 1, a third expression is obtained, the characters of the word content are subjected to 6 bits of binary system conversion and accumulation, the second expression is obtained, and the digital expression is obtained through the first expression, the second expression and the third expression;
If the English word length is greater than 10, the English word length is expressed through 4-bit binary expression to obtain a fourth expression, ASCII code values of each character of the word content are collected, 31-ary conversion is carried out on the ASCII code values and accumulation is carried out to obtain an accumulation result, division and surplus calculation is carried out on the accumulation result and 2 60, 60-bit binary conversion is carried out to obtain a fifth expression, and the digital expression is obtained according to the fourth expression and the fifth expression;
Collecting the length of a Chinese word to be converted;
If the length of the Chinese word is equal to 4, the length of the Chinese word is expressed through 4 bits of binary system to obtain the first expression, the Unicode code value of each character of the word content is subtracted by 2000 and then converted into 15 bits of binary system accumulation to obtain the second expression, and the digital expression is obtained through the first expression and the second expression;
If the length of the Chinese word is smaller than 4, the length of the Chinese word is expressed through 4 bits of binary system to obtain the first expression, the blank characters with the length smaller than 4 of the Chinese word in the word content are expressed through 15 pieces of 1 to obtain a sixth expression, and the digital expression is obtained through the first expression and the sixth expression;
If the length of the Chinese word is greater than 4, the length of the Chinese word is expressed through 4-bit binary expression to obtain a seventh expression, after 13131 binary conversion is carried out on Unicode code values of each character of the word content, accumulation is carried out, the product is divided by 2 60 to obtain remainder calculation, 60-bit binary conversion is carried out, a ninth expression is obtained, and the digital expression is obtained according to the seventh expression and the ninth expression.
2. A method for improving translation accuracy as defined in claim 1,
Acquiring third translation data, performing binary conversion on the third translation data to obtain a fourth digital expression of the third translation data, and comparing the second similarity of the fourth digital expression and the first digital expression or the second digital expression to obtain the translation accuracy of the third translation data, wherein the third translation data is translated document data to be checked.
3. A method for improving translation accuracy as defined in claim 2,
And obtaining the first word or the second word based on the translation accuracy, and adding the first word or the second word into the third translation material, wherein the first word or the second word is marked in the process of adding the first word or the second word into the third translation material, and the marked form at least comprises word fonts, word sizes, word colors and dialog boxes.
4. A method for improving translation accuracy as defined in claim 3,
In the process of respectively binary converting the first word, the second word, the document data to be translated and the third translation data,
Acquiring word length of a word to be converted, and expressing through four-bit binary system to obtain a first expression;
acquiring word content of the word to be converted, and expressing the word content through sixty-bit binary system to obtain a second expression;
Constructing a digital expression based on the first expression and the second expression, wherein the digital expression comprises the first digital expression, the second digital expression, the third digital expression and the fourth digital expression.
5. A method for improving translation accuracy as defined in claim 1,
In the process of processing the English word to be converted with the English word length larger than 10, the method comprises the following steps:
S101, acquiring a first ASCII code value of a first character of the word content, and adding the first ASCII code value with a second ASCII code value of a second character of the word content after 31-ary conversion of the first ASCII code value to obtain a first result;
s103, after 31-ary conversion is carried out on the first result, adding the first result with a third ASCII code value of a third character of the word content to obtain a second result;
S105, based on the calculation process of S103, after the second result is accumulated to the last character of the word content, the second result is divided by the 2 60 to be subjected to remainder calculation and 60-bit binary conversion is carried out, and the fifth expression is obtained.
6. A method for improving translation accuracy as defined in claim 1,
In processing the word content of the Chinese word with length greater than 4, the method comprises the following steps:
s201, extracting a first Unicode code value of a first character, performing 13131 binary conversion, and adding the first Unicode code value with a second Unicode value of a second character to obtain a first result;
s203, performing 13131 binary conversion on the first result, and adding the first result with a third second Unicode code value of a third character to obtain a second result;
S205, based on the calculation process of S203, after accumulating the second result to the last character, dividing the second result by the 2 60, performing remainder calculation, and performing 60-bit binary conversion to obtain the ninth expression.
7. A system for improving translation accuracy, comprising,
The first data acquisition module is used for acquiring a first original text and a first translated text of the first translated data without quality defects;
The second data acquisition module is used for acquiring a second original document of the document to be translated or the document to be checked;
the first data conversion module is used for converting the first original text and the first translated text into a first digital expression;
the second data conversion module is used for converting the second original text into a second digital expression;
collecting the length of English words to be converted;
If the length of the English word is equal to 10, the length of the English word is expressed through 4 bits of binary system to obtain a first expression, characters of word contents are subjected to 6 bits of binary system conversion and accumulation to obtain a second expression, and a digital expression is obtained through the first expression and the second expression;
If the English word length is smaller than 10, the English word length is expressed through 4 bits of binary system, the first expression is obtained, the blank characters with the English word length smaller than 10 in the word content are expressed through 6 bits of 1, a third expression is obtained, the characters of the word content are subjected to 6 bits of binary system conversion and accumulation, the second expression is obtained, and the digital expression is obtained through the first expression, the second expression and the third expression;
If the English word length is greater than 10, the English word length is expressed through 4-bit binary expression to obtain a fourth expression, ASCII code values of each character of the word content are collected, 31-ary conversion is carried out on the ASCII code values and accumulation is carried out to obtain an accumulation result, division and surplus calculation is carried out on the accumulation result and 2 60, 60-bit binary conversion is carried out to obtain a fifth expression, and the digital expression is obtained according to the fourth expression and the fifth expression;
Collecting the length of a Chinese word to be converted;
If the length of the Chinese word is equal to 4, the length of the Chinese word is expressed through 4 bits of binary system to obtain the first expression, the Unicode code value of each character of the word content is subtracted by 2000 and then converted into 15 bits of binary system accumulation to obtain the second expression, and the digital expression is obtained through the first expression and the second expression;
If the length of the Chinese word is smaller than 4, the length of the Chinese word is expressed through 4 bits of binary system to obtain the first expression, the blank characters with the length smaller than 4 of the Chinese word in the word content are expressed through 15 pieces of 1 to obtain a sixth expression, and the digital expression is obtained through the first expression and the sixth expression;
If the length of the Chinese word is greater than 4, the length of the Chinese word is expressed through 4-bit binary system to obtain a seventh expression, after 13131 binary system conversion is carried out on Unicode code values of each character of the word content, accumulation is carried out, the product is divided by 2 60 to obtain remainder calculation and 60-bit binary system conversion is carried out, a ninth expression is obtained, and the digital expression is obtained according to the seventh expression and the ninth expression;
The data processing module is used for comparing the second digital expression with the first digital expression to obtain a second translation;
the display module is used for displaying the second translation;
the storage module is used for storing the first original text, the first translation, the second original text and the second translation, wherein the storage module is also used for fusing the first original text and the second original text to obtain a new first original text, and fusing the second translation and the first translation to obtain a new first translation.
8. An apparatus for improving translation accuracy, comprising,
The input device is used for inputting the document to be translated or the document to be verified;
the display device is used for displaying the translation result of the document to be translated or the auditing result of the document to be audited;
The data processing equipment is used for performing binary digital conversion on the document to be translated or the document to be verified to obtain a first digital expression, performing similarity matching according to second digital expressions existing in the data processing equipment, and selecting words corresponding to at least one second digital expression with highest similarity according to a matching result to obtain a translation result or a verification result;
collecting the length of English words to be converted;
If the length of the English word is equal to 10, the length of the English word is expressed through 4 bits of binary system to obtain a first expression, characters of word contents are subjected to 6 bits of binary system conversion and accumulation to obtain a second expression, and a digital expression is obtained through the first expression and the second expression;
If the English word length is smaller than 10, the English word length is expressed through 4 bits of binary system, the first expression is obtained, the blank characters with the English word length smaller than 10 in the word content are expressed through 6 bits of 1, a third expression is obtained, the characters of the word content are subjected to 6 bits of binary system conversion and accumulation, the second expression is obtained, and the digital expression is obtained through the first expression, the second expression and the third expression;
If the English word length is greater than 10, the English word length is expressed through 4-bit binary expression to obtain a fourth expression, ASCII code values of each character of the word content are collected, 31-ary conversion is carried out on the ASCII code values and accumulation is carried out to obtain an accumulation result, division and surplus calculation is carried out on the accumulation result and 2 60, 60-bit binary conversion is carried out to obtain a fifth expression, and the digital expression is obtained according to the fourth expression and the fifth expression;
And the data storage device is used for storing the document to be translated, the audit translation document, the translation result and the audit result and updating the stored data according to the storage result.
CN202110745049.XA 2021-07-01 2021-07-01 Method, system and device for improving translation accuracy Active CN113420570B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110745049.XA CN113420570B (en) 2021-07-01 2021-07-01 Method, system and device for improving translation accuracy

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110745049.XA CN113420570B (en) 2021-07-01 2021-07-01 Method, system and device for improving translation accuracy

Publications (2)

Publication Number Publication Date
CN113420570A CN113420570A (en) 2021-09-21
CN113420570B true CN113420570B (en) 2024-04-30

Family

ID=77719954

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110745049.XA Active CN113420570B (en) 2021-07-01 2021-07-01 Method, system and device for improving translation accuracy

Country Status (1)

Country Link
CN (1) CN113420570B (en)

Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6314469B1 (en) * 1999-02-26 2001-11-06 I-Dns.Net International Pte Ltd Multi-language domain name service
CN101178705A (en) * 2007-12-13 2008-05-14 中国电信股份有限公司 Free-running speech comprehend method and man-machine interactive intelligent system
CN101261633A (en) * 2008-04-02 2008-09-10 深圳市共进电子有限公司 Electronic translation method and system based on engineering
CN102693222A (en) * 2012-05-25 2012-09-26 熊晶 Carapace bone script explanation machine translation method based on example
CN103559172A (en) * 2013-11-06 2014-02-05 北京百度网讯科技有限公司 Phrasing method and device for multi-language mixed text
CN103793527A (en) * 2014-02-25 2014-05-14 惠州Tcl移动通信有限公司 Sign language interpreting method and sign language interpreting system based on gesture tracing
CN104331399A (en) * 2014-07-25 2015-02-04 一朵云(北京)科技有限公司 Dictionary tree translation method
CN105408891A (en) * 2013-06-03 2016-03-16 机械地带有限公司 Systems and methods for multi-user multi-lingual communications
CN105472451A (en) * 2015-03-18 2016-04-06 深圳Tcl数字技术有限公司 Data transmission method and device between terminals
TWM532593U (en) * 2016-08-10 2016-11-21 Nat Taichung University Science & Technology Voice-translation system
CN107329957A (en) * 2017-05-18 2017-11-07 网易(杭州)网络有限公司 Replace the method and computer-readable recording medium of code Chinese character string
CN109492233A (en) * 2018-11-14 2019-03-19 北京捷通华声科技股份有限公司 A kind of machine translation method and device
CN109634869A (en) * 2018-12-21 2019-04-16 中国人民解放军战略支援部队信息工程大学 Binary translation intermediate representation correctness test method and device based on semantic equivalence verifying
CN111753555A (en) * 2020-06-17 2020-10-09 兰州大学 Method and system for translating mathematic formula into Braille based on MathML
CN112818712A (en) * 2021-02-23 2021-05-18 语联网(武汉)信息技术有限公司 Machine translation method and device based on translation memory library

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050027547A1 (en) * 2003-07-31 2005-02-03 International Business Machines Corporation Chinese / Pin Yin / english dictionary

Patent Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6314469B1 (en) * 1999-02-26 2001-11-06 I-Dns.Net International Pte Ltd Multi-language domain name service
CN101178705A (en) * 2007-12-13 2008-05-14 中国电信股份有限公司 Free-running speech comprehend method and man-machine interactive intelligent system
CN101261633A (en) * 2008-04-02 2008-09-10 深圳市共进电子有限公司 Electronic translation method and system based on engineering
CN102693222A (en) * 2012-05-25 2012-09-26 熊晶 Carapace bone script explanation machine translation method based on example
CN105408891A (en) * 2013-06-03 2016-03-16 机械地带有限公司 Systems and methods for multi-user multi-lingual communications
CN103559172A (en) * 2013-11-06 2014-02-05 北京百度网讯科技有限公司 Phrasing method and device for multi-language mixed text
CN103793527A (en) * 2014-02-25 2014-05-14 惠州Tcl移动通信有限公司 Sign language interpreting method and sign language interpreting system based on gesture tracing
CN104331399A (en) * 2014-07-25 2015-02-04 一朵云(北京)科技有限公司 Dictionary tree translation method
CN105472451A (en) * 2015-03-18 2016-04-06 深圳Tcl数字技术有限公司 Data transmission method and device between terminals
TWM532593U (en) * 2016-08-10 2016-11-21 Nat Taichung University Science & Technology Voice-translation system
CN107329957A (en) * 2017-05-18 2017-11-07 网易(杭州)网络有限公司 Replace the method and computer-readable recording medium of code Chinese character string
CN109492233A (en) * 2018-11-14 2019-03-19 北京捷通华声科技股份有限公司 A kind of machine translation method and device
CN109634869A (en) * 2018-12-21 2019-04-16 中国人民解放军战略支援部队信息工程大学 Binary translation intermediate representation correctness test method and device based on semantic equivalence verifying
CN111753555A (en) * 2020-06-17 2020-10-09 兰州大学 Method and system for translating mathematic formula into Braille based on MathML
CN112818712A (en) * 2021-02-23 2021-05-18 语联网(武汉)信息技术有限公司 Machine translation method and device based on translation memory library

Also Published As

Publication number Publication date
CN113420570A (en) 2021-09-21

Similar Documents

Publication Publication Date Title
CN108416058B (en) Bi-LSTM input information enhancement-based relation extraction method
CN109697232B (en) Chinese text emotion analysis method based on deep learning
CN109101494A (en) A method of it is calculated for Chinese sentence semantic similarity, equipment and computer readable storage medium
JPH0519184B2 (en)
CN110070853B (en) Voice recognition conversion method and system
CN112434535A (en) Multi-model-based factor extraction method, device, equipment and storage medium
CN111178061B (en) Multi-lingual word segmentation method based on code conversion
US9720976B2 (en) Extracting method, computer product, extracting system, information generating method, and information contents
CN109858025B (en) Word segmentation method and system for address standardized corpus
CN111949774A (en) Intelligent question answering method and system
CN111858933A (en) Character-based hierarchical text emotion analysis method and system
CN115759119A (en) Financial text emotion analysis method, system, medium and equipment
CN108536724A (en) Main body recognition methods in a kind of metro design code based on the double-deck hash index
CN111444720A (en) Named entity recognition method for English text
CN113420570B (en) Method, system and device for improving translation accuracy
CN109977430B (en) Text translation method, device and equipment
CN112257425A (en) Power data analysis method and system based on data classification model
Aliwy et al. Corpus-based technique for improving Arabic OCR system
CN113420564B (en) Hybrid matching-based electric power nameplate semantic structuring method and system
JP4088171B2 (en) Text analysis apparatus, method, program, and recording medium recording the program
CN114154503A (en) Sensitive data type identification method
Güngör Lexical and morphological statistics for Turkish
Ekbal et al. Voted approach for part of speech tagging in bengali
CN114417824A (en) Dependency syntax pre-training model-based chapter-level relation extraction method and system
CN113297346A (en) Text intention recognition method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant