CN109325237B

CN109325237B - Complete sentence recognition method and system for machine translation

Info

Publication number: CN109325237B
Application number: CN201811225110.2A
Authority: CN
Inventors: 何恩培; 郑丽华; 王莲
Original assignee: Transn Iol Technology Co ltd
Current assignee: Transn Iol Technology Co ltd
Priority date: 2018-10-22
Filing date: 2018-10-22
Publication date: 2023-06-13
Anticipated expiration: 2038-10-22
Also published as: CN109325237A

Abstract

The application proposes a complete sentence recognition system for machine translation, the system comprising: (1) a pretreatment system: the preprocessing system preprocesses the text to be translated, including paragraph recognition, end recognition and the like; obtaining a set of paragraph parts in paragraph units, for example, identifying the beginning and the end of a paragraph, and also identifying the full text end of the text to be translated; (2) paragraph sub-portion processing system: the paragraph sub-part processing system processes the paragraph sub-part set by taking a paragraph as a unit and outputs a complete sentence; (3) complete sentence uploading system: and uploading the complete sentence output by the paragraph sub-part processing system to a machine translation engine.

Description

Complete sentence recognition method and system for machine translation

Technical Field

The application belongs to the technical field of translation, and particularly relates to a complete sentence identification method and system for machine translation.

Background

With the development of computer-aided technology, various industries introduce computer-aided technology for improving the working efficiency, and translation work is no exception. Various types of different machine translation tools have emerged that can enable automatic translation between different languages.

However, most machine translations are after all simple word contrast-find-replace-splice-processes, whose translations are mechanical processes and do not have the capability of semantic analysis; in addition, the corpus to be translated is usually a large-section long text, but a plurality of translation engines can only translate a certain number of words at one time, a translator is required to manually copy a certain number of texts into the translation engines, and the process can not realize automatic segmentation; even with automatic text segmentation techniques, it is only evenly segmented according to the number requirements, without considering that each sub-portion of the segmentation needs to have complete semantics.

Further, although the existing machine translation engine supports the uploading translation of the whole segment, the translation result is poor in this way, and the inventor finds that the translation effect is poorer as the number of the whole segments is larger; and the result of machine translation is not the final result, and manual proofreading is needed later, if the whole input is always carried out, the manual proofreading is carried out on the translation result output by the whole input, the number of errors to be proofreaded is huge, even the quantity of the manual translation is exceeded, and the working efficiency is greatly reduced.

In general, in order to ensure that the machine translation itself can output a high-quality translation result, a translator usually chooses to upload a single complete sentence at a time, so that the translation engine can output a complete-meaning translated sentence, and the translated sentence itself has relatively fewer errors, so that the translator can check in real time. However, the process needs translation manpower to identify whether a sentence has complete meaning from the text to be translated, which is equivalent to that a translator manually reads the whole text to be translated, and the overall efficiency is still lower; in addition, although one sentence may constitute a complete sentence, the number of words is too small, and the number of times of inputting and uploading is increased each time a shorter sentence is input.

In the prior art, there are related techniques for judging whether a sentence has a complete meaning, for example, if a certain continuous text is considered to end with a period, a question mark, an exclamation mark, the sentence is considered to end, and the continuous text can be considered to form a complete sentence; based on the thought, sentence ending detection can be realized by adopting a mode of detecting specific symbols, so that sentence segmentation is completed. Of course, this approach provides the predetermined effect that the text to be processed is formed in strict compliance with punctuation usage rules.

However, in the current language environment, very few people use punctuation marks strictly according to regulations, most people never use periods except for the end of a paragraph and the end of an article, and a comma is at the bottom or directly and continuously adopts a semicolon; stated another way, the phenomenon of disuse of question marks, exclamation marks is common among a variety of special literature (e.g., growling). Therefore, sentences having complete meanings in the text cannot be accurately recognized only by the aforementioned judgment.

Disclosure of Invention

In order to solve the above problems, particularly the problem that a sentence in a complete sense needs to be accurately segmented in the translation process, the present application proposes a complete sentence recognition method and system for machine translation, which can accurately recognize whether a continuous text segment ends to form a sentence from a text to be translated, so that the segmentation of the continuous text segment forms a single uploading content, and the single uploading content is input into a machine translation engine.

The complete sentence is a sentence with complete meaning, and is not judged by the end of the period, and is not limited by whether punctuation marks are correctly used for the text to be translated.

In a first aspect of the present invention, there is provided a complete sentence recognition system for machine translation, the system comprising:

(1) Pretreatment system: the preprocessing system preprocesses the text to be translated, including paragraph recognition, end recognition and the like; obtaining a set of paragraph parts in paragraph units, for example, identifying the beginning and the end of a paragraph, and also identifying the full text end of the text to be translated;

(2) Paragraph sub-part processing system: the paragraph sub-part processing system processes the paragraph sub-part set by taking a paragraph as a unit and outputs a complete sentence;

(3) Complete sentence uploading system: and uploading the complete sentence output by the paragraph sub-part processing system to a machine translation engine.

The paragraph sub-part processing system specifically comprises a to-be-translated text reading system, a to-be-translated keyword extracting system, a cloud corpus sentence combining system and a judging and identifying system;

the system for reading the text to be translated starts to continuously read the remaining characters from the first unread character of the current paragraph until the pause symbol is read; the read continuous characters form candidate sentences;

detecting a pause sign means that consecutive characters that have been read are likely to form a complete sentence, have independent meaning, and are therefore considered potential candidate sentences; however, further judgment is needed for the potential candidate sentence to determine whether it is indeed a complete sentence;

the keyword extraction system extracts a plurality of keywords to be translated from the candidate sentences; the key words to be translated are real words with action meanings; inputting the keywords to be translated into the cloud corpus sentence-grouping system;

the cloud corpus sentence grouping system outputs at least one matched sentence based on the cloud corpus;

based on automatic learning of a large-scale corpus, the text automatic learning method and device can realize automatic learning and sentence writing of texts. Of course, the matched sentence generated based on the cloud corpus on the basis of extracting a plurality of keywords to be translated from the candidate sentence is a complete independent sentence.

Inputting the matched sentence and the candidate sentence into the judgment and identification system,

and the judging and identifying system outputs a complete sentence judging result.

The specific decision criteria may be one or a combination of the following,

comparing the lengths of the current candidate sentence and the generated matching sentence, and judging whether the length difference is in a first threshold range or not;

comparing the similarity between the current candidate sentence and the generated matching sentence, and judging whether the similarity is within a second threshold value range or not;

the method for acquiring the length difference is simple and easy to realize; the method for comparing the similarity can adopt the text similarity comparison method existing in the prior art, and the invention is not repeated.

If the length difference meets the first threshold range condition and/or the similarity meets the second threshold range condition, the reliability judging device judges that the current candidate sentence is a complete sentence;

at this time, the current sentence to be translated of the text to be translated is processed and recognized, and can be used for actual operations (segmentation or uploading, etc.); then, the technical scheme of the invention continues to read the characters, and repeats the steps (1-5), namely, reading the next sentence to be translated and obtaining a new candidate sentence, and judging whether to form a complete sentence;

if the length difference does not meet the first threshold range condition and/or the similarity does not meet the second threshold range condition, the current candidate sentence is not a complete sentence, and at this time, it indicates that more characters belonging to the sentence follow the current candidate sentence, so the technical scheme of the present invention further includes: continuously reading unread characters after the current pause symbol until the next pause symbol is read; the read continuous characters are added into the current candidate sentence;

therefore, the number of characters of the current candidate sentence is increased, more keywords to be translated can be obtained, and then the judgment of whether the current candidate sentence (candidate sentence) is a complete sentence can be realized by repeating the steps.

As a further improvement of the invention, the length of the current sentence to be translated to be uploaded can also be controlled.

The inventor finds that the situation that the sentence to be translated is recognized to be a complete sentence, but the sentence length is too short sometimes occurs, and the single input is not necessary at all, but the sentence can be input together with other complete sentences.

To further solve this problem, the paragraph sub-portion processing system according to the present invention further comprises: the candidate sentence length judging module is used for judging the length of the current candidate sentence after the candidate sentence is output by the to-be-translated text reading system; if the length of the candidate sentence is smaller than the first set value, the current candidate sentence does not meet the condition of entering the keyword extraction system to be translated, and the text reading system to be translated returns to continue reading new characters.

In addition, the inventor further discovers that even if the candidate sentences meet the condition, the number of the keywords to be translated extracted from the candidate sentences is so large that the cloud corpus sentence-grouping system cannot output correct matched sentences, and the comparison result is affected.

To solve this problem, the paragraph sub-portion processing system according to the present invention further includes: and judging the keywords to be translated, namely judging whether the number of the keywords to be translated meets a second set value after the keywords to be translated are extracted by the keyword extraction system to be translated, if not, the current candidate sentences do not meet the condition of entering the cloud corpus sentence grouping system, namely the current candidate sentences cannot form complete sentences, and returning to the text reading system to be translated to continuously read new characters.

The technical scheme of the invention can be realized by adopting a computer-flow instruction language, and the process of specifically identifying and judging as an iteration loop comprises an internal small loop of a single sentence to be translated, wherein the termination condition is that the current sentence to be translated already forms a complete sentence, and then the next sentence to be translated is identified and judged; when the text to be translated is input by taking the paragraph as a unit, the termination condition of the processing is that the end mark of the paragraph is read; when the text is input, the termination condition of the process is that the text ending mark is read.

Thus, in a second aspect of the present invention, there is provided a computer-implemented method of assisting machine translation for identifying a sentence having complete independence of meaning in a current text to be translated and uploading the complete sentence to a machine translation engine, the method comprising the steps of:

s1: reading a current unprocessed paragraph of the current text to be translated;

s2, starting to continuously read characters from the first unread character of the current unprocessed paragraph;

s3: judging whether the currently read character is a pause character or not; if yes, go to step S4; otherwise, repeating the step S2;

s4: extracting a plurality of sentence trunk words based on a current sentence to be translated formed by the read characters;

s5: outputting at least one comparison sentence according to the plurality of sentence trunk words;

s6: judging whether the current sentence to be translated forms a complete sentence or not based on the comparison of the at least one comparison sentence and the current sentence to be translated;

s7: judging whether the current pause symbol is a full-text ending marker, if so, ending the processing; otherwise, enter step S8;

s8, judging whether the current pause symbol is a paragraph end marker, if so, entering a step S1; otherwise, S2 is entered.

The step S5 specifically includes: inputting the plurality of sentence trunk words into a machine learning engine based on a cloud corpus, and outputting at least one comparison sentence;

wherein, step S6 includes: comparing the lengths of the current sentence to be translated and at least one comparison sentence, and judging whether the length difference is within a third threshold range; and/or comparing the similarity between the current sentence to be translated and at least one comparison sentence, and judging whether the similarity is within a range of four thresholds;

further, if the length difference and/or the similarity are within the corresponding threshold range, judging that the current sentence to be translated forms a complete sentence;

further, the threshold range may be adjustable. A threshold range adjusting module may be configured to adjust the first, second, and third threshold ranges.

In a third aspect of the present invention, a computer readable storage medium is provided, on which computer executable instructions are stored, and the executable instructions are executed by a computer memory and a processor, to implement a computer implemented recognition method according to the present invention, for recognizing a sentence having complete and independent meaning in a current text to be translated.

The technical scheme of the invention at least achieves the following outstanding effects:

1) Identifying the complete sentence based on semantics rather than punctuation, whether the current text strictly complies with punctuation usage does not affect the identification of the complete sentence;

2) The recognized complete sentence has controllable length and complete meaning, and is automatically realized based on a computer, so that the efficiency is high;

3) The automatic learning method based on the large-scale corpus is used for translation assistance for the first time, and particularly, a matching office output by a sentence grouping system of the cloud corpus is used as a comparison standard, so that objective rules are met.

Further embodiments and advantages of the present invention will be described in detail in the detailed description.

Drawings

FIG. 1 is a block diagram of a complete sentence recognition system for machine translation in accordance with the present invention

FIG. 2 is a computer-implemented flowchart of the method of the present invention

DETAILED DESCRIPTION OF EMBODIMENT (S) OF INVENTION

Referring to fig. 1, the complete sentence recognition system for machine translation of the present invention includes:

wherein the first unread character of the current paragraph may be a single word, a word, and a punctuation mark that may be used at the beginning of the paragraph or sentence, such as left Shan Yinhao ", left double quotation mark", etc.;

normally, if the text to be translated uses punctuation marks strictly according to the punctuation mark using method, the complete sentence can be formed only by reading the period, the question mark and the exclamation mark, but as mentioned above, the text to be translated in the prior art is not strictly executed according to the standard. Therefore, to solve this problem, the present application discards the symbol judgment problem of the prior art, and starts reading from the first unread character of the current paragraph until the pause symbol is read, and the consecutive characters read constitute the sentence to be processed.

The pause symbol here refers to a punctuation symbol which is read and can represent the pause of a sentence, and comprises a sentence symbol, a question mark, an exclamation mark, a pause mark, a comma, a quotation mark (right single quotation mark, left Shan Yinhao), a semicolon and the like, wherein the symbol can cause the temporary pause of the sentence, and it can be understood that the pause of the sentence is not caused by the dash mark, the signature mark, the bracket and the like and is not regarded as the pause symbol; although a colon may be stopped, the portion following the colon is generally considered to be a continuation of the previous sentence; thus, a colon is also not considered a stall symbol; in addition, the technical scheme of the application comprises a paragraph identification device, so that the pause symbol also comprises a paragraph ending mark symbol and a full-text ending mark symbol which are identified by the paragraph identification device.

The above examples are merely illustrative and not exhaustive, and those skilled in the art may pre-establish a set of pause symbols for subsequent query determinations in particular implementations.

specifically, the candidate sentence is composed of a plurality of words, some of which are real words and some of which are imaginary words. The real words are words having actual meanings, such as "today", "work-down", "estimated", "submit", "line", etc.; by "article" is meant generally a connection, modification, etc., and individual words do not represent actual meanings such as "then", "and", "the", "should", "does", "such", etc.; in natural language processing, there is a certain difference in the standard of cutting or identifying that related prior art is used for cutting out real words or imaginary words, but the specific meanings are consistent, and the description of the present application is omitted here.

Based on the prior art of segmentation of real words or virtual words, the method extracts a plurality of sentence trunk words or keywords to be translated from the candidate sentences, wherein the sentence trunk words or keywords to be translated can be real words in the current sentences to be processed;

based on automatic learning of a large-scale corpus, the text automatic learning method and device can realize automatic learning and sentence writing of texts. Of course, similar machine learning technologies exist in the prior art, such as a robot news writer, an automatic article writer robot, etc. which have been realized in recent years, and these robots can automatically generate a news draft or an article through a few trunk words (similar to the keywords to be translated and the prompt words described in the present invention) input by a user, and the effect is completely similar to the level of a professional news writer, and even readers cannot distinguish that the article is completed by the robot.

The inventor finds that the machine learning tools are all automatically learned based on a large-scale corpus, so that the application can also provide a cloud-based corpus for machine learning so as to establish a machine learning engine, such as the cloud corpus sentence-forming system of the invention. And inputting the extracted plurality of sentence trunk words into the sentence grouping system of the cloud corpus. Therefore, the cloud corpus sentence grouping system outputs at least one comparison sentence based on the cloud corpus, and the system is similar to the robot news writer and the automatic article writer robot to complete work.

Of course, the invention does not need to output the whole news manuscript or the whole article, only needs to output the whole sentence, so the machine learning engine of the invention can be simpler and faster, the output result can be a plurality of sentences with complete meaning and completely independent, and the effect of the robot is better compared with the existing robot news writer and automatic article writer instead of only one result; this is because the inventors creatively used them for translating the specific needs of the embodiment.

The specific decision criteria may be one or a combination of the following,

at this time, the current sentence to be translated of the text to be translated is processed and recognized, and can be used for actual operations (segmentation or uploading, etc.); then, the technical scheme of the invention continues to read the characters, and repeats the steps, namely, reading the next sentence to be translated and obtaining a new candidate sentence, and judging whether to form a complete sentence or not;

thus, the number of characters of the current candidate sentence is increased, more sentence trunk words can be obtained, and then the steps are repeated, so that the judgment of whether the current sentence to be translated (candidate sentence) is a complete sentence can be realized.

To solve this problem, the paragraph sub-portion processing system according to the present invention further includes: and the keyword judgment module is used for judging whether the number of the keywords to be translated meets a second set value after the keyword extraction system to be translated extracts the keywords to be translated, if not, the current candidate sentences do not meet the condition of entering the cloud corpus sentence grouping system, namely the current candidate sentences cannot form complete sentences, and at the moment, the text reading system to be translated returns to continue reading new characters.

Referring to fig. 2, a computer-implemented identification method is provided, which in this implementation, comprises in particular steps S1-S8 of fig. 2.

Specifically, each step performs the following functions:

s6: based on the comparison of the at least one comparison sentence and the current sentence to be translated, identifying whether the current sentence to be translated is a complete sentence; if yes, uploading the current sentence to be processed to a machine translation engine;

As a further embodiment, the method further includes, after step S4, the steps of:

s41: judging the number of the extracted main words of the sentences, and returning to the step S2 if the number is less than a first set value;

as a further preference, after step S3 before method S4, step S31 is further included:

s31: and judging whether the number of the read characters is smaller than a second set value, and if so, returning to the step S2.

The further preferred steps strengthen the specific implementation judgment standard of the algorithm and reduce the subsequent cycle times.

(1) Identifying sentences with complete meanings in the text to be processed by semantically rather than using punctuation marks as judgment standards;

(2) The judgment standard is based on large-scale semantic learning and combines the advanced technology of machine learning;

(3) Although the automatic article generation technology based on semantic robots belongs to the prior art, the method is applied to translation corpus recognition for the first time; moreover, unlike the prior art, the object of the present invention is not to generate text for generating text, but to use it as a criterion;

(4) While the prior art is based on the existing keywords to generate the whole article, which requires the output of the whole article to be unique and as accurate as possible, the invention focuses on the diversity of the output results based on the existing few keywords, and thus, the method is more accurate for judgment.

Claims

1. A computer-implemented method of assisting machine translation, wherein a set of pause symbols is pre-established, the method comprising the steps of:

s1: reading a current unprocessed paragraph of a current text to be processed;

s3: judging whether the currently read character is a pause character or not;

if the currently read character is a pause character, judging whether the number of the read characters is smaller than a second set value,

if the number of the read characters is smaller than the second set value, returning to the step S2;

if the number of the characters read is not less than the second set value, entering step S4;

if the currently read character is not a pause character, repeating the step S2;

s4: extracting a plurality of sentence trunk words based on a current sentence to be processed formed by the read characters;

judging the number of the extracted main words of the sentences, and returning to the step S2 if the number is less than a first set value;

s5: inputting the plurality of sentence trunk words into a machine learning engine based on a cloud corpus, and outputting at least one comparison sentence;

s6: based on the comparison of the at least one comparison sentence and the current sentence to be processed, identifying whether the current sentence to be processed forms a complete sentence; if yes, uploading the current sentence to be processed to a machine translation engine;

2. A complete sentence recognition system for machine translation for implementing the method of claim 1, the system comprising:

(1) Pretreatment system: the preprocessing system preprocesses the text to be translated and outputs a subsection sub-part set;

(3) Complete sentence uploading system: uploading the complete sentence output by the paragraph sub-part processing system to a machine translation engine;

the method is characterized in that:

a pause symbol set is established in advance and used for subsequent inquiry judgment;

the paragraph sub-part processing system comprises a to-be-translated text reading system, a to-be-translated keyword extracting system, a cloud corpus sentence combining system and a judgment and identification system;

the keyword extraction system to be translated extracts a plurality of sentence trunk words from the candidate sentences based on the segmentation technology of real words or virtual words; inputting the plurality of sentence trunk words into the cloud corpus sentence grouping system;

the judging and identifying system compares the current candidate sentence with the matching sentence, and outputs a judging result based on whether the comparison condition meets a preset condition or not;

whether the comparison condition satisfies a predetermined condition includes: comparing the lengths of the current candidate sentence and the generated matching sentence, and judging whether the length difference is in a first threshold range or not; comparing the similarity between the current candidate sentence and the generated matching sentence, and judging whether the similarity is within a second threshold range;

if the length difference does not meet the first threshold range condition and/or the similarity does not meet the second threshold range condition, the current candidate sentence is not a complete sentence, and the unread characters after the current pause symbol are continuously read until the next pause symbol is read; the read continuous characters are added to the current candidate sentence.

3. The complete sentence recognition system for machine translation of claim 2,

the paragraph sub-portion processing system further comprises: and the candidate sentence length judging module is used for judging the length of the current candidate sentence after the candidate sentence is output by the to-be-translated text reading system.

4. The complete sentence recognition system for machine translation of claim 2,

the paragraph sub-portion processing system further comprises: and judging the keyword to be translated, namely judging whether the number of the keywords to be translated meets a second set value or not after the keyword to be translated is extracted by the keyword extraction system to be translated.