CN109284503B

CN109284503B - Translation statement ending judgment method and system

Info

Publication number: CN109284503B
Application number: CN201811226769.XA
Authority: CN
Inventors: 何恩培; 郑丽华; 王莲
Original assignee: Transn Iol Technology Co ltd
Current assignee: Zhongguancun Technology Leasing Co ltd
Priority date: 2018-10-22
Filing date: 2018-10-22
Publication date: 2023-08-18
Anticipated expiration: 2038-10-22
Also published as: CN109284503A

Abstract

The application provides a method and a system for judging the end of a translation sentence, which can accurately identify whether a section of continuous text ends to form a sentence from the text to be processed, thereby completing the judgment of the end of the sentence. The system comprises a text importing device, a paragraph recognition device, a sentence recognition device, a semantic combining device and a credibility judging device. The application recognizes sentences with complete meanings in the text to be processed by semantically rather than using punctuation marks as a judgment standard.

Description

Translation statement ending judgment method and system

Technical Field

The application belongs to the field of machine learning, and particularly relates to a translation sentence ending judgment method and system.

Background

In the translation process, a longer text to be translated is usually required to be cut. One of the requirements of the segmentation is that each segmented sub-part should be a complete and independent corpus, and the upper and lower half sentences of a sentence cannot be segmented into different sub-parts; in addition, the translation process usually requires the assistance of machine translation, and a translator usually needs to upload the text to be translated into a machine translation tool, and although the existing machine translation engine supports the uploading translation of the whole segment, the translation result is poor in this way, so that the translator usually needs to upload a single complete sentence one sentence by one sentence to obtain the result of relatively complete comparison; in another scenario, it is also necessary to check if the translated result is correct, and at this time, it is also necessary to upload text in complete sentence units for inspection. In this process, an important problem is faced: how to cut to get a complete sentence.

A simple way of determining is based on the sentence ending symbol, for example, it is generally considered that if a segment of continuous text ends with a period, question mark, exclamation mark, the sentence is considered to end, and the continuous text can be considered to constitute a complete sentence; based on the thought, sentence ending detection can be realized by adopting a mode of detecting specific symbols, so that sentence segmentation is completed. Of course, this approach provides the predetermined effect that the text to be processed is formed in strict compliance with punctuation usage rules.

Obviously, in the current language environment, few people strictly use punctuation marks according to regulations, most people never use periods except the end of paragraphs and the end of articles, and a comma is at the bottom or directly and continuously adopts semicolons; stated another way, the phenomenon of disuse of question marks, exclamation marks is common among a variety of special literature (e.g., growling). Therefore, sentences having complete meanings in the text cannot be accurately recognized only by the aforementioned judgment.

Disclosure of Invention

In order to solve the problems, particularly the problem that sentences in the complete meaning need to be accurately segmented in the translation process, the application provides a method and a system for judging the end of a translation sentence, which can accurately identify whether a section of continuous text ends to form a sentence from the text to be processed, thereby completing the judgment of the end of the sentence.

In a first aspect of the present application, there is provided a translation sentence end judgment system including a text importing device, a paragraph identifying device, a sentence identifying device, a semantic combining device, and a credibility judging device; in the concrete implementation, the text to be processed is imported into the system through the text importing device; then operating the paragraph identification device;

the paragraph identification device performs preliminary processing on the imported text to be processed to obtain a paragraph sub-part set taking the paragraph as a unit, for example, the beginning and the end of the paragraph are identified, and the full text end of the text to be processed can also be identified; then, the paragraph sub-part set enters a sentence recognition device segment by segment;

the sentence identifying device processes the paragraph sub-part set by taking paragraphs as units, and the specific processing steps comprise:

(1) Continuously reading the remaining characters from the first unread character of the current paragraph until the pause symbol is read; the read continuous characters form a sentence to be processed;

(2) Extracting a plurality of sentence trunk words from the sentence to be processed; the main words of the sentences are real words with action meanings;

(3) Inputting the plurality of sentence trunk words into the semantic combining device, wherein the semantic combining device outputs at least one comparison sentence based on a cloud corpus;

(4) Inputting the sentence to be processed and the at least one comparison into the credibility judging device;

(5) The reliability judging device outputs a judging result.

Detecting a pause sign means that consecutive characters that have been read are likely to form a complete sentence, have independent meaning, and are therefore considered potential candidate sentences; however, further judgment is needed for the potential candidate sentences to determine whether the candidate sentences are truly a complete independent sentence; taking the potential candidate sentences as sentences to be processed, and entering the next step of processing;

the next step of processing the sentence to be processed is the core of the technical scheme of the application. The treatment concept is as follows:

extracting a plurality of sentence trunk words from the sentence to be processed;

inputting the plurality of sentence trunk words into the semantic combining device, and outputting at least one comparison sentence by the semantic combining device based on a cloud corpus.

Based on automatic learning of a large-scale corpus, the application can realize automatic learning of texts and sentence writing. Of course, the comparison sentence generated based on the cloud corpus on the basis of extracting the trunk words of a plurality of sentences from the sentences to be processed is a complete independent sentence.

And then comparing the current sentence to be processed with the generated comparison sentence, so as to judge whether the current sentence to be processed is an independent sentence or not, wherein the process is realized by the credibility judging device.

The method specifically comprises the following steps:

inputting the sentence to be processed and the at least one comparison into the credibility judging device;

the reliability judging device outputs a judging result.

The specific decision criteria may be one or a combination of the following,

comparing the lengths of the current sentence to be processed and the generated comparison sentence, and judging whether the length difference is in a first threshold range or not;

performing similarity comparison on the current sentence to be processed and the generated comparison sentence, and judging whether the similarity is within a second threshold range or not;

the method for acquiring the length difference is simple and easy to realize; the method for comparing the similarity can adopt the text similarity comparison method existing in the prior art, and the application is not repeated.

If the length difference meets the first threshold range condition and/or the similarity meets the second threshold range condition, the reliability judging device judges that the current sentence to be processed is a complete sentence;

at this time, the current sentence to be processed of the text to be processed is already processed and recognized, and can be used for actual operations (segmentation or uploading, etc.); then, the technical scheme of the application continues to read the characters, and repeats the steps (1-5), namely, reads the next sentence to be processed, and judges whether the complete sentence is formed;

if the length difference does not meet the first threshold range condition and/or the similarity does not meet the second threshold range condition, the current sentence to be processed is not a complete sentence, and at this time, it indicates that more characters belonging to the sentence follow the current sentence to be processed, so the technical scheme of the present application further includes: continuously reading unread characters after the current pause symbol until the next pause symbol is read; the read continuous characters are added into the current sentence to be processed;

thus, the number of characters of the current sentence to be processed is increased, more sentence trunk words can be obtained, and then the steps (2-5) are repeated, so that the judgment of whether the sentence to be processed is a complete sentence can be realized.

Therefore, the technical scheme of the application can be realized by adopting a computer-flow instruction language, and the process of specifically identifying and judging as an iterative loop comprises an internal small loop of a single sentence to be processed, wherein the termination condition is that the current sentence to be processed already forms a complete sentence, and then the next sentence to be processed is identified and judged; when a text to be processed is input by taking a paragraph as a unit, the termination condition of the processing is that a paragraph ending mark is read; when the text to be processed is input in full text, the termination condition of the processing is that the full text ending mark is read.

Accordingly, in a second aspect of the present application, there is provided a computer-implemented translation sentence end judgment method for identifying a sentence having complete and independent meaning in a text currently to be processed, the method comprising the steps of: s1: reading a current unprocessed paragraph of a current text to be processed;

s2, starting to continuously read characters from the first unread character of the current unprocessed paragraph;

s3: judging whether the currently read character is a pause character or not; if yes, go to step S4; otherwise, repeating the step S2;

s4: extracting a plurality of sentence trunk words based on a current sentence to be processed formed by the read characters;

s5: outputting at least one comparison sentence according to the plurality of sentence trunk words;

s6: judging whether the current sentence to be processed forms a complete sentence or not based on the comparison of the at least one comparison sentence and the current sentence to be processed;

s7: judging whether the current pause symbol is a full-text ending marker, if so, ending the processing; otherwise, enter step S8;

s8, judging whether the current pause symbol is a paragraph end marker, if so, entering a step S1; otherwise, S2 is entered.

The step S5 specifically includes: inputting the plurality of sentence trunk words into a machine learning engine based on a cloud corpus, and outputting at least one comparison sentence;

wherein, step S6 includes: comparing the lengths of the current sentence to be processed and at least one comparison sentence, and judging whether the length difference is in a third threshold range or not; and/or comparing the similarity between the current sentence to be processed and at least one comparison sentence, and judging whether the similarity is within a fourth threshold range;

further, if the length difference and/or the similarity are within the corresponding threshold range, judging that the current sentence to be processed forms a complete sentence;

further, the threshold range may be adjustable. A threshold range adjustment module may be provided for adjusting the size of the first to fourth threshold ranges.

In a third aspect of the present application, a computer readable storage medium is provided, on which computer executable instructions are stored, and the executable instructions are executed by a computer memory and a processor, so as to implement the foregoing method for determining the end of a translation sentence according to the present application, where the method is used for identifying a sentence having complete and independent meaning in a text to be processed currently.

The technical scheme of the application at least achieves the following outstanding effects:

recognizing sentences with complete meanings in the text to be processed by semantically rather than using punctuation marks as judgment standards;

the judgment standard is based on large-scale semantic learning, and advanced technology of machine learning is combined;

although the automatic article generation technology based on semantic robots belongs to the prior art, the method is applied to translation corpus recognition for the first time; moreover, unlike the prior art, the object of the present application is not to generate text for generating text, but to use it as a criterion;

the prior art is to generate the whole article based on the existing keywords, which requires the output of the whole article to be unique and as accurate as possible, while the application focuses on the diversity of the output result based on the existing few keywords, so that the judgment is more accurate.

Further embodiments and advantages of the present application will be described in detail in the detailed description.

Drawings

Fig. 1 is a frame diagram of the translation sentence completion judgment system according to the present application.

Fig. 2 is a computer-implemented flowchart of the method of the present application.

DETAILED DESCRIPTION OF EMBODIMENT (S) OF INVENTION

Referring to fig. 1, the system for judging the end of a translation sentence of the present application includes a text importing device, a paragraph identifying device, a sentence identifying device, a semantic combining device, and a credibility judging device.

In this embodiment, the text to be processed is imported into the system through the text importing device; then operating the paragraph identification device;

(5) The reliability judging device outputs a judging result.

Wherein the first unread character of the current paragraph may be a single word, a word, and a punctuation mark that may be used at the beginning of the paragraph or sentence, such as left Shan Yinhao', left double quotation mark ", etc.;

normally, if the text to be processed uses punctuation marks strictly according to the punctuation mark using method, a complete sentence can be formed only by reading a period, a question mark and an exclamation mark, but as mentioned above, the text to be processed of the prior art is not strictly executed according to this standard. Therefore, to solve this problem, the present application discards the symbol judgment problem of the prior art, and starts reading from the first unread character of the current paragraph until the pause symbol is read, and the consecutive characters read constitute the sentence to be processed.

The pause symbol here refers to a punctuation symbol which is read and can represent the pause of a sentence, and comprises a sentence symbol, a question mark, an exclamation mark, a pause mark, a comma, a quotation mark (right single quotation mark, left Shan Yinhao), a semicolon and the like, wherein the symbol can cause the temporary pause of the sentence, and it can be understood that the pause of the sentence is not caused by the dash mark, the signature mark, the bracket and the like and is not regarded as the pause symbol; although a colon may be stopped, the portion following the colon is generally considered to be a continuation of the previous sentence; thus, a colon is also not considered a stall symbol; in addition, the technical scheme of the application comprises a paragraph identification device, so that the pause symbol also comprises a paragraph ending mark symbol and a full-text ending mark symbol which are identified by the paragraph identification device.

The above examples are merely illustrative and not exhaustive, and those skilled in the art may pre-establish a set of pause symbols for subsequent query determinations in particular implementations.

Specifically, the sentence to be processed is composed of a plurality of words, some of which are real words and some of which are imaginary words. The real words are words having actual meanings, such as "today", "work-down", "estimated", "submit", "line", etc.; by "article" is meant generally a connection, modification, etc., and individual words do not represent actual meanings such as "then", "and", "the", "should", "does", "such", etc.; in natural language processing, there are related prior arts for segmenting real words or imaginary words, and the segmentation or recognition standards may be different, but the specific meanings are consistent, which will not be repeated herein.

Based on the prior art of segmentation of real words or virtual words, the method extracts a plurality of sentence trunk words from the sentence to be processed, wherein the sentence trunk words can be the real words in the current sentence to be processed;

next, the plurality of sentence trunk words are input into the semantic combining device, which outputs at least one comparison sentence based on a cloud corpus.

Based on automatic learning of a large-scale corpus, the application can realize automatic learning of texts and sentence writing. Of course, similar machine learning techniques exist in the prior art, such as a robot news writer, an automatic article writer robot, etc. which have been realized in recent years, and these robots can automatically generate a news draft or an article through several trunk words (keywords, prompt words) and the like input by a user, and the effect is completely similar to the level of a professional news writer, and even readers cannot distinguish that the article is completed by the robot.

The present inventors have found that such machine learning tools are all based on automatic learning of a large corpus, and thus the present application may also provide a cloud-based corpus for machine learning to build a machine learning engine, such as the semantic combining means of the present application. And inputting the extracted plurality of sentence trunk words into the semantic combination device. Thus, the semantic combining device outputs at least one comparison sentence based on the cloud corpus, which is similar to the robot news writer and the automatic article writer robot described above.

Of course, the application does not need to output the whole news manuscript or the whole article, only needs to output a complete sentence, so the machine learning engine can be simpler and faster, the output result can be a plurality of sentences with complete meaning and completely independent, and the effect of the robot is better compared with the existing robot news writer and automatic article writer instead of only one result; this is because the inventors creatively used them for translating the specific needs of the embodiment.

Based on a large-scale corpus, the comparison sentence generated on the basis of extracting a plurality of sentence trunk words from the sentence to be processed is a complete independent sentence.

The method specifically comprises the following steps:

the reliability judging device outputs a judging result.

The specific decision criteria may be one or a combination of the following,

Referring to fig. 2, a computer-implemented method for judging the end of a translation sentence is provided, and in this embodiment, the method specifically includes steps S1 to S8 of fig. 2.

Specifically, each step performs the following functions:

s1: reading a current unprocessed paragraph of a current text to be processed;

s6: based on the comparison of the at least one comparison sentence and the current sentence to be processed, identifying whether the current sentence to be processed is a complete sentence;

Claims

1. A translation sentence end judging system comprises a text importing device, a paragraph identifying device, a sentence identifying device, a semantic combining device and a credibility judging device; the text importing device imports a text to be processed, and the paragraph identifying device carries out preliminary processing on the imported text to be processed to obtain a paragraph sub-part set taking a paragraph as a unit;

the method is characterized in that:

the sentence identifying means processes the sub-portion set of paragraphs in units of paragraphs,

the specific processing steps comprise:

(2) Extracting a plurality of sentence trunk words from the sentence to be processed;

(3) Inputting the plurality of sentence trunk words into the semantic combination device, wherein the semantic combination device generates a comparison sentence based on the cloud corpus on the basis of the plurality of sentence trunk words extracted from the sentence to be processed, and the comparison sentence is an independent sentence with complete meaning;

(4) Inputting the sentence to be processed and the comparison sentence into the credibility judging device;

the credibility judging device outputs a judging result based on the comparison condition;

the comparison condition includes one or a combination of the following:

comparing the lengths of the current sentence to be processed and the generated comparison sentence, and judging whether the length difference is in a first threshold range or not; and comparing the similarity between the current sentence to be processed and the generated comparison sentence, and judging whether the similarity is within a second threshold range.

2. The translation sentence ending judgment system according to claim 1, wherein: the device also comprises a preset condition setting module which is used for adjusting the range of the preset conditions.

3. A computer-implemented translation sentence ending judgment method, the method comprising the steps of:

s1: reading a current unprocessed paragraph of a current text to be processed;

s2: continuously reading characters from a first unread character of a current unprocessed paragraph;

s6: comparing the lengths of the current sentence to be processed and at least one comparison sentence, and judging whether the length difference is in a third threshold range or not; and/or comparing the similarity between the current sentence to be processed and at least one comparison sentence, and judging whether the similarity is within a fourth threshold range;

if the length difference and/or the similarity is within the corresponding threshold range, identifying that the current sentence to be processed forms a complete sentence;

s8: judging whether the current pause symbol is a paragraph end marker, if so, entering a step S1; otherwise, entering S2; the step S5 specifically includes: inputting the plurality of sentence trunk words into a machine learning engine based on a cloud corpus, and generating comparison sentences on the basis of the plurality of sentence trunk words, wherein the comparison sentences are independent sentences with complete meanings.

4. The computer-implemented translation sentence ending judgment method according to claim 3, wherein: the threshold range is adjustable.

5. A computer-readable storage medium having stored thereon computer-executable instructions, which are executed by a computer memory and a processor, for implementing all the steps of a computer-implemented translation statement end judgment method as claimed in any one of the preceding claims 3 or 4.