CN108763229B

CN108763229B - Machine translation method and device based on characteristic sentence stem extraction

Info

Publication number: CN108763229B
Application number: CN201810544842.1A
Authority: CN
Inventors: 李晶洁; 胡文杰
Original assignee: Donghua University
Current assignee: Donghua University
Priority date: 2018-05-31
Filing date: 2018-05-31
Publication date: 2020-06-12
Anticipated expiration: 2038-05-31
Also published as: CN108763229A

Abstract

The invention relates to a machine translation method and a device based on characteristic sentence stem extraction, which specifically comprise the following steps: 1) acquiring a multi-word sequence from a language A corpus and identifying a sequence with a structure meeting sentence stem requirements; 2) determining a characteristic sentence trunk based on the internal adhesive force, the external boundary independence and the chapter distribution domain, and screening the characteristic sentence trunk based on a MIN-MAX normalization algorithm and a local maximum value weight elimination method; 3) translating the characteristic sentence stems to obtain a characteristic sentence stem database; 4) inputting a language A text to be translated, extracting sentence stems one by one, searching sentence stem translation in a characteristic sentence stem database, translating words outside the sentence stems, and combining the translation into the sentence stem translation according to the language order of the target language B to obtain a translation. The device comprises a characteristic sentence stem database unit, a language input unit, a sentence stem extraction unit, a sentence stem identification unit, a translation unit and a combination unit. The machine translation method and the device have the advantages of high translation efficiency, short processing time and great application prospect.

Description

Machine translation method and device based on characteristic sentence stem extraction

Technical Field

The invention belongs to the field of machine translation, relates to a machine translation method and device based on characteristic sentence stem extraction, and particularly relates to a machine translation method and device based on corpus extraction of characteristic sentence stems.

Background

From the early dictionary matching to the regular translation of the dictionary in combination with the linguistic expert knowledge and then to the statistical machine translation based on the corpus, along with the improvement of the computer computing power and the explosive growth of multi-language information, the machine translation technology gradually goes out of the ivory tower, and the real-time convenient translation service is provided for common users.

Corpus-based machine translation methods are beginning to be the main direction of research in the field of machine translation. It is in this context that corpus-driven translation peer-to-peer research methods advocated by the Sinclair team have resulted. The core idea of translation equivalence is that there is translation equivalence (translation equality) between two (or more) languages, namely, the text environment of a word in corpus L1 is closely related to the translation equality word (translation equality) in corpus L2. This is done by the computer recognizing the text context of the word to determine which word in L2 corresponds to each actual occurrence of the word in L1.

The steps for constructing the machine translation model based on the method are as follows: 1) index evidence is retrieved in JDEST by using tools such as Wordsmith and the like, the form and the meaning characteristics of the characteristic sentence stem are described, and the corresponding relation between the form and the function is established; 2) searching Chinese or target language translation in a parallel corpus, and determining the translation with higher frequency as a potential equivalent unit (potential equivalent); 3) and inputting the potential equivalent units into a Chinese language or target language corpus for inspection, inspecting the form and functional characteristics of the potential equivalent units, and finally establishing the corresponding degree of the potential equivalent units in the context. In this model, a characteristic stem (sensor stem) refers to a high-frequency semi-fixed sentence-level sequence implementing a language organization and an attitude expression function in an academic english corpus, and is a phrase unit at a special clause level, which contains a major-minor structure and is the core of a sentence. The extraction of the method is always a technical difficulty in the field of machine translation, particularly peer-to-peer translation.

In recent years, with the continuous improvement of computing power and the continuous enrichment of corpus resources, phrasal research is also continuous and deep, and the development of a characteristic sentence stem extraction technology gradually shows dawn. The existing automatic extraction methods for phrase units mainly include the following two types: 1) the frequency threshold method is mainly used for generating preliminary candidate sequences, and has the advantages of low computational complexity and low recognition accuracy and recall rate; 2) the problem is that when academic English text translation is carried out, more than half of the multi-word sequences extracted by the existing association measurement method are professional terms or noun phrases, the sequences of the same language structure exceed 95%, the sequences of the cross-structure units of the sentence stem type, particularly characteristic sentence stems, are few, the sentence stems are different from the professional terms or the noun phrases, the internal association degree is low, the boundaries are difficult to determine, and the existing term extraction method cannot be directly used for identification and judgment of specific sentence stems. Although the automatic extraction method of phrase units has been developed to some extent at present, the above method is only for extracting simple phrases, and cannot meet the practical requirement of extracting discourse stems for machine translation.

Therefore, how to effectively and automatically identify and extract the characteristic sentences from the mass data and then perform machine translation becomes an important problem to be solved urgently.

Disclosure of Invention

The invention aims to overcome the defects of low translation quality and low accuracy rate of cross-language texts in the prior art, and provides a machine translation method and device based on characteristic sentence stem extraction, which has the advantages of accurate extraction of characteristic sentences, small processing amount, good translation quality of cross-language texts and high accuracy rate. The invention tries to extract the characteristic sentence stems by utilizing the characteristics of the characteristic sentence stems and improves the machine translation effect.

In order to achieve the purpose, the invention adopts the technical scheme that:

a machine translation method based on characteristic sentence stem extraction comprises the steps of firstly inputting a language A text to be translated, then extracting sentence stems of the language A text one by one, then searching sentence stem translation in a characteristic sentence stem database, simultaneously translating words outside the sentence stems, and finally combining translations of the words outside the sentence stems into the sentence stem translation according to the language order of a target language B to obtain translations;

the characteristic sentence stem database is established by the following steps:

(1) acquiring a multi-word sequence from a language A corpus;

(2) identifying a sequence with a structure meeting the sentence trunk requirement in the multi-word sequence;

(3) determining a characteristic sentence stem in a sequence with a structure meeting the sentence stem requirement based on the internal adhesive force, the external boundary independence and the chapter distribution domain;

(4) screening the characteristic sentence stems based on an MIN-MAX normalization algorithm and a local maximum duplicate elimination method;

(5) and (4) translating the characteristic sentence stems obtained by screening into a target language B, and recording each characteristic sentence stem and the translation thereof to obtain a characteristic sentence stem database.

As a preferred technical scheme:

in the machine translation method based on the characteristic sentence stem extraction, the obtaining of the multi-word sequence specifically includes: firstly, acquiring an un-coded academic language A text corpus, and performing part-of-speech coding on a text by using coding software; performing linear segmentation on the coded text to obtain a plurality of sequences, generating a multi-word sequence set of 2-7 words, and preprocessing the segmented linear sequences to obtain multi-word sequences; the preprocessing comprises deleting messy codes, deleting punctuations in the sequences and counting the frequency of each sequence.

A machine translation method based on characteristic sentence stem extraction as described above, said language a and target language B being selected from two of english, chinese, french, german, italian and japanese;

when the language A is English, the part-of-speech tagging utilizes a C7 tagging set or TreeTagger of tagging software; when the language A is Chinese, the code assigning software is ICTCCLAS; when the language A is French, German or Italian, the code assigning software is TreeTagger; and when the language A is Japanese, the code assigning software is Mecab. The language a is subjected to part-of-speech coding by using the existing coding software, the protection scope of the invention is not limited thereto, other coding software which is not listed can be applied to the invention, the language a is not limited thereto, other languages capable of part-of-speech coding, such as russian, portuguese, spanish and the like, can be applied to the invention, and proper coding software is selected for part-of-speech coding.

In the machine translation method based on characteristic sentence stem extraction, the sequence of the recognition structure meeting the sentence stem requirement specifically includes: firstly, searching a sentence stem sequence with a main-predicate structure in a multi-word sequence; and then, individually processing the condition that the predicate not included in the main and predicate collocation type is omitted (such as if permitted), and in the process of extracting the main and predicate structure, combining the distribution characteristics of the parts of speech in each sentence to limit the positions of the verbs and the nouns in the sentence. Through the steps, a multiword sequence which structurally meets the requirement of the sentence trunk is extracted, and the sentence trunk sequence with the subject-predicate structure comprises a subject type and a non-subject type.

The machine translation method based on the characteristic sentence stem extraction, which is based on the internal adhesion, the external boundary independence and the chapter distribution domain, determines the characteristic sentence stem in the sequence with the structure meeting the sentence stem requirement specifically as follows:

comprehensively evaluating the typical degree of the sentence stem in the academic terminology from the statistical perspective by combining the internal adhesive force, the external boundary independence and the chapter distribution domain parameters;

the significance of the extracted sentence stem sequence is evaluated based on the extracted sentence stem sequence, and the evaluation method comprises three evaluation indexes: calculating the internal adhesive force, measuring the independence of the boundary and setting the distribution domain parameters of the chapters; the method comprises the following specific steps:

1) calculating the internal adhesion;

1.1) according to a pseudo-binary serialization theory, performing pseudo-binary serialization on an n-word sequence, wherein n is more than or equal to 2, so that the multi-word sequence has measurability and comparability;

1.2) aiming at the n word sequence, n-1 discrete points are selected without repetition, and the attractive force MI at the two sides of each discrete point is calculated one by one_i，MI_iRepresenting the internal adhesion of the partial sequence, i is more than or equal to 1 and less than or equal to n-1, and i is a possible discrete point inside the n-word sequence;

1.3) calculating the occurrence probability of each pseudo-bigram sequence MI value by using a probability mean weighting method and weighting the occurrence probability;

1.4) summing all weighted MI values, the formula is as follows;

MI＝P(MI₁)MI₁+P(MI₂)MI₂+P(MI₃)MI₃+…+P(MI_n-1)MI_n-1＝∑P(MI_i)MI_i；

in the formula P (MI)_i) Denotes MI_iThe probability of (d);

the n word sequence MI (W) after being adjusted by the probability mean weighting method has the following calculation formula:

wherein W ═ W₁,w₂,w₃，…，w_nI are possible discrete points inside the sequence W, dividing W into W₁,w₂....,w_iAnd w_(i+1)....,w_nI is more than or equal to 1 and less than or equal to n-1, and n is more than or equal to 2; wherein, w₁、w₂、....w_nThe nth constituent word of the sequence W, P (W) being the actual observed value of the probability of occurrence of the sequence W, P (W), respectively₁,w₂....,w_i) Is a sequence { w₁,w₂....,w_iActual probability of occurrence, P (w)_(i+1)....,w_n) Is a sequence { w_(i+1)....,w_nThe actual probability of occurrence of (c) },

when the discrete point is i, the theoretical expected value of the occurrence probability of the corresponding sequence W is obtained; 1.3) probability mean weighting method, a sequence W needs to be converted into n-1 pseudo-bigram sequences, i represents n-1 possible discrete points in the sequence W, and W is divided into W₁,w₂....,w_iAnd w_(i+1)....,w_nTwo parts, i is more than or equal to 1 and less than or equal to n-1, and n is more than or equal to 2, so as to form a pseudo-bigram sequence;

2) measuring boundary independence;

the invention adopts entropy to measure the boundary independence of sentence trunks, and the boundary entropy is used for measuring the boundary chaos degree of a sequence; the larger the boundary entropy value is, the larger the uncertainty of the sequence is, the higher the independence is, and the more possible the sequence becomes a reduced chunk;

the method comprises the following specific steps:

2.1) for each candidate sentence stem sequence W, automatically generating two sets of left and right boundary collocations, including a set a ═ a of all words appearing at the left side adjacent position of W_kI k is a positive integer }, a_kFor the k-th word from left to right appearing at the left adjacent position of W, the set of all words appearing at the right adjacent position of W, B ═ B_kI k is a positive integer }, b_kThe k word appears from left to right at the adjacent position on the right side of the W;

2.2) calculating the maximum entropy of the left boundary of each sentence stem H (W)_leftAnd the maximum entropy of the right margin H (W)_rightThe formula is as follows;

wherein P (aW | W) represents the conditional probability of the word a appearing at the left boundary of the sequence W, and P (Wb | W) represents the conditional probability of the word b appearing at the right boundary of the sequence W;

2.3) improving the algorithm in the step 2.2), and calculating the final boundary entropy H (W) of the sentence trunk by combining the maximum entropy of the left and right boundaries, wherein the formula is as follows:

wherein F (W) represents the total frequency of occurrence of sequence W;

3) setting chapter distribution domain parameters;

the chapter distribution domain (D) refers to the article number with a certain sentence stem appearing at the same time, and chapter distribution domain parameters (textdispersion) are added to serve as evaluation indexes so as to ensure that the sentence stem distribution is not too concentrated, so that the sentence stem distribution is approved by multiple academic authors;

in summary, threshold values of three parameters are set to limit the functional stem: internal adhesion (MI) (threshold of 1.8), boundary independence (H) (threshold of 0.5), and chapter distribution domain (D) (threshold of 2); when the above three attributes in the sentence stem sequence are higher than the threshold, i.e. the internal adhesion MI (W) is greater than 1.8, the final boundary entropy H (W) is greater than 0.5, and the chapter distribution threshold is greater than 2, the sequence is determined as a characteristic sentence stem and extracted.

The machine translation method based on the extraction of the characteristic sentences, which is obtained by screening the characteristic sentences based on the MIN-MAX normalization algorithm and the local maximum deduplication method, specifically comprises the following steps:

firstly, normalizing internal adhesive force MI (W) and final boundary entropy H (W) based on a MIN-MAX normalization algorithm to obtain weight-eliminating parameters;

in the weight eliminating algorithm, the weight eliminating parameters are classified into 3 types, ① MI (internal adhesion value) ② H (boundary entropy value) ③ MI H (combined internal adhesion and boundary entropy value);

the internal adhesion and the boundary entropy value jointly act on the weight-eliminating parameter, so that the internal adhesion and the boundary entropy value jointly determine the size of the weight-eliminating parameter;

the invention selects ③ kinds of weight-eliminating parameters, adopts MIN-MAX normalization algorithm to preprocess MI value of internal adhesion and H value of boundary entropy, uses MIN-MAX normalization algorithm to process internal adhesion and boundary entropy respectively, and carries out linear transformation on internal adhesion value and boundary entropy to make two threshold values between 0-1, thereby balancing the effect of each factor on equality value under the condition of not changing internal property of data, balancing the final result of the two, and not influencing the result decisively because a certain value is too large, the formula of MIN-MAX normalization algorithm is as follows;

wherein MI_j' normalized internal adhesion MI (W), MI_max、MI_minMaximum and minimum values of internal adhesion MI (W), MI_jInternal adhesion MI (W), H for characteristic stem j_j' is the normalized final boundary entropy H (W), H_max、H_minThe maximum and minimum values of the final boundary entropy H (W), H_jFor the final boundary entropy H (W) of the characteristic stem j, MI is calculated_j' and H_jMultiplying to obtain weight eliminating parameter GI; then, screening the extracted characteristic sentence stems according to a local maximum duplicate elimination method;

then, screening the extracted characteristic sentence stems according to a local maximum duplicate elimination method;

local maximum (Localmaxs) deduplication: comparing the sentence stem with only an n-1 element subsequence and an n +1 element mother sequence, wherein n is the number of words contained in the sentence stem, the n-1 element subsequence is a sentence stem sequence which is contained in the sentence stem and has the length of n-1 words, the n +1 element sequence is a sentence stem sequence which contains the sentence stem and has the length of n +1 words, adopting a local maximum value to perform de-duplication treatment on all extracted candidate sentence stem sequences, deleting overlapping sequences with different lengths generated by repeated segmentation, and ensuring that each extracted characteristic sentence stem is an independent individual and does not overlap with other n-1 element sequences and n +1 element sequences;

the specific formula of the local maximum deduplication method is as follows:

GI(S_n)>GI(S_n+1)if n＝2；

GI(S_n)>＝GI(S_n-1)∨GI(S_n)>GI(S_n+1)if 7>n>2；

GI(S_n)>＝GI(S_n-1)if n＝7；

in the formula, S_nRepresenting a characteristic sentence stem containing n words;

the screening method for the extracted characteristic sentence stems is not limited to a local maximum (Localmax) duplicate elimination method, but a global maximum (Globalmaxs) duplicate elimination method can be applied to the invention and can be selected according to actual requirements.

Global maximum (globalmax) deduplication: comparing the sentence stem with all subsequences and mother sequences with the length of 2-7 words, wherein the subsequences refer to all sequences of 2-7 words and sentence stems contained in the sentence stem, and the mother sequences are all sequences of 2-7 words and sentence stems containing the sentence stem; the extracted candidate sentence trunk sequence is subjected to deduplication processing by adopting a global maximum value, overlapping sequences with different lengths generated by repeated segmentation are deleted, and each extracted functional sentence trunk is ensured to be independent and not to be overlapped with other sentence trunks; the specific formula is as follows:

GI(S_n)>GI(S_super-string)if n＝2；

GI(S_n)>＝GI(S_sub-string)∨GI(S_n)>GI(S_super-string)if 7>n>2；

GI(S_n)>＝GI(S_sub-string)if n＝7；

in the formula, S_nRepresenting a characteristic sentence stem, S, containing n words_sub-stringDenotes S_nSubsequence of (1), S_super-stringDenotes S_nThe parent sequence of (a).

In the machine translation method based on characteristic sentence stem extraction, searching for the sentence stem translation in the characteristic sentence stem database refers to comparing the sentence stem with the characteristic sentence stem in the characteristic sentence stem database, and if the sentence stem is the same as the characteristic sentence stem in the characteristic sentence stem database, the translation of the characteristic sentence stem is the sentence stem translation. If the sentence trunk is not consistent with the characteristic sentence trunk in the characteristic sentence trunk database, the phrases forming the sentence trunk are respectively translated, and then the phrases are combined according to the target language sequence to obtain the translation of the sentence trunk.

The invention also provides a device adopting the machine translation method based on the characteristic sentence stem extraction, which comprises a characteristic sentence stem database unit, a language input unit, a sentence stem extraction unit, a sentence stem identification unit, a translation unit and a combination unit;

the characteristic sentence stem database unit comprises an input subunit, a core processing subunit and a database subunit; the input subunit is used for acquiring a multi-word sequence; the core processing subunit comprises a word segmentation and statistic calculation module, a threshold value screening module and a duplicate elimination module; the word segmentation and statistics calculation module mainly comprises a word segmentation function submodule and a statistics calculation submodule, wherein the word segmentation function submodule is used for identifying a sequence with a structure meeting the requirement of a sentence stem, and the statistics calculation submodule is used for calculating the internal adhesive force, the external boundary independence and the chapter distribution domain of the sequence with the structure meeting the requirement of the sentence stem; the threshold screening module is used for extracting a characteristic sentence trunk from a sequence with a structure meeting the sentence trunk requirement, the duplication eliminating module mainly comprises a normalization submodule and a duplication eliminating submodule, the normalization submodule is used for processing an internal adhesion value and a boundary entropy value based on an MIN-MAX normalization algorithm, and the duplication eliminating submodule is used for screening the characteristic sentence trunk according to a local maximum duplication eliminating method; the database subunit is used for translating the characteristic sentence stems obtained by screening into a target language B and recording the characteristic sentence stems and translations thereof;

the language input unit is used for inputting a language A text to be translated;

the sentence trunk extracting unit extracts the sentence trunks of the language A text sentence by sentence;

the sentence trunk identification unit searches sentence trunk translation in the characteristic sentence trunk database;

the translation unit translates the words outside the sentence stem into a target language B;

the combination unit is used for combining the sentence stem translation and the translation of the words outside the sentence stem to obtain a translation;

the system comprises a language input unit, a sentence stem extraction unit, a sentence stem identification unit and a combination unit which are sequentially connected, wherein the sentence stem extraction unit, a translation unit and the combination unit are sequentially connected, an input subunit, a word cutting function submodule, a statistical calculation submodule, a threshold value screening module, a normalization submodule, a weight eliminating submodule and a database subunit in a characteristic sentence stem database unit are sequentially connected, and the database subunit is connected with the sentence stem identification unit.

In the device, the language input subunit includes a path selection module, the user can select the path of the input file and the path of the output file at will according to the requirement, the software will automatically create an extraction output folder under the output path selected by the user to store the existing result file, the word segmentation and statistical calculation module is responsible for generating the initial alternative sequence database, in the word segmentation function submodule, the user can set the length and range of the segment stem at will according to the requirement, the length of the segment stem sequence can be selected at will between 2 words and 7 words, the segment stem length is set at will, the system software will linearly segment the segment stem in the input file according to the range set by the user, and finally generate the multi-word sequences with different lengths. In the statistical calculation submodule, software automatically calculates an internal adhesion MI value and a boundary entropy H value, records the occurrence frequency and the text position of left and right adjacent words of each sequence, and finally respectively stores the occurrence frequency and the text position in corresponding files, the threshold value screening module comprises a parameter setting and screening unit, after the sentence stem extraction and the threshold value calculation are finished, a user can automatically set the sizes of three parameters, and the software automatically screens all the utterance behaviors in the parameter range. In the normalization submodule, the user can select whether To normalize the MI value and the H value (To Normalise) according To the requirement, and calculate the product of MI and H To obtain the deduplication parameter. If normalization is selected, software linearly transforms the MI value and the H value by using an MIN-MAX normalization method, so that the two thresholds are both between 0 and 1, and adverse effects caused by overlarge difference between the thresholds in the screening process are reduced as much as possible; if non-normalization is chosen, the software will use the original MI and H values; the final result presentation page of the result presentation part comprises four parts: sentence trunk display frame: the frame is positioned at the top of the interface and is used for highlighting the sentence stem currently selected by the user and the part of speech code corresponding to the sentence stem; sentence stem information table: the table is positioned on the left side of a result interface and displays 7 columns of data, namely an utterance behavior sentence stem, a part-of-speech code corresponding to the sentence stem, a user-selected duplicate elimination parameter value, sentence stem frequency, a mutual information value boundary entropy value and a chapter distribution domain value; text selection drop-down box: the drop-down frame is positioned on the right side of the interface; a text display box: the text display frame is positioned on the right side of the result interface and is used for displaying the original text content of the selected sentence trunk and the context of each occurrence of the sentence trunk; and the output file of the output function part is a processed file which is ordered according to the processing time under the specified path and is in a txt text format.

In the apparatus described above, the parameters set in the parameter setting and screening submodule are internal adhesion MI (W), final boundary entropy H (W), and threshold value of chapter distribution domain D.

The invention mechanism is as follows:

the method comprises the steps of firstly introducing internal adhesion, external boundary independence and discourse distribution domain to evaluate an identified multi-word sequence and selecting a characteristic sentence stem from the multi-word sequence, then innovatively using MIN-MAX normalization algorithm to normalize the internal adhesion value and the boundary entropy value of the characteristic sentence stem, screening the characteristic sentence stem by adopting a local maximum value duplication elimination method, translating the characteristic sentence stem to obtain a characteristic sentence stem database, and then translating a language A text based on the characteristic sentence stem database.

The normalization processing can not only furthest retain the properties among original data, but also control the balance of the influence of each parameter on the extraction result. By adopting the method of the invention, under the same operation environment condition, the processing time of 100 ten thousand words is only 2 minutes, and the processing time of 500 ten thousand words is also only 12 minutes (computer model: HP348G3, processor:

Core^TMi7-6500U CPU @2.50GHz 2.60GHz, memory: 8.00GB, System type: a 64-bit operating system). In addition, the device of the invention has higher flexibility and reliability, can carry out calculation processing according to different parameters input by a user, the user can select a corresponding text path according to the requirement without specifying a fixed path, the device can carry out extraction operation for the same text to be processed for unlimited times, and if the same result file exists, the device can automatically prompt the viewable result and ask the user whether to carry out covering.

Has the advantages that:

(1) the machine translation method based on the characteristic sentence stem extraction has high translation efficiency, short processing time and great application prospect;

(2) the machine translation device based on the characteristic sentence stem extraction is flexible and reliable, and a user can set parameters and paths according to actual conditions.

Description of the drawings:

FIG. 1 is a flow chart of the establishment of a database of characteristic stems according to the present invention;

FIG. 2 is a schematic diagram of possible discrete points within a sequence of n words (n ≧ 2);

FIG. 3 is a detailed translation flowchart of a machine translation method based on characteristic sentence stem extraction according to the present invention;

FIG. 4 is a structural composition diagram of a machine translation device based on characteristic sentence stem extraction according to the present invention;

where ". x" are possible discrete points.

Detailed Description

The invention will be further illustrated with reference to specific embodiments. It should be understood that these examples are for illustrative purposes only and are not intended to limit the scope of the present invention. Further, it should be understood that various changes or modifications of the present invention may be made by those skilled in the art after reading the teaching of the present invention, and such equivalents may fall within the scope of the present invention as defined in the appended claims.

A machine translation method based on characteristic sentence stem extraction specifically comprises the following steps:

(1) establishing a characteristic sentence stem database, wherein the steps are shown in fig. 1:

1.1) obtaining a multi-word sequence in a language A corpus:

firstly, acquiring a text corpus of a non-coding language A, and performing part-of-speech coding on a text; performing linear segmentation on the coded text to obtain a plurality of sequences, generating a multi-word sequence set of 2-7 words, and preprocessing the segmented linear sequences to obtain multi-word sequences; preprocessing comprises deleting messy codes, deleting punctuations in the sequences and counting the frequency of each sequence; when the language A is English, performing part-of-speech coding on the text by using a C7 coding set or TreeTagger of coding software, wherein if the language A is Chinese, the coding software is ICTCCLAS; if the language A is French, German or Italian, the code assigning software is TreeTagger; if language A is Japanese, the code-assigning software is Mecab.

1.2) identifying a sequence with a structure meeting the sentence trunk requirement in a multi-word sequence;

firstly, searching a sentence stem sequence with a main-predicate structure in a multi-word sequence; and then, individually processing the condition that the predicate not included in the main and predicate collocation type is omitted (such as if permitted), and in the process of extracting the main and predicate structure, combining the distribution characteristics of the parts of speech in each sentence to limit the positions of the verbs and the nouns in the sentence. Through the steps, a multiword sequence which structurally meets the requirement of the sentence trunk is extracted, and the sentence trunk sequence with the dominating and predicate structure comprises a subject type and a non-subject type;

1.3) determining a characteristic sentence stem in a sequence with a structure meeting the sentence stem requirement based on the internal adhesive force, the external boundary independence and the chapter distribution domain, wherein the specific steps are as follows:

1.3.1) calculating the internal adhesion;

1.3.1.1) converting the n-word sequence into pseudo-biword serialization according to the pseudo-bigram serialization theory, wherein n is more than or equal to 2;

1.3.1.2) aiming at the n word sequence, n-1 discrete points are selected without repetition, and the attractive force MI at the two sides of each discrete point is calculated one by one_i，MI_iRepresenting the internal adhesion of the partial sequence, i is more than or equal to 1 and less than or equal to n-1, and i is a possible discrete point inside the n-word sequence;

1.3.1.3) calculating the probability of occurrence of each pseudo-bigram sequence MI value using a probability mean weighting method and weighting it;

1.3.1.4) summing all weighted MI values, the formula is as follows;

in the formula P (MI)_i) Denotes MI_iThe probability of (d);

wherein W represents a sequence of n words, and W is { W ═ W₁,w₂,w₃，…，w_n}; i is a possible discrete point within the sequence W, dividing W into W₁,w₂....,w_iAnd w_(i+1)....,w_nI is more than or equal to 1 and less than or equal to n-1, and n is more than or equal to 2; wherein, w₁、w₂、....w_nThe nth constituent word, W, of the sequence W₁,w₂....,w_iRepresenting a first part, w, of a pseudo-binary sequence divided by discrete points i_(i+1)....,w_nIndicates being detachedThe schematic diagram of the possible discrete points in the second part of the pseudo-binary sequence divided by the scatter point i and in the n-word sequence (n ≧ 2) is shown in FIG. 2, where P (W) is the actual observed value of the occurrence probability of the sequence W, and P (W ≧ 2)₁,w₂....,w_i) Is a sequence { w₁,w₂....,w_iActual probability of occurrence, P (w)_(i+1)....,w_n) Is a sequence { w_(i+1)....,w_nThe actual probability of occurrence of (c) },

when the discrete point is i, the theoretical expected value of the occurrence probability of the corresponding sequence W is obtained;

1.3.2) measuring boundary independence;

1.3.2.1) for each candidate sentence stem sequence W, automatically generating two sets of left and right boundary collocations, a set a containing all words appearing at adjacent positions on the left side of W ═ a_kI k is a positive integer }, a_kFor the k-th word from left to right appearing at the left adjacent position of W, the set of all words appearing at the right adjacent position of W, B ═ B_kI k is a positive integer }, b_kThe k word appears from left to right at the adjacent position on the right side of the W;

1.3.2.2) calculate the left bound maximum entropy of each sentence stem H (W)_leftAnd the maximum entropy of the right margin H (W)_rightThe formula is as follows;

1.3.2.3) calculating the final boundary entropy H (W) of the sentence stem, and the formula is as follows:

wherein F (W) represents the total frequency of occurrence of sequence W;

when the three attributes in the sentence stem sequence are higher than the threshold value, namely the internal adhesion MI (W) is more than 1.8, the final boundary entropy H (W) is more than 0.5, and the chapter distribution domain value is more than 2, the sequence is determined as a characteristic sentence stem and extracted;

1.4) screening the characteristic sentence stems based on an MIN-MAX normalization algorithm and a local maximum duplicate elimination method;

1.4.1) carrying out normalization processing on an internal adhesion value and a boundary entropy value based on a MIN-MAX normalization algorithm to obtain a weight-eliminating parameter;

the formula of MIN-MAX normalization algorithm is as follows;

1.4.2) screening the extracted characteristic sentence stems according to a local maximum duplicate elimination method;

the specific formula is as follows:

GI(S_n)>GI(S_n+1)if n＝2；

GI(S_n)>＝GI(S_n-1)∨GI(S_n)>GI(S_n+1)if 7>n>2；

GI(S_n)>＝GI(S_n-1)if n＝7；

wherein GI (S)_n) A de-emphasis parameter, GI (S), representing a characteristic sentence skeleton comprising n words_n+1) A de-emphasis parameter, GI (S), representing a characteristic sentence skeleton containing n +1 words_n-1) De-duplication parameter, S, representing a characteristic stem containing n-1 words_nRepresenting a characteristic sentence stem containing n words;

1.5) translating the characteristic sentence stems obtained by screening into a target language, and recording each characteristic sentence stem and the translation thereof to obtain a characteristic sentence stem database;

the specific translation process of the present invention is shown in fig. 3, and the specific steps are as described in steps (2) to (5):

(2) inputting a language A text to be translated;

(3) extracting sentence trunks of the language A text sentence by sentence;

(4) searching sentence trunk translation in a characteristic sentence trunk database, which specifically comprises the following steps:

comparing the sentence stem with the characteristic sentence stem in the characteristic sentence stem database, wherein if the sentence stem is the same as the characteristic sentence stem in the characteristic sentence stem database, the translation of the characteristic sentence stem is the sentence stem translation; if the sentence trunk is inconsistent with the characteristic sentence trunk in the characteristic sentence trunk database, respectively translating each phrase forming the sentence trunk, and then combining each phrase according to the language order of the target language B to obtain a translation of the sentence trunk;

(5) and translating the words outside the sentence stem, and combining the translations of the words outside the sentence stem into the translation of the sentence stem according to the language order of the target language B to obtain the translation.

The device adopting the machine translation method has the structure shown in fig. 4, and comprises a characteristic sentence stem database unit, a language input unit, a sentence stem extraction unit, a sentence stem identification unit, a translation unit and a combination unit;

the characteristic sentence stem database unit comprises an input subunit, a core processing subunit and a database subunit;

the input subunit is used for acquiring a multi-word sequence and comprises a path selection module for selecting input and output paths;

the core processing subunit comprises a word segmentation and statistic calculation module, a threshold value screening module and a duplicate elimination module;

the word cutting and statistical calculation module is responsible for generating an initial alternative sequence database and mainly comprises a word cutting function submodule and a statistical calculation submodule, wherein the word cutting function submodule is used for identifying a sequence with a structure meeting the requirement of a sentence stem, and the statistical calculation submodule is used for calculating the internal adhesion, the external boundary independence and the chapter distribution domain of the sequence with the structure meeting the requirement of the sentence stem;

the threshold value screening module is used for extracting characteristic sentences from the sequence with the structure meeting the sentence trunk requirements and comprises a parameter setting and screening submodule, and parameters set in the parameter setting and screening submodule are internal adhesion, boundary entropy and thresholds of chapter distribution domains;

the weight eliminating module mainly comprises a normalization submodule and a weight eliminating submodule, wherein the normalization submodule is used for processing an internal adhesion value and a boundary entropy value on the basis of an MIN-MAX normalization algorithm, and the weight eliminating submodule is used for screening the characteristic sentence stems according to a local maximum weight eliminating method; the database subunit is used for translating the characteristic sentence stems obtained by screening into a target language B and recording the characteristic sentence stems and translations thereof;

a sentence trunk extracting unit which extracts sentence trunks of the language A text sentence by sentence;

the sentence trunk identification unit is used for searching sentence trunk translation in the characteristic sentence trunk database;

the language input unit, the sentence stem extraction unit, the sentence stem identification unit and the combination unit are sequentially connected, the sentence stem extraction unit, the translation unit and the combination unit are sequentially connected, the characteristic sentence stem database unit is connected with the sentence stem identification unit, an input subunit, a word cutting function submodule, a statistical calculation submodule, a threshold value screening module, a normalization submodule, a repetition eliminating submodule and a database subunit are sequentially connected in the characteristic sentence stem database unit, and the database subunit is connected with the sentence stem identification unit.

Claims

1. A machine translation method based on characteristic sentence stem extraction is characterized in that: firstly, inputting a language A text to be translated, extracting sentence stems of the language A text one by one, searching sentence stem translations in a characteristic sentence stem database, simultaneously translating words outside the sentence stems, and finally combining translations of the words outside the sentence stems into the sentence stem translations according to the language order of a target language B to obtain translations;

(1) acquiring a multi-word sequence from a language A corpus;

the method for screening the characteristic sentence stems based on the MIN-MAX normalization algorithm and the local maximum duplicate elimination method specifically comprises the following steps:

normalizing the internal adhesive force MI (W) and the final boundary entropy H (W) based on a MIN-MAX normalization algorithm to obtain a weight-eliminating parameter, and screening in the extracted characteristic sentence stems according to a local maximum weight-eliminating method;

the formula of the MIN-MAX normalization algorithm is as follows;

wherein MI_j' normalized internal adhesion MI (W), MI_max、MI_minMaximum and minimum values of internal adhesion MI (W), MI_jInternal adhesion MI (W), H for characteristic stem j_j' is the normalized final boundary entropy H (W), H_max、H_minThe maximum and minimum values of the final boundary entropy H (W), H_jFor the final boundary entropy H (W) of the characteristic stem j, MI is calculated_j' and H_jMultiplying to obtain weight eliminating parameter GI;

the formula of the local maximum deduplication method is as follows:

2. The machine translation method based on the characteristic sentence stem extraction according to claim 1, wherein the obtaining of the multi-word sequence specifically comprises: firstly, acquiring an un-coded academic language A text corpus, and performing part-of-speech coding on a text by using coding software; performing linear segmentation on the coded text to obtain a plurality of sequences, generating a multi-word sequence set of 2-7 words, and preprocessing the segmented linear sequences to obtain multi-word sequences; the preprocessing comprises deleting messy codes, deleting punctuations in the sequences and counting the frequency of each sequence.

3. The method for machine translation based on the extraction of characteristic stems of sentences according to claim 2, wherein said language A and target language B are selected from two of English, Chinese, French, German, Italian and Japanese;

when the language A is English, the part-of-speech tagging utilizes a C7 tagging set or TreeTagger of tagging software; when the language A is Chinese, the code assigning software is ICTCCLAS; when the language A is French, German or Italian, the code assigning software is TreeTagger; and when the language A is Japanese, the code assigning software is Mecab.

4. The machine translation method based on the characteristic sentence stem extraction as claimed in claim 1, wherein the sequence whose recognition structure meets the sentence stem requirement is specifically: firstly, searching a sentence trunk sequence with a main and predicate structure in a multi-word sequence; and then, carrying out independent processing on the condition of predicate omission, wherein the sentence stem sequence with the main predicate structure comprises a subject type and a non-subject type.

5. The method for machine translation based on the extraction of the characteristic sentence stem as claimed in claim 1, wherein the determining the characteristic sentence stem in the sequence with the structure satisfying the sentence stem requirement based on the internal adhesion, the external boundary independence and the chapter distribution domain is specifically as follows:

1) calculating the internal adhesion;

1.1) according to a pseudo-binary serialization theory, performing pseudo-binary serialization on an n-word sequence, wherein n is more than or equal to 2;

1.2) aiming at the n word sequence W, n-1 discrete points are selected without repetition, and the attractive force MI at the two sides of each discrete point is calculated one by one_i，MI_iRepresenting the internal adhesion of the partial sequence, i is more than or equal to 1 and less than or equal to n-1, and i is a possible discrete point inside the n-word sequence;

1.3) calculating each pseudo-bigram sequence MI by using a probability mean weighting method_iProbability of occurrence of a value, and weighting thereof;

1.4) pairing all weighted MI_iSumming the values, the formula is as follows;

in the formula P (MI)_i) Denotes MI_iThe probability of (d);

the probability mean value weighting method is adopted to adjust the internal adhesion MI (W) of the n-word sequence, and the calculation formula is as follows:

2) measuring boundary independence;

2.1) for each sequence W of n words, two sets of left and right border collocations are automatically generated, the set a containing all words appearing at adjacent positions on the left of W ═ a_kI k is a positive integer }, a_kFor the k-th word from left to right appearing at the left adjacent position of W, the set of all words appearing at the right adjacent position of W, B ═ B_kI k is a positive integer }, b_kThe k word appears from left to right at the adjacent position on the right side of the W;

2.2) calculating the maximum entropy H (W) of the left boundary of each n word sequence_leftAnd the maximum entropy of the right margin H (W)_rightThe formula is as follows;

2.3) calculating the final boundary entropy H (W) of the n word sequences, wherein the formula is as follows:

wherein F (W) represents the total frequency of occurrence of sequence W;

3) setting a chapter distribution domain value;

when the three attributes in the n word sequence W are higher than the threshold value, i.e. the internal adhesion MI (W) is greater than 1.8, the final boundary entropy H (W) is greater than 0.5, and the chapter distribution domain value is greater than 2, the sequence is determined as a characteristic sentence stem and extracted.

6. The machine translation method based on characteristic sentence stem extraction of claim 1, wherein the searching for the sentence stem translation in the characteristic sentence stem database is to compare the sentence stem with the characteristic sentence stem in the characteristic sentence stem database, and if the sentence stem is the same as the characteristic sentence stem in the characteristic sentence stem database, the translation of the characteristic sentence stem is the sentence stem translation.

7. The device for adopting the machine translation method based on the characteristic sentence stem extraction as claimed in any one of claims 1 to 6 is characterized in that: the system comprises a characteristic sentence stem database unit, a language input unit, a sentence stem extraction unit, a sentence stem identification unit, a translation unit and a combination unit;

8. The apparatus of claim 7, wherein the input subunit comprises a path selection module for selecting input and output paths, the word segmentation and statistics calculation module is responsible for generating an initial candidate sequence database, and the threshold value filtering module comprises a parameter setting and filtering sub-module.

9. The apparatus of claim 8, wherein the parameters set in the parameter setting and screening submodule are internal adhesion MI (W), final boundary entropy H (W), and threshold value of chapter distribution domain.