CN106708811A - Data processing method and data processing device - Google Patents
Data processing method and data processing device Download PDFInfo
- Publication number
- CN106708811A CN106708811A CN201611178417.2A CN201611178417A CN106708811A CN 106708811 A CN106708811 A CN 106708811A CN 201611178417 A CN201611178417 A CN 201611178417A CN 106708811 A CN106708811 A CN 106708811A
- Authority
- CN
- China
- Prior art keywords
- text
- sentence
- translated
- similarity
- web
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000012545 processing Methods 0.000 title claims abstract description 60
- 238000003672 processing method Methods 0.000 title abstract description 23
- 238000013519 translation Methods 0.000 claims abstract description 30
- 238000000034 method Methods 0.000 claims description 46
- 238000010606 normalization Methods 0.000 claims description 29
- 238000004364 calculation method Methods 0.000 claims description 19
- 238000010586 diagram Methods 0.000 description 6
- 230000000694 effects Effects 0.000 description 3
- 238000010276 construction Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000001914 filtration Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
- G06F40/58—Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
- G06F40/42—Data-driven translation
- G06F40/49—Data-driven translation using very large corpora, e.g. the web
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a data processing method and a data processing device. The data processing method includes: acquiring a to-be-translated text; according to sequence of statement of the to-be-translated text and statement of the to-be-translated text, acquiring similarity of each network text in a first parallel bilingual corpus of the to-be-translated text and a default translation type; according to the similarity of the to-be-translated text and the network texts, determining a target network text of the to-be-translated text. By the arrangement, the similarity of the to-be-translated text and the network texts is calculated by taking the statement as the unit, and accuracy in acquiring the target network text of the to-be-translated text is further improved.
Description
Technical Field
The present invention relates to computer technologies, and in particular, to a method and an apparatus for processing data.
Background
The machine translation is a process of translating a natural language into another natural target language by using a computer, and the core of the machine translation is to realize the alignment of bilinguals in each layer, namely, a target network text with the maximum similarity with a text to be translated is obtained from a plurality of network texts in a parallel bilingual corpus.
At present, a target web text of a text to be translated is usually obtained by a chapter alignment method, specifically, feature values (for example, numbers, punctuations, names, and the like) of the text to be translated are obtained, whether feature values matched with the feature values exist in each web text is judged, and then similarity between the text to be translated and each web text is obtained, a maximum similarity is obtained from the similarities, and the web text with the maximum similarity is used as the target translation text of the text to be translated.
However, the bilingual alignment methods are all directed at the overall alignment of chapters, and the alignment error is large, so that the translation result is inaccurate.
Disclosure of Invention
The invention provides a data processing method and a data processing device, which are used for solving the problems of large alignment error and inaccurate translation caused by the fact that the whole text is used as an object to perform integral alignment of the text in the conventional chapter alignment method.
In a first aspect, the present invention provides a data processing method, including:
acquiring a text to be translated;
according to the sentence sequence of the text to be translated and the sentences of the text to be translated, obtaining the similarity between the text to be translated and each web text in a first parallel bilingual corpus of a preset translation type;
and determining a target network text of the text to be translated according to the similarity between the text to be translated and each network text.
In a second possible implementation manner of the first aspect, the obtaining, according to the sentence order of the text to be translated and the sentences of the text to be translated, a similarity between the text to be translated and each web text in a parallel bilingual corpus of a preset translation type specifically includes:
acquiring a first sentence from the text to be translated according to the sentence sequence of the text to be translated and a preset sentence comparison type, and acquiring a second sentence corresponding to the first sentence from each web text in the first parallel bilingual corpus; the sentence comparison type is the ratio of the number of sentences included in the first sentence to the number of sentences included in the second sentence;
acquiring a first similarity between the first sentence and a second sentence of each web text;
and determining the similarity between the text to be translated and each web text according to the first similarity between the first sentence and the second sentence of each web text.
With reference to the first implementation manner, in a third possible implementation manner of the first aspect, the obtaining a first similarity between the first sentence and each second sentence of the web text specifically includes:
determining length normalization parameters of the first sentence and each second sentence of the network text according to the character length of the first sentence, the character length of the second sentence and the ratio of the language length of the text to be translated to the language length of the network text;
and determining a first similarity between the first sentence and the second sentence of each network text according to the length normalization parameters of the first sentence and the second sentence of each network text and a preset comparison type.
With reference to the second implementation manner, in a fourth possible implementation manner of the first aspect, the determining, according to the length normalization parameter and a preset comparison type of the first sentence and the second sentence of each network text, a first similarity between the first sentence and the second sentence of each network text specifically includes:
according to the formula p ((l)s,lt)|type)=p(|X|≥|(ls,lt)|)=2(1-p(X<|(ls,lt) L)) determining a first similarity p ((l) of the first sentence to each of the second sentencess,lt)|type);
Wherein, the (l)s,lt) Normalizing parameters, the/for the lengths of the first and second statementssFor the character length of the first sentence, the ltThe character length of the second sentence is the r is the ratio of the language length of the text to be translated to the language length of the web text, and the type is the sentence comparison type.
With reference to the third implementation manner, in a fourth possible implementation manner of the first aspect, the determining a target web text of the text to be translated according to the similarity between the text to be translated and each web text specifically includes:
determining a first web text set of the text to be translated from the first parallel bilingual corpus according to a first preset number and the similarity between the text to be translated and each web text, wherein the first web text set comprises a plurality of first web texts;
acquiring the similarity between each character of the first sentence and each character of the second sentence of each first network text;
obtaining a second similarity between each first sentence and each second sentence of the first network text according to the similarity between each character of the first sentence and each character of each second sentence of the first network text;
and determining the target web text according to the second similarity of the first sentence and the second sentence of each first web text.
With reference to the first implementation manner, in a fifth possible implementation manner of the first aspect, before the obtaining, according to the sentence order of the text to be translated and the sentences of the text to be translated, a similarity between the text to be translated and each web text in a first parallel bilingual corpus of a preset translation type, the method further includes:
the text to be translated is aligned with each network text in a preset second parallel bilingual corpus to obtain a third similarity between the text to be translated and each network text in the second parallel bilingual corpus;
and determining the first parallel bilingual corpus from the second parallel bilingual corpus according to the third similarity between the text to be translated and each web text in the second parallel bilingual corpus.
With reference to the third implementation manner, in a sixth possible implementation manner of the first aspect, the determining, according to the character length of the first sentence, the character length of the second sentence, and the ratio of the language length of the text to be translated to the language length of the web text, the length normalization parameter of the first sentence and the length normalization parameter of the second sentence specifically includes:
according to the formulaDetermining a length normalization parameter (l) for the first and second sentencess,lt);
Wherein, the sigma2Is the sample variance of the language of the text to be translated and the language of the web text.
With reference to the fifth implementation manner, in a seventh possible implementation manner of the first aspect, the obtaining a second similarity between the first sentence and each second sentence of the first web text according to a similarity between each character of the first sentence and each character of the second sentence of each first web text specifically includes:
according to the formulaObtaining a second similarity between the first sentence and a second sentence of each first network text;
wherein s is a character in the first sentence, t is a character corresponding to s in the second sentence, l is the number of characters in the first sentence, m is the number of characters in the second sentence, and m is a constant.
In a second aspect, the present invention provides an apparatus for processing data, comprising:
the acquisition module is used for acquiring a text to be translated;
the first calculation module is used for acquiring the similarity between the text to be translated and each web text in a first parallel bilingual corpus of a preset translation type according to the sentence sequence of the text to be translated and the sentences of the text to be translated;
and the determining module is used for determining a target network text of the text to be translated according to the similarity between the text to be translated and each network text.
Further, the first calculation module comprises: the first acquisition unit, the first calculation unit and the first determination unit:
the first obtaining unit is used for obtaining a first sentence from the text to be translated according to the sentence sequence of the text to be translated and a preset sentence comparison type, and obtaining a second sentence corresponding to the first sentence from each web text in the first parallel bilingual corpus; the sentence comparison type is the ratio of the number of sentences included in the first sentence to the number of sentences included in the second sentence;
the first computing unit is used for acquiring a first similarity between the first sentence and a second sentence of each web text;
the first determining unit is configured to determine, according to a first similarity between the first sentence and a second sentence of each web text, a similarity between the text to be translated and each web text.
Further, the first calculating unit is specifically configured to determine, according to the character length of the first sentence, the character length of the second sentence, and a ratio of the language length of the text to be translated to the language length of the web text, a length normalization parameter of the second sentence of each of the first sentence and the web text; and determining a first similarity between the first sentence and each second sentence of the web text according to the length normalization parameters of the first sentence and each second sentence of the web text and a preset comparison type.
Optionally, the first computing unit is further specifically configured to
According to the formula p ((l)s,lt)|type)=p(|X|≥|(ls,lt)|)=2(1-p(X<|(ls,lt) L)) determining a first similarity p ((l) of the first sentence to each of the second sentencess,lt)|type);
Wherein, the (l)s,lt) Normalizing parameters, the/for the lengths of the first and second statementssFor the character length of the first sentence, the ltThe character length of the second sentence is the r is the ratio of the language length of the text to be translated to the language length of the web text, and the type is the sentence comparison type.
Further, the determining module comprises: a second acquiring unit and a second calculating unit;
the first obtaining unit is configured to determine a first web text set of the to-be-translated text from the first parallel bilingual corpus according to a first preset number and a similarity between the to-be-translated text and each web text, where the first web text set includes a plurality of first web texts;
the second computing unit is configured to obtain similarity between each character of the first sentence and each character of a second sentence of each first web text; according to the similarity between each character of the first sentence and each character of the second sentence of each first network text, obtaining a second similarity between the first sentence and the second sentence of each first network text;
the determining module is further configured to determine the target web text according to a second similarity between the first sentence and each second sentence of the first web text.
Further, the device also comprises a second calculation module;
the second calculation module is configured to perform chapter alignment on the text to be translated and each web text in a preset second parallel bilingual corpus before the first calculation module obtains similarity between the text to be translated and each web text in the first parallel bilingual corpus of a preset translation type according to the sentence sequence of the text to be translated and the sentences of the text to be translated, and obtain third similarity between the text to be translated and each web text in the second parallel bilingual corpus; and determining the first parallel bilingual corpus from the second parallel bilingual corpus according to the third similarity between the text to be translated and each web text in the second parallel bilingual corpus.
Optionally, the first computing unit is specifically configured to:
according to the formulaDetermining a length normalization parameter (l) for the one statement and the second statements,lt);
Wherein, the sigma2Is the sample variance of the language of the text to be translated and the language of the web text.
Optionally, the second calculating module is specifically configured to:
according to the formulaObtaining a second similarity between the first sentence and a second sentence of each first network text;
wherein s is a character in the first sentence, t is a character corresponding to s in the second sentence, l is the number of characters in the first sentence, m is the number of characters in the second sentence, and m is a constant.
According to the data processing method and device provided by the invention, the similarity between the text to be translated and each web text in the first parallel bilingual corpus of the preset translation type is obtained through the sentence sequence of the text to be translated and the sentences of the text to be translated, and the target web text of the text to be translated is determined according to the similarity between the text to be translated and each web text. In other words, the method of the embodiment performs similarity calculation between the text to be translated and the web text by using the sentence as a unit, thereby improving the accuracy of obtaining the target web text of the text to be translated.
Drawings
In order to more clearly illustrate the technical solutions of the present invention or the prior art, the following briefly introduces the drawings needed to be used in the description of the embodiments or the prior art, and obviously, the drawings in the following description are some embodiments of the present invention, and those skilled in the art can obtain other drawings according to the drawings without inventive labor.
Fig. 1 is a schematic flow chart of a first embodiment of a data processing method provided by the present invention;
FIG. 2 is a schematic flow chart illustrating a second embodiment of a data processing method according to the present invention;
fig. 3 is a schematic flow chart of a third embodiment of a data processing method provided by the present invention;
fig. 4 is a schematic flowchart of a fourth embodiment of a data processing method provided by the present invention;
fig. 5 is a schematic flow chart of a fifth embodiment of a data processing method provided by the present invention;
FIG. 6 is a schematic structural diagram of a first embodiment of a data processing apparatus according to the present invention;
FIG. 7 is a schematic structural diagram of a second data processing apparatus according to an embodiment of the present invention;
fig. 8 is a schematic structural diagram of a third embodiment of a data processing apparatus according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Parallel corpora (Parallel Texts) refer to text written in different languages that have a translation relationship with each other. Parallel bilingual corpora are collections of text written in two languages that have a translation relationship with each other.
The invention provides a data processing method and device, which are suitable for a parallel bilingual corpus and used for solving the problems of large alignment error and inaccurate translation caused by the fact that the whole text is used as an object to perform integral alignment of the text in the conventional chapter alignment method.
The method provided by the invention aligns the lengths of the sentences by taking the sentences of the text as units, thereby improving the accuracy of obtaining the target network text of the text to be translated.
It should be noted that the terms "first" and "second" in the present embodiment are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
The technical solution of the present invention will be described in detail below with specific examples. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments.
Fig. 1 is a schematic flow chart of a first embodiment of a data processing method provided by the present invention. The execution subject of this embodiment may be a data processing device, which may be implemented by software and/or hardware, and may be disposed in the processor, or may be a separate processor. The embodiment relates to a specific process that a processing device acquires the similarity between a text to be translated and each web text according to the sentence sequence of the text to be translated, and determines a target web text of the text to be translated according to the similarity. As shown in fig. 2, the method of this embodiment may include:
s101, obtaining a text to be translated.
Specifically, the processing device obtains a text to be translated, where the text to be translated may be temporarily input to the processing device by a user, or may be a text stored in another storage device, and the user instructs the processing device to obtain the text from the storage device through a network or the like.
S102, according to the sentence sequence of the text to be translated and the sentences of the text to be translated, obtaining the similarity between the text to be translated and each web text in a first parallel bilingual corpus of a preset translation type.
Specifically, the language types of the web texts in the first parallel bilingual corpus of the preset translation type are all target language types of the text to be translated, for example, when the text to be translated in chinese needs to be translated into an english text, each web text in the preset first parallel bilingual corpus is an english text. The processing device calculates the similarity between the text to be translated and each web text in the first parallel bilingual corpus according to the sentence sequence of the text to be translated in the sentence questioning unit. Alternatively, the processing device may calculate the similarity between the text to be translated and each network text sentence by sentence, for example, the processing device calculates the similarity between the first sentence of the text to be translated and the first sentence of the network text a, which is denoted as p1, calculates the similarity between the second sentence of the text to be translated and the second sentence of the network text a, which is denoted as p2, … … calculates the similarity between the 10 th sentence of the text to be translated and the 10 th sentence of the network text a, which is denoted as p10, according to the sentence sequence of the text to be translated. Then, the processing device may determine the similarity between the text to be translated and the web text a according to the 10 similarities, for example, a sum of the 10 similarities may be used as the similarity between the text to be translated and the web text a, or a weighted average of the 10 similarities may be used as the similarity between the text to be translated and the web text a. By referring to the method, the similarity between the text to be translated and each web text in the first bilingual corpus can be obtained.
Referring to the above example, the processing device may further perform similarity calculation between one sentence in the text to be translated and two sentences in the network text, optionally may further perform similarity calculation between two sentences in the text to be translated and two sentences in the network text as a unit, and optionally may further perform similarity calculation between a plurality of sentences in the text to be translated and one or more sentences in the network text as a unit.
Each web text in the first parallel bilingual corpus is normalized, for example, by NekoHTML and XPath. Among them, NekoHTML is a simple HTML scanner (scanner) and tag compensator (tag balancer) that enables programs to parse HTML documents and access the information therein using standard XML interfaces. It can simply parse, trim and clean up HTML documents, automatically close tags, fix some common errors, and extract text from HTML documents using NekoHTML. XPath is a language for searching information in XML documents, and can be used for traversing elements and attributes in XML documents to further obtain more standard web texts, thereby facilitating the alignment of the back. Meanwhile, in this step, the processing device constructs a dictionary according to each web text and the text to be translated for subsequent use, and the process of constructing the dictionary is not described herein for the prior art.
S103, determining a target network text of the text to be translated according to the similarity between the text to be translated and each network text.
Specifically, the similarity between the text to be translated and each web text can be obtained according to the method in S102, and then the processing device obtains the maximum similarity from the similarities, and takes the web text corresponding to the maximum similarity as the target translation text of the text to be translated. Optionally, the similarity may be sequentially arranged according to the requirement of the user, so as to obtain a certain number of target network files.
According to the data processing method provided by the invention, the similarity between the text to be translated and each web text in the first parallel bilingual corpus of the preset translation type is obtained through the sentence sequence of the text to be translated and the sentences of the text to be translated, and the target web text of the text to be translated is determined according to the similarity between the text to be translated and each web text. In other words, the method of the embodiment performs similarity calculation between the text to be translated and the web text by using the sentence as a unit, thereby improving the accuracy of obtaining the target web text of the text to be translated.
Fig. 2 is a schematic flow chart of a second embodiment of a data processing method provided by the present invention. On the basis of the above embodiments, the present embodiment relates to a specific process in which a processing device obtains, according to a sentence order of a text to be translated and a sentence of the text to be translated, a similarity between the text to be translated and each web text in a parallel bilingual corpus of a preset translation type. That is, the above S102 may specifically include:
s201, acquiring a first sentence from the text to be translated according to the sentence sequence of the text to be translated and a preset sentence comparison type, and acquiring a second sentence corresponding to the first sentence from each web text in the first parallel bilingual corpus; the sentence comparison type is a ratio of the number of sentences included in the first sentence to the number of sentences included in the second sentence.
Specifically, the sentence comparison type preset in this embodiment may be input to the processing device by the user according to actual requirements, or may be determined by the processing device according to characteristics of the text to be translated and the web text, for example, when the chapter of the text to be translated is long, the sentence comparison type may be set to be larger, for example, 2:2, that is, two sentences in the text to be translated are aligned with two sentences in the web text.
In the method of this embodiment, the processing device obtains a first sentence from the text to be translated according to the sentence sequence of the text to be translated and the preset sentence comparison type, and obtains a second sentence, the sentence sequence of which corresponds to the first sentence, from each web text. Assuming that the preset sentence comparison type is 1: and 2, according to the sentence sequence, taking each sentence in the text to be translated as a first sentence, taking two sentences corresponding to the first sentence in each network file as a second sentence, for example, taking the first sentence of the text to be translated as the first sentence, and taking the first sentence and the second sentence of each network file as the second sentence corresponding to the first sentence.
S202, acquiring first similarity of the first sentence and the second sentence of each web text.
Specifically, the processing device obtains a first similarity between the first sentence and the second sentence of each network file according to the second sentence of the first sentence selected in the above step. The processing device may determine the first similarity between the first sentence and the second sentence according to the number of characters included in the first sentence and the number of characters included in the second sentence, for example, when the number of characters included in the first sentence is equal to the number of characters included in the second sentence, the similarity between the first sentence and the second sentence is considered to be high.
S203, determining the similarity between the text to be translated and each web text according to the first similarity between the first sentence and the second sentence of each web text.
Specifically, the processing device calculates a first similarity between a first sentence in the text to be translated and a second sentence of each web text according to the method in S202, and then, the processing device sums or averages the first similarities between the first sentence in the text to be translated and the second sentence of a certain web text to obtain the similarity between the text to be translated and the web text. According to the method, the similarity between the text to be translated and each web text can be obtained.
To further illustrate the technical solution of the present invention, the following is exemplified:
referring to the above example, assume that there are 10 sentences in the text to be translated, and the preset sentence comparison type is 1: 2. The processing device takes the first sentence of the text to be translated as the first sentence and takes the first sentence and the second sentence of the network file a as the second sentence, and obtains the first similarity between the first sentence and the second sentence, which is denoted as P1, according to the method. Then, taking the second sentence of the text to be translated as a new first sentence, taking the second sentence and the third sentence of the network text a as new second sentences, obtaining the first similarity between the first sentence and the second sentence at the moment and taking the first sentence as P2 …, and finally, taking the 10 th sentence of the text to be translated as a first sentence, taking the 10 th sentence and the second 11 sentences of the network text a as second sentences, and obtaining the first similarity between the first sentence and the second sentence and taking the first sentence as P10. The processing device determines the similarity between the text to be translated and the web text A according to the P1 and the P2 … … P10, for example, the sum of the P1 and the P2 … … P10 is used as the similarity between the text to be translated and the web text A, or the weighted average of the P1 and the P2 … … P10 is used as the similarity between the text to be translated and the web text A. By referring to the above steps, the similarity between the text to be translated and each web text can be accurately obtained, and then,
according to the data processing method provided by the invention, the processing device acquires a first sentence from the text to be translated according to the sentence sequence of the text to be translated and a preset sentence comparison type, acquires a second sentence corresponding to the first sentence from each network text in the first parallel bilingual corpus, calculates the first similarity between the first sentence and the second sentence of each network text, and accurately acquires the similarity between the text to be translated and each network text according to the first similarity between the first sentence and the second sentence of each network text, so that the accuracy of acquiring the target network text of the text to be translated is improved.
Fig. 3 is a schematic flow chart of a third embodiment of a data processing method provided by the present invention. On the basis of the foregoing embodiment, the processing apparatus according to this embodiment obtains a specific process of the first similarity between the first sentence and the second sentence of each web text. That is, the above S202 may specifically include:
s301, determining length normalization parameters of the first sentence and each second sentence of the network text according to the character length of the first sentence, the character length of the second sentence and the ratio of the language length of the text to be translated to the language length of the network text.
Specifically, the processing device determines the length normalization parameters of the first sentence and the second sentence of each web text according to the character length of the first sentence, the character length of the second sentence, and the ratio of the language length of the text to be translated to the language length of the web text. For example, assuming that the first sentence is "hit him", the second sentence is "hit, him", the character length of the first sentence is 6, the character length of the second sentence is also 6, and the language length of chinese and the language length of english are 1.6, the processing device obtains the length normalization parameters of the first sentence "hit him" and the second sentence "hit, him" by the existing length normalization method (for example, using the R language normalization function) based on the above parameters.
Optionally, the processing means is according to a formulaDetermining a length normalization parameter (l) for the first and second sentencess,lt) Wherein the σ is2For the sample variance of the language of the text to be translated and the language of the web text, thesFor the character length of the first sentence, the ltAnd the r is the ratio of the language length of the text to be translated to the language length of the web text. WhileWherein,n is the number of web documents in the first parallel bilingual corpus, which is the average character length of the second sentence of the web documents used.
Referring to the above example, the first sentence "typing him" has a character length of lsThe second sentence "hit, him" has a character length of lt6, and the ratio of the Chinese to English language length is 1.6, the sample variance σ of Chinese to English23.4. The above parameters are substituted into the above formula to obtain
The length normalization parameter of the first sentence "hit him" and the second sentence "hit, him" is obtained as 0.49 according to the above formula. With reference to the above steps, the length normalization parameters of the first statement and the second statement of each network file can be obtained.
S302, determining a first similarity between the first sentence and the second sentence of each web text according to the length normalization parameters of the first sentence and the second sentence of each web text and a preset comparison type.
Specifically, the processing device determines a first similarity between the first sentence and the second sentence of each web text according to the obtained length normalization parameter and the preset comparison type of the first sentence and the second sentence of each web file.
Optionally, the processing means is according to a formula
p((ls,lt)|type)=p(|X|≥|(ls,lt)|)=2(1-p(X<|(ls,lt) L)) determining a first similarity p ((l) of the first sentence to each of the second sentencess,lt) Type) which is the sentence alignment type.
By referring to the above example, by substituting the parameters obtained in the above steps into the above formula, the obtained
Optionally, before the processing device calculates the first similarity between the first sentence and the second sentence of each web text according to the above formula, the processing device may further use a gaussian formula:or using the poisson distribution formula:and calculating the similarity between the first sentence and the second sentence of each web text, and filtering the web texts in the first parallel bilingual corpus once, thereby reducing the calculation amount of the processing device.
According to the data processing method provided by the invention, the processing device determines the length normalization parameters of the first sentence and the second sentence of each network text according to the character length of the first sentence, the character length of the second sentence and the ratio of the language length of the text to be translated to the language length of the network text, and determines the first similarity of the first sentence and the second sentence of each network text according to the length normalization parameters of the first sentence and the second sentence of each network text and a preset comparison type.
Fig. 4 is a flowchart illustrating a fourth embodiment of a data processing method according to the present invention. On the basis of the above embodiments, the present embodiment relates to a specific process in which the processing device determines a target web text of the text to be translated according to the similarity between the text to be translated and each web text. That is, S101 specifically includes:
s401, determining a first web text set of the text to be translated from the first parallel bilingual corpus according to a first preset number and the similarity between the text to be translated and each web text, wherein the first web text set comprises a plurality of first web texts.
Specifically, the processing device sorts the obtained similarity between the translated text and each web text (for example, sorts the similarity from large to small), acquires first web texts with the similarity of 10 top from the first parallel bilingual corpus according to a first preset number (for example, 10), and takes the 10 first web texts as a first web text set.
S402, obtaining the similarity between each character of the first sentence and each character of the second sentence of each first network text.
S403, obtaining a second similarity between the first sentence and each second sentence of the first network text according to the similarity between each character of the first sentence and each character of each second sentence of the first network text.
Specifically, in order to further improve the accuracy of obtaining the target web text, the processing device obtains each character of the second sentence of each first web text in the first web file set, determines whether the character of the second sentence of each first web text is a translation character of the first sentence according to the translation relationship, and further obtains the similarity between each character of the first sentence and each character of the second sentence of each first web text.
Then, the processing device may obtain a second similarity between each first sentence and each second sentence of the first web text according to the obtained similarity between each character of the first sentence and each character of the second sentence of the first web text. For example, the processing device may determine a second similarity of the first sentence to the second sentence of each of the first web texts according to the IBM model.
Optionally, the processing means may also be based on a formula
Obtaining a second similarity between the first sentence and a second sentence of each first network text;
wherein s is a character in the first sentence, t is a character corresponding to s in the second sentence, l is the number of characters in the first sentence, m is the number of characters in the second sentence, and m is a constant.
S404, determining the target web text according to the second similarity of the first sentence and the second sentence of each first web text.
Specifically, the processing device obtains a second similarity between the first sentence and each second sentence of the first web text according to the above steps, and determines a target web text of the text to be translated according to the second similarity.
According to the data processing method provided by the invention, a plurality of first web texts are obtained from a first parallel bilingual corpus according to a first preset number and the similarity between the text to be translated and each web text, the similarity between each character of a first sentence and each character of a second sentence of each first web text is obtained, the second similarity between the first sentence and each second sentence of the first web text is obtained according to the similarity between each character of the first sentence and each character of the second sentence of each first web text, and then the target web text of the text to be translated is determined according to the second similarity between the first sentence and each second sentence of the first web text, so that the accuracy of obtaining the target web text is further improved.
Fig. 5 is a schematic flow chart of a fifth embodiment of a data processing method provided by the present invention. On the basis of the above embodiments, the present embodiment relates to a specific process of determining the first parallel bilingual corpus by the processing device before obtaining the similarity between the text to be translated and each web text in the first parallel bilingual corpus of the preset translation type. That is, before the foregoing S102, the method of this embodiment may further include:
s501, performing chapter alignment on the text to be translated and each web text in a preset second parallel bilingual corpus, and obtaining a third similarity between the text to be translated and each web text in the second parallel bilingual corpus.
In this embodiment, in order to reduce the computational complexity of the processing device, a chapter alignment method is used to filter the web texts in the parallel bilingual corpus once to obtain a web text with a higher matching degree with the text to be translated, and the web text with the higher matching degree is used to perform the similarity calculation in the above steps.
For example, the processing device may obtain a third similarity cos (v) of the text to be translated and each web text in the second parallel bilingual corpus by using a Cosine similarity method1,v2):
Where v is a vector containing feature values (featurevaluables) of numbers, punctuation, and named entities that are common in documents, and the chapters are aligned using the feature values.
S502, determining the first parallel bilingual corpus from the second parallel bilingual corpus according to the third similarity between the text to be translated and each web text in the second parallel bilingual corpus.
Then, according to the third similarity between the text to be translated and each web text in the second parallel bilingual corpus, the first parallel bilingual corpus is determined from the second parallel bilingual corpus, for example, 50 web texts with larger third similarities are obtained, and the 50 web texts are used to form the first parallel bilingual corpus.
Further, when the first sentence and the second sentence in this embodiment are in the same language family (for example, english and french), the method in this embodiment may further perform similarity alignment between words again.
Alternatively, the processing means may perform the alignment between words using the following formula:
the method is to find out commonly occurring letters from words among sentences, and then calculate the similarity among the words according to the Dice similarity.
The following phrases are aligned, for example, using the formula above:
whitehOuse
|||///
vitahuset
the alignment phenomenon in the 2-gram is as follows:
the method of the invention has no manual intervention from the construction of the dictionary to the final alignment, and can not aim at an intelligent alignment platform of the language pair, and the automatic alignment method has great advantages in practice. The amount of manual work is greatly reduced (e.g., no manual dictionary is required).
The method of the embodiment fully utilizes the related technologies in the IBM alignment model, natural language processing and information retrieval to automatically acquire thousands of grades of dictionaries.
The constructed corpus covers a multi-domain parallel corpus, and mainly comprises: news (News), Novels (Novels), law (Laws), Education (Education), scientific terminology (Science), spoken dialogue captions (Speech/Dialog/Subtitle), microblog (Twitter), conference (Parliament).
After the corpus is obtained, the storage format of the file is also important. In order to be suitable for different subsequent platform construction, the invention can be stored in two formats, and all texts are coded in a UTF-8 format:
(1) plain text format. This storage format is mainly used for training data for machine translation.
(2) The text format is marked up. The storage format of the markup language is mainly divided into two formats, namely XML and SGML. Meanwhile, in order to facilitate the adoption of the term library and the memory library, the TMX format text is generated from the plain text file. The marked text can conveniently inquire some other attributes of text resources, including more detailed contents such as the creation time, the author and the question content of the text, and is convenient for database inquiry and index.
The data processing method provided by the invention filters the web texts in the parallel bilingual corpus once through the alignment of sections and chapters, thereby reducing the computational complexity of the processing device.
Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
Fig. 6 is a schematic structural diagram of a first embodiment of a data processing apparatus according to the present invention. The data processing device of the present embodiment may be a separate processor, or may be integrated into a processor, for example, a processor of a computer or other devices. As shown in fig. 6, the processing apparatus of the present embodiment may include:
the obtaining module 10 is used for obtaining a text to be translated;
the first calculation module 20 is configured to obtain similarity between the text to be translated and each web text in a first parallel bilingual corpus of a preset translation type according to the sentence sequence of the text to be translated and the sentences of the text to be translated;
the determining module 30 is configured to determine a target web text of the text to be translated according to the similarity between the text to be translated and each web text.
The apparatus of this embodiment may be configured to implement the technical solutions of the above-described method embodiments, and the implementation principles and technical effects are similar, which are not described herein again.
Fig. 7 is a schematic structural diagram of a second data processing apparatus according to an embodiment of the present invention. On the basis of the above embodiment, the first calculation module 20 of the present embodiment includes: a first acquisition unit 201 and a first calculation unit 202.
The first obtaining unit 201 is configured to obtain a first sentence from the text to be translated according to the sentence order of the text to be translated and a preset sentence comparison type, and obtain a second sentence corresponding to the first sentence from each web text in the first parallel bilingual corpus; the sentence comparison type is the ratio of the number of sentences included in the first sentence to the number of sentences included in the second sentence;
the first calculating unit 202 is configured to obtain a first similarity between the first sentence and each second sentence of the web text; and determining the similarity between the text to be translated and each web text according to the first similarity between the first sentence and the second sentence of each web text.
Further, the first calculating unit 202 is further specifically configured to determine, according to the character length of the first sentence, the character length of the second sentence, and a ratio of the language length of the text to be translated to the language length of the web text, a length normalization parameter of the second sentence of each of the first sentence and the web text; and determining a first similarity between the first sentence and each second sentence of the web text according to the length normalization parameters of the first sentence and each second sentence of the web text and a preset comparison type.
Optionally, the first calculating unit 202 is further specifically configured to
According to the formula p ((l)s,lt)|type)=p(|X|≥|(ls,lt)|)=2(1-p(X<|(ls,lt) L)) determining a first similarity p ((l) of the first sentence to each of the second sentencess,lt)|type);
Wherein, the (l)s,lt) Normalizing parameters, the/for the lengths of the first and second statementssFor the character length of the first sentence, the ltIs the character length of the second sentence, and r is the language length of the text to be translated and the language length of the web textAnd the type is the sentence comparison type.
The apparatus of this embodiment may be configured to implement the technical solutions of the above-described method embodiments, and the implementation principles and technical effects are similar, which are not described herein again.
Fig. 8 is a schematic structural diagram of a third embodiment of a data processing apparatus according to the present invention. On the basis of the above embodiment, the determining module 30 of the present embodiment includes: a second acquisition unit 301, a second calculation unit 302, a determination unit 303.
The second obtaining unit 301 is configured to determine a first web text set of the to-be-translated text from the first parallel bilingual corpus according to a first preset number and a similarity between the to-be-translated text and each web text, where the first web text set includes a plurality of first web texts;
the second calculating unit 302 is configured to obtain a similarity between each character of the first sentence and each character of the second sentence of each first web text; according to the similarity between each character of the first sentence and each character of the second sentence of each first network text, obtaining a second similarity between the first sentence and the second sentence of each first network text;
the determining unit 303 is configured to determine the target web text according to a second similarity between the first sentence and each second sentence of the first web text.
Further, the second computing module 302 is configured to, before the first computing module 20 obtains the similarity between the text to be translated and each web text in the first parallel bilingual corpus of a preset translation type according to the sentence sequence of the text to be translated and the sentences of the text to be translated, perform chapter alignment on the text to be translated and each web text in the preset second parallel bilingual corpus, and obtain a third similarity between the text to be translated and each web text in the second parallel bilingual corpus; and determining the first parallel bilingual corpus from the second parallel bilingual corpus according to the third similarity between the text to be translated and each web text in the second parallel bilingual corpus.
Optionally, the first calculating unit 202 is specifically configured to:
according to the formulaDetermining a length normalization parameter (l) for the one statement and the second statements,lt);
Wherein, the sigma2Is the sample variance of the language of the text to be translated and the language of the web text.
Optionally, the second calculating module 302 is specifically configured to:
according to the formulaObtaining a second similarity between the first sentence and a second sentence of each first network text;
wherein s is a character in the first sentence, t is a character corresponding to s in the second sentence, l is the number of characters in the first sentence, m is the number of characters in the second sentence, and m is a constant.
The apparatus of this embodiment may be configured to implement the technical solutions of the above-described method embodiments, and the implementation principles and technical effects are similar, which are not described herein again.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.
Claims (10)
1. A method for processing data, comprising:
acquiring a text to be translated;
according to the sentence sequence of the text to be translated and the sentences of the text to be translated, obtaining the similarity between the text to be translated and each web text in a first parallel bilingual corpus of a preset translation type;
and determining a target network text of the text to be translated according to the similarity between the text to be translated and each network text.
2. The method according to claim 1, wherein the obtaining, according to the sentence order of the text to be translated and the sentences of the text to be translated, the similarity between the text to be translated and each web text in a parallel bilingual corpus of a preset translation type specifically includes:
acquiring a first sentence from the text to be translated according to the sentence sequence of the text to be translated and a preset sentence comparison type, and acquiring a second sentence corresponding to the first sentence from each web text in the first parallel bilingual corpus; the sentence comparison type is the ratio of the number of sentences included in the first sentence to the number of sentences included in the second sentence;
acquiring a first similarity between the first sentence and a second sentence of each web text;
and determining the similarity between the text to be translated and each web text according to the first similarity between the first sentence and the second sentence of each web text.
3. The method according to claim 2, wherein the obtaining a first similarity between the first sentence and the second sentence of each web text specifically includes:
determining length normalization parameters of the first sentence and each second sentence of the network text according to the character length of the first sentence, the character length of the second sentence and the ratio of the language length of the text to be translated to the language length of the network text;
and determining a first similarity between the first sentence and the second sentence of each network text according to the length normalization parameters of the first sentence and the second sentence of each network text and a preset comparison type.
4. The method according to claim 3, wherein the determining a first similarity between the first sentence and the second sentence of each web text according to the length normalization parameter and the preset comparison type of the first sentence and the second sentence of each web text specifically comprises:
according to the formula p ((l)s,lt)|type)=p(|X|≥|(ls,lt)|)=2(1-p(X<|(ls,lt) L)) determining a first similarity p ((l) of the first sentence to each of the second sentencess,lt)|type);
Wherein, the (l)s,lt) Normalizing parameters, the/for the lengths of the first and second statementssFor the character length of the first sentence, the ltThe character length of the second sentence is the r is the ratio of the language length of the text to be translated to the language length of the web text, and the type is the sentence comparison type.
5. The method according to claim 4, wherein the determining the target web text of the text to be translated according to the similarity between the text to be translated and each web text specifically comprises:
determining a first web text set of the text to be translated from the first parallel bilingual corpus according to a first preset number and the similarity between the text to be translated and each web text, wherein the first web text set comprises a plurality of first web texts;
acquiring the similarity between each character of the first sentence and each character of the second sentence of each first network text;
obtaining a second similarity between each first sentence and each second sentence of the first network text according to the similarity between each character of the first sentence and each character of each second sentence of the first network text;
and determining the target web text according to the second similarity of the first sentence and the second sentence of each first web text.
6. The method according to claim 1, wherein before obtaining similarity between the text to be translated and each web text in a first parallel bilingual corpus of a preset translation type according to the sentence order of the text to be translated and the sentences of the text to be translated, the method further comprises:
the text to be translated is aligned with each network text in a preset second parallel bilingual corpus to obtain a third similarity between the text to be translated and each network text in the second parallel bilingual corpus;
and determining the first parallel bilingual corpus from the second parallel bilingual corpus according to the third similarity between the text to be translated and each web text in the second parallel bilingual corpus.
7. The method according to claim 3, wherein the determining the length normalization parameters of the first sentence and the second sentence according to the character length of the first sentence, the character length of the second sentence, and the ratio of the language length of the text to be translated to the language length of the web text comprises:
according to the formulaDetermining a length normalization parameter (l) for the first and second sentencess,lt);
Wherein, the sigma2Is the sample variance of the language of the text to be translated and the language of the web text.
8. The method of claim 5, wherein obtaining the second similarity of the first sentence to the second sentence of each first web text according to the similarity of each character of the first sentence to each character of the second sentence of each first web text comprises:
according to the formulaObtaining a second similarity between the first sentence and a second sentence of each first network text;
wherein s is a character in the first sentence, t is a character corresponding to s in the second sentence, l is the number of characters in the first sentence, m is the number of characters in the second sentence, and m is a constant.
9. An apparatus for processing data, comprising:
the acquisition module is used for acquiring a text to be translated;
the first calculation module is used for acquiring the similarity between the text to be translated and each web text in a first parallel bilingual corpus of a preset translation type according to the sentence sequence of the text to be translated and the sentences of the text to be translated;
and the determining module is used for determining a target network text of the text to be translated according to the similarity between the text to be translated and each network text.
10. The apparatus of claim 9, wherein the first computing module comprises: the first acquisition unit, the first calculation unit and the first determination unit:
the first obtaining unit is used for obtaining a first sentence from the text to be translated according to the sentence sequence of the text to be translated and a preset sentence comparison type, and obtaining a second sentence corresponding to the first sentence from each web text in the first parallel bilingual corpus; the sentence comparison type is the ratio of the number of sentences included in the first sentence to the number of sentences included in the second sentence;
the first computing unit is used for acquiring a first similarity between the first sentence and a second sentence of each web text;
the first determining unit is configured to determine, according to a first similarity between the first sentence and a second sentence of each web text, a similarity between the text to be translated and each web text.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611178417.2A CN106708811A (en) | 2016-12-19 | 2016-12-19 | Data processing method and data processing device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611178417.2A CN106708811A (en) | 2016-12-19 | 2016-12-19 | Data processing method and data processing device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106708811A true CN106708811A (en) | 2017-05-24 |
Family
ID=58939195
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201611178417.2A Pending CN106708811A (en) | 2016-12-19 | 2016-12-19 | Data processing method and data processing device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106708811A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109830229A (en) * | 2018-12-11 | 2019-05-31 | 平安科技(深圳)有限公司 | Audio corpus intelligence cleaning method, device, storage medium and computer equipment |
CN113408304A (en) * | 2021-06-30 | 2021-09-17 | 北京百度网讯科技有限公司 | Text translation method and device, electronic equipment and storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101187924A (en) * | 2007-11-28 | 2008-05-28 | 北京金山软件有限公司 | Method and system for obtaining word pair translation from bilingual sentence |
CN104391842A (en) * | 2014-12-18 | 2015-03-04 | 苏州大学 | Translation model establishing method and system |
CN104750820A (en) * | 2015-04-24 | 2015-07-01 | 中译语通科技(北京)有限公司 | Filtering method and device for corpuses |
-
2016
- 2016-12-19 CN CN201611178417.2A patent/CN106708811A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101187924A (en) * | 2007-11-28 | 2008-05-28 | 北京金山软件有限公司 | Method and system for obtaining word pair translation from bilingual sentence |
CN104391842A (en) * | 2014-12-18 | 2015-03-04 | 苏州大学 | Translation model establishing method and system |
CN104750820A (en) * | 2015-04-24 | 2015-07-01 | 中译语通科技(北京)有限公司 | Filtering method and device for corpuses |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109830229A (en) * | 2018-12-11 | 2019-05-31 | 平安科技(深圳)有限公司 | Audio corpus intelligence cleaning method, device, storage medium and computer equipment |
CN113408304A (en) * | 2021-06-30 | 2021-09-17 | 北京百度网讯科技有限公司 | Text translation method and device, electronic equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Shaalan et al. | NERA: Named entity recognition for Arabic | |
CN105095204B (en) | The acquisition methods and device of synonym | |
US8606826B2 (en) | Augmenting queries with synonyms from synonyms map | |
US7835903B2 (en) | Simplifying query terms with transliteration | |
CN109062912B (en) | Translation quality evaluation method and device | |
KR101500617B1 (en) | Method and system for Context-sensitive Spelling Correction Rules using Korean WordNet | |
Broda et al. | Measuring Readability of Polish Texts: Baseline Experiments. | |
WO2009035863A2 (en) | Mining bilingual dictionaries from monolingual web pages | |
US9600469B2 (en) | Method for detecting grammatical errors, error detection device for same and computer-readable recording medium having method recorded thereon | |
CN104750820A (en) | Filtering method and device for corpuses | |
US20160071511A1 (en) | Method and apparatus of smart text reader for converting web page through text-to-speech | |
Scheible et al. | A gold standard corpus of Early Modern German | |
US20140149106A1 (en) | Categorization Based on Word Distance | |
CN100361124C (en) | System and method for word analysis | |
Mohamed et al. | Annotating and Learning Morphological Segmentation of Egyptian Colloquial Arabic. | |
CN111950301A (en) | English translation quality analysis method and system for Chinese translation and English translation | |
CN114743012B (en) | Text recognition method and device | |
CN106708811A (en) | Data processing method and data processing device | |
Albilali et al. | Constructing arabic reading comprehension datasets: Arabic wikireading and kaiflematha | |
Pinnis et al. | Tilde MT platform for developing client specific MT solutions | |
Nghiem et al. | Using MathML parallel markup corpora for semantic enrichment of mathematical expressions | |
CN106650803A (en) | Method and device for calculating similarity between strings | |
Chiu et al. | Chinese spell checking based on noisy channel model | |
CN116108181A (en) | Client information processing method and device and electronic equipment | |
US11520989B1 (en) | Natural language processing with keywords |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170524 |
|
RJ01 | Rejection of invention patent application after publication |