WO2014169857A1 - 数据处理装置、数据处理方法以及电子设备 - Google Patents

数据处理装置、数据处理方法以及电子设备 Download PDF

Info

Publication number
WO2014169857A1
WO2014169857A1 PCT/CN2014/075776 CN2014075776W WO2014169857A1 WO 2014169857 A1 WO2014169857 A1 WO 2014169857A1 CN 2014075776 W CN2014075776 W CN 2014075776W WO 2014169857 A1 WO2014169857 A1 WO 2014169857A1
Authority
WO
WIPO (PCT)
Prior art keywords
sequence
target language
semantic role
language
predicate
Prior art date
Application number
PCT/CN2014/075776
Other languages
English (en)
French (fr)
Inventor
张姝
孟遥
于浩
Original Assignee
富士通株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 富士通株式会社 filed Critical 富士通株式会社
Priority to JP2016508001A priority Critical patent/JP2016519370A/ja
Publication of WO2014169857A1 publication Critical patent/WO2014169857A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language

Definitions

  • Data processing device data processing method, and electronic device
  • the present invention relates to the field of data processing, and more particularly to a data processing apparatus, a data processing method, and an electronic device.
  • language data is extremely common in people's daily lives and work. For example, e-mail, short messages that are sent between mobile phones, and text messages contained in various files that people need to deal with during their study and work are all linguistic data.
  • e-mail short messages that are sent between mobile phones
  • text messages contained in various files that people need to deal with during their study and work are all linguistic data.
  • the language data as described above is processed using existing techniques for processing linguistic data, especially when converting one mode of linguistic data into another mode, the accuracy and/or precision of the processing is often Lower.
  • the present invention provides a data processing apparatus, a data processing method, and an electronic device, which are not highly accurate.
  • a data processing apparatus comprising: a semantic role labeling unit, a plurality of target language statements for a source language statement and a candidate sequence result as a translation thereof Semantic role labeling is performed separately to obtain a source language semantic role sequence and a plurality of target language semantic role sequences; a matching unit is configured to obtain the above-mentioned source language semantic role sequence and each of the above target language semantic role sequences based on a predetermined bilingual corpus Matching scores, wherein the predetermined bilingual corpus includes multiple semantic role labels And a two-statement pair for the source language and the target language; and a sequence result determining unit, configured to determine a candidate sequence result corresponding to the target language semantic role sequence with the highest matching score as the final sequence result.
  • a data processing method comprising: performing semantic role labeling on a source language statement and a plurality of target language sentences as candidate candidate sequences for the translation thereof Obtaining a source language semantic role sequence and a plurality of target language semantic role sequences; obtaining a matching score between the source language semantic role sequence and each of the target language semantic role sequences respectively based on a predetermined bilingual corpus, wherein the predetermined bilingual corpus
  • the method includes a plurality of double sentence pairs for the source language and the target language marked by the semantic role; and the candidate sequence result corresponding to the target language semantic role sequence with the highest matching score is determined as the final sequence result.
  • an electronic device comprising the data processing apparatus as described above.
  • a program product for storing a machine readable instruction code, the program product being executable to cause the machine to perform the data processing method as described above.
  • the above-described data processing apparatus, data processing method, and electronic apparatus are capable of obtaining a plurality of target language sentences for a candidate sequence result of a translation of a source language sentence by using a predetermined bilingual corpus a plurality of target matching scores corresponding to the target language statements, to determine a final sequence result among the plurality of target language sentences, thereby obtaining at least one of the following benefits: the accuracy of the processing result is high; the calculation amount is small, Fast calculation speed; and high processing efficiency.
  • FIG. 1 is a block diagram schematically showing an example structure of a data processing device according to an embodiment of the present invention.
  • Fig. 2 is a block diagram schematically showing one possible example structure of the matching unit shown in Fig. 1.
  • FIG. 3 is a flow chart schematically showing an exemplary process of a data processing method according to an embodiment of the present invention.
  • FIG. 4 is a block diagram showing the configuration of a hardware configuration of a possible information processing device which can be used to implement a data processing device and a data processing method according to an embodiment of the present invention.
  • An embodiment of the present invention provides a data processing apparatus, the data processing apparatus comprising: a semantic role labeling unit, configured to respectively perform a source language statement and a plurality of target language sentences as candidate result of the translation of the translation Semantic role labeling, to obtain a sequence of source language semantic characters and a plurality of target language semantic role sequences; a matching unit, configured to obtain a score based on a predetermined bilingual corpus, wherein the predetermined bilingual corpus includes a plurality of semantic role labels, A pair of statement pairs of the source language and the target language; and a sequence result determining unit, configured to determine a candidate sequence result corresponding to the target language semantic role sequence with the highest matching score as the final sequence result.
  • a semantic role labeling unit configured to respectively perform a source language statement and a plurality of target language sentences as candidate result of the translation of the translation Semantic role labeling, to obtain a sequence of source language semantic characters and a plurality of target language semantic role sequences
  • a matching unit configured to obtain a score
  • the source language may be, for example, any one of a plurality of languages such as English, Chinese, German, French, Japanese, etc.
  • the target language may be Another of the above-mentioned numerous languages having the same subject-predicate structure as the language of the source language.
  • the "principal-predicate-object structure" in the same subject-predicate structure is not limited to the order of "subject + predicate + object", but may be other orders, such as "subject + object”. + Predicate "equal order, but the selected source and target languages have the same "principal-object structure".
  • the source language and the target language are both "subject-predicate-object" in the order of "subject + predicate + object", or "subject-object structure” in the order of "subject + object + predicate”.
  • the data processing apparatus can further determine which candidate translation is the most in the source language statement by determining the degree of matching between the order of each semantic character in each candidate translation and the source language sentence. match. It should be noted that, in the above data processing, the plurality of candidate translations are equivalent to the processing results of the plurality of candidates obtained in the process of converting the source language sentence from the source language mode to the target language mode.
  • a data processing apparatus 100 includes a semantic role labeling unit 110, a matching unit 120, and a sequence result determining unit 130.
  • a semantic angle The color labeling unit 110 obtains a source language semantic role sequence of the source language sentence by performing semantic role labeling on the source language sentence.
  • the semantic character labeling unit 110 further performs semantic role labeling on the plurality of target language sentences to obtain a target language semantic role sequence of each of the plurality of target language sentences, that is, obtain a plurality of target language semantic role sequences.
  • a semantic role labeling technique such as FrameNet, PropBank, or NomBank may be employed to perform semantic role labeling on an English sentence (as an example of a target language sentence).
  • semantic role tagging techniques such as CPB (Chinese Proposition Bank) can be used to perform semantic role labeling on Chinese sentences (as examples of source language statements).
  • CPB Choinese Proposition Bank
  • the source language statement is not limited to a complete statement (such as "He is the teacher I saw yesterday.”), or Part of the sentence component in the complete statement (such as the sentence component with the subject-predicate structure of the teacher I saw yesterday).
  • the source language statement is "the teacher I saw yesterday” and assume that the target language statement "The teacher I saw yesterday” and the target language statement "I yesterday saw the teacher” are the above source language statements.
  • Two candidate alignment results In this example, the source language is Chinese and the target language is English. It should be noted that, in this example, the two candidate sequence results of the source language statement correspond to the result of the candidate process obtained in the process of converting the source language sentence from the source language mode to the target language mode.
  • the semantic role labeling unit 110 performs a semantic role labeling on the source language statement "the teacher I saw yesterday", and the following labeling result can be obtained:
  • the source language semantic sequence S corresponds to the subject-predicate structure of the subject + predicate + object.
  • the source language predicate has only one semantic role on the left side, and there is only one semantic role on the right side.
  • the source language predicate There can be more than one semantic role on the left and/or right side.
  • the semantic character labeling unit 110 performs semantic role labeling on the target language statement "The teacher I saw yesterday", and the following labeling result can be obtained:
  • argL T1 2 is an object and argL T1 l is a subject.
  • the target language semantic role sequence T1 corresponds to the subject-object structure of the object + subject + predicate.
  • argL T2 l is a subject
  • argR T2 l is an object
  • the target language semantic role sequence T1 corresponds to the subject-predicate structure of the subject + predicate + object.
  • the source language semantic role sequence S of the source language sentence can be obtained for a certain source language statement, and the candidate sequence result of the source language sentence can be obtained.
  • the matching unit 120 can obtain a matching score between the source language semantic role sequence S and each of the plurality of target language semantic character sequences T1, ⁇ 2 ⁇ , respectively.
  • the predetermined bilingual corpus comprises a plurality of pairs of source language and target language pairs, the pair of statement pairs being previously marked by a semantic role.
  • the predetermined bilingual corpus may include a bilingual corpus in a general field and/or a bilingual corpus in a proprietary field.
  • the matching unit 120 may have an example structure as shown in FIG. 2. As shown in FIG. 2, in this implementation, the matching unit 120 may include a correlation degree obtaining subunit 210 and a matching score determining subunit 220.
  • the relevance degree obtaining sub-unit 210 may utilize each target language predicate in the target language semantic role sequence, The predetermined bilingual corpus is obtained to obtain a degree of correlation between the at least partial subsequence of the target language semantic role sequence and the source language semantic role sequence of the target language semantic role sequence.
  • the relevance degree obtaining sub-unit 210 can obtain the following various types for the target language predicate in the target language semantic character sequence Tg. Any one or more of the degree of relevance: the sub-sequence of the target language semantic role sequence Tg including only the target language predicate (ie, the target language)
  • the degree of correlation between the predicate itself hereinafter referred to as the first type of subsequence, and the source language semantic role sequence S
  • the target language semantic role sequence Tg includes a subsequence of at least one semantic role located to the left of the target language predicate ( Hereinafter referred to as the degree of correlation between the second type of subsequence and the source language semantic role sequence S
  • the target language semantic role sequence Tg includes the target language predicate and a subsequence of at least one semantic role located to the left of the target language predicate (hereinafter referred to as the third type
  • the source language semantic role sequence S described below is "ar g L s see ar g R s "
  • the target language semantic role sequence T1 is "argL T1 2 argL T1 l saw”
  • the target language semantic role sequence An example is described by taking T2 as “argL T2 l saw argR T2 l” as an example.
  • the corresponding first type of subsequence can be, for example, "saw”
  • the second type of subsequence can be, for example, "argL T1 2" , "argL T1 l” and “argL T1 2 argL T1 l”
  • the third subsequence can be, for example, w argL T il saw” , "argL T1 2 saw” and "argL T1 2 argL T1 l saw Any of them.
  • the degree of correlation between "saw” and “argL s see argR s " can be, for example, the degree of correlation between the first type of subsequence and the source language semantic role sequence S (hereinafter referred to as "first” An example of the degree of class correlation ").
  • first An example of the degree of class correlation ".
  • the degree of correlation between "saw” and “ar g L s see ar g R s " can be seen, for example, by “saw” and "ar g L s seeing argR s w in a predetermined bilingual corpus.
  • the probability in the middle is reflected, or it can be reflected by the probability that "saw” appears in all English sentences corresponding to all Chinese sentences containing the "ar g L s see ar g R s " structure in the predetermined bilingual corpus.
  • C2 includes "He I saw many books "and "He found many books”, because the semantic role sequence obtained according to the semantic characters of "I saw cats" and "he saw many books” is "subject + 'see, + object” Structure, so it can be determined that "subject + 'see, + object” ie " arg L s see argR s " appears in the double statement pair C1, also in the double statement pair C2.
  • the above-mentioned first-class subsequence "saw” appears in "I saw a cat”, and the above-mentioned first-class subsequence "saw” does not appear in the double-statement C2 English sentence "He found many books”.
  • the probability of "saw” appearing in all English sentences corresponding to all Chinese sentences containing the "ar g L s see argRs" structure in the scheduled bilingual corpus can be, for example, 50% (only two statement pairs C1 and C2 are included in the predetermined bilingual corpus) in the case of).
  • the degree of correlation between "argL T1 2", “argL T1 l” and “argL T1 2 argL T1 l” and MargL s see ar g R s " can be used as the second category above, for example.
  • An example of the degree of correlation between the subsequence and the source language semantic character sequence S (hereinafter referred to as "the second type of correlation degree").
  • the second type of correlation may be, for example, the second type of subsequence and the source language described above.
  • the semantic role sequence S appears at the same time as the probability of a double sentence pair in the predetermined bilingual corpus, or may also appear in all the English sentences corresponding to all the Chinese sentences of the source language semantic role sequence S in the predetermined bilingual corpus.
  • the probability of the second sub-sequence described above reflects that the method of calculating the probability can be similar to the above, and will not be described here.
  • the degree of correlation may be, for example, an example of the degree of correlation between the above-described third type of subsequence and the source language semantic character sequence S (hereinafter referred to as "the third type of correlation degree").
  • the third type of correlation may be, for example, The above-mentioned third type of subsequence is reflected by the probability that the source language semantic role sequence S appears simultaneously in a double sentence pair of the predetermined bilingual corpus, or alternatively, all of the source language semantic role sequence S may be included in the predetermined bilingual corpus.
  • the probability of the above-mentioned third-type subsequence appears in all the English sentences corresponding to the Chinese sentence to reflect, and the method for calculating the probability may be similar to the above, and will not be described again here.
  • the first type of subsequence, the second type of sputum sequence, and the third type of order are "fourth degree of correlation";
  • the first type of subsequence and the third type of subsequence are selected as examples of the at least two subsequences, and it is assumed that the first type of subsequence is "saw” and the third type of subsequence is "ar g L T1 2 saw”
  • the phase between the first type of subsequence and the third type of subsequence and the source language semantic role sequence S The degree of closure may be derived from the probability that the first type of subsequence "saw”, the third type of subsequence "ar g L T1 2 saw”, and the source language semantic role sequence S appear simultaneously in a double sentence pair of the predetermined bilingual corpus.
  • the first sub-sequence "saw” and the third sub-sequence "ar g L" may be simultaneously present in all English sentences corresponding to all Chinese sentences of the source language semantic role sequence S in the predetermined bilingual corpus.
  • the probability of T1 2 saw” is reflected, wherein the method of calculating the probability can be similar to the above, and will not be described here.
  • the degree of correlation between the first type of subsequence and the second type of subsequence and the source language semantic role sequence S may be determined by the first type of subsequence "saw” and the second type of subsequence " Ar g L T1 2 ar g L T1 l” and the probability that the source language semantic role sequence S appears simultaneously in a double sentence pair of the predetermined bilingual corpus, or the source language semantics may be included in the predetermined bilingual corpus
  • the probability that the first sub-sequence "saw” and the second sub-sequence are "ar g L T1 2 ar g L T1 l" are simultaneously reflected in all the English
  • the correlation degree obtaining subunit 210 can obtain any one or more of the above first to fourth categories of correlation. It is not necessary to calculate all of the correlations of the first to fourth categories.
  • the correlation degree calculated by the correlation degree obtaining subunit 210 may include a plurality of correlation levels of the same category, for example, may include two second types of correlation degrees (the two second types of correlation degree) The corresponding second type of subsequences can be different), and so on.
  • the matching score determination sub-unit 220 can obtain various degrees of correlation obtained by the sub-unit 210 for each target language semantic role sequence based on the degree of correlation (such as any of the first to fourth categories of correlation described above). One or more of the species or a plurality) to determine a matching score between each target language semantic role sequence and the source language semantic role sequence. In one implementation, for each target language semantic role sequence, the match score determination sub-unit 220 may multiply the values of the degree of relevance associated with the target language semantic role sequence with each other, and use the resulting product as the target language. The matching score between the semantic role sequence and the source language semantic role sequence.
  • the match score determination sub-unit 220 may also perform a weighted calculation (eg, weighted summation) on the value of the degree of relevance associated with the target language semantic role sequence. The result obtained is used as a matching score between the target language semantic role sequence and the source language semantic role sequence.
  • a weighted calculation eg, weighted summation
  • the match score determination sub-unit 220 can be obtained according to the following formula one Get the above match score.
  • Equation 1 S represents a source language semantic role sequence, and T represents any one of a plurality of target language semantic role sequences corresponding to the source language semantic role sequence S, and V T is the target language in T predicate, V T is T is located on the left side of the i-th semantic roles, h is the number of semantic role left V T located j-th semantic roles for the right T V T, k is the right side of V T
  • s) is the conditional probability for indicating the degree of correlation between the subsequence ⁇ V T ⁇ of S and T, and P(A
  • A-1, V T , S) are subsequences used to represent S and T
  • V T , S) is a condition for indicating the degree of correlation between the subsequences ⁇ V T ⁇ and ⁇ V T? ⁇ of S and T
  • the probability, and P IVTAPS) are conditional probabilities for indicating the degree of correlation between the subsequences ⁇ and ⁇ of S and T.
  • s) may, for example, be equal to the occurrence of a subsequence in all predetermined target language statements corresponding to all predetermined source language statements including the source language semantic role sequence S in the predetermined bilingual corpus.
  • the probability of ⁇ V T ⁇ is hereinafter referred to as a predetermined set.
  • V T , S) may, for example, be equal to the probability of occurrence of the subsequence ⁇ , V t ⁇ in a predetermined target language sentence in which the subsequence ⁇ V T ⁇ has appeared in the predetermined set
  • ⁇ ( ⁇ , ⁇ , pV ⁇ s) may, for example, be equal to the probability of occurrence of a subsequence, ⁇ , ⁇ in a predetermined target language sentence in which the subsequence ⁇ ⁇ , ⁇ ⁇ ⁇ has appeared in the predetermined set
  • V T , S) For example, it may be equal to the probability that a subsequence ⁇ V T?
  • appears in a predetermined target language sentence in which the subsequence ⁇ V T ⁇ has appeared in the predetermined set
  • p(b"v T , ,, s) may be, for example, in the above reservation.
  • the probability that a subsequence ⁇ ⁇ , ⁇ -1? ⁇ appears in the predetermined target language statement of the subsequence ⁇ VT ⁇ -J has already appeared in the set.
  • Equation 1 the closer the distance V T is , the smaller the sequence number of the semantic character is.
  • is the first semantic role in T that is to the left of V T and is closest to V T
  • ⁇ 2 is the second semantic role in ⁇ that is to the left of V T and is closest to V T , and so on.
  • the correlation degree obtaining subunit 210 can utilize the maximum Let's get P(V T
  • Equation 5 (a ⁇ , Y T , a' h ,,..., a , Y s , b ,..., b' k ,) [78] Equation 5: , ,
  • V s is the source language predicate in S, which is / ⁇ in the S, to the left of V s , a semantic role, ..., ⁇ is S in V s On the right side: a semantic role, whereby the sequence ⁇ , ., ., ⁇ , ⁇ », ⁇ is the source language semantic role sequence S radical
  • [81] ,..., ⁇ ) indicates the occurrence of a sequence in all predetermined target language statements in the pair of double-statements to which all of the predetermined source language statements containing the source language semantic character sequence S (ie ⁇ 3 ⁇ 4 , ⁇ , v s ⁇ ) belong. The number of times. All pairs of double-state sentences to which all predetermined source language sentences containing the source language semantic role sequence S (ie, ⁇ , , . . . , v s , ⁇ , . . . , ⁇ ) belong are referred to as pairs of to-be-stated sentences. Represents the number of all predetermined source language statements containing the sequence ⁇ ⁇ , representing the number of occurrences of the sequence ⁇ a x , V T ⁇ in the predetermined target language statement in the pair of statistical sentences,
  • C( V T , bj_ x , ⁇ , , ⁇ , ⁇ , V S , b , ..., b , ) represent the sequence (the number of times ⁇ ) in the predetermined target language sentence in the pair of statistical sentences.
  • the matching score determination sub-unit 220 may also obtain the above-described matching score according to the following formula 7.
  • Score(S, T) P( V T
  • Equation 7 k 3
  • Equation 7 ⁇ (",.
  • v T , ,, s) is a subsequence ⁇ V T , ⁇ w ⁇ and ⁇ V T , b ⁇ bj- for representing S and T
  • S) in Equation 7 can be calculated, for example, according to Formula 2
  • V T , S) can be calculated, for example, according to Formula 3, for example, can be calculated according to Formula 4
  • V T , S) can be calculated, for example, according to the fifth
  • ⁇ dress) can be calculated according to the formula 6.
  • Equation 8 PiaU.
  • the number of occurrences of the sequence ⁇ , ⁇ , ⁇ , ⁇ in the predetermined target language statement in the pair of statistical sentences c(v T , bj_ 2 , b X , ⁇ 3 ⁇ 4 , ,... , ⁇ ⁇ , v s ⁇ ,... ⁇ , ) represents the number of occurrences of the sequence ⁇ V T ⁇ 2 , A ⁇ in the predetermined target language sentence in the pair of statistical sentences, ( V T , bj_ 2 , b X , a , ..
  • a ⁇ , V s , b ⁇ , ..., b , ) represents the number of occurrences of the sequence ⁇ V T 2 , b ⁇ _, ⁇ in the predetermined target language sentence in the pair of statistical sentences.
  • the processing of the sub-unit 210 and the matching score determination sub-unit 220 can be considered simultaneously in the source language statement and the target language statement. Predicate information can make the results obtained by the processing more accurate than traditional techniques.
  • the matching score between each of the plurality of target language semantic character sequences T1, ⁇ 2, ..., ⁇ and the source language semantic character sequence S can be obtained by the processing of the matching unit 120. Then, the sequence result determination unit 130 can match the source language semantic role sequence S The candidate sequence result corresponding to the target language semantic role sequence with the highest score is determined as the final sequence result. It should be noted that the above final sequence result is equivalent to the final processing result obtained in the process of converting the source language sentence from the source language mode to the target language mode.
  • the semantic character tagging unit 110 can obtain the target language semantic character sequence Tl w argL T1 2 argLxil saw” and the target language semantic character sequence T2 "argL T2 l saw argR T2 l” .
  • the matching unit 120 can obtain the target language semantic character sequence T1 and the source language semantic character sequence S a argL s see argR s
  • the match score between the two is assumed to be 0.8.
  • the matching unit 120 can obtain a matching score between the target language semantic character sequence T2 and the source language semantic character sequence S "argL s see argR s ", assuming 0.5.
  • sequence result determination unit 130 can determine the candidate sequence result corresponding to the target language semantic character sequence T2 (ie, "The teacher I saw yesterday") as the final sequence result.
  • the semantic character labeling unit 110 may each source a sequence consisting of a language predicate and its associated semantic role as a source language semantic role sequence corresponding to the source language predicate, and a sequence consisting of the target language predicate corresponding to the source language predicate and its associated semantic role The target language semantic role sequence corresponding to the source language predicate.
  • the matching unit 120 can obtain a matching score between the source language semantic character sequence and the target language semantic character sequence corresponding to the same source language predicate.
  • the semantic character tagging unit 110 and the matching unit 120 can perform the processing of the semantic character tagging unit 110 and the matching unit 120 described above in connection with FIG. 1 and/or FIG. 2, respectively, for each predicate. Similar processing.
  • the "source language semantic role sequence and the target language semantic role sequence corresponding to the same source language predicate" are two sequences: the source language semantic role sequence includes a predicate Vaa, the target language semantic role sequence The predicate Vbb is included, and the predicate Vaa and the predicate Vbb are translations of each other.
  • the source language statement S contains two predicates Vsl and Vs2
  • the target language statement M1 and the target language statement M2 are the two candidate sequence results of the source language statement S
  • the target language statement Ml contains predicates Vtal (corresponding to Vsl) and Vta2 (corresponding to Vs2)
  • target language statement M2 contains predicates Vtbl (corresponding to Vsl) and Vtb2 (corresponding to Vs2).
  • sequence S1 The sequence consisting of the predicate Vsl in the source language statement and the semantic role related to the predicate Vsl is called the sequence S1
  • sequence S2 the sequence consisting of the predicate Vs2 in the source language sentence and the semantic role related to the predicate Vs2 is called Sequence S2,.
  • sequence Tla The sequence consisting of the predicate Vtal in the target language statement M1 and the semantic role related to the predicate Vtal is called the sequence Tla, and the sequence of the predicate Vta2 in the target language sentence and the semantic role related to the predicate Vta2 is called For the sequence T2a,.
  • sequence Tlb The sequence consisting of the predicate Vtbl in the target language statement M2 and the semantic role related to the predicate Vtbl is called the sequence Tlb, and the sequence of the predicate Vtb2 in the target language sentence and the semantic role related to the predicate Vtb2 is called For the sequence T2b,.
  • the matching unit 120 can obtain a matching score between the sequence Tla and the sequence SI (hereinafter referred to as score one), and can obtain a matching score between the sequence Tlb and the sequence S1. (hereinafter referred to as score two)
  • the matching unit 120 can obtain a matching score between the sequence T2a and the sequence S2 (hereinafter referred to as score three), and can obtain a match between the sequence T2b and the sequence S2. Score (hereinafter referred to as score four).
  • the predicate in the target language sentence may be corresponding to which predicate in the source language sentence, for example, the predicate of the translated word (or translation) in the target language sentence and the source language sentence may be mutually (or semantic roles) are determined to correspond to each other.
  • the sequence result determination unit 130 may determine the final sequence result by combining the match scores for each source language predicate.
  • the sequence result determination unit 130 may use a weighted sum of the score one and the score three (for example, the weight is 1) as the target language statement.
  • the weight is 1.
  • the value of the degree of matching between the semantic roles in the M1 and the source language statement The larger the value, the more matching the two.
  • the sequence result determination unit 130 may use a weighted sum of the score two and the score four (for example, the weight is 1) as the measurement target language.
  • the weight is 1.
  • sequence result determination unit 130 can select the one that best matches the source language sentence among all the target language sentences as the final sequence result.
  • the above data processing apparatus is directed to a plurality of target language sentences that are candidates for the translation of the source language statement, and the predetermined bilingual corpus can be used to obtain the source language semantics corresponding to the source language statements of the plurality of target language semantic role sequences corresponding to the plurality of target language statements A matching score between the sequence of characters to determine the final sequenced result among the plurality of target language statements described above.
  • the data processing apparatus according to the embodiment of the present invention described above determines the final sequence result based on the consistency of the subject-predicate structure between the target language and the source language, so that the processing result obtained by the above-described data processing apparatus of the embodiment of the present invention is obtained. More accurate than traditional methods.
  • the above matching score is obtained by using Equation 1 and/or Equation 2 to Equation 6, so that the calculation amount is small and the calculation speed is fast, thereby making the processing efficiency high.
  • an embodiment of the present invention further provides a data processing method, including: performing semantic role labeling on a source language statement and a plurality of target language sentences as candidate candidate sequences of the translation, respectively, Obtaining a source language semantic role sequence and a plurality of target language semantic role sequences; obtaining a matching score between the source language semantic role sequence and each of the target language semantic role sequences respectively based on a predetermined bilingual corpus, wherein the predetermined bilingual corpus includes A plurality of double-statement pairs marked by the semantic role and corresponding to the source language and the target language; and the candidate sequence result corresponding to the target language semantic role sequence with the highest matching score is determined as the final sequence result.
  • the source language may be, for example, any one of a plurality of languages such as English, Chinese, German, French, Japanese, etc.
  • the target language may be Another of the above-mentioned numerous languages having the same subject-predicate structure as the language of the source language.
  • the "subject-object structure" referred to herein may have the same meaning as the "subject-object structure” described above, and thus its detailed description is omitted here.
  • a description will be given of a case where the source language is Chinese and the target language is English as an example to give an embodiment of the present invention.
  • the processing flow 300 of the data processing method according to an embodiment of the present invention starts at step S310, and then proceeds to step S320.
  • step S320 semantic role labeling is performed on the source language statement and the plurality of target language sentences as the candidate sequence results of the translation, respectively, to obtain a source language semantic role sequence and a plurality of target language semantic role sequences.
  • step S330 is performed.
  • the processing performed in step S320 can be the same as the processing of the semantic character labeling unit 110 described above in connection with FIG. 1 and can achieve similar technical effects, and details are not described herein again.
  • step S330 a matching score between the source language semantic role sequence and each target language semantic role sequence is obtained based on the predetermined bilingual corpus, wherein the predetermined bilingual corpus includes a plurality of semantic characters, and the source language is marked. A two-statement pair with the target language.
  • step S340 is performed.
  • the processing performed in step S330 can be the same as the processing of the matching unit 120 described above in connection with FIG. 1 and can achieve similar technical effects, and is not further described herein.
  • the processing in step S330 can be implemented, for example, by: for each target language predicate in each target language semantic role sequence, using a predetermined bilingual corpus to obtain the target language semantic role sequence a degree of correlation between at least a partial subsequence of the target language predicate and a sequence of source language semantic roles; and determining, for each target language semantic role sequence, based on the degree of correlation obtained with the target language semantic role sequence The matching score between the target language semantic role sequence and the source language semantic role sequence.
  • a predetermined bilingual corpus may be utilized to obtain any one or more of the following various degrees of relevance:
  • the target language semantic role sequence includes only the degree of correlation between the sub-sequence of the target language predicate and the source language semantic role sequence;
  • the target language semantic role sequence includes at least one semantic role child located to the left of the target language predicate The degree of correlation between the sequence and the source language semantic role sequence;
  • the target language semantic role sequence includes the target language predicate and the sub-sequence of the at least one semantic role located on the left side of the target language predicate and the source language semantic role sequence a degree of relevance;
  • a subsequence of the target language semantic role sequence including only the target language predicate, a subsequence including at least one semantic role located to the left of the target language predicate, and including the target language predicate and the target language predicate At least one child of the semantic role on the
  • the above matching score can be calculated according to Equation 1 described above.
  • ⁇ s) can be obtained, for example, using the maximum likelihood method.
  • s) can be calculated according to Equations 2 through 6 described above,
  • each source language predicate and its associated semantics may be assumed assuming that the source language statement includes at least two source language predicates.
  • a sequence of roles as a sequence of source language semantic roles corresponding to the source language predicate, and a target language predicate corresponding to the source language predicate and its associated semantics
  • the sequence of roles is a sequence of target language semantic roles corresponding to the source language predicate.
  • a matching score between the source language semantic role sequence and the target language semantic role sequence corresponding to the same source language predicate is obtained, and the final sequence result is determined by combining the matching scores for each source language predicate.
  • step S340 the candidate sequence result corresponding to the target language semantic role sequence with the highest matching score is determined as the final sequence result. Then step S350 is performed.
  • the processing performed in step S340 can be the same as the processing of the sequence result determining unit 130 described above in connection with FIG. 1, and can achieve similar technical effects, and will not be further described herein.
  • the data processing method can obtain the plurality of target languages by using a predetermined bilingual corpus for a plurality of target language sentences that are candidate candidate results of the translation of the source language sentence.
  • the data processing method according to the embodiment of the present invention determines the final sequence result according to the consistency of the subject-predicate structure between the target language and the source language, so that the data processing method according to the embodiment of the present invention is used. Processing results are more accurate than traditional methods.
  • an embodiment of the present invention also provides an electronic device including the data processing device as described above.
  • the electronic device may be any one of the following devices: a computer (such as a desktop computer, a notebook computer, etc.); a tablet computer; a personal digital assistant; Playback devices; mobile phones (such as smart phones); electronic dictionaries; and electronic paper books and so on.
  • the electronic device has various functions and technical effects of the above data processing device, and details are not described herein again.
  • the respective constituent units, sub-units, modules, and the like in the above-described data processing apparatus according to the embodiment of the present invention may be configured by software, firmware, hardware, or any combination thereof.
  • a program constituting the software or firmware may be installed from a storage medium or a network to a machine having a dedicated hardware structure (for example, the general-purpose machine 400 shown in FIG. 4).
  • various functions of the above-described constituent units and subunits can be executed.
  • FIG. 4 is a block diagram showing the configuration of a hardware configuration of a possible information processing device which can be used to implement a data processing device and a data processing method according to an embodiment of the present invention.
  • a central processing unit (CPU) 401 is stored in a read only memory (ROM) 402.
  • the stored program or the program loaded from the storage portion 408 to the random access memory (RAM) 403 performs various processes.
  • RAM 403 data required when the CPU 401 performs various processes and the like is also stored as needed.
  • the CPU 401, the ROM 402, and the RAM 403 are connected to each other via a bus 404.
  • Output interface 405 is also coupled to bus 404.
  • the following components are also connected to the output interface 405: an input portion 406 (including a keyboard, a mouse, etc.), an output portion 407 (including a display such as a cathode ray tube (CRT), a liquid crystal display (LCD), etc., and a speaker Etc.), storage portion 408 (including hard disk, etc.), communication portion 409 (including network interface cards such as LAN cards, modems, etc.).
  • the communication section 409 performs communication processing via a network such as the Internet.
  • Driver 410 can also be coupled to output interface 405 as desired.
  • a removable medium 411 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory or the like can be mounted on the drive 410 as needed, so that the computer program read therefrom can be installed into the storage portion 408 as needed.
  • a program constituting the software can be installed from a network such as the Internet or from a storage medium such as the detachable medium 411.
  • such a storage medium is not limited to the removable medium 411 shown in FIG. 4 in which a program is stored and distributed separately from the device to provide a program to the user.
  • the removable medium 411 include a magnetic disk (including a floppy disk), an optical disk (including a compact disk read only memory (CD-ROM) and a digital versatile disk (DVD)), a magneto-optical disk (including a mini disk (MD) (registered trademark)), and a semiconductor.
  • the storage medium may be the ROM 402, the hard disk included in the storage portion 408, and the like, in which programs are stored, and distributed to the user together with the device containing them.
  • the present invention also proposes a program product for storing an instruction code readable by a machine.
  • the above-described instruction code is read and executed by a machine, the above-described data processing method according to an embodiment of the present invention can be executed.
  • various storage media such as magnetic disks, optical disks, magneto-optical disks, semiconductor memories, and the like for carrying such a program product are also included in the disclosure of the present invention.
  • the object of the present invention can also be achieved by: providing a storage medium shield storing the above executable program code directly or indirectly to a system or device, and computer or central processing in the system or device
  • the unit CPU
  • the embodiment of the present invention is not limited to the program, and the program may be in any form, for example, the target program, the program executed by the interpreter, or the provided Give the operating system's footsteps, etc.
  • machine-readable storage media include, but are not limited to, various memories and storage units, semiconductor devices, disk units such as optical, magnetic, and magneto-optical disks, as well as other media suitable for storing information and the like.
  • the client computer can also implement the present invention by connecting to a corresponding website on the Internet and downloading and installing the computer program code according to the present invention into a computer and then executing the program.
  • the present invention provides the following solutions but is not limited thereto:
  • a data processing device comprising:
  • a semantic role labeling unit configured to perform semantic role labeling on the source language statement and the plurality of target language sentences as candidate candidate sequences of the translation, to obtain a source language semantic role sequence and a plurality of target language semantic role sequences;
  • a matching unit configured to obtain, according to a predetermined bilingual corpus, a matching score between the source language semantic role sequence and each of the target language semantic role sequences, wherein the predetermined bilingual corpus includes a plurality of semantic roles, a pair of statement pairs for the source and target languages;
  • the sequence result determining unit is configured to determine a candidate sequence result corresponding to the target language semantic role sequence with the highest matching score as a final sequence result.
  • the matching unit comprises: a correlation degree obtaining subunit, configured to target each target in each of the target language semantic role sequences a language predicate, using the predetermined bilingual corpus, obtaining a degree of correlation between at least a partial subsequence of the target language semantic role sequence including the target language predicate and the source language semantic role sequence;
  • a matching score determining subunit configured to determine, for each of the target language semantic role sequences, the target language semantic role sequence and the source language semantic role based on the obtained degree of correlation related to the target language semantic character sequence The match score between the sequences.
  • the target language semantic role sequence includes only the degree of correlation between the sub-sequence of the target language predicate and the source language semantic role sequence;
  • the target language semantic role sequence includes a degree of correlation between a subsequence of at least one semantic role located to the left of the target language predicate and the source language semantic role sequence;
  • the target language semantic role sequence includes only a subsequence of the target language predicate, a subsequence including at least one semantic role located to the left of the target language predicate, and at least the target language predicate and at least to the left of the target language predicate The degree of correlation between at least two subsequences of a substring of a semantic role and the source language semantic role sequence.
  • V T is The target language semantic role sequence
  • V T is the target language predicate in T , which is the i-th semantic role located on the left side of V T
  • h is the number of semantic characters on the left side of V T , and is located at V T right in T
  • S) is the conditional probability for indicating the degree of correlation between the subsequence ⁇ V T ⁇ of S and T
  • V T , S) is The conditional probability used to represent the degree of correlation between the subsequences ⁇ V T ⁇ and ⁇ b V T ⁇ of S and T, ⁇ ( ⁇ ,
  • V s is the source language predicate in S, ⁇ ,..., is / ⁇ in the S to the left of V s
  • the semantic role, b ⁇ , b' k is the right side of V s in S: a semantic role that represents the sequence ⁇ in the predetermined target language statement in the pair of predetermined source language pairs containing the sequence ⁇ ⁇ Number of times, The number of predetermined source language statements of ⁇ , indicating the occurrence of an order in the predetermined target language statement The number of times ⁇ , ⁇ , c( , , , pV T , , v s ,...
  • v s ⁇ ,,. ⁇ denotes the number of occurrences of the sequence ⁇ , pV T ⁇ in the predetermined target language sentence
  • c(v T , K , v s represents the sequence in the predetermined target language statement ⁇ V T A The number of times that the sequence appears in the predetermined target language statement
  • the number of ⁇ ⁇ ⁇ , , bj ⁇ , (V T , b l , a , , ..., a , V S , b , ..., b represents the sequence ⁇ V T appearing in the predetermined target language statement The number of AJs.
  • Supplementary note 7 The data processing apparatus according to the supplementary note 2, wherein
  • the semantic role tagging unit is configured to use, as the source corresponding to the source language predicate, a sequence composed of each source language predicate and its related semantic role in a case where the source language statement includes at least two source language predicates a sequence of language semantic roles, and a sequence consisting of a target language predicate corresponding to the source language predicate and its associated semantic role as a target language semantic role sequence corresponding to the source language predicate; the matching unit is used to obtain the same a matching score between a source language semantic role sequence and a target language semantic role sequence corresponding to a source language predicate; and the sequence result determining unit is configured to determine a final sequence result by combining matching scores for each source language predicate .
  • a data processing method including:
  • Semantic role labeling is performed on the source language statement and the plurality of target language sentences as the candidate sequence results of the translation, to obtain the source language semantic role sequence and the plurality of target language semantic role sequences;
  • the candidate sequence result corresponding to the target language semantic role sequence with the highest matching score is determined as the final sequence result.
  • a matching score between the target language semantic role sequence and the source language semantic role sequence is determined based on the obtained degree of correlation related to the target language semantic character sequence.
  • the target language semantic role sequence includes only the degree of correlation between the subsequence of the target language predicate and the source language semantic role sequence;
  • the target language semantic role sequence includes a degree of correlation between a subsequence of at least one semantic role located to the left of the target language predicate and the source language semantic role sequence;
  • the target language semantic role sequence includes only a subsequence of the target language predicate, a subsequence including at least one semantic role located to the left of the target language predicate, and at least the target language predicate and at least to the left of the target language predicate The degree of correlation between at least two subsequences of a substring of a semantic role and the source language semantic role sequence.
  • Supplementary note 12 The data processing method according to the supplementary note 10 or 11, wherein the matching score is determined according to the following formula:
  • V s is the source language predicate in S, ⁇ ,..., is / ⁇ in the S to the left of V s , a semantic role, b ⁇ , b' k , which is the right side of V s in S:, a semantic role, indicating the number of occurrences of the sequence ⁇ in the predetermined target language statement in the pair of two-state sentences to which the predetermined source language statement containing the sequence ⁇ ⁇ belongs, , V S ,... ' ) indicates that the sequence ⁇ , ⁇ ,!
  • the number of predetermined source language statements of , ..., ⁇ , indicating the number of occurrences of the sequence ⁇ , ⁇ in the predetermined target language statement, c( , , , pV T , , v s ,... ' ) indicates at the predetermined
  • the number of occurrences of the sequence pV T ⁇ in the target language statement, c( , pV T , , v s ⁇ ,,. ⁇ indicates the number of occurrences of the sequence ⁇ , pV T ⁇ in the predetermined target language statement, (:! ⁇ , ⁇ ,..., , ⁇ , ⁇ ,..., ⁇ ;!
  • V T A! the number of times bj ⁇ , (V T , b l , a , , ..., a , V S , b , ..., b represents the number of occurrences of the sequence ⁇ , ⁇ in the predetermined target language statement .
  • Supplementary note 15 The data processing method according to Supplementary Note 10, further comprising:
  • each source language is said to be a semantic role sequence, and a sequence consisting of the target language predicate corresponding to the source language predicate and its associated semantic role is taken as a target language semantic role sequence corresponding to the source language predicate; obtaining a matching score between the source language semantic role sequence and the target language semantic role sequence corresponding to the same source language predicate; and by combining the matching for each source language predicate Score to determine the final ordering result.
  • Appendix 17 An electronic device comprising the data processing device of any one of the supplementary notes 1-8.
  • Computer tablet; personal digital assistant; multimedia playback device; mobile phone; electronic dictionary; and electronic paper book.
  • Supplementary note 19 A program product storing a machine readable instruction code, wherein the program product, when executed, enables the machine to execute the data processing method according to any one of Supplementary Notes 9-16 .
  • Supplementary note 20 A computer readable storage medium having stored thereon the program product according to the supplementary note 19.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

本发明提供了数据处理装置、数据处理方法以及电子设备,以克服现有的语言数据处理技术所存在的处理精度不高的问题。上述数据处理装置包括:语义角色标注单元,用于对源语言语句以及作为其译文的候选调序结果的多个目标语言语句分别进行语义角色标注,以获得源语言语义角色序列以及多个目标语言语义角色序列;匹配单元,用于基于预定双语语料库获得上述源语言语义角色序列分别与每个上述目标语言语义角色序列之间的匹配分数;以及调序结果确定单元,用于将匹配分数最高的目标语言语义角色序列对应的候选调序结果确定为最终调序结果。本发明的上述技术能够应用于数据处理领域。

Description

数据处理装置、 数据处理方法以及电子设备
技术领域
[01] 本发明涉及数据处理领域, 尤其涉及数据处理装置、 数据处理方法以 及电子设备。
背景技术
[02] 数据处理是当前较为热门的一个技术领域。 在数据处理领域中, 由于 数据信息的种类的丰富性和多样性, 处理的目的和要求也不尽相同。
[03] 语言数据作为众多类型数据中的一种, 在人们的日常生活、 工作中是 极为常见的。 例如, 电子邮件、 手机间互发的短消息以及人们学习和工作 中需要处理的各种文件中所包含的文字信息, 都是语言数据。 在利用现有 的用于处理语言数据的技术对如上所述的语言数据进行处理时, 尤其是将 一种模式的语言数据转换为另一种模式时,其处理的准确度和 /或精度往往 较低。
发明内容
[04] 在下文中给出了关于本发明的简要概述, 以便提供关于本发明的某些 方面的基本理解。 应当理解, 这个概述并不是关于本发明的穷举性概述。 它并不是意图确定本发明的关键或重要部分, 也不是意图限定本发明的范 围。 其目的仅仅是以筒化的形式给出某些概念, 以此作为稍后论述的更详 细描述的前序。
[05] 鉴于此, 本发明提供了数据处理装置、 数据处理方法以及电子设备, 精度不高的问题。
[06] 根据本发明的一个方面, 提供了一种数据处理装置, 该数据处理装置 包括: 语义角色标注单元, 用于对源语言语句以及作为其译文的候选调序 结果的多个目标语言语句分别进行语义角色标注, 以获得源语言语义角色 序列以及多个目标语言语义角色序列; 匹配单元, 用于基于预定双语语料 库获得上述源语言语义角色序列分别与每个上述目标语言语义角色序列 之间的匹配分数, 其中, 上述预定双语语料库包括多个经过语义角色标注 的、 针对源语言和目标语言的双语句对; 以及调序结果确定单元, 用于将 匹配分数最高的目标语言语义角色序列对应的候选调序结果确定为最终 调序结果。
[07] 根据本发明的另一个方面, 还提供了一种数据处理方法, 该数据处理 方法包括: 对源语言语句以及作为其译文的候选调序结果的多个目标语言 语句分别进行语义角色标注, 以获得源语言语义角色序列以及多个目标语 言语义角色序列; 基于预定双语语料库获得上述源语言语义角色序列分别 与每个上述目标语言语义角色序列之间的匹配分数, 其中, 上述预定双语 语料库包括多个经过语义角色标注的、 针对源语言和目标语言的双语句 对; 以及将匹配分数最高的目标语言语义角色序列对应的候选调序结果确 定为最终调序结果。
[08] 根据本发明的另一个方面, 还提供了一种电子设备, 该电子设备包括 如上所述的数据处理装置。
[09] 根据本发明的又一个方面, 还提供了一种存储有机器可读取的指令代 码的程序产品, 上述程序产品在执行时能够使上述机器执行如上所述的数 据处理方法。
[10] 此外, 根据本发明的其他方面, 还提供了一种计算机可读存储介质, 其上存储有如上所述的程序产品。
[11] 上述根据本发明实施例的数据处理装置、 数据处理方法以及电子设 备, 其针对作为源语言语句的译文的候选调序结果的多个目标语言语句, 能够利用预定双语语料库来获得上述多个目标语言语句对应的多个目标 匹配分数, 以在上述多个目标语言语句中确定最终的调序结果, 由此能够 获得至少以下益处之一: 处理结果的准确度较高; 计算量小, 计算速度快; 以及处理效率较高。
[12] 通过以下结合附图对本发明的最佳实施例的详细说明, 本发明的这些 以及其他优点将更加明显。
附图说明
[13] 本发明可以通过参考下文中结合附图所给出的描述而得到更好的理 解, 其中在所有附图中使用了相同或相似的附图标记来表示相同或者相似 的部件。 所述附图连同下面的详细说明一起包含在本说明书中并且形成本 说明书的一部分, 而且用来进一步举例说明本发明的优选实施例和解幹本 发明的原理和优点。 在附图中:
[14] 图 1是示意性地示出根据本发明的实施例的数据处理装置的一种示例 结构的框图。
[15] 图 2是示意性地示出如图 1所示的匹配单元的一种可能的示例结构的 框图。
[16] 图 3是示意性地示出根据本发明的实施例的数据处理方法的一种示例 性处理的流程图。
[17] 图 4是示出了可用来实现根据本发明的实施例的数据处理装置和数据 处理方法的一种可能的信息处理设备的硬件配置的结构简图。
[18] 本领域技术人员应当理解, 附图中的元件仅仅是为了简单和清 见 而示出的, 而且不一定是按比例绘制的。 例如, 附图中某些元件的尺寸可 能相对于其他元件放大了, 以便有助于提高对本发明实施例的理解。
具体实施方式
[19] 在下文中将结合附图对本发明的示范性实施例进行描述。 为了清楚和 简明起见, 在说明书中并未描述实际实施方式的所有特征。 然而, 应该了 解, 在开发任何这种实际实施例的过程中必须做出很多特定于实施方式的 决定, 以便实现开发人员的具体目标, 例如, 符合与系统及业务相关的那 些限制条件, 并且这些限制条件可能会随着实施方式的不同而有所改变。 此外, 还应该了解, 虽然开发工作有可能是非常复杂和费时的, 但对得益 于 开内容的本领域技术人员来说, 这种开发工作仅仅是例行的任务。
[20] 在此, 还需要说明的一点是, 为了避免因不必要的细节而模糊了本发 明,在附图中仅仅示出了与根据本发明的方案密切相关的装置结构和 /或处 理步驟, 而省略了与本发明关系不大的其他细节。
[21] 本发明的实施例提供了一种数据处理装置, 该数据处理装置包括: 语 义角色标注单元, 用于对源语言语句以及作为其译文的候选调序结果的多 个目标语言语句分别进行语义角色标注, 以获得源语言语义角色序列以及 多个目标语言语义角色序列; 匹配单元, 用于基于预定双语语料库获得上 配分数, 其中, 上述预定双语语料库包括多个经过语义角色标注的、 针对 源语言和目标语言的双语句对; 以及调序结果确定单元, 用于将匹配分数 最高的目标语言语义角色序列对应的候选调序结果确定为最终调序结果。
[22] 在根据本发明的实施例的数据处理装置的具体实现方式中, 源语言例 如可以是为英语、 汉语、德语、 法语、 日语等众多语言中的任意一种语言, 而目标语言可以是与作为源语言的语种之间具有相同的主谓宾结构的、 上 述众多语言中的另一种语言。 其中, 需要说明的是, 这里所说的相同的主 谓宾结构中的 "主谓宾结构" 并不限于 "主语 +谓语 +宾语" 这种顺序, 也 可以是其他顺序, 例如 "主语 +宾语 +谓语" 等顺序, 但所选定的源语言和 目标语言的 "主谓宾结构" 是相同的。 例如, 源语言和目标语言都是 "主 语 +谓语 +宾语" 顺序的 "主谓宾结构 ", 或者都是 "主语 +宾语 +谓语" 顺 序的 "主谓宾结构" 等。
[23] 在下文中, 将主要以源语言为汉语、 目标语言为英语的情况为例来给 出本发明各实施例的相关描述, 对于釆用其他语言作为源语言或目标语言 的示例将不再进行详细描述。 其中, 汉语和英语均是具有 "主语 +谓语 +宾 语" 这种顺序的主谓宾结构的语言。
[24] 在一些数据处理过程中, 对于某个源语言语句, 可能存在多个候选译 文, 在这种情况下, 可以在这多个候选译文中挑选出与该源语言语句最匹 配的那个。 在本发明的实施例中, 数据处理装置能够通过确定每个候选译 文中的各语义角色之间的顺序与源语言语句之间的匹配程度, 来进一步确 定哪一个候选译文与该源语言语句最匹配。 需要说明的是, 在上述数据处 理过程中, 多个候选译文相当于在将源语言语句从源语言模式转换为目标 语言模式的过程中所得到的多个候选的处理结果。
[25] 下面结合图 1来详细描述根据本发明的实施例的数据处理装置的一个 示例。
[26] 如图 1所示, 根据本发明的实施例的数据处理装置 100包括语义角色 标注单元 110、 匹配单元 120以及调序结果确定单元 130。
[27] 下文中, 以某个源语言语句以及作为该源语言语句的候选调序结果的 多个目标语言语句为例来描述如何在这多个目标语言语句中选择与该源 语言语句最匹配的那个。
[28] 在实际处理中, 对于处理多个源语言语句的情况以及处理整篇源语言 文章等的情况来说, 处理的过程是类似的, 将不再详述。
[29] 如图 1所示, 在根据本发明的实施例的数据处理装置 100中, 语义角 色标注单元 110通过对源语言语句进行语义角色标注而获得该源语言语句 的源语言语义角色序列。 此外, 语义角色标注单元 110还对上述多个目标 语言语句分别进行语义角色标注, 以获得多个目标语言语句各自的目标语 言语义角色序列, 即, 获得多个目标语言语义角色序列。
[30] 在根据本发明的实施例的数据处理装置的具体实现方式中, 可以采用 如 FrameNet、 PropBank或 NomBank中的语义角色标注技术来对英语语 句 (作为目标语言语句的示例)进行语义角色标注, 以及可以采用如 CPB ( Chinese Proposition Bank ) 中的语义角色标注技术来对汉语语句 (作为 源语言语句的示例)进行语义角色标注。 需要说明的是, 语义角色标注技 术对于本领域的技术人员来说是可以根据本领域公知常识所获知的, 故这 里不再详述。
[31] 此外, 在才艮据本发明的实施例的数据处理装置的具体实现方式中, 源 语言语句不限于完整语句 (如 "他就是我昨天看见的那个老师。":), 也可 以是完整语句中的部分句子成分(诸如 "我昨天看见的那个老师" 之类的 具有主谓宾结构的句子成分)。
[32] 举例来说, 假设源语言语句为 "我昨天看见的那个老师", 并假设目 标语言语句 "The teacher I saw yesterday"以及目标语言语句 "I yesterday saw the teacher" 是上述源语言语句的两个候选调序结果。 在该例子中, 源语言为汉语, 目标语言为英语。 需要说明的是, 在该例子中, 上述源语 言语句的两个候选调序结果相当于在将源语言语句从源语言模式转换为 目标语言模式的过程中所得到的候选的处理结果。
[33] 通过语义角色标注单元 110对源语言语句 "我昨天看见的那个老师" 进行语义角色标注, 可以得到如下的标注结果:
[34] [我] _argLs 昨天 [看见] _Vs 的 那个 [老师 ]_argRs
[35] 在该标注结果中, [...]— Vs 的方括号中的内容表示源语言语句 "我昨 天看见的那个老师" 中的源语言谓词, [...]— argLs的方括号中的内容表示 源语言语句 "我昨天看见的那个老师" 中的位于源语言谓词左侧的一个语 义角色, [...]— argRs的方括号中的内容表示源语言语句 "我昨天看见的那 个老师" 中的位于源语言谓词右侧的一个语义角色。
[36] 由此, 根据对源语言语句 "我昨天看见的那个老师" 进行语义角色标 注的结果, 按照各个语义角色以及源语言谓词在源语言语句 "我昨天看见 的那个老师" 中的顺序, 可以得到如下的源语言语义角色序列 S: [37] argLs看见 argRs
[38] 例如, 假设 argLs标注的是主语, argRs标注的是宾语, 则上述源语 言语义角色序列 S对应于 "主语 +谓语 +宾语" 顺序的主谓宾结构。
[39] 需要说明的是, 在该例子中, 源语言谓词左侧只有一个语义角色, 其 右侧也只有一个语义角色, 但在本发明实施例的数据处理装置的其他例子 中, 源语言谓词左侧和 /或右侧的语义角色可以多于一个。
[40] 类似地, 通过语义角色标注单元 110对目标语言语句 "The teacher I saw yesterday" 进行语义角色标注, 可以得到如下的标注结果:
[41] The [teacher]_argLTi2 [I]_ argLT1l [saw]_Vtl yesterday
[42] 在该标注结果中, [...]— Vtl的方括号中的内容表示目标语言语句 "The teacher I saw yesterday" 中的目标语言谓词, […: |_argLT1l的方括号中的 内容表示目标语言语句 "The teacher I saw yesterday" 中的位于目标语言 谓词左侧的一个语义角色, [...]— argLT12 的方括号中的内容表示目标语言 语句 "The teacher I saw yesterday" 中的位于目标语言谓词左侧的另一个 语义角色。
[43] 由此, 根据对目标语言语句 "The teacher I saw yesterday" 进行语义 角色标注的结果, 按照各个语义角色以及目标语言谓词在目标语言语句 "The teacher I saw yesterday" 中的顺序, 可以得到如下的目标语言语义 角色序列 T1:
[44] argLTi2 argLTil saw
[45] 例如,假设 argLT12标注的是宾语, argLT1l标注的是主语, 则上述目 标语言语义角色序列 T1对应于 "宾语 +主语 +谓语" 顺序的主谓宾结构。
[46] 此外, 对于目标语言语句 "I yesterday saw the teacher" , 可以类似地 得到如下的标注结果:
[47] [I] _ argLT2l yesterday [saw]Vt2 the [teacher]— argRT2l
[48] 在该标注结果中, [...]— Vt2 的方括号中的内容表示目标语言语句 "I yesterday saw the teacher" 中的目标语言谓词, [...]_argLT2l的方括号中 的内容表示目标语言语句 "I yesterday saw the teacher" 中的位于目标语 言谓词左侧的一个语义角色, [...]— argRT2l 的方括号中的内容表示目标语 言语句 "I yesterday saw the teacher" 中的位于目标语言谓词右侧的另一 个语义角色。 [49] 根据该标注结果, 按照各个语义角色以及目标语言谓词在目标语言语 句 "I yesterday saw the teacher" 中的顺序, 可以得到如下的目标语言语 义角色序列 T2:
[50] argLT2l saw argRT2l
[51] 例如, 假设 argLT2l标注的是主语, argRT2l标注的是宾语, 则上述 目标语言语义角色序列 T1对应于 "主语 +谓语 +宾语"顺序的主谓宾结构。
[52] 这样, 通过语义角色标注单元 110的处理, 针对某个源语言语句, 可 以得到该源语言语句的源语言语义角色序列 S, 以及可以得到作为该源语 言语句的候选调序结果的多个目标语言语句对应的多个目标语言语义角 色序列 Tl、 Τ2 ΤΝ, 其中, Ν为大于 1的整数。 由此, 基于预定双 语语料库, 匹配单元 120可以获得上述源语言语义角色序列 S分别与上述 多个目标语言语义角色序列 Tl、 Τ2 ΤΝ中的每一个之间的匹配分数。
[53] 其中, 上述预定双语语料库包括多个源语言和目标语言的双语句对, 这些双语句对是预先经过语义角色标注的。 需要说明的是, 预定双语语料 库可以包括通用领域的双语语料库和 /或专有领域的双语语料库等。
[54] 在才艮据本发明的实施例的数据处理装置的一种实现方式中, 匹配单元 120可以具有如图 2所示的一种示例结构。 如图 2所示, 在该实现方式中, 匹配单元 120可以包括相关程度获得子单元 210和匹配分数确定子单元 220。
[55] 对于多个目标语言语义角色序列 Tl、 Τ2 ΤΝ中的每个目标语言 语义角色序列来说, 相关程度获得子单元 210可以针对该目标语言语义角 色序列中的每个目标语言谓词, 利用上述预定双语语料库来获得该目标语 言语义角色序列的包含该目标语言谓词的至少部分子序列与源语言语义 角色序列之间的相关程度。
[56] 下面以多个目标语言语义角色序列 Tl、 Τ2 ΤΝ中的任一个目标 语义角色序列之间的匹配分数。 下文中, 用 Tg表示上述 "任一个目标语 言语义角色序列", 其中 Tg=Tl,T2,...,TN。
[57] 在一个示例中, 假设目标语言语义角色序列 Tg中包含至少一个目标 语言谓词, 则针对目标语言语义角色序列 Tg 中的每个目标语言谓词, 相 关程度获得子单元 210可以获得如下多种相关程度中的任一种或多种: 目 标语言语义角色序列 Tg的仅包括该目标语言谓词的子序列 (即该目标语 言谓词本身, 以下简称为第一类子序列)与源语言语义角色序列 S之间的 相关程度; 目标语言语义角色序列 Tg的包括位于该目标语言谓词左侧的 至少一个语义角色的子序列 (以下简称为第二类子序列)与源语言语义角 色序列 S之间的相关程度; 目标语言语义角色序列 Tg的包括该目标语言 谓词和位于该目标语言谓词左侧的至少一个语义角色的子序列 (以下简称 为第三类子序列)与源语言语义角色序列 S之间的相关程度; 以及目标语 言语义角色序列 Tg的第一类子序列、 第二类子序列以及第三类子序列中 的至少两种子序列与源语言语义角色序列 S之间的相关程度。
[58] 下面以上文所描述的源语言语义角色序列 S为 "argLs看见 argRs"、 目标语言语义角色序列 T1为 "argLT12 argLT1l saw" 以及目标语言语 义角色序列 T2为 "argLT2l saw argRT2l"的情况为例来描述一个示例。
[59] 针对目标语言语义角色序列 Tl "argLT12 argLT1l saw" 来说, 其 对应的第一类子序列例如可以为 "saw" , 第二类子序列例如可以为 "argLT12"、 "argLT1l"和 "argLT12 argLT1l" 中的任一个, 第三类子序 列例如可以为 wargLTil saw" 、 "argLT12 saw"和 "argLT12 argLT1l saw" 中的任一个。
[60] 由此, "saw" 与 "argLs看见 argRs" 之间的相关程度例如可以作为 上述第一类子序列与源语言语义角色序列 S之间的相关程度(下文称为"第 一类相关程度") 的一个示例。 其中, "saw" 与 "argLs看见 argRs" 之 间的相关程度例如可以由上述预定双语语料库中 "saw" 与 "argLs看见 argRs w 同时出现在一个双语句对中的概率来反映, 或者, 也可以由上述 预定双语语料库中包含 "argLs看见 argRs" 结构的所有中文语句对应的 所有英文语句中出现 "saw" 的概率来反映。
[61] 其中, 需要说明的是, "argLs看见 argRs" 出现在一个双语句对中是 指: 在这个双语句对的汉语语句中, 根据其语义角色标注的结果, 假设按 照这个汉语语句中的各个语义角色及谓词在该汉语语句中的出现顺序所 得到的语义角色序列为 S。,则 "argLs看见 argRs"为 S。的子序列。例如, 假设 S0为 "Wl W2 W3 W4 W5" , 则 wargLs看见 argRs" 例如可 以是 "W2 W4 W5", 等等。 需要注意的是, 某个序列的子序列可以是 该序列本身。
[62] 举例来说, 假设 argLs表示主语, argRs表示宾语, 则 "argLs看见 argRs" 的结构对应于 "主语 + '看见, +宾语" 这种结构。 于是, 假设双 语句对 C1包括 "我看见猫"和 "I saw a cat" ,另一个双语句对 C2包括 "他 看见许多书" 和 "He found many books", 由于根据 "我看见猫" 和 "他 看见许多书"的语义角色标注结果所得到的语义角色序列均为 "主语 + '看 见, +宾语" 这种结构, 因此可以判定 "主语 + '看见, +宾语" 即 "argLs 看见 argRs" 出现在双语句对 C1中、 也出现在双语句对 C2中。 同时, 在 双语句对 C1的英语语句 "I saw a cat"中出现了上述第一类子序列 "saw", 而在双语句对 C2的英语语句 "He found many books" 中没有出现上述 第一类子序列 "saw"。 因此, 上述预定双语语料库中包含 "argLs 看见 argRs" 结构的所有中文语句对应的所有英文语句中出现 "saw" 的概率例 如可以为 50% (在该预定双语语料库中只包含双语句对 C1和 C2的情况 下)。
[63] 此外, "argLT12"、 "argLT1l" 和 "argLT12 argLT1l" 中的任一个与 MargLs看见 argRs" 之间的相关程度例如可以作为上述第二类子序列与 源语言语义角色序列 S之间的相关程度(下文称为 "第二类相关程度") 的一个示例。 类似地, 第二类相关程度例如可以由上述第二类子序列与源 语言语义角色序列 S同时出现在上述预定双语语料库的一个双语句对中的 概率来反映, 或者, 也可以由上述预定双语语料库中包含源语言语义角色 序列 S的所有中文语句对应的所有英文语句中出现上述第二类子序列的概 率来反映, 计算概率的方法可以与上文相类似, 这里不再赘述。
[64] 此夕卜, "argLT1l saw"、 "argLT12 saw,,和 "argLT12 argLTil saw" 中的任一个与 "argLs看见 argRs" 之间的相关程度例如可以作为上述第 三类子序列与源语言语义角色序列 S之间的相关程度(下文称为 "第三类 相关程度") 的一个示例。 类似地, 第三类相关程度例如可以由上述第三 类子序列与源语言语义角色序列 S同时出现在上述预定双语语料库的一个 双语句对中的概率来反映, 或者, 也可以由上述预定双语语料库中包含源 语言语义角色序列 S的所有中文语句对应的所有英文语句中出现上述第三 类子序列的概率来反映, 计算概率的方法可以与上文相类似, 这里不再赘 述。
[65] 类似地, 可以获得上,第一类子序列、 第二类†序列以及第三类 序 为 "第四类相关程度";)。
[66] 例如, 假设选择第一类子序列和第三类子序列作为上述至少两种子序 列的示例, 并假设第一类子序列为 "saw"、 第三类子序列为 "argLT12 saw" , 则第一类子序列和第三类子序列与源语言语义角色序列 S之间的相 关程度可以由上述第一类子序列 "saw"、 第三类子序列 "argLT12 saw" 以及源语言语义角色序列 S同时出现在上述预定双语语料库的一个双语句 对中的概率来反映, 或者, 也可以由上述预定双语语料库中包含源语言语 义角色序列 S的所有中文语句对应的所有英文语句中同时出现上述第一类 子序列 "saw" 和第三类子序列 "argLT12 saw" 的概率来反映, 其中, 计算概率的方法可以与上文相类似, 这里不再赘述。
[67] 再如, 假设选择第一类子序列和第二类子序列作为上述至少两种子序 列的示例, 并假设第一类子序列为 "saw"、 第二类子序列为 "argLT12 argLT1l" , 则第一类子序列和第二类子序列与源语言语义角色序列 S之间 的相关程度可以由上述第一类子序列 "saw"、 第二类子序列为 "argLT12 argLT1l"以及源语言语义角色序列 S同时出现在上述预定双语语料库的一 个双语句对中的概率来反映, 或者, 也可以由上述预定双语语料库中包含 源语言语义角色序列 S的所有中文语句对应的所有英文语句中同时出现上 述第一类子序列 "saw"和第二类子序列为 "argLT12 argLT1l" 的概率来 反映, 其中, 计算概率的方法可以与上文相类似, 这里不再赞述。
[68] 以上举例说明了如何获得第一类至第四类相关程度, 但需要注意的 是, 相关程度获得子单元 210可以获得上述第一类至第四类相关程度中的 任一种或多种,而不一定需要计算第一类至第四类相关程度的全部。另外, 需要说明的是, 相关程度获得子单元 210所计算的相关程度中可以包括多 个同类别的相关程度, 例如, 可以包括两个第二类相关程度(这两个第二 类相关程度所对应的第二类子序列可以不同), 等等。
[69] 这样, 匹配分数确定子单元 220可以基于相关程度获得子单元 210针 对每个目标语言语义角色序列所获得的各种相关程度(如上述第一类至第 四类相关程度中的任一种或多种中的任一种或多种), 来确定每个目标语 言语义角色序列与源语言语义角色序列之间的匹配分数。 在一种实现方式 中, 针对每个目标语言语义角色序列, 匹配分数确定子单元 220可以将与 该目标语言语义角色序列有关的相关程度的值彼此相乘, 而将得到的乘积 作为该目标语言语义角色序列与源语言语义角色序列之间的匹配分数。 在 另一种实现方式中, 针对每个目标语言语义角色序列, 匹配分数确定子单 元 220也可以通过对与该目标语言语义角色序列有关的相关程度的值进行 加权计算(例如加权求和)所得到的结果作为该目标语言语义角色序列与 源语言语义角色序列之间的匹配分数。
[70] 在一个例子中, 匹配分数确定子单元 220可以根据如下的公式一来获 得上述匹配分数。
、 score(S, T) = P( VT | S)*P( αλ | VT ,S)
公式一 = h k . »
*ΠΡ( a, I a,_x , VT ,S)*P( ^|VT,S)*nP( bj | VT ,b l ,S)
[71] 在公式一中, S表示源语言语义角色序列, T表示与源语言语义角色 序列 S对应的多个目标语言语句中的任一个目标语言语义角色序列, VT 为 T中的目标语言谓词, 为 T中位于 VT左侧的第 i个语义角色, h为 VT左侧的语义角色数量, 为 T中位于 VT右侧的第 j个语义角色, k为 VT右侧的语义角色数量, P(vT|s)为用于表示 S与 T的子序列 {VT}之间的相 关程度的条件概率, P(A|VT,S)为用于表示 S与 T的子序列 {VT}和 {«bVT}之 间的相关程度的条件概率, Ρ(Ω,| A— 1, VT,S)为用于表示 S与 T的子序列
Figure imgf000013_0001
和 A VT}之间的相关程度的条件概率, P(^|VT,S)为用于表示 S与 T的子 序列 {VT}和 {VT?^}之间的相关程度的条件概率, 以及 P IVTAPS)为用于表 示 S与 T的子序列^^ ^和^^^ ^之间的相关程度的条件概率。
[72] 在一种实现方式中, P(vT|s)例如可以等于在上述预定双语语料库中包 含源语言语义角色序列 S的所有预定源语言语句对应的所有预定目标语言 语句中出现子序列 {VT}的概率。 为方便描述, 下文中将 "上述预定双语语 料库中包含源语言语义角色序列 S的所有预定源语言语句对应的所有预定 目标语言语句" 所构成的集合称为预定集合。 这样, P(A|VT,S)例如可以等 于在上述预定集合中已经出现子序列 {VT}的预定目标语言语句中出现子序 列 { ,Vt}的概率, ρ(Ω,μ, pV^s)例如可以等于在上述预定集合中已经出现子 序列 {βΜ,ντ}的预定目标语言语句中出现子序列 ,ί^,ντ}的概率, P(^|VT,S) 例如可以等于在上述预定集合中已经出现子序列 {VT}的预定目标语言语句 中出现子序列 {VT?^}的概率, p(b」vT, ,,s)例如可以在上述预定集合中已经 出现子序列 {VT^-J的预定目标语言语句中出现子序列 {ντ,^-1?^}的概率。
[73] 需要说明的是, 在公式一中, 距离 VT越近, 语义角色的序号越小。 例如, ^为 T中位于 VT左侧且距离 VT最近的第一个语义角色, 而 β2为 Τ 中位于 VT左侧且距离 VT最近的第二个语义角色, 等等。
[74] 其中, 在一种实现方式中, 相关程度获得子单元 210可以利用极大似 然法来获得公式一中的 P(VT|S;)、 P(^|VT,S) ^ Ρ(Ω,.|Ω pV S;)、 P( |VT,S)以及 Ρ(^|ντ,^,8)。 公式二至公式六给出了用于计算公式一中的 P(VT|S)、 P( (¾ I VT ,S;)、 P( at I ,VTS) P(^| VT ,S)以及 P( bj | VT ,b , s)的一个示例。
[75] 公式二: P(V |S)
Figure imgf000014_0001
[76] 公式三:
Figure imgf000014_0002
,VS [77] 公式四: P(a \a. ,,^,5) = ^^^^'^··^'' ^^'"-^'^
' ' C (a^ ,YT,a'h,,...,a ,Ys,b ,...,b'k,) [78] 公式五: , 、
[79] 公式 7 :
Figure imgf000014_0003
[80] 在以上公式二至公式六中, Vs为 S中的源语言谓词, 为 S中 位于 Vs左侧的 /ί,个语义角色, ,...,^为 S中位于 Vs右侧的 :,个语义角色, 由此, 序列 {Ω ,.,.,ΩΊ,ν»,^}即为源语言语义角色序列 S„
[81]
Figure imgf000014_0004
,...,^)表示在包含源语言语义角色序列 S (即 { ¾, Ί ,vs })的所有预定源语言语句所属的双语句对中的所有预定 目标语言语句中出现序列 {^}的次数。 下面将包含源语言语义角色序列 S (即 { ,,..., ,vs,^,...,^})的所有预定源语言语句所属的所有双语句对称为 待统计句对。 表示包含序列 { }的所有 预定源语言语句的数量, 表示在待统计句对中的 预定目标语言语句中出现序列 { ax ,VT }的次数,
表示在待统计句对中的预定目标语言语句中出现序列 { α,,α,^,Υ, }的次数, 表示在待统计句对中的预定目标语言语句中出 现序列 15^}的次数, (^^,^ ,…, ,^,^,… ^表示在待统计句对中的预 定目标语言语句中出现序列 {VT }的次数,
Figure imgf000014_0005
示在待统计句对中的预定目标语言语句中出现序列 { VTA ^ }的次数,
C( VT , bj_x ,α , ,···,α , VS ,b ,...,b ,)表示在待统计句对中的预定目标语言语句中出 现序列( ^的次数。
[82] 在另一个例子, 匹配分数确定子单元 220也可以根据如下的公式七来 获得上述匹配分数。 score(S, T) = P( VT | S)*P( ax I VT ,S)*P( a21 ax ,VT ,S)* Π P( at , at 2VT ,S)
[83] 公式七: k 3
*P( bx I VT ,S)*P( b21 VT ,b, ,S)* Π P( bj | VT ,bj_2 ! ,S)
[84] 与公式一不同的是,公式七中的 Ρ(",.|ί ρΩ,.— 2VT,S)为用于表示 S与 T的子 序列 {" A^VT}和 , α^α^Υτ}之间的相关程度的条件概率, p( |vT, ,,s)为用于表示 S与 T的子序列 {VT, ^w}和 {VT, b^bj-^j}之 间的相关程度的条件概率。
[85] 其中, 公式七中的 P(VT|S)例如可以根据公式二来计算, P(i¾|VT,S)例如 可以根据公式三来计算, 例如可以根据公式四来计算, P(^|VT,S) 例如可以根据公 五来计算, ρ( |ν„)例如可以根据公式六来计算。 此 夕卜,
Figure imgf000015_0001
如可以根据如下的公式九来计算。
[86] 公式八: PiaU.
Figure imgf000015_0002
「一, ,\ j? C(VT,b. ,b. ,,b.,a , ,.,.,α', ,ν,έ»' ...,b',,)
[87] 公式九: V{b, VT2,bi ,,S)= T 2 1 3 h 1 s 1 k
[88] 其中, 2τ, ) 示在待统计句对中的预定目标 语言语句中出现序列 {Ω,,Ω, ρΩ, 2,VT}的次数,
Figure imgf000015_0003
示在待统计句对中的预定目标语言语句中出现序列 { α,^,Υ, }的次数, c(vT , bj_2 ,b X ,Ω ¾, ,... ,Ω Ί ,vs Ί ,... λ, )表示在待统计句对中的预定目标语言语句 中出现序列 { VT Α· 2 , A }的次数, ( VT , bj_2 ,b X ,a ,,...,a\,Vs,b\,...,b ,)表示在待统 计句对中的预定目标语言语句中出现序列 { VT 2 ,b}_, }的次数。
[89] 通过以上描述可知, 在上述结合图 2所描述的例子中, 通t^关程度 获得子单元 210和匹配分数确定子单元 220的处理, 能够同时考虑源语言 语句以及目标语言语句中的谓词信息, 相比于传统技术能够使得处理所得 到的结果更加准确。
[90] 由此, 通过匹配单元 120的处理可以得到多个目标语言语义角色序列 Tl、 Τ2、 …、 ΤΝ中的每一个与源语言语义角色序列 S之间的匹配分数。 然后, 调序结果确定单元 130可以将与源语言语义角色序列 S之间的匹配 分数最高的那个目标语言语义角色序列所对应的候选调序结果确定为最 终调序结果。 需要说明的是, 上述最终调序结果相当于在将源语言语句从 源语言模式转换为目标语言模式的过程中所得到的最终的处理结果。
[91] 例如, 假设源语言语句为 "我昨天看见的那个老师", 并假设目标语 言语句 "The teacher I saw yesterday"以及目标语言语句 "I yesterday saw the teacher" 是上述源语言语句的两个候选调序结果, 则才艮据上文描述可 知, 语义角色标注单元 110可以得到目标语言语义角色序列 Tl wargLT12 argLxil saw"以及目标语言语义角色序列 T2 "argLT2l saw argRT2l"。
[92] 针对目标语言语义角色序列 Tl "argLT12 argLT1l saw" , 根据公式 一至公式六, 匹配单元 120可以获得目标语言语义角色序列 T1与源语言 语义角色序列 S aargLs看见 argRs" 之间的匹配分数, 假设为 0.8。
[93] 类似地, 匹配单元 120可以得到目标语言语义角色序列 T2与源语言 语义角色序列 S "argLs看见 argRs" 之间的匹配分数, 假设为 0.5。
[94] 于是, 调序结果确定单元 130可以将目标语言语义角色序列 T2对应 的候选调序结果(即 "The teacher I saw yesterday" )确定为最终调序结果。
[95] 在才艮据本发明的实施例的数据处理装置的另一种实现方式中, 在源语 言语句包含两个或两个以上谓词的情况下, 语义角色标注单元 110可以将 每个源语言谓词及其相关的语义角色所组成的序列作为与该源语言谓词 对应的源语言语义角色序列, 并将与该源语言谓词对应的目标语言谓词及 其相关的语义角色所组成的序列作为与该源语言谓词对应的目标语言语 义角色序列。 在这种情况下, 匹配单元 120可以获得与同一个源语言谓词 对应的源语言语义角色序列和目标语言语义角色序列之间的匹配分数。 这 样, 在这种实现方式中语义角色标注单元 110和匹配单元 120对每个谓词 可以分别执行与上文中结合图 1和 /或图 2所描述的语义角色标注单元 110 和匹配单元 120的处理相类似的处理。需要说明的是, "与同一个源语言谓 词对应的源语言语义角色序列和目标语言语义角色序列" 是这样的两个序 列: 该源语言语义角色序列中包含谓词 Vaa, 该目标语言语义角色序列中 包含谓词 Vbb, 则谓词 Vaa和谓词 Vbb互为译文。
[96] 例如, 假设源语言语句 S,包含两个谓词 Vsl和 Vs2, 并假设目标语言 语句 Ml以及目标语言语句 M2是上述源语言语句 S,的两个候选调序结果, 其中, 目标语言语句 Ml包含谓词 Vtal (对应于 Vsl )和 Vta2 (对应于 Vs2 ), 目标语言语句 M2包含谓词 Vtbl (对应于 Vsl )和 Vtb2 (对应于 Vs2 )。 [97] 将源语言语句中的谓词 Vsl及与谓词 Vsl有关的语义角色所组成的序 列称为序列 Sl,, 源语言语句中的谓词 Vs2及与谓词 Vs2有关的语义角色 所组成的序列称为序列 S2,。
[98] 将目标语言语句 Ml中的谓词 Vtal及与谓词 Vtal有关的语义角色所 组成的序列称为序列 Tla,, 目标语言语句中的谓词 Vta2及与谓词 Vta2有 关的语义角色所组成的序列称为序列 T2a,。
[99] 将目标语言语句 M2中的谓词 Vtbl及与谓词 Vtbl有关的语义角色所 组成的序列称为序列 Tlb,, 目标语言语句中的谓词 Vtb2及与谓词 Vtb2 有关的语义角色所组成的序列称为序列 T2b,。
[100] 这样, 针对谓词 Vsl, 匹配单元 120可以得到序列 Tla,与序列 SI,之 间的匹配分数(以下称为分数一), 以及可以得到序列 Tlb,与序列 S1,之 间的匹配分数(以下称为分数二 )„
[101] 类似地, 针对谓词 Vs2, 匹配单元 120可以得到序列 T2a,与序列 S2, 之间的匹配分数(以下称为分数三), 以及可以得到序列 T2b,与序列 S2, 之间的匹配分数 (以下称为分数四)。
[102] 其中, 可以根据词对应关系来确定目标语言语句中的谓词与源语言语 句中哪个谓词对应, 例如, 可以把目标语言语句中与源语言语句中互为译 词 (或译文) 的谓词 (或语义角色)确定为相互对应。
[103] 调序结果确定单元 130可以通过结合针对每个源语言谓词的匹配分数 来确定最终调序结果。
[104] 例如, 由于序列 Tla,和序列 T2a,与目标语言语句 Ml相关,调序结果 确定单元 130可以将分数一和分数三的加权和(例如权重分别为 1 )作为 衡量目标语言语句 Ml中的各语义角色之间的顺序与源语言语句之间的匹 配程度的值, 该值越大, 表明二者之间越匹配。
[105] 类似地, 由于序列 Tib,和序列 T2b,与目标语言语句 M2相关, 调序 结果确定单元 130可以将分数二和分数四的加权和(例如权重分别为 1 ) 作为衡量目标语言语句 M2中的各语义角色之间的顺序与源语言语句之间 的匹配程度的值, 该值越大, 表明二者之间越匹配。
[106] 这样, 调序结果确定单元 130可以在所有的目标语言语句中选择与源 语言语句最匹配的那个来作为最终的调序结果。
[107] 通过以上描述可知, 上述根据本发明的实施例的数据处理装置针对作 为源语言语句的译文的候选调序结果的多个目标语言语句, 能够利用预定 双语语料库来获得上述多个目标语言语句对应的多个目标语言语义角色 序列分别与源语言语句对应的源语言语义角色序列之间的匹配分数, 以便 在上述多个目标语言语句中确定最终的调序结果。 上述根据本发明的实施 例的数据处理装置根据目标语言和源语言之间主谓宾结构的一致性来确 定最终的调序结果, 使得利用本发明实施例的上述数据处理装置所得到的 处理结果较传统方法而言更准确。
[108] 此外, 在一些实施例中, 采用如公式一和 /或公式二至公式六来获得上 述匹配分数, 使得计算量小, 计算速度快, 由此使得处理的效率较高。
[109] 此外, 本发明的实施例还提供了一种数据处理方法, 该数据处理方法 包括: 对源语言语句以及作为其译文的候选调序结果的多个目标语言语句 分别进行语义角色标注, 以获得源语言语义角色序列以及多个目标语言语 义角色序列; 基于预定双语语料库获得上述源语言语义角色序列分别与每 个上述目标语言语义角色序列之间的匹配分数, 其中, 上述预定双语语料 库包括多个经过语义角色标注的、 针对源语言和目标语言的双语句对; 以 及将匹配分数最高的目标语言语义角色序列对应的候选调序结果确定为 最终调序结果。
[110] 在根据本发明的实施例的数据处理方法的具体实现方式中, 源语言例 如可以是为英语、 汉语、德语、 法语、 日语等众多语言中的任意一种语言, 而目标语言可以是与作为源语言的语种之间具有相同的主谓宾结构的、 上 述众多语言中的另一种语言。 其中, 这里所说的 "主谓宾结构" 可以具有 与上文描述的 "主谓宾结构" 相同的含义, 故这里省略其详细描述。 下文 中, 将主要以源语言为汉语、 目标语言为英语的情况为例来给出本发明各 实施例的相关描述。
[111] 下面结合图 3来描述上述数据处理方法的一种示例性处理。
[112] 如图 3所示, 根据本发明的实施例的数据处理方法的处理流程 300开 始于步骤 S310, 然后执行步骤 S320。
[113] 在步骤 S320 中, 对源语言语句以及作为其译文的候选调序结果的多 个目标语言语句分别进行语义角色标注, 以获得源语言语义角色序列以及 多个目标语言语义角色序列。 然后执行步骤 S330。 其中, 步骤 S320中所 执行的处理例如可以与上文中结合图 1所描述的语义角色标注单元 110的 处理相同, 并能够达到类似的技术效果, 在此不再赘述。 [114] 在步骤 S330 中, 基于预定双语语料库获得源语言语义角色序列分别 与每个目标语言语义角色序列之间的匹配分数, 其中, 预定双语语料库包 括多个经过语义角色标注的、 针对源语言和目标语言的双语句对。 然后执 行步骤 S340。 其中, 步骤 S330中所执行的处理例如可以与上文中结合图 1所描述的匹配单元 120的处理相同, 并能够达到类似的技术效果, 在此 不再赞述。
[115] 在一个实现方式中, 例如可以通过如下方式来实现步骤 S330 中的处 理: 针对每个目标语言语义角色序列中的每个目标语言谓词, 利用预定双 语语料库, 获得该目标语言语义角色序列的包含该目标语言谓词的至少部 分子序列与源语言语义角色序列之间的相关程度; 以及针对每个目标语言 语义角色序列, 基于获得的与该目标语言语义角色序列有关的相关程度来 确定该目标语言语义角色序列与源语言语义角色序列之间的匹配分数。
[116] 在一个例子中, 在步骤 S330 中, 针对每个目标语言语义角色序列中 的每个目标语言谓词, 例如可以利用预定双语语料库获得如下多种相关程 度中的任一种或多种: 该目标语言语义角色序列的仅包括该目标语言谓词 的子序列与源语言语义角色序列之间的相关程度; 该目标语言语义角色序 列的包括位于该目标语言谓词左侧的至少一个语义角色的子序列与源语 言语义角色序列之间的相关程度; 该目标语言语义角色序列的包括该目标 语言谓词和位于该目标语言谓词左侧的至少一个语义角色的子序列与源 语言语义角色序列之间的相关程度; 以及该目标语言语义角色序列的仅包 括该目标语言谓词的子序列、 包括位于该目标语言谓词左侧的至少一个语 义角色的子序列、 以及包括该目标语言谓词和位于该目标语言谓词左侧的 至少一个语义角色的子序列中的至少两种子序列与源语言语义角色序列 之间的相关程度。
[117] 在一个示例中, 可以根据上文中所描述的公式一计算上述匹配分数。 类似地, 公式一中的 P(VT|S;)、 P(^|VT,S) Ρ(Ω,. | Ω pV S;)、 P( | VT ,S)以及 p^lv^^s)例如可以利用极大似然法来获得。 在一个例子中, 可以根据上 文所描述的公式二至公式六来计算 p(vT|s)、
Figure imgf000019_0001
,
Figure imgf000019_0002
这里不再赘述。
[118] 需要说明的是, 在根据本发明的实施例的数据处理方法的一个实现方 式中, 假设源语言语句包含至少两个源语言谓词, 则可以将每个源语言谓 词及其相关的语义角色所组成的序列作为与该源语言谓词对应的源语言 语义角色序列, 并将与该源语言谓词对应的目标语言谓词及其相关的语义 角色所组成的序列作为与该源语言谓词对应的目标语言语义角色序列。 然 后, 获得与同一个源语言谓词对应的源语言语义角色序列和目标语言语义 角色序列之间的匹配分数, 并通过结合针对每个源语言谓词的匹配分数来 确定最终调序结果。
[119] 在步骤 S340 中, 将匹配分数最高的目标语言语义角色序列对应的候 选调序结果确定为最终调序结果。 然后执行步骤 S350。 其中, 步骤 S340 中所执行的处理例如可以与上文中结合图 1 所描述的调序结果确定单元 130的处理相同, 并能够达到类似的技术效果, 在此不再赞述。
[120] 处理流程 300结束于步骤 S350„
[121] 通过以上描述可知, 上述根据本发明的实施例的数据处理方法针对作 为源语言语句的译文的候选调序结果的多个目标语言语句, 能够利用预定 双语语料库来获得上述多个目标语言语句对应的多个目标语言语义角色 序列分别与源语言语句对应的源语言语义角色序列之间的匹配分数, 以便 在上述多个目标语言语句中确定最终的调序结果。 上述才艮据本发明的实施 例的数据处理方法根据目标语言和源语言之间主谓宾结构的一致性来确 定最终的调序结果, 使得利用本发明实施例的上述数据处理方法所得到的 处理结果较传统方法而言更准确。
[122] 此外, 本发明的实施例还提供了一种电子设备, 该电子设备包括如上 所述的数据处理装置。 在根据本发明的实施例的上述电子设备的具体实现 方式中, 上述电子设备可以是以下设备中的任意一种设备: 计算机(如台 式机、 笔记本电脑等); 平板电脑; 个人数字助理; 多媒体播放设备; 手 机(如智能手机); 电子词典; 以及电纸书等等。 其中, 该电子设备具有 上述数据处理装置的各种功能和技术效果, 这里不再赘述。
[123] 上述根据本发明的实施例的数据处理装置中的各个组成单元、 子单 元、 模块等可以通过软件、 固件、 硬件或其任意组合的方式进行配置。 在 通过软件或固件实现的情况下, 可从存储介质或网络向具有专用硬件结构 的机器(例如图 4所示的通用机器 400 )安装构成该软件或固件的程序, 该机器在安装有各种程序时, 能够执行上述各组成单元、 子单元的各种功 能。
[124] 图 4是示出了可用来实现根据本发明的实施例的数据处理装置和数据 处理方法的一种可能的信息处理设备的硬件配置的结构简图。
[125] 在图 4中, 中央处理单元 (CPU) 401根据只读存储器 (ROM) 402中存 储的程序或从存储部分 408加载到随机存取存储器 (RAM) 403的程序执行 各种处理。 在 RAM 403中, 还根据需要存储当 CPU 401执行各种处理等 等时所需的数据。 CPU 401、 ROM 402和 RAM 403经由总线 404彼此连 接。 输 输出接口 405也连接到总线 404。
[126] 下述部件也连接到输 输出接口 405: 输入部分 406 (包括键盘、 鼠 标等等)、 输出部分 407 (包括显示器, 例如阴极射线管 (CRT)、 液晶显示 器 (LCD)等, 和扬声器等)、 存储部分 408 (包括硬盘等)、 通信部分 409 (包括网络接口卡例如 LAN卡、 调制解调器等)。 通信部分 409经由网络 例如因特网执行通信处理。根据需要, 驱动器 410也可连接到输 输出接 口 405。 可拆卸介质 411例如磁盘、 光盘、 磁光盘、 半导体存储器等等可 以才艮据需要被安装在驱动器 410上, 使得从中读出的计算机程序可才艮据需 要被安装到存储部分 408中。
[127] 在通过软件实现上述系列处理的情况下, 可以从网络例如因特网或从 存储介质例如可拆卸介盾 411安装构成软件的程序。
[128] 本领域的技术人员应当理解, 这种存储介质不局限于图 4所示的其中 存储有程序、 与设备相分离地分发以向用户提供程序的可拆卸介质 411。 可拆卸介质 411 的例子包含磁盘 (包含软盘)、 光盘 (包含光盘只读存储器 (CD-ROM)和数字通用盘 (DVD))、 磁光盘(包含迷你盘 (MD)(注册商标)) 和半导体存储器。 或者,存储介质可以是 ROM 402、存储部分 408中包含 的硬盘等等, 其中存有程序, 并且与包含它们的设备一起被分发给用户。
[129] 此外, 本发明还提出了一种存储有机器可读取的指令代码的程序产 品。 上述指令代码由机器读取并执行时, 可执行上述根据本发明的实施例 的数据处理方法。 相应地, 用于承载这种程序产品的例如磁盘、 光盘、 磁 光盘、 半导体存储器等的各种存储介质也包括在本发明的公开中。
[130] 在上面对本发明具体实施例的描述中,针对一种实施方式描述和 /或示 出的特征可以以相同或类似的方式在一个或更多个其它实施方式中使用, 与其它实施方式中的特征相组合, 或替代其它实施方式中的特征。
[131] 此外, 本发明的各实施例的方法不限于按照说明书中描述的或者附图 中示出的时间顺序来执行, 也可以按照其他的时间顺序、 并行地或独立地 执行。 因此, 本说明书中描述的方法的执行顺序不对本发明的技术范围构 成限制。
[132] 此外, 显然, 根据本发明的上述方法的各个操作过程也可以以存储在 各种机器可读的存储介质中的计算机可执行程序的方式实现。
[133] 而且, 本发明的目的也可以通过下述方式实现: 将存储有上述可执行 程序代码的存储介盾直接或者间接地提供给系统或设备, 并且该系统或设 备中的计算机或者中央处理单元(CPU )读出并执行上述程序代码。
[134] 此时, 只要该系统或者设备具有执行程序的功能, 则本发明的实施方 式不局限于程序, 并且该程序也可以是任意的形式, 例如, 目标程序、 解 释器执行的程序或者提供给操作系统的脚 序等。
[135] 上述这些机器可读存储介质包括但不限于: 各种存储器和存储单元, 半导体设备, 磁盘单元例如光、 磁和磁光盘, 以及其它适于存储信息的介 质等。
[136] 另外, 客户计算机通过连接到因特网上的相应网站, 并且将依据本发 明的计算机程序代码下栽和安装到计算机中然后执行该程序, 也可以实现 本发明。
[137] 最后, 还需要说明的是, 在本文中, 诸如左和右、 第一和第二等之类 的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来, 而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或 者顺序。 而且, 术语 "包括"、 "包含" 或者其任何其他变体意在涵盖非排 他性的包含, 从而使得包括一系列要素的过程、 方法、 物品或者设备不仅 包括那些要素, 而且还包括没有明确列出的其他要素, 或者是还包括为这 种过程、 方法、 物品或者设备所固有的要素。 在没有更多限制的情况下, 由语句 "包括一个…… " 限定的要素, 并不排除在包括所述要素的过程、 方法、 物品或者设备中还存在另外的相同要素。
[138] 综上,在根据本发明的实施例中,本发明提供了如下方案但不限于此:
[139] 附记 1. 一种数据处理装置, 包括:
语义角色标注单元, 用于对源语言语句以及作为其译文的候选调序结 果的多个目标语言语句分别进行语义角色标注, 以获得源语言语义角色序 列以及多个目标语言语义角色序列;
匹配单元, 用于基于预定双语语料库获得所述源语言语义角色序列分 别与每个所述目标语言语义角色序列之间的匹配分数, 其中, 所述预定双 语语料库包括多个经过语义角色标注的、 针对源语言和目标语言的双语句 对; 以及 调序结果确定单元, 用于将所述匹配分数最高的目标语言语义角色序 列对应的候选调序结果确定为最终调序结果。
[140] 附记 2. ¾1据附记 1所述的数据处理装置, 其中, 所述匹配单元包括: 相关程度获得子单元, 用于针对每个所述目标语言语义角色序列中的 每个目标语言谓词, 利用所述预定双语语料库, 获得该目标语言语义角色 序列的包含该目标语言谓词的至少部分子序列与所述源语言语义角色序 列之间的相关程度; 以及
匹配分数确定子单元, 用于针对每个所述目标语言语义角色序列, 基 于获得的与该目标语言语义角色序列有关的所述相关程度来确定该目标 语言语义角色序列与所述源语言语义角色序列之间的匹配分数。
[141] 附记 3.根据权利要求 2所述的数据处理装置, 其中, 所^目关程度 谓词, 获得如下多种相关程度中的任一种或多种:
该目标语言语义角色序列的仅包括该目标语言谓词的子序列与所述 源语言语义角色序列之间的相关程度;
该目标语言语义角色序列的包括位于该目标语言谓词左侧的至少一 个语义角色的子序列与所述源语言语义角色序列之间的相关程度;
该目标语言语义角色序列的包括该目标语言谓词和位于该目标语言 谓词左侧的至少一个语义角色的子序列与所述源语言语义角色序列之间 的相关程度; 以及
该目标语言语义角色序列的仅包括该目标语言谓词的子序列、 包括位 于该目标语言谓词左侧的至少一个语义角色的子序列、 以及包括该目标语 言谓词和位于该目标语言谓词左侧的至少一个语义角色的子序列中的至 少两种子序列与所述源语言语义角色序列之间的相关程度。
[142] 附记 4.根据附记 2或 3所述的数据处理装置, 其中, 所述匹配分数 确定子单元用于根据下式获得所述匹配分数:
score(S, T) = P( VT | S)*P( αλ | VT,S)
*ΠΡ( at I α,_ , VT ,S)*P( | VT ,S)* Π | VT ,b ,S) , 其中, S为所述源语言语义角色序列, T为所述目标语言语义角色序 列, VT为 T中的目标语言谓词, 为 Τ中位于 VT左侧的第 i个语义角色, h为 VT左侧的语义角色数量, 为 T中位于 VT右侧的第 j个语义角色, k 为 VT右侧的语义角色数量, P(VT|S)为用于表示 S与 T的子序列 {VT}之间 的相关程度的条件概率, P(A|VT,S)为用于表示 S与 T的子序列 {VT}和 { bVT} 之间的相关程度的条件概率, Ρ(Ω,| A— 1, VT,S)为用于表示 S 与 T 的子序列 ^^^和 ,^^^ 之间的相关程度的条件概率, P(^|VT,S)为用于表示 S与 T的子序列 {VT}和 {VTA}之间的相关程度的条件概率, 以及 P^VTA^S)为 用于表示 S与 T的子序列 {¥^^.1}和^^, ^. j}之间的相关程度的条件概 率。
[143] 附记 5.根据附记 4所述的数据处理装置, 其中, 所 目关程度获得 子单元用于根据极大!¾然法获得 P(VT|S)、 P(^|VT,S) , Ρ(Ω,|Ω, pVT,S)、 P(^|VT,S) 以及 p ( |vT, — !,s)。
[144] 附记 6.根据附记 5所述的数据处理装置, 其中, 所述相关程度获得 子单元用于根据以下公式获得 P(VT|S)、 P(i¾|VT,S)、 Ρ(
Figure imgf000024_0001
P( |VT,S)以 及 p( |vT, — pS):
Figure imgf000024_0002
v b C(VT, — '¾, Vs^ ,... ' ,)· 其中, Vs为 S中的源语言谓词, ^,…, 为 S中位于 Vs左侧的 /ί,个语 义角 色, b\, b'k,为 S 中位于 Vs 右侧的 :,个语义角 色, 表示在包含序列{ }的预定源语言 双语句对中的预定目标语言语句中出现序列 {^}的次数,
Figure imgf000024_0003
}的预定源语言语句 的数量, 表示在所述预定目标语言语句中出现序 列 { ,^}的次数, c( ,, , pVT, ,vs ,… ' )表示在所述预定目标语言语 句中出现序列( ', ' ^^的次数, c( , pVT, ,vs Ί,,.υ表示在所述预定 目标语言语句中出现序列 { , pVT }的次数, c(vT, K , vs 表示在 所 述预 定 目 标语 言 语 句 中 出 现序 列 { VTA } 的 次数 , 表示在所述预定目标语言语句中出现序列
{ ντ, ,bj }的次数, (VT , b l ,a , ,...,a ,VS ,b ,...,b 表示在所述预定目标语言语句 中出现序列 {VTA J的次数。
[145] 附记 7.才艮据附记 2所述的数据处理装置, 其中,
所述语义角色标注单元用于在所述源语言语句包含至少两个源语言谓 词的情况下, 将每个源语言谓词及其相关的语义角色所组成的序列作为与 该源语言谓词对应的源语言语义角色序列, 并将与该源语言谓词对应的目 标语言谓词及其相关的语义角色所组成的序列作为与该源语言谓词对应 的目标语言语义角色序列; 所述匹配单元用于获得与同一个源语言谓词对应的源语言语义角色 序列和目标语言语义角色序列之间的匹配分数; 以及 所述调序结果确定单元用于通过结合针对每个源语言谓词的匹配分 数来确定最终调序结果。
[146] 附记 8.根据附记 1-7中任一项所述的数据处理装置,其中, 所述源语 言为汉语, 所述目标语言为英语。
[147] 附记 9. 一种数据处理方法, 包括:
对源语言语句以及作为其译文的候选调序结果的多个目标语言语句 分别进行语义角色标注, 以获得源语言语义角色序列以及多个目标语言语 义角色序列;
基于预定双语语料库获得所述源语言语义角色序列分别与每个所述 目标语言语义角色序列之间的匹配分数, 其中, 所述预定双语语料库包括 多个经过语义角色标注的、 针对源语言和目标语言的双语句对; 以及
将所述匹配分数最高的目标语言语义角色序列对应的候选调序结果 确定为最终调序结果。 [148] 附记 10.根据附记 9所述的数据处理方法, 其中, 获得所述源语言语 骤包括:
针对每个所述目标语言语义角色序列中的每个目标语言谓词, 利用所 述预定双语语料库, 获得该目标语言语义角色序列的包含该目标语言谓词 的至少部分子序列与所述源语言语义角色序列之间的相关程度; 以及
针对每个所述目标语言语义角色序列, 基于获得的与该目标语言语义 角色序列有关的所述相关程度来确定该目标语言语义角色序列与所述源 语言语义角色序列之间的匹配分数。
[149] 附记 11.根据附记 10所述的数据处理方法,其中,针对每个所述目标 语言语义角色序列中的每个目标语言谓词, 获得如下多种相关程度中的任 一种或多种:
该目标语言语义角色序列的仅包括该目标语言谓词的子序列与所述源 语言语义角色序列之间的相关程度;
该目标语言语义角色序列的包括位于该目标语言谓词左侧的至少一个 语义角色的子序列与所述源语言语义角色序列之间的相关程度;
该目标语言语义角色序列的包括该目标语言谓词和位于该目标语言谓 词左侧的至少一个语义角色的子序列与所述源语言语义角色序列之间的 相关程度; 以及
该目标语言语义角色序列的仅包括该目标语言谓词的子序列、 包括位 于该目标语言谓词左侧的至少一个语义角色的子序列、 以及包括该目标语 言谓词和位于该目标语言谓词左侧的至少一个语义角色的子序列中的至 少两种子序列与所述源语言语义角色序列之间的相关程度。
[150] 附记 12. 据附记 10或 11所述的数据处理方法, 其中, 所述匹配分 数 4艮据下式确定:
score(S, T) = P( VT | S)*P( αλ | VT ,S)
k
* Π Ρ( a, I a,_x , VT ,S)*P( | VT ,S)* Π P(bj | VT ,b ,S) 其中, S为所述源语言语义角色序列, T为所述目标语言语义角色序 列, VT为 T中的目标语言谓词, 为 Τ中位于 VT左侧的第 i个语义角色, h为 VT左侧的语义角色数量, 为 T中位于 VT右侧的第 j个语义角色, k 为 VT右侧的语义角色数量, P( VT |S)为用于表示 S与 T的子序列 {VT}之间 的相关程度的条件概率, P(A|VT,S)为用于表示 S与 T的子序列 {VT}和 {"bVT} 之间的相关程度的条件概率, Ρ(Ω,| A— 1, VT,S)为用于表示 s 与 τ 的子序列 VT}和 ,ί^,ντ}之间的相关程度的条件概率, P(^|VT,S)为用于表示 S与 T的子序列 {VT}和 {VTA}之间的相关程度的条件概率,
Figure imgf000027_0001
用于表示 S与 T的子序列 {¥1 .1}和^^, ^. j}之间的相关程度的条件概 率。
[151] 附记 13.根据附记 12所述的数据处理方法, 其中, 根据极大似然法 获得 P(VT|S;)、 P(^|VT,S)^ P( pV S;)、 ?!;^^^;!以及!5!^」^, —^)。
[152] 附记 14.根据附记 13所述的数据处理方法, 其中, 分别根据以下公 式菝得 P(VT|S;)、 P(^|VT,S) Ρ(Ω,.|Ω pV S;)、 p ( |vT,s)以及 p( |vT, — "s):
Figure imgf000027_0002
I C VT,b- ,,b.,a , ,...,a ,Y,,b ,...,b ,)
P(^,. ντΛ. ,,S)= T 其中, Vs为 S中的源语言谓词, ^,…, 为 S中位于 Vs左侧的 /ί,个语 义角 色, b\, b'k,为 S 中位于 Vs 右侧的 :,个语义角 色, 表示在包含序列{ }的预定源语言 语句所属双语句对中的预定目标语言语句中出现序列 {^}的次数,
Figure imgf000027_0003
,VS ,… ' )表示包含序列 { ,^ ,!,… ,^的预定源语言语句 的数量, 表示在所述预定目标语言语句中出现序 列 { ,^}的次数, c( ,, , pVT, ,vs ,… ' )表示在所述预定目标语言语 句中出现序列 pVT}的次数, c( , pVT, ,vs Ί,,.υ表示在所述预定 目标语言语句中出现序列 { , pVT}的次数, (:!^,^,…, ,^,^,…,^;!表示在 所 述预 定 目 标语 言 语 句 中 出 现序 列 { VTA } 的 次数 , 表示在所述预定目标语言语句中出现序列
{ VT A! ,bj }的次数, (VT , b l ,a , ,...,a ,VS ,b ,...,b 表示在所述预定目标语言语句 中出现序列 {^,^}的次数。
[153] 附记 15.根据附记 10所述的数据处理方法, 还包括:
在所述源语言语句包含至少两个源语言谓词的情况下, 将每个源语言 言语义角色序列, 并将与该源语言谓词对应的目标语言谓词及其相关的语 义角色所组成的序列作为与该源语言谓词对应的目标语言语义角色序列; 获得与同一个源语言谓词对应的源语言语义角色序列和目标语言语 义角色序列之间的匹配分数; 以及 通过结合针对每个源语言谓词的匹配分数来确定最终调序结果。
[154] 附记 16.根据附记 9-15中任一项所述的数据处理方法,其中, 所述源 语言为汉语, 所述目标语言为英语。
[155] 附记 17. —种电子设备,包括如附记 1-8中任一所述的数据处理装置。
[156] 附记 18.根据附记 17所述的电子设备, 其中, 所述电子设备是以下 设备中的任意一种:
计算机; 平板电脑; 个人数字助理; 多媒体播放设备; 手机; 电子词 典; 以及电纸书。
[157] 附记 19. 一种存储有机器可读取的指令代码的程序产品, 所述程序产 品在执行时能够使所述机器执行根据附记 9-16 中任一所述的数据处理方 法。
[158] 附记 20. —种计算机可读存储介质, 其上存储有根据附记 19所述的 程序产品。

Claims

权利 要求 书
1. 一种数据处理装置, 包括:
语义角色标注单元, 用于对源语言语句以及作为其译文的候选调序结 果的多个目标语言语句分别进行语义角色标注, 以获得源语言语义角色序 列以及多个目标语言语义角色序列;
匹配单元, 用于基于预定双语语料库获得所述源语言语义角色序列分 别与每个所述目标语言语义角色序列之间的匹配分数, 其中, 所述预定双 语语料库包括多个经过语义角色标注的、 针对源语言和目标语言的双语句 对; 以及
调序结果确定单元, 用于将所述匹配分数最高的目标语言语义角色序 列对应的候选调序结果确定为最终调序结果。
2.根据权利要求 1所述的数据处理装置, 其中, 所述匹配单元包括: 相关程度获得子单元, 用于针对每个所述目标语言语义角色序列中的 每个目标语言谓词, 利用所述预定双语语料库, 获得该目标语言语义角色 序列的包含该目标语言谓词的至少部分子序列与所述源语言语义角色序 列之间的相关程度; 以及
匹配分数确定子单元, 用于针对每个所述目标语言语义角色序列, 基 于获得的与该目标语言语义角色序列有关的所述相关程度来确定该目标 语言语义角色序列与所述源语言语义角色序列之间的匹配分数。
3.根据权利要求 2所述的数据处理装置, 其中, 所 目关程度获得子 单元用于针对每个所述目标语言语义角色序列中的每个目标语言谓词, 获 得如下多种相关程度中的任一种或多种:
该目标语言语义角色序列的仅包括该目标语言谓词的子序列与所述 源语言语义角色序列之间的相关程度;
该目标语言语义角色序列的包括位于该目标语言谓词左侧的至少一 个语义角色的子序列与所述源语言语义角色序列之间的相关程度;
该目标语言语义角色序列的包括该目标语言谓词和位于该目标语言 谓词左侧的至少一个语义角色的子序列与所述源语言语义角色序列之间 的相关程度; 以及
该目标语言语义角色序列的仅包括该目标语言谓词的子序列、 包括位 于该目标语言谓词左侧的至少一个语义角色的子序列、 以及包括该目标语 言谓词和位于该目标语言谓词左侧的至少一个语义角色的子序列中的至 少两种子序列与所述源语言语义角色序列之间的相关程度。
4.根据权利要求 2或 3所述的数据处理装置, 其中, 所述匹配分数确 定子单元用于根据下式获得所述匹配分数:
score(S, T) = P( VT | S)*P( αλ |VT,S)
*ΠΡ( a, I a,_x , VT ,S)*P( | VT ,S)* Π V{bj | VT ,b ,S) , 其中, S为所述源语言语义角色序列, T为所述目标语言语义角色序 列, VT为 T中的目标语言谓词, 为 Τ中位于 VT左侧的第 i个语义角色, h为 VT左侧的语义角色数量, 为 T中位于 VT右侧的第 j个语义角色, k 为 VT右侧的语义角色数量, P(VT|S)为用于表示 S与 T的子序列 {VT}之间 的相关程度的条件概率, P(A|VT,S)为用于表示 S与 T的子序列 {VT}和 {«bVT} 之间的相关程度的条件概率, Ρ(Ω,| A— 1, VT,S)为用于表示 S 与 T 的子序列 ^^^和 ,^^ ^ 之间的相关程度的条件概率, p(^|vT,s)为用于表示 S与 T的子序列 {VT}和 {VTA}之间的相关程度的条件概率, 以及 P^VTA^S)为 用于表示 S与 T的子序列 {¥1 .1}和^^, ^. j}之间的相关程度的条件概 率。
5.根据权利要求 4所述的数据处理装置, 其中, 所^目关程度获得子 单元用于才艮据极大 4以然法菝得 P( VT |S)、 P(i¾|VT,S)、 Ρ(Ω,|Ω, pVT,S)、 P( |VT,S)以 及 ρ( |ντ, — )。
6.根据权利要求 5所述的数据处理装置, 其中, 所^目关程度获得子 单元用于才艮据以下公式菝得 P(VT|S)、
Figure imgf000030_0001
, Ρ(Ω,|Ω, pV S POjV S)以及
Figure imgf000031_0001
P 5 =
Figure imgf000031_0002
其中, Vs为 S中的源语言谓词, ^,…, 为 S中位于 Vs左侧的 /ί,个语 义角 色, b\,...,b'k,为 S 中位于 Vs 右侧的 :,个语义角 色, 表示在包含序列{ }的预定源语言 语句所属双语句对中的预定目标语言语句中出现序列 {^}的次数,
Figure imgf000031_0003
,VS ,… ' )表示包含序列 { ,^ ,!,… ,^的预定源语言语句 的数量, 表示在所述预定目标语言语句中出现序 列 { ,^}的次数, (^( ', '!,^,^,…, ,^,^,…,^;!表示在所述预定目标语言语 句中出现序列 {Ω,,Ω, pVT}的次数, C( , pVT, ,vs Ί,,.υ表示在所述预定 目标语言语句中出现序列 !,^}的次数, Cdbpa ...^^,.."^)表示在 所 述预 定 目 标语 言 语 句 中 出 现序 列 { VTA } 的 次数 , 表示在所述预定目标语言语句中出现序列
{ ντ, ,bj }的次数, (VT , b l ,a , ,...,a ,VS ,b ,...,b 'λ )表示在所述预定目标语言语句 中出现序列 {^,^}的次数。
7.根据权利要求 2所述的数据处理装置, 其中,
所述语义角色标注单元用于在所述源语言语句包含至少两个源语言谓 词的情况下, 将每个源语言谓词及其相关的语义角色所组成的序列作为与 该源语言谓词对应的源语言语义角色序列, 并将与该源语言谓词对应的目 标语言谓词及其相关的语义角色所组成的序列作为与该源语言谓词对应 的目标语言语义角色序列; 所述匹配单元用于获得与同一个源语言谓词对应的源语言语义角色 序列和目标语言语义角色序列之间的匹配分数; 以及 所述调序结果确定单元用于通过结合针对每个源语言谓词的匹配分 数来确定最终调序结果。
8.根据权利要求 1-7中任一项所述的数据处理装置, 其中, 所述源语 言为汉语, 所述目标语言为英语。
9. 一种数据处理方法, 包括:
对源语言语句以及作为其译文的候选调序结果的多个目标语言语句 分别进行语义角色标注, 以获得源语言语义角色序列以及多个目标语言语 义角色序列;
基于预定双语语料库获得所述源语言语义角色序列分别与每个所述 目标语言语义角色序列之间的匹配分数, 其中, 所述预定双语语料库包括 多个经过语义角色标注的、 针对源语言和目标语言的双语句对; 以及
将所述匹配分数最高的目标语言语义角色序列对应的候选调序结果 确定为最终调序结果。
10. 一种电子设备, 包括如权利要求 1-8 中任一项所述的数据处理装 置。
PCT/CN2014/075776 2013-04-19 2014-04-21 数据处理装置、数据处理方法以及电子设备 WO2014169857A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2016508001A JP2016519370A (ja) 2013-04-19 2014-04-21 データ処理装置、データ処理方法及び電子機器

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201310138955.9 2013-04-19
CN201310138955.9A CN104111917B (zh) 2013-04-19 2013-04-19 数据处理装置、数据处理方法以及电子设备

Publications (1)

Publication Number Publication Date
WO2014169857A1 true WO2014169857A1 (zh) 2014-10-23

Family

ID=51708713

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2014/075776 WO2014169857A1 (zh) 2013-04-19 2014-04-21 数据处理装置、数据处理方法以及电子设备

Country Status (3)

Country Link
JP (1) JP2016519370A (zh)
CN (1) CN104111917B (zh)
WO (1) WO2014169857A1 (zh)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105808521A (zh) * 2016-03-04 2016-07-27 北京工业大学 一种基于语义特征的语义关系模式获取方法和系统
CN107451158A (zh) * 2016-06-01 2017-12-08 中国科学院地理科学与资源研究所 一种网络文本中交通事件语义角色抽取方法
CN109256128A (zh) * 2018-11-19 2019-01-22 广东小天才科技有限公司 一种根据用户语料自动判定用户角色的方法及系统
CN111460118A (zh) * 2020-03-26 2020-07-28 聚好看科技股份有限公司 一种人工智能冲突语义识别方法及装置

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090106015A1 (en) * 2007-10-23 2009-04-23 Microsoft Corporation Statistical machine translation processing
CN101593174A (zh) * 2009-03-11 2009-12-02 林勋准 一种机器翻译方法及系统
CN103020045A (zh) * 2012-12-11 2013-04-03 中国科学院自动化研究所 一种基于谓词论元结构的统计机器翻译方法
CN103020040A (zh) * 2011-09-27 2013-04-03 富士通株式会社 源语言改写处理方法和设备及机器翻译系统

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005025474A (ja) * 2003-07-01 2005-01-27 Advanced Telecommunication Research Institute International 機械翻訳装置、コンピュータプログラム及びコンピュータ
CN101042692B (zh) * 2006-03-24 2010-09-22 富士通株式会社 基于语义预测的译文获取方法和设备
CN101414310A (zh) * 2008-10-17 2009-04-22 山西大学 一种自然语言搜索的方法和装置
JP5836708B2 (ja) * 2011-09-02 2015-12-24 キヤノン株式会社 端末装置、情報処理方法及びプログラム
JP5780670B2 (ja) * 2011-09-05 2015-09-16 日本電信電話株式会社 翻訳装置、方法、及びプログラム、並びに翻訳モデル学習装置、方法、及びプログラム
JP5911098B2 (ja) * 2012-04-09 2016-04-27 国立研究開発法人情報通信研究機構 翻訳装置、およびプログラム

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090106015A1 (en) * 2007-10-23 2009-04-23 Microsoft Corporation Statistical machine translation processing
CN101593174A (zh) * 2009-03-11 2009-12-02 林勋准 一种机器翻译方法及系统
CN103020040A (zh) * 2011-09-27 2013-04-03 富士通株式会社 源语言改写处理方法和设备及机器翻译系统
CN103020045A (zh) * 2012-12-11 2013-04-03 中国科学院自动化研究所 一种基于谓词论元结构的统计机器翻译方法

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105808521A (zh) * 2016-03-04 2016-07-27 北京工业大学 一种基于语义特征的语义关系模式获取方法和系统
CN107451158A (zh) * 2016-06-01 2017-12-08 中国科学院地理科学与资源研究所 一种网络文本中交通事件语义角色抽取方法
CN107451158B (zh) * 2016-06-01 2021-01-19 中国科学院地理科学与资源研究所 一种网络文本中交通事件语义角色抽取方法
CN109256128A (zh) * 2018-11-19 2019-01-22 广东小天才科技有限公司 一种根据用户语料自动判定用户角色的方法及系统
CN111460118A (zh) * 2020-03-26 2020-07-28 聚好看科技股份有限公司 一种人工智能冲突语义识别方法及装置
CN111460118B (zh) * 2020-03-26 2023-10-20 聚好看科技股份有限公司 一种人工智能冲突语义识别方法及装置

Also Published As

Publication number Publication date
CN104111917A (zh) 2014-10-22
JP2016519370A (ja) 2016-06-30
CN104111917B (zh) 2017-04-12

Similar Documents

Publication Publication Date Title
US9286290B2 (en) Producing insight information from tables using natural language processing
US8990064B2 (en) Translating documents based on content
Chen et al. Automatic ICD-10 coding algorithm using an improved longest common subsequence based on semantic similarity
US9098488B2 (en) Translation of multilingual embedded phrases
JP7232831B2 (ja) 複雑な回答の補強証拠取り出し
US11481417B2 (en) Generation and utilization of vector indexes for data processing systems and methods
US11468238B2 (en) Data processing systems and methods
US20140163962A1 (en) Deep analysis of natural language questions for question answering system
US20080077588A1 (en) Identifying and measuring related queries
US20090070328A1 (en) Method and system for automatically generating regular expressions for relaxed matching of text patterns
US20140156258A1 (en) Foreign language writing support apparatus and method
CN112528681A (zh) 跨语言检索及模型训练方法、装置、设备和存储介质
US20210133264A1 (en) Data Processing Systems and Methods
WO2014169857A1 (zh) 数据处理装置、数据处理方法以及电子设备
US10949904B2 (en) Knowledgebase with work products of service providers and processing thereof
Ruder et al. Xtreme-up: A user-centric scarce-data benchmark for under-represented languages
Feng et al. Question classification by approximating semantics
US9208142B2 (en) Analyzing documents corresponding to demographics
JP2022510818A (ja) 改良されたデータマッチングのためのデータレコードの字訳
Duque et al. Can multilinguality improve biomedical word sense disambiguation?
Ethiraj et al. NELIS-Named Entity and Language Identification System: Shared Task System Description.
Jha et al. A review of machine transliteration, translation, evaluation metrics and datasets in Indian Languages
Vandeghinste et al. Improving the translation environment for professional translators
Tran et al. The recent advances in automatic term extraction: A survey
Gupta et al. A Statistical Language Modeling Framework for Extractive Summarization of Text Documents

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14785526

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2016508001

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14785526

Country of ref document: EP

Kind code of ref document: A1