WO2019119852A1 - 语言处理方法及设备 - Google Patents

语言处理方法及设备 Download PDF

Info

Publication number
WO2019119852A1
WO2019119852A1 PCT/CN2018/102498 CN2018102498W WO2019119852A1 WO 2019119852 A1 WO2019119852 A1 WO 2019119852A1 CN 2018102498 W CN2018102498 W CN 2018102498W WO 2019119852 A1 WO2019119852 A1 WO 2019119852A1
Authority
WO
WIPO (PCT)
Prior art keywords
target language
language
extraction rule
sentence
target
Prior art date
Application number
PCT/CN2018/102498
Other languages
English (en)
French (fr)
Inventor
邢超
陈晓
蔡振林
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP18890375.1A priority Critical patent/EP3719676A4/en
Publication of WO2019119852A1 publication Critical patent/WO2019119852A1/zh
Priority to US16/907,783 priority patent/US11704505B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/263Language identification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/51Translation evaluation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/55Rule-based translation

Definitions

  • the embodiments of the present invention relate to the field of computer technologies, and in particular, to a language processing method and device.
  • the key information in the sentence of the natural language is extracted by using a extraction rule that the linguistic expert summarizes for a natural language.
  • the date extraction rule summarized by the language expert is: [four digits] year [one to two digits] month [one to two digits] day, the system can extract sentences according to the extraction rule of the date.
  • Key information in dates When the system needs to recognize multiple natural languages, an extraction rule cannot be applied to all natural languages due to the grammatical differences between different natural languages. For each natural language, this nature is required.
  • the linguistic expert of the language summarizes the corresponding extraction rules.
  • each natural language requires a corresponding linguistic expert to summarize the extraction rules, resulting in excessive time and labor costs.
  • the present application provides a language processing method and device, which can be used to solve the need for a plurality of natural language extraction rules in the prior art, because each natural language requires a corresponding language expert to summarize the extraction rules, resulting in excessive consumption.
  • the present application provides a language processing method, which includes: obtaining n sets of inter-translated sentence pairs of a source language and a target language, and each set of inter-translated sentence pairs of the n sets of inter-translated sentence pairs includes mutual translations.
  • a source language sentence and a target language sentence n is an integer greater than 1; using source language extraction rules, extracting source language fragments from each source language sentence of n groups of interpreting sentence pairs;
  • a target language segment that is a translation of the source language segment is extracted respectively; and an extraction rule of the target language is generated according to at least n target language segments extracted from the n target language sentences.
  • At least n target language segments are extracted by using a source language extraction rule and n sets of mutually translated sentence pairs of the source language and the target language, and then an extraction rule of the target language is generated according to at least n target language segments.
  • the extraction rule of the target language can be automatically generated, and the extraction rule of the target language is not required to be summarized by the language expert, thereby saving manpower and time cost.
  • each target language sentence of the n pairs of interpreting sentence pairs is extracted from the target language segment which is a translation of the source language segment, including: for each group of interpreting sentence pairs, according to mutual translation The word alignment relationship between the sentence pairs, obtaining the respective words contained in the source language segment extracted from the source language sentence of the pair of translated sentences, and the corresponding translation words in the target language sentence of the pair of translated sentences; combining the translated words to obtain The target language segment of the target language sentence of the cross-translation sentence pair.
  • the target language segment of the target language sentence of the pair of translated sentences is accurately obtained by the word alignment relationship and the source language segment between the pairs of translated sentences.
  • each target language segment includes words of k domains, k is a positive integer; and according to at least n target language segments extracted from n target language sentences, an extraction rule of the target language is generated, including : Merging words belonging to the same domain in at least n target language segments to obtain merged words of each domain, words belonging to the same domain refer to words having the same semantics; merged for each domain The words are generalized to obtain the extraction rules of the target language.
  • the extraction rules of the target language are automatically generated by merging and generalizing words belonging to the same domain among at least n target language segments.
  • the method further includes: applying the extraction rule of the source language to the source language corpus to obtain a a source language segment; and applying a target language extraction rule to the target language corpus to obtain b target language segments; wherein the source language corpus contains the source language sentence and the target language corpus contains the same number of target language sentences And each is a translation, a and b are integers; detecting whether a source language segment and b target language segments meet the preset condition; if the preset condition is met, the extraction rule of the target language is updated.
  • the preset conditions include: a and b are not equal; and/or there is at least one set of semantic mismatches between the source language segment and the target language segment extracted from the pair of interpreting sentences.
  • the extraction rule of the target language includes extraction rules corresponding to at least one domain, and the extraction rules corresponding to each domain are used to extract a semantic term of the target language. Updating the extraction rule of the target language includes: reducing a generalization degree of the extraction rule corresponding to the first domain in the extraction rule of the target language; and/or expanding the extraction rule corresponding to the second domain in the extraction rule of the target language The degree of generalization.
  • the extraction rule of the target language is accurately updated by reducing or expanding the generalization degree of the extraction rule of the target language to ensure the accuracy of the extraction rule of the target language.
  • the present application provides a language processing device comprising means or means for performing the language processing method provided by any of the various possible designs of the first aspect and the first aspect described above.
  • the present application provides a language processing device, including a processor and a memory, wherein the computer stores a computer readable program; the processor runs the program in the memory to complete the first aspect and
  • the language processing method provided by any of the various possible designs of the first aspect.
  • the present application provides a computer storage medium for storing computer software instructions for use as a language processing device, comprising a program designed to perform the above aspects.
  • the present application provides a computer program product for performing the language processing provided by any one of the various possible designs of the first aspect and the first aspect described above when the computer program product is executed method.
  • the present application provides a chip comprising programmable logic circuitry and/or program instructions for implementing any of the various possible designs of the first aspect and the first aspect described above when the chip is in operation Design the language processing methods provided.
  • the present application provides a processing apparatus, the processing apparatus comprising at least one circuit for performing the language provided by any one of the various possible designs of the first aspect and the first aspect described above Approach.
  • the present application provides a processing apparatus for implementing the language processing method provided by any one of the above various aspects and various possible designs of the first aspect.
  • At least n target language segments are extracted through the extraction rule of the source language and the n pairs of mutually translated sentence pairs of the source language and the target language, and then according to at least n target language segments.
  • Generate extraction rules for the target language According to the extracted extraction rule of the source language, the extraction rule of the target language can be automatically generated, and the extraction rule of the target language is not required to be summarized by the language expert, thereby saving manpower and time cost.
  • FIG. 1 is a flowchart of a language processing method provided by an embodiment of the present application.
  • FIG. 2 is a schematic diagram of a word alignment relationship provided by an embodiment of the present application.
  • FIG. 3 is a flowchart of a language processing method provided by another embodiment of the present application.
  • FIG. 4 is a schematic diagram of an extraction rule for generating a target language according to an embodiment of the present application.
  • FIG. 5 is a schematic block diagram of a language processing device according to an embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of a language processing device according to an embodiment of the present application.
  • the method provided by the embodiment of the present application may be a computer device, for example, the computer device may be a PC (Personal Computer) or a server.
  • the computing device When the computing device is the execution subject of the method provided by the embodiment of the present application, it may also be referred to as a language processing device.
  • the computer device includes a database in which a plurality of natural language corpora and extraction rules are stored. For convenience of description, in the following method embodiments, only the execution subject of each step is described as a computer device, but the present invention is not limited thereto.
  • FIG. 1 is a flowchart of a language processing method provided by an embodiment of the present application.
  • the method can include the following steps.
  • Step 101 Acquire n pairs of interpreting sentence pairs of the source language and the target language.
  • the source language refers to a natural language that has summarized one or more extraction rules
  • the target language refers to a natural language that needs to generate extraction rules.
  • the computer device needs to generate an extraction rule of the target language, first obtain n pairs of mutually translated sentence pairs of the source language and the target language, and n is an integer greater than 1.
  • the cross-translation sentence pairs of the source language and the target language refer to a pair of source language sentences and target language sentences that are mutually translated.
  • the words in a set of mutually-translated sentence pairs are also mutually translated, and the correspondence between the words in the mutually-translated sentence pairs that are mutually translated is called a word alignment relationship.
  • a set of mutually translated sentence pairs of the source language and the target language the sentence of the source language is: Please help me search for photos taken during the 51st period, and the sentence corresponding to the target language is: Find the picture Taken in May 1 st for me.
  • the words in the interpreting sentence pairs are also translated into each other, for example, "search” and “find” are translations of each other, “photo” and “picture” are translations of each other, and "I” and "me” are translations of each other, etc. .
  • the computer device obtains a corpus of the source language and the target language.
  • the source language corpus contained in the source language corpus and the target language corpus contain the same number of target language sentences, and are mutually translated, that is, the corpus of the source language and the target language includes multiple sets of mutually translated sentence pairs, and the multiple groups are translated.
  • the sentence pair includes the above n pairs of interpreting sentence pairs.
  • the computer device directly obtains n sets of interpreting sentence pairs of the source language and the target language.
  • the computer device stores the correspondence between the different translation sentences of the source language and the target language in the corpus of the source language and the extraction rules of the source language.
  • the computer device can directly acquire n pairs of interpreting sentence pairs corresponding to the extraction rule according to the correspondence.
  • the computer device can determine the word alignment relationship in the n pairs of interpreting sentence pairs by the word alignment model.
  • the computer equipment can determine the word alignment relationship in each pair of interpreting sentence pairs through the International Business Machines Corporation (IBM) model, and can also determine the word alignment in each group of interpreting sentence pairs through the Attention model. relationship.
  • IBM International Business Machines Corporation
  • Attention model The specific type of the word alignment model used by the computer device is not limited in the embodiment of the present application.
  • the computer device determines the word alignment relationship in the n pairs of interpreting sentence pairs only through a word alignment model.
  • the computer device determines a word alignment relationship in the n pairs of interpreting sentence pairs through a plurality of word alignment models. For a set of interpreting sentence pairs, the computer device first obtains the word alignment relationship determined by each of the plurality of word alignment models, and then selects a word alignment relationship as the pair of interpreting sentences according to the respective weight ratios of the plurality of word alignment models. Word alignment relationship.
  • Step 102 Extract the source language segment from each of the source language sentences of the n pairs of interpreting sentence pairs by using the extraction rule of the source language.
  • Each source language sentence in the n-translated sentence pair corresponds to the same extraction rule of the source language.
  • the computer device adopts the extraction rule to extract the source language segment of each source language sentence from the source language sentences of the n groups of interpreting sentence pairs.
  • the extraction rule of the month information of the source language corresponds to the source language sentences of the two pairs of interpreting sentences, and the source language sentences of the pair of interpreting sentences are “Today is May” and “Tomorrow is 5”.
  • Month or June the source language extracted from the "Today is May” by the computer device according to the month extraction rule of the source language is "May”, and the source language extracted from "Tomorrow is May or June”
  • the clips are "May” and "June”.
  • the language extraction rules can have multiple expression models.
  • sequence to sequence model and regular rule model.
  • the embodiment of the present application does not specifically limit the type of the expression model.
  • Step 103 Extract, from each of the target language sentences of the n sets of interpreting sentence pairs, a target language segment that is a translation of the source language segment.
  • a source language fragment is actually a collection of multiple words. Because the source language sentences and the target language sentences in a set of interpreting sentences are translated into each other, and the words in the pair of interpreting sentences are also translated. Therefore, after extracting the source language segment, the computer device can extract, from a target language sentence of a set of mutually translated sentence pairs, a target language segment that is a translation of the source language segment of the pair of interpreting sentence pairs. For each target language sentence of the n sets of interpreting sentence pairs, the computer device can extract the respective target language segments of each target language sentence.
  • the computer obtains each word included in the source language segment extracted from the source language sentence of the cross-translation sentence pair according to the word alignment relationship between the inter-translated sentence pairs, in the inter-translation sentence The corresponding translated word in the target language sentence of the pair.
  • the computer device then combines the translated words according to their order in the target language to obtain the target language segments of the target language sentences of the mutually translated sentence pairs.
  • a set of interpreting sentence pairs the source language sentence is: Open the DND from 5 pm, the target language sentence is: Setno disturb at 5p.m.
  • the computer device adopts the extraction rule of time information, and the source language segment extracted from the source language sentence is: 5 pm.
  • the corresponding translated words in the target language sentence at 5 pm are 5 and p.m.
  • the computer device combines the translated words 5 and p.m. to obtain the target language segment: 5p.m.
  • Step 104 Generate an extraction rule of the target language according to at least n target language segments extracted from the n target language sentences.
  • the number of source language fragments and target language fragments is the same for a source language extraction rule.
  • the computer device may extract more than one source language segment from the source language sentences of a set of mutually translated sentence pairs according to an extraction rule of a source language. Therefore, the computer device extracts at least n target language segments from the n target language sentences.
  • the at least n target language segments corresponding to the at least n target language segments are extracted by the computer device according to the extraction rule of the same source language
  • the at least n target language segments correspond to the extraction rule of the same target language.
  • the extraction rule of the target language corresponds to an extraction rule of the source language in which the computer device extracts the source language segment. Therefore, the computer device can generate an extraction rule of the target language corresponding to at least n target language segments based on at least n target language segments.
  • step 104 may include the following sub-steps:
  • Each target language segment includes words of k fields, and k is a positive integer. Different domains represent different semantics, and words belonging to the same domain refer to words with the same semantics.
  • the computer device combines words belonging to the same domain of at least n target language segments into a collection containing the merged words of the domain.
  • the target language segment includes words of k fields, and then merges to obtain k sets.
  • the three target language fragments are: 5p.m., 11p.m., and 6a.m., respectively, and the computer device combines the words 5, 11, and 6 as the same domain into one set: [5 or 11 or 6], combine pm, pm, and am as words in the same domain into one set: [pm or pm or am].
  • the "or" in the set may be replaced with other symbols, such as "
  • the computer device determines the grammar rules required to generate the extraction rule of the target language based on the expression model representing the extraction rules of the source language.
  • This grammar rule refers to the grammar rules of the expression model.
  • the expression model representing the extraction rule of the source language is a regular rule model
  • the grammar rule is a regular expression grammar rule
  • the regular expression grammar rule includes replacing the original character in the word with a preset symbol.
  • syntax rules for the expression model are shown in Table-1 below.
  • the computer device can generalize the merged words according to the grammar rules shown in Table-1 above, for example, using [:alpha:] to indicate English letters.
  • the grammatical rules of the expression model shown in Table-1 above are only partial grammatical rules, and are merely exemplary and explanatory, and are not intended to limit the application.
  • the computer device generalizes the merged words according to the grammar rules to obtain the extraction rules of the target language. For example, generalizing [5 or 11 or 6] to get [one or two Arabic numerals], generalizing [pm or pm or am] to get pm or am], the extraction rules of the target language are: [One or two Arabic numerals] [pm or am].
  • the extraction rule representation for convenience of description, only Chinese characters or English are used for representation.
  • the extraction rule representation is different according to the expression model. For example, in the regular rule model, d ⁇ 1 ⁇ can be used to represent an Arabic digit, and the symbol "
  • At least n target language segments are extracted through the extraction rule of the source language and the n sets of translated sentences of the source language and the target language, and then the extraction rule of the target language is generated according to at least n target language segments.
  • the extraction rule of the target language can be automatically generated, and the extraction rule of the target language is not required to be summarized by the language expert, thereby saving manpower and time cost.
  • the computer device After the computer device generates the extraction rule of the target language, it can also detect whether the extraction rule of the generated target language is accurate. When the extraction rule of the target language is not accurate, the computer device updates the extraction rule of the target language.
  • step 104 further includes the following steps.
  • an update manner of the extraction rule of the target language is described.
  • Step 301 Apply the extraction rule of the source language to the source language corpus, obtain a source language segment, and apply the extraction rule of the target language to the target language corpus to obtain b target language segments.
  • the computer device extracts at least n source language segments according to the n-group translation sentence and the extraction rule of the source language, and extracts at least n target language segments in combination with the word alignment relationship to generate the target language extraction. rule.
  • the computer device After generating the extraction rule of the target language, the computer device detects whether the extraction rule of the generated target language is accurate, that is, whether the corresponding target language segment can be accurately extracted according to the extraction rule of the target language.
  • the computer device applies the extraction rule of the target language to the target language corpus to obtain b target language segments.
  • the computer device then applies the extraction rule of the source language to the source language corpus to obtain a source language segment, and the extraction rule of the source language is an extraction rule corresponding to the extraction rule of the target language. Both a and b are integers.
  • Step 302 Detect whether a source language segment and b target language segments meet preset conditions.
  • the computer device determines whether the extraction rule of the target language is accurate by detecting whether the one source language segment and the b target language segments meet the preset condition. If the preset condition is met, the extraction rule of the target language is inaccurate; if the preset condition is not met, the extraction rule of the target language is accurate and does not need to be updated.
  • the preset conditions include: a and b are not equal, and/or there is at least one set of semantic mismatches between the source language segment and the target language segment extracted from the pair of interpreting sentences. Since the language extraction rule and the target language extraction rule are corresponding extraction rules, the source language sentence contained in the source language corpus and the target language corpus contain the same number of target language sentences, and are mutually translated. Therefore, if the extraction rule of the target language is accurate, the number of the source language segment and the target language segment are consistent, and the semantics of the source language segment and the target language segment extracted from the pair of interpreting sentences are also consistent.
  • Step 303 If the preset condition is met, the extraction rule of the target language is updated.
  • the computer device may reduce the generalization degree of the extraction rule corresponding to the first domain in the extraction rule of the target language, and/or expand the generalization degree of the extraction rule corresponding to the second domain in the extraction rule of the target language to the target language.
  • the extraction rules are updated.
  • the extraction rule of the target language includes extraction rules corresponding to at least one domain, and the extraction rules corresponding to each domain are used to extract a semantic term of the target language.
  • the first domain refers to a domain whose degree of generalization of the corresponding extraction rule is too large
  • the second domain refers to a domain whose degree of generalization of the corresponding extraction rule is too small.
  • Reducing the generalization degree of the extraction rule corresponding to a domain means that the expressions of the extraction rule are modified to reduce the words belonging to the domain extracted according to the extraction rule, and to avoid extracting words that do not belong to the domain.
  • Expanding the generalization degree of the extraction rule corresponding to a domain means that by modifying the expression form of the extraction rule, the words belonging to the domain extracted according to the extraction rule are added, and it is ensured that all the words belonging to the domain can be extracted.
  • the computer device can modify the extraction rules of the target language by directly copying the extraction rules of the source language. Only in the case where the extraction rule of the source language uses only the symbols preset in the grammar rules of the expression model, the computer device can directly copy the extraction rules of the source language to modify the extraction rules of the target language. Because both the extraction rule of the source language and the extraction rule of the target language use the same expression model, in the extraction rule of the source language and the extraction rule of the target language, the symbols set in the grammar rules of the expression model have the same meaning. It does not cause ambiguity due to different languages, so the computer device can directly copy the extraction rules of the source language to modify the extraction rules of the target language.
  • the computer device can also modify the extraction rules of the target language by increasing the specified sentence patterns to which the extraction rules apply.
  • the specified sentence pattern may be a predetermined sentence pattern.
  • the source language is English
  • the target language is Chinese
  • the extraction rule is the extraction rule for the month.
  • the extraction rule for the target language is: [one or two Arabic numerals] [month].
  • the computer device will, the next word is not "day", this specified sentence pattern, as the specified sentence pattern applicable to the extraction rule of the target language, the extraction rule of the target language is modified to [one or two Arabic numerals] [month] [Not "days”], to avoid the June in June.
  • the computer device when modifying the extraction rule of the target language, the computer device first detects whether the extraction rule of the source language corresponding to the extraction rule of the target language uses only the symbols preset in the grammar rules of the expression model. If yes, the extraction rule of the target language is modified by directly copying the extraction rule of the source language; if not, the extraction rule of the target language is modified by adding a specified sentence to which the extraction rule applies.
  • a is greater than b.
  • the number of source language fragments is greater than the number of target language fragments, that is, when the computer device extracts according to the extraction rule of the target language, part of the target language fragments are omitted. It is indicated that the generalization degree of the extraction rule corresponding to at least one of the extraction rules of the target language is too small, that is, the extraction rule corresponding to the second domain exists in the extraction rule of the target language. Causes the computer device to fail to extract all the words belonging to the domain. At this time, the computer device needs to expand the generalization degree of the extraction rule corresponding to the domain, and ensure that all the words belonging to the domain can be extracted.
  • the source language is Chinese
  • the target language is English
  • the extraction rule is an extraction rule for the day of the month.
  • the extraction rules of the source language are: [one or two Arabic numerals] [one or two Arabic numerals]
  • the extraction rules for the target language are: [monthly English words] [two Arabic numerals].
  • the computer device extracts the target language segment from the corpus of the target language according to the extraction rule, the date of May 5 is missed is a target language segment represented by an Arabic numeral, resulting in the number of source language segments being larger than the target language segment. Quantity.
  • the computer device needs to expand the generalization degree of the extraction rule corresponding to this domain.
  • the computer device directly copies the extraction rule of the source language to modify the extraction rule of the target language, and the extracted rule of the target language is: [month English word] [one or two Arabic numerals].
  • a is less than b.
  • the number of source language fragments is smaller than the number of target language fragments, that is, when the computer device extracts according to the extraction rule of the target language, the redundant target language fragments are extracted. It is indicated that the generalization degree of the extraction rule corresponding to at least one of the extraction rules of the target language is too large, that is, the extraction rule corresponding to the first domain exists in the extraction rule of the target language. Causes the computer device to extract words that do not belong to the domain. At this time, the computer device needs to reduce the generalization degree of the extraction rule corresponding to the domain, and avoid extracting words that do not belong to the domain.
  • the source language is Chinese
  • the target language is English
  • the extraction rule is an extraction rule for the day of the month.
  • the extraction rules of the source language are: [one or two Arabic numerals] [one or two Arabic numerals]
  • the extraction rules of the target language are: [monthly English words] [Arabic numerals].
  • the computer device extracts the target language segment from the corpus of the target language according to the extraction rule, the target language segment of May 2000, which represents the year and month in English, is also extracted, resulting in the number of source language segments being smaller than the target language segment. Quantity.
  • the computer device needs to reduce the generalization degree of the extraction rule corresponding to one domain of the date.
  • the computer device directly copies the extraction rule of the source language to modify the extraction rule of the target language, and the extracted rule of the target language is: [month English word] [one or two Arabic numerals].
  • a is equal to b, but there is at least one set of semantic mismatches between the source language segment and the target language segment extracted from the interpreting sentence pair.
  • a semantic mismatch between a source language segment and a target language segment extracted from a pair of interpreting sentences means that the words in the target language segment cannot correspond to the words in the source language segment.
  • the computer device determines that the degree of generalization of the extraction rule corresponding to at least one of the extraction rules of the target language is too large, that is, the target language
  • the extraction rule corresponding to the first domain exists in the extraction rule, and the generalization degree of the extraction rule needs to be reduced;
  • the computer device determines the target language.
  • the generalization degree of the extraction rule corresponding to at least one of the extraction rules is too small, that is, the extraction rule corresponding to the second domain exists in the extraction rule of the target language, and the generalization degree of the extraction rule needs to be expanded.
  • the computer device determines that the extraction rule corresponding to the first domain and the extraction rule corresponding to the second domain exist in the extraction rule of the target language. At this time, the computer device reduces the generalization degree of the extraction rule corresponding to the first domain, and expands the generalization degree of the extraction rule corresponding to the second domain.
  • the computer device may update the extraction rule of the source language by using the same update mode.
  • the computer device applies the extraction rule of the source language to the source language corpus, obtains a source language segment, and applies the extraction rule of the target language to the target language corpus to obtain b target language segments, and then detects a source language segment and b. Whether the target language segment meets the preset condition, and if the preset condition is met, the extraction rule of the source language is updated.
  • step 303 above execution may be performed again from step 301 until the a source language segment and the b target language segments do not meet the preset condition.
  • the accuracy of the extraction rule of the target language is ensured by repeatedly detecting the extraction rules of the updated target language.
  • the extraction rule of the target language is detected and updated, the accuracy of the extraction rule of the target language is ensured, and an error occurs when the information is extracted according to the extraction rule of the target language.
  • the technical solution provided by the present application is described from the perspective of a language processing device.
  • the language processing device includes corresponding hardware structures and/or software modules (or units) for performing the respective functions in order to implement the above functions.
  • the embodiments of the present application can be implemented in a combination of hardware or hardware and computer software in combination with the elements of the examples and algorithm steps described in the embodiments disclosed in the application. Whether a function is implemented in hardware or computer software to drive hardware depends on the specific application and design constraints of the solution. A person skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered to be beyond the scope of the technical solutions of the embodiments of the present application.
  • the embodiment of the present application may divide the functional unit into the language processing device according to the foregoing method example.
  • each functional unit may be divided according to each function, or two or more functions may be integrated into one processing unit.
  • the above integrated unit can be implemented in the form of hardware or in the form of a software functional unit. It should be noted that the division of the unit in the embodiment of the present application is schematic, and is only a logical function division. In actual implementation, there may be another division manner.
  • FIG. 5 is a structural block diagram of a language processing device 500 provided by an embodiment of the present application.
  • the device includes an acquisition unit 501, an extraction unit 502, and a generation unit 503.
  • the obtaining unit 501 is configured to implement the steps corresponding to step 101 in the foregoing method embodiment, and other explicit or implicit obtaining steps.
  • the extracting unit 502 is configured to implement at least one of the steps 102, 103, and 301 in the foregoing method embodiment, and other explicit or implicit extraction steps.
  • the generating unit 503 is configured to implement at least one step of step 104, step 302 and step 303 in the foregoing method embodiment, and other explicit or implicit generating steps.
  • the obtaining unit 501 can be implemented by a first instruction, a first segment program, a code set or an instruction set in the processor, the memory and the memory.
  • Extraction unit 502 can be implemented by a second instruction, a second program, a code set, or a set of instructions in the processor, memory, and memory.
  • the generating unit 503 can be implemented by a third instruction, a third program, a code set, or a set of instructions in the processor, the memory, and the memory.
  • the language processing device 510 includes a processor 512 and a memory 511.
  • the language processing device 510 can also include a bus 513.
  • the processor 512 and the memory 511 may be connected to each other through a bus 513.
  • the bus 513 may be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus.
  • PCI Peripheral Component Interconnect
  • EISA Extended Industry Standard Architecture
  • the bus 513 can be divided into an address bus, a data bus, a control bus, and the like. For ease of representation, only one thick line is shown in Figure 6, but it does not mean that there is only one bus or one type of bus.
  • the steps of the method or algorithm described in connection with the disclosure of the embodiments of the present application may be implemented in a hardware manner, or may be implemented by a processor executing software instructions.
  • the software instructions may be composed of corresponding software modules (or units), and the software modules (or units) may be stored in a random access memory (RAM), a flash memory, a read only memory (ROM), and Erasable Programmable ROM (EPROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Register, Hard Disk, Mobile Hard Disk, CD-ROM, or is well known in the art. Any other form of storage medium.
  • An exemplary storage medium is coupled to the processor to enable the processor to read information from, and write information to, the storage medium.
  • the storage medium can also be an integral part of the processor.
  • the processor and the storage medium can be located in an ASIC.
  • the ASIC can be located in a language processing device.
  • the processor and the storage medium may also be present in the language processing device as discrete components.
  • the functions described in the embodiments of the present application may be implemented in hardware, software, firmware, or any combination thereof.
  • the embodiment of the present application also provides a computer program product for implementing the above functions when the computer program product is executed.
  • the computer program described above may be stored in a computer readable medium or transmitted as one or more instructions or code on a computer readable medium.
  • Computer readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one location to another.
  • a storage medium may be any available media that can be accessed by a general purpose or special purpose computer.
  • the embodiment of the present application provides a chip, which includes programmable logic circuits and/or program instructions, and is used to implement the language processing method provided by the foregoing embodiments when the chip is running.
  • the embodiment of the present application provides a processing apparatus, where the processing apparatus includes at least one circuit for performing the language processing method provided by the foregoing embodiment.
  • the embodiment of the present application provides a processing device, which is used to implement the language processing method provided by the foregoing embodiment.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

一种语言处理方法及设备。所述方法包括:获取源语言和目标语言的n组互译句对,n组互译句对中的每一组互译句对包括互为译文的一个源语言句子和一个目标语言句子,n为大于1的整数;采用源语言的提取规则,从n组互译句对的每个源语言句子中提取源语言片段;从n组互译句对的每个目标语言句子中,分别提取与源语言片段互为译文的目标语言片段;根据从n个目标语言句子中提取的至少n个目标语言片段,生成目标语言的提取规则。本申请实施例提供的方案,通过根据已经确定的源语言的提取规则,能够自动生成目标语言的提取规则,不必通过语言专家总结目标语言的提取规则,节省了人力和时间成本。

Description

语言处理方法及设备
本申请要求于2017年12月23日提交中国国家知识产权局、申请号为201711411206.3、发明名称为“语言处理方法及设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请实施例涉及计算机技术领域,特别涉及一种语言处理方法及设备。
背景技术
随着人工智能技术的不断发展,让人机之间能够通过自然语言进行交互的自然语言人机交互系统变的越来越重要。人机之间能够通过自然语言进行交互,就需要系统能够识别出人类自然语言的具体含义。通常,系统通过采用对自然语言的句子进行关键信息提取来识别句子的具体含义。
在相关技术中,利用语言专家针对一种自然语言总结得出的提取规则,来提取该自然语言的句子中的关键信息。例如对于从中文句子提取日期,语言专家总结出的日期提取规则为:【四位数字】年【一至二位数字】月【一至二位数字】日,系统根据该日期的提取规则就能够提取句子中的日期的关键信息.当系统需要识别多种自然语言时,由于不同的自然语言之间的语法区别,一种提取规则无法适用于所有自然语言.对于每一种自然语言,都需要该自然语言的语言专家总结对应的提取规则。
当需要多种自然语言的提取规则时,每一种自然语言都需要对应的语言专家总结提取规则,导致耗费过多的时间和人力成本。
发明内容
本申请提供了一种语言处理方法及设备,可用于解决在现有技术中需要多种自然语言的提取规则时,由于每一种自然语言都需要对应的语言专家总结提取规则,导致耗费过多的时间和人力成本的问题。
第一方面,本申请提供一种语言处理方法,该方法包括:获取源语言和目标语言的n组互译句对,n组互译句对中的每一组互译句对包括互为译文的一个源语言句子和一个目标语言句子,n为大于1的整数;采用源语言的提取规则,从n组互译句对的每个源语言句子中提取源语言片段;从n组互译句对的每个目标语言句子中,分别提取与源语言片段互为译文的目标语言片段;根据从n个目标语言句子中提取的至少n个目标语言片段,生成目标语言的提取规则。
本申请提供的方案中,通过源语言的提取规则以及源语言和目标语言的n组互译句对,提取至少n个目标语言片段,再根据至少n个目标语言片段生成目标语言的提取规则。根据已经确定的源语言的提取规则,能够自动生成目标语言的提取规则,不必通过语言专家总结目标语言的提取规则,节省了人力和时间成本。
在一个可能的设计中,从n组互译句对的每个目标语言句子中,分别提取与源语言片段互为译文的目标语言片段,包括:对于每一组互译句对,根据互译句对间的词对齐关系,获取从互译句对的源语言句子中提取的源语言片段包含的各个词语,在互译句对的目标语言句子中对应的译文词语;将译文词语组合,得到互译句对的目标语言句子的目标语言片段。
本申请提供的方案中,通过根据互译句对之间的词对齐关系和源语言片段,准确地获取互译句对的目标语言句子的目标语言片段。
在另一个可能的设计中,每个目标语言片段包括k个域的词语,k为正整数;根据从n个目标语言句子中提取的至少n个目标语言片段,生成目标语言的提取规则,包括:将至少n个目标语言片段中属 于同一个域的词语进行合并,得到每个域的合并后的词语,属于同一个域的词语是指具有相同语义的词语;对每个域的合并后的词语进行泛化,得到目标语言的提取规则。
本申请提供的方案中,通过对至少n个目标语言片段中属于同一个域的词语进行合并与泛化,从而自动生成目标语言的提取规则。
在又一个可能的设计中,根据从n个目标语言句子中提取的至少n个目标语言片段,生成目标语言的提取规则之后,还包括:将源语言的提取规则施用于源语言语料库,得到a个源语言片段;以及,将目标语言的提取规则施用于目标语言语料库,得到b个目标语言片段;其中,源语言语料库中包含的源语言句子和目标语言语料库中包含的目标语言句子的数量一致,且互为译文,a和b均为整数;检测a个源语言片段和b个目标语言片段是否符合预设条件;若符合预设条件,则对目标语言的提取规则进行更新。
本申请提供的方案中,通过对目标语言的提取规则进行更新,确保目标语言的提取规则的准确性,避免根据目标语言的提取规则提取信息时发生错误。
在又一个可能的设计中,预设条件包括:a和b不相等;和/或,存在至少一组从互译句对中提取的源语言片段和目标语言片段的语义不匹配。
本申请提供的方案中,通过预先设定的预设条件,能够准确地检测目标语言的提取规则是否准确。
在又一个可能的设计中,目标语言的提取规则包括至少一个域对应的提取规则,每个域对应的提取规则用于提取目标语言的一种语义的词语。对目标语言的提取规则进行更新,包括:减小目标语言的提取规则中第一域对应的提取规则的泛化程度;和/或,扩大目标语言的提取规则中第二域对应的提取规则的泛化程度。
本申请提供的方案中,通过减小或扩大目标语言的提取规则的泛化程度,准确地对目标语言的提取规则进行更新,以确保目标语言的提取规则的准确性。
第二方面,本申请提供了一种语言处理设备,该设备包括用于执行上述第一方面及第一方面的各种可能的设计中的任意一个设计所提供的语言处理方法的单元或手段。
第三方面,本申请提供了一种语言处理设备,包括处理器和存储器,其中,存储器中存有计算机可读程序;该处理器通过运行存储器中的程序,以用于完成上述第一方面及第一方面的各个可能的设计中的任意一个设计所提供的语言处理方法。
第四方面,本申请提供一种计算机存储介质,用于储存为语言处理设备所用的计算机软件指令,其包含用于执行上述方面所设计的程序。
第五方面,本申请提供一种计算机程序产品,当该计算机程序产品被执行时,其用于执行上述第一方面及第一方面的各种可能的设计中的任意一个设计所提供的语言处理方法。
第六方面,本申请提供一种芯片,该芯片包括可编程逻辑电路和/或程序指令,当该芯片运行时用于实现上述第一方面及第一方面的各种可能的设计中的任意一个设计所提供的语言处理方法。
第七方面,本申请提供一种处理装置,该处理装置包括至少一个电路,该至少一个电路用于执行上述第一方面及第一方面的各种可能的设计中的任意一个设计所提供的语言处理方法。
第八方面,本申请提供一种处理装置,该处理装置用于实现上述第一方面及第一方面的各种可能的设计中的任意一个设计所提供的语言处理方法。
相较于现有技术,本申请提供的方案中,通过源语言的提取规则以及源语言和目标语言的n组互译句对,提取至少n个目标语言片段,再根据至少n个目标语言片段生成目标语言的提取规则。根据已经确定的源语言的提取规则,能够自动生成目标语言的提取规则,不必通过语言专家总结目标语言的提取 规则,节省了人力和时间成本。
附图说明
图1是本申请一个实施例提供的语言处理方法的流程图;
图2是本申请一个实施例提供的词对齐关系的示意图;
图3是本申请另一个实施例提供的语言处理方法的流程图;
图4是本申请一个实施例提供的生成目标语言的提取规则的示意图;
图5是本申请一个实施例提供的语言处理设备的示意性框图;
图6是本申请一个实施例提供的语言处理设备的结构示意图。
具体实施方式
为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。
本申请实施例提供的方法,各步骤的执行主体可以是计算机设备,例如该计算机设备可以是PC(Personal Computer,个人计算机)或者服务器。当计算设备作为本申请实施例提供的方法的执行主体时,也可以称为语言处理设备。可选地,该计算机设备包括一数据库,该数据库中存储有多种自然语言的语料库和提取规则。为了便于说明,在下述方法实施例中,仅以各步骤的执行主体为计算机设备进行介绍说明,但对此不构成限定。
请参考图1,其示出了本申请一个实施例提供的语言处理方法的流程图。该方法可以包括如下几个步骤。
步骤101,获取源语言和目标语言的n组互译句对。
源语言是指已经总结出一条或多条提取规则的一种自然语言,而目标语言是指需要生成提取规则的一种自然语言。计算机设备需要生成目标语言的一条提取规则时,先获取源语言和目标语言的n组互译句对,n为大于1的整数。源语言和目标语言的互译句对是指互为译文的一对源语言句子和目标语言句子。可选地,一组互译句对中的词语也互为译文,互译句对中互为译文的词语之间的对应关系称为词对齐关系。示例性地,如图2所示,一组源语言和目标语言的互译句对,源语言的句子为:请帮我搜五一期间拍的照片,目标语言对应的句子为:Find the picture taken in May 1 st for me。其中,互译句对中的词语也互为译文,例如“搜”和“find”互为译文、“照片”和“picture”互为译文、以及“我”和“me”互为译文等等。
在一种可能的实施方式中,计算机设备获取源语言和目标语言的语料库。源语言语料库中包含的源语言句子和目标语言语料库中包含的目标语言句子的数量一致,且互为译文,即源语言和目标语言的语料库中包括多组互译句对,该多组互译句对中包括上述n组互译句对。
在另一种可能的实施方式中,计算机设备直接获取源语言和目标语言的n组互译句对。计算机设备存储了源语言和目标语言的语料库中不同互译句对与源语言的提取规则的对应关系。计算机设备能够根据该对应关系直接获取一条提取规则对应的n组互译句对。
可选地,计算机设备能够通过词对齐模型确定n组互译句对中的词对齐关系。计算机设备可以通过国际商业机器公司(International Business Machines Corporation,IBM)模型确定每组互译句对中的词对齐 关系,也可以通过注意力(Attention)模型确定每组互译句对中的词对齐关系。对于计算机设备使用的词对齐模型的具体种类,本申请实施例不做限定。
在一种可能的实施方式中,计算机设备只通过一种词对齐模型确定n组互译句对中的词对齐关系。
在另一种可能的实施方式中,计算机设备通过多种词对齐模型确定n组互译句对中的词对齐关系。对于一组互译句对,计算机设备先获取多种词对齐模型各自确定的词对齐关系,再根据多种词对齐模型各自的权重比例从中筛选出一个词对齐关系作为该组互译句对的词对齐关系。
步骤102,采用源语言的提取规则,从n组互译句对的每个源语言句子中提取源语言片段。
n组互译句对中的每个源语言句子,都对应着源语言的同一条提取规则。计算机设备采用该提取规则,从n组互译句对的源语言句子中提取每个源语言句子的源语言片段。示例性地,源语言的月份信息的提取规则对应着2组互译句对中的源语言句子,2组互译句对中的源语言句子分别为“今天是5月”和“明天是5月或者6月”,则计算机设备根据源语言的月份提取规则从“今天是5月”中提取的源语言片段为“5月”,从”明天是5月或者6月”中提取的源语言片段为“5月”和”6月”。
可选地,语言的提取规则可以有多种表达模型。例如,序列到序列模型和正则规则模型。对于表达模型种类,本申请实施例不做具体限定。
步骤103,从n组互译句对的每个目标语言句子中,分别提取与源语言片段互为译文的目标语言片段。
源语言片段实际是多个词语组成的集合。因为一组互译句对中的源语言句子和目标语言句子互为译文,且互译句对中的词语也互为译文。所以,计算机设备在提取源语言片段后,能够从一组互译句对的目标语言句子中,提取与该组互译句对的源语言片段互为译文的目标语言片段。对于n组互译句对的每个目标语言句子,计算机设备能够提取每个目标语言句子各自的目标语言片段。
可选地,对于每一组互译句对,计算机根据互译句对间的词对齐关系,获取从互译句对的源语言句子中提取的源语言片段包含的各个词语,在互译句对的目标语言句子中对应的译文词语。计算机设备再将译文词语按照其在目标语言中的顺序进行组合,得到互译句对的目标语言句子的目标语言片段。示例性地,一组互译句对,源语言句子是:到下午五点开始打开免打扰,目标语言句子是:Setno disturb at 5p.m.。计算机设备采用时间信息的提取规则,从源语言句子中提取出的源语言片段是:下午五点。根据词对齐关系,下午五点在目标语言句子中对应的译文词语是5和p.m.。计算机设备将译文词语5和p.m.进行组合,得到目标语言片段:5p.m.。
步骤104,根据从n个目标语言句子中提取的至少n个目标语言片段,生成目标语言的提取规则。
在一组互译句对中,对于一条源语言的提取规则,源语言片段和目标语言片段的数量相同。此外,计算机设备根据一条源语言的提取规则,从一组互译句对的源语言句子中可以提取不止一个源语言片段。所以,计算机设备从n个目标语言句子中提取至少n个目标语言片段。
因为上述至少n个目标语言片段对应的至少n个源语言片段是计算机设备根据同一条源语言的提取规则提取的,所以上述至少n个目标语言片段对应着同一条目标语言的提取规则。该目标语言的提取规则,与计算机设备提取源语言片段的源语言的提取规则对应。因此,计算机设备根据至少n个目标语言片段,能够生成至少n个目标语言片段对应的目标语言的提取规则。
可选地,步骤104可以包括如下子步骤:
1、将至少n个目标语言片段中属于同一个域的词语进行合并,得到每个域的合并后的词语。
每个目标语言片段包括k个域的词语,k为正整数。不同的域代表不同的语义,属于同一个域的词语是指具有相同语义的词语。计算机设备将至少n个目标语言片段中属于同一个域的词语合并为一个集 合,该集合中包含了该域的合并后的词语。目标语言片段包括k个域的词语,则合并后得到k个集合。例如,3个目标语言片段分别是:5p.m.、11p.m.和6a.m.,则计算机设备将5、11和6作为同一个域的词语合并为一个集合:【5或11或6】,将p.m.、p.m.和a.m.作为同一个域的词语合并为一个集合:【p.m.或p.m.或a.m.】。需要说明的是,集合中的“或”可以使用其它符号代替,例如“|”。
2、对每个域的合并后的词语进行泛化,得到目标语言的提取规则。
在得到合并的集合后,计算机设备根据表示源语言的提取规则的表达模型,确定生成目标语言的提取规则所需的语法规则。该语法规则是指表达模型的语法规则。例如,表示源语言的提取规则的表达模型是正则规则模型,则语法规则是正则表达式语法规则,正则表达式语法规则包括使用预先设定的符号替换词语中的原有字符。
示例性地,表达模型的语法规则如下表-1所示
符号 代表含义
[:alpha:] 代表任何英文大小写字符,即A-Z,a-z
[:xdigit:] 代表十六进制的数字类型
[:alnum:] 代表英文大小写字符及数字
[:digit:] 代表数字,即0~9
[:lower:] 代表小写字符,即a-z
[:upper] 代表大写字符,即A-Z
[:punct:] 代表标点符号
表-1
计算机设备可以根据上述表-1所示的语法规则对合并后的词语进行泛化,例如,使用【:alpha:】表示英文字母。上述表-1所示的表达模型的语法规则仅为部分语法规则,并且仅是示例性和解释性的,并不用于限定本申请。
计算机设备根据语法规则对合并后的词语进行泛化,得到目标语言的提取规则。例如,对【5或11或6】进行泛化得到【一位或两位阿拉伯数字】,对【p.m.或p.m.或a.m.】进行泛化得到】p.m.或a.m.】,则目标语言的提取规则为:【一位或两位阿拉伯数字】【p.m.或a.m.】。需要说明得是,在本申请实施例中,对于提取规则表示形式,为了便于说明,仅使用汉字或英文进行表示。但在实际应用中,根据表达模型的不同,提取规则表示形式也不同,例如在正则规则模型中,可以使用d{1}表示一位阿拉伯数字、使用符号“|”表示“或”。
本申请实施例提供的方案中,通过源语言的提取规则以及源语言和目标语言的n组互译句,提取至少n个目标语言片段,再根据至少n个目标语言片段生成目标语言的提取规则。根据已经确定的源语言的提取规则,能够自动生成目标语言的提取规则,不必通过语言专家总结目标语言的提取规则,节省了人力和时间成本。
计算机设备在生成目标语言的提取规则后,还可以检测生成的目标语言的提取规则是否准确。当目标语言的提取规则不准确时,计算机设备对目标语言的提取规则进行更新。
在基于图1实施例提供的一个可选实施例中,如图3所示,上述步骤104之后还包括如下步骤。在本实施例中,对目标语言的提取规则的更新方式进行介绍说明。
步骤301,将源语言的提取规则施用于源语言语料库,得到a个源语言片段,并将目标语言的提取规则施用于目标语言语料库,得到b个目标语言片段。
请参考图4所示的示意图,计算机设备根据n组互译句对和源语言的提取规则提取至少n个源语言片段,再结合词对齐关系提取至少n个目标语言片段,生成目标语言的提取规则。
在生成目标语言的提取规则后,计算机设备检测生成的目标语言的提取规则是否准确,即检测根据目标语言的提取规则能否准确提取出对应的目标语言片段。计算机设备将目标语言的提取规则施用于目标语言语料库,得到b个目标语言片段。计算机设备再将源语言的提取规则施用于源语言语料库,得到a个源语言片段,该源语言的提取规则是目标语言的提取规则对应的提取规则。a和b均为整数。
步骤302,检测a个源语言片段和b个目标语言片段是否符合预设条件。
计算机设备通过检测a个源语言片段和b个目标语言片段是否符合预设条件,来判断目标语言的提取规则是否准确。若符合预设条件,则目标语言的提取规则不准确;若不符合预设条件,则目标语言的提取规则准确,无需进行更新。
预设条件包括:a和b不相等,和/或,存在至少一组从互译句对中提取的源语言片段和目标语言片段的语义不匹配。由于语言的提取规则和目标语言的提取规则是对应的提取规则,且源语言语料库中包含的源语言句子和目标语言语料库中包含的目标语言句子的数量一致,并互为译文。因此,若目标语言的提取规则准确,则源语言片段和目标语言片段的数量一致且每一组从互译句对中提取的源语言片段和目标语言片段的语义也一致。
步骤303,若符合预设条件,则对目标语言的提取规则进行更新。
若符合预设条件,则表示目标语言的提取规则不准确,计算机设备需要对目标语言的提取规则进行更新。计算机设备可以通过减小目标语言的提取规则中第一域对应的提取规则的泛化程度,和/或,扩大目标语言的提取规则中第二域对应的提取规则的泛化程度来对目标语言的提取规则进行更新。目标语言的提取规则包括至少一个域对应的提取规则,每个域对应的提取规则用于提取目标语言的一种语义的词语。其中,第一域是指对应的提取规则的泛化程度过大的域,第二域是指对应的提取规则的泛化程度过小的域。
减小一个域对应的提取规则的泛化程度是指通过修改提取规则的表达形式,来减少根据该提取规则提取的属于该域的词语,避免提取不属于该域的词语。扩大一个域对应的提取规则的泛化程度是指通过修改提取规则的表达形式,来增加根据该提取规则提取的属于该域的词语,确保能够将属于该域的词语全部提取出来。
计算机设备可以通过直接复制源语言的提取规则来修改目标语言的提取规则。只有在源语言的提取规则只使用了表达模型的语法规则中预先设定的符号的情况下,计算机设备才能直接复制源语言的提取规则来修改目标语言的提取规则。因为,源语言的提取规则和目标语言的提取规则都使用相同的表达模型,那么在源语言的提取规则和目标语言的提取规则中,表达模型的语法规则中预先设定的符号代表的含义相同,并不会因为语言的不同产生歧义,所以计算机设备可以直接复制源语言的提取规则来修改目标语言的提取规则。
计算机设备还可以通过增加提取规则适用的指定句式来修改目标语言的提取规则。其中,指定句式可以是预先设定的句式。
示例性地,源语言是英文,目标语言是中文,提取规则是月份的提取规则。目标语言的提取规则是:【一位或两位阿拉伯数字】【月】。计算机设备根据该提取规则从目标语言的语料库中提取目标语言片段时,会将“六月天的演唱会”中的六月作为月份提取出来,但该六月并不是表示月份,此时,计算机设备通过“六月天的演唱会”在源语言的语料库中对应的源语言句子,检测出六月天对应于June Day。其中,六月对应于June,天对应于Day。计算机设备将,下一个词语不是“天”,这一指定句式,作为目标 语言的提取规则适用的指定句式,将目标语言的提取规则修改为【一位或两位阿拉伯数字】【月】【不是“天”】,来避免将六月天中的六月提取出来。
此外,在修改目标语言的提取规则时,计算机设备先检测目标语言的提取规则对应的源语言的提取规则是否只使用了表达模型的语法规则中预先设定的符号。若是,则通过直接复制源语言的提取规则来修改目标语言的提取规则;若否,则通过增加提取规则适用的指定句式来修改目标语言的提取规则。
在一种可能的实施方式中,a大于b。源语言片段的数量大于目标语言片段的数量,即计算机设备根据目标语言的提取规则进行提取时,遗漏了部分目标语言片段。说明目标语言的提取规则中至少一个域对应的提取规则的泛化程度过小,即目标语言的提取规则中存在第二域对应的提取规则。导致计算机设备未能将属于该域的词语全部提取出来。此时,计算机设备需要扩大该域对应的提取规则的泛化程度,确保能够将属于该域的词语全部提取出来。
示例性地,源语言是中文,目标语言是英文,提取规则是月日的提取规则。源语言的提取规则为:【一位或两位阿拉伯数字】【一位或两位阿拉伯数字】,目标语言的提取规则为:【月份英文单词】【两位阿拉伯数字】。计算机设备根据该提取规则从目标语言的语料库中提取目标语言片段时,会遗漏May 5这一类日期是使用一位阿拉伯数字来表示的目标语言片段,导致源语言片段的数量大于目标语言片段的数量。计算机设备需要扩大日期这一个域对应的提取规则的泛化程度。计算机设备直接复制源语言的提取规则来修改目标语言的提取规则,修改后的目标语言的提取规则为:【月份英文单词】【一位或两位阿拉伯数字】。
在另一种可能的实施方式中,a小于b。源语言片段的数量小于目标语言片段的数量,即计算机设备根据目标语言的提取规则进行提取时,提取了多余的目标语言片段。说明目标语言的提取规则中至少一个域对应的提取规则的泛化程度过大,即目标语言的提取规则中存在第一域对应的提取规则。导致计算机设备将不属于该域的词语提取出来。此时,计算机设备需要减小该域对应的提取规则的泛化程度,避免提取不属于该域的词语。
示例性地,源语言是中文,目标语言是英文,提取规则是月日的提取规则。源语言的提取规则为:【一位或两位阿拉伯数字】【一位或两位阿拉伯数字】,目标语言的提取规则为:【月份英文单词】【阿拉伯数字】。计算机设备根据该提取规则从目标语言的语料库中提取目标语言片段时,会将May 2000这一类在英文中表示年月的目标语言片段也提取出来,导致源语言片段的数量小于目标语言片段的数量。计算机设备需要减小日期这一个域对应的提取规则的泛化程度。计算机设备直接复制源语言的提取规则来修改目标语言的提取规则,修改后的目标语言的提取规则为:【月份英文单词】【一位或两位阿拉伯数字】。
在又一种可能的实施方式中,a等于b,但存在至少一组从互译句对中提取的源语言片段和目标语言片段的语义不匹配。一组从互译句对中提取的源语言片段和目标语言片段的语义不匹配是指:目标语言片段中的词语与源语言片段中的词语无法一一对应。当目标语言片段中的词语包括了无法与源语言片段中的词语对应的词语时,计算机设备确定目标语言的提取规则中的至少一个域对应的提取规则的泛化程度过大,即目标语言的提取规则中存在第一域对应的提取规则,需要减小该提取规则的泛化程度;当源语言片段中的词语包括了无法与目标语言片段中的词语对应的词语时,计算机设备确定目标语言的提取规则中的至少一个域对应的提取规则的泛化程度过小,即目标语言的提取规则中存在第二域对应的提取规则,需要扩大该提取规则的泛化程度。当上述两种情况同时发生时,计算机设备确定目标语言的提取规则中同时存在第一域对应的提取规则和第二域对应的提取规则。此时,计算机设备减小第一域对应的提取规则的泛化程度,并扩大第二域对应的提取规则的泛化程度。
可选地,在更新目标语言的提取规则后,计算机设备可以采用相同的更新方式,对源语言的提取规 则进行更新。计算机设备将源语言的提取规则施用于源语言语料库,得到a个源语言片段,并将目标语言的提取规则施用于目标语言语料库,得到b个目标语言片段,再检测a个源语言片段和b个目标语言片段是否符合预设条件,若符合预设条件,则对源语言的提取规则进行更新。
可选地,在上述步骤303之后,可以再次从步骤301开始执行,直至a个源语言片段和b个目标语言片段不符合预设条件。通过反复检测更新目标语言的提取规则,确保目标语言的提取规则的准确性。
通过上述方式,结合源语言的提取规则,对目标语言的提取规则进行检测和更新,确保目标语言的提取规则的准确性,避免根据目标语言的提取规则提取信息时发生错误。
上述方法实施例中,从语言处理设备的角度对本申请提供的技术方案进行介绍说明。可以理解的是,语言处理设备为了实现上述功能,其包含了执行各个功能相应的硬件结构和/或软件模块(或单元)。结合本申请中所公开的实施例描述的各示例的单元及算法步骤,本申请实施例能够以硬件或硬件和计算机软件的结合形式来实现。某个功能究竟以硬件还是计算机软件驱动硬件的方式来执行,取决于技术方案的特定应用和设计约束条件。本领域技术人员可以对每个特定的应用来使用不同的方法来实现所描述的功能,但是这种实现不应认为超出本申请实施例的技术方案的范围。
本申请实施例可以根据上述方法示例对语言处理设备进行功能单元的划分,例如,可以对应各个功能划分各个功能单元,也可以将两个或两个以上的功能集成在一个处理单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。需要说明的是,本申请实施例中对单元的划分是示意性的,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。
图5示出了本申请一个实施例提供的语言处理设备500的结构框图。该设备包括:获取单元501、提取单元502和生成单元503。
获取单元501,用于实现上述方法实施例中步骤101对应的步骤,以及其它明示或隐含的获取步骤。
提取单元502,用于实现上述方法实施例中步骤102、步骤103和步骤301中的至少一个步骤,以及其它明示或隐含的提取步骤。
生成单元503,用于实现上述方法实施例中步骤104、步骤302和步骤303的至少一个步骤,以及其它明示或隐含的生成步骤。
获取单元501可以通过处理器、存储器和存储器中的第一条指令、第一段程序、代码集或指令集来实现。提取单元502可以通过处理器、存储器和存储器中的第二条指令、第二段程序、代码集或指令集来实现。生成单元503可以通过处理器、存储器和存储器中的第三条指令、第三段程序、代码集或指令集来实现。
参阅图6所示,该语言处理设备510包括:处理器512、存储器511。可选地,语言处理设备510还可以包括总线513。其中,处理器512以及存储器511可以通过总线513相互连接;总线513可以是外设部件互连标准(Peripheral Component Interconnect,简称PCI)总线或扩展工业标准结构(Extended Industry Standard Architecture,简称EISA)总线等。所述总线513可以分为地址总线、数据总线、控制总线等。为便于表示,图6中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。
结合本申请实施例公开内容所描述的方法或者算法的步骤可以硬件的方式来实现,也可以是由处理器执行软件指令的方式来实现。软件指令可以由相应的软件模块(或单元)组成,软件模块(或单元)可以被存放于随机存取存储器(Random Access Memory,RAM)、闪存、只读存储器(Read Only Memory,ROM)、可擦除可编程只读存储器(Erasable Programmable ROM,EPROM)、电可擦可编程只读存储器(Electrically EPROM,EEPROM)、寄存器、硬盘、移动硬盘、只读光盘(CD-ROM)或者本领域熟知 的任何其它形式的存储介质中。一种示例性的存储介质耦合至处理器,从而使处理器能够从该存储介质读取信息,且可向该存储介质写入信息。当然,存储介质也可以是处理器的组成部分。处理器和存储介质可以位于ASIC中。另外,该ASIC可以位于语言处理设备中。当然,处理器和存储介质也可以作为分立组件存在于语言处理设备中。
本领域技术人员应该可以意识到,在上述一个或多个示例中,本申请实施例所描述的功能可以用硬件、软件、固件或它们的任意组合来实现。本申请实施例还提供了计算机程序产品,当该计算机程序产品被执行时,其用于实现上述功能。另外,可以将上述计算机程序存储在计算机可读介质中或者作为计算机可读介质上的一个或多个指令或代码进行传输。计算机可读介质包括计算机存储介质和通信介质,其中通信介质包括便于从一个地方向另一个地方传送计算机程序的任何介质。存储介质可以是通用或专用计算机能够存取的任何可用介质。
本申请实施例提供一种芯片,该芯片包括可编程逻辑电路和/或程序指令,当该芯片运行时用于实现上述实施例提供的语言处理方法。
本申请实施例提供一种处理装置,该处理装置包括至少一个电路,该至少一个电路用于执行上述实施例提供的语言处理方法。
本申请实施例提供一种处理装置,该处理装置用于实现上述实施例提供的语言处理方法。
以上所述的具体实施方式,对本申请实施例的目的、技术方案和有益效果进行了进一步详细说明,所应理解的是,以上所述仅为本申请实施例的具体实施方式而已,并不用于限定本申请实施例的保护范围,凡在本申请实施例的技术方案的基础之上,所做的任何修改、等同替换、改进等,均应包括在本申请实施例的保护范围之内。

Claims (18)

  1. 一种语言处理方法,其特征在于,所述方法包括:
    获取源语言和目标语言的n组互译句对,所述n组互译句对中的每一组互译句对包括互为译文的一个源语言句子和一个目标语言句子,所述n为大于1的整数;
    采用所述源语言的提取规则,从所述n组互译句对的每个源语言句子中提取源语言片段;
    从所述n组互译句对的每个目标语言句子中,分别提取与所述源语言片段互为译文的目标语言片段;
    根据从n个目标语言句子中提取的至少n个目标语言片段,生成所述目标语言的提取规则。
  2. 根据权利要求1所述的方法,其特征在于,所述从所述n组互译句对的每个目标语言句子中,分别提取与所述源语言片段互为译文的目标语言片段,包括:
    对于每一组互译句对,根据所述互译句对间的词对齐关系,获取从所述互译句对的源语言句子中提取的源语言片段包含的各个词语,在所述互译句对的目标语言句子中对应的译文词语;
    将所述译文词语组合,得到所述互译句对的目标语言句子的目标语言片段。
  3. 根据权利要求1或2所述的方法,其特征在于,每个目标语言片段包括k个域的词语,k为正整数;
    所述根据从n个目标语言句子中提取的至少n个目标语言片段,生成所述目标语言的提取规则,包括:
    将所述至少n个目标语言片段中属于同一个域的词语进行合并,得到每个域的合并后的词语,所述属于同一个域的词语是指具有相同语义的词语;
    对所述每个域的合并后的词语进行泛化,得到所述目标语言的提取规则。
  4. 根据权利要求1至3任一项所述的方法,其特征在于,所述根据从n个目标语言句子中提取的至少n个目标语言片段,生成所述目标语言的提取规则之后,还包括:
    将所述源语言的提取规则施用于源语言语料库,得到a个源语言片段;以及,将所述目标语言的提取规则施用于目标语言语料库,得到b个目标语言片段;其中,所述源语言语料库中包含的源语言句子和所述目标语言语料库中包含的目标语言句子的数量一致,且互为译文,所述a和所述b均为整数;
    检测所述a个源语言片段和所述b个目标语言片段是否符合预设条件;
    若符合所述预设条件,则对所述目标语言的提取规则进行更新。
  5. 根据权利要求4所述的方法,其特征在于,所述预设条件包括:所述a和所述b不相等;和/或,存在至少一组从互译句对中提取的源语言片段和目标语言片段的语义不匹配。
  6. 根据权利要求4所述的方法,其特征在于,所述目标语言的提取规则包括至少一个域对应的提取规则,每个域对应的提取规则用于提取所述目标语言的一种语义的词语;
    所述对所述目标语言的提取规则进行更新,包括:
    减小所述目标语言的提取规则中第一域对应的提取规则的泛化程度;
    和/或,
    扩大所述目标语言的提取规则中第二域对应的提取规则的泛化程度。
  7. 一种语言处理设备,其特征在于,所述设备包括:
    获取单元,用于获取源语言和目标语言的n组互译句对,所述n组互译句对中的每一组互译句对包括互为译文的一个源语言句子和一个目标语言句子,所述n为大于1的整数;
    提取单元,用于采用所述源语言的提取规则,从所述n组互译句对的每个源语言句子中提取源语言片段;
    所述提取单元,还用于从所述n组互译句对的每个目标语言句子中,分别提取与所述源语言片段互为译文的目标语言片段;
    生成单元,用于根据从n个目标语言句子中提取的至少n个目标语言片段,生成所述目标语言的提取规则。
  8. 根据权利要求7所述的设备,其特征在于,
    所述提取单元,用于对于每一组互译句对,根据所述互译句对间的词对齐关系,获取从所述互译句对的源语言句子中提取的源语言片段包含的各个词语,在所述互译句对的目标语言句子中对应的译文词语;将所述译文词语组合,得到所述互译句对的目标语言句子的目标语言片段。
  9. 根据权利要求7或8所述的设备,其特征在于,每个目标语言片段包括k个域的词语,k为正整数;
    所述生成单元,用于将所述至少n个目标语言片段中属于同一个域的词语进行合并,得到每个域的合并后的词语,所述属于同一个域的词语是指具有相同语义的词语;
    对所述每个域的合并后的词语进行泛化,得到所述目标语言的提取规则。
  10. 根据权利要求7至9任一项所述的设备,其特征在于,
    所述提取单元,还用于将所述源语言的提取规则施用于源语言语料库,得到a个源语言片段;以及,将所述目标语言的提取规则施用于目标语言语料库,得到b个目标语言片段;其中,所述源语言语料库中包含的源语言句子和所述目标语言语料库中包含的目标语言句子的数量一致,且互为译文,所述a和所述b均为整数;
    所述生成单元,还用于检测所述a个源语言片段和所述b个目标语言片段是否符合预设条件;若符合所述预设条件,则对所述目标语言的提取规则进行更新。
  11. 根据权利要求10所述的设备,其特征在于,所述预设条件包括:所述a和所述b不相等;和/或,存在至少一组从互译句对中提取的源语言片段和目标语言片段的语义不匹配。
  12. 根据权利要求10所述的设备,其特征在于,所述目标语言的提取规则包括至少一个域对应的提取规则,每个域对应的提取规则用于提取所述目标语言的一种语义的词语;
    所述生成单元,用于减小所述目标语言的提取规则中第一域对应的提取规则的泛化程度;
    和/或,
    所述生成单元,用于扩大所述目标语言的提取规则中第二域对应的提取规则的泛化程度。
  13. 一种语言处理设备,其特征在于,所示设备包括处理器和存储器,其中,
    所述存储器中存储有计算机可读程序;
    所述处理器通过运行所述存储器中的程序,以用于实现上述权利要求1至6任一项所述的方法。
  14. 一种计算机存储介质,其特征在于,所述计算机存储介质中存储有可执行指令,所述可执行指令用于执行如权利要求1至6任一项所述的方法。
  15. 一种计算机程序产品,其特征在于,当所述计算机程序产品被执行时,用于执行如权利要求1至6任一项所述的方法。
  16. 一种芯片,其特征在于,所述芯片包括可编程逻辑电路和/或程序指令,当所述芯片运行时用于实现如权利要求1至6任一项所述的方法。
  17. 一种处理装置,其特征在于,所述处理装置包括至少一个电路,所述至少一个电路用于执行如权利要求1至6任一项所述的方法。
  18. 一种处理装置,其特征在于,所述处理装置用于实现如权利要求1至6任一项所述的方法。
PCT/CN2018/102498 2017-12-23 2018-08-27 语言处理方法及设备 WO2019119852A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP18890375.1A EP3719676A4 (en) 2017-12-23 2018-08-27 METHOD AND DEVICE FOR VOICE PROCESSING
US16/907,783 US11704505B2 (en) 2017-12-23 2020-06-22 Language processing method and device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201711411206.3 2017-12-23
CN201711411206.3A CN109960812B (zh) 2017-12-23 2017-12-23 语言处理方法及设备

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/907,783 Continuation US11704505B2 (en) 2017-12-23 2020-06-22 Language processing method and device

Publications (1)

Publication Number Publication Date
WO2019119852A1 true WO2019119852A1 (zh) 2019-06-27

Family

ID=66994363

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/102498 WO2019119852A1 (zh) 2017-12-23 2018-08-27 语言处理方法及设备

Country Status (4)

Country Link
US (1) US11704505B2 (zh)
EP (1) EP3719676A4 (zh)
CN (1) CN109960812B (zh)
WO (1) WO2019119852A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021017951A1 (en) * 2019-07-26 2021-02-04 Beijing Didi Infinity Technology And Development Co., Ltd. Dual monolingual cross-entropy-delta filtering of noisy parallel data and use thereof

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11763018B2 (en) * 2021-02-22 2023-09-19 Imperva, Inc. System and method for policy control in databases

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104391842A (zh) * 2014-12-18 2015-03-04 苏州大学 一种翻译模型构建方法和系统
CN104572634A (zh) * 2014-12-25 2015-04-29 中国科学院合肥物质科学研究院 一种交互式抽取可比语料与双语词典的方法及其装置
CN105446958A (zh) * 2014-07-18 2016-03-30 富士通株式会社 词对齐方法和词对齐设备

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6345243B1 (en) * 1998-05-27 2002-02-05 Lionbridge Technologies, Inc. System, method, and product for dynamically propagating translations in a translation-memory system
US8874431B2 (en) * 2001-03-16 2014-10-28 Meaningful Machines Llc Knowledge system method and apparatus
KR100530154B1 (ko) * 2002-06-07 2005-11-21 인터내셔널 비지네스 머신즈 코포레이션 변환방식 기계번역시스템에서 사용되는 변환사전을생성하는 방법 및 장치
CN1567297A (zh) * 2003-07-03 2005-01-19 中国科学院声学研究所 一种从双语语料库中自动抽取多词翻译等价单元的方法
US7672831B2 (en) * 2005-10-24 2010-03-02 Invention Machine Corporation System and method for cross-language knowledge searching
CN102207938A (zh) * 2010-03-31 2011-10-05 北京金山软件有限公司 一种互译词条的获取方法及系统
KR101356417B1 (ko) * 2010-11-05 2014-01-28 고려대학교 산학협력단 병렬 말뭉치를 이용한 동사구 번역 패턴 구축 장치 및 그 방법
US8781810B2 (en) * 2011-07-25 2014-07-15 Xerox Corporation System and method for productive generation of compound words in statistical machine translation
CN103246641A (zh) 2013-05-16 2013-08-14 李营 一种文本语义信息分析系统和方法
RU2642343C2 (ru) * 2013-12-19 2018-01-24 Общество с ограниченной ответственностью "Аби Продакшн" Автоматическое построение семантического описания целевого языка
CN104239290B (zh) * 2014-08-08 2017-02-15 中国科学院计算技术研究所 基于依存树的统计机器翻译方法及系统

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105446958A (zh) * 2014-07-18 2016-03-30 富士通株式会社 词对齐方法和词对齐设备
CN104391842A (zh) * 2014-12-18 2015-03-04 苏州大学 一种翻译模型构建方法和系统
CN104572634A (zh) * 2014-12-25 2015-04-29 中国科学院合肥物质科学研究院 一种交互式抽取可比语料与双语词典的方法及其装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3719676A4

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021017951A1 (en) * 2019-07-26 2021-02-04 Beijing Didi Infinity Technology And Development Co., Ltd. Dual monolingual cross-entropy-delta filtering of noisy parallel data and use thereof
US11288452B2 (en) 2019-07-26 2022-03-29 Beijing Didi Infinity Technology And Development Co., Ltd. Dual monolingual cross-entropy-delta filtering of noisy parallel data and use thereof

Also Published As

Publication number Publication date
US11704505B2 (en) 2023-07-18
US20200320255A1 (en) 2020-10-08
CN109960812A (zh) 2019-07-02
EP3719676A4 (en) 2020-11-25
CN109960812B (zh) 2021-05-04
EP3719676A1 (en) 2020-10-07

Similar Documents

Publication Publication Date Title
Sadvilkar et al. PySBD: Pragmatic sentence boundary disambiguation
Karimi et al. Machine transliteration survey
US8706472B2 (en) Method for disambiguating multiple readings in language conversion
Yuret et al. Learning morphological disambiguation rules for Turkish
US9224103B1 (en) Automatic annotation for training and evaluation of semantic analysis engines
CN112232074B (zh) 实体关系抽取方法、装置、电子设备及存储介质
US20220414345A1 (en) Official document processing method, device, computer equipment and storage medium
JP6404511B2 (ja) 翻訳支援システム、翻訳支援方法、および翻訳支援プログラム
US20150066474A1 (en) Method and Apparatus for Matching Misspellings Caused by Phonetic Variations
Berg-Kirkpatrick et al. Improved typesetting models for historical OCR
Green et al. Entity clustering across languages
Tursun et al. Noisy Uyghur text normalization
WO2019119852A1 (zh) 语言处理方法及设备
KR20240006688A (ko) 다국어 문법 오류 정정
Mammadzada A review of existing transliteration approaches and methods
Raymond et al. Markup reconsidered
Yang et al. Spell Checking for Chinese.
JP2016133960A (ja) キーワード抽出システム、キーワード抽出方法、及び、コンピュータ・プログラム
CN102135957A (zh) 一种翻译短句的方法及装置
US11620319B2 (en) Search platform for unstructured interaction summaries
Marton et al. Transliteration normalization for information extraction and machine translation
Abudouwaili et al. Research on the Uyghur morphological segmentation model with an attention mechanism
CN110083817B (zh) 一种命名排歧方法、装置、计算机可读存储介质
Nguyen et al. An in-depth analysis of OCR errors for unconstrained Vietnamese handwriting
Buriachok et al. Implementation of an index optimize technology for highly specialized terms based on the phonetic algorithm metaphone

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18890375

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2018890375

Country of ref document: EP

Effective date: 20200629