CN110807337B - Patent double sentence pair processing method and system - Google Patents
Patent double sentence pair processing method and system Download PDFInfo
- Publication number
- CN110807337B CN110807337B CN201911064809.XA CN201911064809A CN110807337B CN 110807337 B CN110807337 B CN 110807337B CN 201911064809 A CN201911064809 A CN 201911064809A CN 110807337 B CN110807337 B CN 110807337B
- Authority
- CN
- China
- Prior art keywords
- sentence
- bilingual
- alignment
- content
- level
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Landscapes
- Machine Translation (AREA)
Abstract
The invention relates to the technical field of machine translation, in particular to a patent double sentence pair processing method and a system; the method comprises the following steps: acquiring patent discourse-level bilingual alignment corpus; extracting content characteristics of patent bilingual discourse-level articles; according to the content characteristics, carrying out content module segmentation and grading treatment on bilingual alignment chapter-level corpora; segmenting a paragraph module according to the result of the content grading module; classifying and sorting different paragraph modules; and carrying out classification and arrangement of sentence level alignment according to the classification and arrangement result of the paragraph module. The patent bilingual sentence pair processing method and the system disclosed by the invention form a patent knowledge base on the basis of statistical translation and on the basis of patent document analysis based on the basis of the patent field, and realize the automatic extraction and generation of the patent bilingual sentence pair base by combining with neural network translation, thereby greatly improving the efficiency and the accuracy rate and being beneficial to quickly establishing a patent field bilingual sentence pair corpus.
Description
Technical Field
The invention relates to the technical field of machine translation, in particular to a patent bilingual sentence pair processing method and system.
Background
Machine translation, also known as automatic translation, is the process of converting one natural language (source language) to another (target language) using a computer. Machine translation systems are divided into two types according to deployment modes: open systems for mass users and localized deployment systems for specific users.
Chinese patent CN201810845896.1 provides a training method and apparatus for neural network machine translation model, which includes: acquiring a plurality of high-resource language pairs and low-resource language pairs; spelling unification operation is carried out on the source language of the high resource language pair and the source language of the low resource language pair on a character level; taking each operated high-resource language pair as a training set of a corresponding parent model, taking the operated low-resource language pair as a training set of a child model, and training each parent model according to a transfer learning method according to a preset sequence so as to transfer word vectors of a source language and word vectors of a target language of a previous parent model to a next parent model; training the child model according to the last trained father model to obtain a neural network machine translation model for translating low-resource languages; the method is beneficial to obviously improving the performance of the child model trained on the low-resource language pair.
However, the bilingual sentence-pair parallel language corpus is a very important data resource for machine translation based on the neural network algorithm, and the parallel aligned corpus is a bilingual aligned corpus composed of original text and parallel corresponding translated text, and can be classified into word level, sentence level, paragraph level and chapter level according to the degree of alignment. However, in the original corpus of data, the original text and the translated text do not have a one-to-one correspondence relationship, for example: because of the difference between the Chinese and foreign language article structures, the difference between the article content expression habits, the difference between the writing habits of the article authors, and the difference between the translation habits of the translators, the number of Chinese and foreign language paragraphs and the number of sentences may be different, for example, 10 Chinese paragraphs correspond to 15 English paragraphs, 10 English paragraphs correspond to 15 Chinese paragraphs, 10 Chinese sentences may correspond to 12 English sentences, and conversely, 10 English sentences may correspond to 12 Chinese sentences.
At present, a large amount of accurate bilingual corpus is obtained by mainly utilizing a statistical translation method and an artificial mode, so that one-to-one sentences are obtained, the mode needs to consume a large amount of labor and time and depends on a background dictionary, meanwhile, in the processing process, the difference of the levels of processing personnel is considered, and the accuracy rate is uncertain, so that the improvement of the alignment efficiency and the accuracy rate of the sentence alignment corpus is not facilitated.
Therefore, in order to solve the above problems, a new method for processing the patent bilingual sentence pair is urgently needed.
Disclosure of Invention
The invention aims to: the patent bilingual corpus processing method solves the problems of low efficiency and low quality of the patent bilingual aligned corpus processing method in the prior art.
The invention provides the following scheme:
a patent bilingual sentence pair processing method comprises the following steps:
extracting content characteristics of the patent discourse-level bilingual alignment corpus from the patent discourse-level bilingual alignment corpus;
according to the content characteristics, segmenting the patent discourse-level bilingual alignment corpus into content modules, and carrying out grading processing to obtain a plurality of content grading modules;
segmenting each content grading module into paragraph modules to obtain a plurality of paragraph modules;
classifying and sorting the plurality of paragraph modules respectively, and calibrating the category of each paragraph module;
according to the category of each paragraph module, carrying out classification and arrangement of sentence-level alignment;
according to the classification and arrangement result of sentence-level alignment, sentence alignment is carried out in combination with patent big data statistics;
and screening sentence alignment results to form patent bilingual alignment corpora, and adding the patent bilingual alignment corpora into the corpus to form the corpus with the patent bilingual alignment corpora.
Preferably, the step of extracting the content features of the patent bilingual discourse-level article from the patent discourse-level bilingual aligned corpus specifically includes:
forming a content feature alignment library according to the content features of the patent; the content features of the patent include the abstract of the specification, the abstract drawings, the specification drawings and the claims.
Preferably, according to the content characteristics, the method includes the steps of performing content module segmentation on bilingual aligned chapter-level corpus and performing hierarchical processing to obtain a plurality of content hierarchical modules, specifically:
according to the patent content feature alignment library, dividing content modules, and classifying as follows:
the first class comprises the abstract of the specification, the claims, the specification and the drawings of the specification;
the secondary classification comprises the technical field, the background technology, the invention content, the figure description and the specific implementation mode;
the three-level classification comprises abstract drawings and embodiments.
Preferably, the step of performing paragraph module segmentation on each content grading module to obtain a plurality of paragraph modules specifically includes:
each divided content module is further divided into paragraph modules: if the paragraph numbers are consistent, corresponding the paragraph modules one by one to form bilingual alignment corpora D1, D2 and D3 … DN corresponding to the paragraph level; if the paragraph numbers are not consistent, the content module alignment is returned to form the content module bilingual alignment corpus ND 1.
Preferably, the step of classifying and sorting the plurality of paragraph modules respectively and calibrating the category of each paragraph module specifically includes:
dividing formed paragraph level corresponding bilingual aligned corpora D1, D2 and D3 … DN into sentence logarithm consistent libraries J1, J2 and J3 … JN and sentence logarithm inconsistent libraries NJ1, NJ2 and NJ3 … NJN according to whether the sentence logarithm is consistent;
the formed content module bilingual aligned corpus ND1 is divided into a content module sentence-level corpus ND1-J3.
Preferably, the step of sentence alignment is performed according to the sentence-level classification result and by combining the patent big data statistical result, and specifically includes:
sentence alignment is performed on the formed sentence pair number matching bases J1, J2, J3 … JN, sentence pair number inconsistency bases NJ1, NJ2, NJ3 … NJN, and content module sentence level corpora ND1-J3, respectively.
Preferably, the step of sentence alignment of the formed sentence pair consensus libraries J1, J2, J3 … JN includes:
firstly, forming a Chinese sentence list and an English sentence list with the same sentence number by using bilingual corpora of Chinese and English;
secondly, the Chinese sentence list and the English sentence list are in one-to-one correspondence to form sentence beads, the sentence beads are in one-to-one correspondence, and the formed sentence beads are in correspondence by default.
Preferably, the step of sentence alignment of the formed sentence pair number inconsistency libraries NJ1, NJ2, NJ3 … NJN includes:
firstly, forming a Chinese sentence list and an English sentence list by bilingual corpus in which Chinese and English are compared, wherein the sentence numbers of the Chinese sentence list and the English sentence list are inconsistent;
secondly, the Chinese sentence list and the English sentence list are corresponding to form sentence beads, the sentence beads are one-to-one, one-to-many or many-to-one, and the formed sentence beads are corresponding by default.
Preferably, the step of sentence-aligning the formed content module sentence-level corpus ND1-J3 comprises:
firstly, according to the condition that the sentence number of a Chinese sentence list and an English sentence list is uncertain, forming a Chinese sentence list and an English sentence list by bilingual corpora with Chinese and English contrasted;
secondly, the Chinese sentence list and the English sentence list are corresponding to form sentence beads, the sentence beads are one-to-one, one-to-many or many-to-one, and the formed sentence beads are corresponding by default.
Further, the present invention also provides a patent bilingual sentence pair processing system, which comprises:
the content acquisition module 210 is configured to acquire sentence-level corpuses formed under different limiting conditions, including sentence logarithm concordance bases J1, J2, J3 … JN, sentence logarithm discordance bases NJ1, NJ2, NJ3 … NJN, and content module sentence-level corpuses ND 1-J3;
the first sentence alignment module 220 is configured to form unique sentence beads and screen the sentence beads in accuracy by using the sentence logarithm consistency bases J1, J2, and J3 … JN, and obtain accurate and reliable bilingual comparison corpus by using linguistic constraints, adopting sentence similarity calculation, combining a patent knowledge base, and completing threshold screening;
the second sentence alignment module 230 is configured to form unique sentence beads and screen the sentence beads for accuracy by using the sentence logarithm inconsistent bases NJ1, NJ2, and NJ3 … NJN, and obtain accurate and reliable bilingual reference corpora by using linguistic constraints, using sentence similarity calculation, combining a patent knowledge base, and completing threshold screening;
and a third sentence alignment module 240, configured to form a unique sentence bead and screen the sentence bead with accuracy by using the content module sentence-level corpus ND1-J3, and obtain an accurate and reliable bilingual comparison corpus by using linguistic constraints, simultaneously using sentence similarity calculation, and combining a patent knowledge base through threshold screening.
The invention has the following beneficial effects:
the invention discloses a patent double sentence pair processing method and a system, wherein the method comprises the following steps: acquiring patent discourse-level bilingual alignment corpus; extracting content characteristics of patent bilingual discourse-level articles; according to the content characteristics, carrying out content module segmentation and grading treatment on bilingual alignment chapter-level corpora; segmenting a paragraph module according to the result of the content grading module; classifying and sorting different paragraph modules; according to the classification and arrangement result of the paragraph module, performing classification and arrangement of sentence-level alignment; according to sentence-level classification and arrangement results and patent big data statistical results, sentence alignment is carried out; necessary screening is carried out on sentence alignment results to form patent bilingual alignment corpora, and the patent bilingual alignment corpora are added into a corpus; on the basis of statistical translation, starting from the patent field, forming a patent knowledge base on the basis of patent document analysis, and combining with neural network translation, a set of deeper patent field bilingual sentence pair processing method is provided, so that automatic extraction and generation of the patent bilingual sentence pair base are realized, the efficiency and the accuracy are greatly improved, and the patent field bilingual sentence pair corpus is favorably and quickly established.
Drawings
FIG. 1 is a flow chart of a patent bilingual sentence pair processing method according to the present invention.
FIG. 2 is a block diagram of a patent bilingual sentence pair processing system according to the present invention.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
Referring to fig. 1, a patent bilingual sentence pair processing method includes the following steps:
s1, obtaining patent discourse-level bilingual alignment corpus;
s2, extracting the content characteristics of the patent bilingual discourse-level articles;
s3, according to the content characteristics, the bilingual alignment chapter-level corpus is subjected to content module segmentation and grading treatment;
s4, paragraph module segmentation is carried out according to the result of the content grading module;
s5, classifying and sorting different paragraph modules;
s6, carrying out sentence-level alignment classification and arrangement according to the classification and arrangement result of the paragraph module;
s7, according to the sentence-level classification and arrangement results, combining patent big data statistical results to align sentences;
and S8, S9 and S10, carrying out necessary screening on sentence alignment results, forming patent bilingual alignment corpora, and adding the patent bilingual alignment corpora into a corpus.
The method for extracting the content characteristics of the patent bilingual discourse-level articles comprises the following steps:
forming a content feature alignment library according to the content features of the patent; the content features of the patent include the abstract of the specification, the abstract drawings, the specification drawings and the claims.
According to the content characteristics, the steps of segmenting and grading the content module of the bilingual alignment chapter-level corpus are as follows:
according to the patent content feature alignment library, dividing content modules, and classifying as follows:
the first class comprises the abstract of the specification, the claims, the specification and the drawings of the specification;
the secondary classification comprises the technical field, the background technology, the invention content, the figure description and the specific implementation mode;
the three-level classification comprises abstract drawings and embodiments.
According to the result of the content grading module, paragraph module segmentation is carried out, and the steps of classifying and sorting different paragraph modules are as follows:
each divided content module is further divided into paragraph modules: if the paragraph numbers are consistent, corresponding the paragraph modules one by one to form bilingual alignment corpora D1, D2 and D3 … DN corresponding to the paragraph level; if the paragraph numbers are not consistent, the content module alignment is returned to form the content module bilingual alignment corpus ND 1.
According to the classification result of the paragraph module, the classification step of sentence level alignment is carried out, which specifically comprises the following steps:
dividing formed paragraph level corresponding bilingual aligned corpora D1, D2 and D3 … DN into sentence logarithm consistent libraries J1, J2 and J3 … JN and sentence logarithm inconsistent libraries NJ1, NJ2 and NJ3 … NJN according to whether the sentence logarithm is consistent;
the formed content module bilingual aligned corpus ND1 is divided into a content module sentence-level corpus ND1-J3.
According to sentence-level classification and arrangement results and in combination with patent big data statistical results, sentence alignment is performed, specifically:
sentence alignment is performed on the formed sentence pair number matching bases J1, J2, J3 … JN, sentence pair number inconsistency bases NJ1, NJ2, NJ3 … NJN, and content module sentence level corpora ND1-J3, respectively.
The step of sentence alignment of the formed sentence pair coincidence bases J1, J2 and J3 … JN comprises the following steps:
firstly, forming a Chinese sentence list and an English sentence list by bilingual corpus in which Chinese and English are compared, wherein the sentence numbers of the Chinese sentence list and the English sentence list are consistent;
secondly, the Chinese sentence list and the English sentence list are in one-to-one correspondence to form sentence beads, the sentence beads are in one-to-one correspondence, and the formed sentence beads are in correspondence by default.
The step of sentence alignment of the formed sentence pair number inconsistency libraries NJ1, NJ2, NJ3 … NJN includes:
firstly, bilingual corpus of Chinese and English are compared to form a Chinese sentence list and an English sentence list, and the number of sentences in the Chinese sentence list is inconsistent with that in the English sentence list.
Secondly, the Chinese sentence list and the English sentence list are corresponding to form sentence beads, the sentence beads are one-to-one, one-to-many or many-to-one, and the formed sentence beads are corresponding by default.
The step of sentence-aligning the formed sentence-level corpus of content modules ND1-J3 includes:
firstly, forming a Chinese sentence list and an English sentence list by bilingual corpus in which Chinese and English are compared, wherein the sentence numbers of the Chinese sentence list and the English sentence list are uncertain;
secondly, the Chinese sentence list and the English sentence list are corresponding to form sentence beads, the sentence beads are one-to-one, one-to-many or many-to-one, and the formed sentence beads are corresponding by default.
The patent bilingual sentence pair processing method described in this embodiment includes the following specific procedures:
step S1: determining discourse and chapter language materials corresponding to discourse and chapter-level patent bilingual;
step S2: according to the patent content characteristics of the patent abstract, the abstract drawing, the specification drawing, the claims and the like, a content characteristic alignment library is formed.
Step S3: and according to the patent content feature alignment library, dividing content modules. For example:
the first class is: abstract, claims, description, drawings;
the secondary classification is: technical field, background, summary, brief description of the drawings, detailed description, etc.;
the third-level classification is: figures, examples, etc.
Step S3 is a further subdivision of the patent content based on step S2. The classification of each content module at each level is established on the basis of patent full data statistics, and identification vocabularies of each content module are in one-to-one correspondence. For example: in the patent, the term "BACKGROUND technology" refers to the condition that the corresponding patent english may appear includes "BACKGROUND ART", and the like, and the statistical content module identification vocabulary is imported into the statistical program, thereby realizing the full-automatic accurate classification of the content modules.
Step S4: each content module divided in step S3 is further divided into paragraph modules. And dividing the content module into paragraph modules according to the marks of carriage return or line feed and the like in each content module, and counting the number of paragraphs. The method comprises the following steps:
s401: if the paragraph numbers are consistent, corresponding the paragraph modules one by one to form bilingual alignment corpora D1, D2 and D3 … DN corresponding to the paragraph level;
s402: if the paragraph numbers are not consistent, returning to step S3 to align the content modules to form a content module bilingual alignment corpus ND1, where the content module alignment corpus ND1 refers to content modules with inconsistent paragraph numbers inside the content modules;
step S5: paragraph level corresponding bilingual aligned corpora D1, D2 and D3 … DN formed in step S401, paragraph level corresponding bilingual aligned corpora D1, D2 and D3 … DN are divided into sentence level aligned corpora, and the sentence level aligned corpora are divided into sentence level consistent libraries J1, J2 and J3 … JN and sentence level inconsistent libraries NJ1, NJ2 and NJ3 … NJN according to whether the sentence number is consistent or not. In the step, the consistent judgment of sentence pairs needs to consider the segmentation condition of the sentences, in Chinese patent, fixed words such as sentence numbers, part numbers and the like are generally taken as the segmentation condition of the sentences, and punctuations such as question marks, exclamation marks and the like which can not appear in the patent are not in the condition; in english patent, sentence numbers, part numbers and colons are generally used as the segmentation conditions of the sentence, but the exceptional cases such as "No.", "u.s.a" in english need to be excluded from the segmentation conditions.
Step S6: the content module bilingual aligned corpus ND1 formed in step S402 is divided into sentence-level corpora ND1-J3.
Step S7: sentence alignment is respectively carried out on the three sentence-level corpuses 'J1, J2, J3 … JN', 'NJ 1, NJ2, NJ3 … NJN' and 'NDl-J3';
step S8: sentence-level corpora "J1, J2, J3 … JN" are subjected to sentence alignment processing, and the corpora of the type are corpora formed under three conditions of content module alignment, paragraph module number alignment, and sentence number alignment. Taking a bilingual corpus with Chinese and English contrasts as an example, firstly, a Chinese sentence list and an English sentence list are formed by bilingual corpus with Chinese and English contrasts, wherein the Chinese sentence list is formed by Chinese original texts in a sentence-level corpus, the English sentence list is formed by English original texts in the sentence-level corpus, and the number of sentences in the Chinese sentence list and the number of sentences in the English sentence list are consistent. Secondly, the Chinese sentence list and the English sentence list are in one-to-one correspondence to form sentence beads, the sentence beads are one-to-one, and the formed sentence beads are corresponding by default; in this step, in order to avoid omission, in the sentence-bead correspondence, when each sentence in the chinese sentence list is in correspondence, in addition to selecting the corresponding sentence in the english sentence list, the previous sentence and the next sentence of the corresponding sentence in the english sentence list are selected to form a sentence-bead matrix a, respectively, as follows:
wherein n is the number of sentences in the Chinese sentence list or the English sentence list formed in the step, and the value range of n is more than or equal to 1; meanwhile, in the sentence bead similarity matrix A, the probability and the similarity degree corresponding to the interior of each corresponding sentence bead in each row are calculated, the maximum probability is selected for correspondence after multiplication, each row in the matrix A forms a unique sentence bead, and the recall rate is improved to the maximum extent through the step.
After forming a unique sentence-bead pair, the sentence-bead is subjected to accuracy screening, and linguistic constraints are utilized, such as: and the similarity of each sentence bead is calculated by combining the sentence similarity calculation with a patent knowledge base, and a corresponding similarity list is obtained. And after the similarity list is obtained, determining a threshold value according to the actual condition of the patent, screening, wherein the threshold value range is 0-1, and obtaining accurate and reliable bilingual comparison corpus after screening is finished. This step maximizes accuracy and loses a portion of recall.
Step S9: sentence-level corpora "NJ 1, NJ2, NJ3 … NJN" are subjected to sentence alignment processing, and this type of corpus is a corpus formed under the conditions of two conditions of content module alignment and paragraph module number alignment. Taking a bilingual corpus with Chinese and English contrasts as an example, firstly, a Chinese sentence list and an English sentence list are formed by bilingual corpus with Chinese and English contrasts, wherein the Chinese sentence list is formed by Chinese original texts in a sentence-level corpus, the English sentence list is formed by English original texts in the sentence-level corpus, and the number of sentences in the Chinese sentence list and the English sentence list is inconsistent. Secondly, the Chinese sentence list and the English sentence list are corresponded to form a sentence bead, the sentence bead can be one-to-one, one-to-many or many-to-one, and the formed sentence bead is corresponding by default.
In the corpus of "NJ 1, NJ2, NJ3 … NJN", the number of sentences in the chinese sentence list and the english sentence list is not the same, and when forming beads, beads corresponding to one-to-one determined in step S8 cannot be formed, so in this step, the beads need to be formed in consideration of one-to-one, one-to-many, and many-to-one situations. A one-to-one case may be formed with reference to step S8; in the case of one-to-many and many-to-one, when forming a sentence bead, it is necessary to consider the error range of the source language sentence list (here, the chinese sentence list is used as the source language) in the corresponding target language sentence list (here, the english sentence list is used as the target language), the theoretical error should be within the difference between the numbers of the source language sentence list and the target language sentence list, and the difference i is smaller than the numerical value of both the source language sentence list and the target language sentence list. In practice, to distinguish from one-to-one cases, the minimum value of the difference value i is specified as 1, and the maximum value is the larger value of the values in both the source language sentence list and the target language sentence list. Thus, for each source language sentence in the source language sentence list, a plurality of sentence beads within the range of the difference value i are formed, and a sentence bead column B1- - -BN is formed, wherein N is the smaller value of the values of the source language sentence pair and the target language sentence pair.
In the sentence bead column B1- - -BN, the probability and the similarity degree corresponding to the interior of each corresponding sentence bead in each column are calculated, the maximum probability is selected for correspondence after multiplication, each column in the sentence bead column B1- - -BN forms a unique sentence bead, and the recall rate is improved to the maximum extent through the step.
After forming a unique sentence-bead pair, the sentence-bead is subjected to accuracy screening, and linguistic constraints are utilized, such as: and the similarity of each sentence bead is calculated by combining the sentence similarity calculation with a patent knowledge base, and a corresponding similarity list is obtained. And after the similarity list is obtained, determining a threshold value according to the actual condition of the patent, screening, wherein the threshold value range is 0-1, and obtaining accurate and reliable bilingual comparison corpus after screening is finished. This step maximizes accuracy and loses a portion of recall.
Step S10: sentence-level corpus "ND 1-J3" is subjected to sentence alignment processing, and this type of corpus is a corpus formed under the condition that content modules are aligned. Taking a bilingual corpus with Chinese-English contrast as an example, firstly, a Chinese sentence list and an English sentence list are formed by bilingual corpus with Chinese and English contrast, wherein the Chinese sentence list is formed by Chinese original texts in a sentence-level corpus, the English sentence list is formed by English original texts in the sentence-level corpus, and the number of sentences in the Chinese sentence list and the English sentence list is uncertain. Secondly, the Chinese sentence list and the English sentence list are corresponded to form a sentence bead, the sentence bead can be one-to-one, one-to-many or many-to-one, and the formed sentence bead is corresponding by default.
In the corpus of "ND 1-J3", the number of sentences in chinese sentence lists and english sentence lists is uncertain, and when forming beads, the beads corresponding to the one-to-one sentences determined in step S8 cannot be formed, so in this step, the beads need to be formed in consideration of one-to-one, one-to-many, and many-to-one situations. A one-to-one case may be formed with reference to step S8; in the case of one-to-many and many-to-one, when forming a sentence bead, it is necessary to consider the error range of the source language sentence list (here, the chinese sentence list is used as the source language) in the corresponding target language sentence list (here, the english sentence list is used as the target language), the theoretical error should be within the difference between the numbers of the source language sentence list and the target language sentence list, and the difference i is smaller than the numerical value of both the source language sentence list and the target language sentence list. In practice, to distinguish from one-to-one cases, the minimum value of the difference value i is specified as 1, and the maximum value is the larger value of the values in both the source language sentence list and the target language sentence list. Thus, for each source language sentence in the source language sentence list, a plurality of sentence beads within the range of the difference value i are formed, and a sentence bead column C1- - -CN is formed, wherein N is the smaller value of the values of the source language sentence pair and the target language sentence pair.
In the sentence bead column C1- - -CN, the probability and the similarity degree corresponding to the interior of each corresponding sentence bead in each column are calculated, the maximum probability is selected for correspondence after multiplication, and each column in the sentence bead column C1- - -CN forms a unique sentence bead, so that the recall rate is maximally improved through the step.
After forming a unique sentence-bead pair, the sentence-bead is subjected to accuracy screening, and linguistic constraints are utilized, such as: and the similarity of each sentence bead is calculated by combining the sentence similarity calculation with a patent knowledge base, and a corresponding similarity list is obtained. And after the similarity list is obtained, determining a threshold value according to the actual condition of the patent, screening, wherein the threshold value range is 0-1, and obtaining accurate and reliable bilingual comparison corpus after screening is finished. This step maximizes accuracy and loses a portion of recall.
In the above steps, the division of the content module, the paragraph module and the sentence-to-module is partly dependent on the long-term collection and arrangement of the big data, and partly can be performed by using the advantages of the current XML fixed tags, and the XML has obvious presentation descriptions for paragraphs and the like. For example: in the XML standardized data in the patent field, an indication-title and an indication-title represent titles, an abstract and an abstract represent abstracts, a right-text and a right-text represent right claims, a paragraph and the like, and the accuracy of dividing a content module, a paragraph module and a sentence into modules can be greatly improved by using label prompts in the XML standardized data.
The method includes, but is not limited to, bilingual sentence-aligned corpus processing between english, german, japanese, korean, french, etc. and chinese.
In the patent bilingual sentence pair processing method described in this embodiment, the patent content feature library, the patent linguistic constraint library, the patent knowledge library and the like involved in the method are formed by combining the summary of patent translators on the basis of statistics of big data, and can be applied to various fields including but not limited to patents.
Referring to fig. 2, a patent bilingual sentence pair processing system includes:
the content acquisition module 210 is configured to acquire sentence-level corpuses formed under different limiting conditions, including sentence logarithm concordance bases J1, J2, J3 … JN, sentence logarithm discordance bases NJ1, NJ2, NJ3 … NJN, and content module sentence-level corpuses ND 1-J3;
the first sentence alignment module 220 is configured to form unique sentence beads and screen the sentence beads in accuracy by using the sentence logarithm consistency bases J1, J2, and J3 … JN, and obtain accurate and reliable bilingual comparison corpus by using linguistic constraints, adopting sentence similarity calculation, combining a patent knowledge base, and completing threshold screening;
the second sentence alignment module 230 is configured to form unique sentence beads and screen the sentence beads for accuracy by using the sentence logarithm inconsistent bases NJ1, NJ2, and NJ3 … NJN, and obtain accurate and reliable bilingual reference corpora by using linguistic constraints, using sentence similarity calculation, combining a patent knowledge base, and completing threshold screening;
and a third sentence alignment module 240, configured to form a unique sentence bead and screen the sentence bead with accuracy by using the content module sentence-level corpus ND1-J3, and obtain an accurate and reliable bilingual comparison corpus by using linguistic constraints, simultaneously using sentence similarity calculation, and combining a patent knowledge base through threshold screening.
The patent bilingual sentence pair processing system described in this embodiment specifically includes:
the content obtaining module 210 is configured to obtain sentence-level corpora formed under different limiting conditions, including "J1, J2, J3 … JN", "NJ 1, NJ2, NJ3 … NJN", and "ND 1-J3", where each sentence-level corpus is obtained through steps S1-S6.
The first sentence alignment module 220 is configured to form unique sentence beads and screen the sentence beads in the sentences and the corpus J1, J2, and J3 … JN according to the accuracy, and obtain accurate and reliable bilingual comparison corpus by using linguistic constraints, using sentence similarity calculation, and combining a patent knowledge base, and after completion of threshold screening. The module maximally improves the accuracy and the recall rate.
The second sentence alignment module 230 is configured to form unique sentence beads and screen the sentence beads in the sentences and the corpus NJ1, NJ2, NJ3 … NJN with accuracy, and obtain accurate and reliable bilingual reference corpora by combining the patent knowledge base and performing threshold screening by using linguistic constraints and by using sentence similarity calculation. The module maximally improves the accuracy and the recall rate.
And a third sentence alignment module 240, configured to form a unique sentence bead and screen the sentence bead with accuracy by using the sentence and corpus ND1-J3, and obtain an accurate and reliable bilingual comparison corpus by using linguistic constraints, simultaneously adopting sentence similarity calculation, and combining a patent knowledge base through threshold screening. The module maximally improves the accuracy and the recall rate.
The patent bilingual sentence pair processing method and the system thereof provided by the embodiment not only reduce manual participation and realize automatic sentence alignment, but also improve the alignment accuracy and recall rate and greatly improve the patent bilingual sentence alignment efficiency.
The patent bilingual sentence pair processing method and system described in this embodiment, the method includes the following steps: acquiring patent discourse-level bilingual alignment corpus; extracting content characteristics of patent bilingual discourse-level articles; according to the content characteristics, carrying out content module segmentation and grading treatment on bilingual alignment chapter-level corpora; segmenting a paragraph module according to the result of the content grading module; classifying and sorting different paragraph modules; according to the classification and arrangement result of the paragraph module, performing classification and arrangement of sentence-level alignment; according to sentence-level classification and arrangement results and patent big data statistical results, sentence alignment is carried out; necessary screening is carried out on sentence alignment results to form patent bilingual alignment corpora, and the patent bilingual alignment corpora are added into a corpus; on the basis of statistical translation, starting from the patent field, forming a patent knowledge base on the basis of patent document analysis, and combining with neural network translation, a set of deeper patent field bilingual sentence pair processing method is provided, so that automatic extraction and generation of the patent bilingual sentence pair base are realized, the efficiency and the accuracy are greatly improved, and the patent field bilingual sentence pair corpus is favorably and quickly established.
The embodiment also provides a method and a system suitable for realizing the patent bilingual sentence pair processing method and the system. The computer system includes a processor and a computer-readable storage medium. The computer system may perform a method according to an embodiment of the invention.
In particular, the processor may comprise, for example, a general purpose microprocessor, an instruction set processor and/or related chip set and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), or the like. The processor may also include on-board memory for caching purposes. The processor may be a single processing unit or a plurality of processing units for performing the different actions of the method flow according to embodiments of the present invention.
Computer-readable storage media, for example, may be non-volatile computer-readable storage media, specific examples including, but not limited to: magnetic storage devices, such as magnetic tape or Hard Disk Drives (HDDs); optical storage devices, such as compact disks (CD-ROMs); a memory, such as a Random Access Memory (RAM) or a flash memory; and so on.
The computer-readable storage medium may comprise a computer program that may comprise code/computer-executable instructions that, when executed by a processor, cause the processor to perform a method according to an embodiment of the invention or any variant thereof.
The computer program may be configured with computer program code, for example comprising computer program modules. For example, in an example embodiment, code in the computer program may include one or more program modules, including, for example, a content acquisition module 210, a first alignment module 220, a second alignment module 230, and a third module 240. It should be noted that the division and number of modules are not fixed, and those skilled in the art may use suitable program modules or program module combinations according to actual situations, which when executed by a processor, enable the processor to perform the method according to the embodiments of the present invention or any variations thereof.
According to an embodiment of the present invention, at least one of the above modules may be implemented as a computer program module, which when executed by a processor, may implement the respective operations described above.
The present invention also provides a computer-readable storage medium, which may be contained in the apparatus/device/system described in the above embodiments; or may exist separately and not be assembled into the device/apparatus/system. The computer-readable storage medium carries one or more programs which, when executed, implement the method according to an embodiment of the present invention.
According to embodiments of the present invention, the computer readable storage medium may be a non-volatile computer readable storage medium, which may include, for example but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
For simplicity of explanation, the method embodiments are described as a series of acts or combinations, but those skilled in the art will appreciate that the embodiments are not limited by the order of acts described, as some steps may occur in other orders or concurrently with other steps in accordance with the embodiments of the invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.
Claims (8)
1. A patent double sentence pair processing method is characterized in that: the method comprises the following steps:
extracting content characteristics of the patent discourse-level bilingual alignment corpus from the patent discourse-level bilingual alignment corpus;
according to the content characteristics, segmenting the patent discourse-level bilingual alignment corpus into content modules, and carrying out grading processing to obtain a plurality of content grading modules;
segmenting each content grading module into paragraph modules to obtain a plurality of paragraph modules;
classifying and sorting the plurality of paragraph modules respectively, and calibrating the category of each paragraph module;
according to the category of each paragraph module, carrying out classification and arrangement of sentence-level alignment;
according to the classification and arrangement result of sentence-level alignment, sentence alignment is carried out in combination with patent big data statistics; screening sentence alignment results to form patent bilingual alignment corpora, and adding the patent bilingual alignment corpora into a corpus to form a corpus with the patent bilingual alignment corpora;
the method comprises the following steps of carrying out paragraph module segmentation on each content grading module to obtain a plurality of paragraph modules, and specifically comprises the following steps:
each divided content module is further divided into paragraph modules: if the paragraph numbers are consistent, corresponding the paragraph modules one by one to form bilingual alignment corpora D1, D2 and D3 … DN corresponding to the paragraph level; if the paragraph numbers are not consistent, returning the alignment of the content modules to form a bilingual alignment corpus ND1 of the content modules;
the method comprises the following steps of classifying and sorting a plurality of paragraph modules respectively, and calibrating the category of each paragraph module, and comprises the following steps: dividing the formed content module bilingual alignment corpus ND1 into a content module sentence-level corpus ND 1-J3;
the sentence alignment method comprises the following steps of combining patent big data statistical results according to sentence-level classification and sorting results and patent big data statistical results, and comprises the following steps: sentence alignment is carried out on the formed content module sentence-level corpus ND 1-J3;
wherein the step of sentence-aligning the formed content module sentence-level corpus ND1-J3 comprises:
firstly, according to the condition that the sentence number of a Chinese sentence list and an English sentence list is uncertain, forming a Chinese sentence list and an English sentence list by bilingual corpora with Chinese and English contrasted;
secondly, the Chinese sentence list and the English sentence list are corresponding to form sentence beads, the sentence beads are one-to-one, one-to-many or many-to-one, and the formed sentence beads are corresponding by default; and forming a unique sentence bead based on the probability and the similarity corresponding to the sentence beads respectively.
2. The patent bilingual sentence pair processing method according to claim 1, wherein: the step of extracting the content features of the patent bilingual discourse-level article from the patent discourse-level bilingual aligned corpus specifically comprises the following steps:
forming a content feature alignment library according to the content features of the patent; the content features of the patent include the abstract of the specification, the abstract drawings, the specification drawings and the claims.
3. The patent bilingual sentence pair processing method according to claim 2, wherein: according to the content characteristics, the method comprises the steps of segmenting the content module of bilingual alignment chapter-level corpus and processing the bilingual alignment chapter-level corpus into a plurality of content grading modules in a grading way, and specifically comprises the following steps:
according to the patent content feature alignment library, dividing content modules, and classifying as follows:
the first class comprises the abstract of the specification, the claims, the specification and the drawings of the specification;
the secondary classification comprises the technical field, the background technology, the invention content, the figure description and the specific implementation mode;
the three-level classification comprises abstract drawings and embodiments.
4. The patent bilingual sentence pair processing method according to claim 1, wherein: the step of classifying and sorting the plurality of paragraph modules respectively and calibrating the category of each paragraph module further comprises:
and dividing the formed paragraph level corresponding bilingual aligned corpora D1, D2 and D3 … DN into sentence pair number consistent libraries J1, J2 and J3 … JN and sentence pair inconsistent libraries NJ1, NJ2 and NJ3 … NJN according to whether the sentence pairs are consistent or not.
5. The patent bilingual sentence pair processing method according to claim 4, wherein: according to the sentence-level classification and arrangement result and in combination with the patent big data statistical result, sentence alignment is performed, and the method further comprises the following steps:
sentence alignment is performed on the formed sentence pair number matching bases J1, J2, J3 … JN and the sentence pair number inconsistency bases NJ1, NJ2, NJ3 … NJN, respectively.
6. The patent bilingual sentence pair processing method according to claim 5, wherein: the step of sentence alignment of the formed sentence pair coincidence bases J1, J2 and J3 … JN comprises the following steps:
firstly, forming a Chinese sentence list and an English sentence list with the same sentence number by using bilingual corpora of Chinese and English;
secondly, the Chinese sentence list and the English sentence list are in one-to-one correspondence to form sentence beads, the sentence beads are in one-to-one correspondence, and the formed sentence beads are in correspondence by default.
7. The patent bilingual sentence pair processing method according to claim 6, wherein: the step of sentence alignment of the formed sentence pair number inconsistency libraries NJ1, NJ2, NJ3 … NJN includes:
firstly, forming a Chinese sentence list and an English sentence list by bilingual corpus in which Chinese and English are compared, wherein the sentence numbers of the Chinese sentence list and the English sentence list are inconsistent;
secondly, the Chinese sentence list and the English sentence list are corresponding to form sentence beads, the sentence beads are one-to-one, one-to-many or many-to-one, and the formed sentence beads are corresponding by default.
8. A patent double sentence pair processing system is characterized in that: the method comprises the following steps:
the content acquisition module is used for acquiring sentence-level corpora formed under different limiting conditions, and comprises sentence logarithm consistent libraries J1, J2 and J3 … JN, sentence logarithm inconsistent libraries NJ1, NJ2 and NJ3 … NJN and content module sentence-level corpora ND 1-J3; wherein the sentence-level corpus is obtained according to the patent bilingual pair processing method of any one of claims 1 to 6;
the first sentence alignment module is used for forming unique sentence beads and screening the accuracy of the sentence beads by using sentence logarithm consistent libraries J1, J2 and J3 … JN, and obtaining accurate and reliable bilingual comparison corpora by combining linguistic constraints, sentence similarity calculation and a patent knowledge base and threshold screening;
the second sentence alignment module is used for forming unique sentence beads and screening the accuracy of the sentence beads through the sentence logarithm inconsistent bases NJ1, NJ2 and NJ3 … NJN, and obtaining accurate and reliable bilingual comparison corpora through combining the patent knowledge base and threshold screening by utilizing the linguistics constraint and simultaneously adopting sentence similarity calculation;
and the third sentence alignment module is used for forming a unique sentence bead and screening the accuracy of the sentence bead by the content module sentence-level corpus ND1-J3, utilizing the linguistic constraint, simultaneously adopting sentence similarity calculation, combining a patent knowledge base, and obtaining accurate and reliable bilingual comparison corpus after threshold screening is completed.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911064809.XA CN110807337B (en) | 2019-11-01 | 2019-11-01 | Patent double sentence pair processing method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911064809.XA CN110807337B (en) | 2019-11-01 | 2019-11-01 | Patent double sentence pair processing method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110807337A CN110807337A (en) | 2020-02-18 |
CN110807337B true CN110807337B (en) | 2021-11-12 |
Family
ID=69500989
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911064809.XA Active CN110807337B (en) | 2019-11-01 | 2019-11-01 | Patent double sentence pair processing method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110807337B (en) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111428522B (en) * | 2020-03-23 | 2023-06-30 | 腾讯科技(深圳)有限公司 | Translation corpus generation method, device, computer equipment and storage medium |
CN113722497A (en) * | 2020-05-26 | 2021-11-30 | 阿里巴巴集团控股有限公司 | Corpus generation method and apparatus based on patent data |
CN112560511B (en) * | 2020-12-14 | 2024-04-23 | 北京奇艺世纪科技有限公司 | Method and device for translating speech and method and device for training translation model |
CN114742077A (en) * | 2022-04-15 | 2022-07-12 | 中国电子科技集团公司第十研究所 | Generation method of domain parallel corpus and training method of translation model |
CN115688811A (en) * | 2022-09-20 | 2023-02-03 | 甲骨易(北京)语言科技股份有限公司 | Corpus alignment method combining rules and semantics |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103176966A (en) * | 2011-12-22 | 2013-06-26 | 苏州威世博知识产权服务有限公司 | Method and system used for realizing translation of basic patent information |
CN105183722A (en) * | 2015-09-17 | 2015-12-23 | 成都优译信息技术有限公司 | Chinese-English bilingual translation corpus alignment method |
CN106126506A (en) * | 2016-06-22 | 2016-11-16 | 上海者信息科技有限公司 | A kind of online language material alignment schemes and system |
-
2019
- 2019-11-01 CN CN201911064809.XA patent/CN110807337B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103176966A (en) * | 2011-12-22 | 2013-06-26 | 苏州威世博知识产权服务有限公司 | Method and system used for realizing translation of basic patent information |
CN105183722A (en) * | 2015-09-17 | 2015-12-23 | 成都优译信息技术有限公司 | Chinese-English bilingual translation corpus alignment method |
CN106126506A (en) * | 2016-06-22 | 2016-11-16 | 上海者信息科技有限公司 | A kind of online language material alignment schemes and system |
Non-Patent Citations (1)
Title |
---|
汉英文本级句子对齐技术的研究;孙坤杰;《中国优秀硕士学位论文全文数据库信息科技辑》;20160315(第3期);第1、10、17-37页,表2.2,图4.1-4.4 * |
Also Published As
Publication number | Publication date |
---|---|
CN110807337A (en) | 2020-02-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110807337B (en) | Patent double sentence pair processing method and system | |
Ljubešić et al. | New inflectional lexicons and training corpora for improved morphosyntactic annotation of Croatian and Serbian | |
US20170242840A1 (en) | Methods and systems for automated text correction | |
CN110276069B (en) | Method, system and storage medium for automatically detecting Chinese braille error | |
Rubino et al. | Information density and quality estimation features as translationese indicators for human translation classification | |
CN110705262B (en) | Improved intelligent error correction method applied to medical technology inspection report | |
Zhang et al. | HANSpeller++: A unified framework for Chinese spelling correction | |
CN111897917B (en) | Rail transit industry term extraction method based on multi-modal natural language features | |
Almuhareb et al. | Arabic word segmentation with long short-term memory neural networks and word embedding | |
CN115034218A (en) | Chinese grammar error diagnosis method based on multi-stage training and editing level voting | |
CN115309910B (en) | Language-text element and element relation joint extraction method and knowledge graph construction method | |
CN105786971A (en) | International Chinese-teaching oriented grammar point identification method | |
CN117291192B (en) | Government affair text semantic understanding analysis method and system | |
CN110633456A (en) | Language identification method, language identification device, server and storage medium | |
Sifat et al. | Synthetic error dataset generation mimicking bengali writing pattern | |
CN112800182A (en) | Test question generation method and device | |
Chang et al. | KNGED: A tool for grammatical error diagnosis of Chinese sentences | |
CN115422929A (en) | Text error correction method and system | |
CN110826343B (en) | Construction method and system of semi-automatic translation bilingual template based on patent data | |
Yeh et al. | Condition random fields-based grammatical error detection for Chinese as second language | |
CN113569560A (en) | Automatic scoring method for Chinese bilingual composition | |
Tezcan | Informative quality estimation of machine translation output | |
CN112257416A (en) | Inspection new word discovery method and system | |
CN114595688B (en) | Chinese cross-language word embedding method fusing word cluster constraint | |
CN114036920B (en) | Translation scoring method and related device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |