CN106815188B - Method for acquiring Chinese and bilingual structure - Google Patents

Method for acquiring Chinese and bilingual structure Download PDF

Info

Publication number
CN106815188B
CN106815188B CN201510846489.9A CN201510846489A CN106815188B CN 106815188 B CN106815188 B CN 106815188B CN 201510846489 A CN201510846489 A CN 201510846489A CN 106815188 B CN106815188 B CN 106815188B
Authority
CN
China
Prior art keywords
bilingual
tcorpus
sentences
sobase
conjunctive
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510846489.9A
Other languages
Chinese (zh)
Other versions
CN106815188A (en
Inventor
符建辉
王卫明
曹阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongke Guoli (zhenjiang) Intelligent Technology Co Ltd
Original Assignee
Zhongke Guoli (zhenjiang) Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhongke Guoli (zhenjiang) Intelligent Technology Co Ltd filed Critical Zhongke Guoli (zhenjiang) Intelligent Technology Co Ltd
Priority to CN201510846489.9A priority Critical patent/CN106815188B/en
Publication of CN106815188A publication Critical patent/CN106815188A/en
Application granted granted Critical
Publication of CN106815188B publication Critical patent/CN106815188B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Abstract

The invention relates to a method for acquiring a Chinese and bilingual structure, which comprises the steps of performing word segmentation on an original training Corpus Corpus to form a word segmentation Corpus TCorpus; identifying each sentence S in a participle corpus TCorpusiA middle verb; analyzing sentences in TCorpus by using the bilingual mode, forming a candidate bilingual structure for the sentences meeting the bilingual mode, and putting the candidate bilingual structure into a bilingual structure library SOBase to be verified; verifying a candidate bilingual structure library SOBase and outputting a final result SOBaseResult; the invention introduces the bilingual mode, and can greatly control the complexity of the bilingual form on the premise of not reducing the acquisition effect. Aiming at the complexity of Chinese word formation and sentence, in order to ensure the accuracy of the accompanying language structure, the invention strictly verifies the obtained accompanying language structure from the double angles of 'accompanying language structure collocation diversity' and 'accompanying language structure collocation commonness'.

Description

Method for acquiring Chinese and bilingual structure
Technical Field
The invention relates to the field of Chinese natural language processing and Chinese grammar structure automatic identification, in particular to a Chinese and bilingual structure automatic identification method.
Background
Chinese accompanying sentences are a special class of linguistic phenomena. For example, the following three sentences are given (taking spaces and labeling parts of speech, which facilitates highlighting the bilingual context in the sentence):
s1: "group Commission/n Invitation/v they/r attend/v meeting/n"
S2: school/n support/v graduate/n startup/v "
S3: "which position/r let/v this/r only/q bottles/n falls/v above ground/s/u? W'
In S1, "they" are the object of the verb "invite" and are also the subject of the verb "join", so in S1, "they" are conjunctive. At S2, "graduate" is the object of the verb "support" and is the subject of the verb "create", so at S2, "graduate" is a conjunctive language. Similarly, in S3, the "bottle" is the object of the "let" and is the subject of the verb "go", so in S3, the "bottle" is a combination of words.
As can be seen from these three typical examples, chinese cum sentence is a common language phenomenon. For more than 30 years, national famous scholars such as Zhudelxi, Dingshu, Huangborong, Luji Heiping, Wu inspiring and the like carry out systematic research on Chinese accompanying sentences from the aspect of grammar or semantics, and play an important role in understanding the Chinese accompanying sentences for people.
Besides theoretical research value, Chinese teaching and training, accompanying with the overall development of Internet application, the bilingual structure research also has many important uses.
For example, the Chinese cum language structure can be used as a part of a language model in speech recognition, and has an important auxiliary role for automatically creating the language model.
As another example, the problem of unknown word recognition has been an important issue: given a dictionary, words that do not appear in this dictionary are called unknown words. Because any dictionary has limited word collection at the beginning, the dictionary needs to be continuously supplemented in practical application. One technical difficulty in unregistered word recognition or dictionary supplementation is how to accurately determine the left and right boundaries of the unregistered word.
How to effectively acquire a bilingual structure from a large corpus by processing and analyzing the corpus to form a bilingual structure library? How can you verify which verbs, under what conditions, combine with what nouns to form a bilingual structure? These problems have not been adequately addressed and studied.
Disclosure of Invention
Aiming at how to effectively acquire a bilingual structure from a large corpus through processing and analyzing, and forming a bilingual structure library; the invention provides a Chinese and bilingual structure acquisition method, which aims to verify which verbs and nouns are combined under what conditions to form a bilingual structure.
In order to solve the problems, the invention adopts the following technical scheme:
a method for acquiring a Chinese and bilingual structure is characterized in that: the method comprises the following steps:
the first step is as follows: performing word segmentation on an original training Corpus Corpus to form a word segmentation Corpus TCorpus;
adopting an open source ICTCCLAS system to carry out D-feeding on each input text in CorpusDividing words, and splitting each text according to the natural segmentation of sentences to form a simple sentence without sentence punctuation marks; thus, TCorpus has each sentence in the form of Si=“W1/pos1W2/pos2…Wi/posi…Wn/posn", each WiIs a Chinese word, Chinese character, punctuation mark, Arabic numeral, English word or letter, posiIs its corresponding part of speech;
in the word segmentation algorithm, the mark of part of speech is already passed in the computer field; the common parts of speech include a, b, c, d, h, j, k, m, n, p, q, r, u, and z;
the second step is that: identifying each sentence S in a participle corpus TCorpusiVerb or verb phrase in (1);
when "W" appears1/v W2V', then according to "W1W2V' carries on merging process, namely merges two or more verbs into a verb, this process is called verb merging process; after the processing, the adverbs of the modified verbs are eliminated, namely all the modified adverbs before the verbs are deleted; the processed statement is still put into TCorpus;
the third step: analyzing sentences in TCorpus by using the bilingual mode, forming a candidate bilingual structure for the sentences meeting the bilingual mode, and putting the candidate bilingual structure into a bilingual structure library SOBase to be verified;
the method comprises the steps that the bilingual application mode is used for analyzing sentences in TCorpus, namely 5 bilingual modes are adopted, sentences which accord with one of the bilingual modes in TCorpus are selected and placed in a bilingual structure library SOBase to be verified;
specifically, for any statement SO in TCorpusiWhen it contains a verb over 2, or only 1 verb, the sentence is discarded; otherwise, set SOiIs of the form "Ni,1Vi,1Ni,2Vi,2Ni,3", here, subscript i represents the ith sentence meaning; the main task below is to examine Ni,2Whether one of 5 bilingual patterns is satisfied; if one of 5 bilingual patterns is satisfied, then the binary pair<“Vi,1…Vi,2”,“Ni,1Vi,1Ni,2Vi,2Ni,3”>Putting into SOBase; otherwise, the SO is abandonedi
The 5 bilingual modes are as follows: let the general form of a concurrent statement be "N1V1N2V2N3", where N2Namely, the bilingual; in obtaining the conjunctive language structure, only the conjunctive language N is considered2Conjunctive sentences satisfying the following patterns, that is, conjunctive sentences whose conjunctive sentences are other forms when the corpus is sufficiently large can also be obtained from conjunctive sentences whose conjunctive sentences satisfy the following 5 patterns:
mode 1: number + noun;
mode 2: number + quantifier + noun;
mode 3: { this, this field, this time, this bit, this, these, that field, that time, that bit, that, those, it, they }, the elements of the set being common pronouns, typically used to refer to non-living objects or animals, any of which is itself a pattern;
mode 4: { this, field, this time, this bit, this, these, that, field, that time, that bit, that, those } + nouns, which is a bilingual pattern of pronouns and names;
mode 5: { he, they, i, we, s, they }, the elements in the set are common pronouns, commonly used to refer to characters, any of which is itself a pattern;
the fourth step: verifying a candidate bilingual structure library SOBase and outputting a final result SOBaseResult;
for each record in candidate bilingual structure library SOBase<“Vi,1…Vi,2”,“Ni,1Vi,1Ni,2Vi,2Ni,3”>Two verification techniques are employed: common verification of bilingual collocation and diversity of bilingual collocation, which are necessary conditions for ensuring correct structure of bilingual;
the common verification of the collocation of the accompanied words means that when SOi=“Ni,1Vi,1Ni,2Vi,2Ni,3"is a correct conjunctive sentence, the conjunctive sentence structure" Vi,1…Vi,2"occurs in other statements in TCorpus, not just in the cum statement S0iPerforming the following steps;
the verification of the diversity of the bilingual collocation refers to if SOi=“Ni,1Vi,1Ni,2Vi,2Ni,3"is a correct and concurrent statement, then is shaped as SO'i=“N′i,1Vi,1N′i,2Vi,2N′i,3”、SO″i=“N″i,1Vi,1N″i,2Vi,2N″i,3"the doublet statement should also appear multiple times in TCorpus;
the fourth step is implemented by the following specific steps:
firstly, two non-negative threshold values a and b are introduced, wherein a belongs to (0, 1), b belongs to (0, 1)
Step D1: setting the SOBaseResult to null for saving the verified, correct bilingual result;
step D2: if SOBase is empty, go to step D6;
step D3: recording any of SOBase<“Vi,1…Vi,2”,“Ni,1Vi,1Ni,2Vi,2Ni,3”>Will be<“Vi,1…Vi,2”,“Ni,1Vi,1Ni,2Vi,2Ni,3”>Taking out from the SOBase;
step D4: if cof ("V)i,1…Vi,2") > a, then" Vi,1…Vi,2"put into the set SOBaseResult, go to step D2;
said cof (' V)i,1…Vi,2") reflects the conjunctive structure" Vi,1…Vi,2"its calculation is as follows: cof (' V)i,1…Vi,2") ═ TCorpus contains" Vi,1…Vi,2"number of structural statements/number of statements in TCorpus; when cof (V)i,1…Vi,2) When > a,' Vi,1…Vi,2"is considered a correct conjunctive language structure;
step D5: if muf ("V)i,1…Vi,2") > b, then" Vi,1…Vi,2"put into the set SOBaseResult;
said muf (' V)i,1…Vi,2") is a mathematical method for describing the diversity of accompanying collocations, and the calculation sub-steps are as follows: at the beginning, set V*,1And V*,2Is an empty set;
step D51: in SOBase, if present<“Vx…Vi,2”,“Ni,1VxNi,2Vi,2Ni,3”>Then V will bexPut into the set V*,1Performing the following steps;
step D52: in SOBase, if there is "V <i,1…Vy”,“Ni,1Vi,1Ni,2VyNi,3”>Then V will beyPut into the set V*,2Performing the following steps;
step D53: calculation muf ("V)i,1…Vi,2"): the calculation formula is as follows:
Figure GDA0002121334540000041
step D6: and outputting a final bilingual structure result SOBaseResult.
Has the advantages that:
the invention provides a system and a method for acquiring a Chinese-language-compatible structure by means of linguistics and computer technology. Aiming at the complexity and diversity of the bilingual form, the bilingual mode is introduced, and the complexity of the bilingual form can be greatly controlled on the premise of not reducing the acquisition effect. Aiming at the complexity of Chinese word formation and sentence, in order to ensure the accuracy of the accompanying language structure, the invention strictly verifies the obtained accompanying language structure from the double angles of 'accompanying language structure collocation diversity' and 'accompanying language structure collocation commonness'.
After test and verification of 1TB corpus, the system of the invention obtains 13.96 ten thousand bilingual structures, and the accuracy reaches 98.2% through analysis. Therefore, the invention achieves better recognition performance and achieves the aim of practical application.
Drawings
Fig. 1 is a flow chart of a method for obtaining a chinese and bilingual structure.
Detailed Description
In order to be able to explain the invention more clearly, the following terms are defined and explained below:
(1) chinese part of speech: part-of-speech in chinese is an attribute of chinese words. The following are common: nouns (e.g., children, graduates, etc., denoted by n), verbs (e.g., invitations, additions, etc., denoted by v), adverbs (e.g., frequent, etc., denoted by d), adjectives (e.g., beautiful, nice, simple, etc., denoted by a), pronouns (e.g., these, this, it, they, etc., denoted by r), numerators (e.g., one, twelve, 12, etc., denoted by m), quantifiers (e.g., one, root, only, bar, etc., denoted by q).
(2)5 kinds of bilingual modes: in order to control the complexity of obtaining the bilingual structure and ensure that more bilingual structures can be obtained, the invention introduces 5 bilingual modes. For ease of presentation, the following assumes the general form of an inclusive statement as "N1V1N2V2N3", where N2Namely, it is called the bilingual word. The invention only considers the bilingual N when acquiring the bilingual structure2Conjunctive sentences satisfying the following patterns (that is, it is assumed that when the corpus is sufficiently large, conjunctive sentences having conjunctive sentences of other forms can be satisfied from conjunctive sentencesObtained in the accompanying sentence of the following 5 patterns):
mode 1: number + noun. For example: "three/m persons/n" and "3/n items/n" are two specific examples.
Mode 2: number + quantifier + noun. For example: "three/m/q persons/n", "3/m plants/q plants/n" are two specific examples.
Mode 3: { this, this field, this time, this bit, this, these, that field, that time, that bit, that, those, it, they }. The elements in the set are common pronouns, commonly used to refer to non-living objects or animals, any of which is itself a pattern.
Mode 4: { this, this field, this time, this bit, this, these, that field, that time, that bit, that, those } + nouns. This is a bilingual model consisting of pronouns and names. "this/r race/n" and "these/r stock/n" are two specific examples.
Mode 5: { he, they, i, we, she, they }. The elements in the set are common pronouns, commonly used to refer to characters, any one of which is itself a pattern.
(3) CLAICTS system: a free, open source word segmentation system. The system takes a text as input and outputs a word segmentation sequence of the text. The downloading website of the CLAICTS system is as follows: http: // ictlas. nlpir. org. After word segmentation, each word segmentation is marked with part of speech, wherein a represents an adjective, b represents a distinct word, c represents a conjunctive word, d represents an adverb, h represents a prefix word, j represents an acronym, k represents a suffix word, m represents a numerator, n represents a noun, p represents a preposition word, q represents a quantifier, r represents a pronoun, u represents a helper word, z represents a status word, and the like.
The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments. In the following, a short sentence or a long article is generally referred to as text.
A Chinese accompanying language structure acquisition method is divided into four main modules:
a module A: and performing word segmentation on the original training Corpus Corpus to form a word segmentation Corpus TCorpus.
And a module B: identifying each sentence S in a participle corpus TCorpusiVerb in (1).
And a module C: and analyzing the sentences in the TCorpus by applying the bilingual mode, forming candidate bilingual structures for the sentences meeting the bilingual mode, and putting the candidate bilingual structures into a bilingual structure library SOBase to be verified.
A module D: and verifying the candidate bilingual structure library SOBase and outputting a final result SOBaseResult.
The workflow or method of each module is explained in detail below.
A module A: and performing word segmentation on the original training Corpus Corpus to form a word segmentation Corpus TCorpus.
An open source ICTCCLAS system is adopted to perform word segmentation on each input text D in Corpus, and each text is split according to natural segmentation of sentences to form a simple sentence without sentence punctuation marks. Thus, TCorpus has each sentence in the form of Si=“W1/pos1W2/pos2…Wi/posi…Wn/posn", each WiIs a Chinese word, Chinese character, punctuation mark, Arabic numeral, English word or letter, posiIs its corresponding part of speech.
In the word segmentation algorithm, the tagging of parts of speech is already popular in the computer world. Typical parts of speech include a an adjective, b a discriminative word, c a conjunctive word, d an adverb, h a prefix word, j an acronym, k a suffix word, m a numerator, n a noun, p a preposition word, q a quantifier, r a pronoun, u a help word, and z a status word. For example, the word segmentation result of the sentence "group committee invites them to the meeting" is: "teacher/n carefully/d speak/v give/v those/r student/n listen/n".
And a module B: identifying each sentence S in a participle corpus TCorpusiVerb or verb phrase in (1).
For each statement S in TCorpusiIf "W" is present1/v W2V', then according to "W1W2The/v' carries out merging processing, namely merging two or more verbs into a verb, and the process is called verb merging processing. For example, the sentence "teacher/n speaks/v to/v those/r students/n listen/v" is processed as above to get a verb phrase "speak" and thus a cum sentence "teacher/n speaks/v those/r students/n listen/v". The purpose of this is to achieve as many bilingual constructs as possible.
After the above processing, elimination processing is performed on the adverbs of the modifier verb, that is, all the modifier adverbs before the verb are deleted. For example, the sentence "teacher/n carefully/d speak/v to/v those/r student/n listen/v" is subject to verb merging processing to get the cum sentence "teacher/n carefully/d speak/v those/r student/n listen/v". After the adverb deleting process, the compatible sentence 'teacher/n says/v/r students/n listens/v' is obtained.
The present invention still puts the processed statement into TCorpus.
And a module C: and analyzing the sentences in the TCorpus by applying the bilingual mode, forming candidate bilingual structures for the sentences meeting the bilingual mode, and putting the candidate bilingual structures into a bilingual structure library SOBase to be verified.
The analysis of the sentences in the TCorpus by the bilingual application mode means that 5 bilingual modes designed in the above are adopted, the sentences which accord with one of the bilingual modes in the TCorpus are selected and placed in a bilingual structure library SOBase to be verified.
Specifically, for any statement SO in TCorpusiIf it contains a verb that exceeds 2, or only 1 verb, then the sentence is discarded; otherwise, set SOiIs of the form "Ni,1Vi,1Ni,2Vi,2Ni,3"(here, subscript i represents the i-th sentence meaning). The main task below is to examine Ni,2Whether one of 5 bilingual patterns is satisfied. If one of 5 bilingual patterns is satisfied, then the binary pair<“Vi,1…Vi,2”,“Ni,1Vi,1Ni,2Vi,2Ni,3”>Is put intoIn SOBase; otherwise, the SO is abandonedi
For example, for the doublet sentence "teacher/n speaks/v those/r students/n listens/v" and "those/r students/n" satisfy the doublet mode 4, so the present invention considers that obtaining the doublet structure "speaks … listen" according to the doublet sentence "teacher/n speaks/v those/r students/n listens/v" is a doublet structure, and puts < "speaks … listen", "teacher/n speaks/v those/r students/n listen/v" into the candidate doublet structure library SOBase for further verification by module D.
A module D: and verifying the candidate bilingual structure library SOBase and outputting a final result SOBaseResult.
For each record in candidate bilingual structure library SOBase<“Vi,1…Vi,2”,“Ni,1Vi,1Ni,2Vi,2Ni,3”>The invention provides two verification techniques: the common verification of the collocation of the conjunctive language and the diversity of the collocation of the conjunctive language are necessary conditions for ensuring the correct structure of the conjunctive language.
The common verification of the collocation of the accompanied words means if SOi=“Ni,1Vi,1Ni,2Vi,2Ni,3"is a correct conjunctive sentence, the conjunctive sentence structure" Vi,1…Vi,2"appear in other statements in TCorpus, not just in the cum statement SOiIn (1).
For example, SOiThe group committee invites them to the meeting. Then SOi,1'host invites them to participate in an interaction' and SOi,2"owners invite them to have lunch together" may also appear in other sentences in TCorpus; that is to say SOi,1And SOi,2Not only dependent on SOiThis special cum statement.
The verification of the diversity of the bilingual collocation refers to if SOi=“Ni,1Vi,1Ni,2Vi,2Ni,3"is a correct and concurrent statement, then is shaped as SO'i=“N′i,1Vi,1N′i,2Vi,2N′i,3”、SO″i=“N″i,1Vi,1N″i,2Vi,2N″i,3The cum statement of "etc. should also appear multiple times in TCorpus.
For example, SOiSO 'for group Party to invite them to the conference'iThat is "friends invite us to meet birthday party", SOiSO 'for a friend inviting her to a wedding'i"professor invites these students to participate in the conference". That is, in SOiIn "group committees invite them to a meeting", the accompanying word "they" may be replaced by words of various forms, and SOiThe conjunctive structure in (1) "invite … to participate" is still reasonable and correct.
According to the analysis and explanation of the degree of commonness of the collocations of the conjunctive language and the diversity of the conjunctive language collocations, the implementation method of the module D is given as follows: firstly, two non-negative threshold values a and b are introduced, wherein a belongs to (0, 1), b belongs to (0, 1)
Step D1: SOBaseResult is set to null to save the results of the verified, correct bilingual structure.
Step D2: if SOBase is empty, go to step D6.
Step D3: recording any of SOBase<“Vi,1…Vi,2”,“Ni,1Vi,1Ni,2Vi,2Ni,3”>Will be<“Vi,1…Vi,2”,“Ni,1Vi,1Ni,2Vi,2Ni,3”>Taken out of the SOBase.
Step D4: if cof ("V)i,1…Vi,2") > a, then" Vi,1…Vi,2"put into the set SOBaseResult, go to step D2.
Said cof (' V)i,1…Vi,2") reflects the conjunctive structure" Vi,1…Vi,2"its calculation is as follows: cof (' V)i,1…Vi,2") ═ TCorpus contains" Vi,1…Vi,2"number of structural statements/number of statements in TCorpus. When cof (V)i,1…Vi,2) When > a,' Vi,1…Vi,2"is considered a proper bilingual structure.
Step D5: if muf ("V)i,1…Vi,2") > b, then" Vi,1…Vi,2"put into the set SOBaseResult.
Said muf (' V)i,1…Vi,2") is a mathematical method for describing the diversity of accompanying collocations, and the calculation sub-steps are as follows: at the beginning, set V*,1And V*,2Is an empty set.
Step D51: in SOBase, if present<“Vx…Vi,2”,“Ni,1VxNi,2Vi,2Ni,3”>Then V will bexPut into the set V*,1In (1).
Step D52: in SOBase, if present<“Vi,1…Vy”,“Ni,1Vi,1Ni,2VyNi,3”>Then V will beyPut into the set V*,2In (1).
Step D53: calculation muf ("V)i,1…Vi,2"): the calculation formula is as follows:
Figure GDA0002121334540000081
step D6: and outputting a final bilingual structure result SOBaseResult.
Effect of the experiment
Through a plurality of preliminary experiments, the result of the bilingual result is better when the threshold of the bilingual collocation commonality a is set to be 0.0006 (namely a is equal to 0.0006) and the threshold of the bilingual collocation diversity b is set to be 0.0015 (namely b is equal to 0.0015). After test and verification of 1TB corpus, the system of the invention obtains 13.96 ten thousand bilingual structures, and the accuracy reaches 98.2% through analysis. Therefore, the invention achieves better recognition performance and achieves the aim of practical application.

Claims (1)

1. A method for acquiring a Chinese and bilingual structure is characterized in that: the method comprises the following steps:
the first step is as follows: performing word segmentation on an original training Corpus Corpus to form a word segmentation Corpus TCorpus;
performing word segmentation on each input text D in Corpus by adopting an open source ICTCCLAS system, and splitting each text according to natural segmentation of sentences to form a simple sentence without sentence punctuation marks; thus, TCorpus has each sentence in the form of Si=“W1/pos1W2/pos2…Wi/posi…Wn/posn", each WiIs a Chinese word, Chinese character, punctuation mark, Arabic numeral, English word or letter, posiIs its corresponding part of speech;
in the word segmentation algorithm, the mark of part of speech is already passed in the computer field; the common parts of speech include a, b, c, d, h, j, k, m, n, p, q, r, u, and z;
the second step is that: identifying each sentence S in a participle corpus TCorpusiVerb or verb phrase in (1);
when "W" appears1/v W2V', then according to "W1W2V' carries on merging process, namely merges two or more verbs into a verb, this process is called verb merging process; after the processing, the adverbs of the modified verbs are eliminated, namely all the modified adverbs before the verbs are deleted; the processed statement is still put into TCorpus;
the third step: analyzing sentences in TCorpus by using the bilingual mode, forming a candidate bilingual structure for the sentences meeting the bilingual mode, and putting the candidate bilingual structure into a bilingual structure library SOBase to be verified;
the method comprises the steps that the bilingual application mode is used for analyzing sentences in TCorpus, namely 5 bilingual modes are adopted, sentences which accord with one of the bilingual modes in TCorpus are selected and placed in a bilingual structure library SOBase to be verified;
for any statement SO in TCorpusiWhen it contains a verb over 2, or only 1 verb, the sentence is discarded; otherwise, set SOiIs of the form "Ni,1Vi,1Ni,2Vi,2Ni,3", here, subscript i represents the ith sentence meaning; the main task below is to examine Ni,2Whether one of 5 bilingual patterns is satisfied; if one of 5 bilingual patterns is satisfied, then the binary pair<“Vi,1…Vi,2”,“Ni,1Vi,1Ni,2Vi,2Ni,3”>Putting into SOBase; otherwise, the SO is abandonedi
The 5 bilingual modes are as follows: let the general form of a concurrent statement be "N1V1N2V2N3", where N2Namely, the bilingual; in obtaining the conjunctive language structure, only the conjunctive language N is considered2Conjunctive sentences satisfying the following patterns, that is, conjunctive sentences whose conjunctive sentences are other forms when the corpus is sufficiently large can also be obtained from conjunctive sentences whose conjunctive sentences satisfy the following 5 patterns:
mode 1: number + noun;
mode 2: number + quantifier + noun;
mode 3: { this, this field, this time, this bit, this, these, that field, that time, that bit, that, those, it, they }, the elements of the set being common pronouns, typically used to refer to non-living objects or animals, any of which is itself a pattern;
mode 4: { this, field, this time, this bit, this, these, that, field, that time, that bit, that, those } + nouns, which is a bilingual pattern of pronouns and names;
mode 5: { he, they, i, we, s, they }, the elements in the set are common pronouns, commonly used to refer to characters, any of which is itself a pattern;
the fourth step: verifying a candidate bilingual structure library SOBase and outputting a final result SOBaseResult;
for each record in candidate bilingual structure library SOBase<“Vi,1…Vi,2”,“Ni,1Vi,1Ni,2Vi,2Ni,3”>Two verification techniques are employed: common verification of bilingual collocation and diversity of bilingual collocation, which are necessary conditions for ensuring correct structure of bilingual;
the common verification of the collocation of the accompanied words means that when SOi=“Ni,1Vi,1Ni,2Vi,2Ni,3"is a correct conjunctive sentence, the conjunctive sentence structure" Vi,1…Vi,2"appear in other statements in TCorpus, not just in the cum statement SOiPerforming the following steps;
the verification of the diversity of the bilingual collocation refers to if SOi=“Ni,1Vi,1Ni,2Vi,2Ni,3"is a correct and concurrent statement, then is shaped as SO'i=“N′i,1Vi,1N′i,2Vi,2N′i,3”、SO″i=“N″i,1Vi,1N″i,2Vi,2N″i,3"the doublet statement should also appear multiple times in TCorpus;
the fourth step is implemented by the following specific steps:
firstly, two non-negative threshold values a and b are introduced, wherein a belongs to (0, 1), b belongs to (0, 1)
Step D1: setting the SOBaseResult to null for saving the verified, correct bilingual result;
step D2: if SOBase is empty, go to step D6;
step D3: recording < "V" for any of the SOBasesi,1…Vi,2”,“Ni,1Vi,1Ni,2Vi,2Ni,3”>Will be<“Vi,1…Vi,2”,“Ni,1Vi,1Ni,2Vi,2Ni,3”>Taking out from the SOBase;
step D4: if cof ("V)i,1…Vi,2”)>a, then "Vi,1…Vi,2"put into the set SOBaseResult, go to step D2;
said cof (' V)i,1…Vi,2") reflects the conjunctive structure" Vi,1…Vi,2"its calculation is as follows: cof (' V)i,1…Vi,2") ═ TCorpus contains" Vi,1…Vi,2"number of structural statements/number of statements in TCorpus; when cof (V)i,1…Vi,2)>When a, will "Vi,1…Vi,2"is considered a correct conjunctive language structure;
step D5: if muf ("V)i,1…Vi,2”)>b, then "Vi,1…Vi,2"put into the set SOBaseResult; said muf (' V)i,1…Vi,2") is a mathematical method for describing the diversity of accompanying collocations, and the calculation sub-steps are as follows: at the beginning, set V*,1And V*,2Is an empty set;
step D51: in SOBase, if present<“Vx…Vi,2”,“Ni,1VxNi,2Vi,2Ni,3”>Then V will bexPut into the set V*,1Performing the following steps;
step D52: in SOBase, if present<“Vi,1…Vy”,“Ni,1Vi,1Ni,2VyNi,3”>Then V will beyPut into the set V*,2Performing the following steps;
step D53: calculation muf ("V)i,1…Vi,2"): the calculation formula is as follows:
step D6: and outputting a final bilingual structure result SOBaseResult.
CN201510846489.9A 2015-11-27 2015-11-27 Method for acquiring Chinese and bilingual structure Active CN106815188B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510846489.9A CN106815188B (en) 2015-11-27 2015-11-27 Method for acquiring Chinese and bilingual structure

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510846489.9A CN106815188B (en) 2015-11-27 2015-11-27 Method for acquiring Chinese and bilingual structure

Publications (2)

Publication Number Publication Date
CN106815188A CN106815188A (en) 2017-06-09
CN106815188B true CN106815188B (en) 2020-02-18

Family

ID=59103490

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510846489.9A Active CN106815188B (en) 2015-11-27 2015-11-27 Method for acquiring Chinese and bilingual structure

Country Status (1)

Country Link
CN (1) CN106815188B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101937430A (en) * 2010-09-03 2011-01-05 清华大学 Method for extracting event sentence pattern from Chinese sentence
CN102737013A (en) * 2011-04-02 2012-10-17 三星电子(中国)研发中心 Device and method for identifying statement emotion based on dependency relation

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101937430A (en) * 2010-09-03 2011-01-05 清华大学 Method for extracting event sentence pattern from Chinese sentence
CN102737013A (en) * 2011-04-02 2012-10-17 三星电子(中国)研发中心 Device and method for identifying statement emotion based on dependency relation

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于条件随机场的兼语结构自动识别;陈静 等;《情报科学》;20120331;第30卷(第3期);第439-442页 *
现代汉语兼语结构的机器探测;傅成宏;《合肥学院学报(社会科学版)》;20111130;第28卷(第6期);第52-56页 *

Also Published As

Publication number Publication date
CN106815188A (en) 2017-06-09

Similar Documents

Publication Publication Date Title
El Kholy et al. Orthographic and morphological processing for English–Arabic statistical machine translation
Kotek et al. Gender bias and stereotypes in linguistic example sentences
CN110134766B (en) Word segmentation method and device for traditional Chinese medical ancient book documents
Cheng et al. Research on automatic error correction method in English writing based on deep neural network
Litwinowa et al. Improving the stylistic and grammar skills of future translators, depending on the use of electronic editors and methods of working with the text in the translation process
CN106815188B (en) Method for acquiring Chinese and bilingual structure
Irzawati et al. PORTRAIT OF EFL LEARNERS’ WRITINGS: ERRORS, CHALLENGES AND SOLUTIONS
Chen et al. Effects of L1 transfer on English writing of Chinese EFL students
Yeh et al. Chinese word spelling correction based on rule induction
Meng The pedagogy of corpus-aided English-Chinese translation from a critical & creative perspective
Mohamed Morphological segmentation and part-of-speech tagging for the arabic heritage
Kristeller Philosophy and its Historiography
Zhao Research on English translation skills and problems by using computer technology
Boujelbane et al. An automatic process for Tunisian Arabic orthography normalization
Xiaoli Analysis on lexical errors in college English writing
Akita et al. Automatic comma insertion of lecture transcripts based on multiple annotations
Jose et al. Lexical normalization model for noisy SMS text
Ilukkumbura et al. Sinhala active voice into passive voice converter using rule based approach with grammar error correction
Syahrir et al. The Analysis of Short Story Translation Errors (A Case Study of Types and Causes)
Zhou Treebanking and Cross-Lingual Dependency Parsing for Xibe
Shi English writing teaching model dependent on computer network corpus drive model
Tarihoran et al. THE ANALYSIS OF PREPOTIONAL PHRASE USAGE IN TAYLOR SWIFT’S SONG LYRICS
Anirudh et al. 25. Strategies for Development of Machine Translation Systems
Stambolieva Parallel corpora in aspectual studies of non-aspect languages
Fomina Teaching Discourse Analysis of the Category of Subject

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 212009 Zhenjiang high tech Industrial Development Zone, Jiangsu, No. 668, No. twelve road.

Applicant after: Zhongke national power (Zhenjiang) Intelligent Technology Co., Ltd.

Address before: 212009 18 building, North Tower, Twin Tower Rd 468, twelve road 468, Ding Mo Jing, Jiangsu.

Applicant before: Knowology Intelligent Technology Co., Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant
CB03 Change of inventor or designer information

Inventor after: Fu Jianhui

Inventor after: Wang Weimin

Inventor after: Cao Yang

Inventor before: Fu Jianhui

Inventor before: Wang Weiming

Inventor before: Cao Yang

CB03 Change of inventor or designer information