Disclosure of Invention
Aiming at how to effectively acquire a bilingual structure from a large corpus through processing and analyzing, and forming a bilingual structure library; the invention provides a Chinese and bilingual structure acquisition method, which aims to verify which verbs and nouns are combined under what conditions to form a bilingual structure.
In order to solve the problems, the invention adopts the following technical scheme:
a method for acquiring a Chinese and bilingual structure is characterized in that: the method comprises the following steps:
the first step is as follows: performing word segmentation on an original training Corpus Corpus to form a word segmentation Corpus TCorpus;
adopting an open source ICTCCLAS system to carry out D-feeding on each input text in CorpusDividing words, and splitting each text according to the natural segmentation of sentences to form a simple sentence without sentence punctuation marks; thus, TCorpus has each sentence in the form of Si=“W1/pos1W2/pos2…Wi/posi…Wn/posn", each WiIs a Chinese word, Chinese character, punctuation mark, Arabic numeral, English word or letter, posiIs its corresponding part of speech;
in the word segmentation algorithm, the mark of part of speech is already passed in the computer field; the common parts of speech include a, b, c, d, h, j, k, m, n, p, q, r, u, and z;
the second step is that: identifying each sentence S in a participle corpus TCorpusiVerb or verb phrase in (1);
when "W" appears1/v W2V', then according to "W1W2V' carries on merging process, namely merges two or more verbs into a verb, this process is called verb merging process; after the processing, the adverbs of the modified verbs are eliminated, namely all the modified adverbs before the verbs are deleted; the processed statement is still put into TCorpus;
the third step: analyzing sentences in TCorpus by using the bilingual mode, forming a candidate bilingual structure for the sentences meeting the bilingual mode, and putting the candidate bilingual structure into a bilingual structure library SOBase to be verified;
the method comprises the steps that the bilingual application mode is used for analyzing sentences in TCorpus, namely 5 bilingual modes are adopted, sentences which accord with one of the bilingual modes in TCorpus are selected and placed in a bilingual structure library SOBase to be verified;
specifically, for any statement SO in TCorpusiWhen it contains a verb over 2, or only 1 verb, the sentence is discarded; otherwise, set SOiIs of the form "Ni,1Vi,1Ni,2Vi,2Ni,3", here, subscript i represents the ith sentence meaning; the main task below is to examine Ni,2Whether one of 5 bilingual patterns is satisfied; if one of 5 bilingual patterns is satisfied, then the binary pair<“Vi,1…Vi,2”,“Ni,1Vi,1Ni,2Vi,2Ni,3”>Putting into SOBase; otherwise, the SO is abandonedi;
The 5 bilingual modes are as follows: let the general form of a concurrent statement be "N1V1N2V2N3", where N2Namely, the bilingual; in obtaining the conjunctive language structure, only the conjunctive language N is considered2Conjunctive sentences satisfying the following patterns, that is, conjunctive sentences whose conjunctive sentences are other forms when the corpus is sufficiently large can also be obtained from conjunctive sentences whose conjunctive sentences satisfy the following 5 patterns:
mode 1: number + noun;
mode 2: number + quantifier + noun;
mode 3: { this, this field, this time, this bit, this, these, that field, that time, that bit, that, those, it, they }, the elements of the set being common pronouns, typically used to refer to non-living objects or animals, any of which is itself a pattern;
mode 4: { this, field, this time, this bit, this, these, that, field, that time, that bit, that, those } + nouns, which is a bilingual pattern of pronouns and names;
mode 5: { he, they, i, we, s, they }, the elements in the set are common pronouns, commonly used to refer to characters, any of which is itself a pattern;
the fourth step: verifying a candidate bilingual structure library SOBase and outputting a final result SOBaseResult;
for each record in candidate bilingual structure library SOBase<“Vi,1…Vi,2”,“Ni,1Vi,1Ni,2Vi,2Ni,3”>Two verification techniques are employed: common verification of bilingual collocation and diversity of bilingual collocation, which are necessary conditions for ensuring correct structure of bilingual;
the common verification of the collocation of the accompanied words means that when SOi=“Ni,1Vi,1Ni,2Vi,2Ni,3"is a correct conjunctive sentence, the conjunctive sentence structure" Vi,1…Vi,2"occurs in other statements in TCorpus, not just in the cum statement S0iPerforming the following steps;
the verification of the diversity of the bilingual collocation refers to if SOi=“Ni,1Vi,1Ni,2Vi,2Ni,3"is a correct and concurrent statement, then is shaped as SO'i=“N′i,1Vi,1N′i,2Vi,2N′i,3”、SO″i=“N″i,1Vi,1N″i,2Vi,2N″i,3"the doublet statement should also appear multiple times in TCorpus;
the fourth step is implemented by the following specific steps:
firstly, two non-negative threshold values a and b are introduced, wherein a belongs to (0, 1), b belongs to (0, 1)
Step D1: setting the SOBaseResult to null for saving the verified, correct bilingual result;
step D2: if SOBase is empty, go to step D6;
step D3: recording any of SOBase<“Vi,1…Vi,2”,“Ni,1Vi,1Ni,2Vi,2Ni,3”>Will be<“Vi,1…Vi,2”,“Ni,1Vi,1Ni,2Vi,2Ni,3”>Taking out from the SOBase;
step D4: if cof ("V)i,1…Vi,2") > a, then" Vi,1…Vi,2"put into the set SOBaseResult, go to step D2;
said cof (' V)i,1…Vi,2") reflects the conjunctive structure" Vi,1…Vi,2"its calculation is as follows: cof (' V)i,1…Vi,2") ═ TCorpus contains" Vi,1…Vi,2"number of structural statements/number of statements in TCorpus; when cof (V)i,1…Vi,2) When > a,' Vi,1…Vi,2"is considered a correct conjunctive language structure;
step D5: if muf ("V)i,1…Vi,2") > b, then" Vi,1…Vi,2"put into the set SOBaseResult;
said muf (' V)i,1…Vi,2") is a mathematical method for describing the diversity of accompanying collocations, and the calculation sub-steps are as follows: at the beginning, set V*,1And V*,2Is an empty set;
step D51: in SOBase, if present<“Vx…Vi,2”,“Ni,1VxNi,2Vi,2Ni,3”>Then V will bexPut into the set V*,1Performing the following steps;
step D52: in SOBase, if there is "V <i,1…Vy”,“Ni,1Vi,1Ni,2VyNi,3”>Then V will beyPut into the set V*,2Performing the following steps;
step D53: calculation muf ("V)i,1…Vi,2"): the calculation formula is as follows:
step D6: and outputting a final bilingual structure result SOBaseResult.
Has the advantages that:
the invention provides a system and a method for acquiring a Chinese-language-compatible structure by means of linguistics and computer technology. Aiming at the complexity and diversity of the bilingual form, the bilingual mode is introduced, and the complexity of the bilingual form can be greatly controlled on the premise of not reducing the acquisition effect. Aiming at the complexity of Chinese word formation and sentence, in order to ensure the accuracy of the accompanying language structure, the invention strictly verifies the obtained accompanying language structure from the double angles of 'accompanying language structure collocation diversity' and 'accompanying language structure collocation commonness'.
After test and verification of 1TB corpus, the system of the invention obtains 13.96 ten thousand bilingual structures, and the accuracy reaches 98.2% through analysis. Therefore, the invention achieves better recognition performance and achieves the aim of practical application.
Detailed Description
In order to be able to explain the invention more clearly, the following terms are defined and explained below:
(1) chinese part of speech: part-of-speech in chinese is an attribute of chinese words. The following are common: nouns (e.g., children, graduates, etc., denoted by n), verbs (e.g., invitations, additions, etc., denoted by v), adverbs (e.g., frequent, etc., denoted by d), adjectives (e.g., beautiful, nice, simple, etc., denoted by a), pronouns (e.g., these, this, it, they, etc., denoted by r), numerators (e.g., one, twelve, 12, etc., denoted by m), quantifiers (e.g., one, root, only, bar, etc., denoted by q).
(2)5 kinds of bilingual modes: in order to control the complexity of obtaining the bilingual structure and ensure that more bilingual structures can be obtained, the invention introduces 5 bilingual modes. For ease of presentation, the following assumes the general form of an inclusive statement as "N1V1N2V2N3", where N2Namely, it is called the bilingual word. The invention only considers the bilingual N when acquiring the bilingual structure2Conjunctive sentences satisfying the following patterns (that is, it is assumed that when the corpus is sufficiently large, conjunctive sentences having conjunctive sentences of other forms can be satisfied from conjunctive sentencesObtained in the accompanying sentence of the following 5 patterns):
mode 1: number + noun. For example: "three/m persons/n" and "3/n items/n" are two specific examples.
Mode 2: number + quantifier + noun. For example: "three/m/q persons/n", "3/m plants/q plants/n" are two specific examples.
Mode 3: { this, this field, this time, this bit, this, these, that field, that time, that bit, that, those, it, they }. The elements in the set are common pronouns, commonly used to refer to non-living objects or animals, any of which is itself a pattern.
Mode 4: { this, this field, this time, this bit, this, these, that field, that time, that bit, that, those } + nouns. This is a bilingual model consisting of pronouns and names. "this/r race/n" and "these/r stock/n" are two specific examples.
Mode 5: { he, they, i, we, she, they }. The elements in the set are common pronouns, commonly used to refer to characters, any one of which is itself a pattern.
(3) CLAICTS system: a free, open source word segmentation system. The system takes a text as input and outputs a word segmentation sequence of the text. The downloading website of the CLAICTS system is as follows: http: // ictlas. nlpir. org. After word segmentation, each word segmentation is marked with part of speech, wherein a represents an adjective, b represents a distinct word, c represents a conjunctive word, d represents an adverb, h represents a prefix word, j represents an acronym, k represents a suffix word, m represents a numerator, n represents a noun, p represents a preposition word, q represents a quantifier, r represents a pronoun, u represents a helper word, z represents a status word, and the like.
The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments. In the following, a short sentence or a long article is generally referred to as text.
A Chinese accompanying language structure acquisition method is divided into four main modules:
a module A: and performing word segmentation on the original training Corpus Corpus to form a word segmentation Corpus TCorpus.
And a module B: identifying each sentence S in a participle corpus TCorpusiVerb in (1).
And a module C: and analyzing the sentences in the TCorpus by applying the bilingual mode, forming candidate bilingual structures for the sentences meeting the bilingual mode, and putting the candidate bilingual structures into a bilingual structure library SOBase to be verified.
A module D: and verifying the candidate bilingual structure library SOBase and outputting a final result SOBaseResult.
The workflow or method of each module is explained in detail below.
A module A: and performing word segmentation on the original training Corpus Corpus to form a word segmentation Corpus TCorpus.
An open source ICTCCLAS system is adopted to perform word segmentation on each input text D in Corpus, and each text is split according to natural segmentation of sentences to form a simple sentence without sentence punctuation marks. Thus, TCorpus has each sentence in the form of Si=“W1/pos1W2/pos2…Wi/posi…Wn/posn", each WiIs a Chinese word, Chinese character, punctuation mark, Arabic numeral, English word or letter, posiIs its corresponding part of speech.
In the word segmentation algorithm, the tagging of parts of speech is already popular in the computer world. Typical parts of speech include a an adjective, b a discriminative word, c a conjunctive word, d an adverb, h a prefix word, j an acronym, k a suffix word, m a numerator, n a noun, p a preposition word, q a quantifier, r a pronoun, u a help word, and z a status word. For example, the word segmentation result of the sentence "group committee invites them to the meeting" is: "teacher/n carefully/d speak/v give/v those/r student/n listen/n".
And a module B: identifying each sentence S in a participle corpus TCorpusiVerb or verb phrase in (1).
For each statement S in TCorpusiIf "W" is present1/v W2V', then according to "W1W2The/v' carries out merging processing, namely merging two or more verbs into a verb, and the process is called verb merging processing. For example, the sentence "teacher/n speaks/v to/v those/r students/n listen/v" is processed as above to get a verb phrase "speak" and thus a cum sentence "teacher/n speaks/v those/r students/n listen/v". The purpose of this is to achieve as many bilingual constructs as possible.
After the above processing, elimination processing is performed on the adverbs of the modifier verb, that is, all the modifier adverbs before the verb are deleted. For example, the sentence "teacher/n carefully/d speak/v to/v those/r student/n listen/v" is subject to verb merging processing to get the cum sentence "teacher/n carefully/d speak/v those/r student/n listen/v". After the adverb deleting process, the compatible sentence 'teacher/n says/v/r students/n listens/v' is obtained.
The present invention still puts the processed statement into TCorpus.
And a module C: and analyzing the sentences in the TCorpus by applying the bilingual mode, forming candidate bilingual structures for the sentences meeting the bilingual mode, and putting the candidate bilingual structures into a bilingual structure library SOBase to be verified.
The analysis of the sentences in the TCorpus by the bilingual application mode means that 5 bilingual modes designed in the above are adopted, the sentences which accord with one of the bilingual modes in the TCorpus are selected and placed in a bilingual structure library SOBase to be verified.
Specifically, for any statement SO in TCorpusiIf it contains a verb that exceeds 2, or only 1 verb, then the sentence is discarded; otherwise, set SOiIs of the form "Ni,1Vi,1Ni,2Vi,2Ni,3"(here, subscript i represents the i-th sentence meaning). The main task below is to examine Ni,2Whether one of 5 bilingual patterns is satisfied. If one of 5 bilingual patterns is satisfied, then the binary pair<“Vi,1…Vi,2”,“Ni,1Vi,1Ni,2Vi,2Ni,3”>Is put intoIn SOBase; otherwise, the SO is abandonedi。
For example, for the doublet sentence "teacher/n speaks/v those/r students/n listens/v" and "those/r students/n" satisfy the doublet mode 4, so the present invention considers that obtaining the doublet structure "speaks … listen" according to the doublet sentence "teacher/n speaks/v those/r students/n listens/v" is a doublet structure, and puts < "speaks … listen", "teacher/n speaks/v those/r students/n listen/v" into the candidate doublet structure library SOBase for further verification by module D.
A module D: and verifying the candidate bilingual structure library SOBase and outputting a final result SOBaseResult.
For each record in candidate bilingual structure library SOBase<“Vi,1…Vi,2”,“Ni,1Vi,1Ni,2Vi,2Ni,3”>The invention provides two verification techniques: the common verification of the collocation of the conjunctive language and the diversity of the collocation of the conjunctive language are necessary conditions for ensuring the correct structure of the conjunctive language.
The common verification of the collocation of the accompanied words means if SOi=“Ni,1Vi,1Ni,2Vi,2Ni,3"is a correct conjunctive sentence, the conjunctive sentence structure" Vi,1…Vi,2"appear in other statements in TCorpus, not just in the cum statement SOiIn (1).
For example, SOiThe group committee invites them to the meeting. Then SOi,1'host invites them to participate in an interaction' and SOi,2"owners invite them to have lunch together" may also appear in other sentences in TCorpus; that is to say SOi,1And SOi,2Not only dependent on SOiThis special cum statement.
The verification of the diversity of the bilingual collocation refers to if SOi=“Ni,1Vi,1Ni,2Vi,2Ni,3"is a correct and concurrent statement, then is shaped as SO'i=“N′i,1Vi,1N′i,2Vi,2N′i,3”、SO″i=“N″i,1Vi,1N″i,2Vi,2N″i,3The cum statement of "etc. should also appear multiple times in TCorpus.
For example, SOiSO 'for group Party to invite them to the conference'iThat is "friends invite us to meet birthday party", SOiSO 'for a friend inviting her to a wedding'i"professor invites these students to participate in the conference". That is, in SOiIn "group committees invite them to a meeting", the accompanying word "they" may be replaced by words of various forms, and SOiThe conjunctive structure in (1) "invite … to participate" is still reasonable and correct.
According to the analysis and explanation of the degree of commonness of the collocations of the conjunctive language and the diversity of the conjunctive language collocations, the implementation method of the module D is given as follows: firstly, two non-negative threshold values a and b are introduced, wherein a belongs to (0, 1), b belongs to (0, 1)
Step D1: SOBaseResult is set to null to save the results of the verified, correct bilingual structure.
Step D2: if SOBase is empty, go to step D6.
Step D3: recording any of SOBase<“Vi,1…Vi,2”,“Ni,1Vi,1Ni,2Vi,2Ni,3”>Will be<“Vi,1…Vi,2”,“Ni,1Vi,1Ni,2Vi,2Ni,3”>Taken out of the SOBase.
Step D4: if cof ("V)i,1…Vi,2") > a, then" Vi,1…Vi,2"put into the set SOBaseResult, go to step D2.
Said cof (' V)i,1…Vi,2") reflects the conjunctive structure" Vi,1…Vi,2"its calculation is as follows: cof (' V)i,1…Vi,2") ═ TCorpus contains" Vi,1…Vi,2"number of structural statements/number of statements in TCorpus. When cof (V)i,1…Vi,2) When > a,' Vi,1…Vi,2"is considered a proper bilingual structure.
Step D5: if muf ("V)i,1…Vi,2") > b, then" Vi,1…Vi,2"put into the set SOBaseResult.
Said muf (' V)i,1…Vi,2") is a mathematical method for describing the diversity of accompanying collocations, and the calculation sub-steps are as follows: at the beginning, set V*,1And V*,2Is an empty set.
Step D51: in SOBase, if present<“Vx…Vi,2”,“Ni,1VxNi,2Vi,2Ni,3”>Then V will bexPut into the set V*,1In (1).
Step D52: in SOBase, if present<“Vi,1…Vy”,“Ni,1Vi,1Ni,2VyNi,3”>Then V will beyPut into the set V*,2In (1).
Step D53: calculation muf ("V)i,1…Vi,2"): the calculation formula is as follows:
step D6: and outputting a final bilingual structure result SOBaseResult.
Effect of the experiment
Through a plurality of preliminary experiments, the result of the bilingual result is better when the threshold of the bilingual collocation commonality a is set to be 0.0006 (namely a is equal to 0.0006) and the threshold of the bilingual collocation diversity b is set to be 0.0015 (namely b is equal to 0.0015). After test and verification of 1TB corpus, the system of the invention obtains 13.96 ten thousand bilingual structures, and the accuracy reaches 98.2% through analysis. Therefore, the invention achieves better recognition performance and achieves the aim of practical application.