CN106815188B

CN106815188B - Method for acquiring Chinese and bilingual structure

Info

Publication number: CN106815188B
Application number: CN201510846489.9A
Authority: CN
Inventors: 符建辉; 王卫明; 曹阳
Original assignee: Zhongke Guoli (zhenjiang) Intelligent Technology Co Ltd
Current assignee: Zhongke Guoli (zhenjiang) Intelligent Technology Co Ltd
Priority date: 2015-11-27
Filing date: 2015-11-27
Publication date: 2020-02-18
Anticipated expiration: 2035-11-27
Also published as: CN106815188A

Abstract

The invention relates to a method for acquiring a Chinese and bilingual structure, which comprises the steps of performing word segmentation on an original training Corpus Corpus to form a word segmentation Corpus TCorpus; identifying each sentence S in a participle corpus TCorpus_iA middle verb; analyzing sentences in TCorpus by using the bilingual mode, forming a candidate bilingual structure for the sentences meeting the bilingual mode, and putting the candidate bilingual structure into a bilingual structure library SOBase to be verified; verifying a candidate bilingual structure library SOBase and outputting a final result SOBaseResult; the invention introduces the bilingual mode, and can greatly control the complexity of the bilingual form on the premise of not reducing the acquisition effect. Aiming at the complexity of Chinese word formation and sentence, in order to ensure the accuracy of the accompanying language structure, the invention strictly verifies the obtained accompanying language structure from the double angles of 'accompanying language structure collocation diversity' and 'accompanying language structure collocation commonness'.

Description

Method for acquiring Chinese and bilingual structure

Technical Field

The invention relates to the field of Chinese natural language processing and Chinese grammar structure automatic identification, in particular to a Chinese and bilingual structure automatic identification method.

Background

Chinese accompanying sentences are a special class of linguistic phenomena. For example, the following three sentences are given (taking spaces and labeling parts of speech, which facilitates highlighting the bilingual context in the sentence):

s1: "group Commission/n Invitation/v they/r attend/v meeting/n"

S2: school/n support/v graduate/n startup/v "

S3: "which position/r let/v this/r only/q bottles/n falls/v above ground/s/u? W'

In S1, "they" are the object of the verb "invite" and are also the subject of the verb "join", so in S1, "they" are conjunctive. At S2, "graduate" is the object of the verb "support" and is the subject of the verb "create", so at S2, "graduate" is a conjunctive language. Similarly, in S3, the "bottle" is the object of the "let" and is the subject of the verb "go", so in S3, the "bottle" is a combination of words.

As can be seen from these three typical examples, chinese cum sentence is a common language phenomenon. For more than 30 years, national famous scholars such as Zhudelxi, Dingshu, Huangborong, Luji Heiping, Wu inspiring and the like carry out systematic research on Chinese accompanying sentences from the aspect of grammar or semantics, and play an important role in understanding the Chinese accompanying sentences for people.

Besides theoretical research value, Chinese teaching and training, accompanying with the overall development of Internet application, the bilingual structure research also has many important uses.

For example, the Chinese cum language structure can be used as a part of a language model in speech recognition, and has an important auxiliary role for automatically creating the language model.

As another example, the problem of unknown word recognition has been an important issue: given a dictionary, words that do not appear in this dictionary are called unknown words. Because any dictionary has limited word collection at the beginning, the dictionary needs to be continuously supplemented in practical application. One technical difficulty in unregistered word recognition or dictionary supplementation is how to accurately determine the left and right boundaries of the unregistered word.

How to effectively acquire a bilingual structure from a large corpus by processing and analyzing the corpus to form a bilingual structure library? How can you verify which verbs, under what conditions, combine with what nouns to form a bilingual structure? These problems have not been adequately addressed and studied.

Disclosure of Invention

Aiming at how to effectively acquire a bilingual structure from a large corpus through processing and analyzing, and forming a bilingual structure library; the invention provides a Chinese and bilingual structure acquisition method, which aims to verify which verbs and nouns are combined under what conditions to form a bilingual structure.

In order to solve the problems, the invention adopts the following technical scheme:

a method for acquiring a Chinese and bilingual structure is characterized in that: the method comprises the following steps:

the first step is as follows: performing word segmentation on an original training Corpus Corpus to form a word segmentation Corpus TCorpus;

adopting an open source ICTCCLAS system to carry out D-feeding on each input text in CorpusDividing words, and splitting each text according to the natural segmentation of sentences to form a simple sentence without sentence punctuation marks; thus, TCorpus has each sentence in the form of S_i＝“W₁/pos₁W₂/pos₂…W_i/pos_i…W_n/pos_n", each W_iIs a Chinese word, Chinese character, punctuation mark, Arabic numeral, English word or letter, pos_iIs its corresponding part of speech;

in the word segmentation algorithm, the mark of part of speech is already passed in the computer field; the common parts of speech include a, b, c, d, h, j, k, m, n, p, q, r, u, and z;

the second step is that: identifying each sentence S in a participle corpus TCorpus_iVerb or verb phrase in (1);

when "W" appears₁/v W₂V', then according to "W₁W₂V' carries on merging process, namely merges two or more verbs into a verb, this process is called verb merging process; after the processing, the adverbs of the modified verbs are eliminated, namely all the modified adverbs before the verbs are deleted; the processed statement is still put into TCorpus;

the third step: analyzing sentences in TCorpus by using the bilingual mode, forming a candidate bilingual structure for the sentences meeting the bilingual mode, and putting the candidate bilingual structure into a bilingual structure library SOBase to be verified;

the method comprises the steps that the bilingual application mode is used for analyzing sentences in TCorpus, namely 5 bilingual modes are adopted, sentences which accord with one of the bilingual modes in TCorpus are selected and placed in a bilingual structure library SOBase to be verified;

specifically, for any statement SO in TCorpus_iWhen it contains a verb over 2, or only 1 verb, the sentence is discarded; otherwise, set SO_iIs of the form "N_i，1V_i，1N_i，2V_i，2N_i，3", here, subscript i represents the ith sentence meaning; the main task below is to examine N_i，2Whether one of 5 bilingual patterns is satisfied; if one of 5 bilingual patterns is satisfied, then the binary pair<“V_i，1…V_i，2”，“N_i，1V_i，1N_i，2V_i，2N_i，3”>Putting into SOBase; otherwise, the SO is abandoned_i；

The 5 bilingual modes are as follows: let the general form of a concurrent statement be "N₁V₁N₂V₂N₃", where N₂Namely, the bilingual; in obtaining the conjunctive language structure, only the conjunctive language N is considered₂Conjunctive sentences satisfying the following patterns, that is, conjunctive sentences whose conjunctive sentences are other forms when the corpus is sufficiently large can also be obtained from conjunctive sentences whose conjunctive sentences satisfy the following 5 patterns:

mode 1: number + noun;

mode 2: number + quantifier + noun;

mode 3: { this, this field, this time, this bit, this, these, that field, that time, that bit, that, those, it, they }, the elements of the set being common pronouns, typically used to refer to non-living objects or animals, any of which is itself a pattern;

mode 4: { this, field, this time, this bit, this, these, that, field, that time, that bit, that, those } + nouns, which is a bilingual pattern of pronouns and names;

mode 5: { he, they, i, we, s, they }, the elements in the set are common pronouns, commonly used to refer to characters, any of which is itself a pattern;

the fourth step: verifying a candidate bilingual structure library SOBase and outputting a final result SOBaseResult;

for each record in candidate bilingual structure library SOBase<“V_i，1…V_i，2”，“N_i，1V_i，1N_i，2V_i，2N_i，3”>Two verification techniques are employed: common verification of bilingual collocation and diversity of bilingual collocation, which are necessary conditions for ensuring correct structure of bilingual;

the common verification of the collocation of the accompanied words means that when SO_i＝“N_i，1V_i，1N_i，2V_i，2N_i，3"is a correct conjunctive sentence, the conjunctive sentence structure" V_i，1…V_i，2"occurs in other statements in TCorpus, not just in the cum statement S0_iPerforming the following steps;

the verification of the diversity of the bilingual collocation refers to if SO_i＝“N_i，1V_i，1N_i，2V_i，2N_i，3"is a correct and concurrent statement, then is shaped as SO'_i＝“N′_i，1V_i，1N′_i，2V_i，2N′_i，3”、SO″_i＝“N″_i，1V_i，1N″_i，2V_i，2N″_i，3"the doublet statement should also appear multiple times in TCorpus;

the fourth step is implemented by the following specific steps:

firstly, two non-negative threshold values a and b are introduced, wherein a belongs to (0, 1), b belongs to (0, 1)

Step D1: setting the SOBaseResult to null for saving the verified, correct bilingual result;

step D2: if SOBase is empty, go to step D6;

step D3: recording any of SOBase<“V_i，1…V_i，2”，“N_i，1V_i，1N_i，2V_i，2N_i，3”>Will be<“V_i，1…V_i，2”，“N_i，1V_i，1N_i，2V_i，2N_i，3”>Taking out from the SOBase;

step D4: if cof ("V)_i，1…V_i，2") > a, then" V_i，1…V_i，2"put into the set SOBaseResult, go to step D2;

said cof (' V)_i，1…V_i，2") reflects the conjunctive structure" V_i，1…V_i，2"its calculation is as follows: cof (' V)_i，1…V_i，2") ═ TCorpus contains" V_i，1…V_i，2"number of structural statements/number of statements in TCorpus; when cof (V)_i，1…V_i，2) When > a,' V_i，1…V_i，2"is considered a correct conjunctive language structure;

step D5: if muf ("V)_i，1…V_i，2") > b, then" V_i，1…V_i，2"put into the set SOBaseResult;

said muf (' V)_i，1…V_i，2") is a mathematical method for describing the diversity of accompanying collocations, and the calculation sub-steps are as follows: at the beginning, set V_*，1And V_*，2Is an empty set;

step D51: in SOBase, if present<“V_x…V_i，2”，“N_i，1V_xN_i，2V_i，2N_i，3”>Then V will be_xPut into the set V_*，1Performing the following steps;

step D52: in SOBase, if there is "V <_i，1…V_y”，“N_i，1V_i，1N_i，2V_yN_i，3”>Then V will be_yPut into the set V_*，2Performing the following steps;

step D53: calculation muf ("V)_i，1…V_i，2"): the calculation formula is as follows:

step D6: and outputting a final bilingual structure result SOBaseResult.

Has the advantages that:

the invention provides a system and a method for acquiring a Chinese-language-compatible structure by means of linguistics and computer technology. Aiming at the complexity and diversity of the bilingual form, the bilingual mode is introduced, and the complexity of the bilingual form can be greatly controlled on the premise of not reducing the acquisition effect. Aiming at the complexity of Chinese word formation and sentence, in order to ensure the accuracy of the accompanying language structure, the invention strictly verifies the obtained accompanying language structure from the double angles of 'accompanying language structure collocation diversity' and 'accompanying language structure collocation commonness'.

After test and verification of 1TB corpus, the system of the invention obtains 13.96 ten thousand bilingual structures, and the accuracy reaches 98.2% through analysis. Therefore, the invention achieves better recognition performance and achieves the aim of practical application.

Drawings

Fig. 1 is a flow chart of a method for obtaining a chinese and bilingual structure.

Detailed Description

In order to be able to explain the invention more clearly, the following terms are defined and explained below:

(1) chinese part of speech: part-of-speech in chinese is an attribute of chinese words. The following are common: nouns (e.g., children, graduates, etc., denoted by n), verbs (e.g., invitations, additions, etc., denoted by v), adverbs (e.g., frequent, etc., denoted by d), adjectives (e.g., beautiful, nice, simple, etc., denoted by a), pronouns (e.g., these, this, it, they, etc., denoted by r), numerators (e.g., one, twelve, 12, etc., denoted by m), quantifiers (e.g., one, root, only, bar, etc., denoted by q).

(2)5 kinds of bilingual modes: in order to control the complexity of obtaining the bilingual structure and ensure that more bilingual structures can be obtained, the invention introduces 5 bilingual modes. For ease of presentation, the following assumes the general form of an inclusive statement as "N₁V₁N₂V₂N₃", where N₂Namely, it is called the bilingual word. The invention only considers the bilingual N when acquiring the bilingual structure₂Conjunctive sentences satisfying the following patterns (that is, it is assumed that when the corpus is sufficiently large, conjunctive sentences having conjunctive sentences of other forms can be satisfied from conjunctive sentencesObtained in the accompanying sentence of the following 5 patterns):

mode 1: number + noun. For example: "three/m persons/n" and "3/n items/n" are two specific examples.

Mode 2: number + quantifier + noun. For example: "three/m/q persons/n", "3/m plants/q plants/n" are two specific examples.

Mode 3: { this, this field, this time, this bit, this, these, that field, that time, that bit, that, those, it, they }. The elements in the set are common pronouns, commonly used to refer to non-living objects or animals, any of which is itself a pattern.

Mode 4: { this, this field, this time, this bit, this, these, that field, that time, that bit, that, those } + nouns. This is a bilingual model consisting of pronouns and names. "this/r race/n" and "these/r stock/n" are two specific examples.

Mode 5: { he, they, i, we, she, they }. The elements in the set are common pronouns, commonly used to refer to characters, any one of which is itself a pattern.

(3) CLAICTS system: a free, open source word segmentation system. The system takes a text as input and outputs a word segmentation sequence of the text. The downloading website of the CLAICTS system is as follows: http: // ictlas. nlpir. org. After word segmentation, each word segmentation is marked with part of speech, wherein a represents an adjective, b represents a distinct word, c represents a conjunctive word, d represents an adverb, h represents a prefix word, j represents an acronym, k represents a suffix word, m represents a numerator, n represents a noun, p represents a preposition word, q represents a quantifier, r represents a pronoun, u represents a helper word, z represents a status word, and the like.

The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments. In the following, a short sentence or a long article is generally referred to as text.

A Chinese accompanying language structure acquisition method is divided into four main modules:

a module A: and performing word segmentation on the original training Corpus Corpus to form a word segmentation Corpus TCorpus.

And a module B: identifying each sentence S in a participle corpus TCorpus_iVerb in (1).

And a module C: and analyzing the sentences in the TCorpus by applying the bilingual mode, forming candidate bilingual structures for the sentences meeting the bilingual mode, and putting the candidate bilingual structures into a bilingual structure library SOBase to be verified.

A module D: and verifying the candidate bilingual structure library SOBase and outputting a final result SOBaseResult.

The workflow or method of each module is explained in detail below.

An open source ICTCCLAS system is adopted to perform word segmentation on each input text D in Corpus, and each text is split according to natural segmentation of sentences to form a simple sentence without sentence punctuation marks. Thus, TCorpus has each sentence in the form of S_i＝“W₁/pos₁W₂/pos₂…W_i/pos_i…W_n/pos_n", each W_iIs a Chinese word, Chinese character, punctuation mark, Arabic numeral, English word or letter, pos_iIs its corresponding part of speech.

In the word segmentation algorithm, the tagging of parts of speech is already popular in the computer world. Typical parts of speech include a an adjective, b a discriminative word, c a conjunctive word, d an adverb, h a prefix word, j an acronym, k a suffix word, m a numerator, n a noun, p a preposition word, q a quantifier, r a pronoun, u a help word, and z a status word. For example, the word segmentation result of the sentence "group committee invites them to the meeting" is: "teacher/n carefully/d speak/v give/v those/r student/n listen/n".

And a module B: identifying each sentence S in a participle corpus TCorpus_iVerb or verb phrase in (1).

For each statement S in TCorpus_iIf "W" is present₁/v W₂V', then according to "W₁W₂The/v' carries out merging processing, namely merging two or more verbs into a verb, and the process is called verb merging processing. For example, the sentence "teacher/n speaks/v to/v those/r students/n listen/v" is processed as above to get a verb phrase "speak" and thus a cum sentence "teacher/n speaks/v those/r students/n listen/v". The purpose of this is to achieve as many bilingual constructs as possible.

After the above processing, elimination processing is performed on the adverbs of the modifier verb, that is, all the modifier adverbs before the verb are deleted. For example, the sentence "teacher/n carefully/d speak/v to/v those/r student/n listen/v" is subject to verb merging processing to get the cum sentence "teacher/n carefully/d speak/v those/r student/n listen/v". After the adverb deleting process, the compatible sentence 'teacher/n says/v/r students/n listens/v' is obtained.

The present invention still puts the processed statement into TCorpus.

The analysis of the sentences in the TCorpus by the bilingual application mode means that 5 bilingual modes designed in the above are adopted, the sentences which accord with one of the bilingual modes in the TCorpus are selected and placed in a bilingual structure library SOBase to be verified.

Specifically, for any statement SO in TCorpus_iIf it contains a verb that exceeds 2, or only 1 verb, then the sentence is discarded; otherwise, set SO_iIs of the form "N_i，1V_i，1N_i，2V_i，2N_i，3"(here, subscript i represents the i-th sentence meaning). The main task below is to examine N_i，2Whether one of 5 bilingual patterns is satisfied. If one of 5 bilingual patterns is satisfied, then the binary pair<“V_i，1…V_i，2”，“N_i，1V_i，1N_i，2V_i，2N_i，3”>Is put intoIn SOBase; otherwise, the SO is abandoned_i。

For example, for the doublet sentence "teacher/n speaks/v those/r students/n listens/v" and "those/r students/n" satisfy the doublet mode 4, so the present invention considers that obtaining the doublet structure "speaks … listen" according to the doublet sentence "teacher/n speaks/v those/r students/n listens/v" is a doublet structure, and puts < "speaks … listen", "teacher/n speaks/v those/r students/n listen/v" into the candidate doublet structure library SOBase for further verification by module D.

For each record in candidate bilingual structure library SOBase<“V_i，1…V_i，2”，“N_i，1V_i，1N_i，2V_i，2N_i，3”>The invention provides two verification techniques: the common verification of the collocation of the conjunctive language and the diversity of the collocation of the conjunctive language are necessary conditions for ensuring the correct structure of the conjunctive language.

The common verification of the collocation of the accompanied words means if SO_i＝“N_i，1V_i，1N_i，2V_i，2N_i，3"is a correct conjunctive sentence, the conjunctive sentence structure" V_i，1…V_i，2"appear in other statements in TCorpus, not just in the cum statement SO_iIn (1).

For example, SO_iThe group committee invites them to the meeting. Then SO_i，1'host invites them to participate in an interaction' and SO_i，2"owners invite them to have lunch together" may also appear in other sentences in TCorpus; that is to say SO_i，1And SO_i，2Not only dependent on SO_iThis special cum statement.

The verification of the diversity of the bilingual collocation refers to if SO_i＝“N_i，1V_i，1Ni_，2V_i，2N_i，3"is a correct and concurrent statement, then is shaped as SO'_i＝“N′_i，1V_i，1N′_i，2V_i，2N′_i，3”、SO″_i＝“N″_i，1V_i，1N″_i，2V_i，2N″_i，3The cum statement of "etc. should also appear multiple times in TCorpus.

For example, SO_iSO 'for group Party to invite them to the conference'_iThat is "friends invite us to meet birthday party", SO_iSO 'for a friend inviting her to a wedding'_i"professor invites these students to participate in the conference". That is, in SO_iIn "group committees invite them to a meeting", the accompanying word "they" may be replaced by words of various forms, and SO_iThe conjunctive structure in (1) "invite … to participate" is still reasonable and correct.

According to the analysis and explanation of the degree of commonness of the collocations of the conjunctive language and the diversity of the conjunctive language collocations, the implementation method of the module D is given as follows: firstly, two non-negative threshold values a and b are introduced, wherein a belongs to (0, 1), b belongs to (0, 1)

Step D1: SOBaseResult is set to null to save the results of the verified, correct bilingual structure.

Step D2: if SOBase is empty, go to step D6.

Step D3: recording any of SOBase<“V_i，1…V_i，2”，“N_i，1V_i，1N_i，2V_i，2N_i，3”>Will be<“V_i，1…V_i，2”，“N_i，1V_i，1N_i，2V_i，2N_i，3”>Taken out of the SOBase.

Step D4: if cof ("V)_i，1…V_i，2") > a, then" V_i，1…V_i，2"put into the set SOBaseResult, go to step D2.

Said cof (' V)_i，1…V_i，2") reflects the conjunctive structure" V_i，1…V_i，2"its calculation is as follows: cof (' V)_i，1…V_i，2") ═ TCorpus contains" V_i，1…V_i，2"number of structural statements/number of statements in TCorpus. When cof (V)_i，1…V_i，2) When > a,' V_i，1…V_i，2"is considered a proper bilingual structure.

Step D5: if muf ("V)_i，1…V_i，2") > b, then" V_i，1…V_i，2"put into the set SOBaseResult.

Said muf (' V)_i，1…V_i，2") is a mathematical method for describing the diversity of accompanying collocations, and the calculation sub-steps are as follows: at the beginning, set V_*，1And V_*，2Is an empty set.

Step D51: in SOBase, if present<“V_x…V_i，2”，“N_i，1V_xN_i，2V_i，2N_i，3”>Then V will be_xPut into the set V_*，1In (1).

Step D52: in SOBase, if present<“V_i，1…V_y”，“N_i，1V_i，1N_i，2V_yN_i，3”>Then V will be_yPut into the set V_*，2In (1).

step D6: and outputting a final bilingual structure result SOBaseResult.

Effect of the experiment

Through a plurality of preliminary experiments, the result of the bilingual result is better when the threshold of the bilingual collocation commonality a is set to be 0.0006 (namely a is equal to 0.0006) and the threshold of the bilingual collocation diversity b is set to be 0.0015 (namely b is equal to 0.0015). After test and verification of 1TB corpus, the system of the invention obtains 13.96 ten thousand bilingual structures, and the accuracy reaches 98.2% through analysis. Therefore, the invention achieves better recognition performance and achieves the aim of practical application.

Claims

1. A method for acquiring a Chinese and bilingual structure is characterized in that: the method comprises the following steps:

performing word segmentation on each input text D in Corpus by adopting an open source ICTCCLAS system, and splitting each text according to natural segmentation of sentences to form a simple sentence without sentence punctuation marks; thus, TCorpus has each sentence in the form of S_i＝“W₁/pos₁W₂/pos₂…W_i/pos_i…W_n/pos_n", each W_iIs a Chinese word, Chinese character, punctuation mark, Arabic numeral, English word or letter, pos_iIs its corresponding part of speech;

for any statement SO in TCorpus_iWhen it contains a verb over 2, or only 1 verb, the sentence is discarded; otherwise, set SO_iIs of the form "N_i，1V_i，1N_i，2V_i，2N_i，3", here, subscript i represents the ith sentence meaning; the main task below is to examine N_i，2Whether one of 5 bilingual patterns is satisfied; if one of 5 bilingual patterns is satisfied, then the binary pair<“V_i，1…V_i，2”，“N_i，1V_i，1N_i，2V_i，2N_i，3”>Putting into SOBase; otherwise, the SO is abandoned_i；

mode 1: number + noun;

mode 2: number + quantifier + noun;

the common verification of the collocation of the accompanied words means that when SO_i＝“N_i，1V_i，1N_i，2V_i，2N_i，3"is a correct conjunctive sentence, the conjunctive sentence structure" V_i，1…V_i，2"appear in other statements in TCorpus, not just in the cum statement SO_iPerforming the following steps;

the fourth step is implemented by the following specific steps:

step D2: if SOBase is empty, go to step D6;

step D3: recording < "V" for any of the SOBases_i，1…V_i，2”，“N_i，1V_i，1N_i，2V_i，2N_i，3”>Will be<“V_i，1…V_i，2”，“N_i，1V_i，1N_i，2V_i，2N_i，3”>Taking out from the SOBase;

step D4: if cof ("V)_i，1…V_i，2”)>a, then "V_i，1…V_i，2"put into the set SOBaseResult, go to step D2;

said cof (' V)_i，1…V_i，2") reflects the conjunctive structure" V_i，1…V_i，2"its calculation is as follows: cof (' V)_i，1…V_i，2") ═ TCorpus contains" V_i，1…V_i，2"number of structural statements/number of statements in TCorpus; when cof (V)_i，1…V_i，2)>When a, will "V_i，1…V_i，2"is considered a correct conjunctive language structure;

step D5: if muf ("V)_i，1…V_i，2”)>b, then "V_i，1…V_i，2"put into the set SOBaseResult; said muf (' V)_i，1…V_i，2") is a mathematical method for describing the diversity of accompanying collocations, and the calculation sub-steps are as follows: at the beginning, set V_*，1And V_*，2Is an empty set;

step D52: in SOBase, if present<“V_i，1…V_y”，“N_i，1V_i，1N_i，2V_yN_i，3”>Then V will be_yPut into the set V_*，2Performing the following steps;

step D6: and outputting a final bilingual structure result SOBaseResult.