CN105912522A

CN105912522A - Automatic extraction method and extractor of English corpora based on constituent analyses

Info

Publication number: CN105912522A
Application number: CN201610202321.9A
Authority: CN
Inventors: 白晓文; 陈春纬; 刘庆
Original assignee: Changan University
Current assignee: Changan University
Priority date: 2016-03-31
Filing date: 2016-03-31
Publication date: 2016-08-31

Abstract

The invention discloses an automatic extraction method and extractor of English corpora based on constituent analyses with the view of rapidly extracting all English corpora and increasing corpora-extracting accuracy. The adopted technical scheme is characterized in that the automatic extractor of English corpora based on constituent analyses comprises a segmentation module used for segmenting English texts into multiple sentences, a constituent analysis module used for analyzing compositions of all sentences in order to obtain primary constituents and internal constituents of primary constituents of all the sentences and marking and recognizing noun phrases in all constituents, and a corpus export module used for exporting all marked and recognized noun phrases in order to form corpus lists.

Description

English language material extraction method based on component analysis and extractor

Technical field

The invention belongs to computational linguistics and translation technology field, relate to a kind of English language material extraction method based on component analysis And extractor.

Background technology

In natural language processing field, quickly, the technology of language block identification is also entered from artificial cognition for the instrument of language retrieval and technical progress Enter machine recognition.The starting point of language block retrieval technology is extraction continuous print, fixing word string from corpus, through development in a few years, Progressively reach its advanced stage: extract discrete variable language block.Herein from the angle of Corpus Research, respectively from continuously Language block and two aspects of discrete language block, the language block identification to English is concluded with retrieval technique and instrument and is commented.

By corpus retrieval method, academic vocabulary use frequency in information engineering English corpus and distribution characteristics are carried out Statistics and analysis.The academic vocabulary coverage rate in information engineering English corpus of research display reaches 10.39%, academic vocabulary for The suitability of information engineering subject is verified.On this basis, to the most commonly used corpus high frequency science word retrieval Method compares, and proposes the optimisation strategy of Special English high frequency science word retrieval for the most methodical deficiency, from 570 Individual academic word family extracts 248 information engineering English high frequency science word families, provides for carrying out specialty English for academic purpose vocabulary teaching Objective basis, significantly improves the specific aim of the academic vocabulary teaching of specialty.

Multi-words expression (MWE) is not only used to improve current machine translation system quality, and is also used for cross-language retrieval and data Other natural language processing field such as excavation.It is proposed to this end that the method combined with based on statistical tool based on semantic template is from three Tuple comparable corpora automatically extracts this race English MWE.Use and calculate the similarity between word based on vocabulary and location mode, Expand MWE coverage.Utilize GIZA++ alignment algorithm to extract the Chinese MWE of paginal translation, calculate intertranslation according to statistical method Probabilistic information, according to probability size, selects optimal English-Chinese MWE intertranslation pair, test result indicate that said method can be effectively improved MWE extracts and the accuracy rate of alignment.

Componential analysis is a kind of systematized analysis method merging both macro and micro, is applicable to forgive the translation of multiple assessment key element Quality evaluation.Based on componential analysis, translation quality assessment is divided into " target langua0 expression ", " text function ", " textual content is (non- Professional) " and " textual content (professional) and term " four compositions, according to text type, set each composition proportion, etc. Level and score value, the assessment of the combination of qualitative and quantitative analysis of key can be realized so that translation quality assessment more objective, more have operable Property.From the angle of semantic components analysis, inquire into the corresponding relation of English Chinese words and attempted to be used for translating by component analysis theory Practice, makes translation more meet " Xinda is cut " three principle of translation while the most accurately passing on word meaning.But it is existing The more study limitation of English component analysis, in terms of human translation and teaching, seldom combines with computer technology；The research of corpus It is absorbed in the research of this body structure of corpus and application prospect, relates to less about the concrete Corpus Construction being suitable for；English composition Analysis method is not used for Corpus Construction.

Summary of the invention

In order to solve the problems of the prior art, the present invention proposes a kind of by component analysis, it is possible to the institute in rapid extraction English There is language material, and extract the high English language material extraction method based on component analysis of language material accuracy rate and extractor.

In order to realize object above, the technical solution adopted in the present invention is:

A kind of English language material automatic extractor based on component analysis, including:

Punctuate module, being used for English text cutting is several sentences；

Component analysis module, for each sentence is carried out component analysis, obtains the one-level composition of all sentences and the interior of one-level composition Portion's composition, and the noun phrase in all the components is marked identification；

And language material derivation module, the noun phrase for being gone out by all marker recognition is derived and is formed language material list.

A kind of English language material extraction method based on component analysis, comprises the following steps:

1) open English text, utilize punctuate module according to subordinate sentence rule, English text is carried out subordinate sentence, obtains several sentences；

2) utilizing component analysis module that first each sentence is disassembled into several word, retrieval dictionary determines each word in sentence Part of speech；The part of speech of the most each word carries out phrase chunking after determining；Secondly phrase merging is carried out after phrase chunking；Last phrase Finally give one-level composition and the internal component of one-level composition of all sentences according to grammatical rules after having merged, and by all the components In noun phrase be marked identification；

3) language material is utilized to derive the noun phrase derivation formation language material list that all marker recognition are gone out by module.

Described step 1) in punctuate module according to punctuation mark rule, define sentence full stop, run into full stop and be judged as a tail, It is several sentences by English text cutting.

Described punctuate module needs English fullstop is determined whether initialism punctuate, comprises initialism, search in dictionary in dictionary Word before rope fullstop and fullstop, if searching is then initialism punctuate, then ignores not as sentence full stop.

Described step 1) middle employing general reading file module acquisition English text, Word document calls the Com interface of Word Obtaining text, excel document calls the Com interface of excel and obtains text.

Described step 2) in component analysis module get the part of speech of each word from dictionary, if the part of speech of word is unique, this word Part of speech determines；If word exists many parts of speech, then combine other word of sentence, carry out part of speech identification, finally determine that this word is in sentence Unique part of speech.

Described step 3) in language material derive module language material list be ranked up, and travel through from back to front, if adjacent rows language material Character is identical, then for repeating, and a line after deletion.

Compared with prior art, the present invention make pauses in reading unpunctuated ancient writings module according to subordinate sentence rule, English text is carried out subordinate sentence and obtains several sentence, First each sentence is disassembled into several word by recycling component analysis module, and retrieval dictionary determines the word of each word in sentence Property, the part of speech of each word carries out phrase chunking after determining；Secondly carrying out phrase merging after phrase chunking, phrase has merged rear root One-level composition and the internal component of one-level composition of all sentences is finally given according to grammatical rules, and by the noun phrase in all the components It is marked identification, utilizes language material to derive the noun phrase derivation formation language material list that all marker recognition are gone out by module, base of the present invention In English component analysis, by English component analysis, obtain thus one-level composition, determine whether whether this one-level composition is one Individual noun phrase, if it is, be exactly a language material；By each one-level composition is carried out internal component analysis, obtain all of interior Portion's composition, determines whether whether this internal component is a noun phrase, if it is, be exactly a language material, exports all analyses The noun phrase gone out, i.e. obtains required language material, and the English component analysis of the present invention is a kind of English composition based on dictionary and rule base Analysis method, the maturation of rule and complete ensure that higher component analysis accuracy rate such that it is able to the reduction translation time, improve Translation efficiency.The present invention can all language materials in rapid extraction English, component analysis accuracy is high, so that language material accuracy rate is more Greatly, it is possible to be widely used in natural language research and the exploitation of translation aid.

Further, punctuate module, according to punctuation mark rule, defines sentence full stop, it would be desirable to the material cutting of translation is sentence, Run into full stop and be judged as a tail, English fullstop is needed to determine whether initialism punctuate, dictionary comprises initialism, at word Storehouse is searched for word before fullstop and fullstop, if searching is then initialism punctuate, then ignores not as sentence full stop, enter one Step improves the accuracy that subordinate sentence processes, and improves translation efficiency.

Further, component analysis module gets the part of speech of word from dictionary, if part of speech is unique, this word part of speech is it has been determined that such as There is many parts of speech word in fruit, in conjunction with other word of sentence, carries out part of speech identification, finally determine this word unique part of speech in sentence. Such as article+adjective+part of speech word to be determined, part of speech word to be determined has noun part-of-speech and verb part of speech, it is determined that this word For noun part-of-speech, the recognition rule of part of speech is by professional language staffing, and to rule settings priority, routine call rule base The rule that coupling is optimum, then selects to give tacit consent to part of speech to the word of no coupling.

Further, language material is derived module and is ranked up language material list, and travels through from back to front, if adjacent rows language material character is identical, Then for repeating, a line after deletion, by sequence and duplicate removal, facilitate subsequent translation work, it is to avoid repeated work, improve and turn over Translate efficiency.

Detailed description of the invention

Below in conjunction with specific embodiment, the present invention is further explained.

A kind of English language material automatic extractor based on component analysis, including: punctuate module, it is some for being used for English text cutting Individual sentence；Component analysis module, for each sentence is carried out component analysis, obtains one-level composition and the one-level composition of all sentences Internal component, and the noun phrase in all the components is marked identification；And language material derivation module, for by all labellings The noun phrase identified is derived and is formed language material list.

1) using general reading file module to obtain English text, Word document calls the Com interface of Word and obtains text, excel Document calls the Com interface of excel and obtains text, utilizes punctuate module according to subordinate sentence rule, English text is carried out subordinate sentence, To several sentences；Punctuate module, according to punctuation mark rule, defines sentence full stop, runs into full stop and be judged as a tail, by English Language text dividing is several sentences, and punctuate module needs English fullstop is determined whether initialism punctuate, comprises breviary in dictionary Word, searches for word before fullstop and fullstop in dictionary, if searching is then initialism punctuate, then ignores and terminates not as sentence Symbol；

2) utilizing component analysis module that first each sentence is disassembled into several word, retrieval dictionary determines each word in sentence Part of speech, if the part of speech of word is unique, this word part of speech determines；If word exists many parts of speech, then combine other word of sentence, enter Row part of speech identification, finally determines this word unique part of speech in sentence；The part of speech of the most each word carries out phrase chunking after determining； Secondly phrase merging is carried out after phrase chunking；Finally give the one-level one-tenth of all sentences according to grammatical rules after finally phrase has merged Divide and the internal component of one-level composition, and the noun phrase in all the components is marked identification；

3) utilizing language material to derive the noun phrase derivation formation language material list that all marker recognition are gone out by module, language material derives module pair Language material list is ranked up, and travels through from back to front, if adjacent rows language material character is identical, then for repeating, and a line after deletion.

English component analysis concrete grammar of the present invention:

1) according to subordinate sentence rule, English text is carried out subordinate sentence, obtains sentence one by one；

2) word is one by one disassembled in each sentence；

3) retrieval dictionary, forms all properties configuration to each word；

4) according to rule base, it is judged that sentence predicate part, further according to rule base, all words and combinations of words are judged, sentence Break and this word and which type of phrase is combinations of words be, thus according to position in sentence of rule base, this phrase and to relevant become / relation, determine the composition of this phrase, including subject, object, predicative, the adverbial modifier etc.；

5) according to rule base, it is judged that each it has been determined that composition in internal component, reciprocation cycle, until minimum linguistic unit；

6) judgement of all the components is completed.

The English component analysis first-selection of the present invention judges all of one-level composition, it is simply that this maximum composition, including subject part, Predicate part, object part, adverbial modifier's part, appositive part, predicative part etc., then determine whether in each composition Internal component, by that analogy, until minimum linguistic unit.Each one-level composition and internal component may be exactly a noun phrase, Its internal component comprised is also likely to be a noun phrase, is exported by these noun phrases, i.e. completes the language material in this.

Module of the present invention includes:

1. English punctuate module:

According to punctuation mark and rule, be sentence one by one by English text cutting, define sentence full stop, as English fullstop, Exclamation mark, question mark etc., run into full stop and be judged as that a tail, English fullstop also need to judge whether initialism, comprise breviary in dictionary Word, searches for word before fullstop and fullstop in dictionary, if searching is then initialism punctuate, then ignores and terminates not as sentence Symbol；

2. component analysis module:

Each sentence is carried out component analysis and internal component analysis, obtains all one-level compositions and become with the inside of all one-level compositions Point, the noun phrase in all the components is marked:

1) part of speech of each word in sentence is determined: get the part of speech of word from dictionary, if part of speech this word part of speech unique is the most true Fixed, if there is many parts of speech word, in conjunction with other word of sentence, carry out part of speech identification, finally determine unique in sentence of this word Part of speech, as article+adjective+part of speech word to be determined has noun part-of-speech and verb part of speech, it may be determined that this word is noun word Property, the recognition rule of part of speech is by professional language staffing, and to rule settings priority, routine call rule base coupling is optimum Rule, then select to give tacit consent to part of speech to the word of no coupling；

2) phrase chunking on the basis of part of speech determines, identifies phrase, such as article+adjective+noun structure according to phrase rule storehouse Become noun phrase, according to word in the matched sentences of phrase rule storehouse, multiple word identification are become phrase；

3) on the basis of phrase chunking, merge rule base according to phrase and carry out phrase merging, Jie of such as noun phrase+modify thereafter Word phrase is merged into a noun phrase, and phrase finally gives the one-level composition of sentence according to grammatical rules after having merged, as subject, Predicate, object, attribute, the adverbial modifier, complement, predicative etc., such as sentence can be known by noun phrase+predicate phrase+noun phrase Do not become subject+predicate+object；

3. language material derives module: derived by all noun phrases identified, and forms language material list.

The concrete steps that the present invention uses include:

1) running tool；

2) opening the file needing to extract language material, can be the forms such as Word, Excel, text, text be directly with general Reading file module and obtain text, Word document calls the Com interface of Word and obtains the text in word, and excel calls excel Com interface obtain the text in excel form；

3) clicking on " language material extraction ", call English punctuate module, component analysis module, obtain language material, the language material of extraction is with list Mode preserve, one bar language material of every behavior；

4) language material sequence is removed and is repeated, and language material list is used quick sorting algorithm sequence, after language material list in order, from back to front Traversal of lists, if adjacent rows language material is the same, i.e. character is identical, then for repeating, a line after deletion；

5) derive language material, derive the language material file of plain text format, if word or excel document, then call corresponding Com Interface is derived.

The present invention can all language materials in rapid extraction English, component analysis accuracy is high, and language material accuracy rate is big, it is possible to extensively should For natural language research and the exploitation of translation aid.

Claims

1. an English language material automatic extractor based on component analysis, it is characterised in that including:

Punctuate module, being used for English text cutting is several sentences；

2. an English language material extraction method based on component analysis, it is characterised in that comprise the following steps:

A kind of English language material extraction method based on component analysis the most according to claim 2, it is characterised in that institute The step 1 stated) in punctuate module according to punctuation mark rule, define sentence full stop, run into full stop and be judged as a tail, by English Language text dividing is several sentences.

A kind of English language material extraction method based on component analysis the most according to claim 3, it is characterised in that institute The punctuate module stated needs English fullstop is determined whether initialism punctuate, comprises initialism, search for fullstop in dictionary in dictionary And word before fullstop, if searching is then initialism punctuate, then ignore not as sentence full stop.

A kind of English language material extraction method based on component analysis the most according to claim 2, it is characterised in that institute The step 1 stated) middle employing general reading file module acquisition English text, Word document calls the Com interface of Word and obtains literary composition This, excel document calls the Com interface of excel and obtains text.

A kind of English language material extraction method based on component analysis the most according to claim 2, it is characterised in that institute The step 2 stated) in component analysis module get the part of speech of each word from dictionary, if the part of speech of word is unique, this word part of speech is true Fixed；If word exists many parts of speech, then combine other word of sentence, carry out part of speech identification, finally determine unique in sentence of this word Part of speech.

A kind of English language material extraction method based on component analysis the most according to claim 2, it is characterised in that institute The step 3 stated) in language material derive module language material list be ranked up, and travel through from back to front, if adjacent rows language material character phase With, then for repeating, a line after deletion.