CN105912522A - Automatic extraction method and extractor of English corpora based on constituent analyses - Google Patents

Automatic extraction method and extractor of English corpora based on constituent analyses Download PDF

Info

Publication number
CN105912522A
CN105912522A CN201610202321.9A CN201610202321A CN105912522A CN 105912522 A CN105912522 A CN 105912522A CN 201610202321 A CN201610202321 A CN 201610202321A CN 105912522 A CN105912522 A CN 105912522A
Authority
CN
China
Prior art keywords
word
english
language material
sentence
component analysis
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610202321.9A
Other languages
Chinese (zh)
Inventor
白晓文
陈春纬
刘庆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Changan University
Original Assignee
Changan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Changan University filed Critical Changan University
Priority to CN201610202321.9A priority Critical patent/CN105912522A/en
Publication of CN105912522A publication Critical patent/CN105912522A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Abstract

The invention discloses an automatic extraction method and extractor of English corpora based on constituent analyses with the view of rapidly extracting all English corpora and increasing corpora-extracting accuracy. The adopted technical scheme is characterized in that the automatic extractor of English corpora based on constituent analyses comprises a segmentation module used for segmenting English texts into multiple sentences, a constituent analysis module used for analyzing compositions of all sentences in order to obtain primary constituents and internal constituents of primary constituents of all the sentences and marking and recognizing noun phrases in all constituents, and a corpus export module used for exporting all marked and recognized noun phrases in order to form corpus lists.

Description

English language material extraction method based on component analysis and extractor
Technical field
The invention belongs to computational linguistics and translation technology field, relate to a kind of English language material extraction method based on component analysis And extractor.
Background technology
In natural language processing field, quickly, the technology of language block identification is also entered from artificial cognition for the instrument of language retrieval and technical progress Enter machine recognition.The starting point of language block retrieval technology is extraction continuous print, fixing word string from corpus, through development in a few years, Progressively reach its advanced stage: extract discrete variable language block.Herein from the angle of Corpus Research, respectively from continuously Language block and two aspects of discrete language block, the language block identification to English is concluded with retrieval technique and instrument and is commented.
By corpus retrieval method, academic vocabulary use frequency in information engineering English corpus and distribution characteristics are carried out Statistics and analysis.The academic vocabulary coverage rate in information engineering English corpus of research display reaches 10.39%, academic vocabulary for The suitability of information engineering subject is verified.On this basis, to the most commonly used corpus high frequency science word retrieval Method compares, and proposes the optimisation strategy of Special English high frequency science word retrieval for the most methodical deficiency, from 570 Individual academic word family extracts 248 information engineering English high frequency science word families, provides for carrying out specialty English for academic purpose vocabulary teaching Objective basis, significantly improves the specific aim of the academic vocabulary teaching of specialty.
Multi-words expression (MWE) is not only used to improve current machine translation system quality, and is also used for cross-language retrieval and data Other natural language processing field such as excavation.It is proposed to this end that the method combined with based on statistical tool based on semantic template is from three Tuple comparable corpora automatically extracts this race English MWE.Use and calculate the similarity between word based on vocabulary and location mode, Expand MWE coverage.Utilize GIZA++ alignment algorithm to extract the Chinese MWE of paginal translation, calculate intertranslation according to statistical method Probabilistic information, according to probability size, selects optimal English-Chinese MWE intertranslation pair, test result indicate that said method can be effectively improved MWE extracts and the accuracy rate of alignment.
Componential analysis is a kind of systematized analysis method merging both macro and micro, is applicable to forgive the translation of multiple assessment key element Quality evaluation.Based on componential analysis, translation quality assessment is divided into " target langua0 expression ", " text function ", " textual content is (non- Professional) " and " textual content (professional) and term " four compositions, according to text type, set each composition proportion, etc. Level and score value, the assessment of the combination of qualitative and quantitative analysis of key can be realized so that translation quality assessment more objective, more have operable Property.From the angle of semantic components analysis, inquire into the corresponding relation of English Chinese words and attempted to be used for translating by component analysis theory Practice, makes translation more meet " Xinda is cut " three principle of translation while the most accurately passing on word meaning.But it is existing The more study limitation of English component analysis, in terms of human translation and teaching, seldom combines with computer technology;The research of corpus It is absorbed in the research of this body structure of corpus and application prospect, relates to less about the concrete Corpus Construction being suitable for;English composition Analysis method is not used for Corpus Construction.
Summary of the invention
In order to solve the problems of the prior art, the present invention proposes a kind of by component analysis, it is possible to the institute in rapid extraction English There is language material, and extract the high English language material extraction method based on component analysis of language material accuracy rate and extractor.
In order to realize object above, the technical solution adopted in the present invention is:
A kind of English language material automatic extractor based on component analysis, including:
Punctuate module, being used for English text cutting is several sentences;
Component analysis module, for each sentence is carried out component analysis, obtains the one-level composition of all sentences and the interior of one-level composition Portion's composition, and the noun phrase in all the components is marked identification;
And language material derivation module, the noun phrase for being gone out by all marker recognition is derived and is formed language material list.
A kind of English language material extraction method based on component analysis, comprises the following steps:
1) open English text, utilize punctuate module according to subordinate sentence rule, English text is carried out subordinate sentence, obtains several sentences;
2) utilizing component analysis module that first each sentence is disassembled into several word, retrieval dictionary determines each word in sentence Part of speech;The part of speech of the most each word carries out phrase chunking after determining;Secondly phrase merging is carried out after phrase chunking;Last phrase Finally give one-level composition and the internal component of one-level composition of all sentences according to grammatical rules after having merged, and by all the components In noun phrase be marked identification;
3) language material is utilized to derive the noun phrase derivation formation language material list that all marker recognition are gone out by module.
Described step 1) in punctuate module according to punctuation mark rule, define sentence full stop, run into full stop and be judged as a tail, It is several sentences by English text cutting.
Described punctuate module needs English fullstop is determined whether initialism punctuate, comprises initialism, search in dictionary in dictionary Word before rope fullstop and fullstop, if searching is then initialism punctuate, then ignores not as sentence full stop.
Described step 1) middle employing general reading file module acquisition English text, Word document calls the Com interface of Word Obtaining text, excel document calls the Com interface of excel and obtains text.
Described step 2) in component analysis module get the part of speech of each word from dictionary, if the part of speech of word is unique, this word Part of speech determines;If word exists many parts of speech, then combine other word of sentence, carry out part of speech identification, finally determine that this word is in sentence Unique part of speech.
Described step 3) in language material derive module language material list be ranked up, and travel through from back to front, if adjacent rows language material Character is identical, then for repeating, and a line after deletion.
Compared with prior art, the present invention make pauses in reading unpunctuated ancient writings module according to subordinate sentence rule, English text is carried out subordinate sentence and obtains several sentence, First each sentence is disassembled into several word by recycling component analysis module, and retrieval dictionary determines the word of each word in sentence Property, the part of speech of each word carries out phrase chunking after determining;Secondly carrying out phrase merging after phrase chunking, phrase has merged rear root One-level composition and the internal component of one-level composition of all sentences is finally given according to grammatical rules, and by the noun phrase in all the components It is marked identification, utilizes language material to derive the noun phrase derivation formation language material list that all marker recognition are gone out by module, base of the present invention In English component analysis, by English component analysis, obtain thus one-level composition, determine whether whether this one-level composition is one Individual noun phrase, if it is, be exactly a language material;By each one-level composition is carried out internal component analysis, obtain all of interior Portion's composition, determines whether whether this internal component is a noun phrase, if it is, be exactly a language material, exports all analyses The noun phrase gone out, i.e. obtains required language material, and the English component analysis of the present invention is a kind of English composition based on dictionary and rule base Analysis method, the maturation of rule and complete ensure that higher component analysis accuracy rate such that it is able to the reduction translation time, improve Translation efficiency.The present invention can all language materials in rapid extraction English, component analysis accuracy is high, so that language material accuracy rate is more Greatly, it is possible to be widely used in natural language research and the exploitation of translation aid.
Further, punctuate module, according to punctuation mark rule, defines sentence full stop, it would be desirable to the material cutting of translation is sentence, Run into full stop and be judged as a tail, English fullstop is needed to determine whether initialism punctuate, dictionary comprises initialism, at word Storehouse is searched for word before fullstop and fullstop, if searching is then initialism punctuate, then ignores not as sentence full stop, enter one Step improves the accuracy that subordinate sentence processes, and improves translation efficiency.
Further, component analysis module gets the part of speech of word from dictionary, if part of speech is unique, this word part of speech is it has been determined that such as There is many parts of speech word in fruit, in conjunction with other word of sentence, carries out part of speech identification, finally determine this word unique part of speech in sentence. Such as article+adjective+part of speech word to be determined, part of speech word to be determined has noun part-of-speech and verb part of speech, it is determined that this word For noun part-of-speech, the recognition rule of part of speech is by professional language staffing, and to rule settings priority, routine call rule base The rule that coupling is optimum, then selects to give tacit consent to part of speech to the word of no coupling.
Further, language material is derived module and is ranked up language material list, and travels through from back to front, if adjacent rows language material character is identical, Then for repeating, a line after deletion, by sequence and duplicate removal, facilitate subsequent translation work, it is to avoid repeated work, improve and turn over Translate efficiency.
Detailed description of the invention
Below in conjunction with specific embodiment, the present invention is further explained.
A kind of English language material automatic extractor based on component analysis, including: punctuate module, it is some for being used for English text cutting Individual sentence;Component analysis module, for each sentence is carried out component analysis, obtains one-level composition and the one-level composition of all sentences Internal component, and the noun phrase in all the components is marked identification;And language material derivation module, for by all labellings The noun phrase identified is derived and is formed language material list.
A kind of English language material extraction method based on component analysis, comprises the following steps:
1) using general reading file module to obtain English text, Word document calls the Com interface of Word and obtains text, excel Document calls the Com interface of excel and obtains text, utilizes punctuate module according to subordinate sentence rule, English text is carried out subordinate sentence, To several sentences;Punctuate module, according to punctuation mark rule, defines sentence full stop, runs into full stop and be judged as a tail, by English Language text dividing is several sentences, and punctuate module needs English fullstop is determined whether initialism punctuate, comprises breviary in dictionary Word, searches for word before fullstop and fullstop in dictionary, if searching is then initialism punctuate, then ignores and terminates not as sentence Symbol;
2) utilizing component analysis module that first each sentence is disassembled into several word, retrieval dictionary determines each word in sentence Part of speech, if the part of speech of word is unique, this word part of speech determines;If word exists many parts of speech, then combine other word of sentence, enter Row part of speech identification, finally determines this word unique part of speech in sentence;The part of speech of the most each word carries out phrase chunking after determining; Secondly phrase merging is carried out after phrase chunking;Finally give the one-level one-tenth of all sentences according to grammatical rules after finally phrase has merged Divide and the internal component of one-level composition, and the noun phrase in all the components is marked identification;
3) utilizing language material to derive the noun phrase derivation formation language material list that all marker recognition are gone out by module, language material derives module pair Language material list is ranked up, and travels through from back to front, if adjacent rows language material character is identical, then for repeating, and a line after deletion.
English component analysis concrete grammar of the present invention:
1) according to subordinate sentence rule, English text is carried out subordinate sentence, obtains sentence one by one;
2) word is one by one disassembled in each sentence;
3) retrieval dictionary, forms all properties configuration to each word;
4) according to rule base, it is judged that sentence predicate part, further according to rule base, all words and combinations of words are judged, sentence Break and this word and which type of phrase is combinations of words be, thus according to position in sentence of rule base, this phrase and to relevant become / relation, determine the composition of this phrase, including subject, object, predicative, the adverbial modifier etc.;
5) according to rule base, it is judged that each it has been determined that composition in internal component, reciprocation cycle, until minimum linguistic unit;
6) judgement of all the components is completed.
The English component analysis first-selection of the present invention judges all of one-level composition, it is simply that this maximum composition, including subject part, Predicate part, object part, adverbial modifier's part, appositive part, predicative part etc., then determine whether in each composition Internal component, by that analogy, until minimum linguistic unit.Each one-level composition and internal component may be exactly a noun phrase, Its internal component comprised is also likely to be a noun phrase, is exported by these noun phrases, i.e. completes the language material in this.
Module of the present invention includes:
1. English punctuate module:
According to punctuation mark and rule, be sentence one by one by English text cutting, define sentence full stop, as English fullstop, Exclamation mark, question mark etc., run into full stop and be judged as that a tail, English fullstop also need to judge whether initialism, comprise breviary in dictionary Word, searches for word before fullstop and fullstop in dictionary, if searching is then initialism punctuate, then ignores and terminates not as sentence Symbol;
2. component analysis module:
Each sentence is carried out component analysis and internal component analysis, obtains all one-level compositions and become with the inside of all one-level compositions Point, the noun phrase in all the components is marked:
1) part of speech of each word in sentence is determined: get the part of speech of word from dictionary, if part of speech this word part of speech unique is the most true Fixed, if there is many parts of speech word, in conjunction with other word of sentence, carry out part of speech identification, finally determine unique in sentence of this word Part of speech, as article+adjective+part of speech word to be determined has noun part-of-speech and verb part of speech, it may be determined that this word is noun word Property, the recognition rule of part of speech is by professional language staffing, and to rule settings priority, routine call rule base coupling is optimum Rule, then select to give tacit consent to part of speech to the word of no coupling;
2) phrase chunking on the basis of part of speech determines, identifies phrase, such as article+adjective+noun structure according to phrase rule storehouse Become noun phrase, according to word in the matched sentences of phrase rule storehouse, multiple word identification are become phrase;
3) on the basis of phrase chunking, merge rule base according to phrase and carry out phrase merging, Jie of such as noun phrase+modify thereafter Word phrase is merged into a noun phrase, and phrase finally gives the one-level composition of sentence according to grammatical rules after having merged, as subject, Predicate, object, attribute, the adverbial modifier, complement, predicative etc., such as sentence can be known by noun phrase+predicate phrase+noun phrase Do not become subject+predicate+object;
3. language material derives module: derived by all noun phrases identified, and forms language material list.
The concrete steps that the present invention uses include:
1) running tool;
2) opening the file needing to extract language material, can be the forms such as Word, Excel, text, text be directly with general Reading file module and obtain text, Word document calls the Com interface of Word and obtains the text in word, and excel calls excel Com interface obtain the text in excel form;
3) clicking on " language material extraction ", call English punctuate module, component analysis module, obtain language material, the language material of extraction is with list Mode preserve, one bar language material of every behavior;
4) language material sequence is removed and is repeated, and language material list is used quick sorting algorithm sequence, after language material list in order, from back to front Traversal of lists, if adjacent rows language material is the same, i.e. character is identical, then for repeating, a line after deletion;
5) derive language material, derive the language material file of plain text format, if word or excel document, then call corresponding Com Interface is derived.
The present invention can all language materials in rapid extraction English, component analysis accuracy is high, and language material accuracy rate is big, it is possible to extensively should For natural language research and the exploitation of translation aid.

Claims (7)

1. an English language material automatic extractor based on component analysis, it is characterised in that including:
Punctuate module, being used for English text cutting is several sentences;
Component analysis module, for each sentence is carried out component analysis, obtains the one-level composition of all sentences and the interior of one-level composition Portion's composition, and the noun phrase in all the components is marked identification;
And language material derivation module, the noun phrase for being gone out by all marker recognition is derived and is formed language material list.
2. an English language material extraction method based on component analysis, it is characterised in that comprise the following steps:
1) open English text, utilize punctuate module according to subordinate sentence rule, English text is carried out subordinate sentence, obtains several sentences;
2) utilizing component analysis module that first each sentence is disassembled into several word, retrieval dictionary determines each word in sentence Part of speech;The part of speech of the most each word carries out phrase chunking after determining;Secondly phrase merging is carried out after phrase chunking;Last phrase Finally give one-level composition and the internal component of one-level composition of all sentences according to grammatical rules after having merged, and by all the components In noun phrase be marked identification;
3) language material is utilized to derive the noun phrase derivation formation language material list that all marker recognition are gone out by module.
A kind of English language material extraction method based on component analysis the most according to claim 2, it is characterised in that institute The step 1 stated) in punctuate module according to punctuation mark rule, define sentence full stop, run into full stop and be judged as a tail, by English Language text dividing is several sentences.
A kind of English language material extraction method based on component analysis the most according to claim 3, it is characterised in that institute The punctuate module stated needs English fullstop is determined whether initialism punctuate, comprises initialism, search for fullstop in dictionary in dictionary And word before fullstop, if searching is then initialism punctuate, then ignore not as sentence full stop.
A kind of English language material extraction method based on component analysis the most according to claim 2, it is characterised in that institute The step 1 stated) middle employing general reading file module acquisition English text, Word document calls the Com interface of Word and obtains literary composition This, excel document calls the Com interface of excel and obtains text.
A kind of English language material extraction method based on component analysis the most according to claim 2, it is characterised in that institute The step 2 stated) in component analysis module get the part of speech of each word from dictionary, if the part of speech of word is unique, this word part of speech is true Fixed;If word exists many parts of speech, then combine other word of sentence, carry out part of speech identification, finally determine unique in sentence of this word Part of speech.
A kind of English language material extraction method based on component analysis the most according to claim 2, it is characterised in that institute The step 3 stated) in language material derive module language material list be ranked up, and travel through from back to front, if adjacent rows language material character phase With, then for repeating, a line after deletion.
CN201610202321.9A 2016-03-31 2016-03-31 Automatic extraction method and extractor of English corpora based on constituent analyses Pending CN105912522A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610202321.9A CN105912522A (en) 2016-03-31 2016-03-31 Automatic extraction method and extractor of English corpora based on constituent analyses

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610202321.9A CN105912522A (en) 2016-03-31 2016-03-31 Automatic extraction method and extractor of English corpora based on constituent analyses

Publications (1)

Publication Number Publication Date
CN105912522A true CN105912522A (en) 2016-08-31

Family

ID=56745199

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610202321.9A Pending CN105912522A (en) 2016-03-31 2016-03-31 Automatic extraction method and extractor of English corpora based on constituent analyses

Country Status (1)

Country Link
CN (1) CN105912522A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106649564A (en) * 2016-11-10 2017-05-10 中科院合肥技术创新工程院 Inter-translation multi-word expression extraction method and device
CN107797995A (en) * 2017-11-20 2018-03-13 语联网(武汉)信息技术有限公司 A kind of Chinese and English fragment language material generation method
CN108519974A (en) * 2018-03-31 2018-09-11 华南理工大学 English composition automatic detection of syntax error and analysis method
CN108628819A (en) * 2017-03-16 2018-10-09 北京搜狗科技发展有限公司 Treating method and apparatus, the device for processing
CN109166407A (en) * 2018-08-06 2019-01-08 李勤骞 The nominal structure representation training system of English system and its method
CN110457676A (en) * 2019-06-26 2019-11-15 平安科技(深圳)有限公司 Extracting method and device, storage medium, the computer equipment of evaluation information
CN111581953A (en) * 2019-01-30 2020-08-25 武汉慧人信息科技有限公司 Method for automatically analyzing grammar phenomenon of English text

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100223047A1 (en) * 2009-03-02 2010-09-02 Sdl Plc Computer-assisted natural language translation
CN104035968A (en) * 2014-05-20 2014-09-10 微梦创科网络科技(中国)有限公司 Method and device for constructing training corpus set based on social network
CN104298665A (en) * 2014-10-16 2015-01-21 苏州大学 Identification method and device of evaluation objects of Chinese texts
CN104679885A (en) * 2015-03-17 2015-06-03 北京理工大学 User search string organization name recognition method based on semantic feature model
CN104915443A (en) * 2015-06-29 2015-09-16 北京信息科技大学 Extraction method of Chinese Microblog evaluation object
CN104933164A (en) * 2015-06-26 2015-09-23 华南理工大学 Method for extracting relations among named entities in Internet massive data and system thereof

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100223047A1 (en) * 2009-03-02 2010-09-02 Sdl Plc Computer-assisted natural language translation
CN104035968A (en) * 2014-05-20 2014-09-10 微梦创科网络科技(中国)有限公司 Method and device for constructing training corpus set based on social network
CN104298665A (en) * 2014-10-16 2015-01-21 苏州大学 Identification method and device of evaluation objects of Chinese texts
CN104679885A (en) * 2015-03-17 2015-06-03 北京理工大学 User search string organization name recognition method based on semantic feature model
CN104933164A (en) * 2015-06-26 2015-09-23 华南理工大学 Method for extracting relations among named entities in Internet massive data and system thereof
CN104915443A (en) * 2015-06-29 2015-09-16 北京信息科技大学 Extraction method of Chinese Microblog evaluation object

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
俄拉扎提.巴合达吾列提: "哈萨克语短语库构建及管理系统的研究与实现", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
景元 等: "基于规则的英汉翻译技术报告", 《第四届全国机器翻译研讨会》 *
许大姐: "面向中国学生的英文作文错误检查的研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106649564A (en) * 2016-11-10 2017-05-10 中科院合肥技术创新工程院 Inter-translation multi-word expression extraction method and device
CN108628819A (en) * 2017-03-16 2018-10-09 北京搜狗科技发展有限公司 Treating method and apparatus, the device for processing
CN108628819B (en) * 2017-03-16 2022-09-20 北京搜狗科技发展有限公司 Processing method and device for processing
CN107797995A (en) * 2017-11-20 2018-03-13 语联网(武汉)信息技术有限公司 A kind of Chinese and English fragment language material generation method
CN108519974A (en) * 2018-03-31 2018-09-11 华南理工大学 English composition automatic detection of syntax error and analysis method
CN109166407A (en) * 2018-08-06 2019-01-08 李勤骞 The nominal structure representation training system of English system and its method
CN109166407B (en) * 2018-08-06 2021-06-04 李勤骞 English system nominal structure expression training system and method thereof
CN111581953A (en) * 2019-01-30 2020-08-25 武汉慧人信息科技有限公司 Method for automatically analyzing grammar phenomenon of English text
CN110457676A (en) * 2019-06-26 2019-11-15 平安科技(深圳)有限公司 Extracting method and device, storage medium, the computer equipment of evaluation information
CN110457676B (en) * 2019-06-26 2022-06-21 平安科技(深圳)有限公司 Evaluation information extraction method and device, storage medium and computer equipment

Similar Documents

Publication Publication Date Title
CN105912522A (en) Automatic extraction method and extractor of English corpora based on constituent analyses
CN110866399B (en) Chinese short text entity recognition and disambiguation method based on enhanced character vector
Chong et al. Natural language processing for sentiment analysis: an exploratory analysis on tweets
CN105138514B (en) It is a kind of based on dictionary it is positive gradually plus a word maximum matches Chinese word cutting method
CN105824933A (en) Automatic question-answering system based on theme-rheme positions and realization method of automatic question answering system
WO2008107305A2 (en) Search-based word segmentation method and device for language without word boundary tag
CN106528524A (en) Word segmentation method based on MMseg algorithm and pointwise mutual information algorithm
CN110390022A (en) A kind of professional knowledge map construction method of automation
CN110321434A (en) A kind of file classification method based on word sense disambiguation convolutional neural networks
Ashna et al. Lexicon based sentiment analysis system for malayalam language
Kanojia et al. Utilizing wordnets for cognate detection among indian languages
Samardžić et al. Automatic interlinear glossing as two-level sequence classification
Daille Building bilingual terminologies from comparable corpora: The TTC TermSuite
Altenbek et al. Kazakh segmentation system of inflectional affixes
CN108255818B (en) Combined machine translation method using segmentation technology
Mulloni Automatic prediction of cognate orthography using support vector machines
Sun Chinese named entity recognition using modified conditional random field on postal address
CN108280066B (en) Off-line translation method from Chinese to English
CN107870905B (en) Method for identifying specific vocabulary
Singh A review on word sense disambiguation emphasizing the data resources on wordnet and corpus
Nadali et al. Sarcastic tweets detection based on sentiment hashtags analysis
JP2002189734A (en) Device and method for extracting retrieval word
Flanagan et al. Automatic extraction and prediction of word order errors from language learning SNS
Estarrona et al. Dealing with dialectal variation in the construction of the Basque historical corpus
CN109977418B (en) Short text similarity measurement method based on semantic vector

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20160831