CN106649263A - Multi-word expression extraction method and device - Google Patents

Multi-word expression extraction method and device Download PDF

Info

Publication number
CN106649263A
CN106649263A CN201610990921.6A CN201610990921A CN106649263A CN 106649263 A CN106649263 A CN 106649263A CN 201610990921 A CN201610990921 A CN 201610990921A CN 106649263 A CN106649263 A CN 106649263A
Authority
CN
China
Prior art keywords
mutual information
information
words
mutual
expression
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610990921.6A
Other languages
Chinese (zh)
Inventor
朱泽德
曾新华
郑守国
孙熊伟
翁士状
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hefei Technology Innovation Engineering Institute of CAS
Original Assignee
Hefei Technology Innovation Engineering Institute of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hefei Technology Innovation Engineering Institute of CAS filed Critical Hefei Technology Innovation Engineering Institute of CAS
Priority to CN201610990921.6A priority Critical patent/CN106649263A/en
Publication of CN106649263A publication Critical patent/CN106649263A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Abstract

The invention relates to a multi-word expression extraction method and device. The method comprises the steps that a vocabulary set is formed after a document library is preprocessed, mutual information of every two adjacent vocabularies in multiple documents is calculated, transient information before and after each mutual information is acquired, the mutual information and the transient information form two-dimensional mutual information, multi-word expression is screened out by clustering the two-dimensional mutual information, and then a multi-word expression library is constructed. According to the multi-word expression extraction method and device, the problems that a threshold value of one-dimensional mutual information needs to be manually set, and the one-dimensional mutual information has the adaptability to different data are avoided; meanwhile, a multi-word dual structure is not limited, and multi-word expression of a multi-word combination can be acquired at a time; in addition, the method does not need to be achieved step by step, the multi-word expression utilization rate is effectively increased, and the multi-word expression library construction accuracy is improved.

Description

A kind of multi-words expression abstracting method and its device
Technical field
The present invention relates to statistical machine translation and cross-language information retrieval techniques field, especially a kind of multi-words expression are extracted Method and its device.
Background technology
Multi-words expression is and meaningful complete multiple word combinations with grammer, semantic or pragmatic characteristic.Multi-words expression Identification can be good at lifting the efficiency and accuracy of the work such as participle, part-of-speech tagging and machine translation.In machine translation, Multi-words expression in correct identification original language contributes to selecting suitable translation, avoids multiple words from translating and caused target respectively Language is unnatural or even can not express one's ideas.
The abstracting method of multi-words expression is divided into Statistics-Based Method and rule-based method substantially.Rule-based side Method usually specifically studies a certain type such as verb phrase structure etc. or is confined to some specific area, the side based on statistics Rule can extract the multi-words expression of form independence, that is, using indiscriminate various structures and the field extracted of statistical information Multi-words expression.However, existing statistical method problems faced has:One-dimensional mutual information needs artificial given threshold, to different numbers According to there is adaptability problem, the diadactic structure of many words is confined to, it is impossible to once obtain the multi-words expression of many word combinations, and need substep Realize, the degree of accuracy that multi-words expression storehouse is built is low.
The content of the invention
The primary and foremost purpose of the present invention is to provide a kind of multi-words expression for disposably obtaining many word combinations, real without the need for substep It is existing, effectively improve multi-words expression and extract utilization rate, improve the degree of accuracy of multi-words expression storehouse construction.
For achieving the above object, technical scheme below, a kind of multi-words expression abstracting method, the method bag be present invention employs The step of including following order:
(1) document library forms source document using the pretreatment of participle and part-of-speech tagging;
(2) mutual information of adjacent words in many documents is calculated, and further calculates the saltus step information before and after mutual information sequence;
(3) mutual information sequence and saltus step information sequence are constituted into two-dimentional mutual information set;
(4) two-dimentional mutual information set adopts grader for point and exterior point in multi-words expression, and point link in choosing more builds many words Expression.
Further, in the step (1), all documents for collecting document library carry out Chinese word segmentation, part of speech mark The pretreatment that note and name Entity recognition, part of speech are selected constitutes the candidate's lexical set for having certain order.
Further, the step of step (2) is including following order:
A () calculates the mutual information of all adjacent words in many documents;
B () calculates the saltus step information before and after mutual information sequence.
Further, in the step (3), according to mutual information sequence and saltus step information sequence correspondence position point, two are built Dimension mutual information (MIi, fi), multiple two-dimentional mutual informations constitute two-dimentional mutual information set.
Further, in the step (4), institute in two-dimentional mutual information set a little, is divided into by many words using grader Point and the class of exterior point two in expression, by the link of the adjacent words comprising interior point multi-words expression is constituted.
Further, in the step (a), the mutual information of adjacent words in many documents is calculated, constitutes mutual information sequence MI, Wherein the mutual information of adjacent words x and y calculates MIi(0≤i < len (MI)-α) such as following formula:
Wherein, x and y represent adjacent words;MIiRepresent i-th mutual information that adjacent words x and y are constituted;Len (MI) table Show the length of mutual information sequence MI;α represents a constant;M represents the sum of vocabulary in all documents;P (x, y) represents vocabulary x With y in all documents co-occurrence number of times;P (x) represents vocabulary x occurrence numbers in all documents;P (y) represents vocabulary y all Occurrence number in document;N represents the number of all documents in document sets;Nx,yRepresent the document number comprising x and y co-occurrences.
Further, in the step (b), the saltus step information before and after mutual information sequence is calculated, constitutes saltus step information sequence F, saltus step information f of adjacent mutual information thereiniComputing formula is as follows:
Wherein, fiRepresent the saltus step information of current mutual information and subsequent mutual information in mutual information sequence;| | expression takes definitely Value.
Further, the α is 2.
Another object of the present invention is to a kind of multi-words expression draw-out device is provided, including:
Candidate's vocabulary acquisition device:All documents for collecting document library carry out Chinese word segmentation, part-of-speech tagging and name The pretreatment that Entity recognition, part of speech are selected constitutes the candidate's lexical set with certain order;
Mutual information and saltus step information acquisition device:The mutual information of neighboring candidate vocabulary in many documents is calculated, and with according to adjacent Mutual information calculates the saltus step information before and after mutual information sequence;
Two-dimentional mutual information acquisition device:According to mutual information sequence information corresponding with saltus step information sequence position, select mutual Information and saltus step information structure two dimension mutual information;
Category filter multi-words expression device:Institute in two-dimentional mutual information set a little, is categorized as by many vocabularys using grader Up to interior point and the class of exterior point two, the adjacent words for having interior point link is constituted into multi-words expression.
As shown from the above technical solution, the mutual information between adjacent words is transformed into two-dimentional mutual information, cluster two by the present invention Dimension mutual information filters out multi-words expression, it is to avoid one-dimensional mutual information needs artificial given thresholds, and the adaptability of different pieces of information asked Topic, while being not limited to the diadactic structure of many words, can once obtain the multi-words expression of many word combinations, and realize having without the need for substep Effect improves the utilization rate of multi-words expression, improves the degree of accuracy of multi-words expression storehouse construction.
Description of the drawings
Fig. 1 is the schematic flow sheet of the inventive method;
Fig. 2 is the structured flowchart of apparatus of the present invention.
Specific embodiment
A kind of multi-words expression abstracting method, the method include following order the step of:(1) document library is using participle and part of speech The pretreatment such as mark, forms source document;(2) mutual information of adjacent words in many documents is calculated, and further calculates mutual trust Saltus step information before and after breath sequence;(3) mutual information sequence and saltus step information sequence are constituted into two-dimentional mutual information set;(4) two dimension Mutual information set adopts grader for point and exterior point in multi-words expression, and the continuous interior point link of screening builds multi-words expression.Such as Fig. 1 institutes Show.
Below in conjunction with Fig. 1, the present invention is further illustrated.
In the step (1), it is real that all texts for collecting document library carry out Chinese word segmentation, part-of-speech tagging and name The pretreatment that body identification, part of speech are selected constitutes the candidate's lexical set for having certain order.
The step of step (2) is including following order:A () calculates the mutual information of all adjacent words in many documents;(b) Calculate the saltus step information before and after mutual information sequence.
In the step (a), the mutual information of adjacent words in many documents is calculated, constitute mutual information sequence MI, wherein phase The mutual information of adjacent vocabulary x and y calculates MIi(0≤i < len (MI)-α) such as following formula:
Wherein, x and y represent adjacent words;MIiRepresent i-th mutual information that adjacent words x and y are constituted;Len (MI) table Show the length of mutual information sequence MI;α represents a constant;M represents the sum of vocabulary in all documents;P (x, y) represents vocabulary x With y in all documents co-occurrence number of times;P (x) represents vocabulary x occurrence numbers in all documents;P (y) represents vocabulary y all Occurrence number in document;N represents the number of all documents in document sets;Nx,yRepresent the document number comprising x and y co-occurrences;Constant α is 2.
In the step (b), the saltus step information before and after mutual information sequence is calculated, constitute saltus step information sequence f, it is therein Saltus step information f of adjacent mutual informationiComputing formula is as follows:
Wherein, fiRepresent the saltus step information of current mutual information and subsequent mutual information in mutual information sequence;| | expression takes definitely Value.
In the step (3), according to mutual information sequence and saltus step information sequence correspondence position point, two-dimentional mutual information is built (MIi, fi), multiple two-dimentional mutual informations constitute two-dimentional mutual information set.
In the step (4), using grader by institute in two-dimentional mutual information set a little, be divided into multi-words expression point and The class of exterior point two, by the link of the adjacent words comprising interior point multi-words expression is constituted.
As shown in Fig. 2 apparatus of the present invention include:Candidate's vocabulary acquisition device, all texts for collecting document library enter The pretreatment such as row Chinese word segmentation, part-of-speech tagging and name Entity recognition, part of speech selection constitutes the candidate's vocabulary with certain order Set;Mutual information and saltus step information acquisition device, calculate the mutual information of neighboring candidate vocabulary in many documents, and with according to adjacent mutual trust Breath calculates the saltus step information before and after mutual information sequence;Two-dimentional mutual information acquisition device, according to mutual information sequence and saltus step information sequence The corresponding information of column position, selects mutual information and saltus step information structure two dimension mutual information;Category filter multi-words expression device, adopts Institute in two-dimentional mutual information set a little, is categorized as point and the class of exterior point two in multi-words expression by grader, will have the adjacent word of interior point The link that converges constitutes multi-words expression.
In sum, the mutual information between adjacent words is transformed into two-dimentional mutual information, the two-dimentional mutual information sieve of cluster by the present invention Select multi-words expression, it is to avoid one-dimensional mutual information needs artificial given thresholds, the adaptability problem to different pieces of information, while not office It is limited to the diadactic structure of many words, can once obtains the multi-words expression of many word combinations, and realize without the need for substep, effectively improves many vocabularys The utilization rate for reaching, improves the degree of accuracy of multi-words expression storehouse construction.

Claims (9)

1. a kind of multi-words expression abstracting method, it is characterised in that the step of the method includes following order:
(1) document library forms source document using the pretreatment of participle and part-of-speech tagging;
(2) mutual information of adjacent words in many documents is calculated, and further calculates the saltus step information before and after mutual information sequence;
(3) mutual information sequence and saltus step information sequence are constituted into two-dimentional mutual information set;
(4) two-dimentional mutual information set adopts grader for point and exterior point in multi-words expression, and point link in choosing more builds multi-words expression.
2. method according to claim 1, it is characterised in that:In the step (1), for all documents of document library Carry out Chinese word segmentation, the pretreatment that part-of-speech tagging and name Entity recognition, part of speech are selected constitutes the candidate's vocabulary for having certain order Set.
3. method according to claim 1, it is characterised in that:The step of the step (2) is including following order:
A () calculates the mutual information of all adjacent words in many documents;
B () calculates the saltus step information before and after mutual information sequence.
4. method according to claim 1, it is characterised in that:In the step (3), believed according to mutual information sequence and saltus step Breath sequence pair answers location point, builds two-dimentional mutual information (MIi, fi), multiple two-dimentional mutual informations constitute two-dimentional mutual information set.
5. method according to claim 1, it is characterised in that:In the step (4), using grader by two-dimentional mutual information Institute a little, is divided into point and the class of exterior point two in multi-words expression in set, and the link of the adjacent words comprising interior point is constituted into many vocabularys Reach.
6. method according to claim 3, it is characterised in that:In the step (a), adjacent words in many documents are calculated Mutual information, constitutes mutual information sequence MI, and the wherein mutual information of adjacent words x and y calculates MIi(0≤i < len (MI)-α) is as follows Formula:
MI i = l o g [ M × p ( x , y ) p ( x ) × p ( y ) × N x , y N ] ,
Wherein, x and y represent adjacent words;MIiRepresent i-th mutual information that adjacent words x and y are constituted;Len (MI) represents mutual trust The length of breath sequence MI;α represents a constant;M represents the sum of vocabulary in all documents;P (x, y) represents vocabulary x and y in institute There is co-occurrence number of times in document;P (x) represents vocabulary x occurrence numbers in all documents;P (y) represents vocabulary y in all documents Occurrence number;N represents the number of all documents in document sets;Nx,yRepresent the document number comprising x and y co-occurrences.
7. method according to claim 3, it is characterised in that:In the step (b), the jump before and after mutual information sequence is calculated Change information, constitutes saltus step information sequence f, saltus step information f of adjacent mutual information thereiniComputing formula is as follows:
f i = | 1 &alpha; &Sigma; j = 1 &alpha; MI i + j - MI i | , 0 &le; i < l e n ( M I ) - &alpha;
Wherein, fiRepresent the saltus step information of current mutual information and subsequent mutual information in mutual information sequence;| | expression takes absolute value.
8. method according to claim 6, it is characterised in that:The α is 2.
9. a kind of multi-words expression draw-out device, including:
Candidate's vocabulary acquisition device:All documents for collecting document library carry out Chinese word segmentation, part-of-speech tagging and name entity The pretreatments such as identification, part of speech selection constitute the candidate's lexical set with certain order;
Mutual information and saltus step information acquisition device:The mutual information of neighboring candidate vocabulary in many documents is calculated, and with according to adjacent mutual trust Breath calculates the saltus step information before and after mutual information sequence;
Two-dimentional mutual information acquisition device:According to mutual information sequence information corresponding with saltus step information sequence position, mutual information is selected With saltus step information structure two dimension mutual information;
Category filter multi-words expression device:Institute in two-dimentional mutual information set a little, is categorized as in multi-words expression using grader Point and the class of exterior point two, by the adjacent words for having interior point link multi-words expression is constituted.
CN201610990921.6A 2016-11-10 2016-11-10 Multi-word expression extraction method and device Pending CN106649263A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610990921.6A CN106649263A (en) 2016-11-10 2016-11-10 Multi-word expression extraction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610990921.6A CN106649263A (en) 2016-11-10 2016-11-10 Multi-word expression extraction method and device

Publications (1)

Publication Number Publication Date
CN106649263A true CN106649263A (en) 2017-05-10

Family

ID=58806046

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610990921.6A Pending CN106649263A (en) 2016-11-10 2016-11-10 Multi-word expression extraction method and device

Country Status (1)

Country Link
CN (1) CN106649263A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108549631A (en) * 2018-03-30 2018-09-18 北京智慧正安科技有限公司 Noun dictionary extracting method, electronic device and computer readable storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040044528A1 (en) * 2002-09-03 2004-03-04 Chelba Ciprian I. Method and apparatus for generating decision tree questions for speech processing
CN1567297A (en) * 2003-07-03 2005-01-19 中国科学院声学研究所 Method for extracting multi-word translation equivalent cells from bilingual corpus automatically
US20050216443A1 (en) * 2000-07-06 2005-09-29 Streamsage, Inc. Method and system for indexing and searching timed media information based upon relevance intervals
JP2006178536A (en) * 2004-12-20 2006-07-06 Oki Electric Ind Co Ltd Parallel translation expression extraction device
CN106095736A (en) * 2016-06-07 2016-11-09 华东师范大学 A kind of method of field neologisms extraction

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050216443A1 (en) * 2000-07-06 2005-09-29 Streamsage, Inc. Method and system for indexing and searching timed media information based upon relevance intervals
US20040044528A1 (en) * 2002-09-03 2004-03-04 Chelba Ciprian I. Method and apparatus for generating decision tree questions for speech processing
CN1567297A (en) * 2003-07-03 2005-01-19 中国科学院声学研究所 Method for extracting multi-word translation equivalent cells from bilingual corpus automatically
JP2006178536A (en) * 2004-12-20 2006-07-06 Oki Electric Ind Co Ltd Parallel translation expression extraction device
CN106095736A (en) * 2016-06-07 2016-11-09 华东师范大学 A kind of method of field neologisms extraction

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108549631A (en) * 2018-03-30 2018-09-18 北京智慧正安科技有限公司 Noun dictionary extracting method, electronic device and computer readable storage medium

Similar Documents

Publication Publication Date Title
CN107463607B (en) Method for acquiring and organizing upper and lower relations of domain entities by combining word vectors and bootstrap learning
Ljubešić et al. {bs, hr, sr} wac-web corpora of Bosnian, Croatian and Serbian
CN106844346A (en) Short text Semantic Similarity method of discrimination and system based on deep learning model Word2Vec
Shindo et al. Bayesian symbol-refined tree substitution grammars for syntactic parsing
CN101261623A (en) Word splitting method and device for word border-free mark language based on search
CN101303692B (en) All-purpose numeral semantic library for translation of mechanical language
CN107391486A (en) A kind of field new word identification method based on statistical information and sequence labelling
CN106569993A (en) Method and device for mining hypernym-hyponym relation between domain-specific terms
CN105068997B (en) The construction method and device of parallel corpora
CN105975454A (en) Chinese word segmentation method and device of webpage text
CN106610951A (en) Improved text similarity solving algorithm based on semantic analysis
CN107153640A (en) A kind of segmenting method towards elementary mathematics field
CN109002473A (en) A kind of sentiment analysis method based on term vector and part of speech
CN106611041A (en) New text similarity solution method
CN107943786A (en) A kind of Chinese name entity recognition method and system
Chang A new approach for automatic Chinese spelling correction
CN112948543A (en) Multi-language multi-document abstract extraction method based on weighted TextRank
CN104573030A (en) Textual emotion prediction method and device
CN101763403A (en) Query translation method facing multi-lingual information retrieval system
Wong et al. isentenizer-: Multilingual sentence boundary detection model
CN108595413B (en) Answer extraction method based on semantic dependency tree
Millour et al. Unsupervised data augmentation for less-resourced languages with no standardized spelling
Bhat Morpheme segmentation for kannada standing on the shoulder of giants
Kanjirangat et al. Optimizing the size of subword vocabularies in dialect classification
CN106649263A (en) Multi-word expression extraction method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20170510