CN106649263A

CN106649263A - Multi-word expression extraction method and device

Info

Publication number: CN106649263A
Application number: CN201610990921.6A
Authority: CN
Inventors: 朱泽德; 曾新华; 郑守国; 孙熊伟; 翁士状
Original assignee: Hefei Technology Innovation Engineering Institute of CAS
Current assignee: Hefei Technology Innovation Engineering Institute of CAS
Priority date: 2016-11-10
Filing date: 2016-11-10
Publication date: 2017-05-10

Abstract

The invention relates to a multi-word expression extraction method and device. The method comprises the steps that a vocabulary set is formed after a document library is preprocessed, mutual information of every two adjacent vocabularies in multiple documents is calculated, transient information before and after each mutual information is acquired, the mutual information and the transient information form two-dimensional mutual information, multi-word expression is screened out by clustering the two-dimensional mutual information, and then a multi-word expression library is constructed. According to the multi-word expression extraction method and device, the problems that a threshold value of one-dimensional mutual information needs to be manually set, and the one-dimensional mutual information has the adaptability to different data are avoided; meanwhile, a multi-word dual structure is not limited, and multi-word expression of a multi-word combination can be acquired at a time; in addition, the method does not need to be achieved step by step, the multi-word expression utilization rate is effectively increased, and the multi-word expression library construction accuracy is improved.

Description

A kind of multi-words expression abstracting method and its device

Technical field

The present invention relates to statistical machine translation and cross-language information retrieval techniques field, especially a kind of multi-words expression are extracted Method and its device.

Background technology

Multi-words expression is and meaningful complete multiple word combinations with grammer, semantic or pragmatic characteristic.Multi-words expression Identification can be good at lifting the efficiency and accuracy of the work such as participle, part-of-speech tagging and machine translation.In machine translation, Multi-words expression in correct identification original language contributes to selecting suitable translation, avoids multiple words from translating and caused target respectively Language is unnatural or even can not express one's ideas.

The abstracting method of multi-words expression is divided into Statistics-Based Method and rule-based method substantially.Rule-based side Method usually specifically studies a certain type such as verb phrase structure etc. or is confined to some specific area, the side based on statistics Rule can extract the multi-words expression of form independence, that is, using indiscriminate various structures and the field extracted of statistical information Multi-words expression.However, existing statistical method problems faced has：One-dimensional mutual information needs artificial given threshold, to different numbers According to there is adaptability problem, the diadactic structure of many words is confined to, it is impossible to once obtain the multi-words expression of many word combinations, and need substep Realize, the degree of accuracy that multi-words expression storehouse is built is low.

The content of the invention

The primary and foremost purpose of the present invention is to provide a kind of multi-words expression for disposably obtaining many word combinations, real without the need for substep It is existing, effectively improve multi-words expression and extract utilization rate, improve the degree of accuracy of multi-words expression storehouse construction.

For achieving the above object, technical scheme below, a kind of multi-words expression abstracting method, the method bag be present invention employs The step of including following order：

(1) document library forms source document using the pretreatment of participle and part-of-speech tagging；

(2) mutual information of adjacent words in many documents is calculated, and further calculates the saltus step information before and after mutual information sequence；

(3) mutual information sequence and saltus step information sequence are constituted into two-dimentional mutual information set；

(4) two-dimentional mutual information set adopts grader for point and exterior point in multi-words expression, and point link in choosing more builds many words Expression.

Further, in the step (1), all documents for collecting document library carry out Chinese word segmentation, part of speech mark The pretreatment that note and name Entity recognition, part of speech are selected constitutes the candidate's lexical set for having certain order.

Further, the step of step (2) is including following order：

A () calculates the mutual information of all adjacent words in many documents；

B () calculates the saltus step information before and after mutual information sequence.

Further, in the step (3), according to mutual information sequence and saltus step information sequence correspondence position point, two are built Dimension mutual information (MI_i, f_i), multiple two-dimentional mutual informations constitute two-dimentional mutual information set.

Further, in the step (4), institute in two-dimentional mutual information set a little, is divided into by many words using grader Point and the class of exterior point two in expression, by the link of the adjacent words comprising interior point multi-words expression is constituted.

Further, in the step (a), the mutual information of adjacent words in many documents is calculated, constitutes mutual information sequence MI, Wherein the mutual information of adjacent words x and y calculates MI_i(0≤i ＜ len (MI)-α) such as following formula：

Wherein, x and y represent adjacent words；MI_iRepresent i-th mutual information that adjacent words x and y are constituted；Len (MI) table Show the length of mutual information sequence MI；α represents a constant；M represents the sum of vocabulary in all documents；P (x, y) represents vocabulary x With y in all documents co-occurrence number of times；P (x) represents vocabulary x occurrence numbers in all documents；P (y) represents vocabulary y all Occurrence number in document；N represents the number of all documents in document sets；N_x,yRepresent the document number comprising x and y co-occurrences.

Further, in the step (b), the saltus step information before and after mutual information sequence is calculated, constitutes saltus step information sequence F, saltus step information f of adjacent mutual information therein_iComputing formula is as follows：

Wherein, f_iRepresent the saltus step information of current mutual information and subsequent mutual information in mutual information sequence；| | expression takes definitely Value.

Further, the α is 2.

Another object of the present invention is to a kind of multi-words expression draw-out device is provided, including：

Candidate's vocabulary acquisition device：All documents for collecting document library carry out Chinese word segmentation, part-of-speech tagging and name The pretreatment that Entity recognition, part of speech are selected constitutes the candidate's lexical set with certain order；

Mutual information and saltus step information acquisition device：The mutual information of neighboring candidate vocabulary in many documents is calculated, and with according to adjacent Mutual information calculates the saltus step information before and after mutual information sequence；

Two-dimentional mutual information acquisition device：According to mutual information sequence information corresponding with saltus step information sequence position, select mutual Information and saltus step information structure two dimension mutual information；

Category filter multi-words expression device：Institute in two-dimentional mutual information set a little, is categorized as by many vocabularys using grader Up to interior point and the class of exterior point two, the adjacent words for having interior point link is constituted into multi-words expression.

As shown from the above technical solution, the mutual information between adjacent words is transformed into two-dimentional mutual information, cluster two by the present invention Dimension mutual information filters out multi-words expression, it is to avoid one-dimensional mutual information needs artificial given thresholds, and the adaptability of different pieces of information asked Topic, while being not limited to the diadactic structure of many words, can once obtain the multi-words expression of many word combinations, and realize having without the need for substep Effect improves the utilization rate of multi-words expression, improves the degree of accuracy of multi-words expression storehouse construction.

Description of the drawings

Fig. 1 is the schematic flow sheet of the inventive method；

Fig. 2 is the structured flowchart of apparatus of the present invention.

Specific embodiment

A kind of multi-words expression abstracting method, the method include following order the step of：(1) document library is using participle and part of speech The pretreatment such as mark, forms source document；(2) mutual information of adjacent words in many documents is calculated, and further calculates mutual trust Saltus step information before and after breath sequence；(3) mutual information sequence and saltus step information sequence are constituted into two-dimentional mutual information set；(4) two dimension Mutual information set adopts grader for point and exterior point in multi-words expression, and the continuous interior point link of screening builds multi-words expression.Such as Fig. 1 institutes Show.

Below in conjunction with Fig. 1, the present invention is further illustrated.

In the step (1), it is real that all texts for collecting document library carry out Chinese word segmentation, part-of-speech tagging and name The pretreatment that body identification, part of speech are selected constitutes the candidate's lexical set for having certain order.

The step of step (2) is including following order：A () calculates the mutual information of all adjacent words in many documents；(b) Calculate the saltus step information before and after mutual information sequence.

In the step (a), the mutual information of adjacent words in many documents is calculated, constitute mutual information sequence MI, wherein phase The mutual information of adjacent vocabulary x and y calculates MI_i(0≤i ＜ len (MI)-α) such as following formula：

Wherein, x and y represent adjacent words；MI_iRepresent i-th mutual information that adjacent words x and y are constituted；Len (MI) table Show the length of mutual information sequence MI；α represents a constant；M represents the sum of vocabulary in all documents；P (x, y) represents vocabulary x With y in all documents co-occurrence number of times；P (x) represents vocabulary x occurrence numbers in all documents；P (y) represents vocabulary y all Occurrence number in document；N represents the number of all documents in document sets；N_x,yRepresent the document number comprising x and y co-occurrences；Constant α is 2.

In the step (b), the saltus step information before and after mutual information sequence is calculated, constitute saltus step information sequence f, it is therein Saltus step information f of adjacent mutual information_iComputing formula is as follows：

In the step (3), according to mutual information sequence and saltus step information sequence correspondence position point, two-dimentional mutual information is built (MI_i, f_i), multiple two-dimentional mutual informations constitute two-dimentional mutual information set.

In the step (4), using grader by institute in two-dimentional mutual information set a little, be divided into multi-words expression point and The class of exterior point two, by the link of the adjacent words comprising interior point multi-words expression is constituted.

As shown in Fig. 2 apparatus of the present invention include：Candidate's vocabulary acquisition device, all texts for collecting document library enter The pretreatment such as row Chinese word segmentation, part-of-speech tagging and name Entity recognition, part of speech selection constitutes the candidate's vocabulary with certain order Set；Mutual information and saltus step information acquisition device, calculate the mutual information of neighboring candidate vocabulary in many documents, and with according to adjacent mutual trust Breath calculates the saltus step information before and after mutual information sequence；Two-dimentional mutual information acquisition device, according to mutual information sequence and saltus step information sequence The corresponding information of column position, selects mutual information and saltus step information structure two dimension mutual information；Category filter multi-words expression device, adopts Institute in two-dimentional mutual information set a little, is categorized as point and the class of exterior point two in multi-words expression by grader, will have the adjacent word of interior point The link that converges constitutes multi-words expression.

In sum, the mutual information between adjacent words is transformed into two-dimentional mutual information, the two-dimentional mutual information sieve of cluster by the present invention Select multi-words expression, it is to avoid one-dimensional mutual information needs artificial given thresholds, the adaptability problem to different pieces of information, while not office It is limited to the diadactic structure of many words, can once obtains the multi-words expression of many word combinations, and realize without the need for substep, effectively improves many vocabularys The utilization rate for reaching, improves the degree of accuracy of multi-words expression storehouse construction.

Claims

1. a kind of multi-words expression abstracting method, it is characterised in that the step of the method includes following order：

(4) two-dimentional mutual information set adopts grader for point and exterior point in multi-words expression, and point link in choosing more builds multi-words expression.

2. method according to claim 1, it is characterised in that：In the step (1), for all documents of document library Carry out Chinese word segmentation, the pretreatment that part-of-speech tagging and name Entity recognition, part of speech are selected constitutes the candidate's vocabulary for having certain order Set.

3. method according to claim 1, it is characterised in that：The step of the step (2) is including following order：

4. method according to claim 1, it is characterised in that：In the step (3), believed according to mutual information sequence and saltus step Breath sequence pair answers location point, builds two-dimentional mutual information (MI_i, f_i), multiple two-dimentional mutual informations constitute two-dimentional mutual information set.

5. method according to claim 1, it is characterised in that：In the step (4), using grader by two-dimentional mutual information Institute a little, is divided into point and the class of exterior point two in multi-words expression in set, and the link of the adjacent words comprising interior point is constituted into many vocabularys Reach.

6. method according to claim 3, it is characterised in that：In the step (a), adjacent words in many documents are calculated Mutual information, constitutes mutual information sequence MI, and the wherein mutual information of adjacent words x and y calculates MI_i(0≤i ＜ len (MI)-α) is as follows Formula：

{MI}_{i} = l o g [\frac{M \times p (x, y)}{p (x) \times p (y)} \times \frac{N_{x, y}}{N}],

Wherein, x and y represent adjacent words；MI_iRepresent i-th mutual information that adjacent words x and y are constituted；Len (MI) represents mutual trust The length of breath sequence MI；α represents a constant；M represents the sum of vocabulary in all documents；P (x, y) represents vocabulary x and y in institute There is co-occurrence number of times in document；P (x) represents vocabulary x occurrence numbers in all documents；P (y) represents vocabulary y in all documents Occurrence number；N represents the number of all documents in document sets；N_x,yRepresent the document number comprising x and y co-occurrences.

7. method according to claim 3, it is characterised in that：In the step (b), the jump before and after mutual information sequence is calculated Change information, constitutes saltus step information sequence f, saltus step information f of adjacent mutual information therein_iComputing formula is as follows：

f_{i} = | \frac{1}{α} Σ_{j = 1}^{α} {MI}_{i + j} - {MI}_{i} |, 0 \leq i < l e n (M I) - α

Wherein, f_iRepresent the saltus step information of current mutual information and subsequent mutual information in mutual information sequence；| | expression takes absolute value.

8. method according to claim 6, it is characterised in that：The α is 2.

9. a kind of multi-words expression draw-out device, including：

Candidate's vocabulary acquisition device：All documents for collecting document library carry out Chinese word segmentation, part-of-speech tagging and name entity The pretreatments such as identification, part of speech selection constitute the candidate's lexical set with certain order；

Mutual information and saltus step information acquisition device：The mutual information of neighboring candidate vocabulary in many documents is calculated, and with according to adjacent mutual trust Breath calculates the saltus step information before and after mutual information sequence；

Two-dimentional mutual information acquisition device：According to mutual information sequence information corresponding with saltus step information sequence position, mutual information is selected With saltus step information structure two dimension mutual information；

Category filter multi-words expression device：Institute in two-dimentional mutual information set a little, is categorized as in multi-words expression using grader Point and the class of exterior point two, by the adjacent words for having interior point link multi-words expression is constituted.