CN106649263A - Multi-word expression extraction method and device - Google Patents
Multi-word expression extraction method and device Download PDFInfo
- Publication number
- CN106649263A CN106649263A CN201610990921.6A CN201610990921A CN106649263A CN 106649263 A CN106649263 A CN 106649263A CN 201610990921 A CN201610990921 A CN 201610990921A CN 106649263 A CN106649263 A CN 106649263A
- Authority
- CN
- China
- Prior art keywords
- mutual information
- information
- words
- mutual
- expression
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Abstract
The invention relates to a multi-word expression extraction method and device. The method comprises the steps that a vocabulary set is formed after a document library is preprocessed, mutual information of every two adjacent vocabularies in multiple documents is calculated, transient information before and after each mutual information is acquired, the mutual information and the transient information form two-dimensional mutual information, multi-word expression is screened out by clustering the two-dimensional mutual information, and then a multi-word expression library is constructed. According to the multi-word expression extraction method and device, the problems that a threshold value of one-dimensional mutual information needs to be manually set, and the one-dimensional mutual information has the adaptability to different data are avoided; meanwhile, a multi-word dual structure is not limited, and multi-word expression of a multi-word combination can be acquired at a time; in addition, the method does not need to be achieved step by step, the multi-word expression utilization rate is effectively increased, and the multi-word expression library construction accuracy is improved.
Description
Technical field
The present invention relates to statistical machine translation and cross-language information retrieval techniques field, especially a kind of multi-words expression are extracted
Method and its device.
Background technology
Multi-words expression is and meaningful complete multiple word combinations with grammer, semantic or pragmatic characteristic.Multi-words expression
Identification can be good at lifting the efficiency and accuracy of the work such as participle, part-of-speech tagging and machine translation.In machine translation,
Multi-words expression in correct identification original language contributes to selecting suitable translation, avoids multiple words from translating and caused target respectively
Language is unnatural or even can not express one's ideas.
The abstracting method of multi-words expression is divided into Statistics-Based Method and rule-based method substantially.Rule-based side
Method usually specifically studies a certain type such as verb phrase structure etc. or is confined to some specific area, the side based on statistics
Rule can extract the multi-words expression of form independence, that is, using indiscriminate various structures and the field extracted of statistical information
Multi-words expression.However, existing statistical method problems faced has:One-dimensional mutual information needs artificial given threshold, to different numbers
According to there is adaptability problem, the diadactic structure of many words is confined to, it is impossible to once obtain the multi-words expression of many word combinations, and need substep
Realize, the degree of accuracy that multi-words expression storehouse is built is low.
The content of the invention
The primary and foremost purpose of the present invention is to provide a kind of multi-words expression for disposably obtaining many word combinations, real without the need for substep
It is existing, effectively improve multi-words expression and extract utilization rate, improve the degree of accuracy of multi-words expression storehouse construction.
For achieving the above object, technical scheme below, a kind of multi-words expression abstracting method, the method bag be present invention employs
The step of including following order:
(1) document library forms source document using the pretreatment of participle and part-of-speech tagging;
(2) mutual information of adjacent words in many documents is calculated, and further calculates the saltus step information before and after mutual information sequence;
(3) mutual information sequence and saltus step information sequence are constituted into two-dimentional mutual information set;
(4) two-dimentional mutual information set adopts grader for point and exterior point in multi-words expression, and point link in choosing more builds many words
Expression.
Further, in the step (1), all documents for collecting document library carry out Chinese word segmentation, part of speech mark
The pretreatment that note and name Entity recognition, part of speech are selected constitutes the candidate's lexical set for having certain order.
Further, the step of step (2) is including following order:
A () calculates the mutual information of all adjacent words in many documents;
B () calculates the saltus step information before and after mutual information sequence.
Further, in the step (3), according to mutual information sequence and saltus step information sequence correspondence position point, two are built
Dimension mutual information (MIi, fi), multiple two-dimentional mutual informations constitute two-dimentional mutual information set.
Further, in the step (4), institute in two-dimentional mutual information set a little, is divided into by many words using grader
Point and the class of exterior point two in expression, by the link of the adjacent words comprising interior point multi-words expression is constituted.
Further, in the step (a), the mutual information of adjacent words in many documents is calculated, constitutes mutual information sequence MI,
Wherein the mutual information of adjacent words x and y calculates MIi(0≤i < len (MI)-α) such as following formula:
Wherein, x and y represent adjacent words;MIiRepresent i-th mutual information that adjacent words x and y are constituted;Len (MI) table
Show the length of mutual information sequence MI;α represents a constant;M represents the sum of vocabulary in all documents;P (x, y) represents vocabulary x
With y in all documents co-occurrence number of times;P (x) represents vocabulary x occurrence numbers in all documents;P (y) represents vocabulary y all
Occurrence number in document;N represents the number of all documents in document sets;Nx,yRepresent the document number comprising x and y co-occurrences.
Further, in the step (b), the saltus step information before and after mutual information sequence is calculated, constitutes saltus step information sequence
F, saltus step information f of adjacent mutual information thereiniComputing formula is as follows:
Wherein, fiRepresent the saltus step information of current mutual information and subsequent mutual information in mutual information sequence;| | expression takes definitely
Value.
Further, the α is 2.
Another object of the present invention is to a kind of multi-words expression draw-out device is provided, including:
Candidate's vocabulary acquisition device:All documents for collecting document library carry out Chinese word segmentation, part-of-speech tagging and name
The pretreatment that Entity recognition, part of speech are selected constitutes the candidate's lexical set with certain order;
Mutual information and saltus step information acquisition device:The mutual information of neighboring candidate vocabulary in many documents is calculated, and with according to adjacent
Mutual information calculates the saltus step information before and after mutual information sequence;
Two-dimentional mutual information acquisition device:According to mutual information sequence information corresponding with saltus step information sequence position, select mutual
Information and saltus step information structure two dimension mutual information;
Category filter multi-words expression device:Institute in two-dimentional mutual information set a little, is categorized as by many vocabularys using grader
Up to interior point and the class of exterior point two, the adjacent words for having interior point link is constituted into multi-words expression.
As shown from the above technical solution, the mutual information between adjacent words is transformed into two-dimentional mutual information, cluster two by the present invention
Dimension mutual information filters out multi-words expression, it is to avoid one-dimensional mutual information needs artificial given thresholds, and the adaptability of different pieces of information asked
Topic, while being not limited to the diadactic structure of many words, can once obtain the multi-words expression of many word combinations, and realize having without the need for substep
Effect improves the utilization rate of multi-words expression, improves the degree of accuracy of multi-words expression storehouse construction.
Description of the drawings
Fig. 1 is the schematic flow sheet of the inventive method;
Fig. 2 is the structured flowchart of apparatus of the present invention.
Specific embodiment
A kind of multi-words expression abstracting method, the method include following order the step of:(1) document library is using participle and part of speech
The pretreatment such as mark, forms source document;(2) mutual information of adjacent words in many documents is calculated, and further calculates mutual trust
Saltus step information before and after breath sequence;(3) mutual information sequence and saltus step information sequence are constituted into two-dimentional mutual information set;(4) two dimension
Mutual information set adopts grader for point and exterior point in multi-words expression, and the continuous interior point link of screening builds multi-words expression.Such as Fig. 1 institutes
Show.
Below in conjunction with Fig. 1, the present invention is further illustrated.
In the step (1), it is real that all texts for collecting document library carry out Chinese word segmentation, part-of-speech tagging and name
The pretreatment that body identification, part of speech are selected constitutes the candidate's lexical set for having certain order.
The step of step (2) is including following order:A () calculates the mutual information of all adjacent words in many documents;(b)
Calculate the saltus step information before and after mutual information sequence.
In the step (a), the mutual information of adjacent words in many documents is calculated, constitute mutual information sequence MI, wherein phase
The mutual information of adjacent vocabulary x and y calculates MIi(0≤i < len (MI)-α) such as following formula:
Wherein, x and y represent adjacent words;MIiRepresent i-th mutual information that adjacent words x and y are constituted;Len (MI) table
Show the length of mutual information sequence MI;α represents a constant;M represents the sum of vocabulary in all documents;P (x, y) represents vocabulary x
With y in all documents co-occurrence number of times;P (x) represents vocabulary x occurrence numbers in all documents;P (y) represents vocabulary y all
Occurrence number in document;N represents the number of all documents in document sets;Nx,yRepresent the document number comprising x and y co-occurrences;Constant
α is 2.
In the step (b), the saltus step information before and after mutual information sequence is calculated, constitute saltus step information sequence f, it is therein
Saltus step information f of adjacent mutual informationiComputing formula is as follows:
Wherein, fiRepresent the saltus step information of current mutual information and subsequent mutual information in mutual information sequence;| | expression takes definitely
Value.
In the step (3), according to mutual information sequence and saltus step information sequence correspondence position point, two-dimentional mutual information is built
(MIi, fi), multiple two-dimentional mutual informations constitute two-dimentional mutual information set.
In the step (4), using grader by institute in two-dimentional mutual information set a little, be divided into multi-words expression point and
The class of exterior point two, by the link of the adjacent words comprising interior point multi-words expression is constituted.
As shown in Fig. 2 apparatus of the present invention include:Candidate's vocabulary acquisition device, all texts for collecting document library enter
The pretreatment such as row Chinese word segmentation, part-of-speech tagging and name Entity recognition, part of speech selection constitutes the candidate's vocabulary with certain order
Set;Mutual information and saltus step information acquisition device, calculate the mutual information of neighboring candidate vocabulary in many documents, and with according to adjacent mutual trust
Breath calculates the saltus step information before and after mutual information sequence;Two-dimentional mutual information acquisition device, according to mutual information sequence and saltus step information sequence
The corresponding information of column position, selects mutual information and saltus step information structure two dimension mutual information;Category filter multi-words expression device, adopts
Institute in two-dimentional mutual information set a little, is categorized as point and the class of exterior point two in multi-words expression by grader, will have the adjacent word of interior point
The link that converges constitutes multi-words expression.
In sum, the mutual information between adjacent words is transformed into two-dimentional mutual information, the two-dimentional mutual information sieve of cluster by the present invention
Select multi-words expression, it is to avoid one-dimensional mutual information needs artificial given thresholds, the adaptability problem to different pieces of information, while not office
It is limited to the diadactic structure of many words, can once obtains the multi-words expression of many word combinations, and realize without the need for substep, effectively improves many vocabularys
The utilization rate for reaching, improves the degree of accuracy of multi-words expression storehouse construction.
Claims (9)
1. a kind of multi-words expression abstracting method, it is characterised in that the step of the method includes following order:
(1) document library forms source document using the pretreatment of participle and part-of-speech tagging;
(2) mutual information of adjacent words in many documents is calculated, and further calculates the saltus step information before and after mutual information sequence;
(3) mutual information sequence and saltus step information sequence are constituted into two-dimentional mutual information set;
(4) two-dimentional mutual information set adopts grader for point and exterior point in multi-words expression, and point link in choosing more builds multi-words expression.
2. method according to claim 1, it is characterised in that:In the step (1), for all documents of document library
Carry out Chinese word segmentation, the pretreatment that part-of-speech tagging and name Entity recognition, part of speech are selected constitutes the candidate's vocabulary for having certain order
Set.
3. method according to claim 1, it is characterised in that:The step of the step (2) is including following order:
A () calculates the mutual information of all adjacent words in many documents;
B () calculates the saltus step information before and after mutual information sequence.
4. method according to claim 1, it is characterised in that:In the step (3), believed according to mutual information sequence and saltus step
Breath sequence pair answers location point, builds two-dimentional mutual information (MIi, fi), multiple two-dimentional mutual informations constitute two-dimentional mutual information set.
5. method according to claim 1, it is characterised in that:In the step (4), using grader by two-dimentional mutual information
Institute a little, is divided into point and the class of exterior point two in multi-words expression in set, and the link of the adjacent words comprising interior point is constituted into many vocabularys
Reach.
6. method according to claim 3, it is characterised in that:In the step (a), adjacent words in many documents are calculated
Mutual information, constitutes mutual information sequence MI, and the wherein mutual information of adjacent words x and y calculates MIi(0≤i < len (MI)-α) is as follows
Formula:
Wherein, x and y represent adjacent words;MIiRepresent i-th mutual information that adjacent words x and y are constituted;Len (MI) represents mutual trust
The length of breath sequence MI;α represents a constant;M represents the sum of vocabulary in all documents;P (x, y) represents vocabulary x and y in institute
There is co-occurrence number of times in document;P (x) represents vocabulary x occurrence numbers in all documents;P (y) represents vocabulary y in all documents
Occurrence number;N represents the number of all documents in document sets;Nx,yRepresent the document number comprising x and y co-occurrences.
7. method according to claim 3, it is characterised in that:In the step (b), the jump before and after mutual information sequence is calculated
Change information, constitutes saltus step information sequence f, saltus step information f of adjacent mutual information thereiniComputing formula is as follows:
Wherein, fiRepresent the saltus step information of current mutual information and subsequent mutual information in mutual information sequence;| | expression takes absolute value.
8. method according to claim 6, it is characterised in that:The α is 2.
9. a kind of multi-words expression draw-out device, including:
Candidate's vocabulary acquisition device:All documents for collecting document library carry out Chinese word segmentation, part-of-speech tagging and name entity
The pretreatments such as identification, part of speech selection constitute the candidate's lexical set with certain order;
Mutual information and saltus step information acquisition device:The mutual information of neighboring candidate vocabulary in many documents is calculated, and with according to adjacent mutual trust
Breath calculates the saltus step information before and after mutual information sequence;
Two-dimentional mutual information acquisition device:According to mutual information sequence information corresponding with saltus step information sequence position, mutual information is selected
With saltus step information structure two dimension mutual information;
Category filter multi-words expression device:Institute in two-dimentional mutual information set a little, is categorized as in multi-words expression using grader
Point and the class of exterior point two, by the adjacent words for having interior point link multi-words expression is constituted.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610990921.6A CN106649263A (en) | 2016-11-10 | 2016-11-10 | Multi-word expression extraction method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610990921.6A CN106649263A (en) | 2016-11-10 | 2016-11-10 | Multi-word expression extraction method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106649263A true CN106649263A (en) | 2017-05-10 |
Family
ID=58806046
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610990921.6A Pending CN106649263A (en) | 2016-11-10 | 2016-11-10 | Multi-word expression extraction method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106649263A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108549631A (en) * | 2018-03-30 | 2018-09-18 | 北京智慧正安科技有限公司 | Noun dictionary extracting method, electronic device and computer readable storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20040044528A1 (en) * | 2002-09-03 | 2004-03-04 | Chelba Ciprian I. | Method and apparatus for generating decision tree questions for speech processing |
CN1567297A (en) * | 2003-07-03 | 2005-01-19 | 中国科学院声学研究所 | Method for extracting multi-word translation equivalent cells from bilingual corpus automatically |
US20050216443A1 (en) * | 2000-07-06 | 2005-09-29 | Streamsage, Inc. | Method and system for indexing and searching timed media information based upon relevance intervals |
JP2006178536A (en) * | 2004-12-20 | 2006-07-06 | Oki Electric Ind Co Ltd | Parallel translation expression extraction device |
CN106095736A (en) * | 2016-06-07 | 2016-11-09 | 华东师范大学 | A kind of method of field neologisms extraction |
-
2016
- 2016-11-10 CN CN201610990921.6A patent/CN106649263A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050216443A1 (en) * | 2000-07-06 | 2005-09-29 | Streamsage, Inc. | Method and system for indexing and searching timed media information based upon relevance intervals |
US20040044528A1 (en) * | 2002-09-03 | 2004-03-04 | Chelba Ciprian I. | Method and apparatus for generating decision tree questions for speech processing |
CN1567297A (en) * | 2003-07-03 | 2005-01-19 | 中国科学院声学研究所 | Method for extracting multi-word translation equivalent cells from bilingual corpus automatically |
JP2006178536A (en) * | 2004-12-20 | 2006-07-06 | Oki Electric Ind Co Ltd | Parallel translation expression extraction device |
CN106095736A (en) * | 2016-06-07 | 2016-11-09 | 华东师范大学 | A kind of method of field neologisms extraction |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108549631A (en) * | 2018-03-30 | 2018-09-18 | 北京智慧正安科技有限公司 | Noun dictionary extracting method, electronic device and computer readable storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107463607B (en) | Method for acquiring and organizing upper and lower relations of domain entities by combining word vectors and bootstrap learning | |
Ljubešić et al. | {bs, hr, sr} wac-web corpora of Bosnian, Croatian and Serbian | |
CN106844346A (en) | Short text Semantic Similarity method of discrimination and system based on deep learning model Word2Vec | |
Shindo et al. | Bayesian symbol-refined tree substitution grammars for syntactic parsing | |
CN101261623A (en) | Word splitting method and device for word border-free mark language based on search | |
CN101303692B (en) | All-purpose numeral semantic library for translation of mechanical language | |
CN107391486A (en) | A kind of field new word identification method based on statistical information and sequence labelling | |
CN106569993A (en) | Method and device for mining hypernym-hyponym relation between domain-specific terms | |
CN105068997B (en) | The construction method and device of parallel corpora | |
CN105975454A (en) | Chinese word segmentation method and device of webpage text | |
CN106610951A (en) | Improved text similarity solving algorithm based on semantic analysis | |
CN107153640A (en) | A kind of segmenting method towards elementary mathematics field | |
CN109002473A (en) | A kind of sentiment analysis method based on term vector and part of speech | |
CN106611041A (en) | New text similarity solution method | |
CN107943786A (en) | A kind of Chinese name entity recognition method and system | |
Chang | A new approach for automatic Chinese spelling correction | |
CN112948543A (en) | Multi-language multi-document abstract extraction method based on weighted TextRank | |
CN104573030A (en) | Textual emotion prediction method and device | |
CN101763403A (en) | Query translation method facing multi-lingual information retrieval system | |
Wong et al. | isentenizer-: Multilingual sentence boundary detection model | |
CN108595413B (en) | Answer extraction method based on semantic dependency tree | |
Millour et al. | Unsupervised data augmentation for less-resourced languages with no standardized spelling | |
Bhat | Morpheme segmentation for kannada standing on the shoulder of giants | |
Kanjirangat et al. | Optimizing the size of subword vocabularies in dialect classification | |
CN106649263A (en) | Multi-word expression extraction method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20170510 |