CN107133207A - A kind of information extracting method and device - Google Patents
A kind of information extracting method and device Download PDFInfo
- Publication number
- CN107133207A CN107133207A CN201610108274.1A CN201610108274A CN107133207A CN 107133207 A CN107133207 A CN 107133207A CN 201610108274 A CN201610108274 A CN 201610108274A CN 107133207 A CN107133207 A CN 107133207A
- Authority
- CN
- China
- Prior art keywords
- participle
- sequence
- information
- mrow
- tag along
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/253—Grammatical analysis; Style critique
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
This application discloses a kind of information extracting method and device, this method includes:Obtain raw information, word segmentation processing is carried out to the raw information, obtain each participle in the raw information, and by sequence of each participle in the raw information, it is determined that the segmentation sequence being made up of each participle, according to the segmentation sequence, it is determined that each self-corresponding observation characteristic sequence of each participle, according to each self-corresponding each observation characteristic sequence of each participle, pass through the conditional random field models pre-established, it is determined that the tag along sort of the maximum each participle of the joint probability for each observation characteristic sequence of sening as an envoy to, according to the tag along sort of each participle of determination, information needed is extracted from the raw information.By the above method, no matter whether information to be extracted is to be constituted according to fixed information format, effectively can extract the information needed for user, improve the accuracy rate of information extraction.
Description
Technical field
The application is related to field of computer technology, more particularly to a kind of information extracting method and device.
Background technology
With the continuous development of network technology, the information that user can get is also more and more, and still, user is obtained
Generally only partial information is the really necessary information of user in the information got, accordingly, it would be desirable to by the really necessary letter of user
Breath is extracted, e.g., is being arranged when sentencing book of law court, is being typically required the litigant included in court verdict, tells
The finish messages such as time, court verdict are disputed to come out, accordingly, it would be desirable to by the litigant included in court verdict, the lawsuit time,
The information such as court verdict are extracted.
At present, the information in each field is generally to show user according to certain information format, therefore, existing
In technology, when needing to extract the really necessary information of user, the template for information extraction can be pre-set, its
In, fixed information format is carried in template, can be by the information extraction needed for user in information to be extracted by the template
Out.
But, template after the completion of design generally be all it is changeless, if information to be extracted exist it is nonstandard
Situation, then can cause the accuracy rate of information extraction relatively low, e.g., and the form designed in template is " plaintiff:XXX ", when server exists
" plaintiff is identified in information to be extracted:" after, directly by " plaintiff:" after information extract, still, when letter to be extracted
Cease for " when plaintiff is XX ", then server goes out name and the extraction of plaintiff with regard to None- identified, moreover, the information in some fields is not deposited
In specific information format, so that information can not be extracted by setting template.
The content of the invention
The embodiment of the present application provides a kind of information extracting method and device, the standard to solve information extraction in the prior art
The problem of really rate is relatively low.
A kind of information extracting method that the embodiment of the present application is provided, methods described includes:
Obtain raw information;
Word segmentation processing is carried out to the raw information, each participle in the raw information is obtained;
By sequence of each participle in the raw information, it is determined that the segmentation sequence being made up of each participle;
According to the segmentation sequence, each self-corresponding observation characteristic sequence of each participle is determined;
According to each each self-corresponding observation characteristic sequence of participle, by the conditional random field models pre-established,
It is determined that the tag along sort of the maximum each participle of the joint probability for each observation characteristic sequence of sening as an envoy to;
According to the tag along sort of each participle of determination, information needed is extracted from the raw information.
A kind of information extracting device that the embodiment of the present application is provided, described device includes:
Acquisition module, for obtaining raw information;
Word-dividing mode, for carrying out word segmentation processing to the raw information, obtains each participle in the raw information;
Segmentation sequence determining module, for the sequence by each participle in the raw information, it is determined that being made up of each participle
Segmentation sequence;
Characteristic sequence determining module, for according to the segmentation sequence, determining each self-corresponding observation of each participle
Characteristic sequence;
Tag along sort determining module, for according to each each self-corresponding observation characteristic sequence of participle, by advance
The conditional random field models of foundation, it is determined that the contingency table of the maximum each participle of the joint probability for each observation characteristic sequence of sening as an envoy to
Label;
Extraction module, for the tag along sort of each participle according to determination, information needed is extracted from the raw information.
The embodiment of the present application provides a kind of information extracting method and device, and this method obtains raw information, to the original letter
Breath carries out word segmentation processing, obtains each participle in the raw information, and by sequence of each participle in the raw information, it is determined that by
The segmentation sequence that each participle is constituted, according to the segmentation sequence, it is determined that each self-corresponding observation characteristic sequence of each participle, according to every
Individual each self-corresponding each observation characteristic sequence of participle, by the conditional random field models pre-established, it is determined that each observation of sening as an envoy to is special
The tag along sort of each participle of the joint probability maximum of sequence is levied, it is original from this according to the tag along sort of each participle of determination
Information needed is extracted in information.By the above method, no matter whether information to be extracted is according to fixed information format composition
, effectively the information needed for user can be extracted, improve the accuracy rate of information extraction.
Brief description of the drawings
Accompanying drawing described herein is used for providing further understanding of the present application, constitutes the part of the application, this Shen
Schematic description and description please is used to explain the application, does not constitute the improper restriction to the application.In the accompanying drawings:
The information extraction process schematic diagram that Fig. 1 provides for the embodiment of the present application;
The information extracting device structural representation that Fig. 2 provides for the embodiment of the present application.
Embodiment
To make the purpose, technical scheme and advantage of the application clearer, below in conjunction with the application specific embodiment and
Technical scheme is clearly and completely described corresponding accompanying drawing.Obviously, described embodiment is only the application one
Section Example, rather than whole embodiments.Based on the embodiment in the application, those of ordinary skill in the art are not doing
Go out the every other embodiment obtained under the premise of creative work, belong to the scope of the application protection.
The information extraction process that Fig. 1 provides for the embodiment of the present application, specifically includes following steps:
S101:Obtain raw information.
In the embodiment of the present application, during the information needed for by user in information is extracted, it is necessary first to obtain
Raw information is taken, and obtains what the raw information can be completed by server, there can also be data processing function by other
What equipment was completed, wherein, the information needed for user is contained in the raw information, server is needed user in raw information
Required information is extracted, wherein, the raw information includes text message.
For example, law staff is needed the legal documents of substantial amounts of electronic edition, (what is included in legal documents is text
Information) in defendant's name, defendant's sex and defendant extract the date of birth, therefore, server obtain electronic edition method
The information (that is, raw information) included in rule document, for simple and clear elaboration this programme, saves unnecessary cumbersome step
Suddenly, the information only included using in each legal documents at this as:Defendant:XX, sex:X, XXXX XX months XX day are born, and only
Illustrated by taking a legal documents as an example.
S102:Word segmentation processing is carried out to the raw information, each participle in the raw information is obtained.
Because in actual applications, the information needed for user is only at word or word in a word, therefore, in this Shen
Please in, word segmentation processing can be carried out to the raw information that gets, a tag along sort subsequently be determined to each participle, subsequently then
The required corresponding information of tag along sort can be extracted.
During word segmentation processing is carried out to raw information, if each word can be caused as a participle
Operand is huge, if an excess of word as a participle, then can cause the accuracy rate of information extraction reduces, therefore, at this
In application, complete word can be grouped together according to the language habits, e.g., word segmentation processing be carried out to " we like China ",
Obtained each participle is:We, love, China, can be by punctuation mark in addition, if include punctuation mark in raw information
Individually take out as a participle.
S103:By sequence of each participle in the raw information, it is determined that the segmentation sequence being made up of each participle.
In the embodiment of the present application, server is after raw information is got, by row of each participle in the raw information
Sequence, it is determined that the sequence being made up of each participle.
Further, since in language environment, same word may act as different compositions in different sentences, that is,
Say, when same word act as heterogeneity, it is representative mean different, represented parts of speech be also it is different, e.g.,
" you can have a talk about the process of thing with meToday, I was by your entrance " in, " process " acts as verb in later half sentence, and
" your entrance " word that " process " is close to below is impossible to appear in serve as behind noun " process " in first half sentence.
In order to effectively improve the accuracy rate of information extraction, therefore, in the embodiment of the present application, server is being determined by each participle structure
Into segmentation sequence after, it may be determined that the part of speech of each participle in segmentation sequence, wherein, the part of speech include a variety of different words
Property, and table 1 only lists two kinds of different parts of speech, that is, the first part of speech and the second part of speech, certainly, server also can be to original
Beginning information carries out word segmentation processing, obtains after each participle in raw information, it is determined that the part of speech of each participle, is punctuate for participle
Symbol, the part of speech of the participle can be represented with w.
Continuation of the previous cases, it is assumed that the information included in the legal documents that server is got is " defendant:Zhang San, sex:Man,
On October 21st, 1985 is born ", server uses word segmentation processing mode mentioned above, to " defendant:Zhang San, sex:Man,
On October 21st, 1985 is born " word segmentation processing is carried out, and according to each participle in " defendant:Zhang San, sex:Man, October 21 in 1985
Sequence in day birth ", determines the segmentation sequence being made up of each participle, and determine the part of speech of each participle in segmentation sequence
(that is, determining the first part of speech and the second part of speech of each participle in segmentation sequence), so as to obtain number as shown in table 1
According to:
Participle | First part of speech | Second part of speech |
Defendant | n | / basic word-Chinese |
: | w | other |
Zhang San | n | / name-Chinese personal name |
, | w | other |
Sex | n | / basic word-Chinese |
: | w | other |
Man | n | / product type qualifier |
, | w | other |
On October 21st, 1985 | n | /DATE |
Birth | vi | / basic word-Chinese |
Table 1
S104:According to the segmentation sequence, each self-corresponding observation characteristic sequence of each participle is determined.
Because the application is realized by conditional random field models, therefore, server is made up of each participle determining
Segmentation sequence after, it is thus necessary to determine that go out each each self-corresponding observation characteristic sequence of participle in the segmentation sequence.
It is determined that during the entire process of each self-corresponding observation characteristic sequence of each participle, server often reads segmentation sequence
In a participle when, by the feature templates pre-established determine one observation characteristic sequence, until all participles are each
It is self-corresponding observation characteristic sequence all determined untill, wherein, for feature templates, the application give following exemplary to
Go out five feature templates:
#Unigram
U00:%x [- 2,0]
U01:%x [- 1,0]
U02:%x [0,0]
U03:%x [1,0]
U04:%x [2,0].
In addition, the accuracy rate in order to improve information extraction, server after the segmentation sequence being made up of each participle is determined,
Also need to determine the part of speech of each participle in segmentation sequence, that is to say, that the part of speech of each participle also determines participle correspondence
Observation characteristic sequence, so as to determine which label the participle corresponds on earth, therefore, by conditional random field models, really
In the fixed segmentation sequence during each each self-corresponding observation characteristic sequence of participle, in addition it is also necessary to reference to the part of speech of each participle,
Specifically, participle content, participle word order and participle part of speech in the segmentation sequence, it is determined that each participle is each self-corresponding
Characteristic sequence is observed, wherein, participle content refers to each participle in segmentation sequence, and participle word order is referred in segmentation sequence
The tandem of each participle.
Use the example above, it is assumed that above-mentioned five feature templates provided of use, server read segmentation sequence " defendant ",
“:", " Zhang San ", ", ", " sex ", ":", " man ", ", ", " on October 21st, 1985 ", after " birth ", generate such as the institute of table 2
Each observation characteristic sequence shown:
Table 2
S105:According to each each self-corresponding observation characteristic sequence of participle, pass through the condition random field pre-established
Model, it is determined that the tag along sort of the maximum each participle of the joint probability for each observation characteristic sequence of sening as an envoy to.
In this application, due to being subsequently that each participle is classified by the form of tag along sort, and by information needed
The corresponding information extraction of corresponding tag along sort comes out, wherein, the species of the tag along sort is pre-set, because
This, after the segmentation sequence being made up of each participle determined and each observation characteristic sequence, it is thus necessary to determine that go out the segmentation sequence
In each participle tag along sort.
And it is determined that in sequence during the tag along sort of each participle, the application is real by conditional random field models
Existing, because the core of conditional random field models is (it is, all participles are corresponding in segmentation sequence in given list entries
Each observation characteristic sequence) and given output sequence (it is, the corresponding tag along sort of each participle, according to each participle in sequence
Middle order, the sequence for sorting and constituting successively, i.e. tag along sort sequence) in the case of, determine that the joint of output sequence is general
Rate, also, again because the joint probability of output sequence is bigger, then it is that correct possibility is bigger to illustrate output sequence, joint is general
Rate is smaller, then it is that correct possibility is smaller to illustrate output sequence, therefore, in this application, it is determined that in segmentation sequence each
, can be directly to making list entries (it is, all participles are corresponding in segmentation sequence during the tag along sort of participle
Each observation characteristic sequence) in the case of, determine the maximum output sequence of joint probability (it is, the corresponding classification of each participle
Label, sequentially, successively sorts and the sequence of composition, i.e. tag along sort sequence in the sequence according to each participle), so that it is determined that going out
The tag along sort of each participle.
Adopt example, it is assumed that the species of the tag along sort pre-established is comprising following several:0 (the information that need not be paid close attention to
Corresponding tag along sort), the dName corresponding tag along sort of name information of concern (need), dSex (need the sex letter of concern
Cease corresponding tag along sort), the dBirthday corresponding tag along sort of birthday by information of concern (need), server is according in table 2
Each observation characteristic sequence determined, by the conditional random field models pre-established, it is determined that each sight determined in table 2 of sening as an envoy to
The maximum tag along sort sequence of the joint probability of characteristic sequence is examined, so that it is determined that going out the tag along sort of each participle, it is assumed that it is determined that
The tag along sort of each participle gone out is as described in Table 3:
Participle | First part of speech | Second part of speech | Tag along sort |
Defendant | n | / basic word-Chinese | 0 |
: | w | other | 0 |
Zhang San | n | / name-Chinese personal name | dName |
, | w | other | 0 |
Sex | n | / basic word-Chinese | 0 |
: | w | other | 0 |
Man | n | / product type qualifier | dSex |
, | w | other | 0 |
On October 21st, 1985 | n | /DATE | dBirthday |
Birth | vi | / basic word-Chinese | 0 |
Table 3
Therefore, the tag along sort sequence finally obtained be " 0 ", " 0 ", " dName ", " 0 ", " 0 ", " 0 ", " dSex ", " 0 ",
“dBirthday”、“0”。
S106:According to the tag along sort of each participle of determination, information needed is extracted from the raw information.
In this application, server is it is determined that the maximum classification of the joint probability for the segmentation sequence being made up of each participle of sening as an envoy to
After sequence label, it may be determined that the corresponding tag along sort of user's information needed, using the tag along sort as specified label, and from the original
The corresponding information of specified label is extracted in beginning information.
Adopt example, it is assumed that the corresponding tag along sort of information needed for user is " dName, dSex, dBirthday ", clothes
Being engaged in, device will ", as label is specified, server be it is determined that each observation determined in table 2 of sening as an envoy to by dName, dSex, dBirthday "
The maximum tag along sort sequence of the joint probability of characteristic sequence is " 0 ", " 0 ", " dName ", " 0 ", " 0 ", " 0 ", " dSex ", " 0 ",
It is behind " dBirthday ", " 0 ", " dName " corresponding " Zhang San ", " dSex " corresponding " man ", " dBirthday " is corresponding
" on October 21st, 1985 " extracts.
By the above method, no matter whether information to be extracted is to be constituted according to fixed information format, can be had
Effect by information to be extracted, the information needed for user is extracted, and improves the accuracy rate of information extraction.
In addition, in the conditional random field models by pre-establishing, it is determined that the joint for each observation characteristic sequence of sening as an envoy to
During the tag along sort of each participle of maximum probability, present invention also provides the core involved by conditional random field models
Formula, be specially:Server is according to formula(y | x, λ) maximum each participle it is determined that the P that sends as an envoy to
Tag along sort, wherein, Z (x) is expressed as normalized function, λjRepresent the corresponding weight of j-th of characteristic function, fjRepresent this
J-th of characteristic function in part random field models, yi-1The corresponding tag along sort of the i-th -1 participle in the segmentation sequence is represented,
yiThe corresponding tag along sort of i-th of participle in the segmentation sequence is represented, x represents each self-corresponding observation feature sequence of each participle
Row.
Further, present invention also provides a kind of mode of setting up of conditional random field models, subsequently, server can be direct
Using the model established, the corresponding each observation characteristic sequence (that is, list entries) of all participles in given segmentation sequence
In the case of, it is determined that the tag along sort of the maximum each participle of the joint probability for each observation characteristic sequence of sening as an envoy to.
Mode is specifically set up the following is conditional random field models:Sample information is obtained in advance, and the sample information is carried out
Participle, obtains each participle in the sample information, by sequence of each participle in the sample information, it is determined that be made up of each participle
Sample sequence, determines the part of speech of each participle in the sample sequence, according to the sorting of each participle in the sample sequence, the word of each participle
Property and known each participle tag along sort, training obtain conditional random field models.
Herein it should be noted that being specifically the row according to each participle in sample sequence during training condition random field models
Core formula in the tag along sort of sequence, the part of speech of each participle and known each participle, training condition random field modelsIn λjAnd fj, and all characteristic function f
It is according to the sorting of each participle in the sample sequence, the part of speech of each participle, the tag along sort of known each participle and in advance
What the feature templates of foundation were determined, wherein, used feature templates and step during for training condition random field models
Involved feature templates are consistent in S104, and server passes through each in above-mentioned five feature templates and the sample sequence
The sorting of participle, the part of speech of each participle, the tag along sort of known each participle train all spies in conditional random field models
Levy function f.
The information extracting method provided above for the embodiment of the present application, based on same thinking, the embodiment of the present application is also carried
For a kind of information extracting device, as shown in Figure 2.
The information extracting device structural representation that Fig. 2 provides for the embodiment of the present application, described device includes:
Acquisition module 201, for obtaining raw information;
Word-dividing mode 202, for carrying out word segmentation processing to the raw information, obtains each point in the raw information
Word;
Segmentation sequence determining module 203, for the sequence by each participle in the raw information, it is determined that by each participle structure
Into segmentation sequence;
Characteristic sequence determining module 204, for according to the segmentation sequence, determining each self-corresponding sight of each participle
Examine characteristic sequence;
Tag along sort determining module 205, for according to each each self-corresponding observation characteristic sequence of participle, by pre-
The conditional random field models first set up, it is determined that the contingency table of the maximum each participle of the joint probability for each observation characteristic sequence of sening as an envoy to
Label;
Extraction module 206, for the tag along sort of each participle according to determination, believes needed for being extracted from the raw information
Breath.
Described device also includes:
Part of speech determining module 207, for determining that each participle is each corresponded in the characteristic sequence determining module 204
Observation characteristic sequence before, determine the part of speech of each participle in the segmentation sequence;
The characteristic sequence determining module 204 is specifically for participle content, participle word order in the segmentation sequence
And participle part of speech, determine each self-corresponding observation characteristic sequence of each participle.
The tag along sort determining module 205 specifically for obtaining sample information, the sample information being divided in advance
Word, obtains each participle in the sample information, by sequence of each participle in the sample information, it is determined that being made up of each participle
Sample sequence, determine the part of speech of each participle in the sample sequence, according to the sorting of each participle in the sample sequence, each point
The tag along sort of the part of speech of word and known each participle, training obtains conditional random field models.
The tag along sort determining module 205 is specifically for according to formula
The tag along sort of (y | x, λ) maximum each participle it is determined that the P that sends as an envoy to, wherein, Z (x) is expressed as normalized function, λjRepresent jth
The corresponding weight of individual characteristic function, fjRepresent j-th of characteristic function in the conditional random field models, yi-1Represent described point
The corresponding tag along sort of the i-th -1 participle, y in word sequenceiRepresent the corresponding contingency table of i-th of participle in the segmentation sequence
Label, x represents each self-corresponding observation characteristic sequence of each participle.
The extraction module 206 is specifically for determining the corresponding tag along sort of information needed, as specified label, from institute
State and the corresponding information of the specified label is extracted in raw information.
In a typical configuration, computing device includes one or more processors (CPU), input/output interface, net
Network interface and internal memory.
Internal memory potentially includes the volatile memory in computer-readable medium, random access memory (RAM) and/or
The forms such as Nonvolatile memory, such as read-only storage (ROM) or flash memory (flash RAM).Internal memory is computer-readable medium
Example.
Computer-readable medium includes permanent and non-permanent, removable and non-removable media can be by any method
Or technology come realize information store.Information can be computer-readable instruction, data structure, the module of program or other data.
The example of the storage medium of computer includes, but are not limited to phase transition internal memory (PRAM), static RAM (SRAM), moved
State random access memory (DRAM), other kinds of random access memory (RAM), read-only storage (ROM), electric erasable
Programmable read only memory (EEPROM), fast flash memory bank or other memory techniques, read-only optical disc read-only storage (CD-ROM),
Digital versatile disc (DVD) or other optical storages, magnetic cassette tape, the storage of tape magnetic rigid disk or other magnetic storage apparatus
Or any other non-transmission medium, the information that can be accessed by a computing device available for storage.Define, calculate according to herein
Machine computer-readable recording medium does not include temporary computer readable media (transitory media), such as data-signal and carrier wave of modulation.
It should also be noted that, term " comprising ", "comprising" or its any other variant are intended to nonexcludability
Comprising so that process, method, commodity or equipment including a series of key elements are not only including those key elements, but also wrap
Include other key elements being not expressly set out, or also include for this process, method, commodity or equipment intrinsic want
Element.In the absence of more restrictions, the key element limited by sentence "including a ...", it is not excluded that wanted including described
Also there is other identical element in process, method, commodity or the equipment of element.
It will be understood by those skilled in the art that embodiments herein can be provided as method, system or computer program product.
Therefore, the application can be using the embodiment in terms of complete hardware embodiment, complete software embodiment or combination software and hardware
Form.Deposited moreover, the application can use to can use in one or more computers for wherein including computer usable program code
The shape for the computer program product that storage media is implemented on (including but is not limited to magnetic disk storage, CD-ROM, optical memory etc.)
Formula.
Embodiments herein is the foregoing is only, the application is not limited to.For those skilled in the art
For, the application can have various modifications and variations.It is all any modifications made within spirit herein and principle, equivalent
Replace, improve etc., it should be included within the scope of claims hereof.
Claims (10)
1. a kind of information extracting method, it is characterised in that methods described includes:
Obtain raw information;
Word segmentation processing is carried out to the raw information, each participle in the raw information is obtained;
By sequence of each participle in the raw information, it is determined that the segmentation sequence being made up of each participle;
According to the segmentation sequence, each self-corresponding observation characteristic sequence of each participle is determined;
According to each each self-corresponding observation characteristic sequence of participle, by the conditional random field models pre-established, it is determined that
The tag along sort of each participle for each joint probability maximum for observing characteristic sequence of sening as an envoy to;
According to the tag along sort of each participle of determination, information needed is extracted from the raw information.
2. the method as described in claim 1, it is characterised in that it is determined that each self-corresponding observation feature sequence of each participle
Before row, methods described also includes:
Determine the part of speech of each participle in the segmentation sequence;
According to the segmentation sequence, each self-corresponding observation characteristic sequence of each participle is determined, is specifically included:
Participle content, participle word order and participle part of speech in the segmentation sequence, determine that each participle is each right
The observation characteristic sequence answered.
3. the method as described in claim 1, it is characterised in that pre-establish conditional random field models, is specifically included:
Sample information is obtained in advance;
Participle is carried out to the sample information, each participle in the sample information is obtained;
By sequence of each participle in the sample information, it is determined that the sample sequence being made up of each participle;
Determine the part of speech of each participle in the sample sequence;
According to the tag along sort of the sorting of each participle, the part of speech of each participle and known each participle in the sample sequence, instruction
Get conditional random field models.
4. the method as described in claim 1, it is characterised in that it is determined that the joint probability maximum for each observation characteristic sequence of sening as an envoy to
The tag along sort of each participle, is specifically included:
According to formula
<mrow>
<mi>P</mi>
<mrow>
<mo>(</mo>
<mi>y</mi>
<mo>|</mo>
<mi>x</mi>
<mo>,</mo>
<mi>&lambda;</mi>
<mo>)</mo>
</mrow>
<mo>=</mo>
<mfrac>
<mn>1</mn>
<mrow>
<mi>z</mi>
<mrow>
<mo>(</mo>
<mi>x</mi>
<mo>)</mo>
</mrow>
</mrow>
</mfrac>
<mi>exp</mi>
<mrow>
<mo>(</mo>
<msubsup>
<mi>&Sigma;</mi>
<mrow>
<mi>i</mi>
<mo>=</mo>
<mn>1</mn>
</mrow>
<mi>n</mi>
</msubsup>
<msub>
<mi>&Sigma;</mi>
<mi>j</mi>
</msub>
<msub>
<mi>&lambda;</mi>
<mi>j</mi>
</msub>
<msub>
<mi>f</mi>
<mi>j</mi>
</msub>
<mo>(</mo>
<mrow>
<msub>
<mi>y</mi>
<mrow>
<mi>i</mi>
<mo>-</mo>
<mn>1</mn>
</mrow>
</msub>
<mo>,</mo>
<msub>
<mi>y</mi>
<mi>i</mi>
</msub>
<mo>,</mo>
<mi>x</mi>
<mo>,</mo>
<mi>i</mi>
</mrow>
<mo>)</mo>
<mo>)</mo>
</mrow>
<mo>,</mo>
</mrow>
It is determined that the P that sends as an envoy to (y | x, λ) most
The tag along sort tag along sort of big each participle, wherein, Z (x) is expressed as normalized function, and λ j represent j-th of characteristic function
Corresponding weight, fj represents j-th of characteristic function in the conditional random field models, and yi-1 is represented in the segmentation sequence
The corresponding tag along sort of i-1 participle, yi represents the corresponding tag along sort of i-th of participle in the segmentation sequence, and x represents each
Each self-corresponding observation characteristic sequence of participle.
5. the method as described in claim 1, it is characterised in that according to the tag along sort of each participle of determination, from described original
Information needed is extracted in information, is specifically included:
The corresponding tag along sort of information needed is determined, specified label is used as;
The corresponding information of the specified label is extracted from the raw information.
6. a kind of information extracting device, it is characterised in that described device includes:
Acquisition module, for obtaining raw information;
Word-dividing mode, for carrying out word segmentation processing to the raw information, obtains each participle in the raw information;
Segmentation sequence determining module, for the sequence by each participle in the raw information, it is determined that point being made up of each participle
Word sequence;
Characteristic sequence determining module, for according to the segmentation sequence, determining each self-corresponding observation feature of each participle
Sequence;
Tag along sort determining module, for according to each each self-corresponding observation characteristic sequence of participle, by pre-establishing
Conditional random field models, it is determined that the tag along sort of the maximum each participle of the joint probability for each observation characteristic sequence of sening as an envoy to;
Extraction module, for the tag along sort of each participle according to determination, information needed is extracted from the raw information.
7. device as claimed in claim 6, it is characterised in that described device also includes:
Part of speech determining module, for determining each self-corresponding observation feature of each participle in the characteristic sequence determining module
Before sequence, the part of speech of each participle in the segmentation sequence is determined;
The characteristic sequence determining module specifically for, participle content, participle word order in the segmentation sequence and point
Word part of speech, determines each self-corresponding observation characteristic sequence of each participle.
8. device as claimed in claim 6, it is characterised in that the tag along sort determining module specifically for obtaining in advance
Sample information, carries out participle to the sample information, each participle in the sample information is obtained, by each participle in the sample
Sequence in information, it is determined that the sample sequence being made up of each participle, determines the part of speech of each participle in the sample sequence, according to institute
The tag along sort of the sorting of each participle in sample sequence, the part of speech of each participle and known each participle is stated, training obtains condition
Random field models.
9. device as claimed in claim 6, it is characterised in that the tag along sort determining module is specifically for according to formula
<mrow>
<mi>P</mi>
<mrow>
<mo>(</mo>
<mi>y</mi>
<mo>|</mo>
<mi>x</mi>
<mo>,</mo>
<mi>&lambda;</mi>
<mo>)</mo>
</mrow>
<mo>=</mo>
<mfrac>
<mn>1</mn>
<mrow>
<mi>Z</mi>
<mrow>
<mo>(</mo>
<mi>x</mi>
<mo>)</mo>
</mrow>
</mrow>
</mfrac>
<mi>exp</mi>
<mrow>
<mo>(</mo>
<msubsup>
<mi>&Sigma;</mi>
<mrow>
<mi>i</mi>
<mo>=</mo>
<mn>1</mn>
</mrow>
<mi>n</mi>
</msubsup>
<msub>
<mi>&Sigma;</mi>
<mi>j</mi>
</msub>
<msub>
<mi>&lambda;</mi>
<mi>j</mi>
</msub>
<msub>
<mi>f</mi>
<mi>j</mi>
</msub>
<mo>(</mo>
<mrow>
<msub>
<mi>y</mi>
<mrow>
<mi>i</mi>
<mo>-</mo>
<mn>1</mn>
</mrow>
</msub>
<mo>,</mo>
<msub>
<mi>y</mi>
<mi>i</mi>
</msub>
<mo>,</mo>
<mi>x</mi>
<mo>,</mo>
<mi>i</mi>
</mrow>
<mo>)</mo>
<mo>)</mo>
</mrow>
<mo>,</mo>
</mrow>
(y | x, λ) maximum each participle it is determined that the P that sends as an envoy to
Tag along sort, wherein, Z (x) is expressed as normalized function, and λ j represent the corresponding weight of j-th of characteristic function, and fj represents described
J-th of characteristic function in conditional random field models, yi-1 represents the corresponding contingency table of the i-th -1 participle in the segmentation sequence
Label, yi represents the corresponding tag along sort of i-th of participle in the segmentation sequence, and x represents that each self-corresponding observation of each participle is special
Levy sequence.
10. device as claimed in claim 6, it is characterised in that the extraction module is specifically for determining information needed correspondence
Tag along sort, as specified label, the corresponding information of the specified label is extracted from the raw information.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610108274.1A CN107133207A (en) | 2016-02-26 | 2016-02-26 | A kind of information extracting method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610108274.1A CN107133207A (en) | 2016-02-26 | 2016-02-26 | A kind of information extracting method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107133207A true CN107133207A (en) | 2017-09-05 |
Family
ID=59721296
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610108274.1A Pending CN107133207A (en) | 2016-02-26 | 2016-02-26 | A kind of information extracting method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107133207A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109299472A (en) * | 2018-11-09 | 2019-02-01 | 天津开心生活科技有限公司 | Text data processing method, device, electronic equipment and computer-readable medium |
CN110209831A (en) * | 2018-02-13 | 2019-09-06 | 北京京东尚科信息技术有限公司 | Model generation, the method for semantics recognition, system, equipment and storage medium |
-
2016
- 2016-02-26 CN CN201610108274.1A patent/CN107133207A/en active Pending
Non-Patent Citations (1)
Title |
---|
黄绍杉等: "基于条件随机场的专利摘要信息抽取研究", 《DIGITAL LIBRARY FORUM》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110209831A (en) * | 2018-02-13 | 2019-09-06 | 北京京东尚科信息技术有限公司 | Model generation, the method for semantics recognition, system, equipment and storage medium |
CN109299472A (en) * | 2018-11-09 | 2019-02-01 | 天津开心生活科技有限公司 | Text data processing method, device, electronic equipment and computer-readable medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110163478B (en) | Risk examination method and device for contract clauses | |
CN112035653B (en) | Policy key information extraction method and device, storage medium and electronic equipment | |
US20190370296A1 (en) | Method and device for mining an enterprise relationship | |
CN109101489B (en) | Text automatic summarization method and device and electronic equipment | |
US8380489B1 (en) | System, methods, and data structure for quantitative assessment of symbolic associations in natural language | |
CN104636466B (en) | Entity attribute extraction method and system for open webpage | |
CN109344234A (en) | Machine reads understanding method, device, computer equipment and storage medium | |
WO2019205308A1 (en) | Information input method and apparatus, and terminal device and medium | |
CN108509482A (en) | Question classification method, device, computer equipment and storage medium | |
CN113837531A (en) | Product quality problem finding and risk assessment method based on network comments | |
CN111930895B (en) | MRC-based document data retrieval method, device, equipment and storage medium | |
Braz et al. | Document classification using a Bi-LSTM to unclog Brazil's supreme court | |
CN113312480B (en) | Scientific and technological thesis level multi-label classification method and device based on graph volume network | |
CN108446295A (en) | Information retrieval method, device, computer equipment and storage medium | |
CN111782759B (en) | Question-answering processing method and device and computer readable storage medium | |
CN112149387A (en) | Visualization method and device for financial data, computer equipment and storage medium | |
CN105808726A (en) | Method and apparatus for measuring similarity of documents | |
CN110019820B (en) | Method for detecting time consistency of complaints and symptoms of current medical history in medical records | |
CN108427667A (en) | A kind of segmentation method and device of legal documents | |
CN107133207A (en) | A kind of information extracting method and device | |
CN105786929A (en) | Information monitoring method and device | |
CN114911936A (en) | Model training and comment recognition method and device, electronic equipment and medium | |
CN112434126B (en) | Information processing method, device, equipment and storage medium | |
Tschirschwitz et al. | A dataset for analysing complex document layouts in the digital humanities and its evaluation with krippendorff’s alpha | |
CN114153939A (en) | Text recognition method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170905 |
|
RJ01 | Rejection of invention patent application after publication |