CN107133207A - A kind of information extracting method and device - Google Patents

A kind of information extracting method and device Download PDF

Info

Publication number
CN107133207A
CN107133207A CN201610108274.1A CN201610108274A CN107133207A CN 107133207 A CN107133207 A CN 107133207A CN 201610108274 A CN201610108274 A CN 201610108274A CN 107133207 A CN107133207 A CN 107133207A
Authority
CN
China
Prior art keywords
participle
sequence
information
mrow
tag along
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610108274.1A
Other languages
Chinese (zh)
Inventor
景艺亮
代斌
隋豌辰
赵科科
王晓光
杨旭
蔡宁
张凯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201610108274.1A priority Critical patent/CN107133207A/en
Publication of CN107133207A publication Critical patent/CN107133207A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

This application discloses a kind of information extracting method and device, this method includes:Obtain raw information, word segmentation processing is carried out to the raw information, obtain each participle in the raw information, and by sequence of each participle in the raw information, it is determined that the segmentation sequence being made up of each participle, according to the segmentation sequence, it is determined that each self-corresponding observation characteristic sequence of each participle, according to each self-corresponding each observation characteristic sequence of each participle, pass through the conditional random field models pre-established, it is determined that the tag along sort of the maximum each participle of the joint probability for each observation characteristic sequence of sening as an envoy to, according to the tag along sort of each participle of determination, information needed is extracted from the raw information.By the above method, no matter whether information to be extracted is to be constituted according to fixed information format, effectively can extract the information needed for user, improve the accuracy rate of information extraction.

Description

A kind of information extracting method and device
Technical field
The application is related to field of computer technology, more particularly to a kind of information extracting method and device.
Background technology
With the continuous development of network technology, the information that user can get is also more and more, and still, user is obtained Generally only partial information is the really necessary information of user in the information got, accordingly, it would be desirable to by the really necessary letter of user Breath is extracted, e.g., is being arranged when sentencing book of law court, is being typically required the litigant included in court verdict, tells The finish messages such as time, court verdict are disputed to come out, accordingly, it would be desirable to by the litigant included in court verdict, the lawsuit time, The information such as court verdict are extracted.
At present, the information in each field is generally to show user according to certain information format, therefore, existing In technology, when needing to extract the really necessary information of user, the template for information extraction can be pre-set, its In, fixed information format is carried in template, can be by the information extraction needed for user in information to be extracted by the template Out.
But, template after the completion of design generally be all it is changeless, if information to be extracted exist it is nonstandard Situation, then can cause the accuracy rate of information extraction relatively low, e.g., and the form designed in template is " plaintiff:XXX ", when server exists " plaintiff is identified in information to be extracted:" after, directly by " plaintiff:" after information extract, still, when letter to be extracted Cease for " when plaintiff is XX ", then server goes out name and the extraction of plaintiff with regard to None- identified, moreover, the information in some fields is not deposited In specific information format, so that information can not be extracted by setting template.
The content of the invention
The embodiment of the present application provides a kind of information extracting method and device, the standard to solve information extraction in the prior art The problem of really rate is relatively low.
A kind of information extracting method that the embodiment of the present application is provided, methods described includes:
Obtain raw information;
Word segmentation processing is carried out to the raw information, each participle in the raw information is obtained;
By sequence of each participle in the raw information, it is determined that the segmentation sequence being made up of each participle;
According to the segmentation sequence, each self-corresponding observation characteristic sequence of each participle is determined;
According to each each self-corresponding observation characteristic sequence of participle, by the conditional random field models pre-established, It is determined that the tag along sort of the maximum each participle of the joint probability for each observation characteristic sequence of sening as an envoy to;
According to the tag along sort of each participle of determination, information needed is extracted from the raw information.
A kind of information extracting device that the embodiment of the present application is provided, described device includes:
Acquisition module, for obtaining raw information;
Word-dividing mode, for carrying out word segmentation processing to the raw information, obtains each participle in the raw information;
Segmentation sequence determining module, for the sequence by each participle in the raw information, it is determined that being made up of each participle Segmentation sequence;
Characteristic sequence determining module, for according to the segmentation sequence, determining each self-corresponding observation of each participle Characteristic sequence;
Tag along sort determining module, for according to each each self-corresponding observation characteristic sequence of participle, by advance The conditional random field models of foundation, it is determined that the contingency table of the maximum each participle of the joint probability for each observation characteristic sequence of sening as an envoy to Label;
Extraction module, for the tag along sort of each participle according to determination, information needed is extracted from the raw information.
The embodiment of the present application provides a kind of information extracting method and device, and this method obtains raw information, to the original letter Breath carries out word segmentation processing, obtains each participle in the raw information, and by sequence of each participle in the raw information, it is determined that by The segmentation sequence that each participle is constituted, according to the segmentation sequence, it is determined that each self-corresponding observation characteristic sequence of each participle, according to every Individual each self-corresponding each observation characteristic sequence of participle, by the conditional random field models pre-established, it is determined that each observation of sening as an envoy to is special The tag along sort of each participle of the joint probability maximum of sequence is levied, it is original from this according to the tag along sort of each participle of determination Information needed is extracted in information.By the above method, no matter whether information to be extracted is according to fixed information format composition , effectively the information needed for user can be extracted, improve the accuracy rate of information extraction.
Brief description of the drawings
Accompanying drawing described herein is used for providing further understanding of the present application, constitutes the part of the application, this Shen Schematic description and description please is used to explain the application, does not constitute the improper restriction to the application.In the accompanying drawings:
The information extraction process schematic diagram that Fig. 1 provides for the embodiment of the present application;
The information extracting device structural representation that Fig. 2 provides for the embodiment of the present application.
Embodiment
To make the purpose, technical scheme and advantage of the application clearer, below in conjunction with the application specific embodiment and Technical scheme is clearly and completely described corresponding accompanying drawing.Obviously, described embodiment is only the application one Section Example, rather than whole embodiments.Based on the embodiment in the application, those of ordinary skill in the art are not doing Go out the every other embodiment obtained under the premise of creative work, belong to the scope of the application protection.
The information extraction process that Fig. 1 provides for the embodiment of the present application, specifically includes following steps:
S101:Obtain raw information.
In the embodiment of the present application, during the information needed for by user in information is extracted, it is necessary first to obtain Raw information is taken, and obtains what the raw information can be completed by server, there can also be data processing function by other What equipment was completed, wherein, the information needed for user is contained in the raw information, server is needed user in raw information Required information is extracted, wherein, the raw information includes text message.
For example, law staff is needed the legal documents of substantial amounts of electronic edition, (what is included in legal documents is text Information) in defendant's name, defendant's sex and defendant extract the date of birth, therefore, server obtain electronic edition method The information (that is, raw information) included in rule document, for simple and clear elaboration this programme, saves unnecessary cumbersome step Suddenly, the information only included using in each legal documents at this as:Defendant:XX, sex:X, XXXX XX months XX day are born, and only Illustrated by taking a legal documents as an example.
S102:Word segmentation processing is carried out to the raw information, each participle in the raw information is obtained.
Because in actual applications, the information needed for user is only at word or word in a word, therefore, in this Shen Please in, word segmentation processing can be carried out to the raw information that gets, a tag along sort subsequently be determined to each participle, subsequently then The required corresponding information of tag along sort can be extracted.
During word segmentation processing is carried out to raw information, if each word can be caused as a participle Operand is huge, if an excess of word as a participle, then can cause the accuracy rate of information extraction reduces, therefore, at this In application, complete word can be grouped together according to the language habits, e.g., word segmentation processing be carried out to " we like China ", Obtained each participle is:We, love, China, can be by punctuation mark in addition, if include punctuation mark in raw information Individually take out as a participle.
S103:By sequence of each participle in the raw information, it is determined that the segmentation sequence being made up of each participle.
In the embodiment of the present application, server is after raw information is got, by row of each participle in the raw information Sequence, it is determined that the sequence being made up of each participle.
Further, since in language environment, same word may act as different compositions in different sentences, that is, Say, when same word act as heterogeneity, it is representative mean different, represented parts of speech be also it is different, e.g., " you can have a talk about the process of thing with meToday, I was by your entrance " in, " process " acts as verb in later half sentence, and " your entrance " word that " process " is close to below is impossible to appear in serve as behind noun " process " in first half sentence. In order to effectively improve the accuracy rate of information extraction, therefore, in the embodiment of the present application, server is being determined by each participle structure Into segmentation sequence after, it may be determined that the part of speech of each participle in segmentation sequence, wherein, the part of speech include a variety of different words Property, and table 1 only lists two kinds of different parts of speech, that is, the first part of speech and the second part of speech, certainly, server also can be to original Beginning information carries out word segmentation processing, obtains after each participle in raw information, it is determined that the part of speech of each participle, is punctuate for participle Symbol, the part of speech of the participle can be represented with w.
Continuation of the previous cases, it is assumed that the information included in the legal documents that server is got is " defendant:Zhang San, sex:Man, On October 21st, 1985 is born ", server uses word segmentation processing mode mentioned above, to " defendant:Zhang San, sex:Man, On October 21st, 1985 is born " word segmentation processing is carried out, and according to each participle in " defendant:Zhang San, sex:Man, October 21 in 1985 Sequence in day birth ", determines the segmentation sequence being made up of each participle, and determine the part of speech of each participle in segmentation sequence (that is, determining the first part of speech and the second part of speech of each participle in segmentation sequence), so as to obtain number as shown in table 1 According to:
Participle First part of speech Second part of speech
Defendant n / basic word-Chinese
w other
Zhang San n / name-Chinese personal name
, w other
Sex n / basic word-Chinese
w other
Man n / product type qualifier
, w other
On October 21st, 1985 n /DATE
Birth vi / basic word-Chinese
Table 1
S104:According to the segmentation sequence, each self-corresponding observation characteristic sequence of each participle is determined.
Because the application is realized by conditional random field models, therefore, server is made up of each participle determining Segmentation sequence after, it is thus necessary to determine that go out each each self-corresponding observation characteristic sequence of participle in the segmentation sequence.
It is determined that during the entire process of each self-corresponding observation characteristic sequence of each participle, server often reads segmentation sequence In a participle when, by the feature templates pre-established determine one observation characteristic sequence, until all participles are each It is self-corresponding observation characteristic sequence all determined untill, wherein, for feature templates, the application give following exemplary to Go out five feature templates:
#Unigram
U00:%x [- 2,0]
U01:%x [- 1,0]
U02:%x [0,0]
U03:%x [1,0]
U04:%x [2,0].
In addition, the accuracy rate in order to improve information extraction, server after the segmentation sequence being made up of each participle is determined, Also need to determine the part of speech of each participle in segmentation sequence, that is to say, that the part of speech of each participle also determines participle correspondence Observation characteristic sequence, so as to determine which label the participle corresponds on earth, therefore, by conditional random field models, really In the fixed segmentation sequence during each each self-corresponding observation characteristic sequence of participle, in addition it is also necessary to reference to the part of speech of each participle, Specifically, participle content, participle word order and participle part of speech in the segmentation sequence, it is determined that each participle is each self-corresponding Characteristic sequence is observed, wherein, participle content refers to each participle in segmentation sequence, and participle word order is referred in segmentation sequence The tandem of each participle.
Use the example above, it is assumed that above-mentioned five feature templates provided of use, server read segmentation sequence " defendant ", “:", " Zhang San ", ", ", " sex ", ":", " man ", ", ", " on October 21st, 1985 ", after " birth ", generate such as the institute of table 2 Each observation characteristic sequence shown:
Table 2
S105:According to each each self-corresponding observation characteristic sequence of participle, pass through the condition random field pre-established Model, it is determined that the tag along sort of the maximum each participle of the joint probability for each observation characteristic sequence of sening as an envoy to.
In this application, due to being subsequently that each participle is classified by the form of tag along sort, and by information needed The corresponding information extraction of corresponding tag along sort comes out, wherein, the species of the tag along sort is pre-set, because This, after the segmentation sequence being made up of each participle determined and each observation characteristic sequence, it is thus necessary to determine that go out the segmentation sequence In each participle tag along sort.
And it is determined that in sequence during the tag along sort of each participle, the application is real by conditional random field models Existing, because the core of conditional random field models is (it is, all participles are corresponding in segmentation sequence in given list entries Each observation characteristic sequence) and given output sequence (it is, the corresponding tag along sort of each participle, according to each participle in sequence Middle order, the sequence for sorting and constituting successively, i.e. tag along sort sequence) in the case of, determine that the joint of output sequence is general Rate, also, again because the joint probability of output sequence is bigger, then it is that correct possibility is bigger to illustrate output sequence, joint is general Rate is smaller, then it is that correct possibility is smaller to illustrate output sequence, therefore, in this application, it is determined that in segmentation sequence each , can be directly to making list entries (it is, all participles are corresponding in segmentation sequence during the tag along sort of participle Each observation characteristic sequence) in the case of, determine the maximum output sequence of joint probability (it is, the corresponding classification of each participle Label, sequentially, successively sorts and the sequence of composition, i.e. tag along sort sequence in the sequence according to each participle), so that it is determined that going out The tag along sort of each participle.
Adopt example, it is assumed that the species of the tag along sort pre-established is comprising following several:0 (the information that need not be paid close attention to Corresponding tag along sort), the dName corresponding tag along sort of name information of concern (need), dSex (need the sex letter of concern Cease corresponding tag along sort), the dBirthday corresponding tag along sort of birthday by information of concern (need), server is according in table 2 Each observation characteristic sequence determined, by the conditional random field models pre-established, it is determined that each sight determined in table 2 of sening as an envoy to The maximum tag along sort sequence of the joint probability of characteristic sequence is examined, so that it is determined that going out the tag along sort of each participle, it is assumed that it is determined that The tag along sort of each participle gone out is as described in Table 3:
Participle First part of speech Second part of speech Tag along sort
Defendant n / basic word-Chinese 0
w other 0
Zhang San n / name-Chinese personal name dName
, w other 0
Sex n / basic word-Chinese 0
w other 0
Man n / product type qualifier dSex
, w other 0
On October 21st, 1985 n /DATE dBirthday
Birth vi / basic word-Chinese 0
Table 3
Therefore, the tag along sort sequence finally obtained be " 0 ", " 0 ", " dName ", " 0 ", " 0 ", " 0 ", " dSex ", " 0 ", “dBirthday”、“0”。
S106:According to the tag along sort of each participle of determination, information needed is extracted from the raw information.
In this application, server is it is determined that the maximum classification of the joint probability for the segmentation sequence being made up of each participle of sening as an envoy to After sequence label, it may be determined that the corresponding tag along sort of user's information needed, using the tag along sort as specified label, and from the original The corresponding information of specified label is extracted in beginning information.
Adopt example, it is assumed that the corresponding tag along sort of information needed for user is " dName, dSex, dBirthday ", clothes Being engaged in, device will ", as label is specified, server be it is determined that each observation determined in table 2 of sening as an envoy to by dName, dSex, dBirthday " The maximum tag along sort sequence of the joint probability of characteristic sequence is " 0 ", " 0 ", " dName ", " 0 ", " 0 ", " 0 ", " dSex ", " 0 ", It is behind " dBirthday ", " 0 ", " dName " corresponding " Zhang San ", " dSex " corresponding " man ", " dBirthday " is corresponding " on October 21st, 1985 " extracts.
By the above method, no matter whether information to be extracted is to be constituted according to fixed information format, can be had Effect by information to be extracted, the information needed for user is extracted, and improves the accuracy rate of information extraction.
In addition, in the conditional random field models by pre-establishing, it is determined that the joint for each observation characteristic sequence of sening as an envoy to During the tag along sort of each participle of maximum probability, present invention also provides the core involved by conditional random field models Formula, be specially:Server is according to formula(y | x, λ) maximum each participle it is determined that the P that sends as an envoy to Tag along sort, wherein, Z (x) is expressed as normalized function, λjRepresent the corresponding weight of j-th of characteristic function, fjRepresent this J-th of characteristic function in part random field models, yi-1The corresponding tag along sort of the i-th -1 participle in the segmentation sequence is represented, yiThe corresponding tag along sort of i-th of participle in the segmentation sequence is represented, x represents each self-corresponding observation feature sequence of each participle Row.
Further, present invention also provides a kind of mode of setting up of conditional random field models, subsequently, server can be direct Using the model established, the corresponding each observation characteristic sequence (that is, list entries) of all participles in given segmentation sequence In the case of, it is determined that the tag along sort of the maximum each participle of the joint probability for each observation characteristic sequence of sening as an envoy to.
Mode is specifically set up the following is conditional random field models:Sample information is obtained in advance, and the sample information is carried out Participle, obtains each participle in the sample information, by sequence of each participle in the sample information, it is determined that be made up of each participle Sample sequence, determines the part of speech of each participle in the sample sequence, according to the sorting of each participle in the sample sequence, the word of each participle Property and known each participle tag along sort, training obtain conditional random field models.
Herein it should be noted that being specifically the row according to each participle in sample sequence during training condition random field models Core formula in the tag along sort of sequence, the part of speech of each participle and known each participle, training condition random field modelsIn λjAnd fj, and all characteristic function f It is according to the sorting of each participle in the sample sequence, the part of speech of each participle, the tag along sort of known each participle and in advance What the feature templates of foundation were determined, wherein, used feature templates and step during for training condition random field models Involved feature templates are consistent in S104, and server passes through each in above-mentioned five feature templates and the sample sequence The sorting of participle, the part of speech of each participle, the tag along sort of known each participle train all spies in conditional random field models Levy function f.
The information extracting method provided above for the embodiment of the present application, based on same thinking, the embodiment of the present application is also carried For a kind of information extracting device, as shown in Figure 2.
The information extracting device structural representation that Fig. 2 provides for the embodiment of the present application, described device includes:
Acquisition module 201, for obtaining raw information;
Word-dividing mode 202, for carrying out word segmentation processing to the raw information, obtains each point in the raw information Word;
Segmentation sequence determining module 203, for the sequence by each participle in the raw information, it is determined that by each participle structure Into segmentation sequence;
Characteristic sequence determining module 204, for according to the segmentation sequence, determining each self-corresponding sight of each participle Examine characteristic sequence;
Tag along sort determining module 205, for according to each each self-corresponding observation characteristic sequence of participle, by pre- The conditional random field models first set up, it is determined that the contingency table of the maximum each participle of the joint probability for each observation characteristic sequence of sening as an envoy to Label;
Extraction module 206, for the tag along sort of each participle according to determination, believes needed for being extracted from the raw information Breath.
Described device also includes:
Part of speech determining module 207, for determining that each participle is each corresponded in the characteristic sequence determining module 204 Observation characteristic sequence before, determine the part of speech of each participle in the segmentation sequence;
The characteristic sequence determining module 204 is specifically for participle content, participle word order in the segmentation sequence And participle part of speech, determine each self-corresponding observation characteristic sequence of each participle.
The tag along sort determining module 205 specifically for obtaining sample information, the sample information being divided in advance Word, obtains each participle in the sample information, by sequence of each participle in the sample information, it is determined that being made up of each participle Sample sequence, determine the part of speech of each participle in the sample sequence, according to the sorting of each participle in the sample sequence, each point The tag along sort of the part of speech of word and known each participle, training obtains conditional random field models.
The tag along sort determining module 205 is specifically for according to formula The tag along sort of (y | x, λ) maximum each participle it is determined that the P that sends as an envoy to, wherein, Z (x) is expressed as normalized function, λjRepresent jth The corresponding weight of individual characteristic function, fjRepresent j-th of characteristic function in the conditional random field models, yi-1Represent described point The corresponding tag along sort of the i-th -1 participle, y in word sequenceiRepresent the corresponding contingency table of i-th of participle in the segmentation sequence Label, x represents each self-corresponding observation characteristic sequence of each participle.
The extraction module 206 is specifically for determining the corresponding tag along sort of information needed, as specified label, from institute State and the corresponding information of the specified label is extracted in raw information.
In a typical configuration, computing device includes one or more processors (CPU), input/output interface, net Network interface and internal memory.
Internal memory potentially includes the volatile memory in computer-readable medium, random access memory (RAM) and/or The forms such as Nonvolatile memory, such as read-only storage (ROM) or flash memory (flash RAM).Internal memory is computer-readable medium Example.
Computer-readable medium includes permanent and non-permanent, removable and non-removable media can be by any method Or technology come realize information store.Information can be computer-readable instruction, data structure, the module of program or other data. The example of the storage medium of computer includes, but are not limited to phase transition internal memory (PRAM), static RAM (SRAM), moved State random access memory (DRAM), other kinds of random access memory (RAM), read-only storage (ROM), electric erasable Programmable read only memory (EEPROM), fast flash memory bank or other memory techniques, read-only optical disc read-only storage (CD-ROM), Digital versatile disc (DVD) or other optical storages, magnetic cassette tape, the storage of tape magnetic rigid disk or other magnetic storage apparatus Or any other non-transmission medium, the information that can be accessed by a computing device available for storage.Define, calculate according to herein Machine computer-readable recording medium does not include temporary computer readable media (transitory media), such as data-signal and carrier wave of modulation.
It should also be noted that, term " comprising ", "comprising" or its any other variant are intended to nonexcludability Comprising so that process, method, commodity or equipment including a series of key elements are not only including those key elements, but also wrap Include other key elements being not expressly set out, or also include for this process, method, commodity or equipment intrinsic want Element.In the absence of more restrictions, the key element limited by sentence "including a ...", it is not excluded that wanted including described Also there is other identical element in process, method, commodity or the equipment of element.
It will be understood by those skilled in the art that embodiments herein can be provided as method, system or computer program product. Therefore, the application can be using the embodiment in terms of complete hardware embodiment, complete software embodiment or combination software and hardware Form.Deposited moreover, the application can use to can use in one or more computers for wherein including computer usable program code The shape for the computer program product that storage media is implemented on (including but is not limited to magnetic disk storage, CD-ROM, optical memory etc.) Formula.
Embodiments herein is the foregoing is only, the application is not limited to.For those skilled in the art For, the application can have various modifications and variations.It is all any modifications made within spirit herein and principle, equivalent Replace, improve etc., it should be included within the scope of claims hereof.

Claims (10)

1. a kind of information extracting method, it is characterised in that methods described includes:
Obtain raw information;
Word segmentation processing is carried out to the raw information, each participle in the raw information is obtained;
By sequence of each participle in the raw information, it is determined that the segmentation sequence being made up of each participle;
According to the segmentation sequence, each self-corresponding observation characteristic sequence of each participle is determined;
According to each each self-corresponding observation characteristic sequence of participle, by the conditional random field models pre-established, it is determined that The tag along sort of each participle for each joint probability maximum for observing characteristic sequence of sening as an envoy to;
According to the tag along sort of each participle of determination, information needed is extracted from the raw information.
2. the method as described in claim 1, it is characterised in that it is determined that each self-corresponding observation feature sequence of each participle Before row, methods described also includes:
Determine the part of speech of each participle in the segmentation sequence;
According to the segmentation sequence, each self-corresponding observation characteristic sequence of each participle is determined, is specifically included:
Participle content, participle word order and participle part of speech in the segmentation sequence, determine that each participle is each right The observation characteristic sequence answered.
3. the method as described in claim 1, it is characterised in that pre-establish conditional random field models, is specifically included:
Sample information is obtained in advance;
Participle is carried out to the sample information, each participle in the sample information is obtained;
By sequence of each participle in the sample information, it is determined that the sample sequence being made up of each participle;
Determine the part of speech of each participle in the sample sequence;
According to the tag along sort of the sorting of each participle, the part of speech of each participle and known each participle in the sample sequence, instruction Get conditional random field models.
4. the method as described in claim 1, it is characterised in that it is determined that the joint probability maximum for each observation characteristic sequence of sening as an envoy to The tag along sort of each participle, is specifically included:
According to formula <mrow> <mi>P</mi> <mrow> <mo>(</mo> <mi>y</mi> <mo>|</mo> <mi>x</mi> <mo>,</mo> <mi>&amp;lambda;</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mn>1</mn> <mrow> <mi>z</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>)</mo> </mrow> </mrow> </mfrac> <mi>exp</mi> <mrow> <mo>(</mo> <msubsup> <mi>&amp;Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </msubsup> <msub> <mi>&amp;Sigma;</mi> <mi>j</mi> </msub> <msub> <mi>&amp;lambda;</mi> <mi>j</mi> </msub> <msub> <mi>f</mi> <mi>j</mi> </msub> <mo>(</mo> <mrow> <msub> <mi>y</mi> <mrow> <mi>i</mi> <mo>-</mo> <mn>1</mn> </mrow> </msub> <mo>,</mo> <msub> <mi>y</mi> <mi>i</mi> </msub> <mo>,</mo> <mi>x</mi> <mo>,</mo> <mi>i</mi> </mrow> <mo>)</mo> <mo>)</mo> </mrow> <mo>,</mo> </mrow> It is determined that the P that sends as an envoy to (y | x, λ) most The tag along sort tag along sort of big each participle, wherein, Z (x) is expressed as normalized function, and λ j represent j-th of characteristic function Corresponding weight, fj represents j-th of characteristic function in the conditional random field models, and yi-1 is represented in the segmentation sequence The corresponding tag along sort of i-1 participle, yi represents the corresponding tag along sort of i-th of participle in the segmentation sequence, and x represents each Each self-corresponding observation characteristic sequence of participle.
5. the method as described in claim 1, it is characterised in that according to the tag along sort of each participle of determination, from described original Information needed is extracted in information, is specifically included:
The corresponding tag along sort of information needed is determined, specified label is used as;
The corresponding information of the specified label is extracted from the raw information.
6. a kind of information extracting device, it is characterised in that described device includes:
Acquisition module, for obtaining raw information;
Word-dividing mode, for carrying out word segmentation processing to the raw information, obtains each participle in the raw information;
Segmentation sequence determining module, for the sequence by each participle in the raw information, it is determined that point being made up of each participle Word sequence;
Characteristic sequence determining module, for according to the segmentation sequence, determining each self-corresponding observation feature of each participle Sequence;
Tag along sort determining module, for according to each each self-corresponding observation characteristic sequence of participle, by pre-establishing Conditional random field models, it is determined that the tag along sort of the maximum each participle of the joint probability for each observation characteristic sequence of sening as an envoy to;
Extraction module, for the tag along sort of each participle according to determination, information needed is extracted from the raw information.
7. device as claimed in claim 6, it is characterised in that described device also includes:
Part of speech determining module, for determining each self-corresponding observation feature of each participle in the characteristic sequence determining module Before sequence, the part of speech of each participle in the segmentation sequence is determined;
The characteristic sequence determining module specifically for, participle content, participle word order in the segmentation sequence and point Word part of speech, determines each self-corresponding observation characteristic sequence of each participle.
8. device as claimed in claim 6, it is characterised in that the tag along sort determining module specifically for obtaining in advance Sample information, carries out participle to the sample information, each participle in the sample information is obtained, by each participle in the sample Sequence in information, it is determined that the sample sequence being made up of each participle, determines the part of speech of each participle in the sample sequence, according to institute The tag along sort of the sorting of each participle in sample sequence, the part of speech of each participle and known each participle is stated, training obtains condition Random field models.
9. device as claimed in claim 6, it is characterised in that the tag along sort determining module is specifically for according to formula <mrow> <mi>P</mi> <mrow> <mo>(</mo> <mi>y</mi> <mo>|</mo> <mi>x</mi> <mo>,</mo> <mi>&amp;lambda;</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mn>1</mn> <mrow> <mi>Z</mi> <mrow> <mo>(</mo> <mi>x</mi> <mo>)</mo> </mrow> </mrow> </mfrac> <mi>exp</mi> <mrow> <mo>(</mo> <msubsup> <mi>&amp;Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </msubsup> <msub> <mi>&amp;Sigma;</mi> <mi>j</mi> </msub> <msub> <mi>&amp;lambda;</mi> <mi>j</mi> </msub> <msub> <mi>f</mi> <mi>j</mi> </msub> <mo>(</mo> <mrow> <msub> <mi>y</mi> <mrow> <mi>i</mi> <mo>-</mo> <mn>1</mn> </mrow> </msub> <mo>,</mo> <msub> <mi>y</mi> <mi>i</mi> </msub> <mo>,</mo> <mi>x</mi> <mo>,</mo> <mi>i</mi> </mrow> <mo>)</mo> <mo>)</mo> </mrow> <mo>,</mo> </mrow> (y | x, λ) maximum each participle it is determined that the P that sends as an envoy to Tag along sort, wherein, Z (x) is expressed as normalized function, and λ j represent the corresponding weight of j-th of characteristic function, and fj represents described J-th of characteristic function in conditional random field models, yi-1 represents the corresponding contingency table of the i-th -1 participle in the segmentation sequence Label, yi represents the corresponding tag along sort of i-th of participle in the segmentation sequence, and x represents that each self-corresponding observation of each participle is special Levy sequence.
10. device as claimed in claim 6, it is characterised in that the extraction module is specifically for determining information needed correspondence Tag along sort, as specified label, the corresponding information of the specified label is extracted from the raw information.
CN201610108274.1A 2016-02-26 2016-02-26 A kind of information extracting method and device Pending CN107133207A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610108274.1A CN107133207A (en) 2016-02-26 2016-02-26 A kind of information extracting method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610108274.1A CN107133207A (en) 2016-02-26 2016-02-26 A kind of information extracting method and device

Publications (1)

Publication Number Publication Date
CN107133207A true CN107133207A (en) 2017-09-05

Family

ID=59721296

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610108274.1A Pending CN107133207A (en) 2016-02-26 2016-02-26 A kind of information extracting method and device

Country Status (1)

Country Link
CN (1) CN107133207A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109299472A (en) * 2018-11-09 2019-02-01 天津开心生活科技有限公司 Text data processing method, device, electronic equipment and computer-readable medium
CN110209831A (en) * 2018-02-13 2019-09-06 北京京东尚科信息技术有限公司 Model generation, the method for semantics recognition, system, equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
黄绍杉等: "基于条件随机场的专利摘要信息抽取研究", 《DIGITAL LIBRARY FORUM》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110209831A (en) * 2018-02-13 2019-09-06 北京京东尚科信息技术有限公司 Model generation, the method for semantics recognition, system, equipment and storage medium
CN109299472A (en) * 2018-11-09 2019-02-01 天津开心生活科技有限公司 Text data processing method, device, electronic equipment and computer-readable medium

Similar Documents

Publication Publication Date Title
CN110163478B (en) Risk examination method and device for contract clauses
CN112035653B (en) Policy key information extraction method and device, storage medium and electronic equipment
US20190370296A1 (en) Method and device for mining an enterprise relationship
CN109101489B (en) Text automatic summarization method and device and electronic equipment
US8380489B1 (en) System, methods, and data structure for quantitative assessment of symbolic associations in natural language
CN104636466B (en) Entity attribute extraction method and system for open webpage
CN109344234A (en) Machine reads understanding method, device, computer equipment and storage medium
WO2019205308A1 (en) Information input method and apparatus, and terminal device and medium
CN108509482A (en) Question classification method, device, computer equipment and storage medium
CN113837531A (en) Product quality problem finding and risk assessment method based on network comments
CN111930895B (en) MRC-based document data retrieval method, device, equipment and storage medium
Braz et al. Document classification using a Bi-LSTM to unclog Brazil's supreme court
CN113312480B (en) Scientific and technological thesis level multi-label classification method and device based on graph volume network
CN108446295A (en) Information retrieval method, device, computer equipment and storage medium
CN111782759B (en) Question-answering processing method and device and computer readable storage medium
CN112149387A (en) Visualization method and device for financial data, computer equipment and storage medium
CN105808726A (en) Method and apparatus for measuring similarity of documents
CN110019820B (en) Method for detecting time consistency of complaints and symptoms of current medical history in medical records
CN108427667A (en) A kind of segmentation method and device of legal documents
CN107133207A (en) A kind of information extracting method and device
CN105786929A (en) Information monitoring method and device
CN114911936A (en) Model training and comment recognition method and device, electronic equipment and medium
CN112434126B (en) Information processing method, device, equipment and storage medium
Tschirschwitz et al. A dataset for analysing complex document layouts in the digital humanities and its evaluation with krippendorff’s alpha
CN114153939A (en) Text recognition method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20170905

RJ01 Rejection of invention patent application after publication