CN105808523A

CN105808523A - Method and apparatus for identifying document

Info

Publication number: CN105808523A
Application number: CN201610130559.5A
Authority: CN
Inventors: 王明君; 王茂帅; 柳廷娜; 高峰; 于文才
Original assignee: Inspur Software Co Ltd
Current assignee: Inspur Software Co Ltd
Priority date: 2016-03-08
Filing date: 2016-03-08
Publication date: 2016-07-27

Abstract

The invention provides a method and apparatus for identifying a document. The method comprises the steps of presetting labeling rules of words, wherein the labeling rules comprise labeling rules of named entities and labeling rules of common words; acquiring a document to be processed; according to the labeling rules, labeling common words and named entities in the document to be processed; and according to the labeled document to be processed, identifying the common words and the named entities in the document to be processed. According to the method and apparatus for identifying the document, which are provided by the invention, the document can be more simply identified.

Description

A kind of method identifying document and device

Technical field

The present invention relates to field of computer technology, particularly to a kind of method identifying document and device.

Background technology

Participle function, as the most basic problem of natural language processing field, is subject in field the attention of every expert for a long time.Chinese word segmentation and English string segmentation also exist the difference of essence, and English has natural space to represent the difference of word, does not then have such regulation in Chinese.In order to enable continuous print word sequence is reassembled into word sequence according to certain specification, the expert in whole field it is also proposed various solution.Have Corpus--based Method, have rule-based.Up to now a point word problem can be very perfectly solved but without an algorithm.Name Entity recognition is also a great problem of natural language processing field, and basic name entity includes time name, name, place name, mechanism's name etc..It is difficult to, at a model or a process, the name entity of all categories is identified work.

In the method identifying document of prior art, first pending document is carried out participle, after participle terminates, from all words, identify name entity.Participle process of the prior art cannot recognize that name entity.For example, pending document has " Peking University " this word, carrying out participle, generally can be divided into " Beijing " " university " two words, then, again through the rule pre-set, the two word is merged into " Peking University ", it is achieved that the identification to " Peking University " this name entity.

Visible by foregoing description, the method identifying document of prior art needs first to carry out word segmentation processing, then, then is named the identification of entity, more complicated.

Summary of the invention

Embodiments provide a kind of method identifying document and device, it is possible to simpler identification document.

On the one hand, a kind of method identifying document of the embodiment of the present invention, including:

S0: pre-setting the mark rule of vocabulary, wherein, described mark rule includes: the mark of name entity is regular, the mark rule of generic word；

S1: obtain pending document；

S2: according to described mark rule, is labeled generic word in described pending document and name entity；

S3: according to the described pending document after mark, identifies the generic word in described pending document and name entity.

Further, after described S0, also include:

According to described mark rule, arranging mark dictionary, described mark dictionary includes: the generic word naming entity, mark of mark；

Described S2, including: according to described mark dictionary, the generic word in described pending document and name entity are labeled.

Further, before described S1, also include:

Obtain language material, according to described mark rule, described language material is labeled；

According to the language material after mark, train hidden Markov model；

Described S2, including: using described pending document as the input of described hidden Markov model after training, by the described hidden Markov model after training, according to described mark rule, generic word in described pending document and name entity are labeled；

Described S3, including: by the described hidden Markov model after training, according to the described pending document after mark, identify the generic word in described pending document and name entity.

Further, described according to the language material after mark, train hidden Markov model, including:

According to the language material after mark and formula one, it is determined that θ value when P (Y | x: θ) is maximum；

Wherein, described formula one is:

P (Y | x : θ) = \frac{1}{Z_{x} (θ)} \exp {Σ_{i = 1}^{K} θ_{i} f_{i} (D_{i})}

Wherein, Y is hidden state, and x is display state, and P (Y | x: θ) it is when x, obtain the probability of Y, θ is the parameter to be determined of described hidden Markov model, and K is characteristic function quantity, f_i(D_i) for characteristic function, D_iFor characteristic function parameter, Z_x(θ) for normalization factor.

Further, described name entity includes: name name entity, place name naming entity, mechanism's name name entity；

Described mark rule, including:

B represents the beginning of a generic word, E represents the end of a generic word, M represents the mid portion of a generic word, S represents special symbol, punctuate and monosyllabic word, RB represents the beginning of name name entity, RE represents the end of name name entity, RM represents the mid portion of name name entity, SB represents the beginning of place name naming entity, SE represents the end of place name naming entity, SM represents the mid portion of place name naming entity, the beginning of TB outgoing mechanism name name entity, the end of TE outgoing mechanism name name entity, the mid portion of TM outgoing mechanism name name entity.

On the other hand, embodiments provide a kind of device identifying document, including:

Arranging unit, for arranging the mark rule of vocabulary, wherein, described mark rule includes: the mark of name entity is regular, the mark rule of generic word；

Acquiring unit, is used for obtaining pending document；

Mark unit, for according to described mark rule, being labeled generic word in described pending document and name entity；

Recognition unit, for according to the described pending document after mark, identifying the generic word in described pending document and name entity.

Further, also include:

Dictionary unit, for according to described mark rule, arranging mark dictionary, described mark dictionary includes: the generic word naming entity, mark of mark；

Described mark unit, for according to described mark dictionary, being labeled the generic word in described pending document and name entity.

Further, also include:

Corpus labeling unit, is used for obtaining language material, according to described mark rule, described language material is labeled；

Training unit, for according to the language material after mark, training hidden Markov model；

Described mark unit, for using described pending document as the input of described hidden Markov model after training, by the described hidden Markov model after training, according to described mark rule, generic word in described pending document and name entity are labeled；

Described recognition unit, for by the described hidden Markov model after training, according to the described pending document after mark, identifies the generic word in described pending document and name entity.

Further, described training unit, for according to the language material after mark and formula one, it is determined that θ value when P (Y | x: θ) is maximum；

Wherein, described formula one is:

P (Y | x : θ) = \frac{1}{Z_{x} (θ)} \exp {Σ_{i = 1}^{K} θ_{i} f_{i} (D_{i})}

Described mark rule, including: B represents the beginning of a generic word, E represents the end of a generic word, M represents the mid portion of a generic word, S represents special symbol, punctuate and monosyllabic word, RB represents the beginning of name name entity, RE represents the end of name name entity, RM represents the mid portion of name name entity, SB represents the beginning of place name naming entity, SE represents the end of place name naming entity, SM represents the mid portion of place name naming entity, the beginning of TB outgoing mechanism name name entity, the end of TE outgoing mechanism name name entity, the mid portion of TM outgoing mechanism name name entity.

In embodiments of the present invention, pre-set the mark rule of vocabulary, regular including the mark rule of name entity and the mark of generic word, according to mark rule, generic word in pending document and name entity are labeled, and then realize the identification to the generic word in pending document and name entity, identifying name entity again without first participle, the identification of generic word carries out together with the identification of name entity, it is achieved that simpler identification document.

Accompanying drawing explanation

In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, the accompanying drawing used required in embodiment or description of the prior art will be briefly described below, apparently, accompanying drawing in the following describes is some embodiments of the present invention, for those of ordinary skill in the art, under the premise not paying creative work, it is also possible to obtain other accompanying drawing according to these accompanying drawings.

Fig. 1 is the flow chart of a kind of method identifying document that one embodiment of the invention provides；

Fig. 2 is the flow chart of the method for the another kind identification document that one embodiment of the invention provides；

Fig. 3 is the schematic diagram of a kind of device identifying document that one embodiment of the invention provides；

Fig. 4 is the schematic diagram of the device of the another kind identification document that one embodiment of the invention provides.

Detailed description of the invention

For making the purpose of the embodiment of the present invention, technical scheme and advantage clearly; below in conjunction with the accompanying drawing in the embodiment of the present invention; technical scheme in the embodiment of the present invention is clearly and completely described; obviously; described embodiment is a part of embodiment of the present invention, rather than whole embodiments, based on the embodiment in the present invention; the every other embodiment that those of ordinary skill in the art obtain under the premise not making creative work, broadly falls into the scope of protection of the invention.

As it is shown in figure 1, embodiments provide a kind of method identifying document, the method may comprise steps of:

S1: obtain pending document；

In a kind of possible implementation, after described S0, also include:

In the present embodiment, mark dictionary includes: the generic word naming entity, mark of mark.For example, mark dictionary includes: name entity " Peking University ", generic word " we ".For in pending document " we go to Peking University to visit.", it is labeled by step S2, it is possible to achieve the mark to " we " and " Peking University ".It addition, the word not having in mark dictionary, it is possible to by being manually labeled.Such as: therein " " " visit ".

So that the identification of pending document is more accurate, in a kind of possible implementation, before described S1, also include:

According to the language material after mark, train hidden Markov model；

In embodiments of the present invention, hidden Markov model (CRF, conditionalrandomfieldalgorithm) is trained by language material, be labeled by the hidden Markov model after training and identify pending document.

In embodiments of the present invention, language material being labeled, according to mark rule, language material is labeled work, the quality of whole model is together decided on by the mark quality of the quality of language material Yu language material.In order to save manual operation, the embodiment of the present invention can adopt and in advance language material be marked in advance, then is labeled work by the mode manually carrying out mistake mark correction.

In a kind of possible implementation, described according to the language material after mark, train hidden Markov model, including:

Wherein, described formula one is:

P (Y | x : θ) = \frac{1}{Z_{x} (θ)} \exp {Σ_{i = 1}^{K} θ_{i} f_{i} (D_{i})}

In a kind of possible implementation, described name entity includes: name name entity, place name naming entity, mechanism's name name entity；

Described mark rule, including:

In embodiments of the present invention, mark by the name name entity arranged, place name naming entity and mechanism's name name entity, can realize name is named the mark of entity, place name naming entity and mechanism's name name entity simultaneously, accelerate the recognition speed to pending document.

For example, name name entity includes: " Zhang San ", and place name naming entity includes: " Beijing ", and mechanism's name name entity includes: " Tsing-Hua University ".For " Zhang San goes to Tsing-Hua University of Pekinese to visit.", after being labeled, it is possible to obtain: RBRESSBSESTBTMTMTEBE.

In embodiments of the present invention, it is achieved that the unified mark of language material, unified solution participle problem and name Entity recognition problem, solve Similar Problems is unified, it is possible to reduce workload, the reduction of cost of labor is played a good role.

In embodiments of the present invention, mark rule is designed, make whole mark rule can well meet the division to all marks, and the classification making mark is minimum, so the amount of calculation of model training can be made to reach minimum, on the basis of same effect, the execution efficiency of algorithm can reach optimum.

Generic word in the embodiment of the present invention refers in pending document, other vocabulary outside name entity.It addition, the result of pending document can be annotated sequence by the embodiment of the present invention.

As in figure 2 it is shown, embodiments provide a kind of method identifying document, the method may comprise steps of:

Step 201: pre-setting the mark rule of vocabulary, wherein, described mark rule includes: the mark of name entity is regular, the mark rule of generic word.

Described name entity includes: name name entity, place name naming entity, mechanism's name name entity；

Step 202: according to described mark rule, arranging mark dictionary, described mark dictionary includes: the generic word naming entity, mark of mark.

For example, mark dictionary includes: " Tsing-Hua University " corresponding being labeled as: TBTMTMTE；" Zhang San " corresponding being labeled as: RBRE；" Beijing " corresponding being labeled as: SBSE；" go " being labeled as of correspondence: S；" " corresponding being labeled as: S；" visit " corresponding being labeled as: BE.Wherein, " Tsing-Hua University " is mechanism's name name entity, and " Zhang San " names entity for name, and " Beijing " is place name naming entity, and these three word belongs to name entity." go ", " ", " visit " these all be name entity outside vocabulary, belong to generic word.

Step 203: obtain pending document.

For example, pending document there is the sentence to be: Zhang San goes to Tsing-Hua University of Pekinese to visit.By the embodiment of the present invention, this sentence is identified.

Step 204: according to mark dictionary, is labeled the generic word in pending document and name entity.

For example, for " Zhang San goes to Tsing-Hua University of Pekinese to visit " in pending document.By mating with the vocabulary in dictionary, it is possible to obtain being labeled as of this sentence: RBRESSBSESTBTMTMTEBE.

Step 205: according to the pending document after mark, identifies the generic word in pending document and name entity.

Owing to different marks represents different lexical types.The identification to each vocabulary in pending document can be realized by the mark in pending document.

For example, being labeled as of " Zhang San goes to Tsing-Hua University of Pekinese to visit " in pending document: RBRESSBSESTBTMTMTEBE.In conjunction with mark rule, it may be determined that going out " Tsing-Hua University " for mechanism's name name entity, " Zhang San " names entity for name, and " Beijing " is place name naming entity, and these three word belongs to name entity." go ", " ", " visit " these all be name entity outside vocabulary, belong to generic word.

Visible, by the scheme that the present invention provides in real time, the name entity in pending document and generic word can not only be identified, moreover it is possible to identify the type of name entity, achieve the unified identification to different types of name entity, improve the recognition efficiency to pending document.

As shown in Figure 3, Figure 4, a kind of device identifying document is embodiments provided.Device embodiment can be realized by software, it is also possible to is realized by the mode of hardware or software and hardware combining.Say from hardware view; as shown in Figure 3; a kind of hardware structure diagram for a kind of device place equipment identifying document that the embodiment of the present invention provides; except the processor shown in Fig. 3, internal memory, network interface and nonvolatile memory; in embodiment, the equipment at device place generally can also include other hardware, such as the forwarding chip etc. of responsible process message.Implemented in software for example, as shown in Figure 4, as the device on a logical meaning, it is that computer program instructions corresponding in nonvolatile memory is read to run in internal memory and formed by the CPU by its place equipment.A kind of device identifying document that the present embodiment provides, including:

Arranging unit 401, for arranging the mark rule of vocabulary, wherein, described mark rule includes: the mark of name entity is regular, the mark rule of generic word；

Acquiring unit 402, is used for obtaining pending document；

Mark unit 403, for according to described mark rule, being labeled generic word in described pending document and name entity；

Recognition unit 404, for according to the described pending document after mark, identifying the generic word in described pending document and name entity.

In a kind of possible implementation, also include:

Described mark unit 403, for according to described mark dictionary, being labeled the generic word in described pending document and name entity.

In a kind of possible implementation, also include:

Described mark unit 403, for using described pending document as the input of described hidden Markov model after training, by the described hidden Markov model after training, according to described mark rule, generic word in described pending document and name entity are labeled；

Described recognition unit 404, for by the described hidden Markov model after training, according to the described pending document after mark, identifies the generic word in described pending document and name entity.

In a kind of possible implementation, described training unit, for according to the language material after mark and formula one, it is determined that θ value when P (Y | x: θ) is maximum；

Wherein, described formula one is:

P (Y | x : θ) = \frac{1}{Z_{x} (θ)} \exp {Σ_{i = 1}^{K} θ_{i} f_{i} (D_{i})}

The contents such as the information between each unit in said apparatus is mutual, execution process, due to the inventive method embodiment based on same design, particular content referring to the narration in the inventive method embodiment, can repeat no more herein.

The embodiment of the present invention at least has the advantages that

1, in embodiments of the present invention, pre-set the mark rule of vocabulary, regular including the mark rule of name entity and the mark of generic word, according to mark rule, generic word in pending document and name entity are labeled, and then realize the identification to the generic word in pending document and name entity, identifying name entity again without first participle, the identification of generic word carries out together with the identification of name entity, it is achieved that simpler identification document.

2, in embodiments of the present invention, mark by the name name entity arranged, place name naming entity and mechanism's name name entity, can realize name is named the mark of entity, place name naming entity and mechanism's name name entity simultaneously, accelerate the recognition speed to pending document.

3, in embodiments of the present invention, it is achieved that the unified mark of language material, unified solution participle problem and name Entity recognition problem, solve Similar Problems is unified, it is possible to reduce workload, the reduction of cost of labor is played a good role.

It should be noted that, in this article, the relational terms of such as first and second etc is used merely to separate an entity or operation with another entity or operating space, and not necessarily requires or imply the relation that there is any this reality between these entities or operation or sequentially.And, term " includes ", " comprising " or its any other variant are intended to comprising of nonexcludability, so that include the process of a series of key element, method, article or equipment not only include those key elements, but also include other key elements being not expressly set out, or also include the key element intrinsic for this process, method, article or equipment.When there is no more restriction, statement " including " key element limited, it is not excluded that there is also other same factor in including the process of described key element, method, article or equipment.

One of ordinary skill in the art will appreciate that: all or part of step realizing said method embodiment can be completed by the hardware that programmed instruction is relevant, aforesaid program can be stored in the storage medium of embodied on computer readable, this program upon execution, performs to include the step of said method embodiment；And aforesaid storage medium includes: in the various media that can store program code such as ROM, RAM, magnetic disc or CD.

Last it should be understood that the foregoing is only presently preferred embodiments of the present invention, it is merely to illustrate technical scheme, is not intended to limit protection scope of the present invention.All make within the spirit and principles in the present invention any amendment, equivalent replacement, improvement etc., be all contained in protection scope of the present invention.

Claims

1. the method identifying document, it is characterised in that including:

S1: obtain pending document；

2. method according to claim 1, it is characterised in that after described S0, also include:

3. method according to claim 1, it is characterised in that before described S1, also include:

According to the language material after mark, train hidden Markov model；

4. method according to claim 3, it is characterised in that described according to the language material after mark, trains hidden Markov model, including:

Wherein, described formula one is:

P (Y | x : θ) = \frac{1}{Z_{x} (θ)} \exp {Σ_{i = 1}^{K} θ_{i} f_{i} (D_{i})}

5. according to described method arbitrary in claim 1-4, it is characterised in that

Described mark rule, including:

6. the device identifying document, it is characterised in that including:

Acquiring unit, is used for obtaining pending document；

7. device according to claim 6, it is characterised in that also include:

8. device according to claim 6, it is characterised in that also include:

9. device according to claim 8, it is characterised in that described training unit, for according to the language material after mark and formula one, it is determined that θ value when P (Y | x: θ) is maximum；

Wherein, described formula one is:

P (Y | x : θ) = \frac{1}{Z_{x} (θ)} \exp {Σ_{i = 1}^{K} θ_{i} f_{i} (D_{i})}

10. according to described device arbitrary in claim 6-9, it is characterised in that described name entity includes: name name entity, place name naming entity, mechanism's name name entity；