CN105808523A - Method and apparatus for identifying document - Google Patents

Method and apparatus for identifying document Download PDF

Info

Publication number
CN105808523A
CN105808523A CN201610130559.5A CN201610130559A CN105808523A CN 105808523 A CN105808523 A CN 105808523A CN 201610130559 A CN201610130559 A CN 201610130559A CN 105808523 A CN105808523 A CN 105808523A
Authority
CN
China
Prior art keywords
name
mark
entity
generic word
name entity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610130559.5A
Other languages
Chinese (zh)
Inventor
王明君
王茂帅
柳廷娜
高峰
于文才
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Software Co Ltd
Original Assignee
Inspur Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Software Co Ltd filed Critical Inspur Software Co Ltd
Priority to CN201610130559.5A priority Critical patent/CN105808523A/en
Publication of CN105808523A publication Critical patent/CN105808523A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a method and apparatus for identifying a document. The method comprises the steps of presetting labeling rules of words, wherein the labeling rules comprise labeling rules of named entities and labeling rules of common words; acquiring a document to be processed; according to the labeling rules, labeling common words and named entities in the document to be processed; and according to the labeled document to be processed, identifying the common words and the named entities in the document to be processed. According to the method and apparatus for identifying the document, which are provided by the invention, the document can be more simply identified.

Description

A kind of method identifying document and device
Technical field
The present invention relates to field of computer technology, particularly to a kind of method identifying document and device.
Background technology
Participle function, as the most basic problem of natural language processing field, is subject in field the attention of every expert for a long time.Chinese word segmentation and English string segmentation also exist the difference of essence, and English has natural space to represent the difference of word, does not then have such regulation in Chinese.In order to enable continuous print word sequence is reassembled into word sequence according to certain specification, the expert in whole field it is also proposed various solution.Have Corpus--based Method, have rule-based.Up to now a point word problem can be very perfectly solved but without an algorithm.Name Entity recognition is also a great problem of natural language processing field, and basic name entity includes time name, name, place name, mechanism's name etc..It is difficult to, at a model or a process, the name entity of all categories is identified work.
In the method identifying document of prior art, first pending document is carried out participle, after participle terminates, from all words, identify name entity.Participle process of the prior art cannot recognize that name entity.For example, pending document has " Peking University " this word, carrying out participle, generally can be divided into " Beijing " " university " two words, then, again through the rule pre-set, the two word is merged into " Peking University ", it is achieved that the identification to " Peking University " this name entity.
Visible by foregoing description, the method identifying document of prior art needs first to carry out word segmentation processing, then, then is named the identification of entity, more complicated.
Summary of the invention
Embodiments provide a kind of method identifying document and device, it is possible to simpler identification document.
On the one hand, a kind of method identifying document of the embodiment of the present invention, including:
S0: pre-setting the mark rule of vocabulary, wherein, described mark rule includes: the mark of name entity is regular, the mark rule of generic word;
S1: obtain pending document;
S2: according to described mark rule, is labeled generic word in described pending document and name entity;
S3: according to the described pending document after mark, identifies the generic word in described pending document and name entity.
Further, after described S0, also include:
According to described mark rule, arranging mark dictionary, described mark dictionary includes: the generic word naming entity, mark of mark;
Described S2, including: according to described mark dictionary, the generic word in described pending document and name entity are labeled.
Further, before described S1, also include:
Obtain language material, according to described mark rule, described language material is labeled;
According to the language material after mark, train hidden Markov model;
Described S2, including: using described pending document as the input of described hidden Markov model after training, by the described hidden Markov model after training, according to described mark rule, generic word in described pending document and name entity are labeled;
Described S3, including: by the described hidden Markov model after training, according to the described pending document after mark, identify the generic word in described pending document and name entity.
Further, described according to the language material after mark, train hidden Markov model, including:
According to the language material after mark and formula one, it is determined that θ value when P (Y | x: θ) is maximum;
Wherein, described formula one is:
P ( Y | x : θ ) = 1 Z x ( θ ) exp { Σ i = 1 K θ i f i ( D i ) }
Wherein, Y is hidden state, and x is display state, and P (Y | x: θ) it is when x, obtain the probability of Y, θ is the parameter to be determined of described hidden Markov model, and K is characteristic function quantity, fi(Di) for characteristic function, DiFor characteristic function parameter, Zx(θ) for normalization factor.
Further, described name entity includes: name name entity, place name naming entity, mechanism's name name entity;
Described mark rule, including:
B represents the beginning of a generic word, E represents the end of a generic word, M represents the mid portion of a generic word, S represents special symbol, punctuate and monosyllabic word, RB represents the beginning of name name entity, RE represents the end of name name entity, RM represents the mid portion of name name entity, SB represents the beginning of place name naming entity, SE represents the end of place name naming entity, SM represents the mid portion of place name naming entity, the beginning of TB outgoing mechanism name name entity, the end of TE outgoing mechanism name name entity, the mid portion of TM outgoing mechanism name name entity.
On the other hand, embodiments provide a kind of device identifying document, including:
Arranging unit, for arranging the mark rule of vocabulary, wherein, described mark rule includes: the mark of name entity is regular, the mark rule of generic word;
Acquiring unit, is used for obtaining pending document;
Mark unit, for according to described mark rule, being labeled generic word in described pending document and name entity;
Recognition unit, for according to the described pending document after mark, identifying the generic word in described pending document and name entity.
Further, also include:
Dictionary unit, for according to described mark rule, arranging mark dictionary, described mark dictionary includes: the generic word naming entity, mark of mark;
Described mark unit, for according to described mark dictionary, being labeled the generic word in described pending document and name entity.
Further, also include:
Corpus labeling unit, is used for obtaining language material, according to described mark rule, described language material is labeled;
Training unit, for according to the language material after mark, training hidden Markov model;
Described mark unit, for using described pending document as the input of described hidden Markov model after training, by the described hidden Markov model after training, according to described mark rule, generic word in described pending document and name entity are labeled;
Described recognition unit, for by the described hidden Markov model after training, according to the described pending document after mark, identifies the generic word in described pending document and name entity.
Further, described training unit, for according to the language material after mark and formula one, it is determined that θ value when P (Y | x: θ) is maximum;
Wherein, described formula one is:
P ( Y | x : θ ) = 1 Z x ( θ ) exp { Σ i = 1 K θ i f i ( D i ) }
Wherein, Y is hidden state, and x is display state, and P (Y | x: θ) it is when x, obtain the probability of Y, θ is the parameter to be determined of described hidden Markov model, and K is characteristic function quantity, fi(Di) for characteristic function, DiFor characteristic function parameter, Zx(θ) for normalization factor.
Further, described name entity includes: name name entity, place name naming entity, mechanism's name name entity;
Described mark rule, including: B represents the beginning of a generic word, E represents the end of a generic word, M represents the mid portion of a generic word, S represents special symbol, punctuate and monosyllabic word, RB represents the beginning of name name entity, RE represents the end of name name entity, RM represents the mid portion of name name entity, SB represents the beginning of place name naming entity, SE represents the end of place name naming entity, SM represents the mid portion of place name naming entity, the beginning of TB outgoing mechanism name name entity, the end of TE outgoing mechanism name name entity, the mid portion of TM outgoing mechanism name name entity.
In embodiments of the present invention, pre-set the mark rule of vocabulary, regular including the mark rule of name entity and the mark of generic word, according to mark rule, generic word in pending document and name entity are labeled, and then realize the identification to the generic word in pending document and name entity, identifying name entity again without first participle, the identification of generic word carries out together with the identification of name entity, it is achieved that simpler identification document.
Accompanying drawing explanation
In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, the accompanying drawing used required in embodiment or description of the prior art will be briefly described below, apparently, accompanying drawing in the following describes is some embodiments of the present invention, for those of ordinary skill in the art, under the premise not paying creative work, it is also possible to obtain other accompanying drawing according to these accompanying drawings.
Fig. 1 is the flow chart of a kind of method identifying document that one embodiment of the invention provides;
Fig. 2 is the flow chart of the method for the another kind identification document that one embodiment of the invention provides;
Fig. 3 is the schematic diagram of a kind of device identifying document that one embodiment of the invention provides;
Fig. 4 is the schematic diagram of the device of the another kind identification document that one embodiment of the invention provides.
Detailed description of the invention
For making the purpose of the embodiment of the present invention, technical scheme and advantage clearly; below in conjunction with the accompanying drawing in the embodiment of the present invention; technical scheme in the embodiment of the present invention is clearly and completely described; obviously; described embodiment is a part of embodiment of the present invention, rather than whole embodiments, based on the embodiment in the present invention; the every other embodiment that those of ordinary skill in the art obtain under the premise not making creative work, broadly falls into the scope of protection of the invention.
As it is shown in figure 1, embodiments provide a kind of method identifying document, the method may comprise steps of:
S0: pre-setting the mark rule of vocabulary, wherein, described mark rule includes: the mark of name entity is regular, the mark rule of generic word;
S1: obtain pending document;
S2: according to described mark rule, is labeled generic word in described pending document and name entity;
S3: according to the described pending document after mark, identifies the generic word in described pending document and name entity.
In embodiments of the present invention, pre-set the mark rule of vocabulary, regular including the mark rule of name entity and the mark of generic word, according to mark rule, generic word in pending document and name entity are labeled, and then realize the identification to the generic word in pending document and name entity, identifying name entity again without first participle, the identification of generic word carries out together with the identification of name entity, it is achieved that simpler identification document.
In a kind of possible implementation, after described S0, also include:
According to described mark rule, arranging mark dictionary, described mark dictionary includes: the generic word naming entity, mark of mark;
Described S2, including: according to described mark dictionary, the generic word in described pending document and name entity are labeled.
In the present embodiment, mark dictionary includes: the generic word naming entity, mark of mark.For example, mark dictionary includes: name entity " Peking University ", generic word " we ".For in pending document " we go to Peking University to visit.", it is labeled by step S2, it is possible to achieve the mark to " we " and " Peking University ".It addition, the word not having in mark dictionary, it is possible to by being manually labeled.Such as: therein " " " visit ".
So that the identification of pending document is more accurate, in a kind of possible implementation, before described S1, also include:
Obtain language material, according to described mark rule, described language material is labeled;
According to the language material after mark, train hidden Markov model;
Described S2, including: using described pending document as the input of described hidden Markov model after training, by the described hidden Markov model after training, according to described mark rule, generic word in described pending document and name entity are labeled;
Described S3, including: by the described hidden Markov model after training, according to the described pending document after mark, identify the generic word in described pending document and name entity.
In embodiments of the present invention, hidden Markov model (CRF, conditionalrandomfieldalgorithm) is trained by language material, be labeled by the hidden Markov model after training and identify pending document.
In embodiments of the present invention, language material being labeled, according to mark rule, language material is labeled work, the quality of whole model is together decided on by the mark quality of the quality of language material Yu language material.In order to save manual operation, the embodiment of the present invention can adopt and in advance language material be marked in advance, then is labeled work by the mode manually carrying out mistake mark correction.
In a kind of possible implementation, described according to the language material after mark, train hidden Markov model, including:
According to the language material after mark and formula one, it is determined that θ value when P (Y | x: θ) is maximum;
Wherein, described formula one is:
P ( Y | x : θ ) = 1 Z x ( θ ) exp { Σ i = 1 K θ i f i ( D i ) }
Wherein, Y is hidden state, and x is display state, and P (Y | x: θ) it is when x, obtain the probability of Y, θ is the parameter to be determined of described hidden Markov model, and K is characteristic function quantity, fi(Di) for characteristic function, DiFor characteristic function parameter, Zx(θ) for normalization factor.
In a kind of possible implementation, described name entity includes: name name entity, place name naming entity, mechanism's name name entity;
Described mark rule, including:
B represents the beginning of a generic word, E represents the end of a generic word, M represents the mid portion of a generic word, S represents special symbol, punctuate and monosyllabic word, RB represents the beginning of name name entity, RE represents the end of name name entity, RM represents the mid portion of name name entity, SB represents the beginning of place name naming entity, SE represents the end of place name naming entity, SM represents the mid portion of place name naming entity, the beginning of TB outgoing mechanism name name entity, the end of TE outgoing mechanism name name entity, the mid portion of TM outgoing mechanism name name entity.
In embodiments of the present invention, mark by the name name entity arranged, place name naming entity and mechanism's name name entity, can realize name is named the mark of entity, place name naming entity and mechanism's name name entity simultaneously, accelerate the recognition speed to pending document.
For example, name name entity includes: " Zhang San ", and place name naming entity includes: " Beijing ", and mechanism's name name entity includes: " Tsing-Hua University ".For " Zhang San goes to Tsing-Hua University of Pekinese to visit.", after being labeled, it is possible to obtain: RBRESSBSESTBTMTMTEBE.
In embodiments of the present invention, it is achieved that the unified mark of language material, unified solution participle problem and name Entity recognition problem, solve Similar Problems is unified, it is possible to reduce workload, the reduction of cost of labor is played a good role.
In embodiments of the present invention, mark rule is designed, make whole mark rule can well meet the division to all marks, and the classification making mark is minimum, so the amount of calculation of model training can be made to reach minimum, on the basis of same effect, the execution efficiency of algorithm can reach optimum.
Generic word in the embodiment of the present invention refers in pending document, other vocabulary outside name entity.It addition, the result of pending document can be annotated sequence by the embodiment of the present invention.
As in figure 2 it is shown, embodiments provide a kind of method identifying document, the method may comprise steps of:
Step 201: pre-setting the mark rule of vocabulary, wherein, described mark rule includes: the mark of name entity is regular, the mark rule of generic word.
Described name entity includes: name name entity, place name naming entity, mechanism's name name entity;
Described mark rule, including: B represents the beginning of a generic word, E represents the end of a generic word, M represents the mid portion of a generic word, S represents special symbol, punctuate and monosyllabic word, RB represents the beginning of name name entity, RE represents the end of name name entity, RM represents the mid portion of name name entity, SB represents the beginning of place name naming entity, SE represents the end of place name naming entity, SM represents the mid portion of place name naming entity, the beginning of TB outgoing mechanism name name entity, the end of TE outgoing mechanism name name entity, the mid portion of TM outgoing mechanism name name entity.
Step 202: according to described mark rule, arranging mark dictionary, described mark dictionary includes: the generic word naming entity, mark of mark.
For example, mark dictionary includes: " Tsing-Hua University " corresponding being labeled as: TBTMTMTE;" Zhang San " corresponding being labeled as: RBRE;" Beijing " corresponding being labeled as: SBSE;" go " being labeled as of correspondence: S;" " corresponding being labeled as: S;" visit " corresponding being labeled as: BE.Wherein, " Tsing-Hua University " is mechanism's name name entity, and " Zhang San " names entity for name, and " Beijing " is place name naming entity, and these three word belongs to name entity." go ", " ", " visit " these all be name entity outside vocabulary, belong to generic word.
Step 203: obtain pending document.
For example, pending document there is the sentence to be: Zhang San goes to Tsing-Hua University of Pekinese to visit.By the embodiment of the present invention, this sentence is identified.
Step 204: according to mark dictionary, is labeled the generic word in pending document and name entity.
For example, for " Zhang San goes to Tsing-Hua University of Pekinese to visit " in pending document.By mating with the vocabulary in dictionary, it is possible to obtain being labeled as of this sentence: RBRESSBSESTBTMTMTEBE.
Step 205: according to the pending document after mark, identifies the generic word in pending document and name entity.
Owing to different marks represents different lexical types.The identification to each vocabulary in pending document can be realized by the mark in pending document.
For example, being labeled as of " Zhang San goes to Tsing-Hua University of Pekinese to visit " in pending document: RBRESSBSESTBTMTMTEBE.In conjunction with mark rule, it may be determined that going out " Tsing-Hua University " for mechanism's name name entity, " Zhang San " names entity for name, and " Beijing " is place name naming entity, and these three word belongs to name entity." go ", " ", " visit " these all be name entity outside vocabulary, belong to generic word.
Visible, by the scheme that the present invention provides in real time, the name entity in pending document and generic word can not only be identified, moreover it is possible to identify the type of name entity, achieve the unified identification to different types of name entity, improve the recognition efficiency to pending document.
As shown in Figure 3, Figure 4, a kind of device identifying document is embodiments provided.Device embodiment can be realized by software, it is also possible to is realized by the mode of hardware or software and hardware combining.Say from hardware view; as shown in Figure 3; a kind of hardware structure diagram for a kind of device place equipment identifying document that the embodiment of the present invention provides; except the processor shown in Fig. 3, internal memory, network interface and nonvolatile memory; in embodiment, the equipment at device place generally can also include other hardware, such as the forwarding chip etc. of responsible process message.Implemented in software for example, as shown in Figure 4, as the device on a logical meaning, it is that computer program instructions corresponding in nonvolatile memory is read to run in internal memory and formed by the CPU by its place equipment.A kind of device identifying document that the present embodiment provides, including:
Arranging unit 401, for arranging the mark rule of vocabulary, wherein, described mark rule includes: the mark of name entity is regular, the mark rule of generic word;
Acquiring unit 402, is used for obtaining pending document;
Mark unit 403, for according to described mark rule, being labeled generic word in described pending document and name entity;
Recognition unit 404, for according to the described pending document after mark, identifying the generic word in described pending document and name entity.
In a kind of possible implementation, also include:
Dictionary unit, for according to described mark rule, arranging mark dictionary, described mark dictionary includes: the generic word naming entity, mark of mark;
Described mark unit 403, for according to described mark dictionary, being labeled the generic word in described pending document and name entity.
In a kind of possible implementation, also include:
Corpus labeling unit, is used for obtaining language material, according to described mark rule, described language material is labeled;
Training unit, for according to the language material after mark, training hidden Markov model;
Described mark unit 403, for using described pending document as the input of described hidden Markov model after training, by the described hidden Markov model after training, according to described mark rule, generic word in described pending document and name entity are labeled;
Described recognition unit 404, for by the described hidden Markov model after training, according to the described pending document after mark, identifies the generic word in described pending document and name entity.
In a kind of possible implementation, described training unit, for according to the language material after mark and formula one, it is determined that θ value when P (Y | x: θ) is maximum;
Wherein, described formula one is:
P ( Y | x : θ ) = 1 Z x ( θ ) exp { Σ i = 1 K θ i f i ( D i ) }
Wherein, Y is hidden state, and x is display state, and P (Y | x: θ) it is when x, obtain the probability of Y, θ is the parameter to be determined of described hidden Markov model, and K is characteristic function quantity, fi(Di) for characteristic function, DiFor characteristic function parameter, Zx(θ) for normalization factor.
In a kind of possible implementation, described name entity includes: name name entity, place name naming entity, mechanism's name name entity;
Described mark rule, including: B represents the beginning of a generic word, E represents the end of a generic word, M represents the mid portion of a generic word, S represents special symbol, punctuate and monosyllabic word, RB represents the beginning of name name entity, RE represents the end of name name entity, RM represents the mid portion of name name entity, SB represents the beginning of place name naming entity, SE represents the end of place name naming entity, SM represents the mid portion of place name naming entity, the beginning of TB outgoing mechanism name name entity, the end of TE outgoing mechanism name name entity, the mid portion of TM outgoing mechanism name name entity.
The contents such as the information between each unit in said apparatus is mutual, execution process, due to the inventive method embodiment based on same design, particular content referring to the narration in the inventive method embodiment, can repeat no more herein.
The embodiment of the present invention at least has the advantages that
1, in embodiments of the present invention, pre-set the mark rule of vocabulary, regular including the mark rule of name entity and the mark of generic word, according to mark rule, generic word in pending document and name entity are labeled, and then realize the identification to the generic word in pending document and name entity, identifying name entity again without first participle, the identification of generic word carries out together with the identification of name entity, it is achieved that simpler identification document.
2, in embodiments of the present invention, mark by the name name entity arranged, place name naming entity and mechanism's name name entity, can realize name is named the mark of entity, place name naming entity and mechanism's name name entity simultaneously, accelerate the recognition speed to pending document.
3, in embodiments of the present invention, it is achieved that the unified mark of language material, unified solution participle problem and name Entity recognition problem, solve Similar Problems is unified, it is possible to reduce workload, the reduction of cost of labor is played a good role.
It should be noted that, in this article, the relational terms of such as first and second etc is used merely to separate an entity or operation with another entity or operating space, and not necessarily requires or imply the relation that there is any this reality between these entities or operation or sequentially.And, term " includes ", " comprising " or its any other variant are intended to comprising of nonexcludability, so that include the process of a series of key element, method, article or equipment not only include those key elements, but also include other key elements being not expressly set out, or also include the key element intrinsic for this process, method, article or equipment.When there is no more restriction, statement " including " key element limited, it is not excluded that there is also other same factor in including the process of described key element, method, article or equipment.
One of ordinary skill in the art will appreciate that: all or part of step realizing said method embodiment can be completed by the hardware that programmed instruction is relevant, aforesaid program can be stored in the storage medium of embodied on computer readable, this program upon execution, performs to include the step of said method embodiment;And aforesaid storage medium includes: in the various media that can store program code such as ROM, RAM, magnetic disc or CD.
Last it should be understood that the foregoing is only presently preferred embodiments of the present invention, it is merely to illustrate technical scheme, is not intended to limit protection scope of the present invention.All make within the spirit and principles in the present invention any amendment, equivalent replacement, improvement etc., be all contained in protection scope of the present invention.

Claims (10)

1. the method identifying document, it is characterised in that including:
S0: pre-setting the mark rule of vocabulary, wherein, described mark rule includes: the mark of name entity is regular, the mark rule of generic word;
S1: obtain pending document;
S2: according to described mark rule, is labeled generic word in described pending document and name entity;
S3: according to the described pending document after mark, identifies the generic word in described pending document and name entity.
2. method according to claim 1, it is characterised in that after described S0, also include:
According to described mark rule, arranging mark dictionary, described mark dictionary includes: the generic word naming entity, mark of mark;
Described S2, including: according to described mark dictionary, the generic word in described pending document and name entity are labeled.
3. method according to claim 1, it is characterised in that before described S1, also include:
Obtain language material, according to described mark rule, described language material is labeled;
According to the language material after mark, train hidden Markov model;
Described S2, including: using described pending document as the input of described hidden Markov model after training, by the described hidden Markov model after training, according to described mark rule, generic word in described pending document and name entity are labeled;
Described S3, including: by the described hidden Markov model after training, according to the described pending document after mark, identify the generic word in described pending document and name entity.
4. method according to claim 3, it is characterised in that described according to the language material after mark, trains hidden Markov model, including:
According to the language material after mark and formula one, it is determined that θ value when P (Y | x: θ) is maximum;
Wherein, described formula one is:
P ( Y | x : θ ) = 1 Z x ( θ ) exp { Σ i = 1 K θ i f i ( D i ) }
Wherein, Y is hidden state, and x is display state, and P (Y | x: θ) it is when x, obtain the probability of Y, θ is the parameter to be determined of described hidden Markov model, and K is characteristic function quantity, fi(Di) for characteristic function, DiFor characteristic function parameter, Zx(θ) for normalization factor.
5. according to described method arbitrary in claim 1-4, it is characterised in that
Described name entity includes: name name entity, place name naming entity, mechanism's name name entity;
Described mark rule, including:
B represents the beginning of a generic word, E represents the end of a generic word, M represents the mid portion of a generic word, S represents special symbol, punctuate and monosyllabic word, RB represents the beginning of name name entity, RE represents the end of name name entity, RM represents the mid portion of name name entity, SB represents the beginning of place name naming entity, SE represents the end of place name naming entity, SM represents the mid portion of place name naming entity, the beginning of TB outgoing mechanism name name entity, the end of TE outgoing mechanism name name entity, the mid portion of TM outgoing mechanism name name entity.
6. the device identifying document, it is characterised in that including:
Arranging unit, for arranging the mark rule of vocabulary, wherein, described mark rule includes: the mark of name entity is regular, the mark rule of generic word;
Acquiring unit, is used for obtaining pending document;
Mark unit, for according to described mark rule, being labeled generic word in described pending document and name entity;
Recognition unit, for according to the described pending document after mark, identifying the generic word in described pending document and name entity.
7. device according to claim 6, it is characterised in that also include:
Dictionary unit, for according to described mark rule, arranging mark dictionary, described mark dictionary includes: the generic word naming entity, mark of mark;
Described mark unit, for according to described mark dictionary, being labeled the generic word in described pending document and name entity.
8. device according to claim 6, it is characterised in that also include:
Corpus labeling unit, is used for obtaining language material, according to described mark rule, described language material is labeled;
Training unit, for according to the language material after mark, training hidden Markov model;
Described mark unit, for using described pending document as the input of described hidden Markov model after training, by the described hidden Markov model after training, according to described mark rule, generic word in described pending document and name entity are labeled;
Described recognition unit, for by the described hidden Markov model after training, according to the described pending document after mark, identifies the generic word in described pending document and name entity.
9. device according to claim 8, it is characterised in that described training unit, for according to the language material after mark and formula one, it is determined that θ value when P (Y | x: θ) is maximum;
Wherein, described formula one is:
P ( Y | x : θ ) = 1 Z x ( θ ) exp { Σ i = 1 K θ i f i ( D i ) }
Wherein, Y is hidden state, and x is display state, and P (Y | x: θ) it is when x, obtain the probability of Y, θ is the parameter to be determined of described hidden Markov model, and K is characteristic function quantity, fi(Di) for characteristic function, DiFor characteristic function parameter, Zx(θ) for normalization factor.
10. according to described device arbitrary in claim 6-9, it is characterised in that described name entity includes: name name entity, place name naming entity, mechanism's name name entity;
Described mark rule, including: B represents the beginning of a generic word, E represents the end of a generic word, M represents the mid portion of a generic word, S represents special symbol, punctuate and monosyllabic word, RB represents the beginning of name name entity, RE represents the end of name name entity, RM represents the mid portion of name name entity, SB represents the beginning of place name naming entity, SE represents the end of place name naming entity, SM represents the mid portion of place name naming entity, the beginning of TB outgoing mechanism name name entity, the end of TE outgoing mechanism name name entity, the mid portion of TM outgoing mechanism name name entity.
CN201610130559.5A 2016-03-08 2016-03-08 Method and apparatus for identifying document Pending CN105808523A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610130559.5A CN105808523A (en) 2016-03-08 2016-03-08 Method and apparatus for identifying document

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610130559.5A CN105808523A (en) 2016-03-08 2016-03-08 Method and apparatus for identifying document

Publications (1)

Publication Number Publication Date
CN105808523A true CN105808523A (en) 2016-07-27

Family

ID=56466930

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610130559.5A Pending CN105808523A (en) 2016-03-08 2016-03-08 Method and apparatus for identifying document

Country Status (1)

Country Link
CN (1) CN105808523A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107168946A (en) * 2017-04-14 2017-09-15 北京化工大学 A kind of name entity recognition method of medical text data
WO2018059302A1 (en) * 2016-09-29 2018-04-05 腾讯科技(深圳)有限公司 Text recognition method and device, and storage medium
CN107943786A (en) * 2017-11-16 2018-04-20 广州市万隆证券咨询顾问有限公司 A kind of Chinese name entity recognition method and system
CN108009229A (en) * 2017-11-29 2018-05-08 厦门市美亚柏科信息股份有限公司 Method, terminal device and the storage medium that public sentiment event data is found
CN109190110A (en) * 2018-08-02 2019-01-11 厦门快商通信息技术有限公司 A kind of training method of Named Entity Extraction Model, system and electronic equipment
CN109992766A (en) * 2017-12-29 2019-07-09 北京京东尚科信息技术有限公司 The method and apparatus for extracting target word
CN110837737A (en) * 2019-11-11 2020-02-25 中国电子科技集团公司信息科学研究院 Method for recognizing ability word entity
CN112784593A (en) * 2020-06-05 2021-05-11 珠海金山办公软件有限公司 Document processing method and device, electronic equipment and readable storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102033879A (en) * 2009-09-27 2011-04-27 腾讯科技(深圳)有限公司 Method and device for identifying Chinese name
CN104615589A (en) * 2015-02-15 2015-05-13 百度在线网络技术(北京)有限公司 Named-entity recognition model training method and named-entity recognition method and device

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102033879A (en) * 2009-09-27 2011-04-27 腾讯科技(深圳)有限公司 Method and device for identifying Chinese name
CN104615589A (en) * 2015-02-15 2015-05-13 百度在线网络技术(北京)有限公司 Named-entity recognition model training method and named-entity recognition method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
俞鸿魁等: "基于层叠隐马尔可夫模型的中文命名实体识别", 《通信学报》 *
王春雨等: "基于条件随机场的农业命名实体识别研究", 《河北农业大学学报》 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018059302A1 (en) * 2016-09-29 2018-04-05 腾讯科技(深圳)有限公司 Text recognition method and device, and storage medium
US11068655B2 (en) 2016-09-29 2021-07-20 Tencent Technology (Shenzhen) Company Limited Text recognition based on training of models at a plurality of training nodes
CN107168946A (en) * 2017-04-14 2017-09-15 北京化工大学 A kind of name entity recognition method of medical text data
CN107943786A (en) * 2017-11-16 2018-04-20 广州市万隆证券咨询顾问有限公司 A kind of Chinese name entity recognition method and system
CN108009229A (en) * 2017-11-29 2018-05-08 厦门市美亚柏科信息股份有限公司 Method, terminal device and the storage medium that public sentiment event data is found
CN109992766A (en) * 2017-12-29 2019-07-09 北京京东尚科信息技术有限公司 The method and apparatus for extracting target word
CN109992766B (en) * 2017-12-29 2024-02-06 北京京东尚科信息技术有限公司 Method and device for extracting target words
CN109190110A (en) * 2018-08-02 2019-01-11 厦门快商通信息技术有限公司 A kind of training method of Named Entity Extraction Model, system and electronic equipment
CN109190110B (en) * 2018-08-02 2023-08-22 厦门快商通信息技术有限公司 Named entity recognition model training method and system and electronic equipment
CN110837737A (en) * 2019-11-11 2020-02-25 中国电子科技集团公司信息科学研究院 Method for recognizing ability word entity
CN112784593A (en) * 2020-06-05 2021-05-11 珠海金山办公软件有限公司 Document processing method and device, electronic equipment and readable storage medium
CN112784593B (en) * 2020-06-05 2023-02-03 珠海金山办公软件有限公司 Document processing method and device, electronic equipment and readable storage medium

Similar Documents

Publication Publication Date Title
CN105808523A (en) Method and apparatus for identifying document
CN112801010B (en) Visual rich document information extraction method for actual OCR scene
CN106777275B (en) Entity attribute and property value extracting method based on more granularity semantic chunks
WO2018028077A1 (en) Deep learning based method and device for chinese semantics analysis
Tabassum et al. A survey on text pre-processing & feature extraction techniques in natural language processing
CN107943911A (en) Data pick-up method, apparatus, computer equipment and readable storage medium storing program for executing
CN105243129A (en) Commodity property characteristic word clustering method
CN108959566B (en) A kind of medical text based on Stacking integrated study goes privacy methods and system
CN108519974A (en) English composition automatic detection of syntax error and analysis method
CN108984661A (en) Entity alignment schemes and device in a kind of knowledge mapping
CN109299269A (en) A kind of file classification method and device
CN106649666A (en) Left-right recursion-based new word discovery method
WO2022226716A1 (en) Deep learning-based java program internal annotation generation method and system
CN112926345A (en) Multi-feature fusion neural machine translation error detection method based on data enhancement training
CN109213998A (en) Chinese wrongly written character detection method and system
Shanmugalingam et al. Language identification at word level in Sinhala-English code-mixed social media text
CN110008807A (en) A kind of training method, device and the equipment of treaty content identification model
CN112818693A (en) Automatic extraction method and system for electronic component model words
CN108511036A (en) A kind of method and system of Chinese symptom mark
CN102591850A (en) Method and system for error text statement correction based on conditional statements
CN105808522A (en) Method and apparatus for semantic association
CN107133207A (en) A kind of information extracting method and device
Suriyachay et al. Thai named entity tagged corpus annotation scheme and self verification
CN106681982B (en) English novel abstraction generating method
CN110717029A (en) Information processing method and system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20160727

WD01 Invention patent application deemed withdrawn after publication