CN105808523A - Method and apparatus for identifying document - Google Patents
Method and apparatus for identifying document Download PDFInfo
- Publication number
- CN105808523A CN105808523A CN201610130559.5A CN201610130559A CN105808523A CN 105808523 A CN105808523 A CN 105808523A CN 201610130559 A CN201610130559 A CN 201610130559A CN 105808523 A CN105808523 A CN 105808523A
- Authority
- CN
- China
- Prior art keywords
- name
- mark
- entity
- generic word
- name entity
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Machine Translation (AREA)
Abstract
The invention provides a method and apparatus for identifying a document. The method comprises the steps of presetting labeling rules of words, wherein the labeling rules comprise labeling rules of named entities and labeling rules of common words; acquiring a document to be processed; according to the labeling rules, labeling common words and named entities in the document to be processed; and according to the labeled document to be processed, identifying the common words and the named entities in the document to be processed. According to the method and apparatus for identifying the document, which are provided by the invention, the document can be more simply identified.
Description
Technical field
The present invention relates to field of computer technology, particularly to a kind of method identifying document and device.
Background technology
Participle function, as the most basic problem of natural language processing field, is subject in field the attention of every expert for a long time.Chinese word segmentation and English string segmentation also exist the difference of essence, and English has natural space to represent the difference of word, does not then have such regulation in Chinese.In order to enable continuous print word sequence is reassembled into word sequence according to certain specification, the expert in whole field it is also proposed various solution.Have Corpus--based Method, have rule-based.Up to now a point word problem can be very perfectly solved but without an algorithm.Name Entity recognition is also a great problem of natural language processing field, and basic name entity includes time name, name, place name, mechanism's name etc..It is difficult to, at a model or a process, the name entity of all categories is identified work.
In the method identifying document of prior art, first pending document is carried out participle, after participle terminates, from all words, identify name entity.Participle process of the prior art cannot recognize that name entity.For example, pending document has " Peking University " this word, carrying out participle, generally can be divided into " Beijing " " university " two words, then, again through the rule pre-set, the two word is merged into " Peking University ", it is achieved that the identification to " Peking University " this name entity.
Visible by foregoing description, the method identifying document of prior art needs first to carry out word segmentation processing, then, then is named the identification of entity, more complicated.
Summary of the invention
Embodiments provide a kind of method identifying document and device, it is possible to simpler identification document.
On the one hand, a kind of method identifying document of the embodiment of the present invention, including:
S0: pre-setting the mark rule of vocabulary, wherein, described mark rule includes: the mark of name entity is regular, the mark rule of generic word;
S1: obtain pending document;
S2: according to described mark rule, is labeled generic word in described pending document and name entity;
S3: according to the described pending document after mark, identifies the generic word in described pending document and name entity.
Further, after described S0, also include:
According to described mark rule, arranging mark dictionary, described mark dictionary includes: the generic word naming entity, mark of mark;
Described S2, including: according to described mark dictionary, the generic word in described pending document and name entity are labeled.
Further, before described S1, also include:
Obtain language material, according to described mark rule, described language material is labeled;
According to the language material after mark, train hidden Markov model;
Described S2, including: using described pending document as the input of described hidden Markov model after training, by the described hidden Markov model after training, according to described mark rule, generic word in described pending document and name entity are labeled;
Described S3, including: by the described hidden Markov model after training, according to the described pending document after mark, identify the generic word in described pending document and name entity.
Further, described according to the language material after mark, train hidden Markov model, including:
According to the language material after mark and formula one, it is determined that θ value when P (Y | x: θ) is maximum;
Wherein, described formula one is:
Wherein, Y is hidden state, and x is display state, and P (Y | x: θ) it is when x, obtain the probability of Y, θ is the parameter to be determined of described hidden Markov model, and K is characteristic function quantity, fi(Di) for characteristic function, DiFor characteristic function parameter, Zx(θ) for normalization factor.
Further, described name entity includes: name name entity, place name naming entity, mechanism's name name entity;
Described mark rule, including:
B represents the beginning of a generic word, E represents the end of a generic word, M represents the mid portion of a generic word, S represents special symbol, punctuate and monosyllabic word, RB represents the beginning of name name entity, RE represents the end of name name entity, RM represents the mid portion of name name entity, SB represents the beginning of place name naming entity, SE represents the end of place name naming entity, SM represents the mid portion of place name naming entity, the beginning of TB outgoing mechanism name name entity, the end of TE outgoing mechanism name name entity, the mid portion of TM outgoing mechanism name name entity.
On the other hand, embodiments provide a kind of device identifying document, including:
Arranging unit, for arranging the mark rule of vocabulary, wherein, described mark rule includes: the mark of name entity is regular, the mark rule of generic word;
Acquiring unit, is used for obtaining pending document;
Mark unit, for according to described mark rule, being labeled generic word in described pending document and name entity;
Recognition unit, for according to the described pending document after mark, identifying the generic word in described pending document and name entity.
Further, also include:
Dictionary unit, for according to described mark rule, arranging mark dictionary, described mark dictionary includes: the generic word naming entity, mark of mark;
Described mark unit, for according to described mark dictionary, being labeled the generic word in described pending document and name entity.
Further, also include:
Corpus labeling unit, is used for obtaining language material, according to described mark rule, described language material is labeled;
Training unit, for according to the language material after mark, training hidden Markov model;
Described mark unit, for using described pending document as the input of described hidden Markov model after training, by the described hidden Markov model after training, according to described mark rule, generic word in described pending document and name entity are labeled;
Described recognition unit, for by the described hidden Markov model after training, according to the described pending document after mark, identifies the generic word in described pending document and name entity.
Further, described training unit, for according to the language material after mark and formula one, it is determined that θ value when P (Y | x: θ) is maximum;
Wherein, described formula one is:
Wherein, Y is hidden state, and x is display state, and P (Y | x: θ) it is when x, obtain the probability of Y, θ is the parameter to be determined of described hidden Markov model, and K is characteristic function quantity, fi(Di) for characteristic function, DiFor characteristic function parameter, Zx(θ) for normalization factor.
Further, described name entity includes: name name entity, place name naming entity, mechanism's name name entity;
Described mark rule, including: B represents the beginning of a generic word, E represents the end of a generic word, M represents the mid portion of a generic word, S represents special symbol, punctuate and monosyllabic word, RB represents the beginning of name name entity, RE represents the end of name name entity, RM represents the mid portion of name name entity, SB represents the beginning of place name naming entity, SE represents the end of place name naming entity, SM represents the mid portion of place name naming entity, the beginning of TB outgoing mechanism name name entity, the end of TE outgoing mechanism name name entity, the mid portion of TM outgoing mechanism name name entity.
In embodiments of the present invention, pre-set the mark rule of vocabulary, regular including the mark rule of name entity and the mark of generic word, according to mark rule, generic word in pending document and name entity are labeled, and then realize the identification to the generic word in pending document and name entity, identifying name entity again without first participle, the identification of generic word carries out together with the identification of name entity, it is achieved that simpler identification document.
Accompanying drawing explanation
In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, the accompanying drawing used required in embodiment or description of the prior art will be briefly described below, apparently, accompanying drawing in the following describes is some embodiments of the present invention, for those of ordinary skill in the art, under the premise not paying creative work, it is also possible to obtain other accompanying drawing according to these accompanying drawings.
Fig. 1 is the flow chart of a kind of method identifying document that one embodiment of the invention provides;
Fig. 2 is the flow chart of the method for the another kind identification document that one embodiment of the invention provides;
Fig. 3 is the schematic diagram of a kind of device identifying document that one embodiment of the invention provides;
Fig. 4 is the schematic diagram of the device of the another kind identification document that one embodiment of the invention provides.
Detailed description of the invention
For making the purpose of the embodiment of the present invention, technical scheme and advantage clearly; below in conjunction with the accompanying drawing in the embodiment of the present invention; technical scheme in the embodiment of the present invention is clearly and completely described; obviously; described embodiment is a part of embodiment of the present invention, rather than whole embodiments, based on the embodiment in the present invention; the every other embodiment that those of ordinary skill in the art obtain under the premise not making creative work, broadly falls into the scope of protection of the invention.
As it is shown in figure 1, embodiments provide a kind of method identifying document, the method may comprise steps of:
S0: pre-setting the mark rule of vocabulary, wherein, described mark rule includes: the mark of name entity is regular, the mark rule of generic word;
S1: obtain pending document;
S2: according to described mark rule, is labeled generic word in described pending document and name entity;
S3: according to the described pending document after mark, identifies the generic word in described pending document and name entity.
In embodiments of the present invention, pre-set the mark rule of vocabulary, regular including the mark rule of name entity and the mark of generic word, according to mark rule, generic word in pending document and name entity are labeled, and then realize the identification to the generic word in pending document and name entity, identifying name entity again without first participle, the identification of generic word carries out together with the identification of name entity, it is achieved that simpler identification document.
In a kind of possible implementation, after described S0, also include:
According to described mark rule, arranging mark dictionary, described mark dictionary includes: the generic word naming entity, mark of mark;
Described S2, including: according to described mark dictionary, the generic word in described pending document and name entity are labeled.
In the present embodiment, mark dictionary includes: the generic word naming entity, mark of mark.For example, mark dictionary includes: name entity " Peking University ", generic word " we ".For in pending document " we go to Peking University to visit.", it is labeled by step S2, it is possible to achieve the mark to " we " and " Peking University ".It addition, the word not having in mark dictionary, it is possible to by being manually labeled.Such as: therein " " " visit ".
So that the identification of pending document is more accurate, in a kind of possible implementation, before described S1, also include:
Obtain language material, according to described mark rule, described language material is labeled;
According to the language material after mark, train hidden Markov model;
Described S2, including: using described pending document as the input of described hidden Markov model after training, by the described hidden Markov model after training, according to described mark rule, generic word in described pending document and name entity are labeled;
Described S3, including: by the described hidden Markov model after training, according to the described pending document after mark, identify the generic word in described pending document and name entity.
In embodiments of the present invention, hidden Markov model (CRF, conditionalrandomfieldalgorithm) is trained by language material, be labeled by the hidden Markov model after training and identify pending document.
In embodiments of the present invention, language material being labeled, according to mark rule, language material is labeled work, the quality of whole model is together decided on by the mark quality of the quality of language material Yu language material.In order to save manual operation, the embodiment of the present invention can adopt and in advance language material be marked in advance, then is labeled work by the mode manually carrying out mistake mark correction.
In a kind of possible implementation, described according to the language material after mark, train hidden Markov model, including:
According to the language material after mark and formula one, it is determined that θ value when P (Y | x: θ) is maximum;
Wherein, described formula one is:
Wherein, Y is hidden state, and x is display state, and P (Y | x: θ) it is when x, obtain the probability of Y, θ is the parameter to be determined of described hidden Markov model, and K is characteristic function quantity, fi(Di) for characteristic function, DiFor characteristic function parameter, Zx(θ) for normalization factor.
In a kind of possible implementation, described name entity includes: name name entity, place name naming entity, mechanism's name name entity;
Described mark rule, including:
B represents the beginning of a generic word, E represents the end of a generic word, M represents the mid portion of a generic word, S represents special symbol, punctuate and monosyllabic word, RB represents the beginning of name name entity, RE represents the end of name name entity, RM represents the mid portion of name name entity, SB represents the beginning of place name naming entity, SE represents the end of place name naming entity, SM represents the mid portion of place name naming entity, the beginning of TB outgoing mechanism name name entity, the end of TE outgoing mechanism name name entity, the mid portion of TM outgoing mechanism name name entity.
In embodiments of the present invention, mark by the name name entity arranged, place name naming entity and mechanism's name name entity, can realize name is named the mark of entity, place name naming entity and mechanism's name name entity simultaneously, accelerate the recognition speed to pending document.
For example, name name entity includes: " Zhang San ", and place name naming entity includes: " Beijing ", and mechanism's name name entity includes: " Tsing-Hua University ".For " Zhang San goes to Tsing-Hua University of Pekinese to visit.", after being labeled, it is possible to obtain: RBRESSBSESTBTMTMTEBE.
In embodiments of the present invention, it is achieved that the unified mark of language material, unified solution participle problem and name Entity recognition problem, solve Similar Problems is unified, it is possible to reduce workload, the reduction of cost of labor is played a good role.
In embodiments of the present invention, mark rule is designed, make whole mark rule can well meet the division to all marks, and the classification making mark is minimum, so the amount of calculation of model training can be made to reach minimum, on the basis of same effect, the execution efficiency of algorithm can reach optimum.
Generic word in the embodiment of the present invention refers in pending document, other vocabulary outside name entity.It addition, the result of pending document can be annotated sequence by the embodiment of the present invention.
As in figure 2 it is shown, embodiments provide a kind of method identifying document, the method may comprise steps of:
Step 201: pre-setting the mark rule of vocabulary, wherein, described mark rule includes: the mark of name entity is regular, the mark rule of generic word.
Described name entity includes: name name entity, place name naming entity, mechanism's name name entity;
Described mark rule, including: B represents the beginning of a generic word, E represents the end of a generic word, M represents the mid portion of a generic word, S represents special symbol, punctuate and monosyllabic word, RB represents the beginning of name name entity, RE represents the end of name name entity, RM represents the mid portion of name name entity, SB represents the beginning of place name naming entity, SE represents the end of place name naming entity, SM represents the mid portion of place name naming entity, the beginning of TB outgoing mechanism name name entity, the end of TE outgoing mechanism name name entity, the mid portion of TM outgoing mechanism name name entity.
Step 202: according to described mark rule, arranging mark dictionary, described mark dictionary includes: the generic word naming entity, mark of mark.
For example, mark dictionary includes: " Tsing-Hua University " corresponding being labeled as: TBTMTMTE;" Zhang San " corresponding being labeled as: RBRE;" Beijing " corresponding being labeled as: SBSE;" go " being labeled as of correspondence: S;" " corresponding being labeled as: S;" visit " corresponding being labeled as: BE.Wherein, " Tsing-Hua University " is mechanism's name name entity, and " Zhang San " names entity for name, and " Beijing " is place name naming entity, and these three word belongs to name entity." go ", " ", " visit " these all be name entity outside vocabulary, belong to generic word.
Step 203: obtain pending document.
For example, pending document there is the sentence to be: Zhang San goes to Tsing-Hua University of Pekinese to visit.By the embodiment of the present invention, this sentence is identified.
Step 204: according to mark dictionary, is labeled the generic word in pending document and name entity.
For example, for " Zhang San goes to Tsing-Hua University of Pekinese to visit " in pending document.By mating with the vocabulary in dictionary, it is possible to obtain being labeled as of this sentence: RBRESSBSESTBTMTMTEBE.
Step 205: according to the pending document after mark, identifies the generic word in pending document and name entity.
Owing to different marks represents different lexical types.The identification to each vocabulary in pending document can be realized by the mark in pending document.
For example, being labeled as of " Zhang San goes to Tsing-Hua University of Pekinese to visit " in pending document: RBRESSBSESTBTMTMTEBE.In conjunction with mark rule, it may be determined that going out " Tsing-Hua University " for mechanism's name name entity, " Zhang San " names entity for name, and " Beijing " is place name naming entity, and these three word belongs to name entity." go ", " ", " visit " these all be name entity outside vocabulary, belong to generic word.
Visible, by the scheme that the present invention provides in real time, the name entity in pending document and generic word can not only be identified, moreover it is possible to identify the type of name entity, achieve the unified identification to different types of name entity, improve the recognition efficiency to pending document.
As shown in Figure 3, Figure 4, a kind of device identifying document is embodiments provided.Device embodiment can be realized by software, it is also possible to is realized by the mode of hardware or software and hardware combining.Say from hardware view; as shown in Figure 3; a kind of hardware structure diagram for a kind of device place equipment identifying document that the embodiment of the present invention provides; except the processor shown in Fig. 3, internal memory, network interface and nonvolatile memory; in embodiment, the equipment at device place generally can also include other hardware, such as the forwarding chip etc. of responsible process message.Implemented in software for example, as shown in Figure 4, as the device on a logical meaning, it is that computer program instructions corresponding in nonvolatile memory is read to run in internal memory and formed by the CPU by its place equipment.A kind of device identifying document that the present embodiment provides, including:
Arranging unit 401, for arranging the mark rule of vocabulary, wherein, described mark rule includes: the mark of name entity is regular, the mark rule of generic word;
Acquiring unit 402, is used for obtaining pending document;
Mark unit 403, for according to described mark rule, being labeled generic word in described pending document and name entity;
Recognition unit 404, for according to the described pending document after mark, identifying the generic word in described pending document and name entity.
In a kind of possible implementation, also include:
Dictionary unit, for according to described mark rule, arranging mark dictionary, described mark dictionary includes: the generic word naming entity, mark of mark;
Described mark unit 403, for according to described mark dictionary, being labeled the generic word in described pending document and name entity.
In a kind of possible implementation, also include:
Corpus labeling unit, is used for obtaining language material, according to described mark rule, described language material is labeled;
Training unit, for according to the language material after mark, training hidden Markov model;
Described mark unit 403, for using described pending document as the input of described hidden Markov model after training, by the described hidden Markov model after training, according to described mark rule, generic word in described pending document and name entity are labeled;
Described recognition unit 404, for by the described hidden Markov model after training, according to the described pending document after mark, identifies the generic word in described pending document and name entity.
In a kind of possible implementation, described training unit, for according to the language material after mark and formula one, it is determined that θ value when P (Y | x: θ) is maximum;
Wherein, described formula one is:
Wherein, Y is hidden state, and x is display state, and P (Y | x: θ) it is when x, obtain the probability of Y, θ is the parameter to be determined of described hidden Markov model, and K is characteristic function quantity, fi(Di) for characteristic function, DiFor characteristic function parameter, Zx(θ) for normalization factor.
In a kind of possible implementation, described name entity includes: name name entity, place name naming entity, mechanism's name name entity;
Described mark rule, including: B represents the beginning of a generic word, E represents the end of a generic word, M represents the mid portion of a generic word, S represents special symbol, punctuate and monosyllabic word, RB represents the beginning of name name entity, RE represents the end of name name entity, RM represents the mid portion of name name entity, SB represents the beginning of place name naming entity, SE represents the end of place name naming entity, SM represents the mid portion of place name naming entity, the beginning of TB outgoing mechanism name name entity, the end of TE outgoing mechanism name name entity, the mid portion of TM outgoing mechanism name name entity.
The contents such as the information between each unit in said apparatus is mutual, execution process, due to the inventive method embodiment based on same design, particular content referring to the narration in the inventive method embodiment, can repeat no more herein.
The embodiment of the present invention at least has the advantages that
1, in embodiments of the present invention, pre-set the mark rule of vocabulary, regular including the mark rule of name entity and the mark of generic word, according to mark rule, generic word in pending document and name entity are labeled, and then realize the identification to the generic word in pending document and name entity, identifying name entity again without first participle, the identification of generic word carries out together with the identification of name entity, it is achieved that simpler identification document.
2, in embodiments of the present invention, mark by the name name entity arranged, place name naming entity and mechanism's name name entity, can realize name is named the mark of entity, place name naming entity and mechanism's name name entity simultaneously, accelerate the recognition speed to pending document.
3, in embodiments of the present invention, it is achieved that the unified mark of language material, unified solution participle problem and name Entity recognition problem, solve Similar Problems is unified, it is possible to reduce workload, the reduction of cost of labor is played a good role.
It should be noted that, in this article, the relational terms of such as first and second etc is used merely to separate an entity or operation with another entity or operating space, and not necessarily requires or imply the relation that there is any this reality between these entities or operation or sequentially.And, term " includes ", " comprising " or its any other variant are intended to comprising of nonexcludability, so that include the process of a series of key element, method, article or equipment not only include those key elements, but also include other key elements being not expressly set out, or also include the key element intrinsic for this process, method, article or equipment.When there is no more restriction, statement " including " key element limited, it is not excluded that there is also other same factor in including the process of described key element, method, article or equipment.
One of ordinary skill in the art will appreciate that: all or part of step realizing said method embodiment can be completed by the hardware that programmed instruction is relevant, aforesaid program can be stored in the storage medium of embodied on computer readable, this program upon execution, performs to include the step of said method embodiment;And aforesaid storage medium includes: in the various media that can store program code such as ROM, RAM, magnetic disc or CD.
Last it should be understood that the foregoing is only presently preferred embodiments of the present invention, it is merely to illustrate technical scheme, is not intended to limit protection scope of the present invention.All make within the spirit and principles in the present invention any amendment, equivalent replacement, improvement etc., be all contained in protection scope of the present invention.
Claims (10)
1. the method identifying document, it is characterised in that including:
S0: pre-setting the mark rule of vocabulary, wherein, described mark rule includes: the mark of name entity is regular, the mark rule of generic word;
S1: obtain pending document;
S2: according to described mark rule, is labeled generic word in described pending document and name entity;
S3: according to the described pending document after mark, identifies the generic word in described pending document and name entity.
2. method according to claim 1, it is characterised in that after described S0, also include:
According to described mark rule, arranging mark dictionary, described mark dictionary includes: the generic word naming entity, mark of mark;
Described S2, including: according to described mark dictionary, the generic word in described pending document and name entity are labeled.
3. method according to claim 1, it is characterised in that before described S1, also include:
Obtain language material, according to described mark rule, described language material is labeled;
According to the language material after mark, train hidden Markov model;
Described S2, including: using described pending document as the input of described hidden Markov model after training, by the described hidden Markov model after training, according to described mark rule, generic word in described pending document and name entity are labeled;
Described S3, including: by the described hidden Markov model after training, according to the described pending document after mark, identify the generic word in described pending document and name entity.
4. method according to claim 3, it is characterised in that described according to the language material after mark, trains hidden Markov model, including:
According to the language material after mark and formula one, it is determined that θ value when P (Y | x: θ) is maximum;
Wherein, described formula one is:
Wherein, Y is hidden state, and x is display state, and P (Y | x: θ) it is when x, obtain the probability of Y, θ is the parameter to be determined of described hidden Markov model, and K is characteristic function quantity, fi(Di) for characteristic function, DiFor characteristic function parameter, Zx(θ) for normalization factor.
5. according to described method arbitrary in claim 1-4, it is characterised in that
Described name entity includes: name name entity, place name naming entity, mechanism's name name entity;
Described mark rule, including:
B represents the beginning of a generic word, E represents the end of a generic word, M represents the mid portion of a generic word, S represents special symbol, punctuate and monosyllabic word, RB represents the beginning of name name entity, RE represents the end of name name entity, RM represents the mid portion of name name entity, SB represents the beginning of place name naming entity, SE represents the end of place name naming entity, SM represents the mid portion of place name naming entity, the beginning of TB outgoing mechanism name name entity, the end of TE outgoing mechanism name name entity, the mid portion of TM outgoing mechanism name name entity.
6. the device identifying document, it is characterised in that including:
Arranging unit, for arranging the mark rule of vocabulary, wherein, described mark rule includes: the mark of name entity is regular, the mark rule of generic word;
Acquiring unit, is used for obtaining pending document;
Mark unit, for according to described mark rule, being labeled generic word in described pending document and name entity;
Recognition unit, for according to the described pending document after mark, identifying the generic word in described pending document and name entity.
7. device according to claim 6, it is characterised in that also include:
Dictionary unit, for according to described mark rule, arranging mark dictionary, described mark dictionary includes: the generic word naming entity, mark of mark;
Described mark unit, for according to described mark dictionary, being labeled the generic word in described pending document and name entity.
8. device according to claim 6, it is characterised in that also include:
Corpus labeling unit, is used for obtaining language material, according to described mark rule, described language material is labeled;
Training unit, for according to the language material after mark, training hidden Markov model;
Described mark unit, for using described pending document as the input of described hidden Markov model after training, by the described hidden Markov model after training, according to described mark rule, generic word in described pending document and name entity are labeled;
Described recognition unit, for by the described hidden Markov model after training, according to the described pending document after mark, identifies the generic word in described pending document and name entity.
9. device according to claim 8, it is characterised in that described training unit, for according to the language material after mark and formula one, it is determined that θ value when P (Y | x: θ) is maximum;
Wherein, described formula one is:
Wherein, Y is hidden state, and x is display state, and P (Y | x: θ) it is when x, obtain the probability of Y, θ is the parameter to be determined of described hidden Markov model, and K is characteristic function quantity, fi(Di) for characteristic function, DiFor characteristic function parameter, Zx(θ) for normalization factor.
10. according to described device arbitrary in claim 6-9, it is characterised in that described name entity includes: name name entity, place name naming entity, mechanism's name name entity;
Described mark rule, including: B represents the beginning of a generic word, E represents the end of a generic word, M represents the mid portion of a generic word, S represents special symbol, punctuate and monosyllabic word, RB represents the beginning of name name entity, RE represents the end of name name entity, RM represents the mid portion of name name entity, SB represents the beginning of place name naming entity, SE represents the end of place name naming entity, SM represents the mid portion of place name naming entity, the beginning of TB outgoing mechanism name name entity, the end of TE outgoing mechanism name name entity, the mid portion of TM outgoing mechanism name name entity.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610130559.5A CN105808523A (en) | 2016-03-08 | 2016-03-08 | Method and apparatus for identifying document |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610130559.5A CN105808523A (en) | 2016-03-08 | 2016-03-08 | Method and apparatus for identifying document |
Publications (1)
Publication Number | Publication Date |
---|---|
CN105808523A true CN105808523A (en) | 2016-07-27 |
Family
ID=56466930
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610130559.5A Pending CN105808523A (en) | 2016-03-08 | 2016-03-08 | Method and apparatus for identifying document |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105808523A (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107168946A (en) * | 2017-04-14 | 2017-09-15 | 北京化工大学 | A kind of name entity recognition method of medical text data |
WO2018059302A1 (en) * | 2016-09-29 | 2018-04-05 | 腾讯科技(深圳)有限公司 | Text recognition method and device, and storage medium |
CN107943786A (en) * | 2017-11-16 | 2018-04-20 | 广州市万隆证券咨询顾问有限公司 | A kind of Chinese name entity recognition method and system |
CN108009229A (en) * | 2017-11-29 | 2018-05-08 | 厦门市美亚柏科信息股份有限公司 | Method, terminal device and the storage medium that public sentiment event data is found |
CN109190110A (en) * | 2018-08-02 | 2019-01-11 | 厦门快商通信息技术有限公司 | A kind of training method of Named Entity Extraction Model, system and electronic equipment |
CN109992766A (en) * | 2017-12-29 | 2019-07-09 | 北京京东尚科信息技术有限公司 | The method and apparatus for extracting target word |
CN110837737A (en) * | 2019-11-11 | 2020-02-25 | 中国电子科技集团公司信息科学研究院 | Method for recognizing ability word entity |
CN112784593A (en) * | 2020-06-05 | 2021-05-11 | 珠海金山办公软件有限公司 | Document processing method and device, electronic equipment and readable storage medium |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102033879A (en) * | 2009-09-27 | 2011-04-27 | 腾讯科技(深圳)有限公司 | Method and device for identifying Chinese name |
CN104615589A (en) * | 2015-02-15 | 2015-05-13 | 百度在线网络技术(北京)有限公司 | Named-entity recognition model training method and named-entity recognition method and device |
-
2016
- 2016-03-08 CN CN201610130559.5A patent/CN105808523A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102033879A (en) * | 2009-09-27 | 2011-04-27 | 腾讯科技(深圳)有限公司 | Method and device for identifying Chinese name |
CN104615589A (en) * | 2015-02-15 | 2015-05-13 | 百度在线网络技术(北京)有限公司 | Named-entity recognition model training method and named-entity recognition method and device |
Non-Patent Citations (2)
Title |
---|
俞鸿魁等: "基于层叠隐马尔可夫模型的中文命名实体识别", 《通信学报》 * |
王春雨等: "基于条件随机场的农业命名实体识别研究", 《河北农业大学学报》 * |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2018059302A1 (en) * | 2016-09-29 | 2018-04-05 | 腾讯科技(深圳)有限公司 | Text recognition method and device, and storage medium |
US11068655B2 (en) | 2016-09-29 | 2021-07-20 | Tencent Technology (Shenzhen) Company Limited | Text recognition based on training of models at a plurality of training nodes |
CN107168946A (en) * | 2017-04-14 | 2017-09-15 | 北京化工大学 | A kind of name entity recognition method of medical text data |
CN107943786A (en) * | 2017-11-16 | 2018-04-20 | 广州市万隆证券咨询顾问有限公司 | A kind of Chinese name entity recognition method and system |
CN108009229A (en) * | 2017-11-29 | 2018-05-08 | 厦门市美亚柏科信息股份有限公司 | Method, terminal device and the storage medium that public sentiment event data is found |
CN109992766A (en) * | 2017-12-29 | 2019-07-09 | 北京京东尚科信息技术有限公司 | The method and apparatus for extracting target word |
CN109992766B (en) * | 2017-12-29 | 2024-02-06 | 北京京东尚科信息技术有限公司 | Method and device for extracting target words |
CN109190110A (en) * | 2018-08-02 | 2019-01-11 | 厦门快商通信息技术有限公司 | A kind of training method of Named Entity Extraction Model, system and electronic equipment |
CN109190110B (en) * | 2018-08-02 | 2023-08-22 | 厦门快商通信息技术有限公司 | Named entity recognition model training method and system and electronic equipment |
CN110837737A (en) * | 2019-11-11 | 2020-02-25 | 中国电子科技集团公司信息科学研究院 | Method for recognizing ability word entity |
CN112784593A (en) * | 2020-06-05 | 2021-05-11 | 珠海金山办公软件有限公司 | Document processing method and device, electronic equipment and readable storage medium |
CN112784593B (en) * | 2020-06-05 | 2023-02-03 | 珠海金山办公软件有限公司 | Document processing method and device, electronic equipment and readable storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105808523A (en) | Method and apparatus for identifying document | |
Tabassum et al. | A survey on text pre-processing & feature extraction techniques in natural language processing | |
CN106777275B (en) | Entity attribute and property value extracting method based on more granularity semantic chunks | |
WO2018028077A1 (en) | Deep learning based method and device for chinese semantics analysis | |
CN107943911A (en) | Data pick-up method, apparatus, computer equipment and readable storage medium storing program for executing | |
CN112801010A (en) | Visual rich document information extraction method for actual OCR scene | |
CN105243129A (en) | Commodity property characteristic word clustering method | |
CN107436864A (en) | A kind of Chinese question and answer semantic similarity calculation method based on Word2Vec | |
CN108959566B (en) | A kind of medical text based on Stacking integrated study goes privacy methods and system | |
CN108519974A (en) | English composition automatic detection of syntax error and analysis method | |
CN107392143A (en) | A kind of resume accurate Analysis method based on SVM text classifications | |
CN108984661A (en) | Entity alignment schemes and device in a kind of knowledge mapping | |
CN109299269A (en) | A kind of file classification method and device | |
WO2022226716A1 (en) | Deep learning-based java program internal annotation generation method and system | |
CN106649666A (en) | Left-right recursion-based new word discovery method | |
CN102693279A (en) | Method, device and system for fast calculating comment similarity | |
CN109101489A (en) | A kind of text automatic abstracting method, device and a kind of electronic equipment | |
CN112926345A (en) | Multi-feature fusion neural machine translation error detection method based on data enhancement training | |
Shanmugalingam et al. | Language identification at word level in Sinhala-English code-mixed social media text | |
CN111160026A (en) | Model training method and device, and method and device for realizing text processing | |
CN112818693A (en) | Automatic extraction method and system for electronic component model words | |
CN110717029A (en) | Information processing method and system | |
CN102591850A (en) | Method and system for error text statement correction based on conditional statements | |
Tschirschwitz et al. | A dataset for analysing complex document layouts in the digital humanities and its evaluation with Krippendorff’s alpha | |
CN105808522A (en) | Method and apparatus for semantic association |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20160727 |
|
WD01 | Invention patent application deemed withdrawn after publication |