CN103995885A - Method and device for recognizing entity names - Google Patents

Method and device for recognizing entity names Download PDF

Info

Publication number
CN103995885A
CN103995885A CN201410234622.0A CN201410234622A CN103995885A CN 103995885 A CN103995885 A CN 103995885A CN 201410234622 A CN201410234622 A CN 201410234622A CN 103995885 A CN103995885 A CN 103995885A
Authority
CN
China
Prior art keywords
identified
text
root
name
instance
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410234622.0A
Other languages
Chinese (zh)
Other versions
CN103995885B (en
Inventor
陈丽欧
徐明泉
韩锋
姜世超
周寰
王平
雷绍泽
周丰乐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201410234622.0A priority Critical patent/CN103995885B/en
Publication of CN103995885A publication Critical patent/CN103995885A/en
Application granted granted Critical
Publication of CN103995885B publication Critical patent/CN103995885B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a method and device for recognizing entity names. The method for recognizing the entity names comprises the steps of obtaining a to-be-recognized text and source information of the to-be-recognized text, obtaining the first entity name of the to-be-recognized text according to the source information and a recognition model of the to-be-recognized text, and obtaining the second entity name from content except for the content of the first entity name of the to-be-recognized text according to a pre-built root table and a preset constraint rule. According to the method for recognizing the entity names, the accuracy rate and the recall rate of entity name recognizing are improved, the method and device can be suitable for various kinds of linguistic types and are higher in universality, and in addition, for effective recognizing of the entity names in a creative text, the creative individual requirements can be greatly met.

Description

The recognition methods of physical name and device
Technical field
The present invention relates to internet information processing technology field, particularly a kind of recognition methods of physical name and device.
Background technology
Along with the fast development with internet that is widely used of computing machine, Internet resources are abundant gradually, and quantity of information sharply increases.In order to make user find rapidly the information of real needs in the information source of magnanimity, need to process information document, automatically to identify physical name wherein, so that user searches the information needing according to physical name.At present, to the automatic identification of physical name, be a technical barrier.The type of physical name is different, and its identification difficulty and recognition methods are also different.The identification of physical name mainly contains method and two kinds of modes of rule-based recognition methods of statistical learning.Wherein:
The method of statistical learning comprises training stage and cognitive phase, in the training stage, on the basis of mark language material, by extracting correlated characteristic and selecting suitable machine learning strategy to train the model of proper name identification; At cognitive phase, with the model that the training stage obtains, automatically identify the proper name in new language material.But, in the training stage, need to manually mark, proofread corpus, take time and effort very much, and physical name constantly changes, frequently have some new physical names and occur, so corpus also needs frequent renewal, this is labor intensive resource very, waste time and energy, and accuracy rate is not high.
The thought of rule-based recognition methods is that the mankind are written as to some rules for identifying the linguistic knowledge of physical name, allows machine according to these rules, the physical name in text be identified automatically.These rules generally all depend on concrete syntax type, as Chinese, English etc.But for these rules of identifying physical name very complicated too, and the work of knowledge encoding do not have unified guiding method at present yet, therefore, rule-based method, need to write respectively recognition rule for different language, workload is large, and versatility is poor.
Therefore, at present, the recognition methods general applicability of physical name is poor, and preliminary work amount is large, is difficult to realize high-accuracy and low human resources expend simultaneously.
Summary of the invention
The present invention is intended to solve the problems of the technologies described above at least to a certain extent.
For this reason, first object of the present invention is to propose a kind of recognition methods of physical name, and the method can promote accuracy rate and the versatility of named entity recognition.
Second object of the present invention is to propose a kind of recognition device of physical name.
For reaching above-mentioned purpose, the recognition methods that has proposed a kind of physical name according to first aspect present invention embodiment, comprising: the source-information that obtains text to be identified and described text to be identified; According to the source-information of described text to be identified and model of cognition, obtain the first instance name in described text to be identified; According to obtaining second instance name in the content of the root table of setting up in advance and default constraint rule non-first instance name from described text to be identified.
The recognition methods of the physical name of the embodiment of the present invention, according to the source-information of text to be identified and model of cognition, obtain the first instance name in text to be identified, and obtain the second instance name in text to be identified according to root table and preset rules, fully combine the two advantage of statistical learning method and rule-based recognition methods, accuracy rate and the recall rate of named entity recognition have been promoted, applicable to various language form, versatility is stronger.In addition, the effective identification for the physical name in intention text, meets individual demand in intention greatly, and has met the identification demand of legal risk vocabulary.
Second aspect present invention embodiment provides a kind of recognition device of physical name, comprising: acquisition module, for obtaining the source-information of text to be identified and described text to be identified; The first identification module, for obtaining the first instance name of described text to be identified according to the source-information of described text to be identified and model of cognition; The second identification module, for obtaining second instance name according to root table and the default constraint rule set up in advance from the content of the non-first instance name of described text to be identified.
The recognition device of the physical name of the embodiment of the present invention, according to the source-information of text to be identified and model of cognition, obtain the first instance name in text to be identified, and obtain the second instance name in text to be identified according to root table and preset rules, fully combine the two advantage of statistical learning method and rule-based recognition methods, accuracy rate and the recall rate of named entity recognition have been promoted, applicable to various language form, versatility is stronger.In addition, the effective identification for the physical name in intention text, meets individual demand in intention greatly, and has met the identification demand of legal risk vocabulary.
Additional aspect of the present invention and advantage in the following description part provide, and part will become obviously from the following description, or recognize by practice of the present invention.
Accompanying drawing explanation
Above-mentioned and/or additional aspect of the present invention and advantage accompanying drawing below combination obviously and is easily understood becoming the description of embodiment, wherein:
Fig. 1 is the process flow diagram of the recognition methods of physical name according to an embodiment of the invention;
Fig. 2 is for obtaining according to an embodiment of the invention the process flow diagram of the method for the first instance name in text to be identified according to the source-information of text to be identified and model of cognition;
Fig. 3 is according to an embodiment of the invention according to obtaining the process flow diagram of second instance name in the content of the root table of setting up in advance and default constraint rule non-first instance name from text to be identified;
Fig. 4 is for setting up according to an embodiment of the invention the process flow diagram of the method for root table and affixe table;
Fig. 5 is for setting up according to an embodiment of the invention the process flow diagram of the method for root model of cognition;
Fig. 6 is for setting up according to an embodiment of the invention the process flow diagram of the method for Entity recognition model;
Fig. 7 is the structural representation of the recognition device of physical name according to an embodiment of the invention;
Fig. 8 is the structural representation of the recognition device of physical name in accordance with another embodiment of the present invention.
Embodiment
Describe embodiments of the invention below in detail, the example of described embodiment is shown in the drawings, and wherein same or similar label represents same or similar element or has the element of identical or similar functions from start to finish.Below by the embodiment being described with reference to the drawings, be exemplary, only for explaining the present invention, and can not be interpreted as limitation of the present invention.
In description of the invention, it will be appreciated that, term " a plurality of " refers to two or more; Term " first ", " second " be only for describing object, and can not be interpreted as indication or hint relative importance.
Below with reference to accompanying drawing, describe according to the recognition methods of the physical name of the embodiment of the present invention and device.
In order to reduce the expending of human resources of identification physical name, and improve recognition accuracy, the present invention proposes a kind of recognition methods of physical name, comprising: the source-information that obtains text to be identified and text to be identified; According to the source-information of text to be identified, obtain the first instance name in text to be identified; According to obtaining second instance name in the content of the root table of setting up in advance and default constraint rule non-first instance name from text to be identified.
In an embodiment of the present invention, anyly in entity real world by name distinguish, the title of discernible things.For instance, for example, mechanism's name, brand name, place name, name etc.
Fig. 1 is the process flow diagram of the recognition methods of physical name according to an embodiment of the invention.As shown in Figure 1, the recognition methods according to the physical name of the embodiment of the present invention, comprising:
S101, obtains the source-information of text to be identified and text to be identified.
In one embodiment of the invention, the Business Name that the source-information of text to be identified is issue text to be identified, web site name etc.As " Shenzhen Lian Xunda electronic technology development corporation, Ltd. ".
In an embodiment of the present invention, text to be identified is natural language text.The source-information of text to be identified can be user and provides when text to be identified is provided simultaneously, and releasing news in the time of also can be according to text to be identified issue obtained, as publisher's accounts information etc.Because mostly can comprise in publisher's accounts information that publisher obtains publisher's account place or the mechanism of representative.
S102, obtains the first instance name in text to be identified according to the source-information of text to be identified and model of cognition.
In an embodiment of the present invention, the first instance physical name relevant to the source-information of text to be identified by name.For instance, in one embodiment of the invention, first instance name can be mechanism's name.For example, if the source-information of text to be identified is " Shenzhen Lian Xunda electronic technology development corporation, Ltd. ", first instance name can be " Lian Xunda electronic technology development corporation, Ltd. ".
Particularly, in one embodiment of the invention, can obtain the first instance name in text to be identified by the step shown in Fig. 2.As shown in Figure 2, according to the source-information of text to be identified and model of cognition, obtain the method for the first instance name in text to be identified, comprising:
S201, identifies the source-information of text to be identified according to root model of cognition, to obtain the root in the source-information of text to be identified.
In an embodiment of the present invention, root model of cognition is to set up in advance.More specifically, can before text to be identified is identified, train root model of cognition, also can or download the root model of cognition having trained from other memory storage copies.According to root table, training obtains root model of cognition, can identify the model of cognition of the root in the source-information of text to be identified.For instance, for source-information " Shenzhen Lian Xunda electronic technology development corporation, Ltd. ", by root model of cognition, can identify root " Lian Xunda " wherein.
S202, obtains the first instance name in text to be identified according to the affixe table of root and foundation in advance.
In an embodiment of the present invention, affixe table is the storage list that comprises the suffix of a plurality of first instance names.The suffix that can comprise for instance, the physical names such as " company limited ", " dynamo-electric accessory factory ", " Ltd " in affixe table.
In one embodiment of the invention, first instance name can be the physical name with suffix, as " Lian Xun reaches company limited ", also can be the physical name of not being with suffix, as " Lian Xunda ".Therefore, can first in text to be identified, search this root, if existed, this root is a first instance name in text to be identified.Then, can in text to be identified, search root and the character string that in affixe table, affixe combines and forms arbitrarily according to root and affixe table, be first instance name.
In another embodiment of the present invention, because a lot of entities may exist another name, therefore, according to the root of source-information, for example possibly cannot cover physical name in text to be identified, " all visitors " also can be written as " VANCL ".In order to identify physical name in text to be identified comprehensively, according to the source-information of text to be identified, obtain the method for the first instance name in text to be identified except comprising step S201 – S202, also can comprise:
S203, identifies text to be identified according to Entity recognition model, to obtain the first instance name in text to be identified.
In an embodiment of the present invention, Entity recognition model is to set up in advance.More specifically, can before text to be identified is identified, train Entity recognition model, also can or download the Entity recognition model having trained from other memory storage copies.Entity recognition model obtains according to root table and the training of affixe table, can identify the model of cognition of the entity in text to be identified.For instance, " the sincere product of VANCL " in text to be identified, can be identified as first instance name by Entity recognition model.
S103, according to obtaining second instance name in the content of the root table of setting up in advance and default constraint rule non-first instance name from text to be identified.
In an embodiment of the present invention, the second instance physical name relevant to agency, product, the operation of first instance name by name.For instance, if first instance mechanism by name name, second instance name can be brand name.For example, particularly, can identify the second instance name in text to be identified by the method shown in Fig. 3, as shown in Figure 3, according to obtaining second instance name in the content of the root table of setting up in advance and default constraint rule non-first instance name from text to be identified, comprise:
S301, searches according to the root table of setting up in advance the root that the content of non-first instance name in text to be identified comprises.
S302, the root content of non-first instance name in text to be identified being comprised according to default constraint rule screens, and to obtain in the content of non-first instance name in text to be identified, obtains second instance name.
In one embodiment of the invention, the root in root table, can be divided into the root of strong constraint and the root of weak constraint.Wherein, the root of strong constraint refers to the root that all can be used as in any case physical name, and the root of weak constraint refers to the root that just can be used as physical name when meeting certain context constraint condition.For instance, " all visitors " is the root of strong constraint, and " seven days " only, when affixes such as " hotels " or " holiday inn " is combined, just can be used as physical name.In other situations, " seven days " are only numeral-classifier compound.Therefore, need to set up default constraint rule for the root of weak constraint, default constraint rule is for the root of weak constraint being carried out to term restriction so that the root of weak constraint can be used as physical name under this default constraint rule.Because the root type of weak constraint is different, therefore, default constraint rule is also to match according to the root of different weak constraints, and the present invention does not limit the concrete form of default constraint rule.
The recognition methods of the physical name of the embodiment of the present invention, according to the source-information of text to be identified and model of cognition, obtain the first instance name in text to be identified, and obtain the second instance name in text to be identified according to root table and preset rules, fully combine the two advantage of statistical learning method and rule-based recognition methods, accuracy rate and the recall rate of named entity recognition have been promoted, applicable to various language form, versatility is stronger.In addition, the effective identification for the physical name in intention text, meets individual demand in intention greatly, and has met the identification demand of legal risk vocabulary.
In one embodiment of the invention, after identifying physical name, according to the type of the physical name identifying, stamp corresponding label.For instance, the label of mechanism's name is <ORG></ORGGreatT .GreaT.GT, and the label of brand name is <BRD></BRDGreatT .GreaT.GT.For example, if " Shenzhen Lian Xunda electronic technology development corporation, Ltd. " is an exabyte, but the label of the physical name in the intention of its issue is as follows:
Intention: ... .<BRD> the gloomy </BRD> netting twine of Nike-first-selected Shenzhen <ORG> even interrogates and reaches </ORG>
Wherein, " DataExpert reaches " is mechanism's name; And " Nike is gloomy " is the ProductName of its operation, should be identified as brand name.
Fig. 4 is for setting up according to an embodiment of the invention the process flow diagram of the method for root table and affixe table.Particularly, as shown in Figure 4, set up the method for root table and affixe table, comprising:
S401, collects a plurality of registering entities names.
In an embodiment of the present invention, registering entities name refers to fixed physical name.As, registered exabyte, ProductName, registration brand etc.
S402, carries out participle to a plurality of registering entities names respectively, to obtain a plurality of participles.
Wherein, registering entities name is carried out to participle and can use in correlation technique or following any segmenting method that may occur, the present invention does not limit used segmenting method.
S403, obtains the attributive character of a plurality of participles.
In an embodiment of the present invention, the attributive character of participle comprises the part of speech, length of participle, frequency, the features such as position of participle in registering entities name occurring in whole registering entities names.
S404 filters out a plurality of roots in root table and a plurality of affixes in affixe table, to set up root table and affixe table from a plurality of participles according to attributive character.
In an embodiment of the present invention, root has that occurrence frequency is not high, the attributive character such as be everlasting between region word and product word, and affixe has, frequency is high, the attributive character such as exabyte afterbody of being everlasting.The attributive character that therefore, can have by root and affixe respectively filters out a plurality of roots and a plurality of affixe from a plurality of participles.
For instance, can from a plurality of participles, filter out by following rule a plurality of roots:
The word of A, formation word can not be separated by other words;
B, word are not region words;
The frequency * position of C, word must meet certain threshold restriction;
The total length of D, word must be less than certain length threshold value.
Can from a plurality of participles, filter out by following rule a plurality of affixes:
A, word are in the afterbody of exabyte the afterbody of recursive structure (or);
The frequency of occurrences of b, word must be greater than certain frequency threshold value;
The word of c, formation word must meet certain part of speech restriction.
Should be appreciated that above-mentioned rule is only for exemplary, in other embodiments of the invention, those skilled in the art also can set according to the attributive character of other roots that do not list in foregoing description and affixe the screening rule of root and affixe.
In one embodiment of the invention, because the kind of physical name is various, therefore, the data volume of root table is very huge, in order to improve inquiry velocity when using root table, root table is set up to compressed index, for instance, for the root with same prefix, can set up a common index according to their identical prefix, thereby can improve search efficiency.In addition, as previous embodiment, root is divided into the root of strong constraint and the root of weak constraint, and therefore, root table can be distinguished strong root table and weak root table.
Fig. 5 is for setting up according to an embodiment of the invention the process flow diagram of the method for root model of cognition.Particularly, as shown in Figure 5, set up the method for root model of cognition, comprising:
S501, obtains the first corpus.
In an embodiment of the present invention, the first corpus is for training the language material of root model of cognition.Particularly, can hit and extract a small amount of physical name at fixed entity, for instance, can extract 1000 physical names, then through 1000 physical names that extract being carried out to artificial check and correction, obtain the first corpus, can make the recognition accuracy of the model of cognition that trains reach more than 95%.Owing to obtaining the needed physical name of the first corpus seldom, the workload of artificial check and correction is also very little, only needs a few minutes just can complete, and greatly save manpower and time, and accuracy rate is higher.
S502, builds First Characteristic template according to the word feature of the first corpus.
In an embodiment of the present invention, for each word in the physical name in the first corpus, extract word itself and part of speech two category features thereof, then, two category features of different words in the first corpus are combined, obtain having the First Characteristic template of the characteristic item of the first predetermined number.
S503, according to First Characteristic template and conditional random field models training root model of cognition.
Wherein, conditional random field models is a kind of prejudgementing character model, can predict most probable flag sequence by the conditional probability of defined label sequence and observation sequence.Therefore, in an embodiment of the present invention, can utilize conditional random field models, according to the First Characteristic template of the feature that meets root building, obtain root model of cognition.
Fig. 6 is for setting up according to an embodiment of the invention the process flow diagram of the method for Entity recognition model.Particularly, as shown in Figure 6, set up the method for Entity recognition model, comprising:
S601, obtains the second corpus according to root table and affixe table.
In an embodiment of the present invention, the second corpus can utilize root table and affixe table automatically to construct and form, particularly, first after a large amount of intention fragments being carried out to participle and part of speech identification, use root and affixe table to carry out canonical coupling, then using meet call format (as: without stop words, continuously every, length in threshold value etc.) the longest coupling string of root+affixe as a mechanism's name with suffix.Wherein, in the result obtaining after coupling finishes, can be divided into following four kinds of situations:
1, the intention fragment that comprises " root+affixe "; As: Beijing dawn (root) hospital of andrology (affixe) has online Senior Expert.
2, the intention fragment that only comprises " root ", as all (root) the five chamber ion peptide therapies treatments of employing new technology of: Beijing army.
3, the intention fragment that only comprises " affixe ", as: is treatment which hospital of prostatitis (affixe) good?
4, the intention fragment that root and affixe do not comprise, as: do not have an injection. not oral. do not operate on. no pain.
In above-mentioned four kinds of situations, first two has comprised entity, and this is known as " positive example "; And latter two does not comprise entity, be known as " counter-example ".Because the intention fragment that intention comprises may have entity, likely there is no entity, therefore for training the second corpus of Entity recognition model should comprise that positive example also comprises counter-example, otherwise the model training has deviation.Wherein, positive counter-example number need meet certain proportion, and in one embodiment of the invention, according to the distribution of the intention fragment that comprises entity in intention and do not comprise entity, can set positive example in the second corpus is 1:3 with the number ratio of counter-example.
S602, builds Second Characteristic template according to the word feature of the second corpus.
In an embodiment of the present invention, for each word in the second corpus, extract word itself and part of speech thereof, position, length four category features, then, four category features of different words in the second corpus are combined, obtain having the Second Characteristic template of the characteristic item of the second predetermined number.
S603, according to Second Characteristic template and conditional random field models training Entity recognition model.
Wherein, conditional random field models is a kind of prejudgementing character model, can predict most probable flag sequence by the conditional probability of defined label sequence and observation sequence.Therefore, in an embodiment of the present invention, can utilize conditional random field models, according to the Second Characteristic template of the feature that meets physical name building, obtain Entity recognition model.
From the embodiment shown in Fig. 4, Fig. 5, Fig. 6, in the recognition methods of the physical name of the embodiment of the present invention, the foundation of the training of language material, the training of model of cognition, root table and affixe table almost can automatically perform, although, obtaining when training the first corpus of root model of cognition, need to manually proofread, but required manpower and time are considerably less, extremely low to artificial dependence, thus the consumption of human and material resources resources reduced widely, saved the time.
In order to realize above-described embodiment, the present invention also proposes a kind of recognition device of physical name.
A recognition device for physical name, comprising: acquisition module, for obtaining the source-information of text to be identified and text to be identified; The first identification module, for obtaining the first instance name of text to be identified according to the source-information of text to be identified and model of cognition; The second identification module, for obtaining second instance name according to root table and the default constraint rule set up in advance from the content of the non-first instance name of text to be identified.
Fig. 7 is the structural representation of the recognition device of physical name according to an embodiment of the invention.
As shown in Figure 7, the recognition device according to the physical name of the embodiment of the present invention, comprising: acquisition module 10, the first identification module 20 and the second identification module 30.
Particularly, acquisition module 10 is for obtaining the source-information of text to be identified and text to be identified.In one embodiment of the invention, the Business Name that the source-information of text to be identified is issue text to be identified, web site name etc.As " Shenzhen Lian Xunda electronic technology development corporation, Ltd. ".
In an embodiment of the present invention, text to be identified is natural language text.The source-information of text to be identified can be user and provides when text to be identified is provided simultaneously, and releasing news in the time of also can be according to text to be identified issue obtained, as publisher's accounts information etc.Because mostly can comprise in publisher's accounts information that publisher obtains publisher's account place or the mechanism of representative.
The first identification module 20 is for obtaining the first instance name of text to be identified according to the source-information of text to be identified and model of cognition.In an embodiment of the present invention, the first instance physical name relevant to the source-information of text to be identified by name.For instance, in one embodiment of the invention, first instance name can be mechanism's name.For example, if the source-information of text to be identified is " Shenzhen Lian Xunda electronic technology development corporation, Ltd. ", first instance name can be " Lian Xunda electronic technology development corporation, Ltd. ".
More specifically, in one embodiment of the invention, the first identification module 20 is specifically for identifying the source-information of text to be identified according to root model of cognition, to obtain the root in the source-information of text to be identified, and obtain the first instance name in text to be identified according to root and the affixe table set up in advance.
In an embodiment of the present invention, root model of cognition is to set up in advance.More specifically, can before text to be identified is identified, train root model of cognition, also can or download the root model of cognition having trained from other memory storage copies.According to root table, training obtains root model of cognition, can identify the model of cognition of the root in the source-information of text to be identified.For instance, for source-information " Shenzhen Lian Xunda electronic technology development corporation, Ltd. ", by root model of cognition, can identify root " Lian Xunda " wherein.In an embodiment of the present invention, affixe table is the storage list that comprises the suffix of a plurality of first instance names.The suffix that can comprise for instance, the physical names such as " company limited ", " dynamo-electric accessory factory ", " Ltd " in affixe table.
In one embodiment of the invention, first instance name can be the physical name with suffix, as " Lian Xun reaches company limited ", also can be the physical name of not being with suffix, as " Lian Xunda ".Therefore, first the first identification module 20 can search this root in text to be identified, if existed, this root is a first instance name in text to be identified.Then, the first identification module 20 can be searched root and the character string that in affixe table, affixe combines and forms arbitrarily according to root and affixe table in text to be identified, is first instance name.
In another embodiment of the present invention, because a lot of entities may exist another name, therefore, according to the root of source-information, for example possibly cannot cover physical name in text to be identified, " all visitors " also can be written as " VANCL ".In order to identify physical name in text to be identified comprehensively, the first identification module 20 also can be used for according to Entity recognition model, text to be identified being identified, to obtain the first instance name in text to be identified.Wherein, Entity recognition model is to set up in advance.More specifically, can before text to be identified is identified, train Entity recognition model, also can or download the Entity recognition model having trained from other memory storage copies.Entity recognition model obtains according to root table and the training of affixe table, can identify the model of cognition of the entity in text to be identified.For instance, " the sincere product of VANCL " in text to be identified, can be identified as first instance name by Entity recognition model.
The second identification module 30 for obtaining second instance name according to root table and the default constraint rule set up in advance from the content of the non-first instance name of text to be identified.In an embodiment of the present invention, the second instance physical name relevant to agency, product, the operation of first instance name by name.For instance, if first instance mechanism by name name, second instance name can be brand name.
More specifically, the root that the second identification module 30 comprises specifically for search the content of non-first instance name in text to be identified according to the root table of setting up in advance, and the root content of non-first instance name in text to be identified being comprised according to default constraint rule screens, to obtain in the content of non-first instance name in text to be identified, obtain second instance name.In one embodiment of the invention, the root in root table, can be divided into the root of strong constraint and the root of weak constraint.Wherein, the root of strong constraint refers to the root that all can be used as in any case physical name, and the root of weak constraint refers to the root that just can be used as physical name when meeting certain context constraint condition.For instance, " all visitors " is the root of strong constraint, and " seven days " only, when affixes such as " hotels " or " holiday inn " is combined, just can be used as physical name.In other situations, " seven days " are only numeral-classifier compound.Therefore, need to set up default constraint rule for the root of weak constraint, default constraint rule is for the root of weak constraint being carried out to term restriction so that the root of weak constraint can be used as physical name under this default constraint rule.Because the root type of weak constraint is different, therefore, default constraint rule is also to match according to the root of different weak constraints, and the present invention does not limit the concrete form of default constraint rule.
The recognition device of the physical name of the embodiment of the present invention, according to the source-information of text to be identified and model of cognition, obtain the first instance name in text to be identified, and obtain the second instance name in text to be identified according to root table and preset rules, fully combine the two advantage of statistical learning method and rule-based recognition methods, accuracy rate and the recall rate of named entity recognition have been promoted, applicable to various language form, versatility is stronger.In addition, the effective identification for the physical name in intention text, meets individual demand in intention greatly, and has met the identification demand of legal risk vocabulary.
In one embodiment of the invention, after identifying physical name, according to the type of the physical name identifying, stamp corresponding label.For instance, the label of mechanism's name is <ORG></ORGGreatT .GreaT.GT, and the label of brand name is <BRD></BRDGreatT .GreaT.GT.For example, if " Shenzhen Lian Xunda electronic technology development corporation, Ltd. " is an exabyte, but the label of the physical name in the intention of its issue is as follows:
Intention: ... .<BRD> the gloomy </BRD> netting twine of Nike-first-selected Shenzhen <ORG> even interrogates and reaches </ORG>
Wherein, " DataExpert reaches " is mechanism's name; And " Nike is gloomy " is the ProductName of its operation, should be identified as brand name.
Fig. 8 is the structural representation of the recognition device of physical name in accordance with another embodiment of the present invention.As shown in Figure 8, the recognition device of this physical name comprises: acquisition module 10, the first identification module 20, the second identification module 30, vocabulary are set up module 40, the first model training module 50 and the second model training module 60.
Particularly, vocabulary set up module 40 for:
Collect a plurality of registering entities names, wherein, registering entities name refers to fixed physical name.As, registered exabyte, ProductName, registration brand etc.;
Respectively a plurality of registering entities names are carried out to participle, to obtain a plurality of participles, wherein, registering entities name is carried out to participle and can use in correlation technique or following any segmenting method that may occur, the present invention does not limit used segmenting method;
Obtain the attributive character of a plurality of participles, wherein, the attributive character of participle comprises the part of speech of participle, length, the frequency, the features such as position of participle in registering entities name that in whole registering entities names, occur;
According to attributive character, from a plurality of participles, filter out a plurality of roots in root table and a plurality of affixes in affixe table, to set up root table and affixe table.
In an embodiment of the present invention, root has that occurrence frequency is not high, the attributive character such as be everlasting between region word and product word, and affixe has, frequency is high, the attributive character such as exabyte afterbody of being everlasting.The attributive character that therefore, can have by root and affixe respectively filters out a plurality of roots and a plurality of affixe from a plurality of participles.
For instance, can from a plurality of participles, filter out by following rule a plurality of roots:
The word of A, formation word can not be separated by other words;
B, word are not region words;
The frequency * position of C, word must meet certain threshold restriction;
The total length of D, word must be less than certain length threshold value.
Can from a plurality of participles, filter out by following rule a plurality of affixes:
A, word are in the afterbody of exabyte the afterbody of recursive structure (or);
The frequency of occurrences of b, word must be greater than certain frequency threshold value;
The word of c, formation word must meet certain part of speech restriction.
Should be appreciated that above-mentioned rule is only for exemplary, in other embodiments of the invention, those skilled in the art also can set according to the attributive character of other roots that do not list in foregoing description and affixe the screening rule of root and affixe.
In one embodiment of the invention, because the kind of physical name is various, therefore, the data volume of root table is very huge, in order to improve inquiry velocity when using root table, root table is set up to compressed index, for instance, for the root with same prefix, can set up a common index according to their identical prefix, thereby can improve search efficiency.In addition, as previous embodiment, root is divided into the root of strong constraint and the root of weak constraint, and therefore, root table can be distinguished strong root table and weak root table.
The first model training module 50 for:
Obtain the first corpus, wherein, the first corpus is for training the language material of root model of cognition.Particularly, can hit and extract a small amount of physical name at fixed entity, for instance, can extract 1000 physical names, then through 1000 physical names that extract being carried out to artificial check and correction, obtain the first corpus, can make the recognition accuracy of the model of cognition that trains reach more than 95%.Owing to obtaining the needed physical name of the first corpus seldom, therefore the workload of artificial check and correction is also very little, only needs a few minutes just can complete, and greatly save manpower and time, and accuracy rate is higher.;
According to the word feature of the first corpus, build First Characteristic template, wherein, for each word in the physical name in the first corpus, extract word itself and part of speech two category features thereof, then, two category features of different words in the first corpus are combined, obtain having the First Characteristic template of the characteristic item of the first predetermined number;
According to First Characteristic template and conditional random field models training root model of cognition, wherein, conditional random field models is a kind of prejudgementing character model, can predict most probable flag sequence by the conditional probability of defined label sequence and observation sequence.Therefore, in an embodiment of the present invention, can utilize conditional random field models, according to the First Characteristic template of the feature that meets root building, obtain root model of cognition.
The second model training module 60 for:
According to root table and affixe table, obtain the second corpus, wherein, in the result obtaining after coupling finishes, can be divided into following four kinds of situations:
1, the intention fragment that comprises " root+affixe "; As: Beijing dawn (root) hospital of andrology (affixe) has online Senior Expert.
2, the intention fragment that only comprises " root ", as all (root) the five chamber ion peptide therapies treatments of employing new technology of: Beijing army.
3, the intention fragment that only comprises " affixe ", as: is treatment which hospital of prostatitis (affixe) good?
4, the intention fragment that root and affixe do not comprise, as: do not have an injection. not oral. do not operate on. no pain.
In above-mentioned four kinds of situations, first two has comprised entity, and this is known as " positive example "; And latter two does not comprise entity, be known as " counter-example ".Because the intention fragment that intention comprises may have entity, likely there is no entity, therefore for training the second corpus of Entity recognition model should comprise that positive example also comprises counter-example, otherwise the model training has deviation.Wherein, positive counter-example number need meet certain proportion, and in one embodiment of the invention, according to the distribution of the intention fragment that comprises entity in intention and do not comprise entity, can set positive example in the second corpus is 1:3 with the number ratio of counter-example;
According to the word feature of the second corpus, build Second Characteristic template, wherein, for each word in the second corpus, extract word itself and part of speech thereof, position, length four category features, then, four category features of different words in the second corpus are combined, obtain having the Second Characteristic template of the characteristic item of the second predetermined number;
According to Second Characteristic template and conditional random field models training Entity recognition model, wherein, conditional random field models is a kind of prejudgementing character model, can predict most probable flag sequence by the conditional probability of defined label sequence and observation sequence.Therefore, in an embodiment of the present invention, can utilize conditional random field models, according to the Second Characteristic template of the feature that meets physical name building, obtain Entity recognition model.
The recognition device of the physical name of the embodiment of the present invention, the foundation of the training of language material, the training of model of cognition, root table and affixe table almost can automatically perform, although, obtaining when training the first corpus of root model of cognition, need to manually proofread, but required manpower and time are considerably less, extremely low to artificial dependence, thereby reduced widely the consumption of human and material resources resources, saved the time, and accuracy rate has been higher.
In process flow diagram or any process of otherwise describing at this or method describe and can be understood to, represent to comprise that one or more is for realizing module, fragment or the part of code of executable instruction of the step of specific logical function or process, and the scope of the preferred embodiment of the present invention comprises other realization, wherein can be not according to order shown or that discuss, comprise according to related function by the mode of basic while or by contrary order, carry out function, this should be understood by embodiments of the invention person of ordinary skill in the field.
The logic and/or the step that in process flow diagram, represent or otherwise describe at this, for example, can be considered to for realizing the sequencing list of the executable instruction of logic function, may be embodied in any computer-readable medium, for instruction execution system, device or equipment (as computer based system, comprise that the system of processor or other can and carry out the system of instruction from instruction execution system, device or equipment instruction fetch), use, or use in conjunction with these instruction execution systems, device or equipment.With regard to this instructions, " computer-readable medium " can be anyly can comprise, storage, communication, propagation or transmission procedure be for instruction execution system, device or equipment or the device that uses in conjunction with these instruction execution systems, device or equipment.The example more specifically of computer-readable medium (non-exhaustive list) comprises following: the electrical connection section (electronic installation) with one or more wirings, portable computer diskette box (magnetic device), random access memory (RAM), ROM (read-only memory) (ROM), the erasable ROM (read-only memory) (EPROM or flash memory) of editing, fiber device, and portable optic disk ROM (read-only memory) (CDROM).In addition, computer-readable medium can be even paper or other the suitable medium that can print described program thereon, because can be for example by paper or other media be carried out to optical scanning, then edit, decipher or process in electronics mode and obtain described program with other suitable methods if desired, be then stored in computer memory.
Should be appreciated that each several part of the present invention can realize with hardware, software, firmware or their combination.In the above-described embodiment, a plurality of steps or method can realize with being stored in storer and by software or the firmware of suitable instruction execution system execution.For example, if realized with hardware, the same in another embodiment, can realize by any one in following technology well known in the art or their combination: have for data-signal being realized to the discrete logic of the logic gates of logic function, the special IC with suitable combinational logic gate circuit, programmable gate array (PGA), field programmable gate array (FPGA) etc.
Those skilled in the art are appreciated that realizing all or part of step that above-described embodiment method carries is to come the hardware that instruction is relevant to complete by program, described program can be stored in a kind of computer-readable recording medium, this program, when carrying out, comprises step of embodiment of the method one or a combination set of.
In addition, each functional unit in each embodiment of the present invention can be integrated in a processing module, can be also that the independent physics of unit exists, and also can be integrated in a module two or more unit.Above-mentioned integrated module both can adopt the form of hardware to realize, and also can adopt the form of software function module to realize.If described integrated module usings that the form of software function module realizes and during as production marketing independently or use, also can be stored in a computer read/write memory medium.
The above-mentioned storage medium of mentioning can be ROM (read-only memory), disk or CD etc.
In the description of this instructions, the description of reference term " embodiment ", " some embodiment ", " example ", " concrete example " or " some examples " etc. means to be contained at least one embodiment of the present invention or example in conjunction with specific features, structure, material or the feature of this embodiment or example description.In this manual, the schematic statement of above-mentioned term is not necessarily referred to identical embodiment or example.And the specific features of description, structure, material or feature can be with suitable mode combinations in any one or more embodiment or example.
Although illustrated and described embodiments of the invention, those having ordinary skill in the art will appreciate that: in the situation that not departing from principle of the present invention and aim, can carry out multiple variation, modification, replacement and modification to these embodiment, scope of the present invention is by claim and be equal to and limit.

Claims (16)

1. a recognition methods for physical name, is characterized in that, comprising:
Obtain the source-information of text to be identified and described text to be identified;
According to the source-information of described text to be identified and model of cognition, obtain the first instance name in described text to be identified;
According to obtaining second instance name in the content of the root table of setting up in advance and default constraint rule non-first instance name from described text to be identified.
2. the method for claim 1, is characterized in that,
Described first instance mechanism by name name;
Described second instance is called brand name.
3. method as claimed in claim 1 or 2, is characterized in that, describedly according to the source-information of described text to be identified and model of cognition, obtains the first instance name in described text to be identified, specifically comprises:
According to root model of cognition, the source-information of described text to be identified is identified, to obtain the root in the source-information of described text to be identified;
According to the affixe table of described root and foundation in advance, obtain the first instance name in described text to be identified.
4. method as claimed in claim 3, is characterized in that, also comprises:
According to Entity recognition model, described text to be identified is identified, to obtain the first instance name in described text to be identified.
5. method as claimed in claim 1 or 2, is characterized in that, in the content of the root table that described basis is set up in advance and default constraint rule non-first instance name from described text to be identified, obtains second instance name, specifically comprises:
The root that the content of searching non-first instance name in described text to be identified according to the described root table of setting up in advance comprises;
The root content of non-first instance name in described text to be identified being comprised according to described default constraint rule screens, and to obtain in the content of non-first instance name in described text to be identified, obtains second instance name.
6. method as claimed in claim 3, is characterized in that, before the described source-information that obtains text to be identified and described text to be identified, also comprises:
Collect a plurality of registering entities names;
Respectively described a plurality of registering entities names are carried out to participle, to obtain a plurality of participles;
Obtain the attributive character of described a plurality of participles;
According to described attributive character, from described a plurality of participles, filter out a plurality of roots in described root table and a plurality of affixes in described affixe table, to set up described root table and described affixe table.
7. method as claimed in claim 3, is characterized in that, also comprises:
Obtain the first corpus;
According to the word feature of described the first corpus, build First Characteristic template;
According to described First Characteristic template and conditional random field models, train described root model of cognition.
8. method as claimed in claim 3, is characterized in that, also comprises:
According to described root table and described affixe table, obtain the second corpus;
According to the word feature of described the second corpus, build Second Characteristic template;
According to described Second Characteristic template and described conditional random field models, train described Entity recognition model.
9. a recognition device for physical name, is characterized in that, comprising:
Acquisition module, for obtaining the source-information of text to be identified and described text to be identified;
The first identification module, for obtaining the first instance name of described text to be identified according to the source-information of described text to be identified and model of cognition;
The second identification module, for obtaining second instance name according to root table and the default constraint rule set up in advance from the content of the non-first instance name of described text to be identified.
10. device as claimed in claim 9, is characterized in that,
Described first instance mechanism by name name;
Described second instance is called brand name.
11. devices as described in claim 9 or 10, is characterized in that, described the first identification module specifically for:
According to root model of cognition, the source-information of described text to be identified is identified, to obtain the root in the source-information of described text to be identified;
According to the affixe table of described root and foundation in advance, obtain the first instance name in described text to be identified.
12. devices as claimed in claim 11, is characterized in that, described the first identification module is also for according to Entity recognition model, described text to be identified being identified, to obtain the first instance name in described text to be identified.
13. devices as described in claim 9 or 10, is characterized in that, described second instance module specifically for:
The root that the content of searching non-first instance name in described text to be identified according to the described root table of setting up in advance comprises;
The root content of non-first instance name in described text to be identified being comprised according to described default constraint rule screens, and to obtain in the content of non-first instance name in described text to be identified, obtains second instance name.
14. devices as claimed in claim 11, is characterized in that, also comprise that vocabulary sets up module, described vocabulary set up module for:
Collect a plurality of registering entities names;
Respectively described a plurality of registering entities names are carried out to participle, to obtain a plurality of participles;
Obtain the attributive character of described a plurality of participles;
According to described attributive character, from described a plurality of participles, filter out a plurality of roots in described root table and a plurality of affixes in described affixe table, to set up described root table and described affixe table.
15. devices as claimed in claim 11, is characterized in that, also comprise the first model training module, and described the first model training module is used for:
Obtain the first corpus;
According to the word feature of described the first corpus, build First Characteristic template;
According to described First Characteristic template and conditional random field models, train described root model of cognition.
16. devices as claimed in claim 11, is characterized in that, also comprise the second model training module, and described the second model training module is used for:
According to described root table and described affixe table, obtain the second corpus;
According to the word feature of described the second corpus, build Second Characteristic template;
According to described Second Characteristic template and described conditional random field models, train described Entity recognition model.
CN201410234622.0A 2014-05-29 2014-05-29 The recognition methods of physical name and device Active CN103995885B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410234622.0A CN103995885B (en) 2014-05-29 2014-05-29 The recognition methods of physical name and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410234622.0A CN103995885B (en) 2014-05-29 2014-05-29 The recognition methods of physical name and device

Publications (2)

Publication Number Publication Date
CN103995885A true CN103995885A (en) 2014-08-20
CN103995885B CN103995885B (en) 2017-11-17

Family

ID=51310050

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410234622.0A Active CN103995885B (en) 2014-05-29 2014-05-29 The recognition methods of physical name and device

Country Status (1)

Country Link
CN (1) CN103995885B (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105550372A (en) * 2016-01-28 2016-05-04 浪潮软件集团有限公司 Sentence training device and method and information extraction system
CN106503192A (en) * 2016-10-31 2017-03-15 北京百度网讯科技有限公司 Name entity recognition method and device based on artificial intelligence
CN107943786A (en) * 2017-11-16 2018-04-20 广州市万隆证券咨询顾问有限公司 A kind of Chinese name entity recognition method and system
CN108241621A (en) * 2016-12-23 2018-07-03 北京国双科技有限公司 The search method and device of legal knowledge
CN108595430A (en) * 2018-04-26 2018-09-28 携程旅游网络技术(上海)有限公司 Boat becomes information extracting method and system
CN108829681A (en) * 2018-06-28 2018-11-16 北京神州泰岳软件股份有限公司 A kind of name entity extraction method and device
CN109582975A (en) * 2019-01-31 2019-04-05 北京嘉和美康信息技术有限公司 It is a kind of name entity recognition methods and device
CN110688841A (en) * 2019-09-30 2020-01-14 广州准星信息科技有限公司 Mechanism name identification method, mechanism name identification device, mechanism name identification equipment and storage medium
CN110750991A (en) * 2019-09-18 2020-02-04 平安科技(深圳)有限公司 Entity identification method, device, equipment and computer readable storage medium
CN111178073A (en) * 2018-10-23 2020-05-19 北京嘀嘀无限科技发展有限公司 Text processing method and device, electronic equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101901235A (en) * 2009-05-27 2010-12-01 国际商业机器公司 Method and system for document processing
CN102103594A (en) * 2009-12-22 2011-06-22 北京大学 Character data recognition and processing method and device
US20130346069A1 (en) * 2012-06-15 2013-12-26 Canon Kabushiki Kaisha Method and apparatus for identifying a mentioned person in a dialog

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101901235A (en) * 2009-05-27 2010-12-01 国际商业机器公司 Method and system for document processing
CN102103594A (en) * 2009-12-22 2011-06-22 北京大学 Character data recognition and processing method and device
US20130346069A1 (en) * 2012-06-15 2013-12-26 Canon Kabushiki Kaisha Method and apparatus for identifying a mentioned person in a dialog

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105550372A (en) * 2016-01-28 2016-05-04 浪潮软件集团有限公司 Sentence training device and method and information extraction system
CN106503192A (en) * 2016-10-31 2017-03-15 北京百度网讯科技有限公司 Name entity recognition method and device based on artificial intelligence
CN106503192B (en) * 2016-10-31 2019-10-15 北京百度网讯科技有限公司 Name entity recognition method and device based on artificial intelligence
CN108241621A (en) * 2016-12-23 2018-07-03 北京国双科技有限公司 The search method and device of legal knowledge
CN107943786A (en) * 2017-11-16 2018-04-20 广州市万隆证券咨询顾问有限公司 A kind of Chinese name entity recognition method and system
CN108595430A (en) * 2018-04-26 2018-09-28 携程旅游网络技术(上海)有限公司 Boat becomes information extracting method and system
CN108595430B (en) * 2018-04-26 2022-02-22 携程旅游网络技术(上海)有限公司 Aviation transformer information extraction method and system
CN108829681A (en) * 2018-06-28 2018-11-16 北京神州泰岳软件股份有限公司 A kind of name entity extraction method and device
CN111178073A (en) * 2018-10-23 2020-05-19 北京嘀嘀无限科技发展有限公司 Text processing method and device, electronic equipment and storage medium
CN109582975A (en) * 2019-01-31 2019-04-05 北京嘉和美康信息技术有限公司 It is a kind of name entity recognition methods and device
CN109582975B (en) * 2019-01-31 2023-05-23 北京嘉和海森健康科技有限公司 Named entity identification method and device
CN110750991A (en) * 2019-09-18 2020-02-04 平安科技(深圳)有限公司 Entity identification method, device, equipment and computer readable storage medium
CN110688841A (en) * 2019-09-30 2020-01-14 广州准星信息科技有限公司 Mechanism name identification method, mechanism name identification device, mechanism name identification equipment and storage medium

Also Published As

Publication number Publication date
CN103995885B (en) 2017-11-17

Similar Documents

Publication Publication Date Title
CN103995885A (en) Method and device for recognizing entity names
WO2020007224A1 (en) Knowledge graph construction and smart response method and apparatus, device, and storage medium
CN106570180B (en) Voice search method and device based on artificial intelligence
US9652719B2 (en) Authoring system for bayesian networks automatically extracted from text
KR101707369B1 (en) Construction method and device for event repository
CN102253930B (en) A kind of method of text translation and device
CN104679850B (en) Address structure method and device
CN104035975B (en) It is a kind of to realize the method that remote supervisory character relation is extracted using Chinese online resource
CN108052659A (en) Searching method, device and electronic equipment based on artificial intelligence
TW202020691A (en) Feature word determination method and device and server
CN105183923A (en) New word discovery method and device
WO2018201600A1 (en) Information mining method and system, electronic device and readable storage medium
CN105389349A (en) Dictionary updating method and apparatus
CN103092979A (en) Processing method and device for searching of natural language by remote sensing data
CN103686244A (en) Video data managing method and system
CN103778200A (en) Method for extracting information source of message and system thereof
CN102750282B (en) Synonym template mining method and device as well as synonym mining method and device
CN103927299A (en) Method for providing candidate sentences in input method and method and device for recommending input content
CN102339294A (en) Searching method and system for preprocessing keywords
CN107577713B (en) Text handling method based on electric power dictionary
CN103150409B (en) Method and system for recommending user search word
WO2023040493A1 (en) Event detection
CN103294820A (en) WEB page classifying method and system based on semantic extension
CN104516870A (en) Translation check method and system
CN104462272A (en) Search requirement analysis method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant