CN109710925A

CN109710925A - Name entity recognition method and device

Info

Publication number: CN109710925A
Application number: CN201811518491.3A
Authority: CN
Inventors: 樊芳利
Original assignee: New H3C Big Data Technologies Co Ltd
Current assignee: New H3C Big Data Technologies Co Ltd
Priority date: 2018-12-12
Filing date: 2018-12-12
Publication date: 2019-05-03

Abstract

This disclosure relates to a kind of name entity recognition method and device, including text is labeled according to default mark collection to obtain training corpus, the notation methods of the default type of the mark collection comprising name entity and notation methods and non-name entity；According to the training corpus, the preset multiple identification models of training；Text to be identified is identified using the multiple identification model respectively, obtains the recognition result of each identification model；The recognition result of weight and each identification model based on each identification model, obtains the recognition result of the text to be identified.Text to be identified is identified respectively by using multiple identification models, obtains the recognition result of each identification model, and the precision of name Entity recognition can be effectively able to ascend according to the name entity recognition method and device of the embodiment of the present disclosure.

Description

Name entity recognition method and device

Technical field

This disclosure relates to pattern-recognition and sorting technique field more particularly to a kind of name entity recognition method and device.

Background technique

Name Entity recognition be intended to identify proper noun in natural language text and significant numeral classifier phrase (refer to by The phrase that number and quantifier combination are constituted), and classified.Name Entity recognition belongs to unknown word identification in morphological analysis Scope is essential group of a variety of natural language processing techniques such as information extraction, information retrieval, machine translation and question answering system At part.

In the related technology, the method for naming Entity recognition is divided into: rule-based and dictionary method, Statistics-Based Method With the method etc. based on deep learning.Rule-based and dictionary method depends on concrete syntax, field and text style, is easy Mistake is generated, it is portable poor.Statistics-Based Method also compares the dependence of corpus the more demanding of Feature Selection Greatly.There are problems that gradient disappearance in training based on the method for deep learning.These methods are difficult to meet to a large amount of at present random Then, the identification of multi-field name entity.

Summary of the invention

In view of this, being able to ascend name Entity recognition the present disclosure proposes a kind of name entity recognition method and device Precision.

According to the one side of the disclosure, a kind of name entity recognition method is provided, which comprises according to pre- bidding Note collection is labeled to obtain training corpus for text, the default type of the mark collection comprising name entity and notation methods, And the notation methods of non-name entity；According to the training corpus, the preset multiple identification models of training；Using the multiple Identification model identifies text to be identified respectively, obtains the recognition result of each identification model；Weight based on each identification model with And the recognition result of each identification model, obtain the recognition result of the text to be identified.

According to another aspect of the present disclosure, a kind of name entity recognition device is provided, described device includes: mark mould Block obtains training corpus for being labeled according to default mark collection for text, and the default mark collection includes name entity Type and notation methods and it is non-name entity notation methods；Training module, for according to the training corpus, training Preset multiple identification models；Identification module obtains every for identifying text to be identified respectively using the multiple identification model The recognition result of a identification model；Weighting block, the identification knot for weight and each identification model based on each identification model Fruit obtains the recognition result of the text to be identified.

In the embodiments of the present disclosure, text is labeled to obtain training corpus according to default mark collection, according to training Corpus, the preset multiple identification models of training, identifies text to be identified using multiple identification models respectively, obtains each identification mould The recognition result of type combines the advantage of different identification models by the recognition result of the multiple identification models of Weighted Fusion, is promoted The precision of name Entity recognition.

According to below with reference to the accompanying drawings to detailed description of illustrative embodiments, the other feature and aspect of the disclosure be will be apparent.

Detailed description of the invention

Comprising in the description and constituting the attached drawing of part of specification and specification together illustrates the disclosure Exemplary embodiment, feature and aspect, and for explaining the principles of this disclosure.

Fig. 1 shows the flow chart of the name entity recognition method according to one embodiment of the disclosure.

Fig. 2 shows the identification model fusion architecture schematic diagrames according to the embodiment of the present disclosure.

Fig. 3 shows the flow chart of the name entity recognition method according to one embodiment of the disclosure.

Fig. 4 shows the flow chart of the step S12 according to one embodiment of the disclosure.

Fig. 5 shows the flow chart of the name entity recognition method according to one embodiment of the disclosure.

Fig. 6 shows the flow chart of the step S12 according to one embodiment of the disclosure.

Fig. 7 shows the block diagram of the name entity recognition device according to one embodiment of the disclosure.

Fig. 8 shows the block diagram of the name entity recognition device according to one embodiment of the disclosure.

Fig. 9 is shown according to an exemplary embodiment a kind of for naming the block diagram of entity recognition device 900.

Specific embodiment

Various exemplary embodiments, feature and the aspect of the disclosure are described in detail below with reference to attached drawing.It is identical in attached drawing Appended drawing reference indicate element functionally identical or similar.Although the various aspects of embodiment are shown in the attached drawings, remove It non-specifically points out, it is not necessary to attached drawing drawn to scale.

Dedicated word " exemplary " means " being used as example, embodiment or illustrative " herein.Here as " exemplary " Illustrated any embodiment should not necessarily be construed as preferred or advantageous over other embodiments.

In addition, giving numerous details in specific embodiment below to better illustrate the disclosure. It will be appreciated by those skilled in the art that without certain details, the disclosure equally be can be implemented.In some instances, for Method, means, element and circuit well known to those skilled in the art are not described in detail, in order to highlight the purport of the disclosure.

Fig. 1 shows the flow chart of the name entity recognition method according to one embodiment of the disclosure.As shown in Figure 1, this method Can include:

Step S11 is labeled text according to default mark collection to obtain training corpus, and the default mark collection includes Name type, notation methods and the non-notation methods for naming entity of entity.

Step S12, according to the training corpus, the preset multiple identification models of training.

Step S13 identifies text to be identified using the multiple identification model respectively, obtains the identification of each identification model As a result.

Step S14, the recognition result of weight and each identification model based on each identification model, obtains the text to be identified This recognition result.

In the embodiments of the present disclosure, by the recognition result of the multiple identification models of Weighted Fusion, different identification moulds are combined The advantage of type improves the precision of name Entity recognition.

Before model is trained and is tested, the text used when needing to training and test is labeled, and is instructed Practice corpus.Default mark collection can be used for being labeled text.Default mark collection may include the type and mark of name entity The notation methods of note mode and non-name entity.Wherein, the type of entity is named to can be used for identifying that different names is real Body.The notation methods of name entity can serve to indicate that how to mark name entity all types of in text.Non- name entity is real The notation methods of body can serve to indicate that the non-name entity how marked in text.

It in the embodiment of the present disclosure, is illustrated by taking medical electronics medical record data as an example, the name entity of the embodiment of the present disclosure Recognition methods can be also used for the identification of other data.

It is illustrated for using medical electronics medical record data as text.By analyzing medical electronics medical record data, doctor Diagnosis and treatment activity for patient may be summarized to be: (what do to check) performance (what disease of discovery disease by detection methods Shape), it provides diagnosis (what disease), and be based on diagnosis, provides remedy measures (how treating).During from this As can be seen that curative activity relates generally to 4 class important informations: inspection, symptom, disease and treatment.On this basis, mark is preset Concentrate the type that may include 5 name entities, comprising: disease type, symptom type, checks class at medical diagnosis on disease classification type Type and treatment type.

The notation methods for including according to default mark collection: the name entity of disease type can be labeled as disease；Disease The name entity of diagnostic classification type can be labeled as disease_type；The name entity of symptom type can be labeled as symptom；The name entity of inspect-type can be labeled as test；The name entity for treating type can be labeled as treatment。

Table 1 shows the mark collection of the embodiment of the present disclosure and the example of training corpus.In one example, as shown in table 1, Disease can be labeled as " hypertension, the diabetes " in text " old female suffers from, and denies hypertension, diabetic history ".

Default mark collection further comprises the notation methods of non-name entity, for example, non-name entity can be labeled as O.Its In, non-name entity can indicate the part in corpus in addition to naming entity.

Table 1

In one possible implementation, the notation methods of entity is named to also can indicate that the first character of name entity (or the last one divides for symbol (either first participle), intermediate character (either intermediate participle) and last character Word).As shown in table 1, in one example, name the first character, intermediate character and last character of entity can be with It is respectively labeled as B, I and E.For example, the first character of the name entity of disease type can be labeled as disease-B, The character of the centre of the name entity of disease type can be labeled as disease-I, the name entity of disease type Last character can be labeled as disease-E.

Can be according to the training corpus marked, the preset multiple identification models of training.Each identification model can be used for Identification name entity.

In one possible implementation, preset multiple identification models may include CRF model and/or BILSTM_ CRF model and/or IDCNN_CRF model.Wherein, CRF model is probabilistic model, in the embodiments of the present disclosure, the spy of CRF model Sign includes the part of speech of analysis, and the contextual information being made of the context of the context and part of speech that segment；BILSTM_CRF Model and IDCNN_CRF model are deep learning model, in the embodiments of the present disclosure, BILSTM_CRF model and IDCNN_CRF The feature vector of model is the character feature vector that the word vector sum Character segmentation vector of character forms.It should be noted that this Multiple identification models of open embodiment can also be other disaggregated models, with no restrictions to this disclosure.

Text to be identified can indicate the text for needing to identify name entity.In the instruction for completing multiple preset identification models After white silk, server can identify the name entity in text to be identified using the identification model that training is completed, and provide and identify Name entity recognition result.

Fig. 2 shows the identification model fusion architecture schematic diagrames according to the embodiment of the present disclosure.As shown in Fig. 2, can will be wait know Other text inputs each identification model respectively, obtains the recognition result of each identification model, according to the weight of each identification model, to each The recognition result of identification model is weighted summation, and final recognition result is determined according to summed result.

Wherein, the recognition result obtained by identification model is that each character of text to be identified is identified as naming entity Probability.

In one possible implementation, when summed result is greater than specified threshold, determination belongs to name entity, when asking When being less than or equal to specified threshold with result, determines and be not belonging to name entity.Wherein, specified threshold can according to need progress It determines, such as can be 0.5.

In an example it is assumed that multiple identification models include model 1, model 2 and model 3, weight is respectively 0.1,0.3 With 0.7, specified threshold 0.5.After text input model 1 to be identified, model 2 and model 3, the to be identified of the output of model 1 is obtained Recognition result (i.e. each character is identified as naming the probability of entity), the model 2 of each character export to be identified in text The recognition result of each character in the text to be identified that the recognition result of each character and model 3 export in text.Wait know For the first character of other text, it is assumed that it is 0.7 that the first character that model 1 exports, which is identified as naming the probability of entity, It is 0.8 that the first character that model 2 exports, which is identified as naming the probability of entity, and the first character that model 3 exports is identified Probability to name entity is 0.2, is weighted summation to these three probability, available summed result is 0.45, the summation As a result it is less than specified threshold 0.5, accordingly, it is determined that the first character of text to be identified is not belonging to name entity.For to be identified Other characters of text are referred to first character, and which is not described herein again.

It, can be with Loss (loss) minimum optimization aim of fused model, in reality in a kind of possible realization During existing Loss value is the smallest, continuous iteration finally searches out optimal weight combination, so that it is determined that each identification model out Weight.

By taking identification model is CRF model as an example.Fig. 3 shows the name entity recognition method according to the disclosure one embodiment Flow chart.As shown in figure 3, step S11 is labeled text according to default mark collection and is instructed according to the training corpus Practicing corpus may include step S111 and step S112.

Step S111 carries out word segmentation processing to the text, obtains multiple participles.

The participle for belonging to name entity is labeled as phase according to the notation methods that the default mark collection includes by step S112 The type answered, and the participle for belonging to non-name entity is labeled, obtain training corpus.

For each participle, it can determine whether the participle is noted as name entity in the text；If the participle is marked Note is name entity, then can determine that the participle belongs to name entity, can be true if the participle is noted as non-name entity The fixed participle belongs to non-name entity.

When participle belongs to name entity, the type of the corresponding name entity of the participle can be determined, according to default mark The notation methods of the name entity of the type that collection includes, mark the participle.

When participle is not belonging to name entity, the mark side for the non-name entity for including can be concentrated according to default mark Formula marks the participle.For example, being labeled as O.

In one example, the name entity in text " carried out gallbladder before 4 years and assist resection " is known, " gallbladder removal Art " is the name entity for treating type, collects " cholecystectomy " according to default mark and is noted as treatment, forms training Corpus.Word segmentation processing is carried out to training corpus, obtains following participle: " 4 ", " year ", " preceding ", " progress ", " gall-bladder " and " excision Art ".Participle is matched with the name entity " cholecystectomy " for being noted as treatment, is obtained: segment " gall-bladder " and " resection " belongs to the name entity of disease type, and participle " gall-bladder " is first participle of the name entity, participle " excision Art " is the last one participle of the name entity.Therefore, " gall-bladder " can be labeled as to treatment-B, " resection " is marked Note is treatment-E.It can be these participle addition O marks, to all points in text for other participles in text After the completion of word mark, available training corpus.

In one possible implementation, according to the training corpus, the preset multiple identification models of training include: root According to the training corpus, the training CRF model.On this basis, Fig. 4 shows the step S12 according to one embodiment of the disclosure Flow chart.As shown in figure 4, step S12 can include:

Step S121 obtains the part of speech feature of each participle in the training corpus.

The training corpus tissue is matrix form, wherein first is classified as participle, middle column in matrix by step S122 For the part of speech feature of each participle, last is classified as the annotation results of each participle, and respectively segments in first row according to each participle Position in the training corpus is ranked up.

Step S123 determines that the CRF template file that training process uses, the CRF template file are made of multiple template, Each template is for specifying when extracting contextual information for current participle, and extracted contextual information is relative to current point The absolute position of the line displacement of word and extracted contextual information column, the contextual information include the upper of participle The contextual feature of following traits and part of speech.

Step S124 generates CRF model according to the matrix and the CRF template file.

The part of speech feature of participle can be used to indicate that the part of speech of participle.By analyzing medical electronics medical record data, discovery is very More disease names are made of the connection of multiple nouns, i.e., the name entity of disease type is usually the participle of noun by multiple parts of speech Composition.Referring to " the Chinese electronic health record standard of word segmentation ", part of speech mainly includes noun (n), verb (v), preposition (p), numeral-classifier compound (m), symbol (x) and adverbial word (d) etc..

Table 2 shows the example of part of speech feature.As shown in table 2, the name entity " abdominal mass " of disease type is by two words Property for noun participle " abdomen " and " mass " form.It should be noted that nr indicates proper noun, and proper noun can in table 2 Using as the specified participle in contextual feature.

Table 2

Word	Part of speech	Mark
			Cause	p	O
It was found that	v	O
			Abdomen	n	disease-B
Mass	n	disease-E
			1	m	O
Year	m	O
			It is remaining	m	O
It is admitted to hospital	n	O
			。	x	O
Physical examination	nr	O
			:	x	O
Cardiopulmonary	n	test-B
			Auscultation	v	test-E
Nothing	v	O
			It is abnormal	d	O

It can be matrix form by training corpus tissue, wherein first be classified as participle in matrix, centre is classified as each participle Part of speech feature, last is classified as the annotation results of each participle, and respectively segmented in first row according to it is each participle in the training Position in corpus is ranked up.As shown in table 2, first row is that text " is admitted to hospital for abdominal mass more than 1 year because of discovery.Physical examination: cardiopulmonary Auscultate without exception " each participle for including, secondary series is the part of speech feature of each participle, and third column are the mark knots of each participle Fruit.Therefore, table 2 can indicate the matrix that training corpus is organized into, wherein first row shown in table 2 can be used as the of matrix One arranges, and secondary series shown in table 2 can be used as the secondary series of matrix, and the column of third shown in table 2 can be used as the third column of matrix.

Wherein, contextual information includes the contextual feature of participle and the contextual feature of part of speech.

The contextual feature of the part of speech of participle indicates the feature of the context of participle.For example, the symptom of disease generally occurs within After verb, i.e. the name entity of symptom type typically occurs in after the participle that part of speech is verb.For example, such as table 2 Shown, the name entity " heart and lung auscultation " of inspect-type appears in after the participle " physical examination " that part of speech is verb.

The contextual feature of participle can be used to indicate that the feature for the participle that the context of participle includes.The context of participle Feature can be generated by the feature templates used when training CRF model.It can be in the feature templates used in training CRF model It is specified participle comprising the participle before or after specified participle.For example, " physical examination: " can be used as the upper of " heart and lung auscultation " Following traits, " physical examination: " back very maximum probability will appear the name entity of inspect-type.The character modules used in CRF model May include in plate connected participle " physical examination " and segment ": " after participle be inspect-type name entity.

CRF template file is made of multiple template, and each template extracts context for current participle for specifying how Information.In CRF algorithm, when extracting contextual information for current participle, extracted contextual information is relative to current The absolute position of the line displacement of participle and extracted contextual information column.

Contextual information can be expressed as, wherein the initial position of row and column is all 0.Table 3 shows CRF template file One example.Every a line in CRF template file shown in table 3 is a template, and each template is come by %x [row, col] The specified contextual information extracted.Wherein, row indicates that the relative displacement of row, col indicate the absolute position of column.Wherein, row When for negative, biased forwards being indicated, indicating to deviate backward when being positive number, current participle is indicated when being 0.For example, referring to table 2 and table 3, it is illustrated so that current participle is " abdomen " as an example.In table 3 %x [- 2,0] be in table 2 row it is opposite with " abdomen " offset be- 2 row is classified as the content of the 0th column, i.e. participle " because ".%x [0,1] is to go to be in table 2 relative to the offset of " abdomen " in table 3 0 row is classified as the content of the 1st column, that is, segments the part of speech feature " n " of " abdomen ".Table 4 is shown, when current participle is " abdomen ", table The corresponding contextual information of each template shown in 3.

Table 3

#Unigram
	U01:%x [- 2,0]
U02:%x [- 1,0]
	U03:%x [0,0]
U04:%x [1,0]
	U05:%x [2,0]
U06:%x [- 1,0]/%x [0,0]
	U07:%x [0,0]/%x [1,0]
U08:%x [- 1,0]/%x [0,0]/%x [1,0]
	U09:%x [- 2,0]/%x [- 1,0]/%x [0,0]
U10:%x [0,0]/%x [1,0]/%x [2,0]
	U11:%x [0,1]
U12:%x [0,0]/%x [0,1]
	#Bigram
B

Table 4

Corresponding each template, CRF algorithm can generate a series of characteristic function set, the feelings to response training corpus Condition, and then generate corresponding CRF model.Characteristic function includes transfer function and function of state in CRF algorithm, wherein transfer function To include the characteristic function before and after current location, function of state be include characteristic function on current location.In general, feature Function value is 1 or 1.

In one example, table 1 shows 21 kinds of annotation results, according to template U01%x [- 2,0] available following shape State characteristic function:

1 else of Func1=if (output=' disease-B ' and feature=' U01: because ') return return 0

1 else of Func2=if (output=' disease-I ' and feature=' U01: because ') return return 0

1 else of Func3=if (output=' disease-E ' and feature=' U01: because ') return return 0

……

1 else return 0 of Func21=if (output=' O ' and feature=' U01: because ') return

Wherein, output is the observation label of current participle " abdomen ", and each template can be defeated all possible label It all lists out, the weight of each characteristic function is then determined by training.Time that reasonable label occurs in training sample Number is more, and the weight of corresponding characteristic function is higher, otherwise fewer.

According to the matrix and the CRF template file, CRF model can be generated.Matrix and CRF template file are inputted After CRF algorithm, CRF algorithm can train automatically and obtain CRF model.

In one example, order " crf_learn template train.txt model " training CRF can be passed through Model.Wherein, template representing matrix, train.txt indicate CRF template file.The order can generate model after having executed File, the model file are CRF model.

By taking BILSTM_CRF model or IDCNN_CRF model as an example.Fig. 5 shows the name according to one embodiment of the disclosure The flow chart of entity recognition method.As shown in figure 5, step S11 collects according to default mark for text according to the training corpus It is labeled to obtain training corpus to may include step S113 and step S114.

The text is divided into multiple characters by step S113.

Step S114, the notation methods for including are integrated according to the default mark will belong to the character label of name entity as phase The type answered, and the character for belonging to non-name entity is labeled, obtain training corpus.

For each character, it can determine whether the character is noted as name entity in training corpus；If the character It is noted as name entity, then can determine that the character belongs to name entity；If the character is noted as non-name entity, can To determine that the character belongs to non-name entity.

When character belongs to name entity, the type of the corresponding name entity of the character can be determined, according to default mark The notation methods of the name entity of the type that collection includes, mark the character.

When character is not belonging to name entity, the mark side for the non-name entity for including can be concentrated according to default mark Formula marks the character.For example, being labeled as O.

In one example, the name entity in text " carried out gallbladder before 4 years and assist resection " is known, " gallbladder removal Art " is the name entity for treating type, collects " cholecystectomy " according to default mark and is noted as treatment, forms training Corpus.Training corpus is split as character, obtains character " 4 ", " year ", " preceding ", " into ", " row ", " gallbladder ", " assisting ", " cutting ", " removing " " art ".Character is matched with the name entity " cholecystectomy " for being noted as treatment, wherein character " gallbladder ", " assisting ", " cutting ", " removing " and " art " belongs to the name entity of disease type, and character " gallbladder " is the first character of the name entity, Character " assisting ", " cutting " and " removing " is the character of the centre of the name entity, and character " art " is the last character of the name entity Symbol.Therefore, it can be labeled as treatment-B for character " gallbladder ", character " assisting ", " cutting " and " removing " is respectively labeled as Treatment-I, character " art " are labeled as treatment-E.For other participles in training corpus, server can be this A little participles are labeled as O, after the completion of character labels all in text, available training corpus.

In one possible implementation, according to the training corpus, the preset multiple identification models of training include: root According to the training corpus, the training BILSTM_CRF model or the IDCNN_CRF model.On this basis, Fig. 6 is shown According to the flow chart of the step S12 of one embodiment of the disclosure.As shown in fig. 6, step S12 can include:

Step S125 obtains the character feature vector of each character in the training corpus, the character feature vector packet Include word vector sum Character segmentation vector.

Step S126 is tied according to the corresponding mark of each character of character feature vector sum of character each in the training corpus Fruit, the training BILSTM_CRF model or the IDCNN_CRF model.

Character feature vector can indicate the feature vector of character, and character feature vector may include word vector sum Character segmentation Vector.

Wherein, word vector is the vector that can indicate the feature of character, and word vector can represent one per one-dimensional value With feature that is certain semantic and grammatically explaining.Wherein, feature can be used for the fundamental of word (such as radical, radical, Stroke, meaning etc.) the various information that are characterized.Can according to external word vector table (such as pass through word2vec training complete Word vector table) to each character carry out id coding, obtain the word id of each character.It, can be outer using word id matching when training pattern Portion's word vector table, obtains word vector.

Character segmentation vector can be used for identifying the character for belonging to the same participle.Training corpus is being split into character In the process, numeric coding, such as " Kazakhstan " " that " " shore " three character categories can be carried out to the character for belonging to the same name entity In a name entity, numeric coding can be expressed as " 1 " " 2 " " 3 ", and " north " " capital " two characters belong to a name entity, Numeric coding can be expressed as " 1 " " 3 ", and " Hubei Province " " that " " more " " this " four characters belong to a name entity, and numeric coding can To be expressed as " 1 " " 2 " " 2 " " 3 ".During training corpus is split into character, to the character for being not belonging to name entity, number Value coding can be labeled as 0.For example, the numeric coding of " I is Harbin people " can be expressed as " 001230 ".It can be right After each character code of training corpus, for each character, random initializtion can be carried out to the character, generate the first of the character Beginning vector.It in training pattern, is continued to optimize by back-propagation algorithm, finally trains an optimal character for the character Split vector.

Table 5 shows the example of word id and numeric coding.Table 6 shows the word vector sum character point obtained on the basis of table 5 Cut the example of vector.Wherein, the dimension of word vector is 100, and the dimension of Character segmentation vector is 20.

Table 5

Training corpus	[because sending out, existing, abdomen, portion, packet, block ... ...]
		Word id	[230,16,511,14,1052,363 ... ...]
Numeric coding	[0,0,0,1,2,2,3 ... ...]

Table 6

Training corpus	Word vector	Character segmentation vector
			Cause	[0.5,0.6 ... 0.25,0.5]_1*100	[0.5,0.6 ... 0.25]_1*20
Hair	[0.7,0.8 ... 0.32,0.3]_1*100	[0.5,0.6 ... 0.25]_1*20
			It is existing	[0.9,0.5 ... 0.72,0.8]_1*100	[0.5,0.6 ... 0.25]_1*20
Abdomen	[0.9,0.8 ... 0.75,0.9]_1*100	[0.9,0.8 ... 0.75]_1*20
			Portion	[0.2,0.6 ... 0.22,0.6]_1*100	[0.2,0.6 ... 0.22]_1*20
Packet	[0.5,0.3 ... 0.52,0.7]_1*100	[0.2,0.6 ... 0.22]_1*20
			Block	[0.4,0.7 ... 0.23,0.9]_1*100	[0.4,0.7 ... 0.23]_1*20
……	……	…

According to the corresponding annotation results of each character of character feature vector sum of character each in the training corpus, described in training BILSTM_CRF model or the IDCNN_CRF model.Wherein, the character feature vector of each character can be used as The input of BILSTM_CRF model or IDCNN_CRF model, the annotation results of each character can be used for supervising BILSTM_CRF The output result of model and IDCNN_CRF model.

With the multiple identification model include CRF model and/or BILSTM_CRF model and/or IDCNN_CRF model is Example.

Wherein, text to be identified is identified using the CRF model, obtains recognition result can include: to the text to be identified This progress word segmentation processing obtains multiple participles；Determine the part of speech feature of each participle；The multiple of the part of speech feature will be carried The segmentation sequence of participle composition inputs in the CRF model, obtains the recognition result of the CRF model.

Text to be identified is identified using the BILSTM_CRF model or the IDCNN_CRF model, obtains recognition result Can include: the text to be identified is split into character, obtains multiple characters；Determine the character feature vector of each character；It will The sequence of the character feature vector composition of multiple characters is inputted respectively in BILSTM_CRF model or IDCNN_CRF model, is obtained The recognition result of BILSTM_CRF model or the recognition result of IDCNN_CRF model.

Text to be identified can be segmented, and determine the part of speech feature of each participle, point of part of speech feature will be carried The segmentation sequence of word composition is input in the CRF model of training completion, obtains the recognition result of CRF model.

The text to be identified can be split into character, the character feature vector of each character be obtained, by each character The sequence of character feature vector composition is separately input in the BILSTM_CRF model and IDCNN_CRF model of training completion, is obtained To the recognition result of BILSTM_CRF model and the recognition result of IDCNN_CRF model.

Table 7 shows an example of recognition result.As shown in table 6, for some character in text to be identified, three kinds of moulds The probability that type determines that the character belongs to name entity is respectively 0.7,0.8 and 0.2, with 0.5 for for classification thresholds, three kinds of models Predict that classification is respectively 1,1,0.Wherein, it 1 indicates, which belongs to name entity, and 0 indicates that the character is not belonging to name entity. After being weighted, the probability for determining that the character belongs to name entity is 0.45, can determine that final recognition result is 0, the i.e. word Symbol is not belonging to name entity.

Table 7

Fig. 7 shows the block diagram of the name entity recognition device according to one embodiment of the disclosure.As shown in fig. 7, the device 30 May include:

Labeling module 31, for being labeled to obtain training corpus, the pre- bidding for text according to default mark collection The notation methods of type of the note collection comprising name entity and notation methods and non-name entity；

Training module 32, for according to the training corpus, the preset multiple identification models of training；

Identification module 33 obtains each identification mould for identifying text to be identified respectively using the multiple identification model The recognition result of type；

Weighting block 34 obtains described for the recognition result of weight and each identification model based on each identification model The recognition result of text to be identified.

In one possible implementation, the labeling module 31 is particularly used in:

When the identification model is CRF model, word segmentation processing is carried out to the text, obtains multiple participles；

The participle for belonging to name entity is labeled as corresponding type according to the notation methods that the default mark collection includes, And the participle for belonging to non-name entity is labeled, obtain training corpus.

Fig. 8 shows the block diagram of the name entity recognition device according to one embodiment of the disclosure.As shown in figure 8, one kind can In the implementation of energy, the training module 32 can include:

First training submodule 321, for according to the training corpus, the training CRF model；

The first training submodule 321 is particularly used in:

Obtain the part of speech feature of each participle in the training corpus；

It is matrix form by the training corpus tissue, wherein first be classified as participle in matrix, centre is classified as each participle Part of speech feature, last is classified as the annotation results of each participle, and respectively segmented in first row according to it is each participle in the training Position in corpus is ranked up；

Determine that the CRF template file that training process uses, the CRF template file are made of multiple template, each template For specifying when extracting contextual information for current participle, extracted contextual information is inclined relative to the row currently segmented It moves and the absolute position of extracted contextual information column, the contextual information includes the contextual feature of participle With the contextual feature of part of speech；

According to the matrix and the CRF template file, CRF model is generated.

In one possible implementation, the labeling module 31 is particularly used in:

When the identification model is BILSTM_CRF model or IDCNN_CRF model, the text is divided into multiple Character；

The character label for belonging to name entity is corresponding type by the notation methods for including according to the default mark collection, And the character for belonging to non-name entity is labeled, obtain training corpus.

In one possible implementation, the training module 32 can include:

Second training submodule 322, for according to the training corpus, the training BILSTM_CRF model or described IDCNN_CRF model；

The second training submodule 322 is particularly used in:

The character feature vector of each character in the training corpus is obtained, the character feature vector includes word vector sum Character segmentation vector；

According to the corresponding annotation results of each character of character feature vector sum of character each in the training corpus, described in training BILSTM_CRF model or the IDCNN_CRF model.

In one possible implementation, the multiple identification model includes CRF model and/or BILSTM_CRF model And/or the IDCNN_CRF model；The identification module 33 can include:

First identification submodule 331 obtains recognition result for identifying text to be identified using the CRF model；

Second identification submodule 332, for using the BILSTM_CRF model or the IDCNN_CRF model identification to It identifies text, obtains recognition result；

Wherein, the first identification submodule 331 is particularly used in:

Word segmentation processing is carried out to the text to be identified, obtains multiple participles；

Determine the part of speech feature of each participle；

The segmentation sequence for the multiple participles composition for carrying the part of speech feature is inputted in the CRF model, institute is obtained State the recognition result of CRF model；

The second identification submodule 332 is particularly used in:

The text to be identified is split into character, obtains multiple characters；

Determine the character feature vector of each character；

The sequence that the character feature vector of multiple characters forms is inputted into BILSTM_CRF model or IDCNN_CRF mould respectively In type, the recognition result of BILSTM_CRF model or the recognition result of IDCNN_CRF model are obtained.

Fig. 9 is shown according to an exemplary embodiment a kind of for naming the block diagram of entity recognition device 900.Referring to figure 9, which may include processor 901, the machine readable storage medium 902 for being stored with machine-executable instruction.Processor 901 can communicate with machine readable storage medium 902 via system bus 903.Also, processor 901 is readable by read machine Machine-executable instruction corresponding with name Entity recognition logic is known in storage medium 902 with executing name entity described above Other method.

Machine readable storage medium 902 referred to herein can be any electronics, magnetism, optics or other physical stores Device may include or store information, such as executable instruction, data, etc..For example, machine readable storage medium may is that RAM (Radom Access Memory, random access memory), volatile memory, nonvolatile memory, flash memory, storage are driven Dynamic device (such as hard disk drive), solid state hard disk, any kind of storage dish (such as CD, dvd) or similar storage are situated between Matter or their combination.

The presently disclosed embodiments is described above, above description is exemplary, and non-exclusive, and It is not limited to disclosed each embodiment.Without departing from the scope and spirit of illustrated each embodiment, for this skill Many modifications and changes are obvious for the those of ordinary skill in art field.The selection of term used herein, purport In the principle, practical application or technological improvement to the technology in market for best explaining each embodiment, or lead this technology Other those of ordinary skill in domain can understand each embodiment disclosed herein.

Claims

1. a kind of name entity recognition method, which is characterized in that the described method includes:

Text is labeled according to default mark collection to obtain training corpus, the default class of the mark collection comprising name entity The notation methods of type and notation methods and non-name entity；

According to the training corpus, the preset multiple identification models of training；

Text to be identified is identified using the multiple identification model respectively, obtains the recognition result of each identification model；

The recognition result of weight and each identification model based on each identification model, obtains the identification knot of the text to be identified Fruit.

2. the method according to claim 1, wherein the basis is default when the identification model is CRF model Mark collection is labeled text to obtain training corpus, comprising:

Word segmentation processing is carried out to the text, obtains multiple participles；

The participle for belonging to name entity is labeled as corresponding type according to the notation methods that the default mark collection includes, and The participle for belonging to non-name entity is labeled, training corpus is obtained.

3. according to the method described in claim 2, it is characterized in that, training preset multiple identifications according to the training corpus Model includes:

According to the training corpus, the training CRF model, comprising:

Obtain the part of speech feature of each participle in the training corpus；

It is matrix form by the training corpus tissue, wherein first is classified as participle, the intermediate word for being classified as each participle in matrix Property feature, last is classified as the annotation results of each participle, and respectively segmented in first row according to it is each participle in the training corpus In position be ranked up；

Determine that the CRF template file that training process uses, the CRF template file are made of multiple template, each template is used for Specify when extracting contextual information for current participle, extracted contextual information relative to the line displacement currently segmented, And the absolute position of extracted contextual information column, the contextual information include the contextual feature and word of participle The contextual feature of property；

According to the matrix and the CRF template file, CRF model is generated.

4. the method according to claim 1, wherein the identification model be BILSTM_CRF model or When IDCNN_CRF model, the default mark collection of the basis is labeled text to obtain training corpus, comprising:

The text is divided into multiple characters；

The character label for belonging to name entity is corresponding type by the notation methods for including according to the default mark collection, and The character for belonging to non-name entity is labeled, training corpus is obtained.

5. according to the method described in claim 4, it is characterized in that, training preset multiple identifications according to the training corpus Model includes:

According to the training corpus, the training BILSTM_CRF model or the IDCNN_CRF model, comprising:

The character feature vector of each character in the training corpus is obtained, the character feature vector includes word vector sum character Split vector；

6. the method according to claim 3 or 5, which is characterized in that the multiple identification model include CRF model and/or BILSTM_CRF model and/or IDCNN_CRF model；

Text to be identified is identified using the CRF model, obtaining recognition result includes:

Determine the part of speech feature of each participle；

The segmentation sequence for the multiple participles composition for carrying the part of speech feature is inputted in the CRF model, the CRF is obtained The recognition result of model；

Text to be identified is identified using the BILSTM_CRF model or the IDCNN_CRF model, obtaining recognition result includes:

Determine the character feature vector of each character；

The sequence that the character feature vector of multiple characters forms is inputted into BILSTM_CRF model or IDCNN_CRF model respectively In, obtain the recognition result of BILSTM_CRF model or the recognition result of IDCNN_CRF model.

7. a kind of name entity recognition device, which is characterized in that described device includes:

Labeling module, for being labeled to obtain training corpus, the default mark Ji Bao for text according to default mark collection The notation methods of the type and notation methods of the entity containing name and non-name entity；

Training module, for according to the training corpus, the preset multiple identification models of training；

Identification module obtains the knowledge of each identification model for identifying text to be identified respectively using the multiple identification model Other result；

Weighting block obtains described to be identified for the recognition result of weight and each identification model based on each identification model The recognition result of text.

8. device according to claim 7, which is characterized in that the labeling module is specifically used for:

9. device according to claim 8, which is characterized in that the training module includes:

First training submodule, for according to the training corpus, the training CRF model；

The first training submodule is specifically used for:

Obtain the part of speech feature of each participle in the training corpus；

According to the matrix and the CRF template file, CRF model is generated.

10. device according to claim 7, which is characterized in that the labeling module is specifically used for:

When the identification model is BILSTM_CRF model or IDCNN_CRF model, the text is divided into multiple characters；

11. device according to claim 10, which is characterized in that the training module includes:

Second training submodule, for according to the training corpus, the training BILSTM_CRF model or the IDCNN_ CRF model；

The second training submodule is specifically used for:

12. the device according to claim 9 or 11, which is characterized in that the multiple identification model include CRF model and/ Or BILSTM_CRF model and/or IDCNN_CRF model；The identification module includes:

First identification submodule obtains recognition result for identifying text to be identified using the CRF model；

Second identification submodule, for identifying text to be identified using the BILSTM_CRF model or the IDCNN_CRF model This, obtains recognition result；

Wherein, the first identification submodule is specifically used for:

Determine the part of speech feature of each participle；

The second identification submodule is specifically used for:

Determine the character feature vector of each character；