CN107797992A

CN107797992A - Name entity recognition method and device

Info

Publication number: CN107797992A
Application number: CN201711102742.5A
Authority: CN
Inventors: 苏海波; 刘钰; 刘译璟; 杨哲铭; 张康利; 宋青原
Original assignee: Beijing Baifendian Information Science & Technology Co Ltd
Current assignee: Beijing Baifendian Information Science & Technology Co Ltd
Priority date: 2017-11-10
Filing date: 2017-11-10
Publication date: 2018-03-13

Abstract

The embodiment of the present application, which provides a kind of name entity recognition method and device, this method, to be included：Obtain list entries；Vectorization processing is carried out to the character in the list entries, obtains character vector sequence corresponding to the list entries；The character vector sequence is handled using neural network algorithm, obtains the text feature sequence of the list entries；The use condition random field processing text feature sequence, obtain name Entity recognition result corresponding to the list entries.Because character can characterize the quantity that more fine-grained feature and character quantity be much smaller than word, neural network algorithm it is contemplated that in list entries each character contextual information, and condition random field can avoid marking bias problem, therefore, technical scheme is by way of character vector, neural network algorithm and condition random field this three is combined, to realize name Entity recognition, preferable recognition effect can be reached.

Description

Name entity recognition method and device

Technical field

The application is related to field of computer technology, more particularly to name entity recognition method and device.

Background technology

Natural language processing (Natural Language Processing, NLP) is computer science, artificial intelligence, language The field of the interphase interaction of computer and human language of interest Yan Xue, is computer science and artificial intelligence field In an important directions.NLP research categories, which are covered, can realize and carry out efficient communication with natural language between people and computer Various theoretical and methods, the field being related to includes nature semantic understanding, retrieval, information extraction, machine translation, automatic question answering System etc..

As a basic task in NLP, name Entity recognition (Named Entity Recognition, NER) is Finger identifies technology of the entity with particular category such as name, place name, mechanism name, proper noun from text.NER is The important foundation instrument of the application fields such as information retrieval, inquiry classification, question answering system, syntactic analysis, machine translation, it identifies effect Fruit directly affects the subsequent treatment effect in aforementioned applications field.Therefore it provides a kind of recognition effect preferably names identification technology, Those skilled in the art's technical problem urgently to be resolved hurrily is turned into.

The content of the invention

The purpose of the embodiment of the present application is to provide a kind of name entity recognition method and device, real to reach preferably name Body recognition effect.

To reach above-mentioned technical purpose, what the embodiment of the present application was realized in：

According to the first aspect of the embodiment of the present application, there is provided one kind name entity recognition method, methods described include：

Obtain list entries；

Vectorization processing is carried out to the character in the list entries, obtains character vector sequence corresponding to the list entries Row；

The character vector sequence is handled using neural network algorithm, obtains the text feature sequence of the list entries；

The use condition random field processing text feature sequence, obtain name Entity recognition corresponding to the list entries As a result.

It is described to handle the character vector sequence using neural network algorithm in one embodiment of the application, obtain The text feature sequence of the list entries, including：

The character vector sequence is handled using two-way shot and long term Memory Neural Networks, obtains the text of the list entries Characteristic sequence.

In one embodiment of the application, the character in the list entries carries out vectorization processing, obtains Character vector sequence corresponding to the list entries, including：

Character-DUAL PROBLEMS OF VECTOR MAPPING dictionary is obtained, wherein, record has character and vector in the character-DUAL PROBLEMS OF VECTOR MAPPING dictionary Corresponding relation；

The vector corresponding to the character in the list entries is searched from the character-DUAL PROBLEMS OF VECTOR MAPPING dictionary；

The vector corresponding to the character in the list entries is handled using notice mechanism, obtains weighing corresponding to each vector Weight values；

Vector weighted value corresponding with the vector corresponding to character in the list entries is subjected to point multiplication operation, obtained To character vector sequence corresponding to the list entries.

In one embodiment of the application, the generating process of the character-DUAL PROBLEMS OF VECTOR MAPPING dictionary, including：

Obtain training corpus；

In units of character, the training corpus is split, obtains split result；

At least one of following pretreatments are carried out to the split result：Filtering spam character, filtering deactivation character, filtering are low Frequency character and the meaningless symbol of filtering, obtain pre-processed results；

Using pre-processed results described in word2vec Algorithm for Training, obtain obtaining character-DUAL PROBLEMS OF VECTOR MAPPING dictionary.

It is described using pre-processed results described in word2vec Algorithm for Training in one embodiment of the application, obtained Character-DUAL PROBLEMS OF VECTOR MAPPING dictionary is taken, including：

Using pre-processed results described in skip-gram model trainings, obtain obtaining character-DUAL PROBLEMS OF VECTOR MAPPING dictionary.

According to the second aspect of the embodiment of the present application, there is provided one kind name entity recognition device, described device include：

Acquiring unit, for obtaining list entries；

First processing units, for carrying out vectorization processing to the character in the list entries, obtain the input sequence Character vector sequence corresponding to row；

Second processing unit, for handling the character vector sequence using neural network algorithm, obtain the input sequence The text feature sequence of row；

3rd processing unit, for the use condition random field processing text feature sequence, obtain the list entries Corresponding name Entity recognition result.

In one embodiment of the application, the second processing unit, including：

Character vector series processing subelement, for handling the character vector using two-way shot and long term Memory Neural Networks Sequence, obtain the text feature sequence of the list entries.

In one embodiment of the application, the first processing units, including：

Map dictionary and obtain subelement, for obtaining character-DUAL PROBLEMS OF VECTOR MAPPING dictionary, wherein, the character-DUAL PROBLEMS OF VECTOR MAPPING word Record has the corresponding relation of character and vector in allusion quotation；

Subelement is searched, it is right for searching the institute of the character in the list entries from the character-DUAL PROBLEMS OF VECTOR MAPPING dictionary The vector answered；

Notice mechanism handles subelement, for being handled using notice mechanism corresponding to the character in the list entries Vector, obtain it is each vector corresponding to weighted value；

Character vector sequence obtain subelement, for by corresponding to the character in the list entries vector with the vector Corresponding weighted value carries out point multiplication operation, obtains character vector sequence corresponding to the list entries.

In one embodiment of the application, described device also includes：Map dictionary creation unit；

The mapping dictionary creation unit, including：

Training corpus obtains subelement, for obtaining training corpus；

Character splits subelement, in units of character, being split to the training corpus, obtaining split result；

Subelement is pre-processed, for carrying out at least one of following pretreatments to the split result：Filtering spam character, mistake Filter disables character, filtering low character and the meaningless symbol of filtering, obtains pre-processed results；

Dictionary training subelement is mapped, for using pre-processed results described in word2vec Algorithm for Training, obtaining obtaining word Symbol-DUAL PROBLEMS OF VECTOR MAPPING dictionary.

In one embodiment of the application, the mapping dictionary training subelement, it is specifically used for：

According to the third aspect of the embodiment of the present application, there is provided a kind of electronic equipment, including：Processor；And

It is arranged to store the memory of computer executable instructions, the executable instruction makes the place when executed Manage device and perform following operate：

Obtain list entries；

According to the fourth aspect of the embodiment of the present application, there is provided a kind of computer-readable storage medium, the computer-readable storage Media storage one or more program, one or more of programs perform when the electronic equipment for being included multiple application programs When so that the electronic equipment performs following operate：

Obtain list entries；

The technical scheme provided from above the embodiment of the present application, the embodiment of the present application can be by input sequences to be identified Each character in row is converted to corresponding vector, vector corresponding to each character is handled using neural network algorithm, to extract The text feature sequence of list entries to be identified, last use condition random field handle text feature sequence, obtained to be identified List entries corresponding to name Entity recognition result.Because character can characterize more fine-grained feature and character quantity is remote Less than the quantity of word, neural network algorithm it is contemplated that in list entries each character contextual information, and condition with Airport can avoid marking bias problem, and therefore, the embodiment of the present application is by by character vector, neural network algorithm and condition The mode that this three of random field is combined, to realize name Entity recognition, preferable recognition effect can be reached.

Brief description of the drawings

, below will be to embodiment or existing in order to illustrate more clearly of the embodiment of the present application or technical scheme of the prior art There is the required accompanying drawing used in technology description to be briefly described, it should be apparent that, drawings in the following description are only this Some embodiments described in specification, for those of ordinary skill in the art, before creative labor is not paid Put, other accompanying drawings can also be obtained according to these accompanying drawings.

Fig. 1 is the schematic diagram for the CBOW models that the application provides；

Fig. 2 is the schematic diagram for the Skip-Gram models that the application provides；

Fig. 3 is the flow chart of the name entity recognition method of one embodiment of the application；

Fig. 4 is the flow chart of the character-DUAL PROBLEMS OF VECTOR MAPPING dictionary creation method of one embodiment of the application；

Fig. 5 is the structural representation of the name entity recognition device of one embodiment of the application；

Fig. 6 is the structural representation of the electronic equipment of one embodiment of the application.

Embodiment

In order that those skilled in the art more fully understand the technical scheme in this specification, below in conjunction with the application Accompanying drawing in embodiment, the technical scheme in the embodiment of the present application is clearly and completely described, it is clear that described reality It is only this specification part of the embodiment to apply example, rather than whole embodiments.Based on the embodiment in this specification, ability The every other embodiment that domain those of ordinary skill is obtained under the premise of creative work is not made, should all belong to this theory The scope of bright book protection.

The embodiment of the present application provides a kind of name entity recognition method and device.

In order to make it easy to understand, some technical terms and concept for being related in the embodiment of the present application are situated between first below Continue.

Entity recognition (Named Entity Recognition, NER) is named, refers to identify with specific from text Technology of the entity of classification such as name, place name, mechanism name, proper noun.It is that sequence labelling is asked on NER process nature Topic, i.e., for given input text sequence, mark is stamped to each word (or word).

Mark can use defined below：It is identified as with name (PER) in NER, place name (LOC), mechanism name (ORG) Example, then for following input text：" Zhang San comes from Xi'an, graduates from Peking University ", its sequence labelling result are：

/ tri-/I-PER of B-PER come/O finishes from/O west/B-LOC peaces/I-LOC ,/O/O industry/O is in/O north/B-ORG capital/I- ORG is big/I-ORG/I-ORG；

After parsing, NER results are：

Zhang San/PER comes from Xi'an/LOC, graduates from Peking University/ORG；Wherein, the meaning of above mark refers to table 1 below：

Mark	Lexical or textual analysis
		B-PER	The bebinning character of name
I-PER	The centre of name and termination character
		B-LOC	The bebinning character of place name
I-LOC	The centre of place name and termination character
		B-ORG	The bebinning character of institution term
I-ORG	The centre of institution term and termination character
		O	Other characters

Table 1

Word2vector algorithms, be the algorithm of Google companies exploitation, by unsupervised training, by word become one it is several The vector of hundred dimensions, this vector can catch the semantic dependency between word (or character), and also known as term vector or word is embedding Enter.

Skip-gram models, are one kind of word2vector algorithms, and it predicts the word of surrounding by current word, especially It is applied to the prediction under the conditions of big data, as shown in figure 1, the word w (t-2), w (t-1), w that are gone using word w (t) around prediction And w (t+2) (t+1).

CBOW models, it is one kind of word2vector algorithms, the word of its based on context pre- measured center, as shown in Fig. 2 According to word w (t-2), w (t-1), w (t+1) and w (t+2) the prediction word w (t) around word w (t), the vector of these words is done and connected Connect, contextual information can be sufficiently reserved.

What notice mechanism (Attention Mechanism) was simulated is the attention model of human brain, when we read one During article, the word being just only currently seen of eye focus in fact, this when, the brain of people was primarily upon in this part On word, that is to say, that this when, concern of the human brain to entire article was not balanced, was to have what certain weight was distinguished. Notice mechanism has huge castering action in Sequence Learning task, in codec framework, by coding stage Attention model is added, data weighting conversion is carried out to source data sequence, the natural way of sequence pair sequence can be effectively improved Under system performance.

Shot and long term memory network (Long Short-Term Memory, LSTM), is a kind of time recurrent neural network, is fitted Together in being spaced in processing and predicted time sequence and postpone relatively long critical event.It passes through " Memory-Gate " and " forgetting door " To control the going or staying of historical information, the long Route Dependence of conventional recycle neutral net is efficiently solved the problems, such as.

Condition random field (Conditional Random Field, CRF), is that natural language processing field was commonly used in recent years One of algorithm, be usually used in syntactic analysis, name Entity recognition, part-of-speech tagging etc..CRF is become using Markov Chain as implicit The probability metastasis model of amount, variable is implied by Observable condition discrimination, belongs to discrimination model.

Fig. 3 is the flow chart of the name entity recognition method of one embodiment of the application, and this method can be by service end Perform, can also be performed by terminal device, the service end can include：Server or server cluster, the terminal device can be with Including：Smart mobile phone, tablet personal computer, notebook/desktop computer etc., as shown in figure 3, this method may comprise steps of：

In step 301, list entries is obtained.

In the embodiment of the present application, list entries can be text sequence, or sound bite.

In step 302, vectorization processing is carried out to the character in list entries, obtains character corresponding to the list entries Sequence vector.

In the embodiment of the present application, when list entries is text sequence, list entries is directly subjected to Character segmentation processing, Obtain the character string (x of list entries₁,x₂,...,x_n), wherein, x_iFor i-th of character in list entries, n is input sequence The character number of row.

In the embodiment of the present application, when list entries is sound bite, sound bite is first converted into corresponding text sequence Row, then Character segmentation processing is carried out to text sequence, obtain the character string (x of text sequence₁,x₂,...,x_n), also It is the character string (x of list entries₁,x₂,...,x_n), wherein, x_iFor i-th of character in text sequence, n is text sequence Character number, 1≤i≤n.

In an optional embodiment, above-mentioned steps 302 can include：S31 and S32, wherein,

In S31, obtain character-DUAL PROBLEMS OF VECTOR MAPPING dictionary, wherein, in the character-DUAL PROBLEMS OF VECTOR MAPPING dictionary record have character with The corresponding relation of vector；

In S32, the vector corresponding to the character in list entries is searched from the character-DUAL PROBLEMS OF VECTOR MAPPING dictionary, will be looked into The sequence that the vector found is formed is defined as character vector sequence corresponding to the list entries.

In the present embodiment, the character string (x of list entries is being obtained₁,x₂,...,x_n) after, obtain character-vector Dictionary is mapped, the character x in list entries is searched from the character-DUAL PROBLEMS OF VECTOR MAPPING dictionary_iCorresponding vector v_i, will find Vector v_iSequence (the v formed₁,v₂,...,v_n), it is defined as character vector sequence (v ' corresponding to the list entries₁,v ′₂,...,v′_n), wherein, v '_i=v_i。

In one preferred embodiment, above-mentioned steps 302 can include：S33, S34, S35 and S36, wherein,

In S33, obtain character-DUAL PROBLEMS OF VECTOR MAPPING dictionary, wherein, in the character-DUAL PROBLEMS OF VECTOR MAPPING dictionary record have character with The corresponding relation of vector；

In S34, the vector corresponding to the character in list entries is searched from the character-DUAL PROBLEMS OF VECTOR MAPPING dictionary；

In S35, the vector corresponding to the character in the list entries is handled using notice mechanism, it is right to obtain each vector The weighted value answered；

It should be noted that the weighted value of vector reflects the vectorial significance level, weighted value is bigger, illustrates that vector is got over It is important.

In present embodiment, can use Bi-LSTM (Bi-directional Recurrent Neural Network, Two-way long short-term memory Recognition with Recurrent Neural Network) realize notice mechanism.

In the present embodiment, the character string (x of list entries is being obtained₁,x₂,...,x_n) after, obtain character-vector Dictionary is mapped, the character x in list entries is searched from the character-DUAL PROBLEMS OF VECTOR MAPPING dictionary_iCorresponding vector v_i, will look into afterwards The vector v found_iSequence (the v formed₁,v₂,...,v_n) input in the correlation model of notice mechanism, export (at₁, at₂,...,at_n), wherein, at_iFor v_iCorresponding weighted value.

In S36, the vector weighted value corresponding with the vector corresponding to the character in the list entries is subjected to dot product Computing, obtain character vector sequence corresponding to the list entries.

Character x in list entries is obtained_iCorresponding vector v_i, and v_iCorresponding weighted value at_iAfterwards, v is calculated_i* at_i, by (v₁*at₁,v₂*at₂,...,v_n*at_n) it is defined as character vector sequence (v ' corresponding to list entries₁,v′₂,...,v ′_n) wherein, v '_i=v_i*at_i。

In the embodiment of the present application, character-DUAL PROBLEMS OF VECTOR MAPPING dictionary can be previously generated, it is contemplated that word2vector algorithm energy Enough vectors (usual hundreds of dimension) become each character in one lower dimensional space, semantic dependency between such character can be with With the distance of vector come approximate description, therefore training corpus can be trained using word2vector algorithms, generate word Symbol-DUAL PROBLEMS OF VECTOR MAPPING dictionary；Now, as shown in figure 4, Fig. 4 shows the character based on Skip-gram models-DUAL PROBLEMS OF VECTOR MAPPING dictionary The flow chart of generation method, may comprise steps of：

In S401, training corpus is obtained.

In the embodiment of the present application, training corpus includes a plurality of sentence.

In the embodiment of the present application, it is contemplated that word2vec algorithms are unsupervised learning algorithms, therefore are collecting related instruction When practicing language material, the data volume of training corpus is the bigger the better, in addition, these language materials are mainly for corresponding application scenarios, and to the greatest extent Amount covers most of data type of the scene.In actual applications, training corpus can be the language material marked, can also For the language material not marked, the embodiment of the present application is not construed as limiting to this.

In S402, in units of character, training corpus is split, obtains split result.

In the embodiment of the present application, every sentence in training corpus is divided into character one by one.

In S403, at least one of following pretreatments are carried out to split result：Filtering spam character, filtering disable character, Filtering low character and the meaningless symbol of filtering, obtain pre-processed results.

In the embodiment of the present application, in order to improve treatment effeciency and effect, the garbage character in split result can be filtered out, Filtering stops low-frequency word and meaningless symbol, is organized into the requirement form of word2vec algorithms, that is, represents input and output, to establish Training objective is prepared.

In S404, using word2vec Algorithm for Training pre-processed results, obtain obtaining character-DUAL PROBLEMS OF VECTOR MAPPING dictionary.

In the embodiment of the present application, the CBOW model preprocessing results in word2vector algorithms can be used, are obtained Character-DUAL PROBLEMS OF VECTOR MAPPING dictionary；The Skip-gram model preprocessing results in word2vector algorithms can also be used, are obtained Take character-DUAL PROBLEMS OF VECTOR MAPPING dictionary.

Bigger in view of the data volume of training corpus, the content trained in obtained character-DUAL PROBLEMS OF VECTOR MAPPING dictionary is more complete Face and accurate, and Skip-gram models are particularly well suited to big data, therefore, prioritizing selection uses Skip-gram model preprocessings As a result, obtain obtaining character-DUAL PROBLEMS OF VECTOR MAPPING dictionary.

Compared to the term vector in correlation technique, vectorization technology of the embodiment of the present application based on character can be brought following Advantage：More fine-grained character feature can be characterized；Because character quantity is much smaller than word quantity, obtained model space-consuming It is minimum, greatly improve model loading velocity；Over time, neologisms can continue to bring out, the term vector trained before Increasingly severe feature hit rate downslide problem occurs in model, and the vector based on character then effectively prevent this problem, Because it is relatively seldom to be continuously created the fresh character come every year.

In step 303, the character vector sequence is handled using neural network algorithm, the text for obtaining list entries is special Levy sequence.

It is understood that stamping the action of mark to each character in list entries, a sequence can be abstracted as Mark problem, its essence are also a classification task in fact, i.e., it needs to be determined that the class categories of each character.

In the embodiment of the present application, used neural network algorithm, its core concept is, for point of current each character Class differentiates, considers historical information above as input, solve thes problems, such as character independence.

In the embodiment of the present application, by character vector sequence (v ' corresponding to list entries₁,v′₂,...,v′_n) it is input to nerve Handled in network algorithm, obtain the text feature sequence (h of list entries₁,h₂,...,h_n), wherein, h_iFor x_iCharacter feature to Amount, h_iContain x_iCharacteristic information.

As an example, character vector sequence corresponding to Recognition with Recurrent Neural Network (RNN) processing list entries can be used.

In view of standard RNN due to the problem of producing long Route Dependence, i.e., it is right for the longer historical information in path It is smaller in the classification results influence of current character, even if these information have direct correlation with current problem.Illustrate：

Consider input " I studies abroad in the U.S. ... ..., can say a bite idiomatic English ", wherein, ellipsis represents other Longer contextual information, italics represent the vocabulary for being currently needed for prediction.When we have seen that before " English " this word, we It may be a kind of name of language that next word, which may be predicted, but be specifically which kind of language needs based on context to come really It is fixed.The RNN of standard is due to its structure problem, and possibly can not remembering previously mentioned " I studies abroad in the U.S. ", this is highly useful Information, so as to which the word of unpredictable next appearance is probably " English ", this phenomenon is referred to as " long Route Dependence ".LSTM's goes out Now solves above mentioned problem, its main thought is, on the basis of standard RNN, goes to control the defeated of contextual information using " door " Enter, specifically, controlling the input and output of historical information by several " doors ", each " door " is done non-linear by sigmoid functions Change is normalized between 0~1, and its value then shows that less historical information passes through " door " closer to 0；On the contrary, closer to 1, then table It is bright to there are more information to pass through " door ".These " doors " " can both remember " useful information, " can forget " useless information again. In this way, the relevant information meeting selective retention outside relatively long distance is got off, so that current character mark classification refers to, so as to carry Rise prediction effect.

Further, in order to improve treatment effect, two-way shot and long term Memory Neural Networks processing character vector can be used Sequence, obtain the text feature sequence of list entries.

Specifically, by character vector sequence (v ' corresponding to list entries₁,v′₂,...,v′_n) be input in two-way LSTM, Forward direction LSTM output is (h_f1,h_f2,...,h_fn), backward LSTM outputs are (h_b1,h_b2,...,h_bn), the two enters row vector spelling (h is obtained after connecing₁,h₂,...,h_n), wherein, v_i' corresponding forward direction LSTM outputs are h_fi, v '_iCorresponding backward LSTM, which is exported, is h_bi, h_fiCharacterize x_iHistory context information, and export h backward_biThen characterize x_iFollowing contextual information.

In step 304, use condition random field processing text feature sequence, name entity corresponding to list entries is obtained Recognition result.

Conditional random field models are a kind of undirected graph models, it be marked in given needs observation sequence (word, sentence, Numerical value etc.) under conditions of, calculate the joint probability distribution of whole flag sequence.

In the embodiment of the present application, the sequence learning algorithm of condition random field can be improved iteration method of scales, condition random The prediction algorithm of field can be viterbi algorithm.

, can be by text feature sequence (h corresponding to list entries in the embodiment of the present application₁,h₂,...,h_n) it is input to line Property chain condition random field, specifically, study when, utilize text feature sequence (h₁,h₂,...,h_n) pass through condition random field Learning algorithm (such as improved iteration method of scales) obtains output sequence (s₁,s₂,...,s_n) and state-transition matrix, wherein, s_i For h_iCorresponding output, s_iFor 1*K vector, s_iIn each vector value represent x_iRelative to the confidence score of different labeled, Transition probability of the state-transition matrix between each mark；In prediction, maximum probability routing problem is converted into, utilizes Viterbi Algorithm is to output sequence (s₁,s₂,...,s_n) and state-transition matrix handled, obtain annotated sequence (y₁,y₂,...,y_n), Further parsing according to demand afterwards, obtains finally naming Entity recognition result, wherein, y_iWith x_iIt is corresponding.

As seen from the above-described embodiment, the embodiment can be converted to each character in list entries to be identified correspondingly Vector, handled using neural network algorithm corresponding to each character it is vectorial, to extract the text of list entries to be identified spy Sequence is levied, last use condition random field handles text feature sequence, obtains name entity corresponding to list entries to be identified Recognition result.Because character can characterize the quantity that more fine-grained feature and character quantity be much smaller than word, neutral net is calculated Method it is contemplated that in list entries each character contextual information, and condition random field can avoid marking bias problem, Therefore, the embodiment of the present application passes through the side that is combined character vector, neural network algorithm and condition random field this three Formula, to realize name Entity recognition, preferable recognition effect can be reached.

Fig. 5 is the structural representation of the name entity recognition device of one embodiment of the application.Fig. 5 is refer to, one In kind Software Implementation, entity recognition device 500 is named, can be included：

Acquiring unit 501, for obtaining list entries；

First processing units 502, for carrying out vectorization processing to the character in the list entries, obtain the input Character vector sequence corresponding to sequence；

Second processing unit 503, for handling the character vector sequence using neural network algorithm, obtain the input The text feature sequence of sequence；

3rd processing unit 504, for the use condition random field processing text feature sequence, obtain the input sequence Name Entity recognition result corresponding to row.

Alternatively, as one embodiment, the second processing unit 503, can include：

Alternatively, as one embodiment, the first processing units 502, can include：

Alternatively, as one embodiment, the name entity recognition device 500, can also include：Map dictionary creation Unit；

The mapping dictionary creation unit, can include：

Training corpus obtains subelement, for obtaining training corpus；

Alternatively, as one embodiment, the mapping dictionary training subelement, it is specifically used for：

Name entity recognition device 500 can also carry out the method for embodiment illustrated in fig. 3, and realize name entity recognition device In the function of embodiment illustrated in fig. 5, the embodiment of the present application will not be repeated here.

Fig. 6 is the structural representation of the electronic equipment of one embodiment of the application.Fig. 6 is refer to, should in hardware view Electronic equipment includes processor, alternatively also includes internal bus, network interface, memory.Wherein, memory may include interior Deposit, such as high-speed random access memory (Random-Access Memory, RAM), it is also possible to also including non-volatile memories Device (non-volatile memory), for example, at least 1 magnetic disk storage etc..Certainly, the electronic equipment is also possible that other Hardware required for business.

Processor, network interface and memory can be connected with each other by internal bus, and the internal bus can be ISA (Industry Standard Architecture, industry standard architecture) bus, PCI (Peripheral Component Interconnect, Peripheral Component Interconnect standard) bus or EISA (Extended Industry Standard Architecture, EISA) bus etc..The bus can be divided into address bus, data/address bus, control always Line etc..For ease of representing, only represented in Fig. 6 with a four-headed arrow, it is not intended that an only bus or a type of Bus.

Memory, for depositing program.Specifically, program can include program code, and described program code includes calculating Machine operational order.Memory can include internal memory and nonvolatile memory, and provide instruction and data to processor.

Processor read from nonvolatile memory corresponding to computer program into internal memory then run, in logical layer Name entity recognition device is formed on face.Processor, the program that memory is deposited is performed, and specifically for performing following grasp Make：

Obtain list entries；

The method that name entity recognition device disclosed in the above-mentioned embodiment illustrated in fig. 6 such as the application performs can apply to locate Manage in device, or realized by processor.Processor is probably a kind of IC chip, has the disposal ability of signal.In reality During existing, each step of the above method can pass through the integrated logic circuit of the hardware in processor or the finger of software form Order is completed.Above-mentioned processor can be general processor, including central processing unit (Central Processing Unit, CPU), network processing unit (Network Processor, NP) etc.；It can also be digital signal processor (Digital Signal Processor, DSP), it is application specific integrated circuit (Application Specific Integrated Circuit, ASIC), existing Field programmable gate array (Field-Programmable Gate Array, FPGA) or other PLDs, divide Vertical door or transistor logic, discrete hardware components.It can realize or perform and be in the embodiment of the present application disclosed each Method, step and logic diagram.General processor can be microprocessor or the processor can also be any conventional place Manage device etc..The step of method with reference to disclosed in the embodiment of the present application, can be embodied directly in hardware decoding processor and perform Into, or combined with the hardware in decoding processor and software module and perform completion.Software module can be located at random access memory, This area such as flash memory, read-only storage, programmable read only memory or electrically erasable programmable memory, register maturation In storage medium.The storage medium is located at memory, and processor reads the information in memory, and above-mentioned side is completed with reference to its hardware The step of method.

The electronic equipment can also carry out Fig. 3 method, and realize work(of the name entity recognition device in embodiment illustrated in fig. 3 Can, the embodiment of the present application will not be repeated here.

Certainly, in addition to software realization mode, the electronic equipment of this specification is not precluded from other implementations, such as Mode of logical device or software and hardware combining etc., that is to say, that the executive agent of following handling process is not limited to each Logic unit or hardware or logical device.

The embodiment of the present application additionally provides a kind of computer-readable recording medium, the computer-readable recording medium storage one Individual or multiple programs, one or more programs include instruction, and the instruction is when the portable electronic for being included multiple application programs When equipment performs, the method for portable electric appts execution embodiment illustrated in fig. 3 can be made, and specifically for performing with lower section Method：

Obtain list entries；

In a word, the preferred embodiment of this specification is the foregoing is only, is not intended to limit the protection of this specification Scope.All spirit in this specification any modification, equivalent substitution and improvements made etc., should be included in this with principle Within the protection domain of specification.

System, device, module or the unit that above-described embodiment illustrates, it can specifically be realized by computer chip or entity, Or realized by the product with certain function.One kind typically realizes that equipment is computer.Specifically, computer for example may be used Think personal computer, laptop computer, cell phone, camera phone, smart phone, personal digital assistant, media play It is any in device, navigation equipment, electronic mail equipment, game console, tablet PC, wearable device or these equipment The combination of equipment.

Computer-readable medium includes permanent and non-permanent, removable and non-removable media can be by any method Or technology come realize information store.Information can be computer-readable instruction, data structure, the module of program or other data. The example of the storage medium of computer includes, but are not limited to phase transition internal memory (PRAM), static RAM (SRAM), moved State random access memory (DRAM), other kinds of random access memory (RAM), read-only storage (ROM), electric erasable Programmable read only memory (EEPROM), fast flash memory bank or other memory techniques, read-only optical disc read-only storage (CD-ROM), Digital versatile disc (DVD) or other optical storages, magnetic cassette tape, the storage of tape magnetic rigid disk or other magnetic storage apparatus Or any other non-transmission medium, the information that can be accessed by a computing device available for storage.Define, calculate according to herein Machine computer-readable recording medium does not include temporary computer readable media (transitory media), such as data-signal and carrier wave of modulation.

It should also be noted that, term " comprising ", "comprising" or its any other variant are intended to nonexcludability Comprising so that process, method, commodity or equipment including a series of elements not only include those key elements, but also wrapping Include the other element being not expressly set out, or also include for this process, method, commodity or equipment intrinsic want Element.In the absence of more restrictions, the key element limited by sentence "including a ...", it is not excluded that wanted including described Other identical element also be present in the process of element, method, commodity or equipment.

Each embodiment in this specification is described by the way of progressive, identical similar portion between each embodiment Divide mutually referring to what each embodiment stressed is the difference with other embodiment.It is real especially for system For applying example, because it is substantially similar to embodiment of the method, so description is fairly simple, related part is referring to embodiment of the method Part explanation.

Claims

1. one kind name entity recognition method, it is characterised in that methods described includes：

Obtain list entries；

Vectorization processing is carried out to the character in the list entries, obtains character vector sequence corresponding to the list entries；

The use condition random field processing text feature sequence, obtain name Entity recognition knot corresponding to the list entries Fruit.

2. according to the method for claim 1, it is characterised in that described to handle the character vector using neural network algorithm Sequence, the text feature sequence of the list entries is obtained, including：

The character vector sequence is handled using two-way shot and long term Memory Neural Networks, obtains the text feature of the list entries Sequence.

3. according to the method for claim 1, it is characterised in that the character in the list entries carries out vectorization Processing, obtains character vector sequence corresponding to the list entries, including：

Character-DUAL PROBLEMS OF VECTOR MAPPING dictionary is obtained, wherein, record has character corresponding with vector in the character-DUAL PROBLEMS OF VECTOR MAPPING dictionary Relation；

The vector corresponding to the character in the list entries is handled using notice mechanism, obtains weight corresponding to each vector Value；

Vector weighted value corresponding with the vector corresponding to character in the list entries is subjected to point multiplication operation, obtains institute State character vector sequence corresponding to list entries.

4. according to the method for claim 3, it is characterised in that the generating process of the character-DUAL PROBLEMS OF VECTOR MAPPING dictionary, bag Include：

Obtain training corpus；

In units of character, the training corpus is split, obtains split result；

At least one of following pretreatments are carried out to the split result：Filtering spam character, filtering disable character, filtering low word Meaningless symbol is accorded with and filtered, obtains pre-processed results；

5. according to the method for claim 4, it is characterised in that described to use pretreatment knot described in word2vec Algorithm for Training Fruit, obtain obtaining character-DUAL PROBLEMS OF VECTOR MAPPING dictionary, including：

6. one kind name entity recognition device, it is characterised in that described device includes：

Acquiring unit, for obtaining list entries；

First processing units, for carrying out vectorization processing to the character in the list entries, obtain the list entries pair The character vector sequence answered；

Second processing unit, for handling the character vector sequence using neural network algorithm, obtain the list entries Text feature sequence；

3rd processing unit, for the use condition random field processing text feature sequence, it is corresponding to obtain the list entries Name Entity recognition result.

7. device according to claim 6, it is characterised in that the second processing unit, including：

Character vector series processing subelement, for handling the character vector sequence using two-way shot and long term Memory Neural Networks Row, obtain the text feature sequence of the list entries.

8. device according to claim 6, it is characterised in that the first processing units, including：

Map dictionary and obtain subelement, for obtaining character-DUAL PROBLEMS OF VECTOR MAPPING dictionary, wherein, in the character-DUAL PROBLEMS OF VECTOR MAPPING dictionary Record has the corresponding relation of character and vector；

Subelement is searched, for being searched from the character-DUAL PROBLEMS OF VECTOR MAPPING dictionary corresponding to the character in the list entries Vector；

Notice mechanism handles subelement, for handled using notice mechanism corresponding to the character in the list entries to Amount, obtain weighted value corresponding to each vector；

Character vector sequence obtains subelement, for the vector corresponding to the character in the list entries is corresponding with the vector Weighted value carry out point multiplication operation, obtain character vector sequence corresponding to the list entries.

9. a kind of electronic equipment, it is characterised in that including：

Processor；And

It is arranged to store the memory of computer executable instructions, the executable instruction makes the processor when executed The step of performing any one of claim 1-5 methods described.

A kind of 10. computer-readable storage medium, it is characterised in that the computer-readable recording medium storage one or more journey Sequence, one or more of programs are when the electronic equipment for being included multiple application programs performs so that the electronic equipment is held The step of any one of row claim 1-5 methods described.