CN109522553A

CN109522553A - Name recognition methods and the device of entity

Info

Publication number: CN109522553A
Application number: CN201811332914.2A
Authority: CN
Inventors: 聂镭; 徐泓洋; 郑权; 张峰; 聂颖
Original assignee: Dragon Horse Zhixin (zhuhai Hengqin) Technology Co Ltd
Current assignee: Dragon Horse Zhixin (zhuhai Hengqin) Technology Co Ltd
Priority date: 2018-11-09
Filing date: 2018-11-09
Publication date: 2019-03-26
Anticipated expiration: 2038-11-09
Also published as: CN109522553B

Abstract

The invention discloses a kind of recognition methods for naming entity and devices.Wherein, this method comprises: carrying out information extraction to character image using convolutional neural networks MODEL C NN, the corresponding font vector of text in character image is obtained；Font vector text vector corresponding with text is spliced, and feature vector is obtained according to the splicing vector that splicing obtains；Name entity set is obtained according to feature vector, wherein includes multiple name entities in name entity set；Rhetoric question topic corresponding with character image is constructed, and positions to obtain the name entity for needing to obtain based on topic is put up a question, wherein the name entity for needing to obtain belongs to name entity set.The present invention solves the technical issues of information identified in the way of traditional information extraction to some files progress information in the related technology is not available information.

Description

Name recognition methods and the device of entity

Technical field

The present invention relates to natural language processing technique field, in particular to a kind of recognition methods for naming entity and Device.

Background technique

The certificate of traditional national authentication, including CET-4, CET-6, diploma, diploma etc. suffer from fixed Mode, fixed position, specific content.So in certificate identification, it is only necessary to which the text extracted on relevant position can Directly to match corresponding information, that is, identification obtains.

Relieving with country to certificate form and content, colleges and universities and scientific research institution start autonomous Design one after another to be had respectively The certificate of characteristic, especially diploma and degree's diploma.There are different form and content or even school's difference in different schools Certificate content and form it is also not identical.This is just that traditional certificate identification brings problem: even having extracted certificate In text, but still can not match information, that is, only identify but be not available information.

What information identified is carried out to some files in the way of traditional information extraction in the related technology for above-mentioned Information is not available Information Problems, and currently no effective solution has been proposed.

Summary of the invention

The embodiment of the invention provides a kind of recognition methods for naming entity and devices, sharp in the related technology at least to solve The technical issues of information that information identifies is not available information is carried out to some files with traditional information extraction mode.

According to an aspect of an embodiment of the present invention, a kind of recognition methods for naming entity is provided, comprising: utilize convolution Neural network model CNN carries out information extraction to character image, obtains the corresponding font vector of text in the character image；It will Font vector text vector corresponding with the text is spliced, and obtains feature according to the splicing vector that splicing obtains Vector；Name entity set is obtained according to described eigenvector, wherein includes multiple name entities in the name entity set；Structure Rhetoric question topic corresponding with the character image is built, and positions to obtain the name entity for needing to obtain based on the rhetoric question topic, Wherein, described that the name entity obtained is needed to belong to the name entity set.

Optionally, the font vector is the vector of N*1 dimension, and the text vector is the vector of M*1 dimension, wherein N is indicated The quantity of the font attribute of the corresponding text of the font vector, M indicate the number of the word attribute of text in the text vector Amount.

Optionally, font vector text vector corresponding with the text is spliced, and is obtained according to splicing Splicing vector obtain feature vector include: by dimension be N*1 the font vector and dimension be M*1 the text vector Spliced, obtains the splicing vector of the dimension of (N+M) * 1；The splicing vector that (N+M) * 1 by described in is tieed up is as two-way long short-term memory The input of network model Bi-LSTM；Obtain the output of two-way length memory network Model B i-LSTM in short-term；According to described defeated Described eigenvector is obtained out, wherein described eigenvector is the vector of 2 (N+M) * 1 dimension.

Optionally, obtaining name entity set according to described eigenvector includes: using described eigenvector as condition random The input of field model CRF；Obtain the output of the conditional random field models CRF；According to the defeated of the conditional random field models CRF The name entity set is obtained out.

Optionally, constructing rhetoric question topic corresponding with the character image includes: to extract the corresponding text of the character image This key message, wherein the key message is the Feature Words for having incidence relation with the name entity；By the key Information is as the rhetoric question topic.

Optionally, it positions to obtain the name entity that needs obtain to include: by matching nerve net based on the rhetoric question topic Network model, the identifier of determining text fragments corresponding with the rhetoric question topic, wherein the matching neural network model is to make Obtained with multi-group data by machine learning training, every group of data in the multi-group data include: put up a question topic and The identifier of the corresponding text fragments of rhetoric question topic；It extracts to obtain according to the identifier of the text fragments and described needs to obtain Name entity.

Optionally, before positioning to obtain the name entity for needing to obtain based on the rhetoric question topic, the name entity Recognition methods further include: the corresponding text of the character image is identified, multiple text segments are obtained；Based on pre-defined rule Identifier is added for the multiple text segment；Wherein, the corresponding text of the character image is identified, obtains multiple texts Word slice section includes: the predetermined punctuation mark in the identification text；According to the predefined identifier number to the character image pair The text answered is identified, the multiple text segment is obtained.

Another aspect according to an embodiment of the present invention, additionally provides a kind of identification device for naming entity, comprising: take out Unit is taken, for carrying out information extraction to character image using convolutional neural networks MODEL C NN, obtains the character image Chinese The corresponding font vector of word；First acquisition unit, for carrying out font vector text vector corresponding with the text Splicing, and feature vector is obtained according to the splicing vector that splicing obtains；Second acquisition unit, for being obtained according to described eigenvector To name entity set, wherein include multiple name entities in the name entity set；Third acquiring unit, for building and institute The corresponding rhetoric question topic of character image is stated, and positions to obtain the name entity for needing to obtain based on the rhetoric question topic, wherein institute It states and the name entity obtained is needed to belong to the name entity set.

Optionally, the first acquisition unit includes: splicing module, for by dimension be N*1 the font vector with Dimension is that the text vector of M*1 is spliced, and obtains the splicing vector of the dimension of (N+M) * 1；First determining module is used for institute State input of the splicing vector of the dimension of (N+M) * 1 as two-way length memory network Model B i-LSTM in short-term；First obtains module, uses In the output for obtaining two-way length memory network Model B i-LSTM in short-term；Second obtains module, for being exported according to described To described eigenvector, wherein described eigenvector is the vector of 2 (N+M) * 1 dimension.

Optionally, the second acquisition unit includes: the second determining module, for using described eigenvector as condition with The input of airport MODEL C RF；Third obtains module, for obtaining the output of the conditional random field models CRF；4th obtains mould Block, for obtaining the name entity set according to the output of the conditional random field models CRF.

Optionally, the third acquiring unit includes: abstraction module, for extracting the corresponding text of the character image Key message, wherein the key message is the Feature Words for having incidence relation with the name entity；Third determining module, For using the key message as the rhetoric question topic.

Optionally, the third acquiring unit includes: the 4th determining module, for passing through matching neural network model, really The identifier of fixed text fragments corresponding with the rhetoric question topic, wherein the matching neural network model is to use multiple groups number According to what is obtained by machine learning training, every group of data in the multi-group data include: to put up a question topic and rhetoric question topic The identifier of the corresponding text fragments of mesh；Extraction module obtains the need for extracting according to the identifier of the text fragments The name entity to be obtained.

Optionally, the identification device of the name entity further include: the 4th acquiring unit, for being based on the rhetoric question topic Before positioning obtains the name entity for needing to obtain, the corresponding text of the character image is identified, multiple texts are obtained Segment；Adding unit, for being that the multiple text segment adds identifier based on pre-defined rule；Wherein, it the described 4th obtains Unit includes: identification module, for identification the predetermined punctuation mark in the text；5th obtains module, for according to Predefined identifier number identifies the corresponding text of the character image, obtains the multiple text segment.

Another aspect according to an embodiment of the present invention, additionally provides a kind of storage medium, the storage medium includes The program of storage, wherein described program execute it is any one of above-mentioned described in name entity recognition methods.

Another aspect according to an embodiment of the present invention, additionally provides a kind of processor, the processor is for running Program, wherein described program executes the recognition methods that entity is named described in above-mentioned any one when running.

In embodiments of the present invention, information extraction is carried out to character image using using convolutional neural networks MODEL C NN, obtained The corresponding font vector of text into character image；Font vector text vector corresponding with text is spliced, and according to Splice obtained splicing vector and obtains feature vector；Name entity set is obtained according to feature vector, wherein is wrapped in name entity set Include multiple name entities；Rhetoric question topic corresponding with character image is constructed, and positions to obtain what needs obtained based on topic is put up a question Name entity, wherein the mode that the name entity for needing to obtain belongs to name entity set is named Entity recognition, by this hair The font vector sum text information for the font information that will be extracted may be implemented in the recognition methods for the name entity that bright embodiment provides Corresponding text information is spliced to obtain spliced splicing vector, and obtains the mesh of name entity set according to splicing vector , to consider not only the spatial information of text, it is also considered that the contextual information for having arrived text improves effective information Recognition efficiency, and then solve and in the related technology some files progress information is identified to obtain in the way of traditional information extraction Information be not available information the technical issues of.

Detailed description of the invention

The drawings described herein are used to provide a further understanding of the present invention, constitutes part of this application, this hair Bright illustrative embodiments and their description are used to explain the present invention, and are not constituted improper limitations of the present invention.In the accompanying drawings:

Fig. 1 is the flow chart of the recognition methods of name entity according to an embodiment of the present invention；

Fig. 2 is the schematic diagram of the identification device of name entity according to an embodiment of the present invention.

Specific embodiment

In order to enable those skilled in the art to better understand the solution of the present invention, below in conjunction in the embodiment of the present invention Attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is only The embodiment of a part of the invention, instead of all the embodiments.Based on the embodiments of the present invention, ordinary skill people The model that the present invention protects all should belong in member's every other embodiment obtained without making creative work It encloses.

It should be noted that description and claims of this specification and term " first " in above-mentioned attached drawing, " Two " etc. be to be used to distinguish similar objects, without being used to describe a particular order or precedence order.It should be understood that using in this way Data be interchangeable under appropriate circumstances, so as to the embodiment of the present invention described herein can in addition to illustrating herein or Sequence other than those of description is implemented.In addition, term " includes " and " having " and their any deformation, it is intended that cover Cover it is non-exclusive include, for example, the process, method, system, product or equipment for containing a series of steps or units are not necessarily limited to Step or unit those of is clearly listed, but may include be not clearly listed or for these process, methods, product Or other step or units that equipment is intrinsic.

Embodiment 1

According to embodiments of the present invention, a kind of embodiment of the method for recognition methods for naming entity is provided, needs to illustrate It is that step shown in the flowchart of the accompanying drawings can execute in a computer system such as a set of computer executable instructions, Also, although logical order is shown in flow charts, and it in some cases, can be to be different from sequence execution herein Shown or described step.

Fig. 1 is the flow chart of the recognition methods of name entity according to an embodiment of the present invention, as shown in Figure 1, the name is real The recognition methods of body includes the following steps:

Step S102 carries out information extraction to character image using convolutional neural networks MODEL C NN, obtains in character image The corresponding font vector of text.

Wherein, convolutional neural networks (convolutional neural network, abbreviation CNN) are a kind of depth feedforwards Artificial neural network, artificial neuron can respond surrounding cells, carry out large-scale image procossing.Including convolutional layer, pond layer, swash Layer living and dropout layers etc..It include: one-dimensional convolutional neural networks, two-dimensional convolution neural network and Three dimensional convolution nerve net Network.Wherein, one-dimensional convolutional neural networks are usually used in the data processing of sequence class；Two-dimensional convolution neural network is commonly applied to image class Text identification；Three dimensional convolution neural network is mainly used in medical image and video class data identification.

In the present invention is implemented, it can use the font information in convolutional neural networks MODEL C NN extraction character image, and Export the corresponding font vector of each text in pictograph.For example, CET-4, CET-6, diploma, diploma etc., by Different in certificate content type, text also has different fonts in certificate, for example, information such as name, time, unit and general Font type, font size size, font weight of text etc. suffer from difference.These texts are generally also key message in certificate A part, even all, therefore, firstly, it is necessary to extract the font information of text.And convolutional neural networks MODEL C NN conduct A kind of convolutional neural networks model, is commonly used to the spatial information of abstract image, in practical application, herein can by application scene and Demand uses the convolutional neural networks MODEL C NN of differing complexity.

Step S104 splices font vector text vector corresponding with text, and the splicing obtained according to splicing Vector obtains feature vector.

In step S104, it can be input to the font information extracted in step S102 as a part of text vector In Bi-LSTM+CRF model, it is named Entity recognition.

Here Bi-LSTM, that is, two-way LSTM model is a kind of variant of Recognition with Recurrent Neural Network (RNN), wherein LSTM exists Memory unit is modified on the basis of basic RNN model, input gate is set up, forgets door and out gate, thus when realizing more effective Sequence information learning.Bi-LSTM is then to increase a backward sequence on the basis of forward direction (for reversed) LSTM of script Forward direction is usually spliced with reversed vector in output element, obtains a final output vector by column study.

The input of Bi-LSTM is the vector of each word or text, can be the simple form of one-hot, is also possible to The term vector (Word2vec, Glove) of pre_train, will be added the font information of each text, institute in embodiments of the present invention With the character/word vector for using the M*1 of pre_train to tie up, spliced word text/term vector and font information vector to obtain (N+ M) the input vector of * 1 dimension.Obtained after Bi-LSTM output vector dimension be 2 (N+M) * 1, that is, the feature spliced to Amount.

Preferably, font vector is the vector of N*1 dimension, and text vector is the vector of M*1 dimension, wherein N indicates font vector The quantity of the font attribute of corresponding text, M indicate the quantity of the word attribute of text in text vector.Wherein, word here Body attribute can be font type, font size of text etc. for attribute the characteristics of indicating font.Word attribute is then to use To indicate that text is the attribute of verb, noun, predicate, subject, name, place name etc..

As a kind of optional embodiment, font vector text vector corresponding with text is spliced, and according to spelling The splicing vector that connects obtain feature vector include: by dimension be N*1 font vector and dimension be M*1 text vector into Row splicing, obtains the splicing vector of the dimension of (N+M) * 1；The splicing vector that (N+M) * 1 is tieed up is as two-way length memory network mould in short-term The input of type Bi-LSTM；Obtain the output of two-way length memory network Model B i-LSTM in short-term；Feature vector is obtained according to output, Wherein, feature vector is the vector of 2 (N+M) * 1 dimension.

Step S106, obtains name entity set according to feature vector, wherein includes that multiple names are real in name entity set Body.

Wherein, here condition random field (Conditional Random Field, abbreviation CRF) is a kind of probability without To graph model.Condition random field is the conditional probability of another set output stochastic variable under the conditions of given one group of input stochastic variable Distributed model, feature assume that output stochastic variable constitutes markov random file.It and HMM are on the contrary, be one kind by observation sequence Column predict the discrimination model of implicit variable, are usually used in the scenes such as syntactic analysis, name Entity recognition, part-of-speech tagging.Herein, I Using CRF as next layer of Bi-LSTM, input for each layer of Bi-LSTM 2 (N+M) * 1 dimension feature vector, export for pair The sequence label answered, i.e., various name entities.

In step S108, according to feature vector obtain name entity set may include: using feature vector as condition with The input of airport MODEL C RF；Obtain the output of conditional random field models CRF；It is obtained according to the output of conditional random field models CRF Name entity set.

Step S108 constructs rhetoric question topic corresponding with character image, and needs to obtain based on putting up a question topic and positioning to obtain Name entity, wherein the name entity for needing to obtain belongs to name entity set.

In this embodiment it is possible to be obtained by carrying out information extraction to character image using convolutional neural networks MODEL C NN The corresponding font vector of text into character image；Then font vector text vector corresponding with text is spliced, and Feature vector is obtained according to the splicing vector that splicing obtains；Name entity set is obtained further according to feature vector, wherein name entity Concentrating includes multiple name entities；And corresponding with character image rhetoric question topic is constructed, and position and needed based on rhetoric question topic The name entity to be obtained, wherein the name entity for needing to obtain belongs to name entity set.Relative in the related technology due to card Book it is many kinds of, the certificate of different unit grantings has different form and content or even the same unit different time not Content and form with the certificate of department's granting is also less identical.This is just that traditional certificate identification brings problem, even if Extracted the text of certificate, but still can not match information the drawbacks of, the name entity that provides through the embodiment of the present invention Recognition methods may be implemented to splice the corresponding text information of font vector sum text information of the font information of extraction Spliced splicing vector is obtained, and the purpose of name entity set is obtained according to splicing vector, to consider not only text Spatial information, it is also considered that the contextual information for having arrived text improves the recognition efficiency of effective information, and then solves correlation Carrying out the information that information identifies to some files in the way of traditional information extraction in technology is not available information Technical problem.

In step S108, constructing rhetoric question topic corresponding with character image may include: that extraction character image is corresponding The key message of text, wherein key message be with name entity have incidence relation Feature Words；Using key message as setting Problem mesh.That is, put up a question for key message, the purpose of this step is to be analogous to reading understanding to will extract information to ask The problem of inscribing, passing through rhetoric question, looks for word segment relevant to problem, from original text with the position of location answer.

Here by taking diploma as an example, the key message for extracting diploma should include: name, graduation time, graduation list Position, graduation educational background, date of birth, length of schooling etc..It can so put up a question accordingly:

Does is A: what the name of student?

Does is B: what the graduation unit of student?

……

In addition, positioning to obtain the name entity that needs obtain to may include: to pass through based on topic is put up a question in step S108 Neural network model is matched, determines the identifier of text fragments corresponding with topic is put up a question, wherein matching neural network model is It is obtained using multi-group data by machine learning training, every group of data in multi-group data include: to put up a question topic and be somebody's turn to do Put up a question the identifier of the corresponding text fragments of topic；It is extracted to obtain the name reality for needing to obtain according to the identifier of text fragments Body.

For example, can go to understand text using the model similar to Match-LSTM, segment relevant to problem is positioned.Card The characteristics of book content is that text is extremely terse, and one segment of a content is separated with comma, in response to this, according to text Sequence by text segment number, the number of final output segment relevant to problem.

The training process for matching neural network model is similar to Match-LSTM, also divides four steps.First to problem and original text It is Embedding, generates term vector；Then Encode is carried out to problem and source text using two-way LSTM；Third calculates The each word of original text is distributed about the attention of problem, and summarizes problem representation using attention distribution, and the original text word is indicated Indicate that another LSTM layers of input is Encode and obtains the query-aware expression of the word with correspondence problem；4th, then plus one layer Attention layers, the vector for obtaining text indicates；The probability P i for finally going to seek each word with Softmax layers, optimization aim are mesh The probability of the word of standard film section even multiplies value maximum, that is,Wherein, l indicates that loss function, k indicate text fragments Number, i indicate segment in i-th of word.Here loss function is mainly used for the network in matching neural network model The parameter of function in layer optimizes.It should be noted that the text due to certificate is relatively short, name entity is brighter It is aobvious, so not needing positioning initial position.That is, the training process of above-mentioned matching neural network model is similar to Match-LSTM , but be distinct in last result output, it is only necessary to target can be directly found by finding corresponding position, no It needs to position initial position.

Wherein, Embedding is that the embeding layer in network structure is mainly converted to positive integer with fixed size Vector.The reason of using embeding layer: 1. vectors encoded using one-hot method can very higher-dimension it is also very sparse.Assuming that we Do encountered in natural language processing one include 2000 words dictionary, when being encoded using one-hot, each word can be one Vector comprising 2000 integers indicates, wherein 1999 numbers are 0；During 2. neural network being trained, Mei Geqian The vector entered can all be updated.

Softmax function, also known as normalization exponential function, in mathematics, especially probability theory and related fields, actually It is the log of gradient normalization of finite term discrete probability distribution.It can be by a K dimensional vector Z containing any real number " compressed " to another An outer K is tieed up in real vector so that the range of each element is between (0,1), and all elements and be 1.

It should before positioning to obtain the name entity for needing to obtain based on rhetoric question topic as a kind of optional embodiment The recognition methods for naming entity can also include: to identify to the corresponding text of character image, obtain multiple text segments；Base It is that multiple text segments add identifier in pre-defined rule；Wherein, the corresponding text of character image is identified, is obtained multiple Text segment includes: the predetermined punctuation mark identified in text；According to predefined identifier number to the corresponding text of character image into Row identification, obtains multiple text segments.

In addition, the feature more terse due to certificate text, the name entity of extraction is object content, passes through above-mentioned base Text segment is positioned in content of text in putting up a question topic, so that it may which the core answer of correspondence problem is found.I.e. first positioning is set Problem purpose answer position, then extract the name entity of the position.

The recognition methods of the name entity provided through the embodiment of the present invention can extract the font information of character image, and In conjunction with font information, Entity recognition is named using Bi-LSTM+CRF model, time, name, mechanism name in extraction text Title, place name etc. name entity；Set up " problem " using key message as answer；It is managed again using Bi-LSTM+Attention model Text is solved, predicts sentence relevant to problem；And match the name entity in correlative, as answer.For Current Content After changeable certificate identification the problem of text information extraction, the font information for combining text and instantly popular deep learning are proposed Method go realize name Entity recognition, so both consider text spatial information, it is also considered that the contextual information of text.So Simple the problem of reading answer " what is " in understanding is converted by Text Feature Extraction afterwards, proposes that one kind is similar to Match-LSTM Model building method, no longer go prediction answer starting point or answer word, but go positioning be segmented according to punctuation mark after Answer segment position.It goes to extract information in conjunction with text position and name Entity recognition.

Embodiment 2

A kind of identification device for naming entity is additionally provided according to embodiments of the present invention, it should be noted that the present invention is real The identification device for applying the name entity of example can be used for executing the recognition methods of name entity provided by the embodiment of the present invention.With Under to it is provided in an embodiment of the present invention name entity identification device be introduced.

Fig. 2 is the schematic diagram of the identification device of name entity according to an embodiment of the present invention, as shown in Fig. 2, the name is real The identification device of body may include: extracting unit 21, first acquisition unit 23, second acquisition unit 25, third acquiring unit 27. The identification device of the name entity is described in detail below.

Extracting unit 21 obtains text for carrying out information extraction to character image using convolutional neural networks MODEL C NN The corresponding font vector of text in image.

First acquisition unit 23 is connect with above-mentioned extracting unit 21, for by font vector text corresponding with text to Amount is spliced, and obtains feature vector according to the splicing vector that splicing obtains.

Second acquisition unit 25 is connect with above-mentioned first acquisition unit 23, for obtaining name entity according to feature vector Collection, wherein include multiple name entities in name entity set.

Third acquiring unit 27 is connect with above-mentioned second acquisition unit 25, for constructing rhetoric question corresponding with character image Topic, and position to obtain the name entity for needing to obtain based on topic is put up a question, wherein the name entity for needing to obtain belongs to name Entity set.

It should be noted that the extracting unit 21 in the embodiment can be used for executing the step in the embodiment of the present invention S102, the first acquisition unit 23 in the embodiment can be used for executing the step S104 in the embodiment of the present invention, the embodiment In second acquisition unit 25 can be used for executing the step S106 in the embodiment of the present invention, third in the embodiment obtains single Member 27 can be used for executing the step S108 in the embodiment of the present invention.The example and answer that above-mentioned module and corresponding step are realized It is identical with scene, but it is not limited to the above embodiments disclosure of that.

In this embodiment it is possible to be carried out using convolutional neural networks MODEL C NN to character image using extracting unit 21 Information extraction obtains the corresponding font vector of text in character image；Then using first acquisition unit 23 by font vector with The corresponding text vector of text is spliced, and obtains feature vector according to the splicing vector that splicing obtains；Second is recycled to obtain Unit 25 is taken to obtain name entity set according to feature vector, wherein to include multiple name entities in name entity set；And utilize the Three acquiring units construct rhetoric question topic corresponding with character image, and position to obtain the name reality for needing to obtain based on topic is put up a question Body, wherein the name entity for needing to obtain belongs to name entity set.Relative to many kinds of due to certificate in the related technology, The card that the certificate of different unit grantings has different form and content or even the same unit different time difference department to provide The content and form of book is also less identical.This is just that traditional certificate identification brings problem, even if having extracted certificate Text, but still can not match information the drawbacks of, the identification device of the name entity provided through the embodiment of the present invention can be with Realization is spliced the corresponding text information of font vector sum text information of the font information of extraction to obtain spliced spelling Vector is connect, and is also examined according to the purpose that splicing vector obtains name entity set to consider not only the spatial information of text The contextual information for having considered text, improves the recognition efficiency of effective information, and then solves and utilize tradition in the related technology Information extraction mode the technical issues of information that identifies of information is not available information is carried out to some files.

As a kind of optional embodiment, font vector is the vector of N*1 dimension, and text vector is the vector of M*1 dimension, In, N indicates the quantity of the font attribute of the corresponding text of font vector, and M indicates the number of the word attribute of text in text vector Amount.

As a kind of optional embodiment, first acquisition unit includes: splicing module, for the font for being N*1 by dimension Vector is spliced with the text vector that dimension is M*1, obtains the splicing vector of the dimension of (N+M) * 1；First determining module, being used for will (N+M) input of the splicing vector of * 1 dimension as two-way length memory network Model B i-LSTM in short-term；First obtains module, is used for Obtain the output of two-way length memory network Model B i-LSTM in short-term；Second obtain module, for according to output obtain feature to Amount, wherein feature vector is the vector of 2 (N+M) * 1 dimension.

As a kind of optional embodiment, second acquisition unit includes: the second determining module, for using feature vector as The input of conditional random field models CRF；Third obtains module, for obtaining the output of conditional random field models CRF；4th obtains Module, for obtaining name entity set according to the output of conditional random field models CRF.

As a kind of optional embodiment, third acquiring unit includes: abstraction module, corresponding for extracting character image The key message of text, wherein key message be with name entity have incidence relation Feature Words；Third determining module is used In using key message as rhetoric question topic.

As a kind of optional embodiment, third acquiring unit includes: the 4th determining module, for passing through matching nerve net Network model determines the identifier of text fragments corresponding with topic is put up a question, wherein matching neural network model is to use multiple groups number According to what is obtained by machine learning training, every group of data in multi-group data include: to put up a question topic and the rhetoric question topic pair The identifier for the text fragments answered；Extraction module extracts to obtain the name for needing to obtain for the identifier according to text fragments Entity.

As a kind of optional embodiment, the identification device of the name entity further include: the 4th acquiring unit, in base Before putting up a question topic and positioning to obtain the name entity for needing to obtain, the corresponding text of character image is identified, is obtained more A text segment；Adding unit, for being that multiple text segments add identifier based on pre-defined rule；Wherein, the 4th list is obtained Member includes: identification module, for identification the predetermined punctuation mark in text；5th obtains module, for according to predefined identifier Number the corresponding text of character image is identified, obtains multiple text segments.

The identification device of above-mentioned name entity includes processor and memory, above-mentioned extracting unit 21, first acquisition unit 23, second acquisition unit 25, third acquiring unit 27 is equal to be stored in memory as program unit, is deposited by processor execution Above procedure unit in memory is stored up to realize corresponding function.

Include kernel in above-mentioned processor, is gone in memory to transfer corresponding program unit by kernel.Kernel can be set One or more constructs rhetoric question topic corresponding with character image by adjusting kernel parameter, and is positioned to based on topic is put up a question The name entity obtained to needs, wherein the name entity for needing to obtain belongs to name entity set.

Above-mentioned memory may include the non-volatile memory in computer-readable medium, random access memory (RAM) and/or the forms such as Nonvolatile memory, such as read-only memory (ROM) or flash memory (flash RAM), memory includes extremely A few storage chip.

Another aspect according to an embodiment of the present invention, additionally provides a kind of storage medium, and storage medium includes storage Program, wherein program executes the recognition methods of any one of above-mentioned name entity.

Another aspect according to an embodiment of the present invention additionally provides a kind of processor, and processor is used to run program, Wherein, the recognition methods of the name entity of above-mentioned any one is executed when program is run.

A kind of equipment is additionally provided in embodiments of the present invention, which includes processor, memory and be stored in storage On device and the program that can run on a processor, processor perform the steps of when executing program and utilize convolutional neural networks mould Type CNN carries out information extraction to character image, obtains the corresponding font vector of text in character image；By font vector and text Corresponding text vector is spliced, and obtains feature vector according to the splicing vector that splicing obtains；It is obtained according to feature vector Name entity set, wherein include multiple name entities in name entity set；Rhetoric question topic corresponding with character image is constructed, and It positions to obtain the name entity for needing to obtain based on topic is put up a question, wherein the name entity for needing to obtain belongs to name entity set.

A kind of computer program product is additionally provided in embodiments of the present invention, when being executed on data processing equipment, It is adapted for carrying out the program of initialization there are as below methods step: using convolutional neural networks MODEL C NN to character image progress information It extracts, obtains the corresponding font vector of text in character image；Font vector text vector corresponding with text is spliced, And feature vector is obtained according to the splicing vector that splicing obtains；Name entity set is obtained according to feature vector, wherein name entity Concentrating includes multiple name entities；Rhetoric question topic corresponding with character image is constructed, and is needed based on putting up a question topic and positioning The name entity of acquisition, wherein the name entity for needing to obtain belongs to name entity set.

The serial number of the above embodiments of the invention is only for description, does not represent the advantages or disadvantages of the embodiments.

In the above embodiment of the invention, it all emphasizes particularly on different fields to the description of each embodiment, does not have in some embodiment The part of detailed description, reference can be made to the related descriptions of other embodiments.

In several embodiments provided herein, it should be understood that disclosed technology contents can pass through others Mode is realized.Wherein, the apparatus embodiments described above are merely exemplary, such as the division of the unit, Ke Yiwei A kind of logical function partition, there may be another division manner in actual implementation, for example, multiple units or components can combine or Person is desirably integrated into another system, or some features can be ignored or not executed.Another point, shown or discussed is mutual Between coupling, direct-coupling or communication connection can be through some interfaces, the INDIRECT COUPLING or communication link of unit or module It connects, can be electrical or other forms.

The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple On unit.It can some or all of the units may be selected to achieve the purpose of the solution of this embodiment according to the actual needs.

It, can also be in addition, the functional units in various embodiments of the present invention may be integrated into one processing unit It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list Member both can take the form of hardware realization, can also realize in the form of software functional units.

If the integrated unit is realized in the form of SFU software functional unit and sells or use as independent product When, it can store in a computer readable storage medium.Based on this understanding, technical solution of the present invention is substantially The all or part of the part that contributes to existing technology or the technical solution can be in the form of software products in other words It embodies, which is stored in a storage medium, including some instructions are used so that a computer Equipment (can for personal computer, server or network equipment etc.) execute each embodiment the method for the present invention whole or Part steps.And storage medium above-mentioned includes: that USB flash disk, read-only memory (ROM, Read-Only Memory), arbitrary access are deposited Reservoir (RAM, Random Access Memory), mobile hard disk, magnetic or disk etc. be various to can store program code Medium.

The above is only a preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art For member, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications are also answered It is considered as protection scope of the present invention.

Claims

1. a kind of recognition methods for naming entity characterized by comprising

Information extraction is carried out to character image using convolutional neural networks MODEL C NN, it is corresponding to obtain text in the character image Font vector；

Font vector text vector corresponding with the text is spliced, and is obtained according to the splicing vector that splicing obtains Take feature vector；

Name entity set is obtained according to described eigenvector, wherein includes multiple name entities in the name entity set；

Rhetoric question topic corresponding with the character image is constructed, and positions to obtain the name for needing to obtain based on the rhetoric question topic Entity, wherein described that the name entity obtained is needed to belong to the name entity set.

2. the method according to claim 1, wherein the font vector be N*1 dimension vector, the text to Amount is the vector of M*1 dimension, wherein N indicates the quantity of the font attribute of the corresponding text of the font vector, and M indicates the text The quantity of the word attribute of text in word vector.

3. according to the method described in claim 2, it is characterized in that, by font vector text corresponding with the text to Amount is spliced, and the splicing vector acquisition feature vector obtained according to splicing includes:

The text vector that the font vector that dimension is N*1 is M*1 with dimension is spliced, the dimension of (N+M) * 1 is obtained Splicing vector；

Will described in input of the splicing vector tieed up of (N+M) * 1 as two-way length memory network Model B i-LSTM in short-term；

Obtain the output of two-way length memory network Model B i-LSTM in short-term；

Described eigenvector is obtained according to the output, wherein described eigenvector is the vector of 2 (N+M) * 1 dimension.

4. the method according to claim 1, wherein according to described eigenvector obtain name entity set include:

Using described eigenvector as the input of conditional random field models CRF；

Obtain the output of the conditional random field models CRF；

The name entity set is obtained according to the output of the conditional random field models CRF.

5. the method according to claim 1, wherein building rhetoric question topic packet corresponding with the character image It includes:

Extract the key message of the corresponding text of the character image, wherein the key message is that have with the name entity Relevant Feature Words；

Using the key message as the rhetoric question topic.

6. the method according to claim 1, wherein positioning to obtain the life for needing to obtain based on the rhetoric question topic Name entity include:

By matching neural network model, the identifier of determining text fragments corresponding with the rhetoric question topic, wherein described It is obtained using multi-group data by machine learning training with neural network model, every group of data in the multi-group data are equal It include: the identifier for putting up a question topic and the corresponding text fragments of rhetoric question topic；

It is extracted to obtain the name entity for needing to obtain according to the identifier of the text fragments.

7. according to the method described in claim 6, it is characterized in that, positioning to obtain what needs obtained based on the rhetoric question topic Before name entity, further includes:

The corresponding text of the character image is identified, multiple text segments are obtained；

It is that the multiple text segment adds identifier based on pre-defined rule；

Wherein, the corresponding text of the character image is identified, obtaining multiple text segments includes:

Identify the predetermined punctuation mark in the text；

The corresponding text of the character image is identified according to the predefined identifier number, obtains the multiple letter plate Section.

8. a kind of identification device for naming entity characterized by comprising

Extracting unit obtains the text figure for carrying out information extraction to character image using convolutional neural networks MODEL C NN The corresponding font vector of text as in；

First acquisition unit, for splicing font vector text vector corresponding with the text, and according to spelling The splicing vector connect obtains feature vector；

Second acquisition unit, for obtaining name entity set according to described eigenvector, wherein include in the name entity set Multiple name entities；

Third acquiring unit is positioned for constructing rhetoric question topic corresponding with the character image, and based on the rhetoric question topic Obtain the name entity for needing to obtain, wherein described that the name entity obtained is needed to belong to the name entity set.

9. a kind of storage medium, which is characterized in that the storage medium includes the program of storage, wherein described program right of execution Benefit require any one of 1 to 7 described in name entity recognition methods.

10. a kind of processor, which is characterized in that the processor is for running program, wherein right of execution when described program is run Benefit require any one of 1 to 7 described in name entity recognition methods.