CN106991085A

CN106991085A - The abbreviation generation method and device of a kind of entity

Info

Publication number: CN106991085A
Application number: CN201710212978.8A
Authority: CN
Inventors: 刘华杰; 周寅; 封令爽; 盛丽晔
Original assignee: Industrial and Commercial Bank of China Ltd ICBC
Current assignee: Industrial and Commercial Bank of China Ltd ICBC
Priority date: 2017-04-01
Filing date: 2017-04-01
Publication date: 2017-07-28
Anticipated expiration: 2037-04-01
Also published as: CN106991085B

Abstract

The invention provides the abbreviation generation method and device of a kind of entity, it is related to computer information system depth learning technology field.Method includes：Obtain the full name data of entity, name data complete to entity carries out word processing pretreatment, the full name data of entity is split as multiple words, the word frequency coding for representing word frequency of occurrences in the corpus pre-set and the part-of-speech tagging information for representing Words ' Attributes is generated；Process pretreated word, word frequency coding and part-of-speech tagging information according to word is carried out, by the first deep learning model and the second deep learning model of training in advance, generate respectively first it is initial referred to as and second it is initial referred to as；Rule is corrected according to the verification pre-set, initially referred to as carrying out verification correction to the first initial abbreviation and second is handled, and the first abbreviation result and the second abbreviation result are generated respectively；First abbreviation result and the second abbreviation result are compared, and according to the final abbreviation of the comparative result generation full name data of entity.

Description

The abbreviation generation method and device of a kind of entity

Technical field

The present invention relates to the abbreviation generation side in computer information system deep learning technology field, more particularly to a kind of entity Method and device.

Background technology

Currently, with a large amount of popularizations and development of Internet technology and computer information technology, internet and computer are Through having entered the big data epoch.In the big data epoch, using text as carrier to all kinds of entities (such as enterprise, government bodies, society Can group etc.) report and evaluation quantity it is increasing, entity, which needs to gather, simultaneously to be recognized and itself entity (such as enterprise's name Title, government bodies' title etc.) association news information, and then be applied to the scenes such as business risk identification, the analysis of public opinion.And it is current Conventional collection and means of identification be that full dose gathers all kinds of news report, the then identification entity inside the text message collected Title, such as enterprise name, government bodies' title, public organization's title, are then defined as association new by corresponding news report Hear report.But, in many news report, based on it is precise and to the point, many factors such as write with pregnant brevity, media are often with letter Claim to describe entity, so-called abbreviation is exactly the appellation (example for extracting representative word composition from former word, i.e. full name Such as, the referred to as China of the People's Republic of China (PRC)；The referred to as industrial and commercial bank of the Industrial and Commercial Bank of China).So, bulk information is being carried out When collection and analysis, how to judge whether the news information is related to entity, relate to the accurate judgement to abbreviation.

Therefore, a kind of method of intelligence generation abbreviation is currently needed badly, consequently facilitating from network datas such as all kinds of news report Middle accurate acquisition is to the data related to the abbreviation of entity.

The content of the invention

Embodiments of the invention provide the abbreviation generation method and device of a kind of entity, to solve the letter of currently existing technology Claim the generation degree of accuracy low, the problem of as a result not unique.

To reach above-mentioned purpose, the present invention is adopted the following technical scheme that：

A kind of abbreviation generation method of entity, including：

The full name data of entity is obtained, name data complete to the entity carries out word processing pretreatment, by the full name of entity Claim data be split as multiple words, generate for represent word frequency of occurrences in the corpus pre-set word frequency encode and Part-of-speech tagging information for representing Words ' Attributes；

Pretreated word, word frequency coding and part-of-speech tagging information are processed according to word is carried out, passes through training in advance First deep learning model and the second deep learning model, generate respectively first it is initial referred to as and second it is initial referred to as；

Rule is corrected according to the verification pre-set, initially referred to as carrying out verification to the described first initial abbreviation and second entangles Positive processing, generates the first abbreviation result and the second abbreviation result respectively；

The first abbreviation result and the second abbreviation result are compared, and the full title of entity is generated according to comparative result The final abbreviation of data.

Specifically, obtaining the full name data of entity, name data complete to the entity carries out word processing pretreatment, will be real The full name data of body is split as multiple words, generates the word frequency for representing word frequency of occurrences in the corpus pre-set Coding and the part-of-speech tagging information for representing Words ' Attributes, including：

The corresponding frequency of each word after the full name data of entity is split is obtained from the word frequency information table pre-set；

According to the corresponding frequency of each word, the descending arrangement of line frequency is entered to each word, the corresponding word of each word is generated Frequency is encoded；

By the corresponding word frequency code storage of each word in the word frequency coding schedule pre-set.

Further, the full name data of entity is obtained, name data complete to the entity carries out word processing pretreatment, will The full name data of entity is split as multiple words, generates the word for representing word frequency of occurrences in the corpus pre-set Frequency coding and the part-of-speech tagging information for representing Words ' Attributes, in addition to：

The full title number of entity is determined according to dictionary, regional dictionary, industry dictionary, keyword dictionary is sewed before and after pre-setting According to the dictionary where each word after being split, and each word after the full name data of entity is split carries out part-of-speech tagging, Generate the part-of-speech tagging information；The part-of-speech tagging information sews word, regional word, industry word and key before and after including Word.

Further, the abbreviation generation method of the entity, in addition to：

According to the training corpus pre-set, the first deep learning model and the second deep learning model are distinguished Carry out machine learning training；The full name data of entity that the training corpus includes pre-setting is corresponding with the full title of each entity Abbreviation.

Specifically, rule is corrected according to the verification pre-set, it is initial to described first referred to as initially referred to as to enter with second Row verification correction processing, generates the first abbreviation result and the second abbreviation result respectively, including：

Judge that word is processed in pretreated word with the presence or absence of two adjacent individual character words；

If word, which is processed, has two adjacent individual character words in pretreated word, and first initial referred to as described The word of two adjacent individual character word compositions, it is determined that described first is initially referred to as the first abbreviation result；If word is processed There are two adjacent individual character words in pretreated word, and second is initially referred to as described two adjacent individual character words The word of composition, it is determined that described second is initially referred to as the second abbreviation result；

If word, which is processed, has two adjacent individual character words in pretreated word, and the first initial abbreviation is not institute The word of two adjacent individual character word compositions is stated, then obtaining word according to word frequency coding processes pretreated frequency most Low word, and the minimum word of two adjacent individual character words and the frequency is constituted into the first abbreviation result；If word Language, which is processed, has two adjacent individual character words in pretreated word, and second initial referred to as not to be described two adjacent The word of individual character word composition, then encode according to the word frequency and obtain the minimum word of the pretreated frequency of word processing, and The minimum word of two adjacent individual character words and the frequency is constituted into the second abbreviation result.

Judge whether the described first initial abbreviation or the second initial abbreviation are individual character；

If described first is initially referred to as individual character, word is obtained according to word frequency coding and processes pretreated frequency most Two low words, and two minimum words of the individual character and the frequency are constituted into the first abbreviation result；If described Second is initially referred to as individual character, is encoded according to the word frequency and obtains two minimum words of the pretreated frequency of word processing, And two minimum words of the individual character and the frequency are constituted into the second abbreviation result.

Judge whether the abbreviation length of the described first initial abbreviation or the second initial abbreviation is more than the length pre-set Spend threshold value；

If the abbreviation length of the first initial abbreviation is more than the length threshold pre-set, pre-processed according to word processing is carried out The part-of-speech tagging information and TF-IDF values of word afterwards, descending arrangement is carried out to the word in the first initial abbreviation, and from descending Last word of the word in the first initial abbreviation after arrangement is deleted successively, until the abbreviation of the first initial abbreviation Length is less than or equal to the length threshold pre-set, and first after being deleted is initial to be referred to as used as the first abbreviation result；

If the abbreviation length of the second initial abbreviation is more than the length threshold pre-set, pre-processed according to word processing is carried out The part-of-speech tagging information and TF-IDF values of word afterwards, descending arrangement is carried out to the word in the second initial abbreviation, and from descending Last word of the word in the second initial abbreviation after arrangement is deleted successively, until the abbreviation of the second initial abbreviation Length is less than or equal to the length threshold pre-set, and second after being deleted is initial to be referred to as used as the second abbreviation result.

Further, after the initial abbreviation of generation first and the second initial abbreviation, in addition to：

The first abbreviation result and the second abbreviation result are stored into training corpus, to update the training corpus Storehouse.

Specifically, being compared to the first abbreviation result and the second abbreviation result, and generated in fact according to comparative result The final abbreviation of the full name data of body, including：

The first abbreviation result and the second abbreviation result are compared, comparative result is generated；

If the comparative result is that the first abbreviation result and the second abbreviation result are inconsistent, the first deep learning model is obtained The first generating probability and the second deep learning model for exporting abbreviation export the second generating probability of abbreviation；

First generating probability and the second generating probability are compared, and select the first generating probability and the second generation The corresponding abbreviation result of higher value in probability as the full name data of entity final abbreviation.

A kind of abbreviation generating means of entity, including：

Word processes pretreatment unit, and for obtaining the full name data of entity, name data complete to the entity carries out word Language processing pretreatment, is split as multiple words by the full name data of entity, generates for representing word in the language material pre-set The word frequency coding of the frequency of occurrences and the part-of-speech tagging information for representing Words ' Attributes in storehouse；

Initial abbreviation generation unit, for processing pretreated word, word frequency coding and part of speech mark according to progress word Information is noted, by the first deep learning model and the second deep learning model of training in advance, the first initial abbreviation is generated respectively With the second initial abbreviation；

Processing unit is corrected in verification, for correcting rule according to the verification that pre-sets, it is initial to described first referred to as and Second initial referred to as progress verification correction processing, generates the first abbreviation result and the second abbreviation result respectively；

Final abbreviation generation unit, for being compared to the first abbreviation result and the second abbreviation result, and according to Comparative result generates the final abbreviation of the full name data of entity.

In addition, the word processing pretreatment unit, specifically for：

In addition, the word processing pretreatment unit, is specifically additionally operable to：

Further, the abbreviation generating means of the entity, in addition to：

Machine learning training unit, for according to the training corpus pre-set, to the first deep learning model Machine learning training is carried out respectively with the second deep learning model；The training corpus includes the full title of entity pre-set Data and the corresponding abbreviation of the full title of each entity.

In addition, processing unit is corrected in the verification, specifically for：

Further, the abbreviation generating means of described entity, in addition to：

Memory cell, for the first abbreviation result and the second abbreviation result to be stored into training corpus, to update The training corpus.

In addition, the final abbreviation generation unit, specifically for：

The abbreviation generation method and device of a kind of entity provided in an embodiment of the present invention, obtain the full title number of entity first According to name data complete to the entity carries out word processing pretreatment, and the full name data of entity is split as into multiple words, generated For representing that the word frequency of word frequency of occurrences in the corpus pre-set is encoded and for representing the part of speech marks of Words ' Attributes Note information；Afterwards, according to the pretreated word of word processing, word frequency coding and part-of-speech tagging information is carried out, by instructing in advance Experienced the first deep learning model and the second deep learning model, generate respectively first it is initial referred to as and second it is initial referred to as；Root Rule is corrected according to the verification pre-set, initially referred to as carrying out verification correction to the described first initial abbreviation and second is handled, and is divided The first abbreviation result and the second abbreviation result are not generated；The first abbreviation result and the second abbreviation result are compared, and The final abbreviation of the full name data of entity is generated according to comparative result.The embodiment of the present invention by using deep learning method, Advantage of the deep learning model without engineer's feature is make use of, and part of speech and word frequency information are fused in model, is extended Characteristic range, after the autonomous iterative learning of full name, ultimately forms accurate, unique corresponding relation of entity full name and abbreviation, The abbreviation generation degree of accuracy that currently existing technology can be solved is low, the problem of as a result not unique.

Brief description of the drawings

In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing There is the accompanying drawing used required in technology description to be briefly described, it should be apparent that, drawings in the following description are only this Some embodiments of invention, for those of ordinary skill in the art, without having to pay creative labor, may be used also To obtain other accompanying drawings according to these accompanying drawings.

Fig. 1 is a kind of flow chart one of the abbreviation generation method of entity provided in an embodiment of the present invention；

Fig. 2 is a kind of flowchart 2 of the abbreviation generation method of entity provided in an embodiment of the present invention；

Fig. 3 be the embodiment of the present invention in the first deep learning model block schematic illustration；

Fig. 4 be the embodiment of the present invention in the second deep learning model block schematic illustration；

Fig. 5 be the embodiment of the present invention in the first deep learning model training schematic flow sheet；

Fig. 6 be the embodiment of the present invention in the second deep learning model training schematic flow sheet；

Fig. 7 is a kind of structural representation of the abbreviation generating means of entity provided in an embodiment of the present invention.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Site preparation is described, it is clear that described embodiment is only a part of embodiment of the invention, rather than whole embodiments.It is based on Embodiment in the present invention, it is every other that those of ordinary skill in the art are obtained under the premise of creative work is not made Embodiment, belongs to the scope of protection of the invention.

As shown in figure 1, the embodiment of the present invention provides a kind of abbreviation generation method of entity, including：

Step 101, the acquisition full name data of entity, name data complete to the entity carry out word processing pretreatment, will The full name data of entity is split as multiple words, generates the word for representing word frequency of occurrences in the corpus pre-set Frequency coding and the part-of-speech tagging information for representing Words ' Attributes.

Step 102, pretreated word, word frequency coding and part-of-speech tagging information are processed according to carrying out word, by pre- The the first deep learning model and the second deep learning model first trained, generate the first initial abbreviation and the second initial letter respectively Claim.

Rule is corrected in the verification that step 103, basis are pre-set, initial to described first referred to as initially referred to as to enter with second Row verification correction processing, generates the first abbreviation result and the second abbreviation result respectively.

Step 104, the first abbreviation result and the second abbreviation result are compared, and generated according to comparative result real The final abbreviation of the full name data of body.

A kind of abbreviation generation method of entity provided in an embodiment of the present invention, obtains the full name data of entity, to institute first State the full name data of entity and carry out word processing pretreatment, the full name data of entity is split as multiple words, generated for table Show the word frequency coding of word frequency of occurrences in the corpus pre-set and the part-of-speech tagging information for representing Words ' Attributes； Afterwards, according to the pretreated word of word processing, word frequency coding and part-of-speech tagging information is carried out, the first of training in advance is passed through Deep learning model and the second deep learning model, generate respectively first it is initial referred to as and second it is initial referred to as；According to setting in advance Rule is corrected in the verification put, it is initial to described first referred to as and second it is initial referred to as carry out verification correction processing, the is generated respectively One abbreviation result and the second abbreviation result；The first abbreviation result and the second abbreviation result are compared, and according to comparing As a result the final abbreviation of the full name data of entity is generated.The embodiment of the present invention make use of depth by using the method for deep learning Advantage of the learning model without engineer's feature is spent, and part of speech and word frequency information are fused in model, extension feature scope, After the autonomous iterative learning of full name, accurate, unique corresponding relation of entity full name and abbreviation is ultimately formed, can solve to work as The abbreviation generation degree of accuracy of preceding prior art is low, the problem of as a result not unique.

In order that those skilled in the art is better understood by the present invention, a more detailed embodiment is set forth below, As shown in Fig. 2 the embodiment of the present invention provides a kind of abbreviation generation method of entity, including：

Step 201, the acquisition full name data of entity, name data complete to the entity carry out word processing pretreatment, will The full name data of entity is split as multiple words.

The full name data of entity is split as multiple words herein, is that, using word as BSR semantic unit, can so accord with Close the grammer of Chinese language.For example word segmentation processing can be carried out using Chinese word segmentation instrument Jieba, but be not only limited to this.Example Such as " DadangDatang Telecom Technology ＆ Industry Group " can be split as multiple words, i.e. " Datang ", " telecommunications ", " science and technology ", " stock Part ", " limited ", " company ".

Step 202, from the word frequency information table pre-set obtain the full name data of entity be split after each word pair The frequency answered.

What deserves to be explained is, the word frequency information pre-set token is loaded with the corresponding frequency letter of substantial amounts of word, word Breath, the frequency information can be the applying frequency of the word or word occurred in the full name data of various entities.Specifically can be with With --- word：The mode of numeral is represented, frequency is represented with numeral, for example：Credit:8；New master:1；Cyclopentadienyl:13；Generate electricity:54； ZhangZhou:2；Audi:1；Packaging:19；Fuling:2；Second:3；She:2；Henan:34；Copper foil:1；Safety:2；Carbon black:1；It is clean soft:1； It is anti-:1；Technique:3；It is micro-:1；Paradise:2；Lead zinc:1；Hundred rivers:1.

Step 203, according to the corresponding frequency of each word, enter the descending arrangement of line frequency to each word, generate each word Corresponding word frequency coding.

For example, exemplified by this sentences enterprise name " DadangDatang Telecom Technology ＆ Industry Group ", it is assumed that the frequency of " limited " appearance Rate is 100, and the frequency that " share " occurs is 20, and the frequency that " company " occurs is 150, then word frequency coding can be as follows：Company: 1；It is limited:2；Share:3.

Step 204, by the corresponding word frequency code storage of each word in the word frequency coding schedule pre-set.

So, the corresponding word frequency coding of each word can be stored in word frequency coding schedule, in order to answering for training corpus etc. With.

Front and rear dictionary, regional dictionary, industry dictionary, the keyword dictionary sewed that step 205, basis are pre-set determines entity Full name data be split after each word where dictionary, and each word after the full name data of entity is split carries out word Property mark, generate the part-of-speech tagging information.

Wherein, the part-of-speech tagging information sews word, regional word, industry word and keyword before and after including.

For example, it is front and rear sew can be recorded in dictionary " company ", " responsibility ", " limited ", " industry " etc. often in enterprise, list The word occurred before and after the entities such as position." China ", " Beijing ", " Shanghai ", " Henan " etc. can be recorded in regional dictionary to be used for Represent the word in area." intellectual property ", " patent agency ", " bank ", " pharmacy " etc. can be recorded in industry dictionary to be used for Represent the word of industry.The keyword of the full title of specific entity can be recorded in keyword dictionary, for example " China is industrial and commercial The keyword of bank " is " industry and commerce ", and the keyword of " San You Patent Agency(11127) " is " three friends ", and " Beijing is existing For Automobile Co., Ltd. " keyword be " modern times ".Sew dictionary, regional dictionary, industry word before and after being pre-set by inquiry Storehouse, keyword dictionary, you can determine the full name data of entity be split after each word where dictionary, so that it is determined that entity is complete Name data be split after each word part of speech.Such as " San You Patent Agency(11127) ", " Beijing " is ground Area's word, " three friends " is keyword, and " intellectual property agency " is industry word, and " Co., Ltd " sews word to be front and rear.

The training corpus that step 206, basis are pre-set, to the first deep learning model and the second deep learning Model carries out machine learning training respectively.

The training corpus includes the full name data of entity and the corresponding abbreviation of the full title of each entity pre-set.And The corresponding abbreviation of each full title of entity can by be manually labeled or the first abbreviation result in step 209 and Second abbreviation result is stored into training corpus.What deserves to be explained is, data volume is bigger, the spy that deep learning model learning comes out Levying can be more accurate.Generally training corpus accounts for 80% or so of whole Sample Storehouse.

In addition, the generalization ability in order to verify deep learning model, in addition it is also necessary to set checking corpus.Verify that language material needs Independent same distribution is kept with training corpus, the deep learning model of such training corpus also there can be ratio in checking corpus Preferable effect.Generally checking corpus accounts for 20% or so of whole Sample Storehouse.

Herein, specific learning training process may refer to shown in subsequent figure 5 and Fig. 6.

Step 207, pretreated word, word frequency coding and part-of-speech tagging information are processed according to carrying out word, by pre- The the first deep learning model and the second deep learning model first trained, generate the first initial abbreviation and the second initial letter respectively Claim.

The process and step 504 in subsequent figure 5 to step of the first initial abbreviation are generated here by the first deep learning model Rapid 505 is consistent.And the process and the step 504 in subsequent figure 5 of the second initial abbreviation are generated by the second deep learning model Consistent to step 506, the process of mainly referred to as generation no longer carries out error passback.

Rule is corrected in the verification that step 208, basis are pre-set, initial to described first referred to as initially referred to as to enter with second Row verification correction processing, generates the first abbreviation result and the second abbreviation result respectively.

Specifically, step 208 can be carried out for example herein according to the word length of initial abbreviation in the following way Verification and correction, such as following several ways：

1. judge that word is processed in pretreated word with the presence or absence of two adjacent individual character words.

If word, which is processed, has two adjacent individual character words in pretreated word, and first initial referred to as described The word of two adjacent individual character word compositions, it is determined that described first is initially referred to as the first abbreviation result；If word is processed There are two adjacent individual character words in pretreated word, and second is initially referred to as described two adjacent individual character words The word of composition, it is determined that described second is initially referred to as the second abbreviation result.

2. judge whether the described first initial abbreviation or the second initial abbreviation are individual character.

3. judge whether the abbreviation length of the described first initial abbreviation or the second initial abbreviation is more than what is pre-set Length threshold.

If the abbreviation length of the first initial abbreviation is more than the length threshold pre-set, pre-processed according to word processing is carried out The part-of-speech tagging information and TF-IDF values of word afterwards, descending arrangement is carried out to the word in the first initial abbreviation, and from descending Last word of the word in the first initial abbreviation after arrangement is deleted successively, until the abbreviation of the first initial abbreviation Length is less than or equal to the length threshold pre-set, and first after being deleted is initial to be referred to as used as the first abbreviation result.

Step 209, the first abbreviation result and the second abbreviation result be stored into training corpus, to update the instruction Practice corpus.

Step 210, the first abbreviation result and the second abbreviation result are compared, generate comparative result, determine the Whether one abbreviation result is consistent with the second abbreviation result.

If the comparative result is that the first abbreviation result is consistent with the second abbreviation result, step 211 is performed.Otherwise, if institute It is that the first abbreviation result and the second abbreviation result are inconsistent to state comparative result, performs step 212.

Step 211, selection the first abbreviation result or the second abbreviation result as the full name data of entity final abbreviation.

Step 212, the first generating probability and the second deep learning model for obtaining the first deep learning model output abbreviation The second generating probability of abbreviation is exported, first generating probability and the second generating probability are compared, and selects first to give birth to Into final abbreviation of the corresponding abbreviation result of higher value in probability and the second generating probability as the full name data of entity.

In order that those skilled in the art understands 206 pairs of above-mentioned steps the first deep learning model and the second depth Learning model carries out machine learning training respectively, below to the first deep learning model and the second deep learning model and its training Process does following elaboration：

Above two deep learning model is separate, is parallel training study and output, does not interfere with each other.

As shown in figure 3, being the block schematic illustration of the first deep learning model, the first deep learning model uses multitiered network Structure, including：Input layer 301, individual layer Embedding layers 302, multilayer BRNN303, multilayer RNN304, Softmax layers of individual layer 305.Such as two layers of BRNN303 in Fig. 3, the example of the one layer of RNN304 only present invention can be carried out according to hands-on result Adjustment.

Input layer 301 obtains training corpus from database, afferent nerve network Embedding layers 302, by Embedding layers 302 carry out term vector coding, then afferent nerve network processes.Fixation is needed because neutral net is inputted Dimension, therefore the term vector dimensionality reduction of high-dimensional feature can be the term vector pattern of 128 dimensions, term vector drop by Embedding layers 302 The technology of dimension can use existing Word2Vec methods.

Word in the embodiment of the present invention in corpus represents with existing One-hot Representation methods, this side Method is also most directly perceived, most common representation in current natural language processing, and each vocabulary is shown as one very by this method Long vector, this vectorial dimension is vocabulary size, wherein most elements are 0, the value of only one of which dimension is 1, this Individual vector just represents current word, and example is as follows：

" water power " can be expressed as [0 001000000000000 ...]

" electric power " can be expressed as [0 000000010000000 ...]

Term vector coding can be by calculating the relation between the distance between vector two words of discovery.Each word is except word It is 1 on position, other are all 0, be all isolated between any two word, two words are not only seen from the two vectors Whether there is relation, such as there is no task nexus between " water power " and " electric power " the two similar words.And be encoded into by term vector After 128 dimensional vectors, it is assumed that " water power " uses u vector representations, and " electric power " uses v vector representations,COS distance Measure the two vectorial differences, nearer it is to 1 numeral, illustrate the two words semantically closer to.

Recurrent neural network RNN (Recurrent Neural Network) is a kind of nerve net with feedback arrangement Network, the output for being mainly characterized by each hidden layer is used as the input of next time point hidden layer.Forward-backward recutrnce neutral net BRNN (Bidirectional Recurrent Neural Network) is to increase reverse list entries on the basis of RNN, Such as one full title sequence of terms of entity is W₁W₂...W_n, BRNN output is positive W₁W₂...W_nRNN output and reversely W_nW_n-1...W₁RNN output average value.Recurrent neural network is compared with traditional neural network, and interlayer has a memory function, word with It is no longer independent between word, it is stronger relational learning ability word.As shown in figure 3, in interlayer input increase Dropout operations, The motivation of Dropout operations comes from the integration trainingt of model, and if mass data, different nets are trained with different pieces of information Network, takes the output average value of multiple networks, will be obviously improved predictablity rate, prevent over-fitting during prediction.Or with different ginsengs The multiple heterogeneous networks of number training can also reach same purpose, it is done so that training cost is higher, Dropout operations are a kind of Simple realization approach, i.e., make some neurons in network fail, in a disguised form produce the net of various change at random with certain probability Network.

Softmax layers 305, using activation primitive of classifying softmax, are trained more using polytypic cross entropy loss function Sample.

Many classification activation primitives are as follows：

Wherein, net_iThe output of i-th of neuron is represented, o is the output variable for traveling through all neurons, net_oRepresent o The output of individual neuron.

Many classification cross entropy loss function E (Error abbreviation) are as follows：

Wherein o is the output variable for traveling through all neurons, t_oRepresent the label value of o-th of neuron in the sample, y_oTable Show output valve.The derivative of cross entropy loss function is solved, and then parameter shift amount can be obtained according to the solution of BP algorithm principle.

Wherein, net_iRepresent the output of i-th of neuron, y_iIt is net_iCorresponding softmax functional values, t_iRepresent i-th The label value of neuron in the sample.The label that model 1 is used is R^10*3Bidimensional matrix, that is, the full title inputted are up to 10 keywords, 3 kinds of label values of each keyword correspondence (all occur, lead-in is occurred without).Model is in output y_iWhile value Also each y is exported_iIt is worth corresponding confidence level.

Fig. 4 is the block schematic illustration for the second deep learning model that the embodiment of the present invention is used, the second deep learning model Mainly include input layer 401 (previous word, current word, current word part of speech, current word word frequency coding), RNN layers 402, full connection Layer 403, output layer 404.Second deep learning model is similar to the structure of the first deep learning model, is mainly the increase in part of speech And word frequency information, industry is so added, area, the information of keyword.Different from word, being directly over for part of speech and word frequency is complete Articulamentum, is connected to last layer of output layer, and other layers are similar with the first depth school model.In addition, the second deep learning mould Type output use label value be 0 or 1,0 expression the word occur without, 1 represent occur, similarly, the second deep learning mould Type output label also can output label confidence level.

In order to lift the generalization ability of the second deep learning model, it is necessary to be compressed to word frequency scope, word frequency scope It is divided into 5 class intervals, processing procedure and design method are as follows：

First by method for normalizing min-max standardization, to linear transformation of the sample prime word frequency according to X, make result Value is mapped between [0-1].Transfer function is as follows：

Y=(X-MinValue)/(MaxValue-MinValue)；

Wherein MaxValue is the maximum of sample data, and MinValue is the minimum value of sample data.

According to TF-IDF (Term Frequency-Inverse Document after Y value after being changed Frequency, a kind of conventional weighting technique for information retrieval and data mining) method, it is mapped to 5 class intervals.In entity In referred to as generating, word frequency TF (Term Frequency) is the frequency that each word occurs in entity, document-frequency DF (Document Frequency) is the frequency of occurrences of the word in all samples.By taking enterprise name as an example, TF in enterprise name The general probability occurred in abbreviation of the bigger values of 1, DF generally is smaller, so typically with reverse word frequency IDF (Inverse Document Freqency) calculate, traditional IDF is usually the logarithm for asking total sample number divided by DF, i.e. reverse word frequency| D | it is entry sum in corpus, j represents to include the entry number of word.If word does not exist In corpus, may result in denominator is 0, therefore ordinary circumstance is applicable plus the processing of 1 method.Word frequency scope Y will frequency by normalizing Rate is mapped between 1 to 5, as shown in table 1 below.

Table 1：Word frequency mapping table

Word frequency scope (Y)	Mapping value (IDF)
		[0,0.2]	5
(0.2,0.4]	4
		(0.4,0.6]	3
(0.6,0.8]	2
		(0.8,1]	1

Hereinafter, it is the training flow of the first deep learning model, as shown in figure 5, comprising the following steps：

Step 501, the network parameter of the first deep learning model is initialized.Random fashion is taken in the present invention.

Step 502, the training corpus in reading database.

Step 503, term vector coding is carried out to input language material.

Step 504, term vector input neutral net is subjected to forward calculation.

Step 505, softmax layers of classified calculating.

Step 506, the error between output and the abbreviation manually marked is calculated.

Step 507, gradient calculation, and error is returned layer by layer, update network parameter.

Step 508, judge whether that training stops, reaching iterations or meet stop condition deconditioning in advance, it is no Then return to step 502 and carry out continuation training.

Step 509, training terminates, and stores the parameter of the first deep learning model.

Hereinafter, it is the training flow of the second deep learning model, as shown in fig. 6, comprising the following steps：

Step 601, the network parameter of the second deep learning model is initialized.Random fashion is taken in the present invention.

Step 602, the training corpus in reading database.

Step 603, term vector coding is carried out to input language material, while reading industry, area, keyword, word frequency carries out word Property coding and word frequency coding.

Step 604, term vector is inputted into RNN layers of progress forward calculation.

Step 605, using the processing RNN outputs of full articulamentum, another full articulamentum processing current word part of speech and frequency.

Step 606, output layer is handled, and uses softmax function output results.

Step 607, the error between output and the abbreviation manually marked is calculated.

Step 608, gradient calculation, and error is returned layer by layer, update network parameter.

Step 609, judge whether that training stops, reaching iterations or meet stop condition deconditioning in advance, it is no Then return to step 602 and carry out continuation training.

Step 610, training terminates, and stores the parameter of the second deep learning model.

By the embodiment of the present invention, the present invention can break the bottle for depending on artificial rule design in abbreviation generating process unduly Neck, energy is automatic, quick, the abbreviation information of the full title of Mass production entity, using deep learning autonomous learning abbreviation Conduce Disciplinarian, With follow-up constantly iterative learning, the accuracy rate for generating abbreviation is also stepped up during repetition learning, the abbreviation of generation With there is unique corresponding relation with entity so that entity can quickly will be around the structuring of entity, unstructured Segment information is together in series to form 360 degree of knowledge networks.

According to many experiments result, the present invention has following rule under enterprise name abbreviation scene：

" limited company " this suffix is hardly appeared in abbreviation.Such as " limited company of safety bank " letter " safety bank " is claimed to occur without " limited company " suffix.

Province ,city and area's information in the full title of enterprise is appeared in abbreviation sometimes, is then occurred without sometimes.Such as " Qingdao double star The abbreviation " Qingdao double star " of limited company " contains area, and the abbreviation " ten thousand of " Jiangxi Wannianqing Cement Co., ltd " It is young " do not contain area.

Based on context associative key can be obtained normally referred to as, and for example " group ", " share " and " holding " sometimes go out In present abbreviation, sometimes occur without.The abbreviation " square large group " of " Fangda Group Co Ltd " contains " group ", and The abbreviation " Chinese Bao'an " of " Chinese Bao'an Group Plc ", then without " group ".

Kernel keyword is extracted, a part of industry is appeared in abbreviation, and a part of industry is occurred without.The full title of enterprise The abbreviation " Si Erte " of " Anhui Sierte Fertilizer Company Limited " is not comprising industry word, and " Chengdu Xindu Chemical stock The abbreviation " new capital chemical industry " of part Co., Ltd " contains industry word.

The full title of company is split into after word, and two may be occurred without, lead-in occur or occurred in each word result Individual word.The abbreviation " Guyue Longshan " of " Guyuelongshan Shaoxing Wine Co Ltd, Zhejiang ", " Zhejiang " is occurred without, and " Gu is got over " occurs Two words；In the abbreviation " Shang Qi groups " of " Saic Motor Corporation Limited ", respectively there is a word in " Shanghai " and " automobile ".

Corresponding to the embodiment of the method shown in above-mentioned Fig. 1 and Fig. 2, as shown in fig. 7, the embodiment of the present invention also provides a kind of real The abbreviation generating means of body, including：

Word processes pretreatment unit 71, and for obtaining the full name data of entity, name data complete to the entity is carried out Word processing pretreatment, is split as multiple words by the full name data of entity, generates for representing word in the language pre-set The word frequency coding of the frequency of occurrences and the part-of-speech tagging information for representing Words ' Attributes in material storehouse.

Initial abbreviation generation unit 72, for processing pretreated word, word frequency coding and part of speech according to progress word Markup information, by the first deep learning model and the second deep learning model of training in advance, generates the first initial letter respectively Claim and the second initial abbreviation.

Verification correction processing unit 73, for correcting rule according to the verification pre-set, to the described first initial abbreviation With the second initial referred to as progress verification correction processing, the first abbreviation result and the second abbreviation result are generated respectively.

Final abbreviation generation unit 74, for being compared to the first abbreviation result and the second abbreviation result, and root The final abbreviation of the full name data of entity is generated according to comparative result.

In addition, the word processing pretreatment unit 71, specifically for：

The corresponding frequency of each word after the full name data of entity is split is obtained from the word frequency information table pre-set.

According to the corresponding frequency of each word, the descending arrangement of line frequency is entered to each word, the corresponding word of each word is generated Frequency is encoded.

In addition, the word processing pretreatment unit 71, is specifically additionally operable to：

Further, as shown in fig. 7, the abbreviation generating means of the entity, in addition to：

Machine learning training unit 75, for according to the training corpus pre-set, to the first deep learning mould Type and the second deep learning model carry out machine learning training respectively；The training corpus includes the full name of entity pre-set Claim data and the corresponding abbreviation of the full title of each entity.

In addition, processing unit 73 is corrected in the verification, specifically for：

Judge that word is processed in pretreated word with the presence or absence of two adjacent individual character words.

Processed in word and there are two adjacent individual character words in pretreated word, and first initial referred to as described During the word that two adjacent individual character words are constituted, it is determined that described first is initially referred to as the first abbreviation result；If word adds There are two adjacent individual character words in the pretreated word of work, and second is initially referred to as described two adjacent monosyllabic words The word of language composition, it is determined that described second is initially referred to as the second abbreviation result.

Processed in word and there are two adjacent individual character words in pretreated word, and the first initial abbreviation is not institute When stating the word that two adjacent individual character words are constituted, then word is obtained according to word frequency coding and process pretreated frequency Minimum word, and the minimum word of two adjacent individual character words and the frequency is constituted into the first abbreviation result；If Word, which is processed, has two adjacent individual character words in pretreated word, and second initial referred to as not to be described two adjacent Individual character word composition word, then encoded according to the word frequency and obtain word and process the minimum word of pretreated frequency, And the minimum word of two adjacent individual character words and the frequency is constituted into the second abbreviation result.

Judge whether the described first initial abbreviation or the second initial abbreviation are individual character.

In the described first initially referred to as individual character, word is obtained according to word frequency coding and processes pretreated frequency Two minimum words, and two minimum words of the individual character and the frequency are constituted into the first abbreviation result；If institute State second and be initially referred to as individual character, encoded according to the word frequency and obtain two minimum words of the pretreated frequency of word processing Language, and two minimum words of the individual character and the frequency are constituted into the second abbreviation result.

Judge whether the abbreviation length of the described first initial abbreviation or the second initial abbreviation is more than the length pre-set Spend threshold value.

When the abbreviation length of the first initial abbreviation is more than the length threshold pre-set, according to carrying out, word processing is pre- to be located The part-of-speech tagging information and TF-IDF values of word after reason, descending arrangement is carried out to the word in the first initial abbreviation, and from drop Last word of the word in the first initial abbreviation after sequence arrangement is deleted successively, until the letter of the first initial abbreviation Length is claimed to be less than or equal to the length threshold pre-set, first after being deleted is initial to be referred to as used as the first abbreviation result；

When the abbreviation length of the second initial abbreviation is more than the length threshold pre-set, according to carrying out, word processing is pre- to be located The part-of-speech tagging information and TF-IDF values of word after reason, descending arrangement is carried out to the word in the second initial abbreviation, and from drop Last word of the word in the second initial abbreviation after sequence arrangement is deleted successively, until the letter of the second initial abbreviation Length is claimed to be less than or equal to the length threshold pre-set, second after being deleted is initial to be referred to as used as the second abbreviation result.

Further, as shown in fig. 7, the abbreviation generating means of described entity, in addition to：

Memory cell 76, for the first abbreviation result and the second abbreviation result to be stored into training corpus, with more The new training corpus.

In addition, the final abbreviation generation unit 74, specifically for：

The first abbreviation result and the second abbreviation result are compared, comparative result is generated.

When the comparative result is the first abbreviation result and inconsistent the second abbreviation result, the first deep learning mould is obtained The first generating probability and the second deep learning model of type output abbreviation export the second generating probability of abbreviation.

What deserves to be explained is, a kind of specific implementation of the abbreviation generating means of entity provided in an embodiment of the present invention can So that referring to the corresponding embodiments of the method for above-mentioned Fig. 1 and Fig. 2, here is omitted.

A kind of abbreviation generating means of entity provided in an embodiment of the present invention, obtain the full name data of entity, to institute first State the full name data of entity and carry out word processing pretreatment, the full name data of entity is split as multiple words, generated for table Show the word frequency coding of word frequency of occurrences in the corpus pre-set and the part-of-speech tagging information for representing Words ' Attributes； Afterwards, according to the pretreated word of word processing, word frequency coding and part-of-speech tagging information is carried out, the first of training in advance is passed through Deep learning model and the second deep learning model, generate respectively first it is initial referred to as and second it is initial referred to as；According to setting in advance Rule is corrected in the verification put, it is initial to described first referred to as and second it is initial referred to as carry out verification correction processing, the is generated respectively One abbreviation result and the second abbreviation result；The first abbreviation result and the second abbreviation result are compared, and according to comparing As a result the final abbreviation of the full name data of entity is generated.The embodiment of the present invention make use of depth by using the method for deep learning Advantage of the learning model without engineer's feature is spent, and part of speech and word frequency information are fused in model, extension feature scope, After the autonomous iterative learning of full title, accurate, unique corresponding relation of the full title of entity and abbreviation is ultimately formed, can be solved The abbreviation generation degree of accuracy of certainly currently existing technology is low, the problem of as a result not unique.

It should be understood by those skilled in the art that, embodiments of the invention can be provided as method, system or computer program Product.Therefore, the present invention can be using the reality in terms of complete hardware embodiment, complete software embodiment or combination software and hardware Apply the form of example.Moreover, the present invention can be used in one or more computers for wherein including computer usable program code The computer program production that usable storage medium is implemented on (including but is not limited to magnetic disk storage, CD-ROM, optical memory etc.) The form of product.

The present invention is the flow with reference to method according to embodiments of the present invention, equipment (system) and computer program product Figure and/or block diagram are described.It should be understood that can be by every first-class in computer program instructions implementation process figure and/or block diagram Journey and/or the flow in square frame and flow chart and/or block diagram and/or the combination of square frame.These computer programs can be provided The processor of all-purpose computer, special-purpose computer, Embedded Processor or other programmable data processing devices is instructed to produce A raw machine so that produced by the instruction of computer or the computing device of other programmable data processing devices for real The device for the function of being specified in present one flow of flow chart or one square frame of multiple flows and/or block diagram or multiple square frames.

These computer program instructions, which may be alternatively stored in, can guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works so that the instruction being stored in the computer-readable memory, which is produced, to be included referring to Make the manufacture of device, the command device realize in one flow of flow chart or multiple flows and/or one square frame of block diagram or The function of being specified in multiple square frames.

These computer program instructions can be also loaded into computer or other programmable data processing devices so that in meter Series of operation steps is performed on calculation machine or other programmable devices to produce computer implemented processing, thus in computer or The instruction performed on other programmable devices is provided for realizing in one flow of flow chart or multiple flows and/or block diagram one The step of function of being specified in individual square frame or multiple square frames.

Apply specific embodiment in the present invention to be set forth the principle and embodiment of the present invention, above example Explanation be only intended to help to understand the method and its core concept of the present invention；Simultaneously for those of ordinary skill in the art, According to the thought of the present invention, it will change in specific embodiments and applications, in summary, in this specification Appearance should not be construed as limiting the invention.

Claims

1. a kind of abbreviation generation method of entity, it is characterised in that including：

The full name data of entity is obtained, name data complete to the entity carries out word processing pretreatment, by the full title number of entity According to multiple words are split as, generate the word frequency for representing word frequency of occurrences in the corpus pre-set and encode and be used for Represent the part-of-speech tagging information of Words ' Attributes；

Pretreated word, word frequency coding and part-of-speech tagging information are processed according to word is carried out, passes through the first of training in advance Deep learning model and the second deep learning model, generate respectively first it is initial referred to as and second it is initial referred to as；

Rule is corrected according to the verification pre-set, it is initial to described first referred to as initially referred to as to be carried out at verification correction with second Reason, generates the first abbreviation result and the second abbreviation result respectively；

The first abbreviation result and the second abbreviation result are compared, and the full name data of entity is generated according to comparative result Final abbreviation.

2. the abbreviation generation method of entity according to claim 1, it is characterised in that obtain the full name data of entity, right The full name data of entity carries out word processing pretreatment, the full name data of entity is split as into multiple words, generating is used for Represent the word frequency coding of word frequency of occurrences in the corpus pre-set and for representing that the part-of-speech tagging of Words ' Attributes is believed Breath, including：

According to the corresponding frequency of each word, the descending arrangement of line frequency is entered to each word, the corresponding word frequency of each word is generated and compiles Code；

3. the abbreviation generation method of entity according to claim 2, it is characterised in that obtain the full name data of entity, right The full name data of entity carries out word processing pretreatment, the full name data of entity is split as into multiple words, generating is used for Represent the word frequency coding of word frequency of occurrences in the corpus pre-set and for representing that the part-of-speech tagging of Words ' Attributes is believed Breath, in addition to：

The full name data quilt of entity is determined according to dictionary, regional dictionary, industry dictionary, keyword dictionary is sewed before and after pre-setting The dictionary where each word after fractionation, and each word after the full name data of entity is split carries out part-of-speech tagging, generation The part-of-speech tagging information；The part-of-speech tagging information sews word, regional word, industry word and keyword before and after including.

4. the abbreviation generation method of entity according to claim 3, it is characterised in that also include：

According to the training corpus pre-set, the first deep learning model and the second deep learning model are carried out respectively Machine learning is trained；The training corpus includes the full name data of entity and the corresponding letter of the full title of each entity pre-set Claim.

5. the abbreviation generation method of entity according to claim 4, it is characterised in that corrected according to the verification pre-set Rule, it is initial to described first referred to as and second it is initial referred to as carry out verification correction processing, generate respectively the first abbreviation result and Second abbreviation result, including：

If word, which is processed, has two adjacent individual character words in pretreated word, and first initially referred to as described two The word of adjacent individual character word composition, it is determined that described first is initially referred to as the first abbreviation result；If the pre- place of word processing There are two adjacent individual character words in word after reason, and second is initially referred to as described two adjacent individual character word compositions Word, it is determined that described second initially referred to as the second abbreviation result；

If word, which is processed, has two adjacent individual character words in pretreated word, and the first initial abbreviation is not described two The word of individual adjacent individual character word composition, then obtain the pretreated frequency of word processing minimum according to word frequency coding Word, and the minimum word of two adjacent individual character words and the frequency is constituted into the first abbreviation result；If word adds There are two adjacent individual character words in the pretreated word of work, and the second initial abbreviation is not described two adjacent individual characters The word of word composition, then encode according to the word frequency and obtain word and process the minimum word of pretreated frequency, and by two Individual adjacent individual character word and the minimum word of the frequency constitute the second abbreviation result.

6. the abbreviation generation method of entity according to claim 4, it is characterised in that corrected according to the verification pre-set Rule, it is initial to described first referred to as and second it is initial referred to as carry out verification correction processing, generate respectively the first abbreviation result and Second abbreviation result, including：

If described first is initially referred to as individual character, the pretreated frequency of word processing is obtained according to word frequency coding minimum Two words, and two minimum words of the individual character and the frequency are constituted into the first abbreviation result；If described second Individual character is initially referred to as, is encoded according to the word frequency and obtains two minimum words of the pretreated frequency of word processing, and will The individual character constitutes the second abbreviation result with two minimum words of the frequency.

7. the abbreviation generation method of entity according to claim 4, it is characterised in that corrected according to the verification pre-set Rule, it is initial to described first referred to as and second it is initial referred to as carry out verification correction processing, generate respectively the first abbreviation result and Second abbreviation result, including：

Judge whether the abbreviation length of the described first initial abbreviation or the second initial abbreviation is more than the length threshold pre-set Value；

If the abbreviation length of the first initial abbreviation is more than the length threshold pre-set, according to carrying out, word processing is pretreated The part-of-speech tagging information and TF-IDF values of word, descending arrangement are carried out to the word in the first initial abbreviation, and arrange from descending Last word of the word in the first initial abbreviation afterwards is deleted successively, until the abbreviation length of the first initial abbreviation Less than or equal to the length threshold pre-set, the first initial abbreviation after being deleted is used as the first abbreviation result；

If the abbreviation length of the second initial abbreviation is more than the length threshold pre-set, according to carrying out, word processing is pretreated The part-of-speech tagging information and TF-IDF values of word, descending arrangement are carried out to the word in the second initial abbreviation, and arrange from descending Last word of the word in the second initial abbreviation afterwards is deleted successively, until the abbreviation length of the second initial abbreviation Less than or equal to the length threshold pre-set, the second initial abbreviation after being deleted is used as the second abbreviation result.

8. the abbreviation generation method of the entity according to any one of claim 5 to 7, it is characterised in that at the beginning of generation first Begin referred to as and after the second initial abbreviation, in addition to：

The first abbreviation result and the second abbreviation result are stored into training corpus, to update the training corpus.

9. the abbreviation generation method of entity according to claim 8, it is characterised in that to the first abbreviation result and Two abbreviation results are compared, and generate the final abbreviation of the full name data of entity according to comparative result, including：

If the comparative result is that the first abbreviation result and the second abbreviation result are inconsistent, the output of the first deep learning model is obtained First generating probability of abbreviation and the second deep learning model export the second generating probability of abbreviation；

First generating probability and the second generating probability are compared, and select the first generating probability and the second generating probability In the corresponding abbreviation result of higher value as the full name data of entity final abbreviation.

10. a kind of abbreviation generating means of entity, it is characterised in that including：

Word processes pretreatment unit, and for obtaining the full name data of entity, name data complete to the entity carries out word and added Work is pre-processed, and the full name data of entity is split as into multiple words, generated for representing word in the corpus pre-set The word frequency coding of the frequency of occurrences and the part-of-speech tagging information for representing Words ' Attributes；

Initial abbreviation generation unit, for the pretreated word of word processing, word frequency to be encoded and part-of-speech tagging is believed according to carrying out Breath, by the first deep learning model and the second deep learning model of training in advance, generate respectively first it is initial referred to as and the Two initial abbreviations；

Verification correction processing unit, for correcting rule according to the verification pre-set, to the described first initial abbreviation and second It is initial referred to as to carry out verification correction processing, the first abbreviation result and the second abbreviation result are generated respectively；

Final abbreviation generation unit, for being compared to the first abbreviation result and the second abbreviation result, and according to comparing As a result the final abbreviation of the full name data of entity is generated.

11. the abbreviation generating means of entity according to claim 10, it is characterised in that the word processing pretreatment is single Member, specifically for：

12. the abbreviation generating means of entity according to claim 11, it is characterised in that the word processing pretreatment is single Member, is specifically additionally operable to：

13. the abbreviation generating means of entity according to claim 12, it is characterised in that also include：

Machine learning training unit, for according to the training corpus that pre-sets, to the first deep learning model and the Two deep learning models carry out machine learning training respectively；The training corpus includes the full name data of entity pre-set Abbreviation corresponding with the full title of each entity.

14. the abbreviation generating means of entity according to claim 13, it is characterised in that it is single that processing is corrected in the verification Member, specifically for：

15. the abbreviation generating means of entity according to claim 13, it is characterised in that it is single that processing is corrected in the verification Member, specifically for：

16. the abbreviation generating means of entity according to claim 13, it is characterised in that it is single that processing is corrected in the verification Member, specifically for：

17. the abbreviation generating means of the entity according to any one of claim 14 to 16, it is characterised in that also include：

Memory cell, it is described to update for the first abbreviation result and the second abbreviation result to be stored into training corpus Training corpus.

18. the abbreviation generating means of entity according to claim 17, it is characterised in that the final abbreviation generation is single Member, specifically for：