CN106991085A - The abbreviation generation method and device of a kind of entity - Google Patents
The abbreviation generation method and device of a kind of entity Download PDFInfo
- Publication number
- CN106991085A CN106991085A CN201710212978.8A CN201710212978A CN106991085A CN 106991085 A CN106991085 A CN 106991085A CN 201710212978 A CN201710212978 A CN 201710212978A CN 106991085 A CN106991085 A CN 106991085A
- Authority
- CN
- China
- Prior art keywords
- word
- abbreviation
- entity
- initial
- result
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Abstract
The invention provides the abbreviation generation method and device of a kind of entity, it is related to computer information system depth learning technology field.Method includes:Obtain the full name data of entity, name data complete to entity carries out word processing pretreatment, the full name data of entity is split as multiple words, the word frequency coding for representing word frequency of occurrences in the corpus pre-set and the part-of-speech tagging information for representing Words ' Attributes is generated;Process pretreated word, word frequency coding and part-of-speech tagging information according to word is carried out, by the first deep learning model and the second deep learning model of training in advance, generate respectively first it is initial referred to as and second it is initial referred to as;Rule is corrected according to the verification pre-set, initially referred to as carrying out verification correction to the first initial abbreviation and second is handled, and the first abbreviation result and the second abbreviation result are generated respectively;First abbreviation result and the second abbreviation result are compared, and according to the final abbreviation of the comparative result generation full name data of entity.
Description
Technical field
The present invention relates to the abbreviation generation side in computer information system deep learning technology field, more particularly to a kind of entity
Method and device.
Background technology
Currently, with a large amount of popularizations and development of Internet technology and computer information technology, internet and computer are
Through having entered the big data epoch.In the big data epoch, using text as carrier to all kinds of entities (such as enterprise, government bodies, society
Can group etc.) report and evaluation quantity it is increasing, entity, which needs to gather, simultaneously to be recognized and itself entity (such as enterprise's name
Title, government bodies' title etc.) association news information, and then be applied to the scenes such as business risk identification, the analysis of public opinion.And it is current
Conventional collection and means of identification be that full dose gathers all kinds of news report, the then identification entity inside the text message collected
Title, such as enterprise name, government bodies' title, public organization's title, are then defined as association new by corresponding news report
Hear report.But, in many news report, based on it is precise and to the point, many factors such as write with pregnant brevity, media are often with letter
Claim to describe entity, so-called abbreviation is exactly the appellation (example for extracting representative word composition from former word, i.e. full name
Such as, the referred to as China of the People's Republic of China (PRC);The referred to as industrial and commercial bank of the Industrial and Commercial Bank of China).So, bulk information is being carried out
When collection and analysis, how to judge whether the news information is related to entity, relate to the accurate judgement to abbreviation.
Therefore, a kind of method of intelligence generation abbreviation is currently needed badly, consequently facilitating from network datas such as all kinds of news report
Middle accurate acquisition is to the data related to the abbreviation of entity.
The content of the invention
Embodiments of the invention provide the abbreviation generation method and device of a kind of entity, to solve the letter of currently existing technology
Claim the generation degree of accuracy low, the problem of as a result not unique.
To reach above-mentioned purpose, the present invention is adopted the following technical scheme that:
A kind of abbreviation generation method of entity, including:
The full name data of entity is obtained, name data complete to the entity carries out word processing pretreatment, by the full name of entity
Claim data be split as multiple words, generate for represent word frequency of occurrences in the corpus pre-set word frequency encode and
Part-of-speech tagging information for representing Words ' Attributes;
Pretreated word, word frequency coding and part-of-speech tagging information are processed according to word is carried out, passes through training in advance
First deep learning model and the second deep learning model, generate respectively first it is initial referred to as and second it is initial referred to as;
Rule is corrected according to the verification pre-set, initially referred to as carrying out verification to the described first initial abbreviation and second entangles
Positive processing, generates the first abbreviation result and the second abbreviation result respectively;
The first abbreviation result and the second abbreviation result are compared, and the full title of entity is generated according to comparative result
The final abbreviation of data.
Specifically, obtaining the full name data of entity, name data complete to the entity carries out word processing pretreatment, will be real
The full name data of body is split as multiple words, generates the word frequency for representing word frequency of occurrences in the corpus pre-set
Coding and the part-of-speech tagging information for representing Words ' Attributes, including:
The corresponding frequency of each word after the full name data of entity is split is obtained from the word frequency information table pre-set;
According to the corresponding frequency of each word, the descending arrangement of line frequency is entered to each word, the corresponding word of each word is generated
Frequency is encoded;
By the corresponding word frequency code storage of each word in the word frequency coding schedule pre-set.
Further, the full name data of entity is obtained, name data complete to the entity carries out word processing pretreatment, will
The full name data of entity is split as multiple words, generates the word for representing word frequency of occurrences in the corpus pre-set
Frequency coding and the part-of-speech tagging information for representing Words ' Attributes, in addition to:
The full title number of entity is determined according to dictionary, regional dictionary, industry dictionary, keyword dictionary is sewed before and after pre-setting
According to the dictionary where each word after being split, and each word after the full name data of entity is split carries out part-of-speech tagging,
Generate the part-of-speech tagging information;The part-of-speech tagging information sews word, regional word, industry word and key before and after including
Word.
Further, the abbreviation generation method of the entity, in addition to:
According to the training corpus pre-set, the first deep learning model and the second deep learning model are distinguished
Carry out machine learning training;The full name data of entity that the training corpus includes pre-setting is corresponding with the full title of each entity
Abbreviation.
Specifically, rule is corrected according to the verification pre-set, it is initial to described first referred to as initially referred to as to enter with second
Row verification correction processing, generates the first abbreviation result and the second abbreviation result respectively, including:
Judge that word is processed in pretreated word with the presence or absence of two adjacent individual character words;
If word, which is processed, has two adjacent individual character words in pretreated word, and first initial referred to as described
The word of two adjacent individual character word compositions, it is determined that described first is initially referred to as the first abbreviation result;If word is processed
There are two adjacent individual character words in pretreated word, and second is initially referred to as described two adjacent individual character words
The word of composition, it is determined that described second is initially referred to as the second abbreviation result;
If word, which is processed, has two adjacent individual character words in pretreated word, and the first initial abbreviation is not institute
The word of two adjacent individual character word compositions is stated, then obtaining word according to word frequency coding processes pretreated frequency most
Low word, and the minimum word of two adjacent individual character words and the frequency is constituted into the first abbreviation result;If word
Language, which is processed, has two adjacent individual character words in pretreated word, and second initial referred to as not to be described two adjacent
The word of individual character word composition, then encode according to the word frequency and obtain the minimum word of the pretreated frequency of word processing, and
The minimum word of two adjacent individual character words and the frequency is constituted into the second abbreviation result.
Specifically, rule is corrected according to the verification pre-set, it is initial to described first referred to as initially referred to as to enter with second
Row verification correction processing, generates the first abbreviation result and the second abbreviation result respectively, including:
Judge whether the described first initial abbreviation or the second initial abbreviation are individual character;
If described first is initially referred to as individual character, word is obtained according to word frequency coding and processes pretreated frequency most
Two low words, and two minimum words of the individual character and the frequency are constituted into the first abbreviation result;If described
Second is initially referred to as individual character, is encoded according to the word frequency and obtains two minimum words of the pretreated frequency of word processing,
And two minimum words of the individual character and the frequency are constituted into the second abbreviation result.
Specifically, rule is corrected according to the verification pre-set, it is initial to described first referred to as initially referred to as to enter with second
Row verification correction processing, generates the first abbreviation result and the second abbreviation result respectively, including:
Judge whether the abbreviation length of the described first initial abbreviation or the second initial abbreviation is more than the length pre-set
Spend threshold value;
If the abbreviation length of the first initial abbreviation is more than the length threshold pre-set, pre-processed according to word processing is carried out
The part-of-speech tagging information and TF-IDF values of word afterwards, descending arrangement is carried out to the word in the first initial abbreviation, and from descending
Last word of the word in the first initial abbreviation after arrangement is deleted successively, until the abbreviation of the first initial abbreviation
Length is less than or equal to the length threshold pre-set, and first after being deleted is initial to be referred to as used as the first abbreviation result;
If the abbreviation length of the second initial abbreviation is more than the length threshold pre-set, pre-processed according to word processing is carried out
The part-of-speech tagging information and TF-IDF values of word afterwards, descending arrangement is carried out to the word in the second initial abbreviation, and from descending
Last word of the word in the second initial abbreviation after arrangement is deleted successively, until the abbreviation of the second initial abbreviation
Length is less than or equal to the length threshold pre-set, and second after being deleted is initial to be referred to as used as the second abbreviation result.
Further, after the initial abbreviation of generation first and the second initial abbreviation, in addition to:
The first abbreviation result and the second abbreviation result are stored into training corpus, to update the training corpus
Storehouse.
Specifically, being compared to the first abbreviation result and the second abbreviation result, and generated in fact according to comparative result
The final abbreviation of the full name data of body, including:
The first abbreviation result and the second abbreviation result are compared, comparative result is generated;
If the comparative result is that the first abbreviation result and the second abbreviation result are inconsistent, the first deep learning model is obtained
The first generating probability and the second deep learning model for exporting abbreviation export the second generating probability of abbreviation;
First generating probability and the second generating probability are compared, and select the first generating probability and the second generation
The corresponding abbreviation result of higher value in probability as the full name data of entity final abbreviation.
A kind of abbreviation generating means of entity, including:
Word processes pretreatment unit, and for obtaining the full name data of entity, name data complete to the entity carries out word
Language processing pretreatment, is split as multiple words by the full name data of entity, generates for representing word in the language material pre-set
The word frequency coding of the frequency of occurrences and the part-of-speech tagging information for representing Words ' Attributes in storehouse;
Initial abbreviation generation unit, for processing pretreated word, word frequency coding and part of speech mark according to progress word
Information is noted, by the first deep learning model and the second deep learning model of training in advance, the first initial abbreviation is generated respectively
With the second initial abbreviation;
Processing unit is corrected in verification, for correcting rule according to the verification that pre-sets, it is initial to described first referred to as and
Second initial referred to as progress verification correction processing, generates the first abbreviation result and the second abbreviation result respectively;
Final abbreviation generation unit, for being compared to the first abbreviation result and the second abbreviation result, and according to
Comparative result generates the final abbreviation of the full name data of entity.
In addition, the word processing pretreatment unit, specifically for:
The corresponding frequency of each word after the full name data of entity is split is obtained from the word frequency information table pre-set;
According to the corresponding frequency of each word, the descending arrangement of line frequency is entered to each word, the corresponding word of each word is generated
Frequency is encoded;
By the corresponding word frequency code storage of each word in the word frequency coding schedule pre-set.
In addition, the word processing pretreatment unit, is specifically additionally operable to:
The full title number of entity is determined according to dictionary, regional dictionary, industry dictionary, keyword dictionary is sewed before and after pre-setting
According to the dictionary where each word after being split, and each word after the full name data of entity is split carries out part-of-speech tagging,
Generate the part-of-speech tagging information;The part-of-speech tagging information sews word, regional word, industry word and key before and after including
Word.
Further, the abbreviation generating means of the entity, in addition to:
Machine learning training unit, for according to the training corpus pre-set, to the first deep learning model
Machine learning training is carried out respectively with the second deep learning model;The training corpus includes the full title of entity pre-set
Data and the corresponding abbreviation of the full title of each entity.
In addition, processing unit is corrected in the verification, specifically for:
Judge that word is processed in pretreated word with the presence or absence of two adjacent individual character words;
If word, which is processed, has two adjacent individual character words in pretreated word, and first initial referred to as described
The word of two adjacent individual character word compositions, it is determined that described first is initially referred to as the first abbreviation result;If word is processed
There are two adjacent individual character words in pretreated word, and second is initially referred to as described two adjacent individual character words
The word of composition, it is determined that described second is initially referred to as the second abbreviation result;
If word, which is processed, has two adjacent individual character words in pretreated word, and the first initial abbreviation is not institute
The word of two adjacent individual character word compositions is stated, then obtaining word according to word frequency coding processes pretreated frequency most
Low word, and the minimum word of two adjacent individual character words and the frequency is constituted into the first abbreviation result;If word
Language, which is processed, has two adjacent individual character words in pretreated word, and second initial referred to as not to be described two adjacent
The word of individual character word composition, then encode according to the word frequency and obtain the minimum word of the pretreated frequency of word processing, and
The minimum word of two adjacent individual character words and the frequency is constituted into the second abbreviation result.
In addition, processing unit is corrected in the verification, specifically for:
Judge whether the described first initial abbreviation or the second initial abbreviation are individual character;
If described first is initially referred to as individual character, word is obtained according to word frequency coding and processes pretreated frequency most
Two low words, and two minimum words of the individual character and the frequency are constituted into the first abbreviation result;If described
Second is initially referred to as individual character, is encoded according to the word frequency and obtains two minimum words of the pretreated frequency of word processing,
And two minimum words of the individual character and the frequency are constituted into the second abbreviation result.
In addition, processing unit is corrected in the verification, specifically for:
Judge whether the abbreviation length of the described first initial abbreviation or the second initial abbreviation is more than the length pre-set
Spend threshold value;
If the abbreviation length of the first initial abbreviation is more than the length threshold pre-set, pre-processed according to word processing is carried out
The part-of-speech tagging information and TF-IDF values of word afterwards, descending arrangement is carried out to the word in the first initial abbreviation, and from descending
Last word of the word in the first initial abbreviation after arrangement is deleted successively, until the abbreviation of the first initial abbreviation
Length is less than or equal to the length threshold pre-set, and first after being deleted is initial to be referred to as used as the first abbreviation result;
If the abbreviation length of the second initial abbreviation is more than the length threshold pre-set, pre-processed according to word processing is carried out
The part-of-speech tagging information and TF-IDF values of word afterwards, descending arrangement is carried out to the word in the second initial abbreviation, and from descending
Last word of the word in the second initial abbreviation after arrangement is deleted successively, until the abbreviation of the second initial abbreviation
Length is less than or equal to the length threshold pre-set, and second after being deleted is initial to be referred to as used as the second abbreviation result.
Further, the abbreviation generating means of described entity, in addition to:
Memory cell, for the first abbreviation result and the second abbreviation result to be stored into training corpus, to update
The training corpus.
In addition, the final abbreviation generation unit, specifically for:
The first abbreviation result and the second abbreviation result are compared, comparative result is generated;
If the comparative result is that the first abbreviation result and the second abbreviation result are inconsistent, the first deep learning model is obtained
The first generating probability and the second deep learning model for exporting abbreviation export the second generating probability of abbreviation;
First generating probability and the second generating probability are compared, and select the first generating probability and the second generation
The corresponding abbreviation result of higher value in probability as the full name data of entity final abbreviation.
The abbreviation generation method and device of a kind of entity provided in an embodiment of the present invention, obtain the full title number of entity first
According to name data complete to the entity carries out word processing pretreatment, and the full name data of entity is split as into multiple words, generated
For representing that the word frequency of word frequency of occurrences in the corpus pre-set is encoded and for representing the part of speech marks of Words ' Attributes
Note information;Afterwards, according to the pretreated word of word processing, word frequency coding and part-of-speech tagging information is carried out, by instructing in advance
Experienced the first deep learning model and the second deep learning model, generate respectively first it is initial referred to as and second it is initial referred to as;Root
Rule is corrected according to the verification pre-set, initially referred to as carrying out verification correction to the described first initial abbreviation and second is handled, and is divided
The first abbreviation result and the second abbreviation result are not generated;The first abbreviation result and the second abbreviation result are compared, and
The final abbreviation of the full name data of entity is generated according to comparative result.The embodiment of the present invention by using deep learning method,
Advantage of the deep learning model without engineer's feature is make use of, and part of speech and word frequency information are fused in model, is extended
Characteristic range, after the autonomous iterative learning of full name, ultimately forms accurate, unique corresponding relation of entity full name and abbreviation,
The abbreviation generation degree of accuracy that currently existing technology can be solved is low, the problem of as a result not unique.
Brief description of the drawings
In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing
There is the accompanying drawing used required in technology description to be briefly described, it should be apparent that, drawings in the following description are only this
Some embodiments of invention, for those of ordinary skill in the art, without having to pay creative labor, may be used also
To obtain other accompanying drawings according to these accompanying drawings.
Fig. 1 is a kind of flow chart one of the abbreviation generation method of entity provided in an embodiment of the present invention;
Fig. 2 is a kind of flowchart 2 of the abbreviation generation method of entity provided in an embodiment of the present invention;
Fig. 3 be the embodiment of the present invention in the first deep learning model block schematic illustration;
Fig. 4 be the embodiment of the present invention in the second deep learning model block schematic illustration;
Fig. 5 be the embodiment of the present invention in the first deep learning model training schematic flow sheet;
Fig. 6 be the embodiment of the present invention in the second deep learning model training schematic flow sheet;
Fig. 7 is a kind of structural representation of the abbreviation generating means of entity provided in an embodiment of the present invention.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete
Site preparation is described, it is clear that described embodiment is only a part of embodiment of the invention, rather than whole embodiments.It is based on
Embodiment in the present invention, it is every other that those of ordinary skill in the art are obtained under the premise of creative work is not made
Embodiment, belongs to the scope of protection of the invention.
As shown in figure 1, the embodiment of the present invention provides a kind of abbreviation generation method of entity, including:
Step 101, the acquisition full name data of entity, name data complete to the entity carry out word processing pretreatment, will
The full name data of entity is split as multiple words, generates the word for representing word frequency of occurrences in the corpus pre-set
Frequency coding and the part-of-speech tagging information for representing Words ' Attributes.
Step 102, pretreated word, word frequency coding and part-of-speech tagging information are processed according to carrying out word, by pre-
The the first deep learning model and the second deep learning model first trained, generate the first initial abbreviation and the second initial letter respectively
Claim.
Rule is corrected in the verification that step 103, basis are pre-set, initial to described first referred to as initially referred to as to enter with second
Row verification correction processing, generates the first abbreviation result and the second abbreviation result respectively.
Step 104, the first abbreviation result and the second abbreviation result are compared, and generated according to comparative result real
The final abbreviation of the full name data of body.
A kind of abbreviation generation method of entity provided in an embodiment of the present invention, obtains the full name data of entity, to institute first
State the full name data of entity and carry out word processing pretreatment, the full name data of entity is split as multiple words, generated for table
Show the word frequency coding of word frequency of occurrences in the corpus pre-set and the part-of-speech tagging information for representing Words ' Attributes;
Afterwards, according to the pretreated word of word processing, word frequency coding and part-of-speech tagging information is carried out, the first of training in advance is passed through
Deep learning model and the second deep learning model, generate respectively first it is initial referred to as and second it is initial referred to as;According to setting in advance
Rule is corrected in the verification put, it is initial to described first referred to as and second it is initial referred to as carry out verification correction processing, the is generated respectively
One abbreviation result and the second abbreviation result;The first abbreviation result and the second abbreviation result are compared, and according to comparing
As a result the final abbreviation of the full name data of entity is generated.The embodiment of the present invention make use of depth by using the method for deep learning
Advantage of the learning model without engineer's feature is spent, and part of speech and word frequency information are fused in model, extension feature scope,
After the autonomous iterative learning of full name, accurate, unique corresponding relation of entity full name and abbreviation is ultimately formed, can solve to work as
The abbreviation generation degree of accuracy of preceding prior art is low, the problem of as a result not unique.
In order that those skilled in the art is better understood by the present invention, a more detailed embodiment is set forth below,
As shown in Fig. 2 the embodiment of the present invention provides a kind of abbreviation generation method of entity, including:
Step 201, the acquisition full name data of entity, name data complete to the entity carry out word processing pretreatment, will
The full name data of entity is split as multiple words.
The full name data of entity is split as multiple words herein, is that, using word as BSR semantic unit, can so accord with
Close the grammer of Chinese language.For example word segmentation processing can be carried out using Chinese word segmentation instrument Jieba, but be not only limited to this.Example
Such as " DadangDatang Telecom Technology & Industry Group " can be split as multiple words, i.e. " Datang ", " telecommunications ", " science and technology ", " stock
Part ", " limited ", " company ".
Step 202, from the word frequency information table pre-set obtain the full name data of entity be split after each word pair
The frequency answered.
What deserves to be explained is, the word frequency information pre-set token is loaded with the corresponding frequency letter of substantial amounts of word, word
Breath, the frequency information can be the applying frequency of the word or word occurred in the full name data of various entities.Specifically can be with
With --- word:The mode of numeral is represented, frequency is represented with numeral, for example:Credit:8;New master:1;Cyclopentadienyl:13;Generate electricity:54;
ZhangZhou:2;Audi:1;Packaging:19;Fuling:2;Second:3;She:2;Henan:34;Copper foil:1;Safety:2;Carbon black:1;It is clean soft:1;
It is anti-:1;Technique:3;It is micro-:1;Paradise:2;Lead zinc:1;Hundred rivers:1.
Step 203, according to the corresponding frequency of each word, enter the descending arrangement of line frequency to each word, generate each word
Corresponding word frequency coding.
For example, exemplified by this sentences enterprise name " DadangDatang Telecom Technology & Industry Group ", it is assumed that the frequency of " limited " appearance
Rate is 100, and the frequency that " share " occurs is 20, and the frequency that " company " occurs is 150, then word frequency coding can be as follows:Company:
1;It is limited:2;Share:3.
Step 204, by the corresponding word frequency code storage of each word in the word frequency coding schedule pre-set.
So, the corresponding word frequency coding of each word can be stored in word frequency coding schedule, in order to answering for training corpus etc.
With.
Front and rear dictionary, regional dictionary, industry dictionary, the keyword dictionary sewed that step 205, basis are pre-set determines entity
Full name data be split after each word where dictionary, and each word after the full name data of entity is split carries out word
Property mark, generate the part-of-speech tagging information.
Wherein, the part-of-speech tagging information sews word, regional word, industry word and keyword before and after including.
For example, it is front and rear sew can be recorded in dictionary " company ", " responsibility ", " limited ", " industry " etc. often in enterprise, list
The word occurred before and after the entities such as position." China ", " Beijing ", " Shanghai ", " Henan " etc. can be recorded in regional dictionary to be used for
Represent the word in area." intellectual property ", " patent agency ", " bank ", " pharmacy " etc. can be recorded in industry dictionary to be used for
Represent the word of industry.The keyword of the full title of specific entity can be recorded in keyword dictionary, for example " China is industrial and commercial
The keyword of bank " is " industry and commerce ", and the keyword of " San You Patent Agency(11127) " is " three friends ", and " Beijing is existing
For Automobile Co., Ltd. " keyword be " modern times ".Sew dictionary, regional dictionary, industry word before and after being pre-set by inquiry
Storehouse, keyword dictionary, you can determine the full name data of entity be split after each word where dictionary, so that it is determined that entity is complete
Name data be split after each word part of speech.Such as " San You Patent Agency(11127) ", " Beijing " is ground
Area's word, " three friends " is keyword, and " intellectual property agency " is industry word, and " Co., Ltd " sews word to be front and rear.
The training corpus that step 206, basis are pre-set, to the first deep learning model and the second deep learning
Model carries out machine learning training respectively.
The training corpus includes the full name data of entity and the corresponding abbreviation of the full title of each entity pre-set.And
The corresponding abbreviation of each full title of entity can by be manually labeled or the first abbreviation result in step 209 and
Second abbreviation result is stored into training corpus.What deserves to be explained is, data volume is bigger, the spy that deep learning model learning comes out
Levying can be more accurate.Generally training corpus accounts for 80% or so of whole Sample Storehouse.
In addition, the generalization ability in order to verify deep learning model, in addition it is also necessary to set checking corpus.Verify that language material needs
Independent same distribution is kept with training corpus, the deep learning model of such training corpus also there can be ratio in checking corpus
Preferable effect.Generally checking corpus accounts for 20% or so of whole Sample Storehouse.
Herein, specific learning training process may refer to shown in subsequent figure 5 and Fig. 6.
Step 207, pretreated word, word frequency coding and part-of-speech tagging information are processed according to carrying out word, by pre-
The the first deep learning model and the second deep learning model first trained, generate the first initial abbreviation and the second initial letter respectively
Claim.
The process and step 504 in subsequent figure 5 to step of the first initial abbreviation are generated here by the first deep learning model
Rapid 505 is consistent.And the process and the step 504 in subsequent figure 5 of the second initial abbreviation are generated by the second deep learning model
Consistent to step 506, the process of mainly referred to as generation no longer carries out error passback.
Rule is corrected in the verification that step 208, basis are pre-set, initial to described first referred to as initially referred to as to enter with second
Row verification correction processing, generates the first abbreviation result and the second abbreviation result respectively.
Specifically, step 208 can be carried out for example herein according to the word length of initial abbreviation in the following way
Verification and correction, such as following several ways:
1. judge that word is processed in pretreated word with the presence or absence of two adjacent individual character words.
If word, which is processed, has two adjacent individual character words in pretreated word, and first initial referred to as described
The word of two adjacent individual character word compositions, it is determined that described first is initially referred to as the first abbreviation result;If word is processed
There are two adjacent individual character words in pretreated word, and second is initially referred to as described two adjacent individual character words
The word of composition, it is determined that described second is initially referred to as the second abbreviation result.
If word, which is processed, has two adjacent individual character words in pretreated word, and the first initial abbreviation is not institute
The word of two adjacent individual character word compositions is stated, then obtaining word according to word frequency coding processes pretreated frequency most
Low word, and the minimum word of two adjacent individual character words and the frequency is constituted into the first abbreviation result;If word
Language, which is processed, has two adjacent individual character words in pretreated word, and second initial referred to as not to be described two adjacent
The word of individual character word composition, then encode according to the word frequency and obtain the minimum word of the pretreated frequency of word processing, and
The minimum word of two adjacent individual character words and the frequency is constituted into the second abbreviation result.
2. judge whether the described first initial abbreviation or the second initial abbreviation are individual character.
If described first is initially referred to as individual character, word is obtained according to word frequency coding and processes pretreated frequency most
Two low words, and two minimum words of the individual character and the frequency are constituted into the first abbreviation result;If described
Second is initially referred to as individual character, is encoded according to the word frequency and obtains two minimum words of the pretreated frequency of word processing,
And two minimum words of the individual character and the frequency are constituted into the second abbreviation result.
3. judge whether the abbreviation length of the described first initial abbreviation or the second initial abbreviation is more than what is pre-set
Length threshold.
If the abbreviation length of the first initial abbreviation is more than the length threshold pre-set, pre-processed according to word processing is carried out
The part-of-speech tagging information and TF-IDF values of word afterwards, descending arrangement is carried out to the word in the first initial abbreviation, and from descending
Last word of the word in the first initial abbreviation after arrangement is deleted successively, until the abbreviation of the first initial abbreviation
Length is less than or equal to the length threshold pre-set, and first after being deleted is initial to be referred to as used as the first abbreviation result.
If the abbreviation length of the second initial abbreviation is more than the length threshold pre-set, pre-processed according to word processing is carried out
The part-of-speech tagging information and TF-IDF values of word afterwards, descending arrangement is carried out to the word in the second initial abbreviation, and from descending
Last word of the word in the second initial abbreviation after arrangement is deleted successively, until the abbreviation of the second initial abbreviation
Length is less than or equal to the length threshold pre-set, and second after being deleted is initial to be referred to as used as the second abbreviation result.
Step 209, the first abbreviation result and the second abbreviation result be stored into training corpus, to update the instruction
Practice corpus.
Step 210, the first abbreviation result and the second abbreviation result are compared, generate comparative result, determine the
Whether one abbreviation result is consistent with the second abbreviation result.
If the comparative result is that the first abbreviation result is consistent with the second abbreviation result, step 211 is performed.Otherwise, if institute
It is that the first abbreviation result and the second abbreviation result are inconsistent to state comparative result, performs step 212.
Step 211, selection the first abbreviation result or the second abbreviation result as the full name data of entity final abbreviation.
Step 212, the first generating probability and the second deep learning model for obtaining the first deep learning model output abbreviation
The second generating probability of abbreviation is exported, first generating probability and the second generating probability are compared, and selects first to give birth to
Into final abbreviation of the corresponding abbreviation result of higher value in probability and the second generating probability as the full name data of entity.
In order that those skilled in the art understands 206 pairs of above-mentioned steps the first deep learning model and the second depth
Learning model carries out machine learning training respectively, below to the first deep learning model and the second deep learning model and its training
Process does following elaboration:
Above two deep learning model is separate, is parallel training study and output, does not interfere with each other.
As shown in figure 3, being the block schematic illustration of the first deep learning model, the first deep learning model uses multitiered network
Structure, including:Input layer 301, individual layer Embedding layers 302, multilayer BRNN303, multilayer RNN304, Softmax layers of individual layer
305.Such as two layers of BRNN303 in Fig. 3, the example of the one layer of RNN304 only present invention can be carried out according to hands-on result
Adjustment.
Input layer 301 obtains training corpus from database, afferent nerve network Embedding layers 302, by
Embedding layers 302 carry out term vector coding, then afferent nerve network processes.Fixation is needed because neutral net is inputted
Dimension, therefore the term vector dimensionality reduction of high-dimensional feature can be the term vector pattern of 128 dimensions, term vector drop by Embedding layers 302
The technology of dimension can use existing Word2Vec methods.
Word in the embodiment of the present invention in corpus represents with existing One-hot Representation methods, this side
Method is also most directly perceived, most common representation in current natural language processing, and each vocabulary is shown as one very by this method
Long vector, this vectorial dimension is vocabulary size, wherein most elements are 0, the value of only one of which dimension is 1, this
Individual vector just represents current word, and example is as follows:
" water power " can be expressed as [0 001000000000000 ...]
" electric power " can be expressed as [0 000000010000000 ...]
Term vector coding can be by calculating the relation between the distance between vector two words of discovery.Each word is except word
It is 1 on position, other are all 0, be all isolated between any two word, two words are not only seen from the two vectors
Whether there is relation, such as there is no task nexus between " water power " and " electric power " the two similar words.And be encoded into by term vector
After 128 dimensional vectors, it is assumed that " water power " uses u vector representations, and " electric power " uses v vector representations,COS distance
Measure the two vectorial differences, nearer it is to 1 numeral, illustrate the two words semantically closer to.
Recurrent neural network RNN (Recurrent Neural Network) is a kind of nerve net with feedback arrangement
Network, the output for being mainly characterized by each hidden layer is used as the input of next time point hidden layer.Forward-backward recutrnce neutral net
BRNN (Bidirectional Recurrent Neural Network) is to increase reverse list entries on the basis of RNN,
Such as one full title sequence of terms of entity is W1W2...Wn, BRNN output is positive W1W2...WnRNN output and reversely
WnWn-1...W1RNN output average value.Recurrent neural network is compared with traditional neural network, and interlayer has a memory function, word with
It is no longer independent between word, it is stronger relational learning ability word.As shown in figure 3, in interlayer input increase Dropout operations,
The motivation of Dropout operations comes from the integration trainingt of model, and if mass data, different nets are trained with different pieces of information
Network, takes the output average value of multiple networks, will be obviously improved predictablity rate, prevent over-fitting during prediction.Or with different ginsengs
The multiple heterogeneous networks of number training can also reach same purpose, it is done so that training cost is higher, Dropout operations are a kind of
Simple realization approach, i.e., make some neurons in network fail, in a disguised form produce the net of various change at random with certain probability
Network.
Softmax layers 305, using activation primitive of classifying softmax, are trained more using polytypic cross entropy loss function
Sample.
Many classification activation primitives are as follows:
Wherein, netiThe output of i-th of neuron is represented, o is the output variable for traveling through all neurons, netoRepresent o
The output of individual neuron.
Many classification cross entropy loss function E (Error abbreviation) are as follows:
Wherein o is the output variable for traveling through all neurons, toRepresent the label value of o-th of neuron in the sample, yoTable
Show output valve.The derivative of cross entropy loss function is solved, and then parameter shift amount can be obtained according to the solution of BP algorithm principle.
Wherein, netiRepresent the output of i-th of neuron, yiIt is netiCorresponding softmax functional values, tiRepresent i-th
The label value of neuron in the sample.The label that model 1 is used is R10*3Bidimensional matrix, that is, the full title inputted are up to
10 keywords, 3 kinds of label values of each keyword correspondence (all occur, lead-in is occurred without).Model is in output yiWhile value
Also each y is exportediIt is worth corresponding confidence level.
Fig. 4 is the block schematic illustration for the second deep learning model that the embodiment of the present invention is used, the second deep learning model
Mainly include input layer 401 (previous word, current word, current word part of speech, current word word frequency coding), RNN layers 402, full connection
Layer 403, output layer 404.Second deep learning model is similar to the structure of the first deep learning model, is mainly the increase in part of speech
And word frequency information, industry is so added, area, the information of keyword.Different from word, being directly over for part of speech and word frequency is complete
Articulamentum, is connected to last layer of output layer, and other layers are similar with the first depth school model.In addition, the second deep learning mould
Type output use label value be 0 or 1,0 expression the word occur without, 1 represent occur, similarly, the second deep learning mould
Type output label also can output label confidence level.
In order to lift the generalization ability of the second deep learning model, it is necessary to be compressed to word frequency scope, word frequency scope
It is divided into 5 class intervals, processing procedure and design method are as follows:
First by method for normalizing min-max standardization, to linear transformation of the sample prime word frequency according to X, make result
Value is mapped between [0-1].Transfer function is as follows:
Y=(X-MinValue)/(MaxValue-MinValue);
Wherein MaxValue is the maximum of sample data, and MinValue is the minimum value of sample data.
According to TF-IDF (Term Frequency-Inverse Document after Y value after being changed
Frequency, a kind of conventional weighting technique for information retrieval and data mining) method, it is mapped to 5 class intervals.In entity
In referred to as generating, word frequency TF (Term Frequency) is the frequency that each word occurs in entity, document-frequency DF
(Document Frequency) is the frequency of occurrences of the word in all samples.By taking enterprise name as an example, TF in enterprise name
The general probability occurred in abbreviation of the bigger values of 1, DF generally is smaller, so typically with reverse word frequency IDF (Inverse
Document Freqency) calculate, traditional IDF is usually the logarithm for asking total sample number divided by DF, i.e. reverse word frequency| D | it is entry sum in corpus, j represents to include the entry number of word.If word does not exist
In corpus, may result in denominator is 0, therefore ordinary circumstance is applicable plus the processing of 1 method.Word frequency scope Y will frequency by normalizing
Rate is mapped between 1 to 5, as shown in table 1 below.
Table 1:Word frequency mapping table
Word frequency scope (Y) | Mapping value (IDF) |
[0,0.2] | 5 |
(0.2,0.4] | 4 |
(0.4,0.6] | 3 |
(0.6,0.8] | 2 |
(0.8,1] | 1 |
Hereinafter, it is the training flow of the first deep learning model, as shown in figure 5, comprising the following steps:
Step 501, the network parameter of the first deep learning model is initialized.Random fashion is taken in the present invention.
Step 502, the training corpus in reading database.
Step 503, term vector coding is carried out to input language material.
Step 504, term vector input neutral net is subjected to forward calculation.
Step 505, softmax layers of classified calculating.
Step 506, the error between output and the abbreviation manually marked is calculated.
Step 507, gradient calculation, and error is returned layer by layer, update network parameter.
Step 508, judge whether that training stops, reaching iterations or meet stop condition deconditioning in advance, it is no
Then return to step 502 and carry out continuation training.
Step 509, training terminates, and stores the parameter of the first deep learning model.
Hereinafter, it is the training flow of the second deep learning model, as shown in fig. 6, comprising the following steps:
Step 601, the network parameter of the second deep learning model is initialized.Random fashion is taken in the present invention.
Step 602, the training corpus in reading database.
Step 603, term vector coding is carried out to input language material, while reading industry, area, keyword, word frequency carries out word
Property coding and word frequency coding.
Step 604, term vector is inputted into RNN layers of progress forward calculation.
Step 605, using the processing RNN outputs of full articulamentum, another full articulamentum processing current word part of speech and frequency.
Step 606, output layer is handled, and uses softmax function output results.
Step 607, the error between output and the abbreviation manually marked is calculated.
Step 608, gradient calculation, and error is returned layer by layer, update network parameter.
Step 609, judge whether that training stops, reaching iterations or meet stop condition deconditioning in advance, it is no
Then return to step 602 and carry out continuation training.
Step 610, training terminates, and stores the parameter of the second deep learning model.
By the embodiment of the present invention, the present invention can break the bottle for depending on artificial rule design in abbreviation generating process unduly
Neck, energy is automatic, quick, the abbreviation information of the full title of Mass production entity, using deep learning autonomous learning abbreviation Conduce Disciplinarian,
With follow-up constantly iterative learning, the accuracy rate for generating abbreviation is also stepped up during repetition learning, the abbreviation of generation
With there is unique corresponding relation with entity so that entity can quickly will be around the structuring of entity, unstructured
Segment information is together in series to form 360 degree of knowledge networks.
According to many experiments result, the present invention has following rule under enterprise name abbreviation scene:
" limited company " this suffix is hardly appeared in abbreviation.Such as " limited company of safety bank " letter
" safety bank " is claimed to occur without " limited company " suffix.
Province ,city and area's information in the full title of enterprise is appeared in abbreviation sometimes, is then occurred without sometimes.Such as " Qingdao double star
The abbreviation " Qingdao double star " of limited company " contains area, and the abbreviation " ten thousand of " Jiangxi Wannianqing Cement Co., ltd "
It is young " do not contain area.
Based on context associative key can be obtained normally referred to as, and for example " group ", " share " and " holding " sometimes go out
In present abbreviation, sometimes occur without.The abbreviation " square large group " of " Fangda Group Co Ltd " contains " group ", and
The abbreviation " Chinese Bao'an " of " Chinese Bao'an Group Plc ", then without " group ".
Kernel keyword is extracted, a part of industry is appeared in abbreviation, and a part of industry is occurred without.The full title of enterprise
The abbreviation " Si Erte " of " Anhui Sierte Fertilizer Company Limited " is not comprising industry word, and " Chengdu Xindu Chemical stock
The abbreviation " new capital chemical industry " of part Co., Ltd " contains industry word.
The full title of company is split into after word, and two may be occurred without, lead-in occur or occurred in each word result
Individual word.The abbreviation " Guyue Longshan " of " Guyuelongshan Shaoxing Wine Co Ltd, Zhejiang ", " Zhejiang " is occurred without, and " Gu is got over " occurs
Two words;In the abbreviation " Shang Qi groups " of " Saic Motor Corporation Limited ", respectively there is a word in " Shanghai " and " automobile ".
Corresponding to the embodiment of the method shown in above-mentioned Fig. 1 and Fig. 2, as shown in fig. 7, the embodiment of the present invention also provides a kind of real
The abbreviation generating means of body, including:
Word processes pretreatment unit 71, and for obtaining the full name data of entity, name data complete to the entity is carried out
Word processing pretreatment, is split as multiple words by the full name data of entity, generates for representing word in the language pre-set
The word frequency coding of the frequency of occurrences and the part-of-speech tagging information for representing Words ' Attributes in material storehouse.
Initial abbreviation generation unit 72, for processing pretreated word, word frequency coding and part of speech according to progress word
Markup information, by the first deep learning model and the second deep learning model of training in advance, generates the first initial letter respectively
Claim and the second initial abbreviation.
Verification correction processing unit 73, for correcting rule according to the verification pre-set, to the described first initial abbreviation
With the second initial referred to as progress verification correction processing, the first abbreviation result and the second abbreviation result are generated respectively.
Final abbreviation generation unit 74, for being compared to the first abbreviation result and the second abbreviation result, and root
The final abbreviation of the full name data of entity is generated according to comparative result.
In addition, the word processing pretreatment unit 71, specifically for:
The corresponding frequency of each word after the full name data of entity is split is obtained from the word frequency information table pre-set.
According to the corresponding frequency of each word, the descending arrangement of line frequency is entered to each word, the corresponding word of each word is generated
Frequency is encoded.
By the corresponding word frequency code storage of each word in the word frequency coding schedule pre-set.
In addition, the word processing pretreatment unit 71, is specifically additionally operable to:
The full title number of entity is determined according to dictionary, regional dictionary, industry dictionary, keyword dictionary is sewed before and after pre-setting
According to the dictionary where each word after being split, and each word after the full name data of entity is split carries out part-of-speech tagging,
Generate the part-of-speech tagging information;The part-of-speech tagging information sews word, regional word, industry word and key before and after including
Word.
Further, as shown in fig. 7, the abbreviation generating means of the entity, in addition to:
Machine learning training unit 75, for according to the training corpus pre-set, to the first deep learning mould
Type and the second deep learning model carry out machine learning training respectively;The training corpus includes the full name of entity pre-set
Claim data and the corresponding abbreviation of the full title of each entity.
In addition, processing unit 73 is corrected in the verification, specifically for:
Judge that word is processed in pretreated word with the presence or absence of two adjacent individual character words.
Processed in word and there are two adjacent individual character words in pretreated word, and first initial referred to as described
During the word that two adjacent individual character words are constituted, it is determined that described first is initially referred to as the first abbreviation result;If word adds
There are two adjacent individual character words in the pretreated word of work, and second is initially referred to as described two adjacent monosyllabic words
The word of language composition, it is determined that described second is initially referred to as the second abbreviation result.
Processed in word and there are two adjacent individual character words in pretreated word, and the first initial abbreviation is not institute
When stating the word that two adjacent individual character words are constituted, then word is obtained according to word frequency coding and process pretreated frequency
Minimum word, and the minimum word of two adjacent individual character words and the frequency is constituted into the first abbreviation result;If
Word, which is processed, has two adjacent individual character words in pretreated word, and second initial referred to as not to be described two adjacent
Individual character word composition word, then encoded according to the word frequency and obtain word and process the minimum word of pretreated frequency,
And the minimum word of two adjacent individual character words and the frequency is constituted into the second abbreviation result.
In addition, processing unit 73 is corrected in the verification, specifically for:
Judge whether the described first initial abbreviation or the second initial abbreviation are individual character.
In the described first initially referred to as individual character, word is obtained according to word frequency coding and processes pretreated frequency
Two minimum words, and two minimum words of the individual character and the frequency are constituted into the first abbreviation result;If institute
State second and be initially referred to as individual character, encoded according to the word frequency and obtain two minimum words of the pretreated frequency of word processing
Language, and two minimum words of the individual character and the frequency are constituted into the second abbreviation result.
In addition, processing unit 73 is corrected in the verification, specifically for:
Judge whether the abbreviation length of the described first initial abbreviation or the second initial abbreviation is more than the length pre-set
Spend threshold value.
When the abbreviation length of the first initial abbreviation is more than the length threshold pre-set, according to carrying out, word processing is pre- to be located
The part-of-speech tagging information and TF-IDF values of word after reason, descending arrangement is carried out to the word in the first initial abbreviation, and from drop
Last word of the word in the first initial abbreviation after sequence arrangement is deleted successively, until the letter of the first initial abbreviation
Length is claimed to be less than or equal to the length threshold pre-set, first after being deleted is initial to be referred to as used as the first abbreviation result;
When the abbreviation length of the second initial abbreviation is more than the length threshold pre-set, according to carrying out, word processing is pre- to be located
The part-of-speech tagging information and TF-IDF values of word after reason, descending arrangement is carried out to the word in the second initial abbreviation, and from drop
Last word of the word in the second initial abbreviation after sequence arrangement is deleted successively, until the letter of the second initial abbreviation
Length is claimed to be less than or equal to the length threshold pre-set, second after being deleted is initial to be referred to as used as the second abbreviation result.
Further, as shown in fig. 7, the abbreviation generating means of described entity, in addition to:
Memory cell 76, for the first abbreviation result and the second abbreviation result to be stored into training corpus, with more
The new training corpus.
In addition, the final abbreviation generation unit 74, specifically for:
The first abbreviation result and the second abbreviation result are compared, comparative result is generated.
When the comparative result is the first abbreviation result and inconsistent the second abbreviation result, the first deep learning mould is obtained
The first generating probability and the second deep learning model of type output abbreviation export the second generating probability of abbreviation.
First generating probability and the second generating probability are compared, and select the first generating probability and the second generation
The corresponding abbreviation result of higher value in probability as the full name data of entity final abbreviation.
What deserves to be explained is, a kind of specific implementation of the abbreviation generating means of entity provided in an embodiment of the present invention can
So that referring to the corresponding embodiments of the method for above-mentioned Fig. 1 and Fig. 2, here is omitted.
A kind of abbreviation generating means of entity provided in an embodiment of the present invention, obtain the full name data of entity, to institute first
State the full name data of entity and carry out word processing pretreatment, the full name data of entity is split as multiple words, generated for table
Show the word frequency coding of word frequency of occurrences in the corpus pre-set and the part-of-speech tagging information for representing Words ' Attributes;
Afterwards, according to the pretreated word of word processing, word frequency coding and part-of-speech tagging information is carried out, the first of training in advance is passed through
Deep learning model and the second deep learning model, generate respectively first it is initial referred to as and second it is initial referred to as;According to setting in advance
Rule is corrected in the verification put, it is initial to described first referred to as and second it is initial referred to as carry out verification correction processing, the is generated respectively
One abbreviation result and the second abbreviation result;The first abbreviation result and the second abbreviation result are compared, and according to comparing
As a result the final abbreviation of the full name data of entity is generated.The embodiment of the present invention make use of depth by using the method for deep learning
Advantage of the learning model without engineer's feature is spent, and part of speech and word frequency information are fused in model, extension feature scope,
After the autonomous iterative learning of full title, accurate, unique corresponding relation of the full title of entity and abbreviation is ultimately formed, can be solved
The abbreviation generation degree of accuracy of certainly currently existing technology is low, the problem of as a result not unique.
It should be understood by those skilled in the art that, embodiments of the invention can be provided as method, system or computer program
Product.Therefore, the present invention can be using the reality in terms of complete hardware embodiment, complete software embodiment or combination software and hardware
Apply the form of example.Moreover, the present invention can be used in one or more computers for wherein including computer usable program code
The computer program production that usable storage medium is implemented on (including but is not limited to magnetic disk storage, CD-ROM, optical memory etc.)
The form of product.
The present invention is the flow with reference to method according to embodiments of the present invention, equipment (system) and computer program product
Figure and/or block diagram are described.It should be understood that can be by every first-class in computer program instructions implementation process figure and/or block diagram
Journey and/or the flow in square frame and flow chart and/or block diagram and/or the combination of square frame.These computer programs can be provided
The processor of all-purpose computer, special-purpose computer, Embedded Processor or other programmable data processing devices is instructed to produce
A raw machine so that produced by the instruction of computer or the computing device of other programmable data processing devices for real
The device for the function of being specified in present one flow of flow chart or one square frame of multiple flows and/or block diagram or multiple square frames.
These computer program instructions, which may be alternatively stored in, can guide computer or other programmable data processing devices with spy
Determine in the computer-readable memory that mode works so that the instruction being stored in the computer-readable memory, which is produced, to be included referring to
Make the manufacture of device, the command device realize in one flow of flow chart or multiple flows and/or one square frame of block diagram or
The function of being specified in multiple square frames.
These computer program instructions can be also loaded into computer or other programmable data processing devices so that in meter
Series of operation steps is performed on calculation machine or other programmable devices to produce computer implemented processing, thus in computer or
The instruction performed on other programmable devices is provided for realizing in one flow of flow chart or multiple flows and/or block diagram one
The step of function of being specified in individual square frame or multiple square frames.
Apply specific embodiment in the present invention to be set forth the principle and embodiment of the present invention, above example
Explanation be only intended to help to understand the method and its core concept of the present invention;Simultaneously for those of ordinary skill in the art,
According to the thought of the present invention, it will change in specific embodiments and applications, in summary, in this specification
Appearance should not be construed as limiting the invention.
Claims (18)
1. a kind of abbreviation generation method of entity, it is characterised in that including:
The full name data of entity is obtained, name data complete to the entity carries out word processing pretreatment, by the full title number of entity
According to multiple words are split as, generate the word frequency for representing word frequency of occurrences in the corpus pre-set and encode and be used for
Represent the part-of-speech tagging information of Words ' Attributes;
Pretreated word, word frequency coding and part-of-speech tagging information are processed according to word is carried out, passes through the first of training in advance
Deep learning model and the second deep learning model, generate respectively first it is initial referred to as and second it is initial referred to as;
Rule is corrected according to the verification pre-set, it is initial to described first referred to as initially referred to as to be carried out at verification correction with second
Reason, generates the first abbreviation result and the second abbreviation result respectively;
The first abbreviation result and the second abbreviation result are compared, and the full name data of entity is generated according to comparative result
Final abbreviation.
2. the abbreviation generation method of entity according to claim 1, it is characterised in that obtain the full name data of entity, right
The full name data of entity carries out word processing pretreatment, the full name data of entity is split as into multiple words, generating is used for
Represent the word frequency coding of word frequency of occurrences in the corpus pre-set and for representing that the part-of-speech tagging of Words ' Attributes is believed
Breath, including:
The corresponding frequency of each word after the full name data of entity is split is obtained from the word frequency information table pre-set;
According to the corresponding frequency of each word, the descending arrangement of line frequency is entered to each word, the corresponding word frequency of each word is generated and compiles
Code;
By the corresponding word frequency code storage of each word in the word frequency coding schedule pre-set.
3. the abbreviation generation method of entity according to claim 2, it is characterised in that obtain the full name data of entity, right
The full name data of entity carries out word processing pretreatment, the full name data of entity is split as into multiple words, generating is used for
Represent the word frequency coding of word frequency of occurrences in the corpus pre-set and for representing that the part-of-speech tagging of Words ' Attributes is believed
Breath, in addition to:
The full name data quilt of entity is determined according to dictionary, regional dictionary, industry dictionary, keyword dictionary is sewed before and after pre-setting
The dictionary where each word after fractionation, and each word after the full name data of entity is split carries out part-of-speech tagging, generation
The part-of-speech tagging information;The part-of-speech tagging information sews word, regional word, industry word and keyword before and after including.
4. the abbreviation generation method of entity according to claim 3, it is characterised in that also include:
According to the training corpus pre-set, the first deep learning model and the second deep learning model are carried out respectively
Machine learning is trained;The training corpus includes the full name data of entity and the corresponding letter of the full title of each entity pre-set
Claim.
5. the abbreviation generation method of entity according to claim 4, it is characterised in that corrected according to the verification pre-set
Rule, it is initial to described first referred to as and second it is initial referred to as carry out verification correction processing, generate respectively the first abbreviation result and
Second abbreviation result, including:
Judge that word is processed in pretreated word with the presence or absence of two adjacent individual character words;
If word, which is processed, has two adjacent individual character words in pretreated word, and first initially referred to as described two
The word of adjacent individual character word composition, it is determined that described first is initially referred to as the first abbreviation result;If the pre- place of word processing
There are two adjacent individual character words in word after reason, and second is initially referred to as described two adjacent individual character word compositions
Word, it is determined that described second initially referred to as the second abbreviation result;
If word, which is processed, has two adjacent individual character words in pretreated word, and the first initial abbreviation is not described two
The word of individual adjacent individual character word composition, then obtain the pretreated frequency of word processing minimum according to word frequency coding
Word, and the minimum word of two adjacent individual character words and the frequency is constituted into the first abbreviation result;If word adds
There are two adjacent individual character words in the pretreated word of work, and the second initial abbreviation is not described two adjacent individual characters
The word of word composition, then encode according to the word frequency and obtain word and process the minimum word of pretreated frequency, and by two
Individual adjacent individual character word and the minimum word of the frequency constitute the second abbreviation result.
6. the abbreviation generation method of entity according to claim 4, it is characterised in that corrected according to the verification pre-set
Rule, it is initial to described first referred to as and second it is initial referred to as carry out verification correction processing, generate respectively the first abbreviation result and
Second abbreviation result, including:
Judge whether the described first initial abbreviation or the second initial abbreviation are individual character;
If described first is initially referred to as individual character, the pretreated frequency of word processing is obtained according to word frequency coding minimum
Two words, and two minimum words of the individual character and the frequency are constituted into the first abbreviation result;If described second
Individual character is initially referred to as, is encoded according to the word frequency and obtains two minimum words of the pretreated frequency of word processing, and will
The individual character constitutes the second abbreviation result with two minimum words of the frequency.
7. the abbreviation generation method of entity according to claim 4, it is characterised in that corrected according to the verification pre-set
Rule, it is initial to described first referred to as and second it is initial referred to as carry out verification correction processing, generate respectively the first abbreviation result and
Second abbreviation result, including:
Judge whether the abbreviation length of the described first initial abbreviation or the second initial abbreviation is more than the length threshold pre-set
Value;
If the abbreviation length of the first initial abbreviation is more than the length threshold pre-set, according to carrying out, word processing is pretreated
The part-of-speech tagging information and TF-IDF values of word, descending arrangement are carried out to the word in the first initial abbreviation, and arrange from descending
Last word of the word in the first initial abbreviation afterwards is deleted successively, until the abbreviation length of the first initial abbreviation
Less than or equal to the length threshold pre-set, the first initial abbreviation after being deleted is used as the first abbreviation result;
If the abbreviation length of the second initial abbreviation is more than the length threshold pre-set, according to carrying out, word processing is pretreated
The part-of-speech tagging information and TF-IDF values of word, descending arrangement are carried out to the word in the second initial abbreviation, and arrange from descending
Last word of the word in the second initial abbreviation afterwards is deleted successively, until the abbreviation length of the second initial abbreviation
Less than or equal to the length threshold pre-set, the second initial abbreviation after being deleted is used as the second abbreviation result.
8. the abbreviation generation method of the entity according to any one of claim 5 to 7, it is characterised in that at the beginning of generation first
Begin referred to as and after the second initial abbreviation, in addition to:
The first abbreviation result and the second abbreviation result are stored into training corpus, to update the training corpus.
9. the abbreviation generation method of entity according to claim 8, it is characterised in that to the first abbreviation result and
Two abbreviation results are compared, and generate the final abbreviation of the full name data of entity according to comparative result, including:
The first abbreviation result and the second abbreviation result are compared, comparative result is generated;
If the comparative result is that the first abbreviation result and the second abbreviation result are inconsistent, the output of the first deep learning model is obtained
First generating probability of abbreviation and the second deep learning model export the second generating probability of abbreviation;
First generating probability and the second generating probability are compared, and select the first generating probability and the second generating probability
In the corresponding abbreviation result of higher value as the full name data of entity final abbreviation.
10. a kind of abbreviation generating means of entity, it is characterised in that including:
Word processes pretreatment unit, and for obtaining the full name data of entity, name data complete to the entity carries out word and added
Work is pre-processed, and the full name data of entity is split as into multiple words, generated for representing word in the corpus pre-set
The word frequency coding of the frequency of occurrences and the part-of-speech tagging information for representing Words ' Attributes;
Initial abbreviation generation unit, for the pretreated word of word processing, word frequency to be encoded and part-of-speech tagging is believed according to carrying out
Breath, by the first deep learning model and the second deep learning model of training in advance, generate respectively first it is initial referred to as and the
Two initial abbreviations;
Verification correction processing unit, for correcting rule according to the verification pre-set, to the described first initial abbreviation and second
It is initial referred to as to carry out verification correction processing, the first abbreviation result and the second abbreviation result are generated respectively;
Final abbreviation generation unit, for being compared to the first abbreviation result and the second abbreviation result, and according to comparing
As a result the final abbreviation of the full name data of entity is generated.
11. the abbreviation generating means of entity according to claim 10, it is characterised in that the word processing pretreatment is single
Member, specifically for:
The corresponding frequency of each word after the full name data of entity is split is obtained from the word frequency information table pre-set;
According to the corresponding frequency of each word, the descending arrangement of line frequency is entered to each word, the corresponding word frequency of each word is generated and compiles
Code;
By the corresponding word frequency code storage of each word in the word frequency coding schedule pre-set.
12. the abbreviation generating means of entity according to claim 11, it is characterised in that the word processing pretreatment is single
Member, is specifically additionally operable to:
The full name data quilt of entity is determined according to dictionary, regional dictionary, industry dictionary, keyword dictionary is sewed before and after pre-setting
The dictionary where each word after fractionation, and each word after the full name data of entity is split carries out part-of-speech tagging, generation
The part-of-speech tagging information;The part-of-speech tagging information sews word, regional word, industry word and keyword before and after including.
13. the abbreviation generating means of entity according to claim 12, it is characterised in that also include:
Machine learning training unit, for according to the training corpus that pre-sets, to the first deep learning model and the
Two deep learning models carry out machine learning training respectively;The training corpus includes the full name data of entity pre-set
Abbreviation corresponding with the full title of each entity.
14. the abbreviation generating means of entity according to claim 13, it is characterised in that it is single that processing is corrected in the verification
Member, specifically for:
Judge that word is processed in pretreated word with the presence or absence of two adjacent individual character words;
If word, which is processed, has two adjacent individual character words in pretreated word, and first initially referred to as described two
The word of adjacent individual character word composition, it is determined that described first is initially referred to as the first abbreviation result;If the pre- place of word processing
There are two adjacent individual character words in word after reason, and second is initially referred to as described two adjacent individual character word compositions
Word, it is determined that described second initially referred to as the second abbreviation result;
If word, which is processed, has two adjacent individual character words in pretreated word, and the first initial abbreviation is not described two
The word of individual adjacent individual character word composition, then obtain the pretreated frequency of word processing minimum according to word frequency coding
Word, and the minimum word of two adjacent individual character words and the frequency is constituted into the first abbreviation result;If word adds
There are two adjacent individual character words in the pretreated word of work, and the second initial abbreviation is not described two adjacent individual characters
The word of word composition, then encode according to the word frequency and obtain word and process the minimum word of pretreated frequency, and by two
Individual adjacent individual character word and the minimum word of the frequency constitute the second abbreviation result.
15. the abbreviation generating means of entity according to claim 13, it is characterised in that it is single that processing is corrected in the verification
Member, specifically for:
Judge whether the described first initial abbreviation or the second initial abbreviation are individual character;
If described first is initially referred to as individual character, the pretreated frequency of word processing is obtained according to word frequency coding minimum
Two words, and two minimum words of the individual character and the frequency are constituted into the first abbreviation result;If described second
Individual character is initially referred to as, is encoded according to the word frequency and obtains two minimum words of the pretreated frequency of word processing, and will
The individual character constitutes the second abbreviation result with two minimum words of the frequency.
16. the abbreviation generating means of entity according to claim 13, it is characterised in that it is single that processing is corrected in the verification
Member, specifically for:
Judge whether the abbreviation length of the described first initial abbreviation or the second initial abbreviation is more than the length threshold pre-set
Value;
If the abbreviation length of the first initial abbreviation is more than the length threshold pre-set, according to carrying out, word processing is pretreated
The part-of-speech tagging information and TF-IDF values of word, descending arrangement are carried out to the word in the first initial abbreviation, and arrange from descending
Last word of the word in the first initial abbreviation afterwards is deleted successively, until the abbreviation length of the first initial abbreviation
Less than or equal to the length threshold pre-set, the first initial abbreviation after being deleted is used as the first abbreviation result;
If the abbreviation length of the second initial abbreviation is more than the length threshold pre-set, according to carrying out, word processing is pretreated
The part-of-speech tagging information and TF-IDF values of word, descending arrangement are carried out to the word in the second initial abbreviation, and arrange from descending
Last word of the word in the second initial abbreviation afterwards is deleted successively, until the abbreviation length of the second initial abbreviation
Less than or equal to the length threshold pre-set, the second initial abbreviation after being deleted is used as the second abbreviation result.
17. the abbreviation generating means of the entity according to any one of claim 14 to 16, it is characterised in that also include:
Memory cell, it is described to update for the first abbreviation result and the second abbreviation result to be stored into training corpus
Training corpus.
18. the abbreviation generating means of entity according to claim 17, it is characterised in that the final abbreviation generation is single
Member, specifically for:
The first abbreviation result and the second abbreviation result are compared, comparative result is generated;
If the comparative result is that the first abbreviation result and the second abbreviation result are inconsistent, the output of the first deep learning model is obtained
First generating probability of abbreviation and the second deep learning model export the second generating probability of abbreviation;
First generating probability and the second generating probability are compared, and select the first generating probability and the second generating probability
In the corresponding abbreviation result of higher value as the full name data of entity final abbreviation.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710212978.8A CN106991085B (en) | 2017-04-01 | 2017-04-01 | Entity abbreviation generation method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710212978.8A CN106991085B (en) | 2017-04-01 | 2017-04-01 | Entity abbreviation generation method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106991085A true CN106991085A (en) | 2017-07-28 |
CN106991085B CN106991085B (en) | 2020-08-04 |
Family
ID=59415942
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710212978.8A Active CN106991085B (en) | 2017-04-01 | 2017-04-01 | Entity abbreviation generation method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106991085B (en) |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107783960A (en) * | 2017-10-23 | 2018-03-09 | 百度在线网络技术(北京)有限公司 | Method, apparatus and equipment for Extracting Information |
CN108460014A (en) * | 2018-02-07 | 2018-08-28 | 百度在线网络技术(北京)有限公司 | Recognition methods, device, computer equipment and the storage medium of business entity |
CN108460016A (en) * | 2018-02-09 | 2018-08-28 | 中云开源数据技术(上海)有限公司 | A kind of entity name analysis recognition method |
CN109033082A (en) * | 2018-07-19 | 2018-12-18 | 深圳创维数字技术有限公司 | The learning training method, apparatus and computer readable storage medium of semantic model |
RU2678716C1 (en) * | 2017-12-11 | 2019-01-31 | Общество с ограниченной ответственностью "Аби Продакшн" | Use of autoencoders for learning text classifiers in natural language |
WO2019095899A1 (en) * | 2017-11-17 | 2019-05-23 | 中兴通讯股份有限公司 | Material annotation method and apparatus, terminal, and computer readable storage medium |
CN110096571A (en) * | 2019-04-10 | 2019-08-06 | 北京明略软件系统有限公司 | A kind of mechanism name abbreviation generation method and device, computer readable storage medium |
CN110134779A (en) * | 2019-05-13 | 2019-08-16 | 极智(上海)企业管理咨询有限公司 | A kind of method of enterprise name processing |
CN111507108A (en) * | 2020-04-17 | 2020-08-07 | 腾讯科技(深圳)有限公司 | Alias generation method and device, electronic equipment and computer readable storage medium |
CN112613299A (en) * | 2020-12-25 | 2021-04-06 | 北京知因智慧科技有限公司 | Method and device for constructing enterprise synonym library and electronic equipment |
TWI733012B (en) * | 2018-03-29 | 2021-07-11 | 中華電信股份有限公司 | Dialogue system and method of integrating intentions and hot keys |
CN113642867A (en) * | 2021-07-30 | 2021-11-12 | 南京星云数字技术有限公司 | Method and system for assessing risk |
CN114020973A (en) * | 2021-11-26 | 2022-02-08 | 盐城金堤科技有限公司 | Public opinion data acquisition method and device, computer storage medium and electronic equipment |
CN114707500A (en) * | 2022-03-17 | 2022-07-05 | 深圳前海微众银行股份有限公司 | Work unit name verification method and device |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103049501A (en) * | 2012-12-11 | 2013-04-17 | 上海大学 | Chinese domain term recognition method based on mutual information and conditional random field model |
CN104572625A (en) * | 2015-01-21 | 2015-04-29 | 北京云知声信息技术有限公司 | Recognition method of named entity |
WO2015080558A1 (en) * | 2013-11-27 | 2015-06-04 | Mimos Berhad | A method and system for automated entity recognition |
CN105975555A (en) * | 2016-05-03 | 2016-09-28 | 成都数联铭品科技有限公司 | Enterprise abbreviation extraction method based on bidirectional recurrent neural network |
CN106503192A (en) * | 2016-10-31 | 2017-03-15 | 北京百度网讯科技有限公司 | Name entity recognition method and device based on artificial intelligence |
CN106598950A (en) * | 2016-12-23 | 2017-04-26 | 东北大学 | Method for recognizing named entity based on mixing stacking model |
-
2017
- 2017-04-01 CN CN201710212978.8A patent/CN106991085B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103049501A (en) * | 2012-12-11 | 2013-04-17 | 上海大学 | Chinese domain term recognition method based on mutual information and conditional random field model |
WO2015080558A1 (en) * | 2013-11-27 | 2015-06-04 | Mimos Berhad | A method and system for automated entity recognition |
CN104572625A (en) * | 2015-01-21 | 2015-04-29 | 北京云知声信息技术有限公司 | Recognition method of named entity |
CN105975555A (en) * | 2016-05-03 | 2016-09-28 | 成都数联铭品科技有限公司 | Enterprise abbreviation extraction method based on bidirectional recurrent neural network |
CN106503192A (en) * | 2016-10-31 | 2017-03-15 | 北京百度网讯科技有限公司 | Name entity recognition method and device based on artificial intelligence |
CN106598950A (en) * | 2016-12-23 | 2017-04-26 | 东北大学 | Method for recognizing named entity based on mixing stacking model |
Non-Patent Citations (1)
Title |
---|
孙丽萍、过戈、唐文武、徐永斌: "基于构成模式和条件随机场的企业简称预测", 《计算机应用》 * |
Cited By (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11288593B2 (en) | 2017-10-23 | 2022-03-29 | Baidu Online Network Technology (Beijing) Co., Ltd. | Method, apparatus and device for extracting information |
CN107783960A (en) * | 2017-10-23 | 2018-03-09 | 百度在线网络技术(北京)有限公司 | Method, apparatus and equipment for Extracting Information |
WO2019095899A1 (en) * | 2017-11-17 | 2019-05-23 | 中兴通讯股份有限公司 | Material annotation method and apparatus, terminal, and computer readable storage medium |
RU2678716C1 (en) * | 2017-12-11 | 2019-01-31 | Общество с ограниченной ответственностью "Аби Продакшн" | Use of autoencoders for learning text classifiers in natural language |
CN108460014B (en) * | 2018-02-07 | 2022-02-25 | 百度在线网络技术(北京)有限公司 | Enterprise entity identification method and device, computer equipment and storage medium |
CN108460014A (en) * | 2018-02-07 | 2018-08-28 | 百度在线网络技术(北京)有限公司 | Recognition methods, device, computer equipment and the storage medium of business entity |
CN108460016A (en) * | 2018-02-09 | 2018-08-28 | 中云开源数据技术(上海)有限公司 | A kind of entity name analysis recognition method |
TWI733012B (en) * | 2018-03-29 | 2021-07-11 | 中華電信股份有限公司 | Dialogue system and method of integrating intentions and hot keys |
CN109033082A (en) * | 2018-07-19 | 2018-12-18 | 深圳创维数字技术有限公司 | The learning training method, apparatus and computer readable storage medium of semantic model |
CN109033082B (en) * | 2018-07-19 | 2022-06-10 | 深圳创维数字技术有限公司 | Learning training method and device of semantic model and computer readable storage medium |
CN110096571A (en) * | 2019-04-10 | 2019-08-06 | 北京明略软件系统有限公司 | A kind of mechanism name abbreviation generation method and device, computer readable storage medium |
CN110096571B (en) * | 2019-04-10 | 2021-06-08 | 北京明略软件系统有限公司 | Mechanism name abbreviation generation method and device and computer readable storage medium |
CN110134779A (en) * | 2019-05-13 | 2019-08-16 | 极智(上海)企业管理咨询有限公司 | A kind of method of enterprise name processing |
CN111507108A (en) * | 2020-04-17 | 2020-08-07 | 腾讯科技(深圳)有限公司 | Alias generation method and device, electronic equipment and computer readable storage medium |
CN111507108B (en) * | 2020-04-17 | 2021-03-19 | 腾讯科技(深圳)有限公司 | Alias generation method and device, electronic equipment and computer readable storage medium |
CN112613299A (en) * | 2020-12-25 | 2021-04-06 | 北京知因智慧科技有限公司 | Method and device for constructing enterprise synonym library and electronic equipment |
CN112613299B (en) * | 2020-12-25 | 2024-07-02 | 北京知因智慧科技有限公司 | Method and device for constructing enterprise synonym library and electronic equipment |
CN113642867A (en) * | 2021-07-30 | 2021-11-12 | 南京星云数字技术有限公司 | Method and system for assessing risk |
CN114020973A (en) * | 2021-11-26 | 2022-02-08 | 盐城金堤科技有限公司 | Public opinion data acquisition method and device, computer storage medium and electronic equipment |
CN114707500A (en) * | 2022-03-17 | 2022-07-05 | 深圳前海微众银行股份有限公司 | Work unit name verification method and device |
Also Published As
Publication number | Publication date |
---|---|
CN106991085B (en) | 2020-08-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106991085A (en) | The abbreviation generation method and device of a kind of entity | |
Bakhtin et al. | Real or fake? learning to discriminate machine from human generated text | |
CN107133213B (en) | Method and system for automatically extracting text abstract based on algorithm | |
CN110990564B (en) | Negative news identification method based on emotion calculation and multi-head attention mechanism | |
CN110427623A (en) | Semi-structured document Knowledge Extraction Method, device, electronic equipment and storage medium | |
Terechshenko et al. | A comparison of methods in political science text classification: Transfer learning language models for politics | |
CN107247702A (en) | A kind of text emotion analysis and processing method and system | |
CN106855853A (en) | Entity relation extraction system based on deep neural network | |
CN106445919A (en) | Sentiment classifying method and device | |
CN109885670A (en) | A kind of interaction attention coding sentiment analysis method towards topic text | |
CN106650789A (en) | Image description generation method based on depth LSTM network | |
CN109977234A (en) | A kind of knowledge mapping complementing method based on subject key words filtering | |
CN106598950A (en) | Method for recognizing named entity based on mixing stacking model | |
CN110598219A (en) | Emotion analysis method for broad-bean-net movie comment | |
CN106096664A (en) | A kind of sentiment analysis method based on social network data | |
CN112800239B (en) | Training method of intention recognition model, and intention recognition method and device | |
Li et al. | TSQA: tabular scenario based question answering | |
CN107357785A (en) | Theme feature word abstracting method and system, feeling polarities determination methods and system | |
CN116010581A (en) | Knowledge graph question-answering method and system based on power grid hidden trouble shooting scene | |
CN114648029A (en) | Electric power field named entity identification method based on BiLSTM-CRF model | |
CN115630156A (en) | Mongolian emotion analysis method and system fusing Prompt and SRU | |
CN115269834A (en) | High-precision text classification method and device based on BERT | |
WO2022061877A1 (en) | Event extraction and extraction model training method, apparatus and device, and medium | |
Basri et al. | A deep learning based sentiment analysis on bang-lish disclosure | |
CN111507101A (en) | Ironic detection method based on multi-level semantic capsule routing |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |