CN107329951A - Build name entity mark resources bank method, device, storage medium and computer equipment - Google Patents
Build name entity mark resources bank method, device, storage medium and computer equipment Download PDFInfo
- Publication number
- CN107329951A CN107329951A CN201710447680.5A CN201710447680A CN107329951A CN 107329951 A CN107329951 A CN 107329951A CN 201710447680 A CN201710447680 A CN 201710447680A CN 107329951 A CN107329951 A CN 107329951A
- Authority
- CN
- China
- Prior art keywords
- text
- mark
- name entity
- bank
- marked
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
Abstract
Name entity mark resources bank method, device, storage medium and computer equipment are built the present invention relates to one kind.The present invention uses the resources bank that does not mark text composition epicycle iteration of a small amount of seed bank with not marking in text set to be calculated, the average utility value of each name entity in text is not marked by calculating, generate the seed bank of next round iteration, the seed bank of generation and other resources banks for not marking text composition next round iteration are calculated again to the seed bank of next round again, calculate always like this until will not mark text all calculating, it was found that new name entity, and generate name entity mark resources bank.This method calculates simple, and the confidence level of acquired results is high, is adapted to the extensive text of processing.Text data is a kind of unstructured data, generally all relatively difficult to unstructured data progress recruitment evaluation, and this method can be realized and carry out quantitative evaluation to text name entity.
Description
Technical field
The present invention relates to technical field of information processing, more particularly to it is a kind of build name entity mark resources bank method,
Device, storage medium and computer equipment.
Background technology
Name entity (named entity) just refers to name, mechanism name, place name and other are all with entitled mark
Entity, the name entity of broad sense also includes numeral, date, currency, address etc..Name Entity recognition (Named Entity
Recognition, NER) it is one of basic technology of natural language processing, for improving many natural language processing application systems
Performance all play an important role.Current NER mainly uses statistical model as treatment technology, such as hidden Markov model
(Hidden Markov Model, HMM), conditional random field models (Conditional Random Field, CRF) etc. are counted
Model, this kind of statistical model is required for substantial amounts of mark resources bank as training set, typically frequently with People's Daily's language material resource
The resources bank that storehouse etc. is manually marked is as training set.Resource in these resources banks manually marked is very limited amount of, deficiency
To adapt to large-scale application scene such as machine translation, and with the development of society, constantly there is new name entity to be born, than
Such as mechanism name, movie name, name of product, book name, so can not much meet life using the resources bank manually marked
The demand of name Entity recognition.Therefore, set up and safeguard that name entity mark resources bank is numerous natural language processing field applications
The core of (such as search system, machine translation system etc.).
The content of the invention
Based on this, it is necessary to build name entity mark resources bank method, dress there is provided one kind for above-mentioned technical problem
Put, storage medium and computer equipment.
One kind builds name entity mark resources bank method, and methods described includes:
Acquisition has marked text set as the seed bank of epicycle iteration, and the text set that marked includes having marked text;
Acquisition does not mark text set, the text set that do not mark includes not marking text, and text set is not marked from described
That chooses predetermined number does not mark the resources bank that text constitutes epicycle iteration with the seed bank;
The average utility value of each name entity in text is not marked described in calculating;
To the average utility value according to being ranked up from big to small, the name entity of predetermined number in the top is obtained
It is used as candidate word;
The text comprising the candidate word and value of utility maximum is selected to be added in the seed bank as next round iteration
Seed bank, then from it is described do not mark text set in choose predetermined number do not mark text and the seed bank constitute it is described under
The resources bank of one wheel iteration, until by it is described do not mark in text set it is all do not mark the whole iteration of text, obtain mark money
Source storehouse;
Candidate word in the mark resources bank is scored;
The corresponding text for including the candidate word of candidate word for being scored above given threshold is obtained, the text is constituted
Set be used as name entity mark resources bank.
In one of the embodiments, the average utility value of each name entity in text is not marked described in the calculating,
Including:
Participle is carried out to the text that do not mark in the resources bank, obtains not marking text after participle;
Using the mark text in resources bank described in condition random field CRF model trainings, forecast model is obtained, using pre-
The annotated sequence for not marking text surveyed in resources bank described in model prediction, is obtained from the annotated sequence for not marking text
Optimal and suboptimum annotated sequence and described optimal and suboptimum annotated sequence conditional probability;
Text is not marked to each, is not marked according to the conditional probability is calculated by effect assessment function in text
Each name entity value of utility;
Obtain each name entity comprising the name entity do not mark text in value of utility, according to the effectiveness
Value calculates the average utility value of each name entity.
In one of the embodiments, before the acquisition has marked seed bank of the text set as epicycle iteration, also wrap
Include:
Gather text message;
The text message of predetermined number is chosen from the text message of the collection, to the text message of the predetermined number
In name entity be labeled, generation, which has been marked, remaining in text set, the text message of the collection does not mark text structure
Into not marking text set.
In one of the embodiments, the effect assessment function is
WhereinFor x optimal annotated sequence,For x suboptimum annotated sequence, θ is model parameter, and 0≤λ≤1 is tune
Save the factor,For the conditional probability of x optimal annotated sequence,For x suboptimum annotated sequence
Conditional probability, x is a text marking sequence sample.
In one of the embodiments, the average utility calculation formula is
Wherein XtIt is the sample set containing entity candidate word t, | Xt| it is the number containing entity candidate word t samples,It is
T is in X for entity candidate wordtAverage utility value on sample set, xtIt is a text marking sequence sample containing entity candidate word t.
One kind builds name entity mark resources bank device, and described device includes:
Seed bank acquisition module, text set has been marked as the seed bank of epicycle iteration for obtaining, described to have marked text
This collection includes having marked text;
Resources bank acquisition module, text set is not marked for obtaining, and the text set that do not mark includes not marking text, from
It is described not mark the resources bank for not marking text and seed bank composition epicycle iteration that predetermined number is chosen in text set;
Average utility value computing module, for calculating the average utility value for not marking each name entity in text;
Entity candidate word acquisition module is named, for, according to being ranked up from big to small, being obtained to the average utility value
The name entity of predetermined number in the top is used as candidate word;
Mark resources bank generation module, for select include the maximum text of the candidate word and value of utility be added to it is described
In seed bank as the seed bank of next round iteration, then from it is described do not mark text set in choose predetermined number and do not mark text
The resources bank of the next round iteration is constituted with the seed bank, until by it is described do not mark in text set all do not mark text
This whole iteration, obtains marking resources bank;
Candidate word grading module, for scoring the candidate word in the mark resources bank;
Entity mark resources bank generation module is named, the candidate word that given threshold is scored above for obtaining corresponding is included
The text of the candidate word, the set that the text is constituted is used as name entity mark resources bank.
In one of the embodiments, the average utility value computing module includes:
Word-dividing mode, for carrying out participle to the text that do not mark in the resources bank, obtains not marking text after participle
This;
Conditional probability computing module, for using the text of mark in resources bank described in condition random field CRF model trainings
This, obtains forecast model, predicts the annotated sequence for not marking text in the resources bank using forecast model, is not marked from described
Optimal and suboptimum annotated sequence and described optimal and suboptimum annotated sequence conditional probability are obtained in the annotated sequence of explanatory notes sheet;
Value of utility computing module, for not marking text to each, effect assessment function is passed through according to the conditional probability
The value of utility of each name entity in text is not marked described in calculating;
Average utility value acquisition module, text is not being marked for obtaining each name entity comprising the name entity
In value of utility, the average utility value of each name entity is calculated according to the value of utility.
In one of the embodiments, described device also includes:
Text message acquisition module, for gathering text message;
Text message sort module, the text message for choosing predetermined number from the text message of the collection is right
Name entity in the text message of the predetermined number is labeled, and generation has marked text set, the text envelope of the collection
The remaining text composition that do not mark does not mark text set in breath.
A kind of computer-readable recording medium, is stored thereon with computer program, and the program is realized when being executed by processor
Following steps:
Acquisition has marked text set as the seed bank of epicycle iteration, and the text set that marked includes having marked text;
Acquisition does not mark text set, the text set that do not mark includes not marking text, and text set is not marked from described
That chooses predetermined number does not mark the resources bank that text constitutes epicycle iteration with the seed bank;
The average utility value of each name entity in text is not marked described in calculating;
To the average utility value according to being ranked up from big to small, the name entity of predetermined number in the top is obtained
It is used as candidate word;
The text comprising the candidate word and value of utility maximum is selected to be added in the seed bank as next round iteration
Seed bank, then from it is described do not mark text set in choose predetermined number do not mark text and the seed bank constitute it is described under
The resources bank of one wheel iteration, until by it is described do not mark in text set it is all do not mark the whole iteration of text, obtain mark money
Source storehouse;
Candidate word in the mark resources bank is scored;
The corresponding text for including the candidate word of candidate word for being scored above given threshold is obtained, the text is constituted
Set be used as name entity mark resources bank.
A kind of computer equipment, the computer equipment includes memory, processor and is stored on the memory simultaneously
The computer program that can be run on the processor, following steps are realized described in the computing device during computer program:
Acquisition has marked text set as the seed bank of epicycle iteration, and the text set that marked includes having marked text;
Acquisition does not mark text set, the text set that do not mark includes not marking text, and text set is not marked from described
That chooses predetermined number does not mark the resources bank that text constitutes epicycle iteration with the seed bank;
The average utility value of each name entity in text is not marked described in calculating;
To the average utility value according to being ranked up from big to small, the name entity of predetermined number in the top is obtained
It is used as candidate word;
The text comprising the candidate word and value of utility maximum is selected to be added in the seed bank as next round iteration
Seed bank, then from it is described do not mark text set in choose predetermined number do not mark text and the seed bank constitute it is described under
The resources bank of one wheel iteration, until by it is described do not mark in text set it is all do not mark the whole iteration of text, obtain mark money
Source storehouse;
Candidate word in the mark resources bank is scored;
The corresponding text for including the candidate word of candidate word for being scored above given threshold is obtained, the text is constituted
Set be used as name entity mark resources bank.
Above-mentioned structure name entity mark resources bank method, device, storage medium and computer equipment, will mark text
Collect the seed bank as epicycle iteration, then the text that do not mark for the predetermined number not marked in text set is constituted this with seed bank
Take turns the resources bank of iteration.The average utility value for not marking in text each name entity is calculated, to average utility value according to from big
It is ranked up to small, the name entity for obtaining predetermined number in the top is used as candidate word.Select again comprising candidate word and effect
It is added to the seed bank in seed bank as next round iteration with the maximum text of value, then never chooses default in mark text set
Quantity does not mark the resources bank that text and seed bank constitute next round iteration, all is not marked until by do not mark in text set
This whole iteration of explanatory notes, obtain marking resources bank.Finally the candidate word in mark resources bank is scored, acquisition is scored above
The corresponding text for including candidate word of candidate word of given threshold, the set that text is constituted is used as name entity mark resource
Storehouse.The present invention uses the resources bank that does not mark text composition epicycle iteration of a small amount of seed bank with not marking in text set to be counted
Calculate, generate next round iteration seed bank, then by the seed bank of generation and other do not mark text constitute next round iteration money
Source storehouse is calculated again the seed bank of next round, is calculated until will not mark text all calculating, is found new always like this
Name entity, and generate name entity mark resources bank.This method realize simple, speed it is fast, can large scale deployment, can be with
It is unlimited to expand the scale that name entity marks resources bank, meet various scene demands.
Brief description of the drawings
Fig. 1 is the flow chart of structure name entity mark resources bank method in one embodiment;
Fig. 2 is the flow chart of structure name entity mark resources bank method in one embodiment;
Fig. 3 is the flow chart of structure name entity mark resources bank method in one embodiment;
Fig. 4 is the structural representation of structure name entity mark resources bank device in one embodiment;
Fig. 5 is the structural representation of average utility computing module in Fig. 4;
Fig. 6 is the structural representation of structure name entity mark resources bank device in one embodiment.
Embodiment
In order to facilitate the understanding of the purposes, features and advantages of the present invention, below in conjunction with the accompanying drawings to the present invention
Embodiment be described in detail.Many details are elaborated in the following description to fully understand this hair
It is bright.But the invention can be embodied in many other ways as described herein, those skilled in the art can be not
Similar improvement is done in the case of running counter to intension of the present invention, therefore the present invention is not limited to the specific embodiments disclosed below.
Unless otherwise defined, all of technologies and scientific terms used here by the article is with belonging to technical field of the invention
The implication that technical staff is generally understood that is identical.Term used in the description of the invention herein is intended merely to description tool
The purpose of the embodiment of body, it is not intended that in the limitation present invention.Each technical characteristic of above example can carry out arbitrary group
Close, to make description succinct, combination not all possible to each technical characteristic in above-described embodiment is all described, however,
As long as contradiction is not present in the combination of these technical characteristics, the scope of this specification record is all considered to be.
In one embodiment, name entity mark resources bank method is built there is provided one kind as shown in Figure 1, including:
Step 110, obtain and marked text set as the seed bank of epicycle iteration, having marked text set includes having marked text
This.
Internet text message is gathered first with crawlers, such as news, comment etc. are used as source material storehouse.So
Afterwards, the selected part text in source material storehouse, entity mark is named to it using the mode manually marked.Using a small amount of
Text be named entity and manually mark, save human cost, these texts marked, which are constituted, has marked text set.Example
Such as, there are 1000 text messages in source material storehouse, choose 100 text messages and manually marked.The text structure marked
Into text collection, seed bank of the text collection as epicycle iteration will have been marked.Manually mark refers to in text name entity
Word belong to which kind of name entity mark out come, for example, to " calf is found in June, 2013 online." this sentence enters
Pedestrian's work is marked, and annotation results are:(calf is online, organization names), it is found in (in June, 2013, time).By in this sentence
" calf is online " be labeled as " organization names ", will " in June, 2013 " be labeled as " time ".Certainly, the text in firsthand information storehouse
This information can also be other quantity.
Step 120, obtain and do not mark text set, text set is not marked to be included not marking text, is never marked in text set
That chooses predetermined number does not mark the resources bank that text constitutes epicycle iteration with seed bank.
Removed from source material storehouse and marked text set, remaining just constitute does not mark text set.Never text is marked
This concentration chooses predetermined number and does not mark the resources bank that text constitutes epicycle iteration together with seed bank.For example, having 1000
Bar source material has carried out artificial mark in storehouse to 100 text messages, constitutes seed bank, and remaining 900 do not mark text
Collection.Epicycle from this 900 do not mark text set in choose 1/9 text i.e. 100 text messages, epicycle is constituted together with seed bank
The resources bank of iteration.It is of course also possible to choose the text of other ratios.
Step 130, the average utility value for not marking each name entity in text is calculated.
First, participle is carried out to the text that do not mark in resources bank, obtains not marking text after participle.
Participle is carried out to the text that do not mark in resources bank, obtains not marking text after participle.Can be using for example most
Big matching process, HMM (Hidden Markov Model, hidden Markov model) method etc. carry out participle to not marking text.
For example, to text, " calf is ranked the first in South China online." participle is carried out, obtain that " calf exists online after participle cutting
Rank the first South China ".
Secondly, using the mark text in condition random field CRF model training resources banks, forecast model is obtained, utilized
Forecast model prediction resources bank in the annotated sequence for not marking text, never mark text annotated sequence in obtain it is optimal and
Suboptimum annotated sequence and optimal and suboptimum annotated sequence conditional probability.Condition random field algorithm is natural language processing in recent years
One of conventional algorithm in field, is usually used in syntactic analysis, name Entity recognition, part-of-speech tagging etc..With CRF models to above-mentioned epicycle
Each in the resources bank of iteration does not mark text and is trained, and obtains not marking the mark sequence after text is labeled to each
Row.Obtain the text marking sequence of each optimal and suboptimum for not marking text, and calculate do not mark each the optimal of text and
The conditional probability of suboptimum text marking sequence.
Again, text is not marked to each, calculated and do not marked in text by effect assessment function according to conditional probability
The value of utility of each name entity.Finally, obtain each name entity comprising name entity do not mark text in effectiveness
Value, the average utility value of each name entity is calculated according to value of utility.
Step 140, the name of predetermined number in the top is obtained according to being ranked up from big to small to average utility value
Entity is used as candidate word.
The average utility value calculated is ranked up according to order from big to small, the name of predetermined number is real before obtaining
Body is used as name entity candidate word.For example, that obtain can be the name entity candidate for naming entity to be used as epicycle of top 10
Word, such as be " calf is online, Tsing-Hua University, Baidu, Alibaba, big boundary, unmanned plane, intelligent robot, glasses, cosmetics,
RMB ".
Step 150, the text comprising candidate word and value of utility maximum is selected to be added in seed bank as next round iteration
Seed bank, then never choose predetermined number in mark text set do not mark the money that text and seed bank constitute next round iteration
Source storehouse, until by do not mark in text set it is all do not mark the whole iteration of text, obtain marking resources bank.
To each name entity candidate word, selected in the text message of the resources bank of epicycle iteration real comprising the name
The text of body candidate word, and selected from the text set comprising the name entity candidate word so that the name entity candidate word is at this
The text of value of utility maximum in text.The maximum text of the corresponding effectiveness of each name entity candidate word is added to seed bank
The middle seed bank as next round iteration.That never chooses predetermined number in mark text set again does not mark text and seed bank structure
Into the resources bank of next round iteration, until by do not mark in text set it is all do not mark the whole iteration of text, obtain mark money
Source storehouse.Expansion seed bank is carried out using the text that do not mark in internet, can infinitely expand name entity mark resources bank
Scale, meets various scene demands.
For example, in above-mentioned resources bank the remaining quantity for not marking text be 800, then next round iteration just from this 800
Individual do not mark chooses 100 and does not mark text again in text, text and last round of obtained seed bank structure are not marked by this 100
Into the resources bank of epicycle iteration.Carry out design conditions probability, value of utility and average utility value etc., until select include candidate word and
The maximum text of value of utility is added to the seed bank in seed bank as next round iteration.Text is not marked from remaining 700 again
100 are selected in this, the resources bank that text constitutes epicycle iteration with last round of obtained seed bank is not marked by this 100.Such as
This iterative cycles so far terminates up to not marking the whole iteration of text by remaining, and what is finally given is mark resources bank.
Step 160, the candidate word in mark resources bank is scored.
The name entity candidate word in mark resources bank is commented with scoring formula in actual name Entity recognition
Point, obtain appraisal result.Scoring formula be:
WhereinIt is identified as naming the frequency of entity part in the sample for entity candidate word t.NtFor entity candidate
Total frequency that word t occurs in language material, language material includes name entity part and generic word part.Name entity part is language material
In be considered as name entity part, generic word part be language material in be not considered as name entity part.Language material, leads to
Often it is practically impossible to observe large-scale language example in statistics natural language processing.Typically a text set is collectively referred to as
For corpus (Corpus), when having several such text collections, commonly referred to as corpus set (Corpora).
Step 170, the corresponding text for including candidate word of candidate word for being scored above given threshold is obtained, text is constituted
Set be used as name entity mark resources bank.
Threshold value is set to scoring, scoring is ranked up from big to small, the name entity for being scored above given threshold is obtained
Candidate word, the text for including the name entity candidate word is obtained further according to name entity candidate word from mark resources bank.These texts
The set of this composition is name entity mark resources bank.
In the present embodiment, seed bank of the text set as epicycle iteration will have been marked;It is default in text set by not marking
The resources bank for not marking text and seed bank composition epicycle iteration of quantity.Calculate not marking and the flat of entity is each named in text
Equal value of utility, to average utility value according to being ranked up from big to small, the name entity for obtaining predetermined number in the top is made
For candidate word.Select again and be added to seed in seed bank as next round iteration comprising the maximum text of candidate word and value of utility
Storehouse, then never choose predetermined number in mark text set do not mark the resources bank that text and seed bank constitute next round iteration,
Until by do not mark in text set it is all do not mark the whole iteration of text, obtain marking resources bank.Finally to mark resources bank
In candidate word scored, obtain and be scored above the corresponding text for including candidate word of candidate word of given threshold, by text
The set of composition is used as name entity mark resources bank.The present invention is using a small amount of seed bank with not marking not marking in text set
The resources bank that text constitutes epicycle iteration is calculated, the seed bank of generation next round iteration, then by the seed bank of generation and its
He, which does not mark text and constitutes the resources bank of next round iteration, is calculated again the seed bank of next round, is calculated always like this directly
To will not mark text all calculating, new name entity is found, and generate name entity mark resources bank.This method is realized
Simply, speed it is fast, can large scale deployment, can infinitely expand the scale that name entity marks resources bank, meet various scene need
Ask.
In one embodiment, as shown in Fig. 2 each name entity candidate word is including name entity in computing resource storehouse
Text set in average utility value, including:
Step 131, participle is carried out to the text that do not mark in resources bank, obtains not marking text after participle.
Participle is carried out to the text that do not mark in resources bank, obtains not marking text after participle.Can be using for example most
Big matching process, HMM (Hidden Markov Model, hidden Markov model) method etc. carry out participle to not marking text.
Maximum matching process belongs to mechanical segmentation method, is the Chinese character string and one " fully big " being analysed to according to certain strategy
Entry in machine dictionary is matched, if finding some character string in dictionary, and the match is successful identifies a word.It is hidden
Markov model embodies very big value in fields such as speech recognition, natural language processing and biological informations.To current
Untill, it is considered as to realize most successful side during quick accurate speech recognition system and natural language processing always
Method.For example, to text, " calf is ranked the first in South China online." participle is carried out, obtain that " calf is online after participle cutting
Ranked the first in South China ".
Step 133, using the mark text in condition random field CRF model training resources banks, forecast model is obtained, profit
Predicted with forecast model in the annotated sequence for not marking text in resources bank, the annotated sequence for never marking text and obtain optimal
And suboptimum annotated sequence and optimal and suboptimum annotated sequence conditional probability.
Using the mark in CRF (Conditional Random Field, condition random field) model training resources bank
Text, obtains forecast model, and the annotated sequence for not marking text in resources bank is predicted using forecast model.Utilize forecast model
One annotated sequence for not marking text, which is predicted, can produce multiple different annotated sequences.From this multiple annotated sequence
The annotated sequence of the optimal and suboptimum of each text is obtained, and calculates optimal and suboptimum text marking sequence the bar of each text
Part probability.Calculate the conditional probability of the text marking sequence of optimal and suboptimum:WithIts
InIt is optimal and suboptimum annotated sequence, θ is model parameter, and x is a text marking sequence sample.
Condition random field is one of algorithm that natural language processing field is commonly used in recent years, is usually used in syntactic analysis, name
Entity recognition, part-of-speech tagging etc..Each text in the resources bank of above-mentioned epicycle iteration is trained with CRF models, obtained
Text marking sequence after being labeled to each text.
For example, using forecast model to the mark for not marking text " calf is ranked the first in South China online " after participle
Note sequence is predicted, and possible annotation results and conditional probability are:
[(calf is online, mechanism name, 0.9), (South China, place name, 0.89)],
[(calf is online, place name, 0.09), (South China, time, 0.02)],
[(calf is online, time, 0.01), (South China, mechanism name, 0.09)] etc..The optimal mark of " calf is online "
Sequence is (calf is online, mechanism name, 0.9), and " calf is online " suboptimum annotated sequence is (calf is online, place name, 0.09).I.e.For 0.9,For 0.09.
Step 135, text is not marked to each, calculated and do not marked in text by effect assessment function according to conditional probability
Each name entity value of utility.
The formula of effect assessment function is:
WhereinFor x optimal annotated sequence,For x suboptimum annotated sequence, θ is model parameter, and 0≤λ≤1 is tune
Save the factor,For the conditional probability of x optimal annotated sequence,For x suboptimum annotated sequence
Conditional probability, x is a text marking sequence sample.
Text is not marked to each, according in the conditional probability above calculated, is calculated and not marked using effect assessment function
The value of utility of each name entity in explanatory notes sheet.For example, " calf is online in South China's ranking for the above-mentioned text that do not mark
There are 2 name entity candidate words in one ", one is " calf is online ", and one is " South China ", wherein " calf is online " is most
Excellent annotated sequence is (calf is online, mechanism name, 0.9), " calf is online " suboptimum annotated sequence be (calf is online, place name,
0.09).I.e.For 0.9,For 0.09, pass through effect assessment function and calculate " calf is online "
It is in the value of utility for not marking text:1- (0.9- (1-0.5) × 0.09)=0.145, wherein taking λ to be 0.5.Similarly to " China
Southern area " is calculated in the value of utility in not marking text.Then successively text is not marked to others again, calculates each life
The value of utility of name entity.
Step 137, obtain each name entity candidate word comprising name entity do not mark text in value of utility, root
The average utility value of each name entity is calculated according to value of utility.
Average utility calculation formula is:
Wherein XtIt is the sample set containing entity candidate word t, | Xt| it is the number containing entity candidate word t samples,
It is entity candidate word t in XtAverage utility value on sample set, xtIt is a text marking sequence sample containing entity candidate word t
Example.
By the value of utility of the above-mentioned each name entity calculated, averaged by average utility calculation formula,
Obtain the average utility value of each name entity candidate word.In the present embodiment, it is proposed that one kind is defeated using CRF model trainings
Optimal and suboptimum annotated sequence the conditional probability of each text gone out, text is not marked to each, according to above calculating
Conditional probability, the value of utility for not marking each name entity in text is calculated using effect assessment function.Obtain again each
Name entity candidate word comprising name entity do not mark text in value of utility, each name entity is calculated according to value of utility
Average utility value.
In one embodiment, as shown in figure 3, obtaining before having marked seed bank of the text set as epicycle iteration, also
Including:
Step 180, text message is gathered.
Before acquisition has marked seed bank of the text set as epicycle iteration, crawlers are utilized to gather internet text
Information, such as news, comment etc. are used as source material storehouse.
Step 190, the text message of predetermined number is chosen from the text message of collection, to the text message of predetermined number
In name entity be labeled, generation marked in text set, the text message of collection it is remaining do not mark text constitute not
Mark text set.
The selected part text in source material storehouse, entity mark is named to it using the mode manually marked.People
This part after work mark has marked text and has constituted mark text set, removes this part in source material storehouse and has marked text set
Afterwards, remaining whole does not mark text composition and does not mark text set.
In the present embodiment, a number of text is obtained first with crawlers, it is then artificial to part therein
Text has carried out name entity mark, and this part has been marked to text set as the part in the seed bank subsequently trained.This
This part of sample, which has marked text, can improve the accuracy of follow-up training result.
In one embodiment, effect assessment function is:
WhereinFor x optimal annotated sequence,For x suboptimum annotated sequence, θ is model parameter, and 0≤λ≤1 is tune
Save the factor,For the conditional probability of x optimal annotated sequence,For x suboptimum annotated sequence
Conditional probability, x is a text marking sequence sample.
In the present embodiment, initiating effect assessment function is used to calculate effect of each name entity in text marking sequence
With value, by the use of the conditional probability of CRF models output as input, this method calculates simple, and the confidence level of acquired results is high, fits
Close the extensive text of processing.Text data is a kind of unstructured data, generally carries out recruitment evaluation all to unstructured data
It is relatively difficult, and this method can be realized and carry out quantitative evaluation to text name entity.
In one embodiment, average utility calculation formula is:
Wherein XtIt is the sample set containing entity candidate word t, | Xt| it is the number containing entity candidate word t samples,It is
T is in X for entity candidate wordtAverage utility value on sample set, xtIt is a text marking sequence sample containing entity candidate word t.
In the present embodiment, the value of utility using each name entity calculated in text marking sequence, will provided
Value of utility of each name entity candidate word in the text set comprising name entity, which is summed up, in the storehouse of source averages, and produces
Average utility value is arrived.Similarly, this method calculates simple, workable.
In one embodiment, should as shown in figure 4, additionally providing a kind of name entity that builds marks resources bank device 400
Device includes:Seed bank acquisition module 410, resources bank acquisition module 420, average utility value computing module 430, name entity are waited
Select word acquisition module 440, mark resources bank generation module 450, candidate word grading module 460 and name entity mark resources bank life
Into module 470.
Seed bank acquisition module 410, has marked text set as the seed bank of epicycle iteration for obtaining, has marked text
Collection includes having marked text.
Resources bank acquisition module 420, text set is not marked for obtaining, and text set is not marked to be included not marking text, from
The resources bank for not marking text and seed bank composition epicycle iteration that predetermined number is chosen in text set is not marked.
Average utility value computing module 430, does not mark the average utility value of each name entity in text for calculating.
Entity candidate word acquisition module 440 is named, for, according to being ranked up from big to small, acquisition to be arranged to average utility value
The name entity of the forward predetermined number of name is used as candidate word.
Resources bank generation module 450 is marked, seed is added to for selecting the text for including candidate word and value of utility maximum
That predetermined number is chosen as the seed bank of next round iteration in storehouse, then never in mark text set does not mark text and seed bank
Constitute the resources bank of next round iteration, until by do not mark in text set it is all do not mark the whole iteration of text, marked
Resources bank.
Candidate word grading module 460, for scoring the candidate word in mark resources bank.
Entity mark resources bank generation module 470 is named, the candidate word that given threshold is scored above for obtaining is corresponding
Text comprising candidate word, the set that text is constituted is used as name entity mark resources bank.
In one embodiment, as shown in figure 5, average utility value computing module 430 includes:Word-dividing mode 431, condition are general
Rate computing module 433, value of utility computing module 435 and average utility value acquisition module 437.
Word-dividing mode 431, for carrying out participle to the text that do not mark in resources bank, obtains not marking text after participle
This.
Conditional probability computing module 433, for using the text of mark in condition random field CRF model training resources banks
This, obtains forecast model, predicts the annotated sequence for not marking text in resources bank using forecast model, never marks text
Optimal and suboptimum annotated sequence and optimal and suboptimum annotated sequence conditional probability are obtained in annotated sequence.
Value of utility computing module 435, for not marking text to each, effect assessment function meter is passed through according to conditional probability
Calculate the value of utility for each name entity not marked in text.
Average utility value acquisition module 437, text is not being marked for obtaining each name entity comprising name entity
In value of utility, the average utility value of each name entity is calculated according to value of utility.
In one embodiment, as shown in fig. 6, building name entity mark resources bank device 400 also includes:Text message
Acquisition module 480 and text message sort module 490.
Text message acquisition module 480, for gathering text message.
Text message sort module 490, the text message for choosing predetermined number from the text message of collection, to pre-
If the name entity in the text message of quantity is labeled, generation has marked remaining in text set, the text message of collection
Text composition is not marked does not mark text set.
In one embodiment, a kind of computer-readable recording medium is additionally provided, computer program is stored thereon with, should
Following steps are realized when program is executed by processor:Acquisition has marked text set as the seed bank of epicycle iteration, has marked text
This collection includes having marked text;Acquisition does not mark text set, and text set is not marked to be included not marking text, never marks text set
The middle resources bank for not marking text and seed bank composition epicycle iteration for choosing predetermined number;Calculating, which is not marked in text, each orders
The average utility value of name entity;To average utility value according to being ranked up from big to small, predetermined number in the top is obtained
Name entity is used as candidate word;Select to be added in seed bank as next round comprising the maximum text of candidate word and value of utility and change
The seed bank in generation, then the text that do not mark of selection predetermined number constitutes next round iteration with seed bank never in mark text set
Resources bank, until by do not mark in text set it is all do not mark the whole iteration of text, obtain marking resources bank;To mark resource
Candidate word in storehouse is scored;The corresponding text for including candidate word of candidate word for being scored above given threshold is obtained, by text
The set of this composition is used as name entity mark resources bank.
In one embodiment, following steps are also realized when said procedure is executed by processor:To not marking in resources bank
Explanatory notes this progress participle, obtains not marking text after participle;Using the mark in condition random field CRF model training resources banks
Explanatory notes sheet, obtains forecast model, and the annotated sequence for not marking text in resources bank is predicted using forecast model, never marks text
Optimal and suboptimum annotated sequence and optimal and suboptimum annotated sequence conditional probability are obtained in this annotated sequence;Do not marked to each
Explanatory notes sheet, the value of utility for each name entity not marked in text is calculated according to conditional probability by effect assessment function;Obtain
Take each name entity comprising name entity do not mark text in value of utility, each name entity is calculated according to value of utility
Average utility value.In one embodiment, following steps are also realized when said procedure is executed by processor:Gather text envelope
Breath;The text message of predetermined number is chosen from the text message of collection, to the name entity in the text message of predetermined number
It is labeled, generation has marked the remaining text composition that do not mark in text set, the text message of collection and do not marked text set.
In one embodiment, following steps are also realized when said procedure is executed by processor:Effect assessment function is
WhereinFor x optimal annotated sequence,For x suboptimum annotated sequence, θ is model parameter, and 0≤λ≤1 is tune
Save the factor,For the conditional probability of x optimal annotated sequence,For x suboptimum annotated sequence
Conditional probability, x is a text marking sequence sample.
In one embodiment, following steps are also realized when said procedure is executed by processor:Average utility calculation formula
For
Wherein XtIt is the sample set containing entity candidate word t, | Xt| it is the number containing entity candidate word t samples,It is
T is in X for entity candidate wordtAverage utility value on sample set, xtIt is a text marking sequence sample containing entity candidate word t.
In one embodiment, additionally provide a kind of computer equipment, the computer equipment include memory, processor and
The computer program that can be run on a memory and on a processor is stored, following walk is realized during computing device computer program
Suddenly:
Acquisition has marked text set as the seed bank of epicycle iteration, and having marked text set includes having marked text;Obtain
Text set is not marked, and text set is not marked to be included not marking text, never marks and not marking for predetermined number is chosen in text set
Text constitutes the resources bank of epicycle iteration with seed bank;Calculate the average utility value for not marking each name entity in text;It is right
Average utility value according to being ranked up from big to small, and the name entity for obtaining predetermined number in the top is used as candidate word;Choosing
Go out and be added to seed bank in seed bank as next round iteration comprising the maximum text of candidate word and value of utility, then never mark
That predetermined number is chosen in text set does not mark the resources bank that text constitutes next round iteration with seed bank, until will not mark text
The all of this concentration do not mark the whole iteration of text, obtain marking resources bank;Candidate word in mark resources bank is scored;
The corresponding text for including candidate word of candidate word for being scored above given threshold is obtained, the set that text is constituted is real as name
Body marks resources bank.
In one embodiment, following steps are also realized during above-mentioned computing device computer program:To in resources bank
Text is not marked and carries out participle, obtains not marking text after participle;Using in condition random field CRF model training resources banks
Text has been marked, forecast model is obtained, the annotated sequence for not marking text in resources bank has been predicted using forecast model, never marks
Optimal and suboptimum annotated sequence and optimal and suboptimum annotated sequence conditional probability are obtained in the annotated sequence of explanatory notes sheet;To each
Text is not marked, calculates the effectiveness for each name entity not marked in text by effect assessment function according to conditional probability
Value;Obtain each name entity comprising name entity do not mark text in value of utility, each life is calculated according to value of utility
The average utility value of name entity.In one embodiment, following steps are also realized during above-mentioned computing device computer program:Adopt
Collect text message;The text message of predetermined number is chosen from the text message of collection, in the text message of predetermined number
Name entity is labeled, and generation has marked the remaining text composition that do not mark in text set, the text message of collection and do not marked
Text set.
In one embodiment, following steps are also realized during above-mentioned computing device computer program:Effect assessment function
For
WhereinFor x optimal annotated sequence,For x suboptimum annotated sequence, θ is model parameter, and 0≤λ≤1 is tune
Save the factor,For the conditional probability of x optimal annotated sequence,For x suboptimum annotated sequence
Conditional probability, x is a text marking sequence sample.
In one embodiment, following steps are also realized during above-mentioned computing device computer program:Average utility is calculated
Formula is
Wherein XtIt is the sample set containing entity candidate word t, | Xt| it is the number containing entity candidate word t samples,It is
T is in X for entity candidate wordtAverage utility value on sample set, xtIt is a text marking sequence sample containing entity candidate word t.
Embodiment described above only expresses the several embodiments of the present invention, and it describes more specific and detailed, but simultaneously
Can not therefore it be construed as limiting the scope of the patent.It should be pointed out that coming for one of ordinary skill in the art
Say, without departing from the inventive concept of the premise, various modifications and improvements can be made, these belong to the protection of the present invention
Scope.Therefore, the protection domain of patent of the present invention should be determined by the appended claims.
Claims (10)
1. one kind builds name entity mark resources bank method, methods described includes:
Acquisition has marked text set as the seed bank of epicycle iteration, and the text set that marked includes having marked text;
Acquisition does not mark text set, and the text set that do not mark includes not marking text, from it is described do not mark text set in choose
The resources bank for not marking text and seed bank composition epicycle iteration of predetermined number;
The average utility value of each name entity in text is not marked described in calculating;
To the average utility value according to being ranked up from big to small, the name entity conduct of predetermined number in the top is obtained
Candidate word;
Select and be added to kind in the seed bank as next round iteration comprising the maximum text of the candidate word and value of utility
Word bank, then from it is described do not mark text set in choose do not mark text and the seed bank of predetermined number and constitute the next round
The resources bank of iteration, until by it is described do not mark in text set it is all do not mark the whole iteration of text, obtain marking resources bank;
Candidate word in the mark resources bank is scored;
Obtain the corresponding text for including the candidate word of candidate word for being scored above given threshold, the collection that the text is constituted
Cooperate as name entity mark resources bank.
2. according to the method described in claim 1, it is characterised in that do not mark each name entity in text described in the calculating
Average utility value, including:
Participle is carried out to the text that do not mark in the resources bank, obtains not marking text after participle;
Using the mark text in resources bank described in condition random field CRF model trainings, forecast model is obtained, using predicting mould
Type predicts the annotated sequence for not marking text in the resources bank, obtains optimal from the annotated sequence for not marking text
And suboptimum annotated sequence and described optimal and suboptimum annotated sequence conditional probability;
Text is not marked to each, is not marked according to the conditional probability is calculated by effect assessment function every in text
The value of utility of individual name entity;
Obtain each name entity comprising the name entity do not mark text in value of utility, according to the value of utility meter
Calculate the average utility value of each name entity.
3. according to the method described in claim 1, it is characterised in that the acquisition has marked text set as the kind of epicycle iteration
Before word bank, in addition to:
Gather text message;
The text message of predetermined number is chosen from the text message of the collection, in the text message of the predetermined number
Name entity is labeled, and generation has marked the remaining text that do not mark in text set, the text message of the collection and constituted not
Mark text set.
4. method according to claim 2, it is characterised in that the effect assessment function is:
<mrow>
<msub>
<mi>U</mi>
<mi>M</mi>
</msub>
<mrow>
<mo>(</mo>
<mi>x</mi>
<mo>)</mo>
</mrow>
<mo>=</mo>
<mn>1</mn>
<mo>-</mo>
<mrow>
<mo>(</mo>
<mi>P</mi>
<mo>(</mo>
<mrow>
<msubsup>
<mi>y</mi>
<mn>1</mn>
<mo>*</mo>
</msubsup>
<mo>|</mo>
<mi>x</mi>
<mo>,</mo>
<mi>&theta;</mi>
</mrow>
<mo>)</mo>
<mo>)</mo>
</mrow>
<mo>-</mo>
<mrow>
<mo>(</mo>
<mn>1</mn>
<mo>-</mo>
<mi>&lambda;</mi>
<mo>)</mo>
</mrow>
<mrow>
<mo>(</mo>
<mi>P</mi>
<mo>(</mo>
<mrow>
<msubsup>
<mi>y</mi>
<mn>2</mn>
<mo>*</mo>
</msubsup>
<mo>|</mo>
<mi>x</mi>
<mo>,</mo>
<mi>&theta;</mi>
</mrow>
<mo>)</mo>
<mo>)</mo>
</mrow>
</mrow>
WhereinFor x optimal annotated sequence,For x suboptimum annotated sequence, θ is model parameter, 0≤λ≤1 for regulation because
Son,For the conditional probability of x optimal annotated sequence,For the bar of x suboptimum annotated sequence
Part probability, x is a text marking sequence sample.
5. method according to claim 2, it is characterised in that the average utility calculation formula is:
<mrow>
<mover>
<mi>U</mi>
<mo>&OverBar;</mo>
</mover>
<mrow>
<mo>(</mo>
<mi>t</mi>
<mo>)</mo>
</mrow>
<mo>=</mo>
<mfrac>
<mn>1</mn>
<mrow>
<mo>|</mo>
<msub>
<mi>X</mi>
<mi>t</mi>
</msub>
<mo>|</mo>
</mrow>
</mfrac>
<munderover>
<mo>&Sigma;</mo>
<mrow>
<mi>n</mi>
<mo>=</mo>
<mn>1</mn>
</mrow>
<mrow>
<mo>|</mo>
<msub>
<mi>X</mi>
<mi>t</mi>
</msub>
<mo>|</mo>
</mrow>
</munderover>
<msub>
<mi>U</mi>
<mi>M</mi>
</msub>
<mrow>
<mo>(</mo>
<msub>
<mi>x</mi>
<mi>t</mi>
</msub>
<mo>)</mo>
</mrow>
<mo>,</mo>
<msub>
<mi>x</mi>
<mi>t</mi>
</msub>
<mo>&Element;</mo>
<msub>
<mi>X</mi>
<mi>t</mi>
</msub>
<mo>,</mo>
</mrow>
Wherein XtIt is the sample set containing entity candidate word t, | Xt| it is the number containing entity candidate word t samples,It is entity
Candidate word t is in XtAverage utility value on sample set, xtIt is a text marking sequence sample containing entity candidate word t.
6. one kind builds name entity mark resources bank device, it is characterised in that described device includes:
Seed bank acquisition module, text set has been marked as the seed bank of epicycle iteration for obtaining, described to have marked text set
Including having marked text;
Resources bank acquisition module, text set is not marked for obtaining, and the text set that do not mark includes not marking text, from described
The resources bank for not marking text and seed bank composition epicycle iteration that predetermined number is chosen in text set is not marked;
Average utility value computing module, for calculating the average utility value for not marking each name entity in text;
Entity candidate word acquisition module is named, for, according to being ranked up from big to small, obtaining ranking to the average utility value
The name entity of forward predetermined number is used as candidate word;
Resources bank generation module is marked, the seed is added to for selecting the text for including the candidate word and value of utility maximum
As the seed bank of next round iteration in storehouse, then from it is described do not mark text set in choose predetermined number and do not mark text and institute
State the resources bank that seed bank constitutes the next round iteration, until by it is described do not mark in text set all not mark text complete
Portion's iteration, obtains marking resources bank;
Candidate word grading module, for scoring the candidate word in the mark resources bank;
Entity mark resources bank generation module is named, the candidate word that given threshold is scored above for obtaining is corresponding comprising described
The text of candidate word, the set that the text is constituted is used as name entity mark resources bank.
7. device according to claim 6, it is characterised in that the average utility value computing module includes:
Word-dividing mode, for carrying out participle to the text that do not mark in the resources bank, obtains not marking text after participle;
Conditional probability computing module, for using the mark text in resources bank described in condition random field CRF model trainings, obtaining
To forecast model, the annotated sequence for not marking text in the resources bank is predicted using forecast model, text is not marked from described
Optimal and suboptimum annotated sequence and described optimal and suboptimum annotated sequence conditional probability are obtained in this annotated sequence;
Value of utility computing module, for not marking text to each, is calculated according to the conditional probability by effect assessment function
The value of utility of each name entity not marked in text;
Average utility value acquisition module, text is not being marked comprising the name entity for obtaining each name entity
Value of utility, the average utility value of each name entity is calculated according to the value of utility.
8. device according to claim 6, it is characterised in that described device also includes:
Text message acquisition module, for gathering text message;
Text message sort module, the text message for choosing predetermined number from the text message of the collection, to described
Name entity in the text message of predetermined number is labeled, and generation has been marked in text set, the text message of the collection
The remaining text composition that do not mark does not mark text set.
9. a kind of computer-readable recording medium, is stored thereon with computer program, it is characterised in that the program is held by processor
The structure name entity mark resources bank method as any one of power 1 to 5 is realized during row.
10. a kind of computer equipment, the computer equipment includes memory, processor and is stored on the memory and can
The computer program run on the processor, it is characterised in that realized described in the computing device during computer program
Structure name entity mark resources bank method as any one of weighing 1 to 5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710447680.5A CN107329951A (en) | 2017-06-14 | 2017-06-14 | Build name entity mark resources bank method, device, storage medium and computer equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710447680.5A CN107329951A (en) | 2017-06-14 | 2017-06-14 | Build name entity mark resources bank method, device, storage medium and computer equipment |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107329951A true CN107329951A (en) | 2017-11-07 |
Family
ID=60194667
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710447680.5A Pending CN107329951A (en) | 2017-06-14 | 2017-06-14 | Build name entity mark resources bank method, device, storage medium and computer equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107329951A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109859813A (en) * | 2019-01-30 | 2019-06-07 | 新华三大数据技术有限公司 | A kind of entity modification word recognition method and device |
CN110245757A (en) * | 2019-06-14 | 2019-09-17 | 上海商汤智能科技有限公司 | A kind of processing method and processing device of image pattern, electronic equipment and storage medium |
CN110750993A (en) * | 2019-10-15 | 2020-02-04 | 成都数联铭品科技有限公司 | Word segmentation method, word segmentation device, named entity identification method and system |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP2506157A1 (en) * | 2011-03-30 | 2012-10-03 | British Telecommunications Public Limited Company | Textual analysis system |
CN103268339A (en) * | 2013-05-17 | 2013-08-28 | 中国科学院计算技术研究所 | Recognition method and system of named entities in microblog messages |
-
2017
- 2017-06-14 CN CN201710447680.5A patent/CN107329951A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP2506157A1 (en) * | 2011-03-30 | 2012-10-03 | British Telecommunications Public Limited Company | Textual analysis system |
CN103268339A (en) * | 2013-05-17 | 2013-08-28 | 中国科学院计算技术研究所 | Recognition method and system of named entities in microblog messages |
Non-Patent Citations (1)
Title |
---|
江会星: "汉语命名实体识别研究", 《中国博士学位论文全文数据库信息科技辑》 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109859813A (en) * | 2019-01-30 | 2019-06-07 | 新华三大数据技术有限公司 | A kind of entity modification word recognition method and device |
CN110245757A (en) * | 2019-06-14 | 2019-09-17 | 上海商汤智能科技有限公司 | A kind of processing method and processing device of image pattern, electronic equipment and storage medium |
CN110245757B (en) * | 2019-06-14 | 2022-04-01 | 上海商汤智能科技有限公司 | Image sample processing method and device, electronic equipment and storage medium |
CN110750993A (en) * | 2019-10-15 | 2020-02-04 | 成都数联铭品科技有限公司 | Word segmentation method, word segmentation device, named entity identification method and system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110633409B (en) | Automobile news event extraction method integrating rules and deep learning | |
CN104268197B (en) | A kind of industry comment data fine granularity sentiment analysis method | |
CN106649272B (en) | A kind of name entity recognition method based on mixed model | |
CN108829801A (en) | A kind of event trigger word abstracting method based on documentation level attention mechanism | |
CN107315738B (en) | A kind of innovation degree appraisal procedure of text information | |
CN107330011A (en) | The recognition methods of the name entity of many strategy fusions and device | |
CN107220237A (en) | A kind of method of business entity's Relation extraction based on convolutional neural networks | |
CN108255813B (en) | Text matching method based on word frequency-inverse document and CRF | |
CN103870000B (en) | The method and device that candidate item caused by a kind of pair of input method is ranked up | |
CN108549634A (en) | A kind of Chinese patent text similarity calculating method | |
CN102081602B (en) | Method and equipment for determining category of unlisted word | |
CN103823857B (en) | Space information searching method based on natural language processing | |
CN108388914B (en) | Classifier construction method based on semantic calculation and classifier | |
CN110096575B (en) | Psychological portrait method facing microblog user | |
CN108874896B (en) | Humor identification method based on neural network and humor characteristics | |
CN107092605A (en) | A kind of entity link method and device | |
CN103324700A (en) | Noumenon concept attribute learning method based on Web information | |
CN103631858A (en) | Science and technology project similarity calculation method | |
CN106547864A (en) | A kind of Personalized search based on query expansion | |
CN112069312B (en) | Text classification method based on entity recognition and electronic device | |
CN104699797A (en) | Webpage data structured analytic method and device | |
CN110851593A (en) | Complex value word vector construction method based on position and semantics | |
CN112328800A (en) | System and method for automatically generating programming specification question answers | |
CN115357719A (en) | Power audit text classification method and device based on improved BERT model | |
CN107329951A (en) | Build name entity mark resources bank method, device, storage medium and computer equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20171107 |