CN107193959A - A kind of business entity's sorting technique towards plain text - Google Patents

A kind of business entity's sorting technique towards plain text Download PDF

Info

Publication number
CN107193959A
CN107193959A CN201710371464.7A CN201710371464A CN107193959A CN 107193959 A CN107193959 A CN 107193959A CN 201710371464 A CN201710371464 A CN 201710371464A CN 107193959 A CN107193959 A CN 107193959A
Authority
CN
China
Prior art keywords
business entity
entity
word
training
mark
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710371464.7A
Other languages
Chinese (zh)
Other versions
CN107193959B (en
Inventor
张雷
陈嘉伟
谢璐遥
王崇骏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University
Original Assignee
Nanjing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University filed Critical Nanjing University
Priority to CN201710371464.7A priority Critical patent/CN107193959B/en
Publication of CN107193959A publication Critical patent/CN107193959A/en
Application granted granted Critical
Publication of CN107193959B publication Critical patent/CN107193959B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes

Abstract

The present invention discloses a kind of business entity's sorting technique towards plain text, comprises the following steps:S1, classification mark is carried out to the business entity in the plain text data that collects, be used as the training set of business entity's identification module;Classification mark is carried out according to industrial nature to the business entity in the plain text data that collects, using the training sample set as business entity's sort module;S2, by conditional random field models business entity identification model training is carried out, and obtain business entity's identification model;S3, the text data to original training set carry out semantic vector structure;S4, the training set data for having classification to mark after semantic vector as training parameter trained into business entity's disaggregated model;S5, using business entity's disaggregated model the business entity in text to be predicted is classified.This method feature of the obtained semantic vector as entity, reduces the dependence to manual features and external data, and versatility and robustness are guaranteed.

Description

A kind of business entity's sorting technique towards plain text
Technical field
The invention belongs to name Entity recognition and fine granularity entity classification technical field, and in particular to a kind of towards plain text Business entity's sorting technique.
Background technology
In recent years, with the upsurge of " internet finance ", increasing corporate decision maker is more advanced in the urgent need to utilizing Information processing manner extracted and analyzed come the internet data to magnanimity, to make more preferable decision-making.In these seas Measure among data, the plain text data such as law court's document class, news public sentiment class turns into the primary source that enterprise obtains high price value information.
Name entity recognition techniques are that enterprise carries out Entity Semantics analysis, the basis of the work such as entity relation extraction.At present Entity is simply divided into name, place name, mechanism name etc. by the name entity recognition techniques of main flow, and this causes the type of entity to lack language Justice.Meanwhile, carry out excessively depending on manual features and external data during entity classification, its versatility and robustness is protected Card.
The content of the invention
Entity is simply divided into name, place name, mechanism name etc. by the present invention for the name entity recognition techniques of current main flow, So that the type of entity lacks semantic.In addition, excessively depending on manual features and external data when carrying out entity classification, lead to it It cannot be guaranteed with property and robustness.To solve the above problems, the present invention proposes that a kind of business entity towards plain text classifies Method, using the more fine-grained dividing mode of business entity, and the semantic construction feature using text in itself, finally looked forward to The classification of industry entity.Wherein, plain text, that is, include the text of business activity information, for example newsletter archive, law court's letter.
As shown in figure 1, business entity's sorting technique disclosed in this invention towards plain text, comprises the following steps:
S1, classification mark is carried out to the business entity in the plain text data that collects, will mark the data that complete as The training set of business entity's identification module;Classification is carried out according to industrial nature to the business entity in the plain text data that collects Mark, will mark the data of completion as the training sample set of business entity's sort module;
S2, by conditional random field models business entity identification model training is carried out, and obtain business entity's identification model;
S3, the text data to original training set carry out semantic vector structure;
S4, the training set data for having classification to mark after semantic vector as training parameter trained into business entity Disaggregated model;
S5, using business entity's disaggregated model the business entity in text to be predicted is classified.
Further, in S1, the plain text data collected is subjected to subordinate sentence, participle and part-of-speech tagging, using artificial mark The method of note is labeled to the business entity in plain text data and category of employment.
Further, using the participle and part-of-speech tagging software HanLP increased income to plain text data carry out subordinate sentence, participle and Part-of-speech tagging.
Further, it is " BIO " mark pattern to business entity's notation methods in plain text data, wherein, enterprise is real The starting word of body is labeled as " B ", and the other parts word of the non-starting word of business entity is labeled as " I ", unrelated with business entity Word is labeled as " O ".
Further, using in the method manually marked, to the business entity in plain text data according to context Classification mark is carried out to it according to industrial nature.
Further, in S2, business entity's identification model instruction is carried out by the conditional random field models for introducing boundary characteristic Practice.
Further, introducing the conditional random field models of boundary characteristic includes:Will be whole after enterprise name participle by HanLP Reason obtains left and right border dictionary;The forecast model on left and right border is obtained using the libSVM training increased income;Successively from training set Middle taking-up word simultaneously judges whether the word is left and right border word by the forecast model on left and right border;Word sheet will be included The condition random field instrument that body, part-of-speech tagging, right boundary mark, the training set data input of entity mark are increased income carries out enterprise The training of entity recognition model and the identification model for obtaining business entity.
Further, in S3, the term vector that training sample concentrates all words is obtained using term vector calculating instrument, instruction is calculated Practice inverse text frequency (IDF) value of all words in sample set, include business entity's language using term vector and TF-IDF value calculating The vector sum context vector of business entity in sentence, the vector sum context vector of business entity is spliced, to obtain Include business entity's semantic vector that context is semantic.
Further, the term vector of all words in training set is calculated using the word2vec instruments increased income.
Further, in S4, enterprise is gone out to having had the training set data that classification is marked using softmax model trainings real The disaggregated model of body.
The present invention has the advantage that as described below:
1) right boundary of entity is predefined using lexicon rules and SVM classifier, afterwards by the left and right side of judgement The result on boundary is incorporated into conditional random field models as new feature, and improved method of the invention has very in recall rate and F1 values Big lifting.
2) mode of weighting is embedded in using word, semantic vector expression is carried out to entity and its context, so that real Semanteme between body can be measured by semantic vector distance.With feature of the obtained semantic vector as entity, reduction pair The dependence of manual features and external data.
3) entity boundary characteristic is introduced in existing conditional random field models, and the introducing of entity boundary characteristic is strengthened Conditional random field models are to the control ability on entity border, and the recall rate of such as identification, which has, obviously to be improved, and also leads to it It is guaranteed with property and robustness.
Brief description of the drawings
Fig. 1 is business entity's sorting technique FB(flow block) disclosed in this invention towards plain text.
Fig. 2 is that the training set in embodiment builds flow chart.
Fig. 3 is business entity's identification model training flow chart based on improvement condition random field in embodiment.
Fig. 4 builds flow chart for the Entity Semantics vector of the word-based vector sum TF-IDF values weighting in embodiment.
Fig. 5 is business entity's disaggregated model training flow chart.
Fig. 6 is business entity's classification process figure.
Embodiment
In order to know more about the technology contents of the present invention, especially exemplified by specific business entity's sorting technique towards law court's document Embodiment simultaneously coordinates institute's accompanying drawings to be described as follows.
As shown in Fig. 2 the present invention first builds training sample set before implementation.The mistake of training sample set is built in embodiment Journey is as follows:
Step 1-0, the initial state for setting up training set.
Step 1-1, using web crawlers instrument law court's document is gathered from internet, be used as original language material storehouse.
Step 1-2, the document data to collecting, using the participle and part-of-speech tagging software HanLP increased income to document Text carries out subordinate sentence, participle and part-of-speech tagging.Certainly, general participle software of increasing income can be used, for example Chinese Academy of Sciences's participle Etc., the HanLP softwares selected in embodiment are relatively more preferable compared to the effect of participle for current participle software of increasing income, and And can artificial Customized dictionary, be also more convenient.
Step 1-3, due in text business entity's word (be the title of enterprise, mainly include full name and referred to as two kinds Form) meeting cutting is multiple words after participle, so the method by manually marking is needed, by the business entity in document text Mark out and, the mode of mark is labeled as " B " for the starting word of " BIO " mark pattern, i.e. business entity, the non-starting of business entity The other parts word of word is labeled as " I ", and the word unrelated with business entity is labeled as " O ", and such as " defendant (O) Jiangsu (B) is Eurasian (I) film (I) Co., Ltd (I) ".The data for marking completion are used as the training set of business entity's identification model.
Meanwhile, the business entity in the document text that collects is carried out according to context according to industrial nature to it Classification is marked.The data completed are marked as the training set of business entity's disaggregated model, the data that mark is completed are to include one The category of sentence comprising enterprise name and the affiliated industry of the enterprise, and whole training set is exactly some such sentence+classes Target set.Wherein, the standard of classification mark can be selected with accuracy and authoritative industrial sectors of national economy classification (GB/T Dividing mode in 4754-2011).
Step 1-4, the end for setting up training set.
As shown in figure 3, after training set has been built, using improved maximum matching method, i.e. by introducing border The conditional random field models of feature carry out business entity's identification model training.
Step 2-0, the beginning of business entity's identification model training.
Training set data (the i.e. step 1-3 of step 2-1, input after subordinate sentence, participle, part-of-speech tagging and entity mark In mark complete data).
Step 2-2, some business directories are crawled from internet, these enterprise names are passed through into HanLP participle Final finishings Obtain left and right border dictionary.Left margin word refers to first word after enterprise name participle, and right margin word refers to enterprise's name Claim last word after participle.All left and right border words, which are arranged, turns into left and right border word dictionary.
Step 2-3, using the libSVM training increased income obtain the forecast model on left and right border.Left margin forecast model is instructed The feature selected during white silk is:The word of current word and latter two word is in itself and part of speech;Right margin forecast model was trained The feature selected in journey is:The word of current word and first two words is in itself and part of speech.Wherein, the libSVM increased income the tools used There are preferable robustness and more preferable classification boundaries
Step 2-4, taken out from training set word and judge that the word is by the forecast model on left and right border successively No is left and right border word.
Whether current term is that the determination methods of left margin word are:If the word is appeared in left margin dictionary, and should There is word to be determined as that left margin word is then correct left margin word under SVM methods on the right of word in two word windows, otherwise give up Go.Certainly, each word has a judged result under dictionary methods and SVM methods, but the two methods all have shortcoming, The step of this in embodiment is the result of comprehensive two methods, selects a more reasonably result.
Whether current term is that the determination methods of right margin word are:If the word is appeared in right margin dictionary, and should There is word to be determined as that right margin word is then correct right margin word under SVM methods in the word window of two, the word left side, otherwise give up Go.
Step 2-5, judge whether to have traveled through all words, step 2-7 is arrived if traveling through and completing, otherwise to 2-6.
Step 2-6, counter i add 1, take out next word in text.Actual above step is exactly to judge some word Whether it is right boundary word.
Step 2-7, the condition random field instrument CRF++ that the data input of training set is increased income carry out business entity's identification mould The training of type, exports the identification model of business entity.Training data selection feature for word in itself, part-of-speech tagging, left and right side Boundary mark note, entity mark.
Step 2-8, the end of business entity's identification model training.
It can be seen that, the present invention introduces entity boundary characteristic in existing conditional random field models, in use condition random field Judge whether this word is right boundary word before model, using this result as feature, use condition random field afterwards Model, and the introducing of entity boundary characteristic strengthens control ability of the conditional random field models to entity border, is embodied as The recall rate of identification is significantly improved.
As shown in figure 4, carrying out the flow chart of semantic vector structure to the text data of original training set.
The beginning that step 3-0, training set text semantic vector are built.
Step 3-1, input have completed the training set text collection of subordinate sentence, participle, part-of-speech tagging and classification mark.
Step 3-2, the term vector using all words in the word2vec instruments calculating training set increased income.It is noticeable It is that word2vec is the instrument for the calculating term vector that Google increases income, instrument such at present is a lot, and word2vec relatively knows Name, alternative instrument also has many such as java word2vec4j etc..
Step 3-3, inverse text frequency (IDF) value for calculating all words in training set, the formula that it is calculated are as follows:
Wherein, the fraction in logarithmic function, molecule represents the sum of document in whole document, and denominator represents to include some word The number of files of language adds 1 again, takes both ratio.
Step 3-4, each text taken out successively since first text in training set in document.
Step 3-5, using business entity's identification model judge take out this text in whether have depositing for business entity If then arriving step 3-6, otherwise to step 3-10.
Step 3-6, judge in step 3-5 to include business entity in text after, to the semantic vector of entity part Calculated, it is assumed that the vector representation of an entity is vm, constituting its phrase its vector representation is respectively:w1, w2..., wn, Then vmCalculation formula it is as follows:
Step 3-7, after the semantic vector of step 3-6 computational entities, the context section of entity is calculated it is semantic to Amount, its calculation is as follows:
Wherein, v (context) is the vectorial forms of characterization of context, tfidf (wi) represent word wiTF-IDF Value, v (wi) it is word wiTerm vector, k be word window size (take in context close to central entity preceding k word).Word TF values be the frequency of the word occur in text, the TF-IDF values of word are the TF values of word and the product of IDF values.
Step 3-8, the entity and the semantic vector of context obtained in step 3-6 and step 3-7 is spliced, specifically The context vector for the entity vector sum k dimensions for operating to tie up k, with entity vector preceding, the posterior mode of context vector is spelled Connect the vector for obtaining a 2k dimension.
Step 3-9, judge whether to have traveled through sentences all in training set text, step 3- is arrived if traveling through and completing 11, otherwise to step 3-10.
Step 3-10, counter i add 1, take out next sentence in training set text.
Step 3-11, the entity vector output by obtained integrating context semanteme, are used as business entity's disaggregated model Training data.It is worth noting that, step 1-3 mark after data be a plain text+category data, herein before Step is that text is changed into vector, thus data here are the data of vector+category.
The end that step 3-12, training set text semantic are built.
As shown in figure 5, carrying out semantic vector structure to original language material (be obtain after step 1-3 data set) Afterwards, the training of business entity's disaggregated model is carried out using softmax multi-classification algorithms.Softmax multi-classification algorithms are a kind of Conventional method, for other methods, its calculating speed is fast, and occupying little space, and can obtain test sample exists Probability in each classification
Step 4-0, the beginning of business entity's disaggregated model training.
Step 4-1, softmax classification will be input to by the training set data for having classification to mark after semantic vector In model, training parameter is used as.
Step 4-2, many disaggregated model training are carried out using softmax algorithms, export many points of the softmax after training Class model, is predicted to follow-up classification.
Step 4-3, the end of business entity's disaggregated model training.
As shown in fig. 6, after business entity's disaggregated model is obtained, the flow chart classified using the disaggregated model.
Step 5-0, the beginning of business entity's classification.
Step 5-1, the text to business entity's disaggregated model input entity class to be predicted.
Step 5-2, using business entity's identification model judge input text in whether have business entity, if then going to Step 5-3, otherwise goes to step 5-5.
Step 5-3, carry out the vectorial structure of Entity Semantics to including business entity's text using step 3-1 to step 3-12 Build, in the business entity's disaggregated model for afterwards training obtained vector input, obtain the classification results of entity in text.
Step 5-4, the classification results for exporting 5-3 steps.
Step 5-5, the end of business entity's classification.
In summary, the TF-IDF of utilization term vector technology proposed by the present invention and document word is worth to comprising context The method classified again after semantic business entity's vector representation form, can solve the problem that at present to business entity's sorting technique Middle type is less and the problem of lacking semantic, making the type of business entity has thinner granularity and stronger semantic feature.
Persond having ordinary knowledge in the technical field of the present invention, without departing from the spirit and scope of the present invention, when can It is used for a variety of modifications and variations.Therefore, the scope of protection of the present invention is defined by those of the claims.

Claims (10)

1. a kind of business entity's sorting technique towards plain text, it is characterised in that comprise the following steps:
S1, classification mark is carried out to the business entity in the plain text data that collects, the data that complete will be marked and be used as enterprise The training set of Entity recognition module;Classification mark is carried out according to industrial nature to the business entity in the plain text data that collects Note, will mark the data of completion as the training sample set of business entity's sort module;
S2, by conditional random field models business entity identification model training is carried out, and obtain business entity's identification model;
S3, the text data to original training set carry out semantic vector structure;
S4, using after semantic vector have classification mark training set data as training parameter train business entity classification Model;
S5, using business entity's disaggregated model the business entity in text to be predicted is classified.
2. business entity's sorting technique as claimed in claim 1, it is characterised in that in S1, by the plain text data collected Subordinate sentence, participle and part-of-speech tagging are carried out, using the method manually marked to the business entity in plain text data and category of employment It is labeled.
3. business entity's sorting technique as claimed in claim 2, it is characterised in that soft using the participle and part-of-speech tagging increased income Part HanLP carries out subordinate sentence, participle and part-of-speech tagging to plain text data.
4. business entity's sorting technique as claimed in claim 2, it is characterised in that marked to the business entity in plain text data Note mode is " BIO " mark pattern, wherein, the starting word of business entity is labeled as " B ", other portions of the non-starting word of business entity Participle language is labeled as " I ", and the word unrelated with business entity is labeled as " O ".
5. business entity's sorting technique as claimed in claim 2, it is characterised in that using in the method manually marked, to pure Business entity in text data carries out classification mark according to industrial nature according to context to it.
6. business entity's sorting technique as claimed in claim 1, it is characterised in that in S2, by the bar for introducing boundary characteristic Part random field models carry out business entity's identification model training.
7. business entity's sorting technique as claimed in claim 6, it is characterised in that introduce the condition random field mould of boundary characteristic Type includes:Enterprise name participle Final finishing is obtained by left and right border dictionary by HanLP;Trained using the libSVM increased income To the forecast model on left and right border;Word is taken out from training set successively and this is judged by the forecast model on left and right border Whether word is left and right border word;By including word in itself, part-of-speech tagging, right boundary mark, entity mark training set number The condition random field instrument increased income according to input carries out the training of business entity's identification model and obtains the identification model of business entity.
8. business entity's sorting technique as claimed in claim 1, it is characterised in that in S3, is obtained using term vector calculating instrument The term vector of all words is concentrated to training sample, inverse text frequency (IDF) value that training sample concentrates all words is calculated, utilizes word Vector sum TF-IDF values calculate the vector sum context vector of the business entity included in business entity's sentence, by business entity Vector sum context vector spliced, to obtain comprising the semantic business entity's semantic vector of context.
9. business entity's sorting technique as claimed in claim 8, it is characterised in that calculated using the word2vec instruments increased income The term vector of all words in training set.
10. business entity's sorting technique as claimed in claim 1, it is characterised in that in S4, to the instruction for having there is classification to mark Practice the disaggregated model that collection data go out business entity using softmax model trainings.
CN201710371464.7A 2017-05-24 2017-05-24 Pure text-oriented enterprise entity classification method Active CN107193959B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710371464.7A CN107193959B (en) 2017-05-24 2017-05-24 Pure text-oriented enterprise entity classification method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710371464.7A CN107193959B (en) 2017-05-24 2017-05-24 Pure text-oriented enterprise entity classification method

Publications (2)

Publication Number Publication Date
CN107193959A true CN107193959A (en) 2017-09-22
CN107193959B CN107193959B (en) 2020-11-27

Family

ID=59874712

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710371464.7A Active CN107193959B (en) 2017-05-24 2017-05-24 Pure text-oriented enterprise entity classification method

Country Status (1)

Country Link
CN (1) CN107193959B (en)

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107423264A (en) * 2017-07-10 2017-12-01 广东华联建设投资管理股份有限公司 A kind of engineering material borrowing-word extracting method
CN107894986A (en) * 2017-09-26 2018-04-10 北京纳人网络科技有限公司 A kind of business connection division methods, server and client based on vectorization
CN108255813A (en) * 2018-01-23 2018-07-06 重庆邮电大学 A kind of text matching technique based on term frequency-inverse document and CRF
CN108460014A (en) * 2018-02-07 2018-08-28 百度在线网络技术(北京)有限公司 Recognition methods, device, computer equipment and the storage medium of business entity
CN108733778A (en) * 2018-05-04 2018-11-02 百度在线网络技术(北京)有限公司 The industry type recognition methods of object and device
CN108763402A (en) * 2018-05-22 2018-11-06 广西师范大学 Class center vector Text Categorization Method based on dependence, part of speech and semantic dictionary
CN108763201A (en) * 2018-05-17 2018-11-06 南京大学 A kind of open field Chinese text name entity recognition method based on semi-supervised learning
CN109408827A (en) * 2018-11-07 2019-03-01 南京理工大学 A kind of software entity recognition methods based on machine learning
CN110083704A (en) * 2019-05-06 2019-08-02 重庆天蓬网络有限公司 A kind of company's information processing method, storage medium and equipment based on main business
CN110297913A (en) * 2019-06-12 2019-10-01 中电科大数据研究院有限公司 A kind of electronic government documents entity abstracting method
CN110472062A (en) * 2019-07-11 2019-11-19 新华三大数据技术有限公司 The method and device of identification name entity
CN110502638A (en) * 2019-08-30 2019-11-26 重庆誉存大数据科技有限公司 A kind of Company News classification of risks method based on target entity
CN110990587A (en) * 2019-12-04 2020-04-10 电子科技大学 Enterprise relation discovery method and system based on topic model
CN111209392A (en) * 2018-11-20 2020-05-29 百度在线网络技术(北京)有限公司 Method, device and equipment for excavating polluted enterprises
CN111539209A (en) * 2020-04-15 2020-08-14 北京百度网讯科技有限公司 Method and apparatus for entity classification
CN111881685A (en) * 2020-07-20 2020-11-03 南京中孚信息技术有限公司 Small-granularity strategy mixed model-based Chinese named entity identification method and system
CN112418681A (en) * 2020-11-26 2021-02-26 北京上奇数字科技有限公司 Method and apparatus for analyzing industrial development, electronic device, and storage medium
CN113065343A (en) * 2021-03-25 2021-07-02 天津大学 Enterprise research and development resource information modeling method based on semantics
CN113408273A (en) * 2021-06-30 2021-09-17 北京百度网讯科技有限公司 Entity recognition model training and entity recognition method and device
WO2021238337A1 (en) * 2020-05-29 2021-12-02 华为技术有限公司 Method and device for entity tagging
CN114036933A (en) * 2022-01-10 2022-02-11 湖南工商大学 Information extraction method based on legal documents
CN114647727A (en) * 2022-03-17 2022-06-21 北京百度网讯科技有限公司 Model training method, device and equipment applied to entity information recognition

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104965992A (en) * 2015-07-13 2015-10-07 南开大学 Text mining method based on online medical question and answer information
US20160148116A1 (en) * 2014-11-21 2016-05-26 International Business Machines Corporation Extraction of semantic relations using distributional relation detection
CN105630768A (en) * 2015-12-23 2016-06-01 北京理工大学 Cascaded conditional random field-based product name recognition method and device
CN105787461A (en) * 2016-03-15 2016-07-20 浙江大学 Text-classification-and-condition-random-field-based adverse reaction entity identification method in traditional Chinese medicine literature
CN106503035A (en) * 2016-09-14 2017-03-15 海信集团有限公司 A kind of data processing method of knowledge mapping and device
CN106570179A (en) * 2016-11-10 2017-04-19 中国科学院信息工程研究所 Evaluative text-oriented kernel entity identification method and apparatus

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160148116A1 (en) * 2014-11-21 2016-05-26 International Business Machines Corporation Extraction of semantic relations using distributional relation detection
CN104965992A (en) * 2015-07-13 2015-10-07 南开大学 Text mining method based on online medical question and answer information
CN105630768A (en) * 2015-12-23 2016-06-01 北京理工大学 Cascaded conditional random field-based product name recognition method and device
CN105787461A (en) * 2016-03-15 2016-07-20 浙江大学 Text-classification-and-condition-random-field-based adverse reaction entity identification method in traditional Chinese medicine literature
CN106503035A (en) * 2016-09-14 2017-03-15 海信集团有限公司 A kind of data processing method of knowledge mapping and device
CN106570179A (en) * 2016-11-10 2017-04-19 中国科学院信息工程研究所 Evaluative text-oriented kernel entity identification method and apparatus

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
MUHAMMAD ASHRAF KHAN NIAZI等: ""Signature automation of UMLS concepts: An un-supervised named entity recognition framework for classification of DNA and RNA in biological text"", 《 2015 SCIENCE AND INFORMATION CONFERENCE (SAI)》 *
庄成龙: ""基于树核函数的中文实体语义关系抽取方法的研究"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
李芳: ""基于条件随机场的两阶段中文微博命名实体识别研究"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
王树伟: ""面向金融文本的实体识别与关系抽取研究"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (34)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107423264A (en) * 2017-07-10 2017-12-01 广东华联建设投资管理股份有限公司 A kind of engineering material borrowing-word extracting method
CN107894986A (en) * 2017-09-26 2018-04-10 北京纳人网络科技有限公司 A kind of business connection division methods, server and client based on vectorization
CN107894986B (en) * 2017-09-26 2021-03-30 北京纳人网络科技有限公司 Enterprise relation division method based on vectorization, server and client
CN108255813A (en) * 2018-01-23 2018-07-06 重庆邮电大学 A kind of text matching technique based on term frequency-inverse document and CRF
CN108255813B (en) * 2018-01-23 2021-11-16 重庆邮电大学 Text matching method based on word frequency-inverse document and CRF
CN108460014A (en) * 2018-02-07 2018-08-28 百度在线网络技术(北京)有限公司 Recognition methods, device, computer equipment and the storage medium of business entity
CN108460014B (en) * 2018-02-07 2022-02-25 百度在线网络技术(北京)有限公司 Enterprise entity identification method and device, computer equipment and storage medium
CN108733778A (en) * 2018-05-04 2018-11-02 百度在线网络技术(北京)有限公司 The industry type recognition methods of object and device
CN108733778B (en) * 2018-05-04 2022-05-17 百度在线网络技术(北京)有限公司 Industry type identification method and device of object
CN108763201B (en) * 2018-05-17 2021-07-23 南京大学 Method for identifying text named entities in open domain based on semi-supervised learning
CN108763201A (en) * 2018-05-17 2018-11-06 南京大学 A kind of open field Chinese text name entity recognition method based on semi-supervised learning
CN108763402A (en) * 2018-05-22 2018-11-06 广西师范大学 Class center vector Text Categorization Method based on dependence, part of speech and semantic dictionary
CN108763402B (en) * 2018-05-22 2021-08-27 广西师范大学 Class-centered vector text classification method based on dependency relationship, part of speech and semantic dictionary
CN109408827A (en) * 2018-11-07 2019-03-01 南京理工大学 A kind of software entity recognition methods based on machine learning
CN111209392A (en) * 2018-11-20 2020-05-29 百度在线网络技术(北京)有限公司 Method, device and equipment for excavating polluted enterprises
CN110083704A (en) * 2019-05-06 2019-08-02 重庆天蓬网络有限公司 A kind of company's information processing method, storage medium and equipment based on main business
CN110297913A (en) * 2019-06-12 2019-10-01 中电科大数据研究院有限公司 A kind of electronic government documents entity abstracting method
CN110472062A (en) * 2019-07-11 2019-11-19 新华三大数据技术有限公司 The method and device of identification name entity
CN110502638A (en) * 2019-08-30 2019-11-26 重庆誉存大数据科技有限公司 A kind of Company News classification of risks method based on target entity
CN110502638B (en) * 2019-08-30 2023-05-16 重庆誉存大数据科技有限公司 Enterprise news risk classification method based on target entity
CN110990587A (en) * 2019-12-04 2020-04-10 电子科技大学 Enterprise relation discovery method and system based on topic model
CN110990587B (en) * 2019-12-04 2023-04-18 电子科技大学 Enterprise relation discovery method and system based on topic model
CN111539209B (en) * 2020-04-15 2023-09-15 北京百度网讯科技有限公司 Method and apparatus for entity classification
CN111539209A (en) * 2020-04-15 2020-08-14 北京百度网讯科技有限公司 Method and apparatus for entity classification
CN113743117A (en) * 2020-05-29 2021-12-03 华为技术有限公司 Method and device for entity marking
WO2021238337A1 (en) * 2020-05-29 2021-12-02 华为技术有限公司 Method and device for entity tagging
CN113743117B (en) * 2020-05-29 2024-04-09 华为技术有限公司 Method and device for entity labeling
CN111881685A (en) * 2020-07-20 2020-11-03 南京中孚信息技术有限公司 Small-granularity strategy mixed model-based Chinese named entity identification method and system
CN112418681A (en) * 2020-11-26 2021-02-26 北京上奇数字科技有限公司 Method and apparatus for analyzing industrial development, electronic device, and storage medium
CN113065343A (en) * 2021-03-25 2021-07-02 天津大学 Enterprise research and development resource information modeling method based on semantics
CN113408273A (en) * 2021-06-30 2021-09-17 北京百度网讯科技有限公司 Entity recognition model training and entity recognition method and device
CN113408273B (en) * 2021-06-30 2022-08-23 北京百度网讯科技有限公司 Training method and device of text entity recognition model and text entity recognition method and device
CN114036933A (en) * 2022-01-10 2022-02-11 湖南工商大学 Information extraction method based on legal documents
CN114647727A (en) * 2022-03-17 2022-06-21 北京百度网讯科技有限公司 Model training method, device and equipment applied to entity information recognition

Also Published As

Publication number Publication date
CN107193959B (en) 2020-11-27

Similar Documents

Publication Publication Date Title
CN107193959A (en) A kind of business entity's sorting technique towards plain text
CN106919673B (en) Text mood analysis system based on deep learning
WO2019200806A1 (en) Device for generating text classification model, method, and computer readable storage medium
CN109685056B (en) Method and device for acquiring document information
CN106844349B (en) Comment spam recognition methods based on coorinated training
US8170969B2 (en) Automated computation of semantic similarity of pairs of named entity phrases using electronic document corpora as background knowledge
CN110276054B (en) Insurance text structuring realization method
CN106776581A (en) Subjective texts sentiment analysis method based on deep learning
US7386544B2 (en) Database search system
CN110532563A (en) The detection method and device of crucial paragraph in text
CN102541838B (en) Method and equipment for optimizing emotional classifier
CN106649597A (en) Method for automatically establishing back-of-book indexes of book based on book contents
CN110688836A (en) Automatic domain dictionary construction method based on supervised learning
CN106933800A (en) A kind of event sentence abstracting method of financial field
CN112101027A (en) Chinese named entity recognition method based on reading understanding
CN106126502A (en) A kind of emotional semantic classification system and method based on support vector machine
CN112051986A (en) Code search recommendation device and method based on open source knowledge
CN110134799A (en) A kind of text corpus based on BM25 algorithm build and optimization method
US20230028664A1 (en) System and method for automatically tagging documents
CN115238040A (en) Steel material science knowledge graph construction method and system
Saravanan et al. Automatic identification of rhetorical roles using conditional random fields for legal document summarization
CN103473356B (en) Document-level emotion classifying method and device
CN104794209A (en) Chinese microblog sentiment classification method and system based on Markov logic network
CN112257442B (en) Policy document information extraction method based on corpus expansion neural network
CN111460147A (en) Title short text classification method based on semantic enhancement

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant