CN106649662A - Construction method of domain dictionary - Google Patents

Construction method of domain dictionary Download PDF

Info

Publication number
CN106649662A
CN106649662A CN201611149314.3A CN201611149314A CN106649662A CN 106649662 A CN106649662 A CN 106649662A CN 201611149314 A CN201611149314 A CN 201611149314A CN 106649662 A CN106649662 A CN 106649662A
Authority
CN
China
Prior art keywords
text
word
domain
dictionary
distance
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201611149314.3A
Other languages
Chinese (zh)
Inventor
张晓霞
刘世林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Business Big Data Technology Co Ltd
Original Assignee
Chengdu Business Big Data Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Business Big Data Technology Co Ltd filed Critical Chengdu Business Big Data Technology Co Ltd
Priority to CN201611149314.3A priority Critical patent/CN106649662A/en
Publication of CN106649662A publication Critical patent/CN106649662A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/374Thesaurus
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to the field of natural language processing, and in particular relates to a construction method of a domain dictionary. The method comprises the following steps: on the basis of automatic acquisition of a text keyword, clustering to-be-processed texts to form different topic text sets; selecting a part of seed words in a domain text set of a to-be-constructed dictionary through manual examination; on the basis, analyzing the distance of the clustered topic text sets and the selected domain seed words in relationship, and only retaining the top text sets which are relatively close in relationship for expanding the domain dictionary; and in a related domain, performing automatic expansion of the domain dictionary by combining an algorithm in the related domain to obtain a corresponding dictionary. According to the method provided by the invention, the to-be-constructed domain dictionary can be automatically expanded through a few of part of seed words on the basis of automatic differentiation of the domains of the text topics; the construction efficiency of the dictionary is relatively high, the accuracy is high, and the pertinence of the domain is strong; the method has wide application prospect in text analysis and natural language processing field.

Description

A kind of construction method of domain lexicon
Technical field
Natural language processing field of the present invention, more particularly to a kind of domain lexicon construction method.
Background technology
With the fast development of internet, substantial amounts of, disclosed web data is generated, also therefore facilitated various being based on The medical treatment of the new industry of big data technology, such as internet, Internet education, enterprise or personal reference etc..These internets The rise of industry be unable to do without substantial amounts of data message analysis with flourishing.Natural language processing in big data analysis occupies important ground Position, the network text resource in the face of magnanimity with natural language processing analysis method by automatically, intelligently judging text Or certain Sentiment orientation that text publisher is contained, either suffer from the analysis of public opinion or business survey heavy to closing The practical application meaning wanted.Using these analysis results, correct anticipation is carried out to the development evolvement of thing, and then taken in advance Corresponding measure is realizing bigger positive effect.
And sentiment analysis mainly have two big class methods, a class is that, based on the method for machine learning, another kind of is based on dictionary Method.Method based on machine learning is to build grader first, and be analysed to text input is carried out in grader Analysis.The limitation of this method is to build grader, needs large-scale corpus to be trained grader, and is classified The selection of feature also has challenge very much, and the quality of feature selecting will directly affect the performance of grader.Based on the method for dictionary, Using the word in dictionary as feature, corresponding feature vocabulary is extracted by dictionary matching, on the basis of feature vocabulary is extracted With reference to setting model either algorithm judging the corresponding tendency of the text or property, the reliability of analysis is greatly increased.
It is targetedly to analyze and excavate, the word that different fields is taken based on the sentiment analysis method of sentiment dictionary Allusion quotation is also very different, and at present existing domain lexicon, but lacks the applicability to particular problem, and specific aim is not strong.Dividing When analysis specific field or concrete topic, using existing big and wide in range domain lexicon, preferably analysis can not be reached Effect, targetedly domain lexicon is very necessary for structure, but manual construction dictionary very takes time and effort;Magnanimity can not be met The demand of text analyzing.
The content of the invention
It is an object of the invention to overcome the above-mentioned deficiency in the presence of prior art, there is provided a kind of domain lexicon structure side Method, on the basis of text key word is obtained automatically, clusters to pending text, and the different field of formation or theme are literary This collection;According to analysis needs, a small amount of corresponding field seed words are chosen, the field after cluster or master are analyzed on this basis Topic text set is far and near with the relation of selected field seed words, only retains the closer field of relation or subject text collection as neck The source of domain lexicon extension.Carry out the automatic extension of domain lexicon with reference to word correlation analysis algorithm on this basis, and then Obtain corresponding domain lexicon.
In order to realize foregoing invention purpose, the invention provides technical scheme below:A kind of domain lexicon construction method, bag Step containing implemented below:
(1) keyword of each text in pending text set is extracted;
(2) pending text is clustered, generates N number of subject text collection, wherein N is integer and N >=2;
(3) seed words in field are chosen;
(4) seed words are counted and the frequency for occurring is concentrated in each subject text;The subject text collection that frequency exceedes threshold value is protected Stay, as the source text collection of domain lexicon extension;
(5) degree of association of seed words and each candidate word in the text of source text collection is calculated, the degree of association is reached threshold value is set Candidate word be stored in dictionary to be expanded as domain term.
Specifically, the inventive method includes participle, the pre-treatment step gone high frequency words, remove stop words.
Further, keyword in text is extracted using following algorithmic formula in the step (1).The calculating of the algorithm Formula is:
TR(vi) it is word v in textiImportance, d is damped coefficient, be traditionally arranged to be 0.85, N be in non-directed graph own The number of word, relat { viBe and word viThere are the set of words of cooccurrence relation, vjIt is relat { viIn any one word, TR (vj) It is vjImportance, N (pj) be and vjThere is the number of the word of cooccurrence relation.
Further, procedure below is included to pending text cluster in the step (2):
(2-1) when initial, each pending text is respectively a class;
Between class distance is defined as in two classes the maximum of distance between text pair two-by-two, and the computing formula of distance is such as between text Under:
Wherein C (t1, t2) represents the distance between text 1 and text 2, and t1 ∩ t2 are represented and included between text 1 and text 2 The number of same keyword, mid (t1, t2) represents the mean number comprising keyword in text 1 and text 2;Between class distance meter Calculate formula as follows:
Dist(ca, cb)=max { C (ta, tb), ta∈ca, tb∈cb}
Wherein, Dist (ca, cb) represent the distance between any two class cluster, caAnd cbTwo classes, C (t are represented respectivelya, tb) represent the distance between two texts, taAnd tbTwo texts are represented respectively, and require ta∈ca、tb∈cb(2-2) calculate All classes distance between any two, the minimum class of distance is merged, and is named as cnew;
(2-3) merged initial classes cluster is deleted in pending text set, and new class cluster cnew is added to poly- In class result;
(2-4) repeat step (2-1) to (2-3), when only including N number of class cluster in pending text set, stops cluster. What is now included in pending text set is the N number of theme formed after cluster, and the concrete number of wherein N is answered according to actual The sets itself with institute.
As a kind of preferred:Candidate word is with the calculation of relationship degree formula of seed words in the step (5):
The probability that wherein p (word1, word2) occurs jointly for word word1 and word word2, p (word1) and p (word2) Represent the probability that word word1 and word word2 occurs respectively.
As one kind preferably, in the step (2), N=3.
As one kind preferably, in the step (3), the number of selected seed words is 50-200.
Further, the step (3) sequentially can move on to the step (1) and, or before step (2).
As a kind of preferred, in the step (4), only retain seed words frequency of occurrences highest subject text collection as word The source text collection that allusion quotation expands.
As it is a kind of preferably, in the step (5) candidate word and the threshold value of seed words be set to MI (word1, word2)= 0.2, when the degree of association >=0.2 of vocabulary in text set and seed words, just the word is added to be built as extension vocabulary Dictionary in.
Compared with prior art, beneficial effects of the present invention:The present invention provides a kind of domain lexicon construction method, automatic On the basis of obtaining text key word, pending text is clustered, form different subject text collection;And choose a fixed number The field seed words of amount, find that the pending text set after clustering is remote with the relation for treating extension field by seed words automatically Closely, automatically identify cluster after text domain type on the basis of, only retain subject text collection in close relations to be led Domain lexicon extension.The accuracy of dictionary creation is higher, builds in hgher efficiency.
The inventive method, chooses a part of seed words, depending on the selection of seed words can be according to the concrete direction of analysis, therefore More there is specific aim, choose with the basis of field automatically discovery in seed words, calculate the text of seed words and source text collection The correlation degree of middle word is far and near, retains word in close relations as the expansion word of the domain lexicon;Compared to common domain term Allusion quotation, the domain lexicon constructed by the inventive method has higher flexible.The practicality of dictionary is higher, is adaptive to particular problem Or the text analyzing of theme.
Description of the drawings:
Fig. 1 realizes block diagram for the construction method of this area dictionary.
Fig. 2 is the realization procedure chart of this area word construction method step (5).
Specific embodiment
With reference to test example and specific embodiment, the present invention is described in further detail.But this should not be understood Scope for above-mentioned theme of the invention is only limitted to below example, and all technologies realized based on present invention belong to this The scope of invention.
A kind of domain lexicon construction method is provided, on the basis of text key word is obtained automatically, pending text is entered Row cluster, forms different subject text collection;It is concentrated through manually checking in the field text of dictionary to be built, chooses a part Seed words.The subject text collection after cluster is analyzed on this basis far and near with the relation of selected field seed words, only retain and close The closer subject text collection of system is carrying out domain lexicon extension.Carry out the automatic of domain lexicon in conjunction with algorithm on this basis Extension, obtains corresponding domain lexicon.The inventive method leads to too small amount of portion on the basis of automatic distinguishing text subject field Seed words are divided to expand the domain lexicon for wanting to build automatically;The structure efficiency of dictionary is higher, and accuracy is high, the pin in field It is very strong to property;Have wide practical use in text analyzing and natural language processing field.
In order to realize foregoing invention purpose, the invention provides technical scheme below:A kind of domain lexicon construction method, bag Containing implemented below step as shown in Figure 1:
(1) keyword of each text in pending text set is extracted;
(2) pending text is clustered, forms N number of subject text collection, wherein N is integer and N >=2;
(3) a small amount of field seed words are chosen;Choose the vocabulary with obvious domain features, the side of artificial selected seed word Formula, it is higher for the specific aim of specific field or problem, constructed dictionary it is applicable more flexible.
(4) seed words are counted and the frequency for occurring is concentrated in each subject text;The seed words frequency of occurrences is exceeded into the master of threshold value Topic text set retains, used as the source text collection of domain lexicon extension.Pending text set is classified by cluster, is defined The text collection of different themes, the correlation degree between text in same subject is higher, is that follow-up lexicon extension is carried out The preparation and screening of language material.
Formed after different themes text set by cluster, through calculating appearance frequency of the seed words in subject text keyword Rate, and then the distance of the relation between different themes and constructed dictionary field is analyzed, relation text set farther out is given up, this Sample is only carried out when lexicon extension is carried out in the nearer theme in field, substantially increases the quality of lexicon extension source language material, The accuracy of lexicon extension is obviously improved, simultaneously because being only that in the nearest text set in extended field dictionary expansion is carried out Exhibition, reduces the scope calculated during lexicon extension, reduces the amount of calculation of lexicon extension, improves the efficiency of lexicon extension.
(5) degree of association of seed words and each word of source text collection is calculated, the degree of association is reached the word of given threshold as neck Domain word is stored in dictionary to be expanded.
Specifically, the inventive method includes participle, the pre-treatment step gone high frequency words, remove stop words.
Further, keyword in text is extracted using following algorithmic formula in the step (1).The calculating of the algorithm Formula is:
TR(vi) it is word v in textiImportance.D is damped coefficient, is traditionally arranged to be 0.85.N is (by text in non-directed graph After this participle, a non-directed graph is abstracted into, each word in its Chinese version is a node in figure) number of all words. relat{viBe and word viThere is the set of words of cooccurrence relation.vjIt is relat { viIn any one word, TR (vj) it is vjWeight The property wanted, N (pj) be and vjThere is the number of the word of cooccurrence relation.
Calculating is iterated by this computing formula, TR (v are extractedi) it is more than the key of the equivalent as the text of threshold value Word;It is that text cluster is prepared by the automatic extraction of keyword.
Further, procedure below is included to pending text cluster in the step (2):
(2-1) when initial, each pending text is respectively a class;
Between class distance is defined as in two classes the maximum of distance between text pair two-by-two, and the computing formula of distance is such as between text Under:
Wherein C (t1, t2) represents the distance between text 1 and text 2, and t1 ∩ t2 are represented and included between text 1 and text 2 The number of same keyword, mid (t1, t2) represents the mean number comprising keyword in text 1 and text 2;Between class distance meter Calculate formula as follows:
Dist(ca, cb)=max { C (ta, tb), ta∈ca, tb∈cb}
Wherein, Dist (ca, cb) represent the distance between any two class cluster, caAnd cbTwo classes, C (t are represented respectivelya, tb) represent the distance between two texts, taAnd tbTwo texts are represented respectively, and require ta∈ca、tb∈cb(2-2) calculate All classes distance between any two, the minimum class of distance is merged, and is named as cnew;
(2-3) merged initial classes cluster is deleted in pending text set, and new class cluster cnew is added to poly- In class result;
(2-4) repeat step (2-1) to (2-3), when only including N number of class cluster in pending text set, stops cluster. What is now included in pending text set is the N number of theme formed after cluster, and the concrete number of wherein N is answered according to actual With and sets itself.
As a kind of preferred, step (2-4) N=3, pending text set is only divided into three themes, it is convenient follow-up Calculate.
As one kind preferably, in the step (3), the quantity of the field seed words for being extracted is 50-200.Choose Seed words are very few, will affect the accuracy of domain lexicon extension, cross and at most will increase the manpower and time cost chosen.
As a kind of preferred;In the step (4), only retain seed words frequency of occurrences highest subject text collection as word The source text collection that allusion quotation expands;This step concentrates the most close text set of selection and seed words relation from individual subject text so that word The characteristics of corpus of allusion quotation extension more conform to field, the extension quality of dictionary is higher, and specific aim is higher.
As a kind of preferred:Vocabulary is thought with the calculation of relationship degree of seed words using the calculating of mutual information in the step (5) Think, the computing formula for being adopted for:
The probability that wherein p (word1, word2) occurs jointly for word word1 and word word2, p (word1) and p (word2) Represent the probability that word word1 and word word2 occurs respectively.Mutual information algorithm is for the degree of association between analysis vocabulary, algorithm letter Clean easy realization, computational efficiency is higher;Mutual information is the analysis method of computational linguistics model, and it is measured between two objects Reciprocity.It is used for measures characteristic in filtration problem for the discrimination of theme.When domain lexicon structure is carried out, plant choosing On the basis of sub- word, the correlation of vocabulary to be expanded and existing seed words is calculated using the method for mutual information, the degree of correlation is got over Height represents that the word is higher with the relevance of seed words.
As one kind preferably, the threshold value of the step (5) is set to MI (word1, word2)=0.2, when time in text set When selecting the degree of association >=0.2 of word and seed words, just it is added to the word as extension vocabulary in the dictionary to be built, the step Suddenly the calculating process of (5) is as shown in Figure 2.

Claims (9)

1. a kind of domain lexicon construction method, it is characterised in that comprising implemented below step:
(1) keyword of each text in pending text set is extracted;
(2) pending text is clustered, generates N number of subject text collection, wherein N is integer and N >=2;
(3) seed words in field are chosen;
(4) seed words are counted and the frequency for occurring is concentrated in each subject text;The subject text collection that frequency exceedes threshold value is retained, is made For the source text collection of domain lexicon extension;
(5) degree of association of seed words and each candidate word in the text of source text collection is calculated, the degree of association is reached into the candidate word of threshold value It is stored in dictionary to be expanded as domain term.
2. the method for claim 1, it is characterised in that include before the step (1):Participle, go high frequency words, go to stop The pre-treatment step of word.
3. the method for claim 1, in the step (1) keyword, the public affairs are extracted using following computing formula Formula is:
T R ( v i ) = 1 - d N + d Σ v j ∈ r e l a t { v i } T R ( v j ) N ( p j )
TR(vi) it is word v in textiImportance, d is damped coefficient, and it is all words in non-directed graph to be traditionally arranged to be 0.85, N Number, relat { viBe and word viThere are the set of words of cooccurrence relation, vjIt is relat { viIn any one word, TR (vj) it is vj Importance, N (pj) be and vjThere is the number of the word of cooccurrence relation.
4. method as claimed in claim 3, it is characterised in that:To pending text cluster comprising following in the step (2) Process:
(2-1) when initial, each pending text is respectively a class;
Between class distance is defined as in two classes the maximum of distance between text pair two-by-two, and the computing formula of distance is as follows between text:
C ( t 1 , t 2 ) = t 1 ∩ t 2 m i d ( t 1 , t 2 )
Wherein C (t1, t2) represents the distance between text 1 and text 2, and t1 ∩ t2 are represented between text 1 and text 2 comprising identical The number of keyword, mid (t1, t2) represents the mean number comprising keyword in text 1 and text 2;
Between class distance computing formula is as follows:
Dist(ca, cb)=max { C (ta, tb), ta∈ca, tb∈cb}
Wherein, Dist (ca, cb) represent the distance between any two class cluster, caAnd cbTwo classes, C (t are represented respectivelya, tb) table Show the distance between two texts, taAnd tbTwo texts are represented respectively, and require ta∈ca、tb∈cb(2-2) calculate all Class distance between any two, the minimum class of distance is merged, and is named as cnew;
(2-3) merged class cluster is deleted in pending text set, and new class cluster cnew is added in cluster result;
(2-4) repeat step (2-1) to (2-3), when only including N number of class cluster in pending text set, stops cluster.
5. method as claimed in claim 4, it is characterised in that:The degree of association meter of candidate word and seed words in the step (5) Calculating formula is:
M I ( w o r d 1 , w o r d 2 ) = l o g p ( w o r d 1 , w o r d 2 ) p ( w o r d 1 ) p ( w o r d 2 )
Wherein p (word1, word2) is the probability that word word1 and word word2 occurs jointly, and p (word1) and p (word2) is represented The probability that word word1 and word word2 occur respectively.
6. method as claimed in claim 5, it is characterised in that:In the step (2), N=3.
7. method as claimed in claim 6, it is characterised in that:In the step (3), the number of selected seed words is 50-200.
8. method as claimed in claim 7, it is characterised in that:In the step (4), only retain seed words frequency of occurrences highest The source text collection that expands as dictionary of subject text collection.
9. method as claimed in claim 8, it is characterised in that:In the step (5), the degree of association of expansion word and seed words is treated Threshold value is set to:0.2.
CN201611149314.3A 2016-12-13 2016-12-13 Construction method of domain dictionary Pending CN106649662A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611149314.3A CN106649662A (en) 2016-12-13 2016-12-13 Construction method of domain dictionary

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611149314.3A CN106649662A (en) 2016-12-13 2016-12-13 Construction method of domain dictionary

Publications (1)

Publication Number Publication Date
CN106649662A true CN106649662A (en) 2017-05-10

Family

ID=58825933

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611149314.3A Pending CN106649662A (en) 2016-12-13 2016-12-13 Construction method of domain dictionary

Country Status (1)

Country Link
CN (1) CN106649662A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107402909A (en) * 2017-06-16 2017-11-28 合肥龙图腾信息技术有限公司 A kind of encyclopaedia content input method and system
CN107992509A (en) * 2017-10-12 2018-05-04 如是科技(大连)有限公司 method and device for generating job dictionary information
CN108038101A (en) * 2017-12-07 2018-05-15 杭州迪普科技股份有限公司 A kind of recognition methods for distorting text and device
CN110704638A (en) * 2019-09-30 2020-01-17 南京邮电大学 Clustering algorithm-based electric power text dictionary construction method
CN115080752A (en) * 2022-08-18 2022-09-20 湖南大学 Numerical value feature discovery method and system based on automatic acquisition of feature field knowledge
CN115270774A (en) * 2022-09-27 2022-11-01 吉奥时空信息技术股份有限公司 Big data keyword dictionary construction method for semi-supervised learning

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103559174A (en) * 2013-09-30 2014-02-05 东软集团股份有限公司 Semantic emotion classification characteristic value extraction method and system
CN105893444A (en) * 2015-12-15 2016-08-24 乐视网信息技术(北京)股份有限公司 Sentiment classification method and apparatus

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103559174A (en) * 2013-09-30 2014-02-05 东软集团股份有限公司 Semantic emotion classification characteristic value extraction method and system
CN105893444A (en) * 2015-12-15 2016-08-24 乐视网信息技术(北京)股份有限公司 Sentiment classification method and apparatus

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
LIZHOU ZHENG: "Multi-dimensional Sentiment Analysis for Large-Scale E-commerce Reviews", 《INTERNATIONAL CONFERENCE ON DATABASE AND EXPERT SYSTEMS APPLICATIONS》 *
唐浩浩等: "基于词亲和度的微博词语语义倾向识别算法", 《数据采集与处理》 *
董丽丽等: "基于领域本体、情感词典的商品评论倾向性分析", 《计算机应用与软件》 *
蒋盛益等: "面向微博的社会情绪词典构建及情绪分析方法研究", 《中文信息学报》 *
赵军等: "一种改进的融合关联词典的微博倾向性分析方法", 《数据采集与处理》 *
顾益军: "融合LDA与TextRank的关键词抽取研究", 《现代图书情报技术》 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107402909A (en) * 2017-06-16 2017-11-28 合肥龙图腾信息技术有限公司 A kind of encyclopaedia content input method and system
CN107992509A (en) * 2017-10-12 2018-05-04 如是科技(大连)有限公司 method and device for generating job dictionary information
CN107992509B (en) * 2017-10-12 2022-05-13 如是人力科技集团股份有限公司 Method and device for generating job dictionary information
CN108038101A (en) * 2017-12-07 2018-05-15 杭州迪普科技股份有限公司 A kind of recognition methods for distorting text and device
CN108038101B (en) * 2017-12-07 2021-04-27 杭州迪普科技股份有限公司 Method and device for identifying tampered text
CN110704638A (en) * 2019-09-30 2020-01-17 南京邮电大学 Clustering algorithm-based electric power text dictionary construction method
CN115080752A (en) * 2022-08-18 2022-09-20 湖南大学 Numerical value feature discovery method and system based on automatic acquisition of feature field knowledge
CN115080752B (en) * 2022-08-18 2022-12-02 湖南大学 Numerical value feature discovery method and system based on automatic acquisition of feature field knowledge
CN115270774A (en) * 2022-09-27 2022-11-01 吉奥时空信息技术股份有限公司 Big data keyword dictionary construction method for semi-supervised learning

Similar Documents

Publication Publication Date Title
CN106610955A (en) Dictionary-based multi-dimensional emotion analysis method
Rathi et al. Sentiment analysis of tweets using machine learning approach
CN106649662A (en) Construction method of domain dictionary
CN108052593B (en) Topic keyword extraction method based on topic word vector and network structure
Prusa et al. The effect of dataset size on training tweet sentiment classifiers
CN104102626B (en) A kind of method for short text Semantic Similarity Measurement
CN104750844B (en) Text eigenvector based on TF-IGM generates method and apparatus and file classification method and device
CN102411563B (en) Method, device and system for identifying target words
CN107301171A (en) A kind of text emotion analysis method and system learnt based on sentiment dictionary
CN106681985A (en) Establishment system of multi-field dictionaries based on theme automatic matching
CN108021555A (en) A kind of Question sentence parsing measure based on depth convolutional neural networks
CN111160037A (en) Fine-grained emotion analysis method supporting cross-language migration
CN106682128A (en) Method for automatic establishment of multi-field dictionaries
CN109885675B (en) Text subtopic discovery method based on improved LDA
CN110851176B (en) Clone code detection method capable of automatically constructing and utilizing pseudo-clone corpus
CN103034627B (en) Calculate the method and apparatus of sentence similarity and the method and apparatus of machine translation
CN105740404A (en) Label association method and device
CN110825850B (en) Natural language theme classification method and device
CN106681986A (en) Multi-dimensional sentiment analysis system
CN108038106B (en) Fine-grained domain term self-learning method based on context semantics
CN106682089A (en) RNNs-based method for automatic safety checking of short message
CN105740382A (en) Aspect classification method for short comment texts
US8762300B2 (en) Method and system for document classification
CN107463715A (en) English social media account number classification method based on information gain
CN107967337A (en) A kind of cross-cutting sentiment analysis method semantic based on feeling polarities enhancing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20170510

WD01 Invention patent application deemed withdrawn after publication