CN106776569A - Tourist hot spot and its Feature Extraction Method and system in mass text - Google Patents
Tourist hot spot and its Feature Extraction Method and system in mass text Download PDFInfo
- Publication number
- CN106776569A CN106776569A CN201611219439.9A CN201611219439A CN106776569A CN 106776569 A CN106776569 A CN 106776569A CN 201611219439 A CN201611219439 A CN 201611219439A CN 106776569 A CN106776569 A CN 106776569A
- Authority
- CN
- China
- Prior art keywords
- talked
- much
- topic
- word
- feature extraction
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000605 extraction Methods 0.000 title claims abstract description 45
- 239000013598 vector Substances 0.000 claims description 32
- 238000004458 analytical method Methods 0.000 claims description 15
- 238000005516 engineering process Methods 0.000 claims description 9
- 238000013480 data collection Methods 0.000 claims description 7
- 238000000034 method Methods 0.000 abstract description 25
- 238000007418 data mining Methods 0.000 abstract description 2
- 239000000284 extract Substances 0.000 description 8
- 238000011160 research Methods 0.000 description 7
- 238000002474 experimental method Methods 0.000 description 5
- 230000000717 retained effect Effects 0.000 description 4
- 230000011218 segmentation Effects 0.000 description 4
- 238000004364 calculation method Methods 0.000 description 2
- 230000001419 dependent effect Effects 0.000 description 2
- 238000005065 mining Methods 0.000 description 2
- 239000000203 mixture Substances 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 238000001228 spectrum Methods 0.000 description 2
- 241000196324 Embryophyta Species 0.000 description 1
- 240000007019 Oxalis corniculata Species 0.000 description 1
- 241000256856 Vespidae Species 0.000 description 1
- 230000006978 adaptation Effects 0.000 description 1
- 238000009412 basement excavation Methods 0.000 description 1
- 230000003796 beauty Effects 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000021615 conjugation Effects 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 239000004744 fabric Substances 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000004807 localization Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000006386 neutralization reaction Methods 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 238000012552 review Methods 0.000 description 1
- 238000011269 treatment regimen Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/10—Services
- G06Q50/14—Travel agencies
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2216/00—Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
- G06F2216/03—Data mining
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Business, Economics & Management (AREA)
- General Health & Medical Sciences (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Tourism & Hospitality (AREA)
- Economics (AREA)
- Human Resources & Organizations (AREA)
- Marketing (AREA)
- Primary Health Care (AREA)
- Strategic Management (AREA)
- General Business, Economics & Management (AREA)
- Probability & Statistics with Applications (AREA)
- Machine Translation (AREA)
Abstract
The present invention relates to Data Mining, there is provided tourist hot spot and its Feature Extraction Method and system in a kind of mass text, the method include:Text Pretreatment;Much-talked-about topic word finds;Much-talked-about topic feature extraction.Technical scheme proposed by the present invention also improves the correlation between the topic word of local feature Ci Yu centers in addition to reducing computational complexity, can effectively shield the interference of high frequency words.
Description
Technical field
The invention belongs to Data Mining, tourist hot spot and its Feature Extraction Method in more particularly to a kind of mass text
And system.
Background technology
Topic in text based on natural language excavates the hot research direction of always information retrieval field, and correlation is ground
Study carefully a lot.Almost all of research is all basic component-" word " expansion around text.Exist in view of different words
Difference (semanteme) status in same piece document, and author's word in writing openness and noise.The head of document process
It is exactly to identify that those are expressing semantically very important word to work, that is, extract Feature Words.On the basis of Feature Words,
Deployable further document information analysis mining work, such as text is sorted out and topic is summarized.
In terms of feature extraction, with reference to text composition and natural language the characteristics of, what researcher considered first is from spy
The aspects such as part of speech, syntactic feature, the Text Mode of word are levied to solve the problems, such as feature extraction.In order to improve the accurate of result of calculation
Property, also have scholar by the work of other intelligent algorithms be attached to text feature extract, for example document frequency, mutual information, slightly
Rough collection strategy, TF-IDF, information gain, χ2Statistics, and conjugation condition random field models etc..It is special in mass text data
Levying extraction process automation is one and puies forward efficient method.It is proposed that being expanded based on synonym and pagerank algorithm phases
With reference to mode extract product feature method automatically.Somebody proposes a kind of product feature based on unsupervised learning and takes out automatically
Method is taken, the method achieves preferable experiment effect on the product review language material of electronics field.Use accurate mark
Seed words the effective ways of feature extraction are carried out also with machine learning.
In terms of topic discovery, classical research basic thought is the angle from word, first looks for suitable measurement
To represent the relation between word, then introducing intelligent algorithm carries out topic summary.It is considered first in this class research to be
Semantic relation between word, wherein using co-occurrence word frequency relation (such as TF-IDF and entropy) and Semantic Similarity by emphasis
(e.g., clustering algorithm and sorting algorithm) and their combination, such as extract theme using TF-IDF and document growth rate factor
Word, and grapheme is built according to relation between descriptor, recognize topic finally according to the connectivity of graph.In above-mentioned text calculating process
In, document is generally represented with a vector space model (Vector Space Model), and the word of document constitutes vector
Dimension.In vector space model, each document is counted as a vector in word space.But, represented with vector space
The result of document is so that the order information that vocabulary occurs in a document lost.In addition, the model assumes vocabulary in theory
Between have statistical independence.This two shortcomings cause when the enterprising jargon topic of the Feature Words that extract is summarized, by word frequency and
Original semantic influence is very big, easily ignores information of the word in field topic.Later, people gradually recognized word in a document
Appearance is not fully independent, it is proposed that consider the topic model of the position distribution relation between word.This kind of topic model
Realization be generally probabilistic model, the prior information to corpus is required.Thus in the UGC texts that treatment pattern writing is more random
Shelves, easily by the interference of noise (such as synonym, polysemant, wrong word).
The content of the invention
【The technical problem to be solved】
It is an object of the invention to provide tourist hot spot in a kind of mass text and its Feature Extraction Method and system, with effective
Focus descriptor and its local feature that field is captured out from large scale text data.
【Technical scheme】
The present invention is achieved by the following technical solutions.
Present invention firstly relates to tourist hot spot and its Feature Extraction Method in a kind of mass text, it is comprised the following steps:
A, Text Pretreatment
The related document in Extracting Information domain from network, and these document contents are carried out into pretreatment form data set;
B, much-talked-about topic word find
By the proprietary vocabulary in given information field, the much-talked-about topic in the information field is excavated from the data set
Set of words;
C, much-talked-about topic feature extraction
Row vector cutting is entered based on much-talked-about topic set of words;Dependence between analysis candidate feature word and much-talked-about topic word is closed
System, obtains the local feature of much-talked-about topic word.
As one kind preferred embodiment, using much-talked-about topic word and the mark on much-talked-about topic word periphery in the step C
Point symbol carries out cutting to data set, obtains the cutting Son item set of all much-talked-about topic words.
As another preferred embodiment, maximum confidence index analysis candidate feature word is used in the step C
And the dependence between much-talked-about topic word.
Used as another preferred embodiment, the maximum confidence threshold value is 0.6~0.95.
Used as another preferred embodiment, the Extracting Information domain from network is related using crawler technology for the step A
Document.
As another preferred embodiment, the pretreatment in the step A at least include by the fullstop in text,
Question mark and exclamation mark carry out text dividing.
The invention further relates to tourist hot spot in a kind of mass text and its feature extraction system, including:
Text Pretreatment module, it is configured to:The related document in Extracting Information domain from network, and by these documents
Appearance carries out pretreatment and forms data set;
Much-talked-about topic word discovery module, it is configured to:By the proprietary vocabulary in given information field, from the data
Concentration excavates the much-talked-about topic set of words in the information field;
Much-talked-about topic feature extraction module, it is configured to:Row vector cutting is entered based on much-talked-about topic set of words;Analysis is waited
The dependence between Feature Words and much-talked-about topic word is selected, the local feature of much-talked-about topic word is obtained.
As one kind preferred embodiment, the much-talked-about topic feature extraction module is specifically configured to:Using focus
The punctuation mark on topic word and much-talked-about topic word periphery carries out cutting to data set, obtains the cutting subitem of all much-talked-about topic words
Collection.
Used as another preferred embodiment, the much-talked-about topic feature extraction module is specifically configured to:With most
Dependence between big confidence indicator analysis candidate feature word and much-talked-about topic word.
Used as another preferred embodiment, the Text Pretreatment module is specifically configured to:Using crawler technology
The related document in Extracting Information domain, text dividing is carried out by the fullstop in text, question mark and exclamation mark from network.
The present invention is described in detail below.
The semantic relation of research object is expressed as three layers of " domain-(domain) topic-(topic) feature " simultaneously by the present invention first
Each it is defined as follows:
Domain:In order that the content of information retrieval compares collection neutralization being readily appreciated that, the UGC of research is limited to certain theme
Within the scope of, its subject content is referred to as a background field.Such as traffic class UGC, GT grand touring UGC and healthy class UGC etc..Domain is phase
For the background of document.
Topic:Assuming that there is one or more topics in the document of UGC, these topics are shown by corresponding topic word.UGC
The corresponding topic word of middle much-talked-about topic is referred to as much-talked-about topic word.
Feature:When people introduce a certain focus (word) in a document, like referring to some specific Feature Words describing this
Focus or provided auxiliary information.For describing word referred to as its Feature Words of much-talked-about topic feature in one aspect.If certain
Feature Words are solely for describing specific focus word in a domain, then it is thus referred to as the local feature word of the focus word
(with the overall situation comparatively).If for example, " Beijing " is focus word, then " Great Wall " is exactly one local feature.It is local special
Word distinguishing feature in a document is levied, is exactly semantically for the service of focus word, also, its position in a document typically occur
Around focus word.Especially, it is characterized in for topic in this research.
The method that the present invention is provided mainly includes:Text Pretreatment works;Much-talked-about topic word finds;Much-talked-about topic feature is taken out
Take, three parts are described in detail separately below.
(1) document content pretreatment
In document content preprocessing subsystem, document extraction, participle and data cleansing are substantially carried out.Work is extracted in document
It is main that data are obtained from domain information related web site using web crawlers technology in work, form initial document data collection.Enter
And, participle is carried out to the text that data are concentrated using participle instrument, by document vectorization.And data cleansing work is mainly to dividing
Word result carries out Semantic judgement, and retains following two word segmentation results:
Semantic word (phrase):In the present invention, the much-talked-about topic in UGC and the office related to much-talked-about topic are mainly found out
Portion's feature.And these contents are made up of noun or noun phrase, so the semantic word retained herein is exactly point
Noun in word result.
Punctuation mark:In the text, three kinds of punctuation marks such as fullstop, question mark and exclamation mark represent one section of knot of sentence
Beam.In text-processing of the invention, these punctuation marks are retained as the border between sentence and unify with " | | " symbol
Number represent.
Pre-processed by content, document data collection changes into the term vector that a series of semantic word and punctuation mark are constituted:
B={ b1,…,bi,…,b|B|, formula (1)
Data set B is by term vector biComposition, all of item is represented by data set BElement bij
It is j-th word in data set in i-th document.
(2) much-talked-about topic word finds
1st, special term filtering in domain
Give the proprietary vocabulary (Domain Name Table, abridge DNT) in an information field.Using itself and each text
Common factor one new term vector of generation of shelves:
Obviously, the lemma element of this new term vector is all made up of information word in domain.All of term vectorCan produce
A raw new data set:
Data set BGIn all items can be expressed asWherein element bijRepresent BGIn i-th
Information lemma element in j-th domain in document.Different with conventional method to be, the present invention is in data set BGIn term vectorNot only represent the related word of all of domain information mentioned in i-th document, it is often more important that it further comprises these words and exists
The bit sequence occurred in document.Assuming that document data collection B forms 5 term vectors (as shown in table 1), then locality information vocabulary is given
After DNT={ A B C D }, according to formula (2), the data set B that field term vector is constituted is can obtainGAs shown in table 2.
The data set B of table 1
The data set B of table 2G
2nd, much-talked-about topic word is excavated in domain
N Frequent Set in arbitrary data collection T can be expressed as:
Wherein, supp (X) represents the support of item collection X, and mini_supp is minimum support threshold set in advance.From
For the angle of database, data set BGIn vectorIt is considered as a Transaction Information (Transactional Data).
So, BGIn 1- Frequent Set FP(1)(BG) it is exactly that we need the single much-talked-about topic word that obtains.For the data in table 2,
If mini_supp=60%, FP can be obtained(1)(BG)={ A B C D } it is much-talked-about topic word in the domain.
(3) the local feature word of much-talked-about topic word is extracted
Separately below from the dependence and local feature between vectorial cutting, frequent co-occurrence word based on much-talked-about topic word
Three aspects of word abstracting method are introduced to the part.
1st, the vectorial cutting based on much-talked-about topic word
If it is well known that comprising substantial amounts of vocabulary (much-talked-about topic word also intert wherein) in a certain document of data set B,
Traditional mining algorithm does not account for the semantic relation between lemma element, and directly implementing excavation on B will obtain substantial amounts of making an uproar
Sound result.It has been found, however, that the following writing custom of most of domestic consumers:If a certain topic is for everybody be all concerned about, that
The word for expressing these topics will be gradually formed " much-talked-about topic word ";Author when document is write, usually around a certain topic
Organized words express the idea of oneself.This results in a preferable result, and special testimony position in a document is all often divided
Cloth is around " much-talked-about topic word ";If the theme of adjacent sentence description is all and a certain much-talked-about topic (" much-talked-about topic word ") phase
Close, then these sentences constitute the topic domain related to " much-talked-about topic word ".When topic changes, while between adjacent sentence
Also the punctuation mark for having correlation is blocked.
So, it can be assumed that each " much-talked-about topic word ", with the presence of associated content, these distribution of content exist
In the adjacent sentence in position, and a subject area for bounded is constituted with punctuation mark.
According to formula (2), it is known that biIn element can be divided into much-talked-about topic word and other classes of word two and (allow weight
It is folded).Therefore can be according to much-talked-about topic word and its punctuation mark on periphery, by biCutting is multiple Son item set (each Son item set
It is also a term vector).Because the punctuation mark that the sentence in document has correlation ends up, so biIn j-th Son item set
Original position should be much-talked-about topic wordFirst punctuation works symbol above, and end position should be under
One much-talked-about topic wordFirst punctuation works symbol above.So, the term vector b that document i is formediCan do as
The division of lower form:
Wherein,The positional representation punctuate symbol of " | | ".Such division
Afterwards, it is believed that with much-talked-about topic wordRelated content (includes the local special of the much-talked-about topic word in these contents
Levy) there is very big probability to be included in the Son item set of division.
Give certain much-talked-about topic word bH∈FP(1)(BG), if bHIn biIn cutting after vector be:
CUT(bH|bi)={ bijS,…,bH,…,bijEFormula (6)
So, much-talked-about topic word bHThe cutting vector set of the institute's directed quantity in data set B is represented by:
CUT(bH)=∪i{CUT(bH|bi)},bi∈ B formulas (7)
CUT(bH) in the collection of all elements be combined intoHere bijIt is CUT (bH|
bi) in j-th lemma element.In fact, cutting vector CUT (bH) be considered as by much-talked-about topic word b in domainHWith it is a series of with
Correlation word constitute.The mutual grammer that co-occurrence (Co-occurrence) degree for having between them is high, have relies on (or two
Person has concurrently).In data set B, it is all potentially with much-talked-about topic word bHRelated content (local feature word) all may be in vector
CUT(bH) in.For the data in table 1, work as FP(1)(BG)={ A B C D }, and during mini_supp=60%, focus word " A "
And its cutting result of content is as shown in table 3 below.Similar, the cutting result of B, C and D can be obtained.
The CUT of table 3 (A)
Term vector bi | CUT(A|bi) |
b1 | {a1 A a2} |
b2 | {A a1 a2} |
b3 | {A a1 C c1 A c2} |
b4 | {} |
b5 | {a1 A a2} |
2nd, the dependence between frequent co-occurrence word
As certain much-talked-about topic b in document authors' description fieldHWhen, its important local feature b may be referred to.At this
Plant under situation, frequency of occurrences supp (b) of the feature b in domain is dependent on much-talked-about topic word b in documentHAppearance supp (bH),
And their frequency supp ({ b for being mentioned simultaneouslyH,b}).In other words, local feature word related to a certain topic in domain
Appearance depends on associated much-talked-about topic word.
Given item collection X={ bH, b }, the present invention weighs b using maximum confidence (Max-confidence)HAnd b between
Dependence, maximum confidence is defined as follows:
Given threshold value θ0∈ [0,1], if certainAnd meeting following condition, then b is exactly focus
Topic word bHLocal feature:
3rd, local feature word abstracting method
To much-talked-about topic word b in localizationH, the computing for extracting its potential feature b is largely divided into two steps:First, obtain
Focus word, and cutting term vector (the first sub-step);Then maximum confidence index analysis candidate feature word b and b are usedHBetween
Dependence (the second sub-step).
First sub-step mainly has three phases:The field word set B for calculating that the document being aggregated in B refers to firstG;So
The much-talked-about topic set of words in universe is calculated afterwards;Finally, cutting is carried out to data set using much-talked-about topic word and punctuation mark, is obtained
To all of cutting Son item set.
And then, it is possible to use the above results analysis cutting Son item set CUT (bH) in all items and much-talked-about topic word bHIt
Between dependence, draw bHLocal feature (the second sub-step).
【Beneficial effect】
Technical scheme proposed by the present invention has the advantages that:
Different from method of the prior art, the Feature Words that the present invention is extracted are not only related to document, while also contemplating
Semantic status of the word in field, such as popular degree, if topic etc. can be represented.In addition, the present invention determine topic it
Afterwards, centered on topic word, the vocabulary distribution situation on its periphery is searched in all spectra document, those and topic word relation is tight
Close word (being not necessarily domain term), will be identified that (part) Feature Words of specific topics, so, with center topic word as base
Plinth divide and rule method except reduce computational complexity in addition to, also improve the correlation between the topic word of local feature Ci Yu centers
Property, can effectively shield the interference of high frequency words.
Brief description of the drawings
Fig. 1 is the principle frame of tourist hot spot and its feature extraction system in the mass text that embodiments of the invention one are provided
Figure.
Fig. 2 is Hong Kong focus tourism place name sequence in embodiments of the invention three.
Fig. 3 is local feature (the maximum confidence threshold θ of topic " Disneyland " in embodiments of the invention three0=
0.6)。
Fig. 4 is local feature (the maximum confidence threshold θ of topic " Disneyland " in embodiments of the invention three0=
0.8)。
Fig. 5 is local feature (the maximum confidence threshold θ of topic " Disneyland " in embodiments of the invention three0=
0.95)。
Specific embodiment
To make the object, technical solutions and advantages of the present invention clearer, below will be to specific embodiment of the invention
Carry out clear, complete description.
Embodiment one
Fig. 1 is the principle frame of tourist hot spot and its feature extraction system in the mass text that the embodiment of the present invention one is provided
Figure.As shown in figure 1, the system includes Text Pretreatment module, much-talked-about topic word discovery module and much-talked-about topic feature extraction mould
Block.
Text Pretreatment module is configured to:The related document in Extracting Information domain from network, and by these document contents
Carry out pretreatment and form data set.Specifically, the Extracting Information domain from network is related using crawler technology for Text Pretreatment module
Document, text dividing is carried out by the fullstop in text, question mark and exclamation mark.
Much-talked-about topic word discovery module is configured to:By the proprietary vocabulary in given information field, from the data set
In excavate much-talked-about topic set of words in the information field.
Much-talked-about topic feature extraction module.It is configured to:Row vector cutting is entered based on much-talked-about topic set of words;Analysis candidate
Dependence between Feature Words and much-talked-about topic word, obtains the local feature of much-talked-about topic word.Specifically, much-talked-about topic feature
Abstraction module carries out cutting first with the punctuation mark of much-talked-about topic word and much-talked-about topic word periphery to data set, is owned
The cutting Son item set of much-talked-about topic word, then between maximum confidence index analysis candidate feature word and much-talked-about topic word
Dependence.
Be may be referred to using tourist hot spot in the mass text that the system in embodiment one is realized and its Feature Extraction Method
Following specific method embodiments.
Embodiment two
Embodiment two is tourist hot spot and its Feature Extraction Method in a kind of mass text, and the method is comprised the following steps:
(1) Text Pretreatment
The step mainly includes document extraction, participle and data cleansing.It is main to be climbed using network in document extraction work
Worm technology obtains data from domain information related web site, forms initial document data collection.And then, using participle instrument to data
The text of concentration carries out participle, by document vectorization.Data cleansing work mainly carries out Semantic judgement to word segmentation result, and protects
Stay following two word segmentation results:
Semantic word (phrase):In the present embodiment, much-talked-about topic in UGC and related to much-talked-about topic is mainly found out
Local feature.And these contents are made up of noun or noun phrase, so the semantic word retained herein is exactly
Noun in word segmentation result.
Punctuation mark:In the text, three kinds of punctuation marks such as fullstop, question mark and exclamation mark represent one section of knot of sentence
Beam.In the text-processing of the present embodiment, these punctuation marks are retained as the border between sentence and unify with " | | "
Symbol is represented.
Pre-processed by content, document data collection changes into the term vector that a series of semantic word and punctuation mark are constituted.
(2) much-talked-about topic word finds
By the proprietary vocabulary in given information field, the much-talked-about topic word set in the information field is excavated from data set
Close.Specifically, give the proprietary vocabulary in an information field, using itself and each document common factor one new word of generation to
Amount, the term vector constitutes a new data set, used as much-talked-about topic set of words.
(3) much-talked-about topic feature extraction
Row vector cutting is entered based on much-talked-about topic set of words;Dependence between analysis candidate feature word and much-talked-about topic word is closed
System, obtains the local feature of much-talked-about topic word.Specifically, in the step, first, using much-talked-about topic word and much-talked-about topic word week
The punctuation mark on side carries out cutting to data set, obtains the cutting Son item set of all much-talked-about topic words;Then, with maximum confidence
Degree index analysis candidate feature word and much-talked-about topic word between dependence, will candidate feature word maximum confidence with it is pre-
If maximum confidence threshold value be compared, if the maximum confidence of candidate feature word be more than or equal to default maximum confidence
Threshold value, then the candidate feature word is the local feature of much-talked-about topic word.
Embodiment three
Embodiment three is tourist hot spot and its Feature Extraction Method in a kind of mass text.Especially, embodiment three is real
The materialization of example two is applied, the method is comprised the following steps:
(1) Text Pretreatment
The related document in Extracting Information domain from network, and these document contents are carried out into pretreatment form data set.Tool
Body ground, can using crawler technology from network the related document in Extracting Information domain, then by the fullstop in text, question mark and
Exclamation mark carries out text dividing.Specifically, the step is from prominent domestic travel information sharing website hornet nest
(www.mafengwo.com) destination is extracted for the document in " Hong Kong " and document content is pre-processed.Geographical term vocabulary
DNT is drawn from Hong Kong tourist attractions title of whole in famous tour site www.tripadvisor.com.
(2) much-talked-about topic word finds
After data prediction, Fig. 2 is shown in the frequency distribution of Hong Kong tourism noun.Curve has two obvious turn in figure
Break is respectively " middle ring " and " Luohu ".Due to only two hot words before " middle ring ", the information that it is provided is very few.Therefore choose
Before " Luohu ", the much-talked-about topic word that 15 noun is traveled as Hong Kong before frequency collating.
(3) much-talked-about topic feature extraction
The step have chosen " Disneyland " to demonstrate local feature extraction in the 15 focuses tourism place name for extracting
Experiment.First by punctuation mark, cutting obtains the Son item set CUT (" Disneyland ") related to " Disneyland ".So
The 1- Frequent Set and 2- Frequent Set in Son item set are excavated afterwards, the feature related to " Disneyland " is finally extracted, and see Fig. 3
To Fig. 5.
As shown in figure 3, working as maximum confidence threshold θ0When=0.6, can extract more related to " Disneyland "
Feature.And " ocean park " and its feature that can see even its periphery beauty spot that attract more tourists all are mined out.That be because
For " Disneyland " and " ocean park " is often mentioned by traveller simultaneously.And also excavate many and " Disneyland "
Related feature itself.Such as " admission ticket " and traffic route " gushing line in east ".These can with the feature of " Disneyland " strong correlation
Informational support is provided with for traveller formulates its tour plan.
On the other hand, a looser θ0Although threshold value can excavate more local feature, simultaneously
The dependence allowed between feature becomes complicated.In order to obtain more, clearly dependence and " Disneyland " are maximally related
Local feature, by threshold θ0It is respectively increased to 0.8 and 0.95, some popular tourist quilts in " Disneyland " can be obtained
Excavate (as shown in Figure 4 and Figure 5).Especially in θ0When=0.95, the feature excavated is exactly almost that " Disney is found pleasure in
The most crucial feature of playing in garden ".These information have very great help to the tourism planning of potential user and decision-making.
Contrast experiment
For the adaptation of methods for checking embodiment two to provide, this part mainly discusses the text in different message areas
Experimental result on data set.The data of experiment are respectively from amusement, physical culture, hotel's comment, economy, computer and art neck
Domain.The source situation of data is as shown in table 4 below, it can be seen that the source variation of data.Wherein special journal of writings (warp
Ji, computer and art) and the report (entertaining and physical culture) of realm information be all document more long.And hotel's comment is by common
The short text data of the online comment behavior generation of user.
Data set situation is described during table 4 is tested
Next, the method for embodiment two is named as into TVS (Term Vector Subdividing), with it with it is classical
Topic and Feature Extraction Method FP, TF-IDF and LDA are compared.These four methods are taken out respectively on the different data set of six classes
Take preceding 5,10,20,50 and 80 focus words.
The Average Accuracy result of three kinds of methods shows that features of the TVS for focus word in different text fields is taken out
Take better than other three methods.This shows that TVS methods are good at the Feature Words for extracting and being obviously dependent on focus word.
In addition, on tourism blog documents data set, the semantic differential between the feature that three kinds of methods are extracted is compared,
Every kind of method extracts preceding 5 focus words.Result of calculation is shown in Table 5.
Table 5 TF-IDF, LDA and TVS extract local feature semantic content and compare
The semantic content of the feature that three kinds of methods are extracted shows that the characteristic information granularity that TVS is extracted is moderate, while extensive
Feature is less, can well represent much-talked-about topic word local feature, for example, extracted in " Disneyland " " glad Australia, small
Worldlet, Jones, sleeping beauty, Shi Diqi ", this much-talked-about topic word of these features exactly Disneyland is distinctive local special
Levy.
From above example and its confirmatory experiment can be seen that the embodiment of the present invention extraction Feature Words not only with document phase
Close, while also contemplating semantic status of the word in field, such as popular degree, if topic etc. can be represented.In addition, of the invention
After topic is determined, centered on topic word, the vocabulary distribution situation on its periphery is searched in all spectra document, those
With the word (being not necessarily domain term) of topic word close relation, (part) Feature Words of specific topics are will be identified that, so, with
Method is divided and ruled in addition to reducing computational complexity based on the topic word of center, also improves local feature Ci Yu centers topic
Correlation between word, can effectively shield the interference of high frequency words.
It is to be appreciated that the embodiment of foregoing description is a part of embodiment of the invention, rather than whole embodiments, also not
It is limitation of the present invention.Based on embodiments of the invention, those of ordinary skill in the art are not paying creative work premise
Lower obtained every other embodiment, belongs to protection scope of the present invention.
Claims (10)
1. tourist hot spot and its Feature Extraction Method in a kind of mass text, it is characterised in that comprise the following steps:
A, Text Pretreatment
The related document in Extracting Information domain from network, and these document contents are carried out into pretreatment form data set;
B, much-talked-about topic word find
By the proprietary vocabulary in given information field, the much-talked-about topic word set in the information field is excavated from the data set
Close;
C, much-talked-about topic feature extraction
Row vector cutting is entered based on much-talked-about topic set of words;Dependence between analysis candidate feature word and much-talked-about topic word,
Obtain the local feature of much-talked-about topic word.
2. tourist hot spot and its Feature Extraction Method in mass text according to claim 1, it is characterised in that the step
Cutting is carried out to data set using the punctuation mark of much-talked-about topic word and much-talked-about topic word periphery in rapid C, all focus words are obtained
The cutting Son item set of epigraph.
3. tourist hot spot and its Feature Extraction Method in mass text according to claim 2, it is characterised in that the step
With the dependence between maximum confidence index analysis candidate feature word and much-talked-about topic word in rapid C.
4. tourist hot spot and its Feature Extraction Method in mass text according to claim 3, it is characterised in that it is described most
Big confidence threshold value is 0.6~0.95.
5. tourist hot spot and its Feature Extraction Method in mass text according to claim 1, it is characterised in that the step
Rapid A related documents in Extracting Information domain from network using crawler technology.
6. tourist hot spot and its Feature Extraction Method in mass text according to claim 1, it is characterised in that the step
Pretreatment in rapid A at least includes carrying out text dividing by the fullstop in text, question mark and exclamation mark.
7. tourist hot spot and its feature extraction system in a kind of mass text, it is characterised in that including:
Text Pretreatment module, it is configured to:The related document in Extracting Information domain from network, and these document contents are entered
Row pretreatment forms data set;
Much-talked-about topic word discovery module, it is configured to:By the proprietary vocabulary in given information field, from the data set
Excavate the much-talked-about topic set of words in the information field;
Much-talked-about topic feature extraction module, it is configured to:Row vector cutting is entered based on much-talked-about topic set of words;Analysis candidate is special
The dependence between word and much-talked-about topic word is levied, the local feature of much-talked-about topic word is obtained.
8. tourist hot spot and its feature extraction system in mass text according to claim 7, it is characterised in that the heat
Point topic feature extraction module is specifically configured to:Using the punctuation mark on much-talked-about topic word and much-talked-about topic word periphery to data
Collection carries out cutting, obtains the cutting Son item set of all much-talked-about topic words.
9. tourist hot spot and its feature extraction system in mass text according to claim 7, it is characterised in that the heat
Point topic feature extraction module is specifically configured to:With maximum confidence index analysis candidate feature word and much-talked-about topic word it
Between dependence.
10. tourist hot spot and its feature extraction system in mass text according to claim 7, it is characterised in that the text
This pretreatment module is specifically configured to:Using the document of crawler technology Extracting Information domain correlation from network, by text
Fullstop, question mark and exclamation mark carry out text dividing.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611219439.9A CN106776569A (en) | 2016-12-26 | 2016-12-26 | Tourist hot spot and its Feature Extraction Method and system in mass text |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611219439.9A CN106776569A (en) | 2016-12-26 | 2016-12-26 | Tourist hot spot and its Feature Extraction Method and system in mass text |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106776569A true CN106776569A (en) | 2017-05-31 |
Family
ID=58925202
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201611219439.9A Pending CN106776569A (en) | 2016-12-26 | 2016-12-26 | Tourist hot spot and its Feature Extraction Method and system in mass text |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106776569A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111783438A (en) * | 2020-05-22 | 2020-10-16 | 贵州电网有限责任公司 | Hot word detection method for realizing work order analysis |
CN112667884A (en) * | 2019-10-16 | 2021-04-16 | 财团法人工业技术研究院 | System and method for generating a ruled book |
CN112819659A (en) * | 2021-02-09 | 2021-05-18 | 西南交通大学 | Tourist attraction development and evaluation method |
-
2016
- 2016-12-26 CN CN201611219439.9A patent/CN106776569A/en active Pending
Non-Patent Citations (1)
Title |
---|
徐华林: "领域UGC文本话题-特征关系抽取及应用研究", 《中国博士学位论文全文数据库信息科技辑》 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112667884A (en) * | 2019-10-16 | 2021-04-16 | 财团法人工业技术研究院 | System and method for generating a ruled book |
CN112667884B (en) * | 2019-10-16 | 2023-11-28 | 财团法人工业技术研究院 | System and method for generating enterprise book |
CN111783438A (en) * | 2020-05-22 | 2020-10-16 | 贵州电网有限责任公司 | Hot word detection method for realizing work order analysis |
CN112819659A (en) * | 2021-02-09 | 2021-05-18 | 西南交通大学 | Tourist attraction development and evaluation method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111143479B (en) | Knowledge graph relation extraction and REST service visualization fusion method based on DBSCAN clustering algorithm | |
CN106777274B (en) | A kind of Chinese tour field knowledge mapping construction method and system | |
CN110825721B (en) | Method for constructing and integrating hypertension knowledge base and system in big data environment | |
CN104391942B (en) | Short essay eigen extended method based on semantic collection of illustrative plates | |
CN103605729B (en) | A kind of method based on local random lexical density model POI Chinese Text Categorizations | |
CN106776562A (en) | A kind of keyword extracting method and extraction system | |
CN107066553A (en) | A kind of short text classification method based on convolutional neural networks and random forest | |
CN111190900B (en) | JSON data visualization optimization method in cloud computing mode | |
CN106372208B (en) | A kind of topic viewpoint clustering method based on statement similarity | |
CN102306204B (en) | Subject area identifying method based on weight of text structure | |
CN105528437B (en) | A kind of question answering system construction method extracted based on structured text knowledge | |
Zheng et al. | Template-independent news extraction based on visual consistency | |
CN107122413A (en) | A kind of keyword extracting method and device based on graph model | |
CN111177591A (en) | Knowledge graph-based Web data optimization method facing visualization demand | |
CN103593474B (en) | Image retrieval sort method based on deep learning | |
CN107391706A (en) | A kind of city tour's question answering system based on mobile Internet | |
CN109558492A (en) | A kind of listed company's knowledge mapping construction method and device suitable for event attribution | |
CN113553429A (en) | Normalized label system construction and text automatic labeling method | |
CN105654144B (en) | A kind of social network ontologies construction method based on machine learning | |
CN105677638B (en) | Web information abstracting method | |
CN106156287A (en) | Analyze public sentiment satisfaction method based on the scenic spot evaluating data of tourism demand template | |
Sadr et al. | Unified topic-based semantic models: a study in computing the semantic relatedness of geographic terms | |
CN110888991A (en) | Sectional semantic annotation method in weak annotation environment | |
CN112036178A (en) | Distribution network entity related semantic search method | |
CN106776569A (en) | Tourist hot spot and its Feature Extraction Method and system in mass text |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20170531 |