CN106776569A - Tourist hot spot and its Feature Extraction Method and system in mass text - Google Patents

Tourist hot spot and its Feature Extraction Method and system in mass text Download PDF

Info

Publication number
CN106776569A
CN106776569A CN201611219439.9A CN201611219439A CN106776569A CN 106776569 A CN106776569 A CN 106776569A CN 201611219439 A CN201611219439 A CN 201611219439A CN 106776569 A CN106776569 A CN 106776569A
Authority
CN
China
Prior art keywords
talked
much
topic
word
feature extraction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201611219439.9A
Other languages
Chinese (zh)
Inventor
袁华
钱宇
徐华林
印如意
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN201611219439.9A priority Critical patent/CN106776569A/en
Publication of CN106776569A publication Critical patent/CN106776569A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/14Travel agencies
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2216/00Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
    • G06F2216/03Data mining

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Tourism & Hospitality (AREA)
  • Economics (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • General Business, Economics & Management (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

The present invention relates to Data Mining, there is provided tourist hot spot and its Feature Extraction Method and system in a kind of mass text, the method include:Text Pretreatment;Much-talked-about topic word finds;Much-talked-about topic feature extraction.Technical scheme proposed by the present invention also improves the correlation between the topic word of local feature Ci Yu centers in addition to reducing computational complexity, can effectively shield the interference of high frequency words.

Description

Tourist hot spot and its Feature Extraction Method and system in mass text
Technical field
The invention belongs to Data Mining, tourist hot spot and its Feature Extraction Method in more particularly to a kind of mass text And system.
Background technology
Topic in text based on natural language excavates the hot research direction of always information retrieval field, and correlation is ground Study carefully a lot.Almost all of research is all basic component-" word " expansion around text.Exist in view of different words Difference (semanteme) status in same piece document, and author's word in writing openness and noise.The head of document process It is exactly to identify that those are expressing semantically very important word to work, that is, extract Feature Words.On the basis of Feature Words, Deployable further document information analysis mining work, such as text is sorted out and topic is summarized.
In terms of feature extraction, with reference to text composition and natural language the characteristics of, what researcher considered first is from spy The aspects such as part of speech, syntactic feature, the Text Mode of word are levied to solve the problems, such as feature extraction.In order to improve the accurate of result of calculation Property, also have scholar by the work of other intelligent algorithms be attached to text feature extract, for example document frequency, mutual information, slightly Rough collection strategy, TF-IDF, information gain, χ2Statistics, and conjugation condition random field models etc..It is special in mass text data Levying extraction process automation is one and puies forward efficient method.It is proposed that being expanded based on synonym and pagerank algorithm phases With reference to mode extract product feature method automatically.Somebody proposes a kind of product feature based on unsupervised learning and takes out automatically Method is taken, the method achieves preferable experiment effect on the product review language material of electronics field.Use accurate mark Seed words the effective ways of feature extraction are carried out also with machine learning.
In terms of topic discovery, classical research basic thought is the angle from word, first looks for suitable measurement To represent the relation between word, then introducing intelligent algorithm carries out topic summary.It is considered first in this class research to be Semantic relation between word, wherein using co-occurrence word frequency relation (such as TF-IDF and entropy) and Semantic Similarity by emphasis (e.g., clustering algorithm and sorting algorithm) and their combination, such as extract theme using TF-IDF and document growth rate factor Word, and grapheme is built according to relation between descriptor, recognize topic finally according to the connectivity of graph.In above-mentioned text calculating process In, document is generally represented with a vector space model (Vector Space Model), and the word of document constitutes vector Dimension.In vector space model, each document is counted as a vector in word space.But, represented with vector space The result of document is so that the order information that vocabulary occurs in a document lost.In addition, the model assumes vocabulary in theory Between have statistical independence.This two shortcomings cause when the enterprising jargon topic of the Feature Words that extract is summarized, by word frequency and Original semantic influence is very big, easily ignores information of the word in field topic.Later, people gradually recognized word in a document Appearance is not fully independent, it is proposed that consider the topic model of the position distribution relation between word.This kind of topic model Realization be generally probabilistic model, the prior information to corpus is required.Thus in the UGC texts that treatment pattern writing is more random Shelves, easily by the interference of noise (such as synonym, polysemant, wrong word).
The content of the invention
【The technical problem to be solved】
It is an object of the invention to provide tourist hot spot in a kind of mass text and its Feature Extraction Method and system, with effective Focus descriptor and its local feature that field is captured out from large scale text data.
【Technical scheme】
The present invention is achieved by the following technical solutions.
Present invention firstly relates to tourist hot spot and its Feature Extraction Method in a kind of mass text, it is comprised the following steps:
A, Text Pretreatment
The related document in Extracting Information domain from network, and these document contents are carried out into pretreatment form data set;
B, much-talked-about topic word find
By the proprietary vocabulary in given information field, the much-talked-about topic in the information field is excavated from the data set Set of words;
C, much-talked-about topic feature extraction
Row vector cutting is entered based on much-talked-about topic set of words;Dependence between analysis candidate feature word and much-talked-about topic word is closed System, obtains the local feature of much-talked-about topic word.
As one kind preferred embodiment, using much-talked-about topic word and the mark on much-talked-about topic word periphery in the step C Point symbol carries out cutting to data set, obtains the cutting Son item set of all much-talked-about topic words.
As another preferred embodiment, maximum confidence index analysis candidate feature word is used in the step C And the dependence between much-talked-about topic word.
Used as another preferred embodiment, the maximum confidence threshold value is 0.6~0.95.
Used as another preferred embodiment, the Extracting Information domain from network is related using crawler technology for the step A Document.
As another preferred embodiment, the pretreatment in the step A at least include by the fullstop in text, Question mark and exclamation mark carry out text dividing.
The invention further relates to tourist hot spot in a kind of mass text and its feature extraction system, including:
Text Pretreatment module, it is configured to:The related document in Extracting Information domain from network, and by these documents Appearance carries out pretreatment and forms data set;
Much-talked-about topic word discovery module, it is configured to:By the proprietary vocabulary in given information field, from the data Concentration excavates the much-talked-about topic set of words in the information field;
Much-talked-about topic feature extraction module, it is configured to:Row vector cutting is entered based on much-talked-about topic set of words;Analysis is waited The dependence between Feature Words and much-talked-about topic word is selected, the local feature of much-talked-about topic word is obtained.
As one kind preferred embodiment, the much-talked-about topic feature extraction module is specifically configured to:Using focus The punctuation mark on topic word and much-talked-about topic word periphery carries out cutting to data set, obtains the cutting subitem of all much-talked-about topic words Collection.
Used as another preferred embodiment, the much-talked-about topic feature extraction module is specifically configured to:With most Dependence between big confidence indicator analysis candidate feature word and much-talked-about topic word.
Used as another preferred embodiment, the Text Pretreatment module is specifically configured to:Using crawler technology The related document in Extracting Information domain, text dividing is carried out by the fullstop in text, question mark and exclamation mark from network.
The present invention is described in detail below.
The semantic relation of research object is expressed as three layers of " domain-(domain) topic-(topic) feature " simultaneously by the present invention first Each it is defined as follows:
Domain:In order that the content of information retrieval compares collection neutralization being readily appreciated that, the UGC of research is limited to certain theme Within the scope of, its subject content is referred to as a background field.Such as traffic class UGC, GT grand touring UGC and healthy class UGC etc..Domain is phase For the background of document.
Topic:Assuming that there is one or more topics in the document of UGC, these topics are shown by corresponding topic word.UGC The corresponding topic word of middle much-talked-about topic is referred to as much-talked-about topic word.
Feature:When people introduce a certain focus (word) in a document, like referring to some specific Feature Words describing this Focus or provided auxiliary information.For describing word referred to as its Feature Words of much-talked-about topic feature in one aspect.If certain Feature Words are solely for describing specific focus word in a domain, then it is thus referred to as the local feature word of the focus word (with the overall situation comparatively).If for example, " Beijing " is focus word, then " Great Wall " is exactly one local feature.It is local special Word distinguishing feature in a document is levied, is exactly semantically for the service of focus word, also, its position in a document typically occur Around focus word.Especially, it is characterized in for topic in this research.
The method that the present invention is provided mainly includes:Text Pretreatment works;Much-talked-about topic word finds;Much-talked-about topic feature is taken out Take, three parts are described in detail separately below.
(1) document content pretreatment
In document content preprocessing subsystem, document extraction, participle and data cleansing are substantially carried out.Work is extracted in document It is main that data are obtained from domain information related web site using web crawlers technology in work, form initial document data collection.Enter And, participle is carried out to the text that data are concentrated using participle instrument, by document vectorization.And data cleansing work is mainly to dividing Word result carries out Semantic judgement, and retains following two word segmentation results:
Semantic word (phrase):In the present invention, the much-talked-about topic in UGC and the office related to much-talked-about topic are mainly found out Portion's feature.And these contents are made up of noun or noun phrase, so the semantic word retained herein is exactly point Noun in word result.
Punctuation mark:In the text, three kinds of punctuation marks such as fullstop, question mark and exclamation mark represent one section of knot of sentence Beam.In text-processing of the invention, these punctuation marks are retained as the border between sentence and unify with " | | " symbol Number represent.
Pre-processed by content, document data collection changes into the term vector that a series of semantic word and punctuation mark are constituted:
B={ b1,…,bi,…,b|B|, formula (1)
Data set B is by term vector biComposition, all of item is represented by data set BElement bij It is j-th word in data set in i-th document.
(2) much-talked-about topic word finds
1st, special term filtering in domain
Give the proprietary vocabulary (Domain Name Table, abridge DNT) in an information field.Using itself and each text Common factor one new term vector of generation of shelves:
Obviously, the lemma element of this new term vector is all made up of information word in domain.All of term vectorCan produce A raw new data set:
Data set BGIn all items can be expressed asWherein element bijRepresent BGIn i-th Information lemma element in j-th domain in document.Different with conventional method to be, the present invention is in data set BGIn term vectorNot only represent the related word of all of domain information mentioned in i-th document, it is often more important that it further comprises these words and exists The bit sequence occurred in document.Assuming that document data collection B forms 5 term vectors (as shown in table 1), then locality information vocabulary is given After DNT={ A B C D }, according to formula (2), the data set B that field term vector is constituted is can obtainGAs shown in table 2.
The data set B of table 1
The data set B of table 2G
2nd, much-talked-about topic word is excavated in domain
N Frequent Set in arbitrary data collection T can be expressed as:
Wherein, supp (X) represents the support of item collection X, and mini_supp is minimum support threshold set in advance.From For the angle of database, data set BGIn vectorIt is considered as a Transaction Information (Transactional Data). So, BGIn 1- Frequent Set FP(1)(BG) it is exactly that we need the single much-talked-about topic word that obtains.For the data in table 2, If mini_supp=60%, FP can be obtained(1)(BG)={ A B C D } it is much-talked-about topic word in the domain.
(3) the local feature word of much-talked-about topic word is extracted
Separately below from the dependence and local feature between vectorial cutting, frequent co-occurrence word based on much-talked-about topic word Three aspects of word abstracting method are introduced to the part.
1st, the vectorial cutting based on much-talked-about topic word
If it is well known that comprising substantial amounts of vocabulary (much-talked-about topic word also intert wherein) in a certain document of data set B, Traditional mining algorithm does not account for the semantic relation between lemma element, and directly implementing excavation on B will obtain substantial amounts of making an uproar Sound result.It has been found, however, that the following writing custom of most of domestic consumers:If a certain topic is for everybody be all concerned about, that The word for expressing these topics will be gradually formed " much-talked-about topic word ";Author when document is write, usually around a certain topic Organized words express the idea of oneself.This results in a preferable result, and special testimony position in a document is all often divided Cloth is around " much-talked-about topic word ";If the theme of adjacent sentence description is all and a certain much-talked-about topic (" much-talked-about topic word ") phase Close, then these sentences constitute the topic domain related to " much-talked-about topic word ".When topic changes, while between adjacent sentence Also the punctuation mark for having correlation is blocked.
So, it can be assumed that each " much-talked-about topic word ", with the presence of associated content, these distribution of content exist In the adjacent sentence in position, and a subject area for bounded is constituted with punctuation mark.
According to formula (2), it is known that biIn element can be divided into much-talked-about topic word and other classes of word two and (allow weight It is folded).Therefore can be according to much-talked-about topic word and its punctuation mark on periphery, by biCutting is multiple Son item set (each Son item set It is also a term vector).Because the punctuation mark that the sentence in document has correlation ends up, so biIn j-th Son item set Original position should be much-talked-about topic wordFirst punctuation works symbol above, and end position should be under One much-talked-about topic wordFirst punctuation works symbol above.So, the term vector b that document i is formediCan do as The division of lower form:
Wherein,The positional representation punctuate symbol of " | | ".Such division Afterwards, it is believed that with much-talked-about topic wordRelated content (includes the local special of the much-talked-about topic word in these contents Levy) there is very big probability to be included in the Son item set of division.
Give certain much-talked-about topic word bH∈FP(1)(BG), if bHIn biIn cutting after vector be:
CUT(bH|bi)={ bijS,…,bH,…,bijEFormula (6)
So, much-talked-about topic word bHThe cutting vector set of the institute's directed quantity in data set B is represented by:
CUT(bH)=∪i{CUT(bH|bi)},bi∈ B formulas (7)
CUT(bH) in the collection of all elements be combined intoHere bijIt is CUT (bH| bi) in j-th lemma element.In fact, cutting vector CUT (bH) be considered as by much-talked-about topic word b in domainHWith it is a series of with Correlation word constitute.The mutual grammer that co-occurrence (Co-occurrence) degree for having between them is high, have relies on (or two Person has concurrently).In data set B, it is all potentially with much-talked-about topic word bHRelated content (local feature word) all may be in vector CUT(bH) in.For the data in table 1, work as FP(1)(BG)={ A B C D }, and during mini_supp=60%, focus word " A " And its cutting result of content is as shown in table 3 below.Similar, the cutting result of B, C and D can be obtained.
The CUT of table 3 (A)
Term vector bi CUT(A|bi)
b1 {a1 A a2}
b2 {A a1 a2}
b3 {A a1 C c1 A c2}
b4 {}
b5 {a1 A a2}
2nd, the dependence between frequent co-occurrence word
As certain much-talked-about topic b in document authors' description fieldHWhen, its important local feature b may be referred to.At this Plant under situation, frequency of occurrences supp (b) of the feature b in domain is dependent on much-talked-about topic word b in documentHAppearance supp (bH), And their frequency supp ({ b for being mentioned simultaneouslyH,b}).In other words, local feature word related to a certain topic in domain Appearance depends on associated much-talked-about topic word.
Given item collection X={ bH, b }, the present invention weighs b using maximum confidence (Max-confidence)HAnd b between Dependence, maximum confidence is defined as follows:
Given threshold value θ0∈ [0,1], if certainAnd meeting following condition, then b is exactly focus Topic word bHLocal feature:
3rd, local feature word abstracting method
To much-talked-about topic word b in localizationH, the computing for extracting its potential feature b is largely divided into two steps:First, obtain Focus word, and cutting term vector (the first sub-step);Then maximum confidence index analysis candidate feature word b and b are usedHBetween Dependence (the second sub-step).
First sub-step mainly has three phases:The field word set B for calculating that the document being aggregated in B refers to firstG;So The much-talked-about topic set of words in universe is calculated afterwards;Finally, cutting is carried out to data set using much-talked-about topic word and punctuation mark, is obtained To all of cutting Son item set.
And then, it is possible to use the above results analysis cutting Son item set CUT (bH) in all items and much-talked-about topic word bHIt Between dependence, draw bHLocal feature (the second sub-step).
【Beneficial effect】
Technical scheme proposed by the present invention has the advantages that:
Different from method of the prior art, the Feature Words that the present invention is extracted are not only related to document, while also contemplating Semantic status of the word in field, such as popular degree, if topic etc. can be represented.In addition, the present invention determine topic it Afterwards, centered on topic word, the vocabulary distribution situation on its periphery is searched in all spectra document, those and topic word relation is tight Close word (being not necessarily domain term), will be identified that (part) Feature Words of specific topics, so, with center topic word as base Plinth divide and rule method except reduce computational complexity in addition to, also improve the correlation between the topic word of local feature Ci Yu centers Property, can effectively shield the interference of high frequency words.
Brief description of the drawings
Fig. 1 is the principle frame of tourist hot spot and its feature extraction system in the mass text that embodiments of the invention one are provided Figure.
Fig. 2 is Hong Kong focus tourism place name sequence in embodiments of the invention three.
Fig. 3 is local feature (the maximum confidence threshold θ of topic " Disneyland " in embodiments of the invention three0= 0.6)。
Fig. 4 is local feature (the maximum confidence threshold θ of topic " Disneyland " in embodiments of the invention three0= 0.8)。
Fig. 5 is local feature (the maximum confidence threshold θ of topic " Disneyland " in embodiments of the invention three0= 0.95)。
Specific embodiment
To make the object, technical solutions and advantages of the present invention clearer, below will be to specific embodiment of the invention Carry out clear, complete description.
Embodiment one
Fig. 1 is the principle frame of tourist hot spot and its feature extraction system in the mass text that the embodiment of the present invention one is provided Figure.As shown in figure 1, the system includes Text Pretreatment module, much-talked-about topic word discovery module and much-talked-about topic feature extraction mould Block.
Text Pretreatment module is configured to:The related document in Extracting Information domain from network, and by these document contents Carry out pretreatment and form data set.Specifically, the Extracting Information domain from network is related using crawler technology for Text Pretreatment module Document, text dividing is carried out by the fullstop in text, question mark and exclamation mark.
Much-talked-about topic word discovery module is configured to:By the proprietary vocabulary in given information field, from the data set In excavate much-talked-about topic set of words in the information field.
Much-talked-about topic feature extraction module.It is configured to:Row vector cutting is entered based on much-talked-about topic set of words;Analysis candidate Dependence between Feature Words and much-talked-about topic word, obtains the local feature of much-talked-about topic word.Specifically, much-talked-about topic feature Abstraction module carries out cutting first with the punctuation mark of much-talked-about topic word and much-talked-about topic word periphery to data set, is owned The cutting Son item set of much-talked-about topic word, then between maximum confidence index analysis candidate feature word and much-talked-about topic word Dependence.
Be may be referred to using tourist hot spot in the mass text that the system in embodiment one is realized and its Feature Extraction Method Following specific method embodiments.
Embodiment two
Embodiment two is tourist hot spot and its Feature Extraction Method in a kind of mass text, and the method is comprised the following steps:
(1) Text Pretreatment
The step mainly includes document extraction, participle and data cleansing.It is main to be climbed using network in document extraction work Worm technology obtains data from domain information related web site, forms initial document data collection.And then, using participle instrument to data The text of concentration carries out participle, by document vectorization.Data cleansing work mainly carries out Semantic judgement to word segmentation result, and protects Stay following two word segmentation results:
Semantic word (phrase):In the present embodiment, much-talked-about topic in UGC and related to much-talked-about topic is mainly found out Local feature.And these contents are made up of noun or noun phrase, so the semantic word retained herein is exactly Noun in word segmentation result.
Punctuation mark:In the text, three kinds of punctuation marks such as fullstop, question mark and exclamation mark represent one section of knot of sentence Beam.In the text-processing of the present embodiment, these punctuation marks are retained as the border between sentence and unify with " | | " Symbol is represented.
Pre-processed by content, document data collection changes into the term vector that a series of semantic word and punctuation mark are constituted.
(2) much-talked-about topic word finds
By the proprietary vocabulary in given information field, the much-talked-about topic word set in the information field is excavated from data set Close.Specifically, give the proprietary vocabulary in an information field, using itself and each document common factor one new word of generation to Amount, the term vector constitutes a new data set, used as much-talked-about topic set of words.
(3) much-talked-about topic feature extraction
Row vector cutting is entered based on much-talked-about topic set of words;Dependence between analysis candidate feature word and much-talked-about topic word is closed System, obtains the local feature of much-talked-about topic word.Specifically, in the step, first, using much-talked-about topic word and much-talked-about topic word week The punctuation mark on side carries out cutting to data set, obtains the cutting Son item set of all much-talked-about topic words;Then, with maximum confidence Degree index analysis candidate feature word and much-talked-about topic word between dependence, will candidate feature word maximum confidence with it is pre- If maximum confidence threshold value be compared, if the maximum confidence of candidate feature word be more than or equal to default maximum confidence Threshold value, then the candidate feature word is the local feature of much-talked-about topic word.
Embodiment three
Embodiment three is tourist hot spot and its Feature Extraction Method in a kind of mass text.Especially, embodiment three is real The materialization of example two is applied, the method is comprised the following steps:
(1) Text Pretreatment
The related document in Extracting Information domain from network, and these document contents are carried out into pretreatment form data set.Tool Body ground, can using crawler technology from network the related document in Extracting Information domain, then by the fullstop in text, question mark and Exclamation mark carries out text dividing.Specifically, the step is from prominent domestic travel information sharing website hornet nest (www.mafengwo.com) destination is extracted for the document in " Hong Kong " and document content is pre-processed.Geographical term vocabulary DNT is drawn from Hong Kong tourist attractions title of whole in famous tour site www.tripadvisor.com.
(2) much-talked-about topic word finds
After data prediction, Fig. 2 is shown in the frequency distribution of Hong Kong tourism noun.Curve has two obvious turn in figure Break is respectively " middle ring " and " Luohu ".Due to only two hot words before " middle ring ", the information that it is provided is very few.Therefore choose Before " Luohu ", the much-talked-about topic word that 15 noun is traveled as Hong Kong before frequency collating.
(3) much-talked-about topic feature extraction
The step have chosen " Disneyland " to demonstrate local feature extraction in the 15 focuses tourism place name for extracting Experiment.First by punctuation mark, cutting obtains the Son item set CUT (" Disneyland ") related to " Disneyland ".So The 1- Frequent Set and 2- Frequent Set in Son item set are excavated afterwards, the feature related to " Disneyland " is finally extracted, and see Fig. 3 To Fig. 5.
As shown in figure 3, working as maximum confidence threshold θ0When=0.6, can extract more related to " Disneyland " Feature.And " ocean park " and its feature that can see even its periphery beauty spot that attract more tourists all are mined out.That be because For " Disneyland " and " ocean park " is often mentioned by traveller simultaneously.And also excavate many and " Disneyland " Related feature itself.Such as " admission ticket " and traffic route " gushing line in east ".These can with the feature of " Disneyland " strong correlation Informational support is provided with for traveller formulates its tour plan.
On the other hand, a looser θ0Although threshold value can excavate more local feature, simultaneously The dependence allowed between feature becomes complicated.In order to obtain more, clearly dependence and " Disneyland " are maximally related Local feature, by threshold θ0It is respectively increased to 0.8 and 0.95, some popular tourist quilts in " Disneyland " can be obtained Excavate (as shown in Figure 4 and Figure 5).Especially in θ0When=0.95, the feature excavated is exactly almost that " Disney is found pleasure in The most crucial feature of playing in garden ".These information have very great help to the tourism planning of potential user and decision-making.
Contrast experiment
For the adaptation of methods for checking embodiment two to provide, this part mainly discusses the text in different message areas Experimental result on data set.The data of experiment are respectively from amusement, physical culture, hotel's comment, economy, computer and art neck Domain.The source situation of data is as shown in table 4 below, it can be seen that the source variation of data.Wherein special journal of writings (warp Ji, computer and art) and the report (entertaining and physical culture) of realm information be all document more long.And hotel's comment is by common The short text data of the online comment behavior generation of user.
Data set situation is described during table 4 is tested
Next, the method for embodiment two is named as into TVS (Term Vector Subdividing), with it with it is classical Topic and Feature Extraction Method FP, TF-IDF and LDA are compared.These four methods are taken out respectively on the different data set of six classes Take preceding 5,10,20,50 and 80 focus words.
The Average Accuracy result of three kinds of methods shows that features of the TVS for focus word in different text fields is taken out Take better than other three methods.This shows that TVS methods are good at the Feature Words for extracting and being obviously dependent on focus word.
In addition, on tourism blog documents data set, the semantic differential between the feature that three kinds of methods are extracted is compared, Every kind of method extracts preceding 5 focus words.Result of calculation is shown in Table 5.
Table 5 TF-IDF, LDA and TVS extract local feature semantic content and compare
The semantic content of the feature that three kinds of methods are extracted shows that the characteristic information granularity that TVS is extracted is moderate, while extensive Feature is less, can well represent much-talked-about topic word local feature, for example, extracted in " Disneyland " " glad Australia, small Worldlet, Jones, sleeping beauty, Shi Diqi ", this much-talked-about topic word of these features exactly Disneyland is distinctive local special Levy.
From above example and its confirmatory experiment can be seen that the embodiment of the present invention extraction Feature Words not only with document phase Close, while also contemplating semantic status of the word in field, such as popular degree, if topic etc. can be represented.In addition, of the invention After topic is determined, centered on topic word, the vocabulary distribution situation on its periphery is searched in all spectra document, those With the word (being not necessarily domain term) of topic word close relation, (part) Feature Words of specific topics are will be identified that, so, with Method is divided and ruled in addition to reducing computational complexity based on the topic word of center, also improves local feature Ci Yu centers topic Correlation between word, can effectively shield the interference of high frequency words.
It is to be appreciated that the embodiment of foregoing description is a part of embodiment of the invention, rather than whole embodiments, also not It is limitation of the present invention.Based on embodiments of the invention, those of ordinary skill in the art are not paying creative work premise Lower obtained every other embodiment, belongs to protection scope of the present invention.

Claims (10)

1. tourist hot spot and its Feature Extraction Method in a kind of mass text, it is characterised in that comprise the following steps:
A, Text Pretreatment
The related document in Extracting Information domain from network, and these document contents are carried out into pretreatment form data set;
B, much-talked-about topic word find
By the proprietary vocabulary in given information field, the much-talked-about topic word set in the information field is excavated from the data set Close;
C, much-talked-about topic feature extraction
Row vector cutting is entered based on much-talked-about topic set of words;Dependence between analysis candidate feature word and much-talked-about topic word, Obtain the local feature of much-talked-about topic word.
2. tourist hot spot and its Feature Extraction Method in mass text according to claim 1, it is characterised in that the step Cutting is carried out to data set using the punctuation mark of much-talked-about topic word and much-talked-about topic word periphery in rapid C, all focus words are obtained The cutting Son item set of epigraph.
3. tourist hot spot and its Feature Extraction Method in mass text according to claim 2, it is characterised in that the step With the dependence between maximum confidence index analysis candidate feature word and much-talked-about topic word in rapid C.
4. tourist hot spot and its Feature Extraction Method in mass text according to claim 3, it is characterised in that it is described most Big confidence threshold value is 0.6~0.95.
5. tourist hot spot and its Feature Extraction Method in mass text according to claim 1, it is characterised in that the step Rapid A related documents in Extracting Information domain from network using crawler technology.
6. tourist hot spot and its Feature Extraction Method in mass text according to claim 1, it is characterised in that the step Pretreatment in rapid A at least includes carrying out text dividing by the fullstop in text, question mark and exclamation mark.
7. tourist hot spot and its feature extraction system in a kind of mass text, it is characterised in that including:
Text Pretreatment module, it is configured to:The related document in Extracting Information domain from network, and these document contents are entered Row pretreatment forms data set;
Much-talked-about topic word discovery module, it is configured to:By the proprietary vocabulary in given information field, from the data set Excavate the much-talked-about topic set of words in the information field;
Much-talked-about topic feature extraction module, it is configured to:Row vector cutting is entered based on much-talked-about topic set of words;Analysis candidate is special The dependence between word and much-talked-about topic word is levied, the local feature of much-talked-about topic word is obtained.
8. tourist hot spot and its feature extraction system in mass text according to claim 7, it is characterised in that the heat Point topic feature extraction module is specifically configured to:Using the punctuation mark on much-talked-about topic word and much-talked-about topic word periphery to data Collection carries out cutting, obtains the cutting Son item set of all much-talked-about topic words.
9. tourist hot spot and its feature extraction system in mass text according to claim 7, it is characterised in that the heat Point topic feature extraction module is specifically configured to:With maximum confidence index analysis candidate feature word and much-talked-about topic word it Between dependence.
10. tourist hot spot and its feature extraction system in mass text according to claim 7, it is characterised in that the text This pretreatment module is specifically configured to:Using the document of crawler technology Extracting Information domain correlation from network, by text Fullstop, question mark and exclamation mark carry out text dividing.
CN201611219439.9A 2016-12-26 2016-12-26 Tourist hot spot and its Feature Extraction Method and system in mass text Pending CN106776569A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611219439.9A CN106776569A (en) 2016-12-26 2016-12-26 Tourist hot spot and its Feature Extraction Method and system in mass text

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611219439.9A CN106776569A (en) 2016-12-26 2016-12-26 Tourist hot spot and its Feature Extraction Method and system in mass text

Publications (1)

Publication Number Publication Date
CN106776569A true CN106776569A (en) 2017-05-31

Family

ID=58925202

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611219439.9A Pending CN106776569A (en) 2016-12-26 2016-12-26 Tourist hot spot and its Feature Extraction Method and system in mass text

Country Status (1)

Country Link
CN (1) CN106776569A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111783438A (en) * 2020-05-22 2020-10-16 贵州电网有限责任公司 Hot word detection method for realizing work order analysis
CN112667884A (en) * 2019-10-16 2021-04-16 财团法人工业技术研究院 System and method for generating a ruled book
CN112819659A (en) * 2021-02-09 2021-05-18 西南交通大学 Tourist attraction development and evaluation method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
徐华林: "领域UGC文本话题-特征关系抽取及应用研究", 《中国博士学位论文全文数据库信息科技辑》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112667884A (en) * 2019-10-16 2021-04-16 财团法人工业技术研究院 System and method for generating a ruled book
CN112667884B (en) * 2019-10-16 2023-11-28 财团法人工业技术研究院 System and method for generating enterprise book
CN111783438A (en) * 2020-05-22 2020-10-16 贵州电网有限责任公司 Hot word detection method for realizing work order analysis
CN112819659A (en) * 2021-02-09 2021-05-18 西南交通大学 Tourist attraction development and evaluation method

Similar Documents

Publication Publication Date Title
CN111143479B (en) Knowledge graph relation extraction and REST service visualization fusion method based on DBSCAN clustering algorithm
CN106777274B (en) A kind of Chinese tour field knowledge mapping construction method and system
CN110825721B (en) Method for constructing and integrating hypertension knowledge base and system in big data environment
CN104391942B (en) Short essay eigen extended method based on semantic collection of illustrative plates
CN103605729B (en) A kind of method based on local random lexical density model POI Chinese Text Categorizations
CN106776562A (en) A kind of keyword extracting method and extraction system
CN107066553A (en) A kind of short text classification method based on convolutional neural networks and random forest
CN111190900B (en) JSON data visualization optimization method in cloud computing mode
CN106372208B (en) A kind of topic viewpoint clustering method based on statement similarity
CN102306204B (en) Subject area identifying method based on weight of text structure
CN105528437B (en) A kind of question answering system construction method extracted based on structured text knowledge
Zheng et al. Template-independent news extraction based on visual consistency
CN107122413A (en) A kind of keyword extracting method and device based on graph model
CN111177591A (en) Knowledge graph-based Web data optimization method facing visualization demand
CN103593474B (en) Image retrieval sort method based on deep learning
CN107391706A (en) A kind of city tour's question answering system based on mobile Internet
CN109558492A (en) A kind of listed company's knowledge mapping construction method and device suitable for event attribution
CN113553429A (en) Normalized label system construction and text automatic labeling method
CN105654144B (en) A kind of social network ontologies construction method based on machine learning
CN105677638B (en) Web information abstracting method
CN106156287A (en) Analyze public sentiment satisfaction method based on the scenic spot evaluating data of tourism demand template
Sadr et al. Unified topic-based semantic models: a study in computing the semantic relatedness of geographic terms
CN110888991A (en) Sectional semantic annotation method in weak annotation environment
CN112036178A (en) Distribution network entity related semantic search method
CN106776569A (en) Tourist hot spot and its Feature Extraction Method and system in mass text

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20170531