CN106776569A

CN106776569A - Tourist hot spot and its Feature Extraction Method and system in mass text

Info

Publication number: CN106776569A
Application number: CN201611219439.9A
Authority: CN
Inventors: 袁华; 钱宇; 徐华林; 印如意
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2016-12-26
Filing date: 2016-12-26
Publication date: 2017-05-31

Abstract

The present invention relates to Data Mining, there is provided tourist hot spot and its Feature Extraction Method and system in a kind of mass text, the method include：Text Pretreatment；Much-talked-about topic word finds；Much-talked-about topic feature extraction.Technical scheme proposed by the present invention also improves the correlation between the topic word of local feature Ci Yu centers in addition to reducing computational complexity, can effectively shield the interference of high frequency words.

Description

Tourist hot spot and its Feature Extraction Method and system in mass text

Technical field

The invention belongs to Data Mining, tourist hot spot and its Feature Extraction Method in more particularly to a kind of mass text And system.

Background technology

Topic in text based on natural language excavates the hot research direction of always information retrieval field, and correlation is ground Study carefully a lot.Almost all of research is all basic component-" word " expansion around text.Exist in view of different words Difference (semanteme) status in same piece document, and author's word in writing openness and noise.The head of document process It is exactly to identify that those are expressing semantically very important word to work, that is, extract Feature Words.On the basis of Feature Words, Deployable further document information analysis mining work, such as text is sorted out and topic is summarized.

In terms of feature extraction, with reference to text composition and natural language the characteristics of, what researcher considered first is from spy The aspects such as part of speech, syntactic feature, the Text Mode of word are levied to solve the problems, such as feature extraction.In order to improve the accurate of result of calculation Property, also have scholar by the work of other intelligent algorithms be attached to text feature extract, for example document frequency, mutual information, slightly Rough collection strategy, TF-IDF, information gain, χ²Statistics, and conjugation condition random field models etc..It is special in mass text data Levying extraction process automation is one and puies forward efficient method.It is proposed that being expanded based on synonym and pagerank algorithm phases With reference to mode extract product feature method automatically.Somebody proposes a kind of product feature based on unsupervised learning and takes out automatically Method is taken, the method achieves preferable experiment effect on the product review language material of electronics field.Use accurate mark Seed words the effective ways of feature extraction are carried out also with machine learning.

In terms of topic discovery, classical research basic thought is the angle from word, first looks for suitable measurement To represent the relation between word, then introducing intelligent algorithm carries out topic summary.It is considered first in this class research to be Semantic relation between word, wherein using co-occurrence word frequency relation (such as TF-IDF and entropy) and Semantic Similarity by emphasis (e.g., clustering algorithm and sorting algorithm) and their combination, such as extract theme using TF-IDF and document growth rate factor Word, and grapheme is built according to relation between descriptor, recognize topic finally according to the connectivity of graph.In above-mentioned text calculating process In, document is generally represented with a vector space model (Vector Space Model), and the word of document constitutes vector Dimension.In vector space model, each document is counted as a vector in word space.But, represented with vector space The result of document is so that the order information that vocabulary occurs in a document lost.In addition, the model assumes vocabulary in theory Between have statistical independence.This two shortcomings cause when the enterprising jargon topic of the Feature Words that extract is summarized, by word frequency and Original semantic influence is very big, easily ignores information of the word in field topic.Later, people gradually recognized word in a document Appearance is not fully independent, it is proposed that consider the topic model of the position distribution relation between word.This kind of topic model Realization be generally probabilistic model, the prior information to corpus is required.Thus in the UGC texts that treatment pattern writing is more random Shelves, easily by the interference of noise (such as synonym, polysemant, wrong word).

The content of the invention

【The technical problem to be solved】

It is an object of the invention to provide tourist hot spot in a kind of mass text and its Feature Extraction Method and system, with effective Focus descriptor and its local feature that field is captured out from large scale text data.

【Technical scheme】

The present invention is achieved by the following technical solutions.

Present invention firstly relates to tourist hot spot and its Feature Extraction Method in a kind of mass text, it is comprised the following steps：

A, Text Pretreatment

The related document in Extracting Information domain from network, and these document contents are carried out into pretreatment form data set；

B, much-talked-about topic word find

By the proprietary vocabulary in given information field, the much-talked-about topic in the information field is excavated from the data set Set of words；

C, much-talked-about topic feature extraction

Row vector cutting is entered based on much-talked-about topic set of words；Dependence between analysis candidate feature word and much-talked-about topic word is closed System, obtains the local feature of much-talked-about topic word.

As one kind preferred embodiment, using much-talked-about topic word and the mark on much-talked-about topic word periphery in the step C Point symbol carries out cutting to data set, obtains the cutting Son item set of all much-talked-about topic words.

As another preferred embodiment, maximum confidence index analysis candidate feature word is used in the step C And the dependence between much-talked-about topic word.

Used as another preferred embodiment, the maximum confidence threshold value is 0.6~0.95.

Used as another preferred embodiment, the Extracting Information domain from network is related using crawler technology for the step A Document.

As another preferred embodiment, the pretreatment in the step A at least include by the fullstop in text, Question mark and exclamation mark carry out text dividing.

The invention further relates to tourist hot spot in a kind of mass text and its feature extraction system, including：

Text Pretreatment module, it is configured to：The related document in Extracting Information domain from network, and by these documents Appearance carries out pretreatment and forms data set；

Much-talked-about topic word discovery module, it is configured to：By the proprietary vocabulary in given information field, from the data Concentration excavates the much-talked-about topic set of words in the information field；

Much-talked-about topic feature extraction module, it is configured to：Row vector cutting is entered based on much-talked-about topic set of words；Analysis is waited The dependence between Feature Words and much-talked-about topic word is selected, the local feature of much-talked-about topic word is obtained.

As one kind preferred embodiment, the much-talked-about topic feature extraction module is specifically configured to：Using focus The punctuation mark on topic word and much-talked-about topic word periphery carries out cutting to data set, obtains the cutting subitem of all much-talked-about topic words Collection.

Used as another preferred embodiment, the much-talked-about topic feature extraction module is specifically configured to：With most Dependence between big confidence indicator analysis candidate feature word and much-talked-about topic word.

Used as another preferred embodiment, the Text Pretreatment module is specifically configured to：Using crawler technology The related document in Extracting Information domain, text dividing is carried out by the fullstop in text, question mark and exclamation mark from network.

The present invention is described in detail below.

The semantic relation of research object is expressed as three layers of " domain-(domain) topic-(topic) feature " simultaneously by the present invention first Each it is defined as follows：

Domain：In order that the content of information retrieval compares collection neutralization being readily appreciated that, the UGC of research is limited to certain theme Within the scope of, its subject content is referred to as a background field.Such as traffic class UGC, GT grand touring UGC and healthy class UGC etc..Domain is phase For the background of document.

Topic：Assuming that there is one or more topics in the document of UGC, these topics are shown by corresponding topic word.UGC The corresponding topic word of middle much-talked-about topic is referred to as much-talked-about topic word.

Feature：When people introduce a certain focus (word) in a document, like referring to some specific Feature Words describing this Focus or provided auxiliary information.For describing word referred to as its Feature Words of much-talked-about topic feature in one aspect.If certain Feature Words are solely for describing specific focus word in a domain, then it is thus referred to as the local feature word of the focus word (with the overall situation comparatively).If for example, " Beijing " is focus word, then " Great Wall " is exactly one local feature.It is local special Word distinguishing feature in a document is levied, is exactly semantically for the service of focus word, also, its position in a document typically occur Around focus word.Especially, it is characterized in for topic in this research.

The method that the present invention is provided mainly includes：Text Pretreatment works；Much-talked-about topic word finds；Much-talked-about topic feature is taken out Take, three parts are described in detail separately below.

(1) document content pretreatment

In document content preprocessing subsystem, document extraction, participle and data cleansing are substantially carried out.Work is extracted in document It is main that data are obtained from domain information related web site using web crawlers technology in work, form initial document data collection.Enter And, participle is carried out to the text that data are concentrated using participle instrument, by document vectorization.And data cleansing work is mainly to dividing Word result carries out Semantic judgement, and retains following two word segmentation results：

Semantic word (phrase)：In the present invention, the much-talked-about topic in UGC and the office related to much-talked-about topic are mainly found out Portion's feature.And these contents are made up of noun or noun phrase, so the semantic word retained herein is exactly point Noun in word result.

Punctuation mark：In the text, three kinds of punctuation marks such as fullstop, question mark and exclamation mark represent one section of knot of sentence Beam.In text-processing of the invention, these punctuation marks are retained as the border between sentence and unify with " | | " symbol Number represent.

Pre-processed by content, document data collection changes into the term vector that a series of semantic word and punctuation mark are constituted：

B={ b₁,…,b_i,…,b_|B|, formula (1)

Data set B is by term vector b_iComposition, all of item is represented by data set BElement b_ij It is j-th word in data set in i-th document.

(2) much-talked-about topic word finds

1st, special term filtering in domain

Give the proprietary vocabulary (Domain Name Table, abridge DNT) in an information field.Using itself and each text Common factor one new term vector of generation of shelves：

Obviously, the lemma element of this new term vector is all made up of information word in domain.All of term vectorCan produce A raw new data set：

Data set B_GIn all items can be expressed asWherein element b_ijRepresent B_GIn i-th Information lemma element in j-th domain in document.Different with conventional method to be, the present invention is in data set B_GIn term vectorNot only represent the related word of all of domain information mentioned in i-th document, it is often more important that it further comprises these words and exists The bit sequence occurred in document.Assuming that document data collection B forms 5 term vectors (as shown in table 1), then locality information vocabulary is given After DNT={ A B C D }, according to formula (2), the data set B that field term vector is constituted is can obtain_GAs shown in table 2.

The data set B of table 1

The data set B of table 2_G

2nd, much-talked-about topic word is excavated in domain

N Frequent Set in arbitrary data collection T can be expressed as：

Wherein, supp (X) represents the support of item collection X, and mini_supp is minimum support threshold set in advance.From For the angle of database, data set B_GIn vectorIt is considered as a Transaction Information (Transactional Data). So, B_GIn 1- Frequent Set FP⁽¹⁾(B_G) it is exactly that we need the single much-talked-about topic word that obtains.For the data in table 2, If mini_supp=60%, FP can be obtained⁽¹⁾(B_G)={ A B C D } it is much-talked-about topic word in the domain.

(3) the local feature word of much-talked-about topic word is extracted

Separately below from the dependence and local feature between vectorial cutting, frequent co-occurrence word based on much-talked-about topic word Three aspects of word abstracting method are introduced to the part.

1st, the vectorial cutting based on much-talked-about topic word

If it is well known that comprising substantial amounts of vocabulary (much-talked-about topic word also intert wherein) in a certain document of data set B, Traditional mining algorithm does not account for the semantic relation between lemma element, and directly implementing excavation on B will obtain substantial amounts of making an uproar Sound result.It has been found, however, that the following writing custom of most of domestic consumers：If a certain topic is for everybody be all concerned about, that The word for expressing these topics will be gradually formed " much-talked-about topic word "；Author when document is write, usually around a certain topic Organized words express the idea of oneself.This results in a preferable result, and special testimony position in a document is all often divided Cloth is around " much-talked-about topic word "；If the theme of adjacent sentence description is all and a certain much-talked-about topic (" much-talked-about topic word ") phase Close, then these sentences constitute the topic domain related to " much-talked-about topic word ".When topic changes, while between adjacent sentence Also the punctuation mark for having correlation is blocked.

So, it can be assumed that each " much-talked-about topic word ", with the presence of associated content, these distribution of content exist In the adjacent sentence in position, and a subject area for bounded is constituted with punctuation mark.

According to formula (2), it is known that b_iIn element can be divided into much-talked-about topic word and other classes of word two and (allow weight It is folded).Therefore can be according to much-talked-about topic word and its punctuation mark on periphery, by b_iCutting is multiple Son item set (each Son item set It is also a term vector).Because the punctuation mark that the sentence in document has correlation ends up, so b_iIn j-th Son item set Original position should be much-talked-about topic wordFirst punctuation works symbol above, and end position should be under One much-talked-about topic wordFirst punctuation works symbol above.So, the term vector b that document i is formed_iCan do as The division of lower form：

Wherein,The positional representation punctuate symbol of " | | ".Such division Afterwards, it is believed that with much-talked-about topic wordRelated content (includes the local special of the much-talked-about topic word in these contents Levy) there is very big probability to be included in the Son item set of division.

Give certain much-talked-about topic word b^H∈FP⁽¹⁾(B_G), if b^HIn b_iIn cutting after vector be：

CUT(b^H|b_i)={ b_ijS,…,b^H,…,b_ijEFormula (6)

So, much-talked-about topic word b^HThe cutting vector set of the institute's directed quantity in data set B is represented by：

CUT(b^H)=∪_i{CUT(b^H|b_i)},b_i∈ B formulas (7)

CUT(b^H) in the collection of all elements be combined intoHere b_ijIt is CUT (b^H| b_i) in j-th lemma element.In fact, cutting vector CUT (b^H) be considered as by much-talked-about topic word b in domain^HWith it is a series of with Correlation word constitute.The mutual grammer that co-occurrence (Co-occurrence) degree for having between them is high, have relies on (or two Person has concurrently).In data set B, it is all potentially with much-talked-about topic word b^HRelated content (local feature word) all may be in vector CUT(b^H) in.For the data in table 1, work as FP⁽¹⁾(B_G)={ A B C D }, and during mini_supp=60%, focus word " A " And its cutting result of content is as shown in table 3 below.Similar, the cutting result of B, C and D can be obtained.

The CUT of table 3 (A)

Term vector b_i	CUT(A\|b_i)
		b₁	{a₁ A a₂}
b₂	{A a₁ a₂}
		b₃	{A a₁ C c₁ A c₂}
b₄	{}
		b₅	{a₁ A a₂}

2nd, the dependence between frequent co-occurrence word

As certain much-talked-about topic b in document authors' description field^HWhen, its important local feature b may be referred to.At this Plant under situation, frequency of occurrences supp (b) of the feature b in domain is dependent on much-talked-about topic word b in document^HAppearance supp (b^H), And their frequency supp ({ b for being mentioned simultaneously^H,b}).In other words, local feature word related to a certain topic in domain Appearance depends on associated much-talked-about topic word.

Given item collection X={ b^H, b }, the present invention weighs b using maximum confidence (Max-confidence)^HAnd b between Dependence, maximum confidence is defined as follows：

Given threshold value θ₀∈ [0,1], if certainAnd meeting following condition, then b is exactly focus Topic word b^HLocal feature：

3rd, local feature word abstracting method

To much-talked-about topic word b in localization^H, the computing for extracting its potential feature b is largely divided into two steps：First, obtain Focus word, and cutting term vector (the first sub-step)；Then maximum confidence index analysis candidate feature word b and b are used^HBetween Dependence (the second sub-step).

First sub-step mainly has three phases：The field word set B for calculating that the document being aggregated in B refers to first_G；So The much-talked-about topic set of words in universe is calculated afterwards；Finally, cutting is carried out to data set using much-talked-about topic word and punctuation mark, is obtained To all of cutting Son item set.

And then, it is possible to use the above results analysis cutting Son item set CUT (b^H) in all items and much-talked-about topic word b^HIt Between dependence, draw b^HLocal feature (the second sub-step).

【Beneficial effect】

Technical scheme proposed by the present invention has the advantages that：

Different from method of the prior art, the Feature Words that the present invention is extracted are not only related to document, while also contemplating Semantic status of the word in field, such as popular degree, if topic etc. can be represented.In addition, the present invention determine topic it Afterwards, centered on topic word, the vocabulary distribution situation on its periphery is searched in all spectra document, those and topic word relation is tight Close word (being not necessarily domain term), will be identified that (part) Feature Words of specific topics, so, with center topic word as base Plinth divide and rule method except reduce computational complexity in addition to, also improve the correlation between the topic word of local feature Ci Yu centers Property, can effectively shield the interference of high frequency words.

Brief description of the drawings

Fig. 1 is the principle frame of tourist hot spot and its feature extraction system in the mass text that embodiments of the invention one are provided Figure.

Fig. 2 is Hong Kong focus tourism place name sequence in embodiments of the invention three.

Fig. 3 is local feature (the maximum confidence threshold θ of topic " Disneyland " in embodiments of the invention three₀= 0.6)。

Fig. 4 is local feature (the maximum confidence threshold θ of topic " Disneyland " in embodiments of the invention three₀= 0.8)。

Fig. 5 is local feature (the maximum confidence threshold θ of topic " Disneyland " in embodiments of the invention three₀= 0.95)。

Specific embodiment

To make the object, technical solutions and advantages of the present invention clearer, below will be to specific embodiment of the invention Carry out clear, complete description.

Embodiment one

Fig. 1 is the principle frame of tourist hot spot and its feature extraction system in the mass text that the embodiment of the present invention one is provided Figure.As shown in figure 1, the system includes Text Pretreatment module, much-talked-about topic word discovery module and much-talked-about topic feature extraction mould Block.

Text Pretreatment module is configured to：The related document in Extracting Information domain from network, and by these document contents Carry out pretreatment and form data set.Specifically, the Extracting Information domain from network is related using crawler technology for Text Pretreatment module Document, text dividing is carried out by the fullstop in text, question mark and exclamation mark.

Much-talked-about topic word discovery module is configured to：By the proprietary vocabulary in given information field, from the data set In excavate much-talked-about topic set of words in the information field.

Much-talked-about topic feature extraction module.It is configured to：Row vector cutting is entered based on much-talked-about topic set of words；Analysis candidate Dependence between Feature Words and much-talked-about topic word, obtains the local feature of much-talked-about topic word.Specifically, much-talked-about topic feature Abstraction module carries out cutting first with the punctuation mark of much-talked-about topic word and much-talked-about topic word periphery to data set, is owned The cutting Son item set of much-talked-about topic word, then between maximum confidence index analysis candidate feature word and much-talked-about topic word Dependence.

Be may be referred to using tourist hot spot in the mass text that the system in embodiment one is realized and its Feature Extraction Method Following specific method embodiments.

Embodiment two

Embodiment two is tourist hot spot and its Feature Extraction Method in a kind of mass text, and the method is comprised the following steps：

(1) Text Pretreatment

The step mainly includes document extraction, participle and data cleansing.It is main to be climbed using network in document extraction work Worm technology obtains data from domain information related web site, forms initial document data collection.And then, using participle instrument to data The text of concentration carries out participle, by document vectorization.Data cleansing work mainly carries out Semantic judgement to word segmentation result, and protects Stay following two word segmentation results：

Semantic word (phrase)：In the present embodiment, much-talked-about topic in UGC and related to much-talked-about topic is mainly found out Local feature.And these contents are made up of noun or noun phrase, so the semantic word retained herein is exactly Noun in word segmentation result.

Punctuation mark：In the text, three kinds of punctuation marks such as fullstop, question mark and exclamation mark represent one section of knot of sentence Beam.In the text-processing of the present embodiment, these punctuation marks are retained as the border between sentence and unify with " | | " Symbol is represented.

Pre-processed by content, document data collection changes into the term vector that a series of semantic word and punctuation mark are constituted.

(2) much-talked-about topic word finds

By the proprietary vocabulary in given information field, the much-talked-about topic word set in the information field is excavated from data set Close.Specifically, give the proprietary vocabulary in an information field, using itself and each document common factor one new word of generation to Amount, the term vector constitutes a new data set, used as much-talked-about topic set of words.

(3) much-talked-about topic feature extraction

Row vector cutting is entered based on much-talked-about topic set of words；Dependence between analysis candidate feature word and much-talked-about topic word is closed System, obtains the local feature of much-talked-about topic word.Specifically, in the step, first, using much-talked-about topic word and much-talked-about topic word week The punctuation mark on side carries out cutting to data set, obtains the cutting Son item set of all much-talked-about topic words；Then, with maximum confidence Degree index analysis candidate feature word and much-talked-about topic word between dependence, will candidate feature word maximum confidence with it is pre- If maximum confidence threshold value be compared, if the maximum confidence of candidate feature word be more than or equal to default maximum confidence Threshold value, then the candidate feature word is the local feature of much-talked-about topic word.

Embodiment three

Embodiment three is tourist hot spot and its Feature Extraction Method in a kind of mass text.Especially, embodiment three is real The materialization of example two is applied, the method is comprised the following steps：

(1) Text Pretreatment

The related document in Extracting Information domain from network, and these document contents are carried out into pretreatment form data set.Tool Body ground, can using crawler technology from network the related document in Extracting Information domain, then by the fullstop in text, question mark and Exclamation mark carries out text dividing.Specifically, the step is from prominent domestic travel information sharing website hornet nest (www.mafengwo.com) destination is extracted for the document in " Hong Kong " and document content is pre-processed.Geographical term vocabulary DNT is drawn from Hong Kong tourist attractions title of whole in famous tour site www.tripadvisor.com.

(2) much-talked-about topic word finds

After data prediction, Fig. 2 is shown in the frequency distribution of Hong Kong tourism noun.Curve has two obvious turn in figure Break is respectively " middle ring " and " Luohu ".Due to only two hot words before " middle ring ", the information that it is provided is very few.Therefore choose Before " Luohu ", the much-talked-about topic word that 15 noun is traveled as Hong Kong before frequency collating.

(3) much-talked-about topic feature extraction

The step have chosen " Disneyland " to demonstrate local feature extraction in the 15 focuses tourism place name for extracting Experiment.First by punctuation mark, cutting obtains the Son item set CUT (" Disneyland ") related to " Disneyland ".So The 1- Frequent Set and 2- Frequent Set in Son item set are excavated afterwards, the feature related to " Disneyland " is finally extracted, and see Fig. 3 To Fig. 5.

As shown in figure 3, working as maximum confidence threshold θ₀When=0.6, can extract more related to " Disneyland " Feature.And " ocean park " and its feature that can see even its periphery beauty spot that attract more tourists all are mined out.That be because For " Disneyland " and " ocean park " is often mentioned by traveller simultaneously.And also excavate many and " Disneyland " Related feature itself.Such as " admission ticket " and traffic route " gushing line in east ".These can with the feature of " Disneyland " strong correlation Informational support is provided with for traveller formulates its tour plan.

On the other hand, a looser θ₀Although threshold value can excavate more local feature, simultaneously The dependence allowed between feature becomes complicated.In order to obtain more, clearly dependence and " Disneyland " are maximally related Local feature, by threshold θ₀It is respectively increased to 0.8 and 0.95, some popular tourist quilts in " Disneyland " can be obtained Excavate (as shown in Figure 4 and Figure 5).Especially in θ₀When=0.95, the feature excavated is exactly almost that " Disney is found pleasure in The most crucial feature of playing in garden ".These information have very great help to the tourism planning of potential user and decision-making.

Contrast experiment

For the adaptation of methods for checking embodiment two to provide, this part mainly discusses the text in different message areas Experimental result on data set.The data of experiment are respectively from amusement, physical culture, hotel's comment, economy, computer and art neck Domain.The source situation of data is as shown in table 4 below, it can be seen that the source variation of data.Wherein special journal of writings (warp Ji, computer and art) and the report (entertaining and physical culture) of realm information be all document more long.And hotel's comment is by common The short text data of the online comment behavior generation of user.

Data set situation is described during table 4 is tested

Next, the method for embodiment two is named as into TVS (Term Vector Subdividing), with it with it is classical Topic and Feature Extraction Method FP, TF-IDF and LDA are compared.These four methods are taken out respectively on the different data set of six classes Take preceding 5,10,20,50 and 80 focus words.

The Average Accuracy result of three kinds of methods shows that features of the TVS for focus word in different text fields is taken out Take better than other three methods.This shows that TVS methods are good at the Feature Words for extracting and being obviously dependent on focus word.

In addition, on tourism blog documents data set, the semantic differential between the feature that three kinds of methods are extracted is compared, Every kind of method extracts preceding 5 focus words.Result of calculation is shown in Table 5.

Table 5 TF-IDF, LDA and TVS extract local feature semantic content and compare

The semantic content of the feature that three kinds of methods are extracted shows that the characteristic information granularity that TVS is extracted is moderate, while extensive Feature is less, can well represent much-talked-about topic word local feature, for example, extracted in " Disneyland " " glad Australia, small Worldlet, Jones, sleeping beauty, Shi Diqi ", this much-talked-about topic word of these features exactly Disneyland is distinctive local special Levy.

From above example and its confirmatory experiment can be seen that the embodiment of the present invention extraction Feature Words not only with document phase Close, while also contemplating semantic status of the word in field, such as popular degree, if topic etc. can be represented.In addition, of the invention After topic is determined, centered on topic word, the vocabulary distribution situation on its periphery is searched in all spectra document, those With the word (being not necessarily domain term) of topic word close relation, (part) Feature Words of specific topics are will be identified that, so, with Method is divided and ruled in addition to reducing computational complexity based on the topic word of center, also improves local feature Ci Yu centers topic Correlation between word, can effectively shield the interference of high frequency words.

It is to be appreciated that the embodiment of foregoing description is a part of embodiment of the invention, rather than whole embodiments, also not It is limitation of the present invention.Based on embodiments of the invention, those of ordinary skill in the art are not paying creative work premise Lower obtained every other embodiment, belongs to protection scope of the present invention.

Claims

1. tourist hot spot and its Feature Extraction Method in a kind of mass text, it is characterised in that comprise the following steps：

A, Text Pretreatment

B, much-talked-about topic word find

By the proprietary vocabulary in given information field, the much-talked-about topic word set in the information field is excavated from the data set Close；

C, much-talked-about topic feature extraction

Row vector cutting is entered based on much-talked-about topic set of words；Dependence between analysis candidate feature word and much-talked-about topic word, Obtain the local feature of much-talked-about topic word.

2. tourist hot spot and its Feature Extraction Method in mass text according to claim 1, it is characterised in that the step Cutting is carried out to data set using the punctuation mark of much-talked-about topic word and much-talked-about topic word periphery in rapid C, all focus words are obtained The cutting Son item set of epigraph.

3. tourist hot spot and its Feature Extraction Method in mass text according to claim 2, it is characterised in that the step With the dependence between maximum confidence index analysis candidate feature word and much-talked-about topic word in rapid C.

4. tourist hot spot and its Feature Extraction Method in mass text according to claim 3, it is characterised in that it is described most Big confidence threshold value is 0.6~0.95.

5. tourist hot spot and its Feature Extraction Method in mass text according to claim 1, it is characterised in that the step Rapid A related documents in Extracting Information domain from network using crawler technology.

6. tourist hot spot and its Feature Extraction Method in mass text according to claim 1, it is characterised in that the step Pretreatment in rapid A at least includes carrying out text dividing by the fullstop in text, question mark and exclamation mark.

7. tourist hot spot and its feature extraction system in a kind of mass text, it is characterised in that including：

Text Pretreatment module, it is configured to：The related document in Extracting Information domain from network, and these document contents are entered Row pretreatment forms data set；

Much-talked-about topic word discovery module, it is configured to：By the proprietary vocabulary in given information field, from the data set Excavate the much-talked-about topic set of words in the information field；

Much-talked-about topic feature extraction module, it is configured to：Row vector cutting is entered based on much-talked-about topic set of words；Analysis candidate is special The dependence between word and much-talked-about topic word is levied, the local feature of much-talked-about topic word is obtained.

8. tourist hot spot and its feature extraction system in mass text according to claim 7, it is characterised in that the heat Point topic feature extraction module is specifically configured to：Using the punctuation mark on much-talked-about topic word and much-talked-about topic word periphery to data Collection carries out cutting, obtains the cutting Son item set of all much-talked-about topic words.

9. tourist hot spot and its feature extraction system in mass text according to claim 7, it is characterised in that the heat Point topic feature extraction module is specifically configured to：With maximum confidence index analysis candidate feature word and much-talked-about topic word it Between dependence.

10. tourist hot spot and its feature extraction system in mass text according to claim 7, it is characterised in that the text This pretreatment module is specifically configured to：Using the document of crawler technology Extracting Information domain correlation from network, by text Fullstop, question mark and exclamation mark carry out text dividing.