CN109710944A - Hot word extracting method, device, electronic equipment and computer readable storage medium - Google Patents
Hot word extracting method, device, electronic equipment and computer readable storage medium Download PDFInfo
- Publication number
- CN109710944A CN109710944A CN201811638597.7A CN201811638597A CN109710944A CN 109710944 A CN109710944 A CN 109710944A CN 201811638597 A CN201811638597 A CN 201811638597A CN 109710944 A CN109710944 A CN 109710944A
- Authority
- CN
- China
- Prior art keywords
- word
- weight
- extracted
- text data
- feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 55
- 238000003860 storage Methods 0.000 title claims abstract description 17
- 238000000605 extraction Methods 0.000 claims abstract description 32
- 230000006870 function Effects 0.000 claims description 12
- 238000004140 cleaning Methods 0.000 claims description 9
- 238000009826 distribution Methods 0.000 claims description 7
- 239000012141 concentrate Substances 0.000 claims description 6
- 238000004422 calculation algorithm Methods 0.000 claims description 5
- 238000012545 processing Methods 0.000 claims description 5
- 238000004590 computer program Methods 0.000 claims description 3
- 238000004364 calculation method Methods 0.000 abstract description 13
- 238000010586 diagram Methods 0.000 description 13
- 239000000284 extract Substances 0.000 description 12
- PCTMTFRHKVHKIS-BMFZQQSSSA-N (1s,3r,4e,6e,8e,10e,12e,14e,16e,18s,19r,20r,21s,25r,27r,30r,31r,33s,35r,37s,38r)-3-[(2r,3s,4s,5s,6r)-4-amino-3,5-dihydroxy-6-methyloxan-2-yl]oxy-19,25,27,30,31,33,35,37-octahydroxy-18,20,21-trimethyl-23-oxo-22,39-dioxabicyclo[33.3.1]nonatriaconta-4,6,8,10 Chemical compound C1C=C2C[C@@H](OS(O)(=O)=O)CC[C@]2(C)[C@@H]2[C@@H]1[C@@H]1CC[C@H]([C@H](C)CCCC(C)C)[C@@]1(C)CC2.O[C@H]1[C@@H](N)[C@H](O)[C@@H](C)O[C@H]1O[C@H]1/C=C/C=C/C=C/C=C/C=C/C=C/C=C/[C@H](C)[C@@H](O)[C@@H](C)[C@H](C)OC(=O)C[C@H](O)C[C@H](O)CC[C@@H](O)[C@H](O)C[C@H](O)C[C@](O)(C[C@H](O)[C@H]2C(O)=O)O[C@H]2C1 PCTMTFRHKVHKIS-BMFZQQSSSA-N 0.000 description 6
- 230000003287 optical effect Effects 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 230000008878 coupling Effects 0.000 description 2
- 238000010168 coupling process Methods 0.000 description 2
- 238000005859 coupling reaction Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 230000011218 segmentation Effects 0.000 description 2
- 238000012163 sequencing technique Methods 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- 229960005486 vaccine Drugs 0.000 description 2
- 241000282465 Canis Species 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 201000010099 disease Diseases 0.000 description 1
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 1
- 235000013399 edible fruits Nutrition 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 239000002245 particle Substances 0.000 description 1
- 230000036316 preload Effects 0.000 description 1
- 229960003127 rabies vaccine Drugs 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
- 238000005303 weighing Methods 0.000 description 1
Abstract
The embodiment of the present application provides a kind of hot word extracting method, device, electronic equipment and computer readable storage medium.This method comprises: determining the first weight of the word to be extracted that text data is concentrated;The weight for determining Feature Words in word to be extracted determines the second weight of word to be extracted based on the first weight of word to be extracted and the weight of Feature Words;The word to be extracted of the highest preceding predetermined number of the second weight is extracted as hot word.Scheme provided in this embodiment, it can determine the first weight of word to be extracted, and determine the weight of the Feature Words and Feature Words in word to be extracted, weight based on the first weight and Feature Words, the second weight for determining word to be extracted improves the reasonability of the second weight calculation, therefore when being based on the second weight extraction hot word, the accuracy that can be improved hot word extraction, meets the use demand of user.
Description
Technical field
This application involves technical field of data processing, specifically, this application involves a kind of hot word extracting method, device,
Electronic equipment and computer readable storage medium.
Background technique
Hot word, i.e., popular vocabulary, can represent the content of netizen's care, can also represent the centre point of event or theme.Heat
Word extractive technique generates hot information by content syndication technologies, facilitates netizen to understand focus incident in massive information, with people
The network life abundant and information content increase, the importance of hot word extractive technique also increasingly increases.
Existing hot word extractive technique be mainly based upon statistics method realize, cause hot word extract accuracy rate compared with
Difference is unable to satisfy the use demand of user.
Summary of the invention
This application provides a kind of hot word extracting method, device, electronic equipment and computer readable storage mediums, for solving
The problem of accuracy rate that certainly hot word is extracted is poor, is unable to satisfy the use demand of user.Technical solution used by the application is such as
Under:
In a first aspect, this application provides a kind of hot word extracting methods, this method comprises:
Determine that text data concentrates the first weight of word to be extracted;
The weight for determining Feature Words in word to be extracted, based on the first weight of word to be extracted and the weight of Feature Words, really
Second weight of fixed word to be extracted;
Word to be extracted is ranked up based on the second weight, extract sequence after the higher preceding predetermined number of the second weight to
Word is extracted as hot word.
Second aspect, this application provides a kind of hot word extraction element, which includes:
First weight determination module, for determining that text data concentrates the first weight of word to be extracted;
Second weight determination module, for determining the weight of Feature Words in word to be extracted, the first power based on word to be extracted
The weight of weight and Feature Words, determines the second weight of word to be extracted;
Hot word extraction module, for being ranked up to word to be extracted based on the second weight, extract after sequence the second weight compared with
The word to be extracted of high preceding predetermined number is as hot word.
The third aspect, this application provides a kind of electronic equipment, which includes: processor and memory;
Memory, for storing operational order;
Processor executes the hot word extracting method as shown in the first aspect of the application for instructing by call operation.
Fourth aspect, this application provides a kind of computer readable storage mediums, are stored thereon with computer program, the journey
Hot word extracting method shown in the first aspect of the application is realized when sequence is executed by processor.
Technical solution provided by the embodiments of the present application has the benefit that
Scheme provided in this embodiment can determine the first weight of word to be extracted, and determine the feature in word to be extracted
The weight of word and Feature Words, the weight based on the first weight and Feature Words determine the second weight of word to be extracted, improve
The reasonability of second weight calculation, thus when being based on the second weight extraction hot word, it can be improved the accuracy of hot word extraction, it is full
The use demand of sufficient user.
Detailed description of the invention
In order to more clearly explain the technical solutions in the embodiments of the present application, institute in being described below to the embodiment of the present application
Attached drawing to be used is needed to be briefly described.
Fig. 1 is a kind of flow diagram of hot word extracting method provided by the embodiments of the present application;
Fig. 2 is the flow diagram for determining the first weight;
Fig. 3 is the flow diagram for determining title feature word weight;
Fig. 4 is the flow diagram for determining theme feature word weight;
Fig. 5 is a kind of design cycle schematic diagram of hot word extracting method provided by the embodiments of the present application;
Fig. 6 is the processing flow schematic diagram that a kind of hot word extracts service system;
Fig. 7 is a kind of structural schematic diagram of hot word extraction element provided by the embodiments of the present application;
Fig. 8 is the structural schematic diagram of a kind of electronic equipment provided by the embodiments of the present application.
Specific embodiment
Embodiments herein is described below in detail, examples of the embodiments are shown in the accompanying drawings, wherein from beginning to end
Same or similar label indicates same or similar element or element with the same or similar functions.Below with reference to attached
The embodiment of figure description is exemplary, and is only used for explaining the application, and is not construed as limiting the claims.
Those skilled in the art of the present technique are appreciated that unless expressly stated, singular " one " used herein, " one
It is a ", " described " and "the" may also comprise plural form.It is to be further understood that being arranged used in the description of the present application
Diction " comprising " refer to that there are the feature, integer, step, operation, element and/or component, but it is not excluded that in the presence of or addition
Other one or more features, integer, step, operation, element, component and/or their group.It should be understood that when we claim member
Part is " connected " or when " coupled " to another element, it can be directly connected or coupled to other elements, or there may also be
Intermediary element.In addition, " connection " used herein or " coupling " may include being wirelessly connected or wirelessly coupling.It is used herein to arrange
Diction "and/or" includes one or more associated wholes for listing item or any cell and all combinations.
To keep the purposes, technical schemes and advantages of the application clearer, below in conjunction with attached drawing to the application embodiment party
Formula is described in further detail.
Existing hot word extractive technique is mainly concerned with document sets feature extraction and document sets theme distribution.The spy of document sets
Sign is extracted mainly on the basis of word frequency statistics and inverse document frequency count, using term frequency-inverse document frequency (term
Frequency-inverse document frequency, tf-idf) and its improved method progress word weight calculation, realize heat
Word extracts;Document sets theme distribution generally uses topic model, establishes the semantic association between word, document and theme, realizes heat
Word extracts.
Above two hot word extracting method, the method for being all based on statistics calculate the weight of hot word, less in view of word
Positional relationship and article theme, the word (such as auxiliary word, conjunction) that also not can avoid some auxiliary properties do hot word extraction
It disturbs, therefore it is accurate slightly lower to cause hot word to be extracted, and is unable to satisfy the use demand of user.
Hot word extracting method, device, electronic equipment and computer readable storage medium provided by the present application, it is intended to solve existing
There is the technical problem as above of technology.
How the technical solution of the application and the technical solution of the application are solved with specifically embodiment below above-mentioned
Technical problem is described in detail.These specific embodiments can be combined with each other below, for the same or similar concept
Or process may repeat no more in certain embodiments.Below in conjunction with attached drawing, embodiments herein is described.
The embodiment of the present application provides a kind of hot word extracting method, as shown in Figure 1, this method mainly may include:
Step S110: determine that text data concentrates the first weight of word to be extracted.
In the present embodiment, text data set can be the base text for carrying out hot word extraction.Text data set can be with
Including plurality of articles, word to be extracted can be the word identified from the article of text data set, wherein containing hot word.First
Weight is the initial weight of each word to be extracted, can provide basis for subsequent calculating.
Step S120: determining the weight of Feature Words in word to be extracted, the first weight and Feature Words based on word to be extracted
Weight, determine the second weight of word to be extracted;
Step S130: the word to be extracted of the highest preceding predetermined number of the second weight is extracted as hot word.
In the present embodiment, Feature Words can for can be to the word that text data set is characterized, Feature Words hot word can
Energy property is higher, can be determined the second weight of word to be extracted based on the first weight of word to be extracted and the weight of Feature Words, be mentioned
The reasonability of high second weight calculation.
In the present embodiment, after determining the second weight, can by the highest preceding predetermined number of the second weight wait mention
It takes word to extract as hot word, specifically, can be sorted from high to low based on the second weight to word to be extracted, extracts sequence
The word to be extracted of preceding predetermined number is as hot word afterwards.Predetermined number can be set according to actual needs.
Hot word extracting method provided in this embodiment can determine the first weight of word to be extracted, and determine word to be extracted
In Feature Words and Feature Words weight, the weight based on the first weight and Feature Words, determine word to be extracted second power
Weight, improves the reasonability of the second weight calculation, therefore can be improved the accuracy of hot word extraction, meets the use need of user
It asks.
In a kind of possible implementation of the embodiment of the present application, above-mentioned Feature Words may include at least one of following:
The title feature word of article title in text data set;
The theme feature word that text data is concentrated;
The search key of text data set;
Wherein, search key is for the word by search to determine text data set.
In the present embodiment, the article in text data set may include one or more titles, since article title is general
The purport of article can be embodied, therefore can determine the title feature word that can characterize article title from the title of article
As Feature Words.
The theme feature word that text data is concentrated can determine that theme feature word can be used in based on the theme distribution of article
Article theme is characterized, specifically, the biggish n theme feature word of probability can be determined, as Feature Words.The specific value of n
It can be actually needed and be set.
Search key is the word for searching for determine text data set, and search key can characterize text data set
Theme, can be using search key as Feature Words.
In the present embodiment, Feature Words can be not limited at least one above-mentioned, can also increase feature according to actual needs
The type of word.Since the first weight of word to be extracted is calculated by statistical method (such as tf-iwf), so that the first power
Weight is excessively biased to word frequency statistics, can be in the calculating of the second weight to avoid word frequency from occupying leading role in hot word extraction
Cheng Zhong promotes the weight for the Feature Words that text data is concentrated, and can be combined with the Feature Words of multiple types, realizes various features word
Comprehensive consideration, to improve the accuracy of hot word extraction.
In a kind of possible implementation of the embodiment of the present application, above-mentioned the first weight and feature based on word to be extracted
The weight of word determines the second weight of word to be extracted, may include:
By the weight of Feature Words, first weight of equivalent in word to be extracted is added with Feature Words, obtains word to be extracted
Second weight;
Wherein, the weight of Feature Words includes at least one of the following:
The weight of title feature word;
The weight of theme feature word;
The default weight of search key.
In the present embodiment, when Feature Words are title feature word, word to be extracted can be determined based on formula below (1)
Second weight:
score2(w)=score1(w)+score(wt) formula (1)
Wherein, w indicates the set of word to be extracted;score1(w) the first weight set of word to be extracted, score are indicated2(w)
Indicate the second weight set of word to be extracted;wtIndicate the set of title feature word, score (wt) indicate title feature word power
Gather again.
In actually calculating, each title feature word can be searched into equivalent in the set of word to be extracted, only by each mark
The first weight that the weight of topic Feature Words distinguishes corresponding word is added, and obtains the second weight of each equivalent, word to be extracted
Gather the first weight of other words in addition to equivalent directly as the second weight, so that it is determined that the second power of word to be extracted out
Gather again.
In the present embodiment, when Feature Words are the theme Feature Words, word to be extracted can be determined based on formula below (2)
Second weight:
score2(w)=score1(w)+score(wtopic) formula (2)
Wherein, wtopicIndicate the set of theme feature word, score (wtopic) indicate theme feature word weight set.
When the weight for carrying out theme feature word is added with the first weight of word to be extracted, specific calculation, Ke Yican
The mode being added according to the weight of above-mentioned title feature word with the first weight of word to be extracted.
In the present embodiment, when Feature Words are search key, word to be extracted can be determined based on formula below (3)
Second weight:
score2(w)=score1(w)+score(wkey) formula (3)
Wherein, wkeyIndicate the set of search key, score (wkey) indicate search key weight set.
When the weight for scanning for keyword is added with the first weight of word to be extracted, specific calculation, Ke Yican
The mode being added according to the weight of above-mentioned title feature word with the first weight of word to be extracted.
When Feature Words include title feature word, theme feature word and search key, formula below can be based on
(4) the second weight of word to be extracted is determined:
score2(w)=score1(w)+score(wt)+score(wtopic)+score(wkey) formula (4)
When actually calculating, the mode that the weight for being referred to title feature word is added with the first weight of word to be extracted,
Respectively by the weight of the weight of title feature word, the weight of theme feature word and search key, with word to be extracted first
Weight is added, and obtains the second weight sequence.
By carrying out weight jointly using title feature word, theme feature word and search key as Feature Words
It calculates, realizes on the basis of using word frequency and inverse word frequency to measure word temperature to be extracted, in the text in conjunction with word to be extracted
Whether whether be the theme Feature Words, word to be extracted of position, word to be extracted is that many factors such as search key comprehensively consider, can
The accuracy rate that hot word is extracted greatly is improved, the effect that hot word is extracted is improved.
In the present embodiment, for that convenient for subsequent processing, the second weight of each word to be extracted can be normalized to obtain
Three weights, then word to be extracted is ranked up according to third weight, to determine hot word.
Specifically, following formula (5) can be used, the third weight of any word to be extracted is determined:
Wherein, score2(wl) indicate any word w to be extractedlThe second weight, score2(w)maxIndicate word set to be extracted
Numerical value maximum one is closed in the second weight in w, F (wl) indicate any word w to be extractedlThird weight.
In a kind of possible implementation of the embodiment of the present application, of the word to be extracted in above-mentioned determination text data set
One weight may include:
Inverse word frequency (term frequency-inverse the word of the word frequency-for the word to be extracted that text data is concentrated
Frequency, tf-iwf) value is determined as the first weight;
And/or
Determine the first weight of the word to be extracted of article text in text data set.
Weighing computation method tf-idf based on statistics is calculating significance level of the words relative to a document sets
When, the importance of words is directly proportional with its frequency of occurrence in the text and its frequency of occurrences in document sets is inversely proportional.Same
The method has very big drawback in class document sets, and often the Feature Words of some same class documents are blanked.
The first weight of word to be extracted, the i.e. side of word frequency and the weighting of word frequency inverse are determined in the present embodiment based on tf-iwf
Formula.This method can not only reduce the weight of the high useless word of word frequency, but also can be blanked to avoid the Feature Words of same class document
The drawbacks of.
In the present embodiment, only the body part of each piece article can be concentrated to extract word to be extracted text data, and determine
First weight of word to be extracted.The body part of article and title division are distinguished, and for the body part of article extract to
Word is extracted, increases title feature word and is determining the second weight validity.
It is specific as follows Fig. 2 shows the flow diagram for determining the first weight:
The text data of article body part in text data set is segmented, establishing simplified word such as " ABCD " is
Simplified, the entity word dictionary of " ABCD company " are reinforced participle dynamics, are stored in list to the word divided, wherein the member of list
Element is exactly the list of all word compositions of text of every document.
The jieba segmenter that can be selected preloads customized dictionary text in advance.Such as: " the mad dog epidemic disease of ABCD company
Seedling fraud is looked into." Custom Dictionaries word segmentation result is not loaded are as follows: " the mad dog of AB/CD//vaccine/fraud/looked into/.", wherein " mad
Canine vaccines " are our common words, are divided into two words, " ABCD " is the abbreviation of " ABCD company ", these common words are
It is not intended to by separated.After loading pre- customized dictionary, word segmentation result are as follows: " ABCD/ rabies vaccine/fraud/looked into/."
Part-of-speech tagging is added to upper predicate, then leaves adjective, secondary shape word, adnoun (describing with noun function
Word), noun, name, place name, group, mechanism, other proper names, verb, secondary verb, name verb, abbreviation abbreviation, adverbial word, Chinese idiom etc.
The word of property filters out the useless word such as auxiliary word, modal particle, stop words.
Tf-iwf algorithm be can use to calculate the weighted value of the word i.e. to be extracted of the word after above-mentioned part-of-speech tagging (the i.e. first power
Weight).Specifically, following formula (6) can be used, the first weight of any word to be extracted is determined:
score1(wl)=tf (wl)*iwf(wl) (6)
Wherein, score1(wl) indicate any word w to be extractedlThe first weight;tf(wl) indicate any word w to be extractedl's
Word frequency;iwf(wl) indicate any word w to be extractedlInverse word frequency.
Specifically, the word frequency of any word to be extracted can be determined using following formula (7):
Wherein, n (wl) indicate any word w to be extractedlIn text data (i.e. article body part in text data set)
The number of appearance, D (w) indicate the total word number for including in text data.
The inverse word frequency of any word to be extracted can be determined using following formula (8):
Wherein, N (wl) indicate any word w to be extractedlTotal frequency of appearance in text data;M is indicated in text data
The total quantity of word to be extracted.
In a kind of possible implementation of the embodiment of the present application, when Feature Words include the mark of article title in text data set
When inscribing Feature Words, the weight of Feature Words, may include: title in the title by text data set in above-mentioned determination word to be extracted
The tf-iwf value of Feature Words is determined as the weight of title feature word.In the present embodiment, for title feature word in article title
Title feature word weight can also determine that specific calculation be referred to above-mentioned word to be extracted first and weigh based on tf-iwf
The calculation of weight.
Fig. 3 shows the flow diagram of determining title feature word weight, specific as follows:
The title of text data set is segmented to obtain title feature word, respectively to title feature word count word frequency and
Inverse word frequency.
In a kind of possible implementation of the embodiment of the present application, when Feature Words include theme feature word, above-mentioned determination
The weight of Feature Words in word to be extracted may include:
Based on hidden Di Li Cray distribution model (Latent Dirichlet Allocation, LDA), text data is determined
The determine the probability of theme feature word is the theme term weight function by the probability of the theme feature word of concentration.
In the present embodiment, the LDA model in the library gensim can use, it is special to extract theme to every article of text data set
Word is levied, theme and the corresponding probability distribution of theme feature word are generated, extracts the forward preceding n theme feature word conduct of probability sorting
The determine the probability of the theme feature word of extraction is the theme term weight function by Feature Words, in actual use, can only retain every
Ten theme before piece article weight sequencing.
Fig. 4 shows the flow diagram of determining theme feature word weight, specific as follows:
Theme is extracted to the text data in text data set, and obtains theme and the corresponding probability of theme feature word
Distribution, the forward preceding n theme feature word of extraction probability sorting are true by the probability of each theme feature word of extraction as Feature Words
It is set to theme feature word weight.
In a kind of possible implementation of the embodiment of the present application, in the first power for determining the word to be extracted of text data concentration
Before weight, the above method can also include:
Noise cleaning treatment is carried out to text data set.
In actual use, in the internet web page text usually crawled include certain redundancy and noise information, than
Such as example non-Chinese character of@, website links, redundance character and part punctuation mark, in the present embodiment, text data can be being determined
Before first weight of the word to be extracted concentrated, redundancy and noise information are first filtered out, reduces the interference to subsequent calculating.
In specific implementation, noise cleaning treatment can be carried out to text data set based on regular expression matching algorithm.
Fig. 5 shows a kind of design cycle schematic diagram of hot word extracting method provided by the embodiments of the present application.Detailed process
Are as follows: the text data in text data set is subjected to data cleansing (i.e. noise cleaning treatment), textual data is determined based on tf-iwf
According to word weight i.e. the first weight of word to be extracted in collecting;The weight calculation for carrying out title feature word, determines the power of title feature word
Weight then carries out title weighting, i.e., is added the weight of title feature word with the first weight;Theme spy is determined based on LDA model
Levy word weight, then carry out theme weighting, i.e., by with the weight of title feature word the first weight after being added and theme feature word
Weight be added;Weight based on search key scans for keyword weighting, i.e., by the weight phase with title feature word
Add, theme feature word weight the first weight after being added is added with the weight of search key, obtain the second weight, to second weigh
Third weight is obtained be normalized again after, will be exported according to the word (word to be extracted) of third weight sequencing.
Fig. 6 shows a kind of processing flow schematic diagram of hot word extraction service system, real on big data Spark cluster
It applies, is managed by zookeeper service, realize that hot word extracts service system using Spark Streaming, meet sea
The hot word quasi real time for measuring data, which is extracted, to be required.Hot word extracts service system real time monitoring and to Hadoop distributed file system
In (Hadoop Distributed File System, HDFS) then input file reads text in a manner of discrete data
This content, and hot word extraction algorithm is called to analyze discrete data, the hot word of topN is extracted, finally hot word is extracted and is tied
Fruit is written in the theme (topic) of message queue kafka.
Hot word based on big data extracts service system, using big data Spark platform, in conjunction with advanced message system,
Make service that can support the data of magnanimity, service is provided in real time.
Based on principle identical with method shown in Fig. 1, the embodiment of the present application also provides a kind of hot word extraction element,
As shown in fig. 7, the hot word extraction element 20 may include:
First weight determination module 210, the first weight of the word to be extracted for determining text data concentration;
Second weight determination module 220, for determining the weight of Feature Words in word to be extracted, based on word to be extracted first
The weight of weight and Feature Words determines the second weight of word to be extracted;
Hot word extraction module 230, for extracting the word to be extracted of the highest preceding predetermined number of the second weight as hot word.
Hot word extraction element provided in this embodiment can determine the first weight of word to be extracted, and determine word to be extracted
In Feature Words and Feature Words weight, the weight based on the first weight and Feature Words, determine word to be extracted second power
Weight improves the reasonability of the second weight calculation, therefore when being based on the second weight extraction hot word, can be improved hot word extraction
Accuracy meets the use demand of user.
Optionally, above-mentioned Feature Words include at least one of the following:
The title feature word of article title in text data set;
The theme feature word that text data is concentrated;
The search key of text data set;
Wherein, search key is for the word by search to determine text data set.
Optionally, the second above-mentioned weight determination module is based on the first weight of word to be extracted and the power of Feature Words
Weight, when determining the second weight of word to be extracted, is specifically used for:
By the weight of Feature Words, first weight of equivalent in word to be extracted is added with Feature Words, obtains word to be extracted
Second weight;
Wherein, the weight of Feature Words includes at least one of the following:
The weight of title feature word;
The weight of theme feature word;
The default weight of search key.
Optionally, the first above-mentioned weight determination module is specifically used for:
The tf-iwf value for the word to be extracted that text data is concentrated is determined as the first weight;
And/or
Determine the first weight of the word to be extracted of article text in text data set.
Optionally, when Feature Words include the title feature word of article title in text data set, the second above-mentioned weight
Determining module when the weight of Feature Words, is specifically used in determining word to be extracted:
The tf-iwf value of title Feature Words in the title of text data set is determined as to the weight of title feature word.It is optional
Ground, when Feature Words include theme feature word, the power of above-mentioned the second weight determination module Feature Words in determining word to be extracted
When weight, it is specifically used for:
Based on LDA, the probability for the theme feature word that text data is concentrated is determined, based on the determine the probability of theme feature word
Inscribe term weight function.
Optionally, before determining the first weight of word to be extracted of text data concentration, the above method can also include:
Noise cleaning treatment is carried out to text data set.
Optionally, above-mentioned that noise cleaning treatment is carried out to text data set, may include:
Noise cleaning treatment is carried out to text data set based on regular expression matching algorithm.
It is realized shown in Fig. 1 it is understood that above-mentioned each module of the hot word extraction element in the present embodiment has
The function of hot word extracting method corresponding steps in embodiment.The function can also be held by hardware realization by hardware
The corresponding software realization of row.The hardware or software include one or more modules corresponding with above-mentioned function.Above-mentioned module can
To be software and/or hardware, above-mentioned each module can be implemented separately, can also be with multiple module integration realizations.For above-mentioned hot word
The correspondence that the function description of each module of extraction element specifically may refer to the hot word extracting method in embodiment shown in Fig. 1 is retouched
It states, details are not described herein.
The embodiment of the present application provides a kind of electronic equipment, as shown in figure 8, electronic equipment shown in Fig. 8 2000 includes: place
Manage device 2001 and memory 2003.Wherein, processor 2001 is connected with memory 2003, is such as connected by bus 2002.It is optional
, electronic equipment 2000 can also include transceiver 2004.It should be noted that transceiver 2004 is not limited to one in practical application
A, the structure of the electronic equipment 2000 does not constitute the restriction to the embodiment of the present application.
Wherein, processor 2001 is applied in the embodiment of the present application, for realizing method shown in above method embodiment.
Transceiver 2004 may include Receiver And Transmitter, and transceiver 2004 is applied in the embodiment of the present application, real when for executing
The function that the electronic equipment of existing the embodiment of the present application is communicated with other equipment.
Processor 2001 can be CPU, general processor, DSP, ASIC, FPGA or other programmable logic device, crystalline substance
Body pipe logical device, hardware component or any combination thereof.It, which may be implemented or executes, combines described by present disclosure
Various illustrative logic blocks, module and circuit.Processor 2001 is also possible to realize the combination of computing function, such as wraps
It is combined containing one or more microprocessors, DSP and the combination of microprocessor etc..
Bus 2002 may include an access, and information is transmitted between said modules.Bus 2002 can be pci bus or
Eisa bus etc..Bus 2002 can be divided into address bus, data/address bus, control bus etc..Only to be used in Fig. 8 convenient for indicating
One thick line indicates, it is not intended that an only bus or a type of bus.
Memory 2003 can be ROM or can store the other kinds of static storage device of static information and instruction, RAM
Or the other kinds of dynamic memory of information and instruction can be stored, it is also possible to EEPROM, CD-ROM or other CDs
Storage, optical disc storage (including compression optical disc, laser disc, optical disc, Digital Versatile Disc, Blu-ray Disc etc.), magnetic disk storage medium
Or other magnetic storage apparatus or can be used in carry or store have instruction or data structure form desired program generation
Code and can by any other medium of computer access, but not limited to this.
Optionally, memory 2003 is used to store the application code for executing application scheme, and by processor 2001
It is executed to control.Processor 2001 is for executing the application code stored in memory 2003, to realize above method reality
Apply hot word extracting method shown in example.
Electronic equipment provided by the embodiments of the present application is suitable for above method any embodiment, and details are not described herein.
The embodiment of the present application provides a kind of electronic equipment, compared with prior art, can determine the first of word to be extracted
Weight, and determine the weight of the Feature Words and Feature Words in word to be extracted, the weight based on the first weight and Feature Words, really
Second weight of fixed word to be extracted improves the reasonability of the second weight calculation, therefore when being based on the second weight extraction hot word,
The accuracy that can be improved hot word extraction, meets the use demand of user.
The embodiment of the present application provides a kind of computer readable storage medium, is stored on the computer readable storage medium
Computer program realizes hot word extracting method shown in above method embodiment when the program is executed by processor.
Computer readable storage medium provided by the embodiments of the present application is suitable for above method any embodiment, herein not
It repeats again.
The embodiment of the present application provides a kind of computer readable storage medium, compared with prior art, can determine wait mention
The first weight of word is taken, and determines the weight of the Feature Words and Feature Words in word to be extracted, is based on the first weight and feature
The weight of word determines the second weight of word to be extracted, improves the reasonability of the second weight calculation, therefore be based on the second weight
When extracting hot word, it can be improved the accuracy of hot word extraction, meet the use demand of user.
It should be understood that although each step in the flow chart of attached drawing is successively shown according to the instruction of arrow,
These steps are not that the inevitable sequence according to arrow instruction successively executes.Unless expressly stating otherwise herein, these steps
Execution there is no stringent sequences to limit, can execute in the other order.Moreover, at least one in the flow chart of attached drawing
Part steps may include that perhaps these sub-steps of multiple stages or stage are not necessarily in synchronization to multiple sub-steps
Completion is executed, but can be executed at different times, execution sequence, which is also not necessarily, successively to be carried out, but can be with other
At least part of the sub-step or stage of step or other steps executes in turn or alternately.
The above is only some embodiments of the invention, it is noted that for the ordinary skill people of the art
For member, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications are also answered
It is considered as protection scope of the present invention.
Claims (12)
1. a kind of hot word extracting method characterized by comprising
Determine that text data concentrates the first weight of word to be extracted;
The weight for determining Feature Words in the word to be extracted, the first weight and the Feature Words based on the word to be extracted
Weight determines the second weight of the word to be extracted;
The word to be extracted of the highest preceding predetermined number of the second weight is extracted as hot word.
2. hot word extracting method according to claim 1, which is characterized in that the Feature Words include at least one of the following:
The title feature word of article title in the text data set;
The theme feature word that the text data is concentrated;
The search key of the text data set;
Wherein, described search keyword is for by searching for the word of the determination text data set.
3. hot word extracting method according to claim 2, which is characterized in that the first weight based on the word to be extracted with
And the weight of the Feature Words, determine the second weight of the word to be extracted, comprising:
By the weight of the Feature Words, first weight of equivalent in the word to be extracted is added with the Feature Words, obtains institute
State the second weight of word to be extracted;
Wherein, the weight of the Feature Words includes at least one of the following:
The weight of the title feature word;
The weight of the theme feature word;
The default weight of described search keyword.
4. hot word extracting method according to claim 1, which is characterized in that the determining text data concentrates word to be extracted
The first weight, comprising:
The inverse word frequency tf-iwf value of the word frequency-of the word to be extracted of article text in text data set is determined as the first weight;
And/or
Determine the first weight of the word to be extracted of article text in text data set.
5. hot word extracting method according to claim 2, which is characterized in that when the Feature Words include the text data
When concentrating the title feature word of article title, the weight of Feature Words in the determination word to be extracted, comprising:
The tf-iwf value of title Feature Words in the title of text data set is determined as to the weight of title feature word.
6. hot word extracting method according to claim 2, which is characterized in that when the Feature Words include the theme feature
When word, the weight of Feature Words in the determination word to be extracted, comprising:
Based on hidden Di Li Cray distribution model LDA, the probability for the theme feature word that the text data is concentrated is determined, by the master
The determine the probability of topic Feature Words is the theme term weight function.
7. hot word extracting method described in one in -6 according to claim 1, which is characterized in that in the determining text data set
In word to be extracted the first weight before, further includes:
Noise cleaning treatment is carried out to text data set.
8. hot word extracting method according to claim 7, which is characterized in that described to carry out noise cleaning to text data set
Processing, comprising:
Noise cleaning treatment is carried out to text data set based on regular expression matching algorithm.
9. a kind of hot word extraction element characterized by comprising
First weight determination module, for determining that text data concentrates the first weight of word to be extracted;
Second weight determination module, for determining the weight of Feature Words in the word to be extracted, based on the word to be extracted
The weight of one weight and the Feature Words determines the second weight of the word to be extracted;
Hot word extraction module, for extracting the word to be extracted of the highest preceding predetermined number of the second weight as hot word.
10. hot word extraction element according to claim 9, which is characterized in that the Feature Words include at least one of the following:
The title feature word of article title in the text data set;
The theme feature word that the text data is concentrated;
The search key of the text data set;
Wherein, described search keyword is for by searching for the word of the determination text data set.
11. a kind of electronic equipment, which is characterized in that it includes processor and memory;
The memory, for storing operational order;
The processor, for executing hot word described in any one of the claims 1-8 by calling the operational order
Extracting method.
12. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the program is by processor
Hot word extracting method described in any one of the claims 1-8 is realized when execution.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811638597.7A CN109710944A (en) | 2018-12-29 | 2018-12-29 | Hot word extracting method, device, electronic equipment and computer readable storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811638597.7A CN109710944A (en) | 2018-12-29 | 2018-12-29 | Hot word extracting method, device, electronic equipment and computer readable storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109710944A true CN109710944A (en) | 2019-05-03 |
Family
ID=66260188
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811638597.7A Pending CN109710944A (en) | 2018-12-29 | 2018-12-29 | Hot word extracting method, device, electronic equipment and computer readable storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109710944A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110334268A (en) * | 2019-07-05 | 2019-10-15 | 北京国创动力文化传媒有限公司 | A kind of block chain project hot word generation method and device |
CN110874530A (en) * | 2019-10-30 | 2020-03-10 | 深圳价值在线信息科技股份有限公司 | Keyword extraction method and device, terminal equipment and storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120226641A1 (en) * | 2010-09-29 | 2012-09-06 | Yahoo! Inc. | Training a search query intent classifier using wiki article titles and a search click log |
CN104615593A (en) * | 2013-11-01 | 2015-05-13 | 北大方正集团有限公司 | Method and device for automatic detection of microblog hot topics |
CN105354333A (en) * | 2015-12-07 | 2016-02-24 | 天云融创数据科技(北京)有限公司 | Topic extraction method based on news text |
CN108228808A (en) * | 2017-12-29 | 2018-06-29 | 东软集团股份有限公司 | Determine the method, apparatus of focus incident and storage medium and electronic equipment |
CN108363694A (en) * | 2018-02-23 | 2018-08-03 | 北京窝头网络科技有限公司 | Keyword extracting method and device |
-
2018
- 2018-12-29 CN CN201811638597.7A patent/CN109710944A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120226641A1 (en) * | 2010-09-29 | 2012-09-06 | Yahoo! Inc. | Training a search query intent classifier using wiki article titles and a search click log |
CN104615593A (en) * | 2013-11-01 | 2015-05-13 | 北大方正集团有限公司 | Method and device for automatic detection of microblog hot topics |
CN105354333A (en) * | 2015-12-07 | 2016-02-24 | 天云融创数据科技(北京)有限公司 | Topic extraction method based on news text |
CN108228808A (en) * | 2017-12-29 | 2018-06-29 | 东软集团股份有限公司 | Determine the method, apparatus of focus incident and storage medium and electronic equipment |
CN108363694A (en) * | 2018-02-23 | 2018-08-03 | 北京窝头网络科技有限公司 | Keyword extracting method and device |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110334268A (en) * | 2019-07-05 | 2019-10-15 | 北京国创动力文化传媒有限公司 | A kind of block chain project hot word generation method and device |
CN110874530A (en) * | 2019-10-30 | 2020-03-10 | 深圳价值在线信息科技股份有限公司 | Keyword extraction method and device, terminal equipment and storage medium |
CN110874530B (en) * | 2019-10-30 | 2023-06-13 | 深圳价值在线信息科技股份有限公司 | Keyword extraction method, keyword extraction device, terminal equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Agarwal et al. | Knowledge graph based synthetic corpus generation for knowledge-enhanced language model pre-training | |
CN104615593B (en) | Hot microblog topic automatic testing method and device | |
KR101737887B1 (en) | Apparatus and Method for Topic Category Classification of Social Media Text based on Cross-Media Analysis | |
CN104850574B (en) | A kind of filtering sensitive words method of text-oriented information | |
Zhang et al. | Sentiment analysis of Chinese documents: From sentence to document level | |
US9342590B2 (en) | Keywords extraction and enrichment via categorization systems | |
Hamdan et al. | Experiments with DBpedia, WordNet and SentiWordNet as resources for sentiment analysis in micro-blogging | |
EP2801917A1 (en) | Method, apparatus, and computer storage medium for automatically adding tags to document | |
CN103593336B (en) | Knowledge pushing system and method based on semantic analysis | |
CN103150374A (en) | Method and system for identifying abnormal microblog users | |
CN109471933A (en) | A kind of generation method of text snippet, storage medium and server | |
CN110147425A (en) | A kind of keyword extracting method, device, computer equipment and storage medium | |
US20170060834A1 (en) | Natural Language Determiner | |
Saeed et al. | Roman Urdu toxic comment classification | |
CN109710944A (en) | Hot word extracting method, device, electronic equipment and computer readable storage medium | |
Barla et al. | From ambiguous words to key-concept extraction | |
Möller et al. | Survey on english entity linking on wikidata | |
Yamaguchi et al. | Team hitachi@ automin 2021: Reference-free automatic minuting pipeline with argument structure construction over topic-based summarization | |
Sandhan et al. | Evaluating neural word embeddings for Sanskrit | |
WO2021055868A1 (en) | Associating user-provided content items to interest nodes | |
Yu et al. | Gender classification of Chinese Weibo users | |
AleEbrahim et al. | Summarising customer online reviews using a new text mining approach | |
Rasheed et al. | Building a text collection for Urdu information retrieval | |
CN115438048A (en) | Table searching method, device, equipment and storage medium | |
Dubey et al. | Sentiment analysis of keenly intellective smart phone product review utilizing SVM classification technique |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190503 |