CN109710944A - Hot word extracting method, device, electronic equipment and computer readable storage medium - Google Patents

Hot word extracting method, device, electronic equipment and computer readable storage medium Download PDF

Info

Publication number
CN109710944A
CN109710944A CN201811638597.7A CN201811638597A CN109710944A CN 109710944 A CN109710944 A CN 109710944A CN 201811638597 A CN201811638597 A CN 201811638597A CN 109710944 A CN109710944 A CN 109710944A
Authority
CN
China
Prior art keywords
word
weight
extracted
text data
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811638597.7A
Other languages
Chinese (zh)
Inventor
韩勇
赵立永
吴新丽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
XINHUA NETWORK CO Ltd
Original Assignee
XINHUA NETWORK CO Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by XINHUA NETWORK CO Ltd filed Critical XINHUA NETWORK CO Ltd
Priority to CN201811638597.7A priority Critical patent/CN109710944A/en
Publication of CN109710944A publication Critical patent/CN109710944A/en
Pending legal-status Critical Current

Links

Abstract

The embodiment of the present application provides a kind of hot word extracting method, device, electronic equipment and computer readable storage medium.This method comprises: determining the first weight of the word to be extracted that text data is concentrated;The weight for determining Feature Words in word to be extracted determines the second weight of word to be extracted based on the first weight of word to be extracted and the weight of Feature Words;The word to be extracted of the highest preceding predetermined number of the second weight is extracted as hot word.Scheme provided in this embodiment, it can determine the first weight of word to be extracted, and determine the weight of the Feature Words and Feature Words in word to be extracted, weight based on the first weight and Feature Words, the second weight for determining word to be extracted improves the reasonability of the second weight calculation, therefore when being based on the second weight extraction hot word, the accuracy that can be improved hot word extraction, meets the use demand of user.

Description

Hot word extracting method, device, electronic equipment and computer readable storage medium
Technical field
This application involves technical field of data processing, specifically, this application involves a kind of hot word extracting method, device, Electronic equipment and computer readable storage medium.
Background technique
Hot word, i.e., popular vocabulary, can represent the content of netizen's care, can also represent the centre point of event or theme.Heat Word extractive technique generates hot information by content syndication technologies, facilitates netizen to understand focus incident in massive information, with people The network life abundant and information content increase, the importance of hot word extractive technique also increasingly increases.
Existing hot word extractive technique be mainly based upon statistics method realize, cause hot word extract accuracy rate compared with Difference is unable to satisfy the use demand of user.
Summary of the invention
This application provides a kind of hot word extracting method, device, electronic equipment and computer readable storage mediums, for solving The problem of accuracy rate that certainly hot word is extracted is poor, is unable to satisfy the use demand of user.Technical solution used by the application is such as Under:
In a first aspect, this application provides a kind of hot word extracting methods, this method comprises:
Determine that text data concentrates the first weight of word to be extracted;
The weight for determining Feature Words in word to be extracted, based on the first weight of word to be extracted and the weight of Feature Words, really Second weight of fixed word to be extracted;
Word to be extracted is ranked up based on the second weight, extract sequence after the higher preceding predetermined number of the second weight to Word is extracted as hot word.
Second aspect, this application provides a kind of hot word extraction element, which includes:
First weight determination module, for determining that text data concentrates the first weight of word to be extracted;
Second weight determination module, for determining the weight of Feature Words in word to be extracted, the first power based on word to be extracted The weight of weight and Feature Words, determines the second weight of word to be extracted;
Hot word extraction module, for being ranked up to word to be extracted based on the second weight, extract after sequence the second weight compared with The word to be extracted of high preceding predetermined number is as hot word.
The third aspect, this application provides a kind of electronic equipment, which includes: processor and memory;
Memory, for storing operational order;
Processor executes the hot word extracting method as shown in the first aspect of the application for instructing by call operation.
Fourth aspect, this application provides a kind of computer readable storage mediums, are stored thereon with computer program, the journey Hot word extracting method shown in the first aspect of the application is realized when sequence is executed by processor.
Technical solution provided by the embodiments of the present application has the benefit that
Scheme provided in this embodiment can determine the first weight of word to be extracted, and determine the feature in word to be extracted The weight of word and Feature Words, the weight based on the first weight and Feature Words determine the second weight of word to be extracted, improve The reasonability of second weight calculation, thus when being based on the second weight extraction hot word, it can be improved the accuracy of hot word extraction, it is full The use demand of sufficient user.
Detailed description of the invention
In order to more clearly explain the technical solutions in the embodiments of the present application, institute in being described below to the embodiment of the present application Attached drawing to be used is needed to be briefly described.
Fig. 1 is a kind of flow diagram of hot word extracting method provided by the embodiments of the present application;
Fig. 2 is the flow diagram for determining the first weight;
Fig. 3 is the flow diagram for determining title feature word weight;
Fig. 4 is the flow diagram for determining theme feature word weight;
Fig. 5 is a kind of design cycle schematic diagram of hot word extracting method provided by the embodiments of the present application;
Fig. 6 is the processing flow schematic diagram that a kind of hot word extracts service system;
Fig. 7 is a kind of structural schematic diagram of hot word extraction element provided by the embodiments of the present application;
Fig. 8 is the structural schematic diagram of a kind of electronic equipment provided by the embodiments of the present application.
Specific embodiment
Embodiments herein is described below in detail, examples of the embodiments are shown in the accompanying drawings, wherein from beginning to end Same or similar label indicates same or similar element or element with the same or similar functions.Below with reference to attached The embodiment of figure description is exemplary, and is only used for explaining the application, and is not construed as limiting the claims.
Those skilled in the art of the present technique are appreciated that unless expressly stated, singular " one " used herein, " one It is a ", " described " and "the" may also comprise plural form.It is to be further understood that being arranged used in the description of the present application Diction " comprising " refer to that there are the feature, integer, step, operation, element and/or component, but it is not excluded that in the presence of or addition Other one or more features, integer, step, operation, element, component and/or their group.It should be understood that when we claim member Part is " connected " or when " coupled " to another element, it can be directly connected or coupled to other elements, or there may also be Intermediary element.In addition, " connection " used herein or " coupling " may include being wirelessly connected or wirelessly coupling.It is used herein to arrange Diction "and/or" includes one or more associated wholes for listing item or any cell and all combinations.
To keep the purposes, technical schemes and advantages of the application clearer, below in conjunction with attached drawing to the application embodiment party Formula is described in further detail.
Existing hot word extractive technique is mainly concerned with document sets feature extraction and document sets theme distribution.The spy of document sets Sign is extracted mainly on the basis of word frequency statistics and inverse document frequency count, using term frequency-inverse document frequency (term Frequency-inverse document frequency, tf-idf) and its improved method progress word weight calculation, realize heat Word extracts;Document sets theme distribution generally uses topic model, establishes the semantic association between word, document and theme, realizes heat Word extracts.
Above two hot word extracting method, the method for being all based on statistics calculate the weight of hot word, less in view of word Positional relationship and article theme, the word (such as auxiliary word, conjunction) that also not can avoid some auxiliary properties do hot word extraction It disturbs, therefore it is accurate slightly lower to cause hot word to be extracted, and is unable to satisfy the use demand of user.
Hot word extracting method, device, electronic equipment and computer readable storage medium provided by the present application, it is intended to solve existing There is the technical problem as above of technology.
How the technical solution of the application and the technical solution of the application are solved with specifically embodiment below above-mentioned Technical problem is described in detail.These specific embodiments can be combined with each other below, for the same or similar concept Or process may repeat no more in certain embodiments.Below in conjunction with attached drawing, embodiments herein is described.
The embodiment of the present application provides a kind of hot word extracting method, as shown in Figure 1, this method mainly may include:
Step S110: determine that text data concentrates the first weight of word to be extracted.
In the present embodiment, text data set can be the base text for carrying out hot word extraction.Text data set can be with Including plurality of articles, word to be extracted can be the word identified from the article of text data set, wherein containing hot word.First Weight is the initial weight of each word to be extracted, can provide basis for subsequent calculating.
Step S120: determining the weight of Feature Words in word to be extracted, the first weight and Feature Words based on word to be extracted Weight, determine the second weight of word to be extracted;
Step S130: the word to be extracted of the highest preceding predetermined number of the second weight is extracted as hot word.
In the present embodiment, Feature Words can for can be to the word that text data set is characterized, Feature Words hot word can Energy property is higher, can be determined the second weight of word to be extracted based on the first weight of word to be extracted and the weight of Feature Words, be mentioned The reasonability of high second weight calculation.
In the present embodiment, after determining the second weight, can by the highest preceding predetermined number of the second weight wait mention It takes word to extract as hot word, specifically, can be sorted from high to low based on the second weight to word to be extracted, extracts sequence The word to be extracted of preceding predetermined number is as hot word afterwards.Predetermined number can be set according to actual needs.
Hot word extracting method provided in this embodiment can determine the first weight of word to be extracted, and determine word to be extracted In Feature Words and Feature Words weight, the weight based on the first weight and Feature Words, determine word to be extracted second power Weight, improves the reasonability of the second weight calculation, therefore can be improved the accuracy of hot word extraction, meets the use need of user It asks.
In a kind of possible implementation of the embodiment of the present application, above-mentioned Feature Words may include at least one of following:
The title feature word of article title in text data set;
The theme feature word that text data is concentrated;
The search key of text data set;
Wherein, search key is for the word by search to determine text data set.
In the present embodiment, the article in text data set may include one or more titles, since article title is general The purport of article can be embodied, therefore can determine the title feature word that can characterize article title from the title of article As Feature Words.
The theme feature word that text data is concentrated can determine that theme feature word can be used in based on the theme distribution of article Article theme is characterized, specifically, the biggish n theme feature word of probability can be determined, as Feature Words.The specific value of n It can be actually needed and be set.
Search key is the word for searching for determine text data set, and search key can characterize text data set Theme, can be using search key as Feature Words.
In the present embodiment, Feature Words can be not limited at least one above-mentioned, can also increase feature according to actual needs The type of word.Since the first weight of word to be extracted is calculated by statistical method (such as tf-iwf), so that the first power Weight is excessively biased to word frequency statistics, can be in the calculating of the second weight to avoid word frequency from occupying leading role in hot word extraction Cheng Zhong promotes the weight for the Feature Words that text data is concentrated, and can be combined with the Feature Words of multiple types, realizes various features word Comprehensive consideration, to improve the accuracy of hot word extraction.
In a kind of possible implementation of the embodiment of the present application, above-mentioned the first weight and feature based on word to be extracted The weight of word determines the second weight of word to be extracted, may include:
By the weight of Feature Words, first weight of equivalent in word to be extracted is added with Feature Words, obtains word to be extracted Second weight;
Wherein, the weight of Feature Words includes at least one of the following:
The weight of title feature word;
The weight of theme feature word;
The default weight of search key.
In the present embodiment, when Feature Words are title feature word, word to be extracted can be determined based on formula below (1) Second weight:
score2(w)=score1(w)+score(wt) formula (1)
Wherein, w indicates the set of word to be extracted;score1(w) the first weight set of word to be extracted, score are indicated2(w) Indicate the second weight set of word to be extracted;wtIndicate the set of title feature word, score (wt) indicate title feature word power Gather again.
In actually calculating, each title feature word can be searched into equivalent in the set of word to be extracted, only by each mark The first weight that the weight of topic Feature Words distinguishes corresponding word is added, and obtains the second weight of each equivalent, word to be extracted Gather the first weight of other words in addition to equivalent directly as the second weight, so that it is determined that the second power of word to be extracted out Gather again.
In the present embodiment, when Feature Words are the theme Feature Words, word to be extracted can be determined based on formula below (2) Second weight:
score2(w)=score1(w)+score(wtopic) formula (2)
Wherein, wtopicIndicate the set of theme feature word, score (wtopic) indicate theme feature word weight set.
When the weight for carrying out theme feature word is added with the first weight of word to be extracted, specific calculation, Ke Yican The mode being added according to the weight of above-mentioned title feature word with the first weight of word to be extracted.
In the present embodiment, when Feature Words are search key, word to be extracted can be determined based on formula below (3) Second weight:
score2(w)=score1(w)+score(wkey) formula (3)
Wherein, wkeyIndicate the set of search key, score (wkey) indicate search key weight set.
When the weight for scanning for keyword is added with the first weight of word to be extracted, specific calculation, Ke Yican The mode being added according to the weight of above-mentioned title feature word with the first weight of word to be extracted.
When Feature Words include title feature word, theme feature word and search key, formula below can be based on (4) the second weight of word to be extracted is determined:
score2(w)=score1(w)+score(wt)+score(wtopic)+score(wkey) formula (4)
When actually calculating, the mode that the weight for being referred to title feature word is added with the first weight of word to be extracted, Respectively by the weight of the weight of title feature word, the weight of theme feature word and search key, with word to be extracted first Weight is added, and obtains the second weight sequence.
By carrying out weight jointly using title feature word, theme feature word and search key as Feature Words It calculates, realizes on the basis of using word frequency and inverse word frequency to measure word temperature to be extracted, in the text in conjunction with word to be extracted Whether whether be the theme Feature Words, word to be extracted of position, word to be extracted is that many factors such as search key comprehensively consider, can The accuracy rate that hot word is extracted greatly is improved, the effect that hot word is extracted is improved.
In the present embodiment, for that convenient for subsequent processing, the second weight of each word to be extracted can be normalized to obtain Three weights, then word to be extracted is ranked up according to third weight, to determine hot word.
Specifically, following formula (5) can be used, the third weight of any word to be extracted is determined:
Wherein, score2(wl) indicate any word w to be extractedlThe second weight, score2(w)maxIndicate word set to be extracted Numerical value maximum one is closed in the second weight in w, F (wl) indicate any word w to be extractedlThird weight.
In a kind of possible implementation of the embodiment of the present application, of the word to be extracted in above-mentioned determination text data set One weight may include:
Inverse word frequency (term frequency-inverse the word of the word frequency-for the word to be extracted that text data is concentrated Frequency, tf-iwf) value is determined as the first weight;
And/or
Determine the first weight of the word to be extracted of article text in text data set.
Weighing computation method tf-idf based on statistics is calculating significance level of the words relative to a document sets When, the importance of words is directly proportional with its frequency of occurrence in the text and its frequency of occurrences in document sets is inversely proportional.Same The method has very big drawback in class document sets, and often the Feature Words of some same class documents are blanked.
The first weight of word to be extracted, the i.e. side of word frequency and the weighting of word frequency inverse are determined in the present embodiment based on tf-iwf Formula.This method can not only reduce the weight of the high useless word of word frequency, but also can be blanked to avoid the Feature Words of same class document The drawbacks of.
In the present embodiment, only the body part of each piece article can be concentrated to extract word to be extracted text data, and determine First weight of word to be extracted.The body part of article and title division are distinguished, and for the body part of article extract to Word is extracted, increases title feature word and is determining the second weight validity.
It is specific as follows Fig. 2 shows the flow diagram for determining the first weight:
The text data of article body part in text data set is segmented, establishing simplified word such as " ABCD " is Simplified, the entity word dictionary of " ABCD company " are reinforced participle dynamics, are stored in list to the word divided, wherein the member of list Element is exactly the list of all word compositions of text of every document.
The jieba segmenter that can be selected preloads customized dictionary text in advance.Such as: " the mad dog epidemic disease of ABCD company Seedling fraud is looked into." Custom Dictionaries word segmentation result is not loaded are as follows: " the mad dog of AB/CD//vaccine/fraud/looked into/.", wherein " mad Canine vaccines " are our common words, are divided into two words, " ABCD " is the abbreviation of " ABCD company ", these common words are It is not intended to by separated.After loading pre- customized dictionary, word segmentation result are as follows: " ABCD/ rabies vaccine/fraud/looked into/."
Part-of-speech tagging is added to upper predicate, then leaves adjective, secondary shape word, adnoun (describing with noun function Word), noun, name, place name, group, mechanism, other proper names, verb, secondary verb, name verb, abbreviation abbreviation, adverbial word, Chinese idiom etc. The word of property filters out the useless word such as auxiliary word, modal particle, stop words.
Tf-iwf algorithm be can use to calculate the weighted value of the word i.e. to be extracted of the word after above-mentioned part-of-speech tagging (the i.e. first power Weight).Specifically, following formula (6) can be used, the first weight of any word to be extracted is determined:
score1(wl)=tf (wl)*iwf(wl) (6)
Wherein, score1(wl) indicate any word w to be extractedlThe first weight;tf(wl) indicate any word w to be extractedl's Word frequency;iwf(wl) indicate any word w to be extractedlInverse word frequency.
Specifically, the word frequency of any word to be extracted can be determined using following formula (7):
Wherein, n (wl) indicate any word w to be extractedlIn text data (i.e. article body part in text data set) The number of appearance, D (w) indicate the total word number for including in text data.
The inverse word frequency of any word to be extracted can be determined using following formula (8):
Wherein, N (wl) indicate any word w to be extractedlTotal frequency of appearance in text data;M is indicated in text data The total quantity of word to be extracted.
In a kind of possible implementation of the embodiment of the present application, when Feature Words include the mark of article title in text data set When inscribing Feature Words, the weight of Feature Words, may include: title in the title by text data set in above-mentioned determination word to be extracted The tf-iwf value of Feature Words is determined as the weight of title feature word.In the present embodiment, for title feature word in article title Title feature word weight can also determine that specific calculation be referred to above-mentioned word to be extracted first and weigh based on tf-iwf The calculation of weight.
Fig. 3 shows the flow diagram of determining title feature word weight, specific as follows:
The title of text data set is segmented to obtain title feature word, respectively to title feature word count word frequency and Inverse word frequency.
In a kind of possible implementation of the embodiment of the present application, when Feature Words include theme feature word, above-mentioned determination The weight of Feature Words in word to be extracted may include:
Based on hidden Di Li Cray distribution model (Latent Dirichlet Allocation, LDA), text data is determined The determine the probability of theme feature word is the theme term weight function by the probability of the theme feature word of concentration.
In the present embodiment, the LDA model in the library gensim can use, it is special to extract theme to every article of text data set Word is levied, theme and the corresponding probability distribution of theme feature word are generated, extracts the forward preceding n theme feature word conduct of probability sorting The determine the probability of the theme feature word of extraction is the theme term weight function by Feature Words, in actual use, can only retain every Ten theme before piece article weight sequencing.
Fig. 4 shows the flow diagram of determining theme feature word weight, specific as follows:
Theme is extracted to the text data in text data set, and obtains theme and the corresponding probability of theme feature word Distribution, the forward preceding n theme feature word of extraction probability sorting are true by the probability of each theme feature word of extraction as Feature Words It is set to theme feature word weight.
In a kind of possible implementation of the embodiment of the present application, in the first power for determining the word to be extracted of text data concentration Before weight, the above method can also include:
Noise cleaning treatment is carried out to text data set.
In actual use, in the internet web page text usually crawled include certain redundancy and noise information, than Such as example non-Chinese character of@, website links, redundance character and part punctuation mark, in the present embodiment, text data can be being determined Before first weight of the word to be extracted concentrated, redundancy and noise information are first filtered out, reduces the interference to subsequent calculating.
In specific implementation, noise cleaning treatment can be carried out to text data set based on regular expression matching algorithm.
Fig. 5 shows a kind of design cycle schematic diagram of hot word extracting method provided by the embodiments of the present application.Detailed process Are as follows: the text data in text data set is subjected to data cleansing (i.e. noise cleaning treatment), textual data is determined based on tf-iwf According to word weight i.e. the first weight of word to be extracted in collecting;The weight calculation for carrying out title feature word, determines the power of title feature word Weight then carries out title weighting, i.e., is added the weight of title feature word with the first weight;Theme spy is determined based on LDA model Levy word weight, then carry out theme weighting, i.e., by with the weight of title feature word the first weight after being added and theme feature word Weight be added;Weight based on search key scans for keyword weighting, i.e., by the weight phase with title feature word Add, theme feature word weight the first weight after being added is added with the weight of search key, obtain the second weight, to second weigh Third weight is obtained be normalized again after, will be exported according to the word (word to be extracted) of third weight sequencing.
Fig. 6 shows a kind of processing flow schematic diagram of hot word extraction service system, real on big data Spark cluster It applies, is managed by zookeeper service, realize that hot word extracts service system using Spark Streaming, meet sea The hot word quasi real time for measuring data, which is extracted, to be required.Hot word extracts service system real time monitoring and to Hadoop distributed file system In (Hadoop Distributed File System, HDFS) then input file reads text in a manner of discrete data This content, and hot word extraction algorithm is called to analyze discrete data, the hot word of topN is extracted, finally hot word is extracted and is tied Fruit is written in the theme (topic) of message queue kafka.
Hot word based on big data extracts service system, using big data Spark platform, in conjunction with advanced message system, Make service that can support the data of magnanimity, service is provided in real time.
Based on principle identical with method shown in Fig. 1, the embodiment of the present application also provides a kind of hot word extraction element, As shown in fig. 7, the hot word extraction element 20 may include:
First weight determination module 210, the first weight of the word to be extracted for determining text data concentration;
Second weight determination module 220, for determining the weight of Feature Words in word to be extracted, based on word to be extracted first The weight of weight and Feature Words determines the second weight of word to be extracted;
Hot word extraction module 230, for extracting the word to be extracted of the highest preceding predetermined number of the second weight as hot word.
Hot word extraction element provided in this embodiment can determine the first weight of word to be extracted, and determine word to be extracted In Feature Words and Feature Words weight, the weight based on the first weight and Feature Words, determine word to be extracted second power Weight improves the reasonability of the second weight calculation, therefore when being based on the second weight extraction hot word, can be improved hot word extraction Accuracy meets the use demand of user.
Optionally, above-mentioned Feature Words include at least one of the following:
The title feature word of article title in text data set;
The theme feature word that text data is concentrated;
The search key of text data set;
Wherein, search key is for the word by search to determine text data set.
Optionally, the second above-mentioned weight determination module is based on the first weight of word to be extracted and the power of Feature Words Weight, when determining the second weight of word to be extracted, is specifically used for:
By the weight of Feature Words, first weight of equivalent in word to be extracted is added with Feature Words, obtains word to be extracted Second weight;
Wherein, the weight of Feature Words includes at least one of the following:
The weight of title feature word;
The weight of theme feature word;
The default weight of search key.
Optionally, the first above-mentioned weight determination module is specifically used for:
The tf-iwf value for the word to be extracted that text data is concentrated is determined as the first weight;
And/or
Determine the first weight of the word to be extracted of article text in text data set.
Optionally, when Feature Words include the title feature word of article title in text data set, the second above-mentioned weight Determining module when the weight of Feature Words, is specifically used in determining word to be extracted:
The tf-iwf value of title Feature Words in the title of text data set is determined as to the weight of title feature word.It is optional Ground, when Feature Words include theme feature word, the power of above-mentioned the second weight determination module Feature Words in determining word to be extracted When weight, it is specifically used for:
Based on LDA, the probability for the theme feature word that text data is concentrated is determined, based on the determine the probability of theme feature word Inscribe term weight function.
Optionally, before determining the first weight of word to be extracted of text data concentration, the above method can also include:
Noise cleaning treatment is carried out to text data set.
Optionally, above-mentioned that noise cleaning treatment is carried out to text data set, may include:
Noise cleaning treatment is carried out to text data set based on regular expression matching algorithm.
It is realized shown in Fig. 1 it is understood that above-mentioned each module of the hot word extraction element in the present embodiment has The function of hot word extracting method corresponding steps in embodiment.The function can also be held by hardware realization by hardware The corresponding software realization of row.The hardware or software include one or more modules corresponding with above-mentioned function.Above-mentioned module can To be software and/or hardware, above-mentioned each module can be implemented separately, can also be with multiple module integration realizations.For above-mentioned hot word The correspondence that the function description of each module of extraction element specifically may refer to the hot word extracting method in embodiment shown in Fig. 1 is retouched It states, details are not described herein.
The embodiment of the present application provides a kind of electronic equipment, as shown in figure 8, electronic equipment shown in Fig. 8 2000 includes: place Manage device 2001 and memory 2003.Wherein, processor 2001 is connected with memory 2003, is such as connected by bus 2002.It is optional , electronic equipment 2000 can also include transceiver 2004.It should be noted that transceiver 2004 is not limited to one in practical application A, the structure of the electronic equipment 2000 does not constitute the restriction to the embodiment of the present application.
Wherein, processor 2001 is applied in the embodiment of the present application, for realizing method shown in above method embodiment. Transceiver 2004 may include Receiver And Transmitter, and transceiver 2004 is applied in the embodiment of the present application, real when for executing The function that the electronic equipment of existing the embodiment of the present application is communicated with other equipment.
Processor 2001 can be CPU, general processor, DSP, ASIC, FPGA or other programmable logic device, crystalline substance Body pipe logical device, hardware component or any combination thereof.It, which may be implemented or executes, combines described by present disclosure Various illustrative logic blocks, module and circuit.Processor 2001 is also possible to realize the combination of computing function, such as wraps It is combined containing one or more microprocessors, DSP and the combination of microprocessor etc..
Bus 2002 may include an access, and information is transmitted between said modules.Bus 2002 can be pci bus or Eisa bus etc..Bus 2002 can be divided into address bus, data/address bus, control bus etc..Only to be used in Fig. 8 convenient for indicating One thick line indicates, it is not intended that an only bus or a type of bus.
Memory 2003 can be ROM or can store the other kinds of static storage device of static information and instruction, RAM Or the other kinds of dynamic memory of information and instruction can be stored, it is also possible to EEPROM, CD-ROM or other CDs Storage, optical disc storage (including compression optical disc, laser disc, optical disc, Digital Versatile Disc, Blu-ray Disc etc.), magnetic disk storage medium Or other magnetic storage apparatus or can be used in carry or store have instruction or data structure form desired program generation Code and can by any other medium of computer access, but not limited to this.
Optionally, memory 2003 is used to store the application code for executing application scheme, and by processor 2001 It is executed to control.Processor 2001 is for executing the application code stored in memory 2003, to realize above method reality Apply hot word extracting method shown in example.
Electronic equipment provided by the embodiments of the present application is suitable for above method any embodiment, and details are not described herein.
The embodiment of the present application provides a kind of electronic equipment, compared with prior art, can determine the first of word to be extracted Weight, and determine the weight of the Feature Words and Feature Words in word to be extracted, the weight based on the first weight and Feature Words, really Second weight of fixed word to be extracted improves the reasonability of the second weight calculation, therefore when being based on the second weight extraction hot word, The accuracy that can be improved hot word extraction, meets the use demand of user.
The embodiment of the present application provides a kind of computer readable storage medium, is stored on the computer readable storage medium Computer program realizes hot word extracting method shown in above method embodiment when the program is executed by processor.
Computer readable storage medium provided by the embodiments of the present application is suitable for above method any embodiment, herein not It repeats again.
The embodiment of the present application provides a kind of computer readable storage medium, compared with prior art, can determine wait mention The first weight of word is taken, and determines the weight of the Feature Words and Feature Words in word to be extracted, is based on the first weight and feature The weight of word determines the second weight of word to be extracted, improves the reasonability of the second weight calculation, therefore be based on the second weight When extracting hot word, it can be improved the accuracy of hot word extraction, meet the use demand of user.
It should be understood that although each step in the flow chart of attached drawing is successively shown according to the instruction of arrow, These steps are not that the inevitable sequence according to arrow instruction successively executes.Unless expressly stating otherwise herein, these steps Execution there is no stringent sequences to limit, can execute in the other order.Moreover, at least one in the flow chart of attached drawing Part steps may include that perhaps these sub-steps of multiple stages or stage are not necessarily in synchronization to multiple sub-steps Completion is executed, but can be executed at different times, execution sequence, which is also not necessarily, successively to be carried out, but can be with other At least part of the sub-step or stage of step or other steps executes in turn or alternately.
The above is only some embodiments of the invention, it is noted that for the ordinary skill people of the art For member, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications are also answered It is considered as protection scope of the present invention.

Claims (12)

1. a kind of hot word extracting method characterized by comprising
Determine that text data concentrates the first weight of word to be extracted;
The weight for determining Feature Words in the word to be extracted, the first weight and the Feature Words based on the word to be extracted Weight determines the second weight of the word to be extracted;
The word to be extracted of the highest preceding predetermined number of the second weight is extracted as hot word.
2. hot word extracting method according to claim 1, which is characterized in that the Feature Words include at least one of the following:
The title feature word of article title in the text data set;
The theme feature word that the text data is concentrated;
The search key of the text data set;
Wherein, described search keyword is for by searching for the word of the determination text data set.
3. hot word extracting method according to claim 2, which is characterized in that the first weight based on the word to be extracted with And the weight of the Feature Words, determine the second weight of the word to be extracted, comprising:
By the weight of the Feature Words, first weight of equivalent in the word to be extracted is added with the Feature Words, obtains institute State the second weight of word to be extracted;
Wherein, the weight of the Feature Words includes at least one of the following:
The weight of the title feature word;
The weight of the theme feature word;
The default weight of described search keyword.
4. hot word extracting method according to claim 1, which is characterized in that the determining text data concentrates word to be extracted The first weight, comprising:
The inverse word frequency tf-iwf value of the word frequency-of the word to be extracted of article text in text data set is determined as the first weight;
And/or
Determine the first weight of the word to be extracted of article text in text data set.
5. hot word extracting method according to claim 2, which is characterized in that when the Feature Words include the text data When concentrating the title feature word of article title, the weight of Feature Words in the determination word to be extracted, comprising:
The tf-iwf value of title Feature Words in the title of text data set is determined as to the weight of title feature word.
6. hot word extracting method according to claim 2, which is characterized in that when the Feature Words include the theme feature When word, the weight of Feature Words in the determination word to be extracted, comprising:
Based on hidden Di Li Cray distribution model LDA, the probability for the theme feature word that the text data is concentrated is determined, by the master The determine the probability of topic Feature Words is the theme term weight function.
7. hot word extracting method described in one in -6 according to claim 1, which is characterized in that in the determining text data set In word to be extracted the first weight before, further includes:
Noise cleaning treatment is carried out to text data set.
8. hot word extracting method according to claim 7, which is characterized in that described to carry out noise cleaning to text data set Processing, comprising:
Noise cleaning treatment is carried out to text data set based on regular expression matching algorithm.
9. a kind of hot word extraction element characterized by comprising
First weight determination module, for determining that text data concentrates the first weight of word to be extracted;
Second weight determination module, for determining the weight of Feature Words in the word to be extracted, based on the word to be extracted The weight of one weight and the Feature Words determines the second weight of the word to be extracted;
Hot word extraction module, for extracting the word to be extracted of the highest preceding predetermined number of the second weight as hot word.
10. hot word extraction element according to claim 9, which is characterized in that the Feature Words include at least one of the following:
The title feature word of article title in the text data set;
The theme feature word that the text data is concentrated;
The search key of the text data set;
Wherein, described search keyword is for by searching for the word of the determination text data set.
11. a kind of electronic equipment, which is characterized in that it includes processor and memory;
The memory, for storing operational order;
The processor, for executing hot word described in any one of the claims 1-8 by calling the operational order Extracting method.
12. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the program is by processor Hot word extracting method described in any one of the claims 1-8 is realized when execution.
CN201811638597.7A 2018-12-29 2018-12-29 Hot word extracting method, device, electronic equipment and computer readable storage medium Pending CN109710944A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811638597.7A CN109710944A (en) 2018-12-29 2018-12-29 Hot word extracting method, device, electronic equipment and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811638597.7A CN109710944A (en) 2018-12-29 2018-12-29 Hot word extracting method, device, electronic equipment and computer readable storage medium

Publications (1)

Publication Number Publication Date
CN109710944A true CN109710944A (en) 2019-05-03

Family

ID=66260188

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811638597.7A Pending CN109710944A (en) 2018-12-29 2018-12-29 Hot word extracting method, device, electronic equipment and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN109710944A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110334268A (en) * 2019-07-05 2019-10-15 北京国创动力文化传媒有限公司 A kind of block chain project hot word generation method and device
CN110874530A (en) * 2019-10-30 2020-03-10 深圳价值在线信息科技股份有限公司 Keyword extraction method and device, terminal equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120226641A1 (en) * 2010-09-29 2012-09-06 Yahoo! Inc. Training a search query intent classifier using wiki article titles and a search click log
CN104615593A (en) * 2013-11-01 2015-05-13 北大方正集团有限公司 Method and device for automatic detection of microblog hot topics
CN105354333A (en) * 2015-12-07 2016-02-24 天云融创数据科技(北京)有限公司 Topic extraction method based on news text
CN108228808A (en) * 2017-12-29 2018-06-29 东软集团股份有限公司 Determine the method, apparatus of focus incident and storage medium and electronic equipment
CN108363694A (en) * 2018-02-23 2018-08-03 北京窝头网络科技有限公司 Keyword extracting method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120226641A1 (en) * 2010-09-29 2012-09-06 Yahoo! Inc. Training a search query intent classifier using wiki article titles and a search click log
CN104615593A (en) * 2013-11-01 2015-05-13 北大方正集团有限公司 Method and device for automatic detection of microblog hot topics
CN105354333A (en) * 2015-12-07 2016-02-24 天云融创数据科技(北京)有限公司 Topic extraction method based on news text
CN108228808A (en) * 2017-12-29 2018-06-29 东软集团股份有限公司 Determine the method, apparatus of focus incident and storage medium and electronic equipment
CN108363694A (en) * 2018-02-23 2018-08-03 北京窝头网络科技有限公司 Keyword extracting method and device

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110334268A (en) * 2019-07-05 2019-10-15 北京国创动力文化传媒有限公司 A kind of block chain project hot word generation method and device
CN110874530A (en) * 2019-10-30 2020-03-10 深圳价值在线信息科技股份有限公司 Keyword extraction method and device, terminal equipment and storage medium
CN110874530B (en) * 2019-10-30 2023-06-13 深圳价值在线信息科技股份有限公司 Keyword extraction method, keyword extraction device, terminal equipment and storage medium

Similar Documents

Publication Publication Date Title
Agarwal et al. Knowledge graph based synthetic corpus generation for knowledge-enhanced language model pre-training
CN104615593B (en) Hot microblog topic automatic testing method and device
KR101737887B1 (en) Apparatus and Method for Topic Category Classification of Social Media Text based on Cross-Media Analysis
CN104850574B (en) A kind of filtering sensitive words method of text-oriented information
Zhang et al. Sentiment analysis of Chinese documents: From sentence to document level
US9342590B2 (en) Keywords extraction and enrichment via categorization systems
Hamdan et al. Experiments with DBpedia, WordNet and SentiWordNet as resources for sentiment analysis in micro-blogging
EP2801917A1 (en) Method, apparatus, and computer storage medium for automatically adding tags to document
CN103593336B (en) Knowledge pushing system and method based on semantic analysis
CN103150374A (en) Method and system for identifying abnormal microblog users
CN109471933A (en) A kind of generation method of text snippet, storage medium and server
CN110147425A (en) A kind of keyword extracting method, device, computer equipment and storage medium
US20170060834A1 (en) Natural Language Determiner
Saeed et al. Roman Urdu toxic comment classification
CN109710944A (en) Hot word extracting method, device, electronic equipment and computer readable storage medium
Barla et al. From ambiguous words to key-concept extraction
Möller et al. Survey on english entity linking on wikidata
Yamaguchi et al. Team hitachi@ automin 2021: Reference-free automatic minuting pipeline with argument structure construction over topic-based summarization
Sandhan et al. Evaluating neural word embeddings for Sanskrit
WO2021055868A1 (en) Associating user-provided content items to interest nodes
Yu et al. Gender classification of Chinese Weibo users
AleEbrahim et al. Summarising customer online reviews using a new text mining approach
Rasheed et al. Building a text collection for Urdu information retrieval
CN115438048A (en) Table searching method, device, equipment and storage medium
Dubey et al. Sentiment analysis of keenly intellective smart phone product review utilizing SVM classification technique

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20190503