CN108710664A - A kind of hot word analysis method, computer readable storage medium and terminal device - Google Patents

A kind of hot word analysis method, computer readable storage medium and terminal device Download PDF

Info

Publication number
CN108710664A
CN108710664A CN201810456973.4A CN201810456973A CN108710664A CN 108710664 A CN108710664 A CN 108710664A CN 201810456973 A CN201810456973 A CN 201810456973A CN 108710664 A CN108710664 A CN 108710664A
Authority
CN
China
Prior art keywords
exposure
hot word
text message
sequence
threshold
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810456973.4A
Other languages
Chinese (zh)
Other versions
CN108710664B (en
Inventor
张依
汪伟
肖京
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN201810456973.4A priority Critical patent/CN108710664B/en
Priority to PCT/CN2018/096267 priority patent/WO2019218452A1/en
Publication of CN108710664A publication Critical patent/CN108710664A/en
Application granted granted Critical
Publication of CN108710664B publication Critical patent/CN108710664B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention belongs to a kind of field of computer technology more particularly to hot word analysis method, computer readable storage medium and terminal devices.The method crawls the webpage issued on targeted website in the current statistic period by search engine;Cutting word processing is carried out to the text message in the webpage, obtains each participle for constituting the text message;Count the exposure frequency of each participle in the text message;The participle that the exposure frequency in the text message is more than to preset first threshold exposure is determined as hot word;Count the exposure frequency of each enterprise name in preferred text message;The degree of association between each enterprise name and the hot word is calculated according to the exposure frequency of each enterprise name in the preferred text message.The present invention provides a set of objective assessment standard for the determination of hot word, and after obtaining hot word, has considered the relationship between enterprise and hot word, and analysis result has stronger directive significance for enterprise.

Description

A kind of hot word analysis method, computer readable storage medium and terminal device
Technical field
The invention belongs to field of computer technology more particularly to a kind of hot word analysis method, computer readable storage mediums And terminal device.
Background technology
Hot word, i.e. network hot topic vocabulary refer to a kind of vocabulary phenomenon, reflect a country, an area at one Phase people's question of common concern and things.Hot word has characteristics of the times, can be as the much-talked-about topic and the people's livelihood in a period The representative of problem.
At present for the determination of hot word, mainly by network analysis personnel according to oneself browsed to information on the internet Handle obtained, judgement of this mode dependent on network analysis personnel individual, subjectivity is extremely strong, it is difficult to objectively anti- True hot word situation is answered, and after obtaining hot word, is only often to carry out unilateral analysis just for hot word, analysis dimension Spend single, analysis result is very poor for the directive significance of enterprise.
Invention content
In view of this, an embodiment of the present invention provides a kind of hot word analysis method, computer readable storage medium and terminals Equipment, the determination process subjectivity to solve hot word in the prior art is extremely strong and analysis result is very poor for the directive significance of enterprise The problem of.
The first aspect of the embodiment of the present invention provides a kind of hot word analysis method, may include:
The webpage issued on targeted website in the current statistic period is crawled by search engine, the targeted website is clear The amount of looking at is more than the website of preset pageview threshold value;
Cutting word processing is carried out to the text message in the webpage, obtains each participle for constituting the text message;
Count the exposure frequency of each participle in the text message;
The participle that the exposure frequency in the text message is more than to preset first threshold exposure is determined as hot word;
The exposure frequency of each enterprise name in preferred text message is counted, the preferred text message is comprising described The text message of hot word;
According to the exposure frequency of each enterprise name in the preferred text message calculate each enterprise name with it is described The degree of association between hot word.
The second aspect of the embodiment of the present invention provides a kind of computer readable storage medium, the computer-readable storage Media storage has computer-readable instruction, the computer-readable instruction to realize following steps when being executed by processor:
The webpage issued on targeted website in the current statistic period is crawled by search engine, the targeted website is clear The amount of looking at is more than the website of preset pageview threshold value;
Cutting word processing is carried out to the text message in the webpage, obtains each participle for constituting the text message;
Count the exposure frequency of each participle in the text message;
The participle that the exposure frequency in the text message is more than to preset first threshold exposure is determined as hot word;
The exposure frequency of each enterprise name in preferred text message is counted, the preferred text message is comprising described The text message of hot word;
According to the exposure frequency of each enterprise name in the preferred text message calculate each enterprise name with it is described The degree of association between hot word.
The third aspect of the embodiment of the present invention provide a kind of hot word analysing terminal equipment, including memory, processor with And it is stored in the computer-readable instruction that can be run in the memory and on the processor, described in the processor execution Following steps are realized when computer-readable instruction:
The webpage issued on targeted website in the current statistic period is crawled by search engine, the targeted website is clear The amount of looking at is more than the website of preset pageview threshold value;
Cutting word processing is carried out to the text message in the webpage, obtains each participle for constituting the text message;
Count the exposure frequency of each participle in the text message;
The participle that the exposure frequency in the text message is more than to preset first threshold exposure is determined as hot word;
The exposure frequency of each enterprise name in preferred text message is counted, the preferred text message is comprising described The text message of hot word;
According to the exposure frequency of each enterprise name in the preferred text message calculate each enterprise name with it is described The degree of association between hot word.
Existing advantageous effect is the embodiment of the present invention compared with prior art:The embodiment of the present invention is drawn by search first The webpage for crawling and being issued on targeted website in the current statistic period is held up, the text message in the webpage is carried out at cutting word Reason, obtains each participle for constituting the text message, then counts the exposure frequency of each participle in the text message, The participle that the exposure frequency in the text message is more than to preset first threshold exposure is determined as hot word, and finally statistics is each The exposure frequency of a enterprise name in preferred text message, according to exposure of each enterprise name in the preferred text message Optical frequency time calculates the degree of association between each enterprise name and the hot word.Through the embodiment of the present invention, on the one hand, for hot word Determination provides a set of objective assessment standard, has broken away from the dependence to network analysis personnel profile, the hot word determined It more can be difficult to objectively react true situation, and after obtaining hot word, consider between enterprise and hot word Relationship, analysis result for enterprise have stronger directive significance.
Description of the drawings
It to describe the technical solutions in the embodiments of the present invention more clearly, below will be to embodiment or description of the prior art Needed in attached drawing be briefly described, it should be apparent that, the accompanying drawings in the following description be only the present invention some Embodiment for those of ordinary skill in the art without having to pay creative labor, can also be according to these Attached drawing obtains other attached drawings.
Fig. 1 is a kind of one embodiment flow chart of hot word analysis method in the embodiment of the present invention;
Fig. 2 is a kind of exemplary flow of the setting up procedure of the first threshold exposure and the second threshold exposure in specific implementation Figure;
Exemplary flows of the Fig. 3 for the setting up procedure of the first threshold exposure and the second threshold exposure in another implement Figure;
Fig. 4 is a kind of one embodiment structure chart of hot word analytical equipment in the embodiment of the present invention;
Fig. 5 is a kind of schematic block diagram of hot word analysing terminal equipment in the embodiment of the present invention.
Specific implementation mode
In order to make the invention's purpose, features and advantages of the invention more obvious and easy to understand, below in conjunction with the present invention Attached drawing in embodiment, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that disclosed below Embodiment be only a part of the embodiment of the present invention, and not all embodiment.Based on the embodiments of the present invention, this field All other embodiment that those of ordinary skill is obtained without making creative work, belongs to protection of the present invention Range.
Referring to Fig. 1, a kind of one embodiment of hot word analysis method may include in the embodiment of the present invention:
Step S101, the webpage issued on targeted website in the current statistic period is crawled by search engine.
The targeted website is the website that pageview is more than preset pageview threshold value, can will be described according to actual conditions Pageview threshold value is set as 100,000 times, 500,000 times, 1,000,000 times etc., and the targeted website can be Baidu news (http:// News.baidu.com/), Netease's news (http://news.163.com/), Tencent news (http:// News.qq.com/), phoenix news (http://news.ifeng.com/) etc. news websites or other news website.
Measurement period can be set as one day, one week, two weeks or one month etc. according to actual conditions.
Step S102, cutting word processing is carried out to the text message in the webpage, obtains constituting each of the text message A participle.
Cutting word processing refers to that a statement text is cut into individual word one by one namely each participle, In the present embodiment, cutting can be carried out to statement text according to universaling dictionary, it is normal vocabulary, such as word to ensure the word separated all Language does not separate individual character then in dictionary.
Step S103, the exposure frequency of each participle in the text message is counted.
Namely each number for segmenting and occurring in the text message is counted respectively.
Step S104, participle that the exposure frequency in the text message is more than to preset first threshold exposure determines For hot word.
For example, it is 10000 that first threshold exposure, which can be arranged,.
Optionally, can also by the exposure frequency in the text message be less than or equal to first threshold exposure and Participle more than preset second threshold exposure is determined as candidate participle, and each candidate point is then obtained from historical statistics record The exposure frequency of the word in the T measurement period before the current statistic period will meet the candidate of following conditions and segment It is determined as hot word:
For the value of arbitrary t, inequalitySet up.
Wherein, n is the serial number of the candidate participle, and 1≤n≤N, N are the sum of the candidate participle, and t is each statistics The serial number that period is arranged in order according to chronological order, 1≤t≤T, T are positive integer, ExpNumn,tFor n-th of candidate participle The exposure frequency in t-th of measurement period, ExpNumn,T+1For exposure of n-th of candidate participle within the current statistic period Optical frequency time, ln are natural logrithm function, and ThreshRatio is preset proportion threshold value.
For example, it is 2000 that second threshold exposure, which can be arranged, the fractional threshold is 2, and T=1.If uniting currently It counts in the period, the exposure frequency of " Xiong Anxinqu " is 9000, is less than first threshold exposure, but is greater than second exposure Threshold value then obtains the exposure frequency in its 1 before the current statistic period measurement period, if the exposure in a upper measurement period Optical frequency time is 1000, inequalityIt sets up, then it is also determined as hot word.
If within the current statistic period, the light exposure of " artificial intelligence " is 1500 times, is not only smaller than first threshold exposure Value, and be less than second threshold exposure, then directly it is confirmed as common words.
Further, the hot word determined can also be filtered, namely interference will be generated from the hot word determined Word filter out, better common interference hot word can be pre-set, for example, as " we ", " everybody ", " this " etc. Deng.The exposure frequency of these interference hot words has no any relationship with news content, namely regardless of what news content is, these are dry First threshold exposure may be all remained above by disturbing the exposure frequency of hot word.When doing hot word statistics, if not dry to these Disturb hot word and be filtered processing, then can impact analysis result accuracy, thus need interference is filtered out from the hot word determined Hot word obtains filtered hot word, namely obtains really necessary hot word.Specifically, after determining hot word, can again from Preset interference hot word is obtained in data list, then, one by one with all interference hot words by all hot words determined Comparison is filtered out if some hot word is consistent with some interference hot word, otherwise, if some hot word is dry with any one It is all inconsistent to disturb hot word, then retains the hot word, the hot word being finally retained is filtered hot word.
Step S105, the exposure frequency of each enterprise name in preferred text message is counted.
The preferred text message is the text message for including the hot word.
Optionally, the current statistic period may include M subcycle, wherein M is positive integer, then needs to count each The exposure frequency of the enterprise name in the preferred text message of each sub- period.
Step S106, the exposure frequency according to each enterprise name in the preferred text message calculates each enterprise's name Claim the degree of association between the hot word.
Specifically, the degree of association between each enterprise name and the hot word can be calculated according to the following formula:
Wherein, q be enterprise name serial number, 1≤q≤Q, Q be enterprise name sum, p be hot word serial number, 1≤p≤ P, P is the sums of hot word, the serial number that m is arranged in order for each sub- period according to chronological order, 1≤m≤M, EntExpNumq,p,mFor exposure of q-th of enterprise name in the preferred text message comprising p-th of hot word in m-th of sub- period Optical frequency time, kmFor preset weight coefficient, km<km+1AndRelq,pFor q-th enterprise name and p-th hot word it Between the degree of association.
It distinguishingly, can be with if the degree of association between enterprise name A and the hot word B is more than preset degree of association threshold value Think that the two is unique match, the degree of association threshold value can be set as 80%, 90% or 95% etc. according to actual conditions Deng.
For example, to carry out interindustrial relations analysis to hot word " king's honor ", then in the text message comprising the hot word Searching enterprise title counts the exposure frequency of each enterprise name, and calculates each enterprise name and the heat according to above-mentioned formula The degree of association between word, if relational degree taxis result first is Tencent, and its degree of association between the hot word is 98%, is surpassed The degree of association threshold value is crossed, it is determined that the enterprise name with hot word " king's honor " unique match is Tencent.
It is that corresponding enterprise is associated with by hot word above, another association angle is to be associated with corresponding hot word by enterprise. Specifically, there is the text message of the enterprise name in netpage search, then searched in the text message for the enterprise occur Hot word counts the frequency that each hot word occurs, and is ranked up to hot word according to the sequence of the frequency from big to small, and sequence is more forward Hot word and the degree of association of the enterprise it is higher, sequence hot word more rearward is lower with the degree of association of the enterprise.
For example, to carry out hot word association analysis to Tencent, then there is the text message of Tencent in netpage search, then Hot word is searched in these text messages, counts the frequency that each hot word occurs, and according to the sequence of the frequency from big to small to heat Word is ranked up, if ranking results are followed successively by from big to small:" king's honor ", " seeking survival danger spot ", " Missions " ..., then may be used Determine currently to be respectively " king's honor ", " seeking survival danger spot ", " Missions " ... with the highest hot word of Tencent's degree of association.
Further, when carrying out interindustrial relations analysis, enterprise name should including its nickname, for example, to Tencent into It when row association analysis, not only to search for " Tencent ", also need search " Tencent " " goose factory " etc., Alibaba is associated point It when analysis, not only to search for " Alibaba ", also need search " Alibaba " " Ali " etc..Specifically, enterprise's name can be pre-set The nickname list of title, records the correspondence between the formal name of enterprise and nickname, when carrying out interindustrial relations analysis, from this The corresponding nickname of enterprise is obtained in list, which is also included in the statistic processes to enterprise.
In one kind of the embodiment of the present invention in the specific implementation, setting for first threshold exposure and second threshold exposure The process of setting may include step as shown in Figure 2:
Step S201, the exposure frequency of each history hot word in each measurement period is obtained from historical statistics record.
The history hot word is the hot word being had determined before the current statistic period.
Step S202, the first exposure sequence of each history hot word is constructed.
Specifically, the first exposure sequence of each history hot word can be constructed according to the following formula:
ExpSeq1nh={ HsExpNumnh,1,HsExpNumnh,2,......,HsExpNumnh,th,......, HsExpNumnh,THnh}
Wherein, nh is the serial number of the history hot word, and 1≤nh≤NH, NH are the sum of the history hot word, and th is each The serial number that measurement period is arranged in order according to chronological order, 1≤th≤THnh, THnhFor the statistics of n-th h history hot word The sum in period, HsExpNumnh,thFor the exposure frequency of n-th h history hot word in the th measurement period, ExpSeq1nh For the first exposure sequence of n-th h history hot word.
Step S203, the mean value of each first exposure sequence is calculated.
Specifically, the mean value of each first exposure sequence can be calculated according to the following formula:
Wherein, AvExpSeq1nhFor the mean value of the n-th h first exposure sequence.
Step S204, the sequence that the mean value of each first exposure sequence is arranged in order according to sequence from big to small is constructed.
Specifically, the mean value that can construct each first exposure sequence according to the following formula is arranged successively according to sequence from big to small The sequence of row:
{AvExpSeq11′,AvExpSeq12′,......,AvExpSeq1nh1′,......,AvExpSeq1NH′}
Wherein, AvExpSeq1nh1' for according to the first exposure sequence being arranged sequentially on the n-th positions h1 from big to small Mean value, 1≤nh1≤NH.
Step S205, first threshold exposure and second threshold exposure are calculated.
Specifically, first threshold exposure can be calculated according to the following formula:
Wherein, NMAX=floor (ξmax× NH), ξmaxFor preset coefficient, and 0<ξmax<1, floor is downward value Function, Threshold1 are first threshold exposure;
Second threshold exposure is calculated according to the following formula:
Wherein, NMIN=floor (ξmin× NH), ξminFor preset coefficient, and 0<ξmin<1, Threshold2 is described Second threshold exposure.
The embodiment of the present invention another kind in the specific implementation, first threshold exposure and second threshold exposure Setting up procedure may include step as shown in Figure 3:
Step S301, the exposure frequency of each history hot word in each measurement period is obtained from historical statistics record.
Step S302, the first exposure sequence of each history hot word is constructed.
Step S303, the mean value of each first exposure sequence is calculated.
Wherein, the process of step S301- steps S303 is identical as the process of step S201- steps S203, specifically can refer to Above description, details are not described herein.
Step S304, the second exposure sequence of each history hot word is constructed.
Specifically, the second exposure sequence of each history hot word can be constructed according to the following formula:
ExpSeq2nh={ HsExpNumnh,1′,HsExpNumnh,2′,......,HsExpNumnh,th1′,......, HsExpNumnh,THnh′}
Wherein, HsExpNumnh,th1′∈ExpSeq1nh, 1≤th1≤THnh, HsExpNumnh,th1′≥ HsExpNumnh,th1+1', ExpSeq2nhFor the second exposure sequence of n-th h history hot word.
Step S305, the mean value of each second exposure sequence is calculated.
Specifically, the mean value of each second exposure sequence can be calculated according to the following formula:
Wherein, AvExpSeq2nhFor the mean value of the n-th h second exposure sequence, TH1nhMeet the following conditions: HsExpNumnh,TH1′≥AvExpSeq1nhAnd HsExpNumnh,TH1+1′<AvExpSeq1nh
Step S306, the sequence that the mean value of each second exposure sequence is arranged in order according to sequence from big to small is constructed.
Specifically, the mean value that can construct each second exposure sequence according to the following formula is arranged successively according to sequence from big to small The sequence of row:
{AvExpSeq21′,AvExpSeq22′,......,AvExpSeq2nh1′,......,AvExpSeq2NH′}
Wherein, AvExpSeq2nh1' for according to the second exposure sequence being arranged sequentially on the n-th positions h1 from big to small Mean value.
Step S307, first threshold exposure and second threshold exposure are calculated.
Specifically, first threshold exposure can be calculated according to the following formula:
Second threshold exposure is calculated according to the following formula:
It is sent out on targeted website in conclusion the embodiment of the present invention is crawled by search engine in the current statistic period first The webpage of cloth carries out cutting word processing to the text message in the webpage, obtains each participle for constituting the text message, so The exposure frequency of each participle in the text message is counted afterwards, the exposure frequency in the text message is more than default The participle of the first threshold exposure be determined as hot word, finally count exposure frequency of each enterprise name in preferred text message It is secondary, according to the exposure frequency of each enterprise name in the preferred text message calculate each enterprise name and the hot word it Between the degree of association.Through the embodiment of the present invention, on the one hand, provide a set of objective assessment standard for the determination of hot word, break away from Dependence to network analysis personnel profile, the hot word determined more can be difficult to objectively react true situation, And after obtaining hot word, the relationship between enterprise and hot word has been considered, analysis result has enterprise stronger Directive significance.
It should be understood that the size of the serial number of each step is not meant that the order of the execution order in above-described embodiment, each process Execution sequence should be determined by its function and internal logic, the implementation process without coping with the embodiment of the present invention constitutes any limit It is fixed.
Corresponding to a kind of hot word analysis method described in foregoing embodiments, Fig. 4 shows provided in an embodiment of the present invention one One embodiment structure chart of kind hot word analytical equipment.
In the present embodiment, a kind of hot word analytical equipment may include:
Web page crawl module 401 is issued in the current statistic period on targeted website for being crawled by search engine Webpage, the targeted website are the website that pageview is more than preset pageview threshold value;
Cutting word processing module 402 obtains constituting the text for carrying out cutting word processing to the text message in the webpage Each participle of this information;
First statistical module 403, for counting the exposure frequency of each participle in the text message;
First hot word determining module 404 is exposed for the exposure frequency in the text message to be more than preset first The participle of photo threshold is determined as hot word;
Second statistical module 405, it is described excellent for counting the exposure frequency of each enterprise name in preferred text message It is the text message for including the hot word to select text message;
Calculation of relationship degree module 406, for the exposure frequency according to each enterprise name in the preferred text message Calculate the degree of association between each enterprise name and the hot word.
Further, the hot word analytical equipment can also include:
Candidate's participle determining module, exposes for the exposure frequency in the text message to be less than or equal to described first Photo threshold and it is determined as candidate participle more than the participle of preset second threshold exposure;
Third statistical module, for obtained in being recorded from historical statistics each candidate participle the current statistic period it The exposure frequency in T preceding measurement period, wherein T is positive integer;
Second hot word determining module, the candidate participle for that will meet following conditions are determined as hot word:
For the value of arbitrary t, inequalityIt sets up, wherein n is described The serial number of candidate's participle, 1≤n≤N, N are the sum of the candidate participle, and t is each measurement period according to chronological order The serial number being arranged in order, 1≤t≤T, ExpNumn,tThe exposure frequency in the t measurement period is segmented for n-th of candidate, ExpNumn,T+1For the exposure frequency of n-th of candidate participle within the current statistic period, ln is natural logrithm function, ThreshRatio is preset proportion threshold value;
Further, the hot word analytical equipment can also include:
4th statistical module, for obtaining exposure of each history hot word in each measurement period in being recorded from historical statistics Optical frequency time, the history hot word is the hot word being had determined before the current statistic period;
First exposure sequence constructing module, the first exposure sequence for constructing each history hot word according to the following formula:
ExpSeq1nh={ HsExpNumnh,1,HsExpNumnh,2,......,HsExpNumnh,th,......, HsExpNumnh,THnh}
Wherein, nh is the serial number of the history hot word, and 1≤nh≤NH, NH are the sum of the history hot word, and th is each The serial number that measurement period is arranged in order according to chronological order, 1≤th≤THnh, THnhFor the statistics of n-th h history hot word The sum in period, HsExpNumnh,thFor the exposure frequency of n-th h history hot word in the th measurement period, ExpSeq1nh For the first exposure sequence of n-th h history hot word;
First exposure serial mean computing module, the mean value for calculating each first exposure sequence according to the following formula:
Wherein, AvExpSeq1nhFor the mean value of the n-th h first exposure sequence;
First mean value sequence structure module, for construct according to the following formula it is each first exposure sequence mean value according to from greatly to The sequence that small sequence is arranged in order:
{AvExpSeq11′,AvExpSeq12′,......,AvExpSeq1nh1′,......,AvExpSeq1NH′}
Wherein, AvExpSeq1nh1' for according to the first exposure sequence being arranged sequentially on the n-th positions h1 from big to small Mean value, 1≤nh1≤NH;
First threshold exposure computing module, for calculating first threshold exposure according to the following formula:
Wherein, NMAX=floor (ξmax× NH), ξmaxFor preset coefficient, and 0<ξmax<1, floor is downward value letter Number, Threshold1 are first threshold exposure;
Second threshold exposure computing module, for calculating second threshold exposure according to the following formula:
Wherein, NMIN=floor (ξmin× NH), ξminFor preset coefficient, and 0<ξmin<1, Threshold2 is described Second threshold exposure.
Further, the hot word analytical equipment can also include:
Second exposure sequence constructing module, the second exposure sequence for constructing each history hot word according to the following formula:
ExpSeq2nh={ HsExpNumnh,1′,HsExpNumnh,2′,......,HsExpNumnh,th1′,......, HsExpNumnh,THnh′}
Wherein, HsExpNumnh,th1′∈ExpSeq1nh, 1≤th1≤THnh, HsExpNumnh,th1′≥ HsExpNumnh,th1+1', ExpSeq2nhFor the second exposure sequence of n-th h history hot word;
Second exposure serial mean computing module, the mean value for calculating each second exposure sequence according to the following formula:
Wherein, AvExpSeq2nhFor the mean value of the n-th h second exposure sequence, TH1nhMeet the following conditions: HsExpNumnh,TH1′≥AvExpSeq1nhAnd HsExpNumnh,TH1+1′<AvExpSeq1nh
Second mean value sequence structure module, for construct according to the following formula it is each second exposure sequence mean value according to from greatly to The sequence that small sequence is arranged in order:
{AvExpSeq21′,AvExpSeq22′,......,AvExpSeq2nh1′,......,AvExpSeq2NH′}
Wherein, AvExpSeq2nh1' for according to the second exposure sequence being arranged sequentially on the n-th positions h1 from big to small Mean value;
First threshold exposure computing module, for calculating first threshold exposure according to the following formula:
Second threshold exposure computing module, for calculating second threshold exposure according to the following formula:
Further, the current statistic period includes M subcycle, wherein M is positive integer, the second statistics mould Block may include:
Sub- period statistic unit, for counting each enterprise name in the preferred text message of each sub- period Expose the frequency;
The calculation of relationship degree module may include:
First computing unit, for calculating the degree of association between each enterprise name and the hot word according to the following formula:
Wherein, q be enterprise name serial number, 1≤q≤Q, Q be enterprise name sum, p be hot word serial number, 1≤p≤ P, P is the sums of hot word, the serial number that m is arranged in order for each sub- period according to chronological order, 1≤m≤M, EntExpNumq,p,mFor exposure of q-th of enterprise name in the preferred text message comprising p-th of hot word in m-th of sub- period Optical frequency time, kmFor preset weight coefficient, km<km+1AndRelq,pBetween q-th of enterprise name and p-th of hot word The degree of association.
It is apparent to those skilled in the art that for convenience and simplicity of description, the device of foregoing description, The specific work process of module and unit, can refer to corresponding processes in the foregoing method embodiment, and details are not described herein.
In the above-described embodiments, it all emphasizes particularly on different fields to the description of each embodiment, is not described in detail or remembers in some embodiment The part of load may refer to the associated description of other embodiments.
Fig. 5 shows a kind of schematic block diagram of hot word analysing terminal equipment provided in an embodiment of the present invention, for the ease of saying It is bright, it illustrates only and the relevant part of the embodiment of the present invention.
In the present embodiment, the hot word analysing terminal equipment 5 can be mobile phone, tablet computer, desktop PC, pen Remember the computing devices such as this and cloud server.The hot word analysing terminal equipment 5 may include:It processor 50, memory 51 and deposits The computer-readable instruction 52 that can be run in the memory 51 and on the processor 50 is stored up, such as executes above-mentioned heat The computer-readable instruction of word analysis method.The processor 50 is realized above-mentioned each when executing the computer-readable instruction 52 Step in hot word analysis method embodiment, such as step S101 to S106 shown in FIG. 1.Alternatively, the processor 50 executes The function of each module/unit in above-mentioned each device embodiment, such as module shown in Fig. 4 are realized when the computer-readable instruction 52 401 to 406 function.
Illustratively, the computer-readable instruction 52 can be divided into one or more module/units, one Or multiple module/units are stored in the memory 51, and executed by the processor 50, to complete the present invention.Institute It can be the series of computation machine readable instruction section that can complete specific function, the instruction segment to state one or more module/units For describing implementation procedure of the computer-readable instruction 52 in the hot word analysing terminal equipment 5.
The processor 50 can be central processing unit (Central Processing Unit, CPU), can also be Other general processors, digital signal processor (Digital Signal Processor, DSP), application-specific integrated circuit (Application Specific Integrated Circuit, ASIC), field programmable gate array (Field- Programmable Gate Array, FPGA) either other programmable logic device, discrete gate or transistor logic, Discrete hardware components etc..General processor can be microprocessor or the processor can also be any conventional processor Deng.
The memory 51 can be the internal storage unit of the hot word analysing terminal equipment 5, such as hot word analysis is eventually The hard disk or memory of end equipment 5.The memory 51 can also be the External memory equipment of the hot word analysing terminal equipment 5, Such as the plug-in type hard disk being equipped in the hot word analysing terminal equipment 5, intelligent memory card (Smart Media Card, SMC), Secure digital (Secure Digital, SD) blocks, flash card (Flash Card) etc..Further, the memory 51 may be used also With both include the hot word analysing terminal equipment 5 internal storage unit and also including External memory equipment.The memory 51 is used In other instruction and datas needed for the storage computer-readable instruction and the hot word analysing terminal equipment 5.It is described to deposit Reservoir 51 can be also used for temporarily storing the data that has exported or will export.
Each functional unit in each embodiment of the present invention can be integrated in a processing unit, can also be each Unit physically exists alone, can also be during two or more units are integrated in one unit.Above-mentioned integrated unit both may be used It realizes, can also be realized in the form of SFU software functional unit in the form of using hardware.
If the integrated unit is realized in the form of SFU software functional unit and sells or use as independent product When, it can be stored in a computer readable storage medium.Based on this understanding, technical scheme of the present invention substantially or Person says that all or part of the part that contributes to existing technology or the technical solution can body in the form of software products Reveal and, which is stored in a storage medium, including several computer-readable instructions are used so that one Platform computer equipment (can be personal computer, server or the network equipment etc.) executes described in each embodiment of the present invention The all or part of step of method.And storage medium above-mentioned includes:USB flash disk, mobile hard disk, read-only memory (ROM, Read- Only Memory), random access memory (RAM, Random Access Memory), magnetic disc or CD etc. are various can be with Store the medium of computer-readable instruction.
Embodiment described above is merely illustrative of the technical solution of the present invention, rather than its limitations;Although with reference to aforementioned reality Applying example, invention is explained in detail, it will be understood by those of ordinary skill in the art that:It still can be to aforementioned each Technical solution recorded in embodiment is modified or equivalent replacement of some of the technical features;And these are changed Or it replaces, the spirit and scope for various embodiments of the present invention technical solution that it does not separate the essence of the corresponding technical solution.

Claims (10)

1. a kind of hot word analysis method, which is characterized in that including:
The webpage issued on targeted website in the current statistic period is crawled by search engine, the targeted website is pageview More than the website of preset pageview threshold value;
Cutting word processing is carried out to the text message in the webpage, obtains each participle for constituting the text message;
Count the exposure frequency of each participle in the text message;
The participle that the exposure frequency in the text message is more than to preset first threshold exposure is determined as hot word;
The exposure frequency of each enterprise name in preferred text message is counted, the preferred text message is to include the hot word Text message;
Each enterprise name and the hot word are calculated according to the exposure frequency of each enterprise name in the preferred text message Between the degree of association.
2. hot word analysis method according to claim 1, which is characterized in that in each participle of the statistics in the text After the exposure frequency in information, further include:
The exposure frequency in the text message is less than or equal to first threshold exposure and is exposed more than preset second The participle of photo threshold is determined as candidate participle;
From the exposure obtained in historical statistics record in each candidate T measurement period segmented before the current statistic period Optical frequency time, wherein T is positive integer;
The candidate participle for meeting following conditions is determined as hot word:
For the value of arbitrary t, inequalityIt sets up, wherein n is described candidate point The serial number of word, 1≤n≤N, N are the sum of the candidate participle, and t is that each measurement period is arranged successively according to chronological order The serial number of row, 1≤t≤T, ExpNumn,tFor the exposure frequency of n-th of candidate participle in t-th of measurement period, ExpNumn,T+1 For the exposure frequency of n-th of candidate participle within the current statistic period, ln is natural logrithm function, and ThreshRatio is Preset proportion threshold value.
3. hot word analysis method according to claim 2, which is characterized in that first threshold exposure and described second exposes The setting up procedure of photo threshold includes:
The exposure frequency of each history hot word in each measurement period is obtained from historical statistics record, the history hot word is The hot word being had determined before the current statistic period;
The first exposure sequence of each history hot word is constructed according to the following formula:
Wherein, nh is the serial number of the history hot word, and 1≤nh≤NH, NH are the sum of the history hot word, and th is each statistics The serial number that period is arranged in order according to chronological order, 1≤th≤THnh, THnhFor the measurement period of n-th h history hot word Sum, HsExpNumnh,thFor the exposure frequency of n-th h history hot word in the th measurement period, ExpSeq1nhIt is First exposure sequence of nh history hot word;
The mean value of each first exposure sequence is calculated according to the following formula:
Wherein, AvExpSeq1nhFor the mean value of the n-th h first exposure sequence;
The sequence that the mean value of each first exposure sequence is arranged in order according to sequence from big to small is constructed according to the following formula:
{AvExpSeq11′,AvExpSeq12′,......,AvExpSeq1nh1′,......,AvExpSeq1NH′}
Wherein, AvExpSeq1nh1' to expose the mean value of sequence according to first be arranged sequentially on the n-th positions h1 from big to small, 1≤nh1≤NH;
First threshold exposure is calculated according to the following formula:
Wherein, NMAX=floor (ξmax× NH), ξmaxFor preset coefficient, and 0<ξmax<1, floor is downward value function, Threshold1 is first threshold exposure;
Second threshold exposure is calculated according to the following formula:
Wherein, NMIN=floor (ξmin× NH), ξminFor preset coefficient, and 0<ξmin<1, Threshold2 is described second Threshold exposure.
4. hot word analysis method according to claim 3, which is characterized in that in the mean value for calculating each first exposure sequence Later, further include:
The second exposure sequence of each history hot word is constructed according to the following formula:
Wherein, HsExpNumnh,th1′∈ExpSeq1nh, 1≤th1≤THnh, HsExpNumnh,th1′≥HsExpNumnh,th1+1', ExpSeq2nhFor the second exposure sequence of n-th h history hot word;
The mean value of each second exposure sequence is calculated according to the following formula:
Wherein, AvExpSeq2nhFor the mean value of the n-th h second exposure sequence, TH1nhMeet the following conditions:HsExpNumnh,TH1′ ≥AvExpSeq1nhAnd HsExpNumnh,TH1+1′<AvExpSeq1nh
The sequence that the mean value of each second exposure sequence is arranged in order according to sequence from big to small is constructed according to the following formula:
{AvExpSeq21′,AvExpSeq22′,......,AvExpSeq2nh1′,......,AvExpSeq2NH′}
Wherein, AvExpSeq2nh1' for according to the mean value of the second exposure sequence being arranged sequentially on the n-th positions h1 from big to small;
First threshold exposure is calculated according to the following formula:
Second threshold exposure is calculated according to the following formula:
5. hot word analysis method according to any one of claim 1 to 4, which is characterized in that the current statistic period Including M subcycle, wherein M is positive integer, exposure frequency packet of each enterprise name of statistics in preferred text message It includes:
Count the exposure frequency of each enterprise name in the preferred text message of each sub- period;
The exposure frequency according to each enterprise name in the preferred text message calculate each enterprise name with it is described The degree of association between hot word includes:
The degree of association between each enterprise name and the hot word is calculated according to the following formula:
Wherein, q is the serial number of enterprise name, and 1≤q≤Q, Q are the sum of enterprise name, and p is the serial number of hot word, 1≤p≤P, P For the sum of hot word, m is the serial number being arranged in order according to chronological order each sub- period, 1≤m≤M, EntExpNumq,p,mFor exposure of q-th of enterprise name in the preferred text message comprising p-th of hot word in m-th of sub- period Optical frequency time, kmFor preset weight coefficient, km<km+1AndRelq,pBetween q-th of enterprise name and p-th of hot word The degree of association.
6. a kind of computer readable storage medium, the computer-readable recording medium storage has computer-readable instruction, special Sign is, the hot word point as described in any one of claim 1 to 5 is realized when the computer-readable instruction is executed by processor The step of analysis method.
7. a kind of hot word analysing terminal equipment, including memory, processor and it is stored in the memory and can be described The computer-readable instruction run on processor, which is characterized in that the processor executes real when the computer-readable instruction Existing following steps:
The webpage issued on targeted website in the current statistic period is crawled by search engine, the targeted website is pageview More than the website of preset pageview threshold value;
Cutting word processing is carried out to the text message in the webpage, obtains each participle for constituting the text message;
Count the exposure frequency of each participle in the text message;
The participle that the exposure frequency in the text message is more than to preset first threshold exposure is determined as hot word;
The exposure frequency of each enterprise name in preferred text message is counted, the preferred text message is to include the hot word Text message;
Each enterprise name and the hot word are calculated according to the exposure frequency of each enterprise name in the preferred text message Between the degree of association.
8. hot word analysing terminal equipment according to claim 7, which is characterized in that in each participle of the statistics described After the exposure frequency in text message, further include:
The exposure frequency in the text message is less than or equal to first threshold exposure and is exposed more than preset second The participle of photo threshold is determined as candidate participle;
From the exposure obtained in historical statistics record in each candidate T measurement period segmented before the current statistic period Optical frequency time, wherein T is positive integer;
The candidate participle for meeting following conditions is determined as hot word:
For the value of arbitrary t, inequalityIt sets up, wherein n is described candidate point The serial number of word, 1≤n≤N, N are the sum of the candidate participle, and t is that each measurement period is arranged successively according to chronological order The serial number of row, 1≤t≤T, ExpNumn,tFor the exposure frequency of n-th of candidate participle in t-th of measurement period, ExpNumn,T+1 For the exposure frequency of n-th of candidate participle within the current statistic period, ln is natural logrithm function, and ThreshRatio is Preset proportion threshold value.
9. hot word analysing terminal equipment according to claim 8, which is characterized in that first threshold exposure and described The setting up procedure of two threshold exposures includes:
The exposure frequency of each history hot word in each measurement period is obtained from historical statistics record, the history hot word is The hot word being had determined before the current statistic period;
The first exposure sequence of each history hot word is constructed according to the following formula:
Wherein, nh is the serial number of the history hot word, and 1≤nh≤NH, NH are the sum of the history hot word, and th is each statistics The serial number that period is arranged in order according to chronological order, 1≤th≤THnh, THnhFor the measurement period of n-th h history hot word Sum, HsExpNumnh,thFor the exposure frequency of n-th h history hot word in the th measurement period, ExpSeq1nhIt is First exposure sequence of nh history hot word;
The mean value of each first exposure sequence is calculated according to the following formula:
Wherein, AvExpSeq1nhFor the mean value of the n-th h first exposure sequence;
The sequence that the mean value of each first exposure sequence is arranged in order according to sequence from big to small is constructed according to the following formula:
{AvExpSeq11′,AvExpSeq12′,......,AvExpSeq1nh1′,......,AvExpSeq1NH′}
Wherein, AvExpSeq1nh1' to expose the mean value of sequence according to first be arranged sequentially on the n-th positions h1 from big to small, 1≤nh1≤NH;
First threshold exposure is calculated according to the following formula:
Wherein, NMAX=floor (ξmax× NH), ξmaxFor preset coefficient, and 0<ξmax<1, floor is downward value function, Threshold1 is first threshold exposure;
Second threshold exposure is calculated according to the following formula:
Wherein, NMIN=floor (ξmin× NH), ξminFor preset coefficient, and 0<ξmin<1, Threshold2 is described second Threshold exposure.
10. the hot word analysing terminal equipment according to any one of claim 7 to 9, which is characterized in that the current statistic Period includes M subcycle, wherein M is positive integer, exposure frequency of each enterprise name of statistics in preferred text message It is secondary to include:
Count the exposure frequency of each enterprise name in the preferred text message of each sub- period;
The exposure frequency according to each enterprise name in the preferred text message calculate each enterprise name with it is described The degree of association between hot word includes:
The degree of association between each enterprise name and the hot word is calculated according to the following formula:
Wherein, q is the serial number of enterprise name, and 1≤q≤Q, Q are the sum of enterprise name, and p is the serial number of hot word, 1≤p≤P, P For the sum of hot word, m is the serial number being arranged in order according to chronological order each sub- period, 1≤m≤M, EntExpNumq,p,mFor exposure of q-th of enterprise name in the preferred text message comprising p-th of hot word in m-th of sub- period Optical frequency time, kmFor preset weight coefficient, km<km+1AndRelq,pBetween q-th of enterprise name and p-th of hot word The degree of association.
CN201810456973.4A 2018-05-14 2018-05-14 Hot word analysis method, computer readable storage medium and terminal device Active CN108710664B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201810456973.4A CN108710664B (en) 2018-05-14 2018-05-14 Hot word analysis method, computer readable storage medium and terminal device
PCT/CN2018/096267 WO2019218452A1 (en) 2018-05-14 2018-07-19 Method, computer readable storage medium, terminal apparatus, and device for analyzing trending terms

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810456973.4A CN108710664B (en) 2018-05-14 2018-05-14 Hot word analysis method, computer readable storage medium and terminal device

Publications (2)

Publication Number Publication Date
CN108710664A true CN108710664A (en) 2018-10-26
CN108710664B CN108710664B (en) 2023-04-18

Family

ID=63868099

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810456973.4A Active CN108710664B (en) 2018-05-14 2018-05-14 Hot word analysis method, computer readable storage medium and terminal device

Country Status (2)

Country Link
CN (1) CN108710664B (en)
WO (1) WO2019218452A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110427381A (en) * 2019-08-07 2019-11-08 北京嘉和海森健康科技有限公司 A kind of data processing method and relevant device
CN111310018A (en) * 2018-12-11 2020-06-19 阿里巴巴集团控股有限公司 Determining method of timeliness search vocabulary and search engine
CN111737553A (en) * 2020-06-16 2020-10-02 苏州朗动网络科技有限公司 Method and device for selecting enterprise associated words and storage medium

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117034904B (en) * 2023-10-09 2023-12-08 北京睿企信息科技有限公司 Method for obtaining hot words with stable heat, electronic equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101727494A (en) * 2009-12-29 2010-06-09 华中师范大学 Network hot word generating system in specific area
CN103186675A (en) * 2013-04-03 2013-07-03 南京安讯科技有限责任公司 Automatic webpage classification method based on network hot word identification
US20170169062A1 (en) * 2015-12-14 2017-06-15 Le Holdings (Beijing) Co., Ltd. Method and electronic device for recommending video

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101246499B (en) * 2008-03-27 2010-10-13 腾讯科技(深圳)有限公司 Network information search method and system
CN103106227A (en) * 2012-08-03 2013-05-15 人民搜索网络股份公司 System and method of looking up new word based on webpage text
CN105045882B (en) * 2015-07-21 2018-09-25 无锡天脉聚源传媒科技有限公司 A kind of hot word processing method and processing device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101727494A (en) * 2009-12-29 2010-06-09 华中师范大学 Network hot word generating system in specific area
CN103186675A (en) * 2013-04-03 2013-07-03 南京安讯科技有限责任公司 Automatic webpage classification method based on network hot word identification
US20170169062A1 (en) * 2015-12-14 2017-06-15 Le Holdings (Beijing) Co., Ltd. Method and electronic device for recommending video

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111310018A (en) * 2018-12-11 2020-06-19 阿里巴巴集团控股有限公司 Determining method of timeliness search vocabulary and search engine
CN111310018B (en) * 2018-12-11 2024-03-01 阿里巴巴集团控股有限公司 Method for determining timeliness search vocabulary and search engine
CN110427381A (en) * 2019-08-07 2019-11-08 北京嘉和海森健康科技有限公司 A kind of data processing method and relevant device
CN111737553A (en) * 2020-06-16 2020-10-02 苏州朗动网络科技有限公司 Method and device for selecting enterprise associated words and storage medium

Also Published As

Publication number Publication date
CN108710664B (en) 2023-04-18
WO2019218452A1 (en) 2019-11-21

Similar Documents

Publication Publication Date Title
US20200193382A1 (en) Employment resource system, method and apparatus
Koesten et al. Everything you always wanted to know about a dataset: Studies in data summarisation
CN108710664A (en) A kind of hot word analysis method, computer readable storage medium and terminal device
WO2019149145A1 (en) Compliant report class sorting method and apparatus
US20170352089A1 (en) Recommendation Engine
US20190286676A1 (en) Contextual content collection, filtering, enrichment, curation and distribution
WO2017107569A1 (en) Android application assembly method based on application content
US9934293B2 (en) Generating search results
US20150213368A1 (en) Information recommendation method, apparatus, and server
US20110270845A1 (en) Ranking Information Content Based on Performance Data of Prior Users of the Information Content
US20160125028A1 (en) Systems and methods for query rewriting
US11803927B2 (en) Analysis of intellectual-property data in relation to products and services
CN107918644A (en) News subject under discussion analysis method and implementation system in reputation Governance framework
Rebelo et al. TwitterJam: Identification of mobility patterns in urban centers based on tweets
Ranganath et al. Understanding and identifying rhetorical questions in social media
Moya et al. Integrating web feed opinions into a corporate data warehouse
US20200380376A1 (en) Artificial Intelligence Based System And Method For Predicting And Preventing Illicit Behavior
US20150074121A1 (en) Semantics graphs for enterprise communication networks
CN111882224A (en) Method and device for classifying consumption scenes
CN107766537A (en) A kind of position search ordering method and computing device
Alghieth et al. A map-based job recommender model
CN113961811B (en) Event map-based conversation recommendation method, device, equipment and medium
US20150294019A1 (en) Web browsing activity flow
CN109242690A (en) Finance product recommended method, device, computer equipment and readable storage medium storing program for executing
CN113868373A (en) Word cloud generation method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant