CN107783961A - A kind of method, apparatus and readable storage medium storing program for executing of much-talked-about topic identification - Google Patents

A kind of method, apparatus and readable storage medium storing program for executing of much-talked-about topic identification Download PDF

Info

Publication number
CN107783961A
CN107783961A CN201711092187.2A CN201711092187A CN107783961A CN 107783961 A CN107783961 A CN 107783961A CN 201711092187 A CN201711092187 A CN 201711092187A CN 107783961 A CN107783961 A CN 107783961A
Authority
CN
China
Prior art keywords
word
text
talked
topic
much
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201711092187.2A
Other languages
Chinese (zh)
Inventor
毕银龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhengzhou Yunhai Information Technology Co Ltd
Original Assignee
Zhengzhou Yunhai Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhengzhou Yunhai Information Technology Co Ltd filed Critical Zhengzhou Yunhai Information Technology Co Ltd
Priority to CN201711092187.2A priority Critical patent/CN107783961A/en
Publication of CN107783961A publication Critical patent/CN107783961A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Data Mining & Analysis (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of much-talked-about topic to know method for distinguishing, gathers text corresponding to forum;Text is divided into word according to participle instrument;Word, and the frequency that each word that calculating sifting goes out successively occurs in the word all filtered out are screened according to corpus;Frequency is selected to be more than the word of setting value as much-talked-about topic;Wherein, segmenting the dictionary of instrument includes the word of preset standard form.It can be seen that, the dictionary of participle instrument includes the word of preset standard form, can using works and expressions for everyday use, cyberspeak as preset standard form word, when being segmented to text, the word that the words recognition of preset standard form can be come out and be used as after participle, and the word after participle is screened according to corpus.Much-talked-about topic can more accurately be identified.The invention also discloses the device and computer-readable recording medium of a kind of much-talked-about topic identification, effect is as above.

Description

A kind of method, apparatus and readable storage medium storing program for executing of much-talked-about topic identification
Technical field
The present invention relates to computer realm, more particularly to the method, apparatus of a kind of much-talked-about topic identification and computer-readable Storage medium.
Background technology
With the development of computer network, network various viewpoints, comment etc. emerge in an endless stream, in order to understand society in time Meeting focus incident, observation society dynamic, make appropriate decision-making for enterprise, government etc. and provide guidance, generally require on network Comment, viewpoint etc. are analyzed, and identify much-talked-about topic.
In the prior art, the text got is generally divided into word, and directly counts the frequency that each word occurs, choosing The high word of frequency is taken as much-talked-about topic.And for the forum on network, user is when making comments, cyberspeak and daily Term is more, and cyberspeak and works and expressions for everyday use are often stated lack of standardization, easily cause mistake to segment, also, for part point Word after word, topic may can not be used as, cause the higher word of the frequency finally selected can not to be used as focus Topic.
Therefore, how much-talked-about topic is more accurately identified, is that those skilled in the art need to solve the problems, such as at present.
The content of the invention
The method, apparatus and computer-readable recording medium identified it is an object of the invention to provide a kind of much-talked-about topic, more Add and accurately and effectively identify social hotspots topic.
In order to solve the above-mentioned technical problem, the present invention provides a kind of much-talked-about topic knowledge method for distinguishing, including:
Gather text corresponding to forum;
The text is divided into word according to participle instrument;
The word is screened according to corpus, and each word that calculating sifting goes out successively is described in all filter out The frequency occurred in word;
Frequency is selected to be more than the word of setting value as much-talked-about topic;
Wherein, the dictionary of the participle instrument includes the word of preset standard form.
Preferably, after text corresponding to the collection forum, further comprise:
The text collected is pre-processed, and enters the foundation participle instrument and the text is divided into word The step of.
Preferably, the described pair of text collected carries out pretreatment and specifically included:
The wrong word in the text and emoticon are obtained, and the text is modified;
Delete the stop words in the text.
Preferably, after the text is divided into word by the foundation participle instrument, further comprise:
The word that participle mistake be present is merged, and enters the step that the word is screened according to corpus Suddenly.
Preferably, it is described selection frequency be more than setting value the word as much-talked-about topic after, further comprise:
Include the text of the much-talked-about topic according to sentiment dictionary analysis to obtain the Sentiment orientation of corresponding user.
Preferably, text corresponding to the collection forum is specially:
The URL link of webpage corresponding to forum is obtained by reptile iteration;
Webpage is obtained according to the URL link;
The matching of regular expression is carried out to the webpage to obtain required text.
The present invention also provides a kind of device of much-talked-about topic identification, including:
Harvester, for gathering text corresponding to forum;
Device is divided, for the text to be divided into word according to participle instrument;
Computing device is screened, for screening the word, and each word that calculating sifting goes out successively according to corpus The frequency occurred in the word all filtered out;
Selection device, for selecting frequency to be more than the word of setting value as much-talked-about topic;
Wherein, the dictionary of the participle instrument includes the word of preset standard form.
Preferably, in addition to:
Pretreatment unit, for being pre-processed to the text collected.
The present invention also provides a kind of device of much-talked-about topic identification, including processor, the processor are used to perform storage The step of any of the above-described kind of much-talked-about topic knows method for distinguishing is realized during the program stored in device.
The present invention also provides a kind of computer-readable recording medium, and calculating is stored with the computer-readable recording medium Machine program, the computer program are executed by processor to realize following steps:
Gather text corresponding to forum;
The text is divided into word according to participle instrument;
The word is screened according to corpus, and each word that calculating sifting goes out successively is described in all filter out The frequency occurred in word;
Frequency is selected to be more than the word of setting value as much-talked-about topic;
Wherein, the dictionary of the participle instrument includes the word of preset standard form.
Gather text corresponding to forum;Text is divided into word according to participle instrument;Word is screened according to corpus, and The frequency that each word that calculating sifting goes out successively occurs in the word all filtered out;Frequency is selected to be more than the word of setting value As much-talked-about topic;Wherein, segmenting the dictionary of instrument includes the word of preset standard form.It can be seen that segment the dictionary of instrument Include the word of preset standard form, can using works and expressions for everyday use, cyberspeak as preset standard form word, to text When being segmented, the word that the words recognition of preset standard form can be come out and be used as after participle, and to the word after participle Language is screened according to corpus, the word for cannot function as topic, no longer calculates frequency and as final much-talked-about topic, Therefore, it is possible to more accurately identify much-talked-about topic.The device of much-talked-about topic provided by the invention identification and computer-readable Storage medium, effect is as above.
Brief description of the drawings
In order to illustrate the embodiments of the present invention more clearly, the required accompanying drawing used in embodiment will be done simply below Introduce, it should be apparent that, drawings in the following description are only some embodiments of the present invention, for ordinary skill people For member, on the premise of not paying creative work, other accompanying drawings can also be obtained according to these accompanying drawings.
Fig. 1 is the flow chart that a kind of much-talked-about topic provided in an embodiment of the present invention knows method for distinguishing;
Fig. 2 is a kind of structure chart of the device of much-talked-about topic identification provided in an embodiment of the present invention;
Fig. 3 is a kind of structure chart of the device of much-talked-about topic identification provided in an embodiment of the present invention.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Site preparation describes, it is clear that described embodiment is only part of the embodiment of the present invention, rather than whole embodiments.Based on this Embodiment in invention, those of ordinary skill in the art on the premise of creative work is not paid, obtained it is all its His embodiment, belongs to the scope of the present invention.
The method, apparatus and computer-readable recording medium identified it is an object of the invention to provide a kind of much-talked-about topic, more Add and accurately and effectively identify social hotspots topic.
In order that those skilled in the art is better understood from technical scheme, it is below in conjunction with the accompanying drawings and specific real Applying mode, the present invention is described in further detail.
Fig. 1 is the flow chart that a kind of much-talked-about topic provided in an embodiment of the present invention knows method for distinguishing, as shown in figure 1, focus The method of topic detection comprises the following steps:
S10:Gather text corresponding to forum.
Forum is exactly to be posted the platform that money order receipt to be signed and returned to the sender is talked about for user, and user can release news or propose view in forum. The length of text in forum is general shorter, and works and expressions for everyday use or cyberspeak are more.
S11:Text is divided into word according to participle instrument.
Multiple sentences are generally comprised in text, if being divided into each sentence in the text collected using participle instrument Dry independent word.Participle instrument can include dictionary, when being segmented to text, can be used as ginseng using the word in dictionary Examine.
S12:Word is screened according to corpus, and each word that calculating sifting goes out successively goes out in the word all filtered out Existing frequency.
Word set in advance can be included in corpus, these words set in advance can be used as much-talked-about topic, The word included in corpus is screened in all words obtained after the division of all sentences, for the word not included in corpus It language, can delete, and count each word filtered out successively and have the number occurred altogether, have the number occurred altogether with each word Divided by the total number of the word after screening, obtain the frequency that each word occurs.
So, for the word not having in corpus, much-talked-about topic can not be used as.
S13:Frequency is selected to be more than the word of setting value as much-talked-about topic.
Frequency is bigger, it is meant that the number that the word occurs is more, is more discussed by user.And choose set as needed Definite value, the word that setting value can be more than using selecting frequency are used as much-talked-about topic.
Wherein, segmenting the dictionary of instrument includes the word of preset standard form.
Word using works and expressions for everyday use, cyberspeak and other nonstandard words of statement as default reference format, this Sample, when text being divided into word according to participle instrument, works and expressions for everyday use or cyberspeak can be filtered out or statement is nonstandard Word, and independent word is divided into, so as to avoid the occurrence of participle mistake.
Gather text corresponding to forum;Text is divided into word according to participle instrument;Word is screened according to corpus, and The frequency that each word that calculating sifting goes out successively occurs in the word all filtered out;Frequency is selected to be more than the word of setting value As much-talked-about topic;Wherein, segmenting the dictionary of instrument includes the word of preset standard form.It can be seen that segment the dictionary of instrument Include the word of preset standard form, can using works and expressions for everyday use, cyberspeak as preset standard form word, to text When being segmented, the words recognition of preset standard form can be come out and be divided into independent word, and to participle after Word screened according to corpus, the word for cannot function as topic, no longer calculate frequency and as final focus Topic, therefore, it is possible to more accurately identify much-talked-about topic.
On the basis of above-described embodiment, in order to more accurately be segmented to text, so as to more accurately Much-talked-about topic is identified, after gathering text corresponding to forum, is further comprised:The text collected is pre-processed, gone forward side by side Enter step S11.
Preferably, carry out pretreatment to the text collected to specifically include, obtain the wrong word in text and emoticon, And text is modified, delete the stop words in text.
Because how lack of standardization the statement of user is, so, wrong word or emoticon are might have in forum's text, can be pre- Wrong word storehouse and emoticon storehouse are first set, can also include revised correct word in wrong word storehouse, in emoticon storehouse The word corresponding to the meaning of emoticon expression can also be included, identified by storehouse set in advance and detect the mistake in text Malapropism and emoticon, and text is modified according to storehouse set in advance.For some stop words in text, Ke Yizhi Connect and be deleted.
On the basis of above-described embodiment, in order to more accurately identify much-talked-about topic, preferably embodiment party Formula, after text is divided into word according to participle instrument, further comprise:The word that participle mistake be present is merged, and Into step S12.
, can be again depending in order to which whether the word examined point is wrong after text is divided into word according to participle instrument Original text is divided into word by participle instrument, and the word with obtaining for the first time is compared, if there is a phrase twice Word after participle is inconsistent, and these words after participle are merged, and as the word after participle, can also manage Solve as using this phrase as the word after participle.
Certainly, on the basis of again, original text can also again be segmented according to participle instrument, i.e., original text is carried out Third time segments, and the word obtained after third time is segmented is compared with the word obtained after preceding participle twice, if deposited A phrase after each participle it is all inconsistent, the word after participle is merged, that is, directly using the phrase as Word after participle.The present invention is not construed as limiting to the number of participle.
On the basis of above-described embodiment, in order to understand attitude and view of the people for much-talked-about topic, preferably Embodiment, select frequency be more than setting value word as much-talked-about topic after, further comprise, according to sentiment dictionary analysis Text including much-talked-about topic is with the Sentiment orientation of user corresponding to obtaining.
The emotion that sentiment dictionary includes can be roughly divided into positive, passive, neutral three major types, can be with for every one kind Dictionary corresponding to foundation, analysis includes the text of much-talked-about topic, if occurring word in dictionary corresponding to certain a kind of emotion in text Language, then the attitude of the user is designated as affective style corresponding to the dictionary.
On the basis of above-described embodiment, in order to more efficiently and accurately gather the text of forum, gather corresponding to forum Text is specially:The URL link of webpage corresponding to forum is obtained by reptile iteration, webpage is obtained according to URL link, to webpage The matching of regular expression is carried out to obtain required text.
The URL of webpage is obtained using crawler technology, then analyzing structure of web page, and of regular expression is carried out to webpage Match somebody with somebody, so as to be captured text corresponding to forum and be saved in local.
The embodiment that method for distinguishing is known above for much-talked-about topic is described in detail, and is described based on above-described embodiment Much-talked-about topic know method for distinguishing, the embodiment of the present invention provides the device that a kind of corresponding with this method much-talked-about topic identifies.By It is mutually corresponding in the embodiment of device part and the embodiment of method part, therefore the embodiment of device part refer to method portion The embodiment description divided, is no longer described in detail here.
Fig. 2 is a kind of structure chart of the device of much-talked-about topic identification provided in an embodiment of the present invention, as shown in Fig. 2 focus The device of topic detection includes:
Collecting unit 20, for gathering text corresponding to forum.
Division unit 21, for text to be divided into word according to participle instrument.
Computing unit 22 is screened, for screening word according to corpus, and each word that calculating sifting goes out successively is in whole The frequency occurred in the word filtered out.
Selecting unit 23, for selecting frequency to be more than the word of setting value as much-talked-about topic.
Wherein, segmenting the dictionary of instrument includes the word of preset standard form.
Text corresponding to collecting unit collection forum;Text is divided into word by division unit according to participle instrument;Screening Computing unit screens word according to corpus, and calculating sifting goes out successively each word occurs in the word all filtered out Frequency;Selecting unit selection frequency is more than the word of setting value as much-talked-about topic;Wherein, segmenting the dictionary of instrument is included in advance It is marked with the word of quasiconfiguaration.It can be seen that the dictionary for segmenting instrument includes the word of preset standard form, can by works and expressions for everyday use, Word of the cyberspeak as preset standard form, when being segmented to text, division unit can be by preset standard form Words recognition comes out and is divided into independent word, and screening computing unit sieves to the word after participle according to corpus Choosing, the word for cannot function as topic, frequency is no longer calculated and as final much-talked-about topic, therefore, it is possible to more accurate Identify much-talked-about topic in ground.
On the basis of above-described embodiment, in order to more accurately be segmented to text, so as to more accurately Much-talked-about topic is identified, the device of much-talked-about topic identification also includes:
Pretreatment unit, for being pre-processed to the text collected.
Preferably, pretreatment unit is specifically used for obtaining the wrong word in text and emoticon, and text is repaiied Just, the stop words in text is deleted.
The embodiment that method for distinguishing is known above for much-talked-about topic is described in detail, and is described based on above-described embodiment Much-talked-about topic know method for distinguishing, the embodiment of the present invention provides the device that a kind of corresponding with this method much-talked-about topic identifies.By It is mutually corresponding in the embodiment of device part and the embodiment of method part, therefore the embodiment of device part refer to method portion The embodiment description divided, is no longer described in detail here.
Fig. 3 is a kind of structure chart of the device of much-talked-about topic identification provided in an embodiment of the present invention, as shown in figure 3, focus The device of topic detection includes:
Memory 30 and processor 31.
Memory 30, for storing computer program.
Processor 31, during for performing the computer program stored in memory 30, it is possible to achieve following steps:
Gather text corresponding to forum;
Text is divided into word according to participle instrument;
Word is screened according to corpus, and calculating sifting goes out successively each word occurs in the word all filtered out Frequency;
Frequency is selected to be more than the word of setting value as much-talked-about topic;
Wherein, segmenting the dictionary of instrument includes the word of preset standard form.
In some embodiments of the invention, above-mentioned processor 31, can be also used for performing the computer in memory 30 Program realizes following steps:
The text collected is pre-processed, and enters the step of text is divided into word according to participle instrument.
In some embodiments of the invention, above-mentioned processor 31, can be also used for performing the computer in memory 30 Program realizes following steps:
The wrong word in text and emoticon are obtained, and text is modified;
Delete the stop words in text.
In some embodiments of the invention, above-mentioned processor 31, can be also used for performing the computer in memory 30 Program realizes following steps:
The word that participle mistake be present is merged, and enters the step of word is screened according to corpus.
In some embodiments of the invention, above-mentioned processor 31, can be also used for performing the computer in memory 30 Program realizes following steps:
Include the text of much-talked-about topic according to sentiment dictionary analysis to obtain the Sentiment orientation of corresponding user.
In some embodiments of the invention, above-mentioned processor 31, can be also used for performing the computer in memory 30 Program realizes following steps:
The URL link of webpage corresponding to forum is obtained by reptile iteration;
Webpage is obtained according to URL link;
The matching of regular expression is carried out to webpage to obtain required text.
The present embodiment provide much-talked-about topic identification device, processor in the computer program in performing memory, Gather text corresponding to forum;Text is divided into word according to participle instrument;Word is screened according to corpus, and calculated successively The frequency that each word filtered out occurs in the word all filtered out;Frequency is selected to be more than the word of setting value as focus Topic;Wherein, segmenting the dictionary of instrument includes the word of preset standard form.It can be seen that the dictionary for segmenting instrument is included in advance The word of quasiconfiguaration is marked with, text can be segmented using works and expressions for everyday use, cyberspeak as the word of preset standard form When, the words recognition of preset standard form can be come out and be divided into independent word, and to the word after participle according to Being screened according to corpus, the word for cannot function as topic, no longer calculating frequency and as final much-talked-about topic, because This, can more accurately identify much-talked-about topic.
Present invention also offers a kind of computer-readable storage corresponding with the embodiment of the method for above-mentioned much-talked-about topic identification Medium, because the embodiment of computer-readable recording medium part and the embodiment of method part are mutually corresponding, therefore computer The embodiment of readable storage medium storing program for executing part refer to the embodiment description of method part, and in this not go into detail.
Computer program is stored with computer-readable recording medium, computer program is executed by processor as follows to realize Step:
Gather text corresponding to forum.
Text is divided into word according to participle instrument.
Word is screened according to corpus, and calculating sifting goes out successively each word occurs in the word all filtered out Frequency.
Frequency is selected to be more than the word of setting value as much-talked-about topic.
Wherein, segmenting the dictionary of instrument includes the word of preset standard form.
It should be noted that the computer-readable recording medium in the present invention can be the media such as USB flash disk or CD, specifically not It is construed as limiting.
When computer program in computer-readable recording medium provided by the invention is executed by processor, forum pair is gathered The text answered;Text is divided into word according to participle instrument;According to corpus screen word, and successively calculating sifting go out it is each The frequency that word occurs in the word all filtered out;Frequency is selected to be more than the word of setting value as much-talked-about topic;Wherein, The dictionary of participle instrument includes the word of preset standard form.It can be seen that the dictionary for segmenting instrument includes preset standard form Word, can using works and expressions for everyday use, cyberspeak as preset standard form word, when being segmented to text, can will The words recognition of preset standard form is come out and is divided into independent word, and the word after participle is entered according to corpus Row screening, the word for cannot function as topic, frequency is no longer calculated and as final much-talked-about topic, therefore, it is possible to more Much-talked-about topic is identified exactly.
The method, apparatus and computer-readable recording medium of much-talked-about topic provided by the present invention identification are carried out above It is discussed in detail.Each embodiment is described by the way of progressive in specification, and each embodiment, which stresses, is and other realities Apply the difference of example, between each embodiment identical similar portion mutually referring to.
It should be pointed out that for those skilled in the art, under the premise without departing from the principles of the invention, Some improvement and modification can also be carried out to the present invention, these are improved and modification also falls into the protection domain of the claims in the present invention It is interior.
It should also be noted that, in this manual, such as first and second etc relational terms are used merely to one Individual entity either operates to be made a distinction with another entity or operation, and is not necessarily required and either implied these entities or behaviour Any this actual relation or order between work be present.Moreover, term " comprising ", "comprising" or its any variant are intended to Cover including for nonexcludability, so that process, method, article or equipment including a series of key element not only include that A little key elements, but also other key elements including being not expressly set out, either also include for this process, method, article or set Standby intrinsic key element.In the absence of more restrictions, the key element limited by sentence "including a ...", it is not excluded that Other identical element in the process including the key element, method, article or equipment also be present.

Claims (10)

1. a kind of much-talked-about topic knows method for distinguishing, it is characterised in that including:
Gather text corresponding to forum;
The text is divided into word according to participle instrument;
The word is screened according to corpus, and each word that calculating sifting goes out successively is in the word all filtered out The frequency of middle appearance;
Frequency is selected to be more than the word of setting value as much-talked-about topic;
Wherein, the dictionary of the participle instrument includes the word of preset standard form.
2. according to the method for claim 1, it is characterised in that after text corresponding to the collection forum, further comprise:
The text collected is pre-processed, and enters the step that the text is divided into word according to participle instrument Suddenly.
3. according to the method for claim 2, it is characterised in that the described pair of text collected pre-process specifically Including:
The wrong word in the text and emoticon are obtained, and the text is modified;
Delete the stop words in the text.
4. according to the method for claim 1, it is characterised in that described that the text is divided into word according to participle instrument Afterwards, further comprise:
The word that participle mistake be present is merged, and enters the described the step of word is screened according to corpus.
5. according to the method for claim 1, it is characterised in that the selection frequency is more than the word conduct of setting value After much-talked-about topic, further comprise:
Include the text of the much-talked-about topic according to sentiment dictionary analysis to obtain the Sentiment orientation of corresponding user.
6. according to the method for claim 1, it is characterised in that it is described collection forum corresponding to text be specially:
The URL link of webpage corresponding to forum is obtained by reptile iteration;
Webpage is obtained according to the URL link;
The matching of regular expression is carried out to the webpage to obtain required text.
A kind of 7. device of much-talked-about topic identification, it is characterised in that including:
Collecting unit, for gathering text corresponding to forum;
Division unit, for the text to be divided into word according to participle instrument;
Computing unit is screened, for screening the word according to corpus, and each word that calculating sifting goes out successively is complete The frequency occurred in the word that portion filters out;
Selecting unit, for selecting frequency to be more than the word of setting value as much-talked-about topic;
Wherein, the dictionary of the participle instrument includes the word of preset standard form.
8. device according to claim 7, it is characterised in that also include:
Pretreatment unit, for being pre-processed to the text collected.
9. a kind of device of much-talked-about topic identification, it is characterised in that including processor, the processor is used to perform in memory The step of much-talked-about topic knows method for distinguishing as described in any one of claim 1 to 6 is realized during the program of storage.
10. a kind of computer-readable recording medium, it is characterised in that be stored with computer on the computer-readable recording medium Program, the computer program are executed by processor to realize following steps:
Gather text corresponding to forum;
The text is divided into word according to participle instrument;
The word is screened according to corpus, and each word that calculating sifting goes out successively is in the word all filtered out The frequency of middle appearance;
Frequency is selected to be more than the word of setting value as much-talked-about topic;
Wherein, the dictionary of the participle instrument includes the word of preset standard form.
CN201711092187.2A 2017-11-08 2017-11-08 A kind of method, apparatus and readable storage medium storing program for executing of much-talked-about topic identification Pending CN107783961A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711092187.2A CN107783961A (en) 2017-11-08 2017-11-08 A kind of method, apparatus and readable storage medium storing program for executing of much-talked-about topic identification

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711092187.2A CN107783961A (en) 2017-11-08 2017-11-08 A kind of method, apparatus and readable storage medium storing program for executing of much-talked-about topic identification

Publications (1)

Publication Number Publication Date
CN107783961A true CN107783961A (en) 2018-03-09

Family

ID=61433147

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711092187.2A Pending CN107783961A (en) 2017-11-08 2017-11-08 A kind of method, apparatus and readable storage medium storing program for executing of much-talked-about topic identification

Country Status (1)

Country Link
CN (1) CN107783961A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109299248A (en) * 2018-12-12 2019-02-01 成都航天科工大数据研究院有限公司 A kind of business intelligence collection method based on natural language processing

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101719122A (en) * 2009-12-04 2010-06-02 中国人民解放军信息工程大学 Method for extracting Chinese named entity from text data
CN101980199A (en) * 2010-10-28 2011-02-23 北京交通大学 Method and system for discovering network hot topic based on situation assessment
CN104731770A (en) * 2015-03-23 2015-06-24 中国科学技术大学苏州研究院 Chinese microblog emotion analysis method based on rules and statistical model
CN105183765A (en) * 2015-07-30 2015-12-23 成都鼎智汇科技有限公司 Big data-based topic extraction method
JP2016040660A (en) * 2014-08-12 2016-03-24 日本電信電話株式会社 Content recommendation device, content recommendation method, and content recommendation program
CN105574092A (en) * 2015-12-10 2016-05-11 百度在线网络技术(北京)有限公司 Information mining method and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101719122A (en) * 2009-12-04 2010-06-02 中国人民解放军信息工程大学 Method for extracting Chinese named entity from text data
CN101980199A (en) * 2010-10-28 2011-02-23 北京交通大学 Method and system for discovering network hot topic based on situation assessment
JP2016040660A (en) * 2014-08-12 2016-03-24 日本電信電話株式会社 Content recommendation device, content recommendation method, and content recommendation program
CN104731770A (en) * 2015-03-23 2015-06-24 中国科学技术大学苏州研究院 Chinese microblog emotion analysis method based on rules and statistical model
CN105183765A (en) * 2015-07-30 2015-12-23 成都鼎智汇科技有限公司 Big data-based topic extraction method
CN105574092A (en) * 2015-12-10 2016-05-11 百度在线网络技术(北京)有限公司 Information mining method and device

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109299248A (en) * 2018-12-12 2019-02-01 成都航天科工大数据研究院有限公司 A kind of business intelligence collection method based on natural language processing

Similar Documents

Publication Publication Date Title
KR101536520B1 (en) Method and server for extracting topic and evaluating compatibility of the extracted topic
WO2021073116A1 (en) Method and apparatus for generating legal document, device and storage medium
CN105893478B (en) A kind of tag extraction method and apparatus
CN109543084A (en) A method of establishing the detection model of the hidden sensitive text of network-oriented social media
CN106502989A (en) Sentiment analysis method and device
US9665561B2 (en) System and method for performing analysis on information, such as social media
CN107943909A (en) User demand trend method for digging and device, storage medium based on comment data
CN106649334B (en) Processing method and device of associated word set
DE102018007165A1 (en) FORECASTING STYLES WITHIN A TEXT CONTENT
CN105912629A (en) Intelligent question and answer method and device
CN108345686A (en) A kind of data analysing method and system based on search engine technique
CN104317784A (en) Cross-platform user identification method and cross-platform user identification system
CN104809252B (en) Internet data extraction system
CN105912645A (en) Intelligent question and answer method and apparatus
KR102296931B1 (en) Real-time keyword extraction method and device in text streaming environment
CN104834739B (en) Internet information storage system
CN110880142B (en) Risk entity acquisition method and device
CN109947934A (en) For the data digging method and system of short text
CN108363784A (en) A kind of public sentiment trend estimate method based on text machine learning
CN104391852B (en) A kind of method and apparatus for establishing keyword dictionary
CN106202034A (en) A kind of adjective word sense disambiguation method based on interdependent constraint and knowledge and device
CN104239285A (en) New article chapter detecting method and device
CN117520522B (en) Intelligent dialogue method and device based on combination of RPA and AI and electronic equipment
CN107783961A (en) A kind of method, apparatus and readable storage medium storing program for executing of much-talked-about topic identification
KR101727686B1 (en) Method for extracting semantic entity topic

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20180309