CN107783961A

CN107783961A - A kind of method, apparatus and readable storage medium storing program for executing of much-talked-about topic identification

Info

Publication number: CN107783961A
Application number: CN201711092187.2A
Authority: CN
Inventors: 毕银龙
Original assignee: Zhengzhou Yunhai Information Technology Co Ltd
Current assignee: Zhengzhou Yunhai Information Technology Co Ltd
Priority date: 2017-11-08
Filing date: 2017-11-08
Publication date: 2018-03-09

Abstract

The invention discloses a kind of much-talked-about topic to know method for distinguishing, gathers text corresponding to forum；Text is divided into word according to participle instrument；Word, and the frequency that each word that calculating sifting goes out successively occurs in the word all filtered out are screened according to corpus；Frequency is selected to be more than the word of setting value as much-talked-about topic；Wherein, segmenting the dictionary of instrument includes the word of preset standard form.It can be seen that, the dictionary of participle instrument includes the word of preset standard form, can using works and expressions for everyday use, cyberspeak as preset standard form word, when being segmented to text, the word that the words recognition of preset standard form can be come out and be used as after participle, and the word after participle is screened according to corpus.Much-talked-about topic can more accurately be identified.The invention also discloses the device and computer-readable recording medium of a kind of much-talked-about topic identification, effect is as above.

Description

A kind of method, apparatus and readable storage medium storing program for executing of much-talked-about topic identification

Technical field

The present invention relates to computer realm, more particularly to the method, apparatus of a kind of much-talked-about topic identification and computer-readable Storage medium.

Background technology

With the development of computer network, network various viewpoints, comment etc. emerge in an endless stream, in order to understand society in time Meeting focus incident, observation society dynamic, make appropriate decision-making for enterprise, government etc. and provide guidance, generally require on network Comment, viewpoint etc. are analyzed, and identify much-talked-about topic.

In the prior art, the text got is generally divided into word, and directly counts the frequency that each word occurs, choosing The high word of frequency is taken as much-talked-about topic.And for the forum on network, user is when making comments, cyberspeak and daily Term is more, and cyberspeak and works and expressions for everyday use are often stated lack of standardization, easily cause mistake to segment, also, for part point Word after word, topic may can not be used as, cause the higher word of the frequency finally selected can not to be used as focus Topic.

Therefore, how much-talked-about topic is more accurately identified, is that those skilled in the art need to solve the problems, such as at present.

The content of the invention

The method, apparatus and computer-readable recording medium identified it is an object of the invention to provide a kind of much-talked-about topic, more Add and accurately and effectively identify social hotspots topic.

In order to solve the above-mentioned technical problem, the present invention provides a kind of much-talked-about topic knowledge method for distinguishing, including：

Gather text corresponding to forum；

The text is divided into word according to participle instrument；

The word is screened according to corpus, and each word that calculating sifting goes out successively is described in all filter out The frequency occurred in word；

Frequency is selected to be more than the word of setting value as much-talked-about topic；

Wherein, the dictionary of the participle instrument includes the word of preset standard form.

Preferably, after text corresponding to the collection forum, further comprise：

The text collected is pre-processed, and enters the foundation participle instrument and the text is divided into word The step of.

Preferably, the described pair of text collected carries out pretreatment and specifically included：

The wrong word in the text and emoticon are obtained, and the text is modified；

Delete the stop words in the text.

Preferably, after the text is divided into word by the foundation participle instrument, further comprise：

The word that participle mistake be present is merged, and enters the step that the word is screened according to corpus Suddenly.

Preferably, it is described selection frequency be more than setting value the word as much-talked-about topic after, further comprise：

Include the text of the much-talked-about topic according to sentiment dictionary analysis to obtain the Sentiment orientation of corresponding user.

Preferably, text corresponding to the collection forum is specially：

The URL link of webpage corresponding to forum is obtained by reptile iteration；

Webpage is obtained according to the URL link；

The matching of regular expression is carried out to the webpage to obtain required text.

The present invention also provides a kind of device of much-talked-about topic identification, including：

Harvester, for gathering text corresponding to forum；

Device is divided, for the text to be divided into word according to participle instrument；

Computing device is screened, for screening the word, and each word that calculating sifting goes out successively according to corpus The frequency occurred in the word all filtered out；

Selection device, for selecting frequency to be more than the word of setting value as much-talked-about topic；

Preferably, in addition to：

Pretreatment unit, for being pre-processed to the text collected.

The present invention also provides a kind of device of much-talked-about topic identification, including processor, the processor are used to perform storage The step of any of the above-described kind of much-talked-about topic knows method for distinguishing is realized during the program stored in device.

The present invention also provides a kind of computer-readable recording medium, and calculating is stored with the computer-readable recording medium Machine program, the computer program are executed by processor to realize following steps：

Gather text corresponding to forum；

The text is divided into word according to participle instrument；

Gather text corresponding to forum；Text is divided into word according to participle instrument；Word is screened according to corpus, and The frequency that each word that calculating sifting goes out successively occurs in the word all filtered out；Frequency is selected to be more than the word of setting value As much-talked-about topic；Wherein, segmenting the dictionary of instrument includes the word of preset standard form.It can be seen that segment the dictionary of instrument Include the word of preset standard form, can using works and expressions for everyday use, cyberspeak as preset standard form word, to text When being segmented, the word that the words recognition of preset standard form can be come out and be used as after participle, and to the word after participle Language is screened according to corpus, the word for cannot function as topic, no longer calculates frequency and as final much-talked-about topic, Therefore, it is possible to more accurately identify much-talked-about topic.The device of much-talked-about topic provided by the invention identification and computer-readable Storage medium, effect is as above.

Brief description of the drawings

In order to illustrate the embodiments of the present invention more clearly, the required accompanying drawing used in embodiment will be done simply below Introduce, it should be apparent that, drawings in the following description are only some embodiments of the present invention, for ordinary skill people For member, on the premise of not paying creative work, other accompanying drawings can also be obtained according to these accompanying drawings.

Fig. 1 is the flow chart that a kind of much-talked-about topic provided in an embodiment of the present invention knows method for distinguishing；

Fig. 2 is a kind of structure chart of the device of much-talked-about topic identification provided in an embodiment of the present invention；

Fig. 3 is a kind of structure chart of the device of much-talked-about topic identification provided in an embodiment of the present invention.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Site preparation describes, it is clear that described embodiment is only part of the embodiment of the present invention, rather than whole embodiments.Based on this Embodiment in invention, those of ordinary skill in the art on the premise of creative work is not paid, obtained it is all its His embodiment, belongs to the scope of the present invention.

In order that those skilled in the art is better understood from technical scheme, it is below in conjunction with the accompanying drawings and specific real Applying mode, the present invention is described in further detail.

Fig. 1 is the flow chart that a kind of much-talked-about topic provided in an embodiment of the present invention knows method for distinguishing, as shown in figure 1, focus The method of topic detection comprises the following steps：

S10：Gather text corresponding to forum.

Forum is exactly to be posted the platform that money order receipt to be signed and returned to the sender is talked about for user, and user can release news or propose view in forum. The length of text in forum is general shorter, and works and expressions for everyday use or cyberspeak are more.

S11：Text is divided into word according to participle instrument.

Multiple sentences are generally comprised in text, if being divided into each sentence in the text collected using participle instrument Dry independent word.Participle instrument can include dictionary, when being segmented to text, can be used as ginseng using the word in dictionary Examine.

S12：Word is screened according to corpus, and each word that calculating sifting goes out successively goes out in the word all filtered out Existing frequency.

Word set in advance can be included in corpus, these words set in advance can be used as much-talked-about topic, The word included in corpus is screened in all words obtained after the division of all sentences, for the word not included in corpus It language, can delete, and count each word filtered out successively and have the number occurred altogether, have the number occurred altogether with each word Divided by the total number of the word after screening, obtain the frequency that each word occurs.

So, for the word not having in corpus, much-talked-about topic can not be used as.

S13：Frequency is selected to be more than the word of setting value as much-talked-about topic.

Frequency is bigger, it is meant that the number that the word occurs is more, is more discussed by user.And choose set as needed Definite value, the word that setting value can be more than using selecting frequency are used as much-talked-about topic.

Wherein, segmenting the dictionary of instrument includes the word of preset standard form.

Word using works and expressions for everyday use, cyberspeak and other nonstandard words of statement as default reference format, this Sample, when text being divided into word according to participle instrument, works and expressions for everyday use or cyberspeak can be filtered out or statement is nonstandard Word, and independent word is divided into, so as to avoid the occurrence of participle mistake.

Gather text corresponding to forum；Text is divided into word according to participle instrument；Word is screened according to corpus, and The frequency that each word that calculating sifting goes out successively occurs in the word all filtered out；Frequency is selected to be more than the word of setting value As much-talked-about topic；Wherein, segmenting the dictionary of instrument includes the word of preset standard form.It can be seen that segment the dictionary of instrument Include the word of preset standard form, can using works and expressions for everyday use, cyberspeak as preset standard form word, to text When being segmented, the words recognition of preset standard form can be come out and be divided into independent word, and to participle after Word screened according to corpus, the word for cannot function as topic, no longer calculate frequency and as final focus Topic, therefore, it is possible to more accurately identify much-talked-about topic.

On the basis of above-described embodiment, in order to more accurately be segmented to text, so as to more accurately Much-talked-about topic is identified, after gathering text corresponding to forum, is further comprised：The text collected is pre-processed, gone forward side by side Enter step S11.

Preferably, carry out pretreatment to the text collected to specifically include, obtain the wrong word in text and emoticon, And text is modified, delete the stop words in text.

Because how lack of standardization the statement of user is, so, wrong word or emoticon are might have in forum's text, can be pre- Wrong word storehouse and emoticon storehouse are first set, can also include revised correct word in wrong word storehouse, in emoticon storehouse The word corresponding to the meaning of emoticon expression can also be included, identified by storehouse set in advance and detect the mistake in text Malapropism and emoticon, and text is modified according to storehouse set in advance.For some stop words in text, Ke Yizhi Connect and be deleted.

On the basis of above-described embodiment, in order to more accurately identify much-talked-about topic, preferably embodiment party Formula, after text is divided into word according to participle instrument, further comprise：The word that participle mistake be present is merged, and Into step S12.

, can be again depending in order to which whether the word examined point is wrong after text is divided into word according to participle instrument Original text is divided into word by participle instrument, and the word with obtaining for the first time is compared, if there is a phrase twice Word after participle is inconsistent, and these words after participle are merged, and as the word after participle, can also manage Solve as using this phrase as the word after participle.

Certainly, on the basis of again, original text can also again be segmented according to participle instrument, i.e., original text is carried out Third time segments, and the word obtained after third time is segmented is compared with the word obtained after preceding participle twice, if deposited A phrase after each participle it is all inconsistent, the word after participle is merged, that is, directly using the phrase as Word after participle.The present invention is not construed as limiting to the number of participle.

On the basis of above-described embodiment, in order to understand attitude and view of the people for much-talked-about topic, preferably Embodiment, select frequency be more than setting value word as much-talked-about topic after, further comprise, according to sentiment dictionary analysis Text including much-talked-about topic is with the Sentiment orientation of user corresponding to obtaining.

The emotion that sentiment dictionary includes can be roughly divided into positive, passive, neutral three major types, can be with for every one kind Dictionary corresponding to foundation, analysis includes the text of much-talked-about topic, if occurring word in dictionary corresponding to certain a kind of emotion in text Language, then the attitude of the user is designated as affective style corresponding to the dictionary.

On the basis of above-described embodiment, in order to more efficiently and accurately gather the text of forum, gather corresponding to forum Text is specially：The URL link of webpage corresponding to forum is obtained by reptile iteration, webpage is obtained according to URL link, to webpage The matching of regular expression is carried out to obtain required text.

The URL of webpage is obtained using crawler technology, then analyzing structure of web page, and of regular expression is carried out to webpage Match somebody with somebody, so as to be captured text corresponding to forum and be saved in local.

The embodiment that method for distinguishing is known above for much-talked-about topic is described in detail, and is described based on above-described embodiment Much-talked-about topic know method for distinguishing, the embodiment of the present invention provides the device that a kind of corresponding with this method much-talked-about topic identifies.By It is mutually corresponding in the embodiment of device part and the embodiment of method part, therefore the embodiment of device part refer to method portion The embodiment description divided, is no longer described in detail here.

Fig. 2 is a kind of structure chart of the device of much-talked-about topic identification provided in an embodiment of the present invention, as shown in Fig. 2 focus The device of topic detection includes：

Collecting unit 20, for gathering text corresponding to forum.

Division unit 21, for text to be divided into word according to participle instrument.

Computing unit 22 is screened, for screening word according to corpus, and each word that calculating sifting goes out successively is in whole The frequency occurred in the word filtered out.

Selecting unit 23, for selecting frequency to be more than the word of setting value as much-talked-about topic.

Text corresponding to collecting unit collection forum；Text is divided into word by division unit according to participle instrument；Screening Computing unit screens word according to corpus, and calculating sifting goes out successively each word occurs in the word all filtered out Frequency；Selecting unit selection frequency is more than the word of setting value as much-talked-about topic；Wherein, segmenting the dictionary of instrument is included in advance It is marked with the word of quasiconfiguaration.It can be seen that the dictionary for segmenting instrument includes the word of preset standard form, can by works and expressions for everyday use, Word of the cyberspeak as preset standard form, when being segmented to text, division unit can be by preset standard form Words recognition comes out and is divided into independent word, and screening computing unit sieves to the word after participle according to corpus Choosing, the word for cannot function as topic, frequency is no longer calculated and as final much-talked-about topic, therefore, it is possible to more accurate Identify much-talked-about topic in ground.

On the basis of above-described embodiment, in order to more accurately be segmented to text, so as to more accurately Much-talked-about topic is identified, the device of much-talked-about topic identification also includes：

Pretreatment unit, for being pre-processed to the text collected.

Preferably, pretreatment unit is specifically used for obtaining the wrong word in text and emoticon, and text is repaiied Just, the stop words in text is deleted.

Fig. 3 is a kind of structure chart of the device of much-talked-about topic identification provided in an embodiment of the present invention, as shown in figure 3, focus The device of topic detection includes：

Memory 30 and processor 31.

Memory 30, for storing computer program.

Processor 31, during for performing the computer program stored in memory 30, it is possible to achieve following steps：

Gather text corresponding to forum；

Text is divided into word according to participle instrument；

Word is screened according to corpus, and calculating sifting goes out successively each word occurs in the word all filtered out Frequency；

In some embodiments of the invention, above-mentioned processor 31, can be also used for performing the computer in memory 30 Program realizes following steps：

The text collected is pre-processed, and enters the step of text is divided into word according to participle instrument.

The wrong word in text and emoticon are obtained, and text is modified；

Delete the stop words in text.

The word that participle mistake be present is merged, and enters the step of word is screened according to corpus.

Include the text of much-talked-about topic according to sentiment dictionary analysis to obtain the Sentiment orientation of corresponding user.

Webpage is obtained according to URL link；

The matching of regular expression is carried out to webpage to obtain required text.

The present embodiment provide much-talked-about topic identification device, processor in the computer program in performing memory, Gather text corresponding to forum；Text is divided into word according to participle instrument；Word is screened according to corpus, and calculated successively The frequency that each word filtered out occurs in the word all filtered out；Frequency is selected to be more than the word of setting value as focus Topic；Wherein, segmenting the dictionary of instrument includes the word of preset standard form.It can be seen that the dictionary for segmenting instrument is included in advance The word of quasiconfiguaration is marked with, text can be segmented using works and expressions for everyday use, cyberspeak as the word of preset standard form When, the words recognition of preset standard form can be come out and be divided into independent word, and to the word after participle according to Being screened according to corpus, the word for cannot function as topic, no longer calculating frequency and as final much-talked-about topic, because This, can more accurately identify much-talked-about topic.

Present invention also offers a kind of computer-readable storage corresponding with the embodiment of the method for above-mentioned much-talked-about topic identification Medium, because the embodiment of computer-readable recording medium part and the embodiment of method part are mutually corresponding, therefore computer The embodiment of readable storage medium storing program for executing part refer to the embodiment description of method part, and in this not go into detail.

Computer program is stored with computer-readable recording medium, computer program is executed by processor as follows to realize Step：

Gather text corresponding to forum.

Text is divided into word according to participle instrument.

Word is screened according to corpus, and calculating sifting goes out successively each word occurs in the word all filtered out Frequency.

Frequency is selected to be more than the word of setting value as much-talked-about topic.

It should be noted that the computer-readable recording medium in the present invention can be the media such as USB flash disk or CD, specifically not It is construed as limiting.

When computer program in computer-readable recording medium provided by the invention is executed by processor, forum pair is gathered The text answered；Text is divided into word according to participle instrument；According to corpus screen word, and successively calculating sifting go out it is each The frequency that word occurs in the word all filtered out；Frequency is selected to be more than the word of setting value as much-talked-about topic；Wherein, The dictionary of participle instrument includes the word of preset standard form.It can be seen that the dictionary for segmenting instrument includes preset standard form Word, can using works and expressions for everyday use, cyberspeak as preset standard form word, when being segmented to text, can will The words recognition of preset standard form is come out and is divided into independent word, and the word after participle is entered according to corpus Row screening, the word for cannot function as topic, frequency is no longer calculated and as final much-talked-about topic, therefore, it is possible to more Much-talked-about topic is identified exactly.

The method, apparatus and computer-readable recording medium of much-talked-about topic provided by the present invention identification are carried out above It is discussed in detail.Each embodiment is described by the way of progressive in specification, and each embodiment, which stresses, is and other realities Apply the difference of example, between each embodiment identical similar portion mutually referring to.

It should be pointed out that for those skilled in the art, under the premise without departing from the principles of the invention, Some improvement and modification can also be carried out to the present invention, these are improved and modification also falls into the protection domain of the claims in the present invention It is interior.

It should also be noted that, in this manual, such as first and second etc relational terms are used merely to one Individual entity either operates to be made a distinction with another entity or operation, and is not necessarily required and either implied these entities or behaviour Any this actual relation or order between work be present.Moreover, term " comprising ", "comprising" or its any variant are intended to Cover including for nonexcludability, so that process, method, article or equipment including a series of key element not only include that A little key elements, but also other key elements including being not expressly set out, either also include for this process, method, article or set Standby intrinsic key element.In the absence of more restrictions, the key element limited by sentence "including a ...", it is not excluded that Other identical element in the process including the key element, method, article or equipment also be present.

Claims

1. a kind of much-talked-about topic knows method for distinguishing, it is characterised in that including：

Gather text corresponding to forum；

The text is divided into word according to participle instrument；

The word is screened according to corpus, and each word that calculating sifting goes out successively is in the word all filtered out The frequency of middle appearance；

2. according to the method for claim 1, it is characterised in that after text corresponding to the collection forum, further comprise：

The text collected is pre-processed, and enters the step that the text is divided into word according to participle instrument Suddenly.

3. according to the method for claim 2, it is characterised in that the described pair of text collected pre-process specifically Including：

Delete the stop words in the text.

4. according to the method for claim 1, it is characterised in that described that the text is divided into word according to participle instrument Afterwards, further comprise：

The word that participle mistake be present is merged, and enters the described the step of word is screened according to corpus.

5. according to the method for claim 1, it is characterised in that the selection frequency is more than the word conduct of setting value After much-talked-about topic, further comprise：

6. according to the method for claim 1, it is characterised in that it is described collection forum corresponding to text be specially：

Webpage is obtained according to the URL link；

A kind of 7. device of much-talked-about topic identification, it is characterised in that including：

Collecting unit, for gathering text corresponding to forum；

Division unit, for the text to be divided into word according to participle instrument；

Computing unit is screened, for screening the word according to corpus, and each word that calculating sifting goes out successively is complete The frequency occurred in the word that portion filters out；

Selecting unit, for selecting frequency to be more than the word of setting value as much-talked-about topic；

8. device according to claim 7, it is characterised in that also include：

Pretreatment unit, for being pre-processed to the text collected.

9. a kind of device of much-talked-about topic identification, it is characterised in that including processor, the processor is used to perform in memory The step of much-talked-about topic knows method for distinguishing as described in any one of claim 1 to 6 is realized during the program of storage.

10. a kind of computer-readable recording medium, it is characterised in that be stored with computer on the computer-readable recording medium Program, the computer program are executed by processor to realize following steps：

Gather text corresponding to forum；

The text is divided into word according to participle instrument；