CN109359233A - Public network massive information monitoring method and system based on natural language processing technique - Google Patents
Public network massive information monitoring method and system based on natural language processing technique Download PDFInfo
- Publication number
- CN109359233A CN109359233A CN201811067750.5A CN201811067750A CN109359233A CN 109359233 A CN109359233 A CN 109359233A CN 201811067750 A CN201811067750 A CN 201811067750A CN 109359233 A CN109359233 A CN 109359233A
- Authority
- CN
- China
- Prior art keywords
- public network
- word
- text data
- effective
- network text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 101
- 238000012544 monitoring process Methods 0.000 title claims abstract description 83
- 238000003058 natural language processing Methods 0.000 title claims abstract description 41
- 239000013598 vector Substances 0.000 claims abstract description 59
- 241001269238 Data Species 0.000 claims abstract description 32
- 238000012545 processing Methods 0.000 claims abstract description 25
- 238000003860 storage Methods 0.000 claims abstract description 20
- 230000011218 segmentation Effects 0.000 claims abstract description 10
- 230000008451 emotion Effects 0.000 claims description 60
- 230000008569 process Effects 0.000 claims description 22
- 238000004590 computer program Methods 0.000 claims description 12
- 230000009193 crawling Effects 0.000 claims description 4
- 230000000694 effects Effects 0.000 abstract description 12
- 238000005516 engineering process Methods 0.000 description 4
- 238000012512 characterization method Methods 0.000 description 3
- 238000004891 communication Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 238000012552 review Methods 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 2
- 238000001914 filtration Methods 0.000 description 2
- 230000014759 maintenance of location Effects 0.000 description 2
- 238000005498 polishing Methods 0.000 description 2
- 238000012216 screening Methods 0.000 description 2
- 238000007619 statistical method Methods 0.000 description 2
- PEDCQBHIVMGVHV-UHFFFAOYSA-N Glycerine Chemical compound OCC(O)CO PEDCQBHIVMGVHV-UHFFFAOYSA-N 0.000 description 1
- 206010020675 Hypermetropia Diseases 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 230000010267 cellular communication Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000004140 cleaning Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000007717 exclusion Effects 0.000 description 1
- 230000002045 lasting effect Effects 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000002203 pretreatment Methods 0.000 description 1
- 230000001105 regulatory effect Effects 0.000 description 1
- 230000007480 spreading Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/01—Social networking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Business, Economics & Management (AREA)
- Health & Medical Sciences (AREA)
- General Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- Tourism & Hospitality (AREA)
- Primary Health Care (AREA)
- Strategic Management (AREA)
- Marketing (AREA)
- Human Resources & Organizations (AREA)
- General Business, Economics & Management (AREA)
- Economics (AREA)
- Computing Systems (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Engineering & Computer Science (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention relates to a kind of public network massive information monitoring method and system, computer equipment, computer storage medium based on natural language processing technique.The above method includes: to be crawled the public network text data in the first set period of time using preset high frequency words as keyword and carried out text analyzing to it using natural language processing technique;Word segmentation processing is carried out to each public network text data, identify effective word of public network text data, the word weight of effective word in each public network text data is determined according to the word weight that effective dictionary records, and successively records the term vector of effective word word weight in public network text data;Wherein, effective word is the word in public network text data in addition to stop-word;Effective dictionary is the database for recording the corresponding word weight of each word;The public network text data is classified according to the term vector, all kinds of public network text datas are monitored respectively.Its monitoring efficiency with higher, effectively increases corresponding monitoring effect.
Description
Technical field
The present invention relates to Internet technical fields, more particularly to a kind of public network magnanimity based on natural language processing technique
Information monitoring method and system, computer equipment, computer storage medium.
Background technique
With the rapid development of internet technology, netizen's scale is rapidly expanded, and more and more netizens gladly pass through
The internet platforms such as microblogging, wechat express viewpoint, and the speed of spreading network information, which can achieve, spreads all over the whole world within a few hours, because
This grasps the public networks massive informations such as related public sentiment in time and suffers from most important no matter to enterprise or to appropriate regulatory bodies
Effect.And the network informations (public network massive information) monitoring means such as traditional public sentiment monitoring need according to relevant information theme into
The corresponding identification of row and screening, then according to the information realization network information monitoring therein after screening, it be easy to cause information in this way
It omits, keeps monitoring effect poor.
Summary of the invention
Based on this, it is necessary to be easy to cause information to omit for traditional scheme, make the public affairs based on natural language processing technique
The technical problem of net massive information monitoring effect difference provides a kind of public network massive information monitoring based on natural language processing technique
Method and system, computer equipment, computer storage medium.
A kind of public network massive information monitoring method based on natural language processing technique, comprising:
Using preset high frequency words as keyword, the public network text data in the first set period of time is crawled;
Word segmentation processing is carried out to each public network text data, effective word of the public network text data is identified, according to effective
The word weight of dictionary record determines the word weight of effective word in each public network text data, and successively records public network text data
In effective word word weight term vector;Wherein, effective word is the word in public network text data in addition to stop-word;It is described to have
Imitating dictionary is the database for recording the corresponding word weight of each word;
The public network text data is classified according to the term vector, all kinds of public network text datas are monitored respectively.
The above-mentioned public network massive information monitoring method based on natural language processing technique, can climb according to preset high frequency words
The public network text data in the first set period of time is taken, identifies effective word of above-mentioned public network text data, is remembered according to effective dictionary
The word weight of record determines the word weight of effective word in each public network text data, and successively records in public network text data effectively
The public network text data is classified, then is monitored respectively to all kinds of public network text datas by the term vector of word word weight;It can
Public network text data is accordingly monitored with category, monitoring efficiency with higher, and above-mentioned monitoring process is with public network text
Effective word included by notebook data is foundation, effectively increases corresponding monitoring effect.
In one embodiment, the term vector be successively record in corresponding public network text data effective word word weight and
The n-dimensional vector of setting value, n are the word amount of effective dictionary;
It is described to include: by the process that the public network text data is classified according to the term vector
The cosine value between any two term vector is calculated separately, it, will be described when the cosine value is greater than similar threshold value
The corresponding two public network text datas of cosine value are determined as a kind of text data.
The present embodiment can guarantee the accuracy that classification processing is carried out to public network text data.
As one embodiment, the cosine value calculated separately between two term vectors of arbitrary neighborhood, described remaining
String value be greater than similar threshold value when, by the corresponding two public network text datas of the cosine value be determined as a kind of text data process it
Afterwards, further includes:
Multiclass text data including identical public network text data is determined as a kind of text data.
It is between multiple public network text datas of similar text data with the same public network text data in the present embodiment
Similarity is higher, these text datas are determined as a kind of text data, to sample the same or similar monitoring scheme simultaneously
Network information monitoring is carried out to more public network text datas, corresponding monitoring efficiency can be improved.
In one embodiment, the process being monitored respectively to all kinds of public network text datas includes:
Identify the Sentiment orientation parameter of each public network data text;Wherein, the Sentiment orientation parameter is that characterization is corresponding public
The parameter of network data text aggressiveness level;
The public network text data that the Sentiment orientation parameter is less than emotion threshold value is determined as passive text data;
The number for counting the passive text data of all kinds of public network text datas, according to the number of passive text data to corresponding
The public network text data of classification carries out network information monitoring.
In the present embodiment, if the number of passive text data is more, correlation can be caused by characterizing such public network text data
Public opinion crisis needs to carry out the processing such as early warning, to guarantee the timeliness of related public sentiment processing.
As one embodiment, identify that the process of the Sentiment orientation parameter of public network data text includes:
The feature emotion word in the public network data text is extracted, each emotion word pair recorded according to the emotion dictionary
The emotion deviation value answered determines the emotion deviation value of the feature emotion word;Wherein, the emotion dictionary is to record each emotion
The database of the corresponding emotion deviation value of word;
The average value for calculating the corresponding each emotion deviation value of the public network data text, being determined according to the average value should
The Sentiment orientation parameter of public network data text.
The present embodiment can the Sentiment orientation parameter to public network data text accurately determined.
As one embodiment, the number according to passive text data carries out the public network text data of respective classes
The network information monitoring process include:
If the number of passive text data is greater than or equal to the setting ratio of respective classes public network text data sum, produce
Raw warning information.
The present embodiment is greater than or equal to the setting of respective classes public network text data sum in the number of passive text data
Warning information is generated when ratio, so as to the above-mentioned warning information of associated user's timely learning, can accordingly be coped with, to prevent carriage
By the generation of crisis.
In one embodiment, described using preset high frequency words as keyword, crawl the public affairs in the first set period of time
Before the process of online article notebook data, further includes:
The public sentiment event in the second set period of time is acquired, frequency of occurrence in the public sentiment event is obtained and is greater than frequency threshold value
Effective word, acquired effective word is determined as high frequency words.
The present embodiment can arrange text information included by the public sentiment event in the second set period of time, use
Statistical method obtains effective word that frequency of occurrence in above-mentioned public sentiment event is greater than frequency threshold value, to identify the high frequency of public sentiment event
Word, realization crawl public network text data in the first set period of time, and guarantee crawls having for obtained public network text data
Effect property.
A kind of public network massive information monitoring system based on natural language processing technique, comprising:
Module is crawled, for crawling the public network text in the first set period of time using preset high frequency words as keyword
Data;
Identification module identifies having for the public network text data for carrying out word segmentation processing to each public network text data
Word is imitated, the word weight of effective word in each public network text data is determined according to the word weight that effective dictionary records, and successively remember
Record the term vector of effective word word weight in public network text data;Wherein, effective word is that stop-word is removed in public network text data
Except word;Effective dictionary is the database for recording the corresponding word weight of each word;
Monitoring modular, for the public network text data to be classified according to the term vector, respectively to all kinds of public network texts
Data are monitored.
The above-mentioned public network massive information based on natural language processing technique monitors system, can be climbed according to preset high frequency words
The public network text data in the first set period of time is taken, identifies effective word of above-mentioned public network text data, is remembered according to effective dictionary
The word weight of record determines the word weight of effective word in each public network text data, and successively records in public network text data effectively
The public network text data is classified, then is monitored respectively to all kinds of public network text datas by the term vector of word word weight;It can
Public network text data is accordingly monitored with category, monitoring efficiency with higher, and above-mentioned monitoring process is with public network text
Effective word included by notebook data is foundation, effectively increases corresponding monitoring effect.
A kind of computer equipment, including memory, processor and be stored on the memory and can be in the processing
The computer program run on device, the processor realize the base that any of the above-described embodiment provides when executing the computer program
In the public network massive information monitoring method of natural language processing technique.
A kind of computer storage medium, is stored thereon with computer program, which is characterized in that the program is executed by processor
The public network massive information monitoring method based on natural language processing technique that any of the above-described embodiment of Shi Shixian provides.
Public network massive information monitoring method according to the present invention based on natural language processing technique, the present invention also provides one
Kind computer equipment and computer storage medium, for realizing above-mentioned network information monitoring method by program.Above-mentioned computer
Equipment and computer storage medium can be improved network information monitoring effect.
Detailed description of the invention
Fig. 1 is the public network massive information monitoring method flow chart based on natural language processing technique of one embodiment;
Fig. 2 is that the public network massive information based on natural language processing technique of one embodiment monitors system structure signal
Figure;
Fig. 3 is the computer system module map of one embodiment.
Specific embodiment
To make the objectives, technical solutions, and advantages of the present invention more comprehensible, with reference to the accompanying drawings and embodiments, to this
Invention is described in further detail.It should be appreciated that the specific embodiments described herein are only used to explain the present invention,
And the scope of protection of the present invention is not limited.
It should be noted that term involved in the embodiment of the present invention " first second third " be only distinguish it is similar
Object does not represent the particular sorted for object, it is possible to understand that ground, " first second third " can be mutual in the case where permission
Change specific sequence or precedence.It should be understood that the object that " first second third " is distinguished in the appropriate case can be mutual
It changes, so that the embodiment of the present invention described herein can be real with the sequence other than those of illustrating or describing herein
It applies.
The term " includes " of the embodiment of the present invention and " having " and their any deformations, it is intended that cover non-exclusive
Include.Such as contain series of steps or module process, method, system, product or equipment be not limited to it is listed
Step or module, but optionally further comprising the step of not listing or module, or optionally further comprising for these processes, side
Method, product or equipment intrinsic other steps or module.
Referenced herein " embodiment " is it is meant that a particular feature, structure, or characteristic described can wrap in conjunction with the embodiments
It is contained at least one embodiment of the application.Each position in the description occur the phrase might not each mean it is identical
Embodiment, nor the independent or alternative embodiment with other embodiments mutual exclusion.Those skilled in the art explicitly and
Implicitly understand, embodiment described herein can be combined with other embodiments.
Refering to what is shown in Fig. 1, Fig. 1 is the public network massive information monitoring side based on natural language processing technique of one embodiment
Method flow chart, comprising:
S10 crawls the public network text data in the first set period of time using preset high frequency words as keyword;
Above-mentioned first set period of time can be determined according to monitoring accuracy, for example be set as a few days ago or first three days, also
It can be set to using current time as periods such as preceding 40 hours of finish time.Above-mentioned public network text data is to pass through public network
The public text such as news, comment or the message delivered;Optionally, above-mentioned public network text data is divided according to user, a public affairs
Altogether on platform, the text information that user delivers is a public network text data, for example, certain user delivers for certain product
A comment, the primary of a news release that certain reporter delivers or certain user leave a message etc..
The above-mentioned network information may include the specific information that public sentiment etc. easily causes public opinion crisis.Above-mentioned high frequency words can wrap
Include one or more, high frequency words can according to particular networks acquisition of information such as the public sentiment events occurred in the past period,
Specifically, the higher word of frequency of occurrence in each particular network information can be counted in the past period, high frequency is determined with this
Word.
In one embodiment, above-mentioned steps S10 can build the network crawler system based on mainstream media's platform,
Using preset high frequency words as keyword, the public network text data in the first set period of time is crawled, and stores above-mentioned public network text
Notebook data;The text processing system based on natural language processing can also be built, to the public network text data stored into
Row cleaning, filter out the junk datas such as ad data, then by remaining public network text data through the operation of natural language processing algorithm (such as
Correct mistake, delete duplicate contents etc.), it realizes the pretreatment to public network text data, guarantees the consistency of public network text data,
All data index can also be obtained according to above-mentioned pretreated public network text data carry out discretization storage.
S20 carries out word segmentation processing to each public network text data, identifies effective word of the public network text data, according to
The word weight of effective dictionary record determines the word weight of effective word in each public network text data, and successively records public network text
The term vector of effective word word weight in data;Wherein, effective word is the word in public network text data in addition to stop-word;Institute
Stating effective dictionary is the database for recording the corresponding word weight of each word;
Before carrying out word segmentation processing to each public network text data, a large amount of network sample datas can be based on, rubbish is formulated
Data filtering rule, deletes the junk datas such as advertisement, and remove repeated data by database technology.To each public network textual data
Public network text data is divided into multiple words according to word segmentation processing is carried out, the public network text data after word segmentation processing includes stop-word
With effective word, above-mentioned stop-word can refer to the high word of the frequency of use such as article, preposition, adverbial word and/or conjunction;It is " ", " inner
Face ", " ", " ", " it ", " for " these words be all off word, these words are because frequency of use is excessively high, almost on each webpage
All exist, if there are a large amount of such words on website, many resources will be wasted during Correlation method for data processing, it will
This kind of words (stop-word) all neglect, and can save resource, improve corresponding data-handling efficiency.Effective word is public network
Word in text data in addition to stop-word, effective word have specific reference meaning in corresponding text data.Effective dictionary
For a large and complete dictionary, the corresponding word weight of each effective word (word weight is certain specific numerical value) is recorded, on
Predicate weight can the factors such as context of use determine according to the affective characteristics of corresponding words, in the network information.Public network text data
Effective word can find corresponding word weight respectively in effective dictionary, the corresponding word of a public network text data to
Amount, above-mentioned term vector successively records effective word in public network text data, and (effective word is according to its appearance in public network text data
Sequence arrange) word weight.If effective dictionary includes n word, above-mentioned term vector can be that n is vector, some public network textual data
According to term vector after having recorded the corresponding word weight of its effective word, can with setting value by its term vector polishing, make the word to
Amount is n-dimensional vector.
The public network text data is classified according to the term vector, is supervised respectively to all kinds of public network text datas by S30
It surveys.
Above-mentioned steps S30 can identify the direction of each term vector, by term vector direction it is close (such as angle be less than set angle
The term vector of degree), the high public network text data of similarity be determined as a kind of public network text data, then respectively to all kinds of public network texts
Data are monitored, to guarantee the efficiency of above-mentioned network information monitoring.
In one embodiment, Monitoring Rules can be formulated according to the type of public network text data, is advised according to above-mentioned monitoring
Then monitor public network text data.Specifically, the public sentiment event occurred in the past period, the data generated to it can be collected
It is arranged, concludes event general character, formulate network information Monitoring Rules, such as: according to historical data, if certain money produces in 3 days
The negative reviews of product are more than 10, can break out query product quality public opinion crisis, then the Monitoring Rules formulated are as follows: in 3 days certain
The amount of text of money product negative reviews is greater than 10, when the amount of text of certain product negative reviews is greater than 10 in 3 days,
Carry out corresponding early warning.
Public network massive information monitoring method provided by the invention based on natural language processing technique, can be according to preset
High frequency words crawl the public network text data in the first set period of time, identify effective word of above-mentioned public network text data, according to having
The word weight of effect dictionary record determines the word weight of effective word in each public network text data, and successively records public network textual data
The term vector of effective word word weight in, by the public network text data classify, then respectively to all kinds of public network text datas into
Row monitoring;Public network text data can accordingly be monitored with category, monitoring efficiency with higher, and above-mentioned monitoring process
Using effective word included by public network text data as foundation, corresponding monitoring effect is effectively increased.
In one embodiment, the term vector be successively record in corresponding public network text data effective word word weight and
The n-dimensional vector of setting value, n are the word amount of effective dictionary;
It is described to include: by the process that the public network text data is classified according to the term vector
The cosine value between any two term vector is calculated separately, it, will be described when the cosine value is greater than similar threshold value
The corresponding two public network text datas of cosine value are determined as a kind of text data.
Specifically, the word that above-mentioned setting value can be convenient for associated vector operation for 0 or 1 etc..Above-mentioned similar threshold value
It can be arranged according to nicety of grading, such as be set as 0.9 equivalence.The term vector of some public network text data is recording its effective word
After corresponding word weight, the term vector n-dimensional vector can be made with setting value by its term vector polishing, such as the first public network text
Data include a effective word of a (a < n), and the second public network text data includes a effective word of b (b < n), the first public network text data
Term vector A=[A1,A2,…,Aa,…,An], the term vector B=[B of the second public network text data1,B2,…,Ba,…,Bn], it is above-mentioned
A=[A1,A2,…,Aa,…,An] in A1To AaSuccessively record the corresponding word weight of effective word, A in the first public network text dataa+1
To AnIt is setting value;B=[B1,B2,…,Ba,…,Bn] in B1To BbSuccessively record effective word pair in the second public network text data
The word weight answered, Bb+1To BnIt is setting value.Cosine value between term vector A and term vector B can be with are as follows:
When above-mentioned cosine value cos θ is greater than similar threshold value, show that term vector A and term vector B angulation are small, term vector A
Close with the direction term vector B, term vector A public network text data corresponding with term vector B is a kind of text data.
The present embodiment can guarantee the accuracy that classification processing is carried out to public network text data.
As one embodiment, the cosine value calculated separately between two term vectors of arbitrary neighborhood, described remaining
String value be greater than similar threshold value when, by the corresponding two public network text datas of the cosine value be determined as a kind of text data process it
Afterwards, further includes:
Multiclass text data including identical public network text data is determined as a kind of text data.
It is between multiple public network text datas of similar text data with the same public network text data in the present embodiment
Similarity is higher, these text datas are determined as a kind of text data, to sample the same or similar monitoring scheme simultaneously
Network information monitoring is carried out to more public network text datas, corresponding monitoring efficiency can be improved;Such as the first public network textual data
According to being similar text data with the second public network text data, the first public network text data and third public network text data are similar text
Above-mentioned first public network text data, the second public network text data and third public network text data can be then divided by notebook data
A kind of text data, to carry out network information monitoring to such text data.
In one embodiment, the process being monitored respectively to all kinds of public network text datas includes:
Identify the Sentiment orientation parameter of each public network data text;Wherein, the Sentiment orientation parameter is that characterization is corresponding public
The parameter of network data text aggressiveness level;
The public network text data that the Sentiment orientation parameter is less than emotion threshold value is determined as passive text data;
The number for counting the passive text data of all kinds of public network text datas, according to the number of passive text data to corresponding
The public network text data of classification carries out network information monitoring.
Above-mentioned Sentiment orientation parameter value may range from 0 to 1,0 representative absolutely negatively (passiveness), and 1 represents absolutely front
The emotion deviation value of (positive), active text data is high, and the emotion deviation value of passive text data is low.Above-mentioned emotion threshold value can be with
It is arranged according to network information monitoring feature, is such as set as 0.3 equivalence.
In the present embodiment, if the number of passive text data is more, correlation can be caused by characterizing such public network text data
Public opinion crisis needs to carry out the processing such as early warning, to guarantee the timeliness of related public sentiment processing.
As one embodiment, identify that the process of the Sentiment orientation parameter of public network data text may include:
The feature emotion word in the public network data text is extracted, each emotion word pair recorded according to the emotion dictionary
The emotion deviation value answered determines the emotion deviation value of the feature emotion word;Wherein, the emotion dictionary is to record each emotion
The database of the corresponding emotion deviation value of word;
The average value for calculating the corresponding each emotion deviation value of the public network data text (such as tires out each emotion deviation value
In addition it is averaged afterwards), the Sentiment orientation parameter of the public network data text is determined according to the average value.
Features described above emotion word is that can characterize the positive or passive word of speech attitude, and such as farsighted, affinity is lost, is sad
It is disappointed etc..Above-mentioned emotion deviation value is to characterize the value of the aggressiveness level of corresponding words, and the value range of emotion deviation value can be 0 to 1,
0 represents absolutely negative (passiveness), and 1 represents absolutely positive (positive), and the emotion deviation value of positive emotion word is high, Negative Affect word
Emotion deviation value is low.Above-mentioned emotion dictionary is the database for recording the corresponding emotion deviation value of each emotion word, above-mentioned feelings
The emotion word that sense dictionary is recorded includes that the feature emotion word in public network data text exists.It specifically, can be to public network data text
The pretreatments such as the filtering of this progress junk data, removal repeated data, then pretreated public network data text is segmented, is gone
Except stop-word processing, then extract feature emotion word therein.
The present embodiment can the Sentiment orientation parameter to public network data text accurately determined.
As one embodiment, the number according to passive text data carries out the public network text data of respective classes
The network information monitoring process may include:
If the number of passive text data is greater than or equal to the setting ratio of respective classes public network text data sum, produce
Raw warning information.
Above-mentioned setting ratio can be determined according to the type of public network text data, such as be determined as 70% equivalence.Passive text
The setting ratio that the number of data is greater than or equal to respective classes public network text data sum shows that such public network text data can
Related public opinion crisis can be caused, need to carry out early warning.After generating warning information, can be notified by alarm equipment alarm, system,
The modes such as Push Service, short message service and/or mail service notify associated user on line, so that user can take accordingly in time
Treatment measures.Specifically, the user that can be will be monitored is set as pre-alert notification object, the warning information of generation is imported pre-
The notice template for carrying above-mentioned warning information is sent to pre-alert notification object by the notice template set, to guarantee that above-mentioned early warning is logical
Know that object can in time, efficiently obtain above-mentioned warning information.
The present embodiment is greater than or equal to the setting of respective classes public network text data sum in the number of passive text data
Warning information is generated when ratio, so as to the above-mentioned warning information of associated user's timely learning, can accordingly be coped with, to prevent carriage
By the generation of crisis.
In one embodiment, described using preset high frequency words as keyword, crawl the public affairs in the first set period of time
Before the process of online article notebook data, further includes:
The public sentiment event in the second set period of time is acquired, frequency of occurrence in the public sentiment event is obtained and is greater than frequency threshold value
Effective word, acquired effective word is determined as high frequency words.
Above-mentioned second set period of time can be a longer time section, such as the previous moon or the first two months time
Section.Above-mentioned frequency threshold value can be determined according to the feature of public sentiment event and the total amount of public sentiment event, such as be confirmed as 50 or 60
Value.Text information included by public sentiment event in second set period of time can be arranged, be obtained with statistical method
It takes frequency of occurrence in above-mentioned public sentiment event to be greater than effective word of frequency threshold value, to identify the high frequency words of public sentiment event, realizes to the
Public network text data crawls in one set period of time, guarantees the validity for crawling obtained public network text data.
The present embodiment determines high frequency words according to the public sentiment thing in the second set period of time, makes using above-mentioned high frequency words as key
The public network text data that word is crawled is the public sentiment data in the first set period of time, to be directed to above-mentioned first set period of time
Interior public sentiment data carries out corresponding data processing, realizes public sentiment monitoring, can effectively prevent related public opinion crisis outburst.
It, can also be corresponding to public sentiment event after acquiring the public sentiment event in the second set period of time as one embodiment
Data arranged, conclude event general character, formulate public sentiment early warning rule, during corresponding network information monitoring, if triggering
Above-mentioned public sentiment early warning rule, just carries out corresponding early warning.
Public network massive information monitoring method provided in this embodiment based on natural language processing technique is powerful with computer
Computing capability be to rely on, can automate, output public feelings information efficiently, lasting, there are the spies such as accuracy is high, timeliness is strong
Point.
The public network massive information monitoring system based on natural language processing technique of one embodiment is shown with reference to Fig. 2, Fig. 2
System structural schematic diagram, comprising:
Module 10 is crawled, for using preset high frequency words as keyword, crawling the text of the public network in the first set period of time
Notebook data;
Identification module 20 identifies the public network text data for carrying out word segmentation processing to each public network text data
Effective word determines the word weight of effective word in each public network text data according to the word weight that effective dictionary records, and successively
Record the term vector of effective word word weight in public network text data;Wherein, effective word is in public network text data except stopping
Word except word;Effective dictionary is the database for recording the corresponding word weight of each word;
Monitoring modular 30, for the public network text data to be classified according to the term vector, respectively to all kinds of public network texts
Notebook data is monitored.
In one embodiment, the term vector be successively record in corresponding public network text data effective word word weight and
The n-dimensional vector of setting value, n are the word amount of effective dictionary;
The monitoring modular includes computing module:
The computing module is used to calculate separately the cosine value between any two term vector, is greater than phase in the cosine value
When like threshold value, the corresponding two public network text datas of the cosine value are determined as a kind of text data.
As one embodiment, the monitoring modular includes determining module:
The determining module is used to the multiclass text data including identical public network text data being determined as a kind of textual data
According to.
In one embodiment, the process being monitored respectively to all kinds of public network text datas includes:
Identify the Sentiment orientation parameter of each public network data text;Wherein, the Sentiment orientation parameter is that characterization is corresponding public
The parameter of network data text aggressiveness level;
The public network text data that the Sentiment orientation parameter is less than emotion threshold value is determined as passive text data;
The number for counting the passive text data of all kinds of public network text datas, according to the number of passive text data to corresponding
The public network text data of classification carries out network information monitoring.
As one embodiment, identify that the process of the Sentiment orientation parameter of public network data text includes:
The feature emotion word in the public network data text is extracted, each emotion word pair recorded according to the emotion dictionary
The emotion deviation value answered determines the emotion deviation value of the feature emotion word;Wherein, the emotion dictionary is to record each emotion
The database of the corresponding emotion deviation value of word;
The average value for calculating the corresponding each emotion deviation value of the public network data text, being determined according to the average value should
The Sentiment orientation parameter of public network data text.
As one embodiment, the number according to passive text data carries out the public network text data of respective classes
The network information monitoring process include:
If the number of passive text data is greater than or equal to the setting ratio of respective classes public network text data sum, produce
Raw warning information.
The public network massive information based on natural language processing technique monitors system in one embodiment, further includes:
Acquisition module obtains for acquiring the public sentiment event in the second set period of time and goes out occurrence in the public sentiment event
Number is greater than effective word of frequency threshold value, and acquired effective word is determined as high frequency words.
Fig. 3 is the module map for being able to achieve a computer system 1000 of the embodiment of the present invention.The computer system 1000
An only example for being suitable for the invention computer environment is not construed as proposing appointing to use scope of the invention
What is limited.Computer system 1000 can not be construed to need to rely on or the illustrative computer system 1000 with diagram
In one or more components combination.
Computer system 1000 shown in Fig. 3 is the example for being suitable for computer system of the invention.Have
Other frameworks of different sub-systems configuration also can be used.Such as to have big well known desktop computer, notebook etc. similar
Equipment can be adapted for some embodiments of the present invention.But it is not limited to equipment enumerated above.
As shown in figure 3, computer system 1000 includes processor 1010, memory 1020 and system bus 1022.Including
Various system components including memory 1020 and processor 1010 are connected on system bus 1022.Processor 1010 is one
For executing the hardware of computer program instructions by arithmetic sum logical operation basic in computer system.Memory 1020
It is one for temporarily or permanently storing the physical equipment of calculation procedure or data (for example, program state information).System is total
Line 1020 can be any one in the bus structures of following several types, including memory bus or storage control, outer
If bus and local bus.Processor 1010 and memory 1020 can carry out data communication by system bus 1022.Wherein
Memory 1020 includes read-only memory (ROM) or flash memory (being all not shown in figure) and random access memory (RAM), RAM
Typically refer to the main memory for being loaded with operating system and application program.
Computer system 1000 further includes display interface 1030 (for example, graphics processing unit), display 1040 (example of equipment
Such as, liquid crystal display), audio interface 1050 (for example, sound card) and audio frequency apparatus 1060 (for example, loudspeaker).Show equipment
1040 can be used for the broadcasting of related warning information to audio frequency apparatus 1060.
Computer system 1000 generally comprises a storage equipment 1070.Storing equipment 1070 can from a variety of computers
It reads to select in medium, computer-readable medium refers to any available medium that can be accessed by computer system 1000,
Including mobile and fixed two media.For example, computer-readable medium includes but is not limited to, flash memory (miniature SD
Card), CD-ROM, digital versatile disc (DVD) or other optical disc storages, cassette, tape, disk storage or other magnetic storages are set
Any other medium that is standby, or can be used for storing information needed and can be accessed by computer system 1000.
Computer system 1000 further includes input unit 1080 and input interface 1090 (for example, I/O controller).User can
With by input unit 1080, such as the touch panel equipment in keyboard, mouse, display device 1040, input instruction and information are arrived
In computer system 1000.Input unit 1080 is usually connected on system bus 1022 by input interface 1090, but
It can also be connected by other interfaces or bus structures, such as universal serial bus (USB).
Computer system 1000 can carry out logical connection with one or more network equipment in a network environment.Network is set
It is standby to can be PC, server, router, tablet computer or other common network nodes.Computer system 1000 is logical
It crosses local area network (LAN) interface 1100 or mobile comm unit 1110 is connected with the network equipment.Local area network (LAN) refers to having
It limits in region, such as family, school, computer laboratory or the office building using the network media, interconnects the computer of composition
Network.WiFi and twisted pair wiring Ethernet are two kinds of technologies of most common building local area network.WiFi is a kind of to make to calculate
1000 swapping data of machine system or the technology that wireless network is connected to by radio wave.Mobile comm unit 1110 can be one
It answers and makes a phone call by radio communication diagram while movement in a wide geographic area.Other than call, move
Dynamic communication unit 1110 is also supported to carry out internet visit in 2G, 3G or the 4G cellular communication system for providing mobile data service
It asks.
It should be pointed out that other includes than the computer system of the more or fewer subsystems of computer system 1000
It can be suitably used for inventing.It is as detailed above, it is suitable for the invention computer system 1000 and can execute and be based on natural language
The specified operation of the public network massive information monitoring method of processing technique.Computer system 1000 operates in meter by processor 1010
The form of software instruction in calculation machine readable medium executes these operations.These software instructions can from storage equipment 1070 or
Person is read into memory 1020 by lan interfaces 1100 from another equipment.The software instruction being stored in memory 1020
So that processor 1010 executes the above-mentioned public network massive information monitoring method based on natural language processing technique.In addition, passing through
Hardware circuit or hardware circuit combination software instruction also can equally realize the present invention.Therefore, realize that the present invention is not limited to appoint
The combination of what specific hardware circuit and software.
Public network massive information monitoring system based on natural language processing technique of the invention and of the invention based on nature
The public network massive information monitoring method of language processing techniques corresponds, in the above-mentioned public network sea based on natural language processing technique
The technical characteristic and its advantages that the embodiment of amount information monitoring method illustrates are suitable for based on natural language processing technique
Public network massive information monitoring system embodiment in.
Based on example as described above, a kind of computer equipment is also provided in one embodiment, the computer equipment packet
The computer program that includes memory, processor and storage on a memory and can run on a processor, wherein processor executes
The public network massive information prison such as any one in the various embodiments described above based on natural language processing technique is realized when described program
Survey method.
Above-mentioned computer equipment is effectively increased by the computer program run on the processor based on nature language
Say the public network massive information monitoring effect of processing technique.
Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with
Relevant hardware is instructed to complete by computer program, it is non-volatile computer-readable that the program can be stored in one
It takes in storage medium, in the embodiment of the present invention, which be can be stored in the storage medium of computer system, and by the calculating
At least one processor in machine system executes, and includes the public network magnanimity letter as above-mentioned based on natural language processing technique with realization
Cease the process of the embodiment of monitoring method.Wherein, the storage medium can be magnetic disk, CD, read-only memory
(Read-Only Memory, ROM) or random access memory (Random Access Memory, RAM) etc..
Accordingly, a kind of computer storage medium is also provided in one embodiment, is stored thereon with computer program,
In, it realizes when which is executed by processor such as any one public affairs based on natural language processing technique in the various embodiments described above
Net massive information monitoring method.
Above-mentioned computer storage medium can be improved by the computer program that it is stored based on natural language processing skill
The efficiency and effect of the public network massive information monitoring of art.
Each technical characteristic of embodiment described above can be combined arbitrarily, for simplicity of description, not to above-mentioned reality
It applies all possible combination of each technical characteristic in example to be all described, as long as however, the combination of these technical characteristics is not deposited
In contradiction, all should be considered as described in this specification.
The embodiments described above only express several embodiments of the present invention, and the description thereof is more specific and detailed, but simultaneously
It cannot therefore be construed as limiting the scope of the patent.It should be pointed out that coming for those of ordinary skill in the art
It says, without departing from the inventive concept of the premise, various modifications and improvements can be made, these belong to protection of the invention
Range.Therefore, the scope of protection of the patent of the invention shall be subject to the appended claims.
Claims (10)
1. a kind of public network massive information monitoring method based on natural language processing technique characterized by comprising
Using preset high frequency words as keyword, the public network text data in the first set period of time is crawled;
Word segmentation processing is carried out to each public network text data, effective word of the public network text data is identified, according to effective dictionary
The word weight of record determines the word weight of effective word in each public network text data, and successively record public network text data in have
Imitate the term vector of word word weight;Wherein, effective word is the word in public network text data in addition to stop-word;Effective word
Library is the database for recording the corresponding word weight of each word;
The public network text data is classified according to the term vector, all kinds of public network text datas are monitored respectively.
2. the public network massive information monitoring method according to claim 1 based on natural language processing technique, feature exist
In the term vector is the n-dimensional vector for successively recording effective word word weight and setting value in corresponding public network text data, and n is
The word amount of effective dictionary;
It is described to include: by the process that the public network text data is classified according to the term vector
The cosine value between any two term vector is calculated separately, when the cosine value is greater than similar threshold value, by the cosine
It is worth corresponding two public network text datas and is determined as a kind of text data.
3. the public network massive information monitoring method according to claim 2 based on natural language processing technique, feature exist
In the cosine value calculated separately between two term vectors of arbitrary neighborhood will when the cosine value is greater than similar threshold value
The corresponding two public network text datas of the cosine value are determined as after a kind of text data process, further includes:
Multiclass text data including identical public network text data is determined as a kind of text data.
4. the public network massive information monitoring method according to any one of claims 1 to 3 based on natural language processing technique,
It is characterized in that, the process being monitored respectively to all kinds of public network text datas includes:
Identify the Sentiment orientation parameter of each public network data text;Wherein, the Sentiment orientation parameter is to characterize corresponding public network number
According to the parameter of text aggressiveness level;
The public network text data that the Sentiment orientation parameter is less than emotion threshold value is determined as passive text data;
The number for counting the passive text data of all kinds of public network text datas, according to the number of passive text data to respective classes
Public network text data carry out network information monitoring.
5. the public network massive information monitoring method according to claim 4 based on natural language processing technique, feature exist
In the process of the Sentiment orientation parameter of identification public network data text includes:
The feature emotion word in the public network data text is extracted, each emotion word recorded according to the emotion dictionary is corresponding
Emotion deviation value determines the emotion deviation value of the feature emotion word;Wherein, the emotion dictionary is to record each emotion word point
The database of not corresponding emotion deviation value;
The average value for calculating the corresponding each emotion deviation value of the public network data text, determines the public network according to the average value
The Sentiment orientation parameter of data text.
6. the public network massive information monitoring method according to claim 4 based on natural language processing technique, feature exist
In the number according to passive text data carries out the process packet of network information monitoring to the public network text data of respective classes
It includes:
If the number of passive text data is greater than or equal to the setting ratio of respective classes public network text data sum, generate pre-
Alert information.
7. the public network massive information monitoring method according to any one of claims 1 to 3 based on natural language processing technique,
It is characterized in that, it is described using preset high frequency words as keyword, crawl the public network text data in the first set period of time
Before process, further includes:
The public sentiment event in the second set period of time is acquired, frequency of occurrence in the public sentiment event is obtained and is greater than having for frequency threshold value
Word is imitated, acquired effective word is determined as high frequency words.
8. a kind of public network massive information based on natural language processing technique monitors system characterized by comprising
Module is crawled, for crawling the public network text data in the first set period of time using preset high frequency words as keyword;
Identification module, for identifying effective word of the public network text data to each public network text data progress word segmentation processing,
The word weight of effective word in each public network text data is determined according to the word weight that effective dictionary records, and successively records public network
The term vector of effective word word weight in text data;Wherein, effective word be public network text data in addition to stop-word
Word;Effective dictionary is the database for recording the corresponding word weight of each word;
Monitoring modular, for the public network text data to be classified according to the term vector, respectively to all kinds of public network text datas
It is monitored.
9. a kind of computer equipment, including memory, processor and it is stored on the memory and can be in the processor
The computer program of upper operation, which is characterized in that the processor realized when executing the computer program as claim 1 to
Public network massive information monitoring method described in 7 any one based on natural language processing technique.
10. a kind of computer storage medium, is stored thereon with computer program, which is characterized in that the program is executed by processor
The Shi Shixian public network massive information monitoring method as claimed in any one of claims 1 to 7 based on natural language processing technique.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811067750.5A CN109359233A (en) | 2018-09-13 | 2018-09-13 | Public network massive information monitoring method and system based on natural language processing technique |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811067750.5A CN109359233A (en) | 2018-09-13 | 2018-09-13 | Public network massive information monitoring method and system based on natural language processing technique |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109359233A true CN109359233A (en) | 2019-02-19 |
Family
ID=65350660
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811067750.5A Pending CN109359233A (en) | 2018-09-13 | 2018-09-13 | Public network massive information monitoring method and system based on natural language processing technique |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109359233A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109840300A (en) * | 2019-03-04 | 2019-06-04 | 深信服科技股份有限公司 | Internet public opinion analysis method, apparatus, equipment and computer readable storage medium |
CN112256974A (en) * | 2020-11-13 | 2021-01-22 | 泰康保险集团股份有限公司 | Public opinion information processing method and device |
CN112686035A (en) * | 2019-10-18 | 2021-04-20 | 北京沃东天骏信息技术有限公司 | Method and device for vectorizing unknown words |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101751458A (en) * | 2009-12-31 | 2010-06-23 | 暨南大学 | Network public sentiment monitoring system and method |
US20110060733A1 (en) * | 2009-09-04 | 2011-03-10 | Alibaba Group Holding Limited | Information retrieval based on semantic patterns of queries |
CN106599065A (en) * | 2016-11-16 | 2017-04-26 | 北京化工大学 | Food safety online public opinion early warning system based on Storm distributed framework |
CN107832344A (en) * | 2017-10-16 | 2018-03-23 | 广州大学 | A kind of food security Internet public opinion analysis method based on storm stream calculation frameworks |
-
2018
- 2018-09-13 CN CN201811067750.5A patent/CN109359233A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110060733A1 (en) * | 2009-09-04 | 2011-03-10 | Alibaba Group Holding Limited | Information retrieval based on semantic patterns of queries |
CN101751458A (en) * | 2009-12-31 | 2010-06-23 | 暨南大学 | Network public sentiment monitoring system and method |
CN106599065A (en) * | 2016-11-16 | 2017-04-26 | 北京化工大学 | Food safety online public opinion early warning system based on Storm distributed framework |
CN107832344A (en) * | 2017-10-16 | 2018-03-23 | 广州大学 | A kind of food security Internet public opinion analysis method based on storm stream calculation frameworks |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109840300A (en) * | 2019-03-04 | 2019-06-04 | 深信服科技股份有限公司 | Internet public opinion analysis method, apparatus, equipment and computer readable storage medium |
CN112686035A (en) * | 2019-10-18 | 2021-04-20 | 北京沃东天骏信息技术有限公司 | Method and device for vectorizing unknown words |
CN112256974A (en) * | 2020-11-13 | 2021-01-22 | 泰康保险集团股份有限公司 | Public opinion information processing method and device |
CN112256974B (en) * | 2020-11-13 | 2023-11-17 | 泰康保险集团股份有限公司 | Public opinion information processing method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110992169B (en) | Risk assessment method, risk assessment device, server and storage medium | |
US10108741B2 (en) | Automatic browser tab groupings | |
Adedoyin-Olowe et al. | A rule dynamics approach to event detection in twitter with its application to sports and politics | |
CN103177090B (en) | A kind of topic detection method and device based on big data | |
CN109145216A (en) | Network public-opinion monitoring method, device and storage medium | |
JP2018523885A (en) | Classifying user behavior as abnormal | |
Jiang et al. | Recommending new features from mobile app descriptions | |
CN109359233A (en) | Public network massive information monitoring method and system based on natural language processing technique | |
CN111600874A (en) | User account detection method, device, electronic equipment, medium and program product | |
US10762089B2 (en) | Open ended question identification for investigations | |
US11095953B2 (en) | Hierarchical video concept tagging and indexing system for learning content orchestration | |
CN111178701B (en) | Risk control method and device based on feature derivation technology and electronic equipment | |
CN110263817B (en) | Risk grade classification method and device based on user account | |
CN115576834A (en) | Software test multiplexing method, system, terminal and medium for supporting fault recovery | |
CN115514558A (en) | Intrusion detection method, device, equipment and medium | |
CN111383072A (en) | User credit scoring method, storage medium and server | |
CN112231444A (en) | Processing method and device for corpus data combining RPA and AI and electronic equipment | |
CN113746780A (en) | Abnormal host detection method, device, medium and equipment based on host image | |
CN110347934A (en) | A kind of text data filtering method, device and medium | |
CN113961811B (en) | Event map-based conversation recommendation method, device, equipment and medium | |
Janer et al. | Incorporating space, time, and magnitude measures in a network characterization of earthquake events | |
CN105786929A (en) | Information monitoring method and device | |
CN114443738A (en) | Abnormal data mining method, device, equipment and medium | |
CN114547257A (en) | Class matching method and device, computer equipment and storage medium | |
KR20230059364A (en) | Public opinion poll system using language model and method thereof |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190219 |
|
RJ01 | Rejection of invention patent application after publication |