CN109359233A - Public network massive information monitoring method and system based on natural language processing technique - Google Patents

Public network massive information monitoring method and system based on natural language processing technique Download PDF

Info

Publication number
CN109359233A
CN109359233A CN201811067750.5A CN201811067750A CN109359233A CN 109359233 A CN109359233 A CN 109359233A CN 201811067750 A CN201811067750 A CN 201811067750A CN 109359233 A CN109359233 A CN 109359233A
Authority
CN
China
Prior art keywords
public network
word
text data
effective
network text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811067750.5A
Other languages
Chinese (zh)
Inventor
江颖
钟山
沈超
张馨
陈锦聪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Wislife Intelligent Technology Co Ltd
Original Assignee
Guangzhou Wislife Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Wislife Intelligent Technology Co Ltd filed Critical Guangzhou Wislife Intelligent Technology Co Ltd
Priority to CN201811067750.5A priority Critical patent/CN109359233A/en
Publication of CN109359233A publication Critical patent/CN109359233A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Tourism & Hospitality (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • Marketing (AREA)
  • Human Resources & Organizations (AREA)
  • General Business, Economics & Management (AREA)
  • Economics (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to a kind of public network massive information monitoring method and system, computer equipment, computer storage medium based on natural language processing technique.The above method includes: to be crawled the public network text data in the first set period of time using preset high frequency words as keyword and carried out text analyzing to it using natural language processing technique;Word segmentation processing is carried out to each public network text data, identify effective word of public network text data, the word weight of effective word in each public network text data is determined according to the word weight that effective dictionary records, and successively records the term vector of effective word word weight in public network text data;Wherein, effective word is the word in public network text data in addition to stop-word;Effective dictionary is the database for recording the corresponding word weight of each word;The public network text data is classified according to the term vector, all kinds of public network text datas are monitored respectively.Its monitoring efficiency with higher, effectively increases corresponding monitoring effect.

Description

Public network massive information monitoring method and system based on natural language processing technique
Technical field
The present invention relates to Internet technical fields, more particularly to a kind of public network magnanimity based on natural language processing technique Information monitoring method and system, computer equipment, computer storage medium.
Background technique
With the rapid development of internet technology, netizen's scale is rapidly expanded, and more and more netizens gladly pass through The internet platforms such as microblogging, wechat express viewpoint, and the speed of spreading network information, which can achieve, spreads all over the whole world within a few hours, because This grasps the public networks massive informations such as related public sentiment in time and suffers from most important no matter to enterprise or to appropriate regulatory bodies Effect.And the network informations (public network massive information) monitoring means such as traditional public sentiment monitoring need according to relevant information theme into The corresponding identification of row and screening, then according to the information realization network information monitoring therein after screening, it be easy to cause information in this way It omits, keeps monitoring effect poor.
Summary of the invention
Based on this, it is necessary to be easy to cause information to omit for traditional scheme, make the public affairs based on natural language processing technique The technical problem of net massive information monitoring effect difference provides a kind of public network massive information monitoring based on natural language processing technique Method and system, computer equipment, computer storage medium.
A kind of public network massive information monitoring method based on natural language processing technique, comprising:
Using preset high frequency words as keyword, the public network text data in the first set period of time is crawled;
Word segmentation processing is carried out to each public network text data, effective word of the public network text data is identified, according to effective The word weight of dictionary record determines the word weight of effective word in each public network text data, and successively records public network text data In effective word word weight term vector;Wherein, effective word is the word in public network text data in addition to stop-word;It is described to have Imitating dictionary is the database for recording the corresponding word weight of each word;
The public network text data is classified according to the term vector, all kinds of public network text datas are monitored respectively.
The above-mentioned public network massive information monitoring method based on natural language processing technique, can climb according to preset high frequency words The public network text data in the first set period of time is taken, identifies effective word of above-mentioned public network text data, is remembered according to effective dictionary The word weight of record determines the word weight of effective word in each public network text data, and successively records in public network text data effectively The public network text data is classified, then is monitored respectively to all kinds of public network text datas by the term vector of word word weight;It can Public network text data is accordingly monitored with category, monitoring efficiency with higher, and above-mentioned monitoring process is with public network text Effective word included by notebook data is foundation, effectively increases corresponding monitoring effect.
In one embodiment, the term vector be successively record in corresponding public network text data effective word word weight and The n-dimensional vector of setting value, n are the word amount of effective dictionary;
It is described to include: by the process that the public network text data is classified according to the term vector
The cosine value between any two term vector is calculated separately, it, will be described when the cosine value is greater than similar threshold value The corresponding two public network text datas of cosine value are determined as a kind of text data.
The present embodiment can guarantee the accuracy that classification processing is carried out to public network text data.
As one embodiment, the cosine value calculated separately between two term vectors of arbitrary neighborhood, described remaining String value be greater than similar threshold value when, by the corresponding two public network text datas of the cosine value be determined as a kind of text data process it Afterwards, further includes:
Multiclass text data including identical public network text data is determined as a kind of text data.
It is between multiple public network text datas of similar text data with the same public network text data in the present embodiment Similarity is higher, these text datas are determined as a kind of text data, to sample the same or similar monitoring scheme simultaneously Network information monitoring is carried out to more public network text datas, corresponding monitoring efficiency can be improved.
In one embodiment, the process being monitored respectively to all kinds of public network text datas includes:
Identify the Sentiment orientation parameter of each public network data text;Wherein, the Sentiment orientation parameter is that characterization is corresponding public The parameter of network data text aggressiveness level;
The public network text data that the Sentiment orientation parameter is less than emotion threshold value is determined as passive text data;
The number for counting the passive text data of all kinds of public network text datas, according to the number of passive text data to corresponding The public network text data of classification carries out network information monitoring.
In the present embodiment, if the number of passive text data is more, correlation can be caused by characterizing such public network text data Public opinion crisis needs to carry out the processing such as early warning, to guarantee the timeliness of related public sentiment processing.
As one embodiment, identify that the process of the Sentiment orientation parameter of public network data text includes:
The feature emotion word in the public network data text is extracted, each emotion word pair recorded according to the emotion dictionary The emotion deviation value answered determines the emotion deviation value of the feature emotion word;Wherein, the emotion dictionary is to record each emotion The database of the corresponding emotion deviation value of word;
The average value for calculating the corresponding each emotion deviation value of the public network data text, being determined according to the average value should The Sentiment orientation parameter of public network data text.
The present embodiment can the Sentiment orientation parameter to public network data text accurately determined.
As one embodiment, the number according to passive text data carries out the public network text data of respective classes The network information monitoring process include:
If the number of passive text data is greater than or equal to the setting ratio of respective classes public network text data sum, produce Raw warning information.
The present embodiment is greater than or equal to the setting of respective classes public network text data sum in the number of passive text data Warning information is generated when ratio, so as to the above-mentioned warning information of associated user's timely learning, can accordingly be coped with, to prevent carriage By the generation of crisis.
In one embodiment, described using preset high frequency words as keyword, crawl the public affairs in the first set period of time Before the process of online article notebook data, further includes:
The public sentiment event in the second set period of time is acquired, frequency of occurrence in the public sentiment event is obtained and is greater than frequency threshold value Effective word, acquired effective word is determined as high frequency words.
The present embodiment can arrange text information included by the public sentiment event in the second set period of time, use Statistical method obtains effective word that frequency of occurrence in above-mentioned public sentiment event is greater than frequency threshold value, to identify the high frequency of public sentiment event Word, realization crawl public network text data in the first set period of time, and guarantee crawls having for obtained public network text data Effect property.
A kind of public network massive information monitoring system based on natural language processing technique, comprising:
Module is crawled, for crawling the public network text in the first set period of time using preset high frequency words as keyword Data;
Identification module identifies having for the public network text data for carrying out word segmentation processing to each public network text data Word is imitated, the word weight of effective word in each public network text data is determined according to the word weight that effective dictionary records, and successively remember Record the term vector of effective word word weight in public network text data;Wherein, effective word is that stop-word is removed in public network text data Except word;Effective dictionary is the database for recording the corresponding word weight of each word;
Monitoring modular, for the public network text data to be classified according to the term vector, respectively to all kinds of public network texts Data are monitored.
The above-mentioned public network massive information based on natural language processing technique monitors system, can be climbed according to preset high frequency words The public network text data in the first set period of time is taken, identifies effective word of above-mentioned public network text data, is remembered according to effective dictionary The word weight of record determines the word weight of effective word in each public network text data, and successively records in public network text data effectively The public network text data is classified, then is monitored respectively to all kinds of public network text datas by the term vector of word word weight;It can Public network text data is accordingly monitored with category, monitoring efficiency with higher, and above-mentioned monitoring process is with public network text Effective word included by notebook data is foundation, effectively increases corresponding monitoring effect.
A kind of computer equipment, including memory, processor and be stored on the memory and can be in the processing The computer program run on device, the processor realize the base that any of the above-described embodiment provides when executing the computer program In the public network massive information monitoring method of natural language processing technique.
A kind of computer storage medium, is stored thereon with computer program, which is characterized in that the program is executed by processor The public network massive information monitoring method based on natural language processing technique that any of the above-described embodiment of Shi Shixian provides.
Public network massive information monitoring method according to the present invention based on natural language processing technique, the present invention also provides one Kind computer equipment and computer storage medium, for realizing above-mentioned network information monitoring method by program.Above-mentioned computer Equipment and computer storage medium can be improved network information monitoring effect.
Detailed description of the invention
Fig. 1 is the public network massive information monitoring method flow chart based on natural language processing technique of one embodiment;
Fig. 2 is that the public network massive information based on natural language processing technique of one embodiment monitors system structure signal Figure;
Fig. 3 is the computer system module map of one embodiment.
Specific embodiment
To make the objectives, technical solutions, and advantages of the present invention more comprehensible, with reference to the accompanying drawings and embodiments, to this Invention is described in further detail.It should be appreciated that the specific embodiments described herein are only used to explain the present invention, And the scope of protection of the present invention is not limited.
It should be noted that term involved in the embodiment of the present invention " first second third " be only distinguish it is similar Object does not represent the particular sorted for object, it is possible to understand that ground, " first second third " can be mutual in the case where permission Change specific sequence or precedence.It should be understood that the object that " first second third " is distinguished in the appropriate case can be mutual It changes, so that the embodiment of the present invention described herein can be real with the sequence other than those of illustrating or describing herein It applies.
The term " includes " of the embodiment of the present invention and " having " and their any deformations, it is intended that cover non-exclusive Include.Such as contain series of steps or module process, method, system, product or equipment be not limited to it is listed Step or module, but optionally further comprising the step of not listing or module, or optionally further comprising for these processes, side Method, product or equipment intrinsic other steps or module.
Referenced herein " embodiment " is it is meant that a particular feature, structure, or characteristic described can wrap in conjunction with the embodiments It is contained at least one embodiment of the application.Each position in the description occur the phrase might not each mean it is identical Embodiment, nor the independent or alternative embodiment with other embodiments mutual exclusion.Those skilled in the art explicitly and Implicitly understand, embodiment described herein can be combined with other embodiments.
Refering to what is shown in Fig. 1, Fig. 1 is the public network massive information monitoring side based on natural language processing technique of one embodiment Method flow chart, comprising:
S10 crawls the public network text data in the first set period of time using preset high frequency words as keyword;
Above-mentioned first set period of time can be determined according to monitoring accuracy, for example be set as a few days ago or first three days, also It can be set to using current time as periods such as preceding 40 hours of finish time.Above-mentioned public network text data is to pass through public network The public text such as news, comment or the message delivered;Optionally, above-mentioned public network text data is divided according to user, a public affairs Altogether on platform, the text information that user delivers is a public network text data, for example, certain user delivers for certain product A comment, the primary of a news release that certain reporter delivers or certain user leave a message etc..
The above-mentioned network information may include the specific information that public sentiment etc. easily causes public opinion crisis.Above-mentioned high frequency words can wrap Include one or more, high frequency words can according to particular networks acquisition of information such as the public sentiment events occurred in the past period, Specifically, the higher word of frequency of occurrence in each particular network information can be counted in the past period, high frequency is determined with this Word.
In one embodiment, above-mentioned steps S10 can build the network crawler system based on mainstream media's platform, Using preset high frequency words as keyword, the public network text data in the first set period of time is crawled, and stores above-mentioned public network text Notebook data;The text processing system based on natural language processing can also be built, to the public network text data stored into Row cleaning, filter out the junk datas such as ad data, then by remaining public network text data through the operation of natural language processing algorithm (such as Correct mistake, delete duplicate contents etc.), it realizes the pretreatment to public network text data, guarantees the consistency of public network text data, All data index can also be obtained according to above-mentioned pretreated public network text data carry out discretization storage.
S20 carries out word segmentation processing to each public network text data, identifies effective word of the public network text data, according to The word weight of effective dictionary record determines the word weight of effective word in each public network text data, and successively records public network text The term vector of effective word word weight in data;Wherein, effective word is the word in public network text data in addition to stop-word;Institute Stating effective dictionary is the database for recording the corresponding word weight of each word;
Before carrying out word segmentation processing to each public network text data, a large amount of network sample datas can be based on, rubbish is formulated Data filtering rule, deletes the junk datas such as advertisement, and remove repeated data by database technology.To each public network textual data Public network text data is divided into multiple words according to word segmentation processing is carried out, the public network text data after word segmentation processing includes stop-word With effective word, above-mentioned stop-word can refer to the high word of the frequency of use such as article, preposition, adverbial word and/or conjunction;It is " ", " inner Face ", " ", " ", " it ", " for " these words be all off word, these words are because frequency of use is excessively high, almost on each webpage All exist, if there are a large amount of such words on website, many resources will be wasted during Correlation method for data processing, it will This kind of words (stop-word) all neglect, and can save resource, improve corresponding data-handling efficiency.Effective word is public network Word in text data in addition to stop-word, effective word have specific reference meaning in corresponding text data.Effective dictionary For a large and complete dictionary, the corresponding word weight of each effective word (word weight is certain specific numerical value) is recorded, on Predicate weight can the factors such as context of use determine according to the affective characteristics of corresponding words, in the network information.Public network text data Effective word can find corresponding word weight respectively in effective dictionary, the corresponding word of a public network text data to Amount, above-mentioned term vector successively records effective word in public network text data, and (effective word is according to its appearance in public network text data Sequence arrange) word weight.If effective dictionary includes n word, above-mentioned term vector can be that n is vector, some public network textual data According to term vector after having recorded the corresponding word weight of its effective word, can with setting value by its term vector polishing, make the word to Amount is n-dimensional vector.
The public network text data is classified according to the term vector, is supervised respectively to all kinds of public network text datas by S30 It surveys.
Above-mentioned steps S30 can identify the direction of each term vector, by term vector direction it is close (such as angle be less than set angle The term vector of degree), the high public network text data of similarity be determined as a kind of public network text data, then respectively to all kinds of public network texts Data are monitored, to guarantee the efficiency of above-mentioned network information monitoring.
In one embodiment, Monitoring Rules can be formulated according to the type of public network text data, is advised according to above-mentioned monitoring Then monitor public network text data.Specifically, the public sentiment event occurred in the past period, the data generated to it can be collected It is arranged, concludes event general character, formulate network information Monitoring Rules, such as: according to historical data, if certain money produces in 3 days The negative reviews of product are more than 10, can break out query product quality public opinion crisis, then the Monitoring Rules formulated are as follows: in 3 days certain The amount of text of money product negative reviews is greater than 10, when the amount of text of certain product negative reviews is greater than 10 in 3 days, Carry out corresponding early warning.
Public network massive information monitoring method provided by the invention based on natural language processing technique, can be according to preset High frequency words crawl the public network text data in the first set period of time, identify effective word of above-mentioned public network text data, according to having The word weight of effect dictionary record determines the word weight of effective word in each public network text data, and successively records public network textual data The term vector of effective word word weight in, by the public network text data classify, then respectively to all kinds of public network text datas into Row monitoring;Public network text data can accordingly be monitored with category, monitoring efficiency with higher, and above-mentioned monitoring process Using effective word included by public network text data as foundation, corresponding monitoring effect is effectively increased.
In one embodiment, the term vector be successively record in corresponding public network text data effective word word weight and The n-dimensional vector of setting value, n are the word amount of effective dictionary;
It is described to include: by the process that the public network text data is classified according to the term vector
The cosine value between any two term vector is calculated separately, it, will be described when the cosine value is greater than similar threshold value The corresponding two public network text datas of cosine value are determined as a kind of text data.
Specifically, the word that above-mentioned setting value can be convenient for associated vector operation for 0 or 1 etc..Above-mentioned similar threshold value It can be arranged according to nicety of grading, such as be set as 0.9 equivalence.The term vector of some public network text data is recording its effective word After corresponding word weight, the term vector n-dimensional vector can be made with setting value by its term vector polishing, such as the first public network text Data include a effective word of a (a < n), and the second public network text data includes a effective word of b (b < n), the first public network text data Term vector A=[A1,A2,…,Aa,…,An], the term vector B=[B of the second public network text data1,B2,…,Ba,…,Bn], it is above-mentioned A=[A1,A2,…,Aa,…,An] in A1To AaSuccessively record the corresponding word weight of effective word, A in the first public network text dataa+1 To AnIt is setting value;B=[B1,B2,…,Ba,…,Bn] in B1To BbSuccessively record effective word pair in the second public network text data The word weight answered, Bb+1To BnIt is setting value.Cosine value between term vector A and term vector B can be with are as follows:
When above-mentioned cosine value cos θ is greater than similar threshold value, show that term vector A and term vector B angulation are small, term vector A Close with the direction term vector B, term vector A public network text data corresponding with term vector B is a kind of text data.
The present embodiment can guarantee the accuracy that classification processing is carried out to public network text data.
As one embodiment, the cosine value calculated separately between two term vectors of arbitrary neighborhood, described remaining String value be greater than similar threshold value when, by the corresponding two public network text datas of the cosine value be determined as a kind of text data process it Afterwards, further includes:
Multiclass text data including identical public network text data is determined as a kind of text data.
It is between multiple public network text datas of similar text data with the same public network text data in the present embodiment Similarity is higher, these text datas are determined as a kind of text data, to sample the same or similar monitoring scheme simultaneously Network information monitoring is carried out to more public network text datas, corresponding monitoring efficiency can be improved;Such as the first public network textual data According to being similar text data with the second public network text data, the first public network text data and third public network text data are similar text Above-mentioned first public network text data, the second public network text data and third public network text data can be then divided by notebook data A kind of text data, to carry out network information monitoring to such text data.
In one embodiment, the process being monitored respectively to all kinds of public network text datas includes:
Identify the Sentiment orientation parameter of each public network data text;Wherein, the Sentiment orientation parameter is that characterization is corresponding public The parameter of network data text aggressiveness level;
The public network text data that the Sentiment orientation parameter is less than emotion threshold value is determined as passive text data;
The number for counting the passive text data of all kinds of public network text datas, according to the number of passive text data to corresponding The public network text data of classification carries out network information monitoring.
Above-mentioned Sentiment orientation parameter value may range from 0 to 1,0 representative absolutely negatively (passiveness), and 1 represents absolutely front The emotion deviation value of (positive), active text data is high, and the emotion deviation value of passive text data is low.Above-mentioned emotion threshold value can be with It is arranged according to network information monitoring feature, is such as set as 0.3 equivalence.
In the present embodiment, if the number of passive text data is more, correlation can be caused by characterizing such public network text data Public opinion crisis needs to carry out the processing such as early warning, to guarantee the timeliness of related public sentiment processing.
As one embodiment, identify that the process of the Sentiment orientation parameter of public network data text may include:
The feature emotion word in the public network data text is extracted, each emotion word pair recorded according to the emotion dictionary The emotion deviation value answered determines the emotion deviation value of the feature emotion word;Wherein, the emotion dictionary is to record each emotion The database of the corresponding emotion deviation value of word;
The average value for calculating the corresponding each emotion deviation value of the public network data text (such as tires out each emotion deviation value In addition it is averaged afterwards), the Sentiment orientation parameter of the public network data text is determined according to the average value.
Features described above emotion word is that can characterize the positive or passive word of speech attitude, and such as farsighted, affinity is lost, is sad It is disappointed etc..Above-mentioned emotion deviation value is to characterize the value of the aggressiveness level of corresponding words, and the value range of emotion deviation value can be 0 to 1, 0 represents absolutely negative (passiveness), and 1 represents absolutely positive (positive), and the emotion deviation value of positive emotion word is high, Negative Affect word Emotion deviation value is low.Above-mentioned emotion dictionary is the database for recording the corresponding emotion deviation value of each emotion word, above-mentioned feelings The emotion word that sense dictionary is recorded includes that the feature emotion word in public network data text exists.It specifically, can be to public network data text The pretreatments such as the filtering of this progress junk data, removal repeated data, then pretreated public network data text is segmented, is gone Except stop-word processing, then extract feature emotion word therein.
The present embodiment can the Sentiment orientation parameter to public network data text accurately determined.
As one embodiment, the number according to passive text data carries out the public network text data of respective classes The network information monitoring process may include:
If the number of passive text data is greater than or equal to the setting ratio of respective classes public network text data sum, produce Raw warning information.
Above-mentioned setting ratio can be determined according to the type of public network text data, such as be determined as 70% equivalence.Passive text The setting ratio that the number of data is greater than or equal to respective classes public network text data sum shows that such public network text data can Related public opinion crisis can be caused, need to carry out early warning.After generating warning information, can be notified by alarm equipment alarm, system, The modes such as Push Service, short message service and/or mail service notify associated user on line, so that user can take accordingly in time Treatment measures.Specifically, the user that can be will be monitored is set as pre-alert notification object, the warning information of generation is imported pre- The notice template for carrying above-mentioned warning information is sent to pre-alert notification object by the notice template set, to guarantee that above-mentioned early warning is logical Know that object can in time, efficiently obtain above-mentioned warning information.
The present embodiment is greater than or equal to the setting of respective classes public network text data sum in the number of passive text data Warning information is generated when ratio, so as to the above-mentioned warning information of associated user's timely learning, can accordingly be coped with, to prevent carriage By the generation of crisis.
In one embodiment, described using preset high frequency words as keyword, crawl the public affairs in the first set period of time Before the process of online article notebook data, further includes:
The public sentiment event in the second set period of time is acquired, frequency of occurrence in the public sentiment event is obtained and is greater than frequency threshold value Effective word, acquired effective word is determined as high frequency words.
Above-mentioned second set period of time can be a longer time section, such as the previous moon or the first two months time Section.Above-mentioned frequency threshold value can be determined according to the feature of public sentiment event and the total amount of public sentiment event, such as be confirmed as 50 or 60 Value.Text information included by public sentiment event in second set period of time can be arranged, be obtained with statistical method It takes frequency of occurrence in above-mentioned public sentiment event to be greater than effective word of frequency threshold value, to identify the high frequency words of public sentiment event, realizes to the Public network text data crawls in one set period of time, guarantees the validity for crawling obtained public network text data.
The present embodiment determines high frequency words according to the public sentiment thing in the second set period of time, makes using above-mentioned high frequency words as key The public network text data that word is crawled is the public sentiment data in the first set period of time, to be directed to above-mentioned first set period of time Interior public sentiment data carries out corresponding data processing, realizes public sentiment monitoring, can effectively prevent related public opinion crisis outburst.
It, can also be corresponding to public sentiment event after acquiring the public sentiment event in the second set period of time as one embodiment Data arranged, conclude event general character, formulate public sentiment early warning rule, during corresponding network information monitoring, if triggering Above-mentioned public sentiment early warning rule, just carries out corresponding early warning.
Public network massive information monitoring method provided in this embodiment based on natural language processing technique is powerful with computer Computing capability be to rely on, can automate, output public feelings information efficiently, lasting, there are the spies such as accuracy is high, timeliness is strong Point.
The public network massive information monitoring system based on natural language processing technique of one embodiment is shown with reference to Fig. 2, Fig. 2 System structural schematic diagram, comprising:
Module 10 is crawled, for using preset high frequency words as keyword, crawling the text of the public network in the first set period of time Notebook data;
Identification module 20 identifies the public network text data for carrying out word segmentation processing to each public network text data Effective word determines the word weight of effective word in each public network text data according to the word weight that effective dictionary records, and successively Record the term vector of effective word word weight in public network text data;Wherein, effective word is in public network text data except stopping Word except word;Effective dictionary is the database for recording the corresponding word weight of each word;
Monitoring modular 30, for the public network text data to be classified according to the term vector, respectively to all kinds of public network texts Notebook data is monitored.
In one embodiment, the term vector be successively record in corresponding public network text data effective word word weight and The n-dimensional vector of setting value, n are the word amount of effective dictionary;
The monitoring modular includes computing module:
The computing module is used to calculate separately the cosine value between any two term vector, is greater than phase in the cosine value When like threshold value, the corresponding two public network text datas of the cosine value are determined as a kind of text data.
As one embodiment, the monitoring modular includes determining module:
The determining module is used to the multiclass text data including identical public network text data being determined as a kind of textual data According to.
In one embodiment, the process being monitored respectively to all kinds of public network text datas includes:
Identify the Sentiment orientation parameter of each public network data text;Wherein, the Sentiment orientation parameter is that characterization is corresponding public The parameter of network data text aggressiveness level;
The public network text data that the Sentiment orientation parameter is less than emotion threshold value is determined as passive text data;
The number for counting the passive text data of all kinds of public network text datas, according to the number of passive text data to corresponding The public network text data of classification carries out network information monitoring.
As one embodiment, identify that the process of the Sentiment orientation parameter of public network data text includes:
The feature emotion word in the public network data text is extracted, each emotion word pair recorded according to the emotion dictionary The emotion deviation value answered determines the emotion deviation value of the feature emotion word;Wherein, the emotion dictionary is to record each emotion The database of the corresponding emotion deviation value of word;
The average value for calculating the corresponding each emotion deviation value of the public network data text, being determined according to the average value should The Sentiment orientation parameter of public network data text.
As one embodiment, the number according to passive text data carries out the public network text data of respective classes The network information monitoring process include:
If the number of passive text data is greater than or equal to the setting ratio of respective classes public network text data sum, produce Raw warning information.
The public network massive information based on natural language processing technique monitors system in one embodiment, further includes:
Acquisition module obtains for acquiring the public sentiment event in the second set period of time and goes out occurrence in the public sentiment event Number is greater than effective word of frequency threshold value, and acquired effective word is determined as high frequency words.
Fig. 3 is the module map for being able to achieve a computer system 1000 of the embodiment of the present invention.The computer system 1000 An only example for being suitable for the invention computer environment is not construed as proposing appointing to use scope of the invention What is limited.Computer system 1000 can not be construed to need to rely on or the illustrative computer system 1000 with diagram In one or more components combination.
Computer system 1000 shown in Fig. 3 is the example for being suitable for computer system of the invention.Have Other frameworks of different sub-systems configuration also can be used.Such as to have big well known desktop computer, notebook etc. similar Equipment can be adapted for some embodiments of the present invention.But it is not limited to equipment enumerated above.
As shown in figure 3, computer system 1000 includes processor 1010, memory 1020 and system bus 1022.Including Various system components including memory 1020 and processor 1010 are connected on system bus 1022.Processor 1010 is one For executing the hardware of computer program instructions by arithmetic sum logical operation basic in computer system.Memory 1020 It is one for temporarily or permanently storing the physical equipment of calculation procedure or data (for example, program state information).System is total Line 1020 can be any one in the bus structures of following several types, including memory bus or storage control, outer If bus and local bus.Processor 1010 and memory 1020 can carry out data communication by system bus 1022.Wherein Memory 1020 includes read-only memory (ROM) or flash memory (being all not shown in figure) and random access memory (RAM), RAM Typically refer to the main memory for being loaded with operating system and application program.
Computer system 1000 further includes display interface 1030 (for example, graphics processing unit), display 1040 (example of equipment Such as, liquid crystal display), audio interface 1050 (for example, sound card) and audio frequency apparatus 1060 (for example, loudspeaker).Show equipment 1040 can be used for the broadcasting of related warning information to audio frequency apparatus 1060.
Computer system 1000 generally comprises a storage equipment 1070.Storing equipment 1070 can from a variety of computers It reads to select in medium, computer-readable medium refers to any available medium that can be accessed by computer system 1000, Including mobile and fixed two media.For example, computer-readable medium includes but is not limited to, flash memory (miniature SD Card), CD-ROM, digital versatile disc (DVD) or other optical disc storages, cassette, tape, disk storage or other magnetic storages are set Any other medium that is standby, or can be used for storing information needed and can be accessed by computer system 1000.
Computer system 1000 further includes input unit 1080 and input interface 1090 (for example, I/O controller).User can With by input unit 1080, such as the touch panel equipment in keyboard, mouse, display device 1040, input instruction and information are arrived In computer system 1000.Input unit 1080 is usually connected on system bus 1022 by input interface 1090, but It can also be connected by other interfaces or bus structures, such as universal serial bus (USB).
Computer system 1000 can carry out logical connection with one or more network equipment in a network environment.Network is set It is standby to can be PC, server, router, tablet computer or other common network nodes.Computer system 1000 is logical It crosses local area network (LAN) interface 1100 or mobile comm unit 1110 is connected with the network equipment.Local area network (LAN) refers to having It limits in region, such as family, school, computer laboratory or the office building using the network media, interconnects the computer of composition Network.WiFi and twisted pair wiring Ethernet are two kinds of technologies of most common building local area network.WiFi is a kind of to make to calculate 1000 swapping data of machine system or the technology that wireless network is connected to by radio wave.Mobile comm unit 1110 can be one It answers and makes a phone call by radio communication diagram while movement in a wide geographic area.Other than call, move Dynamic communication unit 1110 is also supported to carry out internet visit in 2G, 3G or the 4G cellular communication system for providing mobile data service It asks.
It should be pointed out that other includes than the computer system of the more or fewer subsystems of computer system 1000 It can be suitably used for inventing.It is as detailed above, it is suitable for the invention computer system 1000 and can execute and be based on natural language The specified operation of the public network massive information monitoring method of processing technique.Computer system 1000 operates in meter by processor 1010 The form of software instruction in calculation machine readable medium executes these operations.These software instructions can from storage equipment 1070 or Person is read into memory 1020 by lan interfaces 1100 from another equipment.The software instruction being stored in memory 1020 So that processor 1010 executes the above-mentioned public network massive information monitoring method based on natural language processing technique.In addition, passing through Hardware circuit or hardware circuit combination software instruction also can equally realize the present invention.Therefore, realize that the present invention is not limited to appoint The combination of what specific hardware circuit and software.
Public network massive information monitoring system based on natural language processing technique of the invention and of the invention based on nature The public network massive information monitoring method of language processing techniques corresponds, in the above-mentioned public network sea based on natural language processing technique The technical characteristic and its advantages that the embodiment of amount information monitoring method illustrates are suitable for based on natural language processing technique Public network massive information monitoring system embodiment in.
Based on example as described above, a kind of computer equipment is also provided in one embodiment, the computer equipment packet The computer program that includes memory, processor and storage on a memory and can run on a processor, wherein processor executes The public network massive information prison such as any one in the various embodiments described above based on natural language processing technique is realized when described program Survey method.
Above-mentioned computer equipment is effectively increased by the computer program run on the processor based on nature language Say the public network massive information monitoring effect of processing technique.
Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with Relevant hardware is instructed to complete by computer program, it is non-volatile computer-readable that the program can be stored in one It takes in storage medium, in the embodiment of the present invention, which be can be stored in the storage medium of computer system, and by the calculating At least one processor in machine system executes, and includes the public network magnanimity letter as above-mentioned based on natural language processing technique with realization Cease the process of the embodiment of monitoring method.Wherein, the storage medium can be magnetic disk, CD, read-only memory (Read-Only Memory, ROM) or random access memory (Random Access Memory, RAM) etc..
Accordingly, a kind of computer storage medium is also provided in one embodiment, is stored thereon with computer program, In, it realizes when which is executed by processor such as any one public affairs based on natural language processing technique in the various embodiments described above Net massive information monitoring method.
Above-mentioned computer storage medium can be improved by the computer program that it is stored based on natural language processing skill The efficiency and effect of the public network massive information monitoring of art.
Each technical characteristic of embodiment described above can be combined arbitrarily, for simplicity of description, not to above-mentioned reality It applies all possible combination of each technical characteristic in example to be all described, as long as however, the combination of these technical characteristics is not deposited In contradiction, all should be considered as described in this specification.
The embodiments described above only express several embodiments of the present invention, and the description thereof is more specific and detailed, but simultaneously It cannot therefore be construed as limiting the scope of the patent.It should be pointed out that coming for those of ordinary skill in the art It says, without departing from the inventive concept of the premise, various modifications and improvements can be made, these belong to protection of the invention Range.Therefore, the scope of protection of the patent of the invention shall be subject to the appended claims.

Claims (10)

1. a kind of public network massive information monitoring method based on natural language processing technique characterized by comprising
Using preset high frequency words as keyword, the public network text data in the first set period of time is crawled;
Word segmentation processing is carried out to each public network text data, effective word of the public network text data is identified, according to effective dictionary The word weight of record determines the word weight of effective word in each public network text data, and successively record public network text data in have Imitate the term vector of word word weight;Wherein, effective word is the word in public network text data in addition to stop-word;Effective word Library is the database for recording the corresponding word weight of each word;
The public network text data is classified according to the term vector, all kinds of public network text datas are monitored respectively.
2. the public network massive information monitoring method according to claim 1 based on natural language processing technique, feature exist In the term vector is the n-dimensional vector for successively recording effective word word weight and setting value in corresponding public network text data, and n is The word amount of effective dictionary;
It is described to include: by the process that the public network text data is classified according to the term vector
The cosine value between any two term vector is calculated separately, when the cosine value is greater than similar threshold value, by the cosine It is worth corresponding two public network text datas and is determined as a kind of text data.
3. the public network massive information monitoring method according to claim 2 based on natural language processing technique, feature exist In the cosine value calculated separately between two term vectors of arbitrary neighborhood will when the cosine value is greater than similar threshold value The corresponding two public network text datas of the cosine value are determined as after a kind of text data process, further includes:
Multiclass text data including identical public network text data is determined as a kind of text data.
4. the public network massive information monitoring method according to any one of claims 1 to 3 based on natural language processing technique, It is characterized in that, the process being monitored respectively to all kinds of public network text datas includes:
Identify the Sentiment orientation parameter of each public network data text;Wherein, the Sentiment orientation parameter is to characterize corresponding public network number According to the parameter of text aggressiveness level;
The public network text data that the Sentiment orientation parameter is less than emotion threshold value is determined as passive text data;
The number for counting the passive text data of all kinds of public network text datas, according to the number of passive text data to respective classes Public network text data carry out network information monitoring.
5. the public network massive information monitoring method according to claim 4 based on natural language processing technique, feature exist In the process of the Sentiment orientation parameter of identification public network data text includes:
The feature emotion word in the public network data text is extracted, each emotion word recorded according to the emotion dictionary is corresponding Emotion deviation value determines the emotion deviation value of the feature emotion word;Wherein, the emotion dictionary is to record each emotion word point The database of not corresponding emotion deviation value;
The average value for calculating the corresponding each emotion deviation value of the public network data text, determines the public network according to the average value The Sentiment orientation parameter of data text.
6. the public network massive information monitoring method according to claim 4 based on natural language processing technique, feature exist In the number according to passive text data carries out the process packet of network information monitoring to the public network text data of respective classes It includes:
If the number of passive text data is greater than or equal to the setting ratio of respective classes public network text data sum, generate pre- Alert information.
7. the public network massive information monitoring method according to any one of claims 1 to 3 based on natural language processing technique, It is characterized in that, it is described using preset high frequency words as keyword, crawl the public network text data in the first set period of time Before process, further includes:
The public sentiment event in the second set period of time is acquired, frequency of occurrence in the public sentiment event is obtained and is greater than having for frequency threshold value Word is imitated, acquired effective word is determined as high frequency words.
8. a kind of public network massive information based on natural language processing technique monitors system characterized by comprising
Module is crawled, for crawling the public network text data in the first set period of time using preset high frequency words as keyword;
Identification module, for identifying effective word of the public network text data to each public network text data progress word segmentation processing, The word weight of effective word in each public network text data is determined according to the word weight that effective dictionary records, and successively records public network The term vector of effective word word weight in text data;Wherein, effective word be public network text data in addition to stop-word Word;Effective dictionary is the database for recording the corresponding word weight of each word;
Monitoring modular, for the public network text data to be classified according to the term vector, respectively to all kinds of public network text datas It is monitored.
9. a kind of computer equipment, including memory, processor and it is stored on the memory and can be in the processor The computer program of upper operation, which is characterized in that the processor realized when executing the computer program as claim 1 to Public network massive information monitoring method described in 7 any one based on natural language processing technique.
10. a kind of computer storage medium, is stored thereon with computer program, which is characterized in that the program is executed by processor The Shi Shixian public network massive information monitoring method as claimed in any one of claims 1 to 7 based on natural language processing technique.
CN201811067750.5A 2018-09-13 2018-09-13 Public network massive information monitoring method and system based on natural language processing technique Pending CN109359233A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811067750.5A CN109359233A (en) 2018-09-13 2018-09-13 Public network massive information monitoring method and system based on natural language processing technique

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811067750.5A CN109359233A (en) 2018-09-13 2018-09-13 Public network massive information monitoring method and system based on natural language processing technique

Publications (1)

Publication Number Publication Date
CN109359233A true CN109359233A (en) 2019-02-19

Family

ID=65350660

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811067750.5A Pending CN109359233A (en) 2018-09-13 2018-09-13 Public network massive information monitoring method and system based on natural language processing technique

Country Status (1)

Country Link
CN (1) CN109359233A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109840300A (en) * 2019-03-04 2019-06-04 深信服科技股份有限公司 Internet public opinion analysis method, apparatus, equipment and computer readable storage medium
CN112256974A (en) * 2020-11-13 2021-01-22 泰康保险集团股份有限公司 Public opinion information processing method and device
CN112686035A (en) * 2019-10-18 2021-04-20 北京沃东天骏信息技术有限公司 Method and device for vectorizing unknown words

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101751458A (en) * 2009-12-31 2010-06-23 暨南大学 Network public sentiment monitoring system and method
US20110060733A1 (en) * 2009-09-04 2011-03-10 Alibaba Group Holding Limited Information retrieval based on semantic patterns of queries
CN106599065A (en) * 2016-11-16 2017-04-26 北京化工大学 Food safety online public opinion early warning system based on Storm distributed framework
CN107832344A (en) * 2017-10-16 2018-03-23 广州大学 A kind of food security Internet public opinion analysis method based on storm stream calculation frameworks

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110060733A1 (en) * 2009-09-04 2011-03-10 Alibaba Group Holding Limited Information retrieval based on semantic patterns of queries
CN101751458A (en) * 2009-12-31 2010-06-23 暨南大学 Network public sentiment monitoring system and method
CN106599065A (en) * 2016-11-16 2017-04-26 北京化工大学 Food safety online public opinion early warning system based on Storm distributed framework
CN107832344A (en) * 2017-10-16 2018-03-23 广州大学 A kind of food security Internet public opinion analysis method based on storm stream calculation frameworks

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109840300A (en) * 2019-03-04 2019-06-04 深信服科技股份有限公司 Internet public opinion analysis method, apparatus, equipment and computer readable storage medium
CN112686035A (en) * 2019-10-18 2021-04-20 北京沃东天骏信息技术有限公司 Method and device for vectorizing unknown words
CN112256974A (en) * 2020-11-13 2021-01-22 泰康保险集团股份有限公司 Public opinion information processing method and device
CN112256974B (en) * 2020-11-13 2023-11-17 泰康保险集团股份有限公司 Public opinion information processing method and device

Similar Documents

Publication Publication Date Title
CN110992169B (en) Risk assessment method, risk assessment device, server and storage medium
US10108741B2 (en) Automatic browser tab groupings
Adedoyin-Olowe et al. A rule dynamics approach to event detection in twitter with its application to sports and politics
CN103177090B (en) A kind of topic detection method and device based on big data
CN109145216A (en) Network public-opinion monitoring method, device and storage medium
JP2018523885A (en) Classifying user behavior as abnormal
Jiang et al. Recommending new features from mobile app descriptions
CN109359233A (en) Public network massive information monitoring method and system based on natural language processing technique
CN111600874A (en) User account detection method, device, electronic equipment, medium and program product
US10762089B2 (en) Open ended question identification for investigations
US11095953B2 (en) Hierarchical video concept tagging and indexing system for learning content orchestration
CN111178701B (en) Risk control method and device based on feature derivation technology and electronic equipment
CN110263817B (en) Risk grade classification method and device based on user account
CN115576834A (en) Software test multiplexing method, system, terminal and medium for supporting fault recovery
CN115514558A (en) Intrusion detection method, device, equipment and medium
CN111383072A (en) User credit scoring method, storage medium and server
CN112231444A (en) Processing method and device for corpus data combining RPA and AI and electronic equipment
CN113746780A (en) Abnormal host detection method, device, medium and equipment based on host image
CN110347934A (en) A kind of text data filtering method, device and medium
CN113961811B (en) Event map-based conversation recommendation method, device, equipment and medium
Janer et al. Incorporating space, time, and magnitude measures in a network characterization of earthquake events
CN105786929A (en) Information monitoring method and device
CN114443738A (en) Abnormal data mining method, device, equipment and medium
CN114547257A (en) Class matching method and device, computer equipment and storage medium
KR20230059364A (en) Public opinion poll system using language model and method thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20190219

RJ01 Rejection of invention patent application after publication