CN104809253B

CN104809253B - Internet data analysis system

Info

Publication number: CN104809253B
Application number: CN201510257964.9A
Authority: CN
Inventors: 张鹏
Original assignee: BEIJING BLTSFE INFORMATION TECHNOLOGY Co Ltd
Current assignee: Guangzhou Kunchuan Network Technology Co ltd
Priority date: 2015-05-20
Filing date: 2015-05-20
Publication date: 2017-12-08
Anticipated expiration: 2035-05-20
Also published as: CN104809253A

Abstract

The invention provides a kind of internet data analysis system, the system includes：Correlation calculations module, for using randomly selected text to be identified and the remaining text to be identified of being chosen as observation sequence and status switch, calculating selected correlation probabilities value between text to be identified and remaining text to be identified；Classification and identification module, for correlation highest text in status switch and selected text to be identified to be merged, the first kind is characterized as, while using the minimum text of correlation as Second Type；Using the first and second types as new status switch, remaining text to be identified is iterated as new observation sequence, to realize the identification of sensitive vocabulary.The present invention proposes a kind of information monitoring analysis system, carries out multidimensional monitoring to internet public feelings, effectively gathers and analyze sensitive information, improve precision ratio and recall ratio.

Description

Internet data analysis system

Technical field

The present invention relates to data collection and analysis, more particularly to a kind of internet data analysis system.

Background technology

Compared with original conventional internet form, current internet form has occurred that great change, Portable mobile equipment Constantly push away new, people have had been detached from the traditional wire online pattern of old-fashioned desktop computer, and the function of mobile device emerges in an endless stream, It has been most basic function to take pictures, shoot video.People can shoot the thing that surrounding occurs by mobile device, and can To be uploaded directly into internet, the rapid pole of the spread speed of its information, if be not monitored rationally, it is possible that Invalid information, the judgement of the public is misguided, cause public opinion to move towards the direction of mistake.In being detected in public sentiment, data acquisition work( Can be particularly important, all data after acquisition, because data volume is very big, it is necessary to by technological means, within the limited time, are divided The data being precipitated with.But existing information monitoring system only applies single acquisition mode, can only meet that certain is specific Information analysis, it is impossible to meet the needs of diversification information analysis in present internet.

The content of the invention

To solve the problems of above-mentioned prior art, the present invention proposes a kind of internet data analysis system, bag Include：

Correlation calculations module, for randomly selected selected text to be identified and remaining text to be identified to be made respectively For observation sequence and status switch, selected correlation probabilities value between text to be identified and remaining text to be identified is calculated；

Classification and identification module, for correlation highest text in status switch and selected text to be identified to be merged, The first kind is characterized as, while using the minimum text of correlation as Second Type；Using the first and second types as new shape State sequence, remaining text to be identified is iterated as new observation sequence, to realize the identification of sensitive vocabulary.

Preferably, the correlation calculations module further comprises：

Text representation module, for being vector space model by text representation to be identified；Wherein, all text tables to be identified It is shown as T_n={ t₁, w₁；t₂, w₂；…；t_i, w_i, Feature Words t_iAppear in simultaneously in text and dictionary to be identified, calculate its weight w_i And introduce sensitivity coefficient β for the weight_i：

Wherein：tf_niRepresent keyword t in n-th of document_iThe frequency of appearance；K represents total number of files；k_iRepresent containing relevant Keyword t_iNumber of files, and sensitivity coefficient β_iIt is expressed as：

β i=-P (C_m)logP(C_m)+P(t_i|C_m)logP(t_i|C_m)+P(t’_i|C_m) l is ogP (t '_i|C_m)

Wherein：P(C_m) represent to belong to the textual data of m class sensitivity vocabulary；P(t_i|C_m) represent to belong to m class sensitivity vocabulary And include keyword t_iTextual data；P(t’_i|C_m) represent to belong to m class sensitivity vocabulary but do not include keyword t_iTextual data.

Preferably, the correlation calculations module is further used for：

By y₁, y₂..., y_nAs sensitive vocabulary type feature, y={ y₁, y₂..., y_iRepresented as vector space model A sensitive vocabulary type；By x₁, x₂..., x_nIt is the feature of text to be identified, x={ x₁, x₂..., x_iIt is empty with vector Between a text to be identified representing of model, observation sequence x corresponds to parameter sets Λ={ λ₁..., λ_jDesignated state y bar Part probability is：

Wherein：f_jIt is characterized function；λ_jFor the weights by training obtained characteristic function；Z (x) is regularization coefficient, and And：

Preferably, the classification and identification module are further configured to：

Choose 1 at random from K texts to be identified as s, remaining K-1 texts to be identified of observation list entries to make For K-1 output class status Bar, the probable value between document in the document and output sequence in list entries is calculated, until identification Go out the type of all sensitive vocabulary：

A) K-1 obtained probable value is sorted, the text corresponding to most probable value and the text in input observation sequence This merger is one kind and is denoted as class C₁, while text corresponding to minimum probability value is denoted as class C₂；

B) using remaining K-3 texts to be identified as input observation sequence, C₁And C₂As output class status Bar, so Obtain text to be identified and be under the jurisdiction of C₁And C₂Two probable values of class；

C) variance is asked to each probable value of each text to be identified and output class status Bar and sorted；

D) all probable values of the text corresponding to minimum variance value are checked, if wherein minimum probable value is less than a certain threshold Value θ, then as a new class C₃；Otherwise, check that variance yields is located at deputy text, be less than until finding probable value The text of threshold θ, while the text corresponding to maximum variance value is integrated into the type corresponding to maximum probability；

E) repeat step b)~d), until all texts are all classified.

The present invention compared with prior art, has advantages below：

The present invention proposes a kind of information monitoring analysis system, to internet public feelings carry out multidimensional monitoring, effectively collection and Sensitive information is analyzed, improves precision ratio and recall ratio.

Brief description of the drawings

Fig. 1 is the module map of internet data analysis system according to embodiments of the present invention.

Embodiment

Retouching in detail to one or more embodiment of the invention is hereafter provided together with the accompanying drawing for illustrating the principle of the invention State.The present invention is described with reference to such embodiment, but the invention is not restricted to any embodiment.The scope of the present invention is only by right Claim limits, and the present invention covers many replacements, modification and equivalent.Illustrate in the following description many details with Thorough understanding of the present invention is just provided.These details are provided for exemplary purposes, and without in these details Some or all details can also realize the present invention according to claims.

An aspect of of the present present invention provides a kind of information monitoring analysis system.Fig. 1 is information according to embodiments of the present invention Monitoring analysis system module map.

The present invention combines the mode of much information collection, and it is realized in information monitoring.In view of internal Monitoring in terms of appearance, the present invention also need to audit the sensitive information in content, and the large-scale website in internet provides very The recommendation of more focus vocabulary, these data key vocabularies that probably exactly the present invention is concerned about, i.e. keyword, these data sheets Invention is also required to timely collect.Secondly in monitoring, the object selected in information monitoring is door in internet Website, real time propelling movement is carried out for some regional information.The rule that system configures according to the present invention, periodically crawl concern Information, by various analysis means, the data message for prompting user to pay attention to matching is audited.

In information monitoring system, system is divided into four levels, bottom-up to be followed successively by basic data layer, data processing Layer, monitoring operation layer, expression layer.

Basic data layer provides database management function, it is necessary to have rational storage planning to the data of collection；It is distributed Computing capability, realize the object reference on the different nodes inside subsystems between subsystem；System maintenance, can be right Subsystems carry out parameter configuration, the running situation of each part of monitoring system, manage user and its authority etc..

Data analysis layer provides data acquisition ability, and the Web content and audio-video frequency content of emphasis website can be increased Amount formula captures；Data storage, external storage system can be managed, Data Migration, backup and the function of cleaning can be realized；Data Management function, it can analyze to the essential informations of the monitoring objects such as website, Web content, audio-video network content and further Information be managed, such as inquiry, modification, delete, addition.And support to manually import audio-video network content.

Monitor operation layer and carry out content analysis, text, audio, the video data collected is analyzed, extraction feature, Establish data directory, identification harmful information and tracking focus, sensitive vocabulary etc.；Information gathering, it is based on keyword, sample figure Piece, sample audio, sample video, to carry out content acquisition；Information Statistics, according to the needs of monitoring business, regarded to what is collected Audio website, Web content and harmful information carry out statistics classification.

Expression layer provides each management function friendly operation interface, and display information collection, Information Statistics, harmful information are known Not with the result of analysis, the multi-mode operations such as system maintenance are carried out；

System interface provides unified service for related system, is easy to integrate the other systems of monitoring business, improves each industry The integration and autgmentability of business system.

In information gathering process, video acquisition module can be by the keyword of business personnel's submission, to video network Content carries out content acquisition, returns to the video file for including designated key word, and the temporal information in corresponding document.Pass through industry The key frame that business personnel submit, content acquisition is carried out to video network content, returns to the video file for including designated key frame, with And the temporal information in corresponding document.The particular video frequency segment that can be submitted by business personnel, in local video data storehouse Collection includes similar or identical video clips video network content, and and then finds the video network content online Distribution situation.The video clips found are consistent with the content of query sample, but allow in form distinct.Business personnel By WEB interface, the summary and key frame panorama sketch of the result video that collects can be checked, key frame can carry out positioning playing And watch, unloading result video.The video network content for the video clips occur can be clicked directly on video large-size screen monitors.

Audio collection module, content indexing is established to internet voice/audio file, support user to carry out certain content sound Frequency gathers.By the collection to certain content audio-frequency information (sensitive information in other words), the monitoring to network audio information is realized. The voice/audio information of certain content can have diversified forms, can be particular keywords, or speaker dependent, Huo Zhete Determine audio fragment.

When user submits key words text, system can return to the interconnection network audio file for including designated key word, and fixed Temporal information of the position in file.When user submits the speech samples of some speaker dependent, system can be returned comprising specified The interconnection network audio file of speaker, and it is positioned at the temporal information in file.When user submits some particular audio piece, it is System can return to the interconnection network audio file for including specific audio frequency fragment, and be positioned at the temporal information in file.

Text collection module, including topic collection recognition unit, topic trend analysis unit, keyword filtering and matching are single Member.Wherein：

Topic gathers recognition unit to be needed to carry out specified website flow visit capacity statistics, the row of collection according to monitoring business Name position etc., the Web content announced automatically from third party, portal website, the channel such as search engine and big website network obtain Take related data.Can by input specify web site name, collection third party announce ranking information Web content, automatically from The data that the acquisitions such as precedence data monitoring business needs are analyzed in Web content.

Topic trend analysis unit, by the method based on statistics mood word tendentiousness value, to the institute of keyword in comment The statistical weight for vocabulary justice tendentiousness value of being in a bad mood, to complete topic by contrasting and analyzing the tendentiousness vector of user's topic Sentiment classification.

Keyword filters and matching unit, by keyword match, detects in Web content whether include harmful content simultaneously Filtered；Need to configure keyword according to monitoring business, can be configured by combination condition, possess with, Or, it is non-it is various include mode, and consider configuration effective period according to keyword is ageing.

According to a further aspect of the present invention, the topic collection recognition unit includes：

The present invention constructs a kind of sensitive vocabulary identification model.With reference to dictionary by each text vector space to be identified Model is represented, and carries out that a series of probable value is calculated, and sensitive vocabulary identification is carried out using these probable values.

All texts to be identified in network can use vector space model and be expressed as T with reference to dictionary_n={ t₁, w₁； t₂, w₂；…；t_i, w_i}.Wherein, Feature Words t_iIt must simultaneously appear in text and dictionary to be identified, calculate its weight w_iAnd it is The weight introduces sensitivity coefficient β i：

Wherein：tf_niRepresent the frequency that keyword ti occurs in n-th of document；K represents total number of files；k_iRepresent containing relevant Keyword t_iNumber of files.Sensitivity coefficient β_iRepresented with information gain：

β i=-P (C_m)logP(C_m)+P(t_i|C_m)logP(t_i|C_m)+P(t’_i|C_m)logP(t’_i|C_m)

By y₁, y₂..., y_nAs sensitive vocabulary type feature, y={ y₁, y₂..., y_iIt is to be represented with vector space model A sensitive vocabulary type；By x₁, x₂..., x_nIt is the feature of text to be identified, x={ x₁, x₂..., x_iIt is empty with vector Between model represent a text to be identified.Observation sequence x corresponds to parameter sets Λ={ λ₁..., λ_jDesignated state y bar Part probability is shown below.

Wherein：f_jFunction is characterized, is the unified representation of transfer characteristic function and state characteristic function；λ_jTo pass through training The weights of obtained characteristic function；Z (x) is regularization coefficient, and：

It is the observation list entries and output class in model that sensitive vocabulary identification process, which is first had to text representation to be identified, Status Bar.Choose 1 at random from K texts to be identified as s, remaining K-1 texts to be identified of observation list entries to make For K-1 output class status Bar.Thus the probable value between document in the document and output sequence in list entries is calculated, after The step of be iterated with similar approach, the type until identifying all sensitive vocabulary.Specifically：

A) K-1 obtained probable value is sorted, the text corresponding to most probable value and the text in input observation sequence This merger is one kind and is denoted as class C₁, while text corresponding to minimum probability value is denoted as class C₂。

B) using remaining K-3 texts to be identified as input observation sequence, C₁And C₂As output class status Bar, so Obtain text to be identified and be under the jurisdiction of C₁And C₂Two probable values of class.

C) variance is asked to each probable value of each text to be identified and output class status Bar and sorted, the bigger theory of variance yields The bright text and type have very big discrimination.

D) all probable values of the text corresponding to minimum variance value are checked, if wherein minimum probable value is less than a certain threshold Value θ is just as a new class C₃；Otherwise, check that variance yields is located at deputy text.It is less than until finding probable value The text of threshold θ.Text corresponding to maximum variance value is integrated into the type corresponding to maximum probability simultaneously.

E) repeat step b)~d), until all texts are all classified.

Threshold θ is used to control whether to need to increase new type, if θ values are bigger, unobvious are got in the difference between type, so as to The number of types for making to obtain is more, can branch away the text mistake for belonging to a type；If θ values are smaller, obtained number of types will It is fewer, so as to which text mistake can be divided into a type.Therefore need to carry out θ with θ variation tendency by the distance between type Estimation.

According to another aspect of the present invention, the semantic trend value that topic trend analysis unit is established between emotag and vocabulary Network, semantic tendency intensity of the vocabulary with the Similarity Measure vocabulary of the semantic trend value vector between emotag is recycled, with This completes the semantic tendency identification of network words.Topic trend analysis unit is divided into data prediction and lexical semantic tendency identification Two modules.

Data preprocessing module, first, network text is screened using clearly just negative emotag is inclined in network；So Afterwards candidate word set is extracted from the network text filtered out.

Semantic tendency identification module, first, the candidate word set obtained using Term co-occurrence value model to data prediction are built Lexical semantic trend value network；Then, word frequency is made higher than the emotag of preset value in candidate word set in selection mood set For candidate word, and low-frequency word is extended using synonym clump, extracts mood word；Finally, candidate word and structure are utilized Semantic trend value network calculations vocabulary semantic tendency intensity, complete lexical semantic tendency identification.

Participle and part-of-speech tagging use each vocabulary in binary crelation (word, freq) intermediate scheme, obtain candidate item Set W={ w₁, w₂..., w_N, N is candidate word sum.

Term co-occurrence value embodies orderly co-occurrence degree of two vocabulary in the two global adjacent_lattice, represents a word to another The activation weight whether one word occurs.For given vocabulary i and j, vocabulary i is defined as follows for vocabulary j Term co-occurrence value：

waf_ij=(f_ij/f_i)·(f_ij/f_j)/d² _ij

In formula：f_iAnd f_jThe frequency that two words occur in a document is represented respectively；f_ijRepresent vocabulary i and j in setting co-occurrence The frequency of the window apart from interior appearance；d_ijFor the average co-occurrence distance of two words.According to definition, waf_ijNumerical intervals be [0,1], 0 Represent in document vocabulary j always not no d after vocabulary i_ijOccur in individual vocabulary, vocabulary j always adjoins appearance in 1 expression document After vocabulary i.Defined according to waf, can be Term co-occurrence value matrix WAF by a document representation.

Due to waf_ijIt is oriented value, so Term co-occurrence value matrix is a unsymmetrical matrix.Element waf in matrix_ijRepresent Vocabulary i is with waf_ijWeight activation vocabulary j.Lexical semantic trend value between vocabulary is calculated based on the oriented WordNet, its It is defined as follows

A_ij=((1/ | K_ij|)OR(waf_ki, waf_kj))^1/2·((1/|L_ij|)OR(waf_il, waf_jl))^1/2

In formula：K_ij=k | waf_ki>0 or waf_kj>0 } co-occurrence value sets of the vocabulary i and vocabulary j to other vocabulary is represented；L_ij =l | waf_il>0or waf_jl>0 } co-occurrence value set of other vocabulary to vocabulary i and vocabulary j is represented；OR (x, y)=min (x, Y)/max (x, y) calculates for Duplication；Lexical semantic trend value A_ijIt is that vocabulary i and vocabulary j are all common in Term co-occurrence value matrix The geometrical mean of the Duplication of present worth, embody both intimate degree in whole document.

Likewise it is possible to by document representation it is semantic to tend to value matrix with lexical semantic trend value Aij.Lexical semantic tends to Value matrix is a undirected symmetrical matrix, wherein the i-th row represents the semantic trend value of other all words and vocabulary i.In vocabulary , can be using vocabulary as node in semantic tendency identification process, the semantic trend value between each node builds semantic tend to as side It is worth network, the more strong then node semantics tendency of semantic trend value is more close.If node set is W={ w₁, w₂..., w_N, node< w_i, w_j>Between semantic trend value be A_ij。

Before mood word extraction, it is necessary first to carry out the selection of candidate word, two methods can be used：One kind is to select word The frequency highest and obvious one group of word of tendency is as candidate word；Another kind be selected based on dictionary resources be inclined in dictionary it is most obvious One group of word is as candidate word.The present invention chooses the emotag conduct of the positive and negative tendency of frequency of occurrence highest in a document in network Candidate word.

Occupied the majority by low-frequency word in pretreated document, low-frequency word and candidate's Term co-occurrence number are less, and the present invention draws Enter synonym clump, when mood word extracts, low-frequency word be extended using synonym clump, at the same consider low-frequency word and its Semantic trend value between synset and candidate word is extracted to complete mood word.

The just negative of vocabulary is weighed by calculating the similarity of semantic trend value vector between vocabulary and just negative candidate word Intensity is inclined to, and then draws the semantic tendency intensity of vocabulary.If mood word set OPW total amounts are N ', then vocabulary c_j(c_j∈ OPW, j ∈ [1,2 ..., N ']) semantic tendency intensity can be expressed as

SO_j=SO⁺ _j-β*SO^-j

Wherein, SO⁺ _jAnd SO^- _jVocabulary c is represented respectively_jWith the semantic tendency similarity of just negative candidate word set, β is total SO⁺ _jWith total SO^- _jRatio, positive negative tendency intensity ratio as in document.

SO⁺ _jIt can be calculated by following formula：

In formula：vc_jFor vocabulary c_jSemantic tendency value vector, v_t←piPositive candidate word p_iIt is right in semanteme tends to value matrix The row vector answered, t p_iThe corresponding row in semanteme tends to value matrix.

Similarly, can be to SO^- _jCarry out similar calculate.By SO⁺ _jAnd SO^- _jSubstitution obtains SO_j, work as SO_j>γ_pWhen be defined as front Vocabulary, SO_j<γ_nWhen be defined as negative vocabulary.Other situations are defined as neutral vocabulary.Wherein γ_P,γ_nIt is respectively positive and negative Vocabulary decision threshold.

In summary, the present invention proposes a kind of information monitoring analysis system, carries out multidimensional monitoring to internet public feelings, has Effect gathers and analyzes sensitive information, improves precision ratio and recall ratio.

Obviously, can be with general it should be appreciated by those skilled in the art, above-mentioned each module of the invention or each step Computing system realize that they can be concentrated in single computing system, or be distributed in multiple computing systems and formed Network on, alternatively, they can be realized with the program code that computing system can perform, it is thus possible to they are stored Performed within the storage system by computing system.So, the present invention is not restricted to any specific hardware and software combination.

It should be appreciated that the above-mentioned embodiment of the present invention is used only for exemplary illustration or explains the present invention's Principle, without being construed as limiting the invention.Therefore, that is done without departing from the spirit and scope of the present invention is any Modification, equivalent substitution, improvement etc., should be included in the scope of the protection.In addition, appended claims purport of the present invention Covering the whole changes fallen into scope and border or this scope and the equivalents on border and repairing Change example.

Claims

A kind of 1. internet data analysis system, it is characterised in that including：

Correlation calculations module, for using randomly selected text to be identified and the remaining text to be identified of being chosen as sight Sequencing row and status switch, calculate selected correlation probabilities value between text to be identified and remaining text to be identified；

Classification and identification module, for correlation highest text in status switch and selected text to be identified to be merged, characterize For the first kind, while using the minimum text of correlation as Second Type；Using the first and second types as new state sequence Row, remaining text to be identified are iterated as new observation sequence, to realize the identification of sensitive vocabulary；

The correlation calculations module is further used for：

By y₁, y₂..., y_nAs sensitive vocabulary type feature, y={ y₁, y₂..., y_iAs vector space model represent one The type of individual sensitive vocabulary；By x₁, x₂..., x_nAs the feature of text to be identified, x={ x₁, x₂..., x_iIt is to use vector space The text to be identified that model represents, observation sequence x correspond to parameter sets Λ={ λ₁..., λ_jDesignated state y condition Probability is：

Wherein：f_jIt is characterized function；λ_jFor the weights by training obtained characteristic function；Z (x) is regularization coefficient, and n is quick Feel lexical types feature and the dimension of text feature to be identified, and：

The classification and identification module are further configured to：

Choose 1 at random from K texts to be identified as s, remaining K-1 texts to be identified of observation list entries as K- 1 output class status Bar, the probable value between document in the document and output sequence in list entries is calculated, until identifying There is the type of sensitive vocabulary：

A) K-1 obtained probable value is sorted, the text corresponding to most probable value is returned with the text in input observation sequence And for one kind and it is denoted as class C₁, while text corresponding to minimum probability value is denoted as class C₂；

B) using remaining K-3 texts to be identified as input observation sequence, C₁And C₂As output class status Bar, so obtain Text to be identified is under the jurisdiction of C₁And C₂Two probable values of class；

C) variance is asked to each probable value of each text to be identified and output class status Bar and sorted；

D) all probable values of the text corresponding to minimum variance value are checked, if wherein minimum probable value is less than a certain threshold θ, Then as a new class C₃；Otherwise, check that variance yields is located at deputy text, be less than threshold value until finding probable value θ text, while the text corresponding to maximum variance value is integrated into the type corresponding to maximum probability；

E) repeat step b)~d), until all texts are all classified.