CN104809253B - Internet data analysis system - Google Patents

Internet data analysis system Download PDF

Info

Publication number
CN104809253B
CN104809253B CN201510257964.9A CN201510257964A CN104809253B CN 104809253 B CN104809253 B CN 104809253B CN 201510257964 A CN201510257964 A CN 201510257964A CN 104809253 B CN104809253 B CN 104809253B
Authority
CN
China
Prior art keywords
text
identified
value
vocabulary
correlation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510257964.9A
Other languages
Chinese (zh)
Other versions
CN104809253A (en
Inventor
张鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Kunchuan Network Technology Co ltd
Original Assignee
BEIJING BLTSFE INFORMATION TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BEIJING BLTSFE INFORMATION TECHNOLOGY Co Ltd filed Critical BEIJING BLTSFE INFORMATION TECHNOLOGY Co Ltd
Priority to CN201510257964.9A priority Critical patent/CN104809253B/en
Publication of CN104809253A publication Critical patent/CN104809253A/en
Application granted granted Critical
Publication of CN104809253B publication Critical patent/CN104809253B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a kind of internet data analysis system, the system includes:Correlation calculations module, for using randomly selected text to be identified and the remaining text to be identified of being chosen as observation sequence and status switch, calculating selected correlation probabilities value between text to be identified and remaining text to be identified;Classification and identification module, for correlation highest text in status switch and selected text to be identified to be merged, the first kind is characterized as, while using the minimum text of correlation as Second Type;Using the first and second types as new status switch, remaining text to be identified is iterated as new observation sequence, to realize the identification of sensitive vocabulary.The present invention proposes a kind of information monitoring analysis system, carries out multidimensional monitoring to internet public feelings, effectively gathers and analyze sensitive information, improve precision ratio and recall ratio.

Description

Internet data analysis system
Technical field
The present invention relates to data collection and analysis, more particularly to a kind of internet data analysis system.
Background technology
Compared with original conventional internet form, current internet form has occurred that great change, Portable mobile equipment Constantly push away new, people have had been detached from the traditional wire online pattern of old-fashioned desktop computer, and the function of mobile device emerges in an endless stream, It has been most basic function to take pictures, shoot video.People can shoot the thing that surrounding occurs by mobile device, and can To be uploaded directly into internet, the rapid pole of the spread speed of its information, if be not monitored rationally, it is possible that Invalid information, the judgement of the public is misguided, cause public opinion to move towards the direction of mistake.In being detected in public sentiment, data acquisition work( Can be particularly important, all data after acquisition, because data volume is very big, it is necessary to by technological means, within the limited time, are divided The data being precipitated with.But existing information monitoring system only applies single acquisition mode, can only meet that certain is specific Information analysis, it is impossible to meet the needs of diversification information analysis in present internet.
The content of the invention
To solve the problems of above-mentioned prior art, the present invention proposes a kind of internet data analysis system, bag Include:
Correlation calculations module, for randomly selected selected text to be identified and remaining text to be identified to be made respectively For observation sequence and status switch, selected correlation probabilities value between text to be identified and remaining text to be identified is calculated;
Classification and identification module, for correlation highest text in status switch and selected text to be identified to be merged, The first kind is characterized as, while using the minimum text of correlation as Second Type;Using the first and second types as new shape State sequence, remaining text to be identified is iterated as new observation sequence, to realize the identification of sensitive vocabulary.
Preferably, the correlation calculations module further comprises:
Text representation module, for being vector space model by text representation to be identified;Wherein, all text tables to be identified It is shown as Tn={ t1, w1;t2, w2;…;ti, wi, Feature Words tiAppear in simultaneously in text and dictionary to be identified, calculate its weight wi And introduce sensitivity coefficient β for the weighti
Wherein:tfniRepresent keyword t in n-th of documentiThe frequency of appearance;K represents total number of files;kiRepresent containing relevant Keyword tiNumber of files, and sensitivity coefficient βiIt is expressed as:
β i=-P (Cm)logP(Cm)+P(ti|Cm)logP(ti|Cm)+P(t’i|Cm) l is ogP (t 'i|Cm)
Wherein:P(Cm) represent to belong to the textual data of m class sensitivity vocabulary;P(ti|Cm) represent to belong to m class sensitivity vocabulary And include keyword tiTextual data;P(t’i|Cm) represent to belong to m class sensitivity vocabulary but do not include keyword tiTextual data.
Preferably, the correlation calculations module is further used for:
By y1, y2..., ynAs sensitive vocabulary type feature, y={ y1, y2..., yiRepresented as vector space model A sensitive vocabulary type;By x1, x2..., xnIt is the feature of text to be identified, x={ x1, x2..., xiIt is empty with vector Between a text to be identified representing of model, observation sequence x corresponds to parameter sets Λ={ λ1..., λjDesignated state y bar Part probability is:
Wherein:fjIt is characterized function;λjFor the weights by training obtained characteristic function;Z (x) is regularization coefficient, and And:
Preferably, the classification and identification module are further configured to:
Choose 1 at random from K texts to be identified as s, remaining K-1 texts to be identified of observation list entries to make For K-1 output class status Bar, the probable value between document in the document and output sequence in list entries is calculated, until identification Go out the type of all sensitive vocabulary:
A) K-1 obtained probable value is sorted, the text corresponding to most probable value and the text in input observation sequence This merger is one kind and is denoted as class C1, while text corresponding to minimum probability value is denoted as class C2
B) using remaining K-3 texts to be identified as input observation sequence, C1And C2As output class status Bar, so Obtain text to be identified and be under the jurisdiction of C1And C2Two probable values of class;
C) variance is asked to each probable value of each text to be identified and output class status Bar and sorted;
D) all probable values of the text corresponding to minimum variance value are checked, if wherein minimum probable value is less than a certain threshold Value θ, then as a new class C3;Otherwise, check that variance yields is located at deputy text, be less than until finding probable value The text of threshold θ, while the text corresponding to maximum variance value is integrated into the type corresponding to maximum probability;
E) repeat step b)~d), until all texts are all classified.
The present invention compared with prior art, has advantages below:
The present invention proposes a kind of information monitoring analysis system, to internet public feelings carry out multidimensional monitoring, effectively collection and Sensitive information is analyzed, improves precision ratio and recall ratio.
Brief description of the drawings
Fig. 1 is the module map of internet data analysis system according to embodiments of the present invention.
Embodiment
Retouching in detail to one or more embodiment of the invention is hereafter provided together with the accompanying drawing for illustrating the principle of the invention State.The present invention is described with reference to such embodiment, but the invention is not restricted to any embodiment.The scope of the present invention is only by right Claim limits, and the present invention covers many replacements, modification and equivalent.Illustrate in the following description many details with Thorough understanding of the present invention is just provided.These details are provided for exemplary purposes, and without in these details Some or all details can also realize the present invention according to claims.
An aspect of of the present present invention provides a kind of information monitoring analysis system.Fig. 1 is information according to embodiments of the present invention Monitoring analysis system module map.
The present invention combines the mode of much information collection, and it is realized in information monitoring.In view of internal Monitoring in terms of appearance, the present invention also need to audit the sensitive information in content, and the large-scale website in internet provides very The recommendation of more focus vocabulary, these data key vocabularies that probably exactly the present invention is concerned about, i.e. keyword, these data sheets Invention is also required to timely collect.Secondly in monitoring, the object selected in information monitoring is door in internet Website, real time propelling movement is carried out for some regional information.The rule that system configures according to the present invention, periodically crawl concern Information, by various analysis means, the data message for prompting user to pay attention to matching is audited.
In information monitoring system, system is divided into four levels, bottom-up to be followed successively by basic data layer, data processing Layer, monitoring operation layer, expression layer.
Basic data layer provides database management function, it is necessary to have rational storage planning to the data of collection;It is distributed Computing capability, realize the object reference on the different nodes inside subsystems between subsystem;System maintenance, can be right Subsystems carry out parameter configuration, the running situation of each part of monitoring system, manage user and its authority etc..
Data analysis layer provides data acquisition ability, and the Web content and audio-video frequency content of emphasis website can be increased Amount formula captures;Data storage, external storage system can be managed, Data Migration, backup and the function of cleaning can be realized;Data Management function, it can analyze to the essential informations of the monitoring objects such as website, Web content, audio-video network content and further Information be managed, such as inquiry, modification, delete, addition.And support to manually import audio-video network content.
Monitor operation layer and carry out content analysis, text, audio, the video data collected is analyzed, extraction feature, Establish data directory, identification harmful information and tracking focus, sensitive vocabulary etc.;Information gathering, it is based on keyword, sample figure Piece, sample audio, sample video, to carry out content acquisition;Information Statistics, according to the needs of monitoring business, regarded to what is collected Audio website, Web content and harmful information carry out statistics classification.
Expression layer provides each management function friendly operation interface, and display information collection, Information Statistics, harmful information are known Not with the result of analysis, the multi-mode operations such as system maintenance are carried out;
System interface provides unified service for related system, is easy to integrate the other systems of monitoring business, improves each industry The integration and autgmentability of business system.
In information gathering process, video acquisition module can be by the keyword of business personnel's submission, to video network Content carries out content acquisition, returns to the video file for including designated key word, and the temporal information in corresponding document.Pass through industry The key frame that business personnel submit, content acquisition is carried out to video network content, returns to the video file for including designated key frame, with And the temporal information in corresponding document.The particular video frequency segment that can be submitted by business personnel, in local video data storehouse Collection includes similar or identical video clips video network content, and and then finds the video network content online Distribution situation.The video clips found are consistent with the content of query sample, but allow in form distinct.Business personnel By WEB interface, the summary and key frame panorama sketch of the result video that collects can be checked, key frame can carry out positioning playing And watch, unloading result video.The video network content for the video clips occur can be clicked directly on video large-size screen monitors.
Audio collection module, content indexing is established to internet voice/audio file, support user to carry out certain content sound Frequency gathers.By the collection to certain content audio-frequency information (sensitive information in other words), the monitoring to network audio information is realized. The voice/audio information of certain content can have diversified forms, can be particular keywords, or speaker dependent, Huo Zhete Determine audio fragment.
When user submits key words text, system can return to the interconnection network audio file for including designated key word, and fixed Temporal information of the position in file.When user submits the speech samples of some speaker dependent, system can be returned comprising specified The interconnection network audio file of speaker, and it is positioned at the temporal information in file.When user submits some particular audio piece, it is System can return to the interconnection network audio file for including specific audio frequency fragment, and be positioned at the temporal information in file.
Text collection module, including topic collection recognition unit, topic trend analysis unit, keyword filtering and matching are single Member.Wherein:
Topic gathers recognition unit to be needed to carry out specified website flow visit capacity statistics, the row of collection according to monitoring business Name position etc., the Web content announced automatically from third party, portal website, the channel such as search engine and big website network obtain Take related data.Can by input specify web site name, collection third party announce ranking information Web content, automatically from The data that the acquisitions such as precedence data monitoring business needs are analyzed in Web content.
Topic trend analysis unit, by the method based on statistics mood word tendentiousness value, to the institute of keyword in comment The statistical weight for vocabulary justice tendentiousness value of being in a bad mood, to complete topic by contrasting and analyzing the tendentiousness vector of user's topic Sentiment classification.
Keyword filters and matching unit, by keyword match, detects in Web content whether include harmful content simultaneously Filtered;Need to configure keyword according to monitoring business, can be configured by combination condition, possess with, Or, it is non-it is various include mode, and consider configuration effective period according to keyword is ageing.
According to a further aspect of the present invention, the topic collection recognition unit includes:
Correlation calculations module, for randomly selected selected text to be identified and remaining text to be identified to be made respectively For observation sequence and status switch, selected correlation probabilities value between text to be identified and remaining text to be identified is calculated;
Classification and identification module, for correlation highest text in status switch and selected text to be identified to be merged, The first kind is characterized as, while using the minimum text of correlation as Second Type;Using the first and second types as new shape State sequence, remaining text to be identified is iterated as new observation sequence, to realize the identification of sensitive vocabulary.
The present invention constructs a kind of sensitive vocabulary identification model.With reference to dictionary by each text vector space to be identified Model is represented, and carries out that a series of probable value is calculated, and sensitive vocabulary identification is carried out using these probable values.
All texts to be identified in network can use vector space model and be expressed as T with reference to dictionaryn={ t1, w1; t2, w2;…;ti, wi}.Wherein, Feature Words tiIt must simultaneously appear in text and dictionary to be identified, calculate its weight wiAnd it is The weight introduces sensitivity coefficient β i:
Wherein:tfniRepresent the frequency that keyword ti occurs in n-th of document;K represents total number of files;kiRepresent containing relevant Keyword tiNumber of files.Sensitivity coefficient βiRepresented with information gain:
β i=-P (Cm)logP(Cm)+P(ti|Cm)logP(ti|Cm)+P(t’i|Cm)logP(t’i|Cm)
Wherein:P(Cm) represent to belong to the textual data of m class sensitivity vocabulary;P(ti|Cm) represent to belong to m class sensitivity vocabulary And include keyword tiTextual data;P(t’i|Cm) represent to belong to m class sensitivity vocabulary but do not include keyword tiTextual data.
By y1, y2..., ynAs sensitive vocabulary type feature, y={ y1, y2..., yiIt is to be represented with vector space model A sensitive vocabulary type;By x1, x2..., xnIt is the feature of text to be identified, x={ x1, x2..., xiIt is empty with vector Between model represent a text to be identified.Observation sequence x corresponds to parameter sets Λ={ λ1..., λjDesignated state y bar Part probability is shown below.
Wherein:fjFunction is characterized, is the unified representation of transfer characteristic function and state characteristic function;λjTo pass through training The weights of obtained characteristic function;Z (x) is regularization coefficient, and:
It is the observation list entries and output class in model that sensitive vocabulary identification process, which is first had to text representation to be identified, Status Bar.Choose 1 at random from K texts to be identified as s, remaining K-1 texts to be identified of observation list entries to make For K-1 output class status Bar.Thus the probable value between document in the document and output sequence in list entries is calculated, after The step of be iterated with similar approach, the type until identifying all sensitive vocabulary.Specifically:
A) K-1 obtained probable value is sorted, the text corresponding to most probable value and the text in input observation sequence This merger is one kind and is denoted as class C1, while text corresponding to minimum probability value is denoted as class C2
B) using remaining K-3 texts to be identified as input observation sequence, C1And C2As output class status Bar, so Obtain text to be identified and be under the jurisdiction of C1And C2Two probable values of class.
C) variance is asked to each probable value of each text to be identified and output class status Bar and sorted, the bigger theory of variance yields The bright text and type have very big discrimination.
D) all probable values of the text corresponding to minimum variance value are checked, if wherein minimum probable value is less than a certain threshold Value θ is just as a new class C3;Otherwise, check that variance yields is located at deputy text.It is less than until finding probable value The text of threshold θ.Text corresponding to maximum variance value is integrated into the type corresponding to maximum probability simultaneously.
E) repeat step b)~d), until all texts are all classified.
Threshold θ is used to control whether to need to increase new type, if θ values are bigger, unobvious are got in the difference between type, so as to The number of types for making to obtain is more, can branch away the text mistake for belonging to a type;If θ values are smaller, obtained number of types will It is fewer, so as to which text mistake can be divided into a type.Therefore need to carry out θ with θ variation tendency by the distance between type Estimation.
According to another aspect of the present invention, the semantic trend value that topic trend analysis unit is established between emotag and vocabulary Network, semantic tendency intensity of the vocabulary with the Similarity Measure vocabulary of the semantic trend value vector between emotag is recycled, with This completes the semantic tendency identification of network words.Topic trend analysis unit is divided into data prediction and lexical semantic tendency identification Two modules.
Data preprocessing module, first, network text is screened using clearly just negative emotag is inclined in network;So Afterwards candidate word set is extracted from the network text filtered out.
Semantic tendency identification module, first, the candidate word set obtained using Term co-occurrence value model to data prediction are built Lexical semantic trend value network;Then, word frequency is made higher than the emotag of preset value in candidate word set in selection mood set For candidate word, and low-frequency word is extended using synonym clump, extracts mood word;Finally, candidate word and structure are utilized Semantic trend value network calculations vocabulary semantic tendency intensity, complete lexical semantic tendency identification.
Participle and part-of-speech tagging use each vocabulary in binary crelation (word, freq) intermediate scheme, obtain candidate item Set W={ w1, w2..., wN, N is candidate word sum.
Term co-occurrence value embodies orderly co-occurrence degree of two vocabulary in the two global adjacent_lattice, represents a word to another The activation weight whether one word occurs.For given vocabulary i and j, vocabulary i is defined as follows for vocabulary j Term co-occurrence value:
wafij=(fij/fi)·(fij/fj)/d2 ij
In formula:fiAnd fjThe frequency that two words occur in a document is represented respectively;fijRepresent vocabulary i and j in setting co-occurrence The frequency of the window apart from interior appearance;dijFor the average co-occurrence distance of two words.According to definition, wafijNumerical intervals be [0,1], 0 Represent in document vocabulary j always not no d after vocabulary iijOccur in individual vocabulary, vocabulary j always adjoins appearance in 1 expression document After vocabulary i.Defined according to waf, can be Term co-occurrence value matrix WAF by a document representation.
Due to wafijIt is oriented value, so Term co-occurrence value matrix is a unsymmetrical matrix.Element waf in matrixijRepresent Vocabulary i is with wafijWeight activation vocabulary j.Lexical semantic trend value between vocabulary is calculated based on the oriented WordNet, its It is defined as follows
Aij=((1/ | Kij|)OR(wafki, wafkj))1/2·((1/|Lij|)OR(wafil, wafjl))1/2
In formula:Kij=k | wafki>0 or wafkj>0 } co-occurrence value sets of the vocabulary i and vocabulary j to other vocabulary is represented;Lij =l | wafil>0or wafjl>0 } co-occurrence value set of other vocabulary to vocabulary i and vocabulary j is represented;OR (x, y)=min (x, Y)/max (x, y) calculates for Duplication;Lexical semantic trend value AijIt is that vocabulary i and vocabulary j are all common in Term co-occurrence value matrix The geometrical mean of the Duplication of present worth, embody both intimate degree in whole document.
Likewise it is possible to by document representation it is semantic to tend to value matrix with lexical semantic trend value Aij.Lexical semantic tends to Value matrix is a undirected symmetrical matrix, wherein the i-th row represents the semantic trend value of other all words and vocabulary i.In vocabulary , can be using vocabulary as node in semantic tendency identification process, the semantic trend value between each node builds semantic tend to as side It is worth network, the more strong then node semantics tendency of semantic trend value is more close.If node set is W={ w1, w2..., wN, node< wi, wj>Between semantic trend value be Aij
Before mood word extraction, it is necessary first to carry out the selection of candidate word, two methods can be used:One kind is to select word The frequency highest and obvious one group of word of tendency is as candidate word;Another kind be selected based on dictionary resources be inclined in dictionary it is most obvious One group of word is as candidate word.The present invention chooses the emotag conduct of the positive and negative tendency of frequency of occurrence highest in a document in network Candidate word.
Occupied the majority by low-frequency word in pretreated document, low-frequency word and candidate's Term co-occurrence number are less, and the present invention draws Enter synonym clump, when mood word extracts, low-frequency word be extended using synonym clump, at the same consider low-frequency word and its Semantic trend value between synset and candidate word is extracted to complete mood word.
The just negative of vocabulary is weighed by calculating the similarity of semantic trend value vector between vocabulary and just negative candidate word Intensity is inclined to, and then draws the semantic tendency intensity of vocabulary.If mood word set OPW total amounts are N ', then vocabulary cj(cj∈ OPW, j ∈ [1,2 ..., N ']) semantic tendency intensity can be expressed as
SOj=SO+ j-β*SO-j
Wherein, SO+ jAnd SO- jVocabulary c is represented respectivelyjWith the semantic tendency similarity of just negative candidate word set, β is total SO+ jWith total SO- jRatio, positive negative tendency intensity ratio as in document.
SO+ jIt can be calculated by following formula:
In formula:vcjFor vocabulary cjSemantic tendency value vector, vt←piPositive candidate word piIt is right in semanteme tends to value matrix The row vector answered, t piThe corresponding row in semanteme tends to value matrix.
Similarly, can be to SO- jCarry out similar calculate.By SO+ jAnd SO- jSubstitution obtains SOj, work as SOjpWhen be defined as front Vocabulary, SOjnWhen be defined as negative vocabulary.Other situations are defined as neutral vocabulary.Wherein γP,γnIt is respectively positive and negative Vocabulary decision threshold.
In summary, the present invention proposes a kind of information monitoring analysis system, carries out multidimensional monitoring to internet public feelings, has Effect gathers and analyzes sensitive information, improves precision ratio and recall ratio.
Obviously, can be with general it should be appreciated by those skilled in the art, above-mentioned each module of the invention or each step Computing system realize that they can be concentrated in single computing system, or be distributed in multiple computing systems and formed Network on, alternatively, they can be realized with the program code that computing system can perform, it is thus possible to they are stored Performed within the storage system by computing system.So, the present invention is not restricted to any specific hardware and software combination.
It should be appreciated that the above-mentioned embodiment of the present invention is used only for exemplary illustration or explains the present invention's Principle, without being construed as limiting the invention.Therefore, that is done without departing from the spirit and scope of the present invention is any Modification, equivalent substitution, improvement etc., should be included in the scope of the protection.In addition, appended claims purport of the present invention Covering the whole changes fallen into scope and border or this scope and the equivalents on border and repairing Change example.

Claims (1)

  1. A kind of 1. internet data analysis system, it is characterised in that including:
    Correlation calculations module, for using randomly selected text to be identified and the remaining text to be identified of being chosen as sight Sequencing row and status switch, calculate selected correlation probabilities value between text to be identified and remaining text to be identified;
    Classification and identification module, for correlation highest text in status switch and selected text to be identified to be merged, characterize For the first kind, while using the minimum text of correlation as Second Type;Using the first and second types as new state sequence Row, remaining text to be identified are iterated as new observation sequence, to realize the identification of sensitive vocabulary;
    The correlation calculations module is further used for:
    By y1, y2..., ynAs sensitive vocabulary type feature, y={ y1, y2..., yiAs vector space model represent one The type of individual sensitive vocabulary;By x1, x2..., xnAs the feature of text to be identified, x={ x1, x2..., xiIt is to use vector space The text to be identified that model represents, observation sequence x correspond to parameter sets Λ={ λ1..., λjDesignated state y condition Probability is:
    Wherein:fjIt is characterized function;λjFor the weights by training obtained characteristic function;Z (x) is regularization coefficient, and n is quick Feel lexical types feature and the dimension of text feature to be identified, and:
    The classification and identification module are further configured to:
    Choose 1 at random from K texts to be identified as s, remaining K-1 texts to be identified of observation list entries as K- 1 output class status Bar, the probable value between document in the document and output sequence in list entries is calculated, until identifying There is the type of sensitive vocabulary:
    A) K-1 obtained probable value is sorted, the text corresponding to most probable value is returned with the text in input observation sequence And for one kind and it is denoted as class C1, while text corresponding to minimum probability value is denoted as class C2
    B) using remaining K-3 texts to be identified as input observation sequence, C1And C2As output class status Bar, so obtain Text to be identified is under the jurisdiction of C1And C2Two probable values of class;
    C) variance is asked to each probable value of each text to be identified and output class status Bar and sorted;
    D) all probable values of the text corresponding to minimum variance value are checked, if wherein minimum probable value is less than a certain threshold θ, Then as a new class C3;Otherwise, check that variance yields is located at deputy text, be less than threshold value until finding probable value θ text, while the text corresponding to maximum variance value is integrated into the type corresponding to maximum probability;
    E) repeat step b)~d), until all texts are all classified.
CN201510257964.9A 2015-05-20 2015-05-20 Internet data analysis system Active CN104809253B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510257964.9A CN104809253B (en) 2015-05-20 2015-05-20 Internet data analysis system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510257964.9A CN104809253B (en) 2015-05-20 2015-05-20 Internet data analysis system

Publications (2)

Publication Number Publication Date
CN104809253A CN104809253A (en) 2015-07-29
CN104809253B true CN104809253B (en) 2017-12-08

Family

ID=53694075

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510257964.9A Active CN104809253B (en) 2015-05-20 2015-05-20 Internet data analysis system

Country Status (1)

Country Link
CN (1) CN104809253B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105610640B (en) * 2015-12-21 2019-09-24 中国电子科技集团公司第十五研究所 A kind of method and apparatus of internet information spreading path reduction
CN105893582B (en) * 2016-04-01 2019-06-28 深圳市未来媒体技术研究院 A kind of social network user mood method of discrimination
CN109034389A (en) * 2018-08-02 2018-12-18 黄晓鸣 Man-machine interactive modification method, device, equipment and the medium of information recommendation system

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1158460A (en) * 1996-12-31 1997-09-03 复旦大学 Multiple languages automatic classifying and searching method
CN101727500A (en) * 2010-01-15 2010-06-09 清华大学 Text classification method of Chinese web page based on steam clustering
CN104216954B (en) * 2014-08-20 2017-07-14 北京邮电大学 The prediction meanss and Forecasting Methodology of accident topic state

Also Published As

Publication number Publication date
CN104809253A (en) 2015-07-29

Similar Documents

Publication Publication Date Title
CN104809108B (en) Information monitoring analysis system
US11580104B2 (en) Method, apparatus, device, and storage medium for intention recommendation
CN109271512B (en) Emotion analysis method, device and storage medium for public opinion comment information
Nagy et al. Crowd sentiment detection during disasters and crises.
Purohit et al. Emergency-relief coordination on social media: Automatically matching resource requests and offers
Keneshloo et al. Predicting the popularity of news articles
Li et al. Using text mining and sentiment analysis for online forums hotspot detection and forecast
US9990368B2 (en) System and method for automatic generation of information-rich content from multiple microblogs, each microblog containing only sparse information
CN107577759A (en) User comment auto recommending method
CN107862022B (en) Culture resource recommendation system
CN111737495A (en) Middle-high-end talent intelligent recommendation system and method based on domain self-classification
KR101695011B1 (en) System for Detecting and Tracking Topic based on Topic Opinion and Social-influencer and Method thereof
US10387805B2 (en) System and method for ranking news feeds
Weiler et al. Survey and experimental analysis of event detection techniques for twitter
CN106156372B (en) A kind of classification method and device of internet site
CN108733791B (en) Network event detection method
Sharma et al. Detecting hate speech and insults on social commentary using nlp and machine learning
CN104809253B (en) Internet data analysis system
KR101543680B1 (en) Entity searching and opinion mining system of hybrid-based using internet and method thereof
Hazimeh et al. SocialMatching++: A Novel Approach for Interlinking User Profiles on Social Networks.
CN107330076A (en) A kind of network public sentiment information display systems and method
Masood et al. Semantic analysis to identify students’ feedback
CN115018255A (en) Tourist attraction evaluation information quality validity analysis method based on integrated learning data mining technology
CN113971213A (en) Smart city management public information sharing system
Pino et al. Assessment and visualization of geographically distributed event-related sentiments by mining social networks and news

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
EXSB Decision made by sipo to initiate substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20180730

Address after: 510623 room 3301 -3302, 1 Jinsui Road, Tianhe District, Guangzhou, Guangdong (for office use only)

Patentee after: GUANGZHOU FENGSHEN NETWORK TECHNOLOGY Co.,Ltd.

Address before: 610041 No. 1, No. 3 Shen Xian Nan Road, Chengdu high tech Zone, Sichuan, China.

Patentee before: CHENGDU BLTSAFE INFORMATION TECHNOLOGY Co.,Ltd.

PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: Internet data analysis system

Effective date of registration: 20210223

Granted publication date: 20171208

Pledgee: Zhujiang Branch of Guangzhou Bank Co.,Ltd.

Pledgor: GUANGZHOU FENGSHEN NETWORK TECHNOLOGY Co.,Ltd.

Registration number: Y2021980001275

PE01 Entry into force of the registration of the contract for pledge of patent right
PC01 Cancellation of the registration of the contract for pledge of patent right

Date of cancellation: 20220420

Granted publication date: 20171208

Pledgee: Zhujiang Branch of Guangzhou Bank Co.,Ltd.

Pledgor: GUANGZHOU FENGSHEN NETWORK TECHNOLOGY Co.,Ltd.

Registration number: Y2021980001275

PC01 Cancellation of the registration of the contract for pledge of patent right
TR01 Transfer of patent right

Effective date of registration: 20240226

Address after: Room 499, 4th Floor, No. 89 Yanling Road, Tianhe District, Guangzhou City, Guangdong Province 510000. Self made No. 134 (for office only)

Patentee after: Guangzhou Kunchuan Network Technology Co.,Ltd.

Country or region after: China

Address before: 510623 room 3301 -3302, 1 Jinsui Road, Tianhe District, Guangzhou, Guangdong (for office use only)

Patentee before: GUANGZHOU FENGSHEN NETWORK TECHNOLOGY Co.,Ltd.

Country or region before: China

TR01 Transfer of patent right