CN104281663A

CN104281663A - Method and system for analyzing events on basis of non-negative matrix factorization

Info

Publication number: CN104281663A
Application number: CN201410495959.7A
Authority: CN
Inventors: 张日崇; 邰振赢; 于伟仁; 刘俊伟; 李建欣
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2014-09-24
Filing date: 2014-09-24
Publication date: 2015-01-14

Abstract

The invention provides a method and a system for analyzing events on the basis of non-negative matrix factorization. The method includes acquiring to-be-processed data of at least one data text; respectively segmenting words of each data text to obtain text space matrixes corresponding to the to-be-processed data; performing non-negative matrix factorization ion the text space matrixes, determining various events and keywords in the to-be-processed data according to basis matrixes, and determining data texts according to coefficient matrixes. The basis matrixes are obtained by means of factorization. The keywords are respectively used for describing the various events. The coefficient matrixes are obtained by means of factorization. The determined data texts respectively correspond to the various events. The method and the system have the advantages that the text space matrixes of the to-be-processed data are constructed and are subjected to non-negative matrix factorization, so that each matrix on a huge scale can be factorized to obtain two corresponding matrixes on small scales, the non-negativity of matrix elements can be guaranteed before and after the matrixes are factorized, the accuracy of event mining results can be guaranteed, the events in the to-be-processed data can be found by means of reducing dimensions, computation is simple and convenient, and the method and the system are good in expansibility.

Description

A kind of affair analytical method based on Non-negative Matrix Factorization and system

Technical field

The invention belongs to data mining technology field, especially relate to a kind of affair analytical method based on Non-negative Matrix Factorization and system.

Background technology

Flourish along with Internet technology, increasing user issues various news by the such as social network-i i-platform such as forum, microblogging or delivers the suggestion of individual to some social phenomenons, thus cause the various data messages on internet also to present explosive growth, how to carry out the excavation of effective event to the data message of magnanimity is the problem that each search engine one is mainly studied.

The mode of existing a kind of data mining is the cluster mode adopting hierarchy type, data-oriented object set is carried out to the decomposition of level, until certain cut-off condition meets.Specifically can be divided into again: the hierarchical clustering of cohesion: a kind of bottom-up strategy, first using each data object as a cluster, then with the similarity between data object for being increasing bunch according to merging these clusters, until certain cut-off condition is satisfied.The hierarchical clustering of division: adopt top-down strategy, first all data objects are placed in one bunch by it, is then subdivided into more and more less bunch, gradually until reach certain cut-off condition.

But the characteristic due to the mode of hierarchical clustering itself determines it and has higher computation complexity, makes limited scalability, be unsuitable for being applied in the event excavation of mass data.

Summary of the invention

For above-mentioned Problems existing, the invention provides a kind of affair analytical method based on Non-negative Matrix Factorization and system, cause the defect of higher computation complexity and poor extensibility in order to overcome hierarchical clustering mode of the prior art.

The invention provides a kind of affair analytical method based on Non-negative Matrix Factorization, comprising:

Obtain pending data, described pending data comprise at least one data text;

Respectively word segmentation processing is carried out to each data text at least one data text described, obtain the text space matrix corresponding with described pending data, the word information comprised in described text space matrix description at least one data text described;

Non-negative Matrix Factorization is carried out to described text space matrix, according to decomposing each event that the basis matrix obtained is determined to comprise in described pending data and the keyword being respectively used to describe each event described, and determine data text corresponding with each event described respectively according to decomposing the matrix of coefficients obtained.

The invention provides a kind of event analysis system based on Non-negative Matrix Factorization, comprising:

Acquisition module, for obtaining pending data, described pending data comprise at least one data text;

Processing module, for carrying out word segmentation processing to each data text at least one data text described respectively, obtain the text space matrix corresponding with described pending data, the word information comprised in described text space matrix description at least one data text described;

Computing module, for carrying out Non-negative Matrix Factorization to described text space matrix, according to decomposing each event that the basis matrix obtained is determined to comprise in described pending data and the keyword being respectively used to describe each event described, and determine data text corresponding with each event described respectively according to decomposing the matrix of coefficients obtained.

Affair analytical method based on Non-negative Matrix Factorization provided by the invention and system, after getting the pending data comprising multiple data text, in units of word, respectively word segmentation processing is carried out to the plurality of data text, thus obtains the text space matrix for describing all words comprised in the multiple data file information and the plurality of data file information that comprise in these pending data.And then, Non-negative Matrix Factorization is carried out at text space matrix, according to decomposing each event that the basis matrix obtained obtains comprising in pending data and the keyword being respectively used to describe each event described, and determine data text corresponding with each event respectively according to decomposing the matrix of coefficients obtained, namely comprise the data text of this event.By constructing the text space matrix of pending data, and Non-negative Matrix Factorization is carried out to text space matrix, thus a matrix decomposition in large scale is become the less matrix of two scales, and ensure the nonnegativity of matrix element before and after decomposing, namely the element before and after decomposing on same position is positive number, while the accuracy of the event of guarantee Result, is found the event comprised in pending data by dimensionality reduction, calculate easy, extensibility is better.

Accompanying drawing explanation

Fig. 1 is the process flow diagram of the affair analytical method embodiment one that the present invention is based on Non-negative Matrix Factorization;

Fig. 2 is the process flow diagram of the affair analytical method embodiment two that the present invention is based on Non-negative Matrix Factorization;

Fig. 3 is the structural representation of the event analysis system embodiment one that the present invention is based on Non-negative Matrix Factorization;

Fig. 4 is the structural representation of the event analysis system embodiment two that the present invention is based on Non-negative Matrix Factorization.

Embodiment

Fig. 1 is the process flow diagram of the affair analytical method embodiment one that the present invention is based on Non-negative Matrix Factorization, and as shown in Figure 1, the method comprises:

Step 101, obtain pending data, described pending data comprise at least one data text;

Step 102, respectively word segmentation processing is carried out to each data text at least one data text described, obtain the text space matrix corresponding with described pending data, the word information comprised in described text space matrix description at least one data text described;

Step 103, Non-negative Matrix Factorization is carried out to described text space matrix, according to decomposing each event that the basis matrix obtained is determined to comprise in described pending data and the keyword being respectively used to describe each event described, and determine data text corresponding with each event described respectively according to decomposing the matrix of coefficients obtained.

The described method that the present embodiment provides goes for excavating process to the mass data that the various application on internet produce event of carrying out, be particularly useful for the such as social networks such as microblogging, forum, the method can be performed by a treatment facility, and this treatment facility such as can for the management platform applied in certain.

For microblogging, a large amount of various data messages all can be had every day to propagate on microblogging, in the microblog data of magnanimity, the information self needed fast and effeciently can be searched for the ease of vast prevalent user, or in order to enable the user such as domestic consumer, government organs know social hotspots in time, all need to carry out event excavation to the microblog data of magnanimity.What deserves to be explained is, the data message mainly for text in the present embodiment processes, and is referred to as data text.And, event described in the present embodiment, the thing that not certain part is in general sense complete or news, and refer to the set of words characterized with some keywords, the keyword comprised in an event often has certain association, such as these keywords occurred simultaneously in a lot of bar data text all simultaneously, and therefore, these keywords also reflect the focus of attention in current microblogging to a certain extent.

Specifically, after the microblog data obtaining such as some day when treatment facility and pending data, word segmentation processing is carried out to each data text comprised in these pending data, existing NLPIR Chinese word segmentation system is such as adopted to carry out word segmentation processing to each data file, thus be that unit divides by each data text according to word, obtain each word comprised in each data text.By carrying out word segmentation processing to each data text in pending data, thus all words of comprising in pending data can be obtained, thus construct the text space matrix that is made up of all data texts in pending data and all words, each word comprised in the data text that this column vector that what each column vector in this matrix represented is is corresponding.

And then carry out Non-negative Matrix Factorization to text space matrix, wherein, Non-negative Matrix Factorization is existing matrix disassembling method of the prior art, does not repeat.The result of Non-negative Matrix Factorization obtains two matrixes, is respectively basis matrix and matrix of coefficients.What deserves to be explained is, because text space matrix is a matrix in large scale, directly process is carried out to this matrix and will cause very large operand, and be decomposed into two less matrixes, the operand of the process carried out based on these two less matrixes is reduced greatly.And, the product of the basis matrix that Non-negative Matrix Factorization obtains and matrix of coefficients is the approximate expression of text space matrix, the result of decomposing ensure that the element on same position, error amount before and after decomposing is positive number, thus makes the element after decomposition have the expression be substantially equal to this element before decomposition.Thus, according to decomposing each event that the basis matrix obtained is determined to comprise in described pending data and the keyword being respectively used to describe each event described, and determine data text corresponding with each event described respectively according to decomposing the matrix of coefficients obtained.That is, the number of decomposing column vector in the basis matrix obtained is the number of the event comprised in these pending data, and namely each word comprised in each column vector forms the keyword of this event; Each row vector in matrix of coefficients characterizes an event, and each data text in this row vector illustrates the data text set comprising each keyword in corresponding this event i.e. this event.Therefore, can know in pending data to contain how many events by basis matrix and matrix of coefficients, what the keyword comprised in each event is, and which the data text of the keyword comprising each event respectively has.

In the present embodiment, after getting the pending data comprising multiple data text, in units of word, respectively word segmentation processing is carried out to the plurality of data text, thus obtains the text space matrix for describing all words comprised in the multiple data file information and the plurality of data file information that comprise in these pending data.And then, Non-negative Matrix Factorization is carried out at text space matrix, according to decomposing each event that the basis matrix obtained obtains comprising in pending data and the keyword being respectively used to describe each event described, and determine data text corresponding with each event respectively according to decomposing the matrix of coefficients obtained, namely comprise the data text of this event.By constructing the text space matrix of pending data, and Non-negative Matrix Factorization is carried out to text space matrix, thus a matrix decomposition in large scale is become the less matrix of two scales, and ensure the nonnegativity of matrix element before and after decomposing, namely decomposing each element in two matrixes obtained is nonnegative value, while the accuracy of the event of guarantee Result, large matrix is changed into two minor matrixs, the event comprised in pending data is found by dimensionality reduction, calculate easy, extensibility is better.

Fig. 2 is the process flow diagram of the affair analytical method embodiment two that the present invention is based on Non-negative Matrix Factorization, and as shown in Figure 2, the described method that the present embodiment provides comprises the steps:

Step 201, obtain pending data, described pending data comprise at least one data text;

Step 202, described each data text carried out to semanteme and resolve, determine the noun that comprises in described each data text and verb;

Step 203, the described noun determined and verb to be marked, and determine the weighted value of each described noun and verb according to following formula, obtain the text space matrix A corresponding with described pending data _{m × N}:

R (w)=(occurrence number of w in a described M word) × log (data text sum N/ comprises the data text quantity of w).

Wherein, w is noun described in any one or verb, the weighted value that R (w) is w

In the present embodiment, carry out semanteme to each data file in pending data to resolve, to determine containing which word in each data text, due to the word huge number comprised in each data text, wherein such as can some such as " ", " " etc. there is no the word of practical significance, be referred to as function word, also can exist such as " municipal administration ", " attack " etc. have noun or the verb of practical significance, therefore, in order to distinguish the importance of different terms in each data text, after each data text being carried out to semanteme and resolving, select the noun and verb that comprise in this data text, and be the weighted value that these nouns and verb assignment are higher, and be the weighted value that function word assignment is lower.Wherein, the weighted value of each noun and verb can be determined respectively according to the occurrence number in pending data of each noun and verb

Step 204, to described text space matrix A _{m × N}carry out Non-negative Matrix Factorization, obtain basis matrix W _{m × K}, and matrix of coefficients H _{k × N}, described K is the total number of events comprised in described pending data;

Step 205, determine described basis matrix W _{m × K}in each column vector characterize first event, the target word comprised in each column vector is the keyword describing the first corresponding event, and described target word is noun and the verb that in the word that comprises of described column vector, the descending arrangement of weighted value comes the first predetermined number above;

Step 206, determine described matrix of coefficients H _{k × N}in each row vector characterize a second event, the data text comprised in each row vector is the data text corresponding with the second event that described row vector characterizes.

When the noun comprised in each data text and verb assignment higher weights value, after Non-negative Matrix Factorization is carried out to text space matrix, the word comprised in each column vector in basis matrix is the word with different weighted value, the noun of existing higher weights value and verb in these words, also the function word of lower weighted value is had, optionally, can determine that these higher weights values and weighted value are greater than the noun of certain threshold value and the verb keyword as event corresponding to this column vector.But probably the quantity of these nouns and verb is still larger, if need the result by event is excavated to present, the keyword of quantity like this is carried out present and will cause lower Consumer's Experience.Therefore, in the present embodiment, the word comprised in column vector each in basis matrix is arranged according to the order that weighted value is descending, choose the keyword of word as the event of its correspondence of the predetermined number come above.What deserves to be explained is, putting in order from big to small is only a kind of citing, can also sort from small to large, accordingly, selects the word of the preset data come below.

Step 207, respectively using each event in each event described as pending event, from the keyword that described pending event is corresponding, choose the mark of keyword as described pending event of the second predetermined number, determine that the quantity of the data text that described pending event is corresponding accounts for the ratio of the data text sum of described pending data;

Step 208, according to described mark and described ratio, any one mode in following presentation mode is adopted to present described pending event: form, cake chart, histogram, broken line graph, word cloud.

In the present embodiment, conveniently different user can understand the result that event is excavated intuitively, namely recognizes the focus of attention in current microblogging more intuitively, the result that event is excavated can be carried out visual presenting.For this reason, needing simply to analyze event Result or process, such as: in order to ensure effect of visualization, for each event, from the keyword that this event comprises, the mark of keyword as this event of some can be chosen further again.As event identifier keyword both can in the keyword of this event random selecting, also can carry out weight selection value according to the weighted value of each keyword larger.For another example: in order to understand the significance level of each event in pending data more intuitively or pay close attention to temperature, can add up and determine that data text corresponding to each event accounts for the ratio of the data text sum of pending data.

And then, according to above-mentioned mark and described ratio, adopt any one mode in following presentation mode to present described pending event: form, cake chart, histogram, broken line graph, word cloud.Such as: the mark that can show each event in form, corresponding data text quantity, and the data text proportion of correspondence; Can determine according to the size of the data text proportion of different event the font size that the mark of each event will be shown in word cloud, etc.

Fig. 3 is the structural representation of the event analysis system embodiment one that the present invention is based on Non-negative Matrix Factorization, and as shown in Figure 3, this system comprises:

Acquisition module 11, for obtaining pending data, described pending data comprise at least one data text;

Processing module 12, for carrying out word segmentation processing to each data text at least one data text described respectively, obtain the text space matrix corresponding with described pending data, the word information comprised in described text space matrix description at least one data text described;

Computing module 13, for carrying out Non-negative Matrix Factorization to described text space matrix, according to decomposing each event that the basis matrix obtained is determined to comprise in described pending data and the keyword being respectively used to describe each event described, and determine data text corresponding with each event described respectively according to decomposing the matrix of coefficients obtained.

The system of the present embodiment may be used for the technical scheme performing embodiment of the method shown in Fig. 1, and it realizes principle and technique effect is similar, repeats no more herein.

Fig. 4 is the structural representation of the event analysis system embodiment two that the present invention is based on Non-negative Matrix Factorization, as shown in Figure 4, the described system that the present embodiment provides is on basis embodiment illustrated in fig. 3, described pending data comprise N number of data text, the word comprised in described N number of data text adds up to M, described text space matrix A _{m × N}for M × N tie up matrix, the value of described N be more than or equal to 1 integer;

Described computing module 13, comprising:

Computing unit 131, for described text space matrix A _{m × N}carry out Non-negative Matrix Factorization, obtain basis matrix W _{m × K}, and matrix of coefficients H _{k × N}, described K is the total number of events comprised in described pending data;

Determining unit 132, for determining described basis matrix W _{m × K}in each column vector characterize first event, the word comprised in each column vector is the keyword describing the first corresponding event;

Described determining unit 132, also for determining described matrix of coefficients H _{k × N}in each row vector characterize a second event, the data text comprised in each row vector is the data text corresponding with the second event that described row vector characterizes.

Further, described processing module 12, comprising:

Resolution unit 121, resolves for carrying out semanteme to described each data text, determines the noun that comprises in described each data text and verb;

Indexing unit 122, for marking the described noun determined and verb, and determine the weighted value of each described noun and verb according to following formula:

R (w)=(occurrence number of w in a described M word) × log (data text sum N/ comprises the data text quantity of w);

Wherein, w is noun described in any one or verb, the weighted value that R (w) is w.

Particularly, described determining unit 132, specifically for:

Determine described basis matrix W _{m × K}in each column vector characterize first event, the target word comprised in each column vector is the keyword describing the first corresponding event, and described target word is noun and the verb that in the word that comprises of described column vector, the descending arrangement of weighted value comes the first predetermined number above.

Further, described system also comprises:

Analysis module 21, for respectively using each event in each event described as pending event, chooses the mark of keyword as described pending event of the second predetermined number from the keyword that described pending event is corresponding;

Described analysis module 21, also for determining that the quantity of the data text that described pending event is corresponding accounts for the ratio of the data text sum of described pending data;

Present module 22, for according to described mark and described ratio, adopt any one mode in following presentation mode to present described pending event:

Form, cake chart, histogram, broken line graph, word cloud.

The system of the present embodiment may be used for the technical scheme performing embodiment of the method shown in Fig. 2, and it realizes principle and technique effect is similar, repeats no more herein.

One of ordinary skill in the art will appreciate that: all or part of step realizing said method embodiment can have been come by the hardware that programmed instruction is relevant, aforesaid program can be stored in a computer read/write memory medium, this program, when performing, performs the step comprising said method embodiment; And aforesaid storage medium comprises: ROM, RAM, magnetic disc or CD etc. various can be program code stored medium.

Last it is noted that above each embodiment is only in order to illustrate technical scheme of the present invention, be not intended to limit; Although with reference to foregoing embodiments to invention has been detailed description, those of ordinary skill in the art is to be understood that: it still can be modified to the technical scheme described in foregoing embodiments, or carries out equivalent replacement to wherein some or all of technical characteristic; And these amendments or replacement, do not make the essence of appropriate technical solution depart from the scope of various embodiments of the present invention technical scheme.

Claims

1. based on an affair analytical method for Non-negative Matrix Factorization, it is characterized in that, comprising:

Obtain pending data, described pending data comprise at least one data text;

2. method according to claim 1, is characterized in that, described pending data comprise N number of data text, and the word comprised in described N number of data text adds up to M, described text space matrix A _{m × N}for M × N tie up matrix, the value of described N be more than or equal to 1 integer;

Described Non-negative Matrix Factorization is carried out to described text space matrix, according to decomposing each event that the basis matrix obtained is determined to comprise in described pending data and the keyword being respectively used to describe each event described, and determine data text corresponding with each event described respectively according to decomposing the matrix of coefficients obtained, comprising:

To described text space matrix A _{m × N}carry out Non-negative Matrix Factorization, obtain basis matrix W _{m × K}, and matrix of coefficients H _{k × N}, described K is the total number of events comprised in described pending data;

Determine described basis matrix W _{m × K}in each column vector characterize first event, the word comprised in each column vector is the keyword describing the first corresponding event;

Determine described matrix of coefficients H _{k × N}in each row vector characterize a second event, the data text comprised in each row vector is the data text corresponding with the second event that described row vector characterizes.

3. method according to claim 2, is characterized in that, describedly carries out word segmentation processing to each data text at least one data text described respectively, comprising:

Carry out semanteme to described each data text to resolve, determine the noun that comprises in described each data text and verb;

The described noun determined and verb are marked, and determine the weighted value of each described noun and verb according to following formula:

4. method according to claim 3, is characterized in that, describedly determines described basis matrix W _m _{× K}in each column vector characterize first event, the word comprised in each column vector is the keyword describing the first corresponding event, comprising:

5. method according to any one of claim 1 to 4, it is characterized in that, described Non-negative Matrix Factorization is carried out to described text space matrix, according to decomposing each event that the basis matrix obtained is determined to comprise in described pending data and the keyword being respectively used to describe each event described, and after determining data text corresponding with each event described respectively according to the matrix of coefficients that decomposition obtains, also comprise:

Respectively using each event in each event described as pending event, from the keyword that described pending event is corresponding, choose the mark of keyword as described pending event of the second predetermined number;

Determine that the quantity of the data text that described pending event is corresponding accounts for the ratio of the data text sum of described pending data;

According to described mark and described ratio, any one mode in following presentation mode is adopted to present described pending event:

Form, cake chart, histogram, broken line graph, word cloud.

6., based on an event analysis system for Non-negative Matrix Factorization, it is characterized in that, comprising:

7. system according to claim 6, is characterized in that, described pending data comprise N number of data text, and the word comprised in described N number of data text adds up to M, described text space matrix A _{m × N}for M × N tie up matrix, the value of described N be more than or equal to 1 integer;

Described computing module, comprising:

Computing unit, for described text space matrix A _{m × N}carry out Non-negative Matrix Factorization, obtain basis matrix W _{m × K}, and matrix of coefficients H _{k × N}, described K is the total number of events comprised in described pending data;

Determining unit, for determining described basis matrix W _{m × K}in each column vector characterize first event, the word comprised in each column vector is the keyword describing the first corresponding event;

Described determining unit, also for determining described matrix of coefficients H _{k × N}in each row vector characterize a second event, the data text comprised in each row vector is the data text corresponding with the second event that described row vector characterizes.

8. system according to claim 7, is characterized in that, described processing module, comprising:

Resolution unit, resolves for carrying out semanteme to described each data text, determines the noun that comprises in described each data text and verb;

Indexing unit, for marking the described noun determined and verb, and determine the weighted value of each described noun and verb according to following formula:

9. system according to claim 8, is characterized in that, described determining unit, specifically for:

10. the system according to any one of claim 6 to 9, is characterized in that, also comprises:

Analysis module, for respectively using each event in each event described as pending event, chooses the mark of keyword as described pending event of the second predetermined number from the keyword that described pending event is corresponding;

Described analysis module, also for determining that the quantity of the data text that described pending event is corresponding accounts for the ratio of the data text sum of described pending data;

Present module, for according to described mark and described ratio, adopt any one mode in following presentation mode to present described pending event:

Form, cake chart, histogram, broken line graph, word cloud.