JP4476078B2  Time series data judgment program  Google Patents
Time series data judgment program Download PDFInfo
 Publication number
 JP4476078B2 JP4476078B2 JP2004254856A JP2004254856A JP4476078B2 JP 4476078 B2 JP4476078 B2 JP 4476078B2 JP 2004254856 A JP2004254856 A JP 2004254856A JP 2004254856 A JP2004254856 A JP 2004254856A JP 4476078 B2 JP4476078 B2 JP 4476078B2
 Authority
 JP
 Japan
 Prior art keywords
 data
 series data
 time
 occurrence matrix
 test
 Prior art date
 Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
 Expired  Fee Related
Links
Images
Description
The present invention relates to a timeseries data determination program for determining whether or not timeseries data belongs to one or more predetermined categories.
In order to detect a socalled “spoofing” in which a user's password is stolen and a computer is impersonated by impersonating the user, whether or not there is an abnormality in the time series data input to the computer (the input time series It is effective to detect whether the data is timeseries data created by an impersonator using an abnormality detection system. In a known anomaly detection system, a profile (features that appear in timeseries data created by a user) is first created that shows typical user behavior. Then, by comparing the profile of the input data (time series data) to be tested with the profile of the user, it is time series data created by a normal user or abnormal time series data created by a spoofer. To identify.
Typical input data to be inspected is UNIX (registered trademark) commands used, timeseries data such as accessed files, and the like. The process of identifying whether the input timeseries data is normal or abnormal is divided into two steps. First, in the first step, feature extraction of time series data is performed. In the second step, it is identified whether the extracted feature is normal or abnormal.
Typical conventional methods for performing feature extraction in the first step include a histogram and an ngram. In the histogram, an appearance frequency vector of items (events) appearing in time series data is a feature vector to be extracted. Further, an Ngram has one feature of N consecutive items [NonPatent Documents 1 to 3].
Various methods have been proposed so far for identifying whether the extracted features in the second step are normal or abnormal. Among them, typical methods include rule base [Nonpatent document 4], automaton [Nonpatent document 5], Bayesian network [Nonpatent document 6], Naive Bayes [Nonpatent document 7], neural network [Nonpatent document 6]. Patent Document 8], Markov Model [NonPatent Document 9], and Hidden Markov Model [NonPatent Document 10].
Ye, X. Li, Q. Chen, S.M. M.M. Emran, and M.M. Xu's "Probable Techniques for Intrusion Detection Based on Computer Audit Data" IEEE Transactions of Systems Man and Cybernetics, Vol. 31, pp. 266274, 2001 S. A. Hofmeyr, S.M. Forrest and A.M. "Intrusion Detection using Sequences of Calls" by Soyamaji, Journal of Computer Security, vol. 6, pp. 151180, 1998 W. Lee and S.M. J. et al. "A framework for constructing features and models for intrusion detection systems" by Stolfo, Information and System Security, vol. 3, pp. 227261, 2000 `` ASAX: Software Architecture and RuleBased Language for Universal Audit Trail Analysis '' by N. Habra, BLCharlier, A. Mountaine and I. Mathieu, In Proc. Of European Symposiumon Researchin Computer Security (ESORICS), pp.435 450,1992 R. Sekar, M. Bendre and P. Bollineni, `` A Fast Automaton Based Method for Detecting Anomalous Program Behaviors '' In Proceedings of the 2001 IEEE Symposium on Security and Privacy, pp. 144155, Oakland, May 2001. W. DuMouchel's `` Computer Intrusion Detection Based on Bayes Factors for Comparing Command Transition Probabilities '' Technical Report TR91, National Institute of Statistical Sciences (NISS), 1999. RAMaxion and TNTownsend.Masquerade Detection Using Truncated CommandLines, In Prof.of the International Conferenceon Dependable Systems and Networks (DSN02), pp.219228, 2002. `` A study in using neural networks for anomaly and misuse detection '' in Proc. Of USENIX Security Symposium, pp. 141151, 1999 by AKGhosh, A. Schwartzbard, and M. Schatz. `` Classifiers and Intrusion Detection.In Proc.of 14th IEEE Computer Security Foundations Workshop, pp.206219, 2001 '' by JSTan, KMC and RAMaxion.MarkovChains. C. Warrender, S. Forresto and BAPearlmutter, `` Detecting Intrusions using System Calls: Alternative Data Models '' In IEEE Symposium on Security and Privacy, pp. 133145, 1999.
However, the histogram (Histogram) is characterized by the appearance frequency vector of items (events) appearing in the time series data. In the Ngram, N consecutive items are one feature. However, in these conventional methods, dynamic information on user behavior in time series data (information on user behavior as seen in time series, that is, the types of events appearing in the event time series and the order of their appearance) Problems) that cannot be used, or that the dynamic information of user behavior in time series data is lost, or that only single or adjacent event features can be used, or only features between adjacent events can be represented. .
An object of the present invention is to provide a timeseries data determination program capable of determining whether timeseries data includes a predetermined category (feature) by capturing dynamic information included in the timeseries data. There is.
Another object of the present invention is to provide a timeseries data determination method with higher determination accuracy than in the past.
Another object of the present invention is to provide a timeseries data abnormality determination method capable of determining whether or not timeseries data has an abnormality.
The present invention was made based on the development of the Eigen Cooccurrence Matrix (ECM) method. This ECM method first associates events included in timeseries data while considering timeseries information. This association is performed by paying attention to the relationship between two events and expressing the relationship of all the binomial events as a cooccurrence matrix. The cooccurrence matrix can express all relationships between items (events) appearing in the time series data. This is a feature of time series data that could not be expressed by a histogram or an Ngram. In a specific invention, principal component analysis is performed on the cooccurrence matrix to generate orthogonal principal component vector spaces. Features of each cooccurrence matrix are extracted as vectors in the principal component vector space. By extracting features as vectors, it is possible to use various vector discrimination functions.
The timeseries data determination program of the present invention uses a feature extraction method and an identification method to determine whether timeseries data including a plurality of types of events belongs to one or more predetermined categories. judge. In the present invention, in particular, as the feature extraction method, a statistic using a plurality of timeseries input data converted into matrix data representing a relationship between two types of events included in a plurality of types of events as a cooccurrence matrix. Use a feature extraction method. As the identification method, a method that uses the feature vector extracted by the statistical feature extraction method for identification is used. Here, the plurality of types of events mean a plurality of items constituting timeseries data, and when the timeseries data is composed of a plurality of commands, the plurality of commands are events. A category is a concept that means a type of timeseries data when viewed from a higher concept, and a category to which a set of feature vectors described later obtained from timeseries data belongs when viewed from a lower concept. For example, whether or not certain time series data is normal can be determined by whether or not the time series data belongs to one or more predetermined categories. In terms of the relationship between the feature vector and the category, the category corresponding to the partial region of the space where the feature vector exists is a category. Any statistical feature extraction method may be used as long as it can extract a feature vector. For example, a principal component analysis method can be used. An identification method for determining which category the time series data belongs to using the feature vector is arbitrary. It goes without saying that various known identification methods described in the prior art column can be used.
The cooccurrence matrix employed in the program of the present invention can express all relationships between items (events) appearing in timeseries data. In other words, the cooccurrence matrix expresses the strength of relevance between all two terms by its distance and appearance frequency. Therefore, according to the present invention, it is possible to determine whether or not the timeseries data belongs to a predetermined category with higher accuracy than before using the dynamic information included in the timeseries data.
When converting a plurality of timeseries input data into matrix data represented by a cooccurrence matrix, a window data extraction step, a scope data extraction step, and a cooccurrence matrix conversion step are performed. In the window data extraction step, a plurality of window data are extracted by cutting out the timeseries input data with a window having a predetermined data length. The data length of the window may be determined according to the length of the time series data. In the scope data extraction step, a plurality of scope data having a data length shorter than the data length of the window data is sequentially extracted from the window data with a time lag. In the specific scope data extraction step, one or more scope data for one type of event is extracted with a position where one type of event selected from a plurality of types of events is included in the window data as a reference position. Can do. Further, in the cooccurrence matrix conversion step, a plurality of window data indicating the strength of the relevance of a plurality of window data based on a plurality of scope data and viewed in time series between a plurality of types of events included in the window data. Convert to a matrix. Specifically, in the cooccurrence matrix conversion step, the total value of the number of the one type of events or the other types of events included in the one or more scope data for one type of event is calculated as one type of event. Convert the window data into a cooccurrence matrix by converting the frequency of one type of event to an event and converting this frequency to a value that displays the strength of the relevance of one type of event to one type of event. To do. When the cooccurrence matrix is converted in this way, it is possible to obtain a cooccurrence matrix that more appropriately indicates the relationship between events viewed in time series.
In order to identify an impersonator as a legitimate user by executing the program of the present invention on a computer system, it is appropriate to treat the cooccurrence matrix as a pattern and apply a statistical pattern recognition method (identification method). The simplest pattern recognition method (identification method) is a method based on matching between patterns. However, when the cooccurrence matrix itself is treated as a pattern, the dimension of the pattern becomes enormous. Therefore, in matching between patterns, it is more effective to extract a feature (which is also information compression) and perform recognition. By performing effective feature extraction from the pattern, it is possible to expect a recognition result that is robust against fluctuations in the input pattern. Therefore, in a more specific method of the present invention, as a feature extraction method, principal component analysis is used to extract feature vectors from a cooccurrence matrix. Principal component analysis is a statistical feature extraction method that allows vectorformat data to be represented by a small number of features (principal components). As an example of successful recognition using principal component analysis, Turk et al. [M. Turk, A.M. Pentland, “Eigenfaces for Recognition”, Journal of Cognitive Neuroscience, vol3, no. 1, 1991] has been widely known for recognition of face images by Eigenface (unique face). In the specific method of the present invention, there is a unique point of view where a cooccurrence matrix is considered as a face image.
Accordingly, in the specific timeseries data determination program of the present invention for determining whether or not timeseries data including a plurality of types of events belongs to one or more predetermined categories, the abovementioned window In addition to the data extraction step, the scope data extraction step, and the cooccurrence matrix conversion step described above, an eigen cooccurrence matrix group determination step, a profile cooccurrence matrix conversion step, and a determination feature vector extraction step When a test for the cooccurrence matrix conversion step, and the test feature vector extraction step, to execute a determination step in the computer system.
In the eigencooccurrence matrix group determining step, an eigencooccurrence matrix group serving as a basis for obtaining a feature vector by principal component analysis with a plurality of cooccurrence matrices as input is determined. In the profile cooccurrence matrix conversion step, the same steps as the window data extraction step, the scope data extraction step, and the cooccurrence matrix conversion step are performed on one or more profile learning time series data including one or more categories. Each is implemented to convert one or more profile learning timeseries data into one or more profile cooccurrence matrices. In the determination feature vector extraction step, one or more determination feature vectors for one or more profile learning timeseries data are extracted based on one or more profile cooccurrence matrices and eigen cooccurrence matrix groups. Furthermore, in the test cooccurrence matrix conversion step, the same steps as the window data extraction step, scope data extraction step and cooccurrence matrix conversion step are performed on the test time series data to be tested. Convert series data to test cooccurrence matrix. The test feature vector extraction step extracts a test feature vector for the test time series data based on the test cooccurrence matrix and the eigencooccurrence matrix group. In the determination step, it is determined whether or not the test timeseries data includes one or more categories based on the one or more determination feature vectors and the test feature vector. When the eigen cooccurrence matrix group (Eigen Coocurrence Matrix) corresponding to the eigenface is created through the principal component analysis as in the specific method of the present invention, the original cooccurrence matrix is reduced in a low dimension. It became possible to express in an approximate manner.
In the determination step, specifically, the test timeseries data is one or more categories depending on whether or not the Euclidean distance between the test timeseries data and the determination feature vector is within a threshold using a predetermined vector identification function. Whether or not is included. By using such a vector discriminant function, the determination can be easily performed with higher accuracy.
In order to construct a highly accurate anomaly detection system, it is necessary to update the user profile in accordance with the Conceptual Drift. In the conventional method, when updating the user profile, it is necessary to use the result in the discrimination function (feedback update). Therefore, there is a problem that the profile is not correctly updated when the result of the discrimination function is wrong. Therefore, in the present invention, when the eigencooccurrence matrix group is updated by including the test time series data in a plurality of time series data for learning, the profile can be updated without using the result of the discriminant function (feedfor Word update). Therefore, the update can be performed reliably.
Further, when the abnormality of the time series data input to the computer system is determined using the time series data determination method of the present invention, the abnormal time series data can be determined with higher accuracy than before.
According to the present invention, it is possible to determine whether timeseries data includes a predetermined category with higher accuracy than before by using dynamic information included in the timeseries data.
Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. FIG. 1 shows a time series determination method according to the present invention for determining whether or not time series data including a plurality of types of events belongs to one or more predetermined categories using a principal component analysis method. It is a figure which shows the structure of the program for implementing an example of embodiment. In the present embodiment, a plurality of time series data for learning to obtain a group of eigen cooccurrence matrices used for obtaining a feature vector and time series data for profile learning (hereinafter referred to as profile learning time series data). Then, test time series data (hereinafter referred to as test time series data) to be tested is converted into a cooccurrence matrix. Here, the cooccurrence matrix is obtained by converting the relationship between two types of events included in a plurality of types of events constituting timeseries data into matrix data.
The step of converting time series data into a cooccurrence matrix will be described. FIG. 2 shows a plurality of time series data for learning [in this case, three time series sent from the user (a person or other computer who accesses the computer and transmits the time series data) 1 to 3 respectively. 2 shows an example of the configuration of [DATA]. In this example, the timeseries data from each user is composed of 20 commands (events). As will be described later, in this embodiment, timeseries data composed of 20 commands is divided by a window having 10 commands (data length) (window data extraction step). In this window data extraction step, each timeseries input data is cut out in a window having a predetermined data length (data length for 10 commands) to extract two window data. Note that the data length of the window may be determined according to the length of the time series data.
Next, in order to represent a causal relationship between two events appearing in time series data of a certain section, it is converted into a cooccurrence matrix. Each element of the cooccurrence matrix represents the strength of the causal relationship between two events. To create a cooccurrence matrix, define window size w, scope size s, and event set B = {b1, b2, b3,..., Bm}. Here, m represents the number of events. The window size w determines the size of an event time series from which one feature vector is extracted, and the scope size s determines an interval width that considers the causal relationship between two events. In the data example shown in FIG. 2, w is defined as 10 and s is defined as 6. Further, B is assumed to be eight unique commands (events) (m = 8) appearing in the time series data (domain data) for learning for all three people. The eight commands are cd, ls, less, emacs, gcc, gdb, mkdir, and cp. The strength of the causal relationship or relationship between two events is defined by the distance between the events and the frequency with which they appear. That is, the frequency of the causal relationship between events is defined by counting the frequency at which the event of interest appears within the scope size (6) within the window size (10). In the example of FIG. 2, two cooccurrence matrices are created for each user. The element or frequency number 7 of event cd and event ls in window 1 in FIG. 3 indicates that ls appeared seven times after cd within the scope size (6) in window size (10). The event pair (cd ls) and {ls cd} have the largest element or frequency number in window 1 of FIG. This indicates that these events are strongly related in this time series. The cooccurrence matrix represents the strength of the causal relationship or relationship between all two events appearing in the time series data.
FIG. 3 will be described in detail in relation to the present invention. First, as shown in FIG. 3, for each user's timeseries data, a plurality of scope data is extracted from the window data described above (scope data extraction step). In this step, a plurality of scope data having a data length shorter than the data length of the window data is sequentially extracted from the window data with a time lag in the data. In this example, scope data having a data length of six commands is sequentially extracted. Specifically, one type of event (for example, cd) selected from a plurality of types of events (in the case of FIG. 3, cd, ls, less) included in the 10 commands constituting the window data, One or more scope data for one type of event is extracted with a position included in the window data as a reference position. In the example of FIG. 3, when focusing on the event cd, the first scope data includes six commands (events) after the cd (reference position) without including the event cd at the top of the window 1. Next, six commands (events) after this cd (reference position) are extracted as the second scope data without including the sixth event cd from the head. As shown in the example of FIG. 3, when there are only 10 events in the window 1, 4 events are extracted from the second scope data. Similarly, the third and fourth scope data are extracted with the eighth and ninth events cd from the top as reference positions.
Next, based on the multiple scope data extracted from the window data, the strength of the relationship in the time series between the multiple types of events included in the window data (the relationship between the two events) (Strength) is expressed as the frequency and distance at which two events that look at the relationship appear. For example, the total number of one type of events (the same type of cd in the case of FIG. 3) included in one or more (four in the case of FIG. 3) scope data for one type of event cd Let the value be the frequency of one type of event for one type of event. Then, the window data is converted into a cooccurrence matrix by converting the frequency into a value indicating the strength of relevance of one type of event to one type of event. In the example of FIG. 3, the relationship between the event cd in the window 1 and the event cd is viewed as a frequency. The first scope data includes one cd, the second scope data includes two cds, and the third scope data includes one cd. cd is included, and cd is not included in the fourth scope data. Therefore, the frequency of the event cd with respect to the event cd can be calculated as 1 + 2 + 1 + 0 = 4. Similarly, regarding the relationship of the event ls to the event cd, the abovementioned first scope data includes three ls, and the second scope data includes two ls. One ls is included in the third scope data, and one ls is included in the fourth scope data. Therefore, the frequency of event ls with respect to event cd can be calculated as 3 + 2 + 1 + 1 = 7. By setting the scope data, these frequencies include time or distance relationships, that is, dynamic information included in time series data. In the right area of FIG. 3, matrix data obtained by converting windows 1 and 2 into cooccurrence matrices are shown. When timeseries data is expressed by a cooccurrence matrix in this way, it is possible to model human fluid behavior.
To identify a legitimate user impersonator using the method of the present invention, treat the cooccurrence matrix as a pattern, obtain a feature vector using principal component analysis as a statistical feature extraction method, and then identify the feature vector Use this for identification. Principal component analysis is a statistical feature extraction method that makes it possible to represent data in vector format with a small number of features (principal components). Principal component analysis is based on multivariate data statistics and linear combination. It is a technique that composes new variables to be expressed and summarizes them into "principal components" that are uncorrelated with each other. In this embodiment, the cooccurrence matrix is regarded as a face image by Eigenface (eigenface) proposed by Turk et al. Therefore, in the present application, the timeseries data determination method of the present invention is called an Eigen Cooccurrence Matrix (ECM) method.
As shown in FIG. 1, learning time series data for creating an eigen cooccurrence matrix group is selected from the time series data and used as domain data. The cooccurrence matrix converted from one window is referred to as M.I. It is regarded as a face image in Eigenface (Eigenface) announced by Turk et al. Eigenvalues and corresponding eigenvectors are obtained by principal component analysis. Then, the eigenvalues are arranged in descending order, and N eigenvectors corresponding to the eigenvalues are selected from the top to form a matrix to form an eigencooccurrence matrix group.
The feature vector extraction using the principal component analysis from the cooccurrence matrix is performed by the following procedure. First, the ith cooccurrence matrix among the p learning cooccurrence matrices obtained from the learning timeseries data is expressed as an Ndimensional vector xi in which the values of the respective elements are arranged. Here, p is the number of samples, and N is the square of the number of events. An average vector of p cooccurrence matrices is obtained as an average cooccurrence matrix by the following equation. Here, the mean cooccurrence matrix indicates the relationship between event pairs (between two terms).
And the vector which subtracted the average cooccurrence matrix (average vector) from each cooccurrence matrix
Represented by The meaning of subtracting the average cooccurrence matrix is to set the coordinate axis as the origin. Then, the average cooccurrence matrix (m × m matrix) is subtracted from each cooccurrence matrix and vectorized (m × m matrix is changed to m ^{two} dimensional vertical vector).
Represented by A matrix obtained by multiplying this matrix and its transpose matrix is a covariance matrix (m ^{2} × m ^{2} matrix) in FIG.
Next, an orthonormal basis a that optimally approximates the set of cooccurrence matrices for learning is configured by the eigenvectors of the covariance matrix of the matrix X expressed by [Equation 3]. Therefore, eigenvalues and eigenvectors are calculated from the covariance matrix (eigenvectors of m ^{2} × m ^{2} matrix are calculated). Here, the eigenvalue represents the strength of the feature. The eigenvectors represent feature axes that are uncorrelated with each other. At this time, each eigenvector al of a is an eigen cooccurrence matrix (Eigen coocurrence matrix), and the set is called an eigen cooccurrence matrix group (principal component).
Specifically, the eigenvalues are sorted in descending order to obtain eigenvectors corresponding to them (m selects only N out of ^{2} eigenvectors. By sorting eigenvectors by eigenvalues, the axis with strong features is Each of N eigenvectors can be matrixed (m ^{2} dimensional vectors are converted into m × m matrices), and this is defined as an eigencooccurrence matrix group. The vector (A) (or principal component score C) is obtained by calculating the inner product of the cooccurrence matrix x obtained by converting the vertical vector into the orthonormal basis a, and each component c _{1} , c _{2} ,. _{N,} like the. present embodiment will represent the contribution of each specific cooccurrence matrix for representing the cooccurrence matrix x, when extracting feature vectors from the cooccurrence matrix, various vectors empty It can be used to identify the feature vectors using techniques.
Portions relating to the timeseries data determination method of the present invention will be described below. In the determination method, in addition to the window data extraction step used in the cooccurrence matrix conversion described above, the scope data extraction step described above, and the cooccurrence matrix conversion step described above, an eigencooccurrence matrix determination step, A profile cooccurrence matrix conversion step, a determination feature vector extraction step, a test cooccurrence matrix conversion step, a test feature vector extraction step, and a determination step are performed.
First, in the eigencooccurrence matrix determination step, as described above, a plurality of cooccurrence matrices (which are obtained by converting learning timeseries data into cooccurrence matrices) are used as the basis for determining feature vectors by principal component analysis. Eigenoccurrence matrix group (set of eigencooccurrence matrices, ie, principal component)
In the profile cooccurrence matrix conversion step, the same window data extraction step, scope data extraction step, and cooccurrence as described above for one or more profile learning timeseries data including one or more categories The same steps as the matrix conversion step are performed to convert one or more profile learning timeseries data into one or more profile cooccurrence matrices. Here, as the time series data for profile learning, time series data clearly known to be created by a normal user is used. It goes without saying that the time series data for profile learning may be selected from the time series data for learning. If there are 100 users accessing a certain computer, the time series data created by the 100 users is converted into profile cooccurrence matrices as time series data for profile learning.
Next, in the determination feature vector extraction step, a determination feature vector for each profile learning timeseries data is extracted based on the profile cooccurrence matrix and the eigen cooccurrence matrix group. The determination feature vector extracted in this way is stored in advance in a computer memory. FIG. 1 does not particularly describe the profile learning time series data, but converts it into a cooccurrence matrix by the same route as the test time series data, and obtains a feature vector thereof.
Next, in the test cooccurrence matrix conversion step, the same steps as the window data extraction step, the scope data extraction step and the cooccurrence matrix conversion step are performed on the test time series data to be tested, Convert test time series data to test cooccurrence matrix. The test feature vector extraction step extracts a test feature vector for the test timeseries data based on the test cooccurrence matrix and the eigencooccurrence matrix. When extracting the test feature vector, as shown in FIG. 1, a vector obtained by subtracting the average cooccurrence matrix from the test cooccurrence matrix and the previously obtained eigen cooccurrence matrix group are vectorized. Find the inner product of the product.
In the determination step, it is determined whether or not the test timeseries data includes one or more categories based on the determination feature vector and the test feature vector obtained and stored in advance. In the determination step, specifically, the test timeseries data is one or more categories depending on whether or not the Euclidean distance between the test timeseries data and the determination feature vector is within a threshold using a predetermined vector identification function. Is included (whether it is timeseries data created by the user, that is, whether it is timeseries data created by an impersonator other than the user).
In order to build an accurate anomaly detection system (time series data anomaly discriminating method), it is necessary to update the user profile (user discriminating feature vector) to correspond to the Conceptual Drift. . In the conventional method as shown in FIG. 4, when updating the user profile (user discrimination feature vector), it is necessary to use the result in the discrimination function (feedback update). Therefore, there is a problem that the profile is not correctly updated when the result of the discrimination function is wrong. On the other hand, in this embodiment, as shown in FIG. 5, the eigencooccurrence matrix group is updated by including the test timeseries data in the plurality of timeseries data (domains) for learning. In this way, the profile can be updated without using the result of the identification function (feed forward update). Therefore, the update can be performed reliably.
Further, when the time series data determination program of the present invention is executed on a computer system to determine abnormality of time series data input to the computer system, abnormal time series data can be determined with higher accuracy than before. .
Schonlau et al. (M.Schonlau, W.DuMouchel, W.H.Ju, AFKarr, M.Theus and Y.Vardi, "Computer intrusion Detecting masquerades", InStatlsticalScience, pp.16 (1): 5874,2001. An experiment of impersonation detection using UNIX (registered trademark) command data provided by the company) was performed with respect to this embodiment. The purpose of the experiment is to consider the difference in detection accuracy of impersonation due to the difference in the size of time series data (domain data) for learning. FIG. 6 and FIG. 7 show the case where the first 50 windows of all users are experimented as domain data as Experiment 1, and similarly, the first 75 windows of all users are used as learning timeseries data ( The case of experimenting as domain data) is shown as Experiment 2. From this experimental result, it was found that the detection rate in Experiment 2 with a large domain data size was better than that in Experiment 1.
In the above embodiment, principal component analysis is used as a statistical feature extraction method, but it is needless to say that other statistical feature extraction methods other than principal component analysis can be used in the method of the present invention. In this embodiment, the Euclidean distance of the feature vector is used as the identification method. However, it is needless to say that various vector identification methods other than the Euclidean distance can be used.
Claims (5)
 Such commands and files that are input into the computer, timeseries data includes a plurality kinds of events, it is determined whether or not belonging to a predetermined one or more categories, the timeseries data To determine if it ’s anomalous timeseries data created by an impersonator ,
A window data extraction step of extracting a plurality of window data by cutting out a plurality of timeseries data for learning in advance with a window having a predetermined data length;
A scope data extraction step for sequentially extracting a plurality of scope data having a data length shorter than the data length from the window data with a time lag,
Converting the plurality of window data into a plurality of cooccurrence matrices indicating the strength of relevance of the plurality of types of events included in the window data as viewed in time series based on the plurality of scope data A cooccurrence matrix transformation step,
An eigencooccurrence matrix group determining step for determining an eigencooccurrence matrix group serving as a basis for obtaining a feature vector by a statistical feature extraction method using the plurality of cooccurrence matrices as input; and
Performing the same steps as the window data extraction step, the scope data extraction step and the cooccurrence matrix conversion step on one or more profile learning timeseries data including the one or more categories, A profile cooccurrence matrix conversion step of converting one or more profile learning timeseries data into one or more profile cooccurrence matrices;
A determination feature vector extraction step for extracting one or more determination feature vectors for the one or more profile learning timeseries data based on the one or more profile cooccurrence matrices and the eigen cooccurrence matrix group;
The test time series data to be tested is subjected to the same steps as the window data extraction step, the scope data extraction step, and the cooccurrence matrix conversion step, and the test time series data is shared with the test time series data. A test cooccurrence matrix conversion step for converting to an occurrence matrix;
A test feature vector extracting step of extracting a test feature vector for the test timeseries data based on the test cooccurrence matrix and the eigen cooccurrence matrix group;
On the basis of the one or the determination feature vector and the test feature vector, the time the test timeseries data by executing a determination step of determining whether or not belonging to the one or more categories in the computer system A timeseries data determination program for determining whether or not series data is abnormal timeseries data created by an impersonator ,
A program for determining time series data, wherein the eigen cooccurrence matrix group is updated by including the test time series data in the plurality of time series data for learning.  In the scope data extraction step, one or more scope data for the one type of event is obtained using a position where the one type of the event selected from the plurality of types of events is included in the window data as a reference position. Extract and
In the cooccurrence matrix conversion step, a total value of the number of other one type of the events included in the one or more scope data for the one type of event is calculated as the other type for the one type of event. The window data is converted to the frequency of the one type of event, and the frequency is set to a value indicating the strength of the association of the other type of event with respect to the one type of event. The time series data determination program according to claim 1, wherein the time series data determination program is converted into a matrix.  In the determination feature vector extraction step, the profile cooccurrence matrix and the eigen cooccurrence matrix group are vectorized and then the inner product is determined to determine the determination feature vector,
2. The test feature vector extracting step includes: vectorizing the test cooccurrence matrix and the eigen cooccurrence matrix group and then obtaining an inner product thereof to extract the test feature vector. Program for judging timeseries data.  In the determination step, the test time series data is classified into the one or more categories depending on whether a Euclidean distance between the test time series data and the determination feature vector is within a threshold using a predetermined vector discrimination function. The time series data determination program according to claim 1, wherein it is determined whether or not it belongs.
 A time series data determination program according to any one of claims 1 to 4, wherein an abnormality of time series data such as a command or a file input to a computer system is determined by a computer. Series data abnormality determination method.
Priority Applications (1)
Application Number  Priority Date  Filing Date  Title 

JP2004254856A JP4476078B2 (en)  20040901  20040901  Time series data judgment program 
Applications Claiming Priority (1)
Application Number  Priority Date  Filing Date  Title 

JP2004254856A JP4476078B2 (en)  20040901  20040901  Time series data judgment program 
Publications (2)
Publication Number  Publication Date 

JP2006072666A JP2006072666A (en)  20060316 
JP4476078B2 true JP4476078B2 (en)  20100609 
Family
ID=36153236
Family Applications (1)
Application Number  Title  Priority Date  Filing Date 

JP2004254856A Expired  Fee Related JP4476078B2 (en)  20040901  20040901  Time series data judgment program 
Country Status (1)
Country  Link 

JP (1)  JP4476078B2 (en) 
Families Citing this family (4)
Publication number  Priority date  Publication date  Assignee  Title 

JP5088233B2 (en)  20080521  20121205  富士通株式会社  Operation management apparatus, display method, and program 
JP5928165B2 (en) *  20120601  20160601  富士通株式会社  Abnormal transition pattern detection method, program, and apparatus 
JP5936240B2 (en)  20140912  20160622  インターナショナル・ビジネス・マシーンズ・コーポレーションＩｎｔｅｒｎａｔｉｏｎａｌ Ｂｕｓｉｎｅｓｓ Ｍａｃｈｉｎｅｓ Ｃｏｒｐｏｒａｔｉｏｎ  Data processing apparatus, data processing method, and program 
EP3358570A4 (en) *  20150929  20180808  Fujitsu Limited  Program, information processing method, and information processing device 

2004
 20040901 JP JP2004254856A patent/JP4476078B2/en not_active Expired  Fee Related
Also Published As
Publication number  Publication date 

JP2006072666A (en)  20060316 
Similar Documents
Publication  Publication Date  Title 

Banerjee et al.  Biometric authentication and identification using keystroke dynamics: A survey  
Lee et al.  Reliable online human signature verification systems  
EP1433118B1 (en)  System and method of face recognition using portions of learned model  
Bigun et al.  Multimodal biometric authentication using quality signals in mobile communications  
Yu et al.  GASVM wrapper approach for feature subset selection in keystroke dynamics identity verification  
Javaid et al.  A deep learning approach for network intrusion detection system  
EP2535845A1 (en)  Unknown malcode detection using classifiers with optimal training sets  
EP2069993B1 (en)  Security system and method for detecting intrusion in a computerized system  
US6466929B1 (en)  System for discovering implicit relationships in data and a method of using the same  
Shen et al.  Evaluation of automated biometricsbased identification and verification systems  
US8230232B2 (en)  System and method for determining a computer user profile from a motionbased input device  
Ektefa et al.  Intrusion detection using data mining techniques  
TW201730803A (en)  Method and system for identifying human/machine  
Lane  Hidden markov models for human/computer interface modeling  
Revett et al.  A machine learning approach to keystroke dynamics based user authentication  
Asaka et al.  A new intrusion detection method based on discriminant analysis  
Ahmed et al.  Detecting Computer Intrusions Using Behavioral Biometrics.  
Pusara et al.  User reauthentication via mouse movements  
Justino et al.  The interpersonal and intrapersonal variability influences on offline signature verification using hmm  
Gao et al.  Hmms (hidden markov models) based on anomaly intrusion detection method  
Kim et al.  Fusions of GA and SVM for anomaly detection in intrusion detection system  
Dowland et al.  Keystroke analysis as a method of advanced user authentication and response  
Gamboa et al.  An Identity Authentication System Based On Human Computer Interaction Behaviour.  
Woodbridge et al.  Predicting domain generation algorithms with long shortterm memory networks  
Dong et al.  Comparison deep learning method to traditional methods using for network intrusion detection 
Legal Events
Date  Code  Title  Description 

A131  Notification of reasons for refusal 
Free format text: JAPANESE INTERMEDIATE CODE: A131 Effective date: 20070612 

A521  Written amendment 
Free format text: JAPANESE INTERMEDIATE CODE: A523 Effective date: 20070810 

A02  Decision of refusal 
Free format text: JAPANESE INTERMEDIATE CODE: A02 Effective date: 20071002 

A521  Written amendment 
Free format text: JAPANESE INTERMEDIATE CODE: A523 Effective date: 20071108 

A911  Transfer of reconsideration by examiner before appeal (zenchi) 
Free format text: JAPANESE INTERMEDIATE CODE: A911 Effective date: 20071207 

A912  Removal of reconsideration by examiner before appeal (zenchi) 
Free format text: JAPANESE INTERMEDIATE CODE: A912 Effective date: 20080926 

A01  Written decision to grant a patent or to grant a registration (utility model) 
Free format text: JAPANESE INTERMEDIATE CODE: A01 

A61  First payment of annual fees (during grant procedure) 
Free format text: JAPANESE INTERMEDIATE CODE: A61 Effective date: 20100309 

R150  Certificate of patent or registration of utility model 
Free format text: JAPANESE INTERMEDIATE CODE: R150 

FPAY  Renewal fee payment (event date is renewal date of database) 
Free format text: PAYMENT UNTIL: 20130319 Year of fee payment: 3 

FPAY  Renewal fee payment (event date is renewal date of database) 
Free format text: PAYMENT UNTIL: 20130319 Year of fee payment: 3 

FPAY  Renewal fee payment (event date is renewal date of database) 
Free format text: PAYMENT UNTIL: 20140319 Year of fee payment: 4 

LAPS  Cancellation because of no payment of annual fees 