JP4476078B2 - Time series data judgment program - Google Patents

Time series data judgment program Download PDF

Info

Publication number
JP4476078B2
JP4476078B2 JP2004254856A JP2004254856A JP4476078B2 JP 4476078 B2 JP4476078 B2 JP 4476078B2 JP 2004254856 A JP2004254856 A JP 2004254856A JP 2004254856 A JP2004254856 A JP 2004254856A JP 4476078 B2 JP4476078 B2 JP 4476078B2
Authority
JP
Japan
Prior art keywords
data
series data
time
occurrence matrix
test
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
JP2004254856A
Other languages
Japanese (ja)
Other versions
JP2006072666A (en
Inventor
和彦 加藤
恵弘 大山
瑞起 岡
Original Assignee
独立行政法人科学技術振興機構
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 独立行政法人科学技術振興機構 filed Critical 独立行政法人科学技術振興機構
Priority to JP2004254856A priority Critical patent/JP4476078B2/en
Publication of JP2006072666A publication Critical patent/JP2006072666A/en
Application granted granted Critical
Publication of JP4476078B2 publication Critical patent/JP4476078B2/en
Application status is Expired - Fee Related legal-status Critical
Anticipated expiration legal-status Critical

Links

Images

Description

The present invention relates to a time-series data determination program for determining whether or not time-series data belongs to one or more predetermined categories.

  In order to detect a so-called “spoofing” in which a user's password is stolen and a computer is impersonated by impersonating the user, whether or not there is an abnormality in the time series data input to the computer (the input time series It is effective to detect whether the data is time-series data created by an impersonator using an abnormality detection system. In a known anomaly detection system, a profile (features that appear in time-series data created by a user) is first created that shows typical user behavior. Then, by comparing the profile of the input data (time series data) to be tested with the profile of the user, it is time series data created by a normal user or abnormal time series data created by a spoofer. To identify.

  Typical input data to be inspected is UNIX (registered trademark) commands used, time-series data such as accessed files, and the like. The process of identifying whether the input time-series data is normal or abnormal is divided into two steps. First, in the first step, feature extraction of time series data is performed. In the second step, it is identified whether the extracted feature is normal or abnormal.

  Typical conventional methods for performing feature extraction in the first step include a histogram and an n-gram. In the histogram, an appearance frequency vector of items (events) appearing in time series data is a feature vector to be extracted. Further, an N-gram has one feature of N consecutive items [Non-Patent Documents 1 to 3].

Various methods have been proposed so far for identifying whether the extracted features in the second step are normal or abnormal. Among them, typical methods include rule base [Non-patent document 4], automaton [Non-patent document 5], Bayesian network [Non-patent document 6], Naive Bayes [Non-patent document 7], neural network [Non-patent document 6]. Patent Document 8], Markov Model [Non-Patent Document 9], and Hidden Markov Model [Non-Patent Document 10].
Ye, X. Li, Q. Chen, S.M. M.M. Emran, and M.M. Xu's "Probable Techniques for Intrusion Detection Based on Computer Audit Data" IEEE Transactions of Systems Man and Cybernetics, Vol. 31, pp. 266-274, 2001 S. A. Hofmeyr, S.M. Forrest and A.M. "Intrusion Detection using Sequences of Calls" by Soyamaji, Journal of Computer Security, vol. 6, pp. 151-180, 1998 W. Lee and S.M. J. et al. "A framework for constructing features and models for intrusion detection systems" by Stolfo, Information and System Security, vol. 3, pp. 227-261, 2000 `` ASAX: Software Architecture and Rule-Based Language for Universal Audit Trail Analysis '' by N. Habra, BLCharlier, A. Mountaine and I. Mathieu, In Proc. Of European Symposiumon Researchin Computer Secu-rity (ESORICS), pp.435- 450,1992 R. Sekar, M. Bendre and P. Bollineni, `` A Fast Automaton Based Method for Detecting Anomalous Program Behaviors '' In Proceedings of the 2001 IEEE Symposium on Security and Privacy, pp. 144-155, Oakland, May 2001. W. DuMouchel's `` Computer Intrusion Detection Based on Bayes Factors for Comparing Command Transition Probabilities '' Technical Report TR91, National Institute of Statistical Sciences (NISS), 1999. RAMaxion and TNTownsend.Masquerade Detection Using Truncated CommandLines, In Prof.of the International Conferenceon Dependable Systems and Networks (DSN-02), pp.219-228, 2002. `` A study in using neural networks for anomaly and misuse detection '' in Proc. Of USENIX Security Symposium, pp. 141-151, 1999 by AKGhosh, A. Schwartzbard, and M. Schatz. `` Classifiers and Intrusion Detection.In Proc.of 14th IEEE Computer Security Foundations Workshop, pp.206-219, 2001 '' by JSTan, KMC and RAMaxion.MarkovChains. C. Warrender, S. Forresto and BAPearlmutter, `` Detecting Intrusions using System Calls: Alternative Data Models '' In IEEE Symposium on Security and Privacy, pp. 133-145, 1999.

  However, the histogram (Histogram) is characterized by the appearance frequency vector of items (events) appearing in the time series data. In the N-gram, N consecutive items are one feature. However, in these conventional methods, dynamic information on user behavior in time series data (information on user behavior as seen in time series, that is, the types of events appearing in the event time series and the order of their appearance) Problems) that cannot be used, or that the dynamic information of user behavior in time series data is lost, or that only single or adjacent event features can be used, or only features between adjacent events can be represented. .

An object of the present invention is to provide a time-series data determination program capable of determining whether time-series data includes a predetermined category (feature) by capturing dynamic information included in the time-series data. There is.

  Another object of the present invention is to provide a time-series data determination method with higher determination accuracy than in the past.

  Another object of the present invention is to provide a time-series data abnormality determination method capable of determining whether or not time-series data has an abnormality.

  The present invention was made based on the development of the Eigen Co-occurrence Matrix (ECM) method. This ECM method first associates events included in time-series data while considering time-series information. This association is performed by paying attention to the relationship between two events and expressing the relationship of all the binomial events as a co-occurrence matrix. The co-occurrence matrix can express all relationships between items (events) appearing in the time series data. This is a feature of time series data that could not be expressed by a histogram or an N-gram. In a specific invention, principal component analysis is performed on the co-occurrence matrix to generate orthogonal principal component vector spaces. Features of each co-occurrence matrix are extracted as vectors in the principal component vector space. By extracting features as vectors, it is possible to use various vector discrimination functions.

The time-series data determination program of the present invention uses a feature extraction method and an identification method to determine whether time-series data including a plurality of types of events belongs to one or more predetermined categories. judge. In the present invention, in particular, as the feature extraction method, a statistic using a plurality of time-series input data converted into matrix data representing a relationship between two types of events included in a plurality of types of events as a co-occurrence matrix. Use a feature extraction method. As the identification method, a method that uses the feature vector extracted by the statistical feature extraction method for identification is used. Here, the plurality of types of events mean a plurality of items constituting time-series data, and when the time-series data is composed of a plurality of commands, the plurality of commands are events. A category is a concept that means a type of time-series data when viewed from a higher concept, and a category to which a set of feature vectors described later obtained from time-series data belongs when viewed from a lower concept. For example, whether or not certain time series data is normal can be determined by whether or not the time series data belongs to one or more predetermined categories. In terms of the relationship between the feature vector and the category, the category corresponding to the partial region of the space where the feature vector exists is a category. Any statistical feature extraction method may be used as long as it can extract a feature vector. For example, a principal component analysis method can be used. An identification method for determining which category the time series data belongs to using the feature vector is arbitrary. It goes without saying that various known identification methods described in the prior art column can be used.

The co-occurrence matrix employed in the program of the present invention can express all relationships between items (events) appearing in time-series data. In other words, the co-occurrence matrix expresses the strength of relevance between all two terms by its distance and appearance frequency. Therefore, according to the present invention, it is possible to determine whether or not the time-series data belongs to a predetermined category with higher accuracy than before using the dynamic information included in the time-series data.

  When converting a plurality of time-series input data into matrix data represented by a co-occurrence matrix, a window data extraction step, a scope data extraction step, and a co-occurrence matrix conversion step are performed. In the window data extraction step, a plurality of window data are extracted by cutting out the time-series input data with a window having a predetermined data length. The data length of the window may be determined according to the length of the time series data. In the scope data extraction step, a plurality of scope data having a data length shorter than the data length of the window data is sequentially extracted from the window data with a time lag. In the specific scope data extraction step, one or more scope data for one type of event is extracted with a position where one type of event selected from a plurality of types of events is included in the window data as a reference position. Can do. Further, in the co-occurrence matrix conversion step, a plurality of window data indicating the strength of the relevance of a plurality of window data based on a plurality of scope data and viewed in time series between a plurality of types of events included in the window data. Convert to a matrix. Specifically, in the co-occurrence matrix conversion step, the total value of the number of the one type of events or the other types of events included in the one or more scope data for one type of event is calculated as one type of event. Convert the window data into a co-occurrence matrix by converting the frequency of one type of event to an event and converting this frequency to a value that displays the strength of the relevance of one type of event to one type of event. To do. When the co-occurrence matrix is converted in this way, it is possible to obtain a co-occurrence matrix that more appropriately indicates the relationship between events viewed in time series.

In order to identify an impersonator as a legitimate user by executing the program of the present invention on a computer system, it is appropriate to treat the co-occurrence matrix as a pattern and apply a statistical pattern recognition method (identification method). The simplest pattern recognition method (identification method) is a method based on matching between patterns. However, when the co-occurrence matrix itself is treated as a pattern, the dimension of the pattern becomes enormous. Therefore, in matching between patterns, it is more effective to extract a feature (which is also information compression) and perform recognition. By performing effective feature extraction from the pattern, it is possible to expect a recognition result that is robust against fluctuations in the input pattern. Therefore, in a more specific method of the present invention, as a feature extraction method, principal component analysis is used to extract feature vectors from a co-occurrence matrix. Principal component analysis is a statistical feature extraction method that allows vector-format data to be represented by a small number of features (principal components). As an example of successful recognition using principal component analysis, Turk et al. [M. Turk, A.M. Pentland, “Eigenfaces for Recognition”, Journal of Cognitive Neuroscience, vol3, no. 1, 1991] has been widely known for recognition of face images by Eigenface (unique face). In the specific method of the present invention, there is a unique point of view where a co-occurrence matrix is considered as a face image.

Accordingly, in the specific time-series data determination program of the present invention for determining whether or not time-series data including a plurality of types of events belongs to one or more predetermined categories, the above-mentioned window In addition to the data extraction step, the scope data extraction step, and the co-occurrence matrix conversion step described above, an eigen co-occurrence matrix group determination step, a profile co-occurrence matrix conversion step, and a determination feature vector extraction step When a test for the co-occurrence matrix conversion step, and the test feature vector extraction step, to execute a determination step in the computer system.

  In the eigen-cooccurrence matrix group determining step, an eigen-co-occurrence matrix group serving as a basis for obtaining a feature vector by principal component analysis with a plurality of co-occurrence matrices as input is determined. In the profile co-occurrence matrix conversion step, the same steps as the window data extraction step, the scope data extraction step, and the co-occurrence matrix conversion step are performed on one or more profile learning time series data including one or more categories. Each is implemented to convert one or more profile learning time-series data into one or more profile co-occurrence matrices. In the determination feature vector extraction step, one or more determination feature vectors for one or more profile learning time-series data are extracted based on one or more profile co-occurrence matrices and eigen co-occurrence matrix groups. Furthermore, in the test co-occurrence matrix conversion step, the same steps as the window data extraction step, scope data extraction step and co-occurrence matrix conversion step are performed on the test time series data to be tested. Convert series data to test co-occurrence matrix. The test feature vector extraction step extracts a test feature vector for the test time series data based on the test co-occurrence matrix and the eigen-co-occurrence matrix group. In the determination step, it is determined whether or not the test time-series data includes one or more categories based on the one or more determination feature vectors and the test feature vector. When the eigen co-occurrence matrix group (Eigen Co-ocurrence Matrix) corresponding to the eigenface is created through the principal component analysis as in the specific method of the present invention, the original co-occurrence matrix is reduced in a low dimension. It became possible to express in an approximate manner.

  In the determination step, specifically, the test time-series data is one or more categories depending on whether or not the Euclidean distance between the test time-series data and the determination feature vector is within a threshold using a predetermined vector identification function. Whether or not is included. By using such a vector discriminant function, the determination can be easily performed with higher accuracy.

  In order to construct a highly accurate anomaly detection system, it is necessary to update the user profile in accordance with the Conceptual Drift. In the conventional method, when updating the user profile, it is necessary to use the result in the discrimination function (feedback update). Therefore, there is a problem that the profile is not correctly updated when the result of the discrimination function is wrong. Therefore, in the present invention, when the eigencooccurrence matrix group is updated by including the test time series data in a plurality of time series data for learning, the profile can be updated without using the result of the discriminant function (feedfor Word update). Therefore, the update can be performed reliably.

  Further, when the abnormality of the time series data input to the computer system is determined using the time series data determination method of the present invention, the abnormal time series data can be determined with higher accuracy than before.

  According to the present invention, it is possible to determine whether time-series data includes a predetermined category with higher accuracy than before by using dynamic information included in the time-series data.

  Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. FIG. 1 shows a time series determination method according to the present invention for determining whether or not time series data including a plurality of types of events belongs to one or more predetermined categories using a principal component analysis method. It is a figure which shows the structure of the program for implementing an example of embodiment. In the present embodiment, a plurality of time series data for learning to obtain a group of eigen co-occurrence matrices used for obtaining a feature vector and time series data for profile learning (hereinafter referred to as profile learning time series data). Then, test time series data (hereinafter referred to as test time series data) to be tested is converted into a co-occurrence matrix. Here, the co-occurrence matrix is obtained by converting the relationship between two types of events included in a plurality of types of events constituting time-series data into matrix data.

  The step of converting time series data into a co-occurrence matrix will be described. FIG. 2 shows a plurality of time series data for learning [in this case, three time series sent from the user (a person or other computer who accesses the computer and transmits the time series data) 1 to 3 respectively. 2 shows an example of the configuration of [DATA]. In this example, the time-series data from each user is composed of 20 commands (events). As will be described later, in this embodiment, time-series data composed of 20 commands is divided by a window having 10 commands (data length) (window data extraction step). In this window data extraction step, each time-series input data is cut out in a window having a predetermined data length (data length for 10 commands) to extract two window data. Note that the data length of the window may be determined according to the length of the time series data.

  Next, in order to represent a causal relationship between two events appearing in time series data of a certain section, it is converted into a co-occurrence matrix. Each element of the co-occurrence matrix represents the strength of the causal relationship between two events. To create a co-occurrence matrix, define window size w, scope size s, and event set B = {b1, b2, b3,..., Bm}. Here, m represents the number of events. The window size w determines the size of an event time series from which one feature vector is extracted, and the scope size s determines an interval width that considers the causal relationship between two events. In the data example shown in FIG. 2, w is defined as 10 and s is defined as 6. Further, B is assumed to be eight unique commands (events) (m = 8) appearing in the time series data (domain data) for learning for all three people. The eight commands are cd, ls, less, emacs, gcc, gdb, mkdir, and cp. The strength of the causal relationship or relationship between two events is defined by the distance between the events and the frequency with which they appear. That is, the frequency of the causal relationship between events is defined by counting the frequency at which the event of interest appears within the scope size (6) within the window size (10). In the example of FIG. 2, two co-occurrence matrices are created for each user. The element or frequency number 7 of event cd and event ls in window 1 in FIG. 3 indicates that ls appeared seven times after cd within the scope size (6) in window size (10). The event pair (cd ls) and {ls cd} have the largest element or frequency number in window 1 of FIG. This indicates that these events are strongly related in this time series. The co-occurrence matrix represents the strength of the causal relationship or relationship between all two events appearing in the time series data.

  FIG. 3 will be described in detail in relation to the present invention. First, as shown in FIG. 3, for each user's time-series data, a plurality of scope data is extracted from the window data described above (scope data extraction step). In this step, a plurality of scope data having a data length shorter than the data length of the window data is sequentially extracted from the window data with a time lag in the data. In this example, scope data having a data length of six commands is sequentially extracted. Specifically, one type of event (for example, cd) selected from a plurality of types of events (in the case of FIG. 3, cd, ls, less) included in the 10 commands constituting the window data, One or more scope data for one type of event is extracted with a position included in the window data as a reference position. In the example of FIG. 3, when focusing on the event cd, the first scope data includes six commands (events) after the cd (reference position) without including the event cd at the top of the window 1. Next, six commands (events) after this cd (reference position) are extracted as the second scope data without including the sixth event cd from the head. As shown in the example of FIG. 3, when there are only 10 events in the window 1, 4 events are extracted from the second scope data. Similarly, the third and fourth scope data are extracted with the eighth and ninth events cd from the top as reference positions.

  Next, based on the multiple scope data extracted from the window data, the strength of the relationship in the time series between the multiple types of events included in the window data (the relationship between the two events) (Strength) is expressed as the frequency and distance at which two events that look at the relationship appear. For example, the total number of one type of events (the same type of cd in the case of FIG. 3) included in one or more (four in the case of FIG. 3) scope data for one type of event cd Let the value be the frequency of one type of event for one type of event. Then, the window data is converted into a co-occurrence matrix by converting the frequency into a value indicating the strength of relevance of one type of event to one type of event. In the example of FIG. 3, the relationship between the event cd in the window 1 and the event cd is viewed as a frequency. The first scope data includes one cd, the second scope data includes two cds, and the third scope data includes one cd. cd is included, and cd is not included in the fourth scope data. Therefore, the frequency of the event cd with respect to the event cd can be calculated as 1 + 2 + 1 + 0 = 4. Similarly, regarding the relationship of the event ls to the event cd, the above-mentioned first scope data includes three ls, and the second scope data includes two ls. One ls is included in the third scope data, and one ls is included in the fourth scope data. Therefore, the frequency of event ls with respect to event cd can be calculated as 3 + 2 + 1 + 1 = 7. By setting the scope data, these frequencies include time or distance relationships, that is, dynamic information included in time series data. In the right area of FIG. 3, matrix data obtained by converting windows 1 and 2 into co-occurrence matrices are shown. When time-series data is expressed by a co-occurrence matrix in this way, it is possible to model human fluid behavior.

  To identify a legitimate user impersonator using the method of the present invention, treat the co-occurrence matrix as a pattern, obtain a feature vector using principal component analysis as a statistical feature extraction method, and then identify the feature vector Use this for identification. Principal component analysis is a statistical feature extraction method that makes it possible to represent data in vector format with a small number of features (principal components). Principal component analysis is based on multivariate data statistics and linear combination. It is a technique that composes new variables to be expressed and summarizes them into "principal components" that are uncorrelated with each other. In this embodiment, the co-occurrence matrix is regarded as a face image by Eigenface (eigenface) proposed by Turk et al. Therefore, in the present application, the time-series data determination method of the present invention is called an Eigen Co-occurrence Matrix (ECM) method.

  As shown in FIG. 1, learning time series data for creating an eigen co-occurrence matrix group is selected from the time series data and used as domain data. The co-occurrence matrix converted from one window is referred to as M.I. It is regarded as a face image in Eigenface (Eigenface) announced by Turk et al. Eigenvalues and corresponding eigenvectors are obtained by principal component analysis. Then, the eigenvalues are arranged in descending order, and N eigenvectors corresponding to the eigenvalues are selected from the top to form a matrix to form an eigencooccurrence matrix group.

The feature vector extraction using the principal component analysis from the co-occurrence matrix is performed by the following procedure. First, the i-th co-occurrence matrix among the p learning co-occurrence matrices obtained from the learning time-series data is expressed as an N-dimensional vector xi in which the values of the respective elements are arranged. Here, p is the number of samples, and N is the square of the number of events. An average vector of p co-occurrence matrices is obtained as an average co-occurrence matrix by the following equation. Here, the mean co-occurrence matrix indicates the relationship between event pairs (between two terms).

And the vector which subtracted the average co-occurrence matrix (average vector) from each co-occurrence matrix

Represented by The meaning of subtracting the average co-occurrence matrix is to set the coordinate axis as the origin. Then, the average co-occurrence matrix (m × m matrix) is subtracted from each co-occurrence matrix and vectorized (m × m matrix is changed to m two- dimensional vertical vector).

Represented by A matrix obtained by multiplying this matrix and its transpose matrix is a covariance matrix (m 2 × m 2 matrix) in FIG.

Next, an orthonormal basis a that optimally approximates the set of co-occurrence matrices for learning is configured by the eigenvectors of the covariance matrix of the matrix X expressed by [Equation 3]. Therefore, eigenvalues and eigenvectors are calculated from the covariance matrix (eigenvectors of m 2 × m 2 matrix are calculated). Here, the eigenvalue represents the strength of the feature. The eigenvectors represent feature axes that are uncorrelated with each other. At this time, each eigenvector al of a is an eigen co-occurrence matrix (Eigen co-ocurrence matrix), and the set is called an eigen co-occurrence matrix group (principal component).

Specifically, the eigenvalues are sorted in descending order to obtain eigenvectors corresponding to them (m selects only N out of 2 eigenvectors. By sorting eigenvectors by eigenvalues, the axis with strong features is Each of N eigenvectors can be matrixed (m 2 dimensional vectors are converted into m × m matrices), and this is defined as an eigencooccurrence matrix group. The vector (A) (or principal component score C) is obtained by calculating the inner product of the co-occurrence matrix x obtained by converting the vertical vector into the orthonormal basis a, and each component c 1 , c 2 ,. N, like the. present embodiment will represent the contribution of each specific co-occurrence matrix for representing the co-occurrence matrix x, when extracting feature vectors from the co-occurrence matrix, various vectors empty It can be used to identify the feature vectors using techniques.
Portions relating to the time-series data determination method of the present invention will be described below. In the determination method, in addition to the window data extraction step used in the co-occurrence matrix conversion described above, the scope data extraction step described above, and the co-occurrence matrix conversion step described above, an eigen-co-occurrence matrix determination step, A profile co-occurrence matrix conversion step, a determination feature vector extraction step, a test co-occurrence matrix conversion step, a test feature vector extraction step, and a determination step are performed.

  First, in the eigen-co-occurrence matrix determination step, as described above, a plurality of co-occurrence matrices (which are obtained by converting learning time-series data into co-occurrence matrices) are used as the basis for determining feature vectors by principal component analysis. Eigen-occurrence matrix group (set of eigen-co-occurrence matrices, ie, principal component)

  In the profile co-occurrence matrix conversion step, the same window data extraction step, scope data extraction step, and co-occurrence as described above for one or more profile learning time-series data including one or more categories The same steps as the matrix conversion step are performed to convert one or more profile learning time-series data into one or more profile co-occurrence matrices. Here, as the time series data for profile learning, time series data clearly known to be created by a normal user is used. It goes without saying that the time series data for profile learning may be selected from the time series data for learning. If there are 100 users accessing a certain computer, the time series data created by the 100 users is converted into profile co-occurrence matrices as time series data for profile learning.

  Next, in the determination feature vector extraction step, a determination feature vector for each profile learning time-series data is extracted based on the profile co-occurrence matrix and the eigen co-occurrence matrix group. The determination feature vector extracted in this way is stored in advance in a computer memory. FIG. 1 does not particularly describe the profile learning time series data, but converts it into a co-occurrence matrix by the same route as the test time series data, and obtains a feature vector thereof.

  Next, in the test co-occurrence matrix conversion step, the same steps as the window data extraction step, the scope data extraction step and the co-occurrence matrix conversion step are performed on the test time series data to be tested, Convert test time series data to test co-occurrence matrix. The test feature vector extraction step extracts a test feature vector for the test time-series data based on the test co-occurrence matrix and the eigen-co-occurrence matrix. When extracting the test feature vector, as shown in FIG. 1, a vector obtained by subtracting the average co-occurrence matrix from the test co-occurrence matrix and the previously obtained eigen co-occurrence matrix group are vectorized. Find the inner product of the product.

  In the determination step, it is determined whether or not the test time-series data includes one or more categories based on the determination feature vector and the test feature vector obtained and stored in advance. In the determination step, specifically, the test time-series data is one or more categories depending on whether or not the Euclidean distance between the test time-series data and the determination feature vector is within a threshold using a predetermined vector identification function. Is included (whether it is time-series data created by the user, that is, whether it is time-series data created by an impersonator other than the user).

  In order to build an accurate anomaly detection system (time series data anomaly discriminating method), it is necessary to update the user profile (user discriminating feature vector) to correspond to the Conceptual Drift. . In the conventional method as shown in FIG. 4, when updating the user profile (user discrimination feature vector), it is necessary to use the result in the discrimination function (feedback update). Therefore, there is a problem that the profile is not correctly updated when the result of the discrimination function is wrong. On the other hand, in this embodiment, as shown in FIG. 5, the eigencooccurrence matrix group is updated by including the test time-series data in the plurality of time-series data (domains) for learning. In this way, the profile can be updated without using the result of the identification function (feed forward update). Therefore, the update can be performed reliably.

Further, when the time series data determination program of the present invention is executed on a computer system to determine abnormality of time series data input to the computer system, abnormal time series data can be determined with higher accuracy than before. .

  Schonlau et al. (M.Schonlau, W.DuMouchel, W.-H.Ju, AFKarr, M.Theus and Y.Vardi, "Computer intrusion Detecting masquerades", InStatlsticalScience, pp.16 (1): 58-74,2001. An experiment of impersonation detection using UNIX (registered trademark) command data provided by the company) was performed with respect to this embodiment. The purpose of the experiment is to consider the difference in detection accuracy of impersonation due to the difference in the size of time series data (domain data) for learning. FIG. 6 and FIG. 7 show the case where the first 50 windows of all users are experimented as domain data as Experiment 1, and similarly, the first 75 windows of all users are used as learning time-series data ( The case of experimenting as domain data) is shown as Experiment 2. From this experimental result, it was found that the detection rate in Experiment 2 with a large domain data size was better than that in Experiment 1.

  In the above embodiment, principal component analysis is used as a statistical feature extraction method, but it is needless to say that other statistical feature extraction methods other than principal component analysis can be used in the method of the present invention. In this embodiment, the Euclidean distance of the feature vector is used as the identification method. However, it is needless to say that various vector identification methods other than the Euclidean distance can be used.

Embodiment of the time-series determination method of the present invention for determining whether or not time-series data including a plurality of types of events belongs to one or more predetermined categories using a principal component analysis method It is a figure which shows the structure of the program for implementing an example. An example of the configuration of three time-series data respectively sent from a plurality of users 1 to 3 is shown. It is a figure used in order to explain conversion of a co-occurrence matrix. It is a figure used in order to explain the conventional profile update. It is a figure used in order to explain profile update by an embodiment of the invention. It is a figure which shows the basic concept of Experiment 1 and 2. FIG. It is a figure which shows the relationship between the detection rate in Experiment 1 and 2, and a misdetection.

Claims (5)

  1. Such commands and files that are input into the computer, time-series data includes a plurality kinds of events, it is determined whether or not belonging to a predetermined one or more categories, the time-series data To determine if it ’s anomalous time-series data created by an impersonator ,
    A window data extraction step of extracting a plurality of window data by cutting out a plurality of time-series data for learning in advance with a window having a predetermined data length;
    A scope data extraction step for sequentially extracting a plurality of scope data having a data length shorter than the data length from the window data with a time lag,
    Converting the plurality of window data into a plurality of co-occurrence matrices indicating the strength of relevance of the plurality of types of events included in the window data as viewed in time series based on the plurality of scope data A co-occurrence matrix transformation step,
    An eigencooccurrence matrix group determining step for determining an eigencooccurrence matrix group serving as a basis for obtaining a feature vector by a statistical feature extraction method using the plurality of co-occurrence matrices as input; and
    Performing the same steps as the window data extraction step, the scope data extraction step and the co-occurrence matrix conversion step on one or more profile learning time-series data including the one or more categories, A profile co-occurrence matrix conversion step of converting one or more profile learning time-series data into one or more profile co-occurrence matrices;
    A determination feature vector extraction step for extracting one or more determination feature vectors for the one or more profile learning time-series data based on the one or more profile co-occurrence matrices and the eigen co-occurrence matrix group;
    The test time series data to be tested is subjected to the same steps as the window data extraction step, the scope data extraction step, and the co-occurrence matrix conversion step, and the test time series data is shared with the test time series data. A test co-occurrence matrix conversion step for converting to an occurrence matrix;
    A test feature vector extracting step of extracting a test feature vector for the test time-series data based on the test co-occurrence matrix and the eigen co-occurrence matrix group;
    On the basis of the one or the determination feature vector and the test feature vector, the time the test time-series data by executing a determination step of determining whether or not belonging to the one or more categories in the computer system A time-series data determination program for determining whether or not series data is abnormal time-series data created by an impersonator ,
    A program for determining time series data, wherein the eigen co-occurrence matrix group is updated by including the test time series data in the plurality of time series data for learning.
  2. In the scope data extraction step, one or more scope data for the one type of event is obtained using a position where the one type of the event selected from the plurality of types of events is included in the window data as a reference position. Extract and
    In the co-occurrence matrix conversion step, a total value of the number of other one type of the events included in the one or more scope data for the one type of event is calculated as the other type for the one type of event. The window data is converted to the frequency of the one type of event, and the frequency is set to a value indicating the strength of the association of the other type of event with respect to the one type of event. The time series data determination program according to claim 1, wherein the time series data determination program is converted into a matrix.
  3. In the determination feature vector extraction step, the profile co-occurrence matrix and the eigen co-occurrence matrix group are vectorized and then the inner product is determined to determine the determination feature vector,
    2. The test feature vector extracting step includes: vectorizing the test co-occurrence matrix and the eigen co-occurrence matrix group and then obtaining an inner product thereof to extract the test feature vector. Program for judging time-series data.
  4.   In the determination step, the test time series data is classified into the one or more categories depending on whether a Euclidean distance between the test time series data and the determination feature vector is within a threshold using a predetermined vector discrimination function. The time series data determination program according to claim 1, wherein it is determined whether or not it belongs.
  5. A time series data determination program according to any one of claims 1 to 4, wherein an abnormality of time series data such as a command or a file input to a computer system is determined by a computer. Series data abnormality determination method.
JP2004254856A 2004-09-01 2004-09-01 Time series data judgment program Expired - Fee Related JP4476078B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2004254856A JP4476078B2 (en) 2004-09-01 2004-09-01 Time series data judgment program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP2004254856A JP4476078B2 (en) 2004-09-01 2004-09-01 Time series data judgment program

Publications (2)

Publication Number Publication Date
JP2006072666A JP2006072666A (en) 2006-03-16
JP4476078B2 true JP4476078B2 (en) 2010-06-09

Family

ID=36153236

Family Applications (1)

Application Number Title Priority Date Filing Date
JP2004254856A Expired - Fee Related JP4476078B2 (en) 2004-09-01 2004-09-01 Time series data judgment program

Country Status (1)

Country Link
JP (1) JP4476078B2 (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5088233B2 (en) 2008-05-21 2012-12-05 富士通株式会社 Operation management apparatus, display method, and program
JP5928165B2 (en) * 2012-06-01 2016-06-01 富士通株式会社 Abnormal transition pattern detection method, program, and apparatus
JP5936240B2 (en) 2014-09-12 2016-06-22 インターナショナル・ビジネス・マシーンズ・コーポレーションInternational Business Machines Corporation Data processing apparatus, data processing method, and program
EP3358570A4 (en) * 2015-09-29 2018-08-08 Fujitsu Limited Program, information processing method, and information processing device

Also Published As

Publication number Publication date
JP2006072666A (en) 2006-03-16

Similar Documents

Publication Publication Date Title
Banerjee et al. Biometric authentication and identification using keystroke dynamics: A survey
Lee et al. Reliable online human signature verification systems
EP1433118B1 (en) System and method of face recognition using portions of learned model
Bigun et al. Multimodal biometric authentication using quality signals in mobile communications
Yu et al. GA-SVM wrapper approach for feature subset selection in keystroke dynamics identity verification
Javaid et al. A deep learning approach for network intrusion detection system
EP2535845A1 (en) Unknown malcode detection using classifiers with optimal training sets
EP2069993B1 (en) Security system and method for detecting intrusion in a computerized system
US6466929B1 (en) System for discovering implicit relationships in data and a method of using the same
Shen et al. Evaluation of automated biometrics-based identification and verification systems
US8230232B2 (en) System and method for determining a computer user profile from a motion-based input device
Ektefa et al. Intrusion detection using data mining techniques
TW201730803A (en) Method and system for identifying human/machine
Lane Hidden markov models for human/computer interface modeling
Revett et al. A machine learning approach to keystroke dynamics based user authentication
Asaka et al. A new intrusion detection method based on discriminant analysis
Ahmed et al. Detecting Computer Intrusions Using Behavioral Biometrics.
Pusara et al. User re-authentication via mouse movements
Justino et al. The interpersonal and intrapersonal variability influences on off-line signature verification using hmm
Gao et al. Hmms (hidden markov models) based on anomaly intrusion detection method
Kim et al. Fusions of GA and SVM for anomaly detection in intrusion detection system
Dowland et al. Keystroke analysis as a method of advanced user authentication and response
Gamboa et al. An Identity Authentication System Based On Human Computer Interaction Behaviour.
Woodbridge et al. Predicting domain generation algorithms with long short-term memory networks
Dong et al. Comparison deep learning method to traditional methods using for network intrusion detection

Legal Events

Date Code Title Description
A131 Notification of reasons for refusal

Free format text: JAPANESE INTERMEDIATE CODE: A131

Effective date: 20070612

A521 Written amendment

Free format text: JAPANESE INTERMEDIATE CODE: A523

Effective date: 20070810

A02 Decision of refusal

Free format text: JAPANESE INTERMEDIATE CODE: A02

Effective date: 20071002

A521 Written amendment

Free format text: JAPANESE INTERMEDIATE CODE: A523

Effective date: 20071108

A911 Transfer of reconsideration by examiner before appeal (zenchi)

Free format text: JAPANESE INTERMEDIATE CODE: A911

Effective date: 20071207

A912 Removal of reconsideration by examiner before appeal (zenchi)

Free format text: JAPANESE INTERMEDIATE CODE: A912

Effective date: 20080926

A01 Written decision to grant a patent or to grant a registration (utility model)

Free format text: JAPANESE INTERMEDIATE CODE: A01

A61 First payment of annual fees (during grant procedure)

Free format text: JAPANESE INTERMEDIATE CODE: A61

Effective date: 20100309

R150 Certificate of patent or registration of utility model

Free format text: JAPANESE INTERMEDIATE CODE: R150

FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20130319

Year of fee payment: 3

FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20130319

Year of fee payment: 3

FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20140319

Year of fee payment: 4

LAPS Cancellation because of no payment of annual fees