GR20220100245A

GR20220100245A - Method for detecting and treating errors in log files with use of deep learning neural networks

Info

Publication number: GR20220100245A
Application number: GR20220100245A
Authority: GR
Inventors: Κωνσταντινος Νικολαου Καραμιτσιος
Original assignee: My Company Projects O.E.,; Κωνσταντινος Νικολαου Καραμιτσιος
Priority date: 2022-03-21
Filing date: 2022-03-21
Publication date: 2023-10-10

Abstract

A method that detects abnormal entries (potential errors) in program log data in information systems infrastructures that act as servers to provide online services and applications is disclosed. Given a log file, the data are transformed and processed appropriately to train a deep machine learning neural network. The goal of training is to reconstruct the input data with the least possible errorwhile, at the same time, the reconstruction of the training data is limited by the dimensions and structure of the neural network. As a consequence, the neural network succeeds in modeling the most important features of the training data. New input data that have similar characteristics to those of the training data are reconstructed with little error. Conversely, data that do not possess similarfeatures are reconstructed with a larger error and, consequently, are categorized as abnormal based on a threshold value. This indication is generated in real time and is used to notify server administrators of possible failures, and perform self-healing actions.

Description

ΠΕΡΙΓΡΑΦΗ DESCRIPTION

Μέθοδος για την ανίχνευση και αντιμετώπιση ανωμαλιών σε δεδομένα καταγραφής υποδομών πληροφοριακών συστημάτων με χρήση νευρωνικού δικτύου βαθιάς μάθησης. A method for detecting and dealing with anomalies in information system infrastructure log data using a deep learning neural network.

Η παρούσα εφεύρεση αναφέρεται σε μια μέθοδο που αποσκοπεί στην ανίχνευση ανώμαλων εγγραφών σε δεδομένα καταγραφής υποδομών πληροφοριακών συστημάτων που δρουν ως εξυπηρετητές (servers) και στη λήψη ενεργειών αυτοεπιδιόρθωσης. Οι εξυπηρετητές αυτοί αφορούν υποδομές φιλοξενίας ιστοσελίδων και παροχής διαδικτυακών υπηρεσιών και εφαρμογών. Η ανίχνευση των ανωμαλιών πραγματοποιείται με χρήση ενός πλήρως συνδεδεμένου νευρωνικού δικτύου βαθιάς μηχανικής μάθησης. The present invention refers to a method that aims to detect abnormal records in data recording infrastructures of information systems that act as servers and to take self-repair actions. These servers concern infrastructures for hosting web pages and providing online services and applications. Anomaly detection is performed using a fully connected deep machine learning neural network.

Οι υποδομές πληροφοριακών συστημάτων που αποτελούν εξυπηρετητές (servers) που αποσκοπούν στην φιλοξενία ιστοσελίδων και στην παροχή σχετικών διαδικτυακών υπηρεσιών και εφαρμογών. Καθώς οι εξυπηρετητές αποτελούνται από ηλεκτρονικούς υπολογιστές, κατά τη λειτουργία τους μπορεί να εμφανιστεί μια πληθώρα σφαλμάτων και προβλημάτων. Ένας ειδικός, που στο εξής θα αναφέρεται ως διαχειριστής του συστήματος (system administrator), καλείται να αντιμετωπίσει τα σφάλματα που προκύπτουν ενώ, ακόμα, η ανίχνευση των γνωστών σφαλμάτων μπορεί να αυτοματοποιηθεί από αυτόν ή ειδικούς προγραμματιστές με τη συγγραφή κατάλληλου λογισμικού. Ωστόσο, οι εξυπηρετητές αποτελούν πολύπλοκα συστήματα. Ως απόρροια αυτού, είναι πιθανή η εμφάνιση σφαλμάτων που καθίστανται άγνωστα, τα οποία δε θα ανιχνευτούν από την αυτοματοποιημένη ανίχνευση γνωστών σφαλμάτων, ενώ μπορεί να είναι άγνωστα ακόμα και σε κάποιον ειδικό, και ακόμα, ο διαχειριστής του συστήματος μπορεί να μην είναι άμεσα διαθέσιμος για να αντιμετωπίσει την κατάσταση. The infrastructure of information systems that are servers intended to host websites and provide related online services and applications. As servers consist of computers, a multitude of errors and problems can occur during their operation. An expert, hereinafter referred to as a system administrator, is called upon to deal with the resulting errors while, furthermore, the detection of known errors can be automated by him or specialist programmers by writing appropriate software. However, servers are complex systems. As a consequence of this, it is possible to have errors that become unknown, which will not be detected by the automated detection of known errors, while they may be unknown even to an expert, and also, the system administrator may not be readily available to to deal with the situation.

Το αντικείμενο της παρούσας εφεύρεσης είναι η παροχή μιας καινοτόμου μεθόδου ως λύση στην αυτόματη ανίχνευση σφαλμάτων σε δεδομένα καταγραφής (log files) των προγραμμάτων και την πραγματοποίηση ενεργειών αυτοεπιδιόρθωσης, η οποία καταφέρνει να αναγνωρίσει σφάλματα που δεν έχουν εμφανιστεί στο παρελθόν, μοντελοποιώντας αποκλειστικά την κανονική συμπεριφορά του συστήματος. Η μέθοδος αυτή μπορεί να εφαρμοστεί σε οποιοδήποτε αρχείο καταγραφή συστήματος ενός προγράμματος εξυπηρετητή που παρέχει διαδικτυακές υπηρεσίες (για παράδειγμα, σε αρχεία καταγραφής ενός συγκεκριμένου και κρίσιμου λογισμικού, όπως ο Apach Web Server και η MySQL Database). Με αυτόν τον τρόπο, μέσω των ενδείξεων που παράγει το νευρωνικό δίκτυο βαθιάς μηχανικής μάθησης, ο διαχειριστής του συστήματος ενημερώνεται όταν το σύστημα που αποτελεί τον εξυπηρετητή δε συμπεριφέρεται όπως συνήθως και ταυτόχρονα πραγματοποιείται μια αυτοματοποιημένη ενέργεια στο εκάστοτε δομικό στοιχείο του εξυπηρετητή που παράγει τα δεδομένα καταγραφής, που αφορά την επανεκκίνηση του προγράμματος. Ως αποτέλεσμα, μειώνεται ο χρόνος διακοπών του διακομιστή, καθώς τα σφάλματα αντιμετωπίζονται αυτόματα, ενώ ακόμα ενημερώνεται ο διαχειριστής του συστήματος (system administrator) σε πιο σύντομο χρονικό διάστημα. The object of the present invention is to provide an innovative method as a solution to the automatic detection of errors in log files of programs and the realization of self-repair actions, which manages to recognize errors that have not appeared in the past, exclusively modeling the normal behavior of the system. This method can be applied to any system log of a server program that provides web services (for example, logs of a specific and critical software such as Apache Web Server and MySQL Database). In this way, through the indications produced by the neural network of deep machine learning, the system administrator is informed when the system that constitutes the server does not behave as usual and at the same time an automated action is carried out on the respective structural element of the server that produces the log data , which is about restarting the program. As a result, server downtime is reduced, as errors are handled automatically, while still informing the system administrator in a shorter period of time.

Κάθε εξυπηρετητής διαθέτει αρχεία καταγραφής συστήματος, τα οποία αποτελούν ένα σημαντικό μέσο για την παρακολούθηση της συμπεριφοράς των στοιχείων που τα παράγουν, καθώς και της συνολικής συμπεριφοράς ενός συστήματος. Η παρούσα εφεύρεση αναφέρεται σε αρχεία καταγραφής που αφορούν μέρη του λογισμικού (software). Η μέθοδος εφαρμόζεται ξεχωριστά για κάθε αρχείο καταγραφής ενός προγράμματος λογισμικού που κρίνεται κρίσιμο για τη λειτουργία του διακομιστή (για παράδειγμα, του προγράμματος που είναι υπεύθυνο για τη λειτουργία της βάσης δεδομένων). Για ένα αρχείο καταγραφής στο οποίο εφαρμόζεται η μέθοδος, μέσω της παροχής των κατάλληλων μονοπατιών (paths) που οδηγούν στο αρχείο, συλλέγονται τα δεδομένα, τα οποία αποτελούνται από αρχεία κειμένου, όπου σε κάθε μία σειρά αποτυπώνεται μία ξεχωριστή καταγραφή συστήματος. Every server has system logs, which are an important means of monitoring the behavior of the components that produce them, as well as the overall behavior of a system. The present invention relates to log files related to parts of the software. The method is applied separately for each log file of a software program that is critical to the operation of the server (for example, the program responsible for the operation of the database). For a log file to which the method is applied, by providing the appropriate paths (paths) leading to the file, the data is collected, which consists of text files, where each row represents a separate system log.

Στη συνέχεια ξεκινάει η προεπεξεργασία των δεδομένων. Τα δεδομένα καταγραφής μετατρέπονται σε έναν πίνακα δύο διαστάσεων, με την κάθε γραμμή του πίνακα να περιλαμβάνει μία ολόκληρη καταγραφή του συστήματος, χωρίζοντας τις λέξεις της καταγραφής, σε ξεχωριστά κελιά, όπου εμφανίζεται ο κενός χαρακτήρας. Ο αριθμός των γραμμών του πίνακα είναι ίσος με τον αριθμό των καταγραφών, ενώ ο αριθμός των στηλών είναι ίσος με τον μέγιστο αριθμό των συνολικών λέξεων που εμφανίζει η μεγαλύτερη καταγραφή του συστήματος. Για τις καταγραφές με αριθμό στοιχείων μικρότερου του μέγιστου, συμπληρώνεται στα κελιά που υπολείπονται η τιμή μηδέν (0). Οι κενοί χαρακτήρες που βρίσκονται ανάμεσα σε κλειστές παρενθέσεις ή αγκύλες δεν λαμβάνονται υπόψη στο διαχωρισμό των στηλών, καθώς τα μέρη των καταγραφών που βρίσκονται ανάμεσα σε παρενθέσεις ή αγκύλες αποτελούν ενιαία πληροφορία. Then the preprocessing of the data begins. The log data is converted into a two-dimensional array, with each line of the array containing an entire system log, dividing the words of the log into separate cells, where the blank character appears. The number of rows of the table is equal to the number of records, while the number of columns is equal to the maximum number of total words displayed by the largest record in the system. For records with a number of elements less than the maximum, zero (0) is filled in the missing cells. Blank characters between closed parentheses or square brackets are not taken into account in the column separation, since the parts of records between parentheses or square brackets are a single piece of information.

Στην περίπτωση όπου ο εξυπηρετητής, μέχρι την εγκατάσταση της μεθόδου, δεν έχει παρουσιάσει προβληματική συμπεριφορά, τότε τα δεδομένα που συλλέχθηκαν αποθηκεύονται αυτούσια για τη μετέπειτα εκπαίδευση του μοντέλου νευρωνικού δικτύου βαθιάς μηχανικής μάθησης. Σε διαφορετική περίπτωση, ο χρήστης μπορεί να ορίσει μια μέγιστη ημερομηνία, μέχρι την οποία δεν έχει εμφανιστεί κάποιο πρόβλημα ή σφάλμα, και να χρησιμοποιηθούν μόνο καταγραφές που είναι προγενέστερες της ημερομηνίας αυτής. Ο στόχος είναι το μοντέλο μηχανικής μάθησης να εκπαιδευτεί μόνο με δεδομένα κανονικής λειτουργίας, η οποία κανονική λειτουργία είναι επιθυμητό να μοντελοποιηθεί, και οι μη-κανονικές (ανώμαλες) καταγραφές να είναι το κατά το δυνατό λιγότερες ή να εξαλειφθούν τελείως από τα δεδομένα. Ως στήλη που περιγράφει την ημερομηνία ορίζεται η πρώτη στήλη, εκτός αν οριστεί διαφορετικά από τον χρήστη της μεθόδου. Σε κάθε περίπτωση, η στήλη που περιγράφει την ημερομηνία διαγράφεται από τον πίνακα, καθώς δεν αποτελεί χρήσιμο δεδομένο για την ανίχνευση ανωμαλιών. In the case where the server, until the installation of the method, has not presented problematic behavior, then the collected data is stored intact for the subsequent training of the deep machine learning neural network model. Otherwise, the user can set a maximum date, up to which no problem or error has occurred, and only records prior to this date are used. The goal is for the machine learning model to be trained only with normal mode data, which normal mode is desired to be modeled, and for non-normal (abnormal) records to be as few as possible or completely eliminated from the data. The column describing the date is defined as the first column, unless otherwise specified by the user of the method. In any case, the column describing the date is deleted from the table, as it is not useful data for anomaly detection.

Για κάθε στήλη του πίνακα που περιγράφει τα δεδομένα καταγραφής συστήματος, εντοπίζεται αν αυτή περιέχει μόνο αριθμητικές τιμές. Αν αυτό είναι αληθές, τότε για τις τιμές της στήλης εφαρμόζεται κανονικοποίηση στο εύρος [0, 1] με βάση τη μεγίστη και ελάχιστη τιμή της στήλης. Σε αντίθετη περίπτωση, εάν οι στήλες περιλαμβάνουν και αλφαβητικούς χαρακτήρες ή/και σύμβολα, τότε τα δεδομένα ολόκληρης της στήλης θεωρούνται ως κατηγορικά, και χρησιμοποιείται one-hot encoding για την αναπαράστασή τους. For each column in the table describing system log data, it is detected whether it contains only numeric values. If this is true, then the column values are normalized to the range [0, 1] based on the column's maximum and minimum value. Otherwise, if the columns also include alphabetic characters and/or symbols, then the entire column's data is treated as categorical, and one-hot encoding is used to represent it.

Ο πίνακας χωρίζεται με βάση τις γραμμές του σε δύο μέρη, που στο εξής χαρακτηρίζονται ως σετ εκπαίδευσης και σετ επικύρωσης, ή ως δεδομένα εκπαίδευσης και δεδομένα επικύρωσης. Το σετ εκπαίδευσης αντιστοιχεί στο ογδόντα τοις εκατό (80%) του συνόλου των γραμμών του αρχικού πίνακα. Το σετ επικύρωσης αντιστοιχεί στο υπόλοιπο είκοσι τοις εκατό (20%) του συνόλου των γραμμών του αρχικού πίνακα. Προκειμένου να τόσο το σετ εκπαίδευσης όσο και το σετ επικύρωσης να εμπεριέχουν δεδομένα από όλες τις χρονικές περιόδους κατά τις οποίες έχουν παραχθεί τα δεδομένα καταγραφής, και όχι το πρώτο σετ να περιέχει τα πρώτα κατά χρονική σειρά και το δεύτερο τα τελευταία κατά χρονική σειρά, ο χωρισμός του αρχικού πίνακα στα δύο παραπάνω σετ δε γίνεται με την αρχική χρονολογική σειρά που έχουν οι γραμμές του. Αντιθέτως, με μία μηχανή παραγωγής ψευδοτυχαίων αριθμών, οι γραμμές του αρχικού πίνακα ανακατεύονται με τυχαίο τρόπο, και στη συνέχεια το πρώτο 80% αποθηκεύεται ως το σετ εκπαίδευσης, και το υπόλοιπο ως το σετ επικύρωσης. Δεν υπάρχει καμία επικάλυψη σε αυτά τα δύο σετ. The array is partitioned by its rows into two parts, hereafter denoted as training set and validation set, or training data and validation data. The training set corresponds to eighty percent (80%) of the total rows of the original table. The validation set corresponds to the remaining twenty percent (20%) of the total rows of the original table. In order for both the training set and the validation set to contain data from all time periods in which the log data have been produced, rather than the first set containing the former in time order and the latter the latter in time order, the dividing the original table into the two sets above is not possible with the original chronological order of its lines. In contrast, with a pseudorandom number generator, the rows of the original table are shuffled in a random fashion, and then the first 80% are stored as the training set, and the rest as the validation set. There is no overlap in these two sets.

Το νευρωνικό δίκτυο βαθιάς μηχανικής μάθησης, που στο εξής θα αναφέρεται και ως μοντέλο, αποτελείται από τέσσερα (4) πλήρως συνδεδεμένα επίπεδα. Το πρώτο επίπεδο αποτελείται από δεκατέσσερις (14) νευρώνες, το δεύτερο επίπεδο αποτελείται από επτά (7) νευρώνες, το τρίτο επίπεδο αποτελείται από δεκατέσσερις (14) νευρώνες, και το τέταρτο επίπεδο αποτελείται από τόσους νευρώνες όσο το μέγεθος της αρχικής διάστασης των δεδομένων εισόδου, δηλαδή όσος είναι ο αριθμός των στηλών των δεδομένων εκπαίδευσης και επικύρωσης. Στα τρίτα πρώτα επίπεδα χρησιμοποιείται η συνάρτηση διορθωμένης γραμμικής μονάδας (Rectified Linear Unit - ReLU) ως συνάρτηση ενεργοποίησης. Στο τέταρτο και τελευταίο επίπεδο, που αποτελεί την έξοδο του νευρωνικού δικτύου, δε χρησιμοποιείται κάποια συνάρτηση ενεργοποίησης. The deep machine learning neural network, hereafter also referred to as the model, consists of four (4) fully connected layers. The first layer consists of fourteen (14) neurons, the second layer consists of seven (7) neurons, the third layer consists of fourteen (14) neurons, and the fourth layer consists of as many neurons as the size of the original dimension of the input data , that is, the number of columns of the training and validation data. In the first three levels, the Rectified Linear Unit (ReLU) function is used as the activation function. In the fourth and last level, which is the output of the neural network, no activation function is used.

Το μοντέλο εκπαιδεύεται μέσω της παρακάτω διαδικασίας. Για την αποδοτικότερη εκπαίδευση, η συνολική είσοδος αποτελείται από εξήντα τέσσερις (64) εγγραφές από τα δεδομένα καταγραφής που επεξεργάζονται παράλληλα από το μοντέλο, δηλαδή το μέγεθος δέσμης (batch size) ορίζεται ως εξήντα τέσσερα (64). Είναι επιθυμητό η έξοδος του μοντέλου να αποτελέσει στην ουσία μία ανακατασκευή της εισόδου. Υπολογίζεται το μέσο τετραγωνικό σφάλμα ανάμεσα στην είσοδο και την έξοδο και, μέσω του βελτιστοποιητή Adam (Adam optimizer), το μοντέλο εκπαιδεύεται με κριτήριο την ελαχιστοποίηση του μέσου τετραγωνικού σφάλματος, που στο εξής θα αναφέρεται και ως σφάλμα ανακατασκευής. Η διαδικασία επαναλαμβάνεται για ολόκληρο το σύνολο δεδομένων εκπαίδευσης πενήντα (50) φορές, δηλαδή ο αριθμός των εποχών (epochs) ορίζεται ως πενήντα (50). Ο ρυθμός μάθησης (learning rate) ορίζεται ως δέκα εις την μείον τέσσερα (). The model is trained through the following process. For more efficient training, the total input consists of sixty-four (64) records from the log data processed in parallel by the model, that is, the batch size is set to sixty-four (64). It is desirable that the output of the model is essentially a reconstruction of the input. The mean squared error between the input and the output is calculated and, through the Adam optimizer, the model is trained with the criterion of minimizing the mean squared error, which will be referred to hereafter as the reconstruction error. The process is repeated for the entire training data set fifty (50) times, i.e. the number of epochs is set to fifty (50). The learning rate is defined as ten minus four ().

Μετά τη διαδικασία εκπαίδευσης το μοντέλο θα επιτυγχάνει μία καλή ανακατασκευή των δεδομένων που δίνονται ως είσοδος, εφόσον αυτά είναι παρόμοιας μορφής με τα δεδομένα εκπαίδευσης. Ωστόσο, λόγω της μείωσης των διαστάσεων της πληροφορίας όπως αυτή ρέει στο εσωτερικό του μοντέλου, καθώς η διάσταση των δεδομένων εισόδου αναμένεται να είναι κατά πολύ μεγαλύτερη από τον αριθμό των νευρώνων του κάθε επιπέδου, η ανακατασκευή δε θα είναι τέλεια, αλλά θα εμπεριέχει ένα χαμηλό σφάλμα. Συνεπώς, η παραπάνω εκπαίδευση του μοντέλου θα οδηγήσει στην μοντελοποίηση των πιο σημαντικών και αντιπροσωπευτικών χαρακτηριστικών των δεδομένων εισόδου, και όχι στην τέλεια αντιγραφή της εισόδου. Ως απόρροια όλων των παραπάνω, εάν στο μοντέλο δοθούν ως είσοδος δεδομένα που δεν έχουν τα ίδια χαρακτηριστικά με τα δεδομένα εκπαίδευσης, δηλαδή ανώμαλα δεδομένα, το μοντέλο δε θα επιτυγχάνει την καλή ανακατασκευή των δεδομένων και θα προκύπτει μεγάλο σφάλμα ανακατασκευής. Η εφεύρεση εκμεταλλεύεται το χαρακτηριστικό αυτό της αδυναμίας του μοντέλου να ανακατασκευάσει τα ανώμαλα δεδομένα. Στην ουσία, το νευρωνικό δίκτυο που εκπαιδεύεται αποτελεί μια αρχιτεκτονική αυτοκωδικοποιητή (autoencoder). After the training process the model will achieve a good reconstruction of the data given as input, as long as it is of a similar format to the training data. However, due to the reduction of the dimensions of the information as it flows inside the model, as the dimension of the input data is expected to be much larger than the number of neurons of each layer, the reconstruction will not be perfect, but will contain a low error. Therefore, the above model training will result in modeling the most important and representative features of the input data, rather than perfectly replicating the input. As a consequence of all the above, if the model is given as input data that do not have the same characteristics as the training data, i.e. abnormal data, the model will not achieve a good reconstruction of the data and a large reconstruction error will result. The invention exploits this feature of the model's inability to reconstruct abnormal data. In essence, the neural network being trained is an autoencoder architecture.

Για την ανίχνευση των ανώμαλων δεδομένων, θα υπολογιστεί ένα κατώφλι για το μέγεθος του σφάλματος ανακατασκευής, το οποίο αν ξεπεραστεί θα κατηγοριοποιούνται τα δεδομένα που το παρήγαγαν, δηλαδή οι εγγραφές, ως ανώμαλα. Με τη χρήση του σετ επικύρωσης, μετά το πέρας της κάθε εποχής, όπως αυτές ορίστηκαν παραπάνω, θα υπολογίζεται το μέσο σφάλμα ανακατασκευής για ολόκληρο το σύνολο του σετ επικύρωσης. Σύμφωνα με αυτό το σφάλμα, επιλέγεται το μοντέλο που προκύπτει από την εποχή όπου παρουσιάζεται η χαμηλότερη τιμή του, και ορίζεται το κατώφλι για την ανίχνευση ανωμαλιών σε νέα δεδομένα ως η τιμή που είναι κατά πέντε τοις εκατό (5%) μεγαλύτερη από το μέσο σφάλμα ανακατασκευής που παρουσίασαν τα δεδομένα επικύρωσης στη συγκεκριμένη εποχή. Ο λόγος που χρησιμοποιούνται τα δεδομένα επικύρωσης για την επιλογή της βέλτιστης εποχής και τη ρύθμιση της τιμής του κατωφλιού, και όχι τα δεδομένα εκπαίδευσης, είναι για να αποφευχθεί το φαινόμενο της υπερμοντελλοποίησης (overfitting), καθώς το σφάλμα ανακατασκευής θα είναι μικρότερο για τα δεδομένα εκπαίδευσης, ενώ το σφάλμα ανακατασκευής για τα δεδομένα επικύρωσης, με τα οποία δεν έχει εκπαιδευτεί το μοντέλο, θα δώσει μία περισσότερο αντιπροσωπευτική τιμή για το σφάλμα που θα εμφανίζουν τα δεδομένα καταγραφών κατά την πρακτική εφαρμογή της μεθόδου. To detect anomalous data, a threshold will be calculated for the size of the reconstruction error, which if exceeded will categorize the data that produced it, i.e. the records, as anomalous. Using the validation set, after each epoch as defined above, the mean reconstruction error for the entire validation set will be calculated. According to this error, the model resulting from the epoch showing its lowest value is selected, and the threshold for detecting anomalies in new data is set as the value that is five percent (5%) greater than the average error reconstruction presented by the validation data at that particular epoch. The reason for using the validation data to select the optimal epoch and set the threshold value, rather than the training data, is to avoid overfitting, as the reconstruction error will be smaller for the training data , while the reconstruction error for the validation data, with which the model has not been trained, will give a more representative value for the error that the log data will show during the practical application of the method.

Για τα νέα δεδομένα που παράγει το εκάστοτε λογισμικό, που αποτελούν επίσης εγγραφές καταγραφής συστήματος, κατά την εφαρμογή της μεθόδου και σε πραγματικό χρόνο, εφαρμόζεται η ίδια προεπεξεργασία που πραγματοποιήθηκε στα δεδομένα εκπαίδευσης και επικύρωσης, με τη διαφορά ότι εάν μια εγγραφή ξεπερνάει σε αριθμό στηλών το μέγιστο που είχε παρατηρηθεί στα αρχικά δεδομένα, τότε αυτή κατηγοριοποιείται κατευθείαν ως ανώμαλη, καθώς αυτό αποτελεί ασυνήθιστο φαινόμενο και χρήζει αντιμετώπισης (σημειώνεται ότι το σύνολο δεδομένων εκπαίδευσης, όντας αρκούντως μεγάλο, θα περιλαμβάνει όλες τις συνηθισμένες εγγραφές ομαλής λειτουργίας). Στη συνέχεια, υπολογίζεται το σφάλμα ανακατασκευής για την είσοδο, και ελέγχεται αν υπάρχει υπέρβαση του κατωφλιού ανίχνευσης ανωμαλιών, καθώς και κατά πόσο υπερβαίνει το σφάλμα το κατώφλι. Αν το τελευταίο είναι αληθές, και το σφάλμα είναι πέντε τις εκατό (5%) μεγαλύτερο από το κατώφλι , τότε η μέθοδος εμφανίζει μια ειδοποίηση στον χρήστη του συστήματος (με την αποστολή ηλεκτρονικού μηνύματοςκαι αν το σφάλμα είναι αρκούντος μεγάλο, συγκεκριμένα πάνω από δέκα τις εκατό (10%) πραγματοποιεί μια αυτόματη ενέργεια, συγκεκριμένα η επανεκκίνηση του προγράμματος που παράγει τα αρχεία καταγραφής (για παράδειγμα, την επανεκκίνηση του προγράμματος εξυπηρετητή ηλεκτρονικής αλληλογραφίας σε περίπτωση που το τελευταίο αποτελεί το στοιχείο του οποίου τις καταγραφές συστήματος είναι που εξετάζει η μέθοδος). The new data generated by the respective software, which are also syslog records, during the application of the method and in real time, the same pre-processing is applied to the training and validation data, with the difference that if a record exceeds in number of columns the maximum observed in the original data, then it is directly categorized as anomalous, as this is an unusual phenomenon and needs to be addressed (note that the training data set, being large enough, will include all normal normal records). The reconstruction error for the input is then calculated, and it is checked whether the anomaly detection threshold is exceeded, and whether the error exceeds the threshold. If the latter is true, and the error is five percent (5%) greater than the threshold, then the method displays a notification to the system user (by sending an email message, and if the error is large enough, specifically more than ten one hundred (10%) perform an automatic action, namely restarting the program that produces the log files (for example, restarting the e-mail server program in case the latter is the component whose system logs are examined by the method) .

Η αρχιτεκτονική του νευρωνικού δικτύου φαίνεται στο σχήμα 1. Η είσοδος στο νευρωνικό δίκτυο αναφέρεται ως επίπεδο εισόδου. Η είσοδος εισέρχεται στο πρώτο πλήρως συνδεδεμένο επίπεδο, και στη συνέχεια η ροή της πληροφορίας συνεχίζει στα επόμενα πλήρως συνδεδεμένα επίπεδα. Στο σχήμα, τα πλήρως συνδεδεμένα επίπεδα αναγράφονται ως ΠΣΕ1 , ΠΣΕ2, ΠΣΕ3, και ΠΣΕ4. Για κάθε πλήρως συνδεδεμένο επίπεδο, αναγράφεται ο αριθμός των νευρώνων του στο εσωτερικό του. Το τέταρτο και τελευταίο πλήρως συνδεδεμένο επίπεδο (ΠΣΕ4) επαναφέρει την πληροφορία στις αρχικές διαστάσεις που είχε στο επίπεδο εισόδου. Τα βήματα της συνολικής μεθόδου, όπως αυτή περιγράφηκε παραπάνω, φαίνονται στο σχήμα 2. The architecture of the neural network is shown in Figure 1. The input to the neural network is referred to as the input layer. The input enters the first fully connected layer, and then the flow of information continues to the next fully connected layers. In the figure, the fully connected layers are labeled as PSE1, PSE2, PSE3, and PSE4. For each fully connected layer, the number of neurons inside it is listed. The fourth and last fully connected layer (PSE4) restores the information to the original dimensions it had in the input layer. The steps of the overall method, as described above, are shown in Figure 2.

Claims

1. A method for detecting and dealing with anomalies in log files related to a server program, comprising the following phases:

1. Collection of existing system log data for a specific program, which is critical to the operation of the server.

2. Preprocessing of the data.

3. Split the data into training data and validation data.

4. Deep machine learning neural network training.

5. Definition of anomaly detection threshold.

6. Detect anomalies in new real-time system log data and take action.

characterized by the fact that on a server (sever), which functions as an infrastructure for hosting web pages and providing internet services, simultaneously with the operation of the server, a deep machine learning neural network, hereinafter referred to as a model, is trained, which model becomes capable to detect anomalies in new system records concerning the critical program by calculating a reconstruction error and checking whether this error exceeds a certain threshold.

2. Method according to claim 1, characterized by the fact that the collection of the data concerns the system log data found in specific paths (paths) on the server, given by the user of the method, while the method can concern more than a system log file. In this case, the method is applied directly to each individual system log file, which file relates to a separate server component.

3. Method according to claim 1, characterized in that the pre-processing of the data includes the following phases:

1. Convert the log data into a two-dimensional array, where each row of the array contains an entire record, separating the words of the log where the blank character occurs. The number of lines will be equal to the number of records in the file, and the number of columns will be equal to the number of total words of the longest record appearing in the file. For records with fewer than the maximum number of items, zero (0) is filled in the remaining cells. Blank characters between closed parentheses or square brackets are not taken into account when separating columns, since the parts of the records between parentheses or square brackets are a single piece of information.

2. If a maximum date is set by the user, records after this are ignored. If not defined by the user, as the cells containing the date of the recording, those of the first column are considered. In any case, after this phase, the column relating to the date of the records is deleted from the table, as it is not useful information for the detection of anomalies.

3. Detect for each column if it contains only numeric values. If this is true, the column values are normalized to the range [0, 1] based on the column's maximum and minimum values. If the column also contains alphabetic characters or symbols, then the data of the entire column is considered as categorical data, and the one-hot encoding method is applied.

4. Randomize the data, using a pseudorandom number generator (the algorithm of the generator is meaningless), into training data and validation data. The training data is eighty percent (80%) of the data set and the validation data is the remaining twenty hundred (20%) of the data set.

4. Method according to claim 1, characterized by the fact that the deep neural network consists of four (4) fully connected layers. The first level consists of fourteen (14) neurons and as an activation function uses the Rectified Linear Unit (ReLU) function, which will be referred to as ReLU. The second layer consists of seven (7) neurons and uses the ReLU activation function. The third layer consists of fourteen (14) neurons and as an activation function uses the ReLU activation function. The fourth layer consists of as many neurons as the dimensions of the input, i.e. as many columns of the training set, and does not use any activation function.

5. Method according to claim 1, characterized by the fact that, during the training of the model, the hyperparameters are defined as: sixty-four (64) for the batch size, fifty (50) for the number of epochs, ten minus four () for the learning rate

6. Method according to claim 1, characterized in that in the training of the neural network the mean squared error (mean squared error) between the input and the output for the set of records of each data set is minimized, using the Adam optimizer ) to minimize the mean squared error, hereafter referred to as the reconstruction error.

7. Method according to claim 1 or 6, characterized by the fact that the training of the neural network leads to the realization of a reconstruction of the input by learning the most important and representative features of the input by gradually reducing the dimensions of the information inside the neural network . The neural network will not achieve a perfect reconstruction of the input, but this reconstruction will contain a small error for the training data.

Method according to claim 1 or 7, characterized in that the reconstruction error will be larger than normal for input data which is of a different distribution from the training data, i.e. the reconstruction error will be larger for abnormal data.

Method according to claim 1, characterized in that the validation data is given as input to the neural network and the average reconstruction error is calculated for the entire set of validation data for each training epoch, and the model resulting from the epoch where the lower value of the reconstruction error for the validation data.

10. A method according to claim 1 or 9, characterized in that the mean reconstruction error for the validation data, for the selected epoch, is used to set a threshold value defining the categorization of a value as normal or abnormal and obtaining or not of a real-time action. Specifically, for a new system recording that is given as input to the neural network and its reconstruction error is calculated, i.e. the mean squared error between the input and the output of the neural network. For this error, values exceeding it by five percent (5%) are categorized as problematic and the system displays a warning by sending an email message to the administrator, while values exceeding ten percent (10%) the threshold value are categorized as critical exceedances, and the system displays a critical warning, sending a message similar to above, and automatically restarts the program that generated the junk logs. Otherwise, the quoted values take no action.

are defined as normal and not