CN115860008B

CN115860008B - Data processing method, electronic equipment and medium for determining abnormal log information

Info

Publication number: CN115860008B
Application number: CN202310160654.XA
Authority: CN
Inventors: 李峰; 孙晓鹏; 时伟强; 夏国栋; 宋衍龙
Original assignee: Shandong Yuntian Safety Technology Co ltd
Current assignee: Shandong Yuntian Safety Technology Co ltd
Priority date: 2023-02-24
Filing date: 2023-02-24
Publication date: 2023-05-12
Anticipated expiration: 2043-02-24
Also published as: CN115860008A

Abstract

The present disclosure relates to the field of data processing, and in particular, to a data processing method, an electronic device, and a medium for determining abnormal log information, where the method includes: acquiring an original log information set A to be processed corresponding to a target network; traversing A, deleting a fixed language segment corresponding to TAGi in ATXTi if TAGi is a target log type identifier to obtain a target log information set B to be processed; performing word segmentation processing on Bi to obtain a word segmentation list set C; performing hot single coding treatment on the C to obtain a first coding list set D; vectorizing the D to obtain a first word vector list set E; vector conversion processing is carried out on the E to obtain log semantic vectors F1, F2, …, fi, … and Fn corresponding to each piece of original log information to be processed; from F1, F2, …, fi, …, fn, the anomaly log information set Y is determined from a. The method and the device can improve the recognition efficiency of the abnormal log information.

Description

Data processing method, electronic equipment and medium for determining abnormal log information

Technical Field

The present disclosure relates to the field of data processing, and in particular, to a data processing method, an electronic device, and a medium for determining abnormal log information.

Background

Within a network system, a large amount of log information is generated every day. And the abnormal location can be carried out through the determined abnormal log information, so that the potential safety hazard of the network is solved. However, in the existing abnormal log determining method, a large number of regular expressions are used to detect each log information in sequence, so as to determine the abnormal log. But this approach requires matching with all regular expressions every time log information is available, which is inefficient.

Disclosure of Invention

In view of the foregoing, the present application provides a data processing method, an electronic device, and a medium for determining abnormal log information, which at least partially solve the problems existing in the prior art.

In one aspect of the present application, there is provided a data processing method for exception log information determination, including:

s100, acquiring An original log information set A= (A1, A2, …, ai, …, an) to be processed corresponding to a target network, wherein ai= (TAGI, ATXTi); wherein Ai is the ith original log information to be processed generated in time sequence in the target network, TAGI is a log type identifier corresponding to Ai, and ATXTi is the original log content information corresponding to Ai;

s200, traversing A, and deleting a fixed speech segment corresponding to TAGI in ATXTi if TAGI is a target log type identifier to obtain a target log information set B= (B1, B2, …, bi, …, bn) to be processed, wherein Bi= (TAGI, BTXTi); wherein Bi is target log information to be processed corresponding to Ai, and BTXTi is target log content information corresponding to Bi; the target log type identifier is a log type identifier containing a fixed language segment in the corresponding original log content information;

s300, performing word segmentation processing on Bi to obtain word segmentation list sets C= (C1, C2, …, ci, …, cn), ci= (TAGI, C) _i ¹ ，C _i ² ，…，C _i ^j ，…，C _i ^g（i） ) J=1, 2, …, g (i); wherein Ci is a word segmentation list corresponding to Bi, and C _i ^j For the j-th word in BTXTi, g (i) is the number of words in BTXTi;

s400, performing heat single coding processing on the C to obtain a first coding list set D= (D1, D2, …, di, …, dn), di= (D) _i ¹ ，D _i ² ，…，D _i ^p ，…，D _i ^g（i），D _i ^g（i）+1 ) P=1, 2, …, g (i) +1; wherein Di is a first code list corresponding to Ci, D _i ¹ D for the first code obtained after the heat-unique coding treatment of TAGI _i ^p To C _i ^p-1 Performing a hot-independent coding treatment to obtain a first code; each first code is composed ofA one-dimensional vector composed of s characteristic elements; s is the number of words in a dictionary formed by using the history log information set;

s500, vectorizing the D to obtain a first word vector list set E= (E1, E2, …, ei, …, en), ei= (E) _i ¹ ，E _i ² ，…，E _i ^p ，…，E _i ^g（i），E _i ^g（i）+1 ) The method comprises the steps of carrying out a first treatment on the surface of the Wherein Ei is a first word vector list corresponding to Di, E _i ^p For D _i ^p Corresponding first word vector, E _i ^p Containing contextual semantic information, E _i ^p =D _i ^p X VecT; vecT is a target word vector matrix obtained by processing the history log information set according to a preset word2vec model; the size of VecT is s.h, s is the number of rows of VecT, h is the number of columns of VecT, and h < s;

s600, carrying out vector conversion processing on the E to obtain log semantic vectors F1, F2, …, fi, … and Fn corresponding to each piece of original log information to be processed; fi is a log semantic vector corresponding to Ai obtained by performing vector conversion processing on Ei;

s700, determining an anomaly log information set y= (Y1, Y2, …, yq, …, ym), q=1, 2, …, m, from a according to F1, F2, …, fi, …, fn; wherein Yq is the q-th abnormal log information determined from the a, and m is the number of abnormal log information determined from the a; m is less than n.

In another aspect of the present application, there is provided an electronic device comprising a processor and a memory;

the processor is configured to perform the steps of any of the methods described above by invoking a program or instruction stored in the memory.

In another aspect of the present application, there is provided a non-transitory computer readable storage medium storing a program or instructions that cause a computer to perform the steps of any of the methods described above.

The beneficial effects of the application are as follows.

According to the data processing method, after the original log information set A to be processed is obtained, as the original log content corresponding to the partial log type identifier is provided with the fixed language segments, the fixed language segments are easily identified as important semantic information due to the fact that the occurrence frequency of the fixed language segments is high when semantic extraction is carried out, but the importance degree of the fixed language segments is low in practice. Therefore, in the application, the fixed speech segments in the original content log information corresponding to the target log type identifier are deleted, so that the denoising effect is achieved.

Meanwhile, the target log information to be processed in the target log information set B is arranged according to the generation sequence thereof, so that the content in B can be regarded as a structure similar to an article (i.e. having a sequence rather than disorder). As in the case of the first word vector, after being converted into the first word vector using the target word vector matrix, each of the first word vectors includes context semantic information therein, so that the generation order of the original log information can be considered when determining the abnormal log information from the first word vector and the log semantic vector generated subsequently. Therefore, the abnormal log information which can only be determined by the joint analysis of a plurality of logs can be effectively identified.

Further, in the present application, the abnormal log information set Y may be quickly determined from a by the processing from step S200 to step S500 and by means of clustering or semantic recognition. Compared with the method for identifying by using regular expressions, the method has the advantages that the processing capacity is greatly reduced, and the method can be used for identifying by combining with context semantic information, so that the identification accuracy is higher.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of a data processing method for determining exception log information provided in the present application.

Detailed Description

Embodiments of the present application are described in detail below with reference to the accompanying drawings.

It should be noted that, without conflict, the following embodiments and features in the embodiments may be combined with each other; and, based on the embodiments in this disclosure, all other embodiments that may be made by one of ordinary skill in the art without inventive effort are within the scope of the present disclosure.

It is noted that various aspects of the embodiments are described below within the scope of the following claims. It should be apparent that the aspects described herein may be embodied in a wide variety of forms and that any specific structure and/or function described herein is merely illustrative. Based on the present disclosure, one skilled in the art will appreciate that one aspect described herein may be implemented independently of any other aspect, and that two or more of these aspects may be combined in various ways. For example, an apparatus may be implemented and/or a method practiced using any number of the aspects set forth herein. In addition, such apparatus may be implemented and/or such methods practiced using other structure and/or functionality in addition to one or more of the aspects set forth herein.

Referring to fig. 1, in an exemplary embodiment of the present application, a data processing method for determining exception log information is provided, including the steps of:

s100, acquiring An original log information set A= (A1, A2, …, ai, …, an) to be processed corresponding to a target network, wherein ai= (TAGI, ATXTi); wherein Ai is the ith original log information to be processed generated in time sequence in the target network, TAGI is the log type identifier corresponding to Ai, and ATXTi is the original log content information corresponding to Ai. Specifically, the original log information to be processed may be linux system log information, such as "localhost sshd [1630]: accepted password for root from 192.168.0.104 port 4229 ssh2", wherein the log type identifier may be" localhost sshd [1630] "or" [1630] ", the original log content information is" Acceptedpassword for root from 192.168.0.104 port 4229 ssh2", and" Accepted password for root from "is" [1630] "corresponding fixed-line paragraph.

S200, traversing A, and deleting a fixed speech segment corresponding to TAGI in ATXTi if TAGI is a target log type identifier to obtain a target log information set B= (B1, B2, …, bi, …, bn) to be processed, wherein Bi= (TAGI, BTXTi); wherein Bi is target log information to be processed corresponding to Ai, and BTXTi is target log content information corresponding to Bi; the target log type identifier is a log type identifier containing a fixed language segment in the corresponding original log content information. The target log type identifier can be determined through a preset identifier list, and all log type identifiers with corresponding fixed language segments are stored in the identifier list. For example, it may be identified whether TAGi belongs to the list of identifications, and if so, it is determined to be the target log type identification. Meanwhile, the fixed speech segments corresponding to each log type identifier can be recorded in the identifier list, so that the fixed speech segments in ATXTi can be deleted conveniently. It will be appreciated that if TAGI is not the target log type identification, ATXTi is directly determined to be BTXTi.

S300, performing word segmentation processing on Bi to obtain word segmentation list sets C= (C1, C2, …, ci, …, cn), ci= (TAGI, C) _i ¹ ，C _i ² ，…，C _i ^j ，…，C _i ^g（i） ) J=1, 2, …, g (i); wherein Ci is a word segmentation list corresponding to Bi, and C _i ^j For the j-th word in BTXTi, g (i) is the number of words in BTXTi. Specifically, each log type identifier is used as a separate word, and the target log content information can be determined by using a preset dictionary. The preset dictionary may be constructed by a set of history log information, in which a plurality of history log information is stored, which may be understood as all or at least part of the log information generated before the current time.

S400, performing heat single coding processing on the C to obtain a first coding list set D= (D1, D2, …, di, …, dn), di= (D) _i ¹ ，D _i ² ，…，D _i ^p ，…，D _i ^g（i），D _i ^g（i）+1 ) P=1, 2, …, g (i) +1; wherein Di is a first code list corresponding to Ci, D _i ¹ D for the first code obtained after the heat-unique coding treatment of TAGI _i ^p To C _i ^p-1 Performing a hot-independent coding treatment to obtain a first code; each first code is a one-dimensional vector consisting of s characteristic elements; s is the number of words in the dictionary formed using the history log information set.

S500, vectorizing the D to obtain a first word vector list set E= (E1, E2, …, ei, …, en), ei= (E) _i ¹ ，E _i ² ，…，E _i ^p ，…，E _i ^g（i），E _i ^g（i）+1 ) The method comprises the steps of carrying out a first treatment on the surface of the Wherein Ei is a first word vector list corresponding to Di, E _i ^p For D _i ^p Corresponding first word vector, E _i ^p Containing contextual semantic information, E _i ^p =D _i ^p X VecT; vecT is a target word vector matrix obtained by processing the history log information set according to a preset word2vec model; the size of VecT is s.h, s is the number of rows of VecT, h is the number of columns of VecT, and h < s. h may be 300.

Since the number of feature elements in the first code obtained after the hot encoding processing is extremely large, generally tens of thousands or more, and no context semantic information is provided, if the first code is directly used for semantic recognition or anomaly recognition, the processing amount is extremely large and the accuracy is low, so in the embodiment, the target word vector matrix is used for vectorizing each first code, and the first word vector is converted to reduce the number of feature elements, and each first word vector is provided with the context semantic information.

Specifically, when training is performed using the history log information set, it is necessary to arrange the history log information in the history log information set in order of from front to back in terms of generation time, and cut in units of "day" or "week" to form a plurality of subsets. Each subset may be considered as one "paragraph" in an "article" and each history log information within the same subset may be considered as several "sentences" in "paragraphs". In this way, the target word vector matrix obtained after the word2vec model is used for processing the history log information set can learn the correlation between log generation sequences, so that the first word vector obtained through the conversion of the target word vector matrix has context semantic information.

S600, carrying out vector conversion processing on the E to obtain log semantic vectors F1, F2, …, fi, … and Fn corresponding to each piece of original log information to be processed; fi is a log semantic vector corresponding to Ai obtained by performing vector conversion processing on Ei.

Specifically, the step S600 includes:

s610, obtaining vector weights corresponding to each first word vector to obtain a vector weight list set Q= (Q1, Q2, …, qi, …, qn), qi= (Q) _i ¹ ，Q _i ² ，…，Q _i ^p ，…，Q _i ^g（i），Q _i ^g（i）+1 ) The method comprises the steps of carrying out a first treatment on the surface of the Wherein Qi is a vector weight list corresponding to Ei, Q _i ^p For E _i ^p Corresponding vector weights, Q _i ^p Is according to E _i ^p The word frequency of the corresponding word in the history log information set is obtained, and E _i ^p Word frequency and Q of corresponding word segmentation in history log information set _i ^p Inversely proportional to the size of (a);

s620, obtain fi= (Σ) _p=1 ^g(i)+1 Q _i ^p * E _i ^p ) /(g (i) +1) to give F1, F2, …, fi, …, fn.

According to research, the higher the word frequency of the word is in the system log information, the lower the importance degree is, so that when all first word vectors corresponding to each target log information to be processed are converted into the corresponding log semantic information, the vector weight corresponding to the first word vector with high word frequency is reduced, so that the finally obtained log statement information has more effective information, and the accuracy of determining abnormal log information is improved.

Specifically, the abnormal log information set Y is determined from a by F1, F2, …, fi, … and Fn, and may be determined by performing semantic recognition on F1, F2, …, fi, … and Fn or performing clustering processing on F1, F2, …, fi, … and Fn.

According to the data processing method provided by the embodiment, after the original log information set A to be processed is obtained, because the original log content corresponding to the partial log type identifier is provided with the fixed language segments, the fixed language segments are easily identified as important semantic information due to the high occurrence frequency when semantic extraction is carried out, but the importance degree of the fixed language segments is low in practice. Therefore, in this embodiment, the fixed speech segments in the original content log information corresponding to the target log type identifier are deleted, so as to achieve the denoising effect.

Meanwhile, in this embodiment, the target log information to be processed in the target log information set B is arranged according to the generation order thereof, so that the content in B can be regarded as a structure similar to an "article" (i.e. having a sequential order, not a disorder). As in the case of the first word vector, after being converted into the first word vector using the target word vector matrix, each of the first word vectors includes context semantic information therein, so that the generation order of the original log information can be considered when determining the abnormal log information from the first word vector and the log semantic vector generated subsequently. Therefore, the abnormal log information which can only be determined by the joint analysis of a plurality of logs can be effectively identified.

Further, in this embodiment, through the processing of step S200-step S600, a log semantic vector with context semantic information can be obtained, and by means of clustering or semantic recognition, an abnormal log information set Y can be quickly determined from a. Compared with the method for identifying by using regular expressions, the method has the advantages that the processing capacity is greatly reduced, and the method can be used for identifying by combining with context semantic information, so that the identification accuracy is higher.

In an exemplary embodiment of the present application, the step S700 includes:

s710, carrying out random initialization to obtain a first unit vector W1 and a second unit vector W2; w1 and W2 are each one-dimensional vectors composed of h feature elements.

S720, a first parameter t=0 is acquired.

S730, if t=h, then t=1 is obtained, otherwise, t=t+1 is obtained.

S740, according to Ft, obtain dist1=sqrt (Σ) _k=1 ^h (X1 k-X2 k)), and dist2=sqrt (Σ) _k=1 ^h (X1 k-X3 k)); wherein, X1k is the kth characteristic element in Ft, X2k is the kth characteristic element in W1, and X3k is the kth characteristic element in W2.

S750, if Dist1 < Dist2, obtaining w1=w1+β (Ft-W1); otherwise, obtain w2=w2+β (Ft-W2); where β is the learning rate, and the magnitude of β changes as the number of executions of step S750 increases. Specifically, β may be determined by the mexico cap function or top hat function.

S760, clustering F1, F2, …, fi, … and Fn according to W1 and W2 to obtain a first clustering result and a second clustering result. The first clustering result is a clustering result obtained by taking W1 as a clustering core, and the second clustering result is a clustering result obtained by taking W2 as a clustering core. Each clustering result comprises at least one log semantic vector.

S770, determining whether the clustering termination condition is met currently according to the first clustering result and the second clustering result, if so, entering step S780, otherwise, entering step S730; the clustering termination condition includes: the number of the semantic vectors in the first clustering result and the number of the semantic vectors in the second clustering result are smaller than 0.0001 or larger than 10000.

S780, determining that the number of the log semantic vectors in the first clustering result and the second clustering result is smaller as a target clustering result.

S790, determining the original log information to be processed corresponding to each log semantic vector in the target clustering result as abnormal log information, so as to obtain Y.

In this embodiment, by circularly executing steps S730 to S760, the first unit vector W1 and the second unit vector W2 are continuously optimized, so that the first unit vector W1 and the second unit vector W2 can gradually increase in distance and lean against the clustering core. In the optimization process, every time t is increased to h, the log semantic vector can be regarded as one round of optimization, and in each round of optimization, the log semantic vector is sequentially input according to the corresponding generation time of the original log to be processed, so that context related information can be learned in the optimization process, and the accuracy of subsequent clustering is further improved. Specifically, optimization may also be performed by using the SOM model to perform optimization and clustering of W1 and W2. In the conventional clustering process, only the first element vector W1 and the second element vector W2 are required to reach the convergence condition (e.g., the changes of W1 and W2 tend to be stable), and the clustering can be ended. However, it was found in the study that if the conventional convergence condition is used as the cluster termination condition, the problem of excessive information of the determined abnormal log (i.e., low accuracy) may be caused. Therefore, in the embodiment, the ratio of the number of the log semantic vectors in the two clustering results is used as the termination condition, so that the ratio of the normal log information to the abnormal log information is converged when the clustering results are identified manually, and the accuracy of the abnormal log information can be improved. Experiments prove that the accuracy of the clustering result obtained by using the clustering termination condition provided by the embodiment is obviously higher than that of the clustering termination condition using the conventional convergence condition.

In one exemplary embodiment of the present application, the cluster termination condition includes:

w1 and W2 meet preset convergence conditions; and the number of the semantic vectors in the first clustering result and the number of the semantic vectors in the second clustering result are smaller than 0.0001 or larger than 10000.

In this embodiment, the clustering termination condition is set to two conditions that need to be satisfied at the same time, that is, only after W1 and W2 satisfy a preset convergence condition (e.g., the change of W1 and W2 tends to be stable), the number of the log semantic vectors in the first clustering result and the number of the log semantic vectors in the second clustering result are determined to be less than 0.0001 or greater than 10000. In this way, the accuracy of clustering can be further improved.

In an exemplary embodiment of the present application, the size of β decreases as the number of executions of step S750 increases.

And the cluster termination condition includes:

beta is smaller than a preset learning rate threshold; and the number of the semantic vectors in the first clustering result and the number of the semantic vectors in the second clustering result are smaller than 0.0001 or larger than 10000.

In the present embodiment, β decreases with an increase in the number of learning times and gradually tends to 0, and when the learning rate tends to 0, it can also be shown that W1 and W2 tend to be stable. At this time, the number of the semantic vectors in the first clustering result and the number of the semantic vectors in the second clustering result are judged to be smaller than 0.0001 or larger than 10000. In this way, the accuracy of clustering can be further improved.

Furthermore, although the steps of the methods in the present disclosure are depicted in a particular order in the drawings, this does not require or imply that the steps must be performed in that particular order or that all illustrated steps be performed in order to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform, etc.

From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or may be implemented in software in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.) or on a network, including several instructions to cause a computing device (may be a personal computer, a server, a mobile terminal, or a network device, etc.) to perform the method according to the embodiments of the present disclosure.

In an exemplary embodiment of the present disclosure, an electronic device capable of implementing the above method is also provided.

Those skilled in the art will appreciate that the various aspects of the present application may be implemented as a system, method, or program product. Accordingly, aspects of the present application may be embodied in the following forms, namely: an entirely hardware embodiment, an entirely software embodiment (including firmware, micro-code, etc.) or an embodiment combining hardware and software aspects may be referred to herein as a "circuit," module "or" system.

An electronic device according to this embodiment of the present application. The electronic device is only one example and should not impose any limitation on the functionality and scope of use of the embodiments of the present application.

The electronic device is in the form of a general purpose computing device. Components of an electronic device may include, but are not limited to: the at least one processor, the at least one memory, and a bus connecting the various system components, including the memory and the processor.

Wherein the memory stores program code that is executable by the processor to cause the processor to perform steps according to various exemplary embodiments of the present application described in the above section of the "exemplary method" of the present specification.

The storage may include readable media in the form of volatile storage, such as Random Access Memory (RAM) and/or cache memory, and may further include Read Only Memory (ROM).

The storage may also include a program/utility having a set (at least one) of program modules including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.

The bus may be one or more of several types of bus structures including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor, or a local bus using any of a variety of bus architectures.

The electronic device may also communicate with one or more external devices (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device, and/or with any device (e.g., router, modem, etc.) that enables the electronic device to communicate with one or more other computing devices. Such communication may be through an input/output (I/O) interface. And, the electronic device may also communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet, through a network adapter. The network adapter communicates with other modules of the electronic device via a bus. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with an electronic device, including but not limited to: microcode, device drivers, redundant processors, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.

From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or may be implemented in software in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.) or on a network, including several instructions to cause a computing device (may be a personal computer, a server, a terminal device, or a network device, etc.) to perform the method according to the embodiments of the present disclosure.

In an exemplary embodiment of the present disclosure, a computer-readable storage medium having stored thereon a program product capable of implementing the method described above in the present specification is also provided. In some possible implementations, the various aspects of the present application may also be implemented in the form of a program product comprising program code for causing a terminal device to carry out the steps according to the various exemplary embodiments of the present application as described in the "exemplary methods" section of this specification, when the program product is run on the terminal device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The computer readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).

Furthermore, the above-described figures are only illustrative of the processes involved in the method according to exemplary embodiments of the present application, and are not intended to be limiting. It will be readily appreciated that the processes shown in the above figures do not indicate or limit the temporal order of these processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, for example, among a plurality of modules.

It should be noted that although in the above detailed description several modules or units of a device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit in accordance with embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into a plurality of modules or units to be embodied.

The foregoing is merely specific embodiments of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions easily conceivable by those skilled in the art within the technical scope of the present application should be covered in the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A data processing method for abnormality log information determination, comprising:

s100, acquiring An original log information set a= (A1, A2, …, ai, …, an), ai= (TAGi, atati); ai is the ith original log information to be processed generated in time sequence in the target network, TAGi is the log type identifier corresponding to Ai, and ATXTi is the original log content information corresponding to Ai;

s200, traversing A, and deleting a fixed speech segment corresponding to TAGI in ATXTi if TAGI is a target log type identifier to obtain a target log information set B= (B1, B2, …, bi, …, bn) to be processed, wherein Bi= (TAGI, BTXTi); bi is target log information to be processed corresponding to Ai, and BTXTi is target log content information corresponding to Bi; the target log type identifier is a log type identifier containing a fixed language segment in the corresponding original log content information;

s300, performing word segmentation processing on Bi to obtain word segmentation list sets C= (C1, C2, …, ci, …, cn), ci= (TAGI, C) _i ¹ ，C _i ² ，…，C _i ^j ，…，C _i ^g（i） ) J=1, 2, …, g (i); ci is a word segmentation list corresponding to Bi, C _i ^j For the j-th word in BTXTi, g (i) is the number of words in BTXTi;

s400, performing heat single coding processing on the C to obtain a first coding list set D= (D1, D2, …, di, …, dn), di= (D) _i ¹ ，D _i ² ，…，D _i ^p ，…，D _i ^g（i），D _i ^g（i）+1 ) P=1, 2, …, g (i) +1; di is a first code list corresponding to Ci, D _i ¹ D for the first code obtained after the heat-unique coding treatment of TAGI _i ^p To C _i ^p-1 Performing a hot-independent coding treatment to obtain a first code; each first code is a one-dimensional vector consisting of s characteristic elements; s is the number of words in a dictionary formed by using the history log information set;

s500, vectorizing the D to obtain a first word vector list set E= (E1, E2, …, ei, …, en), ei= (E) _i ¹ ，E _i ² ，…，E _i ^p ，…，E _i ^g（i），E _i ^g（i）+1 ) The method comprises the steps of carrying out a first treatment on the surface of the Ei is a first word vector list corresponding to Di, E _i ^p For D _i ^p Corresponding first word vector, E _i ^p Containing contextual semantic information, E _i ^p =D _i ^p X VecT; vecT is a target word vector matrix obtained by processing the history log information set according to a preset word2vec model; the size of VecT is s×h, s is the number of rows of VecT, h is the number of columns of VecT, and h ≪ s; e is used for determining an abnormal log information set Y from the A;

2. The data processing method according to claim 1, wherein the step S600 includes:

3. The data processing method according to claim 2, wherein the step S700 includes:

s710, carrying out random initialization to obtain a first unit vector W1 and a second unit vector W2; w1 and W2 are one-dimensional vectors composed of h characteristic elements;

s720, acquiring a first parameter t=0;

s730, if t=n, then t=1 is obtained, otherwise, t=t+1 is obtained;

s740, according to Ft, obtain dist1=sqrt (Σ) _k=1 ^h (X1 k-X2 k)), and dist2=sqrt (Σ) _k=1 ^h (X1 k-X3 k)); wherein X1k is the kth characteristic element in Ft, X2k is the kth characteristic element in W1, and X3k is the kth characteristic element in W2;

s750, if Dist1 < Dist2, obtaining w1=w1+β (Ft-W1); otherwise, obtain w2=w2+β (Ft-W2); wherein β is the learning rate, and the magnitude of β changes with the increase of the execution times of step S750;

s760, clustering F1, F2, …, fi, … and Fn according to W1 and W2 to obtain a first clustering result and a second clustering result;

s770, determining whether the clustering termination condition is met currently according to the first clustering result and the second clustering result, if so, entering step S780, otherwise, entering step S730; the clustering termination condition includes: the number of the semantic vectors in the first clustering result and the number of the semantic vectors in the second clustering result are smaller than 0.0001 or larger than 10000;

s780, determining that the number of the log semantic vectors in the first clustering result and the second clustering result is smaller as a target clustering result;

4. A data processing method according to claim 3, wherein the cluster termination condition comprises:

5. A data processing method according to claim 3, wherein the size of β decreases with an increase in the number of executions of step S750.

6. The data processing method according to claim 5, wherein the clustering termination condition includes:

7. The method according to claim 5, wherein the original log information to be processed is linux system log information.

8. An electronic device comprising a processor and a memory;

the processor is adapted to perform the steps of the method according to any of claims 1 to 7 by invoking a program or instruction stored in the memory.

9. A non-transitory computer-readable storage medium storing a program or instructions that cause a computer to perform the steps of the method of any one of claims 1 to 7.