Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.
Before discussing exemplary embodiments in more detail, it should be noted that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although a flowchart may describe the operations (or steps) as a sequential process, many of the operations can be performed in parallel, concurrently or simultaneously. In addition, the order of the operations may be re-arranged. The process may be terminated when its operations are completed, but may have additional steps not included in the figure. The processes may correspond to methods, functions, procedures, subroutines, and the like.
The term "preset feature model" used herein includes preset 26 code block features and file features, where the term "code block feature" describes a feature of a code block where a logging statement is located, for example, a source code line SLOC of the code block, a method name called by the code block, and the like, and the term "file feature" describes a feature of a file where the logging statement is located, for example, a logging statement density in the file, and the like. The term "log level" as used herein is defined uniformly and includes a fatal level, an error level, an alarm level, an information level, a debug level, and a trace level.
Example one
Fig. 1 is a flowchart of a log-level prediction method in a first embodiment of the present invention, where the present embodiment is applicable to a case where log-level prediction is performed on a to-be-inserted log record statement, and the method may be executed by a log-level prediction apparatus, which may be implemented in software and/or hardware, and may be generally integrated in a computer device. As shown in fig. 1, the method of the embodiment of the present invention specifically includes:
and step 110, obtaining a code block into which a log recording statement is to be inserted, wherein the log recording statement is used for forming a recording log after being triggered and executed.
In this embodiment, the log recording statement is used to record information or process some errors encountered during system operation, and a log may be formed after triggering execution. The code blocks to be inserted into the log record statement have 10 types in total, including: CatchBlock, TryBlock, IfBlock, SwitchBlock, ForBlock, WhileBlock, DoBlock, MethodBlock, Synchronized Block, and ClassBlock. Each code block only comprises one log record statement for recording related information, and one log record statement can only access variables inside the code block due to access authority limit in syntax.
And step 120, extracting the characteristics of the code block and the file to which the code block belongs according to a preset characteristic model to obtain the characteristics of the code block and the characteristics of the file.
The feature model comprises preset 26 code block features and file features, wherein the code block features describe features of a code block where a log recording statement is located, and since the code block determines a trigger condition of the log recording statement, the code block needs to be considered when selecting a log level for the log recording statement. The file characteristics describe the characteristics of the file where the log recording statement is located, and the log recording statements in the same file often share the same log recording purpose and can record the same function information.
Optionally, the extracting features of the code block and the file to which the code block belongs to obtain the features of the code block and the file features includes: and inputting the code blocks and the files of the code blocks into a source code analysis tool to obtain the code block characteristics and the file characteristics. The source code analysis tool may be a small tool Javalogextra (JLE) developed with a JaverParser toolkit as a core, and the JLE may be used for analyzing a source code and extracting features.
Optionally, the code block features include: textual content features, including structural features and some others, as well as syntactic features. With respect to the structural features, the structural features of the source code can be utilized to extract context information in consideration of the clear code structure of the source code of the code block. For the text content features, as the source code is also a text, all the text content features in the code block are extracted, including the method name, the variable name, the abnormal type and the like, and the text content features and the extracted structural features can be combined into full-text content. For syntactic features, since developers usually perform processing by setting flags, throwing exceptions again, or returning special values when encountering system runtime errors, in order to capture these contextual factors, it is necessary to extract the key syntactic features from each code block.
Optionally, the structural features at least include: a source code line SLOC of the code block, a number of methods called by the code block and a number of variables declared in the code block; the text content features at least include: the method comprises the following steps of structural characteristics, the name of a method called by a code block, the name of a variable declared in the code block, the type of the code block and the type of a trigger strategy; the syntactic characteristics include at least: whether there is a throw statement and whether there is a return value; the file characteristics at least include: density of logging statements in the file, average length of logging statements in the file, and class name of the file.
And step 130, predicting a log level according to the code block characteristics and the file characteristics, wherein the log level is used for describing the detailed degree of the information recorded in the log.
Optionally, the log level includes: a fatal level, an error level, an alarm level, an information level, a debugging level and a tracking level with sequentially decreasing importance levels; wherein the lower the importance level, the more detailed the detail of the information recorded in the log.
In this embodiment, the level of lethality refers to a very serious error event that may cause the application to abort. The error level refers to an error event that still allows the application to continue running. The warning level refers to a potentially harmful situation. The information level refers to informational messages that highly highlight the progress of the application at the coarse level of granularity. The debug level refers to the fine information event that is most useful to debug an application. Trace level refers to information events that are finer grained than "debug".
Optionally, predicting the log level according to the code block characteristics and the file characteristics may include: setting a standard feature vector for each log level in advance, respectively carrying out similarity calculation on the code block features and the feature vectors corresponding to the file features of the log recording statements to be inserted and the feature vectors of each level, and selecting the log level with the highest similarity as the level of the log recording statements to be inserted.
Or, considering that the code block features and the file features can be divided into three types, namely digital features, boolean features and numeric text features, feature vectors of corresponding standard digital features, standard boolean features and standard numeric text features can be set for each log level in advance, then the euclidean distances between the feature vectors of the numeric features, boolean features and numeric text features of the log record sentences to be inserted and the feature vectors of the standard digital features, standard boolean features and standard numeric text features of each level are calculated respectively, and the euclidean distances corresponding to the three types of features are weighted and calculated, and the log level with the largest weighting value is selected.
Optionally, predicting the log level according to the code block characteristics and the file characteristics includes: and inputting the code block characteristics and the file characteristics into a pre-trained log level prediction model to obtain the log level.
In this embodiment, the log level of the log record statement to be inserted may be determined by calculating similarity of the feature vectors, or by calculating a weighted euclidean distance value of each type of feature vector, or may be determined according to a trained log level prediction model, or may be determined according to other manners, which is not limited in this embodiment.
And step 140, inserting a log record statement into the code block according to the log level.
After determining the log level of the log record statement to be inserted, the log record statement is inserted into the corresponding code block, and the log level thereof is set to the determined log level to perform appropriate information recording.
According to the technical scheme of the embodiment of the invention, a code block of a log recording statement to be inserted is obtained, and the log recording statement is used for forming a recording log after triggering execution; according to a preset feature model, feature extraction is carried out on the code blocks and files to which the code blocks belong to so as to obtain code block features and file features; according to the code block characteristics and the file characteristics, predicting a log level, wherein the log level is used for describing the detailed degree of information recorded in a log; according to the method and the device, the log recording statement is inserted into the code block according to the log level, the problem that in the prior art, the level of the log recording statement is decided only by depending on development experience and field knowledge of developers, so that the development efficiency is low is solved, the log level prediction of the log recording statement to be inserted is realized, the decision time of the log level is reduced, and the development efficiency is improved.
On the basis of the foregoing embodiment, optionally, before obtaining the code block into which the log record statement is to be inserted, the method may further include: searching a log recording statement in a training item and a code block containing the log recording statement; according to a preset feature model, feature extraction is carried out on the code blocks and files to which the code blocks belong, and code block features and file features matched with the preset feature model are obtained; and inputting the extracted code block characteristics and file characteristics into a preset algorithm model for training to obtain a log level prediction model.
The benefits of this arrangement are: the method can provide a mode of directly obtaining the prediction result of the log level through the prediction model, and the prediction model is obtained by training according to the characteristics of the code block and the characteristics of the file, combines the influence factors of the code block and the file to which the code block belongs on the log level, and can ensure that the prediction accuracy of the log level is higher.
Optionally, after searching for the logging statement in the training item and the code block containing the logging statement, the method may further include: acquiring the log level and contributors of log recording statements; and screening the effectiveness of the log recording statement according to the log level of the log recording statement and the contributor.
The benefits of this arrangement are: the reliability of the data is further ensured, the deviation caused by the data quality problem is avoided, and the learned classification rule is ensured to come from high-stability and high-quality effective data.
Optionally, according to the log level of the log recording statement and the contributor, the validity screening is performed on the log recording statement, and the method can be implemented in the following manner: if the contributor of the log recording statement is in a preset contributor list and the log level of the log recording statement is consistent with the log level output by the contributor list, keeping the log recording statement; if the log level of the log recording statement is consistent with the log level output by the log recording statement, the contributor of the log recording statement is not in a preset contributor list, and the number of the files to which all the log recording statements of the contributor belong is greater than or equal to a file number threshold value, the log recording statement is reserved; if the log level of the log recording statement is consistent with the log level output by the log recording statement, the contributor of the log recording statement is not in a preset contributor list, the number of the log recording statements in the file to which the log recording statement belongs is less than or equal to a statement number threshold, and the density of the log recording statements in the file to which the log recording statement belongs is less than or equal to a statement density threshold, the log recording statement is reserved.
Optionally, before inputting the extracted code block features and file features into a preset algorithm model for training to obtain a log-level prediction model, the method may further include: carrying out hump conversion processing, lower case conversion processing, stop word processing, stem extraction and root processing and word frequency-inverse file frequency TF-IDF processing on text content characteristics in the code block characteristics in sequence; and performing dimension reduction processing on the text content features subjected to TF-IDF processing through a text mining classifier to obtain numerical text features.
The benefits of this arrangement are: redundant information in the text content characteristics can be removed, and the text content characteristics are converted into digital representation, so that the problem that the text content characteristics cannot be directly input into a machine learning model is solved.
Optionally, the text content features subjected to TF-IDF processing are subjected to dimensionality reduction processing by a text mining classifier to obtain numerical text features, which can be implemented in the following manner: dividing the text content feature subjected to the TF-IDF processing into a first sample and a second sample by using hierarchical random sampling; respectively learning a first text mining classifier corresponding to the first sample and a second text mining classifier corresponding to the second sample according to a naive Bayes algorithm; assigning a second confidence score matrix to the second sample using the first text-mining classifier, and assigning a first confidence score matrix to the first sample using the second text-mining classifier; the first confidence score matrix and the second confidence score matrix are numerical text features.
The benefits of this arrangement are: the feature generated by preprocessing can be subjected to dimensionality reduction processing to generate a numerical text feature, so that the problem that the effect of the digital feature and the Boolean feature in the model is diluted due to overlarge feature dimensionality generated by the text content feature in preprocessing is solved.
Optionally, before inputting the extracted code block features and file features into a preset algorithm model for training to obtain a log-level prediction model, the method may further include: the algorithm model is constructed in advance using any one of a decision tree algorithm, a support vector machine algorithm, a logistic regression algorithm, and a convolutional neural network algorithm.
The benefits of this arrangement are: the method can more comprehensively utilize the purposes of different algorithms in the aspect of prediction of the log record statement level, thereby achieving the aim of recommending as accurately as possible.
Optionally, searching for the log record statement in the training item and the code block containing the log record statement may be implemented in the following manner: acquiring all source code files in a training project; finding out log recording sentences in each source code file according to the regular matching; and traversing towards the direction of the root node by using syntax tree analysis to find the code block containing the log record statement.
Example two
Fig. 2a is a flowchart of a prediction method at a log level according to a second embodiment of the present invention. Embodiments of the present invention may be combined with various alternatives of the above embodiments. In the embodiment of the present invention, before obtaining the code block into which the log record statement is to be inserted, the method further includes: searching a log recording statement in a training item and a code block containing the log recording statement; according to a preset feature model, feature extraction is carried out on the code blocks and files to which the code blocks belong, and code block features and file features matched with the preset feature model are obtained; and inputting the extracted code block characteristics and file characteristics into a preset algorithm model for training to obtain a log level prediction model.
Step 210, finding a log record statement in the training item and a code block containing the log record statement.
In this embodiment, in order to ensure that the code data of the training project has higher quality, a good decision rule at the level of the log record statement can be learned, and a Java project which is jointly developed by multiple persons, runs for a long time and relates to the top one hundred ranks on the GitHub in a wide field can be selected as the training project.
Optionally, searching for a log record statement in the training item and a code block containing the log record statement includes: acquiring all source code files in a training project; finding out log recording sentences in each source code file according to the regular matching; and traversing towards the direction of the root node by using syntax tree analysis to find the code block containing the log record statement.
The embodiment can traverse the directory of the current training project, find all source code files, extract contributors and code content of each line of source codes, splice the code content into Java source codes, input the Java source codes into the JLE tool, then find a log recording statement in the project through regular matching, and traverse the parent node of the log recording statement upward until it is determined that the node is one of ten code blocks.
Optionally, after searching for the log record statement in the training item and the code block containing the log record statement, the method further includes: acquiring the log level and contributors of log recording statements; and screening the effectiveness of the log recording statement according to the log level of the log recording statement and the contributor.
In order to further ensure the reliability of data and avoid deviation caused by data quality problems, the log recording sentences acquired in the training project need to be screened, the log recording sentences which are possibly changed or have improper levels are filtered, and the learned classification rule is ensured to come from high-stability and high-quality effective data.
Optionally, the screening the validity of the logging statement according to the logging level of the logging statement and the contributor may include: if the contributor of the log recording statement is in a preset contributor list and the log level of the log recording statement is consistent with the log level output by the contributor list, keeping the log recording statement; if the log level of the log recording statement is consistent with the log level output by the log recording statement, the contributor of the log recording statement is not in a preset contributor list, and the number of the files to which all the log recording statements of the contributor belong is greater than or equal to a file number threshold value, the log recording statement is reserved; if the log level of the log recording statement is consistent with the log level output by the log recording statement, the contributor of the log recording statement is not in a preset contributor list, the number of the log recording statements in the file to which the log recording statement belongs is less than or equal to a statement number threshold, and the density of the log recording statements in the file to which the log recording statement belongs is less than or equal to a statement density threshold, the log recording statement is reserved.
And step 220, extracting the characteristics of the code block and the file to which the code block belongs according to a preset characteristic model to obtain the code block characteristics and the file characteristics matched with the preset characteristic model.
In this embodiment, the preset feature model includes the following features to be extracted: a trigger policy type, a type of a code block, an exception type of a code block, a name of a method called by a code block, a name of a caller of a method called by a code block, a number of variables declared in a code block, a name of variables declared in a code block, a type of variables declared in a code block, whether there is an assert statement, whether there is a thread statement, whether there is a JDBC statement, whether there is an other log statement, whether there is a thread statement, whether there is a return value, whether there is a flag statement, the source code line SLOC of a code block, the number of other log statements, the number of methods called by a code block, the number of parameters of methods called by a code block, the density of logging statements in a file, the number of logging statements in a file, the average length of logging statement parameters in a file, the maximum log level, the class name of a file, and the name of a packet of a file.
And step 230, inputting the extracted code block characteristics and file characteristics into a preset algorithm model for training to obtain a log level prediction model.
Optionally, before inputting the extracted code block features and file features into a preset algorithm model for training to obtain a log-level prediction model, the method further includes: carrying out hump conversion processing, lower case conversion processing, stop word processing, stem extraction and root processing and word frequency-inverse file frequency TF-IDF processing on text content characteristics in the code block characteristics in sequence; and performing dimension reduction processing on the text content features subjected to TF-IDF processing through a text mining classifier to obtain numerical text features.
In this embodiment, as shown in fig. 2b, since the text content features cannot be directly learned as input data of the machine learning model, a series of preprocessing is required to remove redundant information and convert the redundant information into a digital representation. However, the feature dimensionality generated in the text content feature preprocessing is too large, and the role of the digital feature and the boolean feature in the model is diluted, so that the feature generated by preprocessing needs to be subjected to dimension reduction processing by using a text miner of a bayesian model to generate a numerical text feature.
In this embodiment, as shown in fig. 2b, the specific steps of preprocessing the extracted text content features are as follows: 1) hump conversion: because all the extracted text content features are identifier names, such as method names, variable names and the like, and the naming specification of the Java language generally uses the hump rule as a default, words spliced together can be conveniently separated by using the characteristics of the hump rule, and the essence of the operation is word segmentation. 2) And (3) conversion of a lower case: since the letters in the Java code are capitalized only to meet the humpback specification and not to have special meaning as specified in the english grammar, the capitalized letters in all the extracted text features are converted into lowercase letters in order to be processed uniformly. 3) Stop words: stop words mainly refer to adverbs, adjectives and some conjunctions thereof, such as "the", "is", and the like, which are often meaningless words for text classification. 4) Stem extraction and root word: in order to unify the information extracted by the text content features, the information needs to be subjected to stem extraction and root processing, and the basic meaning of each named word is reserved. 5) TF-IDF processing: setting weights for text content features completes the conversion from text to numbers.
Optionally, as shown in fig. 2c, performing dimension reduction processing on the text content feature subjected to TF-IDF processing by using a text mining classifier to obtain a numerical text feature, where the dimension reduction processing includes: dividing the text content feature subjected to the TF-IDF processing into a first sample and a second sample by using hierarchical random sampling; respectively learning a first text mining classifier corresponding to the first sample and a second text mining classifier corresponding to the second sample according to a naive Bayes algorithm; assigning a second confidence score matrix to the second sample using the first text-mining classifier, and assigning a first confidence score matrix to the first sample using the second text-mining classifier; the first confidence score matrix and the second confidence score matrix are numerical text features. The confidence score matrix is the probability of belonging to each log record statement level.
Through the above processing, most text content features can be successfully converted into numerical text features which can be used as algorithm input, but text content features with category attributes, such as code block types, need to be numerically processed by a more appropriate method due to discrete data, for example, further processing by using unique hot codes and normalizing numerical text features.
Optionally, before inputting the extracted code block features and file features into a preset algorithm model for training to obtain a log-level prediction model, the method further includes: the algorithm model is constructed in advance using any one of a decision tree algorithm, a support vector machine algorithm, a logistic regression algorithm, and a convolutional neural network algorithm.
The method comprises the steps of calling an interface realized by an algorithm in scimit-leann, inputting the processed code block characteristics and file characteristics, namely numerical text characteristics, numerical characteristics and Boolean characteristics into an algorithm model together for training, learning a mapping rule between the characteristics and a log recording statement level, and generating a log level prediction model after learning the mapping rule.
And 240, acquiring a code block into which a log recording statement is to be inserted, wherein the log recording statement is used for forming a recording log after being triggered and executed.
And step 250, extracting the characteristics of the code block and the file to which the code block belongs according to a preset characteristic model to obtain the characteristics of the code block and the characteristics of the file.
And step 260, predicting a log level according to the code block characteristics and the file characteristics, wherein the log level is used for describing the detailed degree of the information recorded in the log.
And step 270, inserting a log record statement into the code block according to the log level.
According to the technical scheme of the embodiment of the invention, a code block of a log recording statement to be inserted is obtained, and the log recording statement is used for forming a recording log after triggering execution; according to a preset feature model, feature extraction is carried out on the code blocks and files to which the code blocks belong to so as to obtain code block features and file features; according to the code block characteristics and the file characteristics, predicting a log level, wherein the log level is used for describing the detailed degree of information recorded in a log; according to the method and the device, the log recording statement is inserted into the code block according to the log level, the problem that in the prior art, the level of the log recording statement is decided only by depending on development experience and field knowledge of developers, so that the development efficiency is low is solved, the log level prediction of the log recording statement to be inserted is realized, the decision time of the log level is reduced, and the development efficiency is improved.
EXAMPLE III
Fig. 3 is a schematic structural diagram of a prediction apparatus at a log level according to a third embodiment of the present invention. The apparatus may be implemented in software and/or hardware and may generally be integrated in a computer device. As shown in fig. 3, the apparatus includes: a code block acquisition module 310, a feature acquisition module 320, a level determination module 330, and a log statement insertion module 340;
a code block obtaining module 310, configured to obtain a code block into which a log recording statement is to be inserted, where the log recording statement is used to form a recording log after triggering execution;
the feature obtaining module 320 is configured to perform feature extraction on the code block and the file to which the code block belongs according to a preset feature model to obtain a code block feature and a file feature;
a level determining module 330, configured to predict a log level according to the code block characteristics and the file characteristics, where the log level is used to describe a detailed degree of information recorded in the log;
and a log statement insertion module 340, configured to insert a log record statement in the code block according to the log level.
According to the technical scheme of the embodiment of the invention, a code block of a log recording statement to be inserted is obtained, and the log recording statement is used for forming a recording log after triggering execution; according to a preset feature model, feature extraction is carried out on the code blocks and files to which the code blocks belong to so as to obtain code block features and file features; according to the code block characteristics and the file characteristics, predicting a log level, wherein the log level is used for describing the detailed degree of information recorded in a log; according to the method and the device, the log recording statement is inserted into the code block according to the log level, the problem that in the prior art, the level of the log recording statement is decided only by depending on development experience and field knowledge of developers, so that the development efficiency is low is solved, the log level prediction of the log recording statement to be inserted is realized, the decision time of the log level is reduced, and the development efficiency is improved.
On the basis of the above embodiments, the code block features include: text content characteristics and syntactic characteristics; the text content features at least include: the method comprises the following steps of structural characteristics, the name of a method called by a code block, the name of a variable declared in the code block, the type of the code block and the type of a trigger strategy; the structural features include at least: a source code line SLOC of the code block, a number of methods called by the code block and a number of variables declared in the code block; the syntactic characteristics include at least: whether there is a throw statement and whether there is a return value; the file characteristics at least include: density of logging statements in the file, average length of logging statements in the file, and class name of the file.
On the basis of the foregoing embodiments, the feature obtaining module 320 is specifically configured to: and inputting the code blocks and the files of the code blocks into a source code analysis tool to obtain the code block characteristics and the file characteristics.
On the basis of the above embodiments, the log level includes: a fatal level, an error level, an alarm level, an information level, a debugging level and a tracking level with sequentially decreasing importance levels; wherein the lower the importance level, the more detailed the detail of the information recorded in the log.
On the basis of the foregoing embodiments, the level determining module 330 is specifically configured to: and inputting the code block characteristics and the file characteristics into a pre-trained log level prediction model to obtain the log level.
On the basis of the above embodiments, the method further includes: the searching module is used for searching the log recording sentences in the training items and the code blocks containing the log recording sentences before the code blocks of the log recording sentences to be inserted are obtained; the characteristic extraction module is used for extracting the characteristics of the code blocks and the files to which the code blocks belong according to a preset characteristic model to obtain the characteristics of the code blocks and the characteristics of the files matched with the preset characteristic model; and the training module is used for inputting the extracted code block characteristics and the file characteristics into a preset algorithm model for training to obtain a log level prediction model.
On the basis of the above embodiments, the method further includes: the screening module is used for acquiring the log level and contributors of the log recording statement after searching the log recording statement in the training item and the code block containing the log recording statement; and screening the effectiveness of the log recording statement according to the log level of the log recording statement and the contributor.
On the basis of the above embodiments, the screening module is specifically configured to: if the contributor of the log recording statement is in a preset contributor list and the log level of the log recording statement is consistent with the log level output by the contributor list, keeping the log recording statement; if the log level of the log recording statement is consistent with the log level output by the log recording statement, the contributor of the log recording statement is not in a preset contributor list, and the number of the files to which all the log recording statements of the contributor belong is greater than or equal to a file number threshold value, the log recording statement is reserved; if the log level of the log recording statement is consistent with the log level output by the log recording statement, the contributor of the log recording statement is not in a preset contributor list, the number of the log recording statements in the file to which the log recording statement belongs is less than or equal to a statement number threshold, and the density of the log recording statements in the file to which the log recording statement belongs is less than or equal to a statement density threshold, the log recording statement is reserved.
On the basis of the above embodiments, the method further includes: the preprocessing module is used for inputting the extracted code block characteristics and the file characteristics into a preset algorithm model for training and sequentially carrying out hump conversion processing, lower case conversion processing, stop word processing, stem extraction and root processing and word frequency-inverse file frequency TF-IDF processing on the text content characteristics in the code block characteristics before obtaining a log level prediction model; and the dimension reduction processing module is used for performing dimension reduction processing on the text content features subjected to TF-IDF processing through the text mining classifier to obtain numerical text features.
On the basis of the foregoing embodiments, the dimension reduction processing module is specifically configured to: dividing the text content feature subjected to the TF-IDF processing into a first sample and a second sample by using hierarchical random sampling; respectively learning a first text mining classifier corresponding to the first sample and a second text mining classifier corresponding to the second sample according to a naive Bayes algorithm; assigning a second confidence score matrix to the second sample using the first text-mining classifier, and assigning a first confidence score matrix to the first sample using the second text-mining classifier; the first confidence score matrix and the second confidence score matrix are numerical text features.
On the basis of the above embodiments, the method further includes: and the model construction module is used for constructing the algorithm model by utilizing any one of a decision tree algorithm, a support vector machine algorithm, a logistic regression algorithm and a convolutional neural network algorithm in advance before inputting the extracted code block characteristics and file characteristics into a preset algorithm model for training to obtain the log-level prediction model.
On the basis of the foregoing embodiments, the search module is specifically configured to: acquiring all source code files in a training project; finding out log recording sentences in each source code file according to the regular matching; and traversing towards the direction of the root node by using syntax tree analysis to find the code block containing the log record statement.
The log-level prediction device can execute the log-level prediction method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the log-level prediction method.
Example four
Fig. 4 is a schematic structural diagram of a computer device in the fourth embodiment of the present invention. FIG. 4 illustrates a block diagram of an exemplary computer device 412 suitable for use in implementing embodiments of the present invention. The computer device 412 shown in FIG. 4 is only one example and should not impose any limitations on the functionality or scope of use of embodiments of the present invention.
As shown in FIG. 4, computer device 412 is in the form of a general purpose computing device. Components of computer device 412 may include, but are not limited to: one or more processors 416, a memory 428, and a bus 418 that couples the various system components (including the memory 428 and the processors 416).
Bus 418 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. By way of example, such architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, micro-channel architecture (MAC) bus, enhanced ISA bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.
Computer device 412 typically includes a variety of computer system readable media. Such media can be any available media that is accessible by computer device 412 and includes both volatile and nonvolatile media, removable and non-removable media.
The memory 428 is used to store instructions. Memory 428 can include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM) 430 and/or cache memory 432. The computer device 412 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 434 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 4, commonly referred to as a "hard drive"). Although not shown in FIG. 4, a magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In these cases, each drive may be connected to bus 418 by one or more data media interfaces. Memory 428 can include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the invention.
A program/utility 440 having a set (at least one) of program modules 442 may be stored, for instance, in memory 428, such program modules 442 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which examples or some combination thereof may comprise an implementation of a network environment. The program modules 442 generally perform the functions and/or methodologies of the described embodiments of the invention.
The computer device 412 may also communicate with one or more external devices 414 (e.g., keyboard, pointing device, display 424, etc.), with one or more devices that enable a user to interact with the computer device 412, and/or with any devices (e.g., network card, modem, etc.) that enable the computer device 412 to communicate with one or more other computing devices. Such communication may occur via input/output (I/O) interfaces 422. Also, computer device 412 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, such as the Internet) through network adapter 420. As shown, network adapter 420 communicates with the other modules of computer device 412 over bus 418. It should be appreciated that although not shown in FIG. 4, other hardware and/or software modules may be used in conjunction with the computer device 412, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
The processor 416 performs various functional applications and data processing by executing instructions stored in the memory 428, such as performing the following: acquiring a code block into which a log recording statement is to be inserted, wherein the log recording statement is used for forming a recording log after triggering execution; according to a preset feature model, feature extraction is carried out on the code blocks and files to which the code blocks belong to so as to obtain code block features and file features; according to the code block characteristics and the file characteristics, predicting a log level, wherein the log level is used for describing the detailed degree of information recorded in a log; and inserting a log recording statement into the code block according to the log level.
On the basis of the above embodiments, the code block features include: text content characteristics and syntactic characteristics; (ii) a The text content features at least include: the method comprises the following steps of structural characteristics, the name of a method called by a code block, the name of a variable declared in the code block, the type of the code block and the type of a trigger strategy; the structural features include at least: the source code line SLOC of a code block, the number of methods called by the code block and the syntactic characteristics of the number of variables declared in the code block include at least: whether there is a throw statement and whether there is a return value; the file characteristics at least include: density of logging statements in the file, average length of logging statements in the file, and class name of the file.
On the basis of the above embodiments, the processor 416 is configured to obtain the code block characteristics and the file characteristics by: and inputting the code blocks and the files of the code blocks into a source code analysis tool to obtain the code block characteristics and the file characteristics.
On the basis of the above embodiments, the log level includes: a fatal level, an error level, an alarm level, an information level, a debugging level and a tracking level with sequentially decreasing importance levels; wherein the lower the importance level, the more detailed the detail of the information recorded in the log.
On the basis of the above embodiments, the processor 416 is configured to predict the log level by: and inputting the code block characteristics and the file characteristics into a pre-trained log level prediction model to obtain the log level.
On the basis of the foregoing embodiments, before obtaining the code block into which the log record statement is to be inserted, the processor 416 is further configured to: searching a log recording statement in a training item and a code block containing the log recording statement; according to a preset feature model, feature extraction is carried out on the code blocks and files to which the code blocks belong, and code block features and file features matched with the preset feature model are obtained; and inputting the extracted code block characteristics and file characteristics into a preset algorithm model for training to obtain a log level prediction model.
On the basis of the foregoing embodiments, after searching for the logging statement in the training item and the code block containing the logging statement, the processor 416 is further configured to: acquiring the log level and contributors of log recording statements; and screening the effectiveness of the log recording statement according to the log level of the log recording statement and the contributor.
On the basis of the above embodiments, the processor 416 is configured to perform validity screening on the logging statement by: if the contributor of the log recording statement is in a preset contributor list and the log level of the log recording statement is consistent with the log level output by the contributor list, keeping the log recording statement; if the log level of the log recording statement is consistent with the log level output by the log recording statement, the contributor of the log recording statement is not in a preset contributor list, and the number of the files to which all the log recording statements of the contributor belong is greater than or equal to a file number threshold value, the log recording statement is reserved; if the log level of the log recording statement is consistent with the log level output by the log recording statement, the contributor of the log recording statement is not in a preset contributor list, the number of the log recording statements in the file to which the log recording statement belongs is less than or equal to a statement number threshold, and the density of the log recording statements in the file to which the log recording statement belongs is less than or equal to a statement density threshold, the log recording statement is reserved.
On the basis of the foregoing embodiments, before inputting the extracted code block features and file features into a preset algorithm model for training, and obtaining a log-level prediction model, the processor 416 is further configured to: carrying out hump conversion processing, lower case conversion processing, stop word processing, stem extraction and root processing and word frequency-inverse file frequency TF-IDF processing on text content characteristics in the code block characteristics in sequence; and performing dimension reduction processing on the text content features subjected to TF-IDF processing through a text mining classifier to obtain numerical text features.
On the basis of the above embodiments, the processor 416 is configured to obtain the numeric text feature by: dividing the text content feature subjected to the TF-IDF processing into a first sample and a second sample by using hierarchical random sampling; respectively learning a first text mining classifier corresponding to the first sample and a second text mining classifier corresponding to the second sample according to a naive Bayes algorithm; assigning a second confidence score matrix to the second sample using the first text-mining classifier, and assigning a first confidence score matrix to the first sample using the second text-mining classifier; the first confidence score matrix and the second confidence score matrix are numerical text features.
On the basis of the foregoing embodiments, before inputting the extracted code block features and file features into a preset algorithm model for training, and obtaining a log-level prediction model, the processor 416 is further configured to: the algorithm model is constructed in advance using any one of a decision tree algorithm, a support vector machine algorithm, a logistic regression algorithm, and a convolutional neural network algorithm.
On the basis of the above embodiments, the processor 416 is configured to find the logging statement in the training item and the code block containing the logging statement by: acquiring all source code files in a training project; finding out log recording sentences in each source code file according to the regular matching; and traversing towards the direction of the root node by using syntax tree analysis to find the code block containing the log record statement.
EXAMPLE five
Fifth embodiment of the present invention provides a computer-readable storage medium, where the computer-readable storage medium is used to store computer instructions, and the computer instructions are used to execute the prediction method of log level provided in any embodiment of the present invention.
Any combination of one or more computer-readable media may be employed. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.