CN103984756B - Semi-supervised probabilistic latent semantic analysis based software change log classification method - Google Patents

Semi-supervised probabilistic latent semantic analysis based software change log classification method Download PDF

Info

Publication number
CN103984756B
CN103984756B CN201410234156.6A CN201410234156A CN103984756B CN 103984756 B CN103984756 B CN 103984756B CN 201410234156 A CN201410234156 A CN 201410234156A CN 103984756 B CN103984756 B CN 103984756B
Authority
CN
China
Prior art keywords
change log
software
category
probability
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201410234156.6A
Other languages
Chinese (zh)
Other versions
CN103984756A (en
Inventor
张小洪
鄢萌
傅颖
徐玲
杨梦宁
洪明坚
葛永新
杨丹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University
Original Assignee
Chongqing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University filed Critical Chongqing University
Priority to CN201410234156.6A priority Critical patent/CN103984756B/en
Publication of CN103984756A publication Critical patent/CN103984756A/en
Application granted granted Critical
Publication of CN103984756B publication Critical patent/CN103984756B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a semi-supervised probabilistic latent semantic analysis based software change log classification method. A word dictionary determined through prior knowledge is combined, classification is performed on software change logs objectively according to probabilistic dependencies between words, probabilistic dependencies between the words and change log categories and probabilistic dependencies between the software change logs and the change log categories, and accordingly the classification on the software change logs according to weight values of the word frequency characteristics is avoided, the accuracy of the classification can be improved, and the problems that errors are produced and the accuracy is low in the process of the classification on the software change logs due to the fact that the weight values are set artificially in the prior art are effectively solved.

Description

Software change log classification method based on semi-supervised probability latent semantic analysis
Technical Field
The invention belongs to the technical field of computer information technology and software engineering, and particularly relates to a software change log classification method based on semi-supervised probability latent semantic analysis.
Background
Currently, in the field of computers, it is common to record an operation that has been processed, generate a processing log for knowing the operation condition that has been performed subsequently from the recorded processing log, and determine a corresponding subsequent operation policy according to the recorded processing log.
In the process of running, managing and maintaining computer software, software needs to be repaired due to BUGs, errors or defects, or software functions or software features are added to the software to adapt to new environments or new requirements, or the software needs to be re-edited or re-constructed (also called as software reconfiguration) to improve the readability, reusability, maintainability and the like of the software. These operations will change the software code program, and correspondingly, will also generate the software change log, so that in the later management and maintenance process of the computer software, the change history of the software can be known according to the software change log, thereby being capable of performing statistics and positioning processing on the problems occurring in the software, and further analyzing the quality index, life cycle, operation risk, etc. of the software product. In the log database of software, there may be a lot of software change logs, and to perform software-related analysis according to the software change logs, the software change logs must be classified to know the change operation types recorded in the software change logs.
In the prior art, a software change log is classified by a computer through a software change log classification method by a computer information processing technology, so that the problems of large workload, long time consumption and low efficiency of manual classification are solved. Currently, some related researches are also carried out on software change log classification technology in the field. The common classification method for the software change log is as follows:
the method comprises the following steps: software change logs are extracted from a log database of software.
Step two: and performing stem extraction processing on the software change log by using the conventional stem extraction algorithm to obtain each word contained in the stem of the software change log. The stem extraction processing is to obtain a characteristic word capable of representing the main content described by the software change log, and generally, words without actual content representation meanings, such as "the", "on", "a", "which", and the like, need to be removed in the stem extraction process.
Step three: based on the word frequency characteristics, according to the frequency of the words appearing in a certain category of software change logs, giving the weight value of the category to the words; the higher the frequency of occurrence of a word, the correspondingly greater the weight value it is assigned to in that category. Then, comparing each word in the stem of the extracted software change log, and if the same word also appears in the stem, judging the category of the software change log according to the category weight value of the corresponding word.
However, in such a software change log classification method, the class weight value of a word is often set manually and empirically, and for some synonyms and polysemons, if the set class weight value is not appropriate, problems such as erroneous classification are likely to occur. For example, because two words that are synonyms of each other occur in two different categories with a higher frequency, and the synonym occurs in the software change log originally belonging to category B, the weight value of the synonym in category a is greater, which results in the software change log being misclassified as category a; for another example, a word is an ambiguous word and has a very high probability of appearing in the category a, and has a large weight value, but the ambiguous word appearing in a software change log originally belonging to the category B does not indicate its normal meaning, but the ambiguous word is assigned a too large weight value in the category a, so that the software change log is misclassified as the category a. These misclassifications can all easily lead log managers to obtain erroneous software change log analysis results. Therefore, how to overcome the error caused by artificially setting the weight value and further improve the accuracy of software change log classification becomes the primary problem to be solved in the software change log classification technology.
Disclosure of Invention
Aiming at the problems in the prior art, the invention provides a software change log classification method based on semi-supervised probability latent semantic analysis, which combines a word dictionary determined by prior knowledge, objectively classifies software change logs according to the probability correlation between words, the probability correlation between words and change log categories and the probability correlation between the software change logs and the change log categories, avoids classifying the software change logs according to the weighted value of the word frequency characteristic, overcomes the error caused by artificially setting the weighted value, improves the accuracy of software change log classification, and solves the problems of low accuracy and error in software change log classification caused by artificially setting the weighted value in the prior art.
In order to achieve the purpose, the invention adopts the following technical means:
the software change log classification method based on the semi-supervised probability latent semantic analysis comprises the following steps:
A) according to the priori knowledge, the change log categories are divided, the key words corresponding to each change log category are determined, and the set of all the key words corresponding to each change log category is used as a word dictionary; a key word corresponding to each change log category in the word dictionary is one word in a word stem obtained by performing word stem extraction on the software change logs belonging to the corresponding change log categories according to prior knowledge; the change log categories are specifically divided into three categories, namely:
1 st Change Log Category z1: repairing software change logs generated by software corruption, errors, or defects;
change Log class 2 z2: adding a software change log generated by a software function or a software characteristic;
change log category 3 z3: re-editing the software or reconstructing the generated software change log;
B) acquiring a plurality of software change logs which belong to the three change log categories and have known change log categories as training samples, and taking a set of all the training samples as a training database; respectively counting the class z of the k-th change log in the training databasekTraining ofNumber of samples nkK ∈ {1, 2., K }, where K is the number of change log categories, that is, K is 3, and stem extraction processing is performed on each training sample in the training database, so as to obtain each word contained in the stem of each training sample;
C) establishing a probabilistic latent semantic analysis model among key words, software change logs and change log categories in a word dictionary:
wherein, P (w)j|zk) Representing the jth key word w in the word dictionaryjAnd the kth change log category zkK ∈ {1,2,3 }; P (z)k|di) Indicates the kth Change Log class zkAnd ith software change log diThe probability relationship of (a); p (d)i) Indicating the ith software Change Log diProbability of number of words with respect to training database, i.e.niIndicating the ith software Change Log diNumber of words contained in the stem of words, NbaceRepresenting the sum of the number of words contained in the word stems of all training samples in the training database;
D) constructing a likelihood function L of the probability latent semantic analysis model:
where i ∈ {1, 2.. multidata, M }, where M represents the total number of software change logs, j ∈ {1, 2.. multidata, N }, where N represents the total number of key words in the word dictionary, and N (w) (i.e., N represents the total number of key words in the word dictionary)j,di) Representing the jth key word w in the word dictionaryjIn the software change log diThe number of occurrences in (a);
E) respectively using each training sample in the training database as a software change log diSubstituting the likelihood function L constructed in the step D into each key word w in the word dictionary by adopting an expectation maximization algorithmjWith each change log category zkAnd each change log category zkAnd as a software change log diSolving the probability relation of each training sample; converging and solving the expectation maximization algorithm to obtain each key word w in the word dictionaryjWith each change log category zkIs marked as Pc(wj|zk) Each change log category z obtained by convergence solving the expectation-maximization algorithmkAnd as a software change log diIs marked as Pc(zk|di) J ∈ {1, 2.. multidata, N }, i ∈ {1, 2.. multidata, M }, K ∈ {1, 2.. multidata, K }, and calculating each change log category z separatelykWherein the kth change log category zkSample center probability relationship ofComprises the following steps:
at the moment, the total number M of the software change logs is taken as the total number of training samples in the training database;
F) acquiring a software change log of a change log category to be determined as a sample to be tested, and taking a set of all samples to be tested as a test database; respectively carrying out stem extraction processing on each sample to be tested in the test database to obtain each word contained in the stem of each sample to be tested;
G) respectively taking each sample to be tested in the test database as a software change log diSubstitution into the likelihood function L constructed in step DUsing expectation maximization algorithm, for each change log category zkAnd as a software change log diSolving the probability relation of each sample to be tested;
H) according to each change log type z obtained in step GkProbability relation with each sample to be tested, and each change log category zkSample center probability relationship ofRespectively calculating each sample to be tested and each change log category zkSimilarity of sample center probability Sim (d)x,m,zk) Thus, determining the change log category to which each sample to be tested belongs:
wherein, XmRepresents an arbitrary m-th sample d to be measuredx,mThe change log category to which it belongs; similarity Sim (d)x,m,zk) Comprises the following steps:
wherein, Px(zk|dx,m) Indicates the k-th change log type z obtained in step GkAnd the m-th sample d to be measuredx,mThe probability relationship of (a);
thus, a category label is added to each software change log as a sample to be tested according to the determined change log category to which each sample to be tested belongs.
In the software change log classification method based on semi-supervised probability latent semantic analysis, specifically, the step E specifically includes:
e1) respectively using each training sample in the training database as softwareChange Log diSubstituting the likelihood function L constructed in the step D with i ∈ {1, 2.., M }, wherein the total number M of the software change logs is the total number of training samples in the training database, and classifying the k change log into a class zkAnd as a software change log diOf the training samples P (z)k|di) Randomly taking the initial value of the word dictionary, and taking the jth key word w in the word dictionaryjAnd the kth change log category zkProbability relation P (w)j|zk) The initial values of (a) are:
wherein n iskIndicating the class z of the k-th change log in the training databasekNumber of training samples of (c), k ∈ {1,2,3 }; nj,kRepresenting the jth key word w in the word dictionaryjBelongs to the kth Change Log class z in the training databasekThe number of occurrences in the training sample;
e2) in E-step of the expectation-maximization algorithm, P (w) is determined according to the current probability relationshipj|zk) And a probability relation P (z)k|di) Respectively calculating each change log category zkConditional distribution probability P (z)k|di,wj),k∈{1,2,...,K}:
e3) In M-step of the expectation-maximization algorithm, the conditional distribution probability P (z) obtained in step e2 is usedk|di,wj) Respectively for each key word w in the word dictionaryjJ ∈ {1, 2., N }, as a software change log d in the training databaseiI ∈ {1, 2.., M }, and a respective change log class zkK ∈ {1, 2.. K }, is related to probabilityIs P (w)j|zk) And a probability relation P (z)k|di) Updating the value of (a):
wherein n (w)j,di) Representing the jth key word w in the word dictionaryjIn the software change log diThe number of occurrences in (a);representing key words in a word dictionary in a software change log diThe total number of occurrences in (a);
e4) repeating steps e 2-e 3 until the expectation maximization algorithm converges; converging and solving the expectation maximization algorithm to obtain each key word w in the word dictionaryjWith each change log category zkIs marked as Pc(wj|zk) Each change log category z obtained by convergence solving the expectation-maximization algorithmkAnd as a software change log diIs marked as Pc(zk|di) J ∈ {1, 2.. multidata, N }, i ∈ {1, 2.. multidata, M }, K ∈ {1, 2.. multidata, K }, and calculating each change log category z separatelykWherein the kth change log category zkSample center probability relationship ofComprises the following steps:
in the software change log classification method based on semi-supervised probability latent semantic analysis, specifically, the step G specifically includes:
g1) will measureEach sample to be tested in the test database is respectively used as a software change log diSubstituting the obtained data into a likelihood function L constructed in the step D, i ∈ {1, 2.., M }, taking the total number M of the software change logs as the total number of samples to be tested in the test database, and classifying the kth change log into a class zkAnd as a software change log diIs measured on the probability relation P (z) of the sample to be measuredk|di) Randomly taking the initial value of the word dictionary, and taking the jth key word w in the word dictionaryjAnd the kth change log category zkProbability relation P (w)j|zk) The initial values of (a) are:
P(wj|zk)=Pc(wj|zk);
g2) in E-step of the expectation-maximization algorithm, P (w) is determined according to the current probability relationshipj|zk) And a probability relation P (z)k|di) Respectively calculating each change log category zkConditional distribution probability P (z)k|di,wj),k∈{1,2,...,K}:
g3) In M-step of the expectation maximization algorithm, the conditional distribution probability P (z) obtained in step g2 is utilizedk|di,wj) Respectively for each key word w in the word dictionaryjJ ∈ {1, 2., N }, in the test database as a software change log diI ∈ {1, 2.., M }, and each change log category zkK ∈ {1, 2.., K }, for the probability relationship P (w)j|zk) And a probability relation P (z)k|di) Updating the value of (a):
wherein,n(wj,di) Representing the jth key word w in the word dictionaryjIn the software change log diThe number of occurrences in (a);representing key words in a word dictionary in a software change log diThe total number of occurrences in (a);
g4) repeating the steps g 2-g 3 until the expectation maximization algorithm is converged, thereby obtaining each change log category z obtained by convergence solution of the expectation maximization algorithmkAnd as a software change log diThe probability relationship of each sample to be measured.
Compared with the prior art, the invention has the following beneficial effects:
1. the software change log classification method based on semi-supervised probability latent semantic analysis combines the word dictionary determined by the prior knowledge, objectively classifies the software change logs according to the probability correlation between words, the probability correlation between words and change log categories and the probability correlation between the software change logs and the change log categories, avoids classifying the software change logs according to the weight value of the word frequency characteristic, improves the classification accuracy, and effectively solves the problems of error and lower accuracy of software change log classification caused by artificially setting the weight value in the prior art.
2. In the software change log classification method based on semi-supervised probability latent semantic analysis, in the process of determining probability correlation characteristics of different key words based on a training database obtained by prior knowledge, an expectation maximization algorithm is utilized for solving, and a probability relation P (w) is setj|zk) Is initially taken asCompared with randomly setting probability relation P (w)j|zk) The initial value of (2) can reflect the key sheetWord wjBelongs to the kth Change Log class z in the training databasekThe objective distribution condition in the training samples is beneficial to improving the convergence speed of the expectation-maximization algorithm, and objectively embodies the probability correlation characteristics of different key words in the word dictionary based on the training database and the probability correlation characteristics of the training samples in the training database and the change log categories.
3. In the software change log classification method based on semi-supervised probability latent semantic analysis, in the process of determining probability correlation characteristics of each sample to be tested and each change log category in a test database, an expectation maximization algorithm is utilized to solve, and the jth key word w in a word dictionary isjAnd the kth change log category zkProbability relation P (w)j|zk) Is set as P (w)j|zk)=Pc(wj|zk) The probability correlation characteristics of different key words in the word dictionary based on the training database are utilized, the convergence speed of the expectation maximization algorithm is improved, and the determination of the direct probability correlation characteristics of each sample to be tested and the change log category can be based on the actual situation of the training database.
4. In the software change log classification method based on semi-supervised probability latent semantic analysis, the similarity of the sample to be detected on each change log category is comprehensively considered in the process of confirming the change log category to which the sample to be detected belongs, and the change log category to which the sample to be detected belongs is determined according to the maximum similarity, so that the software change logs of the category to be determined are objectively classified, the classification of the software change logs is avoided only according to the weight value of a word given to a certain category, and the classification judgment of the software change logs is more comprehensive and accurate.
5. The software change log classification method based on semi-supervised probability latent semantic analysis has good feasibility and effectiveness in practical application.
Drawings
FIG. 1 is a flow chart of a software change log classification method based on semi-supervised probability latent semantic analysis according to the present invention.
FIG. 2 is a statistical result chart of classification accuracy in a validation experiment according to the present invention.
Detailed Description
In the existing software change log classification method, a weight value given by a word in a certain change log category is set artificially according to word frequency characteristics, and software change logs are classified according to the weight value, so that under the condition that synonyms and polysemons occur, the phenomenon of misclassification is easy to occur, the classification accuracy of the software change logs is reduced, and the analysis of the software change logs by log management personnel is influenced. Aiming at the problem, the invention provides a software change log classification method based on semi-supervised probability latent semantic analysis, which combines a word dictionary determined by prior knowledge, objectively classifies software change logs according to the probability correlation between words, the probability correlation between words and change log categories and the probability correlation between the software change logs and the change log categories, avoids classifying the software change logs according to a weighted value of word frequency characteristics, overcomes errors caused by artificially setting the weighted value and achieves the aim of improving the classification accuracy of the software change logs.
The invention relates to a software change log classification method based on semi-supervised probability latent semantic analysis, which has a specific flow as shown in figure 1 and comprises the following steps:
A) according to the priori knowledge, the change log categories are divided, the key words corresponding to each change log category are determined, and the set of all the key words corresponding to each change log category is used as a word dictionary; a key word corresponding to each change log category in the word dictionary is one word in a word stem obtained by performing word stem extraction on the software change logs belonging to the corresponding change log categories according to prior knowledge; the change log categories are specifically divided into three categories, namely:
1 st Change Log Category z1: repairing software change logs generated by software corruption, errors, or defects;
change Log class 2 z2: adding a software change log generated by a software function or a software characteristic;
change log category 3 z3: re-editing the software or reconstructing the resulting software change log.
In this step, the confirmation of the keyword corresponding to each change log category needs to be obtained through prior knowledge. Such as by human recognition or some key word that has been known to be confirmed in the three change log categories described above. For example, based on the Swanson's modified classification system, a word dictionary as shown in Table 1 can be constructed based on the work of Mauczka et al.
TABLE 1
Of course, in different software change log description environments, or different software types for specific applications, the specific resulting word dictionary may also be different.
B) Acquiring a plurality of software change logs which belong to the three change log categories and have known change log categories as training samples, and taking a set of all the training samples as a training database; respectively counting the class z of the k-th change log in the training databasekNumber n of training sampleskK ∈ {1, 2., K }, where K is the number of change log categories, i.e., K3, and for the training dataAnd (4) respectively carrying out stem extraction processing on each training sample in the library to obtain each word contained in the stem of each training sample.
In this step, the software change log as the training sample needs to belong to one of the three change log categories, and the category of the change log to which the software change log belongs is known. These software change logs as training samples also need to be acquired and judged by prior knowledge, for example, by manual identification and judgment, or by prior identification means, so as to obtain the change log category to which the software change logs belong. Meanwhile, in the three change log categories, a plurality of training samples belonging to each change log category should be provided; of course, the more training samples are obtained for each change log category, the more accurate the classification effect of the method of the present invention is. The stem extraction processing for each training sample can be realized by using a stem extraction algorithm in the prior art, and words without actual content representation meanings, such as "the", "on", "a", "which", and the like, need to be removed. The stem extraction algorithm itself is not a technical contribution of the present invention, and therefore, redundant description is not provided.
C) Establishing a probabilistic latent semantic analysis model among key words, software change logs and change log categories in a word dictionary:
wherein, P (w)j|zk) Representing the jth key word w in the word dictionaryjAnd the kth change log category zkK ∈ {1,2,3 }; P (z)k|di) Indicates the kth Change Log class zkAnd ith software change log diThe probability relationship of (a); p (d)i) Indicating the ith software Change Log diProbability of number of words with respect to training database, i.e.niIndicating the ith software Change Log diNumber of words contained in the stem of words, NbaceRepresenting the sum of the number of words contained in the stems of all training samples in the training database.
The method of the invention utilizes a probabilistic latent semantic analysis model to establish the probability correlation between words, between words and change log categories and between software change logs and change log categories.
D) Constructing a likelihood function L of the probability latent semantic analysis model:
where i ∈ {1, 2.. multidata, M }, where M represents the total number of software change logs, j ∈ {1, 2.. multidata, N }, where N represents the total number of key words in the word dictionary, and N (w) (i.e., N represents the total number of key words in the word dictionary)j,di) Representing the jth key word w in the word dictionaryjIn the software change log diThe number of occurrences in (c).
The method solves the probability latent semantic analysis model by using the likelihood function L.
E) Respectively using each training sample in the training database as a software change log diSubstituting the likelihood function L constructed in the step D into each key word w in the word dictionary by adopting an expectation maximization algorithmjWith each change log category zkAnd each change log category zkAnd as a software change log diThe probability relationship of each training sample is solved. The method comprises the steps of solving probability correlation relations between key words and change log categories and between training samples and change log categories aiming at a training database, and accordingly determining probability correlation characteristics of different key words based on the training database obtained by priori knowledge. The specific process of the step is as follows:
e1) respectively using each training sample in the training database as a software change log diSubstituting the likelihood function L constructed in the step D with i ∈ {1, 2.., M }, wherein the total number M of the software change logs is the total number of training samples in the training database, and classifying the k change log into a class zkAnd as a software change log diOf the training samples P (z)k|di) Randomly taking the initial value of the word dictionary, and taking the jth key word w in the word dictionaryjAnd the kth change log category zkProbability relation P (w)j|zk) The initial values of (a) are:
wherein n iskIndicating the class z of the k-th change log in the training databasekNumber of training samples of (c), k ∈ {1,2,3 }; nj,kRepresenting the jth key word w in the word dictionaryjBelongs to the kth Change Log class z in the training databasekThe number of occurrences in the training sample;
e2) in E-step of the expectation-maximization algorithm, P (w) is determined according to the current probability relationshipj|zk) And a probability relation P (z)k|di) Respectively calculating each change log category zkConditional distribution probability P (z)k|di,wj),k∈{1,2,...,K}:
e3) In M-step of the expectation-maximization algorithm, the conditional distribution probability P (z) obtained in step e2 is usedk|di,wj) Respectively for each key word w in the word dictionaryjJ ∈ {1, 2., N }, as a software change log d in the training databaseiEach training sample of i ∈ {1,2Change Log Category zkK ∈ {1, 2.., K }, for the probability relationship P (w)j|zk) And a probability relation P (z)k|di) Updating the value of (a):
wherein n (w)j,di) Representing the jth key word w in the word dictionaryjIn the software change log diThe number of occurrences in (a);representing key words in a word dictionary in a software change log diThe total number of occurrences in (a);
e4) repeating steps e 2-e 3 until the expectation maximization algorithm converges; converging and solving the expectation maximization algorithm to obtain each key word w in the word dictionaryjWith each change log category zkIs marked as Pc(wj|zk) Each change log category z obtained by convergence solving the expectation-maximization algorithmkAnd as a software change log diIs marked as Pc(zk|di) J ∈ {1, 2.. multidata, N }, i ∈ {1, 2.. multidata, M }, K ∈ {1, 2.. multidata, K }, and calculating each change log category z separatelykWherein the kth change log category zkSample center probability relationship ofComprises the following steps:
in the step, a mature expectation maximization algorithm is utilized to solve, and a probability relation P (w) is setj|zk) Is initially taken asCompared with randomly setting probability relation P (w)j|zk) The initial value of (2) can reflect the key word wjBelongs to the kth Change Log class z in the training databasekThe objective distribution condition in the training sample is beneficial to improving the convergence speed of the expectation maximization algorithm. And the expectation maximization algorithm converges and solves each key word w in the obtained word dictionaryjWith each change log category zkAnd each change log category z obtained by convergence solution of expectation-maximization algorithmkAnd as a software change log diThe probability relation of each training sample objectively embodies the probability correlation characteristics of each different key word in the word dictionary based on the training database and the probability correlation characteristics of each training sample in the training database and each change log category, and the characteristics can be used as the basis for classifying the software change log to be tested subsequently.
F) Acquiring a software change log of a change log category to be determined as a sample to be tested, and taking a set of all samples to be tested as a test database; and respectively carrying out stem extraction processing on each sample to be tested in the test database to obtain each word contained in the stem of each sample to be tested.
And in the test database, the number of the software change logs serving as samples to be tested is determined according to the actual need of the software change log condition of the change log category to be determined. The software change log classification and identification method can be applied to the application condition that the test database contains any plurality of software change logs of which the change log types are to be determined. The stem extraction processing for each sample to be detected can be realized by using a stem extraction algorithm in the prior art, and words without actual content representation meanings such as the, on, a, while which and the like also need to be removed.
G) Will testEach sample to be tested in the database is respectively used as a software change log diSubstituting into the likelihood function L constructed in step D, adopting expectation maximization algorithm, and classifying each change log zkAnd as a software change log diSolving the probability relation of each sample to be tested. In the step, the probability correlation characteristics of each sample to be tested and each change log category in the test database are determined according to the probability correlation characteristics of different key words in the word dictionary based on the training database. The specific process of the step is as follows:
g1) respectively taking each sample to be tested in the test database as a software change log diSubstituting the obtained data into a likelihood function L constructed in the step D, i ∈ {1, 2.., M }, taking the total number M of the software change logs as the total number of samples to be tested in the test database, and classifying the kth change log into a class zkAnd as a software change log diIs measured on the probability relation P (z) of the sample to be measuredk|di) Randomly taking the initial value of the word dictionary, and taking the jth key word w in the word dictionaryjAnd the kth change log category zkProbability relation P (w)j|zk) The initial values of (a) are:
P(wj|zk)=Pc(wj|zk);
g2) in E-step of the expectation-maximization algorithm, P (w) is determined according to the current probability relationshipj|zk) And a probability relation P (z)k|di) Respectively calculating each change log category zkConditional distribution probability P (z)k|di,wj),k∈{1,2,...,K}:
g3) In M-step of the expectation maximization algorithm, the conditional distribution probability P (z) obtained in step g2 is utilizedk|di,wj) For each of the word dictionariesKey word wjJ ∈ {1, 2., N }, in the test database as a software change log diI ∈ {1, 2.., M }, and each change log category zkK ∈ {1, 2.., K }, for the probability relationship P (w)j|zk) And a probability relation P (z)k|di) Updating the value of (a):
wherein n (w)j,di) Representing the jth key word w in the word dictionaryjIn the software change log diThe number of occurrences in (a);representing key words in a word dictionary in a software change log diThe total number of occurrences in (a);
g4) repeating the steps g 2-g 3 until the expectation maximization algorithm is converged, thereby obtaining each change log category z obtained by convergence solution of the expectation maximization algorithmkAnd as a software change log diThe probability relationship of each sample to be measured.
In the step, the expectation maximization algorithm is also utilized to solve, and in the solving process, the jth key word w in the word dictionary is solvedjAnd the kth change log category zkProbability relation P (w)j|zk) Is set as P (w)j|zk)=Pc(wj|zk) That is, the probability correlation characteristic of different key words in the word dictionary based on the training database is utilized to solve and obtain various change log categories zkThe probability relation with each sample to be tested is beneficial to improving the convergence rate of the expectation maximization algorithm on one hand, and on the other hand, the determination of the probability correlation characteristics of each sample to be tested and the change log category can be realized, and the actual condition of the training database can be used as an objective basis.
H) According to each change log type z obtained in step GkProbability relation with each sample to be tested, and each change log category zkSample center probability relationship ofRespectively calculating each sample to be tested and each change log category zkSimilarity of sample center probability Sim (d)x,m,zk) Thus, determining the change log category to which each sample to be tested belongs:
wherein, XmRepresents an arbitrary m-th sample d to be measuredx,mThe change log category to which it belongs; similarity Sim (d)x,m,zk) Comprises the following steps:
wherein, Px(zk|dx,m) Indicates the k-th change log type z obtained in step GkAnd the m-th sample d to be measuredx,mThe probability relationship of (a);
thus, a category label is added to each software change log as a sample to be tested according to the determined change log category to which each sample to be tested belongs.
It can be seen that, in the process of confirming the change log category to which the sample to be tested belongs, the method of the present invention gives each change log category z obtained in step GkProbability relation with each sample to be tested, and each change log category zkSample center probability relationship ofComprehensively considerThe similarity of the sample on each change log category is determined, and the change log category to which the sample to be detected belongs is determined according to the maximum similarity, so that the software change logs of the category to be determined are objectively classified, the classification of the software change logs is avoided only according to the weight value given to the word in a certain category, and the classification judgment of the software change logs is more comprehensive and accurate.
Compared with the existing software change log classification method, the software change log classification method based on semi-supervised probability latent semantic analysis combines the word dictionary determined by the priori knowledge, objectively classifies the software change logs according to the probability correlation between words, the probability correlation between words and change log categories and the probability correlation between the software change logs and the change log categories, avoids classifying the software change logs according to the weighted value of the word frequency characteristic, improves the classification accuracy, and effectively solves the problems of error and lower accuracy of software change log classification caused by artificially setting the weighted value in the prior art.
And (3) verification experiment:
the invention also verifies the effectiveness and the accuracy of the classification method by adopting Cohen's Kappa value as the verification standard on the verification of the classification accuracy through experiments and comparing the software change log classification method based on the semi-supervised probability latent semantic analysis with the ' firstkey ' classification method proposed by Hattori et al. The "firstkey" classification method proposed by Hattori et al is a method for classifying log data based on weight assignment of word frequency characteristics. In this experiment, the two classification methods are applied to five existing large open source projects, namely, Bugzilla, wirehardk, Boost, Firebird and Python, and the software change log data sets of the five open source projects are shown in table 2.
TABLE 2
The software change log data of the five open source projects are classified into the following three categories by respectively adopting a 'firstkey' classification method and the classification method of the invention:
1 st Change Log Category z1: repairing software change logs generated by software corruption, errors, or defects;
change Log class 2 z2: adding a software change log generated by a software function or a software characteristic;
change log category 3 z3: re-editing the software or reconstructing the resulting software change log.
The Cohen's Kappa value is used as the verification standard of the classification accuracy, and the obtained classification accuracy statistical result is shown in FIG. 2. As can be seen from FIG. 2, the classification method of the invention has examined the "firstkey" classification method proposed by Hattori et al on the classification accuracy of the software change log of each open source item, which shows that the classification accuracy of the software change log classification method based on semi-supervised probability latent semantic analysis of the invention is obviously superior to that of the existing software change log classification method; meanwhile, the classification result of the software change log of the five open source items reaches the average Cohen's Kappa value of 0.53, and according to the measurement standard provided by El Emam, the average Cohen's Kappa value higher than 0.50 represents that the classification result has very high coincidence degree with the real result, so that the software change log classification method based on the semi-supervised probability latent semantic analysis has very good feasibility and effectiveness in practical application.
Finally, the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made to the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention, and all of them should be covered in the claims of the present invention.

Claims (2)

1. The software change log classification method based on semi-supervised probability latent semantic analysis is characterized by comprising the following steps of:
A) according to the priori knowledge, the change log categories are divided, the key words corresponding to each change log category are determined, and the set of all the key words corresponding to each change log category is used as a word dictionary; a key word corresponding to each change log category in the word dictionary is one word in a word stem obtained by performing word stem extraction on the software change logs belonging to the corresponding change log categories according to prior knowledge; the change log categories are specifically divided into three categories, namely:
1 st Change Log Category z1: repairing software change logs generated by software corruption, errors, or defects;
change Log class 2 z2: adding a software change log generated by a software function or a software characteristic;
change log category 3 z3: re-editing the software or reconstructing the generated software change log;
B) acquiring a plurality of software change logs which belong to the three change log categories and have known change log categories as training samples, and taking a set of all the training samples as a training database; respectively counting the class z of the k-th change log in the training databasekNumber n of training sampleskK ∈ {1,2, …, K }, where K is the number of change log categories, that is, K is 3, and stem extraction processing is performed on each training sample in the training database to obtain each word contained in the stem of each training sample;
C) establishing a probabilistic latent semantic analysis model among key words, software change logs and change log categories in a word dictionary:
P ( w j , d i ) = P ( d i ) Σ k = 1 K [ P ( z k | d i ) P ( w j | z k ) ] ;
wherein, P (w)j|zk) Representing the jth key word w in the word dictionaryjAnd the kth change log category zkK ∈ {1,2,3 }; P (z)k|di) Indicates the kth Change Log class zkAnd ith software change log diThe probability relationship of (a); p (d)i) Indicating the ith software Change Log diProbability of number of words with respect to training database, i.e.niIndicating the ith software Change Log diNumber of words contained in the stem of words, NbaceRepresenting the sum of the number of words contained in the word stems of all training samples in the training database;
D) constructing a likelihood function L of the probability latent semantic analysis model:
L = Π i = 1 M Π j = 1 N P ( d i ) Σ k = 1 K [ P ( z k | d i ) P ( w j | z k ) n ( w j , d i ) ] ;
where i ∈ {1,2, …, M }, where M represents the total number of software change logs, j ∈ {1,2, …, N }, where N represents the total number of key words in the word dictionary, and N (w)j,di) Representing the jth key word w in the word dictionaryjIn the software change log diThe number of occurrences in (a);
E) respectively using each training sample in the training database as a software change log diSubstituting the likelihood function L constructed in the step D into each key word w in the word dictionary by adopting an expectation maximization algorithmjWith each change log category zkAnd each change log category zkAnd as a software change log diSolving the probability relation of each training sample; converging and solving the expectation maximization algorithm to obtain each key word w in the word dictionaryjWith each change log category zkIs marked as Pc(wj|zk) Each change log category z obtained by convergence solving the expectation-maximization algorithmkAnd as a software change log diIs marked as Pc(zk|di) J ∈ {1,2, …, N }, i ∈ {1,2, …, M }, K ∈ {1,2, …, K }, and calculating each change log category z, respectivelykWherein the kth change log category zkSample center probability ofIs a systemComprises the following steps:
P ‾ c ( z k ) = Σ i = 1 M P c ( z k | d i ) M ;
at the moment, the total number M of the software change logs is taken as the total number of training samples in the training database;
the method comprises the following steps:
e1) respectively using each training sample in the training database as a software change log diSubstituting the likelihood function L constructed in the step D with i ∈ {1,2, …, M }, taking the total number M of the software change logs as the total number of training samples in the training database, and classifying the k-th change log into a class zkAnd as a software change log diOf the training samples P (z)k|di) Randomly taking the initial value of the word dictionary, and taking the jth key word w in the word dictionaryjAnd the kth change log category zkProbability relation P (w)j|zk) The initial values of (a) are:
P ( w j | z k ) = n j , k n k ;
wherein n iskIndicating the class z of the k-th change log in the training databasekNumber of training samples of (c), k ∈ {1,2,3 }; nj,kRepresenting the jth key word w in the word dictionaryjBelongs to the kth Change Log class z in the training databasekThe number of occurrences in the training sample;
e2) in E-step of the expectation-maximization algorithm, P (w) is determined according to the current probability relationshipj|zk) And a probability relation P (z)k|di) Respectively calculating each change log category zkConditional distribution probability P (z)k|di,wj),k∈{1,2,…,K}:
P ( z k | d i , w j ) = P ( w j | z k ) P ( z k | d i ) Σ k = 1 K [ P ( w j | z k ) P ( z k | d i ) ] ;
e3) In M-step of the expectation-maximization algorithm, the conditional distribution probability P (z) obtained in step e2 is usedk|di,wj) Respectively for each key word w in the word dictionaryjJ ∈ {1,2, …, N }, as a software change log d in the training databaseiI ∈ {1,2, …, M }, and a respective change log category zkK ∈ {1,2, …, K }, versus probability relationship P (w)j|zk) And a probability relation P (z)k|di) Updating the value of (a):
P ( w j | z k ) = Σ i = 1 M [ n ( w j , d i ) P ( z k | d i , w j ) ] Σ j = 1 N Σ i = 1 M [ n ( w j , d i ) P ( z k | d i , w j ) ] ; P ( z k | d i ) = Σ j = 1 N [ n ( w j , d i ) P ( z k | d i , w j ) ] n ( d i ) ;
wherein n (w)j,di) Representing the jth key word w in the word dictionaryjIn the software change log diThe number of occurrences in (a);representing key words in a word dictionary in a software change log diThe total number of occurrences in (a);
e4) repeating steps e 2-e 3 until the expectation maximization algorithm converges; converging and solving the expectation maximization algorithm to obtain each key word w in the word dictionaryjWith each change log category zkIs marked as Pc(wj|zk) Each change log category z obtained by convergence solving the expectation-maximization algorithmkAnd as softwareChange Log diIs marked as Pc(zk|di) J ∈ {1,2, …, N }, i ∈ {1,2, …, M }, K ∈ {1,2, …, K }, and calculating each change log category z, respectivelykWherein the kth change log category zkSample center probability relationship ofComprises the following steps:
P ‾ c ( z k ) = Σ i = 1 M P c ( z k | d i ) M ;
F) acquiring a software change log of a change log category to be determined as a sample to be tested, and taking a set of all samples to be tested as a test database; respectively carrying out stem extraction processing on each sample to be tested in the test database to obtain each word contained in the stem of each sample to be tested;
G) respectively taking each sample to be tested in the test database as a software change log diSubstituting into the likelihood function L constructed in step D, adopting expectation maximization algorithm, and classifying each change log zkAnd as a software change log diEach sample to be testedSolving the probability relation;
H) according to each change log type z obtained in step GkProbability relation with each sample to be tested, and each change log category zkSample center probability relationship ofRespectively calculating each sample to be tested and each change log category zkSimilarity of sample center probability Sim (d)x,m,zk) Thus, determining the change log category to which each sample to be tested belongs:
X m = arg m a x k [ S i m ( d x , m , z k ) ] , k ∈ { 1 , 2 , 3 } ;
wherein, XmRepresents an arbitrary m-th sample d to be measuredx,mThe change log category to which it belongs; similarity Sim (d)x,m,zk) Comprises the following steps:
S i m ( d x , m , z k ) = Σ k = 1 K [ P x ( z k | d x , m ) · P ‾ c ( z k ) ] Σ k = 1 K P x ( z k | d x , m ) 2 · Σ k = 1 K P ‾ c ( z k ) 2 ;
wherein, Px(zk|dx,m) Indicates the k-th change log type z obtained in step GkAnd the m-th sample d to be measuredx,mThe probability relationship of (a);
thus, a category label is added to each software change log as a sample to be tested according to the determined change log category to which each sample to be tested belongs.
2. The software change log classification method based on semi-supervised probability latent semantic analysis according to claim 1, wherein the step G specifically comprises:
g1) respectively taking each sample to be tested in the test database as a software change log diSubstituting the likelihood function L constructed in the step D with i ∈ {1,2, …, M }, taking the total number M of the software change logs as the total number of samples to be tested in the test database, and classifying the k change log into a category zkAnd as a software change log diIs measured on the probability relation P (z) of the sample to be measuredk|di) Randomly taking the initial value of the word dictionary, and taking the jth key word w in the word dictionaryjAnd the kth change log category zkProbability relation P (w)j|zk) The initial values of (a) are:
P(wj|zk)=Pc(wj|zk);
g2) in E-step of the expectation-maximization algorithm, P (w) is determined according to the current probability relationshipj|zk) And a probabilistic relationship P: (zk|di) Respectively calculating each change log category zkConditional distribution probability P (z)k|di,wj),k∈{1,2,…,K}:
P ( z k | d i , w j ) = P ( w j | z k ) P ( z k | d i ) Σ k = 1 K [ P ( w j | z k ) P ( z k | d i ) ] ;
g3) In M-step of the expectation maximization algorithm, the conditional distribution probability P (z) obtained in step g2 is utilizedk|di,wj) Respectively for each key word w in the word dictionaryjJ ∈ {1,2, …, N }, as a software change log d in a test databaseiI ∈ {1,2, …, M }, and each change log category zkK ∈ {1,2, …, K }, versus probability relationship P (w)j|zk) And a probability relation P (z)k|di) Updating the value of (a):
P ( w j | z k ) = Σ i = 1 M [ n ( w j , d i ) P ( z k | d i , w j ) ] Σ j = 1 N Σ i = 1 M [ n ( w j , d i ) P ( z k | d i , w j ) ] ; P ( z k | d i ) = Σ j = 1 N [ n ( w j , d i ) P ( z k | d i , w j ) ] n ( d i ) ;
wherein n (w)j,di) Representing the jth key word w in the word dictionaryjIn the software change log diThe number of occurrences in (a);representing key words in a word dictionary in a software change log diThe total number of occurrences in (a);
g4) repeating the steps g 2-g 3 until the expectation maximization algorithm is converged, thereby obtaining each change log category z obtained by convergence solution of the expectation maximization algorithmkAnd as a software change log diThe probability relationship of each sample to be measured.
CN201410234156.6A 2014-05-29 2014-05-29 Semi-supervised probabilistic latent semantic analysis based software change log classification method Active CN103984756B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410234156.6A CN103984756B (en) 2014-05-29 2014-05-29 Semi-supervised probabilistic latent semantic analysis based software change log classification method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410234156.6A CN103984756B (en) 2014-05-29 2014-05-29 Semi-supervised probabilistic latent semantic analysis based software change log classification method

Publications (2)

Publication Number Publication Date
CN103984756A CN103984756A (en) 2014-08-13
CN103984756B true CN103984756B (en) 2017-04-12

Family

ID=51276729

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410234156.6A Active CN103984756B (en) 2014-05-29 2014-05-29 Semi-supervised probabilistic latent semantic analysis based software change log classification method

Country Status (1)

Country Link
CN (1) CN103984756B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112367222B (en) * 2020-10-30 2022-09-27 中国联合网络通信集团有限公司 Network anomaly detection method and device
CN112527769B (en) * 2020-12-09 2023-05-16 重庆大学 Automatic quality assurance framework for software change log generation method
CN113760644A (en) * 2021-03-05 2021-12-07 北京沃东天骏信息技术有限公司 Method, device, computing equipment and medium for processing log

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102289430A (en) * 2011-06-29 2011-12-21 北京交通大学 Method for analyzing latent semantics of fusion probability of multi-modality data
WO2012061162A1 (en) * 2010-10-25 2012-05-10 Intelius Inc. Cost-sensitive alternating decision trees for record linkage
CN103488707A (en) * 2013-09-06 2014-01-01 中国人民解放军国防科学技术大学 Method of searching for candidate classes based on greedy strategy and heuristic algorithm

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012061162A1 (en) * 2010-10-25 2012-05-10 Intelius Inc. Cost-sensitive alternating decision trees for record linkage
CN102289430A (en) * 2011-06-29 2011-12-21 北京交通大学 Method for analyzing latent semantics of fusion probability of multi-modality data
CN103488707A (en) * 2013-09-06 2014-01-01 中国人民解放军国防科学技术大学 Method of searching for candidate classes based on greedy strategy and heuristic algorithm

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"基于概率潜在语义分析的软件变更分类研究";鄢萌;《中国优秀硕士学位论文全文数据库 信息科技辑》;20140215(第2期);第1至5章 *

Also Published As

Publication number Publication date
CN103984756A (en) 2014-08-13

Similar Documents

Publication Publication Date Title
CN111914090B (en) Method and device for enterprise industry classification identification and characteristic pollutant identification
WO2020156000A1 (en) Computer implemented event risk assessment method and device
CN110852856A (en) Invoice false invoice identification method based on dynamic network representation
CN112685324B (en) Method and system for generating test scheme
CN111447574B (en) Short message classification method, device, system and storage medium
CN110794360A (en) Method and system for predicting fault of intelligent electric energy meter based on machine learning
CN114997169B (en) Entity word recognition method and device, electronic equipment and readable storage medium
CN112800232B (en) Case automatic classification method based on big data
CN112884570A (en) Method, device and equipment for determining model security
CN112966708A (en) Chinese crowdsourcing test report clustering method based on semantic similarity
CN103984756B (en) Semi-supervised probabilistic latent semantic analysis based software change log classification method
CN117131449A (en) Data management-oriented anomaly identification method and system with propagation learning capability
CN115879017A (en) Automatic classification and grading method and device for power sensitive data and storage medium
CN114139634A (en) Multi-label feature selection method based on paired label weights
CN114186644A (en) Defect report severity prediction method based on optimized random forest
CN113221960A (en) Construction method and collection method of high-quality vulnerability data collection model
CN115269870A (en) Method for realizing classification and early warning of data link faults in data based on knowledge graph
CN111832306A (en) Image diagnosis report named entity identification method based on multi-feature fusion
CN112308251A (en) Work order assignment method and system based on machine learning
CN114611515B (en) Method and system for identifying enterprise actual control person based on enterprise public opinion information
CN112148605B (en) Software defect prediction method based on spectral clustering and semi-supervised learning
CN112380224B (en) Mass big data system for massive heterogeneous multidimensional data acquisition
CN112581188A (en) Construction method, prediction method and model of engineering project bid quotation prediction model
CN111563165A (en) Statement classification method based on anchor word positioning and training statement augmentation
CN111108516A (en) Evaluating input data using a deep learning algorithm

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant