CN103729197A

CN103729197A - Multi-granularity layer software clustering method based on LDA (latent dirichlet allocation) model

Info

Publication number: CN103729197A
Application number: CN201410028677.6A
Authority: CN
Inventors: 孙小兵; 刘湘月; 李斌; 杨智松
Original assignee: Yangzhou University
Current assignee: Dongguan Aipeike Technology Co ltd
Priority date: 2014-01-22
Filing date: 2014-01-22
Publication date: 2014-04-16
Anticipated expiration: 2034-01-22
Also published as: CN103729197B

Abstract

The invention discloses a multi-granularity layer software clustering method based on an LDA (latent dirichlet allocation) model in the technical field of software engineering, which aims at solving the technical problems that a software system cannot be rapidly understood by the development personnel according to a clustering result because the software functional characteristics are neglected by a software clustering technology in the prior art. Subjects are respectively extracted from two layers such as category and method through the LDA model, so that the clustering from the coarse-granularity layer to the fine-granularity layer is realized, a system structure which is more easily understood is established for the development personnel, and the clustering result is more effective and more practical. According to the method disclosed by the invention, the functional points of the software program can be clearly known by the development personnel, and the required functional source codes can be rapidly found. The method is used for assisting the software maintenance and program understanding in the demonstration process, and a gradual understanding process from the system to the method can be provided for the development personnel. The method has the characteristics of good clustering performance, strong practicability and high working efficiency.

Description

A kind of many granularities level software clustering method based on LDA model

Technical field

The present invention relates to a kind of clustering method, particularly a kind of many granularities level software clustering method based on LDA model, belongs to technical field of software engineering.

Background technology

For meeting user, constantly change ground demand, software product generally all needs constantly upgrade and safeguard.In order to realize user's maintenance request, first developer needs to understand whole software systems, the particularly understanding to program.But along with the development of software systems, the scale of whole software systems is also increasing, complexity is also inevitable more and more higher, and generally, program comprehension will account for the time of Software maintenance process 60%.For auxiliary this work, developer has proposed software clustering technique, its objective is by extract subsystem less in software, more concentrated and that be easier to understand and the relation between them, the efficiency of understanding, analyzing and transform Legacy System to improve people from source code.

In prior art, most of software clustering technique all static structure dependence between service routine element carries out; Also someone has proposed the clustering method based on understanding, and the code that is about to mate identical pattern is divided into a class and result is effectively named.But these two kinds of methods have all been ignored the functional character of system.And function point and each function point that the target of program comprehension is understanding system are how to be achieved by different source codes, therefore above-mentioned two kinds of methods all can not help developer's prehension program fast and efficiently.

In software systems, upgrading and the modification request of safeguarding are also referred to as feature or theme conventionally, and feature or theme can represent a kind of function, and this function is according to developer and user's requirement and can acceptance defines.If just can provide feature or theme in the initial procedure of software cluster, can effectively help developer to obtain an overall information.

LDA(implies Dirichlet distribute) model is current the most representative, also be most popular a kind of probability agent model, in fields such as text mining, Knowledge Discovery, Topic Tracking and multi-document summaries, obtained applications well widely, LDA is a kind of non-supervisory machine learning techniques, can be used for identifying extensive document and or corpus in hiding subject information, there is higher dirigibility and robotization processing power.LDA model can excavate the potential topic model of specifying number from a data acquisition, effectively excavates inner link implicit between semantic information, represents a text, thereby reach the object of Feature Dimension Reduction by these topic models.

Summary of the invention

The object of this invention is to provide a kind of many granularities level software clustering method based on LDA model, be intended to solve in prior art software clustering technique and ignored software function feature and cause the developer can not be according to the technical matters of cluster result fast understanding software systems.

The object of the present invention is achieved like this: a kind of many granularities level software clustering method based on LDA model, comprises the following steps:

(1) from software systems to be clustered, choose class name, method name and annotation as screening object, described software systems are screened, use LDA model from extracting k system theme the software systems through screening, k is self-defining value;

(2) calculate the descriptor number of system theme and the ratio of the total word number of described software systems document described in each, if described ratio equals 1, the class at corresponding descriptor place is assigned to the initial clustering with the class hierarchy of corresponding system theme coupling; If described ratio is less than 1, in arranging from big to small each ratio, the class at the descriptor place of the front M position that the ratio of choosing comes, is assigned to the initial clustering with the class hierarchy of corresponding system theme coupling by described class, and wherein, M is self-defining value;

(3) analyze one by one unappropriated class in described software systems and the class of having distributed between relation, if unappropriated class and the class Existence dependency relationship having distributed, in the initial clustering at the class place of having distributed described in unappropriated class being assigned to, until all unappropriated classes are all distributed in corresponding initial clustering, obtain described software systems at the cluster result based on system theme of class hierarchy;

(4) according to the cluster result of step (3), use LDA model from class, to extract j class theme, j is self-defining value;

(5) calculate the ratio of the descriptor number of class theme and the total word number of document of described class described in each, if described ratio equals 1, the method at corresponding descriptor place is assigned to the initial clustering with the method level of respective class theme coupling; If described ratio is less than 1,, in arranging from big to small each ratio, the ratio of choosing comes the method at the descriptor place of top N, described method is assigned to the initial clustering of the method level mating with respective class theme, and wherein, N is self-defining value;

(6) analyze one by one unappropriated method in described class and the method for having distributed between relation, if unappropriated method and the method Existence dependency relationship having distributed, in the initial clustering at the method place of having distributed described in unappropriated method being assigned to, until all unappropriated methods are all distributed in corresponding initial clustering, obtain described class at the cluster result based on class theme of method level.

The invention has the beneficial effects as follows: by LDA model, in class and two different levels of method, extract theme, realized the cluster of coarseness level to fine granularity level, for developer sets up a more understandable system architecture, the result that makes cluster more effectively, more practical; Because the cluster result of class hierarchy and method level all can provide the functional character of software systems, therefore developer can be well understood to according to cluster result the function point of software program, and finds fast required function source code according to the cluster result of method level; The present invention has realized a kind of software cluster process of top-town refinement, more meets developer's actual software understanding process, contributes to developer to understand simply, progressively and fast whole software systems.This method is applied to the program comprehension in assistant software maintenance and evolutionary process, can be developer a process of progressively understanding from system to method is provided, and has the advantages that clustering performance is good, practical, work efficiency is high.

Accompanying drawing explanation

Fig. 1 is the initial clustering process flow diagram of class hierarchy in the present invention.

Fig. 2 is the initial clustering process flow diagram of method level in the present invention.

Fig. 3 is the software system structure schematic diagram of understanding according to theme.

Embodiment

As shown in Figure 1, the initial clustering process flow diagram for for class hierarchy in the present invention, comprises the following steps:

(1) from software systems to be clustered, choose class name, method name and annotation as screening object, software systems are screened, use LDA model from extracting k system theme the software systems through screening, k is self-defining value, need to set in advance.As embodiments of the invention, the system of establishing extraction themes as t1, t2, t3 ... tk, the descriptor that tx is corresponding is tx0, tx1, tx2 ..., the class at descriptor txy place is cxy, wherein, variable x meets: 1≤x≤k, variable y meets: 0≤y.

(2) calculate the ratio of descriptor number and the total word number of software systems document of each system theme, be made as P, P value can be calculated and be provided by LDA model.If P=1, is assigned to the class at descriptor place the initial clustering with the class hierarchy of corresponding system theme coupling; If P < 1, in arranging from big to small each ratio, the class at the descriptor place of the front M position that the ratio of choosing comes, and the initial clustering that such is assigned to the class hierarchy mating with corresponding system theme, wherein, M is self-defining value.As shown in table 1, the P value of the descriptor t10 of system theme t1 is 1, the class c10 at t10 place directly can be assigned in the initial clustering of class hierarchy of t1 coupling; If establish M=2, in t1, to come the descriptor of first 2 be t11, t12 to P value, so class c11, c12 are also assigned in the initial clustering of class hierarchy of t1 coupling.

Figure 2014100286776100002DEST_PATH_IMAGE001

(3) relation between unappropriated class and the class of having distributed in analyzing software system one by one, unappropriated class is the class that P < 1 and P value do not come the descriptor place of front M position herein.If unappropriated class and the class Existence dependency relationship having distributed, unappropriated class is assigned in the initial clustering at the class place of having distributed, until all unappropriated classes are all distributed in corresponding initial clustering, obtain software systems at the cluster result based on system theme of class hierarchy.If the class not being assigned with is c0, c1, c2, c3, c4, c5, c6, c7, if c12 and class c0, c5 exist and call or called relation, c21 and class c1, c3, c4 exist and call or called relation, ck1 and class c2, c6, c7 exist and call or called relation, c0, c5 are assigned to the initial clustering of the class hierarchy at c12 place, c1, c3, c4 are assigned to the initial clustering of the class hierarchy at c21 place, c2, c6, c7 are assigned to the initial clustering of the class hierarchy at ck1 place, as shown in table 2.

Figure 2014100286776100002DEST_PATH_IMAGE002

Through after above-mentioned three steps, just completed the initial clustering of software program class hierarchy, reach the object of understanding system function point.As needs, further understanding each function point is how by different source codes, to be achieved, and enters next step; Otherwise, can directly stop cluster.

As shown in Figure 2, be the initial clustering process flow diagram of method level in the present invention, step is as follows:

(4) according to the cluster result of step (3), use LDA model from class, to extract j class theme, j value need to be set in advance, and concrete numerical value can be revised as required voluntarily.The class c10 of system theme t1 of take is example, establishes the class of extracting and theme as m1, m2, m3 from class c10 ... mj, the descriptor that class theme mn is corresponding is mn0, mn1, mn2 ... descriptor mnr place method is gnr, wherein, variable n meets: 1≤n≤j, variable r meets: 0≤r.

(5) calculate the ratio of the descriptor number of all kinds of themes and the total word number of document of class, be made as Q, if ratio Q equals 1, the method at descriptor place is assigned to the initial clustering of the method level mating with respective class theme; If ratio Q is less than 1,, in arranging from big to small each ratio, the ratio of choosing comes the method at the descriptor place of top N, and the method is assigned to the initial clustering of the method level mating with respective class theme, and wherein, N is self-defining value.As shown in table 3, the Q value of the descriptor m11 of class theme m1 is 1, the method g11 at m11 place directly can be assigned in the initial clustering of method level of m1 coupling; If establish N=2, in theme m1, to come the descriptor of first 2 be m10, m15 to Q value, so method g10, g15 are also assigned in the initial clustering of method level of m1 coupling.

Figure 2014100286776100002DEST_PATH_IMAGE003

(6) relation between unappropriated method and the method for having distributed in analysis classes one by one, unappropriated method is the method that Q < 1 and Q value do not come the descriptor place of top N herein.If unappropriated method and the method Existence dependency relationship having distributed, unappropriated method is assigned in the initial clustering at the method place of having distributed, until all unappropriated methods are all distributed in corresponding initial clustering, obtain class at the cluster result based on class theme of method level.If the method not being assigned with is g0, g1, g2, g3, g4, if the method g10 having distributed and method g0, g3 exist and call or called relation, g11 and method g1 exist and call or called relation, g26 and method g2, g4 exist and call or called relation, g0, g3 are assigned to the initial clustering of the method level at g10 place, g1 is assigned to the initial clustering of the method level at g11 place, g2, g4 is assigned to the initial clustering of the method level at g26 place, as shown in table 4.

Figure 2014100286776100002DEST_PATH_IMAGE004

So far just completed the initial clustering of method level, developer can be according to the source code of each function point of cluster result fast understanding of method level.

As shown in Figure 3, be the software system structure schematic diagram understood according to theme, software systems are associated with each class by system theme, all kinds ofly by class theme, are associated with method respectively.

The present invention is not limited to above-described embodiment; on the basis of technical scheme disclosed by the invention; those skilled in the art is according to disclosed technology contents; do not need performing creative labour just can make some replacements and distortion to some technical characterictics wherein, these replacements and distortion are all in protection scope of the present invention.

Claims

1. the many granularities level software clustering method based on LDA model, is characterized in that, comprises the following steps: