CN103729197A - Multi-granularity layer software clustering method based on LDA (latent dirichlet allocation) model - Google Patents
Multi-granularity layer software clustering method based on LDA (latent dirichlet allocation) model Download PDFInfo
- Publication number
- CN103729197A CN103729197A CN201410028677.6A CN201410028677A CN103729197A CN 103729197 A CN103729197 A CN 103729197A CN 201410028677 A CN201410028677 A CN 201410028677A CN 103729197 A CN103729197 A CN 103729197A
- Authority
- CN
- China
- Prior art keywords
- class
- theme
- software
- ratio
- unappropriated
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Abstract
The invention discloses a multi-granularity layer software clustering method based on an LDA (latent dirichlet allocation) model in the technical field of software engineering, which aims at solving the technical problems that a software system cannot be rapidly understood by the development personnel according to a clustering result because the software functional characteristics are neglected by a software clustering technology in the prior art. Subjects are respectively extracted from two layers such as category and method through the LDA model, so that the clustering from the coarse-granularity layer to the fine-granularity layer is realized, a system structure which is more easily understood is established for the development personnel, and the clustering result is more effective and more practical. According to the method disclosed by the invention, the functional points of the software program can be clearly known by the development personnel, and the required functional source codes can be rapidly found. The method is used for assisting the software maintenance and program understanding in the demonstration process, and a gradual understanding process from the system to the method can be provided for the development personnel. The method has the characteristics of good clustering performance, strong practicability and high working efficiency.
Description
Technical field
The present invention relates to a kind of clustering method, particularly a kind of many granularities level software clustering method based on LDA model, belongs to technical field of software engineering.
Background technology
For meeting user, constantly change ground demand, software product generally all needs constantly upgrade and safeguard.In order to realize user's maintenance request, first developer needs to understand whole software systems, the particularly understanding to program.But along with the development of software systems, the scale of whole software systems is also increasing, complexity is also inevitable more and more higher, and generally, program comprehension will account for the time of Software maintenance process 60%.For auxiliary this work, developer has proposed software clustering technique, its objective is by extract subsystem less in software, more concentrated and that be easier to understand and the relation between them, the efficiency of understanding, analyzing and transform Legacy System to improve people from source code.
In prior art, most of software clustering technique all static structure dependence between service routine element carries out; Also someone has proposed the clustering method based on understanding, and the code that is about to mate identical pattern is divided into a class and result is effectively named.But these two kinds of methods have all been ignored the functional character of system.And function point and each function point that the target of program comprehension is understanding system are how to be achieved by different source codes, therefore above-mentioned two kinds of methods all can not help developer's prehension program fast and efficiently.
In software systems, upgrading and the modification request of safeguarding are also referred to as feature or theme conventionally, and feature or theme can represent a kind of function, and this function is according to developer and user's requirement and can acceptance defines.If just can provide feature or theme in the initial procedure of software cluster, can effectively help developer to obtain an overall information.
LDA(implies Dirichlet distribute) model is current the most representative, also be most popular a kind of probability agent model, in fields such as text mining, Knowledge Discovery, Topic Tracking and multi-document summaries, obtained applications well widely, LDA is a kind of non-supervisory machine learning techniques, can be used for identifying extensive document and or corpus in hiding subject information, there is higher dirigibility and robotization processing power.LDA model can excavate the potential topic model of specifying number from a data acquisition, effectively excavates inner link implicit between semantic information, represents a text, thereby reach the object of Feature Dimension Reduction by these topic models.
Summary of the invention
The object of this invention is to provide a kind of many granularities level software clustering method based on LDA model, be intended to solve in prior art software clustering technique and ignored software function feature and cause the developer can not be according to the technical matters of cluster result fast understanding software systems.
The object of the present invention is achieved like this: a kind of many granularities level software clustering method based on LDA model, comprises the following steps:
(1) from software systems to be clustered, choose class name, method name and annotation as screening object, described software systems are screened, use LDA model from extracting k system theme the software systems through screening, k is self-defining value;
(2) calculate the descriptor number of system theme and the ratio of the total word number of described software systems document described in each, if described ratio equals 1, the class at corresponding descriptor place is assigned to the initial clustering with the class hierarchy of corresponding system theme coupling; If described ratio is less than 1, in arranging from big to small each ratio, the class at the descriptor place of the front M position that the ratio of choosing comes, is assigned to the initial clustering with the class hierarchy of corresponding system theme coupling by described class, and wherein, M is self-defining value;
(3) analyze one by one unappropriated class in described software systems and the class of having distributed between relation, if unappropriated class and the class Existence dependency relationship having distributed, in the initial clustering at the class place of having distributed described in unappropriated class being assigned to, until all unappropriated classes are all distributed in corresponding initial clustering, obtain described software systems at the cluster result based on system theme of class hierarchy;
(4) according to the cluster result of step (3), use LDA model from class, to extract j class theme, j is self-defining value;
(5) calculate the ratio of the descriptor number of class theme and the total word number of document of described class described in each, if described ratio equals 1, the method at corresponding descriptor place is assigned to the initial clustering with the method level of respective class theme coupling; If described ratio is less than 1,, in arranging from big to small each ratio, the ratio of choosing comes the method at the descriptor place of top N, described method is assigned to the initial clustering of the method level mating with respective class theme, and wherein, N is self-defining value;
(6) analyze one by one unappropriated method in described class and the method for having distributed between relation, if unappropriated method and the method Existence dependency relationship having distributed, in the initial clustering at the method place of having distributed described in unappropriated method being assigned to, until all unappropriated methods are all distributed in corresponding initial clustering, obtain described class at the cluster result based on class theme of method level.
The invention has the beneficial effects as follows: by LDA model, in class and two different levels of method, extract theme, realized the cluster of coarseness level to fine granularity level, for developer sets up a more understandable system architecture, the result that makes cluster more effectively, more practical; Because the cluster result of class hierarchy and method level all can provide the functional character of software systems, therefore developer can be well understood to according to cluster result the function point of software program, and finds fast required function source code according to the cluster result of method level; The present invention has realized a kind of software cluster process of top-town refinement, more meets developer's actual software understanding process, contributes to developer to understand simply, progressively and fast whole software systems.This method is applied to the program comprehension in assistant software maintenance and evolutionary process, can be developer a process of progressively understanding from system to method is provided, and has the advantages that clustering performance is good, practical, work efficiency is high.
Accompanying drawing explanation
Fig. 1 is the initial clustering process flow diagram of class hierarchy in the present invention.
Fig. 2 is the initial clustering process flow diagram of method level in the present invention.
Fig. 3 is the software system structure schematic diagram of understanding according to theme.
Embodiment
As shown in Figure 1, the initial clustering process flow diagram for for class hierarchy in the present invention, comprises the following steps:
(1) from software systems to be clustered, choose class name, method name and annotation as screening object, software systems are screened, use LDA model from extracting k system theme the software systems through screening, k is self-defining value, need to set in advance.As embodiments of the invention, the system of establishing extraction themes as t1, t2, t3 ... tk, the descriptor that tx is corresponding is tx0, tx1, tx2 ..., the class at descriptor txy place is cxy, wherein, variable x meets: 1≤x≤k, variable y meets: 0≤y.
(2) calculate the ratio of descriptor number and the total word number of software systems document of each system theme, be made as P, P value can be calculated and be provided by LDA model.If P=1, is assigned to the class at descriptor place the initial clustering with the class hierarchy of corresponding system theme coupling; If P < 1, in arranging from big to small each ratio, the class at the descriptor place of the front M position that the ratio of choosing comes, and the initial clustering that such is assigned to the class hierarchy mating with corresponding system theme, wherein, M is self-defining value.As shown in table 1, the P value of the descriptor t10 of system theme t1 is 1, the class c10 at t10 place directly can be assigned in the initial clustering of class hierarchy of t1 coupling; If establish M=2, in t1, to come the descriptor of first 2 be t11, t12 to P value, so class c11, c12 are also assigned in the initial clustering of class hierarchy of t1 coupling.
(3) relation between unappropriated class and the class of having distributed in analyzing software system one by one, unappropriated class is the class that P < 1 and P value do not come the descriptor place of front M position herein.If unappropriated class and the class Existence dependency relationship having distributed, unappropriated class is assigned in the initial clustering at the class place of having distributed, until all unappropriated classes are all distributed in corresponding initial clustering, obtain software systems at the cluster result based on system theme of class hierarchy.If the class not being assigned with is c0, c1, c2, c3, c4, c5, c6, c7, if c12 and class c0, c5 exist and call or called relation, c21 and class c1, c3, c4 exist and call or called relation, ck1 and class c2, c6, c7 exist and call or called relation, c0, c5 are assigned to the initial clustering of the class hierarchy at c12 place, c1, c3, c4 are assigned to the initial clustering of the class hierarchy at c21 place, c2, c6, c7 are assigned to the initial clustering of the class hierarchy at ck1 place, as shown in table 2.
Through after above-mentioned three steps, just completed the initial clustering of software program class hierarchy, reach the object of understanding system function point.As needs, further understanding each function point is how by different source codes, to be achieved, and enters next step; Otherwise, can directly stop cluster.
As shown in Figure 2, be the initial clustering process flow diagram of method level in the present invention, step is as follows:
(4) according to the cluster result of step (3), use LDA model from class, to extract j class theme, j value need to be set in advance, and concrete numerical value can be revised as required voluntarily.The class c10 of system theme t1 of take is example, establishes the class of extracting and theme as m1, m2, m3 from class c10 ... mj, the descriptor that class theme mn is corresponding is mn0, mn1, mn2 ... descriptor mnr place method is gnr, wherein, variable n meets: 1≤n≤j, variable r meets: 0≤r.
(5) calculate the ratio of the descriptor number of all kinds of themes and the total word number of document of class, be made as Q, if ratio Q equals 1, the method at descriptor place is assigned to the initial clustering of the method level mating with respective class theme; If ratio Q is less than 1,, in arranging from big to small each ratio, the ratio of choosing comes the method at the descriptor place of top N, and the method is assigned to the initial clustering of the method level mating with respective class theme, and wherein, N is self-defining value.As shown in table 3, the Q value of the descriptor m11 of class theme m1 is 1, the method g11 at m11 place directly can be assigned in the initial clustering of method level of m1 coupling; If establish N=2, in theme m1, to come the descriptor of first 2 be m10, m15 to Q value, so method g10, g15 are also assigned in the initial clustering of method level of m1 coupling.
(6) relation between unappropriated method and the method for having distributed in analysis classes one by one, unappropriated method is the method that Q < 1 and Q value do not come the descriptor place of top N herein.If unappropriated method and the method Existence dependency relationship having distributed, unappropriated method is assigned in the initial clustering at the method place of having distributed, until all unappropriated methods are all distributed in corresponding initial clustering, obtain class at the cluster result based on class theme of method level.If the method not being assigned with is g0, g1, g2, g3, g4, if the method g10 having distributed and method g0, g3 exist and call or called relation, g11 and method g1 exist and call or called relation, g26 and method g2, g4 exist and call or called relation, g0, g3 are assigned to the initial clustering of the method level at g10 place, g1 is assigned to the initial clustering of the method level at g11 place, g2, g4 is assigned to the initial clustering of the method level at g26 place, as shown in table 4.
So far just completed the initial clustering of method level, developer can be according to the source code of each function point of cluster result fast understanding of method level.
As shown in Figure 3, be the software system structure schematic diagram understood according to theme, software systems are associated with each class by system theme, all kinds ofly by class theme, are associated with method respectively.
The present invention is not limited to above-described embodiment; on the basis of technical scheme disclosed by the invention; those skilled in the art is according to disclosed technology contents; do not need performing creative labour just can make some replacements and distortion to some technical characterictics wherein, these replacements and distortion are all in protection scope of the present invention.
Claims (1)
1. the many granularities level software clustering method based on LDA model, is characterized in that, comprises the following steps:
(1) from software systems to be clustered, choose class name, method name and annotation as screening object, described software systems are screened, use LDA model from extracting k system theme the software systems through screening, k is self-defining value;
(2) calculate the descriptor number of system theme and the ratio of the total word number of described software systems document described in each, if described ratio equals 1, the class at corresponding descriptor place is assigned to the initial clustering with the class hierarchy of corresponding system theme coupling; If described ratio is less than 1, in arranging from big to small each ratio, the class at the descriptor place of the front M position that the ratio of choosing comes, is assigned to the initial clustering with the class hierarchy of corresponding system theme coupling by described class, and wherein, M is self-defining value;
(3) analyze one by one unappropriated class in described software systems and the class of having distributed between relation, if unappropriated class and the class Existence dependency relationship having distributed, in the initial clustering at the class place of having distributed described in unappropriated class being assigned to, until all unappropriated classes are all distributed in corresponding initial clustering, obtain described software systems at the cluster result based on system theme of class hierarchy;
(4) according to the cluster result of step (3), use LDA model from class, to extract j class theme, j is self-defining value;
(5) calculate the ratio of the descriptor number of class theme and the total word number of document of described class described in each, if described ratio equals 1, the method at corresponding descriptor place is assigned to the initial clustering with the method level of respective class theme coupling; If described ratio is less than 1,, in arranging from big to small each ratio, the ratio of choosing comes the method at the descriptor place of top N, described method is assigned to the initial clustering of the method level mating with respective class theme, and wherein, N is self-defining value;
(6) analyze one by one unappropriated method in described class and the method for having distributed between relation, if unappropriated method and the method Existence dependency relationship having distributed, in the initial clustering at the method place of having distributed described in unappropriated method being assigned to, until all unappropriated methods are all distributed in corresponding initial clustering, obtain described class at the cluster result based on class theme of method level.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410028677.6A CN103729197B (en) | 2014-01-22 | 2014-01-22 | Multi-granularity layer software clustering method based on LDA (latent dirichlet allocation) model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410028677.6A CN103729197B (en) | 2014-01-22 | 2014-01-22 | Multi-granularity layer software clustering method based on LDA (latent dirichlet allocation) model |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103729197A true CN103729197A (en) | 2014-04-16 |
CN103729197B CN103729197B (en) | 2017-01-18 |
Family
ID=50453283
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410028677.6A Active CN103729197B (en) | 2014-01-22 | 2014-01-22 | Multi-granularity layer software clustering method based on LDA (latent dirichlet allocation) model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103729197B (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104090775A (en) * | 2014-07-24 | 2014-10-08 | 扬州大学 | Software evolution modeling method based on dynamic topic model |
CN104572111A (en) * | 2015-01-20 | 2015-04-29 | 扬州大学 | Program understanding and characteristic locating method based on correlated topic model |
CN104850311A (en) * | 2015-05-26 | 2015-08-19 | 中山大学 | Generation method and system of graphical descriptions of version updates of mobile applications |
CN109165155A (en) * | 2018-06-20 | 2019-01-08 | 扬州大学 | A kind of software defect recovery template extracting method based on clustering |
CN109992271A (en) * | 2019-03-31 | 2019-07-09 | 东南大学 | Layered architecture recognition method based on code vocabulary and structure dependence |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9009134B2 (en) * | 2010-03-16 | 2015-04-14 | Microsoft Technology Licensing, Llc | Named entity recognition in query |
CN102902700B (en) * | 2012-04-05 | 2015-02-25 | 中国人民解放军国防科学技术大学 | Online-increment evolution topic model based automatic software classifying method |
CN102760149B (en) * | 2012-04-05 | 2015-02-25 | 中国人民解放军国防科学技术大学 | Automatic annotating method for subjects of open source software |
-
2014
- 2014-01-22 CN CN201410028677.6A patent/CN103729197B/en active Active
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104090775A (en) * | 2014-07-24 | 2014-10-08 | 扬州大学 | Software evolution modeling method based on dynamic topic model |
CN104090775B (en) * | 2014-07-24 | 2017-05-03 | 扬州大学 | Software evolution modeling method based on dynamic topic model |
CN104572111A (en) * | 2015-01-20 | 2015-04-29 | 扬州大学 | Program understanding and characteristic locating method based on correlated topic model |
CN104572111B (en) * | 2015-01-20 | 2017-12-01 | 扬州大学 | A kind of program comprehension and characteristic positioning method based on related subject model |
CN104850311A (en) * | 2015-05-26 | 2015-08-19 | 中山大学 | Generation method and system of graphical descriptions of version updates of mobile applications |
CN104850311B (en) * | 2015-05-26 | 2018-05-01 | 中山大学 | Graphical the explanation generation method and system of a kind of mobile application version updating |
CN109165155A (en) * | 2018-06-20 | 2019-01-08 | 扬州大学 | A kind of software defect recovery template extracting method based on clustering |
CN109165155B (en) * | 2018-06-20 | 2021-06-22 | 扬州大学 | Software defect repairing template extraction method based on cluster analysis |
CN109992271A (en) * | 2019-03-31 | 2019-07-09 | 东南大学 | Layered architecture recognition method based on code vocabulary and structure dependence |
CN109992271B (en) * | 2019-03-31 | 2022-05-13 | 东南大学 | Layered architecture recognition method based on code vocabulary and structure dependence |
Also Published As
Publication number | Publication date |
---|---|
CN103729197B (en) | 2017-01-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107609052B (en) | A kind of generation method and device of the domain knowledge map based on semantic triangle | |
CN104008166A (en) | Dialogue short text clustering method based on form and semantic similarity | |
CN101963995B (en) | Image marking method based on characteristic scene | |
CN103729197A (en) | Multi-granularity layer software clustering method based on LDA (latent dirichlet allocation) model | |
CN105893551A (en) | Method and device for processing data and knowledge graph | |
CN105468371B (en) | A kind of business process map merging method based on Subject Clustering | |
CN104199972A (en) | Named entity relation extraction and construction method based on deep learning | |
CN111191012B (en) | Knowledge graph generation device and method and computer readable storage medium thereof | |
CN105469789A (en) | Voice information processing method and voice information processing terminal | |
CN103778200A (en) | Method for extracting information source of message and system thereof | |
CN110209809B (en) | Text clustering method and device, storage medium and electronic device | |
CN103678714B (en) | Construction method and device for entity knowledge base | |
CN102650995A (en) | Multi-dimensional data analyzing model generating system and method | |
CN104866308A (en) | Scenario image generation method and apparatus | |
CN104298683A (en) | Theme digging method and equipment and query expansion method and equipment | |
CN105183742A (en) | Resume identification method | |
CN104978332A (en) | UGC label data generating method, UGC label data generating device, relevant method and relevant device | |
NZ757969A (en) | Quantifying robustness by analyzing a property graph data model | |
CN107480137A (en) | With semantic iterative extraction network accident and the method that identifies extension event relation | |
Butler et al. | Sax discretization does not guarantee equiprobable symbols | |
CN110442730A (en) | A kind of knowledge mapping construction method based on deepdive | |
CN103164393A (en) | Method and system of report formula processing | |
CN105159927A (en) | Method and device for selecting subject term of target text and terminal | |
CN106547765A (en) | Data base management method and device based on SQL | |
CN111914859A (en) | Service multiplexing method, computing device and computer readable storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right | ||
TR01 | Transfer of patent right |
Effective date of registration: 20230920 Address after: 523000 building 18, 780 Xie Cao Road, Xiegang Town, Dongguan City, Guangdong Province Patentee after: Dongguan aipeike Technology Co.,Ltd. Address before: 225009 No. 88, South University Road, Jiangsu, Yangzhou Patentee before: YANGZHOU University |