CN103729197A - Multi-granularity layer software clustering method based on LDA (latent dirichlet allocation) model - Google Patents

Multi-granularity layer software clustering method based on LDA (latent dirichlet allocation) model Download PDF

Info

Publication number
CN103729197A
CN103729197A CN201410028677.6A CN201410028677A CN103729197A CN 103729197 A CN103729197 A CN 103729197A CN 201410028677 A CN201410028677 A CN 201410028677A CN 103729197 A CN103729197 A CN 103729197A
Authority
CN
China
Prior art keywords
class
theme
software
ratio
unappropriated
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410028677.6A
Other languages
Chinese (zh)
Other versions
CN103729197B (en
Inventor
孙小兵
刘湘月
李斌
杨智松
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dongguan Aipeike Technology Co ltd
Original Assignee
Yangzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yangzhou University filed Critical Yangzhou University
Priority to CN201410028677.6A priority Critical patent/CN103729197B/en
Publication of CN103729197A publication Critical patent/CN103729197A/en
Application granted granted Critical
Publication of CN103729197B publication Critical patent/CN103729197B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses a multi-granularity layer software clustering method based on an LDA (latent dirichlet allocation) model in the technical field of software engineering, which aims at solving the technical problems that a software system cannot be rapidly understood by the development personnel according to a clustering result because the software functional characteristics are neglected by a software clustering technology in the prior art. Subjects are respectively extracted from two layers such as category and method through the LDA model, so that the clustering from the coarse-granularity layer to the fine-granularity layer is realized, a system structure which is more easily understood is established for the development personnel, and the clustering result is more effective and more practical. According to the method disclosed by the invention, the functional points of the software program can be clearly known by the development personnel, and the required functional source codes can be rapidly found. The method is used for assisting the software maintenance and program understanding in the demonstration process, and a gradual understanding process from the system to the method can be provided for the development personnel. The method has the characteristics of good clustering performance, strong practicability and high working efficiency.

Description

A kind of many granularities level software clustering method based on LDA model
Technical field
The present invention relates to a kind of clustering method, particularly a kind of many granularities level software clustering method based on LDA model, belongs to technical field of software engineering.
Background technology
For meeting user, constantly change ground demand, software product generally all needs constantly upgrade and safeguard.In order to realize user's maintenance request, first developer needs to understand whole software systems, the particularly understanding to program.But along with the development of software systems, the scale of whole software systems is also increasing, complexity is also inevitable more and more higher, and generally, program comprehension will account for the time of Software maintenance process 60%.For auxiliary this work, developer has proposed software clustering technique, its objective is by extract subsystem less in software, more concentrated and that be easier to understand and the relation between them, the efficiency of understanding, analyzing and transform Legacy System to improve people from source code.
In prior art, most of software clustering technique all static structure dependence between service routine element carries out; Also someone has proposed the clustering method based on understanding, and the code that is about to mate identical pattern is divided into a class and result is effectively named.But these two kinds of methods have all been ignored the functional character of system.And function point and each function point that the target of program comprehension is understanding system are how to be achieved by different source codes, therefore above-mentioned two kinds of methods all can not help developer's prehension program fast and efficiently.
In software systems, upgrading and the modification request of safeguarding are also referred to as feature or theme conventionally, and feature or theme can represent a kind of function, and this function is according to developer and user's requirement and can acceptance defines.If just can provide feature or theme in the initial procedure of software cluster, can effectively help developer to obtain an overall information.
LDA(implies Dirichlet distribute) model is current the most representative, also be most popular a kind of probability agent model, in fields such as text mining, Knowledge Discovery, Topic Tracking and multi-document summaries, obtained applications well widely, LDA is a kind of non-supervisory machine learning techniques, can be used for identifying extensive document and or corpus in hiding subject information, there is higher dirigibility and robotization processing power.LDA model can excavate the potential topic model of specifying number from a data acquisition, effectively excavates inner link implicit between semantic information, represents a text, thereby reach the object of Feature Dimension Reduction by these topic models.
Summary of the invention
The object of this invention is to provide a kind of many granularities level software clustering method based on LDA model, be intended to solve in prior art software clustering technique and ignored software function feature and cause the developer can not be according to the technical matters of cluster result fast understanding software systems.
The object of the present invention is achieved like this: a kind of many granularities level software clustering method based on LDA model, comprises the following steps:
(1) from software systems to be clustered, choose class name, method name and annotation as screening object, described software systems are screened, use LDA model from extracting k system theme the software systems through screening, k is self-defining value;
(2) calculate the descriptor number of system theme and the ratio of the total word number of described software systems document described in each, if described ratio equals 1, the class at corresponding descriptor place is assigned to the initial clustering with the class hierarchy of corresponding system theme coupling; If described ratio is less than 1, in arranging from big to small each ratio, the class at the descriptor place of the front M position that the ratio of choosing comes, is assigned to the initial clustering with the class hierarchy of corresponding system theme coupling by described class, and wherein, M is self-defining value;
(3) analyze one by one unappropriated class in described software systems and the class of having distributed between relation, if unappropriated class and the class Existence dependency relationship having distributed, in the initial clustering at the class place of having distributed described in unappropriated class being assigned to, until all unappropriated classes are all distributed in corresponding initial clustering, obtain described software systems at the cluster result based on system theme of class hierarchy;
(4) according to the cluster result of step (3), use LDA model from class, to extract j class theme, j is self-defining value;
(5) calculate the ratio of the descriptor number of class theme and the total word number of document of described class described in each, if described ratio equals 1, the method at corresponding descriptor place is assigned to the initial clustering with the method level of respective class theme coupling; If described ratio is less than 1,, in arranging from big to small each ratio, the ratio of choosing comes the method at the descriptor place of top N, described method is assigned to the initial clustering of the method level mating with respective class theme, and wherein, N is self-defining value;
(6) analyze one by one unappropriated method in described class and the method for having distributed between relation, if unappropriated method and the method Existence dependency relationship having distributed, in the initial clustering at the method place of having distributed described in unappropriated method being assigned to, until all unappropriated methods are all distributed in corresponding initial clustering, obtain described class at the cluster result based on class theme of method level.
The invention has the beneficial effects as follows: by LDA model, in class and two different levels of method, extract theme, realized the cluster of coarseness level to fine granularity level, for developer sets up a more understandable system architecture, the result that makes cluster more effectively, more practical; Because the cluster result of class hierarchy and method level all can provide the functional character of software systems, therefore developer can be well understood to according to cluster result the function point of software program, and finds fast required function source code according to the cluster result of method level; The present invention has realized a kind of software cluster process of top-town refinement, more meets developer's actual software understanding process, contributes to developer to understand simply, progressively and fast whole software systems.This method is applied to the program comprehension in assistant software maintenance and evolutionary process, can be developer a process of progressively understanding from system to method is provided, and has the advantages that clustering performance is good, practical, work efficiency is high.
Accompanying drawing explanation
Fig. 1 is the initial clustering process flow diagram of class hierarchy in the present invention.
Fig. 2 is the initial clustering process flow diagram of method level in the present invention.
Fig. 3 is the software system structure schematic diagram of understanding according to theme.
Embodiment
As shown in Figure 1, the initial clustering process flow diagram for for class hierarchy in the present invention, comprises the following steps:
(1) from software systems to be clustered, choose class name, method name and annotation as screening object, software systems are screened, use LDA model from extracting k system theme the software systems through screening, k is self-defining value, need to set in advance.As embodiments of the invention, the system of establishing extraction themes as t1, t2, t3 ... tk, the descriptor that tx is corresponding is tx0, tx1, tx2 ..., the class at descriptor txy place is cxy, wherein, variable x meets: 1≤x≤k, variable y meets: 0≤y.
(2) calculate the ratio of descriptor number and the total word number of software systems document of each system theme, be made as P, P value can be calculated and be provided by LDA model.If P=1, is assigned to the class at descriptor place the initial clustering with the class hierarchy of corresponding system theme coupling; If P < 1, in arranging from big to small each ratio, the class at the descriptor place of the front M position that the ratio of choosing comes, and the initial clustering that such is assigned to the class hierarchy mating with corresponding system theme, wherein, M is self-defining value.As shown in table 1, the P value of the descriptor t10 of system theme t1 is 1, the class c10 at t10 place directly can be assigned in the initial clustering of class hierarchy of t1 coupling; If establish M=2, in t1, to come the descriptor of first 2 be t11, t12 to P value, so class c11, c12 are also assigned in the initial clustering of class hierarchy of t1 coupling.
Figure 2014100286776100002DEST_PATH_IMAGE001
(3) relation between unappropriated class and the class of having distributed in analyzing software system one by one, unappropriated class is the class that P < 1 and P value do not come the descriptor place of front M position herein.If unappropriated class and the class Existence dependency relationship having distributed, unappropriated class is assigned in the initial clustering at the class place of having distributed, until all unappropriated classes are all distributed in corresponding initial clustering, obtain software systems at the cluster result based on system theme of class hierarchy.If the class not being assigned with is c0, c1, c2, c3, c4, c5, c6, c7, if c12 and class c0, c5 exist and call or called relation, c21 and class c1, c3, c4 exist and call or called relation, ck1 and class c2, c6, c7 exist and call or called relation, c0, c5 are assigned to the initial clustering of the class hierarchy at c12 place, c1, c3, c4 are assigned to the initial clustering of the class hierarchy at c21 place, c2, c6, c7 are assigned to the initial clustering of the class hierarchy at ck1 place, as shown in table 2.
Figure 2014100286776100002DEST_PATH_IMAGE002
Through after above-mentioned three steps, just completed the initial clustering of software program class hierarchy, reach the object of understanding system function point.As needs, further understanding each function point is how by different source codes, to be achieved, and enters next step; Otherwise, can directly stop cluster.
As shown in Figure 2, be the initial clustering process flow diagram of method level in the present invention, step is as follows:
(4) according to the cluster result of step (3), use LDA model from class, to extract j class theme, j value need to be set in advance, and concrete numerical value can be revised as required voluntarily.The class c10 of system theme t1 of take is example, establishes the class of extracting and theme as m1, m2, m3 from class c10 ... mj, the descriptor that class theme mn is corresponding is mn0, mn1, mn2 ... descriptor mnr place method is gnr, wherein, variable n meets: 1≤n≤j, variable r meets: 0≤r.
(5) calculate the ratio of the descriptor number of all kinds of themes and the total word number of document of class, be made as Q, if ratio Q equals 1, the method at descriptor place is assigned to the initial clustering of the method level mating with respective class theme; If ratio Q is less than 1,, in arranging from big to small each ratio, the ratio of choosing comes the method at the descriptor place of top N, and the method is assigned to the initial clustering of the method level mating with respective class theme, and wherein, N is self-defining value.As shown in table 3, the Q value of the descriptor m11 of class theme m1 is 1, the method g11 at m11 place directly can be assigned in the initial clustering of method level of m1 coupling; If establish N=2, in theme m1, to come the descriptor of first 2 be m10, m15 to Q value, so method g10, g15 are also assigned in the initial clustering of method level of m1 coupling.
Figure 2014100286776100002DEST_PATH_IMAGE003
(6) relation between unappropriated method and the method for having distributed in analysis classes one by one, unappropriated method is the method that Q < 1 and Q value do not come the descriptor place of top N herein.If unappropriated method and the method Existence dependency relationship having distributed, unappropriated method is assigned in the initial clustering at the method place of having distributed, until all unappropriated methods are all distributed in corresponding initial clustering, obtain class at the cluster result based on class theme of method level.If the method not being assigned with is g0, g1, g2, g3, g4, if the method g10 having distributed and method g0, g3 exist and call or called relation, g11 and method g1 exist and call or called relation, g26 and method g2, g4 exist and call or called relation, g0, g3 are assigned to the initial clustering of the method level at g10 place, g1 is assigned to the initial clustering of the method level at g11 place, g2, g4 is assigned to the initial clustering of the method level at g26 place, as shown in table 4.
Figure 2014100286776100002DEST_PATH_IMAGE004
So far just completed the initial clustering of method level, developer can be according to the source code of each function point of cluster result fast understanding of method level.
As shown in Figure 3, be the software system structure schematic diagram understood according to theme, software systems are associated with each class by system theme, all kinds ofly by class theme, are associated with method respectively.
The present invention is not limited to above-described embodiment; on the basis of technical scheme disclosed by the invention; those skilled in the art is according to disclosed technology contents; do not need performing creative labour just can make some replacements and distortion to some technical characterictics wherein, these replacements and distortion are all in protection scope of the present invention.

Claims (1)

1. the many granularities level software clustering method based on LDA model, is characterized in that, comprises the following steps:
(1) from software systems to be clustered, choose class name, method name and annotation as screening object, described software systems are screened, use LDA model from extracting k system theme the software systems through screening, k is self-defining value;
(2) calculate the descriptor number of system theme and the ratio of the total word number of described software systems document described in each, if described ratio equals 1, the class at corresponding descriptor place is assigned to the initial clustering with the class hierarchy of corresponding system theme coupling; If described ratio is less than 1, in arranging from big to small each ratio, the class at the descriptor place of the front M position that the ratio of choosing comes, is assigned to the initial clustering with the class hierarchy of corresponding system theme coupling by described class, and wherein, M is self-defining value;
(3) analyze one by one unappropriated class in described software systems and the class of having distributed between relation, if unappropriated class and the class Existence dependency relationship having distributed, in the initial clustering at the class place of having distributed described in unappropriated class being assigned to, until all unappropriated classes are all distributed in corresponding initial clustering, obtain described software systems at the cluster result based on system theme of class hierarchy;
(4) according to the cluster result of step (3), use LDA model from class, to extract j class theme, j is self-defining value;
(5) calculate the ratio of the descriptor number of class theme and the total word number of document of described class described in each, if described ratio equals 1, the method at corresponding descriptor place is assigned to the initial clustering with the method level of respective class theme coupling; If described ratio is less than 1,, in arranging from big to small each ratio, the ratio of choosing comes the method at the descriptor place of top N, described method is assigned to the initial clustering of the method level mating with respective class theme, and wherein, N is self-defining value;
(6) analyze one by one unappropriated method in described class and the method for having distributed between relation, if unappropriated method and the method Existence dependency relationship having distributed, in the initial clustering at the method place of having distributed described in unappropriated method being assigned to, until all unappropriated methods are all distributed in corresponding initial clustering, obtain described class at the cluster result based on class theme of method level.
CN201410028677.6A 2014-01-22 2014-01-22 Multi-granularity layer software clustering method based on LDA (latent dirichlet allocation) model Active CN103729197B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410028677.6A CN103729197B (en) 2014-01-22 2014-01-22 Multi-granularity layer software clustering method based on LDA (latent dirichlet allocation) model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410028677.6A CN103729197B (en) 2014-01-22 2014-01-22 Multi-granularity layer software clustering method based on LDA (latent dirichlet allocation) model

Publications (2)

Publication Number Publication Date
CN103729197A true CN103729197A (en) 2014-04-16
CN103729197B CN103729197B (en) 2017-01-18

Family

ID=50453283

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410028677.6A Active CN103729197B (en) 2014-01-22 2014-01-22 Multi-granularity layer software clustering method based on LDA (latent dirichlet allocation) model

Country Status (1)

Country Link
CN (1) CN103729197B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104090775A (en) * 2014-07-24 2014-10-08 扬州大学 Software evolution modeling method based on dynamic topic model
CN104572111A (en) * 2015-01-20 2015-04-29 扬州大学 Program understanding and characteristic locating method based on correlated topic model
CN104850311A (en) * 2015-05-26 2015-08-19 中山大学 Generation method and system of graphical descriptions of version updates of mobile applications
CN109165155A (en) * 2018-06-20 2019-01-08 扬州大学 A kind of software defect recovery template extracting method based on clustering
CN109992271A (en) * 2019-03-31 2019-07-09 东南大学 Layered architecture recognition method based on code vocabulary and structure dependence

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9009134B2 (en) * 2010-03-16 2015-04-14 Microsoft Technology Licensing, Llc Named entity recognition in query
CN102902700B (en) * 2012-04-05 2015-02-25 中国人民解放军国防科学技术大学 Online-increment evolution topic model based automatic software classifying method
CN102760149B (en) * 2012-04-05 2015-02-25 中国人民解放军国防科学技术大学 Automatic annotating method for subjects of open source software

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104090775A (en) * 2014-07-24 2014-10-08 扬州大学 Software evolution modeling method based on dynamic topic model
CN104090775B (en) * 2014-07-24 2017-05-03 扬州大学 Software evolution modeling method based on dynamic topic model
CN104572111A (en) * 2015-01-20 2015-04-29 扬州大学 Program understanding and characteristic locating method based on correlated topic model
CN104572111B (en) * 2015-01-20 2017-12-01 扬州大学 A kind of program comprehension and characteristic positioning method based on related subject model
CN104850311A (en) * 2015-05-26 2015-08-19 中山大学 Generation method and system of graphical descriptions of version updates of mobile applications
CN104850311B (en) * 2015-05-26 2018-05-01 中山大学 Graphical the explanation generation method and system of a kind of mobile application version updating
CN109165155A (en) * 2018-06-20 2019-01-08 扬州大学 A kind of software defect recovery template extracting method based on clustering
CN109165155B (en) * 2018-06-20 2021-06-22 扬州大学 Software defect repairing template extraction method based on cluster analysis
CN109992271A (en) * 2019-03-31 2019-07-09 东南大学 Layered architecture recognition method based on code vocabulary and structure dependence
CN109992271B (en) * 2019-03-31 2022-05-13 东南大学 Layered architecture recognition method based on code vocabulary and structure dependence

Also Published As

Publication number Publication date
CN103729197B (en) 2017-01-18

Similar Documents

Publication Publication Date Title
CN107609052B (en) A kind of generation method and device of the domain knowledge map based on semantic triangle
CN104008166A (en) Dialogue short text clustering method based on form and semantic similarity
CN101963995B (en) Image marking method based on characteristic scene
CN103729197A (en) Multi-granularity layer software clustering method based on LDA (latent dirichlet allocation) model
CN105893551A (en) Method and device for processing data and knowledge graph
CN105468371B (en) A kind of business process map merging method based on Subject Clustering
CN104199972A (en) Named entity relation extraction and construction method based on deep learning
CN111191012B (en) Knowledge graph generation device and method and computer readable storage medium thereof
CN105469789A (en) Voice information processing method and voice information processing terminal
CN103778200A (en) Method for extracting information source of message and system thereof
CN110209809B (en) Text clustering method and device, storage medium and electronic device
CN103678714B (en) Construction method and device for entity knowledge base
CN102650995A (en) Multi-dimensional data analyzing model generating system and method
CN104866308A (en) Scenario image generation method and apparatus
CN104298683A (en) Theme digging method and equipment and query expansion method and equipment
CN105183742A (en) Resume identification method
CN104978332A (en) UGC label data generating method, UGC label data generating device, relevant method and relevant device
NZ757969A (en) Quantifying robustness by analyzing a property graph data model
CN107480137A (en) With semantic iterative extraction network accident and the method that identifies extension event relation
Butler et al. Sax discretization does not guarantee equiprobable symbols
CN110442730A (en) A kind of knowledge mapping construction method based on deepdive
CN103164393A (en) Method and system of report formula processing
CN105159927A (en) Method and device for selecting subject term of target text and terminal
CN106547765A (en) Data base management method and device based on SQL
CN111914859A (en) Service multiplexing method, computing device and computer readable storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20230920

Address after: 523000 building 18, 780 Xie Cao Road, Xiegang Town, Dongguan City, Guangdong Province

Patentee after: Dongguan aipeike Technology Co.,Ltd.

Address before: 225009 No. 88, South University Road, Jiangsu, Yangzhou

Patentee before: YANGZHOU University