CN113159196B - Software demand clustering method and system based on regular variation embedding - Google Patents

Software demand clustering method and system based on regular variation embedding Download PDF

Info

Publication number
CN113159196B
CN113159196B CN202110455004.9A CN202110455004A CN113159196B CN 113159196 B CN113159196 B CN 113159196B CN 202110455004 A CN202110455004 A CN 202110455004A CN 113159196 B CN113159196 B CN 113159196B
Authority
CN
China
Prior art keywords
clustering
software
vector
regular
embedded
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110455004.9A
Other languages
Chinese (zh)
Other versions
CN113159196A (en
Inventor
崔国荣
康雁
李媛
张晓颖
李晋源
贾雪彬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yunnan University YNU
Original Assignee
Yunnan University YNU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yunnan University YNU filed Critical Yunnan University YNU
Priority to CN202110455004.9A priority Critical patent/CN113159196B/en
Publication of CN113159196A publication Critical patent/CN113159196A/en
Application granted granted Critical
Publication of CN113159196B publication Critical patent/CN113159196B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a regular variational embedding-based software demand clustering method and a regular variational embedding-based software demand clustering system. The method comprises the following steps: acquiring software requirement data of different types of software; performing text preprocessing on software requirement data; mapping the software requirement file to a vector space by using a BERT pre-trained sentence vector model; clustering the sentence vectors by using a regular variational embedded clustering model; the clustering step is as follows: dropot regularization processing is carried out on the sentence vectors, and regularization vectors are determined; performing feature compression on the normalized vector by using a full connection layer; determining embedding characteristics by adopting an encoder according to the compressed regularized vector; decoding the embedded features with a decoder; determining a clustering division result by adopting a K-means algorithm according to the embedding characteristics; and determining a corresponding loss function according to the embedded features, the original vector and the clustering division result, and performing back propagation on the corresponding loss function to determine a clustering result. The method and the device improve the accuracy of software demand text clustering.

Description

Regular variational embedding-based software demand clustering method and system
Technical Field
The invention relates to the field of data mining, in particular to a regular variational embedding-based software demand clustering method and system.
Background
The software development process comprises requirement analysis, system design, detailed design, test and evaluation, and the first work is the requirement analysis when a good software is designed, but in the actual development process, people neglect the importance of the requirement analysis and pay attention to the design stage. Errors caused by the requirement analysis are not visible in the software development process, and can be found only in the testing stage, but the modification errors pay double cost. The software requirement description is not only a bridge for communication between users and developers, but also a basis for functional design and performance indexes, and the requirement description runs through the whole software development process, and the requirement can be changed along with the time, so that huge risks can be brought to later development.
The demand analysis stage often has the following problems: (1) the difference between the research fields of the user and the developer, which are familiar with the own field but are unfamiliar with the field of the opposite party, causes the communication between the user and the developer to be obstructed; (2) potential requirements in the software requirement description cannot be mined, and due to the fact that professional knowledge in the computer field understood by a user is limited, required functions and performance design expressions are incomplete, and therefore the requirements can be omitted during requirement analysis; (3) the software requirement text description is fuzzy, the problems of sparsity, ambiguity, non-verifiability and the like exist, the comprehension of the text expressions with the same meaning in different fields is also different, the short defect of the text description is embodied in the aspects of redundant function description, complex development process, high module coupling and the like, so that the user experience and operability are poor, and the software cost and the development efficiency cannot be guaranteed, thereby causing the failure of the project.
The problem in the stage of demand analysis causes inconvenience to software development, and if similarity descriptions are aggregated by a clustering method, only the descriptions in the same class need to be concerned, so that the class can be guided to represent the functions and application fields of the software in any aspect.
The traditional text clustering method firstly carries out text preprocessing, then carries out feature extraction on a word sequence after preprocessing, then is mapped to a vector space by a word vector tool, and completes sample division based on a measurement function. The traditional feature extraction technology is based on statistical analysis, weight distribution is carried out on the existing feature parameters by utilizing an evaluation function, the feature weight represents the importance degree of words in the whole sentence, but the method usually ignores the position relation and semantic information of sequences, the words in the text are taken as isolated individuals, the semantics cannot be accurately expressed only by depending on the frequency, and the context relation is ignored.
The Clustering methods commonly used are mostly based on the partition method, density method and hierarchical method, such as K-means, DBSCAN, Agglomerative Clustering, etc. The known clustering method achieves good effects in the text field and the image field, but the clustering method is not necessarily suitable for all field knowledge division, and the traditional clustering method is also extremely dependent on the quality of extracted features and training of a word vector model. The introduction of deep learning focuses on learning representation, the feature expression of a text is learned by a deep model, compared with the traditional feature extraction, the nonlinearity and the fitting capability are stronger, the extracted features are greatly improved on the traditional clustering, and the higher clustering accuracy can be obtained by adjusting network parameters. However, most deep clustering is based on two-segment clustering, where feature extraction is performed on the text first, and then the compressed features are partitioned by conventional clustering. The two-segment clustering structure is clear, the feature extractor compresses high-dimensional features into low-dimensional features, the data sparsity is reduced, more text semantics are included, and the clustering has better continuity and interpretability, but the two-segment clustering has obvious defects that the representation learning and the clustering process are carried out in two steps, the clustering center and the extracted features cannot be improved according to the clustering result, and the defects of easy falling into local optimum exist, the forward clustering process is one-off, and the clustering output can be improved only in a small range.
The traditional text clustering method is easily influenced by an initial clustering center, most of the traditional text clustering methods are based on distance measurement, the method still has the defect of being easily trapped in local optimization, deep clustering combines a deep learning technology and a traditional clustering method, the model structure is easy to understand, the clustering result is slightly improved compared with the traditional clustering method, but most of the deep clustering methods are based on two-segment clustering and cannot reversely propagate and optimize the clustering center and the sample distribution.
At present, clustering methods related to software requirement texts are rare, most of the clustering methods focus on the improvement of the methods, optimization of a feature extraction mode is omitted, and in practical situations, not all the clustering methods are universal, a certain feature extraction mode and a certain clustering method are adopted according to data distribution, a feature extractor is difficult to decide, and sample division is determined indirectly to a certain extent. The training of the word vector model is also important because of unsupervised training, but a large-scale software requirement corpus is few, no good word vector exists, and the clustering method becomes meaningless.
The software requires that the text is a disordered unstructured data, and has a lot of redundant information, so the text cannot be input into a machine for learning, the word abbreviation change, spelling correction, word stem extraction and word shape reduction are required, the traditional feature extractor is linear mapping, the result lacks reasonable explanation and description, and the text is mapped into a vector form which can be recognized by a computer, so that clustering division can be carried out. After software demand data are analyzed, which clustering method and feature extractor are adopted and how a back-propagation neural network clustering model is designed are judged according to the sample distribution of the software demand data, and meanwhile, the characteristic vector is guaranteed not to be damaged by an embedding space caused by clustering loss, and the local structure of a sample is reserved.
Therefore, a software demand clustering method is urgently needed, and aiming at the characteristics of high dispersion, high noise, sparse data and the like of a software demand text, the accuracy of software demand text clustering is improved.
Disclosure of Invention
The invention aims to provide a regular variational embedding-based software demand clustering method and a regular variational embedding-based software demand clustering system, which are used for improving the accuracy of software demand text clustering.
In order to achieve the purpose, the invention provides the following scheme:
a regular variational embedding-based software demand clustering method comprises the following steps:
acquiring software requirement data of different types of software;
performing text preprocessing on the software requirement data to determine a software requirement text;
mapping the software requirement file to a vector space by using a sentence vector model pre-trained by BERT, and determining a sentence vector;
clustering the sentence vectors by using a regular variation embedded clustering model to determine a clustering result;
the step of clustering the sentence vectors by using the regular variation embedded clustering model comprises the following steps:
the regular variational embedded clustering model carries out Dropout regularization processing on the sentence vectors to determine regularized vectors;
performing feature compression on the regularization vectors by utilizing a full connection layer; determining embedding characteristics by adopting an encoder according to the compressed regularized vector;
decoding the embedded features by using a decoder to determine an original vector;
determining clustering division results by adopting a K-means algorithm according to the embedding characteristics;
and determining a corresponding loss function according to the embedded feature, the original vector and the clustering result, and performing back propagation on the embedded feature, the original vector and the loss function corresponding to the clustering result to determine the clustering result.
Optionally, the acquiring software requirement data of different types of software specifically includes:
acquiring software requirement data of 11 types of software of a Windows platform under a Softpedia website by utilizing a Scapy technology;
and (3) storing each type of software requirement data in a csv format independently, and labeling each type of software requirement data simultaneously.
Optionally, the text preprocessing is performed on the software requirement data, and the determining of the software requirement text specifically includes:
eliminating html tags in the software requirement data by using the regular expression;
correcting the abbreviation and the scrambled words of the software requirement data with the html tags removed;
carrying out stem extraction and morphological restoration on the corrected software requirement data;
and storing the processed data in the csv file.
Optionally, the regularization vectors are feature compressed by using a full connection layer; and determining embedding characteristics by using an encoder according to the compressed regularization vector, specifically comprising:
determining the embedded features by using the formula z ═ u + exp (delta) × epsilon;
wherein, u and δ are two parameters of the encoder converting the compressed regularization vector into a hidden space, which are respectively a mean value and a variance, epsilon is a tensor subject to normal distribution, epsilon-N (1, 1).
Optionally, the clustering the sentence vectors by using a regular variational embedded clustering model to determine a clustering result, and then further comprising:
and evaluating the clustering result by using the clustering index.
A regular variational embedding-based software demand clustering system comprises:
the software requirement data acquisition module is used for acquiring software requirement data of different types of software;
the text preprocessing module is used for performing text preprocessing on the software requirement data and determining a software requirement text;
the sentence vector determining module is used for mapping the software demand file to a vector space by using a BERT pre-trained sentence vector model and determining a sentence vector;
the clustering result determining module is used for clustering the sentence vectors by utilizing a regular variation embedded clustering model to determine a clustering result;
the step of clustering the sentence vectors by using the regular variational embedded clustering model comprises the following steps:
the regular variational embedded clustering model carries out Dropout regularization processing on the sentence vectors to determine regularized vectors;
performing feature compression on the regularization vectors by utilizing a full connection layer; determining embedding characteristics by adopting an encoder according to the compressed regularized vector;
decoding the embedded features by using a decoder to determine an original vector;
determining clustering division results by adopting a K-means algorithm according to the embedding characteristics;
and determining a corresponding loss function according to the embedded feature, the original vector and the clustering result, and performing back propagation on the embedded feature, the original vector and the loss function corresponding to the clustering result to determine the clustering result.
Optionally, the software requirement data acquiring module specifically includes:
the software demand data acquisition unit is used for acquiring software demand data of different types of software of the Windows platform under the Softpedia website by utilizing a script technology;
and the software requirement data storage and labeling unit is used for separately storing each type of software requirement data in a csv format and labeling each type of software requirement data.
Optionally, the text preprocessing module specifically includes:
the text removing unit is used for removing the html tags in the software requirement data by using the regular expression;
the text correction unit is used for correcting the software requirement data from which the html tags are removed by the abbreviative words and the messy code words;
the text extraction unit is used for carrying out stem extraction and morphological restoration on the corrected software requirement data;
and the text storage unit is used for storing the processed data in the csv file.
Optionally, the method further comprises:
and the clustering evaluation module is used for evaluating the clustering result by utilizing the clustering index.
According to the specific embodiment provided by the invention, the invention discloses the following technical effects:
the invention provides a software demand clustering method and system based on regular variational embedding, which utilize a full connection layer to carry out feature compression on the regular vector; determining embedding characteristics by adopting an encoder according to the compressed regularized vector; the scheme solves the problem that the two-end type clustering can not reversely propagate and optimize the clustering center and the sample distribution, the local structure of the data is reserved by learning the internal hidden distribution of the software demand data, the original data is simulated by the skills of the heavy parameters, the essential characteristics of the data are reflected, the characteristic quality and the clustering accuracy are improved, and the optimal effect is achieved. In text representation, a BERT sentence vector model is used for mapping an original text to a vector space, as the original text is not subjected to noise processing, Dropout regularization is fused at the input end of the model to randomly inhibit the work of part of neurons, noise interference is reduced, overfitting of the model is prevented, robustness of the model is enhanced, the Dropout regularized vector is input into a variational embedded clustering method, the embedded space can learn original data distribution through an encoder, a recomparametric skill ensures that a hidden layer can better abstract input data characteristics, then the embedded space vector is used for decoding to learn characteristics on one hand, and on the other hand, a K-means clustering method is used for sample division to achieve iteration stop condition output clustering results through a back propagation optimization loss function.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.
FIG. 1 is a schematic flow chart of a regular variational embedding-based software demand clustering method provided by the invention;
FIG. 2 is a flowchart of an algorithm for regular variational-based embedded software demand clustering provided by the present invention;
FIG. 3 is a network architecture diagram of a regular variational-embedding-based software demand clustering provided by the present invention;
FIG. 4 is a schematic diagram of Dropout regularization in an embodiment;
FIG. 5 is a diagram illustrating a conventional clustering comparison;
FIG. 6 is a schematic diagram of depth clustering comparison;
fig. 7 is a schematic structural diagram of a regular variational-embedding-based software demand clustering system provided by the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.
The invention aims to provide a regular variational embedding-based software demand clustering method and a regular variational embedding-based software demand clustering system, which are used for improving the accuracy of software demand text clustering.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
Fig. 1 is a schematic flow chart of a regular-variation-embedding-based software demand clustering method, as shown in fig. 1, the regular-variation-embedding-based software demand clustering method provided by the present invention includes:
s101, acquiring software requirement data of different types of software.
S101 specifically comprises the following steps:
and acquiring software requirement data of 11 types of software of the Windows platform under the Softpedia website by using a Scapy technology.
And (4) independently storing each type of software requirement data in a csv format, and labeling each type of software requirement data. And outputting and storing the software requirement description with the same function in a csv format, and labeling the software requirement description information with the same function.
Because open source data described about software requirements on the internet is few, 11 types of data sets under a Windows platform are obtained under a Softpedia website by adopting a script crawler technology, wherein the 11 types of data are respectively anti, authoring-Tools, CD-DVD-Blu-ray-Tools, Compression-Tools, Desktop-Enhancements, File-managers, Gaming-Related, iPod-Tools, Maps & GPS, Mobile-Phone-Tools and Network-Tools, because an anti-crawling program sets an IP proxy pool and time delay, each type of data is separately stored in a csv format, samples in the 11 types of data are labeled at the same time, category names are ordered according to Ascii codes, then the samples of the same type are labeled in a number 0-10 form, and the good 11 types of data are synthesized into a File most red.
And S102, performing text preprocessing on the software requirement data, and determining a software requirement text.
S102 specifically includes:
and eliminating the html tags in the software requirement data by using the regular expression.
And (5) correcting the abbreviation and the scrambled word of the software requirement data with the html label removed.
And carrying out stem extraction and morphological restoration on the corrected software requirement data.
And storing the processed data in the csv file.
Because the software requirement data is directly obtained from the website, the html tag content cannot be avoided, and the tag formats include the following types: < div class ═ test "> < div >, < img/>, custom Tag < My-Tag > </My-Tag >, html Tag in english text is removed using regular expression re. Represents the case of the tag of </div >, the second V? This indicates < img/> case; sub (r 'can \ t', 'can not', text), using pyenchant base spell check to find out errors and then correcting; using SnowballStemmer to complete stem extraction, for example, protect after protecting extraction, WordNemtermemter completes morphological reduction, and finally, placing all classes of data in a csv file
S103, mapping the software requirement file to a vector space by using a sentence vector model pre-trained by BERT, and determining a sentence vector; the sentence vector represents the text information of the whole sentence, and the front-back position relation and the internal relation of the words are considered. The sentence vector model pre-trained by the BERT outputs sentence vector expressions with the same dimensionality, and the original software requirement description records the vector space of the dimensionality number through the training output software requirement description.
S104, clustering the sentence vectors by using a regular variational embedded clustering model to determine a clustering result; inputting the processed sentence vectors into a regular variational embedded clustering model, compressing high-dimensional sparse features into low-dimensional dense vectors through coding, generating samples similar to original data in a probability form to reflect the essential features of the data, and then simultaneously performing feature learning and clustering division on the newly generated vectors.
As shown in fig. 2 and fig. 3, the step of clustering the sentence vectors by using the regular variational embedded clustering model is as follows:
s401, the regular variational embedded clustering model carries out Dropout regularization processing on the sentence vectors, and regularization vectors are determined.
Dropout regularization is fused at the input end, partial neurons of the neural network stop working according to a certain probability, greedy learning of unimportant features in data by the model is prevented, excessive noise is fused in the compression process of the features, loss is caused to text clustering, random inhibition of the neurons is equivalent to noise reduction, the model has robustness during model learning, and the generalization capability of the model is enhanced. The neural network training process cannot escape two problems: (1) time consuming; (2) it is easy to overfit. Dropout is fused to solve the two problems, some neurons are stopped randomly and do not participate in training, and other neurons work together, so that a network model is simplified, time complexity is reduced, and joint adaptability among neuron nodes is eliminated and weakened. Assume that the input vector set is x ═ x 1 ,x 2 ,x 3 ,…,x m ]Wherein x is i Representing each sample, the hidden layer vector is z ═ z 1 ,z 2 ,z 3 ,…,z k ]The reconstructed vector is
Figure BDA0003040185090000091
The activation function of the fully connected layer of the network is a relu function. Regularization is merged at the input and randomly mapped to regularization by DropoutQuantized vector
Figure BDA0003040185090000092
Figure BDA0003040185090000093
S402, performing feature compression on the regularized vector by using a full connection layer; and determining embedding characteristics by adopting an encoder according to the compressed regularization vector.
S402 specifically includes:
the embedded features are determined using the formula z u + exp (δ) × epsilon.
Wherein u and delta are two parameters of the encoder converting the compressed regularization vector into a hidden space, which are respectively a mean value and a variance, epsilon is a tensor subject to normal distribution, epsilon-N (1, 1).
S103, feature compression is carried out on the high-dimensional sparse vector by adopting a full connection layer, a constraint term is applied to the encoder network, the network generates a sample set which is subjected to Gaussian distribution, any relevant data can be obtained by utilizing a repetition trigk according to the mean value mu and the variance delta rule of Gaussian distribution, the embedded space learns the original data distribution, the delta dynamically adjusts the noise intensity, and then the layer is decoded to reconstruct the sample, so that data similar to the original sample is generated:
Figure BDA0003040185090000101
epsilon~N(0,1);
z=u+exp(δ)*epsilon。
s403, decoding the embedded features by using a decoder to determine an original vector; and decoding the embedded characteristic z formed by the heavy parameters, wherein the decoding also adopts a full-connection layer, and the decoding process is opposite to the encoding process and is restored into an original vector x' through decoding.
x′=f(W x′ z+b x′ )。
S404, determining clustering division results by adopting a K-means algorithm according to the embedding characteristics(ii) a Embedding the characteristic z as the input of a K-means algorithm, and dividing the samples into the nearest classes by the K-means according to a neighbor principle, namely, the distance (x) is satisfied i ,c j )=min(distance(x i ,c j ) J ═ 1, 2, 3,. times, k }, then x i ∈y i . Recalculating clustering center c for each round of training j
Figure BDA0003040185090000102
Characterizing an embedded feature z by a t-distribution i And cluster center c j Similarity of (2):
Figure BDA0003040185090000103
using this similarity, the target distribution is defined as:
Figure BDA0003040185090000104
s405, determining a corresponding loss function according to the embedded feature, the original vector and the clustering division result, and performing back propagation on the embedded feature, the original vector and the loss function corresponding to the clustering division result to determine the clustering result.
S105 is an optimized loss function, after decoding and clustering division, the loss function is optimized in a back propagation mode, the loss functions are related to three, loss generated by sample reconstruction, loss generated by clustering and loss generated by a new sample generated by a heavy parameter:
Reconstruction Loss=||x-x′|| 2
Figure BDA0003040185090000111
Figure BDA0003040185090000112
the loss function is defined as: l ═ KL Loss + reconfiguration Loss + α · Cluster Loss.
After S104, further comprising:
and evaluating the clustering result by using the clustering index.
Calculating and counting samples of each class according to the principle of 'minority obeying majority', and simultaneously measuring the method quality by using evaluation indexes such as Silhouuette Coefficient (SC), Calinski Harabasz (CH) and Davies-Bouldin index (DBI), wherein the specific formula is as follows:
Figure BDA0003040185090000113
Figure BDA0003040185090000114
Figure BDA0003040185090000115
as a specific embodiment, the invention adopts the script crawler technology to acquire software requirements on a Softpedia platform, and the requirements are objective description of software functions. The Softpedia platform provides users with various tools under various system platforms, including Windows, Linux, Android, IOS and the like. There are dozens of software categories such as antivirus software, compression software, file management, game software and map positioning software under each platform, and the tool usage descriptions under different platforms are different. By 2019, 5 and 27 in Beijing, 113016 total applications are recorded in the website, and the total number of times of downloading the applications is 3320687342.
The software requirement data is software function description under a Windows platform, the platform has different types of software, each software type has a large number of Apps, the function information and the action of each App have corresponding description information, 11 types of software data are crawled altogether, 15598 pieces of software data are crawled altogether, the data lengths are different, and the types and the number of the crawled software are shown in table 1.
TABLE 1 software requirements data sheet
Figure BDA0003040185090000121
The invention mainly compares the clustering accuracy, and adopts the purity as the clustering accuracy, and the formula is as follows:
Figure BDA0003040185090000122
in the invention, after texts with different lengths are trained by BERT, the characteristic dimension of each sample is 768 dimensions, the problem of vector irregularity caused by different text lengths is solved, context semantic information is considered, and in order to verify that an embedded model of the BERT sentence is superior to an average vector, the traditional Clustering algorithms DBSCAN, Spectral Clustering, Hierarchical Clustering, GMM, K-means and SOM are compared with the method disclosed by the invention, as shown in FIG. 5.
According to the graph 5, the software requirement data is suitable for K-means and SOM clustering algorithms, the improvement effect of other traditional clustering algorithms is not obvious, the clustering accuracy is improved by 7.79% compared with the average vector in the SOM algorithm, and the clustering accuracy is improved by 5.16% in the K-means clustering algorithm, so that the sentence vector model can be used for effectively improving the performance of the clustering algorithm compared with the average vector. Compared with the traditional clustering algorithm and the model in the chapter, the clustering accuracy of the DVEC model is the highest, the average vector is 60.11%, and the sentence vector is 62.92%, compared with the traditional K-means algorithm with the highest clustering accuracy, the clustering accuracy is improved by 5.72% on the average vector and 3.37% on the sentence vector, which shows that the accuracy of the clustering algorithm can be improved by the vector of the embedding space of the regular variation embedded clustering algorithm while the characteristic representation is learned, and the algorithm performance is greatly improved by the common optimization of the two algorithms. Meanwhile, the clustering algorithm model in this chapter is also more suitable for software requirement text data.
The invention also compares the accuracy with the more popular deep clustering algorithms AE + K-means, DEC and IDEC, and the comparison is shown in FIG. 6.
The comparison shows that the AE + K-means has the lowest accuracy rate, the reason is that the self-encoder only compresses data features, dimensionality is reduced, the compression features are completed by a K-means clustering algorithm, algorithm complexity is reduced, but improvement in other aspects is not performed, and compared with a model in the chapter, the average vector is 4.09% lower, and the sentence vector is 3% lower. The DEC model is lower than the model in the current chapter by 2.03 percent on the basis of average vectors and lower than 2.75 percent on the basis of sentence vectors, compared with the DEC model, the decoder structure is not removed, the local feature structure is maintained, and the influence of clustering loss on the feature space is avoided. Since the IDEC model only learns the error of the sample input and output, and does not consider the problems of vector noise and sample distribution, the DVEC model proposed herein removes noise at the input while using VEC to learn sample distribution and cluster partitioning. On Average Embedding, the clustering accuracy of the DVEC model is slightly lower than that of IDEC, because the weighted summation of the Average vectors cannot fully express semantics, and the accuracy is not increased or decreased. On the Sennce Embedding, the DVEC effect is the best, the accuracy is as high as 62.92%, the accuracy is improved by 2.81% compared with the Average Embedding, the accuracy is improved by 1.14% compared with the IDEC, the Sentence vector considers the position relation before and after words, but the noise influence is not considered, the DVEC input end is fused with Dropout regularization to solve the problem, meanwhile, the embedded space is used for reconstructing samples according to the positive distribution by using the heavy parameter skill, the method is suitable for the condition of less data quantity, the coding and decoding process is similar to the process of generating countermeasures, the process enables the characteristic vector extraction effect of the embedded space to be better, and the method is effective for improving the accuracy of the clustering algorithm.
The performance of the clustering results was evaluated on SC, CH, DBI as shown in table 2.
TABLE 2 comparison of various evaluation indices
Figure BDA0003040185090000141
As can be seen from Table 2, the Silhouette Coefficient of DVEC is only inferior to AE + K-means, and has higher values than those of DEC and IDEC, which indicates that the distance within the class of DVEC is smaller, and the distance between classes is larger, so that the classification can be better. On the Calinski Harabasz evaluation index, the value of the DVEC is the smallest in the comparison model, which shows that the covariance of the data in the class of the algorithm model in the chapter is the smallest, the covariance between the classes is the largest, and the clustering division effect is the best. On the evaluation index of Davies Bouldin, the value of the DVEC model is also the minimum, the smaller the value is, the lowest similarity between different classes is represented, and the clustering division is clear.
Fig. 7 is a schematic structural diagram of a software demand clustering system based on a canonical variate embedding type, as shown in fig. 7, the software demand clustering system based on the canonical variate embedding type provided by the present invention includes:
a software requirement data obtaining module 701, configured to obtain software requirement data of different types of software;
a text preprocessing module 702, configured to perform text preprocessing on the software requirement data to determine a software requirement text;
a sentence vector determining module 703, configured to map the software requirement file to a vector space by using a sentence vector model pre-trained by BERT, and determine a sentence vector;
a clustering result determining module 704, configured to cluster the sentence vectors by using a regular variational embedded clustering model, and determine a clustering result;
the step of clustering the sentence vectors by using the regular variation embedded clustering model comprises the following steps:
the regular variation embedded clustering model carries out Dropot regularization processing on the sentence vectors and determines regularized vectors;
performing feature compression on the regularized vector by using a full connection layer; determining embedding characteristics by adopting an encoder according to the compressed regularized vector;
decoding the embedded features by using a decoder to determine an original vector;
determining clustering division results by adopting a K-means algorithm according to the embedding characteristics;
and determining corresponding loss functions according to the embedded features, the original vectors and the clustering division results, and performing back propagation on the embedded features, the original vectors and the loss functions corresponding to the clustering division results to determine the clustering results.
The software requirement data obtaining module 701 specifically includes:
the software demand data acquisition unit is used for acquiring software demand data of different types of software of the Windows platform under the Softpedia website by utilizing a script technology;
and the software requirement data storage and labeling unit is used for independently storing each type of software requirement data in a csv format and labeling each type of software requirement data.
The text preprocessing module 702 specifically includes:
the text removing unit is used for removing the html tags in the software requirement data by using the regular expression;
the text correction unit is used for correcting the software requirement data from which the html tags are removed by the abbreviative words and the messy code words;
the text extraction unit is used for carrying out stem extraction and morphological restoration on the corrected software requirement data;
and the text storage unit is used for storing the processed data in the csv file.
The invention provides a regular variational embedding-based software demand clustering system, which further comprises:
and the clustering evaluation module is used for evaluating the clustering result by utilizing the clustering index.
In the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the description of the method part.
The principle and the embodiment of the present invention are explained by applying specific examples, and the above description of the embodiments is only used to help understanding the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the foregoing, the description is not to be taken in a limiting sense.

Claims (8)

1. A regular variational embedding-based software demand clustering method is characterized by comprising the following steps:
acquiring software requirement data of different types of software;
performing text preprocessing on the software requirement data to determine a software requirement text;
mapping the software requirement text to a vector space by using a sentence vector model pre-trained by BERT, and determining a sentence vector;
clustering the sentence vectors by using a regular variation embedded clustering model to determine a clustering result;
the step of clustering the sentence vectors by using the regular variational embedded clustering model comprises the following steps:
the regular variation embedded clustering model carries out Dropot regularization processing on the sentence vectors and determines regularized vectors;
performing feature compression on the regularization vectors by utilizing a full connection layer; determining embedding characteristics by adopting an encoder according to the compressed regularized vector;
decoding the embedded features by using a decoder to determine an original vector;
determining clustering division results by adopting a K-means algorithm according to the embedding characteristics;
determining a corresponding loss function according to the embedded feature, the original vector and the clustering division result, and performing back propagation on the embedded feature, the original vector and the loss function corresponding to the clustering division result to determine the clustering result;
performing feature compression on the regularized vector by using a full connection layer; and determining embedding characteristics by using an encoder according to the compressed regularization vector, specifically comprising:
determining the embedded features by using the formula z ═ u + exp (delta) × epsilon;
determining a Loss function by using a formula L which is KL Loss + ReconstructionLoss + alpha ClusterLoss;
u and delta are two parameters of a compressed regularization vector converted into a hidden space by an encoder, namely a mean value and a variance, epsilon is a tensor subject to normal distribution, epsilon-N (0, 1), L is a Loss function, and Reconstruction Loss is Loss generated by sample Reconstruction and is | | | x-x' | torm 2 X is the input vector set, x' is the original vector, Cluster Loss is the Loss caused by clustering,
Figure FDA0003749436260000021
p ij for the target distribution, q ij KLLoss is the Loss of new samples generated by the heavy parameters for the similarity of the embedded features and the cluster center,
Figure FDA0003749436260000022
2. the regular variational-embedding-based software demand clustering method according to claim 1, wherein the acquiring of software demand data of different types of software specifically comprises:
acquiring software requirement data of 11 types of software of a Windows platform under a Softpedia website by utilizing a Scapy technology;
and (3) storing each type of software requirement data in a csv format independently, and labeling each type of software requirement data simultaneously.
3. The regular-variation-embedding-based software demand clustering method according to claim 1, wherein the step of performing text preprocessing on the software demand data to determine a software demand text specifically comprises:
eliminating html tags in the software requirement data by using the regular expression;
correcting the abbreviation and the scrambled words of the software requirement data with the html tags removed;
carrying out stem extraction and morphological restoration on the corrected software requirement data;
and storing the processed data in the csv file.
4. The regular variational embedded-based software demand clustering method according to claim 1, wherein the regular variational embedded clustering model is used to cluster the sentence vectors to determine a clustering result, and then the method further comprises:
and evaluating the clustering result by using the clustering index.
5. A regular variational embedding-based software demand clustering system is characterized by comprising:
the software requirement data acquisition module is used for acquiring software requirement data of different types of software;
the text preprocessing module is used for performing text preprocessing on the software requirement data and determining a software requirement text;
the sentence vector determining module is used for mapping the software requirement text to a vector space by using a sentence vector model pre-trained by BERT to determine a sentence vector;
the clustering result determining module is used for clustering the sentence vectors by utilizing a regular variation embedded clustering model to determine a clustering result;
the step of clustering the sentence vectors by using the regular variational embedded clustering model comprises the following steps:
the regular variational embedded clustering model carries out Dropout regularization processing on the sentence vectors to determine regularized vectors;
performing feature compression on the regularized vector by using a full connection layer; determining embedding characteristics by adopting an encoder according to the compressed regularized vector;
decoding the embedded features by using a decoder to determine an original vector;
determining clustering division results by adopting a K-means algorithm according to the embedding characteristics;
determining a corresponding loss function according to the embedded feature, the original vector and the clustering division result, and performing back propagation on the embedded feature, the original vector and the loss function corresponding to the clustering division result to determine the clustering result;
determining the embedded features by using the formula z ═ u + exp (delta) × epsilon;
determining a Loss function by using a formula L which is KL Loss + Reconstruction Loss + alpha Cluster Loss;
u and delta are two parameters of a compressed regularization vector converted into a hidden space by an encoder, namely a mean value and a variance, epsilon is a tensor subject to normal distribution, epsilon-N (0, 1), L is a Loss function, and Reconstruction Loss is Loss generated by sample Reconstruction and is | | | x-x' | torm 2 X is the input vector set, x' is the original vector, Cluster Loss is the Loss caused by clustering,
Figure FDA0003749436260000031
p ij for the target distribution, q ij KLLoss is the Loss of new samples generated by the heavy parameters for the similarity of the embedded features and the cluster center,
Figure FDA0003749436260000041
6. the regular variational embedding-based software demand clustering system according to claim 5, wherein the software demand data acquisition module specifically comprises:
the software demand data acquisition unit is used for acquiring software demand data of different types of software of the Windows platform under the Softpedia website by utilizing a script technology;
and the software requirement data storage and labeling unit is used for separately storing each type of software requirement data in a csv format and labeling each type of software requirement data.
7. The regular-variation-embedding-based software demand clustering system of claim 5, wherein the text preprocessing module specifically comprises:
the text rejection unit is used for rejecting html tags in the software requirement data by using the regular expression;
the text correction unit is used for correcting the software requirement data from which the html tags are removed by the abbreviative words and the messy code words;
the text extraction unit is used for carrying out stem extraction and morphological restoration on the corrected software requirement data;
and the text storage unit is used for storing the processed data in the csv file.
8. The regular variational embedding-based software demand clustering system according to claim 5, further comprising:
and the clustering evaluation module is used for evaluating the clustering result by utilizing the clustering index.
CN202110455004.9A 2021-04-26 2021-04-26 Software demand clustering method and system based on regular variation embedding Active CN113159196B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110455004.9A CN113159196B (en) 2021-04-26 2021-04-26 Software demand clustering method and system based on regular variation embedding

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110455004.9A CN113159196B (en) 2021-04-26 2021-04-26 Software demand clustering method and system based on regular variation embedding

Publications (2)

Publication Number Publication Date
CN113159196A CN113159196A (en) 2021-07-23
CN113159196B true CN113159196B (en) 2022-09-09

Family

ID=76870951

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110455004.9A Active CN113159196B (en) 2021-04-26 2021-04-26 Software demand clustering method and system based on regular variation embedding

Country Status (1)

Country Link
CN (1) CN113159196B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109165309A (en) * 2018-08-06 2019-01-08 北京邮电大学 Negative training sample acquisition method, device and model training method, device
CN111581385A (en) * 2020-05-06 2020-08-25 西安交通大学 Chinese text type identification system and method for unbalanced data sampling

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107516110B (en) * 2017-08-22 2020-02-18 华南理工大学 Medical question-answer semantic clustering method based on integrated convolutional coding
CN109062763B (en) * 2018-07-31 2022-03-04 云南大学 Method for dynamically mining software process activities in real time from SVN log event stream
US20200250304A1 (en) * 2019-02-01 2020-08-06 Nec Laboratories America, Inc. Detecting adversarial examples
WO2020190772A1 (en) * 2019-03-15 2020-09-24 Futurewei Technologies, Inc. Neural network model compression and optimization
GB2599831A (en) * 2019-06-14 2022-04-13 Quantum Interface Llc Predictive virtual training systems, apparatuses, interfaces, and methods for implementing same
CN110347835B (en) * 2019-07-11 2021-08-24 招商局金融科技有限公司 Text clustering method, electronic device and storage medium
CN112417289B (en) * 2020-11-29 2023-04-07 中国科学院电子学研究所苏州研究院 Information intelligent recommendation method based on deep clustering
CN112417893A (en) * 2020-12-16 2021-02-26 江苏徐工工程机械研究院有限公司 Software function demand classification method and system based on semantic hierarchical clustering

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109165309A (en) * 2018-08-06 2019-01-08 北京邮电大学 Negative training sample acquisition method, device and model training method, device
CN111581385A (en) * 2020-05-06 2020-08-25 西安交通大学 Chinese text type identification system and method for unbalanced data sampling

Also Published As

Publication number Publication date
CN113159196A (en) 2021-07-23

Similar Documents

Publication Publication Date Title
CN112966074B (en) Emotion analysis method and device, electronic equipment and storage medium
CN111459491B (en) Code recommendation method based on tree neural network
CN111709518A (en) Method for enhancing network representation learning based on community perception and relationship attention
CN112215013B (en) Clone code semantic detection method based on deep learning
CN112905795A (en) Text intention classification method, device and readable medium
CN113127339B (en) Method for acquiring Github open source platform data and source code defect repair system
CN115081437B (en) Machine-generated text detection method and system based on linguistic feature contrast learning
CN112200664A (en) Repayment prediction method based on ERNIE model and DCNN model
CN116661805B (en) Code representation generation method and device, storage medium and electronic equipment
CN115017299A (en) Unsupervised social media summarization method based on de-noised image self-encoder
CN115392252A (en) Entity identification method integrating self-attention and hierarchical residual error memory network
CN114490954B (en) Document level generation type event extraction method based on task adjustment
CN113254581A (en) Financial text formula extraction method and device based on neural semantic analysis
CN115658846A (en) Intelligent search method and device suitable for open-source software supply chain
CN115757695A (en) Log language model training method and system
CN114676346A (en) News event processing method and device, computer equipment and storage medium
CN116522912B (en) Training method, device, medium and equipment for package design language model
CN113159196B (en) Software demand clustering method and system based on regular variation embedding
CN116975634A (en) Micro-service extraction method based on program static attribute and graph neural network
CN114492458A (en) Multi-head attention and word co-occurrence based aspect-level emotion analysis method
CN114239555A (en) Training method of keyword extraction model and related device
CN112632229A (en) Text clustering method and device
CN113988083A (en) Factual information coding and evaluating method for shipping news abstract generation
CN113392929A (en) Biological sequence feature extraction method based on word embedding and self-encoder fusion
CN112329933A (en) Data processing method, device, server and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant