CN113159196B

CN113159196B - Software demand clustering method and system based on regular variation embedding

Info

Publication number: CN113159196B
Application number: CN202110455004.9A
Authority: CN
Inventors: 崔国荣; 康雁; 李媛; 张晓颖; 李晋源; 贾雪彬
Original assignee: Yunnan University YNU
Current assignee: Yunnan University YNU
Priority date: 2021-04-26
Filing date: 2021-04-26
Publication date: 2022-09-09
Anticipated expiration: 2041-04-26
Also published as: CN113159196A

Abstract

The invention relates to a regular variational embedding-based software demand clustering method and a regular variational embedding-based software demand clustering system. The method comprises the following steps: acquiring software requirement data of different types of software; performing text preprocessing on software requirement data; mapping the software requirement file to a vector space by using a BERT pre-trained sentence vector model; clustering the sentence vectors by using a regular variational embedded clustering model; the clustering step is as follows: dropot regularization processing is carried out on the sentence vectors, and regularization vectors are determined; performing feature compression on the normalized vector by using a full connection layer; determining embedding characteristics by adopting an encoder according to the compressed regularized vector; decoding the embedded features with a decoder; determining a clustering division result by adopting a K-means algorithm according to the embedding characteristics; and determining a corresponding loss function according to the embedded features, the original vector and the clustering division result, and performing back propagation on the corresponding loss function to determine a clustering result. The method and the device improve the accuracy of software demand text clustering.

Description

Regular variational embedding-based software demand clustering method and system

Technical Field

The invention relates to the field of data mining, in particular to a regular variational embedding-based software demand clustering method and system.

Background

The software development process comprises requirement analysis, system design, detailed design, test and evaluation, and the first work is the requirement analysis when a good software is designed, but in the actual development process, people neglect the importance of the requirement analysis and pay attention to the design stage. Errors caused by the requirement analysis are not visible in the software development process, and can be found only in the testing stage, but the modification errors pay double cost. The software requirement description is not only a bridge for communication between users and developers, but also a basis for functional design and performance indexes, and the requirement description runs through the whole software development process, and the requirement can be changed along with the time, so that huge risks can be brought to later development.

The demand analysis stage often has the following problems: (1) the difference between the research fields of the user and the developer, which are familiar with the own field but are unfamiliar with the field of the opposite party, causes the communication between the user and the developer to be obstructed; (2) potential requirements in the software requirement description cannot be mined, and due to the fact that professional knowledge in the computer field understood by a user is limited, required functions and performance design expressions are incomplete, and therefore the requirements can be omitted during requirement analysis; (3) the software requirement text description is fuzzy, the problems of sparsity, ambiguity, non-verifiability and the like exist, the comprehension of the text expressions with the same meaning in different fields is also different, the short defect of the text description is embodied in the aspects of redundant function description, complex development process, high module coupling and the like, so that the user experience and operability are poor, and the software cost and the development efficiency cannot be guaranteed, thereby causing the failure of the project.

The problem in the stage of demand analysis causes inconvenience to software development, and if similarity descriptions are aggregated by a clustering method, only the descriptions in the same class need to be concerned, so that the class can be guided to represent the functions and application fields of the software in any aspect.

The traditional text clustering method firstly carries out text preprocessing, then carries out feature extraction on a word sequence after preprocessing, then is mapped to a vector space by a word vector tool, and completes sample division based on a measurement function. The traditional feature extraction technology is based on statistical analysis, weight distribution is carried out on the existing feature parameters by utilizing an evaluation function, the feature weight represents the importance degree of words in the whole sentence, but the method usually ignores the position relation and semantic information of sequences, the words in the text are taken as isolated individuals, the semantics cannot be accurately expressed only by depending on the frequency, and the context relation is ignored.

The Clustering methods commonly used are mostly based on the partition method, density method and hierarchical method, such as K-means, DBSCAN, Agglomerative Clustering, etc. The known clustering method achieves good effects in the text field and the image field, but the clustering method is not necessarily suitable for all field knowledge division, and the traditional clustering method is also extremely dependent on the quality of extracted features and training of a word vector model. The introduction of deep learning focuses on learning representation, the feature expression of a text is learned by a deep model, compared with the traditional feature extraction, the nonlinearity and the fitting capability are stronger, the extracted features are greatly improved on the traditional clustering, and the higher clustering accuracy can be obtained by adjusting network parameters. However, most deep clustering is based on two-segment clustering, where feature extraction is performed on the text first, and then the compressed features are partitioned by conventional clustering. The two-segment clustering structure is clear, the feature extractor compresses high-dimensional features into low-dimensional features, the data sparsity is reduced, more text semantics are included, and the clustering has better continuity and interpretability, but the two-segment clustering has obvious defects that the representation learning and the clustering process are carried out in two steps, the clustering center and the extracted features cannot be improved according to the clustering result, and the defects of easy falling into local optimum exist, the forward clustering process is one-off, and the clustering output can be improved only in a small range.

The traditional text clustering method is easily influenced by an initial clustering center, most of the traditional text clustering methods are based on distance measurement, the method still has the defect of being easily trapped in local optimization, deep clustering combines a deep learning technology and a traditional clustering method, the model structure is easy to understand, the clustering result is slightly improved compared with the traditional clustering method, but most of the deep clustering methods are based on two-segment clustering and cannot reversely propagate and optimize the clustering center and the sample distribution.

At present, clustering methods related to software requirement texts are rare, most of the clustering methods focus on the improvement of the methods, optimization of a feature extraction mode is omitted, and in practical situations, not all the clustering methods are universal, a certain feature extraction mode and a certain clustering method are adopted according to data distribution, a feature extractor is difficult to decide, and sample division is determined indirectly to a certain extent. The training of the word vector model is also important because of unsupervised training, but a large-scale software requirement corpus is few, no good word vector exists, and the clustering method becomes meaningless.

The software requires that the text is a disordered unstructured data, and has a lot of redundant information, so the text cannot be input into a machine for learning, the word abbreviation change, spelling correction, word stem extraction and word shape reduction are required, the traditional feature extractor is linear mapping, the result lacks reasonable explanation and description, and the text is mapped into a vector form which can be recognized by a computer, so that clustering division can be carried out. After software demand data are analyzed, which clustering method and feature extractor are adopted and how a back-propagation neural network clustering model is designed are judged according to the sample distribution of the software demand data, and meanwhile, the characteristic vector is guaranteed not to be damaged by an embedding space caused by clustering loss, and the local structure of a sample is reserved.

Therefore, a software demand clustering method is urgently needed, and aiming at the characteristics of high dispersion, high noise, sparse data and the like of a software demand text, the accuracy of software demand text clustering is improved.

Disclosure of Invention

The invention aims to provide a regular variational embedding-based software demand clustering method and a regular variational embedding-based software demand clustering system, which are used for improving the accuracy of software demand text clustering.

In order to achieve the purpose, the invention provides the following scheme:

a regular variational embedding-based software demand clustering method comprises the following steps:

acquiring software requirement data of different types of software;

performing text preprocessing on the software requirement data to determine a software requirement text;

mapping the software requirement file to a vector space by using a sentence vector model pre-trained by BERT, and determining a sentence vector;

clustering the sentence vectors by using a regular variation embedded clustering model to determine a clustering result;

the step of clustering the sentence vectors by using the regular variation embedded clustering model comprises the following steps:

the regular variational embedded clustering model carries out Dropout regularization processing on the sentence vectors to determine regularized vectors;

performing feature compression on the regularization vectors by utilizing a full connection layer; determining embedding characteristics by adopting an encoder according to the compressed regularized vector;

decoding the embedded features by using a decoder to determine an original vector;

determining clustering division results by adopting a K-means algorithm according to the embedding characteristics;

and determining a corresponding loss function according to the embedded feature, the original vector and the clustering result, and performing back propagation on the embedded feature, the original vector and the loss function corresponding to the clustering result to determine the clustering result.

Optionally, the acquiring software requirement data of different types of software specifically includes:

acquiring software requirement data of 11 types of software of a Windows platform under a Softpedia website by utilizing a Scapy technology;

and (3) storing each type of software requirement data in a csv format independently, and labeling each type of software requirement data simultaneously.

Optionally, the text preprocessing is performed on the software requirement data, and the determining of the software requirement text specifically includes:

eliminating html tags in the software requirement data by using the regular expression;

correcting the abbreviation and the scrambled words of the software requirement data with the html tags removed;

carrying out stem extraction and morphological restoration on the corrected software requirement data;

and storing the processed data in the csv file.

Optionally, the regularization vectors are feature compressed by using a full connection layer; and determining embedding characteristics by using an encoder according to the compressed regularization vector, specifically comprising:

determining the embedded features by using the formula z ═ u + exp (delta) × epsilon;

wherein, u and δ are two parameters of the encoder converting the compressed regularization vector into a hidden space, which are respectively a mean value and a variance, epsilon is a tensor subject to normal distribution, epsilon-N (1, 1).

Optionally, the clustering the sentence vectors by using a regular variational embedded clustering model to determine a clustering result, and then further comprising:

and evaluating the clustering result by using the clustering index.

A regular variational embedding-based software demand clustering system comprises:

the software requirement data acquisition module is used for acquiring software requirement data of different types of software;

the text preprocessing module is used for performing text preprocessing on the software requirement data and determining a software requirement text;

the sentence vector determining module is used for mapping the software demand file to a vector space by using a BERT pre-trained sentence vector model and determining a sentence vector;

the clustering result determining module is used for clustering the sentence vectors by utilizing a regular variation embedded clustering model to determine a clustering result;

the step of clustering the sentence vectors by using the regular variational embedded clustering model comprises the following steps:

Optionally, the software requirement data acquiring module specifically includes:

the software demand data acquisition unit is used for acquiring software demand data of different types of software of the Windows platform under the Softpedia website by utilizing a script technology;

and the software requirement data storage and labeling unit is used for separately storing each type of software requirement data in a csv format and labeling each type of software requirement data.

Optionally, the text preprocessing module specifically includes:

the text removing unit is used for removing the html tags in the software requirement data by using the regular expression;

the text correction unit is used for correcting the software requirement data from which the html tags are removed by the abbreviative words and the messy code words;

the text extraction unit is used for carrying out stem extraction and morphological restoration on the corrected software requirement data;

and the text storage unit is used for storing the processed data in the csv file.

Optionally, the method further comprises:

and the clustering evaluation module is used for evaluating the clustering result by utilizing the clustering index.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

the invention provides a software demand clustering method and system based on regular variational embedding, which utilize a full connection layer to carry out feature compression on the regular vector; determining embedding characteristics by adopting an encoder according to the compressed regularized vector; the scheme solves the problem that the two-end type clustering can not reversely propagate and optimize the clustering center and the sample distribution, the local structure of the data is reserved by learning the internal hidden distribution of the software demand data, the original data is simulated by the skills of the heavy parameters, the essential characteristics of the data are reflected, the characteristic quality and the clustering accuracy are improved, and the optimal effect is achieved. In text representation, a BERT sentence vector model is used for mapping an original text to a vector space, as the original text is not subjected to noise processing, Dropout regularization is fused at the input end of the model to randomly inhibit the work of part of neurons, noise interference is reduced, overfitting of the model is prevented, robustness of the model is enhanced, the Dropout regularized vector is input into a variational embedded clustering method, the embedded space can learn original data distribution through an encoder, a recomparametric skill ensures that a hidden layer can better abstract input data characteristics, then the embedded space vector is used for decoding to learn characteristics on one hand, and on the other hand, a K-means clustering method is used for sample division to achieve iteration stop condition output clustering results through a back propagation optimization loss function.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

FIG. 1 is a schematic flow chart of a regular variational embedding-based software demand clustering method provided by the invention;

FIG. 2 is a flowchart of an algorithm for regular variational-based embedded software demand clustering provided by the present invention;

FIG. 3 is a network architecture diagram of a regular variational-embedding-based software demand clustering provided by the present invention;

FIG. 4 is a schematic diagram of Dropout regularization in an embodiment;

FIG. 5 is a diagram illustrating a conventional clustering comparison;

FIG. 6 is a schematic diagram of depth clustering comparison;

fig. 7 is a schematic structural diagram of a regular variational-embedding-based software demand clustering system provided by the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

Fig. 1 is a schematic flow chart of a regular-variation-embedding-based software demand clustering method, as shown in fig. 1, the regular-variation-embedding-based software demand clustering method provided by the present invention includes:

s101, acquiring software requirement data of different types of software.

S101 specifically comprises the following steps:

and acquiring software requirement data of 11 types of software of the Windows platform under the Softpedia website by using a Scapy technology.

And (4) independently storing each type of software requirement data in a csv format, and labeling each type of software requirement data. And outputting and storing the software requirement description with the same function in a csv format, and labeling the software requirement description information with the same function.

Because open source data described about software requirements on the internet is few, 11 types of data sets under a Windows platform are obtained under a Softpedia website by adopting a script crawler technology, wherein the 11 types of data are respectively anti, authoring-Tools, CD-DVD-Blu-ray-Tools, Compression-Tools, Desktop-Enhancements, File-managers, Gaming-Related, iPod-Tools, Maps & GPS, Mobile-Phone-Tools and Network-Tools, because an anti-crawling program sets an IP proxy pool and time delay, each type of data is separately stored in a csv format, samples in the 11 types of data are labeled at the same time, category names are ordered according to Ascii codes, then the samples of the same type are labeled in a number 0-10 form, and the good 11 types of data are synthesized into a File most red.

And S102, performing text preprocessing on the software requirement data, and determining a software requirement text.

S102 specifically includes:

and eliminating the html tags in the software requirement data by using the regular expression.

And (5) correcting the abbreviation and the scrambled word of the software requirement data with the html label removed.

And carrying out stem extraction and morphological restoration on the corrected software requirement data.

And storing the processed data in the csv file.

Because the software requirement data is directly obtained from the website, the html tag content cannot be avoided, and the tag formats include the following types: < div class ═ test "> < div >, < img/>, custom Tag < My-Tag > </My-Tag >, html Tag in english text is removed using regular expression re. Represents the case of the tag of </div >, the second V? This indicates < img/> case; sub (r 'can \ t', 'can not', text), using pyenchant base spell check to find out errors and then correcting; using SnowballStemmer to complete stem extraction, for example, protect after protecting extraction, WordNemtermemter completes morphological reduction, and finally, placing all classes of data in a csv file

S103, mapping the software requirement file to a vector space by using a sentence vector model pre-trained by BERT, and determining a sentence vector; the sentence vector represents the text information of the whole sentence, and the front-back position relation and the internal relation of the words are considered. The sentence vector model pre-trained by the BERT outputs sentence vector expressions with the same dimensionality, and the original software requirement description records the vector space of the dimensionality number through the training output software requirement description.

S104, clustering the sentence vectors by using a regular variational embedded clustering model to determine a clustering result; inputting the processed sentence vectors into a regular variational embedded clustering model, compressing high-dimensional sparse features into low-dimensional dense vectors through coding, generating samples similar to original data in a probability form to reflect the essential features of the data, and then simultaneously performing feature learning and clustering division on the newly generated vectors.

As shown in fig. 2 and fig. 3, the step of clustering the sentence vectors by using the regular variational embedded clustering model is as follows:

s401, the regular variational embedded clustering model carries out Dropout regularization processing on the sentence vectors, and regularization vectors are determined.

Dropout regularization is fused at the input end, partial neurons of the neural network stop working according to a certain probability, greedy learning of unimportant features in data by the model is prevented, excessive noise is fused in the compression process of the features, loss is caused to text clustering, random inhibition of the neurons is equivalent to noise reduction, the model has robustness during model learning, and the generalization capability of the model is enhanced. The neural network training process cannot escape two problems: (1) time consuming; (2) it is easy to overfit. Dropout is fused to solve the two problems, some neurons are stopped randomly and do not participate in training, and other neurons work together, so that a network model is simplified, time complexity is reduced, and joint adaptability among neuron nodes is eliminated and weakened. Assume that the input vector set is x ═ x ₁ ，x ₂ ，x ₃ ，…，x _m ]Wherein x is _i Representing each sample, the hidden layer vector is z ═ z ₁ ，z ₂ ，z ₃ ，…，z _k ]The reconstructed vector is

The activation function of the fully connected layer of the network is a relu function. Regularization is merged at the input and randomly mapped to regularization by DropoutQuantized vector

S402, performing feature compression on the regularized vector by using a full connection layer; and determining embedding characteristics by adopting an encoder according to the compressed regularization vector.

S402 specifically includes:

the embedded features are determined using the formula z u + exp (δ) × epsilon.

Wherein u and delta are two parameters of the encoder converting the compressed regularization vector into a hidden space, which are respectively a mean value and a variance, epsilon is a tensor subject to normal distribution, epsilon-N (1, 1).

S103, feature compression is carried out on the high-dimensional sparse vector by adopting a full connection layer, a constraint term is applied to the encoder network, the network generates a sample set which is subjected to Gaussian distribution, any relevant data can be obtained by utilizing a repetition trigk according to the mean value mu and the variance delta rule of Gaussian distribution, the embedded space learns the original data distribution, the delta dynamically adjusts the noise intensity, and then the layer is decoded to reconstruct the sample, so that data similar to the original sample is generated:

epsilon～N(0，1)；

z＝u+exp(δ)*epsilon。

s403, decoding the embedded features by using a decoder to determine an original vector; and decoding the embedded characteristic z formed by the heavy parameters, wherein the decoding also adopts a full-connection layer, and the decoding process is opposite to the encoding process and is restored into an original vector x' through decoding.

x′＝f(W _x′ z+b _x′ )。

S404, determining clustering division results by adopting a K-means algorithm according to the embedding characteristics(ii) a Embedding the characteristic z as the input of a K-means algorithm, and dividing the samples into the nearest classes by the K-means according to a neighbor principle, namely, the distance (x) is satisfied _i ，c _j )＝min(distance(x _i ，c _j ) J ═ 1, 2, 3,. times, k }, then x _i ∈y _i . Recalculating clustering center c for each round of training _j ：

Characterizing an embedded feature z by a t-distribution _i And cluster center c _j Similarity of (2):

using this similarity, the target distribution is defined as:

s405, determining a corresponding loss function according to the embedded feature, the original vector and the clustering division result, and performing back propagation on the embedded feature, the original vector and the loss function corresponding to the clustering division result to determine the clustering result.

S105 is an optimized loss function, after decoding and clustering division, the loss function is optimized in a back propagation mode, the loss functions are related to three, loss generated by sample reconstruction, loss generated by clustering and loss generated by a new sample generated by a heavy parameter:

Reconstruction Loss＝||x-x′|| ² ；

the loss function is defined as: l ═ KL Loss + reconfiguration Loss + α · Cluster Loss.

After S104, further comprising:

and evaluating the clustering result by using the clustering index.

Calculating and counting samples of each class according to the principle of 'minority obeying majority', and simultaneously measuring the method quality by using evaluation indexes such as Silhouuette Coefficient (SC), Calinski Harabasz (CH) and Davies-Bouldin index (DBI), wherein the specific formula is as follows:

as a specific embodiment, the invention adopts the script crawler technology to acquire software requirements on a Softpedia platform, and the requirements are objective description of software functions. The Softpedia platform provides users with various tools under various system platforms, including Windows, Linux, Android, IOS and the like. There are dozens of software categories such as antivirus software, compression software, file management, game software and map positioning software under each platform, and the tool usage descriptions under different platforms are different. By 2019, 5 and 27 in Beijing, 113016 total applications are recorded in the website, and the total number of times of downloading the applications is 3320687342.

The software requirement data is software function description under a Windows platform, the platform has different types of software, each software type has a large number of Apps, the function information and the action of each App have corresponding description information, 11 types of software data are crawled altogether, 15598 pieces of software data are crawled altogether, the data lengths are different, and the types and the number of the crawled software are shown in table 1.

TABLE 1 software requirements data sheet

The invention mainly compares the clustering accuracy, and adopts the purity as the clustering accuracy, and the formula is as follows:

in the invention, after texts with different lengths are trained by BERT, the characteristic dimension of each sample is 768 dimensions, the problem of vector irregularity caused by different text lengths is solved, context semantic information is considered, and in order to verify that an embedded model of the BERT sentence is superior to an average vector, the traditional Clustering algorithms DBSCAN, Spectral Clustering, Hierarchical Clustering, GMM, K-means and SOM are compared with the method disclosed by the invention, as shown in FIG. 5.

According to the graph 5, the software requirement data is suitable for K-means and SOM clustering algorithms, the improvement effect of other traditional clustering algorithms is not obvious, the clustering accuracy is improved by 7.79% compared with the average vector in the SOM algorithm, and the clustering accuracy is improved by 5.16% in the K-means clustering algorithm, so that the sentence vector model can be used for effectively improving the performance of the clustering algorithm compared with the average vector. Compared with the traditional clustering algorithm and the model in the chapter, the clustering accuracy of the DVEC model is the highest, the average vector is 60.11%, and the sentence vector is 62.92%, compared with the traditional K-means algorithm with the highest clustering accuracy, the clustering accuracy is improved by 5.72% on the average vector and 3.37% on the sentence vector, which shows that the accuracy of the clustering algorithm can be improved by the vector of the embedding space of the regular variation embedded clustering algorithm while the characteristic representation is learned, and the algorithm performance is greatly improved by the common optimization of the two algorithms. Meanwhile, the clustering algorithm model in this chapter is also more suitable for software requirement text data.

The invention also compares the accuracy with the more popular deep clustering algorithms AE + K-means, DEC and IDEC, and the comparison is shown in FIG. 6.

The comparison shows that the AE + K-means has the lowest accuracy rate, the reason is that the self-encoder only compresses data features, dimensionality is reduced, the compression features are completed by a K-means clustering algorithm, algorithm complexity is reduced, but improvement in other aspects is not performed, and compared with a model in the chapter, the average vector is 4.09% lower, and the sentence vector is 3% lower. The DEC model is lower than the model in the current chapter by 2.03 percent on the basis of average vectors and lower than 2.75 percent on the basis of sentence vectors, compared with the DEC model, the decoder structure is not removed, the local feature structure is maintained, and the influence of clustering loss on the feature space is avoided. Since the IDEC model only learns the error of the sample input and output, and does not consider the problems of vector noise and sample distribution, the DVEC model proposed herein removes noise at the input while using VEC to learn sample distribution and cluster partitioning. On Average Embedding, the clustering accuracy of the DVEC model is slightly lower than that of IDEC, because the weighted summation of the Average vectors cannot fully express semantics, and the accuracy is not increased or decreased. On the Sennce Embedding, the DVEC effect is the best, the accuracy is as high as 62.92%, the accuracy is improved by 2.81% compared with the Average Embedding, the accuracy is improved by 1.14% compared with the IDEC, the Sentence vector considers the position relation before and after words, but the noise influence is not considered, the DVEC input end is fused with Dropout regularization to solve the problem, meanwhile, the embedded space is used for reconstructing samples according to the positive distribution by using the heavy parameter skill, the method is suitable for the condition of less data quantity, the coding and decoding process is similar to the process of generating countermeasures, the process enables the characteristic vector extraction effect of the embedded space to be better, and the method is effective for improving the accuracy of the clustering algorithm.

The performance of the clustering results was evaluated on SC, CH, DBI as shown in table 2.

TABLE 2 comparison of various evaluation indices

As can be seen from Table 2, the Silhouette Coefficient of DVEC is only inferior to AE + K-means, and has higher values than those of DEC and IDEC, which indicates that the distance within the class of DVEC is smaller, and the distance between classes is larger, so that the classification can be better. On the Calinski Harabasz evaluation index, the value of the DVEC is the smallest in the comparison model, which shows that the covariance of the data in the class of the algorithm model in the chapter is the smallest, the covariance between the classes is the largest, and the clustering division effect is the best. On the evaluation index of Davies Bouldin, the value of the DVEC model is also the minimum, the smaller the value is, the lowest similarity between different classes is represented, and the clustering division is clear.

Fig. 7 is a schematic structural diagram of a software demand clustering system based on a canonical variate embedding type, as shown in fig. 7, the software demand clustering system based on the canonical variate embedding type provided by the present invention includes:

a software requirement data obtaining module 701, configured to obtain software requirement data of different types of software;

a text preprocessing module 702, configured to perform text preprocessing on the software requirement data to determine a software requirement text;

a sentence vector determining module 703, configured to map the software requirement file to a vector space by using a sentence vector model pre-trained by BERT, and determine a sentence vector;

a clustering result determining module 704, configured to cluster the sentence vectors by using a regular variational embedded clustering model, and determine a clustering result;

the regular variation embedded clustering model carries out Dropot regularization processing on the sentence vectors and determines regularized vectors;

performing feature compression on the regularized vector by using a full connection layer; determining embedding characteristics by adopting an encoder according to the compressed regularized vector;

and determining corresponding loss functions according to the embedded features, the original vectors and the clustering division results, and performing back propagation on the embedded features, the original vectors and the loss functions corresponding to the clustering division results to determine the clustering results.

The software requirement data obtaining module 701 specifically includes:

and the software requirement data storage and labeling unit is used for independently storing each type of software requirement data in a csv format and labeling each type of software requirement data.

The text preprocessing module 702 specifically includes:

The invention provides a regular variational embedding-based software demand clustering system, which further comprises:

In the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the description of the method part.

The principle and the embodiment of the present invention are explained by applying specific examples, and the above description of the embodiments is only used to help understanding the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the foregoing, the description is not to be taken in a limiting sense.

Claims

1. A regular variational embedding-based software demand clustering method is characterized by comprising the following steps:

acquiring software requirement data of different types of software;

mapping the software requirement text to a vector space by using a sentence vector model pre-trained by BERT, and determining a sentence vector;

determining a corresponding loss function according to the embedded feature, the original vector and the clustering division result, and performing back propagation on the embedded feature, the original vector and the loss function corresponding to the clustering division result to determine the clustering result;

performing feature compression on the regularized vector by using a full connection layer; and determining embedding characteristics by using an encoder according to the compressed regularization vector, specifically comprising:

determining a Loss function by using a formula L which is KL Loss + ReconstructionLoss + alpha ClusterLoss;

u and delta are two parameters of a compressed regularization vector converted into a hidden space by an encoder, namely a mean value and a variance, epsilon is a tensor subject to normal distribution, epsilon-N (0, 1), L is a Loss function, and Reconstruction Loss is Loss generated by sample Reconstruction and is | | | x-x' | torm ² X is the input vector set, x' is the original vector, Cluster Loss is the Loss caused by clustering,

p _ij for the target distribution, q _ij KLLoss is the Loss of new samples generated by the heavy parameters for the similarity of the embedded features and the cluster center,

2. the regular variational-embedding-based software demand clustering method according to claim 1, wherein the acquiring of software demand data of different types of software specifically comprises:

3. The regular-variation-embedding-based software demand clustering method according to claim 1, wherein the step of performing text preprocessing on the software demand data to determine a software demand text specifically comprises:

and storing the processed data in the csv file.

4. The regular variational embedded-based software demand clustering method according to claim 1, wherein the regular variational embedded clustering model is used to cluster the sentence vectors to determine a clustering result, and then the method further comprises:

and evaluating the clustering result by using the clustering index.

5. A regular variational embedding-based software demand clustering system is characterized by comprising:

the sentence vector determining module is used for mapping the software requirement text to a vector space by using a sentence vector model pre-trained by BERT to determine a sentence vector;

determining a Loss function by using a formula L which is KL Loss + Reconstruction Loss + alpha Cluster Loss;

6. the regular variational embedding-based software demand clustering system according to claim 5, wherein the software demand data acquisition module specifically comprises:

7. The regular-variation-embedding-based software demand clustering system of claim 5, wherein the text preprocessing module specifically comprises:

the text rejection unit is used for rejecting html tags in the software requirement data by using the regular expression;

8. The regular variational embedding-based software demand clustering system according to claim 5, further comprising: