CN114706559A

CN114706559A - Software scale measurement method based on demand identification

Info

Publication number: CN114706559A
Application number: CN202210319424.9A
Authority: CN
Inventors: 李刚; 郑成鹏; 李敏; 周鸣乐; 韩德隆
Original assignee: Qilu University of Technology; Shandong Computer Science Center National Super Computing Center in Jinan
Current assignee: Qilu University of Technology; Shandong Computer Science Center National Super Computing Center in Jinan
Priority date: 2022-03-29
Filing date: 2022-03-29
Publication date: 2022-07-05

Abstract

The invention discloses a software scale measuring method based on demand identification. The method is based on a software requirement specification document, obtains software requirements and carries out requirement classification, and obtains the software scale through identifying and counting function points in function requirements. Acquiring a requirement specification document of target software; carrying out data preprocessing on a target software requirement document to obtain requirement statement data; an automatic demand classification model is constructed by using a graph attention network and BERT to carry out demand identification classification; performing function point regular calculation on the function requirements, and performing attribute embedding and global statistics on the non-function requirements; and finally, performing software scale measurement by taking function point scale estimation as a main part and taking global non-function demand classification statistics and system characteristics as adjustment coefficients.

Description

Software scale measurement method based on demand identification

Technical Field

The invention relates to the technical field of software scale measurement, and particularly provides a software scale measurement method based on demand identification.

Background

The software development project scale measurement is the basis for estimating the workload of a software project, budgeting the cost and planning the reasonable project progress. The size metric is one of the important reasons for the failure of a software project. With the development of computer technology and software engineering, more and more software projects emerge. Successful software system development means that software systems that meet the needs of users are delivered on time, on a budget. From the development experience of software projects at home and abroad, various factors influencing the success or failure of system development exist, and software scale estimation and management control are one of the key factors for success. If the scale estimation is too large, the cost is too high, and huge waste of resources is caused; if the estimated size is too small, resulting in too low a cost, the entire project may be out of control, far exceeding the project budget and delivery date. Therefore, the method has great significance in accurately measuring the software scale.

Although the prior researchers have searched for the software scale measurement method, some problems still exist in the field of software scale measurement. Because a large amount of text data needs to be processed by the software scale measuring method, the efficiency is low in the existing software scale estimation mainly by manually identifying and counting the functional points from complicated project materials, and the accuracy of the scale estimation is difficult to ensure due to the uneven levels of identification personnel. Therefore, it is time-consuming and challenging to accurately identify the user requirement and the requirement corresponding function point from the requirement specification document. With the rise of machine learning and natural language processing technologies, more and more researchers try to process a demand document by adopting methods such as machine learning and the like instead of manually acquiring user demands and recognition statistics of function points from complicated demand documents. The traditional demand classification method based on machine learning extracts feature information through manual preprocessing and realizes demand classification by utilizing a shallow classifier, and some researchers obtain higher classification precision by adopting methods such as semantic similarity, self-defined dictionaries, manual preprocessing and the like. However, as the demand classification effect is more and more demanding, the conventional demand classification method is time-consuming and has low accuracy.

The current demand classification research still has certain limitations. First, existing demand classification techniques ignore structural features and syntax information. The traditional requirement classification method depends on feature engineering to a great extent, a model regards a text as a set of a plurality of words, each word is independently present in the set and has no relationship with other words, and the traditional feature extraction technology only acquires the features of the words and the shallow information of a requirement sentence and is difficult to acquire grammatical and syntactic information. Secondly, most of the demand classification models have poor generalization capability, and particularly when the models subjected to manual preprocessing are applied to unknown software projects, the performance of the models is sharply reduced, so that the demand classification models are difficult to apply to actual projects.

Compared with the traditional network, the graph attention network can better capture the characteristic information of the demand statement through the data of the graph structure, and is widely applied to tasks such as recommendation systems, image processing, knowledge reasoning, knowledge map construction and the like. The advanced natural language processing technology is applied with demand identification, so that the accuracy and the efficiency of the software scale measurement task can be better improved. Therefore, the method provides ideas and technologies for automatic software scale measurement, and the method has certain feasibility.

Disclosure of Invention

The purpose of the invention is as follows: in order to solve the defects of the prior art, the invention provides a software scale measuring method based on demand identification.

The technical scheme is as follows: the invention provides a software scale measuring method for demand identification, which comprises the following steps:

step 1: and acquiring a requirement specification document of the target software.

Step 2: and carrying out data preprocessing on the target software requirement document to obtain requirement statement data.

And step 3: and inputting the preprocessed requirement statement data into an automatic requirement classification model constructed by using a graph attention network and BERT, and outputting the requirement statement category.

And 4, step 4: and according to the requirement identification result, if the requirement statement contains the functional requirement, acquiring the functional point, if the non-functional requirement exists, embedding the non-functional requirement attribute into the functional point corresponding to the statement to calculate the scale of the functional point, and counting the non-functional attribute to the global non-functional attribute.

And 5: and finally, calculating the software scale measurement according to the obtained function point scale, the global non-functional requirement classification statistics and the expert judgment opinions.

Further, in the step 3, the automatic demand classification model based on the graph attention network and BERT, which is constructed, includes the following steps:

and step 3-1) a data preprocessing module. The acquired requirement text is subjected to data cleaning, non-requirement sentences are removed, and the requirement sentences are subjected to duplication removal, messy code removal, space removal and sentence word segmentation.

And 3-2) constructing a module for the demand graph. Constructing a corresponding syntax parsing tree for each requirement statement, wherein the syntax parsing tree is used for revealing the dependency relationship between abundant syntax information and vocabularies in the requirement statements; and constructing a dependency graph based on the syntactic parse tree.

Step 3-3) embedding the graph into a module. Firstly, carrying out node initialization embedding on a constructed dependency analysis graph by using BERT to obtain vectorized node representation; and then, learning the characteristic information of the demand sentence by using the graph attention network, aggregating the information of the forward neighborhood and the backward neighborhood of the nodes in the graph to obtain bidirectional node embedding, constructing graph embedding on the basis of learning node embedding, and acquiring the characteristic information representation of the whole graph.

And 3-4) a graph classification module. And inputting the output graph embedding information into a multilayer perceptron (MLP) for classification operation, and outputting the probability distribution of the demand candidate subclasses. And then performing normalization processing by using a Softmax function to output a demand classification result.

The specific construction process of the data preprocessing module in the step 3-1) is as follows:

unifying sentences in the requirement documentThe theory language library S ═ S₁，s₂，...，s_nAnd carrying out duplication removal operation on the data, deleting spaces before and after the sentences, removing non-required sentences in the corpus S, and carrying out messy code removal, space removal, sentence duplication removal and sentence word segmentation on the required sentences.

The process of constructing the module of the demand graph in the step 3-2) is as follows:

step 3-2-1): and constructing a syntactic parse graph. We construct a syntactic parse graph using a linear time scanning method with a dependent parser. And sequentially converting the required sentences until all the sentences construct corresponding syntax analysis diagrams.

Step 3-2-2): and constructing a dependency analysis graph. Setting each word in a syntactic parse graph as a node of a dependent parse graph to obtain a node set, wherein each node information is a dictionary type and comprises a node value, a node position ID (the position of the word in an original text sequence) and a node ID; setting syntactic relation connection between words as an edge of a dependency parse graph, representing the dependency relation between two nodes by using a new node, regarding each edge as a graph node and constructing a bipartite graph.

The specific construction process of the graph embedding module in the step 3-3) is as follows:

step 3-3-1: node initialization embedding is performed using BERT. We reconstruct the requirement sentence into a form suitable for the BERT input, adding "[ CLS ]" and "[ CLS ]" at the beginning and end of the requirement sentence, respectively. The input sequence is adjusted to the appropriate size.

Step 3-3-2: bipartite graph attention network feature learning

(1) According to the edge direction between nodes, the neighbor node pointing to w is called a forward neighbor and is marked as a forward neighbor

The neighbor nodes that flow out of w are called backward neighbors. Represent

(2) We connect each front by a fully connected neural networkVectors to neighbors, and then perform max pooling operations using the aggregator. Node w_iForward representation of k forward neighbors

Aggregate into a single vector

K ∈ { 1., K } is the iteration index. The operating formula for the forward polymerization is:

where max is the max operator of the element, σ is an activation function, W_poolIs a pool parameter matrix.

(3) We use the newly generated neighborhood vector

Forward representation of connecting node w

And inputting the connection vector into the full connection layer, and updating the forward representation of the node for the next iteration.

(4) We use a similar procedure to the forward operation in steps (2) and (3) to update the backward operation. The formula of the aggregation function used for the backward aggregation is:

(5) and (5) repeating the steps (2) to (4) K times, integrating the final forward representation and the final reverse representation into a final bidirectional representation of V, and finishing the node embedding operation.

(6) And finally, embedding the graph by adopting a node-based method. We add a new core node w to the input graph_sAnd add all other nodes directly to w_sGenerating graph embedding.

The specific construction process of the graph classification module in the step 3-4) is as follows:

step 3-4-1: the classification is performed using a multi-tier perceptron. We use the MLP model to calculate the probabilities. r denotes a fixed distribution, W₁∈R^H×MFor fully connected weight matrix, H is the number of hidden layers, b₁Is a bias term, O ∈ R^H×1For full connection layer output, the specific calculation formula is as follows:

O＝f(W₁r+b₁)

step 3-4-2: the classification results were normalized using Softmax. We transform the raw feature space into a confidence space and then apply the Softmax layer for classification, the Softmax layer I belongs to R^C×1Is represented as follows:

I＝W₂O

wherein W₂∈R^C×HFor the transformation matrix, C is the class number. If four subclasses of demand are classified, then C is 4. The confidence of each class sample is I. We determine the final classification result from the normalized confidence value output by the Softmax layer.

In step 4, aiming at the requirement classification result, function points are obtained for the function requirement statements, wherein the function points are classified into data functions and transaction functions. The data function refers to a function provided to a user to satisfy internal or external data requirements, and the transaction function refers to a function provided to a user to process data. And acquiring a non-functional requirement attribute for the non-functional requirement statement, and dividing the non-functional requirement statement into five types of non-functional requirement types, namely performance, reliability, availability, safety and maintainability. And embedding the non-functional demand attribute into the functional point corresponding to the statement to participate in the functional point standard calculation, and finally counting the non-functional demand attribute into the global non-functional attribute to be used as one of the adjustment factors of the software workload scale.

And 5, calculating the software scale measurement according to the scale of the functional points, the global non-functional requirement classification statistics and the expert judgment opinions. Wherein the functional point size and global non-functional demand classification statistics are obtained according to claim 7. And judging the type and the property of the software according to the project experience by an expert to evaluate and score.

The method has the advantages that a requirement text preprocessing method is designed aiming at some characteristics of a requirement text, a dependency analysis graph of a requirement sentence is constructed on the basis to express required sentence structure and syntactic characteristics, a BERT pre-training model is used for carrying out initialization embedding on nodes in the dependency analysis graph to generate dynamic word vectors of context semantic information, then the implicit structure characteristics and the syntactic characteristics of the requirement are mined by utilizing a graph attention network, the BERT and the graph attention network are complemented by a characteristic fusion method, more requirement information is captured, the purpose of improving automatic requirement classification accuracy is achieved, and the important role is played in actual requirement analysis.

Drawings

FIG. 1 is a flow chart of software dimensioning based on demand identification according to the present invention.

FIG. 2 is a flow chart of demand identification provided by the present invention.

FIG. 3 is a framework diagram of an automatic demand classification model based on BERT and a graph attention network according to the present invention.

Detailed description of the invention

The invention is further described below with reference to the accompanying drawings and examples, it being understood that the examples described below are intended to facilitate the understanding of the invention, and are not intended to limit it in any way.

Under the premise of fully understanding the actual software scale measurement process and deeply learning methods such as measuring the software scale of the functional points, the intelligent functional point analysis is realized by learning the knowledge of the existing functional point analysis process, the demand identification is realized by adopting a natural language processing model, and finally, the functional point measurement method and the demand identification are combined to realize the efficient, accurate and reliable software scale measurement process, as shown in figure 1.

Firstly, the invention constructs basic data for the existing public data source in the software scale measurement field, and simultaneously carries out data cleaning, sentence duplicate removal, messy code removal, space removal and sentence word segmentation processing on terms and functional point words in a specific field (such as the fields of e-government affairs, e-commerce, traffic and the like). Then, aiming at different application types and application targets, analysis and data support are provided for tasks such as requirement identification and function point extraction by adopting a technical idea of combining BERT and a graph neural network technology around technical means such as requirement identification, requirement classification, named entity identification and function point extraction.

Then, the invention realizes intelligent function point identification by combining with the function point algorithm and identifying the requirement, and realizes the software scale measurement method which is free of manpower, high in efficiency and high in accuracy by combining with the natural language processing technology and the deep learning technology according to the application of the function point algorithm in the software scale measurement. The present invention is further illustrated by the following specific embodiments.

As shown in fig. 1, the whole process of the software scale measurement process based on demand identification of the present invention mainly includes the following steps:

Step 2: carrying out data preprocessing on a target software requirement document; according to the software requirement document of a project, firstly, the document is subjected to normalization processing, meaningless characters or redundant characters are removed, capital and small conversion and complex and simple conversion are carried out for data cleaning, and then data missing completion, data noise filtration, data format consistency and other operations are carried out.

And step 3: inputting the preprocessed requirement statement data into an automatic requirement classification model constructed by using a graph attention network and BERT, and outputting the requirement statement category, wherein the structure diagram of the model is shown in FIG. 3.

And step 3-1) a data preprocessing module. Processing the acquired demand text, removing non-demand sentences, removing duplication, messy codes, spaces and sentence word segmentation of the demand sentences; the sentences in the requirement document are arranged into a material library S ═ S₁，s₂，…，s_nAnd carrying out duplication removal operation on the data, deleting spaces before and after the sentences, removing non-required sentences in the corpus S, and carrying out messy code removal, space removal, sentence duplication removal and sentence word segmentation on the required sentences.

Word segmentation is the basis of demand statement processing, and the existing word segmentation methods can be divided into three categories: a word segmentation method based on character string matching, a word segmentation method based on understanding and a word segmentation method based on statistics. The method comprises the steps of using a mainstream word segmentation word stock to perform word segmentation, and finding that the existing word segmentation word stock has a poor classifying effect on professional vocabularies in a certain field through experiments, so that the professional vocabularies are expanded on the basis of the existing word stock, a stop word list is introduced, and the word segmentation accuracy of required sentences is improved.

For the data of model training, the content of the counting item is taken as the starting point, because the content of the counting item is extracted from the functional point description text, the content of the counting item can be said to be a basis for judging the type of the functional point. After the word segmentation is carried out on the text, whether a single word after the word segmentation of the text contains information of the content of the counting item or not is judged, and then the word segmentation text is matched with the content of the counting item in an item-entering way. Each word has a label corresponding to it.

And 3-2) constructing a module for the demand graph. Constructing a corresponding syntax parsing tree for each requirement statement, wherein the syntax parsing tree is used for revealing the dependency relationship between rich syntactic information and vocabularies in the requirement statements; and constructing a dependency graph based on the syntactic parse tree.

Step 3-3-2: and (3) attention network feature learning of the bipartite graph:

The neighbor nodes that flow out of w are called backward neighbors. To represent

(2) We connect the vectors of each forward neighbor through a fully connected neural network and then perform a max pooling operation using an aggregator. Node w_iForward representation of k forward neighbors

Aggregate into a single vector

K ∈ { 1., K } is the iteration index.

The operating formula for the forward polymerization is:

(3) We use the newly generated neighborhood vector

Forward representation of connecting node w

(4) We use a similar procedure to the forward operation in steps (2) and (3) to update the backward operation.

The formula of the aggregation function used for the backward aggregation is:

Step 3-4-1: the classification is performed using a multi-tier perceptron. We use the MLP model to calculate the probabilities. r denotes a fixed distribution, W₁∈R^H×MFor a fully connected weight matrix, H is the number of hidden layers, b₁Is a bias term, O ∈ R^H×1For full connection layer output, the specific calculation formula is as follows: o ═ f (W)₁r+b₁)

I＝W₂O

And 4, step 4: according to the requirement identification result, if the requirement statement contains a functional requirement, the statement is subjected to functional point acquisition, if the requirement statement contains a non-functional requirement, the non-functional requirement attribute is embedded into the functional point corresponding to the statement to calculate the scale of the functional point, and the non-functional attribute is counted to the global non-functional attribute, wherein the specific operation steps are as follows:

according to the requirement identification result, the requirement statement containing the function points can be determined, and the function points contained in the function requirement statement are extracted on the basis. The number of unadjusted function points reflects the number of functions provided by the application to the user, and the number of unadjusted function points mainly includes two major categories in the counting process: a data function and a transaction function. The data function refers to a function provided to a user to satisfy internal or external data requirements, and the transaction function refers to a function provided to a user to process data.

If the requirement statement is a non-functional requirement, the statement of the non-functional requirement is divided into five types of non-functional requirement types, namely performance, reliability, availability, safety and maintainability. And performing classification statistics to serve as an adjusting factor of subsequent software scale measurement.

If the requirement statement contains both functional requirements and non-functional requirements, for example, the statement that the user needs to verify the identity and authority of the user in the process of logging in the system includes not only the data function of logging in by the user, but also the non-functional requirement of security is implied. And embedding the non-functional requirement attribute into a functional point corresponding to the requirement statement, participating in the functional point regulation, and counting the non-functional requirement attribute into the global non-functional attribute to be used as a regulation factor of the subsequent software scale measurement.

And 5: and finally, calculating the software scale measurement according to the obtained scale of the functional points, the global non-functional demand classification statistics and the expert opinion.

Step 5-1: and (4) calculating the scale of the functional points. And calculating the scale of the function points, namely counting the function points in the function requirements on the basis of requirement identification, if the requirement statement contains non-function requirement attributes, embedding according to non-function attribute factors, adjusting the size of the function points, and then counting all the adjusted function points to obtain the scale of the function points of the software.

Function point scale ∑ [ function point (1+ non-function demand attribute) ]

Step 5-2: an adjustment factor is determined. The adjustment coefficient is mainly used for evaluating the global non-functional demand statistics and the general system characteristics and the influence degree thereof in the functional point algorithm.

The characteristics and the influence degree of the general system are judged and scored by an experienced software expert, and the specific scoring form is as follows:

general system characteristics	Level of influence	Remarks for note
			1. System security
2. Distributed data processing
			3. Performance of
4. Using a high-intensity arrangement
			5. Speed of response
6. Online data entry
			7. Efficiency of end user
8. Online update
			9. Complexity of software
10. Can be used forReusability
			11. Easy installation
12. Easy operability
			13. Whether or not to be arranged in multiple fields
14. Change request

The degree of influence of each system characteristic is classified into 6 levels, 0 does not exist or has no influence; 1 occasional impact; 2 minor effects; 3, the influence of medium; 4 significant effect; 5 strong influence.

The tuning coefficients are based on system non-functional requirements statistics and 14 general system characteristics used to evaluate the functionality of the application being analyzed. Each feature has rules to score. These 14 general system characteristics are summarized and the final adjustment factor is then calculated that will scale the function point to plus or minus 40% of the amplitude.

Adjusting coefficient is non-functional demand influence factor + system characteristic influence factor

Step 5-3: and (4) calculating a software scale metric. After the software function point number is determined and the adjusting coefficient is determined, the final software scale is determined through calculation. Calculating the formula: and adjusting the coefficient according to the software scale, namely the scale of the functional point.

Claims

1. The software scale measuring method based on the demand identification is characterized by comprising the following steps:

step 1: acquiring a requirement specification document of target software;

step 2: carrying out data preprocessing on a target software requirement document to obtain requirement statement data;

and step 3: inputting the preprocessed requirement statement data into an automatic requirement classification model constructed by using a graph attention network and BERT, and outputting the category of the requirement statement;

and 4, step 4: according to the requirement identification result, if the requirement statement contains the functional requirement, acquiring the functional point, if the non-functional requirement exists, embedding the non-functional requirement attribute into the functional point corresponding to the statement to calculate the scale of the functional point, and counting the non-functional attribute to the global non-functional attribute;

and 5: and finally, calculating the software scale measurement according to the acquired function point scale, the global non-function requirement classification statistics and the system characteristics.

2. The method according to claim 1, wherein the step 3 of constructing the automatic demand classification model based on the graph attention network and the BERT comprises the following steps:

1) a data preprocessing module: the method comprises the following steps of cleaning data of an acquired demand text, removing non-demand sentences, removing duplication, messy codes and spaces of the demand sentences and performing sentence segmentation processing;

2) the requirement graph building module: constructing a corresponding syntax parse tree for each requirement statement, revealing the dependency relationship between rich syntax information and vocabularies in the requirement statements, and constructing a dependency relationship graph based on the syntax parse tree;

3) a graph embedding module: firstly, carrying out node initialization embedding on a constructed dependency analysis graph by using BERT to obtain vectorized node representation; secondly, feature information of a graph attention network learning demand statement is used, bidirectional node embedding is obtained by aggregating information of a forward neighborhood and a backward neighborhood of nodes in the graph, graph embedding is constructed on the basis of learning node embedding, and feature information representation of the whole graph is obtained;

4) a graph classification module: inputting the output graph embedding information into a multilayer perceptron (MLP) for classification operation, and outputting probability distribution of the demand candidate subclasses;

and then performing normalization processing by using a Softmax function to output a demand classification result.

3. The method according to claim 2, wherein the requirement graph building module in step 2) comprises the following steps:

the method comprises the steps that a dependency parser is used for executing a linear time scanning method to construct a syntax parse graph, then, each word in the syntax parse graph is set as a node of the dependency parse graph, and a node set is obtained; syntactic relationship connections between words are set as edges of the dependency parse graph.

4. The method according to claim 2, wherein the graph embedding module in the step 3) is specifically constructed as follows:

performing node initialization embedding by using BERT; reconstructing a requirement sentence into a form suitable for BERT input, adjusting an input sequence to be a proper size, then using a bidirectional graph attention network feature learning, adding front and rear neighbor nodes of a node according to the direction of edges between the nodes, searching a vector of each forward (backward) neighbor connected by a fully-connected neural network, then using an aggregator to execute maximum pooling operation to aggregate neighbor information of the node into a neighborhood vector of the node, and finally, adopting a node-based method to perform graph embedding operation; we add a new core node in the input graph and add all other nodes directly to the new core node to generate graph embedding.

5. The method according to claim 2, wherein the graph classification module in step 4) is specifically constructed as follows:

first, a probability is calculated using an MLP model; then, carrying out classification result standardization through Softmax; we determine the final classification result from the normalized confidence value output by the Softmax layer.

6. The method according to claim 1, wherein in step 4, for the requirement classification result, function points are obtained for the function requirement statements, wherein the function points are classified into a data function and a transaction function, the data function is a function provided for a user to meet internal or external data requirements, the transaction function is a function provided for the user to process data, a non-function requirement attribute is obtained for the non-function requirement statements, the non-function requirement statements are divided into five types of non-function requirement types, namely, performance, reliability, availability, safety and maintainability, the non-function requirement attribute is embedded into the function points corresponding to the statements to participate in the function point specification calculation, and finally the non-function requirement attribute is counted into the global non-function attribute as one of the adjustment factors of the software workload scale.

7. The method according to claim 1, wherein in step 5, a software scale metric is calculated based on the scale of the functional points, the global non-functional demand classification statistics, and the system characteristics; wherein the functional point size and global non-functional demand classification statistics are obtained according to claim 6; and judging the type and the property of the software according to the project experience by an expert to evaluate and score.