CN114780411A - Software configuration item preselection method oriented to performance tuning - Google Patents

Software configuration item preselection method oriented to performance tuning Download PDF

Info

Publication number
CN114780411A
CN114780411A CN202210450353.6A CN202210450353A CN114780411A CN 114780411 A CN114780411 A CN 114780411A CN 202210450353 A CN202210450353 A CN 202210450353A CN 114780411 A CN114780411 A CN 114780411A
Authority
CN
China
Prior art keywords
configuration item
label
configuration
intention
software
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210450353.6A
Other languages
Chinese (zh)
Other versions
CN114780411B (en
Inventor
李姗姗
贾周阳
马俊
李小玲
何浩辰
董威
陈立前
陈振邦
周成龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN202210450353.6A priority Critical patent/CN114780411B/en
Publication of CN114780411A publication Critical patent/CN114780411A/en
Application granted granted Critical
Publication of CN114780411B publication Critical patent/CN114780411B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/362Software debugging
    • G06F11/3628Software debugging of optimised code
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/44Encoding
    • G06F8/443Optimisation
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • Computer Hardware Design (AREA)
  • Quality & Reliability (AREA)
  • Machine Translation (AREA)
  • Stored Programmes (AREA)

Abstract

The invention discloses a software configuration item preselection method oriented to performance tuning, and aims to solve the problems that the existing performance tuning method for configuration items consumes long time and only considers a single intention. The technical scheme is as follows: constructing a software configuration item preselection system which is formed by a configuration item intention data automatic amplification module and a configuration item preselection module and is oriented to performance tuning; selecting part of configuration items from data set source software for labeling to obtain a labeled configuration item set; the automatic configuration item intention data amplification module iteratively amplifies the labeling configuration item set to obtain an amplified labeling configuration item set D; training a configuration item preselection module by using the D; and the trained configuration item pre-selection module classifies the configuration items of the target software according to the configuration documents of the target software. The invention can optimize the configuration item sets of corresponding categories according to different intents, greatly reduces the expenditure, and simultaneously recommends the configuration items with related performance more comprehensively, thereby improving the efficiency and the accuracy.

Description

Software configuration item preselection method oriented to performance tuning
Technical Field
The invention relates to the field of performance tuning of large-scale software, in particular to a software configuration item preselecting method.
Background
In order to adapt software to different application scenarios and production environments without modifying the software source code, developers typically set configuration items for the software to provide a user with an interface to adjust software behavior. However, as application scenarios and user requirements become more diverse, the size and complexity of modern software increases, and the number of software configuration items also increases. For example, there are more than 900 configuration items in MySQL, and the number of configuration items in GCC is more than 1000. The great number of configuration items brings great difficulty to a user for configuring the software, and the use threshold of the software is improved. It is difficult for a user to satisfy his or her intention by adjusting software configuration items.
Users usually have various intentions in using software, such as improving software performance (e.g., throughput rate, execution time, read/write speed, etc.) and reliability, preventing information leakage, etc. This is one of the most common and most interesting intentions for users to improve software performance. Since the performance of software is easier to measure quantitatively relative to other intentions, how to adjust the configuration items of software to achieve the best performance of software, i.e. to perform performance tuning by adjusting the configuration of software, is a hot issue of current research.
In the current configuration tuning work, all configuration items are generally used as input, and under a specific load, a large number of performance tests are performed by changing the values of the configuration items to obtain the corresponding relation between the values of the configuration items and the software performance. The current performance tuning work has the problem of large configuration search space, which results in long time consumption for performance tuning and needs a large amount of time for obtaining the configuration corresponding to the optimal performance.
For the problem of too large configuration search space, the prior art mainly reduces the configuration search space by pre-screening configuration items having important influence on performance, and there are two methods. The first is a Carver, published by Zhen Cao et al in FAST' 2020, which selects key configuration items for Storage System performance Tuning, a Carver method (background one for short) samples a configuration space in a Latin Hypercube Sampling (LHS) manner, adopts a variance-based performance matrix to evaluate the importance degree of different configuration items on performance after performance testing, finally selects N (N is specified by a user) configuration items with the largest influence on performance by using a greedy algorithm, and preselects the configuration items to the user as the input of an automatic Tuning tool. The research proves that the influence of different configuration items on the performance is different in importance degree, a small number of configuration items are particularly important for improving the software performance, and the importance of software configuration item preselection oriented to performance tuning is determined. The second is the Too Man Knobs to Tune? According to the method, firstly, a latin hypercube sampling method is used for sampling a configuration space, then the corresponding relation between the performance of two Database system software, namely Cassandra and PostgreSQL under different loads and software configuration is tested, the importance degree ranking of the influence of different configuration items on the software performance is analyzed, and the first 15 configuration items with the largest influence on the performance under different loads are compared to indicate that several configuration items with the largest influence on the performance in the software are usually fixed. They have experimentally demonstrated that: under the condition that only the first 5 configuration items with the largest performance influence in the Cassandra are subjected to performance optimization, the Throughput rate (Throughput), the Read latency (Read latency) and the write latency (write latency) can reach the similar level of the performance optimization of 30 configuration items, and even better performance can be realized in the Read latency and the write latency. Both methods achieve the preselection of software configuration items, but when obtaining data and further obtaining the importance of the configuration items on the software performance, a large number of performance tests are still needed, the preselected configuration items under different workloads have certain difference, and the preselection result strongly depends on the workload selected during the performance tests. In addition, the methods only pay attention to how to improve the software performance, do not consider whether hidden dangers are brought to the reliability and the safety of the software, and lack comprehensive consideration of user intentions.
In summary, how to construct a multi-purpose sensitive and load-independent lightweight configuration item preselection method to assist the existing performance tuning work and warn the user of possible side effects caused by tuning is a problem to be solved urgently by researchers in the field.
Disclosure of Invention
The technical problem to be solved by the invention is to provide a software configuration item preselection method oriented to performance tuning, aiming at the problem that the existing performance tuning method for configuration items has a single intention of consuming long time and only considering performance (namely, when a user has an intention other than performance, the existing technology cannot work), the method has the advantages of short running time and comprehensive consideration of various intentions of the user, and assists the user in configuring and tuning the intention other than the performance.
In order to solve the technical problems, the technical scheme of the invention is as follows: firstly, constructing a software configuration item pre-selection system which is formed by an automatic configuration item intention data amplification module and a configuration item pre-selection module and is oriented to performance tuning; then randomly selecting part of configuration items from data set source software, and manually marking intentions to obtain a marking configuration item set; the automatic configuration item intention data amplification module iteratively amplifies the labeling configuration item set to obtain an amplified labeling configuration item set; then training a configuration item preselection module by using the amplified labeling configuration item set; and finally, classifying the configuration items of the target software by the trained configuration item preselection module according to the configuration documents of the target software, and selecting configuration item sets corresponding to intentions of different categories. The configuration item sets corresponding to the intentions of different categories reflect several main factors considered when the user performs performance tuning, and the user can perform performance tuning by adopting the configuration item sets corresponding to the intentions of the corresponding categories according to the requirement on software tuning so as to achieve the purpose of improving the software performance.
The invention comprises the following steps:
the first step is to construct a software configuration item preselection system oriented to performance tuning, wherein the software configuration item preselection system oriented to performance tuning is composed of a configuration item intention data automatic amplification module and a configuration item preselection module.
The configuration item intention data automatic amplification module is connected with the configuration item preselection module and is also connected with data set source software. The data set source software comprises two parts: a set of annotated configuration items and a set of unlabeled configuration items. The annotation configuration item set refers to a data set constructed by performing intention type annotation on configuration items according to each configuration item document in a manual annotation mode. The configuration item intention data automatic amplification module preprocesses a labeling configuration item set, labels the unlabeled configuration items in the unlabeled configuration item set, adds newly labeled data into the labeling configuration item set from the unlabeled configuration item set until the number of configuration items in the labeling configuration item set is not changed any more, obtains an amplified labeling configuration item set, and sends the amplified labeling configuration item set to the configuration item preselection module.
The configuration item pre-selection module is connected with the configuration item intention data automatic amplification module and receives the amplified labeling configuration item set from the configuration item intention data automatic amplification module. The configuration item preselection module comprises a TF-IDF encoder and a configuration item preselection model RF. The encoder encodes the sentences in the configuration item document to obtain vectors corresponding to the sentences; and the RF is a random forest model with a two-layer structure, and the model is trained by using the amplified label configuration item set to obtain parameters of the random forest model. The configuration item pre-selection module classifies the configuration items of the target software according to the configuration data (generally by manual extraction) of the target software, and pre-selects the configuration items corresponding to different intention categories to obtain a pre-selected configuration item set.
Second, a set of configuration items D of the software is sourced from the data set0Randomly selecting part of configuration items to label intentions to obtain a labeled configuration item set D1
2.1 data set source software including 13 types of software, MySQL, Cassandra, MariaDB, Apache-Httpd, Nginx, Hadoop-Common, MapReduce, Apache-Flink, HDFS, Keystone, Nova, GCC, and Clang. Selecting part of configuration items from the data set source software according to the following conditions: 1) belonging to server-side software. The software generally has higher requirements on the performance, reliability, safety and the like of the software, and is favorable for researching the influence of configuration items on the software; 2) there are over 2,000 stars (a star reflects the user's attention to the software, a larger number of stars indicates more users use and pay attention to this software) software on the largest code hosting platform GitHub around the world with a large number of users. The software generally has a large number of users, and the configuration items of the software can be marked to have greater influence; 3) software with more than 100 configuration items. The number of software configuration items is large, and performance tuning is more needed. A configuration item set D consisting of more than 7 thousand configuration items of software that satisfy the above 3 conditions simultaneously0The configuration items with the proportion of s (wherein s is more than or equal to 0.2) are randomly selected by people. The total number of the configuration items is recorded as S, the number of the randomly selected configuration items is N, and N is S multiplied by S, and the configuration items are rounded to obtain an integer.
2.2 according to the official document description of the selected configuration items, carrying out intention labeling on the N configuration items to obtain a labeled configuration item set D1The method comprises the following steps: according to the document description of the configuration item, if the adjustment of the configuration item can bring about the improvement of the software performance, but the improvement of the performance can simultaneously lead to the reduction of the software reliability, the intention Label of the configuration item is Label1(ii) a If the adjustment of the configuration item can bring about the performance improvement of the software, but the performance improvement can simultaneously cause the safety reduction of the software, the intention Label of the configuration item is Label2(ii) a If the configuration item is adjusted canBringing about the software performance improvement, but the performance improvement can cause the software functionality degradation at the same time, and the intention Label of the configuration item is Label3(ii) a If the adjustment of the configuration item can bring about the performance improvement of the software, but the performance improvement can simultaneously cause the use cost of the software to increase, the intention Label of the configuration item is Label4(ii) a If the software performance can be improved by adjusting the configuration item, but the performance is improved and the performance is reduced when other users use the software, the intention Label of the configuration item is Label5(ii) a If the software performance can be improved by adjusting the configuration item, and the performance is improved without causing the first five side effects, the intention Label of the configuration item is Label6(ii) a If adjusting the configuration item does not affect the software performance, the intent tag of the configuration item is Label7
2.3 set of annotation configuration items D1={<(cn,dn),labeln>|1≤n≤N,labelnIs epsilon of Labels }, wherein c isnIs D1Name of the nth configuration item, dnIs a configuration item cnDocument of dnCan be expressed as
Figure BDA0003617009860000041
Figure BDA0003617009860000042
Wherein WnIs dnThe total number of the Chinese words; labelnIs a configuration item cnThe intention category of (1), Label ═ LabeliI1 ≦ i ≦ 7 is a set of intent tag categories.
Note that the set of S-N configuration items that is not selected in step 2.1 is referred to as an unmarked configuration item set D2,D2={<(cct,ddt)>L 1 is less than or equal to T is less than or equal to T, wherein cctIs D2The tth configuration item name, ddtIs a configuration item cctThe document of (1). The ddtCan be expressed as
Figure BDA0003617009860000043
Wherein U istIs ddtTotal number of words in.
Thirdly, preprocessing and labeling a configuration item set D of the configuration item intention data automatic amplification module1Iterative pair of unlabeled configuration item sets D2Labeling unmarked configuration items in the step (A), and amplifying by adopting newly labeled configuration items1And obtaining an amplified labeling configuration item set D by the method comprising the following steps:
3.1 configuration item intention data automatic augmentation Module pretreatment D1The method comprises the following steps:
3.1.1 defining dictionary types (dictionary type definition see Link)https://docs.python.org/3/c- api/dict.htmlOne dictionary type variable dit is composed of several key value pairs (key)1,value1),…,(keyk,valuek),…,(keyK,valueK) Form, satisfies dit [ keyk]=valuekWherein K is the number of key value pairs in the fact, K is more than or equal to 0, and when K is 0, the dictionary fact is a null dictionary, key1,…,keyk,…keyKA set of keys that are different and constitute a dit) variable flabelFor encoding an intention tag class, satisfying flabel[Label1]=1,…,flabel[Labeli]=i,…,flabel[Label7]=7(1≤i≤7);
3.1.2 initializing word mapping maximum index ═ 8;
3.1.3 defining dictionary type variables ftokenFor encoding words, initializing ftokenIs an empty dictionary, i.e. ftokenIs an empty set, and will be added to the key set step by step in the subsequent steps<Part of speech, root word>The formed binary group encodes the words according to the parts of speech and the roots of the words;
3.1.4 encoding words and building f step by steptokenThe method comprises the following steps:
3.1.4.1 initializing variable n ═ 1;
3.1.4.2 pairs dnW innCoding the words to obtain dnCoded d'nThe method comprises the following steps:
3.1.4.2.1 initialize the word index variable wn=1;
3.1.4.2.2 will
Figure BDA0003617009860000051
Conversion into binary
Figure BDA0003617009860000052
Wherein
Figure BDA0003617009860000053
Is composed of
Figure BDA0003617009860000054
Parts of speech (such as nouns, verbs, adjectives, adverbs, etc.),
Figure BDA0003617009860000055
is composed of
Figure BDA0003617009860000056
E.g., both writes and writes have roots.
3.1.4.2.3 judging
Figure BDA0003617009860000057
Whether or not at ftokenIf not, then
Figure BDA0003617009860000058
Encode into index while key-value pairs
Figure BDA0003617009860000059
Adding ftokenIn and for
Figure BDA00036170098600000510
Figure BDA00036170098600000511
Turning to 3.1.3.2.4; if so, the method will be used
Figure BDA00036170098600000512
Coded as a key
Figure BDA00036170098600000513
Corresponding value, i.e. to
Figure BDA00036170098600000514
Is coded into
Figure BDA00036170098600000515
Is a natural number (value range 1 to 7), and is converted to 3.1.4.2.5;
3.1.4.2.4 let index be index + 1;
3.1.4.2.5 if wn=WnThen pair d is completednCoding of each word in the sequence, and obtaining dnCoded d'n
Figure BDA00036170098600000516
Figure BDA00036170098600000517
Rotating for 3.1.4.3; if w<WnTurning to 3.1.4.2.6;
3.1.4.2.6 order wn=wn+1, go 3.1.4.2.2;
3.1.4.3 if N is N, if so, then D1D innIs replaced with its coded d'nObtaining a preprocessed labeled configuration item set D'1,D′1={<(cn,d′n),labeln>|1≤n≤N,labelnBelongs to Labels, and changes to 3.2; if n is<N, converting to 3.1.4.4;
3.1.4.4 changing n to n +1, 3.1.4.2;
3.2 automatic amplification Module of Profile intention data from D1The method for mining the sequence pattern to obtain the sequence pattern set SP comprises the following steps:
3.2.1 use of D'1Constructing a sequence set SeqDB ═ { seq ═ seq1,…,seqn,…,seqNIn which seqnIs composed of configuration items cnDocument d ofnCoded d'nAnd cnIntention label ofnCorresponding code flabel(labeln) Sequences formed by splicing, i.e.
Figure BDA0003617009860000061
Figure BDA0003617009860000062
3.2.2 sequence Pattern mining on the SeqDB to get the sequence set P, { P ═ using the FEAT algorithm in the Efficient mining of frequency sequence generators published by Chuancong Gao et al in WWW20081,…,pm,…,pMWhere M is the total number of sequence patterns, pmIs a frequently occurring sequence in the sequence set SeqDB, pm=(pp1,…,ppx,…ppX) X is p corresponding to commonly used expressions in the configuration document, such as frequently occurring words and phrasesmIs calculated by the FEAT algorithm, ppxIs pmThe x-th item of (1) is a code corresponding to a word or an intention label, and satisfies that pp is more than or equal to 1x<index, specifically 1. ltoreq. ppxLabel representing intention Label at the time of less than or equal to 7ppxAt flabelMapping of (3), 8 ≦ ppx<index represents a certain
Figure BDA0003617009860000063
The form transformed by step 3.1.4.2.2
Figure BDA0003617009860000064
At ftokenOf (2), i.e.
Figure BDA0003617009860000065
Figure BDA0003617009860000066
3.2.3 processing the P, reserving sequences related to the intention category in the P, and calculating the corresponding support degree and confidence degree of each sequence to obtain a sequence mode set SP, wherein the method comprises the following steps of:
3.2.3.1 initializing sequence pattern set SP as an empty set;
3.2.3.2 initialization sequence traversal variable m ═ 1;
3.2.3.3 initializing sequence pattern count variable m' ═ 0;
3.2.3.4 determination of pmLast pp inXWhether or not it satisfies pp of 1. ltoreq. ppXLess than or equal to 7, if yes, ppXFor coding of intention classes, pmIn connection with determining the unlabeled configuration item intent categories, go to 3.2.3.5; otherwise, pmIndependently of determining the unlabeled configuration item intent category, proceed directly to 3.2.3.6;
3.2.3.5 let m ═ m' +1, and let pm′=pm. Calculating pm′And adding the processed sequence pattern into the sequence pattern set SP. The method comprises the following steps:
3.2.3.5.1 initializing the index configuration item subscript loop variable n ═ 1, and let m ═ m' -1;
3.2.3.5.2 let m '═ m' + 1;
3.2.3.5.3 initializing support variable supportm′=0;
3.2.3.5.4 initialized matching degree variable matchedm′0, counting the number of configuration items matched with the mode;
3.2.3.5.5 order pm′Corresponding intent class lm′=LabelppxLet p bem′Reflected ofm′Related sequence patternm′=(pp1,…,ppx,…,ppX-1);
3.2.3.5.6 judge patternm'is d'nIf so, indicating that a matching sequence is found, and matchingm′=matchedm′+1, go 3.2.3.5.7; if not, go to 3.2.3.5.8;
3.2.3.5.7 if lm′=labelmIt is shown that the intention tag can be correctly matched at the same time of sequence matching, so that support is enabledm′=supportm′+1, go 3.2.3.5.2; if lm′≠labelnTo illustrate that although a sequence can be matched, the intent tag corresponding to that sequence does not match, go to 3.2.3.5.8;
3.2.3.5.8 if N is N, go 3.2.3.5.10, if N < N, go 3.2.3.5.9;
3.2.3.5.9 turn 3.2.3.5.2 when n is n + 1;
3.2.3.5.10 calculating pm′Confidence of (1)m′:confidencem′=supportm′/matchedm′(FEAT Algorithm guarantees pm′At least a sub-sequence of a sequence in SeqDB, i.e. always matchedm′Not less than 1), and the sequence mode after processing is marked as Patternm′=(patternm′,lm′,confidencem′) Will Patternm′Adding the sequence pattern set SP;
3.2.3.6 when M equals M, get sequence Pattern set SP, SP equals { Pattern }m′L 1 is less than or equal to M 'is less than or equal to M', wherein M 'is the total number of all modes in SP, M' is less than or equal to M, and the rotation is 3.3; if not, let m be m +1, go to 3.2.3.4;
3.3 configuration item intent data auto-augmentation Module Pair D2The coding is carried out by the method:
3.3.1 initializing variable t ═ 1;
3.3.2 pairs of ddtIn (1) UtThe method for coding each word comprises the following steps:
3.3.2.1 initialize word index variable ut=1;
3.3.2.2 will
Figure BDA0003617009860000071
Is converted into
Figure BDA0003617009860000072
Wherein
Figure BDA0003617009860000073
Is composed of
Figure BDA0003617009860000074
Parts of speech (such as nouns, verbs, adjectives, adverbs, etc.),
Figure BDA0003617009860000075
is composed of
Figure BDA0003617009860000076
The root word of (2).
3.3.2.3 judgment
Figure BDA0003617009860000077
Whether or not at ftikenIf so, then
Figure BDA0003617009860000078
Is coded into
Figure BDA0003617009860000079
Turning to 3.3.2.4; if not, f cannot be usedtokenFor is to
Figure BDA00036170098600000710
Coding is carried out, directly
Figure BDA0003617009860000081
Coding is 0, and 3.3.2.4 is turned;
3.3.2.4 if ut=UtThen, it completes the pair ddtCoding of (2), ddtIs coded to dd't
Figure BDA0003617009860000082
Figure BDA0003617009860000083
Rotating by 3.3.3; if not, let ut=ut+1, go 3.3.2.2;
3.3.3 if T ═ T, let binary (cc)t,dd′t) As D2In<(cct,ddt)>To the encoded set D 'of unlabeled configuration items'2To obtain D'2={(cct,dd′t) Turning to 3.4, |1 is not less than T and not more than T }; if t<T, rotating to 3.3.4;
3.3.4, converting t to t +1 to 3.3.2;
3.4 configuration item intent data auto amplification Module Using SP to D'2And (6) labeling. The method comprises the following steps:
3.4.1 setting a confidence threshold, with 0< threshold ≦ 1, which is preferably set to 0.7< threshold ≦ 1;
3.4.2 initializing variable t ═ 1;
3.4.3 initializing a set R of configuration items with tags1Is an empty set;
3.4.4 initializing a set R of untagged configuration items2Is an empty set;
3.4.5 initializing the dictionary type variable selector for selecting an intent tag for the tth unlabeled configuration item, let selector [ Label1]=0,…,selector[Labeli]=0,…,selector[Label7]=0,selector[Labeli]Indicating that the t-th unmarked configuration item is marked as LabeliThe confidence of (2);
3.4.6 update selector according to pattern set SP obtained by 3.2, the method is:
3.4.6.1 initializing variable m' ═ 1;
3.4.6.2 Pattern from sequence Patternm′Reading confidencem′,lm′,patternm′
3.4.6.2 if confidencem′Judging whether pattern matching can be carried out or not by turning to 3.4.6.3 if the threshold is more than or equal to the threshold; if confidencem′<threshold, then the Patternm′If the confidence level requirement is not met, turning to 3.4.6.5;
3.4.6.3 if patternm′Is dd'tThe subsequence of (3), which indicates pattern matching, is converted to 3.4.6.4; if not, go to 3.4.6.5;
3.4.6.4 if confidencem′>selector[lm′]Then update the selector [ l ]m′]Instant messenger selectorm′]=confidencem′Turning to 3.4.6.5; otherwise, go directly to 3.4.6.5;
3.4.6.5, if M 'is equal to M', traversing all sequence modes, completing updating the selector, and turning to 3.4.7; if M '< M', making M '═ M' +1, go to 3.4.6.2;
3.4.7 according to selector as dd'tSelecting a label, wherein the method comprises the following steps:
3.4.7.1 initializing candidate tags LCt=Label1
3.4.7.2 initializing tag subscript variable i-2;
3.4.7.3 if selector [ labeli]>selector[LCt]To illustrate, label is selectediConfidence as a label higher than picking LCtAs confidence of the label, let LCt=labeliTurning to 3.4.7.4; if selector [ label ]i]≤selector[LCt]Go directly to 3.4.7.4;
3.4.7.4 if i is 7, go to 3.4.7.5; if i is less than 7, let i equal to i +1, go to 3.4.7.3;
3.4.7.5 if selector LCt]>0, then LCtAs the intention label of the t-th unlabeled configuration item, will<(cct,ddt),LCt>Adding R1Turning to 3.4.8; if selector [ LCt]If 0, it means that no dd is found in SPt' matching patterns, not selecting an intent tag for the tth unlabeled configuration item, will<(cct,ddt)>Adding R2Turning to 3.4.8;
3.4.8 if T is T, the set D of configuration items not marked is completed2Is marked to obtain R1And R2Turning to 3.4.10; if t<T, go to 3.4.9;
3.4.9 converting t to t +1 to 3.4.5;
3.4.10 determination of R1If the result is an empty set, finishing pair D1The iterative amplification is terminated to obtain an amplified labeled configuration item set, and 3.4.12 is transferred; if not, turning to 3.4.11;
3.4.11 order D1=D1+R1Let D2=R2And then, rotating to 3.1;
3.4.12 set D of labeled configuration items at the time of this step1The set of label placement items after amplification is denoted as D ═<(cn′,dn′),labeln′>|1≤n′≤N′,labeln′E.g. Labels, wherein dn′As configuration item cn′Description of (1), labeln′As configuration item cn′N' is the number of configuration items in the amplified labeled configuration item set D. N' is more than or equal to N.
And fourthly, training a configuration item preselection module of the software configuration item preselection system oriented to performance tuning by using the amplified labeling configuration item set D. The method for training the configuration item pre-selection module comprises the following steps:
4.1 use N' Profile D in D1,…,dn′,…,dN′As a training set, Ramos et al, 2003, 1stAn article published by an instruction Conference on Machine Learning (namely, a first Machine Learning instruction Conference), namely 'Using TF-IDF to determine word dependency in document query', a TF-IDF encoder encorder in a training configuration item pre-selection module is used for encoding a configuration item document, and the input of the encorder is a sentence and the output of the encorder is a vector corresponding to the sentence;
4.2, encoding N 'documents in the D by using an encoder to obtain a vector set V' after encoding, wherein the method comprises the following steps:
4.2.1 initializing vector set V' as an empty set;
4.2.2 initialize loop subscript variable n ═ 1;
4.2.3 Using encoder to convert dn′Encoded as the n' th vector vn′
4.2.4 vn′Adding V';
4.2.5, if N 'is equal to N', completing encoding of N 'configuration item documents in D to obtain an encoded vector set V', and converting to 4.3; if N '< N', making N '═ N' +1, and rotating to 4.2.3;
4.3 Using training set<vn′,labeln′>I1. ltoreq.n '. ltoreq.N', and using a paper published in 2018 by Yoni Gavish et al in the Journal of Photogrammetry and Remote Sensing (stage 136) (ISPRS Journal of Photogrammetry and Remote Sensing)A pre-selection model RF of configuration items is trained by a hierarchical random forest algorithm proposed in the compatibility of flat and hierarchical vegetation/Land-Cover classification models in a NATURA 2000site (namely, the performances of a NATURA 2000site middle plane and a hierarchical Habitat/Land Cover classification model are compared), and configuration item pre-selection model parameters are obtained.
And fifthly, pre-selecting the configuration items by the trained configuration item pre-selecting module according to the target software configuration items to obtain a pre-selected configuration item set. Data set DT of object software configuration item<dtca,dta>1 is more than or equal to a and less than or equal to A, wherein A is the number of configuration items in the target software, dtcaIs the name of the a-th configuration item, dtaIs the document for the a-th configuration item. The method comprises the following steps:
5.1 using the encoder obtained by 4.1 training to encode A configuration item documents of the target software, and recording the vector set of the encoded target software as VdtThe method comprises the following steps:
5.1.1 initializing the set of vectors V of the target softwaredtIs an empty set;
5.1.2 initialize loop subscript variable a ═ 1;
5.1.3 Using the encoder dtaCoded as the a-th vector vv of the target softwarea
5.1.4 vaAdding Vdt
5.1.5 if a is equal to a, the a configuration item documents in DT are encoded, and a vector set V of the encoded target software is obtaineddtTurning to 5.2; if a<A, making a equal to a +1, and rotating by 5.1.3;
5.2 configuration item Pre-selection Module after training according to VdtGenerating a corresponding intention label by the vector corresponding to each configuration item to obtain a predicted intention label list O, wherein the method comprises the following steps:
5.2.1 initializing predicted intention tag list O as an empty list;
5.2.2 initialize cycle index variable a ═ 1;
5.2.3 will vvaInputting the predicted configuration item into the model RF of the pre-selection module of the trained configuration item to predict the meaning of the a-th configuration item of the target softwareDrawing tag oaThe method comprises the following steps:
5.2.3.1 initialize the candidate intention label for the a-th configuration item to be oaLet oa=Label7
5.2.3.2 will vvaModel RF input to the trained configuration item preselection module to obtain first tier output pprob, npprob]And second layer output [ prob1,prob2,prob3,prob4,prob5,prob6]Wherein pprob is the probability that the generation prediction configuration item is irrelevant to the performance, npprob is the probability that the configuration item to be predicted is relevant to the performance, probiThe intention Label for the configuration item to be predicted is LabeliThe probability of (d);
5.2.3.3 f pprob<npprob, then the RF predicts that the probability that the configuration item is not performance related is greater than the probability that the configuration item is performance related, let oa=Label75.2.4; if pprob is more than npprob, the RF predicts that the probability that the configuration item is related to the performance is greater than the probability that the configuration item is not related to the performance, that is, the configuration item is a performance-related configuration item, 5.2.3.4 further determines whether the configuration item affects the performance and affects other intentions of the user, that is, determines that the intention Label of the configuration item is Label1,…,Labeli,…Label6Which of the other;
5.2.3.4, judging the intention label of the configuration item related to the performance, the method is:
5.2.3.4.1 initializes the candidate intention label subscript ci to 1;
5.2.3.4.2 initializing loop index variable i ═ 1;
5.2.3.4.3 if probi>probciIf so, let ci equal i, turn to 5.2.3.4.4; otherwise, go directly to 5.2.3.4.4;
5.2.3.4.4 if i equals 6, then complete the traversal of the RF second layer output, let oa=labelciTurning to 5.2.4; if i<6, changing i to i +1, and turning to 5.2.3.4.3;
5.2.4 mixingaAdding to the predicted intention tag list O;
5.2.5 if a is equal to a, completing the prediction of all configuration items in DT to obtain a predicted intention label list O, and turning to 5.3; if a is less than A, making a equal to a +1, and rotating to 5.2.3;
5.3 classifying the configuration items according to the intention labels to obtain a set consisting of the configuration items with the same intention types, wherein the method comprises the following steps:
5.3.1 initializing the configuration item set corresponding to 7 kinds of intention labels as an empty set, i.e. ordering
Figure BDA0003617009860000111
Figure BDA0003617009860000112
Figure BDA0003617009860000113
A configuration item set corresponding to the ith intention label;
5.3.2 initialize cycle index variable a ═ 1;
5.3.3 intention tag o according to the a-th configuration itemaName dtc of a configuration itemaJoining corresponding configuration item sets
Figure BDA0003617009860000114
The preparation method comprises the following steps of (1) performing;
5.3.4 if a<A, making a equal to a +1, and rotating by 5.3.3; if a is equal to A, finishing the classification of all configuration items in DT to obtain a preselected configuration item set
Figure BDA0003617009860000115
Wherein
Figure BDA0003617009860000116
Figure BDA0003617009860000117
Figure BDA0003617009860000118
The intention Label representing the pre-selected model RF prediction of the configuration item after training is LabeliJ is the J configuration itemiLabel representing RF prediction intentioniThe total number of configuration items of (c).
So far, the trained configuration item pre-selection module completes the goal of pre-selecting configuration items according to the intention categories.
The user can select the configuration item set corresponding to the proper intention category according to the intention of the performance tuning to perform tuning, for example, when the user needs to ensure the reliability of the software during the performance tuning,
Figure BDA0003617009860000119
the configuration items in (1) can bring about performance improvement, but can cause software reliability to be reduced, and are contrary to the intention of users that the reliability of the software needs to be ensured,
Figure BDA0003617009860000121
the configuration items in (1) are independent of the software performance, and adjusting the configuration items has no influence on the performance of the software, contrary to the intention for performance tuning, so that the user can select the configuration items preselected by the present invention
Figure BDA0003617009860000122
Figure BDA0003617009860000123
And
Figure BDA0003617009860000124
the configuration items in (1) are subjected to performance tuning to meet the performance tuning requirement.
Compared with the prior art, the invention can achieve the following beneficial effects:
1. by adopting the invention, the configuration item preselection can be carried out on the target software according to the configuration document of the target software without carrying out performance test on the configuration item, and the result is irrelevant to the performance test load, so that the configuration item preselection is lighter, the time consumption of performance tuning is reduced, and the problem of incomplete preselected configuration items caused by the load limitation of the prior art is greatly solved.
2. By adopting the method and the device, the configuration items with vital performance can be preselected, the diversity intention of the user during performance tuning can be considered, and the corresponding configuration items can be preselected according to different intention categories. Compared with the method of the background art (Too Many to-do man Knobs to pipe to fast data base Tuning by Pre-selecting impedance databases Knobs published by Konstatinos Kanellis et al in HotStorage 2020. Database system Tuning is carried out Faster by Pre-selecting Important configuration items) only concerning performance without considering the defect of multi-intention characteristics of users, the invention comprehensively considers the potential influence of other intentions brought by users when carrying out performance Tuning, so that the users can meet the intention of the users to performance when carrying out performance Tuning by using the Pre-selected configuration items.
3. The invention can be used for automatically amplifying the data set. The third step of the invention provides a method for mining sequence patterns from labeled data and amplifying unlabeled data, which can effectively reduce the consumption of manpower and time in the data labeling process. Experiments prove that when the labeled data accounts for 20% (s is 0.2) of the total data amount and the confidence threshold is set to be 0.85(threshold is 0.85), 59.4% of the unlabeled data can be amplified by using the method, the accuracy is 86.4%, the labor and time consumption in the data labeling process is greatly reduced, and the efficiency of data labeling is improved. Compared with the prior art, the method can greatly reduce the dependence of model training on labeled data under the condition of keeping the accuracy rate of the same level as that of the prior art.
Drawings
FIG. 1 is a general flow diagram of the present invention;
FIG. 2 is a logical block diagram of a first step of the present invention to build a multiple intent sensitive software configuration item preselection system;
FIG. 3 is a third step of the present invention, which is a pre-processing label configuration item set D of the automatic amplification module for configuration item intention data1Iterative pair of unlabeled configuration item sets D2The configuration items not marked in the step (1) are marked, and the newly marked configuration items are adopted to amplify D1And obtaining a flow chart of the amplified labeling configuration item set D.
Detailed Description
The present invention will be described with reference to the accompanying drawings.
As shown in fig. 1, the present invention comprises the steps of:
firstly, a software configuration item preselection system oriented to performance tuning is constructed, and the software configuration item preselection system oriented to performance tuning is composed of a configuration item intention data automatic amplification module and a configuration item preselection module as shown in fig. 2.
The automatic configuration item intention data amplification module is connected with the configuration item preselection module and is also connected with data set source software. The data set source software comprises two parts: a set of annotated configuration items and a set of unlabeled configuration items. The annotation configuration item set refers to a data set constructed by performing intention type annotation on configuration items according to each configuration item document in a manual annotation mode. The configuration item intention data automatic amplification module preprocesses a labeling configuration item set, labels the unlabeled configuration items in the unlabeled configuration item set, adds newly labeled data into the labeling configuration item set from the unlabeled configuration item set until the number of configuration items in the labeling configuration item set is not changed any more, obtains an amplified labeling configuration item set, and sends the amplified labeling configuration item set to the configuration item preselection module.
The configuration item preselection module is connected with the configuration item intention data automatic amplification module, and receives the amplified annotation configuration item set from the configuration item intention data automatic amplification module. The configuration item preselection module comprises a TF-IDF coder encoder and a configuration item preselection model RF. The encoder encodes sentences in the configuration item documents to obtain vectors corresponding to the sentences; and the RF is a random forest model with a two-layer structure, and the model is trained by using the amplified label configuration item set to obtain parameters of the random forest model. The configuration item pre-selection module classifies the configuration items of the target software according to the configuration data of the target software, pre-selects the configuration items corresponding to different intention categories, and obtains a pre-selected configuration item set.
Second, a set of configuration items D of the software is sourced from the data set0Randomly selecting partial configuration items to label intentions to obtain a labeled configuration item set D1
2.1 data set Source software including MySQL, Cassandra, MariaDB, Apache-Httpd, Ng13 types of software including ix, Hadoop-Common, MapReduce, Apache-Flink, HDFS, Keystone, Nova, GCC and Clang. Selecting part of configuration items from data set source software according to the following conditions: 1) belonging to server-side software. The software generally has higher requirements on the performance, reliability, safety and the like of the software, and is beneficial to researching the influence of configuration items on the software; 2) there are a large number of users and over 2,000 stars of software on the largest code hosting platform in the world, the GitHub. The software generally has a large number of users, and the configuration items of the software can be marked to have greater influence; 3) software with more than 100 configuration items. The number of software configuration items is large, and performance tuning is more needed. A configuration item set D consisting of more than 7 thousand configuration items of software that satisfy the above 3 conditions simultaneously0In the method, configuration items with the proportion of s (wherein s is more than or equal to 0.2) are selected at random manually. The total number of the configuration items is recorded as S, the number of the randomly selected configuration items is N, and N is S multiplied by S, and the configuration items are rounded to obtain an integer.
2.2 according to the official document description of the selected configuration items, carrying out intention labeling on the N configuration items to obtain a labeled configuration item set D1The method comprises the following steps: according to the document description of the configuration item, if the adjustment of the configuration item can bring about the improvement of the software performance, but the improvement of the performance can simultaneously lead to the reduction of the software reliability, the intention Label of the configuration item is Label1(ii) a If the software performance can be improved by adjusting the configuration item, but the software security is reduced due to the performance improvement, the intention Label of the configuration item is Label2(ii) a If the adjustment of the configuration item can bring about the performance improvement of the software, but the performance improvement can simultaneously cause the functional degradation of the software, the intention Label of the configuration item is Label3(ii) a If the software performance improvement can be brought about by adjusting the configuration item, but the software use cost is increased when the performance improvement is brought about, the intention Label of the configuration item is Label4(ii) a If the software performance can be improved by adjusting the configuration item, but the performance is improved and the performance is reduced when other users use the software, the intention Label of the configuration item is Label5(ii) a Can bring softness if the configuration item is adjustedThe performance of the element is improved, the performance is improved, and the first five side effects cannot be caused at the same time, so that the intention Label of the configuration item is Label6(ii) a If adjusting the configuration item does not affect the software performance, the intent tag of the configuration item is Label7
2.3 set of annotation configuration items D1={<(cn,dn),labeln>|1≤n≤N,labelnIs epsilon of Labels }, wherein c isnIs D1Name of the nth configuration item, dnIs a configuration item cnDocument of dnCan be expressed as
Figure BDA0003617009860000141
Figure BDA0003617009860000142
Wherein WnIs dnThe total number of the Chinese words; labelnIs a configuration item cnThe intention category of (1), Label ═ LabeliI1 ≦ i ≦ 7 is the set of intent tag categories.
Note that the set of T ═ S-N configuration items that were not selected in step 2.1 is referred to as unmarked configuration item set D2,D2={<(cct,ddt)>L 1 is less than or equal to T is less than or equal to T, wherein cctIs D2The tth configuration item name, ddtIs the configuration item cctThe document of (2). The ddtCan be expressed as
Figure BDA0003617009860000143
Wherein U istIs ddtTotal number of words in.
Thirdly, preprocessing a labeling configuration item set D of the automatic configuration item intention data amplification module1Iterative pair of unlabeled configuration item sets D2Labeling unmarked configuration items in the step (A), and amplifying by adopting newly labeled configuration items1And obtaining an amplified labeling configuration item set D, with the method shown in fig. 3:
3.1 configuration item intention data automatic amplification Module pretreatment D1The method comprises the following steps:
3.1.1 defining dictionary type variable flabelFor encoding an intention tag class, satisfying flabel[Label1]=1,…,flabel[Labeli]=i,…,flabel[Label7]=7(1≤i≤7);
3.1.2 initializing word mapping maximum index ═ 8;
3.1.3 defining dictionary type variables ftokenFor encoding words, initializing ftokenIs an empty dictionary, i.e. ftokenIs an empty set, and will be added to the key set step by step in the subsequent steps<Part of speech, root word>The formed binary group encodes the words according to the parts of speech and the roots of the words;
3.1.4 encoding words and building f step by steptokenThe method comprises the following steps:
3.1.4.1 initializing variable n ═ 1;
3.1.4.2 pairs dnW innCoding each word to obtain dnCoded d'nThe method comprises the following steps:
3.1.4.2.1 initialize the word index variable wn=1;
3.1.4.2.2 will
Figure BDA0003617009860000151
Conversion into binary
Figure BDA0003617009860000152
Wherein
Figure BDA0003617009860000153
Is composed of
Figure BDA0003617009860000154
The part of speech of (a) is,
Figure BDA0003617009860000155
is composed of
Figure BDA0003617009860000156
The root word of (2).
3.1.4.2.3 judgment
Figure BDA0003617009860000157
Whether or not at ftokenIf not, will
Figure BDA0003617009860000158
Encoding into index while key-value pairs are encoded
Figure BDA0003617009860000159
Adding ftokenIn and for
Figure BDA00036170098600001510
Figure BDA00036170098600001511
Turning 3.1.3.2.4; if so, the method will be used
Figure BDA00036170098600001512
Coded as a key
Figure BDA00036170098600001513
Corresponding value, i.e. to
Figure BDA00036170098600001514
Is coded into
Figure BDA00036170098600001515
Is a natural number (the value range is 1 to 7), and 3.1.4.2.5 is turned;
3.1.4.2.4 let index be index + 1;
3.1.4.2.5 if wn=WnThen pair d is completednCoding of each word in the sequence, and obtaining dnCoded d'n
Figure BDA00036170098600001516
Figure BDA00036170098600001517
Rotating for 3.1.4.3; if w<WnTurning to 3.1.4.2.6;
3.1.4.2.6 order wn=wn+1, go 3.1.4.2.2;
3.1.4.3 if N is N, then D will be1D in (1)nReplaced by its code d'nObtaining a preprocessed labeled configuration item set D'1,D′1={<(cn,d′n),labeln>|1≤n≤N,labelnBelongs to Labels, and changes to 3.2; if n is<N, rotating to 3.1.4.4;
3.1.4.4 changing n to n +1, 3.1.4.2;
3.2 configuration item intent data auto-augmentation Module from D'1The method for mining the sequence mode to obtain a sequence mode set SP comprises the following steps:
3.2.1 use of D'1Construct sequence set SeqDB ═ ssq1,…,seqn,…,seqNH, seqnIs composed of configuration items cnDocument d ofnCoded d'nAnd cnIntention label ofnCorresponding code flabel(labeln) Sequences formed by splicing, i.e.
Figure BDA00036170098600001518
Figure BDA0003617009860000161
3.2.2 sequence set SeqDB is subjected to sequence pattern mining by using FEAT algorithm in Efficient mining of frequency sequence generators (Efficient mining frequent sequence generator) published by Chuancong Gao et al in WWW2008 to obtain a sequence set P, P ═ { P ═ P { (P {)1,…,pm,…,pMWhere M is the total number of sequence patterns, pmIs a frequently occurring sequence in the sequence set SeqDB, pm=(pp1,…,ppx,…ppX) Corresponding to common expressions in the configuration document, such as frequently occurring words and phrases, X is pmIs calculated by the FEAT algorithm, ppxIs pmThe x-th item of (1) is a code corresponding to a word or an intention label, and satisfies that pp is more than or equal to 1x<index, specifically 1. ltoreq. ppxThe intention label is represented at ≦ 7
Figure BDA0003617009860000162
At flabelMapping of 8. ltoreq. ppx<index represents a certain
Figure BDA0003617009860000163
Form transformed by step 3.1.4.2.2
Figure BDA0003617009860000164
At ftokenMapping of (1), i.e
Figure BDA0003617009860000165
Figure BDA0003617009860000166
3.2.3 processing the P, reserving sequences related to the intention category, and calculating the corresponding support degree and confidence degree of each sequence to obtain a sequence mode set SP, wherein the method comprises the following steps:
3.2.3.1 initializing sequence pattern set SP as an empty set;
3.2.3.2 initialization sequence traversal variable m ═ 1;
3.2.3.3 initializing sequence mode count variable m ═ 0;
3.2.3.4 determination of pmThe last pp inXWhether or not it satisfies pp of 1. ltoreq. ppXLess than or equal to 7, if yes, ppXFor coding of intention classes, pmIn connection with determining the unlabeled configuration item intent categories, go to 3.2.3.5; otherwise, pmIndependently of determining the unlabeled configuration item intent category, proceed directly to 3.2.3.6;
3.2.3.5 let m ═ m' +1, and let pm′=pm. Calculating pm′And adding the processed sequence pattern into the sequence pattern set SP. The method comprises the following steps:
3.2.3.5.1 initializing the index configuration item subscript loop variable n ═ 1, and let m ═ m' -1;
3.2.3.5.2 let m '═ m' + 1;
3.2.3.5.3 initialization support variable supportm′=0;
3.2.3.5.4 initialized matching degree variable matchedm′0, counting the number of configuration items matched with the mode;
3.2.3.5.5 order pm′Corresponding intention category
Figure BDA0003617009860000167
Let p bem′Reflected ofm′Related sequence patternm′=(pp1,…,ppx,…,ppX-1);
3.2.3.5.6 judge patternm′Is d'nIf so, indicating that a matching sequence is found, and matchingm′=matchedm′+1, go 3.2.3.5.7; if not, go to 3.2.3.5.8;
3.2.3.5.7 if lm′=labelnIt is shown that the intention tag can be correctly matched at the same time of sequence matching, so that support is enabledm′=supportm′+1, go 3.2.3.5.2; if lm′≠labelnTo illustrate that although a sequence can be matched, the intent tag corresponding to that sequence does not match, go to 3.2.3.5.8;
3.2.3.5.8 if N is N, go 3.2.3.5.10, if N < N, go 3.2.3.5.9;
3.2.3.5.9 making n equal to n +1, turn 3.2.3.5.2;
3.2.3.5.10 calculating pm′Confidence of (2)m′:confidencem′=supportm′/matchedm′(FEAT Algorithm guarantees pm′At least a sub-sequence of a sequence in SeqDB, i.e. always mattedm′Not less than 1), and the sequence mode after processing is marked as Patternm′=(patternm′,lm′,confidencem′) Will Patternm′Adding the sequence pattern set SP into a sequence pattern set;
3.2.3.6 when M equals M, get sequence Pattern set SP, SP equals { Pattern }m′|1≤m′≤M 'is the total number of all the modes in the SP, M' is less than or equal to M, and the conversion is 3.3; if not, let m be m +1, go to 3.2.3.4;
3.3 automatic amplification of configuration item intention data Module pair D2The coding is carried out by the method:
3.3.1 initializing variable t ═ 1;
3.3.2 pairs of ddtIn (1) UtThe method for coding the words comprises the following steps:
3.3.2.1 initialize word index variable ut=1;
3.3.2.2 will be
Figure BDA0003617009860000171
Is converted into
Figure BDA0003617009860000172
Wherein
Figure BDA0003617009860000173
Is composed of
Figure BDA0003617009860000174
Parts of speech (such as nouns, verbs, adjectives, adverbs, etc.),
Figure BDA0003617009860000175
is composed of
Figure BDA0003617009860000176
The root word of (2).
3.3.2.3 judgment
Figure BDA0003617009860000177
Whether or not at ftokenIf so, will
Figure BDA0003617009860000178
Is coded into
Figure BDA0003617009860000179
Turning to 3.3.2.4; if not, f cannot be usedtokenTo pair
Figure BDA00036170098600001710
Encoding is carried out, directly
Figure BDA00036170098600001711
Code 0, go to 3.3.2.4;
3.3.2.4 if ut=UtThen, it completes the pair ddtCoding of (2), ddtIs coded to dd't
Figure BDA00036170098600001712
Figure BDA00036170098600001713
Rotating by 3.3.3; if not, let ut=ut+1, go 3.3.2.2;
3.3.3 if T ═ T, let binary (cc)t,dd′t) As D2In<(cct,ddt)>To the encoded set D 'of unlabeled configuration items'2To obtain D'2={(cct,dd′t) Turning to 3.4, |1 is not less than T and not more than T }; if t is<T, rotating to 3.3.4;
3.3.4, t is t +1, and then the rotation is carried out for 3.3.2;
3.4 configuration item intent data auto amplification Module Using SP to D'2And (6) labeling. The method comprises the following steps:
3.4.1 set confidence threshold, let 0< threshold ≦ 1, which is preferably set to 0.7< threshold ≦ 1;
3.4.2 initializing variable t ═ 1;
3.4.3 initializing a tagged set of configuration items R1Is an empty set;
3.4.4 initializing a set R of untagged configuration items2Is an empty set;
3.4.5 initialize the dictionary type variable selector used to select the intent tag for the tth unlabeled configuration item, let selector [ Label1]=0,…,selector[Labeli]=0,…,selector[Label7]=0,selector[Labeli]Indicating that the t-th unmarked configuration item is markedNote LabeliThe confidence of (2);
3.4.6 update selector according to pattern set SP obtained by 3.2, the method is:
3.4.6.1 initializing variable m' ═ 1;
3.4.6.2 Pattern from sequence Patternm′Reading confidencem′,lm′,patternm′
3.4.6.2 if confidencem′Judging whether the pattern matching can be carried out or not by turning to 3.4.6.3 if the threshold is larger than or equal to the threshold; if confidencem′<threshold, then the Patternm′If the confidence level requirement is not met, 3.4.6.5 is switched;
3.4.6.3 if patternm′Is dd'tThe subsequence of (3) to illustrate pattern matching, turn to 3.4.6.4; if not, go to 3.4.6.5;
3.4.6.4 if confidencem′>selector[lm′]Then update the selector [ l ]m′]Instant messenger selectorm′]=confidencem′Turning to 3.4.6.5; otherwise, go directly to 3.4.6.5;
3.4.6.5, if M 'is equal to M', traversing all sequence modes, completing updating the selector, and turning to 3.4.7; if M '< M', making M '═ M' +1, go to 3.4.6.2;
3.4.7 according to selector dd'tSelecting a label, wherein the method comprises the following steps:
3.4.7.1 initializing candidate tags LCt=Label1
3.4.7.2 initializing a tag index variable i-2;
3.4.7.3 if selector [ labeli]>selector[LCt]To illustrate, selecting labeliConfidence as label higher than that of selected LCtAs confidence of the label, let LCt=labeliTurning to 3.4.7.4; if selector [ labeli]≤selector[LCt]Go directly to 3.4.7.4;
3.4.7.4 if i is 7, go to 3.4.7.5; if i <7, making i equal to i +1, and switching to 3.4.7.3;
3.4.7.5 if selector LCt]>0, then LCtAs the t-thAnnotating intent tags of configuration items, will<(cct,ddt),LCt>Adding R1Turning to 3.4.8; if selector [ LCt]If 0, it means that no dd is found in SPt' matching patterns, not selecting an intent tag for the tth unlabeled configuration item, will<(cct,ddt)>Adding R2Turning to 3.4.8;
3.4.8 if T is equal to T, completing the process of collecting the configuration items D which are not marked2Is marked to obtain R1And R2Turning to 3.4.10; if t<T, turning to 3.4.9;
3.4.9 making t equal to t +1, rotating 3.4.5;
3.4.10 determination of R1If the result is an empty set, finishing pair D1The iterative amplification is terminated to obtain an amplified labeled configuration item set, and the step 3.4.12 is carried out; if not, turning to 3.4.11;
3.4.11 order D1=D1+R1Let D2=R2And then, rotating to 3.1;
3.4.12 set D of labeled configuration items at the time of this step1The set of label placement items after amplification is denoted as D ═<(cn′,dn′),labeln′>|1≤n′≤N′,labeln′Epsilon Labels, wherein dn′As configuration item cn′Description of (1), labeln′As configuration item cn′N' is the number of configuration items in the amplified labeled configuration item set D. N' is more than or equal to N.
And fourthly, training a configuration item preselection module of the software configuration item preselection system oriented to performance tuning by using the amplified labeling configuration item set D. The method for training the configuration item pre-selection module comprises the following steps:
4.1 use N' Profile D in D1,…,dn′,…,dN′As a training set, a TF-IDF method is utilized, a TF-IDF encoder in a training configuration item preselection module is used for encoding a configuration item document, the encoder inputs sentences, and the encoder outputs vectors corresponding to the sentences;
4.2, encoding N 'documents in the D by using an encoder to obtain a vector set V' after encoding, wherein the method comprises the following steps:
4.2.1 initializing vector set V' as an empty set;
4.2.2 initializing loop index variable n ═ 1;
4.2.3 Using encoder to convert dn′Encoded as the n' th vector vn′
4.2.4 vn′Adding V';
4.2.5 if N 'is equal to N', completing encoding of N 'configuration item documents in D to obtain a vector set V' after encoding, and turning to 4.3; if N '< N', make N '═ N' +1, rotate by 4.2.3;
4.3 Using training set<vn′,labeln′>And |1 is not less than N 'is not less than N', and a configuration item preselection model RF is trained by using a hierarchical random forest algorithm to obtain configuration item preselection model parameters.
And fifthly, the trained configuration item preselection module preselecting the configuration items according to the target software configuration items to obtain a preselected configuration item set. Data set DT of object software configuration item<dtca,dta>1 is more than or equal to a and less than or equal to A, wherein A is the number of configuration items in the target software, dtcaIs the name of the a-th configuration item, dtaIs the document for the a-th configuration item. The method comprises the following steps:
5.1 using the encoder obtained by 4.1 training to encode the A configuration item documents of the target software, and recording the vector set of the encoded target software as VdtThe method comprises the following steps:
5.1.1 initializing the set of vectors V of the target softwaredtIs an empty set;
5.1.2 initialize loop subscript variable a ═ 1;
5.1.3 use of encoder to convert dtaCoded as the a-th vector vv of the target softwarea
5.1.4 to convert vvaAdding Vdt
5.1.5 if a is equal to a, finishing the coding of the a configuration item documents in the DT, and obtaining a vector set V of the coded target softwaredtTurning to 5.2; if a<A, making a equal to a +1, and rotating by 5.1.3;
5.2 configuration item Pre-selection Module after training according to VdtGenerating a corresponding intention label by the vector corresponding to each configuration item to obtain a predicted intention label list O, wherein the method comprises the following steps:
5.2.1 initializing predicted intention tag list O to an empty list;
5.2.2 initialize cycle index variable a ═ 1;
5.2.3 vaInputting the predicted intentions label o of the a-th configuration item of the target software into the model RF of the trained configuration item pre-selection moduleaThe method comprises the following steps:
5.2.3.1 initialize the candidate intention label for the a-th configuration item to be oaLet o stand fora=Label7
5.2.3.2 will vvaModel RF input to the trained configuration item preselection module to obtain first tier outputs pprob, npprob]And second layer output [ prob1,prob2,prob3,prob4,prob5,prob6]Wherein pprob is the probability that the generation prediction configuration item is not related to the performance, npprob is the probability that the configuration item to be predicted is related to the performance, and probiThe intention Label for the configuration item to be predicted is LabeliThe probability of (d);
5.2.3.3 if pprob<npprob, then the RF predicts that the probability that the configuration item is not associated with performance is greater than the probability that the configuration item is associated with performance, let oa=Label75.2.4; if pprob is more than npprob, the RF predicts that the probability that the configuration item is related to the performance is greater than the probability that the configuration item is not related to the performance, that is, the configuration item is a performance-related configuration item, 5.2.3.4 further determines whether the configuration item affects the performance and affects other intentions of the user, that is, determines that the intention Label of the configuration item is Label1,…,Labeli,…Label6Which of the other;
5.2.3.4, judging the intention label of the configuration item related to the performance, the method is:
5.2.3.4.1 initializes the candidate intention label subscript ci to 1;
5.2.3.4.2 initializing loop index variable i ═ 1;
5.2.3.4.3 if probi>probciIf so, let ci equal i, turn to 5.2.3.4.4; otherwise, go directly to 5.2.3.4.4;
5.2.3.4.4 if i is 6, then complete the traversal of the RF second level output, let oa=labelci5.2.4; if i<6, making i equal to i +1, and turning to 5.2.3.4.3;
5.2.4 mixing ofaAdding the predicted intention label list O;
5.2.5 if a is equal to a, completing the prediction of all configuration items in DT, obtaining a predicted intention label list O, and turning to 5.3; if a < A, making a equal to a +1, and rotating by 5.2.3;
5.3 classify the configuration items according to the intention labels to obtain a set consisting of the configuration items with the same intention categories, the method is as follows:
5.3.1 initializing the configuration item set corresponding to 7 kinds of intention labels as an empty set, i.e. ordering
Figure BDA0003617009860000211
Figure BDA0003617009860000212
Figure BDA0003617009860000213
A configuration item set corresponding to the ith intention label;
5.3.2 initialize loop subscript variable a ═ 1;
5.3.3 intention tag o according to the a-th configuration itemaName dtc of a configuration itemaJoining corresponding configuration item sets
Figure BDA0003617009860000214
The preparation method comprises the following steps of (1) performing;
5.3.4 if a<A, making a equal to a +1, and rotating to 5.3.3; if a is equal to A, finishing the classification of all configuration items in DT to obtain a preselected configuration item set
Figure BDA0003617009860000215
Wherein
Figure BDA0003617009860000216
Figure BDA0003617009860000217
Figure BDA0003617009860000218
The intention Label representing the pre-selected model RF prediction of the trained configuration item is LabeliJ is the J configuration itemiRepresentative of the RF predictive intention tag as LabeliThe total number of configuration items of (1).
In order to verify the effect of the invention, a comparison experiment of the invention and the background technology is carried out on a computer with a Ubuntu18.04 operating system, a 48-core Intel Xeon2.2GHz CPU, a Tesla V100 GPU and 64GB memory. The primary coding language is python 3.8.6. The training process is carried out according to the steps in the specification, PostgreSQL and Cassandra software are used as target software for testing, and 252 and 117 configuration item documents are respectively generated by the PostgreSQL and the Cassandra. Since the prior art does not disclose the technical source code and the experimental result, the comparison is only made with the prior art II. As shown in table 1, the experiment proves that 59.4% of the unlabeled data can be amplified by using the method of the present invention when the labeled data accounts for 20% (s is 0.2) of the total data and the confidence threshold is set to 0.85(threshold is 0.85), with an accuracy of 86.4%, so that the manpower and time consumption in the data labeling process is greatly reduced, and the efficiency of data labeling is improved. The invention can recommend the configuration items related to performance more comprehensively while greatly reducing the overhead. Meanwhile, configuration items can be recommended for user intentions except performance, so that the user can be assisted in tuning, and various user intentions are met.
TABLE 1 comparison of the software configuration item preselection method of the present invention and background Art II
Figure BDA0003617009860000219
The method for preselecting software configuration items oriented to performance tuning provided by the invention is described in detail above. The principles and embodiments of the present invention are explained herein, with the above description being included to assist in understanding the core concepts of the present invention. It should be noted that, for those skilled in the art, it is possible to make various improvements and modifications to the present invention without departing from the principle of the present invention, and those improvements and modifications also fall within the scope of the claims of the present invention.

Claims (9)

1. A software configuration item preselection method oriented to performance tuning is characterized by comprising the following steps:
the method comprises the following steps that firstly, a software configuration item preselection system oriented to performance tuning is constructed, and the software configuration item preselection system oriented to performance tuning is composed of a configuration item intention data automatic amplification module and a configuration item preselection module;
the automatic configuration item intention data amplification module is connected with the configuration item preselection module and is also connected with data set source software; the data set source software comprises two parts: marking a configuration item set and an unmarked configuration item set; the annotation configuration item set refers to a data set constructed by performing intention type annotation on configuration items according to each configuration item document; the configuration item intention data automatic amplification module preprocesses a labeled configuration item set, labels un-labeled configuration items in the un-labeled configuration item set, adds newly labeled data from the un-labeled configuration item set to the labeled configuration item set until the number of configuration items in the labeled configuration item set is not changed any more, obtains an amplified labeled configuration item set, and sends the amplified labeled configuration item set to a configuration item pre-selection module;
the configuration item preselection module is connected with the configuration item intention data automatic amplification module, and receives the amplified labeling configuration item set from the configuration item intention data automatic amplification module; the configuration item preselection module comprises a TF-IDF encoder and a configuration item preselection model RF; the encoder encodes sentences in the configuration item documents to obtain vectors corresponding to the sentences; the RF is a random forest model with a two-layer structure, and the model is trained by using the amplified label configuration item set to obtain parameters of the random forest model; the configuration item pre-selection module classifies the configuration items of the target software according to the configuration data of the target software, pre-selects the configuration items corresponding to different intention categories and obtains a pre-selected configuration item set;
second, a set of configuration items D of the software is sourced from the data set0Randomly selecting part of configuration items to label intentions to obtain a labeled configuration item set D1(ii) a The method comprises the following steps:
2.1 selecting partial configuration items from the data set source software according to the following conditions: 1) software belonging to a server side; 2) software with a large number of users and over 2,000 stars on the code hosting platform, GitHub; 3) configuring software with more than 100 items; a configuration item set D composed of more than 7 thousand configuration items satisfying the 3 pieces of conditional software simultaneously0Randomly selecting configuration items with the proportion of s; recording the total number of the configuration items as S, randomly selecting the number of the configuration items as N, wherein N is S multiplied by S, and rounding to obtain an integer;
2.2 according to official document description of the selected configuration items, carrying out intention labeling on the N configuration items to obtain a labeled configuration item set D1The intention Label of the configuration item is Label1、Label2、Label3、Label4、Label5;、Label6、Label7Seven kinds in total;
2.3 annotating a set D of configuration items1={<(cn,dn),labeln>|1≤n≤N,labelnEpsilon. Labels }, wherein cnIs D1Name of the nth configuration item, dnIs a configuration item cnDocument of dnIs shown as
Figure FDA0003617009850000011
Figure FDA0003617009850000021
Wherein WnIs dnThe total number of Chinese words; labelnIs a configuration item cnThe intention category of (1), Label ═ LabeliI1 is not less than i not more than 7 is a set formed by the intention label categories;
note that in step 2.1, T ═ S-N configuration items that were not selectedThe formed set is marked as an unmarked configuration item set D2,D2={<(cct,ddt) 1 ≦ T ≦ T, where cctIs D2Name of the t-th configuration item, ddtIs the configuration item cctThe document of (1); the ddtIs shown as
Figure FDA0003617009850000022
Wherein U istIs ddtThe total number of Chinese words;
thirdly, preprocessing and labeling a configuration item set D of the configuration item intention data automatic amplification module1Iterative pair of unlabeled configuration item sets D2Labeling unmarked configuration items in the step (A), and amplifying by adopting newly labeled configuration items1And obtaining an amplified labeling configuration item set D, wherein the method comprises the following steps of:
3.1 configuration item intention data automatic augmentation Module pretreatment D1The method comprises the following steps:
3.1.1 defining dictionary type variable flabelFor encoding intent tag categories, satisfy flabel[Label1]=1,...,flabel[Labeli]=i,...,flabel[Label7]=7,1≤i≤7;
3.1.2 initializing word mapping maximum index ═ 8;
3.1.3 defining dictionary type variables ftokenFor encoding words, initializing ftokenIs an empty dictionary, i.e. ftokenThe key set is an empty set, in the subsequent steps, binary groups which are less than parts of speech and more than roots of words are gradually added into the key set, and the words are coded according to the parts of speech and the roots of the words;
3.1.4 encoding words and building f step by steptokenThe method comprises the following steps:
3.1.4.1 initializing variable n ═ 1;
3.1.4.2 pairs dnW innCoding each word to obtain dnCoded d'n
Figure FDA0003617009850000023
Figure FDA0003617009850000024
Wherein
Figure FDA0003617009850000025
Is composed of
Figure FDA0003617009850000026
Figure FDA0003617009850000027
The part of speech of (a) is,
Figure FDA0003617009850000028
is composed of
Figure FDA0003617009850000029
The root word of (2);
3.1.4.3 if N ═ N, then D will be1D in (1)nIs replaced with its coded d'nObtaining a preprocessed labeled configuration item set D'1,D′1={<(cn,d′n),labeln>|1≤n≤N,labelnBelongs to Labels, and changes to 3.2; if N is less than N, rotating to 3.1.4.4;
3.1.4.4 changing n to n +1, 3.1.4.2;
3.2 configuration item intent data auto-augmentation Module from D'1The method for mining the sequence mode to obtain a sequence mode set SP comprises the following steps:
3.2.1 use of D'1Constructing a sequence set SeqDB ═ { seq ═ seq1,...,seqn,...,seqNIn which seqnIs composed of configuration items cnDocument d ofnCoded d'nAnd cnIntention label ofnCorresponding code flabel(labeln) Sequences formed by splicing, i.e.
Figure FDA0003617009850000031
Figure FDA0003617009850000032
3.2.2 sequence set SeqDB is subjected to sequence pattern mining by using a FEAT algorithm to obtain a sequence set P, wherein P is { P ═ P1,...,pm,...,pMWhere M is the total number of sequence patterns, pmIs a frequently occurring sequence in the sequence set SeqDB, pm=(pp1,...,ppx,...ppX) X is p corresponding to the expression in the configuration documentmIs calculated by the FEAT algorithm, ppxIs pmThe xth item of (1) is a code corresponding to a word or an intention label, and satisfies pp ≦ 1x<index,1≤ppxThe intention label is represented at ≦ 7
Figure FDA0003617009850000033
At flabelMapping of (3), 8 ≦ ppx< index time represents
Figure FDA0003617009850000034
The form transformed by step 3.1.4.2
Figure FDA0003617009850000035
At ftokenOf (2), i.e.
Figure FDA0003617009850000036
Figure FDA0003617009850000037
3.2.3 processing the P, reserving sequences related to the intention category, and calculating the corresponding support degree and confidence degree of each sequence to obtain a sequence mode set SP, wherein the method comprises the following steps:
3.2.3.1 initializing sequence pattern set SP as an empty set;
3.2.3.2 initialization sequence traversal variable m ═ 1;
3.2.3.3 initializing sequence pattern count variable m' ═ 0;
3.2.3.4 determination of pmThe last pp inXWhether or not it satisfies pp of 1. ltoreq. ppXLess than or equal to 7, if yes, ppXFor coding of intention classes, pmIn connection with determining the unlabeled configuration item intent category, go to 3.2.3.5; otherwise, pmIndependent of determining the unlabeled configuration item intent categories, go to 3.2.3.6;
3.2.3.5 let m' +1 and let pm′=pm(ii) a Calculating pm′Confidence of (1)m′And the processed sequence Pattern is patternedm′Adding into the sequence Pattern set SP, Patternm′=(patternm′,lm′,confidencem′),patternm′Is pm′Reflected sum ofm′Related sequences,/m′Is pm′A corresponding intent category;
3.2.3.6 when M equals M, get sequence Pattern set SP, SP equals { Pattern }m′|1≤m′≤M′},Patternm′=(patternm′,lm′,confidencem′) Wherein M 'is the total number of all the modes in the SP, M' is less than or equal to M, and the rotation is 3.3; if not, let m be m +1, go to 3.2.3.4;
3.3 configuration item intent data auto-augmentation Module Pair D2The coding is carried out by the method:
3.3.1 initializing variable t ═ 1;
3.3.2 pairs of ddtIn (1) UtCoding the words to obtain ddtCoded dd't
Figure FDA0003617009850000038
Figure FDA0003617009850000039
3.3.3 if T ═ T, let binary (cc)t,dd′t) As D2Mid < (cc)t,ddt) Encoding of > to encoded set D 'of unlabeled configuration items'2To obtain D'2={(cct,dd′t) Turning to 3.4, |1 is not less than T and not more than T }; if T is less than T, turning to 3.3.4;
3.3.4, t is t +1, and then the rotation is carried out for 3.3.2;
3.4 configuration item intent data auto amplification Module Using SP to D'2Labeling is carried out; the method comprises the following steps:
3.4.1 setting a confidence threshold value threshold, wherein the threshold value is more than 0 and less than or equal to 1;
3.4.2 initialization variable t ═ 1;
3.4.3 initializing a set R of configuration items with tags1Is an empty set;
3.4.4 initializing a set R of untagged configuration items2Is an empty set;
3.4.5 initializing the dictionary type variable selector for selecting an intent tag for the tth unlabeled configuration item, let selector [ Label1]=0,...,selector[Labeli]=0,...,selector[Label7]=0,selector[Labeli]Means to Label the t-th unlabelled configuration item as LabeliThe confidence of (2);
3.4.6 update selector according to pattern set SP obtained by 3.2, the method is:
3.4.6.1 initializing variable m' ═ 1;
3.4.6.2 Pattern from sequence Patternm′Reading confidencem′,lm′,patternm′
3.4.6.2 if confidencem′Judging whether the pattern matching can be carried out or not by turning to 3.4.6.3 if the threshold is larger than or equal to the threshold; if confidencem′< threshold, the Patternm′If the confidence level requirement is not met, 3.4.6.5 is switched;
3.4.6.3 if patternm′Is dd'tThe subsequence of (3), which indicates pattern matching, is converted to 3.4.6.4; if not, go to 3.4.6.5;
3.4.6.4 if confidencem′>selector[lm′]Then update the selector [ l ]m′]Instant messenger selectorm′]=confideucem′Turning to 3.4.6.5; otherwise, go directly to 3.4.6.5;
3.4.6.5, if M 'is equal to M', traversing all sequence modes, completing updating the selector, and turning to 3.4.7; if M 'is less than M', making M '═ M' +1, switching to 3.4.6.2;
3.4.7 according to selector dd'tSelecting a label, wherein the method comprises the following steps:
3.4.7.1 initializing candidate tags LCt=Label1
3.4.7.2 initializing tag subscript variable i-2;
3.4.7.3 if selector [ labeli]>selector[LCt]To illustrate, label is selectediConfidence as label higher than that of selected LCtAs confidence of the label, let LCt=labeliTurning to 3.4.7.4; if selector [ label ]i]≤selector[LCt]Go directly to 3.4.7.4;
3.4.7.4 if i is 7, go to 3.4.7.5; if i is less than 7, making i equal to i +1, and turning to 3.4.7.3;
3.4.7.5 if selector LCt]If greater than 0, LCtAs the intention label of the tth unlabeled configuration item, will < (cc)t,ddt),LCtAddition of R1Turning to 3.4.8; if selector [ LCt]0 indicates SP is not added with dd'tMatching patterns, not selecting an intent tag for the tth unlabeled configuration item, will be < (cc)t,ddt) Addition of R2Turning to 3.4.8;
3.4.8 if T is equal to T, completing the process of collecting the configuration items D which are not marked2Is marked to obtain R1And R2Turning to 3.4.10; if T is less than T, 3.4.9 is switched;
3.4.9 converting t to t +1 to 3.4.5;
3.4.10 determination of R1If the result is an empty set, finishing pair D1The iterative amplification is terminated to obtain an amplified labeled configuration item set, and 3.4.12 is transferred; if not, turning to 3.4.11;
3.4.11 order D1=D1+R1Let D2=R2And then, rotating to 3.1;
3.4.12 set D of annotation configuration items at the time this step is reached1The amplified labeled arrangement item set is denoted as D ═ and (cn′,dn′),labeln′>|1≤n′≤N′,labeln′E.g. Labels, wherein dn′As configuration item cn′Description of (1), labeln′As configuration item cn′N' is the number of configuration items in the amplified labeling configuration item set D; n' is more than or equal to N;
fourthly, training a configuration item preselection module of the software configuration item preselection system oriented to performance tuning by using the amplified labeling configuration item set D; the method for training the configuration item pre-selection module comprises the following steps:
4.1 use N' Profile D in D1,...,dn′,...,dN′As a training set, a TF-IDF method is utilized, a TF-IDF encoder in a training configuration item preselection module is used for encoding a configuration item document, the encoder inputs sentences, and the encoder outputs vectors corresponding to the sentences;
4.2 encoding N ' documents in D by using an encoder to obtain an encoded vector set V ', wherein N ' encoded vectors exist in the V ', and the nth ' vector Vn′For using encoder pairs dn′A coded vector;
4.3 Using training set { < v {n′,labeln′N 'is more than or equal to |1 and less than or equal to N', and a configuration item preselection model RF is trained by using a layered random forest algorithm to obtain configuration item preselection model parameters;
fifthly, the trained configuration item preselection module preselecting configuration items according to the target software configuration items to obtain a preselected configuration item set; recording target software configuration item data set DT { < dtc {a,dta1 ≦ a ≦ A }, where A is the number of configuration items in the target software, dtcaIs the name of the a-th configuration item, dtaA document that is the a-th configuration item; the method comprises the following steps:
5.1 using the encoder obtained by 4.1 training to encode the A configuration item documents of the target software, and recording the vector set of the encoded target software as VdtThe method comprises the following steps:
5.1.1 initializing the set of vectors V of the target softwaredtIs an empty set;
5.1.2 initialize cycle index variable a ═ 1;
5.1.3 use of encoder to convert dtaCoded as the a-th vector vv of the target softwarea
5.1.4 vaAdding Vdt
5.1.5 if a is equal to a, finishing the coding of the a configuration item documents in the DT, and obtaining a vector set V of the coded target softwaredt5.2; if a is less than A, making a equal to a +1, and rotating by 5.1.3;
5.2 configuration item preselection Module after training according to VdtGenerating a corresponding intention label by using the vector corresponding to each configuration item to obtain a predicted intention label list O, wherein the method comprises the following steps of:
5.2.1 initializing predicted intention tag list O as an empty list;
5.2.2 initializing loop subscript variable a ═ 1;
5.2.3 will vvaInputting the predicted intention label o of the a-th configuration item of the target software into a model RF of a trained configuration item pre-selection modulea
5.2.4 mixingaAdding the predicted intention label list O;
5.2.5 if a is equal to a, completing the prediction of all configuration items in DT to obtain a predicted intention label list O, and turning to 5.3; if a is less than A, making a equal to a +1, and rotating by 5.2.3;
5.3 classify the configuration items according to the intention labels to obtain a set consisting of the configuration items with the same intention categories, the method is as follows:
5.3.1 initializing the configuration item set corresponding to 7 intention labels as an empty set, i.e. ordering
Figure FDA0003617009850000061
Figure FDA0003617009850000062
Figure FDA0003617009850000063
A configuration item set corresponding to the ith intention label;
5.3.2 initialize cycle index variable a ═ 1;
5.3.3 intention tag o according to the a-th configuration itemaName dtc of the a-th configuration itemaJoining corresponding configuration item sets
Figure FDA0003617009850000064
Performing the following steps;
5.3.4 if a < a, make a equal to a +1, turn 5.3.3; if a is equal to A, finishing the classification of all configuration items in DT to obtain a preselected configuration item set
Figure FDA0003617009850000065
Wherein
Figure FDA0003617009850000066
Figure FDA0003617009850000067
Figure FDA0003617009850000068
The intention Label representing the pre-selected model RF prediction of the configuration item after training is LabeliJ is the J configuration itemiLabel representing RF prediction intentioniThe total number of configuration items of (1).
2. The performance-oriented tuning software configuration item preselection method of claim 1, wherein the second step of the data set source software comprises 13 types of software selected from MySQL, Cassandra, MariaDB, Apache-Httpd, Nginx, Hadoop-Common, MapReduce, Apache-Flink, HDFS, Keystone, Nova, GCC, and Clang.
3. The performance-oriented tuning software configuration item preselection method as claimed in claim 1, wherein the ratio s in step 2.1 satisfies 1 ≧ s ≧ 0.2, and the confidence threshold in step 3.4.1 satisfies 0.7< threshold ≦ 1.
4. The method of claim 1, wherein the 2.2 steps of the method for intent labeling of N configuration items comprise: according to the document description of the configuration item, if the software performance can be improved by adjusting the configuration item, but the software reliability is reduced due to the improvement of the performance, the intention Label of the configuration item is Label1(ii) a If the adjustment of the configuration item can bring about the performance improvement of the software, but the performance improvement can simultaneously cause the safety reduction of the software, the intention Label of the configuration item is Label2(ii) a If the adjustment of the configuration item can bring about the performance improvement of the software, but the performance improvement can simultaneously cause the functional degradation of the software, the intention Label of the configuration item is Label3(ii) a If the software performance improvement can be brought about by adjusting the configuration item, but the software use cost is increased when the performance improvement is brought about, the intention Label of the configuration item is Label4(ii) a If the software performance can be improved by adjusting the configuration item, but the performance is improved and the performance is reduced when other users use the software, the intention Label of the configuration item is Label5(ii) a If the software performance can be improved by adjusting the configuration item, and the performance is improved without causing the first five side effects, the intention Label of the configuration item is Label6(ii) a If adjusting the configuration item does not affect the software performance, the intent tag of the configuration item is Label7
5. The method of claim 1 wherein said pair d of steps 3.1.4.2 is selectednW innCoding the words to obtain dnCoded d'nThe method comprises the following steps:
3.1.4.2.1 initialize the word index variable wn=1;
3.1.4.2.2 will
Figure FDA0003617009850000071
Conversion into binary
Figure FDA0003617009850000072
Wherein
Figure FDA0003617009850000073
Is composed of
Figure FDA0003617009850000074
The part of speech of (a) is,
Figure FDA0003617009850000075
is composed of
Figure FDA0003617009850000076
The root word of (2);
3.1.4.2.3 judgment
Figure FDA0003617009850000077
Whether or not at ftokenIf not, will
Figure FDA0003617009850000078
Encoding into index while key-value pairs are encoded
Figure FDA0003617009850000079
Adding ftokenIn and for
Figure FDA00036170098500000710
Figure FDA00036170098500000711
Turning 3.1.3.2.4; if so, the method will be
Figure FDA00036170098500000712
Coded as a key
Figure FDA00036170098500000713
Corresponding value, i.e. to
Figure FDA00036170098500000714
Is coded into
Figure FDA00036170098500000715
Turning to 3.1.4.2.5;
3.1.4.2.4 let index be index + 1;
3.1.4.2.5 if wn=WnThen pair d is completednCoding each word in the sequence to obtain dnCoded d'n
Figure FDA00036170098500000716
Figure FDA00036170098500000717
Figure FDA00036170098500000718
Finishing; if W is less than WnTurning to 3.1.4.2.6;
3.1.4.2.6 order wn=wn+1, go to 3.1.4.2.2.
6. The method of claim 1, wherein step 3.2.3.5 calculates pm′And adding the processed sequence pattern into the sequence pattern set SP, the method comprises the following steps:
3.2.3.5.1 initializing index configuration item index loop variable n ═ 1, and making m ═ m' -1;
3.2.3.5.2 let m '═ m' + 1;
3.2.3.5.3 initialization support variable supportm′=0;
3.2.3.5.4 initialized matching degree variable matchedm′0, counting the number of configuration items matched with the mode;
3.2.3.5.5 order pm′Corresponding intention category
Figure FDA0003617009850000081
Let p bem′Is reflected byAnd lm′Related sequence patternm′=(pp1,...,ppx,...,ppX-1);
3.2.3.5.6 judging patternm′Is d 'or not'nIf so, indicating that a matching sequence is found, and matchingm′=matchedm′+1, go 3.2.3.5.7; if not, go to 3.2.3.5.8;
3.2.3.5.7 if lm′=labelnIt is shown that the intention tag can be correctly matched at the same time of sequence matching, so that support is enabledm′=supportm′+1, go 3.2.3.5.2; if lm′≠labelnTo illustrate that although the sequence can be matched, the intent tag corresponding to the sequence does not match, go 3.2.3.5.8;
3.2.3.5.8 if N is equal to N, go to 3.2.3.5.10, if N < N, go to 3.2.3.5.9;
3.2.3.5.9 turn 3.2.3.5.2 when n is n + 1;
3.2.3.5.10 calculating pm′Confidence of (1)m′:confidencem′=supportm′/matchedm′The sequence mode after the processing is Patternm′=(patternm′,lm′,confidencem′) Will Patternm′Added to the sequence pattern set SP.
7. A performance-oriented tuning software configuration item preselection method as claimed in claim 1, characterized in that said step 3.3.2 refers to ddtIn (1) UtThe method for coding each word comprises the following steps:
3.3.2.1 initialize word index variable ut=1;
3.3.2.2 will
Figure FDA0003617009850000082
Is converted into
Figure FDA0003617009850000083
Wherein
Figure FDA0003617009850000084
Is composed of
Figure FDA0003617009850000085
The part of speech of (a) is,
Figure FDA0003617009850000086
is composed of
Figure FDA0003617009850000087
Figure FDA0003617009850000088
The root word of (2);
3.3.2.3 judgment
Figure FDA0003617009850000089
Whether or not at ftokenIf so, will
Figure FDA00036170098500000810
Is coded into
Figure FDA00036170098500000811
Turning 3.3.2.4; if not, f cannot be usedtokenFor is to
Figure FDA00036170098500000812
Coding is carried out, directly
Figure FDA00036170098500000813
Coding is 0, and 3.3.2.4 is turned;
3.3.2.4 if ut=UtTo dd is completedtCoding of (2), ddtIs coded to dd't
Figure FDA00036170098500000814
Figure FDA0003617009850000091
Finishing; if not, let ut equal ut +1, go to 3.3.2.2.
8. The method of claim 1, wherein in step 4.2, the encoders are used to encode N 'documents in D to obtain a set of encoded vectors V' by:
4.2.1 initializing vector set V' as an empty set;
4.2.2 initializing loop index variable n ═ 1;
4.2.3 Using encoder to convert dn′Encoded as the n' th vector vn′
4.2.4 vn′Adding V';
4.2.5 if N 'is equal to N', completing the encoding of the N 'configuration item documents in D to obtain an encoded vector set V', and ending; if N ' < N ', let N ' +1, change to 4.2.3.
9. A performance-oriented tuning-based software configuration item preselection method as claimed in claim 1, wherein said step 5.2.3 is to select vvaInputting the predicted intentions label o of the a-th configuration item of the target software into the model RF of the trained configuration item pre-selection moduleaThe method comprises the following steps:
5.2.3.1 initialize the candidate intention label of the a-th configuration item to oaLet o stand fora=Label7
5.2.3.2 will vvaModel RF input to the trained configuration item preselection module to obtain first tier outputs pprob, npprob]And a second layer output [ prob1,prob2,prob3,prob4,prob5,prob6]Wherein pprob is the probability that the generation prediction configuration item is not related to the performance, npprob is the probability that the configuration item to be predicted is related to the performance, and probiThe intention Label for the configuration item to be predicted is LabeliThe probability of (d);
5.2.3.3 if pprob < nprob, the RF predicts that the configuration item has a probability of being performance independent greater thanProbability that the configuration item is related to performance, let oa=Label7And ending; if pprob is more than npprob, the RF predicts that the probability that the configuration item is related to the performance is greater than the probability that the configuration item is not related to the performance, that is, the configuration item is a performance-related configuration item, 5.2.3.4 further determines whether the configuration item affects the performance and affects other intentions of the user, that is, determines that the intention Label of the configuration item is Label1,...,Labeli,...Label6Which of the other;
5.2.3.4, judging the intention label of the configuration item related to the performance, the method is:
5.2.3.4.1 initializes the candidate intention label subscript ci to 1;
5.2.3.4.2 initializing loop index variable i ═ 1;
5.2.3.4.3 if probi>probciIf so, let ci equal i, turn to 5.2.3.4.4; otherwise, go directly to 5.2.4.4;
5.2.3.4.4 if i is 6, then complete the traversal of the RF second level output, let oa=labelciAnd ending; if i < 6, let i equal i +1, go to 5.2.3.4.3.
CN202210450353.6A 2022-04-26 2022-04-26 Software configuration item preselection method oriented to performance tuning Active CN114780411B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210450353.6A CN114780411B (en) 2022-04-26 2022-04-26 Software configuration item preselection method oriented to performance tuning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210450353.6A CN114780411B (en) 2022-04-26 2022-04-26 Software configuration item preselection method oriented to performance tuning

Publications (2)

Publication Number Publication Date
CN114780411A true CN114780411A (en) 2022-07-22
CN114780411B CN114780411B (en) 2023-04-07

Family

ID=82432902

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210450353.6A Active CN114780411B (en) 2022-04-26 2022-04-26 Software configuration item preselection method oriented to performance tuning

Country Status (1)

Country Link
CN (1) CN114780411B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116225965A (en) * 2023-04-11 2023-06-06 中国人民解放军国防科技大学 IO size-oriented database performance problem detection method
CN116561002A (en) * 2023-05-16 2023-08-08 中国人民解放军国防科技大学 Database performance problem detection method for I/O concurrency

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108804136A (en) * 2018-05-31 2018-11-13 中国人民解放军国防科技大学 Configuration item type constraint inference method based on name semantics
CN111611177A (en) * 2020-06-29 2020-09-01 中国人民解放军国防科技大学 Software performance defect detection method based on configuration item performance expectation

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108804136A (en) * 2018-05-31 2018-11-13 中国人民解放军国防科技大学 Configuration item type constraint inference method based on name semantics
CN111611177A (en) * 2020-06-29 2020-09-01 中国人民解放军国防科技大学 Software performance defect detection method based on configuration item performance expectation

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
SHANSHAN LI ET AL.: "Detecting Performance Bottlenecks Guided by Resource Usage" *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116225965A (en) * 2023-04-11 2023-06-06 中国人民解放军国防科技大学 IO size-oriented database performance problem detection method
CN116225965B (en) * 2023-04-11 2023-10-10 中国人民解放军国防科技大学 IO size-oriented database performance problem detection method
CN116561002A (en) * 2023-05-16 2023-08-08 中国人民解放军国防科技大学 Database performance problem detection method for I/O concurrency
CN116561002B (en) * 2023-05-16 2023-10-10 中国人民解放军国防科技大学 Database performance problem detection method for I/O concurrency

Also Published As

Publication number Publication date
CN114780411B (en) 2023-04-07

Similar Documents

Publication Publication Date Title
CN111310438B (en) Chinese sentence semantic intelligent matching method and device based on multi-granularity fusion model
CN110490946B (en) Text image generation method based on cross-modal similarity and antagonism network generation
CN109992782B (en) Legal document named entity identification method and device and computer equipment
CN110020438A (en) Enterprise or tissue Chinese entity disambiguation method and device based on recognition sequence
CN112329465A (en) Named entity identification method and device and computer readable storage medium
US11954435B2 (en) Text generation apparatus, text generation learning apparatus, text generation method, text generation learning method and program
CN114780411B (en) Software configuration item preselection method oriented to performance tuning
CN111581973A (en) Entity disambiguation method and system
CN111782961B (en) Answer recommendation method oriented to machine reading understanding
CN112800776A (en) Bidirectional GRU relation extraction data processing method, system, terminal and medium
CN115422369B (en) Knowledge graph completion method and device based on improved TextRank
CN117648469A (en) Cross double-tower structure answer selection method based on contrast learning
CN113807079A (en) End-to-end entity and relation combined extraction method based on sequence-to-sequence
CN117453861A (en) Code search recommendation method and system based on comparison learning and pre-training technology
CN117610562B (en) Relation extraction method combining combined category grammar and multi-task learning
CN117932066A (en) Pre-training-based &#39;extraction-generation&#39; answer generation model and method
CN111309849B (en) Fine-grained value information extraction method based on joint learning model
CN115408506B (en) NL2SQL method combining semantic analysis and semantic component matching
CN116029261B (en) Chinese text grammar error correction method and related equipment
CN114969279A (en) Table text question-answering method based on hierarchical graph neural network
CN113204679B (en) Code query model generation method and computer equipment
CN114548090A (en) Fast relation extraction method based on convolutional neural network and improved cascade labeling
CN117371447A (en) Named entity recognition model training method, device and storage medium
Kishimoto et al. MHG-GNN: Combination of Molecular Hypergraph Grammar with Graph Neural Network
CN113239192B (en) Text structuring technology based on sliding window and random discrete sampling

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant