CN112835798A

CN112835798A - Cluster learning method, test step clustering method and related device

Info

Publication number: CN112835798A
Application number: CN202110152871.5A
Authority: CN
Inventors: 张伟杰; 陈振坤
Original assignee: Guangzhou Huya Technology Co Ltd
Current assignee: Guangzhou Huya Technology Co Ltd
Priority date: 2021-02-03
Filing date: 2021-02-03
Publication date: 2021-05-25
Anticipated expiration: 2041-02-03
Also published as: CN112835798B

Abstract

The embodiment of the application relates to the technical field of software testing, and provides a cluster learning method, a test step clustering method and a related device.

Description

Cluster learning method, test step clustering method and related device

Technical Field

The embodiment of the application relates to the technical field of software testing, in particular to a cluster learning method, a test step clustering method and a related device.

Background

The automatic test script consists of three parts of operation behaviors, test data and expected assertions, the test steps are the operation behaviors in the automatic test script, and each test step needs to correspond to one group of operation sequences. The text phrase description of the testing step is formatted as: when a subject (user) behavior predicate (click) object (confirmation) is used, since the chinese description modes used when different persons express the same meaning may be different, a plurality of operations may be described differently and the corresponding operations may be identical, thereby causing redundancy of the operations and increasing the encoding workload.

Disclosure of Invention

An object of the embodiments of the present application is to provide a cluster learning method, a test step clustering method, and a related apparatus, so as to reduce redundant operation sequences and reduce encoding workload.

In order to achieve the above purpose, the embodiments of the present application employ the following technical solutions:

in a first aspect, an embodiment of the present application provides a cluster learning method, where the method includes:

acquiring a training text, wherein the training text comprises a plurality of test cases, and the test cases comprise a plurality of test steps;

inputting the training text into a pre-constructed processing model to obtain a first similarity metric value between every two testing steps;

clustering the training texts according to each first similarity metric value to generate a cluster tree, wherein the cluster tree comprises a plurality of cluster clusters, and one cluster comprises a plurality of test steps;

and updating parameters of the processing model according to the clustering tree until the clustering tree reaches a set condition to obtain a trained processing model and a final clustering tree, wherein one clustering cluster of the final clustering tree corresponds to a group of preset operation sequences.

In a second aspect, an embodiment of the present application further provides a method for clustering test steps, where the method includes:

acquiring a text to be processed, wherein the text to be processed comprises a plurality of testing steps to be processed;

inputting the text to be processed into a processing model trained by the clustering learning method to obtain a second similarity metric value between each two testing steps to be processed;

clustering the texts to be processed according to each second similarity measurement value to generate a clustering tree to be processed, wherein the clustering tree to be processed comprises a plurality of clustering clusters to be processed, and one clustering cluster to be processed comprises a plurality of testing steps to be processed;

and comparing the cluster trees to be processed with the final cluster tree to obtain an operation sequence corresponding to each cluster to be processed, wherein the final cluster tree is generated by utilizing the cluster learning method.

In a third aspect, an embodiment of the present application further provides a cluster learning apparatus, where the apparatus includes:

the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring a training text, the training text comprises a plurality of test cases, and the test cases comprise a plurality of test steps;

the first processing module is used for inputting the training text into a pre-constructed processing model and obtaining a first similarity metric value between every two testing steps;

the first clustering module is used for clustering the training texts according to each first similarity metric value to generate a clustering tree, wherein the clustering tree comprises a plurality of clustering clusters, and one clustering cluster comprises a plurality of testing steps;

and the execution module is used for updating parameters of the processing model according to the clustering tree until the clustering tree reaches a set condition, so as to obtain a trained processing model and a final clustering tree, wherein one clustering cluster of the final clustering tree corresponds to a group of preset operation sequences.

In a fourth aspect, an embodiment of the present application further provides a device for clustering test steps, where the device includes:

the second acquisition module is used for acquiring a text to be processed, and the text to be processed comprises a plurality of test steps to be processed;

the second processing module is used for inputting the text to be processed into the processing model trained by the clustering learning method to obtain a second similarity metric value between every two testing steps to be processed;

a second clustering module, configured to cluster the texts to be processed according to each of the second similarity metric values, and generate a cluster tree to be processed, where the cluster tree to be processed includes multiple cluster clusters to be processed, and one cluster to be processed includes multiple test steps to be processed;

and the comparison module is used for comparing the clustering tree to be processed with the final clustering tree to obtain an operation sequence corresponding to each clustering cluster to be processed, wherein the final clustering tree is generated by using the clustering learning method.

In a fifth aspect, an embodiment of the present application further provides an electronic device, where the electronic device includes: one or more processors; a memory for storing one or more programs that, when executed by the one or more processors, cause the one or more processors to implement the cluster learning method or the test step clustering method described above.

In a sixth aspect, an embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the cluster learning method or the test step clustering method described above.

Compared with the prior art, the cluster learning method, the test step clustering method and the related device provided by the embodiment of the application have the advantages that when the cluster learning is carried out, the training texts are firstly input into the pre-constructed processing model, the first similarity metric value between every two test steps in the training texts is obtained, then the training texts are clustered according to each first similarity metric value to generate the clustering tree, then the processing model is subjected to parameter updating according to the clustering tree until the clustering tree reaches the set condition, the trained processing model and the final clustering tree are obtained, one clustering cluster of the final clustering tree corresponds to one group of preset operation sequences, namely, all test steps in the same clustering cluster correspond to one group of preset operation sequences, so that redundant operation sequences can be reduced, and the coding workload is reduced.

Drawings

Fig. 1 shows a schematic flow chart of a cluster learning method provided in an embodiment of the present application.

Fig. 2 is a schematic flowchart of step S102 in the cluster learning method shown in fig. 1.

Fig. 3 is a schematic flow chart of sub-step S1023 in step S102 shown in fig. 2.

Fig. 4 is another schematic flow chart of sub-step S1023 in step S102 shown in fig. 2.

Fig. 5 is a schematic flowchart of sub-step S1024 in step S102 shown in fig. 2.

Fig. 6 is a schematic flowchart of step S103 in the cluster learning method shown in fig. 1.

Fig. 7 shows a schematic flow chart of a test step clustering method provided in the embodiment of the present application.

Fig. 8 is a block diagram illustrating a cluster learning apparatus according to an embodiment of the present application.

Fig. 9 is a block diagram illustrating a test step clustering apparatus according to an embodiment of the present application.

Fig. 10 shows a block schematic diagram of an electronic device provided in an embodiment of the present application.

Icon: 10-an electronic device; 11-a processor; 12-a memory; 13-a bus; 100-cluster learning means; 110-a first acquisition module; 120-a first processing module; 130-a first clustering module; 140-an execution module; 200-testing step clustering device; 210-a second obtaining module; 220-a second processing module; 230-a second clustering module; 240-alignment module.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.

The automatic test script comprises operation behaviors, test data and expected assertions, the test steps are the operation behaviors in the automatic test script, each test step needs to correspond to one group of operation sequences, and because Chinese description modes used when different people express the same meaning are possibly different, a plurality of situations that the description is different but the corresponding operation sequences are completely the same can occur, so that the operation sequences are redundant, and the coding workload is increased. Therefore, the test procedure needs to be processed to reduce the generation of redundant operation sequences, thereby reducing unnecessary coding work and improving the overall automation construction efficiency.

Based on this, the following two methods are generally adopted in the prior art to process the testing steps: firstly, clustering is carried out by using the mechanical similarity of texts, namely, a local sensitive hash algorithm is adopted to compare every two short sentences, the similarity is calculated by taking characters as a unit, and if the similarity reaches a set threshold value, the two short sentences are gathered into a test step; and secondly, performing word segmentation and data cleaning on the testing steps, and then performing duplicate removal processing to obtain a processed testing step list. However, both of these methods can only process a small number of test steps, and cannot be used for processing a large number of test steps.

Based on this, the embodiment of the application provides a cluster learning method, a test step clustering method and a related device, a trained processing model and a final clustering tree are obtained through cluster learning, and one clustering cluster of the final clustering tree corresponds to a group of preset operation sequences, so that the trained processing model and the final clustering tree can be used for clustering the original test steps in application, thereby reducing redundant operation sequences and reducing the encoding workload. As described in detail below.

Referring to fig. 1, fig. 1 shows a schematic flow chart of a cluster learning method provided in an embodiment of the present application, where the cluster learning method may include the following steps:

s101, a training text is obtained, wherein the training text comprises a plurality of test cases, and the test cases comprise a plurality of test steps.

The training text may be generated according to all system cases collected by the tester. The system use case is a requirement document of a certain function (for example, ordering) under a project, and the system use case can comprise contents such as a target system, a main executor, an auxiliary executor, a path step and the like. The target system refers to a specific service system for implementing the function, such as a live broadcast system. The main performer refers to an object that mainly performs this function, for example, a user, a timer, and the like.

After the system case is obtained, the system case can be analyzed into a test case. Generally, one system case can be analyzed into a plurality of test cases, and one test case can have a plurality of test steps. Because a system case is associated with a main executor and a target system, all test cases under the system case are associated with the corresponding main executor and target system, and the test steps in each test case are also associated with the corresponding main executor and target system.

S102, inputting the training text into a pre-constructed processing model, and obtaining a first similarity metric value between every two testing steps.

After the training text is obtained, the training text is input into a pre-constructed processing model, and after the training sample is processed by the processing model, a first similarity metric value between every two testing steps in the training sample can be output. The first similarity metric value is used for representing similarity between every two testing steps, and similarity metric indexes can be Euler distance, cosine similarity, RWMD (relaxed word move's distance), WCD (word center distance) and the like.

S103, clustering the training texts according to each first similarity measurement value to generate a cluster tree, wherein the cluster tree comprises a plurality of cluster clusters, and each cluster comprises a plurality of test steps.

After the first similarity metric value between every two testing steps in the training sample is obtained, each testing step in the training text can be clustered according to the obtained first similarity metric value, and therefore a clustering tree is generated. The algorithm adopted by the clustering can be a bottom-up hierarchical clustering algorithm, a k-means clustering algorithm (k-means clustering algorithm) and the like. The cluster tree can comprise a plurality of cluster clusters, one cluster can comprise a plurality of test steps, and the expressions of the test steps in the same cluster are the same.

And S104, updating parameters of the processing model according to the clustering tree until the clustering tree reaches a set condition, and obtaining the trained processing model and a final clustering tree, wherein one clustering cluster of the final clustering tree corresponds to a group of preset operation sequences.

After the cluster tree is generated, the cluster tree can be tested and checked to obtain the cluster accuracy of the cluster tree. And if the clustering accuracy does not reach the expected value, updating the parameters of the processing model, iterating until the obtained clustering accuracy reaches the expected value, and stopping updating the parameters to obtain the trained processing model and the final clustering tree. And the final clustering tree is the clustering tree generated by the last iteration, and the clustering accuracy of the final clustering tree reaches an expected value. Since each first similarity metric value is associated with a primary actor, a target system, and a testing step, the updated parameters may include the weight of the primary actor, the weight of the target system, and the weight of the testing step.

Meanwhile, for the final clustering tree, one clustering cluster of the final clustering tree can be set to correspond to one group of preset operation sequences, namely, all test steps in the same clustering cluster correspond to one group of preset operation sequences, so that redundant operation sequences are reduced, and the encoding workload is reduced.

Step S102 will be described in detail below.

The processing model includes a preprocessing module, a word embedding module, a vectorization module and a similarity measurement module, referring to fig. 2, step S102 may include the following sub-steps:

and S1021, performing word segmentation and filtering processing on the training text by using the preprocessing module to obtain word segmentation results, wherein the word segmentation results comprise each word in the training text.

Because the training sample comprises a plurality of test cases, the test cases comprise a plurality of test steps, the test steps are one sentence, in Chinese, the sentence is formed by word strings together, and no blank space exists between the words, the training sample is required to be participled firstly, the training sample can be participled by a Jieba (Jieba) library, the Jieba is a participle algorithm, and the Jieba library is a sentence division methodWith a small thesaurus of single words of test cases, after processing is completed, each test case becomes an ordered list of a group of words. After the training text is subjected to word segmentation processing, a general corpus including all words in the training text can be obtained, m represents the number of all words in the general corpus, and all words in the general corpus are { w }_i:1≤i≤m}。

Meanwhile, in order to avoid the deviation caused by the small-frequency words and the irregular expressions, filtering processing can be performed after word segmentation processing, for example, all words which appear only once in a general language library are filtered, common stop words, words which appear frequently but have no meaning are filtered, and the like, and finally word segmentation results are obtained.

And S1022, inputting the word segmentation result into a word embedding module to obtain a word embedding result, wherein the word embedding result comprises word embedding of each word in the training text.

Word embedding refers to high-dimensional real number vectors, words can be represented through word embedding, and trained word embedding can keep semantic and grammatical meanings of the words in a vector space. Therefore, after the word segmentation result is obtained, the word embedding module can be trained according to the word segmentation result, and a word embedding result is obtained based on the trained word embedding module, wherein the word embedding result comprises word embedding of each word in the training text.

The word embedding module may be a neural network language model, such as a skip-gram neural network model. Meanwhile, adjacent word groups in the word segmentation result are used as input of the word embedding module, and the adjacent word groups refer to word groups with positions in the sentence or phrase in a set window. For example, w ═ w₀w₁...w_len(w)-1Representing a sentence or a phrase, c representing a predefined window size parameter, all satisfy { (w) in the segmentation result_t,w_t+1) The pairs of words, c.ltoreq.j.ltoreq.c, 0.ltoreq.t, t + j.ltoreq.len (w) } are all used as input.

Because the input of the word embedding module is the adjacent word group in the word segmentation result, the testing steps are relatively short, and the testing steps in each test case are arranged in order, in order to obtain the adjacent word group in the word segmentation result, each testing step in the training text can be connected in series according to the sequence to form a long sentence, and then the adjacent word group is obtained according to the long sentence.

Meanwhile, when the Word embedding module is trained, the Word embedding module can be initialized by adopting the weight of the pre-trained Word2vec model, for example, the weight value of the pre-trained Word2vec model is used as the initial weight value of the skip-gram neural network model, so that the training efficiency can be improved.

In order to further improve the training efficiency, the training cutoff condition of the word embedding module can be set as follows: word pair likelihood value likelihood (i, j) ═ sigmoid (v)_i ^Tv_j)＝1/(1+exp(-v_i ^Tv_j) Maximum, wherein v_iAnd v_jRespectively represent any two words w in the word segmentation result_iAnd w_jThe word of (2) is embedded. A higher word pair likelihood value means a better word embedding.

And S1023, inputting the word embedding result into a vectorization module to obtain the space vectorization representation of each test step.

After the word embedding result is obtained, the word embedding result may be input to a vectorization module, and the vectorization module obtains a spatial vectorization representation of each test step based on the word embedding result. For details, refer to the following detailed description of sub-step S1023.

And S1024, inputting the space vectorization representation of each test step into the similarity measurement module to obtain a first similarity measurement value between each two test steps.

After vectorizing each test step, normal similarity measurement can be performed on the test steps, that is, a first similarity measurement value between every two test steps is calculated. The similarity measure index may be euler distance (Euc) or cosine similarity (cos), which are common calculation methods and are not described herein again.

Since the calculated first similarity metric value is subsequently used for clustering in the testing step, the selection of the euler distance for similarity measurement is superior to the cosine similarity, which is usually 5 to 10 times, in terms of the accuracy of clustering. The main reasons are that: the cosine similarity measure is the angle difference of two vectors, which may reflect only a very small part of vector information, and meanwhile, the value range of the cosine similarity is very small and is [ -1,1 ]; instead, the euler distance can be integrated to measure all the differences of the two vectors in each dimension.

Additionally, the similarity metric may also select RWMD and WCD, the main idea of RWMD being to measure the minimum effort required to convert one sentence into another. The conversion of a sentence is done by devising word-level conversions, and the effort required to convert one word to another can be measured by the euler distance of the word embedding of the two words, called word-shift distance. This direct calculation of word-shift distances is very hardware and time consuming, but it is fast and efficient to calculate only the strict lower bound of word-shift distances, so the similarity metric can choose the loose lower bound of word-shift distances, i.e., RWMD and WCD. For details, refer to the detailed description of sub-step S1024 below.

The following sub-step S1023 is described in detail.

As an embodiment, the spatial vectorization representation of each test step obtained by the vectorization module may be based on TF-IDF (term frequency-inverse text frequency) vectorization. TF-IDF is a statistical method for assessing the importance of a word to one of a set of documents or one of a corpus of documents.

TF-IDF based vectorization mainly considers two areas: a main actor and a target system. Because one test case in the training text corresponds to one main executor and one target system, the main executor field represents all the test cases with different main executors and the same target system, and the target system field represents all the test cases with different target systems and the same main executor.

Referring to fig. 3, the sub-step S1023 may include the following sub-steps:

s10231, calculating a first domain vector of the training text, wherein the first domain vector comprises a vector of each word in the training text in a main performer domain, and the main performer domain indicates that main performers are different and target systems are the same.

The main actor domain is a test case level domain, so the first domain vector is calculated based on the test cases in the training text. The way of calculating the first domain vector of the training text may include the steps of:

1. and acquiring any one target word in the training text. The target word may be represented by w_iAnd (4) showing.

2. And calculating a first word frequency of the target word in the field of the main executor, wherein the first word frequency is the frequency of the target word appearing in all test cases of the corresponding target system.

The first word frequency may be in tf_i，MEDenotes, tf_i,MEIs w_iThe frequency of occurrence in all test cases of its corresponding target system (e.g., one live APP), i.e., w occurs in all test cases of the live APP_iThe number of test cases of (a) to the ratio of all test cases of the live APP.

3. Using formulas

Calculating a first inverse text frequency of the target word in the main performer field, wherein n_i,MEIs the frequency, N, of occurrence of the target word in all test cases of its corresponding primary actor_caseThe number of all test cases of the main executor corresponding to the target word.

For example, if the main executor of the live APP is a user, n is_i,MEIs the frequency, N, of occurrence in all test cases in which the main actor is a user_caseThe main executor is the number of all test cases of the user.

4. And solving the product of the first word frequency and the first inverse text frequency to obtain the vector of the target word in the main executor field.

tf_idf_i，MEDenotes w_iVector in the main actor domain, tf_idf_i，ME＝tf_i,ME×idf_i,ME. TF measures the local word frequency, IDF measures the degree of global importance of a word, and TF measures the degree of global importance since fewer words present in the entire corpus of training text are generally more valuable_idf_i,MECharacterized by the adjusted weight of the target word in the main performer field.

5. And traversing each word in the training text to obtain a vector of each word in the training text in the main executor field, and forming a first field vector.

Calculating vectors of each word in the training text in the main executor field from the first word in sequence according to the sequence of each word in the training text, namely, after the step 4 is executed, judging whether the target word is the last word, if so, stopping executing; if not, returning to the step 1 until obtaining the vector of each word in the training text in the main performer field. The first domain vector may be represented by [ tf [ ]₁df_1,ME,tf₂df_2,ME,...,tf_mdf_m,ME]^TAnd (4) showing.

And S10232, calculating a second field vector of the training text, wherein the second field vector comprises a vector of each word in the training text in the field of a target system, and the field of the target system indicates that the target systems are different and the main performers are the same.

The target system domain is a test step level domain, so the second domain vector is calculated based on the test steps in the training text. The manner of calculating the second domain vector of the training text may include the steps of:

2. And calculating a second word frequency of the target word in the field of the target system, wherein the second word frequency is the frequency of the target word in all the testing steps of the test case where the target word is located.

The second word frequency may be in tf_i,TSDenotes, tf_i,TSIs w_iThe frequency of occurrence in all test steps of the test case (e.g., A) in which it resides, i.e., w_iIn all test steps of AThe number of occurrences and the total number of words of all test steps of a.

3. Using formulas

Calculating a second inverse text frequency of the target word in the target system field, wherein n_i,TSIs the frequency, N, of occurrence of the target word in all test steps of its corresponding target system_caseThe number of all test steps of the target system corresponding to the target word.

4. And solving the product of the second word frequency and the second inverse text frequency to obtain the vector of the target word in the field of the target system.

tf_idf_i,TSDenotes w_iVector in the main actor domain, tf_idf_i,TS＝tf_i,TS×idf_i,TS。tf_idf_i,TSCharacterized by the adjusted weight of the target word in the target system field.

5. And traversing each word in the training text to obtain a vector of each word in the training text in the field of the target system, and forming a second field vector.

Calculating vectors of each word in the training text in the field of the target system from the first word in sequence according to the sequence of each word in the training text, namely, after the step 4 is executed, judging whether the target word is the last word, and if so, stopping execution; if not, returning to the step 1 until obtaining the vector of each word in the training text in the target system field. The second domain vector may be represented by [ tf [ ]₁df_1,TS,tf₂df_2,TS,...,tf_mdf_m,TS]^TAnd (4) showing.

And S10233, acquiring any target test in the training text.

S10234, based on the first field vector, determining the first vector of the target test step in the main executor field.

The first vector may include a vector of each word in the target testing step in the main actor domain, and thus, the vector of each word in the target testing step in the main actor domain may be obtained from the first domain vector to obtain the first vector.

For example, the target testing step includes w_a、w_b、w_cThree words, then from the first domain vector [ tf₁df_1,ME,tf₂df_2,ME,...,tf_mdf_m，ME]^TRespectively obtaining w_a、w_b、w_cCorresponding tf_adf_a,ME、tf_bdf_b,ME、tf_cdf_c,METo obtain a first vector [ tf_adf_a，ME,tf_bdf_b，ME,tf_cdf_c，ME]^T。

S10235, based on the second field vector, determining a second vector of the target test step in the target system field.

The second vector may include a vector of each word in the target system domain in the target test step, and thus, the vector of each word in the target test step in the target system domain may be obtained from the second domain vector, resulting in the second vector.

For example, the target testing step includes w_a、w_b、w_cThree words, then from the second domain vector [ tf₁df_1，TS,tf₂df_2，TS,...,tf_mdf_m，TS]^TRespectively obtaining w_a、w_b、w_cCorresponding tf_adf_a,TS、tf_bdf_b,TS、tf_cdf_c,TSObtaining a second vector [ tf_adf_a,TS,tf_bdf_b,TS,tf_cdf_c,TS]^T。

And S10236, obtaining a space vectorization representation of the target test step according to the first vector and the second vector.

Can utilize formulas

Carrying out weighted normalized cascade on the first vector and the second vector to obtain a space vectorization representation of a target test step,wherein, w_MEWeight of the primary actor Domain, v₁Is a first vector, v₂In order to be the second vector, the vector is,

is an exclusive or operation.

And S10237, traversing each test step in the training text to obtain the space vectorization representation of each test step.

The spatial vectorization representation of each test step in the training text may be sequentially calculated from the first test step according to the sequence of each test step in the training text, that is, after the above sub-step S10237 is executed, it may be determined whether the target test step is the last step, and if so, the execution is stopped; if not, returning to the substep S10233 until a spatial vectorization representation of each test step in the training text is obtained.

Meanwhile, because words of the main performer field and the target system field in the training text are small, the obtained spatial vectorization representation is relatively sparse, and the spatial vectorization representation can be subjected to dimension reduction by applying Principal Component Analysis (PCA).

As another embodiment, the spatial vectorization representation of each test step obtained by the vectorization module may also be word embedding + IDF vectorization. Because the IDF is a good index for representing the importance degree of words, the IDF and the word embedding can be combined, and the spatial vectorization representation can be obtained by calculating the weighted combination of the IDF and the word embedding. In this case, IDF needs to be a general metric for the general word importance of the context, not for the main actor or target system domain, i.e., for both.

Referring to fig. 4, the sub-step S1023 may include the following sub-steps:

s1023a, constructing a main executor field word set and a target system field word set based on the training text, wherein the main executor field indicates that main executors are different and target systems are the same, and the target system field indicates that target systems are different and main executors are the same.

The main actor domain word set and the target system domain word set are subsets of a common corpus that includes all words in the training text. The main performer field word set can be S₁Representing, the target System Domain word set can be represented by S₂And (4) showing.

And S1023b, acquiring any target test step in the training text.

S1023c, according to the word embedding of each first word in the target testing step, calculating a first vector of the target testing step in the main performer field, wherein the first word belongs to the word set of the main performer field.

In this embodiment, the manner of calculating the first vector of the target testing step in the main performer field according to the word embedding of each first word in the target testing step may include:

first, using a formula

Calculating an inverse text frequency for each first word, wherein n_iFor the frequency of occurrence of each first word in the training text, i.e. the number of occurrences of the first word in the training text and the ratio of all words in the training text, N is the number of all words in the training text.

Then, using the formula

Calculating a first vector of target test steps in the main actor domain, wherein v₁Is a first vector, w_iIs the ith first word, S₁Word set, idf, in the field of the primary actor_iIs w_iInverse text frequency of v_iIs w_iThe word of (2) is embedded.

S1023d, according to the word embedding of each second word in the target test step, calculating a second vector of the target test step in the target system field, wherein the second word belongs to the target system field word set.

In this embodiment, the manner of calculating the second vector of the target testing step in the target system field according to the word embedding of each second word in the target testing step may include:

first, using a formula

Calculating an inverse text frequency for each second word, wherein n_jN is the number of all words in the training text for the frequency of occurrence of each second word in the training text.

Then, using the formula

Calculating a second vector of the target test step in the target system domain, wherein v₂Is a second vector, w_jIs the ith second word, S₂Is a target system domain word set, idf_jIs w_jInverse text frequency of v_jIs w_jThe word of (2) is embedded.

And S1023e, obtaining a space vectorization representation of the target test step according to the first vector and the second vector.

And S1023f, traversing each test step in the training text to obtain the space vectorization representation of each test step.

The substeps S1023 e-S1023 f are similar to the substeps S10236-S10237 described above, and are not described again here.

The sub-step S1024 will be described in detail below.

Referring to fig. 5, the sub-step S1024 may include the following sub-steps:

s10241, acquiring any two first test steps and second test steps in the training text.

S10242, calculating a first distance between the first test step and the second test step of the main performer field.

In this embodiment, the manner of calculating the first distance between the first testing step and the second testing step in the main performer area may include:

calculating a first loose word shift distance between a first test step and a second test step in the main performer field; calculating a first word center distance between a first testing step and a second testing step of the main performer field; and taking the maximum value of the first loose word shift distance and the first word center distance as the first distance.

The first relaxed word shift distance may be calculated by the following formula:

RWMD(x,x')＝max(l₁(x,x'),l₂(x,x')),where

wherein RWMD (x, x ') is a first relaxed word shift distance, x is a first test step, x' is a second test step, w_iAs any of the first test step, w_jIs any word in the second testing step, | | · non-woven₂Represents an L2 paradigm;

tf_iis w_iFrequency of occurrence in the main actor domain word set;

tf_jis w_jFrequency of occurrence in the main actor domain word set; argmin_j||v_i-v_j||₂Denotes v_i-v_jThe L2 paradigm of (A) takes the value of j at the time of the minimum; i ═ argmin_i||v_i-v_j||₂Denotes v_i-v_jThe L2 paradigm of (d) takes the value of i at the minimum.

The first word-center distance may be calculated by the following formula:

wherein WCD (x, x ') is the second word center distance, x is the first test step, x' is the second test step, w_iAs any of the first test step, w_jIs any word in the second testing step, | | · non-woven₂Represents an L2 paradigm;

tf_iis w_iFrequency of occurrence in the main actor domain word set;

tf_jis w_jFrequency of occurrence of word sets in the main actor domain.

The maximum value of the first relaxed word-shift distance and the first word-center distance is taken as the first distance, i.e., d (x, x ') -max (WCD (x, x'), RWMD (x, x ')), d (x, x') representing the first distance.

S10243, calculating a second distance between the first test step and the second test step of the target system field.

In this embodiment, the manner of calculating the second distance between the first testing step and the second testing step in the target system field may include:

calculating a second loose word shift distance between the first test step and the second test step in the field of the target system; calculating a second word center distance between the first testing step and the second testing step in the target system field; and taking the maximum value between the second loose word shift distance and the second word center distance as the second distance.

It should be noted that the second distance is calculated in a manner similar to the first distance in the sub-step S10242, except that tf is calculated when the second distance is calculated_iIs w_iFrequency of occurrence, tf, in the set of words in the domain of the target system_jIs w_jThe frequency of occurrence of the word set in the field of the target system is not described herein.

S10244, the first distance and the second distance are weighted and summed to obtain a first similarity metric value between the first testing step and the second testing step.

The formula d ═ w can be used_MEd₁+d₂The first distance and the second distance are weighted and summed to obtain a first similarity metric value, wherein w_MEIs the weight of the main actor domain, d is a first similarity metric value, d₁Is a first distance, d₂Is the second distance.

S10245, traversing every two test steps in the training text until a first similarity metric value between every two test steps is obtained.

It will be appreciated that the above sub-steps S10241 to S10244 are a cyclic process until a first similarity measure between each two test steps is obtained.

Step S103 will be described in detail below.

Referring to fig. 6, step S103 may include the following sub-steps:

and S1031, clustering the training texts by adopting a bottom-up hierarchical clustering algorithm according to each first similarity metric value to obtain an initial clustering tree.

The bottom-up hierarchical clustering algorithm starts from a single sample cluster, iteratively merges two closest clusters together until a unique cluster is left, and records all merging histories. Therefore, a binary tree with merging sequence labels and capable of representing the bottom-up hierarchical clustering process can be constructed, and by using the characteristic, the clustering result of the bottom-up hierarchical clustering algorithm is easy to adjust, such as splitting, merging, individual attribution adjustment and the like, wherein splitting means replacing the original cluster by two sub-clusters, merging means replacing two sub-clusters by a parent cluster, and individual attribution adjustment means moving one sub-tree to the other sub-tree.

The criterion for merging the two closest clusters together is to select the two clusters with the smallest average distance and merge the two clusters into one, using a bottom-up hierarchical clustering algorithm. The average distance of the two clusters refers to the average distance between every two samples in all the samples extracted from the two clusters, and is shown as the following formula:

wherein d is_avg(cls_A,cls_B) Cls for two clusters_AAnd cls_BAverage distance of d (e)_i,e_j) For two samples e_iAnd e_jA distance between e_iFor clustering cls_AAny one of the samples in, e_jFor clustering cls_BAny one of the samples, cls_A||cls_BIs two clusters cls_AAnd cls_BAnd (4) permutation and combination of the number of the samples.

In this embodiment, since the first similarity metric value between each two test steps in the training text has been calculated in step S102, the training text may be clustered by using the above formula according to the first similarity metric value between each two test steps in the training text by using a bottom-up hierarchical clustering algorithm, where the text in the above formula is the test step and the distance is the first similarity metric value.

S1032, carrying out post optimization processing on the initial clustering tree by adopting a K mean value clustering algorithm to obtain the clustering tree.

The initial clustering tree can be obtained by clustering the training texts by adopting a bottom-up hierarchical clustering algorithm, but the bottom-up hierarchical clustering algorithm has the following defects: the merging standard only considers the overall optimization and does not consider the optimization of a single sample, so that the clustering precision is influenced. For example, there is a very large cluster and a very small but special cluster, the large cluster tends to eat the very small but special cluster directly, the average distance between the very large cluster and the very small but special cluster is calculated to be smaller than the average distance between the very large cluster and other special but not small clusters because of the number of different samples in the very large cluster, and after the very large cluster eats the very small but special cluster, the very small but special cluster hardly has any effect on the average characteristics of the very large cluster because of the large size difference between the two clusters, as it never appears.

To alleviate this drawback, a K-means clustering algorithm may be used to perform post-optimization on the initial clustering tree. The K-means clustering algorithm is a very common and mature clustering algorithm, and the clustering process is roughly as follows: firstly, randomly selecting samples from a data set to be clustered to generate k initial mean values; then, in each iteration, the following two steps are performed: 1. assigning each sample to the closest cluster; 2. calculating the mean value of each cluster based on the distributed clusters; the execution is stopped until the algorithm reaches convergence, i.e. when there is no change in all cluster attributions.

In this embodiment, the initial clustering tree obtained by the bottom-up hierarchical clustering algorithm may be used as the initial clustering attribution state of the K-means clustering algorithm, and then the K-means clustering algorithm is iterated until the algorithm converges and then stops executing. And, in an iterative process, if there are no samples in some clusters, these empty clusters are removed.

On one hand, the K-means clustering algorithm strictly ensures that each sample is allocated to the closest cluster, otherwise, the samples cannot be converged finally, and therefore, the K-means clustering algorithm brings the sample optimization. While removing empty clusters may reduce the number of clusters. On the other hand, the bottom-up hierarchical clustering algorithm is an initialization technology, provides a good starting point for the K-means clustering algorithm, and compared with the initialization of mean selection by a follower, the bottom-up hierarchical clustering algorithm ensures that the K-means clustering algorithm cannot converge to certain unreasonable local optimal points.

Meanwhile, since the post-optimization processing of the K-means clustering algorithm destroys the tree structure of the initial clustering tree, the clustering result of the K-means clustering algorithm after the post-optimization processing needs to be reconstructed into the tree structure, thereby obtaining the clustering tree. The process of reconstructing the tree structure may be:

firstly, running local bottom-up clusters in each cluster to enable each cluster to generate a bottom-up hierarchical clustering binary tree to obtain a local hierarchical tree of each cluster;

secondly, running global bottom-up clustering, wherein each cluster is regarded as a single point when running the global clustering, the clustering distance is the average distance between all clustering samples, and finally combining all clusters into a single cluster to obtain a global hierarchical tree;

finally, the local hierarchical tree of each cluster is concatenated into the global hierarchical tree, and it should be noted that, because only leaf nodes of the local hierarchical tree are clusters, and both leaf nodes and root nodes of the global hierarchical tree are clusters, when the local hierarchical tree is concatenated, the local hierarchical tree is directly concatenated into corresponding leaf nodes of the global hierarchical tree, and in the finally formed cluster tree, the leaf nodes are single samples, for example, the testing step in this embodiment.

Referring to fig. 7, fig. 7 is a schematic flow chart illustrating a test step clustering method according to an embodiment of the present application, where the test step clustering method may include the following steps:

s201, obtaining a text to be processed, wherein the text to be processed comprises a plurality of testing steps to be processed.

The text to be processed may be any test text that needs to be clustered, the text to be processed may include a plurality of test steps to be processed, and one test step to be processed may correspond to one main performer (e.g., a user) and a target system (e.g., a live APP).

S202, inputting the text to be processed into the processing model trained by the clustering learning method to obtain a second similarity metric value between every two testing steps to be processed.

After the text to be processed is obtained, the text to be processed is input into the processing model trained by the clustering learning method, and after the trained processing model processes the text to be processed, a second similarity metric value between every two test steps to be processed in the text to be processed can be output.

The process of processing the text to be processed by the trained processing model may be: firstly, performing word segmentation and filtering processing on a text to be processed by using a preprocessing module to obtain each word in the text to be processed; then, inputting each word in the text to be processed into a word embedding module to obtain word embedding of each word in the text to be processed; embedding the word of each word in the text to be processed into an input vectorization module to obtain a spatial vectorization representation of each test step to be processed; and finally, inputting the spatial vectorization representation of each test step to be processed into a similarity measurement module to obtain a second similarity measurement value between each two test steps to be processed.

As an implementation manner, when a word of each word in a text to be processed is embedded in an input vectorization module to obtain a spatial vectorization representation of each test step to be processed, the vectorization may be based on TF-IDF, and taking any one test step to be processed as an example, the processing procedure is as follows:

first, based on the first domain vector, i.e., the vector of the main actor domain [ tf ] obtained in sub-step S10231₁df_1,ME,tf₂df_2,ME,...,tf_mdf_m，ME]^TObtaining the vector of each word in the to-be-processed testing step in the main executor field, and forming the vector of the to-be-processed testing step in the main executor field;

then, based on the second domain vector, i.e., the vector [ tf ] of the target system domain obtained in the sub-step S10232₁df_1,TS,tf₂df_2，TS,...,tf_mdf_m，TS]^TObtaining the vector of each word in the to-be-processed testing step in the target system field, and forming the vector of the to-be-processed testing step in the target system field;

finally, using the formula

Carrying out weighted normalized cascade on the vector of the test step to be processed in the main executor field and the vector of the test step to be processed in the target system field to obtain a space vectorization representation of the test step to be processed, wherein w is_MEFor final updateWeight of the main actor domain.

As another embodiment, when a word of each word in a text to be processed is embedded into an input vectorization module to obtain a spatial vectorization representation of each test step to be processed, word embedding and IDF vectorization may be performed, and taking any one test step to be processed as an example, the processing procedure is as follows:

firstly, determining words of each main performer field from the to-be-processed testing step, wherein the words of the main performer field belong to a main performer field word set, and the main performer field word set is obtained in the substep S1023 a; determining words of each target system field from the to-be-processed testing step, wherein the words of the target system field belong to a target system field word set, and the target system field word set is obtained in the substep S1023 a;

then, according to the word embedding of the word of each main executor field in the test step to be processed, calculating the vector of the test step to be processed in the main executor field; calculating the vector of the to-be-processed testing step in the target system field according to the word embedding of the word of each target system field in the to-be-processed testing step;

finally, using the formula

And carrying out weighted normalized cascade on the vector of the to-be-processed testing step in the main executor field and the vector of the to-be-processed testing step in the target system field to obtain the spatial vectorization expression of the to-be-processed testing step.

And S203, clustering the texts to be processed according to each second similarity metric value to generate a cluster tree to be processed, wherein the cluster tree to be processed comprises a plurality of cluster clusters to be processed, and one cluster to be processed comprises a plurality of test steps to be processed.

The specific processing procedures of steps S202 to S203 are similar to the processing procedures of steps S102 to S103, and are not described again here.

And S204, comparing the cluster trees to be processed with the final cluster tree to obtain an operation sequence corresponding to each cluster to be processed, wherein the final cluster tree is generated by utilizing the cluster learning method.

Since one cluster of the final cluster tree generated by the cluster learning method corresponds to one group of preset operation sequences, that is, all the test steps in the same cluster correspond to one group of preset operation sequences, the operation sequences corresponding to each cluster to be processed can be obtained by comparing the cluster number to be processed with the final cluster number.

Compared with the prior art, the embodiment of the application has the following beneficial effects:

firstly, one cluster of a final cluster tree obtained through cluster learning corresponds to a group of preset operation sequences, namely all test steps in the same cluster correspond to a group of preset operation sequences, so that the original test steps can be clustered by using a trained processing model and the final cluster tree in application, redundant operation sequences can be reduced, and the coding workload is reduced;

secondly, when the Word embedding module is trained, the weight of the pre-trained Word2vec model can be adopted to initialize the parameters of the Word embedding module, so that the training efficiency can be improved.

Referring to fig. 8, fig. 8 is a block diagram illustrating a cluster learning apparatus 100 according to an embodiment of the present disclosure. The cluster learning apparatus 100 includes a first obtaining module 110, a first processing module 120, a first clustering module 130, and an executing module 140.

The first obtaining module 110 is configured to obtain a training text, where the training text includes a plurality of test cases, and the test cases include a plurality of test steps.

The first processing module 120 is configured to input a training text into a pre-constructed processing model, and obtain a first similarity metric between each two testing steps.

The first clustering module 130 is configured to cluster the training texts according to each first similarity metric value to generate a cluster tree, where the cluster tree includes a plurality of cluster clusters, and each cluster includes a plurality of test steps.

And the execution module 140 is configured to update parameters of the processing model according to the cluster tree until the cluster tree reaches a set condition, so as to obtain a trained processing model and a final cluster tree, where one cluster of the final cluster tree corresponds to a group of preset operation sequences.

Optionally, the processing model includes a preprocessing module, a word embedding module, a vectorization module, and a similarity measurement module, and the first processing module 120 is specifically configured to:

utilizing a preprocessing module to perform word segmentation and filtering processing on the training text to obtain word segmentation results, wherein the word segmentation results comprise each word in the training text; inputting the word segmentation result into a word embedding module to obtain a word embedding result, wherein the word embedding result comprises word embedding of each word in the training text; inputting the word embedding result into a vectorization module to obtain a spatial vectorization representation of each test step; and inputting the spatial vectorization representation of each test step into a similarity measurement module to obtain a first similarity measurement value between every two test steps.

Optionally, the first processing module 120 performs a manner of inputting the word embedding result into the vectorization module to obtain the spatial vectorization representation of each test step, including:

calculating a first domain vector of the training text, wherein the first domain vector comprises a vector of each word in the training text in a main performer field, and the main performer field indicates that main performers are different and target systems are the same; calculating a second field vector of the training text, wherein the second field vector comprises a vector of each word in the training text in the field of a target system, and the field of the target system indicates that the target system is different and the main performers are the same; acquiring any target test in the training text; determining a first vector of the target test step in the main executor area based on the first area vector; determining a second vector of the target testing step in the target system field based on the second field vector; and obtaining a spatial vectorization representation of the target test step according to the first vector and the second vector.

Optionally, the first processing module 120 performs a manner of calculating the first domain vector of the training text, including:

obtaining any of training textsA target word; calculating a first word frequency of the target word in the field of the main executor, wherein the first word frequency is the frequency of the target word appearing in all test cases of the corresponding target system; using formulas

Calculating a first inverse text frequency of the target word in the main performer field, wherein n_i,MEIs the frequency, N, of occurrence of the target word in all test cases of its corresponding primary actor_caseThe number of all test cases of the main executor corresponding to the target word; solving the product of the first word frequency and the first inverse text frequency to obtain the vector of the target word in the main executor field; and traversing each word in the training text to obtain a vector of each word in the training text in the main executor field, and forming a first field vector.

Optionally, the first processing module 120 performs a manner of calculating the second domain vector of the training text, including:

acquiring any target word in a training text; calculating a second word frequency of the target word in the field of the target system, wherein the second word frequency is the frequency of the target word in all the testing steps of the test case where the target word is located; using formulas

Calculating a second inverse text frequency of the target word in the target system field, wherein n_i,TSIs the frequency, N, of occurrence of the target word in all test steps of its corresponding target system_caseThe number of all testing steps of the target system corresponding to the target word; solving the product of the second word frequency and the second inverse text frequency to obtain the vector of the target word in the field of the target system; and traversing each word in the training text to obtain a vector of each word in the training text in the field of the target system, and forming a second field vector.

Optionally, the first processing module 120 performs a manner of determining the first vector of the target testing step in the main performer domain based on the first domain vector, including: and obtaining the vector of each word in the target testing step in the main performer field from the first field vector to obtain a first vector.

Optionally, the first processing module 120 performs a manner of determining a second vector of the target testing step in the target system domain based on the second domain vector, including: and obtaining the vector of each word in the target system field in the target test step from the second field vector to obtain a second vector.

Optionally, the first processing module 120 performs a manner of obtaining a spatial vectorized representation of the target testing step according to the first vector and the second vector, including: using formulas

Carrying out weighted normalized cascade on the first vector and the second vector to obtain a space vectorization expression of a target test step, wherein w is_MEWeight of the primary actor Domain, v₁Is a first vector, v₂In order to be the second vector, the vector is,

is an exclusive or operation.

constructing a main executor field word set and a target system field word set based on the training text, wherein the main executor field indicates that main executors are different and target systems are the same, and the target system field indicates that target systems are different and main executors are the same; acquiring any target test in the training text; calculating a first vector of the target testing step in the main performer field according to the word embedding of each first word in the target testing step, wherein the first word belongs to a word set of the main performer field; calculating a second vector of the target testing step in the target system field according to the word embedding of each second word in the target testing step, wherein the second word belongs to a word set of the target system field; obtaining a space vectorization representation of the target testing step according to the first vector and the second vector; and traversing each testing step in the training text to obtain the space vectorization representation of each testing step.

Optionally, the method for calculating the first vector of the target testing step in the main performer field according to the word embedding of each first word in the target testing step includes:

using formulas

Calculating an inverse text frequency for each first word, wherein n_iThe frequency of each first word in the training text is shown, and N is the number of all words in the training text; using formulas

Optionally, the first processing module 120 performs a manner of calculating a second vector of the target testing step in the target system domain according to the word embedding of each second word in the target testing step, including:

using formulas

Calculating an inverse text frequency for each second word, wherein n_jThe frequency of each second word in the training text is shown, and N is the number of all words in the training text; using formulas

Optionally, the first processing module 120 performs a manner of inputting the spatial vectorized representation of each test step into the similarity metric module to obtain a first similarity metric value between each two test steps, including:

acquiring any two first test steps and second test steps in a training text; calculating a first distance between a first test step and a second test step of the main executor field; calculating a second distance between the first testing step and the second testing step in the target system field; and carrying out weighted summation on the first distance and the second distance to obtain a first similarity metric value between the first testing step and the second testing step.

Optionally, the first processing module 120 performs a manner of calculating a first distance between a first testing step and a second testing step of the main performer area, including: calculating a first loose word shift distance between a first test step and a second test step in the main performer field; calculating a first word center distance between a first testing step and a second testing step of the main performer field; and taking the maximum value of the first loose word shift distance and the first word center distance as the first distance.

Optionally, the first processing module 120 performs a manner of calculating a second distance between the first testing step and the second testing step of the target system field, including: calculating a second loose word shift distance between the first test step and the second test step in the field of the target system; calculating a second word center distance between the first testing step and the second testing step in the target system field; and taking the maximum value between the second loose word shift distance and the second word center distance as the second distance.

Optionally, the first clustering module 130 is specifically configured to: clustering the training texts by adopting a bottom-up hierarchical clustering algorithm according to each first similarity metric value to obtain an initial clustering tree; and performing post optimization processing on the initial clustering tree by adopting a K mean value clustering algorithm to obtain the clustering tree.

Referring to fig. 9, fig. 9 is a block diagram illustrating a clustering apparatus 200 for testing steps according to an embodiment of the present disclosure. The clustering apparatus 200 for testing steps includes a second obtaining module 210, a second processing module 220, a second clustering module 230, and a comparing module 240.

The second obtaining module 210 is configured to obtain a to-be-processed text, where the to-be-processed text includes a plurality of to-be-processed test steps.

The second processing module 220 is configured to input the text to be processed into the processing model trained by using the cluster learning method, so as to obtain a second similarity metric between each two testing steps to be processed.

The second clustering module 230 is configured to cluster the texts to be processed according to each second similarity metric value, so as to generate a cluster tree to be processed, where the cluster tree to be processed includes multiple cluster clusters to be processed, and one cluster to be processed includes multiple test steps to be processed.

And a comparison module 240, configured to compare the cluster tree to be processed with the final cluster tree to obtain an operation sequence corresponding to each cluster to be processed, where the final cluster tree is generated by using the cluster learning method.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the cluster learning apparatus 100 and the test step clustering apparatus 200 described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

Referring to fig. 10, fig. 10 is a block diagram illustrating an electronic device 10 according to an embodiment of the present disclosure. The electronic device 10 includes a processor 11, a memory 12, and a bus 13, and the processor 11 and the memory 12 are connected by the bus 13.

The memory 12 is used for storing a program, such as the cluster learning apparatus 100, the test step clustering apparatus 200, or the cluster learning apparatus 100 and the test step clustering apparatus 200, and the processor 11 executes the program after receiving an execution instruction to implement the cluster learning method or the test step clustering method disclosed in the above embodiments of the present invention.

The electronic device 10 may be a general-purpose computer or a special-purpose computer, and both of them may be used to implement the cluster learning method or the test step clustering method of the embodiments of the present application, that is, the execution subject of the cluster learning method and the test step clustering method may be the same computer or different computers. Although only one computer is shown in the embodiments of the present application, for convenience, the functions described in the embodiments of the present application may be implemented in a distributed manner on a plurality of similar platforms to balance the processing load.

The Memory 12 may include a high-speed Random Access Memory (RAM) and may also include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory.

The processor 11 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 11. The Processor 11 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components.

The embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by the processor 11, the cluster learning method or the test step clustering method disclosed in the foregoing embodiment is implemented.

To sum up, when performing cluster learning, the embodiment of the present application inputs a training text into a pre-constructed processing model, obtains a first similarity metric between every two testing steps in the training text, clusters the training text according to each first similarity metric to generate a cluster tree, and then updates parameters of the processing model according to the cluster tree until the cluster tree reaches a set condition, so as to obtain the trained processing model and a final cluster tree, where one cluster of the final cluster tree corresponds to a set of preset operation sequences, that is, all testing steps in the same cluster correspond to a set of preset operation sequences, so that in application, the trained processing model and the final cluster tree can be used to perform clustering on original testing steps, thereby reducing redundant operation sequences and reducing the coding workload.

The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A method of cluster learning, the method comprising:

2. The method of claim 1, wherein the processing model comprises a pre-processing module, a word embedding module, a vectorization module, and a similarity metric module;

the step of inputting the training text into a pre-constructed processing model and obtaining a first similarity metric between each two of the testing steps comprises:

utilizing the preprocessing module to perform word segmentation and filtering processing on the training text to obtain word segmentation results, wherein the word segmentation results comprise each word in the training text;

inputting the word segmentation result into the word embedding module to obtain a word embedding result, wherein the word embedding result comprises word embedding of each word in the training text;

inputting the word embedding result into the vectorization module to obtain a spatial vectorization representation of each testing step;

and inputting the spatial vectorization representation of each test step into the similarity measurement module to obtain a first similarity measurement value between each two test steps.

3. The method of claim 2, wherein one of the test cases corresponds to one main executor and one target system;

the step of inputting the word embedding result into the vectorization module to obtain a spatial vectorization representation of each of the testing steps includes:

calculating a first domain vector of the training text, wherein the first domain vector comprises a vector of each word in the training text in a main performer domain, and the main performer domain indicates that the main performers are different and the target system is the same;

calculating a second domain vector of the training text, wherein the second domain vector comprises a vector of each word in the training text in a target system domain, and the target system domain indicates that the target systems are different and the main performers are the same;

acquiring any target test in the training text;

determining a first vector of the target testing step in the main actor domain based on the first domain vector;

determining a second vector of the target testing step in the target system domain based on the second domain vector;

obtaining a spatial vectorization representation of the target testing step according to the first vector and the second vector;

and traversing each testing step in the training text to obtain the space vectorization representation of each testing step.

4. The method of claim 3, wherein the step of computing the first domain vector for the training text comprises:

acquiring any one target word in the training text;

calculating a first word frequency of the target word in the field of the main executor, wherein the first word frequency is the frequency of the target word appearing in all test cases of a target system corresponding to the target word;

using formulas

Calculating a first inverse text frequency of the target word in the main performer field, wherein n is_i,MEIs the frequency, N, of the target word appearing in all test cases of the corresponding main executors_caseThe number of all test cases of the main executor corresponding to the target word is obtained;

solving the product of the first word frequency and the first inverse text frequency to obtain the vector of the target word in the main executor field;

traversing each word in the training text to obtain a vector of each word in the training text in the main executor field, and forming the first field vector.

5. The method of claim 3, wherein the step of computing the second domain vector for the training text comprises:

acquiring any one target word in the training text;

calculating a second word frequency of the target word in the field of the target system, wherein the second word frequency is the frequency of the target word in all the testing steps of the test case where the target word is located;

using formulas

Calculating the target wordA second inverse text frequency in the target system domain, wherein n_i,TSIs the frequency, N, of the target word appearing in all the test steps of the target system corresponding to the target word_caseThe number of all testing steps of a target system corresponding to the target word;

solving the product of the second word frequency and the second inverse text frequency to obtain the vector of the target word in the field of the target system;

and traversing each word in the training text to obtain a vector of each word in the training text in the field of the target system, and forming the second field vector.

6. The method of claim 3, wherein the step of determining a first vector of the target testing step in the main actor domain based on the first domain vector comprises:

obtaining a vector of each word in the target testing step in the main performer field from the first field vector to obtain the first vector;

the step of determining a second vector of the target testing step in the target system domain based on the second domain vector comprises:

and obtaining the vector of each word in the target system field in the target test step from the second field vector to obtain the second vector.

7. The method of claim 2, wherein one of the test cases corresponds to one main executor and one target system;

constructing a main performer field word set and a target system field word set based on the training text, wherein the main performer field represents that the main performers are different and the target systems are the same, and the target system field represents that the target systems are different and the main performers are the same;

acquiring any target test in the training text;

calculating a first vector of each first word in the target testing step in the main performer field according to the word embedding of the first word in the target testing step, wherein the first word belongs to the main performer field word set;

calculating a second vector of the target testing step in the target system field according to the word embedding of each second word in the target testing step, wherein the second word belongs to the target system field word set;

8. The method of claim 7, wherein said step of computing a first vector of said target testing step in said main actor domain based on word embedding of each first word in said target testing step comprises:

using formulas

Calculating an inverse text frequency for each of the first words, wherein n_iFor the frequency of occurrence of each first word in the training text, N is the number of all words in the training text;

using formulas

Calculating a first vector of the target test step in the main actor domain, wherein v₁Is said first vector, w_iIs the ith first word, S₁Is the main actor domain word set, idf_iIs w_iInverse text frequency of v_iIs w_iWord ofAnd (4) embedding.

9. The method of claim 7, wherein said step of computing a second vector of said target testing step in said target system domain based on word embedding of each second word in said target testing step comprises:

using formulas

Calculating an inverse text frequency for each of the second words, wherein n_jFor the frequency of occurrence of each second word in the training text, N is the number of all words in the training text;

using formulas

Calculating a second vector of the target test step in the target system domain, wherein v₂Is said second vector, w_jIs the ith second word, S₂For the target system domain word set, idf_jIs w_jInverse text frequency of v_jIs w_jThe word of (2) is embedded.

10. The method according to claim 3 or 7, wherein the step of deriving a spatial vectorized representation of the target testing step from the first vector and the second vector comprises:

using formulas

Performing weighted normalized cascade on the first vector and the second vector to obtain a spatial vectorization representation of the target test step, wherein w is_MEIs the weight, v, of the main actor domain₁Is said first vector, v₂For the purpose of the second vector, the vector is,

is an exclusive or operation.

11. A method according to claim 3 or 7, wherein said step of inputting the spatially vectorized representation of each of said testing steps into said similarity metric module, resulting in a first similarity metric value between each two of said testing steps, comprises:

acquiring any two first test steps and any two second test steps in the training text;

calculating a first distance between the first testing step and the second testing step of the main performer field;

calculating a second distance between the first testing step and the second testing step in the target system field;

carrying out weighted summation on the first distance and the second distance to obtain a first similarity metric value between the first testing step and the second testing step;

and traversing every two testing steps in the training text until a first similarity metric value between every two testing steps is obtained.

12. The method of claim 11, wherein said step of calculating a first distance between said first testing step and said second testing step of said main actor area comprises:

calculating a first relaxed word movement distance between the first testing step and the second testing step in the main performer field;

calculating a first word-center distance between the first testing step and the second testing step in the main performer field;

and taking the maximum value of the first loose word shift distance and the first word center distance as the first distance.

13. The method of claim 11, wherein said step of calculating a second distance between said first testing step and said second testing step of said target system area comprises:

calculating a second loose word shift distance between the first testing step and the second testing step in the target system field;

calculating a second word-center distance between the first testing step and the second testing step in the target system field;

taking a maximum value between the second relaxed word shift distance and the second word center distance as the second distance.

14. The method of claim 1, wherein the step of clustering the training texts according to each of the first similarity metric values to generate a cluster tree comprises:

clustering the training texts by adopting a bottom-up hierarchical clustering algorithm according to each first similarity metric value to obtain an initial clustering tree;

and performing post optimization processing on the initial clustering tree by adopting a K mean value clustering algorithm to obtain the clustering tree.

15. A method of clustering test steps, the method comprising:

inputting the text to be processed into a processing model trained by the cluster learning method according to any one of claims 1 to 14 to obtain a second similarity metric between each two testing steps to be processed;

and comparing the cluster trees to be processed with the final cluster tree to obtain an operation sequence corresponding to each cluster to be processed, wherein the final cluster tree is generated by using the cluster learning method of any one of claims 1 to 14.

16. An apparatus for cluster learning, the apparatus comprising:

17. A test-step clustering apparatus, the apparatus comprising:

a second processing module, configured to input the text to be processed into a processing model trained by using the cluster learning method according to any one of claims 1 to 14, so as to obtain a second similarity metric between each two test steps to be processed;

a comparison module, configured to compare the cluster tree to be processed with a final cluster tree to obtain an operation sequence corresponding to each cluster to be processed, where the final cluster tree is generated by using the cluster learning method according to any one of claims 1 to 14.

18. An electronic device, characterized in that the electronic device comprises:

one or more processors;

memory for storing one or more programs that, when executed by the one or more processors, cause the one or more processors to implement the cluster learning method of any one of claims 1-14 or the test step clustering method of claim 15.

19. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out a cluster learning method according to any one of claims 1 to 14, or a test-step clustering method according to claim 15.