CN113064597B - Redundant code identification method, device and equipment - Google Patents

Redundant code identification method, device and equipment Download PDF

Info

Publication number
CN113064597B
CN113064597B CN202110439936.4A CN202110439936A CN113064597B CN 113064597 B CN113064597 B CN 113064597B CN 202110439936 A CN202110439936 A CN 202110439936A CN 113064597 B CN113064597 B CN 113064597B
Authority
CN
China
Prior art keywords
execution plan
information set
fragments
redundant
stock
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110439936.4A
Other languages
Chinese (zh)
Other versions
CN113064597A (en
Inventor
夏雯君
李海波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Industrial and Commercial Bank of China Ltd ICBC
Original Assignee
Industrial and Commercial Bank of China Ltd ICBC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Industrial and Commercial Bank of China Ltd ICBC filed Critical Industrial and Commercial Bank of China Ltd ICBC
Priority to CN202110439936.4A priority Critical patent/CN113064597B/en
Publication of CN113064597A publication Critical patent/CN113064597A/en
Application granted granted Critical
Publication of CN113064597B publication Critical patent/CN113064597B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/44Encoding
    • G06F8/443Optimisation
    • G06F8/4434Reducing the memory space required by the program code
    • G06F8/4435Detection or removal of dead or redundant code

Landscapes

  • Engineering & Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Stored Programmes (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the specification provides a method, a device and equipment for identifying redundant codes, wherein the method comprises the following steps: acquiring an execution plan information set of the target source code; determining target feature vectors of all execution plan fragments in the execution plan information set; based on the target feature vectors of the execution plan fragments, performing similarity matching on the execution plan fragments by using a clustering algorithm to obtain at least one group of similar execution plan fragments successfully matched; acquiring production operation and maintenance information of the similar execution plan fragments; and determining redundant execution plan fragments to be adjusted in the at least one group of similar execution plan fragments according to the production operation and maintenance information. In the embodiment of the specification, the identification of the redundant high-consumption logic of the big data platform can be efficiently realized without relying on manual operation in the identification process of the redundant code, and the self-tuning capacity of the big data platform is effectively improved.

Description

Redundant code identification method, device and equipment
Technical Field
The embodiment of the specification relates to the technical field of big data, in particular to a method, a device and equipment for identifying redundant codes.
Background
Along with the increasing application scenes of data, large enterprises invest in the construction of large data platforms, however, in the early stage of business development, in order to realize the business rapidly, a chimney type development mode causes a large number of repeated logic codes on the large data platforms, so that the calculation resources of the large data platforms are wasted greatly, and how to identify the redundant and high-consumption logic becomes a bottleneck problem for obstructing the development of the large data platforms.
In the prior art, due to the splitting of each data application, operation and maintenance personnel are required to re-comb service data, re-design a data model, build a data center table and rewrite the existing logic, so that the project period is long, redundant codes cannot be identified in time, and the large data platform can be conveniently optimized.
In view of the above problems, no effective solution has been proposed at present.
Disclosure of Invention
The embodiment of the specification provides a method, a device and equipment for identifying redundant codes, which are used for solving the problem that the redundant codes cannot be identified timely in the prior art so as to regulate a big data platform.
The embodiment of the specification provides a redundant code identification method, which comprises the following steps: acquiring an execution plan information set of the target source code; wherein the execution plan information set contains at least one execution plan segment corresponding to each structured query statement in the target source code; determining target feature vectors of all execution plan fragments in the execution plan information set; based on the target feature vectors of the execution plan fragments, performing similarity matching on the execution plan fragments by using a clustering algorithm to obtain at least one group of similar execution plan fragments successfully matched; acquiring production operation and maintenance information of the similar execution plan fragments; and determining redundant execution plan fragments to be adjusted in the at least one group of similar execution plan fragments according to the production operation and maintenance information.
The embodiment of the specification also provides a redundant code identification device, which comprises: the first acquisition module is used for acquiring an execution plan information set of the target source code; wherein the execution plan information set contains at least one execution plan segment corresponding to each structured query statement in the target source code; the first determining module is used for determining target feature vectors of all the execution plan fragments in the execution plan information set; the matching module is used for carrying out similarity matching on each execution plan segment by utilizing a clustering algorithm based on the target feature vector of each execution plan segment to obtain at least one group of similar execution plan segments successfully matched; the second acquisition module is used for acquiring the production operation and maintenance information of the similar execution plan fragments; and the second determining module is used for determining redundant execution plan fragments to be adjusted in the at least one group of similar execution plan fragments according to the production operation and maintenance information.
The embodiment of the specification also provides a redundant code identification device, which comprises a processor and a memory for storing instructions executable by the processor, wherein the processor realizes the steps of the redundant code identification method when executing the instructions.
The present description also provides a computer-readable storage medium having stored thereon computer instructions that when executed perform the steps of the method of identifying redundant code.
The embodiment of the specification provides a redundant code identification method, which can determine target feature vectors of all execution plan fragments in an execution plan information set by acquiring an execution plan information set of a target source code, wherein the execution plan information set can contain at least one execution plan fragment corresponding to each structured query statement in the target source code. And performing similarity matching on each execution plan segment by using a clustering algorithm based on the determined target feature vector of each execution plan segment to obtain at least one group of similar execution plan segments successfully matched, so that at least one group of suspected redundant execution plan segments can be accurately obtained. Furthermore, since the low-consumption code occupies less resources and can be reserved, in order to improve the effectiveness of the identified redundant code, the high-consumption redundant execution plan segment in the suspected redundant execution plan segments can be determined according to the acquired production operation and maintenance information of each similar execution plan segment, and the high-consumption redundant execution plan segment is used as the redundant execution plan segment to be adjusted. The redundant code identification process does not need to rely on manual operation, so that the automatic identification of the redundant high-consumption logic of the large data platform can be realized, and the self-tuning capacity of the large data platform is effectively improved.
Drawings
The accompanying drawings, which are included to provide a further understanding of embodiments of the present specification, are incorporated in and constitute a part of this specification and do not limit the embodiments of the present specification. In the drawings:
FIG. 1 is a schematic step diagram of a method for identifying redundant codes provided according to an embodiment of the present disclosure;
FIG. 2 is a schematic diagram of results of a STAGE-wise segmentation and results of a tree-wise parsing provided in accordance with an embodiment of the present disclosure;
fig. 3 is a schematic structural diagram of an identification device of redundant codes provided according to an embodiment of the present specification;
fig. 4 is a schematic structural diagram of an identification device of redundant codes provided according to an embodiment of the present specification.
Detailed Description
The principles and spirit of the embodiments of the present specification will be described below with reference to several exemplary implementations. It should be understood that these embodiments are presented merely to enable one skilled in the art to better understand and implement the present description embodiments and are not intended to limit the scope of the present description embodiments in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
Those skilled in the art will appreciate that the implementations of the embodiments of the present description may be implemented as a system, apparatus, method, or computer program product. Accordingly, the present disclosure may be embodied in the following forms, namely: complete hardware, complete software (including firmware, resident software, micro-code, etc.), or a combination of hardware and software.
While the flow described below includes a number of operations occurring in a particular order, it should be apparent that these processes may include more or fewer operations, which may be performed sequentially or in parallel (e.g., using a parallel processor or a multi-threaded environment).
Referring to fig. 1, the present embodiment may provide a method for identifying a redundant code. The method for identifying the redundant codes can be used for efficiently and accurately identifying the redundant codes. The above-described method of identifying redundant codes may include the following steps.
S101: acquiring an execution plan information set of the target source code; the execution plan information set contains at least one execution plan segment corresponding to each structured query statement in the target source code.
In this embodiment, an execution plan information set of the target source code may be obtained, where the execution plan information set may include at least one execution plan segment corresponding to each structured query term in the target source code. The target source code may be a source code of a redundant code to be identified, and may be determined according to actual requirements.
In this embodiment, the target source code may be a plurality of source code files, and in some embodiments, the target source code may include the source code of the latest version of the different applications. The source code files are usually stored in a cloud code library, can be pulled through a tool, and are configured to continuously integrate tasks through a DevOps system, and continuously acquire the source code files from the code library for subsequent redundancy identification processing. The DevOps is a generic term for a set of processes, methods and systems, and is used to promote communication, collaboration and integration among development, technical operation and quality assurance departments.
In this embodiment, the object source code may be written using a structured query language (SQL, structured Query Language), which is a special purpose programming language, a database query and programming language, for accessing data and querying, updating and managing relational database systems. Because SQL is a "descriptive" language, unlike the "procedural" language, a user only describes "what to do" when using SQL, rather than "what to do," an execution plan can be generated for each structured query statement in the target source code, which can be used to describe specific steps of the structured query statement.
In this embodiment, since execution plans of the same structured query sentence have different phases (STAGEs), each execution plan may be divided into a plurality of execution plan fragments according to STAGEs, and there is an interdependence between each STAGE of the execution plans of the same structured query sentence.
In this embodiment, the execution plan information set may include information about each execution plan segment, for example: the key processing steps, the dependency relationship among the steps and the key processing steps obtain name value peering. The relevant information of each execution plan in the above execution plan information set may be stored in a structured form, for example: the tree structure, etc., may be specifically determined according to actual situations, and the embodiment of the present specification is not limited thereto.
S102: a target feature vector for each execution plan segment in the execution plan information set is determined.
In this embodiment, a target feature vector of each execution plan segment in the execution plan information set may be determined, where the target feature vector may be used to characterize feature information of the execution plan segment.
In this embodiment, the target feature vector may include a plurality of feature data, and the feature data may include at least one of: an execution plan hierarchy, table names, predicates, an aggregation algorithm, a sort mode, a projection mode, text processing features, an association mode, and the like. Of course, it will be appreciated that the above feature data is merely an example, and that in some embodiments more or less feature data may be included in the target feature vector, for example, more may be included: aggregation fields, partition columns, etc. The specific determination may be determined according to the actual situation, and the embodiment of the present specification is not limited thereto.
S103: and performing similarity matching on each execution plan segment by using a clustering algorithm based on the target feature vector of each execution plan segment to obtain at least one group of successfully matched similar execution plan segments.
In this embodiment, based on the target feature vectors of the execution plan segments, similarity matching may be performed on the execution plan segments by using a clustering algorithm, so as to obtain at least one group of similar execution plan segments that are successfully matched. The above-mentioned clustering algorithm researches a statistical analysis method of the (sample or index) classification problem, and is also an important algorithm of data mining, and the clustering analysis is based on similarity, and more similarity exists between modes in one cluster than between modes not in the same cluster.
In this embodiment, modeling may be performed by using a clustering algorithm, and the target feature vectors of the respective execution plan segments are used as input data of the established model, so as to perform similarity matching. The output result of the model may be at least one group of similar execution plan fragments with successfully matched similarity, each group of similar execution plan fragments may include at least two execution plan fragments, and the execution plan fragments in each group of similar execution plan fragments may be regarded as suspected redundant execution plan fragments.
In this embodiment, the clustering algorithm may be any one of the following: K-MEANS clustering algorithms, mean shift clustering algorithms, DBSCAN (Density-Based Spatial Clustering of Applications with Noise) clustering algorithms, expectation Maximization (EM) clustering using Gaussian Mixture Models (GMM), hierarchical clustering algorithms, and the like. The specific determination may be determined according to the actual situation, and the embodiment of the present specification is not limited thereto.
In this embodiment, when modeling is performed by using the K-MEANS clustering algorithm, a plurality of feature vector data in the training data set may be transmitted as an object set to the K-MEANS algorithm, a specified cluster class number N is input, and N objects are randomly selected in the object set as initial cluster centers. And setting the convergence error margin of the clustering center as an iteration termination condition, and continuously training by taking the average vector of each class as a new clustering center until the termination condition is met to obtain a final model, so that the model can be used for similarity matching.
S104: and acquiring production operation and maintenance information of the similar execution plan fragments.
In this embodiment, since the execution plan segments with successful similarity matching are not necessarily redundant codes that need to be adjusted, production operation and maintenance information of the similar execution plan segments may be acquired to evaluate resource consumption of the similar execution plan segments. If the resource consumption is low, the adjustment is not needed, and the resource consumption can be reserved; if the consumption of resources is large, the suspected redundancy code is high, and then adjustment may be needed.
In this embodiment, the production operation and maintenance information may include: the SQL statement is long in operation, the CPU (Central Processing Unit ) occupies in operation, the memory occupies in operation, the disk space occupies in operation, and the like. Of course, the production operation and maintenance information is not limited to the above examples, and other modifications may be made by those skilled in the art in light of the technical spirit of the embodiments of the present disclosure, and it should be understood that the present disclosure is also encompassed by the scope of the embodiments of the present disclosure as long as the functions and effects of the present disclosure are the same or similar to those of the embodiments of the present disclosure.
In this embodiment, the manner of acquiring the production operation and maintenance information of the similar execution plan segments may include: the method is obtained by pulling from a preset database or is obtained by mining from text description by using a rule extraction method in combination with a corpus. It will be understood, of course, that other possible manners may be used to obtain the production and operation information of the similar execution plan segments, for example, receiving the production and operation information of the similar execution plan segments input by the user into the system testing device, which may be specifically determined according to the actual situation, and the embodiment of the present disclosure is not limited thereto.
S105: and determining redundant execution plan fragments to be adjusted in at least one group of similar execution plan fragments according to the production operation and maintenance information.
In this embodiment, the redundant execution plan segments to be adjusted in at least one group of similar execution plan segments may be determined according to the production operation and maintenance information, so that high-consumption redundant codes may be accurately screened out, and further, the high-consumption redundant codes may be adjusted to optimize the target source code.
In this embodiment, a corresponding structured query statement in the target source code may be determined according to the redundant execution plan segment to be adjusted. In order to enable the operation and maintenance personnel to more intuitively see the identification result of the redundant code, the information such as the structured query statement corresponding to the redundant execution plan segment to be adjusted, the resource consumption condition determined according to the production operation and maintenance information and the like can be processed in a correlation manner and displayed in an interface corresponding to the operation and maintenance personnel. It will of course be appreciated that other data may also be included in the information presented to the operation and maintenance personnel, such as: the position of the redundant execution plan segment in the target source code to be adjusted, the execution plan cost estimation information, and the like may be specifically determined according to actual situations, which is not limited in the embodiment of the present specification.
From the above description, it can be seen that the following technical effects are achieved in the embodiments of the present specification: the target feature vector of each execution plan segment in the execution plan information set may be determined by obtaining an execution plan information set of the target source code, where the execution plan information set may include at least one execution plan segment corresponding to each structured query statement in the target source code. And performing similarity matching on each execution plan segment by using a clustering algorithm based on the determined target feature vector of each execution plan segment to obtain at least one group of similar execution plan segments successfully matched, so that at least one group of suspected redundant execution plan segments can be accurately obtained. Furthermore, since the low-consumption code occupies less resources and can be reserved, in order to improve the effectiveness of the identified redundant code, the high-consumption redundant execution plan segment in the suspected redundant execution plan segments can be determined according to the acquired production operation and maintenance information of each similar execution plan segment, and the high-consumption redundant execution plan segment is used as the redundant execution plan segment to be adjusted. The redundant code identification process does not need to rely on manual operation, so that the automatic identification of the redundant high-consumption logic of the large data platform can be realized, and the self-tuning capacity of the large data platform is effectively improved.
In one embodiment, obtaining the execution plan information set of the target source code may include: and acquiring the target source code and determining a structured query statement information set corresponding to the target source code. The database table statistical information corresponding to the target source code can be obtained, and an execution plan of each structured query statement is generated according to the database table statistical information and the structured query statement information set. Further, the execution plan of each structured query statement may be partitioned by STAGE to obtain multiple execution plan fragments. And analyzing the plurality of execution plan fragments according to the tree structure to obtain an execution plan information set of the target source code.
In this embodiment, the source code file is generally stored in the cloud code library, and the source code file can be pulled by a tool, and the process is configured to continue the integration task through the DevOps system, so that the source code file is continuously obtained from the code library for the subsequent redundancy identification process. The DevOps is a generic term for a set of processes, methods and systems for facilitating communication, collaboration and integration among development, technical operations and Quality Assurance (QA) departments.
In this embodiment, the structured query statement may be parsed from the target source code, so as to obtain the structured query statement information set corresponding to the target source code. The structured query statement information set may include an attribute of a structured query statement, a path of the structured query statement in a target source code, an original structured query statement, a job to which the structured query statement belongs, a job group, and an application. The values of the attributes of the structured query statement may include: the incremental attribute is expressed as a current version of the newly added code, and the stock attribute is expressed as a historical version of the code, and the incremental attribute is operated on the production environment.
In this embodiment, in order to generate an execution plan of each structured query term, database table statistics information corresponding to the target source code may be obtained, where the database table statistics information mainly describes information such as a size, a scale, and a data distribution status of a database table related to the structured query term.
In this embodiment, the database table statistics information may be obtained by using an interface provided by the production operation and maintenance, and may be imported into the development environment according to a preset time interval, and meanwhile, an execution plan generating interface of the development environment may generate an execution plan text for obtaining the class production, that is, an execution plan of each structured query statement. In order to store the execution plan in a structured manner, operations such as segment segmentation, keyword feature extraction and the like can be performed on the execution plan. The processing procedure can configure a continuous integration task through a DevOps framework, continuously process the structured query statement, and further store the execution plan fragments into the execution plan information set in a structured form.
In this embodiment, the execution plan of each generated structured query term may be divided into STAGEs, and each STAGE may be stored in a tree structure by parsing. STAGE represents different phases of an execution plan, which can be divided into different execution plan segments by STAGE, and interdependencies exist between the various STAGEs.
In this embodiment, the analysis may be performed for each split STAGE in a tree structure, and the key processing steps and the dependency relationships between the steps obtained by the analysis may be stored in the execution plan information set in a tree structure. The key processing steps may include: mapoperater (Map operation), tableScan operation, filteroperater (filter operation), reduced outputoperater (output to Reduce), selectoperater (column projection operation), groupbyoplater (packet aggregation), reduced operater (Reduce operation), joinoperater (association operation), fileoutputoperater (file output operation), and the like. The results of the segmentation by STAGE and the parsing by tree structure may be as shown in fig. 2.
In one embodiment, determining the structured query statement information set corresponding to the target source code may include: and carrying out segmentation shaping on the target source code to obtain a plurality of structured query sentences. Attributes of multiple structured query statements may be tagged; wherein the attributes may include: increment attributes and stock attributes. Further, a structured query statement information set can be obtained based on attributes of a plurality of structured query statements; wherein the structured query statement information set contains attribute and feature information of each structured query statement, and the feature information includes: the path of the structured query statement in the target source code, the original structured query statement, the job to which it belongs, the job group, and the application.
In this embodiment, a code division shaper may be used to divide and shape the target source code, and a separator may be used to divide the source code to obtain multiple independent structured query sentences. Further, the source code may be shaped using regular matching substitution or the like to make it an executable structured query statement.
In this embodiment, the increment attribute marking may be performed on the structured query statement after the division shaping, and the relevant change line may be identified by comparing source code files of different versions, if the structured query statement is located in the change line, the increment statement is marked, otherwise, the increment statement is marked as the increment statement.
In this embodiment, the path of the structured query term in the target source code, the feature information such as the original structured query term, the belonging job, the job group, and the application may be stored in the structured query term information set together as the attached information of the structured query term.
In one embodiment, determining the target feature vector for each execution plan segment in the execution plan information set may include: and extracting the characteristic data of each execution plan segment by using a rule matching method to obtain a characteristic information set of each execution plan segment. The feature information set of each execution plan segment may be preprocessed to obtain an initial feature vector for each execution plan segment. Further, the initial feature vector of each execution plan segment may be normalized to obtain a target feature vector of each execution plan segment.
In this embodiment, keyword features of the execution plan segment may be extracted by rule matching extraction and stored in a structured form. In some embodiments, each node of the tree structure corresponding to each STAGE in the execution plan information set may record the key features in name-value pairs, for example: the TableScan operation may perform table name and statistics feature extraction, the filer operation may perform predicate and statistics feature extraction, the groupbyopter may perform aggregation algorithm, aggregation field, aggregation mode and statistics feature extraction, the Reduce outputoperater may perform sorting mode, name value expression, partition column and statistics feature extraction, the select operater may perform projection column, output column and statistics feature extraction, the fileoutputoperater may perform compression flag, input format, output format and statistics feature extraction, and the joinopter may perform association mode and association field feature extraction, thereby obtaining initial feature data sets of respective execution plan segments.
In this embodiment, the initial feature data set of each execution plan segment may be preprocessed, which may include statistical analysis, redundant feature removal, and the like. For example: the execution plan level may be obtained by statistical analysis, the projection manner may be determined from the projection columns, and the like. The specific determination may be determined according to the actual situation, and the embodiment of the present specification is not limited thereto. The initial feature vector can be generated according to the preprocessed feature data, further, in order to remove the influence of dimension, normalization processing can be performed on the initial feature vector of each execution plan segment to obtain the target feature vector of each execution plan segment, so that the feature data of each execution plan segment can be stored in a structured form.
In one embodiment, after obtaining at least one set of similar execution plan fragments for which matching is successful, it may further include: and adding at least one group of similar execution plan fragments successfully matched into the first redundant information set, and determining a cost estimation mean value of each group of similar execution plan fragments in the first redundant information set. Further, each group of similar execution plan fragments with the cost estimation mean value smaller than or equal to a first preset threshold value can be removed from the first redundant information set to obtain a second redundant information set, and each execution plan fragment in the second redundant information set is classified according to the attribute of the structured query statement corresponding to the execution plan fragment to obtain an incremental redundant information set and a stock redundant information set.
In this embodiment, the cost estimation value of each individual execution plan segment may be obtained by weighted averaging the cost values of each key processing step of each similar execution plan segment in the first redundant information set, then the cost estimation average value of each group of similar execution plan segments may be obtained by arithmetically averaging the groups of similar execution plan segments, the groups of similar execution plan segments having the cost estimation average value greater than the first preset threshold may be retained, and each group of similar execution plan segments having the cost estimation average value less than or equal to the first preset threshold may be removed from the first redundant information set without adjustment.
In this embodiment, each execution plan segment may include a plurality of key processing steps, where each key processing step has a corresponding cost value, and the cost value may be a dimensionless value greater than 0, for example: 2. 3, etc. The cost value may be determined according to the CUP occupation condition, the calling frequency, etc. of the structured query statement, and specifically may be determined according to the actual condition, which is not limited in this specification.
In this embodiment, a weight may be set for each key processing step, and when calculating the cost estimation value of a single execution plan segment, a weighted average may be performed by combining the weight of each key processing step and the corresponding cost value of each key processing step. In some embodiments, the cost estimation value of a single execution plan segment may also be obtained by directly performing arithmetic average on the cost value corresponding to each key processing step without considering the weight of the key processing step. Specifically, the method can be determined according to actual conditions, and the specification is not limited thereto.
In this embodiment, since the incremental statement is not put into production, and there is no record of production operation and maintenance information, each group of similar execution plan fragments in the second redundant information set may be classified according to the increment attribute of the structured query statement corresponding to the execution plan fragment, so as to obtain an incremental redundancy information set and an inventory redundancy information set corresponding to each group of similar execution plan fragments.
In one embodiment, determining redundant execution plan segments to be adjusted in at least one set of similar execution plan segments based on production operation and maintenance information may include: and under the condition that the incremental redundancy information set corresponding to a group of similar execution plan fragments is empty and the stock redundancy information set is not empty, determining the average resource consumption of the stock redundancy information set according to the production operation and maintenance information of each execution plan fragment in the stock redundancy information set. And adding the execution plan fragments in the stock redundancy information set into the redundancy information set to be adjusted under the condition that the average resource consumption is larger than a second preset threshold value. In the case that the average resource consumption is equal to or less than the second preset threshold, the execution plan segment in the inventory redundancy information set may be removed.
In this embodiment, the divided incremental redundancy information set and stock redundancy information set may be subjected to the empty judgment processing, and the three cases in which the incremental group is empty and the stock group is not empty, the incremental group is not empty, and the incremental group is not empty and the stock group is empty may be respectively subjected to the classification processing.
In this embodiment, when it is determined that the incremental redundancy information set corresponding to a group of similar execution plan segments is empty and the stock redundancy information set is not empty, production operation and maintenance information of each execution plan segment in the stock redundancy information set may be acquired first, and average resource consumption of the stock redundancy information set may be determined. The production operation and maintenance information may include a plurality of production operation and maintenance indexes, for example: SQL statement operation time length, CPU occupation in operation, memory occupation in operation, disk space occupation in operation, etc. Of course, the production operation and maintenance index is not limited to the above examples, and other modifications may be made by those skilled in the art in light of the technical spirit of the embodiments of the present disclosure, and all the functions and effects achieved by the present disclosure should be covered within the protection scope of the embodiments of the present disclosure.
In this embodiment, each production operation and maintenance index may have a corresponding second preset threshold, and the second preset threshold of each production operation and maintenance index may be different, where the second preset threshold may be used to measure the resource consumption. In the case where the stock redundancy information set includes a plurality of execution plan segments, the average value of each production operation and maintenance index may be calculated, respectively, and the average value of each production operation and maintenance index may be used as the average resource consumption of the stock redundancy information set. Further, the average value of the calculated production operation and maintenance indexes can be respectively compared with a second preset threshold value of each production operation and maintenance index, so that the resource consumption is determined.
In this embodiment, if the stock redundancy information set includes only one execution plan segment, the value of each generated operation and maintenance index of the execution plan segment may be directly compared with the second preset threshold value of each production operation and maintenance index, so as to determine the level of resource consumption.
In this embodiment, the resource consumption may be considered high if one production operation and maintenance index is greater than the second preset threshold, so that the execution plan segment in the stock redundancy information set is added to the redundancy information set to be adjusted. Of course, the judging manner of the resource consumption is not limited to the above example, and it is also possible to consider that the resource consumption is high when more than half of the production operation and maintenance indexes are greater than the second preset threshold, and those skilled in the art may make other changes in light of the technical spirit of the embodiments of the present specification, but as long as the functions and effects achieved by the method are the same as or similar to those of the embodiments of the present specification, the method shall be covered in the protection scope of the embodiments of the present specification.
In one embodiment, after obtaining the incremental redundancy information set and the stock redundancy information set, it may further include: and under the condition that the incremental redundancy information set is not empty and the stock redundancy information set is not empty, determining the average resource consumption of the stock redundancy information set according to the production operation and maintenance information of each execution plan segment in the stock redundancy information set. When the average resource consumption is greater than a second preset threshold, adding the execution plan fragments in the stock redundancy information set and the incremental redundancy information set into the redundancy information set to be adjusted; the redundancy information set to be adjusted contains at least one redundancy execution plan segment to be adjusted. In the case that the average resource consumption is less than or equal to the second preset threshold, the execution plan segments in the inventory redundancy information set and the incremental redundancy information set may be removed. In the event that the incremental redundancy information set is determined to be non-empty and the inventory redundancy information set is determined to be empty, the execution plan fragment in the incremental redundancy information set may be removed.
In this embodiment, since the incremental statement is not put into production and no operation and maintenance information is recorded, when the incremental redundancy information set is not empty, it is necessary to determine the resource consumption of the stock redundancy information set first, and if it is determined that the resource consumption of the stock redundancy information set is high (the average resource consumption is greater than the second preset threshold), the stock redundancy information set and the execution plan segment in the incremental redundancy information set corresponding to the stock redundancy information set may be added to the redundancy information set to be adjusted.
In this embodiment, if it is determined that the resource consumption of the stock redundancy information set is low (the average resource consumption is equal to or less than the preset threshold value), it is indicated that the resource consumption of the execution plan segments in the stock redundancy information set is low, and no adjustment is required, and the execution plan segments in the corresponding incremental redundancy information set are also not required, so that both the stock redundancy information set and the execution plan segments in the incremental redundancy information set can be removed.
In the present embodiment, if it is determined that the incremental redundancy information set is not empty and the stock redundancy information set is empty, since the incremental statement is not put into production and there is no record of the production operation and maintenance information, the execution plan segment in the incremental redundancy information set may be directly removed and the processing may be temporarily not performed.
In one embodiment, the target feature vector may include a plurality of feature data, where the feature data may include at least one of: an execution plan hierarchy, table names, predicates, an aggregation algorithm, a sort mode, a projection mode, text processing features, an association mode, and the like. Of course, it will be appreciated that the above feature data is merely an example, and that in some embodiments more or less feature data may be included in the target feature vector, for example, more may be included: aggregation fields, partition columns, etc. The specific determination may be determined according to the actual situation, and the embodiment of the present specification is not limited thereto.
As can be seen from the above description, the comparison result of the redundant code identification method 2 and the current redundant code identification method 1 in the embodiment of the present specification can be shown in table 1.
TABLE 1
Based on the same inventive concept, the embodiments of the present disclosure also provide a redundant code recognition apparatus, such as the following embodiments. Since the principle of solving the problem of the redundant code recognition device is similar to that of the redundant code recognition method, the implementation of the redundant code recognition device can refer to the implementation of the redundant code recognition method, and the repetition is omitted. As used below, the term "unit" or "module" may be a combination of software and/or hardware that implements the intended function. While the means described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated. Fig. 3 is a block diagram of a redundant code identification apparatus according to an embodiment of the present disclosure, and as shown in fig. 3, may include: the first acquisition module 301, the first determination module 302, the matching module 303, the second acquisition module 304, and the second determination module 305 are described below.
A first obtaining module 301, configured to obtain an execution plan information set of the target source code; the execution plan information set comprises at least one execution plan segment corresponding to each structured query statement in the target source code;
a first determining module 302, configured to determine a target feature vector of each execution plan segment in the execution plan information set;
the matching module 303 may be configured to perform similarity matching on each execution plan segment by using a clustering algorithm based on the target feature vector of each execution plan segment, so as to obtain at least one group of similar execution plan segments that are successfully matched;
a second obtaining module 304, configured to obtain production operation and maintenance information of similar execution plan segments;
the second determining module 305 may be configured to determine, according to the production operation and maintenance information, a redundant execution plan segment to be adjusted in the at least one set of similar execution plan segments.
The embodiment of the present disclosure further provides an electronic device, which may specifically refer to a schematic structural diagram of an electronic device based on the method for identifying a redundant code provided in the embodiment of the present disclosure shown in fig. 4, where the electronic device may specifically include an input device 41, a processor 42, and a memory 43. Wherein the input device 41 may be used in particular for inputting target source code. The processor 42 may be specifically configured to obtain an execution plan information set of the target source code; the execution plan information set comprises at least one execution plan segment corresponding to each structured query statement in the target source code; determining target feature vectors of all execution plan fragments in the execution plan information set; based on the target feature vectors of the execution plan fragments, performing similarity matching on the execution plan fragments by using a clustering algorithm to obtain at least one group of similar execution plan fragments successfully matched; acquiring production operation and maintenance information of similar execution plan fragments; and determining redundant execution plan fragments to be adjusted in at least one group of similar execution plan fragments according to the production operation and maintenance information. The memory 43 may be used for storing production operation information, redundant execution plan segments to be adjusted, and the like.
In this embodiment, the input device may specifically be one of the main means for exchanging information between the user and the computer system. The input device may include a keyboard, mouse, camera, scanner, light pen, handwriting input board, voice input apparatus, etc.; the input device is used to input raw data and a program for processing these numbers into the computer. The input device may also obtain data transmitted from other modules, units, and devices. The processor may be implemented in any suitable manner. For example, the processor may take the form of, for example, a microprocessor or processor, and a computer-readable medium storing computer-readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a programmable logic controller, and an embedded microcontroller, among others. The memory may in particular be a memory device for storing information in modern information technology. The memory may comprise a plurality of levels, and in a digital system, may be memory as long as binary data can be stored; in an integrated circuit, a circuit with a memory function without a physical form is also called a memory, such as a RAM, a FIFO, etc.; in the system, the storage device in physical form is also called a memory, such as a memory bank, a TF card, and the like.
In this embodiment, the specific functions and effects of the electronic device may be explained in comparison with other embodiments, which are not described herein.
In an embodiment of the present specification, there is further provided a computer storage medium storing computer program instructions for implementing a redundancy code-based identification method, where the computer program instructions when executed may implement: acquiring an execution plan information set of the target source code; the execution plan information set comprises at least one execution plan segment corresponding to each structured query statement in the target source code; determining target feature vectors of all execution plan fragments in the execution plan information set; based on the target feature vectors of the execution plan fragments, performing similarity matching on the execution plan fragments by using a clustering algorithm to obtain at least one group of similar execution plan fragments successfully matched; acquiring production operation and maintenance information of similar execution plan fragments; and determining redundant execution plan fragments to be adjusted in at least one group of similar execution plan fragments according to the production operation and maintenance information.
In the present embodiment, the storage medium includes, but is not limited to, a random access Memory (Random Access Memory, RAM), a Read-Only Memory (ROM), a Cache (Cache), a Hard Disk (HDD), or a Memory Card (Memory Card). The memory may be used to store computer program instructions. The network communication unit may be an interface for performing network connection communication, which is set in accordance with a standard prescribed by a communication protocol.
In this embodiment, the functions and effects of the program instructions stored in the computer storage medium may be explained in comparison with other embodiments, and are not described herein.
It will be apparent to those skilled in the art that the modules or steps of the embodiments described above may be implemented in a general purpose computing device, they may be concentrated on a single computing device, or distributed across a network of computing devices, they may alternatively be implemented in program code executable by computing devices, so that they may be stored in a storage device for execution by computing devices, and in some cases, the steps shown or described may be performed in a different order than herein, or they may be separately fabricated into individual integrated circuit modules, or multiple modules or steps within them may be fabricated into a single integrated circuit module. Thus, embodiments of the present specification are not limited to any specific combination of hardware and software.
Although the present description provides the method operational steps as described in the above embodiments or flowcharts, more or fewer operational steps may be included in the method, either on a routine or non-inventive basis. In steps where there is logically no necessary causal relationship, the execution order of the steps is not limited to the execution order provided in the embodiments of the present specification. The described methods, when performed in an actual apparatus or an end product, may be performed sequentially or in parallel (e.g., in a parallel processor or multithreaded environment) as shown in the embodiments or figures.
It is to be understood that the above description is intended to be illustrative, and not restrictive. Many embodiments and many applications other than the examples provided will be apparent to those of skill in the art upon reading the above description. The scope of the embodiments of the specification should, therefore, be determined not with reference to the above description, but instead should be determined with reference to the appended claims, along with the full scope of equivalents to which such claims are entitled.
The above description is only of the preferred embodiments of the present embodiments and is not intended to limit the present embodiments, and various modifications and variations can be made to the present embodiments by those skilled in the art. Any modification, equivalent replacement, improvement, or the like made within the spirit and principles of the embodiments of the present specification should be included in the protection scope of the embodiments of the present specification.

Claims (9)

1. A method for identifying redundant codes, comprising:
acquiring an execution plan information set of the target source code; wherein the execution plan information set contains at least one execution plan segment corresponding to each structured query statement in the target source code;
Determining target feature vectors of all execution plan fragments in the execution plan information set;
based on the target feature vectors of the execution plan fragments, performing similarity matching on the execution plan fragments by using a clustering algorithm to obtain at least one group of similar execution plan fragments successfully matched;
adding the at least one group of similar execution plan fragments successfully matched into a first redundant information set; determining a cost estimation mean value of each group of similar execution plan segments in the first redundant information set; removing each group of similar execution plan fragments with the cost estimation mean value smaller than or equal to a first preset threshold value from the first redundant information set to obtain a second redundant information set; classifying each group of similar execution plan fragments in the second redundant information set according to the attribute of the structured query statement corresponding to the execution plan fragments to obtain a plurality of groups of incremental redundant information sets and stock redundant information sets;
acquiring production operation and maintenance information of the similar execution plan fragments;
determining redundant execution plan fragments to be adjusted in at least one group of similar execution plan fragments according to the production operation and maintenance information; wherein, include: under the condition that the incremental redundancy information set corresponding to a group of similar execution plan fragments is empty and the stock redundancy information set is not empty, determining the average resource consumption of the stock redundancy information set according to the production operation and maintenance information of each execution plan fragment in the stock redundancy information set; adding the execution plan segment in the stock redundancy information set into a redundancy information set to be adjusted under the condition that the average resource consumption is larger than a second preset threshold value; and removing the execution plan fragments in the stock redundancy information set under the condition that the average resource consumption is less than or equal to a second preset threshold value.
2. The method of claim 1, wherein obtaining the execution plan information set for the target source code comprises:
acquiring the target source code;
determining a structured query statement information set corresponding to the target source code;
acquiring database table statistical information corresponding to the target source code;
generating an execution plan of each structured query statement according to the database table statistical information and the structured query statement information set;
dividing the execution plan of each structured query statement according to the STAGE to obtain a plurality of execution plan fragments;
and analyzing the plurality of execution plan fragments according to a tree structure to obtain an execution plan information set of the target source code.
3. The method of claim 2, wherein determining the structured query statement information set corresponding to the target source code comprises:
dividing and shaping the target source code to obtain a plurality of structured query sentences;
marking attributes of the plurality of structured query statements; wherein the attributes include: increment attribute and stock attribute;
based on the attributes of the plurality of structured query sentences, obtaining a structured query sentence information set; wherein the structured query term information set contains attribute and feature information of each structured query term, and the feature information includes: the path of the structured query statement in the target source code, the original structured query statement, the belonging job, the job group and the application.
4. The method of claim 1, wherein determining the target feature vector for each execution plan segment in the set of execution plan information comprises:
extracting the characteristic data of each execution plan segment by using a rule matching method to obtain a characteristic information set of each execution plan segment;
preprocessing the characteristic information set of each execution plan segment to obtain an initial characteristic vector of each execution plan segment;
and carrying out normalization processing on the initial feature vectors of the execution plan fragments to obtain target feature vectors of the execution plan fragments.
5. The method of claim 1, further comprising, after obtaining the incremental redundancy information set and the stock redundancy information set:
under the condition that the incremental redundancy information set corresponding to a group of similar execution plan fragments is not empty and the stock redundancy information set is not empty, determining the average resource consumption of the stock redundancy information set according to the production operation and maintenance information of each execution plan fragment in the stock redundancy information set;
adding the execution plan segments in the stock redundancy information set and the incremental redundancy information set into the redundancy information set to be adjusted under the condition that the average resource consumption is larger than a second preset threshold value; wherein the redundancy information set to be adjusted comprises at least one redundancy execution plan segment to be adjusted;
Removing the execution plan segments in the stock redundancy information set and the incremental redundancy information set under the condition that the average resource consumption is less than or equal to a second preset threshold;
and removing the execution plan fragment in the incremental redundancy information set in the case that the incremental redundancy information set is determined to be non-empty and the stock redundancy information set is determined to be empty.
6. The method of claim 1, wherein the feature data in the target feature vector comprises at least one of: the method comprises the steps of executing a plan level, table names, predicates, an aggregation algorithm, a sequencing mode, a projection mode, text processing characteristics and an association mode.
7. An apparatus for identifying redundant codes, comprising:
the first acquisition module is used for acquiring an execution plan information set of the target source code; wherein the execution plan information set contains at least one execution plan segment corresponding to each structured query statement in the target source code;
the first determining module is used for determining target feature vectors of all the execution plan fragments in the execution plan information set;
the matching module is used for carrying out similarity matching on each execution plan segment by utilizing a clustering algorithm based on the target feature vector of each execution plan segment to obtain at least one group of similar execution plan segments successfully matched;
The redundant information set acquisition module is used for adding the at least one group of similar execution plan fragments which are successfully matched into a first redundant information set; determining a cost estimation mean value of each group of similar execution plan segments in the first redundant information set; removing each group of similar execution plan fragments with the cost estimation mean value smaller than or equal to a first preset threshold value from the first redundant information set to obtain a second redundant information set; classifying each group of similar execution plan fragments in the second redundant information set according to the attribute of the structured query statement corresponding to the execution plan fragments to obtain a plurality of groups of incremental redundant information sets and stock redundant information sets;
the second acquisition module is used for acquiring the production operation and maintenance information of the similar execution plan fragments;
the second determining module is used for determining redundant execution plan fragments to be adjusted in at least one group of similar execution plan fragments according to the production operation and maintenance information; wherein, include: under the condition that the incremental redundancy information set corresponding to a group of similar execution plan fragments is empty and the stock redundancy information set is not empty, determining the average resource consumption of the stock redundancy information set according to the production operation and maintenance information of each execution plan fragment in the stock redundancy information set; adding the execution plan segment in the stock redundancy information set into a redundancy information set to be adjusted under the condition that the average resource consumption is larger than a second preset threshold value; and removing the execution plan fragments in the stock redundancy information set under the condition that the average resource consumption is less than or equal to a second preset threshold value.
8. An identification device for redundant code, comprising a processor and a memory for storing processor-executable instructions, which processor, when executing the instructions, implements the steps of the method of any one of claims 1 to 6.
9. A computer readable storage medium having stored thereon computer instructions which when executed implement the steps of the method of any of claims 1 to 6.
CN202110439936.4A 2021-04-23 2021-04-23 Redundant code identification method, device and equipment Active CN113064597B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110439936.4A CN113064597B (en) 2021-04-23 2021-04-23 Redundant code identification method, device and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110439936.4A CN113064597B (en) 2021-04-23 2021-04-23 Redundant code identification method, device and equipment

Publications (2)

Publication Number Publication Date
CN113064597A CN113064597A (en) 2021-07-02
CN113064597B true CN113064597B (en) 2024-03-08

Family

ID=76567592

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110439936.4A Active CN113064597B (en) 2021-04-23 2021-04-23 Redundant code identification method, device and equipment

Country Status (1)

Country Link
CN (1) CN113064597B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106547683A (en) * 2015-09-22 2017-03-29 阿里巴巴集团控股有限公司 A kind of redundant code detection method and device
CN110502443A (en) * 2019-08-22 2019-11-26 深圳前海环融联易信息科技服务有限公司 Redundant code detection method, detection module, electronic equipment and computer storage medium
CN111290784A (en) * 2020-01-21 2020-06-16 北京航空航天大学 Program source code similarity detection method suitable for large-scale samples

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200183668A1 (en) * 2018-12-05 2020-06-11 Bank Of America Corporation System for code analysis by stacked denoising autoencoders

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106547683A (en) * 2015-09-22 2017-03-29 阿里巴巴集团控股有限公司 A kind of redundant code detection method and device
CN110502443A (en) * 2019-08-22 2019-11-26 深圳前海环融联易信息科技服务有限公司 Redundant code detection method, detection module, electronic equipment and computer storage medium
CN111290784A (en) * 2020-01-21 2020-06-16 北京航空航天大学 Program source code similarity detection method suitable for large-scale samples

Also Published As

Publication number Publication date
CN113064597A (en) 2021-07-02

Similar Documents

Publication Publication Date Title
US10042912B2 (en) Distributed clustering with outlier detection
US8280915B2 (en) Binning predictors using per-predictor trees and MDL pruning
WO2019218475A1 (en) Method and device for identifying abnormally-behaving subject, terminal device, and medium
CN111160021A (en) Log template extraction method and device
KR101965277B1 (en) System and method for analysis of hypergraph data and computer program for the same
CN104239553A (en) Entity recognition method based on Map-Reduce framework
CN114818643B (en) Log template extraction method and device for reserving specific service information
CN111986792A (en) Medical institution scoring method, device, equipment and storage medium
CN115577701A (en) Risk behavior identification method, device, equipment and medium for big data security
CN115794798A (en) Market supervision informationized standard management and dynamic maintenance system and method
US10467276B2 (en) Systems and methods for merging electronic data collections
CN116244367A (en) Visual big data analysis platform based on multi-model custom algorithm
CN115146062A (en) Intelligent event analysis method and system fusing expert recommendation and text clustering
CN112632000A (en) Log file clustering method and device, electronic equipment and readable storage medium
CN113064597B (en) Redundant code identification method, device and equipment
CN116841779A (en) Abnormality log detection method, abnormality log detection device, electronic device and readable storage medium
CN116578700A (en) Log classification method, log classification device, equipment and medium
CN110597796A (en) Big data real-time modeling method and system based on full life cycle
CN116204647A (en) Method and device for establishing target comparison learning model and text clustering
CN115495587A (en) Alarm analysis method and device based on knowledge graph
WO2018100700A1 (en) Data conversion device and data conversion method
CN115034762A (en) Post recommendation method and device, storage medium, electronic equipment and product
CN114090850A (en) Log classification method, electronic device and computer-readable storage medium
CN113689114A (en) Method, device and equipment for determining credit degree
CN110968690B (en) Clustering division method and device for words, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant