CN117435467A

CN117435467A - Code processing method, device, equipment, storage medium and product

Info

Publication number: CN117435467A
Application number: CN202311009107.8A
Authority: CN
Inventors: 薛恩鹏; 王乾
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2023-08-11
Filing date: 2023-08-11
Publication date: 2024-01-23

Abstract

The application discloses a code processing method, a device, equipment, a storage medium and a product, and belongs to the technical field of artificial intelligence. The method may generate code reference hint information based on the first code object and each second code object in each target code block after the first code object and at least one target code block are acquired. The code reference hint information is used to constrain the output results of the code analysis model so that the results can be used to indicate whether a first code object is referenced by a certain second code object, i.e. whether the first code object is referenced by a certain target code block. Based on the output result of the code analysis model, whether the first code object belongs to the useless code object can be determined, so that the automatic detection of useless codes is realized, and the technical purposes of reducing the useless code detection cost and improving the useless code detection efficiency are achieved.

Description

Code processing method, device, equipment, storage medium and product

Technical Field

The present disclosure relates to the field of artificial intelligence technologies, and in particular, to a code processing method, apparatus, device, storage medium, and product.

Background

In the related art, automatic detection of the useless code is difficult to realize, that is, the related art generally needs to rely on manual analysis of the code to extract the useless code. The useless code analysis is a core link for improving the code compactness and stability and realizing the package reduction effect of the software development kit. The related technology relies on manual detection of useless codes to have the obvious defects of high cost, unstable detection results, low detection efficiency and the like.

Disclosure of Invention

The embodiment of the application provides a code processing method, a device, equipment, a storage medium and a product, which can realize automatic detection of useless codes, remarkably improve the detection efficiency and also improve the stability of the detection quality.

According to an aspect of the embodiments of the present application, there is provided a code processing method, including:

acquiring a first code object and at least one target code block;

extracting second code objects in each target code block, and generating code reference prompt information according to the first code objects and the second code objects, wherein the code reference prompt information is used for indicating a code analysis model to output code reference analysis results, and the code reference analysis results are used for indicating whether the first code objects are referenced by the at least one target code block;

Inputting the code reference prompt information, the codes corresponding to the first code objects and the codes corresponding to the second code objects into the code analysis model to obtain the code reference analysis result;

determining the first code object as a useless code object in the case that the code reference analysis result indicates that the first code object is not referenced by any of the target code blocks;

the code analysis model is a model obtained by training a large language model based on prompt learning.

According to an aspect of the embodiments of the present application, there is provided a code processing apparatus, the apparatus including:

the code information extraction module is used for acquiring a first code object and at least one target code block;

the prompt information generation module is used for extracting second code objects in each target code block, generating code reference prompt information according to the first code objects and the second code objects, wherein the code reference prompt information is used for indicating a code analysis model to output code reference analysis results, and the code reference analysis results are used for indicating whether the first code objects are referenced by the at least one target code block;

The code processing module is used for inputting the code reference prompt information, the codes corresponding to the first code objects and the codes corresponding to the second code objects into the code analysis model to obtain the code reference analysis result; determining the first code object as a useless code object in the case that the code reference analysis result indicates that the first code object is not referenced by any of the target code blocks;

According to an aspect of the embodiments of the present application, there is provided a computer apparatus including a processor and a memory, where at least one instruction, at least one program, a code set, or an instruction set is stored, and the at least one instruction, the at least one program, the code set, or the instruction set is loaded and executed by the processor to implement the code processing method described above.

According to an aspect of embodiments of the present application, there is provided a computer-readable storage medium having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, the at least one instruction, the at least one program, the set of codes, or the set of instructions being loaded and executed by a processor to implement the above-described code processing method.

According to one aspect of embodiments of the present application, there is provided a computer program product comprising computer instructions stored in a computer readable storage medium. A processor of a computer device reads the computer instructions from a computer-readable storage medium, and the processor executes the computer instructions so that the computer device executes to implement the above-described code processing method.

The technical scheme provided by the embodiment of the application can bring the following beneficial effects:

the embodiment of the application provides a code processing method, which can generate code reference prompt information based on a first code object and each second code object in each target code block after acquiring the first code object and at least one target code block. The code reference hint information is used to constrain the output results of the code analysis model so that the results can be used to indicate whether a first code object is referenced by a certain second code object, i.e. whether the first code object is referenced by a certain target code block. Based on the output result of the code analysis model, whether the first code object belongs to the useless code object can be determined, so that the automatic detection of useless codes is realized, and the technical purposes of reducing the useless code detection cost and improving the useless code detection efficiency are achieved.

The code analysis model is a model obtained by learning a large language model based on prompt. The large-scale language model is a text processing model obtained through a large number of corpus pre-training, the text processing model comprises rich text knowledge, and the large-scale language model can be guided to the specific problem of code reference analysis through prompt learning, so that the large-scale language model after parameter fine adjustment based on prompt learning, namely the code analysis model, has strong code reference relation analysis capability. The large language model collocation prompt learning can remarkably improve the upper limit of the analysis capability of the code analysis model in the code reference relation, and ensure the accuracy of the detection of the useless codes.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of an application execution environment provided by one embodiment of the present application;

FIG. 2 is a flow chart of a code processing method provided by one embodiment of the present application;

FIG. 3 is a flow chart of a method for determining an object code block according to one embodiment of the present application;

FIG. 4 is a schematic diagram of content of a preset file according to one embodiment of the present application;

FIG. 5 is a schematic diagram of a specific semantic distance between searched code blocks semantically adjacent to a first code object according to one embodiment of the present application;

FIG. 6 is a schematic diagram of a code analysis model training method provided by one embodiment of the present application;

FIG. 7 is a schematic diagram of a sample code reference hint information provided by one embodiment of the present application;

FIG. 8 is a schematic representation of the content of sample code reference hint information provided by one embodiment of the present application;

FIG. 9 is a schematic diagram of file upload provided in one embodiment of the present application;

FIG. 10 is a schematic diagram of a document information interface provided in one embodiment of the present application;

FIG. 11 is a schematic diagram of file content formed after a method name provided in one embodiment of the present application is exported;

FIG. 12 is a schematic diagram of a character string provided in one embodiment of the present application;

FIG. 13 is a schematic diagram of an overall process of a packet reduction scheme according to one embodiment of the present application;

FIG. 14 is a block diagram of a code processing apparatus provided in one embodiment of the present application;

Fig. 15 is a block diagram of a computer device according to an embodiment of the present application.

Detailed Description

Before describing the method embodiments provided herein, related terms or nouns that may be involved in the method embodiments of the present application are briefly described, so as to be understood by those skilled in the art of the present application.

Artificial intelligence (ArtificialIntelligence, AI) is a theory, method, technique, and application system that simulates, extends, and extends human intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, obtains knowledge, and uses the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Machine learning (MachineLearning, ML) is a multi-domain interdisciplinary involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

Deep learning: the concept of deep learning is derived from the study of artificial neural networks. The multi-layer sensor with multiple hidden layers is a deep learning structure. Deep learning forms more abstract high-level representation attribute categories or features by combining low-level features to discover distributed feature representations of data. Deep learning belongs to the underlying concept of machine learning.

Cloud technology (Cloudtechnology) refers to a hosting technology that unifies serial resources such as hardware, software, network, etc. in a wide area network or a local area network to realize calculation, storage, processing and sharing of data. The cloud technology is a generic term of network technology, information technology, integration technology, management platform technology, application technology and the like based on cloud computing business model application, can form a resource pool, and is flexible and convenient as required. Cloud computing technology will become an important support. Background services of technical networking systems require a large amount of computing, storage resources, such as video websites, picture-like websites, and more portals. Along with the high development and application of the internet industry, each article possibly has an own identification mark in the future, the identification mark needs to be transmitted to a background system for logic processing, data with different levels can be processed separately, and various industry data needs strong system rear shield support and can be realized only through cloud computing.

Natural Language Processing (NLP): is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics.

LLM: a Large Language Model (LLM) refers to a computer model capable of processing and generating natural language. It represents a significant advancement in the field of artificial intelligence and is expected to change this field through learned knowledge. LLM can predict the next word or sentence through learning the statistical rule and semantic information of language data, and with the continuous expansion of input data set and parameter space, LLM's ability also can improve correspondingly. It is used in a variety of application fields such as robotics, machine learning, machine translation, speech recognition, image processing, etc., and so is called a multi-Modal Large Language Model (MLLM).

InstructionTuning: instruction trimming, which is to generate instructions (instructions) individually for each task by performing trimming over several tasks and then evaluating generalization capability over specific tasks. Typically on a large number of NLP task datasets disclosed to motivate the understanding capabilities of the language model, by giving more obvious instructions for the model to understand and make the correct feedback.

Promptting, prompt learning, one type of learning method in machine learning: the effect of the model is greatly improved by adding 'prompt information' to the input as an information enhancement without significantly changing the structure and parameters of the pre-training language model, and the model can be regarded as an instruction to a task and also as a multiplexing of the pre-training targets, and the essence of the method is the enhancement of the parameter effectiveness training.

Transformer: is a neural network that learns context and thus meaning by extracting relationships in sequence data. The transducer model employs a set of evolving mathematical techniques, known as attention or self-attention, to detect even the subtle ways in which remote data elements in a series interact and interdepend. The transducer can be used as a framework for constructing a large language model.

Code reference relationship: reference is made herein to the case where other codes are referenced in the code. For example: another function is called in one function or a function in an external library is called.

Useless code: such code is typically some function, class, variable, etc. in a program, which, although present in the code, is never called or used during program execution. This is often due to programmer error, code reconstruction, etc.

SDK: is called a software development kit in full, and is named as a foreign language softwaredevelopment kit. In the embodiment of the application, the dynamic link library which is relied on by the program is mainly used for exposing a plurality of interfaces outwards for external calling.

And (3) reducing the package: this is mainly to reduce the volume of the program and improve the performance of the program by removing the unnecessary code. Embodiments of the present application may refer to reducing the data size of an SDK.

Because the analysis of the useless codes is difficult to realize automatically, the package reduction of the software development kit in the related technology is mainly completed by manual work, namely each interface in the SDK is checked by manual work, and whether the interface is the useless code or not is judged by checking related calling logic in a code kernel manually, if the interface is the useless code, the interface can be deleted, so that the package reduction purpose is achieved. However, the following problems are apparent in the manual packet reduction:

The cost is high: the manual packet reduction has high requirements on the number of people and the personnel capacity, the consumed time cost and the capital cost are correspondingly high, new useless codes can be generated by each code submitting of a software engineer, the packet reduction requirement is obviously high, and the time cost and the capital cost are high.

The error rate is high: the simple dependence on manual packet reduction has high probability of missing detection and false detection, and can generate a lot of repeated work, so that the packet reduction efficiency is low.

In view of this, the embodiments of the present application provide a code processing method. The code processing method may generate code reference hint information based on the first code object and each second code object in each target code block after the first code object and at least one target code block are acquired. The code reference hint information is used to constrain the output results of the code analysis model so that the results can be used to indicate whether a first code object is referenced by a certain second code object, i.e. whether the first code object is referenced by a certain target code block. Based on the output result of the code analysis model, whether the first code object belongs to the useless code object can be determined, so that the automatic detection of useless codes is realized, and the technical purposes of reducing the useless code detection cost and improving the useless code detection efficiency are achieved.

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

Referring to fig. 1, a schematic diagram of an application running environment provided in one embodiment of the present application is shown. The application execution environment may include: a terminal 10 and a server 20.

The terminal 10 includes, but is not limited to, a cell phone, a computer, an intelligent voice interaction device, an intelligent home appliance, a vehicle-mounted terminal, a game console, an electronic book reader, a multimedia playing device, a wearable device, and the like. A client in which an application program can be installed in the terminal 10.

In the embodiment of the present application, the application may be any application capable of providing a code processing service. The server 20 is used to provide background services for clients of applications in the terminal 10. For example, the server 20 may be a background server of the application program described above. The server 20 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDN (content delivery network), internet of vehicles, and basic cloud computing services such as big data and an artificial intelligence platform. Alternatively, the server 20 provides background services for applications in a plurality of terminals 10 at the same time.

Alternatively, the terminal 10 and the server 20 may communicate with each other via the network 30. The terminal 10 and the server 20 may be directly or indirectly connected through wired or wireless communication, and the present application is not limited thereto.

Referring to fig. 2, a flowchart of a code processing method according to an embodiment of the present application is shown. The method can be applied to a computer device, wherein the computer device is an electronic device with data computing and processing capabilities, and the execution subject of each step can be the server 20 in the application running environment shown in fig. 1. The method may comprise the steps of:

Step 201, a first code object and at least one target code block are obtained.

The first code object in the embodiments of the present application may be understood as a code object that needs to be judged whether or not it is referenced by other code objects. The content of the first code object is not limited in this embodiment, and may include a variable, an interface, or a method in a code file. Taking the packet-reduction scenario as an example, it is required to determine whether there is an useless interface in the target SDK in the packet-reduction scenario, where the useless interface refers to that the interface is not called by any method in the software program referencing the target SDK. The interface may be treated as the first code object and the at least one target code block is derived based on a method in a software program referencing the target SDK.

The embodiment of the application is not limited to the specific content of the code block, and the code block can be understood as a code body formed by a plurality of codes in the code file. For example, a code file may be segmented, each segment including methods therein that form code blocks.

In order to reduce the number of target code blocks, so that the code analysis model does not need to judge whether each code object in the code file refers to the first code object or not, the embodiment of the application can exclude the code object with too low semantic similarity with the first code object, because the code object of the type does not necessarily refer to the first code object, that is, by screening the code blocks with relatively high semantic similarity in the code file as the target code blocks, the initial screening can be realized in an invalid code detection scene, and the volume of input data of the subsequent code analysis model is reduced. If the first code object belongs to the first code file, the second code file refers to or calls the code file of the first code file, and the target code block obtained after preliminary screening is the code block of the second code file, wherein the semantic similarity of the code block and the first code object meets the preset requirement. The embodiment of the present application does not limit the above-mentioned preset requirement, for example, a code block whose semantic distance from the first code object is within a preset distance threshold may be defined as the target code block. Of course, the embodiment of the present application does not limit the preset distance threshold, and may be set autonomously according to actual situations.

In one embodiment, the first code object is any code object in a first code file, the target code block is a code block in a second code file, and the first code file is used for being called by the second code file, that is, so-called useless code detection refers to whether the first code object is called by a code in the second code file, and if not, the first code object is a useless code object. Referring to fig. 3, a flowchart of a method for determining an object code block in an embodiment of the present application is shown, where the method includes:

s301, extracting semantic information corresponding to each code block in the second code file based on the semantic information extraction model, and constructing a semantic information base according to semantic information extraction results.

The embodiment of the application does not limit the semantic information extraction model, for example: a bag of words model, a word embedding model, a character level vectorization model, and a large language model can be used as the semantic information extraction model.

In one embodiment, a mapping relationship between each code block in the second code object and the file name and the number of lines in the file may be first determined, and then the semantic information base is constructed based on the mapping relationship and semantic information corresponding to each code block.

In an exemplary embodiment, the second code file may be traversed by writing a script, and the database storing the second code file is simply referred to as a code library. In an exemplary embodiment, the relative path of all files with suffixes of java in the code base may be traversed and saved to a preset list. And then traversing the files corresponding to the addresses in the preset list in sequence, traversing file codes row by row, searching method definitions in the file codes in a regular expression mode and the like, and extracting code blocks in a second code file. Specifically, the starting line of the method may be determined by a keyword such as public, private, protected, and then the ending line of the method may be determined by analyzing symmetry of left and right brackets and using a method of pushing a record.

Next, by combining the File relative path with the starting Line number, the "File: line" column data can be obtained. Then analyzing the left bracket position of the initial line parameter table, wherein the first complete word on the left is the name of the method, and the column data of the method name can be obtained by using a regular expression. And then extracting the complete codes corresponding to the start and stop lines, removing the front and rear blank characters, and only reserving the necessary "\n" symbols, thereby simplifying the method body and obtaining the "method body" column data. The "MethodBody" column data may retain annotations of code, which may be more efficient in identifying reference relationships because of subsequent analysis of code using a large language model, which is also an ability not possessed by traditional code analysis based on abstract syntax trees. Finally, the three columns of data obtained above are saved in a preset file, and the embodiment of the application is not limited to the file type of the preset file, please refer to fig. 4, which shows a schematic diagram of the content of the preset file. The preset file can be a CSV file, wherein the CSV file is a general and relatively simple file format and is widely applied to transferring table data among programs.

On the basis of obtaining the preset file, the semantic information of each code block recorded in the preset file can be proposed to form a semantic vector. For example, a large language model may be selected for extracting semantic information, so as to obtain a more accurate extraction result. Along with the rapid development of technology, various types of large language models develop very rapidly, and the code analysis model in the embodiment of the application belongs to the type of large language models, which can be a language model with a main body framework that the number of parameters constructed based on a Transform model is larger than a preset value, and of course, the code analysis model can be determined by a person skilled in the art for the preset value. In an exemplary embodiment, a QPilot model, which is a large language model provided in the related art, may be used as the semantic information extraction model. Specifically, the model parameter selection text-embedding-ada-002 may be called to output a 1536-dimensional vector array for an input code block, where the vector array is semantic information corresponding to the code block, and the semantic information may be recorded in the preset file to obtain an updated preset file, and the content of the preset file may refer to fig. 4. Fig. 4 is an updated preset file.

In order to facilitate screening out the target code blocks in a preliminary screening manner, the embodiment of the application may store the content of the preset file in a database to form a semantic information base. The screening of the target code blocks needs to measure the correlation or similarity between the semantic information of the first code object and the semantic information of each code block in the preset file, and the most commonly used measuring methods include Euclidean distance, cosine similarity or dot product and the like. In order to realize screening of target code blocks, traversal comparison needs to be carried out on each code block in a database under the condition of a traditional database, and if the number of the code blocks is excessive, the efficiency is not high. According to the method and the device for searching the data in the semantic information base, the vector database can be selected as the semantic information base, the search efficiency of the vector database can be improved through the approximate nearest neighbor search technology, and excellent search advantages are shown in a semantic search scene. The embodiments of the present application are not limited to vector databases, and for example, chroma may be selected for use as the vector database. The Chroma is an open source embedded database, and has remarkable speed advantages in the memory construction and application stages. The embodiment of the application can read the first three columns of the preset File in fig. 4, takes the data of File: line as a primary key identifier, completely stores the first three columns of data in metadata, directly uses the fourth column of data as a vector, stores the preset File in a Chroma, reasonably organizes the storage format of the data by the Chroma, and persists the data to a disk, thereby facilitating the subsequent repeated query.

In the embodiment of the application, the preset file shown in fig. 4 is built based on the second code file, and then the whole process of building the semantic information base can be implemented in a full-automatic manner by writing a script.

S302, extracting semantic information corresponding to the first code object based on the semantic information extraction module to obtain target semantic information. Searching code blocks with the semantic distance within a preset range from the target semantic information in the semantic information base to obtain the at least one target code block.

When the vector database is used for inquiring, the first code object is converted into the vector with the same dimension by using the same semantic information extraction model, namely the target semantic information. In order to improve efficiency, the semantic information base may employ K-nearest neighbor searching to obtain at least one target code block, please refer to fig. 5, which illustrates a specific semantic distance diagram of each code block that is searched for and semantically adjacent to the first code object. Wherein distances represent distance arrays which are sequentially arranged from small to large, and the closer the distance is, the stronger the correlation is. The searching method can reduce the calculated amount and improve the inquiring efficiency. The K-nearest neighbor search, namely KNEARETNeighbor, is a classical algorithm in the field of machine learning, and the K-nearest neighbor search is based on the adjacent degree of data in a feature space to realize quick search.

In step 202, a second code object in each of the target code blocks is extracted, and code reference hint information is generated according to the first code object and each of the second code objects, where the code reference hint information is used to instruct a code analysis model to output a code reference analysis result, and the code reference analysis result is used to instruct whether the first code object is referenced by the at least one target code block.

The code object in the target code block in the embodiment of the present application is the second code object. The code reference hint information is used to instruct the code analysis model to output an analysis result of whether the first code object is referenced by a certain second code object, i.e. to output an analysis result equivalent to whether the first code object is referenced by the at least one target code block. References in embodiments of the present application may be understood as usage or invocation.

In step 203, the code reference prompt information, the codes corresponding to the first code objects, and the codes corresponding to the second code objects are input into the code analysis model, so as to obtain the code reference analysis result. The code analysis model is a model obtained by training a large language model based on prompt learning.

In the embodiment of the application, in order to fully exert the capability of the large language model, the intrinsic capability of the large language model can be stimulated based on a prompt learning scheme, and the core is to perform a small-scale adjustment parameter adaptation task by constructing prompt information so as to achieve the purpose of fully utilizing the capability of the large language model as a specific code analysis processing service. The prompt information is used as information enhanced data, and aims to enable a large model to clearly need to do what tasks and output what contents, namely, multiplexing targets and parameters used by the large language model in a pre-training stage, freezing part of parameters and layers on the basis of the targets and parameters, so that on the basis of saving hardware computing resources and storage resources, the large language model after parameter adjustment is enabled to be used in actual business scenes in a landing mode by freezing part of model parameters, namely, the large language model after parameter adjustment is applied in a landing mode in the aspect of code reference analysis, and meanwhile, modeling cost can be reduced and modeling efficiency can be improved. The training method for providing the code analysis model according to the embodiment of the present application will be described below.

S104, determining the first code object as a useless code object when the code reference analysis result indicates that the first code object is not referenced by any target code block.

Of course, if the code reference analysis result indicates that the first code object is referenced by a certain of the target code blocks, the code object is a useful code object. And deleting the first code object in the first code file when the first code object is determined to be a useless code object. Thus, the volume of the first code file can be reduced, and the package reduction effect of the software development kit generated based on the first code file is achieved.

Referring to fig. 6, a schematic diagram of a code analysis model training method is shown. The code analysis model is obtained by training the following method:

s601, acquiring sample code reference prompt information, wherein the sample code reference prompt information comprises a sample prompt question and a sample prompt answer, the sample prompt question is used for indicating a large language model to output a sample code reference analysis result, the sample code reference analysis result is used for indicating whether a sample first code object in the sample code prompt information is referenced by at least one sample second code object in the sample code prompt information, and the sample prompt answer is used for indicating a reference relation between the sample first code object and the at least one sample second code object.

In the embodiment of the application, the sample first code object refers to the code object of the referred party in the reference relation analysis, and the sample second code object refers to the code object of the possible referred party in the reference relation analysis. Both the sample first code object and the sample second code object are recorded in the sample code hint information. Referring to FIG. 7, a schematic diagram of a sample code referencing hint information is shown. Sample prompt questions can be input in the sample, and sample prompt answers can be input in the completions. The sample hint question may include a method name a of the first code object and a method name B of the second code object for querying whether the method corresponding to this method a corresponds to a method reference corresponding to method name B. The sample prompt answer may be simply represented by YES or NO to indicate whether a reference relationship exists. The sample prompt answer output format is simple and standard, which is favorable for using the return value to confirm whether the quotation relationship exists.

Referring to FIG. 8, a schematic representation of the contents of sample code reference hints information is shown. It is obvious that based on the sample code reference prompt information, the large language model can quickly determine which methods need to be analyzed for the reference relationship between the methods and how to return the analysis result of the reference relationship.

In an exemplary embodiment, a sample third code object and at least one sample code block may be acquired, where the semantic distance between the sample code block and the sample third code object is within a predetermined range; obtaining at least one sample fourth code object based on the code object extracted from each sample code block; and generating sample code reference prompt information with a sample prompt answer being a positive answer based on the sample third code object and the at least one sample fourth code object under the condition that the sample third code object is called by any one of the sample fourth code objects. The embodiment of the application does not limit the analysis method of the reference relation between the sample third code object and the sample fourth code object, and can be manually analyzed.

Determining all sample fifth code objects in the at least one sample code block, the sample fifth code objects being sample fourth code objects referencing the sample third code objects, in the case that the sample third code objects are invoked by any of the sample fourth code objects; deleting said all sample fifth code object from said at least one sample fourth code object; and generating sample code reference prompt information with a negative sample prompt answer based on the sample third code object and the deleting result. Of course, some non-existent methods can also be randomly generated, and the sample code reference prompt information for prompting whether the sample prompt answer is a negative answer can be generated based on the methods and the sample third code object.

S602, inputting the sample code reference prompt information, codes corresponding to the sample first code objects and codes corresponding to the sample second code objects into the large-scale language model, and triggering the large-scale language model to output sample code reference analysis results.

Please refer to fig. 9, which illustrates a file upload diagram. In an exemplary embodiment, the sample code reference prompt information obtained in the previous step may be formed into a sample data file, and the file is uploaded to a server using a file interface of POST, and the server deploys the large language model. Specifically, when uploading data, the purose parameter may be designated as fine-tune, that is, the file is designated as training data for fine-tuning. The file parameter points to the file, and the purpose of the setting is to enable the large language model to be trained in a fine tuning manner without changing excessive parameters. After the uploading is successful, the id value in the returned value can be recorded, and a unique id can be generated for subsequent use in each uploading.

Referring to fig. 10, a schematic diagram of a file information interface is shown. If the file id is lost after the uploading is successful, all file lists of which all uploading is successful can be obtained through the files interface of the GET. And distinguishing the file to be found by the file name and the uploading time. In the embodiment of the application, POST and GET refer to operations to be executed for file uploading and file obtaining, respectively, a set parameter indicates a parameter adjustment mode, a fine-tune indicates a training model by using a fine adjustment mode, and a file parameter is used for recording file information.

S603, adjusting parameters of the large language model based on the difference between the sample code quotation analysis result and the sample prompt answer to obtain the code analysis model.

In an exemplary embodiment, the freezing process may be performed on preset parameters in the large language model; calculating cross entropy loss based on the difference between the sample code quotation analysis result and the sample prompt answer; and adjusting unfrozen parameters in the large language model according to the cross entropy loss to obtain the code analysis model.

The embodiment of the application does not limit which parameters are specifically frozen, and the parameters can be selected according to actual conditions, which is also related to the specific structure of the large language model specifically used, and the selection mode does not form an implementation obstacle of the embodiment of the application, and the details are not repeated. According to the embodiment of the application, part of layers and part of parameters in the large language model can be frozen, wherein freezing refers to that all parameters in frozen layers and frozen parameters in unfrozen layers are kept unchanged in a training process, and training purposes are achieved only by adjusting the unfrozen parameters in the unfrozen layers. Of course, the frozen layer may be selected in some embodiments, or the frozen portion parameters may also be selected.

The method can adjust the unfrozen parameters based on a gradient descent method. The gradient descent method is a method which is frequently used in the field of machine learning and deep learning for performing network parameter adjustment and is used for performing first-order optimization adjustment on network parameters in a gradient descent mode. The gradient descent method in the embodiment of the application can guide the parameters to adjust towards the loss reduction direction. And stopping parameter adjustment when the adjustment times reach a preset time threshold or when the loss is smaller than a preset loss threshold, so as to obtain the code analysis model.

In an exemplary embodiment, in the tuning link, a fine-tunes interface of POST is used, i.e. a new tuning task can be created. After the creation is successful, the task id is recorded, and the fine tuning task can be queried and managed by using the id later. The model training takes time, and the progress of the fine-tuning task can be queried until the state of the task is 'Jobsuccued', and the large language model can be used as the code analysis model in the state.

After the code analysis model is trained, the code may be queried for dead code based on the model. In one embodiment, all export method signatures within the so file may be listed through the use of nm commands. Specifically, all method names can be listed by compiling a Python script in a regular matching mode and stored in a preset file. Please refer to fig. 11, which illustrates a schematic diagram of file contents formed after the method name is exported. Where Python is a scripting language and nm commands are commands that list symbols in certain files. The so file is a dynamic link library file under the Linux operating system.

Because the amount of data allowed to be input by the large language model is limited, and the amount of data allowed to be input by the code analysis model obtained based on the large language model training is limited, in many cases, it is difficult to analyze the reference relationship between the derived method name and each method of the main file referencing the dynamic link library file at one time, therefore, after a certain derived method name is obtained, the relevant method most likely referencing the derived method name in each method of the main file can be determined by a preliminary screening method, and then the method name and the code segment of the relevant method are spliced into a character string for inquiring the reference relationship of the code, please refer to fig. 12, which shows a character string schematic diagram. The character string is used as code reference prompt information to be input into a code analysis model, so that the code analysis model can be triggered to inquire within a limited code block range, a large number of code inputs are avoided, and the calculation amount of the model is saved. And judging whether the method corresponding to the method name is an useless code according to the output of the code analysis model, and if so, recording the useless code.

After all the useless codes are obtained, the packet reduction effect can be achieved by deleting the useless codes, and certainly, whether errors exist or not is tested by compiling after packet reduction. If there is no error, it indicates that the removal was successful. If the code is wrong, the code needs to be restored and rechecked.

Please refer to fig. 13, which illustrates an overall process schematic diagram of a packet reduction scheme according to an embodiment of the present application. The SDK in fig. 13 can be used as an object to be unpacked, and all method names in the object are extracted by running a Python script to form a method name list. For the codes of the client side referencing the SDK, the codes may be semantically extracted and stored in a code vector database, i.e., a semantic information base. And (5) performing preliminary screening in a code vector database in a semantic query mode, and selecting a plurality of code blocks with similar semantics. And carrying out code reference relation analysis based on the selection result. Since the code amount of the code of the client is relatively large, if all the codes are passed to the code analysis model, calculating the code reference relationship obviously results in the model being overburdened and the input upper limit of the model is likely to be exceeded. And too many uncorrelated codes can also affect the calculation and output of the model. By means of the preliminary screening, the above problems caused by excessive information input to the model can be avoided. In the simple terms of preliminary screening, the codes of the client are vectorized and stored in a vector database (semantic information base), when the reference relation is queried, the codes to be queried are vectorized as well, and a plurality of code blocks with highest similarity are queried in the vector database. These code blocks are finally input together into a large language model. This is a typical way of integrating large language models and is currently a relatively flexible and economical way.

According to the embodiment of the application, the large language model is utilized to train the code analysis model, so that code reference relation inspection is automatically carried out, and useless codes are searched. When in actual use, the code is deleted manually, so that the package-reducing requirement can be efficiently and accurately met. The embodiment of the application can be widely applied to the field of SDK development, and provides an efficient useless code checking method for research personnel. Of course, the model may be further trained according to the embodiments of the present application to support identifying whether the interface on which the unwanted code depends is only relied on by the unwanted code, so as to identify more unwanted codes at a time. Or the script for removing useless codes can be further developed, and project codes can be automatically removed and optimized by utilizing the code optimizing capability of the large language model, so that the code quality is improved. Aiming at the manual compiling test after deleting codes, the scheme of pipeline compiling and visual automatic testing can be matched, and the workload of research personnel and testers is reduced.

The following are device embodiments of the present application, which may be used to perform method embodiments of the present application. For details not disclosed in the device embodiments of the present application, please refer to the method embodiments of the present application.

Referring to fig. 14, a block diagram of a code processing apparatus according to an embodiment of the present application is shown. The device has the function of realizing the code processing method, and the function can be realized by hardware or by executing corresponding software by hardware. The device may be a computer device or may be provided in a computer device. The apparatus may include:

a code information extraction module 1401 for acquiring a first code object and at least one target code block;

a hint information generating module 1402, configured to extract a second code object in each of the target code blocks, generate code reference hint information according to the first code object and each of the second code objects, where the code reference hint information is used to instruct a code analysis model to output a code reference analysis result, and the code reference analysis result is used to instruct whether the first code object is referenced by the at least one target code block;

a code processing module 1403, configured to input the code reference prompt information, the codes corresponding to the first code objects, and the codes corresponding to the second code objects into the code analysis model to obtain the code reference analysis result; determining the first code object as a useless code object in the case that the code reference analysis result indicates that the first code object is not referenced by any of the target code blocks;

In one embodiment, the first code object is a code object in a first code file, the target code block is a code block in a second code file, the first code file is used for being called by the second code file, and the code processing module 1403 is used for deleting the first code object in the first code file if the first code object is determined to be a useless code object.

In one embodiment, the target code block is a code block in the second code file, where the semantic similarity between the target code block and the first code object meets a preset requirement.

In one embodiment, the code information extraction module 1401 is configured to perform the following operations:

extracting semantic information corresponding to each code block in the second code file based on the semantic information extraction model, and constructing a semantic information base according to semantic information extraction results;

extracting semantic information corresponding to the first code object based on the semantic information extraction module to obtain target semantic information;

searching code blocks with the semantic distance within a preset range from the target semantic information in the semantic information base to obtain the at least one target code block.

In one embodiment, the code processing module 1403 is configured to perform the following operations:

acquiring sample code reference prompt information, wherein the sample code reference prompt information comprises a sample prompt question and a sample prompt answer, the sample prompt question is used for indicating a large language model to output a sample code reference analysis result, the sample code reference analysis result is used for indicating whether a sample first code object in the sample code prompt information is referenced by at least one sample second code object in the sample code prompt information, and the sample prompt answer is used for indicating a reference relation between the sample first code object and the at least one sample second code object;

inputting the sample code reference prompt information, codes corresponding to the sample first code objects and codes corresponding to the sample second code objects into the large-scale language model, and triggering the large-scale language model to output sample code reference analysis results;

and adjusting parameters of the large language model based on the difference between the sample code quotation analysis result and the sample prompt answer to obtain the code analysis model.

acquiring a sample third code object and at least one sample code block, wherein the semantic distance between the sample code block and the sample third code object is within a preset range;

obtaining at least one sample fourth code object based on the code object extracted from each sample code block;

and generating sample code reference prompt information with a sample prompt answer being a positive answer based on the sample third code object and the at least one sample fourth code object under the condition that the sample third code object is called by any one of the sample fourth code objects.

determining all sample fifth code objects in the at least one sample code block, the sample fifth code objects being sample fourth code objects referencing the sample third code objects, in the case that the sample third code objects are invoked by any of the sample fourth code objects;

deleting said all sample fifth code object from said at least one sample fourth code object;

And generating sample code reference prompt information with a negative sample prompt answer based on the sample third code object and the deleting result.

freezing preset parameters in the large language model;

calculating cross entropy loss based on the difference between the sample code quotation analysis result and the sample prompt answer;

and adjusting unfrozen parameters in the large language model according to the cross entropy loss to obtain the code analysis model.

It should be noted that, in the apparatus provided in the foregoing embodiment, when implementing the functions thereof, only the division of the foregoing functional modules is used as an example, in practical application, the foregoing functional allocation may be implemented by different functional modules, that is, the internal structure of the device is divided into different functional modules, so as to implement all or part of the functions described above. In addition, the apparatus and the method embodiments provided in the foregoing embodiments belong to the same concept, and specific implementation processes of the apparatus and the method embodiments are detailed in the method embodiments and are not repeated herein.

Referring to fig. 15, a block diagram of a computer device according to an embodiment of the present application is shown. The computer device may be a server for performing the above-described code processing method. Specifically, the present invention relates to a method for manufacturing a semiconductor device.

The computer device 1000 includes a central processing unit (CentralProcessingUnit, CPU) 1001, a system memory 1004 including a random access memory (RandomAccessMemory, RAM) 1002 and a read only memory (ReadOnlyMemory, ROM) 1003, and a system bus 1005 connecting the system memory 1004 and the central processing unit 1001. Computer device 1000 also includes a basic Input/Output system (I/O) 1006, which helps to transfer information between various devices within the computer, and a mass storage device 1007 for storing an operating system 1013, application programs 1014, and other program modules 1015.

The basic input/output system 1006 includes a display 1008 for displaying information and an input device 1009, such as a mouse, keyboard, etc., for the user to enter information. Wherein the display 1008 and the input device 1009 are connected to the central processing unit 1001 through an input output controller 1010 connected to a system bus 1005. The basic input/output system 1006 may also include an input/output controller 1010 for receiving and processing input from a number of other devices, such as a keyboard, mouse, or electronic stylus. Similarly, the input output controller 1010 also provides output to a display screen, a printer, or other type of output device.

The mass storage device 1007 is connected to the central processing unit 1001 through a mass storage controller (not shown) connected to the system bus 1005. The mass storage device 1007 and its associated computer-readable media provide non-volatile storage for the computer device 1000. That is, the mass storage device 1007 may include a computer readable medium (not shown) such as a hard disk or CD-ROM (compact read-only memory) drive.

Computer readable media may include computer storage media and communication media without loss of generality. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, EPROM (ErasableProgrammableReadOnly Memory), EEPROM (ElectricallyErasable ProgrammableReadOnlyMemory, electrically erasable programmable read-only memory), flash memory or other solid state memory technology, CD-ROM, DVD (digital video disc) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will recognize that computer storage media are not limited to the ones described above. The system memory 1004 and mass storage devices 1007 described above may be collectively referred to as memory.

According to various embodiments of the present application, the computer device 1000 may also operate by being connected to a remote computer on a network, such as the Internet. I.e., the computer device 1000 may be connected to the network 1012 through a network interface unit 1011 connected to the system bus 1005, or other types of networks or remote computer systems (not shown) may be connected using the network interface unit 1011.

The memory also includes a computer program stored in the memory and configured to be executed by the one or more processors to implement the code processing method described above.

In an exemplary embodiment, a computer readable storage medium is also provided, where at least one instruction, at least one program, a set of codes, or a set of instructions is stored, where the at least one instruction, the at least one program, the set of codes, or the set of instructions, when executed by a processor, implement the code processing method.

Specifically, the code processing method includes:

acquiring a first code object and at least one target code block;

extracting a second code object in each target code block, and generating code reference prompt information according to the first code object and each second code object, wherein the code reference prompt information is used for indicating a code analysis model to output a code reference analysis result, and the code reference analysis result is used for indicating whether the first code object is referenced by the at least one target code block;

In one embodiment, the first code object is a code object in a first code file, the target code block is a code block in a second code file, and the first code file is used for being called by the second code file, and the method further includes:

and deleting the first code object in the first code file when the first code object is determined to be a useless code object.

In one embodiment, the acquiring the first code object and the at least one target code block includes:

In one embodiment, the code analysis model is trained by the following method:

In one embodiment, the acquiring the sample code reference hint information includes:

In one embodiment, the acquiring the sample code reference hint information further includes:

In one embodiment, the adjusting parameters of the large language model based on the difference between the sample code reference analysis result and the sample prompt answer to obtain the code analysis model includes:

freezing preset parameters in the large language model;

Alternatively, the computer-readable storage medium may include: ROM (read only memory), RAM (random access memory), SSD (SolidState Drives, solid state disk), optical disk, or the like. The random access memory may include ReRAM (Resistance RandomAccessMemory, resistive random access memory) and DRAM (DynamicRandom AccessMemory ), among others.

In an exemplary embodiment, a computer program product or a computer program is also provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions so that the computer device performs the above-described code processing method.

It should be understood that references herein to "a plurality" are to two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship. In addition, the step numbers described herein are merely exemplary of one possible execution sequence among steps, and in some other embodiments, the steps may be executed out of the order of numbers, such as two differently numbered steps being executed simultaneously, or two differently numbered steps being executed in an order opposite to that shown, which is not limited by the embodiments of the present application.

In addition, in the specific embodiments of the present application, related data such as user information is related, when the above embodiments of the present application are applied to specific products or technologies, user permission or consent needs to be obtained, and the collection, use and processing of related data need to comply with related laws and regulations and standards of related countries and regions.

The foregoing description of the exemplary embodiments of the present application is not intended to limit the invention to the particular embodiments disclosed, but on the contrary, the intention is to cover all modifications, equivalents, alternatives, and alternatives falling within the spirit and scope of the invention.

Claims

1. A method of code processing, the method comprising:

acquiring a first code object and at least one target code block;

2. The method of claim 1, wherein the first code object is a code object in a first code file and the target code block is a code block in a second code file, the first code file for invocation by the second code file, the method further comprising:

in the event that the first code object is determined to be a useless code object, the first code object is deleted in the first code file.

3. The method of claim 2, wherein the target code block is a code block in the second code file having a semantic similarity with the first code object that meets a preset requirement.

4. A method according to any one of claims 1 to 3, wherein said obtaining a first code object and at least one target code block comprises:

5. The method of claim 4, wherein the code analysis model is trained by:

Inputting the sample code reference prompt information, the codes corresponding to the sample first code objects and the codes corresponding to the sample second code objects into the large language model, and triggering the large language model to output sample code reference analysis results;

6. The method of claim 5, wherein the obtaining sample code reference hint information comprises:

obtaining at least one sample fourth code object based on the code objects extracted from each sample code block;

and generating sample code reference prompt information with a sample prompt answer being a positive answer based on the sample third code object and the at least one sample fourth code object under the condition that the sample third code object is called by any sample fourth code object.

7. The method of claim 6, wherein the obtaining the sample code references hint information, further comprising:

determining all sample fifth code objects in the at least one sample code block, the sample fifth code objects being sample fourth code objects referencing the sample third code objects, in the event that the sample third code objects are invoked by any of the sample fourth code objects;

deleting the entire sample fifth code object at the at least one sample fourth code object;

and generating sample code reference prompt information with a sample prompt answer being a negative answer based on the sample third code object and the deleting result.

8. The method of claim 5, wherein adjusting parameters of the large language model based on the difference between the sample code reference analysis result and the sample prompt answer to obtain the code analysis model comprises:

freezing preset parameters in the large language model;

calculating cross entropy loss based on the difference between the sample code citation analysis result and the sample prompt answer;

and according to the cross entropy loss, adjusting unfrozen parameters in the large language model to obtain the code analysis model.

9. A code processing apparatus, the apparatus comprising:

10. A computer device comprising a processor and a memory having stored therein at least one instruction, at least one program, code set or instruction set that is loaded and executed by the processor to implement the code processing method of any of claims 1 to 8.

11. A computer readable storage medium having stored therein at least one instruction, at least one program, code set, or instruction set, the at least one instruction, the at least one program, the code set, or instruction set being loaded and executed by a processor to implement the code processing method of any of claims 1 to 8.

12. A computer program product comprising a computer program or instructions which, when executed by a processor, implements the code processing method of any of claims 1 to 8.