Detailed Description
Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
It will be apparent that the described embodiments are some, but not all, of the embodiments of the present disclosure. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments in this disclosure without inventive faculty, are intended to be within the scope of this disclosure.
It should be noted that, the terminal device according to the embodiments of the present disclosure may include, but is not limited to, a smart device such as a mobile phone, a Personal digital assistant (Personal DIGITAL ASSISTANT, PDA), a wireless handheld device, a Tablet Computer (Tablet Computer), and the like, and the display device may include, but is not limited to, a device with a display function such as a Personal Computer, a television, and the like.
In addition, the term "and/or" is merely an association relation describing the association object, and means that three kinds of relations may exist, for example, a and/or B, and that three kinds of cases where a exists alone, while a and B exist alone, exist alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship.
In a practical scenario, based on a pre-created code file library, a developer typically manually searches and learns knowledge of each code file in the code file library.
Fig. 1 is a schematic diagram of a first embodiment of the present disclosure, and as shown in fig. 1, the present embodiment provides a big model-based question-answer processing method for implementing question-answer processing based on a code segment database, which may specifically include the following steps:
S101, searching from a pre-established code segment database based on the problem of a user to obtain a target code segment, wherein the code segment database comprises a plurality of code segments;
S102, based on target code segments, referring to a pre-created graph database to acquire at least one relevant code segment, wherein the graph database comprises relations among different code segments;
S103, generating answers to the questions by using a large language model (Large Language Model; LLM) based on the target code segment and at least one related code segment.
The execution subject of the big model-based question-answer processing method of the present embodiment may be a big model-based question-answer processing device, which may be an electronic entity, or may also be a software-integrated application. The device can solve the problem of the user based on a large model and by referring to the code segment database.
The code segment database of the embodiment is a database collected by taking code segments as granularity. For example, each individual code segment in the code segment database in this embodiment may be a function, class name, structure, intra-class member, global variable, static variable, annotation, or the like.
Each node in the graph database of the embodiment is a code segment, and the graph database includes relationships between different code segments. For example, the relationships in this embodiment may include, be included, belong to, be attributed to, reference, be referenced, call, be called, parent-child relationships, and so forth. That is, in the graph database, if there is a relationship between two code segments, there is an edge between the two code segments, otherwise there is no connection between the two code segments.
In this embodiment, based on the target code segment and at least one relevant code segment, a rich knowledge support can be provided for the solution of the user problem. Based on this, the large language model can generate answers to user questions more accurately from the target code segment and at least one relevant code segment.
The user of the present embodiment may be referred to as a developer. The application scenario of this embodiment may be that, in the development process, a developer may have a question that is input to the question-answer processing device based on the large model, where the question of the user may be a question about development, such as how the call of the a function is implemented, how the global variable X is defined, and so on. After receiving the questions of the user, the question-answering processing device based on the large model can adopt the large language model to refer to the code segment database and the graph database, and by adopting the technical scheme of the embodiment, answers of the questions of the user are generated and fed back to the user.
Based on the technical scheme of the embodiment, the research and development personnel do not need to manually check the code file, and by adopting the scheme of the embodiment, a large language model can be adopted to provide accurate and efficient answers for users based on the code segment library and the graph database.
According to the question-answering processing method based on the large model, the target code segment and at least one relevant code segment corresponding to the user question can be obtained based on the code segment database and the graph database, rich knowledge support is provided for solving the user question, and further, a large language model is adopted, answers to the question can be efficiently and accurately generated according to the target code segment and the at least one relevant code segment, and effective support is provided for research personnel to obtain knowledge from the code segment database.
Fig. 2 is a schematic diagram of a second embodiment according to the present disclosure, and fig. 3 is a schematic diagram of the flow diagram shown in fig. 2. The solution of the present disclosure is further described in more detail on the basis of the solution of the embodiment shown in fig. 1. As shown in fig. 2 and 3, the question-answering processing method based on the large model in this embodiment may specifically include the following steps:
s201, acquiring vector expression of a problem of a user;
s202, obtaining vector expression of each code segment in a code segment database;
s203, based on the vector expression of the problem of the user and the vector expression of each code segment, acquiring the code segment with the highest similarity with the problem of the user from a code segment database as a target code segment;
Specifically, the similarity between the vector expressions of the problems of the user and the vector expressions of the code segments is calculated, and then the code segment with the highest similarity is obtained as the target code segment.
In this embodiment, the vector expression of the problem of the user may be obtained by using a pre-trained vector expression model.
Steps S201-S203 are one implementation of step S101 in the embodiment shown in fig. 1. The implementation mode is a vectorization retrieval mode, and the target code segment is acquired. By the method, the target code segment can be accurately and efficiently acquired.
Steps 201 to S203 correspond to the vectorization search process in fig. 3, and the obtained target code segment is the search result in fig. 3.
Optionally, in practical application, the problem of the user and the relevance of each code segment can be directly obtained based on the semantics, so that the code segment with the highest relevance can be obtained as the target code segment. For example, a keyword in a question of a user may also be acquired, and then a code segment having the highest correlation degree may be retrieved and acquired from a code segment database as a target code segment based on the keyword in the question of the user.
S204, based on the target code segment, retrieving the identification of at least one relevant code segment with a relation with the target code segment from the graph database;
For example, in a specific implementation, considering that the number of target code segments is limited, in order to enrich the search result, in this embodiment, at least one level of search may be performed in the graph database with the target code segment as a search start point, so as to obtain an identifier of at least one relevant code segment having a relationship with the target code segment. By means of the method, at least one level of search is conducted in the graph database, and identification of at least one relevant code segment with relation to the target code segment can be accurately obtained.
Alternatively, in practice, two or three levels of search may be performed, subject to the restriction of symbols (token) that can be received when LLM is entered.
It can also be said that the target code segment is used as a search start point, and at most 2-degree relationships are searched in the graph database. When searching for the 1 st degree relation by taking the target code segment as a searching starting point, a plurality of nearest neighbor nodes which are directly related to the target code segment are obtained, then searching for the 2 nd degree relation by taking each nearest neighbor node as a searching starting point, and obtaining a plurality of secondary nearest neighbor nodes which are directly related to the nearest neighbor nodes, wherein the 2 degree relation searching is completed.
S205, acquiring at least one relevant code segment based on the identification of the at least one relevant code segment;
When this step S203 is specifically implemented, two cases may be included:
In the first case, the graph database stores not only the identity of each node but also the content of at least one relevant code segment of each node. At this time, at least one relevant code segment may be obtained from the graph database based on the identification of the at least one relevant code segment.
In the second case, only the node identifiers are stored in the graph database, and the content of at least one relevant code segment of each node is not stored. At this time, at least one relevant code segment may be obtained from the code segment database based on the identification of the at least one relevant code segment.
In either case, at least one relevant code segment can be accurately and efficiently acquired.
Steps S204-S205 are a specific implementation of step S102 in the embodiment shown in fig. 1. By adopting the method, at least one relevant code segment can be accurately and efficiently acquired.
Steps S204-S205 correspond to the process of the relationship retrieval in fig. 3, and the obtained at least one relevant code segment corresponds to the relationship retrieval result in fig. 3.
Alternatively, in practical application, each code segment may be directly analyzed in the graph database, and at least one relevant code segment related to the target code segment is mined.
S206, acquiring the target code segment and the relation existing in at least one relevant code segment from the graph database;
The method specifically can comprise the relation between the target code segment and the relevant code segment obtained by the 1 st degree retrieval in at least one relevant code segment and the relation between each relevant code segment obtained by the 1 st degree retrieval and the relevant code segment obtained by the corresponding 2 nd degree retrieval. This step may be considered as further adding relationships present in at least one relevant code segment to the relationship search results to improve the accuracy of the answer to the generated question.
In this embodiment, the type of the relationship is related to a specific scenario, for example, in a unit test scenario, a call relationship is required. However, in a function interpretation scenario, a parent-child relationship is required. In practical application, specific processing of specific scenes is not described herein in detail. In summary, all relationships that exist in the object code segment and at least one relevant code segment are obtained from the graph database.
S207, generating an answer to the question by using LLM based on the target code segment, the at least one relevant code segment and the relation existing in the target code segment and the at least one relevant code segment.
Optionally, in this embodiment, the maximum token that can be received by the LLM is taken as a constraint, and if the length of the object code segment, at least one relevant code segment, and the relationship existing in the object code segment and at least one relevant code segment is greater than the maximum token, the relevant code segment and the corresponding relationship of the 2 nd degree relationship may be cut preferentially, so as to ensure that the LLM works normally.
In this embodiment, not only the target code segment and at least one relevant code segment are input to the LLM, but also the relationship existing in the target code segment and at least one relevant code segment is input, and the combination relationship can make the LLM better, so that the answer of the generated question is more accurate.
In this embodiment, by providing the LLM model with the relationships between the target code segment and at least one relevant code segment, richer knowledge support can be provided for the LLM, so that the LLM can generate answers to questions more accurately and efficiently.
Optionally, before step S202 of the present embodiment, the following steps may be further included:
(1) Collecting a plurality of code documents;
(2) For each code document, dividing according to functions, class names, structural bodies, members in classes, global variables, static variables or notes to obtain a plurality of code fragments;
(3) A code segment database is constructed based on a plurality of code segments of each code document.
In this embodiment, the code document is segmented according to a function, a class name, a structure, members in a class, a global variable, a static variable or an annotation, so that each code segment has smaller granularity, and can be accurately positioned when being retrieved. In this way, the code segment database can be constructed efficiently and accurately.
Optionally, before step S204 of the present embodiment, the following steps may be further included:
(a) For each code document, mining the relation between different code fragments in the code document;
For example, the relationship between mined code segments Segment1 and Segment2 may be expressed as Segment1- > Relation- > Segment2.
In this embodiment, a pre-trained relation mining model may be used to mine the relation between different code segments.
For example, the relationships of the present embodiment may include call relationships, parent-child relationships, and so forth. The call relationship may include that function a calls function B, which uses global variables. The parent-child relationships may include classes and class member functions, class and class member variables, functions and local variables within functions.
For example, FIG. 4 is a schematic diagram of a relationship mined in an embodiment of the present disclosure. As shown in FIG. 4, the relationships can be mined that function A- > calls- > function B, function B- > is called- > function A, function A- > parent- > class C, class C- > child- > function A, class member- > is called- > function A, function A- > calls- > class member, class member- > parent- > class C, and class C- > child- > class member, and so forth.
(B) Based on the relation between different code segments in each code document, a graph database is constructed.
Specifically, in the constructed graph database, each code segment is used as a node, and the relationship among different code segments is used as an edge among different code segments. The graph database created based on the above manner can be regarded as a knowledge graph about the code segments. By adopting the mode, the graph database can be accurately and efficiently constructed.
Alternatively, in this embodiment, the content of the code segment of each node may be stored in the graph database, or only the identifier of each node may be stored, instead of storing the content of the code segment of each node, specifically, the content of the code segment is obtained from the code segment database.
For example, the technical solution of the present disclosure is described below by taking a code file db.py as an example, and specifically, the specific contents of the code file db.py are as follows:
according to the code segment division manner of the present embodiment, the following two code segments can be obtained by division:
Code segment 1:
Code segment 2:
In one specific application scenario, the user's query may be "check whether a bug exists in the create database function?
The first step after search can search the "create_db" function, if only one-stage search is performed, the obtained search result only searches the "create_db". At this time, the relevance of the query and the "create_password" of the user is low, so the "create_password" cannot be searched. Thus, the LLM cannot see the "create_password" function, so that the LLM cannot answer correctly, and only tells the user to check for anomalies.
Therefore, by adopting the manner of this embodiment, at least two-degree search can be performed to obtain at least one relevant code segment, and in the application example, the "create_password" can be searched based on the "create_db", so that the LLM model can generate the answer of the query of the user based on the "create_db" and the "create_password".
Fig. 5 is a schematic diagram of a third embodiment of the present disclosure, and as shown in fig. 5, the present embodiment provides a big model-based question-answering processing apparatus 500 for implementing question-answering processing based on a code segment database, including:
The segment retrieval module 501 is configured to retrieve from a pre-created code segment database based on a user problem, and obtain a target code segment;
a segment obtaining module 502, configured to obtain at least one relevant code segment by referring to a pre-created graph database based on the target code segment, where the graph database includes relationships between different code segments;
A generating module 503, configured to generate an answer to the question using a large language model based on the target code segment and the at least one relevant code segment.
The big model-based question-answer processing device 500 of the present embodiment implements the implementation principle and the technical effect of the big model-based question-answer processing by adopting the above modules, and is the same as the implementation of the above related method embodiments, and details of the above related method embodiments may be referred to in the description, and will not be repeated herein.
Fig. 6 is a schematic diagram of a fourth embodiment of the present disclosure, and as shown in fig. 6, a big model-based question-answering apparatus 600 of the present embodiment further describes the technical solution of the present disclosure in more detail on the basis of the technical solution of the embodiment shown in fig. 5. As shown in fig. 5, the big model-based question-answering processing apparatus 600 of the present embodiment includes the same-name and same-function modules shown in fig. 5, namely a fragment retrieval module 601, a fragment acquisition module 602, and a generation module 603.
In this embodiment, the segment obtaining module 602 is configured to:
Retrieving, based on the target code segment, from the graph database, an identification of at least one relevant code segment having a relationship with the target code segment;
at least one relevant code segment is obtained based on the identification of the at least one relevant code segment.
Further optionally, in an embodiment of the present disclosure, the segment acquisition module 502 is configured to:
Retrieving said at least one relevant code segment from said graph database based on an identification of said at least one relevant code segment, or
The at least one relevant code segment is obtained from the code segment database based on the identification of the at least one relevant code segment.
Further optionally, in an embodiment of the present disclosure, the segment acquisition module 602 is configured to:
and taking the target code segment as a searching starting point, and performing at least one-stage searching in the graph database to obtain the identification of at least one relevant code segment with a relation with the target code segment.
Further alternatively, in one embodiment of the present disclosure, the segment retrieval module 601 is configured to:
acquiring vector expression of the problem of the user;
Obtaining vector expression of each code segment in the code segment database;
And acquiring a code segment with highest problem similarity with the user from the code segment database based on the vector expression of the problem of the user and the vector expression of each code segment, and taking the code segment as the target code segment.
Further alternatively, as shown in fig. 6, in one embodiment of the present disclosure, the big model based question-answering processing apparatus 600 further includes:
A relationship obtaining module 604, configured to obtain, from the graph database, a relationship existing in the target code segment and the at least one related code segment;
a generating module 603, configured to:
and generating an answer to the question by using the large language model based on the target code segment, the at least one relevant code segment, and the relation existing in the target code segment and the at least one relevant code segment.
Further alternatively, as shown in fig. 6, in one embodiment of the present disclosure, the big model based question-answering processing apparatus 600 further includes:
An acquisition module 605 for acquiring a plurality of code documents;
The segmentation module 606 is configured to segment each code document according to a function, a class name, a structure, a member in a class, a global variable, a static variable or an annotation, to obtain a plurality of code segments;
A construction module 607 is configured to construct the code segment database based on a plurality of code segments of each of the code documents.
Further alternatively, as shown in fig. 6, in one embodiment of the present disclosure, the big model based question-answering processing apparatus 600 further includes:
A mining module 608, configured to mine, for each of the code documents, a relationship between different code segments in the code document;
the construction module 607 is further configured to construct the graph database based on the relationships between different code segments in each of the code documents.
The big model-based question-answer processing device 600 of the present embodiment implements the implementation principle and the technical effect of the big model-based question-answer processing by adopting the above modules, and is the same as the implementation of the above related method embodiments, and details of the above related method embodiments may be referred to in the description, and will not be repeated herein.
In the technical scheme of the disclosure, the acquisition, storage, application and the like of the related user personal information all conform to the regulations of related laws and regulations, and the public sequence is not violated.
According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.
Fig. 7 illustrates a schematic block diagram of an example electronic device 700 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 7, the apparatus 700 includes a computing unit 701 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM703, various programs and data required for the operation of the device 700 may also be stored. The computing unit 701, the ROM 702, and the RAM703 are connected to each other through a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.
Various components in the device 700 are connected to the I/O interface 705, including an input unit 706, e.g., keyboard, mouse, etc., an output unit 707, e.g., various types of displays, speakers, etc., a storage unit 708, e.g., magnetic disk, optical disk, etc., and a communication unit 709, e.g., network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the device 700 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.
The computing unit 701 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 701 performs the various methods and processes described above, such as the above-described methods of the present disclosure. For example, in some embodiments, the above-described methods of the present disclosure may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 708. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 700 via ROM702 and/or communication unit 709. When the computer program is loaded into RAM 703 and executed by computing unit 701, one or more steps of the above-described methods of the present disclosure described above may be performed. Alternatively, in other embodiments, the computing unit 701 may be configured to perform the above-described methods of the present disclosure by any other suitable means (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include being implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be a special or general purpose programmable processor, operable to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user and a keyboard and a pointing device (e.g., a mouse or a trackball) by which the user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user, for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback), and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a Local Area Network (LAN), a Wide Area Network (WAN), and the Internet.
The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.
The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.