CN116521866A - Training sample construction method and device, electronic equipment and medium - Google Patents

Training sample construction method and device, electronic equipment and medium Download PDF

Info

Publication number
CN116521866A
CN116521866A CN202310381165.7A CN202310381165A CN116521866A CN 116521866 A CN116521866 A CN 116521866A CN 202310381165 A CN202310381165 A CN 202310381165A CN 116521866 A CN116521866 A CN 116521866A
Authority
CN
China
Prior art keywords
code
training
training sample
content
user
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310381165.7A
Other languages
Chinese (zh)
Inventor
赵悦浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202310381165.7A priority Critical patent/CN116521866A/en
Publication of CN116521866A publication Critical patent/CN116521866A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/36Preventing errors by testing or debugging software
    • G06F11/362Software debugging
    • G06F11/3628Software debugging of optimised code

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Computer Hardware Design (AREA)
  • Quality & Reliability (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The disclosure provides a training sample construction method, a training sample construction device, electronic equipment and a training sample construction medium, relates to the technical field of Internet, and particularly relates to the technical field of big data and code management. The specific implementation scheme is as follows: and acquiring codes written by the user from the code hosting platform and acquiring task cards submitted by the user for the codes, wherein the task cards comprise types and description information of the codes. Then constructing a training sample according to the type of the code and the code; and setting a training label of the training sample based on the description information of the codes. Thereby realizing the improvement of the model training efficiency.

Description

Training sample construction method and device, electronic equipment and medium
Technical Field
The present disclosure relates to the field of internet technologies, and in particular, to the field of big data and code management technologies.
Background
In a code intelligence scenario, the code may be used as a training sample to train an artificial intelligence (Artificial Intelligence, AI) model, enabling the AI model to support functions such as code interpretation or code repair.
Disclosure of Invention
The disclosure provides a training sample construction method, a training sample construction device, electronic equipment and a training sample construction medium.
In a first aspect of an embodiment of the present disclosure, a training sample construction method is provided, including:
acquiring codes written by users from a code hosting platform;
acquiring a task card submitted by a user aiming at the code, wherein the task card comprises the type and description information of the code;
according to the type of the codes and the codes, constructing training samples;
and setting a training label of the training sample based on the description information of the codes.
In a second aspect of embodiments of the present disclosure, there is provided a training sample construction apparatus, including:
the acquisition module is used for acquiring codes written by a user from the code hosting platform;
the acquisition module is also used for acquiring a task card submitted by a user aiming at the code, wherein the task card comprises the type and description information of the code;
the construction module is used for constructing a training sample according to the type of the code and the code acquired by the acquisition module;
the setting module is used for setting the training label of the training sample based on the description information of the codes acquired by the acquisition module.
In a third aspect of the disclosed embodiments, there is provided an electronic device, including:
At least one processor; and
a memory communicatively coupled to the at least one processor; wherein,,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of the first aspects.
A fourth aspect of embodiments of the present disclosure provides a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the method according to any one of the first aspects.
A fifth aspect of embodiments of the present disclosure provides a computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of the first aspects.
It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.
Drawings
The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
FIG. 1 is a flow chart of a training sample construction method provided by an embodiment of the present disclosure;
FIG. 2 is an exemplary schematic diagram of a task card provided by an embodiment of the present disclosure;
FIG. 3 is an exemplary schematic diagram of a display interface provided by an embodiment of the present disclosure;
FIG. 4 is a flow chart of a method of setting training tags provided by an embodiment of the present disclosure;
FIG. 5 is a flow chart of another method of setting training tags provided by an embodiment of the present disclosure;
FIG. 6 is an exemplary schematic diagram of another task card provided by an embodiment of the present disclosure;
FIG. 7 is an exemplary schematic diagram of another display interface provided by an embodiment of the present disclosure;
FIG. 8 is an exemplary schematic diagram of a training sample construction process provided by an embodiment of the present disclosure;
FIG. 9 is a schematic diagram of a training sample construction device according to an embodiment of the present disclosure;
FIG. 10 is a block diagram of an electronic device used to implement a training sample construction method of an embodiment of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
At present, training samples used in the training of an AI model are all obtained from a public code set, and the codes in the public code set are generally not provided with labels, so that after the training samples are obtained, the training samples are additionally marked manually, a great deal of manpower and time are required, and the model training efficiency is low.
In order to improve training efficiency of a model, an embodiment of the present disclosure provides a training sample construction method, where the method is applied to an electronic device, for example, the electronic device may be a server, a desktop computer, or a virtual machine, which has data processing capability. As shown in fig. 1, the training sample construction method according to the embodiment of the disclosure includes the following steps:
s101, acquiring codes written by a user from a code hosting platform.
The user may be a software engineer, hereinafter referred to as an engineer, and the code written by the engineer may be a business code inside the company. Code written by engineers needs to be submitted to a code hosting platform before being used online.
S102, acquiring a task card submitted by a user aiming at the code.
The task card comprises the type and description information of the codes.
The type of code may be a Task (Task) or a Bug (Bug). The task may also be called a requirement task type, where a code of the requirement task type is a code written for a requirement task, for example, the requirement task is calculating a weighted sum of data, or querying a data table, etc. Bug-type code may also be referred to as an erasure error type, where Bug-type code is used to replace historically submitted problem code in which errors exist.
The descriptive information is used to introduce the meaning of the code. For example, the description information is: "a function of calculating a weighted sum of data", or "a function of referring to a data table", etc.
S103, constructing training samples according to the types and codes of the codes.
The embodiment of the disclosure can directly take codes written by a user as training samples, or can also be constructed in other modes, and the specific construction mode can refer to the following description.
S104, setting training labels of training samples based on the description information of the codes.
The embodiment of the disclosure can directly use the description information as a training label, or can also be set in other modes, and the specific setting mode can refer to the following description.
Through the method, the training label of the training sample can be automatically generated by using the description information of the code while the training sample is constructed by using the code written by the user and the type of the code. Because the descriptive information of the codes can represent the meaning of the codes, the accuracy of setting the training labels can be improved by constructing the training labels of the training samples by using the descriptive information. Moreover, the codes written by engineers are internal codes of the company, not public external codes, so that the type and description information of the codes can be obtained from the inside of the company, and the training samples do not need to be manually marked after being obtained. Therefore, the method and the device reduce the manpower and time required to acquire the training samples and the training labels, improve the efficiency, accuracy and timeliness of acquiring the training samples and the training labels, and further improve the model training efficiency.
The training sample construction method provided by the embodiment of the present disclosure is specifically described below:
in some embodiments of the present disclosure, before the step S102 of obtaining the task card submitted by the user for the code, the electronic device may further obtain code submission information (commit message) submitted by the user for the written code from the code hosting platform. Wherein the code submission information includes an identification of the task card for which the user submitted the code.
In the embodiment of the disclosure, the electronic device may acquire the code and the submission information submitted by the user to the code hosting platform in the previous period at intervals. For example, at 0:00 a day, code and submission information submitted by a user to the code hosting platform during the previous day is obtained.
Similarly, the electronic device may acquire task cards submitted by the user to the demand management platform at intervals. For example, at 0:00 a day, a task card submitted by a user to the demand management platform during the previous day is acquired.
In the embodiment of the disclosure, the first user who submits the code and the code submitting information to the code hosting platform may be a manager or an engineer who writes the code, etc.; likewise, the second user submitting the task card to the demand management platform may be a manager or an engineer writing code, etc.
The first user and the second user may be the same user. For example, an engineer fills in a task card for a code written by himself before submitting the code, and then submits the code and code submission information to a code hosting platform at the same time, and submits the task card for the code to a demand management platform.
The first user and the second user may be different users. For example, a product manager fills in a task card in advance for codes required to be written by engineers and submits the task card to a demand management platform; after the engineer writes the code, searching the identification of the task card corresponding to the code, filling the identification in the code submitting information, and then submitting the written code and the code submitting information to the code hosting platform by the engineer.
It should be noted that, the electronic devices where the code hosting platform and the demand management platform are located and the electronic devices to which the training sample construction method is applied may be the same device or different devices, which is not particularly limited in the embodiments of the present disclosure.
Because the code submitting information includes the identifiers (Identity document, ID) of the task cards, and the identifier of each task card is unique, when executing S102 to obtain the task card submitted by the user for the code, the electronic device may search the task card corresponding to the identifier from the task cards of the demand management platform. The electronic equipment can directly search the task card corresponding to the identifier from the demand management platform; or the electronic device can synchronize task cards in the demand management platform at intervals, so that the task card corresponding to the identifier can be searched from the task cards synchronized by the electronic device.
It should be noted that, because the amount of data that can be carried in the commit message is limited, it is difficult to directly carry specific contents such as the type of the code and the description information, so the embodiments of the present disclosure record these specific contents in the task card, and carry the identifier of the task card corresponding to the code in the commit message, so as to associate the code with these specific contents.
Optionally, the code submission information may also include a title of the task card, thereby facilitating faster understanding of the use of the code by virtue of the title in the code submission information.
Because the codes are required to be submitted to the code hosting platform before being used online at present, the embodiment of the disclosure utilizes the mechanism and adds the identification of the task card in the submitted information, so that the codes in the code hosting platform can be associated with the task card, thereby facilitating the subsequent acquisition of the type and description information of the codes.
Moreover, for large internet companies, software engineers total tens of thousands of code submissions per workday, involving the updating of millions of lines of code, i.e., a huge amount of code data. According to the embodiment of the disclosure, the training samples are constructed by utilizing codes submitted after the engineers write, so that enough training samples can be obtained, and the generalization of the model obtained by subsequent training is ensured. In addition, the codes in the company have correlation with the business of the company, so that the model obtained through training is more accurate in identifying the business in the company through training samples constructed by the codes in the company, the recall rate of the model is improved, and the research and development efficiency of the assisting enterprises is improved.
In some embodiments of the present disclosure, the manner in which the above S103 constructs the training samples according to the type of the code and the code includes the following two methods:
and if the type of the code is the type of the demand task, splitting the code into a plurality of functions, and taking each function as a training sample. The code of the demand task type is code written for the demand task.
For example, fig. 2 is a Task card of a required Task type, in which the code type is Task, and the title is "add a division function", and the content is "add a division function". As shown in fig. 2, the task card may further include: the code's belonging plan is XXXX, the complete flow of issuing the code is State 1-State 5, and the code's responsible person AA. In fig. 2, state 1 is the current state of the code release process, state 2 is the reachable state, and states 3-5 are respectively the unreachable states. A development requirement can be abstractly split into a plurality of plans with different levels and types, the plans can also be called tasks, generally, codes written by a user at one time can correspond to single or multiple tasks, and through recording the tasks of the codes in a task card, the association relationship between development behaviors and the codes can be obtained, so that the development requirement can be conveniently managed.
After the user submits the task card, the code corresponding to the task card can be checked in the display interface. For example, fig. 3 is a display interface, where codes corresponding to the displayed task card include: the actual code is indicated at "package main xxxxxxx xxxxxxx", "x". Wherein, can be when the "math. Go" on the left side is detected: the code is shown after the button is clicked, i.e. the code content under the right side "math. Go" is shown. As shown in fig. 3, the display interface may further include: the title of the code, the time the code was updated, an audit personnel review the code, a record of the code's submittal, a record of the code's file changes, etc. Wherein "xxxx-xx-xxxx: xx" represents time.
The code written by the user each time is typically a code package, which is composed of one or more functions. For example, "package main xxxxxxx xxxxxxx" in fig. 3 is the main function in the code package.
Through the method, the code written by the user can be split by taking the function as a unit, each split function is used as a training sample, the number of the acquired training samples is further increased, and generalization of the model obtained through subsequent training is guaranteed. And the codes written by the user are split, so that the code length included in each training sample can be reduced, and the training efficiency of the model is improved.
In the embodiment of the disclosure, the description information of the codes in the task card comprises titles and contents for describing the meaning of the codes. Wherein, the content is a detailed description of the meaning of the code, and the title is a brief introduction of the meaning of the code.
In the case that the code type is the demand task type, the above S104 sets the training label of the training sample based on the description information of the code, which includes the following two ways.
Referring to fig. 4, in case that the type of the code is a demand task type, mode 1 of setting a training tag includes the steps of:
s401, detecting whether the content meets a preset specification. If yes, executing S402; if not, S403 is performed.
The preset specification may include: the number of words of the content is greater than the preset number of words and/or the content is not within a preset blacklist, etc.
If the number of words of the content is less than or equal to the preset number of words, the number of words of the content is too small, and the description of the meaning of the content to the code is too brief, so that the meaning of the content as a training label is small. For example, when the preset word number is 2, the contents of "function", "add" and "delete" do not meet the preset specification.
The preset blacklist comprises contents which cannot be used as training labels. For example, the preset blacklist includes: "upper", "above", "same heading" and "slightly" etc. Since the user may not know the difference between the title and the content deeply when filling the task card, the user may incorrectly consider that the meaning of the title and the content is identical, so that words such as the same words are filled in the content, but the words do not actually represent the meaning of the code, and therefore, the content does not meet the preset specification and cannot be used as a training label.
S402, taking the content as a training label.
When the content accords with the preset standard, the content can embody the meaning of the code with high probability, so that the content can be used as a training label of a training sample.
S403, taking the title as a training label.
When the content does not accord with the preset standard, the content can not embody the meaning of the code with high probability, so the title is used as a training label of the training sample.
Through the method, whether the content accords with the preset standard or not can be detected, so that the content or the title capable of reflecting the code meaning is set as the training label of the training sample, and the accuracy and the effectiveness of the training label can be improved.
Referring to fig. 5, in the case where the type of the code is the demand task type, mode 2 of setting the training tag includes the steps of:
s501, detecting whether the content accords with a preset specification, and identifying whether a training sample comprises function interpretation.
The manner of detecting whether the content meets the preset specification may refer to the description related to S401, which is not repeated here.
When an engineer writes a code, the engineer's own description of the code, i.e., the function interpretation, may be recorded in the function. For example, the engineer may record the title and content of the code associated task card in the main function of the code. Alternatively, the engineer may also record other forms of function interpretation, as the embodiments of the disclosure are not specifically limited.
Illustratively, the function includes: "{" code ": "xxxxxxxxxx", "card_title", "task" adds a division function "," card_content "adds a division function", and chinese text in the function can be identified to get a functional interpretation of the function. Where "x" represents the actual code.
Optionally, the Chinese text identified from the function may be deduplicated, thereby reducing the inclusion of duplicate content in the training tag.
S502, if the content accords with a preset standard and the training sample comprises function interpretation, the content and the function interpretation are used as training labels.
The content accords with a preset specification, the meaning of the code can be represented by the content with high probability, and the training sample comprises function interpretation, and the meaning of the code can be represented by the function interpretation, so that the content and the function interpretation are used as training labels.
Optionally, the function interpretation and the content can be de-duplicated, and the de-duplication result is used as a training label of the training sample, so that repeated content included in the training label is reduced.
S503, if the content accords with the preset standard and the training sample does not comprise function interpretation, taking the content as a training label.
The content accords with a preset specification, the meaning of the code can be reflected by the high probability of the content, and the training sample does not comprise function interpretation, so that the content is used as a training label.
S504, if the content does not meet the preset specification and the training sample comprises function interpretation, the title and the function interpretation are used as training labels.
The content does not accord with the preset specification, the meaning of the code cannot be represented by the high probability of the content, the training sample comprises function interpretation, and the meaning of the code can be represented by the function interpretation, so that the title and the function interpretation are used as training labels.
Optionally, the function interpretation and the title can be de-duplicated, and the de-duplication result is used as a training label of the training sample, so that repeated contents included in the training label are reduced.
S505, if the content does not meet the preset specification and the training sample does not include function interpretation, taking the title as a training label.
The content does not accord with the preset specification, the meaning of the code cannot be represented by the content with high probability, and the training sample does not comprise function interpretation, so that the title is used as a training label.
Because the description information and the function interpretation can embody the meaning of the function, when the function interpretation is included in the training sample, the accuracy and the comprehensiveness of the training label can be improved by setting the training label in combination with the function interpretation and the description information.
It will be appreciated that when the type of code submitted by the user is a demand task type, the training samples are each function in the code, and the training labels are the meanings of the functions. At this time, the training sample and the corresponding training label are utilized to train the obtained model, so that the meaning of the function can be identified. For example, the trained model may be used for code interpretation.
And if the type of the code written by the user is the error elimination type, acquiring a target problem code corresponding to the code written by the user, and constructing the training sample based on the target problem code. Wherein the code eliminating the error type is used for replacing the corresponding problem code.
In the embodiment of the present disclosure, when the type of the code written by the user acquired at this time is bug, that is, the error is eliminated, the code written by the user acquired at this time may be referred to as a repair code, that is, a code obtained by repairing the target problem code. The submitting time of the target problem code is earlier than the code written by the user acquired at this time. The acquired codes written by the user are the same as the target problem codes in realizing function, and no error exists in the target problem codes.
When the type of the code written by the user is bug, the code submitting information further comprises an identification of the target problem code corresponding to the code, namely an ID of the target problem code, so that the electronic equipment can acquire the target problem code based on the ID of the target problem code.
Optionally, when the electronic device obtains the target problem code, the electronic device may directly search the target problem code from the code hosting platform based on the ID of the target problem code. Or, based on the ID of the target problem code, whether the target problem code exists in the local memory can be searched first, if not, the target problem code is searched from the code hosting platform, so that the speed of acquiring the target problem code is improved.
The target problem code can be directly used as a training sample when the training sample is constructed.
Alternatively, a problem function with an error in the target problem code is determined, and the problem function is used as a training sample.
The difference data between the target problem code and the repair code may be obtained from the code hosting platform, or by comparing the target problem code and the repair code, the difference data between the target problem code and the repair code, that is, difference (DIFF) associated data may be obtained. Then, the function where the data with the difference with the repair code is located can be intercepted from the target problem code, the intercepted function is the problem function, and the problem function is taken as a training sample.
Because the target problem code may contain a plurality of functions, and some functions may be non-problematic, using these non-problematic functions as training samples increases the workload of model training, and it is difficult to improve the recognition accuracy of the model, so that the problem function in the target problem code can be intercepted, and the efficiency of model training is improved as a training sample.
When the type of the code written by the user is the error eliminating type, the code is a repair code obtained by repairing the corresponding target problem code, so that the target problem code has high probability of error, a training sample can be constructed based on the target problem code, and a model trained by the training sample has the capability of identifying the problem code.
In the case that the type of the code written by the user is the error-eliminating type, the above S104 sets the training label of the training sample based on the description information of the code, which includes the following two ways.
In the case where the type of code written by the user is the error-eliminating type, the mode 1 of setting the training tag may be implemented as: and detecting whether the content meets a preset specification. If yes, taking the content as a meaning label of the training sample, and taking a code written by a user as a repair label of the training sample; if not, the title is used as the meaning label of the training sample, and the code written by the user is used as the repair label of the training sample.
The manner of detecting whether the content meets the preset specification may refer to the description related to S401, which is not repeated here.
Since the content description in the task card may include a title and content for describing the meaning of the code, in the case where the type of code written by the user is an erasure type, the title may be a brief introduction of an error of the code repair, and the content is a detailed description of the error of the code repair.
For example, FIG. 6 is a task card with code of the type eliminating errors, in which the code type is bug, titled "repair 0 error", content is described as "0 error result program crash". Other information may also be included in the task card, and reference may be made specifically to the description of fig. 2, which is not repeated here.
The title or the content is used as the meaning label of the training sample, so that the model obtained through training of the training sample and the meaning label can identify the meaning of the error in the problem code, namely, the error type of the problem code. The repair code is used as a repair label of the training sample, so that the problem code can be repaired by a model obtained through training of the training sample and the repair label, and the repaired code is output.
Optionally, a positioning tag may also be set for the training sample, where the positioning tag is DIFF associated data between the repair code and the problem code, so that a model trained based on the training sample and the positioning tag has positioning capability of an error location in the code.
After the user submits the task card, the repair code and the problem code corresponding to the task card can be checked in the display interface. For example, fig. 7 shows a display interface, where the left code is a problem code and the right code is a repair code, where the "x" under "math. Other information may also be included in the display interface, and specific reference may be made to the description of fig. 3, which is not repeated here.
Because the repair codes written by the user are the repair results aiming at the problem codes, the repair codes written by the user can be used as repair labels, so that the model trained based on the repair labels has the repair capability of the problem codes. The title and the content are descriptions of errors in code repair, and the title or the content is used as a meaning label, so that a model trained based on the meaning label has the ability to recognize the meaning of the errors in the problem code. Therefore, the method and the device can improve the accuracy and the comprehensiveness of the training label.
In the case where the type of code written by the user is the type of error elimination, mode 2 of setting the training tag may be implemented as the following steps:
Step one, whether the content accords with a preset specification is detected, and whether a function interpretation is included in codes written by a user is identified.
The manner of detecting whether the content meets the preset specification may refer to the description related to S401, which is not repeated here.
And step two, if the content accords with the preset specification and the code written by the user comprises function interpretation, the content and the function interpretation are used as meaning labels of training samples, and the code written by the user is used as repair labels of the training samples.
And step three, if the content accords with the preset specification and the code written by the user does not comprise function interpretation, taking the content as a meaning label of the training sample, and taking the code written by the user as a repair label of the training sample.
And step four, if the content does not accord with the preset specification and the code written by the user comprises function interpretation, the title and the function interpretation are used as meaning labels of training samples, and the code written by the user is used as repair labels of the training samples.
And fifthly, if the content does not accord with the preset specification and the code written by the user does not comprise function interpretation, taking the title as a meaning label of the training sample and taking the code written by the user as a repair label of the training sample.
The specific implementation of the training tag setting manner 2 is similar to that of fig. 5, and reference may be made to the related description in fig. 5, which is not repeated here.
Illustratively, the user-written repair code includes: "problem_ code": "xxxxxxxxxx", "fixed_code": "xxxxx", "card_title": "bug" is repaired by the procedure crash "caused by the division of 0" and "card_content" by the division of 0 ". Chinese text can be identified from the code resulting in a functional interpretation of the code, where "x" represents the actual code.
Optionally, the Chinese text identified from the code may also be deduplicated, thereby reducing the inclusion of duplicate content in the meaning label.
Because the repair codes written by the user are the repair results aiming at the problem codes, the repair codes written by the user can be used as repair labels, so that the model trained based on the repair labels has the repair capability of the problem codes. Moreover, the title and the content are descriptions of errors of code repair, and the function interpretation in the code can reflect the repaired errors, so that the title or the content is used as a meaning label in combination with the function interpretation, and a model trained based on the meaning label has the meaning recognition capability of the errors in the problem code. Therefore, the method and the device can improve the accuracy and the comprehensiveness of the training label.
It can be understood that when the type of the code written by the user is the error elimination type, the code written by the user is the repair code, at this time, the training sample is obtained based on the problem code, the training label is obtained based on the repair code, the description information and the function interpretation, so that the problem code can be repaired by the training sample and the corresponding training label, the repair code can be obtained, and/or the type of the error in the problem code or the cause of the error can be identified, and the like. For example, the trained model may be used for code repair.
In some embodiments of the present disclosure, after constructing the training samples and corresponding training tags, the electronic device may also store the training samples and training tags in a database.
And then, the electronic equipment can periodically acquire training samples and training labels in the database, and perform data cleaning on the acquired training samples and corresponding training labels, namely deleting the training samples and the corresponding training labels which meet the filtering conditions.
Alternatively, the filtering conditions may include any one or a combination of the following conditions: the file size of the training sample exceeds a preset byte, the number of code lines included in the training sample exceeds a preset number of lines, the number of characters of the training label does not exceed a preset character, and the training label is nonsensical.
For example, a training tag is "this is a test", which is meaningless content.
Wherein natural language processing (NLP, natural Language Processing) word segmentation algorithms can be utilized to identify whether the training tag is meaningless content. Or other algorithms may also be used to identify whether the training tag is meaningless, as embodiments of the present disclosure are not particularly limited.
Referring to fig. 8, the following describes the overall flow of the training sample construction method provided by the embodiment of the present disclosure in combination with an actual application scenario:
in this scenario, the training sample construction method provided by the embodiment of the present disclosure is implemented by a construction system, which includes: a construction platform, a database and a data collector. The build platform, database, and data collector may be in the same electronic device or in different electronic devices.
And the engineer submits the codes and the commit message to the code hosting platform through the user terminal and submits the task card to the demand management platform, wherein the commit message comprises the ID of the task card associated with the codes.
The construction platform periodically acquires codes and commit message written and submitted by the user in the previous period from the code hosting platform, and periodically acquires task cards submitted by the user in the previous period from the demand management platform. And associating each code with the task card according to the ID of the task card in the limit message of each code.
The construction platform constructs a training sample according to the type of the code and the code, and sets a training label of the training sample based on the description information of the code. And correspondingly storing the training samples and the training labels into a database.
The data collector collects training samples and corresponding training labels in the database at regular time to obtain an original data set. The data acquisition device also carries out data cleaning on the acquired training samples and training labels, and builds the cleaning result into a training data set of the model.
In the technical scheme of the disclosure, the related processes of collecting, storing, using, processing, transmitting, providing, disclosing and the like of the code information accord with the regulations of related laws and regulations, and the public order harmony is not violated.
Based on the same inventive concept, corresponding to the above method embodiments, the present disclosure provides a training sample constructing apparatus, as shown in fig. 9, including: an acquisition module 901, a construction module 902 and a setting module 903;
an acquisition module 901, configured to acquire a code written by a user from a code hosting platform;
the acquiring module 901 is further configured to acquire a task card submitted by a user for the code, where the task card includes a type and description information of the code;
A construction module 902, configured to construct a training sample according to the type and the code of the code acquired by the acquisition module 901;
the setting module 903 is configured to set a training label of the training sample based on the description information of the code acquired by the acquiring module 901.
In some embodiments of the present disclosure, wherein the constructing module 902 is specifically configured to:
if the type of the code is a demand task type, splitting the code into a plurality of functions, wherein the code of the demand task type is written for the demand task;
each function is used as a training sample.
In some embodiments of the present disclosure, wherein the descriptive information includes a title and content for describing the meaning of the code; the setting module 903 is specifically configured to:
detecting whether the content meets a preset specification;
if yes, taking the content as a training label;
if not, the title is used as a training label.
In some embodiments of the present disclosure, wherein the descriptive information includes a title and content for describing the meaning of the code; the setting module 903 is specifically configured to:
detecting whether the content accords with a preset specification, and identifying whether a training sample comprises function interpretation;
if the content accords with the preset specification and the training sample comprises function interpretation, the content and the function interpretation are used as training labels;
If the content accords with the preset specification and the training sample does not comprise function interpretation, taking the content as a training label;
if the content does not meet the preset specification and the training sample comprises function interpretation, taking the title and the function interpretation as training labels;
if the content does not meet the preset specification and the training sample does not comprise function interpretation, the title is used as a training label.
In some embodiments of the present disclosure, wherein the constructing module 902 is specifically configured to:
if the type of the code written by the user is the error elimination type, acquiring a target problem code corresponding to the code written by the user; the code for eliminating the error type is used for replacing the corresponding problem code;
training samples are constructed based on the object problem code.
In some embodiments of the present disclosure, wherein the constructing module 902 is specifically configured to:
taking the target problem code as a training sample; or,
and determining a problem function with errors in the target problem code, and taking the problem function as a training sample.
In some embodiments of the present disclosure, wherein the descriptive information includes a title and content for describing the meaning of the code; the setting module 903 is specifically configured to:
detecting whether the content meets a preset specification;
If yes, taking the content as a meaning label of the training sample, and taking a code written by a user as a repair label of the training sample;
if not, the title is used as the meaning label of the training sample, and the code written by the user is used as the repair label of the training sample.
In some embodiments of the present disclosure, wherein the descriptive information includes a title and content for describing the meaning of the code; the setting module 903 is specifically configured to:
detecting whether the content accords with a preset specification, and identifying whether a code written by a user comprises function interpretation;
if the content accords with the preset specification and the code written by the user comprises function interpretation, the content and the function interpretation are used as meaning labels of training samples, and the code written by the user is used as a repair label of the training samples;
if the content accords with the preset specification and the code written by the user does not comprise function interpretation, taking the content as a meaning label of the training sample and taking the code written by the user as a repair label of the training sample;
if the content does not accord with the preset specification and the code written by the user comprises function interpretation, the title and the function interpretation are used as meaning labels of training samples, and the code written by the user is used as a repair training label of the training samples;
If the content does not meet the preset specification and the code written by the user does not comprise function interpretation, the title is used as a meaning label of the training sample, and the code written by the user is used as a repair label of the training sample.
In some embodiments of the present disclosure, the obtaining module 901 is further configured to obtain, from the code hosting platform, code submission information submitted by a user for a written code, before obtaining a task card submitted by the user for the code, the code submission information including an identification of the task card;
the obtaining module 901 is specifically configured to:
and searching the task card corresponding to the identification from the task cards of the demand management platform.
According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.
Fig. 10 shows a schematic block diagram of an example electronic device 1000 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 10, the electronic device 1000 includes a computing unit 1001 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 1002 or a computer program loaded from a storage unit 1008 into a Random Access Memory (RAM) 1003. In the RAM 1003, various programs and data required for the operation of the electronic apparatus 1000 can also be stored. The computing unit 1001, the ROM 1002, and the RAM 1003 are connected to each other by a bus 1004. An input/output (I/O) interface 1005 is also connected to bus 1004.
Various components in the electronic device 1000 are connected to the I/O interface 1005, including: an input unit 1006 such as a keyboard, a mouse, and the like; an output unit 1007 such as various types of displays, speakers, and the like; a storage unit 1008 such as a magnetic disk, an optical disk, or the like; and communication unit 1009 such as a network card, modem, wireless communication transceiver, etc. Communication unit 1009 allows electronic device 1000 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks.
The computing unit 1001 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 1001 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 1001 performs the respective methods and processes described above, such as a training sample construction method. For example, in some embodiments, the training sample construction method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 1008. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 1000 via the ROM 1002 and/or the communication unit 1009. When the computer program is loaded into RAM 1003 and executed by computing unit 1001, one or more steps of the training sample construction method described above may be performed. Alternatively, in other embodiments, the computing unit 1001 may be configured to perform the training sample construction method in any other suitable way (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.
The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially, or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.
The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims (21)

1. A training sample construction method comprising:
acquiring codes written by users from a code hosting platform;
acquiring a task card submitted by a user aiming at the code, wherein the task card comprises the type and description information of the code;
according to the type of the codes and the codes, constructing training samples;
and setting a training label of the training sample based on the description information of the codes.
2. The method of claim 1, wherein the constructing training samples from the type of code and the code comprises:
if the type of the code is a demand task type, splitting the code into a plurality of functions, wherein the code of the demand task type is written for the demand task;
each function is used as a training sample.
3. The method of claim 2, wherein the description information includes a title and contents for describing a meaning of a code; the setting the training label of the training sample based on the description information of the code comprises the following steps:
detecting whether the content meets a preset specification;
if yes, taking the content as the training label;
and if not, taking the title as the training label.
4. The method of claim 2, wherein the description information includes a title and contents for describing a meaning of a code; the setting the training label of the training sample based on the description information of the code comprises the following steps:
detecting whether the content accords with a preset specification or not, and identifying whether function interpretation is included in the training sample or not;
If the content accords with the preset specification and the training sample comprises function interpretation, the content and the function interpretation are used as the training label;
if the content accords with the preset specification and the training sample does not comprise function interpretation, taking the content as the training label;
if the content does not meet the preset specification and the training sample comprises function interpretation, the title and the function interpretation are used as the training label;
and if the content does not accord with the preset specification and the training sample does not comprise function interpretation, taking the title as the training label.
5. The method of claim 1, wherein the constructing training samples from the type of code and the code comprises:
if the type of the code written by the user is the error elimination type, acquiring a target problem code corresponding to the code written by the user; the code for eliminating the error type is used for replacing the corresponding problem code;
a training sample is constructed based on the target issue code.
6. The method of claim 5, wherein the constructing training samples based on the target issue code comprises:
Taking the target problem code as the training sample; or,
and determining a problem function with errors in the target problem code, and taking the problem function as the training sample.
7. The method of claim 5, wherein the description information includes a title and content for describing a meaning of a code; the setting the training label of the training sample based on the description information of the code comprises the following steps:
detecting whether the content meets a preset specification;
if yes, taking the content as a meaning label of the training sample, and taking a code written by a user as a repair label of the training sample;
and if not, taking the title as a meaning label of the training sample, and taking a code written by a user as a repair label of the training sample.
8. The method of claim 7, the description information including a title and content for describing a meaning of a code; the setting the training label of the training sample based on the description information of the code comprises the following steps:
detecting whether the content accords with a preset specification or not, and identifying whether a code written by a user comprises function interpretation or not;
If the content accords with the preset specification and the code written by the user comprises function interpretation, the content and the function interpretation are used as meaning labels of the training samples, and the code written by the user is used as a repair label of the training samples;
if the content accords with the preset specification and the code written by the user does not comprise function interpretation, the content is used as a meaning label of the training sample, and the code written by the user is used as a repair label of the training sample;
if the content does not accord with the preset specification and the code written by the user comprises function interpretation, the title and the function interpretation are used as meaning labels of the training samples, and the code written by the user is used as a repair label of the training samples;
and if the content does not accord with the preset specification and the code written by the user does not comprise function interpretation, taking the title as a meaning label of the training sample and taking the code written by the user as a repair label of the training sample.
9. The method of any of claims 1-8, prior to the acquiring a task card submitted by a user for the code, the method further comprising:
Code submission information submitted by a user for the written code is obtained from the code hosting platform, and the code submission information comprises the identification of the task card;
the task card submitted by the user aiming at the code is obtained, and the task card comprises the following components:
and searching the task card corresponding to the identification from all the task cards of the demand management platform.
10. A training sample construction apparatus comprising:
the acquisition module is used for acquiring codes written by a user from the code hosting platform;
the acquisition module is also used for acquiring a task card submitted by a user aiming at the code, wherein the task card comprises the type and description information of the code;
the construction module is used for constructing a training sample according to the type of the code and the code acquired by the acquisition module;
the setting module is used for setting the training label of the training sample based on the description information of the codes acquired by the acquisition module.
11. The apparatus of claim 10, wherein the construction module is specifically configured to:
if the type of the code is a demand task type, splitting the code into a plurality of functions, wherein the code of the demand task type is written for the demand task;
Each function is used as a training sample.
12. The apparatus of claim 11, wherein the description information includes a title and content for describing a meaning of a code; the setting module is specifically configured to:
detecting whether the content meets a preset specification;
if yes, taking the content as the training label;
and if not, taking the title as the training label.
13. The apparatus of claim 11, wherein the description information includes a title and content for describing a meaning of a code; the setting module is specifically configured to:
detecting whether the content accords with a preset specification or not, and identifying whether function interpretation is included in the training sample or not;
if the content accords with the preset specification and the training sample comprises function interpretation, the content and the function interpretation are used as the training label;
if the content accords with the preset specification and the training sample does not comprise function interpretation, taking the content as the training label;
if the content does not meet the preset specification and the training sample comprises function interpretation, the title and the function interpretation are used as the training label;
And if the content does not accord with the preset specification and the training sample does not comprise function interpretation, taking the title as the training label.
14. The apparatus of claim 10, wherein the construction module is specifically configured to:
if the type of the code written by the user is the error elimination type, acquiring a target problem code corresponding to the code written by the user; the code for eliminating the error type is used for replacing the corresponding problem code;
a training sample is constructed based on the target issue code.
15. The apparatus of claim 14, wherein the construction module is specifically configured to:
taking the target problem code as the training sample; or,
and determining a problem function with errors in the target problem code, and taking the problem function as the training sample.
16. The apparatus of claim 14, wherein the description information includes a title and content for describing a meaning of a code; the setting module is specifically configured to:
detecting whether the content meets a preset specification;
if yes, taking the content as a meaning label of the training sample, and taking a code written by a user as a repair label of the training sample;
And if not, taking the title as a meaning label of the training sample, and taking a code written by a user as a repair label of the training sample.
17. The apparatus of claim 14, wherein the description information includes a title and content for describing a meaning of a code; the setting module is specifically configured to:
detecting whether the content accords with a preset specification or not, and identifying whether a code written by a user comprises function interpretation or not;
if the content accords with the preset specification and the code written by the user comprises function interpretation, the content and the function interpretation are used as meaning labels of the training samples, and the code written by the user is used as a repair label of the training samples;
if the content accords with the preset specification and the code written by the user does not comprise function interpretation, the content is used as a meaning label of the training sample, and the code written by the user is used as a repair label of the training sample;
if the content does not accord with the preset specification and the code written by the user comprises function interpretation, the title and the function interpretation are used as meaning labels of the training samples, and the code written by the user is used as a repair label of the training samples;
And if the content does not accord with the preset specification and the code written by the user does not comprise function interpretation, taking the title as a meaning label of the training sample and taking the code written by the user as a repair label of the training sample.
18. The device according to any one of claim 10 to 17,
the acquisition module is further used for acquiring code submission information submitted by a user for written codes from the code hosting platform before acquiring the task card submitted by the user for the codes, wherein the code submission information comprises an identification of the task card;
the acquisition module is specifically configured to:
and searching the task card corresponding to the identification from all the task cards of the demand management platform.
19. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-9.
20. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-9.
21. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any of claims 1-9.
CN202310381165.7A 2023-04-11 2023-04-11 Training sample construction method and device, electronic equipment and medium Pending CN116521866A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310381165.7A CN116521866A (en) 2023-04-11 2023-04-11 Training sample construction method and device, electronic equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310381165.7A CN116521866A (en) 2023-04-11 2023-04-11 Training sample construction method and device, electronic equipment and medium

Publications (1)

Publication Number Publication Date
CN116521866A true CN116521866A (en) 2023-08-01

Family

ID=87407395

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310381165.7A Pending CN116521866A (en) 2023-04-11 2023-04-11 Training sample construction method and device, electronic equipment and medium

Country Status (1)

Country Link
CN (1) CN116521866A (en)

Similar Documents

Publication Publication Date Title
CN111709527A (en) Operation and maintenance knowledge map library establishing method, device, equipment and storage medium
CN110750654A (en) Knowledge graph acquisition method, device, equipment and medium
CN114595686B (en) Knowledge extraction method, and training method and device of knowledge extraction model
CN112784591B (en) Data processing method and device, electronic equipment and storage medium
CN113836314B (en) Knowledge graph construction method, device, equipment and storage medium
CN114428677A (en) Task processing method, processing device, electronic equipment and storage medium
EP4145298A1 (en) Method and apparatus for snapshotting metadata
CN115599769A (en) Data migration method and device, electronic equipment and storage medium
CN112084150A (en) Model training method, data retrieval method, device, equipment and storage medium
CN114417118A (en) Abnormal data processing method, device, equipment and storage medium
CN113609100A (en) Data storage method, data query method, data storage device, data query device and electronic equipment
CN115048352B (en) Log field extraction method, device, equipment and storage medium
CN115454971A (en) Data migration method and device, electronic equipment and storage medium
CN115544010A (en) Mapping relation determining method and device, electronic equipment and storage medium
US20220129418A1 (en) Method for determining blood relationship of data, electronic device and storage medium
CN116521866A (en) Training sample construction method and device, electronic equipment and medium
CN114443493A (en) Test case generation method and device, electronic equipment and storage medium
CN113961672A (en) Information labeling method and device, electronic equipment and storage medium
CN113553826A (en) Information input method and device combining RPA and AI and electronic equipment
CN114116688A (en) Data processing and data quality inspection method, device and readable storage medium
CN117573561B (en) Automatic test system, method, electronic equipment and storage medium
CN114281981B (en) News brief report generation method and device and electronic equipment
US20240193161A1 (en) Reverse engineered retokenization for translation of machine interpretable languages
CN117422412A (en) Project management method, device, equipment and storage medium
CN117312339A (en) Question bank updating method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination