WO2021220461A1

WO2021220461A1 - Relevant commit identification device, relevant commit identification method, and program

Info

Publication number: WO2021220461A1
Application number: PCT/JP2020/018267
Authority: WO
Inventors: 和明足立; 卓弥岩塚; 大輔山口
Original assignee: 日本電信電話株式会社
Priority date: 2020-04-30
Filing date: 2020-04-30
Publication date: 2021-11-04

Abstract

This relevant commit identification device comprises: an extraction unit that extracts a string of words from a text input in relation to a certain piece of software; a generation unit that generates, for each of a plurality of commits generated with respect to the software, a string of characters including the content of the commit; and an identification unit that identifies a commit having a relatively high relevance to the text from among the plurality of commits on the basis of a comparison between the string of words and each of the strings of characters. Consequently, the relevant commit identification device enhances the efficiency of an operation which identifies a commit related to a certain text from among a list of commits with respect to software.

Description

Related commit identification device, related commit identification method and program

The present invention relates to a related commit specifying device, a related commit specifying method, and a program.

In software development, a series of changes made for a certain purpose are put together in units called commits. Commit includes the identifier of the commit, the change difference (list of changed files) of files related to software (source files, various documents, etc.), the developer who made the change, the date and time when the change was made, and the content of the change (purpose of change). Includes descriptive text that explains (including) in natural language. The changed file list refers to a list of the file name of the changed file and the corrected part (line number where the correction was made) for the file.

When a new version of software is released, release notes are generally created. The release notes may include a list of the commits that made the change, in addition to the text that describes the changes from the previous version in natural language.

If the text and commits related to the software changes are listed in the release notes, the software operation by the user will be smooth. For example, assume a situation in which a vulnerability inherent in software is discovered and a version of the software in which the vulnerability has been fixed has been released. In such a situation, a user using the software wants to determine whether or not the vulnerability affects his / her own usage. In this case, the user first searches the release notes for a sentence describing the vulnerability and identifies the commit in which the vulnerability has been fixed. The user then reviews the identified commits to determine if it affects the user's usage.

However, since the release notes are created by the developer who made the software changes (including corrections), the contents vary. For example, there are developers who do not create release notes at all, and developers who describe only major changes in the release notes.

If the text about the software changes and the commit list are not created, or if the commit list is missing, the user needs to manually associate the software changes with the commit. For example, in order to identify a commit that has been fixed for a vulnerability indicated by publicly available vulnerability information such as CVE (Common Vulnerabilities and Exposures), it is necessary to manually identify the commit based on the word related to the vulnerability. For example, it is necessary to search the commit list for the corresponding commit by a user, a security researcher, etc. (for example, Non-Patent Document 1).

Humans can use their expertise in software implementation to associate texts written in natural language with software-related features, bug fixes, and vulnerabilities. , You need to search for related commits and check the search results one by one.

Therefore, expertise in software implementation is required to identify the relevant commit. In addition, it takes time to identify the target commit from a large number of hundreds to thousands of commits. As a result, when a new version of a certain software is released, there is a problem that it is not easy for the user to decide whether to use the version or it takes time to make the decision.

The present invention has been made in view of the above points, and an object of the present invention is to streamline the work of identifying a commit related to a certain sentence from a list of commits to software.

Therefore, in order to solve the above problem, the related commit identification device extracts the word string from the sentence input for a certain software, and for each of the plurality of commits generated for the software, the relevant commits. Identification that identifies a commit that is relatively highly relevant to the sentence from the plurality of commits based on a comparison between the word string and each character string and a generator that generates a character string including the content. It has a part and.

It is possible to streamline the work of identifying the commit related to a certain sentence from the list of commits to the software.

It is a figure which shows the hardware configuration example of the related commit specifying apparatus 10 in embodiment of this invention. It is a figure which shows the functional structure example of the related commit specifying apparatus 10 in embodiment of this invention. It is a flowchart for demonstrating an example of the processing procedure executed by software information processing unit 11. It is a flowchart for demonstrating an example of the processing procedure of the generation processing of software information after processing. It is a flowchart for demonstrating an example of the processing procedure executed by the corpus generation unit 12. It is a flowchart for demonstrating 1st example of the processing procedure of corpus generation processing. It is a flowchart for demonstrating the 2nd example of the processing procedure of corpus generation processing. It is a flowchart for demonstrating an example of the processing procedure executed by the related commit identification part 13.

Hereinafter, embodiments of the present invention will be described with reference to the drawings. FIG. 1 is a diagram showing a hardware configuration example of the related commit specifying device 10 according to the embodiment of the present invention. The related commit specifying device 10 of FIG. 1 has a drive device 100, an auxiliary storage device 102, a memory device 103, a CPU 104, an interface device 105, a display device 106, an input device 107, and the like, which are connected to each other by a bus B, respectively. ..

The program that realizes the processing in the related commit specifying device 10 is provided by the recording medium 101 such as a CD-ROM. When the recording medium 101 storing the program is set in the drive device 100, the program is installed in the auxiliary storage device 102 from the recording medium 101 via the drive device 100. However, the program does not necessarily have to be installed from the recording medium 101, and may be downloaded from another computer via the network. The auxiliary storage device 102 stores the installed program and also stores necessary files, data, and the like.

The memory device 103 reads and stores the program from the auxiliary storage device 102 when the program is instructed to start. The CPU 104 realizes the function related to the related commit specifying device 10 according to the program stored in the memory device 103. The interface device 105 is used as an interface for connecting to a network. The display device 106 displays a programmatic GUI (Graphical User Interface) or the like. The input device 107 is composed of a keyboard, a mouse, and the like, and is used for inputting various operation instructions.

FIG. 2 is a diagram showing a functional configuration example of the related commit specifying device 10 according to the embodiment of the present invention. As shown in FIG. 2, the related commit specifying device 10 includes a software information processing unit 11, a corpus generation unit 12, and a related commit specifying unit 13. Each of these parts is realized by a process of causing the CPU 104 to execute one or more programs installed in the related commit specifying device 10.

When software information is input, the software information processing unit 11 generates post-processing software information by executing preprocessing on the software information.

The software information includes changes to the previous version of the functions of certain upgraded software (hereinafter referred to as "target software", and the latest version of the target software is referred to as "target version"). It is a sentence written in natural language. For example, the software information may include sentences related to software changes and vulnerability information such as CVE (Common Vulnerabilities and Exposures) included in the release notes for the target version. Alternatively, a sentence created by the user of the related commit specifying device 10 may be input as software information. In this case, the user may create a sentence indicating the change contents of the function that is of interest to him / her as software information.

On the other hand, the processed software information is a word string extracted from the software information by executing a process such as deleting stopswords on the software information.

For example, if the software information is "In ih264d_init_decoder of ih264d_api.c, there is a possible out of bounds write due to a use after free." , Generated (extracted) as software information after processing.

The corpus generation unit 12 inputs a list of commits for the target version (for example, a list of commits included in the release notes of the target version), and applies a corpus generation rule to each commit included in the commit list to generate a corpus. Generate. A commit is a unit of changes made for a certain purpose in software development. Commit includes the identifier of the commit, the change difference of the files related to the software (source files, various documents, etc.) (changed file list (hereinafter referred to as "changed file list")), the developer who made the change, Includes the date and time of the change, a descriptive text that explains the content of the change (including the purpose of the change) in natural language (hereinafter referred to as the "description of the content of the change").

The corpus is a sequence of the identifier of the commit and the "change file list" and "change content description" included in the commit for each commit in the commit list (hereinafter, simply referred to as "sequence"). It is a database that has as an attribute. In addition, sequencing means, for example, the form of an array of words. In this case, the word corresponding to the stop word may be removed, or the compound word may be divided into words. Alternatively, the sequence may be a character string in which the "change file list" and the "change content description" are simply connected.

The related commit specifying unit 13 inputs the processed software information and the corpus, and makes a commit related to the software information (relatively highly related to the software information) from the corpus (hereinafter, referred to as “related commit”). Identify and output the associated commit identifier.

Hereinafter, the processing procedure executed by the related commit specifying device 10 will be described. FIG. 3 is a flowchart for explaining an example of a processing procedure executed by the software information processing unit 11.

When the software information processing unit 11 reads the input software information (hereinafter referred to as "target software information") (S101), it executes the post-processing software information generation process (S102). In the post-processing software information generation process, the post-processing software information is generated by applying the processing rules to the target software information. Subsequently, the software information processing unit 11 outputs the generated post-processing software information (hereinafter, referred to as “target post-processing software information”) (S103).

Subsequently, the details of step S102 will be described. Software information processing rules include the deletion of stopwords and the separation of compound words. For example, removing the stop word this from software information and separating compound words such as forceStop into "force" and "Stop" are examples of software information processing rules.

FIG. 4 is a flowchart for explaining an example of the processing procedure of the processing of generating software information after processing.

In step S201, the software information processing unit 11 parses (syntax analysis) the target software information and generates (extracts) a list of words (hereinafter, referred to as "word string") included in the target software information.

Subsequently, the software information processing unit 11 acquires an unprocessed word from the word string as a processing target (S202). For example, the words to be processed may be acquired in the order of the words in the word string. Hereinafter, the word to be processed is referred to as a "target word". The unprocessed word means a word that is not a target word.

Subsequently, the software information processing unit 11 determines whether or not the target word is a stop word (S203). Whether or not a word corresponds to a stop word may be determined based on a known technique. When the target word is a stop word (Yes in S203), the target word is not executed after step S204 and proceeds to step S207.

On the other hand, when the target word is not a stop word (No in S203), the software information processing unit 11 determines whether or not the target word is a compound word (S204). When the target word is a compound word (Yes in S204), the software information processing unit 11 divides the target word which is a compound word into a plurality of words (S205), and converts each divided word into the target processed software information. Add (S206). On the other hand, when the target word is not a compound word (No in S204), the software information processing unit 11 adds the target word as it is to the target processed software information (S206). The process proceeds to step S207 following step S206.

In step S207, the software information processing unit 11 determines whether or not all the words included in the word string generated in step S201 are targeted for processing. If there are unprocessed words (No in S207), steps S202 and subsequent steps are repeated. When there is no unprocessed word (Yes in S207), the processing procedure of FIG. 4 ends.

FIG. 5 is a flowchart for explaining an example of the processing procedure executed by the corpus generation unit 12.

In step S301, the corpus generation unit 12 reads the commit list of the target version (hereinafter, referred to as "target commit list"). Subsequently, the corpus generation unit 12 determines whether or not there are unprocessed commits for the plurality of commits included in the target commit list (S302). “Unprocessed” means that the process is not targeted in steps S303 and subsequent steps.

When there is an unprocessed commit (Yes in S302), the corpus generation unit 12 acquires one of the unprocessed commits as a processing target (hereinafter referred to as "target commit") (S303).

Subsequently, the corpus generation unit 12 executes the corpus generation process (S304). In the corpus generation process, the corpus generation rule is applied to the target commit, and the corpus data for the target commit is generated. Subsequently, the corpus generation unit 12 adds the corpus data to the corpus corresponding to the target commit (hereinafter, referred to as “target corpus”) (S305). When step S303 and subsequent steps are executed for all the commits included in the commit list (No in S302), the processing procedure of FIG. 5 ends.

Subsequently, the details of step S304 will be described. FIG. 6 is a flowchart for explaining a first example of the processing procedure of the corpus generation process.

In step S411, the corpus generation unit 12 sequences a document in which the list of change files included in the target commit and the changed contents are described in natural language, thereby performing a sequence for the target commit (hereinafter, referred to as “target sequence”). To generate.

Subsequently, the corpus generation unit 12 acquires an unprocessed word among the words included in the target sequence as a processing target (hereinafter, referred to as “target word”) (S412). Note that unprocessed means that steps S413 and subsequent steps are not targeted for processing.

Subsequently, the corpus generation unit 12 determines whether or not the target word is a stop word (S413). When the target word is a stop word (Yes in S413), the corpus generation unit 12 deletes (removes) the target word from the target sequence (S414), and proceeds to step S418.

On the other hand, when the target word is not a stop word (No in S413), the corpus generation unit 12 determines whether or not the target word is a compound word (S415). When the target word is a compound word (Yes in S415), the corpus generation unit 12 divides the target word which is a compound word into a plurality of words (S416), and the target word in the target sequence is determined by the divided word group. Is replaced (S417), and the process proceeds to step S418.

In step S418, the corpus generation unit 12 determines whether or not all the words included in the target sequence are targeted for processing. If there are unprocessed words (No in S418), steps S412 and subsequent steps are repeated. If there are no unprocessed words (Yes in S418), the processing procedure of FIG. 6 ends.

Alternatively, as the corpus generation process, the process procedure of FIG. 7 may be executed instead of the process procedure of FIG. FIG. 7 is a flowchart for explaining a second example of the processing procedure of the corpus generation process. In FIG. 7, the same steps as those in FIG. 6 are assigned the same step numbers, and the description thereof will be omitted. In FIG. 7, steps S401 and S402 are added before step S411.

In step S401, the corpus generation unit 12 acquires the file name of each changed file (hereinafter referred to as "change file") from the list of change files included in the target commit. Subsequently, the corpus generation unit 12 determines whether or not there is a source code file (source file) in the change file list based on the file name of each change file (S402). For example, it can be determined whether or not the file related to the file name is a source file based on the extension of the file name. Specifically, the extension of the source file written in C language is ".c". On the other hand, the extension of the text file containing some explanation and not including the source code is ".txt". In this way, it is possible to determine whether or not each change file is a source file based on the extension.

If there is a source file in the changed file list (Yes in S402), the corpus generation unit 12 executes step S411 and subsequent steps. If there is no source file in the changed file list (No in S402), the corpus generation unit 12 does not execute step S411 and subsequent steps.

That is, commits are created not only when the source file is changed, but also when various documents attached to the software are changed. According to the processing procedure of FIG. 7, the sequence can be generated only for the commits whose source files have been modified (commitments that affect the operation of the target software).

FIG. 8 is a flowchart for explaining an example of the processing procedure executed by the related commit specifying unit 13.

The related commit specifying unit 13 first reads each of the target processed software information and the target corpus (S501, S502).

Subsequently, the related commit specifying unit 13 determines whether or not there is an unprocessed sequence in the sequence group included in the target corpus (S503). “Unprocessed” means that steps S504 and subsequent steps are not targeted for processing. When there is an unprocessed sequence (Yes in S503), the related commit specifying unit 13 acquires one of the unprocessed sequences as a processing target (hereinafter, referred to as “target sequence”) (S504). Subsequently, the related commit specifying unit 13 applies the similarity calculation rule to the target sequence to calculate the similarity with the target processed software information (S505). That is, the degree of similarity with the target processed software information is calculated for each sequence.

As a similarity calculation rule, for each word constituting the target processed software information, the frequency of appearance (number of occurrences) in the target sequence is counted, and based on the frequency of appearance of each word, the target processed software and the target sequence A rule such as calculating the similarity of is conceivable.

Specifically, the total number, average, or maximum value of the appearance frequencies counted for each word constituting the target processed software information may be regarded as the similarity. In this case, the larger the total number, the average or the maximum value, the higher the similarity.

Further, for each word constituting the target processed software information, the appearance frequency is counted for each sequence other than the target sequence, and the target for the word whose appearance frequency in each other sequence is lower than the appearance frequency in the target sequence. The frequency of occurrence in the sequence may be weighted. For example, the frequency of occurrence of the word in the target sequence may be multiplied by a value obtained by dividing the frequency of occurrence by the frequency of occurrence of the word in each of the other sequences.

However, the degree of similarity between the target processed software information and the target sequence may be calculated by another method.

When the similarity is calculated for all the sequences (No in S503), the related commit specifying unit 13 identifies the sequence having the maximum similarity (S506) and outputs the commit of the generator of the sequence (S507). ). That is, the commit of the generator of the sequence having the maximum similarity is specified as the related commit for the target software information.

However, the related commit specifying unit 13 may specify a plurality of sequences having the highest similarity Nth (N> 1). In this case, the related commit specifying unit 13 may output the identifier of the commit that generated each of the specified plurality of sequences as the identifier of the related commit. In this case, the identifiers may be output in descending order of similarity. Further, the value of N may be input by the user.

As is clear from the above, the related commits to the target software information are specified based on the comparison (similarity) between the target processed software information and the sequence related to each commit. That is, the identification of the related commit largely depends on the content of the target processed software information and the content of each sequence.

Therefore, external information (that is, external information for improving the accuracy of the similarity) is provided so that the similarity between the software information and the commit, which are generally or empirically considered to be highly relevant, is calculated to be high. , It may be added (added) to the software information after the target processing or each sequence.

For example, as information for adding such external information to processing software information, the change content generally or empirically included in the commit is used as a key, and the change content is generally or empirically relevant to the change content. A dictionary (hereinafter, referred to as “software change-related word dictionary”) whose value is a word that characterizes the commit that is considered to be committed may be generated in advance and stored in the auxiliary storage device 102 or the like.

For example, an example of a software change-related word dictionary specialized for vulnerabilities is shown.
[Key (changes)]
cross site scripting
[value]
Injection, websites, HTML, token
The software change-related word dictionary manually extracts key changes and value words based on past sentences that can correspond to software information (sentences included in release notes, vulnerability information, etc.). It may be generated by. Alternatively, a software change-related word dictionary may be generated by mechanically extracting key changes and value words from the past sentences by using natural language processing or machine learning.

When the software change-related word dictionary is used, the software information processing unit 11 may execute the following processing after executing the processing procedure of FIG.

The software information processing unit 11 searches the software change-related word dictionary for a key that matches the word or a key that includes the word for each word added to the processed software information in step S102 of FIG. When there is a corresponding key, the software information processing unit 11 adds the value (word) associated with the key in the software change-related word dictionary to the target processed software information as the above-mentioned external information.

By adding such words to the software information after target processing, it can be expected that the accuracy of similarity will be improved and, by extension, the accuracy of specifying related commits will be improved.

Similarly, the corpus generation unit 12 may add external information to the sequence, which is considered to contribute to the accuracy of the similarity, when generating the sequence of each corpus.

In the above, an example in which the search range of related commits is limited to the commit list of the target version has been described, but the commit list for a part or all of the past versions of the target software is added to the search range of related commits. May be good.

As described above, according to the present embodiment, it is possible to automatically identify the commit that is considered to be highly relevant to the target software information. Therefore, the user can quickly identify the commit related to the target software information even if the user lacks expertise in implementing the software. That is, it is possible to streamline the work of identifying the commit related to a certain sentence from the list of commits to the software.

In the present embodiment, the software information processing unit 11 is an example of the extraction unit. The corpus generation unit 12 is an example of a generation unit. The related commit specific unit 13 is an example of the specific unit.

Although the embodiments of the present invention have been described in detail above, the present invention is not limited to such specific embodiments, and various modifications are made within the scope of the gist of the present invention described in the claims.・ Can be changed.

10 Related commit identification device 11 Software information processing unit 12 Corpus generation unit 13 Related commit identification unit 100 Drive device 101 Recording medium 102 Auxiliary storage device 103 Memory device 104 CPU
105 Interface device 106 Display device 107 Input device B Bus

Claims

An extractor that extracts a word string from a sentence entered for a certain software,
For each of the plurality of commits generated for the software, a generator that generates a character string containing the contents of the commit, and a generator.
Based on the comparison between the word string and each character string, a specific part that identifies a commit having a relatively high relevance to the sentence from the plurality of commits, and
A related commit identifying device characterized by having.
The specific unit calculates the similarity of the word string with each character string, and identifies the commit related to the character string having a relatively high similarity.
The related commit specifying device according to claim 1.
For each of the plurality of commits, the generation unit generates a character string including each word included in a document describing the changes of the software included in the commit.
The related commit identifying device according to claim 1 or 2.
An extraction procedure that extracts a word string from a sentence entered for a certain software,
For each of the plurality of commits generated for the software, a generation procedure for generating a character string containing the contents of the commit, and a generation procedure.
Based on the comparison between the word string and each character string, a specific procedure for identifying a commit having a relatively high relevance to the sentence from the plurality of commits and a specific procedure.
A method of identifying related commits that is characterized by a computer performing.
In the specific procedure, the similarity between the word string and each character string is calculated, and the commit related to the character string having a relatively high similarity is specified.
4. The method for identifying a related commit according to claim 4.
The generation procedure generates, for each of the plurality of commits, a character string containing each word contained in a document describing the changes of the software included in the commit.
The related commit identification method according to claim 4 or 5, wherein the related commit is identified.
A program characterized in that a computer executes the related commit identification method according to any one of claims 4 to 6.