WO2021220461A1 - Relevant commit identification device, relevant commit identification method, and program - Google Patents

Relevant commit identification device, relevant commit identification method, and program Download PDF

Info

Publication number
WO2021220461A1
WO2021220461A1 PCT/JP2020/018267 JP2020018267W WO2021220461A1 WO 2021220461 A1 WO2021220461 A1 WO 2021220461A1 JP 2020018267 W JP2020018267 W JP 2020018267W WO 2021220461 A1 WO2021220461 A1 WO 2021220461A1
Authority
WO
WIPO (PCT)
Prior art keywords
commit
software
word
commits
target
Prior art date
Application number
PCT/JP2020/018267
Other languages
French (fr)
Japanese (ja)
Inventor
和明 足立
卓弥 岩塚
大輔 山口
Original Assignee
日本電信電話株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電信電話株式会社 filed Critical 日本電信電話株式会社
Priority to PCT/JP2020/018267 priority Critical patent/WO2021220461A1/en
Publication of WO2021220461A1 publication Critical patent/WO2021220461A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/70Software maintenance or management

Definitions

  • the present invention relates to a related commit specifying device, a related commit specifying method, and a program.
  • commits In software development, a series of changes made for a certain purpose are put together in units called commits. Commit includes the identifier of the commit, the change difference (list of changed files) of files related to software (source files, various documents, etc.), the developer who made the change, the date and time when the change was made, and the content of the change (purpose of change). Includes descriptive text that explains (including) in natural language.
  • the changed file list refers to a list of the file name of the changed file and the corrected part (line number where the correction was made) for the file.
  • release notes are generally created.
  • the release notes may include a list of the commits that made the change, in addition to the text that describes the changes from the previous version in natural language.
  • the software operation by the user will be smooth. For example, assume a situation in which a vulnerability inherent in software is discovered and a version of the software in which the vulnerability has been fixed has been released. In such a situation, a user using the software wants to determine whether or not the vulnerability affects his / her own usage. In this case, the user first searches the release notes for a sentence describing the vulnerability and identifies the commit in which the vulnerability has been fixed. The user then reviews the identified commits to determine if it affects the user's usage.
  • release notes are created by the developer who made the software changes (including corrections), the contents vary. For example, there are developers who do not create release notes at all, and developers who describe only major changes in the release notes.
  • Non-Patent Document 1 If the text about the software changes and the commit list are not created, or if the commit list is missing, the user needs to manually associate the software changes with the commit. For example, in order to identify a commit that has been fixed for a vulnerability indicated by publicly available vulnerability information such as CVE (Common Vulnerabilities and Exposures), it is necessary to manually identify the commit based on the word related to the vulnerability. For example, it is necessary to search the commit list for the corresponding commit by a user, a security researcher, etc. (for example, Non-Patent Document 1).
  • CVE Common Vulnerabilities and Exposures
  • the present invention has been made in view of the above points, and an object of the present invention is to streamline the work of identifying a commit related to a certain sentence from a list of commits to software.
  • the related commit identification device extracts the word string from the sentence input for a certain software, and for each of the plurality of commits generated for the software, the relevant commits. Identification that identifies a commit that is relatively highly relevant to the sentence from the plurality of commits based on a comparison between the word string and each character string and a generator that generates a character string including the content. It has a part and.
  • FIG. 1 is a diagram showing a hardware configuration example of the related commit specifying device 10 according to the embodiment of the present invention.
  • the related commit specifying device 10 of FIG. 1 has a drive device 100, an auxiliary storage device 102, a memory device 103, a CPU 104, an interface device 105, a display device 106, an input device 107, and the like, which are connected to each other by a bus B, respectively. ..
  • the program that realizes the processing in the related commit specifying device 10 is provided by the recording medium 101 such as a CD-ROM.
  • the recording medium 101 storing the program is set in the drive device 100, the program is installed in the auxiliary storage device 102 from the recording medium 101 via the drive device 100.
  • the program does not necessarily have to be installed from the recording medium 101, and may be downloaded from another computer via the network.
  • the auxiliary storage device 102 stores the installed program and also stores necessary files, data, and the like.
  • the memory device 103 reads and stores the program from the auxiliary storage device 102 when the program is instructed to start.
  • the CPU 104 realizes the function related to the related commit specifying device 10 according to the program stored in the memory device 103.
  • the interface device 105 is used as an interface for connecting to a network.
  • the display device 106 displays a programmatic GUI (Graphical User Interface) or the like.
  • the input device 107 is composed of a keyboard, a mouse, and the like, and is used for inputting various operation instructions.
  • FIG. 2 is a diagram showing a functional configuration example of the related commit specifying device 10 according to the embodiment of the present invention.
  • the related commit specifying device 10 includes a software information processing unit 11, a corpus generation unit 12, and a related commit specifying unit 13. Each of these parts is realized by a process of causing the CPU 104 to execute one or more programs installed in the related commit specifying device 10.
  • the software information processing unit 11 When software information is input, the software information processing unit 11 generates post-processing software information by executing preprocessing on the software information.
  • the software information includes changes to the previous version of the functions of certain upgraded software (hereinafter referred to as "target software", and the latest version of the target software is referred to as “target version”). It is a sentence written in natural language.
  • the software information may include sentences related to software changes and vulnerability information such as CVE (Common Vulnerabilities and Exposures) included in the release notes for the target version.
  • CVE Common Vulnerabilities and Exposures
  • a sentence created by the user of the related commit specifying device 10 may be input as software information. In this case, the user may create a sentence indicating the change contents of the function that is of interest to him / her as software information.
  • the processed software information is a word string extracted from the software information by executing a process such as deleting stopswords on the software information.
  • the corpus generation unit 12 inputs a list of commits for the target version (for example, a list of commits included in the release notes of the target version), and applies a corpus generation rule to each commit included in the commit list to generate a corpus.
  • a commit is a unit of changes made for a certain purpose in software development. Commit includes the identifier of the commit, the change difference of the files related to the software (source files, various documents, etc.) (changed file list (hereinafter referred to as "changed file list”)), the developer who made the change, Includes the date and time of the change, a descriptive text that explains the content of the change (including the purpose of the change) in natural language (hereinafter referred to as the "description of the content of the change").
  • the corpus is a sequence of the identifier of the commit and the "change file list” and "change content description” included in the commit for each commit in the commit list (hereinafter, simply referred to as "sequence"). It is a database that has as an attribute.
  • sequencing means, for example, the form of an array of words. In this case, the word corresponding to the stop word may be removed, or the compound word may be divided into words. Alternatively, the sequence may be a character string in which the "change file list” and the "change content description" are simply connected.
  • the related commit specifying unit 13 inputs the processed software information and the corpus, and makes a commit related to the software information (relatively highly related to the software information) from the corpus (hereinafter, referred to as “related commit”). Identify and output the associated commit identifier.
  • FIG. 3 is a flowchart for explaining an example of a processing procedure executed by the software information processing unit 11.
  • the software information processing unit 11 When the software information processing unit 11 reads the input software information (hereinafter referred to as “target software information”) (S101), it executes the post-processing software information generation process (S102). In the post-processing software information generation process, the post-processing software information is generated by applying the processing rules to the target software information. Subsequently, the software information processing unit 11 outputs the generated post-processing software information (hereinafter, referred to as “target post-processing software information”) (S103).
  • Software information processing rules include the deletion of stopwords and the separation of compound words. For example, removing the stop word this from software information and separating compound words such as forceStop into “force” and "Stop” are examples of software information processing rules.
  • FIG. 4 is a flowchart for explaining an example of the processing procedure of the processing of generating software information after processing.
  • step S201 the software information processing unit 11 parses (syntax analysis) the target software information and generates (extracts) a list of words (hereinafter, referred to as "word string”) included in the target software information.
  • the software information processing unit 11 acquires an unprocessed word from the word string as a processing target (S202).
  • the words to be processed may be acquired in the order of the words in the word string.
  • the word to be processed is referred to as a "target word”.
  • the unprocessed word means a word that is not a target word.
  • the software information processing unit 11 determines whether or not the target word is a stop word (S203). Whether or not a word corresponds to a stop word may be determined based on a known technique. When the target word is a stop word (Yes in S203), the target word is not executed after step S204 and proceeds to step S207.
  • the software information processing unit 11 determines whether or not the target word is a compound word (S204). When the target word is a compound word (Yes in S204), the software information processing unit 11 divides the target word which is a compound word into a plurality of words (S205), and converts each divided word into the target processed software information. Add (S206). On the other hand, when the target word is not a compound word (No in S204), the software information processing unit 11 adds the target word as it is to the target processed software information (S206). The process proceeds to step S207 following step S206.
  • step S207 the software information processing unit 11 determines whether or not all the words included in the word string generated in step S201 are targeted for processing. If there are unprocessed words (No in S207), steps S202 and subsequent steps are repeated. When there is no unprocessed word (Yes in S207), the processing procedure of FIG. 4 ends.
  • FIG. 5 is a flowchart for explaining an example of the processing procedure executed by the corpus generation unit 12.
  • step S301 the corpus generation unit 12 reads the commit list of the target version (hereinafter, referred to as "target commit list"). Subsequently, the corpus generation unit 12 determines whether or not there are unprocessed commits for the plurality of commits included in the target commit list (S302). “Unprocessed” means that the process is not targeted in steps S303 and subsequent steps.
  • the corpus generation unit 12 acquires one of the unprocessed commits as a processing target (hereinafter referred to as "target commit") (S303).
  • the corpus generation unit 12 executes the corpus generation process (S304).
  • the corpus generation rule is applied to the target commit, and the corpus data for the target commit is generated.
  • the corpus generation unit 12 adds the corpus data to the corpus corresponding to the target commit (hereinafter, referred to as “target corpus”) (S305).
  • target corpus the corpus corresponding to the target commit
  • step S303 and subsequent steps are executed for all the commits included in the commit list (No in S302), the processing procedure of FIG. 5 ends.
  • FIG. 6 is a flowchart for explaining a first example of the processing procedure of the corpus generation process.
  • step S411 the corpus generation unit 12 sequences a document in which the list of change files included in the target commit and the changed contents are described in natural language, thereby performing a sequence for the target commit (hereinafter, referred to as “target sequence”). To generate.
  • the corpus generation unit 12 acquires an unprocessed word among the words included in the target sequence as a processing target (hereinafter, referred to as “target word”) (S412).
  • target word an unprocessed word among the words included in the target sequence as a processing target (hereinafter, referred to as “target word”) (S412).
  • unprocessed means that steps S413 and subsequent steps are not targeted for processing.
  • the corpus generation unit 12 determines whether or not the target word is a stop word (S413).
  • the corpus generation unit 12 deletes (removes) the target word from the target sequence (S414), and proceeds to step S418.
  • the corpus generation unit 12 determines whether or not the target word is a compound word (S415).
  • the corpus generation unit 12 divides the target word which is a compound word into a plurality of words (S416), and the target word in the target sequence is determined by the divided word group. Is replaced (S417), and the process proceeds to step S418.
  • step S4108 the corpus generation unit 12 determines whether or not all the words included in the target sequence are targeted for processing. If there are unprocessed words (No in S418), steps S412 and subsequent steps are repeated. If there are no unprocessed words (Yes in S418), the processing procedure of FIG. 6 ends.
  • FIG. 7 is a flowchart for explaining a second example of the processing procedure of the corpus generation process.
  • FIG. 7 the same steps as those in FIG. 6 are assigned the same step numbers, and the description thereof will be omitted.
  • steps S401 and S402 are added before step S411.
  • step S401 the corpus generation unit 12 acquires the file name of each changed file (hereinafter referred to as "change file”) from the list of change files included in the target commit. Subsequently, the corpus generation unit 12 determines whether or not there is a source code file (source file) in the change file list based on the file name of each change file (S402). For example, it can be determined whether or not the file related to the file name is a source file based on the extension of the file name. Specifically, the extension of the source file written in C language is ".c”. On the other hand, the extension of the text file containing some explanation and not including the source code is ".txt". In this way, it is possible to determine whether or not each change file is a source file based on the extension.
  • step S411 If there is a source file in the changed file list (Yes in S402), the corpus generation unit 12 executes step S411 and subsequent steps. If there is no source file in the changed file list (No in S402), the corpus generation unit 12 does not execute step S411 and subsequent steps.
  • commits are created not only when the source file is changed, but also when various documents attached to the software are changed. According to the processing procedure of FIG. 7, the sequence can be generated only for the commits whose source files have been modified (commitments that affect the operation of the target software).
  • FIG. 8 is a flowchart for explaining an example of the processing procedure executed by the related commit specifying unit 13.
  • the related commit specifying unit 13 first reads each of the target processed software information and the target corpus (S501, S502).
  • the related commit specifying unit 13 determines whether or not there is an unprocessed sequence in the sequence group included in the target corpus (S503). “Unprocessed” means that steps S504 and subsequent steps are not targeted for processing.
  • the related commit specifying unit 13 acquires one of the unprocessed sequences as a processing target (hereinafter, referred to as “target sequence”) (S504).
  • target sequence a processing target
  • the related commit specifying unit 13 applies the similarity calculation rule to the target sequence to calculate the similarity with the target processed software information (S505). That is, the degree of similarity with the target processed software information is calculated for each sequence.
  • the frequency of appearance (number of occurrences) in the target sequence is counted, and based on the frequency of appearance of each word, the target processed software and the target sequence
  • a rule such as calculating the similarity of is conceivable.
  • the total number, average, or maximum value of the appearance frequencies counted for each word constituting the target processed software information may be regarded as the similarity.
  • the appearance frequency is counted for each sequence other than the target sequence, and the target for the word whose appearance frequency in each other sequence is lower than the appearance frequency in the target sequence.
  • the frequency of occurrence in the sequence may be weighted. For example, the frequency of occurrence of the word in the target sequence may be multiplied by a value obtained by dividing the frequency of occurrence by the frequency of occurrence of the word in each of the other sequences.
  • the degree of similarity between the target processed software information and the target sequence may be calculated by another method.
  • the related commit specifying unit 13 identifies the sequence having the maximum similarity (S506) and outputs the commit of the generator of the sequence (S507). ). That is, the commit of the generator of the sequence having the maximum similarity is specified as the related commit for the target software information.
  • the related commit specifying unit 13 may specify a plurality of sequences having the highest similarity Nth (N> 1). In this case, the related commit specifying unit 13 may output the identifier of the commit that generated each of the specified plurality of sequences as the identifier of the related commit. In this case, the identifiers may be output in descending order of similarity. Further, the value of N may be input by the user.
  • the related commits to the target software information are specified based on the comparison (similarity) between the target processed software information and the sequence related to each commit. That is, the identification of the related commit largely depends on the content of the target processed software information and the content of each sequence.
  • external information that is, external information for improving the accuracy of the similarity
  • the similarity between the software information and the commit which are generally or empirically considered to be highly relevant, is calculated to be high.
  • It may be added (added) to the software information after the target processing or each sequence.
  • the change content generally or empirically included in the commit is used as a key, and the change content is generally or empirically relevant to the change content.
  • a dictionary hereinafter, referred to as “software change-related word dictionary” whose value is a word that characterizes the commit that is considered to be committed may be generated in advance and stored in the auxiliary storage device 102 or the like.
  • a software change-related word dictionary specialized for vulnerabilities is shown.
  • [Key (changes)] cross site scripting [value] Injection, websites, HTML, token
  • the software change-related word dictionary manually extracts key changes and value words based on past sentences that can correspond to software information (sentences included in release notes, vulnerability information, etc.). It may be generated by.
  • a software change-related word dictionary may be generated by mechanically extracting key changes and value words from the past sentences by using natural language processing or machine learning.
  • the software information processing unit 11 may execute the following processing after executing the processing procedure of FIG.
  • the software information processing unit 11 searches the software change-related word dictionary for a key that matches the word or a key that includes the word for each word added to the processed software information in step S102 of FIG. When there is a corresponding key, the software information processing unit 11 adds the value (word) associated with the key in the software change-related word dictionary to the target processed software information as the above-mentioned external information.
  • the corpus generation unit 12 may add external information to the sequence, which is considered to contribute to the accuracy of the similarity, when generating the sequence of each corpus.
  • the present embodiment it is possible to automatically identify the commit that is considered to be highly relevant to the target software information. Therefore, the user can quickly identify the commit related to the target software information even if the user lacks expertise in implementing the software. That is, it is possible to streamline the work of identifying the commit related to a certain sentence from the list of commits to the software.
  • the software information processing unit 11 is an example of the extraction unit.
  • the corpus generation unit 12 is an example of a generation unit.
  • the related commit specific unit 13 is an example of the specific unit.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

This relevant commit identification device comprises: an extraction unit that extracts a string of words from a text input in relation to a certain piece of software; a generation unit that generates, for each of a plurality of commits generated with respect to the software, a string of characters including the content of the commit; and an identification unit that identifies a commit having a relatively high relevance to the text from among the plurality of commits on the basis of a comparison between the string of words and each of the strings of characters. Consequently, the relevant commit identification device enhances the efficiency of an operation which identifies a commit related to a certain text from among a list of commits with respect to software.

Description

関連コミット特定装置、関連コミット特定方法及びプログラムRelated commit identification device, related commit identification method and program
 本発明は、関連コミット特定装置、関連コミット特定方法及びプログラムに関する。 The present invention relates to a related commit specifying device, a related commit specifying method, and a program.
 ソフトウェアの開発では、或る目的のために行われた一連の変更がコミットという単位でまとめられる。コミットには、当該コミットの識別子、ソフトウェアに関するファイル群(ソースファイル、各種ドキュメント等)の変更差分(変更されたファイル一覧)、変更を実施した開発者、変更を行った日時、変更内容(変更目的も含む)を自然言語で説明する説明文等が含まれる。なお、変更されたファイル一覧とは、変更されたファイルのファイル名及び当該ファイルに対する修正個所(修正が行われた行番号)等の一覧をいう。 In software development, a series of changes made for a certain purpose are put together in units called commits. Commit includes the identifier of the commit, the change difference (list of changed files) of files related to software (source files, various documents, etc.), the developer who made the change, the date and time when the change was made, and the content of the change (purpose of change). Includes descriptive text that explains (including) in natural language. The changed file list refers to a list of the file name of the changed file and the corrected part (line number where the correction was made) for the file.
 新しいバージョンのソフトウェアがリリースされる際は、リリースノートが作成されるのが一般的である。リリースノートには、前バージョンからの変更内容を自然言語で記述した文章に加え、その変更を実施したコミットの一覧が記述されることもある。 When a new version of software is released, release notes are generally created. The release notes may include a list of the commits that made the change, in addition to the text that describes the changes from the previous version in natural language.
 ソフトウェアの変更内容に関する文章とコミットが、リリースノートに一覧化されていると、ユーザによるソフトウェア運用は円滑になる。例えば、ソフトウェアに内在する脆弱性が発見され、当該ソフトウェアについて当該脆弱性に対する修正が行われたバージョンがリリースされた状況を想定する。このような状況において、当該ソフトウェアを使用しているユーザが、当該脆弱性に関して、自らの使い方への影響の有無を判断したいとする。この場合、ユーザは、まず、脆弱性について記述した文章をリリースノートから探し、脆弱性に関する修正が行われたコミットを特定する。次に、ユーザは、特定したコミットを確認し、ユーザの使い方に影響があるかを判断する。 If the text and commits related to the software changes are listed in the release notes, the software operation by the user will be smooth. For example, assume a situation in which a vulnerability inherent in software is discovered and a version of the software in which the vulnerability has been fixed has been released. In such a situation, a user using the software wants to determine whether or not the vulnerability affects his / her own usage. In this case, the user first searches the release notes for a sentence describing the vulnerability and identifies the commit in which the vulnerability has been fixed. The user then reviews the identified commits to determine if it affects the user's usage.
 但し、リリースノートは、ソフトウェアの変更(修正も含む。)を行った開発者自身によって作成されるものであるため、その内容にばらつきが有る。例えば、リリースノートを全く作成しない開発者や、主要な変更内容のみをリリースノートに記述する開発者等が存在する。 However, since the release notes are created by the developer who made the software changes (including corrections), the contents vary. For example, there are developers who do not create release notes at all, and developers who describe only major changes in the release notes.
 ソフトウェアの変更内容に関する文章とコミットの一覧が作成されていない場合や、コミットの一覧に漏れがある場合、ユーザは、手作業でソフトウェアの変更内容とコミットを関連付ける必要がある。例えば、CVE(Common Vulnerabilities and Exposures)等のように公開されている脆弱性情報が示す脆弱性に関して修正が行われたコミットを特定するには、当該脆弱性に関連するワードに基づいて、人手(例えば、ユーザやセキュリティのリサーチャー等)よって該当するコミットをコミット一覧の中から検索等する必要が有る(例えば、非特許文献1)。 If the text about the software changes and the commit list are not created, or if the commit list is missing, the user needs to manually associate the software changes with the commit. For example, in order to identify a commit that has been fixed for a vulnerability indicated by publicly available vulnerability information such as CVE (Common Vulnerabilities and Exposures), it is necessary to manually identify the commit based on the word related to the vulnerability. For example, it is necessary to search the commit list for the corresponding commit by a user, a security researcher, etc. (for example, Non-Patent Document 1).
 ソフトウェアに関する機能の追加やバグ修正や脆弱性対応などの対応について自然言語で記述された文章と、その対応が行われたコミットとを人間が関連付けるには、ソフトウェアの実装に関する専門知識を利用して、関連するコミットを検索し、検索結果を逐次確認する必要がある。 Humans can use their expertise in software implementation to associate texts written in natural language with software-related features, bug fixes, and vulnerabilities. , You need to search for related commits and check the search results one by one.
 したがって、該当するコミットを特定するには、ソフトウェアの実装に関する専門知識が必要とされる。また、数百~数千の多数のコミットの中から目的のコミットを特定するには時間が必要とされる。その結果、或るソフトウェアについて新しいバージョンがリリースされた場合、当該バージョンを使用すべきかの判断は、ユーザにとって容易ではない、又は当該判断に時間がかかるという問題が有る。 Therefore, expertise in software implementation is required to identify the relevant commit. In addition, it takes time to identify the target commit from a large number of hundreds to thousands of commits. As a result, when a new version of a certain software is released, there is a problem that it is not easy for the user to decide whether to use the version or it takes time to make the decision.
 本発明は、上記の点に鑑みてなされたものであって、ソフトウェアに対するコミット一覧の中から或る文章に関連するコミットを特定する作業を効率化することを目的とする。 The present invention has been made in view of the above points, and an object of the present invention is to streamline the work of identifying a commit related to a certain sentence from a list of commits to software.
 そこで上記課題を解決するため、関連コミット特定装置は、或るソフトウェアに関して入力された文章から単語列を抽出する抽出部と、前記ソフトウェアに対して生成された複数のコミットのそれぞれについて、当該コミットの内容を含む文字列を生成する生成部と、前記単語列と前記各文字列との比較に基づいて、前記文章に対する関連性が相対的に高いコミットを、前記複数のコミットの中から特定する特定部と、を有する。 Therefore, in order to solve the above problem, the related commit identification device extracts the word string from the sentence input for a certain software, and for each of the plurality of commits generated for the software, the relevant commits. Identification that identifies a commit that is relatively highly relevant to the sentence from the plurality of commits based on a comparison between the word string and each character string and a generator that generates a character string including the content. It has a part and.
 ソフトウェアに対するコミット一覧の中から或る文章に関連するコミットを特定する作業を効率化することができる。 It is possible to streamline the work of identifying the commit related to a certain sentence from the list of commits to the software.
本発明の実施の形態における関連コミット特定装置10のハードウェア構成例を示す図である。It is a figure which shows the hardware configuration example of the related commit specifying apparatus 10 in embodiment of this invention. 本発明の実施の形態における関連コミット特定装置10の機能構成例を示す図である。It is a figure which shows the functional structure example of the related commit specifying apparatus 10 in embodiment of this invention. ソフトウェア情報処理部11が実行する処理手順の一例を説明するためのフローチャートである。It is a flowchart for demonstrating an example of the processing procedure executed by software information processing unit 11. 加工後ソフトウェア情報の生成処理の処理手順の一例を説明するためのフローチャートである。It is a flowchart for demonstrating an example of the processing procedure of the generation processing of software information after processing. コーパス生成部12が実行する処理手順の一例を説明するためのフローチャートである。It is a flowchart for demonstrating an example of the processing procedure executed by the corpus generation unit 12. コーパス生成処理の処理手順の第1の例を説明するためのフローチャートである。It is a flowchart for demonstrating 1st example of the processing procedure of corpus generation processing. コーパス生成処理の処理手順の第2の例を説明するためのフローチャートである。It is a flowchart for demonstrating the 2nd example of the processing procedure of corpus generation processing. 関連コミット特定部13が実行する処理手順の一例を説明するためのフローチャートである。It is a flowchart for demonstrating an example of the processing procedure executed by the related commit identification part 13.
 以下、図面に基づいて本発明の実施の形態を説明する。図1は、本発明の実施の形態における関連コミット特定装置10のハードウェア構成例を示す図である。図1の関連コミット特定装置10は、それぞれバスBで相互に接続されているドライブ装置100、補助記憶装置102、メモリ装置103、CPU104、インタフェース装置105、表示装置106、及び入力装置107等を有する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. FIG. 1 is a diagram showing a hardware configuration example of the related commit specifying device 10 according to the embodiment of the present invention. The related commit specifying device 10 of FIG. 1 has a drive device 100, an auxiliary storage device 102, a memory device 103, a CPU 104, an interface device 105, a display device 106, an input device 107, and the like, which are connected to each other by a bus B, respectively. ..
 関連コミット特定装置10での処理を実現するプログラムは、CD-ROM等の記録媒体101によって提供される。プログラムを記憶した記録媒体101がドライブ装置100にセットされると、プログラムが記録媒体101からドライブ装置100を介して補助記憶装置102にインストールされる。但し、プログラムのインストールは必ずしも記録媒体101より行う必要はなく、ネットワークを介して他のコンピュータよりダウンロードするようにしてもよい。補助記憶装置102は、インストールされたプログラムを格納すると共に、必要なファイルやデータ等を格納する。 The program that realizes the processing in the related commit specifying device 10 is provided by the recording medium 101 such as a CD-ROM. When the recording medium 101 storing the program is set in the drive device 100, the program is installed in the auxiliary storage device 102 from the recording medium 101 via the drive device 100. However, the program does not necessarily have to be installed from the recording medium 101, and may be downloaded from another computer via the network. The auxiliary storage device 102 stores the installed program and also stores necessary files, data, and the like.
 メモリ装置103は、プログラムの起動指示があった場合に、補助記憶装置102からプログラムを読み出して格納する。CPU104は、メモリ装置103に格納されたプログラムに従って関連コミット特定装置10に係る機能を実現する。インタフェース装置105は、ネットワークに接続するためのインタフェースとして用いられる。表示装置106はプログラムによるGUI(Graphical User Interface)等を表示する。入力装置107はキーボード及びマウス等で構成され、様々な操作指示を入力させるために用いられる。 The memory device 103 reads and stores the program from the auxiliary storage device 102 when the program is instructed to start. The CPU 104 realizes the function related to the related commit specifying device 10 according to the program stored in the memory device 103. The interface device 105 is used as an interface for connecting to a network. The display device 106 displays a programmatic GUI (Graphical User Interface) or the like. The input device 107 is composed of a keyboard, a mouse, and the like, and is used for inputting various operation instructions.
 図2は、本発明の実施の形態における関連コミット特定装置10の機能構成例を示す図である。図2に示されるように、関連コミット特定装置10は、ソフトウェア情報処理部11、コーパス生成部12及び関連コミット特定部13を有する。これら各部は、関連コミット特定装置10にインストールされた1以上のプログラムが、CPU104に実行させる処理により実現される。 FIG. 2 is a diagram showing a functional configuration example of the related commit specifying device 10 according to the embodiment of the present invention. As shown in FIG. 2, the related commit specifying device 10 includes a software information processing unit 11, a corpus generation unit 12, and a related commit specifying unit 13. Each of these parts is realized by a process of causing the CPU 104 to execute one or more programs installed in the related commit specifying device 10.
 ソフトウェア情報処理部11は、ソフトウェア情報が入力されると、当該ソフトウェア情報に対して前処理を実行することで、加工後ソフトウェア情報を生成する。 When software information is input, the software information processing unit 11 generates post-processing software information by executing preprocessing on the software information.
 ソフトウェア情報とは、バージョンアップが行われた或るソフトウェア(以下、「対象ソフトウェア」といい、対象ソフトウェアの最新バージョンを「対象バージョン」という。)の機能等について、直前のバージョンに対する変更点等が自然言語によって記述された文章である。例えば、対象バージョンに対するリリースノートに含まれている、ソフトウェアの変更内容に関する文章や、CVE(Common Vulnerabilities and Exposures)等の脆弱性情報等がソフトウェア情報とされてもよい。又は、関連コミット特定装置10のユーザによって作成された文章がソフトウェア情報として入力されてもよい。この場合、当該ユーザは、自らにとって興味が有る機能の変更内容を示す文章をソフトウェア情報として作成してもよい。 The software information includes changes to the previous version of the functions of certain upgraded software (hereinafter referred to as "target software", and the latest version of the target software is referred to as "target version"). It is a sentence written in natural language. For example, the software information may include sentences related to software changes and vulnerability information such as CVE (Common Vulnerabilities and Exposures) included in the release notes for the target version. Alternatively, a sentence created by the user of the related commit specifying device 10 may be input as software information. In this case, the user may create a sentence indicating the change contents of the function that is of interest to him / her as software information.
 一方、加工後ソフトウェア情報とは、ソフトウェア情報に対してstopwords(ストップワード)の削除等の処理を実行することで、ソフトウェア情報から抽出される単語列である。 On the other hand, the processed software information is a word string extracted from the software information by executing a process such as deleting stopswords on the software information.
 例えば、ソフトウェア情報が、「In ih264d_init_decoder of ih264d_api.c, there is a possible out of bounds write due to a use after free.」である場合、「ih264d init decoder ih264d api c possible bounds write due use after free」が、加工後ソフトウェア情報として生成(抽出)される。 For example, if the software information is "In ih264d_init_decoder of ih264d_api.c, there is a possible out of bounds write due to a use after free." , Generated (extracted) as software information after processing.
 コーパス生成部12は、対象バージョンに対するコミット一覧(例えば、対象バージョンのリリースノートに含まれているコミット一覧)を入力とし、当該コミット一覧に含まれるコミットごとに、コーパス生成規則を適用してコーパスを生成する。コミットとは、ソフトウェアの開発において、或る目的のために行われた一連の変更がまとめられた単位をいう。コミットには、当該コミットの識別子、ソフトウェアに関するファイル群(ソースファイル、各種ドキュメント等)の変更差分(変更されたファイル一覧(以下、「変更ファイル一覧」という。))、変更を実施した開発者、変更を行った日時、変更内容(変更目的も含む)を自然言語で説明する説明文等(以下、「変更内容説明文」という。)が含まれる。 The corpus generation unit 12 inputs a list of commits for the target version (for example, a list of commits included in the release notes of the target version), and applies a corpus generation rule to each commit included in the commit list to generate a corpus. Generate. A commit is a unit of changes made for a certain purpose in software development. Commit includes the identifier of the commit, the change difference of the files related to the software (source files, various documents, etc.) (changed file list (hereinafter referred to as "changed file list")), the developer who made the change, Includes the date and time of the change, a descriptive text that explains the content of the change (including the purpose of the change) in natural language (hereinafter referred to as the "description of the content of the change").
 コーパスとは、コミット一覧のコミットごとに、当該コミットの識別子と、当該コミットに含まれる「変更ファイル一覧」及び「変更内容説明文」をシーケンス化したもの(以下、単に「シーケンス」という。)を属性として有するデータベースである。なお、シーケンス化とは、例えば、単語の配列の形式にすることをいう。この場合、ストップワードに該当する単語は除去されてもよいし、複合語は、単語に分割されてもよい。又は、シーケンスは、「変更ファイル一覧」及び「変更内容説明文」を単純に接続した文字列であってもよい。 The corpus is a sequence of the identifier of the commit and the "change file list" and "change content description" included in the commit for each commit in the commit list (hereinafter, simply referred to as "sequence"). It is a database that has as an attribute. In addition, sequencing means, for example, the form of an array of words. In this case, the word corresponding to the stop word may be removed, or the compound word may be divided into words. Alternatively, the sequence may be a character string in which the "change file list" and the "change content description" are simply connected.
 関連コミット特定部13は、加工後ソフトウェア情報及びコーパスを入力として、当該コーパスからソフトウェア情報に関連する(ソフトウェア情報に対する関連性が相対的に高い)コミット(以下、「関連コミット」という。)、を特定し、関連コミットの識別子を出力する。 The related commit specifying unit 13 inputs the processed software information and the corpus, and makes a commit related to the software information (relatively highly related to the software information) from the corpus (hereinafter, referred to as “related commit”). Identify and output the associated commit identifier.
 以下、関連コミット特定装置10が実行する処理手順について説明する。図3は、ソフトウェア情報処理部11が実行する処理手順の一例を説明するためのフローチャートである。 Hereinafter, the processing procedure executed by the related commit specifying device 10 will be described. FIG. 3 is a flowchart for explaining an example of a processing procedure executed by the software information processing unit 11.
 ソフトウェア情報処理部11は、入力されたソフトウェア情報(以下、「対象ソフトウェア情報」という。)を読み込むと(S101)、加工後ソフトウェア情報生成処理を実行する(S102)。加工後ソフトウェア情報生成処理では、対象ソフトウェア情報に対して加工規則を適用することで、加工後ソフトウェア情報が生成される。続いて、ソフトウェア情報処理部11は、生成された加工後ソフトウェア情報(以下、「対象加工後ソフトウェア情報」という。)を出力する(S103)。 When the software information processing unit 11 reads the input software information (hereinafter referred to as "target software information") (S101), it executes the post-processing software information generation process (S102). In the post-processing software information generation process, the post-processing software information is generated by applying the processing rules to the target software information. Subsequently, the software information processing unit 11 outputs the generated post-processing software information (hereinafter, referred to as “target post-processing software information”) (S103).
 続いて、ステップS102の詳細について説明する。ソフトウェア情報加工規則には、ストップワードの削除や複合語の分離が含まれる。例えば、ソフトウェア情報からストップワードであるthisを削除することや、forceStopのような複合語を「force」及び「Stop」に分離することがソフトウェア情報加工規則の一例として挙げられる。 Subsequently, the details of step S102 will be described. Software information processing rules include the deletion of stopwords and the separation of compound words. For example, removing the stop word this from software information and separating compound words such as forceStop into "force" and "Stop" are examples of software information processing rules.
 図4は、加工後ソフトウェア情報の生成処理の処理手順の一例を説明するためのフローチャートである。 FIG. 4 is a flowchart for explaining an example of the processing procedure of the processing of generating software information after processing.
 ステップS201において、ソフトウェア情報処理部11は、対象ソフトウェア情報をパース(構文解析)して、対象ソフトウェア情報に含まれる単語のリスト(以下、「単語列」という。)を生成(抽出)する。 In step S201, the software information processing unit 11 parses (syntax analysis) the target software information and generates (extracts) a list of words (hereinafter, referred to as "word string") included in the target software information.
 続いて、ソフトウェア情報処理部11は、当該単語列の中から未処理の単語を処理対象として取得する(S202)。例えば、当該単語列における単語の並び順に、処理対象とする単語が取得されてもよい。以下、処理対象とされた単語を「対象単語」という。なお、未処理の単語とは、対象単語とされていない単語をいう。 Subsequently, the software information processing unit 11 acquires an unprocessed word from the word string as a processing target (S202). For example, the words to be processed may be acquired in the order of the words in the word string. Hereinafter, the word to be processed is referred to as a "target word". The unprocessed word means a word that is not a target word.
 続いて、ソフトウェア情報処理部11は、対象単語がストップワードであるか否かを判定する(S203)。或る単語がストップワードに該当するか否かは、公知技術に基づいて行われればよい。対象単語がストップワードである場合(S203でYes)、対象単語については、ステップS204以降は実行されずにステップS207へ進む。 Subsequently, the software information processing unit 11 determines whether or not the target word is a stop word (S203). Whether or not a word corresponds to a stop word may be determined based on a known technique. When the target word is a stop word (Yes in S203), the target word is not executed after step S204 and proceeds to step S207.
 一方、対象単語がストップワードでない場合(S203でNo)、ソフトウェア情報処理部11は、対象単語が複合語であるか否かを判定する(S204)。対象単語が複合語である場合(S204でYes)、ソフトウェア情報処理部11は、複合語である対象単語を複数の単語に分割し(S205)、分割後の各単語を対象加工後ソフトウェア情報へ追加する(S206)。一方、対象単語が複合語でない場合(S204でNo)、ソフトウェア情報処理部11は、対象単語をそのまま対象加工後ソフトウェア情報へ追加する(S206)。ステップS206に続いてステップS207へ進む。 On the other hand, when the target word is not a stop word (No in S203), the software information processing unit 11 determines whether or not the target word is a compound word (S204). When the target word is a compound word (Yes in S204), the software information processing unit 11 divides the target word which is a compound word into a plurality of words (S205), and converts each divided word into the target processed software information. Add (S206). On the other hand, when the target word is not a compound word (No in S204), the software information processing unit 11 adds the target word as it is to the target processed software information (S206). The process proceeds to step S207 following step S206.
 ステップS207において、ソフトウェア情報処理部11は、ステップS201において生成された単語列に含まれる全ての単語が処理対象とされたか否かを判定する。未処理の単語が有る場合(S207でNo)、ステップS202以降が繰り返される。未処理の単語が無い場合(S207でYes)、図4の処理手順は終了する。 In step S207, the software information processing unit 11 determines whether or not all the words included in the word string generated in step S201 are targeted for processing. If there are unprocessed words (No in S207), steps S202 and subsequent steps are repeated. When there is no unprocessed word (Yes in S207), the processing procedure of FIG. 4 ends.
 図5は、コーパス生成部12が実行する処理手順の一例を説明するためのフローチャートである。 FIG. 5 is a flowchart for explaining an example of the processing procedure executed by the corpus generation unit 12.
 ステップS301において、コーパス生成部12は、対象バージョンのコミット一覧(以下、「対象コミット一覧」という。)を読み込む。続いて、コーパス生成部12は、対象コミット一覧に含まれる複数のコミットについて、未処理のコミットの有無を判定する(S302)。未処理とは、ステップS303以降において処理対象とされていないことをいう。 In step S301, the corpus generation unit 12 reads the commit list of the target version (hereinafter, referred to as "target commit list"). Subsequently, the corpus generation unit 12 determines whether or not there are unprocessed commits for the plurality of commits included in the target commit list (S302). “Unprocessed” means that the process is not targeted in steps S303 and subsequent steps.
 未処理のコミットが有る場合(S302でYes)、コーパス生成部12は、未処理のコミットのうちの一つのコミットを処理対象(以下、「対象コミット」という。)として取得する(S303)。 When there is an unprocessed commit (Yes in S302), the corpus generation unit 12 acquires one of the unprocessed commits as a processing target (hereinafter referred to as "target commit") (S303).
 続いて、コーパス生成部12は、コーパス生成処理を実行する(S304)。コーパス生成処理では、対象コミットに対してコーパス生成規則が適用されて、対象コミットに対するコーパスデータが生成される。続いて、コーパス生成部12は、当該コーパスデータを対象コミットに対応するコーパス(以下、「対象コーパス」という。)へ追加する(S305)。ステップS303以降が、コミット一覧に含まれる全てのコミットについて実行されると(S302でNo)、図5の処理手順は終了する。 Subsequently, the corpus generation unit 12 executes the corpus generation process (S304). In the corpus generation process, the corpus generation rule is applied to the target commit, and the corpus data for the target commit is generated. Subsequently, the corpus generation unit 12 adds the corpus data to the corpus corresponding to the target commit (hereinafter, referred to as “target corpus”) (S305). When step S303 and subsequent steps are executed for all the commits included in the commit list (No in S302), the processing procedure of FIG. 5 ends.
 続いて、ステップS304の詳細について説明する。図6は、コーパス生成処理の処理手順の第1の例を説明するためのフローチャートである。 Subsequently, the details of step S304 will be described. FIG. 6 is a flowchart for explaining a first example of the processing procedure of the corpus generation process.
 ステップS411において、コーパス生成部12は、対象コミットに含まれる変更ファイル一覧と変更内容とを自然言語で記述した文書をシーケンス化することで、対象コミットに対するシーケンス(以下、「対象シーケンス」という。)を生成する。 In step S411, the corpus generation unit 12 sequences a document in which the list of change files included in the target commit and the changed contents are described in natural language, thereby performing a sequence for the target commit (hereinafter, referred to as “target sequence”). To generate.
 続いて、コーパス生成部12は、対象シーケンスに含まれる単語のうち未処理の単語を処理対象(以下、「対象単語」という。)として取得する(S412)。なお、未処理とは、ステップS413以降について処理対象とされていないことをいう。 Subsequently, the corpus generation unit 12 acquires an unprocessed word among the words included in the target sequence as a processing target (hereinafter, referred to as “target word”) (S412). Note that unprocessed means that steps S413 and subsequent steps are not targeted for processing.
 続いて、コーパス生成部12は、対象単語がストップワードであるか否かを判定する(S413)。対象単語がストップワードである場合(S413でYes)、コーパス生成部12は、対象単語を対象シーケンスから削除(除去)して(S414)、ステップS418へ進む。 Subsequently, the corpus generation unit 12 determines whether or not the target word is a stop word (S413). When the target word is a stop word (Yes in S413), the corpus generation unit 12 deletes (removes) the target word from the target sequence (S414), and proceeds to step S418.
 一方、対象単語がストップワードでない場合(S413でNo)、コーパス生成部12は、対象単語が複合語であるか否かを判定する(S415)。対象単語が複合語である場合(S415でYes)、コーパス生成部12は、複合語である対象単語を複数の単語に分割し(S416)、分割後の単語群によって、対象シーケンス内の対象単語を置換して(S417)、ステップS418へ進む。 On the other hand, when the target word is not a stop word (No in S413), the corpus generation unit 12 determines whether or not the target word is a compound word (S415). When the target word is a compound word (Yes in S415), the corpus generation unit 12 divides the target word which is a compound word into a plurality of words (S416), and the target word in the target sequence is determined by the divided word group. Is replaced (S417), and the process proceeds to step S418.
 ステップS418において、コーパス生成部12は、対象シーケンスに含まれる全ての単語が処理対象とされたか否かを判定する。未処理の単語が有る場合(S418でNo)、ステップS412以降が繰り返される。未処理の単語が無い場合(S418でYes)、図6の処理手順は終了する。 In step S418, the corpus generation unit 12 determines whether or not all the words included in the target sequence are targeted for processing. If there are unprocessed words (No in S418), steps S412 and subsequent steps are repeated. If there are no unprocessed words (Yes in S418), the processing procedure of FIG. 6 ends.
 又は、コーパス生成処理として、図6の処理手順の代わりに、図7の処理手順が実行されてもよい。図7は、コーパス生成処理の処理手順の第2の例を説明するためのフローチャートである。図7中、図6と同一ステップには同一ステップ番号を付し、その説明は省略する。図7では、ステップS411の前に、ステップS401及びS402が追加されている。 Alternatively, as the corpus generation process, the process procedure of FIG. 7 may be executed instead of the process procedure of FIG. FIG. 7 is a flowchart for explaining a second example of the processing procedure of the corpus generation process. In FIG. 7, the same steps as those in FIG. 6 are assigned the same step numbers, and the description thereof will be omitted. In FIG. 7, steps S401 and S402 are added before step S411.
 ステップS401において、コーパス生成部12は、対象コミットに含まれる変更ファイル一覧から、変更された各ファイル(以下、「変更ファイル」という。)のファイル名を取得する。続いて、コーパス生成部12は、各変更ファイルのファイル名に基づいて、変更ファイル一覧の中にソースコードのファイル(ソースファイル)が有るか否かを判定する(S402)。例えば、ファイル名の拡張子に基づいて、当該ファイル名に係るファイルがソースファイルであるか否かを判定することができる。具体的には、C言語で記述されたソースファイルの拡張子は、「.c」である。一方、何らかの説明文が記載された、ソースコードを含まないテキストファイルの拡張子は「.txt」である。このように、拡張子に基づいて、各変更ファイルがソースファイルであるか否かを判定することができる。 In step S401, the corpus generation unit 12 acquires the file name of each changed file (hereinafter referred to as "change file") from the list of change files included in the target commit. Subsequently, the corpus generation unit 12 determines whether or not there is a source code file (source file) in the change file list based on the file name of each change file (S402). For example, it can be determined whether or not the file related to the file name is a source file based on the extension of the file name. Specifically, the extension of the source file written in C language is ".c". On the other hand, the extension of the text file containing some explanation and not including the source code is ".txt". In this way, it is possible to determine whether or not each change file is a source file based on the extension.
 変更ファイル一覧の中にソースファイルが有る場合(S402でYes)、コーパス生成部12は、ステップS411以降を実行する。変更ファイル一覧の中にソースファイルが無い場合(S402でNo)、コーパス生成部12は、ステップS411以降を実行しない。 If there is a source file in the changed file list (Yes in S402), the corpus generation unit 12 executes step S411 and subsequent steps. If there is no source file in the changed file list (No in S402), the corpus generation unit 12 does not execute step S411 and subsequent steps.
 すなわち、コミットは、必ずしもソースファイルが変更された場合だけでなく、ソフトウェアに付属する各種のドキュメントが変更された場合にも作成される。図7の処理手順によれば、ソースファイルが修正されたコミット(対象ソフトウェアの動作に影響するコミット)についてのみ、シーケンスが生成されるようにすることができる。 That is, commits are created not only when the source file is changed, but also when various documents attached to the software are changed. According to the processing procedure of FIG. 7, the sequence can be generated only for the commits whose source files have been modified (commitments that affect the operation of the target software).
 図8は、関連コミット特定部13が実行する処理手順の一例を説明するためのフローチャートである。 FIG. 8 is a flowchart for explaining an example of the processing procedure executed by the related commit specifying unit 13.
 関連コミット特定部13は、まず、対象加工後ソフトウェア情報及び対象コーパスのそれぞれを読み込む(S501、S502)。 The related commit specifying unit 13 first reads each of the target processed software information and the target corpus (S501, S502).
 続いて、関連コミット特定部13は、対象コーパスに含まれるシーケンス群のうち、未処理のシーケンスの有無を判定する(S503)。未処理とは、ステップS504以降について処理対象とされていないことをいう。未処理のシーケンスが有る場合(S503でYes)、関連コミット特定部13は、未処理のシーケンスのうちの一つのシーケンスを処理対象(以下、「対象シーケンス」という。)として取得する(S504)。続いて、関連コミット特定部13は、対象シーケンスについて、類似度算出規則を適用して、対象加工後ソフトウェア情報との類似度を算出する(S505)。すなわち、シーケンスごとに対象加工後ソフトウェア情報との類似度が算出される。 Subsequently, the related commit specifying unit 13 determines whether or not there is an unprocessed sequence in the sequence group included in the target corpus (S503). “Unprocessed” means that steps S504 and subsequent steps are not targeted for processing. When there is an unprocessed sequence (Yes in S503), the related commit specifying unit 13 acquires one of the unprocessed sequences as a processing target (hereinafter, referred to as “target sequence”) (S504). Subsequently, the related commit specifying unit 13 applies the similarity calculation rule to the target sequence to calculate the similarity with the target processed software information (S505). That is, the degree of similarity with the target processed software information is calculated for each sequence.
 類似度算出規則としては、対象加工後ソフトウェア情報を構成する各単語について、対象シーケンスにおける出現頻度(出現回数)をカウントし、当該各単語の出現頻度に基づいて、対象加工後ソフトウェアと対象シーケンスとの類似度を算出するといった規則が考えられる。 As a similarity calculation rule, for each word constituting the target processed software information, the frequency of appearance (number of occurrences) in the target sequence is counted, and based on the frequency of appearance of each word, the target processed software and the target sequence A rule such as calculating the similarity of is conceivable.
 具体的には、対象加工後ソフトウェア情報を構成する各単語についてカウントされた出現頻度の総数、平均又は最大値が類似度とされてもよい。この場合、当該総数、当該平均又は当該最大値が大きいほど、類似度は高くなる。 Specifically, the total number, average, or maximum value of the appearance frequencies counted for each word constituting the target processed software information may be regarded as the similarity. In this case, the larger the total number, the average or the maximum value, the higher the similarity.
 更に、対象加工後ソフトウェア情報を構成する各単語について、対象シーケンス以外の他の各シーケンスについても出現頻度をカウントし、他の各シーケンスにおける出現頻度が対象シーケンスにおける出現頻度よりも低い単語についての対象シーケンスにおける出現頻度に重み付けがされてもよい。例えば、当該単語についての対象シーケンスにおける出現頻度に対して、当該出現頻度を他の各シーケンスにおける当該単語の出現頻度を除した値が乗ぜられてもよい。 Further, for each word constituting the target processed software information, the appearance frequency is counted for each sequence other than the target sequence, and the target for the word whose appearance frequency in each other sequence is lower than the appearance frequency in the target sequence. The frequency of occurrence in the sequence may be weighted. For example, the frequency of occurrence of the word in the target sequence may be multiplied by a value obtained by dividing the frequency of occurrence by the frequency of occurrence of the word in each of the other sequences.
 但し、他の方法によって、対象加工後ソフトウェア情報と対象シーケンスとの類似度が算出されてもよい。 However, the degree of similarity between the target processed software information and the target sequence may be calculated by another method.
 全てのシーケンスについて類似度が算出されると(S503でNo)、関連コミット特定部13は、類似度が最大であるシーケンスを特定し(S506)、当該シーケンスの生成元のコミットを出力する(S507)。すなわち、類似度が最大であるシーケンスの生成元のコミットが、対象ソフトウェア情報に対する関連コミットとして特定される。 When the similarity is calculated for all the sequences (No in S503), the related commit specifying unit 13 identifies the sequence having the maximum similarity (S506) and outputs the commit of the generator of the sequence (S507). ). That is, the commit of the generator of the sequence having the maximum similarity is specified as the related commit for the target software information.
 但し、関連コミット特定部13は、類似度が上位N番目(N>1)までの複数のシーケンスを特定してもよい。この場合、関連コミット特定部13は、特定された複数のシーケンスのそれぞれの生成元のコミットの識別子を関連コミットの識別子として出力してもよい。この場合、類似度の降順に識別子が出力されてもよい。また、Nの値は、ユーザによって入力されてもよい。 However, the related commit specifying unit 13 may specify a plurality of sequences having the highest similarity Nth (N> 1). In this case, the related commit specifying unit 13 may output the identifier of the commit that generated each of the specified plurality of sequences as the identifier of the related commit. In this case, the identifiers may be output in descending order of similarity. Further, the value of N may be input by the user.
 なお、上記からも明らかなように、対象ソフトウェア情報に対する関連コミットは、対象加工後ソフトウェア情報と、各コミットに係るシーケンスとの比較(類似度)に基づいて特定される。すなわち、関連コミットの特定は、対象加工後ソフトウェア情報の内容及び各シーケンスの内容に大きく依存する。 As is clear from the above, the related commits to the target software information are specified based on the comparison (similarity) between the target processed software information and the sequence related to each commit. That is, the identification of the related commit largely depends on the content of the target processed software information and the content of each sequence.
 そこで、一般的又は経験的に関連性が高いと考えられるソフトウェア情報とコミットとの類似度が高く算出されるようにするための外部情報(すなわち、類似度の精度を高めるための外部情報)が、対象加工後ソフトウェア情報や各シーケンスに付与(追加)されるようにしてもよい。 Therefore, external information (that is, external information for improving the accuracy of the similarity) is provided so that the similarity between the software information and the commit, which are generally or empirically considered to be highly relevant, is calculated to be high. , It may be added (added) to the software information after the target processing or each sequence.
 例えば、斯かる外部情報を加工ソフトェア情報に付与するための情報として、一般的又は経験的にコミットに含まれる変更内容をキーとし、当該変更内容に対して一般的又は経験的に関連性が高いと考えられるコミットを特徴付ける単語を値とする辞書(以下、「ソフトウェア変更関連語辞書」という。)が予め生成され、補助記憶装置102等に記憶されていてもよい。 For example, as information for adding such external information to processing software information, the change content generally or empirically included in the commit is used as a key, and the change content is generally or empirically relevant to the change content. A dictionary (hereinafter, referred to as “software change-related word dictionary”) whose value is a word that characterizes the commit that is considered to be committed may be generated in advance and stored in the auxiliary storage device 102 or the like.
 例えば、脆弱性に特化したソフトウェア変更関連語辞書の例を示す。
[キー(変更内容)]
 cross site scripting
[値]
 Injection, websites, HTML, token
 ソフトウェア変更関連語辞書は、ソフトウェア情報に相当しうる過去の文章(リリースノートに含まれる文章や脆弱性情報等)に基づいて、キーとなる変更内容や値となる単語が手作業で抽出されることで生成されてもよい。又は、自然言語処理や機械学習を用いて、当該過去の文章から機械的にキーとなる変更内容や値となる単語が抽出されることでソフトウェア変更関連語辞書が生成されてもよい。
For example, an example of a software change-related word dictionary specialized for vulnerabilities is shown.
[Key (changes)]
cross site scripting
[value]
Injection, websites, HTML, token
The software change-related word dictionary manually extracts key changes and value words based on past sentences that can correspond to software information (sentences included in release notes, vulnerability information, etc.). It may be generated by. Alternatively, a software change-related word dictionary may be generated by mechanically extracting key changes and value words from the past sentences by using natural language processing or machine learning.
 ソフトウェア変更関連語辞書が利用される場合、ソフトウェア情報処理部11は、図4の処理手順の実行後に、以下の処理を実行してもよい。 When the software change-related word dictionary is used, the software information processing unit 11 may execute the following processing after executing the processing procedure of FIG.
 ソフトウェア情報処理部11は、図3のステップS102において加工後ソフトウェア情報に追加された各単語について、当該単語に一致にするキー又は当該単語を含むキーをソフトウェア変更関連語辞書から検索する。該当するキーが有る場合、ソフトウェア情報処理部11は、ソフトウェア変更関連語辞書において当該キーに対応付けられている値(単語)を、上記の外部情報として対象加工後ソフトウェア情報に追加する。 The software information processing unit 11 searches the software change-related word dictionary for a key that matches the word or a key that includes the word for each word added to the processed software information in step S102 of FIG. When there is a corresponding key, the software information processing unit 11 adds the value (word) associated with the key in the software change-related word dictionary to the target processed software information as the above-mentioned external information.
 このような単語が対象加工後ソフトウェア情報に追加されることで、類似度の精度の向上、ひいては、関連コミットの特定精度の向上を期待することができる。 By adding such words to the software information after target processing, it can be expected that the accuracy of similarity will be improved and, by extension, the accuracy of specifying related commits will be improved.
 同様に、コーパス生成部12は、各コーパスのシーケンスの生成に際し、類似度の精度に寄与すると考えられる外部情報を当該シーケンスに追加してもよい。 Similarly, the corpus generation unit 12 may add external information to the sequence, which is considered to contribute to the accuracy of the similarity, when generating the sequence of each corpus.
 なお、上記では、関連コミットの探索範囲を対象バージョンのコミット一覧に限定した例について説明したが、対象ソフトウェアに関する過去の一部又は全部のバージョンに対するコミット一覧が、関連コミットの探索範囲に追加されてもよい。 In the above, an example in which the search range of related commits is limited to the commit list of the target version has been described, but the commit list for a part or all of the past versions of the target software is added to the search range of related commits. May be good.
 上述したように、本実施の形態によれば、対象ソフトウェア情報に対して関連性が高いと考えられるコミットを自動的に特定することができる。したがって、ユーザは、ソフトウェアの実装に関する専門知識が不足していても、対象ソフトウェア情報に関連するコミットを短時間で特定することができる。すなわち、ソフトウェアに対するコミット一覧の中から或る文章に関連するコミットを特定する作業を効率化することができる。 As described above, according to the present embodiment, it is possible to automatically identify the commit that is considered to be highly relevant to the target software information. Therefore, the user can quickly identify the commit related to the target software information even if the user lacks expertise in implementing the software. That is, it is possible to streamline the work of identifying the commit related to a certain sentence from the list of commits to the software.
 なお、本実施の形態において、ソフトウェア情報処理部11は、抽出部の一例である。コーパス生成部12は、生成部の一例である。関連コミット特定部13は、特定部の一例である。 In the present embodiment, the software information processing unit 11 is an example of the extraction unit. The corpus generation unit 12 is an example of a generation unit. The related commit specific unit 13 is an example of the specific unit.
 以上、本発明の実施の形態について詳述したが、本発明は斯かる特定の実施形態に限定されるものではなく、請求の範囲に記載された本発明の要旨の範囲内において、種々の変形・変更が可能である。 Although the embodiments of the present invention have been described in detail above, the present invention is not limited to such specific embodiments, and various modifications are made within the scope of the gist of the present invention described in the claims.・ Can be changed.
10     関連コミット特定装置
11     ソフトウェア情報処理部
12     コーパス生成部
13     関連コミット特定部
100    ドライブ装置
101    記録媒体
102    補助記憶装置
103    メモリ装置
104    CPU
105    インタフェース装置
106    表示装置
107    入力装置
B      バス
10 Related commit identification device 11 Software information processing unit 12 Corpus generation unit 13 Related commit identification unit 100 Drive device 101 Recording medium 102 Auxiliary storage device 103 Memory device 104 CPU
105 Interface device 106 Display device 107 Input device B Bus

Claims (7)

  1.  或るソフトウェアに関して入力された文章から単語列を抽出する抽出部と、
     前記ソフトウェアに対して生成された複数のコミットのそれぞれについて、当該コミットの内容を含む文字列を生成する生成部と、
     前記単語列と前記各文字列との比較に基づいて、前記文章に対する関連性が相対的に高いコミットを、前記複数のコミットの中から特定する特定部と、
    を有することを特徴とする関連コミット特定装置。
    An extractor that extracts a word string from a sentence entered for a certain software,
    For each of the plurality of commits generated for the software, a generator that generates a character string containing the contents of the commit, and a generator.
    Based on the comparison between the word string and each character string, a specific part that identifies a commit having a relatively high relevance to the sentence from the plurality of commits, and
    A related commit identifying device characterized by having.
  2.  前記特定部は、前記単語列について前記各文字列との類似度を算出し、前記類似度が相対的に高い前記文字列に係るコミットを特定する、
    ことを特徴とする請求項1記載の関連コミット特定装置。
    The specific unit calculates the similarity of the word string with each character string, and identifies the commit related to the character string having a relatively high similarity.
    The related commit specifying device according to claim 1.
  3.  前記生成部は、前記複数のコミットのそれぞれについて、当該コミットに含まれる前記ソフトウェアの変更内容を記述した文書に含まれる各単語を含む文字列を生成する、
    ことを特徴とする請求項1又は2記載の関連コミット特定装置。
    For each of the plurality of commits, the generation unit generates a character string including each word included in a document describing the changes of the software included in the commit.
    The related commit identifying device according to claim 1 or 2.
  4.  或るソフトウェアに関して入力された文章から単語列を抽出する抽出手順と、
     前記ソフトウェアに対して生成された複数のコミットのそれぞれについて、当該コミットの内容を含む文字列を生成する生成手順と、
     前記単語列と前記各文字列との比較に基づいて、前記文章に対する関連性が相対的に高いコミットを、前記複数のコミットの中から特定する特定手順と、
    をコンピュータが実行することを特徴とする関連コミット特定方法。
    An extraction procedure that extracts a word string from a sentence entered for a certain software,
    For each of the plurality of commits generated for the software, a generation procedure for generating a character string containing the contents of the commit, and a generation procedure.
    Based on the comparison between the word string and each character string, a specific procedure for identifying a commit having a relatively high relevance to the sentence from the plurality of commits and a specific procedure.
    A method of identifying related commits that is characterized by a computer performing.
  5.  前記特定手順は、前記単語列について前記各文字列との類似度を算出し、前記類似度が相対的に高い前記文字列に係るコミットを特定する、
    ことを特徴とする請求項4記載の関連コミット特定方法。
    In the specific procedure, the similarity between the word string and each character string is calculated, and the commit related to the character string having a relatively high similarity is specified.
    4. The method for identifying a related commit according to claim 4.
  6.  前記生成手順は、前記複数のコミットのそれぞれについて、当該コミットに含まれる前記ソフトウェアの変更内容を記述した文書に含まれる各単語を含む文字列を生成する、
    ことを特徴とする請求項4又は5記載の関連コミット特定方法。
    The generation procedure generates, for each of the plurality of commits, a character string containing each word contained in a document describing the changes of the software included in the commit.
    The related commit identification method according to claim 4 or 5, wherein the related commit is identified.
  7.  請求項4乃至6いずれか一項記載の関連コミット特定方法をコンピュータに実行させることを特徴とするプログラム。 A program characterized in that a computer executes the related commit identification method according to any one of claims 4 to 6.
PCT/JP2020/018267 2020-04-30 2020-04-30 Relevant commit identification device, relevant commit identification method, and program WO2021220461A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/JP2020/018267 WO2021220461A1 (en) 2020-04-30 2020-04-30 Relevant commit identification device, relevant commit identification method, and program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2020/018267 WO2021220461A1 (en) 2020-04-30 2020-04-30 Relevant commit identification device, relevant commit identification method, and program

Publications (1)

Publication Number Publication Date
WO2021220461A1 true WO2021220461A1 (en) 2021-11-04

Family

ID=78331902

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2020/018267 WO2021220461A1 (en) 2020-04-30 2020-04-30 Relevant commit identification device, relevant commit identification method, and program

Country Status (1)

Country Link
WO (1) WO2021220461A1 (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2014157598A (en) * 2013-01-21 2014-08-28 Nec Corp Software asset management device, software asset management method, and software asset management program

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2014157598A (en) * 2013-01-21 2014-08-28 Nec Corp Software asset management device, software asset management method, and software asset management program

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
KAWAHIRA, KOSUKE : "Improving the efficiency of embedded software developments based on impact analysis by requirements change", IPSJ SIG TECHNICAL REPORTS, vol. 2008-SE-159, 17 March 2008 (2008-03-17), pages 17 - 24, XP009531951, ISSN: 0919-6072 *
KAWAI, HIROKI : "Linking fixed bug report to source code using N-gram", IEICE TECHNICAL REPORT, vol. 112, 27 February 2013 (2013-02-27), pages 57 - 62, XP009531950, ISSN: 0913-5685 *

Similar Documents

Publication Publication Date Title
US10684943B2 (en) Generating executable test automation code automatically according to a test case
Hill et al. AMAP: automatically mining abbreviation expansions in programs to enhance software maintenance tools
JP7131199B2 (en) Automatic identification of related software projects for cross-project learning
JP5870790B2 (en) Sentence proofreading apparatus and proofreading method
JP2012190080A (en) Method, program and system for finding correspondence between terms
Van Tonder et al. Lightweight multi-language syntax transformation with parser parser combinators
JP2008083952A (en) Dictionary creation support system, method and program
JP4750476B2 (en) Document retrieval apparatus and method, and storage medium
JP5678896B2 (en) Requirement extraction system, requirement extraction method, and requirement extraction program
AU2012203538A1 (en) Systems and methods for inter-object pattern matching
US20220222442A1 (en) Parameter learning apparatus, parameter learning method, and computer readable recording medium
JP4935243B2 (en) Search program, information search device, and information search method
WO2021220461A1 (en) Relevant commit identification device, relevant commit identification method, and program
JP6651974B2 (en) Information processing apparatus, compiling method and compiler program
JP5447368B2 (en) NEW CASE GENERATION DEVICE, NEW CASE GENERATION METHOD, AND NEW CASE GENERATION PROGRAM
Pirapuraj et al. Analyzing source code identifiers for code reuse using NLP techniques and WordNet
JP5025603B2 (en) Machine translation apparatus, machine translation program, and machine translation method
JP4602388B2 (en) Similar sentence search system and program
JP7074785B2 (en) Ambiguous part correction support device and method
JP6648421B2 (en) Information processing apparatus for processing documents, information processing method, and program
JP6665029B2 (en) Language analysis device, language analysis method, and program
JP2010146273A (en) Document retrieval device and program
JP4341077B2 (en) Document processing apparatus, document processing method, and document processing program
JPWO2020157887A1 (en) Sentence structure vectorization device, sentence structure vectorization method, and sentence structure vectorization program
Beth A comparison of similarity techniques for detecting source code plagiarism

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20933816

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20933816

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: JP