WO2024069741A1

WO2024069741A1 - Software technological field extraction device and software technological field extraction method

Info

Publication number: WO2024069741A1
Application number: PCT/JP2022/035898
Authority: WO
Inventors: 啓太森; 陽一郎古賀; 俊直石井
Original assignee: 三菱電機株式会社
Priority date: 2022-09-27
Filing date: 2022-09-27
Publication date: 2024-04-04

Abstract

The purpose of the present disclosure is to suitably extract a technological field from a software development deliverable and provide the technological field to a user. This software technological field extraction device (101) comprises: a preprocessing unit (11) that preprocesses a software development deliverable, thereby creating preprocessed data; a classification model construction unit (18) that creates a classification model for automating extraction of the technological field from the preprocessed data; a technological field extraction unit (23) that extracts the technological field of the deliverable from the preprocessed data by using the classification model; a skill map creation unit (26) that aggregates the technological fields extracted by the technological field extraction unit (23) for each individual or organization, thereby creating a skill map that represents the proportions of technological fields to which the individuals or organizations are related; and an output control unit (32) that causes an output device to output the skill map.

Description

Apparatus and method for extracting software technical field

This disclosure relates to a technique for extracting software technical fields from the results of software development.

In the field of natural language processing, there are examples of cluster classification using machine learning models. Spam filters are one example.

Topic analysis is one of the cluster classification techniques. Topic analysis is a technique for classifying texts into any number of topics that describe the text. Automatic tagging of Q&A sites is an example of topic analysis.

In the field of software development, there is technology that extracts the skills possessed by individuals or teams from past deliverables. Cited Reference 1 describes a mechanism for calculating an engineer's programming skill standard score from source code stored in a source code management repository.

Reference 2 describes a system for analyzing files held and extracting employee skills.

JP 2020-035077 A JP 2005-202812 A International Publication No. 2021/019942 JP 2012-221316 A JP 2015-511733 A

In recent years, the field of software technology has become increasingly diverse. In this context, it has become difficult for organizations to extract the technical skills of the teams or developers who produce software.

Cited

documents

1 and 2 describe technologies that analyze deliverables such as design documents or source code to help determine programming language proficiency or analyze the skills of individuals. However, these technologies evaluate a developer's skills for a predetermined programming language. Or they use simple character analysis to pick up words and use those words as skills. The technical field of a team or developer can be identified by a combination of multiple keywords, but existing technologies cannot estimate the technical field from multiple keywords. In addition, if each word obtained from a deliverable is extracted as a skill, as with existing technologies, the number of extractions becomes enormous. This makes grouping difficult, making it difficult for people without sufficient knowledge of software technology to handle.

This disclosure has been made to solve the above problems, and aims to appropriately extract technical fields from the results of software development and present them to users.

The software technical field extraction device disclosed herein includes a preprocessing unit that creates preprocessed data by preprocessing the software development deliverables, a classification model construction unit that creates a classification model that automates the extraction of technical fields from the preprocessed data, a technical field extraction unit that extracts the technical fields of the deliverables from the preprocessed data using the classification model, a skill map creation unit that creates a skill map that represents the proportion of technical fields to which an individual or organization is related by aggregating the technical fields extracted by the technical field extraction unit for each individual or organization, and an output control unit that causes an output device to output the skill map.

The software technology field extraction device disclosed herein can appropriately extract technology fields from software development results. The objectives, features, aspects, and advantages of the present disclosure will become more apparent from the following detailed description and the accompanying drawings.

FIG. 1 is a block diagram showing a configuration of a software technical field extraction device. FIG. 13 is a diagram illustrating an example of source code keyword acquisition rules. FIG. 13 is a diagram illustrating an example of a combination rule. FIG. 13 is a diagram illustrating an example of a post-preprocessing DB. 13 is a flowchart showing an operation of a classification model construction unit. FIG. 2 is a diagram illustrating an example of a technical field name DB. FIG. 13 is a diagram showing an example of a technical field confirmation screen. 13 is a flowchart showing an operation of a technical field extraction unit. FIG. 13 is a diagram illustrating an example of an estimation result DB. 13 is a flowchart showing the operation of a skill map creation unit. FIG. 2 is a diagram illustrating an example of an individual skill map DB. FIG. 2 is a diagram illustrating an example of an organization skill map DB. FIG. 13 is a diagram illustrating an example of an individual skill map display screen. FIG. 13 is a diagram illustrating an example of an organization skill map display screen. FIG. 13 is a diagram showing an example of a technical field inquiry screen. FIG. 2 is a diagram illustrating a hardware configuration of a software technical field extraction device. FIG. 2 is a diagram illustrating a hardware configuration of a software technical field extraction device.

<A. First embodiment>
FIG. 1 is a diagram showing the overall configuration of a software technology field extraction device 101.

A set of software development deliverables is input to the software technology field extraction device 101 in response to an instruction input from the input device 40. Here, the deliverables include design documents 30 and source code 31. The input device 40 is, for example, a terminal such as a personal computer. A user inputs instructions to the software technology field extraction device 101 by operating the screen of the terminal. The software technology field extraction device 101 creates a classification model that automates the extraction of the technology fields contained in the deliverables, and extracts the technology fields of the deliverables using the classification model. The results of the extraction of the technology fields are presented to the user by the output device 41. The output device 41 is, for example, a display device.

The software technology field extraction device 101 is configured with a preprocessing unit 11, a classification model construction unit 18, a technology field extraction unit 23, a skill map creation unit 26, and an output control unit 32.

The pre-processing unit 11 includes a design document pre-processing unit 12, a source code pre-processing unit 13, an information combination unit 14, and a memory unit 51. The memory unit 51 stores source code keyword acquisition rules 15, combination rules 16, and a post-preprocessing database 17. The pre-processing unit 11 starts processing in response to a command input from the input device 40. The design document 30 of the project is input to the design document pre-processing unit 12, and a source file containing source code 31 is input to the source code pre-processing unit 13. The software technology field extraction device 101 may acquire the design document 30 and the source code 31 through communication.

The design document preprocessing unit 12 extracts keywords by performing morphological analysis on the design document 30, and converts the design document 30 into a set of keywords. The source code preprocessing unit 13 extracts keywords by performing analysis on the source code 31 based on the source code keyword acquisition rules 15, and converts the source code into a set of keywords.

Figure 2 shows an example of source code keyword acquisition rules 15. The source code keyword acquisition rules 15 in Figure 2 define extended regular expressions for acquiring keywords for each extension of a source file that includes source code 31. The first column specifies the extension. The second column specifies the rule name. The third column specifies the extended regular expression. The fourth column specifies whether to perform lexical analysis on the acquired character string. The fifth column specifies whether the rule is enabled or disabled. The first line shows a rule that when the extension is .c or .h, the library name following #include is acquired as a keyword. The second line shows a rule that when the extension is .c or .h, the contents of the comment are acquired and morphological analysis is performed on the contents to acquire keywords. The third line shows a rule that when the extension is .c, the function name is acquired as a keyword.

The data created by the design document preprocessing unit 12 and the source code preprocessing unit 13 is input to the information combination unit 14. The information combination unit 14 associates the design document 30 with the source code 31 based on the combination rules 16.

Figure 3 shows an example of a merge rule 16. The merge rule 16 in Figure 3 describes the process of associating a design document 30 with source code 31 in a programming language. This merge rule 16 searches the input source code 31 for each input design document (document), and determines that source code 31 with matching keywords is related source code 31.

When the information integration unit 14 creates data that associates the design document 30 with the source code 31, it stores this data as preprocessed data in the preprocessed database (DB) 17.

Figure 4 shows an example of the pre-processed DB 17. A relational DB is used for the pre-processed DB 17 in Figure 4. The pre-processed DB 17 in Figure 4 consists of tables 401, 402, 403, and 404. Table 401 has fields such as data ID, project ID, data owner ID, design document name, related source file name, and keywords. Pre-processed data for data ID = 1 is a combination of the keywords for design document A, followed by the keywords for XXX.c and YYY.c. Table 402 has fields such as project ID and project name. Table 403 has fields such as data owner ID, name, and organization ID. Table 404 has fields such as organization ID and organization name.

FIG. 5 is a flowchart showing the operation of the classification model construction unit 18. The classification model construction unit 18 includes a model construction unit 19, a technical field naming unit 20, and a memory unit 52. The memory unit 52 stores a technical field classification model 21 and a technical field name database 22. The classification model construction unit 18 performs processing using the post-preprocessing database 17. The classification model construction unit 18 executes processing in the model construction unit 19 in response to a command input from the input device 40 or the completion of processing by the preprocessing unit 11.

In step S101, the model construction unit 19 uses the preprocessed data from the preprocessed database 17 to construct N technology field classification models with 1 to N topics using a topic analysis model algorithm such as PLSA or LDA.

Next, in step S102, the model construction unit 19 evaluates the topic model performance indicators (perplexity, coherence, etc.) for the N technical field classification models, and stores the most highly evaluated technical field classification model in the memory unit 52 as the technical field classification model 21 to be used in processing by the technical field extraction unit 23.

Then, in step S103, the technical field naming unit 20 takes the most frequently occurring keyword among the keywords constituting each topic of the technical field classification model 21 as the technical field name, which is a phrase representing the technical field. At this time, it is assumed that a keyword that is the technical field name of a certain topic does not appear in other topics. The technical field naming unit 20 reflects the technical field names of each topic of the technical field classification model 21 in the technical field name DB 22 of the storage unit 52.

FIG. 6 shows an example of a technical field name DB22. In the example of FIG. 6, a relational DB is used for the technical field name DB22. The technical field name DB22 has fields for a technical field ID and a technical field name.

After step S103, in step S104, the output control unit 32 causes the output device 41 to display a technical field confirmation screen that shows, in a two-dimensional map, each topic in the technical field classification model 21, the technical field name corresponding to each topic, and the keywords that make up each topic.

Figure 7 shows an example of a technical field confirmation screen. In addition to topics such as technical field 1 and technical field N, the technical field confirmation screen displays a two-dimensional map 702 of the keywords that make up the topics. The two-dimensional map 702 displays various keywords that make up the topics. In the two-dimensional map 702, keywords that appear more frequently are displayed in larger font size, and keywords that appear less frequently are displayed in smaller font size. The technical field confirmation screen also displays the technical field name 703 of the topic.

The user can input a correction to the technical field name to the software technical field extraction device 101 by operating the technical field confirmation screen with the input device 40. When the user inputs a correction to the technical field name in step S105, the technical field naming unit 20 corrects the technical field name in the technical field name DB 22 in step S106.

FIG. 8 is a flowchart showing the operation of the technical field extraction unit 23. The technical field extraction unit 23 is configured with an estimation unit 24 and a storage unit 53. The storage unit 53 stores the technical field name database 25.

In step S201, the estimation unit 24 inputs the preprocessed data of the preprocessed DB 17 specified by the user or the skill map creation unit 26 into the technology field classification model 21, and calculates the probability that the deliverable includes each technology field.

Next, in step S202, the estimation unit 24 estimates that the technology field that exceeds a certain probability is the technology field of the deliverable. This certain probability is set by the user.

Finally, in step S203, the estimation unit 24 registers the estimation results of the technology field in the estimation result DB 25 of the storage unit 53.

Figure 9 shows an example of the estimation result DB25. A relational DB is used for the estimation result DB25 in Figure 9. The estimation result DB25 has fields such as data ID, classification result, and technical field estimation result. The data IDs in the estimation result DB25 correspond to the data IDs in the preprocessing DB17. The classification result stores the probability that certain preprocessing data includes a technical field of a certain technical field ID. In the technical field estimation result, the technical field ID of a technical field with this probability above a certain level is stored as the estimation result of the technical field.

FIG. 10 is a flowchart showing the operation of the skill map creation unit 26. The skill map creation unit 26 is configured with a creation processing unit 27 and a memory unit 54. The memory unit 54 stores an individual skill map DB 28 and an organization skill map DB 29. The individual skill map DB 28 is a database of skill map data for each individual, and the organization skill map DB 29 is a database of skill map data for each organization.

The creation processing unit 27 instructs the technical field extraction unit 23 to extract technical fields using the preprocessed data in the preprocessed DB 17, triggered by a command input from the input device 40 or a scheduled timing such as when the preprocessed DB 17 is updated or when the technical field classification model 21 is updated (step S301).

Next, in step S302, the creation processing unit 27 selects one piece of estimation result data from the estimation result DB 25.

Then, in step S303, the creation processing unit 27 uses the data ID of the selected estimation result data as a key to confirm the owner and the owner's organization of each data from the preprocessing DB 17. The creation processing unit 27 then counts the number of technical field IDs extracted for each owner and organization, and updates the individual skill map DB 28 and the organizational skill map DB 29.

FIG. 11 shows an example of an individual skill map DB 28. A relational DB is used for the individual skill map DB 28 in FIG. 11. The individual skill map DB 28 has technical field ID and count as fields. In the individual skill map DB 28, a table exists for each individual. For each individual, the number of extractions from the deliverables owned is counted for each technical field ID.

FIG. 12 shows an example of an organization skill map DB 29. A relational DB is used for the organization skill map DB 29 in FIG. 12. The organization skill map DB 29 has technical field ID and count as fields. In the organization skill map DB 29, a table exists for each individual. For each organization, the number of times extractions are made from the deliverables owned is counted for each technical field ID.

Next, in step S304, the creation processing unit 27 determines whether or not there is unselected inference result data in the inference result DB 25. If there is unselected inference result data, the process of the creation processing unit 27 returns to step S302. Once the process has been completed for all the inference result data in the inference result DB 25, in step S305 the output control unit 32 causes the output device 41 to display the individual skill map data and organizational skill map data.

FIG. 13 shows an example of an individual skill map display screen on which individual skill map data is displayed on the output device 41. The individual skill map display screen displays an individual selection tab 1301, technical fields owned by the individual 1302, the number of data owned by the individual 1303, related projects 1304 of the individual, and a pie chart 1305 representing the technical fields owned by the individual. The user can select an individual for whom he/she wants to check the skill map from the individual selection tab 1301. The technical fields owned by the individual 1302 show the technical fields owned by the individual along with their percentages. This percentage is calculated from the count number in the individual skill map DB 28. In the example of FIG. 13, the skill map of "A" shows the technical fields as embedded (60%) and WEB (40%), the number of data as 100, and the related projects as A and B.

FIG. 14 shows an example of an organization skill map display screen on which organization skill map data is displayed on the output device 41. The organization skill map display screen displays an organization selection tab 1401, technical fields owned by the organization 1402, the number of data owned by the organization 1403, related projects of the organization 1404, and a pie chart 1405 showing the technical fields owned by the organization. The user can select the organization for which he wants to check the skill map from the organization selection tab 1401. The technical fields owned by the organization 1402 are shown along with the percentages of the technical fields owned by the organization. This percentage is calculated from the count number in the organization skill map DB 29. In the example of FIG. 14, the skill map of "Section A" shows the technical fields as embedded (60%) and WEB (40%), the number of data as 1000, and the related projects as A, B, and C.

Figure 15 shows an example of a technical field inquiry screen that the software technical field extraction device 101 displays on the output device 41. The technical field inquiry screen includes a design document selection button 1501 and an inquiry button 1503. The user selects a design document registered in the preprocessing DB 17 by pressing the design document selection button 1501. The selected design document and the associated source code related to the selected design document are displayed to the right of the design document selection button 1501. When the user presses the inquiry button 1503, the output control unit 32 retrieves the extraction results of the technical field of the selected design document from the estimation result DB 25, and displays them as extraction results to the right of the inquiry button 1503.

<B. Hardware Configuration>
The preprocessing unit 11, the classification model construction unit 18, the technical field extraction unit 23, the skill map creation unit 26, and the output control unit 32 in the above-mentioned software technical field extraction device 101 are realized by a processing circuit 81 shown in Fig. 16. That is, the processing circuit 81 includes the preprocessing unit 11, the classification model construction unit 18, the technical field extraction unit 23, the skill map creation unit 26, and the output control unit 32 (hereinafter, the preprocessing unit 11, etc.). The processing circuit 81 may be implemented by dedicated hardware or a processor that executes a program stored in a memory. The processor may be, for example, a central processing unit, a processing unit, an arithmetic unit, a microprocessor, a microcomputer, a DSP (Digital Signal Processor), etc.

When the processing circuit 81 is dedicated hardware, the processing circuit 81 corresponds to, for example, a single circuit, a composite circuit, a programmed processor, a parallel programmed processor, an ASIC (Application Specific Integrated Circuit), an FPGA (Field-Programmable Gate Array), or a combination of these. Each function of each part such as the pre-processing unit 11 may be realized by multiple processing circuits 81, or the functions of each part may be combined and realized by a single processing circuit.

When the processing circuit 81 is a processor, the functions of the preprocessing unit 11 and the like are realized by a combination of software, etc. (software, firmware, or software and firmware). The software, etc. is written as a program and stored in a memory. As shown in FIG. 17, the processor 82 applied to the processing circuit 81 realizes the functions of each unit by reading and executing the program stored in the memory 83. That is, the software technology field extraction device 101 includes a memory 83 for storing a program that, when executed by the processing circuit 81, results in the execution of the steps of the preprocessing unit 11 performing preprocessing on the software development product to create preprocessed data, the classification model construction unit 18 creating a classification model that automates the extraction of technology fields from the preprocessed data, the technology field extraction unit 23 extracting technology fields from the preprocessed data using the classification model, and the skill map creation unit 26 creating a skill map of an individual or organization by aggregating the technology fields extracted by the technology field extraction unit 23 for each individual or organization. In other words, this program can be said to cause a computer to execute the procedure or method of the preprocessing unit 11 and the like. Here, memory 83 may be, for example, non-volatile or volatile semiconductor memory such as RAM (Random Access Memory), ROM (Read Only Memory), flash memory, EPROM (Erasable Programmable Read Only Memory), EEPROM (Electrically Erasable Programmable Read Only Memory), HDD (Hard Disk Drive), magnetic disk, flexible disk, optical disk, compact disk, mini disk, DVD (Digital Versatile Disk) and its drive device, or any storage medium to be used in the future.

The above describes a configuration in which the functions of the pre-processing unit 11, etc. are realized either by hardware or software, etc. However, this is not limited to the above, and a configuration in which part of the pre-processing unit 11, etc. is realized by dedicated hardware and another part is realized by software, etc. For example, the functions of the pre-processing unit can be realized by a processing circuit as dedicated hardware, and the remaining functions can be realized by the processing circuit 81 as the processor 82 reading and executing a program stored in the memory 83.

As described above, the processing circuit can realize each of the above-mentioned functions by hardware, software, etc., or a combination of these. Note that the

storage units

51, 52, 53, and 54 are composed of memory 83, but they may be composed of a single memory 83, or each may be composed of an individual memory.

The software technology field extraction device 101 may be configured on a user terminal, which is a terminal used by a user, or on an administrator terminal, which is a terminal managed by an administrator. The software technology field extraction device 101 may also be configured as a system that combines a user terminal or an administrator terminal with a server. In this case, each function or each component of the software technology field extraction device 101 described above may be distributed and placed on each device that makes up the system, or may be centrally placed on one of the devices.

It is possible to freely combine the various embodiments, and to modify or omit the various embodiments as appropriate. The above description is illustrative in all respects. It is understood that countless variations not illustrated can be envisioned.

11 Preprocessing section, 12 Design document preprocessing section, 13 Source code preprocessing section, 14 Information combination section, 15 Source code keyword acquisition rules, 16 Combination rules, 17 Post-preprocessing database, 18 Classification model construction section, 19 Model construction section, 20 Technical field naming section, 21 Technical field classification model, 22 Technical field name database, 23 Technical field extraction section, 24 Estimation section, 25 Technical field name database, 26 Skill map creation section, 27 Creation processing section, 28 Individual skill map database, 29 Organization skill map database, 30 Design document, 31 Source code, 32 Output control section, 40 Input device, 41 Output device, 51, 52, 53, 54 Memory section, 81 Processing circuit, 82 Processor, 83 Memory, 101 Software technical field extraction device.

Claims

a pre-processing unit that performs pre-processing on a software development deliverable to generate pre-processed data;
a classification model construction unit that creates a classification model that automates the extraction of technical fields from the preprocessed data;
a technical field extraction unit that extracts a technical field of the deliverable from the preprocessed data by the classification model;
a skill map creation unit that creates a skill map representing the proportion of technical fields to which the individual or organization is related by aggregating the technical fields extracted by the technical field extraction unit for each individual or organization;
An output control unit that causes an output device to output the skill map.
Software technology field extraction device.
The deliverables include design documentation and source code;
The pre-treatment unit includes:
a design document preprocessing unit that extracts keywords from the design document;
a source code preprocessing unit that extracts keywords from the source code;
an information combining unit that associates the design document with the source code based on keywords extracted from the design document and keywords extracted from the source code, and creates the preprocessed data by combining keywords of the design document and the source code that are in a corresponding relationship,
The software technology field extraction device according to claim 1 .
The classification model construction unit
a model construction unit that creates a plurality of classification model candidates each having a different number of topics extracted from the preprocessed data, and determines, as a classification model, a classification model candidate having an optimal number of topics extracted from the plurality of classification model candidates;
a technical field naming unit that determines a technical field name of each of the topics from keywords that are not included in other topics based on the frequency of appearance of keywords that constitute each of the topics extracted by the classification model,
The software technology field extraction device according to claim 1 .
The output control unit causes the output device to output a name of the technical field of the topic;
The technical field naming unit modifies the technical field name of the topic based on input information from a user.
The software technology field extraction device according to claim 3.
The technical field extraction unit
an estimation unit that calculates a probability that the deliverable is included in each of the topics by inputting the preprocessed data into the classification model, and estimates the topic with the probability exceeding a predetermined value as the technical field of the deliverable;
The software technology field extraction device according to claim 1 .
The skill map creation unit,
a creation processing unit that aggregates the estimation results of the technology fields by the estimation unit for each set of the deliverables owned by an individual or organization and creates a skill map;
The software technology field extraction device according to claim 5.
A preprocessing unit performs preprocessing on the software development deliverable to generate preprocessed data;
A classification model construction unit creates a classification model that automates the extraction of technical fields from the preprocessed data,
A technical field extraction unit extracts technical fields from the preprocessed data using the classification model;
a skill map creation unit that creates a skill map of the individual or organization by aggregating the technical fields extracted by the technical field extraction unit for each individual or organization;
Software technology field extraction methodology