WO2023157074A1 - Dispositif d'aide à la génération de données d'enseignement, système d'aide à la génération de données d'enseignement, procédé de génération de données d'enseignement et support lisible par ordinateur non transitoire - Google Patents

Dispositif d'aide à la génération de données d'enseignement, système d'aide à la génération de données d'enseignement, procédé de génération de données d'enseignement et support lisible par ordinateur non transitoire Download PDF

Info

Publication number
WO2023157074A1
WO2023157074A1 PCT/JP2022/005937 JP2022005937W WO2023157074A1 WO 2023157074 A1 WO2023157074 A1 WO 2023157074A1 JP 2022005937 W JP2022005937 W JP 2022005937W WO 2023157074 A1 WO2023157074 A1 WO 2023157074A1
Authority
WO
WIPO (PCT)
Prior art keywords
columns
data generation
cluster
temporary
feature amount
Prior art date
Application number
PCT/JP2022/005937
Other languages
English (en)
Japanese (ja)
Inventor
浩明 竹内
Original Assignee
日本電気株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電気株式会社 filed Critical 日本電気株式会社
Priority to PCT/JP2022/005937 priority Critical patent/WO2023157074A1/fr
Publication of WO2023157074A1 publication Critical patent/WO2023157074A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Definitions

  • the present disclosure relates to a teacher data generation auxiliary device, a teacher data generation auxiliary system, a teacher data generation method, and a non-temporary computer-readable medium.
  • Patent Document 1 describes a method of calculating feature values from column names and attributes, and extracting columns similar to the relevant column by supervised learning. Further, Japanese Patent Application Laid-Open No. 2002-200001 discloses a method of assigning a meaning of a column by removing spelling variation from a column name using a related word dictionary by a column estimating unit.
  • the database schema used in the actual system does not include commonly used terms, and it may be difficult to automatically remove notation variations, so it is necessary to prepare original training data that matches the system. .
  • the annotation process which requires the most man-hours, must be performed manually, and there is a problem that the burden on the person in charge is heavy.
  • an object of the present disclosure is to improve the efficiency of annotation work and to create teacher data in a short time. and to provide a non-transitory computer-readable medium.
  • the training data generation auxiliary device of the present disclosure is data entry means for entering a table; feature amount calculation means for calculating a feature amount based on the actual data of the columns in the table; cluster classification means for classifying the columns in the table into a plurality of clusters based on the feature quantity; a temporary label generating means for generating temporary labels associated with columns belonging to each of the plurality of clusters.
  • the teaching data generation auxiliary system of the present disclosure is data entry means for entering a table; feature amount calculation means for calculating a feature amount based on the actual data of the columns in the table; cluster classification means for classifying the columns in the table into a plurality of clusters based on the feature quantity; a temporary label generating means for generating temporary labels associated with columns belonging to each of the plurality of clusters.
  • the teacher data generation method of the present disclosure is a data entry step for entering a table; A feature quantity calculation step of calculating a feature quantity based on the actual data of the columns in the table; a cluster classification step of classifying columns in the table into a plurality of clusters based on the feature amount; and a temporary label generating step of generating temporary labels associated with columns belonging to each of the plurality of clusters.
  • the non-transitory computer-readable medium of the present disclosure includes: A non-transitory computer-readable medium storing a program for executing a training data generation method,
  • the training data generation method includes: a data entry step for entering a table; A feature quantity calculation step of calculating a feature quantity based on the actual data of the columns in the table; a cluster classification step of classifying columns in the table into a plurality of clusters based on the feature amount; and a temporary label generating step of generating temporary labels associated with columns belonging to each of the plurality of clusters.
  • FIG. 1 is a block diagram showing a configuration example of a teaching data generation auxiliary device 100 according to a first embodiment
  • FIG. 4 is a flowchart of an operation example of the training data generation auxiliary device 100
  • FIG. 10 is a diagram showing an example of functional blocks of a teaching data generation auxiliary device 100 according to a second embodiment
  • FIG. 10 It is an example of a table input by the data input unit 10 .
  • It is an example of a feature amount.
  • the feature amount is another example of the feature amount.
  • the feature amount When the table input by the data input unit 10 is the table shown in FIG.
  • FIG. 6 and the number input by the user is 3, the columns (five) in the table shown in FIG. 6 are three clusters. It is a figure showing a state classified into (class). It is an example of a table (classification result confirmation screen) in which temporary labels (“code”, “prefecture”, “prefecture”, “code”, “code”) are associated. It is an example of a classification number input screen. 4 is a flowchart of an operation example of the training data generation auxiliary device 100; It is an example of teacher data output by the teacher data generation auxiliary device 100. FIG. It is an example of a table input by the data input unit 10 .
  • FIG. 1 is a diagram showing a hardware configuration example of a teaching data generation auxiliary device 100 according to the present disclosure; FIG. It is an example of teacher data.
  • FIG. 1 is an example of general annotation work.
  • the person in charge of annotation work must create a label set that can cover the meaning from the entire table used for learning, and manually select and assign a label from the label set on the screen to each column.
  • There are hundreds to thousands of columns that require annotation work and if the person in charge does not know the structure of the entire data, it is necessary to visually grasp the huge data.
  • human error occurs, so support by automation is necessary.
  • a typical analysis pattern may be used for automatic analysis. For example, there is a typical analysis pattern for analyzing who did what kind of purchasing behavior. However, comprehending the meaning of the tables used in such analysis is done manually. Therefore, there is a problem that it takes time to grasp the meaning of the table used in the analysis.
  • database migration work may be performed.
  • an operator different from the database operator before migration may use the table after migration.
  • the worker cannot smoothly use the database after migration because it takes time to grasp the meaning of the table after migration.
  • thesaurus may be used for supervised learning and natural language processing, but the database schema used in the actual system has columns that cannot be represented by a generalized thesaurus. Therefore, in some cases, it is necessary to prepare original teacher data specialized for the system.
  • the teacher data generation auxiliary device 100 of each embodiment described below does not replace the process of estimating (inferring) using a learning model. This is because while a high accuracy rate (e.g., 80% or more) is expected in estimation using a learning model, labeling by the teaching data generation auxiliary device 100 that does not use a learning model cannot be expected to achieve such accuracy. It is from. Therefore, the training data generation assistance device 100 (and training data generation method) is not used in the estimation stage, but is used only to assist (assist) the learning model creation work.
  • a high accuracy rate e.g., 80% or more
  • FIG. 2 is an example of a table input to the teacher data generation auxiliary device 100. As shown in FIG. 2
  • table refers to tabular data that includes “columns”.
  • a “column” is a unit representing a row (field) arranged in the vertical direction of a table.
  • Column includes "column name” and "actual data”.
  • Column name refers to the name entered in “Column”.
  • column names are determined by humans. Therefore, the notation variation occurs in the "column name”.
  • various column names such as “type”, “male and female”, and “gender” can be input as the column name of a column whose attribute value is the gender of a person.
  • the concept represented by the column will be referred to as "the meaning of the column” to distinguish it from the column name.
  • “gender” is one example of the meaning of the column.
  • “Actual data” refers to the data that was actually entered. In FIG. 2, a set of data in the second and subsequent rows of columns is the actual data.
  • Annotation refers to the work of associating a label (common label) that expresses the meaning of a column with a column.
  • AI engine refers to a system that realizes analysis processing such as prediction and discrimination by generating an analysis model using machine learning technology according to a predetermined data analysis method.
  • the AI engine is implemented by a commercial software program or an open source software program (eg, scikit-learn and PyTorch).
  • the input of the AI engine is teacher data, and the output is a learning model.
  • Training data is a combination (multiple sets) of column names and feature values for which the annotation work described later has been completed (see Fig. 19).
  • Training model generally refers to learning results generated by machine learning (AI engine).
  • a “learning model” outputs a classification result based on a learning result for an input.
  • the learning model created in the second embodiment outputs the most suitable label from the labels registered in the learning model when a table (column) is given as an input.
  • FIG. 3 is a block diagram showing a configuration example of the training data generation auxiliary device 100 according to the first embodiment.
  • the training data generation auxiliary device 100 may be a personal computer or a server.
  • the training data generation auxiliary device 100 includes data input means 1 , feature quantity calculation means 2 , cluster classification means 3 , and temporary label generation means 4 .
  • the data input means 1 inputs the table to the teacher data generation auxiliary device 100.
  • the feature amount calculation means 2 calculates feature amounts based on the actual data of the columns in the table.
  • a cluster classification means 3 classifies the columns in the table into a plurality of clusters based on the feature amount.
  • Temporary label generation means 4 generates temporary labels associated with columns belonging to each of a plurality of clusters.
  • FIG. 4 is a flowchart of an operation example of the training data generation auxiliary device 100.
  • FIG. 4 is a flowchart of an operation example of the training data generation auxiliary device 100.
  • a table (see, for example, FIG. 2) is input to the training data generation auxiliary device 100 (step S1). This is performed by the data input means 1.
  • the feature amount is calculated based on the actual data of the columns in the table (step S2). This is executed by the feature quantity calculation means 2 .
  • step S3 the columns in the table are classified into a plurality of clusters based on the feature amount. This is performed by the cluster classification means 3 .
  • step S4 generate temporary labels associated with the columns belonging to each of the plurality of clusters. This is executed by the temporary label generating means 4.
  • FIG. 5 is a diagram showing an example of functional blocks of the training data generation auxiliary device 100 according to the second embodiment.
  • the training data generation auxiliary device 100 generates training data (see, for example, FIG. 19) by executing a predetermined process based on the actual data of the columns in the table (one or more) that is input thereto.
  • the teaching data generation auxiliary device 100 may be a personal computer or a server. As shown in FIG. 5, the training data generation auxiliary device 100 includes a data input unit 10 (an example of the data input means of the present invention), a feature amount calculation unit 20 (an example of the feature amount calculation means of the present invention), and a cluster classification unit. 30 (an example of cluster classification means of the present invention), temporary label generation unit 40 (an example of temporary label generation means of the present invention), label display control unit 50, learning model storage unit 60, label estimation unit 70, temporary label correction unit 80. An input unit 90 is also connected to the training data generation auxiliary device 100 .
  • a data input unit 10 an example of the data input means of the present invention
  • a feature amount calculation unit 20 an example of the feature amount calculation means of the present invention
  • a cluster classification unit. 30 an example of cluster classification means of the present invention
  • temporary label generation unit 40 an example of temporary label generation means of the present invention
  • label display control unit 50 learning model storage unit 60
  • label estimation unit 70 temporary label correction unit 80.
  • the data input unit 10 inputs the table (actual data and its column names) to the teacher data generation auxiliary device 100.
  • FIG. 6 is an example of a table input by the data input unit 10.
  • the number of tables input by the data input unit 10 may be one or plural (for example, several tens).
  • the number of columns in each table may be one or plural (for example, hundreds of columns).
  • the table input by the data input unit 10 may be stored in a storage device built in or external to the teacher data generation auxiliary device 100, or may be acquired from the outside via a communication unit (not shown). good too.
  • the feature amount calculation unit 20 calculates feature amounts based on the actual data of the columns in the table input by the data input unit 10 . At that time, the feature amount calculation unit 20 calculates the feature amount for each column in the table input by the data input unit 10 based on the actual data of the column. That is, one feature amount is calculated for one column.
  • Fig. 7 is an example of the feature amount.
  • the feature quantity may be the average value of the named entity vectors, for example, the average value of the semantic vectors of the actual data.
  • a semantic vector can be calculated by a known algorithm (eg, Word2vec).
  • the feature amount is represented by a multi-dimensional vector.
  • FIG. 8 is another example of the feature quantity.
  • the feature amount may be the appearance frequency (occurrence frequency) of ASCII characters (128 characters) appearing in the actual data.
  • the feature amount is represented by a multi-dimensional vector.
  • FIG. 9 is another example of the feature amount.
  • the feature quantity may be a statistical feature quantity such as an average value or maximum value.
  • the feature amount is represented by a multi-dimensional vector. Further, when the target DB has type information, the feature amount may be the type information of the target column.
  • the cluster classification unit 30 classifies columns (plurality) in the table into multiple clusters based on the feature amount calculated by the feature amount calculation unit 20 . Specifically, the cluster classification unit 30 allows the user to input (specify) the columns (plurality) in the table input by the data input unit 10 based on the distribution tendency of the feature amount calculated by the feature amount calculation unit 20. classified (clustered) into the specified number of clusters (classes). Clustering can be performed by known algorithms (eg, k-means). FIG. 10 shows a column ( 5) are classified into three clusters (classes).
  • the temporary label generation unit 40 generates temporary labels associated with columns belonging to clusters classified by the cluster classification unit 30 . For example, for each cluster classified by the cluster classification unit 30, the temporary label generation unit 40 divides and writes the column names of the columns belonging to the cluster, and puts the most frequently occurring nouns into the cluster. Generate as a temporary label associated with the column to which it belongs. Separation can be performed, for example, by a known morphological analysis engine.
  • a similar word dictionary such as WordNet may be used to search for words that correspond to the superordinate concept of each word, and the most frequently used nouns may be generated as temporary labels. .
  • the label display control unit 50 displays a classification result confirmation screen on the display unit (for example, the display connected to the teacher data generation auxiliary device 100).
  • the classification result confirmation screen is a screen for displaying the result of classification by the cluster classification unit 30 (which column is included in which temporary label). For example, as shown in FIG. It includes columns belonging to the cluster and temporary labels associated with the columns (“code”, “prefecture”, “prefecture”, “code”, “code”).
  • FIG. 11 is an example of a table (classification result confirmation screen) in which temporary labels (“code”, “prefecture”, “prefecture”, “code”, “code”) are associated. The user can determine whether or not the classification by the cluster classification unit 30 has been performed appropriately by checking this cluster classification result confirmation screen.
  • the learning model storage unit 60 is a storage device that stores learning models.
  • the learning model is a learning model that, when a table (column) is input, estimates the column name of the input table (column) based on the learning result, and outputs the estimated column name.
  • This learning model is generated by inputting an annotated table (see, eg, FIG. 14), which is teacher data, to an AI engine (eg, scikit-learn and PyTorch), as will be described later.
  • an AI engine eg, scikit-learn and PyTorch
  • the label estimation unit 70 estimates temporary labels using the learning models stored in the learning model storage unit 60 .
  • the input unit 90 is, for example, an input device (eg, keyboard, mouse) connected to the teaching data generation auxiliary device 100.
  • the input unit 90 is used to input (designate) the number of clusters.
  • the input unit 90 is an example of the classification number input means of the present disclosure.
  • the input number is displayed on the classification number input screen (see FIG. 12), which is a UI displayed on the display unit (for example, the display connected to the teacher data generation auxiliary device 100).
  • FIG. 12 is an example of the classification number input screen.
  • FIG. 12 shows a state in which 3 is input as the number for cluster classification.
  • the input unit 90 is also used to correct the temporary label generated by the temporary label generating unit 40.
  • FIG. The temporary label correction section 80 corrects the temporary label generated by the temporary label generation section 40 based on the input from the input section 90 .
  • the temporary label corrector 80 is an example of the temporary label corrector of the present disclosure. ⁇ Specific example of temporary label correction> A specific example in which the temporary label correction unit 80 corrects the temporary label will be described.
  • FIG. 11 is an example of a table (classification result confirmation screen) in which temporary labels (“code”, “prefecture”, “prefecture”, “code”, “code”) are associated. An example of correcting the temporary label in FIG. 11 (“Code” at the left end in FIG. 11) to “Prefecture Code” at the left end in FIG. 14 will be described below.
  • the input unit 90 inputs an instruction to correct the temporary label to be corrected (here, the "code” on the left end in FIG.
  • control the display unit for example, the display connected to the training data generation auxiliary device 100 to display a temporary label correction screen (not shown) including the leftmost “code” in FIG. do.
  • the temporary label correcting unit 80 displays the The temporary label to be corrected (here, "code” on the left end in FIG. 11) is deleted.
  • the temporary label correction unit 80 changes the input new temporary label to a temporary label. Control the display to display the label correction screen. The user judges whether or not the contents of correction are appropriate by viewing the temporary label correction screen. When the user determines that it is appropriate, the user inputs an instruction to confirm the correction contents from the input unit 90 .
  • the temporary label correction unit 80 creates a new temporary label after the correction (here, the "prefecture code” at the left end in FIG. 14). ”) is associated with the column (here, the leftmost column in FIG. 11) associated with the temporary label to be corrected (here, the leftmost “code” in FIG. 11).
  • the temporary label to be corrected (here, the leftmost "code” in FIG. 11) is associated with the new temporary label after the correction (here, the leftmost "prefecture code” in FIG. 14).
  • 11 here, the leftmost column in FIG. 11 and stored in a storage device (for example, a storage device connected to the teaching data generation auxiliary device 100).
  • the temporary label to be corrected (here, "code” on the left end in FIG. 11) can be corrected to a new temporary label (here, "prefecture code” on the left end in FIG. 14).
  • a new temporary label here, "prefecture code” on the left end in FIG. 14.
  • FIG. 13 is a flowchart of an operation example of the training data generation auxiliary device 100.
  • FIG. FIG. 13 shows a flow chart of the process of inputting the table shown in FIG. 6 to the teacher data generation auxiliary device 100 and finally outputting the teacher data (see FIG. 19).
  • FIG. 14 is an example of teacher data output by the teacher data generation auxiliary device 100.
  • step S10 a table containing columns (column names and actual data) is input (step S10). This is performed by the data input unit 10, for example. Here, it is assumed that the table shown in FIG. 6 has been input. Note that the input in step S10 is not limited to one table, and may be a plurality of tables.
  • step S11 it is determined whether or not there is a learning model, that is, whether or not the learning model is stored in the learning model storage unit 60 (step S11).
  • step S12 an unlabeled explicit label is associated (step S12). Specifically, each column in the table input in step S10 is associated with an unlabeled explicit label. This is performed by the temporary label generator 40, for example.
  • a label-unassigned explicit label is a label that clearly indicates that a temporary label is not associated, such as "label name not set", for example.
  • the feature amount is calculated for each column (step S13). Specifically, for each column in the table input in step S10, the feature amount, for example, the average value of the semantic vectors of the actual data (see FIG. 7) is calculated based on the actual data of the column. This is performed, for example, by the feature amount calculator 20 .
  • a temporary label for cluster classification is selected (step S14). For example, this is performed by the user using an input device (eg, keyboard, mouse) connected to the training data generation auxiliary device 100 .
  • the selected temporary label is displayed on a temporary label selection screen (not shown), which is a UI displayed on a display unit (for example, a display connected to the training data generation auxiliary device 100).
  • a temporary label selection screen (not shown), which is a UI displayed on a display unit (for example, a display connected to the training data generation auxiliary device 100).
  • the label-unassigned explicit label associated in step S12 is selected as the label to be clustered.
  • step S14 may be omitted because only one "label name unset" label is to be divided at the first time.
  • step S14 of selecting the temporary label for cluster classification is required.
  • step S15 enter the number for cluster classification. This is done, for example, by the user inputting the number for cluster classification from the input unit 90 .
  • the number for cluster classification As shown in FIG. 12, it is assumed that 3 is input as the number for cluster classification.
  • step S16 cluster classification is performed (step S16). Specifically, from the tendency of the distribution of the feature amount calculated in step S13, the columns (plurality) in the table input in step S10 are classified into the number of clusters (classes) input by the user in step S15 ( clustering). This is performed by the cluster classifier 30, for example. Here, as shown in FIG. 10, it is assumed that the data are classified into three clusters (classes) of "class 1", "class 2", and "class 3".
  • a temporary label is generated (step S17). Specifically, a temporary label associated with the column belonging to the cluster classified in step S16 is generated. For example, for each cluster classified in step S16, the column names of the columns belonging to the cluster are separated, and the nouns with the highest appearance frequency among the separated names are generated as temporary labels associated with the columns belonging to the cluster. do. This is performed by the temporary label generator 40, for example.
  • the classification result confirmation screen is a screen for displaying the result of classification by the cluster classification unit 30 (which column is included in which temporary label). For example, as shown in FIG. It includes columns belonging to the cluster and temporary labels associated with the columns (“code”, “prefecture”, “prefecture”, “code”, “code”). The user can determine whether or not the classification by the cluster classification unit 30 has been performed appropriately by checking this cluster classification result confirmation screen.
  • step S19 it is determined whether or not the cluster size is appropriate. This is determined, for example, by the user confirming the classification result confirmation screen displayed in step S18.
  • step S19: No If the size of the cluster is not appropriate (step S19: No) and the cluster is to be further divided (step S20: Yes), the process returns to step S14, and the processing from step S14 onwards is executed again. On the other hand, if the size of the cluster is not appropriate (step S19: No) and further division is not required (step S20: No), the previous clustering result is canceled (step S21), and then the process returns to step S14.
  • the processing after S14 is repeatedly executed.
  • Step S21 is executed, for example, when a large number is input as the number for cluster classification in step S15 and the clusters (classes) become too fine.
  • step S22 it is determined whether the annotation result is appropriate. This is determined, for example, by the user confirming the classification result confirmation screen (see FIG. 11) displayed in step S18. As a result, if the annotation result is not appropriate (step S22: No), the annotation result is corrected (step S23). For example, if the associated temporary labels (“code”, “prefecture”, “prefecture”, “code”, “code”) associated as shown in FIG. Modify the temporary label that is . For example, this is performed by the user operating the input unit 90 as described above (see ⁇ Specific example of temporary label correction> above). FIG. 14 shows that the temporary label "code” (associated with class 1) generated in step S17 is corrected to "prefecture code", and the temporary label " This is an example in which "code” is modified to "classification code”.
  • Step S23 is repeatedly executed until the annotation result is determined to be appropriate in step S22. Then, when the annotation result is appropriate as a result of the correction in step S23 (step S22: Yes), the teacher data is completed (step S24).
  • the teacher data are combinations (multiple sets) of column names and feature amounts for which annotation work (steps S10 to S22) has been completed.
  • FIG. 19 is an example of teacher data completed in step S24.
  • a learning model is created (step S25). This is done by AI engines (eg scikit-learn and PyTorch).
  • the AI engine generates a learning model when the teacher data completed in step S24 is input.
  • This learning model is a learning model that, when a table (column) is input, infers the column name of the input table (column) based on the learning result and outputs the inferred column name.
  • the learning model created in step S25 is stored in the learning model storage unit 60.
  • FIG. The learning model created here can also be applied to the inference of the table semantic inference system of Japanese Patent No. 6890764.
  • step S10 a table containing columns (column names and actual data) is input (step S10). This is performed by the data input unit 10, for example. Here, it is assumed that the table shown in FIG. 15 has been input.
  • FIG. 15 is an example of a table input by the data input unit 10.
  • FIG. Note that the input in step S10 is not limited to one table, and may be a plurality of tables.
  • step S11 it is determined whether or not there is a learning model, that is, whether or not the learning model is stored in the learning model storage unit 60 (step S11).
  • step S11 Yes
  • the learning model is estimated and a temporary label is generated (step S26).
  • the label estimation unit 70 learning model.
  • the learning model estimates the column name of the input table (column) based on the learning result for each column in the table input in step S10. and output the estimated column name.
  • “prefecture code”, "prefecture”, “prefecture”, "prefecture name”, and "classification code” are estimated as temporary labels.
  • 16 is an example of a table (classification result confirmation screen) in which temporary labels (“prefecture code”, “prefecture”, “prefecture”, “prefecture name”, “classification code”) are associated.
  • the user can modify the temporary label generated in step S26 to a unique name by operating the input unit 90 as described above (see ⁇ Specific example of temporary label modification> above).
  • step S18 onward the processes from step S18 onward described above are executed in the same manner.
  • the learning model can return results with high accuracy for known data (see the column with the column name “prefecture code”, the column with the column name “prefecture name”, and the column with the column name “Todofuken” in FIG. 16).
  • data containing unknown concepts see the column with the column name "order recipient name” and the column with the column name "order receipt code” in FIG. 16) cannot be estimated correctly.
  • step S23 the annotation result is corrected (step S23).
  • the "prefecture name” and “classification code” of the temporary labels are not appropriate. Therefore, as shown in FIG. code”. For example, this is performed by the user operating the input unit 90 as described above (see ⁇ Specific example of temporary label correction> above). 17, the temporary label "prefecture name” (associated with class 1) generated in step S26 is corrected to "receiver”, and the temporary label "classification code” generated in step S26 is changed to " This is an example corrected to "order code”.
  • the effective feature amount can be calculated. can be calculated.
  • label names can be automatically assigned from column names by natural language processing. Therefore, it is possible to grasp the columns classified into classes without checking the contents of the classes. can.
  • the teacher data generated as described above has the feature amount weighed at the annotation stage, it is possible to create teacher data that is convenient for the learner (learning model). For example, when deep learning is used as a learner, it can be expected to shorten the time required for learning to converge.
  • the training data generation auxiliary device 100 described in the above embodiments may have the following hardware configuration.
  • FIG. 18 is a diagram showing a hardware configuration example of the teaching data generation auxiliary device 100 according to the present disclosure.
  • teacher data generation auxiliary device 100 includes processor 201 and memory 202 .
  • the processor 201 reads software (computer program) from the memory 202 and executes it to perform the processing of the teacher data generation auxiliary device 100 described using the flowcharts in the above-described embodiments.
  • the processor 201 may be, for example, a microprocessor, an MPU (Micro Processing Unit), or a CPU (Central Processing Unit).
  • Processor 201 may include multiple processors.
  • the processor 201 executes software (computer program) read out from a memory 202 such as a RAM to perform a data input unit 10, a feature amount calculation unit 20, a cluster classification unit 30, a temporary label generation unit 40, a label display control unit 50, It functions as a learning model storage unit 60, a label estimation unit 70, and a temporary label correction unit 80.
  • a memory 202 such as a RAM to perform a data input unit 10, a feature amount calculation unit 20, a cluster classification unit 30, a temporary label generation unit 40, a label display control unit 50, It functions as a learning model storage unit 60, a label estimation unit 70, and a temporary label correction unit 80.
  • These functions may be implemented in one personal computer or server, or may be distributed and implemented in a plurality of servers. Even when distributed and implemented in a plurality of servers, the processing of each of the above flowcharts can be realized by the plurality of servers communicating with each other via a communication line (for example, the Internet). A part or all of these functions may be realized by hardware.
  • the memory 202 is configured by a combination of volatile memory and non-volatile memory.
  • Memory 202 may include storage remotely located from processor 201 .
  • the processor 201 may access the memory 202 via an I/O (Input/Output) interface (not shown).
  • I/O Input/Output
  • the memory 202 is used to store software modules.
  • the processor 201 reads and executes these software modules from the memory 202, thereby performing the processing of the teacher data generation auxiliary device 100 described in the above embodiments.
  • each of the one or more processors included in the training data generation auxiliary device 100 includes one or more processors containing a group of instructions for causing the computer to execute the algorithm described with reference to the drawings. program.
  • the program includes instructions (or software code) that, when read into a computer, cause the computer to perform one or more of the functions described in the embodiments.
  • the program may be stored in a non-transitory computer-readable medium or tangible storage medium.
  • computer readable media or tangible storage media may include random-access memory (RAM), read-only memory (ROM), flash memory, solid-state drives (SSD) or other memory technology, CDs - ROM, digital versatile disc (DVD), Blu-ray disc or other optical disc storage, magnetic cassette, magnetic tape, magnetic disc storage or other magnetic storage device.
  • the program may be transmitted on a transitory computer-readable medium or communication medium.
  • transitory computer readable media or communication media include electrical, optical, acoustic, or other forms of propagated signals.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

L'invention concerne un dispositif d'aide à la génération de données d'enseignement (100) comprenant un moyen d'entrée de données (1) pour recevoir une entrée d'une table, un moyen de calcul de quantité de caractéristiques (2) pour calculer des quantités de caractéristiques sur la base de données réelles dans les colonnes à l'intérieur de la table, un moyen de classement en grappes (3) pour classer les colonnes à l'intérieur de la table en une pluralité de grappes sur la base des quantités de caractéristiques, et un moyen de génération d'étiquette provisoire (4) pour générer des étiquettes provisoires associées aux colonnes appartenant à chaque grappe de la pluralité de grappes.
PCT/JP2022/005937 2022-02-15 2022-02-15 Dispositif d'aide à la génération de données d'enseignement, système d'aide à la génération de données d'enseignement, procédé de génération de données d'enseignement et support lisible par ordinateur non transitoire WO2023157074A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/JP2022/005937 WO2023157074A1 (fr) 2022-02-15 2022-02-15 Dispositif d'aide à la génération de données d'enseignement, système d'aide à la génération de données d'enseignement, procédé de génération de données d'enseignement et support lisible par ordinateur non transitoire

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2022/005937 WO2023157074A1 (fr) 2022-02-15 2022-02-15 Dispositif d'aide à la génération de données d'enseignement, système d'aide à la génération de données d'enseignement, procédé de génération de données d'enseignement et support lisible par ordinateur non transitoire

Publications (1)

Publication Number Publication Date
WO2023157074A1 true WO2023157074A1 (fr) 2023-08-24

Family

ID=87577750

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2022/005937 WO2023157074A1 (fr) 2022-02-15 2022-02-15 Dispositif d'aide à la génération de données d'enseignement, système d'aide à la génération de données d'enseignement, procédé de génération de données d'enseignement et support lisible par ordinateur non transitoire

Country Status (1)

Country Link
WO (1) WO2023157074A1 (fr)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111931229A (zh) * 2020-07-10 2020-11-13 深信服科技股份有限公司 一种数据识别方法、装置和存储介质
JP2021043881A (ja) * 2019-09-13 2021-03-18 株式会社クレスコ 情報処理装置、情報処理方法および情報処理プログラム

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2021043881A (ja) * 2019-09-13 2021-03-18 株式会社クレスコ 情報処理装置、情報処理方法および情報処理プログラム
CN111931229A (zh) * 2020-07-10 2020-11-13 深信服科技股份有限公司 一种数据识别方法、装置和存储介质

Similar Documents

Publication Publication Date Title
AU2019261735B2 (en) System and method for recommending automation solutions for technology infrastructure issues
US20190228064A1 (en) Generation apparatus, generation method, and program
US11727203B2 (en) Information processing system, feature description method and feature description program
US10073827B2 (en) Method and system to generate a process flow diagram
CN106778878B (zh) 一种人物关系分类方法及装置
US11537797B2 (en) Hierarchical entity recognition and semantic modeling framework for information extraction
Kashmira et al. Generating entity relationship diagram from requirement specification based on nlp
CN114528845A (zh) 异常日志的分析方法、装置及电子设备
CN113704429A (zh) 基于半监督学习的意图识别方法、装置、设备及介质
CN113486189A (zh) 一种开放性知识图谱挖掘方法及系统
JP2019211974A (ja) 企業分析装置
JP6770709B2 (ja) 機械学習用モデル生成装置及びプログラム。
US11650996B1 (en) Determining query intent and complexity using machine learning
EP3605362A1 (fr) Système de traitement d'informations, procédé d'explication de valeur de caractéristique et programme d'explication de valeur de caractéristique
KR102206742B1 (ko) 자연언어 텍스트의 어휘 지식 그래프 표현 방법 및 장치
WO2023157074A1 (fr) Dispositif d'aide à la génération de données d'enseignement, système d'aide à la génération de données d'enseignement, procédé de génération de données d'enseignement et support lisible par ordinateur non transitoire
WO2018174000A1 (fr) Dispositif de gestion de configuration, procédé de gestion de configuration et support d'enregistrement
US20220229998A1 (en) Lookup source framework for a natural language understanding (nlu) framework
CN115017271A (zh) 用于智能生成rpa流程组件块的方法及系统
CN114201961A (zh) 一种注释预测方法、装置、设备及可读存储介质
CN113254612A (zh) 知识问答处理方法、装置、设备及存储介质
JP7159780B2 (ja) 修正内容特定プログラムおよびレポート修正内容特定装置
JP7375096B2 (ja) 分散表現生成システム、分散表現生成方法及び分散表現生成プログラム
US11681870B2 (en) Reducing latency and improving accuracy of work estimates utilizing natural language processing
WO2023228351A1 (fr) Dispositif d'apprentissage, dispositif d'aide à la création de fiche de gestion, programme, procédé d'apprentissage, et procédé d'aide à la création de fiche de gestion

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22926977

Country of ref document: EP

Kind code of ref document: A1