WO2023166578A1 - Labeling assistance system, labeling assistance method, and labeling assistance program - Google Patents
Labeling assistance system, labeling assistance method, and labeling assistance program Download PDFInfo
- Publication number
- WO2023166578A1 WO2023166578A1 PCT/JP2022/008749 JP2022008749W WO2023166578A1 WO 2023166578 A1 WO2023166578 A1 WO 2023166578A1 JP 2022008749 W JP2022008749 W JP 2022008749W WO 2023166578 A1 WO2023166578 A1 WO 2023166578A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- data
- clusters
- data group
- labeled
- labeling
- Prior art date
Links
- 238000002372 labelling Methods 0.000 title claims description 80
- 238000000034 method Methods 0.000 title claims description 76
- 238000012545 processing Methods 0.000 claims description 75
- 230000008569 process Effects 0.000 claims description 59
- 238000007670 refining Methods 0.000 claims description 10
- 238000000605 extraction Methods 0.000 description 36
- 238000012800 visualization Methods 0.000 description 33
- 238000010586 diagram Methods 0.000 description 32
- 239000000284 extract Substances 0.000 description 8
- 241000196324 Embryophyta Species 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 230000010365 information processing Effects 0.000 description 3
- 238000004891 communication Methods 0.000 description 2
- 230000005484 gravity Effects 0.000 description 2
- 238000012423 maintenance Methods 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 230000009467 reduction Effects 0.000 description 2
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 2
- 240000005979 Hordeum vulgare Species 0.000 description 1
- 235000007340 Hordeum vulgare Nutrition 0.000 description 1
- 240000007594 Oryza sativa Species 0.000 description 1
- 235000007164 Oryza sativa Nutrition 0.000 description 1
- 241000209140 Triticum Species 0.000 description 1
- 235000021307 Triticum Nutrition 0.000 description 1
- 230000001133 acceleration Effects 0.000 description 1
- 230000004397 blinking Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 239000000446 fuel Substances 0.000 description 1
- 230000012447 hatching Effects 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000010248 power generation Methods 0.000 description 1
- 230000005855 radiation Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 235000009566 rice Nutrition 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 239000002689 soil Substances 0.000 description 1
- 230000008685 targeting Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/75—Clustering; Classification
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02P—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
- Y02P90/00—Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
- Y02P90/30—Computing systems specially adapted for manufacturing
Definitions
- the present invention relates to a labeling support system, a labeling support method, and a labeling support program that support labeling of unlabeled data.
- Patent Document 1 describes a sensor data classification device that classifies sensor data obtained from a large number of sensors according to their characteristics.
- the device described in Patent Document 1 associates a set of sensor data divided for each preset time interval with a sensor identifier and a divided section identifier, and extracts a plurality of types of feature parameters from the data included in the set of divided data. calculate.
- an object of the present invention is to provide a labeling support system, a labeling support method, and a labeling support program that can support labeling work for clusters in which unlabeled data are classified.
- a labeling support system includes first classification means for generating a plurality of first clusters by classifying a first data group, which is a data group to be labeled, by unsupervised learning; a second classifying means for generating a second plurality of clusters by classifying a second data group that is a data group including a part of data; and output means for outputting data classified into different clusters in the plurality of clusters.
- a computer classifies a first data group, which is a data group to be labeled, by unsupervised learning to generate a plurality of first clusters, and the computer classifies the data to be labeled.
- a second plurality of clusters are generated by classifying a second data group, which is a data group including at least part of the data, and a computer classifies the first data included in the second plurality of clusters. It is characterized by outputting data classified into different clusters in a plurality of clusters.
- a labeling support program provides a computer with a first classification process for generating a first plurality of clusters by classifying a first data group, which is a data group to be labeled, by unsupervised learning, data to be labeled, A second classification process for generating a second plurality of clusters by classifying a second data group that is a data group containing at least part of the data of and outputting data classified into different clusters in the first plurality of clusters.
- FIG. 1 is a block diagram showing a configuration example of an embodiment of a labeling support system according to the present invention
- FIG. FIG. 4 is an explanatory diagram showing an example of data used in the labeling support system
- FIG. 4 is an explanatory diagram showing an example of feature amounts
- FIG. 10 is an explanatory diagram showing an example of visualization of dimension-reduced data in a graph
- FIG. 11 is an explanatory diagram showing another example of visualizing the dimension-reduced data with a graph
- FIG. 4 is an explanatory diagram showing an example of processing for labeling data in a cluster
- FIG. 10 is an explanatory diagram showing an example of processing for selecting some clusters
- FIG. 10 is an explanatory diagram showing an example of processing for excluding part of data; It is explanatory drawing which shows the example which carried out the overlay display of the result before and behind refinement
- FIG. 11 is an explanatory diagram showing an example of displaying a list of data with different results before and after elaboration in another window; FIG. 11 is an explanatory diagram showing an example of overlay display of refinement results of a plurality of times; FIG.
- FIG. 10 is an explanatory diagram showing an example of displaying a list of data with different results due to multiple elaborations in separate windows;
- FIG. 10 is an explanatory diagram showing an example of displaying statistical information of each cluster;
- FIG. 11 is an explanatory diagram showing another example of displaying statistical information of each cluster;
- 4 is a flow chart showing an operation example of the labeling support system;
- 1 is a block diagram showing an overview of a labeling support system according to the present invention;
- FIG. 1 is a schematic block diagram showing a configuration of a computer according to at least one embodiment;
- unlabeled data is not limited to moving images, and may be still images, music data, text data, and the like. Further, unlabeled data (data to be labeled) may be hereinafter referred to as unclassified data.
- FIG. 1 is a block diagram showing a configuration example of one embodiment of a labeling support system according to the present invention.
- the labeling support system 1 of this embodiment includes a data acquisition unit 10, a related information acquisition unit 20, an object identification unit 30, a data processing unit 40, a text information input unit 50, a feature extraction unit 60, and a feature storage. It comprises a unit 70 , a visualization processing unit 80 , an input/output device 90 and a data refinement unit 100 .
- the data acquisition unit 10 acquires data to be labeled (that is, unclassified data). For example, when a camera (not shown) captures an image of a traveling vehicle, the data acquisition unit 10 may acquire a moving image of the vehicle captured by the camera as data to be labeled.
- the data acquired by the data acquisition unit 10 is not limited to data acquired in real time.
- the data acquisition unit 10 may acquire the data to be labeled, for example, from a storage server (not shown) in which the data to be labeled is stored.
- the related information acquisition unit 20 acquires information related to data to be labeled (hereinafter referred to as related information).
- the related information is information indicating the situation in which the data to be labeled is generated. (hereinafter referred to as sensor data).
- the data to be labeled is video data captured by an in-vehicle camera (drive recorder), it is acquired based on GPS (Global Positioning System) information representing the vehicle position and CAN (Controller Area Network) as related information. and the information to be provided.
- GPS Global Positioning System
- CAN Controller Area Network
- sensor data acquired in this case are velocity, acceleration, and position (latitude, longitude, altitude, etc.).
- sensor data when a video showing the operating status of a thermal power plant is used as the data to be labeled, sensor data includes, for example, fuel flow rate, pressure, temperature, rotation speed, and power generation amount.
- sensor data when images showing farm conditions are used as data to be labeled, sensor data includes time, temperature, humidity, pH, soil water content, solar radiation, wind direction/speed, water level, and the like.
- the object identification unit 30 identifies objects included in the acquired data and generates information specifying the identified objects (hereinafter referred to as an object list). For example, when the object to be identified is a vehicle, the object identification unit 30 identifies the vehicle from the data acquired by the data acquisition unit 10, and identifies the vehicle (for example, coordinates indicating the position in the image). may be generated as an object list. Methods for identifying objects from images and videos are widely known, and detailed description thereof is omitted here.
- the data processing unit 40 processes the data (more specifically, the object list) into a form that can be used when the feature extraction unit 60, which will be described later, performs processing. Specifically, the data processing unit 40 processes the data so as to improve the accuracy of feature extraction and clustering.
- the data processing unit 40 for example, thins data, interpolates missing values, excludes outliers, and deletes unnecessary data items. Further, for example, when the data to be labeled is video data, the data processing unit 40 may convert the video data into numerical time-series data.
- the text information input unit 50 accepts input of text data including information to be added to each data to be labeled (hereinafter referred to as additional information).
- the additional information is information indicating the content of the labeling target data that can be acquired other than the related information. Categories indicating additional information include, for example, weather, types of plants, traffic participants, and the like. Examples of weather categorical values include sunny, cloudy, rainy, and snowy. Examples of plant type categorical values include rice, wheat, and barley. ⁇ Pedestrians, etc.
- labeling target data associated with additional information is also simply referred to as labeling target data.
- FIG. 2 is an explanatory diagram showing an example of data used in the labeling support system 1 of this embodiment.
- the example shown in FIG. 2 indicates that the data acquisition unit 10 has acquired the image 11 as data to be labeled, and the related information acquisition unit 20 has acquired related information 21 regarding the location where the image 11 was shot.
- the data processing unit 40 processes the video 11 and the related information 21 (more specifically, the object list generated by the object identification unit 30) to generate numerical time series data 41. indicate that Furthermore, the example shown in FIG. 2 indicates that the text information input unit 50 has received input of text data 51 including information on the weather, scene, time period, and objects as additional information.
- the feature extraction unit 60 extracts features from each data to be labeled.
- the feature extraction unit 60 of the present embodiment firstly generates a plurality of clusters by automatically classifying each data to be labeled including additional information by unsupervised learning. Any method can be used to generate clusters by unsupervised learning, and examples thereof include the k-means method and the Gaussian mixture model.
- the process in which the feature extraction unit 60 classifies the data group to be labeled by unsupervised learning to generate a plurality of clusters will be referred to as the first classification process.
- a plurality of clusters generated by the first classification process will be referred to as a first plurality of clusters, and a data group classified into the first plurality of clusters will be referred to as a first data group.
- the feature extraction unit 60 since the feature extraction unit 60 performs a process of classifying data to be labeled by unsupervised learning, the feature extraction unit 60 can also be called a classifying means.
- the feature extraction unit 60 extracts the feature amount of each data included in the generated cluster.
- the feature extraction unit 60 may extract, for example, additional information included in the text data as a feature amount.
- the feature extraction unit 60 may extract feature amounts indicated by numerical time-series data.
- the feature extraction unit 60 may extract feature amounts based on sensor values included in the data to be labeled (more specifically, numerical time-series data).
- any method can be used to extract feature values from numerical time-series data. For example, for each cluster generated by the k-means method, the feature extraction unit 60 extracts a feature amount called the distance (cluster distance feature) from the center of gravity of the numerical time series data included in the cluster to each data. good.
- the object identification unit 30 identifies the object from the information obtained by the data acquisition unit 10 and the related information acquisition unit 20, and the data processing unit 40 uses the identification result, and the feature extraction unit 60 uses the identification result.
- the data acquisition unit 10 may directly acquire data in the format used by the feature extraction unit 60 and input the acquired data to the feature extraction unit 60 .
- the labeling support system 1 does not have to include the related information acquisition unit 20, the object identification unit 30, and the data processing unit 40.
- the feature storage unit 70 stores feature amounts of each data extracted by the feature extraction unit 60 .
- the feature storage unit 70 may also store information on labels added by the data refinement unit 100, which will be described later. Note that the mode in which the feature storage unit 70 stores the feature amount for each data is arbitrary.
- FIG. 3 is an explanatory diagram showing an example of feature amounts stored in the feature storage unit 70.
- the vertical direction represents one feature point
- the horizontal direction represents the feature amount (category value) of each category (for example, weather, traffic participants, types of plants, etc.).
- the feature storage unit 70 is implemented by, for example, a magnetic disk.
- the visualization processing unit 80 performs processing for visualizing information that contributes to the labeling work for the generated clusters.
- the visualization processing unit 80 of the present embodiment draws a graph on the input/output device 90 of the dimensionality reduction (lower dimension) of the data to be labeled so that a person can observe how the data to be labeled is clustered. Visualize by doing.
- the visualization processing unit 80 uses UMAP (Uniform Manifold Approximation and Projection) or the like to reduce the dimension of the data to be labeled in two dimensions or three dimensions, and visualizes the dimension-reduced data as a graph such as a distribution map. good too.
- the visualization processing unit 80 may display the data classified into the same cluster in a manner different from that of other clusters (for example, by changing the color, changing the symbol, etc.).
- FIG. 4 is an explanatory diagram showing an example of visualizing the dimension-reduced data in a graph.
- the graph illustrated in FIG. 4 shows an example in which the data reduced to two dimensions by UMAP are displayed in different manners (hatching, blacking, etc.) for each cluster to which they belong.
- FIG. 5 is an explanatory diagram showing another example of visualizing the dimension-reduced data in a graph.
- the graph illustrated in FIG. 5 is a graph displayed by changing symbols plotted for each type of video data.
- the visualization processing unit 80 may display the range surrounded by a dotted line so that the range of data included in the cluster can be identified.
- the visualization processing unit 80 may display all data, or may determine that only data that satisfies a specific condition is displayed or not displayed.
- the visualization processing unit 80 for example, targets clusters that satisfy a specific condition (for example, clusters with more data than a predetermined number) and unclassified data (that is, unlabeled data). or not to display.
- the visualization processing unit 80 of the present embodiment outputs data that belong to different clusters as a result of re-learning processing, which will be described later.
- a data output method will be described later.
- the input/output device 90 displays the output result from the visualization processing unit 80.
- the input/output device 90 also receives input from the user regarding the displayed result, and executes processing according to the input.
- the processing of the data refinement unit 100 which will be described later, is performed based on the input of the cluster specified by the user with respect to the output of the input/output device 90.
- the input/output device 90 may be realized by a tablet terminal or the like. Alternatively, the input/output device 90 may be realized by a device having a display device and a pointing device.
- the data refinement unit 100 performs each process on the data group to be labeled based on the clusters generated by the feature extraction unit 60. Specifically, the data refinement unit 100 generates a second data group from the labeling target data group according to the generated first plurality of clusters. In this embodiment, the data refinement unit 100 performs the following three types of processing.
- the first process is the process of labeling the data within the cluster.
- the data refining unit 100 performs labeling for each cluster on the data classified into one of the first plurality of clusters among the data group to be labeled, and converts the data into a second data group. to generate Any cluster can be labeled by the data refinement unit 100 .
- the data refinement unit 100 may label all clusters, or may label clusters specified by the user via the input/output device 90 .
- the data refinement unit 100 may add an arbitrary temporary label to the data in the target cluster, or may add a label with content specified by the user. Then, the data refinement unit 100 may associate the data (more specifically, the feature amount of the data) with the added label and store them in the feature storage unit 70 .
- FIG. 6 is an explanatory diagram showing an example of processing for labeling data within a cluster.
- the example shown in FIG. 6 indicates that the data refinement unit 100 added temporary labels “A”, “B” and “C” to the clusters illustrated in FIG. 5, respectively. Note that when the user designates a cluster to be added among the clusters illustrated in FIG. 5, the data refinement unit 100 may add a temporary label only to the designated cluster.
- the feature extraction unit 60 regenerates a plurality of clusters by learning (supervised learning) using the labeled data.
- the feature extraction unit 60 may perform learning (unsupervised learning) by adding unlabeled data.
- a process of generating a plurality of clusters by classifying a data group including at least part of data to be labeled by the feature extraction unit 60 will be referred to as a second classification process.
- a plurality of clusters generated by the second classification process will be referred to as a second plurality of clusters
- a data group classified into the second plurality of clusters will be referred to as a second data group.
- the second classification process at least part of the labeling target data used in the first classification process is used to generate and refine a plurality of clusters again.
- This can be called a relearning process or refinement. This makes it possible to semi-automate labeling through unsupervised learning, and also contributes to the discovery of new labels.
- the feature extraction unit 60 may extract feature amounts of each data included in the clusters (second plurality of clusters) generated by the second classification process, and store the extracted feature amounts in the feature storage unit 70. .
- the visualization processing unit 80 outputs data classified into different clusters in the first plurality of clusters among the data included in the second plurality of clusters. This corresponds to the process of visualizing data belonging to different clusters as a result of re-learning. Note that specific processing for visualization will be described later.
- the second process is a process of selecting at least some clusters and learning again (unsupervised learning).
- the data refinement unit 100 generates a data group classified into a cluster selected from the first plurality of clusters as a second data group among the data groups to be labeled.
- the data refinement unit 100 selects at least some clusters from among the first plurality of clusters.
- the data refinement unit 100 may select a cluster specified by the user via the input/output device 90, or may automatically select a cluster that satisfies a condition.
- the conditions are arbitrary, and include, for example, clusters in which the number of data is a predetermined number or more, a ratio of classified data that is greater than a predetermined threshold, and the like.
- the data group within the cluster selected here corresponds to the above-described second data group.
- FIG. 7 is an explanatory diagram showing an example of the process of selecting some clusters.
- the example shown in FIG. 7 indicates that two clusters have been selected from the three generated clusters.
- the data refinement unit 100 may add arbitrary cluster identification information to the data in each cluster so that the clusters classified in the first classification process can be identified.
- the feature extraction unit 60 regenerates a plurality of clusters (that is, performs re-learning processing) by learning (unsupervised learning) targeting data in the selected cluster.
- This process corresponds to the above-described second classification process, and the generated clusters correspond to the second clusters.
- the feature extraction unit 60 may perform learning by adding new data separately. As a result, it is possible to dig deeper into the data within the cluster, so it can be expected to classify the data in more detail.
- the visualization processing unit 80 classifies the data included in the second plurality of clusters into different clusters in the first plurality of clusters in the same manner as in the first processing. Output data.
- the visualization processing unit 80 selects the data with the cluster identification information in the minority (other than the maximum ratio) among the data in the cluster as the first A plurality of clusters may be output as data classified into different clusters.
- the third process is a process of excluding at least part of the data not classified into clusters, such as outliers, and learning again (unsupervised learning or supervised learning).
- the data refinement unit 100 generates, as a second data group, a data group excluding one or more data not classified into any of the first plurality of clusters from the data group to be labeled.
- FIG. 8 is an explanatory diagram showing an example of processing for excluding part of the data.
- the example shown in FIG. 8 indicates that the data in the range surrounded by a solid line circle is excluded as an outlier.
- the data to be labeled is video data, this corresponds to processing for excluding noise scenes.
- at least one of the above-described first processing and second processing, or both of them are performed. This is expected to improve classification accuracy.
- the three types of processing performed by the data refinement unit 100 have been described above. However, the processing executed by the data refinement unit 100 is not limited to the three types of processing described above.
- the data refinement unit 100 may also perform data maintenance processing. Also, after each of the first process, the second process, and the third process, the same process may be performed again, or a different process may be performed.
- the data refinement unit 100 may output a file containing a data group to which labels have been added or a data group from which outliers have been removed.
- the data refinement unit 100 creates a label file in which the designated label is described, copies only the labeled data to the next learning folder, and sorts the original data into folders for each label based on the label. (move/copy) etc. may be performed.
- the data refinement unit 100 may create a data list file describing only the data belonging to the selected cluster, copy only the data belonging to the selected cluster to the next learning folder, and the like.
- the data refinement unit 100 creates a data list file describing only data other than the specified data (outliers), and copies the data other than the specified data (outliers) to the next learning folder. Processing and the like may be performed.
- a method for the visualization processing unit 80 to visualize data belonging to a different cluster as a result of re-learning will be specifically described below.
- the visualization processing unit 80 performs dimension reduction on the data group to be labeled, and divides the dimension-reduced data contained in the first plurality of clusters and the dimension-reduced data contained in the second plurality of clusters into Graphs are drawn in such a manner that each cluster can be identified.
- the visualization processing unit 80 displays data classified into different clusters in the first plurality of clusters among the dimension-reduced data included in the second plurality of clusters in a manner different from other data. do.
- Examples of different aspects include changing the shade of color, changing the color itself, changing the line of the outer frame, and blinking.
- FIG. 9 is an explanatory diagram showing an example of an overlay display of results before and after refinement.
- the visualization processing unit 80 superimposes the distribution of the data of each refinement and displays the data other than the layer of interest (that is, the refinement) in a manner different from the data of the layer of interest. to indicate that it is displayed.
- the result of the first elaboration and the result of the second elaboration are superimposed and displayed.
- the data d1 included in the target cluster only in the second refinement is shown in a manner different from other data.
- the data d2 which is included in the cluster of interest only in the first refinement, is shown in a manner different from the other data.
- Figs. 10 and 11 are explanatory diagrams showing examples of displaying results before and after refinement in parallel windows.
- the visualization processing unit 80 may display the results before and after elaboration in separate windows.
- the visualization processing unit 80 may display the data changed before and after elaboration in a different manner from other data, as illustrated in FIG. 11 .
- the visualization processing unit 80 may display a list of data with different results before and after elaboration (that is, data classified into different clusters).
- FIG. 12 is an explanatory diagram showing an example of displaying a list of data d3 that have different results before and after elaboration in separate windows. In the example shown in FIG. 12, the results are shown by displaying a list of the coordinates where the data showing different results before and after elaboration are displayed.
- FIGS. 9 to 12 exemplify the case of comparing two refinement results.
- comparison targets are not limited to two results, and may be three or more.
- FIG. 13 is an explanatory diagram showing an example of an overlay display of results of elaboration performed multiple times.
- FIG. 14 is an explanatory diagram showing an example of displaying a list in another window of data that have resulted in different results due to multiple elaborations. Compared with the example shown in FIG. 9, the example shown in FIG. 13 shows an example in which there are four refinement results. Similarly, the example shown in FIG. 14 shows an example in which there are four refinement results in comparison with the example shown in FIG.
- the visualization processing unit 80 may display cluster statistical information for each data group classification process (that is, refinement) separately from the above-described graph or together with the above-described graph. Note that the creation of the statistical information may be performed by the visualization processing unit 80 or by the feature extraction unit 60 .
- FIG. 15 is an explanatory diagram showing an example of displaying statistical information of each cluster.
- the example shown in FIG. 15 shows an example of displaying the number of data in the cluster, the center of gravity of the data, and the variance (x-direction and y-direction) as the cluster statistical information.
- the visualization processing unit 80 may switch and display the statistical information for each refinement, or may display them side by side.
- FIG. 16 is an explanatory diagram showing another example of displaying the statistical information of each cluster.
- the visualization processing unit 80 may display cluster statistical information (eg, false positive rate) in graph and tabular form.
- cluster statistical information eg, false positive rate
- the example shown in FIG. 16 represents the degree of matching between labels and assigned clusters when supervised learning is performed. In the example shown in FIG. 16, unsupervised learning is assumed for the first time, and there is no evaluation result.
- a data acquisition unit 10 a related information acquisition unit 20, an object identification unit 30, a data processing unit 40, a text information input unit 50, a feature extraction unit 60, a visualization processing unit 80, and a data refinement unit 100.
- a computer processor eg, CPU (Central Processing Unit)
- CPU Central Processing Unit
- a program labeling support program
- the program is stored in a storage unit (not shown) of the labeling support system 1, the processor reads the program, and according to the program, the data acquisition unit 10, the related information acquisition unit 20, the object identification unit 30, the data processing It may operate as the unit 40 , the text information input unit 50 , the feature extraction unit 60 , the visualization processing unit 80 and the data refinement unit 100 .
- the functions of the labeling support system 1 may be provided in a SaaS (Software as a Service) format.
- a data acquisition unit 10, a related information acquisition unit 20, an object identification unit 30, a data processing unit 40, a text information input unit 50, a feature extraction unit 60, a visualization processing unit 80, and a data refinement unit 100 may be implemented by dedicated hardware. Also, part or all of each component of each device may be implemented by general-purpose or dedicated circuitry, processors, etc., or combinations thereof. These may be composed of a single chip, or may be composed of multiple chips connected via a bus. A part or all of each component of each device may be implemented by a combination of the above-described circuits and the like and programs.
- each component of the labeling support system 1 is realized by a plurality of information processing devices, circuits, etc.
- the plurality of information processing devices, circuits, etc. may be centrally arranged, They may be distributed.
- the information processing device, circuits, and the like may be implemented as a form in which each is connected via a communication network, such as a client-server system, a cloud computing system, or the like.
- FIG. 17 is a flow chart showing an operation example of the labeling support system 1.
- FIG. 17 is an operation example when the data acquisition unit 10 directly acquires data in a format used by the feature extraction unit 60 and inputs the acquired data to the feature extraction unit 60 .
- the feature extraction unit 60 generates a first plurality of clusters from the data group to be labeled (first data group) (step S11). After that, the feature extraction unit 60 generates a second plurality of clusters from a data group (second data group) including at least part of data to be labeled (step S12). Then, the visualization processing unit 80 outputs data classified into different clusters in the first plurality of clusters among the data included in the second plurality of clusters (step S13).
- the feature extraction unit 60 classifies the first data group by unsupervised learning to generate the first plurality of clusters. Also, the feature extraction unit 60 classifies the second data group to generate a second plurality of clusters. Then, the visualization processing unit 80 outputs data classified into different clusters in the first plurality of clusters among the data included in the second plurality of clusters. Therefore, it is possible to support labeling work for clusters in which unlabeled data are classified.
- the data refinement unit 100 generates a second data group from among the data group to be labeled, according to the generated first plurality of clusters. Therefore, it is possible to improve the accuracy of re-learning using the generated second data group.
- FIG. 18 is a block diagram showing an overview of a labeling support system according to the present invention.
- a labeling support system 180 (for example, a labeling support system 1) according to the present invention classifies a first data group, which is a data group to be labeled, by unsupervised learning to generate a first plurality of clusters.
- means 181 for example, the feature extracting unit 60
- classifying that is, re-learning
- a second data group which is a data group including at least part of the data to be labeled, to classify a second plurality of clusters.
- a second classifying means 182 e.g., a feature extracting unit 60
- a second classifying means 182 that generates the data classified into a different cluster in the first plurality of clusters out of the data included in the second plurality of clusters.
- means 183 for example, the visualization processing unit 80.
- the labeling support system 180 includes data refinement means (for example, the data refinement unit 100 ).
- the data refining means generates a second data group by performing labeling for each cluster on data classified into one of the first plurality of clusters among the data group to be labeled. (for example, the first processing by the data refinement unit 100).
- the data refining means may generate, as a second data group, a data group classified into a cluster selected from the first plurality of clusters among the data groups to be labeled (for example, , second processing by the data refinement unit 100).
- the data refining means may generate, as the second data group, a data group obtained by excluding one or more data not classified into any of the first plurality of clusters from the data group to be labeled. Good (for example, the third processing by the data refinement unit 100).
- the output means reduces the dimension of the data group to be labeled, and divides the dimension-reduced data included in the first plurality of clusters and the dimension-reduced data included in the second plurality of clusters into clusters. , and out of the dimensionality-reduced data contained in the second plurality of clusters, the data classified into different clusters in the first plurality of clusters are displayed in a manner different from the other data may be displayed.
- the output means may display cluster statistical information for each data group classification process.
- FIG. 19 is a schematic block diagram showing the configuration of a computer according to at least one embodiment.
- a computer 1000 comprises a processor 1001 , a main storage device 1002 , an auxiliary storage device 1003 and an interface 1004 .
- the labeling support system 180 described above is implemented in the computer 1000.
- the operation of each processing unit described above is stored in the auxiliary storage device 1003 in the form of a program (labeling support program).
- the processor 1001 reads out the program from the auxiliary storage device 1003, develops it in the main storage device 1002, and executes the above processing according to the program.
- the secondary storage device 1003 is an example of a non-transitory tangible medium.
- Other examples of non-transitory tangible media include magnetic disks, magneto-optical disks, CD-ROMs (Compact Disc Read-only memory), DVD-ROMs (Read-only memory), connected via interface 1004, A semiconductor memory etc. are mentioned.
- the computer 1000 receiving the distribution may develop the program in the main storage device 1002 and execute the above process.
- the program may be for realizing part of the functions described above.
- the program may be a so-called difference file (difference program) that implements the above-described functions in combination with another program already stored in the auxiliary storage device 1003 .
- the data refining means generates a second data group by performing labeling for each cluster on data classified into one of the first plurality of clusters among the data group to be labeled.
- the labeling support system according to appendix 1 or appendix 2.
- the data refining means generates, as a second data group, a data group classified into a cluster selected from the first plurality of clusters among the data groups to be labeled.
- the data refining means generates, as a second data group, a data group excluding one or more data not classified into any of the first plurality of clusters from the data group to be labeled.
- the labeling support system according to any one of appendices 1 to 4.
- the output means reduces the dimension of the data group to be labeled, and divides the dimension-reduced data contained in the first plurality of clusters and the dimension-reduced data contained in the second plurality of clusters into Graphing is performed in a manner in which each cluster can be identified, and among the dimensionally reduced data included in the second plurality of clusters, the data classified into different clusters in the first plurality of clusters are compared with other data.
- the labeling support system according to any one of appendices 1 to 5, wherein the labeling support system is displayed in different modes.
- a computer generates a first plurality of clusters by classifying a first data group, which is a data group to be labeled, by unsupervised learning,
- the computer generates a second plurality of clusters by classifying a second data group, which is a data group including at least part of the data to be labeled, and
- the labeling support method wherein the computer outputs data classified into different clusters in the first plurality of clusters, among the data included in the second plurality of clusters.
- appendix 11 to the computer, 11.
- the program according to appendix 10 which stores a labeling support program for executing a data refinement process for generating a second data group according to the first plurality of clusters generated from the data group to be labeled. storage medium.
- labeling support system 10 data acquisition unit 20 related information acquisition unit 30 object identification unit 40 data processing unit 50 text information input unit 60 feature extraction unit 70 feature storage unit 80 visualization processing unit 90 input/output device 100 data refinement unit
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Multimedia (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A first classification means 181 generates a first plurality of clusters by classifying a first group of data, which is a group of data to be labeled, by unsupervised learning. A second classification means 182 generates a second plurality of clusters by classifying a second group of data, which is a group of data including at least a part of the group of data to be labeled. An output means 183 outputs data that is included in the second plurality of clusters and is classified into a different cluster for the first plurality of clusters.
Description
本発明は、ラベル付けされていないデータに対するラベリングを支援するラベリング支援システム、ラベリング支援方法およびラベリング支援プログラムに関する。
The present invention relates to a labeling support system, a labeling support method, and a labeling support program that support labeling of unlabeled data.
IoT(Internet of Things)社会において、様々な機器からデータを収集することが可能になっている。一方、例えば、大量のデータの中から、目的とする映像を単純作業で見つけようとするのは、非常に困難である。そこで、収集されたデータを検索する仕組みが求められている。
In the IoT (Internet of Things) society, it has become possible to collect data from various devices. On the other hand, for example, it is very difficult to find a desired video from a large amount of data by simple work. Therefore, there is a demand for a mechanism for searching the collected data.
データを検索するための仕組みとして、そのデータに対するラベリングを行う方法が挙げられる。ただし、大量のデータに対するラベリングを人手で行うには膨大な時間およびコストがかかってしまうため、データを分類するための方法が各種提案されている。
As a mechanism for searching data, there is a method of labeling the data. However, since labeling a large amount of data manually takes a huge amount of time and cost, various methods for classifying data have been proposed.
例えば、特許文献1には、多数のセンサにより得られるセンサデータをその特徴に応じて分類するセンサデータ分類装置が記載されている。特許文献1に記載された装置は、予め設定した時間区間ごとに分割されたセンサデータの集合をセンサ識別子および分割区間識別子と関連付け、分割データの集合に含まれるデータからその複数種の特徴パラメータを算出する。
For example, Patent Document 1 describes a sensor data classification device that classifies sensor data obtained from a large number of sensors according to their characteristics. The device described in Patent Document 1 associates a set of sensor data divided for each preset time interval with a sensor identifier and a divided section identifier, and extracts a plurality of types of feature parameters from the data included in the set of divided data. calculate.
例えば、ルールベースで自動的にラベリングを行うことも考えられる。しかし、環境等の変化に応じてルールをメンテナンスする作業は煩雑であり、また、ルールの追加等の作業も容易ではない。
For example, automatic labeling based on rules is also possible. However, the work of maintaining rules in response to changes in the environment or the like is complicated, and work such as adding rules is not easy.
特許文献1に記載された装置では、分類を行うための特徴パラメータの計算方法や、分割区間が予め定められる。しかし、何らかの基準に基づいて算出された数値からデータを分類したとしても、ラベル付けされていないデータに対して意味のあるラベリング作業を行うには、やはりコストがかかってしまうという問題がある。
In the device described in Patent Document 1, the calculation method of feature parameters for classification and division intervals are determined in advance. However, even if data is classified based on numerical values calculated on the basis of some criteria, there is still the problem that performing meaningful labeling work on unlabeled data still entails costs.
そこで、本発明は、ラベル付けされていないデータが分類されたクラスタに対するラベリング作業を支援できるラベリング支援システム、ラベリング支援方法およびラベリング支援プログラムを提供することを目的とする。
Therefore, an object of the present invention is to provide a labeling support system, a labeling support method, and a labeling support program that can support labeling work for clusters in which unlabeled data are classified.
本発明によるラベリング支援システムは、ラベリング対象のデータ群である第一のデータ群を教師なし学習により分類することで第一の複数のクラスタを生成する第一分類手段と、ラベリング対象のデータの少なくとも一部のデータを含むデータ群である第二のデータ群を分類することで第二の複数のクラスタを生成する第二分類手段と、第二の複数のクラスタに含まれるデータのうち、第一の複数のクラスタでは異なるクラスタに分類されていたデータを出力する出力手段とを備えたことを特徴とする。
A labeling support system according to the present invention includes first classification means for generating a plurality of first clusters by classifying a first data group, which is a data group to be labeled, by unsupervised learning; a second classifying means for generating a second plurality of clusters by classifying a second data group that is a data group including a part of data; and output means for outputting data classified into different clusters in the plurality of clusters.
本発明によるラベリング支援方法は、コンピュータが、ラベリング対象のデータ群である第一のデータ群を教師なし学習により分類することで第一の複数のクラスタを生成し、コンピュータが、ラベリング対象のデータの少なくとも一部のデータを含むデータ群である第二のデータ群を分類することで第二の複数のクラスタを生成し、コンピュータが、第二の複数のクラスタに含まれるデータのうち、第一の複数のクラスタでは異なるクラスタに分類されていたデータを出力することを特徴とする。
In the labeling support method according to the present invention, a computer classifies a first data group, which is a data group to be labeled, by unsupervised learning to generate a plurality of first clusters, and the computer classifies the data to be labeled. A second plurality of clusters are generated by classifying a second data group, which is a data group including at least part of the data, and a computer classifies the first data included in the second plurality of clusters. It is characterized by outputting data classified into different clusters in a plurality of clusters.
本発明によるラベリング支援プログラムは、コンピュータに、ラベリング対象のデータ群である第一のデータ群を教師なし学習により分類することで第一の複数のクラスタを生成する第一分類処理、ラベリング対象のデータの少なくとも一部のデータを含むデータ群である第二のデータ群を分類することで第二の複数のクラスタを生成する第二分類処理、および、第二の複数のクラスタに含まれるデータのうち、第一の複数のクラスタでは異なるクラスタに分類されていたデータを出力する出力処理を実行させることを特徴とする。
A labeling support program according to the present invention provides a computer with a first classification process for generating a first plurality of clusters by classifying a first data group, which is a data group to be labeled, by unsupervised learning, data to be labeled, A second classification process for generating a second plurality of clusters by classifying a second data group that is a data group containing at least part of the data of and outputting data classified into different clusters in the first plurality of clusters.
本発明によれば、ラベル付けされていないデータが分類されたクラスタに対するラベリング作業を支援できる。
According to the present invention, it is possible to support labeling work for clusters in which unlabeled data are classified.
以下、本発明の実施形態を図面を参照して説明する。以下の説明では、ラベル付けされていないデータの一例として、動画(映像データ)を例示する。ただし、ラベル付けされていないデータは、動画に限られず、例えば、静止画や、音楽データ、テキストデータなどであってもよい。また、ラベル付けされていないデータ(ラベリング対象のデータ)のことを、以下、未分類データと記すこともある。
Hereinafter, embodiments of the present invention will be described with reference to the drawings. In the following description, moving images (video data) will be exemplified as an example of unlabeled data. However, unlabeled data is not limited to moving images, and may be still images, music data, text data, and the like. Further, unlabeled data (data to be labeled) may be hereinafter referred to as unclassified data.
図1は、本発明によるラベリング支援システムの一実施形態の構成例を示すブロック図である。本実施形態のラベリング支援システム1は、データ取得部10と、関連情報取得部20と、物体識別部30と、データ加工部40と、テキスト情報入力部50と、特徴抽出部60と、特徴記憶部70と、可視化処理部80と、入出力装置90と、データ精緻化部100とを備えている。
FIG. 1 is a block diagram showing a configuration example of one embodiment of a labeling support system according to the present invention. The labeling support system 1 of this embodiment includes a data acquisition unit 10, a related information acquisition unit 20, an object identification unit 30, a data processing unit 40, a text information input unit 50, a feature extraction unit 60, and a feature storage. It comprises a unit 70 , a visualization processing unit 80 , an input/output device 90 and a data refinement unit 100 .
データ取得部10は、ラベリング対象のデータ(すなわち、未分類データ)を取得する。例えば、カメラ(図示せず)によって走行する車両が撮像されている場合、データ取得部10は、ラベリング対象のデータとして、そのカメラが撮影した車両の動画を取得してもよい。なお、データ取得部10が取得するデータは、リアルタイムで取得されるデータに限られない。データ取得部10は、例えば、ラベリング対象のデータが記憶されたストレージサーバ(図示せず)から、ラベリング対象のデータを取得してもよい。
The data acquisition unit 10 acquires data to be labeled (that is, unclassified data). For example, when a camera (not shown) captures an image of a traveling vehicle, the data acquisition unit 10 may acquire a moving image of the vehicle captured by the camera as data to be labeled. The data acquired by the data acquisition unit 10 is not limited to data acquired in real time. The data acquisition unit 10 may acquire the data to be labeled, for example, from a storage server (not shown) in which the data to be labeled is stored.
関連情報取得部20は、ラベリング対象のデータに関連する情報(以下、関連情報と記す。)を取得する。本実施形態では、関連情報は、ラベリング対象のデータの生成された状況を示す情報であり、例えば、データが生成された場所(撮像された場所)や時間を表わす情報、センサにより取得されたデータ(以下、センサデータと記す。)である。
The related information acquisition unit 20 acquires information related to data to be labeled (hereinafter referred to as related information). In this embodiment, the related information is information indicating the situation in which the data to be labeled is generated. (hereinafter referred to as sensor data).
例えば、ラベリング対象のデータが、車載カメラ(ドライブレコーダ)で撮像された映像データである場合、関連情報として車両位置を表わすGPS(Global Positioning System )情報や、CAN(Controller Area Network )に基づいて取得される情報などが挙げられる。この場合に取得されるセンサデータの例が、速度や加速度、位置(緯度、経度、高度など)である。
For example, if the data to be labeled is video data captured by an in-vehicle camera (drive recorder), it is acquired based on GPS (Global Positioning System) information representing the vehicle position and CAN (Controller Area Network) as related information. and the information to be provided. Examples of sensor data acquired in this case are velocity, acceleration, and position (latitude, longitude, altitude, etc.).
また、ラベリング対象のデータとして火力発電所の稼働状況を示す映像が用いられる場合、センサデータとして、例えば、燃料の流量、圧力、温度、回転数、発電量などが挙げられる。他にも、ラベリング対象のデータとして農場の状況を示す映像が用いられる場合、センサデータとして、時間や温度、湿度、pH、土壌水分量、日射量、風向・風速、水位などが挙げられる。
Also, when a video showing the operating status of a thermal power plant is used as the data to be labeled, sensor data includes, for example, fuel flow rate, pressure, temperature, rotation speed, and power generation amount. In addition, when images showing farm conditions are used as data to be labeled, sensor data includes time, temperature, humidity, pH, soil water content, solar radiation, wind direction/speed, water level, and the like.
物体識別部30は、取得されたデータに含まれる物体を識別し、識別した物体を特定する情報(以下、オブジェクトリストと記す。)を生成する。例えば、識別対象の物体が車両の場合、物体識別部30は、データ取得部10が取得したデータから、車両を識別し、その車両を特定する情報(例えば、画像中の位置を示す座標等)をオブジェクトリストとして生成してもよい。なお、画像や映像から物体を識別する方法は広く知られており、ここでは詳細な説明は省略する。
The object identification unit 30 identifies objects included in the acquired data and generates information specifying the identified objects (hereinafter referred to as an object list). For example, when the object to be identified is a vehicle, the object identification unit 30 identifies the vehicle from the data acquired by the data acquisition unit 10, and identifies the vehicle (for example, coordinates indicating the position in the image). may be generated as an object list. Methods for identifying objects from images and videos are widely known, and detailed description thereof is omitted here.
データ加工部40は、後述する特徴抽出部60が処理を行う際に用いることができる態様にデータ(より具体的には、オブジェクトリスト)を加工する。具体的には、データ加工部40は、特徴抽出やクラスタリングの精度を向上させられるようにデータを加工する。データ加工部40は、例えば、データの間引きや、欠損値の補間、外れ値の除外、不要なデータ項目の削除などを行う。また、例えば、ラベリング対象のデータが映像データの場合、データ加工部40は、映像データを数値時系列データへ変換してもよい。
The data processing unit 40 processes the data (more specifically, the object list) into a form that can be used when the feature extraction unit 60, which will be described later, performs processing. Specifically, the data processing unit 40 processes the data so as to improve the accuracy of feature extraction and clustering. The data processing unit 40, for example, thins data, interpolates missing values, excludes outliers, and deletes unnecessary data items. Further, for example, when the data to be labeled is video data, the data processing unit 40 may convert the video data into numerical time-series data.
テキスト情報入力部50は、ラベリング対象の各データに付加する情報(以下、付加情報と記す。)を含むテキストデータの入力を受け付ける。付加情報は、関連情報以外で取得し得るラベリング対象のデータの内容を示す情報である。付加情報を示すカテゴリとして、例えば、天気や植物の種類、交通参加者などが挙げられる。天気のカテゴリ値の例として、晴れ・曇り・雨・雪などが挙げられ、植物の種類のカテゴリ値の例として、米・小麦・大麦などが挙げられ、交通参加者の例として、自動車・自転車・歩行者などが挙げられる。
The text information input unit 50 accepts input of text data including information to be added to each data to be labeled (hereinafter referred to as additional information). The additional information is information indicating the content of the labeling target data that can be acquired other than the related information. Categories indicating additional information include, for example, weather, types of plants, traffic participants, and the like. Examples of weather categorical values include sunny, cloudy, rainy, and snowy. Examples of plant type categorical values include rice, wheat, and barley.・Pedestrians, etc.
なお、テキストデータの入力は任意である。すなわち、ラベリング対象のデータに対する付加情報が入力されていなくてもよい。ただし、ラベリング対象のデータに付加情報が増えるほど、分類の精度を向上できるため、入力されることが好ましい。以下の説明では、付加情報が対応付けられたラベリング対象のデータも、単にラベリング対象のデータと記す。
The input of text data is optional. In other words, additional information for the data to be labeled may not be input. However, the more additional information is added to the data to be labeled, the more the accuracy of classification can be improved, so input is preferable. In the following description, labeling target data associated with additional information is also simply referred to as labeling target data.
図2は、本実施形態のラベリング支援システム1で利用されるデータの例を示す説明図である。図2に示す例では、データ取得部10がラベリング対象のデータとして映像11を取得し、関連情報取得部20は、映像11が撮影された場所等に関する関連情報21を取得したことを示す。また、図2に示す例では、データ加工部40が、映像11および関連情報21(より具体的には、物体識別部30により生成されたオブジェクトリスト)を加工して数値時系列データ41を生成したことを示す。さらに、図2に示す例では、テキスト情報入力部50が、付加情報として、天気、シーン、時間帯および物体に関する情報を含むテキストデータ51の入力を受け付けたことを示す。
FIG. 2 is an explanatory diagram showing an example of data used in the labeling support system 1 of this embodiment. The example shown in FIG. 2 indicates that the data acquisition unit 10 has acquired the image 11 as data to be labeled, and the related information acquisition unit 20 has acquired related information 21 regarding the location where the image 11 was shot. In the example shown in FIG. 2, the data processing unit 40 processes the video 11 and the related information 21 (more specifically, the object list generated by the object identification unit 30) to generate numerical time series data 41. indicate that Furthermore, the example shown in FIG. 2 indicates that the text information input unit 50 has received input of text data 51 including information on the weather, scene, time period, and objects as additional information.
特徴抽出部60は、ラベリング対象の各データから特徴を抽出する。本実施形態の特徴抽出部60は、まず初めに、付加情報を含むラベリング対象の各データを教師なし学習により自動的に分類することで複数のクラスタを生成する。教師なし学習によりクラスタを生成する方法は任意であり、例えば、k-means法や、混合ガウスモデルなどが挙げられる。
The feature extraction unit 60 extracts features from each data to be labeled. The feature extraction unit 60 of the present embodiment firstly generates a plurality of clusters by automatically classifying each data to be labeled including additional information by unsupervised learning. Any method can be used to generate clusters by unsupervised learning, and examples thereof include the k-means method and the Gaussian mixture model.
以下、特徴抽出部60が、ラベリング対象のデータ群を教師なし学習により分類することで複数のクラスタを生成する処理を、第一分類処理と記す。また、第一分類処理により生成される複数のクラスタを、第一の複数のクラスタと記し、第一の複数のクラスタに分類されるデータ群のことを、第一のデータ群と記す。また、特徴抽出部60が、ラベリング対象のデータを教師なし学習により分類する処理を行うことから、特徴抽出部60のことを分類手段と言うこともできる。
Hereinafter, the process in which the feature extraction unit 60 classifies the data group to be labeled by unsupervised learning to generate a plurality of clusters will be referred to as the first classification process. A plurality of clusters generated by the first classification process will be referred to as a first plurality of clusters, and a data group classified into the first plurality of clusters will be referred to as a first data group. In addition, since the feature extraction unit 60 performs a process of classifying data to be labeled by unsupervised learning, the feature extraction unit 60 can also be called a classifying means.
そして、特徴抽出部60は、生成したクラスタに含まれる各データの特徴量を抽出する。特徴抽出部60は、例えば、テキストデータに含まれている付加情報を特徴量として抽出してもよい。他にも、特徴抽出部60は、数値時系列データが示す特徴量を抽出してもよい。具体的には、特徴抽出部60は、ラベリング対象のデータ(より具体的には、数値時系列データ)に含まれるセンサ値に基づいて特徴量を抽出してもよい。
Then, the feature extraction unit 60 extracts the feature amount of each data included in the generated cluster. The feature extraction unit 60 may extract, for example, additional information included in the text data as a feature amount. In addition, the feature extraction unit 60 may extract feature amounts indicated by numerical time-series data. Specifically, the feature extraction unit 60 may extract feature amounts based on sensor values included in the data to be labeled (more specifically, numerical time-series data).
なお、数値時系列データから特徴量を抽出する方法は任意である。例えば、k-means法により生成された各クラスタについて、特徴抽出部60は、クラスタに含まれる数値時系列データの重心点から各データまでの距離(cluster distance feature)という特徴量を抽出してもよい。
Any method can be used to extract feature values from numerical time-series data. For example, for each cluster generated by the k-means method, the feature extraction unit 60 extracts a feature amount called the distance (cluster distance feature) from the center of gravity of the numerical time series data included in the cluster to each data. good.
また、本実施形態では、データ取得部10と関連情報取得部20により取得された情報から物体識別部30が物体を識別し、識別結果に対してデータ加工部40が、特徴抽出部60が用いる形式にデータを加工する場合について説明した。ただし、データ取得部10が、直接、特徴抽出部60が用いる形式のデータを取得し、取得したデータを特徴抽出部60に入力してもよい。この場合、ラベリング支援システム1は、関連情報取得部20、物体識別部30およびデータ加工部40を備えていなくてもよい。
Further, in this embodiment, the object identification unit 30 identifies the object from the information obtained by the data acquisition unit 10 and the related information acquisition unit 20, and the data processing unit 40 uses the identification result, and the feature extraction unit 60 uses the identification result. A case of processing data into a format has been described. However, the data acquisition unit 10 may directly acquire data in the format used by the feature extraction unit 60 and input the acquired data to the feature extraction unit 60 . In this case, the labeling support system 1 does not have to include the related information acquisition unit 20, the object identification unit 30, and the data processing unit 40.
特徴記憶部70は、特徴抽出部60が抽出した各データの特徴量を記憶する。また、特徴記憶部70は、後述するデータ精緻化部100によって付加されたラベルの情報を併せて記憶してもよい。なお、特徴記憶部70がデータごとの特徴量を記憶する態様は任意である。
The feature storage unit 70 stores feature amounts of each data extracted by the feature extraction unit 60 . The feature storage unit 70 may also store information on labels added by the data refinement unit 100, which will be described later. Note that the mode in which the feature storage unit 70 stores the feature amount for each data is arbitrary.
図3は、特徴記憶部70が記憶する特徴量の例を示す説明図である。図3に示す例では、縦方向が1つの特徴点を表わし、横方向が各カテゴリ(例えば、天気、交通参加者、植物の種類など)の特徴量(カテゴリ値)を表わしている。特徴記憶部70は、例えば、磁気ディスク等により実現される。
FIG. 3 is an explanatory diagram showing an example of feature amounts stored in the feature storage unit 70. FIG. In the example shown in FIG. 3, the vertical direction represents one feature point, and the horizontal direction represents the feature amount (category value) of each category (for example, weather, traffic participants, types of plants, etc.). The feature storage unit 70 is implemented by, for example, a magnetic disk.
可視化処理部80は、生成されたクラスタに対するラベリング作業に寄与する情報を可視化するための処理を行う。本実施形態の可視化処理部80は、ラベリング対象のデータをクラスタ化した様子を人間が観察できるように、ラベリング対象のデータを次元削減(低次元化)したものを、入出力装置90にグラフ描画することで可視化する。
The visualization processing unit 80 performs processing for visualizing information that contributes to the labeling work for the generated clusters. The visualization processing unit 80 of the present embodiment draws a graph on the input/output device 90 of the dimensionality reduction (lower dimension) of the data to be labeled so that a person can observe how the data to be labeled is clustered. Visualize by doing.
可視化処理部80は、例えば、UMAP(Uniform Manifold Approximation and Projection )などにより、2次元または3次元にラベリング対象のデータを次元削減し、次元削減されたデータを、分布図などのグラフとして可視化してもよい。その際、可視化処理部80は、同一のクラスタに分類されたデータを、他のクラスタと異なる態様(例えば、色を変える、記号を変える、など)で表示してもよい。
The visualization processing unit 80, for example, uses UMAP (Uniform Manifold Approximation and Projection) or the like to reduce the dimension of the data to be labeled in two dimensions or three dimensions, and visualizes the dimension-reduced data as a graph such as a distribution map. good too. At that time, the visualization processing unit 80 may display the data classified into the same cluster in a manner different from that of other clusters (for example, by changing the color, changing the symbol, etc.).
図4は、次元削減されたデータをグラフで可視化した例を示す説明図である。図4に例示するグラフは、UMAPにより2次元に次元削減したデータを、属するクラスタごとに態様(斜線、黒塗り等)を変えて表示した例を示す。
FIG. 4 is an explanatory diagram showing an example of visualizing the dimension-reduced data in a graph. The graph illustrated in FIG. 4 shows an example in which the data reduced to two dimensions by UMAP are displayed in different manners (hatching, blacking, etc.) for each cluster to which they belong.
図5は、次元削減されたデータをグラフで可視化した他の例を示す説明図である。図5に例示するグラフは、映像データの種類ごとにプロットされる記号を変化させて表示したグラフである。また、図5に例示するように、可視化処理部80は、クラスタに含まれるデータの範囲を特定できるように、その範囲を点線で囲む表示をしてもよい。
FIG. 5 is an explanatory diagram showing another example of visualizing the dimension-reduced data in a graph. The graph illustrated in FIG. 5 is a graph displayed by changing symbols plotted for each type of video data. Further, as illustrated in FIG. 5, the visualization processing unit 80 may display the range surrounded by a dotted line so that the range of data included in the cluster can be identified.
さらに、グラフ描画の際、可視化処理部80は、全てのデータを表示してもよいし、特定の条件を満たすデータのみ表示する又は表示しないと決定してもよい。可視化処理部80は、例えば、特定の条件を満たすクラスタ(例えば、データ数が所定数よりも多いクラスタ、など)や、未分類のデータ(すなわち、ラベリングされていないデータ)を対象に、表示するか表示しないか判断してもよい。
Furthermore, when drawing the graph, the visualization processing unit 80 may display all data, or may determine that only data that satisfies a specific condition is displayed or not displayed. The visualization processing unit 80, for example, targets clusters that satisfy a specific condition (for example, clusters with more data than a predetermined number) and unclassified data (that is, unlabeled data). or not to display.
さらに、本実施形態の可視化処理部80は、後述する再学習処理の結果、異なるクラスタに属することとなったデータを出力する。なお、データの出力方法については後述される。
Furthermore, the visualization processing unit 80 of the present embodiment outputs data that belong to different clusters as a result of re-learning processing, which will be described later. A data output method will be described later.
入出力装置90は、可視化処理部80による出力結果を表示する。また、入出力装置90は、表示した結果に対するユーザからの入力を受け付け、入力に応じた処理を実行する。本実施形態では、入出力装置90の出力に対してユーザが指定したクラスタの入力に基づいて、後述するデータ精緻化部100の処理が行われる。
The input/output device 90 displays the output result from the visualization processing unit 80. The input/output device 90 also receives input from the user regarding the displayed result, and executes processing according to the input. In this embodiment, the processing of the data refinement unit 100, which will be described later, is performed based on the input of the cluster specified by the user with respect to the output of the input/output device 90. FIG.
入出力装置90は、タブレット端末などにより実現されてもよい。他にも、入出力装置90は、ディスプレイ装置とポインティングデバイスを有する装置等により実現されてもよい。
The input/output device 90 may be realized by a tablet terminal or the like. Alternatively, the input/output device 90 may be realized by a device having a display device and a pointing device.
データ精緻化部100は、特徴抽出部60により生成されたクラスタに基づいて、ラベリング対象のデータ群に対する各処理を実行する。具体的には、データ精緻化部100は、ラベリング対象のデータ群の中から、生成された第一の複数のクラスタに応じて、第二のデータ群を生成する。本実施形態では、データ精緻化部100は、以下の3種類の処理を実行する場合について説明する。
The data refinement unit 100 performs each process on the data group to be labeled based on the clusters generated by the feature extraction unit 60. Specifically, the data refinement unit 100 generates a second data group from the labeling target data group according to the generated first plurality of clusters. In this embodiment, the data refinement unit 100 performs the following three types of processing.
まず、第一の処理について説明する。第一の処理は、クラスタ内のデータにラベル付けを行う処理である。第一の処理では、データ精緻化部100は、ラベリング対象のデータ群のうち、第一の複数のクラスタのいずれかに分類されたデータに対してクラスタごとのラベリングを行った第二のデータ群を生成する。データ精緻化部100がラベリングを行う対象とするクラスタは任意である。データ精緻化部100は、すべてのクラスタに対してラベリングを行ってもよく、入出力装置90を介して、ユーザに指定されたクラスタに対してラベリングを行ってもよい。
First, the first process will be explained. The first process is the process of labeling the data within the cluster. In the first process, the data refining unit 100 performs labeling for each cluster on the data classified into one of the first plurality of clusters among the data group to be labeled, and converts the data into a second data group. to generate Any cluster can be labeled by the data refinement unit 100 . The data refinement unit 100 may label all clusters, or may label clusters specified by the user via the input/output device 90 .
また、クラスタ内のデータに同一のラベルが付加されるのであれば、そのラベルの内容は任意である。データ精緻化部100は、対象とするクラスタ内のデータに対し、任意の仮ラベルを付加してもよく、ユーザにより指定された内容のラベルを付加してもよい。そして、データ精緻化部100は、データ(より詳しくは、データの特徴量)と付加されたラベルとを対応付けて特徴記憶部70に記憶してもよい。
Also, if the same label is added to the data in the cluster, the content of that label is arbitrary. The data refinement unit 100 may add an arbitrary temporary label to the data in the target cluster, or may add a label with content specified by the user. Then, the data refinement unit 100 may associate the data (more specifically, the feature amount of the data) with the added label and store them in the feature storage unit 70 .
図6は、クラスタ内のデータにラベル付けを行う処理の例を示す説明図である。図6に示す例では、データ精緻化部100が、図5に例示するクラスタに対し、それぞれ仮ラベル「A」,「B」および「C」を付加したことを示す。なお、図5に例示するクラスタのうち、付加する対象のクラスタがユーザにより指定された場合、データ精緻化部100は、指定されたクラスタにのみ仮ラベルを付加すればよい。
FIG. 6 is an explanatory diagram showing an example of processing for labeling data within a cluster. The example shown in FIG. 6 indicates that the data refinement unit 100 added temporary labels “A”, “B” and “C” to the clusters illustrated in FIG. 5, respectively. Note that when the user designates a cluster to be added among the clusters illustrated in FIG. 5, the data refinement unit 100 may add a temporary label only to the designated cluster.
その後、特徴抽出部60は、ラベルが付加されたデータを用いた学習(教師あり学習)により、複数のクラスタを再度生成する。なお、特徴抽出部60は、ラベルが付加されていないデータを加えて学習(教師なし学習)を行ってもよい。以下、特徴抽出部60が、ラベリング対象のデータの少なくとも一部のデータを含むデータ群を分類することで複数のクラスタを生成する処理を、第二分類処理と記す。また、第二分類処理により生成される複数のクラスタを、第二の複数のクラスタと記し、第二の複数のクラスタに分類されるデータ群のことを、第二のデータ群と記す。
After that, the feature extraction unit 60 regenerates a plurality of clusters by learning (supervised learning) using the labeled data. Note that the feature extraction unit 60 may perform learning (unsupervised learning) by adding unlabeled data. Hereinafter, a process of generating a plurality of clusters by classifying a data group including at least part of data to be labeled by the feature extraction unit 60 will be referred to as a second classification process. Also, a plurality of clusters generated by the second classification process will be referred to as a second plurality of clusters, and a data group classified into the second plurality of clusters will be referred to as a second data group.
このように、第二分類処理では、第一分類処理で用いたラベリング対象のデータの少なくとも一部のデータを用いて、再度複数のクラスタを生成して精緻化することから、第二分類処理のことを再学習処理または精緻化と言うことができる。これにより、教師なし学習を通じてラベル付けを半自動化でき、また、新規ラベルの発見にも寄与できる。
Thus, in the second classification process, at least part of the labeling target data used in the first classification process is used to generate and refine a plurality of clusters again. This can be called a relearning process or refinement. This makes it possible to semi-automate labeling through unsupervised learning, and also contributes to the discovery of new labels.
特徴抽出部60は、第二分類処理により生成されたクラスタ(第二の複数のクラスタ)に含まれる各データの特徴量を抽出し、抽出した特徴量を特徴記憶部70に記憶してもよい。
The feature extraction unit 60 may extract feature amounts of each data included in the clusters (second plurality of clusters) generated by the second classification process, and store the extracted feature amounts in the feature storage unit 70. .
そして、第二分類処理の後、可視化処理部80は、第二の複数のクラスタに含まれるデータのうち、第一の複数のクラスタでは異なるクラスタに分類されていたデータを出力する。これは、再学習の結果、異なるクラスタに属することになったデータを可視化する処理に対応する。なお、可視化する具体的処理については、後述される。
After the second classification process, the visualization processing unit 80 outputs data classified into different clusters in the first plurality of clusters among the data included in the second plurality of clusters. This corresponds to the process of visualizing data belonging to different clusters as a result of re-learning. Note that specific processing for visualization will be described later.
次に、第二の処理について説明する。第二の処理は、少なくとも一部のクラスタを選択して、再び学習(教師なし学習)をする処理である。データ精緻化部100は、ラベリング対象のデータ群のうち、第一の複数のクラスタの中から選択されたクラスタに分類されているデータ群を第二のデータ群として生成する。
Next, the second processing will be explained. The second process is a process of selecting at least some clusters and learning again (unsupervised learning). The data refinement unit 100 generates a data group classified into a cluster selected from the first plurality of clusters as a second data group among the data groups to be labeled.
まず、データ精緻化部100は、第一の複数のクラスタの中から、少なくとも一部のクラスタを選択する。データ精緻化部100は、入出力装置90を介して、ユーザに指定されたクラスタを選択してもよく、条件を満たすクラスタを自動で選択してもよい。ここでの条件は任意であり、例えば、データ数が予め定めた数以上のクラスタ、分類されたデータの割合が予め定めた閾値よりも大きい、などが挙げられる。ここで選択されたクラスタ内のデータ群が、上述する第二のデータ群に対応する。
First, the data refinement unit 100 selects at least some clusters from among the first plurality of clusters. The data refinement unit 100 may select a cluster specified by the user via the input/output device 90, or may automatically select a cluster that satisfies a condition. The conditions here are arbitrary, and include, for example, clusters in which the number of data is a predetermined number or more, a ratio of classified data that is greater than a predetermined threshold, and the like. The data group within the cluster selected here corresponds to the above-described second data group.
図7は、一部のクラスタを選択する処理の例を示す説明図である。図7に示す例では、生成された3つのクラスタのうち、2つのクラスタが選択されたことを示す。なお、第二の処理においても、第一分類処理で分類されたクラスタを識別できるように、データ精緻化部100は、任意のクラスタ識別情報を各クラスタ内のデータに付与しておけばよい。
FIG. 7 is an explanatory diagram showing an example of the process of selecting some clusters. The example shown in FIG. 7 indicates that two clusters have been selected from the three generated clusters. Also in the second process, the data refinement unit 100 may add arbitrary cluster identification information to the data in each cluster so that the clusters classified in the first classification process can be identified.
その後、特徴抽出部60は、選択されたクラスタ内のデータを対象とした学習(教師なし学習)により、複数のクラスタを再度生成する(すなわち、再学習処理を行う)。この処理が、上述する第二分類処理に対応し、生成された複数のクラスタが、第二の複数のクラスタに対応する。なお、特徴抽出部60は、新たなデータを別途加えて学習を行ってもよい。これにより、クラスタ内のデータを深掘りすることができるため、より詳細にデータを分類することが期待できる。
After that, the feature extraction unit 60 regenerates a plurality of clusters (that is, performs re-learning processing) by learning (unsupervised learning) targeting data in the selected cluster. This process corresponds to the above-described second classification process, and the generated clusters correspond to the second clusters. Note that the feature extraction unit 60 may perform learning by adding new data separately. As a result, it is possible to dig deeper into the data within the cluster, so it can be expected to classify the data in more detail.
そして、第二分類処理の後、可視化処理部80は、上記第一の処理と同様、第二の複数のクラスタに含まれるデータのうち、第一の複数のクラスタでは異なるクラスタに分類されていたデータを出力する。なお、選択されたクラスタが細分化される可能性があることから、可視化処理部80は、クラスタ内のデータのうち、クラスタ識別情報が少数派(最大の割合以外)のデータを、第一の複数のクラスタでは異なるクラスタに分類されていたデータとして出力してもよい。
Then, after the second classification process, the visualization processing unit 80 classifies the data included in the second plurality of clusters into different clusters in the first plurality of clusters in the same manner as in the first processing. Output data. In addition, since the selected cluster may be subdivided, the visualization processing unit 80 selects the data with the cluster identification information in the minority (other than the maximum ratio) among the data in the cluster as the first A plurality of clusters may be output as data classified into different clusters.
次に、第三の処理について説明する。第三の処理は、外れ値など、クラスタに分類されなかったデータの少なくとも一部を除外して、再び学習(教師なし学習または教師あり学習)をする処理である。データ精緻化部100は、ラベリング対象のデータ群のうち、第一の複数のクラスタのいずれにも分類されなかった一以上のデータを除外したデータ群を第二のデータ群として生成する
Next, the third process will be explained. The third process is a process of excluding at least part of the data not classified into clusters, such as outliers, and learning again (unsupervised learning or supervised learning). The data refinement unit 100 generates, as a second data group, a data group excluding one or more data not classified into any of the first plurality of clusters from the data group to be labeled.
図8は、データの一部を除外する処理の例を示す説明図である。図8に示す例では、実線の丸で囲まれた範囲のデータが外れ値として除外されることを示す。例えば、ラベリング対象のデータが映像データの場合、ノイズシーンを除外する処理に対応する。以降、上述する第一の処理と第二の処理の少なくとも一方、または、両方の処理が行われる。これにより、分類精度の向上が期待される。
FIG. 8 is an explanatory diagram showing an example of processing for excluding part of the data. The example shown in FIG. 8 indicates that the data in the range surrounded by a solid line circle is excluded as an outlier. For example, when the data to be labeled is video data, this corresponds to processing for excluding noise scenes. Thereafter, at least one of the above-described first processing and second processing, or both of them are performed. This is expected to improve classification accuracy.
以上、データ精緻化部100が行う3種類の処理について説明した。ただし、データ精緻化部100が実行する処理は、上述する3種類の処理に限定されない。データ精緻化部100は、他にも、データのメンテナンス処理を行ってもよい。また、第一の処理、第二の処理、および、第三の処理の各処理の後で、再び同一の処理が行われてもよく、異なる処理が行われてもよい。
The three types of processing performed by the data refinement unit 100 have been described above. However, the processing executed by the data refinement unit 100 is not limited to the three types of processing described above. The data refinement unit 100 may also perform data maintenance processing. Also, after each of the first process, the second process, and the third process, the same process may be performed again, or a different process may be performed.
データをメンテナンスする処理の一例が、特徴抽出部60が学習に用いるためのデータをメンテナンスする処理である。データ精緻化部100は、ラベルが付加されたデータ群や、外れ値が除外されたデータ群を含むファイルを出力してもよい。
An example of the data maintenance process is the process of maintaining the data used by the feature extraction unit 60 for learning. The data refinement unit 100 may output a file containing a data group to which labels have been added or a data group from which outliers have been removed.
例えば、上述する第一の処理で、ラベリング対象のデータ群に対してラベル付けが行われたとする。この場合、データ精緻化部100は、指定したラベルを記載したラベルファイルの作成、次回学習用フォルダにラベルが付与されたデータのみのコピー、ラベルに基づき、元データをラベルごとのフォルダに振り分ける処理(移動・コピー)などを行ってもよい。
For example, suppose that the data group to be labeled was labeled in the first process described above. In this case, the data refinement unit 100 creates a label file in which the designated label is described, copies only the labeled data to the next learning folder, and sorts the original data into folders for each label based on the label. (move/copy) etc. may be performed.
また、例えば、上述する第二の処理で、クラスタが選別されたとする。この場合、データ精緻化部100は、選択したクラスタに属するデータのみを記載したデータリストファイルの作成、選択したクラスタに属するデータのみを次回学習用フォルダにコピーする処理などを行ってもよい。
Also, for example, assume that clusters have been selected in the second process described above. In this case, the data refinement unit 100 may create a data list file describing only the data belonging to the selected cluster, copy only the data belonging to the selected cluster to the next learning folder, and the like.
また、例えば、上述する第三の処理で、外れ値を除外する処理が行われたとする。この場合、データ精緻化部100は、指定されたデータ(外れ値)以外のデータのみを記載したデータリストファイルの作成、指定されたデータ(外れ値)以外のデータを次回学習用フォルダにコピーする処理などを行ってもよい。
Also, for example, assume that outliers are excluded in the third process described above. In this case, the data refinement unit 100 creates a data list file describing only data other than the specified data (outliers), and copies the data other than the specified data (outliers) to the next learning folder. Processing and the like may be performed.
以下、再学習の結果、異なるクラスタに属することになったデータを可視化処理部80が可視化する方法について具体的に説明する。まず、可視化処理部80は、ラベリング対象のデータ群を次元削減し、第一の複数のクラスタに含まれる次元削減されたデータ、および、第二の複数のクラスタに含まれる次元削減されたデータをクラスタごとに識別できる態様でグラフ描画する。そして、可視化処理部80は、第二の複数のクラスタに含まれる次元削減されたデータのうち、第一の複数のクラスタでは異なるクラスタに分類されていたデータを、他のデータと異なる態様で表示する。
A method for the visualization processing unit 80 to visualize data belonging to a different cluster as a result of re-learning will be specifically described below. First, the visualization processing unit 80 performs dimension reduction on the data group to be labeled, and divides the dimension-reduced data contained in the first plurality of clusters and the dimension-reduced data contained in the second plurality of clusters into Graphs are drawn in such a manner that each cluster can be identified. Then, the visualization processing unit 80 displays data classified into different clusters in the first plurality of clusters among the dimension-reduced data included in the second plurality of clusters in a manner different from other data. do.
異なる態様の例として、例えば、色の濃淡を変化させたり、色そのものを変化させたり、外枠の線を変化させたり、点滅表示したりすることが挙げられる。
Examples of different aspects include changing the shade of color, changing the color itself, changing the line of the outer frame, and blinking.
図9は、精緻化前後の結果をオーバレイ表示した例を示す説明図である。図9に示す例では、可視化処理部80が、各精緻化のデータの分布を重ねて表示するとともに、注目するレイヤ(すなわち、精緻化)以外のデータを、注目するレイヤのデータとは異なる態様で表示していることを示す。具体的には、図9に示す例では、1回目の精緻化の結果と、2回目の精緻化の結果とを重ねて表示している。その際、1回目の精緻化の結果に注目している場合には、2回目の精緻化でのみ対象のクラスタに含まれているデータd1を、他のデータとは異なる態様で示している。同様に、2回目の精緻化の結果に注目している場合には、1回目の精緻化でのみ対象のクラスタに含まれているデータd2を、他のデータとは異なる態様で示している。
FIG. 9 is an explanatory diagram showing an example of an overlay display of results before and after refinement. In the example shown in FIG. 9, the visualization processing unit 80 superimposes the distribution of the data of each refinement and displays the data other than the layer of interest (that is, the refinement) in a manner different from the data of the layer of interest. to indicate that it is displayed. Specifically, in the example shown in FIG. 9, the result of the first elaboration and the result of the second elaboration are superimposed and displayed. At that time, when attention is paid to the result of the first refinement, the data d1 included in the target cluster only in the second refinement is shown in a manner different from other data. Similarly, when looking at the result of the second refinement, the data d2, which is included in the cluster of interest only in the first refinement, is shown in a manner different from the other data.
図10および図11は、精緻化前後の結果を並列窓で表示した例を示す説明図である。図10に例示するように、可視化処理部80は、精緻化前後の結果を別々の窓で表示してもよい。その際、可視化処理部80は、図11に例示するように、精緻化の前後で変化したデータの態様を他のデータと異なる態様で表示してもよい。
Figs. 10 and 11 are explanatory diagrams showing examples of displaying results before and after refinement in parallel windows. As illustrated in FIG. 10, the visualization processing unit 80 may display the results before and after elaboration in separate windows. At that time, the visualization processing unit 80 may display the data changed before and after elaboration in a different manner from other data, as illustrated in FIG. 11 .
さらに、可視化処理部80は、精緻化前後で異なる結果になったデータ(すなわち、異なるクラスタに分類されたデータ)を、リスト表示してもよい。図12は、精緻化前後で異なる結果になったデータd3を別窓でリスト表示した例を示す説明図である。図12に示す例では、精緻化前後で異なる結果になったデータが表示されている座標をリスト表示して結果を示す。
Furthermore, the visualization processing unit 80 may display a list of data with different results before and after elaboration (that is, data classified into different clusters). FIG. 12 is an explanatory diagram showing an example of displaying a list of data d3 that have different results before and after elaboration in separate windows. In the example shown in FIG. 12, the results are shown by displaying a list of the coordinates where the data showing different results before and after elaboration are displayed.
なお、図9から図12では、2つの精緻化結果を比較する場合を例示した。ただし、比較対象は、2つの結果に限定されず、3つ以上であってもよい。図13は、複数回の精緻化結果をオーバレイ表示した例を示す説明図である。また、図14は、複数回の精緻化により異なる結果になったデータを別窓でリスト表示した例を示す説明図である。図13に示す例は、図9に示す例と比較し、精緻化結果が4つ存在する場合の例を示す。図14に示す例も同様に、図12に示す例と比較し、精緻化結果が4つ存在する場合の例を示す。
Note that FIGS. 9 to 12 exemplify the case of comparing two refinement results. However, comparison targets are not limited to two results, and may be three or more. FIG. 13 is an explanatory diagram showing an example of an overlay display of results of elaboration performed multiple times. Also, FIG. 14 is an explanatory diagram showing an example of displaying a list in another window of data that have resulted in different results due to multiple elaborations. Compared with the example shown in FIG. 9, the example shown in FIG. 13 shows an example in which there are four refinement results. Similarly, the example shown in FIG. 14 shows an example in which there are four refinement results in comparison with the example shown in FIG.
また、可視化処理部80は、上述するグラフとは別に、または、上述するグラフと共に、データ群の分類処理(すなわち、精緻化)ごとにクラスタの統計情報を表示してもよい。なお、統計情報の作成は、可視化処理部80が行ってもよく、特徴抽出部60が行ってもよい。
In addition, the visualization processing unit 80 may display cluster statistical information for each data group classification process (that is, refinement) separately from the above-described graph or together with the above-described graph. Note that the creation of the statistical information may be performed by the visualization processing unit 80 or by the feature extraction unit 60 .
図15は、各クラスタの統計情報を表示した例を示す説明図である。図15に示す例では、クラスタの統計情報として、クラスタ内のデータ数、データの重心および分散(x方向およびy方向)を表示した例を示す。また、図15に例示するように、可視化処理部80は、精緻化ごとの統計情報を切替えて表示するようにしてもよく、並べて表示するようにしてもよい。
FIG. 15 is an explanatory diagram showing an example of displaying statistical information of each cluster. The example shown in FIG. 15 shows an example of displaying the number of data in the cluster, the center of gravity of the data, and the variance (x-direction and y-direction) as the cluster statistical information. Further, as illustrated in FIG. 15, the visualization processing unit 80 may switch and display the statistical information for each refinement, or may display them side by side.
図16は、各クラスタの統計情報を表示した他の例を示す説明図である。図16に例示するように、可視化処理部80は、クラスタの統計情報(例えば、誤検知率)をグラフおよび表形式で表示してもよい。図16に示す例では、教師あり学習を実施したときに、ラベルと振り分けられたクラスタとの一致度を表わす。なお、図16に示す例において、1回目は教師なし学習を想定しており、評価結果は存在しない。
FIG. 16 is an explanatory diagram showing another example of displaying the statistical information of each cluster. As illustrated in FIG. 16, the visualization processing unit 80 may display cluster statistical information (eg, false positive rate) in graph and tabular form. The example shown in FIG. 16 represents the degree of matching between labels and assigned clusters when supervised learning is performed. In the example shown in FIG. 16, unsupervised learning is assumed for the first time, and there is no evaluation result.
データ取得部10と、関連情報取得部20と、物体識別部30と、データ加工部40と、テキスト情報入力部50と、特徴抽出部60と、可視化処理部80と、データ精緻化部100とは、プログラム(ラベリング支援プログラム)に従って動作するコンピュータのプロセッサ(例えば、CPU(Central Processing Unit ))によって実現される。
A data acquisition unit 10, a related information acquisition unit 20, an object identification unit 30, a data processing unit 40, a text information input unit 50, a feature extraction unit 60, a visualization processing unit 80, and a data refinement unit 100. is realized by a computer processor (eg, CPU (Central Processing Unit)) that operates according to a program (labeling support program).
例えば、プログラムは、ラベリング支援システム1の記憶部(図示せず)に記憶され、プロセッサは、そのプログラムを読み込み、プログラムに従って、データ取得部10、関連情報取得部20、物体識別部30、データ加工部40、テキスト情報入力部50、特徴抽出部60、可視化処理部80、および、データ精緻化部100として動作してもよい。また、ラベリング支援システム1の機能がSaaS(Software as a Service )形式で提供されてもよい。
For example, the program is stored in a storage unit (not shown) of the labeling support system 1, the processor reads the program, and according to the program, the data acquisition unit 10, the related information acquisition unit 20, the object identification unit 30, the data processing It may operate as the unit 40 , the text information input unit 50 , the feature extraction unit 60 , the visualization processing unit 80 and the data refinement unit 100 . Also, the functions of the labeling support system 1 may be provided in a SaaS (Software as a Service) format.
データ取得部10と、関連情報取得部20と、物体識別部30と、データ加工部40と、テキスト情報入力部50と、特徴抽出部60と、可視化処理部80と、データ精緻化部100とは、それぞれが専用のハードウェアで実現されていてもよい。また、各装置の各構成要素の一部又は全部は、汎用または専用の回路(circuitry )、プロセッサ等やこれらの組合せによって実現されもよい。これらは、単一のチップによって構成されてもよいし、バスを介して接続される複数のチップによって構成されてもよい。各装置の各構成要素の一部又は全部は、上述した回路等とプログラムとの組合せによって実現されてもよい。
A data acquisition unit 10, a related information acquisition unit 20, an object identification unit 30, a data processing unit 40, a text information input unit 50, a feature extraction unit 60, a visualization processing unit 80, and a data refinement unit 100. may be implemented by dedicated hardware. Also, part or all of each component of each device may be implemented by general-purpose or dedicated circuitry, processors, etc., or combinations thereof. These may be composed of a single chip, or may be composed of multiple chips connected via a bus. A part or all of each component of each device may be implemented by a combination of the above-described circuits and the like and programs.
また、ラベリング支援システム1の各構成要素の一部又は全部が複数の情報処理装置や回路等により実現される場合には、複数の情報処理装置や回路等は、集中配置されてもよいし、分散配置されてもよい。例えば、情報処理装置や回路等は、クライアントサーバシステム、クラウドコンピューティングシステム等、各々が通信ネットワークを介して接続される形態として実現されてもよい。
Further, when a part or all of each component of the labeling support system 1 is realized by a plurality of information processing devices, circuits, etc., the plurality of information processing devices, circuits, etc. may be centrally arranged, They may be distributed. For example, the information processing device, circuits, and the like may be implemented as a form in which each is connected via a communication network, such as a client-server system, a cloud computing system, or the like.
次に、本実施形態のラベリング支援システム1の動作を説明する。図17は、ラベリング支援システム1の動作例を示すフローチャートである。図17に例示する動作例は、データ取得部10が、直接、特徴抽出部60が用いる形式のデータを取得し、取得したデータを特徴抽出部60に入力した場合の動作例である。
Next, the operation of the labeling support system 1 of this embodiment will be described. FIG. 17 is a flow chart showing an operation example of the labeling support system 1. FIG. The operation example illustrated in FIG. 17 is an operation example when the data acquisition unit 10 directly acquires data in a format used by the feature extraction unit 60 and inputs the acquired data to the feature extraction unit 60 .
特徴抽出部60は、ラベリング対象のデータ群(第一のデータ群)から、第一の複数のクラスタを生成する(ステップS11)。その後、特徴抽出部60は、ラベリング対象のデータの少なくとも一部のデータを含むデータ群(第二のデータ群)から第二の複数のクラスタを生成する(ステップS12)。そして、可視化処理部80は、第二の複数のクラスタに含まれるデータのうち、第一の複数のクラスタでは異なるクラスタに分類されていたデータを出力する(ステップS13)。
The feature extraction unit 60 generates a first plurality of clusters from the data group to be labeled (first data group) (step S11). After that, the feature extraction unit 60 generates a second plurality of clusters from a data group (second data group) including at least part of data to be labeled (step S12). Then, the visualization processing unit 80 outputs data classified into different clusters in the first plurality of clusters among the data included in the second plurality of clusters (step S13).
以上のように、本実施形態では、特徴抽出部60が、第一のデータ群を教師なし学習により分類することで第一の複数のクラスタを生成する。また、特徴抽出部60が、第二のデータ群を分類することで第二の複数のクラスタを生成する。そして、可視化処理部80が、第二の複数のクラスタに含まれるデータのうち、第一の複数のクラスタでは異なるクラスタに分類されていたデータを出力する。よって、ラベル付けされていないデータが分類されたクラスタに対するラベリング作業を支援できる。
As described above, in the present embodiment, the feature extraction unit 60 classifies the first data group by unsupervised learning to generate the first plurality of clusters. Also, the feature extraction unit 60 classifies the second data group to generate a second plurality of clusters. Then, the visualization processing unit 80 outputs data classified into different clusters in the first plurality of clusters among the data included in the second plurality of clusters. Therefore, it is possible to support labeling work for clusters in which unlabeled data are classified.
また、本実施形態では、データ精緻化部100が、ラベリング対象のデータ群の中から、生成された第一の複数のクラスタに応じて、第二のデータ群を生成する。そのため、生成された第二のデータ群を用いた再学習の精度を向上させることが可能になる。
In addition, in the present embodiment, the data refinement unit 100 generates a second data group from among the data group to be labeled, according to the generated first plurality of clusters. Therefore, it is possible to improve the accuracy of re-learning using the generated second data group.
次に、本発明の概要を説明する。図18は、本発明によるラベリング支援システムの概要を示すブロック図である。本発明によるラベリング支援システム180(例えば、ラベリング支援システム1)は、ラベリング対象のデータ群である第一のデータ群を教師なし学習により分類することで第一の複数のクラスタを生成する第一分類手段181(例えば、特徴抽出部60)と、ラベリング対象のデータの少なくとも一部のデータを含むデータ群である第二のデータ群を分類(すなわち、再学習)することで第二の複数のクラスタを生成する第二分類手段182(例えば、特徴抽出部60)と、第二の複数のクラスタに含まれるデータのうち、第一の複数のクラスタでは異なるクラスタに分類されていたデータを出力する出力手段183(例えば、可視化処理部80)とを備えている。
Next, the outline of the present invention will be explained. FIG. 18 is a block diagram showing an overview of a labeling support system according to the present invention. A labeling support system 180 (for example, a labeling support system 1) according to the present invention classifies a first data group, which is a data group to be labeled, by unsupervised learning to generate a first plurality of clusters. means 181 (for example, the feature extracting unit 60) and classifying (that is, re-learning) a second data group, which is a data group including at least part of the data to be labeled, to classify a second plurality of clusters. and a second classifying means 182 (e.g., a feature extracting unit 60) that generates the data classified into a different cluster in the first plurality of clusters out of the data included in the second plurality of clusters. means 183 (for example, the visualization processing unit 80).
そのような構成により、ラベル付けされていないデータが分類されたクラスタに対するラベリング作業を支援できる。
With such a configuration, it is possible to support labeling work on clusters in which unlabeled data has been classified.
また、ラベリング支援システム180は、ラベリング対象のデータ群の中から、生成された第一の複数のクラスタに応じて、第二のデータ群を生成するデータ精緻化手段(例えば、データ精緻化部100)を備えていてもよい。
In addition, the labeling support system 180 includes data refinement means (for example, the data refinement unit 100 ).
具体的には、データ精緻化手段は、ラベリング対象のデータ群のうち、第一の複数のクラスタのいずれかに分類されたデータに対してクラスタごとのラベリングを行った第二のデータ群を生成してもよい(例えば、上記データ精緻化部100による第一の処理)。
Specifically, the data refining means generates a second data group by performing labeling for each cluster on data classified into one of the first plurality of clusters among the data group to be labeled. (for example, the first processing by the data refinement unit 100).
また、データ精緻化手段は、ラベリング対象のデータ群のうち、第一の複数のクラスタの中から選択されたクラスタに分類されているデータ群を第二のデータ群として生成してもよい(例えば、上記データ精緻化部100による第二の処理)。
Further, the data refining means may generate, as a second data group, a data group classified into a cluster selected from the first plurality of clusters among the data groups to be labeled (for example, , second processing by the data refinement unit 100).
また、データ精緻化手段は、ラベリング対象のデータ群のうち、第一の複数のクラスタのいずれにも分類されなかった一以上のデータを除外したデータ群を第二のデータ群として生成してもよい(例えば、上記データ精緻化部100による第三の処理)。
Further, the data refining means may generate, as the second data group, a data group obtained by excluding one or more data not classified into any of the first plurality of clusters from the data group to be labeled. Good (for example, the third processing by the data refinement unit 100).
また、出力手段は、ラベリング対象のデータ群を次元削減し、第一の複数のクラスタに含まれる次元削減されたデータ、および、第二の複数のクラスタに含まれる次元削減されたデータをクラスタごとに識別できる態様でグラフ描画し、第二の複数のクラスタに含まれる次元削減されたデータのうち、第一の複数のクラスタでは異なるクラスタに分類されていたデータを、他のデータと異なる態様で表示してもよい。
Further, the output means reduces the dimension of the data group to be labeled, and divides the dimension-reduced data included in the first plurality of clusters and the dimension-reduced data included in the second plurality of clusters into clusters. , and out of the dimensionality-reduced data contained in the second plurality of clusters, the data classified into different clusters in the first plurality of clusters are displayed in a manner different from the other data may be displayed.
また、出力手段は、データ群の分類処理ごとにクラスタの統計情報を表示してもよい。
In addition, the output means may display cluster statistical information for each data group classification process.
図19は、少なくとも1つの実施形態に係るコンピュータの構成を示す概略ブロック図である。コンピュータ1000は、プロセッサ1001、主記憶装置1002、補助記憶装置1003、インタフェース1004を備える。
FIG. 19 is a schematic block diagram showing the configuration of a computer according to at least one embodiment. A computer 1000 comprises a processor 1001 , a main storage device 1002 , an auxiliary storage device 1003 and an interface 1004 .
上述のラベリング支援システム180は、コンピュータ1000に実装される。そして、上述した各処理部の動作は、プログラム(ラベリング支援プログラム)の形式で補助記憶装置1003に記憶されている。プロセッサ1001は、プログラムを補助記憶装置1003から読み出して主記憶装置1002に展開し、当該プログラムに従って上記処理を実行する。
The labeling support system 180 described above is implemented in the computer 1000. The operation of each processing unit described above is stored in the auxiliary storage device 1003 in the form of a program (labeling support program). The processor 1001 reads out the program from the auxiliary storage device 1003, develops it in the main storage device 1002, and executes the above processing according to the program.
なお、少なくとも1つの実施形態において、補助記憶装置1003は、一時的でない有形の媒体の一例である。一時的でない有形の媒体の他の例としては、インタフェース1004を介して接続される磁気ディスク、光磁気ディスク、CD-ROM(Compact Disc Read-only memory )、DVD-ROM(Read-only memory)、半導体メモリ等が挙げられる。また、このプログラムが通信回線によってコンピュータ1000に配信される場合、配信を受けたコンピュータ1000が当該プログラムを主記憶装置1002に展開し、上記処理を実行してもよい。
Note that in at least one embodiment, the secondary storage device 1003 is an example of a non-transitory tangible medium. Other examples of non-transitory tangible media include magnetic disks, magneto-optical disks, CD-ROMs (Compact Disc Read-only memory), DVD-ROMs (Read-only memory), connected via interface 1004, A semiconductor memory etc. are mentioned. Further, when this program is distributed to the computer 1000 via a communication line, the computer 1000 receiving the distribution may develop the program in the main storage device 1002 and execute the above process.
また、当該プログラムは、前述した機能の一部を実現するためのものであっても良い。さらに、当該プログラムは、前述した機能を補助記憶装置1003に既に記憶されている他のプログラムとの組み合わせで実現するもの、いわゆる差分ファイル(差分プログラム)であってもよい。
In addition, the program may be for realizing part of the functions described above. Further, the program may be a so-called difference file (difference program) that implements the above-described functions in combination with another program already stored in the auxiliary storage device 1003 .
上記の実施形態の一部又は全部は、以下の付記のようにも記載されうるが、以下には限られない。
Some or all of the above embodiments can also be described as the following additional remarks, but are not limited to the following.
(付記1)ラベリング対象のデータ群である第一のデータ群を教師なし学習により分類することで第一の複数のクラスタを生成する第一分類手段と、
前記ラベリング対象のデータの少なくとも一部のデータを含むデータ群である第二のデータ群を分類することで第二の複数のクラスタを生成する第二分類手段と、
前記第二の複数のクラスタに含まれるデータのうち、前記第一の複数のクラスタでは異なるクラスタに分類されていたデータを出力する出力手段とを備えた
ことを特徴とするラベリング支援システム。 (Appendix 1) A first classification means for generating a first plurality of clusters by classifying a first data group, which is a data group to be labeled, by unsupervised learning;
a second classification means for generating a second plurality of clusters by classifying a second data group, which is a data group including at least part of the data to be labeled;
and output means for outputting data classified into different clusters in the first plurality of clusters among the data included in the second plurality of clusters.
前記ラベリング対象のデータの少なくとも一部のデータを含むデータ群である第二のデータ群を分類することで第二の複数のクラスタを生成する第二分類手段と、
前記第二の複数のクラスタに含まれるデータのうち、前記第一の複数のクラスタでは異なるクラスタに分類されていたデータを出力する出力手段とを備えた
ことを特徴とするラベリング支援システム。 (Appendix 1) A first classification means for generating a first plurality of clusters by classifying a first data group, which is a data group to be labeled, by unsupervised learning;
a second classification means for generating a second plurality of clusters by classifying a second data group, which is a data group including at least part of the data to be labeled;
and output means for outputting data classified into different clusters in the first plurality of clusters among the data included in the second plurality of clusters.
(付記2)ラベリング対象のデータ群の中から、生成された第一の複数のクラスタに応じて、第二のデータ群を生成するデータ精緻化手段を備えた
付記1記載のラベリング支援システム。 (Supplementary Note 2) The labeling support system according toSupplementary Note 1, further comprising data refinement means for generating a second data group according to the generated first plurality of clusters from the data group to be labeled.
付記1記載のラベリング支援システム。 (Supplementary Note 2) The labeling support system according to
(付記3)データ精緻化手段は、ラベリング対象のデータ群のうち、第一の複数のクラスタのいずれかに分類されたデータに対してクラスタごとのラベリングを行った第二のデータ群を生成する
付記1または付記2記載のラベリング支援システム。 (Appendix 3) The data refining means generates a second data group by performing labeling for each cluster on data classified into one of the first plurality of clusters among the data group to be labeled. The labeling support system according toappendix 1 or appendix 2.
付記1または付記2記載のラベリング支援システム。 (Appendix 3) The data refining means generates a second data group by performing labeling for each cluster on data classified into one of the first plurality of clusters among the data group to be labeled. The labeling support system according to
(付記4)データ精緻化手段は、ラベリング対象のデータ群のうち、第一の複数のクラスタの中から選択されたクラスタに分類されているデータ群を第二のデータ群として生成する
付記1または付記2記載のラベリング支援システム。 (Appendix 4) The data refining means generates, as a second data group, a data group classified into a cluster selected from the first plurality of clusters among the data groups to be labeled.Appendix 1 or The labeling support system according to appendix 2.
付記1または付記2記載のラベリング支援システム。 (Appendix 4) The data refining means generates, as a second data group, a data group classified into a cluster selected from the first plurality of clusters among the data groups to be labeled.
(付記5)データ精緻化手段は、ラベリング対象のデータ群のうち、第一の複数のクラスタのいずれにも分類されなかった一以上のデータを除外したデータ群を第二のデータ群として生成する
付記1から付記4のうちのいずれか1つに記載のラベリング支援システム。 (Appendix 5) The data refining means generates, as a second data group, a data group excluding one or more data not classified into any of the first plurality of clusters from the data group to be labeled. The labeling support system according to any one ofappendices 1 to 4.
付記1から付記4のうちのいずれか1つに記載のラベリング支援システム。 (Appendix 5) The data refining means generates, as a second data group, a data group excluding one or more data not classified into any of the first plurality of clusters from the data group to be labeled. The labeling support system according to any one of
(付記6)出力手段は、ラベリング対象のデータ群を次元削減し、第一の複数のクラスタに含まれる次元削減されたデータ、および、第二の複数のクラスタに含まれる次元削減されたデータをクラスタごとに識別できる態様でグラフ描画し、前記第二の複数のクラスタに含まれる次元削減されたデータのうち、第一の複数のクラスタでは異なるクラスタに分類されていたデータを、他のデータと異なる態様で表示する
付記1から付記5のうちのいずれか1つに記載のラベリング支援システム。 (Appendix 6) The output means reduces the dimension of the data group to be labeled, and divides the dimension-reduced data contained in the first plurality of clusters and the dimension-reduced data contained in the second plurality of clusters into Graphing is performed in a manner in which each cluster can be identified, and among the dimensionally reduced data included in the second plurality of clusters, the data classified into different clusters in the first plurality of clusters are compared with other data. The labeling support system according to any one ofappendices 1 to 5, wherein the labeling support system is displayed in different modes.
付記1から付記5のうちのいずれか1つに記載のラベリング支援システム。 (Appendix 6) The output means reduces the dimension of the data group to be labeled, and divides the dimension-reduced data contained in the first plurality of clusters and the dimension-reduced data contained in the second plurality of clusters into Graphing is performed in a manner in which each cluster can be identified, and among the dimensionally reduced data included in the second plurality of clusters, the data classified into different clusters in the first plurality of clusters are compared with other data. The labeling support system according to any one of
(付記7)出力手段は、データ群の分類処理ごとにクラスタの統計情報を表示する
付記1から付記6のうちのいずれか1つに記載のラベリング支援システム。 (Supplementary Note 7) The labeling support system according to any one ofSupplementary Notes 1 to 6, wherein the output means displays cluster statistical information for each data group classification process.
付記1から付記6のうちのいずれか1つに記載のラベリング支援システム。 (Supplementary Note 7) The labeling support system according to any one of
(付記8)コンピュータが、ラベリング対象のデータ群である第一のデータ群を教師なし学習により分類することで第一の複数のクラスタを生成し、
前記コンピュータが、前記ラベリング対象のデータの少なくとも一部のデータを含むデータ群である第二のデータ群を分類することで第二の複数のクラスタを生成し、
前記コンピュータが、前記第二の複数のクラスタに含まれるデータのうち、前記第一の複数のクラスタでは異なるクラスタに分類されていたデータを出力する
ことを特徴とするラベリング支援方法。 (Appendix 8) A computer generates a first plurality of clusters by classifying a first data group, which is a data group to be labeled, by unsupervised learning,
The computer generates a second plurality of clusters by classifying a second data group, which is a data group including at least part of the data to be labeled, and
The labeling support method, wherein the computer outputs data classified into different clusters in the first plurality of clusters, among the data included in the second plurality of clusters.
前記コンピュータが、前記ラベリング対象のデータの少なくとも一部のデータを含むデータ群である第二のデータ群を分類することで第二の複数のクラスタを生成し、
前記コンピュータが、前記第二の複数のクラスタに含まれるデータのうち、前記第一の複数のクラスタでは異なるクラスタに分類されていたデータを出力する
ことを特徴とするラベリング支援方法。 (Appendix 8) A computer generates a first plurality of clusters by classifying a first data group, which is a data group to be labeled, by unsupervised learning,
The computer generates a second plurality of clusters by classifying a second data group, which is a data group including at least part of the data to be labeled, and
The labeling support method, wherein the computer outputs data classified into different clusters in the first plurality of clusters, among the data included in the second plurality of clusters.
(付記9)ラベリング対象のデータ群の中から、生成された第一の複数のクラスタに応じて、第二のデータ群を生成する
付記8記載のラベリング支援方法。 (Supplementary Note 9) The labeling support method according to Supplementary Note 8, wherein the second data group is generated from the data group to be labeled according to the generated first plurality of clusters.
付記8記載のラベリング支援方法。 (Supplementary Note 9) The labeling support method according to Supplementary Note 8, wherein the second data group is generated from the data group to be labeled according to the generated first plurality of clusters.
(付記10)コンピュータに、
ラベリング対象のデータ群である第一のデータ群を教師なし学習により分類することで第一の複数のクラスタを生成する第一分類処理、
前記ラベリング対象のデータの少なくとも一部のデータを含むデータ群である第二のデータ群を分類することで第二の複数のクラスタを生成する第二分類処理、および、
前記第二の複数のクラスタに含まれるデータのうち、前記第一の複数のクラスタでは異なるクラスタに分類されていたデータを出力する出力処理
を実行させるためのラベリング支援プログラムを記憶するプログラム記憶媒体。 (Appendix 10) to the computer,
A first classification process for generating a first plurality of clusters by classifying a first data group, which is a data group to be labeled, by unsupervised learning;
a second classification process for generating a second plurality of clusters by classifying a second data group, which is a data group including at least part of the data to be labeled; and
A program storage medium for storing a labeling support program for executing output processing for outputting data included in the second plurality of clusters and classified into different clusters in the first plurality of clusters.
ラベリング対象のデータ群である第一のデータ群を教師なし学習により分類することで第一の複数のクラスタを生成する第一分類処理、
前記ラベリング対象のデータの少なくとも一部のデータを含むデータ群である第二のデータ群を分類することで第二の複数のクラスタを生成する第二分類処理、および、
前記第二の複数のクラスタに含まれるデータのうち、前記第一の複数のクラスタでは異なるクラスタに分類されていたデータを出力する出力処理
を実行させるためのラベリング支援プログラムを記憶するプログラム記憶媒体。 (Appendix 10) to the computer,
A first classification process for generating a first plurality of clusters by classifying a first data group, which is a data group to be labeled, by unsupervised learning;
a second classification process for generating a second plurality of clusters by classifying a second data group, which is a data group including at least part of the data to be labeled; and
A program storage medium for storing a labeling support program for executing output processing for outputting data included in the second plurality of clusters and classified into different clusters in the first plurality of clusters.
(付記11)コンピュータに、
ラベリング対象のデータ群の中から、生成された第一の複数のクラスタに応じて、第二のデータ群を生成するデータ精緻化処理を実行させる
ためのラベリング支援プログラムを記憶する付記10記載のプログラム記憶媒体。 (Appendix 11) to the computer,
11. The program according toappendix 10, which stores a labeling support program for executing a data refinement process for generating a second data group according to the first plurality of clusters generated from the data group to be labeled. storage medium.
ラベリング対象のデータ群の中から、生成された第一の複数のクラスタに応じて、第二のデータ群を生成するデータ精緻化処理を実行させる
ためのラベリング支援プログラムを記憶する付記10記載のプログラム記憶媒体。 (Appendix 11) to the computer,
11. The program according to
(付記12)コンピュータに、
ラベリング対象のデータ群である第一のデータ群を教師なし学習により分類することで第一の複数のクラスタを生成する第一分類処理、
前記ラベリング対象のデータの少なくとも一部のデータを含むデータ群である第二のデータ群を分類することで第二の複数のクラスタを生成する第二分類処理、および、
前記第二の複数のクラスタに含まれるデータのうち、前記第一の複数のクラスタでは異なるクラスタに分類されていたデータを出力する出力処理
を実行させるためのラベリング支援プログラム。 (Appendix 12) to the computer,
A first classification process for generating a first plurality of clusters by classifying a first data group, which is a data group to be labeled, by unsupervised learning;
a second classification process for generating a second plurality of clusters by classifying a second data group, which is a data group including at least part of the data to be labeled; and
A labeling support program for executing an output process of outputting data classified into different clusters in the first plurality of clusters among the data included in the second plurality of clusters.
ラベリング対象のデータ群である第一のデータ群を教師なし学習により分類することで第一の複数のクラスタを生成する第一分類処理、
前記ラベリング対象のデータの少なくとも一部のデータを含むデータ群である第二のデータ群を分類することで第二の複数のクラスタを生成する第二分類処理、および、
前記第二の複数のクラスタに含まれるデータのうち、前記第一の複数のクラスタでは異なるクラスタに分類されていたデータを出力する出力処理
を実行させるためのラベリング支援プログラム。 (Appendix 12) to the computer,
A first classification process for generating a first plurality of clusters by classifying a first data group, which is a data group to be labeled, by unsupervised learning;
a second classification process for generating a second plurality of clusters by classifying a second data group, which is a data group including at least part of the data to be labeled; and
A labeling support program for executing an output process of outputting data classified into different clusters in the first plurality of clusters among the data included in the second plurality of clusters.
(付記13)コンピュータに、
ラベリング対象のデータ群の中から、生成された第一の複数のクラスタに応じて、第二のデータ群を生成するデータ精緻化処理を実行させる
付記12記載のラベリング支援プログラム。 (Appendix 13) to the computer,
13. The labeling support program according to appendix 12, wherein a data refinement process for generating a second data group is executed according to the generated first plurality of clusters from the data group to be labeled.
ラベリング対象のデータ群の中から、生成された第一の複数のクラスタに応じて、第二のデータ群を生成するデータ精緻化処理を実行させる
付記12記載のラベリング支援プログラム。 (Appendix 13) to the computer,
13. The labeling support program according to appendix 12, wherein a data refinement process for generating a second data group is executed according to the generated first plurality of clusters from the data group to be labeled.
以上、実施形態を参照して本願発明を説明したが、本願発明は上記実施形態に限定されるものではない。本願発明の構成や詳細には、本願発明のスコープ内で当業者が理解し得る様々な変更をすることができる。
Although the present invention has been described with reference to the embodiments, the present invention is not limited to the above embodiments. Various changes that can be understood by those skilled in the art can be made to the configuration and details of the present invention within the scope of the present invention.
1 ラベリング支援システム
10 データ取得部
20 関連情報取得部
30 物体識別部
40 データ加工部
50 テキスト情報入力部
60 特徴抽出部
70 特徴記憶部
80 可視化処理部
90 入出力装置
100 データ精緻化部 1labeling support system 10 data acquisition unit 20 related information acquisition unit 30 object identification unit 40 data processing unit 50 text information input unit 60 feature extraction unit 70 feature storage unit 80 visualization processing unit 90 input/output device 100 data refinement unit
10 データ取得部
20 関連情報取得部
30 物体識別部
40 データ加工部
50 テキスト情報入力部
60 特徴抽出部
70 特徴記憶部
80 可視化処理部
90 入出力装置
100 データ精緻化部 1
Claims (11)
- ラベリング対象のデータ群である第一のデータ群を教師なし学習により分類することで第一の複数のクラスタを生成する第一分類手段と、
前記ラベリング対象のデータの少なくとも一部のデータを含むデータ群である第二のデータ群を分類することで第二の複数のクラスタを生成する第二分類手段と、
前記第二の複数のクラスタに含まれるデータのうち、前記第一の複数のクラスタでは異なるクラスタに分類されていたデータを出力する出力手段とを備えた
ことを特徴とするラベリング支援システム。 a first classification means for generating a first plurality of clusters by classifying a first data group, which is a data group to be labeled, by unsupervised learning;
a second classification means for generating a second plurality of clusters by classifying a second data group, which is a data group including at least part of the data to be labeled;
and output means for outputting data classified into different clusters in the first plurality of clusters among the data included in the second plurality of clusters. - ラベリング対象のデータ群の中から、生成された第一の複数のクラスタに応じて、第二のデータ群を生成するデータ精緻化手段を備えた
請求項1記載のラベリング支援システム。 2. The labeling support system according to claim 1, further comprising data refinement means for generating a second data group from the data group to be labeled according to the generated first plurality of clusters. - データ精緻化手段は、ラベリング対象のデータ群のうち、第一の複数のクラスタのいずれかに分類されたデータに対してクラスタごとのラベリングを行った第二のデータ群を生成する
請求項1または請求項2記載のラベリング支援システム。 wherein the data refining means generates a second data group by performing labeling for each cluster on data classified into one of the first plurality of clusters among the data group to be labeled, or The labeling support system according to claim 2. - データ精緻化手段は、ラベリング対象のデータ群のうち、第一の複数のクラスタの中から選択されたクラスタに分類されているデータ群を第二のデータ群として生成する
請求項1または請求項2記載のラベリング支援システム。 Claim 1 or Claim 2, wherein the data refining means generates, as the second data group, a data group classified into a cluster selected from among the first plurality of clusters out of the data group to be labeled. Labeling support system as described. - データ精緻化手段は、ラベリング対象のデータ群のうち、第一の複数のクラスタのいずれにも分類されなかった一以上のデータを除外したデータ群を第二のデータ群として生成する
請求項1から請求項4のうちのいずれか1項に記載のラベリング支援システム。 From claim 1, wherein the data refining means generates a data group excluding one or more data not classified into any of the first plurality of clusters from the data group to be labeled as the second data group. 5. The labeling support system according to any one of claims 4. - 出力手段は、ラベリング対象のデータ群を次元削減し、第一の複数のクラスタに含まれる次元削減されたデータ、および、第二の複数のクラスタに含まれる次元削減されたデータをクラスタごとに識別できる態様でグラフ描画し、前記第二の複数のクラスタに含まれる次元削減されたデータのうち、第一の複数のクラスタでは異なるクラスタに分類されていたデータを、他のデータと異なる態様で表示する
請求項1から請求項5のうちのいずれか1項に記載のラベリング支援システム。 The output means reduces the dimension of the data group to be labeled, and identifies the dimension-reduced data included in the first plurality of clusters and the dimension-reduced data included in the second plurality of clusters for each cluster. graphing in a manner that can be done, and out of the dimensionality-reduced data included in the second plurality of clusters, data classified into different clusters in the first plurality of clusters is displayed in a manner different from other data The labeling support system according to any one of claims 1 to 5. - 出力手段は、データ群の分類処理ごとにクラスタの統計情報を表示する
請求項1から請求項6のうちのいずれか1項に記載のラベリング支援システム。 7. The labeling support system according to any one of claims 1 to 6, wherein the output means displays cluster statistical information for each data group classification process. - コンピュータが、ラベリング対象のデータ群である第一のデータ群を教師なし学習により分類することで第一の複数のクラスタを生成し、
前記コンピュータが、前記ラベリング対象のデータの少なくとも一部のデータを含むデータ群である第二のデータ群を分類することで第二の複数のクラスタを生成し、
前記コンピュータが、前記第二の複数のクラスタに含まれるデータのうち、前記第一の複数のクラスタでは異なるクラスタに分類されていたデータを出力する
ことを特徴とするラベリング支援方法。 A computer generates a first plurality of clusters by classifying a first data group, which is a data group to be labeled, by unsupervised learning,
The computer generates a second plurality of clusters by classifying a second data group, which is a data group including at least part of the data to be labeled, and
The labeling support method, wherein the computer outputs data classified into different clusters in the first plurality of clusters, among the data included in the second plurality of clusters. - ラベリング対象のデータ群の中から、生成された第一の複数のクラスタに応じて、第二のデータ群を生成する
請求項8記載のラベリング支援方法。 The labeling support method according to claim 8, wherein a second data group is generated from the data group to be labeled according to the generated first plurality of clusters. - コンピュータに、
ラベリング対象のデータ群である第一のデータ群を教師なし学習により分類することで第一の複数のクラスタを生成する第一分類処理、
前記ラベリング対象のデータの少なくとも一部のデータを含むデータ群である第二のデータ群を分類することで第二の複数のクラスタを生成する第二分類処理、および、
前記第二の複数のクラスタに含まれるデータのうち、前記第一の複数のクラスタでは異なるクラスタに分類されていたデータを出力する出力処理
を実行させるためのラベリング支援プログラムを記憶するプログラム記憶媒体。 to the computer,
A first classification process for generating a first plurality of clusters by classifying a first data group, which is a data group to be labeled, by unsupervised learning;
a second classification process for generating a second plurality of clusters by classifying a second data group, which is a data group including at least part of the data to be labeled; and
A program storage medium for storing a labeling support program for executing output processing for outputting data included in the second plurality of clusters and classified into different clusters in the first plurality of clusters. - コンピュータに、
ラベリング対象のデータ群の中から、生成された第一の複数のクラスタに応じて、第二のデータ群を生成するデータ精緻化処理を実行させる
ためのラベリング支援プログラムを記憶する請求項10記載のプログラム記憶媒体。 to the computer,
11. The labeling support program according to claim 10, which stores a labeling support program for executing a data refinement process for generating a second data group according to the first plurality of clusters generated from the data group to be labeled. program storage medium.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2024504060A JPWO2023166578A1 (en) | 2022-03-02 | 2022-03-02 | |
PCT/JP2022/008749 WO2023166578A1 (en) | 2022-03-02 | 2022-03-02 | Labeling assistance system, labeling assistance method, and labeling assistance program |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/JP2022/008749 WO2023166578A1 (en) | 2022-03-02 | 2022-03-02 | Labeling assistance system, labeling assistance method, and labeling assistance program |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2023166578A1 true WO2023166578A1 (en) | 2023-09-07 |
Family
ID=87883223
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2022/008749 WO2023166578A1 (en) | 2022-03-02 | 2022-03-02 | Labeling assistance system, labeling assistance method, and labeling assistance program |
Country Status (2)
Country | Link |
---|---|
JP (1) | JPWO2023166578A1 (en) |
WO (1) | WO2023166578A1 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118152826A (en) * | 2024-05-09 | 2024-06-07 | 深圳市翔飞科技股份有限公司 | Intelligent camera alarm system based on behavior analysis |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2008084151A (en) * | 2006-09-28 | 2008-04-10 | Just Syst Corp | Information display device and information display method |
JP2008084203A (en) * | 2006-09-28 | 2008-04-10 | Nec Corp | System, method and program for assigning label |
JP2014063343A (en) * | 2012-09-21 | 2014-04-10 | Nippon Telegr & Teleph Corp <Ntt> | Clustering quality improvement method |
-
2022
- 2022-03-02 JP JP2024504060A patent/JPWO2023166578A1/ja active Pending
- 2022-03-02 WO PCT/JP2022/008749 patent/WO2023166578A1/en unknown
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2008084151A (en) * | 2006-09-28 | 2008-04-10 | Just Syst Corp | Information display device and information display method |
JP2008084203A (en) * | 2006-09-28 | 2008-04-10 | Nec Corp | System, method and program for assigning label |
JP2014063343A (en) * | 2012-09-21 | 2014-04-10 | Nippon Telegr & Teleph Corp <Ntt> | Clustering quality improvement method |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118152826A (en) * | 2024-05-09 | 2024-06-07 | 深圳市翔飞科技股份有限公司 | Intelligent camera alarm system based on behavior analysis |
Also Published As
Publication number | Publication date |
---|---|
JPWO2023166578A1 (en) | 2023-09-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP5075924B2 (en) | Classifier learning image generation program, method, and system | |
CN108932303B (en) | Distributed dynamic target detection and analysis system for visible light remote sensing image | |
US10963734B1 (en) | Perception visualization tool | |
CN109325502B (en) | Shared bicycle parking detection method and system based on video progressive region extraction | |
CN109871875B (en) | Building change detection method based on deep learning | |
WO2012139228A1 (en) | Video-based detection of multiple object types under varying poses | |
US20210133495A1 (en) | Model providing system, method and program | |
Li et al. | Robust vehicle detection in high-resolution aerial images with imbalanced data | |
EP3443482A1 (en) | Classifying entities in digital maps using discrete non-trace positioning data | |
CN112990065A (en) | Optimized YOLOv5 model-based vehicle classification detection method | |
CN113160395B (en) | CIM-based urban multi-dimensional information interaction and scene generation method, device and medium | |
WO2023166578A1 (en) | Labeling assistance system, labeling assistance method, and labeling assistance program | |
CN115830399B (en) | Classification model training method, device, equipment, storage medium and program product | |
CN113052108A (en) | Multi-scale cascade aerial photography target detection method and system based on deep neural network | |
CN114003672A (en) | Method, device, equipment and medium for processing road dynamic event | |
CN113942521B (en) | Method for identifying style of driver under intelligent vehicle road system | |
CN106454241B (en) | Dust-haze source determination method based on surveillance video and social network data | |
Zhai et al. | GAN-BiLSTM network for field-road classification on imbalanced GNSS recordings | |
CN117557983A (en) | Scene reconstruction method and driving assistance system based on depth forward projection and query back projection | |
Greer et al. | Language-Driven Active Learning for Diverse Open-Set 3D Object Detection | |
Yang et al. | A data-driven method for flight time estimation based on air traffic pattern identification and prediction | |
WO2024069729A1 (en) | Clustering support system, method, and program | |
WO2023166579A1 (en) | Labelling assistance system, labelling assistance method, and labelling assistance program | |
CN110413662B (en) | Multichannel economic data input system, acquisition system and method | |
CN114155440A (en) | Automatic detection method and system for farmland non-farming |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22929726 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2024504060 Country of ref document: JP Kind code of ref document: A |