WO2023166578A1

WO2023166578A1 - Labeling assistance system, labeling assistance method, and labeling assistance program

Info

Publication number: WO2023166578A1
Application number: PCT/JP2022/008749
Authority: WO
Inventors: 哲孝山下; 卓郎鹿嶋; 憲人大井; 秋紗子藤井
Original assignee: 日本電気株式会社
Priority date: 2022-03-02
Filing date: 2022-03-02
Publication date: 2023-09-07
Also published as: JPWO2023166578A1

Abstract

A first classification means 181 generates a first plurality of clusters by classifying a first group of data, which is a group of data to be labeled, by unsupervised learning. A second classification means 182 generates a second plurality of clusters by classifying a second group of data, which is a group of data including at least a part of the group of data to be labeled. An output means 183 outputs data that is included in the second plurality of clusters and is classified into a different cluster for the first plurality of clusters.

Description

Labeling support system, labeling support method and labeling support program

The present invention relates to a labeling support system, a labeling support method, and a labeling support program that support labeling of unlabeled data.

In the IoT (Internet of Things) society, it has become possible to collect data from various devices. On the other hand, for example, it is very difficult to find a desired video from a large amount of data by simple work. Therefore, there is a demand for a mechanism for searching the collected data.

As a mechanism for searching data, there is a method of labeling the data. However, since labeling a large amount of data manually takes a huge amount of time and cost, various methods for classifying data have been proposed.

For example, Patent Document 1 describes a sensor data classification device that classifies sensor data obtained from a large number of sensors according to their characteristics. The device described in Patent Document 1 associates a set of sensor data divided for each preset time interval with a sensor identifier and a divided section identifier, and extracts a plurality of types of feature parameters from the data included in the set of divided data. calculate.

JP 2016-99888 A

For example, automatic labeling based on rules is also possible. However, the work of maintaining rules in response to changes in the environment or the like is complicated, and work such as adding rules is not easy.

In the device described in Patent Document 1, the calculation method of feature parameters for classification and division intervals are determined in advance. However, even if data is classified based on numerical values calculated on the basis of some criteria, there is still the problem that performing meaningful labeling work on unlabeled data still entails costs.

Therefore, an object of the present invention is to provide a labeling support system, a labeling support method, and a labeling support program that can support labeling work for clusters in which unlabeled data are classified.

A labeling support system according to the present invention includes first classification means for generating a plurality of first clusters by classifying a first data group, which is a data group to be labeled, by unsupervised learning; a second classifying means for generating a second plurality of clusters by classifying a second data group that is a data group including a part of data; and output means for outputting data classified into different clusters in the plurality of clusters.

In the labeling support method according to the present invention, a computer classifies a first data group, which is a data group to be labeled, by unsupervised learning to generate a plurality of first clusters, and the computer classifies the data to be labeled. A second plurality of clusters are generated by classifying a second data group, which is a data group including at least part of the data, and a computer classifies the first data included in the second plurality of clusters. It is characterized by outputting data classified into different clusters in a plurality of clusters.

A labeling support program according to the present invention provides a computer with a first classification process for generating a first plurality of clusters by classifying a first data group, which is a data group to be labeled, by unsupervised learning, data to be labeled, A second classification process for generating a second plurality of clusters by classifying a second data group that is a data group containing at least part of the data of and outputting data classified into different clusters in the first plurality of clusters.

According to the present invention, it is possible to support labeling work for clusters in which unlabeled data are classified.

1 is a block diagram showing a configuration example of an embodiment of a labeling support system according to the present invention; FIG. FIG. 4 is an explanatory diagram showing an example of data used in the labeling support system; FIG. 4 is an explanatory diagram showing an example of feature amounts; FIG. 10 is an explanatory diagram showing an example of visualization of dimension-reduced data in a graph; FIG. 11 is an explanatory diagram showing another example of visualizing the dimension-reduced data with a graph; FIG. 4 is an explanatory diagram showing an example of processing for labeling data in a cluster; FIG. 10 is an explanatory diagram showing an example of processing for selecting some clusters; FIG. 10 is an explanatory diagram showing an example of processing for excluding part of data; It is explanatory drawing which shows the example which carried out the overlay display of the result before and behind refinement|elaboration. It is explanatory drawing which shows the example which displayed the result before and behind elaboration by the parallel window. It is explanatory drawing which shows the example which displayed the result before and behind elaboration by the parallel window. FIG. 11 is an explanatory diagram showing an example of displaying a list of data with different results before and after elaboration in another window; FIG. 11 is an explanatory diagram showing an example of overlay display of refinement results of a plurality of times; FIG. 10 is an explanatory diagram showing an example of displaying a list of data with different results due to multiple elaborations in separate windows; FIG. 10 is an explanatory diagram showing an example of displaying statistical information of each cluster; FIG. 11 is an explanatory diagram showing another example of displaying statistical information of each cluster; 4 is a flow chart showing an operation example of the labeling support system; 1 is a block diagram showing an overview of a labeling support system according to the present invention; FIG. 1 is a schematic block diagram showing a configuration of a computer according to at least one embodiment; FIG.

Hereinafter, embodiments of the present invention will be described with reference to the drawings. In the following description, moving images (video data) will be exemplified as an example of unlabeled data. However, unlabeled data is not limited to moving images, and may be still images, music data, text data, and the like. Further, unlabeled data (data to be labeled) may be hereinafter referred to as unclassified data.

FIG. 1 is a block diagram showing a configuration example of one embodiment of a labeling support system according to the present invention. The labeling support system 1 of this embodiment includes a data acquisition unit 10, a related information acquisition unit 20, an object identification unit 30, a data processing unit 40, a text information input unit 50, a feature extraction unit 60, and a feature storage. It comprises a unit 70 , a visualization processing unit 80 , an input/output device 90 and a data refinement unit 100 .

The data acquisition unit 10 acquires data to be labeled (that is, unclassified data). For example, when a camera (not shown) captures an image of a traveling vehicle, the data acquisition unit 10 may acquire a moving image of the vehicle captured by the camera as data to be labeled. The data acquired by the data acquisition unit 10 is not limited to data acquired in real time. The data acquisition unit 10 may acquire the data to be labeled, for example, from a storage server (not shown) in which the data to be labeled is stored.

The related information acquisition unit 20 acquires information related to data to be labeled (hereinafter referred to as related information). In this embodiment, the related information is information indicating the situation in which the data to be labeled is generated. (hereinafter referred to as sensor data).

For example, if the data to be labeled is video data captured by an in-vehicle camera (drive recorder), it is acquired based on GPS (Global Positioning System) information representing the vehicle position and CAN (Controller Area Network) as related information. and the information to be provided. Examples of sensor data acquired in this case are velocity, acceleration, and position (latitude, longitude, altitude, etc.).

Also, when a video showing the operating status of a thermal power plant is used as the data to be labeled, sensor data includes, for example, fuel flow rate, pressure, temperature, rotation speed, and power generation amount. In addition, when images showing farm conditions are used as data to be labeled, sensor data includes time, temperature, humidity, pH, soil water content, solar radiation, wind direction/speed, water level, and the like.

The object identification unit 30 identifies objects included in the acquired data and generates information specifying the identified objects (hereinafter referred to as an object list). For example, when the object to be identified is a vehicle, the object identification unit 30 identifies the vehicle from the data acquired by the data acquisition unit 10, and identifies the vehicle (for example, coordinates indicating the position in the image). may be generated as an object list. Methods for identifying objects from images and videos are widely known, and detailed description thereof is omitted here.

The data processing unit 40 processes the data (more specifically, the object list) into a form that can be used when the feature extraction unit 60, which will be described later, performs processing. Specifically, the data processing unit 40 processes the data so as to improve the accuracy of feature extraction and clustering. The data processing unit 40, for example, thins data, interpolates missing values, excludes outliers, and deletes unnecessary data items. Further, for example, when the data to be labeled is video data, the data processing unit 40 may convert the video data into numerical time-series data.

The text information input unit 50 accepts input of text data including information to be added to each data to be labeled (hereinafter referred to as additional information). The additional information is information indicating the content of the labeling target data that can be acquired other than the related information. Categories indicating additional information include, for example, weather, types of plants, traffic participants, and the like. Examples of weather categorical values include sunny, cloudy, rainy, and snowy. Examples of plant type categorical values include rice, wheat, and barley.・Pedestrians, etc.

　The input of text data is optional. In other words, additional information for the data to be labeled may not be input. However, the more additional information is added to the data to be labeled, the more the accuracy of classification can be improved, so input is preferable. In the following description, labeling target data associated with additional information is also simply referred to as labeling target data.

FIG. 2 is an explanatory diagram showing an example of data used in the labeling support system 1 of this embodiment. The example shown in FIG. 2 indicates that the data acquisition unit 10 has acquired the image 11 as data to be labeled, and the related information acquisition unit 20 has acquired related information 21 regarding the location where the image 11 was shot. In the example shown in FIG. 2, the data processing unit 40 processes the video 11 and the related information 21 (more specifically, the object list generated by the object identification unit 30) to generate numerical time series data 41. indicate that Furthermore, the example shown in FIG. 2 indicates that the text information input unit 50 has received input of text data 51 including information on the weather, scene, time period, and objects as additional information.

The feature extraction unit 60 extracts features from each data to be labeled. The feature extraction unit 60 of the present embodiment firstly generates a plurality of clusters by automatically classifying each data to be labeled including additional information by unsupervised learning. Any method can be used to generate clusters by unsupervised learning, and examples thereof include the k-means method and the Gaussian mixture model.

Hereinafter, the process in which the feature extraction unit 60 classifies the data group to be labeled by unsupervised learning to generate a plurality of clusters will be referred to as the first classification process. A plurality of clusters generated by the first classification process will be referred to as a first plurality of clusters, and a data group classified into the first plurality of clusters will be referred to as a first data group. In addition, since the feature extraction unit 60 performs a process of classifying data to be labeled by unsupervised learning, the feature extraction unit 60 can also be called a classifying means.

Then, the feature extraction unit 60 extracts the feature amount of each data included in the generated cluster. The feature extraction unit 60 may extract, for example, additional information included in the text data as a feature amount. In addition, the feature extraction unit 60 may extract feature amounts indicated by numerical time-series data. Specifically, the feature extraction unit 60 may extract feature amounts based on sensor values included in the data to be labeled (more specifically, numerical time-series data).

Any method can be used to extract feature values from numerical time-series data. For example, for each cluster generated by the k-means method, the feature extraction unit 60 extracts a feature amount called the distance (cluster distance feature) from the center of gravity of the numerical time series data included in the cluster to each data. good.

Further, in this embodiment, the object identification unit 30 identifies the object from the information obtained by the data acquisition unit 10 and the related information acquisition unit 20, and the data processing unit 40 uses the identification result, and the feature extraction unit 60 uses the identification result. A case of processing data into a format has been described. However, the data acquisition unit 10 may directly acquire data in the format used by the feature extraction unit 60 and input the acquired data to the feature extraction unit 60 . In this case, the labeling support system 1 does not have to include the related information acquisition unit 20, the object identification unit 30, and the data processing unit 40.

The feature storage unit 70 stores feature amounts of each data extracted by the feature extraction unit 60 . The feature storage unit 70 may also store information on labels added by the data refinement unit 100, which will be described later. Note that the mode in which the feature storage unit 70 stores the feature amount for each data is arbitrary.

FIG. 3 is an explanatory diagram showing an example of feature amounts stored in the feature storage unit 70. FIG. In the example shown in FIG. 3, the vertical direction represents one feature point, and the horizontal direction represents the feature amount (category value) of each category (for example, weather, traffic participants, types of plants, etc.). The feature storage unit 70 is implemented by, for example, a magnetic disk.

The visualization processing unit 80 performs processing for visualizing information that contributes to the labeling work for the generated clusters. The visualization processing unit 80 of the present embodiment draws a graph on the input/output device 90 of the dimensionality reduction (lower dimension) of the data to be labeled so that a person can observe how the data to be labeled is clustered. Visualize by doing.

The visualization processing unit 80, for example, uses UMAP (Uniform Manifold Approximation and Projection) or the like to reduce the dimension of the data to be labeled in two dimensions or three dimensions, and visualizes the dimension-reduced data as a graph such as a distribution map. good too. At that time, the visualization processing unit 80 may display the data classified into the same cluster in a manner different from that of other clusters (for example, by changing the color, changing the symbol, etc.).

FIG. 4 is an explanatory diagram showing an example of visualizing the dimension-reduced data in a graph. The graph illustrated in FIG. 4 shows an example in which the data reduced to two dimensions by UMAP are displayed in different manners (hatching, blacking, etc.) for each cluster to which they belong.

FIG. 5 is an explanatory diagram showing another example of visualizing the dimension-reduced data in a graph. The graph illustrated in FIG. 5 is a graph displayed by changing symbols plotted for each type of video data. Further, as illustrated in FIG. 5, the visualization processing unit 80 may display the range surrounded by a dotted line so that the range of data included in the cluster can be identified.

Furthermore, when drawing the graph, the visualization processing unit 80 may display all data, or may determine that only data that satisfies a specific condition is displayed or not displayed. The visualization processing unit 80, for example, targets clusters that satisfy a specific condition (for example, clusters with more data than a predetermined number) and unclassified data (that is, unlabeled data). or not to display.

Furthermore, the visualization processing unit 80 of the present embodiment outputs data that belong to different clusters as a result of re-learning processing, which will be described later. A data output method will be described later.

The input/output device 90 displays the output result from the visualization processing unit 80. The input/output device 90 also receives input from the user regarding the displayed result, and executes processing according to the input. In this embodiment, the processing of the data refinement unit 100, which will be described later, is performed based on the input of the cluster specified by the user with respect to the output of the input/output device 90. FIG.

The input/output device 90 may be realized by a tablet terminal or the like. Alternatively, the input/output device 90 may be realized by a device having a display device and a pointing device.

The data refinement unit 100 performs each process on the data group to be labeled based on the clusters generated by the feature extraction unit 60. Specifically, the data refinement unit 100 generates a second data group from the labeling target data group according to the generated first plurality of clusters. In this embodiment, the data refinement unit 100 performs the following three types of processing.

First, the first process will be explained. The first process is the process of labeling the data within the cluster. In the first process, the data refining unit 100 performs labeling for each cluster on the data classified into one of the first plurality of clusters among the data group to be labeled, and converts the data into a second data group. to generate Any cluster can be labeled by the data refinement unit 100 . The data refinement unit 100 may label all clusters, or may label clusters specified by the user via the input/output device 90 .

Also, if the same label is added to the data in the cluster, the content of that label is arbitrary. The data refinement unit 100 may add an arbitrary temporary label to the data in the target cluster, or may add a label with content specified by the user. Then, the data refinement unit 100 may associate the data (more specifically, the feature amount of the data) with the added label and store them in the feature storage unit 70 .

FIG. 6 is an explanatory diagram showing an example of processing for labeling data within a cluster. The example shown in FIG. 6 indicates that the data refinement unit 100 added temporary labels “A”, “B” and “C” to the clusters illustrated in FIG. 5, respectively. Note that when the user designates a cluster to be added among the clusters illustrated in FIG. 5, the data refinement unit 100 may add a temporary label only to the designated cluster.

After that, the feature extraction unit 60 regenerates a plurality of clusters by learning (supervised learning) using the labeled data. Note that the feature extraction unit 60 may perform learning (unsupervised learning) by adding unlabeled data. Hereinafter, a process of generating a plurality of clusters by classifying a data group including at least part of data to be labeled by the feature extraction unit 60 will be referred to as a second classification process. Also, a plurality of clusters generated by the second classification process will be referred to as a second plurality of clusters, and a data group classified into the second plurality of clusters will be referred to as a second data group.

Thus, in the second classification process, at least part of the labeling target data used in the first classification process is used to generate and refine a plurality of clusters again. This can be called a relearning process or refinement. This makes it possible to semi-automate labeling through unsupervised learning, and also contributes to the discovery of new labels.

The feature extraction unit 60 may extract feature amounts of each data included in the clusters (second plurality of clusters) generated by the second classification process, and store the extracted feature amounts in the feature storage unit 70. .

After the second classification process, the visualization processing unit 80 outputs data classified into different clusters in the first plurality of clusters among the data included in the second plurality of clusters. This corresponds to the process of visualizing data belonging to different clusters as a result of re-learning. Note that specific processing for visualization will be described later.

Next, the second processing will be explained. The second process is a process of selecting at least some clusters and learning again (unsupervised learning). The data refinement unit 100 generates a data group classified into a cluster selected from the first plurality of clusters as a second data group among the data groups to be labeled.

First, the data refinement unit 100 selects at least some clusters from among the first plurality of clusters. The data refinement unit 100 may select a cluster specified by the user via the input/output device 90, or may automatically select a cluster that satisfies a condition. The conditions here are arbitrary, and include, for example, clusters in which the number of data is a predetermined number or more, a ratio of classified data that is greater than a predetermined threshold, and the like. The data group within the cluster selected here corresponds to the above-described second data group.

FIG. 7 is an explanatory diagram showing an example of the process of selecting some clusters. The example shown in FIG. 7 indicates that two clusters have been selected from the three generated clusters. Also in the second process, the data refinement unit 100 may add arbitrary cluster identification information to the data in each cluster so that the clusters classified in the first classification process can be identified.

After that, the feature extraction unit 60 regenerates a plurality of clusters (that is, performs re-learning processing) by learning (unsupervised learning) targeting data in the selected cluster. This process corresponds to the above-described second classification process, and the generated clusters correspond to the second clusters. Note that the feature extraction unit 60 may perform learning by adding new data separately. As a result, it is possible to dig deeper into the data within the cluster, so it can be expected to classify the data in more detail.

Then, after the second classification process, the visualization processing unit 80 classifies the data included in the second plurality of clusters into different clusters in the first plurality of clusters in the same manner as in the first processing. Output data. In addition, since the selected cluster may be subdivided, the visualization processing unit 80 selects the data with the cluster identification information in the minority (other than the maximum ratio) among the data in the cluster as the first A plurality of clusters may be output as data classified into different clusters.

Next, the third process will be explained. The third process is a process of excluding at least part of the data not classified into clusters, such as outliers, and learning again (unsupervised learning or supervised learning). The data refinement unit 100 generates, as a second data group, a data group excluding one or more data not classified into any of the first plurality of clusters from the data group to be labeled.

FIG. 8 is an explanatory diagram showing an example of processing for excluding part of the data. The example shown in FIG. 8 indicates that the data in the range surrounded by a solid line circle is excluded as an outlier. For example, when the data to be labeled is video data, this corresponds to processing for excluding noise scenes. Thereafter, at least one of the above-described first processing and second processing, or both of them are performed. This is expected to improve classification accuracy.

The three types of processing performed by the data refinement unit 100 have been described above. However, the processing executed by the data refinement unit 100 is not limited to the three types of processing described above. The data refinement unit 100 may also perform data maintenance processing. Also, after each of the first process, the second process, and the third process, the same process may be performed again, or a different process may be performed.

An example of the data maintenance process is the process of maintaining the data used by the feature extraction unit 60 for learning. The data refinement unit 100 may output a file containing a data group to which labels have been added or a data group from which outliers have been removed.

For example, suppose that the data group to be labeled was labeled in the first process described above. In this case, the data refinement unit 100 creates a label file in which the designated label is described, copies only the labeled data to the next learning folder, and sorts the original data into folders for each label based on the label. (move/copy) etc. may be performed.

Also, for example, assume that clusters have been selected in the second process described above. In this case, the data refinement unit 100 may create a data list file describing only the data belonging to the selected cluster, copy only the data belonging to the selected cluster to the next learning folder, and the like.

Also, for example, assume that outliers are excluded in the third process described above. In this case, the data refinement unit 100 creates a data list file describing only data other than the specified data (outliers), and copies the data other than the specified data (outliers) to the next learning folder. Processing and the like may be performed.

A method for the visualization processing unit 80 to visualize data belonging to a different cluster as a result of re-learning will be specifically described below. First, the visualization processing unit 80 performs dimension reduction on the data group to be labeled, and divides the dimension-reduced data contained in the first plurality of clusters and the dimension-reduced data contained in the second plurality of clusters into Graphs are drawn in such a manner that each cluster can be identified. Then, the visualization processing unit 80 displays data classified into different clusters in the first plurality of clusters among the dimension-reduced data included in the second plurality of clusters in a manner different from other data. do.

Examples of different aspects include changing the shade of color, changing the color itself, changing the line of the outer frame, and blinking.

FIG. 9 is an explanatory diagram showing an example of an overlay display of results before and after refinement. In the example shown in FIG. 9, the visualization processing unit 80 superimposes the distribution of the data of each refinement and displays the data other than the layer of interest (that is, the refinement) in a manner different from the data of the layer of interest. to indicate that it is displayed. Specifically, in the example shown in FIG. 9, the result of the first elaboration and the result of the second elaboration are superimposed and displayed. At that time, when attention is paid to the result of the first refinement, the data d1 included in the target cluster only in the second refinement is shown in a manner different from other data. Similarly, when looking at the result of the second refinement, the data d2, which is included in the cluster of interest only in the first refinement, is shown in a manner different from the other data.

　Figs. 10 and 11 are explanatory diagrams showing examples of displaying results before and after refinement in parallel windows. As illustrated in FIG. 10, the visualization processing unit 80 may display the results before and after elaboration in separate windows. At that time, the visualization processing unit 80 may display the data changed before and after elaboration in a different manner from other data, as illustrated in FIG. 11 .

Furthermore, the visualization processing unit 80 may display a list of data with different results before and after elaboration (that is, data classified into different clusters). FIG. 12 is an explanatory diagram showing an example of displaying a list of data d3 that have different results before and after elaboration in separate windows. In the example shown in FIG. 12, the results are shown by displaying a list of the coordinates where the data showing different results before and after elaboration are displayed.

Note that FIGS. 9 to 12 exemplify the case of comparing two refinement results. However, comparison targets are not limited to two results, and may be three or more. FIG. 13 is an explanatory diagram showing an example of an overlay display of results of elaboration performed multiple times. Also, FIG. 14 is an explanatory diagram showing an example of displaying a list in another window of data that have resulted in different results due to multiple elaborations. Compared with the example shown in FIG. 9, the example shown in FIG. 13 shows an example in which there are four refinement results. Similarly, the example shown in FIG. 14 shows an example in which there are four refinement results in comparison with the example shown in FIG.

In addition, the visualization processing unit 80 may display cluster statistical information for each data group classification process (that is, refinement) separately from the above-described graph or together with the above-described graph. Note that the creation of the statistical information may be performed by the visualization processing unit 80 or by the feature extraction unit 60 .

FIG. 15 is an explanatory diagram showing an example of displaying statistical information of each cluster. The example shown in FIG. 15 shows an example of displaying the number of data in the cluster, the center of gravity of the data, and the variance (x-direction and y-direction) as the cluster statistical information. Further, as illustrated in FIG. 15, the visualization processing unit 80 may switch and display the statistical information for each refinement, or may display them side by side.

FIG. 16 is an explanatory diagram showing another example of displaying the statistical information of each cluster. As illustrated in FIG. 16, the visualization processing unit 80 may display cluster statistical information (eg, false positive rate) in graph and tabular form. The example shown in FIG. 16 represents the degree of matching between labels and assigned clusters when supervised learning is performed. In the example shown in FIG. 16, unsupervised learning is assumed for the first time, and there is no evaluation result.

A data acquisition unit 10, a related information acquisition unit 20, an object identification unit 30, a data processing unit 40, a text information input unit 50, a feature extraction unit 60, a visualization processing unit 80, and a data refinement unit 100. is realized by a computer processor (eg, CPU (Central Processing Unit)) that operates according to a program (labeling support program).

For example, the program is stored in a storage unit (not shown) of the labeling support system 1, the processor reads the program, and according to the program, the data acquisition unit 10, the related information acquisition unit 20, the object identification unit 30, the data processing It may operate as the unit 40 , the text information input unit 50 , the feature extraction unit 60 , the visualization processing unit 80 and the data refinement unit 100 . Also, the functions of the labeling support system 1 may be provided in a SaaS (Software as a Service) format.

A data acquisition unit 10, a related information acquisition unit 20, an object identification unit 30, a data processing unit 40, a text information input unit 50, a feature extraction unit 60, a visualization processing unit 80, and a data refinement unit 100. may be implemented by dedicated hardware. Also, part or all of each component of each device may be implemented by general-purpose or dedicated circuitry, processors, etc., or combinations thereof. These may be composed of a single chip, or may be composed of multiple chips connected via a bus. A part or all of each component of each device may be implemented by a combination of the above-described circuits and the like and programs.

Further, when a part or all of each component of the labeling support system 1 is realized by a plurality of information processing devices, circuits, etc., the plurality of information processing devices, circuits, etc. may be centrally arranged, They may be distributed. For example, the information processing device, circuits, and the like may be implemented as a form in which each is connected via a communication network, such as a client-server system, a cloud computing system, or the like.

Next, the operation of the labeling support system 1 of this embodiment will be described. FIG. 17 is a flow chart showing an operation example of the labeling support system 1. FIG. The operation example illustrated in FIG. 17 is an operation example when the data acquisition unit 10 directly acquires data in a format used by the feature extraction unit 60 and inputs the acquired data to the feature extraction unit 60 .

The feature extraction unit 60 generates a first plurality of clusters from the data group to be labeled (first data group) (step S11). After that, the feature extraction unit 60 generates a second plurality of clusters from a data group (second data group) including at least part of data to be labeled (step S12). Then, the visualization processing unit 80 outputs data classified into different clusters in the first plurality of clusters among the data included in the second plurality of clusters (step S13).

As described above, in the present embodiment, the feature extraction unit 60 classifies the first data group by unsupervised learning to generate the first plurality of clusters. Also, the feature extraction unit 60 classifies the second data group to generate a second plurality of clusters. Then, the visualization processing unit 80 outputs data classified into different clusters in the first plurality of clusters among the data included in the second plurality of clusters. Therefore, it is possible to support labeling work for clusters in which unlabeled data are classified.

In addition, in the present embodiment, the data refinement unit 100 generates a second data group from among the data group to be labeled, according to the generated first plurality of clusters. Therefore, it is possible to improve the accuracy of re-learning using the generated second data group.

Next, the outline of the present invention will be explained. FIG. 18 is a block diagram showing an overview of a labeling support system according to the present invention. A labeling support system 180 (for example, a labeling support system 1) according to the present invention classifies a first data group, which is a data group to be labeled, by unsupervised learning to generate a first plurality of clusters. means 181 (for example, the feature extracting unit 60) and classifying (that is, re-learning) a second data group, which is a data group including at least part of the data to be labeled, to classify a second plurality of clusters. and a second classifying means 182 (e.g., a feature extracting unit 60) that generates the data classified into a different cluster in the first plurality of clusters out of the data included in the second plurality of clusters. means 183 (for example, the visualization processing unit 80).

With such a configuration, it is possible to support labeling work on clusters in which unlabeled data has been classified.

In addition, the labeling support system 180 includes data refinement means (for example, the data refinement unit 100 ).

Specifically, the data refining means generates a second data group by performing labeling for each cluster on data classified into one of the first plurality of clusters among the data group to be labeled. (for example, the first processing by the data refinement unit 100).

Further, the data refining means may generate, as a second data group, a data group classified into a cluster selected from the first plurality of clusters among the data groups to be labeled (for example, , second processing by the data refinement unit 100).

Further, the data refining means may generate, as the second data group, a data group obtained by excluding one or more data not classified into any of the first plurality of clusters from the data group to be labeled. Good (for example, the third processing by the data refinement unit 100).

Further, the output means reduces the dimension of the data group to be labeled, and divides the dimension-reduced data included in the first plurality of clusters and the dimension-reduced data included in the second plurality of clusters into clusters. , and out of the dimensionality-reduced data contained in the second plurality of clusters, the data classified into different clusters in the first plurality of clusters are displayed in a manner different from the other data may be displayed.

In addition, the output means may display cluster statistical information for each data group classification process.

FIG. 19 is a schematic block diagram showing the configuration of a computer according to at least one embodiment. A computer 1000 comprises a processor 1001 , a main storage device 1002 , an auxiliary storage device 1003 and an interface 1004 .

The labeling support system 180 described above is implemented in the computer 1000. The operation of each processing unit described above is stored in the auxiliary storage device 1003 in the form of a program (labeling support program). The processor 1001 reads out the program from the auxiliary storage device 1003, develops it in the main storage device 1002, and executes the above processing according to the program.

Note that in at least one embodiment, the secondary storage device 1003 is an example of a non-transitory tangible medium. Other examples of non-transitory tangible media include magnetic disks, magneto-optical disks, CD-ROMs (Compact Disc Read-only memory), DVD-ROMs (Read-only memory), connected via interface 1004, A semiconductor memory etc. are mentioned. Further, when this program is distributed to the computer 1000 via a communication line, the computer 1000 receiving the distribution may develop the program in the main storage device 1002 and execute the above process.

In addition, the program may be for realizing part of the functions described above. Further, the program may be a so-called difference file (difference program) that implements the above-described functions in combination with another program already stored in the auxiliary storage device 1003 .

Some or all of the above embodiments can also be described as the following additional remarks, but are not limited to the following.

(Appendix 1) A first classification means for generating a first plurality of clusters by classifying a first data group, which is a data group to be labeled, by unsupervised learning;
a second classification means for generating a second plurality of clusters by classifying a second data group, which is a data group including at least part of the data to be labeled;
and output means for outputting data classified into different clusters in the first plurality of clusters among the data included in the second plurality of clusters.

(Supplementary Note 2) The labeling support system according to Supplementary Note 1, further comprising data refinement means for generating a second data group according to the generated first plurality of clusters from the data group to be labeled.

(Appendix 3) The data refining means generates a second data group by performing labeling for each cluster on data classified into one of the first plurality of clusters among the data group to be labeled. The labeling support system according to appendix 1 or appendix 2.

(Appendix 4) The data refining means generates, as a second data group, a data group classified into a cluster selected from the first plurality of clusters among the data groups to be labeled. Appendix 1 or The labeling support system according to appendix 2.

(Appendix 5) The data refining means generates, as a second data group, a data group excluding one or more data not classified into any of the first plurality of clusters from the data group to be labeled. The labeling support system according to any one of appendices 1 to 4.

(Appendix 6) The output means reduces the dimension of the data group to be labeled, and divides the dimension-reduced data contained in the first plurality of clusters and the dimension-reduced data contained in the second plurality of clusters into Graphing is performed in a manner in which each cluster can be identified, and among the dimensionally reduced data included in the second plurality of clusters, the data classified into different clusters in the first plurality of clusters are compared with other data. The labeling support system according to any one of appendices 1 to 5, wherein the labeling support system is displayed in different modes.

(Supplementary Note 7) The labeling support system according to any one of Supplementary Notes 1 to 6, wherein the output means displays cluster statistical information for each data group classification process.

(Appendix 8) A computer generates a first plurality of clusters by classifying a first data group, which is a data group to be labeled, by unsupervised learning,
The computer generates a second plurality of clusters by classifying a second data group, which is a data group including at least part of the data to be labeled, and
The labeling support method, wherein the computer outputs data classified into different clusters in the first plurality of clusters, among the data included in the second plurality of clusters.

(Supplementary Note 9) The labeling support method according to Supplementary Note 8, wherein the second data group is generated from the data group to be labeled according to the generated first plurality of clusters.

(Appendix 10) to the computer,
A first classification process for generating a first plurality of clusters by classifying a first data group, which is a data group to be labeled, by unsupervised learning;
a second classification process for generating a second plurality of clusters by classifying a second data group, which is a data group including at least part of the data to be labeled; and
A program storage medium for storing a labeling support program for executing output processing for outputting data included in the second plurality of clusters and classified into different clusters in the first plurality of clusters.

(Appendix 11) to the computer,
11. The program according to appendix 10, which stores a labeling support program for executing a data refinement process for generating a second data group according to the first plurality of clusters generated from the data group to be labeled. storage medium.

(Appendix 12) to the computer,
A first classification process for generating a first plurality of clusters by classifying a first data group, which is a data group to be labeled, by unsupervised learning;
a second classification process for generating a second plurality of clusters by classifying a second data group, which is a data group including at least part of the data to be labeled; and
A labeling support program for executing an output process of outputting data classified into different clusters in the first plurality of clusters among the data included in the second plurality of clusters.

(Appendix 13) to the computer,
13. The labeling support program according to appendix 12, wherein a data refinement process for generating a second data group is executed according to the generated first plurality of clusters from the data group to be labeled.

Although the present invention has been described with reference to the embodiments, the present invention is not limited to the above embodiments. Various changes that can be understood by those skilled in the art can be made to the configuration and details of the present invention within the scope of the present invention.

1 labeling support system 10 data acquisition unit 20 related information acquisition unit 30 object identification unit 40 data processing unit 50 text information input unit 60 feature extraction unit 70 feature storage unit 80 visualization processing unit 90 input/output device 100 data refinement unit

Claims

a first classification means for generating a first plurality of clusters by classifying a first data group, which is a data group to be labeled, by unsupervised learning;
a second classification means for generating a second plurality of clusters by classifying a second data group, which is a data group including at least part of the data to be labeled;
and output means for outputting data classified into different clusters in the first plurality of clusters among the data included in the second plurality of clusters.
2. The labeling support system according to claim 1, further comprising data refinement means for generating a second data group from the data group to be labeled according to the generated first plurality of clusters.
wherein the data refining means generates a second data group by performing labeling for each cluster on data classified into one of the first plurality of clusters among the data group to be labeled, or The labeling support system according to claim 2.
Claim 1 or Claim 2, wherein the data refining means generates, as the second data group, a data group classified into a cluster selected from among the first plurality of clusters out of the data group to be labeled. Labeling support system as described.
From claim 1, wherein the data refining means generates a data group excluding one or more data not classified into any of the first plurality of clusters from the data group to be labeled as the second data group. 5. The labeling support system according to any one of claims 4.
The output means reduces the dimension of the data group to be labeled, and identifies the dimension-reduced data included in the first plurality of clusters and the dimension-reduced data included in the second plurality of clusters for each cluster. graphing in a manner that can be done, and out of the dimensionality-reduced data included in the second plurality of clusters, data classified into different clusters in the first plurality of clusters is displayed in a manner different from other data The labeling support system according to any one of claims 1 to 5.
7. The labeling support system according to any one of claims 1 to 6, wherein the output means displays cluster statistical information for each data group classification process.
A computer generates a first plurality of clusters by classifying a first data group, which is a data group to be labeled, by unsupervised learning,
The computer generates a second plurality of clusters by classifying a second data group, which is a data group including at least part of the data to be labeled, and
The labeling support method, wherein the computer outputs data classified into different clusters in the first plurality of clusters, among the data included in the second plurality of clusters.
The labeling support method according to claim 8, wherein a second data group is generated from the data group to be labeled according to the generated first plurality of clusters.
to the computer,
A first classification process for generating a first plurality of clusters by classifying a first data group, which is a data group to be labeled, by unsupervised learning;
a second classification process for generating a second plurality of clusters by classifying a second data group, which is a data group including at least part of the data to be labeled; and
A program storage medium for storing a labeling support program for executing output processing for outputting data included in the second plurality of clusters and classified into different clusters in the first plurality of clusters.
to the computer,
11. The labeling support program according to claim 10, which stores a labeling support program for executing a data refinement process for generating a second data group according to the first plurality of clusters generated from the data group to be labeled. program storage medium.