WO2019069506A1 - Feature value generation device, feature value generation method, and feature value generation program - Google Patents
Feature value generation device, feature value generation method, and feature value generation program Download PDFInfo
- Publication number
- WO2019069506A1 WO2019069506A1 PCT/JP2018/022428 JP2018022428W WO2019069506A1 WO 2019069506 A1 WO2019069506 A1 WO 2019069506A1 JP 2018022428 W JP2018022428 W JP 2018022428W WO 2019069506 A1 WO2019069506 A1 WO 2019069506A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- attribute
- geographical
- value
- generator
- map
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/29—Geographical information databases
Definitions
- the present invention relates to a feature quantity generation device, a feature quantity generation method, and a feature quantity generation program that combine a plurality of tables to generate a feature quantity.
- Data mining is a technology for finding useful knowledge that has been unknown so far from a large amount of information.
- it is important to generate more attribute candidates. Specifically, it is important to generate candidates for many attributes (explanatory variables) that can affect variables (target variables) to be predicted. By generating such a large number of candidates, it is possible to increase the possibility that an attribute useful for prediction is included in the candidates.
- Patent Document 1 describes that candidates for feature quantities used in machine learning processing are generated by combining a target table including an object variable with a source table not including an object variable.
- the process of generating candidate feature quantities is defined by a combination of three conditions of Filter conditions, map conditions, and reduce conditions, and thus the number of analysts who generate candidate feature quantities.
- Patent Document 2 describes a demand forecasting device that predicts the number of demands for a dispatch service of vehicles such as taxis in a forecast target area by regression analysis.
- the demand prediction device described in Patent Document 2 acquires estimated population information in a predetermined area, and uses the acquired estimated population information as an explanatory variable of regression analysis.
- the present inventor has received the idea that prediction accuracy is improved by utilizing various information sources when predicting any target in a predetermined area. That is, it may be preferable to combine multiple related information sources to obtain information.
- Patent Document 1 exemplifies using a customer ID commonly included in a target table and a source table as a joining condition (that is, map condition) between the target table and the source table.
- a joining condition that is, map condition
- Patent Document 2 the same reference (area ID, area polygon, and the like) is the prediction target area, which is a unit when predicting the number of demand for service, and the predetermined area, which is a unit of estimated population information used as an explanatory variable. It is stated that it is defined by).
- the method of defining geographical information included in each information source may be different from the method of defining geographical information at the time of prediction.
- geographical information it is possible to specify by latitude and longitude, or to specify by municipality name.
- the present inventor has found that the task of generating candidate feature quantities for predicting a prediction target from each information source can be complicated.
- Patent Document 1 and Patent Document 2 it is assumed that each information source is associated with a customer ID or the same standard. However, even if it is assumed that geographical information is used to associate each information source, such geographical information is not necessarily defined on the same basis. Therefore, it is difficult to simply associate these information sources, and there is a problem that data analysis using such information requires a large number of man-hours.
- this invention aims at providing the feature-value production
- a feature quantity generation device comprises a table acquisition unit for acquiring a first table including a prediction target and a first geographical attribute, and a second table including a second geographical attribute, Calculated based on the value of the first geographical attribute and the value of the second geographical attribute satisfying the condition when the value of the second geographical attribute with respect to the value of the geographical attribute satisfies a predetermined condition
- An attribute adding means is provided for adding the statistics of the distance to the attribute of the first table as a feature that is a variable that can affect the prediction target.
- the feature quantity generation method obtains a first table including a prediction target and a first geographical attribute, and a second table including a second geographical attribute, and generates the first geographical attribute.
- a distance statistic calculated based on the value of the first geographic attribute and the value of the second geographic attribute that satisfies the condition, when the value of the second geographic attribute to the value satisfies a predetermined condition Is added to the attribute of the first table as a feature that is a variable that can affect the prediction target.
- a feature amount generation program includes a table acquisition process for acquiring, on a computer, a first table including a prediction target and a first geographical attribute, and a second table including a second geographical attribute. , Based on the value of the first geographical attribute and the value of the second geographical attribute satisfying the condition, when the value of the second geographical attribute with respect to the value of the first geographical attribute satisfies a predetermined condition, it is characterized in that an attribute addition process is performed in which the statistical value of the distance calculated as described above is added to the attribute of the first table as a feature quantity that is a variable that can affect the prediction target.
- FIG. 1 is a block diagram illustrating an embodiment of an information processing system according to the present invention. It is an explanatory view showing an example of a configuration file. It is explanatory drawing which shows the example of the process which converts data. It is explanatory drawing which shows the example of the relationship between each parameter, a 1st table, and a 2nd table. It is explanatory drawing which shows the example of the process which produces
- the information processing system includes a table (hereinafter also referred to as a first table) including variables to be predicted (for example, target variables) and a table different from the first table (hereinafter referred to as a second table). It may be described as a table of).
- the first table may be referred to as a target table
- the second table may be referred to as a source table.
- the first table and the second table may each include a set of data.
- the first table and the second table each include an attribute common to the viewpoint.
- Common viewpoint means that the semantic content of the data of the attribute is common.
- the method of representing data may be common or different.
- the attribute included in the first table is described as a first attribute
- the attribute included in the second table is described as a second attribute.
- a geographical viewpoint As an attribute having a common viewpoint, a geographical viewpoint, a temporal viewpoint, and the like can be mentioned.
- the value of the attribute of geographical viewpoint can be classified into the following four types of geographical data types. Note that the description below the colon in the heading indicates the syntax for the data.
- Point P (Point): p (x, y) ⁇ P The point P is represented as coordinates (longitude, latitude).
- the polygon G (Polygon): g ( b 1, b 2, ..., b n) ⁇ G
- the polygon G is defined by one outer boundary b 1 and an inner boundary (b 2 ,..., B n ) of 0 or more.
- b 1 (p 1 , p 2 ,..., P n ) (where p 1 , p 2 ,..., P n ⁇ P) is a closed order defined as an order of three or more points Of the ring.
- the multi-polygon M is composed of one or more polygons.
- String S (String): s ⁇ S It is an address represented by a character string.
- an analysis data type may be defined in association with a data type.
- polygon G and polygon M may be defined as an analysis data type for area and point P may be defined as an analysis data type for point .
- the character string related to the address may be defined as, for example, an analysis data type related to a country, a city, a town, a landmark, a street or a point.
- an analysis data type representing geographical information may be referred to as a geographical data type.
- the type (temporal data type) of a temporal viewpoint attribute can be defined as a TimeStamp type.
- the attribute having a common viewpoint is a geographical attribute
- the attribute included in the first table is referred to as a first geographical attribute
- the attribute included in the second table is referred to as a second geographical attribute
- the attribute included in the first table is referred to as a first temporal attribute
- the attribute included in the second table is referred to as a second temporal attribute.
- the first geographical attribute may be the primary key of the first table.
- the common attribute shows the example of the geographical point of view and the time point of view, but the common attribute is not limited to the geographical point of view and the time point of view.
- Other examples of common attributes include the character string aspect and the structural aspect.
- the value of the attribute of the character string viewpoint is, for example, an address or the like.
- the value of the attribute of the structural viewpoint is, for example, a URL (Uniform Resource Locator), a tree structure path, or the like.
- URL Uniform Resource Locator
- FIG. 1 is a block diagram showing an embodiment of an information processing system according to the present invention.
- the information processing system 100 of the present embodiment includes an input unit 10, a geocoder (Geo-Coder) 20, a map parameter generator (Map Parameter Generator) 30, a filter parameter generator (Filter Parameter Generator) 50, and aggregation parameters.
- a generator (Reduce Parameter Generator) 60 a storage unit 80, a feature quantity generation function generator (Feature Descriptor Generator) 81, a feature quantity generator (Feature Generator) 82, a feature quantity selector (Feature Selector) 83, and ,
- the input unit 10 acquires a first table and a second table. Since the input unit 10 acquires each table, the input unit 10 can be referred to as table acquisition means.
- the input unit 10 may acquire a plurality of second tables. For example, when the storage unit 80 stores the first table and the second table, the input unit 10 may acquire the first table and the second table from the storage unit 80. Also, the input unit 10 may obtain the first table and the second table from another system or storage unit via a communication network (not shown).
- the input unit 10 acquires a first table including the prediction target and the first geographical attribute, and a second table including the second geographical attribute. It is also good. Also, for example, when the temporal viewpoint is common, the input unit 10 acquires the first table including the prediction target and the first temporal attribute, and the second table including the second temporal attribute. You may In addition, the input unit 10 may acquire a first table including a prediction target and a first character string attribute, and a second table including a second character string attribute. A first table containing the first structural attribute and a second table containing the second structural attribute may be obtained. The structural attributes will be described later.
- the input unit 10 calculates a similarity between the first attribute and the second attribute (hereinafter referred to as a similarity function) and the degree of similarity with respect to the function of the first attribute.
- a condition for determining whether the value and the value of the second attribute are similar (hereinafter, also referred to as a condition for the degree of similarity) is accepted.
- the similarity function may be represented by a mathematical expression or may be represented as a parameter.
- the condition for the similarity may be represented by a threshold (hereinafter simply referred to as a threshold for similarity) for determining the presence or absence of the similarity based on the degree of the relationship, and the similarity according to the parameter etc. It may be expressed by an expression that outputs whether or not.
- the input unit 10 may receive a geographical relationship as a similarity function, and receive a threshold of similarity indicating the degree of the geographical relationship as a condition. That is, when the first attribute and the second attribute are geographical attributes, for example, the similarity function is defined as a function that calculates the similarity higher as the distance is closer.
- the input unit 10 may receive the temporal relationship as the similarity function, and may receive the threshold of the similarity indicating the degree of the temporal relationship as the condition. That is, when the first attribute and the second attribute are temporal attributes, the similarity function is defined as, for example, a function that calculates the similarity higher as the difference in time is smaller.
- the input unit 10 may receive the relationship of the character string as the similarity function, and may receive the threshold value of the similarity indicating the degree of the relationship of the character string as a condition.
- the similarity function is defined as a function that calculates the degree of similarity higher as the degree of matching between the two texts is higher.
- the text similarity includes, for example, morpheme Simpson coefficients.
- morph (a) as a set of morphemes contained in the text string a.
- the following four text strings indicating addresses are represented by morphological analysis as a set of forms as follows, respectively.
- Equation 1 The function textSim (a, b) for calculating the degree of similarity between the text string a and the text string b can be defined by Equation 1 shown below.
- the input unit 10 may receive a structural relationship as a similarity function, and may receive a threshold of similarity indicating the degree of the structural relationship as a condition.
- a character string in which information of the tree structure, such as an address and a directory structure of a file, is expressed by "/" is defined as a path character string.
- the address "Kanagawa Prefecture Kawasaki City” is expressed as "/ Kanagawa Prefecture / Kawasaki City” in the pass string.
- the directory structure “news ⁇ economy ⁇ bigdata” is expressed as “news / economy / bigdata” in the path string.
- the similarity function is, for example, a function that calculates the higher the degree of similarity as the distance between the two path strings is closer. It is defined.
- a distance function of the path string for example, the minimum value of the distance to the lowest common ancestor (LCA) can be mentioned.
- the lowest common ancestor node is the same node that appears first when traversing from the lowermost node represented by the two paths to the upper (ancestor) direction. Also, the distance to the lowest common ancestor node is the number of nodes when the lowest node is followed from the lowest common ancestor node.
- FIG. 2 is an explanatory view showing an example of a configuration file (hereinafter referred to as a configuration file).
- the example shown in FIG. 2 indicates that the condition for the similarity function and the similarity is set in a configuration file (hereinafter referred to as a configuration file).
- the input unit 10 may receive this configuration file.
- the C1 portion of the configuration file illustrated in FIG. 2 indicates conditions for the similarity function and the similarity.
- the C2 to C4 portions of the configuration file will be described later.
- the former part before the colon
- the data type of the first attribute more specifically, the analysis data type
- the data type of the second attribute more specifically, the analysis data type
- the latter part after the colon
- the "Point-Point" row in the C1 portion defines a geographical relationship representing the distance between the first geographical attribute represented by the point and the second geographical attribute represented by the point.
- “DistanceMap” is a map function that defines the degree of geographical relationship, and includes a distance threshold as a parameter.
- the three parameters in the DistanceMap function indicate “start value”, “end value”, and “interval” (of the threshold applied from the start value to the end value) in order. Assuming that the unit of distance is km (“DistanceMap”, 1, 3, 1) illustrated in FIG. 2, three threshold values “distance within 1 km”, “distance within 2 km”, “distance within 3 km” Indicates to apply to the function.
- KNearestMap is a map function that defines the degree of geographical relationship, and includes, as a parameter, a threshold of the number of pieces of geographical information in proximity.
- the three parameters in the KNearestMap function indicate “start value”, “end value”, and “interval” (of the threshold applied from the start value to the end value).
- the number of pieces of geographical information adjacent to each other as illustrated in FIG. 2 (“KNearest Map”, 3, 5, 1) functions as three thresholds “within three”, “within four”, and “within five”. Indicates that it applies to
- “SameCityMap” is a map function that defines the degree of geographical relationship, and is a function that determines whether two points are included in the same area. Although the SameCityMap function does not include parameters, it is determined whether it is included in the same area based on the area information defining the area. Area information is predefined.
- the "Point-Area" row in the C1 portion defines a geographical relationship that represents an inclusive relationship between the first geographic attribute represented by the point and the second geographic attribute represented by the region.
- InclusionMap is a map function that defines the degree of geographical relationship, and determines whether the first geographical attribute represented by a point is included in the second geographical attribute represented by a region. It is a function. Note that InclusionMap does not include parameters.
- KNearestMap is defined.
- the content of the KNearestMap function is similar to the KNearestMap function in "Point-Point”.
- the "Area-Area" row in the C1 portion defines a geographical relationship that represents the cross-relationship between the first geographic attribute represented by the region and the second geographic attribute represented by the region.
- IntersectMap is a map function that defines the degree of geographical relationship, and determines whether the first geographical attribute represented by the area intersects with the second geographical attribute represented by the area It is a function. Note that IntersectMap does not include parameters.
- the first geographical data type and the second geographical data type may be the same geographical data type as each other, or may be different geographical data types.
- the first geographical data type is a type of data that can identify geography with point information
- the second geographical data type is a type of data that can identify geography with range information. It is also good.
- the line "TimeStamp-TimeStamp" in the C1 section defines a temporal relationship that represents the difference between the first temporal attribute and the second temporal attribute.
- TimeDiffMap is a map function that defines the degree of temporal relationship, and includes a threshold of time difference as a parameter.
- the three parameters in the TimeDiffMap function indicate “start value”, “end value”, and “interval” (of the threshold applied from the start value to the end value).
- start value e.g., “start value”
- end value e.g., “end value”
- interval e.g., “interval”
- the line "Text-Text” in the C1 portion defines the correspondence between the first attribute representing a character string and the second attribute representing a character string.
- “ExactMap” is a function that determines whether the attribute represented by the character string matches.
- a similar relationship between a first attribute representing a character string and a second attribute representing a character string may be defined.
- a map function "textSimMap” that defines the degree of relation of character strings may be set in the “Text-Text” line.
- “TextSimMap” is a map function that defines the degree of relation of character strings, and includes a threshold of similarity as a parameter.
- the textSimMap function like the DistanceMap function, has three parameters, which respectively indicate "start value”, “end value”, and "interval” (of the threshold applied from the start value to the end value).
- the textSimMap function is used to define [(“textSimMap”, 0.8, 1.0, 0.1], which means that “the similarity is 0.8 or more”, “the similarity is 0. 9 shows that three threshold values of “9 or more” and “similarity is 1.0 (or more)” are applied to the function.
- the setting method of the similarity function and the threshold value of similarity is not limited to the content illustrated to C1 part of FIG.
- a structural relationship "Path-Path" representing a distance between a first structural attribute represented by a path string and a second structural attribute represented by a path string is defined. It is also good.
- a map function “pathDisMap” may be set which defines the degree of structural relationship.
- “PathDisMap” is a map function that defines the degree of structural relationship, and includes a distance threshold as a parameter.
- the pathDisMap function has three parameters, which respectively indicate "start value”, “end value”, and “interval” (of the threshold applied from the start value to the end value).
- pathDisMap function is used to define [(“pathDisMap”, 1, 3, 1]. This means that “distance is 1 or less”, “distance is 2 or less” and “distance is 3 or less” It shows applying three threshold values to a function.
- the map parameter generator 30 which will be described later, is a combination for combining a record included in the first table and a record included in the second table. Generate conditions (map parameters).
- the input unit 10 may also receive the attribute of the data indicated by each column of the table.
- the geocoder 20 converts data of an attribute represented by a character string. For example, when the data of the geographical attribute is represented by a character string, the geocoder 20 converts the character string into data of point, polygon or multi-polygon. Note that when there is no need to convert data, the information processing system 100 may not include the geocoder 20.
- FIG. 3 is an explanatory view showing an example of processing for converting data.
- a table adt1 in which an analysis data type for each column is defined and a table adt2 in which correspondence to convert an analysis data type to a data type is defined are acquired in advance.
- the analysis data type of the "Pickup_location” column of the source table S2 is Point when referring to the table adt1, and there is no need for conversion.
- the analysis data type of the "community” column of the source table S1 is "TownAddress" when referring to the table adt1, and when referring to the table adt2, it is necessary to convert it to the data type Polygon. Therefore, the geocoder 20 converts the data included in the "community" column of the source table S1 so as to be represented by a polygon area. For example, area information capable of specifying an area as a polygon is predetermined according to the contents of "community", and the geocoder 20 converts data so that the data type becomes Polygon based on the area information. It is also good.
- the map parameter generator 30, the filter parameter generator 50, and the aggregation parameter generator 60 are features for generating a feature that is a variable that can be influenced by the feature quantity generation function generator 81 described later. Generates parameters to be used when generating a quantity generation function.
- the feature amount means the content of the feature itself (for example, “population”, "position”, etc.).
- the feature quantities generated by the feature quantity generator 82 described later become candidates for explanatory variables when generating a model using machine learning.
- the feature quantity generation function generated in the present embodiment it is possible to automatically generate candidate explanatory variables when generating a model using machine learning.
- FIG. 4 is an explanatory view showing an example of the relationship between each parameter and the first table and the second table.
- the parameters generated by the filter parameter generator 50 are parameters representing extraction conditions of the rows included in the second table.
- this parameter may be referred to as a filter parameter, and a process of extracting a row from the second table based on the filter parameter may be described as “filter”.
- this list of extraction conditions may be described as "F list”.
- the extraction condition is arbitrary, and for example, a condition to judge whether it is the same (large or small) as the value of the designated column.
- the parameters generated by the aggregation parameter generator 60 are parameters representing an aggregation method of aggregating data of each row included in the second table for each objective variable.
- the rows in the first table correspond to the rows in the second table in many cases, so the rows are aggregated as a result.
- Aggregation information may be defined as an aggregation function for columns of the source table (second table).
- the aggregation method is optional, and includes, for example, the total number of columns, maximum value, minimum value, average value, median value, variance, and the like. Also, the total number of columns may be aggregated in terms of excluding duplicate data or not excluding duplicate data.
- this parameter may be described as an aggregation parameter, and a process of aggregating data of each column may be described as “reduce” by a method indicated by the aggregation parameter.
- the process of aggregating geographical information may be described as "Geo-reduce”.
- the list of aggregation processing may be described as "R list”. The details of the process of aggregating geographical information will be described later.
- the parameters generated by the map parameter generator 30 are parameters representing the corresponding conditions of the first table and the columns of the second table.
- this parameter may be referred to as a map parameter, and the process of associating the columns of each table based on the map parameter may be referred to as “map”.
- the list of correspondence conditions may be described as "M list”.
- the process of associating geographical information may be described as "Geo-map”.
- the mapping of the columns of each table by map can be said to be a join of a plurality of tables into one table in the mapped columns. The details of the process of associating geographical information are also described later.
- the map parameter generator 30 includes a geomap generator (GeoMap Generator) 40, a time difference map generator (TimeDiff Map Generator) 31, a map generator (Exact Map Generator) 32, and an attribute specifying unit 33.
- the map parameter generator 30 (more specifically, each generator included in the map parameter generator 30) sets the condition calculated by the similarity calculated by the value of the first attribute and the value of the second attribute.
- a join condition is generated to combine the record of the first table including the value of the first attribute that satisfies the condition and the record of the second table including the value of the second attribute.
- To satisfy the condition means, for example, that the similarity is equal to or less than or equal to a threshold, or included in a predetermined range.
- the geomap generator 40 generates a parameter representing a correspondence condition between columns including geographical attributes of the first table and the second table.
- the geomap generator 40 includes a distance map generator (distance map generator) 41, an inclusion map generator (inclusion map generator) 42, an overlap map generator (overlap map generator) 43, and the same area map generator (SameArea). Map Generator (44).
- the geomap generator 40 determines that the relationship between the value of the first geographical attribute and the value of the second geographical attribute is geographically
- the processing of each generator will be described in detail below.
- the distance map generator 41 generates map parameters when it receives a similarity function and a condition (for example, a threshold of similarity) for associating the first table with the second table based on the closeness of the distance. Do.
- a similarity function for example, a threshold of similarity
- the example shown in FIG. 2 corresponds to the case where at least one of the DistanceMap function and the KNearestMap function is set in the configuration file.
- the distance map generator 41 includes the records included in the first table and the second table such that the distance between the value of the first geographical attribute and the value of the second geographical attribute is within a threshold. Generate map parameters to combine with the record to be recorded.
- FIG. 5 is an explanatory view showing an example of processing for generating map parameters based on distances.
- the example shown in FIG. 5 shows the case where one target table T and one source table S2 are acquired.
- the target table T illustrated in FIG. 5 is a table including data representing the number of passengers (pickup_number) at five locations on January 8, 2015 at 22:00.
- source table S2 illustrated in FIG. 5 is a table which matches and records the number of passengers, the movement distance, and the landing position of a passenger in each time.
- the distance map generator 41 has a distance between the position indicated by the value of the first geographical attribute and the position indicated by the value of the second geographical attribute within 1 km.
- a parameter that associates each record of the target table T with the record of the source table S2 is generated.
- the distance map generator 41 targets the records of the source table S2 in which the distance between the position indicated by the value of the first geographical attribute and the position indicated by the value of the second geographical attribute is within 2 km and 3 km.
- the parameter which matches each record of table T is generated, respectively.
- the attribute of the "target_location" column of the target table T is a first geographical attribute
- the attribute of the "Pickup_location” column of the source table S2 is a second geographical attribute. These two columns are associated.
- a row to be associated with the first table and the second table may be specified in advance, or may be specified by the attribute specifying unit 33 described later.
- the parameter P11 illustrated in FIG. 5 is generated.
- map parameters are generated based on the geographical analysis data type, and one map processing is defined based on one map parameter.
- the map data M11 illustrated in FIG. 5 indicates the result of associating each record of the target table T with the record of the source table S2 having a distance of 1 km or less. For example, only one record from the source table is associated with the first record of the target table. Also, for example, two records from the source table are associated with the second record of the target table.
- FIG. 6 is an explanatory view showing an example of another process of generating map parameters based on distances.
- the target table T and source table S2 illustrated in FIG. 6 are similar to the target table T and source table S2 illustrated in FIG.
- the distance map generator 41 sequentially operates in the order from the closest distance between the position indicated by the value of the first geographical attribute and the position indicated by the value of the second geographical attribute.
- a parameter is generated which associates each record of the target table T with the record of the source table S2 within two or less.
- the distance map generator 41 sets the target table T in the records of the source table S2 in order from the closest distance between the position indicated by the value of the first geographical attribute and the position indicated by the value of the second geographical attribute.
- the attribute of the "target_location” column of the target table T is the first geographical attribute
- the attribute of the "Pickup_location” column of the source table S2 is the second geographical attribute. These two columns are associated.
- a row to be associated with the first table and the second table may be specified in advance, or may be specified by the attribute specifying unit 33 described later.
- the parameter P12 illustrated in FIG. 6 is generated.
- map parameters are generated based on the geographical analysis data type, and one map processing is defined based on one map parameter.
- the map data M12 illustrated in FIG. 6 indicates the result of associating two records of the target table T with the records of the source table S2 in order of closeness. For example, for each record of the target table, the two closest records from the source table are associated.
- the area map generator 44 When the area map generator 44 receives a similarity function for associating the first table with the second table based on whether the area is included in the same area, the area map generator 44 generates map parameters.
- the example shown in FIG. 2 corresponds to the case where the SameCityMap function is set in the configuration file.
- the records included in the first table are included in the same area such that the position indicated by the value of the first geographical attribute and the position indicated by the value of the second geographical attribute are included in the same area.
- FIG. 7 is an explanatory view showing an example of a method of determining whether or not it is included in the same area.
- the common area table CAT in which each area and the area of the area specified by the polygon are associated is defined in advance. Examples of common areas include countries, states, cities, autonomous regions, and cities.
- the common area is defined as a common area that does not overlap each other, and represents boundary information on the map.
- the common area table CAT may be stored, for example, in the storage unit 80.
- the common area table CAT it is determined whether two positions exist in the same area. Specifically, the area indicated by the position of the record t1 in the target table T is specified, and it is determined whether the position of the record s1 in the source table S is within the area. Hereinafter, the same processing is performed on all the records of the target table T and the source table S.
- FIG. 8 is an explanatory view showing an example of processing of generating map parameters based on whether or not it is a common area.
- the target table T and source table S2 illustrated in FIG. 8 are similar to the target table T and source table S2 illustrated in FIG.
- the same area map generator 44 includes the position indicated by the value of the first geographical attribute and the position indicated by the value of the second geographical attribute in the same area.
- a parameter that associates the record of the source table S2 with each record of the target table T is generated.
- the attribute of the "target_location" column of the target table T is a first geographical attribute
- the attribute of the "Pickup_location” column of the source table S2 is a second geographical attribute. These two columns are associated.
- a row to be associated with the first table and the second table may be specified in advance, or may be specified by the attribute specifying unit 33 described later.
- the map data M13 illustrated in FIG. 8 indicates the result of associating the records of the source table S2 having the geographical attribute determined to be the same area with the records of the target table T.
- the map data M13 illustrated in FIG. 8 shows the example matched on the assumption that the point whose distance is less than 1 km is located in the same city.
- the inclusion map generator 42 generates map parameters when it receives a similarity function for associating the first table with the second table based on the inclusion relation.
- the example shown in FIG. 2 corresponds to the case where the InclusionMap function is set in the configuration file.
- the inclusion map generator 42 is configured to record the second table and the records included in the first table such that the position indicated by the value of the first geographical attribute is included in the area indicated by the value of the second geographical attribute. Generate map parameters to combine records contained in the table.
- FIG. 9 is an explanatory view showing an example of processing of generating map parameters based on the inclusive relation.
- the target table T illustrated in FIG. 9 is similar to the target table T illustrated in FIG. Further, the source table S1 illustrated in FIG. 9 is a table that associates and records the population in each area, the number of males, and the number of people from 20 to 40 years old.
- the inclusion map generator 42 records the source table S1 included in the area indicated by the value of the second geographical attribute at the position indicated by the value of the first geographical attribute. Generate a parameter that associates each record of the target table T with.
- the attribute of the "target_location” column of the target table T is the first geographical attribute
- the attribute of the "community” column of the source table S1 is the second geographical attribute. These two columns are associated.
- a row to be associated with the first table and the second table may be specified in advance, or may be specified by the attribute specifying unit 33 described later.
- the map data M14 illustrated in FIG. 9 indicates the result of associating each record of the target table with the record of the source table S1 existing in the same area.
- the overlap map generator 43 generates map parameters when it receives a similarity function for associating the first table with the second table based on the overlapping area.
- the example shown in FIG. 2 corresponds to the case where the IntersectMap function is set in the configuration file.
- the overlapping map generator 43 sets the second table and the records included in the first table such that the area indicated by the value of the first geographical attribute and the area indicated by the value of the second geographical attribute overlap. Generate map parameters to combine with included records.
- the time difference map generator 31 generates map parameters when it receives a similarity function and a condition (for example, a threshold of similarity) for associating the first table with the second table based on the difference in time. Do.
- a similarity function for example, a threshold of similarity
- Do for example, a threshold of similarity
- the temporal difference map generator 31 determines whether the relationship between the value of the first temporal attribute and the value of the second temporal attribute satisfies the degree of the temporal relationship, the records included in the first table and the second Create join conditions to join records contained in the table of. In the present embodiment, the time difference map generator 31 sets the records included in the first table such that the difference between the value of the first temporal attribute and the value of the second temporal attribute is within the threshold. Generate map parameters to combine the records contained in the second table.
- FIG. 10 is an explanatory drawing showing an example of processing for generating map parameters based on the difference in time.
- the target table T and the source table S2 illustrated in FIG. 10 are similar to the target table T and the source table S2 illustrated in FIG.
- the time difference map generator 31 determines that the difference between the value of the first temporal attribute and the value of the second geographical attribute is within 30 minutes. Generate a parameter that associates each record of the target table T with the record. Furthermore, the time difference map generator 31 associates each record of the target table T with the record of the source table S2 in which the difference between the value of the first temporal attribute and the value of the second temporal attribute is within 60 minutes. Generate parameters.
- the attribute of the "time” column of the target table T is the first temporal attribute
- the attribute of the "pickup_time” column of the source table S2 is the second temporal attribute. These two columns are associated.
- a row to be associated with the first table and the second table may be specified in advance, or may be specified by the attribute specifying unit 33 described later.
- the map data M15 illustrated in FIG. 10 shows the result of associating each record of the target table T with the record of the source table S2 in which the time difference is within 30 minutes.
- the map generator 32 When the map generator 32 receives a similarity function for associating the first table with the second table, the map generator 32 generates map parameters. In this embodiment, based on the value of an attribute that is neither a geographical attribute nor a temporal attribute, a parameter that associates a record of the target table with a record of the source table is generated.
- the example shown in FIG. 2 corresponds to the case where the ExactMap function is set in the configuration file.
- the map generator 32 is a map for combining a record included in the first table and a record included in the second table such that the value of the first attribute matches the value of the second attribute. Generate parameters.
- FIG. 11 is an explanatory view showing an example of processing of generating map parameters based on text similarity.
- the target table T illustrated in FIG. 11 is a table including data representing the number of passengers (pickup_number) at a certain address.
- the source table S illustrated in FIG. 11 is a table for recording the income average in each area.
- the map generator 32 targets the records of the source table S whose similarity between the value of the first character string attribute and the value of the second character string attribute is 0.8 or more. Generate a parameter that associates each record of table T. Furthermore, the map generator 32 sets the target table T to records of the source table S in which the similarity between the value of the first character string attribute and the value of the second character string attribute is 0.9 or more and 1.0 or more. Generates parameters to associate each record of.
- the attribute of the "address" column of the target table T is registered as the first character string attribute
- the attribute of the "address” column of the source table S is registered as the second character string attribute. I assume. Then, these two columns are associated. As a result, the parameter P16 illustrated in FIG. 11 is generated.
- the map data M illustrated in FIG. 11 indicates the result of associating each record of the target table T with the record of the source table S having a similarity of 0.8 or more. For example, only one record from the source table is associated with the first record of the target table.
- FIG. 12 is an explanatory view showing an example of a process of generating map parameters based on the structural similarity.
- the target table T illustrated in FIG. 12 is a table including data representing the number of accesses (access_number) to the Web page identified by a certain URL.
- the source table S illustrated in FIG. 12 is a table for recording the number of accesses (access_number) of the last month of the Web page identified by a certain URL.
- the map generator 32 sets the target table T to a record of the source table S in which the distance between the value of the first structural attribute and the value of the second structural attribute is 1 or less. Generate a parameter that associates each record. Furthermore, the map generator 32 associates each record of the target table T with the record of the source table S in which the distance between the value of the first structural attribute and the value of the second structural attribute is 2 or less and 3 or less. Generate each parameter.
- the attribute of the "URL" column of the target table T is registered as the first structural attribute
- the attribute of the "URL” column of the source table S is registered as the second structural attribute. I assume. Then, these two columns are associated. As a result, the parameter P17 illustrated in FIG. 12 is generated.
- the map data M illustrated in FIG. 12 indicates the result of associating each record of the target table T with the record of the source table S having a similarity of 1 or less. For example, only one record from the source table is associated with the first record of the target table.
- the attribute specifying unit 33 specifies an attribute having a common viewpoint in the first table and the second table. Specifically, the attribute specifying unit 33 specifies the same attribute as the attribute of the data indicated by each column of the first table and the attribute of the data indicated by each column of the second table. For example, in the case of the geographical data type, the attribute specifying unit 33 specifies the first geographical attribute having the same data type as the first geographical data type from the first table, and the second geographical attribute. A second geographic attribute having the same data type as the information data type may be identified from the second table. By doing this, it is possible to identify columns having geographical data types from each table. In addition, the attribute specifying unit 33 may specify the attributes of the columns of the first table and the second table from the information of the attributes of the column input to the input unit 10.
- the map parameter generator 30 (more specifically, each generator included in the map parameter generator 30) includes a first geographical (temporal) attribute which is a target of determination of the geographical (temporal) relationship.
- the storage unit 80 also stores parameters including the first table row and the second table row including the second geographical (temporal) attribute and the degree of the geographical (temporal) relationship Good.
- the map parameter generator 30 may store the parameter P11 illustrated in FIG. 5 or the parameter P15 illustrated in FIG. 10 in the storage unit 80.
- FIG. 13 is an explanatory view showing an example of the generated map parameter.
- the input unit 10 receives the target table T, the source table S1 and the source table S2 illustrated in FIG. 13, and the C1 portion of the configuration file illustrated in FIG.
- the map parameter P16 has the attribute of the "target_location" column of the target table T as the first geographical attribute, and the attribute of the "community” column of the source table S1 as the second geographical attribute, based on the KNearestMap function. It is an example of the parameter generated.
- the map parameter generator 30 (more specifically, each generator included in the map parameter generator 30) generates 13 map parameters P11 to 16 illustrated in FIG. 13 from these pieces of information.
- the filter parameter generator 50 includes a filter generator (Exact Filter Generator) 51.
- the filter generator 51 generates filter parameters in which the columns of the second table are associated with the extraction conditions applied to the columns.
- the method of generating the filter parameters is arbitrary.
- the filter generator 51 may generate filter parameters based on, for example, the information defined in the C2 portion of the configuration file illustrated in FIG.
- the extraction condition may be stored in advance in the storage unit 80, and the filter generator 51 may read the extraction condition to generate a filter parameter.
- the filter generator 51 may combine a plurality of extraction conditions to generate additional extraction conditions. Also, the number of combinations of extraction conditions is arbitrary.
- the input unit 10 may receive this combined maximum number, for example. For example, as illustrated in FIG. 2, a parameter (“max_combination_filter_length”) indicating the maximum number of combinations may be set in the C4 portion of the configuration file.
- the aggregation parameter generator 60 (more specifically, each generator included in the aggregation parameter generator 60) generates a parameter representing a method of aggregating data of each row included in the second table.
- the aggregation parameter generator 60 includes a geo aggregation generator (GeoReduce Generator) 70 and a numerical aggregation generator (Numeric Reduce Generator) 61.
- the geo-aggregate generator 70 (more specifically, each generator included in the geo-aggregate generator 70) is a method of aggregating data of each row by the value of the column including the geographical attribute included in the second table. Generate aggregate parameters to represent. Specifically, the geo aggregation generator 70 calculates the statistical value of the value of the geographical attribute based on the designated aggregation method.
- the method of specifying the aggregation method is arbitrary.
- the input unit 10 may receive designation of the aggregation method.
- the aggregation method is defined according to the analysis data type of geographical attribute, and the aggregation parameter is generated according to the defined aggregation method. Good.
- the aggregation method is defined according to the analysis data type of geographical attribute, and the aggregation parameter is generated according to the defined aggregation method. Good.
- the "Point" row in the C3 portion defines an aggregation method when the second geographical attribute (more specifically, the geographical data type) is represented by Point.
- “Sum”, “distance”) are a value of the first geographical attribute and a value of the second geographical attribute among the records of the second table associated with the records of the first table Define the aggregation method to calculate the sum of the distances calculated based on
- Counter defines an aggregation method for calculating, as a statistical value, the number of records of the second table associated with each record (that is, the target variable) of the first table.
- the "Area" line in the C3 portion defines an aggregation method when the second geographical attribute (more specifically, the geographical data type) is represented by an area.
- Counter defines an aggregation method for calculating, as a statistical value, the number of records of the second table associated with each record (that is, the target variable) of the first table.
- the geo consolidation generator 70 includes a point consolidation generator (Point Reduce Generator) 71 and an area consolidation generator (Area Reduce Generator) 72.
- the point aggregation generator 71 generates an aggregation parameter for calculating a distance statistic calculated based on the value of the first geographical attribute and the value of the second geographical attribute.
- the records of the second table targeted here are records respectively associated with the records of the first table.
- records that satisfy certain conditions, such as the value of the first geographical attribute and the value of the second geographical attribute either matching or within a certain range Are associated with each other.
- the point aggregation generator 71 determines that the value of the first geographic attribute and the second condition satisfy the condition when the value of the second geographic attribute with respect to the value of the first geographical attribute satisfies a predetermined condition.
- An aggregation parameter is generated to calculate distance statistics based on the value of the geographical attribute. The calculated statistical value is used as a feature value.
- the point aggregation generator 71 at least one of (“sum”, “distance”), (“avg”, “distance”) and (“count”) illustrated in FIG. 2 is set in the configuration file.
- aggregate parameters may be generated to calculate distance statistics.
- FIG. 14 is an explanatory diagram of an example of a process of generating an aggregation parameter for calculating a distance statistic.
- the point aggregation generator 71 calculates an aggregation parameter that calculates the sum and average of distances between records of the source table and a record of the target table, and an aggregation parameter that calculates the number of records of the associated source table. calculate. For example, as in the aggregation list P21 illustrated in FIG. 14, the point aggregation generator 71 associates the column names of the source table to be aggregated, the column names of the target table to be associated, the aggregation content (distance), and the aggregation parameter May be generated.
- Aggregated data R21 illustrated in FIG. 14 shows the result of aggregating map data M11 based on the aggregation parameter for calculating the sum of distances.
- the area aggregation generator 72 generates an aggregation parameter for calculating the statistical value of the area calculated based on the value of the second geographical attribute. Similar to the point aggregation generator 71, the records in the second table targeted here are records respectively associated with the records in the first table.
- the area aggregation generator 72 at least one of ("sum”, “areaSize”) and ("avg”, “areaSize”) and ("count") illustrated in FIG. 2 is set in the configuration file.
- aggregation parameters may be generated to calculate region statistics.
- FIG. 15 is an explanatory diagram of an example of a process of generating an aggregation parameter for calculating a region statistical value.
- the area aggregation generator 72 calculates an aggregation parameter for calculating the sum and average of the areas of the records of the source table associated with each record of the target table, and the aggregation for calculating the number of records of the associated source table. Calculate the parameters.
- the area aggregation generator 72 may generate an aggregation parameter in which the column name of the source table to be aggregated, the aggregation content (area), and the aggregation function are associated, for example, as in the aggregation list P22 illustrated in FIG.
- Aggregated data R22 illustrated in FIG. 15 shows the result of aggregating the map data M14 based on the aggregation parameter for calculating the sum of the areas.
- the numerical aggregation generator 61 generates an aggregation parameter representing a method of aggregating data of each row by a value of a column including an attribute (Nemuric) attribute (hereinafter referred to as a numerical attribute) included in the second table. . Specifically, the numerical aggregation generator 61 calculates statistical values of numerical values based on the designated aggregation method.
- the method of specifying the aggregation method is arbitrary. Similar to the geo aggregation generator 70, for example, the input unit 10 may receive specification of the aggregation method. Specifically, as exemplified in the C3 portion of the configuration file of FIG. 2, an aggregation method for numerical attributes may be defined, and an aggregation parameter may be generated according to the defined aggregation method. In the example shown in FIG. 2, designation is made to generate an aggregation parameter for calculating the sum and average of the columns of numerical attributes.
- the aggregation parameter generator 60 (more specifically, each generator included in the aggregation parameter generator 60) may store the generated aggregation parameter in the storage unit 80.
- FIG. 16 is an explanatory diagram of an example of the generated aggregation parameter. As shown in the example described above, the input unit 10 receives the target table T, the source table S1 and the source table S2 illustrated in FIG. 16, and the C3 portion of the configuration file illustrated in FIG.
- the aggregation parameter P23 is an example of an aggregation parameter for a column of numerical attributes of the source table S2.
- the aggregation parameter P24 is an example of an aggregation parameter for a column of numerical attributes of the source table S1.
- the aggregation parameter generator 60 (more specifically, each generator included in the aggregation parameter generator 60) generates 16 map parameters P21 to 24 illustrated in FIG. 16 from these pieces of information.
- the feature quantity generation function generator 81 generates a feature quantity generation function for generating the above-mentioned feature quantity from the first table and the second table. Specifically, the feature quantity generation function generator 81 generates a feature quantity generation function using (combining) the combination condition (map parameter) and the aggregation condition (aggregation parameter) described above. Further, the feature quantity generation function generator 81 may generate a feature quantity generation function using (in combination with) the extraction condition (filter parameter) in addition to the combination condition and the aggregation condition.
- the feature quantity generation function generator 81 is a map in which a map parameter for geographical attribute and a map parameter for temporal attribute are combined in advance among combining conditions (map parameters). Parameters may be generated.
- the feature quantity generation function generator 81 is, for example, a map parameter for the geographical attribute when “True” is set to the parameter “time_spatial_map_combination” as shown in the C4 part of the configuration file illustrated in FIG. 2. It may be determined to combine with the map parameters for temporal attributes.
- the procedure of the feature quantity generation function generator 81 generating a feature quantity generation function will be specifically described.
- the target table T and source tables S1 and S2 illustrated in FIG. 13 are input.
- the variable to be predicted is a variable that represents the number of passengers (pickup_number) included in the target table T.
- FIG. 18 is an explanatory view showing an example of a method of generating a feature quantity generation function by combining parameters.
- FIG. 18A shows a combination example of generating a feature quantity generation function for generating a feature quantity from the target table T and the source table S1.
- FIG. 18B shows a combination example of generating a feature quantity generation function for generating a feature quantity from the target table T and the source table S2.
- map parameters in which map parameters for geographical attributes and map parameters for temporal attributes are combined are used.
- map parameters of 4 and aggregation parameters of 9 are generated.
- the feature value generation function generator 81 selects one parameter each from the map parameter, the filter parameter, and the aggregation parameter, and generates a combination of each parameter.
- 14 map parameters and 7 aggregation parameters are generated.
- the feature value generation function generator 81 selects one parameter from each of the map parameters and the aggregation parameter, and generates a combination of each parameter.
- the feature quantity generation function generator 81 generates a feature quantity generation function based on the generated combination. Specifically, the feature quantity generation function generator 81 converts the parameters included in the generated combination into a form of a query language for performing manipulation and definition of table data.
- the feature value generation function generator 81 may use, for example, SQL as a query language.
- the feature quantity generation function generator 81 may generate each feature quantity generation function by applying each parameter to a template for generating an SQL statement. Specifically, a template for generating an SQL statement by fitting each parameter is prepared in advance, and the feature quantity generation function generator 81 sequentially applies each parameter included in the generated combination to the template. You may generate SQL statements.
- the feature quantity generation function is defined as a SQL statement, and each selected parameter corresponds to a parameter for generating the SQL statement.
- Defining feature quantities using combinations of these parameters makes it possible to express many types of feature quantity generation functions as simple element combinations. Therefore, multiple table data can be used to efficiently generate a large number of feature amount candidates. For example, in the case of the above-described example, 130 types of feature values can be easily generated simply by generating 4 map parameters and 9 aggregation parameters, 14 map parameters and 7 aggregation parameters. . Further, since the definition of each parameter once generated can be reused, the effect of reducing the number of man-hours for generating the feature quantity generation function can also be obtained.
- the feature amount generator 82 generates a feature amount using a feature amount generation function.
- the feature amount generation function includes a parameter for calculating the above-described distance statistical value.
- the feature amount generator 82 performs the operation of aggregating the records of the second table satisfying the predetermined condition for each record of the first geographical attribute based on the feature amount generation function, thereby obtaining the distance.
- the statistical value of may be calculated.
- the feature quantity generator 82 performs, as an operation of aggregating the records of the second table, the geographical attribute of the second table satisfying the predetermined condition with respect to each record of the first geographical attribute. The sum and / or the average of the distances may be calculated. Then, the feature quantity generator 82 may add at least one of the sum and the average of the calculated distances as the feature quantity to the attribute of the first table.
- the feature quantity generator 82 is a record of the geographical attribute of the second table which satisfies a predetermined condition for each record of the first geographical attribute as an operation of aggregating the records of the second table. The number may be calculated. Then, the feature quantity generator 82 may add the calculated number of records as the feature quantity to the attribute of the first table.
- the feature quantity generator 82 since the feature quantity generator 82 also performs processing for adding the generated feature quantity to the attribute of the first table, the feature quantity generator 82 can be called attribute addition means.
- the feature quantities generated by the feature quantity generator 82 can also be said to be candidates for feature quantities because they become candidates when the feature quantity selector 83 described later selects feature quantities.
- the feature quantity generator 82 may directly generate feature amount candidates from the first table and the second table using the combination condition and the aggregation condition using the similarity function.
- the join condition is a record of the first table including the value of the first attribute in which the degree of similarity calculated by the value of the first attribute and the value of the second attribute satisfies the condition; It is a condition for combining the record of the second table including the value of the second attribute.
- the aggregation condition is a condition represented by an aggregation method for a plurality of records in the second table and a column that is an object of the aggregation.
- the feature amount generator 82 may generate a number of feature amounts combining a plurality of combination conditions and a plurality of aggregation conditions.
- the same effect as the process of generating the feature quantity generation function by the feature quantity generation function generator 81 described above can be obtained.
- the feature amount selector 83 selects a feature amount optimal for prediction from the generated feature amounts.
- the method of feature-value selection is arbitrary.
- the feature quantity selector 83 may select feature quantities using, for example, L1 regularization.
- the algorithm used to select feature quantities is not limited to L1 regularization.
- the feature quantity selector 83 may select the feature quantity most suitable for prediction according to the algorithm used for selecting the feature quantity.
- the output unit 90 outputs the generated feature amount.
- the output unit 90 may output only the feature amount selected by the feature amount selector 83, or may output all the feature amounts generated by the feature amount generator 82.
- the learning unit 91 learns a prediction model using the generated feature amount.
- the learning unit 91 learns, for example, a prediction model using the added attribute as a feature amount.
- the learning unit 91 applies the data of the first table and the second table to the generated feature amount to generate training data.
- the learning unit 91 learns a model that predicts the value of the prediction target, using the generated feature quantity as an explanatory variable candidate.
- the learning method of a model is arbitrary.
- the prediction unit 92 performs prediction using the model learned by the learning unit 91. Specifically, the prediction unit 92 applies the data of the first table and the second table to the generated feature amount to generate data for prediction. Then, the prediction unit 92 applies the generated data for prediction to the learned model to obtain a prediction result.
- the map parameter generator 30 more specifically includes the geomap generator 40 (more specifically, the distance map generator 41, the inclusion map generator 42, the overlap map generator 43, and the same area map generator 44). , A time difference map generator 31, a map generator 32, and an attribute specifying unit 33.
- the aggregation parameter generator 60 is realized by the geo aggregation generator 70 (more specifically, the point aggregation generator 71 and the area aggregation generator 72) and the numerical aggregation generator 61.
- the program is stored in the storage unit 80, and the processor reads the program, and according to the program, the input unit 10, the geocoder 20, the map parameter generator 30, the filter parameter generator 50, the aggregation parameter generator 60, the feature value
- the generation function generator 81, the feature quantity generator 82, the feature quantity selector 83, the output unit 90, the learning unit 91, and the prediction unit 92 may operate.
- the functions of the information processing system may be provided in the form of Software as a Service (SaaS).
- Input unit 10 Geocoder 20, Map parameter generator 30, Filter parameter generator 50, Aggregated parameter generator 60, Feature quantity generation function generator 81, Feature quantity generator 82, Feature quantity selector
- Each of 83, the output unit 90, the learning unit 91, and the prediction unit 92 may be realized by dedicated hardware.
- part or all of each component of each device may be realized by a general purpose or dedicated circuit, a processor, or the like, or a combination thereof. These may be configured by a single chip or may be configured by a plurality of chips connected via a bus. A part or all of each component of each device may be realized by a combination of the above-described circuits and the like and a program.
- the plurality of information processing devices, circuits, etc. may be arranged centrally. It may be done.
- the information processing apparatus, the circuit, and the like may be realized as a form in which each is connected via a communication network, such as a client and server system, a cloud computing system, and the like.
- the information processing system 100 of the present embodiment may be realized as a single information processing apparatus.
- a part or all of the information processing system 100 according to the present embodiment performs the process of generating the above-described feature quantity, and thus an apparatus including a function of performing the process of generating the feature quantity be able to.
- FIG. 19 is a flowchart illustrating an example of a process of generating a combination condition.
- the input unit 10 acquires a first table including a prediction target and a first geographical attribute, and a second table including a second geographical attribute (step S11). Further, the input unit 10 receives the geographical relationship and the degree of the geographical relationship (step S12).
- the map parameter generator 30 can set the second table and the records included in the first table such that the relation between the value of the first geographical attribute and the value of the second geographical attribute satisfies the degree of geographical relation.
- a join condition for joining records included in the table is generated (step S13).
- FIG. 20 is a flowchart showing another example of the process of generating the combining condition.
- the input unit 10 acquires a first table including a prediction target and a first temporal attribute, and a second table including a second temporal attribute (step S21). Also, the input unit 10 receives a temporal relationship and a degree of the temporal relationship (step S22).
- the map parameter generator 30 is configured to record the second table and the records included in the first table such that the relationship between the value of the first temporal attribute and the value of the second temporal attribute satisfies the degree of the temporal relationship.
- a join condition for joining records included in the table is generated (step S23).
- FIG. 21 is a flowchart illustrating an example of processing for generating a feature amount.
- the input unit 10 acquires a first table including a prediction target and a first geographical attribute, and a second table including a second geographical attribute (step S31).
- the feature quantity generator 82 calculates a distance statistic when the value of the second geographical attribute with respect to the value of the first geographical attribute satisfies a predetermined condition (step S32), and calculates the calculated statistic as a feature A quantity is added to the attribute of the first table (step S33).
- FIG. 22 is a flowchart illustrating another example of the process of generating the feature amount.
- the input unit 10 acquires a first table including the prediction target and the first attribute and a second table including the second attribute (step S41).
- the input unit 10 also receives a similarity function used to calculate the similarity between the first attribute and the second attribute, and a condition for the similarity (for example, a threshold for the similarity) (step S42).
- the feature quantity generator 82 generates feature quantity candidates from the first table and the second table using the combination condition and the aggregation condition calculated using the similarity function (step S43).
- the feature amount selector 83 selects a feature amount optimal for prediction from the feature amount candidates (step S44).
- the input unit 10 acquires the first table including the prediction target and the first geographical attribute, and the second table including the second geographical attribute.
- the input unit 10 receives a geographical relationship and the degree of the geographical relationship.
- the map parameter generator 30 may be configured to set the records included in the first table such that the relation between the value of the first geographical attribute and the value of the second geographical attribute satisfies the degree of geographical relation. Create a join condition for joining the records included in the second table.
- the input unit 10 acquires a first table including a prediction target and a first temporal attribute, and a second table including a second temporal attribute.
- the input unit 10 receives a temporal relationship and a degree of the temporal relationship. Then, the records included in the first table and the map parameter generator 30 are such that the relationship between the value of the first temporal attribute and the value of the second temporal attribute satisfies the degree of the temporal relationship. Create a join condition for joining the records included in the second table.
- the input unit 10 acquires a first table including a prediction target and a first geographical attribute, and a second table including a second geographical attribute. Then, when the value of the second geographical attribute with respect to the value of the first geographical attribute satisfies a predetermined condition, the feature quantity generator 82 determines that the value of the first geographical attribute and the second satisfying the condition.
- the statistics of the distance calculated based on the value of the geographical attribute is added to the attribute of the first table as a feature that is a variable that can affect the prediction target. Therefore, feature quantities can be efficiently generated from a plurality of information sources having geographical information.
- the input unit 10 acquires a first table including a prediction target and a first attribute, and a second table including a second attribute. Further, the input unit 10 receives a similarity function used to calculate the similarity between the first attribute and the second attribute, and a condition for the similarity. Then, the feature quantity generator 82 generates candidate feature quantities from the first table and the second table using the combination condition and the aggregation condition calculated using the similarity function, and the feature quantity selector 83 selects a feature quantity optimal for prediction from the feature quantity candidates. Therefore, it is possible to reduce the number of analysts for generating the feature amount.
- FIG. 23 is a block diagram showing an outline of a feature quantity generation apparatus according to the present invention.
- the feature quantity generation device 280 according to the present invention, a first table (for example, a target table) including a prediction target and a first geographical attribute, and a second table (for example, a source table) including a second geographical attribute
- a table acquisition unit 281 for example, the input unit 10) for acquiring the first geographical attribute when the value of the second geographical attribute with respect to the value of the first geographical attribute satisfies a predetermined condition.
- an addition unit 282 for example, a feature quantity generator 82.
- the attribute adding unit 282 may calculate the distance statistical value by performing an operation of aggregating the records of the second table satisfying the predetermined condition for each record of the first geographical attribute.
- the attribute adding unit 282 calculates the records of the second table as the operation of aggregating the records of the first geographical attribute with the geographical attributes of the second table satisfying the predetermined condition.
- the sum of distances and / or the average may be calculated and added to the attributes of the first table.
- the attribute adding unit 282 calculates the number of records of the geographical attribute of the second table which satisfies a predetermined condition for each record of the first geographical attribute as an operation of aggregating the records of the second table. May be calculated and added to the attributes of the first table.
- the feature quantity generation device 280 may include a learning unit (for example, a learning unit 91) that learns a prediction model using the added attribute as a feature quantity.
- a learning unit for example, a learning unit 91
- FIG. 24 is a schematic block diagram showing the configuration of a computer according to at least one embodiment.
- the computer 1000 includes a processor 1001, a main storage 1002, an auxiliary storage 1003, and an interface 1004.
- the above-described information processing system is implemented in a computer 1000.
- the operation of each processing unit described above is stored in the auxiliary storage device 1003 in the form of a program (combination condition generation program).
- the processor 1001 reads a program from the auxiliary storage device 1003 and expands it in the main storage device 1002, and executes the above processing according to the program.
- the auxiliary storage device 1003 is an example of a non-temporary tangible medium.
- Other examples of non-transitory tangible media include magnetic disks connected via an interface 1004, magneto-optical disks, CD-ROMs, DVD-ROMs, semiconductor memories, and the like.
- the distributed computer 1000 may expand the program in the main storage unit 1002 and execute the above processing.
- the program may be for realizing a part of the functions described above.
- the program may be a so-called difference file (difference program) that realizes the above-described function in combination with other programs already stored in the auxiliary storage device 1003.
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Remote Sensing (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A table acquisition means 281 acquires a first table including prediction objects and first geographical attributes, and a second table including second geographical attributes. If the values of the second geographical attributes satisfy a prescribed condition with respect to the values of the first geographical attributes, an attribute adding means 282 adds, to the attributes of the first table, as feature values, i.e. variables which may affect the prediction objects, statistical values of distances calculated on the basis of the values of the first geographical attributes and the values of the second geographical attributes which satisfy the condition.
Description
本発明は、複数のテーブルを結合して特徴量を生成する特徴量生成装置、特徴量生成方法および特徴量生成プログラムに関する。
The present invention relates to a feature quantity generation device, a feature quantity generation method, and a feature quantity generation program that combine a plurality of tables to generate a feature quantity.
データマイニングは、大量の情報の中から、これまで未知であった有用な知見を見つける技術である。未知である有用な知見を見つけるためには、より多くの属性の候補を生成することが重要である。具体的には、予測対象である変数(目的変数)に影響を及ぼし得る多くの属性(説明変数)の候補を生成することが重要である。このような多くの候補を生成することにより、予測に役立つ属性がこの候補の中に含まれる可能性を高めることができるからである。
Data mining is a technology for finding useful knowledge that has been unknown so far from a large amount of information. In order to find useful findings that are unknown, it is important to generate more attribute candidates. Specifically, it is important to generate candidates for many attributes (explanatory variables) that can affect variables (target variables) to be predicted. By generating such a large number of candidates, it is possible to increase the possibility that an attribute useful for prediction is included in the candidates.
例えば、特許文献1には、目的変数を含むターゲットテーブルと、目的変数を含まないソーステーブルとを結合することにより、機械学習処理に用いられる特徴量の候補を生成することが記載されている。特許文献1に記載された方法では、特徴量の候補を生成する処理を、Filter条件、map条件およびreduce条件の3つの条件の組合せにより定義することで、特徴量の候補を生成する分析者工数を削減する。
For example, Patent Document 1 describes that candidates for feature quantities used in machine learning processing are generated by combining a target table including an object variable with a source table not including an object variable. In the method described in Patent Document 1, the process of generating candidate feature quantities is defined by a combination of three conditions of Filter conditions, map conditions, and reduce conditions, and thus the number of analysts who generate candidate feature quantities. To reduce
また、特許文献2には、予測対象エリアにおけるタクシー等の車両の配車サービスの需要件数を、回帰分析により予測する需要予測装置が記載されている。特許文献2に記載された需要予測装置は、所定エリアにおける推定人口情報を取得し、取得した推定人口情報を回帰分析の説明変数として使用する。
Further, Patent Document 2 describes a demand forecasting device that predicts the number of demands for a dispatch service of vehicles such as taxis in a forecast target area by regression analysis. The demand prediction device described in Patent Document 2 acquires estimated population information in a predetermined area, and uses the acquired estimated population information as an explanatory variable of regression analysis.
本発明者は、所定のエリア内における何らかの対象を予測する際、多様な情報源を活用した方が予測精度が向上するという着想を得た。すなわち、複数の関連する情報源を組み合わせて情報を得ることが好ましいと考えられる。
The present inventor has received the idea that prediction accuracy is improved by utilizing various information sources when predicting any target in a predetermined area. That is, it may be preferable to combine multiple related information sources to obtain information.
例えば、特許文献1には、ターゲットテーブルとソーステーブルとの結合条件(すなわち、map条件)に、ターゲットテーブルとソーステーブルに共通に含まれる顧客IDを利用することが例示されている。また、特許文献2には、サービスの需要件数を予測する際の単位である予測対象エリアと、説明変数として用いられる推定人口情報の単位である所定エリアとが、同じ基準(エリアID、エリアポリゴン)で定義されることが記載されている。
For example, Patent Document 1 exemplifies using a customer ID commonly included in a target table and a source table as a joining condition (that is, map condition) between the target table and the source table. Further, in Patent Document 2, the same reference (area ID, area polygon, and the like) is the prediction target area, which is a unit when predicting the number of demand for service, and the predetermined area, which is a unit of estimated population information used as an explanatory variable. It is stated that it is defined by).
しかし、多様な情報源を予測に活用しようとした場合、各情報源に含まれる地理的情報の定義方法と、予測する際の地理的情報の定義方法とが異なる場合があることを、本発明者は見出した。例えば、地理的情報の場合、緯度および経度で特定することも可能であるし、市町村名で特定することも可能である。さらに、本発明者は、このような場合、予測対象を予測するための特徴量の候補を、各情報源から生成する作業が煩雑になり得ることを見出した。
However, when various information sources are used for prediction, the method of defining geographical information included in each information source may be different from the method of defining geographical information at the time of prediction. The person found it. For example, in the case of geographical information, it is possible to specify by latitude and longitude, or to specify by municipality name. Furthermore, in this case, the present inventor has found that the task of generating candidate feature quantities for predicting a prediction target from each information source can be complicated.
すなわち、特許文献1および特許文献2では、各情報源を顧客IDや同じ基準で関連付けることを想定している。しかし、各情報源の関連付けに地理的情報を利用することを想定しても、これらの地理的情報が必ずしも同じ基準で定義されているとは限らない。したがって、これらの情報源を単純に関連付けることは困難であるため、このような情報を利用したデータ分析に非常に多くの工数が必要になってしまうという問題がある。
That is, in Patent Document 1 and Patent Document 2, it is assumed that each information source is associated with a customer ID or the same standard. However, even if it is assumed that geographical information is used to associate each information source, such geographical information is not necessarily defined on the same basis. Therefore, it is difficult to simply associate these information sources, and there is a problem that data analysis using such information requires a large number of man-hours.
そこで、本発明は、地理的情報を有する複数の情報源から、効率よく特徴量を生成できる特徴量生成装置、特徴量生成方法および特徴量生成プログラムを提供することを目的とする。
Then, this invention aims at providing the feature-value production | generation apparatus, the feature-value production | generation method, and the feature-value production | generation program which can produce | generate a feature-value efficiently from several information sources which have geographical information.
本発明による特徴量生成装置は、予測対象および第1の地理的属性を含む第1のテーブルと、第2の地理的属性を含む第2のテーブルとを取得するテーブル取得手段と、第1の地理的属性の値に対する第2の地理的属性の値が所定の条件を満たす場合に、第1の地理的属性の値と条件を満たす第2の地理的属性の値とに基づいて算出される距離の統計値を、予測対象に影響を及ぼし得る変数である特徴量として第1のテーブルの属性に追加する属性追加手段とを備えたことを特徴とする。
A feature quantity generation device according to the present invention comprises a table acquisition unit for acquiring a first table including a prediction target and a first geographical attribute, and a second table including a second geographical attribute, Calculated based on the value of the first geographical attribute and the value of the second geographical attribute satisfying the condition when the value of the second geographical attribute with respect to the value of the geographical attribute satisfies a predetermined condition An attribute adding means is provided for adding the statistics of the distance to the attribute of the first table as a feature that is a variable that can affect the prediction target.
本発明による特徴量生成方法は、予測対象および第1の地理的属性を含む第1のテーブルと、第2の地理的属性を含む第2のテーブルとを取得し、第1の地理的属性の値に対する第2の地理的属性の値が所定の条件を満たす場合に、第1の地理的属性の値と条件を満たす第2の地理的属性の値とに基づいて算出される距離の統計値を、予測対象に影響を及ぼし得る変数である特徴量として第1のテーブルの属性に追加することを特徴とする。
The feature quantity generation method according to the present invention obtains a first table including a prediction target and a first geographical attribute, and a second table including a second geographical attribute, and generates the first geographical attribute. A distance statistic calculated based on the value of the first geographic attribute and the value of the second geographic attribute that satisfies the condition, when the value of the second geographic attribute to the value satisfies a predetermined condition Is added to the attribute of the first table as a feature that is a variable that can affect the prediction target.
本発明による特徴量生成プログラムは、コンピュータに、予測対象および第1の地理的属性を含む第1のテーブルと、第2の地理的属性を含む第2のテーブルとを取得するテーブル取得処理、および、第1の地理的属性の値に対する第2の地理的属性の値が所定の条件を満たす場合に、第1の地理的属性の値と条件を満たす第2の地理的属性の値とに基づいて算出される距離の統計値を、予測対象に影響を及ぼし得る変数である特徴量として第1のテーブルの属性に追加する属性追加処理を実行させることを特徴とする。
A feature amount generation program according to the present invention includes a table acquisition process for acquiring, on a computer, a first table including a prediction target and a first geographical attribute, and a second table including a second geographical attribute. , Based on the value of the first geographical attribute and the value of the second geographical attribute satisfying the condition, when the value of the second geographical attribute with respect to the value of the first geographical attribute satisfies a predetermined condition It is characterized in that an attribute addition process is performed in which the statistical value of the distance calculated as described above is added to the attribute of the first table as a feature quantity that is a variable that can affect the prediction target.
本発明によれば、上述した技術的手段により、地理的情報を有する複数の情報源から、効率よく特徴量を生成できるという技術的効果を奏する。
According to the present invention, it is possible to efficiently generate feature quantities from a plurality of information sources having geographical information by the above-mentioned technical means.
以下、本発明の実施形態を図面を参照して説明する。
Hereinafter, embodiments of the present invention will be described with reference to the drawings.
本実施形態の情報処理システムは、予測対象(例えば、目的変数)の変数を含む表(以下、第1のテーブルと記すこともある。)と、第1のテーブルと異なる表(以下、第2のテーブルと記すことある。)とを取得する。以下の説明では、第1のテーブルのことをターゲットテーブルと記すこともあり、第2のテーブルをソーステーブルと記すこともある。また、第1のテーブルおよび第2のテーブルは、それぞれデータの集合を含んでいてもよい。
The information processing system according to the present embodiment includes a table (hereinafter also referred to as a first table) including variables to be predicted (for example, target variables) and a table different from the first table (hereinafter referred to as a second table). It may be described as a table of). In the following description, the first table may be referred to as a target table, and the second table may be referred to as a source table. Also, the first table and the second table may each include a set of data.
本実施形態では、第1のテーブルおよび第2のテーブルは、観点が共通する属性をそれぞれ含む。観点が共通するとは、その属性のデータの意味的な内容が共通することを表す。なお、データの表現方法は共通であってもよく、異なっていてもよい。以下、第1のテーブルに含まれる属性を第1の属性と記し、第2のテーブルに含まれる属性を、第2の属性と記す。
In the present embodiment, the first table and the second table each include an attribute common to the viewpoint. Common viewpoint means that the semantic content of the data of the attribute is common. Note that the method of representing data may be common or different. Hereinafter, the attribute included in the first table is described as a first attribute, and the attribute included in the second table is described as a second attribute.
例えば、観点が共通する属性として、地理的な観点や時間的な観点などが挙げられる。例えば、地理的な観点の属性の値は、以下の4種類の地理的データ型に分類できる。なお、見出しのコロン以下の記載は、データについての構文を表す。
(1)点P(Point):p=(x,y)∈P
点Pは、(経度,緯度)の座標として表される。
(2)多角形G(Polygon):g=(b1,b2,...,bn)∈G
多角形Gは、1つの外部境界b1と、0以上の内部境界(b2,...,bn)で定義される。ここで、b1=(p1,p2,...,pn)(ただし、p1,p2,...,pn∈P)は、3点以上の順序として定義される閉じた環の境界である。
(3)複数多角形M(MultiPoligon):m=(g1,g2,...,gn)∈M、g1,g2,...,gn∈G
複数多角形Mは、1以上の多角形で構成される。
(4)文字列S(String):s∈S
文字列で表される住所である。 For example, as an attribute having a common viewpoint, a geographical viewpoint, a temporal viewpoint, and the like can be mentioned. For example, the value of the attribute of geographical viewpoint can be classified into the following four types of geographical data types. Note that the description below the colon in the heading indicates the syntax for the data.
(1) Point P (Point): p = (x, y) ∈ P
The point P is represented as coordinates (longitude, latitude).
(2) the polygon G (Polygon): g = (b 1, b 2, ..., b n) ∈G
The polygon G is defined by one outer boundary b 1 and an inner boundary (b 2 ,..., B n ) of 0 or more. Here, b 1 = (p 1 , p 2 ,..., P n ) (where p 1 , p 2 ,..., P n ∈ P) is a closed order defined as an order of three or more points Of the ring.
(3) a plurality polygons M (MultiPoligon): m = (g 1, g 2, ..., g n) ∈M, g 1, g 2, ..., g n ∈G
The multi-polygon M is composed of one or more polygons.
(4) String S (String): s ∈ S
It is an address represented by a character string.
(1)点P(Point):p=(x,y)∈P
点Pは、(経度,緯度)の座標として表される。
(2)多角形G(Polygon):g=(b1,b2,...,bn)∈G
多角形Gは、1つの外部境界b1と、0以上の内部境界(b2,...,bn)で定義される。ここで、b1=(p1,p2,...,pn)(ただし、p1,p2,...,pn∈P)は、3点以上の順序として定義される閉じた環の境界である。
(3)複数多角形M(MultiPoligon):m=(g1,g2,...,gn)∈M、g1,g2,...,gn∈G
複数多角形Mは、1以上の多角形で構成される。
(4)文字列S(String):s∈S
文字列で表される住所である。 For example, as an attribute having a common viewpoint, a geographical viewpoint, a temporal viewpoint, and the like can be mentioned. For example, the value of the attribute of geographical viewpoint can be classified into the following four types of geographical data types. Note that the description below the colon in the heading indicates the syntax for the data.
(1) Point P (Point): p = (x, y) ∈ P
The point P is represented as coordinates (longitude, latitude).
(2) the polygon G (Polygon): g = (
The polygon G is defined by one outer boundary b 1 and an inner boundary (b 2 ,..., B n ) of 0 or more. Here, b 1 = (p 1 , p 2 ,..., P n ) (where p 1 , p 2 ,..., P n ∈ P) is a closed order defined as an order of three or more points Of the ring.
(3) a plurality polygons M (MultiPoligon): m = (
The multi-polygon M is composed of one or more polygons.
(4) String S (String): s ∈ S
It is an address represented by a character string.
また、データ分析に関連する意味的な情報として、データ型と対応付けて分析データ型が定義されてもよい。例えば、上述する地理的観点の場合、多角形Gおよび複数多角形Mを、領域(Area)に関する分析データ型と定義し、点Pを、点(Point)に関する分析データ型と定義してもよい。また、住所に関する文字列を、例えば、国、都市、町、ランドマーク、通りまたはポイントに関する分析データ型と定義してもよい。以下、地理的情報を表す分析データ型のことを、地理的データ型と記すこともある。
Also, as semantic information related to data analysis, an analysis data type may be defined in association with a data type. For example, in the case of the geographical point of view described above, polygon G and polygon M may be defined as an analysis data type for area and point P may be defined as an analysis data type for point . Also, the character string related to the address may be defined as, for example, an analysis data type related to a country, a city, a town, a landmark, a street or a point. Hereinafter, an analysis data type representing geographical information may be referred to as a geographical data type.
また、例えば、時間的な観点の属性の型(時間的データ型)は、タイムスタンプ(TimeStamp)型として定義することができる。
Also, for example, the type (temporal data type) of a temporal viewpoint attribute can be defined as a TimeStamp type.
以下、観点が共通する属性が地理的な属性の場合、第1のテーブルに含まれる属性を第1の地理的属性、第2のテーブルに含まれる属性を第2の地理的属性と記す。同様に、観点が共通する属性が時間的な属性の場合、第1のテーブルに含まれる属性を第1の時間的属性、第2のテーブルに含まれる属性を第2の時間的属性と記す。他の属性についても同様に記載するものとする。なお、第1の地理的属性は、第1のテーブルのプライマリキーであってもよい。
Hereinafter, when the attribute having a common viewpoint is a geographical attribute, the attribute included in the first table is referred to as a first geographical attribute, and the attribute included in the second table is referred to as a second geographical attribute. Similarly, when the attribute having a common viewpoint is a temporal attribute, the attribute included in the first table is referred to as a first temporal attribute, and the attribute included in the second table is referred to as a second temporal attribute. The same shall apply to the other attributes. The first geographical attribute may be the primary key of the first table.
なお、上記では、共通する属性が地理的な観点と、時間的な観点の例を示したが、共通する属性は、地理的な観点および時間的な観点に限定されない。共通する属性の例として、他にも、文字列の観点や、構造的な観点などが挙げられる。文字列の観点の属性の値は、例えば、住所などである。また、構造的な観点の属性の値は、例えば、URL(Uniform Resource Locator)や、木構造パスなどである。以下、説明を容易にするため、観点が共通する属性として、主に地理的属性と時間的属性を中心に説明する。
In the above, the common attribute shows the example of the geographical point of view and the time point of view, but the common attribute is not limited to the geographical point of view and the time point of view. Other examples of common attributes include the character string aspect and the structural aspect. The value of the attribute of the character string viewpoint is, for example, an address or the like. Further, the value of the attribute of the structural viewpoint is, for example, a URL (Uniform Resource Locator), a tree structure path, or the like. Hereinafter, in order to facilitate the description, mainly the geographical attribute and the temporal attribute will be mainly described as the attributes common to the viewpoints.
図1は、本発明による情報処理システムの一実施形態を示すブロック図である。本実施形態の情報処理システム100は、入力部10と、ジオコーダ(Geo-Coder )20と、マップパラメータ生成器(Map Parameter Generator )30と、フィルタパラメータ生成器(Filter Parameter Generator)50と、集約パラメータ生成器(Reduce Parameter Generator)60と、記憶部80と、特徴量生成関数生成器(Feature Descriptor Generator)81と、特徴量生成器(Feature Generator )82と、特徴量選択器(Feature Selector)83と、出力部90と、学習部91と、予測部92とを備えている。
FIG. 1 is a block diagram showing an embodiment of an information processing system according to the present invention. The information processing system 100 of the present embodiment includes an input unit 10, a geocoder (Geo-Coder) 20, a map parameter generator (Map Parameter Generator) 30, a filter parameter generator (Filter Parameter Generator) 50, and aggregation parameters. A generator (Reduce Parameter Generator) 60, a storage unit 80, a feature quantity generation function generator (Feature Descriptor Generator) 81, a feature quantity generator (Feature Generator) 82, a feature quantity selector (Feature Selector) 83, and , An output unit 90, a learning unit 91, and a prediction unit 92.
入力部10は、第1のテーブルおよび第2のテーブルを取得する。なお、入力部10は、各テーブルを取得することから、入力部10のことをテーブル取得手段と言うことができる。入力部10は、第2のテーブルを複数取得してもよい。例えば、記憶部80が第1のテーブルおよび第2のテーブルを記憶している場合、入力部10が記憶部80から第1のテーブルおよび第2のテーブルを取得してもよい。また、入力部10は、通信ネットワーク(図示せず)を介して他のシステムや記憶部から第1のテーブルおよび第2のテーブルを取得してもよい。
The input unit 10 acquires a first table and a second table. Since the input unit 10 acquires each table, the input unit 10 can be referred to as table acquisition means. The input unit 10 may acquire a plurality of second tables. For example, when the storage unit 80 stores the first table and the second table, the input unit 10 may acquire the first table and the second table from the storage unit 80. Also, the input unit 10 may obtain the first table and the second table from another system or storage unit via a communication network (not shown).
例えば、地理的な観点が共通する場合、入力部10は、予測対象および第1の地理的属性を含む第1のテーブルと、第2の地理的属性を含む第2のテーブルとを取得してもよい。また、例えば、時間的な観点が共通する場合、入力部10は、予測対象および第1の時間的属性を含む第1のテーブルと、第2の時間的属性を含む第2のテーブルとを取得してもよい。他にも、入力部10は、予測対象および第1の文字列属性を含む第1のテーブルと、第2の文字列属性を含む第2のテーブルとを取得してもよいし、予測対象および第1の構造的属性を含む第1のテーブルと、第2の構造的属性を含む第2のテーブルとを取得してもよい。なお、構造的属性については後述される。
For example, when geographical viewpoints are common, the input unit 10 acquires a first table including the prediction target and the first geographical attribute, and a second table including the second geographical attribute. It is also good. Also, for example, when the temporal viewpoint is common, the input unit 10 acquires the first table including the prediction target and the first temporal attribute, and the second table including the second temporal attribute. You may In addition, the input unit 10 may acquire a first table including a prediction target and a first character string attribute, and a second table including a second character string attribute. A first table containing the first structural attribute and a second table containing the second structural attribute may be obtained. The structural attributes will be described later.
さらに、入力部10は、第1の属性と第2の属性との類似度を算出するための関数(以下、類似度関数と記す)と、どの程度の類似度の場合に第1の属性の値と第2の属性の値とが類似すると判断するか決定するための条件(以下、類似度に対する条件と記すこともある)を受け付ける。類似度関数は、数式で表されていてもよく、パラメータとして表されていてもよい。また、類似度に対する条件は、関係の程度に基づいて類似度の有無を判断するための閾値(以下、単に類似度の閾値と記す。)で表されていてもよく、パラメータ等に応じて類似か否かを出力する式で表されていてもよい。
Furthermore, the input unit 10 calculates a similarity between the first attribute and the second attribute (hereinafter referred to as a similarity function) and the degree of similarity with respect to the function of the first attribute. A condition for determining whether the value and the value of the second attribute are similar (hereinafter, also referred to as a condition for the degree of similarity) is accepted. The similarity function may be represented by a mathematical expression or may be represented as a parameter. Further, the condition for the similarity may be represented by a threshold (hereinafter simply referred to as a threshold for similarity) for determining the presence or absence of the similarity based on the degree of the relationship, and the similarity according to the parameter etc. It may be expressed by an expression that outputs whether or not.
例えば、地理的な観点が共通する場合、入力部10は、地理的関係を類似度関数として受け付け、地理的関係の程度を示す類似度の閾値を条件として受け付けてもよい。すなわち、第1の属性および第2の属性が地理的属性である場合、類似度関数は、例えば、距離が近いほど類似度を高く算出する関数として定義される。
For example, when geographical viewpoints are common, the input unit 10 may receive a geographical relationship as a similarity function, and receive a threshold of similarity indicating the degree of the geographical relationship as a condition. That is, when the first attribute and the second attribute are geographical attributes, for example, the similarity function is defined as a function that calculates the similarity higher as the distance is closer.
また、例えば、時間的な観点が共通する場合、入力部10は、時間的関係を類似度関数として受け付け、時間的関係の程度を示す類似度の閾値を条件として受け付けてもよい。すなわち、第1の属性および第2の属性が時間的属性である場合、類似度関数は、例えば、時間の差異が小さいほど類似度を高く算出する関数として定義される。
Also, for example, when the temporal viewpoints are common, the input unit 10 may receive the temporal relationship as the similarity function, and may receive the threshold of the similarity indicating the degree of the temporal relationship as the condition. That is, when the first attribute and the second attribute are temporal attributes, the similarity function is defined as, for example, a function that calculates the similarity higher as the difference in time is smaller.
他にも、文字列の観点が共通する場合、入力部10は、文字列の関係を類似度関数として受け付け、文字列の関係の程度を示す類似度の閾値を条件として受け付けてもよい。具体的には、第1の属性および第2の属性が文字列属性である場合、類似度関数は、例えば、二つのテキストの一致度が高いほど類似度を高く算出する関数として定義される。テキストの類似度として、例えば、形態素のSimpson係数が挙げられる。
In addition, when the viewpoint of the character string is common, the input unit 10 may receive the relationship of the character string as the similarity function, and may receive the threshold value of the similarity indicating the degree of the relationship of the character string as a condition. Specifically, when the first attribute and the second attribute are character string attributes, for example, the similarity function is defined as a function that calculates the degree of similarity higher as the degree of matching between the two texts is higher. The text similarity includes, for example, morpheme Simpson coefficients.
morph(a)をテキスト文字列aに含まれる形態素の集合と定義する。例えば、アドレスを示す以下の4つのテキスト文字列は、形態素解析により、それぞれ以下のような形態の集合として表される。
・morph('川崎市中原区')={'川崎','市','中原','区'}
・morph('神奈川県川崎市中原区')={'神奈川','県','川崎','市','中原','区'}
・morph('神奈川県川崎市幸区')={'神奈川','県','川崎','市','幸','区'}
・morph('神奈川県横浜市港南区')={'神奈川','県','横浜','市','港南','区'} Define morph (a) as a set of morphemes contained in the text string a. For example, the following four text strings indicating addresses are represented by morphological analysis as a set of forms as follows, respectively.
・ Morph ('Kawasaki City Nakahara Ward') = {'Kawasaki', 'City', 'Nakahara', 'ku'}
・ Morph ('Kanagawa Kawasaki City Nakahara Ward') = {'Kanagawa', 'Kenken', 'Kawasaki', 'City', 'Nakahara', 'Ku'}
・ Morph ('Kawasaki Kawasaki City's Yuki Ward') = {'Kanagawa', 'Prefecture', 'Kawasaki', 'city', 'Yuki', 'Ku'}
・ Morph ('Konan Yokohama-shi Konan-ku') = {'Kanagawa', 'Kenken', 'Yokohama', 'City', 'Konan', 'Ku'}
・morph('川崎市中原区')={'川崎','市','中原','区'}
・morph('神奈川県川崎市中原区')={'神奈川','県','川崎','市','中原','区'}
・morph('神奈川県川崎市幸区')={'神奈川','県','川崎','市','幸','区'}
・morph('神奈川県横浜市港南区')={'神奈川','県','横浜','市','港南','区'} Define morph (a) as a set of morphemes contained in the text string a. For example, the following four text strings indicating addresses are represented by morphological analysis as a set of forms as follows, respectively.
・ Morph ('Kawasaki City Nakahara Ward') = {'Kawasaki', 'City', 'Nakahara', 'ku'}
・ Morph ('Kanagawa Kawasaki City Nakahara Ward') = {'Kanagawa', 'Kenken', 'Kawasaki', 'City', 'Nakahara', 'Ku'}
・ Morph ('Kawasaki Kawasaki City's Yuki Ward') = {'Kanagawa', 'Prefecture', 'Kawasaki', 'city', 'Yuki', 'Ku'}
・ Morph ('Konan Yokohama-shi Konan-ku') = {'Kanagawa', 'Kenken', 'Yokohama', 'City', 'Konan', 'Ku'}
また、テキスト文字列aとテキスト文字列bの類似度を算出する関数textSim(a,b)は、以下に示す式1で定義できる。
The function textSim (a, b) for calculating the degree of similarity between the text string a and the text string b can be defined by Equation 1 shown below.
textSim(a,b)=|morph(a)∪morph(b)|/
min(|morph(a)|,|morph(b)|)
・・・(式1) textSim (a, b) = | morph (a) ∪morph (b) | /
min (| morph (a) |, | morph (b) |)
... (Equation 1)
min(|morph(a)|,|morph(b)|)
・・・(式1) textSim (a, b) = | morph (a) ∪morph (b) | /
min (| morph (a) |, | morph (b) |)
... (Equation 1)
この場合、上記に例示するアドレスのテキスト文字列同士の類似度は、以下のように算出される。
In this case, the similarity between the text strings of the addresses exemplified above is calculated as follows.
・textSim(’川崎市中原区’,’神奈川県川崎市中原区’)=4/4=1.0
・textSim(’川崎市中原区’,'神奈川県川崎市幸区')=3/4=0.75
・textSim(’川崎市中原区’,'神奈川県横浜市港南区')=2/4=0.5 ・ TextSim ('Kawasaki City Nakahara Ward', 'Kanagawa Prefecture Kawasaki City Nakahara Ward') = 4/4 = 1.0
・ TextSim ('Kawasaki City Nakahara Ward', 'Kanagawa Prefecture Kawasaki City Kawasaki Ward') = 3/4 = 0.75
・ TextSim ('Kawasaki City Nakahara Ward', 'Konan Yokohama City Yokohama Konan Ward') = 2/4 = 0.5
・textSim(’川崎市中原区’,'神奈川県川崎市幸区')=3/4=0.75
・textSim(’川崎市中原区’,'神奈川県横浜市港南区')=2/4=0.5 ・ TextSim ('Kawasaki City Nakahara Ward', 'Kanagawa Prefecture Kawasaki City Nakahara Ward') = 4/4 = 1.0
・ TextSim ('Kawasaki City Nakahara Ward', 'Kanagawa Prefecture Kawasaki City Kawasaki Ward') = 3/4 = 0.75
・ TextSim ('Kawasaki City Nakahara Ward', 'Konan Yokohama City Yokohama Konan Ward') = 2/4 = 0.5
また、構造的な観点が共通する場合、入力部10は、構造的関係を類似度関数として受け付け、構造的関係の程度を示す類似度の閾値を条件として受け付けてもよい。以下、住所やファイルのディレクトリ構造など、木構造の情報を“/”で表現した文字列をパス文字列と定義する。例えば、住所「神奈川県川崎市」は、パス文字列では‘/神奈川県/川崎市’と表現される。また、例えば、ディレクトリ構造「news→economy→bigdata」は、パス文字列では、‘news/economy/bigdata’と表現される。
In addition, when structural viewpoints are in common, the input unit 10 may receive a structural relationship as a similarity function, and may receive a threshold of similarity indicating the degree of the structural relationship as a condition. Hereinafter, a character string in which information of the tree structure, such as an address and a directory structure of a file, is expressed by "/" is defined as a path character string. For example, the address "Kanagawa Prefecture Kawasaki City" is expressed as "/ Kanagawa Prefecture / Kawasaki City" in the pass string. Also, for example, the directory structure “news → economy → bigdata” is expressed as “news / economy / bigdata” in the path string.
第1の属性および第2の属性が上述するパス文字列で定義される構造的属性の場合、類似度関数は、例えば、二つのパス文字列の距離が近いほど類似度を高く算出する関数として定義される。パス文字列の距離関数として、例えば、最低共通祖先ノード(LCA:Lowest common ancestor)への距離の最小値が挙げられる。
When the first attribute and the second attribute are structural attributes defined by the above-described path string, the similarity function is, for example, a function that calculates the higher the degree of similarity as the distance between the two path strings is closer. It is defined. As a distance function of the path string, for example, the minimum value of the distance to the lowest common ancestor (LCA) can be mentioned.
最低共通祖先ノードとは、二つのパスが表現するそれぞれ一番下のノードから上位(先祖)方向に辿った場合に、最初に現れる同じノードである。また、最低共通祖先ノードへの距離とは、一番下のノードから最低共通祖先ノードへ辿ったときのノード数である。
The lowest common ancestor node is the same node that appears first when traversing from the lowermost node represented by the two paths to the upper (ancestor) direction. Also, the distance to the lowest common ancestor node is the number of nodes when the lowest node is followed from the lowest common ancestor node.
例えば、二つのパス文字列‘/a/b/c’,‘/a/b/z’が存在するとする。この場合、二つのパスの最低共通祖先ノードは、‘/a/b’である。また、‘/a/b/c’から‘/a/b’への距離は1であり、‘/a/b/z’から‘/a/b’への距離も1である。
For example, assume that there are two path strings '/ a / b / c' and '/ a / b / z'. In this case, the lowest common ancestor node of the two paths is '/ a / b'. Also, the distance from '/ a / b / c' to '/ a / b' is 1, and the distance from '/ a / b / z' to '/ a / b' is also 1.
また、例えば、二つのパス文字列‘/a/b/c’,‘/a/d/e/z’が存在するとする。この場合、二つのパスの最低共通祖先ノードは、‘/a’である。また、‘/a/b/c’から‘/a’への距離は2であり、‘/a/d/e/z’から‘/a’への距離は3である。
Also, for example, it is assumed that there are two path strings '/ a / b / c' and '/ a / d / e / z'. In this case, the lowest common ancestor node of the two paths is '/ a'. Also, the distance from '/ a / b / c' to '/ a' is 2, and the distance from '/ a / d / e / z' to '/ a' is 3.
パス文字列の距離を表す関数をpathDis(x,y)とすると、上述するパス文字列の距離は、以下のように算出される。
Assuming that the function representing the distance of the path character string is pathDis (x, y), the distance of the above-described path character string is calculated as follows.
・pathDis(‘/a/b/c’,‘/a/b/z’)=1
・pathDis(‘/a/b/c’,‘/a/d/e/z’)=2 PathDis ('/ a / b / c', '/ a / b / z') = 1
PathDis ('/ a / b / c', '/ a / d / e / z') = 2
・pathDis(‘/a/b/c’,‘/a/d/e/z’)=2 PathDis ('/ a / b / c', '/ a / b / z') = 1
PathDis ('/ a / b / c', '/ a / d / e / z') = 2
図2は、コンフィギュレーションファイル(以下、コンフィグファイルと記す。)の例を示す説明図である。図2に示す例では、類似度関数および類似度に対する条件が、コンフィギュレーションファイル(以下、コンフィグファイルと記す。)に設定されていることを示す。入力部10は、このコンフィグファイルを受け付けてもよい。
FIG. 2 is an explanatory view showing an example of a configuration file (hereinafter referred to as a configuration file). The example shown in FIG. 2 indicates that the condition for the similarity function and the similarity is set in a configuration file (hereinafter referred to as a configuration file). The input unit 10 may receive this configuration file.
図2に例示するコンフィグファイルのC1部分が、類似度関数および類似度に対する条件を示す。なお、コンフィグファイルのC2~C4部分については、後述される。C1部分において、前段部(コロンの前)が、第1の属性のデータ型(より具体的には、分析データ型)と第2の属性のデータ型(より具体的には、分析データ型)との対応関係を示す。また、後段部(コロンの後)が、類似度関数および条件(類似度の閾値)を示す。以下、各内容について、詳細に説明する。
The C1 portion of the configuration file illustrated in FIG. 2 indicates conditions for the similarity function and the similarity. The C2 to C4 portions of the configuration file will be described later. In the C1 part, the former part (before the colon) has the data type of the first attribute (more specifically, the analysis data type) and the data type of the second attribute (more specifically, the analysis data type) Show the correspondence with. Also, the latter part (after the colon) indicates the similarity function and the condition (threshold of similarity). Each content will be described in detail below.
C1部分における“Point-Point”の行は、点で表される第1の地理的属性と点で表される第2の地理的属性との距離を表す地理的関係を定義する。
The "Point-Point" row in the C1 portion defines a geographical relationship representing the distance between the first geographical attribute represented by the point and the second geographical attribute represented by the point.
“DistanceMap”は、地理的関係の程度を規定したマップ関数であり、パラメータとして、距離の閾値を含む。DistanceMap関数における3つのパラメータは、順に“開始値”、“終了値”、(開始値から終了値までに適用する閾値の)“間隔”を示す。図2に例示する(“DistanceMap”,1,3,1)は、距離の単位をkmとすると、“距離が1km以内”、“距離が2km以内”、“距離が3km以内”の3つの閾値を関数に適用することを示す。
“DistanceMap” is a map function that defines the degree of geographical relationship, and includes a distance threshold as a parameter. The three parameters in the DistanceMap function indicate “start value”, “end value”, and “interval” (of the threshold applied from the start value to the end value) in order. Assuming that the unit of distance is km (“DistanceMap”, 1, 3, 1) illustrated in FIG. 2, three threshold values “distance within 1 km”, “distance within 2 km”, “distance within 3 km” Indicates to apply to the function.
“KNearestMap”は、地理的関係の程度を規定したマップ関数であり、パラメータとして、近接する地理的情報の個数の閾値を含む。KNearestMap関数における3つのパラメータも同様、順に“開始値”、“終了値”、(開始値から終了値までに適用する閾値の)“間隔”を示す。図2に例示する(“KNearestMap”,3,5,1)は、近接する地理的情報の個数が、“3つ以内”、“4つ以内”、“5つ以内”の3つの閾値を関数に適用することを示す。
“KNearestMap” is a map function that defines the degree of geographical relationship, and includes, as a parameter, a threshold of the number of pieces of geographical information in proximity. Similarly, the three parameters in the KNearestMap function indicate “start value”, “end value”, and “interval” (of the threshold applied from the start value to the end value). The number of pieces of geographical information adjacent to each other as illustrated in FIG. 2 (“KNearest Map”, 3, 5, 1) functions as three thresholds “within three”, “within four”, and “within five”. Indicates that it applies to
“SameCityMap”は、地理的関係の程度を規定したマップ関数であり、2つの地点が同じエリアに含まれるか否かを判断する関数である。SameCityMap関数はパラメータを含まないが、エリアを定義したエリア情報に基づいて同じエリアに含まれるか否かが判断される。エリア情報は、予め定義される。
“SameCityMap” is a map function that defines the degree of geographical relationship, and is a function that determines whether two points are included in the same area. Although the SameCityMap function does not include parameters, it is determined whether it is included in the same area based on the area information defining the area. Area information is predefined.
C1部分における“Point-Area”の行は、点で表される第1の地理的属性と領域で表される第2の地理的属性との包含関係を表す地理的関係を定義する。
The "Point-Area" row in the C1 portion defines a geographical relationship that represents an inclusive relationship between the first geographic attribute represented by the point and the second geographic attribute represented by the region.
“InclusionMap”は、地理的関係の程度を規定したマップ関数であり、点で表される第1の地理的属性が領域で表される第2の地理的属性に含まれるか否かを判断する関数である。なお、InclusionMapは、パラメータを含まない。
"InclusionMap" is a map function that defines the degree of geographical relationship, and determines whether the first geographical attribute represented by a point is included in the second geographical attribute represented by a region. It is a function. Note that InclusionMap does not include parameters.
また、“Point-Area”の行においても、“KNearestMap”が定義される。KNearestMap関数の内容は、“Point-Point”におけるKNearestMap関数と同様である。
Also, in the "Point-Area" line, "KNearestMap" is defined. The content of the KNearestMap function is similar to the KNearestMap function in "Point-Point".
C1部分における“Area-Area”の行は、領域で表される第1の地理的属性と領域で表される第2の地理的属性との交差関係を表す地理的関係を定義する。
The "Area-Area" row in the C1 portion defines a geographical relationship that represents the cross-relationship between the first geographic attribute represented by the region and the second geographic attribute represented by the region.
“IntersectMap”は、地理的関係の程度を規定したマップ関数であり、領域で表される第1の地理的属性が領域で表される第2の地理的属性と交差するか否かを判断する関数である。なお、IntersectMapは、パラメータを含まない。
“IntersectMap” is a map function that defines the degree of geographical relationship, and determines whether the first geographical attribute represented by the area intersects with the second geographical attribute represented by the area It is a function. Note that IntersectMap does not include parameters.
以上に示すように、第1の地理的データ型と第2の地理的データ型とは、互いに同一の地理的データ型であってもよく、異なる地理的データ型であってもよい。また、第1の地理的データ型が、点の情報で地理を特定可能なデータのタイプであり、第2の地理的データ型が、範囲の情報で地理を特定可能なデータのタイプであってもよい。
As described above, the first geographical data type and the second geographical data type may be the same geographical data type as each other, or may be different geographical data types. Also, the first geographical data type is a type of data that can identify geography with point information, and the second geographical data type is a type of data that can identify geography with range information. It is also good.
C1部分における“TimeStamp-TimeStamp”の行は、第1の時間的属性と第2の時間的属性との差異を表す時間的関係を定義する。
The line "TimeStamp-TimeStamp" in the C1 section defines a temporal relationship that represents the difference between the first temporal attribute and the second temporal attribute.
“TimeDiffMap”は、時間的関係の程度を規定したマップ関数であり、パラメータとして、時間の差異の閾値を含む。TimeDiffMap関数における3つのパラメータも同様、順に“開始値”、“終了値”、(開始値から終了値までに適用する閾値の)“間隔”を示す。図2に例示する(“TimeDiffMap”,30,60,30)は、時間の単位を分とすると、“時間の差異が30分以内”、“時間の差異が60分以内”の2つの閾値を関数に適用することを示す。
“TimeDiffMap” is a map function that defines the degree of temporal relationship, and includes a threshold of time difference as a parameter. Similarly, the three parameters in the TimeDiffMap function indicate “start value”, “end value”, and “interval” (of the threshold applied from the start value to the end value). In the example illustrated in FIG. 2 (“TimeDiffMap”, 30, 60, 30), when the unit of time is a minute, two threshold values “a time difference is within 30 minutes” and “a time difference is within 60 minutes” Indicates to apply to the function.
C1部分における“Text-Text”の行は、文字列を表す第1の属性と文字列を表す第2の属性との一致関係を定義する。“ExactMap”は、文字列で表される属性が一致するか否かを判断する関数である。
The line "Text-Text" in the C1 portion defines the correspondence between the first attribute representing a character string and the second attribute representing a character string. “ExactMap” is a function that determines whether the attribute represented by the character string matches.
また、“Text-Text”の行に、文字列を表す第1の属性と文字列を表す第2の属性との類似関係を定義してもよい。具体的には、“Text-Text”の行に、文字列の関係の程度を規定したマップ関数“textSimMap”を設定してもよい。“textSimMap”は、文字列の関係の程度を規定したマップ関数であり、パラメータとして、類似度の閾値を含む。textSimMap関数は、DistanceMap関数同様、3つのパラメータを有し、それぞれ順に“開始値”、“終了値”、(開始値から終了値までに適用する閾値の)“間隔”を示す。
Also, in the “Text-Text” line, a similar relationship between a first attribute representing a character string and a second attribute representing a character string may be defined. Specifically, a map function "textSimMap" that defines the degree of relation of character strings may be set in the "Text-Text" line. “TextSimMap” is a map function that defines the degree of relation of character strings, and includes a threshold of similarity as a parameter. The textSimMap function, like the DistanceMap function, has three parameters, which respectively indicate "start value", "end value", and "interval" (of the threshold applied from the start value to the end value).
例えば、textSimMap関数を用いて[(“textSimMap”,0.8,1.0,0.1]と定義されていたとする。これは、“類似度が0.8以上”、“類似度が0.9以上”および“類似度が1.0(以上)”の3つの閾値を関数に適用することを示す。
For example, it is assumed that the textSimMap function is used to define [(“textSimMap”, 0.8, 1.0, 0.1], which means that “the similarity is 0.8 or more”, “the similarity is 0. 9 shows that three threshold values of “9 or more” and “similarity is 1.0 (or more)” are applied to the function.
なお、類似度関数および類似度の閾値の設定方法は、図2のC1部分に例示する内容に限定されない。コンフィグファイルに、例えば、パス文字列で表される第1の構造的属性とパス文字列で表される第2の構造的属性との距離を表す構造的関係“Path-Path”を定義してもよい。
In addition, the setting method of the similarity function and the threshold value of similarity is not limited to the content illustrated to C1 part of FIG. In the configuration file, for example, a structural relationship "Path-Path" representing a distance between a first structural attribute represented by a path string and a second structural attribute represented by a path string is defined. It is also good.
具体的には、“Path-Path”の行に、構造的関係の程度を規定したマップ関数“pathDisMap”を設定してもよい。“pathDisMap”は、構造的関係の程度を規定したマップ関数であり、パラメータとして、距離の閾値を含む。pathDisMap関数は、DistanceMap関数同様、3つのパラメータを有し、それぞれ順に“開始値”、“終了値”、(開始値から終了値までに適用する閾値の)“間隔”を示す。
Specifically, in the “Path-Path” line, a map function “pathDisMap” may be set which defines the degree of structural relationship. “PathDisMap” is a map function that defines the degree of structural relationship, and includes a distance threshold as a parameter. Like the DistanceMap function, the pathDisMap function has three parameters, which respectively indicate "start value", "end value", and "interval" (of the threshold applied from the start value to the end value).
例えば、pathDisMap関数を用いて[(“pathDisMap”,1,3,1]と定義されていたとする。これは、“距離が1以下”、“距離が2以下”および“距離が3以下”の3つの閾値を関数に適用することを示す。
For example, it is assumed that the pathDisMap function is used to define [(“pathDisMap”, 1, 3, 1]. This means that “distance is 1 or less”, “distance is 2 or less” and “distance is 3 or less” It shows applying three threshold values to a function.
入力部10が、図2に例示するコンフィグファイルを受け付けることで、後述するマップパラメータ生成器30が、第1のテーブルに含まれるレコードと第2のテーブルに含まれるレコードとを結合するための結合条件(マップパラメータ)を生成する。
When the input unit 10 receives the configuration file illustrated in FIG. 2, the map parameter generator 30, which will be described later, is a combination for combining a record included in the first table and a record included in the second table. Generate conditions (map parameters).
なお、入力部10は、テーブルの各列が示すデータの属性を合わせて受け付けてもよい。
The input unit 10 may also receive the attribute of the data indicated by each column of the table.
ジオコーダ20は、文字列で表された属性のデータを変換する。例えば、地理的属性のデータが文字列で表されている場合、ジオコーダ20は、その文字列を、点、多角形または複数多角形のデータに変換する。なお、データを変換する必要がない場合、情報処理システム100は、ジオコーダ20を備えていなくてもよい。
The geocoder 20 converts data of an attribute represented by a character string. For example, when the data of the geographical attribute is represented by a character string, the geocoder 20 converts the character string into data of point, polygon or multi-polygon. Note that when there is no need to convert data, the information processing system 100 may not include the geocoder 20.
図3は、データを変換する処理の例を示す説明図である。図3に示す例では、列ごとの分析データ型を定義したテーブルadt1と、分析データ型からデータ型へ変換する対応を定義したテーブルadt2が予め取得されているものとする。
FIG. 3 is an explanatory view showing an example of processing for converting data. In the example illustrated in FIG. 3, it is assumed that a table adt1 in which an analysis data type for each column is defined and a table adt2 in which correspondence to convert an analysis data type to a data type is defined are acquired in advance.
この状況で、入力部10が、図3に例示するターゲットテーブルT、ソーステーブルS1およびソーステーブルS2を取得したとする。ソーステーブルS2の“Pickup_location”列の分析データ型は、テーブルadt1を参照するとPointであり、変換の必要がない。一方、ソーステーブルS1の“community”列の分析データ型はテーブルadt1を参照すると“TownAddress”であり、テーブルadt2を参照すると、データ型Polygonに変換する必要がある。そこで、ジオコーダ20は、ソーステーブルS1の“community”列に含まれるデータを、多角形の領域で表すように変換する。例えば、“community”の内容に応じて多角形で領域を特定可能なエリア情報を予め定めておき、ジオコーダ20は、そのエリア情報に基づいて、データ型がPolygonになるようにデータを変換してもよい。
In this situation, it is assumed that the input unit 10 acquires the target table T, the source table S1 and the source table S2 illustrated in FIG. 3. The analysis data type of the "Pickup_location" column of the source table S2 is Point when referring to the table adt1, and there is no need for conversion. On the other hand, the analysis data type of the "community" column of the source table S1 is "TownAddress" when referring to the table adt1, and when referring to the table adt2, it is necessary to convert it to the data type Polygon. Therefore, the geocoder 20 converts the data included in the "community" column of the source table S1 so as to be represented by a polygon area. For example, area information capable of specifying an area as a polygon is predetermined according to the contents of "community", and the geocoder 20 converts data so that the data type becomes Polygon based on the area information. It is also good.
マップパラメータ生成器30、フィルタパラメータ生成器50、および、集約パラメータ生成器60は、後述する特徴量生成関数生成器81が、予測対象に影響を及ぼし得る変数である特徴量を生成するための特徴量生成関数を生成する際に利用するパラメータを生成する。
The map parameter generator 30, the filter parameter generator 50, and the aggregation parameter generator 60 are features for generating a feature that is a variable that can be influenced by the feature quantity generation function generator 81 described later. Generates parameters to be used when generating a quantity generation function.
以下の説明では、特徴量とは、特徴そのものの内容(例えば、「人口」、「位置」など)を意味する。また、特徴量に具体的なデータをあてはめたもの(例えば、人口=“8112”、位置=“(-73.965, 40.724)”など)のことを、特徴量ベクトル(複数の場合、特徴量テーブル)と記す。
In the following description, the feature amount means the content of the feature itself (for example, "population", "position", etc.). In addition, feature quantity vectors (in the case of a plurality of cases) in which specific data are fitted to feature quantities (for example, population = "8112", position = "(-73.965, 40.724)", etc.) It is described as a feature amount table).
また、後述する特徴量生成器82が生成する特徴量は、機械学習を用いてモデルを生成する際の説明変数の候補になる。言い換えると、本実施形態で生成される特徴量生成関数を用いることで、機械学習を用いてモデルを生成する際の説明変数の候補を自動的に生成することが可能になる。
Further, the feature quantities generated by the feature quantity generator 82 described later become candidates for explanatory variables when generating a model using machine learning. In other words, by using the feature quantity generation function generated in the present embodiment, it is possible to automatically generate candidate explanatory variables when generating a model using machine learning.
図4は、各パラメータと、第1のテーブルおよび第2のテーブルとの関係の例を示す説明図である。
FIG. 4 is an explanatory view showing an example of the relationship between each parameter and the first table and the second table.
フィルタパラメータ生成器50が生成するパラメータは、第2のテーブルに含まれる行の抽出条件を表わすパラメータである。以下、このパラメータをフィルタパラメータとしるし、フィルタパラメータに基づいて第2のテーブルから行を抽出する処理を「filter」と記載する場合がある。また、この抽出条件のリストを「Fリスト」と記載する場合がある。抽出条件は任意であり、例えば、指定された列の値と同じ(大きいまたは小さい)か否か判断する条件が挙げられる。
The parameters generated by the filter parameter generator 50 are parameters representing extraction conditions of the rows included in the second table. Hereinafter, this parameter may be referred to as a filter parameter, and a process of extracting a row from the second table based on the filter parameter may be described as “filter”. Also, this list of extraction conditions may be described as "F list". The extraction condition is arbitrary, and for example, a condition to judge whether it is the same (large or small) as the value of the designated column.
集約パラメータ生成器60が生成するパラメータは、第2のテーブルに含まれる各行のデータを目的変数ごとに集約する集約方法を表わすパラメータである。なお、一般に、第1の表における行と第2の表における行とは、一対多対応する場合が多いため、結果として行が集約されることになる。集約情報は、ソーステーブル(第2のテーブル)の列に対する集約関数として定義されてもよい。
The parameters generated by the aggregation parameter generator 60 are parameters representing an aggregation method of aggregating data of each row included in the second table for each objective variable. Generally, the rows in the first table correspond to the rows in the second table in many cases, so the rows are aggregated as a result. Aggregation information may be defined as an aggregation function for columns of the source table (second table).
集約方法は任意であり、例えば、列の総数、最大値、最小値、平均値、中央値、分散などが挙げられる。また、列の総数の集計は、重複データを除外する、または、重複データを除外しない、のいずれかの観点で行われてもよい。
The aggregation method is optional, and includes, for example, the total number of columns, maximum value, minimum value, average value, median value, variance, and the like. Also, the total number of columns may be aggregated in terms of excluding duplicate data or not excluding duplicate data.
以下、このパラメータを集約パラメータと記し、集約パラメータが示す方法により各列のデータを集約する処理を「reduce」と記載する場合がある。特に、地理的情報を集約する処理を「Geo-reduce」と記載することもある。また、この集約処理のリストを「Rリスト」と記載する場合がある。なお、地理的情報を集約する処理の詳細については後述される。
Hereinafter, this parameter may be described as an aggregation parameter, and a process of aggregating data of each column may be described as “reduce” by a method indicated by the aggregation parameter. In particular, the process of aggregating geographical information may be described as "Geo-reduce". Also, the list of aggregation processing may be described as "R list". The details of the process of aggregating geographical information will be described later.
マップパラメータ生成器30が生成するパラメータは、第1のテーブルと第2のテーブルの列との対応条件を表わすパラメータである。以下、このパラメータをマップパラメータと記し、マップパラメータに基づいて各テーブルの列を対応付ける処理を「map」と記載する場合がある。また、この対応条件のリストを「Mリスト」と記載する場合がある。特に、地理的情報同士を対応付ける処理を「Geo-map」と記載することもある。また、mapによる各テーブルの列の対応付けは、対応付けられた列で複数の表を1つの表に結合(join)することとも言える。なお、地理的情報を対応付ける処理の詳細についても後述される。
The parameters generated by the map parameter generator 30 are parameters representing the corresponding conditions of the first table and the columns of the second table. Hereinafter, this parameter may be referred to as a map parameter, and the process of associating the columns of each table based on the map parameter may be referred to as “map”. Also, the list of correspondence conditions may be described as "M list". In particular, the process of associating geographical information may be described as "Geo-map". Also, the mapping of the columns of each table by map can be said to be a join of a plurality of tables into one table in the mapped columns. The details of the process of associating geographical information are also described later.
マップパラメータ生成器30は、ジオマップ生成器(GeoMap Generator)40と、時間差異マップ生成器(TimeDiff Map Generator)31と、マップ生成器(Exact Map Generator )32と、属性特定部33とを含む。マップパラメータ生成器30(より具体的には、マップパラメータ生成器30に含まれる各生成器)は、第1の属性の値と第2の属性の値とにより算出される類似度が、条件を満たすような第1の属性の値を含む第1のテーブルのレコードと、第2の属性の値を含む第2のテーブルのレコードとを結合するための結合条件を生成する。条件を満たすとは、例えば、類似度が閾値以下または以上になることや、予め定めた範囲内に含まれることなどを意味する。
The map parameter generator 30 includes a geomap generator (GeoMap Generator) 40, a time difference map generator (TimeDiff Map Generator) 31, a map generator (Exact Map Generator) 32, and an attribute specifying unit 33. The map parameter generator 30 (more specifically, each generator included in the map parameter generator 30) sets the condition calculated by the similarity calculated by the value of the first attribute and the value of the second attribute. A join condition is generated to combine the record of the first table including the value of the first attribute that satisfies the condition and the record of the second table including the value of the second attribute. To satisfy the condition means, for example, that the similarity is equal to or less than or equal to a threshold, or included in a predetermined range.
ジオマップ生成器40は、第1のテーブルと第2のテーブルの地理的属性を含む列同士の対応条件を表すパラメータを生成する。ジオマップ生成器40は、距離マップ生成器(Distance Map Generator)41と、包含マップ生成器(Inclusion Map Generator )42と、重複マップ生成器(Overlap Map Generator )43と、同地域マップ生成器(SameArea Map Generator)44とを有する。
The geomap generator 40 generates a parameter representing a correspondence condition between columns including geographical attributes of the first table and the second table. The geomap generator 40 includes a distance map generator (distance map generator) 41, an inclusion map generator (inclusion map generator) 42, an overlap map generator (overlap map generator) 43, and the same area map generator (SameArea). Map Generator (44).
ジオマップ生成器40(より具体的には、ジオマップ生成器40に含まれる各生成器)は、第1の地理的属性の値と第2の地理的属性の値との関係が、地理的関係の程度を満たすような、第1のテーブルに含まれるレコードと第2のテーブルに含まれるレコードとを結合するための結合条件(マップパラメータ)を生成する。以下、各生成器の処理について、詳細に説明する。
The geomap generator 40 (more specifically, each generator included in the geomap generator 40) determines that the relationship between the value of the first geographical attribute and the value of the second geographical attribute is geographically A join condition (map parameter) for joining the record included in the first table and the record included in the second table, which satisfies the degree of the relationship, is generated. The processing of each generator will be described in detail below.
距離マップ生成器41は、距離の近さに基づいて第1のテーブルと第2のテーブルを対応付けるための類似度関数および条件(例えば、類似度の閾値)を受け付けた場合に、マップパラメータを生成する。図2に示す例では、DistanceMap関数とKNearestMap関数の少なくとも一方がコンフィグファイルに設定されている場合に対応する。
The distance map generator 41 generates map parameters when it receives a similarity function and a condition (for example, a threshold of similarity) for associating the first table with the second table based on the closeness of the distance. Do. The example shown in FIG. 2 corresponds to the case where at least one of the DistanceMap function and the KNearestMap function is set in the configuration file.
距離マップ生成器41は、第1の地理的属性の値と第2の地理的属性の値との距離が閾値以内であるような、第1のテーブルに含まれるレコードと第2のテーブルに含まれるレコードとを結合するためのマップパラメータを生成する。
The distance map generator 41 includes the records included in the first table and the second table such that the distance between the value of the first geographical attribute and the value of the second geographical attribute is within a threshold. Generate map parameters to combine with the record to be recorded.
図5は、距離に基づいてマップパラメータを生成する処理の例を示す説明図である。図5に示す例では、ターゲットテーブルTとソーステーブルS2をそれぞれ1つずつ取得した場合を示す。なお、図5に例示するターゲットテーブルTは、2015年1月8日22時の、5か所における乗客数(pickup_number)を表すデータを含むテーブルである。また、図5に例示するソーステーブルS2は、各時刻における乗客数、移動距離および乗客の乗り場位置を対応付けて記録するテーブルである。
FIG. 5 is an explanatory view showing an example of processing for generating map parameters based on distances. The example shown in FIG. 5 shows the case where one target table T and one source table S2 are acquired. The target table T illustrated in FIG. 5 is a table including data representing the number of passengers (pickup_number) at five locations on January 8, 2015 at 22:00. Moreover, source table S2 illustrated in FIG. 5 is a table which matches and records the number of passengers, the movement distance, and the landing position of a passenger in each time.
例えば、図2に例示するDistanceMap関数の場合、距離マップ生成器41は、第1の地理的属性の値が示す位置と第2の地理的属性の値が示す位置との距離が1km以内であるソーステーブルS2のレコードにターゲットテーブルTの各レコードを対応付けるパラメータを生成する。さらに、距離マップ生成器41は、第1の地理的属性の値が示す位置と第2の地理的属性の値が示す位置との距離が2km以内および3km以内であるソーステーブルS2のレコードにターゲットテーブルTの各レコードを対応付けるパラメータをそれぞれ生成する。
For example, in the case of the DistanceMap function illustrated in FIG. 2, the distance map generator 41 has a distance between the position indicated by the value of the first geographical attribute and the position indicated by the value of the second geographical attribute within 1 km. A parameter that associates each record of the target table T with the record of the source table S2 is generated. Furthermore, the distance map generator 41 targets the records of the source table S2 in which the distance between the position indicated by the value of the first geographical attribute and the position indicated by the value of the second geographical attribute is within 2 km and 3 km. The parameter which matches each record of table T is generated, respectively.
図5に示す例では、ターゲットテーブルTの“target_location”列の属性が第1の地理的属性であり、ソーステーブルS2の“Pickup_location”列の属性が第2の地理的属性である。この2つの列が対応付けられる。なお、第1のテーブルと第2のテーブルとで対応付ける列は、予め指定されていてもよく、後述する属性特定部33によって特定されてもよい。
In the example shown in FIG. 5, the attribute of the "target_location" column of the target table T is a first geographical attribute, and the attribute of the "Pickup_location" column of the source table S2 is a second geographical attribute. These two columns are associated. A row to be associated with the first table and the second table may be specified in advance, or may be specified by the attribute specifying unit 33 described later.
この結果、図5に例示するパラメータP11が生成される。図5に例示するように、地理的分析データ型に基づいてマップパラメータが生成され、1つのマップパラメータに基づいて、1つのマップ処理が定義される。図5に例示するマップデータM11は、距離が1km以内であるソーステーブルS2のレコードにターゲットテーブルTの各レコードを対応付けた結果を示す。例えば、ターゲットテーブルの1番目のレコードに対して、ソーステーブルから1つのレコードのみ対応付けられる。また、例えば、ターゲットテーブルの2番目のレコードに対して、ソーステーブルから2つのレコードが対応付けられる。
As a result, the parameter P11 illustrated in FIG. 5 is generated. As illustrated in FIG. 5, map parameters are generated based on the geographical analysis data type, and one map processing is defined based on one map parameter. The map data M11 illustrated in FIG. 5 indicates the result of associating each record of the target table T with the record of the source table S2 having a distance of 1 km or less. For example, only one record from the source table is associated with the first record of the target table. Also, for example, two records from the source table are associated with the second record of the target table.
図6は、距離に基づいてマップパラメータを生成する他の処理の例を示す説明図である。図6に例示するターゲットテーブルTおよびソーステーブルS2は、図5に例示するターゲットテーブルTおよびソーステーブルS2と同様である。
FIG. 6 is an explanatory view showing an example of another process of generating map parameters based on distances. The target table T and source table S2 illustrated in FIG. 6 are similar to the target table T and source table S2 illustrated in FIG.
例えば、図2に例示するKNearestMap関数の場合、距離マップ生成器41は、第1の地理的属性の値が示す位置と第2の地理的属性の値が示す位置との距離が近い方から順にソーステーブルS2のレコードにターゲットテーブルTの各レコードを2つ以内で対応付けるパラメータを生成する。さらに、距離マップ生成器41は、第1の地理的属性の値が示す位置と第2の地理的属性の値が示す位置との距離が近い方から順にソーステーブルS2のレコードにターゲットテーブルTの各レコードを3つ以内および4つ以内で対応付けるパラメータをそれぞれ生成する。
For example, in the case of the KNearestMap function illustrated in FIG. 2, the distance map generator 41 sequentially operates in the order from the closest distance between the position indicated by the value of the first geographical attribute and the position indicated by the value of the second geographical attribute. A parameter is generated which associates each record of the target table T with the record of the source table S2 within two or less. Furthermore, the distance map generator 41 sets the target table T in the records of the source table S2 in order from the closest distance between the position indicated by the value of the first geographical attribute and the position indicated by the value of the second geographical attribute. Generate a parameter that associates each record with 3 or less and 4 or less.
図6に示す例では、ターゲットテーブルTの“target_location”列の属性が第1の地理的属性であり、ソーステーブルS2の“Pickup_location”列の属性が第2の地理的属性である。この2つの列が対応付けられる。なお、第1のテーブルと第2のテーブルとで対応付ける列は、予め指定されていてもよく、後述する属性特定部33によって特定されてもよい。
In the example shown in FIG. 6, the attribute of the "target_location" column of the target table T is the first geographical attribute, and the attribute of the "Pickup_location" column of the source table S2 is the second geographical attribute. These two columns are associated. A row to be associated with the first table and the second table may be specified in advance, or may be specified by the attribute specifying unit 33 described later.
この結果、図6に例示するパラメータP12が生成される。図6に例示するように、地理的分析データ型に基づいてマップパラメータが生成され、1つのマップパラメータに基づいて、1つのマップ処理が定義される。図6に例示するマップデータM12は、近い順にソーステーブルS2のレコードにターゲットテーブルTの各レコードを2つ対応付けた結果を示す。例えば、ターゲットテーブルの各レコードに対して、ソーステーブルから2つの最も近いレコードが対応付けられる。
As a result, the parameter P12 illustrated in FIG. 6 is generated. As illustrated in FIG. 6, map parameters are generated based on the geographical analysis data type, and one map processing is defined based on one map parameter. The map data M12 illustrated in FIG. 6 indicates the result of associating two records of the target table T with the records of the source table S2 in order of closeness. For example, for each record of the target table, the two closest records from the source table are associated.
同地域マップ生成器44は、同じエリアに含まれるか否かに基づいて第1のテーブルと第2のテーブルを対応付けるための類似度関数を受け付けた場合に、マップパラメータを生成する。図2に示す例では、SameCityMap関数がコンフィグファイルに設定されている場合に対応する。
When the area map generator 44 receives a similarity function for associating the first table with the second table based on whether the area is included in the same area, the area map generator 44 generates map parameters. The example shown in FIG. 2 corresponds to the case where the SameCityMap function is set in the configuration file.
同地域マップ生成器44は、第1の地理的属性の値が示す位置と第2の地理的属性の値が示す位置が同じエリアに含まれるような、第1のテーブルに含まれるレコードと第2のテーブルに含まれるレコードとを結合するためのマップパラメータを生成する。
In the same area map generator 44, the records included in the first table are included in the same area such that the position indicated by the value of the first geographical attribute and the position indicated by the value of the second geographical attribute are included in the same area. Generate map parameters to combine the records contained in the second table.
図7は、同じエリアに含まれているか否か判断する方法の例を示す説明図である。図7に示す例では、各エリアと多角形で特定されるエリアの領域とが対応付けられた共通エリアテーブルCATが予め定義される。共通エリアの例として、国、州、都市、自治区、街などが挙げられる。共通エリアは、互いに重なり合わない共通の領域として定義され、マップ上の境界情報を表す。共通エリアテーブルCATは、例えば、記憶部80に記憶されていてもよい。
FIG. 7 is an explanatory view showing an example of a method of determining whether or not it is included in the same area. In the example shown in FIG. 7, the common area table CAT in which each area and the area of the area specified by the polygon are associated is defined in advance. Examples of common areas include countries, states, cities, autonomous regions, and cities. The common area is defined as a common area that does not overlap each other, and represents boundary information on the map. The common area table CAT may be stored, for example, in the storage unit 80.
まず、共通エリアテーブルCATに基づいて2つの位置が同じエリアに存在するか否かが判断される。具体的には、ターゲットテーブルTのレコードt1の位置が示すエリアが特定され、ソーステーブルSのレコードs1の位置がそのエリア内か否かが判断される。以下、同様の処理が、ターゲットテーブルTおよびソーステーブルSの全てのレコードに対して行われる。
First, based on the common area table CAT, it is determined whether two positions exist in the same area. Specifically, the area indicated by the position of the record t1 in the target table T is specified, and it is determined whether the position of the record s1 in the source table S is within the area. Hereinafter, the same processing is performed on all the records of the target table T and the source table S.
図8は、共通エリアか否かに基づいてマップパラメータを生成する処理の例を示す説明図である。図8に例示するターゲットテーブルTおよびソーステーブルS2は、図5に例示するターゲットテーブルTおよびソーステーブルS2と同様である。
FIG. 8 is an explanatory view showing an example of processing of generating map parameters based on whether or not it is a common area. The target table T and source table S2 illustrated in FIG. 8 are similar to the target table T and source table S2 illustrated in FIG.
例えば、図2に例示するSameCityMap関数の場合、同地域マップ生成器44は、第1の地理的属性の値が示す位置と第2の地理的属性の値が示す位置とが同じエリアに含まれるソーステーブルS2のレコードとターゲットテーブルTの各レコードとを対応付けるパラメータを生成する。
For example, in the case of the SameCityMap function illustrated in FIG. 2, the same area map generator 44 includes the position indicated by the value of the first geographical attribute and the position indicated by the value of the second geographical attribute in the same area. A parameter that associates the record of the source table S2 with each record of the target table T is generated.
図8に示す例では、ターゲットテーブルTの“target_location”列の属性が第1の地理的属性であり、ソーステーブルS2の“Pickup_location”列の属性が第2の地理的属性である。この2つの列が対応付けられる。なお、第1のテーブルと第2のテーブルとで対応付ける列は、予め指定されていてもよく、後述する属性特定部33によって特定されてもよい。
In the example shown in FIG. 8, the attribute of the "target_location" column of the target table T is a first geographical attribute, and the attribute of the "Pickup_location" column of the source table S2 is a second geographical attribute. These two columns are associated. A row to be associated with the first table and the second table may be specified in advance, or may be specified by the attribute specifying unit 33 described later.
この結果、図8に例示するパラメータP13が生成される。図8に例示するマップデータM13は、同じエリアと判断された地理的属性を有するソーステーブルS2のレコードとターゲットテーブルTの各レコードとを対応付けた結果を示す。なお、図8に例示するマップデータM13は、距離が1km未満の地点が同じ都市に位置すると仮定して対応付けた例を示す。
As a result, the parameter P13 illustrated in FIG. 8 is generated. The map data M13 illustrated in FIG. 8 indicates the result of associating the records of the source table S2 having the geographical attribute determined to be the same area with the records of the target table T. In addition, the map data M13 illustrated in FIG. 8 shows the example matched on the assumption that the point whose distance is less than 1 km is located in the same city.
包含マップ生成器42は、包含関係に基づいて第1のテーブルと第2のテーブルを対応付けるための類似度関数を受け付けた場合に、マップパラメータを生成する。図2に示す例では、InclusionMap関数がコンフィグファイルに設定されている場合に対応する。
The inclusion map generator 42 generates map parameters when it receives a similarity function for associating the first table with the second table based on the inclusion relation. The example shown in FIG. 2 corresponds to the case where the InclusionMap function is set in the configuration file.
包含マップ生成器42は、第1の地理的属性の値が示す位置が第2の地理的属性の値が示す領域に含まれているような、第1のテーブルに含まれるレコードと第2のテーブルに含まれるレコードとを結合するためのマップパラメータを生成する。
The inclusion map generator 42 is configured to record the second table and the records included in the first table such that the position indicated by the value of the first geographical attribute is included in the area indicated by the value of the second geographical attribute. Generate map parameters to combine records contained in the table.
図9は、包含関係に基づいてマップパラメータを生成する処理の例を示す説明図である。図9に例示するターゲットテーブルTは、図5に例示するターゲットテーブルTと同様である。また、図9に例示するソーステーブルS1は、各領域における人口、男性数および20歳から40歳までの人数を対応付けて記録するテーブルである。
FIG. 9 is an explanatory view showing an example of processing of generating map parameters based on the inclusive relation. The target table T illustrated in FIG. 9 is similar to the target table T illustrated in FIG. Further, the source table S1 illustrated in FIG. 9 is a table that associates and records the population in each area, the number of males, and the number of people from 20 to 40 years old.
例えば、図2に例示するInclusionMap関数の場合、包含マップ生成器42は、第1の地理的属性の値が示す位置が第2の地理的属性の値が示す領域に含まれるソーステーブルS1のレコードにターゲットテーブルTの各レコードを対応付けるパラメータを生成する。
For example, in the case of the InclusionMap function illustrated in FIG. 2, the inclusion map generator 42 records the source table S1 included in the area indicated by the value of the second geographical attribute at the position indicated by the value of the first geographical attribute. Generate a parameter that associates each record of the target table T with.
図9に示す例では、ターゲットテーブルTの“target_location”列の属性が第1の地理的属性であり、ソーステーブルS1の“community”列の属性が第2の地理的属性である。この2つの列が対応付けられる。なお、第1のテーブルと第2のテーブルとで対応付ける列は、予め指定されていてもよく、後述する属性特定部33によって特定されてもよい。
In the example shown in FIG. 9, the attribute of the "target_location" column of the target table T is the first geographical attribute, and the attribute of the "community" column of the source table S1 is the second geographical attribute. These two columns are associated. A row to be associated with the first table and the second table may be specified in advance, or may be specified by the attribute specifying unit 33 described later.
この結果、図9に例示するパラメータP14が生成される。図9に例示するマップデータM14は、同じエリアに存在するソーステーブルS1のレコードにターゲットテーブルの各レコードを対応付けた結果を示す。
As a result, the parameter P14 illustrated in FIG. 9 is generated. The map data M14 illustrated in FIG. 9 indicates the result of associating each record of the target table with the record of the source table S1 existing in the same area.
重複マップ生成器43は、重複する領域に基づいて第1のテーブルと第2のテーブルを対応付けるための類似度関数を受け付けた場合に、マップパラメータを生成する。図2に示す例では、IntersectMap関数がコンフィグファイルに設定されている場合に対応する。
The overlap map generator 43 generates map parameters when it receives a similarity function for associating the first table with the second table based on the overlapping area. The example shown in FIG. 2 corresponds to the case where the IntersectMap function is set in the configuration file.
重複マップ生成器43は、第1の地理的属性の値が示す領域と第2の地理的属性の値が示す領域が重複するような、第1のテーブルに含まれるレコードと第2のテーブルに含まれるレコードとを結合するためのマップパラメータを生成する。
The overlapping map generator 43 sets the second table and the records included in the first table such that the area indicated by the value of the first geographical attribute and the area indicated by the value of the second geographical attribute overlap. Generate map parameters to combine with included records.
時間差異マップ生成器31は、時間の差異に基づいて第1のテーブルと第2のテーブルを対応付けるための類似度関数および条件(例えば、類似度の閾値)を受け付けた場合に、マップパラメータを生成する。図2に示す例では、TimeDiffMap関数がコンフィグファイルに設定されている場合に対応する。
The time difference map generator 31 generates map parameters when it receives a similarity function and a condition (for example, a threshold of similarity) for associating the first table with the second table based on the difference in time. Do. The example shown in FIG. 2 corresponds to the case where the TimeDiffMap function is set in the configuration file.
時間差異マップ生成器31は、第1の時間的属性の値と第2の時間的属性の値との関係が時間的関係の程度を満たすような、第1のテーブルに含まれるレコードと第2のテーブルに含まれるレコードとを結合するための結合条件を生成する。本実施形態では、時間差異マップ生成器31は、第1の時間的属性の値と第2の時間的属性の値との差異が閾値以内であるような、第1のテーブルに含まれるレコードと第2のテーブルに含まれるレコードとを結合するためのマップパラメータを生成する。
The temporal difference map generator 31 determines whether the relationship between the value of the first temporal attribute and the value of the second temporal attribute satisfies the degree of the temporal relationship, the records included in the first table and the second Create join conditions to join records contained in the table of. In the present embodiment, the time difference map generator 31 sets the records included in the first table such that the difference between the value of the first temporal attribute and the value of the second temporal attribute is within the threshold. Generate map parameters to combine the records contained in the second table.
図10は、時間の差異に基づいてマップパラメータを生成する処理の例を示す説明図である。図10に例示するターゲットテーブルTおよびソーステーブルS2は、図5に例示するターゲットテーブルTおよびソーステーブルS2と同様である。
FIG. 10 is an explanatory drawing showing an example of processing for generating map parameters based on the difference in time. The target table T and the source table S2 illustrated in FIG. 10 are similar to the target table T and the source table S2 illustrated in FIG.
例えば、図2に例示するTimeDiffMap関数の場合、時間差異マップ生成器31は、第1の時間的属性の値と第2の地理的属性の値との差異が30分以内であるソーステーブルS2のレコードにターゲットテーブルTの各レコードを対応付けるパラメータを生成する。さらに、時間差異マップ生成器31は、第1の時間的属性の値と第2の時間的属性の値との差異が60分以内であるソーステーブルS2のレコードにターゲットテーブルTの各レコードを対応付けるパラメータを生成する。
For example, in the case of the TimeDiffMap function illustrated in FIG. 2, the time difference map generator 31 determines that the difference between the value of the first temporal attribute and the value of the second geographical attribute is within 30 minutes. Generate a parameter that associates each record of the target table T with the record. Furthermore, the time difference map generator 31 associates each record of the target table T with the record of the source table S2 in which the difference between the value of the first temporal attribute and the value of the second temporal attribute is within 60 minutes. Generate parameters.
図10に示す例では、ターゲットテーブルTの“time”列の属性が第1の時間的属性であり、ソーステーブルS2の“pickup_time”列の属性が第2の時間的属性である。この2つの列が対応付けられる。なお、第1のテーブルと第2のテーブルとで対応付ける列は、予め指定されていてもよく、後述する属性特定部33によって特定されてもよい。
In the example shown in FIG. 10, the attribute of the "time" column of the target table T is the first temporal attribute, and the attribute of the "pickup_time" column of the source table S2 is the second temporal attribute. These two columns are associated. A row to be associated with the first table and the second table may be specified in advance, or may be specified by the attribute specifying unit 33 described later.
この結果、図10に例示するパラメータP15が生成される。図10に例示するマップデータM15は、時間の差異が30分以内であるソーステーブルS2のレコードにターゲットテーブルTの各レコードを対応付けた結果を示す。
As a result, the parameter P15 illustrated in FIG. 10 is generated. The map data M15 illustrated in FIG. 10 shows the result of associating each record of the target table T with the record of the source table S2 in which the time difference is within 30 minutes.
マップ生成器32は、第1のテーブルと第2のテーブルを対応付けるための類似度関数を受け付けた場合に、マップパラメータを生成する。本実施形態では、地理的属性と時間的属性のいずれの属性でもない属性の値に基づいてソーステーブルのレコードにターゲットテーブルのレコードを対応付けるパラメータを生成する。
When the map generator 32 receives a similarity function for associating the first table with the second table, the map generator 32 generates map parameters. In this embodiment, based on the value of an attribute that is neither a geographical attribute nor a temporal attribute, a parameter that associates a record of the target table with a record of the source table is generated.
図2に示す例では、ExactMap関数がコンフィグファイルに設定されている場合に対応する。マップ生成器32は、第1の属性の値と第2の属性の値とが一致するような、第1のテーブルに含まれるレコードと第2のテーブルに含まれるレコードとを結合するためのマップパラメータを生成する。
The example shown in FIG. 2 corresponds to the case where the ExactMap function is set in the configuration file. The map generator 32 is a map for combining a record included in the first table and a record included in the second table such that the value of the first attribute matches the value of the second attribute. Generate parameters.
図11は、テキストの類似性に基づいてマップパラメータを生成する処理の例を示す説明図である。図11に例示するターゲットテーブルTは、ある住所における乗客数(pickup_number)を表すデータを含むテーブルである。また、図11に例示するソーステーブルSは、各地域における収入平均を記録するテーブルである。
FIG. 11 is an explanatory view showing an example of processing of generating map parameters based on text similarity. The target table T illustrated in FIG. 11 is a table including data representing the number of passengers (pickup_number) at a certain address. In addition, the source table S illustrated in FIG. 11 is a table for recording the income average in each area.
例えば、上述するtextSimMap関数の場合、マップ生成器32は、第1の文字列属性の値と第2の文字列属性の値との類似度が0.8以上であるソーステーブルSのレコードにターゲットテーブルTの各レコードを対応付けるパラメータを生成する。さらに、マップ生成器32は、第1の文字列属性の値と第2の文字列属性の値との類似度が0.9以上および1.0以上であるソーステーブルSのレコードにターゲットテーブルTの各レコードを対応付けるパラメータをそれぞれ生成する。
For example, in the case of the textSimMap function described above, the map generator 32 targets the records of the source table S whose similarity between the value of the first character string attribute and the value of the second character string attribute is 0.8 or more. Generate a parameter that associates each record of table T. Furthermore, the map generator 32 sets the target table T to records of the source table S in which the similarity between the value of the first character string attribute and the value of the second character string attribute is 0.9 or more and 1.0 or more. Generates parameters to associate each record of.
図11に示す例では、ターゲットテーブルTの“address”列の属性が第1の文字列属性として、ソーステーブルSの“address”列の属性が第2の文字列属性として、それぞれ登録されているとする。そこで、この2つの列が対応付けられる。この結果、図11に例示するパラメータP16が生成される。
In the example shown in FIG. 11, the attribute of the "address" column of the target table T is registered as the first character string attribute, and the attribute of the "address" column of the source table S is registered as the second character string attribute. I assume. Then, these two columns are associated. As a result, the parameter P16 illustrated in FIG. 11 is generated.
図11に例示するマップデータMは、類似度が0.8以上であるソーステーブルSのレコードにターゲットテーブルTの各レコードを対応付けた結果を示す。例えば、ターゲットテーブルの1番目のレコードに対して、ソーステーブルから1つのレコードのみ対応付けられる。
The map data M illustrated in FIG. 11 indicates the result of associating each record of the target table T with the record of the source table S having a similarity of 0.8 or more. For example, only one record from the source table is associated with the first record of the target table.
図12は、構造の類似性に基づいてマップパラメータを生成する処理の例を示す説明図である。図12に例示するターゲットテーブルTは、あるURLで識別されるWebページへのアクセス数(access_number)を表すデータを含むテーブルである。また、図12に例示するソーステーブルSは、あるURLで識別されるWebページの先月のアクセス数(access_number)を記録するテーブルである。
FIG. 12 is an explanatory view showing an example of a process of generating map parameters based on the structural similarity. The target table T illustrated in FIG. 12 is a table including data representing the number of accesses (access_number) to the Web page identified by a certain URL. Also, the source table S illustrated in FIG. 12 is a table for recording the number of accesses (access_number) of the last month of the Web page identified by a certain URL.
例えば、上述するpathDisMap関数の場合、マップ生成器32は、第1の構造的属性の値と第2の構造的属性の値との距離が1以下であるソーステーブルSのレコードにターゲットテーブルTの各レコードを対応付けるパラメータを生成する。さらに、マップ生成器32は、第1の構造的属性の値と第2の構造的属性の値との距離が2以下および3以下であるソーステーブルSのレコードにターゲットテーブルTの各レコードを対応付けるパラメータをそれぞれ生成する。
For example, in the case of the pathDisMap function described above, the map generator 32 sets the target table T to a record of the source table S in which the distance between the value of the first structural attribute and the value of the second structural attribute is 1 or less. Generate a parameter that associates each record. Furthermore, the map generator 32 associates each record of the target table T with the record of the source table S in which the distance between the value of the first structural attribute and the value of the second structural attribute is 2 or less and 3 or less. Generate each parameter.
図12に示す例では、ターゲットテーブルTの“URL”列の属性が第1の構造的属性として、ソーステーブルSの“URL”列の属性が第2の構造的属性として、それぞれ登録されているとする。そこで、この2つの列が対応付けられる。この結果、図12に例示するパラメータP17が生成される。
In the example shown in FIG. 12, the attribute of the "URL" column of the target table T is registered as the first structural attribute, and the attribute of the "URL" column of the source table S is registered as the second structural attribute. I assume. Then, these two columns are associated. As a result, the parameter P17 illustrated in FIG. 12 is generated.
図12に例示するマップデータMは、類似度が1以下であるソーステーブルSのレコードにターゲットテーブルTの各レコードを対応付けた結果を示す。例えば、ターゲットテーブルの1番目のレコードに対して、ソーステーブルから1つのレコードのみ対応付けられる。
The map data M illustrated in FIG. 12 indicates the result of associating each record of the target table T with the record of the source table S having a similarity of 1 or less. For example, only one record from the source table is associated with the first record of the target table.
属性特定部33は、第1のテーブルと第2のテーブルとで、観点が共通する属性を特定する。具体的には、属性特定部33は、第1のテーブルの各列が示すデータの属性と、第2のテーブルの各列が示すデータの属性とが同じ属性を特定する。例えば、地理的データ型の場合、属性特定部33は、第1の地理的データ型と同じデータ型を有する第1の地理的属性を第1のテーブルから特定し、且つ、第2の地理的情報のデータ型と同じデータ型を有する第2の地理的属性を第2のテーブルから特定してもよい。このようにすることで、地理的データ型を有する列を各テーブルから特定することが可能になる。また、属性特定部33は、入力部10に入力された列の属性の情報から、第1のテーブルと第2のテーブルの列の属性を特定してもよい。
The attribute specifying unit 33 specifies an attribute having a common viewpoint in the first table and the second table. Specifically, the attribute specifying unit 33 specifies the same attribute as the attribute of the data indicated by each column of the first table and the attribute of the data indicated by each column of the second table. For example, in the case of the geographical data type, the attribute specifying unit 33 specifies the first geographical attribute having the same data type as the first geographical data type from the first table, and the second geographical attribute. A second geographic attribute having the same data type as the information data type may be identified from the second table. By doing this, it is possible to identify columns having geographical data types from each table. In addition, the attribute specifying unit 33 may specify the attributes of the columns of the first table and the second table from the information of the attributes of the column input to the input unit 10.
マップパラメータ生成器30(より具体的には、マップパラメータ生成器30に含まれる各生成器)は、地理的(時間的)関係の判断対象である第1の地理的(時間的)属性を含む第1のテーブルの列および第2の地理的(時間的)属性を含む第2のテーブルの列と、地理的(時間的)関係の程度とを含むパラメータを、記憶部80に記憶させてもよい。例えば、マップパラメータ生成器30は、図5に例示するパラメータP11や図10に例示するパラメータP15などを記憶部80に記憶させてもよい。
The map parameter generator 30 (more specifically, each generator included in the map parameter generator 30) includes a first geographical (temporal) attribute which is a target of determination of the geographical (temporal) relationship. The storage unit 80 also stores parameters including the first table row and the second table row including the second geographical (temporal) attribute and the degree of the geographical (temporal) relationship Good. For example, the map parameter generator 30 may store the parameter P11 illustrated in FIG. 5 or the parameter P15 illustrated in FIG. 10 in the storage unit 80.
図13は、生成されたマップパラメータの例を示す説明図である。上述する例で示すように、入力部10が、図13に例示するターゲットテーブルT、ソーステーブルS1およびソーステーブルS2、並びに、図2に例示するコンフィグファイルのC1部分を受け付ける。なお、マップパラメータP16は、ターゲットテーブルTの“target_location”列の属性を第1の地理的属性とし、ソーステーブルS1の“community”列の属性を第2の地理的属性として、KNearestMap関数に基づいて生成されるパラメータの例である。マップパラメータ生成器30(より具体的には、マップパラメータ生成器30に含まれる各生成器)は、これらの情報から、図13に例示する13個のマップパラメータP11~16を生成する。
FIG. 13 is an explanatory view showing an example of the generated map parameter. As shown in the example described above, the input unit 10 receives the target table T, the source table S1 and the source table S2 illustrated in FIG. 13, and the C1 portion of the configuration file illustrated in FIG. The map parameter P16 has the attribute of the "target_location" column of the target table T as the first geographical attribute, and the attribute of the "community" column of the source table S1 as the second geographical attribute, based on the KNearestMap function. It is an example of the parameter generated. The map parameter generator 30 (more specifically, each generator included in the map parameter generator 30) generates 13 map parameters P11 to 16 illustrated in FIG. 13 from these pieces of information.
フィルタパラメータ生成器50は、フィルタ生成器(Exact Filter Generator)51を含む。フィルタ生成器51は、第2のテーブルの列と、その列に適用する抽出条件とを対応付けたフィルタパラメータを生成する。
The filter parameter generator 50 includes a filter generator (Exact Filter Generator) 51. The filter generator 51 generates filter parameters in which the columns of the second table are associated with the extraction conditions applied to the columns.
フィルタパラメータの生成方法は任意である。フィルタ生成器51は、例えば、図2に例示するコンフィグファイルのC2部分で定義された情報に基づいて、フィルタパラメータを生成してもよい。また、予め記憶部80に抽出条件を記憶しておき、フィルタ生成器51は、その抽出条件を読み取ってフィルタパラメータを生成してもよい。
The method of generating the filter parameters is arbitrary. The filter generator 51 may generate filter parameters based on, for example, the information defined in the C2 portion of the configuration file illustrated in FIG. Alternatively, the extraction condition may be stored in advance in the storage unit 80, and the filter generator 51 may read the extraction condition to generate a filter parameter.
さらに、フィルタ生成器51は、抽出条件を複数組み合わせて、さらなる抽出条件を生成してもよい。また、抽出条件を組み合わせる数も任意である。入力部10は、例えば、この組み合わせ最大数を受け付けてもよい。例えば、図2に例示するように、コンフィグファイルのC4部分に組み合わせ最大数を示すパラメータ(“max_combination_filter_length”)が設定されていてもよい。
Furthermore, the filter generator 51 may combine a plurality of extraction conditions to generate additional extraction conditions. Also, the number of combinations of extraction conditions is arbitrary. The input unit 10 may receive this combined maximum number, for example. For example, as illustrated in FIG. 2, a parameter (“max_combination_filter_length”) indicating the maximum number of combinations may be set in the C4 portion of the configuration file.
集約パラメータ生成器60(より具体的には、集約パラメータ生成器60に含まれる各生成器)は、第2のテーブルに含まれる各行のデータを集約する方法を表わすパラメータを生成する。集約パラメータ生成器60は、ジオ集約生成器(GeoReduce Generator )70と、数的集約生成器(Numeric Reduce Generator)61とを含む。
The aggregation parameter generator 60 (more specifically, each generator included in the aggregation parameter generator 60) generates a parameter representing a method of aggregating data of each row included in the second table. The aggregation parameter generator 60 includes a geo aggregation generator (GeoReduce Generator) 70 and a numerical aggregation generator (Numeric Reduce Generator) 61.
ジオ集約生成器70(より具体的には、ジオ集約生成器70に含まれる各生成器)は、第2のテーブルに含まれる地理的属性を含む列の値で各行のデータを集約する方法を表わす集約パラメータを生成する。具体的には、ジオ集約生成器70は、指定された集約方法に基づいて地理的属性の値の統計値を算出する。
The geo-aggregate generator 70 (more specifically, each generator included in the geo-aggregate generator 70) is a method of aggregating data of each row by the value of the column including the geographical attribute included in the second table. Generate aggregate parameters to represent. Specifically, the geo aggregation generator 70 calculates the statistical value of the value of the geographical attribute based on the designated aggregation method.
集約方法を指定する方法は任意である。例えば、入力部10が集約方法の指定を受け付けてもよい。具体的には、図2のコンフィグファイルのC3部分に例示するように、地理的属性の分析データ型に応じて集約方法を定義し、定義された集約方法に応じて集約パラメータを生成してもよい。以下、各内容について、詳細に説明する。
The method of specifying the aggregation method is arbitrary. For example, the input unit 10 may receive designation of the aggregation method. Specifically, as exemplified in the C3 portion of the configuration file in FIG. 2, the aggregation method is defined according to the analysis data type of geographical attribute, and the aggregation parameter is generated according to the defined aggregation method. Good. Each content will be described in detail below.
C3部分における“Point”の行は、第2の地理的属性(より具体的には、地理的データ型)が点(Point)で表される場合の集約方法を定義する。
The "Point" row in the C3 portion defines an aggregation method when the second geographical attribute (more specifically, the geographical data type) is represented by Point.
(“sum”,“distance”)は、第1のテーブルのレコードに対応付けられた第2のテーブルの各レコードのうち、第1の地理的属性の値と第2の地理的属性の値とに基づいて算出される距離の合計を統計値として算出する集約方法を定義する。
(“Sum”, “distance”) are a value of the first geographical attribute and a value of the second geographical attribute among the records of the second table associated with the records of the first table Define the aggregation method to calculate the sum of the distances calculated based on
(“avg”,“distance”)は、第1のテーブルのレコードに対応付けられた第2のテーブルの各レコードのうち、第1の地理的属性の値と第2の地理的属性の値とに基づいて算出される距離の平均を統計値として算出する集約方法を定義する。
("Avg", "distance") are a value of the first geographical attribute and a value of the second geographical attribute among the records of the second table associated with the records of the first table Define an aggregation method that calculates the average of distances calculated based on
(“count”)は、第1のテーブルの各レコード(すなわち、目的変数)に対応付けられた第2のテーブルのレコード数を統計値として算出する集約方法を定義する。
("Count") defines an aggregation method for calculating, as a statistical value, the number of records of the second table associated with each record (that is, the target variable) of the first table.
C3部分における“Area”の行は、第2の地理的属性(より具体的には、地理的データ型)が領域(Area)で表される場合の集約方法を定義する。
The "Area" line in the C3 portion defines an aggregation method when the second geographical attribute (more specifically, the geographical data type) is represented by an area.
(“sum”,“areaSize”)は、第1のテーブルのレコードに対応付けられた第2のテーブルの各レコードのうち、第2の地理的属性の領域の大きさの合計を統計値として算出する集約方法を定義する。
("Sum", "areaSize") is calculated as the total value of the size of the area of the second geographical attribute among the respective records of the second table associated with the records of the first table as a statistical value Define the method of aggregation.
(“avg”,“areaSize”)は、第1のテーブルのレコードに対応付けられた第2のテーブルの各レコードのうち、第2の地理的属性の領域の大きさの平均を統計値として算出する集約方法を定義する。
("Avg", "areaSize") calculates, as a statistical value, the average of the size of the area of the second geographic attribute among the records of the second table associated with the records of the first table Define the method of aggregation.
(“count”)は、第1のテーブルの各レコード(すなわち、目的変数)に対応付けられた第2のテーブルのレコード数を統計値として算出する集約方法を定義する。
("Count") defines an aggregation method for calculating, as a statistical value, the number of records of the second table associated with each record (that is, the target variable) of the first table.
ジオ集約生成器70は、ポイント集約生成器(Point Reduce Generator)71と、エリア集約生成器(Area Reduce Generator )72とを有する。
The geo consolidation generator 70 includes a point consolidation generator (Point Reduce Generator) 71 and an area consolidation generator (Area Reduce Generator) 72.
ポイント集約生成器71は、第1の地理的属性の値と第2の地理的属性の値とに基づいて算出される距離の統計値を算出するための集約パラメータを生成する。なお、ここで対象とする第2のテーブルのレコードは、第1のテーブルのレコードにそれぞれ対応付けられたレコードである。地理的属性の場合、上述するように、第1の地理的属性の値と第2の地理的属性の値とが、一致する、または、一定の範囲内にあるなど、一定の条件を満たすレコード同士が対応付けられる。そこで、ポイント集約生成器71は、第1の地理的属性の値に対する第2の地理的属性の値が所定の条件を満たす場合に、第1の地理的属性の値と条件を満たす第2の地理的属性の値とに基づいて距離の統計値を算出するための集約パラメータを生成する。なお、算出される統計値は、特徴量として利用される。
The point aggregation generator 71 generates an aggregation parameter for calculating a distance statistic calculated based on the value of the first geographical attribute and the value of the second geographical attribute. Here, the records of the second table targeted here are records respectively associated with the records of the first table. In the case of geographical attributes, as described above, records that satisfy certain conditions, such as the value of the first geographical attribute and the value of the second geographical attribute either matching or within a certain range Are associated with each other. Thus, the point aggregation generator 71 determines that the value of the first geographic attribute and the second condition satisfy the condition when the value of the second geographic attribute with respect to the value of the first geographical attribute satisfies a predetermined condition. An aggregation parameter is generated to calculate distance statistics based on the value of the geographical attribute. The calculated statistical value is used as a feature value.
ポイント集約生成器71は、例えば、図2に例示する(“sum”,“distance”)、(“avg”,“distance”)および(“count”)の少なくとも一つがコンフィグファイルに設定されている場合に、距離の統計値を算出するための集約パラメータを生成してもよい。
For example, in the point aggregation generator 71, at least one of (“sum”, “distance”), (“avg”, “distance”) and (“count”) illustrated in FIG. 2 is set in the configuration file. In some cases, aggregate parameters may be generated to calculate distance statistics.
図14は、距離の統計値を算出するための集約パラメータを生成する処理の例を示す説明図である。図14に示す例では、3種類の集約方法がコンフィグファイルに設定されている。そこで、ポイント集約生成器71は、ソーステーブルのレコードとターゲットテーブルのレコードとの間の距離の合計および平均を算出する集約パラメータ、並びに、対応付けられたソーステーブルのレコード数を算出する集約パラメータを算出する。ポイント集約生成器71は、例えば、図14に例示する集約リストP21のように、集約するソーステーブルの列名、対応付けるターゲットテーブルの列名、集約内容(距離)および集約関数を対応付けた集約パラメータを生成してもよい。
FIG. 14 is an explanatory diagram of an example of a process of generating an aggregation parameter for calculating a distance statistic. In the example shown in FIG. 14, three types of consolidation methods are set in the configuration file. Therefore, the point aggregation generator 71 calculates an aggregation parameter that calculates the sum and average of distances between records of the source table and a record of the target table, and an aggregation parameter that calculates the number of records of the associated source table. calculate. For example, as in the aggregation list P21 illustrated in FIG. 14, the point aggregation generator 71 associates the column names of the source table to be aggregated, the column names of the target table to be associated, the aggregation content (distance), and the aggregation parameter May be generated.
図14に例示する集約データR21は、距離の合計を算出する集約パラメータに基づいて、マップデータM11を集約した結果を示す。
Aggregated data R21 illustrated in FIG. 14 shows the result of aggregating map data M11 based on the aggregation parameter for calculating the sum of distances.
エリア集約生成器72は、第2の地理的属性の値に基づいて算出される領域の統計値を算出するための集約パラメータを生成する。なお、ポイント集約生成器71と同様、ここで対象とする第2のテーブルのレコードは、第1のテーブルのレコードにそれぞれ対応付けられたレコードである。
The area aggregation generator 72 generates an aggregation parameter for calculating the statistical value of the area calculated based on the value of the second geographical attribute. Similar to the point aggregation generator 71, the records in the second table targeted here are records respectively associated with the records in the first table.
エリア集約生成器72は、例えば、図2に例示する(“sum”,“areaSize”)および(“avg”,“areaSize”)および(“count”)の少なくとも一つがコンフィグファイルに設定されている場合に、領域の統計値を算出するための集約パラメータを生成してもよい。
For example, in the area aggregation generator 72, at least one of ("sum", "areaSize") and ("avg", "areaSize") and ("count") illustrated in FIG. 2 is set in the configuration file. In some cases, aggregation parameters may be generated to calculate region statistics.
図15は、領域の統計値を算出するための集約パラメータを生成する処理の例を示す説明図である。図15に示す例では、3種類の集約方法がコンフィグファイルに設定されている。そこで、エリア集約生成器72は、ターゲットテーブルの各レコードに対応付けられたソーステーブルのレコードの面積の合計および平均を算出する集約パラメータ、並びに、対応付けられたソーステーブルのレコード数を算出する集約パラメータを算出する。エリア集約生成器72は、例えば、図15に例示する集約リストP22のように、集約するソーステーブルの列名、集約内容(面積)および集約関数を対応付けた集約パラメータを生成してもよい。
FIG. 15 is an explanatory diagram of an example of a process of generating an aggregation parameter for calculating a region statistical value. In the example shown in FIG. 15, three types of consolidation methods are set in the configuration file. Therefore, the area aggregation generator 72 calculates an aggregation parameter for calculating the sum and average of the areas of the records of the source table associated with each record of the target table, and the aggregation for calculating the number of records of the associated source table. Calculate the parameters. The area aggregation generator 72 may generate an aggregation parameter in which the column name of the source table to be aggregated, the aggregation content (area), and the aggregation function are associated, for example, as in the aggregation list P22 illustrated in FIG.
図15に例示する集約データR22は、面積の合計を算出する集約パラメータに基づいて、マップデータM14を集約した結果を示す。
Aggregated data R22 illustrated in FIG. 15 shows the result of aggregating the map data M14 based on the aggregation parameter for calculating the sum of the areas.
数的集約生成器61は、第2のテーブルに含まれる数値(Nemuric)の属性(以下、数値属性と記す。)を含む列の値で各行のデータを集約する方法を表わす集約パラメータを生成する。具体的には、数的集約生成器61は、指定された集約方法に基づいて数値の統計値を算出する。
The numerical aggregation generator 61 generates an aggregation parameter representing a method of aggregating data of each row by a value of a column including an attribute (Nemuric) attribute (hereinafter referred to as a numerical attribute) included in the second table. . Specifically, the numerical aggregation generator 61 calculates statistical values of numerical values based on the designated aggregation method.
集約方法を指定する方法は任意である。ジオ集約生成器70と同様、例えば、入力部10が集約方法の指定を受け付けてもよい。具体的には、図2のコンフィグファイルのC3部分に例示するように、数値属性に対する集約方法を定義し、定義された集約方法に応じて集約パラメータを生成してもよい。図2に示す例では、数値属性の列の合計および平均を算出する集約パラメータを生成するための指定がされている。
The method of specifying the aggregation method is arbitrary. Similar to the geo aggregation generator 70, for example, the input unit 10 may receive specification of the aggregation method. Specifically, as exemplified in the C3 portion of the configuration file of FIG. 2, an aggregation method for numerical attributes may be defined, and an aggregation parameter may be generated according to the defined aggregation method. In the example shown in FIG. 2, designation is made to generate an aggregation parameter for calculating the sum and average of the columns of numerical attributes.
集約パラメータ生成器60(より具体的には、集約パラメータ生成器60に含まれる各生成器)は、生成した集約パラメータを記憶部80に記憶させてもよい。図16は、生成された集約パラメータの例を示す説明図である。上述する例で示すように、入力部10が、図16に例示するターゲットテーブルT、ソーステーブルS1およびソーステーブルS2、並びに、図2に例示するコンフィグファイルのC3部分を受け付ける。
The aggregation parameter generator 60 (more specifically, each generator included in the aggregation parameter generator 60) may store the generated aggregation parameter in the storage unit 80. FIG. 16 is an explanatory diagram of an example of the generated aggregation parameter. As shown in the example described above, the input unit 10 receives the target table T, the source table S1 and the source table S2 illustrated in FIG. 16, and the C3 portion of the configuration file illustrated in FIG.
なお、集約パラメータP23は、ソーステーブルS2の数値的属性の列に対する集約パラメータの例である。また、集約パラメータP24は、ソーステーブルS1の数値的属性の列に対する集約パラメータの例である。集約パラメータ生成器60(より具体的には、集約パラメータ生成器60に含まれる各生成器)は、これらの情報から、図16に例示する16個のマップパラメータP21~24を生成する。
The aggregation parameter P23 is an example of an aggregation parameter for a column of numerical attributes of the source table S2. Also, the aggregation parameter P24 is an example of an aggregation parameter for a column of numerical attributes of the source table S1. The aggregation parameter generator 60 (more specifically, each generator included in the aggregation parameter generator 60) generates 16 map parameters P21 to 24 illustrated in FIG. 16 from these pieces of information.
特徴量生成関数生成器81は、第1のテーブルおよび第2のテーブルから、上述する特徴量を生成するための特徴量生成関数を生成する。具体的には、特徴量生成関数生成器81は、上述する結合条件(マップパラメータ)と、集約条件(集約パラメータ)とを用いて(組み合わせて)特徴量生成関数を生成する。また、特徴量生成関数生成器81は、結合条件および集約条件に加え、抽出条件(フィルタパラメータ)を用いて(組み合わせて)特徴量生成関数を生成してもよい。
The feature quantity generation function generator 81 generates a feature quantity generation function for generating the above-mentioned feature quantity from the first table and the second table. Specifically, the feature quantity generation function generator 81 generates a feature quantity generation function using (combining) the combination condition (map parameter) and the aggregation condition (aggregation parameter) described above. Further, the feature quantity generation function generator 81 may generate a feature quantity generation function using (in combination with) the extraction condition (filter parameter) in addition to the combination condition and the aggregation condition.
また、本実施形態では、特徴量生成関数生成器81は、結合条件(マップパラメータ)のうち、地理的属性を対象にしたマップパラメータと時間的属性を対象にしたマップパラメータとを予め結合したマップパラメータを生成してもよい。特徴量生成関数生成器81は、例えば、図2に例示するコンフィグファイルのC4部分に示すようなパラメータ“time_spatial_map_combination”に“True”が設定されている場合に、地理的属性を対象にしたマップパラメータと時間的属性を対象にしたマップパラメータとを結合すると判断してもよい。
Further, in the present embodiment, the feature quantity generation function generator 81 is a map in which a map parameter for geographical attribute and a map parameter for temporal attribute are combined in advance among combining conditions (map parameters). Parameters may be generated. The feature quantity generation function generator 81 is, for example, a map parameter for the geographical attribute when “True” is set to the parameter “time_spatial_map_combination” as shown in the C4 part of the configuration file illustrated in FIG. 2. It may be determined to combine with the map parameters for temporal attributes.
図17は、マップパラメータ同士を結合した例を示す説明図である。例えば、地理的属性を対象にした6つのマップパラメータP11,P12と、時間的属性を対象にした2つのマップパラメータP15が存在するとする。このとき、特徴量生成関数生成器81は、地理的属性を対象にしたマップパラメータと時間的属性を対象にしたマップパラメータとを1つずつ組み合わせて、新たなマップパラメータP31を生成してもよい。図17に示す例の場合、新しく6×2=12のマップパラメータが生成される。
FIG. 17 is an explanatory view showing an example in which map parameters are combined. For example, it is assumed that there are six map parameters P11 and P12 for geographical attributes and two map parameters P15 for temporal attributes. At this time, the feature quantity generation function generator 81 may generate a new map parameter P31 by combining the map parameter for geographical attribute and the map parameter for temporal attribute one by one. . In the case of the example shown in FIG. 17, new 6 × 2 = 12 map parameters are generated.
以下、特徴量生成関数生成器81が特徴量生成関数を生成する手順を具体的に説明する。ここでは、図13に例示するターゲットテーブルT、ソーステーブルS1,S2が入力されるものとする。また、予測対象の変数(目的変数)は、ターゲットテーブルTに含まれる乗客数(pickup_number)を表す変数である。
Hereinafter, the procedure of the feature quantity generation function generator 81 generating a feature quantity generation function will be specifically described. Here, it is assumed that the target table T and source tables S1 and S2 illustrated in FIG. 13 are input. Further, the variable to be predicted (target variable) is a variable that represents the number of passengers (pickup_number) included in the target table T.
図18は、パラメータを組み合わせて特徴量生成関数を生成する方法の例を示す説明図である。図18(a)は、ターゲットテーブルTとソーステーブルS1とから特徴量を生成するための特徴量生成関数を生成する組合せ例を示す。また、図18(b)は、ターゲットテーブルTとソーステーブルS2とから特徴量を生成するための特徴量生成関数を生成する組合せ例を示す。なお、図18(b)に示す例では、地理的属性を対象にしたマップパラメータと時間的属性を対象にしたマップパラメータとが結合されたマップパラメータが利用されるものとする。
FIG. 18 is an explanatory view showing an example of a method of generating a feature quantity generation function by combining parameters. FIG. 18A shows a combination example of generating a feature quantity generation function for generating a feature quantity from the target table T and the source table S1. Further, FIG. 18B shows a combination example of generating a feature quantity generation function for generating a feature quantity from the target table T and the source table S2. In the example shown in FIG. 18 (b), it is assumed that map parameters in which map parameters for geographical attributes and map parameters for temporal attributes are combined are used.
図18(a)に示す例では、4のマップパラメータと9の集約パラメータが生成されている。特徴量生成関数生成器81は、これらのマップパラメータおよび集約パラメータから、それぞれ1つずつパラメータを選択し、各パラメータの組合せを生成する。この例の場合、各パラメータに基づいて、4×9=36通りの組合せが生成される。なお、フィルタパラメータが生成されている場合、特徴量生成関数生成器81は、マップパラメータ、フィルタパラメータおよび集約パラメータから、それぞれ1つずつパラメータを選択し、各パラメータの組合せを生成する。
In the example shown in FIG. 18A, map parameters of 4 and aggregation parameters of 9 are generated. The feature value generation function generator 81 selects one parameter from each of the map parameters and the aggregation parameter, and generates a combination of each parameter. In this example, 4 × 9 = 36 combinations are generated based on each parameter. When a filter parameter is generated, the feature value generation function generator 81 selects one parameter each from the map parameter, the filter parameter, and the aggregation parameter, and generates a combination of each parameter.
図18(b)に示す例でも同様に、14のマップパラメータと7の集約パラメータが生成されている。特徴量生成関数生成器81は、これらのマップパラメータおよび集約パラメータから、それぞれ1つずつパラメータを選択し、各パラメータの組合せを生成する。この例の場合、各パラメータに基づいて、14×7=94通りの組合せが生成される。以上より、全部で、36+94=130のパラメータの組合せが生成される。
Also in the example shown in FIG. 18B, 14 map parameters and 7 aggregation parameters are generated. The feature value generation function generator 81 selects one parameter from each of the map parameters and the aggregation parameter, and generates a combination of each parameter. In this example, 14 × 7 = 94 combinations are generated based on each parameter. From the above, a total of 36 + 94 = 130 parameter combinations are generated.
次に、特徴量生成関数生成器81は、生成された組合せに基づいて特徴量生成関数を生成する。具体的には、特徴量生成関数生成器81は、生成された組合せに含まれるパラメータを、表データの操作や定義を行う問合せ言語の形式に変換する。特徴量生成関数生成器81は、例えば、問合せ言語としてSQLを用いてもよい。
Next, the feature quantity generation function generator 81 generates a feature quantity generation function based on the generated combination. Specifically, the feature quantity generation function generator 81 converts the parameters included in the generated combination into a form of a query language for performing manipulation and definition of table data. The feature value generation function generator 81 may use, for example, SQL as a query language.
このとき、特徴量生成関数生成器81は、SQL文を生成するテンプレートに各パラメータを適用して特徴量生成関数を生成してもよい。具体的には、各パラメータを当てはめてSQL文を生成するためのテンプレートを予め用意しておき、特徴量生成関数生成器81は、生成された組合せに含まれる各パラメータを順次テンプレートに適用してSQL文を生成してもよい。この場合、特徴量生成関数は、SQL文として定義され、選択される各パラメータが、SQL文を生成するパラメータに対応する。
At this time, the feature quantity generation function generator 81 may generate each feature quantity generation function by applying each parameter to a template for generating an SQL statement. Specifically, a template for generating an SQL statement by fitting each parameter is prepared in advance, and the feature quantity generation function generator 81 sequentially applies each parameter included in the generated combination to the template. You may generate SQL statements. In this case, the feature quantity generation function is defined as a SQL statement, and each selected parameter corresponds to a parameter for generating the SQL statement.
これらのパラメータの組合せで特徴量を定義すると、多数の種類の特徴量生成関数を単純な要素の組合せとして表現することが可能になる。したがって、複数の表データを利用して効率よく多数の特徴量の候補を生成できる。例えば、上述する例の場合、4つのマップパラメータと9つの集約パラメータ、および、14のマップパラメータと7つの集約パラメータを生成するだけで、130種類の特徴量を容易に生成することが可能になる。また、一度生成した各パラメータの定義は再利用できるため、特徴量生成関数を生成する工数自体も削減できるという効果も得られる。
Defining feature quantities using combinations of these parameters makes it possible to express many types of feature quantity generation functions as simple element combinations. Therefore, multiple table data can be used to efficiently generate a large number of feature amount candidates. For example, in the case of the above-described example, 130 types of feature values can be easily generated simply by generating 4 map parameters and 9 aggregation parameters, 14 map parameters and 7 aggregation parameters. . Further, since the definition of each parameter once generated can be reused, the effect of reducing the number of man-hours for generating the feature quantity generation function can also be obtained.
特徴量生成器82は、特徴量生成関数を用いて特徴量を生成する。例えば、特徴量生成関数に、上述する距離の統計値を算出するパラメータが含まれているとする。この場合、特徴量生成器82は、特徴量生成関数に基づいて、第1の地理的属性のレコードごとに、所定の条件を満たす第2のテーブルのレコードを集約する演算を行うことにより、距離の統計値を算出してもよい。
The feature amount generator 82 generates a feature amount using a feature amount generation function. For example, it is assumed that the feature amount generation function includes a parameter for calculating the above-described distance statistical value. In this case, the feature amount generator 82 performs the operation of aggregating the records of the second table satisfying the predetermined condition for each record of the first geographical attribute based on the feature amount generation function, thereby obtaining the distance. The statistical value of may be calculated.
具体的には、特徴量生成器82は、第2のテーブルのレコードを集約する演算として、第1の地理的属性の各レコードに対して所定の条件を満たす第2のテーブルの地理的属性との距離の合計と平均の少なくともいずれかを算出してもよい。そして、特徴量生成器82は、算出した距離の合計と平均の少なくともいずれかを特徴量として第1のテーブルの属性に追加してもよい。
Specifically, the feature quantity generator 82 performs, as an operation of aggregating the records of the second table, the geographical attribute of the second table satisfying the predetermined condition with respect to each record of the first geographical attribute. The sum and / or the average of the distances may be calculated. Then, the feature quantity generator 82 may add at least one of the sum and the average of the calculated distances as the feature quantity to the attribute of the first table.
他にも、特徴量生成器82は、第2のテーブルのレコードを集約する演算として、第1の地理的属性の各レコードに対して所定の条件を満たす第2のテーブルの地理的属性のレコード数を算出してもよい。そして、特徴量生成器82は、算出したレコード数を特徴量として第1のテーブルの属性に追加してもよい。
In addition, the feature quantity generator 82 is a record of the geographical attribute of the second table which satisfies a predetermined condition for each record of the first geographical attribute as an operation of aggregating the records of the second table. The number may be calculated. Then, the feature quantity generator 82 may add the calculated number of records as the feature quantity to the attribute of the first table.
このように、特徴量生成器82は、生成した特徴量を第1のテーブルの属性に追加する処理も行うことから、特徴量生成器82のことを属性追加手段と言うことができる。また、特徴量生成器82が生成した特徴量は、後述する特徴量選択器83が特徴量を選択する際の候補となることから、特徴量の候補と言うこともできる。
As described above, since the feature quantity generator 82 also performs processing for adding the generated feature quantity to the attribute of the first table, the feature quantity generator 82 can be called attribute addition means. In addition, the feature quantities generated by the feature quantity generator 82 can also be said to be candidates for feature quantities because they become candidates when the feature quantity selector 83 described later selects feature quantities.
なお、本実施形態では、特徴量生成器82が、特徴量生成関数を用いて特徴量の候補を生成する場合について説明した。ただし、特徴量生成器82が、類似度関数を用いて、第1のテーブルおよび第2のテーブルから、結合条件と集約条件とを用いて特徴量の候補を直接生成してもよい。上述するように、結合条件は、第1の属性の値と第2の属性の値とにより算出される類似度が、条件を満たす第1の属性の値を含む第1のテーブルのレコードと、第2の属性の値を含む第2のテーブルのレコードとを結合するための条件である。また、集約条件は、第2のテーブルにおける複数のレコードに対する集約方法およびその集約の対象になる列により表される条件である。
In the present embodiment, the case where the feature quantity generator 82 generates feature quantity candidates using the feature quantity generation function has been described. However, the feature amount generator 82 may directly generate feature amount candidates from the first table and the second table using the combination condition and the aggregation condition using the similarity function. As described above, the join condition is a record of the first table including the value of the first attribute in which the degree of similarity calculated by the value of the first attribute and the value of the second attribute satisfies the condition; It is a condition for combining the record of the second table including the value of the second attribute. Further, the aggregation condition is a condition represented by an aggregation method for a plurality of records in the second table and a column that is an object of the aggregation.
特徴量生成器82は、例えば、結合条件および集約条件がそれぞれ複数存在する場合、複数の結合条件と複数の集約条件とを組み合わせた数の特徴量を生成してもよい。結合条件および集約条件を組み合わせることにより、上述する特徴量生成関数生成器81が、特徴量生成関数を生成する処理と同様の効果が得られる。
For example, when there are a plurality of combination conditions and aggregation conditions, the feature amount generator 82 may generate a number of feature amounts combining a plurality of combination conditions and a plurality of aggregation conditions. By combining the combination condition and the aggregation condition, the same effect as the process of generating the feature quantity generation function by the feature quantity generation function generator 81 described above can be obtained.
特徴量選択器83は、生成された特徴量の中から、予測に最適な特徴量を選択する。なお、特徴量選択の方法は任意である。特徴量選択器83は、例えば、L1正則化を用いて特徴量を選択してもよい。ただし、特徴量の選択に用いるアルゴリズムはL1正則化に限られない。特徴量選択器83は、特徴量の選択に用いるアルゴリズムに応じて、予測に最適な特徴量を選択すればよい。
The feature amount selector 83 selects a feature amount optimal for prediction from the generated feature amounts. In addition, the method of feature-value selection is arbitrary. The feature quantity selector 83 may select feature quantities using, for example, L1 regularization. However, the algorithm used to select feature quantities is not limited to L1 regularization. The feature quantity selector 83 may select the feature quantity most suitable for prediction according to the algorithm used for selecting the feature quantity.
出力部90は、生成された特徴量を出力する。出力部90は、特徴量選択器83が選択した特徴量のみを出力してもよく、特徴量生成器82が生成した全ての特徴量を出力してもよい。
The output unit 90 outputs the generated feature amount. The output unit 90 may output only the feature amount selected by the feature amount selector 83, or may output all the feature amounts generated by the feature amount generator 82.
学習部91は、生成された特徴量を用いて予測モデルを学習する。学習部91は、例えば、追加された属性を特徴量として予測モデルを学習する。具体的には、学習部91は、生成された特徴量に第1のテーブルおよび第2のテーブルのデータを適用して、訓練データを生成する。そして、学習部91は、生成された特徴量を説明変数の候補として用いて、予測対象の値を予測するモデルを学習する。なお、モデルの学習方法は任意である。
The learning unit 91 learns a prediction model using the generated feature amount. The learning unit 91 learns, for example, a prediction model using the added attribute as a feature amount. Specifically, the learning unit 91 applies the data of the first table and the second table to the generated feature amount to generate training data. Then, the learning unit 91 learns a model that predicts the value of the prediction target, using the generated feature quantity as an explanatory variable candidate. In addition, the learning method of a model is arbitrary.
予測部92は、学習部91によって学習されたモデルを用いて予測を行う。具体的には、予測部92は、生成された特徴量に第1のテーブルおよび第2のテーブルのデータを適用して、予測用データを生成する。そして、予測部92は、生成された予測用データを学習されたモデルに適用して予測結果を得る。
The prediction unit 92 performs prediction using the model learned by the learning unit 91. Specifically, the prediction unit 92 applies the data of the first table and the second table to the generated feature amount to generate data for prediction. Then, the prediction unit 92 applies the generated data for prediction to the learned model to obtain a prediction result.
入力部10と、ジオコーダ20と、マップパラメータ生成器30と、フィルタパラメータ生成器50と、集約パラメータ生成器60と、特徴量生成関数生成器81と、特徴量生成器82と、特徴量選択器83と、出力部90と、学習部91と、予測部92とは、プログラム(情報処理プログラム)に従って動作するコンピュータのプロセッサ(例えば、CPU(Central Processing Unit )、GPU(Graphics Processing Unit)、FPGA(field-programmable gate array ))によって実現される。なお、マップパラメータ生成器30は、より詳しくは、ジオマップ生成器40(さらに詳しくは、距離マップ生成器41と、包含マップ生成器42と、重複マップ生成器43と、同地域マップ生成器44)と、時間差異マップ生成器31と、マップ生成器32と、属性特定部33とにより実現される。また、集約パラメータ生成器60は、ジオ集約生成器70(さらに詳しくは、ポイント集約生成器71と、エリア集約生成器72)と、数的集約生成器61とにより実現される。
Input unit 10, Geocoder 20, Map parameter generator 30, Filter parameter generator 50, Aggregated parameter generator 60, Feature quantity generation function generator 81, Feature quantity generator 82, Feature quantity selector 83, an output unit 90, a learning unit 91, and a prediction unit 92, a processor (for example, a central processing unit (CPU), a graphics processing unit (GPU), an FPGA (for example) of a computer operating according to a program (information processing program)). field-programmable gate array)). The map parameter generator 30 more specifically includes the geomap generator 40 (more specifically, the distance map generator 41, the inclusion map generator 42, the overlap map generator 43, and the same area map generator 44). , A time difference map generator 31, a map generator 32, and an attribute specifying unit 33. Also, the aggregation parameter generator 60 is realized by the geo aggregation generator 70 (more specifically, the point aggregation generator 71 and the area aggregation generator 72) and the numerical aggregation generator 61.
例えば、プログラムは、記憶部80に記憶され、プロセッサは、そのプログラムを読み込み、プログラムに従って、入力部10、ジオコーダ20、マップパラメータ生成器30、フィルタパラメータ生成器50、集約パラメータ生成器60、特徴量生成関数生成器81、特徴量生成器82、特徴量選択器83、出力部90、学習部91および予測部92として動作してもよい。また、情報処理システムの機能がSaaS(Software as a Service )形式で提供されてもよい。
For example, the program is stored in the storage unit 80, and the processor reads the program, and according to the program, the input unit 10, the geocoder 20, the map parameter generator 30, the filter parameter generator 50, the aggregation parameter generator 60, the feature value The generation function generator 81, the feature quantity generator 82, the feature quantity selector 83, the output unit 90, the learning unit 91, and the prediction unit 92 may operate. In addition, the functions of the information processing system may be provided in the form of Software as a Service (SaaS).
入力部10と、ジオコーダ20と、マップパラメータ生成器30と、フィルタパラメータ生成器50と、集約パラメータ生成器60と、特徴量生成関数生成器81と、特徴量生成器82と、特徴量選択器83と、出力部90と、学習部91と、予測部92とは、それぞれが専用のハードウェアで実現されていてもよい。また、各装置の各構成要素の一部又は全部は、汎用または専用の回路(circuitry )、プロセッサ等やこれらの組合せによって実現されもよい。これらは、単一のチップによって構成されてもよいし、バスを介して接続される複数のチップによって構成されてもよい。各装置の各構成要素の一部又は全部は、上述した回路等とプログラムとの組合せによって実現されてもよい。
Input unit 10, Geocoder 20, Map parameter generator 30, Filter parameter generator 50, Aggregated parameter generator 60, Feature quantity generation function generator 81, Feature quantity generator 82, Feature quantity selector Each of 83, the output unit 90, the learning unit 91, and the prediction unit 92 may be realized by dedicated hardware. In addition, part or all of each component of each device may be realized by a general purpose or dedicated circuit, a processor, or the like, or a combination thereof. These may be configured by a single chip or may be configured by a plurality of chips connected via a bus. A part or all of each component of each device may be realized by a combination of the above-described circuits and the like and a program.
また、各装置の各構成要素の一部又は全部が複数の情報処理装置や回路等により実現される場合には、複数の情報処理装置や回路等は、集中配置されてもよいし、分散配置されてもよい。例えば、情報処理装置や回路等は、クライアントアンドサーバシステム、クラウドコンピューティングシステム等、各々が通信ネットワークを介して接続される形態として実現されてもよい。また、本実施形態の情報処理システム100が、単体の情報処理装置として実現されていてもよい。また、本実施形態の情報処理システム100の一部または全部は、上述する特徴量を生成する処理を行うことから、特徴量を生成する処理を行う機能を含む装置を、特徴量生成装置と言うことができる。
Further, in the case where a part or all of each component of each device is realized by a plurality of information processing devices, circuits, etc., the plurality of information processing devices, circuits, etc. may be arranged centrally. It may be done. For example, the information processing apparatus, the circuit, and the like may be realized as a form in which each is connected via a communication network, such as a client and server system, a cloud computing system, and the like. Further, the information processing system 100 of the present embodiment may be realized as a single information processing apparatus. In addition, a part or all of the information processing system 100 according to the present embodiment performs the process of generating the above-described feature quantity, and thus an apparatus including a function of performing the process of generating the feature quantity be able to.
次に、本実施形態の情報処理システム100の動作を説明する。図19は、結合条件を生成する処理の例を示すフローチャートである。
Next, the operation of the information processing system 100 of the present embodiment will be described. FIG. 19 is a flowchart illustrating an example of a process of generating a combination condition.
入力部10は、予測対象および第1の地理的属性を含む第1のテーブルと、第2の地理的属性を含む第2のテーブルとを取得する(ステップS11)。また、入力部10は、地理的関係、および、地理的関係の程度を受け付ける(ステップS12)。マップパラメータ生成器30は、第1の地理的属性の値と第2の地理的属性の値との関係が地理的関係の程度を満たすような、第1のテーブルに含まれるレコードと第2のテーブルに含まれるレコードとを結合するための結合条件を生成する(ステップS13)。
The input unit 10 acquires a first table including a prediction target and a first geographical attribute, and a second table including a second geographical attribute (step S11). Further, the input unit 10 receives the geographical relationship and the degree of the geographical relationship (step S12). The map parameter generator 30 can set the second table and the records included in the first table such that the relation between the value of the first geographical attribute and the value of the second geographical attribute satisfies the degree of geographical relation. A join condition for joining records included in the table is generated (step S13).
図20は、結合条件を生成する処理の他の例を示すフローチャートである。入力部10は、予測対象および第1の時間的属性を含む第1のテーブルと、第2の時間的属性を含む第2のテーブルとを取得する(ステップS21)。また、入力部10は、時間的関係、および、時間的関係の程度を受け付ける(ステップS22)。マップパラメータ生成器30は、第1の時間的属性の値と第2の時間的属性の値との関係が時間的関係の程度を満たすような、第1のテーブルに含まれるレコードと第2のテーブルに含まれるレコードとを結合するための結合条件を生成する(ステップS23)。
FIG. 20 is a flowchart showing another example of the process of generating the combining condition. The input unit 10 acquires a first table including a prediction target and a first temporal attribute, and a second table including a second temporal attribute (step S21). Also, the input unit 10 receives a temporal relationship and a degree of the temporal relationship (step S22). The map parameter generator 30 is configured to record the second table and the records included in the first table such that the relationship between the value of the first temporal attribute and the value of the second temporal attribute satisfies the degree of the temporal relationship. A join condition for joining records included in the table is generated (step S23).
図21は、特徴量を生成する処理の例を示すフローチャートである。入力部10は、予測対象および第1の地理的属性を含む第1のテーブルと、第2の地理的属性を含む第2のテーブルとを取得する(ステップS31)。特徴量生成器82は、第1の地理的属性の値に対する第2の地理的属性の値が所定の条件を満たす場合に距離の統計値を算出し(ステップS32)、算出した統計値を特徴量として第1のテーブルの属性に追加する(ステップS33)。
FIG. 21 is a flowchart illustrating an example of processing for generating a feature amount. The input unit 10 acquires a first table including a prediction target and a first geographical attribute, and a second table including a second geographical attribute (step S31). The feature quantity generator 82 calculates a distance statistic when the value of the second geographical attribute with respect to the value of the first geographical attribute satisfies a predetermined condition (step S32), and calculates the calculated statistic as a feature A quantity is added to the attribute of the first table (step S33).
図22は、特徴量を生成する処理の他の例を示すフローチャートである。入力部10は、予測対象および第1の属性を含む第1のテーブルと、第2の属性を含む第2のテーブルとを取得する(ステップS41)。また、入力部10は、第1の属性と第2の属性との類似度の算出に用いられる類似度関数と、類似度に対する条件(例えば、類似度の閾値)とを受け付ける(ステップS42)。特徴量生成器82は、類似度関数を用いて算出される結合条件と集約条件とを用いて、第1のテーブルおよび第2のテーブルから特徴量の候補を生成する(ステップS43)。そして、特徴量選択器83は、特徴量の候補から、予測に最適な特徴量を選択する(ステップS44)。
FIG. 22 is a flowchart illustrating another example of the process of generating the feature amount. The input unit 10 acquires a first table including the prediction target and the first attribute and a second table including the second attribute (step S41). The input unit 10 also receives a similarity function used to calculate the similarity between the first attribute and the second attribute, and a condition for the similarity (for example, a threshold for the similarity) (step S42). The feature quantity generator 82 generates feature quantity candidates from the first table and the second table using the combination condition and the aggregation condition calculated using the similarity function (step S43). Then, the feature amount selector 83 selects a feature amount optimal for prediction from the feature amount candidates (step S44).
以上のように、本実施形態では、入力部10が、予測対象および第1の地理的属性を含む第1のテーブルと、第2の地理的属性を含む第2のテーブルとを取得する。また、入力部10は、地理的関係、および、地理的関係の程度を受け付ける。そして、マップパラメータ生成器30が、第1の地理的属性の値と第2の地理的属性の値との関係が地理的関係の程度を満たすような、第1のテーブルに含まれるレコードと第2のテーブルに含まれるレコードとを結合するための結合条件を生成する。同様に、本実施形態では、入力部10が、予測対象および第1の時間的属性を含む第1のテーブルと、第2の時間的属性を含む第2のテーブルとを取得する。また、入力部10は、時間的関係、および、時間的関係の程度を受け付ける。そして、マップパラメータ生成器30が、第1の時間的属性の値と第2の時間的属性の値との関係が時間的関係の程度を満たすような、第1のテーブルに含まれるレコードと第2のテーブルに含まれるレコードとを結合するための結合条件を生成する。よって、地理的情報または時間的情報を介して複数の情報を関連付ける作業工数を低減できる。その結果、多様な表現で表された情報を処理するコンピュータの負荷を低減することが可能になる。
As described above, in the present embodiment, the input unit 10 acquires the first table including the prediction target and the first geographical attribute, and the second table including the second geographical attribute. In addition, the input unit 10 receives a geographical relationship and the degree of the geographical relationship. Then, the map parameter generator 30 may be configured to set the records included in the first table such that the relation between the value of the first geographical attribute and the value of the second geographical attribute satisfies the degree of geographical relation. Create a join condition for joining the records included in the second table. Similarly, in the present embodiment, the input unit 10 acquires a first table including a prediction target and a first temporal attribute, and a second table including a second temporal attribute. In addition, the input unit 10 receives a temporal relationship and a degree of the temporal relationship. Then, the records included in the first table and the map parameter generator 30 are such that the relationship between the value of the first temporal attribute and the value of the second temporal attribute satisfies the degree of the temporal relationship. Create a join condition for joining the records included in the second table. Thus, it is possible to reduce the number of steps of associating a plurality of pieces of information through geographical information or temporal information. As a result, it is possible to reduce the load on a computer that processes information represented by various expressions.
また、本実施形態では、入力部10が、予測対象および第1の地理的属性を含む第1のテーブルと、第2の地理的属性を含む第2のテーブルとを取得する。そして、特徴量生成器82は、第1の地理的属性の値に対する第2の地理的属性の値が所定の条件を満たす場合に、第1の地理的属性の値と条件を満たす第2の地理的属性の値とに基づいて算出される距離の統計値を、予測対象に影響を及ぼし得る変数である特徴量として第1のテーブルの属性に追加する。よって、地理的情報を有する複数の情報源から、効率よく特徴量を生成できる。
Further, in the present embodiment, the input unit 10 acquires a first table including a prediction target and a first geographical attribute, and a second table including a second geographical attribute. Then, when the value of the second geographical attribute with respect to the value of the first geographical attribute satisfies a predetermined condition, the feature quantity generator 82 determines that the value of the first geographical attribute and the second satisfying the condition. The statistics of the distance calculated based on the value of the geographical attribute is added to the attribute of the first table as a feature that is a variable that can affect the prediction target. Therefore, feature quantities can be efficiently generated from a plurality of information sources having geographical information.
さらに、本実施形態では、入力部10が、予測対象および第1の属性を含む第1のテーブルと、第2の属性を含む第2のテーブルとを取得する。また、入力部10が、第1の属性と第2の属性との類似度の算出に用いられる類似度関数と、その類似度に対する条件とを受け付ける。そして、特徴量生成器82が、類似度関数を用いて算出される結合条件と集約条件とを用いて、第1のテーブルおよび第2のテーブルから特徴量の候補を生成し、特徴量選択器83が、特徴量の候補から、予測に最適な特徴量を選択する。よって、特徴量を生成するための分析者工数を削減できる。
Furthermore, in the present embodiment, the input unit 10 acquires a first table including a prediction target and a first attribute, and a second table including a second attribute. Further, the input unit 10 receives a similarity function used to calculate the similarity between the first attribute and the second attribute, and a condition for the similarity. Then, the feature quantity generator 82 generates candidate feature quantities from the first table and the second table using the combination condition and the aggregation condition calculated using the similarity function, and the feature quantity selector 83 selects a feature quantity optimal for prediction from the feature quantity candidates. Therefore, it is possible to reduce the number of analysts for generating the feature amount.
次に、本発明の概要を説明する。図23は、本発明による特徴量生成装置の概要を示すブロック図である。本発明による特徴量生成装置280は、予測対象および第1の地理的属性を含む第1のテーブル(例えば、ターゲットテーブル)と、第2の地理的属性を含む第2のテーブル(例えば、ソーステーブル)とを取得するテーブル取得手段281(例えば、入力部10)と、第1の地理的属性の値に対する第2の地理的属性の値が所定の条件を満たす場合に、第1の地理的属性の値と条件を満たす第2の地理的属性の値とに基づいて算出される距離の統計値を、予測対象に影響を及ぼし得る変数である特徴量として第1のテーブルの属性に追加する属性追加手段282(例えば、特徴量生成器82)とを備えている。
Next, an outline of the present invention will be described. FIG. 23 is a block diagram showing an outline of a feature quantity generation apparatus according to the present invention. The feature quantity generation device 280 according to the present invention, a first table (for example, a target table) including a prediction target and a first geographical attribute, and a second table (for example, a source table) including a second geographical attribute And a table acquisition unit 281 (for example, the input unit 10) for acquiring the first geographical attribute when the value of the second geographical attribute with respect to the value of the first geographical attribute satisfies a predetermined condition. An attribute of adding a distance statistic calculated based on the value of and the value of the second geographical attribute that satisfies the condition to the attribute of the first table as a feature that is a variable that can affect the prediction target And an addition unit 282 (for example, a feature quantity generator 82).
そのような構成により、地理的情報を有する複数の情報源から、効率よく特徴量を生成できる。
With such a configuration, it is possible to efficiently generate feature quantities from multiple information sources having geographical information.
また、属性追加手段282は、第1の地理的属性のレコードごとに、所定の条件を満たす第2のテーブルのレコードを集約する演算を行うことにより、距離の統計値を算出してもよい。
In addition, the attribute adding unit 282 may calculate the distance statistical value by performing an operation of aggregating the records of the second table satisfying the predetermined condition for each record of the first geographical attribute.
具体的には、属性追加手段282は、第2のテーブルのレコードを集約する演算として、第1の地理的属性の各レコードに対して所定の条件を満たす第2のテーブルの地理的属性との距離の合計と平均の少なくともいずれかを算出し、第1のテーブルの属性に追加してもよい。
Specifically, the attribute adding unit 282 calculates the records of the second table as the operation of aggregating the records of the first geographical attribute with the geographical attributes of the second table satisfying the predetermined condition. The sum of distances and / or the average may be calculated and added to the attributes of the first table.
他にも、属性追加手段282は、第2のテーブルのレコードを集約する演算として、第1の地理的属性の各レコードに対して所定の条件を満たす第2のテーブルの地理的属性のレコード数を算出し、第1のテーブルの属性に追加してもよい。
In addition, the attribute adding unit 282 calculates the number of records of the geographical attribute of the second table which satisfies a predetermined condition for each record of the first geographical attribute as an operation of aggregating the records of the second table. May be calculated and added to the attributes of the first table.
また、特徴量生成装置280は、追加された属性を特徴量として予測モデルを学習する学習手段(例えば、学習部91)を備えていてもよい。
In addition, the feature quantity generation device 280 may include a learning unit (for example, a learning unit 91) that learns a prediction model using the added attribute as a feature quantity.
図24は、少なくとも1つの実施形態に係るコンピュータの構成を示す概略ブロック図である。コンピュータ1000は、プロセッサ1001、主記憶装置1002、補助記憶装置1003、インタフェース1004を備える。
FIG. 24 is a schematic block diagram showing the configuration of a computer according to at least one embodiment. The computer 1000 includes a processor 1001, a main storage 1002, an auxiliary storage 1003, and an interface 1004.
上述の情報処理システムは、コンピュータ1000に実装される。そして、上述した各処理部の動作は、プログラム(結合条件生成プログラム)の形式で補助記憶装置1003に記憶されている。プロセッサ1001は、プログラムを補助記憶装置1003から読み出して主記憶装置1002に展開し、当該プログラムに従って上記処理を実行する。
The above-described information processing system is implemented in a computer 1000. The operation of each processing unit described above is stored in the auxiliary storage device 1003 in the form of a program (combination condition generation program). The processor 1001 reads a program from the auxiliary storage device 1003 and expands it in the main storage device 1002, and executes the above processing according to the program.
なお、少なくとも1つの実施形態において、補助記憶装置1003は、一時的でない有形の媒体の一例である。一時的でない有形の媒体の他の例としては、インタフェース1004を介して接続される磁気ディスク、光磁気ディスク、CD-ROM、DVD-ROM、半導体メモリ等が挙げられる。また、このプログラムが通信回線によってコンピュータ1000に配信される場合、配信を受けたコンピュータ1000が当該プログラムを主記憶装置1002に展開し、上記処理を実行しても良い。
In at least one embodiment, the auxiliary storage device 1003 is an example of a non-temporary tangible medium. Other examples of non-transitory tangible media include magnetic disks connected via an interface 1004, magneto-optical disks, CD-ROMs, DVD-ROMs, semiconductor memories, and the like. Further, when this program is distributed to the computer 1000 by a communication line, the distributed computer 1000 may expand the program in the main storage unit 1002 and execute the above processing.
また、当該プログラムは、前述した機能の一部を実現するためのものであっても良い。さらに、当該プログラムは、前述した機能を補助記憶装置1003に既に記憶されている他のプログラムとの組み合わせで実現するもの、いわゆる差分ファイル(差分プログラム)であっても良い。
Further, the program may be for realizing a part of the functions described above. Furthermore, the program may be a so-called difference file (difference program) that realizes the above-described function in combination with other programs already stored in the auxiliary storage device 1003.
以上、実施形態及び実施例を参照して本願発明を説明したが、本願発明は上記実施形態および実施例に限定されるものではない。本願発明の構成や詳細には、本願発明のスコープ内で当業者が理解し得る様々な変更をすることができる。
As mentioned above, although this invention was demonstrated with reference to embodiment and an Example, this invention is not limited to the said embodiment and Example. The configurations and details of the present invention can be modified in various ways that those skilled in the art can understand within the scope of the present invention.
この出願は、2017年10月5日に出願された米国仮出願第62/568,448号を基礎とする優先権を主張し、その開示の全てをここに取り込む。
This application claims priority based on US Provisional Application No. 62 / 568,448 filed Oct. 5, 2017, the entire disclosure of which is incorporated herein.
10 入力部
20 ジオコーダ
30 マップパラメータ生成器
31 時間差異マップ生成器
32 マップ生成器
33 属性特定部
40 ジオマップ生成器
41 距離マップ生成器
42 包含マップ生成器
43 重複マップ生成器
44 同地域マップ生成器
50 フィルタパラメータ生成器
51 フィルタ生成器
60 集約パラメータ生成器
61 数的集約生成器
70 ジオ集約生成器
71 ポイント集約生成器
72 エリア集約生成器
80 記憶部
81 特徴量生成関数生成器
82 特徴量生成器
83 特徴量選択器
90 出力部
91 学習部
92 予測部 10input unit 20 geocoder 30 map parameter generator 31 time difference map generator 32 map generator 33 attribute specifying unit 40 geomap generator 41 distance map generator 42 inclusion map generator 43 overlap map generator 44 same area map generator 50 filter parameter generator 51 filter generator 60 aggregation parameter generator 61 numerical aggregation generator 70 geo-aggregate generator 71 point aggregation generator 72 area aggregation generator 80 storage unit 81 feature quantity generation function generator 82 feature quantity generator 83 feature quantity selector 90 output unit 91 learning unit 92 prediction unit
20 ジオコーダ
30 マップパラメータ生成器
31 時間差異マップ生成器
32 マップ生成器
33 属性特定部
40 ジオマップ生成器
41 距離マップ生成器
42 包含マップ生成器
43 重複マップ生成器
44 同地域マップ生成器
50 フィルタパラメータ生成器
51 フィルタ生成器
60 集約パラメータ生成器
61 数的集約生成器
70 ジオ集約生成器
71 ポイント集約生成器
72 エリア集約生成器
80 記憶部
81 特徴量生成関数生成器
82 特徴量生成器
83 特徴量選択器
90 出力部
91 学習部
92 予測部 10
Claims (9)
- 予測対象および第1の地理的属性を含む第1のテーブルと、第2の地理的属性を含む第2のテーブルとを取得するテーブル取得手段と、
前記第1の地理的属性の値に対する前記第2の地理的属性の値が所定の条件を満たす場合に、前記第1の地理的属性の値と前記条件を満たす第2の地理的属性の値とに基づいて算出される距離の統計値を、前記予測対象に影響を及ぼし得る変数である特徴量として前記第1のテーブルの属性に追加する属性追加手段とを備えた
ことを特徴とする特徴量生成装置。 Table acquisition means for acquiring a first table including a prediction target and a first geographical attribute, and a second table including a second geographical attribute;
The value of the first geographical attribute and the value of the second geographical attribute satisfying the condition when the value of the second geographical attribute with respect to the value of the first geographical attribute satisfies a predetermined condition And attribute adding means for adding the statistical value of the distance calculated on the basis of the distance to the attribute of the first table as a feature that is a variable that can affect the prediction target. Quantity generator. - 属性追加手段は、第1の地理的属性のレコードごとに、所定の条件を満たす第2のテーブルのレコードを集約する演算を行うことにより、距離の統計値を算出する
請求項1記載の特徴量生成装置。 The attribute addition means calculates a distance statistical value by performing an operation of aggregating records of a second table satisfying a predetermined condition for each record of the first geographical attribute. Generator. - 属性追加手段は、第2のテーブルのレコードを集約する演算として、第1の地理的属性の各レコードに対して所定の条件を満たす第2のテーブルの地理的属性との距離の合計と平均の少なくともいずれかを算出し、第1のテーブルの属性に追加する
請求項1または請求項2記載の特徴量生成装置。 The attribute adding means is an operation of aggregating the records of the second table, for each record of the first geographical attribute, the sum and average of the distance to the geographical attribute of the second table satisfying the predetermined condition. The feature quantity generation device according to claim 1 or 2, wherein at least one is calculated and added to the attribute of the first table. - 属性追加手段は、第2のテーブルのレコードを集約する演算として、第1の地理的属性の各レコードに対して所定の条件を満たす第2のテーブルの地理的属性のレコード数を算出し、第1のテーブルの属性に追加する
請求項1または請求項2記載の特徴量生成装置。 The attribute addition means calculates the number of records of the geographical attribute of the second table satisfying a predetermined condition for each record of the first geographical attribute as an operation of aggregating the records of the second table, The feature quantity generation device according to claim 1 or 2, which is added to the attribute of the table of 1. - 追加された属性を特徴量として予測モデルを学習する学習手段を備えた
請求項1から請求項4のうちのいずれか1項に記載の特徴量生成装置。 The feature quantity generation device according to any one of claims 1 to 4, further comprising a learning unit that learns a prediction model using the added attribute as a feature quantity. - 予測対象および第1の地理的属性を含む第1のテーブルと、第2の地理的属性を含む第2のテーブルとを取得し、
前記第1の地理的属性の値に対する前記第2の地理的属性の値が所定の条件を満たす場合に、前記第1の地理的属性の値と前記条件を満たす第2の地理的属性の値とに基づいて算出される距離の統計値を、前記予測対象に影響を及ぼし得る変数である特徴量として前記第1のテーブルの属性に追加する
ことを特徴とする特徴量生成方法。 Obtaining a first table including the prediction target and the first geographical attribute, and a second table including the second geographical attribute;
The value of the first geographical attribute and the value of the second geographical attribute satisfying the condition when the value of the second geographical attribute with respect to the value of the first geographical attribute satisfies a predetermined condition And adding a statistical value of a distance calculated based on the feature to the attribute of the first table as a feature that is a variable that can affect the prediction target. - 第1の地理的属性のレコードごとに、所定の条件を満たす第2のテーブルのレコードを集約する演算を行うことにより、距離の統計値を算出する
請求項6記載の特徴量生成方法。 The feature quantity generation method according to claim 6, wherein a distance statistical value is calculated by performing an operation of aggregating records of a second table satisfying a predetermined condition for each record of the first geographical attribute. - コンピュータに、
予測対象および第1の地理的属性を含む第1のテーブルと、第2の地理的属性を含む第2のテーブルとを取得するテーブル取得処理、および、
前記第1の地理的属性の値に対する前記第2の地理的属性の値が所定の条件を満たす場合に、前記第1の地理的属性の値と前記条件を満たす第2の地理的属性の値とに基づいて算出される距離の統計値を、前記予測対象に影響を及ぼし得る変数である特徴量として前記第1のテーブルの属性に追加する属性追加処理
を実行させるための特徴量生成プログラム。 On the computer
A table acquisition process for acquiring a first table including a prediction target and a first geographical attribute, and a second table including a second geographical attribute;
The value of the first geographical attribute and the value of the second geographical attribute satisfying the condition when the value of the second geographical attribute with respect to the value of the first geographical attribute satisfies a predetermined condition And a feature amount generating program for executing attribute addition processing for adding the statistical value of the distance calculated based on the feature to the attribute of the first table as a feature amount that is a variable that may affect the prediction target. - コンピュータに、
属性追加処理で、第1の地理的属性のレコードごとに、所定の条件を満たす第2のテーブルのレコードを集約する演算を行うことにより、距離の統計値を算出させる
請求項8記載の特徴量生成プログラム。 On the computer
9. The feature value according to claim 8, wherein in the attribute addition processing, a distance statistical value is calculated by performing an operation of aggregating records of a second table satisfying a predetermined condition for each record of the first geographical attribute. Generator.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US201762568448P | 2017-10-05 | 2017-10-05 | |
US62/568448 | 2017-10-05 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2019069506A1 true WO2019069506A1 (en) | 2019-04-11 |
Family
ID=65994880
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2018/022428 WO2019069506A1 (en) | 2017-10-05 | 2018-06-12 | Feature value generation device, feature value generation method, and feature value generation program |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2019069506A1 (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH11219367A (en) * | 1998-02-03 | 1999-08-10 | Nippon Telegr & Teleph Corp <Ntt> | Connection processing method and device for different kinds of data by address information |
JP2003527649A (en) * | 1999-04-28 | 2003-09-16 | アリーナ・フアーマシユーチカルズ・インコーポレーテツド | System and method for database similarity join |
JP2013542478A (en) * | 2010-08-25 | 2013-11-21 | インターナショナル・ビジネス・マシーンズ・コーポレーション | Geospatial database integration method and device |
WO2017090475A1 (en) * | 2015-11-25 | 2017-06-01 | 日本電気株式会社 | Information processing system, function creation method, and function creation program |
-
2018
- 2018-06-12 WO PCT/JP2018/022428 patent/WO2019069506A1/en active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH11219367A (en) * | 1998-02-03 | 1999-08-10 | Nippon Telegr & Teleph Corp <Ntt> | Connection processing method and device for different kinds of data by address information |
JP2003527649A (en) * | 1999-04-28 | 2003-09-16 | アリーナ・フアーマシユーチカルズ・インコーポレーテツド | System and method for database similarity join |
JP2013542478A (en) * | 2010-08-25 | 2013-11-21 | インターナショナル・ビジネス・マシーンズ・コーポレーション | Geospatial database integration method and device |
WO2017090475A1 (en) * | 2015-11-25 | 2017-06-01 | 日本電気株式会社 | Information processing system, function creation method, and function creation program |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2019069505A1 (en) | Information processing device, combination condition generation method, and combination condition generation program | |
Yu et al. | Prediction of bus travel time using random forests based on near neighbors | |
Kosowska-Stamirowska et al. | Evolving structure of the maritime trade network: evidence from the Lloyd’s Shipping Index (1890–2000) | |
JP7098327B2 (en) | Information processing system, function creation method and function creation program | |
JP2010128806A (en) | Information analyzing device | |
CN109359186B (en) | Method and device for determining address information and computer readable storage medium | |
JP5968744B2 (en) | SEARCH METHOD, DEVICE, AND COMPUTER-READABLE RECORDING MEDIUM USING CONCEPT KEYWORD EXTENDED DATA SET | |
US20220157167A1 (en) | System for offsite navigation | |
JP2010020490A (en) | Device for providing information on unfamiliar place, and method for providing information on unfamiliar place | |
CN113254630B (en) | Domain knowledge map recommendation method for global comprehensive observation results | |
WO2019069507A1 (en) | Feature value generation device, feature value generation method, and feature value generation program | |
KR20130035660A (en) | Recommendation system and method | |
JP2009134463A (en) | Retrieval device, retrieval method and retrieval program for document group including geographic information, and recording medium recording the program | |
US20160328430A1 (en) | Address/latitude and longitude converting device and geographical information system using the same | |
JP2007219655A (en) | Facility information management system, facility information management method and facility information management program | |
CN103712628B (en) | Guidance path plotting method and terminal | |
CN114090898A (en) | Information recommendation method and device, terminal equipment and medium | |
US20200387505A1 (en) | Information processing system, feature description method and feature description program | |
Taelman et al. | Generating public transport data based on population distributions for RDF benchmarking | |
CN112765288A (en) | Knowledge graph construction method and system and information query method and system | |
Sun et al. | Big data trip classification on the New York City taxi and Uber sensor network | |
CN112883195A (en) | Method and system for constructing traffic knowledge map of individual trip | |
Mühlematter et al. | Spatially-aware car-sharing demand prediction | |
CN112685618A (en) | User feature identification method and device, computing equipment and computer storage medium | |
WO2019069506A1 (en) | Feature value generation device, feature value generation method, and feature value generation program |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 18864530 Country of ref document: EP Kind code of ref document: A1 |