WO2024048305A1

WO2024048305A1 - Information processing system and information processing method

Info

Publication number: WO2024048305A1
Application number: PCT/JP2023/029732
Authority: WO
Inventors: 雄介熊谷; 龍道本
Original assignee: 株式会社博報堂Ｄｙホールディングス
Priority date: 2022-08-29
Filing date: 2023-08-17
Publication date: 2024-03-07
Also published as: JP2024032488A; JP7260704B1

Abstract

In the present invention, with regard to a plurality of first elements in a first set, a first data set that includes data describing a feature of each of the plurality of first elements is acquired. With regard to a plurality of second elements in a second set, a second data set that includes data describing a feature of each of the plurality of second elements is acquired. Similarity concerning data structure between the first data set and the second data set is evaluated on the basis of a comparison of a neighborhood graph of the first set based on similarity between the plurality of first elements and a neighborhood graph of the second set based on similarity between the plurality of second elements.

Description

Information processing system and information processing method

Cross-reference of related applications

This international application claims priority based on Japanese Patent Application No. 2022-136165 filed with the Japan Patent Office on August 29, 2022, and is based on Japanese Patent Application No. 2022-136165. The entire contents are incorporated by reference into this international application.

The present disclosure relates to an information processing system and an information processing method.

Conventionally, consumer purchasing behavior has been analyzed based on product sales data. Analyzing consumers' exposure behavior to mass media and network content is also being conducted.

Data fusion technology is also known that combines multiple pieces of data collected by different means based on a common variable. Patent Document 1 discloses that a first data set related to a first consumer group and a second data set related to a second consumer group are shared between the first data set and the second data set. Discloses a technique for combining using variables.

Japanese Patent Application Publication No. 2016-126609

When attempting to combine a second data set with a first data set, multiple types of data sets may be prepared as candidates for the second data set to be combined. For example, when attempting to combine a data set related to consumer purchasing behavior as a second data set with a first data set, multiple data sets related to the purchasing behavior of different sets of consumers may be combined as candidates. It may be available from the data vendor.

Alternatively, multiple data sets describing purchasing behavior using different parameters may be prepared by processing sales history such as POS (Point of Sale) data. Examples of parameters include the number of products purchased and the purchase price.

Here, consider the case where the first data set is a data set related to a consumer set in which consumers of all ages and all genders are approximately uniformly present. In this case, rather than selecting a data set of only female consumers as the second data set to be combined, a data set of a consumer set in which consumers of all ages and all genders are approximately uniformly selected is selected. It is thought that the accuracy of data fusion will improve if this is selected.

In other words, the accuracy of data fusion between the first data set and the second data set is considered to change depending on the consumer set handled by the second data set. Similarly, the accuracy of data fusion between the first data set and the second data set is considered to vary depending on the type of purchasing behavior parameter described by the second data set. This is because the distribution of consumers on the feature space changes depending on the type of parameter.

Thus, the accuracy of data fusion between the first data set and the second data set depends on the similarity of the data structures between the first data set and the second data set. This dependence is not limited to purchasing behavior data sets.

Therefore, according to one aspect of the present disclosure, it is possible to provide a system and method capable of evaluating the similarity in data structure between a first data set and a second data set with respect to various types of data sets. desirable.

According to one aspect of the present disclosure, an information processing system is provided. The information processing system includes a first acquisition section, a second acquisition section, and an evaluation section. The first acquisition unit is configured to acquire, with respect to the plurality of first elements in the first set, a first data set including data describing characteristics of each of the plurality of first elements. The second acquisition unit is configured to acquire, with respect to the plurality of second elements in the second set, a second data set including data describing characteristics of each of the plurality of second elements.

The evaluation unit evaluates the similarity regarding the data structure between the first data set and the second data set based on a comparison between the neighborhood graph of the first set and the neighborhood graph of the second set. configured to do so. The first set neighborhood graph is a first set neighborhood graph based on the similarity between the plurality of first elements determined from the first data set. The second set neighborhood graph is a second set neighborhood graph based on the similarity between the plurality of second elements determined from the second data set.

The above neighborhood graph is related to the distribution of multiple elements on the feature space. Therefore, according to the above comparison, it is possible to evaluate the similarity between the data structure of the first data set and the data structure of the second data set.

This evaluation is useful, for example, in determining compatibility regarding data fusion between the first data set and the second data set. Evaluation is useful, for example, in selecting datasets to be combined in data fusion. However, the evaluation is not limited to data fusion applications.

According to one aspect of the present disclosure, the evaluation unit compares the neighborhood graph of the first set and the neighborhood graph of the second set, the evaluation unit comprising a graph corresponding to the neighborhood graph of the first set. configured to evaluate similarity in terms of data structure between the first dataset and the second dataset based on comparison using the Laplacian matrix and the graph Laplacian matrix corresponding to the neighborhood graph of the second set; may be done.

According to one aspect of the present disclosure, the second acquisition unit may acquire a plurality of evaluation target datasets as the second dataset. Each of the plurality of evaluation target data sets may be a data set including data describing characteristics of each of the plurality of elements with respect to the plurality of elements in the corresponding set. The plurality of data sets to be evaluated may be data sets relating to different sets, or data sets having different described characteristics.

The evaluation unit calculates, for each of the plurality of evaluation target data sets, a neighborhood graph of the corresponding set based on the similarity between the plurality of elements in the corresponding set determined from the corresponding evaluation target data set, and a first The method may be configured to evaluate similarity in data structure between the corresponding dataset to be evaluated and the first dataset based on a comparison between the neighborhood graphs of the set.

According to one aspect of the present disclosure, the information processing system may include a selection unit. The selection unit may be configured to select, as a combination target, a data set with the highest evaluation of similarity regarding data structure from among the plurality of evaluation target data sets. According to such selection, a second data set suitable for combination with the first data set can be selected from a plurality of data sets.

According to one aspect of the present disclosure, the information processing system may further include a coupling unit. The combination unit is configured to combine the first data set and the data set selected to be combined so as to associate data describing characteristics of similar elements between the first set and the corresponding set. can be made into According to such a combination, highly accurate data fusion between the first data set and the second data set can be realized.

According to one aspect of the present disclosure, the evaluation unit may include a first similarity calculation unit, a second similarity calculation unit, a first eigenvalue calculation unit, and a second eigenvalue calculation unit. good.

The first similarity calculation unit is configured to calculate the similarity between the plurality of first elements based on the first data set. The second similarity calculation unit is configured to calculate the similarity between the plurality of second elements based on the second data set.

The first eigenvalue calculation unit is configured to calculate a group of eigenvalues of the first graph Laplacian matrix as a group of first eigenvalues based on the similarity between the plurality of first elements. The first graph Laplacian matrix corresponds to a neighborhood graph in which each of the plurality of first elements is connected to one or more first elements in the first set whose similarity level satisfies a predetermined condition. is the graph Laplacian matrix.

The second eigenvalue calculation unit is configured to calculate a group of eigenvalues of the second graph Laplacian matrix as a group of second eigenvalues based on the degree of similarity between the plurality of second elements. The second graph Laplacian matrix corresponds to a neighborhood graph in which each of the plurality of second elements is connected to one or more second elements in the second set whose degree of similarity satisfies a predetermined condition. is the graph Laplacian matrix.

The evaluation unit is configured to evaluate similarity regarding the data structure between the first data set and the second data set based on a comparison between the first set of eigenvalues and the second set of eigenvalues. can be configured.

The neighborhood graph is related to the distribution of multiple elements on the feature space. When two neighborhood graphs are similar, the groups of eigenvalues of the two corresponding graph Laplacian matrices are also similar. According to the above comparison, it is possible to evaluate the similarity between the neighborhood graph regarding the first set and the neighborhood graph regarding the second set.

The neighborhood graph corresponds to the data structure of the corresponding dataset. Therefore, according to the above comparison, it is possible to evaluate the similarity between the data structure of the first data set and the data structure of the second data set.

According to one aspect of the present disclosure, the evaluation unit ranks each of the plurality of first eigenvalues included in the group of first eigenvalues in the group of first eigenvalues, and the ranking of each of the plurality of first eigenvalues included in the group of second eigenvalues. Comparing each of the plurality of first eigenvalues with an eigenvalue of the same rank among the plurality of second eigenvalues based on the rank of each of the plurality of second eigenvalues included in the group of second eigenvalues. Similarity regarding data structure may be evaluated by The plurality of first eigenvalues and the plurality of second eigenvalues may be ranked based on the size of the eigenvalue. According to such an evaluation, it is possible to more appropriately evaluate the similarity in data structure between data sets.

According to one aspect of the present disclosure, the evaluation unit assigns each of the eigenvalues from the first to a predetermined rank in descending order of the eigenvalues included in the group of first eigenvalues to the same rank among the plurality of second eigenvalues. Similarity regarding data structures may be evaluated by comparing with the eigenvalues of .

According to one aspect of the present disclosure, the evaluation unit may be configured to calculate the similarity evaluation value regarding the data structure using the sum of squares of errors. Each of the errors may be a difference between a first eigenvalue of a corresponding rank among the plurality of first eigenvalues and a second eigenvalue of a corresponding rank among the plurality of second eigenvalues. By using the sum of squared errors, it is possible to more appropriately evaluate the similarity in data structure between data sets.

According to one aspect of the present disclosure, the first graph Laplacian matrix is a graph Laplacian matrix of a nearest neighbor graph connecting each of the plurality of first elements to a first element having the highest degree of similarity in the first set. It may be a matrix. The second graph Laplacian matrix may be a graph Laplacian matrix of a nearest neighbor graph in which each of the plurality of second elements is connected to the second element having the highest degree of similarity in the second set.

According to one aspect of the present disclosure, when the second acquisition unit acquires the plurality of evaluation target data sets as the second data set, the second similarity calculation unit acquires the plurality of evaluation target data sets as the second data set. For each of these, the degree of similarity between multiple elements in the corresponding set may be calculated. The second eigenvalue calculation unit generates a neighborhood graph for each of the plurality of evaluation target data sets, and converts each of the plurality of elements in the corresponding set into a corresponding set whose high degree of similarity satisfies a predetermined condition. A group of eigenvalues of a graph Laplacian matrix corresponding to a neighborhood graph connected to one or more elements in the graph may be calculated as a group of eigenvalues to be compared.

The evaluation unit evaluates the first data based on a comparison between a group of comparison target eigenvalues based on the corresponding evaluation target dataset and a first group of eigenvalues for each of the plurality of evaluation target datasets. Similarity in data structure between the set and the corresponding dataset to be evaluated may be evaluated.

According to one aspect of the present disclosure, the first data set may be a data set that describes characteristics of a plurality of people in the first set as a plurality of first elements. The second data set may be a data set that describes characteristics of a plurality of people in the second set as a plurality of second elements.

According to one aspect of the present disclosure, a computer program for causing a computer to implement at least some of the functions in the information processing system described above may be provided. According to one aspect of the present disclosure, a computer program for causing a computer to function as at least part of the first acquisition unit, the second acquisition unit, and the evaluation unit may be provided.

According to one aspect of the present disclosure, an information processing method may be provided. The information processing method may be executed by a computer. The information processing method may include, for the plurality of first elements in the first set, obtaining a first data set including data describing characteristics of each of the plurality of first elements.

The information processing method may include, regarding the plurality of second elements in the second set, obtaining a second dataset including data describing characteristics of each of the plurality of second elements.

The information processing method generates a neighborhood graph of a first set based on the similarity between a plurality of first elements determined from a first data set, and a plurality of second elements determined from a second data set. a neighborhood graph of the second set based on the similarity between the first dataset and the second dataset; good.

According to one aspect of the present disclosure, another information processing method may be provided. Another information processing method may be performed by a computer. Another information processing method may include, with respect to the plurality of first elements in the first set, obtaining a first data set including data describing characteristics of each of the plurality of first elements. good.

Another information processing method may include, with respect to the plurality of second elements in the second set, obtaining a second data set including data describing characteristics of each of the plurality of second elements. good.

Another information processing method may include calculating the degree of similarity between the plurality of first elements based on the first data set. Another information processing method may include calculating the similarity between the plurality of second elements based on the second data set.

Another information processing method may include calculating a group of eigenvalues of the first graph Laplacian matrix as a group of first eigenvalues based on the similarity between the plurality of first elements. The first graph Laplacian matrix corresponds to a neighborhood graph in which each of the plurality of first elements is connected to one or more first elements in the first set whose degree of similarity satisfies a predetermined condition. It can be a graph Laplacian matrix.

Another information processing method may include calculating a group of eigenvalues of the second graph Laplacian matrix as a group of second eigenvalues based on the degree of similarity between the plurality of second elements. The second graph Laplacian matrix corresponds to a neighborhood graph in which each of the plurality of second elements is connected to one or more second elements in the second set whose degree of similarity satisfies a predetermined condition. It can be a graph Laplacian matrix.

Another information processing method evaluates the similarity in terms of data structure between a first data set and a second data set based on a comparison between a first set of eigenvalues and a second set of eigenvalues. It may include doing.

According to the above information processing method, it is possible to evaluate the similarity in data structure between the first data set and the second data set with respect to various types of data sets. The information processing system and information processing method described above are not limited to data fusion applications.

According to one aspect of the present disclosure, a computer program including instructions for causing a computer to execute the above-described information processing method may be provided. The computer program may be recorded on a computer-readable recording medium.

FIG. 1 is a block diagram showing the configuration of an information processing system. FIG. 3 is a diagram illustrating an example of generation of an extended data set by data fusion. 2 is a flowchart (part 1) representing evaluation processing executed by a processor. 12 is a flowchart (part 2) representing evaluation processing executed by the processor. 2 is a flowchart (Part 1) showing extended processing executed by a processor. 12 is a flowchart (part 2) illustrating extended processing executed by the processor.

1... Information processing system, 11... Processor, 13... Memory, 15... Storage, 17... User interface, 19... Communication interface, Pr... Computer program.

Exemplary embodiments of the present disclosure will be described below with reference to the drawings.

The information processing system 1 of this embodiment is configured by installing a dedicated computer program Pr into a general-purpose computer. As shown in FIG. 1, the information processing system 1 includes a processor 11, a memory 13, a storage 15, a user interface 17, and a communication interface 19.

The processor 11 is configured to execute processing according to a computer program Pr stored in the storage 15. The memory 13 is a primary storage device including a RAM, and is used as a work area when the processor 11 executes processing.

The storage 15 is a secondary storage device including, for example, a hard disk drive or a solid state drive. The storage 15 stores, in addition to the computer program Pr, various types of data used when executing processes according to the computer program Pr.

The user interface 17 includes an input device for inputting operation signals from a user operating the information processing system 1 to the processor 11. The user interface 17 further includes a display for displaying various information to the user. Examples of input devices include keyboards and pointing devices.

The communication interface 19 includes a LAN (Local Area Network) interface and a USB (Universal Serial Serial) interface, and is used for communication with external devices. The information processing system 1 transmits and receives data to and from external devices through the communication interface 19.

The processor 11 generates the extended data set 15C by executing processing according to the computer program Pr. The expanded data set 15C is generated by expanding the first data set 15A stored in the storage 15 using the second data set 15B stored in the storage 15. The first data set 15A and the second data set 15B are obtained in advance from an external device through the communication interface 19, for example, and stored in the storage 15.

The first data set 15A is a data set that describes the first feature regarding the first set. The first data set 15A includes feature data for each first entity as first feature data. Each of the first entities corresponds to each of the plurality of elements included in the first set. The first set is a first set of entities. The first set may be a first set of consumers. According to one example, the first entity is a consumer, ie a person.

The first feature data for each first entity is data that describes the first feature of the corresponding first entity. For example, the first data set 15A may be a data set regarding the purchasing behavior of a first set of consumers, as shown in FIG. In this case, the first characteristic data may be data describing characteristics of the corresponding consumer's purchasing behavior. The first characteristic data may be, for example, data describing whether or not each product is purchased with respect to a plurality of products.

The second data set 15B is a data set that describes the second feature regarding the second set. The second data set 15B includes feature data for each second entity as second feature data. Each of the second entities corresponds to each of the plurality of elements included in the second set.

The second set is a second set of entities. The second set may be a second set of consumers. The second set of consumers can be the same or different set of consumers than the first set of consumers. According to one example, the second entity is a consumer, ie a person.

The second feature data for each second entity is data that describes the second feature of the corresponding second entity. The second feature data may be data describing a feature that is the same as or different from the first feature described by the first feature data. That is, at least one of the second set and the second feature is different from the first set and the first feature.

For example, the second data set 15B may be a data set regarding online behavior of a second set of consumers, as shown in FIG. According to the example shown in FIG. 2, the online behavior may be the behavior of visiting a website. The second characteristic data may be, for example, data describing whether or not each website has been visited, regarding a plurality of websites.

The extended data set 15C is a data set in which information included in the second data set 15B is added to the first data set 15A. The expansion increases the amount of information about the first entity. An increase in the amount of information will be useful for analyzing human behavior and distributing advertisements.

According to this embodiment, the processor 11 is configured to execute the evaluation process shown in FIGS. 3 and 4 according to instructions from the user. According to the evaluation process, the similarity of the data structure between the first data set 15A and the second data set 15B that the user wants to combine by data fusion is evaluated, and thereby the accuracy of data fusion is estimated in advance. be evaluated. The accuracy of data fusion corresponds to the accuracy (ie, accuracy) of the information described by the expanded data set 15C generated by data fusion.

The data structure of the first data set 15A and the second data set 15B is the structure of a graph when the similarity between entities in the first data set 15A and the second data set 15B is expressed in a graph, respectively. corresponds to As is well known, a graph is composed of a set of nodes (in other words, points) and links (in other words, edges).

In the evaluation process, the nearest neighbor graph of the first set is used as the graph corresponding to the first data set 15A. The nearest neighbor graph of the first set is constructed by connecting each node of the first entity in the first set to the node of the first entity having the highest degree of similarity on the feature space.

Similarly, the nearest neighbor graph of the second set is used as the graph corresponding to the second data set 15B. The nearest neighbor graph of the second set is constructed by connecting each node of the second entity in the second set to the node of the second entity having the highest degree of similarity on the feature space.

Upon starting the evaluation process shown in FIG. 3, the processor 11 reads the first data set 15A specified by the user through the user interface 17 from the storage 15. Based on the read first data set 15A, the processor 11 generates a feature vector x for each first entity for the plurality of first entities included in the first set (S110).

Specifically, for each first entity, the processor 11 calculates the feature vector x=(x1 , x2, ..., xM1). M1 corresponds to the number of dimensions of the feature vector x.

If the first data set 15A is a data set representing the characteristics of the consumer's purchasing behavior as illustrated in FIG. 2, the feature vector x can include vector elements for each product. The vector element of each product represents whether the corresponding consumer has purchased the corresponding product.

In the following S120, the processor 11 reads the second data set 15B specified by the user through the user interface 17 from the storage 15. Based on the read second data set 15B, the processor 11 generates a feature vector y for each second entity for the plurality of second entities included in the second set.

Specifically, for each second entity, the processor 11 calculates the feature vector y=(y1 ,y2,...,yM2). M2 corresponds to the number of dimensions of the feature vector y.

If the second data set 15B is a data set representing characteristics of consumers' online behavior as illustrated in FIG. 2, the feature vector y may include vector elements for each website. The vector element for each website represents whether the corresponding consumer has visited the corresponding website.

In the following S130, the processor 11 calculates the similarity R1 between the first entities included in the first set. For all possible combinations of two first entities in the first set, the processor 11 calculates, for each combination, the similarity R1 between the two first entities constituting the combination using the feature vector x. calculate.

The similarity R1 may be, for example, a cosine similarity calculated by the normalized inner product of the feature vectors x of the two first entities forming the combination. However, the similarity R1 is not limited to cosine similarity.

In the following S140, the processor 11 calculates the first graph Laplacian matrix L1 based on the similarity R1 between the first entities. The first graph Laplacian matrix L1 is a graph Laplacian matrix of the nearest neighbor graph of the first data set 15A. The first graph Laplacian matrix L1 can be calculated using the equation L1=D1-A1 using the degree matrix D1 and the adjacency matrix A1 of the nearest neighbor graph.

The nearest neighbor graph of the first data set 15A can be defined, for example, by performing the following procedure. Step 1: Select one of the plurality of first entities as the entity to be processed. Step 2: A link (in other words, a directed edge) is created from the node of the entity to be processed to the node of the first entity that has the highest degree of similarity R1 with the entity to be processed. Steps 1 and 2 are repeated until all of the plurality of first entities in the first set are selected as entities to be processed. That is, the nearest neighbor graph of the first data set 15A may be a directed graph defined by performing steps 1 and 2 for all of the plurality of first entities in the first set.

In the following S150, the processor 11 calculates the eigenvalues λ1[1], λ1[2], ..., λ1[i], ..., λ1[N1] of the first graph Laplacian matrix L1. The value N1 is the number of unique values.

The index i of the eigenvalue λ1[i] (i=1, 2,..., N1) is the eigenvalue λ1[1], λ1[2] of the eigenvalue λ1[i], which is ranked based on the size of the eigenvalue. , ..., λ1[i], ..., λ1[N1] in a group. That is, λ1[1]≧λ1[2]≧…≧λ1[N1].

In subsequent S160, the processor 11 determines, when adding the eigenvalues λ1[1], λ1[2], ..., λ1[i], ..., λ1[N1] in descending order, the eigenvalue exceeds a predetermined proportion α of the total sum. Determine the rank K1 of λ1[K1]. α may for example have the value 0.9. That is, the processor 11 determines the minimum value K1 that satisfies the following conditional expression.

In the following S170, the processor 11 calculates the similarity R2 between the second entities included in the second set. For every possible combination of two second entities in the second set, the processor 11 calculates, for each combination, the similarity R2 of the two second entities constituting the combination using the feature vector y. do.

The similarity R2 may be, for example, a cosine similarity calculated by the normalized inner product of the feature vectors y of the two second entities forming the combination. However, the similarity R2 is not limited to cosine similarity.

In the following S180, the processor 11 calculates a second graph Laplacian matrix L2 based on the similarity R2 between the second entities. The second graph Laplacian matrix L2 is a graph Laplacian matrix of the nearest neighbor graph of the second data set 15B.

The second graph Laplacian matrix L2 can be calculated by the formula L2=D2-A2 using the degree matrix D2 of the nearest neighbor graph and the adjacency matrix A2. The nearest neighbor graph of the second data set 15B may be defined, for example, by performing the following procedure. Step 11: Select one of the plurality of second entities as the entity to be processed. Step 12: Create a link (in other words, a directed edge) from the node of the selected entity to be processed to the node of the second entity that has the highest similarity R2 with the entity to be processed. . Steps 11 and 12 are repeated until all of the plurality of second entities in the second set are selected as entities to be processed. That is, the nearest neighbor graph of the second data set 15B may be a directed graph defined by performing steps 11 and 12 for all of the plurality of second entities in the second set.

In the following S190, the processor 11 calculates the eigenvalues λ2[1], λ2[2], ..., λ2[i], ..., λ2[N2] of the second graph Laplacian matrix L2. The value N2 is the number of unique values.

The index i of the eigenvalue λ2[i] (i=1, 2,..., N2) is the eigenvalue λ2[1], λ2[2] of the eigenvalue λ2[i], which is ranked based on the size of the eigenvalue. , ..., λ2[i], ..., λ2[N2] in a group. That is, λ2[1]≧λ2[2]≧...≧λ2[N2].

In subsequent S200, the processor 11 determines, when adding the eigenvalues λ2[1], λ2[2], ..., λ2[i], ..., λ2[N2] in descending order, the eigenvalue exceeds a predetermined percentage α of the total sum. Determine the rank K2 of λ2[K2]. That is, the processor 11 determines the minimum value K2 that satisfies the following conditional expression. α may for example have the value 0.9.

In the following S210, the processor 11 sets the smaller value min{K1, K2} of the values K1 and K2 to the value K.

In the following S220, the processor 11 calculates the sum of squared errors of the eigenvalues according to the following equation as the evaluation value E regarding the similarity of data structures between the first data set 15A and the second data set 15B.

Each error is the eigenvalue λ1[i] of the corresponding rank among the eigenvalues λ1[1], λ1[2], ..., λ1[i], ..., λ1[K] of the first graph Laplacian matrix L1. and the eigenvalue λ2[i] of the corresponding rank among the eigenvalues λ2[1], λ2[2], ..., λ2[i], ..., λ2[K] of the second graph Laplacian matrix L2 (λ1[i]−λ2[i]).

Calculating the sum of squares of errors involves calculating the first rank included in a group of eigenvalues λ1[1], λ1[2], ..., λ1[i], ..., λ1[N1] of the first graph Laplacian matrix L1. The eigenvalues λ1[1], λ1[2], ..., λ1[i], ..., λ1[K] up to a predetermined rank are respectively converted into the eigenvalues λ2[1], λ2[2] of the second graph Laplacian matrix L2. ], ..., λ2[i], ..., λ2[K] of the same rank.

In the following S230, the processor 11 displays the evaluation value E calculated in S220 to the user through the display of the user interface 17. After that, the evaluation process ends.

According to this evaluation process, the user can predict in advance the accuracy of data fusion between the first data set 15A and the second data set 15B based on the displayed evaluation value E.

Specifically, the user can determine that the smaller the displayed evaluation value E, the higher the similarity in data structure between the first data set 15A and the second data set 15B. . The user can determine that the smaller the displayed evaluation value E, the more highly accurate data fusion can be achieved between the first data set 15A and the second data set 15B. Thereby, the user can determine that it is possible to obtain the expanded data set 15C with high information accuracy.

Next, details of the expansion process executed by the processor 11 when the user inputs an instruction to execute the expansion process through the user interface 17 will be explained using FIGS. 5 and 6. Along with the execution instruction, the user specifies a plurality of data sets through the user interface 17 as candidates for the second data set 15B to be combined with the first data set 15A. The plurality of data sets may be data sets relating to different sets, or data sets having different described characteristics.

In the expansion process, among these multiple data sets, the data set with the smallest evaluation value E calculated using the same method as the evaluation process described above is selected as the second data set 15B to be combined. The selected second data set 15B is combined with the first data set 15A by data fusion.

When the expansion process starts, the processor 11 reads the first data set 15A designated by the user through the user interface 17 from the storage 15, similar to the process at S110. The processor 11 generates a feature vector x for each first entity based on the read first data set 15A (S310). Furthermore, the processor 11 acquires a plurality of data sets designated as candidates for the second data set 15B to be combined by reading them from the storage 15 (S320).

After that, the processor 11 executes the processes of S330 to S360, similar to the processes of S130 to S160. That is, in S330, the processor 11 calculates the similarity R1 between the first entities.

In S340, the processor 11 calculates the first graph Laplacian matrix L1 based on the similarity R1 between the first entities. In subsequent S350, the processor 11 calculates the eigenvalues λ1[1], λ1[2], ..., λ1[i], ..., λ1[N1] of the first graph Laplacian matrix L1. The value N1 is the number of unique values. The eigenvalues λ1[1], λ1[2],..., λ1[i],..., λ1[N1] satisfy the conditional expression λ1[1]≧λ1[2]≧...≧λ1[N1].

In subsequent S360, the processor 11 determines, when adding the eigenvalues λ1[1], λ1[2], ..., λ1[i], ..., λ1[N1] in descending order, the eigenvalue exceeds a predetermined proportion α of the total sum. Determine the rank K1 of λ1[K1]. α may for example have the value 0.9.

In the following S370, the processor 11 selects one dataset to be evaluated from among the plurality of candidate datasets. In subsequent S380, the processor 11 generates a feature vector y of the corresponding entity for each entity based on the dataset to be evaluated.

The entity here is an element in the sample set of information handled by the dataset to be evaluated. A sample set may correspond to a consumer set. An entity may be each of a plurality of consumers included in a consumer set.

The dataset to be evaluated includes, for each entity, feature data that describes the characteristics of the corresponding entity. Generation of the feature vector y for each entity in S380 is performed in the same way as the process in S120 regarding the second data set 15B.

In subsequent S390, the processor 11 calculates the similarity R3 between entities included in the sample set handled by the evaluation target dataset based on the feature vector y, similar to the process in S170.

For all possible combinations of two entities in the sample set, the processor 11 calculates, for each combination, the similarity R3 between the two entities that make up the combination using the feature vector y. Similarity R3 may be a cosine similarity.

In the following step 400, the processor 11 calculates a graph Laplacian matrix L3 based on the similarity R3 between entities, similar to the process in S180.

The graph Laplacian matrix L3 is a graph Laplacian matrix of the nearest neighbor graph of the dataset to be evaluated. The nearest neighbor graph of the dataset to be evaluated may be defined, for example, by performing the following steps. Step 21: Select one of the multiple entities in the sample set as the entity to be processed. Step 22: A link (in other words, a directed edge) is created from the node of the selected entity to be processed to the node of one entity that has the highest similarity R3 with the entity to be processed. Steps 21 and 22 are repeated until all of the multiple entities in the sample set are selected as entities to be processed. That is, the nearest neighbor graph of the dataset to be evaluated may be a directed graph defined by performing steps 21 and 22 for all of the plurality of entities in the sample set.

In the following S410, the processor 11 calculates the eigenvalues λ3[1], λ3[2], ..., λ3[i], ..., λ3[N3] of the graph Laplacian matrix L3, similarly to the process in S190. The value N3 is the number of eigenvalues, and the eigenvalues λ3[1], λ3[2], ..., λ3[i], ..., λ3[N3] satisfy the conditional expression λ3[1]≧λ3[2]≧...≧ λ3[N3] is satisfied.

In subsequent S420, similarly to the process in S200, the processor 11 determines that when the eigenvalues λ3[1], λ3[2], ..., λ3[i], ..., λ3[N3] are added in descending order, the total The rank K3 of the eigenvalue λ3 [K3] exceeding a predetermined percentage α of the total is determined. α may for example have the value 0.9.

In the following S430, the processor 11 sets the smaller value min{K1, K3} of the values K1 and K3 to the value K.

In the following S440, the processor 11 calculates the sum of squares of the errors of the eigenvalues according to the following equation as the evaluation value E regarding the similarity of the data structure between the first data set 15A and the data set to be evaluated.

In the following S450, the processor 11 determines whether the processes of S370 to S440 have been executed for all of the plurality of data sets designated as candidates. If it is determined that it has not been executed (No in S450), the processor 11 selects one new data set that has not been selected as an evaluation target from among the candidates as a data set to be evaluated (S370). The processor 11 executes the processes of S380 to S440 regarding the newly selected data set to be evaluated.

In this way, the processor 11 makes a negative determination in S450 and repeatedly executes the processes of S370 to S440 until the processes of S370 to S440 are executed for all of the plurality of data sets designated as candidates. As a result, an evaluation value E is calculated for each data set with respect to a plurality of data sets designated as candidates.

When determining that the processes of S370 to S440 have been executed for all of the plurality of data sets (Yes in S450), the processor 11 executes the process of S460. That is, the processor 11 determines the data set with the smallest evaluation value E among the plurality of data sets designated as candidates as the data set with the highest similarity in data structure to the first data set 15A ( S460).

Then, the processor 11 selects the dataset with the smallest evaluation value E from among the plurality of datasets designated as candidates as the second dataset 15B to be combined with the first dataset 15A (S460). .

Thereafter, the processor 11 combines the first data set 15A with the second data set 15B using data fusion technology, thereby converting the first data set 15A into the selected second data set 15B. An expanded data set 15C is generated (S470).

The combination between the first data set 15A and the second data set 15B is performed by combining the feature data of the related first entity and the feature data of the second entity. Combining two pieces of feature data corresponds to associating the two pieces of feature data.

According to the first example, feature data of a first entity and feature data of a second entity that have similar features are combined. According to the second example, the relative positions are similar based on the relative position of each first entity in the first set and the relative position of each second entity in the second set on the feature space. Feature data of the first entity and feature data of the second entity are combined.

After that, the processor 11 outputs the generated extended data set 15C (S480). Specifically, the processor 11 writes the extended data set 15C to the storage 15. The extended data set 15C written to the storage 15 is useful for analyzing consumer behavior, for example.

Here, we will provide additional explanation about data fusion technology. Applicants have already disclosed several data fusion techniques through prior patent applications. Consider a case where the first data set 15A and the second data set 15B include variables that are common between the first entity and the second entity, such as demographic attributes. In this case, the processor 11 combines the first data set 15A and the second data set so as to combine the feature data of the first entity and the feature data of the second entity, which have similar features determined by the common variable. data set 15B can be combined.

As another example, consider a case where no common variable is included between the first data set 15A and the second data set 15B. In this case, the processor 11 selects a first entity in which the relative position of the first entity in the first set on the feature space is similar to the relative position of the second entity in the second set on the feature space. and the second entity, and combine the feature data of the first entity and the feature data of the second entity having similar relative positions. The two data sets 15B can be combined.

According to the information processing system 1 of this embodiment described above, the similarity of data structures between datasets is evaluated based on the eigenvalues of the graph Laplacian matrix based on the nearest neighbor graph.

The nearest neighbor graph corresponds to the data structure of the corresponding dataset. The nearest neighbor graph relates to the distribution on the feature space of multiple elements that make up the set. When two neighborhood graphs are similar, the groups of eigenvalues of the two corresponding graph Laplacian matrices are also similar.

Therefore, by comparing the eigenvalues, it is possible to evaluate the similarity between the nearest neighbor graph regarding the first set and the nearest neighbor graph regarding the second set. As a result, it is possible to evaluate the similarity between the data structure of the first data set 15A and the data structure of the second data set 15B.

This evaluation is useful for selecting datasets to be combined in data fusion. By combining the first data set 15A with the second data set 15B, which has a highly similar data structure, by data fusion technology, an expanded data set 15C with high accuracy regarding the expanded information is generated. Can be done.

In other words, it is better to combine two datasets with similar data structures than to combine two datasets with very different data structures to better combine feature data between entities in the entire dataset. can.

In particular, in this embodiment, the values K1, K2, and K3 are calculated and the value K is determined based on the following idea. Idea 1: The larger the eigenvalue, the more important it is for evaluating the data structure. Idea 2: The larger the ratio of the sum of eigenvalues to the total eigenvalue, the more each value of the eigenvalues corresponding to the sum represents the entire eigenvalue corresponding to the sum.

In this embodiment, the evaluation value E is further calculated by the sum of squares of K errors. That is, according to the method for calculating the evaluation value E of this embodiment, even if the number of eigenvalues differs between datasets to be compared, the eigenvalues can be compared and the evaluation value E regarding the similarity of data structures can be appropriately calculated. can do. Therefore, according to this embodiment, it is possible to achieve good evaluation regarding the similarity of data structures and good data fusion based on this evaluation.

[Other embodiments]
The present disclosure is not limited to the above embodiments, and can take various forms. For example, the graph Laplacian matrix may be a graph Laplacian matrix of a k-nearest neighbor graph. For example, the first graph Laplacian matrix L1 assigns each node of the first entity in the first set to one or more k first entity nodes in the first set in descending order of similarity R1. It may be a graph Laplacian matrix corresponding to a k-nearest neighbor graph connected to .

The second graph Laplacian matrix L2 connects each node of the second entity in the second set with one or more nodes of k second entities in the second set in descending order of similarity R2. It may be a graph Laplacian matrix corresponding to the k-nearest neighbor graph. The k-nearest neighbor graph may be a directed graph or an undirected graph. Similarly, the graph Laplacian matrix L3 may be a k-nearest neighborhood graph.

In the above embodiment, comparison of neighborhood graphs is performed through comparison of eigenvalues of graph Laplacian matrices. However, comparison of neighborhood graphs is not limited to this example. The structures of the neighborhood graphs may be expressed numerically using any method, and the structures of the neighborhood graphs may be compared by comparing the numerical values corresponding to the two approximate graphs. By comparing the structures of such neighborhood graphs, the similarity of the data structures of two corresponding data sets may be evaluated. The numerical value here may include a vector.

The function of one component in the above embodiment may be distributed and provided to multiple components. Functions possessed by multiple components may be integrated into one component. A part of the configuration of the above embodiment may be omitted. At least a part of the configuration of the embodiment described above may be added to or replaced with the configuration of other embodiments described above. All aspects included in the technical idea specified from the words in the claims are embodiments of the present disclosure.

Claims

a first acquisition unit configured to acquire, with respect to a plurality of first elements in a first set, a first data set comprising data describing characteristics of each of the plurality of first elements;
a second acquisition unit configured to acquire, with respect to a plurality of second elements in a second set, a second data set comprising data describing characteristics of each of the plurality of second elements;
a neighborhood graph of the first set based on the similarity between the plurality of first elements determined from the first data set; and the plurality of second elements determined from the second data set. and a neighborhood graph of the second set based on the similarity between the first dataset and the second dataset. an evaluation department to be
An information processing system equipped with.
The evaluation unit compares the neighborhood graph of the first set and the neighborhood graph of the second set, and compares the neighborhood graph of the first set and the graph Laplacian matrix corresponding to the neighborhood graph of the first set. configured to evaluate similarity in terms of data structure between the first data set and the second data set based on a comparison using graph Laplacian matrices corresponding to neighborhood graphs of two sets; The information processing system according to claim 1.
The second acquisition unit acquires a plurality of evaluation target datasets as the second dataset,
Each of the plurality of evaluation target data sets is a data set including data describing characteristics of each of the plurality of elements with respect to the plurality of elements in the corresponding set,
The plurality of data sets to be evaluated are data sets related to different sets, or data sets with different described characteristics,
For each of the plurality of evaluation target data sets, the evaluation unit generates a neighborhood graph of the corresponding set based on the degree of similarity between the plurality of elements in the corresponding set determined from the corresponding evaluation target data set. and a neighborhood graph of the first set, evaluate the similarity regarding the data structure between the corresponding evaluation target dataset and the first dataset,
The information processing system further includes:
a selection unit configured to select, as a combination target, a data set with the highest evaluation of similarity regarding the data structure among the plurality of evaluation target data sets;
A configuration in which the first data set and the data set selected to be combined are combined so as to associate data describing characteristics of similar elements between the first set and the corresponding set. a joint that is made into
The information processing system according to claim 1, comprising:
The evaluation department is
a first similarity calculation unit configured to calculate the similarity between the plurality of first elements based on the first data set;
a second similarity calculation unit configured to calculate the similarity between the plurality of second elements based on the second data set;
Based on the degree of similarity between the plurality of first elements, in the neighborhood graph, each of the plurality of first elements is divided into a group in the first set whose high degree of similarity satisfies a predetermined condition. a first eigenvalue calculation unit configured to calculate a group of eigenvalues of a first graph Laplacian matrix corresponding to the neighborhood graph connected to the first element as a group of first eigenvalues;
Based on the degree of similarity between the plurality of second elements, in the neighborhood graph, each of the plurality of second elements is selected from among the second set whose degree of similarity satisfies the predetermined condition. a second eigenvalue calculation unit configured to calculate a group of eigenvalues of a second graph Laplacian matrix corresponding to a neighborhood graph connected to one or more second elements as a group of second eigenvalues;
and evaluating the similarity in data structure between the first data set and the second data set based on a comparison between the first set of eigenvalues and the second set of eigenvalues. The information processing system according to claim 1, wherein the information processing system is configured to.
The evaluation unit may rank each of the plurality of first eigenvalues included in the first eigenvalue group based on the size of the eigenvalue, and Based on the rank of each of the plurality of second eigenvalues included in the group of second eigenvalues, each of the plurality of first eigenvalues is assigned to the second eigenvalue of the plurality of second eigenvalues. 5. The information processing system according to claim 4, wherein similarity regarding the data structure is evaluated by comparing with eigenvalues of the same rank.
The evaluation unit may compare each of the eigenvalues from the first to a predetermined rank in descending order of the eigenvalues included in the group of first eigenvalues with an eigenvalue of the same rank among the plurality of second eigenvalues. The information processing system according to claim 5, wherein the similarity regarding the data structure is evaluated by:
The evaluation unit is configured to calculate an evaluation value of similarity regarding the data structure by a sum of squares of errors, and each of the errors is calculated by calculating the first eigenvalue of the corresponding rank among the plurality of first eigenvalues. The information processing system according to claim 5 or claim 6, wherein the difference is the difference between the eigenvalue of and the second eigenvalue of the corresponding rank among the plurality of second eigenvalues.
The first graph Laplacian matrix is a graph Laplacian matrix of a nearest neighbor graph in which each of the plurality of first elements is connected to the first element having the highest degree of similarity in the first set,
4. The second graph Laplacian matrix is a graph Laplacian matrix of a nearest neighbor graph in which each of the plurality of second elements is connected to a second element having the highest degree of similarity in the second set. The information processing system according to any one of claims 7 to 9.
The second acquisition unit acquires a plurality of evaluation target datasets as the second dataset,
Each of the plurality of evaluation target data sets is a data set including data describing characteristics of each of the plurality of elements with respect to the plurality of elements in the corresponding set,
The plurality of data sets to be evaluated are data sets related to different sets, or data sets with different described characteristics,
The second similarity calculation unit calculates the similarity between the plurality of elements in the corresponding set for each of the plurality of evaluation target data sets,
The second eigenvalue calculation unit calculates, for each of the plurality of evaluation target data sets, a neighborhood graph in which each of the plurality of elements in the corresponding set has a high degree of similarity that satisfies the predetermined condition. calculating a group of eigenvalues of a graph Laplacian matrix corresponding to a neighborhood graph connected to one or more elements in the corresponding corresponding set as a group of eigenvalues to be compared;
The evaluation unit, for each of the plurality of evaluation target data sets, based on a comparison between the group of comparison target eigenvalues based on the corresponding evaluation target data set and the first group of eigenvalues, The information processing system according to any one of claims 4 to 8, wherein similarity regarding the data structure between the first data set and the corresponding evaluation target data set is evaluated.
a selection unit configured to select, as a combination target, a data set with the highest evaluation of similarity regarding the data structure among the plurality of evaluation target data sets;
A configuration in which the first data set and the data set selected to be combined are combined so as to associate data describing characteristics of similar elements between the first set and the corresponding set. a joint that is made into
The information processing system according to claim 9, comprising:
The first data set is a data set that describes characteristics of a plurality of people in the first set as a first element of the plurality, and the second data set is a data set that describes the characteristics of a plurality of people in the first set. The information processing system according to any one of claims 1 to 10, wherein the elements are data sets that describe characteristics of a plurality of people in the second set.
An information processing method performed by a computer, the method comprising:
obtaining, with respect to a plurality of first elements in a first set, a first data set that includes data describing characteristics of each of the plurality of first elements;
obtaining, with respect to a plurality of second elements in a second set, a second data set that includes data describing characteristics of each of the plurality of second elements;
a neighborhood graph of the first set based on the similarity between the plurality of first elements determined from the first data set; and the plurality of second elements determined from the second data set. and a neighborhood graph of the second set based on the similarity between the first data set and the second data set.
Information processing methods including.
The evaluating is a comparison between the neighborhood graph of the first set and the neighborhood graph of the second set, and the graph Laplacian matrix corresponding to the neighborhood graph of the first set and the neighborhood graph of the second set are compared. A claim comprising evaluating similarity in data structure between the first data set and the second data set based on a comparison using a graph Laplacian matrix corresponding to a neighborhood graph of the second set. Item 12. Information processing method according to item 12.
Obtaining the second data set includes obtaining a plurality of evaluation target data sets as the second data set,
Each of the plurality of evaluation target data sets is a data set including data describing characteristics of each of the plurality of elements with respect to the plurality of elements in the corresponding set,
The plurality of data sets to be evaluated are data sets related to different sets, or data sets with different described characteristics,
The evaluating includes, for each of the plurality of evaluation target data sets, the neighborhood of the corresponding set based on the similarity between the plurality of elements in the corresponding set determined from the corresponding evaluation target data set. evaluating the similarity with respect to the data structure between the corresponding dataset to be evaluated and the first dataset based on a comparison between a graph and a neighborhood graph of the first set; ,
The information processing method further includes:
Selecting a dataset with the highest evaluation of similarity regarding the data structure from among the plurality of evaluation target datasets as a combination target;
combining the first data set and the data set selected to be combined so as to associate data describing characteristics of similar elements between the first set and the corresponding set; and,
The information processing method according to claim 12, comprising:
An information processing method performed by a computer, the method comprising:
obtaining, with respect to a plurality of first elements in a first set, a first data set that includes data describing characteristics of each of the plurality of first elements;
obtaining, with respect to a plurality of second elements in a second set, a second data set that includes data describing characteristics of each of the plurality of second elements;
Calculating the degree of similarity between the plurality of first elements based on the first data set;
Calculating the degree of similarity between the plurality of second elements based on the second data set;
Based on the degree of similarity between the plurality of first elements, in the neighborhood graph, each of the plurality of first elements is divided into a group in the first set whose high degree of similarity satisfies a predetermined condition. Calculating a group of eigenvalues of a first graph Laplacian matrix corresponding to the neighborhood graph connected to the first element as a group of first eigenvalues;
Based on the degree of similarity between the plurality of second elements, in the neighborhood graph, each of the plurality of second elements is selected from among the second set whose degree of similarity satisfies the predetermined condition. calculating a group of eigenvalues of a second graph Laplacian matrix corresponding to a neighborhood graph connected to one or more second elements as a group of second eigenvalues;
Evaluating similarity in data structure between the first data set and the second data set based on a comparison between the first set of eigenvalues and the second set of eigenvalues; ,
Information processing methods including.
A computer program comprising instructions for causing the computer to execute the information processing method according to any one of claims 12 to 15 when executed by a computer.
A computer-readable recording medium that stores a computer program containing instructions for causing the computer to execute the information processing method according to any one of claims 12 to 15 when executed by a computer.
a first acquisition unit configured to acquire, with respect to a plurality of first elements in a first set, a first data set comprising data describing characteristics of each of the plurality of first elements;
a second acquisition unit configured to acquire, with respect to a plurality of second elements in a second set, a second data set comprising data describing characteristics of each of the plurality of second elements;
a first similarity calculation unit configured to calculate the similarity between the plurality of first elements based on the first data set;
a second similarity calculation unit configured to calculate the similarity between the plurality of second elements based on the second data set;
Based on the degree of similarity between the plurality of first elements, in the neighborhood graph, each of the plurality of first elements is divided into a group in the first set whose high degree of similarity satisfies a predetermined condition. a first eigenvalue calculation unit configured to calculate a group of eigenvalues of a first graph Laplacian matrix corresponding to the neighborhood graph connected to the first element as a group of first eigenvalues;
Based on the degree of similarity between the plurality of second elements, in the neighborhood graph, each of the plurality of second elements is selected from among the second set whose degree of similarity satisfies the predetermined condition. a second eigenvalue calculation unit configured to calculate a group of eigenvalues of a second graph Laplacian matrix corresponding to a neighborhood graph connected to one or more second elements as a group of second eigenvalues;
evaluating similarity in data structure between the first data set and the second data set based on a comparison between the first set of eigenvalues and the second set of eigenvalues; an evaluation section consisting of;
An information processing system equipped with.