WO2023162206A1

WO2023162206A1 - Information processing device, information processing method, and information processing program

Info

Publication number: WO2023162206A1
Application number: PCT/JP2022/008227
Authority: WO
Inventors: 勝悟林; 昌史小山田; 元紀草野
Original assignee: 日本電気株式会社
Priority date: 2022-02-28
Filing date: 2022-02-28
Publication date: 2023-08-31

Abstract

The present invention provides, as a technology for calculating the degree of similarity between a record pair, a technology that does not require training data pertaining to a record pair and that is capable of handling data of differing types. To this end, an information processing device (1) comprises: an acquisition unit (11) that acquires a record pair; a conversion unit (12) that converts the record pair so as to generate a converted record pair; a similarity degree calculation unit (13) that inputs the converted record pair into a model so as to calculate a similarity degree pertaining to the converted record pair; and an output unit (14) that outputs the similarity degree which has been calculated by the similarity degree calculation unit (13).

Description

Information processing device, information processing method and information processing program

The present invention relates to an information processing device, an information processing method, and an information processing program.

A process is performed to identify and associate a combination of identical or similar records from records stored in different datasets. Such processing is also called name identification processing. Name identification processing enables unified management of tables, expansion of data, and the like. As a technique for performing name identification processing, there is a technique for performing matching by machine learning. For example, Patent Document 1 discloses a device that calculates the similarity of record pairs using a plurality of similarity functions that calculate the similarity of record pairs, and learns the weight of the similarity by supervised machine learning using training data. is described. Here, the training data is a data set with labels indicating combinations of records and whether they are identical. In addition, Non-Patent Document 1 describes a technique called DITTO that performs name identification by supervised machine learning. Non-Patent Document 2 describes a technique called ZeroER that matches records by unsupervised machine learning that does not use training data.

Also, in recent years, language models (eg, non-patent documents 3 to 5) and image classification models (eg, non-patent document 6) have been proposed as models generated by machine learning.

Japanese Patent Application Laid-Open No. 2019-185244

However, supervised machine learning requires a large amount of training data. There was a problem that it could not correspond to the data. Here, heterogeneous data is a combination of records, and refers to data whose format is not the same. In addition, ZeroER, which is unsupervised machine learning described in Non-Patent Document 2, does not require training data, but attributes must be aligned, so there is a problem that it cannot be applied to heterogeneous data with different attributes. there were.

One aspect of the present invention has been made in view of the above problem. It is to provide a technology that can also deal with

An information processing apparatus according to one aspect of the present invention includes acquisition means for acquiring a record pair, conversion means for generating a converted record pair by converting the record pair, and inputting the converted record pair to a model. Thus, a similarity calculation means for calculating a similarity regarding the converted record pair and an output means for outputting the similarity calculated by the similarity calculation means are provided.

An information processing method according to an aspect of the present invention is characterized in that at least one processor obtains a record pair, generates a transformed record pair by transforming the record pair, and transforms the transformed record pair into calculating a similarity measure for the transformed record pair by inputting to a model; and outputting the calculated similarity measure.

An information processing program according to one aspect of the present invention provides a computer with an acquisition process for acquiring a record pair, a conversion process for generating a converted record pair by converting the record pair, and a model for the converted record pair. , a similarity calculation process for calculating a similarity with respect to the converted record pair and an output process for outputting the similarity calculated in the similarity calculation process are executed.

According to one aspect of the present invention, as a technique for calculating the similarity of record pairs, it is possible to provide a technique that does not require training data for record pairs and that can handle heterogeneous data.

1 is a block diagram showing the configuration of an information processing apparatus according to Exemplary Embodiment 1; FIG. FIG. 3 is a flow diagram showing the flow of an information processing method according to exemplary embodiment 1; FIG. 9 is a block diagram showing the configuration of an information processing apparatus according to Exemplary Embodiment 2; FIG. 10 is a diagram showing a specific example of data including records according to exemplary embodiment 2; FIG. 10 is a diagram showing an overview of the flow of processing performed by an information processing apparatus according to exemplary embodiment 2; FIG. 10 is a diagram showing a specific example of identity determination results according to exemplary embodiment 2; FIG. 10 is a flow diagram showing the flow of an information processing method according to exemplary embodiment 2; FIG. 11 is a diagram schematically illustrating entailment relationships of documents according to exemplary embodiment 2; FIG. 10 is a diagram showing an example of an image converted by a conversion unit according to exemplary embodiment 2; FIG. 12 is a block diagram showing the configuration of an information processing apparatus according to exemplary embodiment 3; FIG. 11 is a flow diagram showing the flow of an information processing method according to exemplary embodiment 3; FIG. 11 is a conceptual diagram of similarity calculation processing using a question-answering model according to exemplary embodiment 3; FIG. 12 is a block diagram showing the configuration of an information processing apparatus according to Exemplary Embodiment 4; FIG. 14 is a diagram showing a specific example of similarity calculated by a similarity calculation unit according to exemplary embodiment 4; FIG. 14 is a diagram showing a specific example of similarity calculated by a similarity calculation unit according to exemplary embodiment 4; FIG. 12 is a block diagram showing the configuration of an information processing apparatus according to Exemplary Embodiment 5; FIG. 12 is a diagram showing a specific example of screen display according to exemplary embodiment 5; It is a figure which shows an example of graph data. 1 is a diagram showing an example of a graph database; FIG. FIG. 4 is a diagram schematically showing a configuration in which learned converters are provided before and after the output of a model; 1 is a block diagram showing the configuration of a computer functioning as an information processing device according to each exemplary embodiment; FIG.

[Exemplary embodiment 1]
A first exemplary embodiment of the invention will now be described in detail with reference to the drawings. This exemplary embodiment is the basis for the exemplary embodiments described later.

<Configuration of information processing device 1>
A configuration of an information processing apparatus 1 according to this exemplary embodiment will be described with reference to FIG. FIG. 1 is a block diagram showing the configuration of an information processing device 1. As shown in FIG. The information processing device 1 is a device that calculates the degree of similarity between records. Here, a record is a unit of data for which similarity is calculated. Examples of data containing records include structured data such as table data, semi-structured data described in a data description language such as JSON (JavaScript Object Notation: registered trademark) or XML (Extensible Markup Language), and natural language It includes unstructured data representing written documents. A record is, for example, a row of a table and contains a set of one or more attribute names and attribute values corresponding to the columns of the table. Also, the record may be graph data.

The information processing device 1 includes an acquisition unit 11 , a conversion unit 12 , a similarity calculation unit 13 and an output unit 14 .

(Acquisition unit 11)
Acquisition unit 11 acquires a record pair. A record pair is a set of records, such as a set of records included in a first table and records included in a second table. The first table and the second table are, for example, tables that store customer information of businesses or tables that store product information. However, the first table and the second table are not limited to the examples described above, and may be other tables. Also, the first table and the second table may be the same or different.

Multiple records included in a record pair may have different data formats. More specifically, for example, when a record is a row of a table, some attribute names included in the record may be different, and all attribute names included in the record may be different. .

The acquiring unit 11 may acquire the record pair by reading the record pair from the storage device, or acquire the record pair by receiving the record pair from another device connected via the communication interface. good too. Also, the acquisition unit 11 may acquire a record pair input from an input device via an input/output interface.

(Converter 12)
The conversion unit 12 converts the record pair to generate a converted record pair. For example, the conversion unit 12 converts records included in a record pair into data representing documents, images, sounds, or graphs. More specifically, the conversion unit 12 converts the record into an affirmative sentence or a question sentence, for example. However, the method by which the conversion unit 12 converts the record pair is not limited to the example described above, and the conversion unit 12 may convert the record pair by another method.

(Similarity calculator 13)
The similarity calculation unit 13 calculates the similarity regarding the converted record pair by inputting the converted record pair into the model. Here, the model is a model for calculating the degree of similarity, and as an example, is a model that is open to the public and available to any user. The model may be a model generated by machine learning or a rule-based model created by humans. Specifically, the model is, by way of example, a document classification model, an image classification model, an audio classification model, or a graph classification model. A document classification model is a model for classifying document data. An image classification model is a model for classifying image data. A speech classification model is a model that classifies speech data. A graph classification model is a model for classifying graph data.

Document classification models include, for example, a document embedding model, an entailment recognition model, a paraphrase prediction model, a question answering model, and a mask language model. A document embedding model is a model that embeds documents or words in a vector space. The entailment recognition model is a model that predicts entailment relationships of multiple documents. A paraphrasing prediction model is a model that predicts whether two documents are paraphrasing expressions. A question answering model is a model that extracts and outputs answers from documents given to questions. A mask language model is a model for predicting words that fit a mask in a document.

An example of an image classification model is an image embedding model. An image embedding model is a model that embeds image data in a vector space. A speech classification model includes, for example, a speech embedding model. A speech embedding model is a model that embeds speech data in a vector space.

Inputs for the above model include at least one of text data, image data, audio data, graphs, and vectors, for example. The output of the model includes, by way of example, a vector or score indicating confidence. The score is, for example, a score indicating the degree of certainty regarding the inclusion relationship of the document or a score indicating the degree of certainty as to whether it is a paraphrasing expression. However, the inputs and outputs of the model are not limited to the examples described above, and may include other information.

When the model is generated by machine learning, the model may be, for example, a language model described in Non-Patent Documents 3 to 5, an image classification model described in Non-Patent Document 6, or an audio classification model for classifying audio data. but not limited to these. Moreover, the model may be stored in the memory of the information processing device 1 or may be stored in another device capable of communicating with the information processing device 1 .

The degree of similarity is information relating to the degree of similarity between records included in a record pair, and an example is the cosine similarity of vector pairs. Also, the similarity may be a value calculated from the score output by the model.

(Output unit 14)
The output unit 14 outputs the similarity calculated by the similarity calculation unit 13 . For example, the output unit 14 may output the degree of similarity by writing it in a storage device, or may output the degree of similarity by transmitting the degree of similarity to another device via a communication interface. Also, the output unit 14 may output the degree of similarity to an output device (not shown) connected via an input/output interface. The output device is, for example, a display, printer, projector, or speaker.

The degree of similarity output by the output unit 14 is used, for example, for table integration processing or information search processing. In the case of table integration processing, by linking records predicted to be identical based on the similarity calculated by the similarity calculation unit 13, a plurality of tables can be integrated and unified data management can be performed. Further, in information retrieval, the similarity calculation unit 13 calculates the similarity for a record pair of a record as a search key (for example, a record specified by a user) and any other record registered in a predetermined table. may be performed. In this case, the information processing apparatus 1 may output records included in a record pair predicted to be identical based on the similarity calculated by the similarity calculation unit 13 as a search result. As a result, even in a table that is not associated with a record that is a search key, search processing using the search key is possible.

<Effects of information processing device 1>
As described above, in the information processing apparatus 1 according to this exemplary embodiment, the acquisition unit 11 that acquires a record pair, the conversion unit 12 that converts the record pair to generate a converted record pair, and the A configuration comprising a similarity calculation unit 13 for calculating a similarity regarding the converted record pair by inputting the converted record pair into a model, and an output unit 14 for outputting the similarity calculated by the similarity calculation unit 13. is adopted. For this reason, according to the information processing apparatus 1 according to the present exemplary embodiment, as a technique for calculating the similarity between record pairs, a technique that does not require training data regarding record pairs and can handle heterogeneous data is provided. The effect of being able to provide is obtained.

<Information processing program>
The functions of the information processing apparatus 1 described above can also be realized by a program. An information processing program according to this exemplary embodiment provides a computer with an acquisition process for acquiring a record pair, a conversion process for generating a converted record pair by converting the record pair, and a model for the converted record pair. , a similarity calculation process for calculating a similarity with respect to the converted record pair and an output process for outputting the similarity calculated in the similarity calculation process are executed.

<Flow of information processing method S1>
The flow of the information processing method S1 according to this exemplary embodiment will be described with reference to FIG. FIG. 2 is a flow diagram showing the flow of the information processing method S1. The execution subject of each step in the information processing method S1 may be a processor included in the information processing apparatus 1 or a processor included in another apparatus. processor.

At step S11, at least one processor acquires a record pair. At step S12, at least one processor generates transformed record pairs by transforming the record pairs. At step S13, at least one processor calculates a similarity for the transformed record pair by inputting the transformed record pair into a model. At step S14, at least one processor outputs the calculated similarity.

<Effect of information processing method S1>
As described above, in the information processing method S1 according to the present exemplary embodiment, at least one processor obtains a record pair and generates a transformed record pair by transforming the record pair. , a configuration including inputting the converted record pair into a model to calculate a similarity regarding the converted record pair and outputting the calculated similarity. For this reason, according to the information processing method S1 according to the present exemplary embodiment, as a technique for calculating the similarity between record pairs, a technique that does not require training data regarding record pairs and can handle heterogeneous data is provided. The effect of being able to provide is obtained.

[Exemplary embodiment 2]
A second exemplary embodiment of the invention will now be described in detail with reference to the drawings. Components having the same functions as the components described in the exemplary embodiment 1 are denoted by the same reference numerals, and description thereof will not be repeated.

<Overview of Information Processing Device 1A>
FIG. 3 is a block diagram showing the configuration of the information processing device 1A according to this exemplary embodiment. The information processing device 1A has a function of determining identity between records. Examples of data containing records are structured data such as table data, semi-structured data described in a data description language such as JSON or XML, or unstructured data representing a document written in a natural language.

FIG. 4 is a diagram showing a specific example of data containing records. In FIG. 4, data D1 is a table. In this case, a record is each row of the table. In FIG. 4, data D2 is semi-structured data described in a data description language such as a markup language. In this case, the record is a web page as an example. Data D3 is unstructured data representing a document written in natural language. In this case, the record is, for example, a file generated in a predetermined file format.

Here, an overview of the processing performed by the information processing device 1A according to this exemplary embodiment will be described with reference to FIG. FIG. 5 is a diagram showing an overview of the flow of processing performed by the information processing apparatus 1A. The information processing apparatus 1A is roughly divided into (i) record pair generation processing, (ii) similarity calculation processing, and (iii) identity determination processing.

(i) In the process of generating record pairs, the information processing device 1A generates record pairs from first data x including multiple records e and second data x' including multiple records e'. As an example, the information processing apparatus 1A generates all combinations of the record e included in the first data x and the record e' included in the second data x'. Further, the information processing apparatus 1A may narrow down the candidates for identity determination of the second data x' for the record e of the first data x by a technique called blocking in generating the record pair.

(ii) In the similarity calculation process, the information processing device 1A calculates the similarity between the records included in the record pair. In this exemplary embodiment, the information processing apparatus 1A calculates the degree of similarity by inputting converted record pairs obtained by converting records into a model. The details of the similarity calculation process will be described later.

(iii) In the identity determination process, the information processing device 1A determines the identity of the records included in the record pair based on the calculated similarity. As an example, the information processing device 1A determines that the records are the same when the degree of similarity is equal to or greater than a threshold. However, the method for determining identity is not limited to the above-described method, and information processing apparatus 1A may determine identity between records using other methods.

FIG. 6 is a diagram showing a specific example of identity determination results. In FIG. 6, a table TBL1 is an example of the first data x and includes multiple rows and multiple columns. Also, the table TBL2 is an example of the second data x' and includes multiple rows and multiple columns. In table TBL1 and table TBL2, a record is a row of the table. Table TBL1 contains records l1, l2, l3 and l4, and table TBL2 contains records r1, r2, r3.

In the example of FIG. 6, the information processing device 1A determines that the record l1 and the record r2 are the same, and determines that the record l2 and the record r3 are the same by the processes (i) to (iii) above. Then, it is determined that the record l3 and the record r1 are the same.

<Configuration of information processing device 1A>
The information processing apparatus 1A, as shown in FIG. 3, includes a control section 10A, a storage section 20A, a communication section 30A and an input/output section 40A.

(Communication section 30A)
The communication unit 30A communicates with an external device of the information processing device 1A via a communication line. Although the specific configuration of the communication line does not limit this exemplary embodiment, examples of the communication line include wireless LAN (Local Area Network), wired LAN, WAN (Wide Area Network), public line network, mobile data communication network, or a combination thereof. The communication unit 30A transmits data supplied from the control unit 10A to other devices, and supplies data received from other devices to the control unit 10A.

(Input/output unit 40A)
Input/output devices such as a keyboard, mouse, display, printer, and touch panel are connected to the input/output unit 40A. The input/output unit 40A receives input of various kinds of information from the connected input device to the information processing apparatus 1A. Also, the input/output unit 40A outputs various kinds of information to the connected output device under the control of the control unit 10A. As the input/output unit 40A, for example, an interface such as a USB (Universal Serial Bus) can be used.

(Control section 10A)
As shown in FIG. 3, the control unit 10A includes an acquisition unit 11, a conversion unit 12, a similarity calculation unit 13, an output unit 14, an identity determination unit 15A, and an integration unit 16A.

(Acquisition unit 11)
In this exemplary embodiment, the acquisition unit 11 generates a record pair including the record e and the record e' from the first data x including the record e and the second data x' including the record e'. However, the acquisition unit 11 does not have to perform the process of generating the record pair. For example, the acquisition unit 11 may acquire by reading record pairs from the storage unit 20A or another external storage device, or acquire record pairs received from another device via the communication unit 30A. may Also, the acquisition unit 11 may acquire a record pair input from an input device connected to the input/output unit 40A.

(Converter 12)
The conversion unit 12 converts the record pair to generate a converted record pair. As an example, the conversion unit 12 converts records included in a record pair into document data, image data, audio data, or graphs. The conversion processing executed by the conversion unit 12 will be described later.

(Similarity calculator 13)
The similarity calculation unit 13 calculates the similarity s regarding the converted record pair by inputting the converted record pair into the model MA. The similarity s is information about the degree of similarity between records included in a record pair, and is, for example, a cosine similarity of a vector pair or a value calculated based on the score output by the model MA. The details of the process of calculating the similarity s by the similarity calculator 13 will be described later.

(Output unit 14)
The output unit 14 outputs the similarity s calculated by the similarity calculation unit 13 . As an example, the output unit 14 outputs the degree of similarity by writing it into the storage unit 20A. However, the method by which the output unit 14 outputs the degree of similarity is not limited to the example described above, and the degree of similarity s may be output by another method. As an example, the output unit 14 may transmit the degree of similarity to another device connected via the communication unit 30A, and output the degree of similarity to an output device connected via the input/output unit 40A. may

(Sameness determination unit 15A)
The identity determination unit 15A determines identity between records included in a record pair based on the degree of similarity s. As an example, the identity determination unit 15A determines that the records are the same when the similarity s is equal to or greater than the threshold. Also, the identity determination unit 15A may determine identity based on the ranking when the record pairs are sorted in order of high similarity, such as determining that x record pairs with the highest degree of similarity are identical. good. Further, as an example, the identity determination unit 15A may determine identity by applying a matching algorithm such as the stable marriage problem algorithm.

Also, the method of determining identity by the identity determination unit 15A is not limited to the above example, and the identity determination unit 15A may determine identity by other methods. As an example, the identity determination unit 15A may perform identity prediction of record pairs by inputting record pairs and similarities into a prediction model generated by machine learning. In this case, the input of the prediction model includes, for example, record pairs and similarities. Also, the output of the predictive model includes, as an example, a predictive result of identity. In this case, the machine learning method of the prediction model is not limited, and as an example, a decision tree-based, linear regression, or neural network method may be used, and two or more of these methods may be used. may

(integration unit 16A)
16 A of integration parts integrate the 1st data x and 2nd data x' based on the determination result of 15 A of identity determination parts. For example, the integration unit 16A integrates the first data x and the second data x' by increasing the number of records and/or increasing the number of data attributes. Data integration performed by the integration unit 16A includes, for example, (i) entity integration, (ii) data cleansing, and (iii) schema matching. (i) Entity integration refers to unifying the notation of different attributes and their values when the same set of records is given. (ii) Data cleansing refers to the unification of differences in description formats such as company names, addresses, and area codes ("Co., Ltd." and "Co., Ltd.", etc.). (iii) Schema matching means aligning (matching) a plurality of attributes with different notations.

(Storage unit 20A)
The storage unit 20A stores the first data x and the second data x′, and also stores the similarity s calculated by the similarity calculation unit 13 . A model MA is stored in the storage unit 20A. Note that the expression that the model MA is stored in the storage unit 20A means that the parameters that define the model MA are stored in the storage unit 20A.

(Model MA)
The model MA is a model for calculating the degree of similarity, and as an example, is a model that is open to the public and can be used by any user. The model MA may be a model generated by machine learning or a rule-based model created by humans.

As an example, the model MA includes at least one of a document classification model, image classification model, speech classification model and graph classification model. Examples of document classification models include document embedding models, entailment recognition models, paraphrase prediction models, and mask language models. An example of an image classification model is an image embedding model that embeds image data in a vector space. As a speech classification model, for example, there is a speech embedding model that embeds speech data in a vector space.

Inputs of the model MA include at least one of document data, image data, audio data, graphs, and vectors, for example. The output of the model MA includes, as an example, vectors and/or scores.

(Example 1 of document classification model: document embedding model)
A document embedding model is a model that embeds documents or words in a vector space. The document embedding model is generated by RoBERTa described in Non-Patent Document 3 as an example. The input of the document embedding model is, for example, a document or a word (for example, the sentence "An elderly man is walking in the park."). The output is a vector as an example.

(Example 2 of document classification model: entailment recognition model)
The entailment recognition model is a model for predicting whether or not there is an entailment relation of "if it is document 1, then it is document 2". The entailment recognition model is generated by the technique described in Non-Patent Document 4 as an example. The input of the entailment recognition model is two documents as an example. The two documents are, for example, document 1 "an old man is walking in the park" and document 2 "a man is in the park". In this case, document 2 entails document 1. FIG. 8 is a diagram schematically showing entailment relationships of documents. In the example of FIG. 8, document 2 entails document 1 .

Also, the output of the entailment recognition model is an entailment score as an example. The entailment score is a numerical value indicating the certainty of the entailment relation, and is a real number between 0 and 1, for example. As an example, the entailment score indicates that the higher the value, the higher the certainty of the entailment relation.

(Example 3 of document classification model: paraphrasing prediction model)
A paraphrasing prediction model is a model that predicts whether two documents are paraphrasing expressions. A paraphrase prediction model is generated by RoBERTa described in Non-Patent Document 3 as an example. The input of the paraphrasing prediction model is two documents as an example. The two documents are, for example, document 1 stating "NEC is an IT company" and document 2 stating "NEC Corporation is in the IT business". The output of the model includes paraphrase scores, as an example. The paraphrasing score is a score indicating the degree of certainty that two documents are paraphrasing expressions, and is a real number between 0 and 1, for example. As an example, the paraphrase score indicates that the higher the value, the higher the confidence that the two documents are paraphrase expressions.

(Document Classification Model Example 4: Mask Language Model)
A mask language model is a model that predicts words that fit a mask in a document. The mask language model is, for example, a model generated by RoBERTa described in Non-Patent Document 3, for example. The input of the document classification model is, for example, a document (for example, the sentence "This pizza is very good. I like this pizza [mask]."). The output of the model includes words (eg, "like") and scores. The score is a value indicating the degree of confidence that the word fits the mask, and is a real number between 0 and 1, for example.

(Image classification model)
An image classification model is a model that classifies images. The image classification model is a model generated by the technique described in Non-Patent Document 6 as an example. An example input for the image classification model is an image (eg, an image of a dog). Also, the intermediate output of the model is, for example, a vector representation of the image, and the output of the model is, for example, a label (for example, a label indicating "dog") and a score. A score is a value indicating the certainty of a label, and is a real number between 0 and 1, for example. In this exemplary embodiment, the similarity calculator 13 calculates the similarity s using the vector representation of the images.

(speech classification model)
A speech classification model is a model that classifies speech. The input of the speech classification model is, for example, speech data (eg, dog barking). The intermediate output of the model is, as an example, a vector representation of the speech, and the output of the model is, as an example, a label (eg, a label indicating that the speech is a dog bark) and a score. A score is a value indicating the certainty of a label, and is a real number between 0 and 1, for example. In this exemplary embodiment, the similarity calculator 13 calculates the similarity s using the vector representation of the speech, which is the intermediate output.

(graph classification model)
A graph classification model is a model for classifying graphs. The input of the graph classification model is, for example, graph data (for example, a graph representing facial features). The intermediate output of the model is, for example, a vector representation of the graph, and the output of the model is, for example, a label (for example, a label indicating a person) and a score. A score is a value indicating the certainty of a label, and is a real number between 0 and 1, for example. In this exemplary embodiment, the similarity calculator 13 calculates the similarity s using the vector representation of the graph, which is the intermediate output.

<Flow of information processing method S100A>
FIG. 7 is a flowchart showing the flow of the information processing method S100A executed by the information processing apparatus 1A. Note that some of the steps included in the information processing method S100A may be executed in parallel or in a different order. Also, the description of the already described contents will not be repeated.

(Step S101)
In step S101, the acquisition unit 11 reads the model MA. In this exemplary embodiment, model MA is selected from a plurality of model candidates. The plurality of model candidates include, for example, at least one of a document embedding model, a document classification model such as an entailment recognition model, an image classification model, and an audio classification model. The selection of the model MA may be performed based on a user's operation as an example, or may be performed according to a predetermined algorithm. The model MA may be a single model or a set of multiple models.

(Step S102)
In step S102, the acquisition unit 11 reads record pairs. For example, record e and record e' included in a record pair are as follows:
e=((a_j, v_j))_{j=1,...,d},
e′=((a′_j, v′_j))_{j=1, . . . , d′}
is represented. where attribute name a_jεA_j, where A_j is a string space as an example. An attribute value v_jεV_j, where V_j is, for example, a string space or a real number space. In this example, record l1 in FIG.
l1 = ((title, sims 2 glamor life stuff pack), (manufacturer, aspyr media), (price, 24.99)) and record r2 is
r2 = ((title, aspyr media inc sims 2 glamor life stuff pack), (price, NaN))
is.

In this exemplary embodiment, the acquisition unit 11 generates record pairs from the first data x and the second data x'. For example, the acquisition unit 11 generates all combinations of records e included in the first data x and records e' included in the second data x'. Further, in generating a record pair, the obtaining unit 11 may narrow down candidates for identity determination of the second data x′ for the record e of the first data x by a technique called blocking.

(Step S103)
In step S103, the conversion unit 12 generates a converted record pair by converting the record pair (e, e') into a format corresponding to the input of the model MA. As an example, the model MA includes a document classification model, and the conversion processing by the conversion unit 12 includes processing for generating a converted record pair by converting the record pair (e, e') into a document. included. Further, as an example, the model MA includes an image classification model, and in the conversion processing by the conversion unit 12, a converted record pair is generated by converting the record pair (e, e') into an image. processing is included. Further, as an example, the model MA includes a speech classification model, and the conversion processing by the conversion unit 12 includes processing for generating a converted record pair by converting the record pair into speech. As an example, the model MA includes a graph classification model, and the conversion processing by the conversion unit 12 includes processing for generating converted record pairs by converting record pairs into graphs.

(Step S104)
In step S104, the similarity calculation unit 13 calculates the similarity s regarding the converted record pair by inputting the converted record pair into the model MA.

(Processing examples 1 to 5 of steps S101 to S104)
Here, processing examples 1 to 5 will be described as processing examples of steps S101 to S104. Processing example 1 is a processing example in the case of using the document embedding model. Processing example 2 is a processing example in the case of using an image classification model. Processing example 3 is a processing example in the case of using a speech classification model. Processing example 4 is a processing example when using an entailment recognition model. Processing example 5 is a processing example in the case of using a paraphrase prediction model.

(Processing example 1: document embedding model)
In this example, in step S101, the acquisition unit 11 reads the document embedding model. Further, in step S103, the conversion unit 12 converts the record pair (e, e') into the document pair (t, t'). As an example, the conversion unit 12
e = ((title, sims 2 glamor life stuff pack), (manufacturer, aspyr media), (price, 24.99))
e' = ((title, aspyr media inc sims 2 glamor life stuff pack), (price, NaN))
A record pair (e, e') containing record e and record e' is
t = "Title is sims 2 glamor life stuff pack. Manufacturer is aspyr media. Price is 24.99."
t' = "Title is aspyr media inc sims 2 glamor life stuff pack."
to a document pair (t, t') containing document t and document t'.

Also, in step S104, the similarity calculation unit 13 converts the document pair (t, t') into a vector pair (v, v') using the document embedding model. where v=M(t) and v'=M(t'). Further, the similarity calculator 13 calculates the similarity s from the vector pair (v, v'). The similarity s is, for example, s=exp(-||v-v'||/c), where c>0. Further, the similarity s may be cosine similarity s=v^Tv'/(||v|||||v'||). Here, ^T is a symbol representing transposition.

(Processing example 2: image embedding model)
In this example, in step S101, the acquisition unit 11 reads an image embedding model. In step S103, the conversion unit 12 converts the record pair (e, e') into the image pair (i, i').

FIG. 9 is a diagram showing an example of an image converted by the converter 12. As shown in FIG. As an example, the conversion unit 12
e = ((title, sims 2 glamor life stuff pack), (manufacturer, aspyr media), (price, 24.99))
e' = ((title, aspyr media inc sims 2 glamor life stuff pack), (price, NaN))
A record pair (e, e') containing record e and record e' is converted into images i, i' shown in FIG.

Also, in step S104, the similarity calculation unit 13 converts the image pair (i, i') into a vector pair (v, v') using the image embedding model. where v=M(i) and v'=M(i'). Further, the similarity calculator 13 calculates the similarity s from the vector pair (v, v'). The similarity s is, for example, =exp(-||v-v'||/c), where c>0.

When using the image embedding model, the conversion unit 12 may convert one record into one image, or may perform image conversion for each element (eg, word) included in the record. When the conversion unit 12 performs image conversion for each element, the similarity calculation unit 13 calculates the similarity s using a set of images for each element. Further, when image conversion is performed for each element, the conversion unit 12 may not perform image conversion for missing values in records.

(Processing example 3: voice embedding model)
In this example, in step S101, the acquisition unit 11 reads the speech embedding model. Also, in step S103, the conversion unit 12 converts the record pair (e, e') into the speech pair (i, i'). As an example, the conversion unit 12
e = ((title, sims 2 glamor life stuff pack), (manufacturer, aspyr media), (price, 24.99))
e' = ((title, aspyr media inc sims 2 glamor life stuff pack), (price, NaN))
Converts a record pair (e, e') containing record e and record e' to voice data i representing the voice of record e read aloud and voice data i' representing the voice of record e' read aloud do.

Also, in step S104, the similarity calculation unit 13 converts the voice data pair (i, i') into a vector pair (v, v') using the voice embedding model. where v=M(i) and v'=M(i'). Further, the similarity calculator 13 calculates the similarity s from the vector pair (v, v'). The similarity s is, for example, =exp(-||v-v'||/c), where c>0.

(Processing example 4: entailment recognition model)
In this example, in step S101, the acquisition unit 11 reads an entailment recognition model. Further, in step S103, the conversion unit 12 converts the record pair (e, e') into the document pair (t, t').

As an example, the conversion unit 12
Let record e=((a_j, v_j))_{j=1,...,d} be
Convert document t="a_1 is v_1.... a_dis v_d." However, the conversion unit 12 does not include missing values in the document t.

As an example, the conversion unit 12
e = ((title, sims 2 glamor life stuff pack), (manufacturer, aspyr media), (price, 24.99))
e' = ((title, aspyr media inc sims 2 glamor life stuff pack), (price, NaN))
A record pair (e, e') containing record e and record e' is
t = "Title is sims 2 glamor life stuff pack. Manufacturer is aspyr media. Price is 24.99."
t' = "Title is aspyr media inc sims 2 glamor life stuff pack."
is converted into a document pair (t, t') containing document t and document t'.

Also, in step S104, the similarity calculation unit 13 calculates the entailment score of the document pair (t, t') using the entailment recognition model. Furthermore, the similarity calculation unit 13 calculates the similarity s using the implication score. The similarity s is, for example, s=M(t, t')×M(t', t). In other words, the similarity s is the entailment score M(t, t′) of the entailment relation “if document t is document t′” and the implication score “if document t′ is document t”. It is a multiplication value with the implication score M(t′, t) of the relation. However, the similarity s is not limited to the example described above, and may be another value. The similarity s is, for example, the maximum value of the implication score M(t, t′) and the implication score M(t′, t), or the implication score M(t, t′) and the implication score It may be the sum with M(t', t).

If the record e and the record e' included in the record pair (e, e') are the same, then e⊂e' and e'⊂e. In this processing example, the similarity calculator 13 uses this relationship to calculate the similarity.

(Processing example 5: paraphrase prediction model)
In this example, in step S101, the acquisition unit 11 reads a paraphrase prediction model. Further, in step S103, the conversion unit 12 converts the record pair (e, e') into the document pair (t, t').

As an example, the conversion unit 12
Let record e=((a_j, v_j))_{j=1,...,d} be
Convert document t = "v_1 ... v_d". However, the conversion unit 12 does not include missing values in the document.

As an example, the conversion unit 12
e = ((title, sims 2 glamor life stuff pack), (manufacturer, aspyr media), (price, 24.99))
e' = ((title, aspyr media inc sims 2 glamor life stuff pack), (price, NaN))
A record pair (e, e') containing record e and record e' is
t = "sims 2 glamor life stuff pack aspyr media 24.99"
t' = "aspyr media inc sims 2 glamor life stuff pack"
is converted into a document pair (t, t') containing document t and document t'.

Also, in step S104, the similarity calculation unit 13 calculates the paraphrase score of the document pair (t, t') using the paraphrase prediction model, and sets the calculated paraphrase score as the similarity s of the record pair. That is, in this processing example, the similarity calculation unit 13 calculates the similarity s by putting the record pair into a format that asks whether it is a paraphrase expression.

(Steps S105 and S106)
In step S<b>105 , the output unit 14 outputs the similarity s calculated by the similarity calculation unit 13 . As an example, the output unit 14 outputs the similarity s by writing it into the storage unit 20A. In step S106, the identity determination unit 15A determines identity between the records included in the record pair based on the similarity s.

(Step S107)
In step S106, the integration unit 16A refers to the determination result of the identity determination unit 15A and generates integrated data from the first data x and the second data x'. The integrated data includes, for example, a record obtained by integrating records included in a record pair determined to be identical by the identity determination unit 15A.

<Effects of information processing device 1A>
By the way, in recent years, highly accurate models (inference models) trained on huge datasets have been published for various tasks in the fields of natural language processing and image processing (text classification, question answering, image classification, etc.). It is For example, DITTO (see Non-Patent Document 1), which is a supervised machine learning name identification model, partially applies a pretrained language model of BERT (Bidirectional Encoder Representations from Transformers). However, since the data format varies depending on the task, it was not possible to perform name identification using only existing inference models. In particular, existing inference models could not perform name identification on records containing unexpected attributes.

On the other hand, according to the present exemplary embodiment, the information processing device 1A converts the record pair into a format corresponding to the input of the model MA and inputs it to the model MA to calculate the similarity s for the record pair. By transforming records and applying name identification tasks to tasks in the fields of natural language processing and image processing, it is possible to utilize inference models trained with large amounts of data. That is, according to this exemplary embodiment, identity can be determined for records having various attributes without requiring training data for the model MA.

Further, in the information processing apparatus 1A according to this exemplary embodiment, the model MA is selected from a plurality of model candidates, and the conversion unit 12 inputs the record pair (e, e') to the model MA. A configuration is adopted in which a converted record pair is generated by converting to a format corresponding to . By converting the record pair by the conversion unit 12, the record pair becomes data in a format that can be input to the model MA. That is, no matter what kind of attribute the record whose similarity is to be calculated contains, the similarity calculator 13 can calculate the similarity s by inputting the converted record pair into the model MA. . Thus, according to the information processing apparatus 1A according to the present exemplary embodiment, it is possible to calculate similarities for records having various attributes using the model MA without requiring training data for the model MA. effect is obtained.

Further, in the information processing apparatus 1A according to the present exemplary embodiment, the model MA includes a document classification model, and in the conversion processing by the conversion unit 12, a converted record pair is converted into a document. A configuration is adopted in which processing for generating a record pair is included. Therefore, according to the information processing apparatus 1A according to this exemplary embodiment, the similarity of records having various attributes can be calculated using the model MA, which is a document classification model, without training the model MA. is obtained.

Further, in the information processing apparatus 1A according to the present exemplary embodiment, the model MA includes an image classification model, and in the conversion processing by the conversion unit 12, a converted record pair is converted into an image. A configuration is adopted in which processing for generating a record pair is included. By converting the record pair into an image by the information processing device 1A, it is possible to more preferably calculate the degree of similarity between the records that are different in character but similar in notation. For example, in the case of a record containing the word "glamour" and a record containing the word "glqmour", although the character strings contained in the records are different, the character shapes are similar, so a high degree of similarity is calculated. be. As described above, according to the information processing apparatus 1A according to the present exemplary embodiment, in addition to the effects of the information processing apparatus 1 according to the first exemplary embodiment, the degree of similarity reflecting the degree of similarity of character shapes is The effect of being able to calculate is obtained.

Further, in the information processing apparatus 1A according to the present exemplary embodiment, the model MA includes a speech classification model, and the conversion processing by the conversion unit 12 includes conversion of record pairs into speech. A configuration is adopted in which processing for generating a record pair is included. The information processing apparatus 1A converts the record pair into speech, so that the similarity between the records having different characters but similar phonemes can be more preferably calculated. For example, a record containing the word "glamour" and a record containing the word "glamar" are similar in pronunciation, even though the strings contained in the records are different, so a high degree of similarity is calculated. be. As described above, according to the information processing apparatus 1A according to the present exemplary embodiment, in addition to the effects of the information processing apparatus 1 according to the first exemplary embodiment, the degree of similarity reflecting the degree of similarity between sounds is calculated. You can get the effect of being able to

Further, in the information processing apparatus 1A according to the present exemplary embodiment, the model MA includes a graph classification model, and the conversion processing by the conversion unit 12 includes conversion of record pairs into graphs. A configuration is adopted in which processing for generating a record pair is included. By converting the record pairs into graphs by the information processing device 1A, it is possible to obtain the effect that the similarity of records having various attributes can be calculated using the model MA, which is a graph classification model, without training the model MA. .

[Exemplary embodiment 3]
A third exemplary embodiment of the invention will now be described in detail with reference to the drawings. Components having the same functions as those described in

exemplary embodiments

1 and 2 are denoted by the same reference numerals, and description thereof will not be repeated.

<Configuration of information processing device 1B>
FIG. 10 is a block diagram showing the configuration of an information processing device 1B according to this exemplary embodiment. The control unit 10B of the information processing device 1B includes an acquisition unit 11B, a conversion unit 12B, a similarity calculation unit 13B, an output unit 14, an identity determination unit 15A, and an integration unit 16A. The storage unit 20B also stores the model MB in addition to the first data x, the second data x', and the similarity s.

The acquisition unit 11B further acquires an auxiliary record in addition to the record pair (e, e'). An auxiliary record is an auxiliary record used to calculate the similarity of the record pair (e, e'). The auxiliary record is, for example, a record included in the first data x and other than the record e included in the record pair (e, e'). Also, the auxiliary record is, for example, a record other than the record e' included in the record pair (e, e') which is included in the second data x'.

The conversion unit 12B converts the record pairs acquired by the acquisition unit 11B to generate converted record pairs. Also, the conversion unit 12B generates a converted auxiliary record by converting the auxiliary record. For example, the conversion unit 12B converts the auxiliary records into data representing documents, images, sounds, or graphs.

More specifically, as an example, the conversion unit 12B converts one record e included in the record pair (e, e') into a question sentence, and converts the other record included in the record pair (e, e') into a question sentence. Generate the transformed record pair by transforming each of the e and auxiliary records into a response sentence.

The similarity calculation unit 13B calculates the similarity regarding the converted record pair by inputting the converted record pair and the converted auxiliary record into the model MB. The model MB includes, as an example, a question-answering model that inputs question sentences and answer sentences.

(question answering model)
The question-answering model is a model that extracts and outputs answer sentences from documents given to question sentences. The question-answer model is a model generated by a technique called TANDA described in Non-Patent Document 5 as an example. Inputs of the question-answering model include, for example, a question sentence and a document. The question sentence is, for example, "Where is NEC's headquarters?" As an example, the document states, "Nippon Denki (British: NEC Corporation) is an electronics manufacturer of the Sumitomo Group headquartered in Shiba 5-chome, Minato-ku, Tokyo. One of the constituent stocks of the Nikkei Stock Average." is a document.

　The output of the model includes, as an example, answer sentences and scores. An example of the reply sentence is "Shiba 5-chome, Minato-ku, Tokyo". The score is, for example, a real number between 0 and 1. Here, a score for each word may be calculated when determining the output of the model. For example, the question answering model calculates a score of "0.1" for "NEC CORPORATION", a score of "0.02" for "Sumitomo Group", and a score of "0.08" for "Nikkei Stock Average".

<Flow of information processing method S100B>
FIG. 11 is a flowchart showing the flow of information processing method S100B executed by information processing apparatus 1B. Note that some steps may be performed in parallel or out of order. Also, the description of the already described contents will not be repeated.

The information processing method S100B includes steps S101B, S102, S102B, S103B, S104B, S105, S106, and S107.

At step S101B, the acquisition unit 11B reads the model MB. In step S102B, the acquisition unit 11B reads the auxiliary record.

In step S103B, the conversion unit 12B converts the record pair to generate a converted record pair, and converts the auxiliary record to generate a converted auxiliary record. In step S104B, the similarity calculation unit 13B inputs the converted record pair and the converted auxiliary record to the model MB to calculate the similarity regarding the converted record pair.

(Processing example 6 of steps S101 to S104B: question answering model)
Here, as an example of processing from steps S101 to S104B, an example of processing in the case of using a question answering model will be described. In this example, in step S101, the similarity calculation unit 13B reads a question answer model. Also, in steps S102 and S102B, the acquisition unit 11B reads the record pair (e, e') and the auxiliary record R={e_1, . . . , e_k} (k is a natural number). The auxiliary record R is, for example, a set of all records included in the second data x'. However, the auxiliary record R is not limited to the example described above, and may be a set of other records. For example, the auxiliary records R may be a set of records selected by a randomized algorithm from the second data x'. Also, the auxiliary record R may be a blocked record set, such as a record set obtained by extracting records containing words common to the record e from the second data x'.

In this processing example, auxiliary record R includes record e' contained in record pair (e, e'). An auxiliary record is, for example, a set of records r1 to r3 of the table TBL2 in FIG.
R = {((title, adobe photoshop elements 4.0 photo-editing software for mac), (price, 85.95)), ((title, aspyr media inc sims 2 glamor life stuff pack), (price, NaN)), (( title, final-draft final draft av 2.5 screenwriting software mac/win screen writing software), (price, 199.95))}
is.

In this processing example, in step S103B, the conversion unit 12B converts record e and auxiliary record R (e'εR). More specifically, the conversion unit 12B converts the record e into a question sentence q=T1(e). Here, the question sentence q is preferably of the so-called 5W1H open question type. Also, the conversion unit 12B converts the auxiliary record R into a document containing a plurality of reply sentences.

As an example, the conversion unit 12B converts the record e = ((a_1, v_1), ..., (a_d, v_d)) into
T1(e)=“What is characterized as v_1 of a_1, … and v_d of a_d?”
Convert to Also, the conversion unit 12B transforms the auxiliary record R={e_1, . . . , e_k} into the document T2(R)=“T3(e_1).
Convert to Here, the answer sentence T3 (e_j) (1≤j≤k) included in the document T2(R) is
T3(e_j) = ``{ID of e_j} is characterized as v_1 of a_1, ..., and v_d of a_d''
is. where {ID of e_j} is the unique ID assigned to record e_jεR. However, the conversion unit 12B does not include missing values in the document during conversion.

As an example, the conversion unit 12B
e = ((title, sims 2 glamor life stuff pack), (manufacturer, aspyr media), (price, 24.99))
of,
q=“What is characterized as title of sims 2 glamor life stuff pack, manufacturer of aspyr media, and price of 24.99?”
Convert to Also, auxiliary record R = {((title, adobe photoshop elements 4.0 photo-editing software for mac), (price, 85.95)), ((title, aspyr media inc sims 2 glamor life stuff pack), (price, NaN) ), ((title, final-draft final draft av 2.5 screenwriting software mac/win screen writing software), (price, 199.95))}
and document c = "r1 is characterized as title of adobe photoshop elements 4.0 photo-editing software for mac and price of 85.95. r2 is characterized as title of aspyr media inc sims 2 glamor life stuff pack. r3 is characterized as title of final -draft final draft av 2.5 screenwriting software mac/win screen writing software and price of 199.95.”
Convert to

Also, in step S104B, the similarity calculation unit 13B inputs the question sentence q and the document c to the question answering model. The question answering model outputs a score indicating the degree of certainty that the answer to the input question sentence q is the answer sentence T3(e_j) (1≦j≦k) extracted from the document c. The similarity calculator 13B calculates the similarity s based on the score output by the question answering model. The similarity s is, for example, MB(q, c, {ID of e'}), that is, the confidence that the record e' included in the record pair (e, e') is an answer sentence. However, the similarity s is not limited to this example, and the similarity calculation unit 13B may calculate the similarity s by another method. As an example, the similarity calculation unit 13B may take the sum of the score when the record e is used as the question and the score when the record e' is used as the question as the degree of similarity.

FIG. 12 is a conceptual diagram of similarity calculation processing using the question answering model. In the example of FIG. 12, the conversion unit 12B converts the record e and the auxiliary record R into a question sentence and a document, and the similarity calculation unit 13B inputs the question sentence and the document into the model MB, which is a question-answer model. , the similarity s is calculated. As described above, in this processing example, the information processing apparatus 1B calculates the similarity of the records by converting the records into the question-and-answer format.

<Effects of Information Processing Device 1B>
As described above, in the information processing apparatus 1B according to the present exemplary embodiment, the acquisition unit 11B further acquires auxiliary records, the conversion unit 12B converts the auxiliary records to generate converted auxiliary records, and similar The degree calculation unit 13B is configured to calculate the degree of similarity regarding the converted record pair by inputting the converted record pair and the converted auxiliary record to the model MB. Therefore, according to the information processing apparatus 1B according to the present exemplary embodiment, it is possible to obtain the effect of being able to calculate similarities for records having various attributes using the model MB without training the model MB.

Further, in the information processing device 1B according to the present exemplary embodiment, the model MB includes a question-answer model in which a question sentence and a response sentence are input, and the conversion unit 12B is included in the record pair. One of the records included in the record pair is converted into a question sentence, and each of the other record and the auxiliary record included in the record pair is converted into a response sentence to generate the converted record pair. Therefore, according to the information processing apparatus 1B according to this exemplary embodiment, it is possible to obtain the effect of being able to calculate the similarity of records having various attributes using the question-and-answer model without training the question-and-answer model. .

[Exemplary embodiment 4]
A fourth exemplary embodiment of the invention will now be described in detail with reference to the drawings. Components having the same functions as those described in exemplary embodiments 1 to 3 are denoted by the same reference numerals, and description thereof will not be repeated.

<Configuration of information processing device 1C>
FIG. 13 is a block diagram showing the configuration of an information processing device 1C according to this exemplary embodiment. The control unit 10C of the information processing device 1C includes an acquisition unit 11, a conversion unit 12, a similarity calculation unit 13C, a similarity integration unit 17C, an output unit 14C, an identity determination unit 15A, and an integration unit 16A. The storage unit 20C also stores the model MC in addition to the first data x, the second data x', and the similarity s.

(Similarity calculator 13C)
The similarity calculator 13C calculates a plurality of similarities si for one record pair (e, e'). As an example, the similarity calculator 13C calculates the first similarity s1 by inputting two records included in the record pair (e, e') to the model MC without interchanging them. Further, the similarity calculation unit 13C calculates the second similarity s2 by replacing the two records included in the record pair (e, e') with each other and inputting them to the model MC.

In the processing examples 4 to 5 described in the second exemplary embodiment and the processing example 6 described in the third exemplary embodiment, the similarity of the record pair (e, e') is the record pair (e', e ) similarity. Therefore, in this exemplary embodiment, the similarity calculation unit 13C calculates the similarity of the record pair (e, e') and the similarity of the record pair (e', e), and calculates the similarity as Identity is determined by reference.

However, the method by which the similarity calculation unit 13C calculates a plurality of similarities si is not limited to the example described above, and the similarity calculation unit 13C may calculate a plurality of similarities si by other methods. For example, the similarity calculator 13C may calculate a plurality of similarities si using a plurality of models. In this case, for example, the conversion unit 12 performs a plurality of conversions on one record pair, and the similarity calculation unit 13C converts the converted record pair into respective models (document classification model, image classification model, . . . ). , a plurality of degrees of similarity si may be calculated.

Further, the similarity calculation unit 13C converts one record pair by a plurality of conversion methods to generate a plurality of converted record pairs, and inputs the plurality of converted record pairs to one model to generate a plurality of , the similarity si may be calculated.

(Similarity integration unit 17C)
The similarity integration unit 17C integrates a plurality of similarities si into an integrated similarity s. For example, the similarity integration unit 17C calculates the post-integration similarity s by averaging or weighting a plurality of similarities si. However, the method by which the similarity integration unit 17C integrates a plurality of similarities si is not limited to the example described above, and the similarity integration unit 17C may calculate the post-integration similarity s by another method. For example, the similarity integration unit 17C may set the sum or integrated value of a plurality of similarities si as the integrated similarity s.

In this specification, it can be said that the similarity integration unit 17C is configured to determine the identity of the target record pair based on a plurality of similarities si regarding the target record pair.

(Output section 14C)
The output unit 14C outputs an integrated similarity s obtained by integrating the plurality of similarities si. As an example, the output unit 14C outputs the similarity s by writing it into the storage unit 20C.

(Model MC)
The model MC is a model for calculating the degree of similarity. The model MC is, for example, a model that is asymmetric with respect to the mutual replacement of two elements that are input to the model. The model MC includes, as an example, at least one of an entailment recognition model, a paraphrase prediction model, and a question answer model.

FIG. 14 is a diagram showing a specific example of the similarity si calculated by the similarity calculation unit 13C. In FIG. 14, the first similarity s1 calculated by the similarity calculation unit 13C for the record pair (L1, R1) is "9", and the similarity for the record pair (R1, L1) obtained by exchanging two records is The second similarity s2 calculated by the calculator 13C is "10". Thus, the similarity calculation unit 13C calculates the first similarity s1 and the second similarity s2 for one record pair, and the identity determination unit 15A calculates the first similarity s1 and the second similarity s1. The records are determined to be the same if both the degrees s2 are the highest compared to other record pairs. In the example of FIG. 14, the identity determination unit 15A determines that record L1 and record R1 are the same, and that record L2 and record R3 are the same.

FIG. 15 is a diagram showing another example of the similarity si calculated by the similarity calculation unit 13C. In FIG. 15, the similarity integration unit 17C aggregates bidirectional similarities. For example, the similarity integration unit 17C sets the sum of the similarity s1 of the record pair (L1, R1) and the similarity s2 of the record pair (R1, L1) as the similarity s. In the example of FIG. 15, the similarity s of the record pair (L1, R1) is the sum of "10" and "9", that is, "19", and the similarity s of the record pair (L1, R2) is "9". and "7", that is, "16". The similarity s of the record pair (L2, R2) is the sum of "9" and "4", that is, "13", and the similarity s of the record pair (L2, R3) is "8" and "8 , that is, "16".

In the example of FIG. 15, the identity determination unit 15A determines that record L1 and record R1 are the same, and that record L2 and record R3 are the same, as in the example of FIG. In this example, the identity determination unit 15A further determines that record pairs having a similarity s equal to or higher than a predetermined threshold among the record pairs determined to be identical are also identical. Here, the threshold is, for example, the minimum value (“13” in the example of FIG. 15) of similarities s of record pairs determined to be identical. The threshold may be determined based on the percentage of identical and non-identical, if known. When the threshold value is "13" in the example of FIG. 15, the identity determination unit 15A determines the record pair (L1, R2 ) are also determined to be the same.

<Effect of information processing device 1C>
As described above, in the information processing apparatus 1C according to this exemplary embodiment, the similarity calculation unit 13C calculates a plurality of similarities si with respect to the record pair, and the output unit 14C calculates the plurality of similarities si. A configuration for outputting the post-integration similarity s obtained by integration is adopted. Therefore, according to the information processing apparatus 1C according to the present exemplary embodiment, it is possible to obtain the effect that the similarity s of the record pair can be calculated more accurately.

Further, in the information processing apparatus 1C according to this exemplary embodiment, the model MC is a model having asymmetry with respect to the mutual replacement of two elements input to the model, and the similarity calculation unit 13C , to the model MC, the first similarity s1 is calculated by inputting two records included in the record pair (e, e′) without replacing each other, and the record pair (e , e′) are replaced with each other and then input to calculate the second similarity s2. Therefore, according to the information processing apparatus 1C according to the present exemplary embodiment, by integrating the first similarity s1 and the second similarity s2, the similarity s of the records can be calculated more accurately. effect is obtained.

[Exemplary embodiment 5]
A fifth exemplary embodiment of the present invention will now be described in detail with reference to the drawings. Components having the same functions as those described in exemplary embodiments 1 to 3 are denoted by the same reference numerals, and description thereof will not be repeated.

<Configuration of information processing device 1D>
FIG. 16 is a block diagram showing the configuration of an information processing device 1D according to this exemplary embodiment. A control unit 10D of the information processing device 1D includes an acquisition unit 11, a conversion unit 12, a similarity calculation unit 13, an output unit 14, an identity determination unit 15A, and a search result output unit 18D.

The acquisition unit 11 according to this exemplary embodiment acquires input data from the user as the first record e included in the record pair (e, e'). Input data from the user is, for example, input by an input device (for example, a keyboard, a mouse, etc.) connected to the input/output unit 40A.

Also, the acquiring unit 11 acquires one of the plurality of records included in the target data as the second record e' included in the record pair (e, e'). The target data is data to be searched, and includes, for example, one or more tables.

The identity determination unit 15A performs identity prediction for record pairs of the first record e and each of the plurality of records included in the target data.

Based on the degree of similarity s calculated by the degree of similarity calculation unit 13, the search result output unit 18D outputs the search results based on the input data and with the target data as the search target. As an example, the search result output unit 18D refers to the determination result of the identity determination unit 15A and outputs the search result based on the input data and the target data as the search target. For example, the search result output unit 18D outputs search results to an output device (display, printer, etc.) connected to the input/output unit 40A. Further, the search result output unit 18D may output the search result by transmitting the search result to another device connected via the communication unit 30A. Further, the search result output unit 18D may output search results by storing the search results in the storage unit 20A or an external storage device.

FIG. 17 is a diagram showing a specific example of screen display output by the search result output unit 18D. In the example of FIG. 17, the input data is a character string that the user inputs into the text box 51, and the target data are tables T1 and T2 having a plurality of records. The identity determination unit 15A determines the identity of record pairs between the first record e, which is the user's input data, and each of the records included in the table T1 and the record e' included in the table T2.

In the example of FIG. 17, the search result output unit 18D refers to the determination result of the identity determination unit 15A and outputs search

results

53 and 54 based on the input data. A search result 53 is a search result obtained by searching the table T1 using the character string "potato chips" as input data. A search result 54 is a search result obtained by searching the table T2 using the character string "potato chips" as input data.

<Effects of information processing device 1D>
As described above, in the information processing apparatus 1D according to the present exemplary embodiment, the determination result of the identity determination unit 15A is referred to, and the search result based on the input data, which is the target data, is searched. A configuration for outputting the results is adopted. Therefore, according to the information processing apparatus 1D according to the present exemplary embodiment, in addition to the effects of the information processing apparatus 1 according to the first exemplary embodiment, the search from the target data based on the input data is more preferably performed. You can get the effect of being able to

The information processing device 1D can also be described as follows.
Acquisition means for acquiring input data from a user and one of a plurality of records included in target data as a record pair;
transforming means for transforming the record pairs to generate transformed record pairs;
a similarity calculating means for calculating a similarity with respect to the converted record pair by inputting the converted record pair into a model;
output means for referring to the degree of similarity calculated by the degree of similarity calculation means and outputting search results based on the input data, in which the target data is a search target;
Information processing device equipped with.

[Modification]
<Modification 1>
In each of the exemplary embodiments described above, the

information processing apparatuses

1, 1A, 1B, 1C, and 1D (hereinafter referred to as "information processing apparatuses 1, etc.") The identity with the contained record e' was determined. A plurality of records to be determined by the information processing apparatus 1 or the like may be records included in different data, or may be records included in common data. In other words, the information processing device 1 and the like may execute processing for searching for the same record from one database. Also, in the exemplary embodiment described above, the case where the first data x and the second data x' are integrated has been described, but the information processing apparatus 1 and the like may integrate three or more data.

<Modification 2>
In each of the exemplary embodiments described above, the information processing device 1 or the like may select models MA, MB, MC (hereinafter referred to as "model M") from a plurality of model candidates, and the user may Model M may be selected. The algorithm by which the information processing device 1 or the like selects the model M is not limited, but as an example, the information processing device 1 or the like may select the model M on a rule basis. For example, the information processing device 1 or the like may select the model M according to the characteristics of the record pair. Here, the characteristics of a record pair include, for example, the attribute of the record included in the record pair, the data size of the record, the type of database to which the record belongs, and the attribute of the database.

<Modification 3>
In each of the exemplary embodiments described above, the data containing records e, e' may be semi-structured data such as JSON or XML. By applying the information processing apparatus according to the exemplary embodiment to semi-structured data, it is possible to determine the identity of document data or web pages. For example, on a housing information site that provides housing information, there are cases where multiple web pages are created for the same property. In this case, the web pages can be grouped for each property by performing identity determination on the web pages.

In this example, the records are, by way of example, web pages contained in the target site. For example, record e = {id1: value1, id2: {id2-1: value2-1, id2-2: value2-1}, id3: value3}
, the converted document is, for example,
"id1 is value1. id2-1 of id2 is value2-1. id2-2 of id2 is value2-1. id3 is value3."
is.

<Modification 4>
Also, the record according to the present specification may be graph data as shown in FIG. 18, for example. FIG. 18 is a diagram showing an example of graph data. For example, face matching can be performed by applying the information processing apparatus 1 or the like according to the present specification to graph data. For example, if the record is the graph shown in FIG. 18, the document after conversion is, as an example,
“1 and 2 are linked. 1 and 4 are linked. 2 and 3 are linked. 2 and 4 are linked.”
is.

<Modification 5>
Data containing records may be a graph database as shown in FIG. 19, for example. By applying the information processing apparatus 1 or the like according to the present specification to a graph database, it is possible to determine the identity of different SNS (Social Networking Service) communities, for example, and to investigate criminal organizations. In this example, if the graph database is as shown in FIG. 19, the document after conversion is as follows:
“Taro of age 23 follows Sakura of age 26. Taro of age 23 follows Emi of age 25. Sakura of age 26 follows Emi of age 25. Sakura of age 26 wrote via smartphone tweet of text “I'm sleepy.” date 20XX /YY/ZZ. Emi of age 25 follows Sakura of age 26. Emi of age 25 follows Taro of age 23.”
is.

<Modification 6>
In each of the exemplary embodiments described above, the information processing device 1 and the like may be configured to execute the learning phase for learning the model M. FIG. The method of machine learning for model M is not limited, but as an example, a decision tree-based, linear regression, or neural network method may be used, or two or more of these methods may be used. .

<Modification 7>
In each of the exemplary embodiments described above, a transformer with learnable parameters may be added before or after the model M output. FIG. 20 schematically shows a configuration in which trained

transducers

121, 122 with learnable parameters are provided before and after the output of model M. FIG. The learned

converters

121 and 122 have learnable parameters, and a learning unit (not shown) uses training data to determine how to convert records (how to make sentences or the number of auxiliary records, etc.) and / or how to convert. It is a model that optimizes parameters. By providing the learned

converters

121 and 122, it is possible to calculate the similarity of records with higher accuracy.

The machine learning method of the trained

converters

121, 122 is not limited, but as an example, a decision tree-based, linear regression, or neural network method may be used, and two or more of these methods may be used. may be used. Also, the learned

converters

121 and 122 may be models generated by active learning.

[Example of realization by software]
Some or all of the functions of the

information processing apparatuses

1, 1A, 1B, 1C, and 1D may be implemented by hardware such as integrated circuits (IC chips), or may be implemented by software.

In the latter case, the

information processing apparatuses

1, 1A, 1B, 1C, and 1D are implemented by computers that execute program instructions, which are software that implements each function, for example. An example of such a computer (hereinafter referred to as computer C) is shown in FIG. Computer C comprises at least one processor C1 and at least one memory C2. A program P for operating the computer C as the

information processing apparatuses

1, 1A, 1B, 1C, and 1D is recorded in the memory C2. In the computer C, the processor C1 reads the program P from the memory C2 and executes it, thereby implementing the functions of the

information processing apparatuses

1, 1A, 1B, 1C, and 1D.

As the processor C1, for example, CPU (Central Processing Unit), GPU (Graphic Processing Unit), DSP (Digital Signal Processor), MPU (Micro Processing Unit), FPU (Floating point number Processing Unit), PPU (Physics Processing Unit) , a microcontroller, or a combination thereof. As the memory C2, for example, a flash memory, HDD (Hard Disk Drive), SSD (Solid State Drive), or a combination thereof can be used.

Note that the computer C may further include a RAM (Random Access Memory) for expanding the program P during execution and temporarily storing various data. Computer C may further include a communication interface for sending and receiving data to and from other devices. Computer C may further include an input/output interface for connecting input/output devices such as a keyboard, mouse, display, and printer.

In addition, the program P can be recorded on a non-temporary tangible recording medium M that is readable by the computer C. As such a recording medium M, for example, a tape, disk, card, semiconductor memory, programmable logic circuit, or the like can be used. The computer C can acquire the program P via such a recording medium M. Also, the program P can be transmitted via a transmission medium. As such a transmission medium, for example, a communication network or broadcast waves can be used. Computer C can also obtain program P via such a transmission medium.

[Appendix 1]
The present invention is not limited to the above-described embodiments, and various modifications are possible within the scope of the claims. For example, embodiments obtained by appropriately combining the technical means disclosed in the embodiments described above are also included in the technical scope of the present invention.

[Appendix 2]
Some or all of the above-described embodiments may also be described as follows. However, the present invention is not limited to the embodiments described below.
(Appendix 1)
an acquisition means for acquiring a record pair;
transforming means for transforming the record pairs to generate transformed record pairs;
a similarity calculation means for calculating a similarity regarding the converted record pair by inputting the converted record pair into a model;
an output means for outputting the similarity calculated by the similarity calculation means;
Information processing device equipped with.

(Appendix 2)
The model is selected from a plurality of model candidates,
the transforming means generates the transformed record pair by transforming the record pair into a format corresponding to the input of the model;
The information processing device according to appendix 1.

(Appendix 3)
the model includes a document classification model;
The conversion processing by the conversion means includes processing for generating the converted record pair by converting the record pair into a document.
The information processing device according to

appendix

1 or 2.

(Appendix 4)
The model includes an image classification model,
The conversion processing by the conversion means includes processing for generating the converted record pair by converting the record pair into an image.
3. The information processing apparatus according to any one of Appendices 1 to 3.

(Appendix 5)
the model includes an audio classification model,
The conversion processing by the conversion means includes processing for generating the converted record pair by converting the record pair into speech.
5. The information processing apparatus according to any one of Appendices 1 to 4.

(Appendix 6)
The model includes a graph classification model,
The conversion processing by the conversion means includes processing for generating the converted record pair by converting the record pair into a graph.
6. The information processing apparatus according to any one of Appendices 1 to 5.

(Appendix 7)
The obtaining means further obtains an auxiliary record,
The conversion means generates a converted auxiliary record by converting the auxiliary record;
The similarity calculation means calculates the similarity regarding the converted record pair by inputting the converted record pair and the converted auxiliary record into the model.
7. The information processing apparatus according to any one of Appendices 1 to 6.

(Appendix 8)
The model includes a question-answer model in which a question sentence and an answer sentence are input,
The conversion means converts one record included in the record pair into a question sentence, and converts the other record included in the record pair and each of the auxiliary records into a response sentence, thereby converting the converted record pair into a question sentence. generate,
The information processing device according to appendix 7.

(Appendix 9)
The similarity calculating means calculates a plurality of similarities with respect to the record pair,
The output means outputs an integrated similarity obtained by integrating the plurality of similarities.
9. The information processing apparatus according to any one of Appendices 1 to 8.

(Appendix 10)
The model is a model that has asymmetry with respect to the replacement of two elements input to the model,
The similarity calculation means is
calculating a first degree of similarity by inputting two records included in the record pair into the model without replacing each other;
Calculating a second degree of similarity by replacing two records included in the record pair with the model and then inputting the model;
The information processing device according to appendix 9.

(Appendix 11)
at least one processor
obtaining a record pair;
generating a transformed record pair by transforming the record pair;
calculating a similarity for the transformed record pair by inputting the transformed record pair into a model;
outputting the calculated similarity;
Information processing method including.

(Appendix 12)
to the computer,
an acquisition process for acquiring a record pair;
a transformation process for generating a transformed record pair by transforming the record pair;
A similarity calculation process for calculating a similarity regarding the converted record pair by inputting the converted record pair into a model;
An output process for outputting the similarity calculated in the similarity calculation process;
Information processing program that runs

(Appendix 13)
Acquisition means for acquiring input data from a user and one of a plurality of records included in target data as a record pair;
transforming means for transforming the record pairs to generate transformed record pairs;
a similarity calculating means for calculating a similarity with respect to the converted record pair by inputting the converted record pair into a model;
output means for referring to the degree of similarity calculated by the degree of similarity calculation means and outputting search results based on the input data, in which the target data is a search target;
Information processing device equipped with.

[Appendix 3]
Some or all of the embodiments described above can also be expressed as follows.

at least one processor for obtaining a record pair; transforming the record pair to generate a transformed record pair; and inputting the transformed record pair into a model. an information processing apparatus for executing a similarity calculation process for calculating a similarity regarding the converted record pair and an output process for outputting the similarity calculated in the similarity calculation process.

Note that this information processing apparatus may further include a memory, and this memory stores information for causing the processor to execute the acquisition process, the conversion process, the similarity calculation process, and the output process. program may be stored. Also, this program may be recorded in a computer-readable non-temporary tangible recording medium.

1, 1A, 1B, 1C, 1D

information processing apparatuses

11,

11B acquisition units

12,

12B conversion units

13, 13B, 13C

similarity calculation units

14, 14C output unit 16A integration unit

Claims

an acquisition means for acquiring a record pair;
transforming means for transforming the record pairs to generate transformed record pairs;
a similarity calculation means for calculating a similarity regarding the converted record pair by inputting the converted record pair into a model;
and output means for outputting the degree of similarity calculated by the degree of similarity calculation means.
The model is selected from a plurality of model candidates,
2. The information processing apparatus according to claim 1, wherein said conversion means generates said converted record pair by converting said record pair into a format corresponding to the input of said model.
the model includes a document classification model;
3. The information processing apparatus according to claim 1, wherein the conversion processing by said conversion means includes processing for generating said converted record pair by converting said record pair into a document.
The model includes an image classification model,
4. The information processing apparatus according to any one of claims 1 to 3, wherein the conversion processing by said conversion means includes processing for generating said converted record pair by converting said record pair into an image.
the model includes an audio classification model,
5. The information processing apparatus according to any one of claims 1 to 4, wherein the conversion processing by said conversion means includes processing for generating said converted record pair by converting said record pair into speech.
The model includes a graph classification model,
The conversion processing by the conversion means includes processing for generating the converted record pair by converting the record pair into a graph.
The information processing apparatus according to any one of claims 1 to 5.
The obtaining means further obtains an auxiliary record,
The conversion means generates a converted auxiliary record by converting the auxiliary record;
7. The similarity calculation means according to any one of claims 1 to 6, wherein the similarity calculation means calculates the similarity regarding the converted record pair by inputting the converted record pair and the converted auxiliary record into the model. information processing equipment.
The model includes a question-answer model in which a question sentence and an answer sentence are input,
The conversion means converts one record included in the record pair into a question sentence, and converts the other record included in the record pair and each of the auxiliary records into a response sentence, thereby converting the converted record pair into a question sentence. 8. The information processing apparatus according to claim 7, which generates the information.
The similarity calculating means calculates a plurality of similarities with respect to the record pair,
9. The information processing apparatus according to any one of claims 1 to 8, wherein the output means outputs an integrated similarity obtained by integrating the plurality of similarities.
The model is a model that has asymmetry with respect to the replacement of two elements input to the model,
The similarity calculation means is
calculating a first degree of similarity by inputting two records included in the record pair into the model without replacing each other;
10. The information processing apparatus according to claim 9, wherein two records included in said record pair are replaced with each other and input to said model to calculate a second degree of similarity.
at least one processor
obtaining a record pair;
generating a transformed record pair by transforming the record pair;
calculating a similarity for the transformed record pair by inputting the transformed record pair into a model;
and outputting the calculated similarity.
to the computer,
an acquisition process for acquiring a record pair;
a transformation process for generating a transformed record pair by transforming the record pair;
A similarity calculation process for calculating a similarity regarding the converted record pair by inputting the converted record pair into a model;
An information processing program for executing output processing for outputting the degree of similarity calculated in the degree of similarity calculation processing.