WO2023162206A1 - Information processing device, information processing method, and information processing program - Google Patents

Information processing device, information processing method, and information processing program Download PDF

Info

Publication number
WO2023162206A1
WO2023162206A1 PCT/JP2022/008227 JP2022008227W WO2023162206A1 WO 2023162206 A1 WO2023162206 A1 WO 2023162206A1 JP 2022008227 W JP2022008227 W JP 2022008227W WO 2023162206 A1 WO2023162206 A1 WO 2023162206A1
Authority
WO
WIPO (PCT)
Prior art keywords
record
model
similarity
record pair
information processing
Prior art date
Application number
PCT/JP2022/008227
Other languages
French (fr)
Japanese (ja)
Inventor
勝悟 林
昌史 小山田
元紀 草野
Original Assignee
日本電気株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電気株式会社 filed Critical 日本電気株式会社
Priority to PCT/JP2022/008227 priority Critical patent/WO2023162206A1/en
Publication of WO2023162206A1 publication Critical patent/WO2023162206A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models

Definitions

  • the present invention relates to an information processing device, an information processing method, and an information processing program.
  • Patent Document 1 discloses a device that calculates the similarity of record pairs using a plurality of similarity functions that calculate the similarity of record pairs, and learns the weight of the similarity by supervised machine learning using training data. is described.
  • the training data is a data set with labels indicating combinations of records and whether they are identical.
  • Non-Patent Document 1 describes a technique called DITTO that performs name identification by supervised machine learning.
  • Non-Patent Document 2 describes a technique called ZeroER that matches records by unsupervised machine learning that does not use training data.
  • language models eg, non-patent documents 3 to 5
  • image classification models eg, non-patent document 6
  • Yuliang Li et. al., Deep Entity Matching with Pre-Trained Language Models, PVLDB 2021 Renzhi Wu, et. al., ZeroER: Entity Resolution using Zero Labeled Examples
  • SIGMOD 2020 Yinhan Liu, et. al.
  • RoBERTa A Robustly Optimized BERT Pretraining Approach, arXiv 2019 Sinong Wang, et. al., Entailment as Few-Shot Learner, arXiv 2021 Siddhant Garg, et. al., TANDA: Transfer and Adapt Pre-Trained Transformer Models for Answer Sentence Selection, AAAI 2020 Kaiming He, et. al., Deep Residual Learning for Image Recognition, CVPR 2016
  • supervised machine learning requires a large amount of training data. There was a problem that it could not correspond to the data.
  • heterogeneous data is a combination of records, and refers to data whose format is not the same.
  • ZeroER which is unsupervised machine learning described in Non-Patent Document 2
  • One aspect of the present invention has been made in view of the above problem. It is to provide a technology that can also deal with
  • An information processing apparatus includes acquisition means for acquiring a record pair, conversion means for generating a converted record pair by converting the record pair, and inputting the converted record pair to a model.
  • acquisition means for acquiring a record pair
  • conversion means for generating a converted record pair by converting the record pair
  • conversion means for generating a converted record pair by converting the record pair
  • output means for outputting the similarity calculated by the similarity calculation means.
  • An information processing method is characterized in that at least one processor obtains a record pair, generates a transformed record pair by transforming the record pair, and transforms the transformed record pair into calculating a similarity measure for the transformed record pair by inputting to a model; and outputting the calculated similarity measure.
  • An information processing program provides a computer with an acquisition process for acquiring a record pair, a conversion process for generating a converted record pair by converting the record pair, and a model for the converted record pair. , a similarity calculation process for calculating a similarity with respect to the converted record pair and an output process for outputting the similarity calculated in the similarity calculation process are executed.
  • FIG. 1 is a block diagram showing the configuration of an information processing apparatus according to Exemplary Embodiment 1;
  • FIG. FIG. 3 is a flow diagram showing the flow of an information processing method according to exemplary embodiment 1;
  • FIG. 9 is a block diagram showing the configuration of an information processing apparatus according to Exemplary Embodiment 2;
  • FIG. 10 is a diagram showing a specific example of data including records according to exemplary embodiment 2;
  • FIG. 10 is a diagram showing an overview of the flow of processing performed by an information processing apparatus according to exemplary embodiment 2;
  • FIG. 10 is a diagram showing a specific example of identity determination results according to exemplary embodiment 2;
  • FIG. 10 is a flow diagram showing the flow of an information processing method according to exemplary embodiment 2;
  • FIG. 11 is a diagram schematically illustrating entailment relationships of documents according to exemplary embodiment 2;
  • FIG. 10 is a diagram showing an example of an image converted by a conversion unit according to exemplary embodiment 2;
  • FIG. 12 is a block diagram showing the configuration of an information processing apparatus according to exemplary embodiment 3;
  • FIG. 11 is a flow diagram showing the flow of an information processing method according to exemplary embodiment 3;
  • FIG. 11 is a conceptual diagram of similarity calculation processing using a question-answering model according to exemplary embodiment 3;
  • FIG. 12 is a block diagram showing the configuration of an information processing apparatus according to Exemplary Embodiment 4;
  • FIG. 14 is a diagram showing a specific example of similarity calculated by a similarity calculation unit according to exemplary embodiment 4;
  • FIG. 14 is a diagram showing a specific example of similarity calculated by a similarity calculation unit according to exemplary embodiment 4;
  • FIG. 12 is a block diagram showing the configuration of an information processing apparatus according to Exemplary Embodiment 5;
  • FIG. 12 is a diagram showing a specific example of screen display according to exemplary embodiment 5; It is a figure which shows an example of graph data.
  • 1 is a diagram showing an example of a graph database;
  • FIG. FIG. 4 is a diagram schematically showing a configuration in which learned converters are provided before and after the output of a model;
  • 1 is a block diagram showing the configuration of a computer functioning as an information processing device according to each exemplary embodiment;
  • FIG. 1 is a block diagram showing the configuration of an information processing device 1.
  • the information processing device 1 is a device that calculates the degree of similarity between records.
  • a record is a unit of data for which similarity is calculated.
  • Examples of data containing records include structured data such as table data, semi-structured data described in a data description language such as JSON (JavaScript Object Notation: registered trademark) or XML (Extensible Markup Language), and natural language It includes unstructured data representing written documents.
  • a record is, for example, a row of a table and contains a set of one or more attribute names and attribute values corresponding to the columns of the table. Also, the record may be graph data.
  • the information processing device 1 includes an acquisition unit 11 , a conversion unit 12 , a similarity calculation unit 13 and an output unit 14 .
  • a record pair is a set of records, such as a set of records included in a first table and records included in a second table.
  • the first table and the second table are, for example, tables that store customer information of businesses or tables that store product information.
  • the first table and the second table are not limited to the examples described above, and may be other tables. Also, the first table and the second table may be the same or different.
  • Multiple records included in a record pair may have different data formats. More specifically, for example, when a record is a row of a table, some attribute names included in the record may be different, and all attribute names included in the record may be different. .
  • the acquiring unit 11 may acquire the record pair by reading the record pair from the storage device, or acquire the record pair by receiving the record pair from another device connected via the communication interface. good too. Also, the acquisition unit 11 may acquire a record pair input from an input device via an input/output interface.
  • the conversion unit 12 converts the record pair to generate a converted record pair.
  • the conversion unit 12 converts records included in a record pair into data representing documents, images, sounds, or graphs. More specifically, the conversion unit 12 converts the record into an affirmative sentence or a question sentence, for example.
  • the method by which the conversion unit 12 converts the record pair is not limited to the example described above, and the conversion unit 12 may convert the record pair by another method.
  • the similarity calculation unit 13 calculates the similarity regarding the converted record pair by inputting the converted record pair into the model.
  • the model is a model for calculating the degree of similarity, and as an example, is a model that is open to the public and available to any user.
  • the model may be a model generated by machine learning or a rule-based model created by humans.
  • the model is, by way of example, a document classification model, an image classification model, an audio classification model, or a graph classification model.
  • a document classification model is a model for classifying document data.
  • An image classification model is a model for classifying image data.
  • a speech classification model is a model that classifies speech data.
  • a graph classification model is a model for classifying graph data.
  • Document classification models include, for example, a document embedding model, an entailment recognition model, a paraphrase prediction model, a question answering model, and a mask language model.
  • a document embedding model is a model that embeds documents or words in a vector space.
  • the entailment recognition model is a model that predicts entailment relationships of multiple documents.
  • a paraphrasing prediction model is a model that predicts whether two documents are paraphrasing expressions.
  • a question answering model is a model that extracts and outputs answers from documents given to questions.
  • a mask language model is a model for predicting words that fit a mask in a document.
  • An example of an image classification model is an image embedding model.
  • An image embedding model is a model that embeds image data in a vector space.
  • a speech classification model includes, for example, a speech embedding model.
  • a speech embedding model is a model that embeds speech data in a vector space.
  • Inputs for the above model include at least one of text data, image data, audio data, graphs, and vectors, for example.
  • the output of the model includes, by way of example, a vector or score indicating confidence.
  • the score is, for example, a score indicating the degree of certainty regarding the inclusion relationship of the document or a score indicating the degree of certainty as to whether it is a paraphrasing expression.
  • the inputs and outputs of the model are not limited to the examples described above, and may include other information.
  • the degree of similarity is information relating to the degree of similarity between records included in a record pair, and an example is the cosine similarity of vector pairs. Also, the similarity may be a value calculated from the score output by the model.
  • the output unit 14 outputs the similarity calculated by the similarity calculation unit 13 .
  • the output unit 14 may output the degree of similarity by writing it in a storage device, or may output the degree of similarity by transmitting the degree of similarity to another device via a communication interface.
  • the output unit 14 may output the degree of similarity to an output device (not shown) connected via an input/output interface.
  • the output device is, for example, a display, printer, projector, or speaker.
  • the degree of similarity output by the output unit 14 is used, for example, for table integration processing or information search processing.
  • table integration processing by linking records predicted to be identical based on the similarity calculated by the similarity calculation unit 13, a plurality of tables can be integrated and unified data management can be performed.
  • the similarity calculation unit 13 calculates the similarity for a record pair of a record as a search key (for example, a record specified by a user) and any other record registered in a predetermined table. may be performed.
  • the information processing apparatus 1 may output records included in a record pair predicted to be identical based on the similarity calculated by the similarity calculation unit 13 as a search result. As a result, even in a table that is not associated with a record that is a search key, search processing using the search key is possible.
  • the acquisition unit 11 that acquires a record pair, the conversion unit 12 that converts the record pair to generate a converted record pair, and the A configuration comprising a similarity calculation unit 13 for calculating a similarity regarding the converted record pair by inputting the converted record pair into a model, and an output unit 14 for outputting the similarity calculated by the similarity calculation unit 13. is adopted.
  • a technique for calculating the similarity between record pairs a technique that does not require training data regarding record pairs and can handle heterogeneous data is provided. The effect of being able to provide is obtained.
  • An information processing program provides a computer with an acquisition process for acquiring a record pair, a conversion process for generating a converted record pair by converting the record pair, and a model for the converted record pair. , a similarity calculation process for calculating a similarity with respect to the converted record pair and an output process for outputting the similarity calculated in the similarity calculation process are executed.
  • FIG. 2 is a flow diagram showing the flow of the information processing method S1.
  • the execution subject of each step in the information processing method S1 may be a processor included in the information processing apparatus 1 or a processor included in another apparatus. processor.
  • At step S11 at least one processor acquires a record pair.
  • At step S12 at least one processor generates transformed record pairs by transforming the record pairs.
  • At step S13 at least one processor calculates a similarity for the transformed record pair by inputting the transformed record pair into a model.
  • At step S14 at least one processor outputs the calculated similarity.
  • At least one processor obtains a record pair and generates a transformed record pair by transforming the record pair. , a configuration including inputting the converted record pair into a model to calculate a similarity regarding the converted record pair and outputting the calculated similarity. For this reason, according to the information processing method S1 according to the present exemplary embodiment, as a technique for calculating the similarity between record pairs, a technique that does not require training data regarding record pairs and can handle heterogeneous data is provided. The effect of being able to provide is obtained.
  • FIG. 3 is a block diagram showing the configuration of the information processing device 1A according to this exemplary embodiment.
  • the information processing device 1A has a function of determining identity between records. Examples of data containing records are structured data such as table data, semi-structured data described in a data description language such as JSON or XML, or unstructured data representing a document written in a natural language.
  • FIG. 4 is a diagram showing a specific example of data containing records.
  • data D1 is a table.
  • a record is each row of the table.
  • data D2 is semi-structured data described in a data description language such as a markup language.
  • the record is a web page as an example.
  • Data D3 is unstructured data representing a document written in natural language.
  • the record is, for example, a file generated in a predetermined file format.
  • FIG. 5 is a diagram showing an overview of the flow of processing performed by the information processing apparatus 1A.
  • the information processing apparatus 1A is roughly divided into (i) record pair generation processing, (ii) similarity calculation processing, and (iii) identity determination processing.
  • the information processing device 1A In the process of generating record pairs, the information processing device 1A generates record pairs from first data x including multiple records e and second data x' including multiple records e'. As an example, the information processing apparatus 1A generates all combinations of the record e included in the first data x and the record e' included in the second data x'. Further, the information processing apparatus 1A may narrow down the candidates for identity determination of the second data x' for the record e of the first data x by a technique called blocking in generating the record pair.
  • the information processing device 1A calculates the similarity between the records included in the record pair.
  • the information processing apparatus 1A calculates the degree of similarity by inputting converted record pairs obtained by converting records into a model. The details of the similarity calculation process will be described later.
  • the information processing device 1A determines the identity of the records included in the record pair based on the calculated similarity. As an example, the information processing device 1A determines that the records are the same when the degree of similarity is equal to or greater than a threshold.
  • the method for determining identity is not limited to the above-described method, and information processing apparatus 1A may determine identity between records using other methods.
  • FIG. 6 is a diagram showing a specific example of identity determination results.
  • a table TBL1 is an example of the first data x and includes multiple rows and multiple columns.
  • the table TBL2 is an example of the second data x' and includes multiple rows and multiple columns.
  • a record is a row of the table.
  • Table TBL1 contains records l1, l2, l3 and l4, and table TBL2 contains records r1, r2, r3.
  • the information processing device 1A determines that the record l1 and the record r2 are the same, and determines that the record l2 and the record r3 are the same by the processes (i) to (iii) above. Then, it is determined that the record l3 and the record r1 are the same.
  • the information processing apparatus 1A includes a control section 10A, a storage section 20A, a communication section 30A and an input/output section 40A.
  • the communication unit 30A communicates with an external device of the information processing device 1A via a communication line.
  • a communication line includes wireless LAN (Local Area Network), wired LAN, WAN (Wide Area Network), public line network, mobile data communication network, or a combination thereof.
  • the communication unit 30A transmits data supplied from the control unit 10A to other devices, and supplies data received from other devices to the control unit 10A.
  • Input/output unit 40A Input/output devices such as a keyboard, mouse, display, printer, and touch panel are connected to the input/output unit 40A.
  • the input/output unit 40A receives input of various kinds of information from the connected input device to the information processing apparatus 1A. Also, the input/output unit 40A outputs various kinds of information to the connected output device under the control of the control unit 10A.
  • an interface such as a USB (Universal Serial Bus) can be used as the input/output unit 40A.
  • control unit 10A includes an acquisition unit 11, a conversion unit 12, a similarity calculation unit 13, an output unit 14, an identity determination unit 15A, and an integration unit 16A.
  • the acquisition unit 11 generates a record pair including the record e and the record e' from the first data x including the record e and the second data x' including the record e'.
  • the acquisition unit 11 does not have to perform the process of generating the record pair.
  • the acquisition unit 11 may acquire by reading record pairs from the storage unit 20A or another external storage device, or acquire record pairs received from another device via the communication unit 30A. may Also, the acquisition unit 11 may acquire a record pair input from an input device connected to the input/output unit 40A.
  • the conversion unit 12 converts the record pair to generate a converted record pair.
  • the conversion unit 12 converts records included in a record pair into document data, image data, audio data, or graphs. The conversion processing executed by the conversion unit 12 will be described later.
  • the similarity calculation unit 13 calculates the similarity s regarding the converted record pair by inputting the converted record pair into the model MA.
  • the similarity s is information about the degree of similarity between records included in a record pair, and is, for example, a cosine similarity of a vector pair or a value calculated based on the score output by the model MA. The details of the process of calculating the similarity s by the similarity calculator 13 will be described later.
  • the output unit 14 outputs the similarity s calculated by the similarity calculation unit 13 .
  • the output unit 14 outputs the degree of similarity by writing it into the storage unit 20A.
  • the method by which the output unit 14 outputs the degree of similarity is not limited to the example described above, and the degree of similarity s may be output by another method.
  • the output unit 14 may transmit the degree of similarity to another device connected via the communication unit 30A, and output the degree of similarity to an output device connected via the input/output unit 40A. may
  • the identity determination unit 15A determines identity between records included in a record pair based on the degree of similarity s. As an example, the identity determination unit 15A determines that the records are the same when the similarity s is equal to or greater than the threshold. Also, the identity determination unit 15A may determine identity based on the ranking when the record pairs are sorted in order of high similarity, such as determining that x record pairs with the highest degree of similarity are identical. good. Further, as an example, the identity determination unit 15A may determine identity by applying a matching algorithm such as the stable marriage problem algorithm.
  • the method of determining identity by the identity determination unit 15A is not limited to the above example, and the identity determination unit 15A may determine identity by other methods.
  • the identity determination unit 15A may perform identity prediction of record pairs by inputting record pairs and similarities into a prediction model generated by machine learning.
  • the input of the prediction model includes, for example, record pairs and similarities.
  • the output of the predictive model includes, as an example, a predictive result of identity.
  • the machine learning method of the prediction model is not limited, and as an example, a decision tree-based, linear regression, or neural network method may be used, and two or more of these methods may be used. may
  • (integration unit 16A) 16 A of integration parts integrate the 1st data x and 2nd data x' based on the determination result of 15 A of identity determination parts. For example, the integration unit 16A integrates the first data x and the second data x' by increasing the number of records and/or increasing the number of data attributes.
  • Data integration performed by the integration unit 16A includes, for example, (i) entity integration, (ii) data cleansing, and (iii) schema matching.
  • Entity integration refers to unifying the notation of different attributes and their values when the same set of records is given.
  • Data cleansing refers to the unification of differences in description formats such as company names, addresses, and area codes ("Co., Ltd.” and "Co., Ltd.”, etc.).
  • Schema matching means aligning (matching) a plurality of attributes with different notations.
  • the storage unit 20A stores the first data x and the second data x′, and also stores the similarity s calculated by the similarity calculation unit 13 .
  • a model MA is stored in the storage unit 20A. Note that the expression that the model MA is stored in the storage unit 20A means that the parameters that define the model MA are stored in the storage unit 20A.
  • the model MA is a model for calculating the degree of similarity, and as an example, is a model that is open to the public and can be used by any user.
  • the model MA may be a model generated by machine learning or a rule-based model created by humans.
  • the model MA includes at least one of a document classification model, image classification model, speech classification model and graph classification model.
  • document classification models include document embedding models, entailment recognition models, paraphrase prediction models, and mask language models.
  • An example of an image classification model is an image embedding model that embeds image data in a vector space.
  • a speech classification model for example, there is a speech embedding model that embeds speech data in a vector space.
  • Inputs of the model MA include at least one of document data, image data, audio data, graphs, and vectors, for example.
  • the output of the model MA includes, as an example, vectors and/or scores.
  • a document embedding model is a model that embeds documents or words in a vector space.
  • the document embedding model is generated by RoBERTa described in Non-Patent Document 3 as an example.
  • the input of the document embedding model is, for example, a document or a word (for example, the sentence "An elderly man is walking in the park.”).
  • the output is a vector as an example.
  • the entailment recognition model is a model for predicting whether or not there is an entailment relation of "if it is document 1, then it is document 2".
  • the entailment recognition model is generated by the technique described in Non-Patent Document 4 as an example.
  • the input of the entailment recognition model is two documents as an example.
  • the two documents are, for example, document 1 "an old man is walking in the park” and document 2 "a man is in the park”. In this case, document 2 entails document 1.
  • FIG. 8 is a diagram schematically showing entailment relationships of documents. In the example of FIG. 8, document 2 entails document 1 .
  • the output of the entailment recognition model is an entailment score as an example.
  • the entailment score is a numerical value indicating the certainty of the entailment relation, and is a real number between 0 and 1, for example. As an example, the entailment score indicates that the higher the value, the higher the certainty of the entailment relation.
  • a paraphrasing prediction model is a model that predicts whether two documents are paraphrasing expressions.
  • a paraphrase prediction model is generated by RoBERTa described in Non-Patent Document 3 as an example.
  • the input of the paraphrasing prediction model is two documents as an example.
  • the two documents are, for example, document 1 stating "NEC is an IT company" and document 2 stating "NEC Corporation is in the IT business".
  • the output of the model includes paraphrase scores, as an example.
  • the paraphrasing score is a score indicating the degree of certainty that two documents are paraphrasing expressions, and is a real number between 0 and 1, for example. As an example, the paraphrase score indicates that the higher the value, the higher the confidence that the two documents are paraphrase expressions.
  • a mask language model is a model that predicts words that fit a mask in a document.
  • the mask language model is, for example, a model generated by RoBERTa described in Non-Patent Document 3, for example.
  • the input of the document classification model is, for example, a document (for example, the sentence "This pizza is very good. I like this pizza [mask].”).
  • the output of the model includes words (eg, "like") and scores.
  • the score is a value indicating the degree of confidence that the word fits the mask, and is a real number between 0 and 1, for example.
  • An image classification model is a model that classifies images.
  • the image classification model is a model generated by the technique described in Non-Patent Document 6 as an example.
  • An example input for the image classification model is an image (eg, an image of a dog).
  • the intermediate output of the model is, for example, a vector representation of the image, and the output of the model is, for example, a label (for example, a label indicating "dog") and a score.
  • a score is a value indicating the certainty of a label, and is a real number between 0 and 1, for example.
  • the similarity calculator 13 calculates the similarity s using the vector representation of the images.
  • a speech classification model is a model that classifies speech.
  • the input of the speech classification model is, for example, speech data (eg, dog barking).
  • the intermediate output of the model is, as an example, a vector representation of the speech, and the output of the model is, as an example, a label (eg, a label indicating that the speech is a dog bark) and a score.
  • a score is a value indicating the certainty of a label, and is a real number between 0 and 1, for example.
  • the similarity calculator 13 calculates the similarity s using the vector representation of the speech, which is the intermediate output.
  • a graph classification model is a model for classifying graphs.
  • the input of the graph classification model is, for example, graph data (for example, a graph representing facial features).
  • the intermediate output of the model is, for example, a vector representation of the graph, and the output of the model is, for example, a label (for example, a label indicating a person) and a score.
  • a score is a value indicating the certainty of a label, and is a real number between 0 and 1, for example.
  • the similarity calculator 13 calculates the similarity s using the vector representation of the graph, which is the intermediate output.
  • FIG. 7 is a flowchart showing the flow of the information processing method S100A executed by the information processing apparatus 1A. Note that some of the steps included in the information processing method S100A may be executed in parallel or in a different order. Also, the description of the already described contents will not be repeated.
  • step S101 the acquisition unit 11 reads the model MA.
  • model MA is selected from a plurality of model candidates.
  • the plurality of model candidates include, for example, at least one of a document embedding model, a document classification model such as an entailment recognition model, an image classification model, and an audio classification model.
  • the selection of the model MA may be performed based on a user's operation as an example, or may be performed according to a predetermined algorithm.
  • the model MA may be a single model or a set of multiple models.
  • Step S102 the acquisition unit 11 reads record pairs.
  • the acquisition unit 11 generates record pairs from the first data x and the second data x'. For example, the acquisition unit 11 generates all combinations of records e included in the first data x and records e' included in the second data x'. Further, in generating a record pair, the obtaining unit 11 may narrow down candidates for identity determination of the second data x′ for the record e of the first data x by a technique called blocking.
  • step S103 the conversion unit 12 generates a converted record pair by converting the record pair (e, e') into a format corresponding to the input of the model MA.
  • the model MA includes a document classification model
  • the conversion processing by the conversion unit 12 includes processing for generating a converted record pair by converting the record pair (e, e') into a document. included.
  • the model MA includes an image classification model, and in the conversion processing by the conversion unit 12, a converted record pair is generated by converting the record pair (e, e') into an image. processing is included.
  • the model MA includes a speech classification model, and the conversion processing by the conversion unit 12 includes processing for generating a converted record pair by converting the record pair into speech.
  • the model MA includes a graph classification model, and the conversion processing by the conversion unit 12 includes processing for generating converted record pairs by converting record pairs into graphs.
  • step S104 the similarity calculation unit 13 calculates the similarity s regarding the converted record pair by inputting the converted record pair into the model MA.
  • Processing examples 1 to 5 of steps S101 to S104 will be described as processing examples of steps S101 to S104.
  • Processing example 1 is a processing example in the case of using the document embedding model.
  • Processing example 2 is a processing example in the case of using an image classification model.
  • Processing example 3 is a processing example in the case of using a speech classification model.
  • Processing example 4 is a processing example when using an entailment recognition model.
  • Processing example 5 is a processing example in the case of using a paraphrase prediction model.
  • step S101 the acquisition unit 11 reads the document embedding model.
  • step S103 the conversion unit 12 converts the record pair (e, e') into the document pair (t, t').
  • the conversion unit 12 e ((title, sims 2 glamor life stuff pack), (manufacturer, aspyr media), (price, 24.99))
  • e' ((title, aspyr media inc sims 2 glamor life stuff pack), (price, NaN))
  • t' "Title is aspyr media inc sims 2 glamor life stuff pack.” to a document pair (t, t') containing document t and document t'.
  • the similarity calculation unit 13 converts the document pair (t, t') into a vector pair (v, v') using the document embedding model.
  • the similarity calculator 13 calculates the similarity s from the vector pair (v, v').
  • ⁇ T is a symbol representing transposition.
  • step S101 the acquisition unit 11 reads an image embedding model.
  • step S103 the conversion unit 12 converts the record pair (e, e') into the image pair (i, i').
  • FIG. 9 is a diagram showing an example of an image converted by the converter 12.
  • the conversion unit 12 e ((title, sims 2 glamor life stuff pack), (manufacturer, aspyr media), (price, 24.99))
  • e' ((title, aspyr media inc sims 2 glamor life stuff pack), (price, NaN))
  • a record pair (e, e') containing record e and record e' is converted into images i, i' shown in FIG.
  • step S104 the similarity calculation unit 13 converts the image pair (i, i') into a vector pair (v, v') using the image embedding model.
  • the similarity calculator 13 calculates the similarity s from the vector pair (v, v').
  • the conversion unit 12 may convert one record into one image, or may perform image conversion for each element (eg, word) included in the record.
  • the similarity calculation unit 13 calculates the similarity s using a set of images for each element. Further, when image conversion is performed for each element, the conversion unit 12 may not perform image conversion for missing values in records.
  • step S101 the acquisition unit 11 reads the speech embedding model. Also, in step S103, the conversion unit 12 converts the record pair (e, e') into the speech pair (i, i').
  • the conversion unit 12 e ((title, sims 2 glamor life stuff pack), (manufacturer, aspyr media), (price, 24.99))
  • e' ((title, aspyr media inc sims 2 glamor life stuff pack), (price, NaN)) Converts a record pair (e, e') containing record e and record e' to voice data i representing the voice of record e read aloud and voice data i' representing the voice of record e' read aloud do.
  • step S104 the similarity calculation unit 13 converts the voice data pair (i, i') into a vector pair (v, v') using the voice embedding model.
  • the similarity calculator 13 calculates the similarity s from the vector pair (v, v').
  • step S101 the acquisition unit 11 reads an entailment recognition model. Further, in step S103, the conversion unit 12 converts the record pair (e, e') into the document pair (t, t').
  • the conversion unit 12 e ((title, sims 2 glamor life stuff pack), (manufacturer, aspyr media), (price, 24.99))
  • e' ((title, aspyr media inc sims 2 glamor life stuff pack), (price, NaN))
  • t' "Title is aspyr media inc sims 2 glamor life stuff pack.” is converted into a document pair (t, t') containing document t and document t'.
  • the similarity calculation unit 13 calculates the entailment score of the document pair (t, t') using the entailment recognition model. Furthermore, the similarity calculation unit 13 calculates the similarity s using the implication score.
  • the similarity s is the entailment score M(t, t′) of the entailment relation “if document t is document t′” and the implication score “if document t′ is document t”. It is a multiplication value with the implication score M(t′, t) of the relation.
  • the similarity s is not limited to the example described above, and may be another value.
  • the similarity s is, for example, the maximum value of the implication score M(t, t′) and the implication score M(t′, t), or the implication score M(t, t′) and the implication score It may be the sum with M(t', t).
  • the similarity calculator 13 uses this relationship to calculate the similarity.
  • step S101 the acquisition unit 11 reads a paraphrase prediction model. Further, in step S103, the conversion unit 12 converts the record pair (e, e') into the document pair (t, t').
  • the conversion unit 12 e ((title, sims 2 glamor life stuff pack), (manufacturer, aspyr media), (price, 24.99))
  • e' ((title, aspyr media inc sims 2 glamor life stuff pack), (price, NaN))
  • t' "aspyr media inc sims 2 glamor life stuff pack" is converted into a document pair (t, t') containing document t and document t'.
  • step S104 the similarity calculation unit 13 calculates the paraphrase score of the document pair (t, t') using the paraphrase prediction model, and sets the calculated paraphrase score as the similarity s of the record pair. That is, in this processing example, the similarity calculation unit 13 calculates the similarity s by putting the record pair into a format that asks whether it is a paraphrase expression.
  • Step S105 and S106 In step S ⁇ b>105 , the output unit 14 outputs the similarity s calculated by the similarity calculation unit 13 . As an example, the output unit 14 outputs the similarity s by writing it into the storage unit 20A. In step S106, the identity determination unit 15A determines identity between the records included in the record pair based on the similarity s.
  • step S106 the integration unit 16A refers to the determination result of the identity determination unit 15A and generates integrated data from the first data x and the second data x'.
  • the integrated data includes, for example, a record obtained by integrating records included in a record pair determined to be identical by the identity determination unit 15A.
  • the information processing device 1A converts the record pair into a format corresponding to the input of the model MA and inputs it to the model MA to calculate the similarity s for the record pair.
  • the information processing device 1A converts the record pair into a format corresponding to the input of the model MA and inputs it to the model MA to calculate the similarity s for the record pair.
  • the model MA is selected from a plurality of model candidates, and the conversion unit 12 inputs the record pair (e, e') to the model MA.
  • a configuration is adopted in which a converted record pair is generated by converting to a format corresponding to .
  • the record pair becomes data in a format that can be input to the model MA. That is, no matter what kind of attribute the record whose similarity is to be calculated contains, the similarity calculator 13 can calculate the similarity s by inputting the converted record pair into the model MA. .
  • the similarity calculator 13 can calculate the similarity s by inputting the converted record pair into the model MA.
  • the model MA includes a document classification model, and in the conversion processing by the conversion unit 12, a converted record pair is converted into a document.
  • a configuration is adopted in which processing for generating a record pair is included. Therefore, according to the information processing apparatus 1A according to this exemplary embodiment, the similarity of records having various attributes can be calculated using the model MA, which is a document classification model, without training the model MA. is obtained.
  • the model MA includes an image classification model, and in the conversion processing by the conversion unit 12, a converted record pair is converted into an image.
  • a configuration is adopted in which processing for generating a record pair is included.
  • the degree of similarity reflecting the degree of similarity of character shapes is The effect of being able to calculate is obtained.
  • the model MA includes a speech classification model
  • the conversion processing by the conversion unit 12 includes conversion of record pairs into speech.
  • a configuration is adopted in which processing for generating a record pair is included.
  • the information processing apparatus 1A converts the record pair into speech, so that the similarity between the records having different characters but similar phonemes can be more preferably calculated. For example, a record containing the word "glamour” and a record containing the word "glamar” are similar in pronunciation, even though the strings contained in the records are different, so a high degree of similarity is calculated. be.
  • the degree of similarity reflecting the degree of similarity between sounds is calculated. You can get the effect of being able to
  • the model MA includes a graph classification model
  • the conversion processing by the conversion unit 12 includes conversion of record pairs into graphs.
  • a configuration is adopted in which processing for generating a record pair is included.
  • FIG. 10 is a block diagram showing the configuration of an information processing device 1B according to this exemplary embodiment.
  • the control unit 10B of the information processing device 1B includes an acquisition unit 11B, a conversion unit 12B, a similarity calculation unit 13B, an output unit 14, an identity determination unit 15A, and an integration unit 16A.
  • the storage unit 20B also stores the model MB in addition to the first data x, the second data x', and the similarity s.
  • the acquisition unit 11B further acquires an auxiliary record in addition to the record pair (e, e').
  • An auxiliary record is an auxiliary record used to calculate the similarity of the record pair (e, e').
  • the auxiliary record is, for example, a record included in the first data x and other than the record e included in the record pair (e, e').
  • the auxiliary record is, for example, a record other than the record e' included in the record pair (e, e') which is included in the second data x'.
  • the conversion unit 12B converts the record pairs acquired by the acquisition unit 11B to generate converted record pairs. Also, the conversion unit 12B generates a converted auxiliary record by converting the auxiliary record. For example, the conversion unit 12B converts the auxiliary records into data representing documents, images, sounds, or graphs.
  • the conversion unit 12B converts one record e included in the record pair (e, e') into a question sentence, and converts the other record included in the record pair (e, e') into a question sentence. Generate the transformed record pair by transforming each of the e and auxiliary records into a response sentence.
  • the similarity calculation unit 13B calculates the similarity regarding the converted record pair by inputting the converted record pair and the converted auxiliary record into the model MB.
  • the model MB includes, as an example, a question-answering model that inputs question sentences and answer sentences.
  • the question-answering model is a model that extracts and outputs answer sentences from documents given to question sentences.
  • the question-answer model is a model generated by a technique called TANDA described in Non-Patent Document 5 as an example.
  • Inputs of the question-answering model include, for example, a question sentence and a document.
  • the question sentence is, for example, "Where is NEC's headquarters?"
  • the document states "Nippon Denki (British: NEC Corporation) is an electronics manufacturer of the Sumitomo Group headquartered in Shiba 5-chome, Minato-ku, Tokyo.
  • One of the constituent stocks of the Nikkei Stock Average is a document.
  • the output of the model includes, as an example, answer sentences and scores.
  • An example of the reply sentence is "Shiba 5-chome, Minato-ku, Tokyo".
  • the score is, for example, a real number between 0 and 1.
  • a score for each word may be calculated when determining the output of the model.
  • the question answering model calculates a score of "0.1" for "NEC CORPORATION”, a score of "0.02" for "Sumitomo Group”, and a score of "0.08" for "Nikkei Stock Average”.
  • FIG. 11 is a flowchart showing the flow of information processing method S100B executed by information processing apparatus 1B. Note that some steps may be performed in parallel or out of order. Also, the description of the already described contents will not be repeated.
  • the information processing method S100B includes steps S101B, S102, S102B, S103B, S104B, S105, S106, and S107.
  • step S101B the acquisition unit 11B reads the model MB.
  • step S102B the acquisition unit 11B reads the auxiliary record.
  • step S103B the conversion unit 12B converts the record pair to generate a converted record pair, and converts the auxiliary record to generate a converted auxiliary record.
  • step S104B the similarity calculation unit 13B inputs the converted record pair and the converted auxiliary record to the model MB to calculate the similarity regarding the converted record pair.
  • steps S101 to S104B question answering model
  • the similarity calculation unit 13B reads a question answer model.
  • the auxiliary record R is, for example, a set of all records included in the second data x'.
  • the auxiliary record R is not limited to the example described above, and may be a set of other records.
  • the auxiliary records R may be a set of records selected by a randomized algorithm from the second data x'.
  • the auxiliary record R may be a blocked record set, such as a record set obtained by extracting records containing words common to the record e from the second data x'.
  • auxiliary record R includes record e' contained in record pair (e, e').
  • the question sentence q is preferably of the so-called 5W1H open question type.
  • the conversion unit 12B converts the auxiliary record R into a document containing a plurality of reply sentences.
  • ⁇ ID of e_j ⁇ is the unique ID assigned to record e_j ⁇ R.
  • the conversion unit 12B does not include missing values in the document during conversion.
  • r2 is characterized as title of aspyr media inc sims 2 glamor life stuff pack.
  • r3 is
  • step S104B the similarity calculation unit 13B inputs the question sentence q and the document c to the question answering model.
  • the question answering model outputs a score indicating the degree of certainty that the answer to the input question sentence q is the answer sentence T3(e_j) (1 ⁇ j ⁇ k) extracted from the document c.
  • the similarity calculator 13B calculates the similarity s based on the score output by the question answering model.
  • the similarity s is, for example, MB(q, c, ⁇ ID of e' ⁇ ), that is, the confidence that the record e' included in the record pair (e, e') is an answer sentence.
  • the similarity s is not limited to this example, and the similarity calculation unit 13B may calculate the similarity s by another method.
  • the similarity calculation unit 13B may take the sum of the score when the record e is used as the question and the score when the record e' is used as the question as the degree of similarity.
  • FIG. 12 is a conceptual diagram of similarity calculation processing using the question answering model.
  • the conversion unit 12B converts the record e and the auxiliary record R into a question sentence and a document
  • the similarity calculation unit 13B inputs the question sentence and the document into the model MB, which is a question-answer model.
  • the similarity s is calculated.
  • the information processing apparatus 1B calculates the similarity of the records by converting the records into the question-and-answer format.
  • the acquisition unit 11B further acquires auxiliary records
  • the conversion unit 12B converts the auxiliary records to generate converted auxiliary records
  • the degree calculation unit 13B is configured to calculate the degree of similarity regarding the converted record pair by inputting the converted record pair and the converted auxiliary record to the model MB. Therefore, according to the information processing apparatus 1B according to the present exemplary embodiment, it is possible to obtain the effect of being able to calculate similarities for records having various attributes using the model MB without training the model MB.
  • the model MB includes a question-answer model in which a question sentence and a response sentence are input, and the conversion unit 12B is included in the record pair.
  • One of the records included in the record pair is converted into a question sentence, and each of the other record and the auxiliary record included in the record pair is converted into a response sentence to generate the converted record pair. Therefore, according to the information processing apparatus 1B according to this exemplary embodiment, it is possible to obtain the effect of being able to calculate the similarity of records having various attributes using the question-and-answer model without training the question-and-answer model. .
  • FIG. 13 is a block diagram showing the configuration of an information processing device 1C according to this exemplary embodiment.
  • the control unit 10C of the information processing device 1C includes an acquisition unit 11, a conversion unit 12, a similarity calculation unit 13C, a similarity integration unit 17C, an output unit 14C, an identity determination unit 15A, and an integration unit 16A.
  • the storage unit 20C also stores the model MC in addition to the first data x, the second data x', and the similarity s.
  • the similarity calculator 13C calculates a plurality of similarities si for one record pair (e, e'). As an example, the similarity calculator 13C calculates the first similarity s1 by inputting two records included in the record pair (e, e') to the model MC without interchanging them. Further, the similarity calculation unit 13C calculates the second similarity s2 by replacing the two records included in the record pair (e, e') with each other and inputting them to the model MC.
  • the similarity of the record pair (e, e') is the record pair (e', e ) similarity. Therefore, in this exemplary embodiment, the similarity calculation unit 13C calculates the similarity of the record pair (e, e') and the similarity of the record pair (e', e), and calculates the similarity as Identity is determined by reference.
  • the method by which the similarity calculation unit 13C calculates a plurality of similarities si is not limited to the example described above, and the similarity calculation unit 13C may calculate a plurality of similarities si by other methods.
  • the similarity calculator 13C may calculate a plurality of similarities si using a plurality of models.
  • the conversion unit 12 performs a plurality of conversions on one record pair, and the similarity calculation unit 13C converts the converted record pair into respective models (document classification model, image classification model, . . . ).
  • a plurality of degrees of similarity si may be calculated.
  • the similarity calculation unit 13C converts one record pair by a plurality of conversion methods to generate a plurality of converted record pairs, and inputs the plurality of converted record pairs to one model to generate a plurality of , the similarity si may be calculated.
  • the similarity integration unit 17C integrates a plurality of similarities si into an integrated similarity s.
  • the similarity integration unit 17C calculates the post-integration similarity s by averaging or weighting a plurality of similarities si.
  • the method by which the similarity integration unit 17C integrates a plurality of similarities si is not limited to the example described above, and the similarity integration unit 17C may calculate the post-integration similarity s by another method.
  • the similarity integration unit 17C may set the sum or integrated value of a plurality of similarities si as the integrated similarity s.
  • the similarity integration unit 17C is configured to determine the identity of the target record pair based on a plurality of similarities si regarding the target record pair.
  • the output unit 14C outputs an integrated similarity s obtained by integrating the plurality of similarities si. As an example, the output unit 14C outputs the similarity s by writing it into the storage unit 20C.
  • the model MC is a model for calculating the degree of similarity.
  • the model MC is, for example, a model that is asymmetric with respect to the mutual replacement of two elements that are input to the model.
  • the model MC includes, as an example, at least one of an entailment recognition model, a paraphrase prediction model, and a question answer model.
  • FIG. 14 is a diagram showing a specific example of the similarity si calculated by the similarity calculation unit 13C.
  • the first similarity s1 calculated by the similarity calculation unit 13C for the record pair (L1, R1) is "9"
  • the similarity for the record pair (R1, L1) obtained by exchanging two records is
  • the second similarity s2 calculated by the calculator 13C is "10".
  • the similarity calculation unit 13C calculates the first similarity s1 and the second similarity s2 for one record pair
  • the identity determination unit 15A calculates the first similarity s1 and the second similarity s1.
  • the records are determined to be the same if both the degrees s2 are the highest compared to other record pairs.
  • the identity determination unit 15A determines that record L1 and record R1 are the same, and that record L2 and record R3 are the same.
  • FIG. 15 is a diagram showing another example of the similarity si calculated by the similarity calculation unit 13C.
  • the similarity integration unit 17C aggregates bidirectional similarities. For example, the similarity integration unit 17C sets the sum of the similarity s1 of the record pair (L1, R1) and the similarity s2 of the record pair (R1, L1) as the similarity s.
  • the similarity s of the record pair (L1, R1) is the sum of "10" and “9", that is, "19”
  • the similarity s of the record pair (L1, R2) is "9". and "7", that is, "16”.
  • the similarity s of the record pair (L2, R2) is the sum of "9” and "4", that is, "13”
  • the similarity s of the record pair (L2, R3) is "8” and "8 , that is, "16”.
  • the identity determination unit 15A determines that record L1 and record R1 are the same, and that record L2 and record R3 are the same, as in the example of FIG. In this example, the identity determination unit 15A further determines that record pairs having a similarity s equal to or higher than a predetermined threshold among the record pairs determined to be identical are also identical.
  • the threshold is, for example, the minimum value (“13” in the example of FIG. 15) of similarities s of record pairs determined to be identical. The threshold may be determined based on the percentage of identical and non-identical, if known. When the threshold value is "13" in the example of FIG. 15, the identity determination unit 15A determines the record pair (L1, R2 ) are also determined to be the same.
  • the similarity calculation unit 13C calculates a plurality of similarities si with respect to the record pair, and the output unit 14C calculates the plurality of similarities si.
  • a configuration for outputting the post-integration similarity s obtained by integration is adopted. Therefore, according to the information processing apparatus 1C according to the present exemplary embodiment, it is possible to obtain the effect that the similarity s of the record pair can be calculated more accurately.
  • the model MC is a model having asymmetry with respect to the mutual replacement of two elements input to the model
  • the similarity calculation unit 13C to the model MC, the first similarity s1 is calculated by inputting two records included in the record pair (e, e′) without replacing each other, and the record pair (e , e′) are replaced with each other and then input to calculate the second similarity s2. Therefore, according to the information processing apparatus 1C according to the present exemplary embodiment, by integrating the first similarity s1 and the second similarity s2, the similarity s of the records can be calculated more accurately. effect is obtained.
  • FIG. 16 is a block diagram showing the configuration of an information processing device 1D according to this exemplary embodiment.
  • a control unit 10D of the information processing device 1D includes an acquisition unit 11, a conversion unit 12, a similarity calculation unit 13, an output unit 14, an identity determination unit 15A, and a search result output unit 18D.
  • the acquisition unit 11 acquires input data from the user as the first record e included in the record pair (e, e').
  • Input data from the user is, for example, input by an input device (for example, a keyboard, a mouse, etc.) connected to the input/output unit 40A.
  • an input device for example, a keyboard, a mouse, etc.
  • the acquiring unit 11 acquires one of the plurality of records included in the target data as the second record e' included in the record pair (e, e').
  • the target data is data to be searched, and includes, for example, one or more tables.
  • the identity determination unit 15A performs identity prediction for record pairs of the first record e and each of the plurality of records included in the target data.
  • the search result output unit 18D Based on the degree of similarity s calculated by the degree of similarity calculation unit 13, the search result output unit 18D outputs the search results based on the input data and with the target data as the search target.
  • the search result output unit 18D refers to the determination result of the identity determination unit 15A and outputs the search result based on the input data and the target data as the search target.
  • the search result output unit 18D outputs search results to an output device (display, printer, etc.) connected to the input/output unit 40A.
  • the search result output unit 18D may output the search result by transmitting the search result to another device connected via the communication unit 30A.
  • the search result output unit 18D may output search results by storing the search results in the storage unit 20A or an external storage device.
  • FIG. 17 is a diagram showing a specific example of screen display output by the search result output unit 18D.
  • the input data is a character string that the user inputs into the text box 51
  • the target data are tables T1 and T2 having a plurality of records.
  • the identity determination unit 15A determines the identity of record pairs between the first record e, which is the user's input data, and each of the records included in the table T1 and the record e' included in the table T2.
  • the search result output unit 18D refers to the determination result of the identity determination unit 15A and outputs search results 53 and 54 based on the input data.
  • a search result 53 is a search result obtained by searching the table T1 using the character string "potato chips" as input data.
  • a search result 54 is a search result obtained by searching the table T2 using the character string "potato chips" as input data.
  • the determination result of the identity determination unit 15A is referred to, and the search result based on the input data, which is the target data, is searched. A configuration for outputting the results is adopted. Therefore, according to the information processing apparatus 1D according to the present exemplary embodiment, in addition to the effects of the information processing apparatus 1 according to the first exemplary embodiment, the search from the target data based on the input data is more preferably performed. You can get the effect of being able to
  • the information processing device 1D can also be described as follows. Acquisition means for acquiring input data from a user and one of a plurality of records included in target data as a record pair; transforming means for transforming the record pairs to generate transformed record pairs; a similarity calculating means for calculating a similarity with respect to the converted record pair by inputting the converted record pair into a model; output means for referring to the degree of similarity calculated by the degree of similarity calculation means and outputting search results based on the input data, in which the target data is a search target; Information processing device equipped with.
  • the information processing apparatuses 1, 1A, 1B, 1C, and 1D (hereinafter referred to as "information processing apparatuses 1, etc.")
  • the identity with the contained record e' was determined.
  • a plurality of records to be determined by the information processing apparatus 1 or the like may be records included in different data, or may be records included in common data.
  • the information processing device 1 and the like may execute processing for searching for the same record from one database.
  • the information processing apparatus 1 and the like may integrate three or more data.
  • the information processing device 1 or the like may select models MA, MB, MC (hereinafter referred to as "model M") from a plurality of model candidates, and the user may Model M may be selected.
  • the algorithm by which the information processing device 1 or the like selects the model M is not limited, but as an example, the information processing device 1 or the like may select the model M on a rule basis.
  • the information processing device 1 or the like may select the model M according to the characteristics of the record pair.
  • the characteristics of a record pair include, for example, the attribute of the record included in the record pair, the data size of the record, the type of database to which the record belongs, and the attribute of the database.
  • the data containing records e, e' may be semi-structured data such as JSON or XML.
  • semi-structured data such as JSON or XML.
  • the records are, by way of example, web pages contained in the target site.
  • record e ⁇ id1: value1, id2: ⁇ id2-1: value2-1, id2-2: value2-1 ⁇ , id3: value3 ⁇
  • the converted document is, for example, "id1 is value1.
  • id2-1 of id2 is value2-1.
  • id2-2 of id2 is value2-1.
  • id3 is value3.” is.
  • the record according to the present specification may be graph data as shown in FIG. 18, for example.
  • FIG. 18 is a diagram showing an example of graph data.
  • face matching can be performed by applying the information processing apparatus 1 or the like according to the present specification to graph data.
  • the document after conversion is, as an example, “1 and 2 are linked. 1 and 4 are linked. 2 and 3 are linked. 2 and 4 are linked.” is.
  • Data containing records may be a graph database as shown in FIG. 19, for example.
  • the information processing apparatus 1 or the like according to the present specification it is possible to determine the identity of different SNS (Social Networking Service) communities, for example, and to investigate criminal organizations.
  • the graph database is as shown in FIG. 19, the document after conversion is as follows: “Taro of age 23 follows Sakura of age 26. Taro of age 23 follows Emi of age 25. Sakura of age 26 follows Emi of age 25. Sakura of age 26 wrote via smartphone tweet of text “I'm sleepy.” date 20XX /YY/ZZ. Emi of age 25 follows Sakura of age 26. Emi of age 25 follows Taro of age 23.” is.
  • the information processing device 1 and the like may be configured to execute the learning phase for learning the model M.
  • the method of machine learning for model M is not limited, but as an example, a decision tree-based, linear regression, or neural network method may be used, or two or more of these methods may be used. .
  • FIG. 20 schematically shows a configuration in which trained transducers 121, 122 with learnable parameters are provided before and after the output of model M.
  • FIG. The learned converters 121 and 122 have learnable parameters, and a learning unit (not shown) uses training data to determine how to convert records (how to make sentences or the number of auxiliary records, etc.) and / or how to convert. It is a model that optimizes parameters. By providing the learned converters 121 and 122, it is possible to calculate the similarity of records with higher accuracy.
  • the machine learning method of the trained converters 121, 122 is not limited, but as an example, a decision tree-based, linear regression, or neural network method may be used, and two or more of these methods may be used. may be used. Also, the learned converters 121 and 122 may be models generated by active learning.
  • Some or all of the functions of the information processing apparatuses 1, 1A, 1B, 1C, and 1D may be implemented by hardware such as integrated circuits (IC chips), or may be implemented by software.
  • the information processing apparatuses 1, 1A, 1B, 1C, and 1D are implemented by computers that execute program instructions, which are software that implements each function, for example.
  • An example of such a computer (hereinafter referred to as computer C) is shown in FIG.
  • Computer C comprises at least one processor C1 and at least one memory C2.
  • a program P for operating the computer C as the information processing apparatuses 1, 1A, 1B, 1C, and 1D is recorded in the memory C2.
  • the processor C1 reads the program P from the memory C2 and executes it, thereby implementing the functions of the information processing apparatuses 1, 1A, 1B, 1C, and 1D.
  • processor C1 for example, CPU (Central Processing Unit), GPU (Graphic Processing Unit), DSP (Digital Signal Processor), MPU (Micro Processing Unit), FPU (Floating point number Processing Unit), PPU (Physics Processing Unit) , a microcontroller, or a combination thereof.
  • memory C2 for example, a flash memory, HDD (Hard Disk Drive), SSD (Solid State Drive), or a combination thereof can be used.
  • the computer C may further include a RAM (Random Access Memory) for expanding the program P during execution and temporarily storing various data.
  • Computer C may further include a communication interface for sending and receiving data to and from other devices.
  • Computer C may further include an input/output interface for connecting input/output devices such as a keyboard, mouse, display, and printer.
  • the program P can be recorded on a non-temporary tangible recording medium M that is readable by the computer C.
  • a recording medium M for example, a tape, disk, card, semiconductor memory, programmable logic circuit, or the like can be used.
  • the computer C can acquire the program P via such a recording medium M.
  • the program P can be transmitted via a transmission medium.
  • a transmission medium for example, a communication network or broadcast waves can be used.
  • Computer C can also obtain program P via such a transmission medium.
  • Appendix 2 Some or all of the above-described embodiments may also be described as follows. However, the present invention is not limited to the embodiments described below. (Appendix 1) an acquisition means for acquiring a record pair; transforming means for transforming the record pairs to generate transformed record pairs; a similarity calculation means for calculating a similarity regarding the converted record pair by inputting the converted record pair into a model; an output means for outputting the similarity calculated by the similarity calculation means; Information processing device equipped with.
  • the model is selected from a plurality of model candidates, the transforming means generates the transformed record pair by transforming the record pair into a format corresponding to the input of the model;
  • the information processing device according to appendix 1.
  • the model includes a document classification model;
  • the conversion processing by the conversion means includes processing for generating the converted record pair by converting the record pair into a document.
  • the information processing device according to appendix 1 or 2.
  • the model includes an image classification model
  • the conversion processing by the conversion means includes processing for generating the converted record pair by converting the record pair into an image. 3.
  • the information processing apparatus according to any one of Appendices 1 to 3.
  • the conversion processing by the conversion means includes processing for generating the converted record pair by converting the record pair into speech. 5.
  • the information processing apparatus according to any one of Appendices 1 to 4.
  • the model includes a graph classification model
  • the conversion processing by the conversion means includes processing for generating the converted record pair by converting the record pair into a graph. 6.
  • the information processing apparatus according to any one of Appendices 1 to 5.
  • the obtaining means further obtains an auxiliary record,
  • the conversion means generates a converted auxiliary record by converting the auxiliary record;
  • the similarity calculation means calculates the similarity regarding the converted record pair by inputting the converted record pair and the converted auxiliary record into the model. 7.
  • the information processing apparatus according to any one of Appendices 1 to 6.
  • the model includes a question-answer model in which a question sentence and an answer sentence are input,
  • the conversion means converts one record included in the record pair into a question sentence, and converts the other record included in the record pair and each of the auxiliary records into a response sentence, thereby converting the converted record pair into a question sentence.
  • the information processing device according to appendix 7.
  • the similarity calculating means calculates a plurality of similarities with respect to the record pair,
  • the output means outputs an integrated similarity obtained by integrating the plurality of similarities.
  • the model is a model that has asymmetry with respect to the replacement of two elements input to the model,
  • the similarity calculation means is calculating a first degree of similarity by inputting two records included in the record pair into the model without replacing each other; Calculating a second degree of similarity by replacing two records included in the record pair with the model and then inputting the model;
  • the information processing device according to appendix 9.
  • Appendix 11 at least one processor obtaining a record pair; generating a transformed record pair by transforming the record pair; calculating a similarity for the transformed record pair by inputting the transformed record pair into a model; outputting the calculated similarity;
  • Information processing method including.
  • At least one processor for obtaining a record pair; transforming the record pair to generate a transformed record pair; and inputting the transformed record pair into a model.
  • an information processing apparatus for executing a similarity calculation process for calculating a similarity regarding the converted record pair and an output process for outputting the similarity calculated in the similarity calculation process.
  • this information processing apparatus may further include a memory, and this memory stores information for causing the processor to execute the acquisition process, the conversion process, the similarity calculation process, and the output process.
  • program may be stored. Also, this program may be recorded in a computer-readable non-temporary tangible recording medium.

Abstract

The present invention provides, as a technology for calculating the degree of similarity between a record pair, a technology that does not require training data pertaining to a record pair and that is capable of handling data of differing types. To this end, an information processing device (1) comprises: an acquisition unit (11) that acquires a record pair; a conversion unit (12) that converts the record pair so as to generate a converted record pair; a similarity degree calculation unit (13) that inputs the converted record pair into a model so as to calculate a similarity degree pertaining to the converted record pair; and an output unit (14) that outputs the similarity degree which has been calculated by the similarity degree calculation unit (13).

Description

情報処理装置、情報処理方法及び情報処理プログラムInformation processing device, information processing method and information processing program
 本発明は、情報処理装置、情報処理方法及び情報処理プログラムに関する。 The present invention relates to an information processing device, an information processing method, and an information processing program.
 異なるデータセットに格納されたレコードから同一の又は類似するレコードの組み合わせを特定して対応付ける処理が行われている。このような処理は名寄せ処理とも呼ばれる。名寄せ処理によりテーブルの一元管理及びデータの拡張等が可能となる。名寄せ処理を行う技術として、機械学習によるマッチングを行う技術がある。例えば、特許文献1には、レコード対の類似度を計算する類似度関数を複数用いてレコード対の類似度を計算し、訓練データを用いた教師あり機械学習により類似度の重みを学習する装置が記載されている。ここで、訓練データは、レコードの組み合わせとそれらが同一であるかどうかを示すラベルが付与されたデータセットである。また、非特許文献1には、教師あり機械学習により名寄せを行うDITTOと呼ばれる技術が記載されている。また、非特許文献2には、訓練データを用いない教師なし機械学習によりレコードを照合するZeroERと呼ばれる技術が記載されている。 A process is performed to identify and associate a combination of identical or similar records from records stored in different datasets. Such processing is also called name identification processing. Name identification processing enables unified management of tables, expansion of data, and the like. As a technique for performing name identification processing, there is a technique for performing matching by machine learning. For example, Patent Document 1 discloses a device that calculates the similarity of record pairs using a plurality of similarity functions that calculate the similarity of record pairs, and learns the weight of the similarity by supervised machine learning using training data. is described. Here, the training data is a data set with labels indicating combinations of records and whether they are identical. In addition, Non-Patent Document 1 describes a technique called DITTO that performs name identification by supervised machine learning. Non-Patent Document 2 describes a technique called ZeroER that matches records by unsupervised machine learning that does not use training data.
 また、近年、機械学習により生成されるモデルとして言語モデル(例えば、非特許文献3~5)及び画像分類モデル(例えば、非特許文献6)等が提案されている。 Also, in recent years, language models (eg, non-patent documents 3 to 5) and image classification models (eg, non-patent document 6) have been proposed as models generated by machine learning.
日本国特開2019-185244号公報Japanese Patent Application Laid-Open No. 2019-185244
 しかしながら、教師あり機械学習は多量の訓練データを必要とするため、特許文献1及び非特許文献1に記載の技術においては、訓練データの収集のためのコスト及び時間がかさんでしまい、また異質データに対応できないという問題があった。ここで、異質データとは、レコードの組み合わせであり、データの形式が同一でないものを指す。また、非特許文献2に記載の教師なし機械学習であるZeroERでは、訓練データは不要であるものの、属性がアライメントされている必要があるため、属性が異なる異質データに対し適用できない、という問題があった。 However, supervised machine learning requires a large amount of training data. There was a problem that it could not correspond to the data. Here, heterogeneous data is a combination of records, and refers to data whose format is not the same. In addition, ZeroER, which is unsupervised machine learning described in Non-Patent Document 2, does not require training data, but attributes must be aligned, so there is a problem that it cannot be applied to heterogeneous data with different attributes. there were.
 本発明の一態様は、上記の問題に鑑みてなされたものであり、その目的の一例は、レコード対の類似度を算出する技術として、レコード対に関する訓練データを必要とせず、かつ、異質データにも対応可能な技術を提供することである。 One aspect of the present invention has been made in view of the above problem. It is to provide a technology that can also deal with
 本発明の一側面に係る情報処理装置は、レコード対を取得する取得手段と、前記レコード対を変換することによって変換済レコード対を生成する変換手段と、前記変換済レコード対をモデルに入力することによって、前記変換済レコード対に関する類似度を算出する類似度算出手段と、前記類似度算出手段が算出した類似度を出力する出力手段とを備える。 An information processing apparatus according to one aspect of the present invention includes acquisition means for acquiring a record pair, conversion means for generating a converted record pair by converting the record pair, and inputting the converted record pair to a model. Thus, a similarity calculation means for calculating a similarity regarding the converted record pair and an output means for outputting the similarity calculated by the similarity calculation means are provided.
 本発明の一側面に係る情報処理方法は、少なくとも1つのプロセッサが、レコード対を取得することと、前記レコード対を変換することによって変換済レコード対を生成することと、前記変換済レコード対をモデルに入力することによって、前記変換済レコード対に関する類似度を算出することと、前記算出した類似度を出力することとを含む。 An information processing method according to an aspect of the present invention is characterized in that at least one processor obtains a record pair, generates a transformed record pair by transforming the record pair, and transforms the transformed record pair into calculating a similarity measure for the transformed record pair by inputting to a model; and outputting the calculated similarity measure.
 本発明の一側面に係る情報処理プログラムは、コンピュータに、レコード対を取得する取得処理と、前記レコード対を変換することによって変換済レコード対を生成する変換処理と、前記変換済レコード対をモデルに入力することによって、前記変換済レコード対に関する類似度を算出する類似度算出処理と、前記類似度算出処理において算出した類似度を出力する出力処理とを実行させる。 An information processing program according to one aspect of the present invention provides a computer with an acquisition process for acquiring a record pair, a conversion process for generating a converted record pair by converting the record pair, and a model for the converted record pair. , a similarity calculation process for calculating a similarity with respect to the converted record pair and an output process for outputting the similarity calculated in the similarity calculation process are executed.
 本発明の一態様によれば、レコード対の類似度を算出する技術として、レコード対に関する訓練データを必要とせず、かつ、異質データにも対応可能な技術を提供することができる。 According to one aspect of the present invention, as a technique for calculating the similarity of record pairs, it is possible to provide a technique that does not require training data for record pairs and that can handle heterogeneous data.
例示的実施形態1に係る情報処理装置の構成を示すブロック図である。1 is a block diagram showing the configuration of an information processing apparatus according to Exemplary Embodiment 1; FIG. 例示的実施形態1に係る情報処理方法の流れを示すフロー図である。FIG. 3 is a flow diagram showing the flow of an information processing method according to exemplary embodiment 1; 例示的実施形態2に係る情報処理装置の構成を示すブロック図である。FIG. 9 is a block diagram showing the configuration of an information processing apparatus according to Exemplary Embodiment 2; 例示的実施形態2に係るレコードが含まれるデータの具体例を示す図である。FIG. 10 is a diagram showing a specific example of data including records according to exemplary embodiment 2; 例示的実施形態2に係る情報処理装置が行う処理の流れの概要を示す図である。FIG. 10 is a diagram showing an overview of the flow of processing performed by an information processing apparatus according to exemplary embodiment 2; 例示的実施形態2に係る同一性の判定結果の具体例を示す図である。FIG. 10 is a diagram showing a specific example of identity determination results according to exemplary embodiment 2; 例示的実施形態2に係る情報処理方法の流れを示すフロー図である。FIG. 10 is a flow diagram showing the flow of an information processing method according to exemplary embodiment 2; 例示的実施形態2に係る文書の含意関係を概略的に示す図である。FIG. 11 is a diagram schematically illustrating entailment relationships of documents according to exemplary embodiment 2; 例示的実施形態2に係る変換部により変換された画像の一例を示す図である。FIG. 10 is a diagram showing an example of an image converted by a conversion unit according to exemplary embodiment 2; 例示的実施形態3に係る情報処理装置の構成を示すブロック図である。FIG. 12 is a block diagram showing the configuration of an information processing apparatus according to exemplary embodiment 3; 例示的実施形態3に係る情報処理方法の流れを示すフロー図である。FIG. 11 is a flow diagram showing the flow of an information processing method according to exemplary embodiment 3; 例示的実施形態3に係る質問応答モデルを用いた類似度の算出処理の概念図である。FIG. 11 is a conceptual diagram of similarity calculation processing using a question-answering model according to exemplary embodiment 3; 例示的実施形態4に係る情報処理装置の構成を示すブロック図である。FIG. 12 is a block diagram showing the configuration of an information processing apparatus according to Exemplary Embodiment 4; 例示的実施形態4に係る類似度算出部が算出する類似度の具体例を示す図である。FIG. 14 is a diagram showing a specific example of similarity calculated by a similarity calculation unit according to exemplary embodiment 4; 例示的実施形態4に係る類似度算出部が算出する類似度の具体例を示す図である。FIG. 14 is a diagram showing a specific example of similarity calculated by a similarity calculation unit according to exemplary embodiment 4; 例示的実施形態5に係る情報処理装置の構成を示すブロック図である。FIG. 12 is a block diagram showing the configuration of an information processing apparatus according to Exemplary Embodiment 5; 例示的実施形態5に係る画面表示の具体例を示す図である。FIG. 12 is a diagram showing a specific example of screen display according to exemplary embodiment 5; グラフデータの一例を示す図である。It is a figure which shows an example of graph data. グラフデータベースの一例を示す図である。1 is a diagram showing an example of a graph database; FIG. 学習済み変換器をモデルの出力の前後に設けた構成を概略的に示す図である。FIG. 4 is a diagram schematically showing a configuration in which learned converters are provided before and after the output of a model; 各例示的実施形態に係る情報処理装置として機能するコンピュータの構成を示すブロック図である。1 is a block diagram showing the configuration of a computer functioning as an information processing device according to each exemplary embodiment; FIG.
 〔例示的実施形態1〕
 本発明の第1の例示的実施形態について、図面を参照して詳細に説明する。本例示的実施形態は、後述する例示的実施形態の基本となる形態である。
[Exemplary embodiment 1]
A first exemplary embodiment of the invention will now be described in detail with reference to the drawings. This exemplary embodiment is the basis for the exemplary embodiments described later.
 <情報処理装置1の構成>
 本例示的実施形態に係る情報処理装置1の構成について、図1を参照して説明する。図1は、情報処理装置1の構成を示すブロック図である。情報処理装置1は、レコード同士の類似度を算出する装置である。ここで、レコードは、類似度の算出対象であるデータの単位である。レコードが含まれるデータとしては、例えば、テーブルデータ等の構造データ、JSON(JavaScript Object Notation:登録商標)又はXML(Extensible Markup Language)等のデータ記述言語で記述された半構造データ、及び自然言語で記された文書を表す非構造データが挙げられる。レコードは一例として、テーブルの行であり、テーブルの列に対応する1又は複数の属性名及び属性値のセットを含む。また、レコードはグラフデータであってもよい。
<Configuration of information processing device 1>
A configuration of an information processing apparatus 1 according to this exemplary embodiment will be described with reference to FIG. FIG. 1 is a block diagram showing the configuration of an information processing device 1. As shown in FIG. The information processing device 1 is a device that calculates the degree of similarity between records. Here, a record is a unit of data for which similarity is calculated. Examples of data containing records include structured data such as table data, semi-structured data described in a data description language such as JSON (JavaScript Object Notation: registered trademark) or XML (Extensible Markup Language), and natural language It includes unstructured data representing written documents. A record is, for example, a row of a table and contains a set of one or more attribute names and attribute values corresponding to the columns of the table. Also, the record may be graph data.
 情報処理装置1は、取得部11、変換部12、類似度算出部13及び出力部14を備える。 The information processing device 1 includes an acquisition unit 11 , a conversion unit 12 , a similarity calculation unit 13 and an output unit 14 .
 (取得部11)
 取得部11は、レコード対を取得する。レコード対はレコードのセットであり、一例として、第1のテーブルに含まれるレコードと、第2のテーブルに含まれるレコードとのセットである。第1のテーブル及び第2のテーブルは、一例として、事業者の顧客情報を保存したテーブル、又は、商品情報を保存したテーブルである。ただし、第1のテーブル及び第2のテーブルは上述した例に限られず、他のテーブルであってもよい。また、第1のテーブルと第2のテーブルとは同じであってもよく、また、異なっていてもよい。
(Acquisition unit 11)
Acquisition unit 11 acquires a record pair. A record pair is a set of records, such as a set of records included in a first table and records included in a second table. The first table and the second table are, for example, tables that store customer information of businesses or tables that store product information. However, the first table and the second table are not limited to the examples described above, and may be other tables. Also, the first table and the second table may be the same or different.
 レコード対に含まれる複数のレコードは、データ形式がそれぞれ異なっていてもよい。より具体的には例えば、レコードがテーブルの行である場合、レコードに含まれる属性名のうちの一部が異なっていてもよく、また、レコードに含まれる全ての属性名が異なっていてもよい。 Multiple records included in a record pair may have different data formats. More specifically, for example, when a record is a row of a table, some attribute names included in the record may be different, and all attribute names included in the record may be different. .
 取得部11は、記憶装置からレコード対を読み出すことによりレコード対を取得してもよく、また、通信インタフェースを介して接続された他の装置からレコード対を受信することによりレコード対を取得してもよい。また、取得部11は、入出力インタフェースを介して入力装置から入力されたレコード対を取得してもよい。 The acquiring unit 11 may acquire the record pair by reading the record pair from the storage device, or acquire the record pair by receiving the record pair from another device connected via the communication interface. good too. Also, the acquisition unit 11 may acquire a record pair input from an input device via an input/output interface.
 (変換部12)
 変換部12は、上記レコード対を変換することによって変換済レコード対を生成する。変換部12は、一例として、レコード対に含まれるレコードを文書、画像、音声又はグラフを表すデータに変換する。より具体的には、変換部12は、一例として、レコードを肯定文又は質問文に変換する。ただし、変換部12がレコード対を変換する手法は上述した例に限られず、変換部12は他の手法によりレコード対を変換してもよい。
(Converter 12)
The conversion unit 12 converts the record pair to generate a converted record pair. For example, the conversion unit 12 converts records included in a record pair into data representing documents, images, sounds, or graphs. More specifically, the conversion unit 12 converts the record into an affirmative sentence or a question sentence, for example. However, the method by which the conversion unit 12 converts the record pair is not limited to the example described above, and the conversion unit 12 may convert the record pair by another method.
 (類似度算出部13)
 類似度算出部13は、上記変換済レコード対をモデルに入力することによって、上記変換済レコード対に関する類似度を算出する。ここで、モデルは、類似度を算出するためのモデルであり、一例として、一般に公開されて任意のユーザが利用可能なモデルである。モデルは、機械学習により生成されたモデルであってもよく、また、人間が作成したルールベースのモデルであってもよい。具体的には、モデルは一例として、文書分類モデル、画像分類モデル、音声分類モデル、又はグラフ分類モデルである。文書分類モデルは、文書データを分類するモデルである。画像分類モデルは、画像データを分類するモデルである。音声分類モデルは、音声データを分類するモデルである。グラフ分類モデルは、グラフデータを分類するモデルである。
(Similarity calculator 13)
The similarity calculation unit 13 calculates the similarity regarding the converted record pair by inputting the converted record pair into the model. Here, the model is a model for calculating the degree of similarity, and as an example, is a model that is open to the public and available to any user. The model may be a model generated by machine learning or a rule-based model created by humans. Specifically, the model is, by way of example, a document classification model, an image classification model, an audio classification model, or a graph classification model. A document classification model is a model for classifying document data. An image classification model is a model for classifying image data. A speech classification model is a model that classifies speech data. A graph classification model is a model for classifying graph data.
 文書分類モデルとしては、例えば、文書埋め込みモデル、含意認識モデル、言い換え予測モデル、質問応答モデル、及びマスク言語モデルが挙げられる。文書埋め込みモデルは、文書又は単語をベクトル空間に埋め込むモデルである。含意認識モデルは、複数の文書の含意関係を予測するモデルである。言い換え予測モデルは、2つの文書が言い換え表現かどうかを予測するモデルである。質問応答モデルは、質問に対し与えられた文書の中から回答を抽出し出力するモデルである。マスク言語モデルは、文書内のマスクに当てはまる単語を予測するためのモデルである。 Document classification models include, for example, a document embedding model, an entailment recognition model, a paraphrase prediction model, a question answering model, and a mask language model. A document embedding model is a model that embeds documents or words in a vector space. The entailment recognition model is a model that predicts entailment relationships of multiple documents. A paraphrasing prediction model is a model that predicts whether two documents are paraphrasing expressions. A question answering model is a model that extracts and outputs answers from documents given to questions. A mask language model is a model for predicting words that fit a mask in a document.
 画像分類モデルとしては、例えば、画像埋め込みモデルが挙げられる。画像埋め込みモデルは、画像データをベクトル空間に埋め込むモデルである。音声分類モデルとしては、例えば、音声埋め込みモデルが挙げられる。音声埋め込みモデルは、音声データをベクトル空間に埋め込むモデルである。 An example of an image classification model is an image embedding model. An image embedding model is a model that embeds image data in a vector space. A speech classification model includes, for example, a speech embedding model. A speech embedding model is a model that embeds speech data in a vector space.
 上記モデルの入力は、一例として、テキストデータ、画像データ、音声データ、グラフ、及びベクトルの少なくともいずれかを含む。上記モデルの出力は、一例として、ベクトル、又は確信度を示すスコアを含む。スコアは、一例として、文書の包含関係に関する確信度を示すスコア、又は、言い換え表現であるかの確信度を示すスコアである。ただし、モデルの入力及び出力は上述した例に限られず、他の情報を含んでいてもよい。 Inputs for the above model include at least one of text data, image data, audio data, graphs, and vectors, for example. The output of the model includes, by way of example, a vector or score indicating confidence. The score is, for example, a score indicating the degree of certainty regarding the inclusion relationship of the document or a score indicating the degree of certainty as to whether it is a paraphrasing expression. However, the inputs and outputs of the model are not limited to the examples described above, and may include other information.
 上記モデルが機械学習により生成される場合、上記モデルとして、例えば、非特許文献3~5に記載の言語モデル、非特許文献6に記載の画像分類モデル、又は音声データを分類する音声分類モデル等が挙げられるが、これらに限られない。また、上記モデルは、情報処理装置1のメモリに記憶されていてもよいし、情報処理装置1と通信可能な他の装置に記憶されていてもよい。 When the model is generated by machine learning, the model may be, for example, a language model described in Non-Patent Documents 3 to 5, an image classification model described in Non-Patent Document 6, or an audio classification model for classifying audio data. but not limited to these. Moreover, the model may be stored in the memory of the information processing device 1 or may be stored in another device capable of communicating with the information processing device 1 .
 類似度は、レコード対に含まれるレコード同士の類似の度合いに関する情報であり、一例として、ベクトル対のコサイン類似度である。また、類似度は、モデルが出力するスコアから算出される値であってもよい。 The degree of similarity is information relating to the degree of similarity between records included in a record pair, and an example is the cosine similarity of vector pairs. Also, the similarity may be a value calculated from the score output by the model.
 (出力部14)
 出力部14は、類似度算出部13が算出した類似度を出力する。出力部14は、一例として、類似度を記憶装置に書き込むことにより出力してもよく、また、通信インタフェースを介して他の装置に類似度を送信することにより類似度を出力してもよい。また、出力部14は、入出力インタフェースを介して接続された出力装置(図示略)に類似度を出力してもよい。出力装置は、一例として、ディスプレイ、プリンタ、プロジェクタ又はスピーカである。
(Output unit 14)
The output unit 14 outputs the similarity calculated by the similarity calculation unit 13 . For example, the output unit 14 may output the degree of similarity by writing it in a storage device, or may output the degree of similarity by transmitting the degree of similarity to another device via a communication interface. Also, the output unit 14 may output the degree of similarity to an output device (not shown) connected via an input/output interface. The output device is, for example, a display, printer, projector, or speaker.
 出力部14が出力する類似度は、例えばテーブルの統合処理、又は情報検索処理に用いられる。テーブルの統合処理の場合、類似度算出部13が算出した類似度に基づき同一であると予測されたレコードを連携することで、複数のテーブルを統合しデータの一元管理を行うことができる。また、情報検索において、検索キーとするレコード(例えば、ユーザにより指定されたレコード)と、所定のテーブルに登録された他の任意のレコードとのレコード対について類似度算出部13が類似度の算出を行ってもよい。この場合、類似度算出部13が算出した類似度に基づき同一であると予測されたレコード対に含まれるレコードを、検索結果として情報処理装置1が出力してもよい。これにより、検索キーであるレコードと連携されていないテーブルにおいても、検索キーによる検索処理が可能となる。 The degree of similarity output by the output unit 14 is used, for example, for table integration processing or information search processing. In the case of table integration processing, by linking records predicted to be identical based on the similarity calculated by the similarity calculation unit 13, a plurality of tables can be integrated and unified data management can be performed. Further, in information retrieval, the similarity calculation unit 13 calculates the similarity for a record pair of a record as a search key (for example, a record specified by a user) and any other record registered in a predetermined table. may be performed. In this case, the information processing apparatus 1 may output records included in a record pair predicted to be identical based on the similarity calculated by the similarity calculation unit 13 as a search result. As a result, even in a table that is not associated with a record that is a search key, search processing using the search key is possible.
 <情報処理装置1の効果>
 以上のように、本例示的実施形態に係る情報処理装置1においては、レコード対を取得する取得部11と、上記レコード対を変換することによって変換済レコード対を生成する変換部12と、上記変換済レコード対をモデルに入力することによって、上記変換済レコード対に関する類似度を算出する類似度算出部13と、類似度算出部13が算出した類似度を出力する出力部14とを備える構成が採用されている。このため、本例示的実施形態に係る情報処理装置1によれば、レコード対の類似度を算出する技術として、レコード対に関する訓練データを必要とせず、かつ、異質データにも対応可能な技術を提供できるという効果が得られる。
<Effects of information processing device 1>
As described above, in the information processing apparatus 1 according to this exemplary embodiment, the acquisition unit 11 that acquires a record pair, the conversion unit 12 that converts the record pair to generate a converted record pair, and the A configuration comprising a similarity calculation unit 13 for calculating a similarity regarding the converted record pair by inputting the converted record pair into a model, and an output unit 14 for outputting the similarity calculated by the similarity calculation unit 13. is adopted. For this reason, according to the information processing apparatus 1 according to the present exemplary embodiment, as a technique for calculating the similarity between record pairs, a technique that does not require training data regarding record pairs and can handle heterogeneous data is provided. The effect of being able to provide is obtained.
 <情報処理プログラム>
 上述の情報処理装置1の機能は、プログラムによって実現することもできる。本例示的実施形態に係る情報処理プログラムは、コンピュータに、レコード対を取得する取得処理と、上記レコード対を変換することによって変換済レコード対を生成する変換処理と、上記変換済レコード対をモデルに入力することによって、上記変換済レコード対に関する類似度を算出する類似度算出処理と、上記類似度算出処理において算出した類似度を出力する出力処理とを実行させる。
<Information processing program>
The functions of the information processing apparatus 1 described above can also be realized by a program. An information processing program according to this exemplary embodiment provides a computer with an acquisition process for acquiring a record pair, a conversion process for generating a converted record pair by converting the record pair, and a model for the converted record pair. , a similarity calculation process for calculating a similarity with respect to the converted record pair and an output process for outputting the similarity calculated in the similarity calculation process are executed.
 <情報処理方法S1の流れ>
 本例示的実施形態に係る情報処理方法S1の流れについて、図2を参照して説明する。図2は、情報処理方法S1の流れを示すフロー図である。情報処理方法S1における各ステップの実行主体は、情報処理装置1が備えるプロセッサであってもよいし、他の装置が備えるプロセッサであってもよく、各ステップの実行主体がそれぞれ異なる装置に設けられたプロセッサであってもよい。
<Flow of information processing method S1>
The flow of the information processing method S1 according to this exemplary embodiment will be described with reference to FIG. FIG. 2 is a flow diagram showing the flow of the information processing method S1. The execution subject of each step in the information processing method S1 may be a processor included in the information processing apparatus 1 or a processor included in another apparatus. processor.
 ステップS11では、少なくとも1つのプロセッサが、レコード対を取得する。ステップS12では、少なくとも1つのプロセッサが、上記レコード対を変換することによって変換済レコード対を生成する。ステップS13では、少なくとも1つのプロセッサが、上記変換済レコード対をモデルに入力することによって、上記変換済レコード対に関する類似度を算出する。ステップS14では、少なくとも1つのプロセッサが、上記算出した類似度を出力する。 At step S11, at least one processor acquires a record pair. At step S12, at least one processor generates transformed record pairs by transforming the record pairs. At step S13, at least one processor calculates a similarity for the transformed record pair by inputting the transformed record pair into a model. At step S14, at least one processor outputs the calculated similarity.
 <情報処理方法S1の効果>
 以上のように、本例示的実施形態に係る情報処理方法S1においては、少なくとも1つのプロセッサが、レコード対を取得することと、上記レコード対を変換することによって変換済レコード対を生成することと、上記変換済レコード対をモデルに入力することによって、上記変換済レコード対に関する類似度を算出することと、上記算出した類似度を出力することとを含む構成が採用されている。このため、本例示的実施形態に係る情報処理方法S1によれば、レコード対の類似度を算出する技術として、レコード対に関する訓練データを必要とせず、かつ、異質データにも対応可能な技術を提供できるという効果が得られる。
<Effect of information processing method S1>
As described above, in the information processing method S1 according to the present exemplary embodiment, at least one processor obtains a record pair and generates a transformed record pair by transforming the record pair. , a configuration including inputting the converted record pair into a model to calculate a similarity regarding the converted record pair and outputting the calculated similarity. For this reason, according to the information processing method S1 according to the present exemplary embodiment, as a technique for calculating the similarity between record pairs, a technique that does not require training data regarding record pairs and can handle heterogeneous data is provided. The effect of being able to provide is obtained.
 〔例示的実施形態2〕
 本発明の第2の例示的実施形態について、図面を参照して詳細に説明する。なお、例示的実施形態1にて説明した構成要素と同じ機能を有する構成要素については、同じ符号を付し、その説明を繰り返さない。
[Exemplary embodiment 2]
A second exemplary embodiment of the invention will now be described in detail with reference to the drawings. Components having the same functions as the components described in the exemplary embodiment 1 are denoted by the same reference numerals, and description thereof will not be repeated.
 <情報処理装置1Aの概要>
 図3は、本例示的実施形態に係る情報処理装置1Aの構成を示すブロック図である。情報処理装置1Aは、レコード同士の同一性を判定する機能を有する。レコードが含まれるデータは、一例として、テーブルデータ等の構造データ、JSON又はXML等のデータ記述言語で記述された半構造データ、又は、自然言語で記された文書を表す非構造データである。
<Overview of Information Processing Device 1A>
FIG. 3 is a block diagram showing the configuration of the information processing device 1A according to this exemplary embodiment. The information processing device 1A has a function of determining identity between records. Examples of data containing records are structured data such as table data, semi-structured data described in a data description language such as JSON or XML, or unstructured data representing a document written in a natural language.
 図4は、レコードが含まれるデータの具体例を示す図である。図4において、データD1はテーブルである。この場合、レコードはテーブルの各行である。また、図4において、データD2は、マークアップ言語等のデータ記述言語で記述された半構造データである。この場合、レコードは一例としてウェブページである。データD3は、自然言語で記された文書を表す非構造データである。この場合、レコードは一例として、所定のファイル形式で生成されたファイルである。 FIG. 4 is a diagram showing a specific example of data containing records. In FIG. 4, data D1 is a table. In this case, a record is each row of the table. In FIG. 4, data D2 is semi-structured data described in a data description language such as a markup language. In this case, the record is a web page as an example. Data D3 is unstructured data representing a document written in natural language. In this case, the record is, for example, a file generated in a predetermined file format.
 ここで、本例示的実施形態に係る情報処理装置1Aが行う処理の概要について、図5を参照しつつ説明する。図5は、情報処理装置1Aが行う処理の流れの概要を示す図である。情報処理装置1Aは、大別して(i)レコード対の生成処理、(ii)類似度の算出処理、及び(iii)同一性の判定処理、を行う。 Here, an overview of the processing performed by the information processing device 1A according to this exemplary embodiment will be described with reference to FIG. FIG. 5 is a diagram showing an overview of the flow of processing performed by the information processing apparatus 1A. The information processing apparatus 1A is roughly divided into (i) record pair generation processing, (ii) similarity calculation processing, and (iii) identity determination processing.
 (i)レコード対の生成処理において、情報処理装置1Aは、複数のレコードeを含む第1データxと、複数のレコードe´を含む第2データx´とからレコード対を生成する。情報処理装置1Aは一例として、第1データxに含まれるレコードeと第2データx´に含まれるレコードe´との全ての組み合わせを生成する。また、情報処理装置1Aは、レコード対の生成において、ブロッキングと呼ばれる技術により第1データxのレコードeについての第2データx´の同一性判定の候補をしぼってもよい。 (i) In the process of generating record pairs, the information processing device 1A generates record pairs from first data x including multiple records e and second data x' including multiple records e'. As an example, the information processing apparatus 1A generates all combinations of the record e included in the first data x and the record e' included in the second data x'. Further, the information processing apparatus 1A may narrow down the candidates for identity determination of the second data x' for the record e of the first data x by a technique called blocking in generating the record pair.
 (ii)類似度の算出処理において、情報処理装置1Aは、レコード対に含まれるレコード同士の類似度を算出する。本例示的実施形態において、情報処理装置1Aは、レコードを変換した変換済レコード対をモデルに入力することにより類似度を算出する。類似度の算出処理の詳細については後述する。 (ii) In the similarity calculation process, the information processing device 1A calculates the similarity between the records included in the record pair. In this exemplary embodiment, the information processing apparatus 1A calculates the degree of similarity by inputting converted record pairs obtained by converting records into a model. The details of the similarity calculation process will be described later.
 (iii)同一性の判定処理において、情報処理装置1Aは、算出した類似度に基づき、レコード対に含まれるレコード同士の同一性を判定する。一例として、情報処理装置1Aは、類似度が閾値以上である場合にレコード同士が同一であると判定する。ただし、同一性の判定の手法は上述した手法に限定されず、情報処理装置1Aは他の手法によりレコード同士の同一性を判定してもよい。 (iii) In the identity determination process, the information processing device 1A determines the identity of the records included in the record pair based on the calculated similarity. As an example, the information processing device 1A determines that the records are the same when the degree of similarity is equal to or greater than a threshold. However, the method for determining identity is not limited to the above-described method, and information processing apparatus 1A may determine identity between records using other methods.
 図6は、同一性の判定結果の具体例を示す図である。図6において、テーブルTBL1は第1データxの一例であり、複数の行及び複数の列を含む。また、テーブルTBL2は第2データx´の一例であり、複数の行及び複数の列を含む。テーブルTBL1及びテーブルTBL2において、レコードはテーブルの行である。テーブルTBL1はレコードl1、l2、l3及びl4を含み、テーブルTBL2はレコードr1、r2、r3を含む。 FIG. 6 is a diagram showing a specific example of identity determination results. In FIG. 6, a table TBL1 is an example of the first data x and includes multiple rows and multiple columns. Also, the table TBL2 is an example of the second data x' and includes multiple rows and multiple columns. In table TBL1 and table TBL2, a record is a row of the table. Table TBL1 contains records l1, l2, l3 and l4, and table TBL2 contains records r1, r2, r3.
 図6の例では、情報処理装置1Aは、上記(i)~(iii)の処理により、レコードl1とレコードr2とが同一であると判定し、レコードl2とレコードr3とが同一であると判定し、レコードl3とレコードr1とが同一であると判定する。 In the example of FIG. 6, the information processing device 1A determines that the record l1 and the record r2 are the same, and determines that the record l2 and the record r3 are the same by the processes (i) to (iii) above. Then, it is determined that the record l3 and the record r1 are the same.
 <情報処理装置1Aの構成>
 情報処理装置1Aは、図3に示すように、制御部10A、記憶部20A、通信部30A及び入出力部40Aを備える。
<Configuration of information processing device 1A>
The information processing apparatus 1A, as shown in FIG. 3, includes a control section 10A, a storage section 20A, a communication section 30A and an input/output section 40A.
 (通信部30A)
 通信部30Aは、情報処理装置1Aの外部の装置と通信回線を介して通信する。通信回線の具体的構成は本例示的実施形態を限定するものではないが、通信回線は一例として、無線LAN(Local Area Network)、有線LAN、WAN(Wide Area Network)、公衆回線網、モバイルデータ通信網、又は、これらの組み合わせである。通信部30Aは、制御部10Aから供給されたデータを他の装置に送信したり、他の装置から受信したデータを制御部10Aに供給したりする。
(Communication section 30A)
The communication unit 30A communicates with an external device of the information processing device 1A via a communication line. Although the specific configuration of the communication line does not limit this exemplary embodiment, examples of the communication line include wireless LAN (Local Area Network), wired LAN, WAN (Wide Area Network), public line network, mobile data communication network, or a combination thereof. The communication unit 30A transmits data supplied from the control unit 10A to other devices, and supplies data received from other devices to the control unit 10A.
 (入出力部40A)
 入出力部40Aには、キーボード、マウス、ディスプレイ、プリンタ、タッチパネル等の入出力機器が接続される。入出力部40Aは、接続された入力機器から情報処理装置1Aに対する各種の情報の入力を受け付ける。また、入出力部40Aは、制御部10Aの制御の下、接続された出力機器に各種の情報を出力する。入出力部40Aとしては、例えばUSB(Universal Serial Bus)などのインタフェースが挙げられる。
(Input/output unit 40A)
Input/output devices such as a keyboard, mouse, display, printer, and touch panel are connected to the input/output unit 40A. The input/output unit 40A receives input of various kinds of information from the connected input device to the information processing apparatus 1A. Also, the input/output unit 40A outputs various kinds of information to the connected output device under the control of the control unit 10A. As the input/output unit 40A, for example, an interface such as a USB (Universal Serial Bus) can be used.
 (制御部10A)
 制御部10Aは、図3に示すように、取得部11、変換部12、類似度算出部13、出力部14、同一性判定部15A及び統合部16Aを備える。
(Control section 10A)
As shown in FIG. 3, the control unit 10A includes an acquisition unit 11, a conversion unit 12, a similarity calculation unit 13, an output unit 14, an identity determination unit 15A, and an integration unit 16A.
 (取得部11)
 本例示的実施形態において、取得部11は、レコードeを含む第1データxとレコードe´を含む第2データx´とから、レコードeとレコードe´を含むレコード対を生成する。ただし、取得部11がレコード対を生成する処理を行わなくてもよい。取得部11は、一例として、記憶部20A又は他の外部記憶装置からレコード対を読み出すことにより取得してもよく、また、通信部30Aを介して他の装置から受信されるレコード対を取得してもよい。また、取得部11は、入出力部40Aに接続された入力装置から入力されるレコード対を取得してもよい。
(Acquisition unit 11)
In this exemplary embodiment, the acquisition unit 11 generates a record pair including the record e and the record e' from the first data x including the record e and the second data x' including the record e'. However, the acquisition unit 11 does not have to perform the process of generating the record pair. For example, the acquisition unit 11 may acquire by reading record pairs from the storage unit 20A or another external storage device, or acquire record pairs received from another device via the communication unit 30A. may Also, the acquisition unit 11 may acquire a record pair input from an input device connected to the input/output unit 40A.
 (変換部12)
 変換部12は、レコード対を変換することによって変換済レコード対を生成する。変換部12は一例として、レコード対に含まれるレコードを文書データ、画像データ、音声データ又はグラフに変換する。変換部12が実行する変換処理については後述する。
(Converter 12)
The conversion unit 12 converts the record pair to generate a converted record pair. As an example, the conversion unit 12 converts records included in a record pair into document data, image data, audio data, or graphs. The conversion processing executed by the conversion unit 12 will be described later.
 (類似度算出部13)
 類似度算出部13は、上記変換済レコード対をモデルMAに入力することによって、上記変換済レコード対に関する類似度sを算出する。類似度sは、レコード対に含まれるレコード同士の類似の度合いに関する情報であり、一例として、ベクトル対のコサイン類似度、又は、モデルMAが出力するスコアに基づき計算される値である。類似度算出部13が類似度sを算出する処理の詳細については後述する。
(Similarity calculator 13)
The similarity calculation unit 13 calculates the similarity s regarding the converted record pair by inputting the converted record pair into the model MA. The similarity s is information about the degree of similarity between records included in a record pair, and is, for example, a cosine similarity of a vector pair or a value calculated based on the score output by the model MA. The details of the process of calculating the similarity s by the similarity calculator 13 will be described later.
 (出力部14)
 出力部14は、類似度算出部13が算出した類似度sを出力する。出力部14は、一例として、類似度を記憶部20Aに書き込むことにより出力する。ただし、出力部14が類似度を出力する手法は上述した例に限定されず、他の手法により類似度sを出力してもよい。一例として、出力部14は、通信部30Aを介して接続された他の装置に類似度を送信してもよく、また、入出力部40Aを介して接続された出力装置に類似度を出力してもよい。
(Output unit 14)
The output unit 14 outputs the similarity s calculated by the similarity calculation unit 13 . As an example, the output unit 14 outputs the degree of similarity by writing it into the storage unit 20A. However, the method by which the output unit 14 outputs the degree of similarity is not limited to the example described above, and the degree of similarity s may be output by another method. As an example, the output unit 14 may transmit the degree of similarity to another device connected via the communication unit 30A, and output the degree of similarity to an output device connected via the input/output unit 40A. may
 (同一性判定部15A)
 同一性判定部15Aは、類似度sに基づきレコード対に含まれるレコード同士の同一性を判定する。一例として、同一性判定部15Aは、類似度sが閾値以上である場合にレコード同士が同一であると判定する。また、同一性判定部15Aは、類似度の上位x個のレコード対を同一と判定する、といったように、レコード対を類似度の高い順位ソートした場合の順位に基づき同一性を判定してもよい。また、同一性判定部15Aは、一例として、安定結婚問題アルゴリズム等のマッチングアルゴリズムを適用して同一性を判定してもよい。
(Sameness determination unit 15A)
The identity determination unit 15A determines identity between records included in a record pair based on the degree of similarity s. As an example, the identity determination unit 15A determines that the records are the same when the similarity s is equal to or greater than the threshold. Also, the identity determination unit 15A may determine identity based on the ranking when the record pairs are sorted in order of high similarity, such as determining that x record pairs with the highest degree of similarity are identical. good. Further, as an example, the identity determination unit 15A may determine identity by applying a matching algorithm such as the stable marriage problem algorithm.
 また、同一性判定部15Aが同一性を判定する手法は上述した例に限られず、同一性判定部15Aは他の手法により同一性の判定を行ってもよい。一例として、同一性判定部15Aは、機械学習により生成された予測モデルにレコード対と類似度とを入力することによりレコード対の同一性予測を行ってもよい。この場合、予測モデルの入力は、一例として、レコード対と類似度とを含む。また、予測モデルの出力は、一例として、同一性の予測結果を含む。この場合、予測モデルの機械学習の手法は限定されず、一例として、決定木ベース、線形回帰、又はニューラルネットワークの手法が用いられてもよく、また、これらのうちの2以上の手法が用いられてもよい。 Also, the method of determining identity by the identity determination unit 15A is not limited to the above example, and the identity determination unit 15A may determine identity by other methods. As an example, the identity determination unit 15A may perform identity prediction of record pairs by inputting record pairs and similarities into a prediction model generated by machine learning. In this case, the input of the prediction model includes, for example, record pairs and similarities. Also, the output of the predictive model includes, as an example, a predictive result of identity. In this case, the machine learning method of the prediction model is not limited, and as an example, a decision tree-based, linear regression, or neural network method may be used, and two or more of these methods may be used. may
 (統合部16A)
 統合部16Aは、同一性判定部15Aの判定結果に基づき、第1データxと第2データx´とを統合する。統合部16Aは、一例として、レコード数を増やす、及び/又はデータの属性数を増やすことにより第1データxと第2データx´とを統合する。統合部16Aが実行するデータ統合としては、例えば、(i)エンティティ統合、(ii)データクレンジング、(iii)スキーママッチング、が挙げられる。(i)エンティティ統合は、同一のレコード集合が与えられたときに異なる属性とその値の表記を統一することをいう。(ii)データクレンジングは、社名、住所、局番等の記載形式の違い(「(株)」と「株式会社」、等)を統一することをいう。(iii)スキーママッチングは、表記が異なる複数の属性のアライメント(マッチング)をとることをいう。
(integration unit 16A)
16 A of integration parts integrate the 1st data x and 2nd data x' based on the determination result of 15 A of identity determination parts. For example, the integration unit 16A integrates the first data x and the second data x' by increasing the number of records and/or increasing the number of data attributes. Data integration performed by the integration unit 16A includes, for example, (i) entity integration, (ii) data cleansing, and (iii) schema matching. (i) Entity integration refers to unifying the notation of different attributes and their values when the same set of records is given. (ii) Data cleansing refers to the unification of differences in description formats such as company names, addresses, and area codes ("Co., Ltd." and "Co., Ltd.", etc.). (iii) Schema matching means aligning (matching) a plurality of attributes with different notations.
 (記憶部20A)
 記憶部20Aには、第1データx及び第2データx´が記憶されるとともに、類似度算出部13が算出する類似度sが記憶される。また、記憶部20Aには、モデルMAが記憶される。なお、モデルMAが記憶部20Aに記憶されているとは、モデルMAを規定するパラメータが記憶部20Aに記憶されていることをいう。
(Storage unit 20A)
The storage unit 20A stores the first data x and the second data x′, and also stores the similarity s calculated by the similarity calculation unit 13 . A model MA is stored in the storage unit 20A. Note that the expression that the model MA is stored in the storage unit 20A means that the parameters that define the model MA are stored in the storage unit 20A.
 (モデルMA)
 モデルMAは、類似度を算出するためのモデルであり、一例として、一般に公開されて任意のユーザが利用可能なモデルである。モデルMAは、機械学習により生成されたモデルであってもよく、また、人間が作成したルールベースのモデルであってもよい。
(Model MA)
The model MA is a model for calculating the degree of similarity, and as an example, is a model that is open to the public and can be used by any user. The model MA may be a model generated by machine learning or a rule-based model created by humans.
 モデルMAは一例として、文書分類モデル、画像分類モデル、音声分類モデル及びグラフ分類モデルの少なくともいずれかを含む。文書分類モデルとしては例えば、文書埋め込みモデル、含意認識モデル、言い換え予測モデル、及びマスク言語モデルが挙げられる。画像分類モデルとしては例えば、画像データをベクトル空間に埋め込む画像埋め込みモデルが挙げられる。音声分類モデルとしては例えば、音声データをベクトル空間に埋め込む音声埋め込みモデルが挙げられる。 As an example, the model MA includes at least one of a document classification model, image classification model, speech classification model and graph classification model. Examples of document classification models include document embedding models, entailment recognition models, paraphrase prediction models, and mask language models. An example of an image classification model is an image embedding model that embeds image data in a vector space. As a speech classification model, for example, there is a speech embedding model that embeds speech data in a vector space.
 モデルMAの入力は、一例として、文書データ、画像データ、音声データ、グラフ、及びベクトルの少なくともいずれかを含む。モデルMAの出力は、一例として、ベクトル及びスコアの少なくともいずれかを含む。 Inputs of the model MA include at least one of document data, image data, audio data, graphs, and vectors, for example. The output of the model MA includes, as an example, vectors and/or scores.
 (文書分類モデルの例1:文書埋め込みモデル)
 文書埋め込みモデルは、文書又は単語をベクトル空間に埋め込むモデルである。文書埋め込みモデルは、一例として非特許文献3に記載されたRoBERTaにより生成される。文書埋め込みモデルの入力は、一例として文書又は単語(例えば、「老人男性が公園で歩いている。」という文)である。出力は、一例としてベクトルである。
(Example 1 of document classification model: document embedding model)
A document embedding model is a model that embeds documents or words in a vector space. The document embedding model is generated by RoBERTa described in Non-Patent Document 3 as an example. The input of the document embedding model is, for example, a document or a word (for example, the sentence "An elderly man is walking in the park."). The output is a vector as an example.
 (文書分類モデルの例2:含意認識モデル)
 含意認識モデルは「文書1であれば文書2である」という含意関係かどうかを予測するモデルである。含意認識モデルは、一例として非特許文献4に記載された技術により生成される。含意認識モデルの入力は、一例として2つの文書である。2つの文書は、例えば、「老人男性が公園で歩いている」という文書1、及び「男性が公園にいる」という文書2である。この場合、文書2は文書1を含意する。図8は、文書の含意関係を概略的に示す図である。図8の例では、文書2が文書1を含意する。
(Example 2 of document classification model: entailment recognition model)
The entailment recognition model is a model for predicting whether or not there is an entailment relation of "if it is document 1, then it is document 2". The entailment recognition model is generated by the technique described in Non-Patent Document 4 as an example. The input of the entailment recognition model is two documents as an example. The two documents are, for example, document 1 "an old man is walking in the park" and document 2 "a man is in the park". In this case, document 2 entails document 1. FIG. 8 is a diagram schematically showing entailment relationships of documents. In the example of FIG. 8, document 2 entails document 1 .
 また、含意認識モデルの出力は、一例として含意性スコアである。含意性スコアは、含意関係の確信度を示す数値であり、例えば0~1の実数である。含意性スコアは、一例として、値が大きいほど、含意関係の確信度が高いことを示す。 Also, the output of the entailment recognition model is an entailment score as an example. The entailment score is a numerical value indicating the certainty of the entailment relation, and is a real number between 0 and 1, for example. As an example, the entailment score indicates that the higher the value, the higher the certainty of the entailment relation.
 (文書分類モデルの例3:言い換え予測モデル)
 言い換え予測モデルは、2つの文書が言い換え表現かどうかを予測するモデルである。言い換え予測モデルは、一例として非特許文献3に記載されたRoBERTaにより生成される。言い換え予測モデルの入力は、一例として2つの文書である。2つの文書は例えば「NECはITの会社である。」という文書1、及び「日本電気株式会社はIT事業を行っている。」という文書2である。モデルの出力は、一例として、言い換えスコアを含む。言い換えスコアは、2つの文書が言い換え表現であることの確信度を示すスコアであり、一例として、0~1の実数である。言い換えスコアは、一例として、値が大きいほど2つの文書が言い換え表現であることの確信度が高いことを示す。
(Example 3 of document classification model: paraphrasing prediction model)
A paraphrasing prediction model is a model that predicts whether two documents are paraphrasing expressions. A paraphrase prediction model is generated by RoBERTa described in Non-Patent Document 3 as an example. The input of the paraphrasing prediction model is two documents as an example. The two documents are, for example, document 1 stating "NEC is an IT company" and document 2 stating "NEC Corporation is in the IT business". The output of the model includes paraphrase scores, as an example. The paraphrasing score is a score indicating the degree of certainty that two documents are paraphrasing expressions, and is a real number between 0 and 1, for example. As an example, the paraphrase score indicates that the higher the value, the higher the confidence that the two documents are paraphrase expressions.
 (文書分類モデルの例4:マスク言語モデル)
 マスク言語モデルは、文書内のマスクに当てはまる単語を予測するモデルである。マスク言語モデルは、一例として例えば非特許文献3に記載されたRoBERTaにより生成されたモデルである。文書分類モデルの入力は、一例として、文書(例えば、「このピザはとても美味しい。私はこのピザが[mask]。」という文)である。モデルの出力は、単語(例えば、「好き」)とスコアとを含む。スコアは、その単語がマスクに当てはまる確信度を示す値であり、一例として、0~1の実数である。
(Document Classification Model Example 4: Mask Language Model)
A mask language model is a model that predicts words that fit a mask in a document. The mask language model is, for example, a model generated by RoBERTa described in Non-Patent Document 3, for example. The input of the document classification model is, for example, a document (for example, the sentence "This pizza is very good. I like this pizza [mask]."). The output of the model includes words (eg, "like") and scores. The score is a value indicating the degree of confidence that the word fits the mask, and is a real number between 0 and 1, for example.
 (画像分類モデル)
 画像分類モデルは、画像を分類するモデルである。画像分類モデルは、一例として非特許文献6に記載された技術により生成されたモデルである。画像分類モデルの入力は、一例として、画像(例えば、犬の画像)である。また、モデルの中間出力は、一例として、画像のベクトル表現であり、モデルの出力は、一例として、ラベル(例えば、「犬」を示すラベル)とスコアである。スコアはラベルの確信度を示す値であり、例えば0~1の実数である。本例示的実施形態では、類似度算出部13は、画像のベクトル表現を利用して類似度sを算出する。
(Image classification model)
An image classification model is a model that classifies images. The image classification model is a model generated by the technique described in Non-Patent Document 6 as an example. An example input for the image classification model is an image (eg, an image of a dog). Also, the intermediate output of the model is, for example, a vector representation of the image, and the output of the model is, for example, a label (for example, a label indicating "dog") and a score. A score is a value indicating the certainty of a label, and is a real number between 0 and 1, for example. In this exemplary embodiment, the similarity calculator 13 calculates the similarity s using the vector representation of the images.
 (音声分類モデル)
 音声分類モデルは、音声を分類するモデルである。音声分類モデルの入力は、一例として音声データ(例えば犬の鳴き声)である。モデルの中間出力は、一例として音声のベクトル表現であり、モデルの出力は、一例としてラベル(例えば、音声が犬の鳴き声であることを示すラベル)とスコアである。スコアはラベルの確信度を示す値であり、例えば0~1の実数である。本例示的実施形態では、類似度算出部13は、中間出力である音声のベクトル表現を利用して類似度sを算出する。
(speech classification model)
A speech classification model is a model that classifies speech. The input of the speech classification model is, for example, speech data (eg, dog barking). The intermediate output of the model is, as an example, a vector representation of the speech, and the output of the model is, as an example, a label (eg, a label indicating that the speech is a dog bark) and a score. A score is a value indicating the certainty of a label, and is a real number between 0 and 1, for example. In this exemplary embodiment, the similarity calculator 13 calculates the similarity s using the vector representation of the speech, which is the intermediate output.
(グラフ分類モデル)
グラフ分類モデルは、グラフを分類するモデルである。グラフ分類モデルの入力は、一例としてグラフデータ(例えば顔の特徴を表すグラフ)である。モデルの中間出力は、一例としてグラフのベクトル表現であり、モデルの出力は、一例としてラベル(例えば人物を示すラベル)とスコアである。スコアはラベルの確信度を示す値であり、例えば0~1の実数である。本例示的実施形態では、類似度算出部13は、中間出力であるグラフのベクトル表現を利用して類似度sを算出する。
(graph classification model)
A graph classification model is a model for classifying graphs. The input of the graph classification model is, for example, graph data (for example, a graph representing facial features). The intermediate output of the model is, for example, a vector representation of the graph, and the output of the model is, for example, a label (for example, a label indicating a person) and a score. A score is a value indicating the certainty of a label, and is a real number between 0 and 1, for example. In this exemplary embodiment, the similarity calculator 13 calculates the similarity s using the vector representation of the graph, which is the intermediate output.
 <情報処理方法S100Aの流れ>
 図7は、情報処理装置1Aが実行する情報処理方法S100Aの流れを示すフロー図である。なお、情報処理方法S100Aに含まれるステップのうち、一部のステップは並行して又は順序を換えて実行されてもよい。また、既に説明した内容についてはその説明を繰り返さない。
<Flow of information processing method S100A>
FIG. 7 is a flowchart showing the flow of the information processing method S100A executed by the information processing apparatus 1A. Note that some of the steps included in the information processing method S100A may be executed in parallel or in a different order. Also, the description of the already described contents will not be repeated.
 (ステップS101)
 ステップS101において、取得部11は、モデルMAを読み込む。本例示的実施形態において、モデルMAは、複数のモデル候補から選択されたものである。複数のモデル候補は、一例として、文書埋め込みモデル、含意認識モデル等の文書分類モデル、画像分類モデル、及び音声分類モデルの少なくともいずれかを含む。モデルMAの選択は、一例としてユーザ操作に基づき行われてもよく、また、所定のアルゴリズムに従って行われてもよい。モデルMAは、ひとつのモデルであってもよく、また、複数のモデルの集合であってもよい。
(Step S101)
In step S101, the acquisition unit 11 reads the model MA. In this exemplary embodiment, model MA is selected from a plurality of model candidates. The plurality of model candidates include, for example, at least one of a document embedding model, a document classification model such as an entailment recognition model, an image classification model, and an audio classification model. The selection of the model MA may be performed based on a user's operation as an example, or may be performed according to a predetermined algorithm. The model MA may be a single model or a set of multiple models.
 (ステップS102)
 ステップS102において、取得部11は、レコード対を読み込む。レコード対に含まれるレコードeとレコードe´とは、一例として、
e=((a_j,v_j))_{j=1,…,d},
e´=((a´_j,v´_j))_{j=1,…,d´}
と表される。ここで、属性名a_j∈A_jであり、A_jは、一例として文字列空間である。属性値v_j∈V_jであり、V_jは、一例として文字列空間又は実数空間である。この例で、図6のレコードl1は、
l1=((title, sims 2 glamour life stuff pack),(manufacturer, aspyr media),(price, 24.99))であり、レコードr2は、
r2=((title, aspyr media inc sims 2 glamour life stuff pack),(price, NaN))
である。
(Step S102)
In step S102, the acquisition unit 11 reads record pairs. For example, record e and record e' included in a record pair are as follows:
e=((a_j, v_j))_{j=1,...,d},
e′=((a′_j, v′_j))_{j=1, . . . , d′}
is represented. where attribute name a_jεA_j, where A_j is a string space as an example. An attribute value v_jεV_j, where V_j is, for example, a string space or a real number space. In this example, record l1 in FIG.
l1 = ((title, sims 2 glamor life stuff pack), (manufacturer, aspyr media), (price, 24.99)) and record r2 is
r2 = ((title, aspyr media inc sims 2 glamor life stuff pack), (price, NaN))
is.
 本例示的実施形態において、取得部11は、第1データx及び第2データx´からレコード対を生成する。取得部11は、一例として、第1データxに含まれるレコードeと第2データx´に含まれるレコードe´の全ての組み合わせを生成する。また、取得部11は、レコード対の生成において、ブロッキングと呼ばれる技術により第1データxのレコードeについての第2データx´の同一性判定の候補をしぼってもよい。 In this exemplary embodiment, the acquisition unit 11 generates record pairs from the first data x and the second data x'. For example, the acquisition unit 11 generates all combinations of records e included in the first data x and records e' included in the second data x'. Further, in generating a record pair, the obtaining unit 11 may narrow down candidates for identity determination of the second data x′ for the record e of the first data x by a technique called blocking.
 (ステップS103)
 ステップS103において、変換部12は、レコード対(e,e´)をモデルMAの入力に対応する形式に変換することによって変換済レコード対を生成する。一例として、モデルMAには、文書分類モデルが含まれており、変換部12による変換処理には、レコード対(e,e´)を文書に変換することによって変換済レコード対を生成する処理が含まれる。また、一例として、モデルMAには、画像分類モデルが含まれており、変換部12による変換処理には、レコード対(e,e´)を画像に変換することによって変換済レコード対を生成する処理が含まれる。また、一例として、モデルMAには、音声分類モデルが含まれており、変換部12による変換処理には、レコード対を音声に変換することによって変換済レコード対を生成する処理が含まれる。また、一例として、モデルMAには、グラフ分類モデルが含まれており、変換部12による変換処理には、レコード対をグラフに変換することによって変換済レコード対を生成する処理が含まれる。
(Step S103)
In step S103, the conversion unit 12 generates a converted record pair by converting the record pair (e, e') into a format corresponding to the input of the model MA. As an example, the model MA includes a document classification model, and the conversion processing by the conversion unit 12 includes processing for generating a converted record pair by converting the record pair (e, e') into a document. included. Further, as an example, the model MA includes an image classification model, and in the conversion processing by the conversion unit 12, a converted record pair is generated by converting the record pair (e, e') into an image. processing is included. Further, as an example, the model MA includes a speech classification model, and the conversion processing by the conversion unit 12 includes processing for generating a converted record pair by converting the record pair into speech. As an example, the model MA includes a graph classification model, and the conversion processing by the conversion unit 12 includes processing for generating converted record pairs by converting record pairs into graphs.
 (ステップS104)
 ステップS104において、類似度算出部13は、変換済レコード対をモデルMAに入力することによって、変換済レコード対に関する類似度sを算出する。
(Step S104)
In step S104, the similarity calculation unit 13 calculates the similarity s regarding the converted record pair by inputting the converted record pair into the model MA.
 (ステップS101~S104の処理例1~5)
 ここで、ステップS101~S104の処理例として、処理例1~5を説明する。処理例1は、文書埋め込みモデルを用いる場合の処理例である。処理例2は、画像分類モデルを用いる場合の処理例である。処理例3は、音声分類モデルを用いる場合の処理例である。処理例4は、含意認識モデルを用いる場合の処理例である。処理例5は、言い換え予測モデルを用いる場合の処理例である。
(Processing examples 1 to 5 of steps S101 to S104)
Here, processing examples 1 to 5 will be described as processing examples of steps S101 to S104. Processing example 1 is a processing example in the case of using the document embedding model. Processing example 2 is a processing example in the case of using an image classification model. Processing example 3 is a processing example in the case of using a speech classification model. Processing example 4 is a processing example when using an entailment recognition model. Processing example 5 is a processing example in the case of using a paraphrase prediction model.
 (処理例1:文書埋め込みモデル)
 この例では、ステップS101において、取得部11は、文書埋め込みモデルを読み込む。また、ステップS103において、変換部12は、レコード対(e,e´)を文書対(t,t´)に変換する。変換部12は、一例として、
e=((title, sims 2 glamour life stuff pack),(manufacturer, aspyr media),(price, 24.99))
e´=((title, aspyr media inc sims 2 glamour life stuff pack),(price, NaN))
であるレコードeとレコードe´とを含むレコード対(e,e´)を、
t=“Title is sims 2 glamour life stuff pack. Manufacturer is aspyr media. Price is 24.99.”
t´=“Title is aspyr media inc sims 2 glamour life stuff pack.”
である文書t及び文書t´を含む文書対(t,t´)に変換する。
(Processing example 1: document embedding model)
In this example, in step S101, the acquisition unit 11 reads the document embedding model. Further, in step S103, the conversion unit 12 converts the record pair (e, e') into the document pair (t, t'). As an example, the conversion unit 12
e = ((title, sims 2 glamor life stuff pack), (manufacturer, aspyr media), (price, 24.99))
e' = ((title, aspyr media inc sims 2 glamor life stuff pack), (price, NaN))
A record pair (e, e') containing record e and record e' is
t = "Title is sims 2 glamor life stuff pack. Manufacturer is aspyr media. Price is 24.99."
t' = "Title is aspyr media inc sims 2 glamor life stuff pack."
to a document pair (t, t') containing document t and document t'.
 また、ステップS104において、類似度算出部13は、文書埋め込みモデルにより文書対(t,t´)をベクトル対(v,v´)に変換する。ここで、v=M(t),v´=M(t´)である。また、類似度算出部13は、ベクトル対(v,v´)から類似度sを算出する。類似度sは、一例として、s=exp(-||v-v´||/c)であり、ここでc>0である。また、類似度sは、コサイン類似度s=v^Tv´/(||v|| ||v´||)であってもよい。ここで、^Tは転置を表す記号である。 Also, in step S104, the similarity calculation unit 13 converts the document pair (t, t') into a vector pair (v, v') using the document embedding model. where v=M(t) and v'=M(t'). Further, the similarity calculator 13 calculates the similarity s from the vector pair (v, v'). The similarity s is, for example, s=exp(-||v-v'||/c), where c>0. Further, the similarity s may be cosine similarity s=v^Tv'/(||v|||||v'||). Here, ^T is a symbol representing transposition.
 (処理例2:画像埋め込みモデル)
 この例では、ステップS101において、取得部11は、画像埋め込みモデルを読み込む。ステップS103において、変換部12は、レコード対(e,e´)を画像対(i,i´)に変換する。
(Processing example 2: image embedding model)
In this example, in step S101, the acquisition unit 11 reads an image embedding model. In step S103, the conversion unit 12 converts the record pair (e, e') into the image pair (i, i').
 図9は、変換部12により変換された画像の一例を示す図である。変換部12は、一例として、
e=((title, sims 2 glamour life stuff pack),(manufacturer, aspyr media),(price, 24.99))
e´=((title, aspyr media inc sims 2 glamour life stuff pack),(price, NaN))
であるレコードeとレコードe´とを含むレコード対(e,e´)を、図9に示す画像i、i´に変換する。
FIG. 9 is a diagram showing an example of an image converted by the converter 12. As shown in FIG. As an example, the conversion unit 12
e = ((title, sims 2 glamor life stuff pack), (manufacturer, aspyr media), (price, 24.99))
e' = ((title, aspyr media inc sims 2 glamor life stuff pack), (price, NaN))
A record pair (e, e') containing record e and record e' is converted into images i, i' shown in FIG.
 また、ステップS104において、類似度算出部13は、画像埋め込みモデルにより画像対(i,i´)をベクトル対(v,v´)に変換する。ここで、v=M(i),v´=M(i´)である。更に、類似度算出部13は、ベクトル対(v,v´)から類似度sを算出する。類似度sは、一例として、=exp(-||v-v´||/c)であり、ここで、c>0である。 Also, in step S104, the similarity calculation unit 13 converts the image pair (i, i') into a vector pair (v, v') using the image embedding model. where v=M(i) and v'=M(i'). Further, the similarity calculator 13 calculates the similarity s from the vector pair (v, v'). The similarity s is, for example, =exp(-||v-v'||/c), where c>0.
 画像埋め込みモデルを用いる場合において、変換部12は、1つのレコードを1つの画像に変換してもよく、また、レコードに含まれる要素(例えば、単語)ごとに画像変換を行ってもよい。変換部12が要素ごとに画像変換する場合、類似度算出部13は、要素ごとの画像の集合を用いて類似度sを算出する。また、要素ごとに画像変換を行う場合において、変換部12は、レコードの欠損値については画像変換しないようにしてもよい。 When using the image embedding model, the conversion unit 12 may convert one record into one image, or may perform image conversion for each element (eg, word) included in the record. When the conversion unit 12 performs image conversion for each element, the similarity calculation unit 13 calculates the similarity s using a set of images for each element. Further, when image conversion is performed for each element, the conversion unit 12 may not perform image conversion for missing values in records.
 (処理例3:音声埋め込みモデル)
 この例では、ステップS101において、取得部11は、音声埋め込みモデルを読み込む。また、ステップS103において、変換部12は、レコード対(e,e´)を音声対(i,i´)に変換する。変換部12は、一例として、
e=((title, sims 2 glamour life stuff pack),(manufacturer, aspyr media),(price, 24.99))
e´=((title, aspyr media inc sims 2 glamour life stuff pack),(price, NaN))
であるレコードeとレコードe´とを含むレコード対(e,e´)を、レコードeを読み上げた音声を表す音声データiと、レコードe´を読み上げた音声を表す音声データi´とに変換する。
(Processing example 3: voice embedding model)
In this example, in step S101, the acquisition unit 11 reads the speech embedding model. Also, in step S103, the conversion unit 12 converts the record pair (e, e') into the speech pair (i, i'). As an example, the conversion unit 12
e = ((title, sims 2 glamor life stuff pack), (manufacturer, aspyr media), (price, 24.99))
e' = ((title, aspyr media inc sims 2 glamor life stuff pack), (price, NaN))
Converts a record pair (e, e') containing record e and record e' to voice data i representing the voice of record e read aloud and voice data i' representing the voice of record e' read aloud do.
 また、ステップS104において、類似度算出部13は、音声埋め込みモデルにより音声データ対(i,i´)をベクトル対(v,v´)に変換する。ここで、v=M(i),v´=M(i´)である。更に、類似度算出部13は、ベクトル対(v,v´)から類似度sを算出する。類似度sは、一例として、=exp(-||v-v´||/c)であり、ここで、c>0である。 Also, in step S104, the similarity calculation unit 13 converts the voice data pair (i, i') into a vector pair (v, v') using the voice embedding model. where v=M(i) and v'=M(i'). Further, the similarity calculator 13 calculates the similarity s from the vector pair (v, v'). The similarity s is, for example, =exp(-||v-v'||/c), where c>0.
 (処理例4:含意認識モデル)
 この例では、ステップS101において、取得部11は、含意認識モデルを読み込む。また、ステップS103において、変換部12は、レコード対(e,e´)を文書対(t,t´)に変換する。
(Processing example 4: entailment recognition model)
In this example, in step S101, the acquisition unit 11 reads an entailment recognition model. Further, in step S103, the conversion unit 12 converts the record pair (e, e') into the document pair (t, t').
 変換部12は、一例として、
レコードe=((a_j,v_j))_{j=1,…,d}を、
文書t=「a_1 is v_1.… a_d is v_d.」に変換する。ただし、変換部12は欠損値を文書tに含めない。
As an example, the conversion unit 12
Let record e=((a_j, v_j))_{j=1,...,d} be
Convert document t="a_1 is v_1.... a_dis v_d." However, the conversion unit 12 does not include missing values in the document t.
 変換部12は、一例として、
e=((title, sims 2 glamour life stuff pack),(manufacturer, aspyr media),(price, 24.99))
e´=((title, aspyr media inc sims 2 glamour life stuff pack),(price, NaN))
であるレコードeとレコードe´とを含むレコード対(e,e´)を、
t=“Title is sims 2 glamour life stuff pack. Manufacturer is aspyr media. Price is 24.99.”
t´=“Title is aspyr media inc sims 2 glamour life stuff pack.”
である文書tと文書t´とを含む文書対(t,t´)に変換する。
As an example, the conversion unit 12
e = ((title, sims 2 glamor life stuff pack), (manufacturer, aspyr media), (price, 24.99))
e' = ((title, aspyr media inc sims 2 glamor life stuff pack), (price, NaN))
A record pair (e, e') containing record e and record e' is
t = "Title is sims 2 glamor life stuff pack. Manufacturer is aspyr media. Price is 24.99."
t' = "Title is aspyr media inc sims 2 glamor life stuff pack."
is converted into a document pair (t, t') containing document t and document t'.
 また、ステップS104において、類似度算出部13は、含意認識モデルを用いて文書対(t,t´)の含意性スコアを計算する。更に、類似度算出部13は、含意性スコアを用いて類似度sを算出する。類似度sは、一例として、s=M(t,t´)×M(t´,t)である。換言すると、類似度sは、「文書tであれば文書t´である」という含意関係の含意性スコアM(t,t´)と、「文書t´であれば文書tである」という含意関係の含意性スコアM(t´,t)との乗算値である。ただし、類似度sは上述した例に限られず、他の値であってもよい。類似度sは、例えば、含意性スコアM(t,t´)と含意性スコアM(t´,t)のうちの最大値、又は、含意性スコアM(t,t´)と含意性スコアM(t´,t)との和であってもよい。 Also, in step S104, the similarity calculation unit 13 calculates the entailment score of the document pair (t, t') using the entailment recognition model. Furthermore, the similarity calculation unit 13 calculates the similarity s using the implication score. The similarity s is, for example, s=M(t, t')×M(t', t). In other words, the similarity s is the entailment score M(t, t′) of the entailment relation “if document t is document t′” and the implication score “if document t′ is document t”. It is a multiplication value with the implication score M(t′, t) of the relation. However, the similarity s is not limited to the example described above, and may be another value. The similarity s is, for example, the maximum value of the implication score M(t, t′) and the implication score M(t′, t), or the implication score M(t, t′) and the implication score It may be the sum with M(t', t).
 レコード対(e,e´)に含まれるレコードeとレコードe´とが同一であれば、e⊂e´かつe´⊂eである。この処理例では、類似度算出部13はこの関係を利用して類似性の算出を行う。 If the record e and the record e' included in the record pair (e, e') are the same, then e⊂e' and e'⊂e. In this processing example, the similarity calculator 13 uses this relationship to calculate the similarity.
 (処理例5:言い換え予測モデル)
 この例では、ステップS101において、取得部11は言い換え予測モデルを読み込む。また、ステップS103において、変換部12は、レコード対(e,e´)を文書対(t,t´)に変換する。
(Processing example 5: paraphrase prediction model)
In this example, in step S101, the acquisition unit 11 reads a paraphrase prediction model. Further, in step S103, the conversion unit 12 converts the record pair (e, e') into the document pair (t, t').
 変換部12は、一例として、
レコードe=((a_j,v_j))_{j=1,…,d}を、
文書t=「v_1 … v_d」に変換する。ただし、変換部12は欠損値を文書に含めない。
As an example, the conversion unit 12
Let record e=((a_j, v_j))_{j=1,...,d} be
Convert document t = "v_1 ... v_d". However, the conversion unit 12 does not include missing values in the document.
 変換部12は、一例として、
e=((title, sims 2 glamour life stuff pack),(manufacturer, aspyr media),(price, 24.99))
e´=((title, aspyr media inc sims 2 glamour life stuff pack),(price, NaN))
であるレコードeとレコードe´とを含むレコード対(e,e´)を、
t=“sims 2 glamour life stuff pack aspyr media 24.99”
t´=“aspyr media inc sims 2 glamour life stuff pack”
である文書tと文書t´とを含む文書対(t,t´)に変換する。
As an example, the conversion unit 12
e = ((title, sims 2 glamor life stuff pack), (manufacturer, aspyr media), (price, 24.99))
e' = ((title, aspyr media inc sims 2 glamor life stuff pack), (price, NaN))
A record pair (e, e') containing record e and record e' is
t = "sims 2 glamor life stuff pack aspyr media 24.99"
t' = "aspyr media inc sims 2 glamor life stuff pack"
is converted into a document pair (t, t') containing document t and document t'.
 また、ステップS104において、類似度算出部13は、言い換え予測モデルを用いて文書対(t,t´)の言い換えスコアを計算し、計算した言い換えスコアをレコード対の類似度sとする。すなわち、この処理例では、類似度算出部13は、レコード対を言い換え表現かを問う形式に落とし込むことにより、類似度sを算出する。 Also, in step S104, the similarity calculation unit 13 calculates the paraphrase score of the document pair (t, t') using the paraphrase prediction model, and sets the calculated paraphrase score as the similarity s of the record pair. That is, in this processing example, the similarity calculation unit 13 calculates the similarity s by putting the record pair into a format that asks whether it is a paraphrase expression.
 (ステップS105・S106)
 ステップS105において、出力部14は、類似度算出部13が算出した類似度sを出力する。出力部14は、一例として、類似度sを記憶部20Aに書き込むことにより出力する。ステップS106において、同一性判定部15Aは、類似度sに基づきレコード対に含まれるレコード同士の同一性を判定する。
(Steps S105 and S106)
In step S<b>105 , the output unit 14 outputs the similarity s calculated by the similarity calculation unit 13 . As an example, the output unit 14 outputs the similarity s by writing it into the storage unit 20A. In step S106, the identity determination unit 15A determines identity between the records included in the record pair based on the similarity s.
 (ステップS107)
 ステップS106において、統合部16Aは、同一性判定部15Aの判定結果を参照して、第1データxと第2データx´とから統合済データを生成する。統合済データは、一例として、同一性判定部15Aが同一であると判定したレコード対に含まれるレコード同士を統合したレコードを含む。
(Step S107)
In step S106, the integration unit 16A refers to the determination result of the identity determination unit 15A and generates integrated data from the first data x and the second data x'. The integrated data includes, for example, a record obtained by integrating records included in a record pair determined to be identical by the identity determination unit 15A.
 <情報処理装置1Aの効果>
 ところで、近年、自然言語処理及び画像処理分野の様々なタスク(テキスト分類、質問応答、画像分類など)に対して、巨大データセットで訓練された高精度なモデル(推論モデル)が公開され、応用されている。例えば、教師あり機械学習の名寄せモデルであるDITTO(非特許文献1を参照。)は、BERT(Bidirectional Encoder Representations from Transformers)の事前学習済み言語モデルを部分的に応用している。しかしながら、タスクによりデータの形式は様々であるため、既存の推論モデルだけで名寄せを行うことはできなかった。特に、想定されていない属性を含むレコードについて既存の推論モデルで名寄せを行うことはできなかった。
<Effects of information processing device 1A>
By the way, in recent years, highly accurate models (inference models) trained on huge datasets have been published for various tasks in the fields of natural language processing and image processing (text classification, question answering, image classification, etc.). It is For example, DITTO (see Non-Patent Document 1), which is a supervised machine learning name identification model, partially applies a pretrained language model of BERT (Bidirectional Encoder Representations from Transformers). However, since the data format varies depending on the task, it was not possible to perform name identification using only existing inference models. In particular, existing inference models could not perform name identification on records containing unexpected attributes.
 それに対し本例示的実施形態によれば、情報処理装置1Aはレコード対をモデルMAの入力に対応する形式に変換してモデルMAに入力することによってレコード対に関する類似度sを算出する。レコードを変換して名寄せタスクを自然言語処理分野又は画像処理分野等のタスクに落とし込むことにより、大量のデータで学習された推論モデルの活用を可能にする。すなわち、本例示的実施形態によれば、モデルMAの訓練データを必要とすることなく、かつ、様々な属性を有するレコードについて同一性を判定することができる。 On the other hand, according to the present exemplary embodiment, the information processing device 1A converts the record pair into a format corresponding to the input of the model MA and inputs it to the model MA to calculate the similarity s for the record pair. By transforming records and applying name identification tasks to tasks in the fields of natural language processing and image processing, it is possible to utilize inference models trained with large amounts of data. That is, according to this exemplary embodiment, identity can be determined for records having various attributes without requiring training data for the model MA.
 また、本例示的実施形態に係る情報処理装置1Aにおいては、モデルMAは、複数のモデル候補から選択されたものであり、変換部12は、レコード対(e,e´)をモデルMAの入力に対応する形式に変換することによって変換済レコード対を生成する構成が採用されている。変換部12がレコード対を変換することにより、レコード対はモデルMAに入力可能な形式のデータとなる。すなわち、類似度の算出対象であるレコードがどのような属性を含む場合であっても、類似度算出部13はモデルMAに変換済レコード対を入力することにより類似度sを算出することができる。これにより、本例示的実施形態に係る情報処理装置1Aによれば、モデルMAの訓練データを必要とすることなく、モデルMAを用いて様々な属性を有するレコードについて類似度を算出することができるという効果が得られる。 Further, in the information processing apparatus 1A according to this exemplary embodiment, the model MA is selected from a plurality of model candidates, and the conversion unit 12 inputs the record pair (e, e') to the model MA. A configuration is adopted in which a converted record pair is generated by converting to a format corresponding to . By converting the record pair by the conversion unit 12, the record pair becomes data in a format that can be input to the model MA. That is, no matter what kind of attribute the record whose similarity is to be calculated contains, the similarity calculator 13 can calculate the similarity s by inputting the converted record pair into the model MA. . Thus, according to the information processing apparatus 1A according to the present exemplary embodiment, it is possible to calculate similarities for records having various attributes using the model MA without requiring training data for the model MA. effect is obtained.
 また、本例示的実施形態に係る情報処理装置1Aにおいては、モデルMAには、文書分類モデルが含まれており、変換部12による変換処理には、レコード対を文書に変換することによって変換済レコード対を生成する処理が含まれるという構成が採用されている。このため、本例示的実施形態に係る情報処理装置1Aによれば、文書分類モデルであるモデルMAを訓練することなく、モデルMAを用いて様々な属性を有するレコードの類似度を算出できるという効果が得られる。 Further, in the information processing apparatus 1A according to the present exemplary embodiment, the model MA includes a document classification model, and in the conversion processing by the conversion unit 12, a converted record pair is converted into a document. A configuration is adopted in which processing for generating a record pair is included. Therefore, according to the information processing apparatus 1A according to this exemplary embodiment, the similarity of records having various attributes can be calculated using the model MA, which is a document classification model, without training the model MA. is obtained.
 また、本例示的実施形態に係る情報処理装置1Aにおいては、モデルMAには、画像分類モデルが含まれており、変換部12による変換処理には、レコード対を画像に変換することによって変換済レコード対を生成する処理が含まれるという構成が採用されている。情報処理装置1Aがレコード対を画像に変換することにより、文字としては異なるが表記が似ているレコード同士の類似度をより好適に算出できる。例えば、「glamour」という単語を含むレコードと「glqmour」という単語を含むレコードの場合、レコードに含まれる文字列は異なっているものの、文字の形状が類似しているため、高い類似度が算出される。このように、本例示的実施形態に係る情報処理装置1Aによれば、例示的実施形態1に係る情報処理装置1の奏する効果に加えて、文字の形状の類似の度合いを反映した類似度を算出できるという効果が得られる。 Further, in the information processing apparatus 1A according to the present exemplary embodiment, the model MA includes an image classification model, and in the conversion processing by the conversion unit 12, a converted record pair is converted into an image. A configuration is adopted in which processing for generating a record pair is included. By converting the record pair into an image by the information processing device 1A, it is possible to more preferably calculate the degree of similarity between the records that are different in character but similar in notation. For example, in the case of a record containing the word "glamour" and a record containing the word "glqmour", although the character strings contained in the records are different, the character shapes are similar, so a high degree of similarity is calculated. be. As described above, according to the information processing apparatus 1A according to the present exemplary embodiment, in addition to the effects of the information processing apparatus 1 according to the first exemplary embodiment, the degree of similarity reflecting the degree of similarity of character shapes is The effect of being able to calculate is obtained.
 また、本例示的実施形態に係る情報処理装置1Aにおいては、モデルMAには、音声分類モデルが含まれており、変換部12による変換処理には、レコード対を音声に変換することによって変換済レコード対を生成する処理が含まれるという構成が採用されている。情報処理装置1Aがレコード対を音声に変換することにより、文字としては異なるが音素が似ているレコード同士の類似度をより好適に算出できる。例えば、「glamour」という単語を含むレコードと「glamar」という単語を含むレコードの場合、レコードに含まれる文字列は異なっているものの、単語の発音が類似しているため、高い類似度が算出される。このように、本例示的実施形態に係る情報処理装置1Aによれば、例示的実施形態1に係る情報処理装置1の奏する効果に加えて、音の類似度の度合いを反映した類似度を算出できるという効果が得られる。 Further, in the information processing apparatus 1A according to the present exemplary embodiment, the model MA includes a speech classification model, and the conversion processing by the conversion unit 12 includes conversion of record pairs into speech. A configuration is adopted in which processing for generating a record pair is included. The information processing apparatus 1A converts the record pair into speech, so that the similarity between the records having different characters but similar phonemes can be more preferably calculated. For example, a record containing the word "glamour" and a record containing the word "glamar" are similar in pronunciation, even though the strings contained in the records are different, so a high degree of similarity is calculated. be. As described above, according to the information processing apparatus 1A according to the present exemplary embodiment, in addition to the effects of the information processing apparatus 1 according to the first exemplary embodiment, the degree of similarity reflecting the degree of similarity between sounds is calculated. You can get the effect of being able to
 また、本例示的実施形態に係る情報処理装置1Aにおいては、モデルMAには、グラフ分類モデルが含まれており、変換部12による変換処理には、レコード対をグラフに変換することによって変換済レコード対を生成する処理が含まれるという構成が採用されている。情報処理装置1Aがレコード対をグラフに変換することにより、グラフ分類モデルであるモデルMAを訓練することなく、モデルMAを用いて様々な属性を有するレコードの類似度を算出できるという効果が得られる。 Further, in the information processing apparatus 1A according to the present exemplary embodiment, the model MA includes a graph classification model, and the conversion processing by the conversion unit 12 includes conversion of record pairs into graphs. A configuration is adopted in which processing for generating a record pair is included. By converting the record pairs into graphs by the information processing device 1A, it is possible to obtain the effect that the similarity of records having various attributes can be calculated using the model MA, which is a graph classification model, without training the model MA. .
 〔例示的実施形態3〕
 本発明の第3の例示的実施形態について、図面を参照して詳細に説明する。なお、例示的実施形態1~2にて説明した構成要素と同じ機能を有する構成要素については、同じ符号を付記し、その説明を繰り返さない。
[Exemplary embodiment 3]
A third exemplary embodiment of the invention will now be described in detail with reference to the drawings. Components having the same functions as those described in exemplary embodiments 1 and 2 are denoted by the same reference numerals, and description thereof will not be repeated.
 <情報処理装置1Bの構成>
 図10は、本例示的実施形態に係る情報処理装置1Bの構成を示すブロック図である。情報処理装置1Bの制御部10Bは、取得部11B、変換部12B、類似度算出部13B、出力部14、同一性判定部15A、及び統合部16Aを備える。また、記憶部20Bは、第1データx、第2データx´及び類似度sに加えてモデルMBを記憶する。
<Configuration of information processing device 1B>
FIG. 10 is a block diagram showing the configuration of an information processing device 1B according to this exemplary embodiment. The control unit 10B of the information processing device 1B includes an acquisition unit 11B, a conversion unit 12B, a similarity calculation unit 13B, an output unit 14, an identity determination unit 15A, and an integration unit 16A. The storage unit 20B also stores the model MB in addition to the first data x, the second data x', and the similarity s.
 取得部11Bは、レコード対(e,e´)に加えて、補助レコードを更に取得する。補助レコードは、レコード対(e,e´)の類似度を計算するために用いられる補助的なレコードである。補助レコードは、一例として、第1データxに含まれるレコードであってレコード対(e,e´)に含まれるレコードe以外のレコードである。また、補助レコードは、一例として、第2データx´に含まれるレコードであってレコード対(e,e´)に含まれるレコードe´以外のレコードである。 The acquisition unit 11B further acquires an auxiliary record in addition to the record pair (e, e'). An auxiliary record is an auxiliary record used to calculate the similarity of the record pair (e, e'). The auxiliary record is, for example, a record included in the first data x and other than the record e included in the record pair (e, e'). Also, the auxiliary record is, for example, a record other than the record e' included in the record pair (e, e') which is included in the second data x'.
 変換部12Bは、取得部11Bが取得したレコード対を変換して変換済レコード対を生成する。また、変換部12Bは、補助レコードを変換することによって変換済補助レコードを生成する。変換部12Bは、一例として、補助レコードを文書、画像、音声、又はグラフを表すデータに変換する。 The conversion unit 12B converts the record pairs acquired by the acquisition unit 11B to generate converted record pairs. Also, the conversion unit 12B generates a converted auxiliary record by converting the auxiliary record. For example, the conversion unit 12B converts the auxiliary records into data representing documents, images, sounds, or graphs.
 より具体的には、変換部12Bは、一例として、レコード対(e,e´)に含まれる一方のレコードeを質問文に変換し、レコード対(e,e´)に含まれる他方のレコードe及び補助レコードの各々を応答文に変換することによって上記変換済レコード対を生成する。 More specifically, as an example, the conversion unit 12B converts one record e included in the record pair (e, e') into a question sentence, and converts the other record included in the record pair (e, e') into a question sentence. Generate the transformed record pair by transforming each of the e and auxiliary records into a response sentence.
 類似度算出部13Bは、変換済レコード対及び変換済補助レコードをモデルMBに入力することによって、変換済レコード対に関する類似度を算出する。モデルMBは、一例として、質問文と応答文とを入力とする質問応答モデルを含む。 The similarity calculation unit 13B calculates the similarity regarding the converted record pair by inputting the converted record pair and the converted auxiliary record into the model MB. The model MB includes, as an example, a question-answering model that inputs question sentences and answer sentences.
 (質問応答モデル)
 質問応答モデルは、質問文に対し与えられた文書の中から回答文を抽出して出力するモデルである。質問応答モデルは、一例として非特許文献5に記載されたTANDAと呼ばれる技術により生成されたモデルである。質問応答モデルの入力は、一例として、質問文と、文書とを含む。質問文は、一例として「NECの本社はどこですか?」という文である。文書は、一例として「日本電気株式会社(にっぽんでんき、英: NEC Corporation)は、東京都港区芝五丁目に本社を置く住友グループの電機メーカー。日経平均株価の構成銘柄の一つ。」という文書である。
(question answering model)
The question-answering model is a model that extracts and outputs answer sentences from documents given to question sentences. The question-answer model is a model generated by a technique called TANDA described in Non-Patent Document 5 as an example. Inputs of the question-answering model include, for example, a question sentence and a document. The question sentence is, for example, "Where is NEC's headquarters?" As an example, the document states, "Nippon Denki (British: NEC Corporation) is an electronics manufacturer of the Sumitomo Group headquartered in Shiba 5-chome, Minato-ku, Tokyo. One of the constituent stocks of the Nikkei Stock Average." is a document.
 モデルの出力は、一例として、回答文及びスコアを含む。回答文は、一例として「東京都港区芝五丁目」という文である。スコアは一例として、0~1の実数である。ここで、モデルの出力を求める際に、各単語のスコアが計算されてもよい。例えば、質問応答モデルにより、日本電気株式会社」のスコア「0.1」、「住友グループ」のスコア「0.02」、及び「日経平均株価」のスコア「0.08」が計算される。  The output of the model includes, as an example, answer sentences and scores. An example of the reply sentence is "Shiba 5-chome, Minato-ku, Tokyo". The score is, for example, a real number between 0 and 1. Here, a score for each word may be calculated when determining the output of the model. For example, the question answering model calculates a score of "0.1" for "NEC CORPORATION", a score of "0.02" for "Sumitomo Group", and a score of "0.08" for "Nikkei Stock Average".
 <情報処理方法S100Bの流れ>
 図11は、情報処理装置1Bが実行する情報処理方法S100Bの流れを示すフロー図である。なお、一部のステップは並行して、又は順序を換えて実行されてもよい。また、既に説明した内容についてはその説明を繰り返さない。
<Flow of information processing method S100B>
FIG. 11 is a flowchart showing the flow of information processing method S100B executed by information processing apparatus 1B. Note that some steps may be performed in parallel or out of order. Also, the description of the already described contents will not be repeated.
 情報処理方法S100Bは、ステップS101B、ステップS102、ステップS102B、ステップS103B、ステップS104B、ステップS105、ステップS106、及びステップS107を含む。 The information processing method S100B includes steps S101B, S102, S102B, S103B, S104B, S105, S106, and S107.
 ステップS101Bにおいて、取得部11BはモデルMBを読み込む。ステップS102Bにおいて、取得部11Bは補助レコードを読み込む。 At step S101B, the acquisition unit 11B reads the model MB. In step S102B, the acquisition unit 11B reads the auxiliary record.
 ステップS103Bにおいて、変換部12Bは、レコード対を変換することによって変換済レコード対を生成するとともに、補助レコードを変換することによって変換済補助レコードを生成する。ステップS104Bにおいて、類似度算出部13Bは、変換済レコード対及び変換済補助レコードをモデルMBに入力することによって、変換済レコード対に関する類似度を算出する。 In step S103B, the conversion unit 12B converts the record pair to generate a converted record pair, and converts the auxiliary record to generate a converted auxiliary record. In step S104B, the similarity calculation unit 13B inputs the converted record pair and the converted auxiliary record to the model MB to calculate the similarity regarding the converted record pair.
 (ステップS101~S104Bの処理例6:質問応答モデル)
 ここで、ステップS101からS104Bの処理例として、質問応答モデルを用いる場合の処理例を説明する。この例では、ステップS101において、類似度算出部13Bは、質問応答モデルを読み込む。また、ステップS102及びS102Bにおいて、取得部11Bは、レコード対(e,e´)及び補助レコードR={e_1,…,e_k}(kは自然数)を読み込む。補助レコードRは、一例として、第2データx´に含まれる全てのレコードの集合である。ただし、補助レコードRは上述した例に限られず、他のレコードの集合であってもよい。例えば、補助レコードRは、第2データx´から乱択アルゴリズムにより選択されたレコードの集合であってもよい。また、補助レコードRは、第2データx´からレコードeと共通する単語を含むレコードを抽出したレコード集合等、ブロッキングされたレコード集合であってもよい。
(Processing example 6 of steps S101 to S104B: question answering model)
Here, as an example of processing from steps S101 to S104B, an example of processing in the case of using a question answering model will be described. In this example, in step S101, the similarity calculation unit 13B reads a question answer model. Also, in steps S102 and S102B, the acquisition unit 11B reads the record pair (e, e') and the auxiliary record R={e_1, . . . , e_k} (k is a natural number). The auxiliary record R is, for example, a set of all records included in the second data x'. However, the auxiliary record R is not limited to the example described above, and may be a set of other records. For example, the auxiliary records R may be a set of records selected by a randomized algorithm from the second data x'. Also, the auxiliary record R may be a blocked record set, such as a record set obtained by extracting records containing words common to the record e from the second data x'.
 この処理例において、補助レコードRは、レコード対(e,e´)に含まれるレコードe´を含む。補助レコードは、一例として、図6のテーブルTBL2のレコードr1~r3の集合、すなわち、
R={((title, adobe photoshop elements 4.0 photo-editing software for mac), (price, 85.95)), ((title, aspyr media inc sims 2 glamour life stuff pack), (price, NaN)), ((title, final-draft final draft av 2.5 screenwriting software mac/win screen writing software), (price, 199.95))}
である。
In this processing example, auxiliary record R includes record e' contained in record pair (e, e'). An auxiliary record is, for example, a set of records r1 to r3 of the table TBL2 in FIG.
R = {((title, adobe photoshop elements 4.0 photo-editing software for mac), (price, 85.95)), ((title, aspyr media inc sims 2 glamor life stuff pack), (price, NaN)), (( title, final-draft final draft av 2.5 screenwriting software mac/win screen writing software), (price, 199.95))}
is.
 この処理例では、ステップS103Bにおいて、変換部12Bは、レコードeと補助レコードR(e´∈R)とを変換する。より具体的には、変換部12Bは、レコードeを質問文q=T1(e)に変換する。ここで、質問文qは、いわゆる5W1Hのオープンクエスチョン型が好ましい。また、変換部12Bは、補助レコードRを複数の回答文を含む文書に変換する。 In this processing example, in step S103B, the conversion unit 12B converts record e and auxiliary record R (e'εR). More specifically, the conversion unit 12B converts the record e into a question sentence q=T1(e). Here, the question sentence q is preferably of the so-called 5W1H open question type. Also, the conversion unit 12B converts the auxiliary record R into a document containing a plurality of reply sentences.
 変換部12Bは、一例として、レコードe=((a_1, v_1),…,(a_d, v_d))を、
T1(e)=“What is characterized as v_1 of a_1, … and v_d of a_d?’’
に変換する。また、変換部12Bは、補助レコードR={e_1,…,e_k}を、文書T2(R)=“T3(e_1). T3(e_2). … T3(e_k).’’
に変換する。ここで、文書T2(R)に含まれる回答文T3(e_j)(1≦j≦k)は、
T3(e_j)=``{ID of e_j} is characterized as v_1 of a_1, …, and v_d of a_d’’
である。ここで、{ID of e_j}は、レコードe_j∈Rに割り当てられる固有のIDである。ただし、変換部12Bは、変換の際に欠損値を文書に含めない。
As an example, the conversion unit 12B converts the record e = ((a_1, v_1), ..., (a_d, v_d)) into
T1(e)=“What is characterized as v_1 of a_1, … and v_d of a_d?”
Convert to Also, the conversion unit 12B transforms the auxiliary record R={e_1, . . . , e_k} into the document T2(R)=“T3(e_1).
Convert to Here, the answer sentence T3 (e_j) (1≤j≤k) included in the document T2(R) is
T3(e_j) = ``{ID of e_j} is characterized as v_1 of a_1, ..., and v_d of a_d''
is. where {ID of e_j} is the unique ID assigned to record e_jεR. However, the conversion unit 12B does not include missing values in the document during conversion.
 変換部12Bは、一例として、
e=((title, sims 2 glamour life stuff pack),(manufacturer, aspyr media),(price, 24.99))
を、
q=“What is characterized as title of sims 2 glamour life stuff pack, manufacturer of aspyr media, and price of 24.99?”
に変換する。また、補助レコードR={((title, adobe photoshop elements 4.0 photo-editing software for mac), (price, 85.95)), ((title, aspyr media inc sims 2 glamour life stuff pack), (price, NaN)), ((title, final-draft final draft av 2.5 screenwriting software mac/win screen writing software), (price, 199.95))}
を、文書c=“r1 is characterized as title of adobe photoshop elements 4.0 photo-editing software for mac and price of 85.95. r2 is characterized as title of aspyr media inc sims 2 glamour life stuff pack. r3 is characterized as title of final-draft final draft av 2.5 screenwriting software mac/win screen writing software and price of 199.95.”
に変換する。
As an example, the conversion unit 12B
e = ((title, sims 2 glamor life stuff pack), (manufacturer, aspyr media), (price, 24.99))
of,
q=“What is characterized as title of sims 2 glamor life stuff pack, manufacturer of aspyr media, and price of 24.99?”
Convert to Also, auxiliary record R = {((title, adobe photoshop elements 4.0 photo-editing software for mac), (price, 85.95)), ((title, aspyr media inc sims 2 glamor life stuff pack), (price, NaN) ), ((title, final-draft final draft av 2.5 screenwriting software mac/win screen writing software), (price, 199.95))}
and document c = "r1 is characterized as title of adobe photoshop elements 4.0 photo-editing software for mac and price of 85.95. r2 is characterized as title of aspyr media inc sims 2 glamor life stuff pack. r3 is characterized as title of final -draft final draft av 2.5 screenwriting software mac/win screen writing software and price of 199.95.”
Convert to
 また、ステップS104Bにおいて、類似度算出部13Bは、質問応答モデルに質問文qと文書cとを入力する。質問応答モデルは、入力された質問文qの回答が文書cから抽出した回答文T3(e_j)(1≦j≦k)である確信度を示すスコアを出力する。類似度算出部13Bは、質問応答モデルが出力したスコアに基づき類似度sを算出する。類似度sは、一例として、MB(q,c,{ID of e´})、すなわちレコード対(e,e´)に含まれるレコードe´が回答文である確信度である。ただし、類似度sはこの例に限られず、類似度算出部13Bは他の手法により類似度sを算出してもよい。類似度算出部13Bは、一例として、レコードeを質問文にした際のスコアと、レコードe´を質問文にした際のスコアとの和を類似度としてもよい。 Also, in step S104B, the similarity calculation unit 13B inputs the question sentence q and the document c to the question answering model. The question answering model outputs a score indicating the degree of certainty that the answer to the input question sentence q is the answer sentence T3(e_j) (1≦j≦k) extracted from the document c. The similarity calculator 13B calculates the similarity s based on the score output by the question answering model. The similarity s is, for example, MB(q, c, {ID of e'}), that is, the confidence that the record e' included in the record pair (e, e') is an answer sentence. However, the similarity s is not limited to this example, and the similarity calculation unit 13B may calculate the similarity s by another method. As an example, the similarity calculation unit 13B may take the sum of the score when the record e is used as the question and the score when the record e' is used as the question as the degree of similarity.
 図12は、質問応答モデルを用いた類似度の算出処理の概念図である。図12の例では、変換部12Bがレコードeと補助レコードRとを質問文と文書に変換し、類似度算出部13Bが質問文と文書とを質問応答モデルであるモデルMBに入力することにより、類似度sを算出する。このように、本処理例では、情報処理装置1Bは、レコードを質問応答の形式に落とし込むことによって、レコードの類似度を算出する。 FIG. 12 is a conceptual diagram of similarity calculation processing using the question answering model. In the example of FIG. 12, the conversion unit 12B converts the record e and the auxiliary record R into a question sentence and a document, and the similarity calculation unit 13B inputs the question sentence and the document into the model MB, which is a question-answer model. , the similarity s is calculated. As described above, in this processing example, the information processing apparatus 1B calculates the similarity of the records by converting the records into the question-and-answer format.
 <情報処理装置1Bの効果>
 以上のように、本例示的実施形態に係る情報処理装置1Bにおいては、取得部11Bは補助レコードを更に取得し、変換部12Bは補助レコードを変換することによって変換済補助レコードを生成し、類似度算出部13Bは上記変換済レコード対及び上記変換済補助レコードをモデルMBに入力することによって、上記変換済レコード対に関する類似度を算出する構成が採用されている。このため、本例示的実施形態に係る情報処理装置1Bによれば、モデルMBを訓練することなく、モデルMBを用いて様々な属性を有するレコードについて類似度を算出できるという効果が得られる。
<Effects of Information Processing Device 1B>
As described above, in the information processing apparatus 1B according to the present exemplary embodiment, the acquisition unit 11B further acquires auxiliary records, the conversion unit 12B converts the auxiliary records to generate converted auxiliary records, and similar The degree calculation unit 13B is configured to calculate the degree of similarity regarding the converted record pair by inputting the converted record pair and the converted auxiliary record to the model MB. Therefore, according to the information processing apparatus 1B according to the present exemplary embodiment, it is possible to obtain the effect of being able to calculate similarities for records having various attributes using the model MB without training the model MB.
 また、本例示的実施形態に係る情報処理装置1Bにおいては、モデルMBには、質問文と応答文とを入力とする質問応答モデルが含まれており、変換部12Bは、上記レコード対に含まれる一方のレコードを質問文に変換し、上記レコード対に含まれる他方のレコード及び上記補助レコードの各々を応答文に変換することによって上記変換済レコード対を生成するという構成が採用されている。このため、本例示的実施形態に係る情報処理装置1Bによれば、質問応答モデルを訓練することなく、様々な属性を有するレコードについて質疑応答モデルを用いて類似度を算出できるという効果が得られる。 Further, in the information processing device 1B according to the present exemplary embodiment, the model MB includes a question-answer model in which a question sentence and a response sentence are input, and the conversion unit 12B is included in the record pair. One of the records included in the record pair is converted into a question sentence, and each of the other record and the auxiliary record included in the record pair is converted into a response sentence to generate the converted record pair. Therefore, according to the information processing apparatus 1B according to this exemplary embodiment, it is possible to obtain the effect of being able to calculate the similarity of records having various attributes using the question-and-answer model without training the question-and-answer model. .
 〔例示的実施形態4〕
 本発明の第4の例示的実施形態について、図面を参照して詳細に説明する。なお、例示的実施形態1~3にて説明した構成要素と同じ機能を有する構成要素については、同じ符号を付記し、その説明を繰り返さない。
[Exemplary embodiment 4]
A fourth exemplary embodiment of the invention will now be described in detail with reference to the drawings. Components having the same functions as those described in exemplary embodiments 1 to 3 are denoted by the same reference numerals, and description thereof will not be repeated.
 <情報処理装置1Cの構成>
 図13は、本例示的実施形態に係る情報処理装置1Cの構成を示すブロック図である。情報処理装置1Cの制御部10Cは、取得部11、変換部12、類似度算出部13C、類似度統合部17C、出力部14C、同一性判定部15A及び統合部16Aを備える。また、記憶部20Cは、第1データx、第2データx´及び類似度sに加えてモデルMCを記憶する。
<Configuration of information processing device 1C>
FIG. 13 is a block diagram showing the configuration of an information processing device 1C according to this exemplary embodiment. The control unit 10C of the information processing device 1C includes an acquisition unit 11, a conversion unit 12, a similarity calculation unit 13C, a similarity integration unit 17C, an output unit 14C, an identity determination unit 15A, and an integration unit 16A. The storage unit 20C also stores the model MC in addition to the first data x, the second data x', and the similarity s.
 (類似度算出部13C)
 類似度算出部13Cは、ひとつのレコード対(e,e´)に関して複数の類似度siを算出する。類似度算出部13Cは、一例として、モデルMCに対して、レコード対(e,e´)に含まれる2つのレコードを互いに入れ替えずに入力することによって第1の類似度s1を算出する。また、類似度算出部13Cは、モデルMCに対して、レコード対(e,e´)に含まれる2つのレコードを互いに入れ替えてから入力することによって第2の類似度s2を算出する。
(Similarity calculator 13C)
The similarity calculator 13C calculates a plurality of similarities si for one record pair (e, e'). As an example, the similarity calculator 13C calculates the first similarity s1 by inputting two records included in the record pair (e, e') to the model MC without interchanging them. Further, the similarity calculation unit 13C calculates the second similarity s2 by replacing the two records included in the record pair (e, e') with each other and inputting them to the model MC.
 上述の例示的実施形態2で説明した処理例4~5、及び例示的実施形態3で説明した処理例6では、レコード対(e,e´)の類似度は、レコード対(e´,e)の類似度と異なる。そこで、本例示的実施形態では、類似度算出部13Cは、レコード対(e,e´)の類似度とレコード対(e´,e)の類似度とをそれぞれ算出し、それらの類似度を参照して同一性を判定する。 In the processing examples 4 to 5 described in the second exemplary embodiment and the processing example 6 described in the third exemplary embodiment, the similarity of the record pair (e, e') is the record pair (e', e ) similarity. Therefore, in this exemplary embodiment, the similarity calculation unit 13C calculates the similarity of the record pair (e, e') and the similarity of the record pair (e', e), and calculates the similarity as Identity is determined by reference.
 ただし、類似度算出部13Cが複数の類似度siを算出する手法は上述した例に限られず、類似度算出部13Cは他の手法により複数の類似度siを算出してもよい。例えば、類似度算出部13Cは複数のモデルを用いて複数の類似度siを算出してもよい。この場合、例えば、変換部12が1つのレコード対に対して複数の変換を施し、類似度算出部13Cが変換後のレコード対をそれぞれのモデル(文書分類モデル、画像分類モデル、・・・)に入力することにより、複数の類似度siを算出してもよい。 However, the method by which the similarity calculation unit 13C calculates a plurality of similarities si is not limited to the example described above, and the similarity calculation unit 13C may calculate a plurality of similarities si by other methods. For example, the similarity calculator 13C may calculate a plurality of similarities si using a plurality of models. In this case, for example, the conversion unit 12 performs a plurality of conversions on one record pair, and the similarity calculation unit 13C converts the converted record pair into respective models (document classification model, image classification model, . . . ). , a plurality of degrees of similarity si may be calculated.
 また、類似度算出部13Cは、ひとつのレコード対を複数の変換方法で変換することにより複数の変換済レコード対を生成し、複数の変換済レコード対をひとつのモデルに入力することにより、複数の類似度siを算出してもよい。 Further, the similarity calculation unit 13C converts one record pair by a plurality of conversion methods to generate a plurality of converted record pairs, and inputs the plurality of converted record pairs to one model to generate a plurality of , the similarity si may be calculated.
 (類似度統合部17C)
 類似度統合部17Cは、複数の類似度siを統合して統合後の類似度sとする。類似度統合部17Cは、一例として、複数の類似度siの平均又は加重平均をとることによって統合後の類似度sを算出する。ただし、類似度統合部17Cが複数の類似度siを統合する手法は上述した例に限られず、類似度統合部17Cは他の手法により統合後の類似度sを算出してもよい。例えば、類似度統合部17Cは、複数の類似度siの総和又は積算値を統合後の類似度sとしてもよい。
(Similarity integration unit 17C)
The similarity integration unit 17C integrates a plurality of similarities si into an integrated similarity s. For example, the similarity integration unit 17C calculates the post-integration similarity s by averaging or weighting a plurality of similarities si. However, the method by which the similarity integration unit 17C integrates a plurality of similarities si is not limited to the example described above, and the similarity integration unit 17C may calculate the post-integration similarity s by another method. For example, the similarity integration unit 17C may set the sum or integrated value of a plurality of similarities si as the integrated similarity s.
 本明細書において、類似度統合部17Cは、対象レコード対に関する複数の類似度siに基づき、当該対象レコード対に関する同一性を判定する構成であるともいえる。 In this specification, it can be said that the similarity integration unit 17C is configured to determine the identity of the target record pair based on a plurality of similarities si regarding the target record pair.
 (出力部14C)
 出力部14Cは、上記複数の類似度siを統合して得られる統合後の類似度sを出力する。出力部14Cは、一例として、記憶部20Cに類似度sを書き込むことにより出力する。
(Output section 14C)
The output unit 14C outputs an integrated similarity s obtained by integrating the plurality of similarities si. As an example, the output unit 14C outputs the similarity s by writing it into the storage unit 20C.
 (モデルMC)
 モデルMCは、類似度を算出するためのモデルである。モデルMCは、一例として、当該モデルに入力される2つの要素の互いの入れ替えに対して非対称性を有するモデルである。モデルMCは、一例として、含意認識モデル、言い換え予測モデル、及び質問応答モデルの少なくともいずれかひとつを含む。
(Model MC)
The model MC is a model for calculating the degree of similarity. The model MC is, for example, a model that is asymmetric with respect to the mutual replacement of two elements that are input to the model. The model MC includes, as an example, at least one of an entailment recognition model, a paraphrase prediction model, and a question answer model.
 図14は、類似度算出部13Cが算出する類似度siの具体例を示す図である。図14において、レコード対(L1,R1)について類似度算出部13Cが算出した第1の類似度s1は「9」であり、2つのレコードを互いに入れ替えたレコード対(R1,L1)について類似度算出部13Cが算出した第2の類似度s2は「10」である。このように、類似度算出部13Cはひとつのレコード対について第1の類似度s1と第2の類似度s2とを算出し、同一性判定部15Aは第1の類似度s1と第2の類似度s2とが共に他のレコード対と比較して最も高ければレコードが同一であると判定する。図14の例では、同一性判定部15Aは、レコードL1とレコードR1が同一であると判定するとともに、レコードL2とレコードR3が同一であると判定する。 FIG. 14 is a diagram showing a specific example of the similarity si calculated by the similarity calculation unit 13C. In FIG. 14, the first similarity s1 calculated by the similarity calculation unit 13C for the record pair (L1, R1) is "9", and the similarity for the record pair (R1, L1) obtained by exchanging two records is The second similarity s2 calculated by the calculator 13C is "10". Thus, the similarity calculation unit 13C calculates the first similarity s1 and the second similarity s2 for one record pair, and the identity determination unit 15A calculates the first similarity s1 and the second similarity s1. The records are determined to be the same if both the degrees s2 are the highest compared to other record pairs. In the example of FIG. 14, the identity determination unit 15A determines that record L1 and record R1 are the same, and that record L2 and record R3 are the same.
 図15は、類似度算出部13Cが算出する類似度siの他の例を示す図である。図15において、類似度統合部17Cは、双方向の類似度を集約する。類似度統合部17Cは、一例として、レコード対(L1,R1)の類似度s1と、レコード対(R1,L1)の類似度s2との和を類似度sとする。図15の例では、レコード対(L1,R1)の類似度sは「10」と「9」の和、すなわち「19」であり、レコード対(L1,R2)の類似度sは「9」と「7」の和、すなわち「16」である。また、レコード対(L2,R2)の類似度sは「9」と「4」との和、すなわち「13」であり、レコード対(L2,R3)の類似度sは「8」と「8」との和、すなわち「16」である。 FIG. 15 is a diagram showing another example of the similarity si calculated by the similarity calculation unit 13C. In FIG. 15, the similarity integration unit 17C aggregates bidirectional similarities. For example, the similarity integration unit 17C sets the sum of the similarity s1 of the record pair (L1, R1) and the similarity s2 of the record pair (R1, L1) as the similarity s. In the example of FIG. 15, the similarity s of the record pair (L1, R1) is the sum of "10" and "9", that is, "19", and the similarity s of the record pair (L1, R2) is "9". and "7", that is, "16". The similarity s of the record pair (L2, R2) is the sum of "9" and "4", that is, "13", and the similarity s of the record pair (L2, R3) is "8" and "8 , that is, "16".
 図15の例では、同一性判定部15Aは、図14の例と同様に、レコードL1とレコードR1が同一であると判定するとともに、レコードL2とレコードR3が同一であると判定する。この例では更に、同一性判定部15Aは、同一であると判定したレコード対の中で類似度sが所定の閾値以上であるレコード対についても、同一であると判定する。ここで、閾値は例えば、同一であると判定されたレコード対の類似度sのうちの最小値(図15の例では、「13」)である。閾値は、同一と非同一の割合が既知であるなら、その割合に基づいて決められてもよい。図15の例において閾値が「13」である場合、同一性判定部15Aは、類似度sが「13」以上であるレコード対、すなわち類似度sが「16」であるレコード対(L1,R2)についてもレコード同士が同一であると判定する。 In the example of FIG. 15, the identity determination unit 15A determines that record L1 and record R1 are the same, and that record L2 and record R3 are the same, as in the example of FIG. In this example, the identity determination unit 15A further determines that record pairs having a similarity s equal to or higher than a predetermined threshold among the record pairs determined to be identical are also identical. Here, the threshold is, for example, the minimum value (“13” in the example of FIG. 15) of similarities s of record pairs determined to be identical. The threshold may be determined based on the percentage of identical and non-identical, if known. When the threshold value is "13" in the example of FIG. 15, the identity determination unit 15A determines the record pair (L1, R2 ) are also determined to be the same.
 <情報処理装置1Cの効果>
 以上のように、本例示的実施形態に係る情報処理装置1Cにおいては、類似度算出部13Cは、上記レコード対に関して複数の類似度siを算出し、出力部14Cは、複数の類似度siを統合して得られる統合後の類似度sを出力する構成が採用されている。このため、本例示的実施形態に係る情報処理装置1Cによれば、レコード対の類似度sをより精度よく算出できるという効果が得られる。
<Effect of information processing device 1C>
As described above, in the information processing apparatus 1C according to this exemplary embodiment, the similarity calculation unit 13C calculates a plurality of similarities si with respect to the record pair, and the output unit 14C calculates the plurality of similarities si. A configuration for outputting the post-integration similarity s obtained by integration is adopted. Therefore, according to the information processing apparatus 1C according to the present exemplary embodiment, it is possible to obtain the effect that the similarity s of the record pair can be calculated more accurately.
 また、本例示的実施形態に係る情報処理装置1Cにおいては、モデルMCは、当該モデルに入力される2つの要素の互いの入れ替えに対して非対称性を有するモデルであり、類似度算出部13Cは、モデルMCに対して、レコード対(e,e´)に含まれる2つのレコードを互いに入れ替えずに入力することによって第1の類似度s1を算出し、モデルMCに対して、レコード対(e,e´)に含まれる2つのレコードを互いに入れ替えてから入力することによって第2の類似度s2を算出する構成が採用されている。このため、本例示的実施形態に係る情報処理装置1Cによれば、第1の類似度s1と第2の類似度s2とを統合することにより、レコードの類似度sをより精度よく算出できるという効果が得られる。 Further, in the information processing apparatus 1C according to this exemplary embodiment, the model MC is a model having asymmetry with respect to the mutual replacement of two elements input to the model, and the similarity calculation unit 13C , to the model MC, the first similarity s1 is calculated by inputting two records included in the record pair (e, e′) without replacing each other, and the record pair (e , e′) are replaced with each other and then input to calculate the second similarity s2. Therefore, according to the information processing apparatus 1C according to the present exemplary embodiment, by integrating the first similarity s1 and the second similarity s2, the similarity s of the records can be calculated more accurately. effect is obtained.
 〔例示的実施形態5〕
 本発明の第5の例示的実施形態について、図面を参照して詳細に説明する。なお、例示的実施形態1~3にて説明した構成要素と同じ機能を有する構成要素については、同じ符号を付記し、その説明を繰り返さない。
[Exemplary embodiment 5]
A fifth exemplary embodiment of the present invention will now be described in detail with reference to the drawings. Components having the same functions as those described in exemplary embodiments 1 to 3 are denoted by the same reference numerals, and description thereof will not be repeated.
 <情報処理装置1Dの構成>
 図16は、本例示的実施形態に係る情報処理装置1Dの構成を示すブロック図である。情報処理装置1Dの制御部10Dは、取得部11、変換部12、類似度算出部13、出力部14、同一性判定部15A及び検索結果出力部18Dを備える。
<Configuration of information processing device 1D>
FIG. 16 is a block diagram showing the configuration of an information processing device 1D according to this exemplary embodiment. A control unit 10D of the information processing device 1D includes an acquisition unit 11, a conversion unit 12, a similarity calculation unit 13, an output unit 14, an identity determination unit 15A, and a search result output unit 18D.
 本例示的実施形態に係る取得部11は、レコード対(e,e´)に含まれる第1のレコードeとして、ユーザからの入力データを取得する。ユーザからの入力データは、一例として、入出力部40Aに接続された入力装置(例えば、キーボード、マウス、等)により入力される。 The acquisition unit 11 according to this exemplary embodiment acquires input data from the user as the first record e included in the record pair (e, e'). Input data from the user is, for example, input by an input device (for example, a keyboard, a mouse, etc.) connected to the input/output unit 40A.
 また、取得部11は、レコード対(e,e´)に含まれる第2のレコードe´として、対象データに含まれる複数のレコードの1つを取得する。対象データは、検索対象のデータであり、一例として、1又は複数のテーブルを含む。 Also, the acquiring unit 11 acquires one of the plurality of records included in the target data as the second record e' included in the record pair (e, e'). The target data is data to be searched, and includes, for example, one or more tables.
 同一性判定部15Aは、第1のレコードeと、対象データに含まれる複数のレコードの各々とのレコード対に対して同一性予測を行う。 The identity determination unit 15A performs identity prediction for record pairs of the first record e and each of the plurality of records included in the target data.
 検索結果出力部18Dは、類似度算出部13が算出した類似度sに基づき、入力データに基づく検索結果であって、対象データを検索対象とする検索結果を出力する。検索結果出力部18Dは、一例として、同一性判定部15Aの判定結果を参照して、入力データに基づく検索結果であって、対象データを検索対象とする検索結果を出力する。検索結果出力部18Dは、一例として、入出力部40Aに接続された出力装置(ディスプレイ、プリンタ、等)に検索結果を出力する。また、検索結果出力部18Dは、通信部30Aを介して接続された他の装置に検索結果を送信することにより、検索結果を出力してもよい。また、検索結果出力部18Dは、検索結果を記憶部20A又は外部記憶装置に記憶することにより検索結果を出力してもよい。 Based on the degree of similarity s calculated by the degree of similarity calculation unit 13, the search result output unit 18D outputs the search results based on the input data and with the target data as the search target. As an example, the search result output unit 18D refers to the determination result of the identity determination unit 15A and outputs the search result based on the input data and the target data as the search target. For example, the search result output unit 18D outputs search results to an output device (display, printer, etc.) connected to the input/output unit 40A. Further, the search result output unit 18D may output the search result by transmitting the search result to another device connected via the communication unit 30A. Further, the search result output unit 18D may output search results by storing the search results in the storage unit 20A or an external storage device.
 図17は、検索結果出力部18Dが出力する画面表示の具体例を示す図である。図17の例で、入力データは、ユーザがテキストボックス51に入力する文字列であり、対象データは複数のレコードを有するテーブルT1及びテーブルT2である。同一性判定部15Aは、ユーザの入力データである第1のレコードeと、テーブルT1に含まれるレコード及びテーブルT2に含まれるレコードe´の各々とのレコード対に対して同一性を判定する。 FIG. 17 is a diagram showing a specific example of screen display output by the search result output unit 18D. In the example of FIG. 17, the input data is a character string that the user inputs into the text box 51, and the target data are tables T1 and T2 having a plurality of records. The identity determination unit 15A determines the identity of record pairs between the first record e, which is the user's input data, and each of the records included in the table T1 and the record e' included in the table T2.
 図17の例において、検索結果出力部18Dは、同一性判定部15Aの判定結果を参照して、入力データに基づく検索結果53、及び検索結果54を出力する。検索結果53は、「ポテチ」の文字列を入力データとして、テーブルT1から検索された検索結果である。検索結果54は、「ポテチ」の文字列を入力データとして、テーブルT2から検索された検索結果である。 In the example of FIG. 17, the search result output unit 18D refers to the determination result of the identity determination unit 15A and outputs search results 53 and 54 based on the input data. A search result 53 is a search result obtained by searching the table T1 using the character string "potato chips" as input data. A search result 54 is a search result obtained by searching the table T2 using the character string "potato chips" as input data.
 <情報処理装置1Dの効果>
 以上のように、本例示的実施形態に係る情報処理装置1Dにおいては、同一性判定部15Aの判定結果を参照して、入力データに基づく検索結果であって、対象データを検索対象とする検索結果を出力する構成が採用されている。このため、本例示的実施形態に係る情報処理装置1Dによれば、例示的実施形態1に係る情報処理装置1の奏する効果に加えて、入力データに基づく対象データからの検索をより好適に行うことができるという効果が得られる。
<Effects of information processing device 1D>
As described above, in the information processing apparatus 1D according to the present exemplary embodiment, the determination result of the identity determination unit 15A is referred to, and the search result based on the input data, which is the target data, is searched. A configuration for outputting the results is adopted. Therefore, according to the information processing apparatus 1D according to the present exemplary embodiment, in addition to the effects of the information processing apparatus 1 according to the first exemplary embodiment, the search from the target data based on the input data is more preferably performed. You can get the effect of being able to
 情報処理装置1Dは、以下のようにも記載され得る。
 ユーザからの入力データと、対象データに含まれる複数のレコードの1つとをレコード対として取得する取得手段と、
 前記レコード対を変換することによって変換済レコード対を生成する変換手段と、
 前記変換済レコード対をモデルに入力することによって、変換済レコード対に関する類似度を算出する類似度算出手段と、
 前記類似度算出手段が算出した類似度を参照して、前記入力データに基づく検索結果であって、前記対象データを検索対象とする検索結果を出力する出力手段と、
を備えている情報処理装置。
The information processing device 1D can also be described as follows.
Acquisition means for acquiring input data from a user and one of a plurality of records included in target data as a record pair;
transforming means for transforming the record pairs to generate transformed record pairs;
a similarity calculating means for calculating a similarity with respect to the converted record pair by inputting the converted record pair into a model;
output means for referring to the degree of similarity calculated by the degree of similarity calculation means and outputting search results based on the input data, in which the target data is a search target;
Information processing device equipped with.
 〔変形例〕
 <変形例1>
 上述の各例示的実施形態では、情報処理装置1、1A、1B、1C、1D(以下「情報処理装置1等」という)は、第1データxに含まれるレコードeと第2データx´に含まれるレコードe´との同一性を判定した。情報処理装置1等が判定の対象とする複数のレコードはそれぞれ異なるデータに含まれるレコードであってもよく、また、共通のデータに含まれるレコードであってもよい。換言すると、情報処理装置1等は、1つのデータベースから同一のレコードを探し出す処理を実行してもよい。また、上述の例示的実施形態では、第1データxと第2データx´とを統合する場合について説明したが、情報処理装置1等は3つ以上のデータを統合してもよい。
[Modification]
<Modification 1>
In each of the exemplary embodiments described above, the information processing apparatuses 1, 1A, 1B, 1C, and 1D (hereinafter referred to as "information processing apparatuses 1, etc.") The identity with the contained record e' was determined. A plurality of records to be determined by the information processing apparatus 1 or the like may be records included in different data, or may be records included in common data. In other words, the information processing device 1 and the like may execute processing for searching for the same record from one database. Also, in the exemplary embodiment described above, the case where the first data x and the second data x' are integrated has been described, but the information processing apparatus 1 and the like may integrate three or more data.
 <変形例2>
 上述の各例示的実施形態において、情報処理装置1等が複数のモデル候補からモデルMA、MB、MC(以下「モデルM」という)を選択してもよく、また、ユーザが複数のモデル候補からモデルMを選択してもよい。情報処理装置1等がモデルMを選択するアルゴリズムは限定されないが、一例として、情報処理装置1等はルールベースでモデルMを選択してもよい。例えば、情報処理装置1等はレコード対の特徴に応じてモデルMを選択してもよい。ここで、レコード対の特徴は、一例として、レコード対に含まれるレコードの属性、レコードのデータサイズ、レコードが属するデータベースの種別、データベースの属性を含む。
<Modification 2>
In each of the exemplary embodiments described above, the information processing device 1 or the like may select models MA, MB, MC (hereinafter referred to as "model M") from a plurality of model candidates, and the user may Model M may be selected. The algorithm by which the information processing device 1 or the like selects the model M is not limited, but as an example, the information processing device 1 or the like may select the model M on a rule basis. For example, the information processing device 1 or the like may select the model M according to the characteristics of the record pair. Here, the characteristics of a record pair include, for example, the attribute of the record included in the record pair, the data size of the record, the type of database to which the record belongs, and the attribute of the database.
 <変形例3>
 上述の各例示的実施形態において、レコードe、e´が含まれるデータは、JSON又はXML等の半構造データであってもよい。半構造データに上記例示的実施形態に係る情報処理装置を適用することにより、書類データ又はウェブページの同一性判定を行うことができる。例えば、住宅情報を提供する住宅情報サイトにおいては、同一の物件についてのウェブページが複数作成されている場合がある。この場合、ウェブページについて同一性判定を行うことにより、ウェブページを物件ごとにまとめることができる。
<Modification 3>
In each of the exemplary embodiments described above, the data containing records e, e' may be semi-structured data such as JSON or XML. By applying the information processing apparatus according to the exemplary embodiment to semi-structured data, it is possible to determine the identity of document data or web pages. For example, on a housing information site that provides housing information, there are cases where multiple web pages are created for the same property. In this case, the web pages can be grouped for each property by performing identity determination on the web pages.
 この例において、レコードは一例として、対象であるサイトに含まれるウェブページである。例えば、レコードe={id1: value1, id2: {id2-1: value2-1, id2-2: value2-1}, id3: value3}
である場合、変換後の文書は、一例として、
“id1 is value1. id2-1 of id2 is value2-1. id2-2 of id2 is value2-1. id3 is value3.”
である。
In this example, the records are, by way of example, web pages contained in the target site. For example, record e = {id1: value1, id2: {id2-1: value2-1, id2-2: value2-1}, id3: value3}
, the converted document is, for example,
"id1 is value1. id2-1 of id2 is value2-1. id2-2 of id2 is value2-1. id3 is value3."
is.
 <変形例4>
 また、本明細書に係るレコードは、例えば図18に示すようなグラフデータであってもよい。図18は、グラフデータの一例を示す図である。グラフデータに本明細書に係る情報処理装置1等を適用することにより、例えば顔照合を行うことができる。例えば、レコードが図18に示すグラフである場合、変換後の文書は、一例として、
“1 and 2 are linked. 1 and 4 are linked. 2 and 3 are linked. 2 and 4 are linked.”
である。
<Modification 4>
Also, the record according to the present specification may be graph data as shown in FIG. 18, for example. FIG. 18 is a diagram showing an example of graph data. For example, face matching can be performed by applying the information processing apparatus 1 or the like according to the present specification to graph data. For example, if the record is the graph shown in FIG. 18, the document after conversion is, as an example,
“1 and 2 are linked. 1 and 4 are linked. 2 and 3 are linked. 2 and 4 are linked.”
is.
 <変形例5>
 また、レコードが含まれるデータは、例えば図19に示すようなグラフデータベースであってもよい。グラフデータベースに本明細書に係る情報処理装置1等を適用することにより、例えば異なるSNS(Social Networking Service)のコミュニティの同一性を判定することができ、例えば犯罪組織の調査に応用できる。この例で、グラフデータベースが図19に示すものである場合、変換後の文書は一例として、
“Taro of age 23 follows Sakura of age 26. Taro of age 23 follows Emi of age 25. Sakura of age 26 follows Emi of age 25. Sakura of age 26 wrote via smartphone tweet of text “I’m sleepy.” date 20XX/YY/ZZ. Emi of age 25 follows Sakura of age 26. Emi of age 25 follows Taro of age 23.”
である。
<Modification 5>
Data containing records may be a graph database as shown in FIG. 19, for example. By applying the information processing apparatus 1 or the like according to the present specification to a graph database, it is possible to determine the identity of different SNS (Social Networking Service) communities, for example, and to investigate criminal organizations. In this example, if the graph database is as shown in FIG. 19, the document after conversion is as follows:
“Taro of age 23 follows Sakura of age 26. Taro of age 23 follows Emi of age 25. Sakura of age 26 follows Emi of age 25. Sakura of age 26 wrote via smartphone tweet of text “I'm sleepy.” date 20XX /YY/ZZ. Emi of age 25 follows Sakura of age 26. Emi of age 25 follows Taro of age 23.”
is.
 <変形例6>
 上述の各例示的実施形態において、情報処理装置1等がモデルMを学習する学習フェーズを実行する構成であってもよい。モデルMの機械学習の手法は限定されないが、一例として、決定木ベース、線形回帰、又はニューラルネットワークの手法が用いられてもよく、また、これらのうちの2以上の手法が用いられてもよい。
<Modification 6>
In each of the exemplary embodiments described above, the information processing device 1 and the like may be configured to execute the learning phase for learning the model M. FIG. The method of machine learning for model M is not limited, but as an example, a decision tree-based, linear regression, or neural network method may be used, or two or more of these methods may be used. .
 <変形例7>
 上述の各例示的実施形態において、モデルMの出力の前後に、学習可能なパラメータを備えた変換器を加える構成としてもよい。図20は、学習可能なパラメータを備えた学習済み変換器121、122をモデルMの出力の前後に設けた構成を概略的に示す図である。学習済み変換器121、122は学習可能なパラメータを備え、学習部(図示略)が訓練データを用いてレコード変換の仕方(文章の作り方又は補助レコードの数、等)、及び/又は、変換のパラメータを最適化するモデルである。学習済み変換器121、122を設けることにより、レコードの類似度をより精度よく算出することができる。
<Modification 7>
In each of the exemplary embodiments described above, a transformer with learnable parameters may be added before or after the model M output. FIG. 20 schematically shows a configuration in which trained transducers 121, 122 with learnable parameters are provided before and after the output of model M. FIG. The learned converters 121 and 122 have learnable parameters, and a learning unit (not shown) uses training data to determine how to convert records (how to make sentences or the number of auxiliary records, etc.) and / or how to convert. It is a model that optimizes parameters. By providing the learned converters 121 and 122, it is possible to calculate the similarity of records with higher accuracy.
 学習済み変換器121、122の機械学習の手法は限定されないが、一例として、決定木ベース、線形回帰、又はニューラルネットワークの手法が用いられてもよく、また、これらのうちの2以上の手法が用いられてもよい。また、学習済み変換器121、122は能動学習により生成されたモデルであってもよい。 The machine learning method of the trained converters 121, 122 is not limited, but as an example, a decision tree-based, linear regression, or neural network method may be used, and two or more of these methods may be used. may be used. Also, the learned converters 121 and 122 may be models generated by active learning.
 〔ソフトウェアによる実現例〕
 情報処理装置1、1A、1B、1C、1Dの一部又は全部の機能は、集積回路(ICチップ)等のハードウェアによって実現してもよいし、ソフトウェアによって実現してもよい。
[Example of realization by software]
Some or all of the functions of the information processing apparatuses 1, 1A, 1B, 1C, and 1D may be implemented by hardware such as integrated circuits (IC chips), or may be implemented by software.
 後者の場合、情報処理装置1、1A、1B、1C、1Dは、例えば、各機能を実現するソフトウェアであるプログラムの命令を実行するコンピュータによって実現される。このようなコンピュータの一例(以下、コンピュータCと記載する)を図18に示す。コンピュータCは、少なくとも1つのプロセッサC1と、少なくとも1つのメモリC2と、を備えている。メモリC2には、コンピュータCを情報処理装置1、1A、1B、1C、1Dとして動作させるためのプログラムPが記録されている。コンピュータCにおいて、プロセッサC1は、プログラムPをメモリC2から読み取って実行することにより、情報処理装置1、1A、1B、1C、1Dの各機能が実現される。 In the latter case, the information processing apparatuses 1, 1A, 1B, 1C, and 1D are implemented by computers that execute program instructions, which are software that implements each function, for example. An example of such a computer (hereinafter referred to as computer C) is shown in FIG. Computer C comprises at least one processor C1 and at least one memory C2. A program P for operating the computer C as the information processing apparatuses 1, 1A, 1B, 1C, and 1D is recorded in the memory C2. In the computer C, the processor C1 reads the program P from the memory C2 and executes it, thereby implementing the functions of the information processing apparatuses 1, 1A, 1B, 1C, and 1D.
 プロセッサC1としては、例えば、CPU(Central Processing Unit)、GPU(Graphic Processing Unit)、DSP(Digital Signal Processor)、MPU(Micro Processing Unit)、FPU(Floating point number Processing Unit)、PPU(Physics Processing Unit)、マイクロコントローラ、又は、これらの組み合わせなどを用いることができる。メモリC2としては、例えば、フラッシュメモリ、HDD(Hard Disk Drive)、SSD(Solid State Drive)、又は、これらの組み合わせなどを用いることができる。 As the processor C1, for example, CPU (Central Processing Unit), GPU (Graphic Processing Unit), DSP (Digital Signal Processor), MPU (Micro Processing Unit), FPU (Floating point number Processing Unit), PPU (Physics Processing Unit) , a microcontroller, or a combination thereof. As the memory C2, for example, a flash memory, HDD (Hard Disk Drive), SSD (Solid State Drive), or a combination thereof can be used.
 なお、コンピュータCは、プログラムPを実行時に展開したり、各種データを一時的に記憶したりするためのRAM(Random Access Memory)を更に備えていてもよい。また、コンピュータCは、他の装置との間でデータを送受信するための通信インタフェースを更に備えていてもよい。また、コンピュータCは、キーボードやマウス、ディスプレイやプリンタなどの入出力機器を接続するための入出力インタフェースを更に備えていてもよい。 Note that the computer C may further include a RAM (Random Access Memory) for expanding the program P during execution and temporarily storing various data. Computer C may further include a communication interface for sending and receiving data to and from other devices. Computer C may further include an input/output interface for connecting input/output devices such as a keyboard, mouse, display, and printer.
 また、プログラムPは、コンピュータCが読み取り可能な、一時的でない有形の記録媒体Mに記録することができる。このような記録媒体Mとしては、例えば、テープ、ディスク、カード、半導体メモリ、又はプログラマブルな論理回路などを用いることができる。コンピュータCは、このような記録媒体Mを介してプログラムPを取得することができる。また、プログラムPは、伝送媒体を介して伝送することができる。このような伝送媒体としては、例えば、通信ネットワーク、又は放送波などを用いることができる。コンピュータCは、このような伝送媒体を介してプログラムPを取得することもできる。 In addition, the program P can be recorded on a non-temporary tangible recording medium M that is readable by the computer C. As such a recording medium M, for example, a tape, disk, card, semiconductor memory, programmable logic circuit, or the like can be used. The computer C can acquire the program P via such a recording medium M. Also, the program P can be transmitted via a transmission medium. As such a transmission medium, for example, a communication network or broadcast waves can be used. Computer C can also obtain program P via such a transmission medium.
 〔付記事項1〕
 本発明は、上述した実施形態に限定されるものでなく、請求項に示した範囲で種々の変更が可能である。例えば、上述した実施形態に開示された技術的手段を適宜組み合わせて得られる実施形態についても、本発明の技術的範囲に含まれる。
[Appendix 1]
The present invention is not limited to the above-described embodiments, and various modifications are possible within the scope of the claims. For example, embodiments obtained by appropriately combining the technical means disclosed in the embodiments described above are also included in the technical scope of the present invention.
 〔付記事項2〕
 上述した実施形態の一部又は全部は、以下のようにも記載され得る。ただし、本発明は、以下の記載する態様に限定されるものではない。
 (付記1)
 レコード対を取得する取得手段と、
 前記レコード対を変換することによって変換済レコード対を生成する変換手段と、
 前記変換済レコード対をモデルに入力することによって、前記変換済レコード対に関する類似度を算出する類似度算出手段と、
 前記類似度算出手段が算出した類似度を出力する出力手段と、
を備えている情報処理装置。
[Appendix 2]
Some or all of the above-described embodiments may also be described as follows. However, the present invention is not limited to the embodiments described below.
(Appendix 1)
an acquisition means for acquiring a record pair;
transforming means for transforming the record pairs to generate transformed record pairs;
a similarity calculation means for calculating a similarity regarding the converted record pair by inputting the converted record pair into a model;
an output means for outputting the similarity calculated by the similarity calculation means;
Information processing device equipped with.
 (付記2)
 前記モデルは、複数のモデル候補から選択されたものであり、
 前記変換手段は、前記レコード対を前記モデルの入力に対応する形式に変換することによって前記変換済レコード対を生成する、
付記1に記載の情報処理装置。
(Appendix 2)
The model is selected from a plurality of model candidates,
the transforming means generates the transformed record pair by transforming the record pair into a format corresponding to the input of the model;
The information processing device according to appendix 1.
 (付記3)
 前記モデルには、文書分類モデルが含まれており、
 前記変換手段による変換処理には、前記レコード対を文書に変換することによって前記変換済レコード対を生成する処理が含まれる、
付記1又は2に記載の情報処理装置。
(Appendix 3)
the model includes a document classification model;
The conversion processing by the conversion means includes processing for generating the converted record pair by converting the record pair into a document.
The information processing device according to appendix 1 or 2.
 (付記4)
 前記モデルには、画像分類モデルが含まれており、
 前記変換手段による変換処理には、前記レコード対を画像に変換することによって前記変換済レコード対を生成する処理が含まれる、
付記1から3の何れか1つに記載の情報処理装置。
(Appendix 4)
The model includes an image classification model,
The conversion processing by the conversion means includes processing for generating the converted record pair by converting the record pair into an image.
3. The information processing apparatus according to any one of Appendices 1 to 3.
 (付記5)
 前記モデルには、音声分類モデルが含まれており、
 前記変換手段による変換処理には、前記レコード対を音声に変換することによって前記変換済レコード対を生成する処理が含まれる、
付記1から4の何れか1つに記載の情報処理装置。
(Appendix 5)
the model includes an audio classification model,
The conversion processing by the conversion means includes processing for generating the converted record pair by converting the record pair into speech.
5. The information processing apparatus according to any one of Appendices 1 to 4.
 (付記6)
 前記モデルには、グラフ分類モデルが含まれており、
 前記変換手段による変換処理には、前記レコード対をグラフに変換することによって前記変換済レコード対を生成する処理が含まれる、
付記1から5の何れか1つに記載の情報処理装置。
(Appendix 6)
The model includes a graph classification model,
The conversion processing by the conversion means includes processing for generating the converted record pair by converting the record pair into a graph.
6. The information processing apparatus according to any one of Appendices 1 to 5.
 (付記7)
 前記取得手段は、補助レコードを更に取得し、
 前記変換手段は、前記補助レコードを変換することによって変換済補助レコードを生成し、
 前記類似度算出手段は、前記変換済レコード対及び前記変換済補助レコードを前記モデルに入力することによって、前記変換済レコード対に関する類似度を算出する、
付記1から6の何れか1つに記載の情報処理装置。
(Appendix 7)
The obtaining means further obtains an auxiliary record,
The conversion means generates a converted auxiliary record by converting the auxiliary record;
The similarity calculation means calculates the similarity regarding the converted record pair by inputting the converted record pair and the converted auxiliary record into the model.
7. The information processing apparatus according to any one of Appendices 1 to 6.
 (付記8)
 前記モデルには、質問文と応答文とを入力とする質問応答モデルが含まれており、
 前記変換手段は、前記レコード対に含まれる一方のレコードを質問文に変換し、前記レコード対に含まれる他方のレコード及び前記補助レコードの各々を応答文に変換することによって前記変換済レコード対を生成する、
付記7に記載の情報処理装置。
(Appendix 8)
The model includes a question-answer model in which a question sentence and an answer sentence are input,
The conversion means converts one record included in the record pair into a question sentence, and converts the other record included in the record pair and each of the auxiliary records into a response sentence, thereby converting the converted record pair into a question sentence. generate,
The information processing device according to appendix 7.
 (付記9)
 前記類似度算出手段は、前記レコード対に関して複数の類似度を算出し、
 前記出力手段は、前記複数の類似度を統合して得られる統合後の類似度を出力する、
付記1から8の何れか1つに記載の情報処理装置。
(Appendix 9)
The similarity calculating means calculates a plurality of similarities with respect to the record pair,
The output means outputs an integrated similarity obtained by integrating the plurality of similarities.
9. The information processing apparatus according to any one of Appendices 1 to 8.
 (付記10)
 前記モデルは、当該モデルに入力される2つの要素の互いの入れ替えに対して非対称性を有するモデルであり、
 前記類似度算出手段は、
  前記モデルに対して、前記レコード対に含まれる2つのレコードを互いに入れ替えずに入力することによって第1の類似度を算出し、
  前記モデルに対して、前記レコード対に含まれる2つのレコードを互いに入れ替えてから入力することによって第2の類似度を算出する、
付記9に記載の情報処理装置。
(Appendix 10)
The model is a model that has asymmetry with respect to the replacement of two elements input to the model,
The similarity calculation means is
calculating a first degree of similarity by inputting two records included in the record pair into the model without replacing each other;
Calculating a second degree of similarity by replacing two records included in the record pair with the model and then inputting the model;
The information processing device according to appendix 9.
 (付記11)
 少なくとも1つのプロセッサが、
 レコード対を取得することと、
 前記レコード対を変換することによって変換済レコード対を生成することと、
 前記変換済レコード対をモデルに入力することによって、前記変換済レコード対に関する類似度を算出することと、
 前記算出した類似度を出力することと、
を含む情報処理方法。
(Appendix 11)
at least one processor
obtaining a record pair;
generating a transformed record pair by transforming the record pair;
calculating a similarity for the transformed record pair by inputting the transformed record pair into a model;
outputting the calculated similarity;
Information processing method including.
 (付記12)
 コンピュータに、
 レコード対を取得する取得処理と、
 前記レコード対を変換することによって変換済レコード対を生成する変換処理と、
 前記変換済レコード対をモデルに入力することによって、前記変換済レコード対に関する類似度を算出する類似度算出処理と、
 前記類似度算出処理において算出した類似度を出力する出力処理と、
を実行させる情報処理プログラム。
(Appendix 12)
to the computer,
an acquisition process for acquiring a record pair;
a transformation process for generating a transformed record pair by transforming the record pair;
A similarity calculation process for calculating a similarity regarding the converted record pair by inputting the converted record pair into a model;
An output process for outputting the similarity calculated in the similarity calculation process;
Information processing program that runs
 (付記13)
 ユーザからの入力データと、対象データに含まれる複数のレコードの1つとをレコード対として取得する取得手段と、
 前記レコード対を変換することによって変換済レコード対を生成する変換手段と、
 前記変換済レコード対をモデルに入力することによって、変換済レコード対に関する類似度を算出する類似度算出手段と、
 前記類似度算出手段が算出した類似度を参照して、前記入力データに基づく検索結果であって、前記対象データを検索対象とする検索結果を出力する出力手段と、
を備えている情報処理装置。
(Appendix 13)
Acquisition means for acquiring input data from a user and one of a plurality of records included in target data as a record pair;
transforming means for transforming the record pairs to generate transformed record pairs;
a similarity calculating means for calculating a similarity with respect to the converted record pair by inputting the converted record pair into a model;
output means for referring to the degree of similarity calculated by the degree of similarity calculation means and outputting search results based on the input data, in which the target data is a search target;
Information processing device equipped with.
 〔付記事項3〕
 上述した実施形態の一部又は全部は、更に、以下のように表現することもできる。
[Appendix 3]
Some or all of the embodiments described above can also be expressed as follows.
 少なくとも1つのプロセッサを備え、前記プロセッサは、レコード対を取得する取得処理と、前記レコード対を変換することによって変換済レコード対を生成する変換処理と、前記変換済レコード対をモデルに入力することによって、前記変換済レコード対に関する類似度を算出する類似度算出処理と、前記類似度算出処理において算出した類似度を出力する出力処理とを実行する情報処理装置。 at least one processor for obtaining a record pair; transforming the record pair to generate a transformed record pair; and inputting the transformed record pair into a model. an information processing apparatus for executing a similarity calculation process for calculating a similarity regarding the converted record pair and an output process for outputting the similarity calculated in the similarity calculation process.
 なお、この情報処理装置は、更にメモリを備えていてもよく、このメモリには、前記取得処理と、前記変換処理と、前記類似度算出処理と、前記出力処理とを前記プロセッサに実行させるためのプログラムが記憶されていてもよい。また、このプログラムは、コンピュータ読み取り可能な一時的でない有形の記録媒体に記録されていてもよい。 Note that this information processing apparatus may further include a memory, and this memory stores information for causing the processor to execute the acquisition process, the conversion process, the similarity calculation process, and the output process. program may be stored. Also, this program may be recorded in a computer-readable non-temporary tangible recording medium.
1、1A、1B、1C、1D 情報処理装置
11、11B 取得部
12、12B 変換部
13、13B、13C 類似度算出部
14、14C 出力部
16A 統合部

 
1, 1A, 1B, 1C, 1D information processing apparatuses 11, 11B acquisition units 12, 12B conversion units 13, 13B, 13C similarity calculation units 14, 14C output unit 16A integration unit

Claims (12)

  1.  レコード対を取得する取得手段と、
     前記レコード対を変換することによって変換済レコード対を生成する変換手段と、
     前記変換済レコード対をモデルに入力することによって、前記変換済レコード対に関する類似度を算出する類似度算出手段と、
     前記類似度算出手段が算出した類似度を出力する出力手段と
    を備えている情報処理装置。
    an acquisition means for acquiring a record pair;
    transforming means for transforming the record pairs to generate transformed record pairs;
    a similarity calculation means for calculating a similarity regarding the converted record pair by inputting the converted record pair into a model;
    and output means for outputting the degree of similarity calculated by the degree of similarity calculation means.
  2.  前記モデルは、複数のモデル候補から選択されたものであり、
     前記変換手段は、前記レコード対を前記モデルの入力に対応する形式に変換することによって前記変換済レコード対を生成する
    請求項1に記載の情報処理装置。
    The model is selected from a plurality of model candidates,
    2. The information processing apparatus according to claim 1, wherein said conversion means generates said converted record pair by converting said record pair into a format corresponding to the input of said model.
  3.  前記モデルには、文書分類モデルが含まれており、
     前記変換手段による変換処理には、前記レコード対を文書に変換することによって前記変換済レコード対を生成する処理が含まれる
    請求項1又は2に記載の情報処理装置。
    the model includes a document classification model;
    3. The information processing apparatus according to claim 1, wherein the conversion processing by said conversion means includes processing for generating said converted record pair by converting said record pair into a document.
  4.  前記モデルには、画像分類モデルが含まれており、
     前記変換手段による変換処理には、前記レコード対を画像に変換することによって前記変換済レコード対を生成する処理が含まれる
    請求項1から3の何れか1項に記載の情報処理装置。
    The model includes an image classification model,
    4. The information processing apparatus according to any one of claims 1 to 3, wherein the conversion processing by said conversion means includes processing for generating said converted record pair by converting said record pair into an image.
  5.  前記モデルには、音声分類モデルが含まれており、
     前記変換手段による変換処理には、前記レコード対を音声に変換することによって前記変換済レコード対を生成する処理が含まれる
    請求項1から4の何れか1項に記載の情報処理装置。
    the model includes an audio classification model,
    5. The information processing apparatus according to any one of claims 1 to 4, wherein the conversion processing by said conversion means includes processing for generating said converted record pair by converting said record pair into speech.
  6.  前記モデルには、グラフ分類モデルが含まれており、
     前記変換手段による変換処理には、前記レコード対をグラフに変換することによって前記変換済レコード対を生成する処理が含まれる、
    請求項1から5の何れか1項に記載の情報処理装置。
    The model includes a graph classification model,
    The conversion processing by the conversion means includes processing for generating the converted record pair by converting the record pair into a graph.
    The information processing apparatus according to any one of claims 1 to 5.
  7.  前記取得手段は、補助レコードを更に取得し、
     前記変換手段は、前記補助レコードを変換することによって変換済補助レコードを生成し、
     前記類似度算出手段は、前記変換済レコード対及び前記変換済補助レコードを前記モデルに入力することによって、前記変換済レコード対に関する類似度を算出する
    請求項1から6の何れか1項に記載の情報処理装置。
    The obtaining means further obtains an auxiliary record,
    The conversion means generates a converted auxiliary record by converting the auxiliary record;
    7. The similarity calculation means according to any one of claims 1 to 6, wherein the similarity calculation means calculates the similarity regarding the converted record pair by inputting the converted record pair and the converted auxiliary record into the model. information processing equipment.
  8.  前記モデルには、質問文と応答文とを入力とする質問応答モデルが含まれており、
     前記変換手段は、前記レコード対に含まれる一方のレコードを質問文に変換し、前記レコード対に含まれる他方のレコード及び前記補助レコードの各々を応答文に変換することによって前記変換済レコード対を生成する
    請求項7に記載の情報処理装置。
    The model includes a question-answer model in which a question sentence and an answer sentence are input,
    The conversion means converts one record included in the record pair into a question sentence, and converts the other record included in the record pair and each of the auxiliary records into a response sentence, thereby converting the converted record pair into a question sentence. 8. The information processing apparatus according to claim 7, which generates the information.
  9.  前記類似度算出手段は、前記レコード対に関して複数の類似度を算出し、
     前記出力手段は、前記複数の類似度を統合して得られる統合後の類似度を出力する
    請求項1から8の何れか1項に記載の情報処理装置。
    The similarity calculating means calculates a plurality of similarities with respect to the record pair,
    9. The information processing apparatus according to any one of claims 1 to 8, wherein the output means outputs an integrated similarity obtained by integrating the plurality of similarities.
  10.  前記モデルは、当該モデルに入力される2つの要素の互いの入れ替えに対して非対称性を有するモデルであり、
     前記類似度算出手段は、
      前記モデルに対して、前記レコード対に含まれる2つのレコードを互いに入れ替えずに入力することによって第1の類似度を算出し、
      前記モデルに対して、前記レコード対に含まれる2つのレコードを互いに入れ替えてから入力することによって第2の類似度を算出する
    請求項9に記載の情報処理装置。
    The model is a model that has asymmetry with respect to the replacement of two elements input to the model,
    The similarity calculation means is
    calculating a first degree of similarity by inputting two records included in the record pair into the model without replacing each other;
    10. The information processing apparatus according to claim 9, wherein two records included in said record pair are replaced with each other and input to said model to calculate a second degree of similarity.
  11.  少なくとも1つのプロセッサが、
     レコード対を取得することと、
     前記レコード対を変換することによって変換済レコード対を生成することと、
     前記変換済レコード対をモデルに入力することによって、前記変換済レコード対に関する類似度を算出することと、
     前記算出した類似度を出力することと
    を含む情報処理方法。
    at least one processor
    obtaining a record pair;
    generating a transformed record pair by transforming the record pair;
    calculating a similarity for the transformed record pair by inputting the transformed record pair into a model;
    and outputting the calculated similarity.
  12.  コンピュータに、
     レコード対を取得する取得処理と、
     前記レコード対を変換することによって変換済レコード対を生成する変換処理と、
     前記変換済レコード対をモデルに入力することによって、前記変換済レコード対に関する類似度を算出する類似度算出処理と、
     前記類似度算出処理において算出した類似度を出力する出力処理と
    を実行させる情報処理プログラム。

     
    to the computer,
    an acquisition process for acquiring a record pair;
    a transformation process for generating a transformed record pair by transforming the record pair;
    A similarity calculation process for calculating a similarity regarding the converted record pair by inputting the converted record pair into a model;
    An information processing program for executing output processing for outputting the degree of similarity calculated in the degree of similarity calculation processing.

PCT/JP2022/008227 2022-02-28 2022-02-28 Information processing device, information processing method, and information processing program WO2023162206A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/JP2022/008227 WO2023162206A1 (en) 2022-02-28 2022-02-28 Information processing device, information processing method, and information processing program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2022/008227 WO2023162206A1 (en) 2022-02-28 2022-02-28 Information processing device, information processing method, and information processing program

Publications (1)

Publication Number Publication Date
WO2023162206A1 true WO2023162206A1 (en) 2023-08-31

Family

ID=87765225

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2022/008227 WO2023162206A1 (en) 2022-02-28 2022-02-28 Information processing device, information processing method, and information processing program

Country Status (1)

Country Link
WO (1) WO2023162206A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP7454156B1 (en) 2023-12-26 2024-03-22 ファーストアカウンティング株式会社 Information processing device, information processing method and program

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170091274A1 (en) * 2015-09-30 2017-03-30 Linkedin Corporation Organizational data enrichment
JP2019185244A (en) * 2018-04-05 2019-10-24 富士通株式会社 Learning program and learning method
JP2021174300A (en) * 2020-04-27 2021-11-01 アットホームラボ株式会社 Information processing device, information processing method and information processing program
US20210374164A1 (en) * 2020-06-02 2021-12-02 Banque Nationale Du Canada Automated and dynamic method and system for clustering data records
US20210374186A1 (en) * 2020-05-26 2021-12-02 Rovi Guides, Inc. Automated metadata asset creation using machine learning models
JP2022510818A (en) * 2018-11-20 2022-01-28 アマゾン テクノロジーズ インコーポレイテッド Transliteration of data records for improved data matching

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170091274A1 (en) * 2015-09-30 2017-03-30 Linkedin Corporation Organizational data enrichment
JP2019185244A (en) * 2018-04-05 2019-10-24 富士通株式会社 Learning program and learning method
JP2022510818A (en) * 2018-11-20 2022-01-28 アマゾン テクノロジーズ インコーポレイテッド Transliteration of data records for improved data matching
JP2021174300A (en) * 2020-04-27 2021-11-01 アットホームラボ株式会社 Information processing device, information processing method and information processing program
US20210374186A1 (en) * 2020-05-26 2021-12-02 Rovi Guides, Inc. Automated metadata asset creation using machine learning models
US20210374164A1 (en) * 2020-06-02 2021-12-02 Banque Nationale Du Canada Automated and dynamic method and system for clustering data records

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
YULIANG LI; JINFENG LI; YOSHIHIKO SUHARA; ANHAI DOAN; WANG-CHIEW TAN: "Deep Entity Matching with Pre-Trained Language Models", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 2 September 2020 (2020-09-02), 201 Olin Library Cornell University Ithaca, NY 14853 , XP081753835, DOI: 10.14778/3421424.3421431 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP7454156B1 (en) 2023-12-26 2024-03-22 ファーストアカウンティング株式会社 Information processing device, information processing method and program

Similar Documents

Publication Publication Date Title
Lin et al. Traceability transformed: Generating more accurate links with pre-trained bert models
CN108959246B (en) Answer selection method and device based on improved attention mechanism and electronic equipment
Lauren et al. Generating word embeddings from an extreme learning machine for sentiment analysis and sequence labeling tasks
CN112711953B (en) Text multi-label classification method and system based on attention mechanism and GCN
Das et al. Sentence embedding models for similarity detection of software requirements
CN112036189A (en) Method and system for recognizing gold semantic
WO2023162206A1 (en) Information processing device, information processing method, and information processing program
Bondielli et al. On the use of summarization and transformer architectures for profiling résumés
Diao et al. Emotion cause detection with enhanced-representation attention convolutional-context network
Tüselmann et al. Recognition-free question answering on handwritten document collections
Kondurkar et al. Modern Applications With a Focus on Training ChatGPT and GPT Models: Exploring Generative AI and NLP
JP2023071785A (en) Acoustic signal search device, acoustic signal search method, data search device, data search method and program
WO2023132029A1 (en) Information processing device, information processing method, and program
Skondras et al. Efficient Resume Classification through Rapid Dataset Creation Using ChatGPT
Touati et al. Deep reinforcement learning approach for ontology matching problem
Lo et al. From ELIZA to ChatGPT: The Evolution of NLP and Financial Applications
Laskari et al. a Systematic Study on Suggestion Mining From Opinion Reviews
Aksoy et al. A comparative analysis of text representation, classification and clustering methods over real project proposals
Syaputra et al. Improving mental health surveillance over Twitter text classification using word embedding techniques
Francis et al. SmarTxT: A Natural Language Processing Approach for Efficient Vehicle Defect Investigation
Kumar et al. Emotion detection and sentiment analysis of text
Agarwal et al. Next Word Prediction Using Hindi Language
CN113361261B (en) Method and device for selecting legal case candidate paragraphs based on enhance matrix
Sopuru et al. Comparative Analysis of Word2Vec and GloVe with LSTM for Sentiment Analysis: Accuracy and Loss Evaluation on Twitter Data
Shahade et al. Deep learning approach-based hybrid fine-tuned Smith algorithm with Adam optimiser for multilingual opinion mining

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22928732

Country of ref document: EP

Kind code of ref document: A1