WO2023162206A1 - Dispositif de traitement d'informations, procédé de traitement d'informations, et programme de traitement d'informations - Google Patents
Dispositif de traitement d'informations, procédé de traitement d'informations, et programme de traitement d'informations Download PDFInfo
- Publication number
- WO2023162206A1 WO2023162206A1 PCT/JP2022/008227 JP2022008227W WO2023162206A1 WO 2023162206 A1 WO2023162206 A1 WO 2023162206A1 JP 2022008227 W JP2022008227 W JP 2022008227W WO 2023162206 A1 WO2023162206 A1 WO 2023162206A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- record
- model
- similarity
- record pair
- information processing
- Prior art date
Links
- 230000010365 information processing Effects 0.000 title claims abstract description 154
- 238000003672 processing method Methods 0.000 title description 20
- 238000006243 chemical reaction Methods 0.000 claims abstract description 106
- 238000004364 calculation method Methods 0.000 claims abstract description 80
- 238000012545 processing Methods 0.000 claims description 87
- 238000000034 method Methods 0.000 claims description 70
- 238000013145 classification model Methods 0.000 claims description 61
- 230000008569 process Effects 0.000 claims description 33
- 230000001131 transforming effect Effects 0.000 claims description 20
- 230000004044 response Effects 0.000 claims description 5
- 230000009466 transformation Effects 0.000 claims description 2
- 238000012549 training Methods 0.000 abstract description 16
- 238000005516 engineering process Methods 0.000 abstract description 3
- 238000010586 diagram Methods 0.000 description 36
- 230000010354 integration Effects 0.000 description 25
- 239000013598 vector Substances 0.000 description 25
- 230000000694 effects Effects 0.000 description 20
- 238000004891 communication Methods 0.000 description 15
- 238000010801 machine learning Methods 0.000 description 15
- 230000006870 function Effects 0.000 description 10
- 238000012986 modification Methods 0.000 description 9
- 230000004048 modification Effects 0.000 description 9
- 230000014509 gene expression Effects 0.000 description 7
- 101000800590 Homo sapiens Transducin beta-like protein 2 Proteins 0.000 description 4
- 241000009328 Perro Species 0.000 description 4
- 102100033248 Transducin beta-like protein 2 Human genes 0.000 description 4
- 244000205754 Colocasia esculenta Species 0.000 description 3
- 235000006481 Colocasia esculenta Nutrition 0.000 description 3
- 102100026338 F-box-like/WD repeat-containing protein TBL1Y Human genes 0.000 description 3
- 101000835691 Homo sapiens F-box-like/WD repeat-containing protein TBL1X Proteins 0.000 description 3
- 101000835690 Homo sapiens F-box-like/WD repeat-containing protein TBL1Y Proteins 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 3
- 230000005540 biological transmission Effects 0.000 description 3
- 238000003066 decision tree Methods 0.000 description 3
- 238000012417 linear regression Methods 0.000 description 3
- 241000981595 Zoysia japonica Species 0.000 description 2
- 230000002457 bidirectional effect Effects 0.000 description 2
- 230000000903 blocking effect Effects 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 238000003058 natural language processing Methods 0.000 description 2
- 235000013550 pizza Nutrition 0.000 description 2
- 235000013606 potato chips Nutrition 0.000 description 2
- 238000011524 similarity measure Methods 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 206010041349 Somnolence Diseases 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 239000000470 constituent Substances 0.000 description 1
- 238000013523 data management Methods 0.000 description 1
- 230000001815 facial effect Effects 0.000 description 1
- 238000007667 floating Methods 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 230000006855 networking Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 230000017105 transposition Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2455—Query execution
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
Definitions
- the present invention relates to an information processing device, an information processing method, and an information processing program.
- Patent Document 1 discloses a device that calculates the similarity of record pairs using a plurality of similarity functions that calculate the similarity of record pairs, and learns the weight of the similarity by supervised machine learning using training data. is described.
- the training data is a data set with labels indicating combinations of records and whether they are identical.
- Non-Patent Document 1 describes a technique called DITTO that performs name identification by supervised machine learning.
- Non-Patent Document 2 describes a technique called ZeroER that matches records by unsupervised machine learning that does not use training data.
- language models eg, non-patent documents 3 to 5
- image classification models eg, non-patent document 6
- Yuliang Li et. al., Deep Entity Matching with Pre-Trained Language Models, PVLDB 2021 Renzhi Wu, et. al., ZeroER: Entity Resolution using Zero Labeled Examples
- SIGMOD 2020 Yinhan Liu, et. al.
- RoBERTa A Robustly Optimized BERT Pretraining Approach, arXiv 2019 Sinong Wang, et. al., Entailment as Few-Shot Learner, arXiv 2021 Siddhant Garg, et. al., TANDA: Transfer and Adapt Pre-Trained Transformer Models for Answer Sentence Selection, AAAI 2020 Kaiming He, et. al., Deep Residual Learning for Image Recognition, CVPR 2016
- supervised machine learning requires a large amount of training data. There was a problem that it could not correspond to the data.
- heterogeneous data is a combination of records, and refers to data whose format is not the same.
- ZeroER which is unsupervised machine learning described in Non-Patent Document 2
- One aspect of the present invention has been made in view of the above problem. It is to provide a technology that can also deal with
- An information processing apparatus includes acquisition means for acquiring a record pair, conversion means for generating a converted record pair by converting the record pair, and inputting the converted record pair to a model.
- acquisition means for acquiring a record pair
- conversion means for generating a converted record pair by converting the record pair
- conversion means for generating a converted record pair by converting the record pair
- output means for outputting the similarity calculated by the similarity calculation means.
- An information processing method is characterized in that at least one processor obtains a record pair, generates a transformed record pair by transforming the record pair, and transforms the transformed record pair into calculating a similarity measure for the transformed record pair by inputting to a model; and outputting the calculated similarity measure.
- An information processing program provides a computer with an acquisition process for acquiring a record pair, a conversion process for generating a converted record pair by converting the record pair, and a model for the converted record pair. , a similarity calculation process for calculating a similarity with respect to the converted record pair and an output process for outputting the similarity calculated in the similarity calculation process are executed.
- FIG. 1 is a block diagram showing the configuration of an information processing apparatus according to Exemplary Embodiment 1;
- FIG. FIG. 3 is a flow diagram showing the flow of an information processing method according to exemplary embodiment 1;
- FIG. 9 is a block diagram showing the configuration of an information processing apparatus according to Exemplary Embodiment 2;
- FIG. 10 is a diagram showing a specific example of data including records according to exemplary embodiment 2;
- FIG. 10 is a diagram showing an overview of the flow of processing performed by an information processing apparatus according to exemplary embodiment 2;
- FIG. 10 is a diagram showing a specific example of identity determination results according to exemplary embodiment 2;
- FIG. 10 is a flow diagram showing the flow of an information processing method according to exemplary embodiment 2;
- FIG. 11 is a diagram schematically illustrating entailment relationships of documents according to exemplary embodiment 2;
- FIG. 10 is a diagram showing an example of an image converted by a conversion unit according to exemplary embodiment 2;
- FIG. 12 is a block diagram showing the configuration of an information processing apparatus according to exemplary embodiment 3;
- FIG. 11 is a flow diagram showing the flow of an information processing method according to exemplary embodiment 3;
- FIG. 11 is a conceptual diagram of similarity calculation processing using a question-answering model according to exemplary embodiment 3;
- FIG. 12 is a block diagram showing the configuration of an information processing apparatus according to Exemplary Embodiment 4;
- FIG. 14 is a diagram showing a specific example of similarity calculated by a similarity calculation unit according to exemplary embodiment 4;
- FIG. 14 is a diagram showing a specific example of similarity calculated by a similarity calculation unit according to exemplary embodiment 4;
- FIG. 12 is a block diagram showing the configuration of an information processing apparatus according to Exemplary Embodiment 5;
- FIG. 12 is a diagram showing a specific example of screen display according to exemplary embodiment 5; It is a figure which shows an example of graph data.
- 1 is a diagram showing an example of a graph database;
- FIG. FIG. 4 is a diagram schematically showing a configuration in which learned converters are provided before and after the output of a model;
- 1 is a block diagram showing the configuration of a computer functioning as an information processing device according to each exemplary embodiment;
- FIG. 1 is a block diagram showing the configuration of an information processing device 1.
- the information processing device 1 is a device that calculates the degree of similarity between records.
- a record is a unit of data for which similarity is calculated.
- Examples of data containing records include structured data such as table data, semi-structured data described in a data description language such as JSON (JavaScript Object Notation: registered trademark) or XML (Extensible Markup Language), and natural language It includes unstructured data representing written documents.
- a record is, for example, a row of a table and contains a set of one or more attribute names and attribute values corresponding to the columns of the table. Also, the record may be graph data.
- the information processing device 1 includes an acquisition unit 11 , a conversion unit 12 , a similarity calculation unit 13 and an output unit 14 .
- a record pair is a set of records, such as a set of records included in a first table and records included in a second table.
- the first table and the second table are, for example, tables that store customer information of businesses or tables that store product information.
- the first table and the second table are not limited to the examples described above, and may be other tables. Also, the first table and the second table may be the same or different.
- Multiple records included in a record pair may have different data formats. More specifically, for example, when a record is a row of a table, some attribute names included in the record may be different, and all attribute names included in the record may be different. .
- the acquiring unit 11 may acquire the record pair by reading the record pair from the storage device, or acquire the record pair by receiving the record pair from another device connected via the communication interface. good too. Also, the acquisition unit 11 may acquire a record pair input from an input device via an input/output interface.
- the conversion unit 12 converts the record pair to generate a converted record pair.
- the conversion unit 12 converts records included in a record pair into data representing documents, images, sounds, or graphs. More specifically, the conversion unit 12 converts the record into an affirmative sentence or a question sentence, for example.
- the method by which the conversion unit 12 converts the record pair is not limited to the example described above, and the conversion unit 12 may convert the record pair by another method.
- the similarity calculation unit 13 calculates the similarity regarding the converted record pair by inputting the converted record pair into the model.
- the model is a model for calculating the degree of similarity, and as an example, is a model that is open to the public and available to any user.
- the model may be a model generated by machine learning or a rule-based model created by humans.
- the model is, by way of example, a document classification model, an image classification model, an audio classification model, or a graph classification model.
- a document classification model is a model for classifying document data.
- An image classification model is a model for classifying image data.
- a speech classification model is a model that classifies speech data.
- a graph classification model is a model for classifying graph data.
- Document classification models include, for example, a document embedding model, an entailment recognition model, a paraphrase prediction model, a question answering model, and a mask language model.
- a document embedding model is a model that embeds documents or words in a vector space.
- the entailment recognition model is a model that predicts entailment relationships of multiple documents.
- a paraphrasing prediction model is a model that predicts whether two documents are paraphrasing expressions.
- a question answering model is a model that extracts and outputs answers from documents given to questions.
- a mask language model is a model for predicting words that fit a mask in a document.
- An example of an image classification model is an image embedding model.
- An image embedding model is a model that embeds image data in a vector space.
- a speech classification model includes, for example, a speech embedding model.
- a speech embedding model is a model that embeds speech data in a vector space.
- Inputs for the above model include at least one of text data, image data, audio data, graphs, and vectors, for example.
- the output of the model includes, by way of example, a vector or score indicating confidence.
- the score is, for example, a score indicating the degree of certainty regarding the inclusion relationship of the document or a score indicating the degree of certainty as to whether it is a paraphrasing expression.
- the inputs and outputs of the model are not limited to the examples described above, and may include other information.
- the degree of similarity is information relating to the degree of similarity between records included in a record pair, and an example is the cosine similarity of vector pairs. Also, the similarity may be a value calculated from the score output by the model.
- the output unit 14 outputs the similarity calculated by the similarity calculation unit 13 .
- the output unit 14 may output the degree of similarity by writing it in a storage device, or may output the degree of similarity by transmitting the degree of similarity to another device via a communication interface.
- the output unit 14 may output the degree of similarity to an output device (not shown) connected via an input/output interface.
- the output device is, for example, a display, printer, projector, or speaker.
- the degree of similarity output by the output unit 14 is used, for example, for table integration processing or information search processing.
- table integration processing by linking records predicted to be identical based on the similarity calculated by the similarity calculation unit 13, a plurality of tables can be integrated and unified data management can be performed.
- the similarity calculation unit 13 calculates the similarity for a record pair of a record as a search key (for example, a record specified by a user) and any other record registered in a predetermined table. may be performed.
- the information processing apparatus 1 may output records included in a record pair predicted to be identical based on the similarity calculated by the similarity calculation unit 13 as a search result. As a result, even in a table that is not associated with a record that is a search key, search processing using the search key is possible.
- the acquisition unit 11 that acquires a record pair, the conversion unit 12 that converts the record pair to generate a converted record pair, and the A configuration comprising a similarity calculation unit 13 for calculating a similarity regarding the converted record pair by inputting the converted record pair into a model, and an output unit 14 for outputting the similarity calculated by the similarity calculation unit 13. is adopted.
- a technique for calculating the similarity between record pairs a technique that does not require training data regarding record pairs and can handle heterogeneous data is provided. The effect of being able to provide is obtained.
- An information processing program provides a computer with an acquisition process for acquiring a record pair, a conversion process for generating a converted record pair by converting the record pair, and a model for the converted record pair. , a similarity calculation process for calculating a similarity with respect to the converted record pair and an output process for outputting the similarity calculated in the similarity calculation process are executed.
- FIG. 2 is a flow diagram showing the flow of the information processing method S1.
- the execution subject of each step in the information processing method S1 may be a processor included in the information processing apparatus 1 or a processor included in another apparatus. processor.
- At step S11 at least one processor acquires a record pair.
- At step S12 at least one processor generates transformed record pairs by transforming the record pairs.
- At step S13 at least one processor calculates a similarity for the transformed record pair by inputting the transformed record pair into a model.
- At step S14 at least one processor outputs the calculated similarity.
- At least one processor obtains a record pair and generates a transformed record pair by transforming the record pair. , a configuration including inputting the converted record pair into a model to calculate a similarity regarding the converted record pair and outputting the calculated similarity. For this reason, according to the information processing method S1 according to the present exemplary embodiment, as a technique for calculating the similarity between record pairs, a technique that does not require training data regarding record pairs and can handle heterogeneous data is provided. The effect of being able to provide is obtained.
- FIG. 3 is a block diagram showing the configuration of the information processing device 1A according to this exemplary embodiment.
- the information processing device 1A has a function of determining identity between records. Examples of data containing records are structured data such as table data, semi-structured data described in a data description language such as JSON or XML, or unstructured data representing a document written in a natural language.
- FIG. 4 is a diagram showing a specific example of data containing records.
- data D1 is a table.
- a record is each row of the table.
- data D2 is semi-structured data described in a data description language such as a markup language.
- the record is a web page as an example.
- Data D3 is unstructured data representing a document written in natural language.
- the record is, for example, a file generated in a predetermined file format.
- FIG. 5 is a diagram showing an overview of the flow of processing performed by the information processing apparatus 1A.
- the information processing apparatus 1A is roughly divided into (i) record pair generation processing, (ii) similarity calculation processing, and (iii) identity determination processing.
- the information processing device 1A In the process of generating record pairs, the information processing device 1A generates record pairs from first data x including multiple records e and second data x' including multiple records e'. As an example, the information processing apparatus 1A generates all combinations of the record e included in the first data x and the record e' included in the second data x'. Further, the information processing apparatus 1A may narrow down the candidates for identity determination of the second data x' for the record e of the first data x by a technique called blocking in generating the record pair.
- the information processing device 1A calculates the similarity between the records included in the record pair.
- the information processing apparatus 1A calculates the degree of similarity by inputting converted record pairs obtained by converting records into a model. The details of the similarity calculation process will be described later.
- the information processing device 1A determines the identity of the records included in the record pair based on the calculated similarity. As an example, the information processing device 1A determines that the records are the same when the degree of similarity is equal to or greater than a threshold.
- the method for determining identity is not limited to the above-described method, and information processing apparatus 1A may determine identity between records using other methods.
- FIG. 6 is a diagram showing a specific example of identity determination results.
- a table TBL1 is an example of the first data x and includes multiple rows and multiple columns.
- the table TBL2 is an example of the second data x' and includes multiple rows and multiple columns.
- a record is a row of the table.
- Table TBL1 contains records l1, l2, l3 and l4, and table TBL2 contains records r1, r2, r3.
- the information processing device 1A determines that the record l1 and the record r2 are the same, and determines that the record l2 and the record r3 are the same by the processes (i) to (iii) above. Then, it is determined that the record l3 and the record r1 are the same.
- the information processing apparatus 1A includes a control section 10A, a storage section 20A, a communication section 30A and an input/output section 40A.
- the communication unit 30A communicates with an external device of the information processing device 1A via a communication line.
- a communication line includes wireless LAN (Local Area Network), wired LAN, WAN (Wide Area Network), public line network, mobile data communication network, or a combination thereof.
- the communication unit 30A transmits data supplied from the control unit 10A to other devices, and supplies data received from other devices to the control unit 10A.
- Input/output unit 40A Input/output devices such as a keyboard, mouse, display, printer, and touch panel are connected to the input/output unit 40A.
- the input/output unit 40A receives input of various kinds of information from the connected input device to the information processing apparatus 1A. Also, the input/output unit 40A outputs various kinds of information to the connected output device under the control of the control unit 10A.
- an interface such as a USB (Universal Serial Bus) can be used as the input/output unit 40A.
- control unit 10A includes an acquisition unit 11, a conversion unit 12, a similarity calculation unit 13, an output unit 14, an identity determination unit 15A, and an integration unit 16A.
- the acquisition unit 11 generates a record pair including the record e and the record e' from the first data x including the record e and the second data x' including the record e'.
- the acquisition unit 11 does not have to perform the process of generating the record pair.
- the acquisition unit 11 may acquire by reading record pairs from the storage unit 20A or another external storage device, or acquire record pairs received from another device via the communication unit 30A. may Also, the acquisition unit 11 may acquire a record pair input from an input device connected to the input/output unit 40A.
- the conversion unit 12 converts the record pair to generate a converted record pair.
- the conversion unit 12 converts records included in a record pair into document data, image data, audio data, or graphs. The conversion processing executed by the conversion unit 12 will be described later.
- the similarity calculation unit 13 calculates the similarity s regarding the converted record pair by inputting the converted record pair into the model MA.
- the similarity s is information about the degree of similarity between records included in a record pair, and is, for example, a cosine similarity of a vector pair or a value calculated based on the score output by the model MA. The details of the process of calculating the similarity s by the similarity calculator 13 will be described later.
- the output unit 14 outputs the similarity s calculated by the similarity calculation unit 13 .
- the output unit 14 outputs the degree of similarity by writing it into the storage unit 20A.
- the method by which the output unit 14 outputs the degree of similarity is not limited to the example described above, and the degree of similarity s may be output by another method.
- the output unit 14 may transmit the degree of similarity to another device connected via the communication unit 30A, and output the degree of similarity to an output device connected via the input/output unit 40A. may
- the identity determination unit 15A determines identity between records included in a record pair based on the degree of similarity s. As an example, the identity determination unit 15A determines that the records are the same when the similarity s is equal to or greater than the threshold. Also, the identity determination unit 15A may determine identity based on the ranking when the record pairs are sorted in order of high similarity, such as determining that x record pairs with the highest degree of similarity are identical. good. Further, as an example, the identity determination unit 15A may determine identity by applying a matching algorithm such as the stable marriage problem algorithm.
- the method of determining identity by the identity determination unit 15A is not limited to the above example, and the identity determination unit 15A may determine identity by other methods.
- the identity determination unit 15A may perform identity prediction of record pairs by inputting record pairs and similarities into a prediction model generated by machine learning.
- the input of the prediction model includes, for example, record pairs and similarities.
- the output of the predictive model includes, as an example, a predictive result of identity.
- the machine learning method of the prediction model is not limited, and as an example, a decision tree-based, linear regression, or neural network method may be used, and two or more of these methods may be used. may
- (integration unit 16A) 16 A of integration parts integrate the 1st data x and 2nd data x' based on the determination result of 15 A of identity determination parts. For example, the integration unit 16A integrates the first data x and the second data x' by increasing the number of records and/or increasing the number of data attributes.
- Data integration performed by the integration unit 16A includes, for example, (i) entity integration, (ii) data cleansing, and (iii) schema matching.
- Entity integration refers to unifying the notation of different attributes and their values when the same set of records is given.
- Data cleansing refers to the unification of differences in description formats such as company names, addresses, and area codes ("Co., Ltd.” and "Co., Ltd.”, etc.).
- Schema matching means aligning (matching) a plurality of attributes with different notations.
- the storage unit 20A stores the first data x and the second data x′, and also stores the similarity s calculated by the similarity calculation unit 13 .
- a model MA is stored in the storage unit 20A. Note that the expression that the model MA is stored in the storage unit 20A means that the parameters that define the model MA are stored in the storage unit 20A.
- the model MA is a model for calculating the degree of similarity, and as an example, is a model that is open to the public and can be used by any user.
- the model MA may be a model generated by machine learning or a rule-based model created by humans.
- the model MA includes at least one of a document classification model, image classification model, speech classification model and graph classification model.
- document classification models include document embedding models, entailment recognition models, paraphrase prediction models, and mask language models.
- An example of an image classification model is an image embedding model that embeds image data in a vector space.
- a speech classification model for example, there is a speech embedding model that embeds speech data in a vector space.
- Inputs of the model MA include at least one of document data, image data, audio data, graphs, and vectors, for example.
- the output of the model MA includes, as an example, vectors and/or scores.
- a document embedding model is a model that embeds documents or words in a vector space.
- the document embedding model is generated by RoBERTa described in Non-Patent Document 3 as an example.
- the input of the document embedding model is, for example, a document or a word (for example, the sentence "An elderly man is walking in the park.”).
- the output is a vector as an example.
- the entailment recognition model is a model for predicting whether or not there is an entailment relation of "if it is document 1, then it is document 2".
- the entailment recognition model is generated by the technique described in Non-Patent Document 4 as an example.
- the input of the entailment recognition model is two documents as an example.
- the two documents are, for example, document 1 "an old man is walking in the park” and document 2 "a man is in the park”. In this case, document 2 entails document 1.
- FIG. 8 is a diagram schematically showing entailment relationships of documents. In the example of FIG. 8, document 2 entails document 1 .
- the output of the entailment recognition model is an entailment score as an example.
- the entailment score is a numerical value indicating the certainty of the entailment relation, and is a real number between 0 and 1, for example. As an example, the entailment score indicates that the higher the value, the higher the certainty of the entailment relation.
- a paraphrasing prediction model is a model that predicts whether two documents are paraphrasing expressions.
- a paraphrase prediction model is generated by RoBERTa described in Non-Patent Document 3 as an example.
- the input of the paraphrasing prediction model is two documents as an example.
- the two documents are, for example, document 1 stating "NEC is an IT company" and document 2 stating "NEC Corporation is in the IT business".
- the output of the model includes paraphrase scores, as an example.
- the paraphrasing score is a score indicating the degree of certainty that two documents are paraphrasing expressions, and is a real number between 0 and 1, for example. As an example, the paraphrase score indicates that the higher the value, the higher the confidence that the two documents are paraphrase expressions.
- a mask language model is a model that predicts words that fit a mask in a document.
- the mask language model is, for example, a model generated by RoBERTa described in Non-Patent Document 3, for example.
- the input of the document classification model is, for example, a document (for example, the sentence "This pizza is very good. I like this pizza [mask].”).
- the output of the model includes words (eg, "like") and scores.
- the score is a value indicating the degree of confidence that the word fits the mask, and is a real number between 0 and 1, for example.
- An image classification model is a model that classifies images.
- the image classification model is a model generated by the technique described in Non-Patent Document 6 as an example.
- An example input for the image classification model is an image (eg, an image of a dog).
- the intermediate output of the model is, for example, a vector representation of the image, and the output of the model is, for example, a label (for example, a label indicating "dog") and a score.
- a score is a value indicating the certainty of a label, and is a real number between 0 and 1, for example.
- the similarity calculator 13 calculates the similarity s using the vector representation of the images.
- a speech classification model is a model that classifies speech.
- the input of the speech classification model is, for example, speech data (eg, dog barking).
- the intermediate output of the model is, as an example, a vector representation of the speech, and the output of the model is, as an example, a label (eg, a label indicating that the speech is a dog bark) and a score.
- a score is a value indicating the certainty of a label, and is a real number between 0 and 1, for example.
- the similarity calculator 13 calculates the similarity s using the vector representation of the speech, which is the intermediate output.
- a graph classification model is a model for classifying graphs.
- the input of the graph classification model is, for example, graph data (for example, a graph representing facial features).
- the intermediate output of the model is, for example, a vector representation of the graph, and the output of the model is, for example, a label (for example, a label indicating a person) and a score.
- a score is a value indicating the certainty of a label, and is a real number between 0 and 1, for example.
- the similarity calculator 13 calculates the similarity s using the vector representation of the graph, which is the intermediate output.
- FIG. 7 is a flowchart showing the flow of the information processing method S100A executed by the information processing apparatus 1A. Note that some of the steps included in the information processing method S100A may be executed in parallel or in a different order. Also, the description of the already described contents will not be repeated.
- step S101 the acquisition unit 11 reads the model MA.
- model MA is selected from a plurality of model candidates.
- the plurality of model candidates include, for example, at least one of a document embedding model, a document classification model such as an entailment recognition model, an image classification model, and an audio classification model.
- the selection of the model MA may be performed based on a user's operation as an example, or may be performed according to a predetermined algorithm.
- the model MA may be a single model or a set of multiple models.
- Step S102 the acquisition unit 11 reads record pairs.
- the acquisition unit 11 generates record pairs from the first data x and the second data x'. For example, the acquisition unit 11 generates all combinations of records e included in the first data x and records e' included in the second data x'. Further, in generating a record pair, the obtaining unit 11 may narrow down candidates for identity determination of the second data x′ for the record e of the first data x by a technique called blocking.
- step S103 the conversion unit 12 generates a converted record pair by converting the record pair (e, e') into a format corresponding to the input of the model MA.
- the model MA includes a document classification model
- the conversion processing by the conversion unit 12 includes processing for generating a converted record pair by converting the record pair (e, e') into a document. included.
- the model MA includes an image classification model, and in the conversion processing by the conversion unit 12, a converted record pair is generated by converting the record pair (e, e') into an image. processing is included.
- the model MA includes a speech classification model, and the conversion processing by the conversion unit 12 includes processing for generating a converted record pair by converting the record pair into speech.
- the model MA includes a graph classification model, and the conversion processing by the conversion unit 12 includes processing for generating converted record pairs by converting record pairs into graphs.
- step S104 the similarity calculation unit 13 calculates the similarity s regarding the converted record pair by inputting the converted record pair into the model MA.
- Processing examples 1 to 5 of steps S101 to S104 will be described as processing examples of steps S101 to S104.
- Processing example 1 is a processing example in the case of using the document embedding model.
- Processing example 2 is a processing example in the case of using an image classification model.
- Processing example 3 is a processing example in the case of using a speech classification model.
- Processing example 4 is a processing example when using an entailment recognition model.
- Processing example 5 is a processing example in the case of using a paraphrase prediction model.
- step S101 the acquisition unit 11 reads the document embedding model.
- step S103 the conversion unit 12 converts the record pair (e, e') into the document pair (t, t').
- the conversion unit 12 e ((title, sims 2 glamor life stuff pack), (manufacturer, aspyr media), (price, 24.99))
- e' ((title, aspyr media inc sims 2 glamor life stuff pack), (price, NaN))
- t' "Title is aspyr media inc sims 2 glamor life stuff pack.” to a document pair (t, t') containing document t and document t'.
- the similarity calculation unit 13 converts the document pair (t, t') into a vector pair (v, v') using the document embedding model.
- the similarity calculator 13 calculates the similarity s from the vector pair (v, v').
- ⁇ T is a symbol representing transposition.
- step S101 the acquisition unit 11 reads an image embedding model.
- step S103 the conversion unit 12 converts the record pair (e, e') into the image pair (i, i').
- FIG. 9 is a diagram showing an example of an image converted by the converter 12.
- the conversion unit 12 e ((title, sims 2 glamor life stuff pack), (manufacturer, aspyr media), (price, 24.99))
- e' ((title, aspyr media inc sims 2 glamor life stuff pack), (price, NaN))
- a record pair (e, e') containing record e and record e' is converted into images i, i' shown in FIG.
- step S104 the similarity calculation unit 13 converts the image pair (i, i') into a vector pair (v, v') using the image embedding model.
- the similarity calculator 13 calculates the similarity s from the vector pair (v, v').
- the conversion unit 12 may convert one record into one image, or may perform image conversion for each element (eg, word) included in the record.
- the similarity calculation unit 13 calculates the similarity s using a set of images for each element. Further, when image conversion is performed for each element, the conversion unit 12 may not perform image conversion for missing values in records.
- step S101 the acquisition unit 11 reads the speech embedding model. Also, in step S103, the conversion unit 12 converts the record pair (e, e') into the speech pair (i, i').
- the conversion unit 12 e ((title, sims 2 glamor life stuff pack), (manufacturer, aspyr media), (price, 24.99))
- e' ((title, aspyr media inc sims 2 glamor life stuff pack), (price, NaN)) Converts a record pair (e, e') containing record e and record e' to voice data i representing the voice of record e read aloud and voice data i' representing the voice of record e' read aloud do.
- step S104 the similarity calculation unit 13 converts the voice data pair (i, i') into a vector pair (v, v') using the voice embedding model.
- the similarity calculator 13 calculates the similarity s from the vector pair (v, v').
- step S101 the acquisition unit 11 reads an entailment recognition model. Further, in step S103, the conversion unit 12 converts the record pair (e, e') into the document pair (t, t').
- the conversion unit 12 e ((title, sims 2 glamor life stuff pack), (manufacturer, aspyr media), (price, 24.99))
- e' ((title, aspyr media inc sims 2 glamor life stuff pack), (price, NaN))
- t' "Title is aspyr media inc sims 2 glamor life stuff pack.” is converted into a document pair (t, t') containing document t and document t'.
- the similarity calculation unit 13 calculates the entailment score of the document pair (t, t') using the entailment recognition model. Furthermore, the similarity calculation unit 13 calculates the similarity s using the implication score.
- the similarity s is the entailment score M(t, t′) of the entailment relation “if document t is document t′” and the implication score “if document t′ is document t”. It is a multiplication value with the implication score M(t′, t) of the relation.
- the similarity s is not limited to the example described above, and may be another value.
- the similarity s is, for example, the maximum value of the implication score M(t, t′) and the implication score M(t′, t), or the implication score M(t, t′) and the implication score It may be the sum with M(t', t).
- the similarity calculator 13 uses this relationship to calculate the similarity.
- step S101 the acquisition unit 11 reads a paraphrase prediction model. Further, in step S103, the conversion unit 12 converts the record pair (e, e') into the document pair (t, t').
- the conversion unit 12 e ((title, sims 2 glamor life stuff pack), (manufacturer, aspyr media), (price, 24.99))
- e' ((title, aspyr media inc sims 2 glamor life stuff pack), (price, NaN))
- t' "aspyr media inc sims 2 glamor life stuff pack" is converted into a document pair (t, t') containing document t and document t'.
- step S104 the similarity calculation unit 13 calculates the paraphrase score of the document pair (t, t') using the paraphrase prediction model, and sets the calculated paraphrase score as the similarity s of the record pair. That is, in this processing example, the similarity calculation unit 13 calculates the similarity s by putting the record pair into a format that asks whether it is a paraphrase expression.
- Step S105 and S106 In step S ⁇ b>105 , the output unit 14 outputs the similarity s calculated by the similarity calculation unit 13 . As an example, the output unit 14 outputs the similarity s by writing it into the storage unit 20A. In step S106, the identity determination unit 15A determines identity between the records included in the record pair based on the similarity s.
- step S106 the integration unit 16A refers to the determination result of the identity determination unit 15A and generates integrated data from the first data x and the second data x'.
- the integrated data includes, for example, a record obtained by integrating records included in a record pair determined to be identical by the identity determination unit 15A.
- the information processing device 1A converts the record pair into a format corresponding to the input of the model MA and inputs it to the model MA to calculate the similarity s for the record pair.
- the information processing device 1A converts the record pair into a format corresponding to the input of the model MA and inputs it to the model MA to calculate the similarity s for the record pair.
- the model MA is selected from a plurality of model candidates, and the conversion unit 12 inputs the record pair (e, e') to the model MA.
- a configuration is adopted in which a converted record pair is generated by converting to a format corresponding to .
- the record pair becomes data in a format that can be input to the model MA. That is, no matter what kind of attribute the record whose similarity is to be calculated contains, the similarity calculator 13 can calculate the similarity s by inputting the converted record pair into the model MA. .
- the similarity calculator 13 can calculate the similarity s by inputting the converted record pair into the model MA.
- the model MA includes a document classification model, and in the conversion processing by the conversion unit 12, a converted record pair is converted into a document.
- a configuration is adopted in which processing for generating a record pair is included. Therefore, according to the information processing apparatus 1A according to this exemplary embodiment, the similarity of records having various attributes can be calculated using the model MA, which is a document classification model, without training the model MA. is obtained.
- the model MA includes an image classification model, and in the conversion processing by the conversion unit 12, a converted record pair is converted into an image.
- a configuration is adopted in which processing for generating a record pair is included.
- the degree of similarity reflecting the degree of similarity of character shapes is The effect of being able to calculate is obtained.
- the model MA includes a speech classification model
- the conversion processing by the conversion unit 12 includes conversion of record pairs into speech.
- a configuration is adopted in which processing for generating a record pair is included.
- the information processing apparatus 1A converts the record pair into speech, so that the similarity between the records having different characters but similar phonemes can be more preferably calculated. For example, a record containing the word "glamour” and a record containing the word "glamar” are similar in pronunciation, even though the strings contained in the records are different, so a high degree of similarity is calculated. be.
- the degree of similarity reflecting the degree of similarity between sounds is calculated. You can get the effect of being able to
- the model MA includes a graph classification model
- the conversion processing by the conversion unit 12 includes conversion of record pairs into graphs.
- a configuration is adopted in which processing for generating a record pair is included.
- FIG. 10 is a block diagram showing the configuration of an information processing device 1B according to this exemplary embodiment.
- the control unit 10B of the information processing device 1B includes an acquisition unit 11B, a conversion unit 12B, a similarity calculation unit 13B, an output unit 14, an identity determination unit 15A, and an integration unit 16A.
- the storage unit 20B also stores the model MB in addition to the first data x, the second data x', and the similarity s.
- the acquisition unit 11B further acquires an auxiliary record in addition to the record pair (e, e').
- An auxiliary record is an auxiliary record used to calculate the similarity of the record pair (e, e').
- the auxiliary record is, for example, a record included in the first data x and other than the record e included in the record pair (e, e').
- the auxiliary record is, for example, a record other than the record e' included in the record pair (e, e') which is included in the second data x'.
- the conversion unit 12B converts the record pairs acquired by the acquisition unit 11B to generate converted record pairs. Also, the conversion unit 12B generates a converted auxiliary record by converting the auxiliary record. For example, the conversion unit 12B converts the auxiliary records into data representing documents, images, sounds, or graphs.
- the conversion unit 12B converts one record e included in the record pair (e, e') into a question sentence, and converts the other record included in the record pair (e, e') into a question sentence. Generate the transformed record pair by transforming each of the e and auxiliary records into a response sentence.
- the similarity calculation unit 13B calculates the similarity regarding the converted record pair by inputting the converted record pair and the converted auxiliary record into the model MB.
- the model MB includes, as an example, a question-answering model that inputs question sentences and answer sentences.
- the question-answering model is a model that extracts and outputs answer sentences from documents given to question sentences.
- the question-answer model is a model generated by a technique called TANDA described in Non-Patent Document 5 as an example.
- Inputs of the question-answering model include, for example, a question sentence and a document.
- the question sentence is, for example, "Where is NEC's headquarters?"
- the document states "Nippon Denki (British: NEC Corporation) is an electronics manufacturer of the Sumitomo Group headquartered in Shiba 5-chome, Minato-ku, Tokyo.
- One of the constituent stocks of the Nikkei Stock Average is a document.
- the output of the model includes, as an example, answer sentences and scores.
- An example of the reply sentence is "Shiba 5-chome, Minato-ku, Tokyo".
- the score is, for example, a real number between 0 and 1.
- a score for each word may be calculated when determining the output of the model.
- the question answering model calculates a score of "0.1" for "NEC CORPORATION”, a score of "0.02" for "Sumitomo Group”, and a score of "0.08" for "Nikkei Stock Average”.
- FIG. 11 is a flowchart showing the flow of information processing method S100B executed by information processing apparatus 1B. Note that some steps may be performed in parallel or out of order. Also, the description of the already described contents will not be repeated.
- the information processing method S100B includes steps S101B, S102, S102B, S103B, S104B, S105, S106, and S107.
- step S101B the acquisition unit 11B reads the model MB.
- step S102B the acquisition unit 11B reads the auxiliary record.
- step S103B the conversion unit 12B converts the record pair to generate a converted record pair, and converts the auxiliary record to generate a converted auxiliary record.
- step S104B the similarity calculation unit 13B inputs the converted record pair and the converted auxiliary record to the model MB to calculate the similarity regarding the converted record pair.
- steps S101 to S104B question answering model
- the similarity calculation unit 13B reads a question answer model.
- the auxiliary record R is, for example, a set of all records included in the second data x'.
- the auxiliary record R is not limited to the example described above, and may be a set of other records.
- the auxiliary records R may be a set of records selected by a randomized algorithm from the second data x'.
- the auxiliary record R may be a blocked record set, such as a record set obtained by extracting records containing words common to the record e from the second data x'.
- auxiliary record R includes record e' contained in record pair (e, e').
- the question sentence q is preferably of the so-called 5W1H open question type.
- the conversion unit 12B converts the auxiliary record R into a document containing a plurality of reply sentences.
- ⁇ ID of e_j ⁇ is the unique ID assigned to record e_j ⁇ R.
- the conversion unit 12B does not include missing values in the document during conversion.
- r2 is characterized as title of aspyr media inc sims 2 glamor life stuff pack.
- r3 is
- step S104B the similarity calculation unit 13B inputs the question sentence q and the document c to the question answering model.
- the question answering model outputs a score indicating the degree of certainty that the answer to the input question sentence q is the answer sentence T3(e_j) (1 ⁇ j ⁇ k) extracted from the document c.
- the similarity calculator 13B calculates the similarity s based on the score output by the question answering model.
- the similarity s is, for example, MB(q, c, ⁇ ID of e' ⁇ ), that is, the confidence that the record e' included in the record pair (e, e') is an answer sentence.
- the similarity s is not limited to this example, and the similarity calculation unit 13B may calculate the similarity s by another method.
- the similarity calculation unit 13B may take the sum of the score when the record e is used as the question and the score when the record e' is used as the question as the degree of similarity.
- FIG. 12 is a conceptual diagram of similarity calculation processing using the question answering model.
- the conversion unit 12B converts the record e and the auxiliary record R into a question sentence and a document
- the similarity calculation unit 13B inputs the question sentence and the document into the model MB, which is a question-answer model.
- the similarity s is calculated.
- the information processing apparatus 1B calculates the similarity of the records by converting the records into the question-and-answer format.
- the acquisition unit 11B further acquires auxiliary records
- the conversion unit 12B converts the auxiliary records to generate converted auxiliary records
- the degree calculation unit 13B is configured to calculate the degree of similarity regarding the converted record pair by inputting the converted record pair and the converted auxiliary record to the model MB. Therefore, according to the information processing apparatus 1B according to the present exemplary embodiment, it is possible to obtain the effect of being able to calculate similarities for records having various attributes using the model MB without training the model MB.
- the model MB includes a question-answer model in which a question sentence and a response sentence are input, and the conversion unit 12B is included in the record pair.
- One of the records included in the record pair is converted into a question sentence, and each of the other record and the auxiliary record included in the record pair is converted into a response sentence to generate the converted record pair. Therefore, according to the information processing apparatus 1B according to this exemplary embodiment, it is possible to obtain the effect of being able to calculate the similarity of records having various attributes using the question-and-answer model without training the question-and-answer model. .
- FIG. 13 is a block diagram showing the configuration of an information processing device 1C according to this exemplary embodiment.
- the control unit 10C of the information processing device 1C includes an acquisition unit 11, a conversion unit 12, a similarity calculation unit 13C, a similarity integration unit 17C, an output unit 14C, an identity determination unit 15A, and an integration unit 16A.
- the storage unit 20C also stores the model MC in addition to the first data x, the second data x', and the similarity s.
- the similarity calculator 13C calculates a plurality of similarities si for one record pair (e, e'). As an example, the similarity calculator 13C calculates the first similarity s1 by inputting two records included in the record pair (e, e') to the model MC without interchanging them. Further, the similarity calculation unit 13C calculates the second similarity s2 by replacing the two records included in the record pair (e, e') with each other and inputting them to the model MC.
- the similarity of the record pair (e, e') is the record pair (e', e ) similarity. Therefore, in this exemplary embodiment, the similarity calculation unit 13C calculates the similarity of the record pair (e, e') and the similarity of the record pair (e', e), and calculates the similarity as Identity is determined by reference.
- the method by which the similarity calculation unit 13C calculates a plurality of similarities si is not limited to the example described above, and the similarity calculation unit 13C may calculate a plurality of similarities si by other methods.
- the similarity calculator 13C may calculate a plurality of similarities si using a plurality of models.
- the conversion unit 12 performs a plurality of conversions on one record pair, and the similarity calculation unit 13C converts the converted record pair into respective models (document classification model, image classification model, . . . ).
- a plurality of degrees of similarity si may be calculated.
- the similarity calculation unit 13C converts one record pair by a plurality of conversion methods to generate a plurality of converted record pairs, and inputs the plurality of converted record pairs to one model to generate a plurality of , the similarity si may be calculated.
- the similarity integration unit 17C integrates a plurality of similarities si into an integrated similarity s.
- the similarity integration unit 17C calculates the post-integration similarity s by averaging or weighting a plurality of similarities si.
- the method by which the similarity integration unit 17C integrates a plurality of similarities si is not limited to the example described above, and the similarity integration unit 17C may calculate the post-integration similarity s by another method.
- the similarity integration unit 17C may set the sum or integrated value of a plurality of similarities si as the integrated similarity s.
- the similarity integration unit 17C is configured to determine the identity of the target record pair based on a plurality of similarities si regarding the target record pair.
- the output unit 14C outputs an integrated similarity s obtained by integrating the plurality of similarities si. As an example, the output unit 14C outputs the similarity s by writing it into the storage unit 20C.
- the model MC is a model for calculating the degree of similarity.
- the model MC is, for example, a model that is asymmetric with respect to the mutual replacement of two elements that are input to the model.
- the model MC includes, as an example, at least one of an entailment recognition model, a paraphrase prediction model, and a question answer model.
- FIG. 14 is a diagram showing a specific example of the similarity si calculated by the similarity calculation unit 13C.
- the first similarity s1 calculated by the similarity calculation unit 13C for the record pair (L1, R1) is "9"
- the similarity for the record pair (R1, L1) obtained by exchanging two records is
- the second similarity s2 calculated by the calculator 13C is "10".
- the similarity calculation unit 13C calculates the first similarity s1 and the second similarity s2 for one record pair
- the identity determination unit 15A calculates the first similarity s1 and the second similarity s1.
- the records are determined to be the same if both the degrees s2 are the highest compared to other record pairs.
- the identity determination unit 15A determines that record L1 and record R1 are the same, and that record L2 and record R3 are the same.
- FIG. 15 is a diagram showing another example of the similarity si calculated by the similarity calculation unit 13C.
- the similarity integration unit 17C aggregates bidirectional similarities. For example, the similarity integration unit 17C sets the sum of the similarity s1 of the record pair (L1, R1) and the similarity s2 of the record pair (R1, L1) as the similarity s.
- the similarity s of the record pair (L1, R1) is the sum of "10" and “9", that is, "19”
- the similarity s of the record pair (L1, R2) is "9". and "7", that is, "16”.
- the similarity s of the record pair (L2, R2) is the sum of "9” and "4", that is, "13”
- the similarity s of the record pair (L2, R3) is "8” and "8 , that is, "16”.
- the identity determination unit 15A determines that record L1 and record R1 are the same, and that record L2 and record R3 are the same, as in the example of FIG. In this example, the identity determination unit 15A further determines that record pairs having a similarity s equal to or higher than a predetermined threshold among the record pairs determined to be identical are also identical.
- the threshold is, for example, the minimum value (“13” in the example of FIG. 15) of similarities s of record pairs determined to be identical. The threshold may be determined based on the percentage of identical and non-identical, if known. When the threshold value is "13" in the example of FIG. 15, the identity determination unit 15A determines the record pair (L1, R2 ) are also determined to be the same.
- the similarity calculation unit 13C calculates a plurality of similarities si with respect to the record pair, and the output unit 14C calculates the plurality of similarities si.
- a configuration for outputting the post-integration similarity s obtained by integration is adopted. Therefore, according to the information processing apparatus 1C according to the present exemplary embodiment, it is possible to obtain the effect that the similarity s of the record pair can be calculated more accurately.
- the model MC is a model having asymmetry with respect to the mutual replacement of two elements input to the model
- the similarity calculation unit 13C to the model MC, the first similarity s1 is calculated by inputting two records included in the record pair (e, e′) without replacing each other, and the record pair (e , e′) are replaced with each other and then input to calculate the second similarity s2. Therefore, according to the information processing apparatus 1C according to the present exemplary embodiment, by integrating the first similarity s1 and the second similarity s2, the similarity s of the records can be calculated more accurately. effect is obtained.
- FIG. 16 is a block diagram showing the configuration of an information processing device 1D according to this exemplary embodiment.
- a control unit 10D of the information processing device 1D includes an acquisition unit 11, a conversion unit 12, a similarity calculation unit 13, an output unit 14, an identity determination unit 15A, and a search result output unit 18D.
- the acquisition unit 11 acquires input data from the user as the first record e included in the record pair (e, e').
- Input data from the user is, for example, input by an input device (for example, a keyboard, a mouse, etc.) connected to the input/output unit 40A.
- an input device for example, a keyboard, a mouse, etc.
- the acquiring unit 11 acquires one of the plurality of records included in the target data as the second record e' included in the record pair (e, e').
- the target data is data to be searched, and includes, for example, one or more tables.
- the identity determination unit 15A performs identity prediction for record pairs of the first record e and each of the plurality of records included in the target data.
- the search result output unit 18D Based on the degree of similarity s calculated by the degree of similarity calculation unit 13, the search result output unit 18D outputs the search results based on the input data and with the target data as the search target.
- the search result output unit 18D refers to the determination result of the identity determination unit 15A and outputs the search result based on the input data and the target data as the search target.
- the search result output unit 18D outputs search results to an output device (display, printer, etc.) connected to the input/output unit 40A.
- the search result output unit 18D may output the search result by transmitting the search result to another device connected via the communication unit 30A.
- the search result output unit 18D may output search results by storing the search results in the storage unit 20A or an external storage device.
- FIG. 17 is a diagram showing a specific example of screen display output by the search result output unit 18D.
- the input data is a character string that the user inputs into the text box 51
- the target data are tables T1 and T2 having a plurality of records.
- the identity determination unit 15A determines the identity of record pairs between the first record e, which is the user's input data, and each of the records included in the table T1 and the record e' included in the table T2.
- the search result output unit 18D refers to the determination result of the identity determination unit 15A and outputs search results 53 and 54 based on the input data.
- a search result 53 is a search result obtained by searching the table T1 using the character string "potato chips" as input data.
- a search result 54 is a search result obtained by searching the table T2 using the character string "potato chips" as input data.
- the determination result of the identity determination unit 15A is referred to, and the search result based on the input data, which is the target data, is searched. A configuration for outputting the results is adopted. Therefore, according to the information processing apparatus 1D according to the present exemplary embodiment, in addition to the effects of the information processing apparatus 1 according to the first exemplary embodiment, the search from the target data based on the input data is more preferably performed. You can get the effect of being able to
- the information processing device 1D can also be described as follows. Acquisition means for acquiring input data from a user and one of a plurality of records included in target data as a record pair; transforming means for transforming the record pairs to generate transformed record pairs; a similarity calculating means for calculating a similarity with respect to the converted record pair by inputting the converted record pair into a model; output means for referring to the degree of similarity calculated by the degree of similarity calculation means and outputting search results based on the input data, in which the target data is a search target; Information processing device equipped with.
- the information processing apparatuses 1, 1A, 1B, 1C, and 1D (hereinafter referred to as "information processing apparatuses 1, etc.")
- the identity with the contained record e' was determined.
- a plurality of records to be determined by the information processing apparatus 1 or the like may be records included in different data, or may be records included in common data.
- the information processing device 1 and the like may execute processing for searching for the same record from one database.
- the information processing apparatus 1 and the like may integrate three or more data.
- the information processing device 1 or the like may select models MA, MB, MC (hereinafter referred to as "model M") from a plurality of model candidates, and the user may Model M may be selected.
- the algorithm by which the information processing device 1 or the like selects the model M is not limited, but as an example, the information processing device 1 or the like may select the model M on a rule basis.
- the information processing device 1 or the like may select the model M according to the characteristics of the record pair.
- the characteristics of a record pair include, for example, the attribute of the record included in the record pair, the data size of the record, the type of database to which the record belongs, and the attribute of the database.
- the data containing records e, e' may be semi-structured data such as JSON or XML.
- semi-structured data such as JSON or XML.
- the records are, by way of example, web pages contained in the target site.
- record e ⁇ id1: value1, id2: ⁇ id2-1: value2-1, id2-2: value2-1 ⁇ , id3: value3 ⁇
- the converted document is, for example, "id1 is value1.
- id2-1 of id2 is value2-1.
- id2-2 of id2 is value2-1.
- id3 is value3.” is.
- the record according to the present specification may be graph data as shown in FIG. 18, for example.
- FIG. 18 is a diagram showing an example of graph data.
- face matching can be performed by applying the information processing apparatus 1 or the like according to the present specification to graph data.
- the document after conversion is, as an example, “1 and 2 are linked. 1 and 4 are linked. 2 and 3 are linked. 2 and 4 are linked.” is.
- Data containing records may be a graph database as shown in FIG. 19, for example.
- the information processing apparatus 1 or the like according to the present specification it is possible to determine the identity of different SNS (Social Networking Service) communities, for example, and to investigate criminal organizations.
- the graph database is as shown in FIG. 19, the document after conversion is as follows: “Taro of age 23 follows Sakura of age 26. Taro of age 23 follows Emi of age 25. Sakura of age 26 follows Emi of age 25. Sakura of age 26 wrote via smartphone tweet of text “I'm sleepy.” date 20XX /YY/ZZ. Emi of age 25 follows Sakura of age 26. Emi of age 25 follows Taro of age 23.” is.
- the information processing device 1 and the like may be configured to execute the learning phase for learning the model M.
- the method of machine learning for model M is not limited, but as an example, a decision tree-based, linear regression, or neural network method may be used, or two or more of these methods may be used. .
- FIG. 20 schematically shows a configuration in which trained transducers 121, 122 with learnable parameters are provided before and after the output of model M.
- FIG. The learned converters 121 and 122 have learnable parameters, and a learning unit (not shown) uses training data to determine how to convert records (how to make sentences or the number of auxiliary records, etc.) and / or how to convert. It is a model that optimizes parameters. By providing the learned converters 121 and 122, it is possible to calculate the similarity of records with higher accuracy.
- the machine learning method of the trained converters 121, 122 is not limited, but as an example, a decision tree-based, linear regression, or neural network method may be used, and two or more of these methods may be used. may be used. Also, the learned converters 121 and 122 may be models generated by active learning.
- Some or all of the functions of the information processing apparatuses 1, 1A, 1B, 1C, and 1D may be implemented by hardware such as integrated circuits (IC chips), or may be implemented by software.
- the information processing apparatuses 1, 1A, 1B, 1C, and 1D are implemented by computers that execute program instructions, which are software that implements each function, for example.
- An example of such a computer (hereinafter referred to as computer C) is shown in FIG.
- Computer C comprises at least one processor C1 and at least one memory C2.
- a program P for operating the computer C as the information processing apparatuses 1, 1A, 1B, 1C, and 1D is recorded in the memory C2.
- the processor C1 reads the program P from the memory C2 and executes it, thereby implementing the functions of the information processing apparatuses 1, 1A, 1B, 1C, and 1D.
- processor C1 for example, CPU (Central Processing Unit), GPU (Graphic Processing Unit), DSP (Digital Signal Processor), MPU (Micro Processing Unit), FPU (Floating point number Processing Unit), PPU (Physics Processing Unit) , a microcontroller, or a combination thereof.
- memory C2 for example, a flash memory, HDD (Hard Disk Drive), SSD (Solid State Drive), or a combination thereof can be used.
- the computer C may further include a RAM (Random Access Memory) for expanding the program P during execution and temporarily storing various data.
- Computer C may further include a communication interface for sending and receiving data to and from other devices.
- Computer C may further include an input/output interface for connecting input/output devices such as a keyboard, mouse, display, and printer.
- the program P can be recorded on a non-temporary tangible recording medium M that is readable by the computer C.
- a recording medium M for example, a tape, disk, card, semiconductor memory, programmable logic circuit, or the like can be used.
- the computer C can acquire the program P via such a recording medium M.
- the program P can be transmitted via a transmission medium.
- a transmission medium for example, a communication network or broadcast waves can be used.
- Computer C can also obtain program P via such a transmission medium.
- Appendix 2 Some or all of the above-described embodiments may also be described as follows. However, the present invention is not limited to the embodiments described below. (Appendix 1) an acquisition means for acquiring a record pair; transforming means for transforming the record pairs to generate transformed record pairs; a similarity calculation means for calculating a similarity regarding the converted record pair by inputting the converted record pair into a model; an output means for outputting the similarity calculated by the similarity calculation means; Information processing device equipped with.
- the model is selected from a plurality of model candidates, the transforming means generates the transformed record pair by transforming the record pair into a format corresponding to the input of the model;
- the information processing device according to appendix 1.
- the model includes a document classification model;
- the conversion processing by the conversion means includes processing for generating the converted record pair by converting the record pair into a document.
- the information processing device according to appendix 1 or 2.
- the model includes an image classification model
- the conversion processing by the conversion means includes processing for generating the converted record pair by converting the record pair into an image. 3.
- the information processing apparatus according to any one of Appendices 1 to 3.
- the conversion processing by the conversion means includes processing for generating the converted record pair by converting the record pair into speech. 5.
- the information processing apparatus according to any one of Appendices 1 to 4.
- the model includes a graph classification model
- the conversion processing by the conversion means includes processing for generating the converted record pair by converting the record pair into a graph. 6.
- the information processing apparatus according to any one of Appendices 1 to 5.
- the obtaining means further obtains an auxiliary record,
- the conversion means generates a converted auxiliary record by converting the auxiliary record;
- the similarity calculation means calculates the similarity regarding the converted record pair by inputting the converted record pair and the converted auxiliary record into the model. 7.
- the information processing apparatus according to any one of Appendices 1 to 6.
- the model includes a question-answer model in which a question sentence and an answer sentence are input,
- the conversion means converts one record included in the record pair into a question sentence, and converts the other record included in the record pair and each of the auxiliary records into a response sentence, thereby converting the converted record pair into a question sentence.
- the information processing device according to appendix 7.
- the similarity calculating means calculates a plurality of similarities with respect to the record pair,
- the output means outputs an integrated similarity obtained by integrating the plurality of similarities.
- the model is a model that has asymmetry with respect to the replacement of two elements input to the model,
- the similarity calculation means is calculating a first degree of similarity by inputting two records included in the record pair into the model without replacing each other; Calculating a second degree of similarity by replacing two records included in the record pair with the model and then inputting the model;
- the information processing device according to appendix 9.
- Appendix 11 at least one processor obtaining a record pair; generating a transformed record pair by transforming the record pair; calculating a similarity for the transformed record pair by inputting the transformed record pair into a model; outputting the calculated similarity;
- Information processing method including.
- At least one processor for obtaining a record pair; transforming the record pair to generate a transformed record pair; and inputting the transformed record pair into a model.
- an information processing apparatus for executing a similarity calculation process for calculating a similarity regarding the converted record pair and an output process for outputting the similarity calculated in the similarity calculation process.
- this information processing apparatus may further include a memory, and this memory stores information for causing the processor to execute the acquisition process, the conversion process, the similarity calculation process, and the output process.
- program may be stored. Also, this program may be recorded in a computer-readable non-temporary tangible recording medium.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
La présente invention concerne, en tant que technologie pour calculer le degré de similarité entre une paire d'enregistrements, une technologie qui ne nécessite pas de données d'apprentissage relatives à une paire d'enregistrements, et qui est capable de gérer des données de types différents. À cette fin, un dispositif de traitement d'informations (1) comprend : une unité d'acquisition (11) qui acquiert une paire d'enregistrements ; une unité de conversion (12) qui convertit la paire d'enregistrements de façon à générer une paire d'enregistrements convertis ; une unité de calcul de degré de similarité (13) qui entre la paire d'enregistrements convertis dans un modèle de façon à calculer un degré de similarité relatif à la paire d'enregistrements convertis ; et une unité de sortie (14) qui délivre le degré de similarité qui a été calculé par l'unité de calcul de degré de similarité (13).
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/JP2022/008227 WO2023162206A1 (fr) | 2022-02-28 | 2022-02-28 | Dispositif de traitement d'informations, procédé de traitement d'informations, et programme de traitement d'informations |
JP2024502726A JPWO2023162206A1 (fr) | 2022-02-28 | 2022-02-28 |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/JP2022/008227 WO2023162206A1 (fr) | 2022-02-28 | 2022-02-28 | Dispositif de traitement d'informations, procédé de traitement d'informations, et programme de traitement d'informations |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2023162206A1 true WO2023162206A1 (fr) | 2023-08-31 |
Family
ID=87765225
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2022/008227 WO2023162206A1 (fr) | 2022-02-28 | 2022-02-28 | Dispositif de traitement d'informations, procédé de traitement d'informations, et programme de traitement d'informations |
Country Status (2)
Country | Link |
---|---|
JP (1) | JPWO2023162206A1 (fr) |
WO (1) | WO2023162206A1 (fr) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP7454156B1 (ja) | 2023-12-26 | 2024-03-22 | ファーストアカウンティング株式会社 | 情報処理装置、情報処理方法及びプログラム |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170091274A1 (en) * | 2015-09-30 | 2017-03-30 | Linkedin Corporation | Organizational data enrichment |
JP2019185244A (ja) * | 2018-04-05 | 2019-10-24 | 富士通株式会社 | 学習プログラム及び学習方法 |
JP2021174300A (ja) * | 2020-04-27 | 2021-11-01 | アットホームラボ株式会社 | 情報処理装置、情報処理方法及び情報処理プログラム |
US20210374164A1 (en) * | 2020-06-02 | 2021-12-02 | Banque Nationale Du Canada | Automated and dynamic method and system for clustering data records |
US20210374186A1 (en) * | 2020-05-26 | 2021-12-02 | Rovi Guides, Inc. | Automated metadata asset creation using machine learning models |
JP2022510818A (ja) * | 2018-11-20 | 2022-01-28 | アマゾン テクノロジーズ インコーポレイテッド | 改良されたデータマッチングのためのデータレコードの字訳 |
-
2022
- 2022-02-28 WO PCT/JP2022/008227 patent/WO2023162206A1/fr unknown
- 2022-02-28 JP JP2024502726A patent/JPWO2023162206A1/ja active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170091274A1 (en) * | 2015-09-30 | 2017-03-30 | Linkedin Corporation | Organizational data enrichment |
JP2019185244A (ja) * | 2018-04-05 | 2019-10-24 | 富士通株式会社 | 学習プログラム及び学習方法 |
JP2022510818A (ja) * | 2018-11-20 | 2022-01-28 | アマゾン テクノロジーズ インコーポレイテッド | 改良されたデータマッチングのためのデータレコードの字訳 |
JP2021174300A (ja) * | 2020-04-27 | 2021-11-01 | アットホームラボ株式会社 | 情報処理装置、情報処理方法及び情報処理プログラム |
US20210374186A1 (en) * | 2020-05-26 | 2021-12-02 | Rovi Guides, Inc. | Automated metadata asset creation using machine learning models |
US20210374164A1 (en) * | 2020-06-02 | 2021-12-02 | Banque Nationale Du Canada | Automated and dynamic method and system for clustering data records |
Non-Patent Citations (1)
Title |
---|
YULIANG LI; JINFENG LI; YOSHIHIKO SUHARA; ANHAI DOAN; WANG-CHIEW TAN: "Deep Entity Matching with Pre-Trained Language Models", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 2 September 2020 (2020-09-02), 201 Olin Library Cornell University Ithaca, NY 14853 , XP081753835, DOI: 10.14778/3421424.3421431 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP7454156B1 (ja) | 2023-12-26 | 2024-03-22 | ファーストアカウンティング株式会社 | 情報処理装置、情報処理方法及びプログラム |
Also Published As
Publication number | Publication date |
---|---|
JPWO2023162206A1 (fr) | 2023-08-31 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Lin et al. | Traceability transformed: Generating more accurate links with pre-trained bert models | |
CN108959246B (zh) | 基于改进的注意力机制的答案选择方法、装置和电子设备 | |
Liu et al. | Transformer-based capsule network for stock movement prediction | |
CN112711953B (zh) | 一种基于注意力机制和gcn的文本多标签分类方法和系统 | |
Lauren et al. | Generating word embeddings from an extreme learning machine for sentiment analysis and sequence labeling tasks | |
CN109670029A (zh) | 用于确定问题答案的方法、装置、计算机设备及存储介质 | |
CN112036189A (zh) | 一种金文语义识别方法和系统 | |
WO2023162206A1 (fr) | Dispositif de traitement d'informations, procédé de traitement d'informations, et programme de traitement d'informations | |
Bondielli et al. | On the use of summarization and transformer architectures for profiling résumés | |
Diao et al. | Emotion cause detection with enhanced-representation attention convolutional-context network | |
Kondurkar et al. | Modern applications with a focus on training chatgpt and gpt models: Exploring generative ai and nlp | |
Tüselmann et al. | Recognition-free question answering on handwritten document collections | |
Skondras et al. | Efficient Resume Classification through Rapid Dataset Creation Using ChatGPT | |
Goossens et al. | Extracting decision dependencies and decision logic from text using deep learning techniques | |
JP2023071785A (ja) | 音響信号検索装置、音響信号検索方法、データ検索装置、データ検索方法、プログラム | |
WO2023132029A1 (fr) | Dispositif de traitement d'informations, procédé de traitement d'informations, et programme | |
Aksoy et al. | A comparative analysis of text representation, classification and clustering methods over real project proposals | |
Syaputra et al. | Improving mental health surveillance over Twitter text classification using word embedding techniques | |
Shahade et al. | Deep learning approach-based hybrid fine-tuned Smith algorithm with Adam optimiser for multilingual opinion mining | |
Lo et al. | From ELIZA to ChatGPT: The Evolution of NLP and Financial Applications | |
Agarwal et al. | Next Word Prediction Using Hindi Language | |
Kumar et al. | Emotion detection and sentiment analysis of text | |
Francis et al. | SmarTxT: A Natural Language Processing Approach for Efficient Vehicle Defect Investigation | |
Laskari et al. | A systematic study on suggestion mining from opinion reviews | |
Kusal et al. | Transfer learning for emotion detection in conversational text: a hybrid deep learning approach with pre-trained embeddings |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22928732 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2024502726 Country of ref document: JP Kind code of ref document: A |