WO2022130579A1 - 類似度判定プログラム、類似度判定装置、及び、類似度判定方法 - Google Patents
類似度判定プログラム、類似度判定装置、及び、類似度判定方法 Download PDFInfo
- Publication number
- WO2022130579A1 WO2022130579A1 PCT/JP2020/047219 JP2020047219W WO2022130579A1 WO 2022130579 A1 WO2022130579 A1 WO 2022130579A1 JP 2020047219 W JP2020047219 W JP 2020047219W WO 2022130579 A1 WO2022130579 A1 WO 2022130579A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- document
- similarity
- groups
- vectors
- named entity
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
Definitions
- the present invention relates to a similarity determination program, a similarity determination device, and a similarity determination method.
- one of the objects of the present invention is to improve the accuracy of determining the degree of similarity between partially similar documents.
- the similarity determination program may cause the computer to perform the following processing.
- the first processing is obtained by classifying the first plurality of partial documents obtained by dividing the first document into the first plurality of named entity included in the first plurality of partial documents.
- the first plurality of parts are based on the words contained in each of the first plurality of subdocument groups. It may include a process of calculating a first plurality of vectors corresponding to each of the document groups.
- the process corresponds to each of the second plurality of sub-document groups obtained by classifying the second plurality of sub-documents obtained by dividing the second document. It may include the process of acquiring the vector of.
- the process includes a process of determining the degree of similarity between the first document and the second document based on the comparison between the first plurality of vectors and the second plurality of vectors. good.
- the present invention can improve the accuracy of determining the degree of similarity between partially similar documents.
- HW hardware
- FIG. 1 is a diagram for explaining the similarity determination system 100 according to a comparative example.
- the similarity determination system 100 has a meaning of a word based on a query 101 requesting determination of similarity of a query document (input document) and a document set 102 including one or more comparison target documents. Calculate the similarity based on the vector.
- the similarity determination system 100 extracts words from each of a plurality of documents, that is, the query document included in the query 101 and the comparison target document included in the document set 102, for example, by morphological analysis (process P110).
- the similarity determination system 100 statistically calculates the word weights for each of the plurality of documents based on the words obtained in the process P110 (process P120). For example, the similarity determination system 100 may evaluate the importance of a word in a document as a weight by using an evaluation method such as tf-idf (Term Frequency-Inverse Document Frequency).
- the similarity determination system 100 executes the process P130 in parallel with or before and after the process P120 and at least a part of the processes. For example, the similarity determination system 100 calculates a word vector for each of a plurality of documents based on the words obtained in the process P110 (process P130).
- the word vector may be referred to as a word embedding vector or a meaning vector.
- the similarity determination system 100 may search a vector database in which a vector expressing the meaning of a word is stored and acquire a word vector.
- the similarity determination system 100 calculates a document vector by adding the result of multiplying the word vector acquired in the process P130 and the weight of the word acquired in the process P120 over all the words in the document for each document. do. Then, the similarity determination system 100 calculates the similarity between the document vector of the query document and each document vector of the comparison target document, thereby determining the text similarity between the query document and the comparison target document. Calculate (process P140).
- the similarity determination system 100 performs ranking processing based on the calculated text similarity (processing P150), and stores the comparison target document having a high similarity with the query document as the ranking result 103 together with the similarity.
- FIG. 2 is a diagram illustrating an example of determination of similarity by the similarity determination system 100 shown in FIG.
- the example of FIG. 2 shows a case where the similarity is determined for the query document 101a and the comparison target document 102a relating to the lithium ion battery.
- the "document” includes a document including a description of a plurality of elements, and, for example, a document such as a patent document or a paper describing a device, a system, a manufacturing method, etc. having a plurality of components.
- a document such as a patent document or a paper describing a device, a system, a manufacturing method, etc. having a plurality of components.
- each of the components of the lithium ion battery such as "positive electrode active material”, “negative electrode active material”, “binder”, “electrolyte”, and “electrolyte solution solvent” is provided.
- Compound names related to the classification (group) of may be mixed and described.
- the comparison target is for other elements, in other words, the elements that are not the investigation target. Differences from documents may affect the judgment result of similarity between documents.
- the document as a whole may be calculated as a value having a low degree of similarity.
- the accuracy of determining the degree of similarity between partially similar documents may decrease.
- semantic vector space is shown in two dimensions in FIG. 2 for convenience, it can actually be a vector of several hundred dimensions. As the number of dimensions of the semantic vector space increases, even if the elements to be investigated are similar between documents, it is possible that the similarity is judged to be low when the descriptions about other elements are different between the documents. Will be higher.
- the similarity determination system 1 acquires a plurality of vectors corresponding to each of the partial documents obtained by dividing the document, and obtains the document. Document similarity is determined based on the comparison of multiple vectors between them.
- the similarity determination system 1 may acquire a plurality of sub-document groups by classifying a plurality of sub-documents based on a plurality of groups. Further, the similarity determination system 1 may use the similarity of the subdocument clusters having the highest similarity as the document similarity by comparing the similarity between the subdocument clusters of both documents to be determined.
- FIG. 3 is a diagram for explaining the similarity determination system 1 according to the first embodiment
- FIGS. 4 and 5 are diagrams for explaining an example of processing of the similarity determination system 1.
- the similarity determination system 1 includes a query 11 requesting determination of the similarity of a query document (input document) and one or more comparison target documents to be determined. Based on the document set (document group) 12, the similarity based on the meaning vector of the word is calculated.
- the similarity determination system 1 determines the similarity between the query document 11a specified by the query 11 and the comparison target document 12a in the document set 12.
- the query document 11a is an example of the first document
- the comparison target document 12a is an example of the second document.
- the similarity determination system 1 extracts words from each of a plurality of documents by, for example, morphological analysis (process P1), as in the comparative example.
- the similarity determination system 1 statistically calculates the word weights for each of the plurality of documents based on the words obtained in the process P1 (process P2). For example, the similarity determination system 1 may evaluate the importance of a word in a document as a weight by using an evaluation method such as tf-idf.
- the similarity determination system 1 executes the process P3 in parallel with or before and after the process P2 and at least a part of the processes. For example, the similarity determination system 1 calculates a word vector for each of a plurality of documents based on the words obtained in the process P1 (process P3).
- the word vector may be referred to as a word embedding vector or a meaning vector.
- the similarity determination system 1 may search a vector database in which a vector expressing the meaning of a word is stored and acquire a word vector.
- the similarity determination system 1 may acquire a word vector corresponding to each of the words obtained in the process P1 based on the trained model.
- the similarity determination system 1 divides each of a plurality of documents into a plurality of sub-documents (for example, paragraphs), clusters the plurality of sub-documents based on the named entity included in each sub-document (process P4), and sub-document cluster. To generate. Further, the similarity determination system 1 calculates the partial document vector of each partial document cluster.
- the similarity determination system 1 calculates the text similarity between the partial document clusters based on the plurality of partial document vectors of the query document 11a and each of the plurality of partial document vectors of the comparison target document 12a (process P5).
- the similarity determination system 1 performs a ranking process of ranking each of the plurality of comparison target documents 12a according to the similarity with the query document 11a based on the text similarity (process P6), and outputs the result 13. do.
- the result 13 may include a ranking result.
- the similarity determination system 1 acquires a plurality of partial documents (partial texts) by dividing the document for each document in the process P4.
- Sub-documents in other words, document division units include, for example, sentences, paragraphs, chapters, sections, and the like.
- the partial document is a paragraph.
- the similarity determination system 1 divides the query document 11a (document X), acquires a plurality of paragraphs PX which are examples of the first plurality of partial documents, and compares them.
- the document 12a (document Y) is divided to obtain a plurality of paragraphs P Y which are examples of the second plurality of partial documents.
- paragraph P when paragraphs PX and P Y are not distinguished from each other, they are simply referred to as "paragraph P".
- the similarity determination system 1 has a plurality of parts by clustering a plurality of paragraphs P based on a named entity list indicating the named entity cluster, for example, compound lists C X1 to C XN and CY1 to CYM shown in FIG. Acquire the document clusters PX1 to PXN and PY1 to PYM .
- the named entity is the compound name and the document is a document in the field of chemistry that includes the compound name.
- various existing methods such as the shortest distance method may be used.
- the compound list C X1 to C XN is a list of first multiple named entity contained in the document X, for example, a list of named entity clusters obtained by classifying a plurality of compound names, and is an example of the first plurality of groups.
- the compound lists C X1 to C XN classify, in other words, the names of the first plurality of compounds based on the respective positions of the first plurality of compounds and the respective similarity of the first plurality of compounds. It may be acquired by clustering and may be referred to as a first cluster group.
- N is an integer of 1 or more, and indicates the number of groups included in the document X, in other words, the number of clusters.
- the compound lists CY1 to CYM are a list of second named entities contained in the document Y, for example, a list of named named entity clusters obtained by classifying a plurality of compound names, and are examples of the second plurality of groups. be.
- the compound lists CY1 to CYM classify, in other words, the names of the second plurality of compounds based on the respective positions of the second plurality of compounds and the respective similarity of the second plurality of compounds. It may be acquired by clustering and may be referred to as a second cluster group.
- M is an integer of 1 or more, and indicates the number of groups included in the document Y, in other words, the number of clusters.
- the similarity determination system 1 may cluster the paragraph P by a clustering process using the degree of agreement between the named entity included in the named entity cluster and the named entity included in the plurality of paragraphs P.
- the similarity determination system 1 determines the degree of coincidence between each of the compound lists C X1 to C XN for each cluster and each of the plurality of paragraphs PX for the document X according to the following formula (1). Based on this, partial document clusters PX1 to PXN are generated. Further, the similarity determination system 1 has a degree of coincidence cos (CPX,) between each of the compound lists CY1 to CYN for each cluster and each of the plurality of paragraphs PY for the document Y according to the following formula (2) . Sub-document clusters P Y1 to P YN are generated based on C Xa ).
- C PX is a compound list included in paragraph PX
- a is an integer of 1 to N
- C X a is a compound list C X 1 to C for each cluster.
- XN is a compound list included in paragraph P Y
- b is an integer of 1 to M
- CY b is a compound list CY1 to CYM for each cluster.
- cos is a function that calculates the cosine similarity between two elements in parentheses.
- argmax is a function that extracts the condition (here, cluster) when the element in parentheses is the maximum.
- the cosine similarity between each of the compound names included in paragraph P and each of the compound names in the compound list is maximum, for example, the number of occurrences is the largest.
- Paragraph P can be assigned to the element (cluster of compounds).
- -Partial document clusters PX1 and PY1 A paragraph describing "negative electrode active material”.
- -Partial document clusters PX2 and PY2 A paragraph describing "positive electrode active material”.
- -Partial document clusters PX3 and PY3 A paragraph describing "binder”.
- -Partial document clusters PX4 and PY4 A paragraph describing "electrolyte solvent”.
- the numbers N and M of the partial document clusters of the documents X and Y are assumed to match the numbers N and M of the compound list, but the number N and M are not limited to this. It may be acceptable if they do not match and do not match. For example, the number of partial document clusters may be smaller than N and M.
- the similarity determination system 1 calculates a plurality of subdocument vectors corresponding to each of the plurality of subdocument clusters based on the words included in each of the subdocument clusters. For example, the similarity determination system 1 adds the result of multiplying the word vector acquired in the process P3 and the weight of the word acquired in the process P2 over all the words in the subdocument cluster for each subdocument cluster. By doing so, the partial document vector may be calculated.
- the similarity determination system 1 is based on the similarity between the partial document vector of the query document 11a and each partial document vector of the comparison target document 12a, in other words, the partial document based on the meaning vector of the word. Calculate the text similarity between clusters.
- the partial document vector of the query document 11a is an example of the first plurality of vectors
- the partial document vector of the comparison target document 12a is an example of the second plurality of vectors.
- the similarity determination system 1 calculates the text similarity, for example, the cosine similarity between the partial document cluster of the query document 11a and the partial document cluster of the comparison target document 12a by the calculation of the following equation (3). good.
- WP Xa is a dispersion vector of words included in paragraph PXa
- WP Yb is a dispersion vector of words included in paragraph P Yb .
- the similarity determination system 1 has partial document clusters PX1 , PX2 , PX3 , ... PXN , and partial document clusters XY1 , PHY2 , PHY3 , ... PHYM .
- the text similarity may be calculated according to the above equation (3) for all pairs of and.
- the similarity determination system 1 performs a ranking process in the process P6 to rank each of the plurality of comparison target documents 12a according to the similarity with the query document 11a based on the text similarity, and outputs the result 13.
- the similarity determination system 1 outputs rankings of a plurality of comparison target documents 12a according to the similarity with the query document 11a based on the text similarity in the ranking process.
- the similarity determination system 1 may calculate the document similarity Sim (X, Y) between the document X and one comparison target document Y, for example, according to the following equation (4).
- ft is the text similarity according to the above equation (3)
- max is a function that adopts the maximum value among all the combinations in parentheses.
- the similarity determination system 1 determines that the pair of compound lists C X2 and CY2 , in other words, the text similarity between the partial document clusters of the “positive electrode active material” is the maximum, and the relevant The text similarity is determined as the document similarity Sim (X, Y) between the documents X and Y.
- the above equation (4) shows an example of calculating the document similarity between the document X (query document 11a) and one document Y (comparison target document 12a).
- the similarity determination system 1 performs the above processing for each of a plurality of comparison target documents 12a, for example, documents Y 1 to Y L (L is an integer of 2 or more and the number of documents of comparison target document 12a), and the number of documents Y.
- Document similarity Sim (X, Y 1 ) to Sim (X, Y L ) according to the above may be acquired.
- the similarity determination system 1 sorts all the documents Y 1 to Y L to be searched in descending order from the documents Y having the highest document similarity Sim (X, Y 1 ) to Sim (X, Y L ), for example.
- the sort result may be output as the result 13.
- the result 13 may include the identification information of the document Y together with the rank (rank), and may include the document similarity Sim (X, Y) of each document Y.
- the identification information of the document Y includes at least one of an identifier such as a document number or a document code, bibliographic information such as a document name, and at least a part of the contents of the document Y such as a summary and a predetermined part. But it may be.
- the similarity determination system 1 identifies information of the document Y having the highest document similarity Sim (X, Y) with the document Y determined to have a specific order, for example, the query document 11a. May be output.
- the similarity between documents is partially determined based on the text similarity for each partial document cluster classified by the clustering process. It is possible to improve the accuracy of determining the degree of similarity between similar documents.
- the similarity determination system 1 has a high degree of similarity between the documents X and Y because the semantic vectors for the "positive electrode active material" are similar by comparing the partial document vectors between the documents X and Y. , Can be judged.
- the semantic vector space is shown in two dimensions, but it can actually be a vector of several hundred dimensions.
- the accuracy of determining the degree of similarity between partially similar documents can be improved by comparing the partial document clusters.
- the similarity determination system 1 has been described as calculating a plurality of partial document vectors for both the query document 11a and the comparison target document 12a, but the present invention is limited to this. It's not something.
- any one of the documents for example, a plurality of comparison target documents 12a
- the similarity determination system 1 when the similarity determination system 1 stores the document set 12 in advance, a plurality of portions of each of the plurality of comparison target documents 12a.
- the document vector may be calculated in advance and accumulated.
- the similarity determination system 1 calculates a plurality of partial document vectors for the other document, for example, the query document 11a, and acquires a plurality of partial document vectors to be accumulated for the comparison target document 12a. You can do it. Then, the similarity determination system 1 performs the above-mentioned text similarity calculation process and ranking process based on the calculated plurality of partial document vectors of the query document 11a and the plurality of partial document vectors of the acquired comparison target document 12a. You can do it.
- the document in which a plurality of partial document vectors are calculated and accumulated in advance is not limited to the comparison target document 12a, and may be a query document 11a in place of or in addition to the comparison target document 12a.
- FIG. 6 is a block diagram showing a functional configuration example of the server 2 in the similarity determination system 1 according to the first embodiment
- FIG. 7 is a diagram showing a screen output example by the server 2. Is.
- the server 2 is an example of a similarity determination device, an information processing device, or a computer.
- the server 2 performs various communications such as reception of the query document 11a and the comparison target document 12a and transmission of the result 13 with a terminal device (not shown), another server, or the like. good.
- the server 2 may provide, for example, a function for enabling access to the terminal device.
- Examples of the function include generation and display control of a screen such as a web page used for access by a terminal device.
- the terminal device sends an access request to the server 2 using an application such as a browser, and accesses the server 2 via a web page displayed on the application based on the screen information received from the server 2. good.
- the server 2 may output the screen information of the query specification screen 210 for designating the query and the determination result output screen 240 for outputting the determination result.
- the server 2 may optionally include a memory unit 21, a document input unit 22, a similarity calculation unit 23, and a similarity output unit 24.
- the memory unit 21, the document input unit 22, the similarity calculation unit 23, and the similarity output unit 24 are examples of control units.
- the memory unit 21 has a storage area for storing various data related to the similarity determination process.
- the memory unit 21 may store information such as the query document 11a shown in FIG. 3, a plurality of comparison target documents 12a, the result 13, and a compound list for each cluster preclassified for each document. Further, the memory unit 21 may store information such as paragraph P, partial document cluster, text similarity, document similarity Sim, etc. for each document shown in FIGS. 4 and 5 as intermediate data in the similarity determination process. ..
- the document input unit 22 may receive input of the query document 11a and the comparison target document 12a from a computer such as a terminal device (not shown) or another server, and store the query document 11a and the comparison target document 12a in the memory unit 21, for example, as a DB (Database). In this way, the document input unit 22 may be able to construct and refer to the DB of the document.
- a computer such as a terminal device (not shown) or another server
- DB Database
- the document input unit 22 may receive the input of the query document 11a related to the similarity determination request from a computer such as a terminal device (not shown) or another server and store it in the memory unit 21.
- the query document 11a may be included in the query 11, for example.
- the document input unit 22 may accept, for example, as the query 11, not the query document 11a itself, but the identification information of the query document 11a, for example, information such as a document number and a document code.
- the document input unit 22 may specify the query document 11a related to the similarity determination request from, for example, the DB of the memory unit 21 based on the identification information.
- the document input unit 22 may accept the document number set in the input field 211 when the determination button 212 of the query specification screen 210 is pressed.
- the similarity calculation unit 23 calculates the similarity between the query document 11a and the comparison target document 12a. As illustrated in FIG. 6, the similarity calculation unit 23 may include a document division unit 231, a partial document clustering unit 232, and a document similarity calculation unit 233.
- the document division unit 231 divides each of the query document 11a and the comparison target document 12a stored in the memory unit 21 to generate partial documents, for example, paragraphs PX and PY .
- the partial document clustering unit 232 clusters each of the plurality of paragraphs PX and the plurality of paragraphs P Y based on the compound lists C X1 to C XN and CY1 to CYM stored in the memory unit 21, and the partial document cluster P. Acquire X1 to P XN and P Y1 to P YM . Further, the partial document clustering unit 232 is based on the results of morphological analysis, word weight calculation, and word vector calculation for each of the documents X and Y for each of the partial document clusters PX1 to PXN and PY1 to PYM . , Calculate the partial document vector.
- the processing of the document division unit 231 and the partial document clustering unit 232 is an example of the processes P1 to P4 in FIG.
- the document similarity calculation unit 233 calculates the text similarity for each partial document based on the partial document vector for each partial document cluster, and determines the text similarity of the cluster having the highest similarity in the document as the similarity of the document. Calculated as Sim (X, Y).
- Sim Sim (X, Y)
- the document similarity calculation unit 233 calculates the similarity Sim (X, Y 1 ) to Sim (X, Y L ) for each comparison target document 12a. You can do it.
- the document similarity calculation unit 233 may store the calculated similarity Sim (X, Y) in the memory unit 21.
- the similarity output unit 24 outputs the similarity Sim (X, Y) calculated by the similarity calculation unit 23.
- the documents to be compared are compared in descending order of the calculated similarity Sim (X, Y 1 ) to Sim (X, Y L ).
- Information on 12a and the similarity Sim (X, Y) may be output.
- the processing of the document similarity calculation unit 233 and the similarity output unit 24 is an example of the processes P5 and P6 of FIG.
- the output by the similarity output unit 24 may include, for example, transmission to a computer such as a terminal device (not shown), storage in a storage area of a server 2 such as a memory unit 21, and the like.
- the similarity output unit 24 may output the determination result output screen 240.
- the determination result output screen 240 may include a display area 241 of the query document 11a and display areas 245a to 245c of at least one (three in FIG. 7) of the comparison target document 12a.
- the display area 241 may include a display area 242 such as bibliographic information and a summary, and a full-text reference button 243 for transitioning to a screen for displaying the full text of the query document 11a.
- the display areas 245a to 245c may include display areas 246a to 246c for bibliographic information and summaries, and full text reference buttons 247a to 247c.
- display areas 245a to 245c one or more paragraphs PY or compound list corresponding to the partial document cluster determined to be similar, or / and the similarity Sim (X, Y ) may be displayed. ..
- the similarity output unit 24 can present to the user information about the document determined to have the highest similarity as a result of the similarity calculation between the query document 11a and the comparison target document 12a.
- FIG. 8 is a flowchart illustrating an operation example of the server 2. As shown in FIG. 8, the server 2 may execute the processing for the query document 11a and the processing for the comparison target document 12a at different timings.
- the document input unit 22 accepts the input of the query document 11a (step S1).
- the document division unit 231 divides the query document 11a into a plurality of subdocuments, for example, a plurality of paragraphs PX (step S2).
- the partial document clustering unit 232 clusters a plurality of paragraphs PX based on the compound lists C X1 to C XN , and acquires the partial document clusters PX1 to PXN (step S3). Further, the partial document clustering unit 232 calculates each partial document vector of the partial document clusters PX1 to PXN based on the weight of each word included in the document X and the meaning vector of each word (step S4).
- the document input unit 22 accepts the input of the comparison target document 12a (step S5).
- the document division unit 231 selects an unselected comparison target document 12a (step S6), and divides the selected comparison target document 12a into a plurality of partial documents, for example, a plurality of paragraphs PY (step S7).
- the partial document clustering unit 232 clusters a plurality of paragraphs P Y based on the compound lists CY1 to CYN, and acquires the partial document clusters P Y1 to P YM (step S8). Further, the sub-document clustering unit 232 calculates each sub-document vector of the sub-document clusters P Y1 to P YM based on the weight of each word included in the document Y and the meaning vector of each word (step S9).
- the document similarity calculation unit 233 compares the partial document vectors of the query document 11a and the comparison target document 12a, calculates the similarity Sim between the documents (step S10), and stores it in the memory unit 21 (step S11).
- the document similarity calculation unit 233 determines whether or not there is an unselected comparison target document 12a (step S12), and if it determines that there is (YES in step S12), the process proceeds to step S6.
- the similarity output unit 24 determines the comparison target document 12a in descending order of similarity Sim (X, Y). And its similarity Sim (X, Y) are output (step S13). Then, the process ends.
- the server 2 may be a virtual server (VM; Virtual Machine) or a physical server. Further, the function of the server 2 may be realized by one computer or may be realized by two or more computers. Further, at least a part of the functions of the server 2 may be realized by using the HW (Hardware) resource and the NW (Network) resource provided by the cloud environment.
- VM Virtual Machine
- HW Hardware
- NW Network
- FIG. 9 is a block diagram showing a hardware (HW) configuration example of the computer 10 that realizes the function of the server 2.
- HW hardware
- the computer 10 has an HW configuration, for example, a processor 10a, a memory 10b, a storage unit 10c, an IF (Interface) unit 10d, an I / O (Input / Output) unit 10e, and a reading unit. It may be provided with 10f.
- a processor 10a for example, a processor 10a, a memory 10b, a storage unit 10c, an IF (Interface) unit 10d, an I / O (Input / Output) unit 10e, and a reading unit. It may be provided with 10f.
- the processor 10a is an example of an arithmetic processing unit that performs various controls and operations.
- the processor 10a may be connected to each block in the computer 10 so as to be communicable with each other by the bus 10i.
- the processor 10a may be a multi-processor including a plurality of processors, a multi-core processor having a plurality of processor cores, or a configuration having a plurality of multi-core processors.
- Examples of the processor 10a include integrated circuits (ICs) such as CPUs, MPUs, GPUs, APUs, DSPs, ASICs, and FPGAs. As the processor 10a, two or more combinations of these integrated circuits may be used.
- ICs integrated circuits
- MPU is an abbreviation for Micro Processing Unit
- GPU is an abbreviation for Graphics Processing Unit
- APU is an abbreviation for Accelerated Processing Unit.
- DSP is an abbreviation for Digital Signal Processor
- ASIC is an abbreviation for Application Specific IC
- FPGA is an abbreviation for Field-Programmable Gate Array.
- the memory 10b is an example of HW that stores information such as various data and programs.
- Examples of the memory 10b include one or both of a volatile memory such as DRAM (Dynamic Random Access Memory) and a non-volatile memory such as PM (Persistent Memory).
- the storage unit 10c is an example of HW that stores information such as various data and programs.
- Examples of the storage unit 10c include a magnetic disk device such as an HDD (Hard Disk Drive), a semiconductor drive device such as an SSD (Solid State Drive), and various storage devices such as a non-volatile memory.
- Examples of the non-volatile memory include flash memory, SCM (Storage Class Memory), ROM (Read Only Memory) and the like.
- the storage unit 10c may store a program 10g (similarity determination program) that realizes all or a part of various functions of the computer 10.
- the processor 10a of the server 2 can realize the function as the server 2 illustrated in FIG. 6 by expanding and executing the program 10g stored in the storage unit 10c in the memory 10b.
- the memory unit 21 shown in FIG. 6 may be realized by a storage area of one or both of the memory unit 10b and the storage unit 10c.
- the IF unit 10d is an example of a communication IF that controls connection and communication with a network.
- the IF unit 10d may include an adapter compliant with LAN (Local Area Network) such as Ethernet (registered trademark) or optical communication such as FC (Fibre Channel).
- the adapter may support one or both wireless and wired communication methods.
- the server 2 may be connected to the terminal device and each of the other servers so as to be able to communicate with each other via the IF unit 10d.
- the program 10g may be downloaded from the network to the computer 10 via the communication IF and stored in the storage unit 10c.
- the I / O unit 10e may include one or both of an input device and an output device.
- Examples of the input device include a keyboard, a mouse, a touch panel, and the like.
- Examples of the output device include a monitor, a projector, a printer and the like.
- the reading unit 10f is an example of a reader that reads data and program information recorded on the recording medium 10h.
- the reading unit 10f may include a connection terminal or device to which the recording medium 10h can be connected or inserted.
- Examples of the reading unit 10f include an adapter compliant with USB (Universal Serial Bus), a drive device for accessing a recording disk, a card reader for accessing a flash memory such as an SD card, and the like.
- the program 10g may be stored in the recording medium 10h, or the reading unit 10f may read the program 10g from the recording medium 10h and store it in the storage unit 10c.
- Examples of the recording medium 10h include non-temporary computer-readable recording media such as magnetic / optical disks and flash memories.
- Examples of the magnetic / optical disk include flexible discs, CDs (Compact Discs), DVDs (Digital Versatile Discs), Blu-ray discs, HVDs (Holographic Versatile Discs), and the like.
- Examples of the flash memory include semiconductor memories such as USB memory and SD card.
- the above-mentioned HW configuration of the computer 10 is an example. Therefore, the increase / decrease of HW (for example, addition or deletion of arbitrary blocks), division, integration in any combination, addition or deletion of buses, etc. may be appropriately performed in the computer 10.
- the server 2 at least one of the I / O unit 10e and the reading unit 10f may be omitted.
- Second Embodiment [2-1] Description of the Second Embodiment Next, the second embodiment will be described.
- the similarity determination system 1 stores the compound lists C X1 to C XN and CY1 to CYM for each cluster in advance.
- the method in which the similarity determination system 1A calculates the compound lists C X1 to C XN and CY1 to CYM will be described.
- FIG. 10 is a diagram for explaining the similarity determination system 1A according to the second embodiment
- FIG. 11 is a diagram for explaining an example of processing of the similarity determination system 1A.
- the processes P1 to P6 based on the query 11 and the document set 12 are the same as those in the first embodiment.
- the processes P7 and P8 may be executed in parallel with or before and after the processes P1 to P3 and at least a part of the processes.
- the processes P7 and P8 will be described.
- the similarity determination system 1A extracts a compound name as an example of a unique expression from each of a plurality of documents, for example, a query document 11a and a plurality of comparison target documents 12a (process P7), and the unique expression is used for each document. Generate a list, eg a compound list.
- the similarity determination system 1A extracts the compound name from the query document 11a (denoted as “document X”) included in the query 11 and generates the compound list CX . Further, the similarity determination system 1A extracts a compound name from the comparison target document 12a (denoted as “document Y”) included in the document set 12 to generate a compound list CY .
- the query document 11a and the comparison target document 12a are documents relating to the lithium ion battery.
- compound list C when the compound lists C X and CY generated for the set of documents to be determined are not distinguished from each other, they are simply referred to as “compound list C”.
- the similarity determination system 1A executes clustering for classifying and grouping named entity based on the named entity list (process P8).
- the clustering method various existing methods such as the shortest distance method may be used.
- the similarity determination system 1A may calculate the similarity score S between named entities included in the named entity list for each pair (set) of named entity based on the named entity list. For example, the similarity determination system 1A calculates a similarity score S for a pair of named entity based on each position of the named entity and the similarity between the named entity.
- the similarity determination system 1A may calculate the similarity score S (x 1 , x 2 ) using the following formula (5). ..
- TC (x 1 , x 2 ) is the Tanimoto coefficient of MACCS Key.
- MACCS Key is one of the expression methods (compound descriptors) of the characteristics of compounds.
- the Tanimoto coefficient is one of the indexes showing the structural similarity between compounds using MACCS Key, and is an example of the similarity between named entities when the named entity is a compound name.
- Distance (x 1 , x 2 ) is, for example, a numerical value obtained by quantifying the proximity of each appearance position of the named entity in a document, and is, for example, a value corresponding to the following conditions.
- the similarity determination system 1A applies the above formula (5) to each combination of compound name pairs (x 1 , x 2 ) for a plurality of compound names included in the compound list C, and applies each pair (x 1 , x 2).
- the similarity score S (x 1 , x 2 ) of x 2 ) may be calculated.
- the similarity determination system 1A classifies a plurality of compound names included in the compound list C by applying a method such as the shortest distance method to a plurality of calculated similarity scores S (x 1 , x 2 ). By grouping them together, the compound names may be clustered.
- the similarity determination system 1A divides the compound names in the compound list C X into N clusters (groups) by clustering to the compound list C X , and the compound lists C X 1 to C for each cluster. Generate XN . Further, the similarity determination system 1A divides the compound names in the compound list CY into M clusters (groups) by clustering to the compound list CY , and generates the compound lists CY1 to CYM for each cluster. ..
- the compound lists C X and CY can be classified into clusters of the following four elements (characteristics) by such clustering.
- -Compound list C X1 and CY1 A cluster having elements (characteristics) of "negative electrode active material”.
- -Compound list C X2 and CY2 A cluster having an element (characteristic) of "positive electrode active material”.
- -Compound list C X3 and CY3 A cluster with a "binder" element (characteristic).
- -Compound list C X4 and CY4 A cluster having an element (characteristic) of "electrolyte solvent”.
- the similarity determination system 1A generates compound lists C X1 to C XN and CY1 to CYM for each cluster used in partial document clustering (processing P4). Can be done.
- the Tanimoto coefficient of MACCS Key is used as the structural similarity, but the description is not limited to this.
- the method for expressing the characteristics of a compound is not limited to MACCS Key, in other words, MACCS fingerprint, and various compound descriptors such as Morgan fingerprint may be adopted.
- the index indicating the structural similarity between the compounds is not limited to the Tanimoto coefficient, and various coefficients such as the Dice coefficient may be used.
- the similarity determination system 1A uses the similarity score S (x 1 , x 2 ) as a numerical value of the proximity of the appearance position in the document of the named entity and the named entity.
- the product with the similarity is calculated, but the product is not limited to this.
- the similarity determination system 1A may calculate the similarity score S (x 1 , x 2 ) using the following equation (6).
- W is a weight.
- W for example, a value such as "0.5" may be appropriately defined and set by the user or the like so that each position of the named entity and the similarity between the named entity are considered evenly.
- W may be set based on a model trained so that the correct answer example is searched higher by machine learning based on the search query and the training data including the correct answer example (correct answer data). ..
- the similarity determination system 1A is based on the respective positions of the first plurality of compound names included in the query document 11a and the respective similarity of the first plurality of compound names. By classifying the compound names, the first cluster group is generated. Further, the similarity determination system 1A is based on the position of each of the second plurality of compound names included in the comparison target document 12a and the similarity of each of the second plurality of compound names. By classifying the names, a second cluster group is generated. The first cluster group is an example of the first plurality of groups, and the second cluster group is an example of the second plurality of groups.
- the similarity determination system 1A As described above, according to the similarity determination system 1A according to the second embodiment, the same effect as that of the first embodiment can be obtained. Further, since the compound list for each cluster can be generated for each document, it is possible to avoid the user from manually generating the compound list, which is convenient. Further, even when the similarity determination system 1A does not store one or both documents of the query document 11a and the comparison target document 12a, the similarity of the documents can be determined. Such cases include, for example, the case where the document is included in the query 11, or the case where the location of the document (a storage location other than the similarity determination system 1A) is specified by the query 11.
- FIG. 12 is a block diagram showing a functional configuration example of the server 3 in the similarity determination system 1A according to the second embodiment
- FIG. 13 is a diagram showing a screen output example by the server 3. Is.
- the server 3 is an example of a similarity determination device, an information processing device, or a computer.
- the server 3 performs various communications such as reception of the query document 11a and the comparison target document 12a and transmission of the result 14 with a terminal device (not shown), another server, or the like. good.
- the server 3 may provide, for example, a function for enabling access to the terminal device. For example, as shown in FIG. 13, the server 3 may output screen information of a search query specification screen 330 for designating a search query and a search result output screen 340 for outputting search results.
- the above-mentioned similarity determination process by the similarity determination system 1A may be realized by the server 3.
- the server 3 may optionally include a document DB unit 31 and a document retrieval unit 32.
- the document DB unit 31 and the document search unit 32 are examples of control units.
- the server 3 may include the document input unit 22 shown in FIG.
- the document DB unit 31 stores the query document 11a and the comparison target document 12a, and performs a document DB construction process for constructing the document DB.
- the document search unit 32 performs a document search process for searching a comparison target document 12a similar to the query document 11a specified in the query 11 based on the information stored in the document DB unit 31 in response to the acceptance of the query 11.
- the document search process is a process including a similarity determination process, and is an example of use (application example) of the similarity determination process.
- the document DB unit 31 includes, for example, a document storage unit 311, a compound name extraction unit 312, a clustering unit 313, a document cluster vector calculation unit 314, and a document cluster vector storage unit 315. good.
- the document storage unit 311 is an example of the memory unit 21 (see FIG. 6) according to the first embodiment, and stores a plurality of documents.
- the document is a document that can be used as either the query document 11a or the comparison target document 12a. Therefore, it can be said that the document storage unit 311 stores the query document 11a and the document set (document group) 12 including the plurality of comparison target documents 12a that are the targets of the query 11.
- the document storage unit 311 may store a plurality of documents in advance before receiving the query 11.
- the document storage unit 311 may store a plurality of documents received by the document input unit 22 according to the first embodiment.
- the compound name extraction unit 312 extracts a compound name as an example of a named entity from each of a plurality of documents accumulated by the document storage unit 311 and generates compound lists C X and CY for each document.
- the treatment of the compound name extraction unit 312 is an example of the treatment P7 in FIG.
- the clustering unit 313 calculates the similarity score S for each of the compound names included in the compound lists C X and CY . Further, the clustering unit 313 classifies the compound names into a plurality of clusters based on the similarity score S, and the compound lists C X1 , C X2 , C X3 , ... C XN , and the compound lists CY1 and CY2 . Generate CY3 , ... CYM .
- the process of the clustering unit 313 is an example of the process P8 of FIG.
- the document cluster vector calculation unit 314 is based on the information of the compound cluster from the clustering unit 313 and the weights and word vectors calculated based on the words extracted from each of the plurality of documents accumulated by the document storage unit 311.
- the document vector for each may be calculated.
- the process of the document cluster vector calculation unit 314 is an example of at least a part of the processes P1 to P4 and the process P5 in FIG.
- the document cluster vector storage unit 315 is an example of the memory unit 21 shown in FIG. 6, and stores the document vector for each partial document cluster calculated by the document cluster vector calculation unit 314.
- the document search unit 32 may optionally include a search query designation unit 321, a document similarity calculation unit 322, a search result generation unit 323, and a search result output unit 324.
- the search query designation unit 321 is an example of the document input unit 22 shown in FIG. 6, and is a query 11 requesting a document search from a computer such as a terminal device (not shown) or another server (hereinafter referred to as “search query 11”). (May be) Accept the input.
- the search query specification unit 321 may accept the document number of the query document 11a set in the input field 331 when the search button 332 of the search query specification screen 330 is pressed. ..
- the document similarity calculation unit 322 is an example of the document similarity calculation unit 233 shown in FIG.
- the document similarity calculation unit 322 uses the document similarity Sim (X, Y) between the query document 11a specified by the search query 11 and the comparison target document 12a based on the document vector stored in the document cluster vector storage unit 315. ) Is calculated.
- the document similarity calculation unit 322 compares a plurality of partial document vectors corresponding to the query document 11a and the comparison target document 12a among the partial document vectors stored in the document cluster vector storage unit 315, and texts. The similarity may be calculated.
- the document similarity calculation unit 322 calculates the document similarity Sim (X, Y) based on the text similarity, and sorts the comparison target documents 12a in descending order of the document similarity Sim (X, Y). ,
- the ranking result 14 may be generated.
- the content and output method of the result 14 are the same as those of the result 13 according to the first embodiment.
- the process of the document similarity calculation unit 322 is an example of at least a part of the process P5 in FIG. 10 and the process P6.
- the search result generation unit 323 generates a search result for output based on the result 14.
- the search result generation unit 323 may generate the search result output screen 340 shown in FIG.
- the search result output screen 340 may replace the determination result 244 in the determination result output screen 240 shown in FIG. 7 with the search result 344.
- the search result output screen 340 includes a display area 341 of the query document 11a and display areas 345a to 345c of at least one (three in FIG. 13) of the comparison target document 12a. good.
- the display area 341 may include a display area 342 such as bibliographic information and a summary, and a full-text reference button 343 of the query document 11a.
- the display areas 345a to 345c may include display areas 346a to 346c for bibliographic information and summaries, and full text reference buttons 347a to 347c.
- display areas 346a to 346c one or more paragraphs PY or compound list corresponding to the partial document cluster determined to be similar, and / or the similarity Sim (X, Y) may be displayed. ..
- the search result output unit 324 outputs the search result output screen 340 to a computer such as a terminal device or another server (not shown).
- FIG. 14 is a flowchart illustrating an operation example of the document DB construction process of the server 3
- FIG. 15 is a flowchart illustrating an operation example of the document retrieval process of the server 3.
- the document storage unit 311 selects an unselected document (step S21) and registers the document in the document DB (step S22).
- the compound name extraction unit 312 extracts the compound name from the text of the document (step S23).
- the clustering unit 313 clusters the extracted compound names (step S24).
- the document cluster vector calculation unit 314 divides the document into a plurality of sub-documents (step S25), and clusters a plurality of sub-documents based on the compound cluster generated by the clustering unit 313 (step S26).
- the document cluster vector calculation unit 314 calculates the document vector of each partial document cluster (step S27).
- the document cluster vector storage unit 315 associates the calculated document vector with the document and registers (stores) it in, for example, a document DB or a document cluster vector DB (step S28).
- the document storage unit 311 determines whether or not there is an unselected document (step S29), and if it determines that there is an unselected document (YES in step S29), the process proceeds to step S21. When the document storage unit 311 determines that there is no unselected document (NO in step S29), the process ends.
- the search query designation unit 321 accepts the designation of the query document 11a from the search query designation screen 330 (step S31).
- the document similarity calculation unit 322 acquires the document vector of the query document 11a from the document cluster vector storage unit 315 (step S32).
- the document similarity calculation unit 322 selects an unselected document (step S33), and acquires the document vector of the partial document cluster of the selected document from the document cluster vector storage unit 315 (step S34).
- the document similarity calculation unit 322 compares the document vectors of a plurality of partial document clusters between the query document 11a and the selected document, and calculates the document similarity Sim (X, Y) (step S35).
- the document similarity calculation unit 322 determines whether or not there is an unselected document (step S36), and if so (YES in step S36), the process proceeds to step S33.
- the document similarity calculation unit 322 determines that there is no unselected document (NO in step S36)
- the document similarity calculation unit 322 extracts a predetermined number of documents in descending order of document similarity (step S37). ..
- the search result generation unit 323 generates a search result based on the extracted data, the search result output unit 324 outputs a search result, for example, a search result output screen 340 (step S38), and the process ends.
- FIG. 16 is a diagram for explaining the similarity determination system 1B according to the third embodiment
- FIGS. 17 and 18 are diagrams for explaining an example of processing of the similarity determination system 1B.
- the similarity determination system 1B replaces the process P6 of the similarity determination system 1A shown in FIG. 10 with the process P10, and adds the process P9 using the result of the process P8.
- the process P10 is executed using the results of both the processes P5 and P9.
- Process P9 is a process of calculating named entity similarity for each cluster, for example, compound similarity for each pair of clusters between documents.
- the process P10 is a process of ranking each of the plurality of comparison target documents 12a according to the similarity with the query document 11a based on the text similarity and the named entity similarity.
- the processes P9 and P10 will be described.
- the similarity determination system 1B is, for example, a list of a plurality of compounds of the first plurality of clusters generated from the query document 11a and a list of a plurality of compounds of the second plurality of clusters generated from the comparison target document 12a. And may be compared with each other. Then, the similarity determination system 1B performs compound similarity, for example, cosine similarity by the calculation of the following formula (7) for all the cluster pairs between the first plurality of clusters and the second plurality of clusters. The degree may be calculated.
- i is an index for specifying all the compound names contained in the compound lists C Xa and CYb
- C Xai and CYbi are i in the compound lists C Xa and CYb
- the number of appearances of the second compound name is shown.
- the denominator is the square root of the sum of squares of the number of occurrences of the compound of C Xa and the square root of the sum of squares of the number of appearances of the compound of CYb
- the molecule is C Xa . It is the sum of the products of the number of appearances of the common compound with CYb .
- the similarity determination system 1 has the compound lists C X1 , C X2 , C X3 , ... C XN and the compound lists CY1 , CY2 , CY3 . , ...
- the compound similarity may be calculated according to the above formula (7).
- the similarity determination system 1B performs a ranking process of ranking each of the plurality of comparison target documents 12a according to the similarity with the query document 11a based on the text similarity and the named entity similarity (process P10), and the result. 14 is output.
- the similarity determination system 1B calculates the similarity in which the text similarity and the named entity similarity are integrated in the ranking process, and based on the similarity, a plurality of comparison targets according to the similarity with the query document 11a.
- the ranking of the document 12a is output.
- the similarity determination system 1B may calculate the document similarity Sim (X, Y) between the document X and one comparison target document Y, for example, according to the following equation (8).
- fc is a cosine similarity according to the above equation (7), in other words, a named entity similarity.
- the above formula (8) shows an example of calculating the document similarity between the document X (query document 11a) and one document Y (comparison target document 12a). Similar to the second embodiment, the similarity determination system 1B may acquire document similarity Sims (X, Y 1 ) to Sim (X, Y L ) according to the number of documents Y.
- the similarity determination system 1B for example, as in the second embodiment, all the documents Y to be searched in descending order from the documents Y having the highest document similarity Sim (X, Y 1 ) to Sim (X, Y L ). Ranking processing is performed by sorting 1 to Y L. Further, the similarity determination system 1B may output the sort result as the result 14.
- the similarity determination system 1B sets the document similarity Sim (X, Y) between the document X and one comparison target document Y as the named entity similarity and the text similarity according to the following equation (9). It may be calculated as a weighted sum.
- w is a weight.
- w for example, a value such as “0.5” may be appropriately defined and set by the user or the like so that the named entity similarity and the text similarity are considered equally.
- w may be set based on a model trained so that the correct answer example is searched higher by machine learning based on the search query and the training data including the correct answer example (correct answer data). ..
- the compound lists C X and CY are compared with each other.
- the element to be investigated is "positive electrode active material”
- compound names related to "positive electrode active material” such as "LiCoO2” appear in common between documents, while compound names related to other elements differ between documents. Therefore, it may be calculated as a value with a low degree of compound similarity between documents. In this way, when comparing the compound list C for each document, the compound similarity may be calculated as a value having a low similarity even if the elements to be investigated are similar between the documents.
- the similarity determination system 1B as illustrated in FIG. 18, the compound similarity between the pairs of the compound lists C X2 and CY2 , in other words, the clusters of the “positive electrode active material” is the maximum. It can be determined. Then, the similarity determination system 1B can adopt the compound similarity as a value of fc used for calculating the document similarity Sim (X, Y).
- the similarity between documents is partially determined based on the named entity similarity for each cluster classified by the clustering process. It is possible to further improve the determination accuracy of the similarity between the documents.
- FIG. 19 is a block diagram showing a functional configuration example of the server 4 in the similarity determination system 1B according to the third embodiment. Unless otherwise specified, the server 4 may be the same as the server 3 shown in FIG.
- the above-mentioned similarity determination process by the similarity determination system 1B may be realized by the server 4.
- the server 4 may optionally include a document DB unit 41 and a document retrieval unit 42.
- the document DB unit 41 and the document search unit 42 are examples of control units.
- the document DB unit 41 may include a compound cluster storage unit 416 in addition to the configuration of the document DB unit 31 shown in FIG.
- the document retrieval unit 42 may include a document similarity calculation unit 422 instead of the document similarity calculation unit 322 shown in FIG. 12.
- the compound cluster storage unit 416 is an example of the memory unit 21 shown in FIG. 6, and the information of the compound cluster calculated by the clustering unit 313, for example, the compound list C may be stored in association with the document.
- the document similarity calculation unit 422 compares the partial document vectors of the query document 11a and the comparison target document 12a stored in the document cluster vector storage unit 315, and calculates the text similarity. Further, the document similarity calculation unit 422 compares the compound lists of the query document 11a and the comparison target document 12a stored in the compound cluster storage unit 416, and calculates the compound similarity.
- the document similarity calculation unit 422 calculates the document similarity Sim (X, Y) based on the text similarity and the compound similarity, and generates the result 14 from the document similarity Sim (X, Y).
- the process of the document similarity calculation unit 422 is an example of the processes P5, P9, and P10 of FIG.
- document retrieval unit 42 may output the screen illustrated in FIG.
- FIG. 20 is a flowchart illustrating an operation example of the document DB construction process of the server 4
- FIG. 21 is a flowchart illustrating an operation example of the document retrieval process of the server 4.
- FIG. 20 shows that step S41 is added between steps S24 and S25 shown in FIG. As illustrated in FIG. 20, the compound cluster storage unit 416 stores the calculated compound cluster information for each document in step S41.
- step S51 is added between steps S32 and S33 shown in FIG. 15, and step S35 is replaced with steps S52 and S53.
- step S51 the document similarity calculation unit 422 acquires the compound cluster of the query document 11a, for example, the compound list from the compound cluster storage unit 416.
- step S52 the document similarity calculation unit 422 acquires a compound cluster of a document selected from the compound cluster storage unit 416, for example, a compound list.
- step S53 the document similarity calculation unit 422 calculates the document similarity Sim (X, Y) based on the document vector acquired in steps S32 and S34, respectively, and the compound cluster acquired in steps S51 and S52, respectively. ..
- FIG. 22 is a block diagram showing a functional configuration example of the server 5 in the similarity determination system 1C according to the first modification of the third embodiment
- FIG. 23 is a diagram showing a screen output example by the server 5.
- the similarity determination system 1C determines the text similarity by comparing the partial document cluster containing a predetermined keyword in the query document 11a with each partial document cluster of the plurality of comparison target documents 12a. calculate.
- the server 5 may optionally include a document DB unit 41 and a document search unit 52.
- the document DB unit 41 and the document search unit 52 are examples of control units.
- the document DB unit 41 is the same as the document DB unit 41 shown in FIG.
- the document search unit 52 may include a document similarity calculation unit 522, a keyword input unit 525, and a document cluster identification unit 526 in place of the document similarity calculation unit 422 of the document search unit 42 shown in FIG.
- the keyword input unit 525 accepts input of one or more keywords from the user. For example, as shown in FIG. 23, the keyword input unit 525 has a document number of the query document 11a and one or more set in the input fields 531 and 532 when the search button 533 of the search query specification screen 530 is pressed. Notify the document cluster identification unit 526 of the keyword.
- the document cluster identification unit 526 refers to the document cluster vector storage unit 315, and is a portion including one or more notified keywords (for example, including a predetermined number of times or more) from a plurality of partial document clusters of the notified query document 11a. Identify the document cluster.
- the document similarity calculation unit 522 limits the partial document vector of the query document 11a to be compared with the plurality of partial document vectors of the comparison target document 12a to the document vector of the partial document cluster specified by the document cluster identification unit 526. In other words, the document similarity calculation unit 522 sets the importance (priority) of the specified subdocument cluster to be higher than that of other subdocument clusters. Then, the document similarity calculation unit 522 calculates the text similarity for the specified partial document cluster, and calculates the inter-document similarity based on the text similarity and the compound similarity.
- the server 5 according to the first modification, the same effect as that of the first and second embodiments can be obtained. Further, among the plurality of partial document clusters in the query document 11a, the comparison target document 12a can be searched by an appropriate partial document cluster including the keyword intended by the user, and the determination accuracy of the similarity between the documents can be further improved. Can be improved. Further, since the number of partial document clusters used for determining the similarity can be limited, the processing time of the document retrieval process can be shortened. In addition, the user can flexibly specify a cluster including a predetermined keyword, which is highly convenient.
- FIG. 24 is a flowchart illustrating an operation example of the document retrieval process of the server 5.
- steps S61 and S62 are added between steps S51 and S33 shown in FIG. 21, and step S53 is replaced with step S63.
- the keyword input unit 525 accepts the designation of the keyword in step S61.
- step S62 the document cluster specifying unit 526 specifies a partial document cluster of the query document 11a that includes the keywords accepted by the keyword input unit 525 by the first threshold value (predetermined number of times) or more.
- step S63 the document similarity calculation unit 522 calculates the text similarity between the specified partial document cluster of the query document 11a and all the partial document clusters of the selected document. Then, the document similarity calculation unit 522 calculates the document similarity Sim (X, Y) based on the calculated text similarity and compound similarity.
- FIG. 25 is a block diagram showing a functional configuration example of the server 6 in the similarity determination system 1D according to the second modification of the third embodiment.
- the similarity determination system 1D calculates the text similarity by comparing the partial document clusters whose similarity with the text of a predetermined part of the document is equal to or higher than the second threshold value. Further, the similarity determination system 1D calculates the compound similarity by comparing the compound clusters whose degree of agreement with the text of the predetermined portion included in the partial document cluster is equal to or more than the third threshold value.
- the accuracy of determining the similarity of partially similar documents can be further improved by determining the document similarity based on the description content of the predetermined portion. In some cases.
- the similarity determination system 1D specifies a predetermined part of the text such as "(patent) claims" from the document according to the type of the input document. Further, the similarity determination system 1D accumulates only the partial document cluster and the compound cluster related to the text among the clusters calculated from the document. Then, in the similarity determination process, the similarity determination system 1D determines the similarity based on the partial document cluster and the compound cluster related to the text of the predetermined portion according to the type of the designated query document 11a.
- the server 6 may optionally include a document DB unit 61 and a document search unit 42.
- the document DB unit 61 and the document search unit 42 are examples of control units.
- the document search unit 42 is the same as the document search unit 42 shown in FIG.
- the document DB unit 61 may include a predetermined document cluster vector storage unit 615 and a predetermined compound cluster storage unit 616 instead of the document cluster vector storage unit 315 and the compound cluster storage unit 416 of the document DB unit 41 shown in FIG. Further, the document DB unit 61 may include a predetermined document structure analysis unit 617.
- the predetermined document cluster vector storage unit 615 stores the information of the partial document cluster specified by the predetermined document configuration analysis unit 617, which will be described later, among the partial document clusters calculated by the document cluster vector calculation unit 314, in association with the document.
- the predetermined compound cluster storage unit 616 stores the information of the compound cluster specified by the predetermined document structure analysis unit 617, which will be described later, among the compound clusters calculated by the clustering unit 313, in association with the document.
- the predetermined document structure analysis unit 617 specifies the text of the predetermined part from the document according to the type of the input document.
- the "predetermined portion" may be preset according to the type of document, for example, a predetermined document type in which a document structure is defined, such as a patent document, a paper, and various materials.
- the predetermined document composition analysis unit 617 identifies the sub-document cluster whose similarity with the specified text is equal to or higher than the second threshold among the sub-document clusters calculated from the input document, and sets the document vector as the predetermined document. It is stored in the cluster vector storage unit 615.
- the predetermined document structure analysis unit 617 treats the specified text as a partial document (partial document cluster), and texts of the specified text subdocument (partial document cluster) and each of the other subdocument clusters in the document. The similarity may be compared with the second threshold.
- the predetermined document composition analysis unit 617 identifies, among the compound clusters calculated from the input document, the compound cluster whose degree of agreement with the compound name included in the specified partial document cluster is equal to or higher than the third threshold value, and determines. It accumulates in the compound cluster storage unit 616. For example, the predetermined document composition analysis unit 617 treats the compound names included in the specified partial document cluster as a compound list for each cluster, and the compound list is similar to each of the compound list for each other cluster in the document. The degree may be compared with the third threshold.
- the predetermined document composition analysis unit 617 uses the calculated document vector and compound cluster of the partial document cluster as the predetermined document cluster vector storage unit 615 and the predetermined compound cluster storage unit. It may be accumulated in the unit 616.
- the document similarity calculation unit 422 may perform the same operation as the similarity determination system 1B shown in FIG. 19, but when the type of the document related to the query 11 is a predetermined document type in which a "predetermined portion" is set. , The information of each cluster used to calculate the document similarity is limited. That is, the document similarity calculation unit 422 determines the document similarity based on the partial document cluster and the compound cluster related to the text of the predetermined portion according to the type of the input query document 11a.
- the server 6 according to the second modification the same effect as that of the first and second embodiments can be obtained. Further, by setting a "predetermined part" for each document type in advance, important (high priority) partial document clusters and compound clusters according to the document type can be easily identified. Therefore, the accuracy of determining the similarity between documents can be further improved by determining the similarity based on the important partial document cluster and the compound cluster. Further, since the number of partial document clusters and compound clusters used for determining the similarity can be limited, the processing time of the document retrieval process can be shortened.
- FIG. 26 is a flowchart illustrating an operation example of the document DB construction process of the server 6.
- FIG. 26 shows that step S28 shown in FIG. 14 is replaced with steps S71 to S75.
- the predetermined document structure analysis unit 617 specifies the text of the predetermined portion in the document in step S71.
- step S72 the predetermined document structure analysis unit 617 identifies a partial document cluster whose similarity with the text of the predetermined portion is equal to or higher than the threshold value in the partial document cluster.
- the predetermined document cluster vector storage unit 615 registers the document vector of the specified partial document cluster in step S73.
- step S74 the predetermined document structure analysis unit 617 identifies a compound cluster whose degree of agreement with the compound name included in the specified partial document cluster is equal to or greater than the threshold value.
- the predetermined compound cluster storage unit 616 registers the specified compound cluster in step S75.
- FIG. 27 is a flowchart illustrating an operation example of the document retrieval process of the server 6.
- steps S32 and S51 shown in FIG. 21 are replaced with S81 and S82, and S34, S52 and S53 are replaced with S83, S84 and S85.
- step S81 the document similarity calculation unit 422 acquires the document vector of the partial document cluster of the query document 11a from the predetermined document cluster vector storage unit 615.
- the document similarity calculation unit 422 acquires the document vector of the predetermined partial document cluster.
- step S82 the document similarity calculation unit 422 acquires the compound cluster of the query document 11a, that is, the predetermined compound cluster when the query document 11a is the predetermined document type, from the predetermined compound cluster storage unit 616.
- step S83 the document similarity calculation unit 422 acquires the document vector of the partial document cluster of the selected document from the predetermined document cluster vector storage unit 615.
- the document similarity calculation unit 422 acquires the document vector of the predetermined partial document cluster.
- step S84 the document similarity calculation unit 422 acquires a predetermined compound cluster from the predetermined compound cluster storage unit 616, that is, when the selected document has a predetermined document type.
- the document similarity calculation unit 422 calculates the document similarity based on the acquired document vector of the predetermined partial document cluster and the predetermined compound cluster in step S85.
- the compound name is used as a named entity
- the present invention is limited to this. It is not something that is done.
- the named entity various terms that can be the target of the named entity extraction process in natural language processing, such as a gene sequence (genome), may be used.
- each of the servers 2 to 6 shown in FIGS. 6, 12, 19, 22, and 25 may be merged or divided in any combination.
- the first to third embodiments and the first and second modifications of the third embodiment may be combined as appropriate.
- each of the servers 2 to 6 may generate screen information of any of the screens of FIGS. 7, 13, and 23, and may have a functional configuration according to the screen.
- the functions of the servers 5 and 6 according to the first and second modifications of the third embodiment shown in FIGS. 22 and 25 may be implemented in combination with each other. Further, the function may be applied to the document similarity determination process based on the text similarity in the server 2 or 3 according to the first or second embodiment shown in FIG. 6 or FIG.
- each of the servers 2 to 6 shown in FIGS. 6, 12, 19, 22, and 25 has a configuration in which a plurality of devices cooperate with each other via a network to realize each processing function. May be good.
- the memory unit 21 is a DB server
- the document DB units 31, 41 and 61 are a combination of an application server and a DB server
- a document input unit 22 a similarity calculation unit 23
- a similarity output unit 24 and a document search unit 32, 42.
- And 52 may be a combination of an application server and a Web server, and the like.
- the computer, the application server, and the DB server may cooperate with each other via the network to realize each processing function as the servers 2 to 6.
- each of the servers 3 to 6 may be provided with the HW configuration of the computer 10 illustrated in FIG.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Priority Applications (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/JP2020/047219 WO2022130579A1 (ja) | 2020-12-17 | 2020-12-17 | 類似度判定プログラム、類似度判定装置、及び、類似度判定方法 |
| JP2022569435A JPWO2022130579A1 (https=) | 2020-12-17 | 2020-12-17 |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/JP2020/047219 WO2022130579A1 (ja) | 2020-12-17 | 2020-12-17 | 類似度判定プログラム、類似度判定装置、及び、類似度判定方法 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2022130579A1 true WO2022130579A1 (ja) | 2022-06-23 |
Family
ID=82057430
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/JP2020/047219 Ceased WO2022130579A1 (ja) | 2020-12-17 | 2020-12-17 | 類似度判定プログラム、類似度判定装置、及び、類似度判定方法 |
Country Status (2)
| Country | Link |
|---|---|
| JP (1) | JPWO2022130579A1 (https=) |
| WO (1) | WO2022130579A1 (https=) |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN115659945A (zh) * | 2022-12-22 | 2023-01-31 | 南方电网科学研究院有限责任公司 | 一种标准文档相似度检测方法、装置及系统 |
| WO2024202379A1 (ja) * | 2023-03-31 | 2024-10-03 | 富士フイルム株式会社 | 情報処理装置、情報処理方法、及び情報処理プログラム |
Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JPH11272680A (ja) * | 1998-03-19 | 1999-10-08 | Fujitsu Ltd | 文書データ提供装置およびそのプログラム記録媒体 |
| JP2000112949A (ja) * | 1998-09-30 | 2000-04-21 | Fuji Xerox Co Ltd | 情報判別支援装置及び類似情報判別支援プログラムを記録した記録媒体 |
| JP2002259411A (ja) * | 2001-03-06 | 2002-09-13 | Nec Corp | 文章情報変換システム、文章情報変換方法および文章情報変換プログラム |
| JP2008009671A (ja) * | 2006-06-29 | 2008-01-17 | National Institute Of Information & Communication Technology | データ表示装置、データ表示方法及びデータ表示プログラム |
| JP2013020431A (ja) * | 2011-07-11 | 2013-01-31 | Nec Corp | 多義語抽出システム、多義語抽出方法、およびプログラム |
| JP2016045552A (ja) * | 2014-08-20 | 2016-04-04 | 富士通株式会社 | 特徴抽出プログラム、特徴抽出方法、および特徴抽出装置 |
-
2020
- 2020-12-17 JP JP2022569435A patent/JPWO2022130579A1/ja not_active Withdrawn
- 2020-12-17 WO PCT/JP2020/047219 patent/WO2022130579A1/ja not_active Ceased
Patent Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JPH11272680A (ja) * | 1998-03-19 | 1999-10-08 | Fujitsu Ltd | 文書データ提供装置およびそのプログラム記録媒体 |
| JP2000112949A (ja) * | 1998-09-30 | 2000-04-21 | Fuji Xerox Co Ltd | 情報判別支援装置及び類似情報判別支援プログラムを記録した記録媒体 |
| JP2002259411A (ja) * | 2001-03-06 | 2002-09-13 | Nec Corp | 文章情報変換システム、文章情報変換方法および文章情報変換プログラム |
| JP2008009671A (ja) * | 2006-06-29 | 2008-01-17 | National Institute Of Information & Communication Technology | データ表示装置、データ表示方法及びデータ表示プログラム |
| JP2013020431A (ja) * | 2011-07-11 | 2013-01-31 | Nec Corp | 多義語抽出システム、多義語抽出方法、およびプログラム |
| JP2016045552A (ja) * | 2014-08-20 | 2016-04-04 | 富士通株式会社 | 特徴抽出プログラム、特徴抽出方法、および特徴抽出装置 |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN115659945A (zh) * | 2022-12-22 | 2023-01-31 | 南方电网科学研究院有限责任公司 | 一种标准文档相似度检测方法、装置及系统 |
| WO2024202379A1 (ja) * | 2023-03-31 | 2024-10-03 | 富士フイルム株式会社 | 情報処理装置、情報処理方法、及び情報処理プログラム |
Also Published As
| Publication number | Publication date |
|---|---|
| JPWO2022130579A1 (https=) | 2022-06-23 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US8775442B2 (en) | Semantic search using a single-source semantic model | |
| US10394851B2 (en) | Methods and systems for mapping data items to sparse distributed representations | |
| US10353925B2 (en) | Document classification device, document classification method, and computer readable medium | |
| KR102402466B1 (ko) | 키워드 클러스터링을 이용한 문서 요약 방법 및 장치 | |
| KR102046692B1 (ko) | 다언어 특질 투영된 개체 공간 기반 개체 요약본 생성 방법 및 시스템 | |
| KR20190118744A (ko) | 딥러닝 기반의 지식 구조 생성 방법을 활용한 의료 문헌 구절 검색 방법 및 시스템 | |
| CN111143400A (zh) | 一种全栈式检索方法、系统、引擎及电子设备 | |
| Xu et al. | Learning to refine expansion terms for biomedical information retrieval using semantic resources | |
| WO2022130579A1 (ja) | 類似度判定プログラム、類似度判定装置、及び、類似度判定方法 | |
| JP6420268B2 (ja) | 画像評価学習装置、画像評価装置、画像検索装置、画像評価学習方法、画像評価方法、画像検索方法、およびプログラム | |
| Hare et al. | Imageterrier: an extensible platform for scalable high-performance image retrieval | |
| US20260072912A1 (en) | Semantic search in high-dimensional spaces using euclidean distance and cluster-based optimization | |
| Oliveira et al. | A distributed system for SearchOnMath based on the Microsoft BizSpark program | |
| JP5869948B2 (ja) | パッセージ分割方法、装置、及びプログラム | |
| CN115206533A (zh) | 基于知识图谱健康管理方法、装置及电子设备 | |
| CN119066179B (zh) | 问答处理方法、计算机程序产品、设备及介质 | |
| CN114020864A (zh) | 搜索结果的展示方法、装置及设备 | |
| JP2011248827A (ja) | 言語横断型情報検索方法、言語横断型情報検索システム及び言語横断型情報検索プログラム | |
| WO2015125209A1 (ja) | 情報構造化システム及び情報構造化方法 | |
| US10394870B2 (en) | Search method | |
| TW202111571A (zh) | 資訊處理裝置、儲存媒體、程式產品及資訊處理方法 | |
| JP7487797B2 (ja) | 類似度判定プログラム、類似度判定装置、及び、類似度判定方法 | |
| JPWO2020157887A1 (ja) | 文構造ベクトル化装置、文構造ベクトル化方法、及び文構造ベクトル化プログラム | |
| Wang et al. | Citationas: A summary generation tool based on clustering of retrieved citation content | |
| Ping et al. | Research on search ranking technology of chinese electronic medical record based on AdaRank |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20965966 Country of ref document: EP Kind code of ref document: A1 |
|
| ENP | Entry into the national phase |
Ref document number: 2022569435 Country of ref document: JP Kind code of ref document: A |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 20965966 Country of ref document: EP Kind code of ref document: A1 |