WO2022130579A1 - Similarity determination program, similarity determination device, and similarity determination method - Google Patents

Similarity determination program, similarity determination device, and similarity determination method Download PDF

Info

Publication number
WO2022130579A1
WO2022130579A1 PCT/JP2020/047219 JP2020047219W WO2022130579A1 WO 2022130579 A1 WO2022130579 A1 WO 2022130579A1 JP 2020047219 W JP2020047219 W JP 2020047219W WO 2022130579 A1 WO2022130579 A1 WO 2022130579A1
Authority
WO
WIPO (PCT)
Prior art keywords
document
similarity
groups
vectors
named entity
Prior art date
Application number
PCT/JP2020/047219
Other languages
French (fr)
Japanese (ja)
Inventor
伸之 片江
Original Assignee
富士通株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 富士通株式会社 filed Critical 富士通株式会社
Priority to PCT/JP2020/047219 priority Critical patent/WO2022130579A1/en
Priority to JP2022569435A priority patent/JPWO2022130579A1/ja
Publication of WO2022130579A1 publication Critical patent/WO2022130579A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Definitions

  • the present invention relates to a similarity determination program, a similarity determination device, and a similarity determination method.
  • one of the objects of the present invention is to improve the accuracy of determining the degree of similarity between partially similar documents.
  • the similarity determination program may cause the computer to perform the following processing.
  • the first processing is obtained by classifying the first plurality of partial documents obtained by dividing the first document into the first plurality of named entity included in the first plurality of partial documents.
  • the first plurality of parts are based on the words contained in each of the first plurality of subdocument groups. It may include a process of calculating a first plurality of vectors corresponding to each of the document groups.
  • the process corresponds to each of the second plurality of sub-document groups obtained by classifying the second plurality of sub-documents obtained by dividing the second document. It may include the process of acquiring the vector of.
  • the process includes a process of determining the degree of similarity between the first document and the second document based on the comparison between the first plurality of vectors and the second plurality of vectors. good.
  • the present invention can improve the accuracy of determining the degree of similarity between partially similar documents.
  • HW hardware
  • FIG. 1 is a diagram for explaining the similarity determination system 100 according to a comparative example.
  • the similarity determination system 100 has a meaning of a word based on a query 101 requesting determination of similarity of a query document (input document) and a document set 102 including one or more comparison target documents. Calculate the similarity based on the vector.
  • the similarity determination system 100 extracts words from each of a plurality of documents, that is, the query document included in the query 101 and the comparison target document included in the document set 102, for example, by morphological analysis (process P110).
  • the similarity determination system 100 statistically calculates the word weights for each of the plurality of documents based on the words obtained in the process P110 (process P120). For example, the similarity determination system 100 may evaluate the importance of a word in a document as a weight by using an evaluation method such as tf-idf (Term Frequency-Inverse Document Frequency).
  • the similarity determination system 100 executes the process P130 in parallel with or before and after the process P120 and at least a part of the processes. For example, the similarity determination system 100 calculates a word vector for each of a plurality of documents based on the words obtained in the process P110 (process P130).
  • the word vector may be referred to as a word embedding vector or a meaning vector.
  • the similarity determination system 100 may search a vector database in which a vector expressing the meaning of a word is stored and acquire a word vector.
  • the similarity determination system 100 calculates a document vector by adding the result of multiplying the word vector acquired in the process P130 and the weight of the word acquired in the process P120 over all the words in the document for each document. do. Then, the similarity determination system 100 calculates the similarity between the document vector of the query document and each document vector of the comparison target document, thereby determining the text similarity between the query document and the comparison target document. Calculate (process P140).
  • the similarity determination system 100 performs ranking processing based on the calculated text similarity (processing P150), and stores the comparison target document having a high similarity with the query document as the ranking result 103 together with the similarity.
  • FIG. 2 is a diagram illustrating an example of determination of similarity by the similarity determination system 100 shown in FIG.
  • the example of FIG. 2 shows a case where the similarity is determined for the query document 101a and the comparison target document 102a relating to the lithium ion battery.
  • the "document” includes a document including a description of a plurality of elements, and, for example, a document such as a patent document or a paper describing a device, a system, a manufacturing method, etc. having a plurality of components.
  • a document such as a patent document or a paper describing a device, a system, a manufacturing method, etc. having a plurality of components.
  • each of the components of the lithium ion battery such as "positive electrode active material”, “negative electrode active material”, “binder”, “electrolyte”, and “electrolyte solution solvent” is provided.
  • Compound names related to the classification (group) of may be mixed and described.
  • the comparison target is for other elements, in other words, the elements that are not the investigation target. Differences from documents may affect the judgment result of similarity between documents.
  • the document as a whole may be calculated as a value having a low degree of similarity.
  • the accuracy of determining the degree of similarity between partially similar documents may decrease.
  • semantic vector space is shown in two dimensions in FIG. 2 for convenience, it can actually be a vector of several hundred dimensions. As the number of dimensions of the semantic vector space increases, even if the elements to be investigated are similar between documents, it is possible that the similarity is judged to be low when the descriptions about other elements are different between the documents. Will be higher.
  • the similarity determination system 1 acquires a plurality of vectors corresponding to each of the partial documents obtained by dividing the document, and obtains the document. Document similarity is determined based on the comparison of multiple vectors between them.
  • the similarity determination system 1 may acquire a plurality of sub-document groups by classifying a plurality of sub-documents based on a plurality of groups. Further, the similarity determination system 1 may use the similarity of the subdocument clusters having the highest similarity as the document similarity by comparing the similarity between the subdocument clusters of both documents to be determined.
  • FIG. 3 is a diagram for explaining the similarity determination system 1 according to the first embodiment
  • FIGS. 4 and 5 are diagrams for explaining an example of processing of the similarity determination system 1.
  • the similarity determination system 1 includes a query 11 requesting determination of the similarity of a query document (input document) and one or more comparison target documents to be determined. Based on the document set (document group) 12, the similarity based on the meaning vector of the word is calculated.
  • the similarity determination system 1 determines the similarity between the query document 11a specified by the query 11 and the comparison target document 12a in the document set 12.
  • the query document 11a is an example of the first document
  • the comparison target document 12a is an example of the second document.
  • the similarity determination system 1 extracts words from each of a plurality of documents by, for example, morphological analysis (process P1), as in the comparative example.
  • the similarity determination system 1 statistically calculates the word weights for each of the plurality of documents based on the words obtained in the process P1 (process P2). For example, the similarity determination system 1 may evaluate the importance of a word in a document as a weight by using an evaluation method such as tf-idf.
  • the similarity determination system 1 executes the process P3 in parallel with or before and after the process P2 and at least a part of the processes. For example, the similarity determination system 1 calculates a word vector for each of a plurality of documents based on the words obtained in the process P1 (process P3).
  • the word vector may be referred to as a word embedding vector or a meaning vector.
  • the similarity determination system 1 may search a vector database in which a vector expressing the meaning of a word is stored and acquire a word vector.
  • the similarity determination system 1 may acquire a word vector corresponding to each of the words obtained in the process P1 based on the trained model.
  • the similarity determination system 1 divides each of a plurality of documents into a plurality of sub-documents (for example, paragraphs), clusters the plurality of sub-documents based on the named entity included in each sub-document (process P4), and sub-document cluster. To generate. Further, the similarity determination system 1 calculates the partial document vector of each partial document cluster.
  • the similarity determination system 1 calculates the text similarity between the partial document clusters based on the plurality of partial document vectors of the query document 11a and each of the plurality of partial document vectors of the comparison target document 12a (process P5).
  • the similarity determination system 1 performs a ranking process of ranking each of the plurality of comparison target documents 12a according to the similarity with the query document 11a based on the text similarity (process P6), and outputs the result 13. do.
  • the result 13 may include a ranking result.
  • the similarity determination system 1 acquires a plurality of partial documents (partial texts) by dividing the document for each document in the process P4.
  • Sub-documents in other words, document division units include, for example, sentences, paragraphs, chapters, sections, and the like.
  • the partial document is a paragraph.
  • the similarity determination system 1 divides the query document 11a (document X), acquires a plurality of paragraphs PX which are examples of the first plurality of partial documents, and compares them.
  • the document 12a (document Y) is divided to obtain a plurality of paragraphs P Y which are examples of the second plurality of partial documents.
  • paragraph P when paragraphs PX and P Y are not distinguished from each other, they are simply referred to as "paragraph P".
  • the similarity determination system 1 has a plurality of parts by clustering a plurality of paragraphs P based on a named entity list indicating the named entity cluster, for example, compound lists C X1 to C XN and CY1 to CYM shown in FIG. Acquire the document clusters PX1 to PXN and PY1 to PYM .
  • the named entity is the compound name and the document is a document in the field of chemistry that includes the compound name.
  • various existing methods such as the shortest distance method may be used.
  • the compound list C X1 to C XN is a list of first multiple named entity contained in the document X, for example, a list of named entity clusters obtained by classifying a plurality of compound names, and is an example of the first plurality of groups.
  • the compound lists C X1 to C XN classify, in other words, the names of the first plurality of compounds based on the respective positions of the first plurality of compounds and the respective similarity of the first plurality of compounds. It may be acquired by clustering and may be referred to as a first cluster group.
  • N is an integer of 1 or more, and indicates the number of groups included in the document X, in other words, the number of clusters.
  • the compound lists CY1 to CYM are a list of second named entities contained in the document Y, for example, a list of named named entity clusters obtained by classifying a plurality of compound names, and are examples of the second plurality of groups. be.
  • the compound lists CY1 to CYM classify, in other words, the names of the second plurality of compounds based on the respective positions of the second plurality of compounds and the respective similarity of the second plurality of compounds. It may be acquired by clustering and may be referred to as a second cluster group.
  • M is an integer of 1 or more, and indicates the number of groups included in the document Y, in other words, the number of clusters.
  • the similarity determination system 1 may cluster the paragraph P by a clustering process using the degree of agreement between the named entity included in the named entity cluster and the named entity included in the plurality of paragraphs P.
  • the similarity determination system 1 determines the degree of coincidence between each of the compound lists C X1 to C XN for each cluster and each of the plurality of paragraphs PX for the document X according to the following formula (1). Based on this, partial document clusters PX1 to PXN are generated. Further, the similarity determination system 1 has a degree of coincidence cos (CPX,) between each of the compound lists CY1 to CYN for each cluster and each of the plurality of paragraphs PY for the document Y according to the following formula (2) . Sub-document clusters P Y1 to P YN are generated based on C Xa ).
  • C PX is a compound list included in paragraph PX
  • a is an integer of 1 to N
  • C X a is a compound list C X 1 to C for each cluster.
  • XN is a compound list included in paragraph P Y
  • b is an integer of 1 to M
  • CY b is a compound list CY1 to CYM for each cluster.
  • cos is a function that calculates the cosine similarity between two elements in parentheses.
  • argmax is a function that extracts the condition (here, cluster) when the element in parentheses is the maximum.
  • the cosine similarity between each of the compound names included in paragraph P and each of the compound names in the compound list is maximum, for example, the number of occurrences is the largest.
  • Paragraph P can be assigned to the element (cluster of compounds).
  • -Partial document clusters PX1 and PY1 A paragraph describing "negative electrode active material”.
  • -Partial document clusters PX2 and PY2 A paragraph describing "positive electrode active material”.
  • -Partial document clusters PX3 and PY3 A paragraph describing "binder”.
  • -Partial document clusters PX4 and PY4 A paragraph describing "electrolyte solvent”.
  • the numbers N and M of the partial document clusters of the documents X and Y are assumed to match the numbers N and M of the compound list, but the number N and M are not limited to this. It may be acceptable if they do not match and do not match. For example, the number of partial document clusters may be smaller than N and M.
  • the similarity determination system 1 calculates a plurality of subdocument vectors corresponding to each of the plurality of subdocument clusters based on the words included in each of the subdocument clusters. For example, the similarity determination system 1 adds the result of multiplying the word vector acquired in the process P3 and the weight of the word acquired in the process P2 over all the words in the subdocument cluster for each subdocument cluster. By doing so, the partial document vector may be calculated.
  • the similarity determination system 1 is based on the similarity between the partial document vector of the query document 11a and each partial document vector of the comparison target document 12a, in other words, the partial document based on the meaning vector of the word. Calculate the text similarity between clusters.
  • the partial document vector of the query document 11a is an example of the first plurality of vectors
  • the partial document vector of the comparison target document 12a is an example of the second plurality of vectors.
  • the similarity determination system 1 calculates the text similarity, for example, the cosine similarity between the partial document cluster of the query document 11a and the partial document cluster of the comparison target document 12a by the calculation of the following equation (3). good.
  • WP Xa is a dispersion vector of words included in paragraph PXa
  • WP Yb is a dispersion vector of words included in paragraph P Yb .
  • the similarity determination system 1 has partial document clusters PX1 , PX2 , PX3 , ... PXN , and partial document clusters XY1 , PHY2 , PHY3 , ... PHYM .
  • the text similarity may be calculated according to the above equation (3) for all pairs of and.
  • the similarity determination system 1 performs a ranking process in the process P6 to rank each of the plurality of comparison target documents 12a according to the similarity with the query document 11a based on the text similarity, and outputs the result 13.
  • the similarity determination system 1 outputs rankings of a plurality of comparison target documents 12a according to the similarity with the query document 11a based on the text similarity in the ranking process.
  • the similarity determination system 1 may calculate the document similarity Sim (X, Y) between the document X and one comparison target document Y, for example, according to the following equation (4).
  • ft is the text similarity according to the above equation (3)
  • max is a function that adopts the maximum value among all the combinations in parentheses.
  • the similarity determination system 1 determines that the pair of compound lists C X2 and CY2 , in other words, the text similarity between the partial document clusters of the “positive electrode active material” is the maximum, and the relevant The text similarity is determined as the document similarity Sim (X, Y) between the documents X and Y.
  • the above equation (4) shows an example of calculating the document similarity between the document X (query document 11a) and one document Y (comparison target document 12a).
  • the similarity determination system 1 performs the above processing for each of a plurality of comparison target documents 12a, for example, documents Y 1 to Y L (L is an integer of 2 or more and the number of documents of comparison target document 12a), and the number of documents Y.
  • Document similarity Sim (X, Y 1 ) to Sim (X, Y L ) according to the above may be acquired.
  • the similarity determination system 1 sorts all the documents Y 1 to Y L to be searched in descending order from the documents Y having the highest document similarity Sim (X, Y 1 ) to Sim (X, Y L ), for example.
  • the sort result may be output as the result 13.
  • the result 13 may include the identification information of the document Y together with the rank (rank), and may include the document similarity Sim (X, Y) of each document Y.
  • the identification information of the document Y includes at least one of an identifier such as a document number or a document code, bibliographic information such as a document name, and at least a part of the contents of the document Y such as a summary and a predetermined part. But it may be.
  • the similarity determination system 1 identifies information of the document Y having the highest document similarity Sim (X, Y) with the document Y determined to have a specific order, for example, the query document 11a. May be output.
  • the similarity between documents is partially determined based on the text similarity for each partial document cluster classified by the clustering process. It is possible to improve the accuracy of determining the degree of similarity between similar documents.
  • the similarity determination system 1 has a high degree of similarity between the documents X and Y because the semantic vectors for the "positive electrode active material" are similar by comparing the partial document vectors between the documents X and Y. , Can be judged.
  • the semantic vector space is shown in two dimensions, but it can actually be a vector of several hundred dimensions.
  • the accuracy of determining the degree of similarity between partially similar documents can be improved by comparing the partial document clusters.
  • the similarity determination system 1 has been described as calculating a plurality of partial document vectors for both the query document 11a and the comparison target document 12a, but the present invention is limited to this. It's not something.
  • any one of the documents for example, a plurality of comparison target documents 12a
  • the similarity determination system 1 when the similarity determination system 1 stores the document set 12 in advance, a plurality of portions of each of the plurality of comparison target documents 12a.
  • the document vector may be calculated in advance and accumulated.
  • the similarity determination system 1 calculates a plurality of partial document vectors for the other document, for example, the query document 11a, and acquires a plurality of partial document vectors to be accumulated for the comparison target document 12a. You can do it. Then, the similarity determination system 1 performs the above-mentioned text similarity calculation process and ranking process based on the calculated plurality of partial document vectors of the query document 11a and the plurality of partial document vectors of the acquired comparison target document 12a. You can do it.
  • the document in which a plurality of partial document vectors are calculated and accumulated in advance is not limited to the comparison target document 12a, and may be a query document 11a in place of or in addition to the comparison target document 12a.
  • FIG. 6 is a block diagram showing a functional configuration example of the server 2 in the similarity determination system 1 according to the first embodiment
  • FIG. 7 is a diagram showing a screen output example by the server 2. Is.
  • the server 2 is an example of a similarity determination device, an information processing device, or a computer.
  • the server 2 performs various communications such as reception of the query document 11a and the comparison target document 12a and transmission of the result 13 with a terminal device (not shown), another server, or the like. good.
  • the server 2 may provide, for example, a function for enabling access to the terminal device.
  • Examples of the function include generation and display control of a screen such as a web page used for access by a terminal device.
  • the terminal device sends an access request to the server 2 using an application such as a browser, and accesses the server 2 via a web page displayed on the application based on the screen information received from the server 2. good.
  • the server 2 may output the screen information of the query specification screen 210 for designating the query and the determination result output screen 240 for outputting the determination result.
  • the server 2 may optionally include a memory unit 21, a document input unit 22, a similarity calculation unit 23, and a similarity output unit 24.
  • the memory unit 21, the document input unit 22, the similarity calculation unit 23, and the similarity output unit 24 are examples of control units.
  • the memory unit 21 has a storage area for storing various data related to the similarity determination process.
  • the memory unit 21 may store information such as the query document 11a shown in FIG. 3, a plurality of comparison target documents 12a, the result 13, and a compound list for each cluster preclassified for each document. Further, the memory unit 21 may store information such as paragraph P, partial document cluster, text similarity, document similarity Sim, etc. for each document shown in FIGS. 4 and 5 as intermediate data in the similarity determination process. ..
  • the document input unit 22 may receive input of the query document 11a and the comparison target document 12a from a computer such as a terminal device (not shown) or another server, and store the query document 11a and the comparison target document 12a in the memory unit 21, for example, as a DB (Database). In this way, the document input unit 22 may be able to construct and refer to the DB of the document.
  • a computer such as a terminal device (not shown) or another server
  • DB Database
  • the document input unit 22 may receive the input of the query document 11a related to the similarity determination request from a computer such as a terminal device (not shown) or another server and store it in the memory unit 21.
  • the query document 11a may be included in the query 11, for example.
  • the document input unit 22 may accept, for example, as the query 11, not the query document 11a itself, but the identification information of the query document 11a, for example, information such as a document number and a document code.
  • the document input unit 22 may specify the query document 11a related to the similarity determination request from, for example, the DB of the memory unit 21 based on the identification information.
  • the document input unit 22 may accept the document number set in the input field 211 when the determination button 212 of the query specification screen 210 is pressed.
  • the similarity calculation unit 23 calculates the similarity between the query document 11a and the comparison target document 12a. As illustrated in FIG. 6, the similarity calculation unit 23 may include a document division unit 231, a partial document clustering unit 232, and a document similarity calculation unit 233.
  • the document division unit 231 divides each of the query document 11a and the comparison target document 12a stored in the memory unit 21 to generate partial documents, for example, paragraphs PX and PY .
  • the partial document clustering unit 232 clusters each of the plurality of paragraphs PX and the plurality of paragraphs P Y based on the compound lists C X1 to C XN and CY1 to CYM stored in the memory unit 21, and the partial document cluster P. Acquire X1 to P XN and P Y1 to P YM . Further, the partial document clustering unit 232 is based on the results of morphological analysis, word weight calculation, and word vector calculation for each of the documents X and Y for each of the partial document clusters PX1 to PXN and PY1 to PYM . , Calculate the partial document vector.
  • the processing of the document division unit 231 and the partial document clustering unit 232 is an example of the processes P1 to P4 in FIG.
  • the document similarity calculation unit 233 calculates the text similarity for each partial document based on the partial document vector for each partial document cluster, and determines the text similarity of the cluster having the highest similarity in the document as the similarity of the document. Calculated as Sim (X, Y).
  • Sim Sim (X, Y)
  • the document similarity calculation unit 233 calculates the similarity Sim (X, Y 1 ) to Sim (X, Y L ) for each comparison target document 12a. You can do it.
  • the document similarity calculation unit 233 may store the calculated similarity Sim (X, Y) in the memory unit 21.
  • the similarity output unit 24 outputs the similarity Sim (X, Y) calculated by the similarity calculation unit 23.
  • the documents to be compared are compared in descending order of the calculated similarity Sim (X, Y 1 ) to Sim (X, Y L ).
  • Information on 12a and the similarity Sim (X, Y) may be output.
  • the processing of the document similarity calculation unit 233 and the similarity output unit 24 is an example of the processes P5 and P6 of FIG.
  • the output by the similarity output unit 24 may include, for example, transmission to a computer such as a terminal device (not shown), storage in a storage area of a server 2 such as a memory unit 21, and the like.
  • the similarity output unit 24 may output the determination result output screen 240.
  • the determination result output screen 240 may include a display area 241 of the query document 11a and display areas 245a to 245c of at least one (three in FIG. 7) of the comparison target document 12a.
  • the display area 241 may include a display area 242 such as bibliographic information and a summary, and a full-text reference button 243 for transitioning to a screen for displaying the full text of the query document 11a.
  • the display areas 245a to 245c may include display areas 246a to 246c for bibliographic information and summaries, and full text reference buttons 247a to 247c.
  • display areas 245a to 245c one or more paragraphs PY or compound list corresponding to the partial document cluster determined to be similar, or / and the similarity Sim (X, Y ) may be displayed. ..
  • the similarity output unit 24 can present to the user information about the document determined to have the highest similarity as a result of the similarity calculation between the query document 11a and the comparison target document 12a.
  • FIG. 8 is a flowchart illustrating an operation example of the server 2. As shown in FIG. 8, the server 2 may execute the processing for the query document 11a and the processing for the comparison target document 12a at different timings.
  • the document input unit 22 accepts the input of the query document 11a (step S1).
  • the document division unit 231 divides the query document 11a into a plurality of subdocuments, for example, a plurality of paragraphs PX (step S2).
  • the partial document clustering unit 232 clusters a plurality of paragraphs PX based on the compound lists C X1 to C XN , and acquires the partial document clusters PX1 to PXN (step S3). Further, the partial document clustering unit 232 calculates each partial document vector of the partial document clusters PX1 to PXN based on the weight of each word included in the document X and the meaning vector of each word (step S4).
  • the document input unit 22 accepts the input of the comparison target document 12a (step S5).
  • the document division unit 231 selects an unselected comparison target document 12a (step S6), and divides the selected comparison target document 12a into a plurality of partial documents, for example, a plurality of paragraphs PY (step S7).
  • the partial document clustering unit 232 clusters a plurality of paragraphs P Y based on the compound lists CY1 to CYN, and acquires the partial document clusters P Y1 to P YM (step S8). Further, the sub-document clustering unit 232 calculates each sub-document vector of the sub-document clusters P Y1 to P YM based on the weight of each word included in the document Y and the meaning vector of each word (step S9).
  • the document similarity calculation unit 233 compares the partial document vectors of the query document 11a and the comparison target document 12a, calculates the similarity Sim between the documents (step S10), and stores it in the memory unit 21 (step S11).
  • the document similarity calculation unit 233 determines whether or not there is an unselected comparison target document 12a (step S12), and if it determines that there is (YES in step S12), the process proceeds to step S6.
  • the similarity output unit 24 determines the comparison target document 12a in descending order of similarity Sim (X, Y). And its similarity Sim (X, Y) are output (step S13). Then, the process ends.
  • the server 2 may be a virtual server (VM; Virtual Machine) or a physical server. Further, the function of the server 2 may be realized by one computer or may be realized by two or more computers. Further, at least a part of the functions of the server 2 may be realized by using the HW (Hardware) resource and the NW (Network) resource provided by the cloud environment.
  • VM Virtual Machine
  • HW Hardware
  • NW Network
  • FIG. 9 is a block diagram showing a hardware (HW) configuration example of the computer 10 that realizes the function of the server 2.
  • HW hardware
  • the computer 10 has an HW configuration, for example, a processor 10a, a memory 10b, a storage unit 10c, an IF (Interface) unit 10d, an I / O (Input / Output) unit 10e, and a reading unit. It may be provided with 10f.
  • a processor 10a for example, a processor 10a, a memory 10b, a storage unit 10c, an IF (Interface) unit 10d, an I / O (Input / Output) unit 10e, and a reading unit. It may be provided with 10f.
  • the processor 10a is an example of an arithmetic processing unit that performs various controls and operations.
  • the processor 10a may be connected to each block in the computer 10 so as to be communicable with each other by the bus 10i.
  • the processor 10a may be a multi-processor including a plurality of processors, a multi-core processor having a plurality of processor cores, or a configuration having a plurality of multi-core processors.
  • Examples of the processor 10a include integrated circuits (ICs) such as CPUs, MPUs, GPUs, APUs, DSPs, ASICs, and FPGAs. As the processor 10a, two or more combinations of these integrated circuits may be used.
  • ICs integrated circuits
  • MPU is an abbreviation for Micro Processing Unit
  • GPU is an abbreviation for Graphics Processing Unit
  • APU is an abbreviation for Accelerated Processing Unit.
  • DSP is an abbreviation for Digital Signal Processor
  • ASIC is an abbreviation for Application Specific IC
  • FPGA is an abbreviation for Field-Programmable Gate Array.
  • the memory 10b is an example of HW that stores information such as various data and programs.
  • Examples of the memory 10b include one or both of a volatile memory such as DRAM (Dynamic Random Access Memory) and a non-volatile memory such as PM (Persistent Memory).
  • the storage unit 10c is an example of HW that stores information such as various data and programs.
  • Examples of the storage unit 10c include a magnetic disk device such as an HDD (Hard Disk Drive), a semiconductor drive device such as an SSD (Solid State Drive), and various storage devices such as a non-volatile memory.
  • Examples of the non-volatile memory include flash memory, SCM (Storage Class Memory), ROM (Read Only Memory) and the like.
  • the storage unit 10c may store a program 10g (similarity determination program) that realizes all or a part of various functions of the computer 10.
  • the processor 10a of the server 2 can realize the function as the server 2 illustrated in FIG. 6 by expanding and executing the program 10g stored in the storage unit 10c in the memory 10b.
  • the memory unit 21 shown in FIG. 6 may be realized by a storage area of one or both of the memory unit 10b and the storage unit 10c.
  • the IF unit 10d is an example of a communication IF that controls connection and communication with a network.
  • the IF unit 10d may include an adapter compliant with LAN (Local Area Network) such as Ethernet (registered trademark) or optical communication such as FC (Fibre Channel).
  • the adapter may support one or both wireless and wired communication methods.
  • the server 2 may be connected to the terminal device and each of the other servers so as to be able to communicate with each other via the IF unit 10d.
  • the program 10g may be downloaded from the network to the computer 10 via the communication IF and stored in the storage unit 10c.
  • the I / O unit 10e may include one or both of an input device and an output device.
  • Examples of the input device include a keyboard, a mouse, a touch panel, and the like.
  • Examples of the output device include a monitor, a projector, a printer and the like.
  • the reading unit 10f is an example of a reader that reads data and program information recorded on the recording medium 10h.
  • the reading unit 10f may include a connection terminal or device to which the recording medium 10h can be connected or inserted.
  • Examples of the reading unit 10f include an adapter compliant with USB (Universal Serial Bus), a drive device for accessing a recording disk, a card reader for accessing a flash memory such as an SD card, and the like.
  • the program 10g may be stored in the recording medium 10h, or the reading unit 10f may read the program 10g from the recording medium 10h and store it in the storage unit 10c.
  • Examples of the recording medium 10h include non-temporary computer-readable recording media such as magnetic / optical disks and flash memories.
  • Examples of the magnetic / optical disk include flexible discs, CDs (Compact Discs), DVDs (Digital Versatile Discs), Blu-ray discs, HVDs (Holographic Versatile Discs), and the like.
  • Examples of the flash memory include semiconductor memories such as USB memory and SD card.
  • the above-mentioned HW configuration of the computer 10 is an example. Therefore, the increase / decrease of HW (for example, addition or deletion of arbitrary blocks), division, integration in any combination, addition or deletion of buses, etc. may be appropriately performed in the computer 10.
  • the server 2 at least one of the I / O unit 10e and the reading unit 10f may be omitted.
  • Second Embodiment [2-1] Description of the Second Embodiment Next, the second embodiment will be described.
  • the similarity determination system 1 stores the compound lists C X1 to C XN and CY1 to CYM for each cluster in advance.
  • the method in which the similarity determination system 1A calculates the compound lists C X1 to C XN and CY1 to CYM will be described.
  • FIG. 10 is a diagram for explaining the similarity determination system 1A according to the second embodiment
  • FIG. 11 is a diagram for explaining an example of processing of the similarity determination system 1A.
  • the processes P1 to P6 based on the query 11 and the document set 12 are the same as those in the first embodiment.
  • the processes P7 and P8 may be executed in parallel with or before and after the processes P1 to P3 and at least a part of the processes.
  • the processes P7 and P8 will be described.
  • the similarity determination system 1A extracts a compound name as an example of a unique expression from each of a plurality of documents, for example, a query document 11a and a plurality of comparison target documents 12a (process P7), and the unique expression is used for each document. Generate a list, eg a compound list.
  • the similarity determination system 1A extracts the compound name from the query document 11a (denoted as “document X”) included in the query 11 and generates the compound list CX . Further, the similarity determination system 1A extracts a compound name from the comparison target document 12a (denoted as “document Y”) included in the document set 12 to generate a compound list CY .
  • the query document 11a and the comparison target document 12a are documents relating to the lithium ion battery.
  • compound list C when the compound lists C X and CY generated for the set of documents to be determined are not distinguished from each other, they are simply referred to as “compound list C”.
  • the similarity determination system 1A executes clustering for classifying and grouping named entity based on the named entity list (process P8).
  • the clustering method various existing methods such as the shortest distance method may be used.
  • the similarity determination system 1A may calculate the similarity score S between named entities included in the named entity list for each pair (set) of named entity based on the named entity list. For example, the similarity determination system 1A calculates a similarity score S for a pair of named entity based on each position of the named entity and the similarity between the named entity.
  • the similarity determination system 1A may calculate the similarity score S (x 1 , x 2 ) using the following formula (5). ..
  • TC (x 1 , x 2 ) is the Tanimoto coefficient of MACCS Key.
  • MACCS Key is one of the expression methods (compound descriptors) of the characteristics of compounds.
  • the Tanimoto coefficient is one of the indexes showing the structural similarity between compounds using MACCS Key, and is an example of the similarity between named entities when the named entity is a compound name.
  • Distance (x 1 , x 2 ) is, for example, a numerical value obtained by quantifying the proximity of each appearance position of the named entity in a document, and is, for example, a value corresponding to the following conditions.
  • the similarity determination system 1A applies the above formula (5) to each combination of compound name pairs (x 1 , x 2 ) for a plurality of compound names included in the compound list C, and applies each pair (x 1 , x 2).
  • the similarity score S (x 1 , x 2 ) of x 2 ) may be calculated.
  • the similarity determination system 1A classifies a plurality of compound names included in the compound list C by applying a method such as the shortest distance method to a plurality of calculated similarity scores S (x 1 , x 2 ). By grouping them together, the compound names may be clustered.
  • the similarity determination system 1A divides the compound names in the compound list C X into N clusters (groups) by clustering to the compound list C X , and the compound lists C X 1 to C for each cluster. Generate XN . Further, the similarity determination system 1A divides the compound names in the compound list CY into M clusters (groups) by clustering to the compound list CY , and generates the compound lists CY1 to CYM for each cluster. ..
  • the compound lists C X and CY can be classified into clusters of the following four elements (characteristics) by such clustering.
  • -Compound list C X1 and CY1 A cluster having elements (characteristics) of "negative electrode active material”.
  • -Compound list C X2 and CY2 A cluster having an element (characteristic) of "positive electrode active material”.
  • -Compound list C X3 and CY3 A cluster with a "binder" element (characteristic).
  • -Compound list C X4 and CY4 A cluster having an element (characteristic) of "electrolyte solvent”.
  • the similarity determination system 1A generates compound lists C X1 to C XN and CY1 to CYM for each cluster used in partial document clustering (processing P4). Can be done.
  • the Tanimoto coefficient of MACCS Key is used as the structural similarity, but the description is not limited to this.
  • the method for expressing the characteristics of a compound is not limited to MACCS Key, in other words, MACCS fingerprint, and various compound descriptors such as Morgan fingerprint may be adopted.
  • the index indicating the structural similarity between the compounds is not limited to the Tanimoto coefficient, and various coefficients such as the Dice coefficient may be used.
  • the similarity determination system 1A uses the similarity score S (x 1 , x 2 ) as a numerical value of the proximity of the appearance position in the document of the named entity and the named entity.
  • the product with the similarity is calculated, but the product is not limited to this.
  • the similarity determination system 1A may calculate the similarity score S (x 1 , x 2 ) using the following equation (6).
  • W is a weight.
  • W for example, a value such as "0.5" may be appropriately defined and set by the user or the like so that each position of the named entity and the similarity between the named entity are considered evenly.
  • W may be set based on a model trained so that the correct answer example is searched higher by machine learning based on the search query and the training data including the correct answer example (correct answer data). ..
  • the similarity determination system 1A is based on the respective positions of the first plurality of compound names included in the query document 11a and the respective similarity of the first plurality of compound names. By classifying the compound names, the first cluster group is generated. Further, the similarity determination system 1A is based on the position of each of the second plurality of compound names included in the comparison target document 12a and the similarity of each of the second plurality of compound names. By classifying the names, a second cluster group is generated. The first cluster group is an example of the first plurality of groups, and the second cluster group is an example of the second plurality of groups.
  • the similarity determination system 1A As described above, according to the similarity determination system 1A according to the second embodiment, the same effect as that of the first embodiment can be obtained. Further, since the compound list for each cluster can be generated for each document, it is possible to avoid the user from manually generating the compound list, which is convenient. Further, even when the similarity determination system 1A does not store one or both documents of the query document 11a and the comparison target document 12a, the similarity of the documents can be determined. Such cases include, for example, the case where the document is included in the query 11, or the case where the location of the document (a storage location other than the similarity determination system 1A) is specified by the query 11.
  • FIG. 12 is a block diagram showing a functional configuration example of the server 3 in the similarity determination system 1A according to the second embodiment
  • FIG. 13 is a diagram showing a screen output example by the server 3. Is.
  • the server 3 is an example of a similarity determination device, an information processing device, or a computer.
  • the server 3 performs various communications such as reception of the query document 11a and the comparison target document 12a and transmission of the result 14 with a terminal device (not shown), another server, or the like. good.
  • the server 3 may provide, for example, a function for enabling access to the terminal device. For example, as shown in FIG. 13, the server 3 may output screen information of a search query specification screen 330 for designating a search query and a search result output screen 340 for outputting search results.
  • the above-mentioned similarity determination process by the similarity determination system 1A may be realized by the server 3.
  • the server 3 may optionally include a document DB unit 31 and a document retrieval unit 32.
  • the document DB unit 31 and the document search unit 32 are examples of control units.
  • the server 3 may include the document input unit 22 shown in FIG.
  • the document DB unit 31 stores the query document 11a and the comparison target document 12a, and performs a document DB construction process for constructing the document DB.
  • the document search unit 32 performs a document search process for searching a comparison target document 12a similar to the query document 11a specified in the query 11 based on the information stored in the document DB unit 31 in response to the acceptance of the query 11.
  • the document search process is a process including a similarity determination process, and is an example of use (application example) of the similarity determination process.
  • the document DB unit 31 includes, for example, a document storage unit 311, a compound name extraction unit 312, a clustering unit 313, a document cluster vector calculation unit 314, and a document cluster vector storage unit 315. good.
  • the document storage unit 311 is an example of the memory unit 21 (see FIG. 6) according to the first embodiment, and stores a plurality of documents.
  • the document is a document that can be used as either the query document 11a or the comparison target document 12a. Therefore, it can be said that the document storage unit 311 stores the query document 11a and the document set (document group) 12 including the plurality of comparison target documents 12a that are the targets of the query 11.
  • the document storage unit 311 may store a plurality of documents in advance before receiving the query 11.
  • the document storage unit 311 may store a plurality of documents received by the document input unit 22 according to the first embodiment.
  • the compound name extraction unit 312 extracts a compound name as an example of a named entity from each of a plurality of documents accumulated by the document storage unit 311 and generates compound lists C X and CY for each document.
  • the treatment of the compound name extraction unit 312 is an example of the treatment P7 in FIG.
  • the clustering unit 313 calculates the similarity score S for each of the compound names included in the compound lists C X and CY . Further, the clustering unit 313 classifies the compound names into a plurality of clusters based on the similarity score S, and the compound lists C X1 , C X2 , C X3 , ... C XN , and the compound lists CY1 and CY2 . Generate CY3 , ... CYM .
  • the process of the clustering unit 313 is an example of the process P8 of FIG.
  • the document cluster vector calculation unit 314 is based on the information of the compound cluster from the clustering unit 313 and the weights and word vectors calculated based on the words extracted from each of the plurality of documents accumulated by the document storage unit 311.
  • the document vector for each may be calculated.
  • the process of the document cluster vector calculation unit 314 is an example of at least a part of the processes P1 to P4 and the process P5 in FIG.
  • the document cluster vector storage unit 315 is an example of the memory unit 21 shown in FIG. 6, and stores the document vector for each partial document cluster calculated by the document cluster vector calculation unit 314.
  • the document search unit 32 may optionally include a search query designation unit 321, a document similarity calculation unit 322, a search result generation unit 323, and a search result output unit 324.
  • the search query designation unit 321 is an example of the document input unit 22 shown in FIG. 6, and is a query 11 requesting a document search from a computer such as a terminal device (not shown) or another server (hereinafter referred to as “search query 11”). (May be) Accept the input.
  • the search query specification unit 321 may accept the document number of the query document 11a set in the input field 331 when the search button 332 of the search query specification screen 330 is pressed. ..
  • the document similarity calculation unit 322 is an example of the document similarity calculation unit 233 shown in FIG.
  • the document similarity calculation unit 322 uses the document similarity Sim (X, Y) between the query document 11a specified by the search query 11 and the comparison target document 12a based on the document vector stored in the document cluster vector storage unit 315. ) Is calculated.
  • the document similarity calculation unit 322 compares a plurality of partial document vectors corresponding to the query document 11a and the comparison target document 12a among the partial document vectors stored in the document cluster vector storage unit 315, and texts. The similarity may be calculated.
  • the document similarity calculation unit 322 calculates the document similarity Sim (X, Y) based on the text similarity, and sorts the comparison target documents 12a in descending order of the document similarity Sim (X, Y). ,
  • the ranking result 14 may be generated.
  • the content and output method of the result 14 are the same as those of the result 13 according to the first embodiment.
  • the process of the document similarity calculation unit 322 is an example of at least a part of the process P5 in FIG. 10 and the process P6.
  • the search result generation unit 323 generates a search result for output based on the result 14.
  • the search result generation unit 323 may generate the search result output screen 340 shown in FIG.
  • the search result output screen 340 may replace the determination result 244 in the determination result output screen 240 shown in FIG. 7 with the search result 344.
  • the search result output screen 340 includes a display area 341 of the query document 11a and display areas 345a to 345c of at least one (three in FIG. 13) of the comparison target document 12a. good.
  • the display area 341 may include a display area 342 such as bibliographic information and a summary, and a full-text reference button 343 of the query document 11a.
  • the display areas 345a to 345c may include display areas 346a to 346c for bibliographic information and summaries, and full text reference buttons 347a to 347c.
  • display areas 346a to 346c one or more paragraphs PY or compound list corresponding to the partial document cluster determined to be similar, and / or the similarity Sim (X, Y) may be displayed. ..
  • the search result output unit 324 outputs the search result output screen 340 to a computer such as a terminal device or another server (not shown).
  • FIG. 14 is a flowchart illustrating an operation example of the document DB construction process of the server 3
  • FIG. 15 is a flowchart illustrating an operation example of the document retrieval process of the server 3.
  • the document storage unit 311 selects an unselected document (step S21) and registers the document in the document DB (step S22).
  • the compound name extraction unit 312 extracts the compound name from the text of the document (step S23).
  • the clustering unit 313 clusters the extracted compound names (step S24).
  • the document cluster vector calculation unit 314 divides the document into a plurality of sub-documents (step S25), and clusters a plurality of sub-documents based on the compound cluster generated by the clustering unit 313 (step S26).
  • the document cluster vector calculation unit 314 calculates the document vector of each partial document cluster (step S27).
  • the document cluster vector storage unit 315 associates the calculated document vector with the document and registers (stores) it in, for example, a document DB or a document cluster vector DB (step S28).
  • the document storage unit 311 determines whether or not there is an unselected document (step S29), and if it determines that there is an unselected document (YES in step S29), the process proceeds to step S21. When the document storage unit 311 determines that there is no unselected document (NO in step S29), the process ends.
  • the search query designation unit 321 accepts the designation of the query document 11a from the search query designation screen 330 (step S31).
  • the document similarity calculation unit 322 acquires the document vector of the query document 11a from the document cluster vector storage unit 315 (step S32).
  • the document similarity calculation unit 322 selects an unselected document (step S33), and acquires the document vector of the partial document cluster of the selected document from the document cluster vector storage unit 315 (step S34).
  • the document similarity calculation unit 322 compares the document vectors of a plurality of partial document clusters between the query document 11a and the selected document, and calculates the document similarity Sim (X, Y) (step S35).
  • the document similarity calculation unit 322 determines whether or not there is an unselected document (step S36), and if so (YES in step S36), the process proceeds to step S33.
  • the document similarity calculation unit 322 determines that there is no unselected document (NO in step S36)
  • the document similarity calculation unit 322 extracts a predetermined number of documents in descending order of document similarity (step S37). ..
  • the search result generation unit 323 generates a search result based on the extracted data, the search result output unit 324 outputs a search result, for example, a search result output screen 340 (step S38), and the process ends.
  • FIG. 16 is a diagram for explaining the similarity determination system 1B according to the third embodiment
  • FIGS. 17 and 18 are diagrams for explaining an example of processing of the similarity determination system 1B.
  • the similarity determination system 1B replaces the process P6 of the similarity determination system 1A shown in FIG. 10 with the process P10, and adds the process P9 using the result of the process P8.
  • the process P10 is executed using the results of both the processes P5 and P9.
  • Process P9 is a process of calculating named entity similarity for each cluster, for example, compound similarity for each pair of clusters between documents.
  • the process P10 is a process of ranking each of the plurality of comparison target documents 12a according to the similarity with the query document 11a based on the text similarity and the named entity similarity.
  • the processes P9 and P10 will be described.
  • the similarity determination system 1B is, for example, a list of a plurality of compounds of the first plurality of clusters generated from the query document 11a and a list of a plurality of compounds of the second plurality of clusters generated from the comparison target document 12a. And may be compared with each other. Then, the similarity determination system 1B performs compound similarity, for example, cosine similarity by the calculation of the following formula (7) for all the cluster pairs between the first plurality of clusters and the second plurality of clusters. The degree may be calculated.
  • i is an index for specifying all the compound names contained in the compound lists C Xa and CYb
  • C Xai and CYbi are i in the compound lists C Xa and CYb
  • the number of appearances of the second compound name is shown.
  • the denominator is the square root of the sum of squares of the number of occurrences of the compound of C Xa and the square root of the sum of squares of the number of appearances of the compound of CYb
  • the molecule is C Xa . It is the sum of the products of the number of appearances of the common compound with CYb .
  • the similarity determination system 1 has the compound lists C X1 , C X2 , C X3 , ... C XN and the compound lists CY1 , CY2 , CY3 . , ...
  • the compound similarity may be calculated according to the above formula (7).
  • the similarity determination system 1B performs a ranking process of ranking each of the plurality of comparison target documents 12a according to the similarity with the query document 11a based on the text similarity and the named entity similarity (process P10), and the result. 14 is output.
  • the similarity determination system 1B calculates the similarity in which the text similarity and the named entity similarity are integrated in the ranking process, and based on the similarity, a plurality of comparison targets according to the similarity with the query document 11a.
  • the ranking of the document 12a is output.
  • the similarity determination system 1B may calculate the document similarity Sim (X, Y) between the document X and one comparison target document Y, for example, according to the following equation (8).
  • fc is a cosine similarity according to the above equation (7), in other words, a named entity similarity.
  • the above formula (8) shows an example of calculating the document similarity between the document X (query document 11a) and one document Y (comparison target document 12a). Similar to the second embodiment, the similarity determination system 1B may acquire document similarity Sims (X, Y 1 ) to Sim (X, Y L ) according to the number of documents Y.
  • the similarity determination system 1B for example, as in the second embodiment, all the documents Y to be searched in descending order from the documents Y having the highest document similarity Sim (X, Y 1 ) to Sim (X, Y L ). Ranking processing is performed by sorting 1 to Y L. Further, the similarity determination system 1B may output the sort result as the result 14.
  • the similarity determination system 1B sets the document similarity Sim (X, Y) between the document X and one comparison target document Y as the named entity similarity and the text similarity according to the following equation (9). It may be calculated as a weighted sum.
  • w is a weight.
  • w for example, a value such as “0.5” may be appropriately defined and set by the user or the like so that the named entity similarity and the text similarity are considered equally.
  • w may be set based on a model trained so that the correct answer example is searched higher by machine learning based on the search query and the training data including the correct answer example (correct answer data). ..
  • the compound lists C X and CY are compared with each other.
  • the element to be investigated is "positive electrode active material”
  • compound names related to "positive electrode active material” such as "LiCoO2” appear in common between documents, while compound names related to other elements differ between documents. Therefore, it may be calculated as a value with a low degree of compound similarity between documents. In this way, when comparing the compound list C for each document, the compound similarity may be calculated as a value having a low similarity even if the elements to be investigated are similar between the documents.
  • the similarity determination system 1B as illustrated in FIG. 18, the compound similarity between the pairs of the compound lists C X2 and CY2 , in other words, the clusters of the “positive electrode active material” is the maximum. It can be determined. Then, the similarity determination system 1B can adopt the compound similarity as a value of fc used for calculating the document similarity Sim (X, Y).
  • the similarity between documents is partially determined based on the named entity similarity for each cluster classified by the clustering process. It is possible to further improve the determination accuracy of the similarity between the documents.
  • FIG. 19 is a block diagram showing a functional configuration example of the server 4 in the similarity determination system 1B according to the third embodiment. Unless otherwise specified, the server 4 may be the same as the server 3 shown in FIG.
  • the above-mentioned similarity determination process by the similarity determination system 1B may be realized by the server 4.
  • the server 4 may optionally include a document DB unit 41 and a document retrieval unit 42.
  • the document DB unit 41 and the document search unit 42 are examples of control units.
  • the document DB unit 41 may include a compound cluster storage unit 416 in addition to the configuration of the document DB unit 31 shown in FIG.
  • the document retrieval unit 42 may include a document similarity calculation unit 422 instead of the document similarity calculation unit 322 shown in FIG. 12.
  • the compound cluster storage unit 416 is an example of the memory unit 21 shown in FIG. 6, and the information of the compound cluster calculated by the clustering unit 313, for example, the compound list C may be stored in association with the document.
  • the document similarity calculation unit 422 compares the partial document vectors of the query document 11a and the comparison target document 12a stored in the document cluster vector storage unit 315, and calculates the text similarity. Further, the document similarity calculation unit 422 compares the compound lists of the query document 11a and the comparison target document 12a stored in the compound cluster storage unit 416, and calculates the compound similarity.
  • the document similarity calculation unit 422 calculates the document similarity Sim (X, Y) based on the text similarity and the compound similarity, and generates the result 14 from the document similarity Sim (X, Y).
  • the process of the document similarity calculation unit 422 is an example of the processes P5, P9, and P10 of FIG.
  • document retrieval unit 42 may output the screen illustrated in FIG.
  • FIG. 20 is a flowchart illustrating an operation example of the document DB construction process of the server 4
  • FIG. 21 is a flowchart illustrating an operation example of the document retrieval process of the server 4.
  • FIG. 20 shows that step S41 is added between steps S24 and S25 shown in FIG. As illustrated in FIG. 20, the compound cluster storage unit 416 stores the calculated compound cluster information for each document in step S41.
  • step S51 is added between steps S32 and S33 shown in FIG. 15, and step S35 is replaced with steps S52 and S53.
  • step S51 the document similarity calculation unit 422 acquires the compound cluster of the query document 11a, for example, the compound list from the compound cluster storage unit 416.
  • step S52 the document similarity calculation unit 422 acquires a compound cluster of a document selected from the compound cluster storage unit 416, for example, a compound list.
  • step S53 the document similarity calculation unit 422 calculates the document similarity Sim (X, Y) based on the document vector acquired in steps S32 and S34, respectively, and the compound cluster acquired in steps S51 and S52, respectively. ..
  • FIG. 22 is a block diagram showing a functional configuration example of the server 5 in the similarity determination system 1C according to the first modification of the third embodiment
  • FIG. 23 is a diagram showing a screen output example by the server 5.
  • the similarity determination system 1C determines the text similarity by comparing the partial document cluster containing a predetermined keyword in the query document 11a with each partial document cluster of the plurality of comparison target documents 12a. calculate.
  • the server 5 may optionally include a document DB unit 41 and a document search unit 52.
  • the document DB unit 41 and the document search unit 52 are examples of control units.
  • the document DB unit 41 is the same as the document DB unit 41 shown in FIG.
  • the document search unit 52 may include a document similarity calculation unit 522, a keyword input unit 525, and a document cluster identification unit 526 in place of the document similarity calculation unit 422 of the document search unit 42 shown in FIG.
  • the keyword input unit 525 accepts input of one or more keywords from the user. For example, as shown in FIG. 23, the keyword input unit 525 has a document number of the query document 11a and one or more set in the input fields 531 and 532 when the search button 533 of the search query specification screen 530 is pressed. Notify the document cluster identification unit 526 of the keyword.
  • the document cluster identification unit 526 refers to the document cluster vector storage unit 315, and is a portion including one or more notified keywords (for example, including a predetermined number of times or more) from a plurality of partial document clusters of the notified query document 11a. Identify the document cluster.
  • the document similarity calculation unit 522 limits the partial document vector of the query document 11a to be compared with the plurality of partial document vectors of the comparison target document 12a to the document vector of the partial document cluster specified by the document cluster identification unit 526. In other words, the document similarity calculation unit 522 sets the importance (priority) of the specified subdocument cluster to be higher than that of other subdocument clusters. Then, the document similarity calculation unit 522 calculates the text similarity for the specified partial document cluster, and calculates the inter-document similarity based on the text similarity and the compound similarity.
  • the server 5 according to the first modification, the same effect as that of the first and second embodiments can be obtained. Further, among the plurality of partial document clusters in the query document 11a, the comparison target document 12a can be searched by an appropriate partial document cluster including the keyword intended by the user, and the determination accuracy of the similarity between the documents can be further improved. Can be improved. Further, since the number of partial document clusters used for determining the similarity can be limited, the processing time of the document retrieval process can be shortened. In addition, the user can flexibly specify a cluster including a predetermined keyword, which is highly convenient.
  • FIG. 24 is a flowchart illustrating an operation example of the document retrieval process of the server 5.
  • steps S61 and S62 are added between steps S51 and S33 shown in FIG. 21, and step S53 is replaced with step S63.
  • the keyword input unit 525 accepts the designation of the keyword in step S61.
  • step S62 the document cluster specifying unit 526 specifies a partial document cluster of the query document 11a that includes the keywords accepted by the keyword input unit 525 by the first threshold value (predetermined number of times) or more.
  • step S63 the document similarity calculation unit 522 calculates the text similarity between the specified partial document cluster of the query document 11a and all the partial document clusters of the selected document. Then, the document similarity calculation unit 522 calculates the document similarity Sim (X, Y) based on the calculated text similarity and compound similarity.
  • FIG. 25 is a block diagram showing a functional configuration example of the server 6 in the similarity determination system 1D according to the second modification of the third embodiment.
  • the similarity determination system 1D calculates the text similarity by comparing the partial document clusters whose similarity with the text of a predetermined part of the document is equal to or higher than the second threshold value. Further, the similarity determination system 1D calculates the compound similarity by comparing the compound clusters whose degree of agreement with the text of the predetermined portion included in the partial document cluster is equal to or more than the third threshold value.
  • the accuracy of determining the similarity of partially similar documents can be further improved by determining the document similarity based on the description content of the predetermined portion. In some cases.
  • the similarity determination system 1D specifies a predetermined part of the text such as "(patent) claims" from the document according to the type of the input document. Further, the similarity determination system 1D accumulates only the partial document cluster and the compound cluster related to the text among the clusters calculated from the document. Then, in the similarity determination process, the similarity determination system 1D determines the similarity based on the partial document cluster and the compound cluster related to the text of the predetermined portion according to the type of the designated query document 11a.
  • the server 6 may optionally include a document DB unit 61 and a document search unit 42.
  • the document DB unit 61 and the document search unit 42 are examples of control units.
  • the document search unit 42 is the same as the document search unit 42 shown in FIG.
  • the document DB unit 61 may include a predetermined document cluster vector storage unit 615 and a predetermined compound cluster storage unit 616 instead of the document cluster vector storage unit 315 and the compound cluster storage unit 416 of the document DB unit 41 shown in FIG. Further, the document DB unit 61 may include a predetermined document structure analysis unit 617.
  • the predetermined document cluster vector storage unit 615 stores the information of the partial document cluster specified by the predetermined document configuration analysis unit 617, which will be described later, among the partial document clusters calculated by the document cluster vector calculation unit 314, in association with the document.
  • the predetermined compound cluster storage unit 616 stores the information of the compound cluster specified by the predetermined document structure analysis unit 617, which will be described later, among the compound clusters calculated by the clustering unit 313, in association with the document.
  • the predetermined document structure analysis unit 617 specifies the text of the predetermined part from the document according to the type of the input document.
  • the "predetermined portion" may be preset according to the type of document, for example, a predetermined document type in which a document structure is defined, such as a patent document, a paper, and various materials.
  • the predetermined document composition analysis unit 617 identifies the sub-document cluster whose similarity with the specified text is equal to or higher than the second threshold among the sub-document clusters calculated from the input document, and sets the document vector as the predetermined document. It is stored in the cluster vector storage unit 615.
  • the predetermined document structure analysis unit 617 treats the specified text as a partial document (partial document cluster), and texts of the specified text subdocument (partial document cluster) and each of the other subdocument clusters in the document. The similarity may be compared with the second threshold.
  • the predetermined document composition analysis unit 617 identifies, among the compound clusters calculated from the input document, the compound cluster whose degree of agreement with the compound name included in the specified partial document cluster is equal to or higher than the third threshold value, and determines. It accumulates in the compound cluster storage unit 616. For example, the predetermined document composition analysis unit 617 treats the compound names included in the specified partial document cluster as a compound list for each cluster, and the compound list is similar to each of the compound list for each other cluster in the document. The degree may be compared with the third threshold.
  • the predetermined document composition analysis unit 617 uses the calculated document vector and compound cluster of the partial document cluster as the predetermined document cluster vector storage unit 615 and the predetermined compound cluster storage unit. It may be accumulated in the unit 616.
  • the document similarity calculation unit 422 may perform the same operation as the similarity determination system 1B shown in FIG. 19, but when the type of the document related to the query 11 is a predetermined document type in which a "predetermined portion" is set. , The information of each cluster used to calculate the document similarity is limited. That is, the document similarity calculation unit 422 determines the document similarity based on the partial document cluster and the compound cluster related to the text of the predetermined portion according to the type of the input query document 11a.
  • the server 6 according to the second modification the same effect as that of the first and second embodiments can be obtained. Further, by setting a "predetermined part" for each document type in advance, important (high priority) partial document clusters and compound clusters according to the document type can be easily identified. Therefore, the accuracy of determining the similarity between documents can be further improved by determining the similarity based on the important partial document cluster and the compound cluster. Further, since the number of partial document clusters and compound clusters used for determining the similarity can be limited, the processing time of the document retrieval process can be shortened.
  • FIG. 26 is a flowchart illustrating an operation example of the document DB construction process of the server 6.
  • FIG. 26 shows that step S28 shown in FIG. 14 is replaced with steps S71 to S75.
  • the predetermined document structure analysis unit 617 specifies the text of the predetermined portion in the document in step S71.
  • step S72 the predetermined document structure analysis unit 617 identifies a partial document cluster whose similarity with the text of the predetermined portion is equal to or higher than the threshold value in the partial document cluster.
  • the predetermined document cluster vector storage unit 615 registers the document vector of the specified partial document cluster in step S73.
  • step S74 the predetermined document structure analysis unit 617 identifies a compound cluster whose degree of agreement with the compound name included in the specified partial document cluster is equal to or greater than the threshold value.
  • the predetermined compound cluster storage unit 616 registers the specified compound cluster in step S75.
  • FIG. 27 is a flowchart illustrating an operation example of the document retrieval process of the server 6.
  • steps S32 and S51 shown in FIG. 21 are replaced with S81 and S82, and S34, S52 and S53 are replaced with S83, S84 and S85.
  • step S81 the document similarity calculation unit 422 acquires the document vector of the partial document cluster of the query document 11a from the predetermined document cluster vector storage unit 615.
  • the document similarity calculation unit 422 acquires the document vector of the predetermined partial document cluster.
  • step S82 the document similarity calculation unit 422 acquires the compound cluster of the query document 11a, that is, the predetermined compound cluster when the query document 11a is the predetermined document type, from the predetermined compound cluster storage unit 616.
  • step S83 the document similarity calculation unit 422 acquires the document vector of the partial document cluster of the selected document from the predetermined document cluster vector storage unit 615.
  • the document similarity calculation unit 422 acquires the document vector of the predetermined partial document cluster.
  • step S84 the document similarity calculation unit 422 acquires a predetermined compound cluster from the predetermined compound cluster storage unit 616, that is, when the selected document has a predetermined document type.
  • the document similarity calculation unit 422 calculates the document similarity based on the acquired document vector of the predetermined partial document cluster and the predetermined compound cluster in step S85.
  • the compound name is used as a named entity
  • the present invention is limited to this. It is not something that is done.
  • the named entity various terms that can be the target of the named entity extraction process in natural language processing, such as a gene sequence (genome), may be used.
  • each of the servers 2 to 6 shown in FIGS. 6, 12, 19, 22, and 25 may be merged or divided in any combination.
  • the first to third embodiments and the first and second modifications of the third embodiment may be combined as appropriate.
  • each of the servers 2 to 6 may generate screen information of any of the screens of FIGS. 7, 13, and 23, and may have a functional configuration according to the screen.
  • the functions of the servers 5 and 6 according to the first and second modifications of the third embodiment shown in FIGS. 22 and 25 may be implemented in combination with each other. Further, the function may be applied to the document similarity determination process based on the text similarity in the server 2 or 3 according to the first or second embodiment shown in FIG. 6 or FIG.
  • each of the servers 2 to 6 shown in FIGS. 6, 12, 19, 22, and 25 has a configuration in which a plurality of devices cooperate with each other via a network to realize each processing function. May be good.
  • the memory unit 21 is a DB server
  • the document DB units 31, 41 and 61 are a combination of an application server and a DB server
  • a document input unit 22 a similarity calculation unit 23
  • a similarity output unit 24 and a document search unit 32, 42.
  • And 52 may be a combination of an application server and a Web server, and the like.
  • the computer, the application server, and the DB server may cooperate with each other via the network to realize each processing function as the servers 2 to 6.
  • each of the servers 3 to 6 may be provided with the HW configuration of the computer 10 illustrated in FIG.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

This similarity determination program is for causing a computer to execute, with respect to a plurality of first partial-document groups (PX1-PXN) obtained by dividing a first document (11a) to obtain a plurality of first partial-documents (PX) and categorizing the obtained first partial-documents on the basis of a plurality of first groups (CX1-CXN) that are obtained by categorizing a plurality of first unique expressions included in the first partial-documents (PX), a process for: calculating a plurality of first vectors corresponding to the first partial-document groups (PX1-PXN), on the basis of words included in the respective first partial-document groups (PX1-PXN); acquiring a plurality of second vectors respectively corresponding to a plurality of second partial-document groups (PY1-PYM) obtained from a second document (12a); and determining the similarity (Sim) between the first document (11a) and the second document (12a) on the basis of a comparison between the plurality of first and second vectors.

Description

類似度判定プログラム、類似度判定装置、及び、類似度判定方法Similarity determination program, similarity determination device, and similarity determination method
 本発明は、類似度判定プログラム、類似度判定装置、及び、類似度判定方法に関する。 The present invention relates to a similarity determination program, a similarity determination device, and a similarity determination method.
 複数文書を単語に分割し、単語の意味を表現するベクトル、及び、各単語の重みを算出して、ベクトル及び重みに基づき、各文書の文書ベクトルを算出することで、文書間の類似度を判定する手法が知られている。 By dividing multiple documents into words, calculating the vector expressing the meaning of the word, and calculating the weight of each word, and calculating the document vector of each document based on the vector and weight, the similarity between the documents can be calculated. A method for determining is known.
特開2006-331245号公報Japanese Unexamined Patent Publication No. 2006-331245
 文書内には様々な情報が記載されているため、文書全体を文書ベクトル化して文書間の類似度を判定すると、部分的に類似した文書間であっても類似度が低いと判定される可能性がある。 Since various information is described in the document, if the entire document is vectorized and the similarity between the documents is judged, it is possible to judge that the similarity is low even between partially similar documents. There is sex.
 1つの側面では、本発明は、部分的に類似した文書間の類似度の判定精度を向上させることを目的の1つとする。 In one aspect, one of the objects of the present invention is to improve the accuracy of determining the degree of similarity between partially similar documents.
 1つの側面では、類似度判定プログラムは、コンピュータに、以下の処理を実行させてよい。前記処理は、第1の文書を分割することによって得られた第1の複数の部分文書を前記第1の複数の部分文書に含まれる第1の複数の固有表現を分類して得られる第1の複数のグループに基づいて分類することによって得られた第1の複数の部分文書グループについて、前記第1の複数の部分文書グループのそれぞれに含まれる単語に基づいて、前記第1の複数の部分文書グループのそれぞれに対応する第1の複数のベクトルを算出する処理を含んでよい。また、前記処理は、第2の文書を分割することによって得られた第2の複数の部分文書を分類することによって得られた第2の複数の部分文書グループのそれぞれに対応する第2の複数のベクトルを取得する処理を含んでよい。さらに、前記処理は、前記第1の複数のベクトルと前記第2の複数のベクトルとの比較に基づいて、前記第1の文書と前記第2の文書との類似度を判定する処理を含んでよい。 In one aspect, the similarity determination program may cause the computer to perform the following processing. The first processing is obtained by classifying the first plurality of partial documents obtained by dividing the first document into the first plurality of named entity included in the first plurality of partial documents. With respect to the first plurality of subdocument groups obtained by classifying based on the plurality of groups of the above, the first plurality of parts are based on the words contained in each of the first plurality of subdocument groups. It may include a process of calculating a first plurality of vectors corresponding to each of the document groups. In addition, the process corresponds to each of the second plurality of sub-document groups obtained by classifying the second plurality of sub-documents obtained by dividing the second document. It may include the process of acquiring the vector of. Further, the process includes a process of determining the degree of similarity between the first document and the second document based on the comparison between the first plurality of vectors and the second plurality of vectors. good.
 1つの側面では、本発明は、部分的に類似した文書間の類似度の判定精度を向上させることができる。 In one aspect, the present invention can improve the accuracy of determining the degree of similarity between partially similar documents.
比較例に係る類似度判定システムを説明するための図である。It is a figure for demonstrating the similarity determination system which concerns on a comparative example. 図1に示す類似度判定システムによる類似度の判定例を説明する図である。It is a figure explaining the determination example of the similarity degree by the similarity degree determination system shown in FIG. 第1実施形態に係る類似度判定システムを説明するための図である。It is a figure for demonstrating the similarity determination system which concerns on 1st Embodiment. 類似度判定システムの処理の一例を説明するための図である。It is a figure for demonstrating an example of the process of a similarity determination system. 類似度判定システムの処理の一例を説明するための図である。It is a figure for demonstrating an example of the process of a similarity determination system. 第1実施形態に係る類似度判定システムにおけるサーバの機能構成例を示すブロック図である。It is a block diagram which shows the functional structure example of the server in the similarity determination system which concerns on 1st Embodiment. サーバによる画面出力例を示す図である。It is a figure which shows the screen output example by a server. サーバの動作例を説明するフローチャートである。It is a flowchart explaining the operation example of a server. サーバの機能を実現するコンピュータのハードウェア(HW)構成例を示すブロック図である。It is a block diagram which shows the hardware (HW) configuration example of the computer which realizes the function of a server. 第2実施形態に係る類似度判定システムを説明するための図である。It is a figure for demonstrating the similarity determination system which concerns on 2nd Embodiment. 類似度判定システムの処理の一例を説明するための図である。It is a figure for demonstrating an example of the process of a similarity determination system. 第2実施形態に係る類似度判定システムにおけるサーバの機能構成例を示すブロック図である。It is a block diagram which shows the functional structure example of the server in the similarity determination system which concerns on 2nd Embodiment. サーバによる画面出力例を示す図である。It is a figure which shows the screen output example by a server. サーバの文書DB(Database)構築処理の動作例を説明するフローチャートである。It is a flowchart explaining the operation example of the document DB (Database) construction process of a server. サーバの文書検索処理の動作例を説明するフローチャートである。It is a flowchart explaining the operation example of the document search process of a server. 第3実施形態に係る類似度判定システムを説明するための図である。It is a figure for demonstrating the similarity determination system which concerns on 3rd Embodiment. 類似度判定システムの処理の一例を説明するための図である。It is a figure for demonstrating an example of the process of a similarity determination system. 類似度判定システムの処理の一例を説明するための図である。It is a figure for demonstrating an example of the process of a similarity determination system. 第3実施形態に係る類似度判定システムにおけるサーバの機能構成例を示すブロック図である。It is a block diagram which shows the functional structure example of the server in the similarity determination system which concerns on 3rd Embodiment. サーバの文書DB構築処理の動作例を説明するフローチャートである。It is a flowchart explaining the operation example of the document DB construction process of a server. サーバの文書検索処理の動作例を説明するフローチャートである。It is a flowchart explaining the operation example of the document search process of a server. 第3実施形態の第1変形例に係る類似度判定システムにおけるサーバの機能構成例を示すブロック図である。It is a block diagram which shows the functional structure example of the server in the similarity determination system which concerns on the 1st modification of 3rd Embodiment. サーバによる画面出力例を示す図である。It is a figure which shows the screen output example by a server. サーバの文書検索処理の動作例を説明するフローチャートである。It is a flowchart explaining the operation example of the document search process of a server. 第3実施形態の第2変形例に係る類似度判定システムにおけるサーバの機能構成例を示すブロック図である。It is a block diagram which shows the functional structure example of the server in the similarity determination system which concerns on the 2nd modification of 3rd Embodiment. サーバの文書DB構築処理の動作例を説明するフローチャートである。It is a flowchart explaining the operation example of the document DB construction process of a server. サーバの文書検索処理の動作例を説明するフローチャートである。It is a flowchart explaining the operation example of the document search process of a server.
 以下、図面を参照して本発明の実施の形態を説明する。ただし、以下に説明する実施形態は、あくまでも例示であり、以下に明示しない種々の変形又は技術の適用を排除する意図はない。例えば、本実施形態を、その趣旨を逸脱しない範囲で種々変形して実施することができる。なお、以下の説明で用いる図面において、同一符号を付した部分は、特に断らない限り、同一若しくは同様の部分を表す。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. However, the embodiments described below are merely examples, and there is no intention of excluding the application of various modifications or techniques not specified below. For example, the present embodiment can be variously modified and implemented without departing from the spirit of the present embodiment. In the drawings used in the following description, the parts with the same reference numerals represent the same or similar parts unless otherwise specified.
 〔1〕第1実施形態
 〔1-1〕比較例
 上述したように、文書全体を文書ベクトル化して文書間の類似度を判定すると、部分的に類似した文書間であっても類似度が低いと判定される可能性がある。
[1] First Embodiment [1-1] Comparative Example As described above, when the entire document is vectorized and the similarity between the documents is determined, the similarity is low even between partially similar documents. May be determined.
 図1は、比較例に係る類似度判定システム100を説明するための図である。図1に示すように、類似度判定システム100は、クエリ文書(入力文書)の類似度の判定を要求するクエリ101と、1以上の比較対象文書を含む文書集合102とに基づき、単語の意味ベクトルに基づく類似度を算出する。 FIG. 1 is a diagram for explaining the similarity determination system 100 according to a comparative example. As shown in FIG. 1, the similarity determination system 100 has a meaning of a word based on a query 101 requesting determination of similarity of a query document (input document) and a document set 102 including one or more comparison target documents. Calculate the similarity based on the vector.
 例えば、類似度判定システム100は、複数の文書、すなわち、クエリ101に含まれるクエリ文書及び文書集合102に含まれる比較対象文書のそれぞれから、例えば形態素解析により単語を抽出する(処理P110)。 For example, the similarity determination system 100 extracts words from each of a plurality of documents, that is, the query document included in the query 101 and the comparison target document included in the document set 102, for example, by morphological analysis (process P110).
 類似度判定システム100は、処理P110で得られた単語に基づき、複数の文書のそれぞれについて、統計的に単語の重みを算出する(処理P120)。例えば、類似度判定システム100は、tf-idf(Term Frequency - Inverse Document Frequency)等の評価手法を用いて、文書内での単語の重要度を重みとして評価してよい。 The similarity determination system 100 statistically calculates the word weights for each of the plurality of documents based on the words obtained in the process P110 (process P120). For example, the similarity determination system 100 may evaluate the importance of a word in a document as a weight by using an evaluation method such as tf-idf (Term Frequency-Inverse Document Frequency).
 また、類似度判定システム100は、処理P120と少なくとも一部の処理が並行又は前後して、処理P130を実行する。例えば、類似度判定システム100は、処理P110で得られた単語に基づき、複数の文書のそれぞれについて、単語ベクトルを算出する(処理P130)。単語ベクトルは、単語埋め込みベクトル又は意味ベクトルと称されてもよい。 Further, the similarity determination system 100 executes the process P130 in parallel with or before and after the process P120 and at least a part of the processes. For example, the similarity determination system 100 calculates a word vector for each of a plurality of documents based on the words obtained in the process P110 (process P130). The word vector may be referred to as a word embedding vector or a meaning vector.
 例えば、類似度判定システム100は、単語の意味を表現するベクトルが格納されたベクトルデータベースを検索して単語ベクトルを取得してよい。 For example, the similarity determination system 100 may search a vector database in which a vector expressing the meaning of a word is stored and acquire a word vector.
 類似度判定システム100は、文書ごとに、処理P130で取得した単語ベクトルと、処理P120で取得した単語の重みとを乗じた結果を文書内の全単語に亘って加算することで文書ベクトルを算出する。そして、類似度判定システム100は、クエリ文書の文書ベクトルと、比較対象文書の各々の文書ベクトルとの間の類似度を算出することで、クエリ文書と比較対象文書との間のテキスト類似度を算出する(処理P140)。 The similarity determination system 100 calculates a document vector by adding the result of multiplying the word vector acquired in the process P130 and the weight of the word acquired in the process P120 over all the words in the document for each document. do. Then, the similarity determination system 100 calculates the similarity between the document vector of the query document and each document vector of the comparison target document, thereby determining the text similarity between the query document and the comparison target document. Calculate (process P140).
 類似度判定システム100は、算出したテキスト類似度に基づくランキング処理を行ない(処理P150)、クエリ文書との類似度が高い比較対象文書を類似度とともにランキング結果103として保存する。 The similarity determination system 100 performs ranking processing based on the calculated text similarity (processing P150), and stores the comparison target document having a high similarity with the query document as the ranking result 103 together with the similarity.
 図2は、図1に示す類似度判定システム100による類似度の判定例を説明する図である。図2の例では、リチウムイオン電池に関するクエリ文書101a及び比較対象文書102aについて、類似度が判定される場合を示す。 FIG. 2 is a diagram illustrating an example of determination of similarity by the similarity determination system 100 shown in FIG. The example of FIG. 2 shows a case where the similarity is determined for the query document 101a and the comparison target document 102a relating to the lithium ion battery.
 ところで、「文書」には、複数の要素についての記載を含む文書、一例として、複数の構成要素を備える装置、システム又は製造方法等について記載された特許文献又は論文等の文書がある。例えば、図2に示すリチウムイオン電池に関する文書には、「正極活物質」、「負極活物質」、「バインダー」、「電解質」、「電解液溶媒」等の、リチウムイオン電池の構成要素のそれぞれの区分(グループ)に関する化合物名が混在して記載されることがある。 By the way, the "document" includes a document including a description of a plurality of elements, and, for example, a document such as a patent document or a paper describing a device, a system, a manufacturing method, etc. having a plurality of components. For example, in the document relating to the lithium ion battery shown in FIG. 2, each of the components of the lithium ion battery such as "positive electrode active material", "negative electrode active material", "binder", "electrolyte", and "electrolyte solution solvent" is provided. Compound names related to the classification (group) of may be mixed and described.
 このため、クエリ文書101aに記載された所定の要素に着目して比較対象文書102aとの類似度を判定したい場合であっても、その他の要素、換言すれば調査対象ではない要素についての比較対象文書との差異が、文書間の類似度の判定結果に影響を与える場合がある。 Therefore, even if it is desired to determine the similarity with the comparison target document 102a by focusing on a predetermined element described in the query document 101a, the comparison target is for other elements, in other words, the elements that are not the investigation target. Differences from documents may affect the judgment result of similarity between documents.
 例えば、図2の紙面左側に示すように、クエリ文書101a及び比較対象文書102aの双方において下線及び太字で示す段落が「正極活物質」に関する記載であり、双方の段落の内容が類似する場合を想定する。 For example, as shown on the left side of the paper in FIG. 2, the paragraphs shown in underline and bold in both the query document 101a and the comparison target document 102a are the descriptions relating to the “positive electrode active material”, and the contents of both paragraphs are similar. Suppose.
 図2の紙面右側には、クエリ文書101aの記述内容の範囲101bと、比較対象文書102aの記述内容の範囲102bとをマップした2次元の意味ベクトル空間を示す。 On the right side of the page of FIG. 2, a two-dimensional semantic vector space that maps the range 101b of the description content of the query document 101a and the range 102b of the description content of the comparison target document 102a is shown.
 調査対象の要素が「正極活物質」である場合、範囲101b及び102bを比較すると、「正極活物質」に関する段落の記述が文書間で類似する一方、他の要素に関する記述は文書間で相違する。この場合、クエリ文書101aの文書ベクトル101cと比較対象文書102aの文書ベクトル102cとの類似度が低い値として算出される。 When the element under investigation is "positive electrode active material", comparing ranges 101b and 102b, the paragraph description for "positive electrode active material" is similar between documents, while the description for other elements is different between documents. .. In this case, the similarity between the document vector 101c of the query document 101a and the document vector 102c of the comparison target document 102a is calculated as a low value.
 このように、文書全体を1つのベクトルで表現すると、文書の一部に記述された構成要素が類似した文書であっても、文書全体としては類似度が低い値として算出される場合があり、部分的に類似した文書間の類似度の判定精度が低下する場合がある。 In this way, when the entire document is represented by one vector, even if the components described in a part of the document are similar to each other, the document as a whole may be calculated as a value having a low degree of similarity. The accuracy of determining the degree of similarity between partially similar documents may decrease.
 なお、図2では、便宜上、意味ベクトル空間を2次元で示すが、実際には数百次元のベクトルとなり得る。意味ベクトル空間の次元数が増加するほど、調査対象の要素が文書間で類似する場合であっても、他の要素に関する記述が文書間で相違する場合に類似度が低いと判定される可能性が高くなる。 Although the semantic vector space is shown in two dimensions in FIG. 2 for convenience, it can actually be a vector of several hundred dimensions. As the number of dimensions of the semantic vector space increases, even if the elements to be investigated are similar between documents, it is possible that the similarity is judged to be low when the descriptions about other elements are different between the documents. Will be higher.
 〔1-2〕第1実施形態の説明
 そこで、第1実施形態に係る類似度判定システム1は、文書を分割することによって得られた部分文書のそれぞれに対応する複数のベクトルを取得し、文書間での複数のベクトルの比較に基づいて、文書類似度を判定する。
[1-2] Description of the First Embodiment Therefore, the similarity determination system 1 according to the first embodiment acquires a plurality of vectors corresponding to each of the partial documents obtained by dividing the document, and obtains the document. Document similarity is determined based on the comparison of multiple vectors between them.
 例えば、類似度判定システム1は、複数の部分文書を複数のグループに基づいて分類することによって複数の部分文書グループを取得してよい。また、類似度判定システム1は、判定対象の両文書の部分文書クラスタ間の類似度を比較することで、最も類似度の高い部分文書クラスタの類似度を文書の類似度としてよい。 For example, the similarity determination system 1 may acquire a plurality of sub-document groups by classifying a plurality of sub-documents based on a plurality of groups. Further, the similarity determination system 1 may use the similarity of the subdocument clusters having the highest similarity as the document similarity by comparing the similarity between the subdocument clusters of both documents to be determined.
 図3は、第1実施形態に係る類似度判定システム1を説明するための図であり、図4及び図5は、類似度判定システム1の処理の一例を説明するための図である。 FIG. 3 is a diagram for explaining the similarity determination system 1 according to the first embodiment, and FIGS. 4 and 5 are diagrams for explaining an example of processing of the similarity determination system 1.
 図3に示すように、第1実施形態に係る類似度判定システム1は、クエリ文書(入力文書)の類似度の判定を要求するクエリ11と、判定対象となる1以上の比較対象文書を含む文書集合(文書群)12とに基づき、単語の意味ベクトルに基づく類似度を算出する。 As shown in FIG. 3, the similarity determination system 1 according to the first embodiment includes a query 11 requesting determination of the similarity of a query document (input document) and one or more comparison target documents to be determined. Based on the document set (document group) 12, the similarity based on the meaning vector of the word is calculated.
 図4及び図5に示す例では、類似度判定システム1は、クエリ11で指定されるクエリ文書11aと文書集合12内の比較対象文書12aの類似度を判定する。クエリ文書11aは、第1の文書の一例であり、比較対象文書12aは、第2の文書の一例である。 In the example shown in FIGS. 4 and 5, the similarity determination system 1 determines the similarity between the query document 11a specified by the query 11 and the comparison target document 12a in the document set 12. The query document 11a is an example of the first document, and the comparison target document 12a is an example of the second document.
 例えば、類似度判定システム1は、比較例と同様に、複数の文書のそれぞれから、例えば形態素解析により単語を抽出する(処理P1)。 For example, the similarity determination system 1 extracts words from each of a plurality of documents by, for example, morphological analysis (process P1), as in the comparative example.
 類似度判定システム1は、処理P1で得られた単語に基づき、複数の文書のそれぞれについて、統計的に単語の重みを算出する(処理P2)。例えば、類似度判定システム1は、tf-idf等の評価手法を用いて、文書内での単語の重要度を重みとして評価してよい。 The similarity determination system 1 statistically calculates the word weights for each of the plurality of documents based on the words obtained in the process P1 (process P2). For example, the similarity determination system 1 may evaluate the importance of a word in a document as a weight by using an evaluation method such as tf-idf.
 また、類似度判定システム1は、処理P2と少なくとも一部の処理が並行又は前後して、処理P3を実行する。例えば、類似度判定システム1は、処理P1で得られた単語に基づき、複数の文書のそれぞれについて、単語ベクトルを算出する(処理P3)。単語ベクトルは、単語埋め込みベクトル又は意味ベクトルと称されてもよい。 Further, the similarity determination system 1 executes the process P3 in parallel with or before and after the process P2 and at least a part of the processes. For example, the similarity determination system 1 calculates a word vector for each of a plurality of documents based on the words obtained in the process P1 (process P3). The word vector may be referred to as a word embedding vector or a meaning vector.
 例えば、類似度判定システム1は、単語の意味を表現するベクトルが格納されたベクトルデータベースを検索して単語ベクトルを取得してよい。一例として、類似度判定システム1は、訓練済みのモデルに基づき、処理P1で得られた単語のそれぞれに対応する単語ベクトルを取得してよい。 For example, the similarity determination system 1 may search a vector database in which a vector expressing the meaning of a word is stored and acquire a word vector. As an example, the similarity determination system 1 may acquire a word vector corresponding to each of the words obtained in the process P1 based on the trained model.
 類似度判定システム1は、複数の文書のそれぞれを複数の部分文書(例えば段落)に分割し、複数の部分文書を各部分文書に含まれる固有表現に基づきクラスタリングし(処理P4)、部分文書クラスタを生成する。また、類似度判定システム1は、各部分文書クラスタの部分文書ベクトルを算出する。 The similarity determination system 1 divides each of a plurality of documents into a plurality of sub-documents (for example, paragraphs), clusters the plurality of sub-documents based on the named entity included in each sub-document (process P4), and sub-document cluster. To generate. Further, the similarity determination system 1 calculates the partial document vector of each partial document cluster.
 類似度判定システム1は、クエリ文書11aの複数の部分文書ベクトルと、比較対象文書12aの各々の複数の部分文書ベクトルとに基づき、部分文書クラスタ間のテキスト類似度を算出する(処理P5)。 The similarity determination system 1 calculates the text similarity between the partial document clusters based on the plurality of partial document vectors of the query document 11a and each of the plurality of partial document vectors of the comparison target document 12a (process P5).
 そして、類似度判定システム1は、テキスト類似度に基づき、クエリ文書11aとの類似度に応じて複数の比較対象文書12aの各々をランキング付けするランキング処理を行ない(処理P6)、結果13を出力する。結果13は、ランキング結果を含んでもよい。 Then, the similarity determination system 1 performs a ranking process of ranking each of the plurality of comparison target documents 12a according to the similarity with the query document 11a based on the text similarity (process P6), and outputs the result 13. do. The result 13 may include a ranking result.
 以下、部分文書クラスタリング処理(処理P4)、テキスト類似度算出処理(処理P5)、及び、ランキング処理(処理P6)のそれぞれの一例を説明する。 Hereinafter, an example of each of the partial document clustering process (process P4), the text similarity calculation process (process P5), and the ranking process (process P6) will be described.
 (部分文書クラスタリング処理)
 類似度判定システム1は、処理P4において、文書ごとに、文書を分割することによって複数の部分文書(部分テキスト)を取得する。部分文書、換言すれば、文書の分割単位としては、例えば、文、段落、章又は節等が挙げられる。以下、部分文書が段落であるものとする。
(Partial document clustering process)
The similarity determination system 1 acquires a plurality of partial documents (partial texts) by dividing the document for each document in the process P4. Sub-documents, in other words, document division units include, for example, sentences, paragraphs, chapters, sections, and the like. Hereinafter, it is assumed that the partial document is a paragraph.
 図4及び図5の例では、類似度判定システム1は、クエリ文書11a(文書X)を分割して、第1の複数の部分文書の一例である複数の段落Pを取得し、比較対象文書12a(文書Y)を分割して、第2の複数の部分文書の一例である複数の段落Pを取得する。以下、段落P及びPを互いに区別しない場合には、単に「段落P」と表記する。 In the examples of FIGS. 4 and 5, the similarity determination system 1 divides the query document 11a (document X), acquires a plurality of paragraphs PX which are examples of the first plurality of partial documents, and compares them. The document 12a (document Y) is divided to obtain a plurality of paragraphs P Y which are examples of the second plurality of partial documents. Hereinafter, when paragraphs PX and P Y are not distinguished from each other, they are simply referred to as "paragraph P".
 類似度判定システム1は、固有表現クラスタを示す固有表現リスト、例えば図4に示す化合物リストCX1~CXN及びCY1~CYMに基づき、複数の段落Pをクラスタリングすることによって、複数の部分文書クラスタPX1~PXN及びPY1~PYMを取得する。第1実施形態では、固有表現は、化合物名であるものとし、文書は、化合物名を含む化学分野の文書であるものとする。クラスタリングの手法としては、例えば、最短距離法等の既存の種々の手法が用いられてよい。 The similarity determination system 1 has a plurality of parts by clustering a plurality of paragraphs P based on a named entity list indicating the named entity cluster, for example, compound lists C X1 to C XN and CY1 to CYM shown in FIG. Acquire the document clusters PX1 to PXN and PY1 to PYM . In the first embodiment, the named entity is the compound name and the document is a document in the field of chemistry that includes the compound name. As the clustering method, various existing methods such as the shortest distance method may be used.
 化合物リストCX1~CXNは、文書Xに含まれる第1の複数の固有表現、例えば複数の化合物名を分類して得られる固有表現クラスタのリストであり、第1の複数のグループの一例である。例えば、化合物リストCX1~CXNは、第1の複数の化合物のそれぞれの位置と第1の複数の化合物のそれぞれの類似度とに基づいて第1の複数の化合物名を分類、換言すればクラスタリングすることによって取得されてよく、第1クラスタ群と称されてもよい。Nは、1以上の整数であり、文書Xに含まれるグループ数、換言すればクラスタ数を示す。 The compound list C X1 to C XN is a list of first multiple named entity contained in the document X, for example, a list of named entity clusters obtained by classifying a plurality of compound names, and is an example of the first plurality of groups. be. For example, the compound lists C X1 to C XN classify, in other words, the names of the first plurality of compounds based on the respective positions of the first plurality of compounds and the respective similarity of the first plurality of compounds. It may be acquired by clustering and may be referred to as a first cluster group. N is an integer of 1 or more, and indicates the number of groups included in the document X, in other words, the number of clusters.
 化合物リストCY1~CYMは、文書Yに含まれる第2の複数の固有表現、例えば複数の化合物名を分類して得られる固有表現クラスタのリストであり、第2の複数のグループの一例である。例えば、化合物リストCY1~CYMは、第2の複数の化合物のそれぞれの位置と第2の複数の化合物のそれぞれの類似度とに基づいて第2の複数の化合物名を分類、換言すればクラスタリングすることによって取得されてよく、第2クラスタ群と称されてもよい。Mは、1以上の整数であり、文書Yに含まれるグループ数、換言すればクラスタ数を示す。
 
The compound lists CY1 to CYM are a list of second named entities contained in the document Y, for example, a list of named named entity clusters obtained by classifying a plurality of compound names, and are examples of the second plurality of groups. be. For example, the compound lists CY1 to CYM classify, in other words, the names of the second plurality of compounds based on the respective positions of the second plurality of compounds and the respective similarity of the second plurality of compounds. It may be acquired by clustering and may be referred to as a second cluster group. M is an integer of 1 or more, and indicates the number of groups included in the document Y, in other words, the number of clusters.
 例えば、類似度判定システム1は、固有表現クラスタに含まれる固有表現と、複数の段落Pに含まれる固有表現との間の一致度を用いたクラスタリング処理によって、段落Pをクラスタリングしてよい。 For example, the similarity determination system 1 may cluster the paragraph P by a clustering process using the degree of agreement between the named entity included in the named entity cluster and the named entity included in the plurality of paragraphs P.
 図4の例では、類似度判定システム1は、文書Xについて、下記式(1)に従い、クラスタごとの化合物リストCX1~CXNのそれぞれと、複数の段落Pのそれぞれとの一致度に基づき、部分文書クラスタPX1~PXNを生成する。また、類似度判定システム1は、文書Yについて、下記式(2)に従い、クラスタごとの化合物リストCY1~CYNのそれぞれと、複数の段落Pのそれぞれとの一致度cos(CPX,Xa)に基づき、部分文書クラスタPY1~PYNを生成する。
Figure JPOXMLDOC01-appb-M000001
In the example of FIG. 4, the similarity determination system 1 determines the degree of coincidence between each of the compound lists C X1 to C XN for each cluster and each of the plurality of paragraphs PX for the document X according to the following formula (1). Based on this, partial document clusters PX1 to PXN are generated. Further, the similarity determination system 1 has a degree of coincidence cos (CPX,) between each of the compound lists CY1 to CYN for each cluster and each of the plurality of paragraphs PY for the document Y according to the following formula (2) . Sub-document clusters P Y1 to P YN are generated based on C Xa ).
Figure JPOXMLDOC01-appb-M000001
 上記式(1)及び(2)において、CPXは、段落Pに含まれる化合物リストであり、aは、1~Nの整数であり、CXaは、クラスタごとの化合物リストCX1~CXNである。CPYは、段落Pに含まれる化合物リストであり、bは、1~Mの整数であり、CYbは、クラスタごとの化合物リストCY1~CYMである。cosは、括弧内の2つの要素間のコサイン類似度を算出する関数である。argmaxは、括弧内の要素が最大となるときの条件(ここではクラスタ)を抽出する関数である。上記式(1)及び式(2)によれば、段落Pに含まれる化合物名の各々と、化合物リスト内の化合物名の各々との間のコサイン類似度が最大となる、例えば出現数が最多となる要素(化合物のクラスタ)に、段落Pを振り分けることができる。 In the above formulas (1) and (2), C PX is a compound list included in paragraph PX , a is an integer of 1 to N, and C X a is a compound list C X 1 to C for each cluster. XN . CP Y is a compound list included in paragraph P Y , b is an integer of 1 to M, and CY b is a compound list CY1 to CYM for each cluster. cos is a function that calculates the cosine similarity between two elements in parentheses. argmax is a function that extracts the condition (here, cluster) when the element in parentheses is the maximum. According to the above formulas (1) and (2), the cosine similarity between each of the compound names included in paragraph P and each of the compound names in the compound list is maximum, for example, the number of occurrences is the largest. Paragraph P can be assigned to the element (cluster of compounds).
 図5に示す例では、N=M=4として、以下の4つのクラスタについての化合物リストCX1~CX4及びCY1~CY4が用いられるものとする。なお、化合物リストCX1~CXN及びCY1~CYMは、予め類似度判定システム1に記憶されるものとする。
・化合物リストCX1及びCY1
  「負極活物質」の要素(特性)を有するクラスタ。
・化合物リストCX2及びCY2
  「正極活物質」の要素(特性)を有するクラスタ。
・化合物リストCX3及びCY3
  「バインダー」の要素(特性)を有するクラスタ。
・化合物リストCX4及びCY4
  「電解液溶媒」の要素(特性)を有するクラスタ。
In the example shown in FIG. 5, it is assumed that the compound lists C X1 to C X4 and CY1 to CY4 for the following four clusters are used with N = M = 4. It is assumed that the compound lists C X1 to C XN and CY1 to CYM are stored in the similarity determination system 1 in advance.
-Compound list C X1 and CY1 :
A cluster having elements (characteristics) of "negative electrode active material".
-Compound list C X2 and CY2 :
A cluster having an element (characteristic) of "positive electrode active material".
-Compound list C X3 and CY3 :
A cluster with a "binder" element (characteristic).
-Compound list C X4 and CY4 :
A cluster having an element (characteristic) of "electrolyte solvent".
 図5の例では、類似度判定システム1は、段落P及びPをそれぞれ4つのクラスタに分類し(N=M=4)、部分文書クラスタPX1~PX4及びPY1~PY4を生成する。このようなクラスタリングにより、結果的に、段落P及びPを、以下のような4つの要素(特性)の部分文書クラスタに分類することができる。
・部分文書クラスタPX1及びPY1
  「負極活物質」について記載された段落。
・部分文書クラスタPX2及びPY2
  「正極活物質」について記載された段落。
・部分文書クラスタPX3及びPY3
  「バインダー」について記載された段落。
・部分文書クラスタPX4及びPY4
  「電解液溶媒」について記載された段落。
In the example of FIG. 5, the similarity determination system 1 classifies paragraphs PX and P Y into four clusters (N = M = 4), respectively, and divides the partial document clusters PX1 to PX4 and PY1 to PY4 . Generate. As a result, paragraphs PX and P Y can be classified into sub-document clusters of the following four elements (characteristics) by such clustering.
-Partial document clusters PX1 and PY1 :
A paragraph describing "negative electrode active material".
-Partial document clusters PX2 and PY2 :
A paragraph describing "positive electrode active material".
-Partial document clusters PX3 and PY3 :
A paragraph describing "binder".
-Partial document clusters PX4 and PY4 :
A paragraph describing "electrolyte solvent".
 なお、図4及び図5の例では、文書X及びYのそれぞれの部分文書クラスタの数N及びMが、化合物リストの数N及びMと一致するものとしたが、これに限定されるものではなく、一致しない場合も許容されてよい。例えば、部分文書クラスタの数がN及びMよりも小さくてもよい。 In the examples of FIGS. 4 and 5, the numbers N and M of the partial document clusters of the documents X and Y are assumed to match the numbers N and M of the compound list, but the number N and M are not limited to this. It may be acceptable if they do not match and do not match. For example, the number of partial document clusters may be smaller than N and M.
 そして、類似度判定システム1は、部分文書クラスタのそれぞれに含まれる単語に基づいて、複数の部分文書クラスタのそれぞれに対応する複数の部分文書ベクトルを算出する。例えば、類似度判定システム1は、部分文書クラスタごとに、処理P3で取得した単語ベクトルと、処理P2で取得した単語の重みとを乗じた結果を部分文書クラスタ内の全単語に亘って加算することで、部分文書ベクトルを算出してよい。 Then, the similarity determination system 1 calculates a plurality of subdocument vectors corresponding to each of the plurality of subdocument clusters based on the words included in each of the subdocument clusters. For example, the similarity determination system 1 adds the result of multiplying the word vector acquired in the process P3 and the weight of the word acquired in the process P2 over all the words in the subdocument cluster for each subdocument cluster. By doing so, the partial document vector may be calculated.
 (テキスト類似度算出処理)
 類似度判定システム1は、処理P5において、クエリ文書11aの部分文書ベクトルと、比較対象文書12aの各々の部分文書ベクトルとの間の類似度、換言すれば、単語の意味ベクトルに基づく、部分文書クラスタ間のテキスト類似度を算出する。クエリ文書11aの部分文書ベクトルは、第1の複数のベクトルの一例であり、比較対象文書12aの部分文書ベクトルは、第2の複数のベクトルの一例である。
(Text similarity calculation process)
In the process P5, the similarity determination system 1 is based on the similarity between the partial document vector of the query document 11a and each partial document vector of the comparison target document 12a, in other words, the partial document based on the meaning vector of the word. Calculate the text similarity between clusters. The partial document vector of the query document 11a is an example of the first plurality of vectors, and the partial document vector of the comparison target document 12a is an example of the second plurality of vectors.
 例えば、類似度判定システム1は、クエリ文書11aの部分文書クラスタと比較対象文書12aの部分文書クラスタとについて、下記式(3)の演算により、テキスト類似度、一例としてコサイン類似度を算出してよい。
Figure JPOXMLDOC01-appb-M000002
For example, the similarity determination system 1 calculates the text similarity, for example, the cosine similarity between the partial document cluster of the query document 11a and the partial document cluster of the comparison target document 12a by the calculation of the following equation (3). good.
Figure JPOXMLDOC01-appb-M000002
 上記式(3)において、WPXaは、段落PXaに含まれる単語の分散ベクトルであり、WPYbは、段落PYbに含まれる単語の分散ベクトルである。 In the above equation (3), WP Xa is a dispersion vector of words included in paragraph PXa , and WP Yb is a dispersion vector of words included in paragraph P Yb .
 図4に示す例では、類似度判定システム1は、部分文書クラスタPX1、PX2、PX3、・・・PXNと、部分文書クラスタPY1、PY2、PY3、・・・PYMとの全てのペアについて、上記式(3)に従いテキスト類似度を算出してよい。 In the example shown in FIG. 4, the similarity determination system 1 has partial document clusters PX1 , PX2 , PX3 , ... PXN , and partial document clusters XY1 , PHY2 , PHY3 , ... PHYM . The text similarity may be calculated according to the above equation (3) for all pairs of and.
 (ランキング処理)
 類似度判定システム1は、処理P6において、テキスト類似度に基づき、クエリ文書11aとの類似度に応じて複数の比較対象文書12aの各々をランキング付けするランキング処理を行ない、結果13を出力する。
(Ranking process)
The similarity determination system 1 performs a ranking process in the process P6 to rank each of the plurality of comparison target documents 12a according to the similarity with the query document 11a based on the text similarity, and outputs the result 13.
 例えば、類似度判定システム1は、ランキング処理において、テキスト類似度に基づき、クエリ文書11aとの類似度に応じた複数の比較対象文書12aのランキングを出力する。 For example, the similarity determination system 1 outputs rankings of a plurality of comparison target documents 12a according to the similarity with the query document 11a based on the text similarity in the ranking process.
 類似度判定システム1は、例えば、下記式(4)に従い、文書Xと1つの比較対象文書Yとの間の文書類似度Sim(X,Y)を算出してよい。
Figure JPOXMLDOC01-appb-M000003
The similarity determination system 1 may calculate the document similarity Sim (X, Y) between the document X and one comparison target document Y, for example, according to the following equation (4).
Figure JPOXMLDOC01-appb-M000003
 上記式(4)において、ftは、上記式(3)に従ったテキスト類似度であり、maxは、括弧内の全ての組み合わせのうちの最大の値を採用する関数である。 In the above equation (4), ft is the text similarity according to the above equation (3), and max is a function that adopts the maximum value among all the combinations in parentheses.
 類似度判定システム1は、上記式(4)に従い、上記式(3)で算出されたテキスト類似度のうちの、値が最大となる部分文書クラスタのペア(a=1~N、b=1~Mのいずれかの組み合わせ)を、文書X及びY間の文書類似度Sim(X,Y)として採用してよい。 The similarity determination system 1 is a pair of partial document clusters (a = 1 to N, b = 1) having the maximum value among the text similarity calculated by the above equation (3) according to the above equation (4). Any combination of ~ M) may be adopted as the document similarity Sim (X, Y) between the documents X and Y.
 図5の例では、類似度判定システム1は、化合物リストCX2及びCY2のペア、換言すれば、「正極活物質」の部分文書クラスタどうしのテキスト類似度が最大であると判定し、当該テキスト類似度を文書X及びY間の文書類似度Sim(X,Y)に決定する。 In the example of FIG. 5, the similarity determination system 1 determines that the pair of compound lists C X2 and CY2 , in other words, the text similarity between the partial document clusters of the “positive electrode active material” is the maximum, and the relevant The text similarity is determined as the document similarity Sim (X, Y) between the documents X and Y.
 なお、上記式(4)では、文書X(クエリ文書11a)と、1つの文書Y(比較対象文書12a)との間の文書類似度を算出する例を示す。類似度判定システム1は、複数の比較対象文書12a、例えば文書Y~Y(Lは2以上の整数であり、比較対象文書12aの文書数)それぞれについて上記処理を行ない、文書Yの数に応じた文書類似度Sim(X,Y)~Sim(X,Y)を取得してよい。 The above equation (4) shows an example of calculating the document similarity between the document X (query document 11a) and one document Y (comparison target document 12a). The similarity determination system 1 performs the above processing for each of a plurality of comparison target documents 12a, for example, documents Y 1 to Y L (L is an integer of 2 or more and the number of documents of comparison target document 12a), and the number of documents Y. Document similarity Sim (X, Y 1 ) to Sim (X, Y L ) according to the above may be acquired.
 そして、類似度判定システム1は、例えば、文書類似度Sim(X,Y)~Sim(X,Y)が高い文書Yから降順に検索対象の全文書Y~Yをソートし、ソート結果を結果13として出力してよい。結果13には、ランク(順位)とともに文書Yの識別情報が含まれてよく、各文書Yの文書類似度Sim(X,Y)が含まれてもよい。文書Yの識別情報には、文書番号又は文書コード等の識別子及び文書名等の書誌情報、並びに、要約及び所定の部分等の文書Yの少なくとも一部の内容、のうちの少なくとも1つを含んでもよい。 Then, the similarity determination system 1 sorts all the documents Y 1 to Y L to be searched in descending order from the documents Y having the highest document similarity Sim (X, Y 1 ) to Sim (X, Y L ), for example. The sort result may be output as the result 13. The result 13 may include the identification information of the document Y together with the rank (rank), and may include the document similarity Sim (X, Y) of each document Y. The identification information of the document Y includes at least one of an identifier such as a document number or a document code, bibliographic information such as a document name, and at least a part of the contents of the document Y such as a summary and a predetermined part. But it may be.
 なお、類似度判定システム1は、結果13として、特定の順位であると判定した文書Y、例えば、クエリ文書11aとの間で最も文書類似度Sim(X,Y)の高い文書Yの識別情報を出力してもよい。 As a result 13, the similarity determination system 1 identifies information of the document Y having the highest document similarity Sim (X, Y) with the document Y determined to have a specific order, for example, the query document 11a. May be output.
 以上のように、第1実施形態に係る類似度判定システム1によれば、クラスタリング処理により分類した部分文書クラスタごとのテキスト類似度に基づき、文書間の類似度を判定することで、部分的に類似した文書間の類似度の判定精度を向上させることができる。 As described above, according to the similarity determination system 1 according to the first embodiment, the similarity between documents is partially determined based on the text similarity for each partial document cluster classified by the clustering process. It is possible to improve the accuracy of determining the degree of similarity between similar documents.
 例えば、類似度判定システム1は、文書X及びY間で部分文書ベクトルどうしの比較を行なうことで、「正極活物質」についての意味ベクトルが類似しているから文書X及びYの類似度が高い、と判断することができる。図5では、便宜上、意味ベクトル空間を2次元で示すが、実際には数百次元のベクトルとなり得る。第1実施形態によれば、部分文書クラスタどうしの比較により、部分的に類似する文書間の類似度の判定精度を向上させることができる。 For example, the similarity determination system 1 has a high degree of similarity between the documents X and Y because the semantic vectors for the "positive electrode active material" are similar by comparing the partial document vectors between the documents X and Y. , Can be judged. In FIG. 5, for convenience, the semantic vector space is shown in two dimensions, but it can actually be a vector of several hundred dimensions. According to the first embodiment, the accuracy of determining the degree of similarity between partially similar documents can be improved by comparing the partial document clusters.
 ここで、第1実施形態では、類似度判定システム1は、クエリ文書11a及び比較対象文書12aの双方の文書について、それぞれの複数の部分文書ベクトルを算出するものとして説明したが、これに限定されるものではない。 Here, in the first embodiment, the similarity determination system 1 has been described as calculating a plurality of partial document vectors for both the query document 11a and the comparison target document 12a, but the present invention is limited to this. It's not something.
 例えば、いずれか一方の文書、一例として、複数の比較対象文書12aについては、類似度判定システム1が文書集合12を予め格納している場合、複数の比較対象文書12aのそれぞれについて、複数の部分文書ベクトルが予め算出され、蓄積されていてもよい。 For example, for any one of the documents, for example, a plurality of comparison target documents 12a, when the similarity determination system 1 stores the document set 12 in advance, a plurality of portions of each of the plurality of comparison target documents 12a. The document vector may be calculated in advance and accumulated.
 この場合、類似度判定システム1は、いずれか他方の文書、一例として、クエリ文書11aについて、複数の部分文書ベクトルを算出するとともに、比較対象文書12aについては蓄積する複数の部分文書ベクトルを取得してよい。そして、類似度判定システム1は、算出したクエリ文書11aの複数の部分文書ベクトルと、取得した比較対象文書12aの複数の部分文書ベクトルとに基づき、上述したテキスト類似度算出処理及びランキング処理を行なってよい。 In this case, the similarity determination system 1 calculates a plurality of partial document vectors for the other document, for example, the query document 11a, and acquires a plurality of partial document vectors to be accumulated for the comparison target document 12a. You can do it. Then, the similarity determination system 1 performs the above-mentioned text similarity calculation process and ranking process based on the calculated plurality of partial document vectors of the query document 11a and the plurality of partial document vectors of the acquired comparison target document 12a. You can do it.
 なお、複数の部分文書ベクトルが予め算出され蓄積される文書は、比較対象文書12aに限定されるものではなく、比較対象文書12aに代えて又は加えて、クエリ文書11aであってもよい。 The document in which a plurality of partial document vectors are calculated and accumulated in advance is not limited to the comparison target document 12a, and may be a query document 11a in place of or in addition to the comparison target document 12a.
 〔1-3〕機能構成例
 図6は、第1実施形態に係る類似度判定システム1におけるサーバ2の機能構成例を示すブロック図であり、図7は、サーバ2による画面出力例を示す図である。
[1-3] Functional Configuration Example FIG. 6 is a block diagram showing a functional configuration example of the server 2 in the similarity determination system 1 according to the first embodiment, and FIG. 7 is a diagram showing a screen output example by the server 2. Is.
 サーバ2は、類似度判定装置、情報処理装置、又は、コンピュータの一例である。例えば、サーバ2は、類似度判定システム1において、図示しない端末装置、他のサーバ等との間で、クエリ文書11a及び比較対象文書12aの受信、結果13の送信等の種々の通信を行なってよい。 The server 2 is an example of a similarity determination device, an information processing device, or a computer. For example, in the similarity determination system 1, the server 2 performs various communications such as reception of the query document 11a and the comparison target document 12a and transmission of the result 13 with a terminal device (not shown), another server, or the like. good.
 サーバ2は、例えば、端末装置に対して、アクセスを可能とするための機能を提供してよい。当該機能としては、例えば、端末装置によるアクセスに用いられる、ウェブページ等の画面の生成及び表示制御が挙げられる。例えば、端末装置は、ブラウザ等のアプリケーションを用いてサーバ2にアクセス要求を送信し、サーバ2から受信する画面情報に基づきアプリケーションに表示されるウェブページを介して、サーバ2へのアクセスを行なってよい。例えば、サーバ2は、図7に示すように、クエリを指定するためのクエリ指定画面210、及び、判定結果を出力するための判定結果出力画面240の画面情報を出力してよい。 The server 2 may provide, for example, a function for enabling access to the terminal device. Examples of the function include generation and display control of a screen such as a web page used for access by a terminal device. For example, the terminal device sends an access request to the server 2 using an application such as a browser, and accesses the server 2 via a web page displayed on the application based on the screen information received from the server 2. good. For example, as shown in FIG. 7, the server 2 may output the screen information of the query specification screen 210 for designating the query and the determination result output screen 240 for outputting the determination result.
 類似度判定システム1による上述した類似度判定処理は、サーバ2により実現されてよい。図6に示すように、サーバ2は、例示的に、メモリ部21、文書入力部22、類似度算出部23、及び、類似度出力部24を備えてよい。メモリ部21、文書入力部22、類似度算出部23、及び、類似度出力部24は、制御部の一例である。 The above-mentioned similarity determination process by the similarity determination system 1 may be realized by the server 2. As shown in FIG. 6, the server 2 may optionally include a memory unit 21, a document input unit 22, a similarity calculation unit 23, and a similarity output unit 24. The memory unit 21, the document input unit 22, the similarity calculation unit 23, and the similarity output unit 24 are examples of control units.
 メモリ部21は、類似度判定処理に係る種々のデータを記憶する記憶領域を有する。メモリ部21は、例えば、図3に示すクエリ文書11a、複数の比較対象文書12a、結果13、及び、文書ごとに予め分類されたクラスタごとの化合物リスト、等の情報を記憶してよい。また、メモリ部21は、類似度判定処理における中間データとして、図4及び図5に示す文書ごとの段落P、部分文書クラスタ、テキスト類似度、文書類似度Sim等の情報を記憶してもよい。 The memory unit 21 has a storage area for storing various data related to the similarity determination process. The memory unit 21 may store information such as the query document 11a shown in FIG. 3, a plurality of comparison target documents 12a, the result 13, and a compound list for each cluster preclassified for each document. Further, the memory unit 21 may store information such as paragraph P, partial document cluster, text similarity, document similarity Sim, etc. for each document shown in FIGS. 4 and 5 as intermediate data in the similarity determination process. ..
 文書入力部22は、図示しない端末装置又は他のサーバ等のコンピュータから、クエリ文書11a及び比較対象文書12aの入力を受け付け、例えばメモリ部21にDB(Database)として蓄積してもよい。このように、文書入力部22は、文書のDBを構築及び参照可能であってもよい。 The document input unit 22 may receive input of the query document 11a and the comparison target document 12a from a computer such as a terminal device (not shown) or another server, and store the query document 11a and the comparison target document 12a in the memory unit 21, for example, as a DB (Database). In this way, the document input unit 22 may be able to construct and refer to the DB of the document.
 また、文書入力部22は、図示しない端末装置又は他のサーバ等のコンピュータから、類似判定要求に係るクエリ文書11aの入力を受け付け、メモリ部21に格納してよい。クエリ文書11aは、例えばクエリ11に含まれてもよい。 Further, the document input unit 22 may receive the input of the query document 11a related to the similarity determination request from a computer such as a terminal device (not shown) or another server and store it in the memory unit 21. The query document 11a may be included in the query 11, for example.
 文書入力部22は、例えば、クエリ11として、クエリ文書11aそのものではなく、クエリ文書11aの識別情報、例えば文書番号、文書コード等の情報を受け付けてもよい。この場合、文書入力部22は、識別情報に基づき、例えばメモリ部21のDBから、類似判定要求に係るクエリ文書11aを特定してよい。 The document input unit 22 may accept, for example, as the query 11, not the query document 11a itself, but the identification information of the query document 11a, for example, information such as a document number and a document code. In this case, the document input unit 22 may specify the query document 11a related to the similarity determination request from, for example, the DB of the memory unit 21 based on the identification information.
 例えば、図7に示すように、文書入力部22は、クエリ指定画面210の判定ボタン212が押下された際に入力欄211に設定されている文書番号を受け付けてよい。 For example, as shown in FIG. 7, the document input unit 22 may accept the document number set in the input field 211 when the determination button 212 of the query specification screen 210 is pressed.
 類似度算出部23は、クエリ文書11a及び比較対象文書12aの類似度を算出する。図6に例示するように、類似度算出部23は、文書分割部231、部分文書クラスタリング部232、及び、文書類似度算出部233を備えてよい。 The similarity calculation unit 23 calculates the similarity between the query document 11a and the comparison target document 12a. As illustrated in FIG. 6, the similarity calculation unit 23 may include a document division unit 231, a partial document clustering unit 232, and a document similarity calculation unit 233.
 文書分割部231は、メモリ部21に格納されたクエリ文書11a及び比較対象文書12aのそれぞれを分割して部分文書、例えば段落P及びPを生成する。 The document division unit 231 divides each of the query document 11a and the comparison target document 12a stored in the memory unit 21 to generate partial documents, for example, paragraphs PX and PY .
 部分文書クラスタリング部232は、複数の段落P及び複数の段落Pのそれぞれを、メモリ部21が記憶する化合物リストCX1~CXN及びCY1~CYMに基づきクラスタリングし、部分文書クラスタPX1~PXN及びPY1~PYMを取得する。また、部分文書クラスタリング部232は、部分文書クラスタPX1~PXN及びPY1~PYMのそれぞれについて、文書X及びYのそれぞれに対する形態素解析、単語重み算出、及び、単語ベクトル算出の結果に基づき、部分文書ベクトルを算出する。 The partial document clustering unit 232 clusters each of the plurality of paragraphs PX and the plurality of paragraphs P Y based on the compound lists C X1 to C XN and CY1 to CYM stored in the memory unit 21, and the partial document cluster P. Acquire X1 to P XN and P Y1 to P YM . Further, the partial document clustering unit 232 is based on the results of morphological analysis, word weight calculation, and word vector calculation for each of the documents X and Y for each of the partial document clusters PX1 to PXN and PY1 to PYM . , Calculate the partial document vector.
 文書分割部231及び部分文書クラスタリング部232の処理は、図3の処理P1~P4の一例である。 The processing of the document division unit 231 and the partial document clustering unit 232 is an example of the processes P1 to P4 in FIG.
 文書類似度算出部233は、部分文書クラスタごとの部分文書ベクトルに基づき、部分文書ごとのテキスト類似度を算出し、文書内で最も類似度の高いクラスタのテキスト類似度を、当該文書の類似度Sim(X,Y)として算出する。なお、文書類似度算出部233は、比較対象文書12aが複数(例えばL個)存在する場合、比較対象文書12aごとの類似度Sim(X,Y)~Sim(X,Y)を算出してよい。文書類似度算出部233は、算出した類似度Sim(X,Y)をメモリ部21に格納してよい。 The document similarity calculation unit 233 calculates the text similarity for each partial document based on the partial document vector for each partial document cluster, and determines the text similarity of the cluster having the highest similarity in the document as the similarity of the document. Calculated as Sim (X, Y). When a plurality of comparison target documents 12a (for example, L) exist, the document similarity calculation unit 233 calculates the similarity Sim (X, Y 1 ) to Sim (X, Y L ) for each comparison target document 12a. You can do it. The document similarity calculation unit 233 may store the calculated similarity Sim (X, Y) in the memory unit 21.
 類似度出力部24は、類似度算出部23が算出した類似度Sim(X,Y)を出力する。なお、類似度出力部24は、比較対象文書12aが複数(例えばL個)存在する場合、算出した類似度Sim(X,Y)~Sim(X,Y)が高い順に、比較対象文書12a及び類似度Sim(X,Y)の情報を出力してもよい。 The similarity output unit 24 outputs the similarity Sim (X, Y) calculated by the similarity calculation unit 23. When there are a plurality (for example, L) of the documents to be compared in the similarity output unit 24, the documents to be compared are compared in descending order of the calculated similarity Sim (X, Y 1 ) to Sim (X, Y L ). Information on 12a and the similarity Sim (X, Y) may be output.
 文書類似度算出部233及び類似度出力部24の処理は、図3の処理P5及びP6の一例である。類似度出力部24による出力には、例えば、図示しない端末装置等のコンピュータへの送信、メモリ部21等のサーバ2の記憶領域への格納、等が含まれてよい。 The processing of the document similarity calculation unit 233 and the similarity output unit 24 is an example of the processes P5 and P6 of FIG. The output by the similarity output unit 24 may include, for example, transmission to a computer such as a terminal device (not shown), storage in a storage area of a server 2 such as a memory unit 21, and the like.
 例えば、図7に示すように、類似度出力部24は、判定結果出力画面240を出力してもよい。判定結果出力画面240には、クエリ文書11aの表示領域241と、比較対象文書12aの少なくとも1つ(図7では3つ)の表示領域245a~245cとが含まれてよい。表示領域241には、書誌情報及び要約等の表示領域242、及び、クエリ文書11aの全文を表示する画面に遷移するための全文参照ボタン243が含まれてよい。 For example, as shown in FIG. 7, the similarity output unit 24 may output the determination result output screen 240. The determination result output screen 240 may include a display area 241 of the query document 11a and display areas 245a to 245c of at least one (three in FIG. 7) of the comparison target document 12a. The display area 241 may include a display area 242 such as bibliographic information and a summary, and a full-text reference button 243 for transitioning to a screen for displaying the full text of the query document 11a.
 表示領域245a~245cには、書誌情報及び要約等の表示領域246a~246c、及び、全文参照ボタン247a~247cが含まれてよい。なお、表示領域245a~245cには、類似すると判定された部分文書クラスタに対応する、1以上の段落P又は化合物リスト、又は/及び、類似度Sim(X,Y)が表示されてもよい。 The display areas 245a to 245c may include display areas 246a to 246c for bibliographic information and summaries, and full text reference buttons 247a to 247c. In the display areas 245a to 245c, one or more paragraphs PY or compound list corresponding to the partial document cluster determined to be similar, or / and the similarity Sim (X, Y ) may be displayed. ..
 このように、類似度出力部24は、クエリ文書11aと比較対象文書12aとの間の類似度計算の結果、最も類似度が高いと判断された文書に関する情報をユーザに提示することができる。 In this way, the similarity output unit 24 can present to the user information about the document determined to have the highest similarity as a result of the similarity calculation between the query document 11a and the comparison target document 12a.
 〔1-4〕動作例
 図8は、サーバ2の動作例を説明するフローチャートである。図8に示すように、サーバ2は、クエリ文書11aに対する処理と、比較対象文書12aに対する処理とを互いに異なるタイミングで実施してもよい。
[1-4] Operation Example FIG. 8 is a flowchart illustrating an operation example of the server 2. As shown in FIG. 8, the server 2 may execute the processing for the query document 11a and the processing for the comparison target document 12a at different timings.
 図8に例示するように、文書入力部22は、クエリ文書11aの入力を受け付ける(ステップS1)。文書分割部231は、クエリ文書11aを複数の部分文書、例えば複数の段落Pに分割する(ステップS2)。 As illustrated in FIG. 8, the document input unit 22 accepts the input of the query document 11a (step S1). The document division unit 231 divides the query document 11a into a plurality of subdocuments, for example, a plurality of paragraphs PX (step S2).
 部分文書クラスタリング部232は、化合物リストCX1~CXNに基づき複数の段落Pをクラスタリングし、部分文書クラスタPX1~PXNを取得する(ステップS3)。また、部分文書クラスタリング部232は、文書Xに含まれる各単語の重み及び各単語の意味ベクトルに基づき、部分文書クラスタPX1~PXNのそれぞれの部分文書ベクトルを算出する(ステップS4)。 The partial document clustering unit 232 clusters a plurality of paragraphs PX based on the compound lists C X1 to C XN , and acquires the partial document clusters PX1 to PXN (step S3). Further, the partial document clustering unit 232 calculates each partial document vector of the partial document clusters PX1 to PXN based on the weight of each word included in the document X and the meaning vector of each word (step S4).
 また、文書入力部22は、比較対象文書12aの入力を受け付ける(ステップS5)。文書分割部231は、未選択の比較対象文書12aを選択し(ステップS6)、選択した比較対象文書12aを複数の部分文書、例えば複数の段落Pに分割する(ステップS7)。 Further, the document input unit 22 accepts the input of the comparison target document 12a (step S5). The document division unit 231 selects an unselected comparison target document 12a (step S6), and divides the selected comparison target document 12a into a plurality of partial documents, for example, a plurality of paragraphs PY (step S7).
 部分文書クラスタリング部232は、化合物リストCY1~CYNに基づき複数の段落Pをクラスタリングし、部分文書クラスタPY1~PYMを取得する(ステップS8)。また、部分文書クラスタリング部232は、文書Yに含まれる各単語の重み及び各単語の意味ベクトルに基づき、部分文書クラスタPY1~PYMのそれぞれの部分文書ベクトルを算出する(ステップS9)。 The partial document clustering unit 232 clusters a plurality of paragraphs P Y based on the compound lists CY1 to CYN, and acquires the partial document clusters P Y1 to P YM (step S8). Further, the sub-document clustering unit 232 calculates each sub-document vector of the sub-document clusters P Y1 to P YM based on the weight of each word included in the document Y and the meaning vector of each word (step S9).
 なお、比較対象文書12aの化合物リスト及び部分文書ベクトルが予め取得され類似度判定システム1(例えばメモリ部21)に格納されている場合、ステップS7~S9の処理は省略されてよい。 When the compound list and the partial document vector of the comparison target document 12a are acquired in advance and stored in the similarity determination system 1 (for example, the memory unit 21), the processes of steps S7 to S9 may be omitted.
 文書類似度算出部233は、クエリ文書11a及び比較対象文書12aの部分文書ベクトルを比較し、当該文書間の類似度Simを算出し(ステップS10)、メモリ部21に格納する(ステップS11)。 The document similarity calculation unit 233 compares the partial document vectors of the query document 11a and the comparison target document 12a, calculates the similarity Sim between the documents (step S10), and stores it in the memory unit 21 (step S11).
 文書類似度算出部233は、未選択の比較対象文書12aがあるか否かを判定し(ステップS12)、あると判定した場合(ステップS12でYES)、処理がステップS6に移行する。 The document similarity calculation unit 233 determines whether or not there is an unselected comparison target document 12a (step S12), and if it determines that there is (YES in step S12), the process proceeds to step S6.
 未選択の比較対象文書12aがないと文書類似度算出部233が判定した場合(ステップS12でNO)、類似度出力部24は、類似度Sim(X,Y)が高い順に、比較対象文書12a及びその類似度Sim(X,Y)を出力する(ステップS13)。そして、処理が終了する。 When the document similarity calculation unit 233 determines that there is no unselected comparison target document 12a (NO in step S12), the similarity output unit 24 determines the comparison target document 12a in descending order of similarity Sim (X, Y). And its similarity Sim (X, Y) are output (step S13). Then, the process ends.
 〔1-5〕ハードウェア構成例
 サーバ2は、仮想サーバ(VM;Virtual Machine)であってもよいし、物理サーバであってもよい。また、サーバ2の機能は、1台のコンピュータにより実現されてもよいし、2台以上のコンピュータにより実現されてもよい。さらに、サーバ2の機能のうちの少なくとも一部は、クラウド環境により提供されるHW(Hardware)リソース及びNW(Network)リソースを用いて実現されてもよい。
[1-5] Hardware Configuration Example The server 2 may be a virtual server (VM; Virtual Machine) or a physical server. Further, the function of the server 2 may be realized by one computer or may be realized by two or more computers. Further, at least a part of the functions of the server 2 may be realized by using the HW (Hardware) resource and the NW (Network) resource provided by the cloud environment.
 図9は、サーバ2の機能を実現するコンピュータ10のハードウェア(HW)構成例を示すブロック図である。サーバ2の機能を実現するHWリソースとして、複数のコンピュータが用いられる場合は、各コンピュータが図9に例示するHW構成を備えてよい。 FIG. 9 is a block diagram showing a hardware (HW) configuration example of the computer 10 that realizes the function of the server 2. When a plurality of computers are used as the HW resource that realizes the function of the server 2, each computer may have the HW configuration illustrated in FIG.
 図9に示すように、コンピュータ10は、HW構成として、例示的に、プロセッサ10a、メモリ10b、記憶部10c、IF(Interface)部10d、I/O(Input / Output)部10e、及び読取部10fを備えてよい。 As shown in FIG. 9, the computer 10 has an HW configuration, for example, a processor 10a, a memory 10b, a storage unit 10c, an IF (Interface) unit 10d, an I / O (Input / Output) unit 10e, and a reading unit. It may be provided with 10f.
 プロセッサ10aは、種々の制御や演算を行なう演算処理装置の一例である。プロセッサ10aは、コンピュータ10内の各ブロックとバス10iで相互に通信可能に接続されてよい。なお、プロセッサ10aは、複数のプロセッサを含むマルチプロセッサであってもよいし、複数のプロセッサコアを有するマルチコアプロセッサであってもよく、或いは、マルチコアプロセッサを複数有する構成であってもよい。 The processor 10a is an example of an arithmetic processing unit that performs various controls and operations. The processor 10a may be connected to each block in the computer 10 so as to be communicable with each other by the bus 10i. The processor 10a may be a multi-processor including a plurality of processors, a multi-core processor having a plurality of processor cores, or a configuration having a plurality of multi-core processors.
 プロセッサ10aとしては、例えば、CPU、MPU、GPU、APU、DSP、ASIC、FPGA等の集積回路(IC;Integrated Circuit)が挙げられる。なお、プロセッサ10aとして、これらの集積回路の2以上の組み合わせが用いられてもよい。CPUはCentral Processing Unitの略称であり、MPUはMicro Processing Unitの略称である。GPUはGraphics Processing Unitの略称であり、APUはAccelerated Processing Unitの略称である。DSPはDigital Signal Processorの略称であり、ASICはApplication Specific ICの略称であり、FPGAはField-Programmable Gate Arrayの略称である。 Examples of the processor 10a include integrated circuits (ICs) such as CPUs, MPUs, GPUs, APUs, DSPs, ASICs, and FPGAs. As the processor 10a, two or more combinations of these integrated circuits may be used. CPU is an abbreviation for Central Processing Unit, and MPU is an abbreviation for Micro Processing Unit. GPU is an abbreviation for Graphics Processing Unit, and APU is an abbreviation for Accelerated Processing Unit. DSP is an abbreviation for Digital Signal Processor, ASIC is an abbreviation for Application Specific IC, and FPGA is an abbreviation for Field-Programmable Gate Array.
 メモリ10bは、種々のデータやプログラム等の情報を格納するHWの一例である。メモリ10bとしては、例えばDRAM(Dynamic Random Access Memory)等の揮発性メモリ、及び、PM(Persistent Memory)等の不揮発性メモリ、の一方又は双方が挙げられる。 The memory 10b is an example of HW that stores information such as various data and programs. Examples of the memory 10b include one or both of a volatile memory such as DRAM (Dynamic Random Access Memory) and a non-volatile memory such as PM (Persistent Memory).
 記憶部10cは、種々のデータやプログラム等の情報を格納するHWの一例である。記憶部10cとしては、HDD(Hard Disk Drive)等の磁気ディスク装置、SSD(Solid State Drive)等の半導体ドライブ装置、不揮発性メモリ等の各種記憶装置が挙げられる。不揮発性メモリとしては、例えば、フラッシュメモリ、SCM(Storage Class Memory)、ROM(Read Only Memory)等が挙げられる。 The storage unit 10c is an example of HW that stores information such as various data and programs. Examples of the storage unit 10c include a magnetic disk device such as an HDD (Hard Disk Drive), a semiconductor drive device such as an SSD (Solid State Drive), and various storage devices such as a non-volatile memory. Examples of the non-volatile memory include flash memory, SCM (Storage Class Memory), ROM (Read Only Memory) and the like.
 また、記憶部10cは、コンピュータ10の各種機能の全部若しくは一部を実現するプログラム10g(類似度判定プログラム)を格納してよい。例えば、サーバ2のプロセッサ10aは、記憶部10cに格納されたプログラム10gをメモリ10bに展開して実行することにより、図6に例示するサーバ2としての機能を実現できる。 Further, the storage unit 10c may store a program 10g (similarity determination program) that realizes all or a part of various functions of the computer 10. For example, the processor 10a of the server 2 can realize the function as the server 2 illustrated in FIG. 6 by expanding and executing the program 10g stored in the storage unit 10c in the memory 10b.
 図6に示すメモリ部21は、メモリ10b及び記憶部10cの一方又は双方の記憶領域により実現されてよい。 The memory unit 21 shown in FIG. 6 may be realized by a storage area of one or both of the memory unit 10b and the storage unit 10c.
 IF部10dは、ネットワークとの間の接続及び通信の制御等を行なう通信IFの一例である。例えば、IF部10dは、イーサネット(登録商標)等のLAN(Local Area Network)、或いは、FC(Fibre Channel)等の光通信等に準拠したアダプタを含んでよい。当該アダプタは、無線及び有線の一方又は双方の通信方式に対応してよい。例えば、サーバ2は、IF部10dを介して、端末装置及び他のサーバのそれぞれと相互に通信可能に接続されてよい。また、例えば、プログラム10gは、当該通信IFを介して、ネットワークからコンピュータ10にダウンロードされ、記憶部10cに格納されてもよい。 The IF unit 10d is an example of a communication IF that controls connection and communication with a network. For example, the IF unit 10d may include an adapter compliant with LAN (Local Area Network) such as Ethernet (registered trademark) or optical communication such as FC (Fibre Channel). The adapter may support one or both wireless and wired communication methods. For example, the server 2 may be connected to the terminal device and each of the other servers so as to be able to communicate with each other via the IF unit 10d. Further, for example, the program 10g may be downloaded from the network to the computer 10 via the communication IF and stored in the storage unit 10c.
 I/O部10eは、入力装置、及び、出力装置、の一方又は双方を含んでよい。入力装置としては、例えば、キーボード、マウス、タッチパネル等が挙げられる。出力装置としては、例えば、モニタ、プロジェクタ、プリンタ等が挙げられる。 The I / O unit 10e may include one or both of an input device and an output device. Examples of the input device include a keyboard, a mouse, a touch panel, and the like. Examples of the output device include a monitor, a projector, a printer and the like.
 読取部10fは、記録媒体10hに記録されたデータやプログラムの情報を読み出すリーダの一例である。読取部10fは、記録媒体10hを接続可能又は挿入可能な接続端子又は装置を含んでよい。読取部10fとしては、例えば、USB(Universal Serial Bus)等に準拠したアダプタ、記録ディスクへのアクセスを行なうドライブ装置、SDカード等のフラッシュメモリへのアクセスを行なうカードリーダ等が挙げられる。なお、記録媒体10hにはプログラム10gが格納されてもよく、読取部10fが記録媒体10hからプログラム10gを読み出して記憶部10cに格納してもよい。 The reading unit 10f is an example of a reader that reads data and program information recorded on the recording medium 10h. The reading unit 10f may include a connection terminal or device to which the recording medium 10h can be connected or inserted. Examples of the reading unit 10f include an adapter compliant with USB (Universal Serial Bus), a drive device for accessing a recording disk, a card reader for accessing a flash memory such as an SD card, and the like. The program 10g may be stored in the recording medium 10h, or the reading unit 10f may read the program 10g from the recording medium 10h and store it in the storage unit 10c.
 記録媒体10hとしては、例示的に、磁気/光ディスクやフラッシュメモリ等の非一時的なコンピュータ読取可能な記録媒体が挙げられる。磁気/光ディスクとしては、例示的に、フレキシブルディスク、CD(Compact Disc)、DVD(Digital Versatile Disc)、ブルーレイディスク、HVD(Holographic Versatile Disc)等が挙げられる。フラッシュメモリとしては、例示的に、USBメモリやSDカード等の半導体メモリが挙げられる。 Examples of the recording medium 10h include non-temporary computer-readable recording media such as magnetic / optical disks and flash memories. Examples of the magnetic / optical disk include flexible discs, CDs (Compact Discs), DVDs (Digital Versatile Discs), Blu-ray discs, HVDs (Holographic Versatile Discs), and the like. Examples of the flash memory include semiconductor memories such as USB memory and SD card.
 上述したコンピュータ10のHW構成は例示である。従って、コンピュータ10内でのHWの増減(例えば任意のブロックの追加や削除)、分割、任意の組み合わせでの統合、又は、バスの追加若しくは削除等は適宜行なわれてもよい。例えば、サーバ2において、I/O部10e及び読取部10fの少なくとも一方は、省略されてもよい。 The above-mentioned HW configuration of the computer 10 is an example. Therefore, the increase / decrease of HW (for example, addition or deletion of arbitrary blocks), division, integration in any combination, addition or deletion of buses, etc. may be appropriately performed in the computer 10. For example, in the server 2, at least one of the I / O unit 10e and the reading unit 10f may be omitted.
 〔2〕第2実施形態
 〔2-1〕第2実施形態の説明
 次に、第2実施形態について説明する。第1実施形態では、類似度判定システム1がクラスタごとの化合物リストCX1~CXN及びCY1~CYMを予め記憶するものとして説明した。第2実施形態では、類似度判定システム1Aが、化合物リストCX1~CXN及びCY1~CYMを算出する手法を説明する。
[2] Second Embodiment [2-1] Description of the Second Embodiment Next, the second embodiment will be described. In the first embodiment, it has been described that the similarity determination system 1 stores the compound lists C X1 to C XN and CY1 to CYM for each cluster in advance. In the second embodiment, the method in which the similarity determination system 1A calculates the compound lists C X1 to C XN and CY1 to CYM will be described.
 なお、以下の第2実施形態の説明では、特に言及しない構成、処理又は機能は、既述の第1実施形態に係る構成、処理又は機能と同様であるものとする。 In the following description of the second embodiment, the configuration, processing or function not particularly mentioned shall be the same as the configuration, processing or function according to the first embodiment described above.
 図10は、第2実施形態に係る類似度判定システム1Aを説明するための図であり、図11は、類似度判定システム1Aの処理の一例を説明するための図である。図10に示すように、第2実施形態に係る類似度判定システム1Aにおいて、クエリ11及び文書集合12に基づく処理P1~P6については、第1実施形態と同様である。 FIG. 10 is a diagram for explaining the similarity determination system 1A according to the second embodiment, and FIG. 11 is a diagram for explaining an example of processing of the similarity determination system 1A. As shown in FIG. 10, in the similarity determination system 1A according to the second embodiment, the processes P1 to P6 based on the query 11 and the document set 12 are the same as those in the first embodiment.
 図10に例示するように、類似度判定システム1Aにおいて、処理P1~P3と少なくとも一部の処理が並行又は前後して、処理P7及びP8が実行されてよい。以下、処理P7及びP8について説明する。 As illustrated in FIG. 10, in the similarity determination system 1A, the processes P7 and P8 may be executed in parallel with or before and after the processes P1 to P3 and at least a part of the processes. Hereinafter, the processes P7 and P8 will be described.
 例えば、類似度判定システム1Aは、複数の文書、例えばクエリ文書11a及び複数の比較対象文書12aのそれぞれから、固有表現の一例としての化合物名を抽出し(処理P7)、文書ごとに、固有表現リスト、例えば化合物リストを生成する。 For example, the similarity determination system 1A extracts a compound name as an example of a unique expression from each of a plurality of documents, for example, a query document 11a and a plurality of comparison target documents 12a (process P7), and the unique expression is used for each document. Generate a list, eg a compound list.
 図11の例では、類似度判定システム1Aは、クエリ11に含まれるクエリ文書11a(「文書X」と表記)から化合物名を抽出して化合物リストCを生成する。また、類似度判定システム1Aは、文書集合12に含まれる比較対象文書12a(「文書Y」と表記)から化合物名を抽出して化合物リストCを生成する。 In the example of FIG. 11, the similarity determination system 1A extracts the compound name from the query document 11a (denoted as “document X”) included in the query 11 and generates the compound list CX . Further, the similarity determination system 1A extracts a compound name from the comparison target document 12a (denoted as “document Y”) included in the document set 12 to generate a compound list CY .
 第2実施形態では、第1実施形態と同様に、クエリ文書11a及び比較対象文書12aは、リチウムイオン電池に関する文書であるものとする。以下、判定対象の文書の組について生成された化合物リストC及びCを互いに区別しない場合には、単に「化合物リストC」と表記する。 In the second embodiment, as in the first embodiment, the query document 11a and the comparison target document 12a are documents relating to the lithium ion battery. Hereinafter, when the compound lists C X and CY generated for the set of documents to be determined are not distinguished from each other, they are simply referred to as “compound list C”.
 第2実施形態に係る類似度判定システム1Aは、固有表現リストに基づき、固有表現を分類及びグループ化するクラスタリングを実行する(処理P8)。クラスタリングの手法としては、例えば、最短距離法等の既存の種々の手法が用いられてよい。 The similarity determination system 1A according to the second embodiment executes clustering for classifying and grouping named entity based on the named entity list (process P8). As the clustering method, various existing methods such as the shortest distance method may be used.
 例えば、類似度判定システム1Aは、固有表現リストに基づき、固有表現リストに含まれる固有表現間の類似度スコアSを、固有表現のペア(組)ごとに算出してよい。例えば、類似度判定システム1Aは、固有表現のペアについて、固有表現のそれぞれの位置と、固有表現間の類似度とに基づき類似度スコアSを算出する。 For example, the similarity determination system 1A may calculate the similarity score S between named entities included in the named entity list for each pair (set) of named entity based on the named entity list. For example, the similarity determination system 1A calculates a similarity score S for a pair of named entity based on each position of the named entity and the similarity between the named entity.
 一例として、固有表現のペアを化合物x及びxと表記した場合、類似度判定システム1Aは、下記式(5)を用いて類似度スコアS(x,x)を算出してよい。
Figure JPOXMLDOC01-appb-M000004
As an example, when the pair of named entity is expressed as compounds x 1 and x 2 , the similarity determination system 1A may calculate the similarity score S (x 1 , x 2 ) using the following formula (5). ..
Figure JPOXMLDOC01-appb-M000004
 上記式(5)において、TC(x,x)は、MACCS KeyのTanimoto係数である。MACCS Keyは、化合物の特徴の表現手法(化合物記述子)の1つである。Tanimoto係数は、MACCS Keyを用いて化合物間の構造類似度を示す指標の1つであり、固有表現が化合物名である場合の固有表現間の類似度の一例である。また、Distance(x,x)は、例えば、文書内での固有表現のそれぞれの出現位置の近さを数値化した値であり、一例として、以下の条件に応じた値である。
・化合物x及びxが同一文に出現し、且つ、並列関係にある場合:“1.0”
・化合物x及びxが同一文に出現する場合:          “0.8”
・化合物x及びxが同一段落に出現する場合:         “0.5”
・上記以外の場合:                      “0.1”
In the above equation (5), TC (x 1 , x 2 ) is the Tanimoto coefficient of MACCS Key. MACCS Key is one of the expression methods (compound descriptors) of the characteristics of compounds. The Tanimoto coefficient is one of the indexes showing the structural similarity between compounds using MACCS Key, and is an example of the similarity between named entities when the named entity is a compound name. Further, Distance (x 1 , x 2 ) is, for example, a numerical value obtained by quantifying the proximity of each appearance position of the named entity in a document, and is, for example, a value corresponding to the following conditions.
-When compounds x1 and x2 appear in the same sentence and are in a parallel relationship: "1.0"
-When compounds x1 and x2 appear in the same sentence: "0.8"
-When compounds x1 and x2 appear in the same paragraph: "0.5"
・ Other than the above: “0.1”
 類似度判定システム1Aは、化合物リストCに含まれる複数の化合物名について、化合物名のペア(x,x)の組み合わせごとに上記式(5)を適用して、各ペア(x,x)の類似度スコアS(x,x)を算出してよい。 The similarity determination system 1A applies the above formula (5) to each combination of compound name pairs (x 1 , x 2 ) for a plurality of compound names included in the compound list C, and applies each pair (x 1 , x 2). The similarity score S (x 1 , x 2 ) of x 2 ) may be calculated.
 類似度判定システム1Aは、算出した複数の類似度スコアS(x,x)に対して、例えば最短距離法等の手法を適用して、化合物リストCに含まれる複数の化合物名を分類してグループ化することで、化合物名のクラスタリングを行なってよい。 The similarity determination system 1A classifies a plurality of compound names included in the compound list C by applying a method such as the shortest distance method to a plurality of calculated similarity scores S (x 1 , x 2 ). By grouping them together, the compound names may be clustered.
 図11の例では、類似度判定システム1Aは、化合物リストCに対するクラスタリングにより、化合物リストC内の化合物名をN個のクラスタ(グループ)に分割し、クラスタごとの化合物リストCX1~CXNを生成する。また、類似度判定システム1Aは、化合物リストCに対するクラスタリングにより、化合物リストC内の化合物名をM個のクラスタ(グループ)に分割し、クラスタごとの化合物リストCY1~CYMを生成する。 In the example of FIG. 11, the similarity determination system 1A divides the compound names in the compound list C X into N clusters (groups) by clustering to the compound list C X , and the compound lists C X 1 to C for each cluster. Generate XN . Further, the similarity determination system 1A divides the compound names in the compound list CY into M clusters (groups) by clustering to the compound list CY , and generates the compound lists CY1 to CYM for each cluster. ..
 一例として、図10の処理P4で利用される化合物リストに着目する。この場合、類似度判定システム1Aは、図5に示すように、化合物リストC及びCをそれぞれ4つのクラスタに分類し(N=M=4)、化合物リストCX1~CX4及びCY1~CY4を生成する。このようなクラスタリングにより、結果的に、化合物リストC及びCを、以下のような4つの要素(特性)のクラスタに分類することができる。
・化合物リストCX1及びCY1
  「負極活物質」の要素(特性)を有するクラスタ。
・化合物リストCX2及びCY2
  「正極活物質」の要素(特性)を有するクラスタ。
・化合物リストCX3及びCY3
  「バインダー」の要素(特性)を有するクラスタ。
・化合物リストCX4及びCY4
  「電解液溶媒」の要素(特性)を有するクラスタ。
As an example, focus on the compound list used in the process P4 of FIG. In this case, the similarity determination system 1A classifies the compound lists C X and CY into four clusters (N = M = 4), respectively, as shown in FIG. 5, and the compound lists C X1 to C X4 and CY1 are used. ~ CY4 is generated. As a result, the compound lists C X and CY can be classified into clusters of the following four elements (characteristics) by such clustering.
-Compound list C X1 and CY1 :
A cluster having elements (characteristics) of "negative electrode active material".
-Compound list C X2 and CY2 :
A cluster having an element (characteristic) of "positive electrode active material".
-Compound list C X3 and CY3 :
A cluster with a "binder" element (characteristic).
-Compound list C X4 and CY4 :
A cluster having an element (characteristic) of "electrolyte solvent".
 以上のように、第2実施形態に係る類似度判定システム1Aは、部分文書クラスタリング(処理P4)で利用される、クラスタごとの化合物リストCX1~CXN及びCY1~CYMを生成することができる。 As described above, the similarity determination system 1A according to the second embodiment generates compound lists C X1 to C XN and CY1 to CYM for each cluster used in partial document clustering (processing P4). Can be done.
 なお、ここまで、構造類似度としてMACCS KeyのTanimoto係数が用いられるものとして説明したが、これに限定されるものではない。例えば、化合物の特徴の表現手法としては、MACCS Key、換言すればMACCSフィンガープリントに限定されるものではなく、例えば、Morganフィンガープリント等の種々の化合物記述子が採用されてもよい。また、化合物間の構造類似度を示す指標としては、Tanimoto係数に限定されるものではなく、例えば、Dice係数等の種々の係数が用いられてもよい。 Up to this point, the explanation has been made assuming that the Tanimoto coefficient of MACCS Key is used as the structural similarity, but the description is not limited to this. For example, the method for expressing the characteristics of a compound is not limited to MACCS Key, in other words, MACCS fingerprint, and various compound descriptors such as Morgan fingerprint may be adopted. Further, the index indicating the structural similarity between the compounds is not limited to the Tanimoto coefficient, and various coefficients such as the Dice coefficient may be used.
 また、上記式(5)では、類似度判定システム1Aは、類似度スコアS(x,x)として、固有表現の文書内の出現位置の近さを数値化した値と、固有表現の類似度との積を算出するものとしたが、これに限定されるものではない。 Further, in the above equation (5), the similarity determination system 1A uses the similarity score S (x 1 , x 2 ) as a numerical value of the proximity of the appearance position in the document of the named entity and the named entity. The product with the similarity is calculated, but the product is not limited to this.
 一例として、類似度判定システム1Aは、下記式(6)を用いて類似度スコアS(x,x)を算出してもよい。
Figure JPOXMLDOC01-appb-M000005
As an example, the similarity determination system 1A may calculate the similarity score S (x 1 , x 2 ) using the following equation (6).
Figure JPOXMLDOC01-appb-M000005
 上記式(6)において、Wは重みである。Wとしては、例えば、固有表現のそれぞれの位置と、固有表現間の類似度とが均等に考慮されるように“0.5”等の値がユーザ等により適宜定義及び設定されてもよい。或いは、Wは、検索クエリ及び正解例(正解データ)を含む訓練データに基づく機械学習により、正解例が上位に検索されるような値になるように訓練されたモデルに基づき設定されてもよい。 In the above equation (6), W is a weight. As W, for example, a value such as "0.5" may be appropriately defined and set by the user or the like so that each position of the named entity and the similarity between the named entity are considered evenly. Alternatively, W may be set based on a model trained so that the correct answer example is searched higher by machine learning based on the search query and the training data including the correct answer example (correct answer data). ..
 例えば、化学構造は類似していないが1つの構成要素で同様に用いられる化合物(同一文で併記される可能性が高い)は、上記式(6)を用いると類似度が過少評価される可能性がある。これに対し、上記式(6)のように、固有表現の文書内の出現位置の近さを数値化した値と、固有表現の類似度との重み付き和に基づき類似度スコアを算出することで、化合物の類似性を正当に評価することができる。 For example, compounds that are not similar in chemical structure but are used in the same way in one component (which is likely to be written together in the same sentence) may be underestimated in similarity when the above formula (6) is used. There is sex. On the other hand, as in the above equation (6), the similarity score is calculated based on the weighted sum of the value obtained by quantifying the proximity of the appearance position of the named entity in the document and the similarity of the named entity. The similarity of the compounds can be justified.
 このように、類似度判定システム1Aは、クエリ文書11aに含まれる第1の複数の化合物名のそれぞれの位置と第1の複数の化合物名のそれぞれの類似度とに基づいて第1の複数の化合物名を分類することで、第1クラスタ群を生成する。また、類似度判定システム1Aは、比較対象文書12aに含まれる第2の複数の化合物名のそれぞれの位置と第2の複数の化合物名のそれぞれの類似度とに基づいて第2の複数の化合物名を分類することで、第2クラスタ群を生成する。第1クラスタ群は、第1の複数のグループの一例であり、第2クラスタ群は、第2の複数のグループの一例である。 As described above, the similarity determination system 1A is based on the respective positions of the first plurality of compound names included in the query document 11a and the respective similarity of the first plurality of compound names. By classifying the compound names, the first cluster group is generated. Further, the similarity determination system 1A is based on the position of each of the second plurality of compound names included in the comparison target document 12a and the similarity of each of the second plurality of compound names. By classifying the names, a second cluster group is generated. The first cluster group is an example of the first plurality of groups, and the second cluster group is an example of the second plurality of groups.
 以上のように、第2実施形態に係る類似度判定システム1Aによれば、第1実施形態と同様の効果を奏することができる。また、文書ごとに、クラスタごとの化合物リストを生成することができるため、ユーザが手動で化合物リストを生成することを回避でき、利便性がよい。さらに、類似度判定システム1Aがクエリ文書11a及び比較対象文書12aの一方又は双方の文書を蓄積していない場合においても、文書の類似度を判定することができる。このような場合としては、例えば、クエリ11に当該文書が含まれる場合、又は、クエリ11で当該文書の所在(類似度判定システム1A以外の保存場所)が指定される場合等が挙げられる。 As described above, according to the similarity determination system 1A according to the second embodiment, the same effect as that of the first embodiment can be obtained. Further, since the compound list for each cluster can be generated for each document, it is possible to avoid the user from manually generating the compound list, which is convenient. Further, even when the similarity determination system 1A does not store one or both documents of the query document 11a and the comparison target document 12a, the similarity of the documents can be determined. Such cases include, for example, the case where the document is included in the query 11, or the case where the location of the document (a storage location other than the similarity determination system 1A) is specified by the query 11.
 〔2-2〕機能構成例
 図12は、第2実施形態に係る類似度判定システム1Aにおけるサーバ3の機能構成例を示すブロック図であり、図13は、サーバ3による画面出力例を示す図である。
[2-2] Functional Configuration Example FIG. 12 is a block diagram showing a functional configuration example of the server 3 in the similarity determination system 1A according to the second embodiment, and FIG. 13 is a diagram showing a screen output example by the server 3. Is.
 サーバ3は、類似度判定装置、情報処理装置、又は、コンピュータの一例である。例えば、サーバ3は、類似度判定システム1Aにおいて、図示しない端末装置、他のサーバ等との間で、クエリ文書11a及び比較対象文書12aの受信、結果14の送信等の種々の通信を行なってよい。 The server 3 is an example of a similarity determination device, an information processing device, or a computer. For example, in the similarity determination system 1A, the server 3 performs various communications such as reception of the query document 11a and the comparison target document 12a and transmission of the result 14 with a terminal device (not shown), another server, or the like. good.
 サーバ3は、サーバ2と同様に、例えば、端末装置に対して、アクセスを可能とするための機能を提供してよい。例えば、サーバ3は、図13に示すように、検索クエリを指定するための検索クエリ指定画面330、及び、検索結果を出力するための検索結果出力画面340の画面情報を出力してよい。 Similar to the server 2, the server 3 may provide, for example, a function for enabling access to the terminal device. For example, as shown in FIG. 13, the server 3 may output screen information of a search query specification screen 330 for designating a search query and a search result output screen 340 for outputting search results.
 類似度判定システム1Aによる上述した類似度判定処理は、サーバ3により実現されてよい。図12に示すように、サーバ3は、例示的に、文書DB部31、及び、文書検索部32を備えてよい。文書DB部31、及び、文書検索部32は、制御部の一例である。なお、サーバ3は、図6に示す文書入力部22を備えてもよい。 The above-mentioned similarity determination process by the similarity determination system 1A may be realized by the server 3. As shown in FIG. 12, the server 3 may optionally include a document DB unit 31 and a document retrieval unit 32. The document DB unit 31 and the document search unit 32 are examples of control units. The server 3 may include the document input unit 22 shown in FIG.
 文書DB部31は、クエリ文書11a及び比較対象文書12aを蓄積し、文書DBを構築する文書DB構築処理を行なう。 The document DB unit 31 stores the query document 11a and the comparison target document 12a, and performs a document DB construction process for constructing the document DB.
 文書検索部32は、クエリ11の受け付けに応じて、文書DB部31が記憶する情報に基づき、クエリ11で指定されたクエリ文書11aと類似する比較対象文書12aを検索する文書検索処理を行なう。文書検索処理は、類似度判定処理を含む処理であり、類似判定処理の利用例(応用例)である。 The document search unit 32 performs a document search process for searching a comparison target document 12a similar to the query document 11a specified in the query 11 based on the information stored in the document DB unit 31 in response to the acceptance of the query 11. The document search process is a process including a similarity determination process, and is an example of use (application example) of the similarity determination process.
 (文書DB部31の説明)
 図12に示すように、文書DB部31は、例示的に、文書蓄積部311、化合物名抽出部312、クラスタリング部313、文書クラスタベクトル算出部314、及び、文書クラスタベクトル蓄積部315を備えてよい。
(Explanation of document DB section 31)
As shown in FIG. 12, the document DB unit 31 includes, for example, a document storage unit 311, a compound name extraction unit 312, a clustering unit 313, a document cluster vector calculation unit 314, and a document cluster vector storage unit 315. good.
 文書蓄積部311は、第1実施形態に係るメモリ部21(図6参照)の一例であり、複数の文書を蓄積する。文書は、クエリ文書11a及び比較対象文書12aのいずれとしても用いられ得る文書である。従って、文書蓄積部311は、クエリ文書11aと、クエリ11の対象となる複数の比較対象文書12aを含む文書集合(文書群)12とを蓄積するといえる。文書蓄積部311は、クエリ11の受付前に、予め複数の文書を蓄積してよい。なお、文書蓄積部311には、第1実施形態に係る文書入力部22が受け付けた複数の文書が格納されてもよい。 The document storage unit 311 is an example of the memory unit 21 (see FIG. 6) according to the first embodiment, and stores a plurality of documents. The document is a document that can be used as either the query document 11a or the comparison target document 12a. Therefore, it can be said that the document storage unit 311 stores the query document 11a and the document set (document group) 12 including the plurality of comparison target documents 12a that are the targets of the query 11. The document storage unit 311 may store a plurality of documents in advance before receiving the query 11. The document storage unit 311 may store a plurality of documents received by the document input unit 22 according to the first embodiment.
 化合物名抽出部312は、文書蓄積部311が蓄積する複数の文書の各々から、固有表現の一例としての化合物名を抽出し、文書ごとの化合物リストC及びCを生成する。化合物名抽出部312の処理は、図10の処理P7の一例である。 The compound name extraction unit 312 extracts a compound name as an example of a named entity from each of a plurality of documents accumulated by the document storage unit 311 and generates compound lists C X and CY for each document. The treatment of the compound name extraction unit 312 is an example of the treatment P7 in FIG.
 クラスタリング部313は、化合物リストC及びCのそれぞれに含まれる化合物名について類似度スコアSを算出する。また、クラスタリング部313は、類似度スコアSに基づき化合物名を複数のクラスタに分類して、化合物リストCX1、CX2、CX3、・・・CXNと、化合物リストCY1、CY2、CY3、・・・CYMとを生成する。クラスタリング部313の処理は、図10の処理P8の一例である。 The clustering unit 313 calculates the similarity score S for each of the compound names included in the compound lists C X and CY . Further, the clustering unit 313 classifies the compound names into a plurality of clusters based on the similarity score S, and the compound lists C X1 , C X2 , C X3 , ... C XN , and the compound lists CY1 and CY2 . Generate CY3 , ... CYM . The process of the clustering unit 313 is an example of the process P8 of FIG.
 文書クラスタベクトル算出部314は、クラスタリング部313からの化合物クラスタの情報、並びに、文書蓄積部311が蓄積する複数の文書の各々から抽出した単語に基づき算出した重み及び単語ベクトルに基づき、部分文書クラスタごとの文書ベクトルを算出してよい。文書クラスタベクトル算出部314の処理は、図10の処理P1~P4、及び、処理P5の少なくとも一部の一例である。 The document cluster vector calculation unit 314 is based on the information of the compound cluster from the clustering unit 313 and the weights and word vectors calculated based on the words extracted from each of the plurality of documents accumulated by the document storage unit 311. The document vector for each may be calculated. The process of the document cluster vector calculation unit 314 is an example of at least a part of the processes P1 to P4 and the process P5 in FIG.
 文書クラスタベクトル蓄積部315は、図6に示すメモリ部21の一例であり、文書クラスタベクトル算出部314が算出した部分文書クラスタごとの文書ベクトルを蓄積する。 The document cluster vector storage unit 315 is an example of the memory unit 21 shown in FIG. 6, and stores the document vector for each partial document cluster calculated by the document cluster vector calculation unit 314.
 (文書検索部32の説明)
 図12に示すように、文書検索部32は、例示的に、検索クエリ指定部321、文書類似度算出部322、検索結果生成部323、及び、検索結果出力部324を備えてよい。
(Explanation of the document search unit 32)
As shown in FIG. 12, the document search unit 32 may optionally include a search query designation unit 321, a document similarity calculation unit 322, a search result generation unit 323, and a search result output unit 324.
 検索クエリ指定部321は、図6に示す文書入力部22の一例であり、図示しない端末装置又は他のサーバ等のコンピュータから、文書検索を要求するクエリ11(以下、「検索クエリ11」と表記する場合がある)の入力を受け付ける。 The search query designation unit 321 is an example of the document input unit 22 shown in FIG. 6, and is a query 11 requesting a document search from a computer such as a terminal device (not shown) or another server (hereinafter referred to as “search query 11”). (May be) Accept the input.
 例えば、図13に示すように、検索クエリ指定部321は、検索クエリ指定画面330の検索ボタン332が押下された際に入力欄331に設定されている、クエリ文書11aの文書番号を受け付けてよい。 For example, as shown in FIG. 13, the search query specification unit 321 may accept the document number of the query document 11a set in the input field 331 when the search button 332 of the search query specification screen 330 is pressed. ..
 文書類似度算出部322は、図6に示す文書類似度算出部233の一例である。文書類似度算出部322は、文書クラスタベクトル蓄積部315に蓄積された文書ベクトルに基づき、検索クエリ11で指定されたクエリ文書11aと比較対象文書12aとの間の文書類似度Sim(X,Y)を算出する。 The document similarity calculation unit 322 is an example of the document similarity calculation unit 233 shown in FIG. The document similarity calculation unit 322 uses the document similarity Sim (X, Y) between the query document 11a specified by the search query 11 and the comparison target document 12a based on the document vector stored in the document cluster vector storage unit 315. ) Is calculated.
 例えば、文書類似度算出部322は、文書クラスタベクトル蓄積部315に蓄積された部分文書ベクトルのうちの、クエリ文書11a及び比較対象文書12aにそれぞれ対応する複数の部分文書ベクトルどうしを比較し、テキスト類似度を算出してよい。 For example, the document similarity calculation unit 322 compares a plurality of partial document vectors corresponding to the query document 11a and the comparison target document 12a among the partial document vectors stored in the document cluster vector storage unit 315, and texts. The similarity may be calculated.
 そして、文書類似度算出部322は、テキスト類似度に基づき、文書類似度Sim(X,Y)を算出し、文書類似度Sim(X,Y)の大きい順に比較対象文書12aをソートすることで、ランキングの結果14を生成してよい。結果14の内容及び出力手法は、第1実施形態に係る結果13と同様である。 Then, the document similarity calculation unit 322 calculates the document similarity Sim (X, Y) based on the text similarity, and sorts the comparison target documents 12a in descending order of the document similarity Sim (X, Y). , The ranking result 14 may be generated. The content and output method of the result 14 are the same as those of the result 13 according to the first embodiment.
 文書類似度算出部322の処理は、図10の処理P5の少なくとも一部、及び、処理P6の一例である。 The process of the document similarity calculation unit 322 is an example of at least a part of the process P5 in FIG. 10 and the process P6.
 検索結果生成部323は、結果14に基づき、出力するための検索結果を生成する。例えば、検索結果生成部323は、図13に示す検索結果出力画面340を生成してよい。検索結果出力画面340は、図7に示す判定結果出力画面240における判定結果244を検索結果344に置き換えたものであってもよい。 The search result generation unit 323 generates a search result for output based on the result 14. For example, the search result generation unit 323 may generate the search result output screen 340 shown in FIG. The search result output screen 340 may replace the determination result 244 in the determination result output screen 240 shown in FIG. 7 with the search result 344.
 図13に示すように、検索結果出力画面340には、クエリ文書11aの表示領域341と、比較対象文書12aの少なくとも1つ(図13では3つ)の表示領域345a~345cとが含まれてよい。表示領域341には、書誌情報及び要約等の表示領域342、及び、クエリ文書11aの全文参照ボタン343が含まれてよい。 As shown in FIG. 13, the search result output screen 340 includes a display area 341 of the query document 11a and display areas 345a to 345c of at least one (three in FIG. 13) of the comparison target document 12a. good. The display area 341 may include a display area 342 such as bibliographic information and a summary, and a full-text reference button 343 of the query document 11a.
 表示領域345a~345cには、書誌情報及び要約等の表示領域346a~346c、及び、全文参照ボタン347a~347cが含まれてよい。なお、表示領域346a~346cには、類似すると判定された部分文書クラスタに対応する、1以上の段落P又は化合物リスト、及び/又は、類似度Sim(X,Y)が表示されてもよい。 The display areas 345a to 345c may include display areas 346a to 346c for bibliographic information and summaries, and full text reference buttons 347a to 347c. In the display areas 346a to 346c , one or more paragraphs PY or compound list corresponding to the partial document cluster determined to be similar, and / or the similarity Sim (X, Y) may be displayed. ..
 検索結果出力部324は、図示しない端末装置又は他のサーバ等のコンピュータに対して、検索結果出力画面340を出力する。 The search result output unit 324 outputs the search result output screen 340 to a computer such as a terminal device or another server (not shown).
 〔2-3〕動作例
 図14は、サーバ3の文書DB構築処理の動作例を説明するフローチャートであり、図15は、サーバ3の文書検索処理の動作例を説明するフローチャートである。
[2-3] Operation Example FIG. 14 is a flowchart illustrating an operation example of the document DB construction process of the server 3, and FIG. 15 is a flowchart illustrating an operation example of the document retrieval process of the server 3.
 (文書DB構築処理の動作例)
 図14に例示するように、文書蓄積部311は、未選択の文書を選択し(ステップS21)、文書DBに文書を登録する(ステップS22)。
(Operation example of document DB construction process)
As illustrated in FIG. 14, the document storage unit 311 selects an unselected document (step S21) and registers the document in the document DB (step S22).
 化合物名抽出部312は、文書のテキストから化合物名を抽出する(ステップS23)。クラスタリング部313は、抽出した化合物名をクラスタリングする(ステップS24)。 The compound name extraction unit 312 extracts the compound name from the text of the document (step S23). The clustering unit 313 clusters the extracted compound names (step S24).
 文書クラスタベクトル算出部314は、文書を複数の部分文書に分割し(ステップS25)、クラスタリング部313が生成した化合物クラスタに基づき、複数の部分文書をクラスタリングする(ステップS26)。 The document cluster vector calculation unit 314 divides the document into a plurality of sub-documents (step S25), and clusters a plurality of sub-documents based on the compound cluster generated by the clustering unit 313 (step S26).
 文書クラスタベクトル算出部314は、各部分文書クラスタの文書ベクトルを算出する(ステップS27)。文書クラスタベクトル蓄積部315は、算出した文書ベクトルを文書と対応付けて、例えば文書DB又は文書クラスタベクトルDB等に登録(蓄積)する(ステップS28)。 The document cluster vector calculation unit 314 calculates the document vector of each partial document cluster (step S27). The document cluster vector storage unit 315 associates the calculated document vector with the document and registers (stores) it in, for example, a document DB or a document cluster vector DB (step S28).
 文書蓄積部311は、未選択の文書があるか否かを判定し(ステップS29)、あると判定した場合(ステップS29でYES)、処理がステップS21に移行する。文書蓄積部311が未選択の文書がないと判定した場合(ステップS29でNO)、処理が終了する。 The document storage unit 311 determines whether or not there is an unselected document (step S29), and if it determines that there is an unselected document (YES in step S29), the process proceeds to step S21. When the document storage unit 311 determines that there is no unselected document (NO in step S29), the process ends.
 (文書検索処理の動作例)
 図15に例示するように、検索クエリ指定部321は、検索クエリ指定画面330からのクエリ文書11aの指定を受け付ける(ステップS31)。
(Operation example of document search processing)
As illustrated in FIG. 15, the search query designation unit 321 accepts the designation of the query document 11a from the search query designation screen 330 (step S31).
 文書類似度算出部322は、文書クラスタベクトル蓄積部315からクエリ文書11aの文書ベクトルを取得する(ステップS32)。 The document similarity calculation unit 322 acquires the document vector of the query document 11a from the document cluster vector storage unit 315 (step S32).
 文書類似度算出部322は、未選択の文書を選択し(ステップS33)、文書クラスタベクトル蓄積部315から選択した文書の部分文書クラスタの文書ベクトルを取得する(ステップS34)。 The document similarity calculation unit 322 selects an unselected document (step S33), and acquires the document vector of the partial document cluster of the selected document from the document cluster vector storage unit 315 (step S34).
 文書類似度算出部322は、クエリ文書11a及び選択した文書間で、複数の部分文書クラスタの文書ベクトルを比較し、文書類似度Sim(X,Y)を算出する(ステップS35)。 The document similarity calculation unit 322 compares the document vectors of a plurality of partial document clusters between the query document 11a and the selected document, and calculates the document similarity Sim (X, Y) (step S35).
 文書類似度算出部322は、未選択の文書があるか否かを判定し(ステップS36)、ある場合(ステップS36でYES)、処理がステップS33に移行する。文書類似度算出部322が未選択の文書がないと判定した場合(ステップS36でNO)、文書類似度算出部322は、文書類似度が高い順に所定の個数の文書を抽出する(ステップS37)。 The document similarity calculation unit 322 determines whether or not there is an unselected document (step S36), and if so (YES in step S36), the process proceeds to step S33. When the document similarity calculation unit 322 determines that there is no unselected document (NO in step S36), the document similarity calculation unit 322 extracts a predetermined number of documents in descending order of document similarity (step S37). ..
 検索結果生成部323は、抽出したデータに基づき検索結果を生成し、検索結果出力部324が検索結果、例えば検索結果出力画面340を出力し(ステップS38)、処理が終了する。 The search result generation unit 323 generates a search result based on the extracted data, the search result output unit 324 outputs a search result, for example, a search result output screen 340 (step S38), and the process ends.
 〔3〕第3実施形態
 〔3-1〕第3実施形態の説明
 次に、第3実施形態について説明する。第3実施形態では、第2実施形態に係る文書類似度の算出処理において、化合物クラスタに基づく固有表現類似度を利用する手法を説明する。
[3] Third Embodiment [3-1] Description of the Third Embodiment Next, the third embodiment will be described. In the third embodiment, a method of utilizing the named entity similarity based on the compound cluster will be described in the document similarity calculation process according to the second embodiment.
 なお、以下の第3実施形態の説明では、特に言及しない構成、処理又は機能は、既述の第1実施形態及び第2実施形態に係る構成、処理又は機能と同様であるものとする。 In the following description of the third embodiment, the configurations, processes or functions not particularly mentioned are the same as the configurations, processes or functions according to the first embodiment and the second embodiment described above.
 図16は、第3実施形態に係る類似度判定システム1Bを説明するための図であり、図17及び図18は、類似度判定システム1Bの処理の一例を説明するための図である。 FIG. 16 is a diagram for explaining the similarity determination system 1B according to the third embodiment, and FIGS. 17 and 18 are diagrams for explaining an example of processing of the similarity determination system 1B.
 図16に示すように、第3実施形態に係る類似度判定システム1Bは、図10に示す類似度判定システム1Aの処理P6を処理P10に置き換え、処理P8の結果を用いる処理P9を追加し、処理P5及びP9の双方の結果を用いて処理P10を実行するものである。 As shown in FIG. 16, the similarity determination system 1B according to the third embodiment replaces the process P6 of the similarity determination system 1A shown in FIG. 10 with the process P10, and adds the process P9 using the result of the process P8. The process P10 is executed using the results of both the processes P5 and P9.
 処理P9は、クラスタごとの固有表現類似度、例えば化合物類似度を文書間のクラスタのペアごとに算出する処理である。処理P10は、テキスト類似度及び固有表現類似度に基づき、クエリ文書11aとの類似度に応じて複数の比較対象文書12aの各々をランキング付けする処理である。以下、処理P9及びP10について説明する。 Process P9 is a process of calculating named entity similarity for each cluster, for example, compound similarity for each pair of clusters between documents. The process P10 is a process of ranking each of the plurality of comparison target documents 12a according to the similarity with the query document 11a based on the text similarity and the named entity similarity. Hereinafter, the processes P9 and P10 will be described.
 (固有表現類似度算出処理の一例)
 類似度判定システム1Bは、処理P9において、例えば、クエリ文書11aから生成した第1の複数のクラスタの複数の化合物リストと、比較対象文書12aから生成した第2の複数のクラスタの複数の化合物リストとをそれぞれ比較してよい。そして、類似度判定システム1Bは、第1の複数のクラスタと第2の複数のクラスタとの間の全てのクラスタのペアについて、下記式(7)の演算により、化合物類似度、一例としてコサイン類似度を算出してよい。
Figure JPOXMLDOC01-appb-M000006
(Example of named entity similarity calculation processing)
In the process P9, the similarity determination system 1B is, for example, a list of a plurality of compounds of the first plurality of clusters generated from the query document 11a and a list of a plurality of compounds of the second plurality of clusters generated from the comparison target document 12a. And may be compared with each other. Then, the similarity determination system 1B performs compound similarity, for example, cosine similarity by the calculation of the following formula (7) for all the cluster pairs between the first plurality of clusters and the second plurality of clusters. The degree may be calculated.
Figure JPOXMLDOC01-appb-M000006
 上記式(7)において、iは、化合物リストCXa及びCYbに含まれる全ての化合物名を特定するためのインデックスであり、CXai及びCYbiは、化合物リストCXa及びCYb内のi番目の化合物名の出現数を示す。上記式(7)において、分母は、CXaの化合物の出現数の2乗和の平方根と、CYbの化合物の出現数の2乗和の平方根との和であり、分子は、CXaとCYbとの間の共通化合物の出現数の積の総和である。 In the above formula (7), i is an index for specifying all the compound names contained in the compound lists C Xa and CYb , and C Xai and CYbi are i in the compound lists C Xa and CYb . The number of appearances of the second compound name is shown. In the above formula (7), the denominator is the square root of the sum of squares of the number of occurrences of the compound of C Xa and the square root of the sum of squares of the number of appearances of the compound of CYb , and the molecule is C Xa . It is the sum of the products of the number of appearances of the common compound with CYb .
 図17に示す化合物リストC及びCの例では、類似度判定システム1は、化合物リストCX1、CX2、CX3、・・・CXNと、化合物リストCY1、CY2、CY3、・・・CYMとの全てのペア(組み合わせ)について、上記式(7)に従い化合物類似度を算出してよい。 In the example of the compound lists C X and CY shown in FIG. 17, the similarity determination system 1 has the compound lists C X1 , C X2 , C X3 , ... C XN and the compound lists CY1 , CY2 , CY3 . , ... For all pairs (combinations) with CYM , the compound similarity may be calculated according to the above formula (7).
 (ランキング処理の一例)
 類似度判定システム1Bは、テキスト類似度及び固有表現類似度に基づき、クエリ文書11aとの類似度に応じて複数の比較対象文書12aの各々をランキング付けするランキング処理を行ない(処理P10)、結果14を出力する。
(Example of ranking processing)
The similarity determination system 1B performs a ranking process of ranking each of the plurality of comparison target documents 12a according to the similarity with the query document 11a based on the text similarity and the named entity similarity (process P10), and the result. 14 is output.
 例えば、類似度判定システム1Bは、ランキング処理において、テキスト類似度と固有表現類似度を統合した類似度を算出し、当該類似度に基づき、クエリ文書11aとの類似度に応じた複数の比較対象文書12aのランキングを出力する。 For example, the similarity determination system 1B calculates the similarity in which the text similarity and the named entity similarity are integrated in the ranking process, and based on the similarity, a plurality of comparison targets according to the similarity with the query document 11a. The ranking of the document 12a is output.
 類似度判定システム1Bは、例えば、下記式(8)に従い、文書Xと1つの比較対象文書Yとの間の文書類似度Sim(X,Y)を算出してよい。
Figure JPOXMLDOC01-appb-M000007
The similarity determination system 1B may calculate the document similarity Sim (X, Y) between the document X and one comparison target document Y, for example, according to the following equation (8).
Figure JPOXMLDOC01-appb-M000007
 上記式(8)において、fcは、上記式(7)に従ったコサイン類似度、換言すれば固有表現類似度である。 In the above equation (8), fc is a cosine similarity according to the above equation (7), in other words, a named entity similarity.
 なお、上記式(8)では、文書X(クエリ文書11a)と、1つの文書Y(比較対象文書12a)との間の文書類似度を算出する例を示す。類似度判定システム1Bは、第2実施形態と同様に、文書Yの数に応じた文書類似度Sim(X,Y)~Sim(X,Y)を取得してよい。 The above formula (8) shows an example of calculating the document similarity between the document X (query document 11a) and one document Y (comparison target document 12a). Similar to the second embodiment, the similarity determination system 1B may acquire document similarity Sims (X, Y 1 ) to Sim (X, Y L ) according to the number of documents Y.
 そして、類似度判定システム1Bは、例えば、第2実施形態と同様に、文書類似度Sim(X,Y)~Sim(X,Y)が高い文書Yから降順に検索対象の全文書Y~Yをソートすることで、ランキング処理を行なう。また、類似度判定システム1Bは、ソート結果を結果14として出力してよい。 Then, in the similarity determination system 1B, for example, as in the second embodiment, all the documents Y to be searched in descending order from the documents Y having the highest document similarity Sim (X, Y 1 ) to Sim (X, Y L ). Ranking processing is performed by sorting 1 to Y L. Further, the similarity determination system 1B may output the sort result as the result 14.
 なお、類似度判定システム1Bは、下記式(9)に従い、文書Xと1つの比較対象文書Yとの間の文書類似度Sim(X,Y)を、固有表現似度とテキスト類似度との重み付き和として算出してもよい。
Figure JPOXMLDOC01-appb-M000008
The similarity determination system 1B sets the document similarity Sim (X, Y) between the document X and one comparison target document Y as the named entity similarity and the text similarity according to the following equation (9). It may be calculated as a weighted sum.
Figure JPOXMLDOC01-appb-M000008
 上記式(9)において、wは重みである。wとしては、例えば、固有表現類似度とテキスト類似度とが均等に考慮されるように“0.5”等の値がユーザ等により適宜定義及び設定されてもよい。或いは、wは、検索クエリ及び正解例(正解データ)を含む訓練データに基づく機械学習により、正解例が上位に検索されるような値になるように訓練されたモデルに基づき設定されてもよい。 In the above equation (9), w is a weight. As w, for example, a value such as “0.5” may be appropriately defined and set by the user or the like so that the named entity similarity and the text similarity are considered equally. Alternatively, w may be set based on a model trained so that the correct answer example is searched higher by machine learning based on the search query and the training data including the correct answer example (correct answer data). ..
 以上のように、第3実施形態に係る類似度判定システム1Bによれば、第1及び第2実施形態と同様の効果を奏することができる。 As described above, according to the similarity determination system 1B according to the third embodiment, the same effects as those of the first and second embodiments can be obtained.
 また、類似度判定システム1Bは、上記式(8)又は式(9)に従い、上記式(7)で算出された固有表現類似度を文書類似度Sim(X,Y)の算出に利用することができる。例えば、類似度判定システム1Bは、固有表現類似度のうちの、値が最大となるクラスタのペア(a=1~N、b=1~Mのいずれかの組み合わせ)を、文書類似度Sim(X,Y)の算出に利用するfcの値として採用してよい。 Further, the similarity determination system 1B uses the named entity similarity calculated by the above formula (7) for the calculation of the document similarity Sim (X, Y) according to the above formula (8) or the above formula (9). Can be done. For example, the similarity determination system 1B sets the pair of clusters (a combination of a = 1 to N and b = 1 to M) having the maximum value among the named entity similarity to the document similarity Sim (a combination of any of them). It may be adopted as the value of fc used for the calculation of X, Y).
 ここで、図18に示す例において、化合物リストC及びCどうしを比較する場合を想定する。調査対象の要素が「正極活物質」である場合、「LiCoO2」等の「正極活物質」に関する化合物名が文書間で共通して出現する一方、他の要素に関する化合物名が文書間で相違するため、文書間の化合物類似度が低い値として算出される場合がある。このように、文書ごとの化合物リストCの比較を行なう場合、調査対象の要素が文書間で類似する場合であっても、類似度が低い値として化合物類似度が算出される場合がある。 Here, in the example shown in FIG. 18, it is assumed that the compound lists C X and CY are compared with each other. When the element to be investigated is "positive electrode active material", compound names related to "positive electrode active material" such as "LiCoO2" appear in common between documents, while compound names related to other elements differ between documents. Therefore, it may be calculated as a value with a low degree of compound similarity between documents. In this way, when comparing the compound list C for each document, the compound similarity may be calculated as a value having a low similarity even if the elements to be investigated are similar between the documents.
 これに対し、類似度判定システム1Bは、図18に例示するように、化合物リストCX2及びCY2のペア、換言すれば、「正極活物質」のクラスタどうしの化合物類似度が最大であると判定することができる。そして、類似度判定システム1Bは、当該化合物類似度を文書類似度Sim(X,Y)の算出に利用するfcの値として採用することができる。 On the other hand, in the similarity determination system 1B, as illustrated in FIG. 18, the compound similarity between the pairs of the compound lists C X2 and CY2 , in other words, the clusters of the “positive electrode active material” is the maximum. It can be determined. Then, the similarity determination system 1B can adopt the compound similarity as a value of fc used for calculating the document similarity Sim (X, Y).
 以上のように、第3実施形態に係る類似度判定システム1Bによれば、クラスタリング処理により分類したクラスタごとの固有表現類似度に基づき、文書間の類似度を判定することで、部分的に類似した文書間の類似度の判定精度をより向上させることができる。 As described above, according to the similarity determination system 1B according to the third embodiment, the similarity between documents is partially determined based on the named entity similarity for each cluster classified by the clustering process. It is possible to further improve the determination accuracy of the similarity between the documents.
 〔3-2〕機能構成例
 図19は、第3実施形態に係る類似度判定システム1Bにおけるサーバ4の機能構成例を示すブロック図である。サーバ4は、特に言及しない場合、図12に示すサーバ3と同様であってよい。
[3-2] Functional Configuration Example FIG. 19 is a block diagram showing a functional configuration example of the server 4 in the similarity determination system 1B according to the third embodiment. Unless otherwise specified, the server 4 may be the same as the server 3 shown in FIG.
 類似度判定システム1Bによる上述した類似度判定処理は、サーバ4により実現されてよい。図19に示すように、サーバ4は、例示的に、文書DB部41及び文書検索部42を備えてよい。文書DB部41及び文書検索部42は、制御部の一例である。 The above-mentioned similarity determination process by the similarity determination system 1B may be realized by the server 4. As shown in FIG. 19, the server 4 may optionally include a document DB unit 41 and a document retrieval unit 42. The document DB unit 41 and the document search unit 42 are examples of control units.
 文書DB部41は、図12に示す文書DB部31の構成に加えて、化合物クラスタ蓄積部416を備えてよい。文書検索部42は、図12に示す文書類似度算出部322に代えて、文書類似度算出部422を備えてよい。 The document DB unit 41 may include a compound cluster storage unit 416 in addition to the configuration of the document DB unit 31 shown in FIG. The document retrieval unit 42 may include a document similarity calculation unit 422 instead of the document similarity calculation unit 322 shown in FIG. 12.
 (文書DB部41の説明)
 例えば、化合物クラスタ蓄積部416は、図6に示すメモリ部21の一例であり、クラスタリング部313が算出した化合物クラスタ、例えば化合物リストCの情報を文書と対応付けて蓄積してよい。
(Explanation of document DB unit 41)
For example, the compound cluster storage unit 416 is an example of the memory unit 21 shown in FIG. 6, and the information of the compound cluster calculated by the clustering unit 313, for example, the compound list C may be stored in association with the document.
 (文書検索部42の説明)
 文書類似度算出部422は、文書クラスタベクトル蓄積部315に蓄積されたクエリ文書11a及び比較対象文書12aの各々の部分文書ベクトルを比較しテキスト類似度を算出する。また、文書類似度算出部422は、化合物クラスタ蓄積部416に蓄積されたクエリ文書11a及び比較対象文書12aの各々の化合物リストを比較し化合物類似度を算出する。
(Explanation of the document search unit 42)
The document similarity calculation unit 422 compares the partial document vectors of the query document 11a and the comparison target document 12a stored in the document cluster vector storage unit 315, and calculates the text similarity. Further, the document similarity calculation unit 422 compares the compound lists of the query document 11a and the comparison target document 12a stored in the compound cluster storage unit 416, and calculates the compound similarity.
 そして、文書類似度算出部422は、テキスト類似度及び化合物類似度に基づき、文書類似度Sim(X,Y)を算出し、文書類似度Sim(X,Y)から結果14を生成する。文書類似度算出部422の処理は、図16の処理P5、P9及びP10の一例である。 Then, the document similarity calculation unit 422 calculates the document similarity Sim (X, Y) based on the text similarity and the compound similarity, and generates the result 14 from the document similarity Sim (X, Y). The process of the document similarity calculation unit 422 is an example of the processes P5, P9, and P10 of FIG.
 なお、第3実施形態に係る文書検索部42は、図13に例示する画面を出力してよい。 Note that the document retrieval unit 42 according to the third embodiment may output the screen illustrated in FIG.
 〔3-3〕動作例
 図20は、サーバ4の文書DB構築処理の動作例を説明するフローチャートであり、図21は、サーバ4の文書検索処理の動作例を説明するフローチャートである。
[3-3] Operation Example FIG. 20 is a flowchart illustrating an operation example of the document DB construction process of the server 4, and FIG. 21 is a flowchart illustrating an operation example of the document retrieval process of the server 4.
 (文書DB構築処理の動作例)
 図20は、図14に示すステップS24とS25との間にステップS41を追加したものである。図20に例示するように、化合物クラスタ蓄積部416は、ステップS41において、算出した化合物クラスタの情報を文書ごとに蓄積する。
(Operation example of document DB construction process)
FIG. 20 shows that step S41 is added between steps S24 and S25 shown in FIG. As illustrated in FIG. 20, the compound cluster storage unit 416 stores the calculated compound cluster information for each document in step S41.
 (文書検索処理の動作例)
 図21は、図15に示すステップS32とS33との間にステップS51を追加し、ステップS35をステップS52及びS53に置き換えたものである。
(Operation example of document search processing)
In FIG. 21, step S51 is added between steps S32 and S33 shown in FIG. 15, and step S35 is replaced with steps S52 and S53.
 文書類似度算出部422は、ステップS51において、化合物クラスタ蓄積部416からクエリ文書11aの化合物クラスタ、例えば化合物リストを取得する。 In step S51, the document similarity calculation unit 422 acquires the compound cluster of the query document 11a, for example, the compound list from the compound cluster storage unit 416.
 文書類似度算出部422は、ステップS52において、化合物クラスタ蓄積部416から選択した文書の化合物クラスタ、例えば化合物リストを取得する。 In step S52, the document similarity calculation unit 422 acquires a compound cluster of a document selected from the compound cluster storage unit 416, for example, a compound list.
 文書類似度算出部422は、ステップS53において、ステップS32及びS34でそれぞれ取得した文書ベクトルと、ステップS51及びS52でそれぞれ取得した化合物クラスタとに基づき、文書類似度Sim(X,Y)を算出する。 In step S53, the document similarity calculation unit 422 calculates the document similarity Sim (X, Y) based on the document vector acquired in steps S32 and S34, respectively, and the compound cluster acquired in steps S51 and S52, respectively. ..
 〔3-4〕第1変形例
 次に、第3実施形態の第1変形例について説明する。
[3-4] First Modification Example Next, a first modification of the third embodiment will be described.
 (機能構成例)
 図22は、第3実施形態の第1変形例に係る類似度判定システム1Cにおけるサーバ5の機能構成例を示すブロック図であり、図23は、サーバ5による画面出力例を示す図である。
(Functional configuration example)
FIG. 22 is a block diagram showing a functional configuration example of the server 5 in the similarity determination system 1C according to the first modification of the third embodiment, and FIG. 23 is a diagram showing a screen output example by the server 5.
 第1変形例に係る類似度判定システム1Cは、クエリ文書11aのうちの、所定のキーワードを含む部分文書クラスタと、複数の比較対象文書12aの各々の部分文書クラスタとの比較によりテキスト類似度を算出する。 The similarity determination system 1C according to the first modification determines the text similarity by comparing the partial document cluster containing a predetermined keyword in the query document 11a with each partial document cluster of the plurality of comparison target documents 12a. calculate.
 図22に示すように、サーバ5は、例示的に、文書DB部41及び文書検索部52を備えてよい。文書DB部41及び文書検索部52は、制御部の一例である。文書DB部41は、図19に示す文書DB部41と同様である。 As shown in FIG. 22, the server 5 may optionally include a document DB unit 41 and a document search unit 52. The document DB unit 41 and the document search unit 52 are examples of control units. The document DB unit 41 is the same as the document DB unit 41 shown in FIG.
 文書検索部52は、図19に示す文書検索部42の文書類似度算出部422に代えて、文書類似度算出部522を備えるとともに、キーワード入力部525及び文書クラスタ特定部526を備えてよい。 The document search unit 52 may include a document similarity calculation unit 522, a keyword input unit 525, and a document cluster identification unit 526 in place of the document similarity calculation unit 422 of the document search unit 42 shown in FIG.
 キーワード入力部525は、ユーザから1以上のキーワードの入力を受け付ける。例えば、図23に示すように、キーワード入力部525は、検索クエリ指定画面530の検索ボタン533が押下された際に入力欄531及び532に設定されている、クエリ文書11aの文書番号と1以上のキーワードとを文書クラスタ特定部526に通知する。 The keyword input unit 525 accepts input of one or more keywords from the user. For example, as shown in FIG. 23, the keyword input unit 525 has a document number of the query document 11a and one or more set in the input fields 531 and 532 when the search button 533 of the search query specification screen 530 is pressed. Notify the document cluster identification unit 526 of the keyword.
 文書クラスタ特定部526は、文書クラスタベクトル蓄積部315を参照し、通知されたクエリ文書11aの複数の部分文書クラスタの中から、通知された1以上のキーワードを含む(例えば所定回数以上含む)部分文書クラスタを特定する。 The document cluster identification unit 526 refers to the document cluster vector storage unit 315, and is a portion including one or more notified keywords (for example, including a predetermined number of times or more) from a plurality of partial document clusters of the notified query document 11a. Identify the document cluster.
 文書類似度算出部522は、比較対象文書12aの複数の部分文書ベクトルと比較するクエリ文書11aの部分文書ベクトルを、文書クラスタ特定部526が特定した部分文書クラスタの文書ベクトルに限定する。換言すれば、文書類似度算出部522は、特定した部分文書クラスタの重要度(優先度)を他の部分文書クラスタよりも高く設定する。そして、文書類似度算出部522は、特定した部分文書クラスタについてテキスト類似度を算出し、テキスト類似度と化合物類似度とに基づき文書間類似度を算出する。 The document similarity calculation unit 522 limits the partial document vector of the query document 11a to be compared with the plurality of partial document vectors of the comparison target document 12a to the document vector of the partial document cluster specified by the document cluster identification unit 526. In other words, the document similarity calculation unit 522 sets the importance (priority) of the specified subdocument cluster to be higher than that of other subdocument clusters. Then, the document similarity calculation unit 522 calculates the text similarity for the specified partial document cluster, and calculates the inter-document similarity based on the text similarity and the compound similarity.
 このように、第1変形例に係るサーバ5によれば、第1及び第2実施形態と同様の効果を奏することができる。また、クエリ文書11a内の複数の部分文書クラスタのうち、ユーザの意図したキーワードを含む適切な部分文書クラスタによる比較対象文書12aの検索を行なうことができ、文書間の類似度の判定精度をより向上させることができる。さらに、類似度の判定に用いる部分文書クラスタの数を制限できるため、文書検索処理の処理時間を短縮することができる。また、ユーザは、所定のキーワードを含むクラスタを柔軟に指定することができ、利便性が高い。 As described above, according to the server 5 according to the first modification, the same effect as that of the first and second embodiments can be obtained. Further, among the plurality of partial document clusters in the query document 11a, the comparison target document 12a can be searched by an appropriate partial document cluster including the keyword intended by the user, and the determination accuracy of the similarity between the documents can be further improved. Can be improved. Further, since the number of partial document clusters used for determining the similarity can be limited, the processing time of the document retrieval process can be shortened. In addition, the user can flexibly specify a cluster including a predetermined keyword, which is highly convenient.
 (文書検索処理の動作例)
 図24は、サーバ5の文書検索処理の動作例を説明するフローチャートである。図24は、図21に示すステップS51とS33との間にステップS61及びS62を追加し、ステップS53をステップS63に置き換えたものである。
(Operation example of document search processing)
FIG. 24 is a flowchart illustrating an operation example of the document retrieval process of the server 5. In FIG. 24, steps S61 and S62 are added between steps S51 and S33 shown in FIG. 21, and step S53 is replaced with step S63.
 図24に例示するように、キーワード入力部525は、ステップS61において、キーワードの指定を受け付ける。 As illustrated in FIG. 24, the keyword input unit 525 accepts the designation of the keyword in step S61.
 文書クラスタ特定部526は、ステップS62において、キーワード入力部525が受け付けたキーワードを第1閾値(所定回数)以上含む、クエリ文書11aの部分文書クラスタを特定する。 In step S62, the document cluster specifying unit 526 specifies a partial document cluster of the query document 11a that includes the keywords accepted by the keyword input unit 525 by the first threshold value (predetermined number of times) or more.
 文書類似度算出部522は、ステップS63において、クエリ文書11aの特定した部分文書クラスタと、選択した文書の全ての部分文書クラスタとのテキスト類似度を算出する。そして、文書類似度算出部522は、算出したテキスト類似度及び化合物類似度に基づき、文書類似度Sim(X,Y)を算出する。 In step S63, the document similarity calculation unit 522 calculates the text similarity between the specified partial document cluster of the query document 11a and all the partial document clusters of the selected document. Then, the document similarity calculation unit 522 calculates the document similarity Sim (X, Y) based on the calculated text similarity and compound similarity.
 〔3-5〕第2変形例
 次に、第3実施形態の第2変形例について説明する。
[3-5] Second Modified Example Next, a second modified example of the third embodiment will be described.
 (機能構成例)
 図25は、第3実施形態の第2変形例に係る類似度判定システム1Dにおけるサーバ6の機能構成例を示すブロック図である。
(Functional configuration example)
FIG. 25 is a block diagram showing a functional configuration example of the server 6 in the similarity determination system 1D according to the second modification of the third embodiment.
 第2変形例に係る類似度判定システム1Dは、文書のうちの所定部分のテキストとの類似度が第2閾値以上の部分文書クラスタの比較によりテキスト類似度を算出する。また、類似度判定システム1Dは、部分文書クラスタに含まれる所定部分のテキストとの一致度が第3閾値以上の化合物クラスタの比較により化合物類似度を算出する。 The similarity determination system 1D according to the second modification calculates the text similarity by comparing the partial document clusters whose similarity with the text of a predetermined part of the document is equal to or higher than the second threshold value. Further, the similarity determination system 1D calculates the compound similarity by comparing the compound clusters whose degree of agreement with the text of the predetermined portion included in the partial document cluster is equal to or more than the third threshold value.
 所定の形式(文書フォーマット)が定められている特定種別の文書では、所定部分の記載内容に基づき文書類似度を判定することで、部分的に類似する文書の類似度の判定精度をより向上できる場合がある。 For a specific type of document for which a predetermined format (document format) is defined, the accuracy of determining the similarity of partially similar documents can be further improved by determining the document similarity based on the description content of the predetermined portion. In some cases.
 一例として、特定種別の文書が特許文献である場合、「(特許)請求の範囲」及び「要約書」のテキストとの類似度が閾値以上の部分文書クラスタは、「(特許)請求の範囲」及び「要約書」の記載に関連する、重要な文章(例えば段落)である可能性がある。 As an example, when a specific type of document is a patent document, a partial document cluster whose similarity with the text of "(patent) claims" and "abstract" is equal to or more than a threshold value is "(patent) claims". And may be important text (eg, paragraphs) related to the description of the "summary".
 そこで、類似度判定システム1Dは、入力される文書の種別に応じて、当該文書から「(特許)請求の範囲」等の所定部分のテキストを特定する。また、類似度判定システム1Dは、当該文書から算出される各クラスタのうち、当該テキストに関連する部分文書クラスタ及び化合物クラスタに制限して蓄積する。そして、類似度判定システム1Dは、類似度判定処理において、指定されたクエリ文書11aの種別に応じた所定部分のテキストに関連する部分文書クラスタ及び化合物クラスタに基づき、類似度を判定する。 Therefore, the similarity determination system 1D specifies a predetermined part of the text such as "(patent) claims" from the document according to the type of the input document. Further, the similarity determination system 1D accumulates only the partial document cluster and the compound cluster related to the text among the clusters calculated from the document. Then, in the similarity determination process, the similarity determination system 1D determines the similarity based on the partial document cluster and the compound cluster related to the text of the predetermined portion according to the type of the designated query document 11a.
 図25に示すように、サーバ6は、例示的に、文書DB部61及び文書検索部42を備えてよい。文書DB部61及び文書検索部42は、制御部の一例である。文書検索部42は、図19に示す文書検索部42と同様である。 As shown in FIG. 25, the server 6 may optionally include a document DB unit 61 and a document search unit 42. The document DB unit 61 and the document search unit 42 are examples of control units. The document search unit 42 is the same as the document search unit 42 shown in FIG.
 文書DB部61は、図19に示す文書DB部41の文書クラスタベクトル蓄積部315及び化合物クラスタ蓄積部416に代えて、所定文書クラスタベクトル蓄積部615及び所定化合物クラスタ蓄積部616を備えてよい。また、文書DB部61は、所定文書構成解析部617を備えてよい。 The document DB unit 61 may include a predetermined document cluster vector storage unit 615 and a predetermined compound cluster storage unit 616 instead of the document cluster vector storage unit 315 and the compound cluster storage unit 416 of the document DB unit 41 shown in FIG. Further, the document DB unit 61 may include a predetermined document structure analysis unit 617.
 所定文書クラスタベクトル蓄積部615は、文書クラスタベクトル算出部314が算出した部分文書クラスタのうちの、後述する所定文書構成解析部617が特定した部分文書クラスタの情報を文書と対応付けて蓄積する。 The predetermined document cluster vector storage unit 615 stores the information of the partial document cluster specified by the predetermined document configuration analysis unit 617, which will be described later, among the partial document clusters calculated by the document cluster vector calculation unit 314, in association with the document.
 所定化合物クラスタ蓄積部616は、クラスタリング部313が算出した化合物クラスタのうちの、後述する所定文書構成解析部617が特定した化合物クラスタの情報を文書と対応付けて蓄積する。 The predetermined compound cluster storage unit 616 stores the information of the compound cluster specified by the predetermined document structure analysis unit 617, which will be described later, among the compound clusters calculated by the clustering unit 313, in association with the document.
 所定文書構成解析部617は、入力される文書の種別に応じて、当該文書から所定部分のテキストを特定する。「所定部分」は、文書の種別、例えば特許文献、論文及び種々の資料等の、文書構造が定義されている所定の文書種別に応じて、予め設定されてよい。 The predetermined document structure analysis unit 617 specifies the text of the predetermined part from the document according to the type of the input document. The "predetermined portion" may be preset according to the type of document, for example, a predetermined document type in which a document structure is defined, such as a patent document, a paper, and various materials.
 また、所定文書構成解析部617は、入力される文書から算出される部分文書クラスタのうち、特定したテキストとの類似度が第2閾値以上となる部分文書クラスタを特定し、文書ベクトルを所定文書クラスタベクトル蓄積部615に蓄積する。例えば、所定文書構成解析部617は、特定したテキストを部分文書(部分文書クラスタ)として扱い、特定したテキストの部分文書(部分文書クラスタ)と、文書中の他の部分文書クラスタの各々とのテキスト類似度と第2閾値とを比較してもよい。 Further, the predetermined document composition analysis unit 617 identifies the sub-document cluster whose similarity with the specified text is equal to or higher than the second threshold among the sub-document clusters calculated from the input document, and sets the document vector as the predetermined document. It is stored in the cluster vector storage unit 615. For example, the predetermined document structure analysis unit 617 treats the specified text as a partial document (partial document cluster), and texts of the specified text subdocument (partial document cluster) and each of the other subdocument clusters in the document. The similarity may be compared with the second threshold.
 さらに、所定文書構成解析部617は、入力される文書から算出される化合物クラスタのうち、特定した部分文書クラスタに含まれる化合物名との一致度が第3閾値以上の化合物クラスタを特定し、所定化合物クラスタ蓄積部616に蓄積する。例えば、所定文書構成解析部617は、特定した部分文書クラスタに含まれる化合物名をクラスタごとの化合物リストとして扱い、当該化合物リストと、文書中の他のクラスタごとの化合物リストの各々との化合物類似度と第3閾値とを比較してもよい。 Further, the predetermined document composition analysis unit 617 identifies, among the compound clusters calculated from the input document, the compound cluster whose degree of agreement with the compound name included in the specified partial document cluster is equal to or higher than the third threshold value, and determines. It accumulates in the compound cluster storage unit 616. For example, the predetermined document composition analysis unit 617 treats the compound names included in the specified partial document cluster as a compound list for each cluster, and the compound list is similar to each of the compound list for each other cluster in the document. The degree may be compared with the third threshold.
 なお、所定文書構成解析部617は、所定部分が設定されない種別の文書が入力された場合、算出された部分文書クラスタの文書ベクトル及び化合物クラスタを、所定文書クラスタベクトル蓄積部615及び所定化合物クラスタ蓄積部616に蓄積してもよい。 When a document of a type in which a predetermined portion is not set is input, the predetermined document composition analysis unit 617 uses the calculated document vector and compound cluster of the partial document cluster as the predetermined document cluster vector storage unit 615 and the predetermined compound cluster storage unit. It may be accumulated in the unit 616.
 文書類似度算出部422は、図19に示す類似度判定システム1Bと同様の動作を行なってよいが、クエリ11に係る文書の種別が「所定部分」を設定された所定の文書種別である場合、文書類似度の算出に用いる各クラスタの情報は制限されている。すなわち、文書類似度算出部422は、入力されるクエリ文書11aの種別に応じた所定部分のテキストに関連する部分文書クラスタ及び化合物クラスタに基づき、文書類似度を判定する。 The document similarity calculation unit 422 may perform the same operation as the similarity determination system 1B shown in FIG. 19, but when the type of the document related to the query 11 is a predetermined document type in which a "predetermined portion" is set. , The information of each cluster used to calculate the document similarity is limited. That is, the document similarity calculation unit 422 determines the document similarity based on the partial document cluster and the compound cluster related to the text of the predetermined portion according to the type of the input query document 11a.
 このように、第2変形例に係るサーバ6によれば、第1及び第2実施形態と同様の効果を奏することができる。また、文書種別ごとに「所定部分」を予め設定しておくことで、文書種別に応じた重要な(優先度の高い)部分文書クラスタ及び化合物クラスタを容易に特定できる。従って、当該重要な部分文書クラスタ及び化合物クラスタに基づく類似度の判定により、文書間の類似度の判定精度をより向上させることができる。また、類似度の判定に用いる部分文書クラスタ及び化合物クラスタの数を制限できるため、文書検索処理の処理時間を短縮することができる。 As described above, according to the server 6 according to the second modification, the same effect as that of the first and second embodiments can be obtained. Further, by setting a "predetermined part" for each document type in advance, important (high priority) partial document clusters and compound clusters according to the document type can be easily identified. Therefore, the accuracy of determining the similarity between documents can be further improved by determining the similarity based on the important partial document cluster and the compound cluster. Further, since the number of partial document clusters and compound clusters used for determining the similarity can be limited, the processing time of the document retrieval process can be shortened.
 (文書DB構築処理の動作例)
 図26は、サーバ6の文書DB構築処理の動作例を説明するフローチャートである。図26は、図14に示すステップS28をステップS71~S75に置き換えたものである。
(Operation example of document DB construction process)
FIG. 26 is a flowchart illustrating an operation example of the document DB construction process of the server 6. FIG. 26 shows that step S28 shown in FIG. 14 is replaced with steps S71 to S75.
 図26に例示するように、所定文書構成解析部617は、ステップS71において、文書中の所定部分のテキストを特定する。 As illustrated in FIG. 26, the predetermined document structure analysis unit 617 specifies the text of the predetermined portion in the document in step S71.
 所定文書構成解析部617は、ステップS72において、部分文書クラスタの中で所定部分のテキストとの類似度が閾値以上の部分文書クラスタを特定する。 In step S72, the predetermined document structure analysis unit 617 identifies a partial document cluster whose similarity with the text of the predetermined portion is equal to or higher than the threshold value in the partial document cluster.
 所定文書クラスタベクトル蓄積部615は、ステップS73において、特定した部分文書クラスタの文書ベクトルを登録する。 The predetermined document cluster vector storage unit 615 registers the document vector of the specified partial document cluster in step S73.
 所定文書構成解析部617は、ステップS74において、特定した部分文書クラスタに含まれる化合物名との一致度が閾値以上の化合物クラスタを特定する。 In step S74, the predetermined document structure analysis unit 617 identifies a compound cluster whose degree of agreement with the compound name included in the specified partial document cluster is equal to or greater than the threshold value.
 所定化合物クラスタ蓄積部616は、ステップS75において、特定した化合物クラスタを登録する。 The predetermined compound cluster storage unit 616 registers the specified compound cluster in step S75.
 (文書検索処理の動作例)
 図27は、サーバ6の文書検索処理の動作例を説明するフローチャートである。図27は、図21に示すステップS32、S51をS81、S82に置き換え、S34、S52及びS53をS83、S84及びS85に置き換えたものである。
(Operation example of document search processing)
FIG. 27 is a flowchart illustrating an operation example of the document retrieval process of the server 6. In FIG. 27, steps S32 and S51 shown in FIG. 21 are replaced with S81 and S82, and S34, S52 and S53 are replaced with S83, S84 and S85.
 文書類似度算出部422は、ステップS81において、所定文書クラスタベクトル蓄積部615から、クエリ文書11aの部分文書クラスタの文書ベクトルを取得する。クエリ文書11aが所定の文書種別である場合、文書類似度算出部422は、所定部分文書クラスタの文書ベクトルを取得する。 In step S81, the document similarity calculation unit 422 acquires the document vector of the partial document cluster of the query document 11a from the predetermined document cluster vector storage unit 615. When the query document 11a is a predetermined document type, the document similarity calculation unit 422 acquires the document vector of the predetermined partial document cluster.
 文書類似度算出部422は、ステップS82において、所定化合物クラスタ蓄積部616から、クエリ文書11aの化合物クラスタ、すなわち、クエリ文書11aが所定の文書種別である場合、所定化合物クラスタを取得する。 In step S82, the document similarity calculation unit 422 acquires the compound cluster of the query document 11a, that is, the predetermined compound cluster when the query document 11a is the predetermined document type, from the predetermined compound cluster storage unit 616.
 文書類似度算出部422は、ステップS83において、所定文書クラスタベクトル蓄積部615から、選択した文書の部分文書クラスタの文書ベクトルを取得する。選択した文書が所定の文書種別である場合、文書類似度算出部422は、所定部分文書クラスタの文書ベクトルを取得する。 In step S83, the document similarity calculation unit 422 acquires the document vector of the partial document cluster of the selected document from the predetermined document cluster vector storage unit 615. When the selected document has a predetermined document type, the document similarity calculation unit 422 acquires the document vector of the predetermined partial document cluster.
 文書類似度算出部422は、ステップS84において、所定化合物クラスタ蓄積部616から、選択した文書の化合物クラスタ、すなわち、選択した文書が所定の文書種別である場合、所定化合物クラスタを取得する。 In step S84, the document similarity calculation unit 422 acquires a predetermined compound cluster from the predetermined compound cluster storage unit 616, that is, when the selected document has a predetermined document type.
 文書類似度算出部422は、ステップS85において、取得した所定部分文書クラスタの文書ベクトル及び所定化合物クラスタに基づき文書類似度を算出する。 The document similarity calculation unit 422 calculates the document similarity based on the acquired document vector of the predetermined partial document cluster and the predetermined compound cluster in step S85.
 〔4〕その他
 上述した第1~第3実施形態、並びに、第3実施形態の第1及び第2変形例に係る技術は、以下のように変形、変更して実施することができる。
[4] Other Techniques according to the first to third embodiments described above and the first and second modifications of the third embodiment can be modified or modified as follows.
 例えば、上述した第1~第3実施形態、並びに、第3実施形態の第1及び第2変形例では、固有表現として、化合物名が用いられる場合を例に挙げて説明したが、これに限定されるものではない。固有表現としては、例えば遺伝子配列(ゲノム)等、自然言語処理において固有表現抽出処理の対象となり得る種々の用語が用いられてもよい。 For example, in the above-mentioned first to third embodiments and the first and second modifications of the third embodiment, the case where the compound name is used as a named entity has been described as an example, but the present invention is limited to this. It is not something that is done. As the named entity, various terms that can be the target of the named entity extraction process in natural language processing, such as a gene sequence (genome), may be used.
 また、例えば、図6、図12、図19、図22及び図25に示すサーバ2~6のそれぞれが備える機能構成は、任意の組み合わせで併合してもよく、それぞれ分割してもよい。また、第1~第3実施形態、並びに、第3実施形態の第1及び第2変形例を適宜組み合わせて実施してもよい。さらに、サーバ2~6のそれぞれは、図7、図13、図23のいずれの画面の画面情報を生成してもよく、画面に応じた機能構成を備えてよい。 Further, for example, the functional configurations included in each of the servers 2 to 6 shown in FIGS. 6, 12, 19, 22, and 25 may be merged or divided in any combination. Further, the first to third embodiments and the first and second modifications of the third embodiment may be combined as appropriate. Further, each of the servers 2 to 6 may generate screen information of any of the screens of FIGS. 7, 13, and 23, and may have a functional configuration according to the screen.
 さらに、例えば、図22及び図25に示す第3実施形態の第1及び第2変形例に係るサーバ5及び6の機能は、互いに組み合わせて実施されてもよい。また、当該機能は、図6又は図12に示す第1又は第2実施形態に係るサーバ2又は3において、テキスト類似度に基づく文書類似度の判定処理に適用されてもよい。 Further, for example, the functions of the servers 5 and 6 according to the first and second modifications of the third embodiment shown in FIGS. 22 and 25 may be implemented in combination with each other. Further, the function may be applied to the document similarity determination process based on the text similarity in the server 2 or 3 according to the first or second embodiment shown in FIG. 6 or FIG.
 また、図6、図12、図19、図22及び図25に示すサーバ2~6のそれぞれは、複数の装置がネットワークを介して互いに連携することにより、各処理機能を実現する構成であってもよい。一例として、メモリ部21はDBサーバ、文書DB部31、41及び61はアプリケーションサーバ及びDBサーバの組み合わせ、文書入力部22、類似度算出部23、類似度出力部24、文書検索部32、42及び52はアプリケーションサーバ及びWebサーバの組み合わせ、等であってもよい。これらの場合、コンピュータ、アプリケーションサーバ及びDBサーバが、ネットワークを介して互いに連携することにより、サーバ2~6としての各処理機能を実現してもよい。 Further, each of the servers 2 to 6 shown in FIGS. 6, 12, 19, 22, and 25 has a configuration in which a plurality of devices cooperate with each other via a network to realize each processing function. May be good. As an example, the memory unit 21 is a DB server, the document DB units 31, 41 and 61 are a combination of an application server and a DB server, a document input unit 22, a similarity calculation unit 23, a similarity output unit 24, and a document search unit 32, 42. And 52 may be a combination of an application server and a Web server, and the like. In these cases, the computer, the application server, and the DB server may cooperate with each other via the network to realize each processing function as the servers 2 to 6.
 さらに、サーバ3~6のそれぞれは、図9に例示するコンピュータ10のHW構成を備えてよい。 Further, each of the servers 3 to 6 may be provided with the HW configuration of the computer 10 illustrated in FIG.
 1、1A~1D  類似度判定システム
 10  コンピュータ
 11  クエリ
 11a  クエリ文書
 12  文書集合
 12a  比較対象文書
 13、14  結果
 2~6  サーバ
 21  メモリ部
 22  文書入力部
 23  類似度算出部
 24  類似度出力部
 231、312  化合物名抽出部
 232、313  クラスタリング部
 233、322、422、522  文書類似度算出部
 31、41、61  文書DB部
 311  文書蓄積部
 314  文書クラスタベクトル算出部
 315  文書クラスタベクトル蓄積部
 32、42、52  文書検索部
 321  検索クエリ指定部
 323  検索結果生成部
 324  検索結果出力部
 416  化合物クラスタ蓄積部
 525  キーワード入力部
 526  文書クラスタ特定部
 615  所定文書クラスタベクトル蓄積部
 616  所定化合物クラスタ蓄積部
 617  所定文書構成解析部
1, 1A ~ 1D similarity judgment system 10 computer 11 query 11a query document 12 document set 12a comparison target document 13, 14 result 2 ~ 6 server 21 memory part 22 document input part 23 similarity calculation part 24 similarity output part 231 312 Compound name extraction unit 232, 313 Clustering unit 233, 322, 422, 522 Document similarity calculation unit 31, 41, 61 Document DB unit 311 Document storage unit 314 Document cluster vector calculation unit 315 Document cluster vector storage unit 32, 42, 52 Document search unit 321 Search query specification unit 323 Search result generation unit 324 Search result output unit 416 Compound cluster storage unit 525 Keyword input unit 526 Document cluster identification unit 615 Predetermined document cluster vector storage unit 616 Predetermined compound cluster storage unit 617 Predetermined document configuration Analysis department

Claims (20)

  1.  第1の文書を分割することによって得られた第1の複数の部分文書を前記第1の複数の部分文書に含まれる第1の複数の固有表現を分類して得られる第1の複数のグループに基づいて分類することによって得られた第1の複数の部分文書グループについて、前記第1の複数の部分文書グループのそれぞれに含まれる単語に基づいて、前記第1の複数の部分文書グループのそれぞれに対応する第1の複数のベクトルを算出し、
     第2の文書を分割することによって得られた第2の複数の部分文書を分類することによって得られた第2の複数の部分文書グループのそれぞれに対応する第2の複数のベクトルを取得し、
     前記第1の複数のベクトルと前記第2の複数のベクトルとの比較に基づいて、前記第1の文書と前記第2の文書との類似度を判定する、
    処理をコンピュータに実行させる、類似度判定プログラム。
    A first plurality of groups obtained by classifying the first plurality of subdocuments obtained by dividing the first document into the first plurality of named entities contained in the first plurality of subdocuments. With respect to the first plurality of sub-document groups obtained by classifying based on the above, each of the first plurality of sub-document groups is based on the words contained in each of the first plurality of sub-document groups. Calculate the first plurality of vectors corresponding to
    Obtaining the second plurality of vectors corresponding to each of the second plurality of sub-document groups obtained by classifying the second plurality of sub-documents obtained by dividing the second document.
    Based on the comparison between the first plurality of vectors and the second plurality of vectors, the degree of similarity between the first document and the second document is determined.
    A similarity determination program that causes a computer to execute processing.
  2.  前記第2の複数のベクトルを取得する処理は、前記第2の複数の部分文書を前記第2の複数の部分文書に含まれる第2の複数の固有表現を分類して得られる第2の複数のグループに基づいて分類することによって得られた前記第2の複数の部分文書グループについて、前記第2の複数の部分文書グループのそれぞれに含まれる単語に基づいて、前記第2の複数の部分文書グループのそれぞれに対応する前記第2の複数のベクトルを算出する処理を含む、
    請求項1に記載の類似度判定プログラム。
    The process of acquiring the second plurality of vectors is a second plurality obtained by classifying the second plurality of subdocuments into the second plurality of proper representations included in the second plurality of subdocuments. With respect to the second plurality of subdocument groups obtained by classifying based on the group of, the second plurality of subdocuments based on the words contained in each of the second plurality of subdocument groups. The process of calculating the second plurality of vectors corresponding to each of the groups is included.
    The similarity determination program according to claim 1.
  3.  前記第1の複数のベクトルを算出する処理は、前記第1の複数の部分文書に含まれる前記第1の複数の固有表現と、前記第1の複数のグループに含まれる複数の固有表現との間の一致度を用いたクラスタリング処理により、前記第1の複数の部分文書グループを生成する処理を含み、
     前記第2の複数のベクトルを算出する処理は、前記第2の複数の部分文書に含まれる前記第1の複数の固有表現と、前記第2の複数のグループに含まれる複数の固有表現との間の一致度を用いたクラスタリング処理により、前記第2の複数の部分文書グループを生成する処理を含む、
    請求項2に記載の類似度判定プログラム。
    The process of calculating the first plurality of vectors is the process of calculating the first plurality of named vectors by the first named entity included in the first plurality of partial documents and the plurality of named entity included in the first plurality of groups. Including the process of generating the first plurality of partial document groups by the clustering process using the degree of agreement between the two.
    The process of calculating the second plurality of vectors is a process of calculating the first plurality of named entities included in the second plurality of partial documents and the plurality of named entity included in the second plurality of groups. A process of generating the second plurality of subdocument groups by a clustering process using the degree of agreement between the two is included.
    The similarity determination program according to claim 2.
  4.  前記第1の複数のベクトルを算出する処理は、前記第1の複数の固有表現のそれぞれの位置と前記第1の複数の固有表現のそれぞれの類似度とに基づいて前記第1の複数の固有表現を分類することによって前記第1の複数のグループを生成する処理を含み、
     前記第2の複数のベクトルを算出する処理は、前記第2の複数の固有表現のそれぞれの位置と前記第2の複数の固有表現のそれぞれの類似度とに基づいて前記第2の複数の固有表現を分類することによって前記第2の複数のグループを生成する処理を含む、
    請求項2又は請求項3に記載の類似度判定プログラム。
    The process of calculating the first plurality of vectors is based on the respective positions of the first plurality of named entity and the similarity of each of the first named entity. Including the process of generating the first plurality of groups by classifying the expressions.
    The process of calculating the second plurality of vectors is based on the respective positions of the second plurality of named entity and the similarity of each of the second plurality of named entity. Including the process of generating the second plurality of groups by classifying the expressions.
    The similarity determination program according to claim 2 or 3.
  5.  前記第1の複数のグループを生成する処理は、前記第1の複数の固有表現の各々の前記第1の文書内の出現位置の近さを数値化した値と、前記第1の複数の固有表現の各々の類似度とを用いたクラスタリング処理を含み、
     前記第2の複数のグループを生成する処理は、前記第2の複数の固有表現の各々の前記第2の文書内の出現位置の近さを数値化した値と、前記第2の複数の固有表現の各々の類似度とを用いたクラスタリング処理を含む、
    請求項4に記載の類似度判定プログラム。
    In the process of generating the first plurality of groups, the value obtained by quantifying the proximity of the appearance position in the first document of each of the first plurality of named entity and the first plurality of named entity. Includes clustering with each similarity of representation
    In the process of generating the second plurality of groups, the value obtained by quantifying the proximity of the appearance position in the second document of each of the second plurality of named entity and the second plurality of named entity. Includes a clustering process with each similarity of representation.
    The similarity determination program according to claim 4.
  6.  前記類似度を判定する処理は、前記第1の複数のベクトルの各々と前記第2の複数のベクトルの各々との組み合わせの中で、ベクトルの類似度が最大となる組み合わせの前記ベクトルの類似度を、前記第1の文書と前記第2の文書との類似度と判定する処理を含む、
    請求項1~請求項5のいずれか1項に記載の類似度判定プログラム。
    The process of determining the similarity is the similarity of the vectors in the combination of each of the first plurality of vectors and each of the second plurality of vectors in which the similarity of the vectors is the maximum. Is included in the process of determining the degree of similarity between the first document and the second document.
    The similarity determination program according to any one of claims 1 to 5.
  7.  前記類似度を判定する処理は、
      前記第1の複数のベクトルの各々と前記第2の複数のベクトルの各々との比較と、
      前記第1の複数のグループと、前記第2の複数の部分文書を前記第2の複数の部分文書に含まれる第2の複数の固有表現を分類して得られる第2の複数のグループとの比較と、
    に基づいて、前記第1の文書と前記第2の文書との前記類似度を判定する処理を含む、
    請求項1~請求項6のいずれか1項に記載の類似度判定プログラム。
    The process of determining the similarity is
    Comparison of each of the first plurality of vectors with each of the second plurality of vectors,
    The first plurality of groups and the second plurality of groups obtained by classifying the second plurality of subdocuments into the second plurality of named entity included in the second plurality of subdocuments. Comparison and
    A process of determining the degree of similarity between the first document and the second document based on the above.
    The similarity determination program according to any one of claims 1 to 6.
  8.  前記第1の複数のグループと前記第2の複数のグループとを比較する処理は、前記第1の複数のグループの各々と前記第2の複数のグループの各々との組み合わせの中で、グループの類似度が最大となる組み合わせの前記グループの類似度を取得する処理を含む、
    請求項7に記載の類似度判定プログラム。
    The process of comparing the first plurality of groups with the second plurality of groups is performed in the combination of each of the first plurality of groups and each of the second plurality of groups. The process of acquiring the similarity of the group of the combination having the maximum similarity is included.
    The similarity determination program according to claim 7.
  9.  前記類似度を判定する処理は、前記第1の複数の部分文書グループのうちの指定されたキーワードを含む部分文書グループと、前記第2の複数の部分文書グループとの比較に基づいて、前記第1の文書と前記第2の文書との前記類似度を判定する処理を含む、
    請求項1~請求項8のいずれか1項に記載の類似度判定プログラム。
    The process of determining the similarity is based on a comparison between the sub-document group including the specified keyword in the first plurality of sub-document groups and the second plurality of sub-document groups. The process of determining the similarity between the document 1 and the document 2 is included.
    The similarity determination program according to any one of claims 1 to 8.
  10.  前記類似度を判定する処理は、前記第1の文書が所定の文書種別である場合、前記第1の文書における所定部分に含まれる固有表現に基づき特定される部分文書グループと、前記第2の文書における前記所定部分に含まれる固有表現に基づき特定される部分文書グループとの比較に基づいて、前記第1の文書と前記第2の文書との前記類似度を判定する処理を含む、
    請求項1~請求項9のいずれか1項に記載の類似度判定プログラム。
    In the process of determining the similarity, when the first document has a predetermined document type, the partial document group specified based on the named entity included in the predetermined portion of the first document and the second document group. A process of determining the similarity between the first document and the second document based on a comparison with a sub-document group specified based on the named entity contained in the predetermined portion of the document.
    The similarity determination program according to any one of claims 1 to 9.
  11.  前記第1の文書が検索クエリで指定された文書であり、
     前記第2の文書が前記検索クエリの検索対象となる文書群に含まれる複数の前記第2の文書のうちの1つであり、
     前記第1の文書と前記複数の第2の文書の各々との複数の前記類似度に応じて判定した第2の文書の情報を、前記検索クエリの検索結果として出力する、
    処理を前記コンピュータに実行させる、
    請求項1~請求項10のいずれか1項に記載の類似度判定プログラム。
    The first document is the document specified in the search query.
    The second document is one of a plurality of the second documents included in the document group to be searched by the search query.
    The information of the second document determined according to the plurality of similarities between the first document and each of the plurality of second documents is output as the search result of the search query.
    Let the computer perform the process,
    The similarity determination program according to any one of claims 1 to 10.
  12.  前記固有表現は化合物名であり、
     前記第1の複数の固有表現のそれぞれの類似度、及び、前記第2の複数の固有表現のそれぞれの類似度の各々は、化合物の構造類似度である、
    請求項2~請求項5のいずれか1項に記載の類似度判定プログラム。
    The named entity is a compound name.
    Each of the similarity of each of the first named entity and the similarity of each of the second named entity is the structural similarity of the compound.
    The similarity determination program according to any one of claims 2 to 5.
  13.  第1の文書を分割することによって得られた第1の複数の部分文書を前記第1の複数の部分文書に含まれる第1の複数の固有表現を分類して得られる第1の複数のグループに基づいて分類することによって得られた第1の複数の部分文書グループについて、前記第1の複数の部分文書グループのそれぞれに含まれる単語に基づいて、前記第1の複数の部分文書グループのそれぞれに対応する第1の複数のベクトルを算出し、
     第2の文書を分割することによって得られた第2の複数の部分文書を分類することによって得られた第2の複数の部分文書グループのそれぞれに対応する第2の複数のベクトルを取得し、
     前記第1の複数のベクトルと前記第2の複数のベクトルとの比較に基づいて、前記第1の文書と前記第2の文書との類似度を判定する、制御部を備える、
    類似度判定装置。
    A first plurality of groups obtained by classifying the first plurality of subdocuments obtained by dividing the first document into the first plurality of named entities contained in the first plurality of subdocuments. With respect to the first plurality of sub-document groups obtained by classifying based on the above, each of the first plurality of sub-document groups is based on the words contained in each of the first plurality of sub-document groups. Calculate the first plurality of vectors corresponding to
    Obtaining the second plurality of vectors corresponding to each of the second plurality of sub-document groups obtained by classifying the second plurality of sub-documents obtained by dividing the second document.
    A control unit for determining the degree of similarity between the first document and the second document based on the comparison between the first plurality of vectors and the second plurality of vectors is provided.
    Similarity determination device.
  14.  前記制御部は、前記第2の複数のベクトルを取得する処理において、前記第2の複数の部分文書を前記第2の複数の部分文書に含まれる第2の複数の固有表現を分類して得られる第2の複数のグループに基づいて分類することによって得られた前記第2の複数の部分文書グループについて、前記第2の複数の部分文書グループのそれぞれに含まれる単語に基づいて、前記第2の複数の部分文書グループのそれぞれに対応する前記第2の複数のベクトルを算出する処理を行なう、
    請求項13に記載の類似度判定装置。
    In the process of acquiring the second plurality of vectors, the control unit obtains the second plurality of subdocuments by classifying the second plurality of named entities included in the second plurality of subdocuments. The second plurality of sub-document groups obtained by classifying based on the second plurality of sub-document groups, based on the words contained in each of the second plurality of sub-document groups. Performs the process of calculating the second plurality of vectors corresponding to each of the plurality of partial document groups of.
    The similarity determination device according to claim 13.
  15.  前記制御部は、
      前記第1の複数のベクトルを算出する処理において、前記第1の複数の部分文書に含まれる前記第1の複数の固有表現と、前記第1の複数のグループに含まれる複数の固有表現との間の一致度を用いたクラスタリング処理により、前記第1の複数の部分文書グループを生成する処理を行ない、
      前記第2の複数のベクトルを算出する処理において、前記第2の複数の部分文書に含まれる前記第1の複数の固有表現と、前記第2の複数のグループに含まれる複数の固有表現との間の一致度を用いたクラスタリング処理により、前記第2の複数の部分文書グループを生成する処理を行なう、
    請求項14に記載の類似度判定装置。
    The control unit
    In the process of calculating the first plurality of vectors, the first plurality of named entity included in the first plurality of partial documents and the plurality of named entity included in the first plurality of groups are used. A process of generating the first plurality of partial document groups is performed by a clustering process using the degree of agreement between the two.
    In the process of calculating the second plurality of vectors, the first plurality of named entity included in the second plurality of partial documents and the plurality of named entity included in the second plurality of groups are used. A process of generating the second plurality of partial document groups is performed by a clustering process using the degree of agreement between the two.
    The similarity determination device according to claim 14.
  16.  前記制御部は、
      前記第1の複数のベクトルを算出する処理において、前記第1の複数の固有表現のそれぞれの位置と前記第1の複数の固有表現のそれぞれの類似度とに基づいて前記第1の複数の固有表現を分類することによって前記第1の複数のグループを生成する処理を行ない、
      前記第2の複数のベクトルを算出する処理において、前記第2の複数の固有表現のそれぞれの位置と前記第2の複数の固有表現のそれぞれの類似度とに基づいて前記第2の複数の固有表現を分類することによって前記第2の複数のグループを生成する処理を行なう、
    請求項14又は請求項15に記載の類似度判定装置。
    The control unit
    In the process of calculating the first plurality of vectors, the first plurality of named entities are based on the respective positions of the first named entity and the similarity of each of the first named entity. The process of generating the first plurality of groups by classifying the expressions is performed.
    In the process of calculating the second plurality of vectors, the second plurality of named entities are based on the respective positions of the second named entity and the similarity of each of the second named entity. The process of generating the second plurality of groups by classifying the expressions is performed.
    The similarity determination device according to claim 14 or 15.
  17.  前記制御部は、
      前記第1の複数のグループを生成する処理において、前記第1の複数の固有表現の各々の前記第1の文書内の出現位置の近さを数値化した値と、前記第1の複数の固有表現の各々の類似度とを用いたクラスタリング処理を行ない、
      前記第2の複数のグループを生成する処理において、前記第2の複数の固有表現の各々の前記第2の文書内の出現位置の近さを数値化した値と、前記第2の複数の固有表現の各々の類似度とを用いたクラスタリング処理を行なう、
    請求項16に記載の類似度判定装置。
    The control unit
    In the process of generating the first plurality of groups, the value obtained by quantifying the proximity of the appearance position of each of the first plurality of named entity in the first document and the first plurality of named entity. Perform clustering processing using the similarity of each expression.
    In the process of generating the second plurality of groups, the value obtained by quantifying the proximity of the appearance position of each of the second plurality of named entity in the second document and the second plurality of named entity. Perform clustering processing using the similarity of each expression.
    The similarity determination device according to claim 16.
  18.  前記制御部は、前記類似度を判定する処理において、前記第1の複数のベクトルの各々と前記第2の複数のベクトルの各々との組み合わせの中で、ベクトルの類似度が最大となる組み合わせの前記ベクトルの類似度を、前記第1の文書と前記第2の文書との類似度と判定する処理を行なう、
    請求項13~請求項17のいずれか1項に記載の類似度判定装置。
    In the process of determining the similarity, the control unit determines the similarity of the vector having the maximum similarity among the combinations of each of the first plurality of vectors and each of the second plurality of vectors. A process of determining the similarity between the vectors as the similarity between the first document and the second document is performed.
    The similarity determination device according to any one of claims 13 to 17.
  19.  前記制御部は、前記類似度を判定する処理において、
      前記第1の複数のベクトルの各々と前記第2の複数のベクトルの各々との比較と、
      前記第1の複数のグループと、前記第2の複数の部分文書を前記第2の複数の部分文書に含まれる第2の複数の固有表現を分類して得られる第2の複数のグループとの比較と、
    に基づいて、前記第1の文書と前記第2の文書との前記類似度を判定する処理を行なう、
    請求項13~請求項18のいずれか1項に記載の類似度判定装置。
    The control unit is in the process of determining the similarity.
    Comparison of each of the first plurality of vectors with each of the second plurality of vectors,
    The first plurality of groups and the second plurality of groups obtained by classifying the second plurality of subdocuments into the second plurality of named entity included in the second plurality of subdocuments. Comparison and
    Based on the above, a process of determining the degree of similarity between the first document and the second document is performed.
    The similarity determination device according to any one of claims 13 to 18.
  20.  第1の文書を分割することによって得られた第1の複数の部分文書を前記第1の複数の部分文書に含まれる第1の複数の固有表現を分類して得られる第1の複数のグループに基づいて分類することによって得られた第1の複数の部分文書グループについて、前記第1の複数の部分文書グループのそれぞれに含まれる単語に基づいて、前記第1の複数の部分文書グループのそれぞれに対応する第1の複数のベクトルを算出し、
     第2の文書を分割することによって得られた第2の複数の部分文書を分類することによって得られた第2の複数の部分文書グループのそれぞれに対応する第2の複数のベクトルを取得し、
     前記第1の複数のベクトルと前記第2の複数のベクトルとの比較に基づいて、前記第1の文書と前記第2の文書との類似度を判定する、
    処理をコンピュータが実行する、類似度判定方法。
    A first plurality of groups obtained by classifying the first plurality of subdocuments obtained by dividing the first document into the first plurality of named entities contained in the first plurality of subdocuments. With respect to the first plurality of sub-document groups obtained by classifying based on the above, each of the first plurality of sub-document groups is based on the words contained in each of the first plurality of sub-document groups. Calculate the first plurality of vectors corresponding to
    Obtaining the second plurality of vectors corresponding to each of the second plurality of sub-document groups obtained by classifying the second plurality of sub-documents obtained by dividing the second document.
    Based on the comparison between the first plurality of vectors and the second plurality of vectors, the degree of similarity between the first document and the second document is determined.
    A similarity determination method in which a computer executes processing.
PCT/JP2020/047219 2020-12-17 2020-12-17 Similarity determination program, similarity determination device, and similarity determination method WO2022130579A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/JP2020/047219 WO2022130579A1 (en) 2020-12-17 2020-12-17 Similarity determination program, similarity determination device, and similarity determination method
JP2022569435A JPWO2022130579A1 (en) 2020-12-17 2020-12-17

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2020/047219 WO2022130579A1 (en) 2020-12-17 2020-12-17 Similarity determination program, similarity determination device, and similarity determination method

Publications (1)

Publication Number Publication Date
WO2022130579A1 true WO2022130579A1 (en) 2022-06-23

Family

ID=82057430

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2020/047219 WO2022130579A1 (en) 2020-12-17 2020-12-17 Similarity determination program, similarity determination device, and similarity determination method

Country Status (2)

Country Link
JP (1) JPWO2022130579A1 (en)
WO (1) WO2022130579A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115659945A (en) * 2022-12-22 2023-01-31 南方电网科学研究院有限责任公司 Standard document similarity detection method, device and system

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH11272680A (en) * 1998-03-19 1999-10-08 Fujitsu Ltd Document data providing device and program recording medium thereof
JP2000112949A (en) * 1998-09-30 2000-04-21 Fuji Xerox Co Ltd Information discrimination supporting device and record medium recording similar information discrimination supporting program
JP2002259411A (en) * 2001-03-06 2002-09-13 Nec Corp Text information conversion system, text information conversion method and text information conversion program
JP2008009671A (en) * 2006-06-29 2008-01-17 National Institute Of Information & Communication Technology Data display device, data display method and data display program
JP2013020431A (en) * 2011-07-11 2013-01-31 Nec Corp Polysemic word extraction system, polysemic word extraction method and program
JP2016045552A (en) * 2014-08-20 2016-04-04 富士通株式会社 Feature extraction program, feature extraction method, and feature extraction device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH11272680A (en) * 1998-03-19 1999-10-08 Fujitsu Ltd Document data providing device and program recording medium thereof
JP2000112949A (en) * 1998-09-30 2000-04-21 Fuji Xerox Co Ltd Information discrimination supporting device and record medium recording similar information discrimination supporting program
JP2002259411A (en) * 2001-03-06 2002-09-13 Nec Corp Text information conversion system, text information conversion method and text information conversion program
JP2008009671A (en) * 2006-06-29 2008-01-17 National Institute Of Information & Communication Technology Data display device, data display method and data display program
JP2013020431A (en) * 2011-07-11 2013-01-31 Nec Corp Polysemic word extraction system, polysemic word extraction method and program
JP2016045552A (en) * 2014-08-20 2016-04-04 富士通株式会社 Feature extraction program, feature extraction method, and feature extraction device

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115659945A (en) * 2022-12-22 2023-01-31 南方电网科学研究院有限责任公司 Standard document similarity detection method, device and system

Also Published As

Publication number Publication date
JPWO2022130579A1 (en) 2022-06-23

Similar Documents

Publication Publication Date Title
Yilmaz et al. Applying BERT to document retrieval with birch
US10394851B2 (en) Methods and systems for mapping data items to sparse distributed representations
US8775442B2 (en) Semantic search using a single-source semantic model
US10353925B2 (en) Document classification device, document classification method, and computer readable medium
US20230147941A1 (en) Method, apparatus and device used to search for content
KR102046692B1 (en) Method and System for Entity summarization based on multilingual projected entity space
Jin et al. Entity linking at the tail: sparse signals, unknown entities, and phrase models
JP6420268B2 (en) Image evaluation learning device, image evaluation device, image search device, image evaluation learning method, image evaluation method, image search method, and program
Hare et al. Imageterrier: an extensible platform for scalable high-performance image retrieval
CN111143400B (en) Full stack type retrieval method, system, engine and electronic equipment
Xu et al. Learning to refine expansion terms for biomedical information retrieval using semantic resources
KR20220025540A (en) Method and apparatus for summarizing document using keyword clustering
CN111373386A (en) Similarity index value calculation device, similarity search device, and similarity index value calculation program
WO2022130579A1 (en) Similarity determination program, similarity determination device, and similarity determination method
JP5869948B2 (en) Passage dividing method, apparatus, and program
US10394870B2 (en) Search method
WO2022130578A1 (en) Similarity determination program, similarity determination device, and similarity determination method
JPWO2020157887A1 (en) Sentence structure vectorization device, sentence structure vectorization method, and sentence structure vectorization program
WO2015125209A1 (en) Information structuring system and information structuring method
WO2021044519A1 (en) Information processing device, program, and information processing method
Wang et al. Citationas: A summary generation tool based on clustering of retrieved citation content
JP2011248827A (en) Cross-lingual information searching method, cross-lingual information searching system and cross-lingual information searching program
Ping et al. Research on search ranking technology of chinese electronic medical record based on AdaRank
Matos et al. Classification methods for finding articles describing protein-protein interactions in PubMed
KR20190136292A (en) Method and Apparatus for Processing Data Based on Intelligent Data Structure

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20965966

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2022569435

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20965966

Country of ref document: EP

Kind code of ref document: A1