WO2022172334A1 - Information processing device, extraction method, and extraction program - Google Patents

Information processing device, extraction method, and extraction program Download PDF

Info

Publication number
WO2022172334A1
WO2022172334A1 PCT/JP2021/004792 JP2021004792W WO2022172334A1 WO 2022172334 A1 WO2022172334 A1 WO 2022172334A1 JP 2021004792 W JP2021004792 W JP 2021004792W WO 2022172334 A1 WO2022172334 A1 WO 2022172334A1
Authority
WO
WIPO (PCT)
Prior art keywords
document
document vector
words
information processing
vector
Prior art date
Application number
PCT/JP2021/004792
Other languages
French (fr)
Japanese (ja)
Inventor
浩 宮尾
Original Assignee
日本電信電話株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電信電話株式会社 filed Critical 日本電信電話株式会社
Priority to PCT/JP2021/004792 priority Critical patent/WO2022172334A1/en
Publication of WO2022172334A1 publication Critical patent/WO2022172334A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/383Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content

Definitions

  • the present invention relates to an information processing device, an extraction method, and an extraction program.
  • FIG. 8 is a diagram for explaining prior art 1.
  • document vectors are generated based on the words appearing in the document and the appearance frequency of the words, and study materials close to the contents of the specification are extracted based on the cosine similarity of each document vector. For example, let the document vectors of specifications A, B, and C be V dA , V dB , and V dC , respectively. Let V SD , V SE , and V SF be the document vectors of study materials D, E, and F, respectively.
  • the cosine similarity between the document vector V dA of the specification A and the document vectors V SD to V SF of the study materials D to E is calculated, and based on the pairs whose cosine similarity is equal to or higher than the threshold Then, the study material corresponding to the specification A is extracted. For example, when the cosine similarity between the document vector VdA of the specification A and the document vector VSD of the study material D is equal to or greater than the threshold, the study material D is extracted as the study material corresponding to the specification A. . For the other specifications B and C, study materials are extracted in the same manner.
  • FIG. 9 is a diagram for explaining conventional technology 2.
  • the topic of the document is analyzed, and study materials close to the content of the specification are extracted based on the distance between the topics of the document.
  • the topics of specifications A, B, and C and the topics of study materials D, E, and F are analyzed, and the breakdown of topics is calculated and vectorized.
  • mapping is performed on the graph G1.
  • the horizontal axis of the graph G1 is the axis corresponding to the value of the first topic
  • the vertical axis is the axis corresponding to the value of the second topic.
  • specifications A, B, and C are mapped to pA, pB, and pC of graph G1, respectively, and study materials D, E, and F are mapped to pE, pF, and pG of graph G1, respectively.
  • study material D is extracted as study material corresponding to specification A.
  • key words the number of key words in a document
  • the appearance frequency of such key words may be low.
  • figures, tables, etc. are included in order to increase the volume of the document, the number of words other than key words (common words that are common to other documents) increases.
  • the cosine similarity does not increase as a result, and examination materials cannot be appropriately extracted.
  • an information processing apparatus provides a first document vector generated based on a word included in a first document and the appearance frequency of the word, and a plurality of a calculation unit for calculating similarities between a word contained in the second document and a plurality of second document vectors generated based on the frequency of occurrence of the word, with respect to the second document; and the first document vector If there is no second document vector whose similarity with and an extraction unit for extracting the second document based on the counting result.
  • FIG. 1 is a diagram for explaining the processing of the information processing apparatus according to the embodiment.
  • FIG. 2 is a functional block diagram showing the configuration of the information processing apparatus according to this embodiment.
  • FIG. 3 is a diagram showing an example of the data structure of a specification table.
  • FIG. 4 is a diagram illustrating an example of the data structure of a study material table.
  • FIG. 5 is a flow chart showing the processing procedure of the information processing apparatus according to the embodiment.
  • FIG. 6 is a flowchart showing the procedure of extraction processing.
  • FIG. 7 is a diagram showing an example of a computer that executes an extraction program.
  • FIG. 8 is a diagram for explaining prior art 1.
  • FIG. FIG. 9 is a diagram for explaining the prior art 2.
  • FIG. 8 is a diagram for explaining prior art 1.
  • FIG. 1 is a diagram for explaining the processing of the information processing apparatus according to this embodiment.
  • specifications A, B, and C and study materials D, E, and F will be used as an example for explanation.
  • the information processing device generates a document vector based on the words appearing in the document and the appearance frequency of the words. In this embodiment, a word is set for each element (dimension) of the document vector.
  • the document vector of the specification is referred to as "first document vector”
  • the document vector of study material is referred to as "second document vector”.
  • first document vectors of the specifications are shown separately, the document vectors of the specifications A, B, and C are defined as document vectors V dA , V dB , and V dC , respectively.
  • second document vectors of the study materials are indicated individually, the document vectors of the study materials D, E, and F are defined as document vectors V sD , V sE , and V sF , respectively .
  • the information processing device calculates cosine similarity between the first document vector of the specification and the second document vector of each study material.
  • the information processing device calculates the cosine similarity using Equation (1).
  • Vdx indicates the document vector of the specification.
  • Vsy corresponds to the document vector of the study material.
  • the information processing device calculates the cosine similarity between the first document vector and the second document vector, and determines whether or not there is a second document vector whose cosine similarity with the first document vector is equal to or greater than a threshold. do. If there is a second document vector whose cosine similarity with the first document vector is equal to or greater than a threshold value, the information processing apparatus extracts study materials for the second document vector.
  • the information processing device executes the following process.
  • the information processing device counts the number of common words between the words of each element of the first document vector of the specification and the words of each element of the second document vector of each study material. In the following description, common words are referred to as "common words”.
  • the information processing device extracts such study material.
  • the information processing apparatus finds the second document vector of the study material with the largest number of common words, which has a cosine similarity with the first document vector.
  • the maximum second document vector is specified, and study materials for the specified second document vector are extracted.
  • the information processing device calculates the cosine similarity between the document vector VdA of the specification A and the document vectors VsD , VsE , and VsF of the study materials D, E, and F, respectively.
  • the information processing apparatus when the cosine similarity between the document vector VdA of the specification A and the document vector VsD of the study material D is equal to or greater than a threshold, the information processing apparatus As a result, the examination material D is extracted.
  • the information processing device calculates the cosine similarity between the document vector V dB of the specification B and the document vectors V sD , V sE and V sF of the study materials D, E and F, respectively.
  • the information processing apparatus determines whether the document vector V dB of the specification B and the document vector V sD of the study material D Count the number of common words.
  • the information processing device counts the number of common words between the specification B document vector V dB and the study material E document vector V sE .
  • the information processing device counts the number of common words between the specification B document vector V dB and the study material F document vector V sF .
  • the study material E is extracted as the study material from which the specification B was created.
  • the information processing apparatus calculates the cosine similarity between the document vector V dC of the specification C and the document vectors V sD , V sE and V sF of the study materials D, E and F, respectively.
  • the information processing apparatus determines whether the document vector VdC of the specification C and the document vector VsD of the study material D Count the number of common words.
  • the information processing device counts the number of common words between the specification C document vector VdC and the study material E document vector VsE .
  • the information processing device counts the number of common words between the specification C document vector VdC and the study material F document vector VsF .
  • the information processing device determines the number of common words between the document vector VdC of the specification C and the document vector VsD of the study material D, and the number of common words between the document vector VdC of the specification C and the document vector VsF of the study material F.
  • the number of words is the maximum (when there are multiple pairs with the maximum number of common words)
  • the following processing is performed.
  • the information processing device compares the cosine similarity of the document vector VdC and the document vector VsD with the cosine similarity of the document vector VdC and the document vector VsF to determine the cosine similarity of the document vector VdC and the document vector VsF. If the degree is larger, the study material F is extracted as the study material from which the specification C was created.
  • the information processing apparatus calculates the cosine similarity between the first document vector of the specification and the second document vector of each study material, and calculates the cosine similarity between the first document vector and is equal to or greater than the threshold value. If there is a second document vector whose cosine similarity with the first document vector is equal to or greater than a threshold value, the information processing apparatus extracts study materials for the second document vector.
  • the information processing apparatus combines the first document vector of the specification with the second document vector of each study material. and count the number of common words. If there is one study material that has the maximum number of common words with the specification, the information processing apparatus extracts such study material.
  • the information processing apparatus selects the second document vector of the study material having the largest number of common words. , the second document vector having the maximum cosine similarity with the first document vector is specified, and study material for the specified second document vector is extracted.
  • the information processing apparatus can improve the extraction accuracy of study materials by extracting study materials corresponding to specifications from the viewpoint of cosine similarity and the number of common words.
  • FIG. 2 is a functional block diagram showing the configuration of the information processing apparatus according to this embodiment.
  • this information processing apparatus 100 has a communication section 110 , an input section 120 , a display section 130 , a storage section 140 and a control section 150 .
  • the communication unit 110 is a communication interface that transmits and receives various types of information to and from an external device connected via a network or the like.
  • the communication unit 110 is realized by a NIC (Network Interface Card) or the like, and performs communication between an external device and the control unit 150 via an electric communication line such as a LAN (Local Area Network) or the Internet.
  • NIC Network Interface Card
  • the input unit 120 is an input interface that receives various operations from the operator of the information processing device 100 .
  • it is composed of input devices such as a keyboard and a mouse.
  • the display unit 130 is an output device that outputs information acquired from the control unit 150, and is realized by a display device such as a liquid crystal display, a printing device such as a printer, and the like.
  • the storage unit 140 has a specification table 141 and a study material table 142 .
  • the storage unit 140 is implemented by a semiconductor memory device such as RAM (Random Access Memory) or flash memory, or a storage device such as a hard disk or optical disk.
  • RAM Random Access Memory
  • flash memory or a storage device such as a hard disk or optical disk.
  • the specification table 141 is a table that holds information on specifications.
  • FIG. 3 is a diagram showing an example of the data structure of a specification table. As shown in FIG. 3, this specification table 141 has a specification number, a specification, and a first document vector.
  • the specification number is information that identifies the specification. For example, specifications corresponding to specification numbers M10A to M10C are assumed to be specifications A to C, respectively.
  • the specification is document information (text data) of the specification.
  • the first document vector is a vector generated based on the words included in the specification and the frequencies of these words. The first document vector is generated by the generation unit 151, which will be described later.
  • the study material table 142 is a table that holds information about study materials.
  • FIG. 4 is a diagram illustrating an example of the data structure of a study material table. As shown in FIG. 4, the study material table 142 has study material numbers, study materials, and second document vectors.
  • the study material number is information that identifies the study material. For example, study materials corresponding to study material numbers M10D to M10F are assumed to be study materials D to F, respectively.
  • the review material is the document information (text data) of use referred to when the user creates the specification.
  • the second document vector is a vector generated based on the words contained in the study material and the frequencies of these words. The second document vector is generated by the generation unit 151, which will be described later.
  • the control unit 150 is implemented using a CPU (Central Processing Unit) or the like.
  • the control unit 150 has a generation unit 151 , a calculation unit 152 and an extraction unit 153 .
  • the generation unit 151 generates document vectors from document information such as specifications and study materials.
  • the generation unit 151 extracts words by morphologically analyzing the document information of the specifications stored in the specifications table 141, and generates a first document vector based on the extracted words and the frequency of the words. do.
  • the generation unit 151 registers the generated first document vector in the specification table 141 .
  • the generation unit 151 repeatedly executes the above process for each specification stored in the specification table 141 .
  • the generation unit 151 extracts words by morphologically analyzing the document information of the study materials stored in the study material table 142, and generates a second document vector based on the extracted words and the frequency of the words. do.
  • the generation unit 151 registers the generated second document vector in the study material table 142 .
  • the generation unit 151 repeatedly executes the above process for each study material stored in the study material table 142 .
  • the generation unit 151 may generate a document vector by any method.
  • the generation unit 151 generates document vectors based on the technique described in Non-Patent Document 1.
  • the calculation unit 152 calculates the cosine similarity between the first document vector of the specification and the second document vector of each study material.
  • the calculation unit 152 calculates the cosine similarity using Equation (1) described above.
  • the calculation unit 152 outputs the calculation result of the cosine similarity to the extraction unit 153 .
  • the calculation result of the cosine similarity is associated with the specification number of the selected specification, the examination material number of each examination material, and the calculation result of the cosine similarity.
  • the user may operate the input unit 120 to select the specifications, or the calculation unit 152 may select the specifications in a predetermined order.
  • the selected specifications are referred to as "selected specifications”.
  • the extraction unit 153 extracts study materials corresponding to the selected specifications based on the cosine similarity calculation results. The extraction unit 153 determines whether or not there is a set in which the cosine similarity between the first document vector of the selected specification and the second document vector of each study material is equal to or greater than a threshold.
  • the extraction unit 153 extracts the examination material number of the second document vector. is used to extract the relevant study material from the study material table 142 .
  • the extraction unit 153 extracts the first document vector of the selection specification and , with the second document vector of each study material, and count the number of common words.
  • the extraction unit 153 extracts the second document vector with the largest number of common words. corresponding study material number is used to extract the relevant study material from the study material table 142 .
  • the extraction unit 153 extracts the second document vector of the study material with the largest number of common words that is cosine similar to the first document vector. Identify the second document vector with the highest degree. The extraction unit 153 extracts the corresponding study material from the study material table 142 by using the study material number for the specified second document vector.
  • the extraction unit 153 may cause the display unit 130 to display information that associates the selected specifications with the extracted study material.
  • FIG. 5 is a flow chart showing the processing procedure of the information processing apparatus according to the embodiment.
  • the generation unit 151 of the information processing apparatus 100 generates a first document vector of the specification (step S101).
  • the generation unit 151 generates a second document vector of study material (step S102).
  • the calculation unit 152 of the information processing device 100 accepts the specification selection (step S103).
  • the calculator 152 acquires the first document vector of the selected specification (step S104).
  • the calculation unit 152 calculates the cosine similarity between the first document vector of the selected specification and each second document vector (step S105).
  • the extraction unit 153 of the information processing apparatus 100 determines whether or not there is a second document vector whose cosine similarity with the first document vector is equal to or greater than a threshold (step S106). If there is a second document vector whose cosine similarity with the first document vector is equal to or greater than the threshold (step S106, Yes), the extraction unit 153 proceeds to step S107.
  • the extraction unit 153 extracts study materials of the second document vectors whose cosine similarity is equal to or greater than the threshold (step S107).
  • step S108 if there is no second document vector whose cosine similarity with the first document vector is equal to or greater than the threshold value (step S106, No), the extraction unit 153 executes extraction processing (step S108).
  • FIG. 6 is a flowchart showing the procedure of extraction processing.
  • the extraction unit 153 of the information processing apparatus 100 counts the number of common words between the first document vector and the second document vector (step S201).
  • the extraction unit 153 determines whether or not there are a plurality of second document vectors having the maximum number of common words (step S202). If there are not a plurality of second document vectors with the maximum number of common words (step S202, No), the extracting unit 153 extracts study material for the second document vector with the largest number of common words (step S203).
  • step S202 if there are a plurality of second document vectors with the maximum number of common words (step S202, Yes), the extraction unit 153 proceeds to step S204.
  • the extraction unit 153 identifies the second document vector corresponding to the maximum cosine similarity among the cosine similarities between the second document vector having the maximum number of common words and the first document vector (step S204). .
  • the extraction unit 153 extracts the examination result of the identified second document vector (step S205).
  • the information processing apparatus 100 calculates the cosine similarity between the first document vector of the specification and the second document vector of each study material, and calculates the cosine similarity between the first document vector and the second document vector whose cosine similarity is equal to or greater than the threshold. exists. If there is a second document vector whose cosine similarity with the first document vector is equal to or greater than a threshold value, the information processing apparatus extracts study materials for the second document vector. As a result, if there is a pair whose cosine similarity is equal to or greater than the threshold, it is possible to extract study material based on the cosine similarity.
  • the information processing apparatus 100 detects the first document vector of the specification and the first document vector of each study material. Compare the two document vectors and count the number of common words. If there is one study material that has the maximum number of common words with the specification, the information processing apparatus extracts such study material. As a result, even if there is no pair whose cosine similarity is equal to or greater than the threshold value, it is possible to extract study materials based on the number of common words.
  • the information processing apparatus 100 selects the second document vector of the study material having the largest number of common words.
  • a second document vector that is a vector and has the maximum cosine similarity with the first document vector is specified, and study materials for the specified second document vector are extracted.
  • the cosine similarity can be further used to extract study materials.
  • the information processing apparatus 100 it is possible to improve the accuracy of extracting study materials that are used as a basis for creating specifications.
  • FIG. 7 is a diagram showing an example of a computer that executes an extraction program.
  • Computer 1000 has, for example, memory 1010 , CPU 1020 , hard disk drive interface 1030 , disk drive interface 1040 , serial port interface 1050 , video adapter 1060 and network interface 1070 . These units are connected by a bus 1080 .
  • the memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM 1012 .
  • the ROM 1011 stores a boot program such as BIOS (Basic Input Output System).
  • BIOS Basic Input Output System
  • Hard disk drive interface 1030 is connected to hard disk drive 1031 .
  • Disk drive interface 1040 is connected to disk drive 1041 .
  • a removable storage medium such as a magnetic disk or an optical disk is inserted into the disk drive 1041, for example.
  • a mouse 1051 and a keyboard 1052 are connected to the serial port interface 1050, for example.
  • a display 1061 is connected to the video adapter 1060 .
  • the hard disk drive 1031 stores an OS 1091, application programs 1092, program modules 1093 and program data 1094, for example. Each piece of information described in the above embodiment is stored in the hard disk drive 1031 or memory 1010, for example.
  • the extraction program is stored in the hard disk drive 1031, for example, as a program module 1093 in which commands to be executed by the computer 1000 are written.
  • the hard disk drive 1031 stores a program module 1093 in which each process executed by the information processing apparatus 100 described in the above embodiment is described.
  • Data used for information processing by the extraction program is stored as program data 1094 in the hard disk drive 1031, for example. Then, the CPU 1020 reads out the program module 1093 and the program data 1094 stored in the hard disk drive 1031 to the RAM 1012 as necessary, and executes each procedure described above.
  • program module 1093 and program data 1094 related to the extraction program are not limited to being stored in the hard disk drive 1031.
  • they may be stored in a removable storage medium and read by the CPU 1020 via the disk drive 1041 or the like.
  • the program module 1093 and program data 1094 related to the extraction program are stored in another computer connected via a network such as LAN or WAN (Wide Area Network), and are read out by the CPU 1020 via the network interface 1070.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Library & Information Science (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

This information processing device calculates the similarity between words included in a first document and a first document vector generated based on the occurrence frequency of the words, and for a plurality of second documents, the similarity between words included in the second documents and a plurality of second document vectors generated based on the occurrence frequency of the words. When there are no second document vectors having a similarity with the first document vector that is greater than or equal to a threshold value, the information processing device counts the number of common words that are common among the words set in the first document vector and the words set in the second document vectors, and extracts second documents based on the count result.

Description

情報処理装置、抽出方法及び抽出プログラムInformation processing device, extraction method and extraction program
 本発明は、情報処理装置、抽出方法及び抽出プログラムに関する。 The present invention relates to an information processing device, an extraction method, and an extraction program.
 仕様書(開発文書)を作成する場合には、仕様書の品質を確保するため、レビューによる記載漏れ、曖昧さの排除等を行う必要がある。かかる記入漏れ、曖昧さを排除するためには、仕様書を作成する際に、事前に検討した資料(以下、検討資料)を的確に参照し、仕様書のチェックを行う必要があるが、仕様書に含まれる節・章における各部分は、作成された時期がバラバラであるため、検討資料を的確に参照できず、見落としが発生する場合がある。 When creating specifications (development documents), in order to ensure the quality of the specifications, it is necessary to eliminate omissions and ambiguity through reviews. In order to eliminate such omissions and ambiguities, it is necessary to accurately refer to the materials that have been considered in advance (hereinafter referred to as review materials) and check the specifications when creating the specifications. Since each section and chapter included in the book was created at different times, it is not possible to accurately refer to the materials for consideration, and there are cases where an oversight occurs.
 ここで、仕様書の内容と検討資料の内容とは非常に近い内容であることを利用して、文書に対応する検討資料を自動的に抽出する従来技術1,2がある。 Here, there are conventional technologies 1 and 2 that automatically extract examination materials corresponding to documents by utilizing the fact that the contents of the specification and the contents of examination materials are very similar.
 図8は、従来技術1を説明するための図である。従来技術1では、文書に出現する単語と単語の出現頻度とを基にして文書ベクトルを生成し、各文書ベクトルのコサイン類似度を基にして、仕様書の内容に近い検討資料を抽出する。たとえば、仕様書A,B,Cの文書ベクトルをそれぞれVdA,VdB,VdCとする。検討資料D,E,Fの文書ベクトルをそれぞれVSD,VSE,VSFとする。 FIG. 8 is a diagram for explaining prior art 1. FIG. In prior art 1, document vectors are generated based on the words appearing in the document and the appearance frequency of the words, and study materials close to the contents of the specification are extracted based on the cosine similarity of each document vector. For example, let the document vectors of specifications A, B, and C be V dA , V dB , and V dC , respectively. Let V SD , V SE , and V SF be the document vectors of study materials D, E, and F, respectively.
 従来技術1では、仕様書Aの文書ベクトルVdAと、検討資料D~Eの文書ベクトルVSD~VSFとのコサイン類似度をそれぞれ算出し、コサイン類似度が閾値以上となる組を基にして、仕様書Aに対応する検討資料を抽出する。たとえば、仕様書Aの文書ベクトルVdAと、検討資料Dの文書ベクトルVSDとのコサイン類似度が閾値以上である場合には、仕様書Aに対応する検討資料として、検討資料Dを抽出する。他の仕様書B,Cについても同様にして、検討資料を抽出する。 In prior art 1, the cosine similarity between the document vector V dA of the specification A and the document vectors V SD to V SF of the study materials D to E is calculated, and based on the pairs whose cosine similarity is equal to or higher than the threshold Then, the study material corresponding to the specification A is extracted. For example, when the cosine similarity between the document vector VdA of the specification A and the document vector VSD of the study material D is equal to or greater than the threshold, the study material D is extracted as the study material corresponding to the specification A. . For the other specifications B and C, study materials are extracted in the same manner.
 図9は、従来技術2を説明するための図である。従来技術2では、文書のトピックを分析し、文書が持つトピックの距離に基づいて、仕様書の内容に近い検討資料を抽出する。たとえば、従来技術2では、仕様書A,B,Cのトピック、検討資料D,E,Fのトピックを分析し、トピック内訳を算出し、ベクトル化する。また、ベクトル化した値を基にして、グラフG1にマッピングする。グラフG1の横軸は、第1トピックの値に対応する軸とし、縦軸は、第2トピックの値に対応する軸とする。 FIG. 9 is a diagram for explaining conventional technology 2. FIG. In prior art 2, the topic of the document is analyzed, and study materials close to the content of the specification are extracted based on the distance between the topics of the document. For example, in prior art 2, the topics of specifications A, B, and C and the topics of study materials D, E, and F are analyzed, and the breakdown of topics is calculated and vectorized. Moreover, based on the vectorized values, mapping is performed on the graph G1. The horizontal axis of the graph G1 is the axis corresponding to the value of the first topic, and the vertical axis is the axis corresponding to the value of the second topic.
 たとえば、仕様書A,B,Cは、それぞれ、グラフG1のpA,pB,pCにマッピングされ、検討資料D,E,Fは、それぞれ、グラフG1のpE,pF,pGにマッピングされる。ここで、pAと、pDとの距離が近いため、仕様書Aに対応する検討資料として、検討資料Dが抽出される。 For example, specifications A, B, and C are mapped to pA, pB, and pC of graph G1, respectively, and study materials D, E, and F are mapped to pE, pF, and pG of graph G1, respectively. Here, since the distance between pA and pD is close, study material D is extracted as study material corresponding to specification A.
 たとえば、仕様書の各節に記載される機能や要件に関する記載は、数百文字程度であり、比較的分量の少ない文書である。このような文書に対してトピック分析を行うと、どれも似通った結果となり、各文書のトピックに顕著な違いが出ず、仕様書に対応する検討資料を適切に抽出することができない。 For example, the descriptions of the functions and requirements described in each section of the specification are about several hundred characters, which is a relatively small document. When topic analysis is performed on such documents, the results are similar to each other, and there is no noticeable difference in the topic of each document, making it impossible to appropriately extract examination materials corresponding to specifications.
 一方、コサイン類似度を用いる場合についても、文書のキーとなる単語(以下、キー単語)が少なく、かかるキー単語の出現頻度が低くなる場合がある。ここで、文書の分量を増加させるために、図、表等を含めるようにしても、キー単語以外の単語(他の文書と共通するような共通単語)が増えてしまい、検出対象とすべき検討資料であるにもかかわらず、結果的にコサイン類似度が上がらず、検討資料を適切に抽出することができない。 On the other hand, even when cosine similarity is used, the number of key words in a document (hereinafter referred to as "key words") may be small, and the appearance frequency of such key words may be low. Here, even if figures, tables, etc. are included in order to increase the volume of the document, the number of words other than key words (common words that are common to other documents) increases. In spite of being examination materials, the cosine similarity does not increase as a result, and examination materials cannot be appropriately extracted.
 本発明は、上記に鑑みてなされたものであって、仕様書の作成元となる検討資料の抽出精度を向上させることができる情報処理装置、抽出方法及び抽出プログラムを提供することを目的とする。 SUMMARY OF THE INVENTION It is an object of the present invention to provide an information processing apparatus, an extraction method, and an extraction program capable of improving the accuracy of extracting study materials from which specifications are created. .
 上述した課題を解決し、目的を達成するために、本発明に係る情報処理装置は、第1文書に含まれる単語と該単語の出現頻度を基にして生成された第1文書ベクトルと、複数の第2文書について、前記第2文書に含まれる単語と該単語の出現頻度を基にして生成された複数の第2文書ベクトルとの類似度をそれぞれ算出する算出部と、前記第1文書ベクトルとの類似度が閾値以上となる第2文書ベクトルが存在しない場合には、前記第1文書ベクトルに設定された単語と、前記第2文書ベクトルに設定された単語とで共通する共通語の数をそれぞれ計数し、計数結果を基にして、第2文書を抽出する抽出部とを備える。 In order to solve the above-described problems and achieve the object, an information processing apparatus according to the present invention provides a first document vector generated based on a word included in a first document and the appearance frequency of the word, and a plurality of a calculation unit for calculating similarities between a word contained in the second document and a plurality of second document vectors generated based on the frequency of occurrence of the word, with respect to the second document; and the first document vector If there is no second document vector whose similarity with and an extraction unit for extracting the second document based on the counting result.
 仕様書の作成元となる検討資料の抽出精度を向上させることができる。 It is possible to improve the accuracy of extracting study materials that are the basis for creating specifications.
図1は、本実施例に係る情報処理装置の処理を説明するための図である。FIG. 1 is a diagram for explaining the processing of the information processing apparatus according to the embodiment. 図2は、本実施例に係る情報処理装置の構成を示す機能ブロック図である。FIG. 2 is a functional block diagram showing the configuration of the information processing apparatus according to this embodiment. 図3は、仕様書テーブルのデータ構造の一例を示す図である。FIG. 3 is a diagram showing an example of the data structure of a specification table. 図4は、検討資料テーブルのデータ構造の一例を示す図である。FIG. 4 is a diagram illustrating an example of the data structure of a study material table. 図5は、本実施例に係る情報処理装置の処理手順を示すフローチャートである。FIG. 5 is a flow chart showing the processing procedure of the information processing apparatus according to the embodiment. 図6は、抽出処理の処理手順を示すフローチャートである。FIG. 6 is a flowchart showing the procedure of extraction processing. 図7は、抽出プログラムを実行するコンピュータの一例を示す図である。FIG. 7 is a diagram showing an example of a computer that executes an extraction program. 図8は、従来技術1を説明するための図である。FIG. 8 is a diagram for explaining prior art 1. FIG. 図9は、従来技術2を説明するための図である。FIG. 9 is a diagram for explaining the prior art 2. FIG.
 以下に、本願の開示する情報処理装置、抽出方法及び抽出プログラムの実施例を図面に基づいて詳細に説明する。なお、この実施例によりこの発明が限定されるものではない。 Below, embodiments of the information processing device, the extraction method, and the extraction program disclosed in the present application will be described in detail based on the drawings. In addition, this invention is not limited by this Example.
 図1は、本実施例に係る情報処理装置の処理を説明するための図である。図1では一例として、仕様書A,B,Cと、検討資料D,E,Fとを用いて説明する。情報処理装置は、文書に出現する単語と、単語の出現頻度とを基にして、文書ベクトルを生成する。本実施例では、文書ベクトルの各要素(次元)には、単語が設定される。 FIG. 1 is a diagram for explaining the processing of the information processing apparatus according to this embodiment. In FIG. 1, specifications A, B, and C and study materials D, E, and F will be used as an example for explanation. The information processing device generates a document vector based on the words appearing in the document and the appearance frequency of the words. In this embodiment, a word is set for each element (dimension) of the document vector.
 説明の便宜上、仕様書の文書ベクトルを「第1文書ベクトル」と表記し、検討資料の文書ベクトルを「第2文書ベクトル」と表記する。また、仕様書の第1文書ベクトルを個別に示す場合、仕様書A,B,Cの文書ベクトルをそれぞれ文書ベクトルVdA,VdB,VdCとする。検討資料の第2文書ベクトルを個別に示す場合、検討資料D,E,Fの文書ベクトルをそれぞれ文書ベクトルVsD,VsE,VsFとする。 For convenience of explanation, the document vector of the specification is referred to as "first document vector", and the document vector of study material is referred to as "second document vector". When the first document vectors of the specifications are shown separately, the document vectors of the specifications A, B, and C are defined as document vectors V dA , V dB , and V dC , respectively. When the second document vectors of the study materials are indicated individually, the document vectors of the study materials D, E, and F are defined as document vectors V sD , V sE , and V sF , respectively .
 情報処理装置は、仕様書に対応する検討資料を抽出する場合には、仕様書の第1文書ベクトルと、各検討資料の第2文書ベクトルとのコサイン類似度を算出する。情報処理装置は、コサイン類似度を、式(1)によって算出する。式(1)において、「Vdx」は、仕様書の文書ベクトルを示す。「Vsy」は、検討資料の文書ベクトルに対応する。 When extracting study material corresponding to a specification, the information processing device calculates cosine similarity between the first document vector of the specification and the second document vector of each study material. The information processing device calculates the cosine similarity using Equation (1). In equation (1), "Vdx" indicates the document vector of the specification. "Vsy" corresponds to the document vector of the study material.
 Vdx・Vsy/|Vdx||Vsy|・・・(1)  Vdx/Vsy/|Vdx||Vsy| (1)
 情報処理装置は、第1文書ベクトルと、第2文書ベクトルとのコサイン類似度を算出し、第1文書ベクトルとのコサイン類似度が閾値以上となる第2文書ベクトルが存在するか否かを判定する。情報処理装置は、第1文書ベクトルとのコサイン類似度が閾値以上となる第2文書ベクトルが存在する場合には、かかる第2文書ベクトルの検討資料を抽出する。 The information processing device calculates the cosine similarity between the first document vector and the second document vector, and determines whether or not there is a second document vector whose cosine similarity with the first document vector is equal to or greater than a threshold. do. If there is a second document vector whose cosine similarity with the first document vector is equal to or greater than a threshold value, the information processing apparatus extracts study materials for the second document vector.
 一方、情報処理装置は、第1文書ベクトルとのコサイン類似度が閾値以上となる第2文書ベクトルが存在しない場合には、次の処理を実行する。情報処理装置は、仕様書の第1文書ベクトルの各要素の単語と、各検討資料の第2文書ベクトルの各要素の単語とで、共通する単語の数を計数する。以下の説明では、共通する単語を「共通語」と表記する。 On the other hand, if there is no second document vector whose cosine similarity with the first document vector is greater than or equal to the threshold value, the information processing device executes the following process. The information processing device counts the number of common words between the words of each element of the first document vector of the specification and the words of each element of the second document vector of each study material. In the following description, common words are referred to as "common words".
 情報処理装置は、共通語の数が最大となる検討資料が一つである場合には、かかる検討資料を抽出する。 If there is one study material with the largest number of common words, the information processing device extracts such study material.
 情報処理装置は、共通語の数が最大となる検討資料が複数存在する場合には、共通語が最大となる検討資料の第2文書ベクトルであって、第1文書ベクトルとのコサイン類似度が最大となる第2文書ベクトルを特定し、特定した第2文書ベクトルの検討資料を抽出する。 When there are a plurality of study materials with the maximum number of common words, the information processing apparatus finds the second document vector of the study material with the largest number of common words, which has a cosine similarity with the first document vector. The maximum second document vector is specified, and study materials for the specified second document vector are extracted.
 情報処理装置が、仕様書Aに対応する検討資料を抽出する処理の一例について説明する。情報処理装置は、仕様書Aの文書ベクトルVdAと、各検討資料D,E,Fの文書ベクトルをそれぞれVsD,VsE,VsFとのコサイン類似度をそれぞれ算出する。 An example of a process of extracting study material corresponding to the specification A by the information processing device will be described. The information processing device calculates the cosine similarity between the document vector VdA of the specification A and the document vectors VsD , VsE , and VsF of the study materials D, E, and F, respectively.
 たとえば、情報処理装置は、仕様書Aの文書ベクトルVdAと、検討資料Dの文書ベクトルVsDとのコサイン類似度が閾値以上となる場合には、仕様書Aの作成元となった検討資料として、検討資料Dを抽出する。 For example, when the cosine similarity between the document vector VdA of the specification A and the document vector VsD of the study material D is equal to or greater than a threshold, the information processing apparatus As a result, the examination material D is extracted.
 情報処理装置が、仕様書Bに対応する検討資料を抽出する処理の一例について説明する。情報処理装置は、仕様書Bの文書ベクトルVdBと、各検討資料D,E,Fの文書ベクトルをそれぞれVsD,VsE,VsFとのコサイン類似度をそれぞれ算出する。 An example of a process of extracting study materials corresponding to the specification B by the information processing device will be described. The information processing device calculates the cosine similarity between the document vector V dB of the specification B and the document vectors V sD , V sE and V sF of the study materials D, E and F, respectively.
 情報処理装置は、文書ベクトルVdBとのコサイン類似度が閾値以上となる第2文書ベクトルが存在しない場合には、仕様書Bの文書ベクトルVdBと、検討資料Dの文書ベクトルVsDとの共通語の数を計数する。情報処理装置は、仕様書Bの文書ベクトルVdBと、検討資料Eの文書ベクトルVsEとの共通語の数を計数する。情報処理装置は、仕様書Bの文書ベクトルVdBと、検討資料Fの文書ベクトルVsFとの共通語の数を計数する。 If there is no second document vector whose cosine similarity with the document vector V dB is equal to or greater than the threshold, the information processing apparatus determines whether the document vector V dB of the specification B and the document vector V sD of the study material D Count the number of common words. The information processing device counts the number of common words between the specification B document vector V dB and the study material E document vector V sE . The information processing device counts the number of common words between the specification B document vector V dB and the study material F document vector V sF .
 情報処理装置は、仕様書Bの文書ベクトルVdBと検討資料Eの文書ベクトルVsEとの共通語の数が、最大の共通語の数となり、他の共通語の数よりも大きい場合には、仕様書Bの作成元となった検討資料として、検討資料Eを抽出する。 When the number of common words between the document vector V dB of the specification B and the document vector V sE of the study material E is the maximum number of common words and is greater than the number of other common words, the information processing device , the study material E is extracted as the study material from which the specification B was created.
 情報処理装置が、仕様書Cに対応する検討資料を抽出する処理の一例について説明する。情報処理装置は、仕様書Cの文書ベクトルVdCと、各検討資料D,E,Fの文書ベクトルをそれぞれVsD,VsE,VsFとのコサイン類似度をそれぞれ算出する。 An example of a process of extracting study materials corresponding to the specification C by the information processing device will be described. The information processing apparatus calculates the cosine similarity between the document vector V dC of the specification C and the document vectors V sD , V sE and V sF of the study materials D, E and F, respectively.
 情報処理装置は、文書ベクトルVdCとのコサイン類似度が閾値以上となる第2文書ベクトルが存在しない場合には、仕様書Cの文書ベクトルVdCと、検討資料Dの文書ベクトルVsDとの共通語の数を計数する。情報処理装置は、仕様書Cの文書ベクトルVdCと、検討資料Eの文書ベクトルVsEとの共通語の数を計数する。情報処理装置は、仕様書Cの文書ベクトルVdCと、検討資料Fの文書ベクトルVsFとの共通語の数を計数する。 If there is no second document vector whose cosine similarity with the document vector VdC is equal to or greater than the threshold, the information processing apparatus determines whether the document vector VdC of the specification C and the document vector VsD of the study material D Count the number of common words. The information processing device counts the number of common words between the specification C document vector VdC and the study material E document vector VsE . The information processing device counts the number of common words between the specification C document vector VdC and the study material F document vector VsF .
 情報処理装置は、仕様書Cの文書ベクトルVdCと検討資料Dの文書ベクトルVsDとの共通語の数と、仕様書Cの文書ベクトルVdCと検討資料Fの文書ベクトルVsFとの共通語の数とが最大となる場合(共通語の数が最大となる組が複数存在する場合)には、次の処理を行う。 The information processing device determines the number of common words between the document vector VdC of the specification C and the document vector VsD of the study material D, and the number of common words between the document vector VdC of the specification C and the document vector VsF of the study material F. When the number of words is the maximum (when there are multiple pairs with the maximum number of common words), the following processing is performed.
 情報処理装置は、文書ベクトルVdCおよび文書ベクトルVsDのコサイン類似度と、文書ベクトルVdCおよび文書ベクトルVsFのコサイン類似度とを比較し、文書ベクトルVdCおよび文書ベクトルVsFのコサイン類似度の方が大きい場合には、仕様書Cの作成元となった検討資料として、検討資料Fを抽出する。 The information processing device compares the cosine similarity of the document vector VdC and the document vector VsD with the cosine similarity of the document vector VdC and the document vector VsF to determine the cosine similarity of the document vector VdC and the document vector VsF. If the degree is larger, the study material F is extracted as the study material from which the specification C was created.
 上記のように、本実施例に係る情報処理装置は、仕様書の第1文書ベクトルと、各検討資料の第2文書ベクトルとのコサイン類似度を算出し、第1文書ベクトルとのコサイン類似度が閾値以上となる第2文書ベクトルが存在するか否かを判定する。情報処理装置は、第1文書ベクトルとのコサイン類似度が閾値以上となる第2文書ベクトルが存在する場合には、かかる第2文書ベクトルの検討資料を抽出する。 As described above, the information processing apparatus according to the present embodiment calculates the cosine similarity between the first document vector of the specification and the second document vector of each study material, and calculates the cosine similarity between the first document vector and is equal to or greater than the threshold value. If there is a second document vector whose cosine similarity with the first document vector is equal to or greater than a threshold value, the information processing apparatus extracts study materials for the second document vector.
 一方、情報処理装置は、第1文書ベクトルとのコサイン類似度が閾値以上となる第2文書ベクトルが存在しない場合には、仕様書の第1文書ベクトルと、各検討資料の第2文書ベクトルとを比較し、共通語の数を計数する。情報処理装置は、仕様書との共通語の数が最大となる検討資料が一つである場合には、かかる検討資料を抽出する。 On the other hand, if there is no second document vector whose cosine similarity with the first document vector is equal to or greater than the threshold, the information processing apparatus combines the first document vector of the specification with the second document vector of each study material. and count the number of common words. If there is one study material that has the maximum number of common words with the specification, the information processing apparatus extracts such study material.
 情報処理装置は、仕様書の第1文書ベクトルとの共通語の数が最大となる検討資料の第2文書ベクトルが複数存在する場合には、共通語が最大となる検討資料の第2文書ベクトルであって、第1文書ベクトルとのコサイン類似度が最大となる第2文書ベクトルを特定し、特定した第2文書ベクトルの検討資料を抽出する。 When there are a plurality of second document vectors of the study material having the maximum number of common words with the first document vector of the specification, the information processing apparatus selects the second document vector of the study material having the largest number of common words. , the second document vector having the maximum cosine similarity with the first document vector is specified, and study material for the specified second document vector is extracted.
 このように、情報処理装置は、コサイン類似度と、共通語の数との観点から、仕様書に対応する検討資料を抽出することで、検討資料の抽出精度を向上させることができる。 In this way, the information processing apparatus can improve the extraction accuracy of study materials by extracting study materials corresponding to specifications from the viewpoint of cosine similarity and the number of common words.
 次に、本実施例に係る情報処理装置の構成の一例について説明する。図2は、本実施例に係る情報処理装置の構成を示す機能ブロック図である。図2に示すように、この情報処理装置100は、通信部110と、入力部120と、表示部130と、記憶部140と、制御部150とを有する。 Next, an example of the configuration of the information processing apparatus according to this embodiment will be described. FIG. 2 is a functional block diagram showing the configuration of the information processing apparatus according to this embodiment. As shown in FIG. 2 , this information processing apparatus 100 has a communication section 110 , an input section 120 , a display section 130 , a storage section 140 and a control section 150 .
 通信部110は、ネットワーク等を介して接続された外部装置との間で、各種情報を送受信する通信インタフェースである。通信部110は、NIC(Network Interface Card)等で実現され、LAN(Local Area Network)やインターネットなどの電気通信回線を介した外部装置と制御部150との間の通信を行う。 The communication unit 110 is a communication interface that transmits and receives various types of information to and from an external device connected via a network or the like. The communication unit 110 is realized by a NIC (Network Interface Card) or the like, and performs communication between an external device and the control unit 150 via an electric communication line such as a LAN (Local Area Network) or the Internet.
 入力部120は、情報処理装置100の操作者からの各種操作を受け付ける入力インタフェースである。例えば、キーボードやマウス等の入力デバイスによって構成される。 The input unit 120 is an input interface that receives various operations from the operator of the information processing device 100 . For example, it is composed of input devices such as a keyboard and a mouse.
 表示部130は、制御部150から取得した情報を出力する出力デバイスであり、液晶ディスプレイなどの表示装置、プリンター等の印刷装置等によって実現される。 The display unit 130 is an output device that outputs information acquired from the control unit 150, and is realized by a display device such as a liquid crystal display, a printing device such as a printer, and the like.
 記憶部140は、仕様書テーブル141および検討資料テーブル142を有する。記憶部140は、RAM(Random Access Memory)、フラッシュメモリ(Flash Memory)等の半導体メモリ素子、または、ハードディスク、光ディスク等の記憶装置によって実現される。 The storage unit 140 has a specification table 141 and a study material table 142 . The storage unit 140 is implemented by a semiconductor memory device such as RAM (Random Access Memory) or flash memory, or a storage device such as a hard disk or optical disk.
 仕様書テーブル141は、仕様書に関する情報を保持するテーブルである。図3は、仕様書テーブルのデータ構造の一例を示す図である。図3に示すように、この仕様書テーブル141は、仕様書番号と、仕様書と、第1文書ベクトルとを有する。 The specification table 141 is a table that holds information on specifications. FIG. 3 is a diagram showing an example of the data structure of a specification table. As shown in FIG. 3, this specification table 141 has a specification number, a specification, and a first document vector.
 仕様書番号は、仕様書を識別する情報である。たとえば、仕様書番号M10A~M10Cに対応する仕様書をそれぞれ、仕様書A~Cとする。仕様書は、仕様書の文書情報(テキストデータ)である。第1文書ベクトルは、仕様書に含まれる単語と、この単語の頻度とを基にして生成されるベクトルである。第1文書ベクトルは、後述する生成部151によって生成される。 The specification number is information that identifies the specification. For example, specifications corresponding to specification numbers M10A to M10C are assumed to be specifications A to C, respectively. The specification is document information (text data) of the specification. The first document vector is a vector generated based on the words included in the specification and the frequencies of these words. The first document vector is generated by the generation unit 151, which will be described later.
 検討資料テーブル142は、検討資料に関する情報を保持するテーブルである。図4は、検討資料テーブルのデータ構造の一例を示す図である。図4に示すように、この検討資料テーブル142は、検討資料番号と、検討資料と、第2文書ベクトルとを有する。 The study material table 142 is a table that holds information about study materials. FIG. 4 is a diagram illustrating an example of the data structure of a study material table. As shown in FIG. 4, the study material table 142 has study material numbers, study materials, and second document vectors.
 検討資料番号は、検討資料を識別する情報である。たとえば、検討資料番号M10D~M10Fに対応する検討資料をそれぞれ、検討資料D~Fとする。検討資料は、ユーザが仕様書を作成する場合に参照した使用の文書情報(テキストデータ)である。第2文書ベクトルは、検討資料に含まれる単語と、この単語の頻度とを基にして生成されるベクトルである。第2文書ベクトルは、後述する生成部151によって生成される。 The study material number is information that identifies the study material. For example, study materials corresponding to study material numbers M10D to M10F are assumed to be study materials D to F, respectively. The review material is the document information (text data) of use referred to when the user creates the specification. The second document vector is a vector generated based on the words contained in the study material and the frequencies of these words. The second document vector is generated by the generation unit 151, which will be described later.
 図2の説明に戻る。制御部150は、CPU(Central Processing Unit)等を用いて実現される。制御部150は、生成部151、算出部152、抽出部153を有する。 Return to the description of Figure 2. The control unit 150 is implemented using a CPU (Central Processing Unit) or the like. The control unit 150 has a generation unit 151 , a calculation unit 152 and an extraction unit 153 .
 生成部151は、仕様書、検討資料等の文書情報から、文書ベクトルを生成する。生成部151は、仕様書テーブル141に格納された仕様書の文書情報を形態素解析することで、単語を抽出し、抽出した単語と、単語の頻度とを基にして、第1文書ベクトルを生成する。生成部151は、生成した第1文書ベクトルを、仕様書テーブル141に登録する。生成部151は、仕様書テーブル141に格納された各仕様書について、上記処理を繰り返し実行する。 The generation unit 151 generates document vectors from document information such as specifications and study materials. The generation unit 151 extracts words by morphologically analyzing the document information of the specifications stored in the specifications table 141, and generates a first document vector based on the extracted words and the frequency of the words. do. The generation unit 151 registers the generated first document vector in the specification table 141 . The generation unit 151 repeatedly executes the above process for each specification stored in the specification table 141 .
 生成部151は、検討資料テーブル142に格納された検討資料の文書情報を形態素解析することで、単語を抽出し、抽出した単語と、単語の頻度とを基にして、第2文書ベクトルを生成する。生成部151は、生成した第2文書ベクトルを、検討資料テーブル142に登録する。生成部151は、検討資料テーブル142に格納された各検討資料について、上記処理を繰り返し実行する。 The generation unit 151 extracts words by morphologically analyzing the document information of the study materials stored in the study material table 142, and generates a second document vector based on the extracted words and the frequency of the words. do. The generation unit 151 registers the generated second document vector in the study material table 142 . The generation unit 151 repeatedly executes the above process for each study material stored in the study material table 142 .
 生成部151は、どのような手法によって、文書ベクトルを生成してもよい。たとえば、生成部151は、非特許文献1に記載された技術を基にして、文書ベクトルを生成する。 The generation unit 151 may generate a document vector by any method. For example, the generation unit 151 generates document vectors based on the technique described in Non-Patent Document 1.
 算出部152は、仕様書の第1文書ベクトルと、各検討資料の第2文書ベクトルとのコサイン類似度を算出する。算出部152は、上述した式(1)を用いてコサイン類似度を算出する。算出部152は、コサイン類似度の算出結果を、抽出部153に出力する。 The calculation unit 152 calculates the cosine similarity between the first document vector of the specification and the second document vector of each study material. The calculation unit 152 calculates the cosine similarity using Equation (1) described above. The calculation unit 152 outputs the calculation result of the cosine similarity to the extraction unit 153 .
 たとえば、コサイン類似度の算出結果には、選択された仕様書の仕様書番号と、各検討資料の検討資料番号と、コサイン類似度の算出結果とがそれぞれ対応付けられる。ユーザが、入力部120を操作して、仕様書を選択してもよいし、算出部152が、所定の順番で、仕様書を選択してもよい。以下の説明では、選択された仕様書を「選択仕様書」と表記する。 For example, the calculation result of the cosine similarity is associated with the specification number of the selected specification, the examination material number of each examination material, and the calculation result of the cosine similarity. The user may operate the input unit 120 to select the specifications, or the calculation unit 152 may select the specifications in a predetermined order. In the following description, the selected specifications are referred to as "selected specifications".
 抽出部153は、コサイン類似度の算出結果を基にして、選択仕様書に対応する検討資料を抽出する。抽出部153は、選択仕様書の第1文書ベクトルと、各検討資料の第2文書ベクトルとのコサイン類似度が閾値以上となる組が存在するか否かを判定する。 The extraction unit 153 extracts study materials corresponding to the selected specifications based on the cosine similarity calculation results. The extraction unit 153 determines whether or not there is a set in which the cosine similarity between the first document vector of the selected specification and the second document vector of each study material is equal to or greater than a threshold.
 抽出部153は、選択仕様書の第1文書ベクトルと、各検討資料の第2文書ベクトルとのコサイン類似度が閾値以上となる組が存在する場合には、かかる第2文書ベクトルの検討資料番号を用いて、該当する検討資料を、検討資料テーブル142から抽出する。 If there is a set in which the cosine similarity between the first document vector of the selection specification and the second document vector of each examination material is equal to or greater than a threshold, the extraction unit 153 extracts the examination material number of the second document vector. is used to extract the relevant study material from the study material table 142 .
 抽出部153は、選択仕様書の第1文書ベクトルと、各検討資料の第2文書ベクトルとのコサイン類似度が閾値以上となる組が存在しない場合には、選択仕様書の第1文書ベクトルと、各検討資料の第2文書ベクトルとを比較し、共通語の数を計数する。 If there is no set in which the cosine similarity between the first document vector of the selection specification and the second document vector of each examination material is equal to or greater than the threshold, the extraction unit 153 extracts the first document vector of the selection specification and , with the second document vector of each study material, and count the number of common words.
 抽出部153は、共通語の数が最大となる第2文書ベクトルが複数存在しない場合(最大となる第2文書ベクトルが1つの場合)には、共通語の数が最大となる第2文書ベクトルの検討資料番号を用いて、該当する検討資料を、検討資料テーブル142から抽出する。 When there are not a plurality of second document vectors with the largest number of common words (when there is one second document vector with the largest number of common words), the extraction unit 153 extracts the second document vector with the largest number of common words. corresponding study material number is used to extract the relevant study material from the study material table 142 .
 抽出部153は、共通語の数が最大となる第2文書ベクトルが複数存在する場合には、共通語が最大となる検討資料の第2文書ベクトルであって、第1文書ベクトルとのコサイン類似度が最大となる第2文書ベクトルを特定する。抽出部153は、特定した第2文書ベクトルに検討資料番号を用いて、該当する検討資料を、検討資料テーブル142から抽出する。 When there are a plurality of second document vectors with the largest number of common words, the extraction unit 153 extracts the second document vector of the study material with the largest number of common words that is cosine similar to the first document vector. Identify the second document vector with the highest degree. The extraction unit 153 extracts the corresponding study material from the study material table 142 by using the study material number for the specified second document vector.
 抽出部153は、選択仕様書と、抽出した検討資料とを対応付けた情報を、表示部130に表示させてもよい。 The extraction unit 153 may cause the display unit 130 to display information that associates the selected specifications with the extracted study material.
 次に、本実施例に係る情報処理装置の処理手順の一例について説明する。図5は、本実施例に係る情報処理装置の処理手順を示すフローチャートである。図5に示すように、情報処理装置100の生成部151は、仕様書の第1文書ベクトルを生成する(ステップS101)。生成部151は、検討資料の第2文書ベクトルを生成する(ステップS102)。 Next, an example of the processing procedure of the information processing apparatus according to this embodiment will be described. FIG. 5 is a flow chart showing the processing procedure of the information processing apparatus according to the embodiment. As shown in FIG. 5, the generation unit 151 of the information processing apparatus 100 generates a first document vector of the specification (step S101). The generation unit 151 generates a second document vector of study material (step S102).
 情報処理装置100の算出部152は、仕様書の選択を受け付ける(ステップS103)。算出部152は、選択仕様書の第1文書ベクトルを取得する(ステップS104)。算出部152は、選択仕様書の第1文書ベクトルと、各第2文書ベクトルとのコサイン類似度を算出する(ステップS105)。 The calculation unit 152 of the information processing device 100 accepts the specification selection (step S103). The calculator 152 acquires the first document vector of the selected specification (step S104). The calculation unit 152 calculates the cosine similarity between the first document vector of the selected specification and each second document vector (step S105).
 情報処理装置100の抽出部153は、第1文書ベクトルとのコサイン類似度が閾値以上となる第2文書ベクトルが存在するか否かを判定する(ステップS106)。抽出部153は、第1文書ベクトルとのコサイン類似度が閾値以上となる第2文書ベクトルが存在する場合には(ステップS106,Yes)、ステップS107に移行する。 The extraction unit 153 of the information processing apparatus 100 determines whether or not there is a second document vector whose cosine similarity with the first document vector is equal to or greater than a threshold (step S106). If there is a second document vector whose cosine similarity with the first document vector is equal to or greater than the threshold (step S106, Yes), the extraction unit 153 proceeds to step S107.
 抽出部153は、コサイン類似度が閾値以上となる第2文書ベクトルの検討資料を抽出する(ステップS107)。 The extraction unit 153 extracts study materials of the second document vectors whose cosine similarity is equal to or greater than the threshold (step S107).
 一方、抽出部153は、第1文書ベクトルとのコサイン類似度が閾値以上となる第2文書ベクトルが存在しない場合には(ステップS106,No)、抽出処理を実行する(ステップS108)。 On the other hand, if there is no second document vector whose cosine similarity with the first document vector is equal to or greater than the threshold value (step S106, No), the extraction unit 153 executes extraction processing (step S108).
 ここで、図5のステップS108に示した抽出処理の処理手順の一例について説明する。図6は、抽出処理の処理手順を示すフローチャートである。図6に示すように、情報処理装置100の抽出部153は、第1文書ベクトルと、第2文書ベクトルとの共通語の数を計数する(ステップS201)。 Here, an example of the processing procedure of the extraction processing shown in step S108 of FIG. 5 will be described. FIG. 6 is a flowchart showing the procedure of extraction processing. As shown in FIG. 6, the extraction unit 153 of the information processing apparatus 100 counts the number of common words between the first document vector and the second document vector (step S201).
 抽出部153は、共通語の数が最大となる第2文書ベクトルが複数存在するか否かを判定する(ステップS202)。抽出部153は、共通語の数が最大となる第2文書ベクトルが複数存在しない場合には(ステップS202,No)、共通語の数が最大となる第2文書ベクトルの検討資料を抽出する(ステップS203)。 The extraction unit 153 determines whether or not there are a plurality of second document vectors having the maximum number of common words (step S202). If there are not a plurality of second document vectors with the maximum number of common words (step S202, No), the extracting unit 153 extracts study material for the second document vector with the largest number of common words ( step S203).
 一方、抽出部153は、抽出部153は、共通語の数が最大となる第2文書ベクトルが複数存在する場合には(ステップS202,Yes)、ステップS204に移行する。抽出部153は、共通語の数が最大となる第2文書ベクトルと、第1文書ベクトルとのコサイン類似度のうち、最大のコサイン類似度に対応する第2文書ベクトルを特定する(ステップS204)。 On the other hand, if there are a plurality of second document vectors with the maximum number of common words (step S202, Yes), the extraction unit 153 proceeds to step S204. The extraction unit 153 identifies the second document vector corresponding to the maximum cosine similarity among the cosine similarities between the second document vector having the maximum number of common words and the first document vector (step S204). .
 抽出部153は、特定した第2文書ベクトルの検討結果を抽出する(ステップS205)。 The extraction unit 153 extracts the examination result of the identified second document vector (step S205).
 次に、本実施例に係る情報処理装置100の効果について説明する。情報処理装置100は、仕様書の第1文書ベクトルと、各検討資料の第2文書ベクトルとのコサイン類似度を算出し、第1文書ベクトルとのコサイン類似度が閾値以上となる第2文書ベクトルが存在するか否かを判定する。情報処理装置は、第1文書ベクトルとのコサイン類似度が閾値以上となる第2文書ベクトルが存在する場合には、かかる第2文書ベクトルの検討資料を抽出する。これによって、コサイン類似度が閾値以上となる組が存在する場合には、コサイン類似度に基づき、検討資料を抽出することができる。 Next, the effects of the information processing apparatus 100 according to this embodiment will be described. The information processing apparatus 100 calculates the cosine similarity between the first document vector of the specification and the second document vector of each study material, and calculates the cosine similarity between the first document vector and the second document vector whose cosine similarity is equal to or greater than the threshold. exists. If there is a second document vector whose cosine similarity with the first document vector is equal to or greater than a threshold value, the information processing apparatus extracts study materials for the second document vector. As a result, if there is a pair whose cosine similarity is equal to or greater than the threshold, it is possible to extract study material based on the cosine similarity.
 情報処理装置100は、情報処理装置は、第1文書ベクトルとのコサイン類似度が閾値以上となる第2文書ベクトルが存在しない場合には、仕様書の第1文書ベクトルと、各検討資料の第2文書ベクトルとを比較し、共通語の数を計数する。情報処理装置は、仕様書との共通語の数が最大となる検討資料が一つである場合には、かかる検討資料を抽出する。これによって、コサイン類似度が閾値以上となる組が存在しなくても、共通語の数に基づき、検討資料を抽出することができる。 If there is no second document vector whose cosine similarity with the first document vector is equal to or greater than a threshold value, the information processing apparatus 100 detects the first document vector of the specification and the first document vector of each study material. Compare the two document vectors and count the number of common words. If there is one study material that has the maximum number of common words with the specification, the information processing apparatus extracts such study material. As a result, even if there is no pair whose cosine similarity is equal to or greater than the threshold value, it is possible to extract study materials based on the number of common words.
 情報処理装置100は、仕様書の第1文書ベクトルとの共通語の数が最大となる検討資料の第2文書ベクトルが複数存在する場合には、共通語が最大となる検討資料の第2文書ベクトルであって、第1文書ベクトルとのコサイン類似度が最大となる第2文書ベクトルを特定し、特定した第2文書ベクトルの検討資料を抽出する。これによって、共通語の数が最大となる第2文書ベクトルが複数存在する場合でも、コサイン類似度を更に用いて、検討資料を抽出することができる。 When there are a plurality of second document vectors of the study material having the maximum number of common words with the first document vector of the specification, the information processing apparatus 100 selects the second document vector of the study material having the largest number of common words. A second document vector that is a vector and has the maximum cosine similarity with the first document vector is specified, and study materials for the specified second document vector are extracted. As a result, even if there are a plurality of second document vectors having the maximum number of common words, the cosine similarity can be further used to extract study materials.
 すなわち、本実施例1に係る情報処理装置100によれば、仕様書の作成元となる検討資料の抽出精度を向上させることができる。 That is, according to the information processing apparatus 100 according to the first embodiment, it is possible to improve the accuracy of extracting study materials that are used as a basis for creating specifications.
 図7は、抽出プログラムを実行するコンピュータの一例を示す図である。コンピュータ1000は、たとえば、メモリ1010と、CPU1020と、ハードディスクドライブインタフェース1030と、ディスクドライブインタフェース1040と、シリアルポートインタフェース1050と、ビデオアダプタ1060と、ネットワークインタフェース1070とを有する。これらの各部は、バス1080によって接続される。 FIG. 7 is a diagram showing an example of a computer that executes an extraction program. Computer 1000 has, for example, memory 1010 , CPU 1020 , hard disk drive interface 1030 , disk drive interface 1040 , serial port interface 1050 , video adapter 1060 and network interface 1070 . These units are connected by a bus 1080 .
 メモリ1010は、ROM(Read Only Memory)1011およびRAM1012を含む。ROM1011は、たとえば、BIOS(Basic Input Output System)等のブートプログラムを記憶する。ハードディスクドライブインタフェース1030は、ハードディスクドライブ1031に接続される。ディスクドライブインタフェース1040は、ディスクドライブ1041に接続される。ディスクドライブ1041には、たとえば、磁気ディスクや光ディスク等の着脱可能な記憶媒体が挿入される。シリアルポートインタフェース1050には、たとえば、マウス1051およびキーボード1052が接続される。ビデオアダプタ1060には、たとえば、ディスプレイ1061が接続される。 The memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM 1012 . The ROM 1011 stores a boot program such as BIOS (Basic Input Output System). Hard disk drive interface 1030 is connected to hard disk drive 1031 . Disk drive interface 1040 is connected to disk drive 1041 . A removable storage medium such as a magnetic disk or an optical disk is inserted into the disk drive 1041, for example. A mouse 1051 and a keyboard 1052 are connected to the serial port interface 1050, for example. For example, a display 1061 is connected to the video adapter 1060 .
 ここで、ハードディスクドライブ1031は、たとえば、OS1091、アプリケーションプログラム1092、プログラムモジュール1093およびプログラムデータ1094を記憶する。上記実施形態で説明した各情報は、たとえばハードディスクドライブ1031やメモリ1010に記憶される。 Here, the hard disk drive 1031 stores an OS 1091, application programs 1092, program modules 1093 and program data 1094, for example. Each piece of information described in the above embodiment is stored in the hard disk drive 1031 or memory 1010, for example.
 また、抽出プログラムは、たとえば、コンピュータ1000によって実行される指令が記述されたプログラムモジュール1093として、ハードディスクドライブ1031に記憶される。具体的には、上記実施形態で説明した情報処理装置100が実行する各処理が記述されたプログラムモジュール1093が、ハードディスクドライブ1031に記憶される。 Also, the extraction program is stored in the hard disk drive 1031, for example, as a program module 1093 in which commands to be executed by the computer 1000 are written. Specifically, the hard disk drive 1031 stores a program module 1093 in which each process executed by the information processing apparatus 100 described in the above embodiment is described.
 また、抽出プログラムによる情報処理に用いられるデータは、プログラムデータ1094として、たとえば、ハードディスクドライブ1031に記憶される。そして、CPU1020が、ハードディスクドライブ1031に記憶されたプログラムモジュール1093やプログラムデータ1094を必要に応じてRAM1012に読み出して、上述した各手順を実行する。 Data used for information processing by the extraction program is stored as program data 1094 in the hard disk drive 1031, for example. Then, the CPU 1020 reads out the program module 1093 and the program data 1094 stored in the hard disk drive 1031 to the RAM 1012 as necessary, and executes each procedure described above.
 なお、抽出プログラムに係るプログラムモジュール1093やプログラムデータ1094は、ハードディスクドライブ1031に記憶される場合に限られず、たとえば、着脱可能な記憶媒体に記憶されて、ディスクドライブ1041等を介してCPU1020によって読み出されてもよい。あるいは、抽出プログラムに係るプログラムモジュール1093やプログラムデータ1094は、LANやWAN(Wide Area Network)等のネットワークを介して接続された他のコンピュータに記憶され、ネットワークインタフェース1070を介してCPU1020によって読み出されてもよい。 Note that the program module 1093 and program data 1094 related to the extraction program are not limited to being stored in the hard disk drive 1031. For example, they may be stored in a removable storage medium and read by the CPU 1020 via the disk drive 1041 or the like. may be Alternatively, the program module 1093 and program data 1094 related to the extraction program are stored in another computer connected via a network such as LAN or WAN (Wide Area Network), and are read out by the CPU 1020 via the network interface 1070. may
 以上、本発明者によってなされた発明を適用した実施形態について説明したが、本実施形態による本発明の開示の一部をなす記述および図面により本発明は限定されることはない。すなわち、本実施形態に基づいて当業者等によりなされる他の実施形態、実施例および運用技術等は全て本発明の範疇に含まれる。 Although the embodiment to which the invention made by the present inventor is applied has been described above, the present invention is not limited by the descriptions and drawings forming part of the disclosure of the present invention according to the present embodiment. That is, other embodiments, examples, operation techniques, etc. made by those skilled in the art based on this embodiment are all included in the scope of the present invention.
 100  情報処理装置
 110  通信部
 120  入力部
 130  表示部
 140  記憶部
 141  仕様書テーブル
 142  検討資料テーブル
 150  制御部
100 information processing device 110 communication unit 120 input unit 130 display unit 140 storage unit 141 specification table 142 study material table 150 control unit

Claims (6)

  1.  第1文書に含まれる単語と該単語の出現頻度を基にして生成された第1文書ベクトルと、複数の第2文書について、前記第2文書に含まれる単語と該単語の出現頻度を基にして生成された複数の第2文書ベクトルとの類似度をそれぞれ算出する算出部と、
     前記第1文書ベクトルとの類似度が閾値以上となる第2文書ベクトルが存在しない場合には、前記第1文書ベクトルに設定された単語と、前記第2文書ベクトルに設定された単語とで共通する共通語の数をそれぞれ計数し、計数結果を基にして、第2文書を抽出する抽出部と
     を備えることを特徴とする情報処理装置。
    A first document vector generated based on the words included in the first document and the frequency of appearance of the words; a calculation unit that calculates a similarity with each of the plurality of second document vectors generated by
    When there is no second document vector whose degree of similarity with the first document vector is equal to or greater than the threshold, the word set as the first document vector and the word set as the second document vector are common An information processing apparatus comprising: an extraction unit that counts the number of common words that correspond to each other, and extracts the second document based on the counting result.
  2.  前記抽出部は、前記第1文書ベクトルとの類似度が閾値以上となる第2文書ベクトルが存在する場合には、前記第1文書ベクトルとの類似度が閾値以上となる第2文書ベクトルに対応する第2文書を抽出する処理を更に実行することを特徴とする請求項1に記載の情報処理装置。 The extraction unit, when there is a second document vector whose similarity to the first document vector is equal to or greater than a threshold, corresponds to the second document vector whose similarity to the first document vector is equal to or greater than a threshold. 2. The information processing apparatus according to claim 1, further executing a process of extracting the second document.
  3.  前記抽出部は、共通語が最大となる第2文書ベクトルに対応する第2文書を抽出することを特徴とする請求項1に記載の情報処理装置。 The information processing apparatus according to claim 1, wherein the extraction unit extracts the second document corresponding to the second document vector having the maximum common term.
  4.  前記抽出部は、共通語が最大となる第2文書ベクトルが複数存在する場合には、共通語が最大となる第2文書ベクトルのうち、前記類似度が最大となる第2文書ベクトルの第2文章を抽出することを特徴とする請求項3に記載の情報処理装置。 When there are a plurality of second document vectors with the largest common term, the extraction unit extracts the second document vector of the second document vector with the largest similarity among the second document vectors with the largest common term. 4. The information processing apparatus according to claim 3, wherein sentences are extracted.
  5.  第1文書に含まれる単語と該単語の出現頻度を基にして生成された第1文書ベクトルと、複数の第2文書について、前記第2文書に含まれる単語と該単語の出現頻度を基にして生成された複数の第2文書ベクトルとの類似度をそれぞれ算出する算出工程と、
     前記第1文書ベクトルとの類似度が閾値以上となる第2文書ベクトルが存在しない場合には、前記第1文書ベクトルに設定された単語と、前記第2文書ベクトルに設定された単語とで共通する共通語の数をそれぞれ計数し、計数結果を基にして、第2文書を抽出する抽出工程と
     を含んだことを特徴とする抽出方法。
    A first document vector generated based on words included in a first document and the frequency of appearance of the words, and a plurality of second documents based on the words included in the second documents and the frequency of appearance of the words a calculation step of calculating the similarity with each of the plurality of second document vectors generated by
    When there is no second document vector whose degree of similarity with the first document vector is equal to or greater than the threshold, the word set as the first document vector and the word set as the second document vector are common an extracting step of counting the number of common words in each document, and extracting the second document based on the counting result.
  6.  第1文書に含まれる単語と該単語の出現頻度を基にして生成された第1文書ベクトルと、複数の第2文書について、前記第2文書に含まれる単語と該単語の出現頻度を基にして生成された複数の第2文書ベクトルとの類似度をそれぞれ算出する算出ステップと、
     前記第1文書ベクトルとの類似度が閾値以上となる第2文書ベクトルが存在しない場合には、前記第1文書ベクトルに設定された単語と、前記第2文書ベクトルに設定された単語とで共通する共通語の数をそれぞれ計数し、計数結果を基にして、第2文書を抽出する抽出ステップと
     を含んだことを特徴とする抽出プログラム。
    A first document vector generated based on the words included in the first document and the frequency of appearance of the words; a calculation step of calculating the similarity with each of the plurality of second document vectors generated by
    When there is no second document vector whose degree of similarity with the first document vector is equal to or greater than the threshold, the word set as the first document vector and the word set as the second document vector are common an extracting step of counting the number of common words that correspond to each other, and extracting the second document based on the counting result.
PCT/JP2021/004792 2021-02-09 2021-02-09 Information processing device, extraction method, and extraction program WO2022172334A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/JP2021/004792 WO2022172334A1 (en) 2021-02-09 2021-02-09 Information processing device, extraction method, and extraction program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2021/004792 WO2022172334A1 (en) 2021-02-09 2021-02-09 Information processing device, extraction method, and extraction program

Publications (1)

Publication Number Publication Date
WO2022172334A1 true WO2022172334A1 (en) 2022-08-18

Family

ID=82838469

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2021/004792 WO2022172334A1 (en) 2021-02-09 2021-02-09 Information processing device, extraction method, and extraction program

Country Status (1)

Country Link
WO (1) WO2022172334A1 (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111061842A (en) * 2019-12-26 2020-04-24 上海众源网络有限公司 Similar text determination method and device

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111061842A (en) * 2019-12-26 2020-04-24 上海众源网络有限公司 Similar text determination method and device

Similar Documents

Publication Publication Date Title
US11244203B2 (en) Automated generation of structured training data from unstructured documents
CN113807098B (en) Model training method and device, electronic equipment and storage medium
JP7289047B2 (en) Method, computer program and system for block-based document metadata extraction
EP3117369B1 (en) Detecting and extracting image document components to create flow document
WO2018205389A1 (en) Voice recognition method and system, electronic apparatus and medium
US10025980B2 (en) Assisting people with understanding charts
JP7358698B2 (en) Training method, apparatus, device and storage medium for word meaning representation model
US10592738B2 (en) Cognitive document image digitalization
CN109522552B (en) Normalization method and device of medical information, medium and electronic equipment
US10417285B2 (en) Corpus generation based upon document attributes
US11734939B2 (en) Vision-based cell structure recognition using hierarchical neural networks and cell boundaries to structure clustering
CN112784589B (en) Training sample generation method and device and electronic equipment
US20230005283A1 (en) Information extraction method and apparatus, electronic device and readable storage medium
US11176311B1 (en) Enhanced section detection using a combination of object detection with heuristics
WO2022172334A1 (en) Information processing device, extraction method, and extraction program
CN114461665B (en) Method, apparatus and computer program product for generating a statement transformation model
US11966699B2 (en) Intent classification using non-correlated features
US20210342379A1 (en) Method and device for processing sentence, and storage medium
US11347928B2 (en) Detecting and processing sections spanning processed document partitions
CN110728131A (en) Method and device for analyzing text attribute
CN114676699A (en) Entity emotion analysis method and device, computer equipment and storage medium
US20230066454A1 (en) Information analyzing apparatus, information analyzing method, and computer-readable recording medium
TWI644223B (en) Translation memory enhancement system
US20230222827A1 (en) Method and apparatus for processing document image, and electronic device
US20220391602A1 (en) Method of federated learning, electronic device, and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21925588

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21925588

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: JP