WO2022172334A1

WO2022172334A1 - Information processing device, extraction method, and extraction program

Info

Publication number: WO2022172334A1
Application number: PCT/JP2021/004792
Authority: WO
Inventors: 浩宮尾
Original assignee: 日本電信電話株式会社
Priority date: 2021-02-09
Filing date: 2021-02-09
Publication date: 2022-08-18

Abstract

This information processing device calculates the similarity between words included in a first document and a first document vector generated based on the occurrence frequency of the words, and for a plurality of second documents, the similarity between words included in the second documents and a plurality of second document vectors generated based on the occurrence frequency of the words. When there are no second document vectors having a similarity with the first document vector that is greater than or equal to a threshold value, the information processing device counts the number of common words that are common among the words set in the first document vector and the words set in the second document vectors, and extracts second documents based on the count result.

Description

Information processing device, extraction method and extraction program

The present invention relates to an information processing device, an extraction method, and an extraction program.

When creating specifications (development documents), in order to ensure the quality of the specifications, it is necessary to eliminate omissions and ambiguity through reviews. In order to eliminate such omissions and ambiguities, it is necessary to accurately refer to the materials that have been considered in advance (hereinafter referred to as review materials) and check the specifications when creating the specifications. Since each section and chapter included in the book was created at different times, it is not possible to accurately refer to the materials for consideration, and there are cases where an oversight occurs.

Here, there are conventional technologies 1 and 2 that automatically extract examination materials corresponding to documents by utilizing the fact that the contents of the specification and the contents of examination materials are very similar.

FIG. 8 is a diagram for explaining prior art 1. FIG. In prior art 1, document vectors are generated based on the words appearing in the document and the appearance frequency of the words, and study materials close to the contents of the specification are extracted based on the cosine similarity of each document vector. For example, let the document vectors of specifications A, B, and C be V _dA , V _dB , and V _dC , respectively. Let V _SD , V _SE , and V _SF be the document vectors of study materials D, E, and F, respectively.

In prior art 1, the cosine similarity between the document vector V _dA of the specification A and the document vectors V _SD to V _SF of the study materials D to E is calculated, and based on the pairs whose cosine similarity is equal to or higher than the threshold Then, the study material corresponding to the specification A is extracted. For example, when the cosine similarity between the document vector _VdA of the specification A and the document vector _VSD of the study material D is equal to or greater than the threshold, the study material D is extracted as the study material corresponding to the specification A. . For the other specifications B and C, study materials are extracted in the same manner.

FIG. 9 is a diagram for explaining conventional technology 2. FIG. In prior art 2, the topic of the document is analyzed, and study materials close to the content of the specification are extracted based on the distance between the topics of the document. For example, in prior art 2, the topics of specifications A, B, and C and the topics of study materials D, E, and F are analyzed, and the breakdown of topics is calculated and vectorized. Moreover, based on the vectorized values, mapping is performed on the graph G1. The horizontal axis of the graph G1 is the axis corresponding to the value of the first topic, and the vertical axis is the axis corresponding to the value of the second topic.

For example, specifications A, B, and C are mapped to pA, pB, and pC of graph G1, respectively, and study materials D, E, and F are mapped to pE, pF, and pG of graph G1, respectively. Here, since the distance between pA and pD is close, study material D is extracted as study material corresponding to specification A.

For example, the descriptions of the functions and requirements described in each section of the specification are about several hundred characters, which is a relatively small document. When topic analysis is performed on such documents, the results are similar to each other, and there is no noticeable difference in the topic of each document, making it impossible to appropriately extract examination materials corresponding to specifications.

On the other hand, even when cosine similarity is used, the number of key words in a document (hereinafter referred to as "key words") may be small, and the appearance frequency of such key words may be low. Here, even if figures, tables, etc. are included in order to increase the volume of the document, the number of words other than key words (common words that are common to other documents) increases. In spite of being examination materials, the cosine similarity does not increase as a result, and examination materials cannot be appropriately extracted.

SUMMARY OF THE INVENTION It is an object of the present invention to provide an information processing apparatus, an extraction method, and an extraction program capable of improving the accuracy of extracting study materials from which specifications are created. .

In order to solve the above-described problems and achieve the object, an information processing apparatus according to the present invention provides a first document vector generated based on a word included in a first document and the appearance frequency of the word, and a plurality of a calculation unit for calculating similarities between a word contained in the second document and a plurality of second document vectors generated based on the frequency of occurrence of the word, with respect to the second document; and the first document vector If there is no second document vector whose similarity with and an extraction unit for extracting the second document based on the counting result.

It is possible to improve the accuracy of extracting study materials that are the basis for creating specifications.

FIG. 1 is a diagram for explaining the processing of the information processing apparatus according to the embodiment. FIG. 2 is a functional block diagram showing the configuration of the information processing apparatus according to this embodiment. FIG. 3 is a diagram showing an example of the data structure of a specification table. FIG. 4 is a diagram illustrating an example of the data structure of a study material table. FIG. 5 is a flow chart showing the processing procedure of the information processing apparatus according to the embodiment. FIG. 6 is a flowchart showing the procedure of extraction processing. FIG. 7 is a diagram showing an example of a computer that executes an extraction program. FIG. 8 is a diagram for explaining prior art 1. FIG. FIG. 9 is a diagram for explaining the prior art 2. FIG.

Below, embodiments of the information processing device, the extraction method, and the extraction program disclosed in the present application will be described in detail based on the drawings. In addition, this invention is not limited by this Example.

FIG. 1 is a diagram for explaining the processing of the information processing apparatus according to this embodiment. In FIG. 1, specifications A, B, and C and study materials D, E, and F will be used as an example for explanation. The information processing device generates a document vector based on the words appearing in the document and the appearance frequency of the words. In this embodiment, a word is set for each element (dimension) of the document vector.

For convenience of explanation, the document vector of the specification is referred to as "first document vector", and the document vector of study material is referred to as "second document vector". When the first document vectors of the specifications are shown separately, the document vectors of the specifications A, B, and C are defined as document vectors V _dA , V _dB , and V _dC , respectively. When the second document vectors of the study materials are indicated individually, the document vectors of the study materials D, E, and F are defined as document vectors V sD , V _sE , and V _sF , _respectively .

When extracting study material corresponding to a specification, the information processing device calculates cosine similarity between the first document vector of the specification and the second document vector of each study material. The information processing device calculates the cosine similarity using Equation (1). In equation (1), "Vdx" indicates the document vector of the specification. "Vsy" corresponds to the document vector of the study material.

　Vdx/Vsy/|Vdx||Vsy| (1)

The information processing device calculates the cosine similarity between the first document vector and the second document vector, and determines whether or not there is a second document vector whose cosine similarity with the first document vector is equal to or greater than a threshold. do. If there is a second document vector whose cosine similarity with the first document vector is equal to or greater than a threshold value, the information processing apparatus extracts study materials for the second document vector.

On the other hand, if there is no second document vector whose cosine similarity with the first document vector is greater than or equal to the threshold value, the information processing device executes the following process. The information processing device counts the number of common words between the words of each element of the first document vector of the specification and the words of each element of the second document vector of each study material. In the following description, common words are referred to as "common words".

If there is one study material with the largest number of common words, the information processing device extracts such study material.

When there are a plurality of study materials with the maximum number of common words, the information processing apparatus finds the second document vector of the study material with the largest number of common words, which has a cosine similarity with the first document vector. The maximum second document vector is specified, and study materials for the specified second document vector are extracted.

An example of a process of extracting study material corresponding to the specification A by the information processing device will be described. The information processing device calculates the cosine similarity between the document vector _VdA of the specification A and the document vectors _VsD , _VsE , and _VsF of the study materials D, E, and F, respectively.

For example, when the cosine similarity between the document vector _{VdA of the specification A and the document vector VsD} _of the study material D is equal to or greater than a threshold, the information processing apparatus As a result, the examination material D is extracted.

An example of a process of extracting study materials corresponding to the specification B by the information processing device will be described. The information processing device calculates the cosine similarity _between the document vector V _dB of the specification B and the document vectors V sD , V _sE and V _sF of the study materials D, E and F, respectively.

If there is no second document vector whose cosine similarity with the document vector V _dB is equal to or greater than the threshold, the information processing apparatus determines whether the document vector V _dB of the specification B and the document vector V _sD of the study material D Count the number of common words. The information processing device counts the number of common words between the specification B document vector V _dB and the study material E document vector V _sE . The information processing device counts the number of common words between the specification B document vector V _dB and the study material F document vector V _sF .

When the number of common words between the document vector V _dB of the specification B and the document vector V _sE of the study material E is the maximum number of common words and is greater than the number of other common words, the information processing device , the study material E is extracted as the study material from which the specification B was created.

An example of a process of extracting study materials corresponding to the specification C by the information processing device will be described. The information processing apparatus calculates the cosine similarity _between the document vector V _dC of the specification C and the document vectors V sD , V _sE and V _sF of the study materials D, E and F, respectively.

If there is no second document vector whose cosine similarity with the document vector _VdC is equal to or greater than the threshold, the information processing apparatus determines whether the document vector _{VdC of the specification C and the document vector VsD} _of the study material D Count the number of common words. The information processing device counts the number of common words between the specification C document vector _VdC and the study material E document vector _VsE . The information processing device counts the number of common words between the specification C document vector _VdC and the study material F document vector _VsF .

The information processing device determines the number of common words between the document vector _{VdC of the specification C and the document vector VsD of the study material D, and the number of common words between the document vector VdC of the specification C and the document vector VsF} _of _the _study material F. When the number of words is the maximum (when there are multiple pairs with the maximum number of common words), the following processing is performed.

The information processing device _compares the cosine similarity of the document vector _VdC and the document vector _VsD with the cosine similarity of the document vector _VdC and the document vector _VsF to determine the cosine similarity of the document vector _VdC and the document vector VsF. If the degree is larger, the study material F is extracted as the study material from which the specification C was created.

As described above, the information processing apparatus according to the present embodiment calculates the cosine similarity between the first document vector of the specification and the second document vector of each study material, and calculates the cosine similarity between the first document vector and is equal to or greater than the threshold value. If there is a second document vector whose cosine similarity with the first document vector is equal to or greater than a threshold value, the information processing apparatus extracts study materials for the second document vector.

On the other hand, if there is no second document vector whose cosine similarity with the first document vector is equal to or greater than the threshold, the information processing apparatus combines the first document vector of the specification with the second document vector of each study material. and count the number of common words. If there is one study material that has the maximum number of common words with the specification, the information processing apparatus extracts such study material.

When there are a plurality of second document vectors of the study material having the maximum number of common words with the first document vector of the specification, the information processing apparatus selects the second document vector of the study material having the largest number of common words. , the second document vector having the maximum cosine similarity with the first document vector is specified, and study material for the specified second document vector is extracted.

In this way, the information processing apparatus can improve the extraction accuracy of study materials by extracting study materials corresponding to specifications from the viewpoint of cosine similarity and the number of common words.

Next, an example of the configuration of the information processing apparatus according to this embodiment will be described. FIG. 2 is a functional block diagram showing the configuration of the information processing apparatus according to this embodiment. As shown in FIG. 2 , this information processing apparatus 100 has a communication section 110 , an input section 120 , a display section 130 , a storage section 140 and a control section 150 .

The communication unit 110 is a communication interface that transmits and receives various types of information to and from an external device connected via a network or the like. The communication unit 110 is realized by a NIC (Network Interface Card) or the like, and performs communication between an external device and the control unit 150 via an electric communication line such as a LAN (Local Area Network) or the Internet.

The input unit 120 is an input interface that receives various operations from the operator of the information processing device 100 . For example, it is composed of input devices such as a keyboard and a mouse.

The display unit 130 is an output device that outputs information acquired from the control unit 150, and is realized by a display device such as a liquid crystal display, a printing device such as a printer, and the like.

The storage unit 140 has a specification table 141 and a study material table 142 . The storage unit 140 is implemented by a semiconductor memory device such as RAM (Random Access Memory) or flash memory, or a storage device such as a hard disk or optical disk.

The specification table 141 is a table that holds information on specifications. FIG. 3 is a diagram showing an example of the data structure of a specification table. As shown in FIG. 3, this specification table 141 has a specification number, a specification, and a first document vector.

The specification number is information that identifies the specification. For example, specifications corresponding to specification numbers M10A to M10C are assumed to be specifications A to C, respectively. The specification is document information (text data) of the specification. The first document vector is a vector generated based on the words included in the specification and the frequencies of these words. The first document vector is generated by the generation unit 151, which will be described later.

The study material table 142 is a table that holds information about study materials. FIG. 4 is a diagram illustrating an example of the data structure of a study material table. As shown in FIG. 4, the study material table 142 has study material numbers, study materials, and second document vectors.

The study material number is information that identifies the study material. For example, study materials corresponding to study material numbers M10D to M10F are assumed to be study materials D to F, respectively. The review material is the document information (text data) of use referred to when the user creates the specification. The second document vector is a vector generated based on the words contained in the study material and the frequencies of these words. The second document vector is generated by the generation unit 151, which will be described later.

Return to the description of Figure 2. The control unit 150 is implemented using a CPU (Central Processing Unit) or the like. The control unit 150 has a generation unit 151 , a calculation unit 152 and an extraction unit 153 .

The generation unit 151 generates document vectors from document information such as specifications and study materials. The generation unit 151 extracts words by morphologically analyzing the document information of the specifications stored in the specifications table 141, and generates a first document vector based on the extracted words and the frequency of the words. do. The generation unit 151 registers the generated first document vector in the specification table 141 . The generation unit 151 repeatedly executes the above process for each specification stored in the specification table 141 .

The generation unit 151 extracts words by morphologically analyzing the document information of the study materials stored in the study material table 142, and generates a second document vector based on the extracted words and the frequency of the words. do. The generation unit 151 registers the generated second document vector in the study material table 142 . The generation unit 151 repeatedly executes the above process for each study material stored in the study material table 142 .

The generation unit 151 may generate a document vector by any method. For example, the generation unit 151 generates document vectors based on the technique described in Non-Patent Document 1.

The calculation unit 152 calculates the cosine similarity between the first document vector of the specification and the second document vector of each study material. The calculation unit 152 calculates the cosine similarity using Equation (1) described above. The calculation unit 152 outputs the calculation result of the cosine similarity to the extraction unit 153 .

For example, the calculation result of the cosine similarity is associated with the specification number of the selected specification, the examination material number of each examination material, and the calculation result of the cosine similarity. The user may operate the input unit 120 to select the specifications, or the calculation unit 152 may select the specifications in a predetermined order. In the following description, the selected specifications are referred to as "selected specifications".

The extraction unit 153 extracts study materials corresponding to the selected specifications based on the cosine similarity calculation results. The extraction unit 153 determines whether or not there is a set in which the cosine similarity between the first document vector of the selected specification and the second document vector of each study material is equal to or greater than a threshold.

If there is a set in which the cosine similarity between the first document vector of the selection specification and the second document vector of each examination material is equal to or greater than a threshold, the extraction unit 153 extracts the examination material number of the second document vector. is used to extract the relevant study material from the study material table 142 .

If there is no set in which the cosine similarity between the first document vector of the selection specification and the second document vector of each examination material is equal to or greater than the threshold, the extraction unit 153 extracts the first document vector of the selection specification and , with the second document vector of each study material, and count the number of common words.

When there are not a plurality of second document vectors with the largest number of common words (when there is one second document vector with the largest number of common words), the extraction unit 153 extracts the second document vector with the largest number of common words. corresponding study material number is used to extract the relevant study material from the study material table 142 .

When there are a plurality of second document vectors with the largest number of common words, the extraction unit 153 extracts the second document vector of the study material with the largest number of common words that is cosine similar to the first document vector. Identify the second document vector with the highest degree. The extraction unit 153 extracts the corresponding study material from the study material table 142 by using the study material number for the specified second document vector.

The extraction unit 153 may cause the display unit 130 to display information that associates the selected specifications with the extracted study material.

Next, an example of the processing procedure of the information processing apparatus according to this embodiment will be described. FIG. 5 is a flow chart showing the processing procedure of the information processing apparatus according to the embodiment. As shown in FIG. 5, the generation unit 151 of the information processing apparatus 100 generates a first document vector of the specification (step S101). The generation unit 151 generates a second document vector of study material (step S102).

The calculation unit 152 of the information processing device 100 accepts the specification selection (step S103). The calculator 152 acquires the first document vector of the selected specification (step S104). The calculation unit 152 calculates the cosine similarity between the first document vector of the selected specification and each second document vector (step S105).

The extraction unit 153 of the information processing apparatus 100 determines whether or not there is a second document vector whose cosine similarity with the first document vector is equal to or greater than a threshold (step S106). If there is a second document vector whose cosine similarity with the first document vector is equal to or greater than the threshold (step S106, Yes), the extraction unit 153 proceeds to step S107.

The extraction unit 153 extracts study materials of the second document vectors whose cosine similarity is equal to or greater than the threshold (step S107).

On the other hand, if there is no second document vector whose cosine similarity with the first document vector is equal to or greater than the threshold value (step S106, No), the extraction unit 153 executes extraction processing (step S108).

Here, an example of the processing procedure of the extraction processing shown in step S108 of FIG. 5 will be described. FIG. 6 is a flowchart showing the procedure of extraction processing. As shown in FIG. 6, the extraction unit 153 of the information processing apparatus 100 counts the number of common words between the first document vector and the second document vector (step S201).

The extraction unit 153 determines whether or not there are a plurality of second document vectors having the maximum number of common words (step S202). If there are not a plurality of second document vectors with the maximum number of common words (step S202, No), the extracting unit 153 extracts study material for the second document vector with the largest number of common words ( step S203).

On the other hand, if there are a plurality of second document vectors with the maximum number of common words (step S202, Yes), the extraction unit 153 proceeds to step S204. The extraction unit 153 identifies the second document vector corresponding to the maximum cosine similarity among the cosine similarities between the second document vector having the maximum number of common words and the first document vector (step S204). .

The extraction unit 153 extracts the examination result of the identified second document vector (step S205).

Next, the effects of the information processing apparatus 100 according to this embodiment will be described. The information processing apparatus 100 calculates the cosine similarity between the first document vector of the specification and the second document vector of each study material, and calculates the cosine similarity between the first document vector and the second document vector whose cosine similarity is equal to or greater than the threshold. exists. If there is a second document vector whose cosine similarity with the first document vector is equal to or greater than a threshold value, the information processing apparatus extracts study materials for the second document vector. As a result, if there is a pair whose cosine similarity is equal to or greater than the threshold, it is possible to extract study material based on the cosine similarity.

If there is no second document vector whose cosine similarity with the first document vector is equal to or greater than a threshold value, the information processing apparatus 100 detects the first document vector of the specification and the first document vector of each study material. Compare the two document vectors and count the number of common words. If there is one study material that has the maximum number of common words with the specification, the information processing apparatus extracts such study material. As a result, even if there is no pair whose cosine similarity is equal to or greater than the threshold value, it is possible to extract study materials based on the number of common words.

When there are a plurality of second document vectors of the study material having the maximum number of common words with the first document vector of the specification, the information processing apparatus 100 selects the second document vector of the study material having the largest number of common words. A second document vector that is a vector and has the maximum cosine similarity with the first document vector is specified, and study materials for the specified second document vector are extracted. As a result, even if there are a plurality of second document vectors having the maximum number of common words, the cosine similarity can be further used to extract study materials.

That is, according to the information processing apparatus 100 according to the first embodiment, it is possible to improve the accuracy of extracting study materials that are used as a basis for creating specifications.

FIG. 7 is a diagram showing an example of a computer that executes an extraction program. Computer 1000 has, for example, memory 1010 , CPU 1020 , hard disk drive interface 1030 , disk drive interface 1040 , serial port interface 1050 , video adapter 1060 and network interface 1070 . These units are connected by a bus 1080 .

The memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM 1012 . The ROM 1011 stores a boot program such as BIOS (Basic Input Output System). Hard disk drive interface 1030 is connected to hard disk drive 1031 . Disk drive interface 1040 is connected to disk drive 1041 . A removable storage medium such as a magnetic disk or an optical disk is inserted into the disk drive 1041, for example. A mouse 1051 and a keyboard 1052 are connected to the serial port interface 1050, for example. For example, a display 1061 is connected to the video adapter 1060 .

Here, the hard disk drive 1031 stores an OS 1091, application programs 1092, program modules 1093 and program data 1094, for example. Each piece of information described in the above embodiment is stored in the hard disk drive 1031 or memory 1010, for example.

Also, the extraction program is stored in the hard disk drive 1031, for example, as a program module 1093 in which commands to be executed by the computer 1000 are written. Specifically, the hard disk drive 1031 stores a program module 1093 in which each process executed by the information processing apparatus 100 described in the above embodiment is described.

Data used for information processing by the extraction program is stored as program data 1094 in the hard disk drive 1031, for example. Then, the CPU 1020 reads out the program module 1093 and the program data 1094 stored in the hard disk drive 1031 to the RAM 1012 as necessary, and executes each procedure described above.

Note that the program module 1093 and program data 1094 related to the extraction program are not limited to being stored in the hard disk drive 1031. For example, they may be stored in a removable storage medium and read by the CPU 1020 via the disk drive 1041 or the like. may be Alternatively, the program module 1093 and program data 1094 related to the extraction program are stored in another computer connected via a network such as LAN or WAN (Wide Area Network), and are read out by the CPU 1020 via the network interface 1070. may

Although the embodiment to which the invention made by the present inventor is applied has been described above, the present invention is not limited by the descriptions and drawings forming part of the disclosure of the present invention according to the present embodiment. That is, other embodiments, examples, operation techniques, etc. made by those skilled in the art based on this embodiment are all included in the scope of the present invention.

100 information processing device 110 communication unit 120 input unit 130 display unit 140 storage unit 141 specification table 142 study material table 150 control unit

Claims

A first document vector generated based on the words included in the first document and the frequency of appearance of the words; a calculation unit that calculates a similarity with each of the plurality of second document vectors generated by
When there is no second document vector whose degree of similarity with the first document vector is equal to or greater than the threshold, the word set as the first document vector and the word set as the second document vector are common An information processing apparatus comprising: an extraction unit that counts the number of common words that correspond to each other, and extracts the second document based on the counting result.
The extraction unit, when there is a second document vector whose similarity to the first document vector is equal to or greater than a threshold, corresponds to the second document vector whose similarity to the first document vector is equal to or greater than a threshold. 2. The information processing apparatus according to claim 1, further executing a process of extracting the second document.
The information processing apparatus according to claim 1, wherein the extraction unit extracts the second document corresponding to the second document vector having the maximum common term.
When there are a plurality of second document vectors with the largest common term, the extraction unit extracts the second document vector of the second document vector with the largest similarity among the second document vectors with the largest common term. 4. The information processing apparatus according to claim 3, wherein sentences are extracted.
A first document vector generated based on words included in a first document and the frequency of appearance of the words, and a plurality of second documents based on the words included in the second documents and the frequency of appearance of the words a calculation step of calculating the similarity with each of the plurality of second document vectors generated by
When there is no second document vector whose degree of similarity with the first document vector is equal to or greater than the threshold, the word set as the first document vector and the word set as the second document vector are common an extracting step of counting the number of common words in each document, and extracting the second document based on the counting result.
A first document vector generated based on the words included in the first document and the frequency of appearance of the words; a calculation step of calculating the similarity with each of the plurality of second document vectors generated by
When there is no second document vector whose degree of similarity with the first document vector is equal to or greater than the threshold, the word set as the first document vector and the word set as the second document vector are common an extracting step of counting the number of common words that correspond to each other, and extracting the second document based on the counting result.