WO2021117246A1 - Data processing device, data processing method, and data processing program - Google Patents

Data processing device, data processing method, and data processing program Download PDF

Info

Publication number
WO2021117246A1
WO2021117246A1 PCT/JP2019/049053 JP2019049053W WO2021117246A1 WO 2021117246 A1 WO2021117246 A1 WO 2021117246A1 JP 2019049053 W JP2019049053 W JP 2019049053W WO 2021117246 A1 WO2021117246 A1 WO 2021117246A1
Authority
WO
WIPO (PCT)
Prior art keywords
frequency
data processing
word
occurrences
unit
Prior art date
Application number
PCT/JP2019/049053
Other languages
French (fr)
Japanese (ja)
Inventor
須永 聡
一宏 菊間
Original Assignee
日本電信電話株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電信電話株式会社 filed Critical 日本電信電話株式会社
Priority to PCT/JP2019/049053 priority Critical patent/WO2021117246A1/en
Publication of WO2021117246A1 publication Critical patent/WO2021117246A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/383Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content

Definitions

  • the present invention relates to a data processing apparatus, a data processing method, and a data processing program.
  • BoW Bag-of-words
  • the data sparseness problem will be explained by taking as an example the case where the number of words recorded in a Japanese dictionary is 50,000 and a vector having these words as elements is generated.
  • the number of words recorded in a Japanese dictionary is 50,000 and a vector having these words as elements is generated.
  • a newspaper article consisting of 100 words is vectorized, at least 49,900 elements will be "0". In other words, in this case, most elements are "0". It is unlikely that the topic will change frequently in one article, so the same word will be used many times. That is, it is considered that most of the vector elements are "0". Therefore, even if a large number of articles are prepared, it may be difficult to determine how easily a certain word appears.
  • the present invention has been made in view of the above, and an object of the present invention is to provide a data processing apparatus, a data processing method, and a data processing program capable of appropriately vector representation even for a short sentence having a small number of words. To do.
  • the data processing apparatus obtains the number of occurrences or the frequency of occurrence of a predetermined word included in the target portion of vectorization in the document data.
  • the data processing method is a data processing method executed by a data processing apparatus, and is a first method of acquiring the number of occurrences or the frequency of appearance of a predetermined word included in a target portion of vectorization in document data.
  • the second acquisition step of acquiring the number of occurrences or the frequency of appearance of a predetermined word included in a certain range with respect to a certain range of the document data, and the first acquisition step. It is characterized by including a generation step of generating a vector in which a value obtained by adding the number of appearances or the frequency of appearance acquired in the second acquisition step and the number of appearances or the frequency of appearance is used as the value of each element. To do.
  • the data processing program has a first acquisition step of acquiring the number of occurrences or the frequency of appearance of a predetermined word included in the target portion of vectorization in the document data, and a constant with respect to the target portion of the document data.
  • the second acquisition step for acquiring the number of occurrences or the frequency of occurrence of a predetermined word included in a certain range, the number of occurrences or the frequency of occurrence acquired in the first acquisition step, and the second acquisition step.
  • the computer is made to execute a generation step of generating a vector in which the value obtained by adding the number of appearances or the frequency of appearances acquired in is used as the value of each element.
  • FIG. 1 is a diagram schematically showing an example of a configuration of a data processing device according to an embodiment.
  • FIG. 2 is a diagram illustrating a processing flow of the data processing apparatus shown in FIG.
  • FIG. 3 is a diagram showing an example of BoW vector representation by the conventional method.
  • FIG. 4 is a diagram illustrating a processing flow of the data processing apparatus shown in FIG.
  • FIG. 5 is a diagram showing an example of BoW vector representation according to the embodiment.
  • FIG. 6 is a diagram illustrating a processing flow of the data processing apparatus shown in FIG.
  • FIG. 7 is a diagram showing an example of BoW vector representation according to the embodiment.
  • FIG. 8 is a diagram illustrating a processing flow of the data processing apparatus shown in FIG. FIG.
  • FIG. 9 is a diagram illustrating a processing flow of the data processing apparatus shown in FIG.
  • FIG. 10 is a flowchart showing a processing procedure of the data processing method according to the embodiment.
  • FIG. 11 is a flowchart showing a processing procedure of the first acquisition process shown in FIG.
  • FIG. 12 is a flowchart showing a processing procedure of the first acquisition process shown in FIG.
  • FIG. 13 is a diagram showing an example of a computer in which a data processing device is realized by executing a program.
  • the object is digitized text document data (hereinafter referred to as document data).
  • the document feature vector (BoW vector) is generated by using the BoW method.
  • a vector is generated in which the value obtained by adding the frequency of appearance is used as the value of each element.
  • the target portion of vectorization is a short sentence with a small number of words
  • the number of occurrences or the frequency of occurrence of words included in a certain range is added to the target portion of vectorization. Therefore, vector representation is possible appropriately.
  • humans interpret sentences or sentences using background knowledge, but in this embodiment, words in the vicinity of the target part of vectorization are collected and added to the elements of the vector to substitute for background knowledge. The meaning of a sentence or sentence can be expressed by a vector.
  • a BoW vector with reinforced features is generated by supplementing the number of occurrences or the frequency of appearance of words in the vicinity as background knowledge.
  • Such a BoW vector is used in a method of predicting the necessity of verification from a verification matrix by machine learning or the like.
  • FIG. 1 is a diagram schematically showing an example of a configuration of a data processing device according to an embodiment.
  • the data processing device 10 includes an input unit 11, an output unit 12, a communication unit 13, a control unit 14, and a storage unit 15.
  • the input unit 11 is an input interface that receives various operations from the operator of the data processing device 10.
  • the input unit 11 is composed of an input device such as a touch panel, a voice input device, and a keyboard and a mouse.
  • the communication unit 13 is a communication interface for transmitting and receiving various information to and from other devices connected via a network or the like.
  • the communication unit 13 is realized by a NIC (Network Interface Card) or the like, and communicates between another device and the control unit 14 (described later) via a telecommunication line such as a LAN (Local Area Network) or the Internet.
  • NIC Network Interface Card
  • the communication unit 13 receives the data of the document file for which the BoW vector is generated via the network and outputs the data to the control unit 14. Further, the communication unit 13 outputs the BoW vector information generated by the control unit 14 to an external device via the network.
  • the output unit 12 is realized by, for example, a display device such as a liquid crystal display, a printing device such as a printer, an information communication device, or the like, and outputs information indicating a target word and a BoW vector generated by the control unit 14.
  • the control unit 14 controls the entire data processing device 10.
  • the control unit 14 is, for example, an electronic circuit such as a CPU (Central Processing Unit) or MPU (Micro Processing Unit), or an integrated circuit such as an ASIC (Application Specific Integrated Circuit) or FPGA (Field Programmable Gate Array).
  • the control unit 14 has an internal memory for storing programs and control data that define various processing procedures, and executes each process using the internal memory. Further, the control unit 14 functions as various processing units by operating various programs.
  • the control unit 14 has a first acquisition unit 141, a second acquisition unit 142, and a generation unit 143.
  • the first acquisition unit 141 acquires the number of occurrences or the frequency of occurrence of a predetermined word included in the target portion of vectorization in the document data.
  • the first acquisition unit 141 includes a first disassembly unit 1411, a first deletion unit 1412, and a first element acquisition unit 1413.
  • the first decomposition unit 1411 decomposes the target portion of the document data for vectorization into each word using, for example, a morphological analysis tool such as MeCab.
  • the first deletion unit 1412 sorts each word decomposed by the first decomposition unit, and then deletes duplicate words.
  • the first deletion unit 1412 selects an unnecessary word (stop word) and deletes the stop word. Stopwords are words that are not useful for characterizing document data, such as "the”, “a”, “is”, “have”, “take”, “ha”, “no”, “is”. , "Masu”, etc.
  • Each word obtained after the processing by the first deletion unit 1412 is a predetermined word to be vectorized.
  • the first element acquisition unit 1413 obtains the number of occurrences of each word obtained after processing by the first deletion unit 1412, and acquires each appearance number as the first element.
  • the first element acquisition unit 1413 may acquire the appearance frequency of each word obtained after the processing by the first deletion unit 1412 as the first element.
  • the frequency of appearance refers to, for example, the ratio of the corresponding word to the total number of words included in the target portion of vectorization.
  • the second acquisition unit 142 acquires the number of occurrences or the frequency of appearance of a predetermined word included in a certain range in a certain range with respect to the target portion of the document data.
  • a certain range is a sentence that includes a target part.
  • a certain range is a paragraph that includes the target part.
  • the second acquisition unit 142 includes an extraction unit 1421, a second decomposition unit 1422, a second deletion unit 1423, and a second element acquisition unit 1424.
  • the extraction unit 1421 extracts a certain range from the document data with respect to the target portion for vectorization. Specifically, the extraction unit 1421 extracts a sentence including the target portion or a paragraph including the target portion from the document data.
  • the second decomposition unit 1422 decomposes a certain range of the vectorized target portion extracted by the extraction unit 1421 into each word using, for example, a morphological analysis tool such as MeCab.
  • the second deletion unit 1423 sorts each word decomposed by the second decomposition unit 1422, and then deletes the duplicate word.
  • the second deletion unit 1423 selects a stop word and deletes the stop word.
  • the second element acquisition unit 1424 executes the following processing for a predetermined word to be vectorized among the words obtained after the deletion process.
  • the second element acquisition unit 1424 obtains the number of occurrences of a predetermined word to be vectorized obtained after the processing by the second deletion unit 1423, and acquires each appearance number as the second element.
  • the first element acquisition unit 1413 acquires the appearance frequency as the first element
  • the appearance frequency of a predetermined word that is a vectorization target obtained after processing by the second deletion unit 1423. Is acquired as the second element.
  • Each element of the generation unit 143 is a value obtained by adding the number of appearances or the frequency of occurrence of each word acquired by the first acquisition unit 141 and the number of appearances or the frequency of appearance of each word acquired by the second acquisition unit 142. Generate a vector with the value of.
  • the generation unit 143 weights the number of occurrences or the frequency of occurrence of each word acquired by the first acquisition unit 141, and the number of appearances or the frequency of appearance of each word acquired by the weighted first acquisition unit 141, and the first
  • the value obtained by adding the number of occurrences or the frequency of appearance of each word acquired by the acquisition unit 142 of 2 is set as the value of each element of the vector.
  • the storage unit 15 is a storage device for HDD (Hard Disk Drive), SSD (Solid State Drive), optical disk, and the like.
  • the storage unit 15 may be a semiconductor memory that can rewrite data such as RAM (Random Access Memory), flash memory, NVSRAM (Non Volatile Static Random Access Memory), and the like.
  • the storage unit 15 stores an OS (Operating System) and various programs executed by the data processing device 10. Further, the storage unit 15 stores various information used in executing the program.
  • the storage unit 15 stores the document data 151.
  • the document data 151 is digitized text document data, and includes an electronic file document to be processed by the data processing device 10.
  • the first decomposition unit 1411 decomposes the target portion of the electronic file document into words by morphological analysis (see (1) in FIG. 2).
  • the target part of vectorization is, for example, text T1 "Add a function that meets the prerequisites.”
  • the first deletion unit 1412 sorts each word decomposed by the first decomposition unit, and then deletes the duplicated words (see (2) in FIG. 2).
  • the first deletion unit 1412 selects unnecessary words (stop words) (see (3) in FIG. 2) and excludes stop words (see (4) in FIG. 2). Each word remaining after the stop word is deleted becomes the target word for BoW vectorization.
  • the first element acquisition unit 1413 acquires the number of occurrences of each target word in the BoW vectorization as the first element (see (4) in FIG. 2).
  • List L1 is a list of the number of occurrences of the target word for vectorization in the text T1.
  • Each number in the right column of the list L1 is the number of occurrences of the target word for each vectorization.
  • a vector has been generated by using each number of occurrences of the list L1 as an element value.
  • FIG. 3 is a diagram showing an example of BoW vector representation by the conventional method.
  • the left side of the colon (:) is the word number (index)
  • the right side of the colon (:) is the number of occurrences for each word. ..
  • the word number "1" corresponds to the "function" of the list L1
  • the word number "2” corresponds to the "match” of the list L1
  • the word number "3" corresponds to the "premise” of the list L1.
  • the word number “6” corresponds to the "0” as in the vector expression example of FIG. 3, and the conventional method cannot sufficiently express the characteristics of the sentence.
  • the data processing device 10 further collects words (groups) of a sentence or paragraph including the target portion, and acquires the number of occurrences of the vectorized target word in this sentence or paragraph as the second element. .. Specifically, it will be described with reference to FIG.
  • the extraction unit 1421 extracts a sentence including the target part or a paragraph including the target part from the electronic file document (see (5) in FIG. 4). ..
  • the extraction unit 1421 extracts paragraph P1 “The conditions are not changed.
  • the communication function performs error processing.
  • the setting is to confirm the conditions for each function” in FIG. 4 (6).
  • the extracted paragraph contains the text T1 to be vectorized, "Add a function that meets the prerequisites.”
  • the text T1 is deleted and shown. Therefore, it is assumed that the subsequent processing is also executed for the paragraph P1 in which the text T1 is deleted.
  • the second acquisition unit 142 acquires the number of occurrences of each target word in the BoW vectorization as the second element in the extracted sentence or paragraph (see (6) in FIG. 4).
  • the second decomposition unit 1422 decomposes the paragraph P1 extracted by the extraction unit 1421 into words by morphological analysis, and the second deletion unit 1423 deletes duplicate words and stop words.
  • the second element acquisition unit 1424 acquires the number of occurrences of each target word in the BoW vectorization as the second element.
  • List L2 is a list of the number of occurrences of each vectorized word in the extracted paragraph P1.
  • the generation unit 143 sets each element as a value obtained by adding the number of occurrences of the vectorization target word acquired by the first acquisition unit 141 and the number of appearances of the vectorization target word acquired by the second acquisition unit 142.
  • a BoW vector with the value of is generated (see (7) in FIG. 4).
  • the number of occurrences of each word in the list L3 is a value obtained by adding the number of appearances in the list L1 and the number of appearances in the list L2 for each word. In this way, the generation unit 143 generates a BoW vector in which the value obtained by adding the number of occurrences of the list L1 and the list L2 for each word is used as the value of each element.
  • FIG. 5 is a diagram showing an example of BoW vector representation according to the embodiment.
  • the BoW vector shown in FIG. 5 corresponds to the list L3 in FIG. 4, and the correspondence relationship between the word number and the word is the same as the BoW vector shown in FIG.
  • the value of each element of the BoW vector in FIG. 5 is a value obtained by adding the number of occurrences of each word in the paragraph P1 including the text T1 to the number of appearances of each word in the text T1 to be vectorized.
  • the BoW vector shown in FIG. 5 has a value larger than "0" for the element values after the word number "6" as compared with the BoW vector shown in FIG. That is, according to the data processing device 10, since element values other than "0" can be given to more words than in the past, the characteristics of each word can be sufficiently expressed.
  • each element of the BoW vector includes the number of occurrences of a word in a sentence or paragraph including a target part, and by substituting this for background knowledge, the meaning of the sentence or sentence is vectorized. It is possible to express with.
  • the generation unit 143 can reinforce the number of occurrences of the "function” (word number "1") to "4" by adding them up (see frame W1 in FIG. 5), as shown in the list L3. ..
  • the generation unit 143 may weight each appearance number of the list L1 by multiplying it by a weight (for example, "2"). Then, the generation unit 143 adds each number of occurrences of the weighted list L1 and each number of appearances of the list L2 for each word, and sets each value (see list L4) as the value of each element of the BoW vector.
  • FIG. 7 is a diagram showing an example of BoW vector representation according to the embodiment.
  • the BoW vector shown in FIG. 7 corresponds to the list L4 in FIG. 6, and the correspondence between the word numbers and the words is the same as the BoW vector shown in FIG.
  • the generation unit 143 weights the number of occurrences “2” in the list L1 for the target word “function” (word number “1”) for vectorization.
  • the number of appearances "2" in the list L3 is added to the value "4" multiplied by "2" to reinforce the number of appearances to "6" (see frame W2 in FIG. 7).
  • the generation unit 143 generates a feature vector focusing on the words in the original vectorization target part by multiplying the number of occurrences of each word in the vectorization target part by a weight greater than 1. be able to.
  • the generation unit 143 may use a weight other than "2".
  • the generation unit 143 may set the weight to "10".
  • the generation unit 143 sets the number of occurrences of the list L2 "2" to the value "20" obtained by multiplying the number of occurrences "2" of the list L1 by the weight "10" for the target word "function” for vectorization. By adding, the number of appearances is set to "22".
  • the weight is not limited to these, and may be "3" or "4", and the word of the original vectorization target part may be adjusted according to the degree of emphasis.
  • the generation unit 143 sets the weight to a value less than 1, for example, a fraction (for example, "1/3") or a decimal number (for example, "0.2”), and weakens the emphasis of the word in the original vectorization target part. May be good.
  • FIGS. 8 and 9 are views for explaining the flow of 10 processes of the data processing apparatus shown in FIG.
  • the first counting method is a method of counting the number of occurrences of words to be vectorized as they are for paragraph P1 (see (1) and (2) in FIG. 8) extracted from an electronic file document by a search. Yes (see list L2 in (3) of FIG. 8). In the first counting method, since there is only one text to be counted, a simple process is sufficient.
  • the second acquisition unit 142 extracts the paragraph P1 including the text T1 which is the target portion from the electronic file document. Then, the second acquisition unit 142 searches the paragraph P1 for each word in the list L1 and extracts a sentence including the target word for each vectorization.
  • the second acquisition unit 142 searches paragraph P1 (see (1-1) in FIG. 9) and performs error processing in the text T3 "communication function” including "function”. And the text T4 "Check the conditions for each function for setting.” (See (2-1) in FIG. 9).
  • the second acquisition unit 142 counts the number of occurrences of each word appearing in the texts T3 and T4 (see (3-1) in FIG. 9), and generates the list La1.
  • the second acquisition unit 142 performs the same processing for other words. Specifically, the second acquisition unit 142 searches paragraph P1 for the word “match” (see (1-2) in FIG. 9) and extracts text containing “match”. In this case, since the corresponding text is not in paragraph P1 (see (2-2) in FIG. 9), the second acquisition unit 142 generates a list La2 in which the number of occurrences of each word is “0”. (See (3-2) in FIG. 9).
  • the second acquisition unit 142 searches the paragraph P1 for the word "condition” (see (1-3) in FIG. 9), and the text T2 "do not change the condition” including the "condition” and Extract the text T4 "Set the conditions for each function.” (See (2-3) in FIG. 9).
  • the second acquisition unit 142 counts the number of occurrences of each word appearing in the texts T2 and T4 (see (3-3) in FIG. 9), and generates the list La4.
  • the generation unit 143 adds the count numbers for each target word for vectorization such as "function”, “match”, and “premise”, and generates a BoW vector using the added value as the value of each element ( See list L11 in FIG. 9 (4).
  • FIG. 10 is a flowchart showing a processing procedure of the data processing method according to the embodiment.
  • the first acquisition unit 141 when the first acquisition unit 141 receives the input of the electronic file document (step S1), the first acquisition unit 141 acquires the number of occurrences of words included in the target portion of the vectorization in the electronic file document. The first acquisition process is performed (step S2). The first acquisition unit 141 may obtain the frequency of occurrence of the target word for vectorization.
  • the second acquisition unit 142 performs a second acquisition process of acquiring the number of occurrences of the target word for vectorization included in the target portion of the electronic file document in a certain range (step). S3).
  • the second acquisition unit 142 may obtain the frequency of appearance of a predetermined word included in a certain range with respect to the target portion for vectorization.
  • the generation unit 143 generates a BoW vector in which the value obtained by adding the number of appearances acquired in the first acquisition process and the number of appearances acquired in the second acquisition process is used as the value of each element.
  • the generation process is performed (step S4).
  • the generation unit 143 weights the number of appearances acquired in the first acquisition process, and adds the weighted number of appearances and the number of appearances or the frequency of appearances acquired in the second acquisition process.
  • the value may be the value of each element.
  • FIG. 11 is a flowchart showing a processing procedure of the first acquisition process shown in FIG.
  • the first decomposition unit 1411 decomposes the target portion of vectorization of the electronic file document into words by morphological analysis (step S11). Then, the first deletion unit 1412 sorts each word decomposed by the first decomposition unit 1411 and deletes the duplicate word and the stop word (step S12).
  • the first element acquisition unit 1413 obtains the number of occurrences of each vectorized target word obtained after the processing of step S12 (step S13), and acquires each number of occurrences as the first element.
  • the first element acquisition unit 1413 may acquire the appearance frequency of each vectorized target word obtained after the processing by the first deletion unit 1412 as the first element.
  • FIG. 12 is a flowchart showing a processing procedure of the second acquisition process shown in FIG.
  • the extraction unit 1421 extracts a sentence including the target part or a paragraph including the target part from the electronic file document (step S21).
  • the second decomposition unit 1422 decomposes the sentence or paragraph extracted by the extraction unit 1421 into words by morphological analysis (step S22), and the second deletion unit 1423 sorts each word decomposed by the second decomposition unit 1422.
  • the duplicate word and the stop word are deleted (step S23).
  • the second element acquisition unit 1424 obtains the number of occurrences of each vectorized target word obtained after the process of step S22 (step S24), and acquires each number of occurrences as the second element.
  • the second element acquisition unit 1424 acquires the appearance frequency of each word obtained after the processing in step S22 as the second element.
  • the generation unit 143 uses a value obtained by adding the first element acquired in the first acquisition process and the second element acquired in the second acquisition process as the value of each element. To generate.
  • the data processing device 10 has the appearance number or frequency of words included in the target portion of vectorization and the words included in a certain range with respect to the target portion in the document data.
  • a vector is generated in which the value obtained by adding the number of occurrences or the frequency of appearance is used as the value of each element.
  • the data processing device 10 adds the number of occurrences or frequency of occurrence of words included in a certain range to the number of occurrences or frequency of occurrence of words in the target portion to the target portion for vectorization. Even if the target part is a short sentence with a small number of words, an appropriate vector expression is possible. Further, the data processing device 10 collects words in a certain range with respect to the target portion of vectorization and adds them to the elements of the vector, thereby substituting the background knowledge, and the sentence or the meaning of the sentence is converted into a BoW vector. I am trying to express it.
  • the data processing device 10 weights the number of occurrences or the frequency of occurrence of words included in the target portion of vectorization, and sets the weighted number of appearances or frequency of occurrence within a certain range with respect to the target portion of vectorization.
  • the value obtained by adding the number of occurrences or the frequency of occurrence of the included vectorized target words is used as the value of each element.
  • the data processing device 10 weights the number of occurrences or the frequency of occurrence of each word in the target portion of vectorization to generate a BoW vector in which the priority of the original target portion of vectorization is adjusted. can do.
  • a certain range for the target part of vectorization is a sentence that includes the target part or a paragraph that includes the target part.
  • the data processing device 10 adds the number of occurrences or the frequency of occurrence of words in the portion highly related to the target portion to the number of appearances or the frequency of appearance of the words included in the target portion of vectorization.
  • the BoW vector whose features are reinforced by supplementing the appearance number or appearance frequency of words in the vicinity of the target portion as background knowledge can be obtained. Can be generated.
  • each component of each of the illustrated devices is a functional concept and does not necessarily have to be physically configured as shown in the figure. That is, the specific form of distribution / integration of each device is not limited to the one shown in the figure, and all or part of the device is functionally or physically distributed in arbitrary units according to various loads and usage conditions. Can be integrated and configured. Further, each processing function performed by each device may be realized by a CPU and a program analyzed and executed by the CPU, or may be realized as hardware by wired logic.
  • FIG. 13 is a diagram showing an example of a computer in which the data processing device 10 is realized by executing a program.
  • the computer 1000 has, for example, a memory 1010 and a CPU 1020.
  • the computer 1000 also has a hard disk drive interface 1030, a disk drive interface 1040, a serial port interface 1050, a video adapter 1060, and a network interface 1070. Each of these parts is connected by a bus 1080.
  • the memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM 1012.
  • the ROM 1011 stores, for example, a boot program such as a BIOS (Basic Input Output System).
  • BIOS Basic Input Output System
  • the hard disk drive interface 1030 is connected to the hard disk drive 1090.
  • the disk drive interface 1040 is connected to the disk drive 1100.
  • a removable storage medium such as a magnetic disk or an optical disk is inserted into the disk drive 1100.
  • the serial port interface 1050 is connected to, for example, a mouse 1110 and a keyboard 1120.
  • the video adapter 1060 is connected to, for example, the display 1130.
  • the hard disk drive 1090 stores, for example, an OS (Operating System) 1091, an application program 1092, a program module 1093, and program data 1094. That is, the program that defines each process of the data processing device 10 is implemented as a program module 1093 in which a code that can be executed by a computer is described.
  • the program module 1093 is stored in, for example, the hard disk drive 1090.
  • a program module 1093 for executing processing similar to the functional configuration in the data processing device 10 is stored in the hard disk drive 1090.
  • the hard disk drive 1090 may be replaced by an SSD (Solid State Drive).
  • the setting data used in the processing of the above-described embodiment is stored as program data 1094 in, for example, a memory 1010 or a hard disk drive 1090. Then, the CPU 1020 reads the program module 1093 and the program data 1094 stored in the memory 1010 and the hard disk drive 1090 into the RAM 1012 as needed, and executes the program.
  • the program module 1093 and the program data 1094 are not limited to those stored in the hard disk drive 1090, but may be stored in, for example, a removable storage medium and read by the CPU 1020 via the disk drive 1100 or the like. Alternatively, the program module 1093 and the program data 1094 may be stored in another computer connected via a network (LAN (Local Area Network), WAN (Wide Area Network), etc.). Then, the program module 1093 and the program data 1094 may be read by the CPU 1020 from another computer via the network interface 1070.
  • LAN Local Area Network
  • WAN Wide Area Network

Abstract

A data processing device (10) comprises: a first acquisition unit (141) which acquires the number of occurrences or the frequency of occurrence of a prescribed word within a target portion to be vectorized that is included in document data; a second acquisition unit (142) which acquires the number of occurrences or the frequency of occurrence of the prescribed word within a certain region of the document data, said certain region being associated with the target portion; and a generation unit (143) which generates a vector, each element of which has a value equal to the value obtained by adding the number of occurrences or the frequency of occurrence acquired by the first acquisition unit (141) to the number of occurrences or the frequency of occurrence acquired by the second acquisition unit (142).

Description

データ処理装置、データ処理方法及びデータ処理プログラムData processing equipment, data processing method and data processing program
 本発明は、データ処理装置、データ処理方法及びデータ処理プログラムに関する。 The present invention relates to a data processing apparatus, a data processing method, and a data processing program.
 文章や単語の言語表現を数学的に扱いやすい表現にして文書を分類する方法がある。例えば、文書データを、ベクトルで表現するbag-of-words(Bag of Words;BoW)法も提案されている。このBoW法では、単語の出現数または出現頻度をベクトルの要素とする。例えば、自然言語処理に機械学習を用いる場合、文、文章、単語の言語表現を数学的に扱いやすいベクトルで表現することによって、ベクトル間の類似度を用いて文書間の類似度を求めることができる。 There is a way to classify documents by making the linguistic expressions of sentences and words mathematically easy to handle. For example, a bag-of-words (Bag of Words; BoW) method for expressing document data as a vector has also been proposed. In this BoW method, the number of occurrences or the frequency of occurrence of words is used as a vector element. For example, when machine learning is used for natural language processing, it is possible to obtain the similarity between documents by using the similarity between vectors by expressing the linguistic expressions of sentences, sentences, and words with mathematically easy-to-use vectors. it can.
特開平9-297766号公報Japanese Unexamined Patent Publication No. 9-297766
 しかしながら、BoW法では、短文においては単語数が少ない場合、文の特徴を表すベクトルを適切に生成することができないというデータスパースネス問題があった。 However, in the BoW method, when the number of words is small in a short sentence, there is a data sparseness problem that a vector representing the characteristics of the sentence cannot be appropriately generated.
 ある国語辞典の収録単語数を50000語とし、これらの単語を要素とするベクトルを生成する場合を例にデータスパースネス問題について説明する。この例の場合、100語から成る新聞記事をベクトル化した場合、少なくとも49900個の要素が「0」になる。言い換えると、この場合、ほとんどの要素が「0」である。一つの記事で話題が頻繁に変わることも考えにくいので、同じ単語が何度も使われることになる。すなわち、やはりほとんどのベクトル要素が「0」であることが考えられる。したがって、記事を多数用意したとしても、ある単語がどの程度出現しやすいかが判別しにくい場合が生じる。 The data sparseness problem will be explained by taking as an example the case where the number of words recorded in a Japanese dictionary is 50,000 and a vector having these words as elements is generated. In the case of this example, if a newspaper article consisting of 100 words is vectorized, at least 49,900 elements will be "0". In other words, in this case, most elements are "0". It is unlikely that the topic will change frequently in one article, so the same word will be used many times. That is, it is considered that most of the vector elements are "0". Therefore, even if a large number of articles are prepared, it may be difficult to determine how easily a certain word appears.
 一般に、「0」でない値をとる要素数が小さい傾向があるとき、このデータは、疎(sparse)であるという。そして、データが疎である場合、そのデータを処理するために必要な統計値が十分に獲得できないという問題があり、この問題がデータスパースネス問題と呼ばれている。 Generally, when the number of elements that take a value other than "0" tends to be small, this data is said to be sparse. Then, when the data is sparse, there is a problem that sufficient statistical values necessary for processing the data cannot be obtained, and this problem is called a data sparseness problem.
 本発明は、上記に鑑みてなされたものであって、単語数が少ない短文に対しても適切なベクトル表現が可能であるデータ処理装置、データ処理方法及びデータ処理プログラムを提供することを目的とする。 The present invention has been made in view of the above, and an object of the present invention is to provide a data processing apparatus, a data processing method, and a data processing program capable of appropriately vector representation even for a short sentence having a small number of words. To do.
 上述した課題を解決し、目的を達成するために、本発明に係るデータ処理装置は、文書データのうちベクトル化の対象部分に含まれる所定の単語の出現数または出現頻度を取得する第1の取得部と、文書データのうち対象部分に対する一定の範囲について、一定の範囲に含まれる所定の単語の出現数または出現頻度を取得する第2の取得部と、第1の取得部が取得した出現数または出現頻度と、第2の取得部が取得した出現数または出現頻度とを加算した値を、各要素の値としたベクトルを生成する生成部と、を有することを特徴とする。 In order to solve the above-mentioned problems and achieve the object, the data processing apparatus according to the present invention obtains the number of occurrences or the frequency of occurrence of a predetermined word included in the target portion of vectorization in the document data. The acquisition unit, the second acquisition unit that acquires the number of occurrences or the frequency of occurrence of a predetermined word included in the certain range for a certain range of the document data, and the appearance acquired by the first acquisition unit. It is characterized by having a generation unit that generates a vector in which a value obtained by adding the number or appearance frequency and the appearance number or appearance frequency acquired by the second acquisition unit is used as the value of each element.
 また、本発明に係るデータ処理方法は、データ処理装置が実行するデータ処理方法であって、文書データのうちベクトル化の対象部分に含まれる所定の単語の出現数または出現頻度を取得する第1の取得工程と、文書データのうち対象部分に対する一定の範囲について、一定の範囲に含まれる所定の単語の出現数または出現頻度を取得する第2の取得工程と、第1の取得工程において取得された出現数または出現頻度と、第2の取得工程において取得された出現数または出現頻度とを加算した値を、各要素の値としたベクトルを生成する生成工程と、を含んだことを特徴とする。 Further, the data processing method according to the present invention is a data processing method executed by a data processing apparatus, and is a first method of acquiring the number of occurrences or the frequency of appearance of a predetermined word included in a target portion of vectorization in document data. And the second acquisition step of acquiring the number of occurrences or the frequency of appearance of a predetermined word included in a certain range with respect to a certain range of the document data, and the first acquisition step. It is characterized by including a generation step of generating a vector in which a value obtained by adding the number of appearances or the frequency of appearance acquired in the second acquisition step and the number of appearances or the frequency of appearance is used as the value of each element. To do.
 また、本発明に係るデータ処理プログラムは、文書データのうちベクトル化の対象部分に含まれる所定の単語の出現数または出現頻度を取得する第1の取得ステップと、文書データのうち対象部分に対する一定の範囲について、一定の範囲に含まれる所定の単語の出現数または出現頻度を取得する第2の取得ステップと、第1の取得ステップにおいて取得された出現数または出現頻度と、第2の取得ステップにおいて取得された出現数または出現頻度とを加算した値を、各要素の値としたベクトルを生成する生成ステップと、をコンピュータに実行させる。 Further, the data processing program according to the present invention has a first acquisition step of acquiring the number of occurrences or the frequency of appearance of a predetermined word included in the target portion of vectorization in the document data, and a constant with respect to the target portion of the document data. Regarding the range of, the second acquisition step for acquiring the number of occurrences or the frequency of occurrence of a predetermined word included in a certain range, the number of occurrences or the frequency of occurrence acquired in the first acquisition step, and the second acquisition step. The computer is made to execute a generation step of generating a vector in which the value obtained by adding the number of appearances or the frequency of appearances acquired in is used as the value of each element.
 本発明によれば、短文である文書データに対しても適切にベクトルで表現することができる。 According to the present invention, even a short document data can be appropriately expressed as a vector.
図1は、実施の形態に係るデータ処理装置の構成の一例を模式的に示す図である。FIG. 1 is a diagram schematically showing an example of a configuration of a data processing device according to an embodiment. 図2は、図1に示すデータ処理装置の処理の流れを説明する図である。FIG. 2 is a diagram illustrating a processing flow of the data processing apparatus shown in FIG. 図3は、従来方法によるBoWベクトル表現の一例を示す図である。FIG. 3 is a diagram showing an example of BoW vector representation by the conventional method. 図4は、図1に示すデータ処理装置の処理の流れを説明する図である。FIG. 4 is a diagram illustrating a processing flow of the data processing apparatus shown in FIG. 図5は、実施の形態によるBoWベクトル表現の一例を示す図である。FIG. 5 is a diagram showing an example of BoW vector representation according to the embodiment. 図6は、図1に示すデータ処理装置の処理の流れを説明する図である。FIG. 6 is a diagram illustrating a processing flow of the data processing apparatus shown in FIG. 図7は、実施の形態によるBoWベクトル表現の一例を示す図である。FIG. 7 is a diagram showing an example of BoW vector representation according to the embodiment. 図8は、図1に示すデータ処理装置の処理の流れを説明する図である。FIG. 8 is a diagram illustrating a processing flow of the data processing apparatus shown in FIG. 図9は、図1に示すデータ処理装置の処理の流れを説明する図である。FIG. 9 is a diagram illustrating a processing flow of the data processing apparatus shown in FIG. 図10は、実施の形態に係るデータ処理方法の処理手順を示すフローチャートである。FIG. 10 is a flowchart showing a processing procedure of the data processing method according to the embodiment. 図11は、図10に示す第1の取得処理の処理手順を示すフローチャートである。FIG. 11 is a flowchart showing a processing procedure of the first acquisition process shown in FIG. 図12は、図10に示す第1の取得処理の処理手順を示すフローチャートである。FIG. 12 is a flowchart showing a processing procedure of the first acquisition process shown in FIG. 図13は、プログラムが実行されることにより、データ処理装置が実現されるコンピュータの一例を示す図である。FIG. 13 is a diagram showing an example of a computer in which a data processing device is realized by executing a program.
 以下、図面を参照して、本発明の一実施形態を詳細に説明する。なお、この実施形態により本発明が限定されるものではない。また、図面の記載において、同一部分には同一の符号を付して示している。 Hereinafter, an embodiment of the present invention will be described in detail with reference to the drawings. The present invention is not limited to this embodiment. Further, in the description of the drawings, the same parts are indicated by the same reference numerals.
[実施の形態]
 本発明の実施の形態について説明する。本発明の実施の形態では、電子化されたテキスト文書データ(以降、文書データとする。)が対象であることを前提とする。そして、本実施の形態では、BoW法を用いて文書特徴ベクトル(BoWベクトル)を生成する。
[Embodiment]
Embodiments of the present invention will be described. In the embodiment of the present invention, it is premised that the object is digitized text document data (hereinafter referred to as document data). Then, in the present embodiment, the document feature vector (BoW vector) is generated by using the BoW method.
 この本実施の形態では、文書データのうち、ベクトル化の対象部分に含まれる単語である所定の単語の出現数または出現頻度と、対象部分に対する一定の範囲に含まれる所定の単語の出現数または出現頻度とを加算した値を、各要素の値としたベクトルを生成する。 In this embodiment, in the document data, the number of occurrences or frequency of occurrence of a predetermined word that is a word included in the target portion of vectorization, and the number of occurrences or the number of occurrences of a predetermined word included in a certain range with respect to the target portion. A vector is generated in which the value obtained by adding the frequency of appearance is used as the value of each element.
 これによって、本実施の形態では、ベクトル化の対象部分が、単語数が少ない短文であっても、ベクトル化の対象部分に対して一定の範囲に含まれる単語の出現数または出現頻度を加えることで、適切にベクトル表現が可能である。また、人間は背景知識を用いて文または文章を解釈するが、本実施の形態では、ベクトル化の対象部分の近傍の単語を収集してベクトルの要素に加えることで、背景知識の代用とし、文または文章の文意をベクトルで表現できるようにしている。 As a result, in the present embodiment, even if the target portion of vectorization is a short sentence with a small number of words, the number of occurrences or the frequency of occurrence of words included in a certain range is added to the target portion of vectorization. Therefore, vector representation is possible appropriately. In addition, humans interpret sentences or sentences using background knowledge, but in this embodiment, words in the vicinity of the target part of vectorization are collected and added to the elements of the vector to substitute for background knowledge. The meaning of a sentence or sentence can be expressed by a vector.
 このように、本実施の形態は、語数の少ない短文であっても、近傍の部分の単語の出現数または出現頻度を背景知識として補うことで、特徴を補強したBoWベクトルを生成する。このようなBoWベクトルは、検証マトリクスから検証要否を機械学習によって予測する方法等に用いられる。 As described above, in this embodiment, even in a short sentence with a small number of words, a BoW vector with reinforced features is generated by supplementing the number of occurrences or the frequency of appearance of words in the vicinity as background knowledge. Such a BoW vector is used in a method of predicting the necessity of verification from a verification matrix by machine learning or the like.
[データ処理装置の構成]
 まず、実施の形態におけるデータ処理装置の構成について説明する。図1は、実施の形態に係るデータ処理装置の構成の一例を模式的に示す図である。図1に示すように、データ処理装置10は、入力部11、出力部12、通信部13、制御部14及び記憶部15を有する。
[Data processing device configuration]
First, the configuration of the data processing device according to the embodiment will be described. FIG. 1 is a diagram schematically showing an example of a configuration of a data processing device according to an embodiment. As shown in FIG. 1, the data processing device 10 includes an input unit 11, an output unit 12, a communication unit 13, a control unit 14, and a storage unit 15.
 入力部11は、データ処理装置10の操作者からの各種操作を受け付ける入力インタフェースである。例えば、入力部11は、タッチパネル、音声入力デバイス、キーボードやマウス等の入力デバイスによって構成される。 The input unit 11 is an input interface that receives various operations from the operator of the data processing device 10. For example, the input unit 11 is composed of an input device such as a touch panel, a voice input device, and a keyboard and a mouse.
 通信部13は、ネットワーク等を介して接続された他の装置との間で、各種情報を送受信する通信インタフェースである。通信部13は、NIC(Network Interface Card)等で実現され、LAN(Local Area Network)やインターネットなどの電気通信回線を介した他の装置と制御部14(後述)との間の通信を行う。例えば、通信部13は、ネットワークを介して、BoWベクトル生成対象の文書ファイルのデータを受け取り、制御部14に出力する。また、通信部13は、制御部14によって生成されたBoWベクトルの情報を、ネットワークを介して、外部の装置へ出力する。 The communication unit 13 is a communication interface for transmitting and receiving various information to and from other devices connected via a network or the like. The communication unit 13 is realized by a NIC (Network Interface Card) or the like, and communicates between another device and the control unit 14 (described later) via a telecommunication line such as a LAN (Local Area Network) or the Internet. For example, the communication unit 13 receives the data of the document file for which the BoW vector is generated via the network and outputs the data to the control unit 14. Further, the communication unit 13 outputs the BoW vector information generated by the control unit 14 to an external device via the network.
 出力部12は、例えば、液晶ディスプレイなどの表示装置、プリンタ等の印刷装置、情報通信装置等によって実現され、制御部14によって生成された対象語とBoWベクトルを示す情報等を出力する。 The output unit 12 is realized by, for example, a display device such as a liquid crystal display, a printing device such as a printer, an information communication device, or the like, and outputs information indicating a target word and a BoW vector generated by the control unit 14.
 制御部14は、データ処理装置10全体を制御する。制御部14は、例えば、CPU(Central Processing Unit)、MPU(Micro Processing Unit)等の電子回路や、ASIC(Application Specific Integrated Circuit)、FPGA(Field Programmable Gate Array)等の集積回路である。また、制御部14は、各種の処理手順を規定したプログラムや制御データを格納するための内部メモリを有し、内部メモリを用いて各処理を実行する。また、制御部14は、各種のプログラムが動作することにより各種の処理部として機能する。制御部14は、第1の取得部141、第2の取得部142及び生成部143を有する。 The control unit 14 controls the entire data processing device 10. The control unit 14 is, for example, an electronic circuit such as a CPU (Central Processing Unit) or MPU (Micro Processing Unit), or an integrated circuit such as an ASIC (Application Specific Integrated Circuit) or FPGA (Field Programmable Gate Array). Further, the control unit 14 has an internal memory for storing programs and control data that define various processing procedures, and executes each process using the internal memory. Further, the control unit 14 functions as various processing units by operating various programs. The control unit 14 has a first acquisition unit 141, a second acquisition unit 142, and a generation unit 143.
 第1の取得部141は、文書データのうちベクトル化の対象部分に含まれる所定の単語の出現数または出現頻度を取得する。第1の取得部141は、第1分解部1411、第1削除部1412及び第1要素取得部1413を有する。 The first acquisition unit 141 acquires the number of occurrences or the frequency of occurrence of a predetermined word included in the target portion of vectorization in the document data. The first acquisition unit 141 includes a first disassembly unit 1411, a first deletion unit 1412, and a first element acquisition unit 1413.
 第1分解部1411は、文書データのうちベクトル化の対象部分を、例えば、MeCab等の形態素解析ツールを用いて、各単語に分解する。 The first decomposition unit 1411 decomposes the target portion of the document data for vectorization into each word using, for example, a morphological analysis tool such as MeCab.
 第1削除部1412は、第1分解部が分解した各単語をソートしてから、重複する単語を削除する。第1削除部1412は、不要語(ストップワード)を選定し、ストップワードを削除する。ストップワードは、文書データの特徴を得るに有益でない単語であり、例えば、「the」、「a」、「is」、「have」、「take」、「は」、「の」、「です」、「ます」等がある。第1削除部1412による処理後に得られた各単語がベクトル化対象の所定の単語である。 The first deletion unit 1412 sorts each word decomposed by the first decomposition unit, and then deletes duplicate words. The first deletion unit 1412 selects an unnecessary word (stop word) and deletes the stop word. Stopwords are words that are not useful for characterizing document data, such as "the", "a", "is", "have", "take", "ha", "no", "is". , "Masu", etc. Each word obtained after the processing by the first deletion unit 1412 is a predetermined word to be vectorized.
 第1要素取得部1413は、第1削除部1412による処理後に得られた各単語の出現数を求め、各出現数を第1要素として取得する。なお、第1要素取得部1413は、第1削除部1412による処理後に得られた各単語の出現頻度を、第1要素として取得してもよい。出現頻度は、例えば、ベクトル化の対象部分に含まれる全体の単語数に対する、該当単語の割合をいう。 The first element acquisition unit 1413 obtains the number of occurrences of each word obtained after processing by the first deletion unit 1412, and acquires each appearance number as the first element. The first element acquisition unit 1413 may acquire the appearance frequency of each word obtained after the processing by the first deletion unit 1412 as the first element. The frequency of appearance refers to, for example, the ratio of the corresponding word to the total number of words included in the target portion of vectorization.
 第2の取得部142は、文書データのうち対象部分に対する一定の範囲にについて、一定の範囲に含まれる所定の単語の出現数または出現頻度を取得する。具体的には、一定の範囲は、対象部分が含まれる文である。または、一定の範囲は、対象部分が含まれる段落である。第2の取得部142は、抽出部1421、第2分解部1422、第2削除部1423及び第2要素取得部1424を有する。 The second acquisition unit 142 acquires the number of occurrences or the frequency of appearance of a predetermined word included in a certain range in a certain range with respect to the target portion of the document data. Specifically, a certain range is a sentence that includes a target part. Alternatively, a certain range is a paragraph that includes the target part. The second acquisition unit 142 includes an extraction unit 1421, a second decomposition unit 1422, a second deletion unit 1423, and a second element acquisition unit 1424.
 抽出部1421は、文書データから、ベクトル化の対象部分に対する一定の範囲を抽出する。具体的には、抽出部1421は、文書データから、対象部分が含まれる文、または、対象部分が含まれる段落を抽出する。 The extraction unit 1421 extracts a certain range from the document data with respect to the target portion for vectorization. Specifically, the extraction unit 1421 extracts a sentence including the target portion or a paragraph including the target portion from the document data.
 第2分解部1422は、抽出部1421が抽出した、ベクトル化の対象部分に対する一定の範囲を、例えば、MeCab等の形態素解析ツールを用いて、各単語に分解する。 The second decomposition unit 1422 decomposes a certain range of the vectorized target portion extracted by the extraction unit 1421 into each word using, for example, a morphological analysis tool such as MeCab.
 第2削除部1423は、第2分解部1422が分解した各単語をソートしてから、重複する単語を削除する。第2削除部1423は、ストップワードを選定し、ストップワードを削除する。 The second deletion unit 1423 sorts each word decomposed by the second decomposition unit 1422, and then deletes the duplicate word. The second deletion unit 1423 selects a stop word and deletes the stop word.
 第2要素取得部1424は、削除処理後に得られた各単語のうち、ベクトル化対象である所定の単語について以下の処理を実行する。第2要素取得部1424は、第2削除部1423による処理後に得られたベクトル化対象である所定の単語の出現数を求め、各出現数を第2要素として取得する。なお、第2要素取得部1424は、第1要素取得部1413が出現頻度を第1要素として取得する場合、第2削除部1423による処理後に得られたベクトル化対象である所定の単語の出現頻度を、第2要素として取得する。 The second element acquisition unit 1424 executes the following processing for a predetermined word to be vectorized among the words obtained after the deletion process. The second element acquisition unit 1424 obtains the number of occurrences of a predetermined word to be vectorized obtained after the processing by the second deletion unit 1423, and acquires each appearance number as the second element. In the second element acquisition unit 1424, when the first element acquisition unit 1413 acquires the appearance frequency as the first element, the appearance frequency of a predetermined word that is a vectorization target obtained after processing by the second deletion unit 1423. Is acquired as the second element.
 生成部143は、第1の取得部141が取得した各単語の出現数または出現頻度と、第2の取得部142が取得した各単語の出現数または出現頻度とを加算した値を、各要素の値としたベクトルを生成する。生成部143は、第1の取得部141が取得した各単語の出現数または出現頻度に重み付けをし、重み付けをした第1の取得部141が取得した各単語の出現数または出現頻度と、第2の取得部142が取得した各単語の出現数または出現頻度とを加算した値を、ベクトルの各要素の値とする。 Each element of the generation unit 143 is a value obtained by adding the number of appearances or the frequency of occurrence of each word acquired by the first acquisition unit 141 and the number of appearances or the frequency of appearance of each word acquired by the second acquisition unit 142. Generate a vector with the value of. The generation unit 143 weights the number of occurrences or the frequency of occurrence of each word acquired by the first acquisition unit 141, and the number of appearances or the frequency of appearance of each word acquired by the weighted first acquisition unit 141, and the first The value obtained by adding the number of occurrences or the frequency of appearance of each word acquired by the acquisition unit 142 of 2 is set as the value of each element of the vector.
 記憶部15は、HDD(Hard Disk Drive)、SSD(Solid State Drive)、光ディスク等の記憶装置である。なお、記憶部15は、RAM(Random Access Memory)、フラッシュメモリ、NVSRAM(Non Volatile Static Random Access Memory)等のデータを書き換え可能な半導体メモリであってもよい。記憶部15は、データ処理装置10で実行されるOS(Operating System)や各種プログラムを記憶する。さらに、記憶部15は、プログラムの実行で用いられる各種情報を記憶する。記憶部15は、文書データ151を記憶する。 The storage unit 15 is a storage device for HDD (Hard Disk Drive), SSD (Solid State Drive), optical disk, and the like. The storage unit 15 may be a semiconductor memory that can rewrite data such as RAM (Random Access Memory), flash memory, NVSRAM (Non Volatile Static Random Access Memory), and the like. The storage unit 15 stores an OS (Operating System) and various programs executed by the data processing device 10. Further, the storage unit 15 stores various information used in executing the program. The storage unit 15 stores the document data 151.
 文書データ151は、電子化されたテキスト文書データであり、データ処理装置10の処理対象となる電子ファイル文書を含む。 The document data 151 is digitized text document data, and includes an electronic file document to be processed by the data processing device 10.
[データ処理の流れ]
 次に、データ処理装置10における処理の流れについて詳細に説明する。図2、図4及び図6は、図1に示すデータ処理装置の10処理の流れを説明する図である。まず、図2を参照して、データ処理装置10が、第1要素を取得するまでの処理について説明する。
[Data processing flow]
Next, the processing flow in the data processing apparatus 10 will be described in detail. 2, 4 and 6 are diagrams for explaining the flow of 10 processes of the data processing apparatus shown in FIG. First, with reference to FIG. 2, the process until the data processing device 10 acquires the first element will be described.
 まず、第1の取得部141では、第1分解部1411が、電子ファイル文書のうちベクトル化の対象部分を形態素解析により単語に分解する(図2の(1)参照)。ベクトル化の対象部分は、例えば、テキストT1「前提条件に合う機能を機能追加します。」である。 First, in the first acquisition unit 141, the first decomposition unit 1411 decomposes the target portion of the electronic file document into words by morphological analysis (see (1) in FIG. 2). The target part of vectorization is, for example, text T1 "Add a function that meets the prerequisites."
 そして、第1削除部1412は、第1分解部が分解した各単語をソートしてから、重複する単語を削除する(図2の(2)参照)。第1削除部1412は、不要語(ストップワード)を選定し(図2の(3)参照)、ストップワードを除く(図2の(4)参照)。ストップワード削除後に残った各単語が、BoWベクトル化の対象単語となる。 Then, the first deletion unit 1412 sorts each word decomposed by the first decomposition unit, and then deletes the duplicated words (see (2) in FIG. 2). The first deletion unit 1412 selects unnecessary words (stop words) (see (3) in FIG. 2) and excludes stop words (see (4) in FIG. 2). Each word remaining after the stop word is deleted becomes the target word for BoW vectorization.
 そして、第1要素取得部1413は、BoWベクトル化の各対象単語の出現数を、第1要素として取得する(図2の(4)参照)。リストL1は、テキストT1におけるベクトル化の対象単語の出現数の一覧である。リストL1の右列の各数字は、各ベクトル化の対象単語の出現数である。従来は、このリストL1の各出現数を要素の値としてベクトルを生成していた。図3は、従来方法によるBoWベクトル表現の一例を示す図である。 Then, the first element acquisition unit 1413 acquires the number of occurrences of each target word in the BoW vectorization as the first element (see (4) in FIG. 2). List L1 is a list of the number of occurrences of the target word for vectorization in the text T1. Each number in the right column of the list L1 is the number of occurrences of the target word for each vectorization. Conventionally, a vector has been generated by using each number of occurrences of the list L1 as an element value. FIG. 3 is a diagram showing an example of BoW vector representation by the conventional method.
 図3のベクトル表現例では、コロン(:)の前後に数値があり、コロン(:)の左側は単語番号(インデックス)であり、コロン(:)の右側はそれぞれの単語についての出現数である。例えば、図3では、単語番号「1」がリストL1の「機能」に対応し、単語番号「2」がリストL1の「合う」に対応し、単語番号「3」がリストL1の「前提」に対応する。テキストT1の単語数が少ないため、図3のベクトル表現例のように単語番号「6」以降の要素値は全て「0」となり、従来の方法では、文の特徴を十分に表現できない。 In the vector representation example of FIG. 3, there are numerical values before and after the colon (:), the left side of the colon (:) is the word number (index), and the right side of the colon (:) is the number of occurrences for each word. .. For example, in FIG. 3, the word number "1" corresponds to the "function" of the list L1, the word number "2" corresponds to the "match" of the list L1, and the word number "3" corresponds to the "premise" of the list L1. Corresponds to. Since the number of words in the text T1 is small, all the element values after the word number “6” are “0” as in the vector expression example of FIG. 3, and the conventional method cannot sufficiently express the characteristics of the sentence.
 これに対し、データ処理装置10では、さらに、対象部分が含まれる文または段落の単語(群)を収集して、この文または段落におけるベクトル化対象単語の出現数を、第2要素として取得する。具体的に、図4を参照して説明する。 On the other hand, the data processing device 10 further collects words (groups) of a sentence or paragraph including the target portion, and acquires the number of occurrences of the vectorized target word in this sentence or paragraph as the second element. .. Specifically, it will be described with reference to FIG.
 図4に示すように、第2の取得部142では、抽出部1421が、電子ファイル文書から、対象部分が含まれる文または対象部分が含まれる段落を抽出する(図4の(5)参照)。例えば、抽出部1421は、図4の(6)の段落P1「条件を変更しない。通信機能がエラー処理を行う。設定は機能毎に条件を確認すること。」を抽出する。なお、本例では、抽出した段落にはベクトル化対象のテキストT1「前提条件に合う機能を機能追加します。」が含まれるが、説明の容易化のため、テキストT1を削除して示しており、以降の処理も、テキストT1を削除した段落P1について実行するものとする。 As shown in FIG. 4, in the second acquisition unit 142, the extraction unit 1421 extracts a sentence including the target part or a paragraph including the target part from the electronic file document (see (5) in FIG. 4). .. For example, the extraction unit 1421 extracts paragraph P1 “The conditions are not changed. The communication function performs error processing. The setting is to confirm the conditions for each function” in FIG. 4 (6). In this example, the extracted paragraph contains the text T1 to be vectorized, "Add a function that meets the prerequisites." However, for ease of explanation, the text T1 is deleted and shown. Therefore, it is assumed that the subsequent processing is also executed for the paragraph P1 in which the text T1 is deleted.
 そして、第2の取得部142は、抽出した文または段落において、BoWベクトル化の各対象単語の出現数を第2要素として取得する(図4の(6)参照)。例えば、第2分解部1422は、抽出部1421が抽出した段落P1を形態素解析により単語に分解し、第2削除部1423は、重複する単語及びストップワードを削除する。そして、第2要素取得部1424は、BoWベクトル化の各対象単語の出現数を、第2要素として取得する。リストL2は、抽出された段落P1の各ベクトル化対象単語の出現数の一覧である。 Then, the second acquisition unit 142 acquires the number of occurrences of each target word in the BoW vectorization as the second element in the extracted sentence or paragraph (see (6) in FIG. 4). For example, the second decomposition unit 1422 decomposes the paragraph P1 extracted by the extraction unit 1421 into words by morphological analysis, and the second deletion unit 1423 deletes duplicate words and stop words. Then, the second element acquisition unit 1424 acquires the number of occurrences of each target word in the BoW vectorization as the second element. List L2 is a list of the number of occurrences of each vectorized word in the extracted paragraph P1.
 さらに、生成部143は、第1の取得部141が取得したベクトル化対象単語の出現数と、第2の取得部142が取得したベクトル化対象単語の出現数とを足し合わせた値を各要素の値としたBoWベクトルを生成する(図4の(7)参照)。リストL3の各単語の出現数は、リストL1の出現数とリストL2の出現数を単語ごとに合算した値である。生成部143は、このように、リストL1とリストL2の出現数を単語ごとに合算した値を、各要素の値としたBoWベクトルを生成する。 Further, the generation unit 143 sets each element as a value obtained by adding the number of occurrences of the vectorization target word acquired by the first acquisition unit 141 and the number of appearances of the vectorization target word acquired by the second acquisition unit 142. A BoW vector with the value of is generated (see (7) in FIG. 4). The number of occurrences of each word in the list L3 is a value obtained by adding the number of appearances in the list L1 and the number of appearances in the list L2 for each word. In this way, the generation unit 143 generates a BoW vector in which the value obtained by adding the number of occurrences of the list L1 and the list L2 for each word is used as the value of each element.
 図5は、実施の形態によるBoWベクトル表現の一例を示す図である。図5に示すBoWベクトルは、図4のリストL3に対応し、単語番号と単語との対応関係は、図3に示すBoWベクトルと同じである。図5のBoWベクトルの各要素の値は、ベクトル化対象であるテキストT1の各単語の出現数に、テキストT1を含む段落P1における単語の出現数を加算した値である。 FIG. 5 is a diagram showing an example of BoW vector representation according to the embodiment. The BoW vector shown in FIG. 5 corresponds to the list L3 in FIG. 4, and the correspondence relationship between the word number and the word is the same as the BoW vector shown in FIG. The value of each element of the BoW vector in FIG. 5 is a value obtained by adding the number of occurrences of each word in the paragraph P1 including the text T1 to the number of appearances of each word in the text T1 to be vectorized.
 したがって、図5に示すBoWベクトルは、図3に示すBoWベクトルと比して、単語番号「6」以降の要素値も「0」より大きい値となる。すなわち、データ処理装置10によれば、従来よりも多くの単語に対して「0」以外の要素値を与えることができるため、各単語の特徴を十分に表現することができる。 Therefore, the BoW vector shown in FIG. 5 has a value larger than "0" for the element values after the word number "6" as compared with the BoW vector shown in FIG. That is, according to the data processing device 10, since element values other than "0" can be given to more words than in the past, the characteristics of each word can be sufficiently expressed.
 このため、データ処理装置10では、BoWベクトルの各要素は、対象部分を含む文または段落における単語の出現数を含み、これを背景知識の代用とすることで、文または文章の文意をベクトルで表現できるようにしている。 Therefore, in the data processing device 10, each element of the BoW vector includes the number of occurrences of a word in a sentence or paragraph including a target part, and by substituting this for background knowledge, the meaning of the sentence or sentence is vectorized. It is possible to express with.
 また、図4及び図5に示すように、ベクトル化対象語「機能」は、リストL1において2回出現し、リストL2において2回出現する。このため、生成部143は、これらを合算することで、リストL3に示すように、「機能」(単語番号「1」)の出現数を「4」に補強できる(図5の枠W1参照)。 Further, as shown in FIGS. 4 and 5, the vectorization target word “function” appears twice in the list L1 and twice in the list L2. Therefore, the generation unit 143 can reinforce the number of occurrences of the "function" (word number "1") to "4" by adding them up (see frame W1 in FIG. 5), as shown in the list L3. ..
 この際、生成部143は、図6に示すように、リストL1の各出現数に重み(例えば「2」)を乗じて重み付けをしてもよい。そして、生成部143は、重み付け後のリストL1の各出現数と、リストL2の各出現数とを単語ごとに加算し、その各値(リストL4参照)を、BoWベクトルの各要素の値とする。図7は、実施の形態によるBoWベクトル表現の一例を示す図である。図7に示すBoWベクトルは、図6のリストL4に対応し、単語番号と単語との対応関係は、図5に示すBoWベクトルと同じである。 At this time, as shown in FIG. 6, the generation unit 143 may weight each appearance number of the list L1 by multiplying it by a weight (for example, "2"). Then, the generation unit 143 adds each number of occurrences of the weighted list L1 and each number of appearances of the list L2 for each word, and sets each value (see list L4) as the value of each element of the BoW vector. To do. FIG. 7 is a diagram showing an example of BoW vector representation according to the embodiment. The BoW vector shown in FIG. 7 corresponds to the list L4 in FIG. 6, and the correspondence between the word numbers and the words is the same as the BoW vector shown in FIG.
 図6及び図7に示すように、重みが「2」の場合、生成部143は、ベクトル化の対象単語「機能」(単語番号「1」)について、リストL1の出現数「2」に重み「2」を乗じた値「4」に、リストL3の出現数「2」を加算して、出現数を「6」に補強する(図7の枠W2参照)。このように、生成部143は、ベクトル化の対象部分の各単語の出現数に、1より大きい重みを掛けることで、本来のベクトル化の対象部分の単語に重点を置いた特徴ベクトルを生成することができる。 As shown in FIGS. 6 and 7, when the weight is “2”, the generation unit 143 weights the number of occurrences “2” in the list L1 for the target word “function” (word number “1”) for vectorization. The number of appearances "2" in the list L3 is added to the value "4" multiplied by "2" to reinforce the number of appearances to "6" (see frame W2 in FIG. 7). In this way, the generation unit 143 generates a feature vector focusing on the words in the original vectorization target part by multiplying the number of occurrences of each word in the vectorization target part by a weight greater than 1. be able to.
 もちろん、生成部143は、「2」以外の重みを用いてもよい。例えば、生成部143は、重みを「10」としてもよい。この場合、生成部143は、ベクトル化の対象単語「機能」については、リストL1の出現数「2」に重み「10」を乗じた値「20」に、リストL2の出現数「2」を加算することで、出現数を「22」とする。また、重みは、これらに限らず、「3」や「4」でもよく、本来のベクトル化の対象部分の単語に重点度に応じて調整すればよい。また、生成部143は、重みを1未満の値、例えば、分数(例えば「1/3」)や小数(例えば「0.2」)とし、本来のベクトル化の対象部分の単語の重点度を弱めてもよい。 Of course, the generation unit 143 may use a weight other than "2". For example, the generation unit 143 may set the weight to "10". In this case, the generation unit 143 sets the number of occurrences of the list L2 "2" to the value "20" obtained by multiplying the number of occurrences "2" of the list L1 by the weight "10" for the target word "function" for vectorization. By adding, the number of appearances is set to "22". Further, the weight is not limited to these, and may be "3" or "4", and the word of the original vectorization target part may be adjusted according to the degree of emphasis. Further, the generation unit 143 sets the weight to a value less than 1, for example, a fraction (for example, "1/3") or a decimal number (for example, "0.2"), and weakens the emphasis of the word in the original vectorization target part. May be good.
 次に、図8及び図9を参照し、ベクトル化の対象部分に対して一定の範囲に含まれるベクトル化の対象単語の出現数のカウント方法として2つの方法を説明する。図8及び図9は、図1に示すデータ処理装置の10処理の流れを説明する図である。 Next, with reference to FIGS. 8 and 9, two methods will be described as a method of counting the number of occurrences of the target word for vectorization included in a certain range with respect to the target portion for vectorization. 8 and 9 are views for explaining the flow of 10 processes of the data processing apparatus shown in FIG.
 まず、図8を参照して、ベクトル化の対象部分に対して一定の範囲に含まれるベクトル化の対象単語の出現数の第1のカウント方法を説明する。第1のカウント方法は、検索によって、電子ファイル文書から抽出された段落P1(図8の(1),(2)参照)について、そのままベクトル化の対象となる単語の出現数をカウントする方法である(図8の(3)のリストL2参照)。第1のカウント方法は、カウント対象のテキストが1つであるため、簡易な処理で足りる。 First, with reference to FIG. 8, the first counting method of the number of occurrences of the target word for vectorization included in a certain range with respect to the target portion for vectorization will be described. The first counting method is a method of counting the number of occurrences of words to be vectorized as they are for paragraph P1 (see (1) and (2) in FIG. 8) extracted from an electronic file document by a search. Yes (see list L2 in (3) of FIG. 8). In the first counting method, since there is only one text to be counted, a simple process is sufficient.
 続いて、図9を参照して、ベクトル化の対象部分に対して一定の範囲に含まれるベクトル化の対象単語の出現数の第2のカウント方法を説明する。第2のカウント方法を用いる場合、第2の取得部142は、対象部分であるテキストT1が含まれる段落P1を電子ファイル文書から抽出する。そして、第2の取得部142は、リストL1の単語ごとに、段落P1を検索して、各ベクトル化の対象単語が含まれる文を抽出する。 Subsequently, with reference to FIG. 9, a second counting method of the number of occurrences of the target word for vectorization included in a certain range with respect to the target portion for vectorization will be described. When the second counting method is used, the second acquisition unit 142 extracts the paragraph P1 including the text T1 which is the target portion from the electronic file document. Then, the second acquisition unit 142 searches the paragraph P1 for each word in the list L1 and extracts a sentence including the target word for each vectorization.
 例えば、単語「機能」について、第2の取得部142は、段落P1を検索して(図9の(1-1)参照)、「機能」を含むテキストT3「通信機能がエラー処理を行う。」及びテキストT4「設定は機能毎に条件を確認すること。」を抽出する(図9の(2-1)参照)。第2の取得部142は、テキストT3,T4に出現する各単語の出現数をカウントして(図9の(3-1)参照)、リストLa1を生成する。 For example, for the word "function", the second acquisition unit 142 searches paragraph P1 (see (1-1) in FIG. 9) and performs error processing in the text T3 "communication function" including "function". And the text T4 "Check the conditions for each function for setting." (See (2-1) in FIG. 9). The second acquisition unit 142 counts the number of occurrences of each word appearing in the texts T3 and T4 (see (3-1) in FIG. 9), and generates the list La1.
 第2の取得部142は、他の単語についても、同様の処理を行う。具体的には、第2の取得部142は、単語「合う」については、段落P1を検索して(図9の(1-2)参照)、「合う」を含むテキストを抽出する。この場合、該当するテキストが段落P1にないため(図9の(2-2)参照)、第2の取得部142は、各単語の出現数をいずれも「0」としたリストLa2を生成する(図9の(3-2)参照)。 The second acquisition unit 142 performs the same processing for other words. Specifically, the second acquisition unit 142 searches paragraph P1 for the word “match” (see (1-2) in FIG. 9) and extracts text containing “match”. In this case, since the corresponding text is not in paragraph P1 (see (2-2) in FIG. 9), the second acquisition unit 142 generates a list La2 in which the number of occurrences of each word is “0”. (See (3-2) in FIG. 9).
 また、第2の取得部142は、単語「条件」については、段落P1を検索して(図9の(1-3)参照)、「条件」を含むテキストT2「条件を変更しない。」及びテキストT4「設定は機能毎に条件を確認すること。」を抽出する(図9の(2-3)参照)。第2の取得部142は、テキストT2,T4に出現する各単語の出現数をカウントし(図9の(3-3)参照)、リストLa4を生成する。 Further, the second acquisition unit 142 searches the paragraph P1 for the word "condition" (see (1-3) in FIG. 9), and the text T2 "do not change the condition" including the "condition" and Extract the text T4 "Set the conditions for each function." (See (2-3) in FIG. 9). The second acquisition unit 142 counts the number of occurrences of each word appearing in the texts T2 and T4 (see (3-3) in FIG. 9), and generates the list La4.
 そして、生成部143は、「機能」、「合う」、「前提」等のベクトル化の対象単語ごとにカウント数を足し合わせ、足し合わせた値を各要素の値としたBoWベクトルを生成する(図9の(4)のリストL11参照)。 Then, the generation unit 143 adds the count numbers for each target word for vectorization such as "function", "match", and "premise", and generates a BoW vector using the added value as the value of each element ( See list L11 in FIG. 9 (4).
 このように、第2のカウント方法では、ベクトル化の対象部分に対する一定の範囲として抽出した段落P1に対し、対象部分であるテキストT1に出現したベクトル化の対象単語ごとに、各単語を含む文を一文ずつ抽出して、それぞれ各単語の出現数をカウントする。そして、第2のカウント方法では、カウントした各ベクトル化の対象単語の出現数を全て足し合わせることによって、段落を構成する各文に複数回出現した単語を強調表現したBoWベクトルを生成できる。すなわち、第2のカウント方法を用いた場合、データ処理装置10は、段落を構成する各文に複数回出現した単語に対し、顕著な特徴を持たせたBoWベクトルを生成できる。 As described above, in the second counting method, for the paragraph P1 extracted as a certain range with respect to the target part of vectorization, a sentence including each word for each target word of vectorization appearing in the text T1 which is the target part. Is extracted one sentence at a time, and the number of occurrences of each word is counted. Then, in the second counting method, a BoW vector that emphasizes the words that appear a plurality of times in each sentence constituting the paragraph can be generated by adding up all the appearance numbers of the counted target words for each vectorization. That is, when the second counting method is used, the data processing device 10 can generate a BoW vector having a remarkable feature for a word that appears a plurality of times in each sentence constituting the paragraph.
[データ処理方法の処理手順]
 次に、図10を参照して、図1に示すデータ処理装置10によるデータ処理方法の処理手順について説明する。図10は、実施の形態に係るデータ処理方法の処理手順を示すフローチャートである。
[Processing procedure of data processing method]
Next, with reference to FIG. 10, the processing procedure of the data processing method by the data processing apparatus 10 shown in FIG. 1 will be described. FIG. 10 is a flowchart showing a processing procedure of the data processing method according to the embodiment.
 まず、図4に示すように、第1の取得部141は、電子ファイル文書の入力を受け付けると(ステップS1)、電子ファイル文書のうち、ベクトル化の対象部分に含まれる単語の出現数を取得する第1の取得処理を行う(ステップS2)。なお、第1の取得部141は、ベクトル化の対象単語の出現頻度を求めてもよい。 First, as shown in FIG. 4, when the first acquisition unit 141 receives the input of the electronic file document (step S1), the first acquisition unit 141 acquires the number of occurrences of words included in the target portion of the vectorization in the electronic file document. The first acquisition process is performed (step S2). The first acquisition unit 141 may obtain the frequency of occurrence of the target word for vectorization.
 そして、第2の取得部142は、電子ファイル文書のうち、対象部分に対し一定の範囲について、この範囲に含まれるベクトル化の対象単語の出現数を取得する第2の取得処理を行う(ステップS3)。なお、第2の取得部142は、ベクトル化の対象部分に対して一定の範囲に含まれる所定の単語の出現頻度を求めてもよい。 Then, the second acquisition unit 142 performs a second acquisition process of acquiring the number of occurrences of the target word for vectorization included in the target portion of the electronic file document in a certain range (step). S3). The second acquisition unit 142 may obtain the frequency of appearance of a predetermined word included in a certain range with respect to the target portion for vectorization.
 続いて、生成部143は、第1の取得処理において取得された出現数と、第2の取得処理において取得された出現数とを加算した値を、各要素の値としたBoWベクトルを生成する生成処理を行う(ステップS4)。生成部143は、ステップS4において、第1の取得処理において取得された出現数に重み付けをし、重み付けをした出現数と、第2の取得処理において取得された出現数または出現頻度とを加算した値を、各要素の値としてもよい。 Subsequently, the generation unit 143 generates a BoW vector in which the value obtained by adding the number of appearances acquired in the first acquisition process and the number of appearances acquired in the second acquisition process is used as the value of each element. The generation process is performed (step S4). In step S4, the generation unit 143 weights the number of appearances acquired in the first acquisition process, and adds the weighted number of appearances and the number of appearances or the frequency of appearances acquired in the second acquisition process. The value may be the value of each element.
[第1の取得処理の処理手順]
 次に、図11を参照して、第1の取得処理(ステップS2)の処理手順について説明する。図11は、図10に示す第1の取得処理の処理手順を示すフローチャートである。
[Processing procedure of the first acquisition process]
Next, the processing procedure of the first acquisition process (step S2) will be described with reference to FIG. FIG. 11 is a flowchart showing a processing procedure of the first acquisition process shown in FIG.
 図11に示すように、第1の取得部141では、第1分解部1411が、電子ファイル文書のベクトル化の対象部分を形態素解析により単語に分解する(ステップS11)。そして、第1削除部1412は、第1分解部1411が分解した各単語をソートし、重複の単語、ストップワードを削除する(ステップS12)。 As shown in FIG. 11, in the first acquisition unit 141, the first decomposition unit 1411 decomposes the target portion of vectorization of the electronic file document into words by morphological analysis (step S11). Then, the first deletion unit 1412 sorts each word decomposed by the first decomposition unit 1411 and deletes the duplicate word and the stop word (step S12).
 第1要素取得部1413は、ステップS12の処理後に得られた各ベクトル化の対象単語の出現数を求め(ステップS13)、各出現数を第1要素として取得する。なお、第1要素取得部1413は、第1削除部1412による処理後に得られた各ベクトル化の対象単語の出現頻度を、第1要素として取得してもよい。 The first element acquisition unit 1413 obtains the number of occurrences of each vectorized target word obtained after the processing of step S12 (step S13), and acquires each number of occurrences as the first element. The first element acquisition unit 1413 may acquire the appearance frequency of each vectorized target word obtained after the processing by the first deletion unit 1412 as the first element.
[第2の取得処理の処理手順]
 次に、図12を参照して、第2の取得処理(ステップS3)の処理手順について説明する。図12は、図10に示す第2の取得処理の処理手順を示すフローチャートである。
[Processing procedure of the second acquisition process]
Next, the processing procedure of the second acquisition process (step S3) will be described with reference to FIG. FIG. 12 is a flowchart showing a processing procedure of the second acquisition process shown in FIG.
 図11に示すように、第2の取得部142では、抽出部1421が、電子ファイル文書から、対象部分が含まれる文または対象部分が含まれる段落を抽出する(ステップS21)。第2分解部1422は、抽出部1421が抽出した文または段落を形態素解析により単語に分解し(ステップS22)、第2削除部1423は、第2分解部1422が分解した各単語をソートし、重複の単語、ストップワードを削除する(ステップS23)。 As shown in FIG. 11, in the second acquisition unit 142, the extraction unit 1421 extracts a sentence including the target part or a paragraph including the target part from the electronic file document (step S21). The second decomposition unit 1422 decomposes the sentence or paragraph extracted by the extraction unit 1421 into words by morphological analysis (step S22), and the second deletion unit 1423 sorts each word decomposed by the second decomposition unit 1422. The duplicate word and the stop word are deleted (step S23).
 そして、第2要素取得部1424は、ステップS22処理後に得られた各ベクトル化の対象単語の出現数を求め(ステップS24)、各出現数を第2要素として取得する。なお、第2要素取得部1424は、ステップS13において第1要素として出現頻度が取得される場合、第ステップS22処理後に得られた各単語の出現頻度を、第2要素として取得する。生成部143は、ステップS4において、第1の取得処理において取得された第1要素と、第2の取得処理において取得された第2要素とを加算した値を、各要素の値としたBoWベクトルを生成する。 Then, the second element acquisition unit 1424 obtains the number of occurrences of each vectorized target word obtained after the process of step S22 (step S24), and acquires each number of occurrences as the second element. When the appearance frequency is acquired as the first element in step S13, the second element acquisition unit 1424 acquires the appearance frequency of each word obtained after the processing in step S22 as the second element. In step S4, the generation unit 143 uses a value obtained by adding the first element acquired in the first acquisition process and the second element acquired in the second acquisition process as the value of each element. To generate.
[実施の形態の効果]
 このように、本実施の形態に係るデータ処理装置10は、文書データのうち、ベクトル化の対象部分に含まれる単語の出現数または出現頻度と、対象部分に対し一定の範囲に含まれる単語の出現数または出現頻度とを加算した値を、各要素の値としたベクトルを生成する。
[Effect of Embodiment]
As described above, the data processing device 10 according to the present embodiment has the appearance number or frequency of words included in the target portion of vectorization and the words included in a certain range with respect to the target portion in the document data. A vector is generated in which the value obtained by adding the number of occurrences or the frequency of appearance is used as the value of each element.
 データ処理装置10は、対象部分の単語の出現数または出現頻度に、ベクトル化の対象部分に対して一定の範囲に含まれる単語の出現数または出現頻度を対象部分の加えることで、ベクトル化の対象部分が、単語数が少ない短文であっても、適切なベクトル表現が可能である。また、データ処理装置10は、ベクトル化の対象部分に対して一定の範囲にある単語を収集してベクトルの要素に加えることで、背景知識の代用とし、文または文章の文意をBoWベクトルで表現できるようにしている。 The data processing device 10 adds the number of occurrences or frequency of occurrence of words included in a certain range to the number of occurrences or frequency of occurrence of words in the target portion to the target portion for vectorization. Even if the target part is a short sentence with a small number of words, an appropriate vector expression is possible. Further, the data processing device 10 collects words in a certain range with respect to the target portion of vectorization and adds them to the elements of the vector, thereby substituting the background knowledge, and the sentence or the meaning of the sentence is converted into a BoW vector. I am trying to express it.
 また、データ処理装置10は、ベクトル化の対象部分に含まれる単語の出現数または出現頻度に重み付けをし、重み付けをした出現数または出現頻度と、ベクトル化の対象部分に対して一定の範囲に含まれるベクトル化の対象単語の出現数または出現頻度とを加算した値を、各要素の値とする。このように、データ処理装置10は、ベクトル化の対象部分の各単語の出現数または出現頻度に重み付けを行うことで、本来のベクトル化の対象部分の単語に対する重点度を調整したBoWベクトルを生成することができる。 Further, the data processing device 10 weights the number of occurrences or the frequency of occurrence of words included in the target portion of vectorization, and sets the weighted number of appearances or frequency of occurrence within a certain range with respect to the target portion of vectorization. The value obtained by adding the number of occurrences or the frequency of occurrence of the included vectorized target words is used as the value of each element. In this way, the data processing device 10 weights the number of occurrences or the frequency of occurrence of each word in the target portion of vectorization to generate a BoW vector in which the priority of the original target portion of vectorization is adjusted. can do.
 また、ベクトル化の対象部分に対する一定の範囲は、対象部分が含まれる文、または、対象部分が含まれる段落である。このように、データ処理装置10は、ベクトル化の対象部分に含まれる単語の出現数または出現頻度に、対象部分との関連性が高い部分の単語の出現数または出現頻度を加える。この結果、データ処理装置10によれば、対象部分が語数の少ない短文であっても、対象部分の近傍の単語の出現数または出現頻度を背景知識として補うことで、特徴を補強したBoWベクトルを生成することができる。 Also, a certain range for the target part of vectorization is a sentence that includes the target part or a paragraph that includes the target part. In this way, the data processing device 10 adds the number of occurrences or the frequency of occurrence of words in the portion highly related to the target portion to the number of appearances or the frequency of appearance of the words included in the target portion of vectorization. As a result, according to the data processing device 10, even if the target portion is a short sentence with a small number of words, the BoW vector whose features are reinforced by supplementing the appearance number or appearance frequency of words in the vicinity of the target portion as background knowledge can be obtained. Can be generated.
[システム構成等]
 図示した各装置の各構成要素は機能概念的なものであり、必ずしも物理的に図示の如く構成されていることを要しない。すなわち、各装置の分散・統合の具体的形態は図示のものに限られず、その全部又は一部を、各種の負荷や使用状況等に応じて、任意の単位で機能的又は物理的に分散・統合して構成することができる。さらに、各装置にて行なわれる各処理機能は、その全部又は任意の一部が、CPU及び当該CPUにて解析実行されるプログラムにて実現され、あるいは、ワイヤードロジックによるハードウェアとして実現され得る。
[System configuration, etc.]
Each component of each of the illustrated devices is a functional concept and does not necessarily have to be physically configured as shown in the figure. That is, the specific form of distribution / integration of each device is not limited to the one shown in the figure, and all or part of the device is functionally or physically distributed in arbitrary units according to various loads and usage conditions. Can be integrated and configured. Further, each processing function performed by each device may be realized by a CPU and a program analyzed and executed by the CPU, or may be realized as hardware by wired logic.
 また、本実施形態において説明した各処理のうち、自動的に行われるものとして説明した処理の全部又は一部を手動的におこなうこともでき、あるいは、手動的に行なわれるものとして説明した処理の全部又は一部を公知の方法で自動的におこなうこともできる。この他、上記文書中や図面中で示した処理手順、制御手順、具体的名称、各種のデータやパラメータを含む情報については、特記する場合を除いて任意に変更することができる。 Further, among the processes described in the present embodiment, all or part of the processes described as being automatically performed can be manually performed, or the processes described as being manually performed can be performed. All or part of it can be done automatically by a known method. In addition, the processing procedure, control procedure, specific name, and information including various data and parameters shown in the above document and drawings can be arbitrarily changed unless otherwise specified.
[プログラム]
 図13は、プログラムが実行されることにより、データ処理装置10が実現されるコンピュータの一例を示す図である。コンピュータ1000は、例えば、メモリ1010、CPU1020を有する。また、コンピュータ1000は、ハードディスクドライブインタフェース1030、ディスクドライブインタフェース1040、シリアルポートインタフェース1050、ビデオアダプタ1060、ネットワークインタフェース1070を有する。これらの各部は、バス1080によって接続される。
[program]
FIG. 13 is a diagram showing an example of a computer in which the data processing device 10 is realized by executing a program. The computer 1000 has, for example, a memory 1010 and a CPU 1020. The computer 1000 also has a hard disk drive interface 1030, a disk drive interface 1040, a serial port interface 1050, a video adapter 1060, and a network interface 1070. Each of these parts is connected by a bus 1080.
 メモリ1010は、ROM(Read Only Memory)1011及びRAM1012を含む。ROM1011は、例えば、BIOS(Basic Input Output System)等のブートプログラムを記憶する。ハードディスクドライブインタフェース1030は、ハードディスクドライブ1090に接続される。ディスクドライブインタフェース1040は、ディスクドライブ1100に接続される。例えば磁気ディスクや光ディスク等の着脱可能な記憶媒体が、ディスクドライブ1100に挿入される。シリアルポートインタフェース1050は、例えばマウス1110、キーボード1120に接続される。ビデオアダプタ1060は、例えばディスプレイ1130に接続される。 The memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM 1012. The ROM 1011 stores, for example, a boot program such as a BIOS (Basic Input Output System). The hard disk drive interface 1030 is connected to the hard disk drive 1090. The disk drive interface 1040 is connected to the disk drive 1100. For example, a removable storage medium such as a magnetic disk or an optical disk is inserted into the disk drive 1100. The serial port interface 1050 is connected to, for example, a mouse 1110 and a keyboard 1120. The video adapter 1060 is connected to, for example, the display 1130.
 ハードディスクドライブ1090は、例えば、OS(Operating System)1091、アプリケーションプログラム1092、プログラムモジュール1093、プログラムデータ1094を記憶する。すなわち、データ処理装置10の各処理を規定するプログラムは、コンピュータにより実行可能なコードが記述されたプログラムモジュール1093として実装される。プログラムモジュール1093は、例えばハードディスクドライブ1090に記憶される。例えば、データ処理装置10における機能構成と同様の処理を実行するためのプログラムモジュール1093が、ハードディスクドライブ1090に記憶される。なお、ハードディスクドライブ1090は、SSD(Solid State Drive)により代替されてもよい。 The hard disk drive 1090 stores, for example, an OS (Operating System) 1091, an application program 1092, a program module 1093, and program data 1094. That is, the program that defines each process of the data processing device 10 is implemented as a program module 1093 in which a code that can be executed by a computer is described. The program module 1093 is stored in, for example, the hard disk drive 1090. For example, a program module 1093 for executing processing similar to the functional configuration in the data processing device 10 is stored in the hard disk drive 1090. The hard disk drive 1090 may be replaced by an SSD (Solid State Drive).
 また、上述した実施形態の処理で用いられる設定データは、プログラムデータ1094として、例えばメモリ1010やハードディスクドライブ1090に記憶される。そして、CPU1020が、メモリ1010やハードディスクドライブ1090に記憶されたプログラムモジュール1093やプログラムデータ1094を必要に応じてRAM1012に読み出して実行する。 Further, the setting data used in the processing of the above-described embodiment is stored as program data 1094 in, for example, a memory 1010 or a hard disk drive 1090. Then, the CPU 1020 reads the program module 1093 and the program data 1094 stored in the memory 1010 and the hard disk drive 1090 into the RAM 1012 as needed, and executes the program.
 なお、プログラムモジュール1093やプログラムデータ1094は、ハードディスクドライブ1090に記憶される場合に限らず、例えば着脱可能な記憶媒体に記憶され、ディスクドライブ1100等を介してCPU1020によって読み出されてもよい。あるいは、プログラムモジュール1093及びプログラムデータ1094は、ネットワーク(LAN(Local Area Network)、WAN(Wide Area Network)等)を介して接続された他のコンピュータに記憶されてもよい。そして、プログラムモジュール1093及びプログラムデータ1094は、他のコンピュータから、ネットワークインタフェース1070を介してCPU1020によって読み出されてもよい。 The program module 1093 and the program data 1094 are not limited to those stored in the hard disk drive 1090, but may be stored in, for example, a removable storage medium and read by the CPU 1020 via the disk drive 1100 or the like. Alternatively, the program module 1093 and the program data 1094 may be stored in another computer connected via a network (LAN (Local Area Network), WAN (Wide Area Network), etc.). Then, the program module 1093 and the program data 1094 may be read by the CPU 1020 from another computer via the network interface 1070.
 以上、本発明者によってなされた発明を適用した実施形態について説明したが、本実施形態による本発明の開示の一部をなす記述及び図面により本発明は限定されることはない。すなわち、本実施形態に基づいて当業者等によりなされる他の実施形態、実施例及び運用技術等は全て本発明の範疇に含まれる。 Although the embodiment to which the invention made by the present inventor is applied has been described above, the present invention is not limited by the description and the drawings which form a part of the disclosure of the present invention according to the present embodiment. That is, all other embodiments, examples, operational techniques, and the like made by those skilled in the art based on the present embodiment are included in the scope of the present invention.
 10 データ処理装置
 11 入力部
 12 出力部
 13 通信部
 14 制御部
 15 記憶部
 141 第1の取得部
 142 第2の取得部
 143 生成部
 151 文書データ
 1411 第1分解部
 1412 第1削除部
 1413 第1要素取得部
 1421 抽出部
 1422 第2分解部
 1423 第2削除部
 1424 第2要素取得部
10 Data processing device 11 Input unit 12 Output unit 13 Communication unit 14 Control unit 15 Storage unit 141 First acquisition unit 142 Second acquisition unit 143 Generation unit 151 Document data 1411 First decomposition unit 1412 First deletion unit 1413 First Element acquisition unit 1421 Extraction unit 1422 2nd decomposition unit 1423 2nd deletion unit 1424 2nd element acquisition unit

Claims (5)

  1.  文書データのうちベクトル化の対象部分に含まれる所定の単語の出現数または出現頻度を取得する第1の取得部と、
     前記文書データのうち前記対象部分に対する一定の範囲について、前記一定の範囲に含まれる前記所定の単語の出現数または出現頻度を取得する第2の取得部と、
     前記第1の取得部が取得した出現数または出現頻度と、前記第2の取得部が取得した出現数または出現頻度とを加算した値を、各要素の値としたベクトルを生成する生成部と、
     を有することを特徴とするデータ処理装置。
    A first acquisition unit that acquires the number of occurrences or frequency of occurrence of a predetermined word included in the target portion of the document data, and
    With respect to a certain range of the document data with respect to the target portion, a second acquisition unit for acquiring the number of occurrences or the frequency of appearance of the predetermined word included in the certain range, and
    A generator that generates a vector in which the value obtained by adding the number of appearances or the frequency of appearance acquired by the first acquisition unit and the number of appearances or the frequency of appearance acquired by the second acquisition unit is used as the value of each element. ,
    A data processing device characterized by having.
  2.  前記生成部は、前記第1の取得部が取得した出現数または出現頻度に重み付けをし、重み付けをした前記第1の取得部が取得した出現数または出現頻度と、前記第2の取得部が取得した出現数または出現頻度とを加算した値を、各要素の値とすることを特徴とする請求項1に記載のデータ処理装置。 The generation unit weights the number of appearances or appearance frequency acquired by the first acquisition unit, and the weighted number of appearances or appearance frequency acquired by the first acquisition unit and the second acquisition unit The data processing apparatus according to claim 1, wherein a value obtained by adding the acquired number of appearances or appearance frequency is used as the value of each element.
  3.  前記一定の範囲は、前記対象部分が含まれる文、または、前記対象部分が含まれる段落であることを特徴とする請求項1または2に記載のデータ処理装置。 The data processing device according to claim 1 or 2, wherein the certain range is a sentence including the target portion or a paragraph including the target portion.
  4.  データ処理装置が実行するデータ処理方法であって、
     文書データのうちベクトル化の対象部分に含まれる所定の単語の出現数または出現頻度を取得する第1の取得工程と、
     前記文書データのうち前記対象部分に対する一定の範囲について、前記一定の範囲に含まれる前記所定の単語の出現数または出現頻度を取得する第2の取得工程と、
     前記第1の取得工程において取得された出現数または出現頻度と、前記第2の取得工程において取得された出現数または出現頻度とを加算した値を、各要素の値としたベクトルを生成する生成工程と、
     を含んだことを特徴とするデータ処理方法。
    A data processing method executed by a data processing device.
    The first acquisition step of acquiring the number of occurrences or the frequency of occurrence of a predetermined word included in the target portion of the document data to be vectorized, and
    A second acquisition step of acquiring the number of occurrences or the frequency of appearance of the predetermined word included in the certain range with respect to the target portion of the document data.
    Generation that generates a vector in which the value obtained by adding the number of appearances or the frequency of appearance acquired in the first acquisition step and the number of appearances or the frequency of appearance acquired in the second acquisition step is used as the value of each element. Process and
    A data processing method characterized by including.
  5.  文書データのうちベクトル化の対象部分に含まれる所定の単語の出現数または出現頻度を取得する第1の取得ステップと、
     前記文書データのうち前記対象部分に対する一定の範囲について、前記一定の範囲に含まれる前記所定の単語の出現数または出現頻度を取得する第2の取得ステップと、
     前記第1の取得ステップにおいて取得された出現数または出現頻度と、前記第2の取得ステップにおいて取得された出現数または出現頻度とを加算した値を、各要素の値としたベクトルを生成する生成ステップと、
     をコンピュータに実行させるためのデータ処理プログラム。
    The first acquisition step of acquiring the number of occurrences or the frequency of occurrence of a predetermined word included in the target part of the document data to be vectorized, and
    With respect to a certain range of the document data with respect to the target portion, a second acquisition step of acquiring the number of occurrences or the frequency of appearance of the predetermined word included in the certain range, and
    Generation to generate a vector in which the value obtained by adding the number of appearances or the frequency of appearance acquired in the first acquisition step and the number of appearances or the frequency of appearance acquired in the second acquisition step is used as the value of each element. Steps and
    A data processing program that lets a computer run.
PCT/JP2019/049053 2019-12-13 2019-12-13 Data processing device, data processing method, and data processing program WO2021117246A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/JP2019/049053 WO2021117246A1 (en) 2019-12-13 2019-12-13 Data processing device, data processing method, and data processing program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2019/049053 WO2021117246A1 (en) 2019-12-13 2019-12-13 Data processing device, data processing method, and data processing program

Publications (1)

Publication Number Publication Date
WO2021117246A1 true WO2021117246A1 (en) 2021-06-17

Family

ID=76330137

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2019/049053 WO2021117246A1 (en) 2019-12-13 2019-12-13 Data processing device, data processing method, and data processing program

Country Status (1)

Country Link
WO (1) WO2021117246A1 (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006350656A (en) * 2005-06-15 2006-12-28 Nippon Telegr & Teleph Corp <Ntt> Time-series document grouping method, device, and program, and recording medium storing program

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006350656A (en) * 2005-06-15 2006-12-28 Nippon Telegr & Teleph Corp <Ntt> Time-series document grouping method, device, and program, and recording medium storing program

Similar Documents

Publication Publication Date Title
CN110348214B (en) Method and system for detecting malicious codes
JP2008084064A (en) Text classification processing method, text classification processing device and text classification processing program
US11003695B2 (en) Method, apparatus and article of manufacture for categorizing computerized messages into categories
JP6291443B2 (en) Connection relationship estimation apparatus, method, and program
WO2014002775A1 (en) Synonym extraction system, method and recording medium
CN110929520B (en) Unnamed entity object extraction method and device, electronic equipment and storage medium
Kashmira et al. Generating entity relationship diagram from requirement specification based on nlp
CN110472040A (en) Extracting method and device, storage medium, the computer equipment of evaluation information
CN112527958A (en) User behavior tendency identification method, device, equipment and storage medium
CN111177375A (en) Electronic document classification method and device
JP2019204246A (en) Learning data creation method and learning data creation device
JP4143234B2 (en) Document classification apparatus, document classification method, and storage medium
KR101811565B1 (en) System for providing an expert answer to a natural language question
CN113268597A (en) Text classification method, device, equipment and storage medium
WO2021117246A1 (en) Data processing device, data processing method, and data processing program
US20050033566A1 (en) Natural language processing method
CN110347934A (en) A kind of text data filtering method, device and medium
US7921126B2 (en) Patent summarization systems and methods
JP7135641B2 (en) LEARNING DEVICE, EXTRACTION DEVICE AND LEARNING METHOD
CN114896141A (en) Test case duplication removing method, device, equipment and computer readable storage medium
CN114385436A (en) Server grouping method and device, electronic equipment and storage medium
JP7249125B2 (en) DATA PROCESSING DEVICE, DATA PROCESSING METHOD AND DATA PROCESSING PROGRAM
CN107622129B (en) Method and device for organizing knowledge base and computer storage medium
Smith et al. Classification of text to subject using LDA
WO2022107229A1 (en) Data processing device, data processing method, and data processing program

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19955585

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19955585

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: JP