WO2021117246A1

WO2021117246A1 - Data processing device, data processing method, and data processing program

Info

Publication number: WO2021117246A1
Application number: PCT/JP2019/049053
Authority: WO
Inventors: 須永　聡; 一宏菊間
Original assignee: 日本電信電話株式会社
Priority date: 2019-12-13
Filing date: 2019-12-13
Publication date: 2021-06-17

Abstract

A data processing device (10) comprises: a first acquisition unit (141) which acquires the number of occurrences or the frequency of occurrence of a prescribed word within a target portion to be vectorized that is included in document data; a second acquisition unit (142) which acquires the number of occurrences or the frequency of occurrence of the prescribed word within a certain region of the document data, said certain region being associated with the target portion; and a generation unit (143) which generates a vector, each element of which has a value equal to the value obtained by adding the number of occurrences or the frequency of occurrence acquired by the first acquisition unit (141) to the number of occurrences or the frequency of occurrence acquired by the second acquisition unit (142).

Description

Data processing equipment, data processing method and data processing program

The present invention relates to a data processing apparatus, a data processing method, and a data processing program.

There is a way to classify documents by making the linguistic expressions of sentences and words mathematically easy to handle. For example, a bag-of-words (Bag of Words; BoW) method for expressing document data as a vector has also been proposed. In this BoW method, the number of occurrences or the frequency of occurrence of words is used as a vector element. For example, when machine learning is used for natural language processing, it is possible to obtain the similarity between documents by using the similarity between vectors by expressing the linguistic expressions of sentences, sentences, and words with mathematically easy-to-use vectors. it can.

Japanese Unexamined Patent Publication No. 9-297766

However, in the BoW method, when the number of words is small in a short sentence, there is a data sparseness problem that a vector representing the characteristics of the sentence cannot be appropriately generated.

The data sparseness problem will be explained by taking as an example the case where the number of words recorded in a Japanese dictionary is 50,000 and a vector having these words as elements is generated. In the case of this example, if a newspaper article consisting of 100 words is vectorized, at least 49,900 elements will be "0". In other words, in this case, most elements are "0". It is unlikely that the topic will change frequently in one article, so the same word will be used many times. That is, it is considered that most of the vector elements are "0". Therefore, even if a large number of articles are prepared, it may be difficult to determine how easily a certain word appears.

Generally, when the number of elements that take a value other than "0" tends to be small, this data is said to be sparse. Then, when the data is sparse, there is a problem that sufficient statistical values necessary for processing the data cannot be obtained, and this problem is called a data sparseness problem.

The present invention has been made in view of the above, and an object of the present invention is to provide a data processing apparatus, a data processing method, and a data processing program capable of appropriately vector representation even for a short sentence having a small number of words. To do.

In order to solve the above-mentioned problems and achieve the object, the data processing apparatus according to the present invention obtains the number of occurrences or the frequency of occurrence of a predetermined word included in the target portion of vectorization in the document data. The acquisition unit, the second acquisition unit that acquires the number of occurrences or the frequency of occurrence of a predetermined word included in the certain range for a certain range of the document data, and the appearance acquired by the first acquisition unit. It is characterized by having a generation unit that generates a vector in which a value obtained by adding the number or appearance frequency and the appearance number or appearance frequency acquired by the second acquisition unit is used as the value of each element.

Further, the data processing method according to the present invention is a data processing method executed by a data processing apparatus, and is a first method of acquiring the number of occurrences or the frequency of appearance of a predetermined word included in a target portion of vectorization in document data. And the second acquisition step of acquiring the number of occurrences or the frequency of appearance of a predetermined word included in a certain range with respect to a certain range of the document data, and the first acquisition step. It is characterized by including a generation step of generating a vector in which a value obtained by adding the number of appearances or the frequency of appearance acquired in the second acquisition step and the number of appearances or the frequency of appearance is used as the value of each element. To do.

Further, the data processing program according to the present invention has a first acquisition step of acquiring the number of occurrences or the frequency of appearance of a predetermined word included in the target portion of vectorization in the document data, and a constant with respect to the target portion of the document data. Regarding the range of, the second acquisition step for acquiring the number of occurrences or the frequency of occurrence of a predetermined word included in a certain range, the number of occurrences or the frequency of occurrence acquired in the first acquisition step, and the second acquisition step. The computer is made to execute a generation step of generating a vector in which the value obtained by adding the number of appearances or the frequency of appearances acquired in is used as the value of each element.

According to the present invention, even a short document data can be appropriately expressed as a vector.

FIG. 1 is a diagram schematically showing an example of a configuration of a data processing device according to an embodiment. FIG. 2 is a diagram illustrating a processing flow of the data processing apparatus shown in FIG. FIG. 3 is a diagram showing an example of BoW vector representation by the conventional method. FIG. 4 is a diagram illustrating a processing flow of the data processing apparatus shown in FIG. FIG. 5 is a diagram showing an example of BoW vector representation according to the embodiment. FIG. 6 is a diagram illustrating a processing flow of the data processing apparatus shown in FIG. FIG. 7 is a diagram showing an example of BoW vector representation according to the embodiment. FIG. 8 is a diagram illustrating a processing flow of the data processing apparatus shown in FIG. FIG. 9 is a diagram illustrating a processing flow of the data processing apparatus shown in FIG. FIG. 10 is a flowchart showing a processing procedure of the data processing method according to the embodiment. FIG. 11 is a flowchart showing a processing procedure of the first acquisition process shown in FIG. FIG. 12 is a flowchart showing a processing procedure of the first acquisition process shown in FIG. FIG. 13 is a diagram showing an example of a computer in which a data processing device is realized by executing a program.

Hereinafter, an embodiment of the present invention will be described in detail with reference to the drawings. The present invention is not limited to this embodiment. Further, in the description of the drawings, the same parts are indicated by the same reference numerals.

[Embodiment]
Embodiments of the present invention will be described. In the embodiment of the present invention, it is premised that the object is digitized text document data (hereinafter referred to as document data). Then, in the present embodiment, the document feature vector (BoW vector) is generated by using the BoW method.

In this embodiment, in the document data, the number of occurrences or frequency of occurrence of a predetermined word that is a word included in the target portion of vectorization, and the number of occurrences or the number of occurrences of a predetermined word included in a certain range with respect to the target portion. A vector is generated in which the value obtained by adding the frequency of appearance is used as the value of each element.

As a result, in the present embodiment, even if the target portion of vectorization is a short sentence with a small number of words, the number of occurrences or the frequency of occurrence of words included in a certain range is added to the target portion of vectorization. Therefore, vector representation is possible appropriately. In addition, humans interpret sentences or sentences using background knowledge, but in this embodiment, words in the vicinity of the target part of vectorization are collected and added to the elements of the vector to substitute for background knowledge. The meaning of a sentence or sentence can be expressed by a vector.

As described above, in this embodiment, even in a short sentence with a small number of words, a BoW vector with reinforced features is generated by supplementing the number of occurrences or the frequency of appearance of words in the vicinity as background knowledge. Such a BoW vector is used in a method of predicting the necessity of verification from a verification matrix by machine learning or the like.

[Data processing device configuration]
First, the configuration of the data processing device according to the embodiment will be described. FIG. 1 is a diagram schematically showing an example of a configuration of a data processing device according to an embodiment. As shown in FIG. 1, the data processing device 10 includes an input unit 11, an output unit 12, a communication unit 13, a control unit 14, and a storage unit 15.

The input unit 11 is an input interface that receives various operations from the operator of the data processing device 10. For example, the input unit 11 is composed of an input device such as a touch panel, a voice input device, and a keyboard and a mouse.

The communication unit 13 is a communication interface for transmitting and receiving various information to and from other devices connected via a network or the like. The communication unit 13 is realized by a NIC (Network Interface Card) or the like, and communicates between another device and the control unit 14 (described later) via a telecommunication line such as a LAN (Local Area Network) or the Internet. For example, the communication unit 13 receives the data of the document file for which the BoW vector is generated via the network and outputs the data to the control unit 14. Further, the communication unit 13 outputs the BoW vector information generated by the control unit 14 to an external device via the network.

The output unit 12 is realized by, for example, a display device such as a liquid crystal display, a printing device such as a printer, an information communication device, or the like, and outputs information indicating a target word and a BoW vector generated by the control unit 14.

The control unit 14 controls the entire data processing device 10. The control unit 14 is, for example, an electronic circuit such as a CPU (Central Processing Unit) or MPU (Micro Processing Unit), or an integrated circuit such as an ASIC (Application Specific Integrated Circuit) or FPGA (Field Programmable Gate Array). Further, the control unit 14 has an internal memory for storing programs and control data that define various processing procedures, and executes each process using the internal memory. Further, the control unit 14 functions as various processing units by operating various programs. The control unit 14 has a first acquisition unit 141, a second acquisition unit 142, and a generation unit 143.

The first acquisition unit 141 acquires the number of occurrences or the frequency of occurrence of a predetermined word included in the target portion of vectorization in the document data. The first acquisition unit 141 includes a first disassembly unit 1411, a first deletion unit 1412, and a first element acquisition unit 1413.

The first decomposition unit 1411 decomposes the target portion of the document data for vectorization into each word using, for example, a morphological analysis tool such as MeCab.

The first deletion unit 1412 sorts each word decomposed by the first decomposition unit, and then deletes duplicate words. The first deletion unit 1412 selects an unnecessary word (stop word) and deletes the stop word. Stopwords are words that are not useful for characterizing document data, such as "the", "a", "is", "have", "take", "ha", "no", "is". , "Masu", etc. Each word obtained after the processing by the first deletion unit 1412 is a predetermined word to be vectorized.

The first element acquisition unit 1413 obtains the number of occurrences of each word obtained after processing by the first deletion unit 1412, and acquires each appearance number as the first element. The first element acquisition unit 1413 may acquire the appearance frequency of each word obtained after the processing by the first deletion unit 1412 as the first element. The frequency of appearance refers to, for example, the ratio of the corresponding word to the total number of words included in the target portion of vectorization.

The second acquisition unit 142 acquires the number of occurrences or the frequency of appearance of a predetermined word included in a certain range in a certain range with respect to the target portion of the document data. Specifically, a certain range is a sentence that includes a target part. Alternatively, a certain range is a paragraph that includes the target part. The second acquisition unit 142 includes an extraction unit 1421, a second decomposition unit 1422, a second deletion unit 1423, and a second element acquisition unit 1424.

The extraction unit 1421 extracts a certain range from the document data with respect to the target portion for vectorization. Specifically, the extraction unit 1421 extracts a sentence including the target portion or a paragraph including the target portion from the document data.

The second decomposition unit 1422 decomposes a certain range of the vectorized target portion extracted by the extraction unit 1421 into each word using, for example, a morphological analysis tool such as MeCab.

The second deletion unit 1423 sorts each word decomposed by the second decomposition unit 1422, and then deletes the duplicate word. The second deletion unit 1423 selects a stop word and deletes the stop word.

The second element acquisition unit 1424 executes the following processing for a predetermined word to be vectorized among the words obtained after the deletion process. The second element acquisition unit 1424 obtains the number of occurrences of a predetermined word to be vectorized obtained after the processing by the second deletion unit 1423, and acquires each appearance number as the second element. In the second element acquisition unit 1424, when the first element acquisition unit 1413 acquires the appearance frequency as the first element, the appearance frequency of a predetermined word that is a vectorization target obtained after processing by the second deletion unit 1423. Is acquired as the second element.

Each element of the generation unit 143 is a value obtained by adding the number of appearances or the frequency of occurrence of each word acquired by the first acquisition unit 141 and the number of appearances or the frequency of appearance of each word acquired by the second acquisition unit 142. Generate a vector with the value of. The generation unit 143 weights the number of occurrences or the frequency of occurrence of each word acquired by the first acquisition unit 141, and the number of appearances or the frequency of appearance of each word acquired by the weighted first acquisition unit 141, and the first The value obtained by adding the number of occurrences or the frequency of appearance of each word acquired by the acquisition unit 142 of 2 is set as the value of each element of the vector.

The storage unit 15 is a storage device for HDD (Hard Disk Drive), SSD (Solid State Drive), optical disk, and the like. The storage unit 15 may be a semiconductor memory that can rewrite data such as RAM (Random Access Memory), flash memory, NVSRAM (Non Volatile Static Random Access Memory), and the like. The storage unit 15 stores an OS (Operating System) and various programs executed by the data processing device 10. Further, the storage unit 15 stores various information used in executing the program. The storage unit 15 stores the document data 151.

The document data 151 is digitized text document data, and includes an electronic file document to be processed by the data processing device 10.

[Data processing flow]
Next, the processing flow in the data processing apparatus 10 will be described in detail. 2, 4 and 6 are diagrams for explaining the flow of 10 processes of the data processing apparatus shown in FIG. First, with reference to FIG. 2, the process until the data processing device 10 acquires the first element will be described.

First, in the first acquisition unit 141, the first decomposition unit 1411 decomposes the target portion of the electronic file document into words by morphological analysis (see (1) in FIG. 2). The target part of vectorization is, for example, text T1 "Add a function that meets the prerequisites."

Then, the first deletion unit 1412 sorts each word decomposed by the first decomposition unit, and then deletes the duplicated words (see (2) in FIG. 2). The first deletion unit 1412 selects unnecessary words (stop words) (see (3) in FIG. 2) and excludes stop words (see (4) in FIG. 2). Each word remaining after the stop word is deleted becomes the target word for BoW vectorization.

Then, the first element acquisition unit 1413 acquires the number of occurrences of each target word in the BoW vectorization as the first element (see (4) in FIG. 2). List L1 is a list of the number of occurrences of the target word for vectorization in the text T1. Each number in the right column of the list L1 is the number of occurrences of the target word for each vectorization. Conventionally, a vector has been generated by using each number of occurrences of the list L1 as an element value. FIG. 3 is a diagram showing an example of BoW vector representation by the conventional method.

In the vector representation example of FIG. 3, there are numerical values before and after the colon (:), the left side of the colon (:) is the word number (index), and the right side of the colon (:) is the number of occurrences for each word. .. For example, in FIG. 3, the word number "1" corresponds to the "function" of the list L1, the word number "2" corresponds to the "match" of the list L1, and the word number "3" corresponds to the "premise" of the list L1. Corresponds to. Since the number of words in the text T1 is small, all the element values after the word number “6” are “0” as in the vector expression example of FIG. 3, and the conventional method cannot sufficiently express the characteristics of the sentence.

On the other hand, the data processing device 10 further collects words (groups) of a sentence or paragraph including the target portion, and acquires the number of occurrences of the vectorized target word in this sentence or paragraph as the second element. .. Specifically, it will be described with reference to FIG.

As shown in FIG. 4, in the second acquisition unit 142, the extraction unit 1421 extracts a sentence including the target part or a paragraph including the target part from the electronic file document (see (5) in FIG. 4). .. For example, the extraction unit 1421 extracts paragraph P1 “The conditions are not changed. The communication function performs error processing. The setting is to confirm the conditions for each function” in FIG. 4 (6). In this example, the extracted paragraph contains the text T1 to be vectorized, "Add a function that meets the prerequisites." However, for ease of explanation, the text T1 is deleted and shown. Therefore, it is assumed that the subsequent processing is also executed for the paragraph P1 in which the text T1 is deleted.

Then, the second acquisition unit 142 acquires the number of occurrences of each target word in the BoW vectorization as the second element in the extracted sentence or paragraph (see (6) in FIG. 4). For example, the second decomposition unit 1422 decomposes the paragraph P1 extracted by the extraction unit 1421 into words by morphological analysis, and the second deletion unit 1423 deletes duplicate words and stop words. Then, the second element acquisition unit 1424 acquires the number of occurrences of each target word in the BoW vectorization as the second element. List L2 is a list of the number of occurrences of each vectorized word in the extracted paragraph P1.

Further, the generation unit 143 sets each element as a value obtained by adding the number of occurrences of the vectorization target word acquired by the first acquisition unit 141 and the number of appearances of the vectorization target word acquired by the second acquisition unit 142. A BoW vector with the value of is generated (see (7) in FIG. 4). The number of occurrences of each word in the list L3 is a value obtained by adding the number of appearances in the list L1 and the number of appearances in the list L2 for each word. In this way, the generation unit 143 generates a BoW vector in which the value obtained by adding the number of occurrences of the list L1 and the list L2 for each word is used as the value of each element.

FIG. 5 is a diagram showing an example of BoW vector representation according to the embodiment. The BoW vector shown in FIG. 5 corresponds to the list L3 in FIG. 4, and the correspondence relationship between the word number and the word is the same as the BoW vector shown in FIG. The value of each element of the BoW vector in FIG. 5 is a value obtained by adding the number of occurrences of each word in the paragraph P1 including the text T1 to the number of appearances of each word in the text T1 to be vectorized.

Therefore, the BoW vector shown in FIG. 5 has a value larger than "0" for the element values after the word number "6" as compared with the BoW vector shown in FIG. That is, according to the data processing device 10, since element values other than "0" can be given to more words than in the past, the characteristics of each word can be sufficiently expressed.

Therefore, in the data processing device 10, each element of the BoW vector includes the number of occurrences of a word in a sentence or paragraph including a target part, and by substituting this for background knowledge, the meaning of the sentence or sentence is vectorized. It is possible to express with.

Further, as shown in FIGS. 4 and 5, the vectorization target word “function” appears twice in the list L1 and twice in the list L2. Therefore, the generation unit 143 can reinforce the number of occurrences of the "function" (word number "1") to "4" by adding them up (see frame W1 in FIG. 5), as shown in the list L3. ..

At this time, as shown in FIG. 6, the generation unit 143 may weight each appearance number of the list L1 by multiplying it by a weight (for example, "2"). Then, the generation unit 143 adds each number of occurrences of the weighted list L1 and each number of appearances of the list L2 for each word, and sets each value (see list L4) as the value of each element of the BoW vector. To do. FIG. 7 is a diagram showing an example of BoW vector representation according to the embodiment. The BoW vector shown in FIG. 7 corresponds to the list L4 in FIG. 6, and the correspondence between the word numbers and the words is the same as the BoW vector shown in FIG.

As shown in FIGS. 6 and 7, when the weight is “2”, the generation unit 143 weights the number of occurrences “2” in the list L1 for the target word “function” (word number “1”) for vectorization. The number of appearances "2" in the list L3 is added to the value "4" multiplied by "2" to reinforce the number of appearances to "6" (see frame W2 in FIG. 7). In this way, the generation unit 143 generates a feature vector focusing on the words in the original vectorization target part by multiplying the number of occurrences of each word in the vectorization target part by a weight greater than 1. be able to.

Of course, the generation unit 143 may use a weight other than "2". For example, the generation unit 143 may set the weight to "10". In this case, the generation unit 143 sets the number of occurrences of the list L2 "2" to the value "20" obtained by multiplying the number of occurrences "2" of the list L1 by the weight "10" for the target word "function" for vectorization. By adding, the number of appearances is set to "22". Further, the weight is not limited to these, and may be "3" or "4", and the word of the original vectorization target part may be adjusted according to the degree of emphasis. Further, the generation unit 143 sets the weight to a value less than 1, for example, a fraction (for example, "1/3") or a decimal number (for example, "0.2"), and weakens the emphasis of the word in the original vectorization target part. May be good.

Next, with reference to FIGS. 8 and 9, two methods will be described as a method of counting the number of occurrences of the target word for vectorization included in a certain range with respect to the target portion for vectorization. 8 and 9 are views for explaining the flow of 10 processes of the data processing apparatus shown in FIG.

First, with reference to FIG. 8, the first counting method of the number of occurrences of the target word for vectorization included in a certain range with respect to the target portion for vectorization will be described. The first counting method is a method of counting the number of occurrences of words to be vectorized as they are for paragraph P1 (see (1) and (2) in FIG. 8) extracted from an electronic file document by a search. Yes (see list L2 in (3) of FIG. 8). In the first counting method, since there is only one text to be counted, a simple process is sufficient.

Subsequently, with reference to FIG. 9, a second counting method of the number of occurrences of the target word for vectorization included in a certain range with respect to the target portion for vectorization will be described. When the second counting method is used, the second acquisition unit 142 extracts the paragraph P1 including the text T1 which is the target portion from the electronic file document. Then, the second acquisition unit 142 searches the paragraph P1 for each word in the list L1 and extracts a sentence including the target word for each vectorization.

For example, for the word "function", the second acquisition unit 142 searches paragraph P1 (see (1-1) in FIG. 9) and performs error processing in the text T3 "communication function" including "function". And the text T4 "Check the conditions for each function for setting." (See (2-1) in FIG. 9). The second acquisition unit 142 counts the number of occurrences of each word appearing in the texts T3 and T4 (see (3-1) in FIG. 9), and generates the list La1.

The second acquisition unit 142 performs the same processing for other words. Specifically, the second acquisition unit 142 searches paragraph P1 for the word “match” (see (1-2) in FIG. 9) and extracts text containing “match”. In this case, since the corresponding text is not in paragraph P1 (see (2-2) in FIG. 9), the second acquisition unit 142 generates a list La2 in which the number of occurrences of each word is “0”. (See (3-2) in FIG. 9).

Further, the second acquisition unit 142 searches the paragraph P1 for the word "condition" (see (1-3) in FIG. 9), and the text T2 "do not change the condition" including the "condition" and Extract the text T4 "Set the conditions for each function." (See (2-3) in FIG. 9). The second acquisition unit 142 counts the number of occurrences of each word appearing in the texts T2 and T4 (see (3-3) in FIG. 9), and generates the list La4.

Then, the generation unit 143 adds the count numbers for each target word for vectorization such as "function", "match", and "premise", and generates a BoW vector using the added value as the value of each element ( See list L11 in FIG. 9 (4).

As described above, in the second counting method, for the paragraph P1 extracted as a certain range with respect to the target part of vectorization, a sentence including each word for each target word of vectorization appearing in the text T1 which is the target part. Is extracted one sentence at a time, and the number of occurrences of each word is counted. Then, in the second counting method, a BoW vector that emphasizes the words that appear a plurality of times in each sentence constituting the paragraph can be generated by adding up all the appearance numbers of the counted target words for each vectorization. That is, when the second counting method is used, the data processing device 10 can generate a BoW vector having a remarkable feature for a word that appears a plurality of times in each sentence constituting the paragraph.

[Processing procedure of data processing method]
Next, with reference to FIG. 10, the processing procedure of the data processing method by the data processing apparatus 10 shown in FIG. 1 will be described. FIG. 10 is a flowchart showing a processing procedure of the data processing method according to the embodiment.

First, as shown in FIG. 4, when the first acquisition unit 141 receives the input of the electronic file document (step S1), the first acquisition unit 141 acquires the number of occurrences of words included in the target portion of the vectorization in the electronic file document. The first acquisition process is performed (step S2). The first acquisition unit 141 may obtain the frequency of occurrence of the target word for vectorization.

Then, the second acquisition unit 142 performs a second acquisition process of acquiring the number of occurrences of the target word for vectorization included in the target portion of the electronic file document in a certain range (step). S3). The second acquisition unit 142 may obtain the frequency of appearance of a predetermined word included in a certain range with respect to the target portion for vectorization.

Subsequently, the generation unit 143 generates a BoW vector in which the value obtained by adding the number of appearances acquired in the first acquisition process and the number of appearances acquired in the second acquisition process is used as the value of each element. The generation process is performed (step S4). In step S4, the generation unit 143 weights the number of appearances acquired in the first acquisition process, and adds the weighted number of appearances and the number of appearances or the frequency of appearances acquired in the second acquisition process. The value may be the value of each element.

[Processing procedure of the first acquisition process]
Next, the processing procedure of the first acquisition process (step S2) will be described with reference to FIG. FIG. 11 is a flowchart showing a processing procedure of the first acquisition process shown in FIG.

As shown in FIG. 11, in the first acquisition unit 141, the first decomposition unit 1411 decomposes the target portion of vectorization of the electronic file document into words by morphological analysis (step S11). Then, the first deletion unit 1412 sorts each word decomposed by the first decomposition unit 1411 and deletes the duplicate word and the stop word (step S12).

The first element acquisition unit 1413 obtains the number of occurrences of each vectorized target word obtained after the processing of step S12 (step S13), and acquires each number of occurrences as the first element. The first element acquisition unit 1413 may acquire the appearance frequency of each vectorized target word obtained after the processing by the first deletion unit 1412 as the first element.

[Processing procedure of the second acquisition process]
Next, the processing procedure of the second acquisition process (step S3) will be described with reference to FIG. FIG. 12 is a flowchart showing a processing procedure of the second acquisition process shown in FIG.

As shown in FIG. 11, in the second acquisition unit 142, the extraction unit 1421 extracts a sentence including the target part or a paragraph including the target part from the electronic file document (step S21). The second decomposition unit 1422 decomposes the sentence or paragraph extracted by the extraction unit 1421 into words by morphological analysis (step S22), and the second deletion unit 1423 sorts each word decomposed by the second decomposition unit 1422. The duplicate word and the stop word are deleted (step S23).

Then, the second element acquisition unit 1424 obtains the number of occurrences of each vectorized target word obtained after the process of step S22 (step S24), and acquires each number of occurrences as the second element. When the appearance frequency is acquired as the first element in step S13, the second element acquisition unit 1424 acquires the appearance frequency of each word obtained after the processing in step S22 as the second element. In step S4, the generation unit 143 uses a value obtained by adding the first element acquired in the first acquisition process and the second element acquired in the second acquisition process as the value of each element. To generate.

[Effect of Embodiment]
As described above, the data processing device 10 according to the present embodiment has the appearance number or frequency of words included in the target portion of vectorization and the words included in a certain range with respect to the target portion in the document data. A vector is generated in which the value obtained by adding the number of occurrences or the frequency of appearance is used as the value of each element.

The data processing device 10 adds the number of occurrences or frequency of occurrence of words included in a certain range to the number of occurrences or frequency of occurrence of words in the target portion to the target portion for vectorization. Even if the target part is a short sentence with a small number of words, an appropriate vector expression is possible. Further, the data processing device 10 collects words in a certain range with respect to the target portion of vectorization and adds them to the elements of the vector, thereby substituting the background knowledge, and the sentence or the meaning of the sentence is converted into a BoW vector. I am trying to express it.

Further, the data processing device 10 weights the number of occurrences or the frequency of occurrence of words included in the target portion of vectorization, and sets the weighted number of appearances or frequency of occurrence within a certain range with respect to the target portion of vectorization. The value obtained by adding the number of occurrences or the frequency of occurrence of the included vectorized target words is used as the value of each element. In this way, the data processing device 10 weights the number of occurrences or the frequency of occurrence of each word in the target portion of vectorization to generate a BoW vector in which the priority of the original target portion of vectorization is adjusted. can do.

Also, a certain range for the target part of vectorization is a sentence that includes the target part or a paragraph that includes the target part. In this way, the data processing device 10 adds the number of occurrences or the frequency of occurrence of words in the portion highly related to the target portion to the number of appearances or the frequency of appearance of the words included in the target portion of vectorization. As a result, according to the data processing device 10, even if the target portion is a short sentence with a small number of words, the BoW vector whose features are reinforced by supplementing the appearance number or appearance frequency of words in the vicinity of the target portion as background knowledge can be obtained. Can be generated.

[System configuration, etc.]
Each component of each of the illustrated devices is a functional concept and does not necessarily have to be physically configured as shown in the figure. That is, the specific form of distribution / integration of each device is not limited to the one shown in the figure, and all or part of the device is functionally or physically distributed in arbitrary units according to various loads and usage conditions. Can be integrated and configured. Further, each processing function performed by each device may be realized by a CPU and a program analyzed and executed by the CPU, or may be realized as hardware by wired logic.

Further, among the processes described in the present embodiment, all or part of the processes described as being automatically performed can be manually performed, or the processes described as being manually performed can be performed. All or part of it can be done automatically by a known method. In addition, the processing procedure, control procedure, specific name, and information including various data and parameters shown in the above document and drawings can be arbitrarily changed unless otherwise specified.

[program]
FIG. 13 is a diagram showing an example of a computer in which the data processing device 10 is realized by executing a program. The computer 1000 has, for example, a memory 1010 and a CPU 1020. The computer 1000 also has a hard disk drive interface 1030, a disk drive interface 1040, a serial port interface 1050, a video adapter 1060, and a network interface 1070. Each of these parts is connected by a bus 1080.

The memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM 1012. The ROM 1011 stores, for example, a boot program such as a BIOS (Basic Input Output System). The hard disk drive interface 1030 is connected to the hard disk drive 1090. The disk drive interface 1040 is connected to the disk drive 1100. For example, a removable storage medium such as a magnetic disk or an optical disk is inserted into the disk drive 1100. The serial port interface 1050 is connected to, for example, a mouse 1110 and a keyboard 1120. The video adapter 1060 is connected to, for example, the display 1130.

The hard disk drive 1090 stores, for example, an OS (Operating System) 1091, an application program 1092, a program module 1093, and program data 1094. That is, the program that defines each process of the data processing device 10 is implemented as a program module 1093 in which a code that can be executed by a computer is described. The program module 1093 is stored in, for example, the hard disk drive 1090. For example, a program module 1093 for executing processing similar to the functional configuration in the data processing device 10 is stored in the hard disk drive 1090. The hard disk drive 1090 may be replaced by an SSD (Solid State Drive).

Further, the setting data used in the processing of the above-described embodiment is stored as program data 1094 in, for example, a memory 1010 or a hard disk drive 1090. Then, the CPU 1020 reads the program module 1093 and the program data 1094 stored in the memory 1010 and the hard disk drive 1090 into the RAM 1012 as needed, and executes the program.

The program module 1093 and the program data 1094 are not limited to those stored in the hard disk drive 1090, but may be stored in, for example, a removable storage medium and read by the CPU 1020 via the disk drive 1100 or the like. Alternatively, the program module 1093 and the program data 1094 may be stored in another computer connected via a network (LAN (Local Area Network), WAN (Wide Area Network), etc.). Then, the program module 1093 and the program data 1094 may be read by the CPU 1020 from another computer via the network interface 1070.

Although the embodiment to which the invention made by the present inventor is applied has been described above, the present invention is not limited by the description and the drawings which form a part of the disclosure of the present invention according to the present embodiment. That is, all other embodiments, examples, operational techniques, and the like made by those skilled in the art based on the present embodiment are included in the scope of the present invention.

10 Data processing device 11 Input unit 12 Output unit 13 Communication unit 14 Control unit 15 Storage unit 141 First acquisition unit 142 Second acquisition unit 143 Generation unit 151 Document data 1411 First decomposition unit 1412 First deletion unit 1413 First Element acquisition unit 1421 Extraction unit 1422 2nd decomposition unit 1423 2nd deletion unit 1424 2nd element acquisition unit

Claims

A first acquisition unit that acquires the number of occurrences or frequency of occurrence of a predetermined word included in the target portion of the document data, and
With respect to a certain range of the document data with respect to the target portion, a second acquisition unit for acquiring the number of occurrences or the frequency of appearance of the predetermined word included in the certain range, and
A generator that generates a vector in which the value obtained by adding the number of appearances or the frequency of appearance acquired by the first acquisition unit and the number of appearances or the frequency of appearance acquired by the second acquisition unit is used as the value of each element. ,
A data processing device characterized by having.
The generation unit weights the number of appearances or appearance frequency acquired by the first acquisition unit, and the weighted number of appearances or appearance frequency acquired by the first acquisition unit and the second acquisition unit The data processing apparatus according to claim 1, wherein a value obtained by adding the acquired number of appearances or appearance frequency is used as the value of each element.
The data processing device according to claim 1 or 2, wherein the certain range is a sentence including the target portion or a paragraph including the target portion.
A data processing method executed by a data processing device.
The first acquisition step of acquiring the number of occurrences or the frequency of occurrence of a predetermined word included in the target portion of the document data to be vectorized, and
A second acquisition step of acquiring the number of occurrences or the frequency of appearance of the predetermined word included in the certain range with respect to the target portion of the document data.
Generation that generates a vector in which the value obtained by adding the number of appearances or the frequency of appearance acquired in the first acquisition step and the number of appearances or the frequency of appearance acquired in the second acquisition step is used as the value of each element. Process and
A data processing method characterized by including.
The first acquisition step of acquiring the number of occurrences or the frequency of occurrence of a predetermined word included in the target part of the document data to be vectorized, and
With respect to a certain range of the document data with respect to the target portion, a second acquisition step of acquiring the number of occurrences or the frequency of appearance of the predetermined word included in the certain range, and
Generation to generate a vector in which the value obtained by adding the number of appearances or the frequency of appearance acquired in the first acquisition step and the number of appearances or the frequency of appearance acquired in the second acquisition step is used as the value of each element. Steps and
A data processing program that lets a computer run.