WO2023162273A1

WO2023162273A1 - Generation method, generation program, and information processing device

Info

Publication number: WO2023162273A1
Application number: PCT/JP2022/008433
Authority: WO
Inventors: 正弘片岡; 博岩崎; 承剛大山; 量松村
Original assignee: 富士通株式会社
Priority date: 2022-02-28
Filing date: 2022-02-28
Publication date: 2023-08-31

Abstract

This information processing device comprises a storage device which stores target data including a plurality of attribute value arrays obtained by associating a plurality of attributes with attribute values of numerical data corresponding to the attributes. When text data is received, the information processing device designates, from among the plurality of attribute value arrays included in the target data, an attribute value array that satisfies conditions in response to the text data. The information processing device associates a vector responding to the text data with the designated attribute value array.

Description

Generation method, generation program and information processing device

The present invention relates to a generation method and the like.

In the field of document search technology, a vector is assigned to each document registered in the document DB (Data Base), and when a search query is received, the document whose vector corresponds to the search query vector is searched from the document DB. there is a technology to

In the document DB above, there are cases where documents containing tagged numerical values such as XBRL (eXtensible Business Reporting Language) documents are registered, and it is required to assign vectors to documents containing tagged numerical values and search them. be done. For example, the XBRL document is a securities report or the like.

When assigning a vector to a document, the vector of each word contained in the document is calculated using conventional techniques such as Word2Vec and Poincaré embedding, and the vector of each word is multiplied to calculate the vector of the document. and assigns the calculated vector.

In the conventional technology, a machine learning model for estimating a word contained in a document and its adjacent words is machine-learned from the sum vector obtained by summing the vectors of each word contained in the document and the embedding vector of the hidden layer, and machine translation is performed. technology has been published.

Japanese Patent Application Laid-Open No. 2020-060970

Unlike the words contained in normal documents, the tagged numerical values contained in the XBRL document described above are document information associated with attributes, attribute values, etc. unique to XBRL documents. Therefore, even if word vectors of a document are simply calculated as they are using Word2Vec, Poincaré embedding, etc., such vectors cannot be said to be vectors that can effectively search an XBRL document.

In one aspect, an object of the present invention is to provide a generation method, a generation program, and an information processing apparatus capable of appropriately generating vectors associated with documents, data, etc. containing tagged numerical values.

In the first plan, the computer executes the following processes. The computer has a storage device storing target data including a plurality of attribute value arrays in which a plurality of attributes and attribute values of numerical data corresponding to the attributes are associated with each other. When receiving text data, the computer identifies an attribute value array that satisfies a condition corresponding to the text data among the plurality of attribute value arrays included in the target data. The computer associates the identified attribute value array with the vector corresponding to the text data.

It is possible to appropriately generate vectors that correspond to documents and data that contain tagged numerical values.

FIG. 1 is a diagram (1) for explaining the processing of the information processing apparatus according to the embodiment. FIG. 2 is a diagram (2) for explaining the processing of the information processing apparatus according to the embodiment. FIG. 3 is a diagram (3) for explaining the processing of the information processing apparatus according to the embodiment. FIG. 4 is a functional block diagram showing the configuration of the information processing apparatus according to this embodiment. FIG. 5 is a flow chart showing the processing procedure of the preparation phase of the information processing apparatus. FIG. 6 is a flow chart showing the processing procedure of the search phase of the information processing apparatus. FIG. 7 is a diagram for explaining another process (1) of the information processing apparatus. FIG. 8 is a diagram for explaining another process (2) of the information processing apparatus. FIG. 9 is a diagram illustrating an example of a hardware configuration of a computer that implements functions similar to those of the information processing apparatus of the embodiment.

Below, embodiments of the generation method, generation program, and information processing apparatus disclosed in the present application will be described in detail based on the drawings. In addition, this invention is not limited by this Example.

1 to 3 are diagrams for explaining the processing of the information processing apparatus according to this embodiment. 1 and 2 show processing in the preparation phase of the information processing apparatus. First, FIG. 1 will be described. The information processing device has an XBRL document DB 50 . The XBRL document DB 50 is a DB that stores XBRL documents. An XBRL document is a document that includes a plurality of tagged numeric values that associate a plurality of attributes with attribute values corresponding to the attributes. For example, the XBRL document DB 50 stores the XBRL document 51 and the like.

In the XBRL document 51, the words enclosed by the tags (XBRL tags) "<", ">" and "<", "/>" are the attributes, and the numerical value sandwiched between the same attributes is the "attribute value". . Focusing on the tagged numerical value 51a "<sales><2014>20</2014></sales> billion yen", the tagged numerical value 51a includes the attribute "sales" and "2014" indicating the year. The attribute value is "2 billion yen". Although sales and year are shown as attributes in FIG. 1, other attributes may be included in the XBRL document.

It should be noted that the format of the tagged numerical values in the XBRL document shown in FIG. 1 is not limited to that shown in FIG. 1, and may be other formats. For example, "<sales><2014>20</2014></sales> billion yen" may be "<sales year=2014>2 billion yen</sales>" and both tagged figures are It is a tagged number with similar meaning.

The information processing device scans the XBRL document DB 50 and extracts the tagged numerical value sandwiched between the sales tags "<sales>, </sales>". Further, the information processing apparatus sorts the extracted numerical values with tags in ascending order (time series) according to the tags of the years, resulting in the extracted data 60 . The extracted data 60 includes tagged

numerical values

51a, 51b, 51c, 51d, 51e, and 51f. Here, we explained the case of extracting the tagged figures sandwiched between the sales tags "<sales>, </sales>". may

The tagged numerical value 51a of the extracted data 60 is a tagged numerical value indicating sales in "2014", and the attribute value is "2 billion yen." The tagged numerical value 51b of the extracted data 60 is a tagged numerical value indicating the sales in “2015”, and the attribute value is “1 billion yen”. The tagged numerical value 51c of the extracted data 60 is a tagged numerical value indicating sales in “2016 fiscal year”, and the attribute value is “2 billion yen”. The tagged numerical value 51d of the extracted data 60 is a tagged numerical value indicating the sales in "2017 fiscal year", and the attribute value is "3 billion yen". The tagged numerical value 51e of the extracted data 60 is a tagged numerical value indicating the sales in “2018 fiscal year”, and the attribute value is “4 billion yen”. The tagged numerical value 51f of the extracted data 60 is a tagged numerical value indicating sales in “2019 fiscal year”, and the attribute value is “3 billion yen”.

Move to the description of Figure 2. The information processing device scans the attribute values of the tagged numerical values 51a to 50f of the extracted data 60 in chronological order of the year, and the sequence of tagged numerical values in the rising section T1 in which the attribute value rises and the tagged numerical value row in which the attribute value falls A string of tagged numbers in interval T2 is identified. For example, in the example shown in FIG. 2, the column of tagged numerical values in the rising interval T1 is tagged numerical values 51b to 51e. The column of tagged numerical values in the descending section T2 is tagged

numerical values

51e and 51f.

Here, the information processing device accepts the designation of the sentence 10A corresponding to the string of numerical values with tags in the ascending interval T1. In the example shown in FIG. 2, the sentence 10A becomes "sales increase". The information processing device calculates vector Vec10A of sentence 10A. For example, the information processing device divides the sentence 10A into a plurality of words by executing morphological analysis on the sentence 10A, assigns a vector to each word, and multiplies the vectors of each word to obtain the vector "Vec10A" of the sentence 10A. Calculate Each word vector is defined in the dictionary information.

The information processing device sets the vector Vec10A of the sentence 10A as a vector corresponding to the columns 51b to 51e of tagged numerical values in the rising section T1. The information processing device associates the vector Vec10A with the offsets of the columns 51b to 51e of the tagged numerical values in the rising section T1 and sets them in the transposed index 70. FIG. For example, the offset of the columns 51b-51e of tagged numbers in rising interval T1 includes the position of the first word of tagged number 51b and the last word of tagged number 51e in extracted data 60. FIG.

In addition, the information processing device accepts designation of the sentence 10B corresponding to each tagged numerical value in the descending interval T2. In the example shown in FIG. 2, the sentence 10B becomes "Sales are going down." The information processing device calculates the vector "Vec10B" of sentence 10B. For example, the information processing device performs morphological analysis on the sentence 10B to divide it into a plurality of words, assigns a vector to each word, and integrates the vectors of each word to calculate the vector of the sentence 10B. .

The information processing device sets the vector Vec10B of the sentence 10B as a vector corresponding to the

columns

51e and 51f of the tagged numerical values of the descending interval T2. The information processing device associates the vector Vec10B with the offsets of the

columns

51e and 51f of tagged numerical values in the descending interval T2, and sets them in the transposed index 70. FIG. For example, the offsets of the tagged

numeric columns

51e and 51f of the descending interval T2 include the position of the first word of the tagged numeric value 51e and the last word of the tagged numeric value 51f in the extracted data 60. FIG.

Next, move on to the description of FIG. FIG. 3 shows the processing of the search phase of the information processing apparatus. In FIG. 3, when a search query is received, the information processing device uses the transposed index 70 generated by the processing in FIGS. Search from 60.

In the example shown in FIG. 3, the sentence of the search query 20 is "sales increase". The information processing device calculates a sentence vector of the search query 20 . That is, the information processing device performs morphological analysis on the sentence of the search query 20 to divide it into a plurality of words, assigns a vector to each word, and integrates the vectors of each word to obtain the search query 20. A query vector Vec20 is calculated.

The information processing device compares the query vector Vec 20 with each vector registered in the transposed index 70, calculates the similarity, identifies the vector with the maximum similarity from the transposed index 70, Identify the corresponding offset. Sentence vectors are characterized in that vectors of sentences having similar sentence contents are similar to each other. The degree of similarity is cosine similarity or the like.

For example, the vector that maximizes the similarity between the search query "sales increase" and the query vector Vec20 is the vector Vec10A of the sentence 10A "sales increase" described in FIG. Therefore, based on the offset associated with the vector Vec10A of the transposed index 70, the information processing device identifies the columns 51b to 51e of the tagged numerical values of the rising interval T1. The information processing device outputs, as a search result 80, information obtained by extracting the tagged numerical value strings 51b to 51e of the rising interval T1.

As described above, the information processing apparatus according to the present embodiment extracts the tagged numerical values corresponding to the sales tags, and sorts the extracted tagged numerical values in chronological order to generate the extracted data 60. The information processing device scans the tagged numerical values of the extracted data 60 in chronological order of the year, and the column of each tagged numerical value in the rising section where the attribute value rises and each tagged numerical value in the falling section where the attribute value falls. Identifies a column and sets the specified sentence vector to the column of each tagged number in the ascending interval and the column of each tagged number in the descending interval. For example, columns 51b to 51e of tagged numerical values for the ascending interval described in FIG. 2 satisfy the conditions according to sentence 10A.

Columns

51e and 51f of tagged numbers in the descending interval satisfy the conditions according to sentence 10B. This makes it possible to generate a vector suitable for each column of tagged numbers in the ascending interval and for each column of tagged numbers in the descending interval.

The information processing device associates the columns of tagged numerical values 51b to 51e in the ascending interval with the generated vector, sets them in the transposed index 70, and associates the columns of tagged

numerical values

51e and 51f in the descending interval with the generated vector. and are set in the transposed index 70 in association with each other. When receiving a search query, the information processing apparatus can use the inverted index 70 to search for a string of tagged numerical values corresponding to the search query.

Next, a configuration example of the information processing apparatus according to this embodiment will be described. FIG. 4 is a functional block diagram showing the configuration of the information processing apparatus according to this embodiment. As shown in FIG. 4 , this information processing apparatus 100 has a communication section 110 , an input section 120 , a display section 130 , a control section 150 and a storage section 140 .

The communication unit 110 executes data communication with an external device via a network. A communication unit 110 corresponds to a network card or the like.

The input unit 120 is an input device that receives operations from the user, and is realized by, for example, a keyboard, mouse, and the like. The user operates the input unit 120 to input a sentence and a condition corresponding to the sentence. For example, in the example described in FIGS. 1 and 2, the user enters the sentence "sales increase" and the condition of the tagged numeric column "extract the tagged numeric column with the attribute 'sales tag' and the attribute 'year'." are sorted in chronological order, and the "rising interval" is specified. In addition, the user selects the sentence "sales are going down", the condition of the tagged numerical value "attribute 'sales tag', extracts the tagged numerical value, sorts the attribute 'year' in chronological order, and designates the 'downward interval'. Also, the user may operate the input unit 120 to input a search query.

The display unit 130 is a display device for outputting the processing result of the control unit 150 . For example, display unit 130 is realized by a liquid crystal monitor, a printer, or the like.

The storage unit 140 is a storage device that stores various types of information, and is implemented by, for example, a semiconductor memory device such as RAM (Random Access Memory) or flash memory, or a storage device such as a hard disk or optical disk. . For example, the storage unit 140 stores the XBRL document DB 50, the extracted data 60, the transposed index 70, and the dictionary information 140a.

The XBRL document DB 50 is a DB that stores XBRL documents. The explanation about the XBRL document DB 50 is the same as the explanation given in FIG.

The extracted data 60 is sentence information extracted from the XBRL document DB 50 by the control unit 150, which will be described later. The extracted data 60 corresponds to the extracted data 60 described with reference to FIG.

The transposition index 70 is information set by the control unit 150, which will be described later, and associates a sentence vector with offsets of a plurality of sentences. For example, as described with reference to FIG. 2, the multiple sentences correspond to rising segment sentences and falling segment sentences.

The dictionary information 140a is information that associates and holds words and word vectors. Word vectors are pre-trained using conventional techniques such as Word2Vec or Poincaré embedding. A feature is that vectors of words having similar meanings are similar.

The control unit 150 executes various programs stored in the storage device inside the information processing apparatus 100 by a processor such as a CPU (Central Processing Unit) or an MPU (Micro Processing Unit) using a RAM or the like as a work area. Realized. Also, the control unit 150 may be implemented by an integrated circuit such as an ASIC (Application Specific Integrated Circuit) or an FPGA (Field Programmable Gate Array). For example, control unit 150 has identifying unit 151 , generating unit 152 , and searching unit 153 .

When the identification unit 151 receives the conditions for the string of sentences and tagged numerical values, it generates extracted data 60 from the XBRL document DB 50 based on the conditions for the tagged numerical values. The processing of the identifying unit 151 corresponds to the processing described with reference to FIG.

For example, if the condition of the column of tagged numerical values is "extract each tagged numerical value with the attribute 'sales tag', sort the attribute 'year' in chronological order, and 'rise section', the following processing is performed. to run. The specifying unit 151 scans the XBRL document DB 50 and extracts a string of tagged numerical values sandwiched between sales tags “<sales>, </sales>”. The identification unit 151 generates the extraction data 60 by sorting the extracted string of tagged numerical values in ascending order (time series) by year tags.

The identification unit 151 extracts each tagged numerical value of the attribute 'sales tag', sorts the attribute 'year' in chronological order, and extracts it in the same way when the condition of the column of the tagged numerical value is 'falling interval'. Generate data 60 .

The specifying unit 151 registers the generated extracted data 60 in the storage unit 140 and outputs the sentence and the condition of the string of tagged numerical values to the generating unit 152 .

The generation unit 152 associates the specified sentence vector with the tagged numerical values in the section corresponding to the condition of the tagged numerical value column among the multiple tagged numerical values of the extracted data 60 . The generation unit 152 registers, in the transposed index 70, the relationship between the offset of the tagged numerical value in the section corresponding to the condition of the string of tagged numerical values and the associated vector. The processing of the generation unit 152 corresponds to the processing described with reference to FIG.

For example, the generation unit 152 extracts the sentence “sales increase” (sentence 10A in FIG. 2) and the tagged numerical value of the condition “attribute “sales tag” in the column of tagged numerical values, and extracts the attribute “year” in chronological order. Execute the following processing for "Sort, Ascending Section". The generation unit 152 scans the attribute values of the tagged numerical values 51a to 50f of the extracted data 60 in chronological order of the year, and identifies an increase interval T1 in which the attribute values increase.

The generation unit 152 divides the sentence 10A into a plurality of words by executing morphological analysis. The generation unit 152 assigns a vector to each word based on the dictionary information 140a, and integrates the vectors of each word to calculate the vector "Vec10A" of the sentence 10A.

The generation unit 152 sets the vector Vec10A of the sentence 10A as a vector corresponding to the tagged numerical values 51b to 51e of the rising section T1. The generation unit 152 associates the vector Vec10A with the offsets of the tagged numerical values 51b to 51e of the rising section T1, and sets them in the transposed index 70. FIG.

The generating unit 152 extracts the sentence “Sales are going down” (sentence 10B in FIG. 2) and the tagged numerical value with the condition “attribute “sales tag” in the column of tagged numerical values, sorts the attribute “year” in chronological order, The following process is executed for the "descent section". The generation unit 152 scans the attribute values of the tagged numerical values 51a to 50f of the extraction data 60 in chronological order of the year, and identifies the falling interval T2 in which the attribute values rise.

The generation unit 152 divides the sentence 10B into a plurality of words by executing morphological analysis. The generator 152 assigns a vector to each word based on the dictionary information 140a, and integrates the vector of each word to calculate the vector "Vec10B" of the sentence 10B.

The generation unit 152 sets the vector Vec10B of the sentence 10B as a vector corresponding to the tagged

numerical values

51e and 51f of the descending interval T2. The generation unit 152 associates the vector Vec10B with the offsets of the tagged

numerical values

51e and 51f of the descending interval T2, and sets them in the transposed index 70. FIG.

Upon receiving a search query, the search unit 153 calculates a query vector of the search query, and searches for sentences corresponding to the search query based on the query vector and the inverted index. The processing of the search unit 153 corresponds to the processing described with reference to FIG.

For example, when the search unit 153 receives a search query 20 (for example, "sales increase"), it calculates a sentence vector of the search query 20. The search unit 153 divides the sentence of the search query 20 into a plurality of words by executing morphological analysis. The search unit 153 calculates a query vector Vec20 of the search query 20 by assigning a vector to each word based on the dictionary information 140a and integrating the vector of each word.

The search unit 153 compares the query vector Vec 20 with each vector registered in the transposed index 70, calculates the similarity, identifies the vector with the highest similarity from the transposed index 70, Identify the corresponding offset.

For example, the vector that maximizes the similarity between the search query "sales increase" and the query vector Vec20 is the vector Vec10A of the sentence 10A "sales increase" described in FIG. Therefore, the search unit 153 identifies the tagged numerical values 51b to 51e of the rising section T1 based on the offset associated with the vector Vec10A of the transposed index 70. FIG. The search unit 153 generates, as a search result 80, information obtained by extracting the tagged numerical values 51b to 51e of the rising section T1. The search unit 153 may output and display the search result 80 on the display unit 130, or may transmit it to an external device.

Next, an example of the processing procedure of the information processing apparatus 100 according to this embodiment will be described. FIG. 5 is a flow chart showing the processing procedure of the preparation phase of the information processing apparatus. As shown in FIG. 5, the information processing apparatus 100 receives specification of a condition for a sentence and a string of numerical values with tags (step S101). The specifying unit 151 extracts each tagged numeric value corresponding to the sales tag from the XBRL document DB 50 (step S102). The identification unit 151 generates extraction data by sorting the extracted numerical values with tags in chronological order of year (step S103).

The generation unit 152 scans the attribute values of each tagged numerical value in the extraction data 60 in chronological order of the year, and identifies rising sections and falling sections (step S104). The generation unit 152 calculates a vector of each specified sentence based on the dictionary information 140a (step S105).

The generation unit 152 associates the specified sentence vector with the offset of the column of tagged numerical values in the ascending section, and sets them in the transposed index 70 (step S106). The generation unit 152 associates the specified sentence vector with the offset of the column of tagged numerical values in the descending section, and sets them in the transposed index 70 (step S107).

FIG. 6 is a flowchart showing the processing procedure of the search phase of the information processing device. As shown in FIG. 6, the search unit 153 of the information processing device 100 receives a search query (step S201). The search unit 153 calculates a query vector of the search query based on the dictionary information 140a (step S202).

The search unit 153 identifies the offset corresponding to the vector that maximizes the similarity between the query vector and each vector of the transposed index 70 (step S203). The search unit 153 generates a search result based on the specified offset (step S204). The search unit 153 outputs the search result (step S205).

Next, the effects of the information processing apparatus 100 according to this embodiment will be described. The information processing apparatus 100 extracts the tagged numerical values corresponding to the sales tags, and sorts the extracted tagged numerical values in chronological order to generate the extraction data 60 . The information processing apparatus 100 scans the tagged numerical values of the extracted data 60 in chronological order of the year, and identifies each tagged numerical value in the rising section in which the attribute value rises and each tagged numerical value in the falling section in which the attribute value falls. Set the specified sentence vector to each tagged number in the ascending interval and each tagged number in the descending interval. For example, the tagged numerical values 51b-51e of the ascending interval described in FIG. 2 satisfy the conditions according to sentence 10A. Tagged

numbers

51e and 51f in the descending interval satisfy the conditions according to sentence 10B. This makes it possible to generate a vector suitable for each sentence in the ascending interval and for each tagged number in the descending interval.

The information processing apparatus 100 associates the tagged numerical values 51b to 51e of the ascending interval with the generated vector, sets them in the transposed index 70, and associates the tagged

numerical values

51e and 51f of the descending interval with the generated vector. , and set it to the transposed index 70 . When receiving a search query, the information processing apparatus 100 can search for a sentence corresponding to the search query by using the inverted index 70 .

By the way, the processing of the information processing device 100 described above is an example, and the information processing device 100 may perform other processing. Other processes (1) and (2) of the information processing apparatus 100 will be described below.

Other processing (1) of the information processing device 100 will be described. Documents stored in the XBRL document DB 50 described above may be set in a CSV (Comma Separated Value) format. The CSV format does not include tags such as those described in FIG. Therefore, the information processing apparatus 100 may use various conversion tables to convert CSV sentence data into tagged sentence data.

FIG. 7 is a diagram for explaining another process (1) of the information processing device. The information processing apparatus 100 converts the CSV numerical data 80A into tagged numerical data 80B based on the conversion tables 81A and 81B and the tag vector dictionary 81C.

The conversion table 81A is a table that defines the relationship between XBRL tags and columns. A column is information identifying each column of the CSV numerical data 80A.

The conversion table 81B is a table that defines the relationship between tag attributes and rows. For example, the conversion table 81B defines that the tag attribute of each numerical value set in the row of the CSV numerical data 80A is "fiscal year=".

The tag vector dictionary 81C associates XBRL tags, words, and tag vectors according to word vectors. For example, the tag vector dictionary 81C indicates that the XBRL tag <sales> corresponds to the word "sales" and the tag vector is "Vec1-1 . . . Vec1-n". The tag vector is a pre-computed tag that is specific to the XBRL tag.

For example, the first row of the CSV numerical data 80A will be explained. Based on the conversion table 81A, the information processing device 100 identifies that the XBRL tag of the value set in the column β is "<cost>", and based on the tag vector dictionary 81C, identifies "<cost> ' is the word corresponding to 'cost'. The information processing apparatus 100 converts "2020" set in the row into "year=2020" based on the conversion table 81B.

The information processing device 100 arranges the conversion results in a predetermined order, thereby converting the information in the first row of the CSV numerical data 80A into "<cost year=2020>40</cost>". do. The information processing device 100 performs the same process on the information on the second line of the CSV numerical data 80A to convert it to "<Cost year=2021>42</Cost>".

As described above, the information processing apparatus 100 uses various conversion tables to convert numerical data represented by attributes of CSV columns and rows into numerical data with tags. can also execute the processing based on the tags described in FIGS. 1 and 2, and can associate appropriate vectors.

Other processing (2) of the information processing apparatus 100 will be described. FIG. 8 is a diagram for explaining another process (2) of the information processing apparatus. Information processing apparatus 100 extracts a plurality of tagged numerical values (for example, each tagged numerical value of rising section T1) included in extraction data 60 obtained by repeatedly executing the processing shown in FIG. (the vector of sentence 10A) is registered in the storage unit 140 . In addition, the information processing apparatus 100 calculates vectors of a plurality of tagged numerical values included in the extraction data 60 as vectors of the XBRL text, associates them with vectors of the designated text, and registers them in the teacher table 90 . A vector of specified sentences is denoted as a sentence vector.

For example, the information processing apparatus 100 calculates the vector of the XBRL document by multiplying the vectors of the multiple tagged numerical values included in the extracted data 60 . When calculating the vectors of the XBRL document, the information processing apparatus 100 uses the dictionary information 140a described with reference to FIG. 4 and the vectors set in the tag vector dictionary 81C described with reference to FIG. use.

The information processing apparatus 100 repeatedly executes the above-described processing for a plurality of tagged numerical values included in the extracted data 60 and the designated sentence vector, thereby identifying the relationship between the sentence vector and the vector of the XBRL document. and register it in the teacher table 90.

Subsequently, the information processing apparatus 100 learns the learning model M1 by using the sentence vector of the teacher table 90 as input and the vector of the XBRL document as output (correct label). The learning model M1 is a neural network, and the information processing apparatus 100 trains the learning model M1 using a backpropagation learning method or the like.

The information processing apparatus 100 also provides an inverted index that associates the vector of the XBRL document with the offset of the vector of a plurality of tagged numeric values included in the extraction data 60, which are a plurality of sentences corresponding to the vector of the XBRL document. is generated.

On the other hand, when receiving a search query, the information processing apparatus 100 inputs the query vector of the search query to the trained learning model M1 and calculates the vector of the XBRL document. The information processing apparatus 100 compares the calculated vector of the XBRL document with the inverted index, identifies the offset, and extracts a plurality of tagged numerical values corresponding to the search query from the position corresponding to the offset. You can get search results.

Next, an example of the hardware configuration of a computer that implements the same functions as the information processing apparatus 100 shown in the above embodiment will be described. FIG. 9 is a diagram illustrating an example of a hardware configuration of a computer that implements functions similar to those of the information processing apparatus of the embodiment.

As shown in FIG. 9, the computer 200 has a CPU 201 that executes various arithmetic processes, an input device 202 that receives data input from the user, and a display 203 . The computer 200 also has a communication device 204 and an interface device 205 for exchanging data with an external device or the like via a wired or wireless network. The computer 200 also has a RAM 206 that temporarily stores various information, and a hard disk device 207 . Each device 201 - 207 is then connected to a bus 208 .

The hard disk device 207 has a specific program 207a, a generation program 207b, and a search program 207c. Further, the CPU 201 reads each program 207 a to 207 c and develops them in the RAM 206 .

The specific program 207a functions as a specific process 206a. Generation program 207b functions as generation process 206b. The search program 207c functions as a search process 206c.

The processing of the identification process 206a corresponds to the processing of the identification unit 151. The processing of the generation process 206 b corresponds to the processing of the generation unit 152 . The processing of the search process 206 c corresponds to the processing of the search unit 153 .

It should be noted that the programs 207a to 207c do not necessarily have to be stored in the hard disk device 207 from the beginning. For example, each program is stored in a “portable physical medium” such as a flexible disk (FD), CD-ROM, DVD, magneto-optical disk, IC card, etc. inserted into the computer 200 . Then, the computer 200 may read and execute each program 207a to 207c.

50 XBRL document database
60 Extracted data 70 Transposed index 100 Information processing device 110 Communication unit 120 Input unit 130 Display unit 140 Storage unit 140a Dictionary information 150 Control unit 151 Identification unit 152 Generation unit 153 Search unit

Claims

a storage device storing target data including a plurality of attribute value arrays in which a plurality of attributes and attribute values of numerical data corresponding to the attributes are associated;
When text data is received, identifying an attribute value array that satisfies a condition corresponding to the text data among a plurality of attribute value arrays included in the target data;
A generation method, wherein a computer executes a process of associating a vector corresponding to the text data with the specified attribute value array.
The plurality of attributes include a fiscal year attribute, and the plurality of attribute value arrays are rearranged in chronological order based on the fiscal year attributes set in the identified plurality of attribute value arrays, and arranged. 2. The generating method according to claim 1, further comprising specifying an attribute value array to be associated with said text data based on the attribute values of said plurality of attribute value arrays that have been changed.
The process of specifying an attribute value array to be associated with the text data includes scanning the attribute values of the plurality of attribute value arrays in time series, and determining a rising section of the attribute value array in which the attribute value continuously increases and an attribute value array in which the attribute value is continuous. to specify the descending section of the attribute value array descending, associate the vector corresponding to the text data related to the ascending section with the attribute value array of the ascending section, and correspond to the text data related to the descending section 3. The generating method according to claim 2, wherein the vector is associated with the attribute value array of the descending interval.
generating an inverted index that associates the position of the attribute value array with a vector corresponding to the attribute value array, and when a search query is received, based on the vector of the search query and the inverted index, 2. The generating method according to claim 1, further comprising searching for an attribute value array corresponding to said search query.
When tabular data in which attributes are set for each of a plurality of columns is acquired, the process of converting the tabular data into the target data is further executed by converting the attributes of each column of the tabular data into a tag format. 2. The generation method according to claim 1, wherein:
a storage device storing target data including a plurality of attribute value arrays in which a plurality of attributes and attribute values of numerical data corresponding to the attributes are associated;
When text data is received, identifying an attribute value array that satisfies a condition corresponding to the text data among a plurality of attribute value arrays included in the target data;
A generating program that causes a computer to execute a process of associating a vector corresponding to the text data with the specified attribute value array.
The plurality of attributes include a fiscal year attribute, and the plurality of attribute value arrays are rearranged in chronological order based on the fiscal year attributes set in the identified plurality of attribute value arrays, and arranged. 7. The generating program according to claim 6, further executing a process of identifying an attribute value array to be associated with said text data based on the attribute values of said plurality of attribute value arrays that have been changed.
The process of specifying an attribute value array to be associated with the text data includes scanning the attribute values of the plurality of attribute value arrays in time series, and determining a rising section of the attribute value array in which the attribute value continuously increases and an attribute value array in which the attribute value is continuous. to specify the descending section of the attribute value array descending, associate the vector corresponding to the text data related to the ascending section with the attribute value array of the ascending section, and correspond to the text data related to the descending section 8. The generating program according to claim 7, wherein the vector is associated with the attribute value array of the descending interval.
generating an inverted index that associates the position of the attribute value array with a vector corresponding to the attribute value array, and when a search query is received, based on the vector of the search query and the inverted index, 7. The generating program according to claim 6, further executing a process of searching for an attribute value array corresponding to said search query.
When tabular data in which attributes are set for each of a plurality of columns is acquired, the process of converting the tabular data into the target data is further executed by converting the attributes of each column of the tabular data into a tag format. 7. The generating program according to claim 6, characterized by:
a storage device storing target data including a plurality of attribute value arrays in which a plurality of attributes and attribute values of numerical data corresponding to the attributes are associated;
When text data is received, identifying an attribute value array that satisfies a condition corresponding to the text data among a plurality of attribute value arrays included in the target data;
An information processing apparatus having a control unit that executes a process of associating a vector corresponding to the text data with the specified attribute value array.
The control unit converts the plurality of attribute value arrays in chronological order based on the attributes of the fiscal year set in the identified plurality of attribute value arrays. 12. The information processing according to claim 11, further executing a process of specifying an attribute value array to be associated with the text data based on the attribute values of the plurality of attribute value arrays that have been rearranged. Device.
The control unit scans the attribute values of the plurality of attribute value arrays in time series, and includes a rising section of the attribute value array where the attribute value continuously rises and a descending section of the attribute value array where the attribute value continuously falls. A section is specified, a vector corresponding to the text data related to the rising section is associated with the attribute value array of the rising section, and a vector corresponding to the text data related to the falling section is converted to the attribute value of the falling section. 13. The information processing apparatus according to claim 12, wherein the information is associated with an array.
The control unit generates a transposed index that associates the position of the attribute value array with a vector corresponding to the attribute value array, and when a search query is received, the vector of the search query and the transposed index 12. The information processing apparatus according to claim 11, further executing a process of retrieving an attribute value array corresponding to said search query based on.
When acquiring tabular data in which an attribute is set for each of a plurality of columns, the control unit converts the tabular data into the target data by converting the attribute of each column of the tabular data into a tag format. 12. The information processing apparatus according to claim 11, further executing a process for processing.