WO2023162273A1 - Generation method, generation program, and information processing device - Google Patents

Generation method, generation program, and information processing device Download PDF

Info

Publication number
WO2023162273A1
WO2023162273A1 PCT/JP2022/008433 JP2022008433W WO2023162273A1 WO 2023162273 A1 WO2023162273 A1 WO 2023162273A1 JP 2022008433 W JP2022008433 W JP 2022008433W WO 2023162273 A1 WO2023162273 A1 WO 2023162273A1
Authority
WO
WIPO (PCT)
Prior art keywords
attribute value
attribute
vector
value array
text data
Prior art date
Application number
PCT/JP2022/008433
Other languages
French (fr)
Japanese (ja)
Inventor
正弘 片岡
博 岩崎
承剛 大山
量 松村
Original Assignee
富士通株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 富士通株式会社 filed Critical 富士通株式会社
Priority to PCT/JP2022/008433 priority Critical patent/WO2023162273A1/en
Publication of WO2023162273A1 publication Critical patent/WO2023162273A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/80Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
    • G06F16/84Mapping; Conversion

Definitions

  • the present invention relates to a generation method and the like.
  • a vector is assigned to each document registered in the document DB (Data Base), and when a search query is received, the document whose vector corresponds to the search query vector is searched from the document DB.
  • XBRL eXtensible Business Reporting Language
  • the XBRL document is a securities report or the like.
  • the vector of each word contained in the document is calculated using conventional techniques such as Word2Vec and Poincaré embedding, and the vector of each word is multiplied to calculate the vector of the document. and assigns the calculated vector.
  • the tagged numerical values contained in the XBRL document described above are document information associated with attributes, attribute values, etc. unique to XBRL documents. Therefore, even if word vectors of a document are simply calculated as they are using Word2Vec, Poincaré embedding, etc., such vectors cannot be said to be vectors that can effectively search an XBRL document.
  • an object of the present invention is to provide a generation method, a generation program, and an information processing apparatus capable of appropriately generating vectors associated with documents, data, etc. containing tagged numerical values.
  • the computer executes the following processes.
  • the computer has a storage device storing target data including a plurality of attribute value arrays in which a plurality of attributes and attribute values of numerical data corresponding to the attributes are associated with each other.
  • the computer identifies an attribute value array that satisfies a condition corresponding to the text data among the plurality of attribute value arrays included in the target data.
  • the computer associates the identified attribute value array with the vector corresponding to the text data.
  • FIG. 1 is a diagram (1) for explaining the processing of the information processing apparatus according to the embodiment.
  • FIG. 2 is a diagram (2) for explaining the processing of the information processing apparatus according to the embodiment.
  • FIG. 3 is a diagram (3) for explaining the processing of the information processing apparatus according to the embodiment.
  • FIG. 4 is a functional block diagram showing the configuration of the information processing apparatus according to this embodiment.
  • FIG. 5 is a flow chart showing the processing procedure of the preparation phase of the information processing apparatus.
  • FIG. 6 is a flow chart showing the processing procedure of the search phase of the information processing apparatus.
  • FIG. 7 is a diagram for explaining another process (1) of the information processing apparatus.
  • FIG. 8 is a diagram for explaining another process (2) of the information processing apparatus.
  • FIG. 9 is a diagram illustrating an example of a hardware configuration of a computer that implements functions similar to those of the information processing apparatus of the embodiment.
  • the information processing device has an XBRL document DB 50 .
  • the XBRL document DB 50 is a DB that stores XBRL documents.
  • An XBRL document is a document that includes a plurality of tagged numeric values that associate a plurality of attributes with attribute values corresponding to the attributes.
  • the XBRL document DB 50 stores the XBRL document 51 and the like.
  • the words enclosed by the tags are the attributes, and the numerical value sandwiched between the same attributes is the “attribute value”.
  • the tagged numerical value 51a Focusing on the tagged numerical value 51a " ⁇ sales> ⁇ 2014>20 ⁇ /2014> ⁇ /sales> billion yen”, the tagged numerical value 51a includes the attribute "sales” and "2014” indicating the year. The attribute value is "2 billion yen”.
  • sales and year are shown as attributes in FIG. 1, other attributes may be included in the XBRL document.
  • the format of the tagged numerical values in the XBRL document shown in FIG. 1 is not limited to that shown in FIG. 1, and may be other formats.
  • the information processing device scans the XBRL document DB 50 and extracts the tagged numerical value sandwiched between the sales tags " ⁇ sales>, ⁇ /sales>". Further, the information processing apparatus sorts the extracted numerical values with tags in ascending order (time series) according to the tags of the years, resulting in the extracted data 60 .
  • the extracted data 60 includes tagged numerical values 51a, 51b, 51c, 51d, 51e, and 51f.
  • the extracted data 60 includes tagged numerical values 51a, 51b, 51c, 51d, 51e, and 51f.
  • the tagged numerical value 51a of the extracted data 60 is a tagged numerical value indicating sales in "2014", and the attribute value is "2 billion yen.”
  • the tagged numerical value 51b of the extracted data 60 is a tagged numerical value indicating the sales in “2015”, and the attribute value is “1 billion yen”.
  • the tagged numerical value 51c of the extracted data 60 is a tagged numerical value indicating sales in “2016 fiscal year”, and the attribute value is “2 billion yen”.
  • the tagged numerical value 51d of the extracted data 60 is a tagged numerical value indicating the sales in "2017 fiscal year”, and the attribute value is "3 billion yen”.
  • the tagged numerical value 51e of the extracted data 60 is a tagged numerical value indicating the sales in “2018 fiscal year”, and the attribute value is “4 billion yen”.
  • the tagged numerical value 51f of the extracted data 60 is a tagged numerical value indicating sales in “2019 fiscal year”, and the attribute value is “3 billion yen”.
  • the information processing device scans the attribute values of the tagged numerical values 51a to 50f of the extracted data 60 in chronological order of the year, and the sequence of tagged numerical values in the rising section T1 in which the attribute value rises and the tagged numerical value row in which the attribute value falls A string of tagged numbers in interval T2 is identified.
  • the column of tagged numerical values in the rising interval T1 is tagged numerical values 51b to 51e.
  • the column of tagged numerical values in the descending section T2 is tagged numerical values 51e and 51f.
  • the information processing device accepts the designation of the sentence 10A corresponding to the string of numerical values with tags in the ascending interval T1.
  • the sentence 10A becomes "sales increase".
  • the information processing device calculates vector Vec10A of sentence 10A.
  • the information processing device divides the sentence 10A into a plurality of words by executing morphological analysis on the sentence 10A, assigns a vector to each word, and multiplies the vectors of each word to obtain the vector "Vec10A" of the sentence 10A. Calculate Each word vector is defined in the dictionary information.
  • the information processing device sets the vector Vec10A of the sentence 10A as a vector corresponding to the columns 51b to 51e of tagged numerical values in the rising section T1.
  • the information processing device associates the vector Vec10A with the offsets of the columns 51b to 51e of the tagged numerical values in the rising section T1 and sets them in the transposed index 70.
  • the offset of the columns 51b-51e of tagged numbers in rising interval T1 includes the position of the first word of tagged number 51b and the last word of tagged number 51e in extracted data 60.
  • the information processing device accepts designation of the sentence 10B corresponding to each tagged numerical value in the descending interval T2. In the example shown in FIG. 2, the sentence 10B becomes "Sales are going down.”
  • the information processing device calculates the vector "Vec10B" of sentence 10B. For example, the information processing device performs morphological analysis on the sentence 10B to divide it into a plurality of words, assigns a vector to each word, and integrates the vectors of each word to calculate the vector of the sentence 10B. .
  • the information processing device sets the vector Vec10B of the sentence 10B as a vector corresponding to the columns 51e and 51f of the tagged numerical values of the descending interval T2.
  • the information processing device associates the vector Vec10B with the offsets of the columns 51e and 51f of tagged numerical values in the descending interval T2, and sets them in the transposed index 70.
  • the offsets of the tagged numeric columns 51e and 51f of the descending interval T2 include the position of the first word of the tagged numeric value 51e and the last word of the tagged numeric value 51f in the extracted data 60.
  • FIG. 3 shows the processing of the search phase of the information processing apparatus.
  • the information processing device uses the transposed index 70 generated by the processing in FIGS. Search from 60.
  • the sentence of the search query 20 is "sales increase".
  • the information processing device calculates a sentence vector of the search query 20 . That is, the information processing device performs morphological analysis on the sentence of the search query 20 to divide it into a plurality of words, assigns a vector to each word, and integrates the vectors of each word to obtain the search query 20.
  • a query vector Vec20 is calculated.
  • the information processing device compares the query vector Vec 20 with each vector registered in the transposed index 70, calculates the similarity, identifies the vector with the maximum similarity from the transposed index 70, Identify the corresponding offset.
  • Sentence vectors are characterized in that vectors of sentences having similar sentence contents are similar to each other. The degree of similarity is cosine similarity or the like.
  • the vector that maximizes the similarity between the search query "sales increase” and the query vector Vec20 is the vector Vec10A of the sentence 10A "sales increase” described in FIG. Therefore, based on the offset associated with the vector Vec10A of the transposed index 70, the information processing device identifies the columns 51b to 51e of the tagged numerical values of the rising interval T1. The information processing device outputs, as a search result 80, information obtained by extracting the tagged numerical value strings 51b to 51e of the rising interval T1.
  • the information processing apparatus extracts the tagged numerical values corresponding to the sales tags, and sorts the extracted tagged numerical values in chronological order to generate the extracted data 60.
  • the information processing device scans the tagged numerical values of the extracted data 60 in chronological order of the year, and the column of each tagged numerical value in the rising section where the attribute value rises and each tagged numerical value in the falling section where the attribute value falls. Identifies a column and sets the specified sentence vector to the column of each tagged number in the ascending interval and the column of each tagged number in the descending interval.
  • columns 51b to 51e of tagged numerical values for the ascending interval described in FIG. 2 satisfy the conditions according to sentence 10A.
  • Columns 51e and 51f of tagged numbers in the descending interval satisfy the conditions according to sentence 10B. This makes it possible to generate a vector suitable for each column of tagged numbers in the ascending interval and for each column of tagged numbers in the descending interval.
  • the information processing device associates the columns of tagged numerical values 51b to 51e in the ascending interval with the generated vector, sets them in the transposed index 70, and associates the columns of tagged numerical values 51e and 51f in the descending interval with the generated vector. and are set in the transposed index 70 in association with each other.
  • the information processing apparatus can use the inverted index 70 to search for a string of tagged numerical values corresponding to the search query.
  • FIG. 4 is a functional block diagram showing the configuration of the information processing apparatus according to this embodiment.
  • this information processing apparatus 100 has a communication section 110 , an input section 120 , a display section 130 , a control section 150 and a storage section 140 .
  • the communication unit 110 executes data communication with an external device via a network.
  • a communication unit 110 corresponds to a network card or the like.
  • the input unit 120 is an input device that receives operations from the user, and is realized by, for example, a keyboard, mouse, and the like.
  • the user operates the input unit 120 to input a sentence and a condition corresponding to the sentence. For example, in the example described in FIGS. 1 and 2, the user enters the sentence "sales increase” and the condition of the tagged numeric column "extract the tagged numeric column with the attribute 'sales tag' and the attribute 'year'.” are sorted in chronological order, and the "rising interval" is specified.
  • the user selects the sentence "sales are going down", the condition of the tagged numerical value "attribute 'sales tag', extracts the tagged numerical value, sorts the attribute 'year' in chronological order, and designates the 'downward interval'. Also, the user may operate the input unit 120 to input a search query.
  • the display unit 130 is a display device for outputting the processing result of the control unit 150 .
  • display unit 130 is realized by a liquid crystal monitor, a printer, or the like.
  • the storage unit 140 is a storage device that stores various types of information, and is implemented by, for example, a semiconductor memory device such as RAM (Random Access Memory) or flash memory, or a storage device such as a hard disk or optical disk. .
  • the storage unit 140 stores the XBRL document DB 50, the extracted data 60, the transposed index 70, and the dictionary information 140a.
  • the XBRL document DB 50 is a DB that stores XBRL documents.
  • the explanation about the XBRL document DB 50 is the same as the explanation given in FIG.
  • the extracted data 60 is sentence information extracted from the XBRL document DB 50 by the control unit 150, which will be described later.
  • the extracted data 60 corresponds to the extracted data 60 described with reference to FIG.
  • the transposition index 70 is information set by the control unit 150, which will be described later, and associates a sentence vector with offsets of a plurality of sentences.
  • the multiple sentences correspond to rising segment sentences and falling segment sentences.
  • the dictionary information 140a is information that associates and holds words and word vectors.
  • Word vectors are pre-trained using conventional techniques such as Word2Vec or Poincaré embedding.
  • a feature is that vectors of words having similar meanings are similar.
  • the control unit 150 executes various programs stored in the storage device inside the information processing apparatus 100 by a processor such as a CPU (Central Processing Unit) or an MPU (Micro Processing Unit) using a RAM or the like as a work area. Realized. Also, the control unit 150 may be implemented by an integrated circuit such as an ASIC (Application Specific Integrated Circuit) or an FPGA (Field Programmable Gate Array). For example, control unit 150 has identifying unit 151 , generating unit 152 , and searching unit 153 .
  • a processor such as a CPU (Central Processing Unit) or an MPU (Micro Processing Unit) using a RAM or the like as a work area. Realized.
  • the control unit 150 may be implemented by an integrated circuit such as an ASIC (Application Specific Integrated Circuit) or an FPGA (Field Programmable Gate Array).
  • control unit 150 has identifying unit 151 , generating unit 152 , and searching unit 153 .
  • the identification unit 151 When the identification unit 151 receives the conditions for the string of sentences and tagged numerical values, it generates extracted data 60 from the XBRL document DB 50 based on the conditions for the tagged numerical values.
  • the processing of the identifying unit 151 corresponds to the processing described with reference to FIG.
  • the specifying unit 151 scans the XBRL document DB 50 and extracts a string of tagged numerical values sandwiched between sales tags “ ⁇ sales>, ⁇ /sales>”.
  • the identification unit 151 generates the extraction data 60 by sorting the extracted string of tagged numerical values in ascending order (time series) by year tags.
  • the identification unit 151 extracts each tagged numerical value of the attribute 'sales tag', sorts the attribute 'year' in chronological order, and extracts it in the same way when the condition of the column of the tagged numerical value is 'falling interval'. Generate data 60 .
  • the specifying unit 151 registers the generated extracted data 60 in the storage unit 140 and outputs the sentence and the condition of the string of tagged numerical values to the generating unit 152 .
  • the generation unit 152 associates the specified sentence vector with the tagged numerical values in the section corresponding to the condition of the tagged numerical value column among the multiple tagged numerical values of the extracted data 60 .
  • the generation unit 152 registers, in the transposed index 70, the relationship between the offset of the tagged numerical value in the section corresponding to the condition of the string of tagged numerical values and the associated vector.
  • the processing of the generation unit 152 corresponds to the processing described with reference to FIG.
  • the generation unit 152 extracts the sentence “sales increase” (sentence 10A in FIG. 2) and the tagged numerical value of the condition “attribute “sales tag” in the column of tagged numerical values, and extracts the attribute “year” in chronological order. Execute the following processing for "Sort, Ascending Section".
  • the generation unit 152 scans the attribute values of the tagged numerical values 51a to 50f of the extracted data 60 in chronological order of the year, and identifies an increase interval T1 in which the attribute values increase.
  • the generation unit 152 divides the sentence 10A into a plurality of words by executing morphological analysis.
  • the generation unit 152 assigns a vector to each word based on the dictionary information 140a, and integrates the vectors of each word to calculate the vector "Vec10A" of the sentence 10A.
  • the generation unit 152 sets the vector Vec10A of the sentence 10A as a vector corresponding to the tagged numerical values 51b to 51e of the rising section T1.
  • the generation unit 152 associates the vector Vec10A with the offsets of the tagged numerical values 51b to 51e of the rising section T1, and sets them in the transposed index 70.
  • the generating unit 152 extracts the sentence “Sales are going down” (sentence 10B in FIG. 2) and the tagged numerical value with the condition “attribute “sales tag” in the column of tagged numerical values, sorts the attribute “year” in chronological order, The following process is executed for the "descent section".
  • the generation unit 152 scans the attribute values of the tagged numerical values 51a to 50f of the extraction data 60 in chronological order of the year, and identifies the falling interval T2 in which the attribute values rise.
  • the generation unit 152 divides the sentence 10B into a plurality of words by executing morphological analysis.
  • the generator 152 assigns a vector to each word based on the dictionary information 140a, and integrates the vector of each word to calculate the vector "Vec10B" of the sentence 10B.
  • the generation unit 152 sets the vector Vec10B of the sentence 10B as a vector corresponding to the tagged numerical values 51e and 51f of the descending interval T2.
  • the generation unit 152 associates the vector Vec10B with the offsets of the tagged numerical values 51e and 51f of the descending interval T2, and sets them in the transposed index 70.
  • the search unit 153 Upon receiving a search query, the search unit 153 calculates a query vector of the search query, and searches for sentences corresponding to the search query based on the query vector and the inverted index.
  • the processing of the search unit 153 corresponds to the processing described with reference to FIG.
  • the search unit 153 when the search unit 153 receives a search query 20 (for example, "sales increase"), it calculates a sentence vector of the search query 20.
  • the search unit 153 divides the sentence of the search query 20 into a plurality of words by executing morphological analysis.
  • the search unit 153 calculates a query vector Vec20 of the search query 20 by assigning a vector to each word based on the dictionary information 140a and integrating the vector of each word.
  • the search unit 153 compares the query vector Vec 20 with each vector registered in the transposed index 70, calculates the similarity, identifies the vector with the highest similarity from the transposed index 70, Identify the corresponding offset.
  • the search unit 153 identifies the tagged numerical values 51b to 51e of the rising section T1 based on the offset associated with the vector Vec10A of the transposed index 70.
  • FIG. The search unit 153 generates, as a search result 80, information obtained by extracting the tagged numerical values 51b to 51e of the rising section T1.
  • the search unit 153 may output and display the search result 80 on the display unit 130, or may transmit it to an external device.
  • FIG. 5 is a flow chart showing the processing procedure of the preparation phase of the information processing apparatus.
  • the information processing apparatus 100 receives specification of a condition for a sentence and a string of numerical values with tags (step S101).
  • the specifying unit 151 extracts each tagged numeric value corresponding to the sales tag from the XBRL document DB 50 (step S102).
  • the identification unit 151 generates extraction data by sorting the extracted numerical values with tags in chronological order of year (step S103).
  • the generation unit 152 scans the attribute values of each tagged numerical value in the extraction data 60 in chronological order of the year, and identifies rising sections and falling sections (step S104). The generation unit 152 calculates a vector of each specified sentence based on the dictionary information 140a (step S105).
  • the generation unit 152 associates the specified sentence vector with the offset of the column of tagged numerical values in the ascending section, and sets them in the transposed index 70 (step S106).
  • the generation unit 152 associates the specified sentence vector with the offset of the column of tagged numerical values in the descending section, and sets them in the transposed index 70 (step S107).
  • FIG. 6 is a flowchart showing the processing procedure of the search phase of the information processing device.
  • the search unit 153 of the information processing device 100 receives a search query (step S201).
  • the search unit 153 calculates a query vector of the search query based on the dictionary information 140a (step S202).
  • the search unit 153 identifies the offset corresponding to the vector that maximizes the similarity between the query vector and each vector of the transposed index 70 (step S203).
  • the search unit 153 generates a search result based on the specified offset (step S204).
  • the search unit 153 outputs the search result (step S205).
  • the information processing apparatus 100 extracts the tagged numerical values corresponding to the sales tags, and sorts the extracted tagged numerical values in chronological order to generate the extraction data 60 .
  • the information processing apparatus 100 scans the tagged numerical values of the extracted data 60 in chronological order of the year, and identifies each tagged numerical value in the rising section in which the attribute value rises and each tagged numerical value in the falling section in which the attribute value falls.
  • Set the specified sentence vector to each tagged number in the ascending interval and each tagged number in the descending interval.
  • the tagged numerical values 51b-51e of the ascending interval described in FIG. 2 satisfy the conditions according to sentence 10A.
  • Tagged numbers 51e and 51f in the descending interval satisfy the conditions according to sentence 10B. This makes it possible to generate a vector suitable for each sentence in the ascending interval and for each tagged number in the descending interval.
  • the information processing apparatus 100 associates the tagged numerical values 51b to 51e of the ascending interval with the generated vector, sets them in the transposed index 70, and associates the tagged numerical values 51e and 51f of the descending interval with the generated vector. , and set it to the transposed index 70 .
  • the information processing apparatus 100 can search for a sentence corresponding to the search query by using the inverted index 70 .
  • processing of the information processing device 100 described above is an example, and the information processing device 100 may perform other processing.
  • Other processes (1) and (2) of the information processing apparatus 100 will be described below.
  • Documents stored in the XBRL document DB 50 described above may be set in a CSV (Comma Separated Value) format.
  • the CSV format does not include tags such as those described in FIG. Therefore, the information processing apparatus 100 may use various conversion tables to convert CSV sentence data into tagged sentence data.
  • FIG. 7 is a diagram for explaining another process (1) of the information processing device.
  • the information processing apparatus 100 converts the CSV numerical data 80A into tagged numerical data 80B based on the conversion tables 81A and 81B and the tag vector dictionary 81C.
  • the conversion table 81A is a table that defines the relationship between XBRL tags and columns.
  • a column is information identifying each column of the CSV numerical data 80A.
  • the tag vector dictionary 81C associates XBRL tags, words, and tag vectors according to word vectors. For example, the tag vector dictionary 81C indicates that the XBRL tag ⁇ sales> corresponds to the word "sales" and the tag vector is "Vec1-1 . . . Vec1-n".
  • the tag vector is a pre-computed tag that is specific to the XBRL tag.
  • the information processing apparatus 100 uses various conversion tables to convert numerical data represented by attributes of CSV columns and rows into numerical data with tags. can also execute the processing based on the tags described in FIGS. 1 and 2, and can associate appropriate vectors.
  • FIG. 8 is a diagram for explaining another process (2) of the information processing apparatus.
  • Information processing apparatus 100 extracts a plurality of tagged numerical values (for example, each tagged numerical value of rising section T1) included in extraction data 60 obtained by repeatedly executing the processing shown in FIG. (the vector of sentence 10A) is registered in the storage unit 140 .
  • the information processing apparatus 100 calculates vectors of a plurality of tagged numerical values included in the extraction data 60 as vectors of the XBRL text, associates them with vectors of the designated text, and registers them in the teacher table 90 .
  • a vector of specified sentences is denoted as a sentence vector.
  • the information processing apparatus 100 calculates the vector of the XBRL document by multiplying the vectors of the multiple tagged numerical values included in the extracted data 60 .
  • the information processing apparatus 100 uses the dictionary information 140a described with reference to FIG. 4 and the vectors set in the tag vector dictionary 81C described with reference to FIG. use.
  • the information processing apparatus 100 repeatedly executes the above-described processing for a plurality of tagged numerical values included in the extracted data 60 and the designated sentence vector, thereby identifying the relationship between the sentence vector and the vector of the XBRL document. and register it in the teacher table 90.
  • the information processing apparatus 100 learns the learning model M1 by using the sentence vector of the teacher table 90 as input and the vector of the XBRL document as output (correct label).
  • the learning model M1 is a neural network, and the information processing apparatus 100 trains the learning model M1 using a backpropagation learning method or the like.
  • the information processing apparatus 100 also provides an inverted index that associates the vector of the XBRL document with the offset of the vector of a plurality of tagged numeric values included in the extraction data 60, which are a plurality of sentences corresponding to the vector of the XBRL document. is generated.
  • the information processing apparatus 100 when receiving a search query, inputs the query vector of the search query to the trained learning model M1 and calculates the vector of the XBRL document.
  • the information processing apparatus 100 compares the calculated vector of the XBRL document with the inverted index, identifies the offset, and extracts a plurality of tagged numerical values corresponding to the search query from the position corresponding to the offset. You can get search results.
  • FIG. 9 is a diagram illustrating an example of a hardware configuration of a computer that implements functions similar to those of the information processing apparatus of the embodiment.
  • the computer 200 has a CPU 201 that executes various arithmetic processes, an input device 202 that receives data input from the user, and a display 203 .
  • the computer 200 also has a communication device 204 and an interface device 205 for exchanging data with an external device or the like via a wired or wireless network.
  • the computer 200 also has a RAM 206 that temporarily stores various information, and a hard disk device 207 . Each device 201 - 207 is then connected to a bus 208 .
  • the hard disk device 207 has a specific program 207a, a generation program 207b, and a search program 207c. Further, the CPU 201 reads each program 207 a to 207 c and develops them in the RAM 206 .
  • the specific program 207a functions as a specific process 206a.
  • Generation program 207b functions as generation process 206b.
  • the search program 207c functions as a search process 206c.
  • the processing of the identification process 206a corresponds to the processing of the identification unit 151.
  • the processing of the generation process 206 b corresponds to the processing of the generation unit 152 .
  • the processing of the search process 206 c corresponds to the processing of the search unit 153 .
  • each program 207a to 207c do not necessarily have to be stored in the hard disk device 207 from the beginning.
  • each program is stored in a “portable physical medium” such as a flexible disk (FD), CD-ROM, DVD, magneto-optical disk, IC card, etc. inserted into the computer 200 . Then, the computer 200 may read and execute each program 207a to 207c.
  • a “portable physical medium” such as a flexible disk (FD), CD-ROM, DVD, magneto-optical disk, IC card, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

This information processing device comprises a storage device which stores target data including a plurality of attribute value arrays obtained by associating a plurality of attributes with attribute values of numerical data corresponding to the attributes. When text data is received, the information processing device designates, from among the plurality of attribute value arrays included in the target data, an attribute value array that satisfies conditions in response to the text data. The information processing device associates a vector responding to the text data with the designated attribute value array.

Description

生成方法、生成プログラムおよび情報処理装置Generation method, generation program and information processing device
 本発明は、生成方法等に関する。 The present invention relates to a generation method and the like.
 文書検索技術の分野において、文書DB(Data Base)に登録された各文書にベクトルを割り当てておき、検索クエリを受け付けた場合に、検索クエリのベクトルに対応するベクトルの文書を、文書DBから検索する技術がある。 In the field of document search technology, a vector is assigned to each document registered in the document DB (Data Base), and when a search query is received, the document whose vector corresponds to the search query vector is searched from the document DB. there is a technology to
 上記の文書DBには、XBRL(eXtensible Business Reporting Language)文書等のタグ付き数値を含む文書が登録される場合もあり、タグ付き数値を含む文書に対してもベクトルを割り当てて検索することが求められる。たとえば、XBRL文書は、有価証券報告書等である。 In the document DB above, there are cases where documents containing tagged numerical values such as XBRL (eXtensible Business Reporting Language) documents are registered, and it is required to assign vectors to documents containing tagged numerical values and search them. be done. For example, the XBRL document is a securities report or the like.
 文書にベクトルを割り当てる場合には、Word2Vecやポアンカレエンベッディング等の従来技術を用いて、文書に含まれる各単語のベクトルを算出し、各単語のベクトルを積算することで、文書のベクトルを算出し、算出したベクトルを割り当てている。 When assigning a vector to a document, the vector of each word contained in the document is calculated using conventional techniques such as Word2Vec and Poincaré embedding, and the vector of each word is multiplied to calculate the vector of the document. and assigns the calculated vector.
 なお、従来技術では、文書に含まれる各単語のベクトルを合算した合算ベクトルと、隠れ層の埋め込みベクトルから、文書に含まれる単語とその隣接単語を推定する機械学習モデルを機械学習し、機械翻訳する技術が公開されている。 In the conventional technology, a machine learning model for estimating a word contained in a document and its adjacent words is machine-learned from the sum vector obtained by summing the vectors of each word contained in the document and the embedding vector of the hidden layer, and machine translation is performed. technology has been published.
特開2020-060970号公報Japanese Patent Application Laid-Open No. 2020-060970
 上述したXBRL文書に含まれるタグ付き数値は、通常の文書に含まれる単語と異なり、XBRL文書固有の属性、属性値等と関連付けられた文書情報となっている。このため、単純に、Word2Vecやポアンカレエンベッディング等を用いて文書の単語ベクトルをそのまま算出しても、かかるベクトルは、XBRL文書を有効に検索し得るベクトルとは言い難い。 Unlike the words contained in normal documents, the tagged numerical values contained in the XBRL document described above are document information associated with attributes, attribute values, etc. unique to XBRL documents. Therefore, even if word vectors of a document are simply calculated as they are using Word2Vec, Poincaré embedding, etc., such vectors cannot be said to be vectors that can effectively search an XBRL document.
 1つの側面では、本発明は、タグ付き数値を含む文書やデータなどに対応付けるベクトルを適切に生成することができる生成方法、生成プログラムおよび情報処理装置を提供することを目的とする。 In one aspect, an object of the present invention is to provide a generation method, a generation program, and an information processing apparatus capable of appropriately generating vectors associated with documents, data, etc. containing tagged numerical values.
 第1の案では、コンピュータが次の処理を実行する。コンピュータは、複数の属性と属性に対応する数値データの属性値とを対応付けた複数の属性値配列を含む対象データを記憶した記憶装置を有する。コンピュータは、テキストデータを受け付けた場合、対象データに含まれる複数の属性値配列のうち、テキストデータに応じた条件を満たす属性値配列を特定する。コンピュータは、特定した属性値配列に、テキストデータに応じたベクトルを対応付ける。 In the first plan, the computer executes the following processes. The computer has a storage device storing target data including a plurality of attribute value arrays in which a plurality of attributes and attribute values of numerical data corresponding to the attributes are associated with each other. When receiving text data, the computer identifies an attribute value array that satisfies a condition corresponding to the text data among the plurality of attribute value arrays included in the target data. The computer associates the identified attribute value array with the vector corresponding to the text data.
 タグ付き数値を含む文書やデータなどに対応付けるベクトルを適切に生成することができる。 It is possible to appropriately generate vectors that correspond to documents and data that contain tagged numerical values.
図1は、本実施例に係る情報処理装置の処理を説明するための図(1)である。FIG. 1 is a diagram (1) for explaining the processing of the information processing apparatus according to the embodiment. 図2は、本実施例に係る情報処理装置の処理を説明するための図(2)である。FIG. 2 is a diagram (2) for explaining the processing of the information processing apparatus according to the embodiment. 図3は、本実施例に係る情報処理装置の処理を説明するための図(3)である。FIG. 3 is a diagram (3) for explaining the processing of the information processing apparatus according to the embodiment. 図4は、本実施例に係る情報処理装置の構成を示す機能ブロック図である。FIG. 4 is a functional block diagram showing the configuration of the information processing apparatus according to this embodiment. 図5は、情報処理装置の準備フェーズの処理手順を示すフローチャートである。FIG. 5 is a flow chart showing the processing procedure of the preparation phase of the information processing apparatus. 図6は、情報処理装置の検索フェーズの処理手順を示すフローチャートである。FIG. 6 is a flow chart showing the processing procedure of the search phase of the information processing apparatus. 図7は、情報処理装置のその他の処理(1)を説明するための図である。FIG. 7 is a diagram for explaining another process (1) of the information processing apparatus. 図8は、情報処理装置のその他の処理(2)を説明するための図である。FIG. 8 is a diagram for explaining another process (2) of the information processing apparatus. 図9は、実施例の情報処理装置と同様の機能を実現するコンピュータのハードウェア構成の一例を示す図である。FIG. 9 is a diagram illustrating an example of a hardware configuration of a computer that implements functions similar to those of the information processing apparatus of the embodiment.
 以下に、本願の開示する生成方法、生成プログラムおよび情報処理装置の実施例を図面に基づいて詳細に説明する。なお、この実施例によりこの発明が限定されるものではない。 Below, embodiments of the generation method, generation program, and information processing apparatus disclosed in the present application will be described in detail based on the drawings. In addition, this invention is not limited by this Example.
 図1~図3は、本実施例に係る情報処理装置の処理を説明するための図である。図1、図2では、情報処理装置の準備フェーズの処理を示す。まず、図1について説明する。情報処理装置は、XBRL文書DB50を有する。XBRL文書DB50は、XBRL文書を格納するDBである。XBRL文書は、複数の属性と、属性に対応する属性値とを対応付けた複数のタグ付き数値を含む文書である。たとえば、XBRL文書DB50には、XBRL文書51等が格納される。 1 to 3 are diagrams for explaining the processing of the information processing apparatus according to this embodiment. 1 and 2 show processing in the preparation phase of the information processing apparatus. First, FIG. 1 will be described. The information processing device has an XBRL document DB 50 . The XBRL document DB 50 is a DB that stores XBRL documents. An XBRL document is a document that includes a plurality of tagged numeric values that associate a plurality of attributes with attribute values corresponding to the attributes. For example, the XBRL document DB 50 stores the XBRL document 51 and the like.
 XBRL文書51において、タグ(XBRLタグ)「<」、「>」および「<」、「/>」で囲まれる単語が属性であり、同一の属性に挟まれた数値が「属性値」となる。タグ付き数値51a「<売上><2014>20</2014></売上>億円」に着目すると、タグ付き数値51aには属性「売上」と、年度を示す「2014」とが含まれ、属性値は「20」億円となる。図1では、属性として、売上、年度を示したが、XBRL文書には、他の属性が含まれていてもよい。 In the XBRL document 51, the words enclosed by the tags (XBRL tags) "<", ">" and "<", "/>" are the attributes, and the numerical value sandwiched between the same attributes is the "attribute value". . Focusing on the tagged numerical value 51a "<sales><2014>20</2014></sales> billion yen", the tagged numerical value 51a includes the attribute "sales" and "2014" indicating the year. The attribute value is "2 billion yen". Although sales and year are shown as attributes in FIG. 1, other attributes may be included in the XBRL document.
 なお、図1に示すXBRL文書のタグ付き数値の形式は、図1のものに限らず、他の形式であってもよい。たとえば、「<売上><2014>20</2014></売上>億円」は、「<売上 年度=2014>20億円</売上>」であってもよく、双方のタグ付き数値は同様の意味のタグ付き数値である。 It should be noted that the format of the tagged numerical values in the XBRL document shown in FIG. 1 is not limited to that shown in FIG. 1, and may be other formats. For example, "<sales><2014>20</2014></sales> billion yen" may be "<sales year=2014>2 billion yen</sales>" and both tagged figures are It is a tagged number with similar meaning.
 情報処理装置は、XBRL文書DB50を走査して、売上タグ「<売上>、</売上>」で挟まれたタグ付き数値を抽出する。また、情報処理装置は、抽出したタグ付き数値を年度のタグで昇順(時系列)にソートすると、抽出データ60に示すものとなる。抽出データ60には、タグ付き数値51a,51b,51c,51d,51e,51fが含まれる。ここでは、売上タグ「<売上>、</売上>」で挟まれたタグ付き数値を抽出する場合について説明したが、管理者等によって指定される他のタグによって挟まれるタグ付き数値を抽出してもよい。 The information processing device scans the XBRL document DB 50 and extracts the tagged numerical value sandwiched between the sales tags "<sales>, </sales>". Further, the information processing apparatus sorts the extracted numerical values with tags in ascending order (time series) according to the tags of the years, resulting in the extracted data 60 . The extracted data 60 includes tagged numerical values 51a, 51b, 51c, 51d, 51e, and 51f. Here, we explained the case of extracting the tagged figures sandwiched between the sales tags "<sales>, </sales>". may
 抽出データ60のタグ付き数値51aは、「2014年度」の売上を示すタグ付き数値であり、属性値は「20」億円となる。抽出データ60のタグ付き数値51bは、「2015年度」の売上を示すタグ付き数値であり、属性値は「10」億円となる。抽出データ60のタグ付き数値51cは、「2016年度」の売上を示すタグ付き数値であり、属性値は「20」億円となる。抽出データ60のタグ付き数値51dは、「2017年度」の売上を示すタグ付き数値であり、属性値は「30」億円となる。抽出データ60のタグ付き数値51eは、「2018年度」の売上を示すタグ付き数値であり、属性値は「40」億円となる。抽出データ60のタグ付き数値51fは、「2019年度」の売上を示すタグ付き数値であり、属性値は「30」億円となる。 The tagged numerical value 51a of the extracted data 60 is a tagged numerical value indicating sales in "2014", and the attribute value is "2 billion yen." The tagged numerical value 51b of the extracted data 60 is a tagged numerical value indicating the sales in “2015”, and the attribute value is “1 billion yen”. The tagged numerical value 51c of the extracted data 60 is a tagged numerical value indicating sales in “2016 fiscal year”, and the attribute value is “2 billion yen”. The tagged numerical value 51d of the extracted data 60 is a tagged numerical value indicating the sales in "2017 fiscal year", and the attribute value is "3 billion yen". The tagged numerical value 51e of the extracted data 60 is a tagged numerical value indicating the sales in “2018 fiscal year”, and the attribute value is “4 billion yen”. The tagged numerical value 51f of the extracted data 60 is a tagged numerical value indicating sales in “2019 fiscal year”, and the attribute value is “3 billion yen”.
 図2の説明に移行する。情報処理装置は、抽出データ60のタグ付き数値51a~50fの属性値を年度の時系列に走査して、属性値が上昇する上昇区間T1のタグ付き数値の列と、属性値が下降する下降区間T2のタグ付き数値の列とを特定する。たとえば、図2に示す例では、上昇区間T1のタグ付き数値の列は、タグ付き数値51b~51eとなる。下降区間T2のタグ付き数値の列は、タグ付き数値51e,51fとなる。 Move to the description of Figure 2. The information processing device scans the attribute values of the tagged numerical values 51a to 50f of the extracted data 60 in chronological order of the year, and the sequence of tagged numerical values in the rising section T1 in which the attribute value rises and the tagged numerical value row in which the attribute value falls A string of tagged numbers in interval T2 is identified. For example, in the example shown in FIG. 2, the column of tagged numerical values in the rising interval T1 is tagged numerical values 51b to 51e. The column of tagged numerical values in the descending section T2 is tagged numerical values 51e and 51f.
 ここで、情報処理装置は、上昇区間T1のタグ付き数値の列に対応する文10Aの指定を受け付ける。図2に示す例では、文10Aは「売上が上がる」となる。情報処理装置は、文10AのベクトルVec10Aを算出する。たとえば、情報処理装置は、文10Aに対して形態素解析を実行することで複数の単語に分割し、各単語にベクトルを割り当て、各単語のベクトルを積算することで、文10Aのベクトル「Vec10A」を算出する。各単語のベクトルは、辞書情報に定義される。 Here, the information processing device accepts the designation of the sentence 10A corresponding to the string of numerical values with tags in the ascending interval T1. In the example shown in FIG. 2, the sentence 10A becomes "sales increase". The information processing device calculates vector Vec10A of sentence 10A. For example, the information processing device divides the sentence 10A into a plurality of words by executing morphological analysis on the sentence 10A, assigns a vector to each word, and multiplies the vectors of each word to obtain the vector "Vec10A" of the sentence 10A. Calculate Each word vector is defined in the dictionary information.
 情報処理装置は、文10AのベクトルVec10Aを、上昇区間T1のタグ付き数値の列51b~51eに対応するベクトルとして設定する。情報処理装置は、ベクトルVec10Aと、上昇区間T1のタグ付き数値の列51b~51eのオフセットとを対応付けて、転置インデックス70に設定する。たとえば、上昇区間T1のタグ付き数値の列51b~51eのオフセットは、抽出データ60における、タグ付き数値51bの先頭の単語の位置と、タグ付き数値51eの最後の単語の位置と含む。 The information processing device sets the vector Vec10A of the sentence 10A as a vector corresponding to the columns 51b to 51e of tagged numerical values in the rising section T1. The information processing device associates the vector Vec10A with the offsets of the columns 51b to 51e of the tagged numerical values in the rising section T1 and sets them in the transposed index 70. FIG. For example, the offset of the columns 51b-51e of tagged numbers in rising interval T1 includes the position of the first word of tagged number 51b and the last word of tagged number 51e in extracted data 60. FIG.
 また、情報処理装置は、下降区間T2の各タグ付き数値に対応する文10Bの指定を受け付ける。図2に示す例では、文10Bは「売上が下がる」となる。情報処理装置は、文10Bのベクトル「Vec10B」を算出する。たとえば、情報処理装置は、文10Bに対して形態素解析を実行することで複数の単語に分割し、各単語にベクトルを割り当て、各単語のベクトルを積算することで、文10Bのベクトルを算出する。 In addition, the information processing device accepts designation of the sentence 10B corresponding to each tagged numerical value in the descending interval T2. In the example shown in FIG. 2, the sentence 10B becomes "Sales are going down." The information processing device calculates the vector "Vec10B" of sentence 10B. For example, the information processing device performs morphological analysis on the sentence 10B to divide it into a plurality of words, assigns a vector to each word, and integrates the vectors of each word to calculate the vector of the sentence 10B. .
 情報処理装置は、文10BのベクトルVec10Bを、下降区間T2のタグ付き数値の列51e,51fに対応するベクトルとして設定する。情報処理装置は、ベクトルVec10Bと、下降区間T2のタグ付き数値の列51e,51fのオフセットとを対応付けて、転置インデックス70に設定する。たとえば、下降区間T2のタグ付き数値の列51e,51fのオフセットは、抽出データ60における、タグ付き数値51eの先頭の単語の位置と、タグ付き数値51fの最後の単語の位置とを含む。 The information processing device sets the vector Vec10B of the sentence 10B as a vector corresponding to the columns 51e and 51f of the tagged numerical values of the descending interval T2. The information processing device associates the vector Vec10B with the offsets of the columns 51e and 51f of tagged numerical values in the descending interval T2, and sets them in the transposed index 70. FIG. For example, the offsets of the tagged numeric columns 51e and 51f of the descending interval T2 include the position of the first word of the tagged numeric value 51e and the last word of the tagged numeric value 51f in the extracted data 60. FIG.
 続いて、図3の説明に移行する。図3では、情報処理装置の検索フェーズの処理を示す。図3において、情報処理装置は、検索クエリを受け付けた場合に、図1、図2の処理で生成した転置インデックス70を用いて、検索クエリに対応するXBRLのタグ付き数値の列を、抽出データ60から検索する。 Next, move on to the description of FIG. FIG. 3 shows the processing of the search phase of the information processing apparatus. In FIG. 3, when a search query is received, the information processing device uses the transposed index 70 generated by the processing in FIGS. Search from 60.
 図3に示す例では、検索クエリ20の文を「売上が上昇する」とする。情報処理装置は、検索クエリ20の文のベクトルを算出する。すなわち、情報処理装置は、検索クエリ20の文に対して形態素解析を実行することで複数の単語に分割し、各単語にベクトルを割り当て、各単語のベクトルを積算することで、検索クエリ20のクエリベクトルVec20を算出する。 In the example shown in FIG. 3, the sentence of the search query 20 is "sales increase". The information processing device calculates a sentence vector of the search query 20 . That is, the information processing device performs morphological analysis on the sentence of the search query 20 to divide it into a plurality of words, assigns a vector to each word, and integrates the vectors of each word to obtain the search query 20. A query vector Vec20 is calculated.
 情報処理装置は、クエリベクトルVec20と、転置インデックス70に登録された各ベクトルとを比較して、類似度を算出し、類似度が最大となるベクトルを転置インデックス70から特定し、特定したベクトルに対応するオフセットを特定する。文のベクトルは、文の内容が近い文のベクトル同士が類似するという特徴がある。類似度は、コサイン類似度等である。 The information processing device compares the query vector Vec 20 with each vector registered in the transposed index 70, calculates the similarity, identifies the vector with the maximum similarity from the transposed index 70, Identify the corresponding offset. Sentence vectors are characterized in that vectors of sentences having similar sentence contents are similar to each other. The degree of similarity is cosine similarity or the like.
 たとえば、検索クエリ「売上が上昇する」のクエリベクトルVec20との類似度が最大となるベクトルは、図2で説明した文10A「売上が上がる」のベクトルVec10Aとなる。このため、情報処理装置は、転置インデックス70のベクトルVec10Aに対応付けられたオフセットを基にして、上昇区間T1のタグ付き数値の列51b~51eを特定する。情報処理装置は、上昇区間T1のタグ付き数値の列51b~51eを抽出した情報を、検索結果80として出力する。 For example, the vector that maximizes the similarity between the search query "sales increase" and the query vector Vec20 is the vector Vec10A of the sentence 10A "sales increase" described in FIG. Therefore, based on the offset associated with the vector Vec10A of the transposed index 70, the information processing device identifies the columns 51b to 51e of the tagged numerical values of the rising interval T1. The information processing device outputs, as a search result 80, information obtained by extracting the tagged numerical value strings 51b to 51e of the rising interval T1.
 上記のように、本実施例に係る情報処理装置は、売上タグに対応するタグ付き数値を抽出し、抽出したタグ付き数値を年度の時系列にソートすることで、抽出データ60を生成する。情報処理装置は、抽出データ60のタグ付き数値を年度の時系列に走査し、属性値が上昇する上昇区間の各タグ付き数値の列と、属性値が下降する下降区間の各タグ付き数値の列とを特定し、指定される文のベクトルを上昇区間の各タグ付き数値の列、下降区間の各タグ付き数値の列に設定する。たとえば、図2で説明した上昇区間のタグ付き数値の列51b~51eは、文10Aに応じた条件を満たす。下降区間のタグ付き数値の列51e,51fは、文10Bに応じた条件を満たす。これによって、上昇区間の各タグ付き数値の列、下降区間の各タグ付き数値の列に適するベクトルを生成することが可能となる。 As described above, the information processing apparatus according to the present embodiment extracts the tagged numerical values corresponding to the sales tags, and sorts the extracted tagged numerical values in chronological order to generate the extracted data 60. The information processing device scans the tagged numerical values of the extracted data 60 in chronological order of the year, and the column of each tagged numerical value in the rising section where the attribute value rises and each tagged numerical value in the falling section where the attribute value falls. Identifies a column and sets the specified sentence vector to the column of each tagged number in the ascending interval and the column of each tagged number in the descending interval. For example, columns 51b to 51e of tagged numerical values for the ascending interval described in FIG. 2 satisfy the conditions according to sentence 10A. Columns 51e and 51f of tagged numbers in the descending interval satisfy the conditions according to sentence 10B. This makes it possible to generate a vector suitable for each column of tagged numbers in the ascending interval and for each column of tagged numbers in the descending interval.
 情報処理装置は、上昇区間のタグ付き数値の列51b~51eと、生成したベクトルとを対応付けて、転置インデックス70に設定し、下降区間のタグ付き数値の列51e,51fと、生成したベクトルとを対応付けて転置インデックス70に設定する処理を行う。情報処理装置は、検索クエリを受け付けた場合に、転置インデックス70を用いることで、検索クエリに対応したタグ付き数値の列を検索することが可能となる。 The information processing device associates the columns of tagged numerical values 51b to 51e in the ascending interval with the generated vector, sets them in the transposed index 70, and associates the columns of tagged numerical values 51e and 51f in the descending interval with the generated vector. and are set in the transposed index 70 in association with each other. When receiving a search query, the information processing apparatus can use the inverted index 70 to search for a string of tagged numerical values corresponding to the search query.
 次に、本実施例に係る情報処理装置の構成例について説明する。図4は、本実施例に係る情報処理装置の構成を示す機能ブロック図である。図4に示すように、この情報処理装置100は、通信部110と、入力部120と、表示部130と、制御部150と、記憶部140とを有する。 Next, a configuration example of the information processing apparatus according to this embodiment will be described. FIG. 4 is a functional block diagram showing the configuration of the information processing apparatus according to this embodiment. As shown in FIG. 4 , this information processing apparatus 100 has a communication section 110 , an input section 120 , a display section 130 , a control section 150 and a storage section 140 .
 通信部110は、ネットワークを介して外部装置とデータ通信を実行する。通信部110は、ネットワークカード等に対応する。 The communication unit 110 executes data communication with an external device via a network. A communication unit 110 corresponds to a network card or the like.
 入力部120は、利用者からの操作を受付ける入力装置であり、たとえば、キーボードやマウス等により実現される。利用者は、入力部120を操作して、文と、文に対応する条件を入力する。たとえば、図1、2で説明した例では、利用者は、文「売上が上がる」と、タグ付き数値の列の条件「属性「売上タグ」のタグ付き数値の列を抽出、属性「年度」を時系列にソート、「上昇区間」を指定する。また、利用者は、文「売上が下がる」と、タグ付き数値の条件「属性「売上タグ」のタグ付き数値を抽出、属性「年度」を時系列にソート、「下降区間」を指定する。また、利用者は、入力部120を操作して、検索クエリを入力してもよい。 The input unit 120 is an input device that receives operations from the user, and is realized by, for example, a keyboard, mouse, and the like. The user operates the input unit 120 to input a sentence and a condition corresponding to the sentence. For example, in the example described in FIGS. 1 and 2, the user enters the sentence "sales increase" and the condition of the tagged numeric column "extract the tagged numeric column with the attribute 'sales tag' and the attribute 'year'." are sorted in chronological order, and the "rising interval" is specified. In addition, the user selects the sentence "sales are going down", the condition of the tagged numerical value "attribute 'sales tag', extracts the tagged numerical value, sorts the attribute 'year' in chronological order, and designates the 'downward interval'. Also, the user may operate the input unit 120 to input a search query.
 表示部130は、制御部150の処理結果を出力するための表示装置である。たとえば、表示部130は、液晶モニタやプリンタ等により実現される。 The display unit 130 is a display device for outputting the processing result of the control unit 150 . For example, display unit 130 is realized by a liquid crystal monitor, a printer, or the like.
 記憶部140は、各種の情報を記憶する記憶装置であり、たとえば、RAM(Random Access Memory)、フラッシュメモリ(Flash Memory)等の半導体メモリ素子、または、ハードディスク、光ディスク等の記憶装置によって実現される。たとえば、記憶部140には、XBRL文書DB50、抽出データ60、転置インデックス70、辞書情報140aが記憶される。 The storage unit 140 is a storage device that stores various types of information, and is implemented by, for example, a semiconductor memory device such as RAM (Random Access Memory) or flash memory, or a storage device such as a hard disk or optical disk. . For example, the storage unit 140 stores the XBRL document DB 50, the extracted data 60, the transposed index 70, and the dictionary information 140a.
 XBRL文書DB50は、XBRL文書を格納するDBである。XBRL文書DB50に関する説明は、図1で行った説明と同様である。 The XBRL document DB 50 is a DB that stores XBRL documents. The explanation about the XBRL document DB 50 is the same as the explanation given in FIG.
 抽出データ60は、後述する制御部150によって、XBRL文書DB50から抽出された文の情報である。抽出データ60は、図1で説明した抽出データ60に対応する。 The extracted data 60 is sentence information extracted from the XBRL document DB 50 by the control unit 150, which will be described later. The extracted data 60 corresponds to the extracted data 60 described with reference to FIG.
 転置インデックス70は、後述する制御部150によって設定される情報であり、文のベクトルと、複数の文のオフセットとを対応付ける。たとえば、図2で説明したように、複数の文は、上昇区間の文、下降区間の文に対応する。 The transposition index 70 is information set by the control unit 150, which will be described later, and associates a sentence vector with offsets of a plurality of sentences. For example, as described with reference to FIG. 2, the multiple sentences correspond to rising segment sentences and falling segment sentences.
 辞書情報140aは、単語と単語のベクトルとを対応付けて保持する情報である。単語のベクトルは、Word2Vecやポアンカレエンベッディング等の従来技術を用いて、事前に学習される。意味が類似する単語同士のベクトルは類似するという特徴がある。 The dictionary information 140a is information that associates and holds words and word vectors. Word vectors are pre-trained using conventional techniques such as Word2Vec or Poincaré embedding. A feature is that vectors of words having similar meanings are similar.
 制御部150は、CPU(Central Processing Unit)、MPU(Micro Processing Unit)等のプロセッサによって、情報処理装置100内部の記憶装置に記憶されている各種プログラムがRAM等を作業領域として実行されることにより実現される。また、制御部150は、ASIC(Application Specific Integrated Circuit)やFPGA(Field Programmable Gate Array)等の集積回路により実現されてもよい。たとえば、制御部150は、特定部151、生成部152、検索部153を有する。 The control unit 150 executes various programs stored in the storage device inside the information processing apparatus 100 by a processor such as a CPU (Central Processing Unit) or an MPU (Micro Processing Unit) using a RAM or the like as a work area. Realized. Also, the control unit 150 may be implemented by an integrated circuit such as an ASIC (Application Specific Integrated Circuit) or an FPGA (Field Programmable Gate Array). For example, control unit 150 has identifying unit 151 , generating unit 152 , and searching unit 153 .
 特定部151は、文とタグ付き数値の列の条件を受け付けた場合に、タグ付き数値の列の条件を基にして、XBRL文書DB50から抽出データ60を生成する。特定部151の処理は、図1で説明した処理に対応する。 When the identification unit 151 receives the conditions for the string of sentences and tagged numerical values, it generates extracted data 60 from the XBRL document DB 50 based on the conditions for the tagged numerical values. The processing of the identifying unit 151 corresponds to the processing described with reference to FIG.
 たとえば、特定部151は、タグ付き数値の列の条件が「属性「売上タグ」の各タグ付き数値を抽出、属性「年度」を時系列にソート、「上昇区間」である場合、次の処理を実行する。特定部151は、XBRL文書DB50を走査して、売上タグ「<売上>、</売上>」で挟まれたタグ付き数値の列を抽出する。特定部151は、抽出したタグ付き数値の列を年度のタグで昇順(時系列)にソートすることで、抽出データ60を生成する。 For example, if the condition of the column of tagged numerical values is "extract each tagged numerical value with the attribute 'sales tag', sort the attribute 'year' in chronological order, and 'rise section', the following processing is performed. to run. The specifying unit 151 scans the XBRL document DB 50 and extracts a string of tagged numerical values sandwiched between sales tags “<sales>, </sales>”. The identification unit 151 generates the extraction data 60 by sorting the extracted string of tagged numerical values in ascending order (time series) by year tags.
 特定部151は、タグ付き数値の列の条件が「属性「売上タグ」の各タグ付き数値を抽出、属性「年度」を時系列にソート、「下降区間」である場合も同様にして、抽出データ60を生成する。 The identification unit 151 extracts each tagged numerical value of the attribute 'sales tag', sorts the attribute 'year' in chronological order, and extracts it in the same way when the condition of the column of the tagged numerical value is 'falling interval'. Generate data 60 .
 特定部151は、生成した抽出データ60を記憶部140に登録し、文と、タグ付き数値の列の条件とを生成部152に出力する。 The specifying unit 151 registers the generated extracted data 60 in the storage unit 140 and outputs the sentence and the condition of the string of tagged numerical values to the generating unit 152 .
 生成部152は、抽出データ60の複数のタグ付き数値のうち、タグ付き数値の列の条件に対応する区間のタグ付き数値に、指定された文のベクトルを対応付ける。生成部152は、タグ付き数値の列の条件に対応する区間のタグ付き数値のオフセットと、対応付けたベクトルとの関係を、転置インデックス70に登録する。生成部152の処理は、図2で説明した処理に対応する。 The generation unit 152 associates the specified sentence vector with the tagged numerical values in the section corresponding to the condition of the tagged numerical value column among the multiple tagged numerical values of the extracted data 60 . The generation unit 152 registers, in the transposed index 70, the relationship between the offset of the tagged numerical value in the section corresponding to the condition of the string of tagged numerical values and the associated vector. The processing of the generation unit 152 corresponds to the processing described with reference to FIG.
 たとえば、生成部152は、文「売上が上がる」(図2の文10A)と、タグ付き数値の列の条件「属性「売上タグ」のタグ付き数値を抽出、属性「年度」を時系列にソート、上昇区間」について次の処理を実行する。生成部152は、抽出データ60のタグ付き数値51a~50fの属性値を年度の時系列に走査して、属性値が上昇する上昇区間T1を特定する。 For example, the generation unit 152 extracts the sentence “sales increase” (sentence 10A in FIG. 2) and the tagged numerical value of the condition “attribute “sales tag” in the column of tagged numerical values, and extracts the attribute “year” in chronological order. Execute the following processing for "Sort, Ascending Section". The generation unit 152 scans the attribute values of the tagged numerical values 51a to 50f of the extracted data 60 in chronological order of the year, and identifies an increase interval T1 in which the attribute values increase.
 生成部152は、文10Aに対して形態素解析を実行することで複数の単語に分割する。生成部152は、辞書情報140aを基にして、各単語にベクトルを割り当て、各単語のベクトルを積算することで、文10Aのベクトル「Vec10A」を算出する。 The generation unit 152 divides the sentence 10A into a plurality of words by executing morphological analysis. The generation unit 152 assigns a vector to each word based on the dictionary information 140a, and integrates the vectors of each word to calculate the vector "Vec10A" of the sentence 10A.
 生成部152は、文10AのベクトルVec10Aを、上昇区間T1のタグ付き数値51b~51eに対応するベクトルとして設定する。生成部152は、ベクトルVec10Aと、上昇区間T1のタグ付き数値51b~51eのオフセットとを対応付けて、転置インデックス70に設定する。 The generation unit 152 sets the vector Vec10A of the sentence 10A as a vector corresponding to the tagged numerical values 51b to 51e of the rising section T1. The generation unit 152 associates the vector Vec10A with the offsets of the tagged numerical values 51b to 51e of the rising section T1, and sets them in the transposed index 70. FIG.
 生成部152は、文「売上が下がる」(図2の文10B)と、タグ付き数値の列の条件「属性「売上タグ」のタグ付き数値を抽出、属性「年度」を時系列にソート、「下降区間」について次の処理を実行する。生成部152は、抽出データ60のタグ付き数値51a~50fの属性値を年度の時系列に走査して、属性値が上昇する下降区間T2を特定する。 The generating unit 152 extracts the sentence “Sales are going down” (sentence 10B in FIG. 2) and the tagged numerical value with the condition “attribute “sales tag” in the column of tagged numerical values, sorts the attribute “year” in chronological order, The following process is executed for the "descent section". The generation unit 152 scans the attribute values of the tagged numerical values 51a to 50f of the extraction data 60 in chronological order of the year, and identifies the falling interval T2 in which the attribute values rise.
 生成部152は、文10Bに対して形態素解析を実行することで複数の単語に分割する。生成部152は、辞書情報140aを基にして、各単語にベクトルを割り当て、各単語のベクトルを積算することで、文10Bのベクトル「Vec10B」を算出する。 The generation unit 152 divides the sentence 10B into a plurality of words by executing morphological analysis. The generator 152 assigns a vector to each word based on the dictionary information 140a, and integrates the vector of each word to calculate the vector "Vec10B" of the sentence 10B.
 生成部152は、文10BのベクトルVec10Bを、下降区間T2のタグ付き数値51e,51fに対応するベクトルとして設定する。生成部152は、ベクトルVec10Bと、下降区間T2のタグ付き数値51e,51fのオフセットとを対応付けて、転置インデックス70に設定する。 The generation unit 152 sets the vector Vec10B of the sentence 10B as a vector corresponding to the tagged numerical values 51e and 51f of the descending interval T2. The generation unit 152 associates the vector Vec10B with the offsets of the tagged numerical values 51e and 51f of the descending interval T2, and sets them in the transposed index 70. FIG.
 検索部153は、検索クエリを受け付けた場合に、検索クエリのクエリベクトルを算出し、クエリベクトルと、転置インデックスとを基にして、検索クエリに対応する文を検索する。検索部153の処理は、図3で説明した処理に対応する。 Upon receiving a search query, the search unit 153 calculates a query vector of the search query, and searches for sentences corresponding to the search query based on the query vector and the inverted index. The processing of the search unit 153 corresponds to the processing described with reference to FIG.
 たとえば、検索部153は、検索クエリ20(たとえば、「売上が上昇する」)を受け付けると、検索クエリ20の文のベクトルを算出する。検索部153は、検索クエリ20の文に対して形態素解析を実行することで複数の単語に分割する。検索部153は、辞書情報140aを基にして、各単語にベクトルを割り当て、各単語のベクトルを積算することで、検索クエリ20のクエリベクトルVec20を算出する。 For example, when the search unit 153 receives a search query 20 (for example, "sales increase"), it calculates a sentence vector of the search query 20. The search unit 153 divides the sentence of the search query 20 into a plurality of words by executing morphological analysis. The search unit 153 calculates a query vector Vec20 of the search query 20 by assigning a vector to each word based on the dictionary information 140a and integrating the vector of each word.
 検索部153は、クエリベクトルVec20と、転置インデックス70に登録された各ベクトルとを比較して、類似度を算出し、類似度が最大となるベクトルを転置インデックス70から特定し、特定したベクトルに対応するオフセットを特定する。 The search unit 153 compares the query vector Vec 20 with each vector registered in the transposed index 70, calculates the similarity, identifies the vector with the highest similarity from the transposed index 70, Identify the corresponding offset.
 たとえば、検索クエリ「売上が上昇する」のクエリベクトルVec20との類似度が最大となるベクトルは、図2で説明した文10A「売上が上がる」のベクトルVec10Aとなる。このため、検索部153は、転置インデックス70のベクトルVec10Aに対応付けられたオフセットを基にして、上昇区間T1のタグ付き数値51b~51eを特定する。検索部153は、上昇区間T1のタグ付き数値51b~51eを抽出した情報を、検索結果80として生成する。検索部153は、検索結果80を、表示部130に出力して表示してもよいし、外部装置に送信してもよい。 For example, the vector that maximizes the similarity between the search query "sales increase" and the query vector Vec20 is the vector Vec10A of the sentence 10A "sales increase" described in FIG. Therefore, the search unit 153 identifies the tagged numerical values 51b to 51e of the rising section T1 based on the offset associated with the vector Vec10A of the transposed index 70. FIG. The search unit 153 generates, as a search result 80, information obtained by extracting the tagged numerical values 51b to 51e of the rising section T1. The search unit 153 may output and display the search result 80 on the display unit 130, or may transmit it to an external device.
 次に、本実施例に係る情報処理装置100の処理手順の一例について説明する。図5は、情報処理装置の準備フェーズの処理手順を示すフローチャートである。図5に示すように、情報処理装置100は、文およびタグ付き数値の列の条件の指定を受け付ける(ステップS101)。特定部151は、XBRL文書DB50から、売上げタグに対応する各タグ付き数値を抽出する(ステップS102)。特定部151は、抽出した各タグ付き数値を年度の時系列にソートすることで、抽出データを生成する(ステップS103)。 Next, an example of the processing procedure of the information processing apparatus 100 according to this embodiment will be described. FIG. 5 is a flow chart showing the processing procedure of the preparation phase of the information processing apparatus. As shown in FIG. 5, the information processing apparatus 100 receives specification of a condition for a sentence and a string of numerical values with tags (step S101). The specifying unit 151 extracts each tagged numeric value corresponding to the sales tag from the XBRL document DB 50 (step S102). The identification unit 151 generates extraction data by sorting the extracted numerical values with tags in chronological order of year (step S103).
 生成部152は、抽出データ60の各タグ付き数値の属性値を年度の時系列に走査し、上昇区間と下降区間とを特定する(ステップS104)。生成部152は、辞書情報140aを基にして、指定された各文のベクトルを算出する(ステップS105)。 The generation unit 152 scans the attribute values of each tagged numerical value in the extraction data 60 in chronological order of the year, and identifies rising sections and falling sections (step S104). The generation unit 152 calculates a vector of each specified sentence based on the dictionary information 140a (step S105).
 生成部152は、指定された文のベクトルと、上昇区間のタグ付き数値の列のオフセットとを対応付けて、転置インデックス70に設定する(ステップS106)。生成部152は、指定された文のベクトルと、下降区間のタグ付き数値の列のオフセットとを対応付けて、転置インデックス70に設定する(ステップS107)。 The generation unit 152 associates the specified sentence vector with the offset of the column of tagged numerical values in the ascending section, and sets them in the transposed index 70 (step S106). The generation unit 152 associates the specified sentence vector with the offset of the column of tagged numerical values in the descending section, and sets them in the transposed index 70 (step S107).
 図6は、情報処理装置の検索フェーズの処理手順を示すフローチャートである。図6に示すように、情報処理装置100の検索部153は、検索クエリを受け付ける(ステップS201)。検索部153は、辞書情報140aを基にして、検索クエリのクエリベクトルを算出する(ステップS202)。 FIG. 6 is a flowchart showing the processing procedure of the search phase of the information processing device. As shown in FIG. 6, the search unit 153 of the information processing device 100 receives a search query (step S201). The search unit 153 calculates a query vector of the search query based on the dictionary information 140a (step S202).
 検索部153は、クエリベクトルと転置インデックス70の各ベクトルとの類似度が最大となるベクトルに対応するオフセットを特定する(ステップS203)。検索部153は、特定したオフセットを基にして、検索結果を生成する(ステップS204)。検索部153は、検索結果を出力する(ステップS205)。 The search unit 153 identifies the offset corresponding to the vector that maximizes the similarity between the query vector and each vector of the transposed index 70 (step S203). The search unit 153 generates a search result based on the specified offset (step S204). The search unit 153 outputs the search result (step S205).
 次に、本実施例に係る情報処理装置100の効果について説明する。情報処理装置100は、売上タグに対応するタグ付き数値を抽出し、抽出したタグ付き数値を年度の時系列にソートすることで、抽出データ60を生成する。情報処理装置100は、抽出データ60のタグ付き数値を年度の時系列に走査し、属性値が上昇する上昇区間の各タグ付き数値と、属性値が下降する下降区間の各タグ付き数値とを特定し、指定される文のベクトルを上昇区間の各タグ付き数値、下降区間の各タグ付き数値に設定する。たとえば、図2で説明した上昇区間のタグ付き数値51b~51eは、文10Aに応じた条件を満たす。下降区間のタグ付き数値51e,51fは、文10Bに応じた条件を満たす。これによって、上昇区間の各文、下降区間の各タグ付き数値に適するベクトルを生成することが可能となる。 Next, the effects of the information processing apparatus 100 according to this embodiment will be described. The information processing apparatus 100 extracts the tagged numerical values corresponding to the sales tags, and sorts the extracted tagged numerical values in chronological order to generate the extraction data 60 . The information processing apparatus 100 scans the tagged numerical values of the extracted data 60 in chronological order of the year, and identifies each tagged numerical value in the rising section in which the attribute value rises and each tagged numerical value in the falling section in which the attribute value falls. Set the specified sentence vector to each tagged number in the ascending interval and each tagged number in the descending interval. For example, the tagged numerical values 51b-51e of the ascending interval described in FIG. 2 satisfy the conditions according to sentence 10A. Tagged numbers 51e and 51f in the descending interval satisfy the conditions according to sentence 10B. This makes it possible to generate a vector suitable for each sentence in the ascending interval and for each tagged number in the descending interval.
 情報処理装置100は、上昇区間のタグ付き数値51b~51eと、生成したベクトルとを対応付けて、転置インデックス70に設定し、下降区間のタグ付き数値51e,51fと、生成したベクトルとを対応付けて転置インデックス70に設定する処理を行う。情報処理装置100は、検索クエリを受け付けた場合に、転置インデックス70を用いることで、検索クエリに対応した文を検索することが可能となる。 The information processing apparatus 100 associates the tagged numerical values 51b to 51e of the ascending interval with the generated vector, sets them in the transposed index 70, and associates the tagged numerical values 51e and 51f of the descending interval with the generated vector. , and set it to the transposed index 70 . When receiving a search query, the information processing apparatus 100 can search for a sentence corresponding to the search query by using the inverted index 70 .
 ところで、上述した情報処理装置100の処理は一例であり、情報処理装置100はその他の処理を実行してもよい。以下では、情報処理装置100のその他の処理(1)~(2)について説明する。 By the way, the processing of the information processing device 100 described above is an example, and the information processing device 100 may perform other processing. Other processes (1) and (2) of the information processing apparatus 100 will be described below.
 情報処理装置100のその他の処理(1)について説明する。上述したXBRL文書DB50に格納される文書は、CSV(Comma Separated Value)のフォーマットによって設定される場合がある。CSVのフォーマットでは、図1で説明したようなタグが含まれない。このため、情報処理装置100は、各種の変換テーブルを用いて、CSVの文データを、タグ付きの文データに変換してもよい。 Other processing (1) of the information processing device 100 will be described. Documents stored in the XBRL document DB 50 described above may be set in a CSV (Comma Separated Value) format. The CSV format does not include tags such as those described in FIG. Therefore, the information processing apparatus 100 may use various conversion tables to convert CSV sentence data into tagged sentence data.
 図7は、情報処理装置のその他の処理(1)を説明するための図である。情報処理装置100は、変換テーブル81A,81Bと、タグベクトル辞書81Cとを基にして、CSVの数値データ80Aを、タグ付きの数値データ80Bに変換する。 FIG. 7 is a diagram for explaining another process (1) of the information processing device. The information processing apparatus 100 converts the CSV numerical data 80A into tagged numerical data 80B based on the conversion tables 81A and 81B and the tag vector dictionary 81C.
 変換テーブル81Aは、XBRLタグと、カラムとの関係を定義するテーブルである。カラムは、CSVの数値データ80Aの各カラムを識別する情報である。 The conversion table 81A is a table that defines the relationship between XBRL tags and columns. A column is information identifying each column of the CSV numerical data 80A.
 変換テーブル81Bは、タグ属性と、行との関係を定義するテーブルである。たとえば、変換テーブル81Bでは、CSVの数値データ80Aの行に設定された各数値のタグ属性が「年度=」であることを定義している。 The conversion table 81B is a table that defines the relationship between tag attributes and rows. For example, the conversion table 81B defines that the tag attribute of each numerical value set in the row of the CSV numerical data 80A is "fiscal year=".
 タグベクトル辞書81Cは、単語ベクトルに準じて、XBRLタグと、単語と、タグベクトルとを対応付ける。たとえば、タグベクトル辞書81Cでは、XBRLタグ<売上>が単語「売上」に対応し、タグベクトルが「Vec1-1・・・Vec1-n」であることが示される。タグベクトルは、XBRLタグ固有のタグであり事前に計算される。 The tag vector dictionary 81C associates XBRL tags, words, and tag vectors according to word vectors. For example, the tag vector dictionary 81C indicates that the XBRL tag <sales> corresponds to the word "sales" and the tag vector is "Vec1-1 . . . Vec1-n". The tag vector is a pre-computed tag that is specific to the XBRL tag.
 たとえば、CSVの数値データ80Aの1行目について説明する。情報処理装置100は、変換テーブル81Aを基にして、カラムβに設定された値のXBRLタグが「<原価>」であることを特定し、タグベクトル辞書81Cを基にして、「<原価>」に対応する単語が「原価」であることを特定する。情報処理装置100は、行に設定された「2020」について、変換テーブル81Bを基にして「年度=2020」に変換する。 For example, the first row of the CSV numerical data 80A will be explained. Based on the conversion table 81A, the information processing device 100 identifies that the XBRL tag of the value set in the column β is "<cost>", and based on the tag vector dictionary 81C, identifies "<cost> ' is the word corresponding to 'cost'. The information processing apparatus 100 converts "2020" set in the row into "year=2020" based on the conversion table 81B.
 情報処理装置100は、上記の変換結果を、予め定められた順番で配列することで、CSVの数値データ80Aの1行目の情報を「<原価 年度=2020>40</原価>」に変換する。情報処理装置100は、CSVの数値データ80Aの2行目の情報も同様に処理を実行することで、「<原価 年度=2021>42</原価>」に変換する。 The information processing device 100 arranges the conversion results in a predetermined order, thereby converting the information in the first row of the CSV numerical data 80A into "<cost year=2020>40</cost>". do. The information processing device 100 performs the same process on the information on the second line of the CSV numerical data 80A to convert it to "<Cost year=2021>42</Cost>".
 上記のように、情報処理装置100は、各種の変換テーブルを用いて、CSVのカラムや行の属性で表現される数値データを、タグ付きの数値データに変換することで、CSVの数値データについても、図1、図2で説明したタグに基づく処理を実行することができ、適切なベクトルを対応付けることができる。 As described above, the information processing apparatus 100 uses various conversion tables to convert numerical data represented by attributes of CSV columns and rows into numerical data with tags. can also execute the processing based on the tags described in FIGS. 1 and 2, and can associate appropriate vectors.
 情報処理装置100のその他の処理(2)について説明する。図8は、情報処理装置のその他の処理(2)を説明するための図である。情報処理装置100は、図2で示した処理を繰り返し実行することで得られる、抽出データ60に含まれる複数のタグ付き数値(たとえば、上昇区間T1の各タグ付き数値)と、指定される文のベクトル(文10Aのベクトル)との関係を記憶部140に登録しておく。また、情報処理装置100は、抽出データ60に含まれる複数のタグ付き数値のベクトルを、XBRL文章のベクトルとして算出し、指定された文のベクトルと対応付けて、教師テーブル90に登録する。指定された文のベクトルを文ベクトルと表記する。 Other processing (2) of the information processing apparatus 100 will be described. FIG. 8 is a diagram for explaining another process (2) of the information processing apparatus. Information processing apparatus 100 extracts a plurality of tagged numerical values (for example, each tagged numerical value of rising section T1) included in extraction data 60 obtained by repeatedly executing the processing shown in FIG. (the vector of sentence 10A) is registered in the storage unit 140 . In addition, the information processing apparatus 100 calculates vectors of a plurality of tagged numerical values included in the extraction data 60 as vectors of the XBRL text, associates them with vectors of the designated text, and registers them in the teacher table 90 . A vector of specified sentences is denoted as a sentence vector.
 たとえば、情報処理装置100は、抽出データ60に含まれる複数のタグ付き数値のベクトルを積算することで、XBRL文書のベクトルを算出する。情報処理装置100は、XBRL文書のベクトルを算出する場合、各単語のベクトル、タグのベクトルとして、図4で説明した辞書情報140a、図7で説明したタグベクトル辞書81Cに設定されたベクトル等を利用する。 For example, the information processing apparatus 100 calculates the vector of the XBRL document by multiplying the vectors of the multiple tagged numerical values included in the extracted data 60 . When calculating the vectors of the XBRL document, the information processing apparatus 100 uses the dictionary information 140a described with reference to FIG. 4 and the vectors set in the tag vector dictionary 81C described with reference to FIG. use.
 情報処理装置100は、抽出データ60に含まれる複数のタグ付き数値と、指定される文のベクトルとについて、上記処理を繰り返し実行することで、文ベクトルと、XBRL文書のベクトルとの関係を特定し、教師テーブル90に登録する。 The information processing apparatus 100 repeatedly executes the above-described processing for a plurality of tagged numerical values included in the extracted data 60 and the designated sentence vector, thereby identifying the relationship between the sentence vector and the vector of the XBRL document. and register it in the teacher table 90.
 続いて、情報処理装置100は、教師テーブル90の文ベクトルを入力、XBRL文書のベクトルを出力(正解ラベル)として、学習モデルM1の学習を行う。学習モデルM1は、ニューラルネットワークであり、情報処理装置100は、誤差逆伝播学習法等を用いて、学習モデルM1を訓練する。 Subsequently, the information processing apparatus 100 learns the learning model M1 by using the sentence vector of the teacher table 90 as input and the vector of the XBRL document as output (correct label). The learning model M1 is a neural network, and the information processing apparatus 100 trains the learning model M1 using a backpropagation learning method or the like.
 また、情報処理装置100は、XBRL文書のベクトルと、XBRL文書のベクトルに対応する複数の文であって、抽出データ60に含まれる複数のタグ付き数値のベクトルのオフセットとを対応付けた転置インデックスを生成しておく。 The information processing apparatus 100 also provides an inverted index that associates the vector of the XBRL document with the offset of the vector of a plurality of tagged numeric values included in the extraction data 60, which are a plurality of sentences corresponding to the vector of the XBRL document. is generated.
 一方、情報処理装置100は、検索クエリを受け付けた場合に、検索クエリのクエリベクトルを訓練済みの学習モデルM1に入力し、XBRL文書のベクトルを算出する。情報処理装置100は、算出したXBRL文書のベクトルと、転置インデックスとを比較して、オフセットを特定し、オフセットに対応する位置から、検索クエリに対応する複数のタグ付き数値を抽出することで、検索結果を得ることができる。 On the other hand, when receiving a search query, the information processing apparatus 100 inputs the query vector of the search query to the trained learning model M1 and calculates the vector of the XBRL document. The information processing apparatus 100 compares the calculated vector of the XBRL document with the inverted index, identifies the offset, and extracts a plurality of tagged numerical values corresponding to the search query from the position corresponding to the offset. You can get search results.
 次に、上記実施例に示した情報処理装置100と同様の機能を実現するコンピュータのハードウェア構成の一例について説明する。図9は、実施例の情報処理装置と同様の機能を実現するコンピュータのハードウェア構成の一例を示す図である。 Next, an example of the hardware configuration of a computer that implements the same functions as the information processing apparatus 100 shown in the above embodiment will be described. FIG. 9 is a diagram illustrating an example of a hardware configuration of a computer that implements functions similar to those of the information processing apparatus of the embodiment.
 図9に示すように、コンピュータ200は、各種演算処理を実行するCPU201と、ユーザからのデータの入力を受け付ける入力装置202と、ディスプレイ203とを有する。また、コンピュータ200は、有線または無線ネットワークを介して、外部装置等との間でデータの授受を行う通信装置204と、インタフェース装置205とを有する。また、コンピュータ200は、各種情報を一時記憶するRAM206と、ハードディスク装置207とを有する。そして、各装置201~207は、バス208に接続される。 As shown in FIG. 9, the computer 200 has a CPU 201 that executes various arithmetic processes, an input device 202 that receives data input from the user, and a display 203 . The computer 200 also has a communication device 204 and an interface device 205 for exchanging data with an external device or the like via a wired or wireless network. The computer 200 also has a RAM 206 that temporarily stores various information, and a hard disk device 207 . Each device 201 - 207 is then connected to a bus 208 .
 ハードディスク装置207は、特定プログラム207a、生成プログラム207b、検索プログラム207cを有する。また、CPU201は、各プログラム207a~207cを読み出してRAM206に展開する。 The hard disk device 207 has a specific program 207a, a generation program 207b, and a search program 207c. Further, the CPU 201 reads each program 207 a to 207 c and develops them in the RAM 206 .
 特定プログラム207aは、特定プロセス206aとして機能する。生成プログラム207bは、生成プロセス206bとして機能する。検索プログラム207cは、検索プロセス206cとして機能する。 The specific program 207a functions as a specific process 206a. Generation program 207b functions as generation process 206b. The search program 207c functions as a search process 206c.
 特定プロセス206aの処理は、特定部151の処理に対応する。生成プロセス206bの処理は、生成部152の処理に対応する。検索プロセス206cの処理は、検索部153の処理に対応する。 The processing of the identification process 206a corresponds to the processing of the identification unit 151. The processing of the generation process 206 b corresponds to the processing of the generation unit 152 . The processing of the search process 206 c corresponds to the processing of the search unit 153 .
 なお、各プログラム207a~207cについては、必ずしも最初からハードディスク装置207に記憶させておかなくても良い。例えば、コンピュータ200に挿入されるフレキシブルディスク(FD)、CD-ROM、DVD、光磁気ディスク、ICカードなどの「可搬用の物理媒体」に各プログラムを記憶させておく。そして、コンピュータ200が各プログラム207a~207cを読み出して実行するようにしてもよい。 It should be noted that the programs 207a to 207c do not necessarily have to be stored in the hard disk device 207 from the beginning. For example, each program is stored in a “portable physical medium” such as a flexible disk (FD), CD-ROM, DVD, magneto-optical disk, IC card, etc. inserted into the computer 200 . Then, the computer 200 may read and execute each program 207a to 207c.
  50  XBRL文書DB
  60  抽出データ
  70  転置インデックス
 100  情報処理装置
 110  通信部
 120  入力部
 130  表示部
 140  記憶部
 140a 辞書情報
 150  制御部
 151  特定部
 152  生成部
 153  検索部
50 XBRL document database
60 Extracted data 70 Transposed index 100 Information processing device 110 Communication unit 120 Input unit 130 Display unit 140 Storage unit 140a Dictionary information 150 Control unit 151 Identification unit 152 Generation unit 153 Search unit

Claims (15)

  1.  複数の属性と属性に対応する数値データの属性値とを対応付けた複数の属性値配列を含む対象データを記憶した記憶装置を有し、
     テキストデータを受け付けた場合、前記対象データに含まれる複数の属性値配列のうち、前記テキストデータに応じた条件を満たす属性値配列を特定し、
     特定した前記属性値配列に、前記テキストデータに応じたベクトルを対応付ける
     処理をコンピュータが実行することを特徴とする生成方法。
    a storage device storing target data including a plurality of attribute value arrays in which a plurality of attributes and attribute values of numerical data corresponding to the attributes are associated;
    When text data is received, identifying an attribute value array that satisfies a condition corresponding to the text data among a plurality of attribute value arrays included in the target data;
    A generation method, wherein a computer executes a process of associating a vector corresponding to the text data with the specified attribute value array.
  2.  前記複数の属性には年度の属性が含まれ、前記特定された複数の属性値配列に設定された年度の属性を基にして、前記複数の属性値配列を年度の時系列に並べ替え、並べ替えた前記複数の属性値配列の属性値を基にして、前記テキストデータと対応付ける属性値配列を特定する処理を更に実行することを特徴とする請求項1に記載の生成方法。 The plurality of attributes include a fiscal year attribute, and the plurality of attribute value arrays are rearranged in chronological order based on the fiscal year attributes set in the identified plurality of attribute value arrays, and arranged. 2. The generating method according to claim 1, further comprising specifying an attribute value array to be associated with said text data based on the attribute values of said plurality of attribute value arrays that have been changed.
  3.  前記テキストデータと対応付ける属性値配列を特定する処理は、前記複数の属性値配列の属性値を時系列に走査し、属性値が連続して上昇する属性値配列の上昇区間と、属性値が連続して下降する属性値配列の下降区間を特定し、前記上昇区間に関連するテキストデータに応じたベクトルを、前記上昇区間の属性値配列に対応付け、前記下降区間に関連するテキストデータに応じたベクトルを、前記下降区間の属性値配列に対応付けることを特徴とする請求項2に記載の生成方法。 The process of specifying an attribute value array to be associated with the text data includes scanning the attribute values of the plurality of attribute value arrays in time series, and determining a rising section of the attribute value array in which the attribute value continuously increases and an attribute value array in which the attribute value is continuous. to specify the descending section of the attribute value array descending, associate the vector corresponding to the text data related to the ascending section with the attribute value array of the ascending section, and correspond to the text data related to the descending section 3. The generating method according to claim 2, wherein the vector is associated with the attribute value array of the descending interval.
  4.  前記属性値配列の位置と、前記属性値配列に応じたベクトルとを対応付けた転置インデックスを生成し、検索クエリを受け付けた場合に、検索クエリのベクトルと、前記転置インデックスとを基にして、前記検索クエリに対応する属性値配列を検索する処理を更に実行することを特徴とする請求項1に記載の生成方法。 generating an inverted index that associates the position of the attribute value array with a vector corresponding to the attribute value array, and when a search query is received, based on the vector of the search query and the inverted index, 2. The generating method according to claim 1, further comprising searching for an attribute value array corresponding to said search query.
  5.  複数のカラム毎に属性が設定された表形式データを取得した場合、表形式データの各カラムの属性をタグ形式に変換することで、前記表形式データを前記対象データに変換する処理を更に実行することを特徴とする請求項1に記載の生成方法。 When tabular data in which attributes are set for each of a plurality of columns is acquired, the process of converting the tabular data into the target data is further executed by converting the attributes of each column of the tabular data into a tag format. 2. The generation method according to claim 1, wherein:
  6.  複数の属性と属性に対応する数値データの属性値とを対応付けた複数の属性値配列を含む対象データを記憶した記憶装置を有し、
     テキストデータを受け付けた場合、前記対象データに含まれる複数の属性値配列のうち、前記テキストデータに応じた条件を満たす属性値配列を特定し、
     特定した前記属性値配列に、前記テキストデータに応じたベクトルを対応付ける
     処理をコンピュータに実行させることを特徴とする生成プログラム。
    a storage device storing target data including a plurality of attribute value arrays in which a plurality of attributes and attribute values of numerical data corresponding to the attributes are associated;
    When text data is received, identifying an attribute value array that satisfies a condition corresponding to the text data among a plurality of attribute value arrays included in the target data;
    A generating program that causes a computer to execute a process of associating a vector corresponding to the text data with the specified attribute value array.
  7.  前記複数の属性には年度の属性が含まれ、前記特定された複数の属性値配列に設定された年度の属性を基にして、前記複数の属性値配列を年度の時系列に並べ替え、並べ替えた前記複数の属性値配列の属性値を基にして、前記テキストデータと対応付ける属性値配列を特定する処理を更に実行することを特徴とする請求項6に記載の生成プログラム。 The plurality of attributes include a fiscal year attribute, and the plurality of attribute value arrays are rearranged in chronological order based on the fiscal year attributes set in the identified plurality of attribute value arrays, and arranged. 7. The generating program according to claim 6, further executing a process of identifying an attribute value array to be associated with said text data based on the attribute values of said plurality of attribute value arrays that have been changed.
  8.  前記テキストデータと対応付ける属性値配列を特定する処理は、前記複数の属性値配列の属性値を時系列に走査し、属性値が連続して上昇する属性値配列の上昇区間と、属性値が連続して下降する属性値配列の下降区間を特定し、前記上昇区間に関連するテキストデータに応じたベクトルを、前記上昇区間の属性値配列に対応付け、前記下降区間に関連するテキストデータに応じたベクトルを、前記下降区間の属性値配列に対応付けることを特徴とする請求項7に記載の生成プログラム。 The process of specifying an attribute value array to be associated with the text data includes scanning the attribute values of the plurality of attribute value arrays in time series, and determining a rising section of the attribute value array in which the attribute value continuously increases and an attribute value array in which the attribute value is continuous. to specify the descending section of the attribute value array descending, associate the vector corresponding to the text data related to the ascending section with the attribute value array of the ascending section, and correspond to the text data related to the descending section 8. The generating program according to claim 7, wherein the vector is associated with the attribute value array of the descending interval.
  9.  前記属性値配列の位置と、前記属性値配列に応じたベクトルとを対応付けた転置インデックスを生成し、検索クエリを受け付けた場合に、検索クエリのベクトルと、前記転置インデックスとを基にして、前記検索クエリに対応する属性値配列を検索する処理を更に実行することを特徴とする請求項6に記載の生成プログラム。 generating an inverted index that associates the position of the attribute value array with a vector corresponding to the attribute value array, and when a search query is received, based on the vector of the search query and the inverted index, 7. The generating program according to claim 6, further executing a process of searching for an attribute value array corresponding to said search query.
  10.  複数のカラム毎に属性が設定された表形式データを取得した場合、表形式データの各カラムの属性をタグ形式に変換することで、前記表形式データを前記対象データに変換する処理を更に実行することを特徴とする請求項6に記載の生成プログラム。 When tabular data in which attributes are set for each of a plurality of columns is acquired, the process of converting the tabular data into the target data is further executed by converting the attributes of each column of the tabular data into a tag format. 7. The generating program according to claim 6, characterized by:
  11.  複数の属性と属性に対応する数値データの属性値とを対応付けた複数の属性値配列を含む対象データを記憶した記憶装置を有し、
     テキストデータを受け付けた場合、前記対象データに含まれる複数の属性値配列のうち、前記テキストデータに応じた条件を満たす属性値配列を特定し、
     特定した前記属性値配列に、前記テキストデータに応じたベクトルを対応付ける
     処理を実行する制御部を有する情報処理装置。
    a storage device storing target data including a plurality of attribute value arrays in which a plurality of attributes and attribute values of numerical data corresponding to the attributes are associated;
    When text data is received, identifying an attribute value array that satisfies a condition corresponding to the text data among a plurality of attribute value arrays included in the target data;
    An information processing apparatus having a control unit that executes a process of associating a vector corresponding to the text data with the specified attribute value array.
  12.  前記制御部は、前記複数の属性には年度の属性が含まれ、前記特定された複数の属性値配列に設定された年度の属性を基にして、前記複数の属性値配列を年度の時系列に並べ替え、並べ替えた前記複数の属性値配列の属性値を基にして、前記テキストデータと対応付ける属性値配列を特定する処理を更に実行することを特徴とする請求項11に記載の情報処理装置。 The control unit converts the plurality of attribute value arrays in chronological order based on the attributes of the fiscal year set in the identified plurality of attribute value arrays. 12. The information processing according to claim 11, further executing a process of specifying an attribute value array to be associated with the text data based on the attribute values of the plurality of attribute value arrays that have been rearranged. Device.
  13.  前記制御部は、前記複数の属性値配列の属性値を時系列に走査し、属性値が連続して上昇する属性値配列の上昇区間と、属性値が連続して下降する属性値配列の下降区間を特定し、前記上昇区間に関連するテキストデータに応じたベクトルを、前記上昇区間の属性値配列に対応付け、前記下降区間に関連するテキストデータに応じたベクトルを、前記下降区間の属性値配列に対応付けることを特徴とする請求項12に記載の情報処理装置。 The control unit scans the attribute values of the plurality of attribute value arrays in time series, and includes a rising section of the attribute value array where the attribute value continuously rises and a descending section of the attribute value array where the attribute value continuously falls. A section is specified, a vector corresponding to the text data related to the rising section is associated with the attribute value array of the rising section, and a vector corresponding to the text data related to the falling section is converted to the attribute value of the falling section. 13. The information processing apparatus according to claim 12, wherein the information is associated with an array.
  14.  前記制御部は、前記属性値配列の位置と、前記属性値配列に応じたベクトルとを対応付けた転置インデックスを生成し、検索クエリを受け付けた場合に、検索クエリのベクトルと、前記転置インデックスとを基にして、前記検索クエリに対応する属性値配列を検索する処理を更に実行することを特徴とする請求項11に記載の情報処理装置。 The control unit generates a transposed index that associates the position of the attribute value array with a vector corresponding to the attribute value array, and when a search query is received, the vector of the search query and the transposed index 12. The information processing apparatus according to claim 11, further executing a process of retrieving an attribute value array corresponding to said search query based on.
  15.  前記制御部は、複数のカラム毎に属性が設定された表形式データを取得した場合、表形式データの各カラムの属性をタグ形式に変換することで、前記表形式データを前記対象データに変換する処理を更に実行することを特徴とする請求項11に記載の情報処理装置。 When acquiring tabular data in which an attribute is set for each of a plurality of columns, the control unit converts the tabular data into the target data by converting the attribute of each column of the tabular data into a tag format. 12. The information processing apparatus according to claim 11, further executing a process for processing.
PCT/JP2022/008433 2022-02-28 2022-02-28 Generation method, generation program, and information processing device WO2023162273A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/JP2022/008433 WO2023162273A1 (en) 2022-02-28 2022-02-28 Generation method, generation program, and information processing device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2022/008433 WO2023162273A1 (en) 2022-02-28 2022-02-28 Generation method, generation program, and information processing device

Publications (1)

Publication Number Publication Date
WO2023162273A1 true WO2023162273A1 (en) 2023-08-31

Family

ID=87765263

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2022/008433 WO2023162273A1 (en) 2022-02-28 2022-02-28 Generation method, generation program, and information processing device

Country Status (1)

Country Link
WO (1) WO2023162273A1 (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030135495A1 (en) * 2001-06-21 2003-07-17 Isc, Inc. Database indexing method and apparatus
JP2006040058A (en) * 2004-07-28 2006-02-09 Mitsubishi Electric Corp Document classification device
JP2006331089A (en) * 2005-05-26 2006-12-07 Toshiba Corp Method and device for generating time series data from webpage
US20160078079A1 (en) * 2014-09-17 2016-03-17 Futurewei Technologies, Inc. Statement based migration for adaptively building and updating a column store database from a row store database based on query demands using disparate database systems
JP2016035684A (en) * 2014-08-04 2016-03-17 日本電信電話株式会社 Information management system, information management method, and information management program
US11061934B1 (en) * 2018-04-06 2021-07-13 Intuit Inc. Method and system for characterizing time series

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030135495A1 (en) * 2001-06-21 2003-07-17 Isc, Inc. Database indexing method and apparatus
JP2006040058A (en) * 2004-07-28 2006-02-09 Mitsubishi Electric Corp Document classification device
JP2006331089A (en) * 2005-05-26 2006-12-07 Toshiba Corp Method and device for generating time series data from webpage
JP2016035684A (en) * 2014-08-04 2016-03-17 日本電信電話株式会社 Information management system, information management method, and information management program
US20160078079A1 (en) * 2014-09-17 2016-03-17 Futurewei Technologies, Inc. Statement based migration for adaptively building and updating a column store database from a row store database based on query demands using disparate database systems
US11061934B1 (en) * 2018-04-06 2021-07-13 Intuit Inc. Method and system for characterizing time series

Similar Documents

Publication Publication Date Title
CN109885692B (en) Knowledge data storage method, apparatus, computer device and storage medium
US11709895B2 (en) Hybrid approach to approximate string matching using machine learning
US20030041058A1 (en) Queries-and-responses processing method, queries-and-responses processing program, queries-and-responses processing program recording medium, and queries-and-responses processing apparatus
US20090028445A1 (en) Character image feature dictionary preparation apparatus, document image processing apparatus having the same, character image feature dictionary preparation program, recording medium on which character image feature dictionary preparation program is recorded, document image processing program, and recording medium on which document image processing program is recorded
US20080201131A1 (en) Method and apparatus for automatically discovering features in free form heterogeneous data
US20140101544A1 (en) Displaying information according to selected entity type
CN101493896B (en) Document image processing apparatus and method
Holzinger et al. Using ontologies for extracting product features from web pages
US20050004902A1 (en) Information retrieving system, information retrieving method, and information retrieving program
CN116363212A (en) 3D visual positioning method and system based on semantic matching knowledge distillation
WO2008062822A1 (en) Text mining device, text mining method and text mining program
CN108345694B (en) Document retrieval method and system based on theme database
WO2023162273A1 (en) Generation method, generation program, and information processing device
KR102569381B1 (en) System and Method for Machine Reading Comprehension to Table-centered Web Documents
WO2014170965A1 (en) Document processing method, document processing device, and document processing program
CN106250354B (en) Information processing apparatus, information processing method, and program for processing document
CN114298048A (en) Named entity identification method and device
JP6509391B1 (en) Computer system
CN112818645A (en) Chemical information extraction method, device, equipment and storage medium
Wilkinson et al. Neural word search in historical manuscript collections
CN113128231A (en) Data quality inspection method and device, storage medium and electronic equipment
JP2011150603A (en) Category theme phrase extracting device, hierarchical tag attaching device, method, and program, and computer-readable recording medium
EP1681643B1 (en) Method and system for information extraction
JP2019168758A (en) Data processing device, data processing method and data processing program
EP1072986A2 (en) System and method for extracting data from semi-structured text

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22928795

Country of ref document: EP

Kind code of ref document: A1