JP6910061B2

JP6910061B2 - Text generator, text generator and text generator

Info

Publication number: JP6910061B2
Application number: JP2017168673A
Authority: JP
Inventors: 聡一朗村上; 亮彦渡邉; 祐介宮尾; 彬宮澤; 圭一五島; 大也高村
Original assignee: National Institute of Advanced Industrial Science and Technology AIST
Current assignee: National Institute of Advanced Industrial Science and Technology AIST
Priority date: 2017-09-01
Filing date: 2017-09-01
Publication date: 2021-07-28
Anticipated expiration: 2037-09-01
Also published as: JP2019046158A

Description

本発明は、テキスト生成装置、テキスト生成方法及びテキスト生成プログラムに関する。 The present invention relates to a text generator, a text generator and a text generator.

近年、自然言語処理の分野において、リカレントニューラルネットワーク等のニューラルネットワークを用いた言語モデルが研究されている。 In recent years, in the field of natural language processing, language models using neural networks such as recurrent neural networks have been studied.

例えば下記特許文献１には、対話形式のテキストから認識された単語と、単語の時系列情報と、単語の発言者を識別する識別情報とを第１のデータベースから取得する認識結果取得部と、単語と単語の時系列情報と識別情報と要約モデルに基づいて単語を訂正し、訂正結果を第１のデータベースに出力するテキスト要約部と、を有する対話テキスト要約装置が記載されている。 For example, in Patent Document 1 below, a recognition result acquisition unit that acquires words recognized from an interactive text, time-series information of words, and identification information for identifying the speaker of a word from a first database, Described is a dialogue text summarizing device having a text summarizing unit that corrects words based on word-to-word time-series information, identification information, and a summarizing model, and outputs the corrected results to a first database.

特開２０１７−１１１１９０号公報JP-A-2017-111190

ニューラルネットワークを用いた言語モデルは、大量のテキストデータを学習用データとして、学習用データに表れる単語の統計的特徴に基づいてテキストを生成するように学習されることがある。 A language model using a neural network may be trained to use a large amount of text data as training data and generate text based on the statistical characteristics of words appearing in the training data.

しかしながら、時間の経過とともに変化する数値の列を含む時系列数値データ（例えば株価等）の変動を説明するテキストを生成する場合、学習用データは時系列数値データの引用や変化量に関する説明（例えば、株価であれば、始値、終値、上げ幅等）を含むことがあり、その説明に関連付けられる数値が様々に変化するため、それぞれの数値が統計的に稀にしか現れない単語となってしまう。そのため、数値に関する記載を正しく再現するように言語モデルを学習させることが難しく、時系列数値データの変動を説明するテキストを生成することが難しかった。 However, when generating a text explaining the fluctuation of time series numerical data (for example, stock price) including a sequence of numerical values that changes with the passage of time, the training data is a citation of the time series numerical data or an explanation about the amount of change (for example). , If it is a stock price, it may include the opening price, closing price, increase range, etc.), and the numerical values associated with the explanation change variously, so each numerical value becomes a word that rarely appears statistically. .. Therefore, it is difficult to train the language model so as to correctly reproduce the description related to the numerical value, and it is difficult to generate a text explaining the fluctuation of the time-series numerical data.

そこで、本発明は、時系列数値データの変動を説明するテキストを生成するテキスト生成装置、テキスト生成方法及びテキスト生成プログラムを提供する。 Therefore, the present invention provides a text generator, a text generation method, and a text generation program that generate text explaining fluctuations in time-series numerical data.

本発明の一態様に係るテキスト生成装置は、時間の経過と対応付けられた数列を含む時系列数値データの変動を説明するテキストデータを生成するテキスト生成装置であって、テキストデータのうち時系列数値データに関係する数値を所定の規則で所定の文字列に置き換えた置換テキストデータと時系列数値データとを学習用データとして、時系列数値データが入力された場合に、置換テキストデータを出力するように言語モデルを学習させる学習部と、学習部により学習された言語モデルに新たな時系列数値データを入力し、言語モデルの出力によって新たな時系列数値データを説明する新たな置換テキストデータを生成する生成部と、新たな置換テキストデータに含まれる所定の文字列を、所定の規則で新たな時系列数値データに関係する数値に置き換える置換部と、を備える。 The text generator according to one aspect of the present invention is a text generator that generates text data for explaining fluctuations in time-series numerical data including a number sequence associated with the passage of time, and is a time-series of text data. When the replacement text data and the time-series numerical data in which the numerical values related to the numerical data are replaced with the predetermined character strings according to a predetermined rule are used as training data, the replacement text data is output when the time-series numerical data is input. New time-series numerical data is input to the learning unit that trains the language model and the language model learned by the learning unit, and new replacement text data that explains the new time-series numerical data is input by the output of the language model. It includes a generation unit to be generated and a replacement unit that replaces a predetermined character string included in the new replacement text data with a numerical value related to the new time-series numerical data according to a predetermined rule.

この態様によれば、言語モデルの出力によって新たな時系列数値データを説明する新たな置換テキストデータを生成し、新たな置換テキストデータに含まれる所定の文字列を、所定の規則で新たな時系列数値データに関係する数値に置き換えることで、時系列数値データに関する数値を言語モデルによって直接出力する必要が無くなり、数値が様々に変化する場合であってもその数値に関する記載を正しく含むように、時系列数値データの変動を説明するテキストを生成することができる。 According to this aspect, the output of the language model generates new replacement text data explaining the new time-series numerical data, and the predetermined character string contained in the new replacement text data is changed to a new time according to a predetermined rule. By replacing with the numerical value related to the series numerical data, it is not necessary to directly output the numerical value related to the time series numerical data by the language model, and even if the numerical value changes variously, the description about the numerical value is correctly included. It is possible to generate text explaining the fluctuation of time series numerical data.

上記態様において、時系列数値データに関係する数値は、時系列数値データに含まれる数列のうち所定の時点に対応付けられた数値と、時系列数値データに含まれる数列のうち異なる時点に対応付けられた数値の差と、時系列数値データに含まれる数列のうち所定の時点に対応付けられた数値を所定の桁で切り捨てた数値と、時系列数値データに含まれる数列のうち異なる時点に対応付けられた数値の差を所定の桁で切り捨てた数値と、時系列数値データに含まれる数列のうち所定の時点に対応付けられた数値を所定の桁で切り上げた数値と、時系列数値データに含まれる数列のうち異なる時点に対応付けられた数値の差を所定の桁で切り上げた数値と、のうち少なくともいずれかを含み、所定の規則は、時系列数値データに関係する数値の種類と所定の文字列とを対応付ける規則であってもよい。 In the above aspect, the numerical value related to the time-series numerical data is associated with a numerical value associated with a predetermined time point among the numerical values included in the time-series numerical data and a different time point among the numerical values included in the time-series numerical data. Corresponds to the difference between the numerical values and the numerical value obtained by rounding down the numerical value associated with a predetermined time point in the number sequence included in the time-series numerical data by a predetermined digit and the different time point in the number sequence included in the time-series numerical data. The difference between the attached numerical values is rounded down to a predetermined digit, the numerical value associated with a predetermined time point in the numerical sequence included in the time series numerical data is rounded up to a predetermined digit, and the time series numerical data. Includes at least one of the included number columns, which is the difference between the numbers associated with different time points, rounded up to a given digit, and the given rule is the type of number and the given number related to time-series numerical data. It may be a rule that associates with the character string of.

この態様によれば、時系列数値データに関係する数値の種類と所定の文字列とを対応付けることで、時系列数値データの引用や時系列数値データを演算した結果得られる数値を含むように、時系列数値データの変動を説明するテキストを生成することができる。 According to this aspect, by associating the type of numerical value related to the time-series numerical data with a predetermined character string, the numerical value obtained as a result of quoting the time-series numerical data or calculating the time-series numerical data is included. It is possible to generate text explaining the fluctuation of time series numerical data.

上記態様において、時系列数値データは、第１間隔で時間の経過と対応付けられた数列を含む第１時系列数値データと、第１間隔より長い第２間隔で時間の経過と対応付けられた数列を含む第２時系列数値データとを含んでもよい。 In the above aspect, the time-series numerical data is associated with the first time-series numerical data including the sequence associated with the passage of time in the first interval and the passage of time in the second interval longer than the first interval. It may include a second time series numerical data including a sequence of numbers.

この態様によれば、異なる時間間隔で時間の経過と対応付けられた数列を含む時系列数値データを用いることで、時系列数値データの履歴に依存する単語を正しく含むように、時系列数値データの変動を説明するテキストを生成することができる。 According to this aspect, by using time-series numerical data including a sequence of numbers associated with the passage of time at different time intervals, the time-series numerical data correctly includes words that depend on the history of the time-series numerical data. Can generate text that describes the fluctuations in.

上記態様において、生成部は、時系列数値データを、１又は複数の方法に一対一に対応する１又は複数の方法で変換して得られる１又は複数の数値データを、言語モデルに入力してもよい。 In the above aspect, the generation unit inputs one or more numerical data obtained by converting the time series numerical data by one or more methods corresponding to one or more methods one-to-one into the language model. May be good.

この態様によれば、時系列数値データを１又は複数の方法で変換して得られる１又は複数の数値データを言語モデルに入力することで、生成されるテキストが時系列数値データの絶対値に依存してぶれることが防止される。 According to this aspect, by inputting one or more numerical data obtained by converting time-series numerical data by one or more methods into a language model, the generated text becomes an absolute value of the time-series numerical data. Dependence on blurring is prevented.

上記態様において、１又は複数の数値データは、時系列数値データに含まれる数列を所定の数値範囲に正規化した数値データと、時系列数値データに含まれる数列の平均値及び標準偏差を用いて時系列数値データを標準化した数値データと、時系列数値データに含まれる数列のうち所定の時点に対応付けられた数値を基準値として、時系列数値データに含まれる数列を基準値に関して相対化した数値データと、のうち少なくともいずれかを含んでもよい。 In the above embodiment, the one or more numerical data uses the numerical data obtained by normalizing the numerical strings included in the time-series numerical data to a predetermined numerical range, and the average value and standard deviation of the numerical strings included in the time-series numerical data. The numerical data that standardizes the time-series numerical data and the numerical value associated with a predetermined time point among the numerical columns included in the time-series numerical data are used as reference values, and the numerical strings included in the time-series numerical data are relativized with respect to the reference value. Numerical data and at least one of them may be included.

この態様によれば、正規化した数値データ又は標準化した数値データを用いることで、生成されるテキストが時系列数値データの絶対値に依存してぶれることが防止され、相対化した数値データを用いることで、時系列数値データの履歴に依存する単語を正しく含むように、時系列数値データの変動を説明するテキストを生成することができる。 According to this aspect, by using the normalized numerical data or the standardized numerical data, it is prevented that the generated text is blurred depending on the absolute value of the time-series numerical data, and the relativized numerical data is used. By doing so, it is possible to generate a text explaining the fluctuation of the time-series numerical data so as to correctly include the words that depend on the history of the time-series numerical data.

上記態様において、言語モデルは、第１間隔で時間の経過と対応付けられた数列を含む第１時系列数値データを１又は複数の方法で変換して得られる、１又は複数の方法に一対一に対応する１又は複数の第１数値データが入力される第１エンコーダと、第１間隔より長い第２間隔で時間の経過と対応付けられた数列を含む第２時系列数値データを１又は複数の方法で変換して得られる、１又は複数の方法に一対一に対応する１又は複数の第２数値データが入力される第２エンコーダと、第１エンコーダの出力及び第２エンコーダの出力を合成する合成部と、合成部により合成されたデータが入力され、置換テキストデータを出力するデコーダと、を含んでもよい。 In the above embodiment, the language model is one-to-one with one or more methods obtained by converting first time-series numerical data including a number sequence associated with the passage of time in the first interval by one or more methods. One or more first encoders into which one or more first numerical data corresponding to the above are input, and one or more second time series numerical data including a number sequence associated with the passage of time in a second interval longer than the first interval. The output of the first encoder and the output of the second encoder are combined with the second encoder in which one or more second numerical data corresponding to one or more methods corresponding to one or more methods are input, which is obtained by conversion by the method of. It may include a compositing unit and a decoder in which the data synthesized by the compositing unit is input and the replacement text data is output.

この態様によれば、異なる時間間隔で時間の経過と対応付けられた数列を含む時系列数値データをそれぞれ異なるエンコーダに入力し、出力を合成してデコーダに入力することで、時系列数値データの履歴に依存する単語を正しく含むように、時系列数値データの変動を説明するテキストを生成することができる。 According to this aspect, time-series numerical data including a number sequence associated with the passage of time at different time intervals is input to different encoders, and the outputs are combined and input to the decoder to obtain time-series numerical data. Text can be generated to explain the variation of time series numerical data so that it correctly contains history-dependent words.

上記態様において、合成部は、第１エンコーダの出力、第２エンコーダの出力、１又は複数の第１数値データ及び１又は複数の第２数値データを合成してもよい。 In the above aspect, the compositing unit may synthesize the output of the first encoder, the output of the second encoder, one or more first numerical data, and one or more second numerical data.

この態様によれば、デコーダに対して、エンコーダの出力のみならず、複数の数値データを入力することで、時系列数値データの履歴に依存する単語を正しく含むように、時系列数値データの変動を説明するテキストを生成することができる。 According to this aspect, by inputting not only the output of the encoder but also a plurality of numerical data to the decoder, the variation of the time-series numerical data so as to correctly include the words depending on the history of the time-series numerical data. Can generate text that describes.

上記態様において、デコーダには、合成部により合成されたデータ及び時系列数値データの時系列に関するデータが入力されてもよい。 In the above aspect, the data synthesized by the compositing unit and the time-series numerical data of the time-series numerical data may be input to the decoder.

この態様によれば、デコーダに対して、合成部により合成されたデータのみならず、時系列数値データの時系列に関するデータを入力することで、時系列数値データの履歴に依存する単語を正しく含むように、時系列数値データの変動を説明するテキストを生成することができる。 According to this aspect, by inputting not only the data synthesized by the synthesizer but also the data related to the time series of the time series numerical data into the decoder, the words depending on the history of the time series numerical data are correctly included. As such, it is possible to generate text that explains the fluctuations in time series numerical data.

本発明の他の態様に係るテキスト生成方法は、ハードウェアプロセッサ及びメモリを備えるコンピュータによって、時間の経過と対応付けられた数列を含む時系列数値データの変動を説明するテキストデータを生成するテキスト生成方法であって、テキストデータのうち時系列数値データに関係する数値を所定の規則で所定の文字列に置き換えた置換テキストデータと時系列数値データとを学習用データとして、時系列数値データが入力された場合に、置換テキストデータを出力するように言語モデルを学習させることと、学習された言語モデルに新たな時系列数値データを入力し、言語モデルの出力によって新たな時系列数値データを説明する新たな置換テキストデータを生成することと、新たな置換テキストデータに含まれる所定の文字列を、所定の規則で新たな時系列数値データに関係する数値に置き換えることと、を実行する。 The text generation method according to another aspect of the present invention is a text generation method in which a computer including a hardware processor and a memory generates text data explaining fluctuations in time-series numerical data including a sequence of numbers associated with the passage of time. This is a method in which time-series numerical data is input using replacement text data in which numerical values related to time-series numerical data are replaced with predetermined character strings in a predetermined rule and time-series numerical data as learning data. When this is done, the language model is trained to output the replacement text data, new time-series numerical data is input to the trained language model, and the new time-series numerical data is explained by the output of the language model. Generate new replacement text data and replace the predetermined character string contained in the new replacement text data with a numerical value related to the new time-series numerical data according to a predetermined rule.

この態様によれば、言語モデルの出力によって新たな時系列数値データを説明する新たな置換テキストデータを生成し、新たな置換テキストデータに含まれる所定の文字列を、所定の規則で新たな時系列数値データに関係する数値に置き換えることで、時系列数値データに関係する数値を言語モデルによって直接出力する必要が無くなり、数値が様々に変化する場合であってもその数値に関する記載を正しく含むように、時系列数値データの変動を説明するテキストを生成することができる。 According to this aspect, the output of the language model generates new replacement text data explaining the new time-series numerical data, and the predetermined character string contained in the new replacement text data is changed to a new time according to a predetermined rule. By replacing with the numerical value related to the series numerical data, it is not necessary to directly output the numerical value related to the time series numerical data by the language model, and even if the numerical value changes variously, the description about the numerical value is correctly included. In addition, it is possible to generate text explaining the fluctuation of time series numerical data.

本発明の他の態様に係るテキスト生成プログラムは、時間の経過と対応付けられた数列を含む時系列数値データの変動を説明するテキストデータを生成するテキスト生成装置に備えられたコンピュータを、テキストデータのうち時系列数値データに関係する数値を所定の規則で所定の文字列に置き換えた置換テキストデータと時系列数値データとを学習用データとして、時系列数値データが入力された場合に、置換テキストデータを出力するように言語モデルを学習させる学習部、学習部により学習された言語モデルに新たな時系列数値データを入力し、言語モデルの出力によって新たな時系列数値データを説明する新たな置換テキストデータを生成する生成部、及び新たな置換テキストデータに含まれる所定の文字列を、所定の規則で新たな時系列数値データに関係する数値に置き換える置換部、として機能させる。 The text generator according to another aspect of the present invention uses a computer provided with a text generator for generating text data to explain fluctuations in time-series numerical data including a sequence of numbers associated with the passage of time. Of these, when the replacement text data in which the numerical values related to the time-series numerical data are replaced with the predetermined character strings according to a predetermined rule and the time-series numerical data as training data, the replacement text is input when the time-series numerical data is input. A learning unit that trains a language model to output data, a new replacement that inputs new time-series numerical data into the language model learned by the learning unit and explains the new time-series numerical data by the output of the language model. It functions as a generation unit that generates text data and a replacement unit that replaces a predetermined character string included in the new replacement text data with a numerical value related to the new time-series numerical data according to a predetermined rule.

本発明によれば、時系列数値データの変動を説明するテキストを生成するテキスト生成装置、テキスト生成方法及びテキスト生成プログラムを提供することができる。 According to the present invention, it is possible to provide a text generator, a text generation method, and a text generation program that generate text explaining fluctuations in time-series numerical data.

本発明の実施形態に係るテキスト生成装置のネットワーク構成を示す図である。It is a figure which shows the network configuration of the text generator which concerns on embodiment of this invention. 本実施形態に係るテキスト生成装置の物理構成を示す図である。It is a figure which shows the physical structure of the text generator which concerns on this embodiment. 本実施形態に係るテキスト生成装置の機能ブロックを示す図である。It is a figure which shows the functional block of the text generator which concerns on this embodiment. 言語モデルの構成を示す図である。It is a figure which shows the structure of a language model. 時系列数値データに関係する数値の種類と所定の文字列とを対応付ける規則を示す図である。It is a figure which shows the rule which associates a predetermined character string with the type of the numerical value related to time-series numerical data. 本実施形態に係るテキスト生成装置で実行される処理のフローチャートである。It is a flowchart of the process executed by the text generator which concerns on this embodiment. 本実施形態に係るテキスト生成装置で生成されるテキストを示す図である。It is a figure which shows the text generated by the text generator which concerns on this embodiment. 本実施形態に係るテキスト生成装置で生成されるテキストと基準となるテキストとの近さを評価した指標値を示す図である。It is a figure which shows the index value which evaluated the closeness of the text generated by the text generator which concerns on this embodiment, and a reference text.

添付図面を参照して、本発明の実施形態について説明する。なお、各図において、同一の符号を付したものは、同一又は同様の構成を有する。 Embodiments of the present invention will be described with reference to the accompanying drawings. In each figure, those having the same reference numerals have the same or similar configurations.

図１は、本発明の実施形態に係るテキスト生成装置１０のネットワーク構成を示す図である。本実施形態において、テキスト生成システム１００は、時間の経過と対応付けられた数列を含む時系列数値データと、その時系列数値データの変動を説明するテキストデータとを含む初期データセットを記憶するデータベースＤＢと、入力された時系列数値データに応じて置換テキストデータを出力する言語モデル２０と、データベースＤＢに記憶された初期データセットを用いて、言語モデル２０によって時系列数値データの変動を正しく説明するテキストが生成されるように、言語モデル２０を学習させ、新たな時系列数値データを取得した場合にその時系列数値データの変動を説明するテキストデータを生成するテキスト生成装置１０と、を含む。本実施形態において、時系列数値データは、株価である。もっとも、時系列数値データは、時間の経過と対応付けられた数列を含むものであって、継続的に取得される数値データであればどのようなものであってもよく、例えば心電データや血圧データ等のバイタルデータであったり、気温や湿度等の天候データであったり、交通量や乗客数等の交通データであったりしてもよい。 FIG. 1 is a diagram showing a network configuration of the text generator 10 according to the embodiment of the present invention. In the present embodiment, the text generation system 100 stores a database DB that stores an initial data set including time-series numerical data including a number sequence associated with the passage of time and text data explaining fluctuations in the time-series numerical data. Using the language model 20 that outputs the replacement text data according to the input time-series numerical data and the initial data set stored in the database DB, the language model 20 correctly explains the fluctuation of the time-series numerical data. It includes a text generation device 10 that trains a language model 20 so that text is generated, and generates text data that explains fluctuations in the time-series numerical data when new time-series numerical data is acquired. In the present embodiment, the time series numerical data is a stock price. However, the time-series numerical data includes a number sequence associated with the passage of time, and may be any numerical data that is continuously acquired, for example, electrocardiographic data or electrocardiographic data. It may be vital data such as blood pressure data, weather data such as temperature and humidity, and traffic data such as traffic volume and number of passengers.

テキスト生成システム１００は、通信ネットワークＮに接続され、株価配信サーバ４０から所定の時間間隔で株価を取得し、データベースＤＢに記憶したり、テキスト生成装置１０に入力したりする。また、テキスト生成システム１００は、通信ネットワークＮを介して、生成したテキストデータをユーザ端末３０に提供する。また、テキスト生成システム１００は、ユーザ端末３０からの指示に基づいて、データベースＤＢに記憶された初期データセットの追加や編集を行ったり、言語モデル２０の学習を行ったりしてもよい。ここで、通信ネットワークＮは、有線又は無線の通信網であり、例えばインターネットやＬＡＮ（Local Area Network）であってよい。テキスト生成システム１００は、いわゆるクラウドコンピューティングの形で全部又は一部の構成要素がリモートコンピュータによって構成されてよいが、全部又は一部の構成要素がローカルコンピュータによって構成されてもよい。 The text generation system 100 is connected to the communication network N, acquires stock prices from the stock price distribution server 40 at predetermined time intervals, stores them in the database DB, and inputs them to the text generation device 10. Further, the text generation system 100 provides the generated text data to the user terminal 30 via the communication network N. Further, the text generation system 100 may add or edit the initial data set stored in the database DB or learn the language model 20 based on the instruction from the user terminal 30. Here, the communication network N is a wired or wireless communication network, and may be, for example, the Internet or a LAN (Local Area Network). In the text generation system 100, all or part of the components may be configured by a remote computer in the form of so-called cloud computing, but all or some of the components may be configured by a local computer.

言語モデル２０は、時系列数値データが入力された場合に、時系列数値データを説明するテキストデータのうち時系列数値データに関係する数値を所定の規則で所定の文字列に置き換えた置換テキストデータを出力するモデルである。ここで、時系列数値データに関係する数値は、時系列数値データに含まれる数値の引用であったり、時系列数値データに含まれる数値を演算した数値であったりする。言語モデル２０は、例えばニューラルネットワークを用いたモデルであってよく、いわゆるエンコーダ‐デコーダモデルであってよい。言語モデル２０は、エンコーダとして、例えばＭＬＰ（Multi-Layer Perceptron）、ＣＮＮ（Convolutional Neural Network）又はＲＮＮ（Recurrent Neural Network）を含んでよく、デコーダとしてＲＮＮを含んでよい。言語モデル２０は、入力される時系列数値データの種類によって異なるモデルであってよい。言語モデル２０については、後に図４を用いて詳細に説明する。 The language model 20 is a replacement text data in which, when time-series numerical data is input, the numerical values related to the time-series numerical data among the text data explaining the time-series numerical data are replaced with predetermined character strings according to a predetermined rule. It is a model that outputs. Here, the numerical value related to the time-series numerical data may be a citation of a numerical value included in the time-series numerical data, or a numerical value obtained by calculating a numerical value included in the time-series numerical data. The language model 20 may be, for example, a model using a neural network, and may be a so-called encoder-decoder model. The language model 20 may include, for example, an MLP (Multi-Layer Perceptron), a CNN (Convolutional Neural Network) or an RNN (Recurrent Neural Network) as an encoder, and may include an RNN as a decoder. The language model 20 may be a different model depending on the type of time-series numerical data to be input. The language model 20 will be described in detail later with reference to FIG.

図２は、本実施形態に係るテキスト生成装置１０の物理構成を示す図である。テキスト生成装置１０は、ハードウェアプロセッサに相当するＣＰＵ（Central Processing Unit）１０ａと、メモリに相当するＲＡＭ（Random Access Memory）１０ｂと、メモリに相当するＲＯＭ（Read Only Memory）１０ｃと、通信部１０ｄと、入力部１０ｅと、表示部１０ｆとを有する。これら各構成は、バスを介して相互にデータ送受信可能に接続される。 FIG. 2 is a diagram showing a physical configuration of the text generator 10 according to the present embodiment. The text generator 10 includes a CPU (Central Processing Unit) 10a corresponding to a hardware processor, a RAM (Random Access Memory) 10b corresponding to a memory, a ROM (Read Only Memory) 10c corresponding to a memory, and a communication unit 10d. And an input unit 10e and a display unit 10f. Each of these configurations is connected to each other via a bus so that data can be transmitted and received.

ＣＰＵ１０ａは、ＲＡＭ１０ｂ又はＲＯＭ１０ｃに記憶されたプログラムの実行に関する制御やデータの演算、加工を行う制御部である。ＣＰＵ１０ａは、言語モデル２０を用いてテキストデータを生成するプログラム（テキスト生成プログラム）を実行する演算装置である。ＣＰＵ１０ａは、入力部１０ｅや通信部１０ｄから種々の入力データを受け取り、入力データの演算結果を表示部１０ｆに表示したり、ＲＡＭ１０ｂやＲＯＭ１０ｃに格納したりする。 The CPU 10a is a control unit that controls execution of a program stored in the RAM 10b or ROM 10c, calculates data, and processes data. The CPU 10a is an arithmetic unit that executes a program (text generation program) that generates text data using the language model 20. The CPU 10a receives various input data from the input unit 10e and the communication unit 10d, displays the calculation result of the input data on the display unit 10f, and stores it in the RAM 10b or ROM 10c.

ＲＡＭ１０ｂは、データの書き換えが可能な記憶部であり、例えば半導体記憶素子で構成される。ＲＡＭ１０ｂは、ＣＰＵ１０ａが実行するアプリケーション等のプログラムやデータを記憶する。 The RAM 10b is a storage unit capable of rewriting data, and is composed of, for example, a semiconductor storage element. The RAM 10b stores programs and data such as applications executed by the CPU 10a.

ＲＯＭ１０ｃは、データの読み出しのみが可能な記憶部であり、例えば半導体記憶素子で構成される。ＲＯＭ１０ｃは、例えばファームウェア等のプログラムやデータを記憶する。 The ROM 10c is a storage unit capable of only reading data, and is composed of, for example, a semiconductor storage element. The ROM 10c stores programs and data such as firmware.

通信部１０ｄは、テキスト生成装置１０を通信ネットワークＮに接続する通信インタフェースである。 The communication unit 10d is a communication interface that connects the text generator 10 to the communication network N.

入力部１０ｅは、ユーザからデータの入力を受け付けるものであり、例えば、キーボードやマウス、タッチパネルで構成される。 The input unit 10e receives data input from the user, and is composed of, for example, a keyboard, a mouse, and a touch panel.

表示部１０ｆは、ＣＰＵ１０ａによる演算結果を視覚的に表示するものであり、例えばＬＣＤ（Liquid Crystal Display）により構成される。 The display unit 10f visually displays the calculation result by the CPU 10a, and is configured by, for example, an LCD (Liquid Crystal Display).

テキスト生成プログラムは、ＲＡＭ１０ｂやＲＯＭ１０ｃ等のコンピュータによって読み取り可能な記憶媒体に記憶されて提供されてもよいし、通信部１０ｄにより接続される通信ネットワークＮを介して提供されてもよい。テキスト生成装置１０では、ＣＰＵ１０ａがテキスト生成プログラムを実行することにより、次図を用いて説明する様々な機能が実現される。なお、これらの物理的な構成は例示であって、必ずしも独立した構成でなくてもよい。例えば、テキスト生成装置１０は、ＣＰＵ１０ａとＲＡＭ１０ｂやＲＯＭ１０ｃが一体化したＬＳＩ（Large-Scale Integration）を備えていてもよい。また、テキスト生成装置１０は、ＧＰＵ（Graphics Processing Unit）やＦＰＧＡ（Field-Programmable Gate Array）、ＡＳＩＣ（Application Specific Integrated Circuit）等の演算回路を備えてもよい。 The text generation program may be stored in a storage medium readable by a computer such as a RAM 10b or a ROM 10c and provided, or may be provided via a communication network N connected by a communication unit 10d. In the text generation device 10, the CPU 10a executes the text generation program to realize various functions described with reference to the following figures. It should be noted that these physical configurations are examples and do not necessarily have to be independent configurations. For example, the text generator 10 may include an LSI (Large-Scale Integration) in which the CPU 10a and the RAM 10b or ROM 10c are integrated. Further, the text generator 10 may include arithmetic circuits such as a GPU (Graphics Processing Unit), an FPGA (Field-Programmable Gate Array), and an ASIC (Application Specific Integrated Circuit).

図３は、本実施形態に係るテキスト生成装置１０の機能ブロックを示す図である。テキスト生成装置１０は、学習部１１、取得部１２、生成部１３、置換部１４及び規則記憶部１５を備える。 FIG. 3 is a diagram showing a functional block of the text generator 10 according to the present embodiment. The text generation device 10 includes a learning unit 11, an acquisition unit 12, a generation unit 13, a replacement unit 14, and a regular storage unit 15.

学習部１１は、時系列数値データの変動を説明するテキストデータのうち時系列数値データに関係する数値を所定の規則で所定の文字列に置き換えた置換テキストデータＤ１及び時系列数値データＤ２を学習用データとして、時系列数値データが入力された場合に、置換テキストデータを出力するように言語モデル２０を学習させる。学習部１１によって言語モデル２０の学習に用いられる置換テキストデータＤ１及び時系列数値データＤ２は、データベースＤＢに初期データセットとして記憶されているものであってよい。 The learning unit 11 learns the replacement text data D1 and the time-series numerical data D2 in which the numerical values related to the time-series numerical data are replaced with predetermined character strings by a predetermined rule among the text data explaining the fluctuation of the time-series numerical data. When time-series numerical data is input as the data for use, the language model 20 is trained so as to output the replacement text data. The replacement text data D1 and the time-series numerical data D2 used for learning the language model 20 by the learning unit 11 may be stored in the database DB as initial data sets.

ここで、時系列数値データに関係する数値は、時系列数値データに含まれる数列のうち所定の時点に対応付けられた数値と、時系列数値データに含まれる数列のうち異なる時点に対応付けられた数値の差と、時系列数値データに含まれる数列のうち所定の時点に対応付けられた数値を所定の桁で切り捨てた数値と、時系列数値データに含まれる数列のうち異なる時点に対応付けられた数値の差を所定の桁で切り捨てた数値と、時系列数値データに含まれる数列のうち所定の時点に対応付けられた数値を所定の桁で切り上げた数値と、時系列数値データに含まれる数列のうち異なる時点に対応付けられた数値の差を所定の桁で切り上げた数値と、のうち少なくともいずれかを含んでよい。また、数値の所定の桁での切り捨てや切り上げは、１０の位、１００の位、１０００の位及び１００００の位等、任意の位について行ってよい。また、所定の規則は、時系列数値データに関係する数値の種類と所定の文字列とを対応付ける規則であってよい。ここで、所定の文字列は、通常のテキストデータと区別可能な文字列であれば任意のものであってよく、例えば＜ｐｒｉｃｅ１＞や＜ｐｒｉｃｅ２＞等の所定の記号（本例の場合「＜」と「＞」）で先頭と末尾が示された文字列であってよい。 Here, the numerical values related to the time-series numerical data are associated with the numerical values associated with a predetermined time point among the numerical values included in the time-series numerical data and the different time points among the numerical values included in the time-series numerical data. Correspondence between the difference between the numerical values and the numerical value associated with the specified time point in the number sequence included in the time-series numerical data, rounded down to a predetermined digit, and the different time point in the number sequence included in the time-series numerical data. Included in the time-series numerical data: the numerical value obtained by rounding down the difference between the given numerical values by a predetermined digit, the numerical value in which the numerical value associated with a predetermined time point in the numerical sequence included in the time-series numerical data is rounded up by a predetermined digit, and the numerical value. It may include at least one of a numerical value obtained by rounding up the difference between the numerical values associated with different time points in the numerical sequence to be obtained by a predetermined digit. Further, the numerical value may be rounded down or rounded up to a predetermined digit at any place such as the tens place, the 100s place, the 1000s place, and the 10000s place. Further, the predetermined rule may be a rule for associating a predetermined character string with a type of numerical value related to time-series numerical data. Here, the predetermined character string may be any character string as long as it is a character string that can be distinguished from ordinary text data, and for example, a predetermined symbol such as <price1> or <price2> (in this example, "<" "And"> ") may be a character string whose beginning and end are indicated.

また、時系列数値データは、第１間隔で時間の経過と対応付けられた数列を含む第１時系列数値データと、第１間隔より長い第２間隔で時間の経過と対応付けられた数列を含む第２時系列数値データとを含んでよい。本実施形態の場合、第１時系列数値データＸ_shortは、１営業日の寄り付きから大引けまでに５分間隔で取得された株価に関係する時系列数値データであり、第２時系列数値データＸ_longは、７営業日について営業日間隔で取得された株価の終値に関する時系列数値データである。すなわち、第１時系列数値データＸ_shortは、５分間隔で時間の経過と対応付けられた数列を含む時系列数値データであり、第２時系列数値データＸ_longは、１営業日間隔で時間の経過と対応付けられた数列を含む時系列数値データである。日本の東京証券取引所の場合、１営業日における売買立会い時間は５時間（３００分）であり、５分間隔で取得された第１時系列数値データは、６２個のデータを含む。これをＸ_short,i（ｉ＝１〜６２）と表す。また、第２時系列数値データＸ_longは、７個のデータを含み、これをＸ_long,j（ｊ＝１〜７）と表す。 Further, the time-series numerical data includes a first time-series numerical data including a sequence of numbers associated with the passage of time in the first interval and a sequence of numbers associated with the passage of time in a second interval longer than the first interval. It may include the second time series numerical data including. In the case of the present embodiment, the first time-series numerical data X _short is time-series numerical data related to the stock price acquired at intervals of 5 minutes from the approach to the close of one business day, and the second time-series numerical data X. _long is time-series numerical data relating to the closing price of the stock price acquired at business day intervals for 7 business days. That is, the first time-series numerical data X _short is time-series numerical data including a number sequence associated with the passage of time at 5-minute intervals, and the second time-series numerical data X _long is time at 1 business day intervals. It is time series numerical data including a number sequence associated with the progress of. In the case of the Tokyo Stock Exchange in Japan, the trading witness time in one business day is 5 hours (300 minutes), and the first time-series numerical data acquired at 5-minute intervals includes 62 pieces of data. This is expressed as X _{short, i} (i = 1 to 62). Further, the second time series numerical data X _long includes seven data, which are _{represented as X long, j} (j = 1 to 7).

取得部１２は、株価配信サーバ４０から、新たな時系列数値データを取得する。取得部１２は、例えば５分間隔で、株価配信サーバ４０から株価に関する新たな時系列数値データを取得してよい。 The acquisition unit 12 acquires new time-series numerical data from the stock price distribution server 40. The acquisition unit 12 may acquire new time-series numerical data regarding the stock price from the stock price distribution server 40, for example, at intervals of 5 minutes.

生成部１３は、学習部１１により学習された言語モデル２０に新たな時系列数値データを入力し、言語モデル２０の出力によって新たな時系列数値データを説明する新たな置換テキストデータを生成する。生成部１３は、時系列数値データを１又は複数の方法で変換して得られる、１又は複数の方法に一対一に対応する１又は複数の数値データを、言語モデル２０に入力してもよい。ここで、１又は複数の数値データは、時系列数値データに含まれる数列を所定の数値範囲に正規化した数値データと、時系列数値データに含まれる数列の平均値及び標準偏差を用いて時系列数値データを標準化した数値データと、時系列数値データに含まれる数列のうち所定の時点に対応付けられた数値を基準値として、時系列数値データに含まれる数列を基準値に関して相対化した数値データと、のうち少なくともいずれかを含んでよい。 The generation unit 13 inputs new time-series numerical data into the language model 20 learned by the learning unit 11, and generates new replacement text data for explaining the new time-series numerical data by the output of the language model 20. The generation unit 13 may input one or a plurality of numerical data having a one-to-one correspondence to the one or a plurality of methods obtained by converting the time series numerical data by one or a plurality of methods into the language model 20. .. Here, for one or more numerical data, the time is obtained by using the numerical data obtained by normalizing the numerical strings included in the time-series numerical data to a predetermined numerical range and the average value and standard deviation of the numerical strings included in the time-series numerical data. A numerical value obtained by standardizing the series numerical data and a numerical value associated with a predetermined time point among the numerical values included in the time-series numerical data as a reference value, and a numerical value obtained by relativizing the numerical string included in the time-series numerical data with respect to the reference value. It may include at least one of the data.

より具体的には、時系列数値データに含まれる数列Ｘ_i（ｉ＝１〜Ｎ）を所定の数値範囲に正規化した数値データＸ_norm,iは、Ｘ_norm,i＝（２Ｘ_i−（Ｘ_max＋Ｘ_min））／（Ｘ_max−Ｘ_min）によって定義される数値データであってよい。ここで、Ｘ_max＝ｍａｘ_i（Ｘ_i）、Ｘ_min＝ｍｉｎ_i（Ｘ_i）である。この場合、正規化した数値データＸ_norm,iは、−１から１の数値範囲に正規化された数値データとなる。 _{More specifically, the numerical data X norm, i obtained} _{by normalizing the sequence X i} (i = 1 to N) included in the time-series numerical data to a predetermined numerical range is X _{norm, i} = (2X _i − (2X i − (. It may be numerical data defined by X _max + X _min )) / (X _max −X _min). Here, X _max = max _i (X _i ) and X _min = min _i (X _i ). In this case, the normalized numerical data X _{norm, i} is the numerical data normalized in the numerical range of -1 to 1.

また、時系列数値データに含まれる数列Ｘ_i（ｉ＝１〜Ｎ）を標準化した数値データＸ_std,iは、Ｘ_std,i＝（Ｘ_i−μ）／σによって定義される数値データであってよい。ここで、μ＝Ｅ［Ｘ_i］、σ＝（ｖａｒ［Ｘ_i］）^1/2である。 _{Further, the numerical data X std, i obtained} _{by standardizing the sequence X i} (i = 1 to N) included in the time-series numerical data is the numerical data defined by X _{std, i} = (X _{i −μ) / σ.} It may be there. Here, μ = E [X _i ] and σ = (var [X _i ]) ^1/2 .

また、時系列数値データに含まれる数列Ｘ_i（ｉ＝１〜Ｎ）を基準値ｒ_iに関して相対化した数値データＸ_move,iは、Ｘ_move,i＝Ｘ_i−ｒ_iによって定義される数値データであってよい。時系列数値データが株価である場合、基準値ｒ_iは、前日の終値であってよい。すなわち、５分間隔で時間の経過と対応付けられた数列を含む第１時系列数値データＸ_shortについては、前日の終値をｒとするとき、全てのｉに対して（Ｘ_short,i−ｒ）によって相対化した数値データを算出してよい。また。営業日間隔で時間の経過と対応付けられた数列を含む第２時系列数値データＸ_longについては、（Ｘ_long,j−Ｘ_long,j-1）によって相対化した数値データを算出してよい。 Moreover, when the sequence X _i included in the numerical data at (i = 1 to N) numerically and relative with respect to the reference value r _i of the data X _{move, i} is defined by _{_{X move, i = X i -r}} i It may be numerical data. When the time-series numerical data is a stock price, the reference value r _i may be the closing price of the previous day. _{That is, for the first time-series numerical data X short} including a sequence of numbers associated with the passage of time at 5-minute intervals, when the closing price of the previous day is r, for all i (X _{short, i} −r). ) May calculate the relativized numerical data. also. _{For the second time-series numerical data X long} including a sequence of numbers associated with the passage of time at business day intervals, numerical data relativized by (X _{long, j} −X _{long, j-1} ) may be calculated. ..

正規化した数値データ又は標準化した数値データを用いることで、生成されるテキストが時系列数値データの絶対値に依存してぶれることが防止され、相対化した数値データを用いることで、時系列数値データの履歴に依存する単語を正しく含むように、時系列数値データの変動を説明するテキストを生成することができる。 By using normalized numerical data or standardized numerical data, it is possible to prevent the generated text from blurring depending on the absolute value of the time-series numerical data, and by using the relativized numerical data, the time-series numerical value is used. Text can be generated to explain the variation in time-series numerical data so that it correctly contains words that depend on the history of the data.

置換部１４は、新たな置換テキストデータに含まれる所定の文字列を、所定の規則で新たな時系列数値データに関係する数値に置き換える。置換部１４は、規則記憶部１５に記憶された所定の規則を参照して、新たな置換テキストデータに含まれる所定の文字列を、新たな時系列数値データに関係する数値に置き換える。置換部１４は、例えば、＜ｐｒｉｃｅ１＞という所定の文字列を、Ｘ_longの最後の値（Ｘ_long,7、前日の終値）とＸ_shortの最後の値（Ｘ_short,62、当日の終値）の差に置き換えたり、＜ｐｒｉｃｅ２＞という所定の文字列を、Ｘ_longの最後の値（Ｘ_long,7、前日の終値）とＸ_shortの最後の値（Ｘ_short,62、当日の終値）の差を１０の位で切り捨てた値に置き換えたりする。 The replacement unit 14 replaces a predetermined character string included in the new replacement text data with a numerical value related to the new time-series numerical data according to a predetermined rule. The replacement unit 14 refers to a predetermined rule stored in the rule storage unit 15 and replaces a predetermined character string included in the new replacement text data with a numerical value related to the new time-series numerical data. Replacement unit 14, for example, a predetermined string <price1>, the last value of X _{_long} (X _{_long, 7,} the day before the closing price) last value of the X _{_short} (X _{_short, 62,} the day closing price) or replaced with a difference, a predetermined character string of <price2>, the last value of X _{_long} (X _{_long, 7,} the day before the closing price) last value of the X _short of (X _{short, 62,} of the day closing price) Replace the difference with a value truncated to the tens digit.

図４は、言語モデル２０の構成を示す図である。言語モデル２０は、第１間隔で時間の経過と対応付けられた数列を含む第１時系列数値データＸ_shortを第１前処理部２１ａによって１又は複数の方法で変換して得られる、１又は複数の方法に一対一に対応する１又は複数の第１数値データｌ_sが入力される第１エンコーダ２２ａと、第１間隔より長い第２間隔で時間の経過と対応付けられた数列を含む第２時系列数値データＸ_longを第２前処理部２１ｂによって１又は複数の方法で変換して得られる、１又は複数の方法に一対一に対応する１又は複数の第２数値データｌ_lが入力される第２エンコーダ２２ｂと、第１エンコーダ２２ａの出力ｈ_s及び第２エンコーダ２２ｂの出力ｈ_lを合成する合成部２３と、合成部２３により合成されたデータｍが入力され、置換テキストデータを出力するデコーダ２４と、を含む。 FIG. 4 is a diagram showing the configuration of the language model 20. _{The language model 20 is obtained by converting the first time-series numerical data X short} including the number sequence associated with the passage of time at the first interval by the first preprocessing unit 21a by one or a plurality of methods. _{A first encoder 22a in which one or a plurality of first numerical data l s} corresponding to a plurality of methods one-to-one is input, and a number sequence associated with the passage of time in a second interval longer than the first interval. Two time-series numerical data X _long is converted by the second preprocessing unit 21b by one or more methods, and one or more second numerical data l _l corresponding to one or more methods is input. The second encoder 22b, the _{compositing unit 23 that synthesizes the output h s} _{of the first encoder 22a and the output h l} of the second encoder 22b, and the data m synthesized by the compositing unit 23 are input, and the replacement text data is input. Includes a decoder 24 for output.

本例では、第１時系列数値データＸ_shortは、「１２１６７．２９」や「１２２７８．８３」等の数値を含む６２次元のベクトルとして与えられる。また、第２時系列数値データＸ_longは、「１２１１６．５７」や「１２１２０．９４」等の数値を含む７次元のベクトルとして与えられる。第１前処理部２１ａは、入力された第１時系列数値データＸ_shortを３種類の方法で変換して、変換して得られた３種類のベクトルの直和によって第１数値データｌ_sを出力する。ここで、３種類の方法は、入力された第１時系列数値データＸ_shortを所定の数値範囲に正規化した数値データを算出することと、標準化した数値データを算出することと、基準値に関して相対化した数値データを算出することである。本例の場合、第１前処理部２１ａから出力される第１数値データｌ_sは、１８６次元のベクトルとなる。 In this example, the first time series numerical data X _short is given as a 62-dimensional vector including numerical values such as "12167.29" and "12278.83". Further, the second time series numerical data X _long is given as a 7-dimensional vector including numerical values such as "1211.6.57" and "12120.94". The first preprocessing unit 21a converts the input first time series numerical data X _short _{by three kinds of methods, and converts the first numerical data l s} by the direct sum of the three kinds of vectors obtained by the conversion. Output. Here, the three types of methods are to calculate the numerical data obtained by normalizing the input first time-series numerical data X _short into a predetermined numerical range, to calculate the standardized numerical data, and to obtain a reference value. It is to calculate relativized numerical data. In the case of this example, the first numerical data l _s output from the first preprocessing unit 21a is a 186-dimensional vector.

このように、異なる時間間隔で時間の経過と対応付けられた数列を含む時系列数値データを用いることで、時系列数値データの履歴に依存する単語を正しく含むように、時系列数値データの変動を説明するテキストを生成することができる。また、時系列数値データを１又は複数の方法で変換して得られる１又は複数の数値データを言語モデルに入力することで、生成されるテキストが時系列数値データの絶対値に依存してぶれることが防止される。 In this way, by using time-series numerical data that includes a number sequence associated with the passage of time at different time intervals, fluctuations in the time-series numerical data so as to correctly include words that depend on the history of the time-series numerical data. Can generate text that describes. In addition, by inputting one or more numerical data obtained by converting time series numerical data by one or more methods into the language model, the generated text is blurred depending on the absolute value of the time series numerical data. Is prevented.

同様に、第２前処理部２１ｂは、入力された第２時系列数値データＸ_longを３種類の方法で変換して、変換して得られた３種類のベクトルの直和によって第２数値データｌ_lを出力する。ここで、３種類の方法は、入力された第２時系列数値データＸ_longを所定の数値範囲に正規化した数値データを算出することと、標準化した数値データを算出することと、基準値に関して相対化した数値データを算出することである。本例の場合、第２前処理部２１ｂから出力される第２数値データｌ_lは、２１次元のベクトルとなる。 Similarly, the second preprocessing unit 21b converts the input second time series numerical data X _long by three kinds of methods, and the second numerical data by the direct sum of the three kinds of vectors obtained by the conversion. Output l _l. Here, the three types of methods are to calculate the numerical data obtained by normalizing the input second time-series numerical data X _long into a predetermined numerical range, to calculate the standardized numerical data, and to obtain a reference value. It is to calculate relativized numerical data. In the case of this example, the second numerical data l _l output from the second preprocessing unit 21b is a 21-dimensional vector.

第１エンコーダ２２ａには、第１前処理部２１ａから出力される第１数値データｌ_sが入力され、ベクトルｈ_sを出力する。ここで、ベクトルｈ_sの次元は、第１エンコーダ２２ａの出力層に含まれる出力ノードの数となる。同様に、第２エンコーダ２２ｂには、第２前処理部２１ｂから出力される第２数値データｌ_lが入力され、ベクトルｈ_lを出力する。ここで、ベクトルｈ_lの次元は、第２エンコーダ２２ｂの出力層に含まれる出力ノードの数となる。第１エンコーダ２２ａ及び第２エンコーダ２２ｂは、ＭＬＰ、ＣＮＮ及びＲＮＮのうちいずれかであってよく、その他のモデルであってもよい。 _{The first numerical data l s} output from the first preprocessing unit 21a is input to the first encoder 22a, and the vector h _s is output. Here, the dimension of the vector h _s is the number of output nodes included in the output layer of the first encoder 22a. Similarly, the second numerical data l _l output from the second preprocessing unit 21b is input to the second encoder 22b, and the vector h _l is output. Here, the dimension of the vector h _l is the number of output nodes included in the output layer of the second encoder 22b. The first encoder 22a and the second encoder 22b may be any of MLP, CNN and RNN, and may be other models.

合成部２３は、第１エンコーダ２２ａの出力ｈ_s、第２エンコーダ２２ｂの出力ｈ_l、第１前処理部２１ａから出力される第１数値データｌ_s及び第２前処理部２１ｂから出力される第２数値データｌ_lの直和によってこれらのデータを合成する。 The compositing unit 23 is output from the output h _s _{of the first encoder 22a, the output h l} of the second encoder 22b, _{the first numerical data l s} output from the first preprocessing unit 21a, and the second preprocessing unit 21b. These data are combined by the direct sum of the second numerical data l _l.

デコーダ２４には、合成部２３により合成されたデータｍ及び時系列数値データの時系列に関するデータＴが入力される。時系列に関するデータＴは、第１時系列数値データＸ_shortに含まれる数列が対応付けられた時刻のうち最新の時刻に関するデータであったり、第２時系列数値データＸ_longに含まれる数列が対応付けられた営業日の範囲に関するデータであったりしてよい。 The data m synthesized by the synthesizing unit 23 and the time-series data T of the time-series numerical data are input to the decoder 24. The data T related to the time series corresponds to the data related to the latest time among the times associated with the _{number strings} included in the first time series numerical data X short, or the numerical strings included _{in the second time series numerical data X long.} It may be data about the range of business days attached.

本例では、デコーダ２４は、「日経」、「平均」、「、」、「上げ幅」、「＜ｐｒｉｃｅ１＞」、「円」、「超える」、「＜／ｓ＞」という置換テキストデータを出力している。ここで、５番目に出力された文字列「＜ｐｒｉｃｅ１＞」は、テキスト生成装置１０の置換部１４によって時系列数値データに関係する数値に置き換えられる所定の文字列である。また、最後に出力された文字列「＜／ｓ＞」は、テキストデータの終わりを示す所定の文字列である。テキスト生成装置１０は、デコーダ２４から出力されたこれらの文字列によって新たな置換テキストデータを「日経平均、上げ幅＜ｐｒｉｃｅ１＞円超える」と生成する。そして、置換部１４によって、「＜ｐｒｉｃｅ１＞」という所定の文字列を、Ｘ_longの最後の値（Ｘ_long,7、前日の終値）とＸ_shortの最後の値（Ｘ_short,62、当日の終値）の差に置き換えて、株価の変動を説明するテキストデータを生成する。 In this example, the decoder 24 outputs replacement text data such as "Nikkei", "average", ",", "raise width", "<price1>", "circle", "exceeds", and "</ s>". is doing. Here, the fifth output character string "<price1>" is a predetermined character string to be replaced with a numerical value related to the time-series numerical data by the replacement unit 14 of the text generation device 10. The last output character string "</ s>" is a predetermined character string indicating the end of the text data. The text generator 10 generates new replacement text data by these character strings output from the decoder 24 as "Nikkei average, increase width <price1> circle exceeds". Then, the replacing unit 14, the predetermined character string "<price1>", the last value of X _{_long} (X _{_long, 7,} the day before the closing price) last value of the X _{_short} (X _{_short, 62,} the day Replace with the difference of the closing price) to generate text data explaining the fluctuation of the stock price.

本例の言語モデル２０のように、異なる時間間隔で時間の経過と対応付けられた数列を含む時系列数値データをそれぞれ異なるエンコーダに入力し、出力を合成してデコーダに入力することで、時系列数値データの履歴に依存する単語を正しく含むように、時系列数値データの変動を説明するテキストを生成することができる。また、デコーダに対して、エンコーダの出力のみならず、複数の数値データを入力することで、時系列数値データの履歴に依存する単語を正しく含むように、時系列数値データの変動を説明するテキストを生成することができる。さらに、デコーダに対して、合成部により合成されたデータのみならず、時系列数値データの時系列に関するデータを入力することで、時系列数値データの履歴に依存する単語を正しく含むように、時系列数値データの変動を説明するテキストを生成することができる。 As in the language model 20 of this example, time series numerical data including a number sequence associated with the passage of time at different time intervals is input to different encoders, and the outputs are combined and input to the decoder. Text can be generated to explain the variation of the time series numerical data so that it correctly contains the words that depend on the history of the series numerical data. In addition, by inputting not only the output of the encoder but also multiple numerical data to the decoder, a text explaining the fluctuation of the time series numerical data so as to correctly include words that depend on the history of the time series numerical data. Can be generated. Furthermore, by inputting not only the data synthesized by the synthesizer but also the time-series data of the time-series numerical data to the decoder, the time so as to correctly include the words that depend on the history of the time-series numerical data. It is possible to generate text explaining the fluctuation of series numerical data.

本実施形態に係るテキスト生成装置１０によれば、時系列数値データの履歴に依存する単語を正しく含むように、時系列数値データの変動を説明するテキストを生成することで、例えば、「上げ幅」、「続落」、「反発」といった単語のように過去の株価の履歴を参照する表現を正しく生成したり、「始まる」、「寄り付き」、「前引け」、「午後」、「大引け」といった単語のように、時間帯に依存する表現を正しく生成したりすることができる。 According to the text generation device 10 according to the present embodiment, for example, by generating a text explaining the fluctuation of the time-series numerical data so as to correctly include words that depend on the history of the time-series numerical data, for example, "increase". Properly generate expressions that refer to past stock price history, such as words such as "continuation" and "repulsion", and words such as "start", "close", "close ahead", "afternoon", and "close close". It is possible to correctly generate expressions that depend on the time zone, such as.

図５は、時系列数値データに関係する数値の種類と所定の文字列とを対応付ける規則Ｄ３を示す図である。規則Ｄ３は、規則記憶部１５に記憶され、置換部１４によって参照される所定の規則の一例である。 FIG. 5 is a diagram showing rule D3 for associating a predetermined character string with a type of numerical value related to time-series numerical data. Rule D3 is an example of a predetermined rule stored in the rule storage unit 15 and referenced by the replacement unit 14.

規則Ｄ３は、１２種類の文字列について、時系列数値データに関係する１２種類の数値を対応付ける規則である。各文字列は、時系列数値データに関係する数値に１対１に対応する。本例では、＜ｐｒｉｃｅ１＞という文字列は、Ｘ_longの最後の値（Ｘ_long,7）とＸ_shortの最後の値（Ｘ_short,62）の差と対応付けられる。また、＜ｐｒｉｃｅ２＞という文字列は、Ｘ_longの最後の値とＸ_shortの最後の値の差を１０の位で切り捨てた値と対応付けられる。 Rule D3 is a rule for associating 12 types of numerical values related to time-series numerical data with respect to 12 types of character strings. Each character string has a one-to-one correspondence with a numerical value related to time-series numerical data. In this example, the string <price1> is associated with the difference between the last value (X _{long, 7)} and the last value of X _short of _{_{X long (X short, 62)}} . Further, the character string <price2> is associated with a value obtained by rounding down the difference between the last _{value of X long and} the last value of X _{short at the tens digit.}

また、＜ｐｒｉｃｅ３＞という文字列は、Ｘ_longの最後の値とＸ_shortの最後の値の差を１００の位で切り捨てた値と対応付けられ、＜ｐｒｉｃｅ４＞という文字列は、Ｘ_longの最後の値とＸ_shortの最後の値の差を１０の位で切り上げた値と対応付けられ、＜ｐｒｉｃｅ５＞という文字列は、Ｘ_longの最後の値とＸ_shortの最後の値の差を１００の位で切り上げた値と対応付けられる。 Further, the character string <price3> is associated with the value obtained by rounding down the difference between the last _{value of X long and} the last value of X _{short by} the 100s, and the character string <price4> is _{the last value of X long} . The difference between the last value of X _{long and} the last value of X short is associated with the value rounded up to the tens place, and the character string <price5> is the difference between the last _{value of X long and} the last value of X _{short is 100.} It is associated with the value rounded up to the nearest whole number.

さらに、＜ｐｒｉｃｅ６＞という文字列は、Ｘ_shortの最後の値に対応付けられ、＜ｐｒｉｃｅ７＞という文字列は、Ｘ_shortの最後の値を１００の位で切り捨てた値に対応付けられ、＜ｐｒｉｃｅ８＞という文字列は、Ｘ_shortの最後の値を１０００の位で切り捨てた値に対応付けられ、＜ｐｒｉｃｅ９＞という文字列は、Ｘ_shortの最後の値を１００００の位で切り捨てた値に対応付けられる。同様に、＜ｐｒｉｃｅ１０＞という文字列は、Ｘ_shortの最後の値を１００の位で切り上げた値に対応付けられ、＜ｐｒｉｃｅ１１＞という文字列は、Ｘ_shortの最後の値を１０００の位で切り上げた値に対応付けられ、＜ｐｒｉｃｅ１２＞という文字列は、Ｘ_shortの最後の値を１００００の位で切り上げた値に対応付けられる。 Further, the character string <price6> is _{associated with the last value of X short} , and the character string <price7> is associated with the value obtained by truncating the last value of _{X short} by the 100s digit, and <price8>. The character string> is _{associated with the value obtained by truncating the last value of X short} at the 1000s place, and the character string <price9> is associated with the value obtained by truncating the last value of _{X short at the 10000s place.} Be done. Similarly, the string <price10> is _{associated with the last value of X short} rounded up to the 100s, and the string <price11> rounds up the last value of _{X short to the 1000s.} The character string <price12> is associated with _{the last value of X short} rounded up to the nearest 10,000.

このように、時系列数値データに関係する数値の種類と所定の文字列とを対応付けることで、時系列数値データの引用や時系列数値データを演算した結果得られる数値を含むように、時系列数値データの変動を説明するテキストを生成することができる。 In this way, by associating the types of numerical values related to the time-series numerical data with a predetermined character string, the time-series includes the numerical values obtained as a result of quoting the time-series numerical data and calculating the time-series numerical data. It is possible to generate text that explains the fluctuations in numerical data.

図６は、本実施形態に係るテキスト生成装置１０で実行される処理のフローチャートである。はじめに、取得部１２によって、５分間隔で記録された株価を第１時系列数値データとして取得し（Ｓ１０）、１営業日間隔で記録された株価を第２時系列数値データとして取得する（Ｓ１１）。 FIG. 6 is a flowchart of processing executed by the text generator 10 according to the present embodiment. First, the acquisition unit 12 acquires the stock price recorded at 5-minute intervals as the first time-series numerical data (S10), and acquires the stock price recorded at one business day interval as the second time-series numerical data (S11). ).

その後、生成部１３によって、第１時系列数値データ及び第２時系列数値データを言語モデル２０に入力する。言語モデル２０は、第１前処理部２１ａによって、第１時系列数値データを、正規化した数値データ、標準化した数値データ及び相対化した数値データに変換し（Ｓ１２）、第２前処理部２１ｂによって、第２時系列数値データを、正規化した数値データ、標準化した数値データ及び相対化した数値データに変換する（Ｓ１３）。そして、第１時系列数値データを変換して得られた複数の第１数値データを第１エンコーダ２２ａに入力し（Ｓ１４）、第２時系列数値データを変換して得られた複数の第２数値データを第２エンコーダ２２ｂに入力する（Ｓ１５）。 After that, the generation unit 13 inputs the first time-series numerical data and the second time-series numerical data into the language model 20. The language model 20 uses the first preprocessing unit 21a to convert the first time-series numerical data into normalized numerical data, standardized numerical data, and relativized numerical data (S12), and the second preprocessing unit 21b. Converts the second time-series numerical data into normalized numerical data, standardized numerical data, and relativized numerical data (S13). Then, a plurality of first numerical data obtained by converting the first time-series numerical data are input to the first encoder 22a (S14), and a plurality of second numerical data obtained by converting the second time-series numerical data are converted. Numerical data is input to the second encoder 22b (S15).

さらに、合成部２３によって、複数の第１数値データ、複数の第２数値データ、第１エンコーダ２２ａの出力及び第２エンコーダ２２ｂの出力を合成する（Ｓ１６）。その後、合成されたデータ及び時系列に関するデータをデコーダ２４に入力する（Ｓ１７）。 Further, the compositing unit 23 synthesizes a plurality of first numerical data, a plurality of second numerical data, the output of the first encoder 22a, and the output of the second encoder 22b (S16). After that, the synthesized data and the data related to the time series are input to the decoder 24 (S17).

置換部１４は、デコーダ２４から出力される置換テキストデータのうち、所定の文字列を所定の規則で数値に置き換え（Ｓ１８）、時系列数値データの変動を説明するテキストデータを生成する。以上により、処理が終了する。 Of the replacement text data output from the decoder 24, the replacement unit 14 replaces a predetermined character string with a numerical value according to a predetermined rule (S18), and generates text data explaining the variation of the time-series numerical data. This completes the process.

本実施形態に係るテキスト生成装置１０によれば、言語モデルの出力によって新たな時系列数値データを説明する新たな置換テキストデータを生成し、新たな置換テキストデータに含まれる所定の文字列を、所定の規則で新たな時系列数値データに関係する数値に置き換えることで、時系列数値データに関係する数値を言語モデルによって直接出力する必要が無くなり、数値が様々に変化する場合であってもその数値に関する記載を正しく含むように、時系列数値データの変動を説明するテキストを生成することができる。 According to the text generation device 10 according to the present embodiment, new replacement text data for explaining the new time-series numerical data is generated by the output of the language model, and a predetermined character string included in the new replacement text data is generated. By replacing the numerical values related to the new time-series numerical data with the numerical values related to the new time-series numerical data according to a predetermined rule, it is not necessary to directly output the numerical values related to the time-series numerical data by the language model, and even if the numerical values change variously, the numerical values can be changed. Text can be generated to explain the fluctuations in the time series numerical data so that the description of the numerical values is correctly included.

図７は、本実施形態に係るテキスト生成装置１０で生成されるテキストを示す図である。同図では、言語モデルとして従来のモデルを用いた場合と、本実施形態に係る言語モデル２０又は本実施形態に係る言語モデル２０の一部を変更したモデルを用いた場合とについて、生成されるテキストをまとめた第１表Ｒ１を示している。本実施形態に係る言語モデル２０は、図４を用いて説明した言語モデル２０であって、第１エンコーダ２２ａ及び第２エンコーダ２２ｂをＭＬＰとしたモデルである。また、本実施形態に係る言語モデル２０の一部を変更したモデルの第１例は、図４を用いて説明した言語モデル２０のうち標準化したデータを用いないモデル、すなわち第１前処理部２１ａ及び第２前処理部２１ｂによって正規化したデータ及び相対化したデータの２種類を算出するモデルである。また、本実施形態に係る言語モデル２０の一部を変更したモデルの第２例は、図４を用いて説明した言語モデル２０のうち置換テキストデータを用いないモデル、すなわち言語モデルによって時系列数値データに関係する数値を直接生成するモデルである。また、本実施形態に係る言語モデル２０の一部を変更したモデルの第３例は、図４を用いて説明した言語モデル２０のうちデコーダ２４に時系列数値データの時系列に関するデータを入力しないモデルである。 FIG. 7 is a diagram showing text generated by the text generator 10 according to the present embodiment. In the figure, it is generated when a conventional model is used as a language model and when a language model 20 according to the present embodiment or a model in which a part of the language model 20 according to the present embodiment is modified is used. Table 1 R1 summarizing the text is shown. The language model 20 according to the present embodiment is the language model 20 described with reference to FIG. 4, and is a model in which the first encoder 22a and the second encoder 22b are MLPs. Further, the first example of the model in which a part of the language model 20 according to the present embodiment is modified is a model of the language model 20 described with reference to FIG. 4, which does not use standardized data, that is, the first preprocessing unit 21a. This is a model for calculating two types of data, normalized data and relativized data by the second preprocessing unit 21b. Further, the second example of the model in which a part of the language model 20 according to the present embodiment is modified is a model of the language model 20 described with reference to FIG. 4 that does not use replacement text data, that is, a time-series numerical value depending on the language model. It is a model that directly generates numerical values related to data. Further, in the third example of the model in which a part of the language model 20 according to the present embodiment is modified, the data related to the time series of the time series numerical data is not input to the decoder 24 of the language model 20 described with reference to FIG. It is a model.

同図では、第１表Ｒ１の他に、正確なテキストデータの例Ｅを示している。正確なテキストデータの例Ｅは、「日経平均大引け、続伸終値は３２円高の１６９０６円」である。 In the figure, in addition to Table 1 R1, an example E of accurate text data is shown. An example E of accurate text data is "Nikkei average close, continuous growth closing price is 16906 yen, which is 32 yen higher."

これに対して、言語モデルとして従来のモデルを用いた場合に生成されるテキストの例は、「日経平均、反落前引けは５７円安の２０６０６円」であり、テキストデータの配信時間帯を誤って「前引け」としている点、前日との株価の差を誤って「反落」と表現している点、前日終値との差を誤って「５７円安」としている点、現在の株価を誤って「２０６０６円」としている点で、正確性を欠いている。 On the other hand, an example of the text generated when the conventional model is used as the language model is "Nikkei average, the closing price before the fall is 20606 yen, which is 57 yen lower", and the delivery time zone of the text data is incorrect. The point that the stock price is "closed ahead", the difference between the stock price and the previous day is mistakenly expressed as "rebound", the difference from the closing price on the previous day is mistakenly set as "57 yen depreciation", and the current stock price is wrong. It lacks accuracy in that it is set at "20606 yen".

一方、第１表Ｒ１の上から２番目に記載された、本実施形態に係る言語モデル２０を用いた場合に生成されるテキストの例は、「日経平均、続伸大引けは３２円高の１６９０６円」であり、テキストデータの配信時間帯を「大引け」と正しく表現し、前日との株価の差を「続伸」と正しく表現し、前日終値との差を「３２円高」と正しく算出しており、現在の株価を「１６９０６円」と正しく引用しており、全ての表現が正確である。 On the other hand, an example of the text generated when the language model 20 according to the present embodiment, which is described second from the top of Table 1 R1, is "Nikkei average, continuous growth close is 16906 yen, which is 32 yen higher. The text data delivery time zone is correctly expressed as "closed", the difference in stock price from the previous day is correctly expressed as "continuation growth", and the difference from the previous day's closing price is correctly calculated as "32 yen higher". The current stock price is correctly quoted as "16906 yen", and all expressions are accurate.

第１表Ｒ１の上から３番目に記載された、本実施形態に係る言語モデル２０の一部を変更したモデルの第１例を用いた場合に生成されるテキストの例は、「日経平均、続伸大引けは３２円高の１６９０６円」であり、テキストデータの配信時間帯を「大引け」と正しく表現し、前日との株価の差を「続伸」と正しく表現し、前日終値との差を「３２円高」と正しく算出しており、現在の株価を「１６９０６円」と正しく引用しており、全ての表現が正確である。このことから、第１前処理部２１ａ及び第２前処理部２１ｂによって正規化したデータ及び相対化したデータの２種類を算出するモデルであっても、第１前処理部２１ａ及び第２前処理部２１ｂによって、標準化したデータ、正規化したデータ及び相対化したデータの３種類を算出するモデルと同等以上の精度で時系列数値データを説明するテキストデータを生成できることがわかる。 An example of the text generated when the first example of the model in which a part of the language model 20 according to the present embodiment is modified, which is described third from the top of Table 1 R1, is "Nikkei 225, The continuous growth close is 16906 yen, which is 32 yen higher, and the text data delivery time zone is correctly expressed as "close", the difference in stock price from the previous day is correctly expressed as "sequential growth", and the difference from the previous day's closing price is " It is calculated correctly as "32 yen appreciation", and the current stock price is correctly quoted as "16906 yen", and all expressions are accurate. From this, even in the model that calculates two types of data, the data normalized by the first preprocessing unit 21a and the second preprocessing unit 21b and the relativized data, the first preprocessing unit 21a and the second preprocessing unit 21a and the second preprocessing It can be seen that part 21b can generate text data that explains time-series numerical data with an accuracy equal to or higher than that of a model that calculates three types of standardized data, normalized data, and relativized data.

第１表Ｒ１の上から４番目に記載された、本実施形態に係る言語モデル２０の一部を変更したモデルの第２例を用いた場合に生成されるテキストの例は、「日経平均、続伸大引けは２８円高の＜ｕｎｋ＞円」であり、テキストデータの配信時間帯を「大引け」と正しく表現し、前日との株価の差を「続伸」と正しく表現し、前日終値との差を誤って「２８円高」と算出しており、現在の株価が引用できず「＜ｕｎｋ＞円」となっている。ここで、＜ｕｎｋ＞は、ｕｎｋｎｏｗｎを表す文字列であり、適当な単語が生成できなかったことを示す。このことから、言語モデルによって時系列数値データに関係する数値を直接生成するのでは、時系列数値データに関係する数値を正しく生成することが難しく、時系列数値データの演算を伴う表現のみならず、時系列数値データの引用を含めることも困難であることがわかる。 An example of the text generated when the second example of the model in which a part of the language model 20 according to the present embodiment is modified, which is described fourth from the top of Table 1 R1, is "Nikkei 225, The continuous growth close is the yen, which is 28 yen higher, and the text data delivery time zone is correctly expressed as "close", the difference in stock price from the previous day is correctly expressed as "sequential growth", and the difference from the previous day's closing price.を誤って「２８円高」と算出しており、現在の株価が引用できず「＜ｕｎｋ＞円」となっている。 Here, is a character string representing unknown, and indicates that an appropriate word could not be generated. For this reason, if the numerical values related to the time-series numerical data are directly generated by the language model, it is difficult to correctly generate the numerical values related to the time-series numerical data, and not only the expressions involving the calculation of the time-series numerical data but also the expressions. , It turns out that it is also difficult to include citations of time series numerical data.

第１表Ｒ１の上から５番目に記載された、本実施形態に係る言語モデル２０の一部を変更したモデルの第３例を用いた場合に生成されるテキストの例は、「日経平均、続伸前引けは３２円高の１６９０６円」であり、テキストデータの配信時間帯を誤って「前引け」と表現し、前日との株価の差を「続伸」と正しく表現し、前日終値との差を「３２円高」と正しく算出しており、現在の株価を「１６９０６円」と正しく引用している。このことから、デコーダ２４に時系列数値データの時系列に関するデータを入力しないモデルでは、時系列数値データが取得された時間帯について正しく言及することが難しいことがわかる。 An example of the text generated when the third example of the model in which a part of the language model 20 according to the present embodiment is modified, which is described in the fifth from the top of Table 1 R1, is "Nikkei 225, The closing price of the previous day is 16906 yen, which is 32 yen higher. The difference is correctly calculated as "32 yen higher", and the current stock price is correctly quoted as "16906 yen". From this, it can be seen that it is difficult to correctly refer to the time zone in which the time-series numerical data was acquired in the model in which the time-series-related data of the time-series numerical data is not input to the decoder 24.

図８は、本実施形態に係るテキスト生成装置１０で生成されるテキストと基準となるテキストとの近さを評価した指標値を示す図である。同図では、言語モデルとして従来のモデルを用いた場合と、本実施形態に係る言語モデル２０又は本実施形態に係る言語モデル２０の一部を変更したモデルを用いた場合とについて、生成されるテキストと基準となるテキストとの近さを評価した指標値をまとめた第２表Ｒ２を示している。なお、指標値は、ＢＬＥＵ（BiLingual Evaluation Understudy）と呼ばれる値であり、０から１までの値を取り、１に近いほど基準となるテキスト（正確なテキスト）に近いことを表す。この指標値は、テキストの評価を行うために用いられるものの一例であり、他の指標値を用いてテキストの評価を行うこともできる。 FIG. 8 is a diagram showing index values for evaluating the closeness between the text generated by the text generator 10 according to the present embodiment and the reference text. In the figure, it is generated when a conventional model is used as a language model and when a language model 20 according to the present embodiment or a model in which a part of the language model 20 according to the present embodiment is modified is used. Table 2 R2 summarizes the index values for evaluating the closeness between the text and the reference text. The index value is a value called BLEU (BiLingual Evaluation Understudy), which takes a value from 0 to 1, and the closer it is to 1, the closer it is to the reference text (accurate text). This index value is an example of what is used to evaluate the text, and the text can be evaluated using other index values.

本実施形態に係る言語モデル２０は、図４を用いて説明した言語モデル２０であって、第１エンコーダ２２ａ及び第２エンコーダ２２ｂをＭＬＰとしたモデルと、第１エンコーダ２２ａ及び第２エンコーダ２２ｂをＣＮＮとしたモデルと、第１エンコーダ２２ａ及び第２エンコーダ２２ｂをＲＮＮとしたモデルである。 The language model 20 according to the present embodiment is the language model 20 described with reference to FIG. 4, and includes a model in which the first encoder 22a and the second encoder 22b are MLPs, and the first encoder 22a and the second encoder 22b. A model in which CNN is used and a model in which the first encoder 22a and the second encoder 22b are RNNs.

また、本実施形態に係る言語モデル２０の一部を変更したモデルの第１例は、図４を用いて説明した言語モデル２０のうち第１時系列数値データを用いないモデル、すなわち第２時系列数値データのみを用いるモデルである。本実施形態に係る言語モデル２０の一部を変更したモデルの第２例は、図４を用いて説明した言語モデル２０のうち第２時系列数値データを用いないモデル、すなわち第１時系列数値データのみを用いるモデルである。 Further, the first example of the model in which a part of the language model 20 according to the present embodiment is modified is the model of the language model 20 described with reference to FIG. 4, which does not use the first time series numerical data, that is, the second time. This model uses only series numerical data. The second example of the model in which a part of the language model 20 according to the present embodiment is modified is a model of the language model 20 described with reference to FIG. 4, which does not use the second time series numerical data, that is, the first time series numerical value. This is a model that uses only data.

また、本実施形態に係る言語モデル２０の一部を変更したモデルの第３例は、図４を用いて説明した言語モデル２０のうち正規化したデータを用いないモデル、すなわち第１前処理部２１ａ及び第２前処理部２１ｂによって標準化したデータ及び相対化したデータの２種類を算出するモデルである。本実施形態に係る言語モデル２０の一部を変更したモデルの第４例は、図４を用いて説明した言語モデル２０のうち標準化したデータを用いないモデル、すなわち第１前処理部２１ａ及び第２前処理部２１ｂによって正規化したデータ及び相対化したデータの２種類を算出するモデルである。本実施形態に係る言語モデル２０の一部を変更したモデルの第５例は、図４を用いて説明した言語モデル２０のうち相対化したデータを用いないモデル、すなわち第１前処理部２１ａ及び第２前処理部２１ｂによって標準化したデータ及び正規化したデータの２種類を算出するモデルである。 Further, the third example of the model in which a part of the language model 20 according to the present embodiment is modified is a model of the language model 20 described with reference to FIG. 4, which does not use normalized data, that is, the first preprocessing unit. This is a model for calculating two types of data, standardized data and relativized data by 21a and the second preprocessing unit 21b. The fourth example of the model in which a part of the language model 20 according to the present embodiment is modified is a model of the language model 20 described with reference to FIG. 4, which does not use standardized data, that is, the first preprocessing unit 21a and the first. 2 This is a model for calculating two types of data, normalized data and relativized data by the preprocessing unit 21b. The fifth example of the model in which a part of the language model 20 according to the present embodiment is modified is a model of the language model 20 described with reference to FIG. 4, which does not use relativized data, that is, the first preprocessing unit 21a and This is a model for calculating two types of data, standardized data and normalized data, by the second preprocessing unit 21b.

また、本実施形態に係る言語モデル２０の一部を変更したモデルの第６例は、図４を用いて説明した言語モデル２０のうちデコーダ２４に時系列数値データを１又は複数の方法で変換して得られる１又は複数の数値データを入力しないモデル、すなわちデコーダ２４に第１エンコーダ２２ａの出力及び第２エンコーダ２２ｂの出力のみを入力するモデルである。本実施形態に係る言語モデル２０の一部を変更したモデルの第７例は、図４を用いて説明した言語モデル２０のうち置換テキストデータを用いないモデル、すなわち言語モデルによって時系列数値データに関係する数値を直接生成するモデルである。また、本実施形態に係る言語モデル２０の一部を変更したモデルの第８例は、図４を用いて説明した言語モデル２０のうちデコーダ２４に時系列数値データの時系列に関するデータを入力しないモデルである。 Further, in the sixth example of the model in which a part of the language model 20 according to the present embodiment is modified, the time-series numerical data is converted to the decoder 24 in the language model 20 described with reference to FIG. 4 by one or a plurality of methods. This is a model in which one or a plurality of numerical data obtained as described above is not input, that is, a model in which only the output of the first encoder 22a and the output of the second encoder 22b are input to the decoder 24. The seventh example of the model in which a part of the language model 20 according to the present embodiment is modified is a model that does not use replacement text data among the language models 20 described with reference to FIG. 4, that is, time-series numerical data is obtained by the language model. It is a model that directly generates related numerical values. Further, in the eighth example of the model in which a part of the language model 20 according to the present embodiment is modified, the data related to the time series of the time series numerical data is not input to the decoder 24 of the language model 20 described with reference to FIG. It is a model.

言語モデルとして従来のモデルを用いた場合に生成されるテキストの評価値は、「０．２４４」であるのに対して、本実施形態に係る言語モデル２０を用いてテキスト生成装置１０により生成されるテキストの評価値は、エンコーダにＭＬＰを用いる場合「０．４１５」、エンコーダにＣＮＮを用いる場合「０．４１４」、エンコーダにＲＮＮを用いる場合「０．４１５」である。いずれの場合、従来よりも評価値が大幅に改善しており、正確なテキストデータが生成できていることがわかる。 The evaluation value of the text generated when the conventional model is used as the language model is "0.244", whereas the evaluation value of the text is generated by the text generator 10 using the language model 20 according to the present embodiment. The evaluation value of the text is "0.415" when MLP is used for the encoder, "0.414" when CNN is used for the encoder, and "0.415" when RNN is used for the encoder. In any case, the evaluation value is significantly improved as compared with the conventional case, and it can be seen that accurate text data can be generated.

第２表Ｒ２の上から５番目に記載された、本実施形態に係る言語モデル２０の一部を変更したモデルの第１例を用いて生成されるテキストの評価値は「０．３５６」、第２表Ｒ２の上から６番目に記載された、本実施形態に係る言語モデル２０の一部を変更したモデルの第２例を用いて生成されるテキストの評価値は「０．３９７」であり、異なる時間間隔で取得された２種類の時系列数値データを用いることで、評価値が改善することがわかる。これは、本実施形態に係るテキスト生成装置１０によれば、時系列数値データの履歴に依存する単語を正しく生成できることによると考えられる。 The evaluation value of the text generated by using the first example of the model in which the language model 20 according to the present embodiment is partially modified, which is described in the fifth from the top of Table 2, R2, is "0.356". The evaluation value of the text generated by using the second example of the model in which the language model 20 according to the present embodiment is partially modified, which is described in the sixth from the top of Table 2 R2, is "0.397". It can be seen that the evaluation value is improved by using two types of time-series numerical data acquired at different time intervals. It is considered that this is because the text generator 10 according to the present embodiment can correctly generate words that depend on the history of time-series numerical data.

また、第２表Ｒ２の上から７番目に記載された、本実施形態に係る言語モデル２０の一部を変更したモデルの第３例を用いて生成されるテキストの評価値は「０．４２４」、第２表Ｒ２の上から８番目に記載された、本実施形態に係る言語モデル２０の一部を変更したモデルの第４例を用いて生成されるテキストの評価値は「０．４２４」、第２表Ｒ２の上から９番目に記載された、本実施形態に係る言語モデル２０の一部を変更したモデルの第５例を用いて生成されるテキストの評価値は「０．４０８」である。これらのことから、正規化されたデータ及び標準化されたデータのいずれか一方と、相対化されたデータとを用いると、正規化されたデータ、標準化されたデータ及び相対化されたデータ全てを用いる場合よりもより適切なテキストデータが生成できることがわかる。また、相対化されたデータを用いないと、評価値が悪化することがわかる。 Further, the evaluation value of the text generated by using the third example of the model in which the language model 20 according to the present embodiment is partially modified, which is described in the seventh from the top of Table 2, R2, is "0.424". The evaluation value of the text generated by using the fourth example of the model in which the language model 20 according to the present embodiment is partially modified, which is described in the eighth from the top of Table 2, R2, is "0.424". The evaluation value of the text generated by using the fifth example of the model in which the language model 20 according to the present embodiment is partially modified, which is described in the ninth from the top of Table 2, R2, is "0.408". ". From these facts, when either one of the normalized data and the standardized data and the relativized data are used, the normalized data, the standardized data and the relativized data are all used. It can be seen that more appropriate text data can be generated than in the case. Moreover, it can be seen that the evaluation value deteriorates unless the relativized data is used.

また、第２表Ｒ２の上から１０番目に記載された、本実施形態に係る言語モデル２０の一部を変更したモデルの第６例を用いて生成されるテキストの評価値は「０．３９７」、第２表Ｒ２の上から１１番目に記載された、本実施形態に係る言語モデル２０の一部を変更したモデルの第７例を用いて生成されるテキストの評価値は「０．３１３」、第２表Ｒ２の上から１２番目に記載された、本実施形態に係る言語モデル２０の一部を変更したモデルの第８例を用いて生成されるテキストの評価値は「０．３５８」である。これらのことから、デコーダ２４に第１エンコーダ２２ａの出力及び第２エンコーダ２２ｂの出力のみを入力するモデルや、言語モデルによって時系列数値データに関係する数値を直接生成するモデル、時系列数値データの時系列に関するデータを入力しないモデルを用いる場合には、本実施形態に係る言語モデル２０を用いる場合よりも指標値が悪化することがわかる。 Further, the evaluation value of the text generated by using the sixth example of the model in which a part of the language model 20 according to the present embodiment is modified, which is described in the tenth position from the top of Table 2 R2, is "0.397". The evaluation value of the text generated by using the seventh example of the model in which a part of the language model 20 according to the present embodiment is modified, which is described in the eleventh from the top of Table 2 R2, is "0.313". The evaluation value of the text generated by using the eighth example of the model in which the language model 20 according to the present embodiment is partially modified, which is described in the twelfth from the top of Table 2, R2, is "0.358". ". Based on these facts, a model that inputs only the output of the first encoder 22a and the output of the second encoder 22b to the decoder 24, a model that directly generates numerical values related to time-series numerical data by a language model, and time-series numerical data. It can be seen that when a model in which data related to the time series is not input is used, the index value is worse than when the language model 20 according to the present embodiment is used.

以上説明した実施形態は、本発明の理解を容易にするためのものであり、本発明を限定して解釈するためのものではない。実施形態が備える各要素並びにその配置、材料、条件、形状及びサイズ等は、例示したものに限定されるわけではなく適宜変更することができる。また、異なる実施形態で示した構成同士を部分的に置換し又は組み合わせることが可能である。 The embodiments described above are for facilitating the understanding of the present invention, and are not for limiting and interpreting the present invention. Each element included in the embodiment and its arrangement, material, condition, shape, size, and the like are not limited to those exemplified, and can be changed as appropriate. In addition, the configurations shown in different embodiments can be partially replaced or combined.

１０…テキスト生成装置、１０ａ…ＣＰＵ、１０ｂ…ＲＡＭ、１０ｃ…ＲＯＭ、１０ｄ…通信部、１０ｅ…入力部、１０ｆ…表示部、１１…学習部、１２…取得部、１３…生成部、１４…置換部、１５…規則記憶部、２０…言語モデル、２１ａ…第１前処理部、２１ｂ…第２前処理部、２２ａ…第１エンコーダ、２２ｂ…第２エンコーダ、２３…合成部、２４…デコーダ、３０…ユーザ端末、４０…株価配信サーバ、１００…テキスト生成システム、Ｄ１…置換テキストデータ、Ｄ２…時系列数値データ、Ｄ３…規則、Ｅ…正確なテキストデータの例、Ｎ…通信ネットワーク、Ｒ１…第１表、Ｒ２…第２表 10 ... Text generator, 10a ... CPU, 10b ... RAM, 10c ... ROM, 10d ... Communication unit, 10e ... Input unit, 10f ... Display unit, 11 ... Learning unit, 12 ... Acquisition unit, 13 ... Generation unit, 14 ... Replacement unit, 15 ... Regular storage unit, 20 ... Language model, 21a ... First preprocessing unit, 21b ... Second preprocessing unit, 22a ... First encoder, 22b ... Second encoder, 23 ... Synthesis unit, 24 ... Decoder , 30 ... user terminal, 40 ... stock distribution server, 100 ... text generation system, D1 ... replacement text data, D2 ... time series numerical data, D3 ... rule, E ... accurate text data example, N ... communication network, R1 … Table 1, R2… Table 2

Claims

A text generator that generates text data that explains changes in time-series numerical data, including a sequence of numbers associated with the passage of time.
The time-series numerical data is input using the replacement text data in which the numerical values related to the time-series numerical data among the text data are replaced with predetermined character strings according to a predetermined rule and the time-series numerical data as learning data. In this case, the learning unit that trains the language model so as to output the replacement text data,
A generation unit that inputs new time-series numerical data to the language model learned by the learning unit and generates new replacement text data by the output of the language model.
A replacement unit that replaces the predetermined character string included in the new replacement text data with a numerical value related to the new time-series numerical data according to the predetermined rule.
A text generator equipped with.

Numerical values related to the time series numerical data are
Of the sequence of numbers included in the time-series numerical data, the numerical value associated with a predetermined time point and
The difference between the numerical values associated with different time points in the sequence of numbers included in the time series numerical data, and
Of the sequence of numbers included in the time-series numerical data, the numerical value associated with a predetermined time point is rounded down to a predetermined digit, and
Of the sequence of numbers included in the time-series numerical data, the difference between the numerical values associated with different time points is rounded down to a predetermined digit, and
Of the sequence of numbers included in the time-series numerical data, the numerical value associated with a predetermined time point is rounded up to a predetermined digit, and
Includes at least one of the numerical values obtained by rounding up the difference between the numerical values associated with different time points in the sequence of numerical values included in the time-series numerical data by a predetermined digit.
The predetermined rule is a rule for associating the type of numerical value related to the time series numerical data with the predetermined character string.
The text generator according to claim 1.

The time-series numerical data includes a first time-series numerical data including a sequence of numbers associated with the passage of time in the first interval, and a sequence of numbers associated with the passage of time in a second interval longer than the first interval. Including the second time series numerical data including,
The text generator according to claim 1 or 2.

The generation unit inputs one or a plurality of numerical data having a one-to-one correspondence to the one or a plurality of methods, which is obtained by converting the time series numerical data by one or a plurality of methods, into the language model.
The text generator according to any one of claims 1 to 3.

The one or more numerical data is
Numerical data obtained by normalizing the sequence of numbers included in the time-series numerical data to a predetermined numerical range, and
Numerical data obtained by standardizing the time-series numerical data using the average value and standard deviation of the sequence included in the time-series numerical data, and
Numerical data obtained by relativizing the sequence included in the time-series numerical data with respect to the reference value, using the numerical value associated with a predetermined time point as the reference value among the sequence included in the time-series numerical data.
Including at least one of
The text generator according to claim 4.

The language model is
One or more corresponding to the one or more methods, which is obtained by converting the first time series numerical data including a sequence of numbers associated with the passage of time in the first interval by one or a plurality of methods. The first encoder to which the first numerical data is input and
One-to-one with the one or more methods obtained by converting the second time series numerical data including the sequence associated with the passage of time in the second interval longer than the first interval by the one or more methods. A second encoder into which one or more second numerical data corresponding to
A compositing unit that synthesizes the output of the first encoder and the output of the second encoder,
A decoder in which the data synthesized by the synthesis unit is input and the replacement text data is output is included.
The text generator according to any one of claims 1 to 5.

The synthesizing unit synthesizes the output of the first encoder, the output of the second encoder, the one or more first numerical data, and the one or more second numerical data.
The text generator according to claim 6.

Data synthesized by the synthesis unit and data related to the time series of the time series numerical data are input to the decoder.
The text generator according to claim 6 or 7.

A text generation method in which a computer equipped with a hardware processor and memory generates text data that explains fluctuations in time-series numerical data including a sequence of numbers associated with the passage of time.
The time-series numerical data is input using the replacement text data in which the numerical values related to the time-series numerical data among the text data are replaced with predetermined character strings according to a predetermined rule and the time-series numerical data as learning data. In that case, the language model is trained to output the replacement text data, and
Input new time-series numerical data into the trained language model, and generate new replacement text data explaining the new time-series numerical data by the output of the language model.
Replacing the predetermined character string included in the new replacement text data with a numerical value related to the new time-series numerical data according to the predetermined rule,
How to generate text to execute.

A computer equipped with a text generator that generates text data that explains the fluctuations in time-series numerical data, including a sequence of numbers associated with the passage of time.
The time-series numerical data is input using the replacement text data in which the numerical values related to the time-series numerical data among the text data are replaced with predetermined character strings according to a predetermined rule and the time-series numerical data as learning data. In this case, the learning unit that trains the language model so as to output the replacement text data.
A generation unit that inputs new time-series numerical data to the language model learned by the learning unit and generates new replacement text data for explaining the new time-series numerical data by the output of the language model, and the generation unit. A replacement unit that replaces the predetermined character string included in the new replacement text data with a numerical value related to the new time-series numerical data according to the predetermined rule.
A text generator that functions as.