JP7343566B2

JP7343566B2 - Data generation method, computer device, and computer program using language models

Info

Publication number: JP7343566B2
Application number: JP2021209463A
Authority: JP
Inventors: ガンミンユ; ドンジュパク; ジェウクカン; サンウイ; ウミョンパク
Original assignee: Naver Corp
Current assignee: Naver Corp
Priority date: 2021-07-27
Filing date: 2021-12-23
Publication date: 2023-09-12
Anticipated expiration: 2041-12-23
Also published as: JP2023018624A; KR20230016794A

Description

特許法第３０条第２項適用ウェブサイトの掲載日２０２１年４月１８日ウェブサイトのアドレスｈｔｔｐｓ：／／ａｒｘｉｖ．ｏｒｇ／ａｂｓ／２１０４．０８８２６公開者ユガンミン、パクドンジュ、カンジェウク、イサンウ、パクウミョン、ウェブサイトの掲載日２０２１年５月２５日ウェブサイトのアドレスｈｔｔｐｓ：／／ｔｖ．ｎａｖｅｒ．ｃｏｍ／ｖ／２０３４９６４９公開者ユガンミン、パクドンジュ、カンジェウク、イサンウ、パクウミョンApplication of Article 30, Paragraph 2 of the Patent Act Website publication date April 18, 2021 Website address https://arxiv. org/abs/2104.08826 Publisher Yoo Kang-min, Park Dong-joo, Kang Jae-wook, Lee Sang-woo, Park Woo-myung, Website publication date May 25, 2021 Website address https://tv. naver. com/v/20349649 Published by Yoo Kang-min, Park Dong-joo, Kang Jae-wook, Lee Sang-woo, Park Woo-myung

以下の説明は、テキストデータを生成する技術に関する。 The following description relates to techniques for generating text data.

ＮＬＰ（ｎａｔｕｒａｌｌａｎｇｕａｇｅｐｒｏｃｅｓｓｉｎｇ）モデルを構築するためには、データ設計とデータ収集の先行によってモデル学習に必要なデータを確保する。確保したデータに基づいてモデルを学習した後、性能を評価しながらデータ設計やモデル学習の改善作業を繰り返す。 In order to build an NLP (natural language processing) model, data necessary for model learning is secured through data design and data collection in advance. After learning a model based on the acquired data, we repeat improvements to data design and model learning while evaluating performance.

ＰＬＭ（ｐｒｅｔｒａｉｎｅｄｌａｎｇｕａｇｅｍｏｄｅｌ）が登場してからは、ＰＬＭを基盤に特定のドメインやタスクを解くＮＬＰモデルによってファインチューニング（ｆｉｎｅｔｕｎｉｎｇ）を行う方法が適用されている。 Since the advent of PLM (pretrained language model), a method of performing fine tuning using an NLP model that solves a specific domain or task based on PLM has been applied.

さらに、ＮＬＰモデルを効率的に学習するためには、データ拡張（ｄａｔａａｕｇｍｅｎｔａｔｉｏｎ）技法が使用されている。 Additionally, data augmentation techniques have been used to efficiently learn NLP models.

一例として、ＥＤＡ（ｅａｓｙｄａｔａａｕｇｍｅｎｔａｔｉｏｎ）のようにモデル学習のために表面的に文句を操作してテキストデータを生成する技術は、生成された文章の文法性が低く、文章の意味が意図するものとの間に大きな差が生じる場合がある。 As an example, a technology such as EDA (easy data augmentation), which generates text data by superficially manipulating phrases for model learning, produces sentences with poor grammar, and the meaning of the sentences may not match the intended meaning. There may be a large difference between the two.

他の例として、再変換（Ｂａｃｋ－ｔｒａｎｓｌａｔｉｏｎ）のように機械翻訳モデルを利用して類似文章を生成する技術は、特定の言語的特性をもつ翻訳コーパスによって学習された翻訳モデルを活用するため、生成された文体が（生成の対象となる）既存のテキストデータの言語的特性を反映することができず、汎用的な使用が不可能であるという短所がある。 As another example, a technology such as back-translation that uses a machine translation model to generate similar sentences utilizes a translation model learned from a translation corpus with specific linguistic characteristics. The disadvantage is that the generated writing style cannot reflect the linguistic characteristics of existing text data (to be generated), making general use impossible.

多様な言語特性のコーパスによって学習された大規模言語モデルを利用して自然語生成結果を導き出すことができ、導き出された生成結果から新規データを抽出することができる、データ拡張技術を提供する。 To provide a data expansion technology that can derive natural language generation results using a large-scale language model learned from a corpus of diverse language characteristics, and can extract new data from the derived generation results.

コンピュータ装置が実行するデータ生成方法であって、前記コンピュータ装置は、メモリに含まれるコンピュータ読み取り可能な命令を実行するように構成された少なくとも１つのプロセッサを含み、前記データ生成方法は、前記少なくとも１つのプロセッサが、原本データを利用して言語モデルの入力文となるプロンプト（ｐｒｏｍｐｔ）を構成する段階、および前記少なくとも１つのプロセッサが、前記プロンプトを前記言語モデルに入力し、前記言語モデルから新規データと前記新規データに対するラベル情報を生成する段階を含む、データ生成方法を提供する。 A data generation method performed by a computer device, the computer device including at least one processor configured to execute computer readable instructions contained in a memory, the data generation method comprising: one processor constructs a prompt to be an input sentence for a language model using original data; and the at least one processor inputs the prompt to the language model and generates new data from the language model. and generating label information for the new data.

一側面によると、前記生成する段階は、前記プロンプト内のラベルに該当する自然語に対する確率分布（ｐｒｏｂａｂｉｌｉｔｙｄｉｓｔｒｉｂｕｔｉｏｎ）を利用してラベル分布を示すソフトラベルを生成する段階を含んでよい。 According to one aspect, the generating step may include generating a soft label indicating a label distribution using a probability distribution for natural words corresponding to a label in the prompt.

他の側面によると、前記データ生成方法は、前記少なくとも１つのプロセッサが、テキストの意味的多様性に応じて、学習データセットから前記原本データを選択する段階をさらに含んでよい。 According to another aspect, the data generation method may further include the step of the at least one processor selecting the original data from a training data set depending on the semantic diversity of the text.

また他の側面によると、前記選択する段階は、前記学習データセットからラベルタイプの個数だけの前記原本データを選択してよい。 According to another aspect, the selecting step may select the original data as many as the number of label types from the learning data set.

また他の側面によると、前記構成する段階は、テキストタイプとラベルタイプが含まれた形式で前記プロンプトを構成してよい。 According to another aspect, the configuring step may configure the prompt in a format that includes a text type and a label type.

また他の側面によると、前記構成する段階は、テキストタイプとラベルタイプ、およびラベル位置区分子（ｖｅｒｂａｌｉｚｅｒ）が含まれた形式で前記プロンプトを構成してよい。 According to another aspect, the configuring step may configure the prompt in a format including a text type, a label type, and a label position verbizer.

また他の側面によると、前記構成する段階は、前記原本データを加工し、前記原本データと同一形式の自然語形態で前記プロンプトを構成してよい。 According to another aspect, the step of configuring may process the original data and configure the prompt in a natural language form that is the same as the original data.

また他の側面によると、前記構成する段階は、タスク仕様（ｔａｓｋｓｐｅｃｉｆｉｃａｔｉｏｎ）、前記原本データ、およびプロンプトテンプレート（ｔｅｍｐｌａｔｅ）を組み合わせて前記プロンプトを構成してよい。 According to another aspect, the configuring step may configure the prompt by combining a task specification, the original data, and a prompt template.

また他の側面によると、前記生成する段階は、前記プロンプト内のテキストとラベルに該当する自然語に対し、以前のトークンの確率分布を次のトークンの入力として伝達する自己回帰（ａｕｔｏｒｅｇｒｅｓｓｉｖｅ）方式を利用して前記新規データと前記ラベル情報を抽出する段階を含んでよい。 According to another aspect, the generating step is an auto-regressive method in which a probability distribution of a previous token is transmitted as an input to a next token for a natural word corresponding to the text and label in the prompt. The method may include extracting the new data and the label information using a method.

さらに他の側面によると、前記抽出する段階は、ヒューリスティック（ｈｅｕｒｉｓｔｉｃ）を利用したビームサーチ（ｂｅａｍｓｅａｒｃｈ）により、前記以前のトークンの確率分布のうちの上位一部の確率を前記次のトークンの入力として伝達する段階を含んでよい。 According to still another aspect, the extracting step includes determining the probability of a top part of the probability distribution of the previous token as the input of the next token by beam search using a heuristic. may include the step of communicating as

前記データ生成方法を前記コンピュータ装置に実行させるためのコンピュータプログラムを提供する。 A computer program for causing the computer device to execute the data generation method is provided.

コンピュータ装置であって、メモリに含まれるコンピュータ読み取り可能な命令を実行するように構成された少なくとも１つのプロセッサを含み、前記少なくとも１つのプロセッサは、原本データを利用して言語モデルの入力文となるプロンプトを構成する過程、および前記プロンプトを前記言語モデルに入力し、前記言語モデルから新規データと前記新規データに対するラベル情報を生成する過程を処理する、コンピュータ装置を提供する。 A computing device including at least one processor configured to execute computer readable instructions contained in memory, the at least one processor utilizing source data to serve as input sentences for a language model. A computer device is provided that processes the steps of configuring a prompt, inputting the prompt into the language model, and generating new data and label information for the new data from the language model.

本発明の実施形態によると、言語モデルを利用して原本データを変形したり拡張したりして新規データを生成し、言語モデルが認知している知識を新規データを通じて転移することにより、データ収集の投入工数を著しく減らすことができ、モデル軽量化の効果を達成することができる。 According to embodiments of the present invention, data collection is performed by generating new data by transforming or extending original data using a language model, and by transferring knowledge recognized by the language model through the new data. It is possible to significantly reduce the number of input man-hours and achieve the effect of reducing the weight of the model.

本発明の一実施形態における、コンピュータ装置の内部構成の一例を説明するためのブロック図である。1 is a block diagram for explaining an example of the internal configuration of a computer device in an embodiment of the present invention. FIG. 本発明の一実施形態における、大規模言語モデルを利用したテキスト拡張の概念を説明するための図である。FIG. 2 is a diagram for explaining the concept of text expansion using a large-scale language model in an embodiment of the present invention. 本発明の一実施形態における、コンピュータ装置が実行することのできるデータ生成方法の例を示したフローチャートである。1 is a flowchart illustrating an example of a data generation method that can be executed by a computer device according to an embodiment of the present invention. 本発明の一実施形態における、言語モデルの入力プロンプトを構成する過程を説明するための図である。FIG. 3 is a diagram for explaining a process of configuring input prompts for a language model in an embodiment of the present invention. 本発明の一実施形態における、データ拡張過程を説明するための図である。FIG. 3 is a diagram for explaining a data expansion process in an embodiment of the present invention. 本発明の一実施形態における、データ拡張過程を説明するための図である。FIG. 3 is a diagram for explaining a data expansion process in an embodiment of the present invention. 本発明の一実施形態における、言語モデルを利用して新たな文章とラベル情報を生成する過程を説明するための図である。FIG. 3 is a diagram for explaining a process of generating a new sentence and label information using a language model in an embodiment of the present invention. 本発明の一実施形態における、分布のあるソフトラベルを生成する過程を説明するための図である。FIG. 3 is a diagram for explaining a process of generating distributed soft labels in an embodiment of the present invention. 本発明の一実施形態における、原本データの例と、原本データから生成された新規データの例を示した図である。FIG. 2 is a diagram showing an example of original data and an example of new data generated from the original data in an embodiment of the present invention.

以下、本発明の実施形態について、添付の図面を参照しながら詳しく説明する。 Embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

本発明の実施形態は、言語モデルを利用したテキストデータ生成技術に関する。 Embodiments of the present invention relate to text data generation technology using a language model.

本明細書で具体的に開示される事項を含む実施形態は、大規模言語モデルを利用することで、既存のテキストデータの特性と一貫性がある上に、高い文法性と自然さを備えた文章を生成することができる。さらに、該当の文章に対する高い品質のラベル情報も生成することができる。 Embodiments including matters specifically disclosed in this specification utilize a large-scale language model to achieve consistency with the characteristics of existing text data, as well as high grammaticality and naturalness. Can generate sentences. Furthermore, high quality label information for the relevant text can also be generated.

図１は、本発明の一実施形態における、コンピュータ装置の例を示したブロック図である。例えば、本発明の実施形態に係るデータ生成システムは、図１に示したコンピュータ装置１００によって実現されてよい。 FIG. 1 is a block diagram showing an example of a computer device in an embodiment of the present invention. For example, the data generation system according to the embodiment of the present invention may be realized by the computer device 100 shown in FIG.

図１に示すように、コンピュータ装置１００は、本発明の実施形態に係るデータ生成方法を実行するための構成要素として、メモリ１１０、プロセッサ１２０、通信インタフェース１３０、および入力／出力インタフェース１４０を含んでよい。 As shown in FIG. 1, the computer device 100 includes a memory 110, a processor 120, a communication interface 130, and an input/output interface 140 as components for executing the data generation method according to the embodiment of the present invention. good.

メモリ１１０は、コンピュータ読み取り可能な記録媒体であって、ＲＡＭ（ｒａｎｄｏｍａｃｃｅｓｓｍｅｍｏｒｙ）、ＲＯＭ（ｒｅａｄｏｎｌｙｍｅｍｏｒｙ）、およびディスクドライブのような永続的大容量記録装置を含んでよい。ここで、ＲＯＭやディスクドライブのような永続的大容量記録装置は、メモリ１１０とは区分される別の永続的記録装置としてコンピュータ装置１００に含まれてもよい。また、メモリ１１０には、オペレーティングシステムと、少なくとも１つのプログラムコードが記録されてよい。このようなソフトウェア構成要素は、メモリ１１０とは別のコンピュータ読み取り可能な記録媒体からメモリ１１０にロードされてよい。このような別のコンピュータ読み取り可能な記録媒体は、フロッピー（登録商標）ドライブ、ディスク、テープ、ＤＶＤ／ＣＤ－ＲＯＭドライブ、メモリカードなどのコンピュータ読み取り可能な記録媒体を含んでよい。他の実施形態において、ソフトウェア構成要素は、コンピュータ読み取り可能な記録媒体ではない通信インタフェース１３０を通じてメモリ１１０にロードされてもよい。例えば、ソフトウェア構成要素は、ネットワーク１６０を介して受信されるファイルによってインストールされるコンピュータプログラムに基づいてコンピュータ装置１００のメモリ１１０にロードされてよい。 Memory 110 is a computer readable storage medium and may include permanent mass storage devices such as random access memory (RAM), read only memory (ROM), and disk drives. Here, a permanent large capacity storage device such as a ROM or a disk drive may be included in the computer device 100 as a separate permanent storage device separate from the memory 110. Additionally, an operating system and at least one program code may be recorded in the memory 110. Such software components may be loaded into memory 110 from a computer-readable storage medium separate from memory 110. Such other computer-readable recording media may include computer-readable recording media such as floppy drives, disks, tapes, DVD/CD-ROM drives, memory cards, and the like. In other embodiments, software components may be loaded into memory 110 through communication interface 130 that is not a computer-readable storage medium. For example, software components may be loaded into memory 110 of computing device 100 based on a computer program installed by a file received over network 160.

プロセッサ１１０は、基本的な算術、ロジック、および入出力演算を実行することにより、コンピュータプログラムの命令を処理するように構成されてよい。命令は、メモリ１１０または通信インタフェース１３０によって、プロセッサ１２０に提供されてよい。例えば、プロセッサ１２０は、メモリ１１０のような記録装置に記録されたプログラムコードにしたがって受信される命令を実行するように構成されてよい。 Processor 110 may be configured to process instructions of a computer program by performing basic arithmetic, logic, and input/output operations. Instructions may be provided to processor 120 by memory 110 or communication interface 130. For example, processor 120 may be configured to execute instructions received according to program code recorded on a storage device, such as memory 110.

通信インタフェース１３０は、ネットワーク１６０を介してコンピュータ装置１００が他の装置と互いに通信するための機能を提供してよい。一例として、コンピュータ装置１００のプロセッサ１２０がメモリ１１０のような記録装置に記録されたプログラムコードにしたがって生成した要求や命令、データ、ファイルなどが、通信インタフェース１３０の制御にしたがってネットワーク１６０を介して他の装置に伝達されてよい。これとは逆に、他の装置からの信号や命令、データ、ファイルなどが、ネットワーク１６０を経てコンピュータ装置１００の通信インタフェース１３０を通じてコンピュータ装置１００に受信されてよい。通信インタフェース１３０を通じて受信された信号や命令、データなどは、プロセッサ１２０やメモリ１１０に伝達されてよく、ファイルなどは、コンピュータ装置１００がさらに含むことのできる記録媒体（上述した永続的記録装置）に記録されてよい。 Communication interface 130 may provide functionality for computing device 100 to communicate with other devices and each other via network 160. As an example, requests, instructions, data, files, etc. generated by the processor 120 of the computer device 100 according to a program code recorded in a storage device such as the memory 110 are transmitted to others via the network 160 under the control of the communication interface 130. may be transmitted to the device. Conversely, signals, instructions, data, files, etc. from other devices may be received by the computing device 100 via the network 160 and through the communication interface 130 of the computing device 100 . Signals, instructions, data, etc. received through communication interface 130 may be communicated to processor 120 and memory 110, files, etc. may be transferred to a storage medium (such as a persistent storage device as described above) that computing device 100 may further include. May be recorded.

通信方式が限定されることはなく、ネットワーク１６０が含むことのできる通信網（一例として、移動通信網、有線インターネット、無線インターネット、放送網）を利用する通信方式だけではなく、機器間の近距離無線通信が含まれてもよい。例えば、ネットワーク１６０は、ＰＡＮ（ｐｅｒｓｏｎａｌａｒｅａｎｅｔｗｏｒｋ）、ＬＡＮ（ｌｏｃａｌａｒｅａｎｅｔｗｏｒｋ）、ＣＡＮ（ｃａｍｐｕｓａｒｅａｎｅｔｗｏｒｋ）、ＭＡＮ（ｍｅｔｒｏｐｏｌｉｔａｎａｒｅａｎｅｔｗｏｒｋ）、ＷＡＮ（ｗｉｄｅａｒｅａｎｅｔｗｏｒｋ）、ＢＢＮ（ｂｒｏａｄｂａｎｄｎｅｔｗｏｒｋ）、インターネットなどのネットワークのうちの１つ以上の任意のネットワークを含んでよい。さらに、ネットワーク１６０は、バスネットワーク、スターネットワーク、リングネットワーク、メッシュネットワーク、スター－バスネットワーク、ツリーまたは階層的ネットワークなどを含むネットワークトポロジのうちの任意の１つ以上を含んでもよいが、これらに限定されることはない。 The communication method is not limited, and is not limited to communication methods that utilize communication networks that can be included in the network 160 (for example, mobile communication networks, wired Internet, wireless Internet, and broadcasting networks), as well as communication methods that utilize short distances between devices. Wireless communications may also be included. For example, the network 160 is a PAN (personal area network), a LAN (local area network), a CAN (campus area network), a MAN (metropolitan area network), or a WAN (wide area network). e area network), BBN (broadband network), the Internet, etc. may include any one or more of the networks. Further, network 160 may include any one or more of network topologies including, but not limited to, a bus network, a star network, a ring network, a mesh network, a star-bus network, a tree or a hierarchical network, and the like. It will not be done.

入力／出力インタフェース１４０は、入力／出力装置１５０とのインタフェースのための手段であってよい。例えば、入力装置は、マイク、キーボード、カメラ、またはマウスなどの装置を、出力装置は、ディスプレイ、スピーカのような装置を含んでよい。他の例として、入力／出力インタフェース１４０は、タッチスクリーンのように入力と出力のための機能が１つに統合された装置とのインタフェースのための手段であってもよい。入力／出力装置１５０は、コンピュータ装置１００と１つの装置で構成されてもよい。 Input/output interface 140 may be a means for interfacing with input/output device 150. For example, input devices may include devices such as a microphone, keyboard, camera, or mouse, and output devices may include devices such as a display and speakers. As another example, input/output interface 140 may be a means for interfacing with a device that has integrated input and output functionality, such as a touch screen. Input/output device 150 may be configured as one device with computer device 100.

また、他の実施形態において、コンピュータ装置１００は、図１の構成要素よりも少ないか多くの構成要素を含んでもよい。しかし、大部分の従来技術的構成要素を明確に図に示す必要はない。例えば、コンピュータ装置１００は、上述した入力／出力装置１５０のうちの少なくとも一部を含むように実現されてもよいし、トランシーバ、カメラ、各種センサ、データベースなどのような他の構成要素をさらに含んでもよい。 Also, in other embodiments, computing device 100 may include fewer or more components than those of FIG. However, most prior art components need not be clearly illustrated. For example, the computing device 100 may be implemented to include at least some of the input/output devices 150 described above, and may further include other components such as transceivers, cameras, various sensors, databases, etc. But that's fine.

本発明で使用する大規模言語モデルとは、Ｆｅｗ－ｓｈｏｔＬｅａｒｎｉｎｇ（ＦＳＬ）などのような方式を利用してファインチューニング（ｆｉｎｅ－ｔｕｎｉｎｇ）を経ずに推論することが可能な言語モデルを指称してよく、従来の一般的な言語モデルに比べて１０倍以上も多い媒介変数（例えば、１０００億個以上の媒介変数など）を有する。例えば、ＧＰＴ－３（ＧｅｎｅｒａｔｉｖｅＰｒｅ－ｔｒａｉｎｅｄＴｒａｎｓｆｏｒｍｅｒ３）やＨｙｐｅｒＣｌｏｖａのような大規模言語モデルは、自然なテキストプロンプトで制御することが可能な優れたＦｅｗ－ｓｈｏｔ学習機であって、プロンプトによって少量のデータだけでパターンを理解し、ＮＬＰ問題を解決する能力であるインコンテキスト学習（ｉｎ－ｃｏｎｔｅｘｔｌｅａｒｎｉｎｇ）が可能である。 The large-scale language model used in the present invention refers to a language model that can be inferred without fine-tuning using a method such as Few-shot Learning (FSL). It has more than 10 times as many mediating variables (for example, more than 100 billion mediating variables) than conventional general language models. For example, large-scale language models such as GPT-3 (Generative Pre-trained Transformer 3) and HyperClova are excellent few-shot learning machines that can be controlled with natural text prompts. In-context learning, which is the ability to understand patterns and solve NLP problems by simply using a computer, is possible.

本実施形態は、大規模言語モデルを活用して原本データから新規データを生成する新たなデータ拡張技法に関する。さらに、言語モデルで予測したソフトラベル（ｓｏｆｔｌａｂｅｌ）を活用して大規模言語モデルで知識を効果的に蒸溜すると同時に、テキスト摂動（ｔｅｘｔｕａｌｐｅｒｔｕｒｂａｔｉｏｎｓ）を生成することができる。 This embodiment relates to a new data expansion technique that utilizes a large-scale language model to generate new data from original data. Further, by utilizing soft labels predicted by a language model, knowledge can be effectively distilled using a large-scale language model, and at the same time, textual perturbations can be generated.

図２は、本発明の一実施形態における、大規模言語モデルを利用したテキスト拡張の概念を説明するための図である。 FIG. 2 is a diagram for explaining the concept of text expansion using a large-scale language model in one embodiment of the present invention.

図２を参照すると、本実施形態において、大規模言語モデル２１０は、モデル学習に必要な合成テキストデータ（ｓｙｎｔｈｅｔｉｃｔｅｘｔｄａｔａ）を生成するためのバックボーンとして使用される。 Referring to FIG. 2, in this embodiment, a large-scale language model 210 is used as a backbone for generating synthetic text data required for model learning.

本実施形態によると、大規模言語モデル２１０を使用することで、原本データから、合成でありながらも極事実的（ｈｙｐｅｒ－ｒｅａｌｉｓｔｉｃ）な新規データを生成することができる。 According to this embodiment, by using the large-scale language model 210, it is possible to generate synthetic yet hyper-realistic new data from original data.

ラベルがあるかラベルのない形態のテキストデータがあるとき、該当のデータを自然語形態のプロンプト入力文に変換し、変換されたプロンプト入力文を言語モデル２１０に入力として与えることで主語自然語生成結果を導き出してよい。導き出された生成結果を分析して新規データを抽出してよく、このとき、新規データは、原本データと同じ形態であって、ラベルがあるかラベルのない形態である。抽出された新規データは、原本テキストデータに追加してデータ収集に役立てたり、該当のデータによってモデルを生成すればモデルの性能が高まったりもする。 When there is text data in a format with or without a label, the corresponding data is converted into a prompt input sentence in the form of a natural language, and the converted prompt input sentence is given as input to the language model 210 to generate a subject natural word. You can derive results. The derived generation results may be analyzed to extract new data, where the new data is in the same form as the original data, with or without labels. The extracted new data can be added to the original text data to help with data collection, or if a model is generated using the relevant data, the performance of the model can be improved.

言い換えれば、本実施形態は、データ拡張を目的とし、大規模言語モデル２１０で新規データを生成するためにプロンプト基盤の接近方式を使用するものであって、原本データからインスピレーションを受けた新規データと大規模言語モデル２１０によって予測されたソフトラベルを使用して小規模分類モデルを訓練することにより、知識の蒸溜を達成することができる。 In other words, the present embodiment uses a prompt-based approach method to generate new data in the large-scale language model 210 for the purpose of data augmentation, and the new data is inspired from the original data. Knowledge distillation can be achieved by training a small-scale classification model using the soft labels predicted by the large-scale language model 210.

図３は、本発明の一実施形態における、コンピュータ装置が実行することのできるデータ生成方法の例を示したフローチャートである。 FIG. 3 is a flowchart illustrating an example of a data generation method that can be executed by a computer device in an embodiment of the present invention.

本実施形態に係るデータ生成方法は、上述したコンピュータ装置１００によって実行されてよい。この場合、コンピュータ装置１００のプロセッサ１２０は、メモリ１１０が含むオペレーティングシステムのコードと、少なくとも１つのプログラムのコードとによる制御命令（ｉｎｓｔｒｕｃｔｉｏｎ）を実行するように実現されてよい。ここで、プロセッサ１２０は、コンピュータ装置１００に記録されたコードが提供する制御命令にしたがってコンピュータ装置１００が図３のデータ生成方法に含まれる段階３１０～３４０を実行するようにコンピュータ装置１００を制御してよい。 The data generation method according to this embodiment may be executed by the computer device 100 described above. In this case, the processor 120 of the computer device 100 may be implemented to execute control instructions according to the code of the operating system included in the memory 110 and the code of at least one program. Here, the processor 120 controls the computer device 100 so that the computer device 100 executes steps 310 to 340 included in the data generation method of FIG. It's fine.

本実施形態に係るデータ生成方法は、データ分布に基づいて、極めて流暢な新規データを生成することができる。 The data generation method according to this embodiment can generate extremely fluent new data based on data distribution.

図３を参照すると、段階３１０で、プロセッサ１２０は、データセットからプロンプトに使用する原本データを選定してよい。 Referring to FIG. 3, at step 310, processor 120 may select original data from the data set for use in the prompt.

以下では、テキスト分類タスクのためのデータ生成を例に挙げて原本データを選定する方法を説明する。分類タスクＴが与えられた場合、学習データセットは、テキストｘとラベルｙの対からなる集合 In the following, a method for selecting original data will be explained using data generation for a text classification task as an example. Given a classification task T, the training dataset is a set of pairs of text x and label y.

となる。

becomes.

一例として、プロセッサ１２０は、学習データセットＤからｋ個の原本データをランダムに選択してよい。プロセッサ１２０は、一様分布（ｕｎｉｆｏｒｍｄｉｓｔｒｉｂｕｔｉｏｎ）を利用してｋ個の原本データを選択してよい（数式（１））。 As an example, the processor 120 may randomly select k pieces of original data from the training data set D. The processor 120 may select the k pieces of original data using a uniform distribution (Equation (1)).

原本データの個数であるｋは、費用と性能を考慮した上で決定されてよく、例えば、プロセッサ１２０は、ｋを２に設定して学習データセットＤから２つの原本データを選択してよい。 The number of pieces of original data k may be determined in consideration of cost and performance. For example, the processor 120 may set k to 2 and select two pieces of original data from the training data set D.

他の例として、プロセッサ１２０は、学習データセットＤから、テキストの意味的多様性を考慮した上で、原本データを選択してよい。意味的多様性とは、テキストの意味がどれほど多様であるかを示す指標である。意味的多様性が低い場合は、生成されたデータが既存のデータセットの原本と類似度が高いためデータ拡張効果が低い反面、意味的多様性が高いほどデータ拡張に役立つ新規データが得られる確率が高まる。 As another example, the processor 120 may select the original data from the training data set D after considering the semantic diversity of the text. Semantic diversity is a measure of how diverse the meanings of a text are. When semantic diversity is low, the data expansion effect is low because the generated data has a high degree of similarity to the original of the existing dataset, but on the other hand, the higher the semantic diversity, the more likely it is that new data useful for data expansion will be obtained. increases.

意味的多様性を計算する方法として、文章ベクトル表現法（例えば、ｂａｇ－ｏｆ－ｗｏｒｄｓ、ａｇｇｒｅｇａｔｅｗｏｒｄ２ｖｅｃ、ＢＥＲＴｅｍｂｅｄｄｉｎｇ、ＢＬＥＵＲＴなど）を利用して各テキストのベクトルを抽出する。類似度（例えば、ｃｏｓｉｎｅｓｉｍｉｌａｒｉｔｙなど）を利用してベクトル間の距離を計算したり、ＢＬＥＵＲＴのようなネットワークを利用して各対のセマンティック距離（ｐａｉｒｗｉｓｅｓｅｍａｎｔｉｃｄｉｓｔａｎｃｅ）を計算したりした後、距離が遠い（ｓｅｍａｎｔｉｃｄｉｓｔａｎｃｅが高い）対にさらに高い加重値を付与してサンプリングを実施してよい。 As a method of calculating semantic diversity, a vector of each text is extracted using a sentence vector representation method (eg, bag-of-words, aggregate word2vec, BERT embedding, BLEURT, etc.). After calculating the distance between vectors using similarity (e.g., cosine similarity) or calculating the pairwise semantic distance of each pair using a network like BLEURT, the distance is Sampling may be performed by giving a higher weight to pairs that are far away (high semantic distance).

プロセッサ１２０は、学習データセットＤからクラスの数、すなわち、ラベルの種類に該当するだけの原本データを選択してよい。プロセッサ１２０は、同一ラベルの原本データのうちで各対のセマンティック距離が高いデータを優先的に選択してよい。セマンティック距離にα（ａｌｐｈａ）乗数を適用してよく、このとき、αは、最適化が必要なハイパーパラメータに該当する。 The processor 120 may select original data corresponding to the number of classes, that is, the types of labels, from the learning data set D. The processor 120 may preferentially select data with a high semantic distance between pairs of original data with the same label. An α (alpha) multiplier may be applied to the semantic distance, where α corresponds to the hyperparameter that needs to be optimized.

段階３２０で、プロセッサ１２０は、言語モデルの入力に該当するプロンプトを構成してよい。プロセッサ１２０は、段階３１０で選択された原本データを利用して言語モデルの入力プロンプトを構成してよい。プロセッサ１２０は、与えられたＮＬＰ問題の特性が適切に反映された専用プロンプトテンプレートを製作してよく、このとき、プロンプトテンプレートには、該当のタスクの定義やメタ情報が含まれてよい。言い換えれば、プロセッサ１２０は、データセットから選択された原本データを加工して自然語形態のプロンプトを構成してよく、このとき、プロンプトは、言語モデルが理解することが可能な形式で製作され、言語モデルの入力として与えられる。プロセッサ１２０は、原本データがラベルのあるデータの場合、ラベル情報とともに入力文が生成されるようにプロンプトを設計する。 At step 320, processor 120 may configure prompts that are appropriate for the input of the language model. Processor 120 may utilize the source data selected in step 310 to construct input prompts for the language model. The processor 120 may create a dedicated prompt template that appropriately reflects the characteristics of a given NLP problem, and at this time, the prompt template may include the definition and meta information of the corresponding task. In other words, the processor 120 may process the original data selected from the data set to compose a natural language prompt, where the prompt is produced in a format that can be understood by the language model; Given as input to the language model. When the original data is labeled data, the processor 120 designs the prompt so that the input sentence is generated along with the label information.

プロセッサ１２０は、学習データセットＤからサンプリングされた原本データ The processor 120 uses original data sampled from the learning data set D.

が与えられるとき、説明ヘッダ（ｄｅｓｃｒｉｐｔｉｏｎｈｅａｄｅｒ）、テキスト－ラベル対リスト、拡張接頭辞（ａｕｇｍｅｎｔａｔｉｏｎｐｒｅｆｉｘ）で構成されたプロンプトを生成してよい。プロンプトは、言語モデルがデータ分布に対してさらに適切に一般化することが可能なように各タスクの情報を有しており、このようなタスク表示子（ｔａｓｋｉｎｄｉｃａｔｏｒ）は、タスクごとに固有であり、課題のメタ情報を提供してよい。

may generate a prompt consisting of a description header, a list of text-label pairs, and an augmentation prefix. The prompt contains information for each task to allow the language model to generalize better to the data distribution, and such task indicators are unique for each task. Yes, you may provide meta information about the assignment.

プロンプトの形式自体は多様に構成されてよいが、一例として、プロンプトは、テキストタイプ（例えば、レビューや記事など）とラベルタイプ（例えば、感情や分類など）、さらにラベル位置を確認することのできるラベル－トークン区分子（ｌａｂｅｌ－ｔｏｋｅｎｖｅｒｂａｌｉｚｅｒ）を含んでよい。 The format of the prompt itself may be configured in various ways, but as an example, the prompt can confirm the text type (e.g., review, article, etc.), the label type (e.g., sentiment, classification, etc.), and the label position. It may include a label-token verbizer.

テキストタイプＴは、入力テキストｘのメタタイプであって、例えば、動画レビュー感情分析においてテキストタイプは動画レビューに該当する。ラベルタイプＬはラベルクラスｙのメタタイプであって、例えば、動画レビュー感情分析においてラベルタイプは感情に該当する。ラベル－トークン区分子ｖ The text type T is a meta type of the input text x, and for example, in video review sentiment analysis, the text type corresponds to a video review. Label type L is a metatype of label class y, and for example, in video review sentiment analysis, the label type corresponds to emotion. label-token section numerator v

においてプロンプトを公式化するためには、ラベルクラス

To formulate the prompt in the label class

と言語モデルの語彙

and language model vocabulary

で単語トークン間の１対１マッピングが必要となる。

requires a one-to-one mapping between word tokens.

上述した３つのメタ情報は、タスク仕様（ｔａｓｋｓｐｅｃｉｆｉｃａｔｉｏｎ） The three meta information mentioned above are task specifications.

を構成する。各タスクＴは、プロンプトを公式化することのできる課題仕様

Configure. Each task T is a task specification that can formulate prompts.

を必要とする。プロセッサ１２０は、基本的に、一般タスク仕様である

Requires. Processor 120 is basically general task specific.

を使用してプロンプトを生成してよい。ここで、Ｉは、クラスラベルが語彙

You can generate prompts using Here, I is the class label is a vocabulary

に存在すると仮定する識別関数（ｉｄｅｎｔｉｔｙｆｕｎｃｔｉｏｎ）を意味する。

An identity function that is assumed to exist in .

要するに、図４に示すように、プロセッサ１２０は、与えられたタスクＴに対し、タスク仕様４１０、学習データセット４００からサンプリングされた原本データであるデータ例４２０、与えられたタスクＴの特性を考慮したプロンプトテンプレート４３０を組み合わせて言語モデルの入力プロンプトを構成してよい。 In short, as shown in FIG. 4, for a given task T, the processor 120 considers the task specification 410, the data example 420 which is the original data sampled from the learning data set 400, and the characteristics of the given task T. The input prompts of the language model may be configured by combining the prompt templates 430.

図５の具体的な例から分かるように、プロセッサ１２０は、タスク仕様４１０、データ例４２０、およびプロンプトテンプレート４３０を利用して言語モデルが理解することが可能な形式で構成して言語モデルの入力プロンプト５４０を製作してよい。タスク仕様４１０の一例は表１に示すとおりであり、プロンプトテンプレート４３０の一例は表２に示すとおりである。データ例４２０として As can be seen from the specific example of FIG. 5, the processor 120 utilizes the task specification 410, example data 420, and prompt template 430 to configure the input of the language model in a format that the language model can understand. A prompt 540 may be created. An example of the task specification 410 is as shown in Table 1, and an example of the prompt template 430 is as shown in Table 2. As data example 420

が与えられる場合、入力プロンプト５４０は表３のように構成されてよい。

, the input prompt 540 may be configured as in Table 3.

再び図３を参照すると、段階３３０で、プロセッサ１２０は、段階３２０で構成されたプロンプトを言語モデルに入力し、言語モデルから新規データが含まれた自然語を生成してよい。言い換えれば、プロセッサ１２０は、プロンプト入力文を言語モデルに入力した後、言語モデルの完成機能によって言語生成結果を得てよい。 Referring again to FIG. 3, at step 330, processor 120 may input the prompts configured at step 320 into the language model and generate natural language containing new data from the language model. In other words, the processor 120 may input the prompt input sentence into the language model and then obtain the language generation result by the completion function of the language model.

段階３４０で、プロセッサ１２０は、自然語生成結果を分析して新規データを抽出してよい。言語モデルは、入力文として与えられたプロンプトのパターンに沿って自然語を生成してよく、生成された自然語のパターン分析によって新規データを抽出してよい。 At step 340, processor 120 may analyze the natural language generation results to extract new data. The language model may generate natural words according to the pattern of a prompt given as an input sentence, and may extract new data by pattern analysis of the generated natural words.

図６に示すように、プロセッサ１２０は、プロンプト５４０を言語モデル２１０に入力し、言語モデル２１０に基づいて生成された自然語のパターンを分析することで、拡張データ６５０として（新しい文章、該当の文章のラベル情報）対を得てよい。一例として、プロセッサ１２０は、プロンプト５４０内のラベルに該当する自然語トークンの言語モデリング確率分布を使用してラベル分布を得てよい。 As shown in FIG. 6, the processor 120 inputs the prompt 540 into the language model 210 and analyzes the natural language pattern generated based on the language model 210 to generate extended data 650 (new sentences, corresponding Label information of sentences) pairs can be obtained. As an example, processor 120 may use a language modeling probability distribution of natural language tokens that correspond to labels in prompt 540 to obtain the label distribution.

プロンプト基盤の接近方式の場合、拡張テキストｘ’とラベルｙ’は、プロンプト以後に自然語テキストとして連続で生成される。サンプリングされた原本データに基づいて予め定義されたプロンプトテンプレートは、言語モデルが（ｘ’、ｙ’）構造を生成するように入力文を提供するため、パターンマッチングによって各値を抽出してよい。また、共同テキスト（ｊｏｉｎｔｔｅｘｔ）およびラベル生成は、生成されたテキストが正しいラベルに連結されるようにする。 In the prompt-based approach method, the extended text x' and the label y' are continuously generated as natural language text after the prompt. A predefined prompt template based on sampled original data may extract each value by pattern matching to provide input sentences such that the language model generates an (x', y') structure. Joint text and label generation also ensures that the generated text is concatenated with the correct label.

本実施形態のプロンプトデザインは、 The prompt design of this embodiment is

に該当するラベルトークンがテキストｘ以後に生成されるように保障する。プロセッサ１２０は、言語モデルを利用して疑似ラベリング（ｐｓｅｕｄｏ－ｌａｂｅｌｉｎｇ）を実行してよく、拡張テキストｘ’のソフトラベル確率を得るためにラベル－トークンを生成する可能性（ｌｉｋｅｌｉｈｏｏｄ）を正規化してよい。

ensures that a label token corresponding to x is generated after text x. Processor 120 may utilize the language model to perform pseudo-labeling and normalize the likelihood of generating a label-token to obtain a soft label probability for the extended text x'. good.

拡張テキストｘ’がラベルｙ’によってラベリングされる疑似ラベル確率は、数式（２）のとおりである。 The pseudo label probability that the extended text x' is labeled with the label y' is as shown in Equation (2).

ここで、

here,

は、言語モデリング可能性（ｌａｎｇｕａｇｅｍｏｄｅｌｉｎｇｌｉｋｅｌｉｈｏｏｄ）を示し、

indicates language modeling likelihood,

は、与えられたタスク仕様を構成する関数である。

are the functions that constitute the given task specification.

本実施形態では、テキスト摂動、疑似ラベリング、知識の蒸溜を単一拡張タスクで効果的に結合することができる。実際に、疑似ラベルのある新規データは、交差エントロピー損失（ｃｒｏｓｓ－ｅｎｔｒｏｐｙｌｏｓｓ）を使用して原本データとともに訓練される。 In this embodiment, text perturbation, pseudo-labeling, and knowledge distillation can be effectively combined in a single augmentation task. In fact, new data with pseudo-labels is trained with the original data using cross-entropy loss.

図７を参照すると、プロセッサ１２０は、言語モデル２１０の完成機能による確率分布に基づいて新規データを生成してよく、このとき、ラベル－トークン区分子を基準にパターン分析でラベルに該当するトークンに対する確率を利用してソフトラベル、すなわち、分布のあるラベルを生成してよい。 Referring to FIG. 7, the processor 120 may generate new data based on the probability distribution according to the completed function of the language model 210, and at this time, the processor 120 may generate new data based on the probability distribution based on the completed function of the language model 210. Probability may be used to generate soft labels, that is, labels with a distribution.

より詳しく説明すると、プロセッサ１２０は、プロンプト入力文として提供されるすべての自然語を特定のトークン化形態（ｔｏｋｅｎｉｚａｔｉｏｎｓｃｈｅｍｅ）（例えば、定数形態のインデックスなど）に変換した後、言語モデル２１０に入力する。（新しい文章、該当の文章のラベル情報）対を得るためには、自己回帰（ａｕｔｏｒｅｇｒｅｓｓｉｖｅ）に基づいて以前のトークンの確率分布を次のトークンの入力として伝達するようになるが、このとき、ヒューリスティック（ｈｅｕｒｉｓｔｉｃ）を利用したビームサーチ（ｂｅａｍｓｅａｒｃｈ）によって上位ｎ個の確率を次のトークンの入力として使用する。各トークンの自体確率分布と以前のトークンの確率分布を掛けたジョイント確率のうちで上位ｎ個の確率を利用することにより、確率値の高いシーケンスを抽出することができる。 More specifically, the processor 120 converts all natural words provided as prompt input sentences into a particular tokenization scheme (e.g., constant form index, etc.) before inputting them into the language model 210. . In order to obtain a pair (new sentence, label information of the corresponding sentence), the probability distribution of the previous token is transmitted as the input of the next token based on autoregressive, but at this time, the heuristic The top n probabilities are used as input for the next token by beam search using (heuristic). By using the top n probabilities among the joint probabilities obtained by multiplying the own probability distribution of each token by the probability distribution of the previous token, sequences with high probability values can be extracted.

例えば、図８に示すように、プロセッサ１２０は、ラベルに該当するトークンの確率を利用してソフトラベルを構成してよい。言い換えれば、ソフトラベルは、言語モデル２１０によって予測された正規化ラベルトークン分布から抽出されてよい。 For example, as shown in FIG. 8, the processor 120 may construct a soft label using the probability of a token corresponding to the label. In other words, soft labels may be extracted from the normalized label token distribution predicted by language model 210.

図９を参照すると、肯定に分類されたデータ例１（９１０）と否定に分類されたデータ例２（９２０）が原本データとして与えられた場合、言語モデル２１０を利用した言語生成結果から新規データ９３０を抽出してよい。 Referring to FIG. 9, when data example 1 (910) classified as positive and data example 2 (920) classified as negative are given as original data, new data is generated from the language generation result using the language model 210. 930 may be extracted.

プロセッサ１２０は、プロンプトによって生成された結果を分析し、合成された新規データとラベル情報を抽出する。このような作業を数回繰り返すことにより、本来はデータに存在していなかった多様なデータと正確な分類情報が得られるようになり、このようなデータを既存のデータに混合すれば、ダウンストリームＮＬＰモデルをファインチューニングするのに使用することができる。新規データ９３０は、与えられた２つの原本データ９１０、９２０の言語的および構造的特徴を適切に参照して生成される出力データであって、該当の新規データ９３０に肯定と否定を適切に混ぜることで、既存には存在しない、完全に新たなデータとなる。 Processor 120 analyzes the results generated by the prompts and extracts the combined new data and label information. By repeating this process several times, you will be able to obtain diverse data and accurate classification information that did not originally exist in the data, and by mixing such data with existing data, you will be able to It can be used to fine tune NLP models. New data 930 is output data that is generated by appropriately referring to the linguistic and structural features of the two given original data 910 and 920, and appropriately mixes affirmation and negation in the new data 930. This results in completely new data that does not already exist.

新規データ９３０を学習データセットに追加することで、すなわち、データ拡張することで最終分類機性能を高めることができる。 By adding new data 930 to the learning data set, that is, by expanding the data, the performance of the final classifier can be improved.

原本データ９１０、９２０は、単一ラベルが付着された形態のハードラベルのあるデータや、新規データ９３０は、少なくとも２つ以上のラベル分布形態のソフトラベルのあるデータとして生成されてよい。 The original data 910 and 920 may be generated as data with a hard label attached with a single label, and the new data 930 may be generated as data with a soft label with at least two labels distributed.

ソフトラベルをハードラベルに変換することは最大演算（ｍａｘｏｐｅｒａｔｉｏｎ）などによって可能であるため、原本データ９１０、９２０や新規データ９３０はすべてハードラベル形態で活用することができる。学習過程では、交差エントロピー（ｃｒｏｓｓｅｎｔｒｏｐｙ）などの損失関数を使用するため、ハードラベル形態とソフトラベル形態をすべて活用することができる。 Since a soft label can be converted into a hard label by a max operation, the original data 910 and 920 and the new data 930 can all be used in the form of a hard label. In the learning process, a loss function such as cross entropy is used, so that both hard label format and soft label format can be utilized.

このように、本発明の実施形態によると、言語モデルに基づいて既存のデータを変形したり拡張したりして拡張データを生成し、言語モデルが認知している知識を拡張データを通じて転移することにより、データ収集の投入工数を著しく減らすことができ、モデル軽量化の効果を達成することができる。 Thus, according to embodiments of the present invention, existing data may be transformed or extended based on a language model to generate augmented data, and knowledge known by the language model may be transferred through the augmented data. As a result, the man-hours required for data collection can be significantly reduced, and the model weight can be reduced.

上述した装置は、ハードウェア構成要素、ソフトウェア構成要素、および／またはハードウェア構成要素とソフトウェア構成要素との組み合わせによって実現されてよい。例えば、実施形態で説明された装置および構成要素は、プロセッサ、コントローラ、ＡＬＵ（ａｒｉｔｈｍｅｔｉｃｌｏｇｉｃｕｎｉｔ）、デジタル信号プロセッサ、マイクロコンピュータ、ＦＰＧＡ（ｆｉｅｌｄｐｒｏｇｒａｍｍａｂｌｅｇａｔｅａｒｒａｙ）、ＰＬＵ（ｐｒｏｇｒａｍｍａｂｌｅｌｏｇｉｃｕｎｉｔ）、マイクロプロセッサ、または命令を実行して応答することができる様々な装置のように、１つ以上の汎用コンピュータまたは特殊目的コンピュータを利用して実現されてよい。処理装置は、オペレーティングシステム（ＯＳ）およびＯＳ上で実行される１つ以上のソフトウェアアプリケーションを実行してよい。また、処理装置は、ソフトウェアの実行に応答し、データにアクセスし、データを記録、操作、処理、および生成してもよい。理解の便宜のために、１つの処理装置が使用されるとして説明される場合もあるが、当業者は、処理装置が複数個の処理要素および／または複数種類の処理要素を含んでもよいことが理解できるであろう。例えば、処理装置は、複数個のプロセッサまたは１つのプロセッサおよび１つのコントローラを含んでよい。また、並列プロセッサのような、他の処理構成も可能である。 The apparatus described above may be realized by hardware components, software components, and/or a combination of hardware and software components. For example, the devices and components described in the embodiments include a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable gate array (FPGA), a programmable logic unit (PLU), a microprocessor, or may be implemented using one or more general purpose or special purpose computers, such as various devices capable of executing and responding to instructions. A processing device may execute an operating system (OS) and one or more software applications that execute on the OS. The processing device may also be responsive to execution of the software to access, record, manipulate, process, and generate data. Although for convenience of understanding, one processing device may be described as being used, those skilled in the art will appreciate that a processing device may include multiple processing elements and/or multiple types of processing elements. You will understand. For example, a processing device may include multiple processors or a processor and a controller. Other processing configurations are also possible, such as parallel processors.

ソフトウェアは、コンピュータプログラム、コード、命令、またはこれらのうちの１つ以上の組み合わせを含んでもよく、思うままに動作するように処理装置を構成したり、独立的または集合的に処理装置に命令したりしてよい。ソフトウェアおよび／またはデータは、処理装置に基づいて解釈されたり、処理装置に命令またはデータを提供したりするために、いかなる種類の機械、コンポーネント、物理装置、コンピュータ記録媒体または装置に具現化されてよい。ソフトウェアは、ネットワークによって接続されたコンピュータシステム上に分散され、分散された状態で記録されても実行されてもよい。ソフトウェアおよびデータは、１つ以上のコンピュータ読み取り可能な記録媒体に記録されてよい。 Software may include computer programs, code, instructions, or a combination of one or more of these that configure a processing device or instruct a processing device, independently or collectively, to perform operations as desired. You may do so. The software and/or data may be embodied in a machine, component, physical device, computer storage medium or device of any kind for being interpreted by or providing instructions or data to a processing device. good. The software may be distributed on computer systems connected by a network, and may be recorded or executed in a distributed manner. The software and data may be recorded on one or more computer readable storage media.

実施形態に係る方法は、多様なコンピュータ手段によって実行可能なプログラム命令の形態で実現されてコンピュータ読み取り可能な媒体に記録されてよい。ここで、媒体は、コンピュータ実行可能なプログラムを継続して記録するものであっても、実行またはダウンロードのために一時記録するものであってもよい。また、媒体は、単一または複数のハードウェアが結合した形態の多様な記録手段または格納手段であってよく、あるコンピュータシステムに直接接続する媒体に限定されることはなく、ネットワーク上に分散して存在するものであってもよい。媒体の例としては、ハードディスク、フロッピー（登録商標）ディスク、および磁気テープのような磁気媒体、ＣＤ－ＲＯＭおよびＤＶＤのような光媒体、フロプティカルディスク（ｆｌｏｐｔｉｃａｌｄｉｓｋ）のような光磁気媒体、およびＲＯＭ、ＲＡＭ、フラッシュメモリなどを含み、プログラム命令が記録されるように構成されたものであってよい。また、媒体の他の例として、アプリケーションを配布するアプリケーションストアやその他の多様なソフトウェアを供給または配布するサイト、サーバなどで管理する記録媒体または格納媒体が挙げられる。 Methods according to embodiments may be implemented in the form of program instructions executable by various computer means and recorded on computer-readable media. Here, the medium may be one that continuously records a computer-executable program, or one that temporarily records it for execution or download. Also, the medium may be a variety of recording or storage means in the form of a single or multiple hardware combinations, and is not limited to a medium directly connected to a computer system, but may be distributed over a network. It may also exist. Examples of media include magnetic media such as hard disks, floppy disks, and magnetic tape, optical media such as CD-ROMs and DVDs, magneto-optical media such as floptical disks, It may also include ROM, RAM, flash memory, etc., and may be configured to record program instructions. Further, other examples of the medium include an application store that distributes applications, a site that supplies or distributes various other software, and a recording medium or storage medium managed by a server.

以上のように、実施形態を、限定された実施形態および図面に基づいて説明したが、当業者であれば、上述した記載から多様な修正および変形が可能であろう。例えば、説明された技術が、説明された方法とは異なる順序で実行されたり、かつ／あるいは、説明されたシステム、構造、装置、回路などの構成要素が、説明された方法とは異なる形態で結合されたりまたは組み合わされたり、他の構成要素または均等物によって対置されたり置換されたとしても、適切な結果を達成することができる。 As mentioned above, although the embodiments have been described based on limited embodiments and drawings, those skilled in the art will be able to make various modifications and variations based on the above description. For example, the techniques described may be performed in a different order than in the manner described, and/or components of the systems, structures, devices, circuits, etc. described may be performed in a different form than in the manner described. Even when combined or combined, opposed or replaced by other components or equivalents, suitable results can be achieved.

したがって、異なる実施形態であっても、特許請求の範囲と均等なものであれば、添付される特許請求の範囲に属する。 Therefore, even if the embodiments are different, if they are equivalent to the scope of the claims, they fall within the scope of the appended claims.

２１０：言語モデル
２２０：ＰＬＭ 210: Language model 220: PLM

Claims

A data generation method performed by a computer device, the method comprising:
The computer device includes at least one processor configured to execute computer-readable instructions contained in memory;
The data generation method includes:
the at least one processor utilizes source data to construct prompts that serve as input sentences for a language model; and the at least one processor inputs the prompts into the language model and extracts new data from the language model. and generating label information for the new data ,
The generating step is a step of generating a soft label indicating a label distribution using a probability distribution for natural words corresponding to the label in the prompt.
data generation methods, including ;

The data generation method includes:
2. The data generation method of claim 1 , further comprising: the at least one processor selecting the original data from a training data set depending on the semantic diversity of the text.

The selecting step includes:
3. The data generation method according to claim 2 , further comprising selecting the original data as many as the number of label types from the learning data set.

The configuring step includes:
4. The data generation method according to claim 1 , wherein the prompt is configured in a format including a text type and a label type.

The configuring step includes:
4. The data generation method according to claim 1 , wherein the prompt is configured in a format including a text type, a label type, and a label -token delimiter.

The configuring step includes:
4. The data generation method according to claim 1 , further comprising processing the original data to configure the prompt in a natural language format that is the same as that of the original data.

A data generation method performed by a computer device, the method comprising:
The computer device includes at least one processor configured to execute computer-readable instructions contained in memory;
The data generation method includes:
the at least one processor configuring a prompt to be an input sentence of a language model using original data, the step of configuring the prompt by combining a task specification, the original data, and a prompt template;
the at least one processor inputting the prompt into the language model and generating new data and label information for the new data from the language model;
data generation methods, including ;

A data generation method performed by a computer device, the method comprising:
The computer device includes at least one processor configured to execute computer-readable instructions contained in memory;
The data generation method includes:
the at least one processor constructs a prompt to serve as an input sentence for the language model using the source data; and
the at least one processor inputting the prompt into the language model and generating new data and label information for the new data from the language model;
and the step of generating comprises :
extracting the new data and the label information using an autoregressive method that transmits a probability distribution of previous tokens as input for a next token for natural words corresponding to the text and label in the prompt; ,Data generation method.

The extracting step includes:
9. The data generation method according to claim 8 , further comprising the step of transmitting the probability of a top portion of the probability distribution of the previous token as an input of the next token by beam search using a heuristic.

A data generation method performed by a computer device, the method comprising:
The computer device includes at least one processor configured to execute computer-readable instructions contained in memory;
The data generation method includes:
the at least one processor selecting original data from the training data set according to the semantic diversity of the text;
the at least one processor constructs a prompt to be an input sentence for a language model using the original data; and
the at least one processor inputting the prompt into the language model and generating new data and label information for the new data from the language model;
data generation methods, including;

A computer program for causing a computer device to execute the data generation method according to any one of claims 1 to 10.

A computer device,
at least one processor configured to execute computer-readable instructions contained in the memory;
The at least one processor includes:
A process of configuring a prompt to be an input sentence of a language model using original data, and a process of inputting the prompt to the language model and generating new data and label information for the new data from the language model. ,
The at least one processor includes:
In the generating process, a computer device generates a soft label indicating a label distribution using a probability distribution for a natural word corresponding to a label in the prompt.

The at least one processor includes:
13. The computer device according to claim 12 , wherein the original data is selected from a training data set depending on the semantic diversity of the text.

The at least one processor includes:
14. The computer device according to claim 13 , wherein the original data as many as the number of label types are selected from the learning data set.

The at least one processor includes:
The computer device according to any one of claims 12 to 14 , characterized in that the prompt is configured in a format including a text type and a label type.

The at least one processor includes:
A computer device according to any one of claims 12 to 14 , characterized in that the prompt is configured in a format that includes a text type, a label type, and a label -token delimiter.

A computer device,
at least one processor configured to execute computer-readable instructions contained in the memory;
The at least one processor includes:
a process of configuring a prompt to be an input sentence of a language model using original data, the process of configuring the prompt by combining a task specification, the original data, and a prompt template ;
inputting the prompt into the language model and generating new data and label information for the new data from the language model;
A computer device that processes

A computer device,
at least one processor configured to execute computer-readable instructions contained in the memory;
The at least one processor includes:
The process of constructing prompts that serve as input sentences for the language model using the original data, and
inputting the prompt into the language model and generating new data and label information for the new data from the language model;
process,
The at least one processor includes:
In the generation process, the new data and the label information are generated using an autoregressive method that transmits the probability distribution of the previous token as input to the next token for the text in the prompt and the natural language corresponding to the label. A computer device characterized by extracting.

The at least one processor includes:
19. The computer device according to claim 18 , wherein the probability of an upper part of the probability distribution of the previous token is transmitted as an input of the next token by beam search using a heuristic.

A computer device,
at least one processor configured to execute computer-readable instructions contained in the memory;
including;
The at least one processor includes:
a process of selecting original data from the training dataset according to the semantic diversity of the text;
a process of configuring a prompt that becomes an input sentence for a language model using the original data;
inputting the prompt into the language model and generating new data and label information for the new data from the language model;
A computer device that processes