JP2023018624A

JP2023018624A - Data generation method using language model, computer device, and computer program

Info

Publication number: JP2023018624A
Application number: JP2021209463A
Authority: JP
Inventors: ガンミンユ; Kang Min Yoo; ドンジュパク; Dongju Park; ジェウクカン; Jaewook Kang; サンウイ; Sang Woo Lee; ウミョンパク; Woomyoung Park
Original assignee: Naver Corp
Current assignee: Naver Corp
Priority date: 2021-07-27
Filing date: 2021-12-23
Publication date: 2023-02-08
Anticipated expiration: 2041-12-23
Also published as: JP7343566B2; KR102710087B1; KR20230016794A

Abstract

To provide a data generation method, a computer device, and a computer program that use a language model.SOLUTION: A data generation method using a language model comprises the steps of: constructing a prompt that is an input sentence of a language model using original data; and inputting the prompt into the language model and generating new data and label information for the new data from the language model.SELECTED DRAWING: Figure 3

Description

新規性喪失の例外適用申請有り There is an application for exception to loss of novelty

以下の説明は、テキストデータを生成する技術に関する。 The following description relates to techniques for generating text data.

ＮＬＰ（ｎａｔｕｒａｌｌａｎｇｕａｇｅｐｒｏｃｅｓｓｉｎｇ）モデルを構築するためには、データ設計とデータ収集の先行によってモデル学習に必要なデータを確保する。確保したデータに基づいてモデルを学習した後、性能を評価しながらデータ設計やモデル学習の改善作業を繰り返す。 In order to build an NLP (natural language processing) model, data necessary for model learning is secured through data design and data collection. After learning the model based on the secured data, it repeats the improvement work of data design and model learning while evaluating the performance.

ＰＬＭ（ｐｒｅｔｒａｉｎｅｄｌａｎｇｕａｇｅｍｏｄｅｌ）が登場してからは、ＰＬＭを基盤に特定のドメインやタスクを解くＮＬＰモデルによってファインチューニング（ｆｉｎｅｔｕｎｉｎｇ）を行う方法が適用されている。 Since the emergence of the pretrained language model (PLM), fine tuning has been applied using an NLP model that solves a specific domain or task based on the PLM.

さらに、ＮＬＰモデルを効率的に学習するためには、データ拡張（ｄａｔａａｕｇｍｅｎｔａｔｉｏｎ）技法が使用されている。 Furthermore, data augmentation techniques have been used to efficiently train NLP models.

一例として、ＥＤＡ（ｅａｓｙｄａｔａａｕｇｍｅｎｔａｔｉｏｎ）のようにモデル学習のために表面的に文句を操作してテキストデータを生成する技術は、生成された文章の文法性が低く、文章の意味が意図するものとの間に大きな差が生じる場合がある。 As an example, technology that superficially manipulates phrases to generate text data for model learning, such as EDA (easy data augmentation), has a low grammatical nature of the generated sentences, and the meaning of the sentences does not correspond to the intended meaning. There may be large differences between

他の例として、再変換（Ｂａｃｋ－ｔｒａｎｓｌａｔｉｏｎ）のように機械翻訳モデルを利用して類似文章を生成する技術は、特定の言語的特性をもつ翻訳コーパスによって学習された翻訳モデルを活用するため、生成された文体が（生成の対象となる）既存のテキストデータの言語的特性を反映することができず、汎用的な使用が不可能であるという短所がある。 As another example, a technique for generating similar sentences using a machine translation model like retranslation (Back-translation) utilizes a translation model learned by a translation corpus with specific linguistic characteristics, There is a disadvantage that the generated writing style cannot reflect the linguistic characteristics of the existing text data (which is the object of generation), and general-purpose use is impossible.

多様な言語特性のコーパスによって学習された大規模言語モデルを利用して自然語生成結果を導き出すことができ、導き出された生成結果から新規データを抽出することができる、データ拡張技術を提供する。 To provide a data augmentation technology capable of deriving a natural language generation result by using a large-scale language model trained by a corpus of various language characteristics and extracting new data from the derived generation result.

コンピュータ装置が実行するデータ生成方法であって、前記コンピュータ装置は、メモリに含まれるコンピュータ読み取り可能な命令を実行するように構成された少なくとも１つのプロセッサを含み、前記データ生成方法は、前記少なくとも１つのプロセッサが、原本データを利用して言語モデルの入力文となるプロンプト（ｐｒｏｍｐｔ）を構成する段階、および前記少なくとも１つのプロセッサが、前記プロンプトを前記言語モデルに入力し、前記言語モデルから新規データと前記新規データに対するラベル情報を生成する段階を含む、データ生成方法を提供する。 A data generating method performed by a computing device, said computing device including at least one processor configured to execute computer readable instructions contained in a memory, said data generating method comprising said at least one a processor constructing a prompt that is an input sentence of a language model using original data; and the at least one processor inputting the prompt to the language model and generating new data from the language model. and generating label information for the new data.

一側面によると、前記生成する段階は、前記プロンプト内のラベルに該当する自然語に対する確率分布（ｐｒｏｂａｂｉｌｉｔｙｄｉｓｔｒｉｂｕｔｉｏｎ）を利用してラベル分布を示すソフトラベルを生成する段階を含んでよい。 According to one aspect, the generating may include generating soft labels representing label distributions using a probability distribution for natural words corresponding to labels in the prompt.

他の側面によると、前記データ生成方法は、前記少なくとも１つのプロセッサが、テキストの意味的多様性に応じて、学習データセットから前記原本データを選択する段階をさらに含んでよい。 According to another aspect, the data generation method may further include the at least one processor selecting the original data from a training data set according to semantic diversity of text.

また他の側面によると、前記選択する段階は、前記学習データセットからラベルタイプの個数だけの前記原本データを選択してよい。 According to another aspect, the selecting step may select the original data corresponding to the number of label types from the learning data set.

また他の側面によると、前記構成する段階は、テキストタイプとラベルタイプが含まれた形式で前記プロンプトを構成してよい。 According to yet another aspect, the configuring step may configure the prompt in a format that includes a text type and a label type.

また他の側面によると、前記構成する段階は、テキストタイプとラベルタイプ、およびラベル位置区分子（ｖｅｒｂａｌｉｚｅｒ）が含まれた形式で前記プロンプトを構成してよい。 According to another aspect, the constructing step may construct the prompt in a format including a text type, a label type, and a label verbalizer.

また他の側面によると、前記構成する段階は、前記原本データを加工し、前記原本データと同一形式の自然語形態で前記プロンプトを構成してよい。 According to another aspect, the constructing step may process the original data to construct the prompt in a natural language form having the same format as the original data.

また他の側面によると、前記構成する段階は、タスク仕様（ｔａｓｋｓｐｅｃｉｆｉｃａｔｉｏｎ）、前記原本データ、およびプロンプトテンプレート（ｔｅｍｐｌａｔｅ）を組み合わせて前記プロンプトを構成してよい。 According to yet another aspect, the constructing step may construct the prompt by combining a task specification, the original data, and a prompt template.

また他の側面によると、前記生成する段階は、前記プロンプト内のテキストとラベルに該当する自然語に対し、以前のトークンの確率分布を次のトークンの入力として伝達する自己回帰（ａｕｔｏｒｅｇｒｅｓｓｉｖｅ）方式を利用して前記新規データと前記ラベル情報を抽出する段階を含んでよい。 According to another aspect, the generating step is an auto-regressive method of transferring a probability distribution of previous tokens to natural words corresponding to text and labels in the prompt as input of the next token. extracting the new data and the label information using .

さらに他の側面によると、前記抽出する段階は、ヒューリスティック（ｈｅｕｒｉｓｔｉｃ）を利用したビームサーチ（ｂｅａｍｓｅａｒｃｈ）により、前記以前のトークンの確率分布のうちの上位一部の確率を前記次のトークンの入力として伝達する段階を含んでよい。 According to still another aspect, the step of extracting is performed by heuristic beam search to extract the probability of the upper part of the probability distribution of the previous token from the input of the next token. and transmitting as

前記データ生成方法を前記コンピュータ装置に実行させるためのコンピュータプログラムを提供する。 A computer program is provided for causing the computer device to execute the data generation method.

コンピュータ装置であって、メモリに含まれるコンピュータ読み取り可能な命令を実行するように構成された少なくとも１つのプロセッサを含み、前記少なくとも１つのプロセッサは、原本データを利用して言語モデルの入力文となるプロンプトを構成する過程、および前記プロンプトを前記言語モデルに入力し、前記言語モデルから新規データと前記新規データに対するラベル情報を生成する過程を処理する、コンピュータ装置を提供する。 A computing device comprising at least one processor configured to execute computer readable instructions contained in a memory, wherein the at least one processor utilizes original text data to serve as input sentences for a language model. A computing device is provided for processing the steps of constructing a prompt, inputting the prompt into the language model, and generating new data and label information for the new data from the language model.

本発明の実施形態によると、言語モデルを利用して原本データを変形したり拡張したりして新規データを生成し、言語モデルが認知している知識を新規データを通じて転移することにより、データ収集の投入工数を著しく減らすことができ、モデル軽量化の効果を達成することができる。 According to the embodiment of the present invention, the language model is used to transform or expand the original data to generate new data, and the knowledge recognized by the language model is transferred through the new data to collect data. The input man-hours can be significantly reduced, and the effect of model weight reduction can be achieved.

本発明の一実施形態における、コンピュータ装置の内部構成の一例を説明するためのブロック図である。It is a block diagram for explaining an example of an internal configuration of a computer device in one embodiment of the present invention. 本発明の一実施形態における、大規模言語モデルを利用したテキスト拡張の概念を説明するための図である。FIG. 4 is a diagram for explaining the concept of text extension using a large-scale language model in one embodiment of the present invention; 本発明の一実施形態における、コンピュータ装置が実行することのできるデータ生成方法の例を示したフローチャートである。4 is a flow chart showing an example of a data generation method that can be executed by a computer device in one embodiment of the present invention; 本発明の一実施形態における、言語モデルの入力プロンプトを構成する過程を説明するための図である。FIG. 4 is a diagram for explaining the process of constructing an input prompt for a language model in one embodiment of the present invention; 本発明の一実施形態における、データ拡張過程を説明するための図である。FIG. 4 is a diagram for explaining a data extension process in one embodiment of the present invention; 本発明の一実施形態における、データ拡張過程を説明するための図である。FIG. 4 is a diagram for explaining a data extension process in one embodiment of the present invention; 本発明の一実施形態における、言語モデルを利用して新たな文章とラベル情報を生成する過程を説明するための図である。FIG. 4 is a diagram for explaining the process of generating new sentences and label information using a language model in one embodiment of the present invention; 本発明の一実施形態における、分布のあるソフトラベルを生成する過程を説明するための図である。FIG. 4 is a diagram for explaining the process of generating distributed soft labels in an embodiment of the present invention; 本発明の一実施形態における、原本データの例と、原本データから生成された新規データの例を示した図である。FIG. 4 is a diagram showing an example of original data and an example of new data generated from the original data in one embodiment of the present invention;

以下、本発明の実施形態について、添付の図面を参照しながら詳しく説明する。 BEST MODE FOR CARRYING OUT THE INVENTION Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

本発明の実施形態は、言語モデルを利用したテキストデータ生成技術に関する。 The embodiments of the present invention relate to text data generation technology using language models.

本明細書で具体的に開示される事項を含む実施形態は、大規模言語モデルを利用することで、既存のテキストデータの特性と一貫性がある上に、高い文法性と自然さを備えた文章を生成することができる。さらに、該当の文章に対する高い品質のラベル情報も生成することができる。 Embodiments including the matters specifically disclosed in this specification are consistent with the characteristics of existing text data by using a large-scale language model, and have high grammaticality and naturalness. You can generate sentences. In addition, high quality label information can also be generated for the relevant sentences.

図１は、本発明の一実施形態における、コンピュータ装置の例を示したブロック図である。例えば、本発明の実施形態に係るデータ生成システムは、図１に示したコンピュータ装置１００によって実現されてよい。 FIG. 1 is a block diagram illustrating an example of a computing device in one embodiment of the invention. For example, the data generation system according to the embodiment of the present invention may be realized by computer device 100 shown in FIG.

図１に示すように、コンピュータ装置１００は、本発明の実施形態に係るデータ生成方法を実行するための構成要素として、メモリ１１０、プロセッサ１２０、通信インタフェース１３０、および入力／出力インタフェース１４０を含んでよい。 As shown in FIG. 1, computer device 100 includes memory 110, processor 120, communication interface 130, and input/output interface 140 as components for executing the data generation method according to the embodiment of the present invention. good.

メモリ１１０は、コンピュータ読み取り可能な記録媒体であって、ＲＡＭ（ｒａｎｄｏｍａｃｃｅｓｓｍｅｍｏｒｙ）、ＲＯＭ（ｒｅａｄｏｎｌｙｍｅｍｏｒｙ）、およびディスクドライブのような永続的大容量記録装置を含んでよい。ここで、ＲＯＭやディスクドライブのような永続的大容量記録装置は、メモリ１１０とは区分される別の永続的記録装置としてコンピュータ装置１００に含まれてもよい。また、メモリ１１０には、オペレーティングシステムと、少なくとも１つのプログラムコードが記録されてよい。このようなソフトウェア構成要素は、メモリ１１０とは別のコンピュータ読み取り可能な記録媒体からメモリ１１０にロードされてよい。このような別のコンピュータ読み取り可能な記録媒体は、フロッピー（登録商標）ドライブ、ディスク、テープ、ＤＶＤ／ＣＤ－ＲＯＭドライブ、メモリカードなどのコンピュータ読み取り可能な記録媒体を含んでよい。他の実施形態において、ソフトウェア構成要素は、コンピュータ読み取り可能な記録媒体ではない通信インタフェース１３０を通じてメモリ１１０にロードされてもよい。例えば、ソフトウェア構成要素は、ネットワーク１６０を介して受信されるファイルによってインストールされるコンピュータプログラムに基づいてコンピュータ装置１００のメモリ１１０にロードされてよい。 The memory 110 is a computer-readable storage medium and may include random access memory (RAM), read only memory (ROM), and permanent mass storage devices such as disk drives. Here, a permanent mass storage device such as a ROM or disk drive may be included in computer device 100 as a separate permanent storage device separate from memory 110 . Also stored in memory 110 may be an operating system and at least one program code. Such software components may be loaded into memory 110 from a computer-readable medium separate from memory 110 . Such other computer-readable recording media may include computer-readable recording media such as floppy drives, disks, tapes, DVD/CD-ROM drives, memory cards, and the like. In other embodiments, software components may be loaded into memory 110 through communication interface 130 that is not a computer-readable medium. For example, software components may be loaded into memory 110 of computing device 100 based on computer programs installed by files received over network 160 .

プロセッサ１１０は、基本的な算術、ロジック、および入出力演算を実行することにより、コンピュータプログラムの命令を処理するように構成されてよい。命令は、メモリ１１０または通信インタフェース１３０によって、プロセッサ１２０に提供されてよい。例えば、プロセッサ１２０は、メモリ１１０のような記録装置に記録されたプログラムコードにしたがって受信される命令を実行するように構成されてよい。 Processor 110 may be configured to process computer program instructions by performing basic arithmetic, logic, and input/output operations. Instructions may be provided to processor 120 by memory 110 or communication interface 130 . For example, processor 120 may be configured to execute received instructions according to program code stored in a storage device, such as memory 110 .

通信インタフェース１３０は、ネットワーク１６０を介してコンピュータ装置１００が他の装置と互いに通信するための機能を提供してよい。一例として、コンピュータ装置１００のプロセッサ１２０がメモリ１１０のような記録装置に記録されたプログラムコードにしたがって生成した要求や命令、データ、ファイルなどが、通信インタフェース１３０の制御にしたがってネットワーク１６０を介して他の装置に伝達されてよい。これとは逆に、他の装置からの信号や命令、データ、ファイルなどが、ネットワーク１６０を経てコンピュータ装置１００の通信インタフェース１３０を通じてコンピュータ装置１００に受信されてよい。通信インタフェース１３０を通じて受信された信号や命令、データなどは、プロセッサ１２０やメモリ１１０に伝達されてよく、ファイルなどは、コンピュータ装置１００がさらに含むことのできる記録媒体（上述した永続的記録装置）に記録されてよい。 Communication interface 130 may provide functionality for computer device 100 to communicate with other devices over network 160 . As an example, requests, commands, data, files, etc. generated by processor 120 of computer device 100 in accordance with program code recorded in a recording device such as memory 110 may be sent to others via network 160 under the control of communication interface 130 . device. Conversely, signals, instructions, data, files, etc. from other devices may be received by computing device 100 through communication interface 130 of computing device 100 over network 160 . Signals, instructions, data, etc., received through communication interface 130 may be transmitted to processor 120 and memory 110, and files, etc., may be stored in a recording medium (permanent recording device described above) that computing device 100 may further include. may be recorded.

通信方式が限定されることはなく、ネットワーク１６０が含むことのできる通信網（一例として、移動通信網、有線インターネット、無線インターネット、放送網）を利用する通信方式だけではなく、機器間の近距離無線通信が含まれてもよい。例えば、ネットワーク１６０は、ＰＡＮ（ｐｅｒｓｏｎａｌａｒｅａｎｅｔｗｏｒｋ）、ＬＡＮ（ｌｏｃａｌａｒｅａｎｅｔｗｏｒｋ）、ＣＡＮ（ｃａｍｐｕｓａｒｅａｎｅｔｗｏｒｋ）、ＭＡＮ（ｍｅｔｒｏｐｏｌｉｔａｎａｒｅａｎｅｔｗｏｒｋ）、ＷＡＮ（ｗｉｄｅａｒｅａｎｅｔｗｏｒｋ）、ＢＢＮ（ｂｒｏａｄｂａｎｄｎｅｔｗｏｒｋ）、インターネットなどのネットワークのうちの１つ以上の任意のネットワークを含んでよい。さらに、ネットワーク１６０は、バスネットワーク、スターネットワーク、リングネットワーク、メッシュネットワーク、スター－バスネットワーク、ツリーまたは階層的ネットワークなどを含むネットワークトポロジのうちの任意の１つ以上を含んでもよいが、これらに限定されることはない。 The communication method is not limited, and not only the communication method using the communication network that can be included in the network 160 (eg, mobile communication network, wired Internet, wireless Internet, broadcasting network), but also the short distance between devices. Wireless communication may be included. For example, the network 160 includes a PAN (personal area network), a LAN (local area network), a CAN (campus area network), a MAN (metropolitan area network), a WAN (wide area network), a BBN (broadband network), and the Internet. Any one or more of the networks may be included. Additionally, network 160 may include any one or more of network topologies including, but not limited to, bus networks, star networks, ring networks, mesh networks, star-bus networks, tree or hierarchical networks, and the like. will not be

入力／出力インタフェース１４０は、入力／出力装置１５０とのインタフェースのための手段であってよい。例えば、入力装置は、マイク、キーボード、カメラ、またはマウスなどの装置を、出力装置は、ディスプレイ、スピーカのような装置を含んでよい。他の例として、入力／出力インタフェース１４０は、タッチスクリーンのように入力と出力のための機能が１つに統合された装置とのインタフェースのための手段であってもよい。入力／出力装置１５０は、コンピュータ装置１００と１つの装置で構成されてもよい。 Input/output interface 140 may be a means for interfacing with input/output device 150 . For example, input devices may include devices such as microphones, keyboards, cameras, or mice, and output devices may include devices such as displays, speakers, and the like. As another example, the input/output interface 140 may be a means for interfacing with a device that integrates functions for input and output, such as a touch screen. Input/output device 150 may be one device with computing device 100 .

また、他の実施形態において、コンピュータ装置１００は、図１の構成要素よりも少ないか多くの構成要素を含んでもよい。しかし、大部分の従来技術的構成要素を明確に図に示す必要はない。例えば、コンピュータ装置１００は、上述した入力／出力装置１５０のうちの少なくとも一部を含むように実現されてもよいし、トランシーバ、カメラ、各種センサ、データベースなどのような他の構成要素をさらに含んでもよい。 Also, in other embodiments, computing device 100 may include fewer or more components than the components of FIG. However, most prior art components need not be explicitly shown in the figures. For example, computing device 100 may be implemented to include at least some of the input/output devices 150 described above, and may also include other components such as transceivers, cameras, various sensors, databases, and the like. It's okay.

本発明で使用する大規模言語モデルとは、Ｆｅｗ－ｓｈｏｔＬｅａｒｎｉｎｇ（ＦＳＬ）などのような方式を利用してファインチューニング（ｆｉｎｅ－ｔｕｎｉｎｇ）を経ずに推論することが可能な言語モデルを指称してよく、従来の一般的な言語モデルに比べて１０倍以上も多い媒介変数（例えば、１０００億個以上の媒介変数など）を有する。例えば、ＧＰＴ－３（ＧｅｎｅｒａｔｉｖｅＰｒｅ－ｔｒａｉｎｅｄＴｒａｎｓｆｏｒｍｅｒ３）やＨｙｐｅｒＣｌｏｖａのような大規模言語モデルは、自然なテキストプロンプトで制御することが可能な優れたＦｅｗ－ｓｈｏｔ学習機であって、プロンプトによって少量のデータだけでパターンを理解し、ＮＬＰ問題を解決する能力であるインコンテキスト学習（ｉｎ－ｃｏｎｔｅｘｔｌｅａｒｎｉｎｇ）が可能である。 The large-scale language model used in the present invention refers to a language model that can be inferred without fine-tuning using a method such as Few-shot Learning (FSL). and has more than 10 times more parameters (for example, more than 100 billion parameters) compared to conventional general language models. For example, large-scale language models such as GPT-3 (Generative Pre-trained Transformer 3) and HyperClova are excellent Few-shot learners that can be controlled by natural text prompts, and are capable of processing small amounts of data by prompts. In-context learning is possible, which is the ability to understand patterns and solve NLP problems with only one person.

本実施形態は、大規模言語モデルを活用して原本データから新規データを生成する新たなデータ拡張技法に関する。さらに、言語モデルで予測したソフトラベル（ｓｏｆｔｌａｂｅｌ）を活用して大規模言語モデルで知識を効果的に蒸溜すると同時に、テキスト摂動（ｔｅｘｔｕａｌｐｅｒｔｕｒｂａｔｉｏｎｓ）を生成することができる。 This embodiment relates to a new data augmentation technique that utilizes a large-scale language model to generate new data from original data. In addition, it is possible to leverage soft labels predicted by the language model to effectively distill knowledge in the large-scale language model while simultaneously generating textual perturbations.

図２は、本発明の一実施形態における、大規模言語モデルを利用したテキスト拡張の概念を説明するための図である。 FIG. 2 is a diagram for explaining the concept of text extension using a large-scale language model in one embodiment of the present invention.

図２を参照すると、本実施形態において、大規模言語モデル２１０は、モデル学習に必要な合成テキストデータ（ｓｙｎｔｈｅｔｉｃｔｅｘｔｄａｔａ）を生成するためのバックボーンとして使用される。 Referring to FIG. 2, in this embodiment, a large scale language model 210 is used as the backbone for generating the synthetic text data required for model training.

本実施形態によると、大規模言語モデル２１０を使用することで、原本データから、合成でありながらも極事実的（ｈｙｐｅｒ－ｒｅａｌｉｓｔｉｃ）な新規データを生成することができる。 According to the present embodiment, the large language model 210 can be used to generate synthetic yet hyper-realistic new data from the original data.

ラベルがあるかラベルのない形態のテキストデータがあるとき、該当のデータを自然語形態のプロンプト入力文に変換し、変換されたプロンプト入力文を言語モデル２１０に入力として与えることで主語自然語生成結果を導き出してよい。導き出された生成結果を分析して新規データを抽出してよく、このとき、新規データは、原本データと同じ形態であって、ラベルがあるかラベルのない形態である。抽出された新規データは、原本テキストデータに追加してデータ収集に役立てたり、該当のデータによってモデルを生成すればモデルの性能が高まったりもする。 When there is text data in a form with or without a label, the corresponding data is converted into a prompt input sentence in a natural language form, and the converted prompt input sentence is given to the language model 210 as an input to generate the subject natural language. You can derive results. The derived results may be analyzed to extract new data, where the new data is in the same form as the original data, with or without a label. The extracted new data is added to the original text data to be used for data collection, or the performance of the model can be improved by generating a model based on the corresponding data.

言い換えれば、本実施形態は、データ拡張を目的とし、大規模言語モデル２１０で新規データを生成するためにプロンプト基盤の接近方式を使用するものであって、原本データからインスピレーションを受けた新規データと大規模言語モデル２１０によって予測されたソフトラベルを使用して小規模分類モデルを訓練することにより、知識の蒸溜を達成することができる。 In other words, the present embodiment aims at data augmentation and uses a prompt-based approach to generate new data in the large-scale language model 210. The new data inspired by the original data and the soft labels predicted by the large language model 210 can be used to train a small classification model, knowledge distillation can be achieved.

図３は、本発明の一実施形態における、コンピュータ装置が実行することのできるデータ生成方法の例を示したフローチャートである。 FIG. 3 is a flow chart illustrating an example of a data generation method that can be executed by a computing device in one embodiment of the present invention.

本実施形態に係るデータ生成方法は、上述したコンピュータ装置１００によって実行されてよい。この場合、コンピュータ装置１００のプロセッサ１２０は、メモリ１１０が含むオペレーティングシステムのコードと、少なくとも１つのプログラムのコードとによる制御命令（ｉｎｓｔｒｕｃｔｉｏｎ）を実行するように実現されてよい。ここで、プロセッサ１２０は、コンピュータ装置１００に記録されたコードが提供する制御命令にしたがってコンピュータ装置１００が図３のデータ生成方法に含まれる段階３１０～３４０を実行するようにコンピュータ装置１００を制御してよい。 The data generation method according to this embodiment may be executed by the computer device 100 described above. In this case, processor 120 of computing device 100 may be implemented to execute control instructions by the code of the operating system and the code of at least one program contained in memory 110 . Here, the processor 120 controls the computer device 100 so that the computer device 100 executes steps 310 to 340 included in the data generation method of FIG. you can

本実施形態に係るデータ生成方法は、データ分布に基づいて、極めて流暢な新規データを生成することができる。 The data generation method according to the present embodiment can generate extremely fluent new data based on data distribution.

図３を参照すると、段階３１０で、プロセッサ１２０は、データセットからプロンプトに使用する原本データを選定してよい。 Referring to FIG. 3, at step 310, processor 120 may select original data to use for prompts from the data set.

以下では、テキスト分類タスクのためのデータ生成を例に挙げて原本データを選定する方法を説明する。分類タスクＴが与えられた場合、学習データセットは、テキストｘとラベルｙの対からなる集合 In the following, the method of selecting original data will be described by taking data generation for a text classification task as an example. Given a classification task T, the training data set is a set of text x and label y pairs

となる。

becomes.

一例として、プロセッサ１２０は、学習データセットＤからｋ個の原本データをランダムに選択してよい。プロセッサ１２０は、一様分布（ｕｎｉｆｏｒｍｄｉｓｔｒｉｂｕｔｉｏｎ）を利用してｋ個の原本データを選択してよい（数式（１））。 As an example, the processor 120 may randomly select k original data from the learning data set D. The processor 120 may select k original data using a uniform distribution (Equation (1)).

原本データの個数であるｋは、費用と性能を考慮した上で決定されてよく、例えば、プロセッサ１２０は、ｋを２に設定して学習データセットＤから２つの原本データを選択してよい。 The number of original data, k, may be determined in consideration of cost and performance.

他の例として、プロセッサ１２０は、学習データセットＤから、テキストの意味的多様性を考慮した上で、原本データを選択してよい。意味的多様性とは、テキストの意味がどれほど多様であるかを示す指標である。意味的多様性が低い場合は、生成されたデータが既存のデータセットの原本と類似度が高いためデータ拡張効果が低い反面、意味的多様性が高いほどデータ拡張に役立つ新規データが得られる確率が高まる。 As another example, the processor 120 may select original data from the training data set D by considering the semantic diversity of the text. Semantic diversity is a measure of how diverse the meaning of a text is. If the semantic diversity is low, the generated data has a high degree of similarity to the original of the existing dataset, so the data expansion effect is low. increases.

意味的多様性を計算する方法として、文章ベクトル表現法（例えば、ｂａｇ－ｏｆ－ｗｏｒｄｓ、ａｇｇｒｅｇａｔｅｗｏｒｄ２ｖｅｃ、ＢＥＲＴｅｍｂｅｄｄｉｎｇ、ＢＬＥＵＲＴなど）を利用して各テキストのベクトルを抽出する。類似度（例えば、ｃｏｓｉｎｅｓｉｍｉｌａｒｉｔｙなど）を利用してベクトル間の距離を計算したり、ＢＬＥＵＲＴのようなネットワークを利用して各対のセマンティック距離（ｐａｉｒｗｉｓｅｓｅｍａｎｔｉｃｄｉｓｔａｎｃｅ）を計算したりした後、距離が遠い（ｓｅｍａｎｔｉｃｄｉｓｔａｎｃｅが高い）対にさらに高い加重値を付与してサンプリングを実施してよい。 As a method for calculating semantic diversity, a vector of each text is extracted using a sentence vector representation method (eg, bag-of-words, aggregate word2vec, BERT embedding, BLEURT, etc.). After calculating the distance between vectors using similarity (e.g., cosine similarity, etc.), or using a network such as BLEURT to calculate the pairwise semantic distance, the distance is Sampling may be performed by giving higher weights to distant (high semantic distance) pairs.

プロセッサ１２０は、学習データセットＤからクラスの数、すなわち、ラベルの種類に該当するだけの原本データを選択してよい。プロセッサ１２０は、同一ラベルの原本データのうちで各対のセマンティック距離が高いデータを優先的に選択してよい。セマンティック距離にα（ａｌｐｈａ）乗数を適用してよく、このとき、αは、最適化が必要なハイパーパラメータに該当する。 The processor 120 may select original data corresponding to the number of classes, that is, the types of labels, from the learning data set D. FIG. The processor 120 may preferentially select data having a high semantic distance for each pair of original data with the same label. An α (alpha) multiplier may be applied to the semantic distance, where α corresponds to the hyperparameter that needs to be optimized.

段階３２０で、プロセッサ１２０は、言語モデルの入力に該当するプロンプトを構成してよい。プロセッサ１２０は、段階３１０で選択された原本データを利用して言語モデルの入力プロンプトを構成してよい。プロセッサ１２０は、与えられたＮＬＰ問題の特性が適切に反映された専用プロンプトテンプレートを製作してよく、このとき、プロンプトテンプレートには、該当のタスクの定義やメタ情報が含まれてよい。言い換えれば、プロセッサ１２０は、データセットから選択された原本データを加工して自然語形態のプロンプトを構成してよく、このとき、プロンプトは、言語モデルが理解することが可能な形式で製作され、言語モデルの入力として与えられる。プロセッサ１２０は、原本データがラベルのあるデータの場合、ラベル情報とともに入力文が生成されるようにプロンプトを設計する。 At step 320, the processor 120 may compose a prompt appropriate for inputting the language model. The processor 120 may use the original data selected in step 310 to construct a language model input prompt. The processor 120 may create a dedicated prompt template appropriately reflecting the characteristics of the given NLP problem, and the prompt template may include the definition of the corresponding task and meta information. In other words, the processor 120 may process the original data selected from the data set to construct a prompt in natural language form, where the prompt is produced in a format that can be understood by the language model, Given as input for the language model. The processor 120 designs the prompt such that when the original data is labeled data, the input sentence is generated along with the label information.

プロセッサ１２０は、学習データセットＤからサンプリングされた原本データ The processor 120 extracts the original data sampled from the learning data set D

が与えられるとき、説明ヘッダ（ｄｅｓｃｒｉｐｔｉｏｎｈｅａｄｅｒ）、テキスト－ラベル対リスト、拡張接頭辞（ａｕｇｍｅｎｔａｔｉｏｎｐｒｅｆｉｘ）で構成されたプロンプトを生成してよい。プロンプトは、言語モデルがデータ分布に対してさらに適切に一般化することが可能なように各タスクの情報を有しており、このようなタスク表示子（ｔａｓｋｉｎｄｉｃａｔｏｒ）は、タスクごとに固有であり、課題のメタ情報を提供してよい。

may generate a prompt consisting of a description header, a text-label pair list, and an augmentation prefix. The prompt carries information for each task so that the language model can generalize better to the data distribution, and such task indicators are unique for each task. Yes, and may provide meta information for the issue.

プロンプトの形式自体は多様に構成されてよいが、一例として、プロンプトは、テキストタイプ（例えば、レビューや記事など）とラベルタイプ（例えば、感情や分類など）、さらにラベル位置を確認することのできるラベル－トークン区分子（ｌａｂｅｌ－ｔｏｋｅｎｖｅｒｂａｌｉｚｅｒ）を含んでよい。 The format of the prompt itself may be configured in various ways, but as an example, the prompt can check the text type (eg, review, article, etc.), the label type (eg, sentiment, classification, etc.), and the label position. A label-token verbalizer may be included.

テキストタイプＴは、入力テキストｘのメタタイプであって、例えば、動画レビュー感情分析においてテキストタイプは動画レビューに該当する。ラベルタイプＬはラベルクラスｙのメタタイプであって、例えば、動画レビュー感情分析においてラベルタイプは感情に該当する。ラベル－トークン区分子ｖ The text type T is a metatype of the input text x, and for example, the text type corresponds to movie review in movie review sentiment analysis. The label type L is a metatype of the label class y, and for example, in the movie review sentiment analysis, the label type corresponds to emotion. label-token block v

においてプロンプトを公式化するためには、ラベルクラス

To formulate the prompt in the label class

と言語モデルの語彙

and language model vocabulary

で単語トークン間の１対１マッピングが必要となる。

requires a one-to-one mapping between word tokens.

上述した３つのメタ情報は、タスク仕様（ｔａｓｋｓｐｅｃｉｆｉｃａｔｉｏｎ） The three pieces of meta information mentioned above are the task specification

を構成する。各タスクＴは、プロンプトを公式化することのできる課題仕様

configure. Each task T is a task specification that can formulate prompts.

を必要とする。プロセッサ１２０は、基本的に、一般タスク仕様である

need. The processor 120 is basically general task specific

を使用してプロンプトを生成してよい。ここで、Ｉは、クラスラベルが語彙

may be used to generate prompts. where I is the class label

に存在すると仮定する識別関数（ｉｄｅｎｔｉｔｙｆｕｎｃｔｉｏｎ）を意味する。

We mean the identity function that we assume exists in .

要するに、図４に示すように、プロセッサ１２０は、与えられたタスクＴに対し、タスク仕様４１０、学習データセット４００からサンプリングされた原本データであるデータ例４２０、与えられたタスクＴの特性を考慮したプロンプトテンプレート４３０を組み合わせて言語モデルの入力プロンプトを構成してよい。 In short, as shown in FIG. 4, for a given task T, the processor 120 considers the task specification 410, the example data 420, which is the original data sampled from the training data set 400, and the characteristics of the given task T. The prompt templates 430 may be combined to construct the language model input prompt.

図５の具体的な例から分かるように、プロセッサ１２０は、タスク仕様４１０、データ例４２０、およびプロンプトテンプレート４３０を利用して言語モデルが理解することが可能な形式で構成して言語モデルの入力プロンプト５４０を製作してよい。タスク仕様４１０の一例は表１に示すとおりであり、プロンプトテンプレート４３０の一例は表２に示すとおりである。データ例４２０として As can be seen from the specific example of FIG. 5, processor 120 utilizes task specification 410, data example 420, and prompt template 430 to organize and input the language model in a format understandable by the language model. A prompt 540 may be created. An example task specification 410 is shown in Table 1 and an example prompt template 430 is shown in Table 2. As data example 420

が与えられる場合、入力プロンプト５４０は表３のように構成されてよい。

, the input prompt 540 may be constructed as in Table 3.

再び図３を参照すると、段階３３０で、プロセッサ１２０は、段階３２０で構成されたプロンプトを言語モデルに入力し、言語モデルから新規データが含まれた自然語を生成してよい。言い換えれば、プロセッサ１２０は、プロンプト入力文を言語モデルに入力した後、言語モデルの完成機能によって言語生成結果を得てよい。 Referring again to FIG. 3, at step 330, processor 120 may input the prompts constructed at step 320 into the language model and generate natural language containing the new data from the language model. In other words, the processor 120 may obtain the language generation result by the completion function of the language model after inputting the prompt input sentence into the language model.

段階３４０で、プロセッサ１２０は、自然語生成結果を分析して新規データを抽出してよい。言語モデルは、入力文として与えられたプロンプトのパターンに沿って自然語を生成してよく、生成された自然語のパターン分析によって新規データを抽出してよい。 At step 340, processor 120 may analyze the natural language generation results to extract new data. The language model may generate natural language according to the prompt pattern given as an input sentence, and may extract new data by pattern analysis of the generated natural language.

図６に示すように、プロセッサ１２０は、プロンプト５４０を言語モデル２１０に入力し、言語モデル２１０に基づいて生成された自然語のパターンを分析することで、拡張データ６５０として（新しい文章、該当の文章のラベル情報）対を得てよい。一例として、プロセッサ１２０は、プロンプト５４０内のラベルに該当する自然語トークンの言語モデリング確率分布を使用してラベル分布を得てよい。 As shown in FIG. 6, the processor 120 inputs the prompt 540 into the language model 210 and analyzes the natural language patterns generated based on the language model 210 to generate extended data 650 (new sentences, corresponding label information) pairs of sentences may be obtained. As an example, processor 120 may obtain the label distribution using language modeling probability distributions of natural language tokens corresponding to labels in prompt 540 .

プロンプト基盤の接近方式の場合、拡張テキストｘ’とラベルｙ’は、プロンプト以後に自然語テキストとして連続で生成される。サンプリングされた原本データに基づいて予め定義されたプロンプトテンプレートは、言語モデルが（ｘ’、ｙ’）構造を生成するように入力文を提供するため、パターンマッチングによって各値を抽出してよい。また、共同テキスト（ｊｏｉｎｔｔｅｘｔ）およびラベル生成は、生成されたテキストが正しいラベルに連結されるようにする。 In the case of the prompt-based approach method, the extended text x' and the label y' are continuously generated as natural language text after the prompt. A predefined prompt template based on the sampled original data may extract each value by pattern matching to provide an input sentence for the language model to generate a (x', y') structure. Also, joint text and label generation ensures that the generated text is concatenated to the correct label.

本実施形態のプロンプトデザインは、 The prompt design of this embodiment is

に該当するラベルトークンがテキストｘ以後に生成されるように保障する。プロセッサ１２０は、言語モデルを利用して疑似ラベリング（ｐｓｅｕｄｏ－ｌａｂｅｌｉｎｇ）を実行してよく、拡張テキストｘ’のソフトラベル確率を得るためにラベル－トークンを生成する可能性（ｌｉｋｅｌｉｈｏｏｄ）を正規化してよい。

is generated after the text x. Processor 120 may perform pseudo-labeling utilizing the language model, normalizing the likelihood of generating label-tokens to obtain soft label probabilities for the extended text x′. good.

拡張テキストｘ’がラベルｙ’によってラベリングされる疑似ラベル確率は、数式（２）のとおりである。 The pseudo-label probability that the extended text x' is labeled with the label y' is given by Equation (2).

ここで、

here,

は、言語モデリング可能性（ｌａｎｇｕａｇｅｍｏｄｅｌｉｎｇｌｉｋｅｌｉｈｏｏｄ）を示し、

denotes the language modeling likelihood,

は、与えられたタスク仕様を構成する関数である。

is a function that composes a given task specification.

本実施形態では、テキスト摂動、疑似ラベリング、知識の蒸溜を単一拡張タスクで効果的に結合することができる。実際に、疑似ラベルのある新規データは、交差エントロピー損失（ｃｒｏｓｓ－ｅｎｔｒｏｐｙｌｏｓｓ）を使用して原本データとともに訓練される。 In this embodiment, text perturbation, pseudo-labeling, and knowledge distillation can be effectively combined in a single expansion task. In practice, pseudo-labeled new data are trained with the original data using cross-entropy loss.

図７を参照すると、プロセッサ１２０は、言語モデル２１０の完成機能による確率分布に基づいて新規データを生成してよく、このとき、ラベル－トークン区分子を基準にパターン分析でラベルに該当するトークンに対する確率を利用してソフトラベル、すなわち、分布のあるラベルを生成してよい。 Referring to FIG. 7, the processor 120 may generate new data based on the probability distribution according to the completion function of the language model 210. At this time, the token corresponding to the label in the pattern analysis is generated based on the label-token segment. Probabilities may be used to generate soft labels, ie labels with a distribution.

より詳しく説明すると、プロセッサ１２０は、プロンプト入力文として提供されるすべての自然語を特定のトークン化形態（ｔｏｋｅｎｉｚａｔｉｏｎｓｃｈｅｍｅ）（例えば、定数形態のインデックスなど）に変換した後、言語モデル２１０に入力する。（新しい文章、該当の文章のラベル情報）対を得るためには、自己回帰（ａｕｔｏｒｅｇｒｅｓｓｉｖｅ）に基づいて以前のトークンの確率分布を次のトークンの入力として伝達するようになるが、このとき、ヒューリスティック（ｈｅｕｒｉｓｔｉｃ）を利用したビームサーチ（ｂｅａｍｓｅａｒｃｈ）によって上位ｎ個の確率を次のトークンの入力として使用する。各トークンの自体確率分布と以前のトークンの確率分布を掛けたジョイント確率のうちで上位ｎ個の確率を利用することにより、確率値の高いシーケンスを抽出することができる。 More specifically, the processor 120 converts all natural language provided as prompt input sentences into a particular tokenization scheme (e.g., constant form index, etc.) and then inputs them to the language model 210. . In order to obtain a (new sentence, label information of the corresponding sentence) pair, the probability distribution of the previous token is transmitted as the input of the next token based on autoregressive. The top n probabilities are used as input for the next token by a beam search using a heuristic. By using the top n probabilities among the joint probabilities obtained by multiplying the own probability distribution of each token by the probability distribution of the previous token, a sequence with a high probability value can be extracted.

例えば、図８に示すように、プロセッサ１２０は、ラベルに該当するトークンの確率を利用してソフトラベルを構成してよい。言い換えれば、ソフトラベルは、言語モデル２１０によって予測された正規化ラベルトークン分布から抽出されてよい。 For example, as shown in FIG. 8, processor 120 may construct soft labels using the probability of tokens falling under the label. In other words, soft labels may be extracted from the normalized label token distribution predicted by language model 210 .

図９を参照すると、肯定に分類されたデータ例１（９１０）と否定に分類されたデータ例２（９２０）が原本データとして与えられた場合、言語モデル２１０を利用した言語生成結果から新規データ９３０を抽出してよい。 Referring to FIG. 9, when data example 1 (910) classified as positive and data example 2 (920) classified as negative are given as original data, new data are obtained from the language generation result using the language model 210. 930 may be extracted.

プロセッサ１２０は、プロンプトによって生成された結果を分析し、合成された新規データとラベル情報を抽出する。このような作業を数回繰り返すことにより、本来はデータに存在していなかった多様なデータと正確な分類情報が得られるようになり、このようなデータを既存のデータに混合すれば、ダウンストリームＮＬＰモデルをファインチューニングするのに使用することができる。新規データ９３０は、与えられた２つの原本データ９１０、９２０の言語的および構造的特徴を適切に参照して生成される出力データであって、該当の新規データ９３０に肯定と否定を適切に混ぜることで、既存には存在しない、完全に新たなデータとなる。 Processor 120 analyzes the results generated by the prompts and extracts synthesized new data and label information. By repeating this process several times, it becomes possible to obtain diverse data and accurate classification information that were not originally present in the data. It can be used to fine-tune NLP models. The new data 930 is output data generated by appropriately referring to the linguistic and structural features of the two given original data 910, 920, and the appropriate new data 930 is mixed with positive and negative. By doing so, it becomes completely new data that does not exist in the past.

新規データ９３０を学習データセットに追加することで、すなわち、データ拡張することで最終分類機性能を高めることができる。 Adding new data 930 to the training data set, ie data augmentation, can improve the final classifier performance.

原本データ９１０、９２０は、単一ラベルが付着された形態のハードラベルのあるデータや、新規データ９３０は、少なくとも２つ以上のラベル分布形態のソフトラベルのあるデータとして生成されてよい。 The original data 910 and 920 may be generated as hard-labeled data with a single label attached, and the new data 930 may be generated as soft-labeled data with at least two or more label distributions.

ソフトラベルをハードラベルに変換することは最大演算（ｍａｘｏｐｅｒａｔｉｏｎ）などによって可能であるため、原本データ９１０、９２０や新規データ９３０はすべてハードラベル形態で活用することができる。学習過程では、交差エントロピー（ｃｒｏｓｓｅｎｔｒｏｐｙ）などの損失関数を使用するため、ハードラベル形態とソフトラベル形態をすべて活用することができる。 Since a soft label can be converted into a hard label by max operation, the original data 910 and 920 and the new data 930 can all be used in hard label form. Since the learning process uses a loss function such as cross entropy, both hard label and soft label forms can be utilized.

このように、本発明の実施形態によると、言語モデルに基づいて既存のデータを変形したり拡張したりして拡張データを生成し、言語モデルが認知している知識を拡張データを通じて転移することにより、データ収集の投入工数を著しく減らすことができ、モデル軽量化の効果を達成することができる。 Thus, according to the embodiment of the present invention, extended data is generated by transforming or extending existing data based on the language model, and knowledge recognized by the language model is transferred through the extended data. As a result, the number of man-hours input for data collection can be significantly reduced, and the effect of reducing the weight of the model can be achieved.

上述した装置は、ハードウェア構成要素、ソフトウェア構成要素、および／またはハードウェア構成要素とソフトウェア構成要素との組み合わせによって実現されてよい。例えば、実施形態で説明された装置および構成要素は、プロセッサ、コントローラ、ＡＬＵ（ａｒｉｔｈｍｅｔｉｃｌｏｇｉｃｕｎｉｔ）、デジタル信号プロセッサ、マイクロコンピュータ、ＦＰＧＡ（ｆｉｅｌｄｐｒｏｇｒａｍｍａｂｌｅｇａｔｅａｒｒａｙ）、ＰＬＵ（ｐｒｏｇｒａｍｍａｂｌｅｌｏｇｉｃｕｎｉｔ）、マイクロプロセッサ、または命令を実行して応答することができる様々な装置のように、１つ以上の汎用コンピュータまたは特殊目的コンピュータを利用して実現されてよい。処理装置は、オペレーティングシステム（ＯＳ）およびＯＳ上で実行される１つ以上のソフトウェアアプリケーションを実行してよい。また、処理装置は、ソフトウェアの実行に応答し、データにアクセスし、データを記録、操作、処理、および生成してもよい。理解の便宜のために、１つの処理装置が使用されるとして説明される場合もあるが、当業者は、処理装置が複数個の処理要素および／または複数種類の処理要素を含んでもよいことが理解できるであろう。例えば、処理装置は、複数個のプロセッサまたは１つのプロセッサおよび１つのコントローラを含んでよい。また、並列プロセッサのような、他の処理構成も可能である。 The apparatus described above may be realized by hardware components, software components, and/or a combination of hardware and software components. For example, the devices and components described in the embodiments include processors, controllers, arithmetic logic units (ALUs), digital signal processors, microcomputers, field programmable gate arrays (FPGAs), programmable logic units (PLUs), microprocessors, Or may be implemented using one or more general purpose or special purpose computers, such as various devices capable of executing and responding to instructions. The processing unit may run an operating system (OS) and one or more software applications that run on the OS. The processor may also access, record, manipulate, process, and generate data in response to executing software. For convenience of understanding, one processing device may be described as being used, but those skilled in the art will appreciate that the processing device may include multiple processing elements and/or multiple types of processing elements. You will understand. For example, a processing unit may include multiple processors or a processor and a controller. Other processing configurations are also possible, such as parallel processors.

ソフトウェアは、コンピュータプログラム、コード、命令、またはこれらのうちの１つ以上の組み合わせを含んでもよく、思うままに動作するように処理装置を構成したり、独立的または集合的に処理装置に命令したりしてよい。ソフトウェアおよび／またはデータは、処理装置に基づいて解釈されたり、処理装置に命令またはデータを提供したりするために、いかなる種類の機械、コンポーネント、物理装置、コンピュータ記録媒体または装置に具現化されてよい。ソフトウェアは、ネットワークによって接続されたコンピュータシステム上に分散され、分散された状態で記録されても実行されてもよい。ソフトウェアおよびデータは、１つ以上のコンピュータ読み取り可能な記録媒体に記録されてよい。 Software may include computer programs, code, instructions, or a combination of one or more of these, to configure a processor to operate at its discretion or to independently or collectively instruct a processor. You can Software and/or data may be embodied in any kind of machine, component, physical device, computer storage medium, or device for interpretation by, or for providing instructions or data to, a processing device. good. The software may be stored and executed in a distributed fashion over computer systems linked by a network. Software and data may be recorded on one or more computer-readable recording media.

実施形態に係る方法は、多様なコンピュータ手段によって実行可能なプログラム命令の形態で実現されてコンピュータ読み取り可能な媒体に記録されてよい。ここで、媒体は、コンピュータ実行可能なプログラムを継続して記録するものであっても、実行またはダウンロードのために一時記録するものであってもよい。また、媒体は、単一または複数のハードウェアが結合した形態の多様な記録手段または格納手段であってよく、あるコンピュータシステムに直接接続する媒体に限定されることはなく、ネットワーク上に分散して存在するものであってもよい。媒体の例としては、ハードディスク、フロッピー（登録商標）ディスク、および磁気テープのような磁気媒体、ＣＤ－ＲＯＭおよびＤＶＤのような光媒体、フロプティカルディスク（ｆｌｏｐｔｉｃａｌｄｉｓｋ）のような光磁気媒体、およびＲＯＭ、ＲＡＭ、フラッシュメモリなどを含み、プログラム命令が記録されるように構成されたものであってよい。また、媒体の他の例として、アプリケーションを配布するアプリケーションストアやその他の多様なソフトウェアを供給または配布するサイト、サーバなどで管理する記録媒体または格納媒体が挙げられる。 The method according to the embodiments may be embodied in the form of program instructions executable by various computer means and recorded on a computer-readable medium. Here, the medium may continuously record the computer-executable program or temporarily record it for execution or download. In addition, the medium may be a variety of recording means or storage means in the form of a combination of single or multiple hardware, and is not limited to a medium that is directly connected to a computer system, but distributed over a network. It may exist in Examples of media include magnetic media such as hard disks, floppy disks, and magnetic tapes, optical media such as CD-ROMs and DVDs, magneto-optical media such as floptical disks, and ROM, RAM, flash memory, etc., and may be configured to store program instructions. Other examples of media include recording media or storage media managed by application stores that distribute applications, sites that supply or distribute various software, and servers.

以上のように、実施形態を、限定された実施形態および図面に基づいて説明したが、当業者であれば、上述した記載から多様な修正および変形が可能であろう。例えば、説明された技術が、説明された方法とは異なる順序で実行されたり、かつ／あるいは、説明されたシステム、構造、装置、回路などの構成要素が、説明された方法とは異なる形態で結合されたりまたは組み合わされたり、他の構成要素または均等物によって対置されたり置換されたとしても、適切な結果を達成することができる。 As described above, the embodiments have been described based on the limited embodiments and drawings, but those skilled in the art will be able to make various modifications and variations based on the above description. For example, the techniques described may be performed in a different order than in the manner described and/or components such as systems, structures, devices, circuits, etc. described may be performed in a manner different from the manner described. Appropriate results may be achieved when combined or combined, opposed or substituted by other elements or equivalents.

したがって、異なる実施形態であっても、特許請求の範囲と均等なものであれば、添付される特許請求の範囲に属する。 Accordingly, different embodiments that are equivalent to the claims should still fall within the scope of the appended claims.

２１０：言語モデル
２２０：ＰＬＭ 210: Language model 220: PLM

Claims

A data generation method executed by a computer device, comprising:
The computing device includes at least one processor configured to execute computer readable instructions contained in memory;
The data generation method includes:
the at least one processor constructing a prompt that is an input sentence of a language model using original data; and the at least one processor inputting the prompt into the language model to convert the language model into generating new data and label information for said new data from.

The generating step includes:
2. The data generation method of claim 1, further comprising: generating soft labels representing label distributions using a probability distribution for natural words corresponding to labels in the prompt.

The data generation method includes:
3. The data generating method of claim 1 or 2, further comprising: selecting the original data from a training data set according to the semantic diversity of the text by the at least one processor.

The selecting step includes:
4. The data generation method according to claim 3, wherein the original data corresponding to the number of label types are selected from the learning data set.

The configuring step includes:
The data generation method according to any one of claims 1 to 4, characterized in that the prompt is configured in a format including a text type and a label type.

The configuring step includes:
The data generation method according to any one of claims 1 to 4, characterized in that the prompt is constructed in a format including a text type, a label type, and a label verbalizer.

The configuring step includes:
The data generation method according to any one of claims 1 to 4, wherein the original data is processed to compose the prompt in the same natural language form as the original data.

The configuring step includes:
The data generation method according to any one of claims 1 to 4, wherein the prompt is constructed by combining a task specification, the original data, and a prompt template.

The generating step includes:
extracting the new data and the label information using an auto-regressive method in which the probability distribution of the previous token is transferred as the input of the next token to the natural language corresponding to the text and label in the prompt; A data generation method according to any one of claims 1 to 8, comprising the step of:

The extracting step includes:
10. The data of claim 9, comprising communicating, as an input for the next token, a probability of an upper part of the probability distribution of the previous token by a beamsearch using a heuristic. generation method.

A computer program for causing a computer device to execute the data generation method according to any one of claims 1 to 10.

A computer device,
at least one processor configured to execute computer readable instructions contained in memory;
The at least one processor
A process of constructing a prompt as an input sentence of a language model using original data, and a process of inputting the prompt into the language model and generating new data and label information for the new data from the language model. , computer equipment.

The at least one processor
13. The computer according to claim 12, wherein a soft label indicating label distribution is generated using a probability distribution for natural words corresponding to labels in the prompt.

The at least one processor
14. The computer device according to claim 12 or 13, wherein said original data is selected from a learning data set according to semantic diversity of text.

The at least one processor
15. The computer apparatus according to claim 14, wherein said original data corresponding to the number of label types are selected from said learning data set.

The at least one processor
A computer device according to any one of claims 12 to 15, characterized in that it organizes the prompts in a format that includes a text type and a label type.

The at least one processor
16. The computer device according to any one of claims 12 to 15, wherein said prompt is configured in a format including text type, label type, and label position segment molecule.

The at least one processor
16. The computer device according to any one of claims 12 to 15, wherein the prompt is constructed by combining the task specification, the original data, and the prompt template.

The at least one processor
extracting the new data and the label information using an autoregressive method that transfers the probability distribution of the previous token as the input of the next token to the natural language corresponding to the text and label in the prompt. The computer device according to any one of claims 12 to 18, wherein:

The at least one processor
20. The computer apparatus of claim 19, wherein a beam search using heuristics communicates probabilities in the upper part of the probability distribution of the previous token as input for the next token.