JP6743942B2

JP6743942B2 - Vocabulary table selection method, device, and computer-readable storage medium

Info

Publication number: JP6743942B2
Application number: JP2019090337A
Authority: JP
Inventors: トォンイシュアヌ; ジャンヨンウエイ; ドォンビヌ; ジアンシャヌシャヌ; ジャンジィアシ
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2018-07-10
Filing date: 2019-05-13
Publication date: 2020-08-19
Anticipated expiration: 2039-05-13
Also published as: JP2020008836A; CN110705279A

Description

本発明は言語情報処理技術分野に係わり、特に語彙テーブルの選択方法、装置およびコンピュータ読み取り可能な記憶媒体に関する。 The present invention relates to the field of language information processing, and more particularly to a method for selecting a vocabulary table, an apparatus, and a computer-readable storage medium.

自然言語処理において、ニューラルネットワークモデルにより自然言語処理問題を解決する際には、通常、語彙テーブルを指定し、かつニューラルネットワークモデルを訓練して特定の機能を持つモデルを得る必要がある。例えば、固有表現（Named Entity）を抽出するためのあるモデルに対して、人名エンティティや非人名エンティティの他の語彙を含む語彙テーブルによって、そのモデルを学習させて、最終的に得られたモデルは入力された自然言語における人名エンティティを発見することに用いられることができる。 In natural language processing, when solving a natural language processing problem with a neural network model, it is usually necessary to specify a vocabulary table and train the neural network model to obtain a model having a specific function. For example, a model for extracting a named entity is trained by a vocabulary table containing other vocabularies of a person entity and a non-person entity, and the model finally obtained is It can be used to find a person name entity in the input natural language.

従来技術の語彙テーブルによりモデルをトレーニングする際に、通常はオリジナルコーパスからサンプル語彙を抽出して、サンプル語彙からなる語彙テーブルを得てから、その語彙テーブルに基づいてモデルをトレーニングする。オリジナルコーパスに基づいて得られる語彙テーブルは通常大量の語彙を含み、かつ一部の価値の低い語彙が存在するため、その語彙テーブルによりモデルをトレーニングする場合、トレーニング効率が低く、必要な時間も長く、かつトレーニングによって得られるモデルも正確性が低いという問題がある。 When training a model with a vocabulary table of the prior art, a sample vocabulary is usually extracted from the original corpus to obtain a vocabulary table consisting of the sample vocabulary, and then the model is trained based on the vocabulary table. The vocabulary table obtained based on the original corpus usually contains a large amount of vocabulary, and some low-value vocabulary exists, so when training the model with the vocabulary table, the training efficiency is low and the time required is long. Also, the model obtained by training is also inaccurate.

本発明の実施例が解決しようとする技術課題は語彙テーブルの選択方法、装置及びコンピュータ読み取り可能な記録媒体を提供して、モデルトレーニングにより適合する語彙テーブルを選択かつ生成し、モデルのトレーニング効率を向上かつトレーニング時間を減少させることができ、かつトレーニングし得たモデルの正確性を上げることができる。 The technical problem to be solved by the embodiments of the present invention is to provide a vocabulary table selection method, a device, and a computer-readable recording medium to select and generate a vocabulary table more suitable for model training, thereby improving the model training efficiency. It can improve and reduce training time, and increase the accuracy of trained models.

上記の技術課題を解決するために、本発明の実施例より提供された語彙テーブルの選択方法は、
語彙重み付け層をターゲットニューラルネットワークモデルに導入して予備トレーニングモデルを構築し、前記語彙重み付け層は語彙重みによって第1語彙テーブル中のターゲット語彙を重み付けるようにし、かつ重み付け処理により得たターゲット語彙を前記ターゲットニューラルネットワークモデルに入力するステップと、
前記第1語彙テーブルに基づき、前記予備トレーニングモデルをトレーニングして、前記予備トレーニングモデルのモデルパラメータおよび語彙重み付け層の語彙重みを更新し、かつトレーニング終了後に、前記第1語彙テーブルにおけるターゲット語彙の語彙重みを取得するステップと、
前記語彙重みによって前記第1語彙テーブルを選別して、第2語彙テーブルを得るステップと、を含む。 In order to solve the above technical problems, the vocabulary table selection method provided by the embodiment of the present invention is as follows.
The vocabulary weighting layer is introduced into the target neural network model to build a preliminary training model, and the vocabulary weighting layer weights the target vocabulary in the first vocabulary table according to the vocabulary weight, and the target vocabulary obtained by the weighting process is used. Inputting to the target neural network model,
Based on the first vocabulary table, the preliminary training model is trained to update the model parameters of the preliminary training model and the vocabulary weights of the vocabulary weighting layer, and after the training, the vocabulary of the target vocabulary in the first vocabulary table. Obtaining the weights,
Selecting the first vocabulary table according to the vocabulary weight to obtain a second vocabulary table.

好ましいのは、前記第2語彙テーブルを得た後、さらに、
前記第2語彙テーブルによって、前記ターゲットニューラルネットモデルをトレーニングする。 Preferably, after obtaining the second vocabulary table, further,
The target neural network model is trained by the second vocabulary table.

好ましいのは、上記方法において、前記語彙重みによって第1語彙テーブル中のターゲット語彙を重み付ける前記ステップは、
前記ターゲット語彙に対応するターゲット単語ベクトルと未知語彙に対応する未知単語ベクトルとに対して加重合計を行い、そのうち、前記ターゲット単語ベクトルの第1重みは前記ターゲット語彙の語彙重みの正相関関数で、前記未知単語ベクトルの第2重みは前記ターゲット語彙の語彙重みの負相関関数であり、かつ前記第1重みと第2重みの和が所定値であり、前記未知語彙が前記第1語彙テーブルに存在しない語彙で、かつ前記第1語彙テーブルに存在しない語彙がすべて同一の未知単語ベクトルに対応すること、を含む。 Preferably, in the above method, said step of weighting the target vocabulary in the first vocabulary table by said vocabulary weight comprises:
Performing a weighted sum for the target word vector corresponding to the target vocabulary and the unknown word vector corresponding to the unknown vocabulary, of which the first weight of the target word vector is a positive correlation function of the vocabulary weight of the target vocabulary, The second weight of the unknown word vector is a negative correlation function of the vocabulary weight of the target vocabulary, and the sum of the first weight and the second weight is a predetermined value, the unknown vocabulary exists in the first vocabulary table All vocabularies that do not exist in the first vocabulary table correspond to the same unknown word vector.

好ましいのは、上記方法において、前記第1重みは前記ターゲット語彙の語彙重みの第1関数であることができ、前記第1関数は前記ターゲット語彙の語彙重みを0から1までにマッピングするように用いられ、前記第2重みは前記第1重みの第2関数であり、且つ前記第1重みと負の相関がある。 Preferably, in the above method, the first weight may be a first function of vocabulary weights of the target vocabulary, such that the first function maps the vocabulary weights of the target vocabulary from 0 to 1. Used, the second weight is a second function of the first weight, and has a negative correlation with the first weight.

好ましいのは、上記方法において、前記ターゲット単語ベクトルと前記未知単語ベクトルとは、確率的初期化（random initialization）または単語ベクトルの予備トレーニングアルゴリズムによって初期化する。 Preferably, in the above method, the target word vector and the unknown word vector are initialized by a random initialization or a word vector pre-training algorithm.

好ましいのは、上記方法において、前記語彙重みによって前記第1語彙テーブルを選別する前記ステップは、語彙重みの高い順によって、前記第1語彙テーブルから第2数の語彙を選択して、前記第2語彙テーブルを得ることができる。或いは、前記第1語彙テーブルから語彙重みが予めに設定した数値範囲内の語彙を選択して、前記第2語彙テーブルを得ることができる。そこで、前記第2語彙テーブル中の語彙数は第1語彙テーブル中の語彙数より少ない。 Preferably, in the above method, the step of selecting the first vocabulary table according to the vocabulary weight selects a second number of vocabularies from the first vocabulary table according to the order of increasing vocabulary weights, and selects the second vocabulary table. You can get a vocabulary table. Alternatively, the second vocabulary table can be obtained by selecting a vocabulary having a vocabulary weight within a preset numerical range from the first vocabulary table. Therefore, the number of vocabularies in the second vocabulary table is smaller than that in the first vocabulary table.

好ましいのは、上記方法において、前記予備トレーニングモデルを構築する前に、前記方法はさらに、
オリジナルコーパスデータをデータクリーニングすることで、データクリーニング後のオリジナルコーパスデータを文に分割し、かつ文を分割して複数の語彙を得て、語彙のオリジナルコーパスにおける出現頻度の高い順に従って、第1数の語彙を選択して、前記第1語彙テーブルを得る。 Preferably, in the above method, before constructing the preliminary training model, the method further comprises:
By performing data cleaning on the original corpus data, the original corpus data after the data cleaning is divided into sentences, and the sentences are divided to obtain a plurality of vocabularies. Select a number of vocabularies to get the first vocabulary table.

好ましいのは、上記方法において、前記ターゲットニューラルネットワークモデルと前記予備トレーニングモデルはともに、同一のターゲットタスクに対して構築されたモデルである。 Preferably, in the above method, both the target neural network model and the preliminary training model are models built for the same target task.

本発明の実施例はさらに語彙テーブルの選択装置を提供し、それは、
語彙重み付け層をターゲットニューラルネットワークモデルに導入して予備トレーニングモデルを構築し、前記語彙重み付け層は語彙ウェイトによって第1語彙テーブル中のターゲット語彙を重み付けるようにし、かつ重み付け処理により得たターゲット語彙を前記ターゲットニューラルネットワークモデルに入力する予備トレーニングモデルのモデリングユニットと、
前記第1語彙テーブルに基づき、前記予備トレーニングモデルをトレーニングして、前記予備トレーニングモデルのモデルパラメータおよび語彙重み付け層の語彙重みを更新し、かつトレーニング終了後に、前記第1語彙テーブルにおけるターゲット語彙の語彙重みを取得する第1トレーニングユニットと、
前記語彙ウェイトによって前記第1語彙テーブルを選別して、第2語彙テーブルを得る語彙選択ユニットと、を含む。 Embodiments of the invention further provide a vocabulary table selection device, which comprises:
A vocabulary weighting layer is introduced into the target neural network model to build a preliminary training model, the vocabulary weighting layer weights the target vocabulary in the first vocabulary table according to the vocabulary weight, and the target vocabulary obtained by the weighting process is used. A modeling unit of a preliminary training model input to the target neural network model,
Based on the first vocabulary table, the preliminary training model is trained to update the model parameters of the preliminary training model and the vocabulary weights of the vocabulary weighting layer, and after the training, the vocabulary of the target vocabulary in the first vocabulary table. A first training unit for obtaining weights,
A vocabulary selection unit that selects the first vocabulary table according to the vocabulary weight to obtain a second vocabulary table.

好ましいのは、上記語彙テーブルの選択装置は、さらに、
前記第2語彙テーブルによって、前記ターゲットニューラルネットモデルをトレーニングする第2トレーニングユニットを含む。 Preferably, the vocabulary table selection device further comprises:
A second training unit is provided for training the target neural net model according to the second vocabulary table.

好ましいのは、上記語彙テーブルの選択装置において、前記予備トレーニングモデルモデリングモジュールにより、前記ターゲット語彙に対応するターゲット単語ベクトルと未知語彙に対応する未知単語ベクトルとに対して加重合計を行い、そのうち、前記ターゲット単語ベクトルの第1重みは前記ターゲット語彙の語彙重みの正相関関数で、前記未知単語ベクトルの第2重みは前記ターゲット語彙の語彙重みの負相関関数であり、かつ前記第1重みと第2重みの和が所定値であり、前記未知語彙が前記第1語彙テーブルに存在しない語彙で、かつ前記第1語彙テーブルに存在しない語彙はすべて同一の未知単語ベクトルに対応する。 Preferably, in the vocabulary table selection device, the preliminary training model modeling module performs weighted summation on a target word vector corresponding to the target vocabulary and an unknown word vector corresponding to an unknown vocabulary, of which the The first weight of the target word vector is a positive correlation function of the vocabulary weight of the target vocabulary, the second weight of the unknown word vector is a negative correlation function of the vocabulary weight of the target vocabulary, and the first weight and the second The sum of the weights is a predetermined value, the unknown vocabulary that does not exist in the first vocabulary table, and the vocabulary that does not exist in the first vocabulary table all correspond to the same unknown word vector.

好ましいのは、上記語彙テーブルの選択装置において、前記第1重みは前記ターゲット語彙の語彙重みの第1関数であることができ、前記第1関数は前記ターゲット語彙の語彙重みを0から1までにマッピングするように用いられ、前記第2重みは前記第1重みの第2関数であり、且つ前記第1重みと負の相関がある。 Preferably, in the vocabulary table selection device, the first weight may be a first function of the vocabulary weight of the target vocabulary, and the first function may set the vocabulary weight of the target vocabulary to 0 to 1. Used for mapping, the second weight is a second function of the first weight and has a negative correlation with the first weight.

好ましいのは、上記語彙テーブルの選択装置において、前記ターゲット単語ベクトルと前記未知単語ベクトルとは、確率的初期化または単語ベクトルの予備トレーニングアルゴリズムによって初期化する。 Preferably, in the vocabulary table selection device, the target word vector and the unknown word vector are initialized by stochastic initialization or a word vector preliminary training algorithm.

好ましいのは、上記語彙テーブルの選択装置において、前記語彙選択ユニットにより、語彙重みの高い順によって、前記第1語彙テーブルから第2数の語彙を選択して、前記第2語彙テーブルを得ることができる。或いは、前記第1語彙テーブルから語彙重みが予めに設定した数値範囲内の語彙を選択して、前記第2語彙テーブルを得ることができる。そこで、前記第2語彙テーブル中の語彙数は第1語彙テーブル中の語彙数より少ない。 Preferably, in the vocabulary table selection device, the vocabulary selection unit selects a second number of vocabularies from the first vocabulary table in descending order of vocabulary weight to obtain the second vocabulary table. it can. Alternatively, the second vocabulary table can be obtained by selecting a vocabulary having a vocabulary weight within a preset numerical range from the first vocabulary table. Therefore, the number of vocabularies in the second vocabulary table is smaller than that in the first vocabulary table.

好ましいのは、上記語彙テーブルの選択装置はさらに、
語彙テーブル生成ユニットにより、オリジナルコーパスデータをデータクリーニングすることで、データクリーニング後のオリジナルコーパスデータを文に分割し、かつ文を分割して複数の語彙を得て、語彙のオリジナルコーパスにおける出現頻度の高い順に従って、第1数の語彙を選択して、前記第1語彙テーブルを得る。 Preferably, the vocabulary table selection device further comprises:
The vocabulary table generation unit performs data cleaning on the original corpus data to divide the original corpus data after data cleaning into sentences, and also divides the sentence to obtain multiple vocabulary, and to determine the frequency of occurrence of the vocabulary in the original corpus. A first number of vocabularies is selected in descending order to obtain the first vocabulary table.

好ましいのは、上記語彙テーブルの選択装置において、前記ターゲットニューラルネットワークモデルと前記予備トレーニングモデルはともに、同一のターゲットタスクに対して構築されたモデルである。 Preferably, in the vocabulary table selection device, both the target neural network model and the preliminary training model are models constructed for the same target task.

本発明の実施例はさら語彙テーブルの選択装置を提供し、それは、メモリ、プロセッサ、およびメモリに格納されかつプロセッサで実行可能なコンピュータプログラムを含む。前記コンピュータプログラムは前記プロセッサに実行された場合、上記の語彙テーブルの選択方法を実現することができる。 Embodiments of the present invention further provide a vocabulary table selection device, which includes a memory, a processor, and a computer program stored in the memory and executable by the processor. When the computer program is executed by the processor, the above vocabulary table selecting method can be realized.

さらに、本発明の実施例よりコンピュータ読み取り可能な記録媒体が提供され、前記コンピュータ読み取り可能な記録媒体にコンピュータプログラムが格納され、前記コンピュータプログラムはプロセッサに実行された場合、上記の語彙テーブルの選択方法を実現することができる。 Furthermore, a computer-readable recording medium is provided according to an embodiment of the present invention, and a computer program is stored in the computer-readable recording medium, and when the computer program is executed by a processor, the vocabulary table selecting method described above. Can be realized.

従来技術と比べて、本発明の実施例による語彙テーブルの選択方法、装置及びコンピュータ読み取り可能な記憶媒体は、第1語彙テーブルによって予備トレーニングモデルをトレーニングし、かつトレーニング過程においてモデルパラメータ及び語彙重みを更新して、さらに、トレーニング終了後、得たターゲット語彙の語彙重みを利用して第1語彙テーブル中の語彙を選別してターゲットニューラルネットワークモデルをトレーニングするための第2語彙テーブルを得る。当該第2語彙テーブルにより高価値の語彙が含まれるので、第2語彙テーブルによってターゲットニューラルネットワークモデルをトレーニングする時、モデルのトレーニング効率を向上かつトレーニング時間を減少でき、かつトレーニングし得たモデルの正確性を上げる。 Compared with the prior art, the method of selecting a vocabulary table, the apparatus and the computer-readable storage medium according to the embodiment of the present invention trains a preliminary training model with a first vocabulary table, and calculates model parameters and vocabulary weights in a training process. After the training, the vocabulary weight of the target vocabulary obtained is used to further select the vocabulary in the first vocabulary table to obtain the second vocabulary table for training the target neural network model. Since the second vocabulary table contains high-value vocabulary, when training the target neural network model with the second vocabulary table, the training efficiency of the model can be improved and the training time can be reduced, and the accuracy of the trained model can be improved. Improve sex.

本発明の実施例の技術方案をより明確に説明するために、以下、本発明の実施例の説明に必要とされる添付図を簡単に紹介するが、明らかに、下記の図は本発明のいくつかの実施例のみであり、当業者にとって、高度な技術を必要としない前提において、これらの添付図によって他の添付図を得ることができる。
本発明の実施例に係る語彙テーブルの選択方法の応用シーンである。本発明の実施例に係る語彙テーブルの選択方法のフローチャットである。本発明の実施例に係る語彙テーブルの選択方法の他のフローチャットである。本発明の実施例に係る語彙テーブルの選択方法のもう１つのフローチャットである。本発明の実施例に係る語彙テーブルの選択方法に応用された固有表現抽出モデルの構造図である。本発明の実施例に係る予備トレーニングモデルの構造図である。本発明の実施例の語彙テーブルの選択装置の構造図である。本発明の実施例の語彙テーブルの選択装置の他の構造図である。本発明の実施例の語彙テーブルの選択装置のもう１つの構造図である。 In order to describe the technical solution of the embodiments of the present invention more clearly, the following briefly introduces the accompanying drawings required for describing the embodiments of the present invention. Obviously, the following drawings illustrate the present invention. These attached drawings can obtain other attached drawings on the assumption that they are only some embodiments and do not require advanced technology for those skilled in the art.
It is an application scene of the vocabulary table selection method according to the embodiment of the present invention. It is a flow chat of the selection method of the vocabulary table according to the embodiment of the present invention. It is another flow chat of the selection method of the vocabulary table according to the embodiment of the present invention. It is another flow chat of the method of selecting a vocabulary table according to the embodiment of the present invention. 6 is a structural diagram of a named entity extraction model applied to a vocabulary table selection method according to an embodiment of the present invention. FIG. FIG. 6 is a structural diagram of a preliminary training model according to an embodiment of the present invention. FIG. 3 is a structural diagram of a vocabulary table selection device according to an embodiment of the present invention. FIG. 7 is another structural diagram of a vocabulary table selection device according to an embodiment of the present invention. It is another structural diagram of the selection device of the vocabulary table of the embodiment of the present invention.

本発明が解決しようとする技術課題、技術方案および優れた点をより明確させるために、付図および具体的な実施例を組み合わせて詳細な説明を行う。以下、詳細な配置とユニットなどの特定な細部の記載は本発明の実施例を理解するために提供されたもののみである。このため、当業者にとって、本発明の趣旨の範囲内において、記載された実施例に対して種々の変更と補正が可能であることが自明である。また、明確および簡潔のために、公知の機能と構造に関する説明を省略した。 In order to further clarify the technical problem, the technical solution, and the superior point to be solved by the present invention, a detailed description will be given with reference to the accompanying drawings and specific examples. In the following, specific details such as detailed arrangements and units are provided only for understanding the embodiments of the present invention. Therefore, it is obvious to those skilled in the art that various modifications and corrections can be made to the described embodiments within the scope of the gist of the present invention. Also, for clarity and brevity, descriptions of known functions and structures have been omitted.

言うまでもなく、明細書に記載された「１つの実施例」或いは「一実施例」は、実施例と関係する特定な特徴、構造または特性が本発明のすくなくとも1つの実施例に含まれていることを意味する。このため、明細書に記載された「１つの実施例において」或いは「一実施例において」では、必ずしも同じ実施例を指すことではない。この他、これらの特定な特徴、構造または特性は任意の適宜な方式で１つまたは複数の実施例に組み合わせられることも可能である。 It goes without saying that any "one embodiment" or "one embodiment" described in the specification includes a particular feature, structure or characteristic related to the embodiment in at least one embodiment of the present invention. Means Therefore, the terms “in one embodiment” or “in one embodiment” described in the specification do not necessarily mean the same embodiment. In addition, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.

本発明の各実施例において、下記各過程の番号は実行する順序を意味するではなく、各過程の実行順はその機能と固有論理とに確定されるべきであり、本発明の実施例の実行過程に対していかなる限定をするべきではない。 In each embodiment of the present invention, the number of each process below does not mean the order of execution, but the execution order of each process should be determined by its function and unique logic. No limitations should be placed on the process.

前述したように、多義語には複数の異なる意味項があるため、違うコンテキストにおける多義語の語義を区別することは重要である。そこで、本発明の実施例に係る語彙テーブルの選択方法は、多義語に異なる意味項と対応する単語表現を生成することができ、かつ当該方法の計算量が相対的に小さく、かかる時間が短いので、単語表現の生成効率を高めることができる。 As mentioned above, polysemous words have a plurality of different semantic terms, so it is important to distinguish the meanings of polysemous words in different contexts. Therefore, the vocabulary table selection method according to the embodiment of the present invention can generate word expressions corresponding to different meaning terms in polysemous words, has a relatively small calculation amount of the method, and takes a short time. Therefore, the generation efficiency of word expressions can be improved.

図1に本願の語彙テーブルの選択方法を適用できる実施例の例示的なシステムアーキテクチャ100を示す。図1に示すように、システムアーキテクチャ100は端末デバイス101、102、103と、ネットワーク104と、サーバ105とを含む。ネットワーク104は、端末デバイス101、102、103とサーバ105との間に通信リンクを提供するためのメディアである。ネットワーク104は、有線、無線通信リンク、または光ファイバーケーブルなど、さまざまな接続タイプを含むことができる。 FIG. 1 shows an exemplary system architecture 100 of an embodiment to which the vocabulary table selection method of the present application can be applied. As shown in FIG. 1, the system architecture 100 includes terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is a medium for providing a communication link between the terminal devices 101, 102, 103 and the server 105. The network 104 can include various connection types such as wired, wireless communication links, or fiber optic cables.

ユーザは、端末デバイス101、102、103を用いて、ネットワーク104を介してサーバ105とインタラクションすることで、テキストデータなどを送信することが可能である。端末デバイス101、102、103には、ファイル編集アプリケーション、情報検索アプリケーション、情報伝送アプリケーションなど、さまざまな通信クライアントアプリケーションを実装することができる。端末デバイス101、102、103は、ディスプレイを有し、かつ情報、ファイル送信する各種の電子デバイスであって、スマートフォン、タブレットPC、ラップトップPC、デスクトップPCなどを含むが、それらに限られない。 The user can transmit text data and the like by using the terminal devices 101, 102 and 103 and interacting with the server 105 via the network 104. Various communication client applications such as a file editing application, an information search application, and an information transmission application can be mounted on the terminal devices 101, 102, 103. The terminal devices 101, 102, 103 are various electronic devices that have a display and transmit information and files, and include, but are not limited to, smartphones, tablet PCs, laptop PCs, desktop PCs, and the like.

サーバ105は、語彙テーブル選択を行うことが可能なサーバである。具体的に、サーバは、インターネット104からオリジナルコーパスデータを収集して語彙テーブルを生成することができ、例えばインターネットのサイトからコーパスデータを収集する。もちろん、端末デバイス101、102、103よりあらかじめ収集したコーパスデータをサーバ105に送信することも可能であるが、本願の実施例に係る語彙テーブルの選択方法は一般的にサーバ105で実行されるため、相応的に、語彙テーブルの選択装置をサーバ105に設置することができる。 The server 105 is a server capable of selecting a vocabulary table. Specifically, the server can collect original corpus data from the Internet 104 and generate a vocabulary table, for example, collect corpus data from a site on the Internet. Of course, it is possible to transmit the corpus data collected in advance from the terminal devices 101, 102, 103 to the server 105, but the vocabulary table selection method according to the embodiment of the present application is generally executed by the server 105. Correspondingly, a vocabulary table selection device can be installed in the server 105.

言うまでもなく、図1中の端末デバイス、ネットワーク、サーバの数は図式的のみであることを理解すべきである。実現するニーズに応じて、任意数の端末デバイス、ネットワーク、サーバを有することができる。 It should be understood, of course, that the number of terminal devices, networks, and servers in Figure 1 is only schematic. It can have any number of terminal devices, networks, servers depending on the needs to be realized.

図2Aに、本発明の実施例に係る語彙テーブルの選択方法のフローチャットが示され、当該方法は語彙テーブルを指定してトレーニングする必要のある各種のニューラルネットモデルに応用されることができ、当該方法は語彙テーブルを簡約化して、モデルトレーニング効率を向上してトレーニング時間を減少し、且つトレーニングで得たモデルの正確性を高めることができる。図2Aに示すように、当該語彙テーブルの選択方法は以下を含むことができる。 FIG. 2A shows a flow chat of a method for selecting a vocabulary table according to an embodiment of the present invention, which can be applied to various neural net models that require training by designating a vocabulary table. The method can reduce the vocabulary table, improve the model training efficiency, reduce the training time, and improve the accuracy of the model obtained by training. As shown in FIG. 2A, the method of selecting the vocabulary table may include the following.

ステップ201では、語彙重み付け層をターゲットニューラルネットワークモデルに導入して、予備トレーニングモデルを構築し、前記語彙重み付け層は語彙重みに基づいて第1語彙テーブル中のターゲット語彙を重み付けするようにし、かつ重み付け処理によって得られたターゲット語彙を前記ターゲットニューラルネットワークモデルに入力する。 In step 201, a vocabulary weighting layer is introduced into the target neural network model to build a pre-training model, the vocabulary weighting layer weighting the target vocabulary in the first vocabulary table based on the vocabulary weights, and weighting. The target vocabulary obtained by the processing is input to the target neural network model.

ここで、ターゲットニューラルネットワークモデルはトレーニングするために語彙テーブルを指定する必要のあるニューラルネットワークモデルであり、前記第1語彙テーブルが本来においてターゲットニューラルネットワークモデルをトレーニングするために用いられるオリジナル語彙テーブルである。本発明の実施例に係る前記語彙テーブル選択方法は、当該第1語彙テーブル中の語彙を簡約化することができる。 Here, the target neural network model is a neural network model for which it is necessary to specify a vocabulary table for training, and the first vocabulary table is an original vocabulary table originally used for training the target neural network model. .. The vocabulary table selection method according to the embodiment of the present invention can reduce the vocabulary in the first vocabulary table.

ステップ201では、本発明の実施例により、ターゲットニューラルネットワークモデルを元に、新しい層構造、すなわち語彙重み付け層を追加して、本願では、予備トレーニングモデルと呼ぶ新しいモデルを構築する。当該予備トレーニングモデルにおいて、語彙重み付け層は語彙重みに基づいて第1語彙テーブル中のターゲット語彙に重みを付けするようにし、かつ重み付け処理によって得られたターゲット語彙を前記ターゲットニューラルネットワークモデルに入力する。そこで、語彙重み付け層は、予備トレーニングモデルの入力インターフェースとターゲットニューラルネットワークモデルとの間に設けられ、予備トレーニングモデルに入力された語彙に上記重み付け処理を行った後、ターゲットニューラルネットワークモデルに入力する。ここで、語彙重み付け層のターゲット語彙の初期重みが予めに設定した初期値であることができる。 In step 201, according to the embodiment of the present invention, a new layer structure, that is, a vocabulary weighting layer is added to the target neural network model to construct a new model referred to as a preliminary training model in the present application. In the preliminary training model, the vocabulary weighting layer weights the target vocabulary in the first vocabulary table based on the vocabulary weight, and inputs the target vocabulary obtained by the weighting process to the target neural network model. Therefore, the vocabulary weighting layer is provided between the input interface of the preliminary training model and the target neural network model. The vocabulary input to the preliminary training model is subjected to the above weighting processing and then input to the target neural network model. Here, the initial weight of the target vocabulary of the vocabulary weighting layer may be a preset initial value.

具体的には、語彙重み付け層より前記ターゲット語彙に対応するターゲット単語ベクトル及び未知語彙に対応する未知単語ベクトルに対して加重合計を行うことができて、そのうち、前記ターゲット単語ベクトルの第1重みは前記ターゲット語彙の語彙重みの正相関関数で、前記未知単語ベクトルの第2重みは前記ターゲット語彙の語彙重みの負相関関数であり、かつ前記第1重みと第2重みの和が所定値である。例えば、前記所定値が1である。前記未知語彙が前記第1語彙テーブルに存在しない語彙であり、かつ前記第1語彙テーブルに存在しない語彙はすべて同一の未知単語ベクトルに対応する。 Specifically, from the vocabulary weighting layer, it is possible to perform a weighted sum for the target word vector corresponding to the target vocabulary and the unknown word vector corresponding to the unknown vocabulary, of which the first weight of the target word vector is The positive correlation function of the vocabulary weight of the target vocabulary, the second weight of the unknown word vector is a negative correlation function of the vocabulary weight of the target vocabulary, and the sum of the first weight and the second weight is a predetermined value. .. For example, the predetermined value is 1. The unknown vocabulary is a vocabulary that does not exist in the first vocabulary table and the vocabulary that does not exist in the first vocabulary table all correspond to the same unknown word vector.

ここで、前記ターゲット単語ベクトルと前記未知単語ベクトルとは、確率的初期化または単語ベクトルの予備トレーニングアルゴリズムによって初期化することができ、そこで、予備トレーニングアルゴリズムは、語義が近い語彙に対応の単語ベクトルのユークリッド距離を減少することができる。説明しておきたいのは、上記は本発明の実施例が採用できる単語ベクトル初期化の二つの方法のみであり、本発明の実施例は従来技術における他の方法を採用することもでき、本発明の実施例はこれに対して具体的に限定をしない。 Here, the target word vector and the unknown word vector can be initialized by stochastic initialization or a preliminary training algorithm of a word vector, where the preliminary training algorithm is a word vector corresponding to a vocabulary with a close sense. The Euclidean distance of can be reduced. It should be noted that the above is only two methods of word vector initialization that can be adopted by the embodiment of the present invention, and the embodiment of the present invention can also adopt other methods in the prior art. The embodiments of the invention are not specifically limited to this.

一つの実現方式として、前記第1重みは前記ターゲット語彙の語彙重みの第1関数であることができ、前記第1関数は前記ターゲット語彙の語彙重みを0から1までにマッピングするように用いられる。前記第2重みは前記第1重みの第2関数であることができ、且つ前記第1重みと負の相関がある。説明しておきたいのは、上記の関数は本発明の実施例の例示のみであり、本発明の実施例はこれに対して具体的に限定しない。 As one implementation, the first weight may be a first function of the vocabulary weight of the target vocabulary, and the first function is used to map the vocabulary weight of the target vocabulary from 0 to 1. .. The second weight may be a second function of the first weight and has a negative correlation with the first weight. It should be noted that the functions described above are only examples of embodiments of the present invention, and the embodiments of the present invention are not specifically limited thereto.

ステップ202では、前記第1語彙テーブルに基づき、前記予備トレーニングモデルをトレーニングして、前記予備トレーニングモデルのモデルパラメータおよび語彙重み付け層の語彙重みを更新し、かつトレーニング終了後に、前記第1語彙テーブルにおけるターゲット語彙の語彙重みを取得する。 In step 202, based on the first vocabulary table, the preliminary training model is trained to update the model parameters of the preliminary training model and the vocabulary weights of the vocabulary weighting layer, and after the training, in the first vocabulary table. Get the vocabulary weight of the target vocabulary.

ここで、ステップ202では、第1語彙テーブルにより予備トレーニングモデルをトレーニングし、かつトレーニング過程においてモデルパラメータおよび語彙重み付け層のターゲット語彙の語彙重みを更新する。このように、トレーニングが予めに設定した終了条件に到達した時、例えば、反復ラウンドが所定回数に達し、或いは目標関数が所定条件を満たした場合に、トレーニングを終了して、トレーニングによる予備トレーニングモデルおよび語彙重み付け層のターゲット語彙の語彙重みを取得する。 Here, in step 202, the preliminary training model is trained by the first vocabulary table, and the model parameter and the vocabulary weight of the target vocabulary of the vocabulary weighting layer are updated in the training process. In this way, when the training reaches the preset end condition, for example, when the iteration round reaches the predetermined number of times or the target function satisfies the predetermined condition, the training is terminated and the preliminary training model by the training is completed. And get the vocabulary weight of the target vocabulary of the vocabulary weighting layer.

ステップ203では、前記語彙重みによって前記第1語彙テーブルを選別して、第2語彙テーブルを得る。 In step 203, the first vocabulary table is selected according to the vocabulary weight to obtain a second vocabulary table.

ここで、ステップ202においてターゲット語彙の語彙重みを取得した後、モデルトレーニングに必要な語彙数を減らすように、語彙重みによって第1語彙テーブルを選別して、第2語彙テーブルを得ることができる。具体的に、本発明の実施例は語彙重みの高い順によって、前記第1語彙テーブルから第2数の語彙を選択して、前記第2語彙テーブルを得ることができる。或いは、前記第1語彙テーブルから語彙重みが予めに設定した数値範囲内の語彙を選択して、前記第2語彙テーブルを得ることができる。そこで、前記第2語彙テーブル中の語彙数は第1語彙テーブル中の語彙数より少ない。例えば、一つの可能なフィルタリング方法は以下である。語彙重みがQより大きい、或いは-Qより小さい語彙のみを保留して第2語彙テーブルを構成する。ここで、前記Qはユーザより予めに指定されたパラメータであり、正の実数である。 Here, after obtaining the vocabulary weight of the target vocabulary in step 202, the first vocabulary table can be selected by the vocabulary weight so as to reduce the number of vocabularies required for model training, and the second vocabulary table can be obtained. Specifically, according to the embodiment of the present invention, the second vocabulary table can be obtained by selecting the second number of vocabularies from the first vocabulary table in the descending order of vocabulary weight. Alternatively, the second vocabulary table can be obtained by selecting a vocabulary having a vocabulary weight within a preset numerical range from the first vocabulary table. Therefore, the number of vocabularies in the second vocabulary table is smaller than that in the first vocabulary table. For example, one possible filtering method is: The second vocabulary table is constructed by retaining only the vocabulary whose vocabulary weight is larger than Q or smaller than -Q. Here, the Q is a parameter designated in advance by the user and is a positive real number.

以上のステップにより、本発明の実施例は第1語彙テーブルによって比較的に少ない語彙を有する第2語彙テーブルを生成することができ、かつ第2語彙テーブルに価値の高いターゲット語彙が含まれたので、ターゲットニューラルネットモデルのトレーニング効率を向上してトレーニング時間を減少でき、かつトレーニングによるモデルの正確性を向上する。例えば、自然言語における人名エンティティを発見するためのモデルにおいて、人名エンティティの発見の正確性を向上することが可能である。 By the above steps, the embodiment of the present invention can generate the second vocabulary table having relatively few vocabularies by the first vocabulary table, and the second vocabulary table includes the target vocabulary having high value. , Improve the training efficiency of the target neural network model, reduce the training time, and improve the accuracy of the model by training. For example, in a model for discovering personal name entities in natural language, it is possible to improve the accuracy of discovering personal name entities.

さらに、図2Bに示すように、本発明の実施例は上記のステップ203の後に以下をさらに含むことができる。 Further, as shown in FIG. 2B, the embodiment of the present invention may further include the following after step 203 described above.

ステップ204では、前記第2語彙テーブルによって、前記ターゲットニューラルネットモデルをトレーニングする。 In step 204, the target neural network model is trained according to the second vocabulary table.

ここで、前記ターゲットニューラルネットワークモデルと前記予備トレーニングモデルはともに、同一のターゲットタスクに対して構築されたモデルである。例えば、二つのモデルはともに同じタスクを解決するために用いられて、トレーニング時に、ともに同じ目標関数を最適化するターゲットとする。 Here, both the target neural network model and the preliminary training model are models built for the same target task. For example, the two models are both used to solve the same task, and both target to optimize the same objective function during training.

第2語彙テーブル中の語彙数が簡約化され、かつ重みの高い語彙が含まれたため、本発明の実施例はターゲットニューラルネットワークモデルのトレーニング効率を向上してトレーニング時間を減少、及びトレーニングで得たモデルの正確性を高めることができる。 Since the number of vocabularies in the second vocabulary table was reduced and the vocabularies with high weight were included, the embodiments of the present invention improved the training efficiency of the target neural network model, reduced the training time, and obtained the training. The accuracy of the model can be increased.

図3に本発明の実施例による語彙テーブルの選択方法のほかのプロセスであり、当該プロセスは固有表現抽出モデルをターゲットニューラルネットモデルの具体例として、本発明の実施例の語彙テーブルの選択方法をさらに詳しく説明する。図3に示すように、当該プロセスは以下を含む。 FIG. 3 shows another process of the method for selecting a vocabulary table according to the embodiment of the present invention, which is a method for selecting the vocabulary table according to the embodiment of the present invention as a specific example of the target expression neural network model. This will be described in more detail. As shown in FIG. 3, the process includes:

ステップ301では、オリジナルコーパスによって第1語彙テーブルを生成する。 In step 301, the first vocabulary table is generated by the original corpus.

ここで、あらかじめ収集されたオリジナルコーパスを取得でき、オリジナルコーパスデータをデータクリーニングすることで、データクリーニング後のオリジナルコーパスデータを文に分割し、かつ文を分割して複数の語彙を得る。その後、語彙のオリジナルコーパスにおける出現頻度の高い順に従って、第1数の語彙を選択して、前記第1語彙テーブルを得る。 Here, the original corpus collected in advance can be acquired, and the original corpus data is subjected to data cleaning to divide the original corpus data after data cleaning into sentences, and the sentence is divided to obtain a plurality of vocabularies. Then, the first number of vocabularies is selected according to the order of appearance frequency of the vocabulary in the original corpus, and the first vocabulary table is obtained.

本実施例では、当該語彙テーブルの選択方法を電子デバイス（例えば図1に示すサーバ105）で実行することができ、当該電子デバイスは有線接続方式や無線接続方式でネットワーク（例えばインターネット上のウェブサイト）からテキストデータを収集し、或いは端末デバイス101、102、103などによってテキストデータを収集してその電子デバイスに送信することができる。なお、上記無線接続方式は、3G／4G／5G接続、WiFi接続、Bluetooth（登録商標）接続、WiMAX接続、ZigBee接続、UWB（Ultra Wide Band）接続、およびその他既知または将来開発される無線接続方式を含むことができるが、これらに限定されものではない。 In this embodiment, the method of selecting the vocabulary table can be executed by an electronic device (for example, the server 105 shown in FIG. 1), and the electronic device uses a wired connection method or a wireless connection method to connect to a network (for example, a website on the Internet). ), or text data can be collected by the terminal device 101, 102, 103, etc. and transmitted to the electronic device. The above wireless connection methods are 3G/4G/5G connection, WiFi connection, Bluetooth (registered trademark) connection, WiMAX connection, ZigBee connection, UWB (Ultra Wide Band) connection, and other known or future developed wireless connection methods. Can be included, but is not limited to.

通常、ネットワークから収集されたテキストデータに複数のエンコード方式が用いられる可能性があるので、一つの実現方式として、本発明の実施例は上記のステップ301の前に、以下の手順をさらに含むことができる。 Generally, since multiple encoding methods may be used for the text data collected from the network, as one implementation method, the embodiment of the present invention further includes the following procedure before the above step 301. You can

ステップ300では、テキストデータを収集し、かつテキストデータを前処理して、後続ステップ301に必要なオリジナルコーパスを生成する。具体的に、前処理は以下を含む。 In step 300, the text data is collected and the text data is pre-processed to generate the original corpus required in the subsequent step 301. Specifically, the pretreatment includes:

a）コード統一：前記テキストデータを同一のコードフォーマットに変換する。例えば、すべての全角文字を半角文字に変換し、かつテキストデータを例えばutf-8フォーマットに同じコードに変換する。 a) Code unification: The text data is converted into the same code format. For example, all full-width characters are converted to half-width characters, and text data is converted to the same code in, for example, utf-8 format.

b）データ洗浄：前記テキストデータに対してデータ洗浄を行う。データ洗浄はテキストデータからテキスト解析に不要なノイズを移動削除して、実際の語義情報を含む内容のみを保留する。ここでのノイズとは、通常、特殊記号、リンク、メール、絵文字、顔文字、HTMLタグ（例えば、＜html＞、＜title＞と＜body＞、＜br＞、＜span＞など）およびその他の記号、例えば、&lt、&gt、@、#、$、%、^、&、*、（）、＜＞、{}、[]などである。 b) Data cleaning: Data cleaning is performed on the text data. Data washing removes noises unnecessary for text analysis from text data and retains only the contents including actual meaning information. Noise here usually means special symbols, links, emails, pictograms, emoticons, HTML tags (eg <html>, <title> and <body>, <br>, <span>, etc.) and other Symbols such as &lt, &gt, @, #, $, %, ^, &, *, (), <>, {}, [], and the like.

c）データ分割：データ洗浄後の前記テキストデータを形態素に分割し、かつストップワードを除去して前記コーパスデータを得る。例えば、文で分割した後、異なるシステム応用シーンによってさらに単語に分割、かつそのうちのストップワードを除去して、コーパスデータを取得する。ここで、前記形態素は単語、連語、語列の内の少なくとも一つを含む。ストップワードは通常、助詞、介詞、副詞などの実際の語義を持たない語彙、および一部の高頻出語と低頻出語を含む。連語は通常2個またはより多い単語を含み、語列は2個またはより多い連語を含むことができる。具体的には、Pythonライブラリの自然言語処理ツールキット（NLTK：Natural Language Tool Kit）を用いてテキストを文に分割して、さらに文を単語分割器などのツールによって単語に分割することができる。本文の語と語彙は共に同じ意味を表している。 c) Data division: The text data after data washing is divided into morphemes, and stop words are removed to obtain the corpus data. For example, after the sentence is divided, the words are further divided into words according to different system application scenes, and the stop words are removed to obtain corpus data. Here, the morpheme includes at least one of a word, a collocation, and a word string. Stopwords typically include vocabularies that have no actual meaning, such as particles, nouns, and adverbs, and some high-frequency and low-frequency words. Conjunctions typically include two or more words, and word strings can include two or more collocations. Specifically, the text can be divided into sentences using the Natural Language Tool Kit (NLTK) of the Python library, and the sentence can be further divided into words by a tool such as a word divider. Both the word and the vocabulary in the text have the same meaning.

d）データID化：ステップcで分割された語に対して、語の出現頻度に応じて第1数（例えば30000個）の異なる語を選択して第1語彙テーブルを構成し、各語彙にそれぞれ独自のIDを割り当てる。例えば、第1語彙テーブルの30000語のうちの第一語に1をIDとして割り当て、第2語に2をIDとして割り当てて、類似的に割り当てて行く。第1語彙テーブルに含まれていない語彙、すなわち未知語彙に対しては、0をすべての未知語彙のIDとすることができることによって、各語を対応付けられるIDに置き換えることができる。 d) Data ID conversion: For the words divided in step c, the first number (for example, 30,000) of different words are selected according to the frequency of occurrence of the words to form the first vocabulary table, and Assign each unique ID. For example, among the 30,000 words in the first vocabulary table, 1 is assigned to the first word as an ID, 2 is assigned to the second word as an ID, and so on. For vocabularies that are not included in the first vocabulary table, that is, unknown vocabularies, 0 can be used as the IDs of all unknown vocabularies, so that each word can be replaced with the associated ID.

ステップ302では、固有表現抽出モデルに語彙重み付け層を導入して、予備トレーニングモデルを構築する。 In step 302, a vocabulary weighting layer is introduced into the named entity extraction model to build a preliminary training model.

ここで、図4に固有表現抽出モデルの構造図を例示し、図5は固有表現抽出モデルに破線枠で示した語彙重み付け層を導入して得た予備トレーニングモデルの構造図を示したものである。本発明の実施例において、各ターゲット語彙はともに唯一の語彙重みパラメータを有する。 Here, FIG. 4 illustrates a structural diagram of the named entity extraction model, and FIG. 5 illustrates a structural diagram of the preliminary training model obtained by introducing the vocabulary weighting layer shown by the broken line frame into the named entity extraction model. is there. In an embodiment of the invention, each target vocabulary both has a unique vocabulary weight parameter.

以下、図5についで説明する。 Hereinafter, description will be made with reference to FIG.

a）演算子「+」はベクトル加算を表す。n次元ベクトルAとBに対して、ベクトルAとベクトルBとの加算の結果がベクトルCになると仮設すると、ベクトルCもn次元のベクトルであり、かつベクトルC中のi番目の要素C_iは次になる。

a) Operator "+" represents vector addition. Assuming that the result of addition of vector A and vector B to n-dimensional vectors A and B is vector C, vector C is also an n-dimensional vector, and the i-th element C _i in vector C is Next.

そこで、AiはベクトルA中のi番目の要素で、B_iはベクトルB中のi番目の要素である。 So Ai is the i-th element in vector A and B _i is the i-th element in vector B.

b）演算子「×」はベクトル乗算を表す。n次元ベクトルAと実数bに対して、Aとbとの乗算がベクトルDになると仮設すると、ベクトルDもn次元のベクトルであり、かつベクトルD中のi番目の要素D_iは次になる。

b) The operator "x" represents vector multiplication. Assuming that the multiplication of A and b becomes a vector D for an n-dimensional vector A and a real number b, the vector D is also an n-dimensional vector, and the i-th element D _i in the vector D becomes ..

c）演算子「σ」は、活性化関数g（x）に対応する。活性化関数g（x）はターゲット単語ベクトルの第1重みを計算して、前記ターゲット語彙の語彙重みxの第一関数であり、語彙重みを0と1の間にマッピングすることに用いられる。活性化関数の一例を式4に示す。

c) The operator “σ” corresponds to the activation function g(x). The activation function g(x) is a first function of the vocabulary weight x of the target vocabulary by calculating the first weight of the target word vector, and is used for mapping the vocabulary weight between 0 and 1. Equation 4 shows an example of the activation function.

d）演算子「1-」はマッピング関数f（z）に対応する。マッピング関数f（z）は未知単語ベクトルの第2重みを計算することに用いられて、前記ターゲット単語ベクトルの第1重みzの第二関数である。マッピング関数f（z）の値域が[0，1]で、かつターゲット単語ベクトルの第1重みzと負の相関であり、zの増大につれて減少するが、逆の場合は増大する。マッピング関数の一例は式3に示す。

d) The operator "1-" corresponds to the mapping function f(z). The mapping function f(z) is used to calculate the second weight of the unknown word vector and is the second function of the first weight z of the target word vector. The range of the mapping function f(z) is [0,1] and is negatively correlated with the first weight z of the target word vector, and decreases with increasing z, but increases in the opposite case. Equation 3 shows an example of the mapping function.

よって、本例の前記第1重みと第2重みの合計値が1になる。 Therefore, the total value of the first weight and the second weight in this example becomes 1.

e）単語ベクトルはツール「word2vec」によってトレーニングすることができて、トレーニング過程では単語ベクトルの次元を256に設定することができる。トレーニングされた単語ベクトルに含まれていない語彙を確率的に256次元ベクトルに初期化することができる。 e) The word vector can be trained by the tool "word2vec", and the dimension of the word vector can be set to 256 during the training process. Vocabularies not included in the trained word vector can be probabilistically initialized to a 256-dimensional vector.

f）第1語彙テーブルの30000個の単語範囲にない語彙に関して、すべて未知語彙に定義する。未知語彙は同一の確率的初期化された256次元ベクトルを共用して単語ベクトル、すなわち前記の未知語ベクトルとする。 f) All vocabularies that are not within the 30,000 word range in the first vocabulary table are defined as unknown vocabularies. The unknown vocabulary shares the same stochastically initialized 256-dimensional vector as a word vector, that is, the unknown word vector.

g）本発明の実施例は前記予備トレーニングモデルのトレーニング過程において、単語ベクトルパラメータを一括更新する。 g) In the embodiment of the present invention, the word vector parameters are collectively updated in the training process of the preliminary training model.

h）各単語はともに単語の重要性を評価する用の語彙重みに対応する。語彙重みを正実数0.5に初期化でき、かつ予備トレーニングモデルのトレーニング過程に従って更新し続ける。 h) Each word together corresponds to a vocabulary weight for assessing the importance of the word. The vocabulary weight can be initialized to a positive real number of 0.5 and kept updated according to the training process of the preliminary training model.

i）語彙重み付け層の出力は重み付け単語ベクトルである。よって、図5の語彙重み付け層において、g（x）の結果とターゲット語彙のターゲット単語ベクトルと乗算し、g（x）の結果もf（x）の入力とし、f（x）の出力と未知語彙の未知語ベクトルと乗算して、上記二つの乗算結果を加算し、かつ加算結果をターゲット語彙に対応する重み付け単語ベクトルとして固有表現抽出モデルに入力する。 i) The output of the vocabulary weighting layer is a weighted word vector. Therefore, in the vocabulary weighting layer in Fig. 5, the result of g(x) is multiplied by the target word vector of the target vocabulary, and the result of g(x) is also input to f(x), and the output of f(x) is unknown. The unknown word vector of the vocabulary is multiplied, the above two multiplication results are added, and the addition result is input to the proper expression extraction model as a weighted word vector corresponding to the target vocabulary.

したがって、図5の語彙重み付け層の出力は、g（x）・A+（1-g（x））・Uであり、ここで、Aはターゲット語彙のターゲット単語ベクトルを表し、Uは未知語ベクトルを表し、xはターゲット語彙の語彙重みを表す。上記重み付け単語ベクトルは図4の固有表現抽出モデルに入力されて、当該固有表現抽出モデルをトレーニングする用に用いられる。 Therefore, the output of the vocabulary weighting layer in Figure 5 is g(x)·A+(1-g(x))·U, where A represents the target word vector of the target vocabulary and U is the unknown word vector. And x represents the vocabulary weight of the target vocabulary. The weighted word vector is input to the proper expression extraction model of FIG. 4 and used for training the proper expression extraction model.

ステップ303では、第1語彙テーブルに基づいて予備トレーニングモデルをトレーニングする。 In step 303, a preliminary training model is trained based on the first vocabulary table.

予備トレーニングモデルは図5に示すように、第1語彙テーブル中の語彙を語彙重み付け層に入力して、語彙重み付け層の出力を図4に示す固有表現抽出モデルに入力する。図4に示す固有表現抽出モデルは標準的な双方向長短期記憶（LSTM、Long Short Term Memory）ネットワーク構造である。図4には入力が三つの単語である文の例示であり、より多い単語を含むより長い文を入力する場合は、図示の構造を複数回に繰り返すことができる。 As shown in FIG. 5, the preliminary training model inputs the vocabulary in the first vocabulary table to the vocabulary weighting layer and inputs the output of the vocabulary weighting layer to the named entity extraction model shown in FIG. The named entity extraction model shown in Fig. 4 has a standard bidirectional long short-term memory (LSTM) network structure. FIG. 4 shows an example of a sentence in which the input is three words, and in the case of inputting a longer sentence including more words, the illustrated structure can be repeated multiple times.

以下、図4に示す構造について説明する。 The structure shown in FIG. 4 will be described below.

a）モジュール「x0」、「x1」、「x2」はそれぞれ文中の語彙に対応するIDを入力とする。これらのIDは語彙重み付け層によって重み付け単語ベクトルに転換される。 a) Modules “x0”, “x1”, and “x2” each take an ID corresponding to the vocabulary in the sentence. These IDs are transformed into weighted word vectors by the vocabulary weighting layer.

b）モジュール「LSTM Cell」はLSTMモデルの基本ユニットである。フォワードとバックワードのLSTM Cellは、それぞれ異なるモデルパラメータ組を使用する。 b) The module "LSTM Cell" is the basic unit of the LSTM model. The forward and backward LSTM Cell use different model parameter sets.

c）モジュール「CONCAT」は、二つのベクトルをつなぎ合わせた演算子である。つなぎ合わせ操作の出力は入力されたベクトルの全要素を保留し、かつ出力するベクトルの次元は二つの入力されたベクトル次元の和である。 c) The module "CONCAT" is an operator that connects two vectors. The output of the concatenation operation holds all the elements of the input vector, and the dimension of the output vector is the sum of the two input vector dimensions.

d）モジュール「SOFTMAX」は出力を生成するための標準Softmax層である。Softmax層のパラメータは現在のモデルで共有される。 d) The module "SOFTMAX" is a standard Softmax layer for producing output. The parameters of the Softmax layer are shared by the current model.

e）モジュール「OUTPUT」は1次元の2値出力（0または1を出力）であり、現在の語彙が人名エンティティであるかを識別する。 e) Module "OUTPUT" is a one-dimensional binary output (0 or 1 output) that identifies whether the current vocabulary is a person name entity.

f）確率的勾配降下法（SGD、Stochastic Gradient Descent）によりモデルを最適化して、最適化過程において、モデルパラメータおよび語彙重みが更新される。 f) The model is optimized by Stochastic Gradient Descent (SGD), and model parameters and vocabulary weights are updated in the optimization process.

ステップ304は語彙重みに基づき第1語彙テーブルをフィルタリングして、第2語彙テーブルを得る。 Step 304 filters the first vocabulary table based on the vocabulary weights to obtain the second vocabulary table.

ここで、ステップ303で得られた語彙重みの高い順に語彙を並べ、そして順位が前寄りの第二数の第2語彙テーブルを選択することができて、例えば、上位の20000個の重みが大きい単語を選択して第2語彙テーブルを構成する。 Here, it is possible to arrange the vocabulary in descending order of the vocabulary weights obtained in step 303, and to select the second vocabulary table of the second number in the order of precedence, for example, the weight of the top 20,000 is large. Select words to compose the second vocabulary table.

ステップ305は第2語彙テーブルに基づいて、ターゲットニューラルネットワークモデルをトレーニングする。 Step 305 trains the target neural network model based on the second vocabulary table.

ここで、第2語彙テーブルを指定し、かつ固有表現抽出モデルをトレーニングして、この時、当該固有表現抽出モデルに語彙重み付け層が含まれず、最終的なターゲットモデルを学習し得る。 Here, the second vocabulary table is designated, and the proper expression extraction model is trained so that the proper expression extraction model does not include the vocabulary weighting layer, and the final target model can be learned.

具体的には、固有表現抽出モデルによってグリッドサーチを行い、モデルの超パラメータを調整することができる。超パラメータには、batch size、learning rate、dropout rate及びトレーニングの繰返し数が含まれる。及び、上記の超パラメータに基づいて、異なる乱数種でいくつかの（例えば9個の）モデルを初期化とトレーニングして、投票方法により異なるモデルの出力と結合することによって、最終的なターゲットモデルを得る。指摘すべきなのは、本発明の実施例は従来技術の各種既存の模型トレーニング実現方式を採用することができ、本発明の実施例は具体的に限定されない。 Specifically, it is possible to perform a grid search using the named entity extraction model and adjust the hyperparameters of the model. Hyperparameters include batch size, learning rate, dropout rate and number of training iterations. And, based on the above hyperparameters, by training and initializing several (eg 9) models with different random seeds and combining them with the outputs of different models by the voting method, the final target model To get It should be pointed out that the embodiments of the present invention can adopt various existing model training implementation methods of the prior art, and the embodiments of the present invention are not specifically limited.

以上の方法に基づき、本発明の実施例はさらに上記の方法を実施する装置を提供し、図6によると、本発明の実施例が提供する語彙テーブルの選択装置600が以下を含む。 Based on the above method, the embodiment of the present invention further provides an apparatus for performing the above method, and according to FIG. 6, the vocabulary table selection apparatus 600 provided by the embodiment of the present invention includes the following.

予備トレーニングモデルのモデリングユニット601により、語彙重み付け層をターゲットニューラルネットワークモデルに導入して、予備トレーニングモデルを構築し、前記語彙重み付け層は語彙ウェイトに基づいて第1語彙テーブル中のターゲット語彙を重み付けするようにし、かつ重み付け処理によって得られたターゲット語彙を前記ターゲットニューラルネットワークモデルに入力する。 The vocabulary weighting layer is introduced into the target neural network model by the modeling unit 601 of the preliminary training model to build the preliminary training model, and the vocabulary weighting layer weights the target vocabulary in the first vocabulary table based on the vocabulary weights. Then, the target vocabulary obtained by the weighting process is input to the target neural network model.

第1トレーニングユニット602により、前記第1語彙テーブルに基づき、前記予備トレーニングモデルをトレーニングして、前記予備トレーニングモデルのモデルパラメータおよび語彙ウェイト付け層の語彙ウェイトを更新し、かつトレーニング終了後に、前記第1語彙テーブルにおけるターゲット語彙の語彙ウェイトを取得する。 A first training unit 602 trains the preliminary training model based on the first vocabulary table to update model parameters of the preliminary training model and vocabulary weights of a vocabulary weighting layer, and after the training, Gets the vocabulary weight of the target vocabulary in the 1-vocabulary table.

語彙選択ユニット603により、前記語彙ウェイトによって前記第1語彙テーブルを選別して、第2語彙テーブルを得る。 The vocabulary selection unit 603 selects the first vocabulary table according to the vocabulary weight to obtain a second vocabulary table.

以上のユニットにより、本発明の実施例の語彙テーブルの選択装置600は、第1語彙テーブルを簡約化して、よりモデルトレーニングに適した第2語彙テーブルを選択かつ生成することができる。 With the above units, the vocabulary table selection device 600 according to the embodiment of the present invention can simplify the first vocabulary table and select and generate the second vocabulary table more suitable for model training.

図7によると、本発明の実施例が提供するもう一つの語彙テーブルの選択装置700は、図6に示した類似的なユニットを含むほか、さらに以下を含む。 According to FIG. 7, another vocabulary table selection device 700 provided by the embodiment of the present invention includes the similar unit shown in FIG. 6, and further includes:

第2トレーニングユニット604により、前記第2語彙テーブルに基づいて、前記ターゲットニューラルネットモデルをトレーニングする。 A second training unit 604 trains the target neural network model based on the second vocabulary table.

ここで、第2トレーニングユニット604は第2語彙テーブルによってターゲットニューラルネットモデルをトレーニングするが、第2語彙テーブルにより少なくより品質の高い語彙が含まれているため、トレーニング効率を向上かつトレーニングに必要な時間を減少することができ、学習し得たモデルの精度を向上させることができる。 Here, the second training unit 604 trains the target neural network model with the second vocabulary table, but since the second vocabulary table contains fewer and higher quality vocabulary, it is necessary to improve the training efficiency and training. The time can be reduced and the accuracy of the learned model can be improved.

一つの実現方式として、上記実施例による語彙テーブルの選択装置600または語彙テーブルの選択装置700において、前記予備トレーニングモデルのモデリングモジュール601は具体的に、前記ターゲット語彙に対応するターゲット単語ベクトル及び未知語彙に対応する未知単語ベクトルに対して加重合計を行い、そのうち、前記ターゲット単語ベクトルの第1重みは前記ターゲット語彙の語彙重みの正相関関数で、前記未知単語ベクトルの第2重みは前記ターゲット語彙の語彙重みの負相関関数であり、かつ前記第1重みと第2重みの和が所定値である。前記未知語彙が前記第1語彙テーブルに存在しない語彙であり、かつ前記第1語彙テーブルに存在しない語彙はすべて同一の未知単語ベクトルに対応する。 As one implementation method, in the vocabulary table selection device 600 or the vocabulary table selection device 700 according to the above-described embodiment, the modeling module 601 of the preliminary training model specifically includes a target word vector and an unknown vocabulary corresponding to the target vocabulary. The weighting is performed on the unknown word vector corresponding to, among them, the first weight of the target word vector is a positive correlation function of the vocabulary weight of the target vocabulary, and the second weight of the unknown word vector is the target vocabulary. It is a negative correlation function of vocabulary weight, and the sum of the first weight and the second weight is a predetermined value. The unknown vocabulary is a vocabulary that does not exist in the first vocabulary table and the vocabulary that does not exist in the first vocabulary table all correspond to the same unknown word vector.

一つの実現方式として、前記第1重みは前記ターゲット語彙の語彙重みの第1関数であり、前記第1関数は前記ターゲット語彙の語彙重みを0から1までにマッピングするように用いられる。前記第2重みは前記第1重みの第二関数であり、且つ前記第1重みと負の相関である。 As one implementation, the first weight is a first function of the vocabulary weight of the target vocabulary, and the first function is used to map the vocabulary weight of the target vocabulary from 0 to 1. The second weight is a second function of the first weight and has a negative correlation with the first weight.

ここで、前記ターゲット単語ベクトルと前記未知単語ベクトルとは、確率的初期化または単語ベクトルの予備トレーニングアルゴリズムによって初期化することができる。 Here, the target word vector and the unknown word vector may be initialized by a probabilistic initialization or a word vector preliminary training algorithm.

一つの実現方法として、前記語彙選択ユニット603は具体的に以下に用いられる。語彙重みの高い順によって、前記第1語彙テーブルから第2数の語彙を選択して、前記第2語彙テーブルを得ることができる。或いは、前記第1語彙テーブルから語彙重みが予めに設定した数値範囲内の語彙を選択して、前記第2語彙テーブルを得ることができる。そこで、前記第2語彙テーブル中の語彙数は第1語彙テーブル中の語彙数より少ない。 As one implementation method, the vocabulary selection unit 603 is specifically used below. The second vocabulary table can be obtained by selecting a second number of vocabularies from the first vocabulary table according to the order of increasing vocabulary weight. Alternatively, the second vocabulary table can be obtained by selecting a vocabulary having a vocabulary weight within a preset numerical range from the first vocabulary table. Therefore, the number of vocabularies in the second vocabulary table is smaller than that in the first vocabulary table.

さらに、上記語彙テーブルの選択装置600または語彙テーブルの選択装置700は以下を含むことができる。語彙テーブル生成ユニット（未図示）により、オリジナルコーパスデータをデータクリーニングすることで、データクリーニング後のオリジナルコーパスデータを文に分割し、かつ文を分割して複数の語彙を得て、語彙のオリジナルコーパスにおける出現頻度の高い順に従って、第1数の語彙を選択して、前記第1語彙テーブルを得る。 Further, the vocabulary table selection device 600 or the vocabulary table selection device 700 may include the following. An original corpus of vocabulary is obtained by dividing the original corpus data after data cleaning into sentences by dividing the original corpus data into data by performing data cleaning on the original corpus data by a vocabulary table generation unit (not shown). The first number of vocabularies is selected in the descending order of appearance frequency in to obtain the first vocabulary table.

説明しておきたいのは、本発明の実施例では、前記ターゲットニューラルネットワークモデルと前記予備トレーニングモデルはともに、同一のターゲットタスクに対して構築されたモデルであって、例えば、トレーニング過程にいずれも同じ目標関数を最適化ターゲットとする。 It should be noted that, in the embodiment of the present invention, both the target neural network model and the preliminary training model are models constructed for the same target task, and, for example, in the training process, Use the same objective function as the optimization target.

図8によると、本発明の実施例はさらに語彙テーブルの選択装置のハードウェア構造を提供して、図8に示すように、当該語彙テーブルの選択装置800は、プロセッサー802と、コンピュータプログラムコマンドが格納されるメモリ804と、を含む。 According to FIG. 8, the embodiment of the present invention further provides the hardware structure of the vocabulary table selection device, as shown in FIG. 8, the vocabulary table selection device 800 includes a processor 802 and a computer program command. A memory 804 that is stored.

ここで、前記コンピュータプログラムコマンドが前記プロセッサーにより実行された時に、前記プロセッサー802を下記のステップを行わせる。 Here, when the computer program command is executed by the processor, it causes the processor 802 to perform the following steps.

語彙重み付け層をターゲットニューラルネットワークモデルに導入して、予備トレーニングモデルを構築し、前記語彙重み付け層は語彙重みに基づいて第1語彙テーブル中のターゲット語彙を重み付けするようにし、かつ重み付け処理によって得られたターゲット語彙を前記ターゲットニューラルネットワークモデルに入力する。 A vocabulary weighting layer is introduced into the target neural network model to build a pre-training model, said vocabulary weighting layer being adapted to weight the target vocabulary in the first vocabulary table based on the vocabulary weights and obtained by the weighting process. The target vocabulary is input to the target neural network model.

前記第1語彙テーブルに基づき、前記予備トレーニングモデルをトレーニングして、前記予備トレーニングモデルのモデルパラメータおよび語彙重み付け層の語彙重みを更新し、かつトレーニング終了後に、前記第1語彙テーブルにおけるターゲット語彙の語彙重みを取得する。 Based on the first vocabulary table, the preliminary training model is trained to update the model parameters of the preliminary training model and the vocabulary weights of the vocabulary weighting layer, and after the training, the vocabulary of the target vocabulary in the first vocabulary table. Get the weight.

前記語彙重みによって前記第1語彙テーブルを選別して、第2語彙テーブルを得る。 The second vocabulary table is obtained by selecting the first vocabulary table according to the vocabulary weight.

さらに、図8に示すように、当該語彙テーブルの選択装置800はさらに、ネットワークインタフェース801、入力デバイス803、ハードディスク805および表示デバイス806が含まれる。 Further, as shown in FIG. 8, the vocabulary table selection device 800 further includes a network interface 801, an input device 803, a hard disk 805, and a display device 806.

上記各インターフェースとデバイスとの間にはバスアーキテクチャーを介して連接し合う。バスアーキテクチャーは任意数のコンポーネントインターコネクトされるバスとブリッジとを含むことができる。具体的には、プロセッサー802が代表する一つまたは複数の中央プロセッサー（CPU）と、メモリ804が代表する一つまたは複数のメモリーの各種回路とが連接されている。バスアーキテクチャーは周辺デバイス、定電圧器と電源管理回路などの各種ほかの回路を一緒に連接させることができる。言うまでもなく、バスアーキテクチャーはこれらのユニットの間の連接通信を実現するために用いられる。バスアーキテクチャーはデータバスのほか、電源バスと、制御バスと、状態信号バスとを含むことは当分野において公知され、詳細な説明を省略する。 The interfaces and devices are connected to each other via a bus architecture. The bus architecture may include any number of component interconnected buses and bridges. Specifically, one or a plurality of central processors (CPU) represented by the processor 802 and various circuits of one or a plurality of memories represented by the memory 804 are connected. The bus architecture can connect various other circuits together, such as peripheral devices, voltage regulators and power management circuits. Needless to say, the bus architecture is used to realize the articulated communication between these units. It is well known in the art that the bus architecture includes a power bus, a control bus, and a status signal bus in addition to the data bus, and thus detailed description thereof will be omitted.

前記ネットワークインターフェース801はネットワーク（例えばインターネット、ローカルエリアネットワークなど）に接続されて、ネットワークから情報を受信し、受信した情報をハードディスク805に保存し、例えば受信したコーパスデータを生成するためのテキストデータをハードディスク805に保存する。 The network interface 801 is connected to a network (for example, the Internet, a local area network, etc.), receives information from the network, stores the received information in the hard disk 805, and stores, for example, text data for generating the received corpus data. Save to hard disk 805.

前記入力デバイス803は作業員より入力された各種のコマンドを受け取り、かつプロセッサー802に発送して実行される。前記入力デバイス803はキーボードまたはクリックデバイス（例えばマウス、軌跡球（トラックボール）、接触感知板またはタッチスクリーンなど）を含む。 The input device 803 receives various commands input by a worker and sends them to the processor 802 for execution. The input device 803 includes a keyboard or a click device (for example, a mouse, a track ball, a touch sensing plate, or a touch screen).

前記表示デバイス806はプロセッサー802がコマンドを実行して得た結果を表示、例えば生成された第2語彙テーブルなどを表示することができる。 The display device 806 can display a result obtained by the processor 802 executing the command, for example, a generated second vocabulary table.

前記メモリ804は、システム稼動時に必須なプログラムとデータ、およびプロセッサー802の計算過程における中間結果などのデータを格納するように用いられ。 The memory 804 is used to store programs and data essential when the system is operating, and data such as intermediate results in the calculation process of the processor 802.

理解できるように、本発明の実施例におけるメモリ804は揮発性メモリーまたは不揮発性メモリでもよく、或いは揮発性メモリーと不揮発性メモリとの両者を含むことが可能である。そこで、不揮発性メモリは読み出し専用メモリ（ROM）で、プログラマブル読み出し専用メモリ（PROM）で、消去可能なプログラマブル読み出し専用メモリ（EPROM）で、電気的に消去可能なプログラマブル読み出し専用メモリ（EEPROM）でもよい。揮発性メモリーはランダムアクセスメモリ（RAM）でもよく、外部キャッシュとして用いられる。本明細書に記載の装置と方法のメモリ804はこれらおよび任意のほかの適合類型のメモリを含むが、限られることではない。 As can be appreciated, the memory 804 in embodiments of the invention may be volatile memory or non-volatile memory, or may include both volatile and non-volatile memory. Therefore, the nonvolatile memory may be a read-only memory (ROM), a programmable read-only memory (PROM), an erasable programmable read-only memory (EPROM), or an electrically erasable programmable read-only memory (EEPROM). .. Volatile memory can be random access memory (RAM), which is used as an external cache. The memory 804 of the devices and methods described herein includes, but is not limited to, these and any other compatible types of memory.

いくつかの実施形態において、メモリ804は以下の要素を格納しており、実行可能なモジュールまたはデータ構造、或いはそれらのサブ集合または拡張集合、即ち、オペレーティングシステム8041とアプリケーションプログラム8042である。 In some embodiments, memory 804 stores the following elements, which are executable modules or data structures, or sub-sets or extensions thereof, namely operating system 8041 and application programs 8042.

ここで、オペレーティングシステム8041は各種の基礎業務およびハードウェアに基づくタスクを実現するように、例えば枠組層、コアー層、駆動層など各種のシステムプログラムを含む。アプリケーションプログラム8042は各種のアプリケーション業務を実現するように、例えばブラウザー（Browser）などの各種アプリケーションプログラムを含む。本発明の実施例の方法を実現するプログラムはアプリケーションプログラム8042に含まれることが可能である。 Here, the operating system 8041 includes various system programs such as a framework layer, a core layer, and a drive layer so as to realize various basic tasks and tasks based on hardware. The application program 8042 includes various application programs such as a browser so as to realize various application tasks. A program that implements the method of the embodiment of the present invention can be included in the application program 8042.

本発明の上記実施例による方法はプロセッサー802に応用でき、或いはプロセッサー802によって実現できる。プロセッサー802は信号の処理能力を持つ集積回路チップであってもよい。実現過程では、上記方法の各ステップはプロセッサー802内のハードウェアの集積ロジック回路またはソフトウェア形式のコマンドによって完成できる。上記プロセッサー802は汎用プロセッサーで、デジタル信号処理器（DSP）で、特定用途向け集積回路（ASIC）で、現場で構成可能な回路アレイ（FPGA）で、個別ゲートまたはトランジスタロジックデバイスで、個別ハードウェアユニットであってもよく、本発明の実施例に公開された各方法、ステップおよびロジックブロック図を実現または実行できる。汎用プロセッサーはマイクロプロセッサーまたはいかなる常用的なプロセッサーであっても良い。本発明の実施例より公開された方法のステップと結合して直接にできるのは、ハードウェアデコーダプロセッサーより実行して完成、またはデコーダプロセッサー内のハードウェアおよびソフトウェアモジュールの組み合わせによって実行して完成することである。ソフトウェアモジュールはランダムメモリ、フラッシュメモリ、読み出し専用メモリ、プログラマブル読み出し専用メモリまたは電気的に消去可能なプログラマブルメモリ、レジスタなどの当分野において成熟された記録媒体に位置することが可能である。当該記録媒体はメモリ804にあり、プロセッサー802はメモリ804内の情報を読み取って、そのハードウェアと結合して上記方法のステップを完成する。 The method according to the above embodiments of the present invention can be applied to or implemented by the processor 802. The processor 802 may be an integrated circuit chip capable of processing signals. In an implementation process, each step of the above method can be completed by a hardware integrated logic circuit or a software type command in the processor 802. The processor 802 is a general purpose processor, digital signal processor (DSP), application specific integrated circuit (ASIC), field configurable circuit array (FPGA), discrete gate or transistor logic device, discrete hardware It may be a unit and may implement or execute each method, step and logic block diagram disclosed in the embodiments of the present invention. A general purpose processor may be a microprocessor or any conventional processor. Directly in combination with the steps of the method disclosed by the embodiments of the present invention, can be accomplished by execution by a hardware decoder processor or by a combination of hardware and software modules within the decoder processor. That is. The software module can be located in a storage medium mature in the art, such as random memory, flash memory, read only memory, programmable read only memory or electrically erasable programmable memory, register. The recording medium resides in memory 804, and processor 802 reads the information in memory 804 and combines it with its hardware to complete the steps of the method.

理解できるのは、本明細書に記載されたこれらの実施例に、ハードウェア、ソフトウェア、ファームウェア、ミドルウェア、マイクロコードまたはその組み合わせによって実現できる。ハードウェアの実現について、プロセスユニットは１つまたは複数の特定用途向け集積回路（ASIC）、デジタル信号処理器（DSP）、デジタル信号処理デバイス（DSPD）、プログラマブルロジックデバイス（PLD）、フィールドプログラマブル・ゲート・アレイ（FPGA）、汎用プロセッサー、コントローラ、マイクロコントローラ、マイクロプロセッサー、本願の前記機能を実現するためのほかの電子モジュールまたは組み合わせに実現可能である。 It will be appreciated that these embodiments described herein may be implemented in hardware, software, firmware, middleware, microcode or combinations thereof. For hardware implementation, the process unit is one or more application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field programmable gates. It can be realized in an array (FPGA), a general-purpose processor, a controller, a microcontroller, a microprocessor, or another electronic module or combination for realizing the functions of the present application.

ソフトウェアの実現について、本明細書に記載された前記機能を実行するモジュール（例えばプロセス、関数など）によって本明細書の前記技術を実現できる。ソフトウェアコードはメモリに格納、かつプロセッサーによって実行することが可能である。メモリはプロセッサー内またはプロセッサー外部において実現できる。 With respect to software implementation, the techniques herein may be implemented by modules (eg, processes, functions, etc.) that perform the functions described herein. The software code can be stored in memory and executed by the processor. The memory can be implemented within the processor or external to the processor.

具体的には、前記コンピュータプログラムはプロセッサ802に実行された時に以下のステップを実現できる。 Specifically, the computer program can implement the following steps when executed by the processor 802.

前記第2語彙テーブルによって、前記ターゲットニューラルネットモデルをトレーニングする。 The target neural network model is trained by the second vocabulary table.

前記ターゲット語彙に対応するターゲット単語ベクトル及び未知語彙に対応する未知単語ベクトルに対して加重合計を行い、そのうち、前記ターゲット単語ベクトルの第1重みは前記ターゲット語彙の語彙重みの正相関関数で、前記未知単語ベクトルの第2重みは前記ターゲット語彙の語彙重みの負相関関数であり、かつ前記第1重みと第2重みの和が所定値である。例えば、前記所定値が1である。前記未知語彙が前記第1語彙テーブルに存在しない語彙であり、かつ前記第1語彙テーブルに存在しない語彙はすべて同一の未知単語ベクトルに対応する。 The weighting is performed on the target word vector corresponding to the target vocabulary and the unknown word vector corresponding to the unknown vocabulary, of which the first weight of the target word vector is a positive correlation function of the vocabulary weight of the target vocabulary, The second weight of the unknown word vector is a negative correlation function of the vocabulary weight of the target vocabulary, and the sum of the first weight and the second weight is a predetermined value. For example, the predetermined value is 1. The unknown vocabulary is a vocabulary that does not exist in the first vocabulary table and the vocabulary that does not exist in the first vocabulary table all correspond to the same unknown word vector.

前記第1重みは前記ターゲット語彙の語彙重みの第1関数であることができ、前記第1関数は前記ターゲット語彙の語彙重みを0から1までにマッピングするように用いられ、前記第2重みは前記第1重みの第2関数であり、且つ前記第1重みと負の相関がある。前記ターゲット単語ベクトルと前記未知単語ベクトルとは、確率的初期化または単語ベクトルの予備トレーニングアルゴリズムによって初期化することができる。 The first weight can be a first function of the vocabulary weight of the target vocabulary, the first function is used to map the vocabulary weight of the target vocabulary from 0 to 1, and the second weight is It is a second function of the first weight and has a negative correlation with the first weight. The target word vector and the unknown word vector may be initialized by a probabilistic initialization or a word vector preliminary training algorithm.

語彙重みの高い順によって、前記第1語彙テーブルから第2数の語彙を選択して、前記第2語彙テーブルを得ることができる。或いは、前記第1語彙テーブルから語彙重みが予めに設定した数値範囲内の語彙を選択して、前記第2語彙テーブルを得ることができる。そこで、前記第2語彙テーブル中の語彙数は第1語彙テーブル中の語彙数より少ない。 The second vocabulary table can be obtained by selecting a second number of vocabularies from the first vocabulary table according to the order of increasing vocabulary weight. Alternatively, the second vocabulary table can be obtained by selecting a vocabulary having a vocabulary weight within a preset numerical range from the first vocabulary table. Therefore, the number of vocabularies in the second vocabulary table is smaller than that in the first vocabulary table.

オリジナルコーパスデータをデータクリーニングすることで、データクリーニング後のオリジナルコーパスデータを文に分割し、かつ文を分割して複数の語彙を得て、語彙のオリジナルコーパスにおける出現頻度の高い順に従って、第1数の語彙を選択して、前記第1語彙テーブルを得る。 By performing data cleaning on the original corpus data, the original corpus data after the data cleaning is divided into sentences, and the sentences are divided to obtain a plurality of vocabularies. Select a number of vocabularies to get the first vocabulary table.

前記ターゲットニューラルネットワークモデルと前記予備トレーニングモデルはともに、同一のターゲットタスクに対して構築されたモデルであることが好ましい。 It is preferable that both the target neural network model and the preliminary training model are models constructed for the same target task.

当業者にとって、本明細書により公開された実施例の各例示したユニットおよびアルゴリズムステップと結合して、電子ハードウェアまたはコンピュータソフトウェアと電子ハードウェアとの結合によって実現させることができる。これらの機能はハードウェアまたはソフトウェアによって実行させることは、技術方案の特定応用や設計制約条件に決められる。当業者は特定された各応用に応じて異なる方法を用いて前記機能を実現できるが、この実現は本発明の範囲を超えるものと見なすべきではない。 Those skilled in the art can combine each illustrated unit and algorithm step of the embodiments disclosed by the present specification, and can be realized by combining electronic hardware or computer software with electronic hardware. Whether these functions are executed by hardware or software depends on the specific application of the technical plan and design constraints. A person skilled in the art can implement the function using different methods depending on each identified application, but this implementation should not be considered to be beyond the scope of the present invention.

当業者にとって、説明を便利と簡潔にするために、上記のシステム、装置とユニットの具体的な作業過程に関して、前記方法実施例における対応過程を参考でき、ここで贅言をしない。 For a person skilled in the art, in order to make the description convenient and concise, reference may be made to the corresponding steps in the above-mentioned method embodiments regarding specific working steps of the above-mentioned system, apparatus and unit, and no detailed explanation will be given here.

本願より提供した実施例において、記載された方法と装置をほかの方法でも実現できることは言うまでも無い。例えば、上記記述された装置実施例は例示的のみであり、例えば、前記モジュールの区画はロジック機能区画のみであり、実際ではほかの区画方法で実現することも可能である。例えば、複数のモジュールまたはユニットで結合またはほかのシステムに集成して、或いはある特徴が無視でき、または実行されなくてもよい。もう1つ、示しまたは議論された相互間の結合または直接結合または通信連接はインターフェース、装置またはモジュールを介した間接結合または通信連接であってもよし、電気的、機械またはほかの形式であってもよい。 It goes without saying that in the embodiments provided herein, the described method and apparatus can be implemented in other ways. For example, the device embodiments described above are exemplary only, for example, the partition of the module is only the logic function partition, and in practice other partition methods are possible. For example, multiple modules or units may be combined or assembled into another system, or certain features may be neglected or not implemented. Another, the shown or discussed coupling or direct coupling or communication connection between each other may be an indirect coupling or communication connection via an interface, device or module, which may be of electrical, mechanical or other form. Good.

前記分離部品として説明したユニットは物理的に分離されてもよいが、されなくてもよく、ユニットとして表示された部品は物理ユニットであってもよいが、でなくてもよく、すなわち1カ所に位置してもよく、複数のネットワークユニットに配布されてもよい。実際のニーズに応じてその一部またはすべてのユニットを選択して本発明の実施例方案の目的を実現することができる。 The unit described as the separation part may or may not be physically separated, and the part displayed as the unit may be, but need not be, a physical unit, that is, in one place. It may be located and may be distributed to multiple network units. Depending on the actual needs, some or all of the units can be selected to realize the object of the embodiment plan of the present invention.

また、本発明の各実施例における各機能的なユニットを１つのプロセスユニットに集成することも可能が、各ユニットが物理的に単独で存在することも可能で、または二つ或いは二つ以上のユニットが1つのユニットに集成することも可能である。 Further, each functional unit in each embodiment of the present invention may be integrated into one process unit, or each unit may physically exist alone, or two or more units may be provided. It is also possible for the units to be assembled into one unit.

前記機能がソフトウェア機能ユニットの形式で実現し、かつ独立した製品で販売または使用する場合に、コンピュータ読み取り可能記憶媒体に格納することができる。上記により、本発明の技術方案の本質、或いは従来技術に対して貢献する部分、または当該技術方案の一部をソフトウェアプロダクトの形式で実現することができ、当該コンピュータソフトウェアプロダクトは記録媒体に記憶されて、若干のコマンドを含んでコンピュータ装置（パーソナルコンピュータ、サーバ或いはネットワーク設備などでも可能）に本願の各実施例に記載した前記方法の全部または一部のステップを実行させる。前記の記録媒体は、USB、移動ハードディスク、ROM、RAM、磁気ディスクまたは光ディスクなど各種のプログラムコードを記憶できる媒体を含む。 The functions may be implemented in the form of software functional units and stored on a computer readable storage medium for sale or use in a separate product. As described above, the essence of the technical solution of the present invention, a part that contributes to the conventional technology, or a part of the technical solution can be realized in the form of a software product, and the computer software product is stored in a recording medium. Then, a computer device (which may be a personal computer, a server, a network facility, or the like) is caused to execute some or all steps of the method described in each embodiment of the present application by including some commands. The recording medium includes a medium capable of storing various program codes such as USB, mobile hard disk, ROM, RAM, magnetic disk or optical disk.

上記は本発明の具体的な実施方式であり、本発明の保護範囲はこれに限らず、当業者にとって、本発明に公開された技術範囲において、変化また入れ替えを容易に想起でき、いずれも本発明の保護する範囲に含まれる。そのため、本発明の保護範囲はクレームの保護範囲を元にすべきである。
The above is a specific implementation method of the present invention, the protection scope of the present invention is not limited to this, and those skilled in the art can easily think of changes or replacements within the technical scope disclosed in the present invention. It is included in the scope of protection of the invention. Therefore, the protection scope of the present invention should be based on the protection scope of the claims.

Claims

The vocabulary weighting layer is introduced into the target neural network model to build a preliminary training model, and the vocabulary weighting layer weights the target vocabulary in the first vocabulary table according to the vocabulary weight, and the target vocabulary obtained by the weighting process is used. Inputting to the target neural network model,
Based on the first vocabulary table, the preliminary training model is trained, model parameters of the preliminary training model and vocabulary weights of the vocabulary weighting layer are updated, and after the training, the vocabulary weights of the target vocabulary in the first vocabulary table are updated. To get
Selecting the first vocabulary table according to the vocabulary weight to obtain a second vocabulary table,
A method for selecting a vocabulary table characterized by including.

After obtaining the second vocabulary table, the selection method further comprises training the target neural net model with the second vocabulary table,
The selection method according to claim 1, wherein:

Weighting the target vocabulary in the first vocabulary table by the vocabulary weights,
A weighted sum is performed on the target word vector corresponding to the target vocabulary and the unknown word vector corresponding to the unknown vocabulary, of which the first weight of the target word vector is a positive correlation function of the vocabulary weight of the target vocabulary. , The second weight of the unknown word vector is a negative correlation function of the vocabulary weight of the target vocabulary, and the sum of the first weight and the second weight is a predetermined value, the unknown vocabulary in the first vocabulary table 3. The selection method according to claim 1, further comprising: all vocabularies that do not exist and that do not exist in the first vocabulary table correspond to the same unknown word vector.

The first weight is a first function of the vocabulary weight of the target vocabulary, the first function maps the vocabulary weight of the target vocabulary from 0 to 1,
The second weight is a second function of the first weight, and has a negative correlation with the first weight,
4. The selection method according to claim 3, wherein:

The step of selecting the first vocabulary table according to the vocabulary weights,
The second vocabulary table is obtained by selecting a second number of vocabularies from the first vocabulary table in descending order of vocabulary weight, or the vocabulary weight from the first vocabulary table falls within a preset numerical range. Selecting a vocabulary to obtain the second vocabulary table,
The number of vocabularies in the second vocabulary table is less than the number of vocabularies in the first vocabulary table,
The selection method according to claim 1 or 2, characterized in that.

A model built for the same target task together with the target neural network model and the preliminary training model,
The selection method according to claim 1 or 2, characterized in that.

A vocabulary weighting layer is introduced into the target neural network model to build a preliminary training model, the vocabulary weighting layer weights the target vocabulary in the first vocabulary table according to the vocabulary weight, and the target vocabulary obtained by the weighting process is used. A modeling unit of a preliminary training model input to the target neural network model,
Based on the first vocabulary table, the preliminary training model is trained, model parameters of the preliminary training model and vocabulary weights of the vocabulary weighting layer are updated, and after training, the vocabulary weights of the target vocabulary in the first vocabulary table are updated. The first training unit to get
A vocabulary selection unit that selects the first vocabulary table according to the vocabulary weight and obtains a second vocabulary table;
A vocabulary table selection device comprising:

The selection device according to claim 7, further comprising: a second training unit that trains the target neural network model according to the second vocabulary table.

A weighted sum is performed on the target word vector corresponding to the target vocabulary and the unknown word vector corresponding to the unknown vocabulary, of which the first weight of the target word vector is a positive correlation function of the vocabulary weight of the target vocabulary. , The second weight of the unknown word vector is a negative correlation function of the vocabulary weight of the target vocabulary, and the sum of the first weight and the second weight is a predetermined value, the unknown vocabulary in the first vocabulary table Vocabularies that do not exist and that do not exist in the first vocabulary table all correspond to the same unknown word vector,
The selection device according to claim 7 or 8, characterized in that.

A program for causing a computer to execute the method of selecting a vocabulary table according to any one of claims 1 to 6.

A computer-readable storage medium storing the program according to claim 10.