JP2022530447A

JP2022530447A - Chinese word division method based on deep learning, equipment, storage media and computer equipment

Info

Publication number: JP2022530447A
Application number: JP2021563188A
Authority: JP
Inventors: ▲ミン▼川陳; 駿馬; 少軍王
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2019-04-22
Filing date: 2019-11-14
Publication date: 2022-06-29
Anticipated expiration: 2039-11-14
Also published as: CN110222329A; WO2020215694A1; CN110222329B; JP7178513B2; SG11202111464WA

Abstract

訓練コーパスデータを文字レベルのデータに変換し、文字レベルのデータをシーケンスデータに変換し、予め設定された符号に基づいてシーケンスデータを分割し、複数のサブシーケンスデータを取得し、サブシーケンスデータの長さに基づいて複数のサブシーケンスデータをグループ化し、Ｋ個のデータセットを得て、Ｋ個のデータセットに基づいて、Ｋ個の訓練後のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルを得て、ターゲットコーパスデータを処理したデータをＫ個の訓練後のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルのうちの少なくとも１つの訓練後のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルに入力し、ターゲットコーパスデータの単語分割結果を得る。それにより、中国語単語の分割精度が低いという問題を解決できる中国語単語分割方法を提供する。【選択図】図１The training corpus data is converted to character level data, the character level data is converted to sequence data, the sequence data is divided based on the preset code, multiple subsequence data are acquired, and the subsequence data is obtained. Group multiple subsequence data based on length to get K datasets, and based on K datasets, get K post-training timing convolution neural networks-conditional random field models Then, the processed data of the target corpus data is input to K post-training timing convolution neural networks-at least one post-training timing convolution neural network of conditional random field models-conditional random field model and the target. Obtain the word division result of the corpus data. As a result, a Chinese word division method that can solve the problem of low Chinese word division accuracy is provided. [Selection diagram] Fig. 1

Description

本出願は、２０１９年０４月２２日に中国特許庁に提出された、出願番号が２０１９１０３２２１２７．８であり、出願名称が「ディープラーニングに基づく中国語単語分割方法及び装置」である中国特許出願の優先権を主張し、その内容の全てが本出願の一部として援用される。 This application is a Chinese patent application submitted to the China Patent Office on April 22, 2019, with an application number of 2019103222127.8 and an application name of "Chinese word division method and device based on deep learning". Priority is claimed and all of its contents are incorporated as part of this application.

本出願は、人工知能の技術分野に関し、特にディープラーニングに基づく中国語単語分割方法、装置、記憶媒体及びコンピュータ機器に関する。 The present application relates to the technical field of artificial intelligence, and particularly to Chinese word division methods, devices, storage media and computer devices based on deep learning.

従来のディープラーニングの中国語単語分割アルゴリズムは、主に、長・短期記憶（ＬＳＴＭ）に代表されるサイクルニューラルネットワークモデル及びその派生モデルに基づいているが、ＬＳＴＭモデルのシーケンスデータ問題における処理能力は、シーケンスの長さの増加とともに減少し、中国語単語の分割精度が低いという問題がある。 The conventional deep learning Chinese word division algorithm is mainly based on the cycle neural network model represented by long short-term memory (LSTM) and its derivative model, but the processing power of the LSTM model in the sequence data problem is , There is a problem that the division accuracy of Chinese words is low because it decreases as the length of the sequence increases.

以上に鑑み、従来技術で中国語の単語分割の精度が低い問題を解決するために、本出願の実施例は、ディープラーニングに基づく中国語単語分割方法、装置、記憶媒体及びコンピュータ機器を提供する。 In view of the above, in order to solve the problem that the accuracy of Chinese word division is low in the prior art, the embodiment of the present application provides a Chinese word division method, an apparatus, a storage medium and a computer device based on deep learning. ..

一局面では、本出願の実施例は、ディープラーニングに基づく中国語単語分割方法を提供し、前記方法は、訓練コーパスデータを文字レベルのデータに変換するステップと、前記文字レベルのデータをシーケンスデータに変換するステップと、予め設定された符号に基づいて前記シーケンスデータを分割し、複数のサブシーケンスデータを取得し、サブシーケンスデータの長さに基づいて前記複数のサブシーケンスデータをグループ化し、Ｋ個のデータセットを得るステップであって、前記Ｋ個のデータセットのうちの各々のデータセットに含まれるサブシーケンスデータの長さが同じであり、Ｋは、１より大きい自然数であるステップと、ｉ番目のデータセットから複数のサブシーケンスデータを抽出し、且つ抽出された前記複数のサブシーケンスデータをｉ番目のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルに入力し、前記ｉ番目のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルを訓練し、訓練後のｉ番目のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルを取得し、ｉを順に１～Ｋの自然数とし、合計でＫ個の訓練後のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルを得るステップと、ターゲットコーパスデータを文字レベルのデータに変換し、第１データを取得し、前記第１データをシーケンスデータに変換し、第２データを取得し、前記第２データを前記Ｋ個の訓練後のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルのうちの少なくとも１つの訓練後のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルに入力し、前記ターゲットコーパスデータの単語分割結果を得るステップと、を含む。 In one aspect, the embodiments of the present application provide a method of dividing Chinese words based on deep learning, wherein the method is a step of converting training corpus data into character level data and sequence data of the character level data. The sequence data is divided based on the step of converting to and a preset code, a plurality of subsequence data are acquired, and the plurality of subsequence data are grouped based on the length of the subsequence data. A step of obtaining a set of data, wherein the length of the subsequence data contained in each of the K data sets is the same, and K is a natural number larger than 1. Multiple sub-sequence data are extracted from the i-th dataset, and the extracted multiple sub-sequence data are input to the i-th timing convolution neural network-conditional random field model, and the i-th timing convolution neural is used. Network-Train a conditional random field model, get the i-th timing convolution neural network-conditional random field model after training, let i be a natural number from 1 to K in order, and a total of K post-training timings Convolutional Neural Network-Steps to obtain a conditional random field model, convert target corpus data to character level data, get first data, convert the first data to sequence data, get second data , The second data is input to the K post-training timing convolution neural network-at least one post-training timing convolution neural network of conditional random field models-conditional random field model and the target corpus data. Includes steps to obtain the word split result of.

一局面では、本出願の実施例は、ディープラーニングに基づく中国語単語分割装置を提供し、前記装置は、訓練コーパスデータを文字レベルのデータに変換するための第１変換ユニットと、前記文字レベルのデータをシーケンスデータに変換するための第２変換ユニットと、予め設定された符号に基づいて前記シーケンスデータを分割し、複数のサブシーケンスデータを取得し、サブシーケンスデータの長さに基づいて前記複数のサブシーケンスデータをグループ化し、Ｋ個のデータセットを得るための第１分割ユニットであって、前記Ｋ個のデータセットのうちの各々のデータセットに含まれるサブシーケンスデータの長さが同じであり、Ｋは、１より大きい自然数である第１分割ユニットと、ｉ番目のデータセットから複数のサブシーケンスデータを抽出し、且つ抽出された前記複数のサブシーケンスデータをｉ番目のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルに入力し、前記ｉ番目のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルを訓練し、訓練後のｉ番目のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルを取得し、ｉを順に１～Ｋの自然数とし、合計でＫ個の訓練後のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルを得るための第１決定ユニットと、ターゲットコーパスデータを文字レベルのデータに変換し、第１データを取得し、前記第１データをシーケンスデータに変換し、第２データを取得し、前記第２データを前記Ｋ個の訓練後のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルのうちの少なくとも１つの訓練後のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルに入力し、前記ターゲットコーパスデータの単語分割結果を得るための第２決定ユニットと、を含む。 In one aspect, embodiments of the present application provide a Chinese word segmentation device based on deep learning, wherein the device comprises a first conversion unit for converting training corpus data into character level data and the character level. The second conversion unit for converting the data of the above and the sequence data are divided based on a preset code, a plurality of sub-sequence data are acquired, and the above-mentioned is based on the length of the sub-sequence data. It is a first division unit for grouping a plurality of subsequence data to obtain K data sets, and the lengths of the subsequence data included in each of the K data sets are the same. K is a first division unit that is a natural number larger than 1, and a plurality of subsequence data are extracted from the i-th data set, and the extracted plurality of sub-sequence data are used in the i-th timing convolution neural. Enter into the network-conditional random field model, train the i-th timing convolution neural network-conditional random field model, get the i-th timing convolution neural network after training-conditional random field model, i Is a natural number from 1 to K in order, and a total of K post-training timing convolution neural networks-the first decision unit for obtaining a conditional random field model, and the target corpus data are converted to character-level data, and the first Acquire one data, convert the first data into sequence data, acquire the second data, and convert the second data into at least the K post-trained timing convolution neural network-conditional random field model. Includes one post-training timing convolution neural network-a second decision unit for inputting into a conditional random field model to obtain word split results for the target corpus data.

一局面では、本出願の実施例は、記憶されるプログラムを含む記憶媒体を提供し、前記プログラムの運転中に、前記記憶媒体が位置する機器を制御して、上記のディープラーニングに基づく中国語単語分割方法を行わせる。 In one aspect, the embodiments of the present application provide a storage medium containing a program to be stored, and during operation of the program, the device in which the storage medium is located is controlled to control Chinese based on the above deep learning. Have them do the word splitting method.

一局面では、本出願の実施例は、プログラム命令を含む情報を記憶するためのメモリと、プログラム命令の実行を制御するためのプロセッサと、を含むコンピュータ機器を提供し、前記プログラム命令がプロセッサによりロードされて実行されるときに、上記のディープラーニングに基づく中国語単語分割方法のステップを実施する。 In one aspect, embodiments of the present application provide computer equipment comprising a memory for storing information including program instructions and a processor for controlling execution of the program instructions, wherein the program instructions are performed by the processor. Perform the steps of the Chinese word splitting method based on the above deep learning as it is loaded and executed.

本出願の実施例では、ターゲットコーパスデータを文字レベルのデータに変換し、文字レベルのデータをシーケンスデータに変換し、シーケンスデータを訓練後のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルに入力し、ターゲットコーパスデータの単語分割結果を取得し、タイミング畳み込みニューラルネットワークがネットワーク層の数を増加させることで、指数的に増加する速度で受信エリアを広げることができ、それにより、シーケンスの長さが長いシーケンスデータ又は他の特性が複雑なデータを処理でき、エンコード結果の精度を上げることにより、中国語の単語分割の精度を向上させた。 In the examples of the present application, the target corpus data is converted into character level data, the character level data is converted into sequence data, and the sequence data is input into a post-trained timing convolutional neural network-conditional random field model. By acquiring the word division result of the target corpus data and increasing the number of network layers by the timing convolutional neural network, the reception area can be expanded at an exponentially increasing rate, so that the sequence length is long. It was able to process sequence data or other data with complex characteristics, and improved the accuracy of Chinese word division by improving the accuracy of the encoding result.

本出願の実施例の技術的解決手段をより明確に説明するために、以下では実施形態において必要とされる図面を簡単に説明するが、以下に説明される図面は本出願の一部の実施形態にすぎず、当業者にとっては、創造的な労働をせずに、これらの図面に基づいて他の図面も得ることができる。
本出願の実施例による選択可能なディープラーニングに基づく中国語単語分割方法のフローチャートである。本出願の実施例による選択可能なディープラーニングに基づく中国語単語分割装置の模式図である。本出願の実施例に係る選択可能なコンピュータ機器の模式図である。 In order to more clearly explain the technical solutions of the embodiments of the present application, the drawings required in the embodiments will be briefly described below, but the drawings described below are a partial implementation of the present application. It is only a form, and for those skilled in the art, other drawings can be obtained based on these drawings without any creative labor.
It is a flowchart of the Chinese word division method based on selectable deep learning by the Example of this application. It is a schematic diagram of the Chinese word dividing device based on selectable deep learning according to the embodiment of this application. It is a schematic diagram of the selectable computer equipment which concerns on embodiment of this application.

本出願の技術的解決手段をよりよく理解するために、以下、図面を参照しながら本出願の実施形態を詳細に説明する。 In order to better understand the technical solutions of the present application, embodiments of the present application will be described in detail below with reference to the drawings.

説明される実施例は、全ての実施例ではなく、本願の一部の実施例に過ぎないことが明らかである。本出願の実施例に基づいて、当業者は、創造的な労働をせずに取得する他のすべての実施形態も、本出願の保護範囲内に属される。 It is clear that the examples described are not all examples, but only some of the examples of the present application. Based on the embodiments of this application, all other embodiments acquired by one of ordinary skill in the art without creative labor also fall within the scope of protection of this application.

本出願の実施例において使用される用語は、特定の実施例を説明するためのものに過ぎず、本出願を限定するためのものではない。本出願の実施例及び添付の特許請求の範囲において使用される単数形の「１つ」、「前記」及び「該」は、文脈が明確に他の意味を表していない限り、多数の形式を含むことが意図されている。 The terms used in the examples of this application are for illustration purposes only and are not intended to limit this application. The singular forms "one," "above," and "the" used in the examples of this application and the appended claims are in many forms unless the context clearly expresses another meaning. Intended to include.

本明細書で用いられる用語の「及び／又は」は、関連オブジェクトを説明する関連関係に過ぎず、３つの関係が存在してもよいことを表し、例えば、「Ａ及び／又はＢ」は、「Ａが独立して存在する」、「Ａ及びＢが同時に存在する」、「Ｂが独立して存在する」の３つの状況を表してもよいと理解すべきである。また、本明細書における符号「／」は、一般的に前後の関連オブジェクトが「又は」の関係であることを表す。 As used herein, the term "and / or" merely describes a related object, indicating that three relationships may exist, eg, "A and / or B". It should be understood that three situations may be represented: "A exists independently", "A and B exist at the same time", and "B exists independently". Further, the reference numeral "/" in the present specification generally indicates that the related objects before and after are in the relationship of "or".

図１は、本出願の実施例による選択可能なディープラーニングに基づく中国語単語分割方法のフローチャートであり、図１に示すように、当該方法は、ステップＳ１０２、ステップＳ１０４、ステップＳ１０６、ステップＳ１０８、及びステップＳ１１０を含む。 FIG. 1 is a flowchart of a Chinese word division method based on selectable deep learning according to an embodiment of the present application, and as shown in FIG. 1, the method includes steps S102, step S104, step S106, and step S108. And step S110.

ステップＳ１０２において、訓練コーパスデータを文字レベルのデータに変換する。 In step S102, the training corpus data is converted into character level data.

ステップＳ１０４において、文字レベルのデータをシーケンスデータに変換する。 In step S104, character level data is converted into sequence data.

ステップＳ１０６において、予め設定された符号に基づいてシーケンスデータを分割し、複数のサブシーケンスデータを取得し、サブシーケンスデータの長さに基づいて複数のサブシーケンスデータをグループ化し、Ｋ個のデータセットを取得し、Ｋ個のデータセットのうちの各々のデータセットに含まれるサブシーケンスデータの長さが同じである。Ｋは、１より大きい自然数である。予め設定された符号とは、文分割用の句読符号であり、例えば、ピリオド、疑問符、感嘆符、句読点、読点、セミコロン、コロンなどである。 In step S106, sequence data is divided based on a preset code, a plurality of subsequence data are acquired, a plurality of subsequence data are grouped based on the length of the subsequence data, and K data sets are set. Is obtained, and the lengths of the subsequence data contained in each of the K datasets are the same. K is a natural number greater than 1. The preset code is a punctuation mark for sentence division, and is, for example, a period, a question mark, an exclamation mark, a punctuation mark, a comma, a semicolon, a colon, or the like.

ステップＳ１０８において、ｉ番目のデータセットから複数のサブシーケンスデータを抽出し、且つ抽出された複数のサブシーケンスデータをｉ番目のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルに入力し、ｉ番目のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルを訓練し、訓練後のｉ番目のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルを取得し、ｉを順に１～Ｋの自然数とし、合計でＫ個の訓練後のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルを得る。 In step S108, a plurality of subsequence data are extracted from the i-th data set, and the extracted plurality of subsequence data are input to the i-th timing convolutional neural network-conditional random field model, and the i-th timing is used. Convolutional Neural Network-Train a Conditional Random Field Model, Get the i-th Timing Convolutional Neural Network-Conditional Random Field Model After Training, Let i be a natural number from 1 to K in order, after a total of K training Timing Convolutional Neural Network-Get a Conditional Random Field Model.

ステップＳ１１０において、ターゲットコーパスデータを文字レベルのデータに変換し、第１データを取得し、第１データをシーケンスデータに変換し、第２データを取得し、第２データをＫ個の訓練後のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルのうちの少なくとも１つの訓練後のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルに入力し、ターゲットコーパスデータの単語分割結果を得る。 In step S110, the target corpus data is converted into character level data, the first data is acquired, the first data is converted into sequence data, the second data is acquired, and the second data is obtained after training of K pieces. Timing convolutional neural network-at least one of the conditional random field models After training Timing convolutional neural network-Enter into the conditional random field model to get the word split result of the target corpus data.

コーパスデータは、電子コンピュータをキャリヤーとして言語知識を運ぶ基礎リソースであり、言語の実際の使用に実際に出現した言語資料である。 Corpus data is a basic resource that carries linguistic knowledge with an electronic computer as a carrier, and is a linguistic material that has actually appeared in the actual use of a language.

タイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデル（ＴＣＮ－ＣＲＦ）は、タイミング畳み込みニューラルネットワーク（ＴＣＮ）と条件付きランダムフィールド（ＣＲＦ）との結合モデルである。タイミング畳み込みニューラルネットワークは、ディープラーニングの時間畳み込みネットワークであり、条件付きランダムフィールドは、典型的な判別式モデルである。条件付きランダムフィールドは、単語分割を文字の単語における位置の分類問題と見なし、通常、以下のように、文字の単語における位置の情報を定義する。単語頭は、一般的にＢで表され、単語中は、一般的にＭで表され、単語尾は、一般的にＥで表され、シングルワードは、一般的にＳで表され、条件付きランダムフィールドの単語分割の過程は、単語における位置をマーキングした後、ＢとＥとの間の文字、及びＳシングルワードで単語分割の結果を構成することである。例えば、単語分割すべき文は、「我愛北京天安門」であり、マーキング後、我／Ｓ愛／Ｓ北／Ｂ京／Ｅ天／Ｂ安／Ｍ門／Ｅになり、単語分割結果が「我／愛／北京／天安門」である。 The Timing Convolutional Neural Network-Conditional Random Field Model (TCN-CRF) is a combined model of a Timing Convolutional Neural Network (TCN) and a Conditional Random Field (CRF). A timing convolutional neural network is a deep learning time convolutional network, and a conditional random field is a typical discriminant model. Conditional random fields consider word splitting as a position classification problem in a letter's word, and usually define position information in the letter's word as follows: The beginning of a word is generally represented by B, the inside of a word is generally represented by M, the end of a word is generally represented by E, and a single word is generally represented by S and is conditional. The process of word splitting in a random field is to mark the position in the word and then construct the word splitting result with the letters between B and E, and the S single word. For example, the sentence to be divided into words is "I Love Beijing Tianan Gate", and after marking, it becomes I / S Ai / S North / B Kyo / E Tian / B An / M Gate / E, and the word division result is "I Love Beijing Tianan Gate". I / Ai / Beijing / Cheonanmen ”.

本出願の実施例では、ターゲットコーパスデータを文字レベルのデータに変換し、文字レベルのデータをシーケンスデータに変換し、シーケンスデータを訓練後のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルに入力し、ターゲットコーパスデータの単語分割結果を取得する。タイミング畳み込みニューラルネットワークは、ネットワーク層の数を増加させることで、指数的に増加する速度で受信エリアを広げることができ、それにより、シーケンスの長さが長いシーケンスデータ又は特性が他の複雑なデータを処理でき、エンコード結果の精度を上げることにより、中国語の単語分割の精度を向上させる。 In the examples of the present application, the target corpus data is converted into character level data, the character level data is converted into sequence data, and the sequence data is input to the post-trained timing convolutional neural network-conditional random field model. Get the word division result of the target corpus data. Timing convolutional neural networks can expand the reception area at an exponentially increasing rate by increasing the number of network layers, thereby providing long sequence length sequence data or other complex data with characteristics. And improve the accuracy of Chinese word division by improving the accuracy of the encoding result.

また、タイミング畳み込みニューラルネットワークにおける同じ特徴マッピング面のニューロンの重みが同じであり、並行学習でき、処理速度が速く、従って、タイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルは、分散型システムにおいて実現することができる。 Also, the weights of neurons on the same feature mapping surface in a timing convolutional neural network are the same, parallel learning is possible, and the processing speed is fast. Therefore, a timing convolutional neural network-conditional random field model should be realized in a distributed system. Can be done.

任意選択的には、文字レベルのデータをシーケンスデータに変換するステップは、ワンホットエンコーディング又は単語のベクトルエンコーディングのいずれかである予め設定されたエンコーディング方式により文字レベルのデータをシーケンスデータに変換するステップを含む。 Optionally, the step of converting character-level data to sequence data is the step of converting character-level data to sequence data by a preset encoding method, either one-hot encoding or word vector encoding. including.

ワンホットエンコーディングとは、Ｏｎｅ－Ｈｏｔエンコーディングであり、ワンビット有効エンコーディングとも呼ばれる。その方法は、Ｎビットのステータスレジスタを用いてＮ個のステータをエンコーディングすることである。各ステータは、いずれも、独立したレジスタビットを有し、且つ任意の時点に、１ビットだけが有効である。例えば、１組のデータの特徴が色であり、黄色、赤色、緑色を含むと、ワンホットエンコーディングを用いた後、黄色が［１００」になり、赤色が［０１０」になり、緑色が［００１」になり、このように、ワンホットエンコーディング済みのシーケンスデータは、ベクトルに対応し、ニューラルネットワークモデルに用いることができる。 The one-hot encoding is a One-hot encoding and is also called a one-bit valid encoding. The method is to encode N stators using an N-bit status register. Each stator has an independent register bit, and at any given time, only one bit is valid. For example, if a set of data is characterized by color and includes yellow, red, and green, yellow becomes [100], red becomes [010], and green becomes [001] after using one-hot encoding. In this way, the sequence data that has been one-hot encoded corresponds to the vector and can be used for the neural network model.

単語のベクトルエンコーディングは、ｗｏｒｄ２ｖｅｃであってもよく、ｗｏｒｄ２ｖｅｃは、単語を実数値ベクトルとして表現する高効率アルゴリズムモデルであり、訓練により、テキストコンテンツに対する処理をＫ次元のベクトル空間におけるベクトル計算に簡略化することができる。ｗｏｒｄ２ｖｅｃにより出力された単語ベクトルは、多くのＮＬＰ（神経言語プログラミング）に関連する作業、例えばクラスタリング、類義語検索、品性分析などに用いられることができる。例えば、ｗｏｒｄ２ｖｅｃは、文字レベルのデータを特徴とし、特徴をＫ次元のベクトル空間にマッピングし、特徴で表現されるシーケンスデータを得る。 The vector encoding of a word may be word2vec, which is a high-efficiency algorithm model that expresses a word as a real-valued vector, and by training, the processing for text content is simplified to vector calculation in a K-dimensional vector space. can do. The word vector output by word2vec can be used for many NLP (neuro-linguistic programming) related tasks such as clustering, synonym search, and character analysis. For example, word2vec features character-level data, maps features to a K-dimensional vector space, and obtains sequence data represented by the features.

任意選択的には、抽出された複数のサブシーケンスデータをｉ番目のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルに入力し、ｉ番目のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルを訓練し、訓練後のｉ番目のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルを得るステップは、ｉ番目のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルにおけるタイミング畳み込みニューラルネットワークであるｉ番目のタイミング畳み込みニューラルネットワークに抽出された複数のサブシーケンスデータを入力してフォワード伝播を行い、第１出力データを得るステップＳ１と、第１出力データと入力された複数のサブシーケンスデータに基づいて損失関数の値を計算するステップＳ２と、損失関数の値がデフォルト値より大きいと、複数のサブシーケンスデータをｉ番目のタイミング畳み込みニューラルネットワークに入力してバックワード伝播を行い、ｉ番目のタイミング畳み込みニューラルネットワークのネットワークパラメータを最適化させるステップＳ３と、損失関数の値がデフォルト値以下になるまで、ステップＳ１～Ｓ３を繰り返すステップＳ４と、損失関数の値がデフォルト値以下になると、訓練完了を決定し、訓練後のｉ番目のタイミング畳み込みニューラルネットワークを得るステップＳ５と、ｉ番目のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルにおける条件付きランダムフィールドであるｉ番目の条件付きランダムフィールドに、訓練後のｉ番目のタイミング畳み込みニューラルネットワークから出力されたデータを入力し、且つｉ番目の条件付きランダムフィールドを訓練し、訓練後のｉ番目のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルを得るステップＳ６と、を含む。 Optionally, the extracted multiple sub-sequence data is input to the i-th timing convolutional neural network-conditional random field model, and the i-th timing convolutional neural network-conditional random field model is trained and trained. The later step to obtain the i-th timing convolutional neural network-conditional random field model is extracted into the i-th timing convolutional neural network, which is the timing convolutional neural network in the i-th timing convolutional neural network-conditional random field model. Step S1 to obtain the first output data by inputting a plurality of sub-sequence data and performing forward propagation, and step S2 to calculate the value of the loss function based on the first output data and the plurality of input sub-sequence data. When the value of the loss function is larger than the default value, multiple subsequence data are input to the i-th timing convolutional neural network to perform backward propagation and optimize the network parameters of the i-th timing convolutional neural network. Step S3, step S4 in which steps S1 to S3 are repeated until the value of the loss function becomes equal to or less than the default value, and when the value of the loss function becomes equal to or less than the default value, the training completion is determined and the i-th timing after the training is determined. Step S5 to obtain a convolutional neural network and output from the i-th timing convolutional neural network after training to the i-th conditional random field, which is a conditional random field in the i-th timing convolutional neural network-conditional random field model. The i-th conditional random field is input and the i-th timing convolutional neural network after training-step S6 to obtain a conditional random field model is included.

損失関数の値に基づいてｉ番目のタイミング畳み込みニューラルネットワークを訓練するステップは、具体的には、ｉ番目のタイミング畳み込みニューラルネットワークのネットワークパラメータを初期化し、ランダム勾配降下法を用いてｉ番目のタイミング畳み込みニューラルネットワークを反復訓練し、１回反復するたびに１回損失関数の値を計算し、損失関数の値が最小になるまで複数回反復し、訓練完了後のｉ番目のタイミング畳み込みニューラルネットワーク及び対応する収束されたネットワークパラメータを得るステップを含む。 The step of training the i-th timing convolutional neural network based on the value of the loss function specifically initializes the network parameters of the i-th timing convolutional neural network and uses the random gradient descent method to the i-th timing. The convolutional neural network is iteratively trained, the value of the loss function is calculated once for each iteration, repeated multiple times until the value of the loss function is minimized, and the i-th timing convolutional neural network and the convolutional neural network after the training is completed. Includes steps to obtain the corresponding converged network parameters.

具体的に、損失関数を計算する式は、以下の式（次に挿入された数１）であり得る。 Specifically, the formula for calculating the loss function can be the following formula (the next inserted number 1).

Ｌｏｓｓは、損失関数の値を表し、Ｎは、ｉ番目のタイミング畳み込みニューラルネットワークに入力されたサブシーケンスデータの数を表し、ｙ^（ｉ）は、ｉ番目のタイミング畳み込みニューラルネットワークに入力されたｉ番目のサブシーケンスデータを表し、次に挿入された数２は、ｉ番目のサブシーケンスデータがｉ番目のタイミング畳み込みニューラルネットワークに入力された後に出力されたデータを表す。 Ross represents the value of the loss function, N represents the number of subsequence data input to the i-th timing convolutional neural network, and y ⁽ⁱ⁾ represents the i (i) input to the i-th timing convolutional neural network. The second subsequence data is represented, and the number 2 inserted next represents the data output after the i-th subsequence data is input to the i-th timing convolutional neural network.

任意選択的には、ｉ番目の条件付きランダムフィールドを訓練するステップは、訓練後のｉ番目のタイミング畳み込みニューラルネットワークから出力されたデータに基づいて、ｉ番目の条件付きランダムフィールドの出力データの条件確率を計算するステップと、最尤推定方法を用いて訓練してｉ番目の条件付きランダムフィールドの出力データの条件確率の最大値を得るステップと、を含む。 Optionally, the step of training the i-th conditional random field is the condition of the output data of the i-th conditional random field based on the data output from the i-th timing convolution neural network after training. It includes a step of calculating the probability and a step of training using the maximum likelihood estimation method to obtain the maximum value of the conditional probability of the output data of the i-th conditional random field.

条件付きランダムフィールドは、ランダム変数Ｘが与えられた条件において、ランダム変数Ｙのマルコフランダムフィールドであり、マルコフランダムフィールドのあるランダム変数は、その隣接するランダム変数だけに関係し、それらの隣接していないランダム変数とは無関係である。 A conditional random field is a Markov random field of a random variable Y under the condition given a random variable X, and a random variable with a Markov random field is only related to its adjacent random variables and is adjacent to them. It has nothing to do with random variables.

条件確率モデルＰ（Ｙ｜Ｘ）では、Ｙは、出力変数であり、マーキングシーケンスを表し、状態シーケンスとも呼ばれ、Ｘは、入力変数であり、マーキングすべき観測シーケンスを表す。訓練際に訓練データを用い、最尤推定により条件確率モデルを取得し、次に該モデルで予測し、与えられた入力配列Ｘの場合、条件確率が最大のときの出力シーケンスは、Ｙである。一般的には、リニアチェーンの条件付きランダムフィールドを用い、入力されたシーケンスは、Ｘ＝（Ｘ１，Ｘ２，…，Ｘｎ）であり、出力されたシーケンスＹ＝（Ｙ１，Ｙ２，…，Ｙｎ）は、リニアチェーンで表れるランダム変数シーケンスであり、ランダム変数シーケンスＸが与えられた条件において、ランダム変数シーケンスＹの条件確率分布Ｐ（Ｙ｜Ｘ）は、条件付きランダムフィールドを構成する。 In the conditional probability model P (Y | X), Y is an output variable and represents a marking sequence, also called a state sequence, and X is an input variable and represents an observation sequence to be marked. Using training data during training, a conditional probability model is obtained by maximum likelihood estimation, then predicted by the model, and in the case of a given input array X, the output sequence when the conditional probability is maximum is Y. .. Generally, using a conditional random field in a linear chain, the input sequence is X = (X1, X2, ..., Xn) and the output sequence Y = (Y1, Y2, ..., Yn). Is a random variable sequence appearing in a linear chain, and under the condition that the random variable sequence X is given, the conditional random variable distribution P (Y | X) of the random variable sequence Y constitutes a conditional random field.

最尤推定とは、複数回の試験を行い、その結果を観察し、試験結果を用い、サンプルの出現確率を最大にできるあるパラメータ値を得るというものである。最尤推定は、観測データを与えてモデルパラメータを推定する方法であり、すなわち、「モデルが既知、パラメータが未知である」。既知のサンプルデータは、Ｘ＝（Ｘ１，Ｘ２，…，Ｘｎ）であり、ｎは、サンプルデータの数であり、パラメータｔを推定し、Ｘに対するｔの尤度関数は、次の数３で示される。 Maximum likelihood estimation is to perform multiple tests, observe the results, and use the test results to obtain a parameter value that can maximize the probability of appearance of the sample. Maximum likelihood estimation is a method of estimating model parameters by giving observational data, that is, "the model is known, the parameters are unknown". The known sample data is X = (X1, X2, ..., Xn), where n is the number of sample data, the parameter t is estimated, and the likelihood function of t with respect to X is the following number 3. Shown.

ただし、ｉは、値が１～ｎの自然数であり、ｔ’は、パラメータ空間における尤度関数ｆ（ｔ）を最大にできるｔ値であると、ｔ’は、「最可能な」パラメータであり、ｔ’は、ｔの最尤推定量である。

However, i is a natural number having a value of 1 to n, t'is a t value that can maximize the likelihood function f (t) in the parameter space, and t'is a "maximum possible" parameter. Yes, t'is the maximum likelihood estimator of t.

任意選択的には、第２データをＫ個の訓練後のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルのうちの少なくとも１つの訓練後のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルに入力し、ターゲットコーパスデータの単語分割結果を得るステップは、予め設定された符号に基づいて第２データを分割し、複数のシーケンスデータを得るステップと、シーケンスデータの長さに基づいて複数のシーケンスデータをグループ化し、Ｌ個のデータセットを得るステップであって、Ｌ個のデータセットのうちの各々のデータセットに含まれるすべてのシーケンスデータの長さが同じであり、Ｌは、自然数であり、１≦Ｌ≦Ｋステップと、訓練過程に使用されたサブシーケンスデータの長さに基づいてＫ個の訓練後のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルから、Ｌ個の訓練後のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルをスクリーニングし、Ｌ１番目～ＬＬ番目の訓練後のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルを取得し取得し、ｊ番目のデータセットに含まれるすべてのシーケンスデータをＬｊ番目の訓練後のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルに入力し、複数の単語分割結果を得るステップと、複数の単語分割結果をスティッチングし、ターゲットコーパスデータの単語分割結果を得るステップと、を含む。ここで、Ｌｊ番目の訓練後のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルの訓練過程に使用されたサブシーケンスデータの長さは、ｊ番目のデータセットに含まれるシーケンスデータの長さと同じであり、ｊは順に１～Ｌの自然数であり、Ｌｊは１～Ｋの自然数である。 Optionally, the second data is input into the K post-trained timing convolution neural network-at least one post-training timing convolution neural network of the conditional random field model-conditional random field model and targeted. The step of obtaining the word division result of the corpus data is a step of dividing the second data based on a preset code and obtaining a plurality of sequence data, and a step of grouping a plurality of sequence data based on the length of the sequence data. , The step of obtaining L data sets, all sequence data contained in each of the L data sets have the same length, L is a natural number, and 1 ≦ L. ≤ K steps and K post-training timing convolutional neural networks based on the length of the subsequence data used in the training process-from a conditional random field model to L post-training timing convolutional neural networks-conditions With Random field model screened, L1st to LLth post-training timing convolution neural network-conditional random field model is acquired and acquired, and all sequence data contained in the jth dataset is Ljth training. Later Timing Convolution Neural Network-Contains steps to enter into a conditional random field model to obtain multiple word split results, and to stitch multiple word split results to obtain word split results for target corpus data. .. Here, the length of the subsequence data used in the training process of the Ljth post-training timing convolutional neural network-conditional random field model is the same as the length of the sequence data contained in the jth dataset. , J are natural numbers from 1 to L in order, and Lj is a natural number from 1 to K.

例えば、Ｋの値を５とすると、５つの訓練後のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルが訓練際に使用するサブシーケンスの長さがそれぞれ１０、２０、３０、４０、５０であり、第２データを分割した後、長さがそれぞれ２０及び５０の２つのシーケンスデータを取得する。取得し、次に、訓練過程に使用されたサブシーケンスデータの長さ２０及び５０に基づいて、５つの訓練後のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルから、２つの訓練後のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルをスクリーニングし、スクリーニングされた１番目の訓練後のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルの訓練過程に使用されたサブシーケンスデータの長さが２０であり、スクリーニングされた２番目の訓練後のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルの訓練過程に使用されたサブシーケンスデータの長さが５０であり、シーケンスデータの長さが２０のデータを１番目の訓練後のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルに入力し、複数の単語分割結果を得る。シーケンスデータの長さが５０のデータを２番目の訓練後のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルに入力し、複数の単語分割結果を得る。１番目の訓練後のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルから出力された複数の単語分割結果と、２番目の訓練後のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルから出力された複数の単語分割結果とをスティッチングし、ターゲットコーパスデータの単語分割結果を得る。 For example, if the value of K is 5, then the lengths of the five post-training timing convolutional neural networks-conditional random field models used during training are 10, 20, 30, 40, 50, respectively. After dividing the second data, two sequence data having lengths 20 and 50, respectively, are acquired. Five post-training timing convolutional neural networks-from a conditional random field model, two post-training timing convolutional neurals based on the lengths 20 and 50 of the subsequence data obtained and then used in the training process. Network-Conditional Random Field Model Screened and Screened First Post-Training Timing Convolutional Neural Network-The subsequence data used in the training process of the conditional random field model has a length of 20 and is screened. Second post-training timing convolutional neural network-The subsequence data used in the training process of the conditional random field model has a length of 50 and the sequence data length is 20 after the first training. Timing Convolutional Neural Network-Enter into a conditional random field model to get multiple word split results. Data with a length of 50 sequence data is input into the second post-training timing convolutional neural network-conditional random field model to obtain multiple word split results. First Post-Training Timing Convolutional Neural Network-Multiple Word Split Results Output from Conditional Random Field Model and Second Post-Training Timing Convolutional Neural Network-Multiple Words Output from Conditional Random Field Model The word division result of the target corpus data is obtained by stitching with the division result.

図２は、本出願の実施例による選択可能なディープラーニングに基づく中国語単語分割装置の模式図である。該装置は、上記ディープラーニングに基づく中国語単語分割方法を実行するためのものであり、図２に示すように、該装置は、第１変換ユニット１０、第２変換ユニット２０、第１分割ユニット３０、第１決定ユニット４０、及び第２決定ユニット５０を含む。 FIG. 2 is a schematic diagram of a Chinese word segmentation device based on selectable deep learning according to an embodiment of the present application. The device is for executing the Chinese word division method based on the deep learning, and as shown in FIG. 2, the device includes a first conversion unit 10, a second conversion unit 20, and a first division unit. 30, the first determination unit 40, and the second determination unit 50 are included.

第１変換ユニット１０は、訓練コーパスデータを文字レベルのデータに変換するために用いられる。 The first conversion unit 10 is used to convert the training corpus data into character level data.

第２変換ユニット２０は、文字レベルのデータをシーケンスデータに変換するために用いられる。 The second conversion unit 20 is used to convert character level data into sequence data.

第１分割ユニット３０は、予め設定された符号に基づいてシーケンスデータを分割し、複数のサブシーケンスデータを取得し、サブシーケンスデータの長さに基づいて複数のサブシーケンスデータをグループ化し、Ｋ個のデータセットを取得し、Ｋ個のデータセットのうちの各々のデータセットに含まれるサブシーケンスデータの長さが同じである。Ｋは、１より大きい自然数である。予め設定された符号とは、文分割用の句読符号であり、例えば、ピリオド、疑問符、感嘆符、句読点、読点、セミコロン、コロンなどである。 The first division unit 30 divides the sequence data based on a preset code, acquires a plurality of subsequence data, groups the plurality of subsequence data based on the length of the subsequence data, and K pieces. Data sets are acquired, and the lengths of the subsequence data contained in each of the K data sets are the same. K is a natural number greater than 1. The preset code is a punctuation mark for sentence division, and is, for example, a period, a question mark, an exclamation mark, a punctuation mark, a comma, a semicolon, a colon, or the like.

第１決定ユニット４０は、ｉ番目のデータセットから複数のサブシーケンスデータを抽出し、且つ抽出された複数のサブシーケンスデータをｉ番目のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルに入力し、ｉ番目のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルを訓練し、訓練後のｉ番目のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルを取得し、ｉを順に１～Ｋの自然数とし、合計でＫ個の訓練後のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルを得るために用いられる。 The first determination unit 40 extracts a plurality of subsequence data from the i-th data set, and inputs the extracted plurality of subsequence data into the i-th timing convolutional neural network-conditional random field model, i. Second timing convolutional neural network-Train a conditional random field model, get the i-th timing convolutional neural network after training-conditional random field model, let i be a natural number from 1 to K in order, K in total Post-trained timing convolutional neural network-used to obtain a conditional random field model.

第２決定ユニット５０は、ターゲットコーパスデータを文字レベルのデータに変換し、第１データを取得し取得し、第１データをシーケンスデータに変換し、第２データを取得し、第２データをＫ個の訓練後のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルのうちの少なくとも１つの訓練後のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルに入力し、ターゲットコーパスデータの単語分割結果を得るために用いられる。 The second determination unit 50 converts the target corpus data into character level data, acquires and acquires the first data, converts the first data into sequence data, acquires the second data, and K the second data. Post-Training Timing Convolutional Neural Networks-At least one of the Conditional Random Field Models Post-Training Timing Convolutional Neural Networks-Enter into the Conditional Random Field Model and use to obtain the word split results for the target corpus data. Will be.

タイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデル（ＴＣＮ－ＣＲＦ）は、タイミング畳み込みニューラルネットワーク（ＴＣＮ）と条件付きランダムフィールド（ＣＲＦ）との結合モデルである。タイミング畳み込みニューラルネットワークは、ディープラーニングの時間畳み込みネットワークであり、条件付きランダムフィールドは、典型的な判別式モデルであり、条件付きランダムフィールドは、単語分割を文字の単語における位置の分類問題と見なし、通常、以下のように、文字の単語における位置の情報を定義する。単語頭は、一般的にＢで表され、単語中は、一般的にＭで表され、単語尾は、一般的にＥで表され、シングルワードは、一般的にＳで表され、条件付きランダムフィールドの単語分割の過程は、単語における位置をマーキングした後、ＢとＥとの間の文字、及びＳシングルワードで単語分割の結果を構成することである。例えば、単語分割すべき文は、「我愛北京天安門」であり、マーキング後、我／Ｓ愛／Ｓ北／Ｂ京／Ｅ天／Ｂ安／Ｍ門／Ｅになり、単語分割結果が「我／愛／北京／天安門」である。 The Timing Convolutional Neural Network-Conditional Random Field Model (TCN-CRF) is a combined model of a Timing Convolutional Neural Network (TCN) and a Conditional Random Field (CRF). A timing convolutional neural network is a deep learning time convolutional network, a conditional random field is a typical discriminant model, and a conditional random field considers word splitting as a position classification problem in a word of a letter. Usually, the position information in a word of a letter is defined as follows. The beginning of a word is generally represented by B, the inside of a word is generally represented by M, the end of a word is generally represented by E, and a single word is generally represented by S and is conditional. The process of word splitting in a random field is to mark the position in the word and then construct the word splitting result with the letters between B and E, and the S single word. For example, the sentence to be divided into words is "I Love Beijing Tianan Gate", and after marking, it becomes I / S Ai / S North / B Kyo / E Tian / B An / M Gate / E, and the word division result is "I Love Beijing Tianan Gate". I / Ai / Beijing / Cheonanmen ”.

本出願の実施例では、ターゲットコーパスデータを文字レベルのデータに変換し、文字レベルのデータをシーケンスデータに変換し、シーケンスデータを訓練後のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルに入力し、ターゲットコーパスデータの単語分割結果を取得する。タイミング畳み込みニューラルネットワークは、ネットワーク層の数を増加させることで、指数的に増加する速度で受信エリアを広げることができ、それにより、シーケンスの長さが長いシーケンスデータ又は特性が他の複雑なデータを処理でき、エンコード結果の精度を向上させ、それにより、中国語の単語分割の精度を向上させる。 In the examples of the present application, the target corpus data is converted into character level data, the character level data is converted into sequence data, and the sequence data is input to the post-trained timing convolutional neural network-conditional random field model. Get the word division result of the target corpus data. Timing convolutional neural networks can expand the reception area at an exponentially increasing rate by increasing the number of network layers, thereby providing sequence data with long sequence lengths or other complex data with characteristics. Can process and improve the accuracy of the encoded result, thereby improving the accuracy of Chinese word segmentation.

任意選択的には、第２変換ユニット２０は、サブ変換ユニットを含む。サブ変換ユニットは、ワンホットエンコーディング又は単語のベクトルエンコーディングのいずれかである予め設定されたエンコーディング方式により文字レベルのデータをシーケンスデータに変換するために用いられる。 Optionally, the second conversion unit 20 includes a sub-conversion unit. Sub-conversion units are used to convert character-level data into sequence data by a preset encoding scheme, either one-hot encoding or word vector encoding.

任意選択的には、第１決定ユニット４０は、ｉ番目のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルにおけるタイミング畳み込みニューラルネットワークであるｉ番目のタイミング畳み込みニューラルネットワークに、抽出された複数のサブシーケンスデータを入力してフォワードワード伝播を行い、第１出力データを得るステップＳ１と、第１出力データと入力された複数のサブシーケンスデータに基づいて損失関数の値を計算するステップＳ２と、損失関数の値がデフォルト値より大きいと、複数のサブシーケンスデータをｉ番目のタイミング畳み込みニューラルネットワークに入力してバックワード伝播を行い、ｉ番目のタイミング畳み込みニューラルネットワークのネットワークパラメータを最適化させるステップＳ３と、損失関数の値がデフォルト値以下になるまで、ステップＳ１～Ｓ３を繰り返すステップＳ４と、損失関数の値がデフォルト値以下になると、訓練完了を決定し、訓練後のｉ番目のタイミング畳み込みニューラルネットワークを得るステップＳ５と、ｉ番目のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルにおける条件付きランダムフィールドであるｉ番目の条件付きランダムフィールドに、訓練後のｉ番目のタイミング畳み込みニューラルネットワークから出力されたデータを入力し、ｉ番目の条件付きランダムフィールドを訓練し、訓練後のｉ番目のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルを得るステップＳ１と、を実行するためのものである。 Optionally, the first determination unit 40 is a plurality of subsequence data extracted into the i-th timing convolutional neural network-the i-th timing convolutional neural network which is the timing convolutional neural network in the conditional random field model. Step S1 to obtain the first output data by inputting and performing forward word propagation, step S2 to calculate the value of the loss function based on the first output data and a plurality of input subsequence data, and step S2 of the loss function. If the value is larger than the default value, multiple subsequence data are input to the i-th timing convolutional neural network to perform backward propagation, and step S3 for optimizing the network parameters of the i-th timing convolutional neural network and the loss. Step S4 that repeats steps S1 to S3 until the value of the function becomes less than the default value, and when the value of the loss function becomes less than the default value, the training completion is determined and the i-th timing convolutional neural network after training is obtained. Step S5 and the i-th timing convolutional neural network-The data output from the i-th timing convolutional neural network after training is input to the i-th conditional random field, which is a conditional random field in the conditional random field model. Then, the i-th conditional random field is trained, and the i-th timing convolutional neural network after training-step S1 to obtain a conditional random field model is executed.

任意選択的には、第１決定ユニットは、第１サブ計算ユニットと、第１サブ決定ユニットと、を含む。第１サブ計算ユニットは、訓練後のｉ番目のタイミング畳み込みニューラルネットワークから出力されたデータに基づいて、ｉ番目の条件付きランダムフィールドの出力データの条件確率を計算するために用いられる。第１サブ決定ユニットは、最尤推定方法を用いて訓練してｉ番目の条件付きランダムフィールドの出力データの条件確率の最大値を得るために用いられる。 Optionally, the first determination unit includes a first sub-computational unit and a first sub-determination unit. The first sub-computation unit is used to calculate the conditional probability of the output data of the i-th conditional random field based on the data output from the i-th timing convolutional neural network after training. The first subdetermination unit is used to train using the maximum likelihood estimation method to obtain the maximum conditional probability of the output data of the i-th conditional random field.

任意選択的には、第２決定ユニット５０は、サブ分割ユニットと、サブグループ化ユニットと、第２サブ決定ユニットと、サブスティッチングユニットと、を含む。サブ分割ユニットは、予め設定された符号に基づいて第２データを分割し、複数のシーケンスデータを得るために用いられる。サブグループ化ユニットは、シーケンスデータの長さに基づいて複数のシーケンスデータをグループ化し、Ｌ個のデータセットを得るために用いられ、Ｌ個のデータセットのうちの各々のデータセットに含まれるすべてのシーケンスデータの長さが同じであり、Ｌは、自然数であり、１≦Ｌ≦Ｋ。第２サブ決定ユニットは、訓練過程に使用されたサブシーケンスデータの長さに基づいてＫ個の訓練後のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルから、Ｌ個の訓練後のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルをスクリーニングし、Ｌ１番目～ＬＬ番目の訓練後のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルを取得し、ｊ番目のデータセットに含まれるすべてのシーケンスデータをＬｊ番目の訓練後のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルに入力し、複数の単語分割結果を得るために用いられる。ここで、Ｌｊ番目の訓練後のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルの訓練過程に使用されたサブシーケンスデータの長さはｊ番目のデータセットに含まれるシーケンスデータの長さと同じであり、ｊは順に１～Ｌの自然数であり、Ｌｊは１～Ｋの自然数である。サブスティッチングユニットは、複数の単語分割結果をスティッチングし、ターゲットコーパスデータの単語分割結果を得るために用いられる。 Optionally, the second determination unit 50 includes a subdivision unit, a subgrouping unit, a second subdetermination unit, and a substitching unit. The sub-division unit is used to divide the second data based on a preset code and obtain a plurality of sequence data. Subgrouping units are used to group multiple sequence data based on the length of the sequence data to obtain L datasets, all of which are contained in each of the L datasets. The lengths of the sequence data are the same, L is a natural number, and 1 ≦ L ≦ K. The second subdetermination unit consists of K post-training timing convolutional neural networks-from a conditional random field model to L post-training timing convolutional neural networks based on the length of the subsequence data used in the training process. -Screening a conditional random field model, L1st to LLth post-training timing convolutional neural networks-Getting a conditional random field model and training all sequence data contained in the jth dataset to the Ljth training. Later Timing Convolutional Neural Networks-Used to enter into a conditional random field model and obtain multiple word split results. Here, the length of the subsequence data used in the training process of the Ljth post-training timing convolutional neural network-conditional random field model is the same as the length of the sequence data contained in the jth dataset. j is a natural number from 1 to L in order, and Lj is a natural number from 1 to K. The substitching unit is used to stitch a plurality of word division results and obtain the word division result of the target corpus data.

一局面では、本出願の実施例は、記憶されるプログラムを含む記憶媒体を提供し、プログラムの運転中に、記憶媒体が位置する機器を制御して、訓練コーパスデータを文字レベルのデータに変換するステップと、文字レベルのデータをシーケンスデータに変換するステップと、予め設定された符号に基づいてシーケンスデータを分割し、複数のサブシーケンスデータを取得し、サブシーケンスデータの長さに基づいて複数のサブシーケンスデータをグループ化し、Ｋ個のデータセットを得るステップであって、Ｋ個のデータセットのうちの各々のデータセットに含まれるサブシーケンスデータの長さが同じであり、Ｋは、１より大きい自然数であるステップと、ｉ番目のデータセットから複数のサブシーケンスデータを抽出し、抽出した複数のサブシーケンスデータをｉ番目のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルに入力し、ｉ番目のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルを訓練し、訓練後のｉ番目のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルを取得し取得し、ｉを順に１～Ｋの自然数とし、合計でＫ個の訓練後のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルを得るステップと、ターゲットコーパスデータを文字レベルのデータに変換し、第１データを取得し、第１データをシーケンスデータに変換し、第２データを取得し、第２データをＫ個の訓練後のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルのうちの少なくとも１つの訓練後のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルに入力し、ターゲットコーパスデータの単語分割結果を得るステップと、を行わせる。 In one aspect, an embodiment of the present application provides a storage medium containing a stored program and controls the equipment on which the storage medium is located during operation of the program to convert training corpus data into character level data. A step to convert character level data to sequence data, a step to divide the sequence data based on a preset code, acquire multiple subsequence data, and multiple based on the length of the subsequence data. In the step of grouping the subsequence data of the above to obtain K data sets, the lengths of the subsequence data contained in each of the K data sets are the same, and K is 1. Extract multiple subsequence data from the step that is a larger natural number and the i-th dataset, and input the extracted multiple sub-sequence data into the i-th timing convolution neural network-conditional random field model, i-th. Timing convolution neural network-Train a conditional random field model, acquire and acquire the i-th timing convolution neural network-conditional random field model after training, let i be a natural number from 1 to K in order, and K in total Post-Training Timing Convolution Neural Network-Steps to Obtain a Conditional Random Field Model, Convert Target Corpus Data to Character Level Data, Get First Data, Convert First Data to Sequence Data, First Two data are acquired and the second data is input into the K post-trained timing convolution neural network-at least one post-training timing convolution neural network of the conditional random field model-conditional random field model and targeted. Have them perform the steps to obtain the word division result of the corpus data.

任意選択的には、プログラムの運転中に、記憶媒体が位置する機器を制御して、ワンホットエンコーディング又は単語のベクトルエンコーディングのいずれかである予め設定されたエンコーディング方式により文字レベルのデータをシーケンスデータに変換するステップを、さらに行わせる。 Optionally, during program operation, the device in which the storage medium is located is controlled to sequence character-level data by a preset encoding method, either one-hot encoding or word vector encoding. Have more steps to convert to.

任意選択的には、プログラムの運転中に、記憶媒体が位置する機器を制御して、ｉ番目のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルにおけるタイミング畳み込みニューラルネットワークであるｉ番目のタイミング畳み込みニューラルネットワークに、抽出された複数のサブシーケンスデータを入力してフォワードワード伝播を行い、第１出力データを得るステップＳ１と、第１出力データと入力された複数のサブシーケンスデータに基づいて損失関数の値を計算するステップＳ２と、損失関数の値がデフォルト値より大きいと、複数のサブシーケンスデータをｉ番目のタイミング畳み込みニューラルネットワークに入力してバックワード伝播を行い、且つｉ番目のタイミング畳み込みニューラルネットワークのネットワークパラメータを最適化するステップＳ３と、損失関数の値がデフォルト値以下になるまで、ステップＳ１～Ｓ３を繰り返すステップＳ４と、損失関数の値がデフォルト値以下になると、訓練完了を決定し、訓練後のｉ番目のタイミング畳み込みニューラルネットワークを得るステップＳ５と、ｉ番目のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルにおける条件付きランダムフィールドであるｉ番目の条件付きランダムフィールドに、訓練後のｉ番目のタイミング畳み込みニューラルネットワークから出力されたデータを入力し、且つｉ番目の条件付きランダムフィールドを訓練し、訓練後のｉ番目のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルを得るステップＳ６と、をさらに行わせる。 Optionally, while the program is running, the device in which the storage medium is located is controlled to control the i-th timing convolutional neural network-the i-th timing convolutional neural network, which is the timing convolutional neural network in the conditional random field model. In step S1 in which a plurality of extracted subsequence data are input and forward word propagation is performed to obtain a first output data, and a value of a loss function based on the first output data and a plurality of input subsequence data. In step S2 to calculate, and when the value of the loss function is larger than the default value, multiple subsequence data are input to the i-th timing convolutional neural network to perform backward propagation, and the i-th timing convolutional neural network. Step S3 for optimizing the network parameters, step S4 for repeating steps S1 to S3 until the value of the loss function becomes less than the default value, and when the value of the loss function becomes less than the default value, the training is decided to be completed and the training is performed. In step S5 to obtain the later i-th timing convolutional neural network, and in the i-th conditional random field, which is the conditional random field in the i-th timing convolutional neural network-conditional random field model, the i-th after training. The data output from the timing convolutional neural network is input, and the i-th conditional random field is trained, and the i-th timing convolutional neural network after training-step S6 to obtain the conditional random field model is further performed. Let me.

任意選択的には、プログラムの運転中に、記憶媒体が位置する機器を制御して、訓練後のｉ番目のタイミング畳み込みニューラルネットワークから出力されたデータに基づいて、ｉ番目の条件付きランダムフィールドの出力データの条件確率を計算するステップと、最尤推定方法を用いて訓練してｉ番目の条件付きランダムフィールドの出力データの条件確率の最大値を得るステップと、をさらに行わせる。 Optionally, while the program is running, it controls the device in which the storage medium is located to control the i-th conditional random field based on the data output from the i-th timing convolutional neural network after training. A step of calculating the conditional probability of the output data and a step of training using the maximum likelihood estimation method to obtain the maximum value of the conditional probability of the output data of the i-th conditional random field are further performed.

任意選択的には、プログラムの運転中に、記憶媒体が位置する機器を制御して、予め設定された符号に基づいて第２データを分割し、複数のシーケンスデータを得るステップと、シーケンスデータの長さに基づいて複数のシーケンスデータをグループ化し、Ｌ個のデータセットを得るステップであって、Ｌ個のデータセットのうちの各々のデータセットに含まれるすべてのシーケンスデータの長さが同じであり、Ｌは、自然数であり、１≦Ｌ≦Ｋステップと、訓練過程に使用されたサブシーケンスデータの長さに基づいてＫ個の訓練後のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルから、Ｌ個の訓練後のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルをスクリーニングし、Ｌ１番目～ＬＬ番目の訓練後のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルを取得し、ｊ番目のデータセットに含まれるすべてのシーケンスデータをＬｊ番目の訓練後のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルに入力し、複数の単語分割結果を得るステップと、複数の単語分割結果をスティッチングし、ターゲットコーパスデータの単語分割結果を得るステップと、をさらに行わせる。ここで、Ｌｊ番目の訓練後のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルの訓練過程に使用されたサブシーケンスデータの長さは、ｊ番目のデータセットに含まれるシーケンスデータの長さと同じであり、ｊは順に１～Ｌの自然数であり、Ｌｊは１～Ｋの自然数である。 Optionally, during the operation of the program, a step of controlling the device in which the storage medium is located to divide the second data based on a preset code to obtain a plurality of sequence data, and a step of sequence data. It is a step to group a plurality of sequence data based on the length to obtain L data sets, and all the sequence data contained in each of the L data sets have the same length. Yes, L is a natural number, from 1 ≤ L ≤ K steps and K post-training timing convolution neural networks-conditional random field models based on the length of the subsequence data used in the training process. L Post-Training Timing Convolution Neural Networks-Screening Conditional Random Field Models, L1st to LLth Post-Training Timing Convolution Neural Networks-Conditional Random Field Models Obtained and Included in Jth Dataset All sequence data to be input into the Ljth post-training timing convolution neural network-conditional random field model to obtain multiple word split results, and stitching multiple word split results to the target corpus data. Have them do more steps to get the word split result. Here, the length of the subsequence data used in the training process of the Ljth post-training timing convolutional neural network-conditional random field model is the same as the length of the sequence data contained in the jth dataset. , J are natural numbers from 1 to L in order, and Lj is a natural number from 1 to K.

一局面では、本出願の実施例は、プログラム命令を含む情報を記憶するためのメモリと、プログラム命令の実行を制御するためのプロセッサと、を含むコンピュータ機器を提供し、プログラム命令がプロセッサによりロードされて実行されると、訓練コーパスデータを文字レベルのデータに変換するステップと、文字レベルのデータをシーケンスデータに変換するステップと、予め設定された符号に基づいてシーケンスデータを分割し、複数のサブシーケンスデータを取得し、サブシーケンスデータの長さに基づいて複数のサブシーケンスデータをグループ化し、Ｋ個のデータセットを得るステップであって、Ｋ個のデータセットのうちの各々のデータセットに含まれるサブシーケンスデータの長さが同じであり、Ｋは、１より大きい自然数であるステップと、ｉ番目のデータセットから複数のサブシーケンスデータを抽出し、抽出した複数のサブシーケンスデータをｉ番目のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルに入力し、ｉ番目のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルを訓練し、訓練後のｉ番目のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルを得るステップであって、ｉを順に１～Ｋの自然数とし、合計でＫ個の訓練後のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルを得るステップと、ターゲットコーパスデータを文字レベルのデータに変換し、第１データを取得し、第１データをシーケンスデータに変換し、第２データを取得し、第２データをＫ個の訓練後のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルのうちの少なくとも１つの訓練後のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルに入力し、ターゲットコーパスデータの単語分割結果を得るステップと、を実施する。 In one aspect, embodiments of the present application provide computer equipment comprising a memory for storing information including program instructions and a processor for controlling execution of the program instructions, the program instructions being loaded by the processor. When executed, the training corpus data is converted into character-level data, the character-level data is converted into sequence data, and the sequence data is divided based on a preset code, and multiple sequences are used. A step of acquiring subsequence data, grouping multiple subsequence data based on the length of the subsequence data, and obtaining K datasets, in each of the K datasets. The lengths of the subsequence data included are the same, and K is a step that is a natural number greater than 1, and multiple subsequence data are extracted from the i-th data set, and the extracted multiple sub-sequence data are the i-th. Timing Convolution Neural Network-Enter into a Conditional Random Field Model, Train the i-Th Timing Convolution Neural Network-Conditional Random Field Model, and Obtain the i-Th Timing Convolution Neural Network-Conditional Random Field Model After Training In the step, i is a natural number from 1 to K in order, and a total of K post-training timing convolution neural networks-a step to obtain a conditional random field model and conversion of the target corpus data to character level data are performed. Acquire the first data, convert the first data to sequence data, acquire the second data, and convert the second data to K post-trained timing convolution neural networks-at least one of the conditional random field models. Post-training timing convolution neural network-a step of inputting into a conditional random field model and obtaining the word division result of the target corpus data.

任意選択的には、プログラム命令がプロセッサによりロードされて実行されるときに、ワンホットエンコーディング又は単語のベクトルエンコーディングのいずれかである予め設定されたエンコーディング方式により文字レベルのデータをシーケンスデータに変換するステップを、さらに実施する。 Optionally, when a program instruction is loaded and executed by the processor, character-level data is converted to sequence data by a preset encoding method, either one-hot encoding or word vector encoding. Perform more steps.

任意選択的には、プログラム命令がプロセッサによりロードされて実行されると、ｉ番目のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルにおけるタイミング畳み込みニューラルネットワークであるｉ番目のタイミング畳み込みニューラルネットワークに、抽出された複数のサブシーケンスデータを入力してフォワードワード伝播を行い、第１出力データを得るステップＳ１と、第１出力データと入力された複数のサブシーケンスデータに基づいて損失関数の値を計算するステップＳ２と、損失関数の値がデフォルト値より大きいと、複数のサブシーケンスデータをｉ番目のタイミング畳み込みニューラルネットワークに入力してバックワード伝播を行い、且つｉ番目のタイミング畳み込みニューラルネットワークのネットワークパラメータを最適化するステップＳ３と、損失関数の値がデフォルト値以下になるまで、ステップＳ１～Ｓ３を繰り返すステップＳ４と、損失関数の値がデフォルト値以下になると、訓練完了を決定し、訓練後のｉ番目のタイミング畳み込みニューラルネットワークを得るステップＳ５と、ｉ番目のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルにおける条件付きランダムフィールドであるｉ番目の条件付きランダムフィールドに、訓練後のｉ番目のタイミング畳み込みニューラルネットワークから出力されたデータを入力し、且つｉ番目の条件付きランダムフィールドを訓練し、訓練後のｉ番目のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルを得るステップＳ６と、をさらに実施する。 Optionally, when the program instruction is loaded and executed by the processor, it is extracted into the i-th timing convolutional neural network-the i-th timing convolutional neural network, which is the timing convolutional neural network in the conditional random field model. A step S1 in which a plurality of sub-sequence data are input and forward word propagation is performed to obtain a first output data, and a step in which the value of the loss function is calculated based on the first output data and the plurality of input sub-sequence data. When the value of S2 and the loss function is larger than the default value, multiple subsequence data are input to the i-th timing convolutional neural network to perform backward propagation, and the network parameters of the i-th timing convolutional neural network are optimized. Step S3 to be converted, step S4 in which steps S1 to S3 are repeated until the value of the loss function becomes less than the default value, and when the value of the loss function becomes less than the default value, the training completion is determined and the i-th after training is performed. Step S5 to obtain the timing convolutional neural network of, and the i-th timing convolutional neural network-the i-th conditional convolutional neural network after training to the i-th conditional random field, which is a conditional random field in the conditional random field model. The data output from is input and the i-th conditional random field is trained, and the i-th timing convolutional neural network after training-step S6 for obtaining a conditional random field model is further performed.

任意選択的には、プログラム命令がプロセッサによりロードされて実行されるときに、訓練後のｉ番目のタイミング畳み込みニューラルネットワークから出力されたデータに基づいて、ｉ番目の条件付きランダムフィールドの出力データの条件確率を計算するステップと、最尤推定方法を用いて訓練してｉ番目の条件付きランダムフィールドの出力データの条件確率の最大値を得るステップと、をさらに実施する。 Optionally, when the program instruction is loaded and executed by the processor, the output data of the i-th conditional random field is based on the data output from the i-th timing convolutional neural network after training. A step of calculating the conditional probability and a step of training using the maximum likelihood estimation method to obtain the maximum value of the conditional probability of the output data of the i-th conditional random field are further carried out.

任意選択的には、プログラム命令がプロセッサによりロードされて実行されるときに、予め設定された符号に基づいて第２データを分割し、複数のシーケンスデータを得るステップと、シーケンスデータの長さに基づいて複数のシーケンスデータをグループ化し、Ｌ個のデータセットを得るステップであって、Ｌ個のデータセットのうちの各々のデータセットに含まれるすべてのシーケンスデータの長さが同じであり、Ｌは、自然数であり、１≦Ｌ≦Ｋステップと、訓練過程に使用されたサブシーケンスデータの長さに基づいてＫ個の訓練後のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルから、Ｌ個の訓練後のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルをスクリーニングし、Ｌ１番目～ＬＬ番目の訓練後のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルを取得し、ｊ番目のデータセットに含まれるすべてのシーケンスデータをＬｊ番目の訓練後のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルに入力し、複数の単語分割結果を得るステップと、複数の単語分割結果をスティッチングし、ターゲットコーパスデータの単語分割結果を得るステップを、さらに実施する。ここで、Ｌｊ番目の訓練後のタイミング畳み込みニューラルネットワーク－条件付きランダムフィールドモデルの訓練過程に使用されたサブシーケンスデータの長さは、ｊ番目のデータセットに含まれるシーケンスデータの長さと同じであり、ｊは順に１～Ｌの自然数であり、Ｌｊは１～Ｋの自然数である。 Optionally, when the program instruction is loaded and executed by the processor, the second data is divided based on a preset code to obtain a plurality of sequence data, and the length of the sequence data. It is a step of grouping a plurality of sequence data based on the group to obtain L data sets, and all the sequence data contained in each of the L data sets have the same length, and L. Is a natural number, from 1 ≤ L ≤ K steps and K post-training timing convolution neural networks-conditional random field model based on the length of the subsequence data used in the training process. Post-Trained Timing Convolution Neural Network-Screening for Conditional Random Field Models, L1st to LLth Post-Training Timing Convolution Neural Networks-Getting Conditional Random Field Models, All Included in the jth Dataset The sequence data is input to the Ljth post-training timing convolution neural network-conditional random field model to obtain multiple word division results, and stitching of multiple word division results, and the word division result of the target corpus data. Further carry out the steps to obtain. Here, the length of the subsequence data used in the training process of the Ljth post-training timing convolutional neural network-conditional random field model is the same as the length of the sequence data contained in the jth dataset. , J are natural numbers from 1 to L in order, and Lj is a natural number from 1 to K.

図３は、本出願の実施例に係るコンピュータ機器の模式図である。図３に示すように、該実施例のコンピュータ機器５０は、プロセッサ５１、メモリ５２と、メモリ５２に記憶され、プロセッサ５１において実行可能なコンピュータプログラム５３と、を含み、該コンピュータプログラム５３は、プロセッサ５１により実行されるときに、実施例におけるディープラーニングに基づく中国語単語分割方法を実施する。重複を避けるために、ここでは、その詳細を述べない。又は、該コンピュータプログラムがプロセッサ５１により実行されるときに、実施例のディープラーニングに基づく中国語単語分割装置における各モデル／ユニットの機能を実施する。重複を避けるために、ここでは、その詳細を述べない。 FIG. 3 is a schematic diagram of a computer device according to an embodiment of the present application. As shown in FIG. 3, the computer device 50 of the embodiment includes a processor 51, a memory 52, and a computer program 53 stored in the memory 52 and executed by the processor 51, wherein the computer program 53 is a processor. When executed by 51, the Chinese word splitting method based on deep learning in the embodiment is carried out. To avoid duplication, the details are not given here. Alternatively, when the computer program is executed by the processor 51, the function of each model / unit in the Chinese word segmentation device based on the deep learning of the embodiment is performed. To avoid duplication, the details are not given here.

コンピュータ機器５０はデスクトップコンピュータ、ノート、パームトップパソコン及びクラウドサーバなどのコンピューティングデバイスであってもよい。コンピュータ機器は、プロセッサ５１、メモリ５２を含むがこれらに限定されるものではない。当業者であれば理解できるように、図３はコンピュータ機器５０の例に過ぎず、コンピュータ機器５０を限定するものではなく、図示より多く又は少ないユニットをさらに備えてもよいし、ある部材の組み合わせであってもよいし、異なる部材であってもよい。例えば、コンピュータ機器は、入出力デバイス、ネットワークアクセスデバイス、バスなどを含んでもよい。 The computer device 50 may be a computing device such as a desktop computer, a notebook, a palmtop personal computer, and a cloud server. Computer equipment includes, but is not limited to, a processor 51 and a memory 52. As can be understood by those skilled in the art, FIG. 3 is merely an example of a computer device 50, and does not limit the computer device 50, and may further include more or less units than shown in the figure, or a combination of certain members. It may be a different member. For example, computer equipment may include input / output devices, network access devices, buses, and the like.

いわゆるプロセッサ５１は、中央処理ユニット（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ、ＣＰＵ）であってもよいし、その他の汎用プロセッサ、デジタル信号プロセッサ（ＤｉｇｉｔａｌＳｉｇｎａｌＰｒｏｃｅｓｓｏｒ、ＤＳＰ）、専用集積回路（ＡｐｐｌｉｃａｔｉｏｎＳｐｅｃｉｆｉｃＩｎｔｅｇｒａｔｅｄＣｉｒｃｕｉｔ、ＡＳＩＣ）、フィールドプログラマブルゲートアレイ（Ｆｉｅｌｄ－ＰｒｏｇｒａｍｍａｂｌｅＧａｔｅＡｒｒａｙ、ＦＰＧＡ）又はその他のプログラマブル論理デバイス、ディスクリートゲート又はトランジスタ論理デバイス、ディスクリートハードウェアユニットなどであってもよい。汎用プロセッサは、マイクロプロセッサであってもよいし、いかなる通常のプロセッサなどであってもよい。 The so-called processor 51 may be a central processing unit (CPU), other general-purpose processors, digital signal processors (DSPs), dedicated integrated circuits (Application Special Integrated Circuits, ASICs), and ASICs. It may be a field programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware unit, and the like. The general-purpose processor may be a microprocessor, an ordinary processor, or the like.

メモリ５２は、例えば、コンピュータ機器５０のハードディスク又はメモリなどのコンピュータ機器５０の内部記憶ユニットであってもよい。メモリ５２は、コンピュータ機器５０の外部記憶デバイス、例えば、コンピュータ機器５０に配置されたプラグインハードディスク、スマートメモリカード（ＳｍａｒｔＭｅｄｉａＣａｒｄ、ＳＭＣ）、セキュアデジタル（ＳｅｃｕｒｅＤｉｇｉｔａｌ、ＳＤ）カード、フラッシュカード（ＦｌａｓｈＣａｒｄ）などであってもよい。さらに、メモリ５２は、コンピュータ機器５０の内部記憶ユニットを含んでもよいし、外部記憶デバイスを含んでもよい。メモリ５２は、コンピュータプログラム及びコンピュータ機器に必要な他のプログラム及びデータを記憶するために用いられる。メモリ５２は、さらに、出力済み又は出力対象のデータを一時的に記憶するために用いられてもよい。 The memory 52 may be, for example, a hard disk of the computer device 50 or an internal storage unit of the computer device 50 such as a memory. The memory 52 is an external storage device of the computer device 50, for example, a plug-in hard disk, a smart memory card (Smart Media Card, SMC), a secure digital (SD) card, or a flash card (Flash) arranged in the computer device 50. It may be Card) or the like. Further, the memory 52 may include an internal storage unit of the computer device 50 or an external storage device. The memory 52 is used to store computer programs and other programs and data required for computer equipment. The memory 52 may also be used to temporarily store data that has already been output or is to be output.

当業者であれば明らかに理解できるように、説明の便利及び簡潔のために、上記に説明されたシステム、装置及びユニットの具体的な動作過程は、前述の方法実施例における対応するプロセスを参照することができ、ここでは説明を省略する。 For convenience and brevity of description, the specific operating processes of the systems, devices and units described above will refer to the corresponding processes in the method embodiments described above, as will be apparently understood by those skilled in the art. However, the description thereof is omitted here.

本出願に係る複数の実施例では、提供されるシステム、装置及び方法は他の形態で実施されてもよいことを理解されたい。例えば、上述の装置の実施例は単なる例示である。例えば、前記ユニットの分割は論理的な機能分割のみであり、実際の実施中には他の分割形態もあり得る。例えば、複数のユニットまたはコンポーネントを組み合わせたり、他のシステムに集積したり、あるいは一部の特徴は無視、省略される、または実行されなくてもよい。さらに、図示または説明した結合又は直接結合又は通信接続は、いくつかのインタフェース、装置またはユニットを介した間接カプリングまたは通信接続でもよく、電気接続、機械接続または他の形態での接続でもよい。 It should be understood that in the plurality of embodiments of this application, the systems, devices and methods provided may be implemented in other embodiments. For example, the embodiment of the above-mentioned device is merely an example. For example, the division of the unit is only a logical functional division, and other division forms may be possible during the actual implementation. For example, multiple units or components may be combined, integrated into other systems, or some features may be ignored, omitted, or not implemented. Further, the coupled or direct coupled or communication connection illustrated or described may be an indirect coupling or communication connection via some interface, device or unit, or may be an electrical connection, a mechanical connection or other form of connection.

以上は本出願の好ましい実施例に過ぎず、本出願を限定するものではなく、本出願の精神及び原則内で、行われたいかなる修正、同等置換や改善などは、いずれも本出願の保護範囲内に含まれるべきである。 The above are merely preferred embodiments of the present application and are not intended to limit the present application, and any modifications, equivalent substitutions or improvements made within the spirit and principles of the present application are within the scope of the present application. Should be included within.

Claims

It is a Chinese word division method based on deep learning,
Steps to convert training corpus data to character level data,
The step of converting the character level data into sequence data,
The sequence data is divided based on a preset code, a plurality of subsequence data are acquired, and the plurality of subsequence data are grouped based on the length of the subsequence data to obtain K data sets. A step in which the length of the subsequence data contained in each of the K datasets is the same and K is a natural number greater than 1;
Multiple sub-sequence data are extracted from the i-th dataset, and the extracted multiple sub-sequence data are input to the i-th timing convolutional neural network-conditional random field model, and the i-th timing convolutional neural network- Train a conditional random field model, get the i-th timing convolutional neural network after training-a conditional random field model, let i be a natural number from 1 to K in order, and make a total of K post-training timing convolutional neural networks. Network-Steps to get a conditional random field model,
The target corpus data is converted into character level data, the first data is acquired, the first data is converted into sequence data, the second data is acquired, and the second data is the timing after the K trainings. Convolutional Neural Network-At least one post-training timing convolutional neural network of conditional random field models-A step of inputting into a conditional random field model to obtain word split results for the target corpus data.
A method of dividing Chinese words based on deep learning, which is characterized by including.

The step of converting the character level data into sequence data is
The Chinese language based on deep learning according to claim 1, wherein the character level data is converted into the sequence data by a preset encoding method which is either one-hot encoding or word vector encoding. Word splitting method.

The extracted multiple sub-sequence data are input to the i-th timing convolutional neural network-conditional random field model, the i-th timing convolutional neural network-conditional random field model is trained, and the i-th after training. Timing Convolutional Neural Network-Getting a Conditional Random Field Model
The i-th timing convolutional neural network-the i-th timing convolutional neural network, which is the timing convolutional neural network in the conditional random field model, is input with the plurality of extracted subsequence data to perform forward propagation, and the first Step S1 to obtain output data and
Step S2 for calculating the value of the loss function based on the first output data and the input plurality of subsequence data, and
When the value of the loss function is larger than the default value, the plurality of subsequence data are input to the i-th timing convolutional neural network to perform backward propagation, and the network parameter of the i-th timing convolutional neural network is set. Step S3 to optimize and
Step S4 in which steps S1 to S3 are repeated until the value of the loss function becomes equal to or less than the default value.
When the value of the loss function becomes equal to or less than the default value, step S5 for determining the completion of training and obtaining the i-th timing convolutional neural network after training,
The data output from the i-th timing convolutional neural network after the training is input to the i-th conditional random field, which is a conditional random field in the i-th timing convolutional neural network-conditional random field model. The first aspect of claim 1, further comprising training the i-th conditional random field and obtaining the i-th timing convolutional neural network after the training-a conditional random field model. Chinese word division method based on deep learning.

The step of training the i-th conditional random field is
A step of calculating the conditional probability of the output data of the i-th conditional random field based on the data output from the i-th timing convolutional neural network after the training.
The deep learning according to claim 3, comprising training using a maximum likelihood estimation method to obtain the maximum value of the conditional probability of the output data of the i-th conditional random field. Based on Chinese word splitting method.

The second data is input to the K post-training timing convolutional neural networks-at least one post-training timing convolutional neural network of the conditional random field models-conditional random field model of the target corpus data. The above step to obtain the word division result is
A step of dividing the second data based on a preset code to obtain a plurality of sequence data, and
It is a step of grouping the plurality of sequence data based on the length of the sequence data to obtain L data sets, and is a step of obtaining all the sequence data included in each of the L data sets. The length is the same, L is a natural number, and 1 ≦ L ≦ K steps.
From the K post-training timing convolutional neural networks-conditional random field model to the L post-training timing convolutional neural networks-conditional random field model based on the length of the subsequence data used in the training process. , L1st to LLth post-training timing convolutional neural network-obtain a conditional random field model, and all sequence data contained in the jth dataset is the Ljth post-training timing convolutional neural network. -Steps to enter into a conditional random field model and get multiple word split results,
Including the step of stitching the plurality of word division results and obtaining the word division result of the target corpus data.
Here, the length of the sub-sequence data used in the training process of the timing convolutional neural network-conditional random field model after the Lj-th training is the same as the length of the sequence data contained in the j-th data set. The Chinese word division method based on deep learning according to any one of claims 1 to 4, wherein j is a natural number of 1 to L in order, and Lj is a natural number of 1 to K.

A Chinese word divider based on deep learning,
The first conversion unit for converting training corpus data to character level data,
A second conversion unit for converting the character level data into sequence data, and
The sequence data is divided based on a preset code, a plurality of subsequence data are acquired, and the plurality of subsequence data are grouped based on the length of the subsequence data to obtain K data sets. The first division unit, which is a first division unit and has the same length of subsequence data contained in each of the K data sets, and K is a natural number larger than 1,
Multiple sub-sequence data are extracted from the i-th dataset, and the extracted multiple sub-sequence data are input to the i-th timing convolutional neural network-conditional random field model, and the i-th timing convolutional neural network is input. Network-Train a conditional random field model, get the i-th timing convolutional neural network-conditional random field model after training, let i be a natural number from 1 to K in order, and a total of K post-training timings. Convolutional Neural Network-The first decision unit for obtaining a conditional random field model,
The target corpus data is converted into character level data, the first data is acquired, the first data is converted into sequence data, the second data is acquired, and the second data is obtained after the K training. Timing convolutional neural network-at least one of the conditional random field models after training Timing convolutional neural network-with a second decision unit to input into the conditional random field model and obtain the word split result of the target corpus data. A deep learning-based Chinese word divider, including, including.

The second conversion unit is characterized by including a sub-conversion unit for converting the character level data into the sequence data by a preset encoding method which is either one-hot encoding or word vector encoding. The Chinese word dividing device based on the deep learning according to claim 6.

The first determination unit is
The i-th timing convolutional neural network-the i-th timing convolutional neural network, which is the timing convolutional neural network in the conditional random field model, is input with the plurality of extracted subsequence data to perform forward propagation, and the first Step S1 to obtain output data and
Step S2 for calculating the value of the loss function based on the first output data and the input plurality of subsequence data, and
When the value of the loss function is larger than the default value, the plurality of subsequence data are input to the i-th timing convolutional neural network to perform backward propagation, and the network parameter of the i-th timing convolutional neural network is set. Step S3 to optimize and
Step S4 in which steps S1 to S3 are repeated until the value of the loss function becomes equal to or less than the default value.
When the value of the loss function becomes equal to or less than the default value, step S5 for determining the completion of training and obtaining the i-th timing convolutional neural network after training,
The data output from the i-th timing convolutional neural network after the training is input to the i-th conditional random field, which is a conditional random field in the i-th timing convolutional neural network-conditional random field model. The claim is characterized in that it is used to train the i-th conditional random field and perform the post-trained i-th timing convolutional neural network-step S6 to obtain a conditional random field model. The Chinese word dividing device based on the deep learning according to Item 6.

The first determination unit is
A first sub-computation unit for calculating the conditional probability of the output data of the i-th conditional random field based on the data output from the i-th timing convolutional neural network after the training.
8. Claim 8 comprising a first subdetermination unit for training using a maximum likelihood estimation method to obtain the maximum value of the conditional probability of the output data of the i-th conditional random field. Chinese word splitting device based on deep learning described in.

The second determination unit is
A sub-division unit for dividing the second data based on a preset code and obtaining a plurality of sequence data, and a sub-division unit.
It is a subgrouping unit for grouping the plurality of sequence data based on the length of the sequence data to obtain L data sets, and is included in each data set of the L data sets. All sequence data have the same length, L is a natural number, and 1 ≤ L ≤ K subgrouping unit,
From the K post-training timing convolutional neural networks-conditional random field model to the L post-training timing convolutional neural networks-conditional random field model based on the length of the subsequence data used in the training process. , L1st to LLth post-training timing convolutional neural network-obtain a conditional random field model, and all sequence data contained in the jth dataset will be the Ljth post-training timing convolutional neural network. -A second subdetermination unit for inputting into a conditional random field model and obtaining multiple word division results, and the timing convolutional neural network after the Ljth training-used in the training process of the conditional random field model. The length of the sub-sequence data is the same as the length of the sequence data included in the j-th data set, j is a natural number from 1 to L in order, and Lj is a natural number from 1 to K. With the decision unit,
The apparatus according to any one of claims 6 to 9, further comprising a substitching unit for stitching the plurality of word division results and obtaining the word division result of the target corpus data.

A storage medium, the storage medium includes a program to be stored, and during operation of the program, the device in which the storage medium is located is controlled.
Steps to convert training corpus data to character level data,
The step of converting the character level data into sequence data,
The sequence data is divided based on a preset code, a plurality of subsequence data are acquired, and the plurality of subsequence data are grouped based on the length of the subsequence data to obtain K data sets. A step in which the length of the subsequence data contained in each of the K datasets is the same and K is a natural number greater than 1;
Multiple sub-sequence data are extracted from the i-th dataset, and the extracted multiple sub-sequence data are input to the i-th timing convolutional neural network-conditional random field model, and the i-th timing convolutional neural network- Train a conditional random field model, get the i-th timing convolutional neural network after training-a conditional random field model, let i be a natural number from 1 to K in order, and make a total of K post-training timing convolutional neural networks. Network-Steps to get a conditional random field model,
The target corpus data is converted into character level data, the first data is acquired, the first data is converted into sequence data, the second data is acquired, and the second data is obtained after the K training. Timing convolutional neural network-at least one of the conditional random field models after training Timing convolutional neural network-a step of inputting into the conditional random field model and obtaining the word division result of the target corpus data. A storage medium characterized by that.

During the operation of the program, the step of controlling the device in which the storage medium is located to perform the step of converting the character level data into sequence data is
11. The storage medium of claim 11, comprising the step of converting the character level data into the sequence data by a preset encoding method, which is either one-hot encoding or vector encoding of words.

During the operation of the program, the device in which the storage medium is located is controlled to input the extracted plurality of subsequence data into the i-th timing convolutional neural network-conditional random field model, and the i-th. Timeing Convolutional Neural Network-Training a Conditional Random Field Model and Obtaining the i-th Timing Convolutional Neural Network-Conditional Random Field Model After Training
The i-th timing convolutional neural network-the i-th timing convolutional neural network, which is the timing convolutional neural network in the conditional random field model, is input with the plurality of extracted subsequence data to perform forward propagation, and the first Step S1 to obtain output data and
Step S2 for calculating the value of the loss function based on the first output data and the input plurality of subsequence data, and
When the value of the loss function is larger than the default value, the plurality of subsequence data are input to the i-th timing convolutional neural network to perform backward propagation, and the network parameters of the i-th timing convolutional neural network are set. Step S3 to optimize and
Step S4 in which steps S1 to S3 are repeated until the value of the loss function becomes equal to or less than the default value.
When the value of the loss function becomes equal to or less than the default value, step S5 for determining the completion of training and obtaining the i-th timing convolutional neural network after training,
The data output from the i-th timing convolutional neural network after the training is input to the i-th conditional random field, which is the conditional random field in the i-th timing convolutional neural network-conditional random field model. The eleventh aspect of claim 11, further comprising training the i-th conditional random field and obtaining the i-th timing convolutional neural network after the training-a conditional random field model. Storage medium.

During the operation of the program, the step of controlling the device in which the storage medium is located and performing the step of training the i-th conditional random field is
A step of calculating the conditional probability of the output data of the i-th conditional random field based on the data output from the i-th timing convolutional neural network after the training.
13. The storage medium of claim 13, comprising training using a maximum likelihood estimation method to obtain the maximum value of the conditional probability of the output data of the i-th conditional random field.

During the operation of the program, the device in which the storage medium is located is controlled to transfer the second data to the K post-training timing convolutional neural networks-after training at least one of the conditional random field models. Timing Convolutional Neural Network-The step of performing the step of inputting into a conditional random field model and obtaining the word division result of the target corpus data is
A step of dividing the second data based on a preset code to obtain a plurality of sequence data, and
It is a step of grouping the plurality of sequence data based on the length of the sequence data to obtain L data sets, and is a step of obtaining all the sequence data included in each of the L data sets. The length is the same, L is a natural number, and 1 ≦ L ≦ K steps.
From the K post-training timing convolutional neural networks-conditional random field model to the L post-training timing convolutional neural networks-conditional random field model based on the length of the subsequence data used in the training process. , L1st to LLth post-training timing convolutional neural network-obtain a conditional random field model, and all sequence data contained in the jth dataset is the Ljth post-training timing convolutional neural network. -Steps to enter into a conditional random field model and get multiple word split results,
Including the step of stitching the plurality of word division results and obtaining the word division result of the target corpus data.
Here, the length of the sub-sequence data used in the training process of the timing convolutional neural network-conditional random field model after the Lj-th training is the same as the length of the sequence data contained in the j-th data set. The storage medium according to any one of claims 11 to 14, wherein j is a natural number of 1 to L in order, and Lj is a natural number of 1 to K.

A computer device including a memory for storing information including a program instruction and a processor for controlling execution of the program instruction, and when the program instruction is loaded and executed by the processor,
Steps to convert training corpus data to character level data,
The step of converting the character level data into sequence data,
The sequence data is divided based on a preset code, a plurality of subsequence data are acquired, and the plurality of subsequence data are grouped based on the length of the subsequence data to obtain K data sets. A step in which the length of the subsequence data contained in each of the K datasets is the same and K is a natural number greater than 1;
Multiple sub-sequence data are extracted from the i-th dataset, and the extracted multiple sub-sequence data are input to the i-th timing convolutional neural network-conditional random field model, and the i-th timing convolutional neural network- Train a conditional random field model, get the i-th timing convolutional neural network after training-a conditional random field model, let i be a natural number from 1 to K in order, and make a total of K post-training timing convolutional neural networks. Network-Steps to get a conditional random field model,
The target corpus data is converted into character level data, the first data is acquired, the first data is converted into sequence data, the second data is acquired, and the second data is the timing after the K trainings. Convolutional Neural Network-At least one post-training timing convolutional neural network of the conditional random field model-The step of inputting into the conditional random field model and obtaining the word division result of the target corpus data. A computer device that features.

When the program instruction is loaded and executed by the processor, the step of performing the step of converting the character level data into sequence data is
16. The computer device according to claim 16, comprising the step of converting the character level data into the sequence data by a preset encoding method, which is either one-hot encoding or vector encoding of words.

When the program instruction is loaded and executed by the processor, the extracted plurality of subsequence data are input to the i-th timing convolutional neural network-conditional random field model, and the i-th timing convolutional neural network is input. -Training a conditional random field model and the i-th timing convolutional neural network after training-The steps to realize the step to obtain a conditional random field model are:
The i-th timing convolutional neural network-the i-th timing convolutional neural network, which is the timing convolutional neural network in the conditional random field model, is input with the plurality of extracted subsequence data to perform forward propagation, and the first Step S1 to obtain output data and
Step S2 for calculating the value of the loss function based on the first output data and the input plurality of subsequence data, and
When the value of the loss function is larger than the default value, the plurality of subsequence data are input to the i-th timing convolutional neural network to perform backward propagation, and the network parameters of the i-th timing convolutional neural network are set. Step S3 to optimize and
Step S4 in which steps S1 to S3 are repeated until the value of the loss function becomes equal to or less than the default value.
When the value of the loss function becomes equal to or less than the default value, step S5 for determining the completion of training and obtaining the i-th timing convolutional neural network after training,
The data output from the i-th timing convolutional neural network after the training is input to the i-th conditional random field, which is the conditional random field in the i-th timing convolutional neural network-conditional random field model. 16 is characterized by comprising training the i-th conditional random field and obtaining the i-th timing convolutional neural network-conditional random field model after the training. Computer equipment.

The step of performing the step of training the i-th conditional random field when the program instruction is loaded and executed by the processor is
A step of calculating the conditional probability of the output data of the i-th conditional random field based on the data output from the i-th timing convolutional neural network after the training.
18. The computer device of claim 18, comprising training using a maximum likelihood estimation method to obtain the maximum value of the conditional probability of the output data of the i-th conditional random field.

When the program instruction is loaded and executed by the processor, the second data is subjected to the K post-training timing convolutional neural networks-at least one post-training timing convolutional neural network of conditional random field models. -The step of performing the step of inputting into the conditional random field model and obtaining the word division result of the target corpus data is a step of dividing the second data based on a preset code and obtaining a plurality of sequence data. When,
It is a step of grouping the plurality of sequence data based on the length of the sequence data to obtain L data sets, and is a step of obtaining all the sequence data included in each of the L data sets. The length is the same, L is a natural number, and 1 ≦ L ≦ K steps.
From the K post-training timing convolutional neural networks-conditional random field model to the L post-training timing convolutional neural networks-conditional random field model based on the length of the subsequence data used in the training process. , L1st to LLth post-training timing convolutional neural network-obtain a conditional random field model, and all sequence data contained in the jth dataset is the Ljth post-training timing convolutional neural network. -Steps to enter into a conditional random field model and get multiple word split results,
Including the step of stitching the plurality of word division results and obtaining the word division result of the target corpus data.
Here, the length of the sub-sequence data used in the training process of the timing convolutional neural network-conditional random field model after the Lj-th training is the same as the length of the sequence data contained in the j-th data set. The computer device according to any one of claims 16 to 19, wherein j is a natural number of 1 to L in order, and Lj is a natural number of 1 to K.