JP6928206B2

JP6928206B2 - Data identification method based on associative clustering deep learning neural network

Info

Publication number: JP6928206B2
Application number: JP2018199173A
Authority: JP
Inventors: 朱定局
Original assignee: 大国創新智能科技（東莞）有限公司
Priority date: 2017-10-23
Filing date: 2018-10-23
Publication date: 2021-09-01
Anticipated expiration: 2038-10-23
Also published as: JP2019079536A; CN107704888A; CN107704888B

Description

本発明は連合クラスタリング深層学習方法に関し、具体的には連合クラスタリング深層学習ニューラルネットワークに基づくデータ識別方法に関する。 The present invention relates to an associative clustering deep learning method, and more specifically to a data identification method based on an associative clustering deep learning neural network.

既存の深層学習はデータ入力を通じて出力ラベルを取得でき（例えばプロフィール画像を通じて当該人のIDカード番号を取得し、または音声を通じて当該人のIDカード番号を取得し）、しかしトップダウン監督学習段階ではラベル付きのデータの監督を通じて学習する必要がある（例えばIDカード番号付きのプロフィール画像、またはIDカード番号付きの音声である）。IDカード番号付きのプロフィール画像とIDカード番号付きのプロフィール画像音声の両方が同時にある場合、一つのプロフィール画像をプロフィール画像クラスが対応する深層学習ニューラルネットワークに入力して出力されたIDカード番号を取得でき、一つの音声を音声クラスが対応する深層学習ニューラルネットワークに入力して出力されたIDカード番号を取得し、それからプロフィール画像と音声それぞれの入力によって取得した前記出力されたIDカード番号が同じかどうかを判断し、
イエスであれば、前記一つのプロフィール画像と前記一つの音声が同一人物に対応し、
そうでなければ、前記一つのプロフィール画像と前記一つの音声が異なった人に対応する。 Existing deep learning can get the output label through data entry (for example, get the person's ID card number through a profile image, or get the person's ID card number through voice), but in the top-down supervised learning stage the label Must be learned through supervised data with (eg, profile image with ID card number, or voice with ID card number). When both the profile image with the ID card number and the profile image sound with the ID card number are present at the same time, one profile image is input to the deep learning neural network corresponding to the profile image class to obtain the output ID card number. Yes, one voice is input to the deep learning neural network corresponding to the voice class to obtain the output ID card number, and then the profile image and the output ID card number obtained by each voice input are the same. Judge whether
If yes, the one profile image and the one voice correspond to the same person,
Otherwise, the one profile image and the one voice correspond to different people.

しかし、深層学習ニューラルネットワークの適確率が100％に達することができないため、一つのプロフィール画像をプロフィール画像クラスに対応する深層学習ニューラルネットワークに入力すると、プロフィール画像が似ている他人のIDカード番号が出る可能性があり、一つの音声を音声クラスの対応する深層学習ニューラルネットワークに入力すると、音声が似ている他人のIDカード番号が出力される可能性もあり、そうすると、同一人物に属しない一つのプロフィール画像と一つの音声を同一人物に対応するという判断が出て、異なった人物に属する一つのプロフィール画像と一つの音声を同一人物に対応するという判断が出る恐れがある。一つのプロフィール画像と一つの音声が同じでない人に対応すると判断すると、当該プロフィール画像と当該音声が同じ人に対応する確率を計算できない。一つのプロフィール画像と一つの音声が同じ人に対応すると判断すると、当該プロフィール画像と当該音声が同じでない人あるいは他の人の確率を計算できない。 However, since the appropriate probability of the deep learning neural network cannot reach 100%, when one profile image is input to the deep learning neural network corresponding to the profile image class, the ID card number of another person with a similar profile image is displayed. There is a possibility that when one voice is input to the corresponding deep learning neural network of the voice class, the ID card number of another person with similar voice may be output, and then one that does not belong to the same person. There is a risk that one profile image and one voice correspond to the same person, and one profile image and one voice belonging to different people correspond to the same person. If it is determined that one profile image and one voice correspond to a person who is not the same, the probability that the profile image and the voice correspond to the same person cannot be calculated. If it is determined that one profile image and one voice correspond to the same person, the probability of a person or another person whose profile image and the voice are not the same cannot be calculated.

音声、プロフィール画像、またはもっと他のタイプのデータの一種あるいは多種によってオブジエクトを識別する時、既存の深層学習技術を使用し、類似度や多種の深層学習の結果を総合利用してほかの可能性出力及び最優出力を計算できなく、これによりより精確な識別と判断を行うことができない。 When identifying objects by voice, profile image, or one or more types of data, existing deep learning techniques are used to take advantage of similarities and the results of many different types of deep learning for other possibilities. The output and the best output cannot be calculated, which makes it impossible to make more accurate identification and judgment.

中国特許出願公開第104951403号明細書Chinese Patent Application Publication No. 104951403

本発明は解決する必要な技術問題が連合クラスタリング深層学習ニューラルネットワークに基づくデータ識別方法を提供する。 The present invention provides a data identification method based on an associative clustering deep learning neural network whose technical problem needs to be solved.

本発明の目的を実現する技術解決プランは：連合クラスタリング深層学習ニューラルネットワークに基づくデータ識別方法であり、下記のステップを含む： A technical solution plan that achieves the object of the present invention is: a data identification method based on an associative clustering deep learning neural network, which includes the following steps:

ステップ1は、まずNクラスデータサンプルセットと各クラスのデータサンプルセットが対応するラベルセットを取得し、また前記Nクラスデータサンプルセットの中の各クラスのデータサンプルのデータプリセットフォーマットを取得し、ラベルプリセットフォーマットも取得し、それからNクラスデータサンプルセットとラベルセットを前処理し、前記Nが1以上である。その中に、 Step 1 first obtains the label set corresponding to the N-class data sample set and the data sample set of each class, and also obtains the data preset format of the data sample of each class in the N-class data sample set, and labels. It also gets the preset format, then preprocesses the N-class data sample set and label set, where N is greater than or equal to 1. In it

前記Nクラスの中の各クラスのデータサンプルのデータプリセットフォーマットを取得し、ラベルプリセットフォーマットも取得し、具体的には： Obtain the data preset format of the data sample of each class in the N class, and also obtain the label preset format. Specifically:

各クラスのデータサンプルセットの中の各データサンプルのデータフォーマットを取得し、当該クラスの中の同じデータフォーマットを合併しS種のデータフォーマットを取得し、当該クラスのデータサンプルセットの中の各種のデータフォーマットPiが対応するデータサンプル数Miを統計し、一番大きなMiが対応データフォーマットPiを当該クラスのデータサンプルのデータプリセットフォーマットとし、その中に、sが1以上であり、iが1以上且つs以下であり、 Obtain the data format of each data sample in the data sample set of each class, merge the same data formats in the class to obtain the S type data format, and various types in the data sample set of the class. The number of data samples Mi supported by the data format Pi is statistic, and the largest Mi sets the supported data format Pi as the data preset format of the data sample of the class, in which s is 1 or more and i is 1 or more. And less than or equal to s

各クラスのデータサンプルセットが対応するラベルセットの中の各ラベルのラベルフォーマットを取得し、すべてのクラスの同じのラベルフォーマットを合併して少なくともt種のラベルフォーマットを取得し、当該クラスのラベルセットの中の各種のラベルフォーマットQjが対応するラベル数Njを統計し、一番大きなNjが対応するラベルフォーマットQjをラベルプリセットフォーマットとし、その中に、tが1以上であり、jが1以上且つt以下である。 The data sample set of each class gets the label format of each label in the corresponding label set, merges the same label formats of all classes to get at least t kinds of label formats, and the label set of the class. Statistics on the number of labels Nj corresponding to various label formats Qj in the above, and the label format Qj corresponding to the largest Nj is the label preset format, in which t is 1 or more, j is 1 or more, and It is less than or equal to t.

Nクラスデータサンプルセットとラベルセットを前処理し、具体的には： Preprocess N-class data sample sets and label sets, specifically:

ステップ1-1、各クラスのデータサンプルセットの中の各データサンプルのデータフォーマットが当該クラスのデータサンプルのデータプリセットフォーマットに一致するかどうかを判断し、一致でなければ、当該クラスの当該データサンプルのデータフォーマットを当該クラスのデータサンプルのデータプリセットフォーマットに変換し、 Step 1-1, Determine if the data format of each data sample in each class's data sample set matches the data preset format of that class's data sample, and if not, that class's data sample Convert the data format of to the data preset format of the data sample of the class,

ステップ1-2、各クラスのデータサンプルセットの中の各データサンプルが対応するラベルのデータフォーマットがラベルプリセットフォーマットに一致するかどうかを判断し、一致でなければ、当該クラスの当該データサンプルが対応するラベルのデータフォーマットをラベルプリセットフォーマットに変換し、 Step 1-2, Each data sample in each class's data sample set determines if the corresponding label's data format matches the label preset format, and if not, the class's data sample corresponds. Convert the label data format to the label preset format,

ステップ1-3、Nクラスデータサンプルセットの中の各クラスのデータサンプルセットをクラスタリング処理し、J個のクラスタ化されたデータサンプルセット及びそれに対応する出力ラベルセットを取得し、 Step 1-3, cluster the data sample set of each class in the N class data sample set to obtain J clustered data sample sets and their corresponding output label sets.

ステップ1-4、J個のクラスタ化された出力ラベルセットの各クラスの同じのラベルを合併し、更新されたJ個の出力ラベルセットを取得し、 Step 1-4, merge the same labels from each class of J clustered output label sets to get the updated J output label sets,

ステップ1-5、更新されたJ個の出力ラベルセットの同じのラベルを持つラベルセット及び対応のデータサンプルセットをそれぞれ合併し、前処理されたデータサンプルセット及びそれに対応する出力ラベルセットを取得する。 Step 1-5, merge the updated J output label sets with the same label and the corresponding data sample set, respectively, to get the preprocessed data sample set and the corresponding output label set. ..

前記ステップ2は、Nクラスデータサンプルセットが対応するN個の深層学習ニューラルネットワークを初期化し、具体的にはステップ2-1〜2-3を含み、 Step 2 initializes the corresponding N deep learning neural networks of the N-class data sample set, specifically including steps 2-1 to 2-3.

前記ステップ2-1は、各クラスのデータサンプルのデータプリセットフォーマットを当該クラスの対応する深層学習ニューラルネットワークの入力フォーマットとし、 In step 2-1 above, the data preset format of the data sample of each class is set as the input format of the corresponding deep learning neural network of the class.

前記ステップ2-2は、ラベルプリセットフォーマットを各クラスの対応する深層学習ニューラルネットワークの出力フォーマットとし、 In step 2-2, the label preset format is set as the output format of the corresponding deep learning neural network of each class.

前記ステップ2-3は、各クラスの対応する深層学習ニューラルネットワークの構成情報を取得し、それを当該クラスの対応する深層学習ニューラルネットワークの構成情報とし、また当該クラスの対応する深層学習ニューラルネットワークを配置する。具体的にはステップ2-3-1〜2-3-4を含み、 In step 2-3, the configuration information of the corresponding deep learning neural network of each class is acquired, which is used as the configuration information of the corresponding deep learning neural network of the class, and the corresponding deep learning neural network of the class is used. Deploy. Specifically, it includes steps 2-3-1 to 2-3-4.

前記ステップ2-3-1は、深層学習ニューラルネットワーク構成知識ベースから入力フォーマット、出力フォーマットと各クラスのデータプリセットフォーマット及びラベルプリセットフォーマットとが一番一致する深層学習ニューラルネットワークの対応する構成情報を取得し、それを当該クラスの対応する深層学習ニューラルネットワークのプリセット構成情報とし、 In step 2-3-1, the deep learning neural network configuration information of the deep learning neural network in which the input format and output format and the data preset format and label preset format of each class match best is acquired from the knowledge base. Then, use it as the preset configuration information of the corresponding deep learning neural network of the class.

その中に、入力フォーマット、出力フォーマットと各クラスのデータプリセットフォーマット及びラベルプリセットフォーマットとのマッチング程度＝入力フォーマットが当該クラスのデータプリセットフォーマットとのマッチング程度×u％+出力フォーマットがラベルプリセットフォーマットとのマッチング程度×(1-u％)、uのデフォルト値が90であり、 Among them, the degree of matching between the input format and output format and the data preset format and label preset format of each class = the degree of matching of the input format with the data preset format of the relevant class × u% + the output format is the label preset format. Matching degree × (1-u%), the default value of u is 90,

前記ステップ2-3-2は、各クラスの対応する深層学習ニューラルネットワークのプリセット構成情報をユーザーに出力し、 In step 2-3-2, the preset configuration information of the corresponding deep learning neural network of each class is output to the user.

前記ステップ2-3-3は、ユーザーによっての各クラスの対応する深層学習ニューラルネットワークのプリセット構成情報の変更を取得し、 In step 2-3-3, the user obtains a change in the preset configuration information of the corresponding deep learning neural network of each class.

前記ステップ2-3-4は、変更された各クラスの対応する深層学習ニューラルネットワークのプリセット構成情報を、当該クラスの対応する深層学習ニューラルネットワークのプリセット構成情報とする。 In step 2-3-4, the preset configuration information of the corresponding deep learning neural network of each changed class is used as the preset configuration information of the corresponding deep learning neural network of the class.

前記ステップ3は、ステップ1で取得した各クラスのデータサンプルセットを入力とし、それに対応するラベルセットを出力とし、当該クラスの対応する深層学習ニューラルネットワークを訓練し、N個の訓練された深層学習ニューラルネットワークを取得し、具体的にはステップ3-1〜3-2を含み、 In step 3, the data sample set of each class acquired in step 1 is used as an input, the corresponding label set is used as an output, the corresponding deep learning neural network of the class is trained, and N trained deep learnings are trained. Acquire a neural network, specifically including steps 3-1 to 3-2,

前記ステップ3-1は、各クラスのデータサンプルセットの中の各データサンプルを当該クラスの対応する深層学習ニューラルネットワークの入力とし、当該クラスの対応する深層学習ニューラルネットワークに対してアセンディングオーダーの監督なし訓練を行い、 In step 3-1 above, each data sample in the data sample set of each class is used as an input of the corresponding deep learning neural network of the class, and the ascending order is supervised for the corresponding deep learning neural network of the class. None training,

前記ステップ3-2は、各クラスのデータサンプルセットの中の各データサンプルを当該クラスの対応する深層学習ニューラルネットワークの入力とし、当該クラスのデータサンプルセットが対応するラベルセットの中の当該データサンプルが対応するラベルを出力とし、当該クラスの対応する深層学習ニューラルネットワークに対しトップダウンの監督学習を行い、N個の訓練された深層学習ニューラルネットワークを取得する。 In step 3-2, each data sample in the data sample set of each class is used as an input of the corresponding deep learning neural network of the class, and the data sample in the data sample set of the class corresponds to the data sample in the label set. Outputs the corresponding label, performs top-down supervised learning on the corresponding deep learning neural network of the class, and acquires N trained deep learning neural networks.

前記ステップ4は、各クラスの対応する深層学習ニューラルネットワークのために一つのテストデータを取得し、各クラスのテストデータのデータフォーマットを当該クラスのデータサンプルのデータプリセットフォーマットに変換し、それから当該テストデータを当該クラスの対応する深層学習ニューラルネットワークの入力とし、当該深層学習ニューラルネットワークの計算を通じて当該クラスが対応するテスト出力ラベルを取得し、 Step 4 takes one test data for the corresponding deep learning neural network of each class, converts the data format of the test data of each class to the data preset format of the data sample of that class, and then the test. The data is used as the input of the corresponding deep learning neural network of the class, and the test output label corresponding to the class is obtained through the calculation of the deep learning neural network.

前記ステップ5は、ステップ1で前処理されたラベルセットの中で各クラスのテスト出力ラベルが存在するラベルセットを検索し、それから当該ラベルセットが一つのラベル要素しか持っていないかどうかを判断し、もし各クラスのテスト出力ラベルが存在するラベルセットが一つのラベル要素しか持っていなければ、各クラスのテスト出力ラベルを当該クラスの最優出力ラベルとし、そうでなければ次のステップに進み、 In step 5, the label set in which the test output label of each class exists is searched in the label set preprocessed in step 1, and then it is determined whether the label set has only one label element. If the label set in which each class's test output label resides has only one label element, make each class's test output label the best output label for that class, otherwise proceed to the next step.

前記ステップ6は、各クラスのテスト出力ラベルが対応するデータサンプルセットと当該クラスのテスト出力ラベルが存在するラベルセットの中の各ラベル要素が対応するデータサンプルセットの類似度を計算し、それから当該類似度に基づいて各組の可能出力ラベルを計算して決定し、その中に、各組の可能出力ラベルの中には各クラスの一つの可能出力ラベルが含まれ、具体的には： In step 6, the similarity between the data sample set corresponding to the test output label of each class and the data sample set corresponding to each label element in the label set in which the test output label of the class exists is calculated, and then the relevant data sample set is calculated. Each set of possible output labels is calculated and determined based on the similarity, and each set of possible output labels includes one possible output label of each class, specifically:

N＝1であれば、テスト出力ラベルが対応するデータサンプルセットとテスト出力ラベルが存在するラベルセットの中の各ラベル要素が対応するデータサンプルセットの類似度を計算し、類似度が第一プリセット値aを超えるすべてのラベル要素を一組の可能出力ラベルとし、 If N = 1, the similarity between the data sample set corresponding to the test output label and the data sample set corresponding to each label element in the label set in which the test output label exists is calculated, and the similarity is the first preset. All label elements that exceed the value a are considered as a set of possible output labels.

N>1であれば、第iクラスのテスト出力ラベルが対応するデータサンプルセットDiを取得し、第iクラスのテスト出力ラベルが存在するラベルセットの中のラベル要素の数miを取得し、第iクラスのテスト出力ラベルが存在するラベルセットの中の第j個のラベル要素が対応するデータサンプルセットDijを取得し、DiとDijの類似度Pijを計算し、その中に、iが1からNまでの各自然数であり、jが1からmiまでの各自然数であり、 If N> 1, the data sample set Di corresponding to the test output label of the i-class is obtained, the number mi of the label elements in the label set in which the test output label of the i-class exists is obtained, and the first The jth label element in the label set where the i-class test output label exists gets the corresponding data sample set Dij, calculates the similarity Pij between Di and Dij, and i is from 1 in it. Each natural number up to N, j is each natural number from 1 to mi,

k1、k2、…、kNの各値に対し、類似度第一綜合値f(P1k1、P2k2、…、PNkN)を計算し、もしf(P1k1、P2k2、…、PNkN)が第二プリセット値bより大きければ、第一クラスのテスト出力ラベルが存在するラベルセットの中の第k1番のラベル要素、第二クラスのテスト出力ラベルが存在するラベルセットの中の第k2番のラベル要素、…、第Nクラスのテスト出力ラベルが存在するラベルセットの中の第kN番のラベル要素を、一組の可能出力ラベルとし、その中に、k1が1からm1までの各自然数であり、k2が1からm2までの各自然数であり、…、kNが1からmNまでの各自然数であり、f(P1k1、P2k2、…、PNkN)が(P1k1、P2k2、…、PNkN)の乗積である。 For each value of k1, k2, ..., kN, the first overall similarity value f (P1k1, P2k2, ..., PNkN) is calculated, and if f (P1k1, P2k2, ..., PNkN) is the second preset value b. If it is larger, the k1 label element in the label set containing the first class test output label, the k2 label element in the label set containing the second class test output label, ..., The kNth label element in the label set in which the Nth class test output label exists is set as a set of possible output labels, in which k1 is each natural number from 1 to m1 and k2 is 1. Each natural number from to m2, ..., kN is each natural number from 1 to mN, and f (P1k1, P2k2, ..., PNkN) is the product of (P1k1, P2k2, ..., PNkN).

前記ステップ7は、各組の可能出力ラベルの中の各クラスの可能出力ラベルが対応するデータサンプルセットと当該クラスのテストデータセットの類似度を計算し、また当該類似度に基づいて一組の可能出力ラベルを計算決定して最優出力ラベルとし、具体的には： In step 7, the similarity between the data sample set corresponding to each class of possible output labels in each set of possible output labels and the test data set of the class is calculated, and a set based on the similarity is calculated. The possible output label is calculated and determined to be the best output label. Specifically:

N＝1の場合、各組の可能出力ラベルの中の可能出力ラベルが対応するデータサンプルセットと当該クラスのテストデータセットの類似度を計算し、一番大きな類似度が対応する一組の可能出力ラベルを取得して一組の最優出力ラベルとし、 When N = 1, the similarity between the data sample set corresponding to the possible output label in each set of possible output labels and the test data set of the class is calculated, and the set of possible ones with the largest similarity corresponds. Get the output label and make it a set of best output labels

N>1の場合、各組の可能出力ラベルの中の第iクラスの可能出力ラベルが対応するデータサンプルセットと当該クラスのテストデータセットの類似度Piを計算し、それから類似度第二綜合値g(P1、P2、…、PN)を計算し、一番大きな類似度第二綜合値が対応する一組の可能出力ラベルを取得し一組の最優出力ラベルとし、その中にg(P1、P2、…、PN)が(P1、P2、…、PN)の乗積であり、その中に、iが1からNまでの各自然数を取る。 When N> 1, the similarity Pi of the data sample set corresponding to the possible output label of the i-class in each set of possible output labels and the test data set of the class is calculated, and then the second general value of similarity is calculated. Calculate g (P1, P2, ..., PN), obtain a set of possible output labels corresponding to the second overall value of similarity, and use it as a set of best output labels, and g (P1) in it. , P2, ..., PN) is the product of (P1, P2, ..., PN), in which i takes each natural number from 1 to N.

前記ステップ8は、可能出力ラベルの中の各クラスの出力ラベルの一致する確率と一致しない確率を計算し、各クラスの出力ラベルの一致する確率と一致しない確率とし、具体的には： In step 8, the probability of matching the output label of each class in the possible output labels and the probability of not matching are calculated, and the probability of not matching the output label of each class is set as the probability of not matching.

N＝1の場合、一クラスの出力ラベルしかなく、従って各クラスの出力ラベルが一致する確率が100％であり、一致しない確率が0％であり、 If N = 1, there is only one class of output labels, so the probability that the output labels of each class will match is 100%, the probability of non-matching is 0%, and so on.

N>1の場合、まず各組の可能出力ラベルの中の各クラスの可能出力ラベルが一致するかどうかを判断し、 If N> 1, first determine if the possible output labels of each class in each set of possible output labels match.

また一致すると判断される各組の可能出力ラベルが対応する類似度第二綜合値の和をすべての可能出力ラベルが対応する類似度第二綜合値の和と除算し、各クラスの出力ラベルが一致する確率を取得し、 Also, the sum of the corresponding similarity second aggregate values of each set of possible output labels judged to match is divided by the sum of the corresponding similarity second aggregate values of all possible output labels, and the output label of each class is Get the probability of matching and

最後に100％から各クラスの出力ラベルが一致する確率を引いて各クラスの出力ラベルの一致しない確率を取得する。 Finally, subtract the probability that the output labels of each class match from 100% to get the probability that the output labels of each class do not match.

前記ステップ9は、可能出力ラベル、最優出力ラベル、各クラスの出力ラベルの一致する確率及び一致しない確率を出力する。 Step 9 outputs the possible output label, the best output label, the matching probability and the non-matching probability of the output labels of each class.

本発明は現存の技術に比べて、下記の著しい効果を有する：本発明が深層学習ニューラルネットワークと類似度計算を有機的に結合し、出力の結果を豊かにし、出力の適確率を高める。本発明が類似度計算を結合する方法を採用し、長所を見習って短所を補い、これにより類似度計算を通じて深層学習ニューラルネットワークが出力ラベルの数が多く入力サンプルが足りないときの出力正確度の不足を補い、さらに出力の正確度を高める。 The present invention has the following significant effects compared to existing techniques: the present invention organically combines deep learning neural networks and similarity calculations to enrich output results and increase output suitability. The present invention employs a method of combining similarity calculations, emulating the strengths and compensating for the weaknesses, which allows deep learning neural networks to output accuracy when the number of output labels is large and the input sample is insufficient. Make up for the shortfall and further improve the accuracy of the output.

下記に附図をあわせて本発明について詳しく説明する。 The present invention will be described in detail with reference to the following figures.

図1は本発明の連合クラスタリング深層学習ニューラルネットワークに基づくデータ識別方法のフローチャートである。FIG. 1 is a flowchart of a data identification method based on the associative clustering deep learning neural network of the present invention. 図2は本発明の連合クラスタリング深層学習ニューラルネットワークに基づくデータ識別方法の中のデータサンプルセットとラベルセットを前処理するフローチャートである。FIG. 2 is a flowchart for preprocessing the data sample set and label set in the data identification method based on the associative clustering deep learning neural network of the present invention. 図3は本発明の連合クラスタリング深層学習ニューラルネットワークに基づくデータ識別方法の中の深層学習ニューラルネットワーク訓練のフローチャートである。FIG. 3 is a flowchart of deep learning neural network training in the data identification method based on the associative clustering deep learning neural network of the present invention.

附図を合わせて、本発明の連合クラスタリング深層学習ニューラルネットワークに基づくデータ識別方法は、ステップ１〜９を含み、 Together with the accompanying figures, the data identification method based on the associative clustering deep learning neural network of the present invention includes steps 1-9.

前記ステップ1は、まずNクラスデータサンプルセットと各クラスのデータサンプルセットが対応するラベルセットを取得し、また前記Nクラスデータサンプルセットの中の各クラスのデータサンプルのデータプリセットフォーマットを取得し、ラベルプリセットフォーマットも取得し、それからNクラスデータサンプルセットとラベルセットを前処理し、前記Nが1以上であり、 In step 1, the N-class data sample set and the corresponding label set of each class data sample set are first acquired, and then the data preset format of each class data sample in the N-class data sample set is acquired. It also gets the label preset format, then preprocesses the N-class data sample set and label set, and the N is greater than or equal to 1.

前記Nクラスの中の各クラスのデータサンプルのデータプリセットフォーマットを取得し、ラベルプリセットフォーマットも取得するのは、具体的には： Specifically, the data preset format of the data sample of each class in the N class is acquired, and the label preset format is also acquired.

Nクラスデータサンプルセットとラベルセットを前処理するのは、具体的には： Preprocessing the N-class data sample set and label set is specifically:

前記ステップ2は、Nクラスデータサンプルセットが対応するN個の深層学習ニューラルネットワークを初期化し、具体的には： In step 2, the N-class data sample set initializes the corresponding N deep learning neural networks, specifically:

ステップ2-1、各クラスのデータサンプルのデータプリセットフォーマットを当該クラスの対応する深層学習ニューラルネットワークの入力フォーマットとし、 Step 2-1. Set the data preset format of the data sample of each class as the input format of the corresponding deep learning neural network of the class.

ステップ2-2、ラベルプリセットフォーマットを各クラスの対応する深層学習ニューラルネットワークの出力フォーマットとし、 Step 2-2, set the label preset format as the output format of the corresponding deep learning neural network of each class.

ステップ2-3、各クラスの対応する深層学習ニューラルネットワークの構成情報を取得し、それを当該クラスの対応する深層学習ニューラルネットワークの構成情報とし、また当該クラスの対応する深層学習ニューラルネットワークを配置する。具体的には： Step 2-3, Acquire the configuration information of the corresponding deep learning neural network of each class, use it as the configuration information of the corresponding deep learning neural network of the class, and arrange the corresponding deep learning neural network of the class. .. In particular:

ステップ2-3-1、深層学習ニューラルネットワーク構成知識ベースから入力フォーマット、出力フォーマットと各クラスのデータプリセットフォーマット及びラベルプリセットフォーマットとが一番一致する深層学習ニューラルネットワークの対応する構成情報を取得し、それを当該クラスの対応する深層学習ニューラルネットワークのプリセット構成情報とし、 Step 2-3-1, Deep Learning Neural Network Configuration Obtain the corresponding configuration information of the deep learning neural network from the knowledge base where the input format, output format and the data preset format and label preset format of each class match best. Let it be the preset configuration information of the corresponding deep learning neural network of the class.

ステップ2-3-2、各クラスの対応する深層学習ニューラルネットワークのプリセット構成情報をユーザーに出力し、 Step 2-3-2, output the preset configuration information of the corresponding deep learning neural network of each class to the user,

ステップ2-3-3、ユーザーによっての各クラスの対応する深層学習ニューラルネットワークのプリセット構成情報の変更を取得し、 Step 2-3-3, get the user to change the preset configuration information of the corresponding deep learning neural network for each class,

ステップ2-3-4、変更された各クラスの対応する深層学習ニューラルネットワークのプリセット構成情報を、当該クラスの対応する深層学習ニューラルネットワークのプリセット構成情報とする。 Step 2-3-4, the preset configuration information of the corresponding deep learning neural network of each changed class is used as the preset configuration information of the corresponding deep learning neural network of the class.

前記ステップ3は、ステップ1で取得した各クラスのデータサンプルセットを入力とし、それに対応するラベルセットを出力とし、当該クラスの対応する深層学習ニューラルネットワークを訓練し、N個の訓練された深層学習ニューラルネットワークを取得し、具体的には： In step 3, the data sample set of each class acquired in step 1 is used as an input, the corresponding label set is used as an output, the corresponding deep learning neural network of the class is trained, and N trained deep learnings are trained. Get a neural network, specifically:

ステップ3-1、各クラスのデータサンプルセットの中の各データサンプルを当該クラスの対応する深層学習ニューラルネットワークの入力とし、当該クラスの対応する深層学習ニューラルネットワークに対してアセンディングオーダーの監督なし訓練を行い、 Step 3-1. Each data sample in the data sample set of each class is used as an input of the corresponding deep learning neural network of the class, and the ascending order is not supervised for the corresponding deep learning neural network of the class. And

ステップ3-2、各クラスのデータサンプルセットの中の各データサンプルを当該クラスの対応する深層学習ニューラルネットワークの入力とし、当該クラスのデータサンプルセットが対応するラベルセットの中の当該データサンプルが対応するラベルを出力とし、当該クラスの対応する深層学習ニューラルネットワークに対しトップダウンの監督学習を行い、N個の訓練された深層学習ニューラルネットワークを取得する。 Step 3-2, each data sample in the data sample set of each class is used as the input of the corresponding deep learning neural network of the class, and the data sample in the corresponding label set of the data sample set of the class corresponds to the data sample. The label to be output is output, and top-down supervised learning is performed on the corresponding deep learning neural network of the class, and N trained deep learning neural networks are acquired.

データサンプルセットAとデータサンプルセットBの類似度＝max(データサンプルセットAの中の各サンプルとデータサンプルセットBの中の各サンプルの類似度)。 Similarity between data sample set A and data sample set B = max (similarity between each sample in data sample set A and each sample in data sample set B).

本発明が深層学習ニューラルネットワークと類似度計算を有機的に結合し、出力の結果を豊かにし、出力の適確率を高める。本発明が類似度計算を結合する方法を採用し、長所を見習って短所を補い、これにより類似度計算を通じて深層学習ニューラルネットワークが出力ラベルの数が多く入力サンプルが足りないときの出力正確度の不足を補い、さらに出力の正確度を高める。 The present invention organically combines deep learning neural networks and similarity calculations to enrich the output results and increase the appropriate probability of the output. The present invention employs a method of combining similarity calculations, emulating the strengths and compensating for the weaknesses, which allows deep learning neural networks to output accuracy when the number of output labels is large and the input sample is insufficient. Make up for the shortfall and further improve the accuracy of the output.

下記に附図及び具体的な実施方式をあわせて本発明について詳しく説明する。 The present invention will be described in detail with reference to the following figures and specific implementation methods.

実施例 Example

二クラスのデータサンプルセットと、それが対応するラベルセットとを例とする。第一クラスのデータサンプルセットが「プロフィール画像11、プロフィール画像12、プロフィール画像13、プロフィール画像14、…、プロフィール画像1m」であり、対応する第一クラスの出力ラベルのセットが「IDカード番号11、IDカード番号12、IDカード番号13、IDカード番号14、…、IDカード番号1m」であり、その中に、プロフィール画像11がIDカード番号11に対応し、プロフィール画像12がIDカード番号12に対応し、プロフィール画像13がIDカード番号13に対応し、プロフィール画像14がIDカード番号14に対応し、……、プロフィール画像1nがIDカード番号1nに対応する。その中に同じIDカード番号が存在する可能性があり、例えばIDカード番号13とIDカード番号16が同じである。第二クラスのデータサンプルセットが「音声21、音声22、音声23、音声24、…、音声2n」であり、対応する第一クラスの出力ラベルのセットが「IDカード番号21、IDカード番号22、IDカード番号23、IDカード番号24、…、IDカード番号2n」であり、その中に、音声21がIDカード番号21に対応し、音声22がIDカード番号22に対応し、音声23がIDカード番号23に対応し、音声24がIDカード番号24に対応し、……、音声2nがIDカード番号2nに対応する。その中に同じIDカード番号が存在する可能性があり、例えばIDカード番号22とIDカード番号28が同じである。 Take, for example, two classes of data sample sets and their corresponding label sets. The first class data sample set is "profile image 11, profile image 12, profile image 13, profile image 14, ..., profile image 1m", and the corresponding first class output label set is "ID card number 11". , ID card number 12, ID card number 13, ID card number 14, ..., ID card number 1m ", in which profile image 11 corresponds to ID card number 11 and profile image 12 corresponds to ID card number 12. The profile image 13 corresponds to the ID card number 13, the profile image 14 corresponds to the ID card number 14, and ..., the profile image 1n corresponds to the ID card number 1n. The same ID card number may exist in it, for example, ID card number 13 and ID card number 16 are the same. The second class data sample set is "voice 21, voice 22, voice 23, voice 24, ..., voice 2n", and the corresponding first class output label set is "ID card number 21, ID card number 22". , ID card number 23, ID card number 24, ..., ID card number 2n ", in which voice 21 corresponds to ID card number 21, voice 22 corresponds to ID card number 22, and voice 23 corresponds to. Corresponds to ID card number 23, voice 24 corresponds to ID card number 24, ..., voice 2n corresponds to ID card number 2n. The same ID card number may exist in it, for example, ID card number 22 and ID card number 28 are the same.

図1を合わせて、本発明の連合クラスタリング深層学習ニューラルネットワークに基づくデータ識別方法は、ステップ１〜９を含み、 Together with FIG. 1, the data identification method based on the associative clustering deep learning neural network of the present invention includes steps 1-9.

前記ステップ1は、まず二クラスのデータサンプルセットと各クラスのデータサンプルセットが対応するラベルセットを取得し、また前記二クラスのデータサンプルセットの中の各クラスのデータサンプルのデータプリセットフォーマットを取得し、ラベルプリセットフォーマットも取得し、具体的には： In step 1, the two-class data sample set and the corresponding label set of each class data sample set are first acquired, and the data preset format of each class data sample in the two-class data sample set is acquired. And also get the label preset format, specifically:

各クラスのデータサンプルセットの中の各データサンプルのデータフォーマットを取得し、当該クラスの中の同じデータフォーマットを合併しs 種のデータフォーマットを取得し、当該クラスのデータサンプルセットの中の各種のデータフォーマットPiが対応するデータサンプル数Miを統計し、一番大きなMiが対応データフォーマットPiを当該クラスのデータサンプルのデータプリセットフォーマットとし、その中に、sが1以上であり、iが1以上且つs以下である。例えば、第一クラスのデータサンプルが画像サンプルであり、第二クラスのデータサンプルが音声サンプルである。第一クラスのデータサンプルを例とし、第一クラスのデータサンプルセットの中に480x 640ピクセルのJPEG画像データフォーマットのデータサンプルが809個あり、480x 640ピクセルのTIFF画像データフォーマットのデータサンプルが8367個あり、480x 640ピクセルのBMP画像データフォーマットのデータサンプルが67個あり、2576x 1932ピクセルのJPEG画像データフォーマットのデータサンプルが5362個あり、2576x 1932ピクセルのTIFF画像データフォーマットのデータサンプルが32個あり、2576x 1932ピクセルのBMP画像データフォーマットのデータサンプルが136個あり、その中にデータサンプル数が一番大きなデータフォーマットが480x 640ピクセルのTIFF画像データフォーマットであり、従って480x 640ピクセルのTIFF画像データフォーマットを第一クラスのデータサンプルのデータプリセットフォーマットとする。 Obtain the data format of each data sample in the data sample set of each class, merge the same data formats in the class to obtain s types of data formats, and various types in the data sample set of the class. The number of data samples Mi supported by the data format Pi is statistic, and the largest Mi sets the supported data format Pi as the data preset format of the data sample of the class, in which s is 1 or more and i is 1 or more. And it is less than or equal to s. For example, the first class data sample is an image sample and the second class data sample is an audio sample. Taking the first class data sample as an example, there are 809 data samples in the 480 x 640 pixel JPEG image data format and 8367 data samples in the 480 x 640 pixel TIFF image data format in the first class data sample set. Yes, there are 67 data samples in the 480 x 640 pixel BMP image data format, 5362 data samples in the 2576 x 1932 pixel JPEG image data format, 32 data samples in the 2576 x 1932 pixel TIFF image data format, There are 136 data samples in the 2576 x 1932 pixel BMP image data format, of which the data format with the largest number of data samples is the 480 x 640 pixel TIFF image data format, and therefore the 480 x 640 pixel TIFF image data format. The data preset format of the first class data sample.

各クラスのデータサンプルセットが対応するラベルセットの中の各ラベルのラベルフォーマットを取得し、すべてのクラスの同じのラベルフォーマットを合併して少なくともt種のラベルフォーマットを取得し、当該クラスのラベルセットの中の各種のラベルフォーマットQjが対応するラベル数Njを統計し、一番大きなNjが対応するラベルフォーマットQjをラベルプリセットフォーマットとし、その中に、tが1以上であり、jが1以上且つt以下である。例えば、データサンプルセットに対応するラベルセットが二クラスあり、第一クラスのデータサンプルセットが対応するラベルセットの中に、IDカード番号ラベルが5636個あり、名前ラベルが5426個あり、第二クラスのデータサンプルセットが対応するラベルセットの中に、IDカード番号ラベルが2654個あり、名前ラベルが235個あり、二クラスのデータサンプルセットが対応するラベルセットの中にIDカード番号ラベルが8290個あり、名前ラベルが5661個あり、従ってIDカード番号ラベルをラベルプリセットフォーマットとする。 The data sample set of each class gets the label format of each label in the corresponding label set, merges the same label formats of all classes to get at least t kinds of label formats, and the label set of the class. Statistics on the number of labels Nj corresponding to various label formats Qj in the above, and the label format Qj corresponding to the largest Nj is the label preset format, in which t is 1 or more, j is 1 or more, and It is less than or equal to t. For example, there are two classes of label sets corresponding to the data sample set, and among the label sets corresponding to the first class data sample set, there are 5636 ID card number labels, 5426 name labels, and the second class. There are 2654 ID card number labels, 235 name labels, and 8290 ID card number labels in the corresponding label set of the two-class data sample set. Yes, there are 5661 name labels, so the ID card number label is the label preset format.

それから入力する二クラスのデータサンプルセットとラベルセットを前処理し、図2を合わせて、第一クラスのデータサンプルセットとラベルセットを例とし、具体的な過程が下記の通りである： Then, the two classes of data sample set and label set to be input are preprocessed, Fig. 2 is combined, and the first class data sample set and label set are taken as an example, and the specific process is as follows:

ステップ1-1、各クラスのデータサンプルセットの中の各データサンプルのデータフォーマットが当該クラスのデータサンプルのデータプリセットフォーマットに一致するかどうかを判断し、一致でなければ、当該クラスの当該データサンプルのデータフォーマットを当該クラスのデータサンプルのデータプリセットフォーマットに変換する。例えば、480x 640ピクセルのTIFF画像データフォーマットが第一クラスのデータサンプルのデータプリセットフォーマットであり、第一クラスのデータサンプルセットの中の一つのデータサンプルのデータフォーマットも480x 640ピクセルのTIFF画像データフォーマットであれば、第一クラスのデータサンプルのデータプリセットフォーマットと同じであり、変換する必要がなく、第一クラスのデータサンプルセットの中の一つのデータサンプルのデータフォーマットが2576x 1932ピクセルのJPEG画像データフォーマットであれば、第一クラスのデータサンプルのデータプリセットフォーマットと違い、480x 640ピクセルのTIFF画像データフォーマットに変換する必要がある。 Step 1-1, Determine if the data format of each data sample in each class's data sample set matches the data preset format of that class's data sample, and if not, that class's data sample Convert the data format of the above to the data preset format of the data sample of the class. For example, a 480 x 640 pixel TIFF image data format is a first class data sample data preset format, and one data sample data format in a first class data sample set is also a 480 x 640 pixel TIFF image data format. If so, it is the same as the data preset format of the first class data sample, there is no need to convert, and the data format of one data sample in the first class data sample set is 2576 x 1932 pixel JPEG image data. If it is a format, it needs to be converted to a 480 x 640 pixel TIFF image data format, unlike the data preset format of the first class data sample.

ステップ1-2、各クラスのデータサンプルセットの中の各データサンプルが対応するラベルのデータフォーマットがラベルプリセットフォーマットに一致するかどうかを判断し、一致でなければ、当該クラスの当該データサンプルが対応するラベルのデータフォーマットをラベルプリセットフォーマットに変換する。例えば、IDカード番号ラベルがラベルプリセットフォーマットとして、第一クラスのデータサンプルセットの中の一つのデータサンプルが対応するラベルのデータフォーマットがIDカード番号フォーマットであれば、ラベルプリセットフォーマットと同じであり、変換する必要がなく、第一クラスのデータサンプルセットの中の一つのデータサンプルが対応するラベルのデータフォーマットが名前フォーマットであれば、ラベルプリセットフォーマットと違い、IDカード番号フォーマットに変換する必要がある。 Step 1-2, Each data sample in each class's data sample set determines if the corresponding label's data format matches the label preset format, and if not, the class's data sample corresponds. Convert the label data format to the label preset format. For example, if the ID card number label is the label preset format and the data format of the label corresponding to one data sample in the first class data sample set is the ID card number format, it is the same as the label preset format. If there is no need to convert and the data format of the label corresponding to one data sample in the first class data sample set is the name format, it needs to be converted to the ID card number format unlike the label preset format. ..

ステップ1-3、第一クラスのデータサンプルセットをクラスタリング処理し、j個のクラスタ化されたデータサンプルセット及びそれに対応する出力ラベルセットを取得する。具体的には： Step 1-3, Cluster the first class data sample set to obtain j clustered data sample sets and their corresponding output label sets. In particular:

まず第一クラスのデータサンプルセット「プロフィール画像11、プロフィール画像12、プロフィール画像13、プロフィール画像14、…、プロフィール画像1m」をクラスタ化し、クラスタリングの規則が下記の通りである：類似度がプロフィール画像類似度プリセット閾値（デフォルト値が90％である）より大きであるプロフィール画像を同じクラスタリングに加入し（すなわち上記プロフィール画像内部に類似度計算を行い、類似度が90％より大きい場合、相応のプロフィール画像を一つのクラスタリングに加入し）、一つのクラスタリングの中の任意一つのプロフィール画像に対して当該クラスタリングで当該プロフィール画像との類似度がプロフィール画像類似度プリセット閾値（デフォルト値が90％である）より大きなもう一つのプロフィール画像が存在し、その同時に一つのクラスタリングの中の任意一つのプロフィール画像に対してもう一つのクラスタリングにおいて当該プロフィール画像との類似度がプロフィール画像類似度プリセット閾値（デフォルト値が90％である）より大きなプロフィール画像が存在しなく、各プロフィール画像が一つのクラスタリングにしか属しない。当該クラスタリングの規則に基づき、第一番のクラスタリング「プロフィール画像111、プロフィール画像112、…、プロフィール画像11m1」、第二番のクラスタリング「プロフィール画像211、プロフィール画像212、…、プロフィール画像21m2」、第三番のクラスタリング「プロフィール画像311、プロフィール画像312、…、プロフィール画像31m3」、 …、第j番のクラスタリング「プロフィール画像j11、プロフィール画像j12、…、プロフィール画像j1mj」を取得する。 First, the first class data sample set "profile image 11, profile image 12, profile image 13, profile image 14, ..., profile image 1m" is clustered, and the clustering rules are as follows: Similarity is profile image. Profile images that are larger than the similarity preset threshold (default value is 90%) are joined to the same clustering (that is, similarity calculation is performed inside the above profile image, and if the similarity is greater than 90%, the corresponding profile (Subscribe the image to one clustering), and for any one profile image in one clustering, the similarity with the profile image in the clustering is the profile image similarity preset threshold (default value is 90%). There is another larger profile image, and at the same time, for any one profile image in one clustering, the similarity with the profile image in another clustering is the profile image similarity preset threshold (default value is). There are no larger profile images (90%), and each profile image belongs to only one clustering. Based on the clustering rules, the first clustering "profile image 111, profile image 112, ..., profile image 11m1", the second clustering "profile image 211, profile image 212, ..., profile image 21m2", first The third clustering "profile image 311, profile image 312, ..., profile image 31m3", ..., the third clustering "profile image j11, profile image j12, ..., profile image j1mj" are acquired.

それから第一クラスのデータサンプルセットが対応するラベルセットをクラスタ化する。第一クラスのデータサンプルセットの第一番のクラスタリング「プロフィール画像111、プロフィール画像112、…、プロフィール画像11m1」が対応する出力ラベルセットが「IDカード番号111、IDカード番号112、…、IDカード番号11m1」であり、第一クラスの出力ラベルの第一番のクラスタリングとし、データサンプルセットの第二番のクラスタリング「プロフィール画像211、プロフィール画像212、…、プロフィール画像21m2」が対応する出力ラベルセットが「IDカード番号211、IDカード番号212、…、IDカード番号21m2」であり、第一クラスの出力ラベルの第二番のクラスタリングとし、…、データサンプルセットの第j番のクラスタリング「プロフィール画像j11、プロフィール画像j12、…、プロフィール画像j1mj」が対応する出力ラベルセットが「IDカード番号j11、IDカード番号j12、…、IDカード番号j1mj」であり、第一クラスの出力ラベルの第j番のクラスタリングとする。 Then cluster the corresponding label set with the first class data sample set. The output label set corresponding to the first clustering "profile image 111, profile image 112, ..., profile image 11m1" of the first class data sample set is "ID card number 111, ID card number 112, ..., ID card". The number is 11m1 ”, which is the first clustering of the output label of the first class, and the second clustering of the data sample set“ Profile image 211, Profile image 212,…, Profile image 21m2 ”corresponds to the output label set. Is "ID card number 211, ID card number 212, ..., ID card number 21m2", which is the second clustering of the output label of the first class, and ..., the clustering of the jth of the data sample set "Profile image" The output label set corresponding to "j11, profile image j12, ..., profile image j1mj" is "ID card number j11, ID card number j12, ..., ID card number j1mj", and the first class output label number j. Clustering.

ステップ1-4、前記J個のクラスタ化された出力ラベルセットの各クラスの同じのラベルを合併し、更新されたJ個の出力ラベルセットを取得し、 Step 1-4, merge the same labels in each class of the J clustered output label sets to get the updated J output label sets.

ステップ1-5、更新されたJ個の出力ラベルセットの同じラベルを持つラベルセットと、対応のデータサンプルセットとをそれぞれ合併し、前処理されたデータサンプルセットと、それに対応する出力ラベルセットとを取得する。 Step 1-5, merge the label set with the same label of the updated J output label set and the corresponding data sample set, respectively, and combine the preprocessed data sample set with the corresponding output label set. To get.

例えば、第一クラスのデータサンプルセットの第一番のクラスタリング「プロフィール画像111、プロフィール画像112、…、プロフィール画像11m1」が対応する出力ラベルセットが「IDカード番号111、IDカード番号112、…、IDカード番号11m1」であり、データサンプルセットの第二番のクラスタリング「プロフィール画像211、プロフィール画像212、…、プロフィール画像21m2」が対応する出力ラベルセットが「IDカード番号211、IDカード番号212、…、IDカード番号21m2」であり、もし第二番のクラスタリングが対応する出力ラベルのセットの中のIDカード番号212と第一番のクラスタリングが対応する出力ラベルセットの中のIDカード番号116が同じであれば、第二番のクラスタリングが対応する出力ラベルのセットを第一番のクラスタリングが対応する出力ラベルセットに合併し、その同時に第二番のクラスタリングが対応するデータサンプルセットを第一番のクラスタリングが対応するデータサンプルセットに合併する。 For example, the output label set corresponding to the first clustering "profile image 111, profile image 112, ..., profile image 11m1" of the first class data sample set is "ID card number 111, ID card number 112, ..., The output label set corresponding to the second clustering "profile image 211, profile image 212, ..., profile image 21m2" of the data sample set is "ID card number 211, ID card number 212," which is "ID card number 11m1". …, ID card number 21m2 ”, and if the ID card number 212 in the output label set corresponding to the second clustering and the ID card number 116 in the output label set corresponding to the first clustering are If they are the same, the set of output labels corresponding to the second clustering is merged with the corresponding output label set of the first clustering, and at the same time the data sample set corresponding to the second clustering is the first. Clustering merges into the corresponding data sample set.

ステップ2は、Nクラスデータサンプルセットが対応するN個の深層学習ニューラルネットワークを初期化する。 Step 2 initializes the corresponding N deep learning neural networks in the N-class data sample set.

ステップ3は、ステップ1で取得した各クラスのデータサンプルセットを入力とし、それに対応するラベルセットを出力とし、当該クラスの対応する深層学習ニューラルネットワークに対し訓練を行い、二つの訓練された深層学習ニューラルネットワークを取得する。図3を合わせて、具体的には： In step 3, the data sample set of each class acquired in step 1 is input, the corresponding label set is output, training is performed on the corresponding deep learning neural network of the class, and two trained deep learnings are performed. Get a neural network. Together with Figure 3, specifically:

ステップ3-2、各クラスのデータサンプルセットの中の各データサンプルを当該クラスの対応する深層学習ニューラルネットワークの入力とし、対応するラベルセットの中の対応するラベルを出力とし、当該クラスの対応する深層学習ニューラルネットワークに対しトップダウンの監督学習を行い、二つの訓練された深層学習ニューラルネットワークを取得する。 Step 3-2, each data sample in each class's data sample set is the input of the corresponding deep learning neural network of the class, the corresponding label in the corresponding label set is the output, and the corresponding of the class Performs top-down supervised learning on deep learning neural networks to obtain two trained deep learning neural networks.

ステップ4は、各クラスの対応する深層学習ニューラルネットワークのために一つのテストデータを取得し、各クラスのテストデータのデータフォーマットを当該クラスのデータのデータプリセットフォーマットに変換する。例えば、480x 640ピクセルのTIFF画像データフォーマットが第一クラスのデータサンプルのデータプリセットフォーマットであり、もし第一クラスの当該一つのテストデータのデータフォーマットも480x640ピクセルのTIFF画像データフォーマットであれば、第一クラスのデータサンプルのデータプリセットフォーマットと同じであり、変換する必要がなく、もし第一クラスの当該一つのテストデータのデータフォーマットが2576x 1932ピクセルのJPEG画像データフォーマットであれば、第一クラスのデータサンプルのデータプリセットフォーマットと同じでなく、480x 640ピクセルのTIFF画像データフォーマットに変換する必要がある。 Step 4 acquires one test data for the corresponding deep learning neural network of each class and converts the data format of the test data of each class to the data preset format of the data of the class. For example, if the 480x640 pixel TIFF image data format is the data preset format for the first class data sample and the data format for the one test data in the first class is also the 480x640 pixel TIFF image data format, then the first It is the same as the data preset format of one class of data sample and does not need to be converted, and if the data format of the one test data of the first class is 2576 x 1932 pixel JPEG image data format, the first class It is not the same as the data preset format of the data sample and needs to be converted to a 480 x 640 pixel TIFF image data format.

それから当該テストデータを当該クラスの対応する深層学習ニューラルネットワークの入力とし、当該深層学習ニューラルネットワークの計算を通じて当該クラスが対応するテスト出力ラベルを取得する。例えば、第一クラスの任意一つのテストデータ「プロフィール画像1p」を第一クラスが対応する深層学習ニューラルネットワークに入力し、テスト出力ラベル「張三のIDカード番号」を取得し、第二クラスの任意一つのテストデータ「音声2q」を第二クラスが対応する深層学習ニューラルネットワークに入力し、テスト出力ラベル「李四のIDカード番号」を取得する。 Then, the test data is used as an input of the corresponding deep learning neural network of the class, and the test output label corresponding to the class is acquired through the calculation of the deep learning neural network. For example, input any one test data "profile image 1p" of the first class into the deep learning neural network corresponding to the first class, obtain the test output label "Zhangsan's ID card number", and obtain the test output label "Zhangsan's ID card number" of the second class. Arbitrary One test data "voice 2q" is input to the deep learning neural network corresponding to the second class, and the test output label "Li 4's ID card number" is acquired.

ステップ5は、ステップ1で前処理されたラベルセットの中から各クラスのテスト出力ラベルが存在するラベルセットを検索し、また当該ラベルセットが一つのラベル要素しか持っていないかどうかを判断し、もし各クラスのテスト出力ラベルが存在するラベルセットが一つのラベル要素しか持っていなければ、各クラスのテスト出力ラベルを当該クラスの最優出力ラベルとし、すなわちステップ4の中の「張三のIDカード番号」、「李四のIDカード番号」をそれぞれ第一クラス、第二クラスの最優出力ラベルとし、そうでなければ次のステップに進み、 Step 5 searches the label set preprocessed in step 1 for the label set in which the test output label of each class exists, and determines whether the label set has only one label element. If the label set in which the test output label of each class exists has only one label element, the test output label of each class is set as the best output label of the class, that is, "ID of Zhang San" in step 4. Set "card number" and "Li 4's ID card number" as the best output labels of the first class and the second class, respectively, otherwise proceed to the next step.

ステップ6は、各クラスのテスト出力ラベルが対応するデータサンプルセットと当該クラスのテスト出力ラベルが存在するラベルセットの中の各ラベル要素が対応するデータサンプルセットの類似度を計算し、それから当該類似度に基づいて各組の可能出力ラベルを計算して決定し、その中に、各組の可能出力ラベルの中には各クラスの一つの可能出力ラベルが含まれる。例えば、ステップ4の中の「張三のIDカード番号」が存在する第一クラスの出力ラベルセットが「朱一のIDカード番号、鄭二のIDカード番号、張三のIDカード番号、呉七のIDカード番号」であり、対応するデータサンプルセットがそれぞれ「朱一のプロフィール画像セット、鄭二のプロフィール画像セット、張三のプロフィール画像セット、呉七のプロフィール画像セット」であり、「李四のIDカード番号」が存在する第二クラスの出力ラベルセットが「田一のIDカード番号、李四のIDカード番号、呉七のIDカード番号」であり、対応するデータサンプルセットがそれぞれ「田一の音声セット、李四の音声セット、呉七の音声セット」である。Nクラスデータサンプルセット及びそれに対応するラベルセットがあると仮定し、具体的な過程が下記二つの状況に分けられる： Step 6 calculates the similarity between the corresponding data sample set for each class's test output label and the corresponding data sample set for each label element in the label set in which the class's test output label resides, and then the similarity. The possible output labels of each set are calculated and determined based on the degree, and the possible output labels of each set include one possible output label of each class. For example, the output label set of the first class in which "Zhangsan's ID card number" exists in step 4 is "Zhuichi's ID card number, Chungji's ID card number, Zhangsan's ID card number, Wu Shichi. ID card number, and the corresponding data sample sets are "Shuichi's profile image set, Chungji's profile image set, Zhangsan's profile image set, Wu Shichi's profile image set", and "Li 4". The second class output label set in which "ID card number" exists is "Taichi's ID card number, Li 4's ID card number, Kure Shichi's ID card number", and the corresponding data sample sets are "Taichi" respectively. One voice set, Lee four voice set, Wu seven voice set ". Assuming there is an N-class data sample set and its corresponding label set, the specific process can be divided into the following two situations:

(1)N＝1：一クラスのデータサンプルセット及び対応するラベルセットしかない。例えば上記第一クラスのデータサンプルセット及び対応するラベルセットしか存在しない。 (1) N = 1: There is only one class of data sample set and corresponding label set. For example, there are only the first class data sample set and the corresponding label set.

テスト出力ラベルが対応するデータサンプルセット「張三のプロフィール画像セット」とテスト出力ラベルが存在するラベルセットの中の各ラベル要素が対応するデータサンプルセットの類似度を計算する。「張三のプロフィール画像セット」と「朱一のプロフィール画像セット」の類似度a1が80％、「張三のプロフィール画像セット」と「鄭二のプロフィール画像セット」の類似度a2が90％、「張三のプロフィール画像セット」と「張三のプロフィール画像セット」の類似度a3が100％、「張三のプロフィール画像セット」と「呉七のプロフィール画像セット」の類似度a4が92％であることが分かる。その中に、a2、a3、a4がいずれも第一プリセット値80％より大きく、従って3組の可能出力ラベルが存在し、それぞれ「鄭二のIDカード番号」、「張三のIDカード番号」、「呉七のIDカード番号」である。 Calculate the similarity between the data sample set "Zhang San's profile image set" corresponding to the test output label and the data sample set corresponding to each label element in the label set in which the test output label exists. The similarity a1 between "Zhangsan's profile image set" and "Shuichi's profile image set" is 80%, and the similarity a2 between "Zhangsan's profile image set" and "Chungji's profile image set" is 90%. The similarity a3 between "Zhangsan's profile image set" and "Zhangsan's profile image set" is 100%, and the similarity a4 between "Zhangsan's profile image set" and "Wu Shichi's profile image set" is 92%. It turns out that there is. Among them, a2, a3, and a4 are all larger than the first preset value of 80%, so there are three sets of possible output labels, "Chungji's ID card number" and "Zhangsan's ID card number", respectively. , "Kure Shichi's ID card number".

(2)N>1：複数のクラスのデータサンプルセットと、対応するラベルセットとがある。例えばN＝2の場合、上記第一クラスのデータサンプルセット及び対応するラベルセット、第二クラスのデータサンプルセット及び対応するラベルセットを含む。 (2) N> 1: There are data sample sets of multiple classes and corresponding label sets. For example, when N = 2, the first class data sample set and the corresponding label set, the second class data sample set and the corresponding label set are included.

まず第一クラスのテスト出力ラベルが対応するデータサンプルセット「張三のプロフィール画像セット」とテスト出力ラベルが存在するラベルセットの中の各ラベル要素が対応するデータサンプルセットの類似度を計算する。類似度の計算結果が上記N＝1の状況と同じである。 First, the similarity between the data sample set "Zhang San's profile image set" corresponding to the first class test output label and the data sample set corresponding to each label element in the label set in which the test output label exists is calculated. The calculation result of the similarity is the same as the above situation of N = 1.

それから第二クラスのテスト出力ラベルが対応するデータサンプルセット「李四の音声セット」とテスト出力ラベルが存在するラベルセットの中の各ラベル要素が対応するデータサンプルセットの類似度を計算する。「李四の音声セット」と「田一の音声セット」の類似度b1が95％、「李四の音声セット」と「李四の音声セット」の類似度b2が100％、「李四の音声セット」と「呉七の音声セット」の類似度b3が85％であることが分かる。 Then, the similarity between the data sample set "Lee 4's voice set" corresponding to the second class test output label and the data sample set corresponding to each label element in the label set in which the test output label exists is calculated. The similarity b1 between "Li 4's voice set" and "Taichi's voice set" is 95%, the similarity between "Li 4's voice set" and "Li 4's voice set" is 100%, and "Li 4's" It can be seen that the similarity b3 between "Voice set" and "Kure Shichi's voice set" is 85%.

最後にすべての可能出力ラベル組のそれぞれの類似度第一綜合値を計算し、出力ラベル組c1「朱一のIDカード番号、田一のIDカード番号」の類似度第一綜合値f1が80％×95％＝76％であり、出力ラベル組c2「朱一のIDカード番号、李四のIDカード番号」の類似度第一綜合値f2が80％×100％＝80％であり、出力ラベル組c3「朱一のIDカード番号、呉七のIDカード番号」の類似度第一綜合値f3が80％×85％＝68％であり、出力ラベル組c4「鄭二のIDカード番号、田一のIDカード番号」の類似度第一綜合値f4が90％×95％＝85.5％であり、出力ラベル組c5「鄭二のIDカード番号、李四のIDカード番号」の類似度第一綜合値f5が90％×100％＝90％であり、出力ラベル組c6「鄭二のIDカード番号、呉七のIDカード番号」の類似度第一綜合値f6が90％×85％＝76.5％であり、出力ラベル組c7「張三のIDカード番号、田一のIDカード番号」の類似度第一綜合値f7が100％×95％＝95％であり、出力ラベル組c8「張三のIDカード番号、李四のIDカード番号」の類似度第一綜合値f8が100％×100％＝100％であり、出力ラベル組c9「張三のIDカード番号、呉七のIDカード番号」の類似度第一綜合値f9が100％×85％＝85％であり、出力ラベル組c10「呉七のIDカード番号、田一のIDカード番号」の類似度第一綜合値f10が92％×95％＝87.4％であり、出力ラベル組c11「呉七のIDカード番号、李四のIDカード番号」の類似度第一綜合値f11が92％×100％＝92％であり、出力ラベル組c12「呉七のIDカード番号、呉七のIDカード番号」の類似度第一綜合値p12が92％×85％＝78.2％である。その中にf4、f5、f7、f8、f10、f11がいずれも第二プリセット値85％より大きく、従って6組の可能出力ラベルがあり、それぞれc4、c5、c7、c8、c10、c11が対応する出力ラベル組である。 Finally, the first overall similarity value of each possible output label set is calculated, and the first overall similarity value f1 of the output label set c1 "Zhuichi's ID card number, Taichi's ID card number" is 80. % X 95% = 76%, and the similarity first synthesis value f2 of the output label set c2 "Zhuichi's ID card number, Lee's 4's ID card number" is 80% x 100% = 80%, and the output Label set c3 "Zhuichi's ID card number, Wu Shichi's ID card number" has a similarity first synthesis value f3 of 80% x 85% = 68%, and output label set c4 "Chungji's ID card number, The similarity value f4 of "Taichi's ID card number" is 90% x 95% = 85.5%, and the similarity degree of output label set c5 "Chungji's ID card number, Li 4's ID card number" The total value f5 is 90% x 100% = 90%, and the similarity of the output label set c6 "Chungji's ID card number, Wu Shichi's ID card number" is 90% x 85% = It is 76.5%, and the similarity first synthesis value f7 of the output label set c7 "Zhangsan's ID card number, Taichi's ID card number" is 100% x 95% = 95%, and the output label set c8 "Zhang" Similarity 1st synthesis value f8 of "3 ID card number, Li 4 ID card number" is 100% x 100% = 100%, and output label set c9 "Zhang 3 ID card number, Wu 7 ID card" The similarity first synthesis value f9 of "number" is 100% x 85% = 85%, and the similarity first synthesis value f10 of the output label set c10 "Kureshichi's ID card number, Taichi's ID card number" is 92% x 95% = 87.4%, and the similarity first integrated value f11 of the output label set c11 "Wu Shichi's ID card number, Li 4's ID card number" is 92% x 100% = 92%. The similarity first integrated value p12 of the output label set c12 "Kureshichi's ID card number, Kureshichi's ID card number" is 92% x 85% = 78.2%. Among them, f4, f5, f7, f8, f10 and f11 are all larger than the second preset value of 85%, so there are 6 sets of possible output labels, which correspond to c4, c5, c7, c8, c10 and c11 respectively. Output label set to be used.

ステップ7は、ステップ6で取得した各組の可能出力ラベルの中の各クラスの可能出力ラベルが対応するデータサンプルセットと当該クラスのテストデータセットの類似度を計算し、また当該類似度に基づき一組の可能出力ラベルを計算決定し最優出力ラベルとする。ステップ6の中の内容に対応し、具体的な過程が下記二つの状況に分けられる： Step 7 calculates the similarity between the data sample set corresponding to the possible output label of each class in the possible output label of each set obtained in step 6 and the test data set of the class, and based on the similarity. A set of possible output labels is calculated and determined to be the best output label. Corresponding to the contents in step 6, the concrete process can be divided into the following two situations:

(1)N＝1：ステップ6からわかるように、それぞれ3組の可能出力ラベル「鄭二のIDカード番号」、「張三のIDカード番号」と「呉七のIDカード番号」がある。その中に可能出力ラベル「張三のIDカード番号」の類似度値が一番大きく、従ってそれを最優出力ラベル組とする。 (1) N = 1: As you can see from step 6, there are 3 sets of possible output labels "Chungji's ID card number", "Zhangsan's ID card number" and "Kureshichi's ID card number" respectively. Among them, the similarity value of the possible output label "Zhangsan's ID card number" is the largest, so that is the best output label set.

(2)N>1：ステップ6からわかるように、c4、c5、c7、c8、c10、c11全部で6組の可能出力ラベルがあり、具体的な過程が下記の通りである： (2) N> 1: As you can see from step 6, there are 6 sets of possible output labels in total of c4, c5, c7, c8, c10, c11, and the specific process is as follows:

まず各組の可能出力ラベルの中の第一クラスの可能出力ラベルが対応するデータサンプルセットと当該クラスのテストデータセット「張三のプロフィール画像セット」の類似度を計算する。「鄭二のプロフィール画像セット」と「張三のプロフィール画像セット」の類似度が90％、「張三のプロフィール画像セット」と「張三のプロフィール画像セット」の類似度が100％、「呉七のプロフィール画像セット」と「張三のプロフィール画像セット」の類似度が92％であることが分かる。 First, the similarity between the data sample set corresponding to the first class possible output label in each set of possible output labels and the test data set "Zhangsan's profile image set" of the relevant class is calculated. "Chungji's profile image set" and "Zhangsan's profile image set" are 90% similar, "Zhangsan's profile image set" and "Zhangsan's profile image set" are 100% similar, "Wu It can be seen that the degree of similarity between "Seven Profile Image Set" and "Zhang San's Profile Image Set" is 92%.

それから各組の可能出力ラベルの中の第二クラスの可能出力ラベルが対応するデータサンプルセットと当該クラスのテストデータセット「李四の音声セット」の類似度を計算する。「李四の音声セット」と「李四の音声セット」の類似度が100％、「田一の音声セット」と「李四の音声セット」の類似度が95％であることが分かる。 Then, the similarity between the data sample set corresponding to the possible output label of the second class in each set of possible output labels and the test data set "Lee 4's voice set" of the relevant class is calculated. It can be seen that the similarity between "Li 4's voice set" and "Li 4's voice set" is 100%, and the similarity between "Taichi's voice set" and "Li 4's voice set" is 95%.

c4出力ラベル組「鄭二のIDカード番号、田一のIDカード番号」の類似度第二綜合値g4が90％×95％＝85.5％であり、c5出力ラベル組「鄭二のIDカード番号、李四のIDカード番号」の類似度第二綜合値g5が90％×100％＝90％であり、c7出力ラベル組「張三のIDカード番号、田一のIDカード番号」の類似度第二綜合値g7が100％×95％＝95％であり、c8出力ラベル組「張三のIDカード番号、李四のIDカード番号」の類似度第二綜合値g8が100％×100％＝100％であり、c10出力ラベル組「呉七のIDカード番号、田一のIDカード番号」の類似度第二綜合値g10が92％×95％＝87.4％であり、c11出力ラベル組「呉七のIDカード番号、李四のIDカード番号」の類似度第二綜合値g11が92％×100％＝92％である。その中に一番大きな類似度第二綜合値がg8であり、従ってc8出力ラベル組「張三のIDカード番号、李四のIDカード番号」を最優出力ラベル組とする。 The c4 output label set "Chungji's ID card number, Taichi's ID card number" has a similarity second synthesis value of 90% x 95% = 85.5%, and the c5 output label set "Chungji's ID card number" , Li 4's ID card number "similarity The second synthesis value g5 is 90% x 100% = 90%, and the c7 output label set" Zhangsan's ID card number, Taichi's ID card number "similarity The second integrated value g7 is 100% x 95% = 95%, and the similarity of the c8 output label set "Zhangsan's ID card number, Li 4's ID card number" The second integrated value g8 is 100% x 100%. = 100%, and the c10 output label set "Kureshichi's ID card number, Taichi's ID card number" has a similarity second total value g10 of 92% x 95% = 87.4%, and the c11 output label set " The second overall value g11 of the similarity of "Wu Shichi's ID card number, Li 4's ID card number" is 92% x 100% = 92%. Among them, the second overall value of similarity is g8, so the c8 output label set "Zhangsan's ID card number, Lee's ID card number" is set as the best output label set.

ステップ8は、可能出力ラベルの中の各クラスの出力ラベルが一致する確率と一致しない確率を計算し、各クラスの出力ラベルが一致する確率と一致しない確率とする。ステップ6の中の内容に対応し、具体的な過程が下記二つの状況に分けられる： Step 8 calculates the probability that the output labels of each class in the possible output labels match and the probability that they do not match, and sets the probability that the output labels of each class do not match. Corresponding to the contents in step 6, the concrete process can be divided into the following two situations:

(1)N＝1であれば、ステップ6からわかるように、一クラスの出力ラベルしかなく、従って各クラスの出力ラベルが一致する確率が100％であり、一致しない確率が0％であり、 (1) If N = 1, as can be seen from step 6, there is only one class of output labels, so the probability that the output labels of each class will match is 100%, and the probability that they do not match is 0%.

(2)N>1：ステップ6からわかるように、c4、c5、c7、c8、c10、c11全部で6組の可能出力ラベルがある。その中にc4出力ラベル組「鄭二のIDカード番号、田一のIDカード番号」の中に各クラスの可能出力ラベルが一致しなく、c5出力ラベル組「鄭二のIDカード番号、李四のIDカード番号」の中に各クラスの可能出力ラベルが一致しなく、c7出力ラベル組「張三のIDカード番号、田一のIDカード番号」の中に各クラスの可能出力ラベルが一致しなく、c8出力ラベル組「張三のIDカード番号、李四のIDカード番号」の中に各クラスの可能出力ラベルが一致しなく、c10出力ラベル組「呉七のIDカード番号、田一のIDカード番号」の中に各クラスの可能出力ラベルが一致しなく、c11出力ラベル組「呉七のIDカード番号、李四のIDカード番号」の中に各クラスの可能出力ラベルが一致しない。上記からわかるように、すべての可能出力ラベル組の中に各クラスの可能出力ラベルがいずれも一致しなく、従って各クラスの出力ラベルが一致する確率が0％であり、一致しない確率が100％である。各クラスのテストサンプルが同じ人に対応する確率が0％であると表示する。 (2) N> 1: As can be seen from step 6, there are 6 sets of possible output labels in total of c4, c5, c7, c8, c10, and c11. The possible output labels of each class do not match in the c4 output label set "Chungji's ID card number, Taichi's ID card number", and the c5 output label set "Chungji's ID card number, Lee 4" The possible output labels of each class do not match in "ID card number of", and the possible output labels of each class match in the c7 output label set "Zhangsan's ID card number, Taichi's ID card number". No, the possible output labels of each class do not match in the c8 output label set "Zhang San's ID card number, Li 4's ID card number", and the c10 output label set "Wu Shichi's ID card number, Taichi's" The possible output labels of each class do not match in "ID card number", and the possible output labels of each class do not match in the c11 output label set "Wu Shichi's ID card number, Li 4's ID card number". As can be seen from the above, none of the possible output labels of each class match in all possible output label sets, so the probability that the output labels of each class will match is 0% and the probability of non-matching is 100%. Is. Show that the test sample of each class has a 0% probability of corresponding to the same person.

過程をさらに説明するため、四組の可能出力ラベルd4、d6、d10、d11があると仮定し、d4出力ラベル組「鄭二のIDカード番号、鄭二のIDカード番号」の中に各クラスの可能出力ラベルが一致し、対応する類似度第二綜合値が89％であり、d6出力ラベル組「張三のIDカード番号、張三のIDカード番号」の中に各クラスの可能出力ラベルが一致し、対応する類似度第二綜合値が53％であり、d10出力ラベル組「鄭二のIDカード番号、李四のIDカード番号」の中に各クラスの可能出力ラベルが一致しなく、対応する類似度第二綜合値が67％であり、d11出力ラベル組「張三のIDカード番号、鄭二のIDカード番号」の中に各クラスの可能出力ラベルが一致しなく、対応する類似度第二綜合値が75％である。 To further explain the process, we assume that there are four sets of possible output labels d4, d6, d10, d11, and each class in the d4 output label set "Chungji's ID card number, Chungji's ID card number". The possible output labels of are the same, the corresponding similarity second synthesis value is 89%, and the possible output labels of each class are included in the d6 output label set "Zhangsan's ID card number, Zhangsan's ID card number". Matches, the corresponding similarity second synthesis value is 53%, and the possible output labels of each class do not match in the d10 output label set "Chungji's ID card number, Lee's ID card number". , The corresponding similarity second synthesis value is 67%, and the possible output labels of each class do not match in the d11 output label set "Zhangsan's ID card number, Chungji's ID card number", and they correspond. The second overall similarity value is 75%.

一致すると判断される各組の可能出力ラベルが対応する類似度第二綜合値の和(89％+53％)をすべての可能出力ラベルが対応する類似度第二綜合値の和(89％+53％+67％+75％)で割ると、各クラスの出力ラベルが一致する確率が50％であることが分かる。100％から各クラスの出力ラベルが一致する確率50％を引くと、各クラスの出力ラベルが一致しない確率が50％であることが分かる。 The sum of the corresponding similarity second sums (89% + 53%) for each set of possible output labels that are judged to match is the sum of the corresponding similarity second sums (89% +) for all possible output labels. Dividing by 53% + 67% + 75%) shows that the probability that the output labels of each class will match is 50%. Subtracting the 50% probability that the output labels of each class match from 100% shows that the 50% probability that the output labels of each class do not match.

ステップ9は、可能出力ラベル、最優出力ラベル、各クラスの出力ラベルの一致する確率及び一致しない確率を出力する。 Step 9 outputs the possible output label, the best output label, the matching probability and the non-matching probability of the output label of each class.

上記からわかるように、本発明が類似度計算を通じて、出力ラベルの数が多くて入力サンプルが足りないときの深層学習ニューラルネットワークの出力正確度の不足を補い、さらに出力の正確度を高める。 As can be seen from the above, the present invention compensates for the lack of output accuracy of the deep learning neural network when the number of output labels is large and the input sample is insufficient through the similarity calculation, and further enhances the output accuracy.

Claims

A data identification method based on an associative clustering deep learning neural network comprises steps 1-9.
Step 1 first obtains the label set corresponding to the N-class data sample set and the data sample set of each class, and also obtains the data preset format of the data sample of each class in the N-class data sample set, and labels. It also gets the preset format, then preprocesses the N-class data sample set and label set, and the N is greater than or equal to 1.
Step 2 initializes the corresponding N deep learning neural networks in the N-class data sample set.
In step 3, the data sample set of each class acquired in step 1 is input, the corresponding label set is output, the corresponding deep learning neural network of the class is trained, and N trained deep learning neural networks are trained. Get the network,
Step 4 takes one test data for the corresponding deep learning neural network of each class, converts the data format of the test data of each class to the data preset format of the data sample of the class, and then the test data. Is the input of the corresponding deep learning neural network of the class, and the test output label corresponding to the class is obtained through the calculation of the deep learning neural network.
Step 5 searches the label set preprocessed in step 1 for the label set in which the test output label of each class exists, and then determines whether the label set has only one label element. If the label set in which each class's test output label resides has only one label element, make each class's test output label the best output label for that class, otherwise proceed to the next step.
Step 6 calculates the similarity between the corresponding data sample set for each class's test output label and the corresponding data sample set for each label element in the label set in which the class's test output label resides, and then the similarity. Each set of possible output labels is calculated and determined based on the degree, and each set of possible output labels includes one possible output label of each class.
Step 7 calculates the similarity between the data sample set corresponding to each class of possible output labels in each set of possible output labels and the test data set of that class, and a set of possible outputs based on the similarity. Calculate and determine the output label to make it the best output label
Step 8 calculates the matching and non-matching probabilities of the output labels of each class in the possible output labels and sets them as the matching and non-matching probabilities of the output labels of each class.
Step 9 outputs the possible output labels, the best output labels, the matching probabilities and the non-matching probabilities of the output labels of each class.
Specifically, the data preset format of the data sample of each class in the N class is acquired, and the label preset format is also acquired.
Obtain the data format of each data sample in the data sample set of each class, merge the same data formats in the class to obtain s types of data formats, and obtain the data format of s types in the data sample set of the class. Steps to stat the number of data samples Mi supported by various data formats Pi and use the largest Mi as the supported data format Pi as the data preset format of the data sample of the class (s is 1 or more and i is 1 or more). And s or less),
The data sample set of each class gets the label format of each label in the corresponding label set, merges the same label formats of all classes to get at least t kinds of label formats, and the label set of the class. Statistics on the number of labels Nj corresponding to various label formats Qj in, and let the label format Qj corresponding to the largest Nj be the label preset format (t is 1 or more, j is 1 or more and t or less). Includes steps)
Step 1 of preprocessing the N-class data sample set and label set specifically includes steps 1-1 to 1-5.
Step 1-1, Determine if the data format of each data sample in each class's data sample set matches the data preset format of that class's data sample, and if not, that class's data sample Convert the data format of to the data preset format of the data sample of the class,
Step 1-2, Each data sample in each class's data sample set determines if the corresponding label's data format matches the label preset format, and if not, the class's data sample corresponds. Convert the label data format to the label preset format,
Step 1-3, cluster the data sample set of each class in the N class data sample set to obtain J clustered data sample sets and their corresponding output label sets.
Step 1-4, merge the same labels from each class of J clustered output label sets to get the updated J output label sets,
Step 1-5, merge the updated J output label sets with the same label and the corresponding data sample set, respectively, to get the preprocessed data sample set and the corresponding output label set. ,
Initializing the corresponding N deep learning neural networks in the N-class data sample set specifically involves steps 2-1 to 2-3.
Step 2-1. Set the data preset format of the data sample of each class as the input format of the corresponding deep learning neural network of the class.
Step 2-2, set the label preset format as the output format of the corresponding deep learning neural network of each class.
Step 2-3, acquire the configuration information of the corresponding deep learning neural network of each class, use it as the configuration information of the corresponding deep learning neural network of the class, and arrange the corresponding deep learning neural network of the class. ,
To acquire the preset configuration information of the corresponding deep learning neural network of each class and use it as the configuration information of the corresponding deep learning neural network of the class, specifically, steps 2-3-1 to 2-3. Including -4
Step 2-3-1, Deep Learning Neural Network Configuration Obtain the corresponding configuration information of the deep learning neural network from the knowledge base where the input format, output format and the data preset format and label preset format of each class match best. Let it be the preset configuration information of the corresponding deep learning neural network of the class.
Among them, the degree of matching between the input format and output format and the data preset format and label preset format of each class = the degree of matching of the input format with the data preset format of the relevant class × u% + the output format is the label preset format. Matching degree × (1-u%), the default value of u is 90,
Step 2-3-2, output the preset configuration information of the corresponding deep learning neural network of each class to the user,
Step 2-3-3, get the user to change the preset configuration information of the corresponding deep learning neural network for each class,
Step 2-3-4, the preset configuration information of the corresponding deep learning neural network of each changed class is used as the preset configuration information of the corresponding deep learning neural network of the class.
Take the data sample set of each class acquired in step 1 as input, output the corresponding label set, train the corresponding deep learning neural network of the class, and acquire N trained deep learning neural networks. Specifically includes steps 3-1 to 3-2,
Step 3-1. Each data sample in the data sample set of each class is used as an input of the corresponding deep learning neural network of the class, and the ascending order is not supervised for the corresponding deep learning neural network of the class. And
Step 3-2, each data sample in the data sample set of each class is used as the input of the corresponding deep learning neural network of the class, and the data sample in the corresponding label set of the data sample set of the class corresponds to the data sample. The label to be output is output, top-down supervised learning is performed on the corresponding deep learning neural network of the class, and N trained deep learning neural networks are acquired.
Calculate the similarity between the corresponding data sample set for each class's test output label and the corresponding data sample set for each label element in the label set in which the class's test output label resides, and then based on that similarity. Specifically, the possible output labels for each set are calculated and determined:
If N = 1, the similarity between the data sample set corresponding to the test output label and the data sample set corresponding to each label element in the label set in which the test output label exists is calculated, and the similarity is the first preset. All label elements that exceed the value a are considered as a set of possible output labels.
If N> 1, the data sample set Di corresponding to the test output label of the i-class is obtained, the number mi of the label elements in the label set in which the test output label of the i-class exists is obtained, and the first The jth label element in the label set where the i-class test output label exists gets the corresponding data sample set Dij, calculates the similarity Pij between Di and Dij, and i is from 1 in it. Each natural number up to N, j is each natural number from 1 to mi,
For each value of k1, k2, ..., kN, the first overall similarity value f (P1k1, P2k2, ..., PNkN) is calculated, and if f (P1k1, P2k2, ..., PNkN) is the second preset value b. If it is larger, the k1 label element in the label set containing the first class test output label, the k2 label element in the label set containing the second class test output label, ..., The kNth label element in the label set in which the Nth class test output label exists is set as a set of possible output labels, in which k1 is each natural number from 1 to m1 and k2 is 1. Each natural number from to m2, ..., kN is each natural number from 1 to mN, f (P1k1, P2k2, ..., PNkN) is the product of (P1k1, P2k2, ..., PNkN).
Similarity between data sample set A and data sample set B = max (similarity between each sample in data sample set A and each sample in data sample set B),
Calculate the similarity between the data sample set corresponding to each class of possible output labels in each set of possible output labels and the test data set of the class, and calculate a set of possible output labels based on the similarity. Specifically, the decision to make the best output label is:
When N = 1, the similarity between the data sample set corresponding to the possible output label in each set of possible output labels and the test data set of the class is calculated, and the set of possible ones with the largest similarity corresponds. Get the output label and make it a set of best output labels
When N> 1, the similarity Pi of the data sample set corresponding to the possible output label of the i-class in each set of possible output labels and the test data set of the class is calculated, and then the second general value of similarity is calculated. Calculate g (P1, P2, ..., PN), obtain a set of possible output labels corresponding to the second overall value of similarity, and use it as a set of best output labels, and g (P1) in it. , P2, ..., PN) is the product of (P1, P2, ..., PN), in which i is each natural number from 1 to N,
The probability of matching the output label of each class in the possible output labels and the probability of not matching are calculated, and the probability of not matching the output label of each class is specifically: N = 1. If there is only one class of output labels, so the probability that the output labels of each class will match is 100% and the probability that they will not match is 0%.
If N> 1, it is first determined whether the possible output labels of each class in each set of possible output labels match, and then the corresponding similarity second of the possible output labels of each set that are determined to match. Divide the sum of the aggregate values with the sum of the corresponding similarity second aggregate values of all possible output labels to get the probability that the output labels of each class will match, and finally from 100% the output labels of each class will match. Subtract the probability of not matching the output labels of each class.