JP6702456B2

JP6702456B2 - Character shape feature extraction method, character shape feature extraction device, electronic device, and storage medium

Info

Publication number: JP6702456B2
Application number: JP2019019457A
Authority: JP
Inventors: トォンイシュアヌ; ジャンヨンウエイ; ドォンビヌ; ジアンシャヌシャヌ; ジャンジィアシ
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2018-02-08
Filing date: 2019-02-06
Publication date: 2020-06-03
Anticipated expiration: 2039-02-06
Also published as: CN110134935B; CN110134935A; JP2019139771A

Description

本発明は文字処理技術分野に属し、具体的に、文字形状特徴の抽出方法、装置及び電子機器に関する。 The present invention belongs to the field of character processing technology, and more particularly, to a method, apparatus, and electronic device for extracting character shape features.

従来技術において、文字形状の抽出は、通常CNN(Convolutional
Neural Network)やLSTM(Long Short-Term Memory)等のモデルにより実現される。しかし、本発明を行う間に、これらのモデルの複雑度が高いため、文字形状特徴の取得性能が低いことが分かった。 In the prior art, character shape extraction is usually done by CNN (Convolutional
It is realized by models such as Neural Network) and LSTM (Long Short-Term Memory). However, during the implementation of the present invention, it was found that the character shape feature acquisition performance was low due to the high complexity of these models.

上記の問題を鑑みて、本発明は、文字形状特徴の抽出性能を改善することを目的として、文字形状特徴抽出方法、装置及び電子機器を提供する。 In view of the above problems, the present invention provides a character shape feature extraction method, apparatus, and electronic device for the purpose of improving the extraction performance of character shape features.

上記の問題を解決するために、本発明の実施例は、まず、処理データに対して予備処理を行うステップと、所定のサイズを有する文字形状特徴抽出窓を取得するステップと、前記文字形状特徴抽出窓を用いて、予備処理された処理データから文字の形状特徴を抽出するステップと、を含み、前記文字形状特徴抽出窓のサイズは、文字形状特徴抽出の間に一定に保持されることを特徴とする文字形状特徴の抽出方法を提供する。 In order to solve the above-mentioned problem, in the embodiment of the present invention, first, a step of performing preprocessing on the processed data, a step of acquiring a character shape feature extraction window having a predetermined size, and the character shape feature Extracting the shape features of the character from the preprocessed processed data using the extraction window, wherein the size of the character shape feature extraction window is held constant during the character shape feature extraction. A method for extracting a character shape feature as a feature is provided.

その中に、前記処理データに対して予備処理を行うステップは、前記処理データからノイズを除去するステップと、ノイズが除去された処理データを1つまたは複数の文に分割し、分割された文を1つまたは複数の単語に分割するステップと、前記複数の単語にIDを割り当てるステップと、を含む。 Among them, the step of performing pre-processing on the processed data includes a step of removing noise from the processed data and a step of dividing the processed data from which noise is removed into one or more sentences, and the divided sentences. Is divided into one or more words, and an ID is assigned to the plurality of words.

また、前記複数の単語にIDを割り当てるステップは、前記複数の単語からV個の異なる単語を選択し、前記V個の異なる単語を用いてモデルライブラリを構成するステップであって、Vは自然数である、ステップと、前記複数の単語のうちの、前記モデルライブラリにある第1の目標単語に第1のIDを割り当てるステップであって、異なる第1の目標単語は異なるIDを有する、ステップと、前記モデルライブラリにない第2の目標単語に前記第1のIDと異なる第2のIDを割り当てるステップであって、前記第2の目標単語は前記モデルライブラリにない単語を含み、すべての第2のIDは同じである、ステップと、を含む。 Further, the step of assigning an ID to the plurality of words is a step of selecting V different words from the plurality of words and configuring a model library using the V different words, where V is a natural number. There is a step, among the plurality of words, a step of assigning a first ID to a first target word in the model library, wherein different first target words have different IDs, and Assigning a second ID different from the first ID to a second target word that is not in the model library, wherein the second target word includes a word that is not in the model library and all second The IDs are the same, including steps.

また、前記所定のサイズを有する文字形状特徴抽出窓を取得するステップは、予め設定された単語の先頭からP個の文字をプレフィックス情報として抽出し、前記予め設定された単語の末尾からS個の文字をサフィックス情報として抽出するステップであって、P、Sは自然数である、ステップと、前記プレフィックス情報と前記サフィックス情報を用いて、前記文字形状特徴抽出窓を構成するステップと、を含む。 Further, the step of acquiring the character shape feature extraction window having the predetermined size, P characters are extracted as prefix information from the beginning of a preset word, and S characters are extracted from the end of the preset word. A step of extracting a character as suffix information, wherein P and S are natural numbers, and a step of configuring the character shape feature extraction window using the prefix information and the suffix information.

また、前記文字形状特徴抽出窓を用いて、予備処理された処理データから文字の形状特徴を抽出するステップは、アルファベットの大文字と小文字より構成された文字集合の中から、C個の異なる文字を既知の文字として選択し、前記C個の異なる文字にそれぞれN次元の表示ベクトルを付与するステップであって、Nは自然数である、ステップと、複数の単語のうちの第3の目標単語における、前記既知の文字にない第1の目標文字に表示ベクトルを割り当てるステップであって、前記第1の目標文字に割り当てられた表示ベクトルは前記C個の文字に付与された表示ベクトルとは異なる、ステップと、前記先頭からP個の文字の表示ベクトル及び前記末尾からS個の文字の表示ベクトルを取得し、(P+S)*N次元ベクトルである第1のベクトルを形成するステップと、(P+S)*N行、F列を含む加重マトリックスMを取得するステップであって、Fは自然数である、ステップと、前記第1のベクトルと前記加重マトリックスとを乗算して、F次元の文字の形状特徴を取得するステップと、を含む。 Further, using the character shape feature extraction window, the step of extracting the shape feature of the character from the preprocessed processing data, from the character set composed of uppercase and lowercase letters of the alphabet, C different characters Choosing as a known character, the step of imparting an N-dimensional display vector to each of the C different characters, N is a natural number, step, and in the third target word of the plurality of words, Assigning a display vector to a first target character that is not in the known characters, wherein the display vector assigned to the first target character is different from the display vector assigned to the C characters, step A step of obtaining a display vector of P characters from the beginning and a display vector of S characters from the end to form a first vector that is a (P+S)*N-dimensional vector; +S)*N rows, a step of obtaining a weighted matrix M including F columns, F is a natural number, step, and multiplying the first vector and the weighted matrix, F-dimensional character And acquiring the shape feature of.

また、前記文字の形状特徴と取得された単語のベクトル特徴とを併合し、合併されたベクトルを学習モデルの入力とするステップと、前記学習モデルをトレーニングし、前記加重マトリックスM及び/又は前記表示ベクトルの重みを更新するステップと、を含む。 Further, a step of merging the shape feature of the character and the vector feature of the acquired word, the step of inputting the merged vector to the learning model, and training the learning model, the weighting matrix M and/or the display Updating the vector weights.

次に、本発明の実施例は、処理データに対して予備処理を行う予備処理モジュールと、所定のサイズを有する文字形状特徴抽出窓を取得する取得モジュールと、前記文字形状特徴抽出窓を用いて、予備処理された処理データから文字の形状特徴を抽出する抽出モジュールと、を含み、前記文字形状特徴抽出窓のサイズは、文字形状特徴抽出の間に一定に保持されることを特徴とする文字形状特徴の抽出装置を提供する。 Next, the embodiment of the present invention uses a preprocessing module for performing preprocessing on the processed data, an acquisition module for acquiring a character shape feature extraction window having a predetermined size, and the character shape feature extraction window. A character shape feature extraction window for extracting character shape features from the preprocessed processed data, wherein the size of the character shape feature extraction window is held constant during the character shape feature extraction. A shape feature extraction device is provided.

その中に、前記抽出モジュールは、単語のベクトル特徴、加重マトリックスM及び表示ベクトルを取得し、前記文字形状特徴の抽出装置は、前記文字の形状特徴と取得された単語のベクトル特徴とを合併し、合併されたベクトルを学習モデルの入力とする合併モジュールと、前記学習モデルをトレーニングし、前記加重マトリックスM及び/又は前記表示ベクトルの重みを更新するトレーニングモジュールと、を更に含む。 Among them, the extraction module acquires a vector feature of the word, a weighting matrix M and a display vector, and the character shape feature extraction device merges the character shape feature and the acquired word vector feature. And a training module that trains the learning model and updates the weighting matrix M and/or the weight of the display vector.

更に、本発明の実施例は、プロセッサおよびメモリを備え、前記メモリにコンピュータプログラム指令が記憶される電子機器において、前記コンピュータプログラム指令が前記プロセッサにより実行される時に、前記プロセッサに、処理データに対して予備処理を行うステップと、所定のサイズを有する文字形状特徴抽出窓を取得するステップと、前記文字形状特徴抽出窓を用いて、予備処理された処理データから文字の形状特徴を抽出するステップと、を実行させ、前記文字形状特徴抽出窓のサイズは、文字形状特徴抽出の間に一定に保持されることを特徴とする電子機器を提供する。 Further, in an embodiment of the present invention, in an electronic device including a processor and a memory, and the computer program instruction is stored in the memory, when the computer program instruction is executed by the processor, the processor processes the processed data. Performing a preliminary process by using the character shape feature extraction window having a predetermined size, and extracting the character shape feature from the preprocessed processed data using the character shape feature extraction window. , And the size of the character shape feature extraction window is held constant during character shape feature extraction.

最後に、本発明の実施例は、コンピュータ読み取り可能な記憶媒体であって、前記コンピュータ読み取り可能な記憶媒体はコンピュータプログラムを記憶しており、前記コンピュータプログラムがプロセッサによって実行された時に、前記プロセッサに、処理データに対して予備処理を行うステップと、所定のサイズを有する文字形状特徴抽出窓を取得するステップと、前記文字形状特徴抽出窓を用いて、予備処理された処理データから文字の形状特徴を抽出するステップと、を実行させ、前記文字形状特徴抽出窓のサイズは、文字形状特徴抽出の間に一定に保持されることを特徴とする記憶媒体を提供する。 Finally, an embodiment of the present invention is a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, and the computer program is stored in the processor when the computer program is executed by the processor. A step of performing preprocessing on the processed data, a step of obtaining a character shape feature extraction window having a predetermined size, and a character shape feature from the preprocessed processed data using the character shape feature extraction window. And a step of extracting the character shape feature extraction window, wherein the size of the character shape feature extraction window is held constant during the character shape feature extraction.

本発明の実施例は、予備処理された処理データから文字形状特徴を抽出するプロセスにおいて、使用される文字形状特徴抽出窓が変わらないまま保持される。即ち、文字形状特徴抽出窓のサイズは一定である。このために、本発明の実施例によれば文字形状特徴抽出の性能が改善される。 In the embodiment of the present invention, the character shape feature extraction window used in the process of extracting the character shape feature from the preprocessed processed data is kept unchanged. That is, the size of the character shape feature extraction window is constant. Therefore, according to the embodiment of the present invention, the performance of character shape feature extraction is improved.

本発明の実施例にかかる文字形状特徴抽出方法を示すフローチャートである。6 is a flowchart showing a character shape feature extraction method according to an embodiment of the present invention. 本発明の実施例にかかるシステム構成を示す図である。It is a figure which shows the system structure concerning the Example of this invention. 本発明の実施例にかかるハードウェアを示す図である。It is a figure which shows the hardware concerning the Example of this invention. 本発明の実施例に係る文字形状特徴の抽出方法を示すフローチャートである。6 is a flowchart illustrating a method for extracting character shape features according to an exemplary embodiment of the present invention. 本発明の実施例にかかる文字形状特徴抽出窓を示す図である。It is a figure which shows the character shape feature extraction window concerning the Example of this invention. 本発明の実施例にかかる文字形状特徴の抽出装置を示す図である。It is a figure which shows the extraction device of the character shape feature concerning the Example of this invention. 本発明の実施例にかかる予備処理モジュールの構成を示すブロック図である。It is a block diagram which shows the structure of the pretreatment module concerning the Example of this invention. 本発明の実施例にかかる抽出モジュールの構成を示すブロック図である。It is a block diagram which shows the structure of the extraction module concerning the Example of this invention. 本発明の実施例にかかる文字形状特徴の抽出装置の構成を示す図である。It is a figure which shows the structure of the extraction device of the character shape feature concerning the Example of this invention. 本発明の実施例にかかる電子機器の構成を示す図である。It is a figure which shows the structure of the electronic device concerning the Example of this invention.

以下、図面及び実施例を参照し、本発明の具体的な実施形態についてさらに詳細に説明する。以下の実施例は本発明を説明するためのものであり、本発明の範囲を限定するものではない。 Hereinafter, specific embodiments of the present invention will be described in more detail with reference to the drawings and examples. The following examples are intended to illustrate the invention and not limit the scope of the invention.

図1に示すように、本発明の実施例に係る文字形状特徴の抽出方法は、以下のステップを含む。 As shown in FIG. 1, the method for extracting character shape features according to the embodiment of the present invention includes the following steps.

ステップ101では、処理データに対して予備処理を行う。 In step 101, preliminary processing is performed on the processed data.

ここで、処理データは、任意のデータであり、例えば、ウェブページ上のテキストにおけるセグメント等である。 Here, the processed data is arbitrary data, for example, a segment in text on a web page.

本発明の実施例では、処理データに対して行われる予備処理は、主に以下のステップを含む。 In the embodiment of the present invention, the preliminary processing performed on the processed data mainly includes the following steps.

(1)前記処理データからノイズを除去する。 (1) Remove noise from the processed data.

このステップは、データ洗浄（cleansing）とも呼ばれ、処理データからノイズを除去することを主な目的とする。ノイズは、URL(Uniform Resource Locator)、電子メールアドレス、及び"<"や">"形状等のウェブによって導入された記号を含む。ここで、Webページによって導入された"<html>"、"<title>"及び"<body>"等のXML(extensible Markup Language)タグを除去し、タグ同士間のテキストのみが保持される。 This step is also called data cleaning, and its main purpose is to remove noise from the processed data. Noise includes URLs (Uniform Resource Locator), email addresses, and web-introduced symbols such as "<" and ">" shapes. Here, XML (extensible Markup Language) tags such as "<html>", "<title>", and "<body>" introduced by the Web page are removed, and only the text between the tags is retained.

(2)ノイズが除去された処理データを1つまたは複数の文に分割し、分割された文を1つまたは複数の単語に分割する。 (2) The processed data from which noise has been removed is divided into one or more sentences, and the divided sentences are divided into one or more words.

ここでは、パイソンライブラリを用いて、NLTK(Natural Language Toolkit)より処理データを文に分割し、それらの文を単語に更に分割する。単語をデータの最小単位とする。 Here, using the Python library, the processing data is divided into sentences by the NLTK (Natural Language Toolkit), and these sentences are further divided into words. The word is the minimum unit of data.

(3)複数の単語にIDを割り当てる。 (3) Assign IDs to multiple words.

このステップでは、前記複数の単語から異なる単語をV個選択し、異なる単語をV個用いてモデルライブラリを構成する。ここで、Vは自然数である。前記複数の単語のうちに、第1の目標単語がモデルライブラリ内にある場合に、該第1目標単語に第1のIDを割り当てる。第1の目標単語に応じて異なるIDが付与される。第2の目標単語は、モデルライブラリ内にない場合に、前記第2の目標単語に第2のIDを割り当てる。なお、第2のIDは第1のIDとは異なる。 In this step, V different words are selected from the plurality of words, and V different words are used to construct a model library. Here, V is a natural number. Of the plurality of words, if the first target word is in the model library, assign the first ID to the first target word. Different IDs are assigned depending on the first target word. The second target word assigns a second ID to the second target word if it is not in the model library. The second ID is different from the first ID.

第1の目標単語は得られた複数の単語の中のいずれか1つの単語である。 The first target word is any one of the obtained words.

また、第2の目標単語は、モデルライブラリにない単語を含んでおり、第2の目標単語は全て同じID値を有する。例えば、目標単語は、分割して得られる複数の単語におけるモデルライブラリにない単語であってもよいし、他のモデルライブラリにない単語であってもよい。 In addition, the second target words include words that are not in the model library, and the second target words all have the same ID value. For example, the target word may be a word that is not included in the model library among a plurality of words obtained by dividing, or may be a word that is not included in another model library.

実際の応用では、得られる複数の単語から異なる単語をV個選択してモデルライブラリを構成する。また、パラメータVはユーザによって指定される。異なる単語にIDを一意的に指定した後に、得られた複数の単語をIDに置き換える。具体的な方法として次が挙げられる。 In a practical application, V different words are selected from the obtained plural words to construct a model library. The parameter V is specified by the user. After uniquely assigning IDs to different words, replace the obtained words with IDs. The following are specific methods.

(a)モデルライブラリにある単語に、対応する一意的なIDで置き換える。 (a) Replace the word in the model library with the corresponding unique ID.

(b)モデルライブラリにない単語に、未知単語のIDとして、使用されていないIDを指定する。 (b) Specify an unused ID as an unknown word ID for a word that does not exist in the model library.

ステップ102では、サイズが予め設定された文字形状特徴抽出窓を取得する。文字形状特徴抽出過程において、そのサイズが変えないまま一定に保持される。 In step 102, a character shape feature extraction window whose size is preset is acquired. In the character shape feature extraction process, the size is kept constant without changing.

このステップにおいて、予め設定された単語の先頭からP個の文字をプレフィックス情報として抽出し、また、予め設定された単語の末尾からS個の文字をサフィックス情報として抽出することができる。ここで、P、Sは自然数である。また、予め設定された単語は任意の単語であってよい。なお、PおよびSは、ユーザによって指定されることができる。そして、プレフィックス情報とサフィックス情報を用いて文字形状特徴抽出窓を構成する。 In this step, P characters can be extracted as prefix information from the beginning of a preset word, and S characters can be extracted as suffix information from the preset end of the word. Here, P and S are natural numbers. Further, the preset word may be an arbitrary word. Note that P and S can be designated by the user. Then, the character shape feature extraction window is constructed using the prefix information and the suffix information.

ステップ103で、文字形状特徴抽出窓を用いて、予備処理された処理データから文字形状特徴を抽出する。 In step 103, the character shape feature is extracted from the preprocessed process data using the character shape feature extraction window.

このステップは、主に以下のプロセスを含む。 This step mainly includes the following processes.

ステップ1031で、アルファベットの大文字と小文字より構成される文字集合の中から、異なる文字をC個選択して既知の文字とする。そして、これらC個の異なる文字にそれぞれN次元の表示ベクトルを付与する。ここで、Nは自然数である。 In step 1031, C different characters are selected from the character set composed of uppercase and lowercase letters of the alphabet to be known characters. Then, an N-dimensional display vector is given to each of these C different characters. Here, N is a natural number.

実際の応用では、アルファベット文字が大、小文字を合わせて全部で52個ある。52個の文字から異なる文字を任意にC個選択して既知の文字とし、文字毎にN次元の表示ベクトルを付与する。ここで、CとNの値は、ユーザによって指定される。その中に、C個の文字の表示ベクトルをランダムに初期化し、後術するトレーニング中で更新される。若しくは、この表示ベクトルにワンホット（one-hot）ベクトルをそのまま用いてもよいが、このワンホットベクトル後述するトレーニングに関与しない。 In actual application, there are 52 alphabetic characters in total, including upper and lower case letters. C characters different from 52 characters are arbitrarily selected as known characters, and an N-dimensional display vector is given to each character. Here, the values of C and N are specified by the user. In it, the display vector of C characters is initialized randomly, and is updated during the training to be performed later. Alternatively, the one-hot vector may be used as it is as the display vector, but the one-hot vector is not involved in the training described later.

ステップ1032では、複数の単語における第3の目標単語に対して、該第3の目標単語の第1の目標文字は、既知の文字にない場合に、前記第1の目標文字に表示ベクトルを割り当てる。ここで、第1の目標文字に割り当てられた表示ベクトルは、上述したC個の文字に割り当てられる表示ベクトルとは異なる。 In step 1032, for a third target word in the plurality of words, assign a display vector to the first target character if the first target character of the third target word is not a known character. .. Here, the display vector assigned to the first target character is different from the display vectors assigned to the C characters described above.

ここで、前記第3の目標単語は、前記複数の単語中のいずれか1つの単語である。第3の目標単語が含む第1の目標文字(例えば、アルファベット、数字、記号等)は既知の文字ではない場合に、再度第1の目標文字に、C個の文字に割り当てられる表示ベクトルとは異なる表示ベクトルを割り当てる。 Here, the third target word is any one of the plurality of words. When the first target character included in the third target word (e.g., alphabets, numbers, symbols, etc.) is not a known character, the first target character is again the display vector assigned to the C characters. Assign different display vectors.

ステップ1033では、前記先頭からのP個の文字の表示ベクトル及び末尾からS個の文字の表示ベクトルを取得して(P+S)*N次元ベクトルである第1のベクトルを形成する。 In step 1033, the display vector of P characters from the head and the display vector of S characters from the end are acquired to form a first vector that is a (P+S)*N-dimensional vector.

割り当てられた表示ベクトルの中から、先頭からP個の文字の表示ベクトル及び末尾からS個の文字の表示ベクトルを取得して、第1のベクトルを形成する。ここで、第1のベクトルは(P+S)*N次元のベクトルである。こうして構成された第1のベクトルに対して文字形状特徴の抽出が行われる。単語の長さが足りず、文字の抽出に十分ではない場合には、全0のN次元ベクトルを足し合わせる。 A display vector of P characters from the beginning and a display vector of S characters from the end are acquired from the assigned display vectors to form a first vector. Here, the first vector is a (P+S)*N-dimensional vector. The character shape feature is extracted from the first vector thus configured. If the word is not long enough to extract the characters, add all 0-dimensional N-dimensional vectors.

ステップ1034では、加重マトリックスMを取得する。ここで、加重マトリックスMは(P+S)*N行、F列を含み中に、Fは自然数で文字形状特徴の次元を表し、ユーザによって指定されたパラメータである。加重マトリックスにおいて、加重値は浮動小数点数であり、ランダム初期化により得られ、後続するトレーニングにおいて絶えず更新される。 In step 1034, the weighting matrix M is acquired. Here, the weight matrix M includes (P+S)*N rows and F columns, where F is a natural number and represents the dimension of the character shape feature, which is a parameter specified by the user. In the weight matrix, the weight values are floating point numbers, obtained by random initialization, and constantly updated in subsequent training.

ステップ1035では、前記第1のベクトルと前記加重マトリックスとを乗算して、文字形状特徴を取得する。ここで、文字形状特徴の次元はFである。 In step 1035, the first vector and the weighted matrix are multiplied to obtain a character shape feature. Here, the dimension of the character shape feature is F.

本発明の実施例において、予備処理された処理データから文字形状特徴を抽出する過程で、使用する文字形状特徴抽出窓を変えずにそのまま保持する。すなわち、この過程では、任意の単語について、文字形状特徴抽出窓のサイズは一定である。したがって、本発明の実施例より、文字形状特徴抽出の性能及び精度を向上させることができる。また、本発明の実施例にかかる方法はさらにより簡単なネットワークモデルに適用することができる。これにより、文字形状特徴抽出の難易度が低下され、信頼性を向上させる。 In the embodiment of the present invention, in the process of extracting the character shape feature from the preprocessed processed data, the character shape feature extraction window to be used is held unchanged. That is, in this process, the size of the character shape feature extraction window is constant for any word. Therefore, the performance and accuracy of character shape feature extraction can be improved as compared with the embodiment of the present invention. Also, the method according to the embodiment of the present invention can be applied to an even simpler network model. As a result, the difficulty level of character shape feature extraction is reduced, and the reliability is improved.

本発明の実施例にかかる方法は、ネットワークモデルの機能を一層充実させ、文字形状特徴抽出の精度をさらに向上させるためには、上述した内容に加えて、前記文字形状特徴と取得された単語のベクトルとを併合し、合併されたベクトルを学習モデルの入力として、前記学習モデルをトレーニングし、加重マトリックスM及び/又は表示ベクトルの重みを更新する。ここで、単語のベクトルの取得方法は、従来技術を参照することができる。 In order to further enhance the function of the network model and further improve the accuracy of the character shape feature extraction, the method according to the embodiment of the present invention is, in addition to the contents described above, the character shape feature and the acquired word The vector is merged, the merged vector is used as an input of the learning model, the learning model is trained, and the weight matrix M and/or the weight of the display vector is updated. Here, as for the method of acquiring the word vector, the conventional technique can be referred to.

図2は本発明の実施例にかかるシステム構成を示す図である。文字形状特徴は自然言語処理モデルへの入力として使用される。システム全体の入力は、インターネットから集められた自然テキストであり、収集プロセスに代入されたウェブページのノイズが同時に含まれている。 FIG. 2 is a diagram showing a system configuration according to the embodiment of the present invention. Character shape features are used as input to the natural language processing model. The system-wide input is natural text gathered from the Internet, which at the same time contains the noise of web pages that have been substituted into the gathering process.

ここで、データ予備処理モジュール210は、データにおけるノイズを除去するために、処理データに対して予備処理を行う。文字形状特徴抽出モジュール220は、文字形状を抽出し、特徴併合モジュール230は文字形状特徴を他の特徴と併合し、自然言語処理システムに入力する。 Here, the data preprocessing module 210 performs preprocessing on the processed data in order to remove noise in the data. The character shape feature extraction module 220 extracts the character shape, and the feature merging module 230 merges the character shape feature with other features and inputs them to the natural language processing system.

図3は本発明の実施例にかかるハードウェアを示す図である。図3に示すように、本発明の実施例にかかるハードウェアは、インターネットまたは他の通信ネットワークに接続するためのインターフェイスであるネットワークインターフェイス310と、システムのユーザによる入力信号を収集する入力装置320と、ユーザ・ログのような情報を記憶するハードディスク330と、プログラムを実行する中央処理装置(CPU)340と、プログラム実行時の一時的な変数を保存する記憶ユニット350と、システムのユーザに関連情報を表示するディスプレイ360と、を含む。 FIG. 3 is a diagram showing hardware according to the embodiment of the present invention. As shown in FIG. 3, the hardware according to the embodiment of the present invention includes a network interface 310 that is an interface for connecting to the Internet or another communication network, and an input device 320 that collects an input signal by a user of the system. , A hard disk 330 that stores information such as user logs, a central processing unit (CPU) 340 that executes programs, a storage unit 350 that stores temporary variables during program execution, and information related to the system user. And a display 360 for displaying.

次に、図4を参照して、本発明の実施例に係る文字形状特徴の抽出について詳細に説明する。主に以下のステップを含む。 Next, with reference to FIG. 4, extraction of character shape features according to the embodiment of the present invention will be described in detail. It mainly includes the following steps.

ステップ401で、データに対して予備処理を行う。具体的に、以下のステップを含む。 In step 401, preliminary processing is performed on the data. Specifically, the following steps are included.

(1)データ洗浄
処理データからノイズを除去する。ノイズは、URL、電子メールアドレス、形的に"<"または">"等ウェブページにより導入された記号を含む。また、ウェブページによって導入される"<html>"、"<title>"及び"<body>"などのxmlのタグを除去し、タグ間のテキストのみが保持される。 (1) Data washing Remove noise from processed data. Noise includes URLs, email addresses, and symbols introduced by web pages such as "<" or ">" in form. Also, xml tags such as "<html>", "<title>" and "<body>" introduced by web pages are removed, and only the text between the tags is retained.

(2)データ分割
pythonライブラリを用いて、NLTKにより処理データを文に分割し、更に、単語に分割する。 (2) Data division
Using python library, the processed data is divided into sentences by NLTK, and further divided into words.

(3)データID化
(2)で分割された単語から、モデル単語として、30000個の異なる単語を選択する。各単語には一意的なIDを割り当てる。30000個の単語において、第1の単語にIDとして1が割り当てられ、第2の単語にIDとして2が割り当てられ、以下は類推する。また、0を未知の単語のIDである。そして、分割された単語に対応するIDで置き換える。 (3) Data ID conversion
From the words divided in (2), 30,000 different words are selected as model words. Assign a unique ID to each word. Of the 30,000 words, the first word is assigned an ID of 1, the second word is assigned an ID of 2, and the following analogy is made. In addition, 0 is the ID of the unknown word. Then, it replaces with the ID corresponding to the divided word.

ステップ402では、文字形状特徴を抽出する。 In step 402, character shape features are extracted.

図5に示すように、単語informationの前部からアルファベット文字を4個抽出する。この4個の文字からプレフィックス文字形状情報を抽出し、また、単語の末尾からアルファベット文字を4個抽出し、この4個の文字からサフィックス文字形状情報を抽出する。抽出された8個の文字をアルファベット順につなぎ、文字形状特徴抽出窓とする。こうすることによって、文字形状特徴抽出窓はinfotion、または、tioninfoである。 As shown in FIG. 5, four alphabetic characters are extracted from the front part of the word information. Prefix character shape information is extracted from these four characters, four alphabetic characters are extracted from the end of the word, and suffix character shape information is extracted from these four characters. The extracted eight characters are connected in alphabetical order to form a character shape feature extraction window. By doing so, the character shape feature extraction window is infotion or tioninfo.

全ての大文字と小文字から成る集合を選択して既知の文字集合とする。その中に52個の要素を含む。既知の文字集合における各文字に53次元表示ベクトルを割り当てる。これらの表示ベクトルはワンホットの形で値を取る。すなわち、ベクトルの各重みは0と1で構成され、且つ1つの要素のみが1であるが、その他の要素は0である。第1の要素に1且つ他の要素に０で構成されるベクトルを未知文字の表示ベクトルとする。また、第2の要素に1且つ他の要素に0で構成される既知文字集合における第1の文字の表示ベクトルとする。このように類推する。 A set consisting of all uppercase and lowercase letters is selected as the known character set. It contains 52 elements. Assign a 53-dimensional display vector to each character in the known character set. These display vectors take values in one-hot form. That is, each weight of the vector is composed of 0 and 1, and only one element is 1, but the other elements are 0. A vector having 1 in the first element and 0 in the other elements is set as the display vector of the unknown character. In addition, the display vector of the first character in the known character set is configured with 1 in the second element and 0 in the other elements. This is an analogy.

また、文字形状抽出窓において、文字に対応する表示ベクトルを繋げて、(4+4)*53=424次元ベクトルを形成する。4文字未満で長さが足りない単語の場合に、53次元全0ベクトルで足し合わせる。 Further, in the character shape extraction window, display vectors corresponding to characters are connected to form a (4+4)*53=42 four-dimensional vector. If the word is less than 4 characters and the length is insufficient, add it with 53-dimensional all 0 vector.

浮動小数点数からなる加重マトリックスMを構成する。マトリックスは424行256列を有するマトリックスにおける加重値は、ランダム初期化により得られ、後続のトレーニングにおいて絶えずに更新される。そして、(4+4)*53=424次元のベクトルに加重マトリックスMを乗算し、256次元ベクトルが得られ、すなわち、文字形状特徴ベクトルである。 Construct a weighted matrix M of floating point numbers. The matrix has 424 rows and 256 columns. The weights in the matrix are obtained by random initialization and are constantly updated in subsequent training. Then, a vector of (4+4)*53=424 dimensions is multiplied by the weight matrix M to obtain a 256-dimensional vector, that is, a character shape feature vector.

ステップ403では、サンプルベクトルを生成する。 In step 403, a sample vector is generated.

得られた256次元文字形状特徴ベクトルと、単語ベクトルを含む他の特徴と繋ぎ合わせて、モデルへの入力とし、モデルをトレーニングする。モデルをトレーニングする時に、加重マトリックスMを合わせて更新する。 The obtained 256-dimensional character shape feature vector is connected to other features including the word vector and is input to the model to train the model. Update the weighting matrix M accordingly when training the model.

以上により、本発明の実施例は、文字形状特徴抽出の性能及び精度を向上させることができる。また、本発明の実施例にかかる方法は簡単なネットワークモデルに適用することができ、文字形状特徴抽出の難易度を低下させ、信頼性が向上される。 As described above, the embodiment of the present invention can improve the performance and accuracy of character shape feature extraction. Further, the method according to the embodiment of the present invention can be applied to a simple network model, which reduces the difficulty level of character shape feature extraction and improves reliability.

本発明の実施形態に係る抽出する文字形状特徴の装置600は、図6に示すように、処理データに対して予備処理を行う予備処理モジュール601と、サイズが予め設定され文字形状特徴抽出の間に一定に保持される文字形状特徴抽出窓を取得する取得モジュール602と、抽出された前記文字形状特徴抽出窓を用いて、予備処理された処理データから文字形状特徴を抽出する抽出モジュール603と、を含む。 An apparatus 600 for extracting character shape features according to an embodiment of the present invention, as shown in FIG. 6, includes a preliminary processing module 601 for performing a preliminary process on processing data, and a character shape feature extraction unit having a preset size. An acquisition module 602 that acquires a character shape feature extraction window that is held constant at, and an extraction module 603 that extracts a character shape feature from preprocessed processed data using the extracted character shape feature extraction window, including.

その中に、図7に示すように、前記予備処理モジュール601は、前記処理データのノイズを除去する除去サブモジュール6011と、ノイズが除去された処理データを分割して、1つまたは複数の文を取得し、前記文を1つまたは複数の単語に分割する分割サブモジュール6012と、前記複数の単語にIDを割り当てるサブモジュール6013と、を含む。 Among them, as shown in FIG. 7, the pre-processing module 601, the removal sub-module 6011 for removing noise of the processing data, and the processing data from which noise has been removed is divided into one or a plurality of sentences. And a sub-module 6013 for assigning IDs to the plurality of words, and a sub-module 6012 for dividing the sentence into one or more words.

具体的には、前記割当てサブモジュール6013は、前記複数の単語から異なる単語をV個選択してモデルライブラリを構成する。ここで、Vは自然数である。前記複数の単語のうちに第1の目標単語がモデルライブラリ内にある場合に、該第1の目標単語に第1のIDを割り当て、IDは第1の目標単語に応じて異なる。第2の目標単語がモデルライブラリ内にある場合に、該第2の目標単語に第2のIDを割り当て、該第2のIDは、前記第1のIDと異なり、前記モデルライブラリにない単語を含み、全ての第2の目標単語のIDが同一である。 Specifically, the allocation sub-module 6013 selects V different words from the plurality of words to form a model library. Here, V is a natural number. If the first target word of the plurality of words is in the model library, a first ID is assigned to the first target word, and the ID is different depending on the first target word. A second ID is assigned to the second target word when the second target word is in the model library, and the second ID is different from the first ID and is not a word that is not in the model library. Including, the IDs of all the second target words are the same.

具体的には、取得モジュール602は、予め設定された単語の先頭からアルファベット文字をP個抽出してプレフィックス情報として、予め設定された単語の末尾からアルファベット文字をS個抽出してサフィックス情報とする。その中に、P、Sは自然数である。前記プレフィックス情報とサフィックス情報を用いて前記文字形状特徴抽出窓を構成する。 Specifically, the acquisition module 602 extracts P alphabetic characters from the beginning of a preset word as prefix information and extracts S alphabetic characters from the end of a preset word as suffix information. .. Among them, P and S are natural numbers. The character shape feature extraction window is configured using the prefix information and the suffix information.

ここで、図8に示すように、前記抽出モジュール603は、具体的に、第1の割当てサブモジュール6031、第2の割当てサブモジュール6032、第1の取得サブモジュール6033、第2の取得サブモジュール6034、及び抽出サブモジュール6035を含む。 Here, as shown in FIG. 8, the extraction module 603, specifically, the first allocation sub-module 6031, the second allocation sub-module 6032, the first acquisition sub-module 6033, the second acquisition sub-module 6034, and an extraction sub-module 6035.

第1の割当てサブモジュール6031は、アルファベットの大文字と小文字で構成される単語集合から異なる文字をC個選択して既知の文字とし、前記C個の異なる文字にそれぞれN次元表示ベクトルを指定し、ここで、Nは自然数である。 The first allocation sub-module 6031 is a known character by selecting C different characters from the word set consisting of uppercase and lowercase letters of the alphabet, and specify the N-dimensional display vector to each of the C different characters, Here, N is a natural number.

第2の割当てサブモジュール6032は、複数の単語における第3の目標単語に対して、第3の目標単語における第1の目標文字が前記既知文字にない場合に、前記第1の目標文字に表示ベクトルを割り当て、その中に、前記第1の目標文字に割り当てられる表示ベクトルは前記C個の文字の表示ベクトルとは異なる。 A second assignment sub-module 6032 displays for a third target word in a plurality of words the first target character in the third target word if the first target character is not in the known characters. Assigning a vector, in which the display vector assigned to the first target character is different from the display vector of the C characters.

第1の取得サブモジュール6033は、前記先頭からP個の文字の表示ベクトルを取得し、末尾からS個の文字の表示ベクトルを取得し、第1のベクトルを形成する。その中に、前記第1ベクトルは(P+S)*N次元ベクトルである。 The first acquisition sub-module 6033 acquires a display vector of P characters from the head and acquires a display vector of S characters from the end to form a first vector. Among them, the first vector is a (P+S)*N-dimensional vector.

第2の取得サブモジュール6034は、加重マトリックスMを取得する。その中に、前記加重マトリックスMは(P+S)*N行、F列を含み、Fは自然数である。 The second acquisition sub-module 6034 acquires the weighting matrix M. The weighting matrix M includes (P+S)*N rows and F columns therein, and F is a natural number.

抽出サブモジュール6035は、前記第1ベクトルと前記加重マトリックスとを乗算し、文字形状特徴を得る。その中に、前記文字形状特徴の次元はFである。 The extraction sub-module 6035 multiplies the first vector and the weighted matrix to obtain a character shape feature. In that, the dimension of the character shape feature is F.

本発明の実施例にかかる装置の動作原理は、上述した方法の実施例の説明を参照することができる。 The operation principle of the device according to the embodiment of the present invention can be referred to the description of the above-described method embodiment.

本発明の実施例では、予備処理された処理データから文字形状特徴を抽出する過程において、使用される文字形状特徴抽出窓は変化せずに保持される。すなわち、その間に、任意の単語について、文字形状特徴抽出窓のサイズは一定である。したがって、本発明の実施例より、文字形状特徴抽出の性能及び精度を向上させることができる。また、本発明の実施例にかかる方法はさらにより簡単なネットワークモデルに適用することができる。これにより、文字形状特徴抽出の難易度が低下され、信頼性を向上させる。 In the embodiment of the present invention, in the process of extracting the character shape feature from the preprocessed processed data, the character shape feature extraction window used is held unchanged. That is, during that time, the size of the character shape feature extraction window is constant for any word. Therefore, the performance and accuracy of character shape feature extraction can be improved as compared with the embodiment of the present invention. Also, the method according to the embodiment of the present invention can be applied to an even simpler network model. As a result, the difficulty level of character shape feature extraction is reduced, and the reliability is improved.

本発明の実施例にかかる装置は、ネットワークモデルの機能を一層充実させ、文字形状特徴抽出の精度をさらに向上させるためには、上述した説明に加えて、図9に示すように、前記文字形状特徴抽出モジュール603により取得された文字形状特徴と抽出モジュール603により取得された単語ベクトル特徴とを併合し、合併されたベクトルを学習モデルの入力とする合併モジュール604と、前記学習モデルをトレーニングし、加重マトリックスM及び/又は表示ベクトルの重みを更新するトレーニングモジュール605と、を更に含む。 In the device according to the embodiment of the present invention, in order to further enhance the function of the network model and further improve the accuracy of character shape feature extraction, in addition to the above description, as shown in FIG. The character shape feature acquired by the feature extraction module 603 and the word vector feature acquired by the extraction module 603 are merged, and the merged module 604 that inputs the merged vector to the learning model and the learning model are trained, A weighting matrix M and/or a training module 605 for updating the weights of the display vector.

本発明の実施例は、図10に示すように、プロセッサ1001およびメモリ1002を備え、前記メモリ1002にコンピュータプログラム指令が記憶される電子機器1000が提供され、前記電子機器1000は、コンピュータプログラム指令が前記プロセッサ1001により実行される時に、該プロセッサ1001が、処理データに対して予備処理を行うステップと、サイズが予め設定され、文字形状特徴抽出の間に一定に保持される文字形状特徴抽出窓を取得するステップと、前記文字形状特徴抽出窓を用いて、予備処理された処理データから文字形状の特徴を抽出するステップと、を実行させられる。 The embodiment of the present invention, as shown in FIG. 10, is provided with an electronic device 1000 including a processor 1001 and a memory 1002, a computer program command is stored in the memory 1002, the electronic device 1000, a computer program command When executed by the processor 1001, the processor 1001 performs a preliminary process on the processed data, and a character shape feature extraction window whose size is preset and is held constant during character shape feature extraction. The step of acquiring and the step of extracting the character shape feature from the preprocessed process data using the character shape feature extraction window can be executed.

さらに、図10に示すように、電子機器1000は、ネットワークインターフェイス1003、入力装置1004、ハードディスク1005、及び表示装置1006を更に含む。 Further, as shown in FIG. 10, the electronic device 1000 further includes a network interface 1003, an input device 1004, a hard disk 1005, and a display device 1006.

上記各インターフェイスはバスアーキテクチャを介してそれぞれ各装置に接続される。バスアーキテクチャは任意の数の相互接続を含むことができるバスとブリッジである。具体的に、処理装置1001に代表される1つ又は複数の中央処理器（CPU）と、メモリ1002が代表される1つまたは複数のメモリの各種回路が接続される。また、バスアーキテクチャより、例えば外付け機器、レギュレータ―及び電力管理回路などの他の様々な回路が接続される。このように、バスアーキテクチャよりこれらの機器が通信可能に接続される。バスアーキテクチャは、データバス以外に、電源バス、制御バス及び状態信号バスを含む。これらは本発明分野の公知技術であり、本文において詳細な説明を省略する。 The above interfaces are connected to the respective devices via the bus architecture. A bus architecture is a bus and bridge that can include any number of interconnects. Specifically, one or a plurality of central processing units (CPU) represented by the processing device 1001 and various circuits of one or a plurality of memories represented by the memory 1002 are connected. Also, various other circuits such as an external device, a regulator, and a power management circuit are connected from the bus architecture. In this way, these devices are communicatively connected by the bus architecture. In addition to the data bus, the bus architecture includes a power bus, a control bus and a status signal bus. These are known techniques in the field of the present invention, and detailed description thereof will be omitted.

ネットワークインターフェイス1003は、ネットワーク(例えば、インターネットやLANなど)に接続され、ネットワークから関連のデータを受け付け、ハードディスク1005に記憶させるインターフェイスである。 The network interface 1003 is an interface that is connected to a network (for example, the Internet or LAN), receives related data from the network, and stores the related data in the hard disk 1005.

入力装置1004はユーザより入力される各種指令を受け付け、プロセッサ1001に送信して実行させる手段である。また、入力装置1004はキーボードやクリック手段（例えば、マウス、トラックボール、タッチボード）、またはタッチパネルやタッチスクリーン等がある。 The input device 1004 is a unit that receives various commands input by the user, transmits the commands to the processor 1001, and causes the processor 1001 to execute the commands. Further, the input device 1004 includes a keyboard, a click means (for example, a mouse, a trackball, a touch board), a touch panel, a touch screen, or the like.

表示装置46はプロセッサ42が指令を実行した結果を表示する手段である。 The display device 46 is means for displaying the result of execution of the command by the processor 42.

前記メモリ1002はオペレティングシステムの実行に必要なプログラムとデータ、及びプロセッサ1001の計算過程における中間結果等のデータを記憶するメモリである。 The memory 1002 is a memory for storing programs and data necessary for executing the operating system, and data such as intermediate results in the calculation process of the processor 1001.

本発明の実施例にかかるメモリ1002は揮発性メモリ又は非揮発性メモリ、もしくは揮発性と非揮発性の両方を含むメモリである。その中に、非揮発性メモリはROM、PROM、EPROM、EEPROM、フラッシュメモリである。揮発性メモリはRAMであり、外部キャッシュとして使用される。しかし、本明細書に記載される装置及び方法に用いったメモリ1002はこれらのメモリに限定されず、他の適切なタイプのメモリでもよい。 The memory 1002 according to the embodiment of the present invention is a volatile memory or a non-volatile memory, or a memory including both volatile and non-volatile memory. Among them, the non-volatile memory is ROM, PROM, EPROM, EEPROM, flash memory. Volatile memory is RAM and it is used as an external cache. However, the memory 1002 used in the devices and methods described herein is not limited to these memories and may be any other suitable type of memory.

一部の実施形態において、メモリ1002に、実行可能なモジュール又はデータ構成若しくはこれらのサブモジュールや拡張モジュールであるオペレティングシステム10021及びアプリケーションプログラム10010が記憶される。 In some embodiments, memory 1002 stores operating system 10021 and application programs 10010, which are executable modules or data structures or their submodules or expansion modules.

その中に、オペレティングシステム10021は、各種システムプログラム、例えばフレームワーク層、コアライブラリ層、駆動層を含み、様々な基幹業務やハードウェアベースのタスクを実現するために用いられる。アプリケーションプログラム10010は各種アプリケーションプログラム、例えばウェブブラウザ―(Browser)等を含み、様々なアプリケーション業務を実現するためのものである。本実施例にかかる方法を実行するプログラムはアプリケーションプログラム10010に含む。 The operating system 10021 includes various system programs, for example, a framework layer, a core library layer, and a driving layer, and is used to realize various core tasks and hardware-based tasks. The application program 10010 includes various application programs, for example, a web browser (Browser) and the like, and is for realizing various application tasks. A program for executing the method according to the present embodiment is included in the application program 10010.

メモリ1002に記憶されたアプリケーションプログラム、例えばアプリケーションプログラム10010のプログラム又は指令がプロセッサ1001によって実行された時、プロセッサ1001が、処理データに対して予備処理を行うステップと、サイズが予め設定され文字形状特徴抽出の間に一定である文字形状特徴抽出窓を取得するステップと、抽出された前記文字形状特徴抽出窓を用いて、予備処理された処理データから文字形状特徴を抽出するステップと、を実行させる。 When an application program stored in the memory 1002, for example, a program or an instruction of the application program 10010 is executed by the processor 1001, the processor 1001 performs a preliminary process on the processed data, and a character shape feature having a preset size. Performing a step of obtaining a character shape feature extraction window that is constant during extraction, and a step of extracting character shape features from preprocessed processed data using the extracted character shape feature extraction window ..

上記本発明の実施例にかかる方法は、プロセッサ1001に適用され、またはプロセッサ1001によって実現される。プロセッサ1001は信号を処理する能力を持つ集積回路基板である。上記方法の各ステップはプロセッサ1001におけるハードウェアである集積論理回路又はソフトウェア形式の指令によって実現される。上記プロセッサ1001は汎用プロセッサ、デジタル信号処理装置(DSP)、専用集積回路(ASIC)、既製プログラマブルゲートアレイ(FPGA)または他のプログラマブル論理デバイス、ディスクリートゲートまたはトランジスタ論理デバイス、ディスクリートハードウェア部品であり、本発明の実施例に開示される各方法、ステップ及び論理ボックスを実現又は実行可能なものである。汎用処理器はマイクロプロセッサ又は任意の一般処理器などである。本発明の実施例にかかる方法の各ステップは、ハードウェアであるデコーダにより実行されることで実現されてもよく、または、デコーダにお行けるハードウェアとソフトウェアとの組み合わせによって実現されても良い。ソフトウェアモジュールはランダムメモリ、フラッシュメモリ、読出し専用メモリ、プログラマブル読出し専用メモリ、あるいは電気的消去可能なプログラマブルメモリ、レジスタなど、本分野で成熟した記憶媒体に記憶される。このソフトウェアが記憶される記憶媒体を備えるメモリ1002から、プロセッサ1001は情報を読み取り、ハードウェアに合わせて上記方法のステップを実現させる。 The method according to the embodiments of the present invention described above is applied to or implemented by the processor 1001. The processor 1001 is an integrated circuit board capable of processing signals. Each step of the above method is implemented by an integrated logic circuit which is hardware in the processor 1001 or a command in software form. The processor 1001 is a general-purpose processor, a digital signal processor (DSP), a dedicated integrated circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, a discrete gate or transistor logic device, a discrete hardware component, The methods, steps, and logic boxes disclosed in the embodiments of the present invention can be implemented or executed. A general-purpose processor can be a microprocessor or any conventional processor. Each step of the method according to the embodiment of the present invention may be realized by being executed by a decoder which is hardware, or may be realized by a combination of hardware and software which can be executed by the decoder. The software modules are stored in storage media mature in the field, such as random memory, flash memory, read-only memory, programmable read-only memory, or electrically erasable programmable memory, registers. The processor 1001 reads the information from the memory 1002 having a storage medium in which this software is stored, and implements the steps of the above method according to the hardware.

以上に説明した実施例は、ハードウェア、ソフトウェア、ファームウェア、ミドルウェア、マイクロコード、またはそれらの組み合わせで実現される。その中に、ハードウェアの実現に関して、処理ユニットは一つまたは複数の専用集積回路(ASIC)、デジタル信号処理プロセッサ（DSP）、デジタル信号処理装置(DSPD)、プログラム可能論理回路 (PLD)、フィールドプログラマブルゲートアレイ(FPGA)、汎用プロセッサ、コントローラ、マイクロコントローラ、マイクロプロセッサ、本発明の機能を実行する他の電子ユニット或はその組合せにより実現される。 The embodiments described above are implemented by hardware, software, firmware, middleware, microcode, or a combination thereof. Among them, in terms of hardware implementation, the processing unit is one or more dedicated integrated circuits (ASIC), digital signal processor (DSP), digital signal processor (DSPD), programmable logic circuit (PLD), field. It may be implemented by a programmable gate array (FPGA), a general purpose processor, a controller, a microcontroller, a microprocessor, another electronic unit that performs the functions of the present invention, or a combination thereof.

また、ソフトウェアの実現に関しては、以上で説明した機能を実現するモジュール(たとえばプロセス、関数など)により上記技術が実現される。ソフトウェアコードは、メモリに保存され、プロセッサによって実行される。なお、メモリはプロセッサの内部または外部で実現される。 Regarding software implementation, the above technology is implemented by modules (for example, processes, functions, etc.) that implement the functions described above. The software code is stored in memory and executed by the processor. The memory is realized inside or outside the processor.

具体的には、プロセッサ1101は、コンピュータプログラムを読み出して、処理データからノイズを除去するステップと、ノイズが除去された処理データを1つまたは複数の文に分割し、分割された文を1つまたは複数の単語に分割するステップと、前記複数の単語にIDを割り当てるステップと、を実行する。 Specifically, the processor 1101 reads the computer program to remove noise from the processed data, divides the processed data from which noise has been removed into one or more sentences, and divides the divided sentence into one sentence. Alternatively, the steps of dividing into a plurality of words and assigning an ID to the plurality of words are executed.

具体的には、プロセッサ1101は、コンピュータプログラムを読み出して、前記複数の単語から異なる単語をV個選択し、V個の異なる単語を用いてモデルライブラリを構成し、ここで、Vは自然数であるステップと、前記複数の単語のうちに、第1の目標単語がモデルライブラリにある場合に、前記第1目標単語に第1のIDを割り当て、前記第1の目標単語に応じて異なるIDが付与されるステップと、第2の目標単語がモデルライブラリにない場合に、前記第2の目標単語に第2のIDを割り当て、第2のIDは第1のIDとは異なり、前記第2の目標単語は前記モデルライブラリになく、すべて第2のIDは同じであるステップと、を実行する。 Specifically, the processor 1101 reads a computer program, selects V different words from the plurality of words, and configures a model library using the V different words, where V is a natural number. Step, and among the plurality of words, if the first target word is in the model library, assign a first ID to the first target word, and give a different ID depending on the first target word. And a second ID is assigned to the second target word if the second target word is not in the model library, the second ID being different from the first ID and the second target word being The word is not in the model library and all the second IDs are the same.

具体的には、プロセッサ1101は、コンピュータプログラムを読み出して、予め設定された単語の先頭からP個の文字をプレフィックス情報として抽出し、予め設定された単語の末尾からS個の文字をサフィックス情報として抽出し、P、Sは自然数であるステップと、前記プレフィックス情報とサフィックス情報を用いて、前記文字形状特徴抽出窓を構成するステップと、を実行する。 Specifically, the processor 1101 reads a computer program, extracts P characters from the beginning of a preset word as prefix information, and S characters from the end of a preset word as suffix information. And extracting P and S are natural numbers, and constructing the character shape feature extraction window using the prefix information and the suffix information.

また、具体的には、プロセッサ1101は、コンピュータプログラムを読み出して、アルファベットの大文字と小文字より構成される文字集合の中から、異なる文字をC個選択して既知の文字とし、これらC個の異なる文字にそれぞれN次元の表示ベクトルを付与し、Nは自然数であるステップと、複数の単語における第3の目標単語に対して、前記第3の目標単語の第1の目標文字が既知文字にない場合に、前記第1の目標文字に前記C個の文字に割り当てられる表示ベクトルとは異なる表示ベクトルを割り当てるステップと、前記先頭からのP個の文字の表示ベクトル及び末尾からS個の文字の表示ベクトルを取得して(P+S)*N次元ベクトルである第1のベクトルを形成するステップと、(P+S)*N行、F列を含む加重マトリックスMを取得し、Fは自然数であるステップと、前記第1のベクトルと前記加重マトリックスとを乗算して、F次元の文字形状特徴を得るステップと、を実行する。 Further, specifically, the processor 1101 reads the computer program, selects C different characters from the character set composed of uppercase and lowercase letters of the alphabet, and makes them known characters, and these C different Each character is given an N-dimensional display vector, where N is a natural number, and for a third target word in a plurality of words, the first target character of the third target word is not a known character. In this case, the step of assigning a display vector different from the display vector assigned to the C characters to the first target character, the display vector of the P characters from the beginning and the display of the S characters from the end. Obtaining a vector to form a first vector that is a (P+S)*N dimensional vector, and obtaining a weighting matrix M containing (P+S)*N rows and F columns, where F is a natural number. Performing a step and a step of multiplying the first vector and the weight matrix to obtain an F-dimensional character shape feature.

具体的には、プロセッサ1101は、コンピュータプログラムを読み出して、前記文字形状特徴と取得された単語のベクトルとを併合し、合併されたベクトルを学習モデルの入力とするステップと、前記学習モデルをトレーニングし、加重マトリックスM及び/又は表示ベクトルの重みを更新するステップと、を実行する。 Specifically, the processor 1101 reads the computer program, merges the character shape feature and the acquired vector of words, and inputs the learning vector to the merged vector, and training the learning model. And updating the weighting matrix M and/or the weight of the display vector.

また、本発明の実施例はコンピュータ読み取り可能な記憶媒体を提供する。前記コンピュータ読み取り可能な記憶媒体はコンピュータプログラムを記憶しており、前記コンピュータプログラムがプロセッサによって実行された時、前記プロセッサが、処理データに対して予備処理を行うステップと、サイズが予め設定され文字形状特徴抽出の間に一定である文字形状特徴抽出窓を取得するステップと、抽出された前記文字形状特徴抽出窓を用いて、予備処理された処理データから文字形状特徴を抽出するステップと、を実行させられる。 The embodiments of the present invention also provide a computer-readable storage medium. The computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the processor performs a preliminary process on the processed data, and a size is preset and a character shape is set. Performing a step of obtaining a character shape feature extraction window that is constant during feature extraction, and a step of extracting a character shape feature from preprocessed processed data using the extracted character shape feature extraction window To be made.

ここで、処理データに対して予備処理を行うステップは、前記処理データのノイズを除去するステップと、ノイズが除去された処理データを分割して、1つまたは複数の文を取得し、前記文を1つまたは複数の単語に分割するステップと、前記複数の単語にIDを割り当てるステップと、を含む。 Here, the step of performing preprocessing on the processed data includes a step of removing noise from the processed data and a step of dividing the processed data from which noise has been removed to obtain one or more sentences, Is divided into one or more words, and an ID is assigned to the plurality of words.

その中に、前記複数の単語にIDを割り当てるステップは、前記複数の単語から異なる単語をV個選択してモデルライブラリを構成し、Vは自然数であるステップと、前記複数の単語のうちに第1の目標単語がモデルライブラリ内にある場合に、該第1の目標単語に第1のIDを割り当て、IDは第1の目標単語に応じて異なるステップと、第2の目標単語がモデルライブラリ内にある場合に、該第2の目標単語に第2のIDを割り当て、該第2のIDは、前記第1のIDと異なり、前記モデルライブラリにない単語を含み、全ての第2の目標単語のIDが同一であるステップと、を含む。 Among them, the step of assigning an ID to the plurality of words configures a model library by selecting V different words from the plurality of words, where V is a natural number, and among the plurality of words, If the first target word is in the model library, the first target word is assigned a first ID, and the ID has different steps depending on the first target word, and the second target word is in the model library. A second ID is assigned to the second target word, wherein the second ID includes a word that is different from the first ID and is not in the model library, and all the second target words With the same ID.

また、その中に、サイズが予め設定され文字形状特徴抽出の間に一定である文字形状特徴抽出窓を取得するステップは、予め設定された単語の先頭からP個の文字をプレフィックス情報として抽出し、予め設定された単語の末尾からS個の文字をサフィックス情報として抽出し、P、Sは自然数であるステップと、前記プレフィックス情報とサフィックス情報を用いて、前記文字形状特徴抽出窓を構成するステップと、を含む。 Further, in the step of acquiring a character shape feature extraction window of which size is preset and constant during character shape feature extraction, P characters from the beginning of a preset word are extracted as prefix information. , Extracting S characters from the end of a preset word as suffix information, P and S being a natural number, and configuring the character shape feature extraction window using the prefix information and suffix information And, including.

前記抽出された前記文字形状特徴抽出窓を用いて、予備処理された処理データから文字形状特徴を抽出するステップは、アルファベットの大文字と小文字より構成される文字集合の中から、C個の異なる文字を選択して既知の文字とし、これらC個の異なる文字にそれぞれN次元の表示ベクトルを付与し、Nは自然数であるステップと、複数の単語における第3の目標単語に対して、前記第3の目標単語の第1の目標文字は、既知の文字にない場合に、前記第1の目標文字に表示ベクトルを割り当て、第1の目標文字に割り当てられた表示ベクトルは上述したC個の文字に割り当てられる表示ベクトルとは異なるステップと、前記先頭からのP個の文字の表示ベクトル及び末尾からS個の文字の表示ベクトルを取得して(P+S)*N次元ベクトルである第1のベクトルを形成するステップと、(P+S)*N行、F列を含む加重マトリックスMを取得し、Fは自然数であるステップと、前記第1のベクトルと前記加重マトリックスとを乗算して、F次元の文字形状特徴を得るステップと、を含む。 Using the extracted character shape feature extraction window, the step of extracting character shape features from the preprocessed processing data, from the character set consisting of uppercase and lowercase letters of the alphabet, C different characters To a known character, and each of these C different characters is given an N-dimensional display vector, where N is a natural number, and for the third target word in a plurality of words, the third The first target character of the target word of, when not in a known character, assigns a display vector to the first target character, the display vector assigned to the first target character is the C characters described above. The first vector which is a step different from the assigned display vector and the display vector of the P characters from the beginning and the display vector of the S characters from the end is the (P+S)*N dimensional vector. To obtain a weighting matrix M including (P+S)*N rows and F columns, F being a natural number, and multiplying the first vector by the weighting matrix to obtain F Obtaining a dimensional character shape feature.

また、前記方法は、前記文字形状特徴と取得された単語のベクトルとを併合し、合併されたベクトルを学習モデルの入力とするステップと、前記学習モデルをトレーニングし、加重マトリックスM及び/又は表示ベクトルの重みを更新する。 Further, the method, the step of merging the character shape features and the obtained vector of words, the merged vector as an input of the learning model, and training the learning model, weighted matrix M and / or display. Update vector weights.

本発明の複数の実施例より開示された方法及び装置は別の形態でも実現可能であることは容易に想到される。例えば、上記記載された装置は模式的なものに過ぎない。例えば、前記したユニットの分割は論理的な機能の割り当ての一例に過ぎず、実際に実現の時に別の分割方式を採用しても良い。例えば、複数のユニットまたはモジュールを組み合わせるか、別のシステムに集約し、または、一部の機能を省略し、若しくは実行しなくてもよい。なお、上記表示または開示された相互的な接続または直接な接続若しくは通信可能な接続は、インターフェイスを介する接続である。装置やユニット同士の間接的な接続または通信可能な接続は、電気的または機械的もしくは他の形態の接続でよい。 It is readily conceivable that the method and apparatus disclosed by the embodiments of the present invention can be implemented in other forms. For example, the apparatus described above is only schematic. For example, the division of the unit described above is only an example of logical function allocation, and another division method may be adopted when actually realizing it. For example, multiple units or modules may be combined, integrated into another system, or some functions may be omitted or not performed. Note that the mutual connection, the direct connection, or the communicable connection shown or displayed above is a connection through an interface. Indirect or communicable connections between devices or units may be electrical or mechanical or other forms of connection.

なお、本発明の実施例にかかる各機能的ユニットは、1つの処理ユニットに集約しても良く、物理的に単独でもよく、2つ以上で一つのユニットとして集約してもよい。前記集約された手段は、ハードウェアの形態で実現されてもよいし、ハードウェアとソフトウェアによる機能的ユニットの形態で実現されてもよい。 The functional units according to the embodiments of the present invention may be integrated into one processing unit, may be physically independent, or may be integrated into two or more as one unit. The integrated means may be realized in the form of hardware or a functional unit composed of hardware and software.

上記ソフトウェアによる機能的手段で実現する集約ユニットはコンピュータが読み取り可能な記憶媒体に記憶する。その記憶媒体に記憶されるソフトウェアの指令をコンピュータ(例えば、PC、サーバ、またはネットワーク装置等)を実行させることで、本発明の実施例にかかる方法に含むステップが実行される。記憶媒体は、USB、ハードディスク、ROM（Read Only Memory)、RAM（Random Access Memory）、CDまたはDVD等プログラミングコードを記憶可能な媒体である。 The aggregation unit realized by the functional means by the software is stored in a computer-readable storage medium. The steps included in the method according to the embodiment of the present invention are executed by causing a computer (for example, a PC, a server, or a network device) to execute the software instructions stored in the storage medium. The storage medium is a medium such as a USB, a hard disk, a ROM (Read Only Memory), a RAM (Random Access Memory), a CD or a DVD that can store programming codes.

最後に、以上に説明した本発明の好ましい実施の形態は、本発明の趣旨を逸脱しない限り、本技術分野の普通の技術者により更に若干の改善や変更を行うことができる。しかし、これらの改善や変更も本発明の保護範囲と見なされるべきである。
Finally, the preferred embodiment of the present invention described above can be further improved or changed by a person skilled in the art without departing from the spirit of the present invention. However, these improvements and modifications should be considered as the protection scope of the present invention.

Claims

A method for extracting character shape features,
The processor
Preprocessing the processed data,
Obtaining a character shape feature extraction window having a predetermined size,
Using said character shape feature extraction window, perform the steps of extracting shape features of a character from the pretreated processed data, and
The size of the character shape feature extraction window is held constant during character shape feature extraction ,
The step of performing preprocessing on the processed data is
Removing noise from the processed data,
Dividing the processed data from which noise has been removed into one or more sentences, and dividing the divided sentences into one or more words;
Assigning IDs to the plurality of words,
The step of acquiring the character shape feature extraction window having the predetermined size includes
This is a step of extracting P characters from the beginning of a preset word as prefix information and extracting S characters from the end of the preset word as suffix information, where P and S are natural numbers. , Step,
A step of configuring the character shape feature extraction window using the prefix information and the suffix information, the method for extracting character shape features.

Assigning IDs to the plurality of words,
Selecting V different words from the plurality of words and configuring a model library using the V different words, where V is a natural number, and
Of the plurality of words, assigning a first ID to a first target word in the model library, wherein different first target words have different IDs, and
Assigning a second ID different from the first ID to a second target word that is not in the model library, wherein the second target word includes a word that is not in the model library and all second The character shape feature extraction method according to claim 1 , further comprising: a step having the same ID.

Using the character shape feature extraction window, the step of extracting the character shape feature from the preprocessed processing data,
From the character set composed of uppercase and lowercase letters of the alphabet, C different characters are selected as known characters, the step of assigning N-dimensional display vector to each of the C different characters, N Is a natural number, and
In a third target word of a plurality of words, a step of assigning a display vector to the first target character not in the known character, the display vector assigned to the first target character is the C Different from the display vector given to the character
Obtaining a display vector of P characters from the beginning and a display vector of S characters from the end, and forming a first vector that is a (P+S)*N dimensional vector,
(P+S)*N rows, a step of obtaining a weighting matrix M including F columns, where F is a natural number, and
The method for extracting character shape features according to claim 1 , further comprising: multiplying the first vector by the weight matrix to obtain shape features of an F-dimensional character.

Merging the shape feature of the character and the vector feature of the acquired word, and using the merged vector as an input of the learning model,
Training the learning model and updating the weights of the weighting matrix M and/or the display vector. 4. The method of claim 3 , further comprising:

A pre-processing module for pre-processing the processed data,
An acquisition module for acquiring a character shape feature extraction window having a predetermined size,
An extraction module for extracting character shape features from preprocessed processed data using the character shape feature extraction window;
The size of the character shape feature extraction window is held constant during character shape feature extraction ,
The pretreatment module is
Remove noise from the processed data,
Divide the processed data with noise removed into one or more sentences, divide the divided sentences into one or more words,
Assigning IDs to the plurality of words,
The acquisition module is
Extracting P characters from the beginning of the preset word as prefix information, extracting S characters from the end of the preset word as suffix information,
Using the prefix information and the suffix information, configure the character shape feature extraction window,
P and S are natural numbers . Character shape feature extraction device.

The pre-processing module, when assigning IDs to the plurality of words,
Select V different words from the plurality of words, configure a model library using the V different words,
Of the plurality of words, assign a first ID to the first target word in the model library,
Assigning a second ID different from the first ID to a second target word that is not in the model library,
V is a natural number,
Different first target words have different IDs,
The character shape feature extracting apparatus according to claim 5, wherein the second target words include words that are not in the model library, and all the second IDs are the same.

The extraction module is
From the character set consisting of uppercase and lowercase letters of the alphabet, select C different characters as known characters, and give each of the C different characters an N-dimensional display vector,
Assigning a display vector to a first target character that is not in the known characters in a third target word of the plurality of words,
Obtaining the display vector of P characters from the beginning and the display vector of S characters from the end, forming a first vector that is a (P+S)*N dimensional vector,
(P+S)* Get a weighted matrix M containing N rows and F columns,
Multiplying the first vector and the weighted matrix to obtain shape features of the F-dimensional character,
N is a natural number,
The display vector assigned to the first target character is different from the display vector assigned to the C characters,
The character shape feature extracting apparatus according to claim 5, wherein F is a natural number.

The extraction module obtains vector features of words, a weighting matrix M and a display vector,
The character shape feature extraction device,
A merge module that merges the shape feature of the character and the vector feature of the acquired word, and uses the merged vector as the input of the learning model;
The apparatus for extracting character shape features according to claim 7, further comprising: a training module that trains the learning model and updates the weight matrix M and/or the weight of the display vector.

In an electronic device comprising a processor and a memory, the computer program instructions being stored in the memory, in the processor, when the computer program instructions are executed by the processor,
Preprocessing the processed data,
Obtaining a character shape feature extraction window having a predetermined size,
Using the character shape feature extraction window, to extract the shape feature of the character from the preprocessed processing data,
The size of the character shape feature extraction window is held constant during character shape feature extraction ,
The step of performing preprocessing on the processed data is
Removing noise from the processed data,
Dividing the processed data from which noise has been removed into one or more sentences, and dividing the divided sentences into one or more words;
Assigning IDs to the plurality of words,
The step of acquiring the character shape feature extraction window having the predetermined size includes
This is a step of extracting P characters from the beginning of a preset word as prefix information and extracting S characters from the end of the preset word as suffix information, where P and S are natural numbers. , Step,
Configuring the character shape feature extraction window using the prefix information and the suffix information .

A computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor,
Preprocessing the processed data,
Obtaining a character shape feature extraction window having a predetermined size,
Using the character shape feature extraction window, to extract the shape feature of the character from the preprocessed processing data,
The size of the character shape feature extraction window is held constant during character shape feature extraction ,
The step of performing preprocessing on the processed data is
Removing noise from the processed data,
Dividing the processed data from which noise has been removed into one or more sentences, and dividing the divided sentences into one or more words;
Assigning IDs to the plurality of words,
The step of acquiring the character shape feature extraction window having the predetermined size includes
This is a step of extracting P characters from the beginning of a preset word as prefix information and extracting S characters from the end of the preset word as suffix information, where P and S are natural numbers. , Step,
A step of configuring the character shape feature extraction window using the prefix information and the suffix information, the storage medium.