JP2018060474A

JP2018060474A - Place name extraction program, place name extraction device and place name extraction method

Info

Publication number: JP2018060474A
Application number: JP2016199447A
Authority: JP
Inventors: 美佐子宗; Misako So
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2016-10-07
Filing date: 2016-10-07
Publication date: 2018-04-12
Anticipated expiration: 2036-10-07
Also published as: JP6759955B2

Abstract

PROBLEM TO BE SOLVED: To extract a character string of an accurate place name from a character string including a notation of an incomplete place name.SOLUTION: A computer is permitted to execute a process of receiving a character string, referring to a storage storing place name character strings, and outputting a place name character string which contains a larger number of characters common to both characters contained in the character string and characters contained in the place name character strings, and which contains at least one character from the end of the place name character string as a place name included in the character string.SELECTED DRAWING: Figure 2

Description

本発明は、地名抽出プログラム、地名抽出装置および地名抽出方法に関する。 The present invention relates to a place name extraction program, a place name extraction apparatus, and a place name extraction method.

カメラやスキャナにより得られた画像に含まれる住所・所在地表記（以下では、上位の都道府県等が省略される場合を含め、丁目番地の手前までを地名部とし、丁目番地を含めた全体を住所・所在地表記とする）をテキストデータとして利用したいというニーズが存在する。例えば、雑誌に掲載された施設の記事における住所・所在地表記をユーザがスマートフォンのカメラで撮影すると、記事の中に記載された住所・所在地表記が抽出されて電子地図の該当位置に対応して登録されて表示されるといったアプリケーションが考えられる。同様に、車載カメラで撮影された街中の施設の看板に記載された住所・所在地表記が電子地図の該当位置に対応して登録されて表示されるといったアプリケーションも考えられる。 Address and address notation included in images obtained by cameras and scanners (In the following, including the case where the upper prefecture is omitted, the place name is the part before the chome address, and the whole address including the chome address is the address.・ There is a need to use as a text data. For example, when a user photographs the address / location notation in an article of a facility published in a magazine with a smartphone camera, the address / location notation described in the article is extracted and registered corresponding to the corresponding position on the electronic map An application such as being displayed can be considered. Similarly, there may be an application in which an address / location notation described on a signboard of a facility in the city photographed with an in-vehicle camera is registered and displayed corresponding to the corresponding position on the electronic map.

このような画像に含まれる住所・所在地表記は、上位の都道府県等の省略、前後の住所・所在地表記でない余分な文字列の存在、表記の揺れ等により不完全なものであることが多い。また、写真撮影による場合、影による文字の欠損や、ボケが含まれる場合もあり、それらに起因して文字の認識誤りが発生することもある。被写体に汚れがある場合も影によるのと同様に文字の欠損が生じる場合がある。 The address / location notation included in such an image is often incomplete due to omission of upper prefectures and the like, the presence of extra character strings that are not the preceding and following address / location notations, and shaking of the notation. Further, in the case of taking a picture, there may be a loss of characters or a blur due to a shadow, which may cause character recognition errors. When the subject is dirty, character loss may occur as in the case of the shadow.

図１（ａ）は、上位の都道府県等が省略された例（上位の２階層が省略）であり、雑誌や看板等では提供される地域が限定されているためによくあるケースである。図１（ｂ）は、前後の住所・所在地表記でない余分な文字列の存在の例であり、記事の説明の一部や、「住所」を示す記号や駐車場を示す記号および収容台数等の記載が住所・所在地表記の前後に含まれている。図１（ｃ）は、表記の揺れの例を示しており、発音上の「の」が入ったり省略されたり、「字（あざ）」が入ったり省略されたりすることで文字数が増減する場合がある。図１（ｄ）は、写真撮影時の影により文字の欠損が生じる例を示している。図１（ｅ）は、写真撮影時にフォーカスが不十分であったためにボケが生じ、一部の文字が誤認識（「桑」が「団」に誤認識）された例を示している。 FIG. 1A is an example in which upper prefectures and the like are omitted (the upper two hierarchies are omitted), and is often the case because magazines, signboards, and the like provide limited areas. Fig. 1 (b) is an example of the existence of extra character strings that are not address / location notation before and after, such as a part of the description of the article, the symbol indicating "address", the symbol indicating parking lot, the number of accommodations, etc. The description is included before and after the address / location notation. FIG. 1C shows an example of the shaking of the notation, where the number of characters increases or decreases due to the pronunciation “no” being entered or omitted, or “characters” being entered or omitted. There is. FIG. 1D shows an example in which character loss occurs due to a shadow at the time of taking a photograph. FIG. 1E shows an example in which the focus is insufficient at the time of taking a picture and blurring occurs, and some characters are erroneously recognized (“mulberry” is erroneously recognized as “group”).

このような要因から、文字認識された文字列は住所・所在地表記としては不完全なものであり、地図情報等と対応付けるためには正確な住所・所在地表記の文字列に修正する必要がある。 Because of these factors, the character string that has been recognized is incomplete as an address / location notation and must be corrected to an accurate address / location notation in order to be associated with map information or the like.

一方、売上げ伝票、配送伝票等に記入される住所の文字認識結果について、認識誤りを修正し、更に、部分的に省略された住所文字列を補う文字認識結果修正方式が開示されている（例えば、特許文献１等を参照）。しかし、「県」「市」「町」等の区切り文字に着目し、所定数の候補の中で可能な組み合わせの中から正解を特定するものであるため、区切り文字が欠損している場合や、住所文字列の前後に住所ではない文字列が存在する場合には、正しく修正できない場合がある。 On the other hand, there is disclosed a character recognition result correction method for correcting a recognition error for a character recognition result of an address entered in a sales slip, a delivery slip, etc., and further supplementing a partially omitted address character string (for example, See Patent Document 1). However, it focuses on delimiters such as “prefecture”, “city”, “town”, etc., and identifies the correct answer from among the possible combinations of a predetermined number of candidates. If there is a character string that is not an address before and after the address character string, it may not be corrected correctly.

特開平３−２５７６９３号公報Japanese Patent Laid-Open No. 3-257893

上述したように、従来の手法では、不完全な住所・所在地表記、主に不完全な地名の表記を含む文字列から正確な地名の文字列を抽出するのが困難であった。 As described above, with the conventional method, it is difficult to extract a character string of an accurate place name from a character string including an incomplete address / location notation, mainly an incomplete place name notation.

そこで、一側面では、本発明は、不完全な地名の表記を含む文字列から正確な地名の文字列を抽出することを目的とする。 Therefore, in one aspect, an object of the present invention is to extract an accurate place name character string from a character string including an incomplete place name notation.

一つの形態では、文字列を受け付け、地名文字列を記憶する記憶部を参照して、前記文字列に含まれる文字と前記地名文字列に含まれる文字とが共通する文字数がより多く、且つ、前記地名文字列の末尾から少なくとも１以上の文字が前記文字列に含まれる地名文字列を、前記文字列に含まれる地名として出力する、処理をコンピュータに実行させる。 In one embodiment, a character string is received, referring to a storage unit that stores a place name character string, the number of characters in common between the character included in the character string and the character included in the place name character string, and The computer executes a process of outputting a place name character string including at least one character from the end of the place name character string as the place name included in the character string.

不完全な地名を含む文字列から正確な地名の文字列を抽出することができる。 An accurate place name character string can be extracted from a character string including an incomplete place name.

不完全な地名の例を示す図である。It is a figure which shows the example of the incomplete place name. 住所・所在地表記抽出装置の機能構成例を示す図である。It is a figure which shows the function structural example of an address / location notation extraction apparatus. 住所・所在地表記抽出装置のハードウェア構成例を示す図である。It is a figure which shows the hardware structural example of an address / location notation extraction apparatus. 実施形態の処理例を示すフローチャートである。It is a flowchart which shows the process example of embodiment. 地名候補の絞り込みの処理例を示す図である。It is a figure which shows the example of a process of narrowing down a place name candidate. 地名情報の例を示す図である。It is a figure which shows the example of place name information. 照合コストの計算式の例を示す図である。It is a figure which shows the example of the calculation formula of collation cost. 地名区切り文字判定の例を示す図である。It is a figure which shows the example of a place name delimiter character determination. 認識結果文字列の地名部の置き換えの例を示す図である。It is a figure which shows the example of replacement of the place name part of a recognition result character string. 丁目番地区切り文字検出の例を示す図である。It is a figure which shows the example of a chome address delimiter character detection. 不要文字列削除の例を示す図である。It is a figure which shows the example of an unnecessary character string deletion.

以下、本発明の好適な実施形態につき説明する。 Hereinafter, preferred embodiments of the present invention will be described.

＜構成＞
図２は住所・所在地表記抽出装置（情報処理装置）１の機能構成例を示す図である。図２において、住所・所在地表記抽出装置１は、認識結果入力部１０１と地名候補絞り込み部１０２と地名照合部１０３と地名区切り文字判定部１０４と地名決定部１０５と地名修正部１０６とを備えている。また、住所・所在地表記抽出装置１は、丁目番地区切り文字検出部１０７と丁目番地修正・決定部１０８と住所・所在地表記出力部１０９とを備えている。更に、住所・所在地表記抽出装置１は、処理に際して参照する情報として、地名文字情報１１１と地名情報１１２と丁目番地文字情報１１３とを備えている。 <Configuration>
FIG. 2 is a diagram illustrating a functional configuration example of the address / location notation extraction device (information processing device) 1. In FIG. 2, the address / location notation extraction apparatus 1 includes a recognition result input unit 101, a place name candidate narrowing unit 102, a place name collation unit 103, a place name delimiter character determination unit 104, a place name determination unit 105, and a place name correction unit 106. Yes. The address / location notation extraction apparatus 1 includes a chome address delimiter detection unit 107, a chome address correction / determination unit 108, and an address / location notation output unit 109. Further, the address / location notation extraction device 1 includes place name character information 111, place name information 112, and chome street address character information 113 as information to be referred to in the process.

地名文字情報１１１は、住所・所在地表記の対象となる範囲（例えば、日本全国）に存在する地名文字列（丁目番地の手前までの文字列）と、その地名文字列に含まれる個々の文字（見出し文字）とを対応付けたものである。ある文字を指定することで、その文字を含む１以上の地名文字列を特定することができる。地名文字情報１１１の具体例については後述する。 The place name character information 111 includes a place name character string (a character string up to the front of the street address) existing in a range to be addressed / addressed (for example, all over Japan) and individual characters included in the place name character string ( Heading character). By designating a certain character, one or more place name character strings including the character can be specified. A specific example of the place name character information 111 will be described later.

地名情報１１２は、対象となる範囲に存在する地名文字列を集積したものである。地名情報１１２の具体例については後述する。 The place name information 112 is a collection of place name character strings existing in a target range. A specific example of the place name information 112 will be described later.

丁目番地文字情報１１３は、丁目番地（丁目または番地）に用いられる可能性のある文字と、該文字と誤認識（混同）しやすい文字等と、丁目番地の末尾にくる可能性の有無とを対応付けたものである。丁目番地文字情報１１３の具体例については後述する。 The chome address character information 113 includes a character that may be used for the chome address (chome or address), a character that is likely to be erroneously recognized (confused) with the character, and the presence / absence of the possibility of being at the end of the chome address. It is a correspondence. A specific example of the street address character information 113 will be described later.

認識結果入力部１０１は、住所・所在地表記を含む文字認識結果である文字列（認識結果文字列）のテキストデータを入力（受付）する機能を有している。例えば、ユーザが雑誌やパンフレット等の住所・所在地表記を含む部分をスマートフォンのカメラ等で撮影し、その撮影画像から文字認識された結果が認識結果文字列として入力される。 The recognition result input unit 101 has a function of inputting (accepting) text data of a character string (recognition result character string) that is a character recognition result including an address / location notation. For example, a user photographs a portion including an address / location notation such as a magazine or a pamphlet with a smartphone camera or the like, and a character recognition result from the photographed image is input as a recognition result character string.

地名候補絞り込み部１０２は、認識結果入力部１０１により入力した認識結果文字列に対し、地名文字情報１１１を参照して、その後の処理に用いる地名文字列の候補を絞り込む機能を有している。処理の詳細については後述する。 The place name candidate narrowing-down unit 102 has a function of referring to the place name character information 111 with respect to the recognition result character string input by the recognition result input unit 101 and narrowing down place name character string candidates used for subsequent processing. Details of the processing will be described later.

地名照合部１０３は、地名候補絞り込み部１０２により絞り込まれた地名文字列の候補と、認識結果入力部１０１により入力した認識結果文字列とを照合し、照合スコアまたは照合コストを計算する機能を有している。照合スコアは、文字の順序を含めて、認識結果文字列に含まれる文字と候補の地名文字列に含まれる文字とが共通する文字数の多さを示すものである。照合コストは、認識結果文字列と候補の地名文字列との両者を合致させるために文字の挿入・削除・置換等を行うのに要する労力の度合いを示すものである。照合スコアまたは照合コストの計算の具体例については後述する。 The place name collation unit 103 has a function of collating the place name character string candidates narrowed down by the place name candidate narrowing unit 102 with the recognition result character string inputted by the recognition result input unit 101 and calculating a collation score or a collation cost. doing. The collation score indicates the number of characters in common between the characters included in the recognition result character string and the characters included in the candidate place name character string, including the character order. The collation cost indicates the degree of labor required to insert / delete / replace characters in order to match both the recognition result character string and the candidate place name character string. A specific example of the calculation of the matching score or the matching cost will be described later.

地名区切り文字判定部１０４は、照合スコアの大きい順、または、照合コストの小さい順に、上位所定数の候補の地名文字列の地名区切り文字が認識結果文字列に含まれるか否かを判定する機能を有している。地名区切り文字は、地名部の末尾の文字（丁目番地を示す文字に切り替わる直前の文字）を含む１以上の文字であり、それらの文字のいずれかが認識結果文字列に含まれるか否かを判定する。地名部の末尾は省略される可能性が低いため、末尾付近の文字の一致をもって、対応する地名であると特定するようにしている。なお、地名部の末尾の文字だけとしなかったのは、その文字に対応する認識結果文字列における文字が欠落していたり誤認識されていたりする場合に対処するためである。 The place name delimiter character determination unit 104 determines whether or not the place name delimiter characters of the upper predetermined number of candidate place name character strings are included in the recognition result character string in descending order of collation score or in ascending order of collation cost. have. The place name delimiter character is one or more characters including the last character of the place name portion (the character immediately before switching to the character indicating the chome address), and whether or not any of these characters is included in the recognition result character string. judge. Since it is unlikely that the end of the place name portion is omitted, the place name is identified by matching the characters near the end. The reason why only the last character of the place name portion is not used is to deal with a case where a character in the recognition result character string corresponding to the character is missing or misrecognized.

地名決定部１０５は、地名文字列の地名区切り文字が認識結果文字列に含まれる地名文字列のうち、照合スコアが高いものを優先（照合コストの場合は低いものを優先）し、認識結果文字列に含まれる地名として決定する機能を有している。 The place name determination unit 105 gives priority to a place name character string in which the place name delimiter of the place name character string is included in the recognition result character string (in the case of the matching cost, the lower one is given priority), and the recognition result character It has a function to determine the place name included in the column.

地名修正部１０６は、認識結果文字列中の地名文字列の末端を特定し、認識結果文字列の先端から地名文字列の末端までを地名決定部１０５で決定された地名文字列で置き換えることで、認識結果文字列を修正する機能を有している。 The place name correcting unit 106 identifies the end of the place name character string in the recognition result character string, and replaces the end of the recognition result character string to the end of the place name character string with the place name character string determined by the place name determining unit 105. , Has a function of correcting the recognition result character string.

丁目番地区切り文字検出部１０７は、修正後の認識結果文字列における地名部の末端の後を丁目番地部と不要文字列部として、丁目番地部と不要文字列部の境界に対応する丁目番地区切り文字を丁目番地文字情報１１３を使って検出する機能を有している。 The chome address delimiter detection unit 107 uses the end of the place name portion after the end of the corrected recognition result character string as the chome address portion and the unnecessary character string portion, and corresponds to the boundary between the chome address portion and the unnecessary character string portion. It has a function of detecting characters using the chome address character information 113.

丁目番地修正・決定部１０８は、丁目番地区切り文字検出部１０７により検出された丁目番地区切り文字から丁目番地部を特定するとともに、丁目番地部より後の不要文字列部を認識結果文字列から削除する機能を有している。 The chome address correction / determination unit 108 specifies the chome address part from the chome address delimiter character detected by the chome address delimiter character detection unit 107, and deletes the unnecessary character string part after the chome address part from the recognition result character string. It has a function to do.

住所・所在地表記出力部１０９は、最終的に得られた修正済みの認識結果文字列を住所・所在地文字列として出力する機能を有している。 The address / location notation output unit 109 has a function of outputting the finally obtained corrected recognition result character string as an address / location character string.

図３は住所・所在地表記抽出装置１のハードウェア構成例を示す図である。図３において、住所・所在地表記抽出装置１は、システムバス1001に接続されたＣＰＵ（Central Processing Unit）1002、ＲＯＭ（Read Only Memory）1003、ＲＡＭ（Random Access Memory）1004、ＮＶＲＡＭ（Non-Volatile Random Access Memory）1005を備えている。また、住所・所在地表記抽出装置１は、Ｉ／Ｆ（Interface）1006と、Ｉ／Ｆ1006に接続された、Ｉ／Ｏ（Input/Output Device）1007、ＨＤＤ（Hard Disk Drive）／ＳＳＤ（Solid State Drive）1008、ＮＩＣ（Network Interface Card）1009とを備えている。また、住所・所在地表記抽出装置１は、Ｉ／Ｏ1007に接続されたモニタ1010、キーボード1011、マウス1012等を備えている。Ｉ／Ｏ1007にはＣＤ／ＤＶＤ（Compact Disk/Digital Versatile Disk）ドライブ等を接続することもできる。 FIG. 3 is a diagram illustrating a hardware configuration example of the address / location notation extraction device 1. In FIG. 3, an address / location notation extraction apparatus 1 includes a CPU (Central Processing Unit) 1002, a ROM (Read Only Memory) 1003, a RAM (Random Access Memory) 1004, an NVRAM (Non-Volatile Random) connected to a system bus 1001. Access Memory) 1005. The address / location notation extraction apparatus 1 includes an I / F (Interface) 1006, an I / O (Input / Output Device) 1007, an HDD (Hard Disk Drive) / SSD (Solid State) connected to the I / F 1006. Drive) 1008 and NIC (Network Interface Card) 1009. The address / location notation extraction apparatus 1 includes a monitor 1010, a keyboard 1011, a mouse 1012, and the like connected to the I / O 1007. A CD / DVD (Compact Disk / Digital Versatile Disk) drive or the like can be connected to the I / O 1007.

図２で説明した住所・所在地表記抽出装置１の機能は、ＣＰＵ1002において所定のプログラムが実行されることで実現される。プログラムは、記録媒体を経由して取得されるものでもよいし、ネットワークを経由して取得されるものでもよいし、ＲＯＭ組込でもよい。また、処理に際して参照・更新される情報は、一時的にはＲＡＭ1004に記憶され、永続的にはＨＤＤ／ＳＳＤ1008やＮＶＲＡＭ1005に記憶される。 The functions of the address / location notation extracting apparatus 1 described with reference to FIG. 2 are realized by executing a predetermined program in the CPU 1002. The program may be acquired via a recording medium, may be acquired via a network, or may be embedded in a ROM. In addition, information that is referred to or updated during processing is temporarily stored in the RAM 1004 and permanently stored in the HDD / SSD 1008 or the NVRAM 1005.

＜動作＞
図４は上記の実施形態の処理例を示すフローチャートである。図４において、住所・所在地表記抽出装置１が処理を開始すると、認識結果入力部１０１は、住所・所在地表記を含む文字認識結果である文字列（認識結果文字列）のテキストデータを入力（受付）する（ステップＳ１０１）。 <Operation>
FIG. 4 is a flowchart showing a processing example of the above embodiment. In FIG. 4, when the address / location notation extracting apparatus 1 starts processing, the recognition result input unit 101 inputs text data of a character string (recognition result character string) that is a character recognition result including an address / location notation (acceptance). (Step S101).

次いで、地名候補絞り込み部１０２は、認識結果入力部１０１により入力した認識結果文字列に対し、地名文字情報１１１を参照して、その後の処理に用いる地名文字列の候補を絞り込む（ステップＳ１０２）。 Next, the place name candidate narrowing-down unit 102 refers to the place name character information 111 with respect to the recognition result character string input by the recognition result input unit 101, and narrows down the place name character string candidates used for the subsequent processing (step S102).

図５は地名候補絞り込み部１０２による地名候補の絞り込みの処理例を示す図である。ここでは、図５（ａ）の右側に示すような認識結果文字列が入力されたとすると、認識結果文字列に含まれる各文字について、地名文字情報１１１の見出し文字に存在するか否かを調べる。そして、見出し文字に存在する場合に、その見出し文字に関連付けられた地名文字列に１票を投票する。図示の例では、「大崎」の「大」、「菱田」の「菱」について、それぞれ投票を行っている様子を示している。投票数は地名文字情報１１１の各地名文字列と対応付けて一時的に記憶しておく。なお、図５（ｂ）は地名文字情報１１１のデータ構造例を示しており、通番と、見出し文字の文字コードと、この見出し文字に関連付けられた地名文字列の個数と、関連付けられた地名文字列の地名番号（地名情報１１２の地名に対応）とが対応付けられている。投票数は、例えば、地名文字情報１１１の地名番号に対応付けて記憶する。 FIG. 5 is a diagram illustrating an example of a place name candidate narrowing process performed by the place name candidate narrowing unit 102. Here, if a recognition result character string as shown on the right side of FIG. 5A is input, it is checked whether or not each character included in the recognition result character string exists in the heading character of the place name character information 111. . If it exists in the heading character, one vote is voted for the place name character string associated with the heading character. In the illustrated example, “Osaki” “Large” and “Hishida” “Hishi” are each voting. The number of votes is temporarily stored in association with each place name character string in the place name character information 111. FIG. 5B shows an example of the data structure of the place name character information 111. The serial number, the character code of the heading character, the number of place name character strings associated with the heading character, and the associated place name character. The place name numbers in the columns (corresponding to the place names in the place name information 112) are associated with each other. The number of votes is stored in association with the place name number of the place name character information 111, for example.

投票の結果を、図５（ｃ）に示すように、投票数の多い順にソートし、所定の閾値以下の地名文字列を足切することで、投票数が多い上位の地名文字列に絞り込みを行う。例えば、投票数の閾値を「２」として２以下を足切すると、地名候補数を約１２万件からＮ＝Ｏ（1000）〜Ｏ（10）に減らすことが可能である。「Ｏ（）」はオーダを示している。 As shown in FIG. 5C, the voting results are sorted in descending order of the number of votes, and the place name character strings having a large number of votes are narrowed down by subtracting the place name character strings having a predetermined threshold value or less. Do. For example, if the threshold for the number of votes is set to “2”, and the number of place names is reduced to 2 or less, the number of place name candidates can be reduced from about 120,000 to N = O (1000) to O (10). “O ()” indicates an order.

図４に戻り、地名照合部１０３は、地名候補絞り込み部１０２により絞り込まれた候補の地名文字列を地名情報１１２から取得し、認識結果入力部１０１により入力した認識結果文字列と照合し、照合スコアまたは照合コストを計算する（ステップＳ１０３）。 Returning to FIG. 4, the place name collation unit 103 obtains candidate place name character strings narrowed down by the place name candidate narrowing unit 102 from the place name information 112 and collates them with the recognition result character strings input by the recognition result input unit 101. A score or a verification cost is calculated (step S103).

図６は地名情報１１２の例を示す図であり、通番と、都道府県番号と、文字数と、地名文字列とが対応付けられている。例えば、地名照合部１０３は地名候補絞り込み部１０２から絞り込まれた地名候補の通番を受け取り、その通番を指定することで地名情報１１２から地名文字列を取得することができる。 FIG. 6 is a diagram showing an example of the place name information 112, in which a serial number, a prefecture number, the number of characters, and a place name character string are associated with each other. For example, the place name collation unit 103 can receive the place number candidate serial number narrowed down from the place name candidate narrowing down part 102, and can acquire the place name character string from the place name information 112 by designating the serial number.

図７は照合コストの計算式の例を示す図であり、文字の挿入・削除・置換があっても対応付けられる、例えばＤＰマッチング（動的計画法）を照合に用い、その際に得られる編集距離Ｌを用いている。編集距離Ｌは、２つの文字列の相違度を表す量であり、片方の文字列から片方の文字列変換するときの、文字の挿入・削除・置換の必要最小手順に該当する。図示の式において、Ｃは照合コスト、ｎ_１は地名情報１１２中の着目する地名文字列（文字列＃１）の文字数、ｎ_２は入力文字列（認識結果文字列）（文字列＃２）の文字数、ｋは文字列＃１と文字列＃２で一致する文字数である。文字列の長さに照合コストＣを依存させないため、照合コストＣは編集距離Ｌを２文字列の文字数ｎ_１、ｎ_２で正規化している。また、同じ編集距離Ｌの場合は、一致する文字数の割合が大きい方が照合コストＣが小さくなるようにしている。なお、図示の式は一例であり、種々に設計が可能である。照合スコアは、照合コストとは逆の傾向を示す値であり、一致する文字数の比率や文字の順序関係の一致の比率等に応じた値である。 FIG. 7 is a diagram illustrating an example of a calculation formula for collation cost, which is obtained even when there is insertion / deletion / replacement of characters, for example, DP matching (dynamic programming) is used for collation. The edit distance L is used. The edit distance L is an amount representing the degree of difference between two character strings, and corresponds to the minimum necessary procedure for character insertion / deletion / replacement when converting one character string to one character string. In the expression shown in the figure, C is a collation cost, n ₁ is the number of characters of the place name character string of interest (character string # 1) in the place name information 112, and n ₂ is an input character string (recognition result character string) (character string # 2). K is the number of characters that match in character string # 1 and character string # 2. Since the collation cost C does not depend on the length of the character string, the collation cost C normalizes the edit distance L with the number of characters n ₁ and n ₂ of the two character strings. In the case of the same editing distance L, the collation cost C is reduced as the ratio of the number of matching characters is larger. In addition, the expression shown in the drawing is an example, and various designs are possible. The matching score is a value indicating a tendency opposite to the matching cost, and is a value according to the ratio of the number of matching characters, the matching ratio of the order relation of characters, and the like.

図４に戻り、地名区切り文字判定部１０４は、照合スコアの大きい順、または、照合コストの小さい順に地名候補を並び替える（ステップＳ１０４）。そして、地名区切り文字判定部１０４は、上位Ｍ個の地名候補を選択し（ステップＳ１０５）、ｉ番目の地名候補の地名区切り文字が認識結果文字列中にあるかチェックを行い（ステップＳ１０６）、ない場合（ステップＳ１０７のＮｏ）は次の地名候補についてチェックを行う。 Returning to FIG. 4, the place name delimiter determination unit 104 rearranges place name candidates in descending order of collation score or in ascending order of collation cost (step S104). Then, the place name delimiter determination unit 104 selects the top M place name candidates (step S105), checks whether the place name delimiter of the i th place name candidate is in the recognition result character string (step S106), If not (No in step S107), the next place name candidate is checked.

図８（ａ）は地名候補を照合コストが小さい順に並び替えた例を示しており、順位「１」の地名文字列「鹿児島県曽於郡大崎町菱田」の地名区切り文字が末尾の２文字「菱」、「田」となっている。ここで、図８（ｂ）に示すような認識結果文字列であった場合、順位「１」の地名文字列の地名区切り文字の「田」（「菱」についても一致するが、末尾に近い方を優先）が存在すると判定される。 FIG. 8A shows an example in which the place name candidates are rearranged in ascending order of collation cost. The place name delimiter of the place name character string “Osaki-cho Osaki-cho, Kagoshima-ken” with the last two characters “ “Hishi” and “Ta”. Here, in the case of the recognition result character string as shown in FIG. 8B, the place name delimiter “da” (“rhino”) in the place name character string of the rank “1” also matches, but is close to the end. Is prioritized).

図４に戻り、地名候補の地名区切り文字が認識結果文字列中にあると判断された場合（ステップＳ１０７のＹｅｓ）、地名決定部１０５は、認識結果文字列に地名区切り文字が存在した地名文字列を地名として決定する（ステップＳ１０８）。 Returning to FIG. 4, when it is determined that the place name delimiter character of the place name candidate is in the recognition result character string (Yes in step S <b> 107), the place name determining unit 105 determines that the place name delimiter character exists in the recognition result character string. A column is determined as a place name (step S108).

次いで、地名修正部１０６は、認識結果文字列中の地名文字列の末端を特定し、認識結果文字列の先端から地名文字列の末端までを地名決定部１０５で決定された地名文字列で置き換えることで、認識結果文字列を修正する（ステップＳ１０９）。 Next, the place name correcting unit 106 identifies the end of the place name character string in the recognition result character string, and replaces the end of the recognition result character string to the end of the place name character string with the place name character string determined by the place name determining unit 105. Thus, the recognition result character string is corrected (step S109).

図９は認識結果文字列の地名部の置き換えの例を示す図である。図９（ａ）に示すように、認識結果文字列の先端から地名区切り文字と一致した文字「田」までを置き換え対象とし、この置き換え対象の部分を、決定した地名文字列に置き換える。図９（ｂ）は置き換え後の認識結果文字列を示している。 FIG. 9 is a diagram illustrating an example of replacement of the place name portion of the recognition result character string. As shown in FIG. 9A, the characters from the leading end of the recognition result character string to the character “da” that matches the place name delimiter are set as replacement targets, and the part to be replaced is replaced with the determined place name character string. FIG. 9B shows the recognition result character string after replacement.

図４に戻り、丁目番地区切り文字検出部１０７は、修正後の認識結果文字列における地名部の末端より後を丁目番地部と不要文字列部として、丁目番地部と不要文字列部の境界に対応する丁目番地区切り文字を丁目番地文字情報１１３を使って検出する（ステップＳ１１０）。 Returning to FIG. 4, the chome address delimiter character detection unit 107 sets the chome address part and the unnecessary character string part as the boundary between the chome address part and the unnecessary character string part after the end of the place name part in the corrected recognition result character string. A corresponding chome address delimiter is detected using the chome address character information 113 (step S110).

図１０（ａ）は丁目番地文字情報１１３の例を示しており、丁目番地として用いられる可能性のある文字と、その文字と誤認識（混同）しやすいコンフュージョン文字と、丁目番地の末尾にくる可能性とが対応付けられている。ある文字が丁目番地として用いられる可能性のある文字そのものではなくても、コンフュージョン文字に該当する場合は、丁目番地として用いられる可能性のある文字と同様に扱われる。なお、コンフュージョン文字に該当する場合、認識結果文字列における該当する文字は丁目番地として用いられる可能性のある文字に置換される。 FIG. 10A shows an example of the chome address character information 113. A character that may be used as a chome address, a confusion character that is likely to be erroneously recognized (confused) with the character, and the end of the chome address. Is associated with the possibility of coming. Even if a certain character is not a character that may be used as a chome address, if it falls under a confusion character, it is treated in the same manner as a character that can be used as a chome address. In addition, when it corresponds to a confusion character, the applicable character in a recognition result character string is substituted by the character which may be used as a chome address.

ここで、図１０（ｂ）に示すような認識結果文字列である場合、地名部の末尾の後に続く文字のうち、「３」「２」は丁目番地文字情報１１３に登録されており、丁目番地文字として適正（ＯＫ）であると判断される。しかし、それに続く「＠」は丁目番地文字情報１１３に文字としてもコンフュージョン文字としても登録されておらず、不要文字列部の先頭と判断され、その直前の「２」が丁目番地区切り文字とされる。 Here, in the case of the recognition result character string as shown in FIG. 10B, among the characters following the end of the place name portion, “3” and “2” are registered in the chome address character information 113, and the chome It is determined that the address character is appropriate (OK). However, the following “@” is not registered as a character or a confusion character in the chome address character information 113 and is determined to be the head of the unnecessary character string portion, and “2” immediately before that is the chome address delimiter. Is done.

図４に戻り、丁目番地修正・決定部１０８は、丁目番地区切り文字検出部１０７により検出された丁目番地区切り文字から丁目番地部を特定するとともに、丁目番地部より後の不要文字列部を認識結果文字列から削除する（ステップＳ１１１）。図１１（ａ）は不要文字列削除前の認識結果文字列を示し、図１１（ｂ）は不要文字列削除後の認識結果文字列を示している。 Returning to FIG. 4, the chome address correction / decision unit 108 identifies the chome address part from the chome address separator character detected by the chome address separator character detection unit 107 and recognizes an unnecessary character string part after the chome address part. It deletes from a result character string (step S111). FIG. 11A shows a recognition result character string before unnecessary character string deletion, and FIG. 11B shows a recognition result character string after unnecessary character string deletion.

図４に戻り、住所・所在地表記出力部１０９は、最終的に得られた修正済みの認識結果文字列を住所・所在地文字列として出力し（ステップＳ１１２）、処理を終了する。 Returning to FIG. 4, the address / location notation output unit 109 outputs the corrected recognition result character string finally obtained as an address / location character string (step S112), and ends the process.

＜総括＞
以上説明したように、本実施形態によれば、不完全な地名を含む文字列から正確な地名の文字列を抽出することができる。また、住所・所在地表記の全体についても正確な文字列を抽出することができる。 <Summary>
As described above, according to the present embodiment, an accurate place name character string can be extracted from a character string including an incomplete place name. In addition, an accurate character string can be extracted for the entire address / location notation.

以上、好適な実施の形態により説明した。ここでは特定の具体例を示して説明したが、特許請求の範囲に定義された広範な趣旨および範囲から逸脱することなく、これら具体例に様々な修正および変更を加えることができることは明らかである。すなわち、具体例の詳細および添付の図面により限定されるものと解釈してはならない。 In the above, it demonstrated by preferred embodiment. While specific embodiments have been illustrated and described herein, it will be apparent that various modifications and changes may be made thereto without departing from the broad spirit and scope as defined in the claims. . That is, it should not be construed as being limited by the details of the specific examples and the accompanying drawings.

以上の説明に関し、更に以下の項を開示する。
（付記１）
文字列を受け付け、
地名文字列を記憶する記憶部を参照して、前記文字列に含まれる文字と前記地名文字列に含まれる文字とが共通する文字数がより多く、且つ、前記地名文字列の末尾から少なくとも１以上の文字が前記文字列に含まれる地名文字列を、前記文字列に含まれる地名として出力する、
処理をコンピュータに実行させることを特徴とする地名抽出プログラム。
（付記２）
前記文字列に含まれる地名の文字列を、出力する前記地名文字列に置換する、
ことを特徴とする付記１に記載の地名抽出プログラム。
（付記３）
前記文字列に含まれる地名以降の文字列の内、該文字列の先頭から丁目または番地として登録された文字以外の文字の手前の文字までを丁目または番地を示す文字として特定する、
ことを特徴とする付記１または２に記載の地名抽出プログラム。
（付記４）
丁目または番地として登録された文字は、丁目または番地として登録された文字と混同し易い文字を含み、
混同し易い文字については、丁目または番地として登録された文字に置換する、
ことを特徴とする付記３に記載の地名抽出プログラム。
（付記５）
前記丁目または番地として登録された文字以外の文字以降を削除する、
ことを特徴とする付記３または４に記載の地名抽出プログラム。
（付記６）
文字列を受け付けた直後に、地名文字列と該地名文字列に含まれる文字との対応付けを記憶した記憶部を参照して、前記文字列に含まれる文字に合致する文字を含む地名文字列に投票を行い、
投票数が多い上位所定数の地名文字列に、その後の処理に用いる地名文字列の候補を絞り込む、
ことを特徴とする付記１乃至５のいずれか一項に記載の地名抽出プログラム。
（付記７）
文字列を受け付ける受付部と、
地名文字列を記憶する記憶部を参照して、前記文字列に含まれる文字と前記地名文字列に含まれる文字とが共通する文字数がより多く、且つ、前記地名文字列の末尾から少なくとも１以上の文字が前記文字列に含まれる地名文字列を、前記文字列に含まれる地名として出力する出力部と、
を備えたことを特徴とする地名抽出装置。
（付記８）
前記文字列に含まれる地名の文字列を、出力する前記地名文字列に置換する、
ことを特徴とする付記７に記載の地名抽出装置。
（付記９）
前記文字列に含まれる地名以降の文字列の内、該文字列の先頭から丁目または番地として登録された文字以外の文字の手前の文字までを丁目または番地を示す文字として特定する、
ことを特徴とする付記７または８に記載の地名抽出装置。
（付記１０）
丁目または番地として登録された文字は、丁目または番地として登録された文字と混同し易い文字を含み、
混同し易い文字については、丁目または番地として登録された文字に置換する、
ことを特徴とする付記９に記載の地名抽出装置。
（付記１１）
前記丁目または番地として登録された文字以外の文字以降を削除する、
ことを特徴とする付記９または１０に記載の地名抽出装置。
（付記１２）
文字列を受け付けた直後に、地名文字列と該地名文字列に含まれる文字との対応付けを記憶した記憶部を参照して、前記文字列に含まれる文字に合致する文字を含む地名文字列に投票を行い、
投票数が多い上位所定数の地名文字列に、その後の処理に用いる地名文字列の候補を絞り込む、
ことを特徴とする付記７乃至１１のいずれか一項に記載の地名抽出装置。
（付記１３）
文字列を受け付け、
地名文字列を記憶する記憶部を参照して、前記文字列に含まれる文字と前記地名文字列に含まれる文字とが共通する文字数がより多く、且つ、前記地名文字列の末尾から少なくとも１以上の文字が前記文字列に含まれる地名文字列を、前記文字列に含まれる地名として出力する、
処理をコンピュータが実行することを特徴とする地名抽出方法。
（付記１４）
前記文字列に含まれる地名の文字列を、出力する前記地名文字列に置換する、
ことを特徴とする付記１３に記載の地名抽出方法。
（付記１５）
前記文字列に含まれる地名以降の文字列の内、該文字列の先頭から丁目または番地として登録された文字以外の文字の手前の文字までを丁目または番地を示す文字として特定する、
ことを特徴とする付記１３または１４に記載の地名抽出方法。
（付記１６）
丁目または番地として登録された文字は、丁目または番地として登録された文字と混同し易い文字を含み、
混同し易い文字については、丁目または番地として登録された文字に置換する、
ことを特徴とする付記１５に記載の地名抽出方法。
（付記１７）
前記丁目または番地として登録された文字以外の文字以降を削除する、
ことを特徴とする付記１５または１６に記載の地名抽出方法。
（付記１８）
文字列を受け付けた直後に、地名文字列と該地名文字列に含まれる文字との対応付けを記憶した記憶部を参照して、前記文字列に含まれる文字に合致する文字を含む地名文字列に投票を行い、
投票数が多い上位所定数の地名文字列に、その後の処理に用いる地名文字列の候補を絞り込む、
ことを特徴とする付記１３乃至１７のいずれか一項に記載の地名抽出方法。 Regarding the above description, the following items are further disclosed.
(Appendix 1)
Accepts strings,
Referring to the storage unit that stores the place name character string, the number of characters in common between the character included in the character string and the character included in the place name character string is larger, and at least one or more from the end of the place name character string A place name character string in which the character is included in the character string is output as a place name included in the character string.
A place name extraction program that causes a computer to execute processing.
(Appendix 2)
The place name character string included in the character string is replaced with the place name character string to be output.
The place name extraction program according to attachment 1, wherein
(Appendix 3)
Among the character strings after the place name included in the character string, the character string from the beginning of the character string to the character before the character other than the character registered as chome or address is specified as a character indicating the chome or address.
The place name extraction program according to appendix 1 or 2, characterized by the above.
(Appendix 4)
Characters registered as a chome or street address include characters that are easily confused with characters registered as a street or street address,
For characters that are easily confused, replace them with characters registered as chome or street address.
The place name extraction program according to supplementary note 3, characterized by:
(Appendix 5)
Delete characters other than those registered as the chome or address,
The place name extraction program according to appendix 3 or 4, characterized by the above.
(Appendix 6)
Immediately after receiving the character string, referring to the storage unit storing the correspondence between the place name character string and the character included in the place name character string, the place name character string including the character that matches the character included in the character string Vote for
Narrow down the place name character string candidates to be used for subsequent processing to the predetermined number of place name character strings with the highest number of votes.
The place name extraction program as described in any one of the supplementary notes 1 thru | or 5 characterized by the above-mentioned.
(Appendix 7)
A reception unit that accepts a character string;
Referring to the storage unit that stores the place name character string, the number of characters in common between the character included in the character string and the character included in the place name character string is larger, and at least one or more from the end of the place name character string An output unit that outputs a place name character string in which the character is included in the character string as a place name included in the character string;
A place name extraction device characterized by comprising:
(Appendix 8)
The place name character string included in the character string is replaced with the place name character string to be output.
The place name extraction device according to appendix 7, characterized in that.
(Appendix 9)
Among the character strings after the place name included in the character string, the character string from the beginning of the character string to the character before the character other than the character registered as chome or address is specified as a character indicating the chome or address.
The place name extraction apparatus according to appendix 7 or 8, characterized in that.
(Appendix 10)
Characters registered as a chome or street address include characters that are easily confused with characters registered as a street or street address,
For characters that are easily confused, replace them with characters registered as chome or street address.
The place name extraction device according to appendix 9, characterized in that.
(Appendix 11)
Delete characters other than those registered as the chome or address,
The place name extraction apparatus according to appendix 9 or 10, characterized in that.
(Appendix 12)
Immediately after receiving the character string, referring to the storage unit storing the correspondence between the place name character string and the character included in the place name character string, the place name character string including the character that matches the character included in the character string Vote for
Narrow down the place name character string candidates to be used for subsequent processing to the predetermined number of place name character strings with the highest number of votes.
The place name extraction device according to any one of appendices 7 to 11, characterized in that:
(Appendix 13)
Accepts strings,
Referring to the storage unit that stores the place name character string, the number of characters in common between the character included in the character string and the character included in the place name character string is larger, and at least one or more from the end of the place name character string A place name character string in which the character is included in the character string is output as a place name included in the character string.
A place name extraction method, wherein a computer executes processing.
(Appendix 14)
The place name character string included in the character string is replaced with the place name character string to be output.
The place name extraction method according to supplementary note 13, characterized by:
(Appendix 15)
Among the character strings after the place name included in the character string, the character string from the beginning of the character string to the character before the character other than the character registered as chome or address is specified as a character indicating the chome or address.
15. The place name extraction method according to appendix 13 or 14, characterized in that:
(Appendix 16)
Characters registered as a chome or street address include characters that are easily confused with characters registered as a street or street address,
For characters that are easily confused, replace them with characters registered as chome or street address.
The place name extraction method according to supplementary note 15, characterized by:
(Appendix 17)
Delete characters other than those registered as the chome or address,
The place name extraction method according to supplementary note 15 or 16, characterized in that.
(Appendix 18)
Immediately after receiving the character string, referring to the storage unit storing the correspondence between the place name character string and the character included in the place name character string, the place name character string including the character that matches the character included in the character string Vote for
Narrow down the place name character string candidates to be used for subsequent processing to the predetermined number of place name character strings with the highest number of votes.
18. The place name extraction method according to any one of supplementary notes 13 to 17, characterized in that:

認識結果入力部１０１は受付部の一例である。住所・所在地表記出力部１０９は出力部の一例である。 The recognition result input unit 101 is an example of a reception unit. The address / location notation output unit 109 is an example of an output unit.

１住所・所在地表記抽出装置
１０１認識結果入力部
１０２地名候補絞り込み部
１０３地名照合部
１０４地名区切り文字判定部
１０５地名決定部
１０６地名修正部
１０７丁目番地区切り文字検出部
１０８丁目番地修正・決定部
１０９住所・所在地表記出力部
１１１地名文字情報
１１２地名情報
１１３丁目番地文字情報 DESCRIPTION OF SYMBOLS 1 Address / address notation extraction apparatus 101 Recognition result input part 102 Place name candidate narrowing down part 103 Place name collation part 104 Place name delimiter character determination part 105 Place name determination part 106 Place name correction part 107 Street address delimiter character detection part 108 Order number correction / determination part 109 Address / location notation output section 111 Place name character information 112 Place name information 113 Chome street address character information

Claims

Accepts strings,
Referring to the storage unit that stores the place name character string, the number of characters in common between the character included in the character string and the character included in the place name character string is larger, and at least one or more from the end of the place name character string A place name character string in which the character is included in the character string is output as a place name included in the character string.
A place name extraction program that causes a computer to execute processing.

The place name character string included in the character string is replaced with the place name character string to be output.
The place name extraction program according to claim 1 characterized by things.

Among the character strings after the place name included in the character string, the character string from the beginning of the character string to the character before the character other than the character registered as chome or address is specified as a character indicating the chome or address.
The place name extraction program according to claim 1 or 2, characterized in that.

Characters registered as a chome or street address include characters that are easily confused with characters registered as a street or street address,
For characters that are easily confused, replace them with characters registered as chome or street address.
The place name extraction program according to claim 3 characterized by things.

Delete characters other than those registered as the chome or address,
The place name extraction program according to claim 3 or 4, characterized in that.

Immediately after receiving the character string, referring to the storage unit storing the correspondence between the place name character string and the character included in the place name character string, the place name character string including the character that matches the character included in the character string Vote for
Narrow down the place name character string candidates to be used for subsequent processing to the predetermined number of place name character strings with the highest number of votes.
The place name extraction program according to any one of claims 1 to 5, wherein

A reception unit that accepts a character string;
Referring to the storage unit that stores the place name character string, the number of characters in common between the character included in the character string and the character included in the place name character string is larger, and at least one or more from the end of the place name character string An output unit that outputs a place name character string in which the character is included in the character string as a place name included in the character string;
A place name extraction device characterized by comprising:

Accepts strings,
Referring to the storage unit that stores the place name character string, the number of characters in common between the character included in the character string and the character included in the place name character string is larger, and at least one or more from the end of the place name character string A place name character string in which the character is included in the character string is output as a place name included in the character string.
A place name extraction method, wherein a computer executes processing.