JP5087326B2

JP5087326B2 - Pretend Kana Collection and Use Device and Pretend Kana Collection and Use Method

Info

Publication number: JP5087326B2
Application number: JP2007164241A
Authority: JP
Inventors: 秀樹本野
Original assignee: Yahoo Japan Corp
Current assignee: Yahoo Japan Corp
Priority date: 2007-06-21
Filing date: 2007-06-21
Publication date: 2012-12-05
Anticipated expiration: 2027-06-21
Also published as: JP2009003717A

Description

本発明は、ふり仮名収集利用装置及びふり仮名収集利用方法に関する。 The present invention relates to a pretend kana collection and utilization device and a pretend kana collection and use method.

従来より、辞書を利用して漢字にふり仮名を付ける技術が提案されている（例えば、特許文献１）。また、例えば、特許文献２では、複数の漢字習得レベルの漢字辞書を用意し、他国語の日本語翻訳に対して、漢字の習得レベルに応じた日本語文章に変換し、若年層でも容易に読めるようにする技術が開示されている。しかし、いずれも辞書は予め準備しなければならず、辞書の登録データを自動的に収集する技術は開示されていない。
特開平１０−９１６２７号公報特開２００４−５４７８４号公報 2. Description of the Related Art Conventionally, a technique for applying kana to kanji using a dictionary has been proposed (for example, Patent Document 1). In addition, for example, in Patent Document 2, a kanji dictionary with a plurality of kanji acquisition levels is prepared, and Japanese translations of other languages are converted into Japanese sentences according to the kanji acquisition level. A technique for enabling reading is disclosed. However, in any case, a dictionary must be prepared in advance, and a technique for automatically collecting dictionary registration data is not disclosed.
Japanese Patent Laid-Open No. 10-91627 JP 2004-54784 A

多くの漢字や熟語に対応したふり仮名を網羅する辞書を用意するのは大変手間がかかることである。また、語句の読み方については新しいものが使われ始めることがあり、これらをすばやく集めることは困難であった。一方、インターネット等を通じ、現実に多くの語句の読み方が使用されており、これらの中から精度の高い読み方（ふり仮名）情報を収集する技術が求められている。 It takes a lot of time and effort to prepare a dictionary that covers pretend kana corresponding to many kanji and idioms. Also, new ways of reading phrases may begin to be used, and it has been difficult to gather them quickly. On the other hand, many readings of words are actually used through the Internet and the like, and there is a demand for a technique for collecting reading (fake kana) information with high accuracy from these.

そこで、本発明は、既存のＷｅｂページ上で実際に使用されているふり仮名を自動的に収集し、使用目的に応じて再利用する方法を提供することを目的とする。 SUMMARY OF THE INVENTION An object of the present invention is to provide a method for automatically collecting pseudonyms actually used on an existing Web page and reusing them according to the purpose of use.

本発明者は、既存のＷｅｂページから収集したふり仮名をデータベースに蓄積し、その出現頻度に応じて精度の高いふり仮名、間違ったふり仮名等を判定し、使用目的に応じて再利用する装置及び方法を見出し、本発明を完成するに至った。
本発明は、具体的には次のようなものを提供する。 The inventor accumulates pretend kana collected from existing Web pages in a database, determines a high-precision pretend kana, incorrect pretend kana, etc. according to its appearance frequency, and reuses it according to the purpose of use. The headline and the present invention were completed.
Specifically, the present invention provides the following.

（１）Ｗｅｂページの内容から、漢字で構成される語句とふり仮名との組み合わせを取得するふり仮名データ取得手段と、
前記ふり仮名データ取得手段により取得された前記組み合わせを出現回数とともに記録するふり仮名データ記録手段と、
前記ふり仮名データ記録手段により記録された前記組み合わせを前記出現回数に基づいて抽出するふり仮名データ抽出手段と、
を備えるふり仮名収集利用装置。 (1) Fake kana data acquisition means for acquiring a combination of a phrase composed of kanji and a fake kana from the contents of a web page;
Pretend kana data recording means for recording the combination acquired by the pretend kana data acquiring means together with the number of appearances;
Fake kana data extracting means for extracting the combination recorded by the fake kana data recording means based on the number of appearances;
Pretend Kana collection and use device equipped with.

この発明によれば、ふり仮名収集利用装置が、漢字で構成される語句とふり仮名との組み合わせをＷｅｂページ上から自動的に収集し、出現回数に応じて再利用するので、人手をかけて辞書等を作成する必要がなく、かつ一般に広く使われているふり仮名を再利用することができる。 According to this invention, the pretend kana collection and utilization device automatically collects combinations of words and phrases composed of kanji and pretend kana from the web page and reuses them according to the number of appearances. Can be reused, and the commonly used pretend pseudonym can be reused.

（２）インターネットを通じて、複数のＷｅｂページの内容を取得するＷｅｂページ取得手段を備える（１）に記載のふり仮名収集利用装置。 (2) The pretending pseudonym collection / use device according to (1), further comprising Web page acquisition means for acquiring the contents of a plurality of Web pages through the Internet.

この発明によれば、ふり仮名収集利用装置が、複数のＷｅｂページの内容を取得するので、多くのＷｅｂページから漢字で構成される語句とふり仮名との組み合わせを自動的に収集し、出現回数に応じて再利用するので、人手をかけて辞書等を作成する必要がなく、かつ一般に広く使われているふり仮名を再利用することができる。 According to this invention, the pretend kana collection and utilization apparatus acquires the contents of a plurality of Web pages, and therefore automatically collects combinations of words and phrases composed of kanji and pretend kana from many Web pages, and according to the number of appearances. Therefore, it is not necessary to manually create a dictionary or the like, and it is possible to reuse pretending pseudonyms that are widely used in general.

（３）前記Ｗｅｂページ取得手段は、前記内容を取得すべきＷｅｂページの最終更新日時がすでに取得したＷｅｂページの最終更新日時と一致する場合は、該Ｗｅｂページの内容を取得しないことを特徴とする（２）に記載のふり仮名収集利用装置。 (3) The Web page acquisition unit does not acquire the content of the Web page when the last update date and time of the Web page from which the content is to be acquired matches the last update date and time of the Web page that has already been acquired. The pretend pseudonym collection and utilization device described in (2).

この発明によれば、ふり仮名収集利用装置が、同一のＷｅｂページの内容を重複して取得しないので効率良くＷｅｂページの内容取得を行うことができる。 According to the present invention, the pretend pseudonym collection and utilization apparatus does not acquire the same Web page contents redundantly, so that the Web page contents can be efficiently acquired.

（４）前記ふり仮名データ取得手段は、前記Ｗｅｂページに表示されるルビから前記組み合わせを取得することを特徴とする（１）から（３）のいずれかに記載のふり仮名収集利用装置。 (4) The pretensional kana data acquisition unit according to any one of (1) to (3), wherein the pretend kana data acquisition unit acquires the combination from ruby displayed on the Web page.

この発明によれば、ふり仮名収集利用装置が、ルビから自動的に語句とふり仮名との組み合わせを取得するので、ＨＴＭＬ（ｈｙｐｅｒｔｅｘｔｍａｒｋｕｐｌａｎｇｕａｇｅ）等のページ記述言語のルビタグ等を見ることで容易に語句とふり仮名との組み合わせを取得することができる。 According to this invention, since the pretend kana collection and acquisition device automatically acquires a combination of a phrase and a pretend kana from ruby, it is easy to see the phrase easily by looking at a ruby tag of a page description language such as HTML (hypertext markup language). A combination with pretend kana can be acquired.

（５）前記ふり仮名データ取得手段は、前記Ｗｅｂページに表示される文字列中にある括弧内の文字がすべてひらがなであり、括弧直前の文字が漢字である場合に、括弧内の文字が括弧直前の１又は２以上の連続する漢字の語句のふり仮名とみなして前記組み合わせを取得することを特徴とする（１）から（４）のいずれかに記載のふり仮名収集利用装置。 (5) The pretend kana data acquisition means, when all the characters in parentheses in the character string displayed on the Web page are hiragana characters and the characters immediately before the parentheses are kanji, the characters in the parentheses are immediately before the parentheses. (1) to (4), wherein the combination is acquired by regarding the kana as one of two or more consecutive kanji phrases.

この発明によれば、ふり仮名収集利用装置が、Ｗｅｂページに表示される文字列中にある括弧内の文字がすべてひらがなであり、括弧直前の文字が漢字である場合に、括弧内の文字が括弧直前の連続する漢字の語句のふり仮名とみなして語句とふり仮名との組み合わせを取得するので、括弧をつけて付されたふり仮名を自動的に取得することができる。 According to the present invention, when the pretend kana collection / use apparatus is such that all the characters in the parentheses in the character string displayed on the Web page are hiragana characters and the character immediately before the parentheses is a kanji character, the characters in the parentheses are parentheses. Since a combination of a phrase and a pseudonym is acquired by assuming that it is a pseudonym of the immediately preceding kanji phrase, it is possible to automatically acquire a pseudonym attached with parentheses.

（６）前記ふり仮名データ取得手段は、前記Ｗｅｂページに表示される文字列中にある括弧内の文字がすべてカタカナであり、括弧直前の文字が漢字である場合に、括弧内の文字が括弧直前の１又は２以上の連続する漢字の語句のふり仮名とみなして前記組み合わせを取得することを特徴とする（１）から（５）のいずれかに記載のふり仮名収集利用装置。 (6) The pretend kana data acquisition means, when all the characters in parentheses in the character string displayed on the Web page are katakana, and the characters immediately before the parentheses are kanji, the characters in the parentheses are immediately before the parentheses. (1) to (5), wherein the combination is acquired by regarding the kana as one of two or more consecutive kanji phrases.

この発明によれば、ふり仮名収集利用装置が、Ｗｅｂページに表示される文字列中にある括弧内の文字がすべてカタカナであり、括弧直前の文字が漢字である場合に、括弧内の文字が括弧直前の連続する漢字の語句のふり仮名とみなして語句とふり仮名との組み合わせを取得するので、（５）と同様の効果が期待できる。 According to the present invention, when the pretend kana collection and utilization device detects that all the characters in the parentheses in the character string displayed on the Web page are katakana and the character immediately before the parentheses is a kanji, the character in the parentheses is the parenthesis. Since a combination of a phrase and a pseudonym is acquired as if it were a pretend kana phrase of the immediately preceding kanji, the same effect as in (5) can be expected.

（７）前記出現回数は、前記組み合わせが取得されたＷｅｂページの数であることを特徴とする（１）から（６）のいずれかに記載のふり仮名収集利用装置。 (7) The pretending pseudonym collection / use device according to any one of (1) to (6), wherein the number of appearances is the number of Web pages from which the combination is acquired.

この発明によれば、ふり仮名収集利用装置が、同一のＷｅｂページの内容から重複して語句とふり仮名との組み合わせを収集しないので、間違ったふり仮名が同一のＷｅｂページに複数回使用されている場合にも、その影響を少なくするようにすることができる。 According to the present invention, since the pretend kana collection and utilization device does not collect the combination of the phrase and the pretend kana from the contents of the same web page, the wrong pretend kana is used multiple times for the same web page. However, the effect can be reduced.

（８）前記出現回数は、前記組み合わせが、前記Ｗｅｂページ取得手段により取得したＷｅｂページに実際に出現された回数であることを特徴とする（１）から（６）のいずれかに記載のふり仮名収集利用装置。 (8) The fake pseudonym according to any one of (1) to (6), wherein the number of appearances is the number of times the combination has actually appeared on the Web page acquired by the Web page acquisition unit Collection and utilization device.

この発明によれば、ふり仮名収集利用装置が、Ｗｅｂページに実際に出現した回数をカウントして語句とふり仮名との組み合わせを取得するので、例えば、ブログのように同一のＷｅｂページに複数人が文章を記載する場合に、使用者数を反映した収集を行うことができる。 According to the present invention, the pretend kana collection and utilization device counts the number of actual appearances on a web page and acquires a combination of a phrase and pretend kana, so that, for example, a plurality of people can write sentences on the same web page like a blog. Can be collected to reflect the number of users.

（９）前記ふり仮名データ抽出手段は、前記ふり仮名データ記録手段により記録された前記組み合わせについて、同一の語句に対するふり仮名の中で前記出現回数が最も多いものを抽出することを特徴とする（１）から（８）のいずれかに記載のふり仮名収集利用装置。 (9) The fake kana data extracting means extracts, from the combination recorded by the fake kana data recording means, the one having the largest number of appearances among the fake kana for the same word / phrase. (8) The pretending pseudonym collection and utilization device according to any one of (8).

この発明によれば、ふり仮名収集利用装置が、同一の語句に対し、世の中で最も広く使われている精度の高いふり仮名を取得することができる。 According to the present invention, the pretend kana collection / use apparatus can acquire the pretend pseudonym most widely used in the world for the same phrase.

（１０）前記ふり仮名データ抽出手段は、前記ふり仮名データ記録手段により記録された前記組み合わせについて、同一のふり仮名に対する語句の中で前記出現回数が最も多いものを抽出することを特徴とする（１）から（８）のいずれかに記載のふり仮名収集利用装置。 (10) The fake kana data extracting means extracts, from the combinations recorded by the fake kana data recording means, those having the largest number of appearances among words for the same fake kana. (8) The pretending pseudonym collection and utilization device according to any one of (8).

この発明によれば、ふり仮名収集利用装置が、同一のふり仮名に対し、世の中で最も広く使われている語句を取得することができる。 According to the present invention, the pretend kana collection / use apparatus can acquire the most widely used words / phrases in the world for the same pretend kana.

（１１）前記ふり仮名データ抽出手段は、前記ふり仮名データ記録手段により記録された前記組み合わせについて、前記出現回数が所定の数より少ないものを間違いデータとして抽出することを特徴とする（１）から（８）のいずれかに記載のふり仮名収集利用装置。 (11) The fake kana data extracting unit extracts, as error data, the combination recorded by the fake kana data recording unit and having the number of appearances smaller than a predetermined number. ) The pretending pseudonym collection and utilization device described in any of the above.

この発明によれば、ふり仮名収集利用装置が、出現回数が所定の数より少ない語句とふり仮名との組み合わせを取得することができるので、例えば、ふり仮名の間違い例を示すことができる。 According to the present invention, the pretend kana collection / use apparatus can acquire a combination of a phrase and a pretend pseudonym whose number of appearances is less than a predetermined number.

（１２）コンピュータを用いて、ふり仮名に係る情報を集めるふり仮名収集利用方法であって、
インターネットを通じて、複数のＷｅｂページの内容を取得するステップと、
前記Ｗｅｂページの内容から、漢字で構成される語句とふり仮名との組み合わせを取得するステップと、
取得された前記組み合わせを出現回数とともに記録するステップと、
記録された前記組み合わせを前記出現回数に応じて抽出するステップと、
を含むことを特徴とするふり仮名収集利用方法。 (12) A fake pseudonym collection and usage method for collecting information related to fake kana using a computer,
Acquiring the contents of a plurality of Web pages via the Internet;
Obtaining a combination of words and phrases composed of kanji and pretend kana from the content of the web page;
Recording the acquired combination together with the number of appearances;
Extracting the recorded combination according to the number of appearances;
A pretend kana collection and use method characterized by including:

この発明によれば、コンピュータを用いて、当該方法を実行することにより、（２）と同様の効果が期待できる。 According to the present invention, the same effect as in (2) can be expected by executing the method using a computer.

この発明によれば、ふり仮名収集利用装置が、漢字で構成される語句とふり仮名との組み合わせをＷｅｂページ上から自動的に収集し、出現回数に応じて再利用するので、人手をかけて辞書等を作成する必要がなく、かつ一般に広く使われているふり仮名を再利用することができる。
また、この発明によれば、ふり仮名収集利用装置が、出現回数に応じて語句とふり仮名との組み合わせを再利用することができるので、漢字の難易度や正誤等を考慮した使用目的に合わせてふり仮名の再利用を行うことができる。 According to this invention, the pretend kana collection and utilization device automatically collects combinations of words and phrases composed of kanji and pretend kana from the web page and reuses them according to the number of appearances. Can be reused, and the commonly used pretend pseudonym can be reused.
Further, according to the present invention, the pretend kana collection and utilization device can reuse the combination of the phrase and the pretend kana in accordance with the number of appearances, so that the pretend kana is adapted to the purpose of use in consideration of the difficulty and correctness of kanji. Can be reused.

以下、本発明を実施するための最良の形態について図を参照しながら説明する。なお、これはあくまでも一例であって、本発明の技術的範囲はこれに限られるものではない。
（第１の実施形態） Hereinafter, the best mode for carrying out the present invention will be described with reference to the drawings. This is merely an example, and the technical scope of the present invention is not limited to this.
(First embodiment)

［全体図］
図１は、ふり仮名収集利用装置１と、ふり仮名収集利用装置１がインターネット１０を通じてふり仮名データを収集するＷｅｂページ２０〜２４との関係を示した全体図である。ふり仮名収集利用装置１は、ふり仮名データを収集するためにＷｅｂページを取得するＷｅｂページ取得手段３と、Ｗｅｂページ取得手段３によって取得されたＷｅｂページからふり仮名データを取得するふり仮名データ取得手段４と、ふり仮名データ取得手段４によって取得されたふり仮名データを記録するふり仮名データ記録手段５と、ふり仮名データ記録手段５に記録されたふり仮名データを使用目的に応じて抽出するふり仮名データ抽出手段６と、これらの各手段を制御するとともに、インターネット１０を通じてＷｅｂページ２０〜２４にアクセスするための通信制御を行う制御手段２とを備える。
なお、ふり仮名データとは、語句とふり仮名との組み合わせを意味するものであり、同一の語句に異なるふり仮名が付されたものや、異なる語句に同一のふり仮名データが付されたものは別のふり仮名データである。 [Overall view]
FIG. 1 is an overall view showing the relationship between the pretend pseudonym collection and use device 1 and the Web pages 20 to 24 on which the pretend pseudonym collection and use device 1 collects pretend pseudonym data through the Internet 10. The pretend kana collection and utilization device 1 includes a Web page acquisition unit 3 that acquires a Web page to collect pretend pseudonym data, a pretend pseudonym data acquisition unit 4 that acquires pretend pseudonym data from the Web page acquired by the Web page acquisition unit 3, The fake kana data recording means 5 for recording the fake kana data acquired by the fake kana data acquisition means 4, the fake kana data extracting means 6 for extracting the fake kana data recorded in the fake kana data recording means 5 according to the purpose of use, and each of these And a control unit 2 that controls communication and controls communication for accessing the Web pages 20 to 24 through the Internet 10.
Fake kana data means a combination of a phrase and a fake kana, and the same words with different kana, or different words with the same kana data are different kana data. It is.

［ふり仮名収集利用装置１のハードウェア構成］
図２は、本実施形態に係るふり仮名収集利用装置１のハードウェア構成を示す図である。
ふり仮名収集利用装置１は、制御装置４０を構成するＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）４１（マルチプロセッサ構成ではＣＰＵ４２等複数のＣＰＵが追加されてもよい）、バスライン３０、通信Ｉ／Ｆ（Ｉ／Ｆ：インターフェイス）４３、メインメモリ４４、ＢＩＯＳ（ＢａｓｉｃＩｎｐｕｔＯｕｔｐｕｔＳｙｓｔｅｍ）４５、表示装置４６、Ｉ／Ｏコントローラ４７、並びにキーボード及びマウス等の入力装置４８を備える。 [Hardware configuration of pretend kana collection and use device 1]
FIG. 2 is a diagram illustrating a hardware configuration of the pretending pseudonym collection and utilization apparatus 1 according to the present embodiment.
The fake pseudonym collection and utilization device 1 includes a CPU (Central Processing Unit) 41 (a plurality of CPUs such as the CPU 42 may be added in a multiprocessor configuration), a bus line 30, a communication I / F (I / F). Interface) 43, main memory 44, BIOS (Basic Input Output System) 45, display device 46, I / O controller 47, and input devices 48 such as a keyboard and a mouse.

通信Ｉ／Ｆ４３は、ふり仮名収集利用装置１が、インターネット１０を介してＷｅｂページ２０〜２４が保有されているサーバ等（図示せず）にアクセスするためのネットワーク・アダプタである。通信Ｉ／Ｆ４３は、モデム、ケーブル・モデム及びイーサネット（登録商標）・アダプタを含んでよい。
ＢＩＯＳ４５は、ふり仮名収集利用装置１の起動時にＣＰＵ４１が実行するブートプログラムや、ふり仮名収集利用装置１のハードウェアに依存するプログラム等を記録する。 The communication I / F 43 is a network adapter for the pretend pseudonym collection and use device 1 to access a server or the like (not shown) in which the Web pages 20 to 24 are held via the Internet 10. The communication I / F 43 may include a modem, a cable modem, and an Ethernet (registered trademark) adapter.
The BIOS 45 records a boot program executed by the CPU 41 when the pretend pseudonym collection / use device 1 is started, a program depending on the hardware of the pretend pseudonym collection / use device 1, and the like.

表示装置４６は、ふり仮名収集利用装置１による演算処理結果等の画面を表示するものであり、ブラウン管表示装置（ＣＲＴ）、液晶表示装置（ＬＣＤ）等のディスプレイ装置を含む。
Ｉ／Ｏコントローラ４７には、ハードディスク４９、及び半導体メモリ５０等の記憶装置５１を接続することができる。
入力装置４８は、ふり仮名収集利用装置１の管理者による入力の受け付けを行うものである。
ハードディスク４９は、ふり仮名収集利用装置１として機能するための各種プログラム、本発明の機能を実行するプログラム及び後述するテーブルを記憶する。 The display device 46 displays a screen of calculation processing results and the like by the pretend kana collection and utilization device 1, and includes display devices such as a cathode ray tube display device (CRT) and a liquid crystal display device (LCD).
A storage device 51 such as a hard disk 49 and a semiconductor memory 50 can be connected to the I / O controller 47.
The input device 48 accepts input by the administrator of the pretend kana collection / use device 1.
The hard disk 49 stores various programs for functioning as the pretend kana collection and utilization device 1, programs for executing the functions of the present invention, and a table to be described later.

以上の例は、ふり仮名収集利用装置１のハードウェア構成について主に説明したが、コンピュータに、プログラムをインストールして、そのコンピュータをふり仮名収集利用装置１として動作させることにより上記で説明した機能を実現することもできる。したがって、本発明において一実施形態として説明したふり仮名収集利用装置１により実現される機能は、上述の方法を当該コンピュータにより実行することにより、あるいは、上述のプログラムを当該コンピュータに導入して実行することによっても実現可能である。 Although the above example mainly explained the hardware configuration of the pretend kana collection and utilization device 1, the above-described functions are realized by installing a program on a computer and operating the computer as the pretend kana collection and utilization device 1. You can also Therefore, the function realized by the pretending pseudonym collecting and using apparatus 1 described as an embodiment in the present invention is executed by executing the above-described method by the computer or by introducing the above-described program into the computer. This is also possible.

なお、本発明でいうコンピュータとは、記憶装置、制御装置等を備えた情報処理装置をいい、ふり仮名収集利用装置１は、記憶装置５１、制御装置４０等を備えた情報処理装置により構成され、この情報処理装置は、本発明のコンピュータの概念に含まれる。ここで、図１に示した、制御手段２、Ｗｅｂページ取得手段３、ふり仮名データ取得手段４及びふり仮名データ抽出手段６には主として制御装置４０が、ふり仮名データ記録手段５には記憶装置５１が、それぞれ対応する。 Note that the computer in the present invention refers to an information processing device including a storage device, a control device, and the like, and the pretend kana collection and utilization device 1 includes an information processing device including a storage device 51, a control device 40, and the like. This information processing apparatus is included in the concept of the computer of the present invention. Here, the control unit 2, the web page acquisition unit 3, the pretend kana data acquisition unit 4 and the pretend kana data extraction unit 6 shown in FIG. Each corresponds.

［テーブル］
図３は、本実施形態に係るＵＲＬ更新日時テーブルを示す図である。取得したＷｅｂページのＵＲＬアドレス１００とそのＷｅｂページの最終更新日時１０１を保有する。ＵＲＬ更新日時テーブルは、Ｗｅｂページを取得したとき、そのＷｅｂページを保有しているサーバ等からそのＷｅｂページのファイルの最終更新日時を取得して、ＵＲＬアドレス１００とともに記録しておくものである。こうすれば、再度同一ＵＲＬのＷｅｂページにアクセスしたとき、そのＷｅｂページを保有しているサーバ上に記録された最終更新日時が、ＵＲＬ更新日時テーブルに記録された最終更新日時１０１と一致する場合は、そのＷｅｂページについては前回取得時から内容が変わっていないことがわかるので、新たなふり仮名データの取得を省略することができる。 [table]
FIG. 3 is a diagram showing a URL update date / time table according to the present embodiment. It holds the URL address 100 of the acquired Web page and the last update date and time 101 of the Web page. The URL update date / time table acquires the last update date / time of a file of the Web page from a server or the like that holds the Web page when the Web page is acquired, and records it together with the URL address 100. In this way, when a Web page with the same URL is accessed again, the last update date and time recorded on the server that holds the Web page matches the last update date and time 101 recorded in the URL update date and time table. Since it can be seen that the content of the Web page has not changed since the previous acquisition, acquisition of new pretend kana data can be omitted.

図４は、本実施形態に係る更新予備テーブルを示す図である。更新予備テーブルは、同一のＷｅｂページ内に同じ語句とふり仮名の組み合わせが複数回出てくる場合に、出現回数を重複してカウントしないようにするために用いる作業用のテーブルである。使用方法については処理フローの説明の中で説明する。語句１１０とふり仮名１１１とから構成される。 FIG. 4 is a diagram showing an update reserve table according to the present embodiment. The update preliminary table is a work table used to avoid counting the number of appearances redundantly when a combination of the same phrase and pretend kana appears multiple times in the same Web page. The usage method will be described in the description of the processing flow. It consists of a phrase 110 and a pretend kana 111.

図５は、本実施形態に係る語句ふり仮名テーブルを示す図である。語句ふり仮名テーブルは、Ｗｅｂページから取得した語句１２０とふり仮名１２１の組み合わせを記録しておくためのテーブルであり、出現回数１２２も併せて記録している。この出現回数１２２を見ることで、よく使われるふり仮名や、間違ったふり仮名等の判断が可能となる。 FIG. 5 is a diagram showing a phrase pretending pseudonym table according to the present embodiment. The phrase pretend kana table is a table for recording a combination of the phrase 120 and the pretend kana 121 acquired from the Web page, and the appearance count 122 is also recorded. By looking at the number of appearances 122, it is possible to determine commonly used pretend kana or wrong pretend kana.

図６、図７は、使用目的に応じて、出現回数１２２に基づいて、語句ふり仮名テーブル（図５）から抽出した例を示している。具体的な利用方法については、後述する。 6 and 7 show examples extracted from the phrase pretend kana table (FIG. 5) based on the number of appearances 122 in accordance with the purpose of use. A specific usage method will be described later.

図８は、本実施形態に係る最頻ふり仮名テーブルを示す図である。最頻ふり仮名テーブルは、語句ふり仮名テーブル（図５）に登録された語句１２０とふり仮名１２１の組み合わせについて、同一の語句に対するふり仮名の中で前記出現回数が最も多いものをそれぞれ抽出したものである。最頻ふり仮名テーブルは、語句１３０、ふり仮名１３１及び出現回数１３２から構成される。 FIG. 8 is a view showing the most common pretend kana table according to the present embodiment. The most frequent pretend kana table is obtained by extracting, from the combination of the word 120 and the pretend kana 121 registered in the word pretend kana table (FIG. 5), those having the highest number of appearances among the pretend kana for the same word. The most common pretend kana table is composed of a phrase 130, pretend kana 131, and number of appearances 132.

図９は、本実施形態に係る最頻語句テーブルを示す図である。最頻語句テーブルは、語句ふり仮名テーブル（図５）に登録された語句１２０とふり仮名１２１の組み合わせについて、同一のふり仮名に対する語句の中で前記出現回数が最も多いものをそれぞれ抽出したものである。最頻語句テーブルは、語句１４０、ふり仮名１４１及び出現回数１４２から構成される。 FIG. 9 is a diagram showing a most frequent phrase table according to the present embodiment. The most frequent word / phrase table is obtained by extracting the most frequently occurring words / phrases for the same fake kana from the combinations of the word / phrase 120 and the fake kana 121 registered in the word / phrase kana table (FIG. 5). The most frequent phrase table is composed of a phrase 140, a pseudonym 141, and an appearance count 142.

図１０は、本実施形態に係る間違い語句ふり仮名テーブルを示す図である。間違い語句ふり仮名テーブルは、語句ふり仮名テーブル（図５）に登録された語句１２０とふり仮名１２１の組み合わせについて、出現回数が一定回数以下のものを抽出したものである。間違い語句ふり仮名テーブルは、語句１５０、ふり仮名１５１及び出現回数１５２から構成される。 FIG. 10 is a diagram showing an incorrect word / phrase pretend kana table according to the present embodiment. The incorrect phrase pretending pseudonym table is obtained by extracting the combinations of the phrase 120 and the pretend pseudonym 121 registered in the phrase pretending pseudonym table (FIG. 5) and having the number of appearances not more than a predetermined number. The incorrect phrase pretend kana table is composed of a phrase 150, pretend kana 151, and number of appearances 152.

［処理フロー］
図１１に従って、ふり仮名収集利用装置１の処理フローについて説明する。特に断らない限り、以下の処理は、ふり仮名収集利用装置１の制御装置４０が行うものとする。
本実施形態では、ふり仮名収集利用装置１の制御装置４０が、予め定めた複数のＵＲＬアドレスに対して、所定のルールでインターネット１０を通じてアクセスし、対応するＷｅｂページ２０〜２４の内容を収集するクローリングといわれる処理が行われることを前提としている。
ふり仮名収集利用装置１の制御装置４０は、インターネット１０を通じてＵＲＬアドレスに基づきＷｅｂページ２０〜２４にアクセスし、当該Ｗｅｂページのファイルの最終更新日時を取得する（Ｓ１０００）。 [Processing flow]
The processing flow of the pretend kana collection and utilization device 1 will be described with reference to FIG. Unless otherwise specified, the following processing is performed by the control device 40 of the pretend pseudonym collection and utilization device 1.
In the present embodiment, the control device 40 of the pretending pseudonym collection and utilization device 1 accesses the plurality of predetermined URL addresses through the Internet 10 according to a predetermined rule and collects the contents of the corresponding Web pages 20 to 24. It is assumed that the process called is performed.
The control device 40 of the pretending pseudonym collection and utilization device 1 accesses the Web pages 20 to 24 based on the URL address through the Internet 10 and acquires the last update date and time of the file of the Web page (S1000).

次に、取得したＷｅｂページのＵＲＬアドレスがＵＲＬ更新日時テーブル（図３）に登録されているか否かを調べる（Ｓ１０１０）。もし登録されていなければ（Ｓ１０１０：Ｎｏ）、ＵＲＬアドレス１００と最終更新日時１０１をＵＲＬ更新日時テーブル（図３）に登録して（Ｓ１０２０）、Ｗｅｂページの内容を取得し（Ｓ１０５０）、ふり仮名取得の処理（Ｓ１０６０以降）に移る。登録されている場合は（Ｓ１０１０：Ｙｅｓ）、取得した当該Ｗｅｂページのファイルの最終更新日時とＵＲＬ更新日時テーブル（図３）に登録されている最終更新日時１０１とを比較し、両者が一致するか否かを調べる（Ｓ１０３０）。そして両者が一致する場合は（Ｓ１０３０：Ｙｅｓ）、そのＷｅｂページは前回取得したときと内容が変わっていないので何も処理を行わず、そのＷｅｂページの処理は終了する。一致しない場合は（Ｓ１０３０：Ｎｏ）、そのＷｅｂページは前回取得したときと内容が変わっているので、ＵＲＬ更新日時テーブル（図３）の最終更新日時１０１を更新して（Ｓ１０４０）、Ｗｅｂページの内容を取得し（Ｓ１０５０）、ふり仮名取得の処理（Ｓ１０６０以降）に移る。 Next, it is checked whether or not the URL address of the acquired Web page is registered in the URL update date / time table (FIG. 3) (S1010). If not registered (S1010: No), the URL address 100 and the last update date / time 101 are registered in the URL update date / time table (FIG. 3) (S1020), the contents of the Web page are acquired (S1050), and the pseudonym is acquired. The process proceeds to (S1060 and later). If registered (S1010: Yes), the last update date and time of the obtained Web page file is compared with the last update date and time 101 registered in the URL update date and time table (FIG. 3), and the two match. It is checked whether or not (S1030). If the two match (S1030: Yes), the content of the Web page has not changed from the previous acquisition, so no processing is performed and the processing of the Web page ends. If they do not match (S1030: No), the content of the Web page is different from that obtained last time, so the last update date / time 101 in the URL update date / time table (FIG. 3) is updated (S1040). The contents are acquired (S1050), and the process proceeds to pretend pseudonym acquisition processing (S1060 and later).

次に、Ｗｅｂページの内容からふり仮名データを取得する（Ｓ１０６０）。ふり仮名データの取得には次の２つの方法で行う。１つはＷｅｂページの内容を記述するＨＴＭＬのルビタグを利用して取得する方法である。これは、ＨＴＭＬの言語仕様に従い、＜ｒｕｂｙ＞と＜／ｒｕｂｙ＞に囲まれる中にある＜ｒｂ＞の後にある語句と、＜ｒｔ＞の後にあるふり仮名とを対で取得するものである。 Next, pretend kana data is acquired from the contents of the Web page (S1060). The following two methods are used to acquire pretend kana data. One is a method of obtaining using an HTML ruby tag describing the contents of a Web page. According to the language specification of HTML, a phrase after <rb> in <ruby> and </ ruby> and a pseudonym after <rt> are acquired in pairs.

例えば、「＜ｒｕｂｙ＞＜ｒｂ＞七尾奈留＜ｒｔ＞ななをなる＜／ｒｙｂｙ＞」からは「語句：七尾奈留、ふり仮名：ななをなる」が取得される。また、「＜ｒｕｂｙ＞＜ｒｂ＞獣＜／ｒｂ＞＜ｒｐ＞（＜／ｒｐ＞＜ｒｔ＞けだもの＜／ｒｔ＞＜ｒｐ＞）＜／ｒｐ＞＜／ｒｙｂｙ＞」からは「語句：獣、ふり仮名：けだもの」が取得される。なお、ＨＴＭＬ以外の言語であってもルビの機能があるものはそのルビを抽出することで同様に処理することができる。 For example, from “<ruby> <rb> Naru Nanao <rt> Nananar </ ryby>”, “Phrase: Naru Nanao, Pretend Kana: Nananaru” is acquired. From “<ruby> <rb> beast </ rb> <rp>” (</ rp> <rt> kadamono </ rt> <rp>) </ rp> </ ryby> Beast, pretend pseudonym: Kadamono "is acquired. Note that even languages other than HTML that have a ruby function can be processed in the same manner by extracting the ruby.

もう１つの方法は、表示されるテキスト中にある、括弧内の文字が全部ひらがなであって、括弧の直前の文字が漢字である場合に、その漢字と括弧内のひらがなを語句とふり仮名として取得するものである。この場合、漢字が複数個連続する場合は、これらをまとめて語句として取得する。例えば、「宥和（ゆうわ）」からは「語句：宥和、ふり仮名：ゆうわ」が取得される。しかし、例えば、「土方歳三（としぞう）」からは「語句：歳三、ふり仮名：としぞう」ではなく、「語句：土方歳三、ふり仮名：としぞう」が取得される。また、「徒然日記（のようなもの）」からは「語句：徒然日記、ふり仮名：のようなもの」が取得される。このようにふり仮名としては誤ったものが取得される可能性があるが、これについては後述の処理で対処する。漢字か否かは文字コードを見ることで判断できる。なお、ひらがなではなく、括弧内の文字が全部カタカナである場合も同様である。 The other method is that if all the characters in the parentheses in the displayed text are hiragana characters and the characters immediately before the parentheses are kanji, the kanji and the hiragana characters in the parentheses are acquired as phrases and pretend kana characters. To do. In this case, when a plurality of kanji characters are consecutive, they are collectively acquired as a phrase. For example, from “Yuwa”, “Phrase: Yuwa, Pretend Kana: Yuwa” is acquired. However, for example, from “Toshizo Hijikata”, “Phrase: Toshizo Hijikata, Pseudonym: Toshizo” is acquired instead of “Phrase: Toshizo, Pseudonym: Toshizo”. In addition, “phrase: natural diary, pretend pseudonym: like” is acquired from the “natural diary”. In this way, there is a possibility that an improper pseudonym is acquired, but this will be dealt with in the process described later. Whether or not it is a Chinese character can be determined by looking at the character code. The same applies when the characters in parentheses are all katakana instead of hiragana.

このようにして取得したふり仮名データは、更新予備テーブル（図４）に登録されているか否かを調べ（Ｓ１０７０）、登録されていない場合には（Ｓ１０７０：Ｎｏ）、更新予備テーブル（図４）に登録する（Ｓ１０８０）。登録されている場合には（Ｓ１０７０：Ｙｅｓ）、更新予備テーブル（図４）に登録はしない。そして、同一ＵＲＬのＷｅｂページ内にまだふり仮名データがあるかどうかを調べる（Ｓ１０９０）。もしまだふり仮名データがあれば（Ｓ１０９０：Ｙｅｓ）、同様の仕方で、ふり仮名データを取得する（Ｓ１０６０）。このようにしてステップ１０６０からステップ１０９０までの処理を繰り返す。そして同一ＵＲＬのＷｅｂページ内にふり仮名データがなくなれば（Ｓ１０９０：Ｎｏ）、更新予備テーブル（図４）に基づいて、語句ふり仮名テーブル（図５）に登録する（Ｓ１１００〜Ｓ１１４０）。 It is checked whether or not the pretend pseudonym data acquired in this way is registered in the update reserve table (FIG. 4) (S1070). If not registered (S1070: No), the update reserve table (FIG. 4). (S1080). If it is registered (S1070: Yes), it is not registered in the update preliminary table (FIG. 4). Then, it is checked whether there is still pseudonym data in the Web page with the same URL (S1090). If there is still pretend kana data (S1090: Yes), pretend kana data is acquired in the same manner (S1060). In this way, the processing from step 1060 to step 1090 is repeated. If there is no pretend kana data in the Web page with the same URL (S1090: No), it is registered in the pretend kana table (FIG. 5) based on the update preliminary table (FIG. 4) (S1100 to S1140).

次に、図１２に従って、語句ふり仮名テーブル（図５）への登録について説明する。まず、更新予備テーブル（図４）に登録された語句とふり仮名の組み合わせを読み出す（Ｓ１１００）。そして読み出した語句とふり仮名の組み合わせが、語句ふり仮名テーブル（図５）にすでに登録されているか否かを調べる（Ｓ１１１０）。もし登録があれば（Ｓ１１１０：Ｙｅｓ）、出現回数１２２に「１」を加算する（Ｓ１１２０）。もし登録がなければ（Ｓ１１１０：Ｎｏ）、その語句とふり仮名の組み合わせを語句ふり仮名テーブル（図５）に登録し、出現回数１２２に「１」をセットする（Ｓ１１３０）。この処理を更新予備テーブル（図４）に登録されたすべてのふり仮名データについて繰り返す（Ｓ１１４０）。 Next, registration to the phrase pretending pseudonym table (FIG. 5) will be described with reference to FIG. First, a combination of words and pretend kana registered in the update reserve table (FIG. 4) is read (S1100). Then, it is checked whether or not the combination of the read phrase and pretend kana has already been registered in the pretend kana table (FIG. 5) (S1110). If registered (S1110: Yes), “1” is added to the appearance count 122 (S1120). If there is no registration (S1110: No), the combination of the phrase and pretend kana is registered in the pretend kana table (FIG. 5), and “1” is set in the appearance count 122 (S1130). This process is repeated for all pretend pseudonym data registered in the update reserve table (FIG. 4) (S1140).

このようにすることで、多くのＷｅｂページで実際に使用されている語句とふり仮名との組み合わせが出現回数とともに自動的に収集されるので、仮にあるＷｅｂページで誤ったふり仮名が付されていても、出現回数をチェックすることで、多くの人に使われている精度の高いふり仮名を収集することができる。
また、この出現回数は、同一Ｗｅｂページに複数回現れた同一のふり仮名データについては１回としてカウントするので、例えば、ある人が同一のＷｅｂページ内で誤ったふり仮名データを繰り返し使用しても、最終的に語句ふり仮名テーブル（図５）に記録される出現回数は１回として取り扱われ、誤ったふり仮名が正しいふり仮名と認識される可能性が抑えられる。 In this way, combinations of words and pretend kana that are actually used on many Web pages are automatically collected along with the number of appearances, so even if a pretend pseudonym is attached to a Web page. By checking the number of appearances, it is possible to collect pretending pseudonyms used by many people with high accuracy.
In addition, since the number of appearances is counted as one for the same pretend kana data that appears multiple times on the same Web page, for example, even if a certain person repeatedly uses pretend pseudonym data in the same Web page, The number of appearances finally recorded in the phrase pretend kana table (FIG. 5) is handled as one, and the possibility that an incorrect pretend kana is recognized as a correct pretend kana is suppressed.

［ふり仮名データの再利用］
ふり仮名１２１又は語句１２０をキーとして、語句ふり仮名テーブル（図５）を絞り込み、出現回数１２２の順に表示すると、Ｗｅｂページ上で使用されている頻度の順に当該ふり仮名１２１に対する語句１２０、又は当該語句１２０に対するふり仮名１２１が表示される。「あすか」というふり仮名１２１で語句ふり仮名テーブル（図５）の絞込みを行った例を図６、「明星」という語句１２０で語句ふり仮名テーブル（図５）の絞込みを行った例を図７に示した。こうすることにより使用目的に応じたふり仮名データを取得することができる。 [Reuse of pretend kana data]
By narrowing down the phrase pseudonym table (FIG. 5) using the pretend kana 121 or the phrase 120 as a key and displaying them in the order of the number of appearances 122, the phrase 120 or the phrase 120 for the pretend kana 121 in the order of frequency used on the Web page. A pretend pseudonym 121 is displayed. FIG. 6 shows an example of narrowing down the phrase pseudonym table (FIG. 5) with the pretend pseudonym 121 “Asuka”, and FIG. 7 shows an example of narrowing down the pretend pseudonym table (FIG. 5) with the phrase 120 “Meisei”. . In this way, pretend kana data corresponding to the purpose of use can be acquired.

例えば、通常の文章をＷｅｂページ上で表示する場合、例えば、「明星」という語句に対しては、出現回数の一番多い「みょうじょう」というふり仮名を付するようにすることができる。 For example, when a normal sentence is displayed on a Web page, for example, the phrase “Myosei” can be given the pseudonym “Myojo” with the highest number of appearances.

また、出現回数が多いものほど、多くのＷｅｂページでふり仮名が付されて使用されていることを示しているので、一般にふり仮名を付さないと読むのが困難な語句であることが推定される。例えば、出現回数が１０００回以上であればその語句は一般的に読み方が難しいものであると決めればよい。また、それほど多くはないが、一定回数以上の出現回数のものは、ふり仮名が付けられたり付けられなかったりすることがあるものと考えられる。 In addition, since the number of appearances is larger, it indicates that many web pages are used with pretend kana, so it is generally estimated that the phrase is difficult to read without pretend kana. . For example, if the number of appearances is 1000 times or more, it may be determined that the word is generally difficult to read. Moreover, although not so many, it is thought that a person with the appearance frequency more than a certain number of times may or may not be pretend to be a pseudonym.

例えば、最頻ふり仮名テーブル（図８）の例によれば、「蒲公英」、「倫敦」には１０００回以上Ｗｅｂページでふり仮名が付けられているので、一般的に大人であっても読みにくい語句であると考えることができる。また、「土筆」、「憂鬱」は、それぞれ８００回、３００回であるので、「蒲公英」、「倫敦」ほどではないにしてもやはり読みにくい語句であると一般的には考えられているとみなすことができる。したがって、Ｗｅｂページを作成するとき、そのＷｅｂページの想定される利用者が大人の場合は出現回数が１０００回以上のふり仮名データを利用し、想定される利用者が中学生や高校生の場合は、例えば、３００回以上のふり仮名データを利用する等の利用方法が考えられる。 For example, according to the example of the most common pretend kana table (FIG. 8), “Kin-Kingei” and “Lunyi” are pretend to be pretend to be used on web pages more than 1000 times. Can be considered. In addition, “Etsubushi” and “Melancholy” are 800 times and 300 times, respectively, so they are generally considered to be difficult to read words, even though they are not as good as “Kang Koei” and “Lunyi”. Can be considered. Therefore, when creating a Web page, if the assumed user of the Web page is an adult, pretend kana data with an appearance count of 1000 or more is used, and if the assumed user is a junior high school student or a high school student, for example, A usage method such as using pretend kana data of 300 times or more is conceivable.

また、最頻語句テーブル（図９）によれば、同一のふり仮名について最も出現回数の多い語句を知ることができる。
一方、「亜巣化」（あすか）については、出現回数が１回であるため、誤ったふり仮名が付されたものと推定されるので、一般的には利用されることはない。しかし、間違い語句ふり仮名テーブル（図１０）のような、出現回数が一定回数以下（例えば、１０回以下）のものを抽出したテーブルを作成し、例えば、「Ｗｅｂページで見つけた間違いふり仮名事例」というような形でＷｅｂページに表示して利用することもできる。 Further, according to the most frequent word / phrase table (FIG. 9), it is possible to know the word / phrase having the highest number of appearances for the same fake kana.
On the other hand, “sub-nest” (Asuka) has only one appearance, so it is presumed that an impersonated pseudonym is attached, so it is not generally used. However, a table in which the number of appearances is equal to or less than a certain number of times (for example, 10 times or less), such as an erroneous word pretending kana table (FIG. 10), is created. It can also be displayed and used on a Web page in such a form.

（第２の実施形態）
なお、第１の実施形態では、同一のＷｅｂページに同一のふり仮名データが複数回出現する場合は１回としてカウントしたが、特にこれに限定されるものではない。出現回数をそのままカウントし、語句ふり仮名テーブル（図５）の出現回数１２２に反映するようにしてもよい。 (Second Embodiment)
In the first embodiment, when the same pretend pseudonym data appears multiple times on the same Web page, it is counted as one time. However, the present invention is not particularly limited to this. The number of appearances may be counted as it is and reflected in the number of appearances 122 in the phrase pretend kana table (FIG. 5).

本実施形態では、図１〜３、図５は第１の実施形態と同様である。但し、更新予備テーブル（図４）は本実施形態では使用しない。 In the present embodiment, FIGS. 1 to 3 and FIG. 5 are the same as those in the first embodiment. However, the update spare table (FIG. 4) is not used in this embodiment.

［処理フロー］
図１３に従って、ふり仮名収集利用装置１の処理フローについて説明する。特に断らない限り、以下の処理は、ふり仮名収集利用装置１を処理するコンピュータの制御装置４０が行うものとする。図１１及び図１２と同じ処理を行う部分は同じ番号を付している。 [Processing flow]
The processing flow of the pretend kana collection and utilization device 1 will be described with reference to FIG. Unless otherwise specified, the following processing is performed by the control device 40 of the computer that processes the pretend pseudonym collection and utilization device 1. Parts that perform the same processing as in FIGS. 11 and 12 are given the same numbers.

ステップ１０００〜１０６０は図１１と同一である。本実施形態では、同一Ｗｅｂページに現れたふり仮名データはすべてカウントするため、更新予備テーブルへの登録は行わず、Ｗｅｂページから取得したふり仮名データ（Ｓ１０６０）はすべて語句ふり仮名テーブル（図５）に反映する。語句ふり仮名テーブル（図５）への登録処理（Ｓ１１１０〜Ｓ１１３０）は図１２と同一である。しかし、語句ふり仮名テーブル（図５）への登録処理（Ｓ１１１０〜Ｓ１１３０）が終わった後、同一ＵＲＬのＷｅｂページ内にまだふり仮名データがあるかどうかを調べる（Ｓ１１５０）。もしまだふり仮名データがあれば（Ｓ１１５０：Ｙｅｓ）、同様の仕方で、ふり仮名データを取得する（Ｓ１０６０）。もうふり仮名データがなければ（Ｓ１１５０：Ｎｏ）、処理を終了する。
こうすることで、同一のふり仮名データを実際に出現した回数でカウントするので、例えば、ブログのように同一のＷｅｂページに複数人が文章を記載する場合に、世の中で実際に使用されている頻度を反映した収集を行うことができる。 Steps 1000 to 1060 are the same as those in FIG. In this embodiment, since all pretend kana data appearing on the same Web page is counted, registration is not performed in the update reserve table, and all pretend kana data (S1060) acquired from the Web page is reflected in the phrase pretend kana table (FIG. 5). To do. The registration process (S1110 to S1130) to the phrase pretending kana table (FIG. 5) is the same as FIG. However, after the registration process (S1110 to S1130) in the phrase pretending pseudonym table (FIG. 5) is completed, it is checked whether or not pretend pseudonym data still exists in the Web page with the same URL (S1150). If there is still pretend kana data (S1150: Yes), pretend kana data is acquired in the same manner (S1060). If there is no more pretend kana data (S1150: No), the process is terminated.
In this way, since the same pretend pseudonym data is counted by the number of times it actually appears, for example, when multiple people write sentences on the same Web page like a blog, the frequency that is actually used in the world Can be collected.

以上、本発明の実施形態について説明したが、本発明は上述した実施形態に限るものではない。また、本発明の実施形態に記載された効果は、本発明から生じる最も好適な効果を列挙したに過ぎず、本発明による効果は、本発明の実施形態に記載されたものに限定されるものではない。 As mentioned above, although embodiment of this invention was described, this invention is not restricted to embodiment mentioned above. The effects described in the embodiments of the present invention are only the most preferable effects resulting from the present invention, and the effects of the present invention are limited to those described in the embodiments of the present invention. is not.

本発明の実施形態の一例に係るふり仮名収集利用装置１と、ふり仮名収集利用装置１がインターネット１０を通じてふり仮名データを収集するＷｅｂページ２０〜２４との関係を示した全体図である。It is the whole figure which showed the relationship between the pretend pseudonym collection and utilization apparatus 1 which concerns on an example of embodiment of this invention, and the Web pages 20-24 which pretend pseudonym collection and utilization apparatus 1 collects pretend pseudonym data via the internet 10. FIG. 本発明の実施形態の一例に係るふり仮名収集利用装置１のハードウェア構成を示す図である。It is a figure which shows the hardware constitutions of the pretend pseudonym collection utilization apparatus 1 which concerns on an example of embodiment of this invention. 本発明の実施形態の一例に係るＵＲＬ更新日時テーブルを示す図である。It is a figure which shows the URL update date table which concerns on an example of embodiment of this invention. 本発明の実施形態の一例に係る更新予備テーブルを示す図である。It is a figure which shows the update reserve table which concerns on an example of embodiment of this invention. 本発明の実施形態の一例に係る語句ふり仮名テーブルを示す図である。It is a figure which shows the phrase pretending pseudonym table which concerns on an example of embodiment of this invention. 語句ふり仮名テーブル（図５）から、ふり仮名をキーとして抽出した例を示す図である。It is a figure which shows the example extracted from the phrase pretend kana table (FIG. 5) using the pretend kana as a key. 語句ふり仮名テーブル（図５）から、語句をキーとして抽出した例を示す図である。It is a figure which shows the example extracted from the phrase pretend kana table (FIG. 5) using a phrase as a key. 本発明の実施形態の一例に係る最頻ふり仮名テーブルを示す図である。It is a figure which shows the most frequent pretend kana table which concerns on an example of embodiment of this invention. 本発明の実施形態の一例に係る最頻語句テーブルを示す図である。It is a figure which shows the most frequent word phrase table which concerns on an example of embodiment of this invention. 本発明の実施形態の一例に係る間違い語句ふり仮名テーブルを示す図である。It is a figure which shows the kana phrase pretending pseudonym table which concerns on an example of embodiment of this invention. 本発明の第１の実施形態に係るふり仮名収集利用装置１の処理のフローチャート（その１）である。It is a flowchart (the 1) of a process of the pretending pseudonym collection utilization apparatus 1 which concerns on the 1st Embodiment of this invention. 本発明の第１の実施形態に係るふり仮名収集利用装置１の処理のフローチャート（その２）である。It is a flowchart (the 2) of a process of the pretending pseudonym collection utilization apparatus 1 which concerns on the 1st Embodiment of this invention. 本発明の第２の実施形態に係るふり仮名収集利用装置１の処理のフローチャートである。It is a flowchart of a process of the pretend pseudonym collection utilization apparatus 1 which concerns on the 2nd Embodiment of this invention.

Explanation of symbols

１ふり仮名収集利用装置
２制御手段
３Ｗｅｂページ取得手段
４ふり仮名データ取得手段
５ふり仮名データ記録手段
６ふり仮名データ抽出手段
１０インターネット
２０〜２４Ｗｅｂページ
３０バスライン
４０制御装置
４１、４２ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）
４３通信Ｉ／Ｆ（Ｉ／Ｆ：インターフェイス）
４４メインメモリ
４５ＢＩＯＳ（ＢａｓｉｃＩｎｐｕｔＯｕｔｐｕｔＳｙｓｔｅｍ）
４６表示装置
４７Ｉ／Ｏコントローラ
４８入力装置
４９ハードディスク
５０半導体メモリ
５１記憶装置 DESCRIPTION OF SYMBOLS 1 Prediction Kana collection utilization apparatus 2 Control means 3 Web page acquisition means 4 Prediction Kana data acquisition means 5 Prediction Kana data recording means
6 Pretend Kana data extraction means 10 Internet 20-24 Web page 30 Bus line 40 Control device 41, 42 CPU (Central Processing Unit)
43 Communication I / F (I / F: Interface)
44 Main memory 45 BIOS (Basic Input Output System)
46 Display Device 47 I / O Controller 48 Input Device 49 Hard Disk 50 Semiconductor Memory 51 Storage Device

Claims

Pretend kana data acquisition means for acquiring, from a text in a web page, a combination of words that satisfy a predetermined condition as a combination of a word composed of kanji and its pretend kana,
Fake kana data recording means for recording the combination acquired by the fake kana data acquisition means in a table together with the number of appearances of the combination;
Fake kana data extraction means for extracting a combination for use as a dictionary from the combinations recorded in the table by the fake kana data recording means, based on the magnitude of the number of appearances ,
The pretend kana data extraction means presumes that a combination whose number of appearances is equal to or greater than a predetermined threshold among the combinations recorded in the table is generally difficult to read and used apparatus.

The pretend kana data extracting means uses the threshold value set according to the age group of the user assumed by the Web page to be created, and the dictionary satisfying the number of appearances specified by the age group of the user 2. The pretend kana collection and utilization device according to claim 1, wherein

When acquiring the contents of a plurality of Web pages via the Internet, if the last update date / time of the Web page from which the contents are to be acquired matches the last update date / time of the already acquired Web page, the contents of the Web page are acquired. phonetic collection using apparatus according to claim 1 or 2, further comprising a Web page acquiring unit not.

4. The pretensional kana data collection and utilization apparatus according to claim 1, wherein the pretend pseudonym data acquisition unit acquires the combination from ruby displayed on the Web page.

The pretend kana data acquisition means, when all the characters in parentheses in the character string displayed on the Web page are hiragana characters and the characters immediately before the parentheses are kanji, the characters in the parentheses are 1 or 5. The pretend kana collection and use apparatus according to claim 1, wherein the combination is acquired by regarding the pretend kana as two or more consecutive kanji phrases.

The pretend kana data acquisition means, when all the characters in parentheses in the character string displayed on the Web page are katakana, and the characters immediately before the parentheses are kanji, the characters in the parentheses are 1 or 6. The pretend kana collection and use apparatus according to claim 1, wherein the combination is acquired by regarding the pretend kana as two or more consecutive kanji phrases.

The fake pseudonym collection / use device according to claim 1, wherein the appearance count is the number of Web pages from which the combination is acquired.

7. The pretending pseudonym collection and utilization apparatus according to claim 3 , wherein the number of appearances is the number of times the combination has actually appeared on a Web page acquired by the Web page acquisition unit.

The phonetic kana data extracting means extracts, from the combinations recorded in the table by the phonetic kana data recording means, those having the highest appearance frequency among the kana for the same word / phrase as combinations for use as the dictionary. 9. The pretend pseudonym collection and use device according to claim 1, wherein

The phonetic kana data extracting means extracts, from the combinations recorded in the table by the phonetic kana data recording means, the phrase having the highest number of appearances among the phrases for the same phonetic kana as a combination for use as the dictionary. 9. The pretend pseudonym collection and use device according to claim 1, wherein

The fake kana data extraction means is a combination for using as a dictionary that extracting, from the combination recorded in the table by the fake kana data recording means, data having an appearance frequency less than a predetermined number as error data 9. The pretend pseudonym collection and use device according to claim 1, characterized in that:

A fake pseudonym collection and use method for collecting information on pretend kana executed by a computer,
The computer acquiring the contents of a plurality of Web pages from a server on which the Web pages are held via the Internet;
Acquiring from the text in the web page acquired by the computer a combination of words that satisfy a predetermined condition as a combination of a word composed of kanji and its pretend kana;
Recording the combination acquired by the computer in a table together with the number of occurrences of the combination;
Extracting the combination to be used as a dictionary from the combinations recorded in the table by the computer according to the number of appearances , and
The extracting step estimates a combination whose number of appearances is equal to or greater than a predetermined threshold among the combinations recorded in the table as a word that is generally difficult to read. .