JP2002108851A

JP2002108851A - Sentence processor, method therefor and storage medium

Info

Publication number: JP2002108851A
Application number: JP2000295014A
Authority: JP
Inventors: Yuji Suga; 祐治須賀
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2000-09-27
Filing date: 2000-09-27
Publication date: 2002-04-12

Abstract

PROBLEM TO BE SOLVED: To embed transparent information into text data without permitting a user to be conscious of an unnatural feeling in a text by converting the code of text data based on transparent information of the corresponding code. SOLUTION: At first, text data and transparent information are inputted (step S1001), and the inputted text data is normalized so that 'the data is expressed by one character without using connection characters in the voiced consonant syllable character and the p-sound character of HIRAGANA and KATAKANA (the Japanese cursive and square syllabaries)' (step S1002). Then the same character is expressed as another kind of encoding data with respect to normalized text data and transparent information is buried (step S1003).

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【産業上の利用分野】本発明は、テキストデータに対し
て透かし情報を埋め込む、又は、透かし情報が埋め込ま
れたテキストデータから透かし情報を抽出する文章処理
装置及びその方法並びに記憶媒体に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a text processing apparatus and method for embedding watermark information in text data, or extracting watermark information from text data in which watermark information is embedded, and a storage medium. .

【０００２】[0002]

【従来の技術】近年、インターネットの普及により通信
技術が向上したことに伴い、あらゆる情報のデジタル化
が進められインターネット上で流通することが可能とな
った。データをデジタル化することで再利用性を高め利
便性が向上した反面，画像データや音声データなどのデ
ジタルコンテンツを加工したり複製することが容易とな
ったため著作権保護の面で大きな問題が浮上している。2. Description of the Related Art In recent years, with the spread of the Internet and the improvement of communication technology, all information has been digitized and can be distributed on the Internet. Digitizing data has improved reusability and convenience, while digital content such as image data and audio data has become easier to process and copy, which raises a serious problem in terms of copyright protection. are doing.

【０００３】以上のような背景から、著作権保護を目的
として特に画像データや音声データに対する電子透かし
技術が提案されている。電子透かし技術は、ディジタル
データに人間が知覚できない程度に透かし情報を埋め込
む技術であり，著作権保有者の名前や購入者のＩＤなど
が埋めこまれる。上記の透かし情報により加工や複製に
よる無断使用を追跡することができる。[0003] In view of the above background, a digital watermarking technique for image data and audio data has been proposed for the purpose of copyright protection. The digital watermarking technology is a technology for embedding watermark information in digital data to the extent that human beings cannot perceive it, in which the name of the copyright holder, the ID of the purchaser, and the like are embedded. Unauthorized use due to processing or duplication can be tracked by the above watermark information.

【０００４】一般的に電子透かし技術は、画像データや
音声データなどの冗長部分が多いデータ形式に対して適
用されている。一方、文書に対する電子透かしとして
は、防衛大方式（中村康弘、松井甲子雄、著作権保護の
ための電子文書のハードコピーへの署名の埋め込み、情
報処理学会論文誌Ｖｏｌ．３６Ｎｏ．８，１９９５
年、及び特開平９−１８６６０３号公報「電子文書の単
語間の空白部分の長さを利用した符号化および復号化方
法、電子文書への署名情報の埋め込み方法、機密文書の
暗号化方法」）が提案されているが、文書をテキストデ
ータとしてとらえるのではなく画像データとして電子透
かし技術を適用しているに過ぎない。具体的には単語間
の間隔を操作することで透かし情報を埋め込んでいる。[0004] In general, the digital watermarking technique is applied to a data format having many redundant parts, such as image data and audio data. On the other hand, as a digital watermark for a document, a defense large system (Yasuhiro Nakamura, Koshio Matsui, embedding a signature in a hard copy of an electronic document for copyright protection, Transactions of Information Processing Society of Japan Vol. 36 No. 8, 1995)
And Japanese Unexamined Patent Publication No. Hei 9-186603, "Encoding and decoding methods using the length of a space between words in an electronic document, a method for embedding signature information in an electronic document, and a method for encrypting a confidential document") Has been proposed, but a digital watermarking technique is only applied as image data, not as text data. Specifically, watermark information is embedded by manipulating the interval between words.

【０００５】同様の方式として日本ＩＢＭ方式（天野富
夫、平山唯樹、レイアウト構造を利用したページ記述へ
の電子透かし埋め込み手法、情報処理学会研究報告、マ
ルチメディア通信と分散処理９０−９，コンピュータセ
キュリティ２−９，１９９８年，特開２０００−９９５
０１号公報（Ｐ２０００−９９５０１Ａ）「文書データ
への情報の埋め込み方法およびシステム」）が提案され
ており、これによれば文字間の間隔を操作することで透
かし情報を埋め込んでいる。[0005] As similar systems, Japanese IBM systems (Tomio Amano, Yuki Hirayama, embedding a digital watermark in a page description using a layout structure, IPSJ research report, multimedia communication and distributed processing 90-9, computer security 2-9, 1998, JP-A-2000-995
No. 01 (P2000-99501A), “Method and System for Embedding Information in Document Data”), in which watermark information is embedded by manipulating the spacing between characters.

【０００６】テキストデータそのものへの電子透かしと
しては、強制的に改行文字を挿入する方法や単語間の空
白文字を複数にする方法が古くから知られている。松本
勉、中川裕志、村瀬一郎、ステガノグラフィを用いた秘
匿通信の研究開発（第１８回ＩＰＡ技術発表会論文集ｐ
ｐ．５１−６０）では、辞書変換法により文書中の単語
の置き換えを行う方法が提案されている。この方法は同
義語が定義された辞書を使い、文中の単語を別の単語に
置きかえることにより透かし情報を埋め込む方法であ
る。As a digital watermark for text data itself, a method of forcibly inserting a line feed character and a method of forming a plurality of blank characters between words have been known for a long time. Tsutomu Matsumoto, Hiroshi Nakagawa, Ichiro Murase, Research and Development of Secure Communication Using Steganography (The 18th IPA Technology Presentation Papers p
p. 51-60), a method for replacing words in a document by a dictionary conversion method has been proposed. This method uses a dictionary in which synonyms are defined, and embeds watermark information by replacing a word in a sentence with another word.

【０００７】[0007]

【発明が解決しようとする課題】上記従来技術では下記
のような問題があった。単語間や文字間の間隔を操作す
るなどの文書の位置関係を変化させる、つまりレイアウ
ト情報を変更することによる電子透かし手法では、レイ
アウト情報を持つデータ形式にしか適用できない。The above prior art has the following problems. The digital watermarking technique that changes the positional relationship of a document, such as manipulating the spacing between words or characters, that is, changing layout information, can only be applied to a data format having layout information.

【０００８】つまり、文書からレイアウト情報を取り除
いた文書そのものであるテキストデータを抽出した場合
には透かしデータを復元することができない欠点があ
る。さらにレイアウト情報を操作するために一見しただ
けで不自然さが残るという問題点もある。また。辞書変
換法による電子透かし手法では、単語変換による文脈の
乱れが生じ、同様に不自然さが残る問題点が生じる。That is, when text data, which is a document itself obtained by removing layout information from a document, is extracted, the watermark data cannot be restored. There is also a problem that unnaturalness remains at first glance for operating the layout information. Also. In the digital watermarking method using the dictionary conversion method, the context is disturbed by the word conversion, and similarly, there is a problem that unnaturalness remains.

【０００９】本発明は上記の課題に鑑みてなされたもの
であり、テキストデータのコードを対応するコードに透
かし情報に基づいて変換することで、テキストの不自然
さをユーザに感じさせることなく、テキストデータへの
透かし情報を埋め込むことを目的とする。The present invention has been made in view of the above-mentioned problems, and converts a text data code into a corresponding code based on watermark information without causing the user to feel unnaturalness of the text. The purpose is to embed watermark information in text data.

【００１０】[0010]

【課題を解決するための手段】本発明の目的を達成する
ために、例えば本発明の文章処理装置は以下の構成を備
える。すなわち、テキストデータに対して透かし情報を
埋め込む、文章処理装置であって、テキストデータを、
所定の正規化方式を用いて正規化する正規化手段と、前
記正規化手段により正規化されたテキストデータを透か
し情報に基づいて他のテキストデータに変換する変換手
段とを備える。In order to achieve the object of the present invention, for example, a text processing apparatus of the present invention has the following arrangement. That is, a text processing device that embeds watermark information in text data,
The apparatus includes a normalizing unit that normalizes using a predetermined normalization method, and a converting unit that converts the text data normalized by the normalizing unit into other text data based on the watermark information.

【００１１】更に、透かし情報を埋め込まれたテキスト
データに対して正規化を行うことで得られる、正規化デ
ータを生成する正規化データ生成手段と、前記正規化デ
ータに基づいた値を算出する算出手段と、前記テキスト
データに関する情報を用いて、前記正規化データに基づ
いた値から、埋め込みデータを生成する埋め込みデータ
生成手段とを備え、前記埋め込みデータ生成手段により
生成されたデータを前記埋め込み手段により埋め込む。Further, a normalized data generating means for generating normalized data obtained by normalizing the text data in which the watermark information is embedded, and a calculating means for calculating a value based on the normalized data Means for generating embedded data from a value based on the normalized data using the information on the text data, and embedding the data generated by the embedded data generating means by the embedding means. Embed.

【００１２】本発明の目的を達成するために、例えば本
発明の文章処理装置は以下の構成を備える。すなわち、
透かし情報が埋め込まれたテキストデータから透かし情
報を抽出する文章処理装置であって、前記テキストデー
タに対して所定の正規化方式による正規化を行うことで
得られる正規化データを生成する正規化データ生成手段
と、当該正規化データと前記テキストデータとを比較
し、その差分情報を生成する差分情報生成手段とを備
え、当該差分情報に基づいてビット列を生成することで
透かし情報を得る。In order to achieve the object of the present invention, for example, a text processing apparatus of the present invention has the following configuration. That is,
What is claimed is: 1. A text processing apparatus for extracting watermark information from text data in which watermark information is embedded, wherein normalized data is generated by performing normalization on the text data by a predetermined normalization method. Generating means for comparing the normalized data with the text data to generate difference information, and generating watermark information by generating a bit string based on the difference information.

【００１３】[0013]

【発明の実施の形態】以下添付図面に従って、本発明を
好適な実施形態に従って詳細に説明する。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS The present invention will be described below in detail according to preferred embodiments with reference to the accompanying drawings.

【００１４】［第１の実施形態］図１は本実施形態にお
ける、文章（テキストデータ）に透かし情報を埋め込む
方法を説明するブロック図である。[First Embodiment] FIG. 1 is a block diagram for explaining a method of embedding watermark information in a sentence (text data) in this embodiment.

【００１５】まず、透かし情報（透かしデータ）を埋め
込みたい文書（テキストデータ）の入力を行う（１０
０）。テキストデータは文字データの集まりである。通
常コンピュータが文字を扱うときは，文字をある数値に
対応させて表現していることが多い。この文字と対応す
る数値の対応付けを表わす対応表を文字コード（charac
ter code）という。文字を数値に変換することを符号化
と呼ぶ。つまり、テキストデータはある文字コードを用
いて表現された符号化データの集まりである。First, a document (text data) into which watermark information (watermark data) is to be embedded is input (10).
0). Text data is a collection of character data. Usually, when a computer handles characters, the characters are often represented in correspondence with certain numerical values. A character code (charac
ter code). Converting a character to a number is called encoding. That is, the text data is a group of encoded data expressed using a certain character code.

【００１６】次に、入力されたテキストデータを正規化
方式（１０２）に則って、正規化処理を行う（１０
１）。Next, the input text data is normalized in accordance with the normalization method (102) (10).
1).

【００１７】次に、同じ文字を別の符号化データとして
表現することを利用して、透かし情報１０４を埋め込
む、透かし情報埋め込み処理を行う（１０３）。Next, a watermark information embedding process for embedding the watermark information 104 is performed by utilizing the representation of the same character as different encoded data (103).

【００１８】文章入力１００における具体例として、Ｕ
ｎｉｃｏｄｅにおける結合文字(Combining Characters)
を利用した方法を、以下説明する。As a specific example of the text input 100, U
Combining Characters in nicode
The method using is described below.

【００１９】Ｕｎｉｃｏｄｅは、Ｕｎｉｃｏｄｅコンソ
ーシアムで規格化された文字集合であり，文字符号化方
式としてはＵＴＦ−８、ＵＴＦ−１６などがある。Ｕｎ
ｉｃｏｄｅには結合文字と呼ばれる一群の文字があり、
日本語ではＵ＋３０９９という数値で表わされる「゛」
濁点と、Ｕ＋３０９Ａ（「゜」半濁点）などが結合文字
である。Unicode is a character set standardized by the Unicode Consortium, and UTF-8 and UTF-16 are available as character encoding systems. Un
icode has a group of characters called combining characters,
"゛" represented by the numerical value U + 3099 in Japanese
The clouded point and U + 309A (“゜” semi-clouded point) are combined characters.

【００２０】一方、Ｕｎｉｃｏｄｅでは、Ｕ＋３０４Ｃ
「が」も存在することから，Ｕ＋３０４Ｃ「が」と、
｛Ｕ＋３０４Ｂ，Ｕ＋３０９９｝「か゛」は同値であ
る。このように文字としては同じ意味を表現しても符号
化データとしては異なる場合がある。On the other hand, in Unicode, U + 304C
Because "ga" also exists, U + 304C "ga"
{U + 304B, U + 3099} "Ka" has the same value. As described above, even if the same meaning is expressed as a character, it may be different as encoded data.

【００２１】ここでステップ正規化処理１０１には正規
化方式１０２が必要であるが、正規化方式として「ひら
がな・カタカナの濁音文字，半濁音文字では結合文字は
使用しないで１文字で表現する」という方式を採用した
とすると、次のように透かし情報を埋め込むことができ
る．テキストデータ内のすべての濁音文字、半濁音文字
ごとに１ビットの透かし情報を対応させる。ビット０を
埋め込む場合には１文字表現（つまり結合文字を利用し
ない）のまま、ビット１を埋め込む場合には結合文字を
利用するように、透かし情報埋め込み処理１０３を行
う。例えば、正規化後の「が」に対し、ビット０を埋め
込む場合には「が」、ビット１を埋め込む場合には「か
゛」（「か」＋「゛」）と変換することで透かし情報を
埋め込む。Here, the step normalization processing 101 requires a normalization method 102. As the normalization method, "a hiragana and katakana voiced character and a semi-voiced character are represented by one character without using a combined character." If you adopt the method, you can embed watermark information as follows. One bit of watermark information is associated with every voiced character and half voiced character in the text data. When the bit 0 is embedded, the watermark information embedding process 103 is performed such that the one-character expression (that is, the combined character is not used) is used, and when the bit 1 is embedded, the combined character is used. For example, with respect to “ga” after the normalization, the watermark information is converted into “ga” when embedding bit 0 and “ka「 ”(“ ka ”+“ ゛ ”) when embedding bit 1 to convert the watermark information. Embed.

【００２２】以上の透かし情報の埋め込み処理のフロー
チャートを図１０に示す。本フローチャートに関する説
明は上述の説明と同じなので、ここでは簡単に説明す
る。FIG. 10 shows a flowchart of the watermark information embedding process. Since the description of this flowchart is the same as the above description, it will be briefly described here.

【００２３】まずテキストデータと、透かし情報を入力
し（ステップＳ１００１）、入力したテキストデータに
対して上述の正規化を行う（ステップＳ１００２）。そ
して、正規化を施されたテキストデータに対して、同じ
文字を別の符号化データとして表現し、透かし情報を埋
め込む（ステップＳ１００３）。First, text data and watermark information are input (step S1001), and the above-described normalization is performed on the input text data (step S1002). Then, the same character is expressed as different encoded data in the normalized text data, and watermark information is embedded (step S1003).

【００２４】図５は上記のような透かし情報を埋め込む
処理を行う装置（透かし情報埋め込み装置）の概略構成
を示す図である。同装置に対する入力は、テキスト文書
５０１と、透かし情報５０２である。透かし情報埋め込
み装置は、正規化処理１０１を行う正規化処理装置５０
３と、正規化方式１０２を格納する正規化方式格納装置
５０４と、透かし情報埋め込み処理１０３を行う埋め込
み処理装置５０５を有し，透かし処理済テキスト文書５
０６を出力とする。また、以上の各装置は以下説明する
図７のような構成を備える装置によっても実現できる。FIG. 5 is a diagram showing a schematic configuration of a device (watermark information embedding device) for performing the above-described process of embedding watermark information. Inputs to the device are a text document 501 and watermark information 502. The watermark information embedding device includes a normalization processing device 50 that performs a normalization process 101.
3, a normalization method storage device 504 for storing the normalization method 102, and an embedding processing device 505 for performing the watermark information embedding process 103.
06 is output. Further, each of the above devices can also be realized by a device having a configuration as shown in FIG. 7 described below.

【００２５】なお、図１０に示したフローチャートに従
ったプログラムコードは、不図示のＲＯＭやＲＡＭなど
のメモリに格納され、不図示のＣＰＵに読み出され、実
行されるものとする。The program code according to the flowchart shown in FIG. 10 is stored in a memory such as a ROM or a RAM (not shown), and is read and executed by a CPU (not shown).

【００２６】図７においてホストコンピュータ７０１
は、例えば一般に普及しているパソコンであり、ユーザ
からの各種指示等はマウス７１２やキーボード７１３か
らの入力により行われ、テキストデータをプリンタ７１
５から印刷させることが可能である。In FIG. 7, the host computer 701
Is, for example, a personal computer that is widely used. Various instructions and the like from a user are performed by input from a mouse 712 and a keyboard 713, and text data is transmitted to the printer 71.
5 can be printed.

【００２７】また、ホストコンピュータ７０１の内部で
は、バス７１６により後述する各ブロックが接続され、
種々のデータの受け渡しが可能である。In the host computer 701, each block described later is connected by a bus 716.
Various data can be exchanged.

【００２８】７０２はシステムメッセージなどを含むユ
ーザへの指示やテキストデータの表示に用いられるモニ
タである。Reference numeral 702 denotes a monitor used to display instructions to a user including system messages and text data.

【００２９】７０３は内部の各ブロックの動作を制御，
或いはＲＯＭ７０４やＲＡＭ７０５に格納された各種の
プログラムを実行するＣＰＵである。703 controls the operation of each internal block;
Alternatively, it is a CPU that executes various programs stored in the ROM 704 or the RAM 705.

【００３０】７０４は印刷されることが認められていな
い特定画像を記憶したり、あらかじめ必要な画像処理プ
ログラム等（図１０に示したフローチャートに従ったプ
ログラムを含む）を記憶しておくＲＯＭである。また文
字コードなども記憶している。Reference numeral 704 denotes a ROM for storing a specific image which is not permitted to be printed, or for storing a necessary image processing program and the like (including a program according to the flowchart shown in FIG. 10) in advance. . It also stores character codes and the like.

【００３１】７０５はＣＰＵ７０３にて処理を行うため
に一時的にプログラムや処理対象のテキストデータを格
納しておくＲＡＭである。Reference numeral 705 denotes a RAM for temporarily storing a program and text data to be processed for the CPU 703 to perform processing.

【００３２】７０６は，ＲＡＭ７０５等に転送されるプ
ログラムやテキストデータをあらかじめ格納したり，処
理後のテキストデータを保存することのできるハードデ
ィスク（ＨＤ）である。Reference numeral 706 denotes a hard disk (HD) capable of storing programs and text data to be transferred to the RAM 705 and the like in advance and storing processed text data.

【００３３】７０８は外部記憶媒体の一つであるＣＤ
（ＣＤ−Ｒ）に記憶されたデータを読み込み或いは書き
出すことのできるＣＤドライブである。Reference numeral 708 denotes a CD which is one of the external storage media.
This is a CD drive that can read or write data stored in (CD-R).

【００３４】７０９はＣＤドライブ７０８と同様に、Ｆ
Ｄからの読み込み、ＦＤへの書き出しができるＦＤドラ
イブである。７１０もＣＤドライブ７０８と同様に、Ｄ
ＶＤからの読み込み、ＤＶＤへの書き出しができるＤＶ
Ｄドライブである。尚、ＣＤ，ＦＤ，ＤＶＤ等に編集用
のプログラム、或いはプリンタドライバが記憶されてい
る場合には、これらプログラムをＨＤ７０６上にインス
トールし、必要に応じてＲＡＭ７０５に転送されるよう
になっている。Reference numeral 709 denotes F as in the case of the CD drive 708.
It is an FD drive that can read from D and write to FD. 710, like the CD drive 708,
DV that can read from VD and write to DVD
D drive. If an editing program or a printer driver is stored in a CD, FD, DVD, or the like, these programs are installed on the HD 706 and transferred to the RAM 705 as necessary.

【００３５】７１１はマウス７１２或いはキーボード７
１３からの入力指示を受け付けるために、これらと接続
されるインターフェイス（Ｉ／Ｆ）である。Reference numeral 711 denotes a mouse 712 or a keyboard 7
13 is an interface (I / F) connected to these for accepting input instructions from 13.

【００３６】７１８はモデムであり、インターフェース
（Ｉ／Ｆ）７１９を介し、公衆回線を通じて外部のネッ
トワークに接続されている。A modem 718 is connected to an external network via a public line via an interface (I / F) 719.

【００３７】また７０７はネットワーク接続装置であ
り、インターフェース（Ｉ／Ｆ）７１４により外部のネ
ットワークに接続されている。Reference numeral 707 denotes a network connection device, which is connected to an external network via an interface (I / F) 714.

【００３８】次に透かし情報の抽出処理を説明する．図
２は本実施形態における、透かし情報が埋め込まれた文
章（テキストデータ）から透かし情報を抽出する方法を
説明するブロック図である。Next, the process of extracting the watermark information will be described. FIG. 2 is a block diagram illustrating a method for extracting watermark information from a sentence (text data) in which the watermark information is embedded according to the present embodiment.

【００３９】まず、透かし情報が埋め込まれたテキスト
データの入力を行う（２００）。First, text data in which watermark information is embedded is input (200).

【００４０】次に、入力したテキストデータを後述する
正規化方式２０２に則って文字コードレベルで正規化処
理を行う（２０１）。なお、２００で入力したテキスト
データは所定の不図示のメモリに格納しておくと共に、
このテキストデータのコピーを生成し、このコピーに対
して正規化処理を行う。Next, the input text data is normalized at the character code level in accordance with a later-described normalization method 202 (201). The text data input at 200 is stored in a predetermined memory (not shown).
A copy of the text data is generated, and a normalization process is performed on the copy.

【００４１】次に、正規化前のテキストデータ（所定の
不図示のメモリに格納されたテキストデータ）と正規化
後のテキストデータ（コピーに対して正規化処理を施し
た結果）との比較処理を行い、差分情報を生成する（２
０３）。Next, a comparison process between the text data before normalization (the text data stored in a predetermined memory not shown) and the text data after the normalization (the result of performing the normalization process on the copy) To generate difference information (2
03).

【００４２】次に、２０３で得られた差分情報から透か
し情報を特定し、透かし情報を抽出する。Next, watermark information is specified from the difference information obtained in 203, and the watermark information is extracted.

【００４３】本実施形態では正規化方式２０２として、
「ひらがな・かたかなの濁音文字，半濁音文字では結合
文字は使用しないで１文字で表現する」という方式を採
用し、上述した濁音文字、半濁音文字ごとに１ビットの
透かし情報を対応させる方式で抽出を行うとする。In this embodiment, as the normalization method 202,
A method is adopted in which a hiragana character, a katakana character, and a semi-voiced character are represented by one character without using a combined character, and 1-bit watermark information is associated with each of the above-described voiced character and semi-voiced character. It is assumed that extraction is performed by the method.

【００４４】その結果、透かし情報抽出処理２０４にお
いて、濁音文字、半濁音文字のところで１文字表現（つ
まり結合文字を利用しない）場合はビット０が、結合文
字が利用されている場合はビット１を抽出することがで
きる。例えば、正規化後の「が」と比較し，正規化前デ
ータが「が」と表現されていた場合にはビット０を、
「か゛」と表現されていた場合にはビット１を抽出でき
る。As a result, in the watermark information extraction processing 204, bit 0 is set in the case of a one-character expression (that is, a combined character is not used) at the voiced character or semi-voiced character, and bit 1 is set if the combined character is used. Can be extracted. For example, a comparison is made with "GA" after normalization, and if the data before normalization is expressed as "GA", bit 0 is set,
If it is expressed as “゛”, bit 1 can be extracted.

【００４５】以上の透かし情報の抽出処理のフローチャ
ートを図１１に示し、以下説明する。本フローチャート
に関する説明は上述の説明と同じなので、ここでは簡単
に説明する。FIG. 11 shows a flowchart of the above watermark information extraction processing, which will be described below. Since the description of this flowchart is the same as the above description, it will be briefly described here.

【００４６】まず、透かし情報が埋め込まれたテキスト
データを入力し、所定の不図示のメモリに格納すると共
に、コピーを生成する（ステップＳ１１０１）。次に、
コピーされたテキストデータに対して、上述の正規化処
理を行う（ステップＳ１１０２）。次に、正規化前のテ
キストデータと、正規化後のテキストデータとの差分情
報を生成する（ステップＳ１１０３）。次に、全ての濁
音文字、半濁音文字に対してこの差分情報に基づき結合
文字が利用されているかどうか判断し（ステップＳ１１
０４）、利用されていれば、ビットを１（ステップＳ１
１０５），利用されていなければビットを０（ステップ
Ｓ１１０６）としてビット列を生成し、この検索を全て
の濁音文字、半濁音文字に対して行う（ステップＳ１１
０７）。そして全ての濁音文字、半濁音文字に対して生
成されたビット列を得ることで、透かし情報を抽出する
ことができる（ステップＳ１１０８）。First, text data in which watermark information is embedded is input, stored in a predetermined memory (not shown), and a copy is generated (step S1101). next,
The above-described normalization processing is performed on the copied text data (step S1102). Next, difference information between the text data before normalization and the text data after normalization is generated (step S1103). Next, it is determined whether or not the combined character is used for all the voiced characters and the semi-voiced characters based on the difference information (step S11).
04), if used, set the bit to 1 (step S1
105), if not used, a bit string is generated with the bit set to 0 (step S1106), and this search is performed for all voiced and semi-voiced characters (step S11).
07). Then, the watermark information can be extracted by obtaining the bit strings generated for all the voiced and semi-voiced characters (step S1108).

【００４７】図６は上記のような透かし情報を抽出する
処理を行う装置としての、透かし情報抽出装置を説明す
る図である。同装置に対する入力は透かし済テキスト文
書６０１である。透かし情報抽出装置は、正規化処理２
０１を行う正規化処理装置６０２と、正規化方式２０２
を格納する正規化方式格納装置６０３と、比較処理２０
３を行う比較処理装置６０４と、透かし情報抽出処理２
０４を行う抽出処理装置６０５を有し、透かし情報６０
６を出力とする。なお、図１１に示したフローチャート
に従ったプログラムコードは、図６に示した装置内の不
図示のメモリに格納され、不図示のＣＰＵにより読み出
され、実行されるものとする。FIG. 6 is a diagram for explaining a watermark information extracting device as a device for performing the above-described process of extracting watermark information. The input to the device is a watermarked text document 601. The watermark information extraction device performs a normalization process 2
01 and a normalization method 202
Storage unit 603 for storing the
3 and the watermark information extraction processing 2
04, and an extraction processing device 605 for performing
6 is output. The program code according to the flowchart shown in FIG. 11 is stored in a memory (not shown) in the apparatus shown in FIG. 6, and is read and executed by a CPU (not shown).

【００４８】また、以上の各装置は図７のような構成を
もつ信号処理装置によっても実現できる。この場合、図
１１に示したフローチャートに従ったプログラムコード
はＲＯＭ７０４に格納され、ＣＰＵ７０３により読み出
され、実行されるものとする。Each of the above devices can also be realized by a signal processing device having a configuration as shown in FIG. In this case, the program code according to the flowchart shown in FIG. 11 is stored in the ROM 704, read by the CPU 703, and executed.

【００４９】以上の説明により、本実施形態では結合文
字として日本語における濁音文字、半濁音文字を取り上
げたが、本実施形態は日本語のテキストデータだけに特
化した技術ではなく、独語のウムラウトなどにも適用で
き、結合文字が利用されるすべての符号化処理コードに
適用することが可能である。As described above, in the present embodiment, Japanese voiced characters and semi-voiced voice characters are taken as combined characters. However, the present embodiment is not a technique specialized only for Japanese text data, but a German umlaut. And the like, and can be applied to all encoding processing codes using combined characters.

【００５０】また、同じ文字を表わすが符号化データと
しては異なる場合として、カンマやコロン、セミコロン
を半角文字にするか全角文字するかというデータの振れ
を利用することも可能である。Further, as a case where the same character is represented but the encoded data is different, it is also possible to use a data swing of whether a comma, a colon, or a semicolon is converted to a half-width character or a full-width character.

【００５１】［第２の実施形態］第１の実施形態では、
正規化を行うことにより生じる正規化前データとの差分
から透かし情報を抽出する方法を説明した。しかし、第
１の実施形態では，画像に対する電子透かしとは異な
り、耐性が弱く、透かし情報が容易に除去可能であるた
め、透かし情報を消去されてしまうと困る用途には利用
できない。たとえば作者の著作権情報や購入者のＩＤを
埋め込むことで、著作権保護を行う場合などである。し
かしテキストデータの真正性を保証し、改ざん検出に利
用することができる。[Second Embodiment] In the first embodiment,
The method of extracting the watermark information from the difference from the data before normalization caused by performing the normalization has been described. However, in the first embodiment, unlike digital watermarking for an image, it has low durability and can easily remove watermark information. Therefore, the first embodiment cannot be used for applications in which it is difficult to delete watermark information. For example, there is a case where copyright protection is performed by embedding the copyright information of the author or the ID of the purchaser. However, it can guarantee the authenticity of the text data and can be used for falsification detection.

【００５２】図３は本実施形態における文章（テキスト
データ）に透かし情報を埋め込む方法を説明するブロッ
ク図である。FIG. 3 is a block diagram for explaining a method of embedding watermark information in a sentence (text data) according to the present embodiment.

【００５３】まず、透かし情報が埋め込まれたテキスト
データの入力を行う（３００）。以下の処理の対象とな
るテキストデータは、入力したテキストデータのコピー
である。First, text data in which watermark information is embedded is input (300). The text data to be processed in the following is a copy of the input text data.

【００５４】次に、入力されたテキストデータを正規化
方式３０２に則って正規化処理を行う（３０１）。Next, normalization processing is performed on the input text data in accordance with the normalization method 302 (301).

【００５５】次に、正規化データの後述するハッシュ値
を計算する（３０３）。Next, a later-described hash value of the normalized data is calculated (303).

【００５６】次に、後述する署名作成者の秘密鍵を用い
て、ハッシュ値計算処理３０３で得られたハッシュ値か
ら後述する署名データを作成する（３０４）。Next, using the secret key of the signature creator described later, signature data described later is created from the hash value obtained in the hash value calculation processing 303 (304).

【００５７】次に、署名計算処理３０４で得られた署名
データを、透かし情報として入力したテキストデータに
埋め込む（３０５）。Next, the signature data obtained in the signature calculation processing 304 is embedded in the input text data as watermark information (305).

【００５８】なお、本実施形態において、上述の処理を
行う透かし情報埋め込み装置の構成は図５に示した装置
の場合、更に、前記秘密鍵を埋め込み処理装置５０５に
入力し、埋め込み処理装置５０５内でハッシュ値計算、
署名データの作成を行う。なお、このハッシュ値計算、
署名データの作成は、埋め込み処理装置５０５内でな
く、夫々専用の装置を設けても良い。In the present embodiment, the configuration of the watermark information embedding apparatus that performs the above-described processing is the apparatus shown in FIG. 5, and the secret key is further input to the embedding processing apparatus 505 and To calculate the hash value,
Create signature data. Note that this hash value calculation,
For the creation of the signature data, a dedicated device may be provided instead of the embedded processing device 505.

【００５９】［ハッシュ値］ハッシュ値とは１変数関数
であるハッシュ関数ｈの出力値であり、ハッシュ関数と
は衝突を起こしにくい圧縮関数をいう。ここで衝突と
は、異なる入力値ｘ１，ｘ２に対して、ｈ（ｘ１）＝ｈ
（ｘ２）となることである。また圧縮関数とは、任意の
ビット長のビット列をある固定長さのビット列に変換す
る関数である。[Hash Value] A hash value is an output value of a hash function h, which is a one-variable function, and a hash function is a compression function that does not easily cause collision. Here, collision means that for different input values x1 and x2, h (x1) = h
(X2). The compression function is a function for converting a bit string having an arbitrary bit length into a bit string having a fixed length.

【００６０】従って、ハッシュ関数とは任意のビット長
のビット列をある長さのビット列に変換する関数で、ｈ
（ｘ１）＝ｈ（ｘ２）を満たすｘ１，ｘ２を容易に見い
出せないものである。ハッシュ関数の代表的なものとし
ては，ＭＤ５（ＭｅｓｓａｇｅＤｉｇｅｓｔ５）、
ＳＨＡ（ＳｅｃｕｒｅＨａｓｈＡｌｇｏｒｉｔｈ
ｍ）などがある。Accordingly, a hash function is a function for converting a bit string having an arbitrary bit length into a bit string having a certain length.
X1 and x2 satisfying (x1) = h (x2) cannot be easily found. Representative hash functions include MD5 (Message Digest 5),
SHA (Secure Hash Algorithm)
m).

【００６１】［署名データ］署名データの作成方法とし
ては公開鍵暗号方式を用いた方式などがあるが、本実施
形態では特に限定しない。以下、公開鍵暗号方式を用い
た署名方式について説明する。[Signature Data] As a method for creating signature data, there is a method using a public key cryptosystem, but the present embodiment is not particularly limited. Hereinafter, a signature scheme using a public key encryption scheme will be described.

【００６２】公開鍵暗号方式は暗号鍵と復号鍵が異な
り、暗号鍵を公開、復号鍵を秘密に保持する暗号方式で
ある。通信文Ｍに対して、公開の暗号鍵ｋｐを用いた暗
号化操作をＥ（ｋｐ，Ｍ）とし、Ｃ＝Ｅ（ｋｐ，Ｍ）で
あるとき（Ｃは暗号鍵ｋｐにより暗号化された通信文
Ｍ）、秘密の復号鍵ｋｓを用いた復号操作をＤ（ｋｓ，
Ｃ）とすると、公開鍵暗号アルゴリズムは次の３つの条
件を満たす。The public key cryptosystem is a cryptosystem in which an encryption key and a decryption key are different, and the encryption key is made public and the decryption key is kept secret. When the encryption operation using the public encryption key kp is E (kp, M) for the communication message M and C = E (kp, M) (C is a communication encrypted by the encryption key kp). Sentence M), a decryption operation using a secret decryption key ks is performed by D (ks,
If C), the public key encryption algorithm satisfies the following three conditions.

【００６３】（１）ｋｐが与えられたとき、Ｅ（ｋ
ｐ，Ｍ）の計算は容易である．ｋｓが与えられたとき，
Ｄ（ｋｓ，Ｍ）の計算は容易である。(1) When kp is given, E (k
Calculation of (p, M) is easy. Given ks,
Calculation of D (ks, M) is easy.

【００６４】（２）もしｋｓを知らないなら、ｋｐと
Ｅの計算手順とＣ＝Ｅ（ｋｐ，Ｍ）を知っていても、Ｍ
を決定することは計算量の点で困難である。(2) If ks is not known, even if the calculation procedure of kp and E and C = E (kp, M) are known, M
Is difficult in terms of computational complexity.

【００６５】（３）全ての通信文(平文)Ｍに対し、Ｅ
（ｋｐ，Ｍ）が定義でき、Ｄ（ｋｓ，Ｅ（ｋｐ，Ｍ））
＝Ｍが成立する。(3) For all messages (plaintext) M, E
(Kp, M) can be defined, and D (ks, E (kp, M))
= M holds.

【００６６】以上の性質を満たす公開鍵暗号方式を用い
て、ユーザＰが文書Ｍに対して署名を行う。つまりＭが
確かにＰが作成した文書であることを証明する方式は次
のとおりである。The user P signs the document M using the public key cryptosystem satisfying the above properties. That is, a method for proving that M is indeed a document created by P is as follows.

【００６７】Ｐは自分の秘密鍵ｋｓで送信文Ｃ＝Ｄ（ｋ
ｓ，Ｍ）を生成し、ＭとともにユーザＶに送る。P uses its own secret key ks to send a message C = D (k
s, M) is generated and sent to the user V together with M.

【００６８】ユーザＶは、ユーザＰの公開鍵ｋｐで、Ｃ
を復元変換Ｍ’＝Ｅ（ｋｐ，Ｃ）を行い。Ｍ’が文書Ｍ
と一致するかどうかを確認する。このユーザＶの操作を
署名の検証と呼ぶ。User V uses user P's public key kp and C
Is subjected to restoration conversion M ′ = E (kp, C). M 'is document M
Check if it matches. This operation of the user V is called signature verification.

【００６９】一般的に公開鍵暗号による暗号化は時間が
かかるという欠点があるため、文書Ｍ自体に上記の演算
を施すのではなく、一度ハッシュ関数を用いてデータを
圧縮した上で署名処理が行われることが多く、３０４の
署名計算処理でもこの方式を採用している。Generally, encryption using public key cryptography has a disadvantage that it takes time. Therefore, instead of performing the above-described operation on the document M itself, the data is once compressed using a hash function, and the signature processing is performed. This method is often performed, and the signature calculation processing of 304 also employs this method.

【００７０】上述の処理のフローチャートを図１２に示
す。なお、本フローチャートは図１０に示したフローチ
ャートに従った処理により生成される、透かし情報が埋
め込まれたテキストデータを入力することが前提となっ
ている。よって、本実施形態における透かし情報埋め込
み装置が行う処理は、ステップＳ１００３の代わりにス
テップＳ１２０１以降の処理を行うとしたフローチャー
トに従った処理となる。FIG. 12 shows a flowchart of the above processing. This flowchart is based on the premise that text data in which watermark information is embedded, which is generated by the processing according to the flowchart shown in FIG. 10, is input. Therefore, the process performed by the watermark information embedding device in the present embodiment is a process according to a flowchart in which the process after step S1201 is performed instead of step S1003.

【００７１】また、本フローチャートに従ったプログラ
ムコードは、不図示のＲＯＭやＲＡＭなどのメモリに格
納され、不図示のＣＰＵにより読み出され、実行される
ものとする。The program code according to this flowchart is stored in a memory such as a ROM or a RAM (not shown), and is read and executed by a CPU (not shown).

【００７２】まず、テキストデータ、秘密鍵を入力する
（ステップＳ１２０１）。次に入力したテキストデータ
に対して正規化処理を行い（ステップＳ１２０２）、正
規化されたコピーデータからハッシュ値を算出する（ス
テップＳ１２０３）。次に、入力した秘密鍵を用いて、
算出されたハッシュ値から署名データを作成する（ステ
ップＳ１２０４）。そしてこの署名データを入力したテ
キストデータに対して埋め込む（ステップＳ１２０
５）。First, text data and a secret key are input (step S1201). Next, normalization processing is performed on the input text data (step S1202), and a hash value is calculated from the normalized copy data (step S1203). Next, using the entered secret key,
The signature data is created from the calculated hash value (step S1204). The signature data is embedded in the input text data (step S120).
5).

【００７３】図４は本実施形態における、透かし情報が
埋め込まれた文章（テキストデータ）から署名情報を
得、検証する方法を説明するブロック図である。FIG. 4 is a block diagram illustrating a method for obtaining and verifying signature information from text (text data) in which watermark information is embedded according to the present embodiment.

【００７４】まず、透かし情報が埋め込まれたテキスト
データの入力を行う（４００）。次に入力したテキスト
データのコピーに対して、正規化方式４０２に則って正
規化処理を行う（４０１）。First, text data in which watermark information is embedded is input (400). Next, a normalization process is performed on the copy of the input text data in accordance with the normalization method 402 (401).

【００７５】次に、正規化前のテキストデータと正規化
後のテキストデータの比較処理を行い、差分情報を生成
する（４０３）。Next, a comparison process is performed between the text data before normalization and the text data after normalization to generate difference information (403).

【００７６】４０３で得られた差分情報から、署名デー
タを得る（４０４）。Signature data is obtained from the difference information obtained in 403 (404).

【００７７】４０４で得た署名データの検証を上述の通
り行う。The verification of the signature data obtained in 404 is performed as described above.

【００７８】なお本実施形態における透かし情報抽出装
置の構成は、図６に示した装置において、抽出処理装置
６０５から得た署名データの検証を埋め込み時に利用し
た秘密鍵に対応する検証鍵の入力と共に行う装置を図６
に示した装置に加えることで、上述の検証ができる。The configuration of the watermark information extraction device in this embodiment is different from the device shown in FIG. 6 in that the verification of the signature data obtained from the extraction processing device 605 is performed together with the input of the verification key corresponding to the secret key used at the time of embedding. Figure 6 shows the equipment
The above-mentioned verification can be performed by adding to the apparatus shown in FIG.

【００７９】なお本実施形態における透かし情報抽出装
置が行う上述の処理のフローチャートを図１３に示す。
また、本フローチャートに従ったプログラムコードは、
不図示のＲＯＭやＲＡＭなどのメモリに格納され、不図
示のＣＰＵにより読み出され、実行されるものとする。FIG. 13 is a flowchart of the above-described processing performed by the watermark information extracting apparatus according to the present embodiment.
Also, the program code according to this flowchart is:
It is stored in a memory such as a ROM or a RAM (not shown), and is read and executed by a CPU (not shown).

【００８０】まず透かし情報が埋め込まれたテキストデ
ータを入力すると共に、埋め込み時に利用した秘密鍵に
対応する検証鍵も入力する（ステップＳ１３０１）。次
に、入力したテキストデータのコピーに対して正規化を
行い（ステップＳ１３０２）、正規化前のテキストデー
タと、正規化後のテキストデータとの比較を行い、前述
の差分情報を生成する（ステップＳ１３０３）。次に、
この差分情報から署名データを得ると共に（ステップＳ
１３０４）、この署名データに対して、ステップＳ１３
０１で入力した秘密の復号鍵を用いて検証を行う（ステ
ップＳ１３０５）。First, text data in which watermark information is embedded is input, and a verification key corresponding to a secret key used at the time of embedding is also input (step S1301). Next, the copy of the input text data is normalized (step S1302), and the text data before normalization and the text data after normalization are compared to generate the above-described difference information (step S1302). S1303). next,
The signature data is obtained from the difference information (step S
1304), this signature data is processed in step S13.
Verification is performed using the secret decryption key input at 01 (step S1305).

【００８１】以上の説明により、本実施形態により、署
名データを透かし情報として埋め込み、この署名データ
により、テキストデータの真正性を保証し、改ざん検出
に利用することができる。As described above, according to the present embodiment, the signature data can be embedded as the watermark information, the authenticity of the text data can be guaranteed by the signature data, and the data can be used for falsification detection.

【００８２】［第３の実施形態］第２の実施形態の特別
な場合として、本実施形態ではＸＭＬ文書に対する署名
情報の埋め込みを説明する。[Third Embodiment] As a special case of the second embodiment, embedding of signature information in an XML document will be described in this embodiment.

【００８３】ＸＭＬ（ｅＸｔｅｎｓｉｂｌｅＭａｒｋ
ｕｐＬａｎｇｕａｇｅ）はＷ３Ｃで策定されたマーク
アップ言語である。規格はＷ３ＣＲｅｃｏｍｍｅｎｄ
ａｔｉｏｎ、ＥｘｔｅｎｓｉｂｌｅＭａｒｋｕｐ
Ｌａｎｇｕａｇｅ（ＸＭＬ）１．０、ｈｔｔｐ：／
／ｗｗｗ．ｗ３．ｏｒｇ／ＴＲ／１９９８／ＲＥＣ−
ｘｍｌ−１９９８０２１０．ｈｔｍｌで公開されてい
る。XML (extensible Mark)
up Language) is a markup language defined by the W3C. The standard is W3C Recommended
ation, Extensible Markup
Language (XML) 1.0, http: //
/ Www. w3. org / TR / 1998 / REC-
xml-19980210. html.

【００８４】ＸＭＬはテキストデータによって記述され
る可読のデータ形式で、例えば「＜Ｔａｇ＞」と「＜／
Ｔａｇ＞」、といったような「＜」と「＞」形式のタグ
でデータ項目を表現し、またこのタグで別のデータを入
れ子構造になるように囲むことによって様々なデータを
表現する。XML is a readable data format described by text data. For example, “<Tag>” and “<//
Data items are represented by tags of the form “<” and “>”, such as “Tag>”, and various data are represented by surrounding other data in a nested structure with this tag.

【００８５】さらに、このタグ中にはプロパティと呼ば
れる、タグに関連する特徴データを記述することも可能
で，例えば，「＜Ｔａｇｐｒｏｐ＝”Ｐｒｏｐｅｒｔ
ｙ”＞ … ＜／Ｔａｇ＞」というように，「プロパテ
ィ名＝”値”」という文字列をタグ名（例では”Ｔａ
ｇ”）と空白文字で区切って後続してタグに内挿するこ
とによって実現される。Further, in this tag, characteristic data called a property, which is related to the tag, can be described. For example, “<Tag prop =” Property
y ″>... </ Tag> ”, a character string“ property name = “value” ”is replaced with a tag name (“ Ta
g ") and a space, followed by interpolation into the tag.

【００８６】ＸＭＬ文書は次の３つの部分に分けられ
る。An XML document is divided into the following three parts.

【００８７】（１）バージョン宣言や文字コード宣言
を行うＸＭＬ宣言部分。(1) An XML declaration part for declaring a version or character code.

【００８８】（２）後述するＸＭＬインスタンスの文
書構造を宣言し，タグの構文を規定するＤＴＤ（Ｄｏ
ｃｕｍｅｎｔＴｙｐｅＤｅｆｉｎｉｔｉｏｎ）部
分。(2) A DTD (Do) that declares a document structure of an XML instance described later and defines a tag syntax
document type definition) part.

【００８９】（３）実際のタグ付き文書であるＸＭＬ
インスタンスの部分。(3) XML which is an actual tagged document
The part of the instance.

【００９０】ＸＭＬはテキストデータというデータ形式
であるので、一般的な文書編集ソフトウェア（テキスト
エディタなど）で容易に作成したり編集したりすること
ができる。このことは，ＸＭＬの取り扱いを容易にする
という点では長所であり、様々なアプリケーションで利
用されつつある。Since XML is a data format called text data, it can be easily created and edited by general document editing software (text editor or the like). This is an advantage in that XML is easily handled, and is being used in various applications.

【００９１】契約文書の保証などを目的としてデータの
署名は必要不可欠な技術であるが、ＸＭＬ文書に対する
署名の要求も高く、Ｗ３ＣとＩＥＴＦでフォーマットの
標準化が進められており、規格としてＸＭＬ−Ｓｉｇｎ
ａｔｕｒｅＳｙｎｔａｘａｎｄＰｒｏｃｅｓｓｉｎ
ｇ、ｈｔｔｐ：／／ｗｗｗ．ｗ３．ｏｒｇ／ＴＲ／ｘｍ
ｌｄｓｉｇ−ｃｏｒｅ／（以下ｘｍｌｄｓｉｇ−ｃｏ
ｒｅとあらわす）が公開されている。Data signature is an indispensable technique for the purpose of guaranteeing contract documents and the like, but there is a high demand for signatures on XML documents, and standardization of the format is being promoted in W3C and IETF, and XML-Sign is a standard.
ature Syntax and Processin
g, http: // www. w3. org / TR / xm
ldsig-core / (hereinafter xmldsig-co
re) is open to the public.

【００９２】署名処理は，まずＷ３Ｃの別の策定文書で
あるＣａｎｏｎｉｃａｌＸＭＬＶｅｒｓｉｏｎ１．
０、ｈｔｔｐ：／／ｗｗｗ．ｗ３．ｏｒｇ／ＴＲ／ｘ
ｍｌ−ｃ１４ｎ（以下ｘｍｌ−ｃ１４ｎとあらわす）
で規定された正規化方法に則り、ＸＭＬ文書を正規化し
た上で署名を施すことが提案されている。First, the signature process is performed in accordance with the canonical XMLVersion 1.
0, http: // www. w3. org / TR / x
ml-c14n (hereinafter referred to as xml-c14n)
It has been proposed to apply a signature after normalizing an XML document in accordance with a normalization method defined in (1).

【００９３】ｘｍｌｄｓｉｇ−ｃｏｒｅにおけるＸＭＬ
文書に対する署名は、ｘｍｌ−ｃ１４ｎによる正規化処
理を行ったデータのハッシュを取り、署名データを計算
後、署名を施していることを表わす次のようなヘッダ部
分を元ＸＭＬ文書に添付させる必要がある。＜Ｓｉｇｎａｔｕｒｅ＞．．．＜ＳｉｇｎａｔｕｒｅＶａｌｕｅ＞ＡＢＣＤＥＦ．．．＜／ＳｉｇｎａｔｕｒｅＶａｌｕｅ＞＜／Ｓｉｇｎａｔｕｒｅ＞ｘｍｌ−ｃ１４ｎでは改行文字をＬＦにすること、文字
符号化方式はつねにＵＴＦ−８であることなどが規定さ
れている。しかし文字コードレベルでの正規化について
は策定文書の範囲外であり、ベンダー間の取り決めが行
われたり、アプリケーションに依存することになる。そ
こで署名を施す前の正規化として，以下の手順で正規化
処理を行う。XML in xmldsig-core
For the signature on the document, it is necessary to take the hash of the data subjected to the normalization processing by xml-c14n, calculate the signature data, and attach the following header portion indicating that the signature is applied to the original XML document. is there. <Signature>. . . <SignatureValue> ABCDEF. . . </ SignatureValue></Signature> The xml-c14n defines that the line feed character is LF, the character encoding method is always UTF-8, and the like. However, normalization at the character code level is out of the scope of the draft document, and agreements between vendors are made and depend on applications. Therefore, as a normalization before applying a signature, a normalization process is performed in the following procedure.

【００９４】図９は、本実施形態における署名手順を説
明するブロック図である。したフローチャートである。FIG. 9 is a block diagram illustrating a signature procedure according to the present embodiment. It is the flowchart which was performed.

【００９５】まず、透かし情報が埋め込まれたＸＭＬ文
章を入力する（９００）。First, an XML document in which watermark information is embedded is input (900).

【００９６】次に、入力されたＸＭＬ文書（テキストデ
ータ）を正規化方式９０２に則って文字コードレベルで
の正規化を行う（９０１）。Next, the input XML document (text data) is normalized at the character code level in accordance with the normalization method 902 (901).

【００９７】次に、ｘｍｌ−ｃ１４ｎによる正規化を行
う（９０３）。Next, normalization by xml-c14n is performed (903).

【００９８】署名作成者の秘密鍵を用い，２種類の正規
化処理（９０１，９０３）を行ったデータのハッシュ値
を取り、署名データを生成する（９０４）。Using the secret key of the signature creator, the hash value of the data subjected to the two types of normalization processing (901, 903) is obtained, and signature data is generated (904).

【００９９】９０４で得られた署名データをヘッダ（以
下署名ヘッダ）として透かし情報として埋め込む（９０
５）。The signature data obtained in 904 is embedded as watermark information as a header (hereinafter, signature header) (90).
5).

【０１００】署名を施す前の正規化として、ｘｍｌ−ｃ
１４ｎによる正規化の前に、上述の文字コードレベルで
の正規化処理を行うことで、第２の実施形態と同様に署
名を行うことができ、計算された署名ヘッダをＸＭＬ文
書に埋め込むことができる。As the normalization before applying a signature, xml-c
By performing the above-described normalization processing at the character code level before normalization by 14n, a signature can be performed in the same manner as in the second embodiment, and the calculated signature header can be embedded in the XML document. it can.

【０１０１】署名ヘッダのデータ容量が大きい場合は以
下のようにして署名ヘッダの一部だけを透かしデータと
して埋め込む方法も考えられる。以下は署名ヘッダのう
ち署名データのみを透かし情報として埋め込んだ場合の
署名ヘッダの例である．＜Ｓｉｇｎａｔｕｒｅ＞．．．＜ＳｉｇｎａｔｕｒｅＶａｌｕｅｏｐｔｉｏｎ＝”Ｗａ
ｔｅｒＭａｒｋｅｄ”／＞＜／Ｓｉｇｎａｔｕｒｅ＞ ”ＷａｔｅｒＭａｒｋｅｄ”というプロパティ値から、
署名データが埋め込まれていることがわかる。When the data capacity of the signature header is large, a method of embedding only a part of the signature header as watermark data as follows is also conceivable. The following is an example of a signature header when only the signature data of the signature header is embedded as watermark information. <Signature>. . . <SignatureValueOption = ”Wa
terMarked "//></Signature> From the property value" WaterMarked ",
It can be seen that the signature data is embedded.

【０１０２】なお、実施形態における透かし情報埋め込
み装置の構成は第１の実施形態と同じであり、同装置が
行う処理のフローチャートとしては、ステップＳ１２０
２において、更にｘｍｌ−ｃ１４ｎによる正規化を行う
としたフローチャートである。一方、本実施形態におけ
る透かし情報抽出装置の構成は第１の実施形態と同じで
あり、同装置が行う処理のフローチャートとしては、ス
テップ１３０２において、ｘｍｌ−ｃ１４ｎによる正規
化を行うとしたフローチャートである。The configuration of the watermark information embedding device in the embodiment is the same as that in the first embodiment.
2 is a flowchart in which normalization by xml-c14n is further performed in FIG. On the other hand, the configuration of the watermark information extraction device according to the present embodiment is the same as that of the first embodiment, and the process performed by the device is a flowchart in which normalization by xml-c14n is performed in step 1302. .

【０１０３】以上説明したように本実施形態によれば、
ＸＭＬ文書に対する署名データを外部に持って別々に管
理する必要がなく、ＸＭＬ文書から署名データを抽出し
て署名の検証を行うことによりＸＭＬ文書の真正性を保
証し、改ざんを検出することができる。また、上記の方
式はＸＭＬだけでなくＳＧＭＬやＨＴＭＬ等のマークア
ップ言語に適用可能である。As described above, according to the present embodiment,
There is no need to externally manage the signature data for the XML document and separately manage it. By extracting the signature data from the XML document and verifying the signature, the authenticity of the XML document can be assured and tampering can be detected. . Further, the above method is applicable not only to XML but also to markup languages such as SGML and HTML.

【０１０４】［第４の実施形態］前述の実施形態では、
正規化処理には正規化方式が必要であった。本実施形態
は、この正規化方式を秘匿にし、特定のユーザだけが埋
め込みデータを抽出する方法について説明する。なお、
埋め込み処理者は正規化方式を秘匿にして文字コードレ
ベルでの正規化を行うと共に、埋め込み処理者は透かし
情報の抽出を許可する抽出者にのみ暗号通信路などを用
いて正規化方式を安全に共有する。又、本実施形態では
暗号方式については特には限定しない。[Fourth Embodiment] In the above embodiment,
The normalization process required a normalization method. In the present embodiment, a method will be described in which this normalization method is kept secret and only a specific user extracts embedded data. In addition,
The embedding processor performs normalization at the character code level by keeping the normalization method confidential, and the embedding processor secures the normalization method using an encryption communication channel only for those who permit extraction of watermark information. Share. In the present embodiment, the encryption method is not particularly limited.

【０１０５】図８は、抽出者ごとに別々の正規化方式を
保持した透かしシステムの概要を示した図である。FIG. 8 is a diagram showing an outline of a watermarking system holding different normalization schemes for each extractor.

【０１０６】埋め込み装置８０１は正規化方式格納装置
８０２を持ち、埋め込み処理者Ｘは正規化方式Ａ（８０
３）と、正規化方式Ｂ（８０４）とを作成し、正規化方
式格納装置８０２に格納しておく。The embedding device 801 has a normalization method storage device 802, and the embedding processor X sends the normalization method A (80
3) and a normalization method B (804) are created and stored in the normalization method storage device 802.

【０１０７】埋め込み処理者Ｘは抽出者Ａ及び抽出者Ｂ
にそれぞれ正規化方式Ａ（８０３）と、正規化方式Ｂ
（８０４）とを第３者に知られないように共有してお
く。The embedding processor X is an extractor A and an extractor B
Respectively, normalization method A (803) and normalization method B
(804) is shared so as not to be known to a third party.

【０１０８】抽出装置８０５を持つ抽出者Ａは、正規化
方式格納装置８０６に正規化方式Ａ（８０７）を格納し
ておく。同様に、抽出装置８０８を持つ抽出者Ｂは正規
化方式格納装置８０９に正規化方式Ｂ（８１０）を格納
しておく。The extractor A having the extraction device 805 stores the normalization method A (807) in the normalization method storage device 806. Similarly, the extractor B having the extraction device 808 stores the normalization method B (810) in the normalization method storage device 809.

【０１０９】埋め込み処理者Ｘが抽出者Ａにのみ抽出可
能な情報を埋め込む場合には、正規化方式Ａ（８０３）
を用いて透かし処理を行うことにより、抽出者Ａのみが
正規化方式Ａ（８０７）を用いて情報を抽出することが
可能である。この場合、正規化方式Ａを持たない抽出者
Ｂは透かし情報を抽出することはできない。When the embedding processor X embeds information that can be extracted only by the extractor A, the normalization method A (803)
By performing the watermarking process using, only the extractor A can extract information using the normalization method A (807). In this case, extractor B who does not have normalization method A cannot extract watermark information.

【０１１０】以上説明したように、本実施形態によれ
ば、ユーザごとに個別に正規化方式を持つことにより，
特定の抽出者だけが埋め込みデータを抽出することがで
きるしくみを提供することが可能である。As described above, according to the present embodiment, by having a normalization scheme for each user,
It is possible to provide a mechanism that allows only a specific extractor to extract embedded data.

【０１１１】［第５の実施形態］本実施形態は、透かし
情報としてメタデータを扱う例を説明する。メタデータ
（ｍｅｔａ−ｄａｔａ）とは、「データに関するデー
タ」であり、あるデータＤを説明するデータＭのことで
ある。しかし、データＤとメタデータＭが別々のファイ
ルで存在した場合、ファイルの移動やコピーの際に、ユ
ーザが同時に管理しなければならない問題点がある。そ
こで、本実施形態では透かし情報としてメタデータを適
用させることにより、複数のファイルで管理する煩わし
さを解消することが可能である。[Fifth Embodiment] In the present embodiment, an example in which metadata is handled as watermark information will be described. The metadata (meta-data) is “data related to data”, and is data M that describes certain data D. However, when the data D and the metadata M exist in different files, there is a problem that the user must manage the files at the same time when moving or copying the files. Therefore, in the present embodiment, by applying metadata as watermark information, it is possible to eliminate the trouble of managing with a plurality of files.

【０１１２】また、他文書との関連や関係する情報のポ
インタなどのデータの関連性を示すリンク情報を透かし
情報として埋め込むことも可能であり、上記と同様の効
果が得られる。Further, it is also possible to embed link information indicating a data relationship such as a relationship with another document or a pointer of related information as watermark information, and the same effect as described above can be obtained.

【０１１３】リンク情報としてはアプリケーション特有
の識別子や、ＵＲＬ（ＵｎｉｆｏｒｍＲｅｓｏｕｒｃ
ｅＬｏｃａｔｏｒ）や、ＵＲＩ（ＵｎｉｆｏｒｍＲ
ｅｓｏｕｒｃｅＩｄｅｎｔｉｆｉｅｒ）などで表わす
ことができる。ＵＲＬは、ＨＴＭＬのハイパーリンクで
利用されているインターネット上に存在するデータを特
定するためのフォーマットであり、ＵＲＩは、ＵＲＬを
包含した識別子であり、Ｗ３Ｃで策定されている。As the link information, an identifier unique to an application or a URL (Uniform Resource) is used.
e Locator) and URI (Uniform R)
resource Identifier). The URL is a format for specifying data existing on the Internet, which is used as an HTML hyperlink, and the URI is an identifier including the URL, and is defined by the W3C.

【０１１４】以上説明したように、本実施形態によれ
ば、文書に対するメタデータやリンク情報を外部に持っ
て別々に管理する必要がなく、文書のスムーズな管理を
することができる。また、上記の方式は被埋め込みデー
タまたは透かし情報としてＸＭＬ、ＳＧＭＬ、ＨＴＭＬ
等のマークアップ言語で記述することも可能である。As described above, according to the present embodiment, there is no need to externally manage and separately manage metadata and link information for a document, and it is possible to smoothly manage a document. Further, the above-mentioned method uses XML, SGML, HTML as embedded data or watermark information.
It is also possible to describe in a markup language such as.

【０１１５】［第６の実施形態］本発明は上記実施の形
態を実現するための装置及び方法及び実施の形態で説明
した方法を組み合わせて行う方法のみに限定されるもの
ではなく、上記システム又は装置内のコンピュータ（Ｃ
ＰＵあるいはＭＰＵ）に、上記実施の形態を実現するた
めのソフトウエアのプログラムコードを供給し、このプ
ログラムコードに従って上記システムあるいは装置のコ
ンピュータが上記各種デバイスを動作させることにより
上記実施の形態を実現する場合も本発明の範疇に含まれ
る。[Sixth Embodiment] The present invention is not limited only to an apparatus and a method for realizing the above-described embodiment and a method for performing a combination of the methods described in the embodiment. Computer in the device (C
PU or MPU) is supplied with software program code for implementing the above-described embodiment, and the computer of the system or apparatus operates the various devices according to the program code, thereby realizing the above-described embodiment. Such cases are also included in the scope of the present invention.

【０１１６】またこの場合、前記ソフトウエアのプログ
ラムコード自体が上記実施の形態の機能を実現すること
になり、そのプログラムコード自体、及びそのプログラ
ムコードをコンピュータに供給するための手段、具体的
には上記プログラムコードを格納した記憶媒体は本発明
の範疇に含まれる。In this case, the program code itself of the software implements the functions of the above-described embodiment, and the program code itself and means for supplying the program code to the computer, specifically, A storage medium storing the above program code is included in the scope of the present invention.

【０１１７】この様なプログラムコードを格納する記憶
媒体としては、例えばフロッピー（登録商標）ディス
ク、ハードディスク、光ディスク、光磁気ディスク、Ｃ
Ｄ−ＲＯＭ、磁気テープ、不揮発性のメモリカード、Ｒ
ＯＭ等を用いることができる。Examples of a storage medium for storing such a program code include a floppy (registered trademark) disk, hard disk, optical disk, magneto-optical disk, and C drive.
D-ROM, magnetic tape, nonvolatile memory card, R
OM or the like can be used.

【０１１８】また、上記コンピュータが、供給されたプ
ログラムコードのみに従って各種デバイスを制御するこ
とにより、上記実施の形態の機能が実現される場合だけ
ではなく、上記プログラムコードがコンピュータ上で稼
働しているＯＳ（オペレーティングシステム）、あるい
は他のアプリケーションソフト等と共同して上記実施の
形態が実現される場合にもかかるプログラムコードは本
発明の範疇に含まれる。The computer controls the various devices in accordance with only the supplied program codes, so that the functions of the above-described embodiment are realized, and the program codes operate on the computer. Such a program code is included in the scope of the present invention even when the above-described embodiment is realized in cooperation with an OS (Operating System) or other application software.

【０１１９】更に、この供給されたプログラムコード
が、コンピュータの機能拡張ボードやコンピュータに接
続された機能拡張ユニットに備わるメモリに格納された
後、そのプログラムコードの指示に基づいてその機能拡
張ボードや機能格納ユニットに備わるＣＰＵ等が実際の
処理の一部または全部を行い、その処理によって上記実
施の形態が実現される場合も本発明の範疇に含まれる。Further, after the supplied program code is stored in a memory provided in a function expansion board of a computer or a function expansion unit connected to the computer, the function expansion board or function is stored based on the instruction of the program code. The case where the CPU or the like provided in the storage unit performs part or all of the actual processing, and the above-described embodiment is realized by the processing is also included in the scope of the present invention.

【０１２０】本発明を上記記憶媒体に適用する場合、そ
の記憶媒体には、先に説明した（図１０、及び／又は図
１１、及び／又は図１２、及び／又は図１３に示す）フ
ローチャートに対応するプログラムコードが格納される
ことになる。When the present invention is applied to the storage medium described above, the storage medium may include any of the above-described flowcharts (shown in FIG. 10, and / or FIG. 11, and / or FIG. 12, and / or FIG. 13). The corresponding program code will be stored.

【０１２１】[0121]

【発明の効果】以上説明した通り本発明によれば、テキ
ストデータのコードを対応するコードに透かし情報に基
づいて変換することで、テキストの不自然さをユーザに
感じさせることなく、テキストデータへの透かし情報を
埋め込むことができる。As described above, according to the present invention, by converting a code of text data to a corresponding code based on watermark information, text data can be converted into text data without causing the user to feel unnaturalness of the text. Watermark information can be embedded.

[Brief description of the drawings]

【図１】本発明の第１の実施形態における文章（テキス
トデータ）に透かし情報を埋め込む方法を説明するブロ
ック図である。FIG. 1 is a block diagram illustrating a method for embedding watermark information in a text (text data) according to a first embodiment of the present invention.

【図２】本発明の第１の実施形態における、透かし情報
が埋め込まれた文章（テキストデータ）から透かし情報
を抽出する方法を説明する図である。FIG. 2 is a diagram illustrating a method for extracting watermark information from a text (text data) in which watermark information is embedded according to the first embodiment of the present invention.

【図３】本発明の第２の実施形態における文章（テキス
トデータ）に透かし情報を埋め込む方法を説明する図で
ある。FIG. 3 is a diagram illustrating a method for embedding watermark information in a sentence (text data) according to a second embodiment of the present invention.

【図４】本発明の第２の実施形態における、透かし情報
が埋め込まれた文章（テキストデータ）から署名情報を
得、検証する方法を説明するブロック図である。FIG. 4 is a block diagram illustrating a method for obtaining and verifying signature information from a text (text data) in which watermark information is embedded according to a second embodiment of the present invention.

【図５】本発明の第１の実施形態における透かし情報埋
め込み装置の概略構成を示す図である。FIG. 5 is a diagram illustrating a schematic configuration of a watermark information embedding device according to the first embodiment of the present invention.

【図６】本発明の第１の実施形態における透かし情報抽
出装置の概略構成を示す図である。FIG. 6 is a diagram illustrating a schematic configuration of a watermark information extracting device according to the first embodiment of the present invention.

【図７】本発明の第１の実施形態における透かし情報埋
め込み装置としての装置の概略構成を示す図である。FIG. 7 is a diagram illustrating a schematic configuration of a device as a watermark information embedding device according to the first embodiment of the present invention.

【図８】本発明の第４の実施形態における、抽出者ごと
に別々の正規化方式を保持した透かしシステムの概要を
示した図である。FIG. 8 is a diagram illustrating an outline of a watermarking system according to a fourth embodiment of the present invention, in which a different normalization method is held for each extractor.

【図９】本発明の第３の実施形態における、署名手順を
説明するブロック図である。FIG. 9 is a block diagram illustrating a signature procedure according to a third embodiment of the present invention.

【図１０】本発明の第１の実施形態における透かし情報
埋め込み装置が行う、透かし情報埋め込み処理のフロー
チャートである。FIG. 10 is a flowchart of watermark information embedding processing performed by the watermark information embedding device according to the first embodiment of the present invention.

【図１１】本発明の第１の実施形態における透かし情報
抽出装置が行う、透かし情報抽出処理のフローチャート
である。FIG. 11 is a flowchart of a watermark information extraction process performed by the watermark information extraction device according to the first embodiment of the present invention.

【図１２】本発明の第２の実施形態における透かし情報
埋め込み装置が行う、透かし情報埋め込み処理のフロー
チャートである。FIG. 12 is a flowchart of a watermark information embedding process performed by the watermark information embedding device according to the second embodiment of the present invention.

【図１３】本発明の第２の実施形態における透かし情報
抽出装置が行う、透かし情報抽出処理のフローチャート
である。FIG. 13 is a flowchart of watermark information extraction processing performed by the watermark information extraction device according to the second embodiment of the present invention.

Claims

[Claims]

1. A text processing device for embedding watermark information in text data, comprising: a normalization unit for normalizing the text data using a predetermined normalization method; And a converting unit for converting the converted text data into other text data based on the watermark information.

2. The text processing apparatus according to claim 1, wherein the conversion unit expresses the predetermined normalization method by a plurality of normalization methods according to the watermark information.

3. The sentence processing according to claim 1, wherein the predetermined normalization method and another normalization method corresponding to the predetermined normalization method are the same character notation normalization methods. apparatus.

4. The sentence processing apparatus according to claim 1, wherein the normalization method is a normalization method for expressing a hiragana / katakana voiced character and a semi-voiced character in one character.

5. The text processing apparatus according to claim 4, wherein the normalization method represents the voiced and semi-voiced voices with a combined character.

6. The text data is XML, HTM
6. The method according to claim 1, wherein at least one of L and SGML is included or represented by one of them.
A sentence processing apparatus according to any one of the above.

7. The text processing apparatus according to claim 1, wherein the text data is written in a markup language.

8. The text processing apparatus according to claim 1, wherein the watermark information includes metadata for the text data.

9. A normalized data generating means for generating normalized data obtained by normalizing text data in which watermark information is embedded, and calculating a value based on the normalized data. Calculating means for generating embedded data from a value based on the normalized data using information on the text data, and embedding the data generated by the embedded data generating means into the embedded data. 2. The text processing device according to claim 1, wherein the text is embedded by means.

10. The text processing apparatus according to claim 9, wherein the calculation unit calculates a hash value by a one-way hash function as a value based on the normalized data.

11. The method according to claim 9, wherein the normalized data generated by the normalized data generating means is further normalized by a normalization method according to a predetermined rule. The sentence processing device described.

12. The text processing apparatus according to claim 11, wherein the normalization method according to the predetermined rule is a normalization method specified in a W3C formulation document.

13. The text processing apparatus according to claim 9, wherein said embedding means embeds signature information of text data.

14. A text processing apparatus for extracting watermark information from text data in which watermark information has been embedded, wherein normalized text obtained by performing normalization on the text data by a predetermined normalization method is provided. Watermark information by generating a bit string based on the difference information, comprising: a normalized data generation unit that generates the data; and a difference information generation unit that compares the normalized data with the text data and generates the difference information. A sentence processing apparatus characterized in that a sentence is obtained.

15. The predetermined normalization method includes:
The sentence processing apparatus according to claim 14, wherein the condition is that one semi-voiced character is expressed.

16. The method according to claim 14, wherein the normalized data generated by the normalized data generating means is further normalized by a normalization method according to a predetermined rule. The sentence processing device described.

17. The text processing apparatus according to claim 16, wherein the normalization method according to the predetermined rule is a normalization method specified in a W3C formulation document.

18. The text processing apparatus according to claim 14, wherein the difference information includes metadata in which information on the text data is described.

19. The text processing apparatus according to claim 18, further comprising a verification unit that verifies the metadata by using information for decoding the metadata.

20. The text processing apparatus according to claim 17, wherein the verification unit extracts signature information of text data.

21. A text processing method for embedding watermark information in text data, comprising: a normalization step for normalizing the text data using a predetermined normalization method; Converting the converted text data into other text data based on the watermark information.

22. A normalized data generating step of generating normalized data obtained by normalizing text data in which watermark information is embedded, and calculating a value based on the normalized data. And an embedded data generating step of generating embedded data from a value based on the normalized data using the information on the text data, and embedding the data generated by the embedded data generating step. 22. The text processing method according to claim 21, wherein the text is embedded by a process.

23. A text processing method for extracting watermark information from text data in which watermark information is embedded, comprising: normalizing data obtained by performing normalization on the text data by a predetermined normalization method. Generating a normalized data generating step; comparing the normalized data with the text data to generate difference information; and generating a bit string based on the difference information. A sentence processing method characterized by obtaining

24. A storage medium for storing a text processing program code for embedding watermark information in text data, the program code for a normalization step of normalizing the text data using a predetermined normalization method. And a program code for a conversion step of converting the text data normalized in the normalization step into other text data based on watermark information.

25. A program code for a normalized data generation step for generating normalized data, which is obtained by normalizing text data in which watermark information is embedded, based on the normalized data. A program code of a calculating step of calculating a value, and a program code of an embedded data generating step of generating embedded data from a value based on the normalized data by using information on the text data. 25. The storage medium according to claim 24, wherein data generated in the step is embedded in the embedding step.

26. A storage medium for storing a text processing program code for extracting watermark information from text data in which watermark information is embedded, wherein the text data is normalized by a predetermined normalization method. And a program code of a difference information generating step of comparing the normalized data with the text data and generating difference information of the text data. A storage medium, wherein watermark information is obtained by generating a bit string based on difference information.