CN105404614A - Subject and predicate coding based text watermark embedding and extraction method - Google Patents

Subject and predicate coding based text watermark embedding and extraction method Download PDF

Info

Publication number
CN105404614A
CN105404614A CN201510743382.1A CN201510743382A CN105404614A CN 105404614 A CN105404614 A CN 105404614A CN 201510743382 A CN201510743382 A CN 201510743382A CN 105404614 A CN105404614 A CN 105404614A
Authority
CN
China
Prior art keywords
subject
predicate
text
unicode
unicode code
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510743382.1A
Other languages
Chinese (zh)
Other versions
CN105404614B (en
Inventor
陈建平
李桂森
朱晓辉
施佺
马海英
王进
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nantong University
Original Assignee
Nantong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nantong University filed Critical Nantong University
Priority to CN201510743382.1A priority Critical patent/CN105404614B/en
Publication of CN105404614A publication Critical patent/CN105404614A/en
Application granted granted Critical
Publication of CN105404614B publication Critical patent/CN105404614B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/10Protecting distributed programs or content, e.g. vending or licensing of copyrighted material ; Digital rights management [DRM]
    • G06F21/16Program or content traceability, e.g. by watermarking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Technology Law (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Document Processing Apparatus (AREA)

Abstract

本发明涉及一种基于主谓语编码的文本水印嵌入及提取方法,嵌入方法包括:1)将水印信息的每个字符用Unicode编码表示,形成一个Unicode码串;2)检测出待嵌入文本中语句的主谓语,存放于一集合中;3)根据主谓语的数量,将Unicode码串分成若干段,每一个主谓语用其中的一段来编码表示,并给定一个编号;4)依次存储每个主谓语、与该主谓语对应的Unicode码段、及其编号,形成一个码本,完成编码,实现水印的嵌入。提取方法包括:找出被检测的文本中的主谓语,对照码本,取出每个主谓语对应的Unicode码段及其编号,将Unicode码段按编号顺序拼接起来,将得到的Unicode码串转换成对应的字符,形成水印信息。该方法对文本格式与内容不做任何改变,具有良好的隐蔽性和鲁棒性,同时算法构造简单,易于实现。The invention relates to a text watermark embedding and extraction method based on subject-predicate encoding. The embedding method includes: 1) expressing each character of the watermark information with Unicode code to form a Unicode code string; 2) detecting the sentence in the text to be embedded 3) According to the number of subject and predicate, divide the Unicode code string into several segments, each subject and predicate is coded by one segment, and given a number; 4) Store each The subject predicate, the Unicode code segment corresponding to the subject predicate, and its serial number form a codebook to complete the encoding and realize the embedding of the watermark. The extraction method includes: finding out the subject and predicate in the detected text, comparing the codebook, taking out the Unicode code segment and its number corresponding to each subject and predicate, splicing the Unicode code segments according to the numbering order, and converting the obtained Unicode code string into corresponding characters to form watermark information. This method does not make any changes to the text format and content, and has good concealment and robustness. At the same time, the algorithm structure is simple and easy to implement.

Description

一种基于主谓语编码的文本水印嵌入以及提取方法A text watermark embedding and extraction method based on subject-predicate coding

技术领域 technical field

本发明涉及水印的嵌入与提取技术,尤其涉及一种基于主谓语编码的文本水印嵌入以及提取方法。 The invention relates to watermark embedding and extraction technology, in particular to a text watermark embedding and extraction method based on subject-predicate coding.

背景技术 Background technique

随着互联网和信息技术的普及应用,文本信息越来越多的以数字的方式发布、传播和使用,它在给人们的学习、工作和生活带来便利的同时,也产生了文本容易被非法复制和盗用等问题,数字文本的知识产权保护受到业界的广泛关注。文本水印是近年来出现的保护数字文本知识产权的一项技术,它通过某种方式在数字文本中嵌入版权标识信息或身份认证信息(水印),当发现文本遭到非法复制或盗用时,可以提取这些信息来证明文本的版权归属,确认非法复制和盗用行为,保护文本著作权人或拥有人的权益。除此之外,文本水印技术还可用于在文本中隐藏和传递秘密信息、文本内容的认证、文本信息的追踪等方面。 With the popularization and application of the Internet and information technology, more and more text information is published, disseminated and used in digital form. While it brings convenience to people's study, work and life, it also produces texts that are easy to be illegal. The protection of intellectual property rights of digital texts has attracted widespread attention in the industry due to problems such as copying and misappropriation. Text watermarking is a technology that has emerged in recent years to protect the intellectual property rights of digital texts. It embeds copyright identification information or identity authentication information (watermarks) in digital texts in a certain way. Extract these information to prove the copyright ownership of the text, confirm illegal copying and misappropriation, and protect the rights and interests of the text copyright owner or owner. In addition, text watermarking technology can also be used to hide and transmit secret information in text, authenticate text content, track text information, etc.

文本水印目前主要有两类方法——基于文本格式的文本水印和基于自然语言的文本水印。基于文本格式的水印技术利用轻微改变文本格式不易被察觉的特点来嵌入水印信息,如改变行间距、字间距、字符大小等等。这类基于文本格式的水印技术构造简单,易于实现,但对文本进行格式变换就有可能使嵌入的水印遭到破坏,鲁棒性不强。基于自然语言的文本水印技术利用文本内容的语法语义进行编码来嵌入水印信息,目前实现的较多的是通过同义词替换和句法变换对水印信息进行编码。与基于文本格式的水印相比,自然语言文本水印具有更好的隐蔽性和鲁棒性,格式变换不会对水印产生影响。但由于中文语言的复杂性,同义词替换和句法变换有可能会产生歧义或改变语义,同时它也不适用于文本内容不宜改变的情形。 Currently, there are two main methods of text watermarking—text watermarking based on text format and text watermarking based on natural language. The watermarking technology based on the text format embeds the watermark information by making use of the characteristics of slightly changing the text format that are not easy to be detected, such as changing the line spacing, word spacing, character size, and so on. This kind of watermarking technology based on text format is simple in structure and easy to implement, but the embedded watermark may be destroyed if the format of the text is changed, so the robustness is not strong. Text watermarking technology based on natural language encodes the syntax and semantics of the text content to embed the watermark information. At present, most of the implementations are to encode the watermark information through synonym replacement and syntax transformation. Compared with text-based watermarks, natural language text watermarks have better concealment and robustness, and format changes will not affect watermarks. However, due to the complexity of the Chinese language, synonym replacement and syntactic transformation may cause ambiguity or change semantics, and it is not suitable for situations where the text content should not be changed.

发明内容 Contents of the invention

本发明目的在于克服以上现有技术之不足,提供一种具有良好的隐蔽性和鲁棒性的基于主谓语编码的文本水印嵌入以及提取方法,具体有以下技术方案实现: The purpose of the present invention is to overcome the deficiencies of the prior art above, and provide a text watermark embedding and extraction method based on subject-predicate encoding with good concealment and robustness, specifically implemented by the following technical solutions:

所述基于主谓语编码的文本水印嵌入方法,包括 The text watermark embedding method based on subject-predicate coding includes

1)将水印信息的每个字符用Unicode编码表示,形成一个Unicode码串。 1) Express each character of the watermark information in Unicode to form a Unicode code string.

2)检测出待嵌入文本中语句的主谓语,存放于一集合中。 2) Detect the subject-predicate of the sentence to be embedded in the text and store it in a set.

3)根据检测出的主谓语数量,将Unicode码串分成若干段,每一个主谓语用其中的一段来编码表示。考虑到改变文本中语句的顺序可能会使水印信息不能正确提取,对每一个主谓语对应的Unicode码段给定一个编号,用于提取水印时根据编号拼接Unicode码串。 3) According to the number of detected subject-predicates, the Unicode code string is divided into several segments, and each subject-predicate is coded by one of the segments. Considering that changing the order of sentences in the text may make the watermark information incorrectly extracted, a number is given to the Unicode code segment corresponding to each subject and predicate, which is used to splice the Unicode code string according to the number when extracting the watermark.

4)依次存储每个主谓语、该主谓语对应的Unicode码段以及该主谓对应的编号,形成一个码本,完成编码,实现水印的嵌入。 4) Store each subject and predicate, the Unicode code segment corresponding to the subject and predicate, and the number corresponding to the subject and predicate in sequence to form a codebook, complete the encoding, and realize the embedding of the watermark.

上述Unicode编码采用UTF-16格式,每个字符为4位十六进制数,形成一个十六进制的Unicode码串。 The above Unicode encoding adopts UTF-16 format, and each character is a 4-digit hexadecimal number, forming a hexadecimal Unicode code string.

所述步骤2)中检测出待嵌入文本中的主谓语包括如下步骤: Detecting the subject and predicate in the text to be embedded in the step 2) includes the following steps:

A)将提交待嵌入水印的文本转换为字符串的形式; A) Convert the submitted text to be embedded in the watermark into a character string;

B)将待嵌入水印的文本的字符串提交至语言技术平台LTP进行依存句法分析,得到一个包含文本中句子成分依存关系的XML格式的字符串; B) Submit the string of the text to be embedded with the watermark to the language technology platform LTP for dependency syntax analysis, and obtain a string in XML format containing the dependency relationship of sentence components in the text;

C)将得到的XML格式的字符串转换为XML文件,对XML文件进行DOM解析,根据XML文件中句子成分关系属性的核心关系和主谓关系之间的联系,循环遍历文件,查找出每句的主谓语。 C) Convert the string obtained in XML format into an XML file, perform DOM analysis on the XML file, and loop through the file to find out each sentence according to the relationship between the core relationship and the subject-predicate relationship of the sentence component relationship attribute in the XML file of the subject predicate.

所述基于主谓语编码的文本水印嵌入方法的进一步设计在于,所述码本中每一行的主谓语、Unicode码段、编号之间分别用空格隔开。 A further design of the text watermark embedding method based on subject-predicate coding is that the subject-predicate, Unicode code segment, and serial number of each row in the codebook are separated by spaces.

根据所述基于主谓语编码的文本水印嵌入方法,提出一种基于主谓语编码的文本水印的提取方法,包括找出被检测的文本中的主谓语,对照嵌入水印时形成的所述码本,从码本中取出各主谓语对应的Unicode码段、编号,将Unicode码段按对应的编号的顺序拼接起来,得到代表水印信息的Unicode码串,再转换成对应的字符,形成嵌入的水印信息。 According to the text watermark embedding method based on subject-predicate encoding, a text watermark extraction method based on subject-predicate encoding is proposed, including finding the subject-predicate in the detected text, comparing the codebook formed when embedding the watermark, Take out the Unicode code segment and number corresponding to each subject and predicate from the codebook, splice the Unicode code segment in the order of the corresponding number, obtain the Unicode code string representing the watermark information, and then convert it into the corresponding character to form the embedded watermark information .

所述基于主谓语编码的文本水印的提取方法的进一步设计在于,所述取出被检测文本中各主谓语对应的Unicode码段及其编号的步骤包括:将找出的被检测文本中的每个主谓语与码本中的各个主谓语逐一进行比较,若两者一致,则从码本中取出该主谓语对应的Unicode码段、编号。 A further design of the method for extracting text watermarks based on subject-predicate coding is that the step of extracting the Unicode code segment and its number corresponding to each subject-predicate in the detected text includes: each of the detected text to be found The subject-predicate is compared with each subject-predicate in the codebook one by one, and if they are consistent, the Unicode code segment and serial number corresponding to the subject-predicate are taken out from the codebook.

本发明的优点如下: The advantages of the present invention are as follows:

本发明提出一种新的文本水印嵌入及提取方法,利用文本中语句的主谓语对水印信息进行编码来嵌入水印。该方法对文本格式与内容不做任何改变,对原文本不会产生丝毫影响,水印的嵌入没有任何痕迹,不会被察觉和发现,具有很好的隐蔽性。对文本进行格式变换(包括改变行间距、字间距,改变字符大小、字体、颜色等等),调整文本段落、改变句子顺序都不会影响水印的正确提取,具有良好的鲁棒性。同时算法构造简单,易于实现。 The invention proposes a new text watermark embedding and extraction method, which uses the subject and predicate of the sentence in the text to encode the watermark information to embed the watermark. This method does not make any changes to the format and content of the text, and does not have any influence on the original text. The embedding of the watermark has no traces, will not be noticed and found, and has good concealment. Changing the format of the text (including changing line spacing, word spacing, changing character size, font, color, etc.), adjusting text paragraphs, and changing the order of sentences will not affect the correct extraction of watermarks, which has good robustness. At the same time, the algorithm is simple in structure and easy to implement.

具体实施方式 detailed description

下面对本发明方案进行详细说明。 The solution of the present invention will be described in detail below.

本实施例提供的基于主谓语编码的文本水印嵌入方法,包括如下步骤:1)将水印信息的每个字符用UTF-16格式的Unicode编码表示,每个字符为4位十六进制数,形成一个十六进制的Unicode码串。2)检测出待嵌入文本中的主谓语,存放于一集合中。3)根据检测出的主谓语数量,将Unicode码串分成若干段,每一个主谓语用其中的一段来编码表示,对每一个主谓语对应的Unicode码段给定一个编号,用于提取水印时根据编号拼接Unicode码串。4)依次存储每个主谓语、与该主谓语对应的Unicode码段、及其编号,形成一个码本,完成编码,实现水印的嵌入。其中,码本中每一行的主谓语、Unicode码段、编号之间分别用空格隔开。 The text watermark embedding method based on subject-predicate encoding provided in this embodiment includes the following steps: 1) each character of the watermark information is represented by Unicode encoding in UTF-16 format, and each character is a 4-digit hexadecimal number, Form a hexadecimal Unicode string. 2) Detect the subject-predicate to be embedded in the text and store it in a set. 3) According to the number of detected subject-predicates, the Unicode code string is divided into several segments, each subject-predicate is coded by one of the segments, and a number is given to the Unicode code segment corresponding to each subject-predicate, which is used to extract the watermark Concatenate Unicode code strings according to numbers. 4) Store each subject-predicate, the Unicode code segment corresponding to the subject-predicate, and its number in turn to form a codebook, complete the encoding, and realize the embedding of the watermark. Among them, the subject predicate, Unicode code segment, and number of each line in the codebook are separated by spaces.

进一步地,步骤2)中检测出待嵌入文本中的主谓语包括如下步骤:A)将提交待嵌入水印的文本转换为字符串的形式。B)将待嵌入水印的文本的字符串提交至语言技术平台LTP进行依存句法分析,得到一个包含文本中句子成分依存关系的XML格式的字符串。C)将得到的XML格式的字符串转换为XML文件,对XML文件进行DOM解析,根据XML文件中句子成分关系属性的核心关系和主谓关系之间的联系,循环遍历文件,查找出每句的主谓语。 Further, detecting the subject and predicate in the text to be embedded in step 2) includes the following steps: A) converting the text submitted to be embedded with the watermark into a character string. B) Submit the string of the text to be embedded with the watermark to the language technology platform LTP for dependency syntax analysis, and obtain a string in XML format containing the dependency relationship of sentence components in the text. C) Convert the string obtained in XML format into an XML file, perform DOM analysis on the XML file, and loop through the file to find out each sentence according to the relationship between the core relationship and the subject-predicate relationship of the sentence component relationship attribute in the XML file of the subject predicate.

上述提及的语言技术平台(LanguageTechnologyPlatform,LTP)是哈尔滨工业大学社会计算与信息检索研究中心历时十年研制的一整套开放式在线中文自然语言处理系统,包括词法分析(分词、词性标注和命名实体识别)、句法分析(依存句法分析)、语义分析(词义消歧和语义角色标注)三方面六项语言处理功能。该平台对外开放,使用方便。系统提供一个应用程序接口(API),用户根据自己的应用需求,设置API参数,构造HTTP请求,将文本内容提交给系统,即可在线获得分析结果。此处主要用到LTP的依存句法分析功能,将待分析的文本提交给平台,经平台处理得到文本中语句各成分之间的依存关系。根据依存关系,经过进一步处理得到语句的主谓语。其基本过程如下: The above-mentioned Language Technology Platform (Language Technology Platform, LTP) is a complete set of open online Chinese natural language processing system developed by the Social Computing and Information Retrieval Research Center of Harbin Institute of Technology for ten years, including lexical analysis (word segmentation, part-of-speech tagging and named entity recognition), syntactic analysis (dependency syntactic analysis), semantic analysis (word sense disambiguation and semantic role labeling) and six language processing functions. The platform is open to the outside world and easy to use. The system provides an application programming interface (API). Users can set API parameters according to their own application requirements, construct HTTP requests, submit text content to the system, and then obtain analysis results online. Here, the dependency syntactic analysis function of LTP is mainly used, and the text to be analyzed is submitted to the platform, and the dependency relationship between the components of the sentences in the text is obtained through the platform processing. According to the dependency relationship, the subject-predicate of the sentence is obtained through further processing. The basic process is as follows:

将待分析的文本内容转换为字符串,设置API参数和调用方式,将包含文本内容的字符串提交给LTP进行依存句法分析。获取LTP的处理结果,得到一个包含文本中句子成分依存关系的XML格式的文件。该文件包含了para(段落)、sent(句子)、word(分词)等节点。每个分词节点(word)有以下属性:id,为分词在句子中的序号;cont,为分词内容;parent,为依存句法分析的父节点的id号;relate,为相对应的关系。为找出文本中的所有主谓语,循环遍历文本中的每一段,每一句,每一个分词。当循环遍历到word节点时,如果属性relate="HED"(HED表示核心关系),则word节点的cont值即为本句谓语,然后查找本句是否存在这样的word节点,节点的relate属性值为SBV(SBV表示主谓关系),并且其parent属性值与谓语节点的id号相等。如果存在,该word节点的cont值就是本句的主语。将本句的主语和谓语提取出来,存放到一个集合中。 Convert the text content to be analyzed into a string, set the API parameters and calling method, and submit the string containing the text content to LTP for dependency syntax analysis. Obtain the processing result of LTP, and obtain a file in XML format containing the dependency relationship of sentence components in the text. This file contains nodes such as para (paragraph), sent (sentence), and word (word segmentation). Each word segment node (word) has the following attributes: id, the sequence number of the word segment in the sentence; cont, the content of the word segment; parent, the id number of the parent node of the dependency syntax analysis; relate, the corresponding relationship. To find out all the subject predicates in the text, loop through each paragraph, each sentence, and each participle in the text. When looping through the word node, if the attribute relate="HED" (HED represents the core relationship), the cont value of the word node is the predicate of this sentence, and then check whether there is such a word node in this sentence, and the value of the relate attribute of the node It is SBV (SBV means subject-predicate relationship), and its parent attribute value is equal to the id number of the predicate node. If it exists, the cont value of the word node is the subject of the sentence. Extract the subject and predicate of this sentence and store them in a set.

以下给出实现水印嵌入过程的具体功能代码及注释: The specific function codes and comments for implementing the watermark embedding process are given below:

按照LTP系统应用程序接口API的要求,将需要提交分析的文本转换为字符串的形式,可以用C#语言的System.IO.File.ReadAllText(stringpath,Encoding.Default)函数来实现,相应的程序代码为: According to the requirements of the LTP system application program interface API, the text that needs to be submitted for analysis is converted into a string form, which can be realized by the System.IO.File.ReadAllText(stringpath,Encoding.Default) function of the C# language, and the corresponding program code for:

Stringtext=System.IO.File.ReadAllText(path,Encoding.Default); Stringtext=System.IO.File.ReadAllText(path,Encoding.Default);

其中,path为待分析的文本文件的路径,text即为包含文本内容的字符串。 Among them, path is the path of the text file to be analyzed, and text is a string containing the text content.

设置API参数,包括访问LTPWeb服务的地址urlbase、使用API的钥匙api_key(用户注册时获得)、分析模式pattern(选择dp,依存句法分析)、结果格式类型format(选择XML格式)、HTTP请求方式(选择GET方式)等,将包含文本内容的字符串(text)提交给LTP平台进行依存句法分析。实现这一过程的核心代码如下: Set API parameters, including address urlbase for accessing LTPWeb services, API key api_key (obtained during user registration), analysis mode pattern (choose dp, dependent syntax analysis), result format type format (choose XML format), HTTP request method ( Select the GET method), etc., and submit the string (text) containing the text content to the LTP platform for dependency syntax analysis. The core code to realize this process is as follows:

stringurlbase="http://api.ltp-cloud.com/analysis/"; stringurlbase="http://api.ltp-cloud.com/analysis/";

stringapi_key="k2r3q7tqGgWp5zBZRSnEHvNKfTRSFhjMtnHQ0QeP"; stringapi_key="k2r3q7tqGgWp5zBZRSnEHvNKfTRSFhjMtnHQ0QeP";

stringpattern="dp"; stringpattern="dp";

stringformat="xml"; stringformat="xml";

stringstrParam=("api_key="+api_key+"&text="+text.ToString() stringstrParam=("api_key="+api_key+"&text="+text.ToString()

+"&pattern="+pattern+"&format="+format); +"&pattern="+pattern+"&format="+format);

Encodingencoding=Encoding.GetEncoding("utf-8"); Encodingencoding=Encoding. GetEncoding("utf-8");

HttpWebRequestreq= HttpWebRequestreq=

WebRequest.Create(urlbase+strParam)asHttpWebRequest; WebRequest.Create(urlbase+strParam)asHttpWebRequest;

req.Method="GET"; req.Method="GET";

获取LTP的处理结果,得到一个包含文本中句子成分依存关系的XML格式的字符串。这一过程可以用C#语言的StreamReader类来实现,相应的程序代码为: Obtain the processing result of LTP, and obtain a character string in XML format containing the dependency relationship of sentence components in the text. This process can be realized with the StreamReader class of C# language, and the corresponding program code is:

HttpWebResponsewebResponse=req.GetResponse()asHttpWebResponse; HttpWebResponsewebResponse=req.GetResponse()asHttpWebResponse;

StreamReaderstreamReader= StreamReaderstreamReader=

newStreamReader(webResponse.GetResponseStream(),encoding); newStreamReader(webResponse. GetResponseStream(), encoding);

Stringresult=streamReader.ReadToEnd(); Stringresult = streamReader. ReadToEnd();

result中存放的即为处理后的结果。 The result stored in result is the processed result.

将得到的XML格式的字符串转换为XML文件,对XML文件进行DOM解析,根据依存句法分析结果中每句的relate属性的HED(核心关系)和SBV(主谓关系)之间的联系,循环遍历每一段,每一句,每一个分词,查找出每句的主谓语。实现这一过程的核心代码如下: Convert the string obtained in XML format into an XML file, perform DOM analysis on the XML file, and loop according to the relationship between HED (core relationship) and SBV (subject-verb relationship) of the relate attribute of each sentence in the dependency syntax analysis result Traverse each paragraph, each sentence, each participle, and find out the subject and predicate of each sentence. The core code to realize this process is as follows:

XmlDocumentdoc=newXmlDocument();//转换为XML文件 XmlDocumentdoc=newXmlDocument();//Convert to XML file

doc.LoadXml(result); doc. LoadXml(result);

XmlElementroot=doc.DocumentElement;//循环遍历 XmlElementroot=doc.DocumentElement;//Cycle traversal

XmlNodeListlist1,list2,list3; XmlNodeList list1, list2, list3;

XmlNodelist4; XmlNodelist4;

list1=root.SelectNodes("//para"); list1=root.SelectNodes("//para");

foreach(XmlNodenode1inlist1){//循环遍历para节点 foreach(XmlNodenode1inlist1){//Loop through para nodes

list2=node1.ChildNodes; list2=node1.ChildNodes;

foreach(XmlNodenode2inlist2){//循环遍历sent节点 foreach(XmlNodenode2inlist2){//Loop through the sent node

list3=node2.ChildNodes; list3=node2.ChildNodes;

foreach(XmlNodenode3inlist3){//循环遍历word节点 foreach(XmlNodenode3inlist3){//loop through the word node

if(node3.Attributes["relate"].InnerText=="HED"){//判断谓语 if(node3.Attributes["relate"].InnerText=="HED"){//judgment predicate

list4=node3; list4=node3;

foreach(XmlNodenode4inlist3){ foreach(XmlNodenode4inlist3){

if(node4.Attributes["parent"].InnerText==list4.Attributes["id"].InnerText&&node4.Attributes["relate"].InnerText=="SBV")//判断主语hs.Add(node4.Attributes["cont"].InnerText+list4.Attributes["cont"].InnerText+""); if(node4.Attributes["parent"].InnerText==list4.Attributes["id"].InnerText&&node4.Attributes["relate"].InnerText=="SBV")//judgment subject hs.Add(node4.Attributes ["cont"].InnerText+list4.Attributes["cont"].InnerText+"");

} }

} }

} }

} }

} }

List<string>sbv=newList<string>(); List<string>sbv=newList<string>();

sbv.AddRange(hs); sbv. AddRange(hs);

集合sbv中存放的即为文本中的主谓语。 What is stored in the set sbv is the subject and predicate in the text.

将需要嵌入的水印信息根据UTF-16编码,转换成一个Unicode码串。相应的实现代码为: Convert the watermark information to be embedded into a Unicode code string according to UTF-16 encoding. The corresponding implementation code is:

byte[]bts=Encoding.Unicode.GetBytes(info); byte[]bts=Encoding.Unicode.GetBytes(info);

for(inti=0;i<bts.Length;i+=2) for(inti=0;i<bts. Length;i+=2)

uc+=bts[i+1].ToString("x").PadLeft(2,'0')+bts[i].ToString("x").PadLeft(2,'0'); uc+=bts[i+1].ToString("x").PadLeft(2,'0')+bts[i].ToString("x").PadLeft(2,'0');

其中,info中存放的为水印信息,uc中即为生成的Unicode码串。 Among them, the watermark information is stored in info, and the generated Unicode code string is stored in uc.

用主谓语集合中的主谓语对上述Unicode码串进行编码。依次取出集合中的每一个主谓语,为其分配一段Unicode码,并给定一个编号,主谓语、Unicode码段、编号之间分别用空格隔开,形成码本。实现这一过程的核心代码如下。 The above-mentioned Unicode code string is encoded with the subject-predicate in the subject-predicate set. Take out each subject and predicate in the set in turn, assign a Unicode code to it, and give a number. The subject and predicate, the Unicode code segment, and the number are separated by spaces to form a codebook. The core code to realize this process is as follows.

确定为每个主谓语分配的Unicode码的位数的代码: Code that determines the number of Unicode digits assigned to each subject-predicate:

strU_size=uc.length();//strU_size为Unicode码串的位数 strU_size=uc.length();//strU_size is the number of digits in the Unicode code string

sbv_size=sbv.Count;//sbv_size为主谓语的个数 sbv_size=sbv.Count;//sbv_size is the number of main predicates

count_size=strU_size/sbv_size;//count_size是为主谓语分配Unicode码的位数 count_size=strU_size/sbv_size;//count_size is the number of digits assigned to the main predicate Unicode code

为每个主谓语分配一段Unicode码的代码: Assign a Unicode code to each subject and predicate:

for(intx=0;x<sbv_size;x++){ for(intx=0;x<sbv_size;x++){

if(x==sbv_size-1){//为最后一个主谓语分配Unicode码(位数不同,单独处理) if(x==sbv_size-1){//Assign the Unicode code for the last subject and predicate (the number of digits is different, and it is processed separately)

code_list.Add(sbv[x]+""+uc.ToString().Substring(x*count_size)+""+(x+1)); code_list.Add(sbv[x]+""+uc.ToString().Substring(x*count_size)+""+(x+1));

}else{//为前面的主谓语分配Unicode码(平均分配,位数相同) }else{//Allocate Unicode codes for the previous subject and predicate (equal distribution, the same number of digits)

if(x*count_size-1>0){ if(x*count_size-1>0){

code_list.Add(sbv[x]+""+uc.ToString().Substring(x*count_size,count_size)+""+(x+1)); code_list.Add(sbv[x]+""+uc.ToString().Substring(x*count_size,count_size)+""+(x+1));

}else{ }else{

code_list.Add(sbv[x]+""+uc.ToString().Substring(0,x*count_size+count_size)+""+(x+1)); code_list.Add(sbv[x]+""+uc.ToString().Substring(0,x*count_size+count_size)+""+(x+1));

} }

} }

} }

集合code_list存放的即为码本内容,将其写入一个txt文件,便得到嵌入水印的码本文件。 The collection code_list stores the codebook content, write it into a txt file, and then get the codebook file embedded with the watermark.

根据上述基于主谓语编码的文本水印嵌入方法,提出一种基于主谓语编码的文本水印的提取方法,其具体实施方式为: According to the above text watermark embedding method based on subject-predicate encoding, a method for extracting text watermark based on subject-predicate encoding is proposed, and its specific implementation is as follows:

当需要提取水印时,将被检测的文本提交LTP平台进行依存句法分析,对分析结果进行进一步处理得到文本中的主谓语,存放于一集合中。实现这一过程的代码与前面嵌入水印时相同。 When the watermark needs to be extracted, the detected text is submitted to the LTP platform for dependency syntax analysis, and the analysis result is further processed to obtain the subject and predicate in the text, which is stored in a set. The code to implement this process is the same as when embedding the watermark earlier.

打开嵌入水印时形成的码本文件,对照码本,对上述集合中的每一个主谓语进行译码。即依次将集合中的每一个主谓语与码本中的各个主谓语逐一进行比较,若两者一致,则将该主谓语对应的Unicode码段及其编号取出。将获取的各Unicode码段按其编号顺序拼接起来,得到代表水印信息的Unicode码串。 Open the codebook file formed when the watermark is embedded, compare the codebook, and decode each subject and predicate in the above-mentioned set. That is, each subject-predicate in the set is compared one by one with each subject-predicate in the codebook one by one, and if the two are consistent, the Unicode code segment and its number corresponding to the subject-predicate are taken out. The obtained Unicode code segments are spliced together according to their serial numbers to obtain the Unicode code string representing the watermark information.

以下给出实现上述过程的主要操作的代码及注释: The code and comments of the main operations to achieve the above process are given below:

读取码本的每一行,将其放入一个数组。实现代码为: Read each line of the codebook and put it into an array. The implementation code is:

string[]lines=File.ReadAllLines(path); string[]lines=File.ReadAllLines(path);

其中,path为码本文件的路径,lines为包含码本每一行内容的数组。 Among them, path is the path of the codebook file, and lines is an array containing the content of each line of the codebook.

根据空格,将每一行的主谓语分割出来,与前述集合中存放的被检测文本中的主谓语逐一进行比较,如有一致者,取出该行的Unicode码段及其编号,放入一个集合。实现代码为: According to the space, the subject and predicate of each line are separated, and compared with the subjects and predicates in the detected text stored in the aforementioned collection one by one, if there is a consistency, the Unicode code segment and its number of the line are taken out and put into a collection. The implementation code is:

for(inti=0;i<sbv.Count;i++){ for(inti=0;i<sbv.Count;i++){

for(intj=0;j<lines.Length;j++){ for(intj=0;j<lines. Length;j++){

string[]lgs=lines[j].ToString().Split(newChar[]{''},2); string[]lgs=lines[j].ToString().Split(newChar[]{''},2);

if(sbv[i]==lgs[0]) if(sbv[i]==lgs[0])

st.Add(lgs[1]);} st.Add(lgs[1]);}

} }

st即为存放Unicode码段及其编号的集合。 st is a collection of Unicode code segments and their numbers.

根据空格,将各Unicode码段与其编号分割开来,按照编号顺序将各码段拼接起来,得到代表水印信息的Unicode码串。实现代码为: According to the space, separate each Unicode code segment and its number, splice each code segment according to the sequence of numbers, and obtain the Unicode code string representing the watermark information. The implementation code is:

for(intx=0;x<st.Count;x++){ for(intx=0;x<st.Count;x++){

for(inty=0;y<st.Count;y++){ for(inty=0;y<st.Count;y++){

string[]lgs=st[y].ToString().Split(newChar[]{''},2); string[]lgs=st[y].ToString().Split(newChar[]{''},2);

if(Convert.ToInt32(lgs[1])==(x+1)) if(Convert.ToInt32(lgs[1])==(x+1))

drawUc.Append(lgs[0]); drawUc.Append(lgs[0]);

} }

} }

drawUc中即为代表水印信息的Unicode码串。 In drawUc is the Unicode code string representing the watermark information.

根据嵌入水印时使用的UTF-16编码规则,将上述Unicode码串转换为对应的字符,便得到嵌入的水印信息。实现这一过程的核心代码如下: According to the UTF-16 encoding rules used when embedding the watermark, the above Unicode code string is converted into corresponding characters, and the embedded watermark information is obtained. The core code to realize this process is as follows:

MatchCollectionmc=Regex.Matches(str,"([\w]{2})([\w]{2})", MatchCollectionmc=Regex.Matches(str,"([\w]{2})([\w]{2})",

RegexOptions.Compiled|RegexOptions.IgnoreCase); RegexOptions.Compiled|RegexOptions.IgnoreCase);

byte[]bts=newbyte[2]; byte[]bts=newbyte[2];

foreach(Matchminmc){ foreach(Matchminmc){

bts[0]=(byte)int.Parse(m.Groups[2].Value,NumberStyles.HexNumber); bts[0]=(byte)int.Parse(m.Groups[2].Value,NumberStyles.HexNumber);

bts[1]=(byte)int.Parse(m.Groups[1].Value,NumberStyles.HexNumber); bts[1]=(byte)int.Parse(m.Groups[1].Value,NumberStyles.HexNumber);

toStr+=Encoding.Unicode.GetString(bts); toStr+=Encoding.Unicode.GetString(bts);

} }

toStr中所含即为提取出的水印信息。 The information contained in toStr is the extracted watermark information.

Claims (6)

1., based on a Text Watermarking embedding grammar for subject-predicate language coding, it is characterized in that comprising
1) by each character Unicode coded representation of watermark information, a Unicode code string is formed;
2) detect the subject-predicate language of statement in text to be embedded, deposit in a set;
3) according to the subject-predicate language quantity detected, Unicode code string is divided into some sections, each subject-predicate pragmatic wherein one section carrys out coded representation, to the given numbering of the Unicode code section that each subject-predicate language is corresponding, according to numbering splicing Unicode code string during for extracting watermark;
4) store Unicode code section corresponding to each subject-predicate language, this subject-predicate language and numbering corresponding to this subject-predicate successively, form a code book, complete coding, realize the embedding of watermark.
2. the Text Watermarking embedding grammar based on subject-predicate language coding according to claim 1, it is characterized in that described Unicode encodes and adopt UTF-16 form, each character is 4 sexadecimal numbers, forms a hexadecimal Unicode code string.
3. Text Watermarking embedding grammar according to claim 1, is characterized in that described step 2) in detect that the subject-predicate language in text to be embedded comprises the steps:
To the text-converted of watermark to be embedded be submitted to be the form of character string;
B) character string of the text of watermark to be embedded is committed to language technology platform LTP and carries out interdependent syntactic analysis, obtain the character string that comprises the XML format of sentence element dependence in text;
C) character string of the XML format obtained is converted to XML file, carries out DOM parsing to XML file, according to the contact between the Key Relationships of sentence element attribute of a relation in XML file and subject-predicate relation, searching loop file, finds out the subject-predicate language of every.
4. the Text Watermarking embedding grammar based on subject-predicate language coding according to claim 1, is characterized in that separating with space respectively between the subject-predicate language of every a line in described code book, Unicode code section, numbering.
5. the Text Watermarking embedding grammar based on subject-predicate language coding according to any one of claim 1-4, a kind of extracting method of the Text Watermarking based on subject-predicate language coding is proposed, it is characterized in that, comprise: find out the subject-predicate language in detected text, the described code book formed during contrast embed watermark, Unicode code section, numbering that each subject-predicate language is corresponding is taken out from code book, Unicode code section is got up by the sequential concatenation of the numbering of correspondence, obtain the Unicode code string representing watermark information, convert corresponding character again to, form the watermark information embedded.
6. the extracting method of the Text Watermarking based on subject-predicate language coding according to claim 5, it is characterized in that the step of Unicode code section that in described taking-up detected text, each subject-predicate language is corresponding and numbering comprises: each subject-predicate language in the detected text found out and each subject-predicate language in code book are compared one by one, if both are consistent, then from code book, take out Unicode code section, numbering that this subject-predicate language is corresponding.
CN201510743382.1A 2015-11-05 2015-11-05 A kind of Text Watermarking insertion and extracting method based on subject-predicate language coding Active CN105404614B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510743382.1A CN105404614B (en) 2015-11-05 2015-11-05 A kind of Text Watermarking insertion and extracting method based on subject-predicate language coding

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510743382.1A CN105404614B (en) 2015-11-05 2015-11-05 A kind of Text Watermarking insertion and extracting method based on subject-predicate language coding

Publications (2)

Publication Number Publication Date
CN105404614A true CN105404614A (en) 2016-03-16
CN105404614B CN105404614B (en) 2018-05-25

Family

ID=55470109

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510743382.1A Active CN105404614B (en) 2015-11-05 2015-11-05 A kind of Text Watermarking insertion and extracting method based on subject-predicate language coding

Country Status (1)

Country Link
CN (1) CN105404614B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107491423A (en) * 2016-06-12 2017-12-19 北京云量数盟科技有限公司 A kind of Chinese document gene based on numeric character string hybrid coding quantifies and characterizing method
CN108363910A (en) * 2018-01-23 2018-08-03 南通大学 A kind of insertion of the webpage watermark based on HTML code and extracting method
CN114896945A (en) * 2022-06-01 2022-08-12 广州零世纪信息科技有限公司 Terminal information display method for tracking divulgence

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101639826A (en) * 2009-09-01 2010-02-03 西北大学 Text hidden method based on Chinese sentence pattern template transformation
CN102184243A (en) * 2011-05-17 2011-09-14 沈阳化工大学 Text-type attribute-based relational database watermark embedding method
US20140016814A1 (en) * 2012-07-13 2014-01-16 International Business Machines Corporation Hierarchical and index based watermarks represented as trees

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101639826A (en) * 2009-09-01 2010-02-03 西北大学 Text hidden method based on Chinese sentence pattern template transformation
CN102184243A (en) * 2011-05-17 2011-09-14 沈阳化工大学 Text-type attribute-based relational database watermark embedding method
US20140016814A1 (en) * 2012-07-13 2014-01-16 International Business Machines Corporation Hierarchical and index based watermarks represented as trees

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ZUNERA JALIL ET AL: "A Review of Digital Watermarking Techniques for Text Documents", 《2009 INTERNATIONAL CONFERENCE ON INFORMATION AND MULTIMEDIA TECHNOLOGY》 *
斯琴 等: "基于文本特征的文本水印算法", 《计算机应用》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107491423A (en) * 2016-06-12 2017-12-19 北京云量数盟科技有限公司 A kind of Chinese document gene based on numeric character string hybrid coding quantifies and characterizing method
CN107491423B (en) * 2016-06-12 2021-03-30 北京云量数盟科技有限公司 Chinese document gene quantization and characterization method based on numerical value-character string mixed coding
CN108363910A (en) * 2018-01-23 2018-08-03 南通大学 A kind of insertion of the webpage watermark based on HTML code and extracting method
CN108363910B (en) * 2018-01-23 2020-01-10 南通大学 Webpage watermark embedding and extracting method based on HTML (Hypertext markup language) code
CN114896945A (en) * 2022-06-01 2022-08-12 广州零世纪信息科技有限公司 Terminal information display method for tracking divulgence

Also Published As

Publication number Publication date
CN105404614B (en) 2018-05-25

Similar Documents

Publication Publication Date Title
US7665015B2 (en) Hardware unit for parsing an XML document
CN105205355B (en) A kind of Text Watermarking insertion and extracting method based on the mapping of semantic role position
US7596745B2 (en) Programmable hardware finite state machine for facilitating tokenization of an XML document
US7716577B2 (en) Method and apparatus for hardware XML acceleration
CN112446207B (en) Title generation method, title generation device, electronic equipment and storage medium
CN104850574B (en) A kind of filtering sensitive words method of text-oriented information
US8750630B2 (en) Hierarchical and index based watermarks represented as trees
CN103761459B (en) A kind of document multiple digital watermarking embedding, extracting method and device
CN107871002B (en) Fingerprint fusion-based cross-language plagiarism detection method
CN105404614B (en) A kind of Text Watermarking insertion and extracting method based on subject-predicate language coding
CN103778200A (en) Method for extracting information source of message and system thereof
CN102194081B (en) Method for hiding natural language information
Al-Wesabi A Smart English Text Zero-Watermarking Approach Based on Third-Level Order and Word Mechanism of Markov Model.
CN103544408A (en) Method for embedment and extraction of PDF document hidden information according to composite font
Chen et al. Text watermarking algorithm based on semantic role labeling
Lüngen et al. Integrating corpora of computer-mediated communication in CLARIN-D: Results from the curation project ChatCorpus2CLARIN
CN115712909B (en) Text watermark embedding method, tracing method and system based on blockchain
CN106407288B (en) Method and system for synchronously updating information
CN108363910A (en) A kind of insertion of the webpage watermark based on HTML code and extracting method
CN109948089A (en) A method and device for extracting webpage text
CN118535792A (en) Chinese corpus acquisition method and system based on Common Crawl data
CN105320716A (en) Automatic labeling method for digital publication
CN107491423A (en) A kind of Chinese document gene based on numeric character string hybrid coding quantifies and characterizing method
CN114238550A (en) Element extraction method, device, electronic equipment and storage medium
CN103530536B (en) Method for embedding Java software watermark

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant